Professional Documents
Culture Documents
Azure Data Bricks
Azure Data Bricks
Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure
Databricks offers three environments for developing data intensive applications: Databricks SQL, Databricks
Data Science & Engineering, and Databricks Machine Learning.
Databricks SQL provides an easy-to-use platform for analysts who want to run SQL queries on their data lake,
create multiple visualization types to explore query results from different perspectives, and build and share
dashboards.
Databricks Data Science & Engineering provides an interactive workspace that enables collaboration
between data engineers, data scientists, and machine learning engineers. For a big data pipeline, the data (raw
or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using
Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob
Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from
multiple data sources and turn it into breakthrough insights using Spark.
Databricks Machine Learning is an integrated end-to-end machine learning environment incorporating
managed services for experiment tracking, model training, feature development and management, and feature
and model serving.
To select an environment, launch an Azure Databricks workspace and use the persona switcher in the sidebar:
Next steps
Learn more about Databricks Data Science & Engineering
Learn more about Databricks Machine Learning
Learn more about Databricks SQL Analytics
What is Databricks Data Science & Engineering?
7/21/2022 • 3 minutes to read
Databricks Data Science & Engineering (sometimes called simply "Workspace") is an analytics platform based
on Apache Spark. It is integrated with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration between data engineers, data scientists, and machine learning
engineers.
For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in
batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for
long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics
workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data
Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using
Spark.
Enterprise security
Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-
based controls, and SLAs that protect your data and your business.
Integration with Azure Active Directory enables you to run complete Azure-based solutions using Azure
Databricks.
Azure Databricks roles-based access enables fine-grained user permissions for notebooks, clusters, jobs, and
data.
Enterprise-grade SLAs.
IMPORTANT
Azure Databricks is a Microsoft Azure first-party service that is deployed on the Global Azure Public Cloud infrastructure.
All communications between components of the service, including between the public IPs in the control plane and the
customer data plane, remain within the Microsoft Azure network backbone. See also Microsoft global network.
Next steps
Quickstart: Create an Azure Databricks workspace and run a Spark job
Work with Spark clusters
Work with notebooks
Create Spark jobs
What is Databricks Machine Learning?
7/21/2022 • 2 minutes to read
Databricks Machine Learning is an integrated end-to-end machine learning platform incorporating managed
services for experiment tracking, model training, feature development and management, and feature and model
serving. The diagram shows how the capabilities of Databricks map to the steps of the model development and
deployment process.
Next steps
Run the machine learning quickstart tutorial
Run the 10-minute tutorials using your favorite ML libraries
Learn more about Databricks Machine Learning
What is Databricks SQL?
7/21/2022 • 2 minutes to read
Databricks SQL allows you to run quick ad-hoc SQL queries on your data lake. Queries support multiple
visualization types to help you explore your query results from different perspectives.
NOTE
Databricks SQL is not supported in Azure China regions.
Enterprise security
Databricks SQL provides enterprise-grade Azure security, including Azure Active Directory integration, role-
based controls, and SLAs that protect your data and your business.
Integration with Azure Active Directory enables you to run complete Azure-based solutions using Databricks
SQL.
Role based access enables fine-grained user permissions for alerts, dashboards, SQL warehouses, queries,
and data.
Enterprise-grade SLAs.
For details, see Data access overview.
IMPORTANT
Azure Databricks is a Microsoft Azure first-party service that is deployed on the Global Azure Public Cloud infrastructure.
All communications between components of the service, including between the public IPs in the control plane and the
customer data plane, remain within the Microsoft Azure network backbone. See also Microsoft global network.
Next steps
Quickstart: Learn about Databricks SQL by importing the sample dashboards
Quickstart: Complete the admin onboarding tasks
Quickstart: Enable users and create a SQL warehouse
Quickstart: Run a query and create a dashboard
Work with queries
Create dashboards
Quickstart: Run a Spark job on Azure Databricks
Workspace using the Azure portal
7/21/2022 • 7 minutes to read
In this quickstart, you use the Azure portal to create an Azure Databricks workspace with an Apache Spark
cluster. You run a job on the cluster and use custom charts to produce real-time reports from Seattle safety data.
Prerequisites
Portal
Azure CLI
Azure subscription - create one for free. This tutorial cannot be carried out using Azure Free Trial
Subscription . If you have a free account, go to your profile and change your subscription to pay-as-
you-go . For more information, see Azure free account. Then, remove the spending limit, and request a
quota increase for vCPUs in your region. When you create your Azure Databricks workspace, you can
select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free
Premium Azure Databricks DBUs for 14 days.
Sign in to the Azure portal.
NOTE
If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government
compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain
access to this experience.
Portal
Azure CLI
1. In the Azure portal, select Create a resource > Analytics > Azure Databricks .
2. Under Azure Databricks Ser vice , provide the values to create a Databricks workspace.
Provide the following values:
3. Select Review + Create , and then Create . The workspace creation takes a few minutes. During
workspace creation, you can view the deployment status in Notifications . Once this process is finished,
your user account is automatically added as an admin user in the workspace.
When a workspace deployment fails, the workspace is still created in a failed state. Delete the failed
workspace and create a new workspace that resolves the deployment errors. When you delete the failed
workspace, the managed resource group and any successfully deployed resources are also deleted.
1. In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace .
2. You are redirected to the Azure Databricks portal. From the portal, click New Cluster .
Select Create .
3. In this step, create a Spark DataFrame with Seattle Safety Data from Azure Open Datasets, and use SQL to
query the data.
The following command sets the Azure storage access information. Paste this PySpark code into the first
cell and use Shift+Enter to run the code.
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Seattle"
blob_sas_token = r"?st=2019-02-26T02%3A34%3A32Z&se=2119-02-27T02%3A34%3A00Z&sp=rl&sv=2018-03-
28&sr=c&sig=XlJVWA7fMXCSxCKqJm8psMOh0W4h7cSYO28coRqF2fs%3D"
The following command allows Spark to read from Blob storage remotely. Paste this PySpark code into
the next cell and use Shift+Enter to run the code.
The following command creates a DataFrame. Paste this PySpark code into the next cell and use
Shift+Enter to run the code.
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
4. Run a SQL statement return the top 10 rows of data from the temporary view called source . Paste this
PySpark code into the next cell and use Shift+Enter to run the code.
5. You see a tabular output like shown in the following screenshot (only some columns are shown):
6. You now create a visual representation of this data to show how many safety events are reported using
the Citizens Connect App and City Worker App instead of other sources. From the bottom of the tabular
output, select the Bar char t icon, and then click Plot Options .
Clean up resources
After you have finished the article, you can terminate the cluster. To do so, from the Azure Databricks workspace,
from the left pane, select Clusters . For the cluster you want to terminate, move the cursor over the ellipsis
under Actions column, and select the Terminate icon.
If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate
after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster automatically
stops, if it has been inactive for the specified time.
Next steps
In this article, you created a Spark cluster in Azure Databricks and ran a Spark job using data from Azure Open
Datasets. You can also look at Spark data sources to learn how to import data from other data sources into
Azure Databricks. Advance to the next article to learn how to perform an ETL operation (extract, transform, and
load data) using Azure Databricks.
Extract, transform, and load data using Azure Databricks
Quickstart: Create an Azure Databricks workspace
using PowerShell
7/21/2022 • 3 minutes to read
This quickstart describes how to use PowerShell to create an Azure Databricks workspace. You can use
PowerShell to create and manage Azure resources interactively or in scripts.
Prerequisites
If you don't have an Azure subscription, create a free account before you begin.
If you choose to use PowerShell locally, this article requires that you install the Az PowerShell module and
connect to your Azure account using the Connect-AzAccount cmdlet. For more information about installing the
Az PowerShell module, see Install Azure PowerShell.
IMPORTANT
While the Az.Databricks PowerShell module is in preview, you must install it separately from the Az PowerShell module
using the following command: Install-Module -Name Az.Databricks -AllowPrerelease . Once the Az.Databricks
PowerShell module is generally available, it becomes part of future Az PowerShell module releases and available natively
from within Azure Cloud Shell.
NOTE
If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government
compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain
access to this experience.
If this is your first time using Azure Databricks, you must register the Microsoft.Databricks resource provider.
O P T IO N EXA M P L E/ L IN K
The workspace creation takes a few minutes. Once this process is finished, your user account is automatically
added as an admin user in the workspace.
When a workspace deployment fails, the workspace is still created in a failed state. Delete the failed workspace
and create a new workspace that resolves the deployment errors. When you delete the failed workspace, the
managed resource group and any successfully deployed resources are also deleted.
Clean up resources
If the resources created in this quickstart aren't needed for another quickstart or tutorial, you can delete them by
running the following example.
Cau t i on
The following example deletes the specified resource group and all resources contained within it. If resources
outside the scope of this quickstart exist in the specified resource group, they will also be deleted.
To delete only the server created in this quickstart without deleting the resource group, use the
Remove-AzDatabricksWorkspace cmdlet.
Next steps
Create a Spark cluster in Databricks
Quickstart: Create an Azure Databricks workspace
by using an ARM template
7/21/2022 • 3 minutes to read
In this quickstart, you use an Azure Resource Manager template (ARM template) to create an Azure Databricks
workspace. Once the workspace is created, you validate the deployment.
An ARM template is a JavaScript Object Notation (JSON) file that defines the infrastructure and configuration for
your project. The template uses declarative syntax, which lets you state what you intend to deploy without
having to write the sequence of programming commands to create it.
If your environment meets the prerequisites and you're familiar with using ARM templates, select the Deploy to
Azure button. The template will open in the Azure portal.
Prerequisites
To complete this article, you need to:
Have an Azure subscription - create one for free
NOTE
If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government
compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain
access to this experience.
The Azure resource defined in the template is Microsoft.Databricks/workspaces: create an Azure Databricks
workspace.
Azure PowerShell
$resourceGroupName = Read-Host -Prompt "Enter the resource group name where your Azure Databricks workspace
exists"
(Get-AzResource -ResourceType "Microsoft.Databricks/workspaces" -ResourceGroupName $resourceGroupName).Name
Write-Host "Press [ENTER] to continue..."
Clean up resources
If you plan to continue on to subsequent tutorials, you may wish to leave these resources in place. When no
longer needed, delete the resource group, which deletes the Azure Databricks workspace and the related
managed resources. To delete the resource group by using Azure CLI or Azure PowerShell:
Azure CLI
Azure PowerShell
Limitations
The configuration of the storage account deployed in the resource group cannot be modified. To use locally-
replicated storage (LRS) instead of globally-replicated storage (GRS), create a new storage account and mount it
in the existing workspace.
Next steps
In this quickstart, you created an Azure Databricks workspace by using an ARM template and validated the
deployment. Advance to the next article to learn how to perform an ETL operation (extract, transform, and load
data) using Azure Databricks.
Extract, transform, and load data using Azure Databricks
Quickstart: Create an Azure Databricks workspace
in your own Virtual Network
7/21/2022 • 5 minutes to read
The default deployment of Azure Databricks creates a new virtual network that is managed by Databricks. This
quickstart shows how to create an Azure Databricks workspace in your own virtual network instead. You also
create an Apache Spark cluster within that workspace.
For more information about why you might choose to create an Azure Databricks workspace in your own virtual
network, see Deploy Azure Databricks in your Azure Virtual Network (VNet Injection).
If you don't have an Azure subscription, create a free account. This tutorial cannot be carried out using Azure
Free Trial Subscription . If you have a free account, go to your profile and change your subscription to pay-
as-you-go . For more information, see Azure free account. Then, remove the spending limit, and request a quota
increase for vCPUs in your region. When you create your Azure Databricks workspace, you can select the Trial
(Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free Premium Azure Databricks
DBUs for 14 days.
NOTE
If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government
compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain
access to this experience.
Region <Select the region that is closest to Select a geographic location where
your users> you can host your virtual network.
Use the location that's closest to
your users.
3. Select Next: IP Addresses > and apply the following settings. Then select Review + create .
4. On the Review + create tab, select Create to deploy the virtual network. Once the deployment is
complete, navigate to your virtual network and select Address space under Settings . In the box that
says Add additional address range, insert 10.179.0.0/16 and select Save .
Create an Azure Databricks workspace
1. From the Azure portal menu, select Create a resource . Then select Analytics > Databricks .
Location <Select the region that is closest to Choose the same location as your
your users> virtual network.
3. Once you've finished entering settings on the Basics page, select Next: Networking > and apply the
following settings:
Deploy Azure Databricks workspace Yes This setting allows you to deploy an
in your Virtual Network (VNet) Azure Databricks workspace in your
virtual network.
Public Subnet Name public-subnet Use the default public subnet name.
4. Once the deployment is complete, navigate to the Azure Databricks resource. Notice that virtual network
peering is disabled. Also notice the resource group and managed resource group in the overview page.
The managed resource group is not modifiable, and it is not used to create virtual machines. You can only
create virtual machines in the resource group you manage.
When a workspace deployment fails, the workspace is still created in a failed state. Delete the failed
workspace and create a new workspace that resolves the deployment errors. When you delete the failed
workspace, the managed resource group and any successfully deployed resources are also deleted.
Create a cluster
NOTE
To use a free account to create the Azure Databricks cluster, before creating the cluster, go to your profile and change
your subscription to pay-as-you-go . For more information, see Azure free account.
1. Return to your Azure Databricks service and select Launch Workspace on the Over view page.
2. Select Clusters > + Create Cluster . Then create a cluster name, like databricks-quickstart-cluster, and
accept the remaining default settings. Select Create Cluster .
3. Once the cluster is running, return to the managed resource group in the Azure portal. Notice the new
virtual machines, disks, IP Address, and network interfaces. A network interface is created in each of the
public and private subnets with IP addresses.
4. Return to your Azure Databricks workspace and select the cluster you created. Then navigate to the
Executors tab on the Spark UI page. Notice that the addresses for the driver and the executors are in
the private subnet range. In this example, the driver is 10.179.0.6 and executors are 10.179.0.4 and
10.179.0.5. Your IP addresses could be different.
Clean up resources
After you have finished the article, you can terminate the cluster. To do so, from the Azure Databricks workspace,
from the left pane, select Clusters . For the cluster you want to terminate, move the cursor over the ellipsis
under Actions column, and select the Terminate icon. This stops the cluster.
If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate
after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster automatically
stops, if it has been inactive for the specified time.
If you do not wish to reuse the cluster, you can delete the resource group you created in the Azure portal.
Next steps
In this article, you created a Spark cluster in Azure Databricks that you deployed to a virtual network. Advance to
the next article to learn how to query a SQL Server Linux Docker container in the virtual network using JDBC
from an Azure Databricks notebook.
Query a SQL Server Linux Docker container in a virtual network from an Azure Databricks notebook
What is the Databricks Lakehouse?
7/21/2022 • 2 minutes to read
The Databricks Lakehouse combines the ACID transactions and data governance of data warehouses with the
flexibility and cost-efficiency of data lakes to enable business intelligence (BI) and machine learning (ML) on all
data. The Databricks Lakehouse keeps your data in your massively scalable cloud object storage in open source
data standards, allowing you to use your data however and wherever you want.
Delta tables
Tables created on Azure Databricks use the Delta Lake protocol by default. When you create a new Delta table:
Metadata used to reference the table is added to the metastore in the declared schema or database.
Data and table metadata are saved to a directory in cloud object storage.
The metastore reference to a Delta table is technically optional; you can create Delta tables by directly interacting
with directory paths using Spark APIs. Some new features that build upon Delta Lake will store additional
metadata in the table directory, but all Delta tables have:
A directory containing table data in the Parquet file format.
A sub-directory /_delta_log that contains metadata about table versions in JSON and Parquet format.
Learn more about Data objects in the Databricks Lakehouse.
The Databricks Lakehouse organizes data stored with Delta Lake in cloud object storage with familiar relations
like database, tables, and views. This model combines many of the benefits of a data warehouse with the
scalability and flexibility of a data lake. Learn more about how this model works, and the relationship between
object data and metadata so that you can apply best practices when designing and implementing Databricks
Lakehouse for your organization.
For information on securing objects with Unity Catalog, see securable objects model.
What is a metastore?
The metastore contains all of the metadata that defines data objects in the lakehouse. Azure Databricks provides
the following metastore options:
Unity Catalog : you can create a metastore to store and share metadata across multiple Azure Databricks
workspaces. Unity Catalog is managed at the account level.
Hive metastore : Azure Databricks stores all the metadata for the built-in Hive metastore as a managed
service. An instance of the metastore deploys to each cluster and securely accesses metadata from a central
repository for each customer workspace.
External metastore : you can also bring your own metastore to Azure Databricks.
Regardless of the metastore used, Azure Databricks stores all data associated with tables in object storage
configured by the customer in their cloud account.
What is a catalog?
A catalog is the highest abstraction (or coarsest grain) in the Databricks Lakehouse relational model. Every
database will be associated with a catalog. Catalogs exist as objects within a metastore.
Before the introduction of Unity Catalog, Azure Databricks used a two-tier namespace. Catalogs are the third tier
in the Unity Catalog namespacing model:
catalog_name.database_name.table_name
What is a database?
A database is a collection of data objects, such as tables or views (also called “relations”), and functions. In Azure
Databricks, the terms “schema” and “database” are used interchangeably (whereas in many relational systems, a
database is a collection of schemas).
Databases will always be associated with a location on cloud object storage. You can optionally specify a
LOCATION when registering a database, keeping in mind that:
What is a table?
A Azure Databricks table is a collection of structured data. A Delta table stores data as a directory of files on
cloud object storage and registers table metadata to the metastore within a catalog and schema. As Delta Lake is
the default storage provider for tables created in Azure Databricks, all tables created in Databricks are Delta
tables, by default. Because Delta tables store data in cloud object storage and provide references to data through
a metastore, users across an organization can access data using their preferred APIs; on Databricks, this includes
SQL, Python, PySpark, Scala, and R.
Note that it is possible to create tables on Databricks that are not Delta tables. These tables are not backed by
Delta Lake, and will not provide the ACID transactions and optimized performance of Delta tables. Tables falling
into this category include tables registered against data in external systems and tables registered against other
file formats in the data lake.
There are two kinds of tables in Databricks, managed and unmanaged (or external) tables.
NOTE
The Delta Live Tables distinction between live tables and streaming live tables is not enforced from the table perspective.
df.write.saveAsTable("table_name")
df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")
What is a view?
A view stores the text for a query typically against one or more data sources or tables in the metastore. In
Databricks, a view is equivalent to a Spark DataFrame persisted as an object in a database. Unlike DataFrames,
you can query views from any part of the Databricks product, assuming you have permission to do so. Creating
a view does not process or write any data; only the query text is registered to the metastore in the associated
database.
What is a function?
Functions allow you to associate user-defined logic with a database. Functions can return either scalar values or
sets of rows. You can use functions to provide managed access to custom logic across a variety of contexts on
the Databricks product.
This article introduces the set of fundamental concepts you need to understand in order to use Azure Databricks
effectively.
Some concepts are general to Azure Databricks, and others are specific to the persona-based Azure Databricks
environment you are using:
Databricks Data Science & Engineering
Databricks Machine Learning
Databricks SQL
General concepts
This section describes concepts and terms that apply across all Azure Databricks persona-based environments.
Workspaces
In Azure Databricks workspace has two meanings:
1. An Azure Databricks deployment in the cloud that functions as the unified environment that your team
uses for accessing all of their Databricks assets. Your organization can choose to have multiple
workspaces or just one: it depends on your needs.
2. The UI for the Databricks Data Science & Engineering and Databricks Machine Learning person-based
environments. This is as opposed to the Databricks SQL environment.
When we talk about the “workspace browser,” for example, we are talking about the UI that lets you
browse notebooks, libraries, and other files in the Data Science & Engineering and Databricks Machine
Learning environments—a UI that isn’t part of the Databricks SQL environment. But Data Science &
Engineering, Databricks Machine Learning, and Databricks SQL are all included in your deployed Azure
Databricks workspace.
Billing
DBU
Azure Databricks bills based on Databricks units (DBUs), units of processing capability per hour based on VM
instance type.
See the Azure Databricks pricing page.
Authentication and authorization
This section describes concepts that you need to know when you manage Azure Databricks users and their
access to Azure Databricks assets.
User
A unique individual who has access to the system.
Group
A collection of users.
Access control list (ACL)
A list of permissions attached to the workspace, cluster, job, table, or experiment. An ACL specifies which users
or system processes are granted access to the objects, as well as what operations are allowed on the assets. Each
entry in a typical ACL specifies a subject and an operation.
Databricks SQL
Databricks SQL is geared toward data analysts who work primarily with SQL queries and BI tools. It provides an
intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake. Its
UI is quite different from that of the Data Science & Engineering and Databricks Machine Learning
environments. This section describes the fundamental concepts you need to understand in order to use
Databricks SQL effectively.
Databricks SQL interface
This section describes the interfaces that Azure Databricks supports for accessing your Databricks SQL assets: UI
and API.
UI : A graphical interface to dashboards and queries, SQL warehouses, query history, and alerts.
REST API An interface that allows you to automate tasks on Databricks SQL objects.
Data management in Databricks SQL
Visualization : A graphical presentation of the result of running a query.
Dashboard : A presentation of query visualizations and commentary.
Aler t : A notification that a field returned by a query has reached a threshold.
Computation management in Databricks SQL
This section describes concepts that you need to know to run SQL queries in Databricks SQL.
Quer y : A valid SQL statement.
SQL warehouse : A computation resource on which you execute SQL queries.
Quer y histor y : A list of executed queries and their performance characteristics.
Authentication and authorization in Databricks SQL
This section describes concepts that you need to know when you manage Databricks SQL users and groups and
their access to assets.
User and group : A user is a unique individual who has access to the system. A group is a collection of users.
Personal access token : An opaque string is used to authenticate to the REST API and by tools in the Databricks
integrations to connect to SQL warehouses.
Access control list : A set of permissions attached to a principal that requires access to an object. An ACL entry
specifies the object and the actions allowed on the object. Each entry in an ACL specifies a principal, action type,
and object.
Azure Databricks architecture overview
7/21/2022 • 2 minutes to read
The Databricks Unified Data Analytics Platform, from the original creators of Apache Spark, enables data teams
to collaborate in order to solve some of the world’s toughest problems.
High-level architecture
Azure Databricks is structured to enable secure cross-functional team collaboration while keeping a significant
amount of backend services managed by Azure Databricks so you can stay focused on your data science, data
analytics, and data engineering tasks.
Azure Databricks operates out of a control plane and a data plane.
The control plane includes the backend services that Azure Databricks manages in its own Azure account.
Notebook commands and many other workspace configurations are stored in the control plane and
encrypted at rest.
The data plane is managed by your Azure account and is where your data resides. This is also where data is
processed. You can use Azure Databricks connectors so that your clusters can connect to external data
sources outside of your Azure account to ingest data or for storage. You can also ingest data from external
streaming data sources, such as events data, streaming data, IoT data, and more.
Although architectures can vary depending on custom configurations (such as when you’ve deployed a Azure
Databricks workspace to your own virtual network, also known as VNet injection), the following architecture
diagram represents the most common structure and flow of data for Azure Databricks.
For more architecture information, see Manage virtual networks.
Your data is stored at rest in your Azure account in the data plane and in your own data sources, not the control
plane, so you maintain control and ownership of your data.
Job results reside in storage in your account.
Interactive notebook results are stored in a combination of the control plane (partial results for presentation in
the UI) and your Azure storage. If you want interactive notebook results stored only in your cloud account
storage, you can ask your Databricks representative to enable interactive notebook results in the customer
account for your workspace. Note that some metadata about results, such as chart column names, continues to
be stored in the control plane. This feature is in Public Preview.
Tutorial: Run a job with an Azure service principal
7/21/2022 • 8 minutes to read
Jobs provide a non-interactive way to run applications in an Azure Databricks cluster, for example, an ETL job or
data analysis task that should run on a scheduled basis. Typically these jobs run as the user that created them,
but this can have some limitations:
Creating and running jobs is dependent on the user having appropriate permissions.
Only the user that created the job has access to the job.
The user might be removed from the Azure Databricks workspace.
Using a service account—an account associated with an application rather than a specific user—is a common
method to address these limitations. In Azure, you can use an Azure Active Directory (Azure AD) application and
service principal to create a service account.
An example of where this is important is when service principals control access to data stored in an Azure Data
Lake Storage Gen2 account. Running jobs with those service principals allows the jobs to access data in the
storage account and provides control over data access scope.
This tutorial describes how to create an Azure AD application and service principal and make that service
principal the owner of a job. You’ll also learn how to give job run permissions to other groups that don’t own the
job. The following is a high-level overview of the tasks this tutorial walks through:
1. Create a service principal in Azure Active Directory.
2. Create a personal access token (PAT) in Azure Databricks. You’ll use the PAT to authenticate to the Databricks
REST API.
3. Add the service principal as a non-administrative user to Azure Databricks using the Databricks SCIM API.
4. Create an Azure Key Vault-backed secret scope in Azure Databricks.
5. Grant the service principal read access to the secret scope.
6. Create a job in Azure Databricks and configure the job cluster to read secrets from the secret scope.
7. Transfer ownership of the job to the service principal.
8. Test the job by running it as the service principal.
If you don’t have an Azure subscription, create a free account before you begin.
Requirements
You’ll need the following for this tutorial:
A user account with the permissions required to register an application in your Azure AD tenant.
Administrative privileges in the Azure Databricks workspace where you’ll run jobs.
A tool for making API requests to Azure Databricks. This tutorial uses cURL, but you can use any tool that
allows you to submit REST API requests.
TIP
This example uses a personal access token, but you can use an Azure Active Directory token for most APIs. A best practice
is that a PAT is suitable for administrative configuration tasks, but Azure AD tokens are preferred for production
workloads.
You can restrict the generation of PATs to administrators only for security purposes. See Manage personal access tokens
for more details.
Replace <per-workspace-url> with the unique per-workspace URL for your Azure Databricks workspace.
Replace <personal-access-token> with the Azure Databricks personal access token.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.
Replace <per-workspace-url> with the unique per-workspace URL for your Azure Databricks workspace.
Replace <personal-access-token> with the Azure Databricks personal access token.
Replace <scope-name> with the name of the Azure Databricks secret scope that contains the client secret.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.
fs.azure.account.auth.type.acmeadls.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.acmeadls.dfs.core.windows.net
org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id.acmeadls.dfs.core.windows.net <application-id>
fs.azure.account.oauth2.client.secret.acmeadls.dfs.core.windows.net {{secrets/<secret-scope-
name>/<secret-name>}}
fs.azure.account.oauth2.client.endpoint.acmeadls.dfs.core.windows.net
https://login.microsoftonline.com/<directory-id>/oauth2/token
Replace <secret-scope-name> with the name of the Azure Databricks secret scope that contains the
client secret.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.
Replace <secret-name> with the name associated with the client secret value in the secret scope.
Replace <directory-id> with the Directory (tenant) ID for the Azure AD application registration.
Replace <per-workspace-url> with the unique per-workspace URL for your Azure Databricks workspace.
Replace <personal-access-token> with the Azure Databricks personal access token.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.
The job will also need read permissions to the notebook. Run the following command to grant the required
permissions:
curl -X PUT 'https://<per-workspace-url>/api/2.0/permissions/notebooks/<notebook-id>' \
--header 'Authorization: Bearer <personal-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{
"access_control_list": [
{
"service_principal_name": "<application-id>",
"permission_level": "CAN_READ"
}
]
}'
Replace <per-workspace-url> with the unique per-workspace URL for your Azure Databricks workspace.
Replace <notebook-id> with the ID of the notebook associated with the job. To find the ID, go to the notebook
in the Azure Databricks workspace and look for the numeric ID that follows notebook/ in the notebook’s
URL.
Replace <personal-access-token> with the Azure Databricks personal access token.
Replace <application-id> with the Application (client) ID for the Azure AD application registration.
Learn more
To learn more about creating and running jobs, see Jobs.
Tutorial: Query a SQL Server Linux Docker
container in a virtual network from an Azure
Databricks notebook
7/21/2022 • 5 minutes to read
This tutorial teaches you how to integrate Azure Databricks with a SQL Server Linux Docker container in a
virtual network.
In this tutorial, you learn how to:
Deploy an Azure Databricks workspace to a virtual network
Install a Linux virtual machine in a public network
Install Docker
Install Microsoft SQL Server on Linux docker container
Query the SQL Server using JDBC from a Databricks notebook
Prerequisites
Create a Databricks workspace in a virtual network.
Install Ubuntu for Windows.
Download SQL Server Management Studio.
3. Navigate to the Networking tab. Choose the virtual network and the public subnet that includes your
Azure Databricks cluster. Select Review + create , then Create to deploy the virtual machine.
4. When the deployment is complete, navigate to the virtual machine. Notice the Public IP address and
Virtual network/subnet in the Over view . Select the Public IP Address
5. Change the Assignment to Static and enter a DNS name label . Select Save , and restart the virtual
machine.
6. Select the Networking tab under Settings . Notice that the network security group that was created
during the Azure Databricks deployment is associated with the virtual machine. Select Add inbound
por t rule .
7. Add a rule to open port 22 for SSH. Use the following settings:
Source IP addresses <your public ip> Enter the your public IP address.
You can find your public IP address
by visiting bing.com and searching
for "my IP".
Destination IP addresses <your vm public ip> Enter your virtual machine's public
IP address. You can find this on the
Over view page of your virtual
machine.
Destination IP addresses <your vm public ip> Enter your virtual machine's public
IP address. You can find this on the
Over view page of your virtual
machine.
2. Enter the command in your Ubuntu terminal and enter the admin password you created when you
configured the virtual machine.
3. Use the following command to install Docker on the virtual machine.
sudo docker ps -a
%sh
ping 10.179.64.4
%sh
nslookup databricks-tutorial-vm.westus2.cloudapp.azure.com
3. Once you've successfully pinged the SQL Server, you can query the database and tables. Run the
following python code:
jdbcHostname = "10.179.64.4"
jdbcDatabase = "MYDB"
userName = 'SA'
password = 'Password1234'
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostname,
jdbcPort, jdbcDatabase, userName, password)
df = spark.read.jdbc(url=jdbcUrl, table='states')
display(df)
Clean up resources
When no longer needed, delete the resource group, the Azure Databricks workspace, and all related resources.
Deleting the job avoids unnecessary billing. If you're planning to use the Azure Databricks workspace in future,
you can stop the cluster and restart it later. If you are not going to continue to use this Azure Databricks
workspace, delete all resources you created in this tutorial by using the following steps:
1. From the left-hand menu in the Azure portal, click Resource groups and then click the name of the
resource group you created.
2. On your resource group page, select Delete , type the name of the resource to delete in the text box, and
then select Delete again.
Next steps
Advance to the next article to learn how to extract, transform, and load data using Azure Databricks.
Tutorial: Extract, transform, and load data by using Azure Databricks
Tutorial: Access Azure Blob Storage from Azure
Databricks using Azure Key Vault
7/21/2022 • 5 minutes to read
This tutorial describes how to access Azure Blob Storage from Azure Databricks using secrets stored in a key
vault.
In this tutorial, you learn how to:
Create a storage account and blob container
Create an Azure Key Vault and add a secret
Create an Azure Databricks workspace and add a secret scope
Access your blob container from Azure Databricks
Prerequisites
Azure subscription - create one for free
NOTE
This tutorial cannot be carried out using Azure Free Trial Subscription . If you have a free account, go to your profile
and change your subscription to pay-as-you-go . For more information, see Azure free account. Then, remove the
spending limit, and request a quota increase for vCPUs in your region. When you create your Azure Databricks
workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free
Premium Azure Databricks DBUs for 14 days.
5. Locate a file you want to upload to your blob storage container. If you don't have a file, use a text editor to
create a new text file with some information. In this example, a file named hw.txt contains the text "hello
world." Save your text file locally and upload it to your blob storage container.
6. Return to your storage account and select Access keys under Settings . Copy Storage account name
and key 1 to a text editor for later use in this tutorial.
6. On the Create a secret page, provide the following information, and keep the default values for the
remaining fields:
P RO P ERT Y VA L UE
7. Save the key name in a text editor for use later in this tutorial, and select Create . Then, navigate to the
Proper ties menu. Copy the DNS Name and Resource ID to a text editor for use later in the tutorial.
Create an Azure Databricks workspace and add a secret scope
1. In the Azure portal, select Create a resource > Analytics > Azure Databricks .
2. Under Azure Databricks Ser vice , provide the following values to create a Databricks workspace.
Resource group Select the same resource group that contains your key
vault.
Location Select the same location as your Azure Key Vault. For all
available regions, see Azure services available by region.
4. Once your Azure Databricks workspace is open in a separate window, append #secrets/createScope to
the URL. The URL should have the following format:
https://<\location>.azuredatabricks.net/?o=<\orgID>#secrets/createScope .
5. Enter a scope name, and enter the Azure Key Vault DNS name and Resource ID you saved earlier. Save the
scope name in a text editor for use later in this tutorial. Then, select Create .
dbutils.fs.mount(
source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
mount-name is a DBFS path representing where the Blob Storage container or a folder inside the
container (specified in source) will be mounted. A directory is created using the mount-name you
provide.
conf-key can be either fs.azure.account.key.<\your-storage-account-name>.blob.core.windows.net or
fs.azure.sas.<\your-container-name>.<\your-storage-account-name>.blob.core.windows.net
scope-name is the name of the secret scope you created in the previous section.
key-name is the name of they secret you created for the storage account key in your key vault.
6. Run the following command to read the text file in your blob storage container to a dataframe. Change
the values in the command to match your mount name and file name.
df = spark.read.text("/mnt/<mount-name>/<file-name>")
df.show()
dbutils.fs.unmount("/mnt/<mount-name>")
9. Notice that once the mount has been unmounted, you can no longer read from your blob storage
account.
Clean up resources
If you're not going to continue to use this application, delete your entire resource group with the following steps:
1. From the left-hand menu in Azure portal, select Resource groups and navigate to your resource group.
2. Select Delete resource group and type your resource group name. Then select Delete .
Next steps
Advance to the next article to learn how to implement a VNet injected Databricks environment with a Service
Endpoint enabled for Cosmos DB.
Tutorial: Implement Azure Databricks with a Cosmos DB endpoint
Tutorial: Implement Azure Databricks with a Cosmos
DB endpoint
7/21/2022 • 4 minutes to read
This tutorial describes how to implement a VNet injected Databricks environment with a Service Endpoint
enabled for Cosmos DB.
In this tutorial you learn how to:
Create an Azure Databricks workspace in a virtual network
Create a Cosmos DB service endpoint
Create a Cosmos DB account and import data
Create an Azure Databricks cluster
Query Cosmos DB from an Azure Databricks notebook
Prerequisites
Before you start, do the following:
Create an Azure Databricks workspace in a virtual network.
Download the Spark connector.
Download sample data from the NOAA National Centers for Environmental Information. Select a state or
area and select Search . On the next page, accept the defaults and select Search . Then select CSV
Download on the left side of the page to download the results.
Download the pre-compiled binary of the Azure Cosmos DB Data Migration Tool.
SET T IN G VA L UE
Location West US
Geo-Redundancy Disable
NOTE
It is not necessary for this tutorial, but you can also enable Allow access from my IP if you want the ability to
access your Cosmos DB account from your local machine. For example, if you are connecting to your account
using the Cosmos DB SDK, you need to enable this setting. If it is disabled, you will receive "Access Denied" errors.
4. Select Review + Create , and then Create to create your Cosmos DB account inside the virtual network.
5. Once your Cosmos DB account has been created, navigate to Keys under Settings . Copy the primary
connection string and save it in a text editor for later use.
6. Select Data Explorer and New Container to add a new database and container to your Cosmos DB
account.
Upload data to Cosmos DB
1. Open the graphical interface version of the data migration tool for Cosmos DB, Dtui.exe .
2. On the Source Information tab, select CSV File(s) in the Impor t from dropdown. Then select Add
Files and add the storm data CSV you downloaded as a prerequisite.
3. On the Target Information tab, input your connection string. The connection string format is
AccountEndpoint=<URL>;AccountKey=<key>;Database=<database> . The AccountEndpoint and AccountKey are
included in the primary connection string you saved in the previous section. Append
Database=<your database name> to the end of the connection string, and select Verify . Then, add the
Container name and partition key.
4. Select Next until you get to the Summary page. Then, select Impor t .
You can verify that the library was installed on the Libraries tab.
Query Cosmos DB from a Databricks notebook
1. Navigate to your Azure Databricks workspace and create a new python notebook.
2. Run the following python code to set the Cosmos DB connection configuration. Change the Endpoint ,
Masterkey , Database , and Container accordingly.
connectionConfig = {
"Endpoint" : "https://<your Cosmos DB account name.documents.azure.com:443/",
"Masterkey" : "<your Cosmos DB primary key>",
"Database" : "<your database name>",
"preferredRegions" : "West US 2",
"Container": "<your container name>",
"schema_samplesize" : "1000",
"query_pagesize" : "200000",
"query_custom" : "SELECT * FROM c"
}
3. Use the following python code to load the data and create a temporary view.
users = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**connectionConfig).load()
users.createOrReplaceTempView("storm")
4. Use the following magic command to execute a SQL statement that returns data.
%sql
select * from storm
You have successfully connected your VNet-injected Databricks workspace to a service-endpoint enabled
Cosmos DB resource. To read more about how to connect to Cosmos DB, see Azure Cosmos DB
Connector for Apache Spark.
Clean up resources
When no longer needed, delete the resource group, the Azure Databricks workspace, and all related resources.
Deleting the job avoids unnecessary billing. If you're planning to use the Azure Databricks workspace in future,
you can stop the cluster and restart it later. If you are not going to continue to use this Azure Databricks
workspace, delete all resources you created in this tutorial by using the following steps:
1. From the left-hand menu in the Azure portal, click Resource groups and then click the name of the
resource group you created.
2. On your resource group page, select Delete , type the name of the resource to delete in the text box, and
then select Delete again.
Next steps
In this tutorial, you've deployed an Azure Databricks workspace to a virtual network, and used the Cosmos DB
Spark connector to query Cosmos DB data from Databricks. To learn more about working with Azure Databricks
in a virtual network, continue to the tutorial for using SQL Server with Azure Databricks.
Tutorial: Query a SQL Server Linux Docker container in a virtual network from an Azure Databricks notebook
Tutorial: Extract, transform, and load data by using
Azure Databricks
7/21/2022 • 12 minutes to read
In this tutorial, you perform an ETL (extract, transform, and load data) operation by using Azure Databricks. You
extract data from Azure Data Lake Storage Gen2 into Azure Databricks, run transformations on the data in Azure
Databricks, and load the transformed data into Azure Synapse Analytics.
The steps in this tutorial use the Azure Synapse connector for Azure Databricks to transfer data to Azure
Databricks. This connector, in turn, uses Azure Blob Storage as temporary storage for the data being transferred
between an Azure Databricks cluster and Azure Synapse.
The following illustration shows the application flow:
NOTE
This tutorial cannot be carried out using Azure Free Trial Subscription . If you have a free account, go to your profile
and change your subscription to pay-as-you-go . For more information, see Azure free account. Then, remove the
spending limit, and request a quota increase for vCPUs in your region. When you create your Azure Databricks
workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free
Premium Azure Databricks DBUs for 14 days.
Prerequisites
Complete these tasks before you begin this tutorial:
Create an Azure Synapse, create a server-level firewall rule, and connect to the server as a server admin.
See Quickstart: Create and query a Synapse SQL pool using the Azure portal.
Create a master key for the Azure Synapse. See Create a database master key.
Create an Azure Blob storage account, and a container within it. Also, retrieve the access key to access the
storage account. See Quickstart: Upload, download, and list blobs with the Azure portal.
Create an Azure Data Lake Storage Gen2 storage account. See Quickstart: Create an Azure Data Lake
Storage Gen2 storage account.
Create a service principal. See How to: Use the portal to create an Azure AD application and service
principal that can access resources.
There's a couple of specific things that you'll have to do as you perform the steps in that article.
When performing the steps in the Assign the application to a role section of the article, make sure
to assign the Storage Blob Data Contributor role to the service principal in the scope of the
Data Lake Storage Gen2 account. If you assign the role to the parent resource group or
subscription, you'll receive permissions-related errors until those role assignments propagate to
the storage account.
If you'd prefer to use an access control list (ACL) to associate the service principal with a specific
file or directory, reference Access control in Azure Data Lake Storage Gen2.
When performing the steps in the Get values for signing in section of the article, paste the tenant
ID, app ID, and secret values into a text file.
Sign in to the Azure portal.
2. Under Azure Databricks Ser vice , provide the following values to create a Databricks service:
P RO P ERT Y DESC RIP T IO N
3. The account creation takes a few minutes. To monitor the operation status, view the progress bar at the
top.
4. Select Pin to dashboard and then select Create .
Create a file system in the Azure Data Lake Storage Gen2 account
In this section, you create a notebook in Azure Databricks workspace and then run code snippets to configure
the storage account
1. In the Azure portal, go to the Azure Databricks service that you created, and select Launch Workspace .
2. On the left, select Workspace . From the Workspace drop-down, select Create > Notebook .
3. In the Create Notebook dialog box, enter a name for the notebook. Select Scala as the language, and
then select the Spark cluster that you created earlier.
4. Select Create .
5. The following code block sets default service principal credentials for any ADLS Gen 2 account accessed
in the Spark session. The second code block appends the account name to the setting to specify
credentials for a specific ADLS Gen 2 account. Copy and paste either code block into the first cell of your
Azure Databricks notebook.
Session configuration
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<appID>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<tenant-
id>/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
Account configuration
val storageAccountName = "<storage-account-name>"
val appID = "<app-id>"
val secret = "<secret>"
val fileSystemName = "<file-system-name>"
val tenantID = "<tenant-id>"
6. In this code block, replace the <app-id> , <secret> , <tenant-id> , and <storage-account-name>
placeholder values in this code block with the values that you collected while completing the
prerequisites of this tutorial. Replace the <file-system-name> placeholder value with whatever name you
want to give the file system.
The <app-id> , and <secret> are from the app that you registered with active directory as part of
creating a service principal.
The <tenant-id> is from your subscription.
The <storage-account-name> is the name of your Azure Data Lake Storage Gen2 storage account.
7. Press the SHIFT + ENTER keys to run the code in this block.
Ingest sample data into the Azure Data Lake Storage Gen2 account
Before you begin with this section, you must complete the following prerequisites:
Enter the following code into a notebook cell:
Extract data from the Azure Data Lake Storage Gen2 account
1. You can now load the sample json file as a data frame in Azure Databricks. Paste the following code in a
new cell. Replace the placeholders shown in brackets with your values.
val df = spark.read.json("abfss://" + fileSystemName + "@" + storageAccountName +
".dfs.core.windows.net/small_radio_json.json")
2. Press the SHIFT + ENTER keys to run the code in this block.
3. Run the following code to see the contents of the data frame:
df.show()
+---------------------+---------+---------+------+-------------+----------+---------+-------+--------
------------+------+--------+-------------+---------+--------------------+------+-------------+------
+
| artist| auth|firstName|gender|itemInSession| lastName| length| level|
location|method| page| registration|sessionId| song|status| ts|userId|
+---------------------+---------+---------+------+-------------+----------+---------+-------+--------
------------+------+--------+-------------+---------+--------------------+------+-------------+------
+
| El Arrebato |Logged In| Annalyse| F| 2|Montgomery|234.57914| free |
Killeen-Temple, TX| PUT|NextSong|1384448062332| 1879|Quiero Quererte Q...| 200|1409318650332|
309|
| Creedence Clearwa...|Logged In| Dylann| M| 9| Thomas|340.87138| paid |
Anchorage, AK| PUT|NextSong|1400723739332| 10| Born To Move| 200|1409318653332|
11|
| Gorillaz |Logged In| Liam| M| 11| Watts|246.17751| paid |New
York-Newark-J...| PUT|NextSong|1406279422332| 2047| DARE| 200|1409318685332|
201|
...
...
You have now extracted the data from Azure Data Lake Storage Gen2 into Azure Databricks.
2. You can further transform this data to rename the column level to subscription_type .
+---------+----------+------+--------------------+-----------------+
|firstname| lastname|gender| location|subscription_type|
+---------+----------+------+--------------------+-----------------+
| Annalyse|Montgomery| F| Killeen-Temple, TX| free|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Tess| Townsend| F|Nashville-Davidso...| free|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
|Gabriella| Shelton| F|San Jose-Sunnyval...| free|
| Elijah| Williams| M|Detroit-Warren-De...| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Tess| Townsend| F|Nashville-Davidso...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Liam| Watts| M|New York-Newark-J...| paid|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Elijah| Williams| M|Detroit-Warren-De...| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
| Alan| Morse| M|Chicago-Napervill...| paid|
| Dylann| Thomas| M| Anchorage, AK| paid|
| Margaux| Smith| F|Atlanta-Sandy Spr...| free|
+---------+----------+------+--------------------+-----------------+
2. Specify a temporary folder to use while moving data between Azure Databricks and Azure Synapse.
3. Run the following snippet to store Azure Blob storage access keys in the configuration. This action
ensures that you don't have to keep the access key in the notebook in plain text.
4. Provide the values to connect to the Azure Synapse instance. You must have created an Azure Synapse
Analytics service as a prerequisite. Use the fully qualified server name for dwSer ver . For example,
<servername>.database.windows.net .
5. Run the following snippet to load the transformed dataframe, renamedColumnsDF , as a table in Azure
Synapse. This snippet creates a table called SampleTable in the SQL database.
spark.conf.set(
"spark.sql.parquet.writeLegacyFormat",
"true")
renamedColumnsDF.write.format("com.databricks.spark.sqldw").option("url",
sqlDwUrlSmall).option("dbtable", "SampleTable") .option(
"forward_spark_azure_storage_credentials","True").option("tempdir", tempDir).mode("overwrite").save()
NOTE
This sample uses the forward_spark_azure_storage_credentials flag, which causes Azure Synapse to access
data from blob storage using an Access Key. This is the only supported method of authentication.
If your Azure Blob Storage is restricted to select virtual networks, Azure Synapse requires Managed Service
Identity instead of Access Keys. This will cause the error "This request is not authorized to perform this operation."
6. Connect to the SQL database and verify that you see a database named SampleTable .
7. Run a select query to verify the contents of the table. The table should have the same data as the
renamedColumnsDF dataframe.
Clean up resources
After you finish the tutorial, you can terminate the cluster. From the Azure Databricks workspace, select Clusters
on the left. For the cluster to terminate, under Actions , point to the ellipsis (...) and select the Terminate icon.
If you don't manually terminate the cluster, it automatically stops, provided you selected the Terminate after __
minutes of inactivity check box when you created the cluster. In such a case, the cluster automatically stops if
it's been inactive for the specified time.
Next steps
In this tutorial, you learned how to:
Create an Azure Databricks service
Create a Spark cluster in Azure Databricks
Create a notebook in Azure Databricks
Extract data from a Data Lake Storage Gen2 account
Transform data in Azure Databricks
Load data into Azure Synapse
Advance to the next tutorial to learn about streaming real-time data into Azure Databricks using Azure Event
Hubs.
Stream data into Azure Databricks using Event Hubs
Tutorial: Stream data into Azure Databricks using
Event Hubs
7/21/2022 • 12 minutes to read
In this tutorial, you connect a data ingestion system with Azure Databricks to stream data into an Apache Spark
cluster in near real-time. You set up data ingestion system using Azure Event Hubs and then connect it to Azure
Databricks to process the messages coming through. To access a stream of data, you use Twitter APIs to ingest
tweets into Event Hubs. Once you have the data in Azure Databricks, you can run analytical jobs to further
analyze the data.
By the end of this tutorial, you would have streamed tweets from Twitter (that have the term "Azure" in them)
and read the tweets in Azure Databricks.
The following illustration shows the application flow:
NOTE
This tutorial cannot be carried out using Azure Free Trial Subscription . If you have a free account, go to your profile
and change your subscription to pay-as-you-go . For more information, see Azure free account. Then, remove the
spending limit, and request a quota increase for vCPUs in your region. When you create your Azure Databricks
workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free
Premium Azure Databricks DBUs for 14 days.
Prerequisites
Before you start with this tutorial, make sure to meet the following requirements:
An Azure Event Hubs namespace.
An Event Hub within the namespace.
Connection string to access the Event Hubs namespace. The connection string should have a format similar
to
Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=<key name>;SharedAccessKey=<key
value>
.
Shared access policy name and policy key for Event Hubs.
You can meet these requirements by completing the steps in the article, Create an Azure Event Hubs namespace
and event hub.
2. Under Azure Databricks Ser vice , provide the values to create a Databricks workspace.
Provide the following values:
Save the values that you retrieved for the Twitter application. You need the values later in the tutorial.
2. In the Create Notebook dialog box, enter SendTweetsToEventHub , select Scala as the language, and
select the Spark cluster that you created earlier.
Select Create .
3. Repeat the steps to create the ReadTweetsFromEventHub notebook.
NOTE
Twitter API has certain request restrictions and quotas. If you are not satisfied with standard rate limiting in Twitter API,
you can generate text content without using Twitter API in this example. To do that, set variable dataSource to test
instead of twitter and populate the list testSource with preferred test input.
import scala.collection.JavaConverters._
import com.microsoft.azure.eventhubs._
import java.util.concurrent._
import scala.collection.immutable._
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
val namespaceName = "<EVENT HUBS NAMESPACE>"
val eventHubName = "<EVENT HUB NAME>"
val sasKeyName = "<POLICY NAME>"
val sasKey = "<POLICY KEY>"
val connStr = new ConnectionStringBuilder()
.setNamespaceName(namespaceName)
.setEventHubName(eventHubName)
.setSasKeyName(sasKeyName)
.setSasKey(sasKey)
// Specify 'test' if you prefer to not use Twitter API and loop through a list of values you define in
`testSource`
// Otherwise specify 'twitter'
val dataSource = "test"
if (dataSource == "twitter") {
import twitter4j._
import twitter4j.TwitterFactory
import twitter4j.Twitter
import twitter4j.conf.ConfigurationBuilder
// Twitter configuration!
// Replace values below with you
// Getting tweets with keyword "Azure" and sending them to the Event Hub in realtime!
val query = new Query(" #Azure ")
query.setCount(100)
query.lang("en")
var finished = false
while (!finished) {
val result = twitter.search(query)
val statuses = result.getTweets()
var lowestStatusId = Long.MaxValue
for (status <- statuses.asScala) {
if(!status.isRetweet()){
sendEvent(status.getText(), 5000)
}
lowestStatusId = Math.min(status.getId(), lowestStatusId)
}
query.setMaxId(lowestStatusId - 1)
}
} else {
System.out.println("Unsupported Data Source. Set 'dataSource' to \"twitter\" or \"test\"")
}
To run the notebook, press SHIFT + ENTER . You see an output like the snippet below. Each event in the output
is a tweet that is ingested into the Event Hubs containing the term "Azure".
Sent event: @Microsoft and @Esri launch Geospatial AI on Azure https://t.co/VmLUCiPm6q via @geoworldmedia
#geoai #azure #gis #ArtificialIntelligence
Sent event: Public preview of Java on App Service, built-in support for Tomcat and OpenJDK
https://t.co/7vs7cKtvah
#cloudcomputing #Azure
Sent event: 4 Killer #Azure Features for #Data #Performance https://t.co/kpIb7hFO2j by @RedPixie
Sent event: Migrate your databases to a fully managed service with Azure SQL Managed Instance | #Azure |
#Cloud https://t.co/sJHXN4trDk
Sent event: Top 10 Tricks to #Save Money with #Azure Virtual Machines https://t.co/F2wshBXdoz #Cloud
...
...
val customEventhubParameters =
EventHubsConf(connStr.toString())
.setMaxEventsPerTrigger(5)
incomingStream.printSchema
root
|-- body: binary (nullable = true)
|-- offset: long (nullable = true)
|-- seqNumber: long (nullable = true)
|-- enqueuedTime: long (nullable = true)
|-- publisher: string (nullable = true)
|-- partitionKey: string (nullable = true)
-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+--------------+---------------+---------+------------+
|body |offset|sequenceNumber|enqueuedTime |publisher|partitionKey|
+------+------+--------------+---------------+---------+------------+
|[50 75 62 6C 69 63 20 70 72 65 76 69 65 77 20 6F 66 20 4A 61 76 61 20 6F 6E 20 41 70 70 20 53 65 72 76 69
63 65 2C 20 62 75 69 6C 74 2D 69 6E 20 73 75 70 70 6F 72 74 20 66 6F 72 20 54 6F 6D 63 61 74 20 61 6E 64 20
4F 70 65 6E 4A 44 4B 0A 68 74 74 70 73 3A 2F 2F 74 2E 63 6F 2F 37 76 73 37 63 4B 74 76 61 68 20 0A 23 63 6C
6F 75 64 63 6F 6D 70 75 74 69 6E 67 20 23 41 7A 75 72 65] |0 |0
|2018-03-09 05:49:08.86 |null |null |
|[4D 69 67 72 61 74 65 20 79 6F 75 72 20 64 61 74 61 62 61 73 65 73 20 74 6F 20 61 20 66 75 6C 6C 79 20 6D
61 6E 61 67 65 64 20 73 65 72 76 69 63 65 20 77 69 74 68 20 41 7A 75 72 65 20 53 51 4C 20 44 61 74 61 62 61
73 65 20 4D 61 6E 61 67 65 64 20 49 6E 73 74 61 6E 63 65 20 7C 20 23 41 7A 75 72 65 20 7C 20 23 43 6C 6F 75
64 20 68 74 74 70 73 3A 2F 2F 74 2E 63 6F 2F 73 4A 48 58 4E 34 74 72 44 6B] |168 |1
|2018-03-09 05:49:24.752|null |null |
+------+------+--------------+---------------+---------+------------+
-------------------------------------------
Batch: 1
-------------------------------------------
...
...
Because the output is in a binary mode, use the following snippet to convert it into string.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
messages.printSchema
messages.writeStream.outputMode("append").format("console").option("truncate",
false).start().awaitTermination()
root
|-- Offset: long (nullable = true)
|-- Time (readable): timestamp (nullable = true)
|-- Timestamp: long (nullable = true)
|-- Body: string (nullable = true)
-------------------------------------------
Batch: 0
-------------------------------------------
+------+-----------------+----------+-------+
|Offset|Time (readable) |Timestamp |Body
+------+-----------------+----------+-------+
|0 |2018-03-09 05:49:08.86 |1520574548|Public preview of Java on App Service, built-in support for
Tomcat and OpenJDK
https://t.co/7vs7cKtvah
#cloudcomputing #Azure |
|168 |2018-03-09 05:49:24.752|1520574564|Migrate your databases to a fully managed service with Azure SQL
Managed Instance | #Azure | #Cloud https://t.co/sJHXN4trDk |
|0 |2018-03-09 05:49:02.936|1520574542|@Microsoft and @Esri launch Geospatial AI on Azure
https://t.co/VmLUCiPm6q via @geoworldmedia #geoai #azure #gis #ArtificialIntelligence|
|176 |2018-03-09 05:49:20.801|1520574560|4 Killer #Azure Features for #Data #Performance
https://t.co/kpIb7hFO2j by @RedPixie |
+------+-----------------+----------+-------+
-------------------------------------------
Batch: 1
-------------------------------------------
...
...
That's it! Using Azure Databricks, you have successfully streamed data into Azure Event Hubs in near real-time.
You then consumed the stream data using the Event Hubs connector for Apache Spark. For more information on
how to use the Event Hubs connector for Spark, see the connector documentation.
Clean up resources
After you have finished running the tutorial, you can terminate the cluster. To do so, from the Azure Databricks
workspace, from the left pane, select Clusters . For the cluster you want to terminate, move the cursor over the
ellipsis under Actions column, and select the Terminate icon.
If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate
after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster will automatically
stop if it has been inactive for the specified time.
Next steps
In this tutorial, you learned how to:
Create an Azure Databricks workspace
Create a Spark cluster in Azure Databricks
Create a Twitter app to generate streaming data
Create notebooks in Azure Databricks
Add libraries for Event Hubs and Twitter API
Send tweets to Event Hubs
Read tweets from Event Hubs
Databricks runtimes
7/21/2022 • 2 minutes to read
Databricks runtimes are the set of core components that run on Azure Databricks clusters. Azure Databricks
offers several types of runtimes.
Databricks Runtime
Databricks Runtime includes Apache Spark but also adds a number of components and updates that
substantially improve the usability, performance, and security of big data analytics.
Databricks Runtime for Machine Learning
Databricks Runtime ML is a variant of Databricks Runtime that adds multiple popular machine learning
libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
Photon runtime
Photon is the Azure Databricks native vectorized query engine that runs SQL workloads faster and
reduces your total cost per workload.
Databricks Light
Databricks Light provides a runtime option for jobs that don’t need the advanced performance, reliability,
or autoscaling benefits provided by Databricks Runtime.
Databricks Runtime for Genomics (Deprecated)
Databricks Runtime for Genomics is a variant of Databricks Runtime optimized for working with genomic
and biomedical data.
You can choose from among the supported runtime versions when you create a cluster.
For information about the contents of each runtime variant, see the release notes.
Databricks Runtime
7/21/2022 • 2 minutes to read
Databricks Runtime includes Apache Spark but also adds a number of components and updates that
substantially improve the usability, performance, and security of big data analytics:
Delta Lake, a next-generation storage layer built on top of Apache Spark that provides ACID transactions,
optimized layouts and indexes, and execution engine improvements for building data pipelines.
Installed Java, Scala, Python, and R libraries
Ubuntu and its accompanying system libraries
GPU libraries for GPU-enabled clusters
Databricks services that integrate with other components of the platform, such as notebooks, jobs, and
cluster manager
For information about the contents of each runtime version, see the release notes.
Runtime versioning
Databricks Runtime versions are released on a regular basis:
Major versions are represented by an increment to the version number that precedes the decimal point (the
jump from 3.5 to 4.0, for example). They are released when there are major changes, some of which may not
be backwards-compatible.
Feature versions are represented by an increment to the version number that follows the decimal point (the
jump from 3.4 to 3.5, for example). Each major release includes multiple feature releases. Feature releases are
always backwards compatible with previous releases within their major release.
Long Term Suppor t versions are represented by an LTS qualifier (for example, 3.5 LTS ). For each major
release, we declare a “canonical” feature version, for which we provide two full years of support. See
Databricks runtime support lifecycle for more information.
Databricks Runtime for Machine Learning
7/21/2022 • 3 minutes to read
Databricks Runtime for Machine Learning (Databricks Runtime ML) automates the creation of a cluster
optimized for machine learning. Databricks Runtime ML clusters include the most popular machine learning
libraries, such as TensorFlow, PyTorch, Keras, and XGBoost, and also include libraries required for distributed
training such as Horovod. Using Databricks Runtime ML speeds up cluster creation and ensures that the
installed library versions are compatible.
For complete information about using Azure Databricks for machine learning and deep learning, see Databricks
Machine Learning guide.
For information about the contents of each Databricks Runtime ML version, see the release notes.
Databricks Runtime ML is built on Databricks Runtime. For example, Databricks Runtime 7.3 LTS for Machine
Learning is built on Databricks Runtime 7.3 LTS. The libraries included in the base Databricks Runtime are listed
in the Databricks Runtime release notes.
If you select a GPU-enabled ML runtime, you are prompted to select a compatible Driver Type and Worker
Type . Incompatible instance types are grayed out in the drop-downs. GPU-enabled instance types are listed
under the GPU-Accelerated label.
IMPORTANT
Libraries in your workspace that automatically install into all clusters can conflict with the libraries included in
Databricks Runtime ML. Before you create a cluster with Databricks Runtime ML, clear the Install automatically on
all clusters checkbox for conflicting libraries. See the release notes for a list of libraries that are included with each
version of Databricks Runtime ML.
To access data in Unity Catalog for machine learning workflows, you must use a Single User cluster. User Isolation
clusters are not compatible with Databricks Runtime ML.
Databricks Runtime for Genomics (Databricks Runtime Genomics) is a version of Databricks Runtime optimized
for working with genomic and biomedical data. It is a component of the Azure Databricks Unified Analytics
Platform for Genomics. For more information on developing genomics applications, see Genomics guide.
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
Requirements
Your Azure Databricks workspace must have Databricks Runtime for Genomics enabled.
Databricks Light is the Databricks packaging of the open source Apache Spark runtime. It provides a runtime
option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by
Databricks Runtime. In particular, Databricks Light does not support:
Delta Lake
Autopilot features such as autoscaling
Highly concurrent, all-purpose clusters
Notebooks, dashboards, and collaboration features
Connectors to various data sources and BI tools
Databricks Light is a runtime environment for jobs (or “automated workloads”). When you run jobs on
Databricks Light clusters, they are subject to lower Jobs Light Compute pricing. You can select Databricks Light
only when you create or schedule a JAR, Python, or spark-submit job and attach a cluster to that job; you cannot
use Databricks Light to run notebook jobs or interactive workloads.
Databricks Light can be used in the same workspace with clusters running on other Databricks runtimes and
pricing tiers. You don’t need to request a separate workspace to get started.
IMPORTANT
Support for Databricks Light on pool-backed job clusters is in Public Preview.
Photon runtime
7/21/2022 • 2 minutes to read
Photon is the native vectorized query engine on Azure Databricks, written to be directly compatible with Apache
Spark APIs so it works with your existing code. It is developed in C++ to take advantage of modern hardware,
and uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level
parallelism in CPUs, enhancing performance on real-world data and applications-—all natively on your data
lake. Photon is part of a high-performance runtime that runs your existing SQL and DataFrame API calls faster
and reduces your total cost per workload. Photon is used by default in Databricks SQL warehouses.
To access Photon on Azure Databricks clusters you must explicitly select a runtime containing Photon when you
create the cluster, either using the UI or the APIs (Clusters API 2.0 and Jobs API 2.1, specifying spark_version
using the syntax <databricks-runtime-version>-photon-scala2.12 ). Photon is available for clusters running
Databricks Runtime 9.1 LTS and above.
Photon supports a limited set of instance types on the driver and worker nodes. Photon instance types consume
DBUs at a different rate than the same instance type running the non-Photon runtime. For more information
about Photon instances and DBU consumption, see the Azure Databricks pricing page.
Photon advantages
Supports SQL and equivalent DataFrame operations against Delta and Parquet tables.
Expected to accelerate queries that process a significant amount of data (100GB+) and include aggregations
and joins.
Faster performance when data is accessed repeatedly from the Delta cache.
More robust scan performance on tables with many columns and many small files.
Faster Delta and Parquet writing using UPDATE , DELETE , MERGE INTO , INSERT , and CREATE TABLE AS SELECT ,
especially for wide tables (hundreds to thousands of columns).
Replaces sort-merge joins with hash-joins.
Limitations
Works on Delta and Parquet tables only for both read and write.
Does not support window and sort operators
Does not support Spark Structured Streaming.
Does not support UDFs.
Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of
data.
Features not supported by Photon run the same way they would with Databricks Runtime; there is no
performance advantage for those features.
Navigate the workspace
7/21/2022 • 2 minutes to read
An Azure Databricks workspace is an environment for accessing all of your Azure Databricks assets. The
workspace organizes objects (notebooks, libraries, and experiments) into folders, and provides access to data
and computational resources such as clusters and jobs.
You can manage the workspace using the workspace UI, the Databricks CLI, and the Databricks REST API
reference. Most of the articles in the Azure Databricks documentation focus on performing tasks using the
workspace UI.
To change the persona, click the icon below the Databricks logo , and select a persona.
To pin a persona so that it appears the next time you log in, click next to the persona. Click it again to
remove the pin.
Use Menu options at the bottom of the sidebar to set the sidebar mode to Auto (default behavior),
Expand , or Collapse .
When you open a machine learning-related page, the persona automatically switches to Machine
Learning .
Get help
To get help:
Clusters
Azure Databricks Data Science & Engineering and Databricks Machine Learning clusters provide a unified
platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics,
and machine learning. A cluster is a type of Azure Databricks compute resource. Other compute resource types
include Azure Databricks SQL warehouses.
For detailed information on managing and using clusters, see Clusters.
Notebooks
A notebook is a web-based interface to documents containing a series of runnable cells (commands) that
operate on files and tables, visualizations, and narrative text. Commands can be run in sequence, referring to the
output of one or more previously run commands.
Notebooks are one mechanism for running code in Azure Databricks. The other mechanism is jobs.
For detailed information on managing and using notebooks, see Notebooks.
Jobs
Jobs are one mechanism for running code in Azure Databricks. The other mechanism is notebooks.
For detailed information on managing and using jobs, see Jobs.
Libraries
A library makes third-party or locally-built code available to notebooks and jobs running on your clusters.
For detailed information on managing and using libraries, see Libraries.
Data
You can import data into a distributed file system mounted into an Azure Databricks workspace and work with it
in Azure Databricks notebooks and clusters. You can also use a wide variety of Apache Spark data sources to
access data.
For detailed information on loading data, see Ingest data into the Azure Databricks Lakehouse.
Repos
Repos are Azure Databricks folders whose contents are co-versioned together by syncing them to a remote Git
repository. Using a Azure Databricks repo, you can develop notebooks in Azure Databricks and use a remote Git
repository for collaboration and version control.
For detailed information on using repos, see Git integration with Databricks Repos.
Models
Model refers to a model registered in MLflow Model Registry. Model Registry is a centralized model store that
enables you to manage the full lifecycle of MLflow models. It provides chronological model lineage, model
versioning, stage transitions, and model and model version annotations and descriptions.
For detailed information on managing and using models, see MLflow Model Registry on Azure Databricks.
Experiments
An MLflow experiment is the primary unit of organization and access control for MLflow machine learning
model training runs; all MLflow runs belong to an experiment. Each experiment lets you visualize, search, and
compare runs, as well as download run artifacts or metadata for analysis in other tools.
For detailed information on managing and using experiments, see Experiments.
Work with workspace objects
7/21/2022 • 3 minutes to read
This article explains how to work with folders and other workspace objects.
Folders
Folders contain all static assets within a workspace: notebooks, libraries, experiments, and other folders. Icons
indicate the type of the object contained in a folder. Click a folder name to open or close the folder and view its
contents.
To perform an action on a folder, click the at the right side of a folder and select a menu item.
Special folders
An Azure Databricks workspace has three special folders: Workspace, Shared, and Users. You cannot rename or
move a special folder.
Workspace root folder
To navigate to the Workspace root folder:
1. Click Workspace .
>
If workspace access control is enabled, by default objects in this folder are private to that user.
NOTE
When you remove a user, the user’s home folder is retained.
action on a Workspace object, right-click the object or click the at the right side of an object.
From the drop-down menu you can:
If the object is a folder:
Create a notebook, library, MLflow experiment, or folder.
Import a Databricks archive.
Clone the object.
Rename the object.
Move the object to another folder.
Move the object to Trash. See Delete an object.
Export a folder or notebook as a Databricks archive.
If the object is a notebook, copy the notebook’s file path.
If you have Workspace access control enabled, set permissions on the object.
Search workspace for an object
IMPORTANT
This feature is in Public Preview.
NOTE
The search behavior described in this section is not supported on workspaces that use customer-managed keys for
encryption. In those workspaces, you can click Search in the sidebar and type a search string in the Search
Workspace field. As you type, objects whose name contains the search string are listed. Click a name from the list to
open that item in the workspace.
To search the workspace for an object, click Search in the sidebar. The Search dialog appears.
To search for a text string, type it into the search field and press Enter. The system searches the names of all
notebooks, folders, files, libraries, and Repos in the workspace that you have access to. It also searches notebook
commands, but not text in non-notebook files.
You can also search for items by type (file, folder, notebooks, libraries, or repo). A text string is not required.
When you press Enter, workspace objects that match the search criteria appear in the dialog. Click a name from
the list to open that item in the workspace.
Access recently used objects
You can access recently used objects by clicking Recents in the sidebar or the Recents column on the
workspace landing page.
NOTE
The Recents list is cleared after deleting the browser cache and cookies.
Move an object
To move an object, you can drag-and-drop the object or click the or at the right side of the object and
select Move :
To move all the objects inside a folder to another folder, select the Move action on the source folder and select
the Move all items in ‘’ rather than the folder itself checkbox.
Delete an object
To delete a folder, notebook, library or experiment, click the or at the right side of the object and select
Move to Trash . The Trash folder is automatically emptied (purged) after 30 days .
You can permanently delete an object in the Trash by selecting the to the right of the object and selecting
Delete Immediately .
You can permanently delete all objects in the Trash by selecting the to the right of the Trash folder and
selecting Empty Trash .
Restore an object
You restore an object by dragging it from the Trash folder to another folder.
Get workspace, cluster, notebook, folder, model, and
job identifiers
7/21/2022 • 3 minutes to read
This article explains how to get workspace, cluster, directory, model, notebook, and job identifiers and URLs in
Azure Databricks.
In the Azure portal, by selecting the resource and noting the value in the URL field:
Using the Azure API. See Get a per-workspace URL using the Azure API.
Legacy regional URL
IMPORTANT
Avoid using legacy regional URLs. They may not work for new workspaces, are less reliable, and exhibit lower performance
than per-workspace URLs.
The legacy regional URL is composed of the region where the Azure Databricks workspace is deployed plus the
domain azuredatabricks.net , for example, https://westus.azuredatabricks.net/ .
If you log in to a legacy regional URL like https://westus.azuredatabricks.net/ , the instance name is
westus.azuredatabricks.net .
The workspace ID appears in the URL only after you have logged in using a legacy regional URL. It appears
after the o= . In the URL https://<databricks-instance>/?o=6280049833385130 , the workspace ID is
6280049833385130 .
https://<databricks-instance>/#/setting/clusters/<cluster-id>
In this notebook:
The notebook URL is:
https://westus.azuredatabricks.net/?o=6280049833385130#notebook/1940481404050342`
https://westus.azuredatabricks.net/?
o=6280049833385130#notebook/1940481404050342/command/2432220274659491
Folder ID
A folder is a directory used to store files that can used in the Azure Databricks workspace. These files can be
notebooks, libraries or subfolders. There is a specific id associated with each folder and each individual sub-
folder. The Permissions API refers to this id as a directory_id and is used in setting and updating permissions for
a folder.
To retrieve the directory_id , use the Workspace API:
{
"object_type": "DIRECTORY",
"path": "/Users/me@example.com/MyFolder",
"object_id": 123456789012345
}
Model ID
A model refers to an MLflow registered model, which lets you manage MLflow Models in production through
stage transitions and versioning. The registered model ID is required for changing the permissions on the model
programmatically through the Permissions API 2.0.
To get the ID of a registered model, you can use the REST API (latest) endpoint
mlflow/databricks/registered-models/get . For example, the following code returns the registered model object
with its properties, including its ID:
{
"registered_model_databricks": {
"name":"model_name",
"id":"ceb0477eba94418e973f170e626f4471"
}
}
https://westus.azuredatabricks.net/?o=6280049833385130#job/1
In April 2020, Azure Databricks added a new unique per-workspace URL for each workspace. This per-
workspace URL has the format
adb-<workspace-id>.<random-number>.azuredatabricks.net
The per-workspace URL replaces the deprecated regional URL ( <region>.azuredatabricks.net ) to access
workspaces.
IMPORTANT
Avoid using legacy regional URLs. They may not work for new workspaces, are less reliable, and exhibit lower performance
than per-workspace URLs.
You create workspaces in one or more regions and have a list of <regional-url, api-token> pairs either
stored in the script itself or in a database. If this is the case, we recommend that you store the per-
workspace URL instead of the regional URL in the list.
NOTE
Because both regional URLs and per-workspace URLs are supported, any existing automation that uses regional URLs to
reference workspaces that were created before the introduction of per-workspace URLs will continue to work. Although
Databricks recommends that you update any automation to use per-workspace URLs, doing so is not required in this
case.
$ nslookup adb-<workspace-id>.<random-number>.azuredatabricks.net
Server: 192.168.50.1
Address: 192.168.50.1#53
Non-authoritative answer:
adb-<workspace-id>.<random-number>.azuredatabricks.net canonical name = eastus-c3.azuredatabricks.net.
Name: eastus-c3.azuredatabricks.net
Address: 20.42.4.211
DataFrames and Datasets
7/21/2022 • 2 minutes to read
This section gives an introduction to Apache Spark DataFrames and Datasets using Azure Databricks notebooks.
Introduction to DataFrames - Python
Create DataFrames
Work with DataFrames
DataFrame FAQs
Introduction to DataFrames - Scala
Create DataFrames
Work with DataFrames
Frequently asked questions (FAQ)
Introduction to Datasets
Create a Dataset
Work with Datasets
Convert a Dataset to a DataFrame
Complex and nested data
Complex nested data notebook
Aggregators
Dataset aggregator notebook
Dates and timestamps
Dates and calendars
Timestamps and time zones
Construct dates and timestamps
Collect dates and timestamps
For reference information about DataFrames and Datasets, Azure Databricks recommends the following Apache
Spark API reference:
Python API
Scala API
Java API
Introduction to DataFrames - Python
7/21/2022 • 11 minutes to read
This article provides several coding examples of common PySpark DataFrame APIs that use Python.
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can
think of a DataFrame like a spreadsheet, a SQL table, or a dictionary of series objects. For more information and
examples, see the Quickstart on the Apache Spark documentation website.
Create DataFrames
This example uses the Row class from Spark SQL to create several DataFrames. The contents of a few of these
DataFrames are then printed.
print(department1)
print(employee2)
print(departmentWithEmployees1.employees[0].email)
Output:
df1.show(truncate=False)
df2.show(truncate=False)
Output:
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|department |employees
|
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|{123456, Computer Science} |[{michael, armbrust, no-reply@berkeley.edu, 100000}, {xiangrui, meng, no-
reply@stanford.edu, 120000}]|
|{789012, Mechanical Engineering}|[{matei, null, no-reply@waterloo.edu, 140000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
+---------------------------+-------------------------------------------------------------------------------
-----------------+
|department |employees
|
+---------------------------+-------------------------------------------------------------------------------
-----------------+
|{345678, Theater and Drama}|[{michael, jackson, no-reply@neverla.nd, 80000}, {null, wendell, no-
reply@berkeley.edu, 160000}]|
|{901234, Indoor Recreation}|[{xiangrui, meng, no-reply@stanford.edu, 120000}, {matei, null, no-
reply@waterloo.edu, 140000}] |
+---------------------------+-------------------------------------------------------------------------------
-----------------+
unionDF = df1.union(df2)
unionDF.show(truncate=False)
Output:
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|department |employees
|
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|{123456, Computer Science} |[{michael, armbrust, no-reply@berkeley.edu, 100000}, {xiangrui, meng, no-
reply@stanford.edu, 120000}]|
|{789012, Mechanical Engineering}|[{matei, null, no-reply@waterloo.edu, 140000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
|{345678, Theater and Drama} |[{michael, jackson, no-reply@neverla.nd, 80000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
|{901234, Indoor Recreation} |[{xiangrui, meng, no-reply@stanford.edu, 120000}, {matei, null, no-
reply@waterloo.edu, 140000}] |
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
parquetDF = spark.read.format("parquet").load("/tmp/databricks-df-example.parquet")
parquetDF.show(truncate=False)
Output:
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|department |employees
|
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
|{789012, Mechanical Engineering}|[{matei, null, no-reply@waterloo.edu, 140000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
|{901234, Indoor Recreation} |[{xiangrui, meng, no-reply@stanford.edu, 120000}, {matei, null, no-
reply@waterloo.edu, 140000}] |
|{345678, Theater and Drama} |[{michael, jackson, no-reply@neverla.nd, 80000}, {null, wendell, no-
reply@berkeley.edu, 160000}] |
|{123456, Computer Science} |[{michael, armbrust, no-reply@berkeley.edu, 100000}, {xiangrui, meng, no-
reply@stanford.edu, 120000}]|
+--------------------------------+--------------------------------------------------------------------------
---------------------------+
explodeDF = unionDF.select(explode("employees").alias("e"))
flattenDF = explodeDF.selectExpr("e.firstName", "e.lastName", "e.email", "e.salary")
flattenDF.show(truncate=False)
Output:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |null |no-reply@waterloo.edu|140000|
|null |wendell |no-reply@berkeley.edu|160000|
|michael |jackson |no-reply@neverla.nd |80000 |
|null |wendell |no-reply@berkeley.edu|160000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |null |no-reply@waterloo.edu|140000|
+---------+--------+---------------------+------+
Output:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|xiangrui |meng |no-reply@stanford.edu|120000|
|xiangrui |meng |no-reply@stanford.edu|120000|
+---------+--------+---------------------+------+
This example is similar to the previous one, except that it displays only those rows where the firstName field’s
value is xiangrui or michael .
Output:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|michael |jackson |no-reply@neverla.nd |80000 |
|xiangrui |meng |no-reply@stanford.edu|120000|
|xiangrui |meng |no-reply@stanford.edu|120000|
+---------+--------+---------------------+------+
This example is equivalent to the preceding example, except that it uses the where method instead of the filter
method.
Output:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|michael |jackson |no-reply@neverla.nd |80000 |
|xiangrui |meng |no-reply@stanford.edu|120000|
|xiangrui |meng |no-reply@stanford.edu|120000|
+---------+--------+---------------------+------+
nonNullDF = flattenDF.fillna("--")
nonNullDF.show(truncate=False)
Before:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |null |no-reply@waterloo.edu|140000|
|null |wendell |no-reply@berkeley.edu|160000|
|michael |jackson |no-reply@neverla.nd |80000 |
|null |wendell |no-reply@berkeley.edu|160000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |null |no-reply@waterloo.edu|140000|
+---------+--------+---------------------+------+
After:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|michael |armbrust|no-reply@berkeley.edu|100000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |-- |no-reply@waterloo.edu|140000|
|-- |wendell |no-reply@berkeley.edu|160000|
|michael |jackson |no-reply@neverla.nd |80000 |
|-- |wendell |no-reply@berkeley.edu|160000|
|xiangrui |meng |no-reply@stanford.edu|120000|
|matei |-- |no-reply@waterloo.edu|140000|
+---------+--------+---------------------+------+
This example uses the filter method of the previous flattenDF DataFrame along with the isNull method of the
Column class to display all rows where the firstName or lastName field has a null value.
Output:
+---------+--------+---------------------+------+
|firstName|lastName|email |salary|
+---------+--------+---------------------+------+
|null |wendell |no-reply@berkeley.edu|160000|
|null |wendell |no-reply@berkeley.edu|160000|
|matei |null |no-reply@waterloo.edu|140000|
|matei |null |no-reply@waterloo.edu|140000|
+---------+--------+---------------------+------+
This example uses the select, groupBy, and agg methods of the previous nonNullDF DataFrame to select only
the rows’ firstName and lastName fields, group the results by the firstName field’s values, and then display
the number of distinct lastName field values for each of those first names. For each first name, only one distinct
last name is found, except for michael , which has both michael armbrust and michael jackson .
countDistinctDF.show()
Output:
+---------+-------------------+
|firstName|distinct_last_names|
+---------+-------------------+
| null| 1|
| xiangrui| 1|
| matei| 0|
| michael| 2|
+---------+-------------------+
Compare the DataFrame and SQL query physical plans
This example uses the explain method of the preceding example’s DataFrame to print the results of the physical
plan for debugging purpose.
TIP
They should be the same.
countDistinctDF.explain()
This example uses the createOrReplaceTempView method of the preceding example’s DataFrame to create a
local temporary view with this DataFrame. This temporary view exists until the related Spark session goes out of
scope. This example then uses the Spark session’s sql method to run a query on this temporary view. The
physical plan for this query is then displayed. The results of this explain call should be the same as the
previous explain call.
# Register the DataFrame as a temporary view so that we can query it by using SQL.
nonNullDF.createOrReplaceTempView("databricks_df_example")
# Perform the same query as the preceding DataFrame and then display its physical plan.
countDistinctDF_sql = spark.sql('''
SELECT firstName, count(distinct lastName) AS distinct_last_names
FROM databricks_df_example
GROUP BY firstName
''')
countDistinctDF_sql.explain()
Output:
+-----------+
|sum(salary)|
+-----------+
| 1020000|
+-----------+
This example displays the underlying data type of the salary field for the preceding DataFrame, which is a
bigint .
match = 'salary'
Output:
Data type of 'salary' is 'bigint'.
nonNullDF.describe("salary").show()
Output:
+-------+------------------+
|summary| salary|
+-------+------------------+
| count| 8|
| mean| 127500.0|
| stddev|28157.719063467175|
| min| 80000|
| max| 160000|
+-------+------------------+
import pandas as pd
import matplotlib.pyplot as plt
plt.clf()
pdDF = nonNullDF.toPandas()
pdDF.plot(x='firstName', y='salary', kind='bar', rot=45)
display()
Output:
dbutils.fs.rm("/tmp/databricks-df-example.parquet", True)
DataFrame FAQs
This FAQ addresses common use cases and example usage using the available APIs. For more detailed API
descriptions, see the PySpark documentation.
How can I get better performance with DataFrame UDFs?
If the functionality exists in the built-in functions, using these will perform better. Example usage follows. Also
see the PySpark Functions API reference. Use the built-in functions and the withColumn() API to add new
columns. You can also use withColumnRenamed() to replace an existing column after the transformation.
# Instead of registering a UDF, call the builtin functions to perform operations on the columns.
# This will provide a performance improvement as the builtins compile and run in the platform's JVM.
df.createOrReplaceTempView("sample_df")
display(sql("select * from sample_df"))
I want to conver t the DataFrame back to JSON strings to send back to Kafka.
There is an underlying toJSON() function that returns an RDD of JSON strings using the column names and
schema to produce the JSON records.
rdd_json = df.toJSON()
rdd_json.take(2)
My UDF takes a parameter including the column to operate on. How do I pass this parameter?
There is a function available called lit() that creates a constant column.
from pyspark.sql import functions as F
# We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type.
df = df.withColumn('id_offset', add_n(F.lit(1000), df.id.cast(IntegerType())))
display(df)
df_filtered = df.filter(last_n_days(df.date_diff))
display(df_filtered)
I have a table in the Hive metastore and I’d like to access to table as a DataFrame. What’s the best
way to define this?
There are multiple ways to define a DataFrame from a registered table. Call table(tableName) or select and filter
specific columns using an SQL query:
I’d like to clear all the cached tables on the current cluster.
There’s an API available to do this at a global level or per table.
spark.catalog.clearCache()
spark.catalog.cacheTable("sample_df")
spark.catalog.uncacheTable("sample_df")
I’d like to compute aggregates on columns. What’s the best way to do this?
The agg(*exprs) method takes a list of column names and expressions for the type of aggregation you’d like to
compute. See pyspark.sql.DataFrame.agg. You can use built-in functions in the expressions for each column.
# Provide the min, count, and avg and groupBy the location column. Diplay the results
agg_df = df.groupBy("location").agg(F.min("id"), F.count("id"), F.avg("date_diff"))
display(agg_df)
I’d like to write out the DataFrames to Parquet, but would like to par tition on a par ticular column.
You can use the following APIs to accomplish this. Ensure the code does not create a large number of partition
columns with the datasets otherwise the overhead of the metadata can cause significant slow downs. If there is
a SQL table back by this directory, you will need to call refresh table <table-name> to update the metadata
prior to the query.
df = df.withColumn('end_month', F.month('end_date'))
df = df.withColumn('end_year', F.year('end_date'))
df.write.partitionBy("end_year", "end_month").format("parquet").load("/tmp/sample_table")
display(dbutils.fs.ls("/tmp/sample_table"))
How do I properly handle cases where I want to filter out NULL data?
You can use filter() and provide similar syntax as you would with a SQL query.
adult_df = spark.read.\
format("com.spark.csv").\
option("header", "false").\
option("inferSchema", "true").load("dbfs:/databricks-datasets/adult/adult.data")
adult_df.printSchema()
You have a delimited string dataset that you want to conver t to their datatypes. How would you
accomplish this?
Use the RDD APIs to filter out the malformed rows and map the values to the appropriate types. We define a
function that filters the items using regular expressions.
Introduction to DataFrames - Scala
7/21/2022 • 6 minutes to read
This article demonstrates a number of common Spark DataFrame functions using Scala.
Create DataFrames
// Create the case classes for our domain
case class Department(id: String, name: String)
case class Employee(firstName: String, lastName: String, email: String, salary: Int)
case class DepartmentWithEmployees(department: Department, employees: Seq[Employee])
import org.apache.spark.sql.functions._
+---------+--------+--------------------+------+
|firstName|lastName| email|salary|
+---------+--------+--------------------+------+
| matei| null|no-reply@waterloo...|140000|
| null| wendell|no-reply@princeto...|160000|
| michael|armbrust|no-reply@berkeley...|100000|
| xiangrui| meng|no-reply@stanford...|120000|
| michael| jackson| no-reply@neverla.nd| 80000|
| null| wendell|no-reply@princeto...|160000|
| xiangrui| meng|no-reply@stanford...|120000|
| matei| null|no-reply@waterloo...|140000|
+---------+--------+--------------------+------+
TIP
They should be the same.
countDistinctDF.explain()
// register the DataFrame as a temp view so that we can query it using SQL
nonNullDF.createOrReplaceTempView("databricks_df_example")
spark.sql("""
SELECT firstName, count(distinct lastName) as distinct_last_names
FROM databricks_df_example
GROUP BY firstName
""").explain
nonNullDF.describe("salary").show()
dbutils.fs.rm("/tmp/databricks-df-example.parquet", true)
val df_schema =
StructType(
header.split('|').map(fieldName => StructField(fieldName, StringType, true)))
// Instead of registering a UDF, call the builtin functions to perform operations on the columns.
// This will provide a performance improvement as the builtins compile and run in the platform's JVM.
df.createOrReplaceTempView("sample_df")
display(sql("select * from sample_df"))
I want to conver t the DataFrame back to JSON strings to send back to Kafka.
There is a toJSON() function that returns an RDD of JSON strings using the column names and schema to
produce the JSON records.
val rdd_json = df.toJSON
rdd_json.take(2).foreach(println)
My UDF takes a parameter including the column to operate on. How do I pass this parameter?
There is a function available called lit() that creates a static column.
// We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type.
df = df.withColumn("id_offset", add_n(lit(1000), col("id").cast("int")))
display(df)
I have a table in the Hive metastore and I’d like to access to table as a DataFrame. What’s the best
way to define this?
There are multiple ways to define a DataFrame from a registered table. Call table(tableName) or select and filter
specific columns using an SQL query:
I’d like to clear all the cached tables on the current cluster.
There’s an API available to do this at the global or per table level.
spark.catalog.clearCache()
spark.catalog.cacheTable("sample_df")
spark.catalog.uncacheTable("sample_df")
I’d like to compute aggregates on columns. What’s the best way to do this?
There’s an API named agg(*exprs) that takes a list of column names and expressions for the type of
aggregation you’d like to compute. You can leverage the built-in functions mentioned above as part of the
expressions for each column.
// Provide the min, count, and avg and groupBy the location column. Diplay the results
var agg_df = df.groupBy("location").agg(min("id"), count("id"), avg("date_diff"))
display(agg_df)
I’d like to write out the DataFrames to Parquet, but would like to par tition on a par ticular column.
You can use the following APIs to accomplish this. Ensure the code does not create a large number of partitioned
columns with the datasets otherwise the overhead of the metadata can cause significant slow downs. If there is
a SQL table back by this directory, you will need to call refresh table <table-name> to update the metadata
prior to the query.
df = df.withColumn("end_month", month(col("end_date")))
df = df.withColumn("end_year", year(col("end_date")))
dbutils.fs.rm("/tmp/sample_table", true)
df.write.partitionBy("end_year", "end_month").format("parquet").load("/tmp/sample_table")
display(dbutils.fs.ls("/tmp/sample_table"))
How do I properly handle cases where I want to filter out NULL data?
You can use filter() and provide similar syntax as you would with a SQL query.
You have a delimited string dataset that you want to conver t to their data types. How would you
accomplish this?
Use the RDD APIs to filter out the malformed rows and map the values to the appropriate types.
Introduction to Datasets
7/21/2022 • 3 minutes to read
The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with
the benefits of Spark SQL’s optimized execution engine. You can define a Dataset JVM objects and then
manipulate them using functional transformations ( map , flatMap , filter , and so on) similar to an RDD. The
benefits is that, unlike RDDs, these transformations are now applied on a structured and strongly typed
distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.
Create a Dataset
To convert a sequence to a Dataset, call .toDS() on the sequence.
If you have a sequence of case classes, calling .toDS() provides a Dataset with all the necessary fields.
You can also deal with tuples while converting a DataFrame to Dataset without using a case class .
val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am
your father")).toDS()
val groupedDataset = wordsDataset.flatMap(_.toLowerCase.split(" "))
.filter(_ != "")
.groupBy("value")
val countsDataset = groupedDataset.count()
countsDataset.show()
Join Datasets
The following example demonstrates the following:
Union multiple datasets
Doing an inner join on a condition Group by a specific column
Doing a custom aggregation (average) on the grouped dataset.
The examples uses only Datasets API to demonstrate all the operations available. In reality, using DataFrames for
doing aggregation would be simpler and faster than doing custom aggregation with mapGroups . The next
section covers the details of converting Datasets to DataFrames and using DataFrames API for doing
aggregations.
case class Employee(name: String, age: Int, departmentId: Int, salary: Double)
case class Department(id: Int, name: String)
case class Record(name: String, age: Int, salary: Double, departmentId: Int, departmentName: String)
case class ResultSet(departmentId: Int, departmentName: String, avgSalary: Double)
averageSalaryDataset.show()
val wordsDataset = sc.parallelize(Seq("Spark I am your father", "May the spark be with you", "Spark I am
your father")).toDS()
val result = wordsDataset
.flatMap(_.split(" ")) // Split on whitespace
.filter(_ != "") // Filter empty words
.map(_.toLowerCase())
.toDF() // Convert to DataFrame to perform aggregation / sorting
.groupBy($"value") // Count number of occurrences of each word
.agg(count("*") as "numOccurances")
.orderBy($"numOccurances" desc) // Show most common words first
result.show()
Complex and nested data
7/21/2022 • 2 minutes to read
Here’s a notebook showing you how to work with complex and nested data.
The Date and Timestamp datatypes changed significantly in Databricks Runtime 7.0. This article describes:
The Date type and the associated calendar.
The Timestamp type and how it relates to time zones. It also explains the details of time zone offset resolution
and the subtle behavior changes in the new time API in Java 8, used by Databricks Runtime 7.0.
APIs to construct date and timestamp values.
Common pitfalls and best practices for collecting date and timestamp objects on the Apache Spark driver.
java.time.ZoneId.systemDefault
res0:java.time.ZoneId = America/Los_Angeles
java.time.ZoneId.of("America/Los_Angeles").getRules.getOffset(java.time.LocalDateTime.parse("1883-11-
10T00:00:00"))
Prior to November 18, 1883, time of day in North America was a local matter, and most cities and towns used
some form of local solar time, maintained by a well-known clock (on a church steeple, for example, or in a
jeweler’s window). That’s why you see such a strange time zone offset.
The example demonstrates that Java 8 functions are more precise and take into account historical data from
IANA TZDB. After switching to the Java 8 time API, Databricks Runtime 7.0 benefited from the improvement
automatically and became more precise in how it resolves time zone offsets.
Databricks Runtime 7.0 also switched to the Proleptic Gregorian calendar for the Timestamp type. The ISO
SQL:2016 standard declares the valid range for timestamps is from 0001-01-01 00:00:00 to
9999-12-31 23:59:59.999999 . Databricks Runtime 7.0 fully conforms to the standard and supports all
timestamps in this range. Compared to Databricks Runtime 6.x and below, note the following sub-ranges:
0001-01-01 00:00:00..1582-10-03 23:59:59.999999 . Databricks Runtime 6.x and below uses the Julian calendar
and doesn’t conform to the standard. Databricks Runtime 7.0 fixes the issue and applies the Proleptic
Gregorian calendar in internal operations on timestamps such as getting year, month, day, etc. Due to
different calendars, some dates that exist in Databricks Runtime 6.x and below don’t exist in Databricks
Runtime 7.0. For example, 1000-02-29 is not a valid date because 1000 isn’t a leap year in the Gregorian
calendar. Also, Databricks Runtime 6.x and below resolves time zone name to zone offsets incorrectly for this
timestamp range.
1582-10-04 00:00:00..1582-10-14 23:59:59.999999 . This is a valid range of local timestamps in Databricks
Runtime 7.0, in contrast to Databricks Runtime 6.x and below where such timestamps didn’t exist.
1582-10-15 00:00:00..1899-12-31 23:59:59.999999 . Databricks Runtime 7.0 resolves time zone offsets
correctly using historical data from IANA TZDB. Compared to Databricks Runtime 7.0, Databricks Runtime 6.x
and below might resolve zone offsets from time zone names incorrectly in some cases, as shown in the
preceding example.
1900-01-01 00:00:00..2036-12-31 23:59:59.999999 . Both Databricks Runtime 7.0 and Databricks Runtime 6.x
and below conform to the ANSI SQL standard and use Gregorian calendar in date-time operations such as
getting the day of the month.
2037-01-01 00:00:00..9999-12-31 23:59:59.999999 . Databricks Runtime 6.x and below can resolve time zone
offsets and daylight saving time offsets incorrectly. Databricks Runtime 7.0 does not.
One more aspect of mapping time zone names to offsets is overlapping of local timestamps that can happen
due to daylight savings time (DST) or switching to another standard time zone offset. For instance, on
November 3 2019, 02:00:00, most states in the USA turned clocks backwards 1 hour to 01:00:00. The local
timestamp 2019-11-03 01:30:00 America/Los_Angeles can be mapped either to 2019-11-03 01:30:00 UTC-08:00 or
2019-11-03 01:30:00 UTC-07:00 . If you don’t specify the offset and just set the time zone name (for example,
2019-11-03 01:30:00 America/Los_Angeles ), Databricks Runtime 7.0 takes the earlier offset, typically
corresponding to “summer”. The behavior diverges from Databricks Runtime 6.x and below which takes the
“winter” offset. In the case of a gap, where clocks jump forward, there is no valid offset. For a typical one-hour
daylight saving time change, Spark moves such timestamps to the next valid timestamp corresponding to
“summer” time.
As you can see from the preceding examples, the mapping of time zone names to offsets is ambiguous, and is
not one to one. In the cases when it is possible, when constructing timestamps we recommend specifying exact
time zone offsets, for example 2019-11-03 01:30:00 UTC-07:00 .
ANSI SQL and Spark SQL timestamps
The ANSI SQL standard defines two types of timestamps:
TIMESTAMP WITHOUT TIME ZONE or TIMESTAMP : Local timestamp as ( YEAR , MONTH , DAY , HOUR , MINUTE , SECOND
). These timestamps are not bound to any time zone, and are wall clock timestamps.
TIMESTAMP WITH TIME ZONE : Zoned timestamp as ( YEAR , MONTH , DAY , HOUR , MINUTE , SECOND , TIMEZONE_HOUR ,
TIMEZONE_MINUTE ). These timestamps represent an instant in the UTC time zone + a time zone offset (in hours
and minutes) associated with each value.
The time zone offset of a TIMESTAMP WITH TIME ZONE does not affect the physical point in time that the timestamp
represents, as that is fully represented by the UTC time instant given by the other timestamp components.
Instead, the time zone offset only affects the default behavior of a timestamp value for display, date/time
component extraction (for example, EXTRACT ), and other operations that require knowing a time zone, such as
adding months to a timestamp.
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields
( YEAR , MONTH , DAY , HOUR , MINUTE , SECOND , SESSION TZ ) where the YEAR through SECOND field identify a time
instant in the UTC time zone, and where SESSION TZ is taken from the SQL config spark.sql.session.timeZone.
The session time zone can be set as:
Zone offset (+|-)HH:mm . This form allows you to unambiguously define a physical point in time.
Time zone name in the form of region ID area/city , such as America/Los_Angeles . This form of time zone
info suffers from some of the problems described previously like overlapping of local timestamps. However,
each UTC time instant is unambiguously associated with one time zone offset for any region ID, and as a
result, each timestamp with a region ID based time zone can be unambiguously converted to a timestamp
with a zone offset. By default, the session time zone is set to the default time zone of the Java virtual
machine.
Spark TIMESTAMP WITH SESSION TIME ZONE is different from:
TIMESTAMP WITHOUT TIME ZONE , because a value of this type can map to multiple physical time instants, but any
value of TIMESTAMP WITH SESSION TIME ZONE is a concrete physical time instant. The SQL type can be emulated
by using one fixed time zone offset across all sessions, for instance UTC+0. In that case, you could consider
timestamps at UTC as local timestamps.
TIMESTAMP WITH TIME ZONE , because according to the SQL standard column values of the type can have
different time zone offsets. That is not supported by Spark SQL.
You should notice that timestamps that are associated with a global (session scoped) time zone are not
something newly invented by Spark SQL. RDBMSs such as Oracle provide a similar type for timestamps:
TIMESTAMP WITH LOCAL TIME ZONE .
root
|-- date: date (nullable = true)
To print DataFrame content, call the show() action, which converts dates to strings on executors and transfers
the strings to the driver to output them on the console:
df.show()
+-----------+
| date|
+-----------+
| 2020-06-26|
| null|
|-0044-01-01|
+-----------+
Similarly, you can construct timestamp values using the MAKE_TIMESTAMP functions. Like MAKE_DATE , it performs
the same validation for date fields, and additionally accepts time fields HOUR (0-23), MINUTE (0-59) and
SECOND (0-60). SECOND has the type Decimal(precision = 8, scale = 6) because seconds can be passed with
the fractional part up to microsecond precision. For example:
+----+-----+---+----+------+---------+
|YEAR|MONTH|DAY|HOUR|MINUTE| SECOND|
+----+-----+---+----+------+---------+
|2020| 6| 28| 10| 31|30.123456|
|1582| 10| 10| 0| 1| 2.0001|
|2019| 2| 29| 9| 29| 1.0|
+----+-----+---+----+------+---------+
root
|-- MAKE_TIMESTAMP: timestamp (nullable = true)
As for dates, print the content of the ts DataFrame using the show() action. In a similar way, show() converts
timestamps to strings but now it takes into account the session time zone defined by the SQL config
spark.sql.session.timeZone .
ts.show(truncate=False)
+--------------------------+
|MAKE_TIMESTAMP |
+--------------------------+
|2020-06-28 10:31:30.123456|
|1582-10-10 00:01:02.0001 |
|null |
+--------------------------+
Spark cannot create the last timestamp because this date is not valid: 2019 is not a leap year.
You might notice that there is no time zone information in the preceding example. In that case, Spark takes a
time zone from the SQL configuration spark.sql.session.timeZone and applies it to function invocations. You
can also pick a different time zone by passing it as the last parameter of MAKE_TIMESTAMP . Here is an example:
+---------------------------------+
|TIMESTAMP_STRING |
+---------------------------------+
|2020-06-28 13:31:00 Europe/Moscow|
|1582-10-10 10:24:00 Europe/Moscow|
|2019-02-28 09:29:00 Europe/Moscow|
+---------------------------------+
As the example demonstrates, Spark takes into account the specified time zones but adjusts all local timestamps
to the session time zone. The original time zones passed to the MAKE_TIMESTAMP function are lost because the
TIMESTAMP WITH SESSION TIME ZONE type assumes that all values belong to one time zone, and it doesn’t even
store a time zone per every value. According to the definition of the TIMESTAMP WITH SESSION TIME ZONE , Spark
stores local timestamps in the UTC time zone, and uses the session time zone while extracting date-time fields or
converting the timestamps to strings.
Also, timestamps can be constructed from the LONG type using casting. If a LONG column contains the number
of seconds since the epoch 1970-01-01 00:00:00Z, it can be cast to a Spark SQL TIMESTAMP :
Unfortunately, this approach doesn’t allow you to specify the fractional part of seconds.
Another way is to construct dates and timestamps from values of the STRING type. You can make literals using
special keywords:
Alternatively, you can use casting that you can apply for all values in a column:
The input timestamp strings are interpreted as local timestamps in the specified time zone or in the session time
zone if a time zone is omitted in the input string. Strings with unusual patterns can be converted to timestamp
using the to_timestamp() function. The supported patterns are described in Datetime Patterns for Formatting
and Parsing:
For example:
Spark allows you to create Datasets from existing collections of external objects at the driver side and create
columns of corresponding types. Spark converts instances of external types to semantically equivalent internal
representations. For example, to create a Dataset with DATE and TIMESTAMP columns from Python collections,
you can use:
import datetime
df = spark.createDataFrame([(datetime.datetime(2020, 7, 1, 0, 0, 0), datetime.date(2020, 7, 1))],
['timestamp', 'date'])
df.show()
+-------------------+----------+
| timestamp| date|
+-------------------+----------+
|2020-07-01 00:00:00|2020-07-01|
+-------------------+----------+
PySpark converts Python’s date-time objects to internal Spark SQL representations at the driver side using the
system time zone, which can be different from Spark’s session time zone setting spark.sql.session.timeZone .
The internal values don’t contain information about the original time zone. Future operations over the
parallelized date and timestamp values take into account only Spark SQL sessions time zone according to the
TIMESTAMP WITH SESSION TIME ZONE type definition.
In a similar way, Spark recognizes the following types as external date-time types in Java and Scala APIs:
java.sql.Date and java.time.LocalDateas external types for the DATE type
java.sql.Timestamp and java.time.Instant for the TIMESTAMP type.
There is a difference between java.sql.* and java.time.* types. java.time.LocalDate and java.time.Instant
were added in Java 8, and the types are based on the Proleptic Gregorian calendar–the same calendar that is
used by Databricks Runtime 7.0 and above. java.sql.Date and java.sql.Timestamp have another calendar
underneath–the hybrid calendar (Julian + Gregorian since 1582-10-15), which is the same as the legacy
calendar used by Databricks Runtime 6.x and below. Due to different calendar systems, Spark has to perform
additional operations during conversions to internal Spark SQL representations, and rebase input
dates/timestamp from one calendar to another. The rebase operation has a little overhead for modern
timestamps after the year 1900, and it can be more significant for old timestamps.
The following example shows how to make timestamps from Scala collections. The first example constructs a
java.sql.Timestamp object from a string. The valueOf method interprets the input strings as a local timestamp
in the default JVM time zone which can be different from Spark’s session time zone. If you need to construct
instances of java.sql.Timestamp or java.sql.Date in specific time zone, have a look at
java.text.SimpleDateFormat (and its method setTimeZone ) or java.util.Calendar.
+-------------------+
|ts |
+-------------------+
|2020-06-29 22:41:30|
|1970-01-01 03:00:00|
+-------------------+
Seq(java.time.Instant.ofEpochSecond(-12219261484L), java.time.Instant.EPOCH).toDF("ts").show
+-------------------+
| ts|
+-------------------+
|1582-10-15 11:12:13|
|1970-01-01 03:00:00|
+-------------------+
Similarly, you can make a DATE column from collections of java.sql.Date or java.sql.LocalDate .
Parallelization of java.sql.LocalDate instances is fully independent of either Spark’s session or JVM default time
zones, but the same is not true for parallelization of java.sql.Date instances. There are nuances:
1. java.sql.Date instances represent local dates at the default JVM time zone on the driver.
2. For correct conversions to Spark SQL values, the default JVM time zone on the driver and executors must be
the same.
+----------+
| date|
+----------+
|2020-02-29|
|2020-06-29|
+----------+
To avoid any calendar and time zone related issues, we recommend Java 8 types java.sql.LocalDate / Instant
as external types in parallelization of Java/Scala collections of timestamps or dates.
Spark transfers internal values of dates and timestamps columns as time instants in the UTC time zone from
executors to the driver, and performs conversions to Python datetime objects in the system time zone at the
driver, not using Spark SQL session time zone. collect() is different from the show() action described in the
previous section. show() uses the session time zone while converting timestamps to strings, and collects the
resulted strings on the driver.
In Java and Scala APIs, Spark performs the following conversions by default:
Spark SQL DATE values are converted to instances of java.sql.Date .
Spark SQL TIMESTAMP values are converted to instances of java.sql.Timestamp .
Both conversions are performed in the default JVM time zone on the driver. In this way, to have the same date-
time fields that you can get using Date.getDay() , getHour() , and so on, and using Spark SQL functions DAY ,
HOUR , the default JVM time zone on the driver and the session time zone on executors should be the same.
Similarly to making dates/timestamps from java.sql.Date / Timestamp , Databricks Runtime 7.0 performs
rebasing from the Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian). This operation is
almost free for modern dates (after the year 1582) and timestamps (after the year 1900), but it could bring
some overhead for ancient dates and timestamps.
You can avoid such calendar-related issues, and ask Spark to return java.time types, which were added since
Java 8. If you set the SQL config spark.sql.datetime.java8API.enabled to true, the Dataset.collect() action
returns:
java.time.LocalDate for Spark SQL DATE type
java.time.Instant for Spark SQL TIMESTAMP type
Now the conversions don’t suffer from the calendar-related issues because Java 8 types and Databricks Runtime
7.0 and above are both based on the Proleptic Gregorian calendar. The collect() action doesn’t depend on the
default JVM time zone. The timestamp conversions don’t depend on time zone at all. Date conversions use the
session time zone from the SQL config spark.sql.session.timeZone . For example, consider a Dataset with
DATE and TIMESTAMP columns, with the default JVM time zone to set to Europe/Moscow and the session time
zone set to America/Los_Angeles .
java.util.TimeZone.getDefault
spark.conf.get("spark.sql.session.timeZone")
df.show
+-------------------+----------+
| timestamp| date|
+-------------------+----------+
|2020-07-01 00:00:00|2020-07-01|
+-------------------+----------+
The show() action prints the timestamp at the session time America/Los_Angeles , but if you collect the Dataset ,
it is converted to java.sql.Timestamp and the toString method prints Europe/Moscow :
df.collect()
df.collect()(0).getAs[java.sql.Timestamp](0).toString
Actually, the local timestamp 2020-07-01 00:00:00 is 2020-07-01T07:00:00Z at UTC. You can observe that if you
enable Java 8 API and collect the Dataset:
df.collect()
You can convert a java.time.Instant object to any local timestamp independently from the global JVM time
zone. This is one of the advantages of java.time.Instant over java.sql.Timestamp . The former requires
changing the global JVM setting, which influences other timestamps on the same JVM. Therefore, if your
applications process dates or timestamps in different time zones, and the applications should not clash with
each other while collecting data to the driver using Java or Scala Dataset.collect() API, we recommend
switching to Java 8 API using the SQL config spark.sql.datetime.java8API.enabled .
What is Apache Spark Structured Streaming?
7/21/2022 • 2 minutes to read
Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance
with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express
computation on streaming data in the same way you express a batch computation on static data. The Structured
Streaming engine performs the computation incrementally and continuously updates the result as streaming
data arrives. For an overview of Structured Streaming, see the Apache Spark Structured Streaming
Programming Guide.
Examples
For introductory notebooks and notebooks demonstrating example use cases, see Examples for working with
Structured Streaming on Azure Databricks.
API reference
For reference information about Structured Streaming, Azure Databricks recommends the following Apache
Spark API reference:
Python
Scala
Java
Production considerations for Structured Streaming
applications on Azure Databricks
7/21/2022 • 2 minutes to read
You can easily configure production incremental processing workloads with Structured Streaming on Azure
Databricks to fulfill latency and cost requirements for real-time or batch applications. Understanding key
concepts of Structured Streaming on Azure Databricks can help you avoid common pitfalls as you scaling up the
volume and velocity of data and move from development to production.
Azure Databricks has introduced Delta Live Tables to reduce the complexities of managing production
infrastructure for Structured Streaming workloads. Databricks recommends using Delta Live Tables for new
Structured Streaming pipelines; see Delta Live Tables.
The intermediate state information required for stateful Structured Streaming queries can lead to unexpected
latency and production problems if not configured properly.
Optimize performance of stateful Structured Streaming queries on Azure Databricks
Configure RocksDB state store on Azure Databricks
Enable asynchronous state checkpointing for Structured Streaming
Control late data threshold for Structured Streaming with multiple watermark policy
Specify initial state for Structured Streaming mapGroupsWithState
Test state update function for Structured Streaming mapGroupsWithState
Recover from Structured Streaming query failures
7/21/2022 • 5 minutes to read
Structured Streaming provides fault-tolerance and data consistency for streaming queries; using Azure
Databricks workflows, you can easily configure your Structured Streaming queries to automatically restart on
failure. By enabling checkpointing for a streaming query, you can restart the query after a failure. The restarted
query continues where the failed one left off.
streamingDataFrame.writeStream
.format("parquet")
.option("path", "/path/to/table")
.option("checkpointLocation", "/path/to/table/_checkpoint")
.start()
This checkpoint location preserves all of the essential information that identifies a query. Each query must have
a different checkpoint location. Multiple queries should never have the same location. For more information, see
the Structured Streaming Programming Guide.
NOTE
While checkpointLocation is required for most types of output sinks, some sinks, such as memory sink, may
automatically generate a temporary checkpoint location when you do not provide checkpointLocation . These
temporary checkpoint locations do not ensure any fault tolerance or data consistency guarantees and may not get
cleaned up properly. Avoid potential pitfalls by always specifying a checkpointLocation .
WARNING
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook
workflows in your streaming jobs.
NOTE
Failure in any of the active streaming queries causes the active run to fail and terminate all the other streaming
queries.
You do not need to use streamingQuery.awaitTermination() or spark.streams.awaitAnyTermination() at the
end of your notebook. Jobs automatically prevent a run from completing when a streaming query is active.
spark.readStream.format("kafka").option("subscribe", "article")
to
spark.readStream.format("kafka").option("subscribe", "article").option("maxOffsetsPerTrigger",
...)
Changes to subscribed articles and files are generally not allowed as the results are unpredictable:
spark.readStream.format("kafka").option("subscribe", "article") to
spark.readStream.format("kafka").option("subscribe", "newarticle")
Changes in the type of output sink : Changes between a few specific combinations of sinks are allowed.
This needs to be verified on a case-by-case basis. Here are a few examples.
File sink to Kafka sink is allowed. Kafka will see only the new data.
Kafka sink to file sink is not allowed.
Kafka sink changed to foreach, or vice versa is allowed.
Changes in the parameters of output sink : Whether this is allowed and whether the semantics of the
change are well-defined depends on the sink and the query. Here are a few examples.
Changes to output directory of a file sink is not allowed:
sdf.writeStream.format("parquet").option("path", "/somePath") to
sdf.writeStream.format("parquet").option("path", "/anotherPath")
Changes to output article is allowed:
sdf.writeStream.format("kafka").option("article", "somearticle") to
sdf.writeStream.format("kafka").option("path", "anotherarticle")
Changes to the user-defined foreach sink (that is, the ForeachWriter code) is allowed, but the
semantics of the change depends on the code.
Changes in projection / filter / map-like operations : Some cases are allowed. For example:
Addition / deletion of filters is allowed: sdf.selectExpr("a") to
sdf.where(...).selectExpr("a").filter(...) .
Changes in projections with same output schema is allowed:
sdf.selectExpr("stringColumn AS json").writeStream to
sdf.select(to_json(...).as("json")).writeStream .
Changes in projections with different output schema are conditionally allowed:
sdf.selectExpr("a").writeStream to sdf.selectExpr("b").writeStream is allowed only if the output
sink allows the schema change from "a" to "b" .
Changes in stateful operations : Some operations in streaming queries need to maintain state data in
order to continuously update the result. Structured Streaming automatically checkpoints the state data to
fault-tolerant storage (for example, DBFS, Azure Blob storage) and restores it after restart. However, this
assumes that the schema of the state data remains same across restarts. This means that any changes (that
is, additions, deletions, or schema modifications) to the stateful operations of a streaming query are not
allowed between restarts. Here is the list of stateful operations whose schema should not be changed
between restarts in order to ensure state recovery:
Streaming aggregation : For example, sdf.groupBy("a").agg(...) . Any change in number or type of
grouping keys or aggregates is not allowed.
Streaming deduplication : For example, sdf.dropDuplicates("a") . Any change in number or type of
grouping keys or aggregates is not allowed.
Stream-stream join : For example, sdf1.join(sdf2, ...) (i.e. both inputs are generated with
sparkSession.readStream ). Changes in the schema or equi-joining columns are not allowed. Changes
in join type (outer or inner) not allowed. Other changes in the join condition are ill-defined.
Arbitrar y stateful operation : For example, sdf.groupByKey(...).mapGroupsWithState(...) or
sdf.groupByKey(...).flatMapGroupsWithState(...) . Any change to the schema of the user-defined state
and the type of timeout is not allowed. Any change within the user-defined state-mapping function are
allowed, but the semantic effect of the change depends on the user-defined logic. If you really want to
support state schema changes, then you can explicitly encode/decode your complex state data
structures into bytes using an encoding/decoding scheme that supports schema migration. For
example, if you save your state as Avro-encoded bytes, then you can change the Avro-state-schema
between query restarts as this restores the binary state.
Monitoring Structured Streaming queries on Azure
Databricks
7/21/2022 • 3 minutes to read
Azure Databricks provides built-in montoring for Structured Streaming applications through the Spark UI under
the Streaming tab.
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.apache.spark.sql.streaming.StreamingQueryListener._
/**
* Called when a query is started.
* @note This is called synchronously with
* [[org.apache.spark.sql.streaming.DataStreamWriter `DataStreamWriter.start()`]].
* `onQueryStart` calls on all listeners before
* `DataStreamWriter.start()` returns the corresponding [[StreamingQuery]].
* Do not block this method, as it blocks your query.
*/
def onQueryStarted(event: QueryStartedEvent): Unit = {}
/**
* Called when there is some status update (ingestion rate updated, etc.)
*
* @note This method is asynchronous. The status in [[StreamingQuery]] returns the
* latest status, regardless of when this method is called. The status of [[StreamingQuery]]
* may change before or when you process the event. For example, you may find [[StreamingQuery]]
* terminates when processing `QueryProgressEvent`.
*/
def onQueryProgress(event: QueryProgressEvent): Unit = {}
/**
* Called when a query is stopped, with or without error.
*/
def onQueryTerminated(event: QueryTerminatedEvent): Unit = {}
}
Python
class MyListener(StreamingQueryListener):
def onQueryStarted(self, event):
"""
Called when a query is started.
Parameters
----------
event: :class:`pyspark.sql.streaming.listener.QueryStartedEvent`
The properties are available as the same as Scala API.
Notes
-----
This is called synchronously with
meth:`pyspark.sql.streaming.DataStreamWriter.start`,
that is, ``onQueryStart`` will be called on all listeners before
``DataStreamWriter.start()`` returns the corresponding
:class:`pyspark.sql.streaming.StreamingQuery`.
Do not block in this method as it will block your query.
"""
pass
Parameters
----------
event: :class:`pyspark.sql.streaming.listener.QueryProgressEvent`
The properties are available as the same as Scala API.
Notes
-----
This method is asynchronous. The status in
:class:`pyspark.sql.streaming.StreamingQuery` returns the
most recent status, regardless of when this method is called. The status
of :class:`pyspark.sql.streaming.StreamingQuery`.
may change before or when you process the event.
For example, you may find :class:`StreamingQuery`
terminates when processing `QueryProgressEvent`.
"""
pass
Parameters
----------
event: :class:`pyspark.sql.streaming.listener.QueryTerminatedEvent`
The properties are available as the same as Scala API.
"""
pass
my_listener = MyListener()
// Observe row count (rc) and error row count (erc) in the streaming Dataset
val observed_ds = ds.observe("my_event", count(lit(1)).as("rc"), count($"error").as("erc"))
observed_ds.writeStream.format("...").start()
Python
# Observe metric
observed_df = df.observe("metric", count(lit(1)).as("cnt"), count(col("error")).as("malformed"))
observed_df.writeStream.format("...").start()
# Define my listener.
class MyListener(StreamingQueryListener):
def onQueryStarted(self, event):
print(f"'{event.name}' [{event.id}] got started!")
def onQueryProgress(self, event):
row = event.progress.observedMetrics.get("metric")
if row is not None:
if row.malformed / row.cnt > 0.5:
print("ALERT! Ouch! there are too many malformed "
f"records {row.malformed} out of {row.cnt}!")
else:
print(f"{row.cnt} rows processed!")
def onQueryTerminated(self, event):
print(f"{event.id} got terminated!")
# Add my listener.
spark.streams.addListener(MyListener())
Configure scheduler pools for multiple Structured
Streaming workloads on a cluster
7/21/2022 • 2 minutes to read
To enable multiple streaming queries to execute jobs concurrently on a shared cluster, you can configure queries
to execute in separate scheduler pools.
NOTE
The local property configuration must be in the same notebook cell where you start your streaming query.
Limiting the input rate for Structured Streaming queries helps to maintain a consistent batch size and prevents
large batches from leading to spill and cascading micro-batch processing delays.
Azure Databricks provides the same options to control Structured Streaming batch sizes for both Delta Lake and
Auto Loader.
Apache Spark Structured Streaming processes data incrementally; controlling the trigger interval for batch
processing allows you to use Structured Streaming for workloads including near-real time processing,
refreshing databases every 5 minutes or once per hour, or batch processing all new data for a day or week.
Because Databricks Auto Loader uses Structured Streaming to load data, understanding how triggers work
provides you with the greatest flexibility to control costs while ingesting data with the desired frequency.
When you specify a trigger interval that is too small (less than tens of seconds), the system may perform
unnecessary checks to see if new data arrives. Configure your processing time to balance latency requirements
and the rate that data arrives in the source.
Managing the intermediate state information of stateful Structured Streaming queries can help prevent
unexpected latency and production problems.
NOTE
The state management scheme cannot be changed between query restarts. That is, if a query has been started with the
default management, then it cannot changed without starting the query from scratch with a new checkpoint location.
Configure RocksDB state store on Azure Databricks
7/21/2022 • 2 minutes to read
You can enable RockDB-based state management by setting the following configuration in the SparkSession
before starting the streaming query.
spark.conf.set(
"spark.sql.streaming.stateStore.providerClass",
"com.databricks.sql.streaming.state.RocksDBStateStoreProvider")
rocksdbCommitWriteBatchLatency Time (in millis) took for applying the staged writes in in-
memory structure (WriteBatch) to native RocksDB.
rocksdbCommitFlushLatency Time (in millis) took for flushing the RocksDB in-memory
changes to local disk.
rocksdbCommitCompactLatency Time (in millis) took for compaction (optional) during the
checkpoint commit.
rocksdbCommitPauseLatency Time (in millis) took for stopping the background worker
threads (for compaction etc.) as part of the checkpoint
commit.
rocksdbCommitCheckpointLatency Time (in millis) took for taking a snapshot of native RocksDB
and write it to a local directory.
M ET RIC N A M E DESC RIP T IO N
rocksdbCommitFileSyncLatencyMs Time (in millis) took for syncing the native RocksDB snapshot
related files to an external storage (checkpoint location).
rocksdbGetLatency Average time (in nanos) took per the underlying native
RocksDB::Get call.
rocksdbPutCount Average time (in nanos) took per the underlying native
RocksDB::Put call.
rocksdbReadBlockCacheMissCount Number of times the native RocksDB block cache missed and
required reading data from local disk.
rocksdbTotalBytesReadByCompaction Number of bytes read from the local disk by the native
RocksDB compaction process.
rocksdbWriterStallLatencyMs Time (in millis) the writer has stalled due to a background
compaction or flushing of the memtables to disk.
NOTE
Available in Databricks Runtime 10.3 and above.
For stateful streaming queries bottlenecked on state updates, enabling asynchronous state checkpointing can
reduce end-to-end latencies without sacrificing any fault-tolerance guarantees, but with a minor cost of higher
restart delays.
Structured Streaming uses synchronous checkpointing by default. Every micro-batch ensures that all the state
updates in that batch are backed up in cloud storage (called “checkpoint location”) before starting the next batch.
If a stateful streaming query fails, all micro-batches except the last micro-batch are checkpointed. On restart,
only the last batch needs to be re-run. Fast recovery with synchronous checkpointing comes at the cost of
higher latency for each micro-batch.
Asynchronous state checkpointing attempts to perform the checkpointing asynchronously so that the micro-
batch execution doesn’t have to wait for the checkpoint to complete. In other words, the next micro-batch can
start as soon as the computation of the previous micro-batch has been completed. Internally, however, the offset
metadata (also saved in the checkpoint location) tracks whether the state checkpointing has been completed for
a micro-batch. On query restart, more than one micro-batch may need to be re-executed - the last micro-batch
whose computation was incomplete, as well as the one micro-batch before it whose state checkpointing was
incomplete. And you get the same fault-tolerance guarantees (that is, exactly-once guarantees with an
idempotent sink) as that of synchronous checkpointing.
{
"id" : "2e3495a2-de2c-4a6a-9a8e-f6d4c4796f19",
"runId" : "e36e9d7e-d2b1-4a43-b0b3-e875e767e1fe",
"...",
"batchId" : 0,
"durationMs" : {
"...",
"triggerExecution" : 547730,
"..."
},
"stateOperators" : [ {
"...",
"commitTimeMs" : 3186626,
"numShufflePartitions" : 64,
"..."
}]
}
spark.conf.set(
"spark.databricks.streaming.statefulOperator.asyncCheckpoint.enabled",
"true"
)
spark.conf.set(
"spark.sql.streaming.stateStore.providerClass",
"com.databricks.sql.streaming.state.RocksDBStateStoreProvider"
)
When working with multiple Structured Streaming inputs, you can set multiple watermarks to control tolerance
thresholds for late-arriving data. Configuring watermarks allows you to control state information and impacts
latency.
A streaming query can have multiple input streams that are unioned or joined together. Each of the input
streams can have a different threshold of late data that needs to be tolerated for stateful operations. Specify
these thresholds using withWatermarks("eventTime", delay) on each of the input streams. The following is an
example query with stream-stream joins.
While running the query, Structured Streaming individually tracks the maximum event time seen in each input
stream, calculates watermarks based on the corresponding delay, and chooses a single global watermark with
them to be used for stateful operations. By default, the minimum is chosen as the global watermark because it
ensures that no data is accidentally dropped as too late if one of the streams falls behind the others (for
example, one of the streams stop receiving data due to upstream failures). In other words, the global watermark
safely moves at the pace of the slowest stream and the query output is delayed accordingly.
If you want to get faster results, you can set the multiple watermark policy to choose the maximum value as the
global watermark by setting the SQL configuration spark.sql.streaming.multipleWatermarkPolicy to max
(default is min ). This lets the global watermark move at the pace of the fastest stream. However, this
configuration drops data from the slowest streams. Because of this, we recommends that you use this
configuration judiciously.
Specify initial state for Structured Streaming
mapGroupsWithState
7/21/2022 • 2 minutes to read
You can specify a user defined initial state for Structured Streaming stateful processing using
flatMapGroupsWithState or mapGroupsWithState . This allows you to avoid reprocessing data when starting a
stateful stream without a valid checkpoint.
Example use case that specifies an initial state to the flatMapGroupsWithState operator:
fruitStream
.groupByKey(x => x)
.flatMapGroupsWithState(Update, GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)
Example use case that specifies an initial state to the mapGroupsWithState operator:
val fruitCountFunc =(key: String, values: Iterator[String], state: GroupState[RunningCount]) => {
val count = state.getOption.map(_.count).getOrElse(0L) + valList.size
state.update(new RunningCount(count))
(key, count.toString)
}
fruitStream
.groupByKey(x => x)
.mapGroupsWithState(GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)
Test state update function for Structured Streaming
mapGroupsWithState
7/21/2022 • 2 minutes to read
The API enables you to test the state update function used for
TestGroupState
Dataset.groupByKey(...).mapGroupsWithState(...) and Dataset.groupByKey(...).flatMapGroupsWithState(...) .
The state update function takes the previous state as input using an object of type GroupState . See the Apache
Spark GroupState reference documentation. For example:
import org.apache.spark.sql.streaming._
import org.apache.spark.api.java.Optional
assert(!prevState.hasUpdated)
assert(prevState.hasUpdated)
}
Working with pub/sub and message queues on
Azure Databricks
7/21/2022 • 2 minutes to read
Azure Databricks can integrate with stream messaging services for near-real time data ingestion into the
Databricks Lakehouse. It can also sync enriched and transformed data in the lakehouse with other streaming
systems.
Ingesting streaming messages to Delta Lake allows you to retain messages indefinitely, allowing you to replay
data streams without fear of losing data due to retention thresholds.
Azure Databricks has specific features for working with semi-structured data fields contained in Avro and JSON
data payloads. To learn more, see:
Read and write streaming Avro data
To learn more about specific configurations for streaming from or to message queues, see:
Apache Kafka
Azure Event Hubs
Apache Kafka
7/21/2022 • 6 minutes to read
The Apache Kafka connectors for Structured Streaming are packaged in Databricks Runtime. You use the kafka
connector to connect to Kafka 0.10+ and the kafka08 connector to connect to Kafka 0.8+ (deprecated).
Schema
The schema of the records is:
C O L UM N TYPE
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
The key and the value are always deserialized as byte arrays with the ByteArrayDeserializer . Use DataFrame
operations ( cast("string") , udfs) to explicitly deserialize the keys and values.
Quickstart
Let’s start with a the canonical WordCount example. The following notebook demonstrates how to run
WordCount using Structured Streaming with Kafka.
NOTE
This notebook example uses Kafka 0.10. To use Kafka 0.8, change the format to kafka08 (that is, .format("kafka08") ).
Configuration
For the comphensive list of configuration options, see the Spark Structured Streaming + Kafka Integration
Guide. To get you started, here is a subset of the most common configuration options.
NOTE
As Structured Streaming is still under development, this list may not be up to date.
There are multiple ways of specifying which topics to subscribe to. You should provide only one of these
parameters:
SUP P O RT ED K A F K A
O P T IO N VA L UE VERSIO N S DESC RIP T IO N
SUP P O RT ED K A F K A
O P T IO N VA L UE DEFA ULT VA L UE VERSIO N S DESC RIP T IO N
SUP P O RT ED K A F K A
O P T IO N VA L UE DEFA ULT VA L UE VERSIO N S DESC RIP T IO N
* Concurrently
running queries
(both, batch and
streaming) with the
same group ID are
likely interfere with
each other causing
each query to read
only part of the data.
* This may also occur
when queries are
started/restarted in
quick succession. To
minimize such issues,
set the Kafka
consumer
configuration
session.timeout.ms
to be very small.
See Structured Streaming Kafka Integration Guide for other optional configurations.
IMPORTANT
You should not set the following Kafka parameters for the Kafka 0.10 connector as it will throw an exception:
group.id : Setting this parameter is not allowed for Spark versions below 2.2.
auto.offset.reset : Instead, set the source option startingOffsets to specify where to start. To maintain
consistency, Structured Streaming (as opposed to the Kafka Consumer) manages the consumption of offsets internally.
This ensures that you don’t miss any data after dynamically subscribing to new topics/partitions. startingOffsets
applies only when you start a new Streaming query, and that resuming from a checkpoint always picks up from where
the query left off.
key.deserializer : Keys are always deserialized as byte arrays with ByteArrayDeserializer . Use DataFrame
operations to explicitly deserialize the keys.
value.deserializer : Values are always deserialized as byte arrays with ByteArrayDeserializer . Use DataFrame
operations to explicitly deserialize the values.
enable.auto.commit : Setting this parameter is not allowed. Spark keeps track of Kafka offsets internally and doesn’t
commit any offset.
interceptor.classes : Kafka source always read keys and values as byte arrays. It’s not safe to use
ConsumerInterceptor as it may break the query.
NOTE
Available in Databricks Runtime 8.1 and above.
You can get the average, min, and max of the number of offsets that the streaming query is behind the latest
available offset among all the subscribed topics with the avgOffsetsBehindLatest , maxOffsetsBehindLatest , and
minOffsetsBehindLatest metrics. See Reading Metrics Interactively.
NOTE
Available in Databricks Runtime 9.1 and above.
Get the estimated total number of bytes that the query process has not consumed from the subscribed topics by
examining the value of estimatedTotalBytesBehindLatest . This estimate is based on the batches that were
processed in the last 300 seconds. The timeframe that the estimate is based on can be changed by setting the
option bytesEstimateWindowLength to a different value. For example, to set it to 10 minutes:
df = spark.readStream \
.format("kafka") \
.option("bytesEstimateWindowLength", "10m") // m for minutes, you can also use "600s" for 600 seconds
If you are running the stream in a notebook, you can see these metrics under the Raw Data tab in the
streaming query progress dashboard:
{
"sources" : [ {
"description" : "KafkaV2[Subscribe[topic]]",
"metrics" : {
"avgOffsetsBehindLatest" : "4.0",
"maxOffsetsBehindLatest" : "4",
"minOffsetsBehindLatest" : "4",
"estimatedTotalBytesBehindLatest" : "80.0"
},
} ]
}
Use SSL
To enable SSL connections to Kafka, follow the instructions in the Confluent documentation Encryption and
Authentication with SSL. You can provide the configurations described there, prefixed with kafka. , as options.
For example, you specify the trust store location in the property kafka.ssl.truststore.location .
We recommend that you:
Store your certificates in Azure Blob storage or Azure Data Lake Storage Gen2 and access them through a
DBFS mount point. Combined with cluster and job ACLs, you can restrict access to the certificates only to
clusters that can access Kafka.
Store your certificate passwords as secrets in a secret scope.
Once paths are mounted and secrets stored, you can do the following:
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", ...) \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.ssl.truststore.location", <dbfs-truststore-location>) \
.option("kafka.ssl.keystore.location", <dbfs-keystore-location>) \
.option("kafka.ssl.keystore.password", dbutils.secrets.get(scope=<certificate-scope-name>,key=<keystore-
password-key-name>)) \
.option("kafka.ssl.truststore.password", dbutils.secrets.get(scope=<certificate-scope-name>,key=
<truststore-password-key-name>))
Resources
Real-Time End-to-End Integration with Apache Kafka in Apache Spark Structured Streaming
Azure Event Hubs
7/21/2022 • 4 minutes to read
Azure Event Hubs is a hyper-scale telemetry ingestion service that collects, transforms, and stores millions of
events. As a distributed streaming platform, it gives you low latency and configurable time retention, which
enables you to ingress massive amounts of telemetry into the cloud and read the data from multiple
applications using publish-subscribe semantics.
This article explains how to use Structured Streaming with Azure Event Hubs and Azure Databricks clusters.
Requirements
For current release support, see “Latest Releases” in the Azure Event Hubs Spark Connector project readme file.
1. Create a library in your Azure Databricks workspace using the Maven coordinate
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.17 .
NOTE
This connector is updated regularly, and a more recent version may be available: we recommend that you pull the
latest connector from the Maven repository
Schema
The schema of the records is:
C O L UM N TYPE
body binary
partition string
offset string
sequenceNumber long
enqueuedTime timestamp
publisher string
partitionKey string
properties map[string,json]
The body is always provided as a byte array. Use cast("string") to explicitly deserialize the body column.
Quick Start
Let’s start with a quick example: WordCount. The following notebook is all that it takes to run WordCount using
Structured Streaming with Azure Event Hubs.
Azure Event Hubs WordCount with Structured Streaming notebook
Get notebook
Configuration
This section discusses the configuration settings you need to work with Event Hubs.
For detailed guidance on configuring Structured Streaming with Azure Event Hubs, see the Structured Streaming
and Azure Event Hubs Integration Guide developed by Microsoft.
For detailed guidance on using Structured Streaming, see What is Apache Spark Structured Streaming?.
Connection string
An Event Hubs connection string is required to connect to the Event Hubs service. You can get the connection
string for your Event Hubs instance from the Azure portal or by using the ConnectionStringBuilder in the
library.
Azure portal
When you get the connection string from the Azure portal, it may or may not have the EntityPath key.
Consider:
To connect to your EventHubs, an EntityPath must be present. If your connection string doesn’t have one, don’t
worry. This will take care of it:
import org.apache.spark.eventhubs.ConnectionStringBuilder
ConnectionStringBuilder
Alternatively, you can use the ConnectionStringBuilder to make your connection string.
import org.apache.spark.eventhubs.ConnectionStringBuilder
EventHubsConf
All configuration relating to Event Hubs happens in your EventHubsConf . To create an EventHubsConf , you must
pass a connection string:
val connectionString = "<event-hub-connection-string>"
val eventHubsConf = EventHubsConf(connectionString)
See Connection String for more information about obtaining a valid connection string.
For a complete list of configurations, see EventHubsConf. Here is a subset of configurations to get you started:
startingPosition EventPosition Start of stream Streaming and batch The starting position
for your Structured
Streaming job. See
startingPositions for
information about
the order in which
options are read.
For each option, there exists a corresponding setting in EventHubsConf . For example:
import org.apache.spark.eventhubs.
val cs = "<your-connection-string>"
val eventHubsConf = EventHubsConf(cs)
.setConsumerGroup("sample-cg")
.setMaxEventsPerTrigger(10000)
EventPosition
EventHubsConf allows users to specify starting (and ending) positions with the EventPosition class.
EventPosition defines the position of an event in an Event Hub partition. The position can be an enqueued time,
offset, sequence number, the start of the stream, or the end of the stream.
import org.apache.spark.eventhubs._
If you would like to start (or end) at a specific position, simply create the correct EventPosition and set it in your
EventHubsConf :
Apache Avro is a commonly used data serialization system in the streaming world. A typical solution is to put
data in Avro format in Apache Kafka, metadata in Confluent Schema Registry, and then run queries with a
streaming framework that connects to both Kafka and Schema Registry.
Azure Databricks supports the from_avro and to_avro functions to build streaming pipelines with Avro data in
Kafka and metadata in Schema Registry. The function to_avro encodes a column as binary in Avro format and
from_avro decodes Avro binary data into a column. Both functions transform one column to another column,
and the input/output SQL data type can be a complex type or a primitive type.
NOTE
The from_avro and to_avro functions:
Are available in Python, Scala, and Java.
Can be passed to SQL functions in both batch and streaming queries.
Basic example
Similar to from_json and to_json, you can use from_avro and to_avro with any binary column, but you must
specify the Avro schema manually.
import org.apache.spark.sql.avro.functions._
import org.apache.avro.SchemaBuilder
// When reading the key and value of a Kafka topic, decode the
// binary (Avro) data into structured data.
// The schema of the resulting DataFrame is: <key: string, value: int>
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t")
.load()
.select(
from_avro($"key", SchemaBuilder.builder().stringType()).as("key"),
from_avro($"value", SchemaBuilder.builder().intType()).as("value"))
{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
output = df\
.select(from_avro("value", jsonFormatSchema).alias("user"))\
.where('user.favorite_color == "red"')\
.select(to_avro("user.name").alias("value"))
NOTE
Integration with Schema Registry is available only in Scala and Java.
import org.apache.spark.sql.avro.functions._
// Read a Kafka topic "t", assuming the key and value are already
// registered in Schema Registry as subjects "t-key" and "t-value" of type
// string and int. The binary key and value columns are turned into string
// and int type with Avro and Schema Registry. The schema of the resulting DataFrame
// is: <key: string, value: int>.
val schemaRegistryAddr = "https://myhost:8081"
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t")
.load()
.select(
from_avro($"key", "t-key", schemaRegistryAddr).as("key"),
from_avro($"value", "t-value", schemaRegistryAddr).as("value"))
For to_avro , the default output Avro schema might not match the schema of the target subject in the Schema
Registry service for the following reasons:
The mapping from Spark SQL type to Avro schema is not one-to-one. See Supported types for Spark SQL ->
Avro conversion.
If the converted output Avro schema is of record type, the record name is topLevelRecord and there is no
namespace by default.
If the default output schema of to_avro matches the schema of the target subject, you can do the following:
Otherwise, you must provide the schema of the target subject in the to_avro function:
This contains notebooks and code samples for common patterns for working with Structured Streaming on
Azure Databricks.
import com.datastax.spark.connector.cql.CassandraConnectorConf
import com.datastax.spark.connector.rdd.ReadConf
import com.datastax.spark.connector._
spark.setCassandraConf(clusterName, CassandraConnectorConf.ConnectionHostParam.option(host))
spark.readStream.format("rate").load()
.selectExpr("value % 10 as key")
.groupBy("key")
.count()
.toDF("key", "value")
.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
spark.conf.set("spark.sql.shuffle.partitions", "1")
query = (
spark.readStream.format("rate").load()
.selectExpr("value % 10 as key")
.groupBy("key")
.count()
.toDF("key", "count")
.writeStream
.foreachBatch(writeToSQLWarehouse)
.outputMode("update")
.start()
)
Stream-Stream joins
These two notebooks show how to use stream-stream joins in Python and Scala.
Stream-Stream joins Python notebook
Get notebook
Stream-Stream joins Scala notebook
Get notebook
Perform streaming writes to arbitrary data sinks with
Structured Streaming and foreachBatch
7/21/2022 • 4 minutes to read
Structured Streaming APIs provide two ways to write the output of a streaming query to data sources that do
not have an existing streaming sink: foreachBatch() and foreach() .
NOTE
If you are running multiple Spark jobs on the batchDF , the input data rate of the streaming query (reported through
StreamingQueryProgress and visible in the notebook rate graph) may be reported as a multiple of the actual rate at
which data is generated at the source. This is because the input data may be read multiple times in the multiple Spark
jobs per batch.
datasetOfString.writeStream.foreach(
new ForeachWriter[String] {
Using Python
In Python, you can invoke foreach in two ways: in a function or in an object. The function offers a simple way to
express your processing logic but does not allow you to deduplicate generated data when failures cause
reprocessing of some input data. For that situation you must specify the processing logic in an object.
The function takes a row as input.
def processRow(row):
// Write row to storage
query = streamingDF.writeStream.foreach(processRow).start()
The object has a process method and optional open and close methods:
class ForeachWriter:
def open(self, partition_id, epoch_id):
// Open connection. This method is optional in Python.
query = streamingDF.writeStream.foreach(ForeachWriter()).start()
Execution semantics
When the streaming query is started, Spark calls the function or the object’s methods in the following way:
A single copy of this object is responsible for all the data generated by a single task in a query. In other
words, one instance is responsible for processing one partition of the data generated in a distributed
manner.
This object must be serializable, because each task will get a fresh serialized-deserialized copy of the
provided object. Hence, it is strongly recommended that any initialization for writing data (for example,
opening a connection or starting a transaction) is done after you call the open() method, which signifies
that the task is ready to generate data.
The lifecycle of the methods are as follows:
For each partition with partition_id :
For each batch/epoch of streaming data with epoch_id :
Method open(partitionId, epochId) is called.
If open(...) returns true, for each row in the partition and batch/epoch, method process(row) is called.
Method close(error) is called with error (if any) seen while processing rows.
The close() method (if it exists) is called if an open() method exists and returns successfully
(irrespective of the return value), except if the JVM or Python process crashes in the middle.
NOTE
The partitionId and epochId in the open() method can be used to deduplicate generated data when failures cause
reprocessing of some input data. This depends on the execution mode of the query. If the streaming query is being
executed in the micro-batch mode, then every partition represented by a unique tuple (partition_id, epoch_id) is
guaranteed to have the same data. Hence, (partition_id, epoch_id) can be used to deduplicate and/or
transactionally commit data and achieve exactly-once guarantees. However, if the streaming query is being executed in
the continuous mode, then this guarantee does not hold and therefore should not be used for deduplication.
Delta Lake tables
7/21/2022 • 2 minutes to read
Delta tables can be both sources and sinks for streaming queries. For more information, see the Delta Lake
streaming guide.
(Deprecated) Azure Blob storage file source with
Azure Queue Storage
7/21/2022 • 3 minutes to read
IMPORTANT
The Databricks ABS-AQS connector is deprecated. Databricks recommends using Auto Loader instead.
The ABS-AQS connector provides an optimized file source that uses Azure Queue Storage (AQS) to find new
files written to an Azure Blob storage (ABS) container without repeatedly listing all of the files. This provides two
advantages:
Lower latency: no need to list nested directory structures on ABS, which is slow and resource intensive.
Lower costs: no more costly LIST API requests made to ABS.
NOTE
The ABS-AQS source deletes messages from the AQS queue as it consumes events. If you want other pipelines to
consume messages from this queue, set up a separate AQS queue for the optimized reader. You can set up multiple Event
Grid Subscriptions to publish to different queues.
spark.readStream \
.format("abs-aqs") \
.option("fileFormat", "json") \
.option("queueName", ...) \
.option("connectionString", ...) \
.schema(...) \
.load()
NOTE
We strongly recommend that you use Secrets for providing your connection strings.
Configuration
O P T IO N TYPE DEFA ULT DESC RIP T IO N
fileFormat String None (required param) The format of the files such
as parquet , json , csv ,
text , and so on.
If you observe a lot of messages in the driver logs that look like Fetched 0 new events and 3 old events. , where
you tend to observe a lot more old events than new, you should reduce the trigger interval of your stream.
If you are consuming files from a location on Blob storage where you expect that some files may be deleted
before they can be processed, you can set the following configuration to ignore the error and continue
processing:
spark.sql("SET spark.sql.files.ignoreMissingFiles=true")
An Azure Databricks cluster is a set of computation resources and configurations on which you run data
engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics,
ad-hoc analytics, and machine learning.
You run these workloads as a set of commands in a notebook or as an automated job. Azure Databricks makes a
distinction between all-purpose clusters and job clusters. You use all-purpose clusters to analyze data
collaboratively using interactive notebooks. You use job clusters to run fast and robust automated jobs.
You can create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart
an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and
terminates the cluster when the job is complete. You cannot restart a job cluster.
This section describes how to work with clusters using the UI. For other methods, see Clusters CLI and Clusters
API 2.0.
This section also focuses more on all-purpose than job clusters, although many of the configurations and
management tools described apply equally to both cluster types. To learn more about creating job clusters, see
Jobs.
IMPORTANT
Azure Databricks retains cluster configuration information for up to 200 all-purpose clusters terminated in the last 30
days and up to 30 job clusters recently terminated by the job scheduler. To keep an all-purpose cluster configuration even
after it has been terminated for more than 30 days, an administrator can pin a cluster to the cluster list.
In this section:
Create a cluster
Use the Create button
Use the cluster UI
Terraform integration
Manage clusters
Display clusters
Pin a cluster
View a cluster configuration as a JSON file
Edit a cluster
Clone a cluster
Control access to clusters
Start a cluster
Terminate a cluster
Delete a cluster
Restart a cluster to update it with the latest images
View cluster information in the Apache Spark UI
View cluster logs
Monitor performance
Decommission spot instances
Configure clusters
Cluster policy
Cluster mode
Pools
Databricks Runtime
Cluster node type
Cluster size and autoscaling
Autoscaling local storage
Local disk encryption
Security mode
Spark configuration
Retrieve a Spark configuration property from a secret
Environment variables
Cluster tags
SSH access to clusters
Cluster log delivery
Init scripts
Best practices: Cluster configuration
Cluster features
Cluster sizing considerations
Common scenarios
Task preemption
Preemption options
Customize containers with Databricks Container Services
Requirements
Step 1: Build your base
Step 2: Push your base image
Step 3: Launch your cluster
Use an init script
Cluster node initialization scripts
Init script types
Init script execution order
Environment variables
Logging
Cluster-scoped init scripts
Global init scripts
GPU-enabled clusters
Overview
Create a GPU cluster
GPU scheduling
NVIDIA GPU driver, CUDA, and cuDNN
Databricks Container Services on GPU clusters
Single Node clusters
Create a Single Node cluster
Single Node cluster properties
Limitations
REST API
Single Node cluster policy
Single Node job cluster policy
Pools
Display pools
Create a pool
Configure pools
Edit a pool
Delete a pool
Attach a cluster to one or more pools
Best practices: pools
Web terminal
Requirements
Launch the web terminal
Limitations
Debugging with the Apache Spark UI
Spark UI
Driver logs
Executor logs
Create a cluster
7/21/2022 • 2 minutes to read
NOTE
You must have permission to create a cluster. See Configure cluster creation entitlement.
1. Click Create in the sidebar and select Cluster from the menu. The Create Cluster page appears.
2. Name and configure the cluster.
There are many cluster configuration options, which are described in detail in cluster configuration.
3. Click the Create Cluster button.
The cluster Configuration tab displays a spinning progress indicator while the cluster is in a pending
state. When the cluster has started and is ready to use, the progress spinner turns into a green circle with
a check mark. This indicates that cluster is in the running state, and you can now attach notebooks and
start running commands and queries.
Terraform integration
You can manage clusters in a fully automated setup using Databricks Terraform provider and databricks_cluster:
data "databricks_node_type" "smallest" {
local_disk = true
}
This article describes how to manage Azure Databricks clusters, including displaying, editing, starting,
terminating, deleting, controlling access, and monitoring performance and logs.
Display clusters
To display the clusters in your workspace, click Compute in the sidebar.
The Compute page displays clusters in two tabs: All-purpose clusters and Job clusters .
At the left side are two columns indicating if the cluster has been pinned and the status of the cluster:
Pinned
Starting , Terminating
Standard cluster
Running
Terminated
High concurrency cluster
Running
Terminated
Access Denied
Running
Terminated
Table ACLs enabled
Running
Terminated
At the far right of the right side of the All-purpose clusters tab is an icon you can use to terminate the cluster.
You can use the three-button menu to restart, clone, delete, or edit permissions for the cluster. Menu options
that are not available are grayed out.
The All-purpose clusters tab shows the numbers of notebooks attached to the cluster.
Filter cluster list
You can filter the cluster lists using the buttons and search box at the top right:
Pin a cluster
30 days after a cluster is terminated, it is permanently deleted. To keep an all-purpose cluster configuration even
after a cluster has been terminated for more than 30 days, an administrator can pin the cluster. Up to 100
clusters can be pinned.
You can pin a cluster from the cluster list or the cluster detail page:
Pin cluster from cluster list
To pin or unpin a cluster, click the pin icon to the left of the cluster name.
You can also invoke the Pin API endpoint to programmatically pin a cluster.
You can also invoke the Edit API endpoint to programmatically edit the cluster.
NOTE
Notebooks and jobs that were attached to the cluster remain attached after editing.
Libraries installed on the cluster remain installed after editing.
If you edit any attribute of a running cluster (except for the cluster size and permissions), you must restart it. This can
disrupt users who are currently using the cluster.
You can edit only running or terminated clusters. You can, however, update permissions for clusters that are not in
those states on the cluster details page.
For detailed information about cluster configuration properties you can edit, see Configure clusters.
Clone a cluster
You can create a new cluster by cloning an existing cluster.
From the cluster list, click the three-button menu and select Clone from the drop down.
From the cluster detail page, click and select Clone from the drop down.
The cluster creation form is opened prepopulated with the cluster configuration. The following attributes from
the existing cluster are not included in the clone:
Cluster permissions
Installed libraries
Attached notebooks
Cluster-level permissions: A user who has the Can manage permission for a cluster can configure
whether other users can attach to, restart, resize, and manage that cluster from the cluster list or the
cluster details page.
From the cluster list, click the three-button menu and select Edit Permissions .
From the cluster detail page, click and select Permissions .
To learn how to configure cluster access control and cluster-level permissions, see Cluster access control.
Start a cluster
Apart from creating a new cluster, you can also start a previously terminated cluster. This lets you re-create a
previously terminated cluster with its original configuration.
You can start a cluster from the cluster list, the cluster detail page, or a notebook.
To start a cluster from the cluster list, click the arrow:
You can also invoke the Start API endpoint to programmatically start a cluster.
Azure Databricks identifies a cluster with a unique cluster ID. When you start a terminated cluster, Databricks re-
creates the cluster with the same ID, automatically installs all the libraries, and re-attaches the notebooks.
NOTE
If you are using a Trial workspace and the trial has expired, you will not be able to start a cluster.
NOTE
If your cluster was created in Azure Databricks platform version 2.70 or earlier, there is no autostart: jobs scheduled to
run on terminated clusters will fail.
Terminate a cluster
To save cluster resources, you can terminate a cluster. A terminated cluster cannot run notebooks or jobs, but its
configuration is stored so that it can be reused (or—in the case of some types of jobs—autostarted) at a later
time. You can manually terminate a cluster or configure the cluster to automatically terminate after a specified
period of inactivity. Azure Databricks records information whenever a cluster is terminated. When the number of
terminated clusters exceeds 150, the oldest clusters are deleted.
Unless a cluster is pinned, 30 days after the cluster is terminated, it is automatically and permanently deleted.
Terminated clusters appear in the cluster list with a gray circle at the left of the cluster name.
NOTE
When you run a job on a New Job Cluster (which is usually recommended), the cluster terminates and is unavailable for
restarting when the job is complete. On the other hand, if you schedule a job to run on an Existing All-Purpose Cluster
that has been terminated, that cluster will autostart.
IMPORTANT
If you are using a Trial Premium workspace, all running clusters are terminated:
When you upgrade a workspace to full Premium.
If the workspace is not upgraded and the trial expires.
Manual termination
You can manually terminate a cluster from the cluster list or the cluster detail page.
To terminate a cluster from the cluster list, click the square:
To terminate a cluster from the cluster detail page, click Terminate :
Automatic termination
You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in
minutes after which you want the cluster to terminate. If the difference between the current time and the last
command run on the cluster is more than the inactivity period specified, Azure Databricks automatically
terminates that cluster.
A cluster is considered inactive when all commands on the cluster, including Spark jobs, Structured Streaming,
and JDBC calls, have finished executing.
WARNING
Clusters do not report activity resulting from the use of DStreams. This means that an autoterminating cluster may be
terminated while it is running DStreams. Turn off auto termination for clusters running DStreams or consider using
Structured Streaming.
The auto termination feature monitors only Spark jobs, not user-defined local processes. Therefore, if all Spark jobs
have completed, a cluster may be terminated even if local processes are running.
Idle clusters continue to accumulate DBU and cloud instance charges during the inactivity period before termination.
IMPORTANT
The default value of the auto terminate setting depends on whether you choose to create a standard or high concurrency
cluster:
Standard clusters are configured to terminate automatically after 120 minutes.
High concurrency clusters are configured to not terminate automatically.
You can opt out of auto termination by clearing the Auto Termination checkbox or by specifying an inactivity
period of 0 .
NOTE
Auto termination is best supported in the latest Spark versions. Older Spark versions have known limitations which can
result in inaccurate reporting of cluster activity. For example, clusters running JDBC, R, or streaming commands can report
a stale activity time that leads to premature cluster termination. Please upgrade to the most recent Spark version to
benefit from bug fixes and improvements to auto termination.
Unexpected termination
Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured
automatic termination.
For a list of termination reasons and remediation steps, see the Knowledge Base.
Delete a cluster
Deleting a cluster terminates the cluster and removes its configuration.
WARNING
You cannot undo this action.
You cannot delete a pinned cluster. In order to delete a pinned cluster, it must first be unpinned by an
administrator.
From the cluster list, click the three-button menu and select Delete from the drop down.
From the cluster detail page, click and select Delete from the drop down.
You can also invoke the Permanent delete API endpoint to programmatically delete a cluster.
WARNING
If you set perform_restart to True , the script automatically restarts eligible clusters, which can cause active jobs to
fail and reset open notebooks. To reduce the risk of disrupting your workspace’s business critical jobs, plan a scheduled
maintenance window and be sure to notify workspace users.
To filter the events, click the in the Filter by Event Type… field and select one or more event type
checkboxes.
Use Select all to make it easier to filter by excluding particular event types.
View event details
For more information about an event, click its row in the log and then click the JSON tab for details.
Monitor performance
To help you monitor the performance of Azure Databricks clusters, Azure Databricks provides access to Ganglia
metrics from the cluster details page.
In addition, you can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure
Monitor, the monitoring platform for Azure.
You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account.
Ganglia metrics
To access the Ganglia UI, navigate to the Metrics tab on the cluster details page. CPU metrics are available in the
Ganglia UI for all Databricks runtimes. GPU metrics are available for GPU-enabled clusters.
Datadog metrics
You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account. The
following notebook demonstrates how to install a Datadog agent on a cluster using a cluster-scoped init script.
To install the Datadog agent on all clusters, use a global init script after testing the cluster-scoped init script.
Install Datadog agent init script notebook
Get notebook
Because spot instances can reduce costs, creating clusters using spot instances rather than on-demand instances
is a common way to run jobs. However, spot instances can be preempted by cloud provider scheduling
mechanisms. Preemption of spot instances can cause issues with jobs that are running, including:
Shuffle fetch failures
Shuffle data loss
RDD data loss
Job failures
You can enable decommissioning to help address these issues. Decommissioning takes advantage of the
notification that the cloud provider usually sends before a spot instance is decommissioned. When a spot
instance containing an executor receives a preemption notification, the decommissioning process will attempt to
migrate shuffle and RDD data to healthy executors. The duration before the final preemption is typically 30
seconds to 2 minutes, depending on the cloud provider.
Databricks recommends enabling data migration when decommissioning is also enabled. Generally, the
possibility of errors decreases as more data is migrated, including shuffle fetching failures, shuffle data loss, and
RDD data loss. Data migration can also lead to less re-computation and save cost.
Decommissioning is best effort and does not guarantee that all data can be migrated before final preemption.
Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data
from the executor.
With decommissioning enabled, task failures caused by spot instance preemption are not added to the total
number of failed attempts. Task failures caused by preemption are not counted as failed attempts because the
cause of the failure is external to the task and will not result in job failure.
To enable decommissioning, you set Spark configuration settings and environment variables when you create a
cluster:
To enable decommissioning for applications:
spark.decommission.enabled true
spark.storage.decommission.enabled true
spark.storage.decommission.shuffleBlocks.enabled true
NOTE
When RDD StorageLevel replication is set to more than 1, Databricks does not recommend enabling RDD data
migration since the replicas ensure RDDs will not lose data.
spark.storage.decommission.enabled true
spark.storage.decommission.rddBlocks.enabled true
SPARK_WORKER_OPTS="-Dspark.decommission.enabled=true"
When the decommissioning finishes, the executor that decommissioned shows the loss reason in the Spark UI
> Executors tab on the cluster’s details page:
Configure clusters
7/21/2022 • 17 minutes to read
This article explains the configuration options available when you create and edit Azure Databricks clusters. It
focuses on creating and editing clusters using the UI. For other methods, see Clusters CLI, Clusters API 2.0, and
Databricks Terraform provider.
For help deciding what combination of configuration options suits your needs best, see cluster configuration
best practices.
Cluster policy
A cluster policy limits the ability to configure clusters based on a set of rules. The policy rules limit the attributes
or attribute values available for cluster creation. Cluster policies have ACLs that limit their use to specific users
and groups and thus limit which policies you can select when you create a cluster.
To configure a cluster policy, select the cluster policy in the Policy drop-down.
NOTE
If no policies have been created in the workspace, the Policy drop-down does not display.
If you have:
Cluster create permission, you can select the Unrestricted policy and create fully-configurable clusters. The
Unrestricted policy does not limit any cluster attributes or attribute values.
Both cluster create permission and access to cluster policies, you can select the Unrestricted policy and the
policies you have access to.
Access to cluster policies only, you can select the policies you have access to.
Cluster mode
Azure Databricks supports three cluster modes: Standard, High Concurrency, and Single Node. The default
cluster mode is Standard.
IMPORTANT
If your workspace is assigned to a Unity Catalog metastore, High Concurrency clusters are not available. Instead, you
use security mode to ensure the integrity of access controls and enforce strong isolation guarantees. See also Create a
Data Science & Engineering cluster.
You cannot change the cluster mode after a cluster is created. If you want a different cluster mode, you must create a
new cluster.
NOTE
The cluster configuration includes an auto terminate setting whose default value depends on cluster mode:
Standard and Single Node clusters terminate automatically after 120 minutes by default.
High Concurrency clusters do not terminate automatically by default.
Standard clusters
A Standard cluster is recommended for a single user. Standard clusters can run workloads developed in any
language: Python, SQL, R, and Scala.
High Concurrency clusters
A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that
they provide fine-grained sharing for maximum resource utilization and minimum query latencies.
High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of
High Concurrency clusters is provided by running user code in separate processes, which is not possible in
Scala.
In addition, only High Concurrency clusters support table access control.
To create a High Concurrency cluster, set Cluster Mode to High Concurrency .
For an example of how to create a High Concurrency cluster using the Clusters API, see High Concurrency
cluster example.
Single Node clusters
A Single Node cluster has no workers and runs Spark jobs on the driver node.
In contrast, a Standard cluster requires at least one Spark worker node in addition to the driver node to execute
Spark jobs.
To create a Single Node cluster, set Cluster Mode to Single Node .
To learn more about working with Single Node clusters, see Single Node clusters.
Pools
To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances, for the driver and
worker nodes. The cluster is created using instances in the pools. If a pool does not have sufficient idle resources
to create the requested driver or worker nodes, the pool expands by allocating new instances from the instance
provider. When an attached cluster is terminated, the instances it used are returned to the pools and can be
reused by a different cluster.
If you select a pool for worker nodes but not for the driver node, the driver node inherit the pool from the
worker node configuration.
IMPORTANT
If you attempt to select a pool for the driver node but not for worker nodes, an error occurs and your cluster isn’t created.
This requirement prevents a situation where the driver node has to wait for worker nodes to be created, or vice versa.
See Pools to learn more about working with pools in Azure Databricks.
Databricks Runtime
Databricks runtimes are the set of core components that run on your clusters. All Databricks runtimes include
Apache Spark and add components and updates that improve usability, performance, and security. For details,
see Databricks runtimes.
Azure Databricks offers several types of runtimes and several versions of those runtime types in the Databricks
Runtime Version drop-down when you create or edit a cluster.
Photon acceleration
IMPORTANT
This feature is in Public Preview.
NOTE
Available in Databricks Runtime 8.3 and above.
If desired, you can specify the instance type in the Worker Type and Driver Type drop-down.
Databricks recommends the following instance types for optimal price and performance:
Standard_E4ds_v4
Standard_E8ds_v4
Standard_E16ds_v4
You can view Photon activity in the Spark UI. The following screenshot shows the query details DAG. There are
two indications of Photon in the DAG. First, Photon operators start with “Photon”, for example,
PhotonGroupingAgg . Second, in the DAG, Photon operators and stages are colored peach, while the non-Photon
ones are blue.
Docker images
For some Databricks Runtime versions, you can specify a Docker image when you create a cluster. Example use
cases include library customization, a golden container environment that doesn’t change, and Docker CI/CD
integration.
You can also use Docker images to create custom deep learning environments on clusters with GPU devices.
For instructions, see Customize containers with Databricks Container Services and Databricks Container
Services on GPU clusters.
Driver node
Worker node
GPU instance types
Spot instances
Driver node
The driver node maintains state information of all notebooks attached to the cluster. The driver node also
maintains the SparkContext and interprets all the commands you run from a notebook or a library on the
cluster, and runs the Apache Spark master that coordinates with the Spark executors.
The default value of the driver node type is the same as the worker node type. You can choose a larger driver
node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze
them in the notebook.
TIP
Since the driver node maintains all of the state information of the notebooks attached, make sure to detach unused
notebooks from the driver node.
Worker node
Azure Databricks worker nodes run the Spark executors and other services required for the proper functioning
of the clusters. When you distribute your workload with Spark, all of the distributed processing happens on
worker nodes. Azure Databricks runs one executor per worker node; therefore the terms executor and worker
are used interchangeably in the context of the Azure Databricks architecture.
TIP
To run a Spark job, you need at least one worker node. If a cluster has zero workers, you can run non-Spark commands
on the driver node, but Spark commands will fail.
The first instance will always be on-demand (the driver node is always on-demand) and subsequent instances
will be spot instances. If spot instances are evicted due to unavailability, on-demand instances are deployed to
replace evicted instances.
Cluster size and autoscaling
When you create a Azure Databricks cluster, you can either provide a fixed number of workers for the cluster or
provide a minimum and maximum number of workers for the cluster.
When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number of
workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of
workers required to run your job. This is referred to as autoscaling.
With autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of your
job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks
automatically adds additional workers during these phases of your job (and removes them when they’re no
longer needed).
Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to
match a workload. This applies especially to workloads whose requirements change over time (like exploring a
dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning
requirements are unknown. Autoscaling thus offers two advantages:
Workloads can run faster compared to a constant-sized under-provisioned cluster.
Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.
Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these
benefits at the same time. The cluster size can go below the minimum number of workers selected when the
cloud provider terminates instances. In this case, Azure Databricks continuously retries to re-provision instances
in order to maintain the minimum number of workers.
NOTE
Autoscaling is not available for spark-submit jobs.
Job cluster - On the Configure Cluster page, select the Enable autoscaling checkbox in the
Autopilot Options box:
When the cluster is running, the cluster detail page displays the number of allocated workers. You can
compare number of allocated workers with the worker configuration and make adjustments as needed.
IMPORTANT
If you are using an instance pool:
Make sure the cluster size requested is less than or equal to the minimum number of idle instances in the pool. If it is
larger, cluster startup time will be equivalent to a cluster that doesn’t use a pool.
Make sure the maximum cluster size is less than or equal to the maximum capacity of the pool. If it is larger, the cluster
creation will fail.
Autoscaling example
If you reconfigure a static cluster to be an autoscaling cluster, Azure Databricks immediately resizes the cluster
within the minimum and maximum bounds and then starts autoscaling. As an example, the following table
demonstrates what happens to clusters with a certain initial size if you reconfigure a cluster to autoscale
between 5 and 10 nodes.
6 6
12 10
3 5
Some instance types you use to run clusters may have locally attached disks. Azure Databricks may store shuffle
data or ephemeral data on these locally attached disks. To ensure that all data at rest is encrypted for all storage
types, including shuffle data that is stored temporarily on your cluster’s local disks, you can enable local disk
encryption.
IMPORTANT
Your workloads may run more slowly because of the performance impact of reading and writing encrypted data to and
from local volumes.
When local disk encryption is enabled, Azure Databricks generates an encryption key locally that is unique to
each cluster node and is used to encrypt all data stored on local disks. The scope of the key is local to each
cluster node and is destroyed along with the cluster node itself. During its lifetime, the key resides in memory
for encryption and decryption and is stored encrypted on the disk.
To enable local disk encryption, you must use the Clusters API 2.0. During cluster creation or edit, set:
{
"enable_local_disk_encryption": true
}
See Create and Edit in the Clusters API reference for examples of how to invoke these APIs.
Here is an example of a cluster create call that enables local disk encryption:
{
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"enable_local_disk_encryption": true,
"spark_conf": {
"spark.speculation": true
},
"num_workers": 25
}
Security mode
If your workspace is assigned to a Unity Catalog metastore, you use security mode instead of High Concurrency
cluster mode to ensure the integrity of access controls and enforce strong isolation guarantees. High
Concurrency cluster mode is not available with Unity Catalog.
Under Advanced options , select from the following cluster security modes:
None : No isolation. Does not enforce workspace-local table access control or credential passthrough. Cannot
access Unity Catalog data.
Single User : Can be used only by a single user (by default, the user who created the cluster). Other users
cannot attach to the cluster. When accessing a view from a cluster with Single User security mode, the view
is executed with the user’s permissions. Single-user clusters support workloads using Python, Scala, and R.
Init scripts, library installation, and DBFS FUSE mounts are supported on single-user clusters. Automated
jobs should use single-user clusters.
User Isolation : Can be shared by multiple users. Only SQL workloads are supported. Library installation,
init scripts, and DBFS FUSE mounts are disabled to enforce strict isolation among the cluster users.
Table ACL only (Legacy) : Enforces workspace-local table access control, but cannot access Unity Catalog
data.
Passthrough only (Legacy) : Enforces workspace-local credential passthrough, but cannot access Unity
Catalog data.
The only security modes supported for Unity Catalog workloads are Single User and User Isolation .
For more information, see Cluster security mode.
Spark configuration
To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.
1. On the cluster configuration page, click the Advanced Options toggle.
2. Click the Spark tab.
In Spark config , enter the configuration properties as one key-value pair per line.
When you configure a cluster using the Clusters API 2.0, set Spark properties in the spark_conf field in the
Create cluster request or Edit cluster request.
To set Spark properties for all clusters, create a global init script:
dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
|#!/bin/bash
|
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
|[driver] {
| "spark.sql.sources.partitionOverwriteMode" = "DYNAMIC"
|}
|EOF
""".stripMargin, true)
spark.<property-name> {{secrets/<scope-name>/<secret-name>}}
For example, to set a Spark configuration property called password to the value of the secret stored in
secrets/acme_app/password :
spark.password {{secrets/acme-app/password}}
For more information, see Syntax for referencing secrets in a Spark configuration property or environment
variable.
Environment variables
You can configure custom environment variables that you can access from init scripts running on a cluster.
Databricks also provides predefined environment variables that you can use in init scripts. You cannot override
these predefined environment variables.
1. On the cluster configuration page, click the Advanced Options toggle.
2. Click the Spark tab.
3. Set the environment variables in the Environment Variables field.
You can also set environment variables using the spark_env_vars field in the Create cluster request or Edit
cluster request Clusters API endpoints.
Cluster tags
Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your
organization. You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies
these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports.
For clusters launched from pools, the custom cluster tags are only applied to DBU usage reports and do not
propagate to cloud resources.
For detailed information about how pool and cluster tag types work together, see Monitor usage using cluster,
pool, and workspace tags.
For convenience, Azure Databricks applies four default tags to each cluster: Vendor , Creator , ClusterName , and
ClusterId .
In addition, on job clusters, Azure Databricks applies two default tags: RunName and JobId .
On resources used by Databricks SQL, Azure Databricks also applies the default tag SqlWarehouseId .
WARNING
Do not assign a custom tag with the key Name to a cluster. Every cluster has a tag Name whose value is set by Azure
Databricks. If you change the value associated with the key Name , the cluster can no longer be tracked by Azure
Databricks. As a consequence, the cluster might not be terminated after becoming idle and will continue to incur usage
costs.
You can add custom tags when you create a cluster. To configure cluster tags:
1. On the cluster configuration page, click the Advanced Options toggle.
2. At the bottom of the page, click the Tags tab.
3. Add a key-value pair for each custom tag. You can add up to 43 custom tags.
For more details, see Monitor usage using cluster, pool, and workspace tags.
NOTE
SSH can be enabled only if your workspace is deployed in your own Azure virtual network.
Init scripts
A cluster node initialization—or init—script is a shell script that runs during startup for each cluster node before
the Spark driver or worker JVM starts. You can use init scripts to install packages and libraries not included in
the Databricks runtime, modify the JVM system classpath, set system properties and environment variables
used by the JVM, or modify Spark configuration parameters, among other configuration tasks.
You can attach init scripts to a cluster by expanding the Advanced Options section and clicking the Init
Scripts tab.
For detailed instructions, see Cluster node initialization scripts.
Best practices: Cluster configuration
7/21/2022 • 16 minutes to read
Azure Databricks provides a number of options when you create and configure clusters to help you get the best
performance at the lowest cost. This flexibility, however, can create challenges when you’re trying to determine
optimal configurations for your workloads. Carefully considering how users will utilize clusters will help guide
configuration options when you create new clusters or configure existing clusters. Some of the things to
consider when determining configuration options are:
What type of user will be using the cluster? A data scientist may be running different job types with different
requirements than a data engineer or data analyst.
What types of workloads will users run on the cluster? For example, batch extract, transform, and load (ETL)
jobs will likely have different requirements than analytical workloads.
What level of service level agreement (SLA) do you need to meet?
What budget constraints do you have?
This article provides cluster configuration recommendations for different scenarios based on these
considerations. This article also discusses specific features of Azure Databricks clusters and the considerations to
keep in mind for those features.
Your configuration decisions will require a tradeoff between cost and performance. The primary cost of a cluster
includes the Databricks Units (DBUs) consumed by the cluster and the cost of the underlying resources needed
to run the cluster. What may not be obvious are the secondary costs such as the cost to your business of not
meeting an SLA, decreased employee efficiency, or possible waste of resources because of poor controls.
Cluster features
Before discussing more detailed cluster configuration scenarios, it’s important to understand some features of
Azure Databricks clusters and how best to use those features.
All-purpose clusters and job clusters
When you create a cluster you select a cluster type: an all-purpose cluster or a job cluster. All-purpose clusters
can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development.
Once you’ve completed implementing your processing and are ready to operationalize your code, switch to
running it on a job cluster. Job clusters terminate when your job ends, reducing resource usage and cost.
Cluster mode
Azure Databricks supports three cluster modes: Standard, High Concurrency, and Single Node. Most regular
users use Standard or Single Node clusters.
Standard clusters are ideal for processing large amounts of data with Apache Spark.
Single Node clusters are intended for jobs that use small amounts of data or non-distributed workloads such
as single-node machine learning libraries.
High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs.
Administrators usually create High Concurrency clusters. Databricks recommends enabling autoscaling for
High Concurrency clusters.
On-demand and spot instances
To save cost, Azure Databricks supports creating clusters using a combination of on-demand and spot instances.
You can use spot instances to take advantage of unused capacity on Azure to reduce the cost of running your
applications, grow your application’s compute capacity, and increase throughput.
Autoscaling
Autoscaling allows clusters to resize automatically based on workloads. Autoscaling can benefit many use cases
and scenarios from both a cost and performance perspective, but it can be challenging to understand when and
how to use autoscaling. The following are some considerations for determining whether to use autoscaling and
how to get the most benefit:
Autoscaling typically reduces costs compared to a fixed-size cluster.
Autoscaling workloads can run faster compared to an under-provisioned fixed-size cluster.
Some workloads are not compatible with autoscaling clusters, including spark-submit jobs and some Python
packages.
With single-user all-purpose clusters, users may find autoscaling is slowing down their development or
analysis when the minimum number of workers is set too low. This is because the commands or queries
they’re running are often several minutes apart, time in which the cluster is idle and may scale down to save
on costs. When the next command is executed, the cluster manager will attempt to scale up, taking a few
minutes while retrieving instances from the cloud provider. During this time, jobs might run with insufficient
resources, slowing the time to retrieve results. While increasing the minimum number of workers helps, it
also increases cost. This is another example where cost and performance need to be balanced.
If Delta Caching is being used, it’s important to remember that any cached data on a node is lost if that node
is terminated. If retaining cached data is important for your workload, consider using a fixed-size cluster.
If you have a job cluster running an ETL workload, you can sometimes size your cluster appropriately when
tuning if you know your job is unlikely to change. However, autoscaling gives you flexibility if your data sizes
increase. It’s also worth noting that optimized autoscaling can reduce expense with long-running jobs if there
are long periods when the cluster is underutilized or waiting on results from another process. Once again,
though, your job may experience minor delays as the cluster attempts to scale up appropriately. If you have
tight SLAs for a job, a fixed-sized cluster may be a better choice or consider using an Azure Databricks pool
to reduce cluster start times.
Azure Databricks also supports autoscaling local storage. With autoscaling local storage, Azure Databricks
monitors the amount of free disk space available on your cluster’s Spark workers. If a worker begins to run low
on disk, Azure Databricks automatically attaches a new managed volume to the worker before it runs out of disk
space.
Pools
Pools reduce cluster start and scale-up times by maintaining a set of available, ready-to-use instances.
Databricks recommends taking advantage of pools to improve processing time while minimizing cost.
Databricks Runtime versions
Databricks recommends using the latest Databricks Runtime version for all-purpose clusters. Using the most
current version will ensure you have the latest optimizations and most up-to-date compatibility between your
code and preloaded packages.
For job clusters running operational workloads, consider using the Long Term Support (LTS) Databricks Runtime
version. Using the LTS version will ensure you don’t run into compatibility issues and can thoroughly test your
workload before upgrading. If you have an advanced use case around machine learning or genomics, consider
the specialized Databricks Runtime versions.
Cluster policies
Azure Databricks cluster policies allow administrators to enforce controls over the creation and configuration of
clusters. Databricks recommends using cluster policies to help apply the recommendations discussed in this
guide. Learn more about cluster policies in the cluster policies best practices guide.
Automatic termination
Many users won’t think to terminate their clusters when they’re finished using them. Fortunately, clusters are
automatically terminated after a set period, with a default of 120 minutes.
Administrators can change this default setting when creating cluster policies. Decreasing this setting can lower
cost by reducing the time that clusters are idle. It’s important to remember that when a cluster is terminated all
state is lost, including all variables, temp tables, caches, functions, objects, and so forth. All of this state will need
to be restored when the cluster starts again. If a developer steps out for a 30-minute lunch break, it would be
wasteful to spend that same amount of time to get a notebook back to the same state as before.
IMPORTANT
Idle clusters continue to accumulate DBU and cloud instance charges during the inactivity period before termination.
Garbage collection
While it may be less obvious than other considerations discussed in this article, paying attention to garbage
collection can help optimize job performance on your clusters. Providing a large amount of RAM can help jobs
perform more efficiently but can also lead to delays during garbage collection.
To minimize the impact of long garbage collection sweeps, avoid deploying clusters with large amounts of RAM
configured for each instance. Having more RAM allocated to the executor will lead to longer garbage collection
times. Instead, configure instances with smaller RAM sizes, and deploy more instances if you need more
memory for your jobs. However, there are cases where fewer nodes with more RAM are recommended, for
example, workloads that require a lot of shuffles, as discussed in Cluster sizing considerations.
Cluster access control
You can configure two types of cluster permissions:
The Allow Cluster Creation permission controls the ability of users to create clusters.
Cluster-level permissions control the ability to use and modify a specific cluster.
To learn more about configuring cluster permissions, see cluster access control.
You can create a cluster if you have either cluster create permissions or access to a cluster policy, which allows
you to create any cluster within the policy’s specifications. The cluster creator is the owner and has Can Manage
permissions, which will enable them to share it with any other user within the constraints of the data access
permissions of the cluster.
Understanding cluster permissions and cluster policies are important when deciding on cluster configurations
for common scenarios.
Cluster tags
Cluster tags allow you to easily monitor the cost of cloud resources used by different groups in your
organization. You can specify tags as key-value strings when creating a cluster, and Azure Databricks applies
these tags to cloud resources, such as instances and EBS volumes. Learn more about tag enforcement in the
cluster policies best practices guide.
Analytical workloads will likely require reading the same data repeatedly, so recommended worker types are
storage optimized with Delta Cache enabled.
Additional features recommended for analytical workloads include:
Enable auto termination to ensure clusters are terminated after a period of inactivity.
Consider enabling autoscaling based on the analyst’s typical workload.
Consider using pools, which will allow restricting clusters to pre-approved instance types and ensure
consistent cluster configurations.
Features that are probably not useful:
Storage autoscaling, since this user will probably not produce a lot of data.
High Concurrency clusters, since this cluster is for a single user, and High Concurrency clusters are best
suited for shared use.
Basic batch ETL
Simple batch ETL jobs that don’t require wide transformations, such as joins or aggregations, typically benefit
from clusters that are compute-optimized. For these types of workloads, any of the clusters in the following
diagram are likely acceptable.
Compute-optimized worker types are recommended; these will be cheaper, and these workloads will likely not
require significant memory or storage.
Using a pool might provide a benefit for clusters supporting simple ETL jobs by decreasing cluster launch times
and reducing total runtime when running job pipelines. However, since these types of workloads typically run as
scheduled jobs where the cluster runs only long enough to complete the job, using a pool might not provide a
benefit.
The following features probably aren’t useful:
Delta Caching, since re-reading data is not expected.
Auto termination probably isn’t required since these are likely scheduled jobs.
Autoscaling is not recommended since compute and storage should be pre-configured for the use case.
High Concurrency clusters are intended for multi-users and won’t benefit a cluster running a single job.
Complex batch ETL
More complex ETL jobs, such as processing that requires unions and joins across multiple tables, will probably
work best when you can minimize the amount of data shuffled. Since reducing the number of workers in a
cluster will help minimize shuffles, you should consider a smaller cluster like cluster A in the following diagram
over a larger cluster like cluster D.
Complex transformations can be compute-intensive, so for some workloads reaching an optimal number of
cores may require adding additional nodes to the cluster.
Like simple ETL jobs, compute-optimized worker types are recommended; these will be cheaper, and these
workloads will likely not require significant memory or storage. Also, like simple ETL jobs, the main cluster
feature to consider is pools to decrease cluster launch times and reduce total runtime when running job
pipelines.
The following features probably aren’t useful:
Delta Caching, since re-reading data is not expected.
Auto termination probably isn’t required since these are likely scheduled jobs.
Autoscaling is not recommended since compute and storage should be pre-configured for the use case.
High Concurrency clusters are intended for multi-users and won’t benefit a cluster running a single job.
Training machine learning models
Since initial iterations of training a machine learning model are often experimental, a smaller cluster such as
cluster A is a good choice. A smaller cluster will also reduce the impact of shuffles.
If stability is a concern, or for more advanced stages, a larger cluster such as cluster B or C may be a good
choice.
A large cluster such as cluster D is not recommended due to the overhead of shuffling data between nodes.
Recommended worker types are storage optimized with Delta Caching enabled to account for repeated reads of
the same data and to enable caching of training data. If the compute and storage options provided by storage
optimized nodes are not sufficient, consider GPU optimized nodes. A possible downside is the lack of Delta
Caching support with these nodes.
Additional features recommended for analytical workloads include:
Enable auto termination to ensure clusters are terminated after a period of inactivity.
Consider enabling autoscaling based on the analyst’s typical workload.
Use pools, which will allow restricting clusters to pre-approved instance types and ensure consistent cluster
configurations.
Features that are probably not useful:
Autoscaling, since cached data can be lost when nodes are removed as a cluster scales down. Additionally,
typical machine learning jobs will often consume all available nodes, in which case autoscaling will provide
no benefit.
Storage autoscaling, since this user will probably not produce a lot of data.
High Concurrency clusters, since this cluster is for a single user, and High Concurrency clusters are best
suited for shared use.
Common scenarios
The following sections provide additional recommendations for configuring clusters for common cluster usage
patterns:
Multiple users running data analysis and ad-hoc processing.
Specialized use cases like machine learning.
Support scheduled batch jobs.
Multi-user clusters
Scenario
You need to provide multiple users access to data for running data analysis and ad-hoc queries. Cluster usage
might fluctuate over time, and most jobs are not very resource-intensive. The users mostly require read-only
access to the data and want to perform analyses or create dashboards through a simple user interface.
The recommended approach for cluster provisioning is a hybrid approach for node provisioning in the cluster
along with autoscaling. A hybrid approach involves defining the number of on-demand instances and spot
instances for the cluster and enabling autoscaling between the minimum and the maximum number of
instances.
This cluster is always available and shared by the users belonging to a group by default. Enabling autoscaling
allows the cluster to scale up and down depending upon the load.
Users do not have access to start/stop the cluster, but the initial on-demand instances are immediately available
to respond to user queries. If the user query requires more capacity, autoscaling automatically provisions more
nodes (mostly Spot instances) to accommodate the workload.
Azure Databricks has other features to further improve multi-tenancy use cases:
Handling large queries in interactive workflows describes a process to automatically manage queries that will
never finish.
Task preemption improves how long-running jobs and shorter jobs work together.
Autoscaling local storage helps prevent running out of storage space in a multi-tenant environment.
This approach keeps the overall cost down by:
Using a shared cluster model.
Using a mix of on-demand and spot instances.
Using autoscaling to avoid paying for underutilized clusters.
Specialized workloads
Scenario
You need to provide clusters for specialized use cases or teams within your organization, for example, data
scientists running complex data exploration and machine learning algorithms. A typical pattern is that a user
needs a cluster for a short period to run their analysis.
The best approach for this kind of workload is to create cluster policies with pre-defined configurations for
default, fixed, and settings ranges. These settings might include the number of instances, instance types, spot
versus on-demand instances, roles, libraries to be installed, and so forth. Using cluster policies allows users with
more advanced requirements to quickly spin up clusters that they can configure as needed for their use case
and enforce cost and compliance with policies.
This approach provides more control to users while maintaining the ability to keep cost under control by pre-
defining cluster configurations. This also allows you to configure clusters for different groups of users with
permissions to access different data sets.
One downside to this approach is that users have to work with administrators for any changes to clusters, such
as configuration, installed libraries, and so forth.
Batch workloads
Scenario
You need to provide clusters for scheduled batch jobs, such as production ETL jobs that perform data
preparation. The suggested best practice is to launch a new cluster for each job run. Running each job on a new
cluster helps avoid failures and missed SLAs caused by other workloads running on a shared cluster. Depending
on the level of criticality for the job, you could use all on-demand instances to meet SLAs or balance between
spot and on-demand instances for cost savings.
Task preemption
7/21/2022 • 2 minutes to read
The Apache Spark scheduler in Azure Databricks automatically preempts tasks to enforce fair sharing. This
guarantees interactive response times on clusters with many concurrently running jobs.
TIP
When tasks are preempted by the scheduler, their kill reason will be set to preempted by scheduler . This reason is
visible in the Spark UI and can be used to debug preemption behavior.
Preemption options
By default, preemption is conservative: jobs can be starved of resources for up to 30 seconds before the
scheduler intervenes. You can tune preemption by setting the following Spark configuration properties at cluster
launch time:
Whether preemption should be enabled.
spark.databricks.preemption.enabled true
The fair share fraction to guarantee per job. Setting this to 1.0 means the scheduler will aggressively
attempt to guarantee perfect fair sharing. Setting this to 0.0 effectively disables preemption. The default
setting is 0.5, which means at worst a jobs will get half of its fair share.
spark.databricks.preemption.threshold 0.5
How long a job must remain starved before preemption kicks in. Setting this to lower values will provide
more interactive response times, at the cost of cluster efficiency. Recommended values are from 1-100
seconds.
spark.databricks.preemption.timeout 30s
How often the scheduler will check for task preemption. This should be set to less than the preemption
timeout.
spark.databricks.preemption.interval 5s
Databricks Container Services lets you specify a Docker image when you create a cluster. Some example use
cases include:
Library customization: you have full control over the system libraries you want installed.
Golden container environment: your Docker image is a locked down environment that will never change.
Docker CI/CD integration: you can integrate Azure Databricks with your Docker CI/CD pipelines.
You can also use Docker images to create custom deep learning environments on clusters with GPU devices. For
additional information about using GPU clusters with Databricks Container Services, see Databricks Container
Services on GPU clusters.
For tasks to be executed each time the container starts, use an init script.
Requirements
NOTE
Databricks Runtime for Machine Learning and Databricks Runtime for Genomics do not support Databricks Container
Services.
Databricks Runtime 6.1 or above. If you have previously used Databricks Container Services you must
upgrade your base images. See the latest images in https://github.com/databricks/containers tagged with
6.x .
Your Azure Databricks workspace must have Databricks Container Services enabled.
Your machine must be running a recent Docker daemon (one that is tested and works with Client/Server
Version 18.03.0-ce) and the docker command must be available on your PATH .
FROM databricksruntime/standard:9.x
...
To specify additional Python libraries, such as the latest version of pandas and urllib, use the container-specific
version of pip . For the datatabricksruntime/standard:9.x container, include the following:
RUN /databricks/python3/bin/pip install pandas
RUN /databricks/python3/bin/pip install urllib3
Example base images are hosted on Docker Hub at https://hub.docker.com/u/databricksruntime. The Dockerfiles
used to generate these bases are at https://github.com/databricks/containers.
NOTE
The base images databricksruntime/standard and databricksruntime/minimal are not to be confused with the
unrelated databricks-standard and databricks-minimal environments included in the no longer available Databricks
Runtime with Conda (Beta).
NOTE
Databricks recommends using Ubuntu Linux; however, it is possible to use Alpine Linux. To use Alpine Linux, you must
include these files:
alpine coreutils
alpine procps
alpine sudo
In addition, you must set up Python, as shown in this example Dockerfile.
WARNING
Test your custom container image thoroughly on an Azure Databricks cluster. Your container may work on a local or build
machine, but when your container is launched on an Azure Databricks cluster, the cluster launch may fail, certain features
may become disabled, or your container may stop working, even silently. In worst-case scenarios, it could corrupt your
data or accidentally expose your data to external parties.
Step 2: Push your base image
Push your custom base image to a Docker registry. This process is supported with the following registries:
Docker Hub with no auth or basic auth.
Azure Container Registry with basic auth.
Other Docker registries that support no auth or basic auth are also expected to work.
NOTE
If you use Docker Hub for your Docker registry, be sure to check that rate limits accommodate the number of clusters
that you expect to launch in a six-hour period. These rate limits are different for anonymous users, authenticated users
without a paid subscription, and paid subscriptions. See the Docker documentation for details. If this limit is exceeded, you
will get a “429 Too Many Requests” response.
For Databricks Container Services images, you can also store init scripts in DBFS or cloud storage.
The following steps take place when you launch a Databricks Container Services cluster:
1. VMs are acquired from the cloud provider.
2. The custom Docker image is downloaded from your repo.
3. Azure Databricks creates a Docker container from the image.
4. Databricks Runtime code is copied into the Docker container.
5. The init scrips are executed. See Init script execution order.
Azure Databricks ignores the Docker CMD and ENTRYPOINT primitives.
Cluster node initialization scripts
7/21/2022 • 11 minutes to read
An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or
worker JVM starts.
Some examples of tasks performed by init scripts include:
Install packages and libraries not included in Databricks Runtime. To install Python packages, use the Azure
Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into the
Azure Databricks Python virtual environment rather than the system Python environment. For example,
/databricks/python/bin/pip install <package-name> .
Modify the JVM system classpath in special cases.
Set system properties and environment variables used by the JVM.
Modify Spark configuration parameters.
WARNING
Azure Databricks scans the reserved location /databricks/init for legacy global init scripts which are enabled in new
workspaces by default. Databricks recommends you avoid storing init scripts in this location to avoid unexpected behavior.
NOTE
There are two kinds of init scripts that are deprecated. You should migrate init scripts of these types to those listed above:
Cluster-named : run on a cluster with the same name as the script. Cluster-named init scripts are best-effort (silently
ignore failures), and attempt to continue the cluster launch process. Cluster-scoped init scripts should be used instead
and are a complete replacement.
Legacy global: run on every cluster. They are less secure than the new global init script framework, silently ignore
failures, and cannot reference environment variables. You should migrate existing legacy global init scripts to the new
global init script framework. See Migrate from legacy to new global init scripts.
Whenever you change any type of init script you must restart all clusters affected by the script.
Environment variables
Cluster-scoped and global init scripts support the following environment variables:
DB_CLUSTER_ID : the ID of the cluster on which the script is running. See Clusters API 2.0.
DB_CONTAINER_IP : the private IP address of the container in which Spark runs. The init script is run inside this
container. See SparkNode.
DB_IS_DRIVER : whether the script is running on a driver node.
DB_DRIVER_IP : the IP address of the driver node.
DB_INSTANCE_TYPE : the instance type of the host VM.
DB_CLUSTER_NAME : the name of the cluster the script is executing on.
DB_IS_JOB_CLUSTER : whether the cluster was created to run a job. See Create a job.
For example, if you want to run part of a script only on a driver node, you could write a script like:
echo $DB_IS_DRIVER
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
<run this part only on driver>
else
<run this part only on workers>
fi
<run this part on both driver and workers>
You can also configure custom environment variables for a cluster and reference those variables in init scripts.
Use secrets in environment variables
Environment variables that reference secrets exhibit special behavior on Azure Databricks: init scripts can use
these variables, but programs running in Spark cannot use these variables.
You can use any valid variable name when you Reference a secret in an environment variable.
Logging
Init script start and finish events are captured in cluster event logs. Details are captured in cluster logs. Global
init script create, edit, and delete events are also captured in account-level diagnostic logs.
Init script events
Cluster event logs capture two init script events: INIT_SCRIPTS_STARTED and INIT_SCRIPTS_FINISHED , indicating
which scripts are scheduled for execution and which have completed successfully. INIT_SCRIPTS_FINISHED also
captures execution duration.
Global init scripts are indicated in the log event details by the key "global" and cluster-scoped init scripts are
indicated by the key "cluster" .
NOTE
Cluster event logs do not log init script events for each cluster node; only one node is selected to represent them all.
If the cluster is configured to write logs to DBFS, you can view the logs using the File system utility (dbutils.fs) or
the DBFS CLI. For example, if the cluster ID is 1001-234039-abcde739 :
dbfs ls dbfs:/cluster-logs/1001-234039-abcde739/init_scripts
1001-234039-abcde739_10_97_225_166
1001-234039-abcde739_10_97_231_88
1001-234039-abcde739_10_97_244_199
dbfs ls dbfs:/cluster-logs/1001-234039-abcde739/init_scripts/1001-234039-abcde739_10_97_225_166
<timestamp>_<log-id>_<init-script-name>.sh.stderr.log
<timestamp>_<log-id>_<init-script-name>.sh.stdout.log
When cluster log delivery is not configured, logs are written to /databricks/init_scripts . You can use standard
shell commands in a notebook to list and view the logs:
%sh
ls /databricks/init_scripts/
cat /databricks/init_scripts/<timestamp>_<log-id>_<init-script-name>.sh.stdout.log
Every time a cluster launches, it writes a log to the init script log folder.
IMPORTANT
Any user who creates a cluster and enables cluster log delivery can view the stderr and stdout output from global
init scripts. You should ensure that your global init scripts do not output any sensitive information.
Diagnostic logs
Azure Databricks diagnostic logging captures global init script create, edit, and delete events under the event
type globalInitScripts . See Diagnostic logging in Azure Databricks.
dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar
https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
display(dbutils.fs.ls("dbfs:/databricks/scripts/postgresql-install.sh"))
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar
https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar
IMPORTANT
Anaconda Inc. updated their terms of service for anaconda.org channels in September 2020. Based on the new terms of
service you may require a commercial license if you rely on Anaconda’s packaging and distribution. See Anaconda
Commercial Edition FAQ for more information. Your use of any Anaconda channels is governed by their terms of service.
As a result of this change, Databricks has removed the default channel configuration for the Conda package manager. This
is a breaking change. You must update the usage of conda commands in init-scripts to specify a channel using -c . If you
do not specify a channel, conda commands will fail with PackagesNotFoundError .
In Databricks Runtime 8.4 ML and below, you use the Conda package manager to install Python packages. To
install a Python library at cluster initialization, you can use a script like the following:
#!/bin/bash
set -ex
/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python
conda install -c conda-forge -y astropy
IMPORTANT
The script must exist at the configured location. If the script doesn’t exist, the cluster will fail to start or be autoscaled
up.
The init script cannot be larger than 64KB. If a script exceeds that size, the cluster will fail to launch and a failure
message will appear in the cluster log.
3. In the Destination drop-down, select a destination type. In the example in the preceding section, the
destination is DBFS .
4. Specify a path to the init script. In the example in the preceding section, the path is
dbfs:/databricks/scripts/postgresql-install.sh . The path must begin with dbfs:/ .
5. Click Add .
To remove a script from the cluster configuration, click the at the right of the script. When you confirm the
delete you will be prompted to restart the cluster. Optionally you can delete the script file from the location you
uploaded it to.
Configure a cluster-scoped init script using the DBFS REST API
To use the Clusters API 2.0 to configure the cluster with ID 1202-211320-brick1 to run the init script in the
preceding section, run the following command:
curl -n -X POST -H 'Content-Type: application/json' -d '{
"cluster_id": "1202-211320-brick1",
"num_workers": 1,
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"cluster_log_conf": {
"dbfs" : {
"destination": "dbfs:/cluster-logs"
}
},
"init_scripts": [ {
"dbfs": {
"destination": "dbfs:/databricks/scripts/postgresql-install.sh"
}
} ]
}' https://<databricks-instance>/api/2.0/clusters/edit
IMPORTANT
Use global init scripts carefully:
It is easy to add libraries or make other modifications that cause unanticipated impacts. Whenever possible, use
cluster-scoped init scripts instead.
Any user who creates a cluster and enables cluster log delivery can view the stderr and stdout output from global
init scripts. You should ensure that your global init scripts do not output any sensitive information.
You can troubleshoot global init scripts by configuring cluster log delivery and examining the init script log.
Add a global init script using the UI
To configure global init scripts using the Admin Console:
1. Go to the Admin Console and click the Global Init Scripts tab.
2. Click + Add .
3. Name the script and enter it by typing, pasting, or dragging a text file into the Script field.
NOTE
The init script cannot be larger than 64KB. If a script exceeds that size, an error message appears when you try to
save.
4. If you have more than one global init script configured for your workspace, set the order in which the
new script will run.
5. If you want the script to be enabled for all new and restarted clusters after you save, toggle Enabled .
IMPORTANT
When you add a global init script or make changes to the name, run order, or enablement of init scripts, those
changes do not take effect until you restart the cluster.
6. Click Add .
Edit a global init script using the UI
1. Go to the Admin Console and click the Global Init Scripts tab.
2. Click a script.
3. Edit the script.
4. Click Confirm .
Configure a global init script using the API
Admins can add, delete, re-order, and get information about the global init scripts in your workspace using the
Global Init Scripts API 2.0.
Migrate from legacy to new global init scripts
If your Azure Databricks workspace was launched before August 2020, you might still have legacy global init
scripts. You should migrate these to the new global init script framework to take advantage of the security,
consistency, and visibility features included in the new script framework.
1. Copy your existing legacy global init scripts and add them to the new global init script framework using
either the UI or the REST API.
Keep them disabled until you have completed the next step.
2. Disable all legacy global init scripts.
In the Admin Console, go to the Global Init Scripts tab and toggle off the Legacy Global Init Scripts
switch.
NOTE
Some GPU-enabled instance types are in Beta and are marked as such in the drop-down list when you select the driver
and worker types during cluster creation.
Overview
Azure Databricks supports clusters accelerated with graphics processing units (GPUs). This article describes how
to create clusters with GPU-enabled instances and describes the GPU drivers and libraries installed on those
instances.
To learn more about deep learning on GPU-enabled clusters, see Deep learning.
GPU scheduling
Databricks Runtime 7.0 ML and above support GPU-aware scheduling from Apache Spark 3.0. Azure Databricks
preconfigures it on GPU clusters.
GPU scheduling is not enabled on Single Node clusters.
spark.task.resource.gpu.amount is the only Spark config related to GPU-aware scheduling that you might need
to change. The default configuration uses one GPU per task, which is ideal for distributed inference workloads
and distributed training, if you use all GPU nodes. To do distributed training on a subset of nodes, which helps
reduce communication overhead during distributed training, Databricks recommends setting
spark.task.resource.gpu.amount to the number of GPUs per worker node in the cluster Spark configuration.
For PySpark tasks, Azure Databricks automatically remaps assigned GPU(s) to indices 0, 1, …. Under the default
configuration that uses one GPU per task, your code can simply use the default GPU without checking which
GPU is assigned to the task. If you set multiple GPUs per task, for example 4, your code can assume that the
indices of the assigned GPUs are always 0, 1, 2, and 3. If you do need the physical indices of the assigned GPUs,
you can get them from the CUDA_VISIBLE_DEVICES environment variable.
If you use Scala, you can get the indices of the GPUs assigned to the task from
TaskContext.resources().get("gpu") .
For Databricks Runtime releases below 7.0, to avoid conflicts among multiple Spark tasks trying to use the same
GPU, Azure Databricks automatically configures GPU clusters so that there is at most one running task per node.
That way the task can use all GPUs on the node without running into conflicts with other tasks.
NOTE
This software contains source code provided by NVIDIA Corporation. Specifically, to support GPUs, Azure Databricks
includes code from CUDA Samples.
You can use Databricks Container Services on clusters with GPUs to create portable deep learning environments
with customized libraries. See Customize containers with Databricks Container Services for instructions.
To create custom images for GPU clusters, you must select a standard runtime version instead of Databricks
Runtime ML for GPU. When you select Use your own Docker container , you can choose GPU clusters with a
standard runtime version. The custom images for GPU clusters are based on the official CUDA containers, which
is different from Databricks Runtime ML for GPU.
When you create custom images for GPU clusters, you cannot change the NVIDIA driver version, because it
must match the driver version on the host machine.
The databricksruntime Docker Hub contains example base images with GPU capability. The Dockerfiles used to
generate these images are located in the example containers GitHub repository, which also has details on what
the example images provide and how to customize them.
Single Node clusters
7/21/2022 • 3 minutes to read
A Single Node cluster is a cluster consisting of an Apache Spark driver and no Spark workers. A Single Node
cluster supports Spark jobs and all Spark data sources, including Delta Lake. A Standard cluster requires a
minimum of one Spark worker to run Spark jobs.
Single Node clusters are helpful for:
Single-node machine learning workloads that use Spark to load and save data
Lightweight exploratory data analysis
Limitations
Large-scale data processing will exhaust the resources on a Single Node cluster. For these workloads,
Databricks recommends using a Standard mode cluster.
Single Node clusters are not designed to be shared. To avoid resource conflicts, Databricks recommends
using a Standard mode cluster when the cluster must be shared.
A Standard mode cluster can’t be scaled to 0 workers. Use a Single Node cluster instead.
Single Node clusters are not compatible with process isolation.
GPU scheduling is not enabled on Single Node clusters.
On Single Node clusters, Spark cannot read Parquet files with a UDT column. The following error
message results:
The Spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically
reattached.
spark.conf.set("spark.databricks.io.parquet.nativeReader.enabled", False)
REST API
You can use the Clusters API to create a Single Node cluster.
Azure Databricks web terminal provides a convenient and highly interactive way for you to run shell commands
and use editors, such as Vim or Emacs, on the Spark driver node. The web terminal can be used by many users
on one cluster. Example uses of the web terminal include monitoring resource usage and installing Linux
packages.
Web terminal is disabled by default for all workspace users.
Enabling Docker Container Services disables web terminal.
WARNING
Azure Databricks proxies the web terminal service from port 7681 on the cluster’s Spark driver. This web proxy is intended
for use only with the web terminal. If the port is occupied when the cluster starts or if there is otherwise a conflict, the
web terminal may not work as expected. If other web services are launched on port 7681, cluster users may be exposed
to potential security exploits. Neither Databricks nor Microsoft is responsible for any issues that result from the
installation of unsupported software on a cluster.
Requirements
Databricks Runtime 7.0 or above.
Can Attach To permission on a cluster.
Your Azure Databricks workspace must have web terminal enabled.
Each user can have up to 100 active web terminal sessions (tabs) open. Idle web terminal sessions may time out
and the web terminal web application will reconnect, resulting in a new shell process. If you want to keep your
Bash session, Databricks recommends using tmux.
Limitations
Azure Databricks does not support running Spark jobs from the web terminal. In addition, Azure Databricks
web terminal is not available in the following cluster types:
Job clusters
High concurrency clusters with either table access control or credential passthrough enabled.
Clusters launched with the DISABLE_WEB_TERMINAL=true environment variable set.
Enabling Docker Container Services disables web terminal.
Debugging with the Apache Spark UI
7/21/2022 • 6 minutes to read
This guide walks you through the different debugging options available to peek at the internals of your Apache
Spark application. The three important places to look are:
Spark UI
Driver logs
Executor logs
Spark UI
Once you start the job, the Spark UI shows information about what’s happening in your application. To get to the
Spark UI, click the attached cluster:
Streaming tab
Once you get to the Spark UI, you will see a Streaming tab if a streaming job is running in this cluster. If there is
no streaming job running in this cluster, this tab will not be visible. You can skip to Driver logs to learn how to
check for exceptions that might have happened while starting the streaming job.
The first thing to look for in this page is to check if your streaming application is receiving any input events from
your source. In this case, you can see the job receives 1000 events/second.
If you have an application that receives multiple input streams, you can click the Input Rate link which will
show the # of events received for each receiver.
Processing time
As you scroll down, find the graph for Processing Time . This is one of the key graphs to understand the
performance of your streaming job. As a general rule of thumb, it is good if you can process each batch within
80% of your batch processing time.
For this application, the batch interval was 2 seconds. The average processing time is 450ms which is well under
the batch interval. If the average processing time is closer or greater than your batch interval, then you will have
a streaming application that will start queuing up resulting in backlog soon which can bring down your
streaming job eventually.
Completed batches
Towards the end of the page, you will see a list of all the completed batches. The page displays details about the
last 1000 batches that completed. From the table, you can get the # of events processed for each batch and their
processing time. If you want to know more about what happened on one of the batches, you can click the batch
link to get to the Batch Details Page.
Batch details page
This page has all the details you want to know about a batch. Two key things are:
Input: Has details about the input to the batch. In this case, it has details about the Apache Kafka topic,
partition and offsets read by Spark Structured Streaming for this batch. In case of TextFileStream, you see a
list of file names that was read for this batch. This is the best way to start debugging a Streaming application
reading from text files.
Processing: You can click the link to the Job ID which has all the details about the processing done during this
batch.
TIP
Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while
processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more
than one executor in your cluster.
Thread dump
A thread dump shows a snapshot of a JVM’s thread states.
Thread dumps are useful in debugging a specific hanging or slow-running task. To view a specific task’s thread
dump in the Spark UI:
1. Click the Jobs tab.
2. In the Jobs table, find the target job that corresponds to the thread dump you want to see, and click the link
in the Description column.
3. In the job’s Stages table, find the target stage that corresponds to the thread dump you want to see, and click
the link in the Description column.
4. In the stage’s Tasks list, find the target task that corresponds to the thread dump you want to see, and note
its Task ID and Executor ID values.
5. Click the Executors tab.
6. In the Executors table, find the row that contains the Executor ID value that corresponds to the Executor
ID value that you noted earlier. In that row, click the link in the Thread Dump column.
7. In the Thread dump for executor table, click the row where the Thread Name column contains (TID
followed by the Task ID value that you noted earlier. (If the task has finished running, you will not find a
matching thread). The task’s thread dump is shown.
Thread dumps are also useful for debugging issues where the driver appears to be hanging (for example, no
Spark progress bars are showing) or making no progress on queries (for example, Spark progress bars are stuck
at 100%). To view the driver’s thread dump in the Spark UI:
1. Click the Executors tab.
2. In the Executors table, in the driver row, click the link in the Thread Dump column. The driver’s thread
dump is shown.
Driver logs
Driver logs are helpful for 2 purposes:
Exceptions: Sometimes, you may not see the Streaming tab in the Spark UI. This is because the Streaming job
was not started because of some exception. You can drill into the Driver logs to look at the stack trace of the
exception. In some cases, the streaming job may have started properly. But you will see all the batches never
going to the Completed batches section. They might all be in processing or failed state. In such cases too,
driver logs could be handy to understand on the nature of the underlying issues.
Prints: Any print statements as part of the DAG shows up in the logs too.
Executor logs
Executor logs are sometimes helpful if you see certain tasks are misbehaving and would like to see the logs for
specific tasks. From the task details page shown above, you can get the executor where the task was run. Once
you have that, you can go to the clusters UI page, click the # nodes, and then the master. The master page lists all
the workers. You can choose the worker where the suspicious task was run and then get to the log4j output.
Pools
7/21/2022 • 2 minutes to read
Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use
instances. When a cluster is attached to a pool, cluster nodes are created using the pool’s idle instances. If the
pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to
accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is free for
another cluster to use. Only clusters attached to a pool can use that pool’s idle instances.
You can specify a different pool for the driver node and worker nodes, or use the same pool for both.
For an introduction to pools and configuration recommendations, view the Databricks pools video:
Azure Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.
See pricing.
You can manage pools using the UI, the Instance Pools CLI, or by calling the Instance Pools API 2.0.
This section describes how to work with pools using the UI:
Display pools
Create a pool
Configure pools
Edit a pool
Delete a pool
Attach a cluster to one or more pools
Best practices: pools
Display pools
7/21/2022 • 2 minutes to read
To display the pools in your workspace, click Compute in the sidebar and click the Pools tab:
The button at the far right of a row provides quick access to delete a pool.
To display cluster attachment info, click the pool name in the Pools list.
Create a pool
7/21/2022 • 2 minutes to read
IMPORTANT
You must have permission to create a pool; see Pool access control.
You will notice idle instances in the pending state. When they are no longer pending, clusters attached to the
pool will start faster.
To create a pool using the REST API, see the Instance Pools API 2.0 documentation.
Configure pools
7/21/2022 • 4 minutes to read
This article explains the configuration options available when you create and edit a pool.
Maximum Capacity
The maximum number of instances that the pool will provision. If set, this value constrains all instances (idle +
used). If a cluster using the pool requests more instances than this number during autoscaling, the request will
fail with an INSTANCE_POOL_MAX_CAPACITY_FAILURE error.
This configuration is optional. Azure Databricks recommend setting a value only in the following circumstances:
You have an instance quota you must stay under.
You want to protect one set of work from impacting another set of work. For example, suppose your instance
quota is 100 and you have teams A and B that need to run jobs. You can create pool A with a max 50 and
pool B with max 50 so that the two teams share the 100 quota fairly.
You need to cap cost.
Idle Instance Auto Termination
The time in minutes that instances above the value set in Minimum Idle Instances can be idle before being
terminated by the pool.
Instance types
A pool consists of both idle instances kept ready for new clusters and instances in use by running clusters. All of
these instances are of the same instance provider type, selected when creating a pool.
A pool’s instance type cannot be edited. Clusters attached to a pool use the same instance type for the driver and
worker nodes. Different families of instance types fit different use cases, such as memory-intensive or compute-
intensive workloads.
Azure Databricks always provides one year’s deprecation notice before ceasing support for an instance type.
NOTE
If your security requirements include compute isolation, select a Standard_F72s_V2 instance as your worker type. These
instance types represent isolated virtual machines that consume the entire physical host and provide the necessary level
of isolation required to support, for example, US Department of Defense Impact Level 5 (IL5) workloads.
Pool tags
Pool tags allow you to easily monitor the cost of cloud resources used by various groups in your organization.
You can specify tags as key-value pairs when you create a pool, and Azure Databricks applies these tags to cloud
resources like VMs and disk volumes, as well as DBU usage reports.
For convenience, Azure Databricks applies three default tags to each pool: Vendor , DatabricksInstancePoolId ,
and DatabricksInstancePoolCreatorId . You can also add custom tags when you create a pool. You can add up to
41 custom tags.
Custom tag inheritance
Pool-backed clusters inherit default and custom tags from the pool configuration. For detailed information
about how pool tags and cluster tags work together, see Monitor usage using cluster, pool, and workspace tags.
Configure custom pool tags
1. At the bottom of the pool configuration page, select the Tags tab.
2. Specify a key-value pair for the custom tag.
3. Click Add .
Spot instances
To save cost, you can choose to use spot instances by checking the All Spot radio button.
Clusters in the pool will launch with spot instances for all nodes, driver and worker (as opposed to the hybrid
on-demand driver and spot instance workers for non-pool clusters).
If spot instances are evicted due to unavailability, on-demand instances do not replace evicted instances.
Edit a pool
7/21/2022 • 2 minutes to read
Some pool configuration settings are not editable. These settings are grayed out.
You can also invoke the Edit API to programmatically edit the pool.
NOTE
Clusters that were attached to the pool remain attached after editing.
Delete a pool
7/21/2022 • 2 minutes to read
Deleting a pool terminates the pool’s idle instances and removes its configuration.
WARNING
You cannot undo this action.
To delete a pool, click the icon in the actions on the Pools page.
NOTE
Running clusters attached to the pool continue to run, but cannot allocate instances during resize or up-scaling.
Terminated clusters attached to the pool will fail to start.
You can also invoke the Delete API endpoint to programmatically delete a pool.
Attach a cluster to one or more pools
7/21/2022 • 2 minutes to read
To reduce cluster start time, you can designate predefined pools of idle instances to create worker nodes and the
driver node. This is also called attaching the cluster to the pools. The cluster is created using instances in the
pools. If a pool does not have sufficient idle resources to create the requested driver node or worker nodes, the
pool expands by allocating new instances from the instance provider. When the cluster is terminated, the
instances it used are returned to the pool and can be reused by a different cluster.
You can attach a different pool for the driver node and worker nodes, or attach the same pool for both.
IMPORTANT
You must use a pool for both the driver node and worker nodes, or for neither. Otherwise, an error occurs and your
cluster isn’t created. This prevents a situation where the driver node has to wait for worker nodes to be created, or vice
versa.
Requirements
You must have permission to attach to each pool; see Pool access control.
If you use the Clusters API, you must specify driver_instance_pool_id for the driver node and instance_pool_id
for the worker nodes.
Inherited configuration
When you attach a cluster to a pool, the following configuration properties are inherited from the pool:
Docker images.
Custom cluster tags: You can add additional custom tags for the cluster, and both the cluster-level tags and
those inherited from pools are applied. You cannot add a cluster-specific custom tag with the same key name
as a custom tag inherited from a pool (that is, you cannot override a custom tag that is inherited from the
pool).
Best practices: pools
7/21/2022 • 4 minutes to read
Clusters provide the computation resources and configurations that run your notebooks and jobs. Clusters run
on instances provisioned by your cloud provider on demand. The Azure Databricks platform provides an
efficient and cost-effective way to manage your analytics infrastructure. This article shows how to address the
following challenges when creating new clusters or scaling up existing clusters:
The execution time of your Azure Databricks job might be shorter than the time to provision instances and
start a new cluster.
When autoscaling is enabled on a cluster, it takes time for the cloud provider to provision new instances. This
can negatively impact jobs with strict performance requirements or varying workloads.
Azure Databricks pools reduce cluster start and scale-up times by maintaining a set of available, ready-to-use
instances.
You can use a different pool for the driver node and worker nodes.
For an introduction to pools and configuration recommendations, view the following video:
As shown in the following diagram, when a cluster attached to a pool needs an instance, it first attempts to
allocate one of the pool’s available instances. If the pool has no available instances, it expands by allocating a
new instance from the cloud provider to accommodate the cluster’s request. When a cluster releases an
instance, the instance returns to the pool and is free for use by another cluster. Only clusters attached to a pool
can use that pool’s available instances.
This article discusses the following best practices to ensure the best performance at the lowest cost when you
use pools:
Create pools using instance types and Azure Databricks runtimes based on target workloads.
When possible, populate pools with spot instances to reduce costs.
Populate pools with on-demand instances for jobs with short execution times and strict execution time
requirements.
Use pool tags and cluster tags to manage billing.
Use pool configuration options to minimize cost.
Pre-populate pools to make sure instances are available when clusters need them.
To learn more, see Monitor usage using cluster, pool, and workspace tags
Pre-populate pools
To benefit fully from pools, you can pre-populate newly created pools. Set the Min Idle instances greater than
zero in the pool configuration. Alternatively, if you’re following the recommendation to set this value to zero, use
a starter job to ensure that newly created pools have available instances for clusters to access.
With the starter job approach, schedule a job with flexible execution time requirements to run before jobs with
more strict performance requirements or before users start using interactive clusters. After the job finishes, the
instances used for the job are released back to the pool. Set Min Idle instance setting to 0 and set the Idle
Instance Auto Termination time high enough to ensure that idle instances remain available for subsequent
jobs.
Using a starter job allows the pool instances to spin up, populate the pool, and remain available for downstream
job or interactive clusters.
Learn more
Learn more about Azure Databricks pools.
Notebooks
7/21/2022 • 2 minutes to read
A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative
text.
For a quick introduction to notebooks, view this video:
This section describes how to manage and use notebooks. It also contains articles on creating data
visualizations, sharing visualizations as dashboards, parameterizing notebooks and dashboards with widgets,
building complex pipelines with notebooks, and best practices for defining classes in Scala notebooks.
Manage notebooks
Create a notebook
Open a notebook
Delete a notebook
Copy notebook path
Rename a notebook
Control access to a notebook
Notebook external formats
Notebooks and clusters
Schedule a notebook
Distribute notebooks
Use notebooks
Develop notebooks
Run notebooks
Share code in notebooks
Manage notebook state and results
Revision history
Version control with Git
Test notebooks
Visualizations
Create a new visualization
Create a new data profile
Work with visualizations and data profiles
Dashboards
Dashboards notebook
Create a scheduled job to refresh a dashboard
View a specific dashboard version
ipywidgets
Requirements
Usage
Example notebook
Best practices for using ipywidgets and Databricks widgets
Limitations
Databricks widgets
Databricks widget types
Databricks widget API
Configure widget settings
Databricks widgets in dashboards
Use Databricks widgets with %run
Modularize or link notebook code
Ways to modularize or link notebooks
API
Example
Pass structured data
Handle errors
Run multiple notebooks concurrently
Package cells
Package Cells notebook
IPython kernel
Benefits of using the IPython kernel
Known issue
Best practices
Requirements
Walkthrough
bamboolib
Requirements
Quickstart
Walkthroughs
Key tasks
Additional resources
Legacy visualizations
Create a legacy visualization
Machine learning visualizations
Structured Streaming DataFrames
displayHTML function
Images
Visualizations in Python
Visualizations in R
Visualizations in Scala
Deep dive notebooks for Python and Scala
Manage notebooks
7/21/2022 • 10 minutes to read
You can manage notebooks using the UI, the CLI, and by invoking the Workspace API. This article focuses on
performing notebook tasks using the UI. For the other methods, see Databricks CLI and Workspace API 2.0.
Create a notebook
Use the Create button
The easiest way to create a new notebook in your default folder is to use the Create button:
1. Click Create in the sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. Enter a name and select the notebook’s default language.
3. If there are running clusters, the Cluster drop-down displays. Select the cluster you want to attach the
notebook to.
4. Click Create .
Create a notebook in any folder
You can create a new notebook in any folder (for example, in the Shared folder) following these steps:
Next to any folder, click the on the right side of the text and select Create > Notebook .
In the workspace or a user folder, click and select Create > Notebook .
2. Follow steps 2 through 4 in Use the Create button.
Open a notebook
In your workspace, click a . The notebook path displays when you hover over the notebook title.
Delete a notebook
See Folders and Workspace object operations for information about how to access the workspace menu and
delete notebooks or other items in the workspace.
Rename a notebook
To change the title of an open notebook, click the title and edit inline or click File > Rename .
Next to any folder, click the on the right side of the text and select Impor t .
In the Workspace or a user folder, click and select Impor t .
2. Specify the URL or browse to a file containing a supported external format or a ZIP archive of notebooks
exported from an Azure Databricks workspace.
3. Click Impor t .
If you choose a single notebook, it is exported in the current folder.
If you choose a DBC or ZIP archive, its folder structure is recreated in the current folder and each
notebook is imported.
Convert a file to a notebook
You can convert existing Python, SQL, Scala, and R scripts to single-cell notebooks by adding a comment to the
first cell of the file:
Python
SQL
Scala
# COMMAND ----------
SQL
-- COMMAND ----------
Scala
// COMMAND ----------
R
# COMMAND ----------
Export a notebook
In the notebook toolbar, select File > Expor t and a format.
NOTE
When you export a notebook as HTML, IPython notebook, or archive (DBC), and you have not cleared the results, the
results of running the notebook are included.
NOTE
When you export a notebook as HTML, IPython notebook, or archive (DBC), and you have not cleared the results, the
results of running the notebook are included.
Next to any folder, click the on the right side of the text and select Expor t .
In the Workspace or a user folder, click and select Expor t .
2. Select the export format:
DBC Archive : Export a Databricks archive, a binary format that includes metadata and notebook
command results.
Source File : Export a ZIP archive of notebook source files, which can be imported into an Azure
Databricks workspace, used in a CI/CD pipeline, or viewed as source files in each notebook’s default
language. Notebook command results are not included.
HTML Archive : Export a ZIP archive of HTML files. Each notebook’s HTML file can be imported into
an Azure Databricks workspace or viewed as HTML. Notebook command results are included.
If you attempt to attach a notebook to cluster that has maximum number of execution contexts and there are no
idle contexts (or if auto-eviction is disabled), the UI displays a message saying that the current maximum
execution contexts threshold has been reached and the notebook will remain in the detached state.
If you fork a process, an idle execution context is still considered idle once execution of the request that forked
the process returns. Forking separate processes is not recommended with Spark.
Configure context auto-eviction
Auto-eviction is enabled by default. To disable auto-eviction for a cluster, set the Spark property
spark.databricks.chauffeur.enableIdleContextTracking false .
IMPORTANT
As long as a notebook is attached to a cluster, any user with the Can Run permission on the notebook has implicit
permission to access the cluster.
CL ASS VA R I A B L E N A M E
SparkContext sc
Do not create a SparkSession , SparkContext , or SQLContext . Doing so will lead to inconsistent behavior.
spark.version
To determine the Databricks Runtime version of the cluster your notebook is attached to, run:
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
NOTE
Both this sparkVersion tag and the spark_version property required by the endpoints in the Clusters API 2.0 and
Jobs API 2.1 refer to the Databricks Runtime version, not the Spark version.
You can also detach notebooks from a cluster using the Notebooks tab on the cluster details page.
When you detach a notebook from a cluster, the execution context is removed and all computed variable values
are cleared from the notebook.
TIP
Azure Databricks recommends that you detach unused notebooks from a cluster. This frees up memory space on the
driver.
1. In the notebook, click at the top right. If no jobs exist for this notebook, the Schedule
dialog appears.
If jobs already exist for the notebook, the Jobs List dialog appears. To display the Schedule dialog, click
Add a schedule .
2. In the Schedule dialog, optionally enter a name for the job. The default name is the name of the
notebook.
3. Select Manual to run your job only when manually triggered, or Scheduled to define a schedule for
running the job. If you select Scheduled , use the drop-downs to specify the frequency, time, and time
zone.
4. In the Cluster drop-down, select the cluster to run the task.
If you have Allow Cluster Creation permissions, by default the job runs on a new job cluster. To edit
the configuration of the default job cluster, click Edit at the right of the field to display the cluster
configuration dialog.
If you do not have Allow Cluster Creation permissions, by default the job runs on the cluster that the
notebook is attached to. If the notebook is not attached to a cluster, you must select a cluster from the
Cluster drop-down.
5. Optionally, enter any Parameters to pass to the job. Click Add and specify the key and value of each
parameter. Parameters set the value of the notebook widget specified by the key of the parameter. Use
Task parameter variables to pass a limited set of dynamic values as part of a parameter value.
6. Optionally, specify email addresses to receive Aler ts on job events. See Notifications.
7. Click Submit .
Manage scheduled notebook jobs
To display jobs associated with this notebook, click the Schedule button. The jobs list dialog appears, showing
all jobs currently defined for this notebook. To manage jobs, click at the right of a job in the list.
From this menu, you can edit, clone, view, pause, resume, or delete a scheduled job.
When you clone a scheduled job, a new job is created with the same parameters as the original. The new job
appears in the list with the name “Clone of ”.
How you edit a job depends on the complexity of the job’s schedule. Either the Schedule dialog or the Job details
panel displays, allowing you to edit the schedule, cluster, parameters, and so on.
Distribute notebooks
To allow you to easily distribute Azure Databricks notebooks, Azure Databricks supports the Databricks archive,
which is a package that can contain a folder of notebooks or a single notebook. A Databricks archive is a JAR file
with extra metadata and has the extension .dbc . The notebooks contained in the archive are in an Azure
Databricks internal format.
Import an archive
A notebook is a collection of runnable cells (commands). When you use a notebook, you are primarily
developing and running cells.
All notebook tasks are supported by UI actions, but you can also perform many tasks using keyboard shortcuts.
Toggle the shortcut display by clicking the icon.
Develop notebooks
This section describes how to develop notebook cells and navigate around a notebook.
In this section:
About notebooks
Add a cell
Delete a cell
Cut, copy, and paste cells
Select multiple cells or all cells
Default language
Mix languages
Include documentation
Command comments
Change cell display
Show line and command numbers
Find and replace text
Autocomplete
Format SQL
View table of contents
View notebooks in dark mode
About notebooks
A notebook has a toolbar that lets you manage the notebook and perform actions within the notebook:
and one or more cells (or commands) that you can run:
At the far right of a cell, the cell actions , contains three menus: Run , Dashboard , and Edit :
— —
and two actions: Hide and Delete .
Add a cell
To add a cell, mouse over a cell at the top or bottom and click the icon, or access the notebook cell menu at
the far right, click , and select Add Cell Above or Add Cell Below .
Delete a cell
Go to the cell actions menu at the far right and click (Delete).
When you delete a cell, by default a delete confirmation dialog appears. To disable future confirmation dialogs,
select the Do not show this again checkbox and click Confirm . You can also toggle the confirmation dialog
setting with the Turn on command delete confirmation option in > User Settings > Notebook
Settings .
To restore deleted cells, either select Edit > Undo Delete Cells or use the ( Z ) keyboard shortcut.
Cut, copy, and paste cells
There are several options to cut and copy cells:
Use the cell actions menu at the right of the cell. Click and select Cut Cell or Copy Cell .
Use keyboard shortcuts: Command-X or Ctrl-X to cut and Command-C or Ctrl-C to copy.
Use the Edit menu at the top of the notebook. Select Cut current cell or Copy current cell .
After you cut or copy cells, you can paste those cells elsewhere in the notebook, into a different notebook, or
into a notebook in a different browser tab or window. To paste cells, use the keyboard shortcut Command-V or
Ctrl-V . The cells are pasted below the current cell.
You can use the keyboard shortcut Command-Z or Ctrl-Z to undo cut or paste actions.
NOTE
If you are using Safari, you must use the keyboard shortcuts.
Alternately, you can use the language magic command %<language> at the beginning of a cell. The supported
magic commands are: %python , %r , %scala , and %sql .
NOTE
When you invoke a language magic command, the command is dispatched to the REPL in the execution context for the
notebook. Variables defined in one language (and hence in the REPL for that language) are not available in the REPL of
another language. REPLs can share state only through external resources such as files in DBFS or objects in object storage.
NOTE
In Python notebooks, the DataFrame _sqldf is not saved automatically and is replaced with the results of the
most recent SQL cell run. To save the DataFrame, run this code in a Python cell:
new_dataframe_name = _sqldf
If the query uses a widget for parameterization, the results are not available as a Python DataFrame.
If the query uses the keywords CACHE TABLE or UNCACHE TABLE , the results are not available as a Python
DataFrame.
Collapsible headings
Cells that appear after cells containing Markdown headings can be collapsed into the heading cell. The following
image shows a level-one heading called Heading 1 with the following two cells collapsed into it.
Display images
To display images stored in the FileStore, use the syntax:
%md
![test](files/image.png)
For example, suppose you have the Databricks logo image file in FileStore:
dbfs ls dbfs:/FileStore/
databricks-logo-mobile.png
%md
\\(c = \\pm\\sqrt{a^2 + b^2} \\)
\\(A{_i}{_j}=B{_i}{_j}\\)
\\[A{_i}{_j}=B{_i}{_j}\\]
renders as:
and
%md
\\( f(\beta)= -Y_t^T X_t \beta + \sum log( 1+{e}^{X_t\bullet\beta}) + \frac{1}{2}\delta^t S_t^{-1}\delta\\)
renders as:
Include HTML
You can include HTML in a notebook by using the function displayHTML . See HTML, D3, and SVG in notebooks
for an example of how to do this.
NOTE
The displayHTML iframe is served from the domain databricksusercontent.com and the iframe sandbox includes the
allow-same-origin attribute. databricksusercontent.com must be accessible from your browser. If it is currently
blocked by your corporate network, it must added to an allow list.
Command comments
You can have discussions with collaborators using command comments.
To toggle the Comments sidebar, click the Comments button at the top right of a notebook.
To edit, delete, or reply to a comment, click the comment and choose an action.
To replace the current match, click Replace . To replace all matches in the notebook, click Replace All .
To move between matches, click the Prev and Next buttons. You can also press shift+enter and enter to go to
the previous and next matches, respectively.
To close the find and replace tool, click or press esc .
Autocomplete
You can use Azure Databricks autocomplete to automatically complete code segments as you type them. Azure
Databricks supports two types of autocomplete: local and server.
Local autocomplete completes words that are defined in the notebook. Server autocomplete accesses the cluster
for defined types, classes, and objects, as well as SQL database and table names. To activate server
autocomplete, attach your notebook to a cluster and run all cells that define completable objects.
IMPORTANT
Server autocomplete in R notebooks is blocked during command execution.
To trigger autocomplete, press Tab after entering a completable object. For example, after you define and run
the cells containing the definitions of MyClass and instance , the methods of instance are completable, and a
list of valid completions displays when you press Tab .
Type completion, as well as SQL database and table name completion, work in SQL cells and in SQL embedded
in Python.
——
In Databricks Runtime 7.4 and above, you can display Python docstring hints by pressing Shift+Tab after
entering a completable Python object. The docstrings contain the same information as the help() function for
an object.
Format SQL
Azure Databricks provides tools that allow you to format SQL code in notebook cells quickly and easily. These
tools reduce the effort to keep your code formatted and help to enforce the same coding standards across your
notebooks.
You can trigger the formatter in the following ways:
Single cells
Keyboard shortcut: Press Cmd+Shift+F .
Command context menu: Select Format SQL in the command context drop-down menu of a SQL
cell. This item is visible only in SQL notebook cells and those with a %sql language magic.
Multiple cells
Select multiple SQL cells and then select Edit > Format SQL Cells . If you select cells of more than one
language, only SQL cells are formatted. This includes those that use %sql .
Run notebooks
This section describes how to run one or more notebook cells.
In this section:
Requirements
Run a cell
Run all above or below
Run all cells
View multiple outputs per cell
Python and Scala error highlighting
Notifications
Databricks Advisor
Requirements
The notebook must be attached to a cluster. If the cluster is not running, the cluster is started when you run one
or more cells.
Run a cell
In the cell actions menu at the far right, click and select Run Cell , or press shift+enter .
IMPORTANT
The maximum size for a notebook cell, both contents and output, is 16MB.
For example, try running this Python code snippet that references the predefined spark variable.
spark
1+1 # => 2
NOTE
Notebooks have a number of default settings:
When you run a cell, the notebook automatically attaches to a running cluster without prompting.
When you press shift+enter , the notebook auto-scrolls to the next cell if the cell is not visible.
To change these settings, select > User Settings > Notebook Settings and configure the respective checkboxes.
IMPORTANT
Do not do a Run All if steps for mount and unmount are in the same notebook. It could lead to a race condition and
possibly corrupt the mount points.
Notebook notifications are enabled by default. You can disable them under > User Settings > Notebook
Settings .
Databricks Advisor
Databricks Advisor automatically analyzes commands every time they are run and displays appropriate advice
in the notebooks. The advice notices provide information that can assist you in improving the performance of
workloads, reducing costs, and avoiding common mistakes.
View advice
A blue box with a lightbulb icon signals that advice is available for a command. The box displays the number of
distinct pieces of advice.
Click the lightbulb to expand the box and view the advice. One or more pieces of advice will become visible.
Click the Learn more link to view documentation providing more information related to the advice.
Click the Don’t show me this again link to hide the piece of advice. The advice of this type will no longer be
displayed. This action can be reversed in Notebook Settings.
Click the lightbulb again to collapse the advice box.
Advice settings
Access the Notebook Settings page by selecting > User Settings > Notebook Settings or by clicking
the gear icon in the expanded advice box.
Because both of these notebooks are in the same directory in the workspace, use the prefix ./ in
./shared-code-notebook to indicate that the path should be resolved relative to the currently running notebook.
You can organize notebooks into directories, such as %run ./dir/notebook , or use an absolute path like
%run /Users/username@organization.com/directory/notebook .
NOTE
%run must be in a cell by itself, because it runs the entire notebook inline.
You cannot use %run to run a Python file and import the entities defined in that file into a notebook. To import
from a Python file, see Reference source code files using git. Or, package the file into a Python library, create an Azure
Databricks library from that Python library, and install the library into the cluster you use to run your notebook.
When you use %run to run a notebook that contains widgets, by default the specified notebook runs with the
widget’s default values. You can also pass in values to widgets; see Use Databricks widgets with %run.
For notebooks stored in a Azure Databricks Repo, you can reference source code files in the repository. The
following example uses a Python file rather than a notebook.
Create a new example repo to show the file layout:
Now, when you open the notebook, you can reference source code files in the repository using common
commands like import .
For more information on working with files in Git repositories, see Work with non-notebook files in an Azure
Databricks repo.
Download results
By default downloading results is enabled. To toggle this setting, see Manage the ability to download results
from notebooks. If downloading results is disabled, the button is not visible.
Download a cell result
You can download a cell result that contains tabular output to your local machine. Click the button at the
bottom of a cell.
When a query returns more than 1000 rows, a down arrow is added to the button. To download all the
results of a query:
1. Click the down arrow next to and select Download full results .
After you download full results, a CSV file named export.csv is downloaded to your local machine and
the /databricks-results folder has a generated folder containing full the query results.
Hide and show cell content
Cell content consists of cell code and the result of running the cell. You can hide and show the cell code and
result using the cell actions menu at the top right of the cell.
To hide cell code:
Click and select Hide Code
To hide and show the cell result, do any of the following:
Click and select Hide Result
Select
Type Esc > Shift + o
To show hidden cell code or results, click the Show links:
NOTE
Since all notebooks attached to the same cluster execute on the same cluster VMs, even with Spark session isolation
enabled there is no guaranteed user isolation within a cluster.
IMPORTANT
Setting spark.databricks.session.share true breaks the monitoring used by both streaming notebook cells and
streaming jobs. Specifically:
The graphs in streaming cells are not displayed.
Jobs do not block as long as a stream is running (they just finish “successfully”, stopping the stream).
Streams in jobs are not monitored for termination. Instead you must manually call awaitTermination() .
Calling the Create a new visualization on streaming DataFrames doesn’t work.
Cells that trigger commands in other languages (that is, cells using %scala , %python , %r , and %sql ) and cells
that include other notebooks (that is, cells using %run ) are part of the current notebook. Thus, these cells are in
the same session as other notebook cells. By contrast, a notebook workflow runs a notebook with an isolated
SparkSession , which means temporary views defined in such a notebook are not visible in other notebooks.
Revision history
Azure Databricks notebooks maintain a history of revisions, allowing you to view and restore previous
snapshots of the notebook. You can perform the following actions on revisions: add comments, restore and
delete revisions, and clear revision history.
To access notebook revisions, click Revision Histor y at the top right of the notebook toolbar.
In this section:
Add a comment
Restore a revision
Delete a revision
Clear a revision history
Add a comment
To add a comment to the latest revision:
1. Click the revision.
2. Click the Save now link.
3. Click Confirm . The selected revision becomes the latest revision of the notebook.
Delete a revision
To delete a notebook’s revision entry:
1. Click the revision.
2. Click the trash icon .
3. Click Yes, erase . The selected revision is deleted from the notebook’s revision history.
Clear a revision history
To clear a notebook’s revision history:
1. Select File > Clear Revision Histor y .
2. Click Yes, clear . The notebook revision history is cleared.
WARNING
Once cleared, the revision history is not recoverable.
To link a single notebook to Git, Azure Databricks also supports these Git-based version control tools:
GitHub version control
Bitbucket Cloud and Bitbucket Server version control
Azure DevOps Services version control
Test notebooks
This section covers several ways to test code in Databricks notebooks. You can use these methods separately or
together.
Many unit testing libraries work directly within the notebook. For example, you can use the built-in Python
unittest package to test notebook code.
def reverse(s):
return s[::-1]
import unittest
class TestHelpers(unittest.TestCase):
def test_reverse(self):
self.assertEqual(reverse('abc'), 'cba')
To hide test code and results, select the associated menu items from the cell dropdown. Any errors that occur
appear even when results are hidden.
To run tests periodically and automatically, you can use scheduled notebooks. You can configure the job to send
notification emails to an address you specify.
For notebooks in a Databricks Repo, you can set up a CI/CD-style workflow by configuring notebook tests to run
for each commit. See Databricks GitHub Actions.
Visualizations
7/21/2022 • 3 minutes to read
Azure Databricks notebooks have built-in support for charts and visualizations. The visualizations described in
this section are available when you use the display command to view a data table result as a pandas or Apache
Spark DataFrame in a notebook cell.
For information about legacy Databricks visualizations, see Legacy visualizations.
2. Select the data to appear in the visualization. The fields available depend on the selected type.
3. Click Save .
Visualization tools
If you hover over the top right of a chart in the visualization editor, a Plotly toolbar appears where you can
perform operations such as select, zoom, and pan.
If you hover over the top right of a chart in a notebook, a subset of tools appears:
Visualization types
Visualization types in Azure Databricks notebooks
Data profiles display summary statistics of an Apache Spark DataFrame, a pandas DataFrame, or a SQL table in
tabular and graphic format. To create a data profile from a results cell, click + and select .
In this topic:
Rename, duplicate, or remove a visualization or data profile
Edit a visualization
Download a visualization
Add a visualization or data profile to a dashboard
Rename, duplicate, or remove a visualization or data profile
To rename, duplicate, or remove a visualization or data profile, click the three vertical dots at the right of the tab
name.
You can also change the name by clicking directly on it and editing the name in place.
Edit a visualization
Click beneath the visualization to open the visualization editor. When you have finished
making changes, click Save .
Edit colors
You can customize a visualization’s colors when you create the visualization or by editing it.
1. Create or edit a visualization.
2. Click Colors .
3. To modify a color, click the square and select the new color by doing one of the following:
Click it in the color selector.
Enter a hex value.
4. Click anywhere outside the color selector to close it and save changes.
Temporarily hide or show a series
To hide a series in a visualization, click the series in the legend. To show the series again, click it again in the
legend.
To show only a single series, double-click the series in the legend. To show other series, click each one.
Download a visualization
To download a visualization in .png format, click the camera icon in the notebook cell or in the visualization
editor.
In a notebook cell, the camera icon appears at the upper right when you move the cursor over the cell.
In the visualization editor, the camera icon appears when you move the cursor over the chart. See
Visualization tools.
2. Select Add to dashboard . A list of available dashboard views appears, along with a menu option Add
to new dashboard .
3. Select a dashboard or select Add to new dashboard . The dashboard appears, including the newly
added visualization or data profile.
Visualization deep dive in Python
7/21/2022 • 2 minutes to read
This article contains Python and Scala notebooks that show how to view HTML, SVG, and D3 visualizations in
notebooks.
If you want to use a custom Javascript library to render D3, see Use a Javascript library.
IMPORTANT
The maximum size for a notebook cell, both contents and output, is 16MB. Make sure that the size of the HTML
you pass to the displayHTML() function does not exceed this value.
The method for displaying Matplotlib figures depends on which version of Databricks Runtime your cluster is
running.
Databricks Runtime 6.5 and above display Matplotlib figures inline.
With Databricks Runtime 6.4 ES, you must call the %matplotlib inline magic command.
The following notebook shows how to display Matplotlib figures in Python notebooks.
png2x option:
set_matplotlib_formats('png')
Plotly is an interactive graphing library. Azure Databricks supports Plotly 2.0.7. To use Plotly, install the Plotly
PyPI package and attach it to your cluster.
NOTE
Inside Azure Databricks notebooks we recommend using Plotly Offline. Plotly Offline may not perform well when handling
large datasets. If you notice performance issues, you should reduce the size of your dataset.
With htmlwidgets for R you can generate interactive plots using R’s flexible syntax and environment. Azure
Databricks notebooks support htmlwidgets.
The setup has two steps:
1. Install pandoc, a Linux package used by htmlwidgets to generate HTML.
2. Change one function in the htmlwidgets package to make it work in Azure Databricks.
You can automate the first step using an init script so that the cluster installs pandoc when it launches. You
should do the second step, changing an htmlwidgets function, in every notebook that uses the htmlwidgets
package.
The notebook shows how to use htmlwidgets with dygraphs, leaflet, and plotly.
IMPORTANT
With each library invocation, an HTML file containing the rendered plot is downloaded. The plot does not display inline.
htmlwidgets notebook
Get notebook
ggplot2
7/21/2022 • 2 minutes to read
ggplot2 R notebook
Get notebook
Legacy visualizations
7/21/2022 • 9 minutes to read
This article describes legacy Azure Databricks visualizations. See Visualizations for current visualization support.
Azure Databricks also natively supports visualization libraries in Python and R and lets you install and use third-
party libraries.
To choose another plot type, click to the right of the bar chart and choose the plot type.
Legacy chart toolbar
Both line and bar charts have a built-in toolbar that support a rich set of client-side interactions.
To configure a chart, click Plot Options… .
The line chart has a few custom chart options: setting a Y-axis range, showing and hiding points, and displaying
the Y-axis with a log scale.
For information about legacy chart types, see:
Legacy line charts
Color consistency across charts
Azure Databricks supports two kinds of color consistency across legacy charts: series set and global.
Series set color consistency assigns the same color to the same value if you have series with the same values
but in different orders (for example, A = ["Apple", "Orange", "Banana"] and B = ["Orange", "Banana", "Apple"]
). The values are sorted before plotting, so both legends are sorted the same way (
["Apple", "Banana", "Orange"] ), and the same values are given the same colors. However, if you have a series C
= ["Orange", "Banana"] , it would not be color consistent with set A because the set isn’t the same. The sorting
algorithm would assign the first color to “Banana” in set C but the second color to “Banana” in set A. If you want
these series to be color consistent, you can specify that charts should have global color consistency.
In global color consistency, each value is always mapped to the same color no matter what values the series
have. To enable this for each chart, select the Global color consistency checkbox.
NOTE
To achieve this consistency, Azure Databricks hashes directly from values to colors. To avoid collisions (where two values
go to the exact same color), the hash is to a large set of colors, which has the side effect that nice-looking or easily
distinguishable colors cannot be guaranteed; with many colors there are bound to be some that are very similar looking.
# Drop rows with missing values and rename the feature and label columns, replacing spaces with _
from pyspark.sql.functions import col
pop_df = pop_df.dropna() # drop rows with missing values
exprs = [col(column).alias(column.replace(' ', '_')) for column in pop_df.columns]
# Register a UDF to convert the feature (2014_Population_estimate) column vector to a VectorUDT type and
apply it to the column.
from pyspark.ml.linalg import Vectors, VectorUDT
lr = LinearRegression()
modelA = lr.fit(tdata, {lr.regParam:0.0})
ROC curves
For logistic regressions, you can render an ROC curve. To obtain this plot, supply the model, the prepped data
that is input to the fit method, and the parameter "ROC" .
The following example develops a classifier that predicts if an individual earns <=50K or >50k a year from
various attributes of the individual. The Adult dataset derives from census data, and consists of information
about 48842 individuals and their annual income.
The example code in this section uses one-hot encoding. The function was renamed with Apache Spark 3.0, so
the code is slightly different depending on the version of Databricks Runtime you are using. If you are using
Databricks Runtime 6.x or below, you must adjust two lines in the code as described in the code comments.
# This code uses one-hot encoding to convert all categorical variables into binary vectors.
# Run the stages as a Pipeline. This puts the data through all of the feature transformations in a single
call.
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dataset)
preppedDataDF = pipelineModel.transform(dataset)
display(lrModel, preppedDataDF)
Decision trees
Legacy visualizations support rendering a decision tree.
To obtain this visualization, supply the decision tree model.
The following examples train a tree to recognize digits (0 - 9) from the MNIST dataset of images of handwritten
digits and then displays the tree.
Python
trainingDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-
train.txt").cache()
testDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-
test.txt").cache()
indexer = StringIndexer().setInputCol("label").setOutputCol("indexedLabel")
dtc = DecisionTreeClassifier().setLabelCol("indexedLabel")
model = pipeline.fit(trainingDF)
display(model.stages[-1])
Scala
display(tree)
streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count())
Scala
streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count(), processingTime = "5 seconds", checkpointLocation =
"dbfs:/<checkpoint-path>")
Scala
import org.apache.spark.sql.streaming.Trigger
For more information about these parameters, see Starting Streaming Queries.
displayHTML function
Azure Databricks programming language notebooks (Python, R, and Scala) support HTML graphics using the
displayHTML function; you can pass the function any HTML, CSS, or JavaScript code. This function supports
interactive graphics using JavaScript libraries such as D3.
For examples of using displayHTML , see:
HTML, D3, and SVG in notebooks
Embed static images in notebooks
NOTE
The displayHTML iframe is served from the domain databricksusercontent.com , and the iframe sandbox includes the
allow-same-origin attribute. databricksusercontent.com must be accessible from your browser. If it is currently
blocked by your corporate network, it must added to an allow list.
Images
Columns containing image data types are rendered as rich HTML. Azure Databricks attempts to render image
thumbnails for DataFrame columns matching the Spark ImageSchema. Thumbnail rendering works for any
images successfully read in through the spark.read.format('image') function. For image values generated
through other means, Azure Databricks supports the rendering of 1, 3, or 4 channel images (where each channel
consists of a single byte), with the following constraints:
One-channel images : mode field must be equal to 0. height , width , and nChannels fields must
accurately describe the binary image data in the data field.
Three-channel images : mode field must be equal to 16. height , width , and nChannels fields must
accurately describe the binary image data in the data field. The data field must contain pixel data in three-
byte chunks, with the channel ordering (blue, green, red) for each pixel.
Four-channel images : mode field must be equal to 24. height , width , and nChannels fields must
accurately describe the binary image data in the data field. The data field must contain pixel data in four-
byte chunks, with the channel ordering (blue, green, red, alpha) for each pixel.
Example
Suppose you have a folder containing some images:
If you read the images into a DataFrame with ImageSchema.readImages and then display the DataFrame, Azure
Databricks renders thumbnails of the images:
df = sns.load_dataset("iris")
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=3)
g.map_upper(sns.regplot)
display(g.fig)
Other Python libraries
Bokeh
Matplotlib
Plotly
Visualizations in R
To plot data in R, use the display function as follows:
library(SparkR)
diamonds_df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv",
header="true", inferSchema = "true")
library(lattice)
xyplot(price ~ carat | cut, diamonds, scales = list(log = TRUE), type = c("p", "g", "smooth"), ylab = "Log
price")
DandEFA
The DandEFA package supports dandelion plots.
Visualizations in Scala
To plot data in Scala, use the display function as follows:
val diamonds_df =
spark.read.format("csv").option("header","true").option("inferSchema","true").load("/databricks-
datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
display(diamonds_df.groupBy("color").avg("price").orderBy("color"))
Dashboards allow you to publish graphs and visualizations derived from notebook output and share them in a
presentation format with your organization. View the notebook to learn how to create and organize dashboards.
The remaining sections describe how to schedule a job to refresh the dashboard and how to view a specific
dashboard version.
Dashboards notebook
Get notebook
IMPORTANT
This feature is in Public Preview.
ipywidgets are visual elements that allow users to specify parameter values in notebook cells. You can use
ipywidgets to make your Databricks Python notebooks interactive.
The ipywidgets package includes over 30 different controls, including form controls such as sliders, text boxes,
and checkboxes, as well as layout controls such as tabs, accordions, and grids. Using these elements, you can
build graphical user interfaces to interface with your notebook code.
NOTE
For information about Databricks widgets, see Databricks widgets. For guidelines on when to use Databricks widgets or
ipywidgets, see Best practices for using ipywidgets and Databricks widgets.
Requirements
ipywidgets are available in Databricks Runtime 11.0 and above.
Usage
The following code creates a histogram with a slider that can take on values between 3 and 10. The value of the
widget determines the number of bins in the histogram. As you move the slider, the histogram updates
immediately. See the example notebook to try this out.
# Load a dataset
sparkDF = spark.read.csv("/databricks-datasets/bikeSharing/data-001/day.csv", header="true",
inferSchema="true")
# In this code, `(bins=(3, 10)` defines an integer slider widget that allows values between 3 and 10.
@interact(bins=(3, 10))
def plot_histogram(bins):
pdf = sparkDF.toPandas()
pdf.hist(column='temp', bins=bins)
The following code creates an integer slider that can take on values between 0 and 10. The default value is 5. To
access the value of the slider in your code, use int_slider.value .
Limitations
A notebook using ipywidgets must be attached to a running cluster.
Widget states are not preserved across notebook sessions. You must re-run widget cells to render them each
time you attach the notebook to a cluster.
The following ipywidgets are not supported: Password, File Upload, Controller.
HTMLMath and Label widgets with LaTeX expressions do not render correctly. (For example,
widgets.Label(value=r'$$\frac{x+1}{x-1}$$') does not render correctly.)
Widgets might not render properly if the notebook is in dark mode, especially colored widgets.
Widget outputs cannot be used in notebook dashboard views.
The maximum message payload size for an ipywidget is 1 MB. Widgets that use images or large text data
may not be properly rendered.
Databricks widgets
7/21/2022 • 6 minutes to read
Input widgets allow you to add parameters to your notebooks and dashboards. The widget API consists of calls
to create various types of input widgets, remove them, and get bound values.
NOTE
In Databricks Runtime 11.0 and above, you can also use ipywidgets in Databricks notebooks.
TIP
View the documentation for the widget API in Scala, Python, and R with the following command:
dbutils.widgets.help()
Widget dropdowns and text boxes appear immediately following the notebook toolbar.
dbutils.widgets.dropdown("x1232133123", "1", [str(x) for x in range(1, 10)], "hello this is a widget 2")
You can access the current value of the widget with the call:
dbutils.widgets.get("X")
dbutils.widgets.remove("X")
dbutils.widgets.removeAll()
IMPORTANT
If you add a command to remove a widget, you cannot add a subsequent command to create a widget in the same cell.
You must create the widget in another cell.
dbutils.widgets.help("dropdown")
You can create a dropdown widget by passing a unique identifying name, default value, and list of default
choices, along with an optional label. Once you create it, a dropdown input widget appears at the top of the
notebook. These input widgets are notebook-level entities.
If you try to create a widget that already exists, the configuration of the existing widget is overwritten with the
new options.
Databricks widgets in SQL
The API to create widgets in SQL is slightly different but as powerful as the APIs for the other languages. The
following is an example of creating a text input widget.
To specify the selectable values in a dropdown widget in SQL, you can write a sub-query. The first column of the
resulting table of the sub-query determines the values.
The following cell creates a dropdown widget from a sub-query over a table.
CREATE WIDGET DROPDOWN cuts DEFAULT "Good" CHOICES SELECT DISTINCT cut FROM diamonds
The default value specified when you create a dropdown widget must be one of the selectable values and must
be specified as a string literal. To access the current selected value of an input widget in SQL, you can use a
special UDF function in your query. The function is getArgument() . For example:
SELECT COUNT(*) AS numChoices, getArgument("cuts") AS cuts FROM diamonds WHERE cut = getArgument("cuts")
NOTE
getArgument is implemented as a Scala UDF and is not supported on a table ACL-enabled high concurrency cluster. On
such clusters, use the $<parameter> syntax shown in the following example.
You can also use the $<parameter> syntax to access the current value of a SQL input widget:
IMPORTANT
In general, you cannot use widgets to pass arguments between different languages within a notebook. You can create a
widget arg1 in a Python cell and use it in a SQL or Scala cell if you run cell by cell. However, it will not work if you
execute all the commands using Run All or run the notebook as a job. To work around this limitation, we recommend
that you create a notebook for each language and pass the arguments when you run the notebook.
NOTE
Databricks will end support for rendering legacy SQL widgets on January 15, 2022. To ensure that your widgets continue
to render in the UI, update your code to use the SQL widgets. You can still use $<parameter> in your code to get the
parameters passed to a notebook using %run .
The old way of creating widgets in SQL queries with the $<parameter> syntax still works as before. Here is an
example:
SELECT * FROM diamonds WHERE cut LIKE '%$cuts%'
NOTE
To escape the $ character in a SQL string literal, use \$ . For example, the string $1,000 can be expressed as
"\$1,000" . The $ character cannot be escaped for SQL identifiers.
Run Notebook : Every time a new value is selected, the entire notebook is rerun.
Run Accessed Commands : Every time a new value is selected, only cells that retrieve the values
for that particular widget are rerun. This is the default setting when you create a widget.
NOTE
SQL cells are not rerun in this configuration.
3. To pin the widgets to the top of the notebook or to place the widgets above the first cell, click . The
setting is saved on a per-user basis.
4. If you have Can Manage permission for notebooks, you can configure the widget layout by clicking
. Each widget’s order and size can customized. To save or dismiss your changes, click .
NOTE
The widget layout is saved with the notebook.
If the widget layout is configured, new widgets will be added out of alphabetical order.
5. To reset the widget layout to a default order and size, click to open the Widget Panel Settings
dialog and then click Reset Layout .
NOTE
The widget layout cannot be reset by the removeAll() command.
Notebook
You can see a demo of how the Run Accessed Commands setting works in the following notebook. The year
widget is created with setting 2014 and is used in DataFrame API and SQL commands.
When you change the setting of the year widget to 2007 , the DataFrame command reruns, but the SQL
command is not rerun.
Widget demo notebook
Get notebook
This article describes how to use Databricks notebooks to code complex workflows that use modular code,
linked or embedded notebooks, and if-then-else logic.
These methods, like all of the dbutils APIs, are available only in Python and Scala. However, you can use
dbutils.notebook.run() to invoke an R notebook.
WARNING
Jobs based on notebook workflows must complete in 30 days or less. Longer-running jobs based on modularized or
linked notebook tasks aren’t supported.
API
The methods available in the dbutils.notebook API to build notebook workflows are: run and exit . Both
parameters and return values must be strings.
run(path: String, timeout_seconds: int, arguments: Map): String
Run a notebook and return its exit value. The method starts an ephemeral job that runs immediately.
The timeout_seconds parameter controls the timeout of the run (0 means no timeout): the call to run throws an
exception if it doesn’t finish within the specified time. If Azure Databricks is down for more than 10 minutes, the
notebook run fails regardless of timeout_seconds .
The arguments parameter sets widget values of the target notebook. Specifically, if the notebook you are
running has a widget named A , and you pass a key-value pair ("A": "B") as part of the arguments parameter
to the run() call, then retrieving the value of widget A will return "B" . You can find the instructions for
creating and working with widgets in the Databricks widgets article.
WARNING
The arguments parameter accepts only Latin characters (ASCII character set). Using non-ASCII characters will return an
error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.
run Usage
Python
Scala
run Example
Suppose you have a notebook named workflows with a widget named foo that prints the widget’s value:
The widget had the value you passed in through the workflow, "bar" , rather than the default.
exit(value: String): void Exit a notebook with a value. If you call a notebook using the run method, this is the
value returned.
dbutils.notebook.exit("returnValue")
Calling dbutils.notebook.exit in a job causes the notebook to complete successfully. If you want to cause the
job to fail, throw an exception.
Example
In the following example, you pass arguments to DataImportNotebook and run different notebooks (
DataCleaningNotebook or ErrorHandlingNotebook ) based on the result from DataImportNotebook .
When the notebook workflow runs, you see a link to the running notebook:
Click the notebook link Notebook job #xxxx to view the details of the run:
## In callee notebook
spark.range(5).toDF("value").createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")
## In caller notebook
returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))
## In callee notebook
dbutils.fs.rm("/tmp/results/my_data", recurse=True)
spark.range(5).toDF("value").write.format("parquet").load("dbfs:/tmp/results/my_data")
dbutils.notebook.exit("dbfs:/tmp/results/my_data")
## In caller notebook
returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
display(spark.read.format("parquet").load(returned_table))
## In callee notebook
import json
dbutils.notebook.exit(json.dumps({
"status": "OK",
"table": "my_data"
}))
## In caller notebook
result = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
print(json.loads(result))
Scala
// Example 1 - returning data through temporary views.
// You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the
same JVM, you can
// return a name referencing data stored in a temporary view.
Handle errors
This section illustrates how to handle errors in notebook workflows.
Python
# Errors in workflows thrown a WorkflowException.
Scala
import com.databricks.WorkflowException
// Since dbutils.notebook.run() is just a function call, you can retry failures using standard Scala try-
catch
// control flow. Here we show an example of retrying a notebook a number of times.
def runRetry(notebook: String, timeout: Int, args: Map[String, String] = Map.empty, maxTries: Int = 3):
String = {
var numTries = 0
while (true) {
try {
return dbutils.notebook.run(notebook, timeout, args)
} catch {
case e: WorkflowException if numTries < maxTries =>
println("Error, retrying: " + e)
}
numTries += 1
}
"" // not reached
}
To use custom Scala classes and objects defined within notebooks reliably in Spark and across notebook
sessions, you should define classes in package cells. A package cell is a cell that is compiled when it is run. A
package cell has no visibility with respect to the rest of the notebook. You can think of it as a separate Scala file.
Only class and object definitions can go in a package cell. You cannot have any values, variables, or function
definitions.
The following notebook shows what can happen if you do not use package cells and provides some examples,
caveats, and best practices.
The IPython kernel is a Jupyter kernel for Python code execution. Jupyter, and other compatible notebooks, use
the IPython kernel for executing Python notebook code.
In Databricks Runtime 11.0 and above, Python notebooks use the IPython kernel to execute Python code.
Known issue
The IPython command update_display only updates the outputs of the current cell.
bamboolib
7/21/2022 • 20 minutes to read
IMPORTANT
This feature is in Public Preview.
NOTE
bamboolib is supported in Databricks Runtime 11.0 and above.
bamboolib is a user interface component that allows no-code data analysis and transformations from within an
Azure Databricks notebook. bamboolib helps users more easily work with their data and speeds up common
data wrangling, exploration, and visualization tasks. As users complete these kinds of tasks with their data,
bamboolib automatically generates Python code in the background. Users can share this code with others, who
can run this code in their own notebooks to quickly reproduce those original tasks. They can also use bamboolib
to extend those original tasks with additional data tasks, all without needing to know how to code. Those who
are experienced with coding can extend this code to create even more sophisticated results.
Behind the scenes, bamboolib uses ipywidgets, which is an interactive HTML widget framework for the IPython
kernel. ipywidgets runs inside of the IPython kernel.
Contents
Requirements
Quickstart
Walkthroughs
Key tasks
Additional resources
Requirements
An Azure Databricks notebook, which is attached to an Azure Databricks cluster with Databricks Runtime
11.0 or above.
The bamboolib library must be available to the notebook. You can install the library in the workspace from
PyPI, install the library only on a specific cluster from PyPI, or make the library available only to a specific
notebook with the %pip command.
Quickstart
1. Create a Python notebook.
2. Attach the notebook to a cluster that meets the requirements.
3. In the notebook’s first cell, enter the following code, and then run the cell.
bam
NOTE
Alternatively, you can print an existing pandas DataFrame to display bamboolib for use with that specific
DataFrame.
Walkthroughs
You can use bamboolib by itself or with an existing pandas DataFrame.
Use bamboolib by itself
In this walkthrough, you use bamboolib to display in your notebook the contents of an example sales data set.
You then experiment with some of the related notebook code that bamboolib automatically generates for you.
You finish by querying and sorting a copy of the sales data set’s contents.
1. Create a Python notebook.
2. Attach the notebook to a cluster that meets the requirements.
3. In the notebook’s first cell, enter the following code, and then run the cell.
4. In the notebook’s second cell, enter the following code, and then run the cell.
bam
b. Add to this code so that it displays only those rows where order_prio is C , and then run the cell:
import pandas as pd
df = pd.read_csv(bam.sales_csv)
# Step: Keep rows where item_type is one of: Baby Food
df = df.loc[df['item_type'].isin(['Baby Food'])]
TIP
Instead of writing this code, you can also do the same thing by just using bamboolib in the second cell to display
only those rows where order_prio is C . This step is an example of extending the code that bamboolib
automatically generated earlier.
NOTE
This is equivalent to writing the following code yourself:
df = df.sort_values(by=['region'], ascending=[True])
df
You could have also just used bamboolib in the second cell to sort the rows by region in ascending order. This
step demonstrates how you can use bamboolib to extend the code that you write. As you use bamboolib, it
automatically generates the additional code for you in the background, so that you can further extend your
already-extended code!
4. In the notebook’s second cell, enter the following code, and then run the cell.
import pandas as pd
df = pd.read_csv(bam.sales_csv)
df
Note that bamboolib only supports pandas DataFrames. To convert a PySpark DataFrame to a pandas
DataFrame, call toPandas on the PySpark DataFrame. To convert a Pandas API on Spark DataFrame to a
pandas DataFrame, call to_pandas on the Pandas API on Spark DataFrame.
5. Click Show bamboolib UI .
6. Display all of the rows where item_type is Baby Food :
a. In the Search actions list, select Filter rows .
b. In the Filter rows pane, in the Choose list (above where ), select Select rows .
c. In the list below where , select item_type .
d. In the Choose list next to item_type , select has value(s) .
e. In the Choose value(s) box next to has value(s) , select Baby Food .
f. Click Execute .
7. Copy the automatically generated Python code for this query:
a. Cick Get Code .
b. In the Expor t code pane, click Copy code .
8. Paste and modify the code:
a. In the notebook’s third cell, paste the code that you copied. It should look like this:
b. Add to this code so that it displays only those rows where order_prio is C , and then run the cell:
TIP
Instead of writing this code, you can also do the same thing by just using bamboolib in the second cell to display
only those rows where order_prio is C . This step is an example of extending the code that bamboolib
automatically generated earlier.
NOTE
This is equivalent to writing the following code yourself:
df = df.sort_values(by=['region'], ascending=[True])
df
You could have also just used bamboolib in the second cell to sort the rows by region in ascending order. This
step demonstrates how you can use bamboolib to extend the code that you write. As you use bamboolib, it
automatically generates the additional code for you in the background, so that you can further extend your
already-extended code!
Key tasks
In this section:
Add the widget to a cell
Clear the widget
Data loading tasks
Data action tasks
Data action history tasks
Get code to programmatically recreate the widget’s current state as a DataFrame
Add the widget to a cell
Scenario : You want the bamboolib widget to display in a cell.
1. Make sure the notebook meets the requirements for bamboolib.
2. Run the following code in the notebook, preferably in the notebook’s first cell:
3. Option 1 : In the cell where you want the widget to appear, add the following code, and then run the cell:
bam
df = pd.DataFrame({
'a': [ 1, 2, 3 ],
'b': [ 2., 3., 4. ],
'c': [ 'string1', 'string2', 'string3' ],
'd': [ date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1) ],
'e': [ datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0) ]
})
df
bam
The widget clears and then redisplays the Databricks: Read CSV file from DBFS , Databricks: Load
database table , and Load dummy data buttons.
NOTE
If the error name 'bam' is not defined appears, run the following code in the notebook (preferably in the notebook’s
first cell), and then try again:
Option 2 : In a cell that contains a reference to a pandas DataFrame, print the DataFrame again by running the
cell again. The widget clears and then displays the new data.
Data loading tasks
In this section:
Read an example dataset’s contents into the widget
Read a CSV file’s contents into the widget
Read a database table’s contents into the widget
Read an example dataset’s contents into the widget
Scenario : You want to read some example data into the widget, for example some pretend sales data, so that
you can test out the widget’s functionality.
1. Click Load dummy data .
NOTE
If Load dummy data is not visible, clear the widget with Option 1 and try again.
2. In the Load dummy data pane, for Load a dummy data set for testing bamboolib , select the name
of the dataset that you want to load.
3. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a
DataFrame, or leave df as the default programmatic identifier.
4. Click Execute .
The widget displays the contents of the dataset.
TIP
You can switch the current widget to display the contents of a different example dataset:
1. In the current widget, click the Load dummy data tab.
2. Follow the preceding steps to read the other example dataset’s contents into the widget.
NOTE
If Databricks: Read CSV file from DBFS is not visible, clear the widget with Option 1 and try again.
2. In the Read CSV from DBFS pane, browse to the location that contains the target CSV file.
3. Select the target CSV file.
4. For Dataframe name , enter a name for the programmatic identifier of the CSV file’s contents as a
DataFrame, or leave df as the default programmatic identifier.
5. For CSV value separator , enter the character that separates values in the CSV file, or leave the ,
(comma) character as the default value separator.
6. For Decimal separator , enter the character that separates decimals in the CSV file, or leave the . (dot)
character as the default value separator.
7. For Row limit: read the first N rows - leave empty for no limit , enter the maximum number of
rows to read into the widget, or leave 100000 as the default number of rows, or leave this box empty to
specify no row limit.
8. Click Open CSV file .
The widget displays the contents of the CSV file, based on the settings that you specified.
TIP
You can switch the current widget to display the contents of a different CSV file:
1. In the current widget, click the Read CSV from DBFS tab.
2. Follow the preceding steps to read the other CSV file’s contents into the widget.
NOTE
If Databricks: Load database table is not visible, clear the widget with Option 1 and try again.
2. In the Databricks: Load database table pane, for Database - leave empty for default database ,
enter the name of the database in which the target table is located, or leave this box empty to specify the
default database.
3. For Table , enter the name of the target table.
4. For Row limit: read the first N rows - leave empty for no limit , enter the maximum number of
rows to read into the widget, or leave 100000 as the default number of rows, or leave this box empty to
specify no row limit.
5. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a
DataFrame, or leave df as the default programmatic identifier.
6. Click Execute .
The widget displays the contents of the table, based on the settings that you specified.
TIP
You can switch the current widget to display the contents of a different table:
1. In the current widget, click the Databricks: Load database table tab.
2. Follow the preceding steps to read the other table’s contents into the widget.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type drop , and then select Select or drop columns .
Select Select or drop columns .
2. In the Select or drop columns pane, in the Choose drop-down list, select Drop .
3. Select the target column names or inclusion criterion.
4. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
5. Click Execute .
Filter rows
Scenario : You want to show or hide specific table rows based on criteria such as specific column values that are
matching or missing. For example, in the dummy Sales dataset , you want to show only those rows where the
item_type column’s value is set to Baby Food .
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type filter , and then select Filter rows .
Select Filter rows .
2. In the Filter rows pane, in the Choose drop-down list above where , select Select rows or Drop rows .
3. Specify the first filter criterion.
4. To add another filter criterion, click add condition , and specify the next filter criterion. Repeat as desired.
5. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
6. Click Execute .
Sort rows
Scenario : You want to sort table rows based on the values within one or more columns. For example, in the
dummy Sales dataset , you want to show the rows by the region column’s values in alphabetical order from A
to Z.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type sor t , and then select Sor t rows .
Select Sor t rows .
2. In the Sor t column(s) pane, choose the first column to sort by and the sort order.
3. To add another sort criterion, click add column , and specify the next sort criterion. Repeat as desired.
4. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
5. Click Execute .
Grouping rows and columns tasks
I n t h i s se c t i o n :
Scenario : You want to show row and column results by calculated groupings, and you want to assign custom
names to those groupings. For example, in the dummy Sales dataset , you want to group the rows by the
country column’s values, showing the numbers of rows containing the same country value, and giving the list
of calculated counts the name country_count .
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type group , and then select Group by and aggregate (with renaming) .
Select Group by and aggregate (with renaming) .
2. In the Group by with column rename pane, select the columns to group by, the first calculation, and
optionally specify a name for the calculated column.
3. To add another calculation, click add calculation , and specify the next calculation and column name. Repeat
as desired.
4. Specify where to store the result.
5. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
6. Click Execute .
G r o u p r o w s a n d c o l u m n s b y m u l t i p l e a g g r e g a t e fu n c t i o n s
Scenario : You want to show row and column results by calculated groupings. For example, in the dummy Sales
dataset , you want to group the rows by the region , country , and sales_channel columns’ values, showing
the numbers of rows containing the same region and country value by sales_channel , as well as the
total_revenue by unique combination of region , country , and sales_channel .
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type group , and then select Group by and aggregate (default) .
Select Group by and aggregate (default) .
2. In the Group by with column rename pane, select the columns to group by and the first calculation.
3. To add another calculation, click add calculation , and specify the next calculation. Repeat as desired.
4. Specify where to store the result.
5. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
6. Click Execute .
Remove rows with missing values
Scenario : You want to remove any row that has a missing value for the specified columns. For example, in the
dummy Sales dataset , you want to remove any rows that have a missing item_type value.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type drop or remove , and then select Drop missing values .
Select Drop missing values .
2. In the Drop missing values pane, select the columns to remove any row that has a missing value for that
column.
3. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
4. Click Execute .
Remove duplicated rows
Scenario : You want to to remove any row that has a duplicated value for the specified columns. For example, in
the dummy Sales dataset , you want to remove any rows that are exact duplicates of each other.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type drop or remove , and then select Drop/Remove duplicates .
Select Drop/Remove duplicates .
2. In the Remove Duplicates pane, select the columns to remove any row that has a duplicated value for
those columns, and then select whether to keep the first or last row that has the duplicated value.
3. For Dataframe name , enter a name for the programmatic identifier of the table’s contents as a DataFrame,
or leave df as the default programmatic identifier.
4. Click Execute .
Find and replace missing values
Scenario : You want to replace the missing value with a replacement value for any row with the specified
columns. For example, in the dummy Sales dataset , you want to replace any row with a missing value in the
item_type column with the value Unknown Item Type .
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type find or replace , and then select Find and replace missing values .
Select Find and replace missing values .
2. In the Replace missing values pane, select the columns to replace missing values for, and then specify the
replacement value.
3. Click Execute .
Create a column formula
Scenario : You want to create a column that uses a unique formula. For example, in the dummy Sales dataset ,
you want to create a column named profit_per_unit that displays the result of dividing the total_profit
column value by the units_sold column value for each row.
1. On the Data tab, in the Search actions drop-down list, do one of the following:
Type formula , and then select New column formula .
Select New column formula .
2. In the Replace missing values pane, select the columns to replace missing values for, and then specify the
replacement value.
3. Click Execute .
Data action history tasks
In this section:
View the list of actions taken in the widget
Undo the most recent action taken in the widget
Redo the most recent action taken in the widget
Change the most recent action taken in the widget
View the list of actions taken in the widget
Scenario : You want to see a list of all of the changes that were made in the widget, starting with the most recent
change.
Click Histor y . The list of actions appears in the Transformations histor y pane.
Undo the most recent action taken in the widget
Scenario : You want to revert the most recent change that was made in the widget.
Do one of the following:
Click the counterclockwise arrow icon.
Click Histor y , and in the Transformations histor y pane, click Undo last step .
Redo the most recent action taken in the widget
Scenario : You want to revert the most recent revert that was made in the widget.
Do one of the following:
Click the clockwise arrow icon.
Click Histor y , and in the Transformations histor y pane, click Recover last step .
Change the most recent action taken in the widget
Scenario : You want to change the most recent change that was taken in the widget.
1. Do one of the following:
Click the pencil icon.
Click Histor y , and in the Transformations histor y pane, click Edit last step .
2. Make the desired change, and then click Execute .
Get code to programmatically recreate the widget’s current state as a DataFrame
Scenario : You want to get Python code that programmatically recreates the current widget’s state, represented
as a pandas DataFrame. You want to run this code in a different cell in this workbook or a different workbook
altogether.
1. Click Get Code .
2. In the Expor t code pane, click Copy code . The code is copied to your system’s clipboard.
3. Paste the code into a different cell in this workbook or into a different workbook.
4. Write additional code to work with this pandas DataFrame programmatically, and then run the cell. For
example, to display the DataFrame’s contents, assuming that your DataFrame is represented
programmatically by df :
Additional resources
bamboolib Documentation
Software engineering best practices for notebooks
7/21/2022 • 25 minutes to read
This article provides a hands-on walkthrough that demonstrates how to apply software engineering best
practices to your Azure Databricks notebooks, including version control, code sharing, testing, and optionally
continuous integration and continuous delivery or deployment (CI/CD).
In this walkthrough, you will:
Add notebooks to Azure Databricks Repos for version control.
Extract portions of code from one of the notebooks into a shareable module.
Test the shared code.
Run the notebooks from an Azure Databricks job.
Optionally apply CI/CD to the shared code.
Requirements
To complete this walkthrough, you must provide the following resources:
A remote repository with a Git provider that Databricks supports. This article’s walkthrough uses GitHub.
This walkthrough assumes that you have a GitHub repository named best-notebooks available. (You can
give your repository a different name. If you do, replace best-notebooks with your repo’s name
throughout this walkthrough.) Create a GitHub repo if you do not already have one.
NOTE
If you create a new repo, be sure to initialize the repository with at least one file, for example a README file.
An Azure Databricks workspace. Create a workspace if you do not already have one.
An Azure Databricks all-purpose cluster in the workspace. To run notebooks during the design phase, you
attach the notebooks to a running all-purpose cluster. Later on, this walkthrough uses an Azure
Databricks job to automate running the notebooks on this cluster. (You can also run jobs on job clusters
that exist only for the jobs’ lifetimes.) Create an all-purpose cluster if you do not already have one.
NOTE
To work with files in Databricks Repos, participating clusters must have Databricks Runtime 8.4 or higher installed.
Databricks recommends that these clusters have the latest Long Term Support (LTS) version installed, which is
Databricks Runtime 10.4 LTS.
Walkthrough
In this walkthrough, you will:
1. Connect your existing GitHub repo to Azure Databricks Repos.
2. Add an existing notebook to the repo and then run the notebook for the first time.
3. Move some code from the notebook into a shared module. Run the notebook for the second time to make
sure that the notebook calls the shared code as expected.
4. Use a second notebook to test the shared code separately without running the first notebook again.
5. Create an Azure Databricks job to run the two notebooks automatically, either on-demand or on a regular
schedule.
6. Set up the repo to run the second notebook that tests the shared code whenever a pull request is created in
the repo.
7. Make a pull request that changes the shared code, which will trigger the tests to run automatically.
The following steps walk you through each of these activities.
Steps
Step 1: Set up Databricks Repos
Step 2: Import and run the notebook
Step 3: Move code into a shared module
Step 4: Test the shared code
Step 5: Create a job to run the notebooks
(Optional) Step 6: Set up the repo to test the code and run the notebook automatically whenever the code
changes
(Optional) Step 7: Update the shared code in GitHub to trigger tests
Step 1: Set up Databricks Repos
In this step, you connect your existing GitHub repo to Azure Databricks Repos in your existing Azure Databricks
workspace.
To enable your workspace to connect to your GitHub repo, you must first provide your workspace with your
GitHub credentials, if you have not done so already.
Step 1.1: Provide your GitHub credentials
1. In your workspace, on the sidebar in the Data Science & Engineering or Databricks Machine Learning
environment, click Settings > User Settings .
2. On the User Settings page, click Git integration .
3. On the Git integration tab, for Git provider , select GitHub .
4. For Git provider username or email , enter your GitHub username.
5. For Token , enter your GitHub personal access token. This token must have the repo permission.
6. Click Save .
Step 1.2: Connect to your GitHub repo
1. On the sidebar in the Data Science & Engineering or Databricks Machine Learning environment, click
Repos .
2. In the Repos pane, click Add Repo .
3. In the Add Repo dialog:
a. Click Clone remote Git repo .
b. For Git repositor y URL , enter the GitHub Clone with HTTPS URL for your GitHub repo. This article
assumes that your URL ends with best-notebooks.git , for example
https://github.com/<your-GitHub-username>/best-notebooks.git .
c. In the drop-down list next to Git repositor y URL , select GitHub .
d. Leave Repo name set to the name of your repo, for example best-notebooks .
e. Click Create .
Step 2: Import and run the notebook
In this step, you import an existing external notebook into your repo. You could create your own notebooks for
this walkthrough, but to speed things up we provide them for you here.
Step 2.1: Create a working branch in the repo
In this substep, you create a branch named eda in your repo. This branch enables you to work on files and code
independently from your repo’s main branch, which is a software engineering best practice. (You can give your
branch a different name.)
NOTE
In some repos, the main branch may be named master instead. If so, replace main with master throughout this
walkthrough.
TIP
If you’re not familiar with working in Git branches, see Git Branches - Branches in a Nutshell on the Git website.
1. If the Repos pane is not showing, then on the sidebar in the Data Science & Engineering or
Databricks Machine Learning environment, click Repos .
2. If the repo that you connected to in the previous step is not showing in the Repos pane, then select your
workspace username, and select the name of the repo that you connected to in the previous step.
3. Click the drop-down arrow next to your repo’s name, and then click Git .
4. In the best-notebooks dialog, click the + (Create branch ) button.
NOTE
If your repo has a name other than best-notebooks , this dialog’s title will be different, here and throughout this
walkthrough.
NOTE
You could delete the existing covid_eda_raw notebook at this point, because the new covid_eda_modular
notebook is a shared version of the first notebook. However, you might still want to keep the previous notebook
for comparison purposes, even though you will not use it anymore.
NOTE
Do not click the drop-down arrow next to the notebooks folder. Click the drop-down arrow next to your repo’s
name instead. You want this to go into the root of the repo, not into the notebooks folder.
2. In the New Folder Name dialog, enter covid_analysis , and then click Create Folder .
3. In the Repos pane for your repo, click the drop-down arrow next to the covid_analysis folder, and then
click Create > File .
4. In the New File Name dialog, enter transforms.py , and then click Create File .
5. In the Repos pane for your repo, click the covid_analysis folder, and then click transforms.py .
6. In the editor window, enter the following code:
import pandas as pd
TIP
For other code sharing techniques, see Share code in notebooks.
NOTE
Do not click the drop-down arrow next to the notebooks or covid_analysis folders. You want the list of
package dependencies to go into the repo’s root folder, not the notebooks or covid_analysis folders.
2. In the New File Name dialog, enter requirements.txt , and then click Create File .
3. In the Repos pane for your repo, click requirements.txt , and enter the following code:
NOTE
If the requirements.txt file is not visible, you may need to refresh your web browser.
-i https://pypi.org/simple
attrs==21.4.0
cycler==0.11.0
fonttools==4.33.3
iniconfig==1.1.1
kiwisolver==1.4.2
matplotlib==3.5.1
numpy==1.22.3
packaging==21.3
pandas==1.4.2
pillow==9.1.0
pluggy==1.0.0
py==1.11.0
py4j==0.10.9.3
pyarrow==7.0.0
pyparsing==3.0.8
pyspark==3.2.1
pytest==7.1.2
python-dateutil==2.8.2
pytz==2022.1
six==1.16.0
tomli==2.0.1
wget==3.2
NOTE
The preceding file lists specific package versions. For better compatibility, you can cross-reference these versions
with the ones that are installed on your all-purpose cluster. See the “System environment” section for your
cluster’s Databricks Runtime version in Databricks runtime releases.
|-- covid_analysis
| `-- transforms.py
|-- notebooks
| |-- covid_eda_modular
| `-- covid_eda_raw (optional)
`-- requirements.txt
NOTE
Using test data is a software engineering best practice. This enables you to run your tests faster, relying on a small
portion of the data that has the same format as your real data. Of course, you want to always make sure that this
test data accurately represents your real data before you run your tests.
7. In the Repos pane for your repo, click the drop-down arrow next to the tests folder, and then click
Create > File .
8. In the New File Name dialog, enter transforms_test.py , and then click Create File .
9. In the Repos pane for your repo, click the tests folder, and then click transforms_test.py .
10. In the editor window, enter the following test code. These tests use standard pytest fixtures as well as a
mocked in-memory pandas DataFrame:
# Test each of the transform functions.
import pytest
from textwrap import fill
import os
import pandas as pd
import numpy as np
from covid_analysis.transforms import *
from pyspark.sql import SparkSession
@pytest.fixture
def raw_input_df() -> pd.DataFrame:
"""
Create a basic version of the input dataset for testing, including NaNs.
"""
return pd.read_csv('tests/testdata.csv')
@pytest.fixture
def colnames_df() -> pd.DataFrame:
df = pd.DataFrame(
data=[[0,1,2,3,4,5]],
columns=[
"Daily ICU occupancy",
"Daily ICU occupancy per million",
"Daily hospital occupancy",
"Daily hospital occupancy per million",
"Weekly new hospital admissions",
"Weekly new hospital admissions per million"
]
)
return df
# The test data has NaNs for Daily ICU occupancy; this should get filled to 0.
def test_pivot(raw_input_df):
pivoted = pivot_and_clean(raw_input_df, 0)
assert pivoted["Daily ICU occupancy"][0] == 0
|-- covid_analysis
| `-- transforms.py
|-- notebooks
| |-- covid_eda_modular
| `-- covid_eda_raw (optional)
|-- requirements.txt
`-- tests
|-- testdata.csv
`-- transforms_test.py
NOTE
Running pytest runs all files whose names follow the form test_*.py or \*_test.py in the current directory and its
subdirectories.
1. In the Repos pane for your repo, click the drop-down arrow next to the notebooks folder, and then click
Impor t .
2. In the Impor t Notebooks dialog:
a. For Impor t from , select URL .
b. Enter the URL to the raw contents of the run_unit_tests notebook in the
databricks/notebook-best-practices repo in GitHub. To get this URL: i. Go to
https://github.com/databricks/notebook-best-practices. ii. Click the notebooks folder. iii. Click the
run_unit_tests.py file. iv. Click Raw . v. Copy the full URL from your web browser’s address bar over
into the Impor t Notebooks dialog.
c. Click Impor t .
3. If the notebook is not already showing, in the Repos pane for your repo, click the notebooks folder, and
then double-click the run_unit_tests notebook.
4. In the drop-down list next to File , select the cluster to attach this notebook to.
5. Click Run All .
6. If prompted, click Attach & Run or Star t, Attach & Run .
7. Wait while the notebook runs.
After the notebook finishes running, in the notebook you should see information about the number of passing
and failed tests, along with other related details. If the cluster was not already running when you started running
this notebook, it could take several minutes for the cluster to start up before displaying the results.
Your repo structure should now look like this:
|-- covid_analysis
| `-- transforms.py
|-- notebooks
| |-- covid_eda_modular
| |-- covid_eda_raw (optional)
| `-- run_unit_tests
|-- requirements.txt
`-- tests
|-- testdata.csv
`-- transforms_test.py
NOTE
In this scenario, Databricks does not recommend that you use the schedule button in the notebook as described in
Schedule a notebook to schedule a job to run this notebook periodically. This is because the schedule button creates a job
by using the latest working copy of the notebook in the workspace repo. Instead, Databricks recommends that you follow
the preceding instructions to create a job that uses the latest committed version of the notebook in the repo.
3. To see the job results, click on the run_notebook_tests tile, the run_main_notebook tile, or both. The
results on each tile are the same as if you ran the notebooks yourself, one by one.
NOTE
This job ran on-demand. To set up this job to run on a regular basis, see Schedule a job.
(Optional) Step 6: Set up the repo to test the code and run the notebook automatically whenever the code
changes
In the previous step, you used a job to automatically test your shared code and run your notebooks at a point in
time or on a recurring basis. However, you may prefer to trigger tests automatically when changes are merged
into your GitHub repo. You can perform this automation by using a CI/CD platform such as GitHub Actions.
Step 6.1: Set up GitHub access to your workspace
In this substep, you set up a GitHub Actions workflow that run jobs in the workspace whenever changes are
merged into your repository. You do this by giving GitHub a unique Azure Databricks token for access.
For security reasons, Databricks discourages you from giving your Azure Databricks workspace user’s personal
access token to GitHub. Instead, Databricks recommends that you give GitHub an Azure Active Directory (Azure
AD) token that is associated with an Azure service principal. For instructions, see the Azure section of the Run
Databricks Notebook GitHub Action page in the GitHub Actions Marketplace.
IMPORTANT
Notebooks are run with all of the workspace permissions of the identity that is associated with the token, so Databricks
recommends using a service principal. If you really want to give your Azure Databricks workspace user’s personal access
token to GitHub for personal exploration purposes only, and you understand that for security reasons Databricks
discourages this practice, see the instructions to create your workspace user’s personal access token.
on:
pull_request:
env:
# Replace this value with your workspace instance name.
DATABRICKS_HOST: https://<your-workspace-instance-name>
jobs:
unit-test-notebook:
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- name: Checkout repo
uses: actions/checkout@v2
- name: Run test notebook
uses: databricks/run-notebook@main
with:
databricks-token: <your-access-token>
local-notebook-path: notebooks/run_unit_tests.py
existing-cluster-id: <your-cluster-id>
# Grant all users view permission on the notebook's results, so that they can
# see the result of the notebook, if they have related access permissions.
access-control-list-json: >
[
{
"group_name": "users",
"permission_level": "CAN_VIEW"
}
]
run-name: "EDA transforms helper module unit tests"
9. Select Create a new branch for this commit and star t a pull request .
10. Click Propose new file .
11. Click the Pull requests tab, and then create the pull request.
12. On the pull request page, wait for the icon next to Run pre-merge Databricks tests / unit-test-
notebook (pull_request) to display a green check mark. (It may take a few moments for the icon to
appear.) If there is a red X instead of a green check mark, click Details to find out why. If the icon or
Details are no longer showing, click Show all checks .
13. If the green check mark appears, merge the pull request into the main branch.
(Optional) Step 7: Update the shared code in GitHub to trigger tests
In this step, you make a change to the shared code and then push the change into your GitHub repo, which
immediately triggers the tests automatically, based on the GitHub Actions from the previous step.
Step 7.1: Create another working branch in the repo
1. In your workspace, in the Repos pane for your repo, click the first_tests branch.
2. in the best-notebooks dialog, click the drop-down arrow next to the first_tests branch, and select main .
3. Click the Pull button. If prompted to proceed with pulling, click Confirm .
4. Click the + (Create branch ) button.
5. Enter trigger_tests , and then press Enter. (You can give your branch a different name.)
6. Close this dialog.
Step 7.2: Change the shared code
1. In your workspace, in the Repos pane for your repo, double-click the covid_analysis/transforms.py
file.
2. In the third line of this file, change this line of code:
To this:
Learn how to integrate Git source control with Databricks Repos. To support best practices for data science and
engineering code development, Databricks Repos provides repository-level integration with Git providers. You
can develop code in an Azure Databricks notebook, sync it with a remote Git repository, and use Git commands
for updates and source control.
NOTE
Support for arbitrary files in Databricks Repos is now in Public Preview. For details, see Work with non-notebook files in an
Azure Databricks repo and Import Python and R modules.
Set up your Azure Databricks workspace and a Git repo to use Databricks Repos capabilities. Once you set up
Databricks Repos, you can run notebooks or access project files and libraries stored in a remote Git repo.
NOTE
Databricks recommends that you set an expiration date for all personal access tokens.
If you are using GitHub AE and you have enabled GitHub allow lists, you must add Azure Databricks control plane NAT
IPs to the allow list. Use the IP for the region that the Azure Databricks workspace is in.
IMPORTANT
This feature is in Public Preview.
In addition to syncing notebooks with a remote Git repository, you can sync any type of file your project
requires, such as:
.py files
data files in .csv or .json format
.yaml configuration files
You can import and read these files within a Databricks repo. You can also view and edit plain text files in the UI.
If support for this feature is not enabled, you still see non-notebook files in your repo, but you cannot work with
them.
Enable Files in Repos
An admin can enable this feature as follows:
1. Go to the Admin Console.
2. Click the Workspace Settings tab.
3. In the Repos section, click the Files in Repos toggle.
After the feature has been enabled, you must restart your cluster and refresh your browser before you can non-
noteboook files in Repos.
Additionally, the first time you access a repo after Files in Repos is enabled, you must open the Git dialog. The
dialog indicates that you must perform a pull operation to sync non-notebook files in the repo. Select Agree
and Pull to sync files. If there are any merge conflicts, another dialog appears giving you the option of
discarding your conflicting changes or pushing your changes to a new branch.
NOTE
Users can load and pull remote repositories even if they are not on the allow list.
The list you save overwrites the existing set of saved URL prefixes.
It may take about 15 minutes for changes to take effect.
Secrets detection
Databricks Repos scans code for access key IDs that begin with the prefix AKIA and warns the user before
committing.
Terraform integration
You can manage Databricks Repos in a fully automated setup using Databricks Terraform provider and
databricks_repo:
NOTE
Databricks recommends that you use Git integration with Databricks Repos to sync your work in Azure Databricks with a
remote Git repository.
This article describes how to set up version control for notebooks using GitHub through the UI. You can also use
the Databricks CLI or Workspace API 2.0 to import and export notebooks and manage notebook versions using
GitHub tools.
NOTE
You cannot modify a notebook while the history panel is open.
2. Click Save Now to save your notebook to GitHub. The Save Notebook Revision dialog appears.
3. Optionally, enter a message to describe your change.
4. Make sure that Also commit to Git is selected.
5. Click Save .
Revert or update a notebook to a version from GitHub
Once you link a notebook, Azure Databricks syncs your history with Git every time you re-open the history
panel. Versions that sync to Git have commit hashes as part of the entry.
1. Click Revision histor y at the top right of the notebook to open the history Panel.
2. Choose an entry in the history panel. Azure Databricks displays that version.
3. Click Restore this version .
4. Click Confirm to confirm that you want to restore that version.
Unlink a notebook
1. Click Revision histor y at the top right of the notebook to open the history Panel.
2. The Git status bar displays Git: Synced .
3. Click Create PR . GitHub opens to a pull request page for the branch.
Rebase a branch
You can also rebase your branch inside Azure Databricks. The Rebase link displays if new commits are available
in the parent branch. Only rebasing on top of the default branch of the parent repository is supported.
For example, assume that you are working on databricks/reference-apps . You fork it into your own account (for
example, brkyvz ) and start working on a branch called my-branch . If a new update is pushed to
databricks:master , then the Rebase button displays, and you will be able to pull the changes into your branch
brkyvz:my-branch .
Rebasing works a little differently in Azure Databricks. Assume the following branch structure:
After a rebase, the branch structure will look like:
What’s different here is that Commits C5 and C6 will not apply on top of C4. They will appear as local changes in
your notebook. Any merge conflict will show up as follows:
You can then commit to GitHub once again using the Save Now button.
What happens if someone branched off from my branch that I just rebased?
If your branch (for example, branch-a ) was the base for another branch ( branch-b ), and you rebase, you need
not worry! Once a user also rebases branch-b , everything will work out. The best practice in this situation is to
use separate branches for separate notebooks.
Best practices for code reviews
Azure Databricks supports Git branching.
You can link a notebook to any branch in a repository. Azure Databricks recommends using a separate
branch for each notebook.
During development, you can link a notebook to a fork of a repository or to a non-default branch in the main
repository. To integrate your changes upstream, you can use the Create PR link in the Git Preferences dialog
in Azure Databricks to create a GitHub pull request. The Create PR link displays only if you’re not working on
the default branch of the parent repository.
GitHub Enterprise
IMPORTANT
This feature is in Private Preview. To try it, reach out to your Azure Databricks contact.
You can also use the Workspace API 2.0 to programmatically create notebooks and manage the code base in
GitHub Enterprise Server.
Troubleshooting
If you receive errors related to syncing GitHub history, verify the following:
You can only link a notebook to an initialized Git repository that isn’t empty. Test the URL in a web browser.
The GitHub personal access token must be active.
To use a private GitHub repository, you must have permission to read the repository.
If a notebook is linked to a GitHub branch that is renamed, the change is not automaticaly reflected in Azure
Databricks. You must re-link the notebook to the branch manually.
Azure DevOps Services version control
7/21/2022 • 2 minutes to read
Azure DevOps is a collection of services that provide an end-to-end solution for the five core practices of
DevOps: planning and tracking, development, build and test, delivery, and monitoring and operations. This
article describes how to set Azure DevOps as your Git provider.
NOTE
Databricks recommends that you use Git integration with Databricks Repos to sync your work in Azure Databricks with
a remote Git repository.
For information about the name change from Visual Studio Team Services to Azure DevOps, see Visual Studio Team
Services is now Azure DevOps Services.
Get started
Authentication with Azure DevOps Services is done automatically when you authenticate using Azure Active
Directory (Azure AD). The Azure DevOps Services organization must be linked to the same Azure AD tenant as
Databricks.
In Azure Databricks, set your Git provider to Azure DevOps Services on the User Settings page:
1. Click Settings at the lower left of your screen and select User Settings .
2. Click the Git Integration tab.
3. Change your provider to Azure DevOps Services.
Notebook integration
Notebook integration with Azure DevOps Services is exactly like integration with GitHub. See Work with
notebook revisions to learn more about how to work with notebooks using Git.
TIP
In Git Preferences, use the URL scheme https://dev.azure.com/<org>/<project>/_git/<repo> to link Azure DevOps
and Azure Databricks to the same Azure AD tenant.
If your Azure DevOps organzation is org.visualstudio.com , open dev.azure.com in your browser and navigate to
your repository. Copy the URL from the browser and paste that URL in the Link field.
Troubleshooting
The Save button in the Databricks UI is grayed out.
Visual Studio Team Services renamed to Azure DevOps Services. Original URLs in the format
https://<org>.visualstudio.com/<project>/_git/<repo> do not work in Azure Databricks notebooks.
An organization administrator can automatically update the URLs in Azure DevOps Services from the
organization settings page.
Alternately, you can manually create the new URL format used in Azure Databricks notebooks to sync with Azure
DevOps Services. In the Azure Databricks notebook, enter the new URL in the Link field in the Git Preferences
dialog.
Old URL format:
https://<org>.visualstudio.com/<project>/_git/<repo>
This guide describes how to set up version control for notebooks using Bitbucket Cloud and Bitbucket Server
through the UI.
NOTE
Databricks recommends that you use Git integration with Databricks Repos to sync your work in Azure Databricks with a
remote Git repository.
5. Click Save .
Revert or update a notebook to a version from Bitbucket Cloud
Once you link a notebook, Azure Databricks syncs your history with Git every time you re-open the History
panel. Versions that sync to Git have commit hashes as part of the entry.
1. Open the History panel.
2. Choose an entry in the History panel. Azure Databricks displays that version.
3. Click Restore this version .
4. Click Confirm to confirm that you want to restore that version.
Unlink a notebook
1. Open the History panel.
2. The Git status bar displays Git: Synced .
3. Click Git: Synced .
3. Click Create PR . Bitbucket Cloud opens to a pull request page for the branch.
This article describes how to set up Git integration with Databricks Repos for notebooks using GitLab through
the UI.
This guide describes how to configure version control with AWS CodeCommit. Configuring version control
involves creating access credentials in your version control provider and adding those credentials to Azure
Databricks.
This guide describes how to configure version control with GitHub AE. Configuring version control involves
creating access credentials in your version control provider and adding those credentials to Azure Databricks.
This article walks you through steps for working with notebooks and other files in Databricks Repos with a
remote Git integration.
In Azure Databricks you can:
Clone a remote Git respository.
Work in notebooks or files.
Create notebooks, and edit notebooks and other files.
Sync with a remote repository.
Create new branches for development work.
For other tasks, you work in your Git provider:
Creating a PR
Resolving conflicts
Merging or deleting branches
Rebasing a branch
3. In the Add Repo dialog, click Clone remote Git repo and enter the repository URL. Select your Git
provider from the drop-down menu, optionally change the name to use for the Databricks repo, and click
Create . The contents of the remote repository are cloned to the Databricks repo.
Create a notebook or folder
To create a new notebook or folder in a repo, click the down arrow next to the repo name, and select Create >
Notebook or Create > Folder from the menu.
To move an notebook or folder in your workspace into a repo, navigate to the notebook or folder and select
Move from the drop-down menu:
In the dialog, select the repo to which you want to move the object:
You can import a SQL or Python file as a single-cell Azure Databricks notebook.
Add the comment line -- Databricks notebook source at the top of a SQL file.
Add the comment line # Databricks notebook source at the top of a Python file.
IMPORTANT
This feature is in Public Preview.
Requirements
Databricks Runtime 8.4 or above.
Create a new file
The most common way to create a file in a repo is to clone a Git repository. You can also create a new file
directly from the Databricks repo. Click the down arrow next to the repo name, and select Create > File from
the menu.
Import a file
To import a file, click the down arrow next to the repo name, and select Impor t .
The import dialog appears. You can drag files into the dialog or click browse to select files.
import pandas as pd
df = pd.read_csv("./data/winequality-red.csv")
df
You can use Spark to access files in a repo. Spark requires absolute file paths for file data. The absolute file path
for a file in a repo is file:/Workspace/Repos/<user_folder>/<repo_name>/file .
You can copy the absolute or relative path to a file in a repo from the drop-down menu next to the file:
The example below shows the use of {os.getcwd()} to get the full path.
import os
spark.read.format("csv").load(f"file:{os.getcwd()}/my_data.csv")
Example notebook
This notebook shows examples of working with arbitrary files in Databricks Repos.
Arbitrary Files in Repos example notebook
Get notebook
Requirements
Databricks Runtime 8.4 or above.
Import Python and R modules
The current working directory of your repo and notebook are automatically added to the Python path. When
you work in the repo root, you can import modules from the root directory and all subdirectories.
To import modules from another repo, you must add that repo to sys.path . For example:
import sys
sys.path.append("/Workspace/Repos/<user-name>/<repo-name>")
You import functions from a module in a repo just as you would from a module saved as a cluster library or
notebook-scoped library:
Python
source("sample.R")
power.powerOfTwo(3)
%load_ext autoreload
%autoreload 2
The article describes how you can use common Git capabilities to sync Databricks Repos with a remote Git
repository.
To update notebooks and other files in Databricks Repos, you can:
Pull changes from the remote Git repo.
Resolve merge conflicts.
Commit and push from Databricks to the remote repo.
You can also create a new branch in Databricks Repos.
To sync with Git, use the Git dialog. The Git dialog lets you pull changes from your remote Git repository and
push and commit changes. You can also change the branch you are working on or create a new branch.
IMPORTANT
Git operations that pull in upstream changes clear the notebook state. For more information, see Incoming changes clear
the notebook state.
From the Databricks Repos browser, click the button to the right of the repo name:
You can also click the down arrow next to the repo name, and select Git… from the menu.
2. When you click Commit to new branch , a notice appears with a link: Create a pull request to
resolve merge conflicts . Click the link to open your Git provider.
3. In your Git provider, create the PR, resolve the conflicts, and merge the new branch into the original
branch.
4. Return to the Repos UI. Use the Git dialog to pull changes from the Git repository to the original branch.
Add a required Summary of the changes, and click Commit & Push to push these changes to the remote Git
repository.
If you don’t have permission to commit to the default branch, such as main , create a new branch and use your
Git provider interface to create a pull request (PR) to merge it into the default branch.
NOTE
Results are not included with a notebook commit. All results are cleared before the commit is made.
For instructions on resolving merge conflicts, see Resolve merge conflicts.
Learn best practices for using Databricks Repos in a CI/CD workflow. Integrating Git repos with Databricks Repos
provides source control for project files.
The following figure shows an overview of the steps.
Admin workflow
Databricks Repos have user-level folders and non-user top level folders. User-level folders are automatically
created when users first clone a remote repository. You can think of Databricks Repos in user folders as “local
checkouts” that are individual for each user and where users make changes to their code.
Developer workflow
In your user folder in Databricks Repos, clone your remote repository. A best practice is to create a new feature
branch or select a previously created branch for your work, instead of directly committing and pushing changes
to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to
merge your code, create a pull request and follow the review and merge processes in Git.
Here is an example workflow.
Requirements
This workflow requires that you have already set up your Git integration.
NOTE
Databricks recommends that each developer work on their own feature branch. Sharing feature branches among
developers can cause merge conflicts, which must be resolved using your Git provider. For information about how to
resolve merge conflicts, see Resolve merge conflicts.
Workflow
1. Clone your existing Git repository to your Databricks workspace.
2. Use the Repos UI to create a feature branch from the main branch. This example uses a single feature branch
feature-b for simplicity. You can create and use multiple feature branches to do your work.
3. Make your modifications to Databricks notebooks and files in the Repo.
4. Commit and push your changes to your Git provider.
5. Coworkers can now clone the Git repository into their own user folder.
a. Working on a new branch, a coworker makes changes to the notebooks and files in the Repo.
b. The coworker commits and pushes their changes to the Git provider.
6. To merge changes from other branches or rebase the feature branch, you must use the Git command line or
an IDE on your local system. Then, in the Repos UI, use the Git dialog to pull changes into the feature-b
branch in the Databricks Repo.
7. When you are ready to merge your work to the main branch, use your Git provider to create a PR to merge
the changes from feature-b.
8. In the Repos UI, pull changes to the main branch.
If you are using %run commands to make Python or R functions defined in a notebook available to another
notebook, or are installing custom .whl files on a cluster, consider including those custom modules in a
Databricks repo. In this way, you can keep your notebooks and other code modules in sync, ensuring that your
notebook always uses the correct version.
Migrate from %run commands
%run commands let you include one notebook within another and are often used to make supporting Python
or R code available to a notebook. In this example, a notebook named power.py includes the code below.
You can then make functions defined in power.py available to a different notebook with a %run command:
Using Files in Repos, you can directly import the module that contains the Python code and run the function.
Databricks Repos and Git integration have limits specified in the following sections. For general information, see
Databricks limits
Repo configuration
Where is Databricks repo content stored?
The contents of a repo are temporarily cloned onto disk in the control plane. Azure Databricks notebook files are
stored in the control plane database just like notebooks in the main workspace. Non-notebook files may be
stored on disk for up to 30 days.
Does Repos support on-premise or self-hosted Git servers?
Databricks Repos supports Bitbucket Server integration, if the server is internet accessible.
To integrate with a Bitbucket Server, GitHub Enterprise Server, or a GitLab self-managed subscription instance
that is not internet-accessible, get in touch with your Databricks representative.
Does Repos support .gitignore files?
Yes. If you add a file to your repo and do not want it to be tracked by Git, create a .gitignore file or use one
cloned from your remote repository and add the filename, including the extension.
.gitignore works only for files that are not already tracked by Git. If you add a file that is already tracked by Git
to a .gitignore file, the file is still tracked by Git.
Can I create top-level folders that are not user folders?
Yes, admins can create top-level folders to a single depth. Repos does not support additional folder levels.
Does Repos support Git submodules?
No. You can clone a repo that contains Git submodules, but the submodule is not cloned.
Does Azure Data Factory (ADF ) support Repos?
Yes.
How can I disable Repos in my workspace?
Follow these steps to disable Repos for Git in your workspace.
1. Go to the Admin Console.
2. Click the Workspace Settings tab.
3. In the Advanced section, click the Repos toggle.
4. Click Confirm .
5. Refresh your browser.
Source management
Can I pull in .ipynb files?
Yes. The file renders in .json format, not notebook format.
Does Repos support branch merging?
No. Databricks recommends that you create a pull request and merge through your Git provider.
Can I delete a branch from an Azure Databricks repo?
No. To delete a branch, you must work in your Git provider.
If a library is installed on a cluster, and a library with the same name is included in a folder within a repo, which
library is imported?
The library in the repo is imported.
Can I pull the latest version of a repository from Git before running a job without relying on an external
orchestration tool?
No. Typically you can integrate this as a pre-commit on the Git server so that every push to a branch
(main/prod) updates the Production repo.
Can I export a Repo?
You can export notebooks, folders, or an entire Repo. You cannot export non-notebook files, and if you export an
entire Repo, non-notebook files are not included. To export, use the Workspace CLI or the Workspace API 2.0.
IMPORTANT
This feature is in Public Preview.
In Databricks Runtime 10.1 and below, Files in Repos is not compatible with Spark Streaming. To use Spark
Streaming on a cluster running Databricks Runtime 10.1 or below, you must disable Files in Repos on the
cluster. Set the Spark configuration spark.databricks.enableWsfs false .
Native file reads are supported in Python and R notebooks. Native file reads are not supported in Scala
notebooks, but you can use Scala notebooks with DBFS as you do today.
Only text-encoded files are rendered in the UI. To view files in Azure Databricks, the files must not be larger
than 10 MB.
You cannot create or edit a file from your notebook.
You can only export notebooks. You cannot export non-notebook files from a repo.
How can I run non-Databricks notebook files in a repo? For example, a .py file?
You can use any of the following:
Bundle and deploy as a library on the cluster.
Pip install the Git repository directly. This requires a credential in secrets manager.
Use %run with inline code in a notebook.
Use a custom container image. See Customize containers with Databricks Container Services.
Errors and troubleshooting for Databricks Repos
7/21/2022 • 3 minutes to read
Follow the guidance below to respond to common error messages or troubleshoot issues with Databricks
Repos.
Invalid credentials
Try the following:
Confirm that the settings in the Git integration tab (User Settings > Git Integration ) are correct.
You must enter both your Git provider username and token.
Legacy Git integrations did not require a username, so you may need to add a username to work with
Databricks Repos.
Confirm that you have selected the correct Git provider in the Add Repo dialog.
Ensure your personal access token or app password has the correct repo access.
If SSO is enabled on your Git provider, authorize your tokens for SSO.
Test your token with the Git command line. Both of these options should work:
<link>: Secure connection to <link> could not be established because of SSL problems
This error occurs if your Git server is not accessible from Azure Databricks. Private Git servers are not
supported.
This error can occur if your team has recently moved to using a multi-factor authentication (MFA) policy for
Azure Active Directory. To resolve this problem, you must log out of Azure Active Directory by going to
portal.azure.com and logging out. When you log back in, you should get the prompt to use MFA to log in.
If that does not work, try logging out completely from all Azure services before attempting to log in again.
Timeout errors
Expensive operations such as cloning a large repo or checking out a large branch may hit timeout errors, but the
operation might complete in the background. You can also try again later if the workspace was under heavy load
at the time.
404 errors
If you get a 404 error when you try to open a non-notebook file, try waiting a few minutes and then trying
again. There is a delay of a few minutes between when the workspace is enabled and when the webapp picks up
the configuration flag.
Resource not found errors after you pull non-notebook files into a
Databricks repo
This error can occur if you are not using Databricks Runtime 8.4 or above. A cluster running Databricks Runtime
8.4 or above is required to work with non-notebook files in a repo.
This error indicates that a problem occurred while deleting folders from the repo. This could leave the repo in an
inconsistent state, where folders that should have been deleted still exist. If this error occurs, Databricks
recommends deleting and re-cloning the repo to reset its state.
Unable to set repo to most recent state. This may be due to force pushes overriding commit history on the
remote repo. Repo may be out of sync and re-cloning is recommended.
This error indicates that the local and remote Git state have diverged. This can happen when a force push on the
remote overrides recent commits that still exist on the local repo. Databricks does not support a hard reset
within Repos and recommends deleting and re-cloning the repo if this error occurs.
Files do not appear after cloning a remote repos or pulling files into
an existing one
If you know your admin enabled Databricks Repos and support for arbitrary files, try the following:
Confirm your cluster is running Databricks Runtime 8.4 or above.
Refresh your browser and restart your cluster to pick up the new configuration.
NOTE
This applies only to notebook experiments. Creation of new experiments in Repos is unsupported.
Libraries
7/21/2022 • 4 minutes to read
To make third-party or custom code available to notebooks and jobs running on your clusters, you can install a
library. Libraries can be written in Python, Java, Scala, and R. You can upload Java, Scala, and Python libraries
and point to external packages in PyPI, Maven, and CRAN repositories.
This article focuses on performing library tasks in the workspace UI. You can also manage libraries using the
Libraries CLI or the Libraries API 2.0.
TIP
Azure Databricks includes many common libraries in Databricks Runtime. To see which libraries are included in Databricks
Runtime, look at the System Environment subsection of the Databricks Runtime release notes for your Databricks
Runtime version.
IMPORTANT
Azure Databricks does not invoke Python atexit functions when your notebook or job completes processing. If you use
a Python library that registers atexit handlers, you must ensure your code calls required functions before exiting.
Installing Python eggs is deprecated and will be removed in a future Databricks Runtime release. Use Python wheels or
install packages from PyPI instead.
NOTE
Microsoft Support helps isolate and resolve issues related to libraries installed and maintained by Azure Databricks. For
third-party components, including libraries, Microsoft provides commercially reasonable support to help you further
troubleshoot issues. Microsoft Support assists on a best-effort basis and might be able to resolve the issue. For open
source connectors and projects hosted on Github, we recommend that you file issues on Github and follow up on them.
Development efforts such as shading jars or building Python libraries are not supported through the standard support
case submission process: they require a consulting engagement for faster resolution. Support might ask you to engage
other channels for open-source technologies where you can find deep expertise for that technology. There are several
community sites; two examples are the Microsoft Q&A page for Azure Databricks and Stack Overflow.
You can install libraries in three modes: workspace, cluster-installed, and notebook-scoped.
Workspace libraries serve as a local repository from which you create cluster-installed libraries. A workspace
library might be custom code created by your organization, or might be a particular version of an open-
source library that your organization has standardized on.
Cluster libraries can be used by all notebooks running on a cluster. You can install a cluster library directly
from a public repository such as PyPI or Maven, or create one from a previously installed workspace library.
Notebook-scoped libraries, available for Python and R, allow you to install libraries and create an
environment scoped to a notebook session. These libraries do not affect other notebooks running on the
same cluster. Notebook-scoped libraries do not persist and must be re-installed for each session. Use
notebook-scoped libraries when you need a custom environment for a specific notebook.
Notebook-scoped Python libraries
Notebook-scoped R libraries
This section covers:
Workspace libraries
Cluster libraries
Notebook-scoped Python libraries
Notebook-scoped R libraries
NOTE
Custom containers that use a conda-based environment are not compatible with notebook-scoped libraries in
Databricks Runtime 9.0 and above and with cluster libraries in Databricks Runtime 10.1 and above. Instead, Azure
Databricks recommends installing libraries directly in the image or using init scripts. To continue using cluster libraries
in those scenarios, you can set the Spark configuration
spark.databricks.driverNfs.clusterWidePythonLibsEnabled to false . Support for the Spark configuration will
be removed on or after December 31, 2021.
Notebook-scoped libraries using magic commands are enabled by default in Databricks Runtime 7.1 and above,
Databricks Runtime 7.1 ML and above, and Databricks Runtime 7.1 for Genomics and above. They are also available
using a configuration setting in Databricks Runtime 6.4 ML to 7.0 ML and Databricks Runtime 6.4 for Genomics to
Databricks Runtime 7.0 for Genomics. See Requirements for details.
Notebook-scoped libraries with the library utility are deprecated and will be removed in an upcoming Databricks
Runtime version. They are not available on Databricks Runtime ML or Databricks Runtime for Genomics.
N OT EB O O K - SC O P ED N OT EB O O K - SC O P ED
P Y T H O N PA C K A GE L IB RA RIES W IT H L IB RA RIES W IT H T H E JO B L IB RA RIES W IT H
SO URC E % P IP L IB RA RY UT IL IT Y C L UST ER L IB RA RIES JO B S A P I
PyPI Use %pip install . Use Select PyPI as the Add a new pypi
See example. dbutils.library source. object to the job
.installPyPI . libraries and specify
the package field.
Private PyPI mirror, Use %pip install Use Not supported. Not supported.
such as Nexus or with the dbutils.library
Artifactory --index-url .installPyPI and
option. Secret specify the repo
management is argument.
available. See
example.
VCS, such as GitHub, Use %pip install Not supported. Select PyPI as the Add a new pypi
with raw source and specify the source and specify object to the job
repository URL as the repository URL libraries and specify
the package name. as the package name. the repository URL
See example. as the package
field.
N OT EB O O K - SC O P ED N OT EB O O K - SC O P ED
P Y T H O N PA C K A GE L IB RA RIES W IT H L IB RA RIES W IT H T H E JO B L IB RA RIES W IT H
SO URC E % P IP L IB RA RY UT IL IT Y C L UST ER L IB RA RIES JO B S A P I
Private VCS with raw Use %pip install Not supported. Not supported. Not supported.
source and specify the
repository URL with
basic authentication
as the package name.
Secret management
is available. See
example.
DBFS Use %pip install . Use Select DBFS as the Add a new egg or
See example. dbutils.library source. whl object to the
.install(dbfs_path) job libraries and
. specify the DBFS
path as the
package field.
Workspace libraries
7/21/2022 • 4 minutes to read
Workspace libraries serve as a local repository from which you create cluster-installed libraries. A workspace
library might be custom code created by your organization, or might be a particular version of an open-source
library that your organization has standardized on.
You must install a workspace library on a cluster before it can be used in a notebook or job.
Workspace libraries in the Shared folder are available to all users in a workspace, while workspace libraries in a
user folder are available only to that user.
NOTE
Installing Python eggs is deprecated and will be removed in a future Databricks Runtime release.
NOTE
Libraries stored in ADLS are only supported in Databricks Runtime 8.0 and above and Databricks Runtime 7.3 LTS. ADLS
is only supported through the encrypted abfss:// path.
NOTE
Internal Maven repositories are not supported.
4. In the Exclusions field, optionally provide the groupId and the artifactId of the dependencies that you
want to exclude; for example, log4j:log4j .
5. Click Create . The library status screen displays.
6. Optionally install the library on a cluster.
CRAN package
1. In the Library Source button list, select CRAN .
2. In the Package field, enter the name of the package.
3. In the Repository field, optionally enter the CRAN repository URL.
4. Click Create . The library detail screen displays.
5. Optionally install the library on a cluster.
NOTE
CRAN mirrors serve the latest version of a library. As a result, you may end up with different versions of an R package if
you attach the library to different clusters at different times. To learn how to manage and fix R package versions on
Databricks, see the Knowledge Base.
2. Click the drop-down arrow to the right of the library name and select Move . A folder browser
displays.
3. Click the destination folder.
4. Click Select .
5. Click Confirm and Move .
Delete a workspace library
IMPORTANT
Before deleting a workspace library, you should uninstall it from all clusters.
Cluster libraries can be used by all notebooks running on a cluster. You can install a cluster library directly from
a public repository such as PyPI or Maven, using a previously installed workspace library, or using an init script.
NOTE
When you install a library on a cluster, a notebook already attached to that cluster will not immediately see the new
library. You must first detach and then reattach the notebook to the cluster.
In this section:
Workspace library
Cluster-installed library
Init script
Workspace library
NOTE
Starting with Databricks Runtime 7.2, Azure Databricks processes all workspace libraries in the order that they were
installed on the cluster. On Databricks Runtime 7.1 and below, Azure Databricks processes Maven and CRAN libraries in
the order they are installed on the cluster.
You might need to pay attention to the order of installation on the cluster if there are dependencies between libraries.
To install a library that already exists in the workspace, you can start from the cluster UI or the library UI:
Cluster
IMPORTANT
This option does not install the library on clusters running Databricks Runtime 7.0 and above.
Select the checkbox next to the cluster that you want to install the library on and click Install .
The library is installed on the cluster.
Cluster-installed library
IMPORTANT
If you have configured a library to install on all clusters automatically, or you select an existing terminated cluster that has
libraries installed, the job execution does not wait for library installation to complete. If a job requires a specific library, you
should attach the library to the job in the Dependent Libraries field.
You can install a library on a specific cluster without making it available as a workspace library.
To install a library on a cluster:
#!/bin/bash
Notebook-scoped libraries let you create, modify, save, reuse, and share custom Python environments that are
specific to a notebook. When you install a notebook-scoped library, only the current notebook and any jobs
associated with that notebook have access to that library. Other notebooks attached to the same cluster are not
affected.
Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the
beginning of each session, or whenever the notebook is detached from a cluster.
There are two methods for installing notebook-scoped libraries:
Run the %pip magic command in a notebook. The %pip command is supported on Databricks Runtime 7.1
and above, and on Databricks Runtime 6.4 ML and above. Databricks recommends using this approach for
new workloads. This article describes how to use these magic commands.
On Databricks Runtime 10.5 and below, you can use the Azure Databricks library utility. The library utility is
supported only on Databricks Runtime, not Databricks Runtime ML or Databricks Runtime for Genomics. See
Library utility (dbutils.library).
To install libraries for all notebooks attached to a cluster, use workspace or cluster-installed libraries.
IMPORTANT
dbutils.library.install and dbutils.library.installPyPI APIs are removed in Databricks Runtime 11.0.
Requirements
Notebook-scoped libraries using magic commands are enabled by default in Databricks Runtime 7.1 and above,
Databricks Runtime 7.1 ML and above, and Databricks Runtime 7.1 for Genomics and above.
They are also available using a configuration setting in Databricks Runtime 6.4 ML to 7.0 ML and Databricks
Runtime 6.4 for Genomics to Databricks Runtime 7.0 for Genomics. Set the Spark configuration
spark.databricks.conda.condaMagic.enabled to true .
On a High Concurrency cluster running Databricks Runtime 7.4 ML or Databricks Runtime 7.4 for Genomics or
below, notebook-scoped libraries are not compatible with table access control or credential passthrough. An
alternative is to use Library utility (dbutils.library) on a Databricks Runtime cluster, or to upgrade your cluster to
Databricks Runtime 7.5 ML or Databricks Runtime 7.5 for Genomics or above.
To use notebook-scoped libraries with Databricks Connect, you must use Library utility (dbutils.library).
Driver node
Using notebook-scoped libraries might result in more traffic to the driver node as it works to keep the
environment consistent across executor nodes.
When you use a cluster with 10 or more nodes, Databricks recommends these specs as a minimum requirement
for the driver node:
For a 100 node CPU cluster, use Standard_DS5_v2.
For a 10 node GPU cluster, use Standard_NC12.
For larger clusters, use a larger driver node.
IMPORTANT
You should place all %pip commands at the beginning of the notebook. The notebook state is reset after any %pip
command that modifies the environment. If you create Python methods or variables in a notebook, and then use
%pip commands in a later cell, the methods or variables are lost.
Upgrading, modifying, or uninstalling core Python packages (such as IPython) with %pip may cause some features to
stop working as expected. For example, IPython 7.21 and above are incompatible with Databricks Runtime 8.1 and
below. If you experience such problems, reset the environment by detaching and re-attaching the notebook or by
restarting the cluster.
NOTE
You cannot uninstall a library that is included in Databricks Runtime or a library that has been installed as a cluster library.
If you have installed a different library version than the one included in Databricks Runtime or the one installed on the
cluster, you can use %pip uninstall to revert the library to the default version in Databricks Runtime or the version
installed on the cluster, but you cannot use a %pip command to uninstall the version of a library included in Databricks
Runtime or installed on the cluster.
You can add parameters to the URL to specify things like the version or git subdirectory. See the VCS support for
more information and for examples using other version control systems.
Install a private package with credentials managed by Databricks secrets with %pip
Pip supports installing packages from private sources with basic authentication, including private version
control systems and private package repositories, such as Nexus and Artifactory. Secret management is
available via the Databricks Secrets API, which allows you to store authentication tokens and passwords. Use the
DBUtils API to access secrets from your notebook. Note that you can use $variables in magic commands.
To install a package from a private repository, specify the repository URL with the --index-url option to
%pip install or add it to the pip config file at ~/.pip/pip.conf .
Similarly, you can use secret management with magic commands to install private packages from version
control systems.
You can use %pip to install a private package that has been saved on DBFS.
When you upload a file to DBFS, it automatically renames the file, replacing spaces, periods, and hyphens with
underscores. pip requires that the name of the wheel file use periods in the version (for example, 0.1.0) and
hyphens instead of spaces or underscores. To install the package with a %pip command, you must rename the
file to meet these requirements.
Any subdirectories in the file path must already exist. If you run
%pip freeze > /dbfs/<new-directory>/requirements.txt , the command fails if the directory
/dbfs/<new-directory> does not already exist.
The %conda command is equivalent to the conda command and supports the same API with some restrictions
noted below. The following sections contain examples of how to use %conda commands to manage your
environment. For more information on installing Python packages with conda , see the conda install
documentation.
Note that %conda magic commands are not available on Databricks Runtime. They are only available on
Databricks Runtime ML up to Databricks Runtime ML 8.4, and on Databricks Runtime for Genomics. Databricks
recommends using pip to install libraries. For more information, see Understanding conda and pip.
If you must use both %pip and %conda commands in a notebook, see Interactions between pip and conda
commands.
NOTE
The following conda commands are not supported when used with %conda :
activate
create
init
run
env create
env remove
In this section:
Install a library with %conda
Uninstall a library with %conda
Save and reuse or share an environment
List the Python environment of a notebook
Interactions between pip and conda commands
Install a library with %conda
%conda list
NOTE
On Databricks Runtime 11.0 and above, %pip , %sh pip , and !pip all install a library as a notebook-scoped Python
library.
Known issues
On Databricks Runtime 7.0 ML and below as well as Databricks Runtime 7.0 for Genomics and below, if a
registered UDF depends on Python packages installed using %pip or %conda , it won’t work in %sql cells.
Use spark.sql in a Python command shell instead.
On Databricks Runtime 7.2 ML and below as well as Databricks Runtime 7.2 for Genomics and below, when
you update the notebook environment using %conda , the new environment is not activated on worker
Python processes. This can cause issues if a PySpark UDF function calls a third-party function that uses
resources installed inside the Conda environment.
When you use %conda env update to update a notebook environment, the installation order of packages is
not guaranteed. This can cause problems for the horovod package, which requires that tensorflow and
torch be installed before horovod in order to use horovod.tensorflow or horovod.torch respectively. If this
happens, uninstall the horovod package and reinstall it after ensuring that the dependencies are installed.
On Databricks Runtime 10.3 and below, notebook-scoped libraries are incompatible with batch streaming
jobs. Databricks recommends using cluster libraries or the IPython kernel instead.
Notebook-scoped R libraries
7/21/2022 • 2 minutes to read
Notebook-scoped R libraries enable you to create and modify custom R environments that are specific to a
notebook session. When you install an R notebook-scoped library, only the current notebook and any jobs
associated with that notebook have access to that library. Other notebooks attached to the same cluster are not
affected.
Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the
beginning of each session, or whenever the notebook is detached from a cluster.
Notebook-scoped libraries libraries are automatically available on workers for SparkR UDFs.
To install libraries for all notebooks attached to a cluster, use workspace or cluster-installed libraries.
Requirements
Notebook-scoped R libraries are enabled by default in Databricks Runtime 9.0 and above.
Databricks recommends using a CRAN snapshot as the repository to guarantee reproducible results.
devtools::install_github("klutometis/roxygen")
remove.packages("caesar")
FAQs
How do I install a package on just the driver for all R notebooks?
Explicitly set the installation directory to /databricks/spark/R/lib . For example, with install.packages() , run
install.packages("pckg", lib="/databricks/spark/R/lib") . Packages installed in /databricks/spark/R/lib are
shared across all notebooks on the cluster, but they are not accessible to SparkR workers. If you wish to share
libraries across notebooks and also workers, use cluster-scoped libraries.
Are notebook-scoped libraries cached?
There is no caching implemented for notebook-scoped libraries on a cluster. If you install a package in a
notebook, and another user installs the same package in another notebook on the same cluster, the package is
downloaded, compiled, and installed again.
Databricks File System (DBFS)
7/21/2022 • 11 minutes to read
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the
following benefits:
Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage, so you won’t lose data after you terminate a cluster.
DBFS root
The default storage location in DBFS is known as the DBFS root. Several types of data are stored in the following
DBFS root locations:
/FileStore : Imported data files, generated plots, and uploaded libraries. See Special DBFS root locations.
/databricks-datasets : Sample public datasets. See Special DBFS root locations.
/databricks-results : Files generated by downloading the full results of a query.
/databricks/init : Global and cluster-named (deprecated) init scripts.
/user/hive/warehouse : Data and metadata for non-external Hive tables.
In a new workspace, the DBFS root has the following default folders:
The DBFS root also contains data—including mount point metadata and credentials and certain types of logs—
that is not visible and cannot be directly accessed.
Configuration and usage recommendations
The DBFS root is created during workspace creation.
Data written to mount point paths ( /mnt ) is stored outside of the DBFS root. Even though the DBFS root is
writeable, Databricks recommends that you store data in mounted object storage rather than in the DBFS root.
The DBFS root is not intended for production customer data.
Optional encryption of DBFS root data with a customer-managed key
You can encrypt DBFS root data with a customer-managed key. See Configure customer-managed keys for DBFS
root
Special DBFS root locations
The following articles provide more detail on special DBFS root locations:
FileStore
Sample datasets (databricks-datasets)
NOTE
An admin user must enable the DBFS browser interface before you can use it. See Manage the DBFS file browser.
You can also list DBFS objects using the DBFS CLI, DBFS API 2.0, Databricks file system utility (dbutils.fs), Spark
APIs, and local file APIs. See Access DBFS.
IMPORTANT
Nested mounts are not supported. For example, the following structure is not supported:
storage1 mounted as /mnt/storage1
storage2 mounted as /mnt/storage1/storage2
Databricks recommends creating separate mount entries for each storage object:
storage1 mounted as /mnt/storage1
storage2 mounted as /mnt/storage2
Access DBFS
IMPORTANT
All users have read and write access to the objects in object storage mounted to DBFS, with the exception of the DBFS
root. For more information, see Important information about DBFS permissions.
You can upload data to DBFS using the file upload interface, and can upload and access DBFS objects using the
DBFS CLI, DBFS API 2.0, Databricks file system utility (dbutils.fs), Spark APIs, and local file APIs.
In a Azure Databricks cluster you access DBFS objects using the Databricks file system utility, Spark APIs, or local
file APIs. On a local computer you access DBFS objects using the Databricks CLI or the DBFS API.
In this section:
DBFS and local driver node paths
File upload interface
Databricks CLI
dbutils
DBFS API
Spark APIs
Local file APIs
DBFS and local driver node paths
You can work with files on DBFS or on the local driver node of the cluster. You can access the file system using
magic commands such as %fs or %sh . You can also use the Databricks file system utility (dbutils.fs).
Azure Databricks uses a FUSE mount to provide local access to files stored in the cloud. A FUSE mount is a
secure, virtual filesystem.
Access files on DBFS
The path to the default blob storage (root) is dbfs:/ .
The default location for %fs and dbutils.fs is root. Thus, to read from or write to root or an external bucket:
dbutils.fs.<command> ("/<path>/")
%sh reads from the local filesystem by default. To access root or mounted paths in root with %sh , preface the
path with /dbfs/ . A typical use case is if you are working with single node libraries like TensorFlow or scikit-
learn and want to read and write data to cloud storage.
import os
os.<command>('/dbfs/tmp')
Ex a m p l e s
# Default location for %fs is root
%fs ls /tmp/
%fs mkdirs /tmp/my_cloud_dir
%fs cp /tmp/test_dbfs.txt /tmp/file_b.txt
%sh reads from the local filesystem by default, so do not use file:/ :
Ex a m p l e s
# With %fs and dbutils.fs, you must use file:/ to read from local filesystem
%fs ls file:/tmp
%fs mkdirs file:/tmp/my_local_dir
dbutils.fs.ls ("file:/tmp/")
dbutils.fs.put("file:/tmp/my_new_file", "This is a file on the local driver node.")
dbutils.fs.ls("/mnt/mymount")
df = spark.read.format("text").load("dbfs:/mymount/my_file.txt")
NOTE
This feature is disabled by default. An administrator must enable the DBFS browser interface before you can use it. See
Manage the DBFS file browser.
NOTE
This feature is enabled by default. If an administrator has disabled this feature, you will not have the option to upload files.
To create a table using the UI, see Upload data and create table in Databricks SQL.
To upload data for use in a notebook, follow these steps.
1. Create a new notebook or open an existing one, then click File > Upload Data
2. Select a target directory in DBFS to store the uploaded file. The target directory defaults to
/shared_uploads/<your-email-address>/ .
Uploaded files are accessible by everyone who has access to the workspace.
3. Either drag files onto the drop target or click Browse to locate files in your local filesystem.
4. When you have finished uploading the files, click Next .
If you’ve uploaded CSV, TSV, or JSON files, Azure Databricks generates code showing how to load the
data into a DataFrame.
For more information about the DBFS command-line interface, see Databricks CLI.
dbutils
dbutils.fs provides file-system-like commands to access files in DBFS. This section has several examples of how
to write files to and read files from DBFS using dbutils.fs commands.
TIP
To access the help menu for DBFS, use the dbutils.fs.help() command.
Write files to and read files from the DBFS root as if it were a local filesystem
dbutils.fs.mkdirs("/foobar/")
dbutils.fs.head("/foobar/baz.txt")
dbutils.fs.rm("/foobar/baz.txt")
display(dbutils.fs.ls("dbfs:/foobar"))
%fs ls
%fs rm -r foobar
DBFS API
See DBFS API 2.0 and Upload a big file into DBFS.
Spark APIs
When you’re using Spark APIs, you reference files with "/mnt/training/file.csv" or
"dbfs:/mnt/training/file.csv" . The following example writes the file foo.text to the DBFS /tmp directory.
df.write.format("text").save("/tmp/foo.txt")
When you use the Spark APIs to access DBFS (for example, by calling spark.read ), you must specify the full,
absolute path to the target DBFS location. The path must start from the DBFS root, represented by / or dbfs:/
, which are equivalent. For example, to read a file named people.json in the DBFS location /FileStore , you can
specify either of the following:
df = spark.read.format("json").load('dbfs:/FileStore/people.json')
df.show()
Or:
df = spark.read.format("json").load('/FileStore/people.json')
df.show()
Scala
import scala.io.Source
# python
import xlsxwriter
from shutil import copyfile
workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Key")
worksheet.write(0, 1, "Value")
workbook.close()
copyfile('/local_disk0/tmp/excel.xlsx', '/dbfs/tmp/excel.xlsx')
Does not support sparse files. To copy sparse files, use cp --sparse=never :
$ cp sparse.file /dbfs/sparse.file
error writing '/dbfs/sparse.file': Operation not supported
$ cp --sparse=never sparse.file /dbfs/sparse.file
IMPORTANT
If you experience issues with FUSE V1 on <DBR> 5.5 LTS, Databricks recommends that you use FUSE V2 instead.
You can override the default FUSE version in <DBR> 5.5 LTS by setting the environment variable
DBFS_FUSE_VERSION=2 .
Supports only files less than 2GB in size. If you use local file system APIs to read or write files
larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS
CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep
learning.
If you write a file using the local file system APIs and then immediately try to access it using the
DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException , a file of size 0,
or stale file contents. That is expected because the operating system caches writes by default. To
force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix
system call sync. For example:
// scala
import scala.sys.process._
// Write a file using the local file API (over the FUSE mount).
dbutils.fs.put("file:/dbfs/tmp/test", "test-contents")
For distributed deep learning applications, which require DBFS access for loading, checkpointing, and logging
data, Databricks Runtime 6.0 and above provide a high-performance /dbfs mount that’s optimized for deep
learning workloads.
In Databricks Runtime 5.5 LTS, only /dbfs/ml is optimized. In this version Databricks recommends saving data
under /dbfs/ml , which maps to dbfs:/ml .
FileStore
7/21/2022 • 3 minutes to read
FileStore is a special folder within Databricks File System (DBFS) where you can save files and have them
accessible to your web browser. You can use FileStore to:
Save files, such as images and libraries, that are accessible within HTML and JavaScript when you call
displayHTML .
Save output files that you want to download to your local desktop.
Upload CSVs and other data files from your local desktop to process on Databricks.
When you use certain features, Azure Databricks puts files in the following folders under FileStore:
/FileStore/jars - contains libraries that you upload. If you delete files in this folder, libraries that reference
these files in your workspace may no longer work.
/FileStore/tables - contains the files that you import using the UI. If you delete files in this folder, tables that
you created from these files may no longer be accessible.
/FileStore/plots - contains images created in notebooks when you call display() on a Python or R plot
object, such as a ggplot or matplotlib plot. If you delete files in this folder, you may have to regenerate
those plots in the notebooks that reference them. See Matplotlib and ggplot2for more information.
/FileStore/import-stage - contains temporary files created when you import notebooks or Databricks
archives files. These temporary files disappear after the notebook import completes.
In the following, replace <databricks-instance> with the workspace URL of your Azure Databricks deployment.
Files stored in are accessible in your web browser at
/FileStore
https://<databricks-instance>/files/<path-to-file>?o=###### . For example, the file you stored in
/FileStore/my-stuff/my-file.txt is accessible at
https://<databricks-instance>/files/my-stuff/my-file.txt?o=###### where the number after o= is the same as
in your URL.
NOTE
You can also use the DBFS file upload interfaces to put files in the /FileStore directory. See Databricks CLI.
You can upload static images using the DBFS Databricks REST API reference and the requests Python HTTP
library. In the following example:
Replace <databricks-instance> with the workspace URL of your Azure Databricks deployment.
Replace <token> with the value of your personal access token.
Replace <image-dir> with the location in FileStore where you want to upload the image files.
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
import requests
import json
import os
TOKEN = '<token>'
headers = {'Authorization': 'Bearer %s' % TOKEN}
url = "https://<databricks-instance>/api/2.0"
dbfs_dir = "dbfs:/FileStore/<image-dir>/"
mkdirs(path=dbfs_dir, headers=headers)
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
if ".png" in f:
target_path = dbfs_dir + f
resp = put_file(src_path=f, dbfs_path=target_path, overwrite=True, headers=headers)
if resp == None:
print("Success")
else:
print(resp)
Azure Databricks enables users to mount cloud object storage to the Databricks File System (DBFS) to simplify
data access patterns for users that are unfamiliar with cloud concepts. Mounted data does not work with Unity
Catalog, and Databricks recommends migrating away from using mounts and managing data governance with
Unity Catalog.
mount(
source: str,
mountPoint: str,
encryptionType: Optional[str] = "",
extraConfigs: Optional[dict[str:str]] = None
)
Check with your workspace and cloud administrators before configuring or altering data mounts, as improper
configuration can provide unsecured access to all users in your workspace.
dbutils.fs.unmount("/mnt/<mount-name>")
IMPORTANT
Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount
storage as part of processing.
IMPORTANT
All users in the Azure Databricks workspace have access to the mounted ADLS Gen2 account. The service principal you
use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be
granted access to other Azure resources.
When you create a mount point through a cluster, cluster users can immediately access the mount point. To use the
mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to
make the newly created mount point available for use.
Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount
storage as part of processing.
Mount points that use secrets are not automatically refreshed. If mounted storage relies on a secret that is rotated,
expires, or is deleted, errors can occur, such as 401 Unauthorized . To resolve such an error, you must unmount and
remount the storage.
Run the following in your notebook to authenticate and create a mount point.
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Replace
<application-id> with the Application (client) ID for the Azure Active Directory application.
<scope-name> with the Databricks secret scope name.
<service-credential-key-name> with the name of the key containing the client secret.
<directory-id> with the Director y (tenant) ID for the Azure Active Directory application.
<container-name> with the name of a container in the ADLS Gen2 storage account.
<storage-account-name> with the ADLS Gen2 storage account name.
<mount-name> with the name of the intended mount point in DBFS.
External Apache Hive metastore
7/21/2022 • 8 minutes to read
This article describes how to set up Azure Databricks clusters to connect to existing external Apache Hive
metastores. It provides information about recommended metastore setup and cluster configuration
requirements, followed by instructions for configuring clusters to connect to an external metastore. The
following table summarizes which Hive metastore versions are supported in each version of Databricks
Runtime.
DATA B RIC K S
RUN T IM E
VERSIO N 0. 13 - 1. 2. 1 2. 1 2. 2 2. 3 3. 1. 0
IMPORTANT
While SQL Server works as the underlying metastore database for Hive 2.0 and above, the examples throughout this
article use Azure SQL Database.
You can use a Hive 1.2.0 or 1.2.1 metastore of an HDInsight cluster as an external metastore. See Use external
metadata stores in Azure HDInsight.
If you use Azure Database for MySQL as an external metastore, you must change the value of the
lower_case_table_names property from 1 (the default) to 2 in the server-side database configuration. For details,
see Identifier Case Sensitivity.
%sh
nc -vz <DNS name> <port>
where
<DNS name> is the server name of Azure SQL Database.
<port> is the port of the database.
Cluster configurations
You must set two sets of configuration options to connect a cluster to an external metastore:
Spark options configure Spark with the Hive metastore version and the JARs for the metastore client.
Hive options configure the metastore client to connect to the external metastore.
Spark configuration options
Set spark.sql.hive.metastore.version to the version of your Hive metastore and spark.sql.hive.metastore.jars
as follows:
Hive 0.13: do not set spark.sql.hive.metastore.jars .
Hive 1.2.0 or 1.2.1 (Databricks Runtime 6.6 and below): set spark.sql.hive.metastore.jars to builtin .
NOTE
Hive 1.2.0 and 1.2.1 are not the built-in metastore on Databricks Runtime 7.0 and above. If you want to use Hive
1.2.0 or 1.2.1 with Databricks Runtime 7.0 and above, follow the procedure described in Download the metastore
jars and point to them.
Hive 2.3.7 (Databricks Runtime 7.0 - 9.x) or Hive 2.3.9 (Databricks Runtime 10.0 and above): set
spark.sql.hive.metastore.jars to builtin .
For all other Hive versions, Azure Databricks recommends that you download the metastore JARs and set
the configuration spark.sql.hive.metastore.jars to point to the downloaded JARs using the procedure
described in Download the metastore jars and point to them.
Download the metastore jars and point to them
1. Create a cluster with spark.sql.hive.metastore.jars set to maven and spark.sql.hive.metastore.version
to match the version of your metastore.
2. When the cluster is running, search the driver log and find a line like the following:
The directory <path> is the location of downloaded JARs in the driver node of the cluster.
Alternatively you can run the following code in a Scala notebook to print the location of the JARs:
import com.typesafe.config.ConfigFactory
val path = ConfigFactory.load().getString("java.io.tmpdir")
3. Run %sh cp -r <path> /dbfs/hive_metastore_jar (replacing <path> with your cluster’s info) to copy this
directory to a directory in DBFS called hive_metastore_jar through the FUSE client in the driver node.
4. Create an init script that copies /dbfs/hive_metastore_jar to the local filesystem of the node, making sure
to make the init script sleep a few seconds before it accesses the DBFS FUSE client. This ensures that the
client is ready.
5. Set spark.sql.hive.metastore.jars to use this directory. If your init script copies
/dbfs/hive_metastore_jar to /databricks/hive_metastore_jars/ , set spark.sql.hive.metastore.jars to
/databricks/hive_metastore_jars/* . The location must include the trailing /* .
6. Restart the cluster.
Hive configuration options
This section describes options specific to Hive.
To connect to an external metastore using local mode, set the following Hive configuration options:
where
<mssql-connection-string> is the JDBC connection string (which you can get in the Azure portal). You do not
need to include username and password in the connection string, because these will be set by
javax.jdo.option.ConnectionUserName and javax.jdo.option.ConnectionDriverName .
<mssql-username> and <mssql-password> specify the username and password of your Azure SQL Database
account that has read/write access to the database.
NOTE
For production environments, we recommend that you set hive.metastore.schema.verification to true . This
prevents Hive metastore client from implicitly modifying the metastore database schema when the metastore client
version does not match the metastore database version. When enabling this setting for metastore client versions lower
than Hive 1.2.0, make sure that the metastore client has the write permission to the metastore database (to prevent the
issue described in HIVE-9749).
For Hive metastore 1.2.0 and higher, set hive.metastore.schema.verification.record.version to true to
enable hive.metastore.schema.verification .
For Hive metastore 2.1.1 and higher, set hive.metastore.schema.verification.record.version to true as it is
set to false by default.
Scala
dbutils.fs.put(
"/databricks/scripts/external-metastore.sh",
"""#!/bin/sh
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the
metastore client.
| # JDBC connect string for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "<mssql-connection-string>"
|
| # Username to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mssql-username>"
|
| # Password to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mssql-password>"
|
| # Driver class name for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" =
"com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "<hive-version>"
| # Skip this one if <hive-version> is 0.13.x.
| "spark.sql.hive.metastore.jars" = "<hive-jar-source>"
|}
|EOF
|""".stripMargin,
overwrite = true
)
Python
contents = """#!/bin/sh
# Loads environment variables to determine the correct JDBC driver to use.
source /etc/environment
# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
[driver] {
# Hive specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore
client.
# JDBC connect string for a JDBC metastore
"spark.hadoop.javax.jdo.option.ConnectionURL" = "<mssql-connection-string>"
dbutils.fs.put(
file = "/databricks/scripts/external-metastore.sh",
contents = contents,
overwrite = True
)
Troubleshooting
Clusters do not star t (due to incorrect init script settings)
If an init script for setting up the external metastore causes cluster creation failure, configure the init script to
log, and debug the init script using the logs.
Error in SQL statement: InvocationTargetException
Error message pattern in the full exception stack trace:
External metastore JDBC connection information is misconfigured. Verify the configured hostname, port,
username, password, and JDBC driver class name. Also, make sure that the username has the right
privilege to access the metastore database.
Error message pattern in the full exception stack trace:
Required table missing : "`DBS`" in Catalog "" Schema "". DataNucleus requires this table to perform
its persistence operations. [...]
External metastore database not properly initialized. Verify that you created the metastore database and
put the correct database name in the JDBC connection string. Then, start a new cluster with the following
two Spark configuration options:
datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false
In this way, the Hive client library will try to create and initialize tables in the metastore database
automatically when it tries to access them but finds them absent.
Error in SQL statement: AnalysisException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetastoreClient
Error message in the full exception stacktrace:
The specified datastore driver (driver name) was not found in the CLASSPATH
This guide explains how to move your production jobs from Apache Spark on other platforms to Apache Spark
on Azure Databricks.
Concepts
Databricks job
A single unit of code that you can bundle and submit to Azure Databricks. An Azure Databricks job is equivalent
to a Spark application with a single SparkContext . The entry point can be in a library (for example, JAR, egg,
wheel) or a notebook. You can run Azure Databricks jobs on a schedule with sophisticated retries and alerting
mechanisms. The primary interfaces for running jobs are the Jobs API and UI.
Pool
A set of instances in your account that are managed by Azure Databricks but incur no Azure Databricks charges
when they are idle. Submitting multiple jobs on a pool ensures your jobs start quickly. You can set guardrails
(instance types, instance limits, and so on) and autoscaling policies for the pool of instances. A pool is equivalent
to an autoscaling cluster on other Spark platforms.
Migration steps
This section provides the steps for moving your production jobs to Azure Databricks.
Step 1: Create a pool
Create an autoscaling pool. This is equivalent to creating an autoscaling cluster in other Spark platforms. On
other platforms, if instances in the autoscaling cluster are idle for a few minutes or hours, you pay for them.
Azure Databricks manages the instance pool for you for free. That is, you don’t pay Azure Databricks if these
machines are not in use; you pay only the cloud provider. Azure Databricks charges only when jobs are run on
the instances.
Key configurations:
Min Idle : Number of standby instances, not in use by jobs, that the pool maintains. You can set this to 0.
Max Capacity : This is an optional field. If you already have cloud provider instance limits set, you can leave
this field empty. If you want to set additional max limits, set a high value so that a large number of jobs can
share the pool.
Idle Instance Auto Termination : The instances over Min Idle are released back to the cloud provider if
they are idle for the specified period. The higher the value, the more the instances are kept ready and thereby
your jobs will start faster.
Step 2: Run a job on a pool
You can run a job on a pool using the Jobs API or the UI. You must run each job by providing a cluster spec.
When a job is about to start, Azure Databricks automatically creates a new cluster from the pool. The cluster is
automatically terminated when the job finishes. You are charged exactly for the amount of time your job was
run. This is the most cost-effective way to run jobs on Azure Databricks. Each new cluster has:
One associated SparkContext , which is equivalent to a Spark application on other Spark platforms.
A driver node and a specified number of workers. For a single job, you can specify a worker range. Azure
Databricks autoscales a single Spark job based on the resources needed for that job. Azure Databricks
benchmarks show that this can save you up to 30% on cloud costs, depending on the nature of your job.
There are three ways to run jobs on a pool: API/CLI, Airflow, UI.
API / CLI
1. Download and configure the Databricks CLI.
2. Run the following command to submit your code one time. The API returns a URL that you can use to
track the progress of the job run.
databricks runs submit --json
{
"run_name": "my spark job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"instance_pool_id": "0313-121005-test123-pool-ABCD1234",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
}
],
"timeout_seconds": 3600,
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}
3. To schedule a job, use the following example. Jobs created through this mechanism are displayed in the
jobs list page. The return value is a job_id that you can use to look at the status of all the runs.
{
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
...
"instance_pool_id": "0313-121005-test123-pool-ABCD1234",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
}
],
"email_notifications": {
"on_start": ["john@foo.com"],
"on_success": ["sally@foo.com"],
"on_failure": ["bob@foo.com"]
},
"timeout_seconds": 3600,
"max_retries": 2,
"schedule": {
"quartz_cron_expression": "0 15 22 ? \* \*",
"timezone_id": "America/Los_Angeles"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}
If you use spark-submit to submit Spark jobs, the following table shows how spark-submit parameters map to
different arguments in the Create a new job operation ( POST /jobs/create ) in the Jobs API.
–driver-memory, –driver-cores Based on the driver memory and cores you need, choose an
appropriate instance type.
You will provide the instance type for the driver during the
pool creation. Ignore this parameter during job submission.
You will provide the instance type for the workers during the
pool creation. Ignore this parameter during job submission.
Airflow
Azure Databricks offers an Airflow operator if you want to use Airflow to submit jobs in Azure Databricks. The
Databricks Airflow operator calls the Trigger a new job run operation ( POST /jobs/run-now ) of the Jobs API to
submit jobs to Azure Databricks. See Apache Airflow.
UI
Azure Databricks provides a simple and intuitive easy-to-use UI to submit and schedule jobs. To create and
submit jobs from the UI, follow the step-by-step guide.
Step 3: Troubleshoot jobs
Azure Databricks provides lots of tools to help you troubleshoot your jobs.
Access logs and Spark UI
Azure Databricks maintains a fully managed Spark history server to allow you to access all the Spark logs and
Spark UI for each job run. They can be accessed from the job runs page as well as the job run details page:
Forward logs
You can also forward cluster logs to your cloud storage location. To send logs to your location of choice, use the
cluster_log_conf parameter in the new_cluster structure.
View metrics
While the job is running, you can go to the cluster page and look at the live Ganglia metrics in the Metrics tab.
Azure Databricks also snapshots these metrics every 15 minutes and stores them, so you can look at these
metrics even after your job is completed. To send metrics to your metrics server, you can install custom agents in
the cluster. See Monitor performance.
Set alerts
Use email_notifications in the Create a new job operation ( POST /jobs/create ) in the Jobs API to get alerts on
job failures.
You can also forward these email alerts to PagerDuty, Slack, and other monitoring systems.
How to set up PagerDuty alerts with emails
How to set up Slack notification with emails
This article answers typical questions that come up when you migrate single node workloads to Azure
Databricks.
I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going
wrong?
If you are working with any single-node libraries, they will not inherently become distributed when you switch
to using Azure Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.
Alternatively, you can use Pandas API on Spark, which allows you to use the pandas DataFrame API to access
data in Apache Spark DataFrames.
There is an algorithm in sklearn that I love, but Spark ML doesn’t suppor t it (such as DBSCAN).
How can I use this algorithm and still take advantage of Spark?
Use joblib-spark, an Apache Spark backend for joblib to distribute tasks on a Spark cluster.
Use a pandas user-defined function.
For hyperparameter tuning, use Hyperopt.
What are my deployment options for Spark ML?
The best deployment option depends on the latency requirement of the application.
For batch predictions, see Deploy models for inference and prediction.
For streaming applications, see What is Apache Spark Structured Streaming?.
For low-latency model inference, consider MLflow Model Serving or a cloud provider-based solution
such as Azure Machine Learning.
How can I install or update pandas or another librar y?
There are several ways to install or update a library.
To install or update a library for all users on a cluster, see Cluster libraries.
To make a Python library or a library version available only for a specific notebook, see Notebook-scoped
Python libraries.
How can I view data on DBFS with just the driver?
Add /dbfs/ to the beginning of the file path. See Local file APIs.
How can I get data into Azure Databricks?
Mounting. See Mount object storage to DBFS.
Data tab. See Explore and create tables with the Data tab.
%sh wget
If you have a data file at a URL, you can use the %sh wget <url>/<filename> to import data to a Spark
driver node.
NOTE
The cell output prints Saving to: '<filename>' , but the file is actually saved to
file:/databricks/driver/<filename> .
Load a single par tition : As an optimization, you may sometimes directly load the partition of data you
are interested in. For example, spark.read.format("parquet").load("/data/date=2017-01-01") . This is
unnecessary with Delta Lake, since it can quickly read the list of files from the transaction log to find the
relevant ones. If you are interested in a single partition, specify it using a WHERE clause. For example,
spark.read.delta("/data").where("date = '2017-01-01'") . For large tables with many files in the partition,
this can be much faster than loading a single partition (with direct partition path, or with WHERE ) from a
Parquet table because listing the files in the directory is often slower than reading the list of files from the
transaction log.
When you port an existing application to Delta Lake, you should avoid the following operations, which bypass
the transaction log:
Manually modify data : Delta Lake uses the transaction log to atomically commit changes to the table.
Because the log is the source of truth, files that are written out but not added to the transaction log are
not read by Spark. Similarly, even if you manually delete a file, a pointer to the file is still present in the
transaction log. Instead of manually modifying files stored in a Delta table, always use the commands that
are described in this guide.
External readers : The data stored in Delta Lake is encoded as Parquet files. However, accessing these
files using an external reader is not safe. You’ll see duplicates and uncommitted data and the read may
fail when someone runs Remove files no longer referenced by a Delta table.
NOTE
Because the files are encoded in an open format, you always have the option to move the files outside Delta Lake. At that
point, you can run VACUUM RETAIN 0 and delete the transaction log. This leaves the table’s files in a consistent state that
can be read by the external reader of your choice.
Example
Suppose you have Parquet data stored in a directory named /data-pipeline , and you want to create a Delta
table named events .
The first example shows how to:
Read the Parquet data from its original location, /data-pipeline , into a DataFrame.
Save the DataFrame’s contents in Delta format in a separate location, /tmp/delta/data-pipeline/ .
Create the events table based on that separate location, /tmp/delta/data-pipeline/ .
The second example shows how to use CONVERT TO TABLE to convert data from Parquet to Delta format without
changing its original location, /data-pipeline/ .
Each of these examples create an unmanaged table, where you continue to manage the data in its specified
location. Azure Databricks records the table’s name and its specified location in the metastore.
Save as Delta table
1. Read the Parquet data into a DataFrame and then save the DataFrame’s contents to a new directory in
delta format:
data = spark.read.format("parquet").load("/data-pipeline")
data.write.format("delta").save("/tmp/delta/data-pipeline/")
2. Create a Delta table named events that refers to the files in the new directory:
You can also convert Iceberg tables to Delta Lake using the file path in the cloud storage location:
Azure Databricks offers a variety of ways to help you ingest data into a lakehouse backed by Delta Lake.
Partner integrations
Databricks partner integrations enable you to load data into Azure Databricks. These integrations enable low-
code, scalable data ingestion from a variety of sources into Azure Databricks. See Databricks integrations.
COPY INTO
Load data with COPY INTO allows SQL users to idempotently and incrementally load data from cloud object
storage into Delta Lake tables. It can be used in Databricks SQL, notebooks, and Databricks Jobs.
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without
additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles . Given an input
directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive,
with the option of also processing existing files in that directory.
The Databricks SQL create table UI allows you to quickly upload a CSV file and create a Delta table.
NOTE
For loading files from cloud storage such as Azure Data Lake Storage Gen2, AWS S3, or Google Cloud Storage, check out
the tutorial on COPY INTO.
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Requirements
To use Create table in Databricks SQL with Unity Catalog, you need a metastore, catalog, and schema.
For Unity Catalog, you must also have the USAGE permission on the parent catalog of the selected
schema.
If your workspace is assigned to a Unity Catalog metastore, you can still create tables under schemas
in the Hive metastore.
You need USAGE and CREATE permissions on the schema you want to create a table in.
You must have a running SQL Warehouse.
NOTE
Imported files are uploaded to a secure internal location within your account which is garbage collected daily.
1. For workspaces that are assigned to a Unity Catalog metastore, you can select a catalog. If your workspace is
not assigned to a Unity Catalog metastore, the destination catalog will be hidden, and schemas will be loaded
from the Hive metastore.
To use the Hive metastore in a workspace that has been assigned to a Unity Catalog metastore, select
hive_metastore in the catalog selector.
2. Select a schema.
3. By default, the UI converts the file name to a valid table name. You can edit the table name.
Data preview
After the file upload is complete, you can preview the data (limit of 50 rows).
After the upload, the UI tries to start the endpoint selected in the top right. You can switch endpoints at any
time, but the preview and table creation require an active endpoint. If your endpoint is not active yet, it starts
automatically. This may take some time. The preview starts when your endpoint is running.
There are two ways to preview the data, vertically or horizontally. To switch between preview options, click
NOTE
Schema inference does a best effort detection of column types. Changing column types may lead to certain values
being cast to NULL if the value cannot be cast correctly to the target data type. Casting BIGINT to DATE or
TIMESTAMP columns is not supported. Databricks recommends that you create a table first and then transform these
columns using SQL functions afterwards.
To support table column names with special characters, create table UI via upload in Databricks SQL leverages Column
Mapping.
To add comments to columns, create the table and navigate to Data Explorer where you can add comments.
Known issues
Casting BIGINT to non-castable types like DATE , such as dates in the format of ‘yyyy’, may trigger errors.
Load data with COPY INTO
7/21/2022 • 3 minutes to read
The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and
idempotent operation; files in the source location that have already been loaded are skipped.
COPY INTO supports secure access in a several ways, including the ability to Use temporary credentials to load
data with COPY INTO.
You can create empty placeholder Delta tables so that the schema is later inferred during a COPY INTO
command:
The SQL statement above is idempotent and can be scheduled to run to ingest data exactly-once into a Delta
table.
NOTE
The empty Delta table is not usable outside of COPY INTO . INSERT INTO and MERGE INTO are not supported to write
data into schemaless Delta tables. After data is inserted into the table with COPY INTO , the table becomes queryable.
Example
For common use patterns, see Common data loading patterns with COPY INTO
The following example shows how to create a Delta table and then use the COPY INTO SQL command to load
sample data from Sample datasets (databricks-datasets) into the table. You can run the example Python, R, Scala,
or SQL code from a notebook attached to an Azure Databricks cluster. You can also run the SQL code from a
query associated with a SQL warehouse in Databricks SQL.
Python
table_name = 'default.loan_risks_upload'
source_data = '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet'
source_format = 'PARQUET'
display(loan_risks_upload_data)
'''
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
| 0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
| 1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
| 2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
'''
R
library(SparkR)
sparkR.session()
table_name = "default.loan_risks_upload"
source_data = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
source_format = "PARQUET"
loan_risks_upload_data = tableToDF(table_name)
display(loan_risks_upload_data)
# Result:
# +---------+-------------+-----------+------------+
# | loan_id | funded_amnt | paid_amnt | addr_state |
# +=========+=============+===========+============+
# | 0 | 1000 | 182.22 | CA |
# +---------+-------------+-----------+------------+
# | 1 | 1000 | 361.19 | WA |
# +---------+-------------+-----------+------------+
# | 2 | 1000 | 176.26 | TX |
# +---------+-------------+-----------+------------+
# ...
Scala
val table_name = "default.loan_risks_upload"
val source_data = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
val source_format = "PARQUET"
display(loan_risks_upload_data)
/*
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
| 0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
| 1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
| 2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
*/
SQL
-- Result:
-- +---------+-------------+-----------+------------+
-- | loan_id | funded_amnt | paid_amnt | addr_state |
-- +=========+=============+===========+============+
-- | 0 | 1000 | 182.22 | CA |
-- +---------+-------------+-----------+------------+
-- | 1 | 1000 | 361.19 | WA |
-- +---------+-------------+-----------+------------+
-- | 2 | 1000 | 176.26 | TX |
-- +---------+-------------+-----------+------------+
-- ...
To clean up, run the following code, which deletes the table:
Python
Scala
SQL
Tutorial
Bulk load data into a table with COPY INTO in Databricks SQL
Bulk load data into a table with COPY INTO with Spark SQL
Reference
Databricks Runtime 7.x and above: COPY INTO
Databricks Runtime 5.5 LTS and 6.x: Copy Into (Delta Lake on Azure Databricks)
Use temporary credentials to load data with COPY
INTO
7/21/2022 • 2 minutes to read
If your an Azure Databricks cluster or SQL warehouse doesn’t have permissions to read your source files, you
can use temporary credentials to access data from external cloud object storage and load files into a Delta Lake
table.
Depending on how your organization manages your cloud security, you may need to ask a cloud administrator
or power user to provide you with credentials.
WARNING
To avoid misuse or exposure of temporary credentials, Databricks recommends that you set expiration horizons that are
just long enough to complete the task.
COPY INTO supports loading encrypted data from AWS S3. To load encrypted data, provide the type of
encryption and the key to decrypt the data.
Learn common patterns for using COPY INTO to load data from file sources into Delta Lake.
There are many options for using COPY INTO. You can also Use temporary credentials to load data with COPY
INTO in combination with these patterns.
See COPY INTO for a full reference of all options.
Note that to infer schema with copy into, you must pass additional options:
The following example creates a schemaless Delta table called my_pipe_data and loads a pipe-delimited CSV
with a header:
-- The second execution will not copy any data since the first command already loaded the data
COPY INTO my_json_data
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = JSON
FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json')
-- The example below loads CSV files without headers on ADLS Gen2 using COPY INTO.
-- By casting the data and renaming the columns, you can put the data in the schema you want
COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (SELECT _c0::bigint key, _c1::int index, _c2 textData
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path')
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
The result of the command returns how many files were skipped due to corruption in the
COPY INTO
num_skipped_corrupt_files column. This metric also shows up in the operationMetrics column under
numSkippedCorruptFiles after running DESCRIBE HISTORY on the Delta table.
Corrupt files aren’t tracked by COPY INTO , so they can be reloaded in a subsequent run if the corruption is fixed.
You can see which files are corrupt by running COPY INTO in VALIDATE mode.
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
[VALIDATE ALL]
FORMAT_OPTIONS ('ignoreCorruptFiles' = 'true')
NOTE
ignoreCorruptFiles is available in Databricks Runtime 11.0 and above.
Bulk load data into a table with COPY INTO with
Spark SQL
7/21/2022 • 4 minutes to read
Databricks recommends that you use the COPY INTO command for incremental and bulk data loading for data
sources that contain thousands of files. Databricks recommends that you use Auto Loader for advanced use
cases.
In this tutorial, you use the COPY INTO command to load data from cloud object storage into a table in your
Azure Databricks workspace.
Requirements
1. An Azure subscription, an Azure Databricks workspace in that subscription, and a cluster in that workspace.
To create these, see Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure portal. If you
follow this quickstart, you do not need to follow the instructions in the Run a Spark SQL job section.
2. An all-purpose cluster in your workspace running Databricks Runtime 11.0 or above. To create an all-purpose
cluster, see Create a cluster.
3. Familiarity with the Azure Databricks workspace user interface. See Navigate the workspace.
4. Familiarity working with Notebooks.
5. A location you can write data to; this demo uses the DBFS root as an example, but Databricks recommends
an external storage location configured with Unity Catalog.
spark.sql(f"SET c.username='{username}'")
spark.sql(f"SET c.database={database}")
spark.sql(f"SET c.source='{source}'")
dbutils.fs.rm(source, True)
3. Copy and run the following code to configure some tables and functions that will be used to randomly
generate data:
Because this action is idempotent, you can run it multiple times but data will only be loaded once.
2. To stop your compute resource, go to the Clusters tab and Terminate your cluster.
Additional resources
The COPY INTO reference article
Bulk load data into a table with COPY INTO in
Databricks SQL
7/21/2022 • 6 minutes to read
Databricks recommends using the COPY INTO command for incremental and bulk data loading with Databricks
SQL.
NOTE
COPY INTO works well for data sources that contain thousands of files. Databricks recommends that you use Auto
Loader for loading millions of files, which is not supported in Databricks SQL.
In this tutorial, you use the COPY INTO command to load data from an Azure Data Lake Storage Gen2 (ADLS
Gen2) container in your Azure account into a table in Databricks SQL.
Requirements
1. A Databricks SQL warehouse. To create a SQL warehouse, see Create a SQL warehouse.
2. Familiarity with the Databricks SQL user interface. See the Databricks SQL user guide.
3. An ADLS Gen2 storage account in your Azure account. To create an ADLS Gen2 storage account, see Create a
storage account to use with Azure Data Lake Storage Gen2. Make sure that your storage account has
Storage account key access set to Enabled and Soft delete set to Disabled .
4. Click Run .
5. At the bottom of the editor, click the ellipses icon, and then click Download as CSV file .
NOTE
This dataset contains almost 22,000 rows of data. This tutorial downloads only the first 1,000 rows of data. To
download all of the rows, clear the LIMIT 1000 box and then repeat steps 4-5.
4. Click Run .
Step 5: Load the sample data from cloud storage into the table
In this step, you load the CSV file from the ADLS Gen2 container into the table in your Azure Databricks
workspace.
1. In the sidebar, click Create > Quer y .
2. In the SQL editor’s menu bar, select the SQL warehouse that you created in the Requirements section, or
select another available SQL warehouse that you want to use.
3. In the SQL editor, paste the following code. In this code, replace:
nyctaxisampledata with the name of your ADLS Gen2 storage account.
nyctaxi with the name of the container within your storage account.
<yourBlobSASToken> with the value of Blob SAS token from Step 3.
NOTE
FORMAT_OPTIONS differs by FILEFORMAT . In this case, the header option instructs Azure Databricks to treat the
first row of the CSV file as a header, and the inferSchema options instructs Azure Databricks to automatically
determine the data type of each field in the CSV file.
4. Click Run .
NOTE
If you click Run again, no new data is loaded into the table. This is because the COPY INTO command only
processes what it considers to be new data.
Step 6: Clean up
When you are done with this tutorial, you can clean up the associated resources in your cloud account and
Azure Databricks if you no longer want to keep them.
Delete the ADLS Gen2 storage account
1. Open the Azure portal for your Azure account, typically at https://portal.azure.com.
2. Browse to and open the nyctaxisampledata storage account.
3. Click Delete .
4. Enter nyctaxisampledata , and then click Delete .
Delete the tables
1. In the sidebar, click Create > Quer y .
2. Select the SQL warehouse that you created in the Requirements section, or select another available SQL
warehouse that you want to use.
3. Paste the following code:
4. Click Run .
5. Hover over the tab for this query, and then click the X icon.
Delete the queries in the SQL editor
1. In your Azure Databricks workspace, in the SQL persona, click SQL Editor in the sidebar.
2. In the SQL editor’s menu bar, hover over the tab for each query that you created for this tutorial, and then
click the X icon.
Stop the SQL warehouse
If you are not using the SQL warehouse for any other tasks, you should stop the SQL warehouse to avoid
additional costs.
1. In the SQL persona, on the sidebar, click SQL Warehouses .
2. Next to the name of the SQL warehouse, click Stop .
3. When prompted, click Stop again.
Additional resources
The COPY INTO reference article
Auto Loader
7/21/2022 • 2 minutes to read
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any
additional setup.
Auto Loader provides a Structured Streaming source called cloudFiles . Given an input directory path on the
cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of
also processing existing files in that directory. Auto Loader has support for both Python and SQL in Delta Live
Tables.
You can use Auto Loader to process billions of files to migrate or backfill a table. Auto Loader scales to support
near real-time ingestion of millions of files per hour.
Getting Started
Databricks recommends using Auto Loader in Delta Live Tables for incremental data ingestion. Delta Live Tables
extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of
declarative Python or SQL to deploy a production-quality data pipeline.
Databricks recommends Auto Loader whenever you use Apache Spark Structured Streaming to ingest data from
cloud object storage. APIs are available in Python and Scala.
To get started using Auto Loader, see:
Using Auto Loader in Delta Live Tables
Using Auto Loader in Structured Streaming applications
Concepts
You can tune Auto Loader based on data volume, variety, and velocity.
Configuring schema inference and evolution in Auto Loader
Choosing between file notification and directory listing modes
Configure Auto Loader for production workloads
Reference
For a full list of Auto Loader options, see:
Auto Loader options
Tutorials
For details on how to use Auto Loader, see:
Common data loading patterns
Resources
For an overview and demonstration of Auto Loader, watch this YouTube video (59 minutes).
FAQ
Auto Loader FAQ
Using Auto Loader in Delta Live Tables
7/21/2022 • 2 minutes to read
You can use Auto Loader in your Delta Live Tables pipelines. Delta Live Tables extends functionality in Apache
Spark Structured Streaming and allows you to write just a few lines of declarative Python or SQL to deploy a
production-quality data pipeline with:
Autoscaling compute infrastructure for cost savings
Data quality checks with expectations
Automatic schema evolution handling
Monitoring via metrics in the event log
You do not need to provide a schema or checkpoint location because Delta Live Tables automatically manages
these settings for your pipelines. See Delta Live Tables data sources.
@dlt.table
def customers():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/databricks-datasets/retail-org/customers/")
)
@dlt.table
def sales_orders_raw():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/databricks-datasets/retail-org/sales_orders/")
)
SQL
You can use supported format options with Auto Loader. Using the map() function, you can pass any number of
options to the cloud_files() method. Options are key-value pairs, where the keys and values are strings. The
following describes the syntax for working with Auto Loader in SQL:
CREATE OR REFRESH STREAMING LIVE TABLE <table_name>
AS SELECT *
FROM cloud_files(
"<file_path>",
"<file_format>",
map(
"<option_key>", "<option_value",
"<option_key>", "<option_value",
...
)
)
The following example reads data from tab-delimited CSV files with a header:
You can use the schema to specify the format manually; you must specify the schema for formats that do not
support schema inference:
Python
@dlt.table
def wiki_raw():
return (
spark.readStream.format("cloudFiles")
.schema("title STRING, id INT, revisionId INT, revisionTimestamp TIMESTAMP, revisionUsername STRING,
revisionUsernameId INT, text STRING")
.option("cloudFiles.format", "parquet")
.load("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet")
)
SQL
NOTE
Delta Live Tables automatically configures and manages the schema and checkpoint directories when using Auto Loader
to read files. However, if you manually configure either of these directories, performing a full refresh does not affect the
contents of the configured directories. Databricks recommends using the automatically configured directories to avoid
unexpected side effects during processing.
Using Auto Loader in Structured Streaming
applications
7/21/2022 • 4 minutes to read
Databricks recommends using Auto Loader in all Structured Streaming applications that ingest data from cloud
object storage.
Quickstart
The following code example demonstrates how Auto Loader detects new data files as they arrive in cloud
storage. You can run the example code from within a notebook attached to an Azure Databricks cluster.
1. Create the file upload directory, for example:
Python
user_dir = '<my-name>@<my-organization.com>'
upload_path = "/FileStore/shared-uploads/" + user_dir + "/population_data_upload"
dbutils.fs.mkdirs(upload_path)
Scala
dbutils.fs.mkdirs(upload_path)
2. Create the following sample CSV files, and then upload them to the file upload directory by using the
DBFS file browser:
WA.csv :
city,year,population
Seattle metro,2019,3406000
Seattle metro,2020,3433000
OR.csv :
city,year,population
Portland metro,2019,2127000
Portland metro,2020,2151000
checkpoint_path = '/tmp/delta/population_data/_checkpoints'
write_path = '/tmp/delta/population_data'
Scala
4. With the code from step 3 still running, run the following code to query the data in the write directory:
Python
df_population = spark.read.format('delta').load(write_path)
display(df_population)
'''
Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
'''
Scala
display(df_population)
/* Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
*/
5. With the code from step 3 still running, create the following additional CSV files, and then upload them to
the upload directory by using the DBFS file browser:
ID.csv :
city,year,population
Boise,2019,438000
Boise,2020,447000
MT.csv :
city,year,population
Helena,2019,81653
Helena,2020,82590
Misc.csv :
city,year,population
Seattle metro,2021,3461000
Portland metro,2021,2174000
Boise,2021,455000
Helena,2021,81653
6. With the code from step 3 still running, run the following code to query the existing data in the write
directory, in addition to the new data from the files that Auto Loader has detected in the upload directory
and then written to the write directory:
Python
df_population = spark.read.format('delta').load(write_path)
display(df_population)
'''
Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Helena | 2019 | 81653 |
+----------------+------+------------+
| Helena | 2020 | 82590 |
+----------------+------+------------+
| Boise | 2019 | 438000 |
+----------------+------+------------+
| Boise | 2020 | 447000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
| Seattle metro | 2021 | 3461000 |
+----------------+------+------------+
| Portland metro | 2021 | 2174000 |
+----------------+------+------------+
| Boise | 2021 | 455000 |
+----------------+------+------------+
| Helena | 2021 | 81653 |
+----------------+------+------------+
'''
Scala
val df_population = spark.read.format("delta").load(write_path)
display(df_population)
/* Result
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Helena | 2019 | 81653 |
+----------------+------+------------+
| Helena | 2020 | 82590 |
+----------------+------+------------+
| Boise | 2019 | 438000 |
+----------------+------+------------+
| Boise | 2020 | 447000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
| Seattle metro | 2021 | 3461000 |
+----------------+------+------------+
| Portland metro | 2021 | 2174000 |
+----------------+------+------------+
| Boise | 2021 | 455000 |
+----------------+------+------------+
| Helena | 2021 | 81653 |
+----------------+------+------------+
*/
7. To clean up, cancel the running code in step 3, and then run the following code, which deletes the upload,
checkpoint, and write directories:
Python
dbutils.fs.rm(write_path, True)
dbutils.fs.rm(upload_path, True)
Scala
dbutils.fs.rm(write_path, true)
dbutils.fs.rm(upload_path, true)
See also Tutorial: Continuously ingest data into Delta Lake with Auto Loader.
Configuring schema inference and evolution in Auto
Loader
7/21/2022 • 8 minutes to read
Auto Loader can automatically detect the introduction of new columns to your data and restart so you don’t
have to manage the tracking and handling of schema changes yourself. Auto Loader can also “rescue” data that
was unexpected (for example, of differing data types) in a JSON blob column, that you can choose to access later
using the semi-structured data access APIs.
The following formats are supported for schema inference and evolution:
F IL E F O RM AT SUP P O RT ED VERSIO N S
ORC Unsupported
Schema inference
To infer the schema, Auto Loader samples the first 50 GB or 1000 files that it discovers, whichever limit is
crossed first. To avoid incurring this inference cost at every stream start up, and to be able to provide a stable
schema across stream restarts, you must set the option cloudFiles.schemaLocation . Auto Loader creates a
hidden directory _schemas at this location to track schema changes to the input data over time. If your stream
contains a single cloudFiles source to ingest data, you can provide the checkpoint location as
cloudFiles.schemaLocation . Otherwise, provide a unique directory for this option. If your input data returns an
unexpected schema for your stream, check that your schema location is being used by only a single Auto Loader
source.
NOTE
To change the size of the sample that’s used you can set the SQL configurations:
spark.databricks.cloudFiles.schemaInference.sampleSize.numBytes
spark.databricks.cloudFiles.schemaInference.sampleSize.numFiles
(integer)
By default, Auto Loader infers columns in text-based file formats like CSV and JSON as string columns. In
JSON datasets, nested columns are also inferred as string columns. Since JSON and CSV data is self-
describing and can support many data types, inferring the data as string can help avoid schema evolution issues
such as numeric type mismatches (integers, longs, floats). If you want to retain the original Spark schema
inference behavior, set the option cloudFiles.inferColumnTypes to true .
NOTE
Unless case sensitivity is enabled, the columns abc , Abc , and ABC are considered the same column for the purposes
of schema inference. The selection of which case will be chosen is arbitrary and depends on the sampled data. You can use
schema hints to enforce which case should be used. Once a selection has been made and the schema is inferred, Auto
Loader will not consider the casing variants that were not selected consistent with the schema. These columns may need
to be found in the rescued data column.
Auto Loader also attempts to infer partition columns from the underlying directory structure of the data if the
data is laid out in Hive style partitioning. For example, a file path such as
base_path/event=click/date=2021-04-01/f0.json would result in the inference of date and event as partition
columns. The data types for these columns will be strings unless you set cloudFiles.inferColumnTypes to true. If
the underlying directory structure contains conflicting Hive partitions or doesn’t contain Hive style partitioning,
the partition columns will be ignored. You can provide the option cloudFiles.partitionColumns as a comma-
separated list of column names to always try and parse the given columns from the file path if these columns
exist as key=value pairs in your directory structure.
When Auto Loader infers the schema, a rescued data column is automatically added to your schema as
_rescued_data . See the section on rescued data column and schema evolution for details.
NOTE
Binary file ( binaryFile ) and text file formats have fixed data schemas, but also support partition column inference.
The partition columns are inferred at each stream restart unless you specify cloudFiles.schemaLocation . To avoid any
potential errors or information loss, Databricks recommends setting cloudFiles.schemaLocation or
cloudFiles.partitionColumns as options for these file formats as cloudFiles.schemaLocation is not a required
option for these formats.
Schema hints
The data types that are inferred may not always be exactly what you’re looking for. By using schema hints, you
can superimpose the information that you know and expect on an inferred schema.
By default, Apache Spark has a standard approach for inferring the type of data columns. For example, it infers
nested JSON as structs and integers as longs. In contrast, Auto Loader considers all columns as strings. When
you know that a column is of a specific data type, or if you want to choose an even more general data type (for
example, a double instead of an integer), you can provide an arbitrary number of hints for columns data types
as follows:
See the documentation on data types for the list of supported data types.
If a column is not present at the start of the stream, you can also use schema hints to add that column to the
inferred schema.
Here is an example of an inferred schema to see the behavior with schema hints. Inferred schema:
NOTE
Array and Map schema hints support is available in Databricks Runtime 9.1 LTS and above.
Here is an example of an inferred schema with complex datatypes to see the behavior with schema hints.
Inferred schema:
|-- products: array<string>
|-- locations: array<string>
|-- users: array<struct>
| |-- users.element: struct
| | |-- id: string
| | |-- name: string
| | |-- dob: string
|-- ids: map<string,string>
|-- names: map<string,string>
|-- prices: map<string,string>
|-- discounts: map<struct,string>
| |-- discounts.key: struct
| | |-- id: string
| |-- discounts.value: string
|-- descriptions: map<string,struct>
| |-- descriptions.key: string
| |-- descriptions.value: struct
| | |-- content: int
NOTE
Schema hints are used only if you do not provide a schema to Auto Loader. You can use schema hints whether
cloudFiles.inferColumnTypes is enabled or disabled.
Schema evolution
Auto Loader detects the addition of new columns as it processes your data. By default, addition of a new column
will cause your streams to stop with an UnknownFieldException . Before your stream throws this error, Auto
Loader performs schema inference on the latest micro-batch of data, and updates the schema location with the
latest schema. New columns are merged to the end of the schema. The data types of existing columns remain
unchanged. By setting your Auto Loader stream within an Azure Databricks job, you can get your stream to
restart automatically after such schema changes.
Auto Loader supports the following modes for schema evolution, which you set in the option
cloudFiles.schemaEvolutionMode :
addNewColumns : The default mode when a schema is not provided to Auto Loader. The streaming job will fail
with an UnknownFieldException . New columns are added to the schema. Existing columns do not evolve data
types. addNewColumns is not allowed when the schema of the stream is provided. You can instead provide
your schema as a schema hint instead if you want to use this mode.
failOnNewColumns : If Auto Loader detects a new column, the stream will fail. It will not restart unless the
provided schema is updated, or the offending data file is removed.
rescue : The stream runs with the very first inferred or provided schema. Any data type changes or new
columns that are added are rescued in the rescued data column that is automatically added to your stream’s
schema as _rescued_data . In this mode, your stream will not fail due to schema changes.
none : The default mode when a schema is provided. Does not evolve the schema, new columns are ignored,
and data is not rescued unless the rescued data column is provided separately as an option.
Partition columns are not considered for schema evolution. If you had an initial directory structure like
base_path/event=click/date=2021-04-01/f0.json , and then start receiving new files as
base_path/event=click/date=2021-04-01/hour=01/f1.json , the hour column is ignored. To capture information for
new partition columns, set cloudFiles.partitionColumns to event,date,hour .
Limitations
Schema evolution is not supported in Python applications running on Databricks Runtime 8.2 and 8.3 that
use foreachBatch . You can use foreachBatch in Scala instead.
Choosing between file notification and directory
listing modes
7/21/2022 • 15 minutes to read
Auto Loader supports two modes for detecting new files: directory listing and file notification.
Director y listing : Auto Loader identifies new files by listing the input directory. Directory listing mode
allows you to quickly start Auto Loader streams without any permission configurations other than access to
your data on cloud storage. In Databricks Runtime 9.1 and above, Auto Loader can automatically detect
whether files are arriving with lexical ordering to your cloud storage and significantly reduce the amount of
API calls it needs to make to detect new files. See Incremental Listing for more details.
File notification : Auto Loader can automatically set up a notification service and queue service that
subscribe to file events from the input directory. File notification mode is more performant and scalable for
large input directories or a high volume of files but requires additional cloud permissions for set up. See
Leveraging file notifications for more details.
The availability for these modes are listed below.
As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint
location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once. You can
switch file discovery modes across stream restarts and still obtain exactly-once data processing guarantees. In
fact, this is how Auto Loader can both perform a backfill on a directory containing existing files and concurrently
process new files that are being discovered through file notifications.
In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint
location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to
maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.
Auto Loader can discover files on cloud storage systems using directory listing more efficiently than other
alternatives. For example, if you had files being uploaded every 5 minutes as /some/path/YYYY/MM/DD/HH/fileName
, to find all the files in these directories, the Apache Spark file source would list all subdirectories in parallel,
causing 1 (base directory) + 365 (per day) * 24 (per hour) = 8761 LIST API directory calls to storage. By
receiving a flattened response from storage, Auto Loader reduces the number of API calls to the number of files
in storage divided by the number of results returned by each API call (1000 with S3, 5000 with ADLS Gen2, and
1024 with GCS), greatly reducing your cloud costs.
Incremental Listing
NOTE
Available in Databricks Runtime 9.1 LTS and above.
For lexicographically generated files, Auto Loader now can leverage the lexical file ordering and optimized listing
APIs to improve the efficiency of directory listing by listing from recently ingested files rather than listing the
contents of the entire directory.
By default, Auto Loader will automatically detect whether a given directory is applicable for incremental listing
by checking and comparing file paths of previously completed directory listings. To ensure eventual
completeness of data in auto mode, Auto Loader will automatically trigger a full directory list after completing
7 consecutive incremental lists. You can control the frequency of full directory lists by setting
cloudFiles.backfillInterval to trigger asynchronous backfills at a given interval.
You can explicitly enable or disable incremental listing by setting cloudFiles.useIncrementalListing to "true"
or "false" (default "auto" ). When explicitly enabled, Auto Loader will not trigger full directory lists unless a
backfill interval is set. Services like AWS Kinesis Firehose, AWS DMS, and Azure Data Factory are services that
can be configured to upload files to a storage system in lexical order. See the Appendix for more examples of
lexical directory structures.
SUB SC RIP T IO N
C LO UD STO RA GE SERVIC E Q UEUE SERVIC E P REF IX ( 1) L IM IT ( 2)
ADLS Gen2 Azure Event Grid Azure Queue Storage databricks 500 per storage
account
GCS Google Pub/Sub Google Pub/Sub databricks-auto- 100 per GCS bucket
ingest
Azure Blob Storage Azure Event Grid Azure Queue Storage databricks 500 per storage
account
NOTE
Cloud providers do not guarantee 100% delivery of all file events under very rare conditions and do not provide any strict
SLAs on the latency of the file events. Databricks recommends that you trigger regular backfills with Auto Loader by using
the cloudFiles.backfillInterval option to guarantee that all files are discovered within a given SLA if data
completeness is a requirement. Triggering regular backfills will not cause duplicates.
If you require running more than the limited number of file notification pipelines for a given storage account,
you can:
Consider rearchitecting how files are uploaded to leverage incremental listing instead of file notifications
Leverage a service such as AWS Lambda, Azure Functions, or Google Cloud Functions to fan out notifications
from a single queue that listens to an entire container or bucket into directory specific queues
File notification events
AWS S3 provides an ObjectCreated event when a file is uploaded to an S3 bucket regardless of whether it was
uploaded by a put or multi-part upload.
ADLS Gen2 provides different event notifications for files appearing in your Gen2 container.
Auto Loader listens for the FlushWithClose event for processing a file.
Auto Loader streams created with Databricks Runtime 8.3 and after support the RenameFile action for
discovering files. RenameFile actions will require an API request to the storage system to get the size of the
renamed file.
Auto Loader streams created with Databricks Runtime 9.0 and after support the RenameDirectory action for
discovering files. RenameDirectory actions will require API requests to the storage system to list the contents
of the renamed directory.
Google Cloud Storage provides an OBJECT_FINALIZE event when a file is uploaded, which includes overwrites
and file copies. Failed uploads do not generate this event.
Managing file notification resources
You can use Scala APIs to manage the notification and queuing services created by Auto Loader. You must
configure the resource setup permissions described in Permissions before using this API.
/////////////////////////////////////
// Creating a ResourceManager in AWS
/////////////////////////////////////
import com.databricks.sql.CloudFilesAWSResourceManager
val manager = CloudFilesAWSResourceManager
.newManager
.option("cloudFiles.region", <region>) // optional, will use the region of the EC2 instances by default
.option("path", <path-to-specific-bucket-and-folder>) // required only for setUpNotificationServices
.create()
///////////////////////////////////////
// Creating a ResourceManager in Azure
///////////////////////////////////////
import com.databricks.sql.CloudFilesAzureResourceManager
val manager = CloudFilesAzureResourceManager
.newManager
.option("cloudFiles.connectionString", <connection-string>)
.option("cloudFiles.resourceGroup", <resource-group>)
.option("cloudFiles.subscriptionId", <subscription-id>)
.option("cloudFiles.tenantId", <tenant-id>)
.option("cloudFiles.clientId", <service-principal-client-id>)
.option("cloudFiles.clientSecret", <service-principal-client-secret>)
.option("path", <path-to-specific-container-and-folder>) // required only for setUpNotificationServices
.create()
///////////////////////////////////////
// Creating a ResourceManager in GCP
///////////////////////////////////////
import com.databricks.sql.CloudFilesGCPResourceManager
val manager = CloudFilesGCPResourceManager
.newManager
.option("path", <path-to-specific-bucket-and-folder>) // Required only for setUpNotificationServices.
.create()
// Set up a queue and a topic subscribed to the path provided in the manager.
manager.setUpNotificationServices(<resource-suffix>)
// Tear down the notification services created for a specific stream ID.
// Stream ID is a GUID string that you can find in the list result above.
manager.tearDownNotificationServices(<stream-id>)
GCS Databricks Runtime 9.1 and Databricks Runtime 9.1 and Databricks Runtime 9.1 and
above above above
<path_to_table>/_delta_log/00000000000000000000.json
<path_to_table>/_delta_log/00000000000000000001.json <- guaranteed to be written after version 0
<path_to_table>/_delta_log/00000000000000000002.json <- guaranteed to be written after version 1
...
database_schema_name/table_name/LOAD00000001.csv
database_schema_name/table_name/LOAD00000002.csv
...
// <base_path>/yyyy/MM/dd/HH:mm:ss-randomString
<base_path>/2021/12/01/10:11:23-b1662ecd-e05e-4bb7-a125-ad81f6e859b4.json
<base_path>/2021/12/01/10:11:23-b9794cf3-3f60-4b8d-ae11-8ea320fad9d1.json
...
// <base_path>/year=yyyy/month=MM/day=dd/hour=HH/minute=mm/randomString
<base_path>/year=2021/month=12/day=04/hour=08/minute=22/442463e5-f6fe-458a-8f69-a06aa970fc69.csv
<base_path>/year=2021/month=12/day=04/hour=08/minute=22/8f00988b-46be-4112-808d-6a35aead0d44.csv <- this may
be uploaded before the file above as long as processing happens less frequently than a minute
When files are uploaded with date partitioning, some things to keep in mind are:
Months, days, hours, minutes need to be left padded with zeros to ensure lexical ordering (should be
uploaded as hour=03 , instead of hour=3 or 2021/05/03 instead of 2021/5/3 ).
Files don’t necessarily have to be uploaded in lexical order in the deepest directory as long as processing
happens less frequently than the parent directory’s time granularity
Some services that can upload files in a date partitioned lexical ordering are:
Azure Data Factory can be configured to upload files in a lexical order. See an example here.
Kinesis Firehose
Required permissions for setting up file notification resources
ADLS Gen2 and Azure Blob Storage
You must have read permissions for the input directory. See Azure Blob Storage.
To use file notification mode, you must provide authentication credentials for setting up and accessing the event
notification services. In Databricks Runtime 8.1 and above, you only need a service principal for authentication.
For Databricks Runtime 8.0 and below, you must provide both a service principal and a connection string.
Service principal - using Azure built-in roles
Create an Azure Active Directory app and service principal in the form of client ID and client secret.
Assign this app the following roles to the storage account in which the input path resides:
Contributor : This role is for setting up resources in your storage account, such as queues and event
subscriptions.
Storage Queue Data Contributor : This role is for performing queue operations such as retrieving
and deleting messages from the queues. This role is required in Databricks Runtime 8.1 and above
only when you provide a service principal without a connection string.
Assign this app the following role to the related resource group:
EventGrid EventSubscription Contributor : This role is for performing event grid subscription
operations such as creating or listing event subscriptions.
For more information, see Assign Azure roles using the Azure portal.
Service principal - using custom role
If you are concerned with the execessive permissions required for the preceding roles, you may create a
Custom Role with at least the following permissions, listed below in Azure role JSON format:
"permissions": [
{
"actions": [
"Microsoft.EventGrid/eventSubscriptions/write",
"Microsoft.EventGrid/eventSubscriptions/read",
"Microsoft.EventGrid/eventSubscriptions/delete",
"Microsoft.EventGrid/locations/eventSubscriptions/read",
"Microsoft.Storage/storageAccounts/read",
"Microsoft.Storage/storageAccounts/write",
"Microsoft.Storage/storageAccounts/queueServices/read",
"Microsoft.Storage/storageAccounts/queueServices/write",
"Microsoft.Storage/storageAccounts/queueServices/queues/write",
"Microsoft.Storage/storageAccounts/queueServices/queues/read",
"Microsoft.Storage/storageAccounts/queueServices/queues/delete"
],
"notActions": [],
"dataActions": [
"Microsoft.Storage/storageAccounts/queueServices/queues/messages/delete",
"Microsoft.Storage/storageAccounts/queueServices/queues/messages/read",
"Microsoft.Storage/storageAccounts/queueServices/queues/messages/write",
"Microsoft.Storage/storageAccounts/queueServices/queues/messages/process/action"
],
"notDataActions": []
}
]
AWS S3
You must have read permissions for the input directory. See S3 connection details for more details.
To use file notification mode, attach the following JSON policy document to your IAM user or role.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderSetup",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"s3:PutBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:SetTopicAttributes",
"sns:CreateTopic",
"sns:TagResource",
"sns:Publish",
"sns:Subscribe",
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:DeleteMessageBatch",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:SetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility",
"sqs:ChangeMessageVisibilityBatch"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:sqs:<region>:<account-number>:databricks-auto-ingest-*",
"arn:aws:sns:<region>:<account-number>:databricks-auto-ingest-*"
]
},
{
"Sid": "DatabricksAutoLoaderList",
"Effect": "Allow",
"Action": [
"sqs:ListQueues",
"sqs:ListQueueTags",
"sns:ListTopics"
],
"Resource": "*"
},
{
"Sid": "DatabricksAutoLoaderTeardown",
"Effect": "Allow",
"Action": [
"sns:Unsubscribe",
"sns:DeleteTopic",
"sqs:DeleteQueue"
],
"Resource": [
"arn:aws:sqs:<region>:<account-number>:databricks-auto-ingest-*",
"arn:aws:sns:<region>:<account-number>:databricks-auto-ingest-*"
]
}
]
}
where:
<bucket-name> : The S3 bucket name where your stream will read files, for example, auto-logs . You can use
* as a wildcard, for example, databricks-*-logs . To find out the underlying S3 bucket for your DBFS path,
you can list all the DBFS mount points in a notebook by running %fs mounts .
<region> : The AWS region where the S3 bucket resides, for example, us-west-2 . If you don’t want to specify
the region, use * .
<account-number> : The AWS account number that owns the S3 bucket, for example, 123456789012 . If don’t
want to specify the account number, use * .
The string databricks-auto-ingest-* in the SQS and SNS ARN specification is the name prefix that the
cloudFiles source uses when creating SQS and SNS services. Since Azure Databricks sets up the notification
services in the initial run of the stream, you can use a policy with reduced permissions after the initial run (for
example, stop the stream and then restart it).
NOTE
The preceding policy is concerned only with the permissions needed for setting up file notification services, namely S3
bucket notification, SNS, and SQS services and assumes you already have read access to the S3 bucket. If you need to add
S3 read-only permissions, add the following to the Action list in the DatabricksAutoLoaderSetup statement in the
JSON document:
s3:ListBucket
s3:GetObject
IMPORTANT
With the reduced permissions, you won’t able to start new streaming queries or recreate resources in case of failures (for
example, the SQS queue has been accidentally deleted); you also won’t be able to use the cloud resource management
API to list or tear down resources.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderUse",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:TagResource",
"sns:Publish",
"sqs:DeleteMessage",
"sqs:DeleteMessageBatch",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility",
"sqs:ChangeMessageVisibilityBatch"
],
"Resource": [
"arn:aws:sqs:<region>:<account-number>:<queue-name>",
"arn:aws:sns:<region>:<account-number>:<topic-name>",
"arn:aws:s3:::<bucket-name>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<bucket-name>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>/*"
]
},
{
"Sid": "DatabricksAutoLoaderListTopics",
"Effect": "Allow",
"Action": [
"sqs:ListQueues",
"sqs:ListQueueTags",
"sns:ListTopics"
],
"Resource": "arn:aws:sns:<region>:<account-number>:*"
}
]
}
fs.s3a.credentialsType AssumeRole
fs.s3a.stsAssumeRole.arn arn:aws:iam::<bucket-owner-acct-id>:role/MyRoleB
fs.s3a.acl.default BucketOwnerFullControl
GCS
You must have list and get permissions on your GCS bucket and on all the objects. For details, see the
Google documentation on IAM permissions.
To use file notification mode, you need to add permissions for the GCS service account and the account used to
access the Google Cloud Pub/Sub resources.
Add the Pub/Sub Publisher role to the GCS service account. This will allow the account to publish event
notification messages from your GCS buckets to Google Cloud Pub/Sub.
As for the service account used for the Google Cloud Pub/Sub resources, you will need to add the following
permissions:
pubsub.subscriptions.consume
pubsub.subscriptions.create
pubsub.subscriptions.delete
pubsub.subscriptions.get
pubsub.subscriptions.list
pubsub.subscriptions.update
pubsub.topics.attachSubscription
pubsub.topics.create
pubsub.topics.delete
pubsub.topics.get
pubsub.topics.list
pubsub.topics.update
To do this, you can either create an IAM custom role with these permissions or assign pre-existing GCP roles to
cover these permissions.
Finding the GCS Service Account
In the Google Cloud Console for the corresponding project, navigate to Cloud Storage > Settings . On that page,
you should see a section titled “Cloud Storage Service Account” containing the email of the GCS service account.
Creating a Custom Google Cloud IAM Role for File Notification Mode
In the Google Cloud console for the corresponding project, navigate to IAM & Admin > Roles . Then, either create
a role at the top or update an existing role. In the screen for role creation or edit, click Add Permissions . A menu
should then pop up in which you can add the desired permissions to the role.
Troubleshooting
Error :
If you see this error message when you run Auto Loader for the first time, the Event Grid is not registered as a
Resource Provider in your Azure subscription. To register this on Azure portal:
1. Go to your subscription.
2. Click Resource Providers under the Settings section.
3. Register the provider Microsoft.EventGrid .
Error :
Databricks recommends that you follow the streaming best practices for running Auto Loader in production.
Databricks recommends using Auto Loader in Delta Live Tables for incremental data ingestion. Delta Live Tables
extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of
declarative Python or SQL to deploy a production-quality data pipeline with:
Autoscaling compute infrastructure for cost savings
Data quality checks with expectations
Automatic schema evolution handling
Monitoring via metrics in the event log
NOTE
The cloud_files_state function is available in Databricks Runtime 10.5 and above.
Auto Loader provides a SQL API for inspecting the state of a stream. Using the cloud_files_state function, you
can find metadata about files that have been discovered by an Auto Loader stream. Simply query from
cloud_files_state , providing the checkpoint location associated with an Auto Loader stream.
{
"sources" : [
{
"description" : "CloudFilesSource[/path/to/source]",
"metrics" : {
"numFilesOutstanding" : "238",
"numBytesOutstanding" : "163939124006"
}
}
]
}
In Databricks Runtime 10.1 and later, when using file notification mode, the metrics will also include the
approximate number of file events that are in the cloud queue as approximateQueueSize for AWS and Azure.
Cost considerations
When running Auto Loader, your main source of costs would be the cost of compute resources and file
discovery.
To reduce compute costs, Databricks recommends using Databricks Jobs to schedule Auto Loader as batch jobs
using Trigger.AvailableNow (in Databricks Runtime 10.1 and later) or Trigger.Once instead of running it
continuously as long as you don’t have low latency requirements.
File discovery costs can come in the form of LIST operations on your storage accounts in directory listing mode
and API requests on the subscription service, and queue service in file notification mode. To reduce file
discovery costs, Databricks recommends:
Providing a ProcessingTime trigger when running Auto Loader continuously in directory listing mode
Architecting file uploads to your storage account in lexical ordering to leverage Incremental Listing when
possible
Using Databricks Runtime 9.0 or later in directory listing mode, especially for deeply nested directories
Leveraging file notifications when incremental listing is not possible
Using resource tags to tag resources created by Auto Loader to track your costs
Auto Loader can be scheduled to run in Databricks Jobs as a batch job by using Trigger.AvailableNow . The
AvailableNow trigger will instruct Auto Loader to process all files that arrived before the query start time. New
files that are uploaded after the stream has started will be ignored until the next trigger.
With Trigger.AvailableNow , file discovery will happen asynchronously with data processing and data can be
processed across multiple micro-batches with rate limiting. Auto Loader by default processes a maximum of
1000 files every micro-batch. You can configure cloudFiles.maxFilesPerTrigger and
cloudFiles.maxBytesPerTrigger to configure how many files or how many bytes should be processed in a micro-
batch. The file limit is a hard limit but the byte limit is a soft limit, meaning that more bytes can be processed
than the provided maxBytesPerTrigger . When the options are both provided together, Auto Loader will process
as many files that are needed to hit one of the limits.
Event retention
NOTE
Available in Databricks Runtime 8.4 and above.
Auto Loader keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once
ingestion guarantees. For high volume datasets, you can use the cloudFiles.maxFileAge option to expire events
from the checkpoint location to reduce your storage costs and Auto Loader start up time. The minimum value
that you can set for cloudFiles.maxFileAge is "14 days" . Deletes in RocksDB appear as tombstone entries,
therefore you should expect the storage usage to increase temporarily as events expire before it starts to level
off.
WARNING
cloudFiles.maxFileAge is provided as a cost control mechanism for high volume datasets, ingesting in the order of
millions of files every hour. Tuning cloudFiles.maxFileAge incorrectly can lead to data quality issues. Therefore,
Databricks doesn’t recommend tuning this parameter unless absolutely required.
Trying to tune the cloudFiles.maxFileAge option can lead to unprocessed files being ignored by Auto Loader or
already processed files expiring and then being re-processed causing duplicate data. Here are some things to
consider when choosing a cloudFiles.maxFileAge :
If your stream restarts after a long time, file notification events that are pulled from the queue that are older
than cloudFiles.maxFileAge are ignored. Similarly, if you use directory listing, files that may have appeared
during the down time that are older than cloudFiles.maxFileAge are ignored.
If you use directory listing mode and use cloudFiles.maxFileAge , for example set to "1 month" , you stop
your stream and restart the stream with cloudFiles.maxFileAge set to "2 months" , all files that are older
than 1 month, but more recent than 2 months are reprocessed.
The best approach to tuning cloudFiles.maxFileAge would be to start from a generous expiration, for example,
"1 year" and working downwards to something like "9 months" . If you set this option the first time you start
the stream, you will not ingest data older than cloudFiles.maxFileAge , therefore, if you want to ingest old data
you should not set this option as you start your stream.
Auto Loader options
7/21/2022 • 21 minutes to read
Configuration options specific to the cloudFiles source are prefixed with cloudFiles so that they are in a
separate namespace from other Structured Streaming source options.
Common Auto Loader options
Directory listing options
File notification options
File format options
Generic options
JSON options
CSV options
PARQUET options
AVRO options
BINARYFILE options
TEXT options
ORC options
Cloud specific options
AWS specific options
Azure specific options
Google specific options
O P T IO N
cloudFiles.allowOver writes
Type: Boolean
Whether to allow input directory file changes to overwrite existing data. Available in Databricks Runtime 7.6 and above.
cloudFiles.backfillInter val
cloudFiles.format
Type: String
The data file format in the source path. Allowed values include:
cloudFiles.includeExistingFiles
Type: Boolean
Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup.
This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has
no effect.
cloudFiles.inferColumnTypes
Type: Boolean
Whether to infer exact column types when leveraging schema inference. By default, columns are inferred as strings when
inferring JSON and CSV datasets. See schema inference for more details.
cloudFiles.maxBytesPerTrigger
The maximum number of new bytes to be processed in every trigger. You can specify a byte string such as 10g to limit each
microbatch to 10 GB of data. This is a soft maximum. If you have files that are 3 GB each, Azure Databricks processes 12 GB in
a microbatch. When used together with cloudFiles.maxFilesPerTrigger , Azure Databricks consumes up to the lower limit
of cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger , whichever is reached first. This option has no
effect when used with Trigger.Once() .
cloudFiles.maxFileAge
How long a file event is tracked for deduplication purposes. Databricks does not recommend tuning this parameter unless you
are ingesting data at the order of millions of files an hour. See the section on Event retention for more details.
cloudFiles.maxFilesPerTrigger
Type: Integer
The maximum number of new files to be processed in every trigger. When used together with
cloudFiles.maxBytesPerTrigger , Azure Databricks consumes up to the lower limit of cloudFiles.maxFilesPerTrigger or
cloudFiles.maxBytesPerTrigger , whichever is reached first. This option has no effect when used with Trigger.Once() .
cloudFiles.par titionColumns
Type: String
A comma separated list of Hive style partition columns that you would like inferred from the directory structure of the files.
Hive style partition columns are key value pairs combined by an equality sign such as
<base_path>/a=x/b=1/c=y/file.format . In this example, the partition columns are a , b , and c . By default these
columns will be automatically added to your schema if you are using schema inference and provide the <base_path> to load
data from. If you provide a schema, Auto Loader expects these columns to be included in the schema. If you do not want
these columns as part of your schema, you can specify "" to ignore these columns. In addition, you can use this option
when you want columns to be inferred the file path in complex directory structures, like the example below:
<base_path>/year=2022/week=1/file1.csv
<base_path>/year=2022/month=2/day=3/file2.csv
<base_path>/year=2022/month=2/day=4/file3.csv
cloudFiles.schemaEvolutionMode
Type: String
The mode for evolving the schema as new columns are discovered in the data. By default, columns are inferred as strings
when inferring JSON datasets. See schema evolution for more details.
cloudFiles.schemaHints
Type: String
Schema information that you provide to Auto Loader during schema inference. See schema hints for more details.
cloudFiles.schemaLocation
Type: String
The location to store the inferred schema and subsequent changes. See schema inference for more details.
cloudFiles.validateOptions
Type: Boolean
Whether to validate Auto Loader options and return an error for unknown or inconsistent options.
O P T IO N
cloudFiles.useIncrementalListing
Type: String
Whether to use the incremental listing rather than the full listing in directory listing mode. By default, Auto Loader will make
the best effort to automatically detect if a given directory is applicable for the incremental listing. You can explicitly use the
incremental listing or use the full directory listing by setting it as true or false respectively.
O P T IO N
cloudFiles.fetchParallelism
Type: Integer
Number of threads to use when fetching messages from the queueing service.
Default value: 1
cloudFiles.pathRewrites
Required only if you specify a queueUrl that receives file notifications from multiple S3 buckets and you want to leverage
mount points configured for accessing data in these containers. Use this option to rewrite the prefix of the bucket/key path
with the mount point. Only prefixes can be rewritten. For example, for the configuration
{"<databricks-mounted-bucket>/path": "dbfs:/mnt/data-warehouse"} , the path
s3://<databricks-mounted-bucket>/path/2017/08/fileA.json is rewritten to
dbfs:/mnt/data-warehouse/2017/08/fileA.json .
cloudFiles.resourceTags
A series of key-value tag pairs to help associate and identify related resources, for example:
cloudFiles.option("cloudFiles.resourceTag.myFirstKey", "myFirstValue")
.option("cloudFiles.resourceTag.mySecondKey", "mySecondValue")
For more information on AWS, see Amazon SQS cost allocation tags and Configuring tags for an Amazon SNS topic. (1)
For more information on Azure, see Naming Queues and Metadata and the coverage of properties.labels in Event
Subscriptions. Auto Loader stores these key-value tag pairs in JSON as labels. (1)
For more information on GCP, see Reporting usage with labels. (1)
cloudFiles.useNotifications
Type: Boolean
Whether to use file notification mode to determine when there are new files. If false , use directory listing mode. See How
Auto Loader works.
(1) Auto Loader adds the following key-value tag pairs by default on a best-effort basis:
vendor : Databricks
path : The location from where the data is loaded. Unavailable in GCP due to labeling limitations.
checkpointLocation : The location of the stream’s checkpoint. Unavailable in GCP due to labeling limitations.
streamId : A globally unique identifier for the stream.
These key names are reserved and you cannot overwrite their values.
Generic options
The following options apply to all file formats.
O P T IO N
ignoreCorruptFiles
Type: Boolean
Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents
that have been read will still be returned. Observable as numSkippedCorruptFiles in the
operationMetrics column of the Delta Lake history. Available in Databricks Runtime 11.0 and above.
ignoreMissingFiles
Type: Boolean
Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents
that have been read will still be returned. Available in Databricks Runtime 11.0 and above.
modifiedAfter
An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.
modifiedBefore
An optional timestamp to ingest files that have a modification timestamp before the provided timestamp.
pathGlobFilter
Type: String
recursiveFileLookup
Type: Boolean
Whether to load data recursively within the base directory and skip partition inference.
JSON options
O P T IO N
allowBackslashEscapingAnyCharacter
Type: Boolean
Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed
by the JSON specification can be escaped.
allowComments
Type: Boolean
Whether to allow the use of Java, C, and C++ style comments ( '/' , '*' , and '//' varieties) within parsed content or
not.
allowNonNumericNumbers
Type: Boolean
Whether to allow the set of not-a-number ( NaN ) tokens as legal floating number values.
allowNumericLeadingZeros
Type: Boolean
Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001).
allowSingleQuotes
Type: Boolean
Whether to allow use of single quotes (apostrophe, character '\' ) for quoting strings (names and String values).
allowUnquotedControlChars
Type: Boolean
Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab
and line feed characters) or not.
allowUnquotedFieldNames
Type: Boolean
Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification).
badRecordsPath
Type: String
The path to store files for recording the information about bad JSON records.
columnNameOfCorruptRecord
Type: String
The column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED ,
this column will be empty.
dateFormat
Type: String
dropFieldIfAllNull
Type: Boolean
Whether to ignore columns of all null values or empty arrays and structs during schema inference.
encoding or charset
Type: String
The name of the encoding of the JSON files. See java.nio.charset.Charset for list of options. You cannot use UTF-16 and
UTF-32 when multiline is true .
inferTimestamp
Type: Boolean
lineSep
Type: String
locale
Type: String
A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the JSON.
Default value: US
mode
Type: String
multiLine
Type: Boolean
prefersDecimal
Type: Boolean
primitivesAsString
Type: Boolean
rescuedDataColumn
Type: String
Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to
a separate column. This column is included by default when using Auto Loader. For more details, refer to Rescued data column.
timestampFormat
Type: String
timeZone
Type: String
CSV options
O P T IO N
badRecordsPath
Type: String
The path to store files for recording the information about bad CSV records.
charToEscapeQuoteEscaping
Type: Char
The character used to escape the character used for escaping quotes. For example, for the following record: [ " a\\", b ] :
* If the character to escape the '\' is undefined, the record won’t be parsed. The parser will read characters:
[a],[\],["],[,],[ ],[b] and throw an error because it cannot find a closing quote.
* If the character to escape the '\' is defined as '\' , the record will be read with 2 values: [a\] and [b] .
columnNameOfCorruptRecord
Type: String
A column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED ,
this column will be empty.
comment
Type: Char
Defines the character that represents a line comment when found in the beginning of a line of text. Use '\0' to disable
comment skipping.
dateFormat
Type: String
emptyValue
Type: String
encoding or charset
Type: String
The name of the encoding of the CSV files. See java.nio.charset.Charset for the list of options. UTF-16 and UTF-32
cannot be used when multiline is true .
enforceSchema
Type: Boolean
Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are
ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution.
escape
Type: Char
header
Type: Boolean
Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema.
ignoreLeadingWhiteSpace
Type: Boolean
ignoreTrailingWhiteSpace
Type: Boolean
inferSchema
Type: Boolean
Whether to infer the data types of the parsed CSV records or to assume all columns are of StringType . Requires an
additional pass over the data if set to true .
lineSep
Type: String
locale
Type: String
A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the CSV.
Default value: US
O P T IO N
maxCharsPerColumn
Type: Int
Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to -1 , which
means unlimited.
Default value: -1
maxColumns
Type: Int
mergeSchema
Type: Boolean
Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader
when inferring the schema.
mode
Type: String
multiLine
Type: Boolean
nanValue
Type: String
The string representation of a non-a-number value when parsing FloatType and DoubleType columns.
negativeInf
Type: String
The string representation of negative infinity when parsing FloatType or DoubleType columns.
nullValue
Type: String
parserCaseSensitive (deprecated)
Type: Boolean
While reading files, whether to align columns declared in the header with the schema case sensitively. This is true by default
for Auto Loader. Columns that differ by case will be rescued in the rescuedDataColumn if enabled. This option has been
deprecated in favor of readerCaseSensitive .
positiveInf
Type: String
The string representation of positive infinity when parsing FloatType or DoubleType columns.
quote
Type: Char
The character used for escaping values where the field delimiter is part of the value.
readerCaseSensitive
Type: Boolean
Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.
rescuedDataColumn
Type: String
Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.
sep or delimiter
Type: String
skipRows
Type: Int
The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If
header is true, the header will be the first unskipped and uncommented row.
Default value: 0
timestampFormat
Type: String
timeZone
Type: String
unescapedQuoteHandling
Type: String
* STOP_AT_CLOSING_QUOTE : If unescaped quotes are found in the input, accumulate the quote character and proceed parsing
the value as a quoted value, until a closing quote is found.
* BACK_TO_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted value. This will make
the parser accumulate all characters of the current parsed value until the delimiter defined by sep is found. If no delimiter is
found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.
* STOP_AT_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted value. This will make
the parser accumulate all characters until the delimiter defined by sep , or a line ending is found in the input.
* SKIP_VALUE : If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the
next delimiter is found) and the value set in nullValue will be produced instead.
* RAISE_ERROR : If unescaped quotes are found in the input, a
TextParsingException will be thrown.
PARQUET options
O P T IO N
datetimeRebaseMode
Type: String
Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .
int96RebaseMode
Type: String
Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .
mergeSchema
Type: Boolean
Whether to infer the schema across multiple files and to merge the schema of each file.
readerCaseSensitive
Type: Boolean
Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.
rescuedDataColumn
Type: String
Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.
AVRO options
O P T IO N
avroSchema
Type: String
Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is
compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema.
For example, if you set an evolved schema containing one additional column with a default value, the read result will contain
the new column too.
datetimeRebaseMode
Type: String
Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .
mergeSchema
Type: Boolean
Whether to infer the schema across multiple files and to merge the schema of each file.
mergeSchema for Avro does not relax data types.
readerCaseSensitive
Type: Boolean
Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.
rescuedDataColumn
Type: String
Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.
BINARYFILE options
Binary files do not have any additional configuration options.
TEXT options
O P T IO N
encoding
Type: String
The name of the encoding of the TEXT files. See java.nio.charset.Charset for list of options.
lineSep
Type: String
wholeText
Type: Boolean
ORC options
O P T IO N
mergeSchema
Type: Boolean
Whether to infer the schema across multiple files and to merge the schema of each file.
O P T IO N
cloudFiles.region
Type: String
The region where the source S3 bucket resides and where the AWS SNS and SQS services will be created.
Default value: In Databricks Runtime 9.0 and above the region of the EC2 instance. In Databricks Runtime 8.4 and below you
must specify the region.
Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto
Loader to use a queue that you have already set up:
O P T IO N
cloudFiles.queueUrl
Type: String
The URL of the SQS queue. If provided, Auto Loader directly consumes events from this queue instead of setting up its own
AWS SNS and SQS services.
You can use the following options to provide credentials to access AWS SNS and SQS when IAM roles are not
available or when you’re ingesting data from different clouds.
O P T IO N
cloudFiles.awsAccessKey
Type: String
The AWS access key ID for the user. Must be provided with
cloudFiles.awsSecretKey .
cloudFiles.awsSecretKey
Type: String
The AWS secret access key for the user. Must be provided with
cloudFiles.awsAccessKey .
cloudFiles.roleArn
Type: String
The ARN of an IAM role to assume. The role can be assumed from your cluster’s instance profile or by providing credentials
with
cloudFiles.awsAccessKey and cloudFiles.awsSecretKey .
cloudFiles.roleExternalId
Type: String
cloudFiles.roleSessionName
Type: String
cloudFiles.stsEndpoint
Type: String
An optional endpoint to provide for accessing AWS STS when assuming a role using cloudFiles.roleArn .
cloudFiles.clientId
Type: String
cloudFiles.clientSecret
Type: String
cloudFiles.connectionString
Type: String
The connection string for the storage account, based on either account access key or shared access signature (SAS).
cloudFiles.resourceGroup
Type: String
The Azure Resource Group under which the storage account is created.
cloudFiles.subscriptionId
Type: String
cloudFiles.tenantId
Type: String
IMPORTANT
Automated notification setup is available in Azure China and Government regions with Databricks Runtime 9.1 and later.
You must provide a queueName to use Auto Loader with file notifications in these regions for older DBR versions.
Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto
Loader to use a queue that you have already set up:
O P T IO N
cloudFiles.queueName
Type: String
The name of the Azure queue. If provided, the cloud files source directly consumes events from this queue instead of setting
up its own Azure Event Grid and Queue Storage services. In that case, your cloudFiles.connectionString requires only
read permissions on the queue.
O P T IO N
cloudFiles.client
Type: String
cloudFiles.clientEmail
Type: String
cloudFiles.privateKey
Type: String
The private key that’s generated for the Google Service Account.
cloudFiles.privateKeyId
Type: String
The id of the private key that’s generated for the Google Service Account.
cloudFiles.projectId
Type: String
The id of the project that the GCS bucket is in. The Google Cloud Pub/Sub subscription will also be created within this project.
Provide the following option only if you choose cloudFiles.useNotifications = true and you want Auto
Loader to use a queue that you have already set up:
O P T IO N
cloudFiles.subscription
Type: String
The name of the Google Cloud Pub/Sub subscription. If provided, the cloud files source consumes events from this queue
instead of setting up its own GCS Notification and Google Cloud Pub/Sub services.
Auto Loader simplifies a number of common data ingestion tasks. This quick reference provides examples for
several popular patterns.
{ab,c{de, fh}} Matches a string from the string set {ab, cde, cfh}.
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", <format>) \
.schema(schema) \
.load("<base_path>/*/files")
Scala
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", <format>)
.schema(schema)
.load("<base_path>/*/files")
IMPORTANT
You need to use the option pathGlobFilter for explicitly providing suffix patterns. The path only provides a prefix
filter.
For example, if you would like to parse only png files in a directory that contains files with different suffixes,
you can do:
Python
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "binaryFile") \
.option("pathGlobfilter", "*.png") \
.load(<base_path>)
Scala
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.option("pathGlobfilter", "*.png")
.load(<base_path>)
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "<path_to_schema_location>") \
.load("<path_to_source_data>") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "<path_to_schema_location>")
.load("<path_to_source_data>")
.writeStream
.option("mergeSchema", "true")
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
Scala
spark.readStream.format("cloudFiles")
.schema(expected_schema)
.option("cloudFiles.format", "json")
// will collect all new fields as well as data type mismatches in _rescued_data
.option("cloudFiles.schemaEvolutionMode", "rescue")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
If you want your stream to stop processing if a new field is introduced that doesn’t match your schema, you can
add:
.option("cloudFiles.schemaEvolutionMode", "failOnNewColumns")
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# will ensure that the headers column gets processed as a map
.option("cloudFiles.schemaHints",
"headers map<string,string>, statusCode SHORT") \
.load("/api/requests") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
// will ensure that the headers column gets processed as a map
.option("cloudFiles.schemaHints",
"headers map<string,string>, statusCode SHORT")
.load("/api/requests")
.writeStream
.option("mergeSchema", "true")
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
Examples
For detailed examples with common data formats, see:
Ingest CSV data with Auto Loader
Ingest JSON data with Auto Loader
Ingest Parquet data with Auto Loader
Ingest Avro data with Auto Loader
Ingest image data with Auto Loader
Tutorial: Continuously ingest data into Delta Lake with Auto Loader
Access file metadata with Auto Loader
Ingest CSV data with Auto Loader
7/21/2022 • 6 minutes to read
NOTE
Schema inference for CSV files is available in Databricks Runtime 8.3 and above.
Using Auto Loader to ingest CSV data into Delta Lake takes only a few lines of code. By leveraging Auto Loader,
you get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data
or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files.
Schema inference and evolution: Auto Loader can infer your data schema and detect schema drift on the fly.
It can also evolve the schema to add new columns and restart the stream with the new schema automatically.
Data rescue: You can configure Auto Loader to rescue data that couldn’t be parsed from your CSV files in a
rescued data column.
You can use the following code to run Auto Loader with schema inference and evolution capabilities on CSV
files. You specify cloudFiles as the format to leverage Auto Loader. You then specify csv with the option
cloudFiles.format . In the option cloudFiles.schemaLocation specify a directory that Auto Loader can use to
persist the schema changes in your source data over time:
Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path-to-checkpoint>") \
.load("<path-to-source-data>") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path-to-checkpoint>") \
.start("<path-to-target")
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path-to-checkpoint>")
.load("<path-to-source-data>")
.writeStream
.option("mergeSchema", "true")
.option("checkpointLocation", "<path-to-checkpoint>")
.start("<path-to-target")
Auto Loader provides additional functionality to help make ingesting CSV data easier. In this section, we describe
some of the behavior differences between the Apache Spark built-in CSV parser and Auto Loader.
Schema inference
To infer the schema, Auto Loader uses a sample of data. When inferring schema for CSV data, Auto Loader
assumes that the files contain headers. If your CSV files do not contain headers, provide the option
.option("header", "false") . In addition, Auto Loader merges the schemas of all the files in the sample to come
up with a global schema. Auto Loader can then read each file according to its header and parse the CSV
correctly. This is different behavior than the Apache Spark in-built CSV parser. The following example
demonstrates the differences in behavior:
f0.csv:
-------
name,age,lucky_number
john,20,4
f1.csv:
-------
age,lucky_number,name
25,7,nadia
f2.csv:
-------
height,lucky_number
1.81,five
+-------+------+--------------+
| name | age | lucky_number | <-- uses just the first file to infer schema
+-------+------+--------------+
| john | 20 | 4 |
| 25 | 7 | nadia | <-- all files are assumed to have the same schema
| 5.21 | five | null |
+-------+------+--------------+
+-------+------+--------------+--------+
| name | age | lucky_number | height | <-- schema is merged across files
+-------+------+--------------+--------+
| john | 20 | 4 | null |
| nadia | 25 | 7 | null | <-- columns are parsed according to order specified in header
| null | null | five | 1.81 | <-- lucky_number's data type will be relaxed to a string
+-------+------+--------------+--------+
NOTE
To get the same schema inference and parsing semantics with the CSV reader in Databricks Runtime, you can use
spark.read.option("mergeSchema", "true").format("csv").load(<path>)
By default, Auto Loader infers columns in your CSV data as string columns. Since CSV data can support many
data types, inferring the data as string can help avoid schema evolution issues such as numeric type mismatches
(integers, longs, floats). If you want to infer specific column types, set the option cloudFiles.inferColumnTypes to
true . You don’t need to set inferSchema to true if you set cloudFiles.inferColumnTypes as true .
NOTE
Unless case sensitivity is enabled, the columns abc , Abc , and ABC are considered the same column for the purposes
of schema inference. The selection of which case will be chosen is arbitrary and depends on the sampled data. You can use
schema hints to enforce which case should be used. Once a selection has been made and the schema is inferred, Auto
Loader will not consider the casing variants that were not selected consistent with the schema. These columns may need
to be found in the rescued data column.
Learn more about schema inference and evolution with Auto Loader in Configuring schema inference and
evolution in Auto Loader.
NOTE
You can provide a rescued data column to all CSV parsers in Databricks Runtime by using the option
rescuedDataColumn . For example, as an option to spark.read.csv by using the DataFrameReader or the from_csv
function within a SELECT query.
/path/to/table/f0.csv:
---------------------
name,age,lucky_number
john,20,4
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("rescuedDataColumn", "_rescue") \
.option("header", "true") \
.schema("name string, age int") \
.load("/path/to/table/")
+-------+------+------------------------------------------+
| name | age | _rescue |
+-------+------+------------------------------------------+
| john | 20 | { |
| | | "lucky_number": 4, |
| | | "_file_path": "/path/to/table/f0.csv" |
| | | } |
+-------+------+------------------------------------------+
To remove the source file path from the rescued data column, you can set the SQL configuration
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false") .
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("rescuedDataColumn", "_rescued_data") \ # makes sure that you don't lose data
.schema(<schema>) \ # provide a schema here for the files
.load(<path>)
Scala
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("rescuedDataColumn", "_rescued_data") // makes sure that you don't lose data
.schema(<schema>) // provide a schema here for the files
.load(<path>)
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header", "true") \
.option("rescuedDataColumn", "_rescued_data") \ # makes sure that you don't lose data
.schema(<schema>) \ # provide a schema here for the files
.load(<path>)
Scala
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("rescuedDataColumn", "_rescued_data") // makes sure that you don't lose data
.schema(<schema>) // provide a schema here for the files
.load(<path>)
Ingest JSON data with Auto Loader
7/21/2022 • 5 minutes to read
NOTE
Schema inference for JSON files is available in Databricks Runtime 8.2 and above.
Using Auto Loader to ingest JSON data into Delta Lake takes only a few lines of code. By leveraging Auto Loader,
you get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data
or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files without a hiccup.
Schema inference and evolution: Auto Loader can infer your data schema and detect schema drift on the fly.
It can also evolve the schema to add new columns and restart the stream with the new schema automatically.
Data rescue: You can configure Auto Loader to rescue data that couldn’t be parsed from your JSON in a
rescued data column that preserves the structure of your JSON record.
You can use the following code to run Auto Loader with schema inference and evolution capabilities on JSON
files. You specify cloudFiles as the format to leverage Auto Loader. To ingest JSON files, specify json with the
option cloudFiles.format . In the option cloudFiles.schemaLocation specify a directory that Auto Loader can use
to persist the schema changes in your source data over time:
Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.load("<path_to_source_data>") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<path_to_source_data>")
.writeStream
.option("mergeSchema", "true")
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
NOTE
Unless case sensitivity is enabled, the columns abc , Abc , and ABC are considered the same column for the purposes
of schema inference. The selection of which case will be chosen is arbitrary and depends on the sampled data. You can use
schema hints to enforce which case should be used. Once a selection has been made and the schema is inferred, Auto
Loader will not consider the casing variants that were not selected consistent with the schema. These columns may need
to be found in the rescued data column.
Learn more about schema inference and evolution with Auto Loader in Configuring schema inference and
evolution in Auto Loader.
NOTE
You can provide a rescued data column to all JSON parsers in Databricks Runtime by using the option
rescuedDataColumn . For example, as an option to spark.read.json by using the DataFrameReader or the from_json
function within a SELECT query.
/path/to/table/f0.json:
---------------------
{"name":"john","age":20,"lucky_number":4}
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("rescuedDataColumn", "_rescue") \
.schema("name string, age int") \
.load("/path/to/table/")
+-------+------+--------------------------------------------+
| name | age | _rescue |
+-------+------+--------------------------------------------+
| john | 20 | { |
| | | "lucky_number": 4, |
| | | "_file_path": "/path/to/table/f0.json" |
| | | } |
+-------+------+--------------------------------------------+
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.load("<source_data_with_nested_json>") \
.selectExpr(
"*",
"tags:page.name", # extracts {"tags":{"page":{"name":...}}}
"tags:page.id::int", # extracts {"tags":{"page":{"id":...}}} and casts to int
"tags:eventType" # extracts {"tags":{"eventType":...}}
)
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<source_data_with_nested_json>")
.selectExpr(
"*",
"tags:page.name", // extracts {"tags":{"page":{"name":...}}}
"tags:page.id::int", // extracts {"tags":{"page":{"id":...}}} and casts to int
"tags:eventType" // extracts {"tags":{"eventType":...}}
)
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.option("cloudFiles.inferColumnTypes", "true") \
.load("<source_data_with_nested_json>")
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.option("cloudFiles.inferColumnTypes", "true")
.load("<source_data_with_nested_json>")
Ingest Parquet data with Auto Loader
7/21/2022 • 3 minutes to read
NOTE
Schema inference for Parquet files is available in Databricks Runtime 11.1 and above.
Using Auto Loader to ingest Parquet data into Delta Lake takes only a few lines of code. By leveraging Auto
Loader, you get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data
or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files without a hiccup.
Schema inference and evolution: Auto Loader can infer your data schema and detect schema drift on the fly.
It can also evolve the schema to add new columns and restart the stream with the new schema automatically.
Data rescue: You can configure Auto Loader to rescue data that couldn’t be read properly in a rescued data
column that preserves the structure of your nested record.
You can use the following code to run Auto Loader with schema inference and evolution capabilities on Parquet
files. You specify cloudFiles as the format to leverage Auto Loader. To ingest Parquet files, specify parquet
with the option cloudFiles.format . In the option cloudFiles.schemaLocation specify a directory that Auto
Loader can use to persist the schema changes in your source data over time:
Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
# The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.load("<path_to_source_data>") \
.writeStream \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "parquet")
// The schema location directory keeps track of your data schema over time
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.option("rescuedDataColumn", "_rescue") \
.schema("name string, age int") \
.load("/path/to/table/")
+-------+------+-----------------------------------------------+
| name | age | _rescue |
+-------+------+-----------------------------------------------+
| john | 20 | { |
| | | "lucky_number": 4, |
| | | "_file_path": "/path/to/table/f0.parquet" |
| | | } |
+-------+------+-----------------------------------------------+
NOTE
Schema inference for Avro files is available in Databricks Runtime 10.2 and above.
You can use Auto Loader to ingest Avro data into Delta Lake with only a few lines of code. Auto Loader provides
the following benefits:
Automatic discovery of new files to process: You do not need special logic to handle late arriving data or to
keep track of which files that you have already processed.
Scalable file discovery: Auto Loader can ingest billions of files with ease.
Schema inference and evolution: Auto Loader can infer your data schema and detect schema drift in real
time. It can also evolve the schema to add new columns and continue the ingestion with the new schema
automatically when the stream is restarted.
Data rescue: You can configure Auto Loader to rescue data that cannot be read from your Avro file by placing
that data in a rescued data column, which preserves the structure of your Avro record.
You can use the following code to run Auto Loader with schema inference and evolution capabilities on Avro
files. You specify cloudFiles as the format to leverage Auto Loader. To ingest Avro files, specify avro with the
option cloudFiles.format . In the option cloudFiles.schemaLocation specify a directory that Auto Loader can use
to persist the schema changes in your source data over time:
Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "avro") \
# The schema location directory keeps track of the data schema over time.
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>") \
.load("<path_to_source_data>") \
.writeStream \
.option("mergeSchema", "true") \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "avro")
// The schema location directory keeps track of the data schema over time.
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
NOTE
You can provide a rescued data column to all Avro readers in Databricks Runtime by using the option
rescuedDataColumn , for example as an option to spark.read.format("avro") by using the DataFrameReader.
/path/to/table/f0.avro:
---------------------
{"name":"john","age":20,"lucky_number":4}
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "avro") \
.option("rescuedDataColumn", "_rescue") \
.schema("name string, age int") \
.load("/path/to/table/")
+-------+------+--------------------------------------------+
| name | age | _rescue |
+-------+------+--------------------------------------------+
| john | 20 | { |
| | | "lucky_number": 4, |
| | | "_file_path": "/path/to/table/f0.avro" |
| | | } |
+-------+------+--------------------------------------------+
NOTE
Available in Databricks Runtime 9.0 and above.
Using Auto Loader to ingest image data into Delta Lake takes only a few lines of code. By using Auto Loader, you
get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data
or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files.
Optimized storage: Auto Loader can provide Delta Lake with additional information over the data to optimize
file storage.
Python
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "binaryFile") \
.load("<path_to_source_data>") \
.writeStream \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")
Scala
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
The preceding code will write your image data into a Delta table in an optimized format.
Continuous, incremental data ingestion is a common need. For example, applications from mobile games to e-
commerce websites to IoT sensors generate continuous streams of data. Analysts desire access to the freshest
data, yet it can be challenging to implement for several reasons:
You may need to transform and ingest data as it arrives, while processing files exactly once.
You may want to enforce schemas before writing to tables. This logic can be complex to write and maintain.
It is challenging to handle data whose schemas change over time. For example, you must decide how to deal
with incoming rows that have data quality problems and how to reprocess those rows after you have solved
issues with the raw data.
A scalable solution–one that processes thousands or millions of files per minute–requires integrating cloud
services like event notifications, message queues, and triggers, which adds to development complexity and
long-term maintenance.
Building a continuous, cost effective, maintainable, and scalable data transformation and ingestion system is not
trivial. Azure Databricks provides Auto Loader as a built-in, optimized solution that addresses the preceding
issues, and provides a way for data teams to load raw data from cloud object stores at lower costs and latencies.
Auto Loader automatically configures and listens to a notification service for new files and can scale up to
millions of files per second. It also takes care of common issues such as schema inference and schema evolution.
To learn more, see Auto Loader.
In this tutorial, you use Auto Loader to incrementally ingest (load) data into a Delta table.
Requirements
1. An Azure subscription, an Azure Databricks workspace within that subscription, and a cluster within that
workspace. To create these, see Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure
portal. (If you follow this quickstart, you do not need to follow the instructions in the Run a Spark SQL job
section.)
2. Familiarity with the Azure Databricks workspace user interface. See Navigate the workspace.
NOTE
Auto Loader also works with data in the following formats: Avro, binary, CSV, JSON, ORC, Parquet, and text.
import csv
import uuid
import random
import time
from pathlib import Path
count = 0
path = "/tmp/generated_raw_csv_data"
Path(path).mkdir(parents=True, exist_ok=True)
while True:
row_list = [ ["id", "x_axis", "y_axis"],
[uuid.uuid4(), random.randint(-100, 100), random.randint(-100, 100)],
[uuid.uuid4(), random.randint(-100, 100), random.randint(-100, 100)],
[uuid.uuid4(), random.randint(-100, 100), random.randint(-100, 100)]
]
file_location = f'{path}/file_{count}.csv'
count += 1
dbutils.fs.mv(f'file:{file_location}', f'dbfs:{file_location}')
time.sleep(30)
print(f'New CSV file created at dbfs:{file_location}. Contents:')
TIP
If this path already exists in your workspace because someone else ran this tutorial, you may want to clear
out any existing files in this path first.
id,x_axis,y_axis
d033faf3-b6bd-4bbc-83a4-43a37ce7e994,88,-13
fde2bdb6-b0a1-41c2-9650-35af717549ca,-96,19
297a2dfe-99de-4c52-8310-b24bc2f83874,-23,43
c. After 30 seconds, creates a file named file_<number>.csv , writes the random set of data to the file,
stores the file in dbfs:/tmp/generated_raw_csv_data , and reports the path to the file and its
contents. <number> starts at 0 and increases by 1 every time a file is created (for example,
file_0.csv , file_1.csv , and so on).
8. In the notebook’s menu bar, click Run All . Leave this notebook running.
NOTE
To view the list of generated files, in the sidebar, click Data . Click DBFS, select a cluster if prompted, and then click
tmp > generated_raw_csv_data .
raw_data_location = "dbfs:/tmp/generated_raw_csv_data"
target_delta_table_location = "dbfs:/tmp/table/coordinates"
schema_location = "dbfs:/tmp/auto_loader/schema"
checkpoint_location = "dbfs:/tmp/auto_loader/checkpoint"
This code defines in your workspace the paths to the raw data and the target Delta table, the path to the
table’s schema, and the path to the location where Auto Loader writes checkpoint file information in the
Delta Lake transaction log. Checkpoints enable Auto Loader to process only new incoming data and to
skip over any existing data that has already been processed.
TIP
If any of these paths already exist in your workspace because someone else ran this tutorial, you may want to
clear out any existing files in these paths first.
8. With your cursor still in the first cell, run the cell. (To run the cell, press Shift+Enter.) Azure Databricks
reads the specified paths into memory.
9. Add a cell below the first cell, if it is not already there. (To add a cell, rest your mouse pointer along the
bottom edge of the cell, and then click the + icon.) In this second cell, paste the following code (note that
cloudFiles represents Auto Loader):
stream = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header", "true") \
.option("cloudFiles.schemaLocation", schema_location) \
.load(raw_data_location)
display(stream)
12. Run this cell. Auto Loader begins processing the existing CSV files in raw_data_location as well as any
incoming CSV files as they arrive in that location. Auto Loader processes each CSV file by using the first
line in the file for the field names and the remaining lines as field data. Azure Databricks displays the data
as Auto Loader processes it.
13. In the notebook’s fourth cell, paste the following code:
stream.writeStream \
.option("checkpointLocation", checkpoint_location) \
.start(target_delta_table_location)
14. Run this cell. Auto Loader writes the data to the Delta table in target_data_table_location . Auto Loader
also writes checkpoint file information in checkpoint_location .
stream.printSchema()
3. Run all of the notebook’s cells. (To run all of the cells, click Run All in the notebook’s menu bar.) Azure
Databricks prints the data’s schema, which shows all fields as strings. Let’s evolve the x_axis and
y_axis fields to integers.
6. Run all of the notebook’s cells. Azure Databricks prints the data’s new schema, which shows the x_axis
and y_axis columns as integers. Let’s now enforce data quality by using this new schema.
7. Stop the notebook.
8. Replace the contents of the second cell with the following code:
schema = StructType([
StructField('id', StringType(), True),
StructField('x_axis', IntegerType(), True),
StructField('y_axis', IntegerType(), True)
])
stream = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header", "true") \
.option("cloudFiles.schemaLocation", schema_location) \
.schema(schema) \
.load(raw_data_location)
9. Run all of the notebook’s cells. Auto Loader now uses its schema inference and evolution logic to
determine how to process incoming data that does not match the new schema.
Step 4: Clean up
When you are done with this tutorial, you can clean up the associated Azure Databricks resources in your
workspace, if you no longer want to keep them.
Delete the data
1. Stop both notebooks. (To open a notebook, in the sidebar, click Workspace > Users > your user
name , and then click the notebook.)
2. In the notebook from step 1, add a cell after the first one, and paste the following code into this second
cell.
dbutils.fs.rm("dbfs:/tmp/generated_raw_csv_data", True)
dbutils.fs.rm("dbfs:/tmp/table", True)
dbutils.fs.rm("dbfs:/tmp/auto_loader", True)
WARNING
If you have any other information in these locations, this information will also be deleted!
3. Run the cell. Azure Databricks deletes the directories that contain the raw data, the Delta table, the table’s
schema, and the Auto Loader checkpoint information.
Delete the notebooks
1. In the sidebar, click Workspace > Users > your user name .
2. Click the drop-down arrow next to the first notebook, and click Move to Trash .
3. Click Confirm and move to Trash .
4. Repeat steps 1 - 3 for the second notebook.
Stop the cluster
If you are not using the cluster for any other tasks, you should stop it to avoid additional costs.
1. In the sidebar, click Compute .
2. Click the cluster’s name.
3. Click Terminate .
4. Click Confirm .
Additional resources
Auto Loader technical documentation
Leveraging file notifications for larger volumes of data
10 Powerful Features to Simplify Semi-structured Data Management in the Databricks Lakehouse blog
Hassle-Free Data Ingestion on-demand webinar series
Auto Loader FAQ
7/21/2022 • 3 minutes to read
Does Auto Loader process the file again when the file gets appended
or overwritten?
Files are processed exactly once unless cloudFiles.allowOverwrites is enabled. If a file is appended to or
overwritten, Azure Databricks does not guarantee which version of the file is processed. Databricks
recommends you use Auto Loader to ingest only immutable files. If this does not meet your requirements,
contact your Databricks representative.
Can I use this feature when there are existing file notifications on my
bucket or container?
Yes, as long as your input directory does not conflict with the existing notification prefix (for example, the above
parent-child directories).
base/path/partition=1/date=2020-12-31/file1.json
// inconsistent because date and partition directories are in different orders
base/path/date=2020-12-31/partition=2/file2.json
// inconsistent because the date directory is missing
base/path/partition=3/file3.json
Auto Loader infers the partition columns as empty. Use cloudFiles.partitionColumns to explicitly parse columns
from the directory structure.
How does Auto Loader behave when the source folder is empty?
If the source directory is empty, Auto Loader requires you to provide a schema as there is no data to perform
inference.
Databricks Runtime contains JDBC drivers for Microsoft SQL Server and Azure SQL Database. See the
Databricks runtime release notes for the complete list of JDBC libraries included in Databricks Runtime.
This article covers how to use the DataFrame API to connect to SQL databases using JDBC and how to control
the parallelism of reads through the JDBC interface. This article provides detailed examples using the Scala API,
with abbreviated Python and Spark SQL examples at the end. For all of the supported arguments for connecting
to SQL databases using JDBC, see JDBC To Other Databases.
NOTE
Another option for connecting to SQL Server and Azure SQL Database is the Apache Spark connector. It can provide
faster bulk inserts and lets you connect using your Azure Active Directory identity.
IMPORTANT
The examples in this article do not include usernames and passwords in JDBC URLs. Instead it expects that you follow the
Secret management user guide to store your database credentials as secrets, and then leverage them in a notebook to
populate your credentials in a java.util.Properties object. For example:
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
Step 3: Check connectivity to the SQLServer database
%sh
# Copy the PEM files to a folder within /dbfs so that all nodes can read them.
mkdir -p <target-folder>
cp <source-files> <target-folder>
%sh
# Convert the PEM files to PK8 and DER format.
cd <target-folder>
openssl pkcs8 -topk8 -inform PEM -in client_key.pem -outform DER -out client_key.pk8 -nocrypt
openssl x509 -in server_ca.pem -out server_ca.der -outform DER
openssl x509 -in client_cert.pem -out client_cert.der -outform DER
Replace:
<source-files> with the list of files in the source directory, for example
.pem
/dbfs/FileStore/Users/someone@example.com/* .
<target-folder> with the name of the target directory containing the generated PK8 and DER files, for
example /dbfs/databricks/driver/ssl .
<connection-string> with the JDBC URL connection string to the database.
<table-name> with the name of the table to use in the database.
<username> and <password> with the username and password to access the database.
Spark automatically reads the schema from the database table and maps its types back to Spark SQL types.
employees_table.printSchema
display(employees_table.select("age", "salary").groupBy("age").avg("salary"))
The following code saves the data into a database table named diamonds . Using column names that are
reserved keywords can trigger an exception. The example table has column named table , so you can rename it
with withColumnRenamed() prior to pushing it to the JDBC API.
spark.table("diamonds").withColumnRenamed("table", "table_number")
.write
.jdbc(jdbcUrl, "diamonds", connectionProperties)
Spark automatically creates a database table with the appropriate schema determined from the DataFrame
schema.
The default behavior is to create a new table and to throw an error message if a table with the same name
already exists. You can use the Spark SQL SaveMode feature to change this behavior. For example, here’s how to
append more rows to the table:
import org.apache.spark.sql.SaveMode
You can prune columns and pushdown query predicates to the database with DataFrame methods.
// Explain plan with column selection will prune columns and just return the ones specified
// Notice that only the 3 specified columns are in the explain plan
spark.read.jdbc(jdbcUrl, "diamonds", connectionProperties).select("carat", "cut", "price").explain(true)
Manage parallelism
In the Spark UI, you can see that the number of partitions dictate the number of tasks that are launched. Each
task is spread across the executors, which can increase the parallelism of the reads and writes through the JDBC
interface. See the Spark SQL programming guide for other parameters, such as fetchsize , that can help with
performance.
You can use two DataFrameReader APIs to specify partitioning:
jdbc(url:String,table:String,partitionColumn:String,lowerBound:Long,upperBound:Long,numPartitions:Int,...)
takes the name of a numeric, date, or timestamp column ( partitionColumn ), two range endpoints (
lowerBound , upperBound ) and a target numPartitions and generates Spark tasks by evenly splitting the
specified range into numPartitions tasks. This work well if your database table has an indexed numeric
column with fairly evenly-distributed values, such as an auto-incrementing primary key; it works somewhat
less well if the numeric column is extremely skewed, leading to imbalanced tasks.
jdbc(url:String,table:String,predicates:Array[String],...) accepts an array of WHERE conditions that can
be used to define custom partitions: this is useful for partitioning on non-numeric columns or for dealing
with skew. When defining custom partitions, remember to consider NULL when the partition columns are
Nullable. Don’t manually define partitions using more than two columns since writing the boundary
predicates require much more complex logic.
JDBC reads
You can provide split boundaries based on the dataset’s column values.
These options specify the parallelism on read. These options must all be specified if any of them is specified.
lowerBound and upperBound decide the partition stride, but do not filter the rows in table. Therefore, Spark
partitions and returns all rows in the table.
The following example splits the table read across executors on the emp_no column using the columnName ,
lowerBound , upperBound , and numPartitions parameters.
val df = (spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)
JDBC writes
Spark’s partitions dictate the number of connections used to push data through the JDBC API. You can control
the parallelism by calling coalesce(<N>) or repartition(<N>) depending on the existing number of partitions.
Call coalesce when reducing the number of partitions, and repartition when increasing the number of
partitions.
import org.apache.spark.sql.SaveMode
val df = spark.table("diamonds")
println(df.rdd.partitions.length)
// Given the number of partitions above, you can reduce the partition value by calling coalesce() or
increase it by calling repartition() to manage the number of connections.
df.repartition(10).write.mode(SaveMode.Append).jdbc(jdbcUrl, "diamonds", connectionProperties)
Python example
The following Python examples cover some of the same tasks as those provided for Scala.
Create the JDBC URL
jdbcHostname = "<hostname>"
jdbcDatabase = "employees"
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostname, jdbcPort,
jdbcDatabase, username, password)
You can pass in a dictionary that contains the credentials and driver class similar to the preceding Scala example.
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
SELECT count(*) record_count FROM diamonds --count returned to original value (10 less)
Create a view on the table
Create a view on the table using Spark SQL.
Here’s an example of a JDBC read with partitioning configured: the column partitionColumn , which was passed
as columnName , two range endpoints ( lowerBound , upperBound ), and the numPartitions parameter specifying
the maximum number of partitions.
Tune the JDBC fetchSize parameter
JDBC drivers have a fetchSize parameter that controls the number of rows fetched at a time from the remote
JDBC database. If this value is set too low then your workload may become latency-bound due to a high number
of roundtrip requests between Spark and the external database in order to fetch the full result set. If this value is
too high you risk OOM exceptions. The optimal value will be workload dependent (since it depends on the result
schema, sizes of strings in results, and so on), but increasing it even slightly from the default can result in huge
performance gains.
Oracle’s default fetchSize is 10. Increasing it even slightly, to 100, gives massive performance gains, and going
up to a higher value, like 2000, gives an additional improvement. For example:
try {
stmt = conn. prepareStatement("select a, b, c from table");
stmt.setFetchSize(100);
rs = stmt.executeQuery();
while (rs.next()) {
...
}
}
See Make your java run faster for a more general discussion of this tuning parameter for Oracle JDBC drivers.
Consider the impact of indexes
If you are reading in parallel (using one of the partitioning techniques) Spark issues concurrent queries to the
JDBC database. If these queries end up requiring full table scans this could end up bottlenecking in the remote
database and become extremely slow. Thus you should consider the impact of indexes when choosing a
partitioning column and pick a column such that the individual partitions’ queries can be executed reasonably
efficiently in parallel.
IMPORTANT
Make sure that the database has an index on the partitioning column.
When a single-column index is not defined on the source table, you still can choose the leading(leftmost)
column in a composite index as the partitioning column. When only composite indexes are available, most
databases can use a concatenated index when searching with the leading (leftmost) columns. Thus, the leading
column in a multi-column index can also be used as a partitioning column.
Consider whether the number of partitions is appropriate
Using too many partitions when reading from the external database risks overloading that database with too
many queries. Most DBMS systems have limits on the concurrent connections. As a starting point, aim to have
the number of partitions be close to the number of cores or task slots in your Spark cluster in order to maximize
parallelism but keep the total number of queries capped at a reasonable limit. If you need lots of parallelism
after fetching the JDBC rows (because you’re doing something CPU bound in Spark) but don’t want to issue too
many concurrent queries to your database then consider using a lower numPartitions for the JDBC read and
then doing an explicit repartition() in Spark.
Consider database -specific tuning techniques
The database vendor may have a guide on tuning performance for ETL and bulk access workloads.
SQL Databases using the Apache Spark connector
7/21/2022 • 2 minutes to read
The Apache Spark connector for Azure SQL Database and SQL Server enables these databases to act as input
data sources and output data sinks for Apache Spark jobs. It allows you to use real-time transactional data in big
data analytics and persist results for ad-hoc queries or reporting.
Compared to the built-in JDBC connector, this connector provides the ability to bulk insert data into SQL
databases. It can outperform row-by-row insertion with 10x to 20x faster performance. The Spark connector for
SQL Server and Azure SQL Database also supports Azure Active Directory (Azure AD) authentication, enabling
you to connect securely to your Azure SQL databases from Azure Databricks using your Azure AD account. It
provides interfaces that are similar to the built-in JDBC connector. It is easy to migrate your existing Spark jobs
to use this connector.
Requirements
There are two versions of the Spark connector for SQL Server: one for Spark 2.4 and one for Spark 3.x. The
Spark 3.x connector requires Databricks Runtime 7.x or above. The connector is community-supported and does
not include Microsoft SLA support. File any issues on GitHub to engage the community for help.
C O M P O N EN T VERSIO N S SUP P O RT ED
Databricks Runtime Apache Spark 3.0 connector: Databricks Runtime 7.x and
above
Use the Azure Blob Filesystem driver (ABFS) to connect to Azure Blob Storage and Azure Data Lake Storage
Gen2 from Azure Databricks. Databricks recommends securing access to Azure storage containers by using
Azure service principals set in cluster configurations.
This article details how to access Azure storage containers using:
Azure service principals
SAS tokens
Account keys
You will set Spark properties to configure these credentials for a compute environment, either:
Scoped to an Azure Databricks cluster
Scoped to an Azure Databricks notebook
Azure service principals can also be used to access Azure storage from Databricks SQL; see Configure access to
cloud storage.
Databricks recommends using secret scopes for storing all credentials.
Direct access using ABFS URI for Blob Storage or Azure Data Lake
Storage Gen2
If you have properly configured credentials to access your Azure storage container, you can interact with
resources in the storage account using URIs. Databricks recommends using the abfss driver for greater
security.
spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth
2.0 with an Azure service principal
You can securely access data in an Azure storage account using OAuth 2.0 with an Azure Active Directory (Azure
AD) application service principal for authentication; see Configure access to Azure storage with an Azure Active
Directory service principal.
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-
id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net",
service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")
Replace
<scope> with the Databricks secret scope name.
<service-credential-key> with the name of the key containing the client secret.
<storage-account> with the name of the Azure storage account.
<application-id> with the Application (client) ID for the Azure Active Directory application.
<directory-id> with the Director y (tenant) ID for the Azure Active Directory application.
Access Azure Data Lake Storage Gen2 or Blob Storage using a SAS
token
You can use storage shared access signatures (SAS) to access an Azure Data Lake Storage Gen2 storage account
directly. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access
control.
You can configure SAS tokens for multiple storage accounts in the same Spark session.
NOTE
SAS support is available in Databricks Runtime 7.5 and above.
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")
Access Azure Data Lake Storage Gen2 or Blob Storage using the
account key
You can use storage account access keys to manage access to Azure Storage.
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
Replace
<storage-account> with the Azure Storage account name.
<scope> with the Azure Databricks secret scope name.
<storage-account-access-key> with the name of the key containing the Azure storage account access key.
Example notebook
This notebook demonstrates using a service principal to:
1. Authenticate to an ADLS Gen2 storage account.
2. Mount a filesystem in the storage account.
3. Write a JSON file containing Internet of things (IoT) data to the new container.
4. List files using direct access and through the mount point.
5. Read and display the IoT file using direct access and through the mount point.
ADLS Gen2 OAuth 2.0 with Azure service principals notebook
Get notebook
StatusCode=404
StatusDescription=The specified filesystem does not exist.
ErrorCode=FilesystemNotFound
ErrorMessage=The specified filesystem does not exist.
When a hierarchical namespace is enabled, you do not need to create containers through Azure portal. If you
see this issue, delete the Blob container through Azure portal. After a few minutes, you will be able to access the
container. Alternatively, you can change your abfss URI to use a different container, as long as this container is
not created through Azure portal.
Known issues
See Known issues with Azure Data Lake Storage Gen2 in the Microsoft documentation.
Accessing Azure Data Lake Storage Gen1 from
Azure Databricks
7/21/2022 • 3 minutes to read
Microsoft has announced the planned retirement of Azure Data Lake Storage Gen1 (formerly Azure Data Lake
Store, also known as ADLS) and recommends all users migrate to Azure Data Lake Storage Gen2. Databricks
recommends upgrading to Azure Data Lake Storage Gen2 for best performance and new features.
There are two ways of accessing Azure Data Lake Storage Gen1:
1. Pass your Azure Active Directory credentials, also known as credential passthrough.
2. Use a service principal directly.
Access directly with Spark APIs using a service principal and OAuth
2.0
To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials
with the following snippet in your notebook:
spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-
service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
where
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") retrieves your storage account access key
that has been stored as a secret in a secret scope.
After you’ve set up your credentials, you can use standard Spark and Databricks APIs to access the resources.
For example:
val df = spark.read.format("parquet").load("adl://<storage-resource>.azuredatalakestore.net/<directory-
name>")
dbutils.fs.ls("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")
Azure Data Lake Storage Gen1 provides directory level access control, so the service principal must have access
to the directories that you want to read from as well as the Azure Data Lake Storage Gen1 resource.
Access through metastore
To access adl:// locations specified in the metastore, you must specify Hadoop credential configuration
options as Spark options when you create the cluster by adding the spark.hadoop. prefix to the corresponding
Hadoop configuration keys to propagate them to the Hadoop configurations used by the metastore:
spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.oauth2.client.id <application-id>
spark.hadoop.fs.adl.oauth2.credential <service-credential>
spark.hadoop.fs.adl.oauth2.refresh.url https://login.microsoftonline.com/<directory-id>/oauth2/token
WARNING
These credentials are available to all users who access the cluster.
```python
configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",
"fs.adl.oauth2.client.id": "<application-id>",
"fs.adl.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-
service-credential>"),
"fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
val configs = Map(
"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
"fs.adl.oauth2.client.id" -> "<application-id>",
"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-
credential>"),
"fs.adl.oauth2.refresh.url" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)
where
<mount-name>is a DBFS path that represents where the account or a folder inside it (specified in source ) will
be mounted in DBFS.
Access files in your container as if they were local files, for example:
.. code-language-tabs::
df = spark.read.format("text").load("/mnt/<mount-name>/....")
df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")
val df = spark.read.format("text").load("/mnt/<mount-name>/....")
val df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")
You can set up service credentials for multiple Azure Data Lake Storage Gen1 accounts for use within in a
single Spark session by adding ``account.<account-name>`` to the configuration keys. For example, if you
want to set up credentials for both the accounts to access ``adl://example1.azuredatalakestore.net`` and
``adl://example2.azuredatalakestore.net``, you can do this as follows:
```scala
spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.account.example1.oauth2.client.id", "<application-id-example1>")
spark.conf.set("fs.adl.account.example1.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key
= "<key-name-for-service-credential-example1>"))
spark.conf.set("fs.adl.account.example1.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-
id-example1>/oauth2/token")
spark.conf.set("fs.adl.account.example2.oauth2.client.id", "<application-id-example2>")
spark.conf.set("fs.adl.account.example2.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key
= "<key-name-for-service-credential-example2>"))
spark.conf.set("fs.adl.account.example2.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-
id-example2>/oauth2/token")
spark.hadoop.fs.adl.account.example1.oauth2.client.id <application-id-example1>
spark.hadoop.fs.adl.account.example1.oauth2.credential <service-credential-example1>
spark.hadoop.fs.adl.account.example1.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-
example1>/oauth2/token
spark.hadoop.fs.adl.account.example2.oauth2.client.id <application-id-example2>
spark.hadoop.fs.adl.account.example2.oauth2.credential <service-credential-example2>
spark.hadoop.fs.adl.account.example2.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-
example2>/oauth2/token
The following notebook demonstrates how to access Azure Data Lake Storage Gen1 directly and with a mount.
ADLS Gen1 service principal notebook
Get notebook
Connect to Azure Blob Storage with WASB (legacy)
7/21/2022 • 4 minutes to read
Microsoft has deprecated the Windows Azure Storage Blob driver (WASB) for Azure Blob Storage in favor of the
Azure Blob Filesystem driver (ABFS); see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks. ABFS has numerous benefits over WASB; see Azure documentation on ABFS.
This article provides documentation for maintaining code that uses the WASB driver. Databricks recommends
using ABFS for all connections to Azure Blob Storage.
spark.conf(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<storage-account-access-key>"
)
You can upgrade account key URIs to use ABFS. For more information, see Access Azure Data Lake Storage Gen2
or Blob Storage using the account key.
spark.conf(
"fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
"<sas-token-for-container>"
)
wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>
The following code examples show how you can use the DataFrames API and Databricks Utilities to interact with
a named directory within a container.
df = spark.read.format("parquet").load("wasbs://<container-name>@<storage-account-
name>.blob.core.windows.net/<directory-name>")
dbutils.fs.ls("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>")
To update ABFS instead of WASB, update your URIs. For more information, see Direct access using ABFS URI for
Blob Storage or Azure Data Lake Storage Gen2
-- SQL
CREATE DATABASE <db-name>
LOCATION "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/";
To update ABFS instead of WASB, update your URIs; see Direct access using ABFS URI for Blob Storage or Azure
Data Lake Storage Gen2
DBFS uses the credential that you provide when you create the mount point to access the mounted Blob storage
container. If a Blob storage container is mounted using a storage account access key, DBFS uses temporary SAS
tokens derived from the storage account key when it accesses this mount point.
Mount an Azure Blob storage container
Databricks recommends using ABFS instead of WASB. For more information about mounting with ABFS, see:
Mount ADLS Gen2 or Blob Storage with ABFS.
1. To mount a Blob storage container or a folder inside a container, use the following command:
Python
dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
Scala
dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-
name>")))
where
<storage-account-name> is the name of your Azure Blob storage account.
<container-name> is the name of a container in your Azure Blob storage account.
<mount-name> is a DBFS path representing where the Blob storage container or a folder inside the
container (specified in source ) will be mounted in DBFS.
<conf-key> can be either fs.azure.account.key.<storage-account-name>.blob.core.windows.net or
fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") gets the key that has been stored as
a secret in a secret scope.
2. Access files in your container as if they were local files, for example:
Python
# python
df = spark.read.format("text").load("/mnt/<mount-name>/...")
df = spark.read.format("text").load("dbfs:/<mount-name>/...")
Scala
// scala
val df = spark.read.format("text").load("/mnt/<mount-name>/...")
val df = spark.read.format("text").load("dbfs:/<mount-name>/...")
SQL
-- SQL
CREATE DATABASE <db-name>
LOCATION "/mnt/<mount-name>"
Azure Cosmos DB
7/21/2022 • 2 minutes to read
Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. Azure Cosmos DB enables you to
elastically and independently scale throughput and storage across any number of Azure’s geographic regions. It
offers throughput, latency, availability, and consistency guarantees with comprehensive service level agreements
(SLAs). Azure Cosmos DB provides APIs for the following data models, with SDKs available in multiple
languages:
SQL API
MongoDB API
Cassandra API
Graph (Gremlin) API
Table API
This article explains how to read data from and write data to Azure Cosmos DB using Azure Databricks. For
more the most up-to-date details about Azure Cosmos DB, see Accelerate big data analytics by using the Apache
Spark to Azure Cosmos DB connector.
IMPORTANT
This connector supports the core (SQL) API of Azure Cosmos DB. For the Cosmos DB for MongoDB API, use the
MongoDB Spark connector. For the Cosmos DB Cassandra API, use the Cassandra Spark connector.
Azure Synapse Analytics (formerly SQL Data Warehouse) is a cloud-based enterprise data warehouse that
leverages massively parallel processing (MPP) to quickly run complex queries across petabytes of data. Use
Azure as a key component of a big data solution. Import big data into Azure with simple PolyBase T-SQL
queries, or COPY statement and then use the power of MPP to run high-performance analytics. As you integrate
and analyze, the data warehouse will become the single version of truth your business can count on for insights.
You can access Azure Synapse from Azure Databricks using the Azure Synapse connector, a data source
implementation for Apache Spark that uses Azure Blob storage, and PolyBase or the COPY statement in Azure
Synapse to transfer large volumes of data efficiently between an Azure Databricks cluster and an Azure Synapse
instance.
Both the Azure Databricks cluster and the Azure Synapse instance access a common Blob storage container to
exchange data between these two systems. In Azure Databricks, Apache Spark jobs are triggered by the Azure
Synapse connector to read data from and write data to the Blob storage container. On the Azure Synapse side,
data loading and unloading operations performed by PolyBase are triggered by the Azure Synapse connector
through JDBC. In Databricks Runtime 7.0 and above, COPY is used by default to load data into Azure Synapse by
the Azure Synapse connector through JDBC.
NOTE
COPY is available only on Azure Synapse Gen2 instances, which provide better performance. If your database still uses
Gen1 instances, we recommend that you migrate the database to Gen2.
The Azure Synapse connector is more suited to ETL than to interactive queries, because each query execution
can extract large amounts of data to Blob storage. If you plan to perform several queries against the same Azure
Synapse table, we recommend that you save the extracted data in a format such as Parquet.
Requirements
A database master key for the Azure Synapse.
Authentication
The Azure Synapse connector uses three types of network connections:
Spark driver to Azure Synapse
Spark driver and executors to Azure storage account
Azure Synapse to Azure storage account
┌─────────┐
┌─────────────────────────>│ STORAGE │<────────────────────────┐
│ Storage acc key / │ ACCOUNT │ Storage acc key / │
│ Managed Service ID / └─────────┘ OAuth 2.0 / │
│ │ │
│ │ │
│ │ Storage acc key / │
│ │ OAuth 2.0 / │
│ │ │
v v ┌──────v────┐
┌──────────┐ ┌──────────┐ │┌──────────┴┐
│ Synapse │ │ Spark │ ││ Spark │
│ Analytics│<────────────────────>│ Driver │<───────────────>│ Executors │
└──────────┘ JDBC with └──────────┘ Configured └───────────┘
username & password / in Spark
To allow the Spark driver to reach Azure Synapse, we recommend that you set Allow access to Azure
ser vices to ON on the firewall pane of the Azure Synapse server through Azure portal. This setting allows
communications from all Azure IP addresses and all Azure subnets, which allows Spark drivers to reach the
Azure Synapse instance.
OAuth 2.0 with a service principal
You can authenticate to Azure Synapse Analytics using a service principal with access to the underlying storage
account. For more information on using service principal credentials to access an Azure storage account, see
Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure Databricks. You must set the
enableServicePrincipalAuth option to true in the connection configuration Parameters to enable the connector
to authenticate with a service principal.
You can optionally use a different service principal for the Azure Synapse Analytics connection. An example that
configures service principal credentials for the storage account and optional service principal credentials for
Synapse:
ini
; Defining the Service Principal credentials for the Azure storage account
fs.azure.account.auth.type OAuth
fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id <application-id>
fs.azure.account.oauth2.client.secret <service-credential>
fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<directory-id>/oauth2/token
; Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the
connector will use the Azure storage account credentials)
spark.databricks.sqldw.jdbc.service.principal.client.id <application-id>
spark.databricks.sqldw.jdbc.service.principal.client.secret <service-credential>
Sc a l a
// Defining the Service Principal credentials for the Azure storage account
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<service-credential>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<directory-
id>/oauth2/token")
// Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the
connector will use the Azure storage account credentials)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", "<application-id>")
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", "<service-credential>")
Python
# Defining the service principal credentials for the Azure storage account
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<service-credential>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<directory-
id>/oauth2/token")
# Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the
connector will use the Azure storage account credentials)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", "<application-id>")
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", "<service-credential>")
# Load SparkR
library(SparkR)
conf <- sparkR.callJMethod(sparkR.session(), "conf")
# Defining the service principal credentials for the Azure storage account
sparkR.callJMethod(conf, "set", "fs.azure.account.auth.type", "OAuth")
sparkR.callJMethod(conf, "set", "fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sparkR.callJMethod(conf, "set", "fs.azure.account.oauth2.client.id", "<application-id>")
sparkR.callJMethod(conf, "set", "fs.azure.account.oauth2.client.secret", "<service-credential>")
sparkR.callJMethod(conf, "set", "fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")
# Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the
connector will use the Azure storage account credentials)
sparkR.callJMethod(conf, "set", "spark.databricks.sqldw.jdbc.service.principal.client.id", "<application-
id>")
sparkR.callJMethod(conf, "set", "spark.databricks.sqldw.jdbc.service.principal.client.secret", "<service-
credential>")
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")
sc.hadoopConfiguration.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")
Python
hadoopConfiguration is not exposed in all versions of PySpark. Although the following command relies on some
Spark internals, it should work with all PySpark versions and is unlikely to break or change in the future:
sc._jsc.hadoopConfiguration().set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")
Streaming support
The Azure Synapse connector offers efficient and scalable Structured Streaming write support for Azure Synapse
that provides consistent user experience with batch writes, and uses PolyBase or COPY for large data transfers
between an Azure Databricks cluster and Azure Synapse instance. Similar to the batch writes, streaming is
designed largely for ETL, thus providing higher latency that may not be suitable for real-time data processing in
some cases.
Fault tolerance semantics
By default, Azure Synapse Streaming offers end-to-end exactly-once guarantee for writing data into an Azure
Synapse table by reliably tracking progress of the query using a combination of checkpoint location in DBFS,
checkpoint table in Azure Synapse, and locking mechanism to ensure that streaming can handle any types of
failures, retries, and query restarts. Optionally, you can select less restrictive at-least-once semantics for Azure
Synapse Streaming by setting spark.databricks.sqldw.streaming.exactlyOnce.enabled option to false , in which
case data duplication could occur in the event of intermittent connection failures to Azure Synapse or
unexpected query termination.
Usage (Batch)
You can use this connector via the data source API in Scala, Python, SQL, and R notebooks.
Scala
// Otherwise, set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")
df.write
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "<your-table-name>")
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")
.save()
Python
# Otherwise, set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>") \
.save()
SQL
-- Otherwise, set up the Blob storage account access key in the notebook session conf.
SET fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net=<your-storage-account-access-key>;
# Load SparkR
library(SparkR)
# Otherwise, set up the Blob storage account access key in the notebook session conf.
conf <- sparkR.callJMethod(sparkR.session(), "conf")
sparkR.callJMethod(conf, "set", "fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net", "
<your-storage-account-access-key>")
write.df(
df,
source = "com.databricks.spark.sqldw",
url = "jdbc:sqlserver://<the-rest-of-the-connection-string>",
forward_spark_azure_storage_credentials = "true",
dbTable = "<your-table-name>",
tempDir = "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-directory-
name>")
Usage (Streaming)
You can write data using Structured Streaming in Scala and Python notebooks.
Scala
// Set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")
df.writeStream
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>")
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>")
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "<your-table-name>")
.option("checkpointLocation", "/tmp_checkpoint_location")
.start()
Python
# Set up the Blob storage account access key in the notebook session conf.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.dfs.core.windows.net",
"<your-storage-account-access-key>")
df.writeStream \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("checkpointLocation", "/tmp_checkpoint_location") \
.start()
Configuration
This section describes how to configure write semantics for the connector, required permissions, and
miscellaneous configuration parameters.
In this section:
Supported save modes for batch writes
Supported output modes for streaming writes
Write semantics
Required Azure Synapse permissions for PolyBase
Required Azure Synapse permissions for the COPY statement
Parameters
Query pushdown into Azure Synapse
Temporary data management
Temporary object management
Streaming checkpoint table management
Supported save modes for batch writes
The Azure Synapse connector supports ErrorIfExists , Ignore , Append , and Overwrite save modes with the
default mode being ErrorIfExists . For more information on supported save modes in Apache Spark, see Spark
SQL documentation on Save Modes.
Supported output modes for streaming writes
The Azure Synapse connector supports Append and Complete output modes for record appends and
aggregations. For more details on output modes and compatibility matrix, see the Structured Streaming guide.
Write semantics
NOTE
COPY is available in Databricks Runtime 7.0 and above.
In addition to PolyBase, the Azure Synapse connector supports the COPY statement. The COPY statement offers
a more convenient way of loading data into Azure Synapse without the need to create an external table, requires
fewer permissions to load data, and improves the performance of data ingestion into Azure Synapse.
By default, the connector automatically discovers the best write semantics ( COPY when targeting an Azure
Synapse Gen2 instance, PolyBase otherwise). You can also specify the write semantics with the following
configuration:
Scala
// Configure the write semantics for Azure Synapse connector in the notebook session conf.
spark.conf.set("spark.databricks.sqldw.writeSemantics", "<write-semantics>")
Python
# Configure the write semantics for Azure Synapse connector in the notebook session conf.
spark.conf.set("spark.databricks.sqldw.writeSemantics", "<write-semantics>")
SQL
-- Configure the write semantics for Azure Synapse connector in the notebook session conf.
SET spark.databricks.sqldw.writeSemantics=<write-semantics>;
# Load SparkR
library(SparkR)
# Configure the write semantics for Azure Synapse connector in the notebook session conf.
conf <- sparkR.callJMethod(sparkR.session(), "conf")
sparkR.callJMethod(conf, "set", "spark.databricks.sqldw.writeSemantics", "<write-semantics>")
where <write-semantics> is either polybase to use PolyBase, or copy to use the COPY statement.
Required Azure Synapse permissions for PolyBase
When you use PolyBase, the Azure Synapse connector requires the JDBC connection user to have permission to
run the following commands in the connected Azure Synapse instance:
CREATE DATABASE SCOPED CREDENTIAL
CREATE EXTERNAL DATA SOURCE
CREATE EXTERNAL FILE FORMAT
CREATE EXTERNAL TABLE
As a prerequisite for the first command, the connector expects that a database master key already exists for the
specified Azure Synapse instance. If not, you can create a key using the CREATE MASTER KEY command.
Additionally, to read the Azure Synapse table set through dbTable or tables referred in query , the JDBC user
must have permission to access needed Azure Synapse tables. To write data back to an Azure Synapse table set
through dbTable , the JDBC user must have permission to write to this Azure Synapse table.
The following table summarizes the required permissions for all operations with PolyBase:
Required Azure Synapse permissions for PolyBase with the external data source option
NOTE
Available in Databricks Runtime 8.4 and above.
You can use PolyBase with a pre-provisioned external data source. See the externalDataSource parameter in
Parameters for more information.
To use PolyBase with a pre-provisioned external data source, the Azure Synapse connector requires the JDBC
connection user to have permission to run the following commands in the connected Azure Synapse instance:
CREATE EXTERNAL FILE FORMAT
CREATE EXTERNAL TABLE
To create an external data source, you should first create a database scoped credential. The following links
describe how to create a scoped credential for service principals and an external data source for an ABFS
location:
CREATE DATABASE SCOPED CREDENTIAL
CREATE EXTERNAL DATA SOURCE
NOTE
The external data source location must point to a container. The connector will not work if the location is a directory in a
container.
The following table summarizes the permissions for PolyBase write operations with the external data source
option:
INSERT INSERT
ALTER ANY EXTERNAL DATA SOURCE ALTER ANY EXTERNAL DATA SOURCE
ALTER ANY EXTERNAL FILE FORMAT ALTER ANY EXTERNAL FILE FORMAT
INSERT INSERT
ALTER ANY EXTERNAL DATA SOURCE ALTER ANY EXTERNAL DATA SOURCE
ALTER ANY EXTERNAL FILE FORMAT ALTER ANY EXTERNAL FILE FORMAT
The following table summarizes the permissions for PolyBase read operations with external data source option:
You can use this connector to read via the data source API in Scala, Python, SQL, and R notebooks.
Sc a l a
Python
# Get some data from an Azure Synapse table.
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "abfss://<your-container-name>@<your-storage-account-name>.dfs.core.windows.net/<your-
directory-name>") \
.option("externalDataSource", "<your-pre-provisioned-data-source>") \
.option("dbTable", "<your-table-name>") \
.load()
SQ L
NOTE
Available in Databricks Runtime 7.0 and above.
When you use the COPY statement, the Azure Synapse connector requires the JDBC connection user to have
permission to run the following commands in the connected Azure Synapse instance:
COPY INTO
If the destination table does not exist in Azure Synapse, permission to run the following command is required in
addition to the command above:
CREATE TABLE
The following table summarizes the permissions for batch and streaming writes with COPY :
INSERT INSERT
CREATE TABLE
INSERT INSERT
CREATE TABLE
Parameters
The parameter map or OPTIONS provided in Spark SQL support the following settings:
No
forwardSparkAzureStorageCredentials false If true , the library
automatically discovers the
credentials that Spark is
using to connect to the
Blob storage container and
forwards those credentials
to Azure Synapse over
JDBC. These credentials are
sent as part of the JDBC
query. Therefore it is
strongly recommended that
you enable SSL encryption
of the JDBC connection
when you use this option.
See REJECT_VALUE
documentation in CREATE
EXTERNAL TABLE and
MAXERRORS
documentation in COPY.
NOTE
tableOptions , preActions , postActions , and maxStrLength are relevant only when writing data from Azure
Databricks to a new table in Azure Synapse.
externalDataSource is relevant only when reading data from Azure Synapse and writing data from Azure Databricks
to a new table in Azure Synapse with PolyBase semantics. You should not specify other storage authentication types
while using externalDataSource such as forwardSparkAzureStorageCredentials or useAzureMSI .
checkpointLocation and numStreamingTempDirsToKeep are relevant only for streaming writes from Azure
Databricks to a new table in Azure Synapse.
Even though all data source option names are case-insensitive, we recommend that you specify them in “camel case”
for clarity.
NOTE
The Azure Synapse connector does not push down expressions operating on strings, dates, or timestamps.
Query pushdown built with the Azure Synapse connector is enabled by default. You can disable it by setting
spark.databricks.sqldw.pushdown to false .
We recommend that you periodically look for leaked objects using queries such as the following:
SELECT * FROM sys.database_scoped_credentials WHERE name LIKE 'tmp_databricks_%'
SELECT * FROM sys.external_data_sources WHERE name LIKE 'tmp_databricks_%'
SELECT * FROM sys.external_file_formats WHERE name LIKE 'tmp_databricks_%'
SELECT * FROM sys.external_tables WHERE name LIKE 'tmp_databricks_%'
You can configure the prefix with the Spark SQL configuration option
spark.databricks.sqldw.streaming.exactlyOnce.checkpointTableNamePrefix .
What should I do if my quer y failed with the error “No access key found in the session conf or the
global Hadoop conf ”?
This error means that Azure Synapse connector could not find the storage account access key in the notebook
session configuration or global Hadoop configuration for the storage account specified in tempDir . See Usage
(Batch) for examples of how to configure Storage Account access properly. If a Spark table is created using Azure
Synapse connector, you must still provide the storage account access credentials in order to read or write to the
Spark table.
Can I use a Shared Access Signature (SAS) to access the Blob storage container specified by
tempDir ?
Azure Synapse does not support using SAS to access Blob storage. Therefore the Azure Synapse connector does
not support SAS to access the Blob storage container specified by tempDir .
I created a Spark table using Azure Synapse connector with the dbTable option, wrote some data
to this Spark table, and then dropped this Spark table. Will the table created at the Azure Synapse
side be dropped?
No. Azure Synapse is considered an external data source. The Azure Synapse table with the name set through
dbTable is not dropped when the Spark table is dropped.
That is because we want to make the following distinction clear: .option("dbTable", tableName) refers to the
database (that is, Azure Synapse) table, whereas .saveAsTable(tableName) refers to the Spark table. In fact, you
could even combine the two: df.write. ... .option("dbTable", tableNameDW).saveAsTable(tableNameSpark) which
creates a table in Azure Synapse called tableNameDW and an external table in Spark called tableNameSpark that is
backed by the Azure Synapse table.
WARNING
Beware of the following difference between .save() and .saveAsTable() :
For df.write. ... .option("dbTable", tableNameDW).mode(writeMode).save() , writeMode acts on the Azure
Synapse table, as expected.
For df.write. ... .option("dbTable", tableNameDW).mode(writeMode).saveAsTable(tableNameSpark) ,
writeMode acts on the Spark table, whereas tableNameDW is silently overwritten if it already exists in Azure Synapse.
This behavior is no different from writing to any other data source. It is just a caveat of the Spark DataFrameWriter API.
Binary file
7/21/2022 • 2 minutes to read
Databricks Runtime supports the binary file data source, which reads binary files and converts each file into a
single record that contains the raw content and metadata of the file. The binary file data source produces a
DataFrame with the following columns and possibly partition columns:
path (StringType) : The path of the file.
modificationTime (TimestampType) : The modification time of the file. In some Hadoop FileSystem
implementations, this parameter might be unavailable and the value would be set to a default value.
length (LongType) : The length of the file in bytes.
content (BinaryType) : The contents of the file.
Images
Databricks recommends that you use the binary file data source to load image data.
In Databricks Runtime 8.4 and above, the Databricks display function supports displaying image data loaded
using the binary data source.
If all the loaded files have a file name with an image extension, image preview is automatically enabled:
df = spark.read.format("binaryFile").load("<path-to-image-dir>")
display(df) # image thumbnails are rendered in the "content" column
Alternatively, you can force the image preview functionality by using the mimeType option with a string value
"image/*" to annotate the binary column. Images are decoded based on their format information in the binary
content. Supported image types are bmp , gif , jpeg , and png . Unsupported files appear as a broken image
icon.
df = spark.read.format("binaryFile").option("mimeType", "image/*").load("<path-to-dir>")
display(df) # unsupported files are displayed as a broken image icon
See Reference solution for image applications for the recommended workflow to handle image data.
Options
To load files with paths matching a given glob pattern while keeping the behavior of partition discovery, you can
use the pathGlobFilter option. The following code reads all JPG files from the input directory with partition
discovery:
df = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").load("<path-to-dir>")
If you want to ignore partition discovery and recursively search files under the input directory, use the
recursiveFileLookup option. This option searches through nested directories even if their names do not follow a
partition naming scheme like date=2019-07-01 . The following code reads all JPG files recursively from the input
directory and ignores partition discovery:
df = spark.read.format("binaryFile") \
.option("pathGlobFilter", "*.jpg") \
.option("recursiveFileLookup", "true") \
.load("<path-to-dir>")
NOTE
To improve read performance when you load data back, Azure Databricks recommends turning off compression when you
save data loaded from binary files:
spark.conf.set("spark.sql.parquet.compression.codec", "uncompressed")
df.write.format("delta").save("<path-to-table>")
Cassandra
7/21/2022 • 2 minutes to read
The following notebook shows how to connect Cassandra with Azure Databricks.
Couchbase provides an enterprise-class, multi-cloud to edge database that offers the robust capabilities
required for business-critical applications on a highly scalable and available platform.
The following notebook shows how to set up Couchbase with Azure Databricks.
Couchbase notebook
Get notebook
ElasticSearch
7/21/2022 • 2 minutes to read
ElasticSearch notebook
Get notebook
Image
7/21/2022 • 2 minutes to read
IMPORTANT
Databricks recommends that you use the binary file data source to load image data into the Spark DataFrame as raw
bytes. See Reference solution for image applications for the recommended workflow to handle image data.
The image data source abstracts from the details of image representations and provides a standard API to load
image data. To read image files, specify the data source format as image .
df = spark.read.format("image").load("<path-to-image-data>")
Image structure
Image files are loaded as a DataFrame containing a single struct-type column called image with the following
fields:
TYPE C1 C2 C3 C4
CV_8U 0 8 16 24
TYPE C1 C2 C3 C4
CV_8S 1 9 17 25
CV_16U 2 10 18 26
CV_16S 3 11 19 27
CV_32S 4 12 20 28
CV_32S 5 13 21 29
CV_64F 6 14 22 30
data : Image data stored in a binary format. Image data is represented as a 3-dimensional array with the
dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The
array is stored in row-major order.
Notebook
The following notebook shows how to read and write data to image files.
Image data source notebook
Get notebook
This article shows how to import a Hive table from cloud storage into Azure Databricks using an external table.
The MLflow experiment data source provides a standard API to load MLflow experiment run data. You can load
data from the notebook experiment, or you can use the MLflow experiment name or experiment ID.
Requirements
Databricks Runtime 6.0 ML or above.
df = spark.read.format("mlflow-experiment").load()
display(df)
Scala
val df = spark.read.format("mlflow-experiment").load()
display(df)
df = spark.read.format("mlflow-experiment").load("3270527066281272")
display(df)
Scala
val df = spark.read.format("mlflow-experiment").load("3270527066281272,953590262154175")
display(df)
expId = mlflow.get_experiment_by_name("/Shared/diabetes_experiment/").experiment_id
df = spark.read.format("mlflow-experiment").load(expId)
display(df)
Scala
val expId = mlflow.getExperimentByName("/Shared/diabetes_experiment/").get.getExperimentId
val df = spark.read.format("mlflow-experiment").load(expId)
display(df)
df = spark.read.format("mlflow-experiment").load("3270527066281272")
filtered_df = df.filter("metrics.loss < 0.01 AND params.learning_rate > '0.001'")
display(filtered_df)
Scala
val df = spark.read.format("mlflow-experiment").load("3270527066281272")
val filtered_df = df.filter("metrics.loss < 1.85 AND params.num_epochs > '30'")
display(filtered_df)
Schema
The schema of the DataFrame returned by the data source is:
root
|-- run_id: string
|-- experiment_id: string
|-- metrics: map
| |-- key: string
| |-- value: double
|-- params: map
| |-- key: string
| |-- value: string
|-- tags: map
| |-- key: string
| |-- value: string
|-- start_time: timestamp
|-- end_time: timestamp
|-- status: string
|-- artifact_uri: string
MongoDB
7/21/2022 • 2 minutes to read
MongoDB notebook
Get notebook
Neo4j
7/21/2022 • 2 minutes to read
Neo4j is a native graph database that leverages data relationships as first-class entities. You can connect an
Azure Databricks cluster to a Neo4j cluster using the neo4j-spark-connector, which offers Apache Spark APIs for
RDD, DataFrame, and GraphFrames. The neo4j-spark-connector uses the binary Bolt protocol to transfer data to
and from the Neo4j server.
This article describes how to deploy and configure Neo4j, configure Azure Databricks to access Neo4j, and
includes a notebook demonstrating usage.
# conf/neo4j.conf
# Bolt connector
dbms.connector.bolt.enabled=true
#dbms.connector.bolt.tls_level=OPTIONAL
dbms.connector.bolt.listen_address=0.0.0.0:7687
spark.neo4j.bolt.url bolt://<ip-of-neo4j-instance>:7687
spark.neo4j.bolt.user <username>
spark.neo4j.bolt.password <password>
Neo4j notebook
Get notebook
Avro file
7/21/2022 • 4 minutes to read
Configuration
You can change the behavior of an Avro data source using various configuration parameters.
To ignore files without the extension when reading, you can set the parameter
.avro
avro.mapred.ignore.inputs.without.extension in the Hadoop configuration. The default is false .
spark
.sparkContext
.hadoopConfiguration
.set("avro.mapred.ignore.inputs.without.extension", "true")
spark.conf.set("spark.sql.avro.compression.codec", "deflate")
spark.conf.set("spark.sql.avro.deflate.level", "5")
For Databricks Runtime 9.1 LTS and above, you can change the default schema inference behavior in Avro by
providing the mergeSchema option when reading files. Setting mergeSchema to true will infer a schema from a
set of Avro files in the target directory and merge them rather than infer the read schema from a single file.
AVRO T Y P E SPA RK SQ L T Y P E
boolean BooleanType
int IntegerType
long LongType
float FloatType
double DoubleType
bytes BinaryType
string StringType
record StructType
enum StringType
array ArrayType
map MapType
fixed BinaryType
Union types
The Avro data source supports reading union types. Avro considers the following three types to be union
types:
union(int, long) maps to LongType .
union(float, double) maps to DoubleType .
union(something, null) , where something is any supported Avro type. This maps to the same Spark SQL
type as that of something , with nullable set to true .
All other union types are complex types. They map to StructType where field names are member0 , member1 ,
and so on, in accordance with members of the union . This is consistent with the behavior when converting
between Avro and Parquet.
Logical types
The Avro data source supports reading the following Avro logical types:
AVRO LO GIC A L T Y P E AVRO T Y P E SPA RK SQ L T Y P E
NOTE
The Avro data source ignores docs, aliases, and other properties present in the Avro file.
ByteType int
ShortType int
BinaryType bytes
You can also specify the whole output Avro schema with the option avroSchema , so that Spark SQL types can be
converted into other Avro types. The following conversions are not applied by default and require user specified
Avro schema:
ByteType fixed
StringType enum
Examples
These examples use the episodes.avro file.
Scala
val df = spark.read.format("avro").load("/tmp/episodes.avro")
df.filter("doctor > 5").write.format("avro").save("/tmp/output")
import org.apache.avro.Schema
spark
.read
.format("avro")
.option("avroSchema", schema.toString)
.load("/tmp/episodes.avro")
.show()
val df = spark.read.format("avro").load("/tmp/episodes.avro")
import org.apache.spark.sql.SparkSession
val df = spark.createDataFrame(
Seq(
(2012, 8, "Batman", 9.8),
(2012, 8, "Hero", 8.7),
(2012, 7, "Robot", 5.5),
(2011, 7, "Git", 2.0))
).toDF("year", "month", "title", "rating")
df.toDF.write.format("avro").partitionBy("year", "month").save("/tmp/output")
val df = spark.read.format("avro").load("/tmp/episodes.avro")
df.write.options(parameters).format("avro").save("/tmp/output")
Python
SQL
To query Avro data in SQL, register the data file as a table or temporary view:
Notebook
The following notebook demonstrates how to read and write Avro files.
Read and write Avro files notebook
Get notebook
CSV file
7/21/2022 • 3 minutes to read
This article provides examples for reading and writing to CSV files with Azure Databricks using Python, Scala, R,
and SQL.
NOTE
You can use SQL to read CSV data directly or by using a temporary view. Databricks recommends using a temporary view.
Reading the CSV file directly has the following drawbacks:
You can’t specify data source options.
You can’t specify the schema for the data.
See Examples.
Options
You can configure several options for CSV file data sources. See the following Apache Spark reference articles
for supported read and write options.
Read
Python
Scala
Write
Python
Scala
The rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column
contains any data that wasn’t parsed, either because it was missing from the given schema, or because there
was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the
schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the
source file path of the record (the source file path is available in Databricks Runtime 8.3 and above). To remove
the source file path from the rescued data column, you can set the SQL configuration
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false") . You can enable the
rescued data column by setting the option rescuedDataColumn to a column name, such as _rescued_data with
spark.read.option("rescuedDataColumn", "_rescued_data").format("csv").load(<path>) .
The CSV parser supports three modes when parsing records: PERMISSIVE , DROPMALFORMED , and FAILFAST . When
used together with rescuedDataColumn , data type mismatches do not cause records to be dropped in
DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or
malformed CSV—are dropped or throw errors. If you use the option badRecordsPath when parsing CSV, data
type mismatches are not considered as bad records when using the rescuedDataColumn . Only incomplete and
malformed CSV records are stored in badRecordsPath .
Examples
These examples use the diamonds dataset. Specify the path to the dataset as well as any options that you would
like.
In this section:
Read file in any language
Specify schema
Verify correctness of the data
Pitfalls of reading a subset of columns
Read file in any language
This notebook shows how to read a file, display sample data, and print the data schema using Scala, R, Python,
and SQL.
Read CSV files notebook
Get notebook
Specify schema
When the schema of the CSV file is known, you can specify the desired schema to the CSV reader with the
schema option.
In the PERMISSIVE mode it is possible to inspect the rows that could not be parsed correctly. To do that, you can
add _corrupt_record column to the schema.
Find malformed rows notebook
Get notebook
Pitfalls of reading a subset of columns
The behavior of the CSV parser depends on the set of columns that are read. If the specified schema is incorrect,
the results might differ considerably depending on the subset of columns that is accessed. The following
notebook presents the most common pitfalls.
Caveats of reading a subset of columns of a CSV file notebook
Get notebook
JSON file
7/21/2022 • 2 minutes to read
You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts
and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.
For further information, see JSON Files.
Options
See the following Apache Spark reference articles for supported read and write options.
Read
Python
Scala
Write
Python
Scala
The rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column
contains any data that wasn’t parsed, either because it was missing from the given schema, or because there
was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the
schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the
source file path of the record (the source file path is available in Databricks Runtime 8.3 and above). To remove
the source file path from the rescued data column, you can set the SQL configuration
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false") . You can enable the
rescued data column by setting the option rescuedDataColumn to a column name, such as _rescued_data with
spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load(<path>) .
The JSON parser supports three modes when parsing records: PERMISSIVE , DROPMALFORMED , and FAILFAST .
When used together with rescuedDataColumn , data type mismatches do not cause records to be dropped in
DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or
malformed JSON—are dropped or throw errors. If you use the option badRecordsPath when parsing JSON, data
type mismatches are not considered as bad records when using the rescuedDataColumn . Only incomplete and
malformed JSON records are stored in badRecordsPath .
Examples
Single -line mode
In this example, there is one JSON object per line:
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}
val df = spark.read.format("json").load("example.json")
df.printSchema
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
Multi-line mode
This JSON object occupies multiple lines:
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
Scala
spark.read.option("charset", "UTF-16BE").format("json").load("fileInUTF16.json")
Some supported charsets include: UTF-8 , UTF-16BE , UTF-16LE , UTF-16 , UTF-32BE , UTF-32LE , UTF-32 . For the
full list of charsets supported by Oracle Java SE, see Supported Encodings.
Notebook
The following notebook demonstrates single line and multi-line mode.
Read JSON files notebook
Get notebook
LZO compressed file
7/21/2022 • 2 minutes to read
Due to licensing restrictions, the LZO compression codec is not available by default on Azure Databricks clusters.
To read an LZO compressed file, you must use an init script to install the codec on your cluster at launch time.
This article includes two notebooks:
Init LZO compressed files
Builds the LZO codec.
Creates an init script that:
Installs the LZO compression libraries and the lzop command, and copies the LZO codec to proper
class path.
Configures Spark to use the LZO compression codec.
Read LZO compressed files - Uses the codec installed by the init script.
Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more
efficient file format than CSV or JSON.
For further information, see Parquet Files.
Options
See the following Apache Spark reference articles for supported read and write options.
Read
Python
Scala
Write
Python
Scala
The following notebook shows how to read and write data to Parquet files.
Reading Parquet files notebook
Get notebook
Redis
7/21/2022 • 2 minutes to read
Redis is a popular key-value store that is fast and easy to use. The following notebook shows how to use Redis
with Apache Spark.
Example Notebooks
Redis overview notebook
Get notebook
Snowflake
7/21/2022 • 2 minutes to read
Snowflake is a cloud-based SQL data warehouse. This article explains how to read data from and write data to
Snowflake using the Databricks Snowflake connector.
TIP
Avoid exposing your Snowflake username and password in notebooks by using Secrets, which are demonstrated in the
notebooks.
In this section:
Snowflake Scala notebook
Snowflake Python notebook
Snowflake R notebook
Snowflake Scala notebook
Get notebook
Snowflake Python notebook
Get notebook
Snowflake R notebook
Get notebook
Hadoop does not have support for zip files as a compression codec. While a text file in GZip, BZip2, and other
supported compression formats can be configured to be automatically decompressed in Apache Spark as long
as it has the right file extension, you must perform additional steps to read zip files.
The following notebooks show how to read zip files. After you download a zip file to a temp directory, you can
invoke the Azure Databricks %sh zip magic command to unzip the file. For the sample file used in the
notebooks, the tail step removes a comment line from the unzipped file.
When you use %sh to operate on files, the results are stored in the directory /databricks/driver . Before you
load the file using the Spark API, you move the file to DBFS using Databricks Utilities.
Databricks integrates with a wide range of data sources, developer tools, and partner solutions.
Data sources: Databricks can read data from and write data to a variety of data formats such as CSV, Delta
Lake, JSON, Parquet, XML, and other formats, as well as data storage providers such as Azure Data Lake
Storage, Google BigQuery and Cloud Storage, Snowflake, and other providers.
Developer tools: Databricks supports various developer tools such as DataGrip, IntelliJ, PyCharm, Visual
Studio Code, and others, that allow you to work with data through Azure Databricks clusters and Databricks
SQL warehouses by writing code.
Partner solutions: Databricks has validated integrations with various third-party products such as Fivetran,
Power BI, Tableau, and others, that allow you to work with data through Azure Databricks clusters and SQL
warehouses, in many cases with low-code and no-code experiences. These solutions enable common
scenarios such as data ingestion, data preparation and transformation, business intelligence (BI), and
machine learning. Databricks also provides Partner Connect, a user interface that allows some of these
validated solutions to integrate faster and easier with your Azure Databricks clusters and SQL warehouses.
See also Git integration with Databricks Repos.
This guide covers Databricks partner solutions:
Databricks Partner Connect
Databricks partners
Explore and create tables with the Data tab
7/21/2022 • 3 minutes to read
You can use the Data tab in the Data Science & Engineering workspace to create, view, and delete tables.
Databricks recommends using features in Databricks SQL to complete these tasks; the Data explorer provides an
improved experience for viewing data objects and managing ACLs and the create table UI allows users to easily
ingest small files into Delta Lake.
Requirements
To view and create databases and tables, you must be connected to a running cluster.
You can change the cluster from the Databases menu, create table UI, or view table UI. For example, from the
Databases menu:
1. Click the at the top of the Databases folder.
2. Select a cluster.
Import data
If you have small data files on your local machine that you want to analyze with Azure Databricks, you can
import them to DBFS using the UI.
NOTE
This feature may be disabled by admin users. To enable or disable this setting, see Manage data upload.
Create a table
The Create in the sidebar and the Create Table button in the Data tab both launch the Create Table UI.
You can populate a table from files in DBFS or data stored in any of the supported data sources.
NOTE
When you create a table using the UI, you cannot update the table.
1. Click Data in the sidebar. The Databases and Tables folders appear.
2. In the Databases folder, select a database.
3. Above the Tables folder, click Create Table .
4. Choose a data source and follow the steps in the corresponding section to configure the table.
If an Azure Databricks administrator has disabled the Upload File option, you do not have the option to
upload files; you can create tables using one of the other data sources.
Instructions for
a. Drag files to the Files dropzone or click the dropzone to browse and choose files. After upload, a
path displays for each file. The path will be something like
/FileStore/tables/<filename>-<integer>.<file-type> . You can use this path in a notebook to read
data.
Instructions for
a. Select a file.
b. Click Create Table with UI .
c. In the Cluster drop-down, choose a cluster.
5. Click Preview Table to view the table.
6. In the Table Name field, optionally override the default table name. A table name can contain only
lowercase alphanumeric characters and underscores and must start with a lowercase letter or
underscore.
7. In the Create in Database field, optionally override the selected default database.
8. In the File Type field, optionally override the inferred file type.
9. If the file type is CSV:
a. In the Column Delimiter field, select whether to override the inferred delimiter.
b. Indicate whether to use the first row as the column titles.
c. Indicate whether to infer the schema.
10. If the file type is JSON, indicate whether the file is multi-line.
11. Click Create Table .
Create a table in a notebook
In the Create New Table UI you can use quickstart notebooks provided by Azure Databricks to connect to any
data source.
DBFS : Click Create Table in Notebook .
Other Data Sources : In the Connector drop-down, select a data source type. Then click Create Table in
Notebook .
View table details
The table details view shows the table schema and sample data.
NOTE
To display the table preview, a Spark SQL query runs on the cluster selected in the Cluster drop-down. If the
cluster already has a workload running on it, the table preview may take longer to load.
.
File metadata column
7/21/2022 • 2 minutes to read
NOTE
Available in Databricks Runtime 10.5 and above.
You can get metadata information for input files with the _metadata column. The _metadata column is a hidden
column, and is available for all input file formats. To include the _metadata column in the returned DataFrame,
you must explicitly reference it in your query.
If the data source contains a column named _metadata , queries will return the column from the data source,
and not the file metadata.
WARNING
New fields may be added to the _metadata column in future releases. To prevent schema evolution errors if the
_metadata column is updated, Databricks recommends selecting specific fields from the column in your queries. See
examples.
Supported metadata
The _metadata column is a STRUCT containing the following fields:
Examples
Use in a basic file -based data source reader
Python
df = spark.read \
.format("csv") \
.schema(schema) \
.load("dbfs:/tmp/*") \
.select("*", "_metadata")
display(df)
'''
Result:
+---------+-----+----------------------------------------------------+
| name | age | _metadata |
+=========+=====+====================================================+
| | | { |
| | | "file_path": "dbfs:/tmp/f0.csv", |
| Debbie | 18 | "file_name": "f0.csv", |
| | | "file_size": 12, |
| | | "file_modification_time": "2021-07-02 01:05:21" |
| | | } |
+---------+-----+----------------------------------------------------+
| | | { |
| | | "file_path": "dbfs:/tmp/f1.csv", |
| Frank | 24 | "file_name": "f1.csv", |
| | | "file_size": 12, |
| | | "file_modification_time": "2021-12-20 02:06:21" |
| | | } |
+---------+-----+----------------------------------------------------+
'''
Scala
val df = spark.read
.format("csv")
.schema(schema)
.load("dbfs:/tmp/*")
.select("*", "_metadata")
display(df_population)
/* Result:
+---------+-----+----------------------------------------------------+
| name | age | _metadata |
+=========+=====+====================================================+
| | | { |
| | | "file_path": "dbfs:/tmp/f0.csv", |
| Debbie | 18 | "file_name": "f0.csv", |
| | | "file_size": 12, |
| | | "file_modification_time": "2021-07-02 01:05:21" |
| | | } |
+---------+-----+----------------------------------------------------+
| | | { |
| | | "file_path": "dbfs:/tmp/f1.csv", |
| Frank | 24 | "file_name": "f1.csv", |
| | | "file_size": 10, |
| | | "file_modification_time": "2021-12-20 02:06:21" |
| | | } |
+---------+-----+----------------------------------------------------+
*/
Scala
spark.read
.format("csv")
.schema(schema)
.load("dbfs:/tmp/*")
.select("_metadata.file_name", "_metadata.file_size")
Use in filters
Python
spark.read \
.format("csv") \
.schema(schema) \
.load("dbfs:/tmp/*") \
.select("*") \
.filter(col("_metadata.file_name") == lit("test.csv"))
Scala
spark.read
.format("csv")
.schema(schema)
.load("dbfs:/tmp/*")
.select("*")
.filter(col("_metadata.file_name") === lit("test.csv"))
spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.schema(schema) \
.load("abfss://my-bucket/csvData") \
.select("*", "_metadata") \
.writeStream \
.format("delta") \
.option("checkpointLocation", checkpointLocation) \
.start(targetTable)
Scala
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.schema(schema)
.load("abfss://my-bucket/csvData")
.select("*", "_metadata")
.writeStream
.format("delta")
.option("checkpointLocation", checkpointLocation)
.start(targetTable)
Related articles
COPY INTO
Auto Loader
Structured Streaming
Workflows
7/21/2022 • 2 minutes to read
This guide shows how to process and analyze data using Azure Databricks jobs, Delta Live Tables; the Azure
Databricks data processing pipeline framework, and common workflow tools including Apache Airflow and
Azure Data Factory.
Workflows with jobs
Delta Live Tables
Managing dependencies in data pipelines
Delta Live Tables
7/21/2022 • 2 minutes to read
Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You
define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster
management, monitoring, data quality, and error handling.
Instead of defining your data pipelines using a series of separate Apache Spark tasks, Delta Live Tables manages
how your data is transformed based on a target schema you define for each processing step. You can also
enforce data quality with Delta Live Tables expectations. Expectations allow you to define expected data quality
and specify how to handle records that fail those expectations.
To get started with Delta Live Tables:
Develop your first Delta Live Tables pipeline with the quickstart.
Learn about fundamental Delta Live Tables concepts.
Learn how to create, run, and manage pipelines with the Delta Live Tables user interface.
Learn how to develop Delta Live Tables pipelines with Python or SQL.
Learn how to manage data quality in your Delta Live Tables pipelines with expectations.
Learn more about Delta Live Tables:
Use external data sources in your Delta Live Tables pipelines: Data sources
Use the data produced by your Delta Live Tables pipelines: Publish data
Efficiently process continually arriving data in your Delta Live Tables pipelines: Streaming data processing
Use change data capture (CDC) processing in your Delta Live Tables pipelines: Change data capture with
Delta Live Tables
Use the Delta Live Tables API: API guide
Use the Delta Live Tables command line interface: CLI
Configure your Delta Live Tables pipelines: Pipeline settings
Analyze and report on your Delta Live Tables pipelines: Querying the event log
Run your Delta Live Tables pipelines with popular workflow orchestration tools: Workflow tool integration
Learn how to use access control lists (ACLs) to configure permissions on your Delta Live Tables pipelines:
Access control
Find answers and solutions for Delta Live Tables:
Implement common tasks in your Delta Live Tables pipelines: Cookbook
Review frequently asked questions and issues: FAQ
Learn how the Delta Live Tables upgrade process works and how to test your pipelines with the next system
version: Upgrades
Delta Live Tables quickstart
7/21/2022 • 5 minutes to read
You can easily create and run a Delta Live Tables pipeline using an Azure Databricks notebook. This article
demonstrates using a Delta Live Tables pipeline on a dataset containing Wikipedia clickstream data to:
Read the raw JSON clickstream data into a table.
Read the records from the raw data table and use Delta Live Tables expectations to create a new table that
contains cleansed data.
Use the records from the cleansed data table to make Delta Live Tables queries that create derived datasets.
In this quickstart, you:
1. Create a new notebook and add the code to implement the pipeline.
2. Create a new pipeline job using the notebook.
3. Start an update of the pipeline job.
4. View results of the pipeline job.
Requirements
You must have cluster creation permission to start a pipeline. The Delta Live Tables runtime creates a cluster
before it runs your pipeline and fails if you don’t have the correct permission.
Create a notebook
You can use an example notebook or create a new notebook to run the Delta Live Tables pipeline:
1. Go to your Azure Databricks landing page and select Create Blank Notebook .
2. In the Create Notebook dialogue, give your notebook a name and select Python or SQL from the
Default Language dropdown menu. You can leave Cluster set to the default value. The Delta Live
Tables runtime creates a cluster before it runs your pipeline.
3. Click Create .
4. Copy the Python or SQL code example and paste it into your new notebook. You can add the example
code to a single cell of the notebook or multiple cells.
NOTE
You must start your pipeline from the Delta Live Tables tab of the Jobs user interface. Clicking to run your
pipeline will return an error.
Code example
Python
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-
json/2015_2_clickstream.json"
@dlt.table(
comment="The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
)
def clickstream_raw():
return (spark.read.format("json").load(json_path))
@dlt.table(
comment="Wikipedia clickstream data cleaned and prepared for analysis."
)
@dlt.expect("valid_current_page_title", "current_page_title IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")
def clickstream_prepared():
return (
dlt.read("clickstream_raw")
.withColumn("click_count", expr("CAST(n AS INT)"))
.withColumnRenamed("curr_title", "current_page_title")
.withColumnRenamed("prev_title", "previous_page_title")
.select("current_page_title", "click_count", "previous_page_title")
)
@dlt.table(
comment="A table containing the top pages linking to the Apache Spark page."
)
def top_spark_referrers():
return (
dlt.read("clickstream_prepared")
.filter(expr("current_page_title == 'Apache_Spark'"))
.withColumnRenamed("previous_page_title", "referrer")
.sort(desc("click_count"))
.select("referrer", "click_count")
.limit(10)
)
SQL
CREATE OR REFRESH LIVE TABLE clickstream_raw
COMMENT "The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
AS SELECT * FROM json.`/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-
json/2015_2_clickstream.json`;
Create a pipeline
To create a new pipeline using the Delta Live Tables notebook:
1. Click Workflows in the sidebar, click the Delta Live Tables tab, and click Create Pipeline .
After successfully starting the update, the Delta Live Tables system:
1. Starts a cluster using a cluster configuration created by the Delta Live Tables system. You can also specify a
custom cluster configuration.
2. Creates any tables that don’t exist and ensures that the schema is correct for any existing tables.
3. Updates tables with the latest data available.
4. Shuts down the cluster when the update is complete.
You can track the progress of the update by viewing the event log at the bottom of the Pipeline Details page.
View results
You can use the Delta Live Tables user interface to view pipeline processing details. This includes a visual view of
the pipeline graph and schemas, and record processing details such as the number of records processed and
records that fail validation.
View the pipeline graph
To view the processing graph for your pipeline, click the Graph tab. You can use your mouse to adjust the view
Publish datasets
You can make pipeline output data available for querying by publishing tables to the Azure Databricks
metastore:
1. Click the Settings button.
2. Add the target setting to configure a database name for your tables.
3. Click Save .
Example notebooks
These notebooks provide Python and SQL examples that implement a Delta Live Tables pipeline to:
Read raw JSON clickstream data into a table.
Read the records from the raw data table and use Delta Live Tables expectations to create a new table that
contains cleansed data.
Use the records from the cleansed data table to make Delta Live Tables queries that create derived datasets.
Get started with Delta Live Tables Python notebook
Get notebook
Get started with Delta Live Tables SQL notebook
Get notebook
Find more example notebooks at _.
Delta Live Tables concepts
7/21/2022 • 10 minutes to read
This article introduces the fundamental concepts you should understand to use Delta Live Tables effectively.
In this section:
Pipelines
Datasets
Continuous and triggered pipelines
Tables and views in continuous pipelines
Development and production modes
Databricks Enhanced Autoscaling
Product editions
Pipelines
The main unit of execution in Delta Live Tables is a pipeline. A pipeline is a directed acyclic graph (DAG) linking
data sources to target datasets. You define the contents of Delta Live Tables datasets using SQL queries or
Python functions that return Spark SQL or Koalas DataFrames. A pipeline also has an associated configuration
defining the settings required to run the pipeline. You can optionally specify data quality constraints when
defining datasets.
You implement Delta Live Tables pipelines in Azure Databricks notebooks. You can implement pipelines in a
single notebook or in multiple notebooks. All queries in a single notebook must be implemented in either
Python or SQL, but you can configure multiple-notebook pipelines with a mix of Python and SQL notebooks.
Each notebook shares a storage location for output data and is able to reference datasets from other notebooks
in the pipeline.
You can use Databricks Repos to store and manage your Delta Live Tables notebooks. To make a notebook
managed with Databricks Repos available when you create a pipeline:
Add the comment line -- Databricks notebook source at the top of a SQL notebook.
Add the comment line # Databricks notebook source at the top of a Python notebook.
See Create, run, and manage Delta Live Tables pipelines to learn more about creating and running a pipeline.
See Configure multiple notebooks in a pipeline for an example of configuring a multi-notebook pipeline.
Queries
Expectations
Pipeline settings
Pipeline updates
Queries
Queries implement data transformations by defining a data source and a target dataset. Delta Live Tables
queries can be implemented in Python or SQL.
Expectations
You use expectations to specify data quality controls on the contents of a dataset. Unlike a CHECK constraint in a
traditional database which prevents adding any records that fail the constraint, expectations provide flexibility
when processing data that fails data quality requirements. This flexibility allows you to process and store data
that you expect to be messy and data that must meet strict quality requirements.
You can define expectations to retain records that fail validation, drop records that fail validation, or halt the
pipeline when a record fails validation.
Pipeline settings
Pipeline settings are defined in JSON and include the parameters required to run the pipeline, including:
Libraries (in the form of notebooks) that contain the queries that describe the tables and views to create the
target datasets in Delta Lake.
A cloud storage location where the tables and metadata required for processing will be stored. This location
is either DBFS or another location you provide.
Optional configuration for a Spark cluster where data processing will take place.
See Delta Live Tables settings for more details.
Pipeline updates
After you create the pipeline and are ready to run it, you start an update. An update:
Starts a cluster with the correct configuration.
Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names,
missing dependencies, syntax errors, and so on.
Creates or updates all of the tables and views with the most recent data available.
If the pipeline is triggered, the system stops processing after updating all tables in the pipeline once.
When a triggered update completes successfully, each table is guaranteed to be updated based on the data
available when the update started.
For use cases that require low latency, you can configure a pipeline to update continuously.
See Continuous and triggered pipelines for more information about choosing an execution mode for your
pipeline.
Datasets
There are two types of datasets in a Delta Live Tables pipeline: views and tables.
Views are similar to a temporary view in SQL and are an alias for some computation. A view allows you to
break a complicated query into smaller or easier-to-understand queries. Views also allow you to reuse a
given transformation as a source for more than one table. Views are available within a pipeline only and
cannot be queried interactively.
Tables are similar to traditional materialized views. The Delta Live Tables runtime automatically creates tables
in the Delta format and ensures those tables are updated with the latest result of the query that creates the
table.
You can define a live or streaming live view or table:
A live table or view always reflects the results of the query that defines it, including when the query defining the
table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or
view may be entirely computed when possible to optimize computation resources and time.
A streaming live table or view processes data that has been added only since the last pipeline update. Streaming
tables and views are stateful; if the defining query changes, new data will be processed based on the new query
and existing data is not recomputed.
Streaming live tables are valuable for a number of use cases, including:
Data retention: a streaming live table can preserve data indefinitely, even when an input data source has low
retention, for example, a streaming data source such as Apache Kafka or Amazon Kinesis.
Data source evolution: data can be retained even if the data source changes, for example, moving from Kafka
to Kinesis.
You can publish your tables to make them available for discovery and querying by downstream consumers.
{
...
"continuous": true,
...
}
The execution mode is independent of the type of table being computed. Both live and streaming live tables can
be updated in either execution mode.
If some tables in your pipeline have weaker latency requirements, you can configure their update frequency
independently by setting the pipelines.trigger.interval setting:
This option does not turn off the cluster in between pipeline updates, but can free up resources for updating
other tables in your pipeline.
Databricks Enhanced Autoscaling optimizes cluster utilization by automatically allocating cluster resources
based on workload volume, with minimal impact to the data processing latency of your pipelines.
Enhanced Autoscaling adds to the existing cluster autoscaling functionality with the following features:
Enhanced Autoscaling implements optimization of streaming workloads, and adds enhancements to improve
the performance of batch workloads. These optimizations result in more efficient cluster utilization, reduced
resource usage, and lower cost.
Enhanced Autoscaling proactively shuts down under-utilized nodes while guaranteeing there are no failed
tasks during shutdown. The existing cluster autoscaling feature scales down nodes only if the node is idle.
Requirements
To use Enhanced Autoscaling:
1. Set the pipelines.advancedAutoscaling.enabled field to "true" in the pipeline settings configuration object.
2. Add the autoscale configuration to the pipeline default cluster. The following example configures an
Enhanced Autoscaling cluster with a minimum of 5 workers and a maximum of 10 workers. max_workers
must be greater than or equal to min_workers .
NOTE
Enhanced Autoscaling is available for the default cluster only. If you include the autoscale configuration in the
maintenance cluster configuration, the existing cluster autoscaling feature is used.
If you add the autoscale configuration without the pipelines.advancedAutoscaling.enabled configuration, Delta
Live Tables will use the existing cluster autoscaling feature.
{
"configuration": {
"pipelines.advancedAutoscaling.enabled": "true"
},
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 5,
"max_workers": 10
}
}
]
}
The pipeline is automatically restarted after the autoscaling configuration changes if the pipeline is continuous.
After restart, expect a short period of increased latency. Following this brief period of increased latency, the
cluster size should be updated based on your autoscale configuration, and the pipeline latency returned to its
previous latency characteristics.
Monitoring Enhanced Autoscaling enabled pipelines
You can use the Delta Live Tables event log to monitor Enhanced Autoscaling metrics. You can view the metrics
in the user interface. Enhanced Autoscaling events have the autoscale event type. The following are example
events:
EVEN T M ESSA GE
Cluster resize request submitted Autoscale cluster to <X> executors while keeping
alive <Y> executors and retiring <Z> executors
Cluster manager accepted the resize request Submitted request to resize cluster <cluster-id> to
size <X>.
Resizing successfully completed Achieved desired cluster size <X> for cluster
<cluster-id>.
You can also view Enhanced Autoscaling events by directly querying the event log:
To query the event log for cluster performance metrics, for example, Spark task slot utilization, see Cluster
performance metrics.
To monitor cluster resizing requests and responses during Enhanced Autoscaling operations, see Databricks
Enhanced Autoscaling events.
Product editions
You can use the Delta Live Tables product edition option to run your pipeline with the features best suited for the
pipeline requirements. The following product editions are available:
core to run streaming ingest workloads. Select the core edition if your pipeline doesn’t require advanced
features such as change data capture (CDC) or Delta Live Tables expectations.
pro to run streaming ingest and CDC workloads. The pro product edition supports all of the core
features, plus support for workloads that require updating tables based on changes in source data.
advanced to run streaming ingest workloads, CDC workloads, and workloads that require expectations. The
advanced product edition supports the features of the core and pro editions, and also supports
enforcement of data quality constraints with Delta Live Tables expectations.
You can select the product edition when you create or edit a pipeline. You can select a different edition for each
pipeline.
If your pipeline includes features not supported by the selected product edition, for example, expectations, you
will receive an error message with the reason for the error. You can then edit the pipeline to select the
appropriate edition.
Create, run, and manage Delta Live Tables pipelines
7/21/2022 • 9 minutes to read
You can create, run, manage, and monitor a Delta Live Tables pipeline using the UI or the Delta Live Tables API.
You can also run your pipeline with an orchestration tool such as Azure Databricks jobs. This article focuses on
performing Delta Live Tables tasks using the UI. To use the API, see the API guide.
To create and run your first pipeline, see the Delta Live Tables quickstart.
Create a pipeline
1. Do one of the following:
Click Workflows in the sidebar, click the Delta Live Tables tab, and click . The
Create Pipeline dialog appears.
In the sidebar, click Create and select Pipeline from the menu.
2. Select the Delta Live Tables product edition for the pipeline from the Product Edition drop-down.
The product edition option allows you to choose the best product edition based on the requirements of
your pipeline. See Product editions.
3. Enter a name for the pipeline in the Pipeline Name field.
4. Enter a path to a notebook containing your pipeline queries in the Notebook Libraries field, or click
to browse to your notebook.
5. To optionally add additional notebooks to the pipeline, click the Add notebook librar y button.
You can add notebooks in any order. Delta Live Tables automatically analyzes dataset dependencies to
construct the processing graph for your pipeline.
6. To optionally add Spark configuration settings to the cluster that will run the pipeline, click the Add
configuration button.
7. To optionally make your tables available for discovery and querying, enter a database name in the Target
field. See Publish datasets
8. To optionally enter a storage location for output data from the pipeline, enter a DBFS or cloud storage
path in the Storage Location field. The system uses a default location if you leave Storage Location
empty.
9. Select Triggered or Continuous for Pipeline Mode . See Continuous and triggered pipelines.
10. You can optionally modify the configuration for pipeline clusters, including enabling and disabling
autoscaling and setting the number of worker nodes. See Manage cluster size.
11. To optionally run this pipeline using Photon runtime, click the Use Photon Acceleration check box.
12. To optionally change the Delta Live Tables runtime version for this pipeline, click the Channel drop-down.
See the channel field in the Delta Live Tables settings.
13. Click Create .
To optionally view and edit the JSON configuration for your pipeline, click the JSON button on the Create
Pipeline dialog.
To start an update of your pipeline from the Pipeline Details page, click the button.
You might want to reprocess data that has already been ingested, for example, because you modified your
queries based on new requirements or to fix a bug calculating a new column. You can reprocess data that’s
already been ingested by instructing the Delta Live Tables system to perform a full refresh from the UI. To
perform a full refresh, click next to the Star t button and select Full Refresh .
After starting an update or a full refresh, the system returns a message confirming your pipeline is starting.
After successfully starting the update, the Delta Live Tables system:
1. Starts a cluster using a cluster configuration created by the Delta Live Tables system. You can also specify a
custom cluster configuration.
2. Creates any tables that don’t exist and ensures that the schema is correct for any existing tables.
3. Updates tables with the latest data available.
4. Shuts down the cluster when the update is complete.
You can track the progress of the update by viewing the event log at the bottom of the Pipeline Details page.
To view details for a log entry, click the entry. The Pipeline event log details pop-up appears. To view a JSON
document containing the log details, click the JSON tab.
To learn how to query the event log, for example, to analyze performance or data quality metrics, see Delta Live
Tables event log.
View pipeline details
Pipeline graph
After the pipeline starts successfully, the pipeline graph displays. You can use your mouse to adjust the view or
To view tooltips for data quality metrics, hover over the data quality values for a dataset in the pipeline graph.
Pipeline details
The Pipeline Details panel displays information about the pipeline and the current or most recent update of
the pipeline, including pipeline and update identifiers, update status, and update runtime.
The Pipeline Details panel also displays information about the pipeline compute cluster, including the compute
cost, product edition, Databricks Runtime version, and the channel configured for the pipeline. To open the Spark
UI for the cluster in a new tab, click the Spark UI button. To open the cluster logs in a new tab, click the Logs
button. To open the cluster metrics in a new tab, click the Metrics button.
The Run as value displays the user that pipeline updates run as. The Run as user is the pipeline owner, and
pipeline updates run with this user’s permissions. To change the run as user, click Permissions and change the
pipeline owner.
Dataset details
To view details for a dataset, including the dataset schema and data quality metrics, click the dataset in the
Graph view. The dataset details displays.
To open the pipeline notebook in a new window, click the Path value.
To close the dataset details view and return to the Pipeline Details , click .
Schedule a pipeline
You can start a triggered pipeline manually or run the pipeline on a schedule with an Azure Databricks job. You
can create and schedule a job with a single pipeline task directly in the Delta Live Tables UI or add a pipeline task
to a multi-task workflow in the jobs UI.
To create a single-task job and a schedule for the job in the Delta Live Tables UI:
1. Click Schedule > Add a schedule . The Schedule button is updated to show the number of existing
schedules if the pipeline is included in one or more scheduled jobs, for example, Schedule (5) .
2. Enter a name for the job in the Job name field.
3. Set the Schedule to Scheduled .
4. Specify the period, starting time, and time zone.
5. Configure one or more email addresses to receive alerts on pipeline start, success, or failure.
6. Click Create .
To create a multi-task workflow with an Azure Databricks job and add a pipeline task:
1. Create a job in the jobs UI and add your pipeline to the job workflow using a Pipeline task.
2. Create a schedule for the job in the jobs UI.
After creating the pipeline schedule, you can:
View a summary of the schedule in the Delta Live Tables UI, including the schedule name, whether it is
paused, the last run time, and the status of the last run. To view the schedule summary, click the Schedule
button.
Edit the job or the pipeline task.
Edit the schedule or pause and resume the schedule. The schedule will also be paused if you selected
Manual when creating the schedule.
Run the job manually and view details on job runs.
View pipelines
Click Workflows in the sidebar and click the Delta Live Tables tab. The Pipelines page appears with a
list of all defined pipelines, the status of the most recent pipeline updates, the pipeline identifier, and the pipeline
creator.
You can filter pipelines in the list by:
Pipeline name.
A partial text match on one or more pipeline names.
Selecting only the pipelines you own.
Selecting all pipelines you have permissions to access.
Click the Name column header to sort pipelines by name in ascending order (A -> Z) or descending order (Z ->
A).
Pipeline names render as a link when you view the pipelines list, allowing you to right-click on a pipeline name
and access context menu options such as opening the pipeline details in a new tab or window.
Edit settings
On the Pipeline Details page, click the Settings button to view and modify the pipeline settings. You can add,
edit, or remove settings. For example, to make pipeline output available for querying after you’ve created a
pipeline:
1. Click the Settings button. The Edit Pipeline Settings dialog appears.
2. Enter a database name in the Target field.
3. Click Save .
To view and edit the JSON specification, click the JSON button.
See Delta Live Tables settings for more information on configuration settings.
To view the graph, details, and events for an update, select the update in the drop-down. To return to the latest
update, click Show the latest update .
Publish datasets
When creating or editing a pipeline, you can configure the target setting to publish your table definitions to
the Azure Databricks metastore and persist the records to Delta tables.
After your update completes, you can view the database and tables, query the data, or use the data in
downstream applications.
See Delta Live Tables data publishing.
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
]
This snippet from the settings for a pipeline illustrates cluster autoscaling disabled and the number of
worker nodes fixed at 5:
"clusters": [
{
"label": "default",
"num_workers": 5
}
]
Delete a pipeline
You can delete a pipeline from the Pipelines list or the Pipeline Details page:
This article provides details and examples for the Delta Live Tables Python programming interface. For the
complete API specification, see the Python API specification.
For information on the SQL API, see the Delta Live Tables SQL language reference.
Python datasets
The Python API is defined in the dlt module. You must import the dlt module in your Delta Live Tables
pipelines implemented with the Python API. Apply the @dlt.view or @dlt.table decorator to a function to
define a view or table in Python. You can use the function name or the name parameter to assign the table or
view name. The following example defines two different datasets: a view called taxi_raw that takes a JSON file
as the input source and a table called filtered_data that takes the taxi_raw view as input:
@dlt.view
def taxi_raw():
return spark.read.format("json").load("/databricks-datasets/nyctaxi/sample/json/")
View and table functions must return a Spark DataFrame or a Koalas DataFrame. A Koalas DataFrame returned
by a function is converted to a Spark Dataset by the Delta Live Tables runtime.
In addition to reading from external data sources, you can access datasets defined in the same pipeline with the
Delta Live Tables read() function. The following example demonstrate creating a customers_filtered dataset
using the read() function:
@dlt.table
def customers_raw():
return spark.read.format("csv").load("/data/customers.csv")
@dlt.table
def customers_filteredA():
return dlt.read("customers_raw").where(...)
You can also use the spark.table() function to access a dataset defined in the same pipeline or a table
registered in the metastore. When using the spark.table() function to access a dataset defined in the pipeline,
in the function argument prepend the LIVE keyword to the dataset name:
@dlt.table
def customers_raw():
return spark.read.format("csv").load("/data/customers.csv")
@dlt.table
def customers_filteredB():
return spark.table("LIVE.customers_raw").where(...)
To read data from a table registered in the metastore, in the function argument omit the LIVE keyword and
optionally qualify the table name with the database name:
@dlt.table
def customers():
return spark.table("sales.customers").where(...)
Delta Live Tables ensures that the pipeline automatically captures the dependency between datasets. This
dependency information is used to determine the execution order when performing an update and recording
lineage information in the event log for a pipeline.
You can also return a dataset using a spark.sql expression in a query function. To read from an internal dataset,
prepend LIVE. to the dataset name:
@dlt.table
def chicago_customers():
return spark.sql("SELECT * FROM LIVE.customers_cleaned WHERE city = 'Chicago'")
@dlt.table(
comment="Raw data on sales",
schema=sales_schema)
def sales():
return ("...")
@dlt.table(
comment="Raw data on sales",
schema="customer_id STRING, customer_name STRING, number_of_line_items STRING, order_datetime
STRING, order_number LONG")
def sales():
return ("...")
By default, Delta Live Tables infers the schema from the table definition if you don’t specify a schema.
Python libraries
To specify external Python libraries, use the %pip install magic command. When an update starts, Delta Live
Tables runs all cells containing a %pip install command before running any table definitions. Every Python
notebook included in the pipeline has access to all installed libraries. The following example installs a package
called logger and makes it globally available to any Python notebook in the pipeline:
@dlt.table
def dataset():
log_info(...)
return dlt.read(..)
To install a Python wheel package, add the wheel path to the %pip install command. Installed Python wheel
packages are available to all tables in the pipeline. The following example installs a wheel named
dltfns-1.0-py3-none-any.whl from the DBFS directory /dbfs/dlt/ :
NOTE
The Delta Live Tables Python interface has the following limitations:
The pivot() function is not supported. Using the pivot() function in a dataset definition results in non-
deterministic pipeline latencies.
Delta Live Tables Python functions are defined in the dlt module. Your pipelines implemented with the Python
API must import this module:
import dlt
Create table
To define a table in Python, apply the @table decorator. The @table decorator is an alias for the @create_table
decorator.
import dlt
@dlt.table(
name="<name>",
comment="<comment>",
spark_conf={"<key>" : "<value", "<key" : "<value>"},
table_properties={"<key>" : "<value>", "<key>" : "<value>"},
path="<storage-location-path>",
partition_cols=["<partition-column>", "<partition-column>"],
schema="schema-definition",
temporary=False)
@dlt.expect
@dlt.expect_or_fail
@dlt.expect_or_drop
@dlt.expect_all
@dlt.expect_all_or_drop
@dlt.expect_all_or_fail
def <function-name>():
return (<query>)
Create view
To define a view in Python, apply the @view decorator. The @view decorator is an alias for the @create_view
decorator.
import dlt
@dlt.view(
name="<name>",
comment="<comment>")
@dlt.expect
@dlt.expect_or_fail
@dlt.expect_or_drop
@dlt.expect_all
@dlt.expect_all_or_drop
@dlt.expect_all_or_fail
def <function-name>():
return (<query>)
Python properties
@TA B L E O R @VIEW
name
Type: str
An optional name for the table or view. If not defined, the function name is used as the table or view name.
@TA B L E O R @VIEW
comment
Type: str
spark_conf
Type: dict
table_proper ties
Type: dict
path
Type: str
An optional storage location for table data. If not set, the system will default to the pipeline storage location.
par tition_cols
Type: array
An optional list of one or more columns to use for partitioning the table.
schema
An optional schema definition for the table. Schemas can be defined as a SQL DDL string, or with a Python
StructType .
temporar y
Type: bool
TA B L E O R VIEW DEF IN IT IO N
def ()
A Python function that defines the dataset. If the name parameter is not set, then <function-name> is used as the target
dataset name.
TA B L E O R VIEW DEF IN IT IO N
quer y
Use dlt.read() or spark.table() to perform a complete read from a dataset defined in the same pipeline. When using
the spark.table() function to read from a dataset defined in the same pipeline, prepend the LIVE keyword to the dataset
name in the function argument. For example, to read from a dataset named customers :
spark.table("LIVE.customers")
You can also use the spark.table() function to read from a table registered in the metastore by omitting the LIVE
keyword and optionally qualifying the table name with the database name:
spark.table("sales.customers")
Use dlt.read_stream() to perform a streaming read from a dataset defined in the same pipeline.
Use the spark.sql function to define a SQL query to create the return dataset.
Use PySpark syntax to define Delta Live Tables queries with Python.
EXP EC TAT IO N S
@expect(“description”, “constraint”)
@expect_or_drop(“description”, “constraint”)
@expect_or_fail(“description”, “constraint”)
@expect_all(expectations)
@expect_all_or_drop(expectations)
@expect_all_or_fail(expectations)
TA B L E P RO P ERT IES
pipelines.autoOptimize.managed
Default: true
pipelines.autoOptimize.zOrderCols
Default: None
pipelines.reset.allowed
Default: true
This article provides details and examples for the Delta Live Tables SQL programming interface. For the
complete API specification, see SQL API specification.
For information on the Python API, see the Delta Live Tables Python language reference.
SQL datasets
Use the CREATE LIVE VIEW or CREATE OR REFRESH LIVE TABLE syntax to create a view or table with SQL. You can
create a dataset by reading from an external data source or from datasets defined in a pipeline. To read from an
internal dataset, prepend the LIVE keyword to the dataset name. The following example defines two different
datasets: a table called taxi_raw that takes a JSON file as the input source and a table called filtered_data that
takes the taxi_raw table as input:
Delta Live Tables automatically captures the dependencies between datasets defined in your pipeline and uses
this dependency information to determine the execution order when performing an update and to record
lineage information in the event log for a pipeline.
Both views and tables have the following optional properties:
COMMENT: A human-readable description of this dataset.
Data quality constraints enforced with expectations.
Tables also offer additional control of their materialization:
Specify how tables are partitioned using PARTITIONED BY . You can use partitioning to speed up queries.
You can set table properties using TBLPROPERTIES . See Table properties for more detail.
Set a storage location using the LOCATION setting. By default, table data is stored in the pipeline storage
location if LOCATION isn’t set.
See SQL API specification for more information about table and view properties.
Use SET to specify a configuration value for a table or view, including Spark configurations. Any table or view
you define in a notebook after the SET statement has access to the defined value. Any Spark configurations
specified using the SET statement are used when executing the Spark query for any table or view following the
SET statement. To read a configuration value in a query, use the string interpolation syntax ${} . The following
example sets a Spark configuration value named startDate and uses that value in a query:
SET startDate='2020-01-01';
To specify multiple configuration values, use a separate SET statement for each value.
To read data from a streaming source, for example, Auto Loader or an internal data set, define a STREAMING LIVE
table:
Create table
Create view
CREATE TEMPORARY [STREAMING] LIVE VIEW view_name
[(
[
col_name1 [ COMMENT col_comment1 ],
col_name2 [ COMMENT col_comment2 ],
...
]
[
CONSTRAINT expectation_name_1 EXPECT (expectation_expr1) [ON VIOLATION { FAIL UPDATE | DROP ROW }],
CONSTRAINT expectation_name_2 EXPECT (expectation_expr2) [ON VIOLATION { FAIL UPDATE | DROP ROW }],
...
]
)]
[COMMENT view_comment]
AS select_statement
SQL properties
C REAT E TA B L E O R VIEW
TEMPORARY
STREAMING
Create a table that reads an input dataset as a stream. The input dataset must be a streaming data source, for example, Auto
Loader or a STREAMING LIVE table.
PARTITIONED BY
An optional list of one or more columns to use for partitioning the table.
LOCATION
An optional storage location for table data. If not set, the system will default to the pipeline storage location.
COMMENT
TBLPROPERTIES
select_statement
A Delta Live Tables query that defines the dataset for the table.
C O N ST RA IN T C L A USE
EXPECT expectation_name
Define data quality constraint expectation_name . If ON VIOLATION constraint is not defined, add rows that violate the
constraint to the target dataset.
C O N ST RA IN T C L A USE
ON VIOL ATION
Table properties
In addition to the table properties supported by Delta Lake, you can set the following table properties.
TA B L E P RO P ERT IES
pipelines.autoOptimize.managed
Default: true
pipelines.autoOptimize.zOrderCols
Default: None
pipelines.reset.allowed
Default: true
You use expectations to define data quality constraints on the contents of a dataset. An expectation consists of a
description, an invariant, and an action to take when a record fails the invariant. You apply expectations to
queries using Python decorators or SQL constraint clauses.
Use the expect , expect or drop , and expect or fail expectations with Python or SQL queries to define a
single data quality constraint.
You can define expectations with one or more data quality constraints in Python pipelines using the
@expect_all , @expect_all_or_drop , and @expect_all_or_fail decorators. These decorators accept a Python
dictionary as an argument, where the key is the expectation name and the value is the expectation constraint.
You can view data quality metrics such as the number of records that violate an expectation by querying the
Delta Live Tables event log.
SQL
SQL
CONSTRAINT valid_current_page EXPECT (current_page_id IS NOT NULL and current_page_title IS NOT NULL) ON
VIOLATION DROP ROW
When a pipeline fails because of an expectation violation, you must fix the pipeline code to handle the invalid
data correctly before re-running the pipeline.
Fail expectations modify the Spark query plan of your transformations to track information required to detect
and report on violations. For many queries, you can use this information to identify which input record resulted
in the violation. The following is an example exception:
Expectation Violated:
{
"flowName": "a-b",
"verboseInfo": {
"expectationsViolated": [
"x1 is negative"
],
"inputData": {
"a": {"x1": 1,"y1": "a },
"b": {
"x2": 1,
"y2": "aa"
}
},
"outputRecord": {
"x1": 1,
"y1": "a",
"x2": 1,
"y2": "aa"
},
"missingInputData": false
}
}
Multiple expectations
Use expect_all to specify multiple data quality constraints when records that fail validation should be included
in the target dataset:
Use expect_all_or_drop to specify multiple data quality constraints when records that fail validation should be
dropped from the target dataset:
Use expect_all_or_fail to specify multiple data quality constraints when records that fail validation should halt
pipeline execution:
You can also define a collection of expectations as a variable and pass it to one or more queries in your pipeline:
valid_pages = {"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL AND
current_page_title IS NOT NULL"}
@dlt.table
@dlt.expect_all(valid_pages)
def raw_data():
# Create raw dataset
@dlt.table
@dlt.expect_all_or_drop(valid_pages)
def prepared_data():
# Create cleaned and prepared dataset
Delta Live Tables data sources
7/21/2022 • 2 minutes to read
You can use the following external data sources to create datasets:
Any data source that Databricks Runtime directly supports.
Any file in cloud storage such as Azure Data Lake Storage Gen2 (ADLS Gen2), AWS S3, or Google Cloud
Storage (GCS).
Any file stored in DBFS.
Databricks recommends using Auto Loader for pipelines that read data from supported file formats, particularly
for streaming live tables that operate on continually arriving data. Auto Loader is scalable, efficient, and
supports schema inference.
Python datasets can use the Apache Spark built-in file data sources to read data in a batch operation from file
formats not supported by Auto Loader.
SQL datasets can use Delta Live Tables file sources to read data in a batch operation from file formats not
supported by Auto Loader.
Auto Loader
The following examples use Auto Loader to create datasets from CSV and JSON files:
Python
@dlt.table
def customers():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/databricks-datasets/retail-org/customers/")
)
@dlt.table
def sales_orders_raw():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/databricks-datasets/retail-org/sales_orders/")
)
SQL
You can use supported format options with Auto Loader. Using the map() function, you can pass any number of
options to the cloud_files() method. Options are key-value pairs, where the keys and values are strings. The
following describes the syntax for working with Auto Loader in SQL:
CREATE OR REFRESH STREAMING LIVE TABLE <table_name>
AS SELECT *
FROM cloud_files(
"<file_path>",
"<file_format>",
map(
"<option_key>", "<option_value",
"<option_key>", "<option_value",
...
)
)
The following example reads data from tab-delimited CSV files with a header:
You can use the schema to specify the format manually; you must specify the schema for formats that do not
support schema inference:
Python
@dlt.table
def wiki_raw():
return (
spark.readStream.format("cloudFiles")
.schema("title STRING, id INT, revisionId INT, revisionTimestamp TIMESTAMP, revisionUsername STRING,
revisionUsernameId INT, text STRING")
.option("cloudFiles.format", "parquet")
.load("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet")
)
SQL
NOTE
Delta Live Tables automatically configures and manages the schema and checkpoint directories when using Auto Loader
to read files. However, if you manually configure either of these directories, performing a full refresh does not affect the
contents of the configured directories. Databricks recommends using the automatically configured directories to avoid
unexpected side effects during processing.
You can make the output data of your pipeline discoverable and available to query by publishing datasets to the
Azure Databricks metastore. To publish datasets to the metastore, enter a database name in the Target field
when you create a pipeline. You can also add a target database to an existing pipeline:
1. Click the Settings button.
2. Add the target setting to configure a database name for your tables.
3. Click Save .
Exclude tables
To prevent publishing of intermediate tables that are not intended for external consumption, mark them as
TEMPORARY :
@dlt.table(
Temporary=True)
def temp_table():
return ("...")
Streaming data processing
7/21/2022 • 3 minutes to read
Many applications require that tables be updated based on continually arriving data. However, as data sizes
grow, the resources required to reprocess data with each update can become prohibitive. You can define a
streaming table or view to incrementally compute continually arriving data. Streaming tables and views reduce
the cost of ingesting new data and the latency at which new data is made available.
When an update is triggered for a pipeline, a streaming table or view processes only new data that has arrived
since the last update. Data already processed is automatically tracked by the Delta Live Tables runtime.
inputPath = "/databricks-datasets/structured-streaming/events/"
@dlt.table
def streaming_bronze_table():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(inputPath)
)
SQL
@dlt.table
def streaming_silver_table:
return dlt.read_stream("streaming_bronze_table").where(...)
SQL
@dlt.table
def streaming_bronze():
return (
# Since this is a streaming source, this table is incremental.
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("abfss://path/to/raw/data")
)
@dlt.table
def streaming_silver():
# Since we read the bronze table as a stream, this silver table is also
# updated incrementally.
return dlt.read_stream("streaming_bronze").where(...)
@dlt.table
def live_gold():
# This table will be recomputed completely by reading the whole silver table
# when it is updated.
return dlt.read("streaming_silver").groupBy("user_id").count()
SQL
Learn more about using Auto Loader to efficiently read JSON files from Azure storage for incremental
processing.
Streaming joins
Delta Live Tables supports various join strategies for updating tables.
Stream-batch joins
Stream-batch joins are a good choice when denormalizing a continuous stream of append-only data with a
primarily static dimension table. Each time the derived dataset is updated, new records from the stream are
joined with a static snapshot of the batch table when the update started. Records added or updated in the static
table are not reflected in the table until a full refresh is performed.
The following are examples of stream-batch joins:
Python
@dlt.table
def customer_sales():
return dlt.read_stream("sales").join(read("customers"), ["customer_id"], "left")
SQL
In continuous pipelines, the batch side of the join is regularly polled for updates with each micro-batch.
Streaming aggregation
Simple distributive aggregates like count, min, max, or sum, and algebraic aggregates like average or standard
deviation can also be calculated incrementally with streaming live tables. Databricks recommends incremental
aggregation for queries with a limited number of groups, for example, a query with a GROUP BY country clause.
Only new input data is read with each update.
Change data capture with Delta Live Tables
7/21/2022 • 13 minutes to read
NOTE
This article describes how to update tables in your Delta Live Tables pipeline based on changes in source data. To learn
how to record and query row-level change information for Delta tables, see Change data feed.
IMPORTANT
Delta Live Tables support for SCD type 2 is in Public Preview.
You can use change data capture (CDC) in Delta Live Tables to update tables based on changes in source data.
CDC is supported in the Delta Live Tables SQL and Python interfaces. Delta Live Tables supports updating tables
with slowly changing dimensions (SCD) type 1 and type 2:
Use SCD type 1 to update records directly. History is not retained for records that are updated.
Use SCD type 2 to retain the history of all updates to records.
To represent the effective period of a change, SCD Type 2 stores every change with the generated __START_AT
and __END_AT columns. Delta Live Tables uses the column specified by SEQUENCE BY in SQL or sequence_by in
Python to generate the __START_AT and __END_AT columns.
NOTE
The data type of the __START_AT and __END_AT columns is the same as the data type of the specified SEQUENCE BY
field.
SQL
Use the APPLY CHANGES INTO statement to use Delta Live Tables CDC functionality:
KEYS
The column or combination of columns that uniquely identify a row in the source data. This is used to identify which CDC
events apply to specific records in the target table.
WHERE
A condition applied to both source and target to trigger optimizations such as partition pruning. This condition cannot be
used to drop source rows; all CDC rows in the source must satisfy this condition or an error is thrown. Using the WHERE
clause is optional and should be used when your processing requires specific optimizations.
Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and IGNORE
NULL UPDATES is specified, columns with a null will retain their existing values in the target. This also applies to nested
columns with a value of null .
Specifies when a CDC event should be treated as a DELETE rather than an upsert. To handle out-of-order data, the deleted
row is temporarily retained as a tombstone in the underlying Delta table, and a view is created in the metastore that filters out
these tombstones. The retention interval can be configured with the
pipelines.cdc.tombstoneGCThresholdInSeconds table property.
SEQUENCE BY
The column name specifying the logical order of CDC events in the source data. Delta Live Tables uses this sequencing to
handle change events that arrive out of order.
COLUMNS
Specifies a subset of columns to include in the target table. You can either:
* Specify the complete list of columns to include: COLUMNS (userId, name, city) .
* Specify a list of columns to exclude: COLUMNS * EXCEPT (operation, sequenceNum)
The default is to include all columns in the target table when the COLUMNS clause is not specified.
C L A USES
STORED AS
The default behavior for INSERT and UPDATE events is to upsert CDC events from the source: update any rows
in the target table that match the specified key(s) or insert a new row when a matching record does not exist in
the target table. Handling for DELETE events can be specified with the APPLY AS DELETE WHEN condition.
Python
Use the apply_changes() function in the Python API to use Delta Live Tables CDC functionality. The Delta Live
Tables Python CDC interface also provides the create_streaming_live_table() function. You can use this function
to create the target table required by the apply_changes() function. See the example queries.
Apply changes function
apply_changes(
target = "<target-table>",
source = "<data-source>",
keys = ["key1", "key2", "keyN"],
sequence_by = "<sequence-column>",
ignore_null_updates = False,
apply_as_deletes = None,
column_list = None,
except_column_list = None,
stored_as_scd_type = <type>
)
A RGUM EN T S
target
Type: str
The name of the table to be updated. You can use the create_streaming_live_table() function to create the target table before
executing the apply_changes() function.
source
Type: str
keys
Type: list
The column or combination of columns that uniquely identify a row in the source data. This is used to identify which CDC
events apply to specific records in the target table.
Arguments to col() functions cannot include qualifiers. For example, you can use col(userId) , but you cannot use
col(source.userId) .
sequence_by
The column name specifying the logical order of CDC events in the source data. Delta Live Tables uses this sequencing to
handle change events that arrive out of order.
* A string: "sequenceNum"
* A Spark SQL col() function: col("sequenceNum")
Arguments to col() functions cannot include qualifiers. For example, you can use col(userId) , but you cannot use
col(source.userId) .
ignore_null_updates
Type: bool
Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and
ignore_null_updates is True , columns with a null will retain their existing values in the target. This also applies to
nested columns with a value of null . When ignore_null_updates is False , existing values will be overwritten with
null values.
apply_as_deletes
Specifies when a CDC event should be treated as a DELETE rather than an upsert. To handle out-of-order data, the deleted
row is temporarily retained as a tombstone in the underlying Delta table, and a view is created in the metastore that filters out
these tombstones. The retention interval can be configured with the
pipelines.cdc.tombstoneGCThresholdInSeconds table property.
column_list
except_column_list
Type: list
A subset of columns to include in the target table. Use column_list to specify the complete list of columns to include. Use
except_column_list to specify the columns to exclude. You can declare either value as a list of strings or as Spark SQL
col() functions:
Arguments to col() functions cannot include qualifiers. For example, you can use col(userId) , but you cannot use
col(source.userId) .
The default is to include all columns in the target table when no column_list or except_column_list argument is passed
to the function.
stored_as_scd_type
The default behavior for INSERT and UPDATE events is to upsert CDC events from the source: update any rows
in the target table that match the specified key(s) or insert a new row when a matching record does not exist in
the target table. Handling for DELETE events can be specified with the apply_as_deletes argument.
Create a target table for output records
Use the create_streaming_live_table() function to create a target table for the apply_changes() output records.
NOTE
The create_target_table() function is deprecated. Databricks recommends updating existing code to use the
create_streaming_live_table() function.
create_streaming_live_table(
name = "<table-name>",
comment = "<comment>"
spark_conf={"<key>" : "<value", "<key" : "<value>"},
table_properties={"<key>" : "<value>", "<key>" : "<value>"},
partition_cols=["<partition-column>", "<partition-column>"],
path="<storage-location-path>",
schema="schema-definition"
)
A RGUM EN T S
name
Type: str
comment
Type: str
spark_conf
Type: dict
table_proper ties
Type: dict
par tition_cols
Type: array
An optional list of one or more columns to use for partitioning the table.
path
Type: str
An optional storage location for table data. If not set, the system will default to the pipeline storage location.
A RGUM EN T S
schema
An optional schema definition for the table. Schemas can be defined as a SQL DDL string, or with a Python
StructType .
When specifying the schema of the apply_changes target table, you must also include the __START_AT and
__END_AT columns with the same data type as the sequence_by field. For example, if your target table has the
columns key, STRING , value, STRING , and sequencing, LONG :
create_streaming_live_table(
name = "target",
comment = "Target for CDC ingestion.",
partition_cols=["value"],
path="$tablePath",
schema=
StructType(
[
StructField('key', StringType()),
StructField('value', StringType()),
StructField('sequencing', LongType()),
StructField('__START_AT', LongType()),
StructField('__END_AT', LongType())
]
)
)
NOTE
You must ensure that a target table is created before you execute the APPLY CHANGES INTO query or
apply_changes function. See the example queries.
Metrics for the target table, such as number of output rows, are not available.
SCD type 2 updates will add a history row for every input row, even if no columns have changed.
The target of the APPLY CHANGES INTO query or apply_changes function cannot be used as a source for a
streaming live table. A table that reads from the target of an APPLY CHANGES INTO query or apply_changes
function must be a live table.
Expectations are not supported in an APPLY CHANGES INTO query or apply_changes() function. To use expectations
for the source or target dataset:
Add expectations on source data by defining an intermediate table with the required expectations and use this
dataset as the source for the target table.
Add expectations on target data with a downstream table that reads input data from the target table.
Table properties
The following table properties are added to control the behavior of tombstone management for DELETE events:
TA B L E P RO P ERT IES
pipelines.cdc.tombstoneGCThresholdInSeconds
Set this value to match the highest expected interval between out-of-order data.
TA B L E P RO P ERT IES
pipelines.cdc.tombstoneGCFrequencyInSeconds
Examples
These examples demonstrate Delta Live Tables SCD type 1 and type 2 queries that update target tables based on
source events that:
1. Create new user records.
2. Delete a user record.
3. Update user records. In the SCD type 1 example, the last UPDATE operations arrive late and are dropped from
the target table, demonstrating the handling of out of order events.
The following are the input records for these examples.
After running the SCD type 1 example, the target table contains the following records:
USERID NAME C IT Y
After running the SCD type 2 example, the target table contains the following records:
1. Go to your Azure Databricks landing page and select Create a notebook or click Create in the
sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, Generate test CDC records .
Select SQL from the Default Language drop-down menu.
3. If there are running clusters, the Cluster drop-down displays. Select the cluster you want to attach the
notebook to. You can also create a new cluster to attach to after you create the notebook.
4. Click Create .
5. Copy the following query and paste it into the first cell of the new notebook:
CREATE TABLE
cdc_data.users
AS SELECT
col1 AS userId,
col2 AS name,
col3 AS city,
col4 AS operation,
col5 AS sequenceNum
FROM (
VALUES
-- Initial load.
(123, "Isabel", "Monterrey", "INSERT", 1),
(124, "Raul", "Oaxaca", "INSERT", 1),
-- One new user.
(125, "Mercedes", "Tijuana", "INSERT", 2),
-- Isabel is removed from the system and Mercedes moved to Guadalajara.
(123, null, null, "DELETE", 5),
(125, "Mercedes", "Guadalajara", "UPDATE", 5),
-- This batch of updates arrived out of order. The above batch at sequenceNum 5 will be the final
state.
(123, "Isabel", "Chihuahua", "UPDATE", 4),
(125, "Mercedes", "Mexicali", "UPDATE", 4)
);
6. To run the notebook and populate the test records, in the cell actions menu at the far
right, click and select Run Cell , or press shift+enter .
Create and run the SCD type 1 example pipeline
1. Go to your Azure Databricks landing page and select Create a notebook or click Create in the sidebar
and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, DLT CDC example . Select
Python or SQL from the Default Language drop-down menu based on your preferred language. You can
leave Cluster set to the default value. The Delta Live Tables runtime creates a cluster before it runs your
pipeline.
3. Click Create .
4. Copy the Python or SQL query and paste it into the first cell of the notebook.
5. Create a new pipeline and add the notebook in the Notebook Libraries field. To publish the output of the
pipeline processing, you can optionally enter a database name in the Target field.
6. Start the pipeline. If you configured the Target value, you can view and validate the results of the query.
Example queries
Python
import dlt
from pyspark.sql.functions import col, expr
@dlt.view
def users():
return spark.readStream.format("delta").table("cdc_data.users")
dlt.create_streaming_live_table("target")
dlt.apply_changes(
target = "target",
source = "users",
keys = ["userId"],
sequence_by = col("sequenceNum"),
apply_as_deletes = expr("operation = 'DELETE'"),
except_column_list = ["operation", "sequenceNum"],
stored_as_scd_type = 1
)
SQL
import dlt
from pyspark.sql.functions import col, expr
@dlt.view
def users():
return spark.readStream.format("delta").table("cdc_data.users")
dlt.create_streaming_live_table("target")
dlt.apply_changes(
target = "target",
source = "users",
keys = ["userId"],
sequence_by = col("sequenceNum"),
apply_as_deletes = expr("operation = 'DELETE'"),
except_column_list = ["operation", "sequenceNum"],
stored_as_scd_type = "2"
)
SQL
The Delta Live Tables API allows you to create, edit, delete, start, and view details about pipelines.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines POST
pipeline-settings.json :
{
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"continuous": false
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file.
Response
{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5"
}
Request structure
See PipelineSettings.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Edit a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines/{pipeline_id} PUT
pipeline-settings.json
{
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Delete a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines/{pipeline_id} DELETE
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
2.0/pipelines/{pipeline_id}/updates POST
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
{
"update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/pipelines/{pipeline_id}/stop POST
Stops any active pipeline update. If no update is running, this request is a no-op.
Example
This example stops an update for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
2.0/pipelines/{pipeline_id}/events GET
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/events \
--data '{"max_results": 5}'
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
events An array of pipeline events. The list of events matching the request
criteria.
2.0/pipelines/{pipeline_id} GET
Gets details about a pipeline, including the pipeline settings and recent updates.
Example
This example gets details for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/pipelines/{pipeline_id}/updates/{update_id} GET
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates/9a84f906-fc51-
11eb-9a03-0242ac130003
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List pipelines
EN DP O IN T H T T P M ET H O D
2.0/pipelines/ GET
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
"notebook='<path>'" to select
pipelines that reference the provided
notebook path.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
KeyValue
NotebookLibrary
PipelineLibrary
PipelineSettings
PipelineStateInfo
PipelinesNewCluster
UpdateStateInfo
KeyValue
A key-value pair that specifies configuration parameters.
NotebookLibrary
A specification for a notebook containing pipeline code.
PipelineLibrary
A specification for pipeline dependencies.
PipelineSettings
The settings for a pipeline deployment.
PipelineStateInfo
The state of a pipeline, the status of the most recent updates, and information about associated resources.
PipelinesNewCluster
A pipeline cluster specification.
The Delta Live Tables system sets the following attributes. These attributes cannot be configured by users:
spark_version
init_scripts
Note :
UpdateStateInfo
The current state of a pipeline update.
Delta Live Tables settings specify one or more notebooks that implement a pipeline and the parameters
specifying how to run the pipeline in an environment, for example, development, staging, or production. Delta
Live Tables settings are expressed as JSON and can be modified in the Delta Live Tables UI.
Settings
F IEL DS
id
Type: string
A globally unique identifier for this pipeline. The identifier is assigned by the system and cannot be changed.
name
Type: string
A user-friendly name for this pipeline. The name can be used to identify pipeline jobs in the UI.
storage
Type: string
A location on DBFS or cloud storage where output data and metadata required for pipeline execution are stored. Tables and
metadata are stored in subdirectories of this location.
When the storage setting is not specified, the system will default to a location in dbfs:/pipelines/ .
configuration
Type: object
An optional list of settings to add to the Spark configuration of the cluster that will run the pipeline. These settings are read by
the Delta Live Tables runtime and available to pipeline queries through the Spark configuration.
See Databricks Enhanced Autoscaling for an example of using the configuration object to enable Enhanced Autoscaling for
a Delta Live Tables pipeline.
libraries
An array of notebooks containing the pipeline code and required artifacts. See Configure multiple notebooks in a pipeline for
an example.
F IEL DS
clusters
An array of specifications for the clusters to run the pipeline. See Cluster configuration for more detail.
If this is not specified, pipelines will automatically select a default cluster configuration for the pipeline.
continuous
Type: boolean
target
Type: string
The name of a database for persisting pipeline output data. Configuring the target setting allows you to view and query the
pipeline output data from the Azure Databricks UI.
channel
Type: string
The version of the Delta Live Tables runtime to use. The supported values are:
edition
Type string
The Delta Live Tables product edition to run the pipeline. This setting allows you to choose the best product edition based on
the requirements of your pipeline:
photon
Type: boolean
A flag indicating whether to use Photon runtime to run the pipeline. Photon is the Azure Databricks high performance Spark
engine. Photon enabled pipelines are billed at a different rate than non-Photon pipelines.
@dlt.table(
spark_conf={"pipelines.trigger.interval" : "10 seconds"}
)
def <function-name>():
return (<query>)
To set pipelines.trigger.interval on a pipeline, add it to the configuration object in the pipeline settings:
{
"configuration": {
"pipelines.trigger.interval": "10 seconds"
}
}
P IP EL IN ES. T RIGGER. IN T ERVA L
The value is a number plus the time unit. The following are the valid time units:
* second , seconds
* minute , minutes
* hour , hours
* day , days
You can use the singular or plural unit when defining the value, for example:
Cluster configuration
You can configure clusters used by managed pipelines with the same JSON format as the create cluster API. You
can specify configuration for two different cluster types: a default cluster where all processing is performed and
a maintenance cluster where daily maintenance tasks are run. Each cluster is identified using the label field.
Specifying cluster properties is optional, and the system uses defaults for any missing values.
NOTE
You cannot set the Spark version in cluster configurations. Delta Live Tables clusters run on a custom version of
Databricks Runtime that is continually updated to include the latest features.
Because a Delta Live Tables cluster automatically shuts down when not in use, referencing a cluster policy that sets
autotermination_minutes in your cluster configuration results in an error. To control cluster shutdown behavior,
you can use development or production mode or use the pipelines.clusterShutdown.delay setting in the
pipeline configuration. The following example sets the pipelines.clusterShutdown.delay value to 60 seconds:
{
"configuration": {
"pipelines.clusterShutdown.delay": "60s"
}
}
If you set num_workers to 0 in cluster settings, the cluster is created as a Single Node cluster. Configuring an
autoscaling cluster and setting min_workers to 0 and max_workers to 0 also creates a Single Node cluster.
If you configure an autoscaling cluster and set only min_workers to 0, then the cluster is not created as a Single
Node cluster. The cluster has at least 1 active worker at all times until terminated.
An example cluster configuration to create a Single Node cluster in Delta Live Tables:
{
"clusters": [
{
"label": "default",
"num_workers": 0
}
]
}
NOTE
If you need Azure Data Lake Storage credential passthrough or other configuration to access your storage location,
specify it for both the default cluster and the maintenance cluster.
{
"clusters": [
{
"label": "default",
"node_type_id": "Standard_D3_v2",
"driver_node_type_id": "Standard_D3_v2",
"num_workers": 20,
"spark_conf": {
"spark.databricks.io.parquet.nativeReader.enabled": "false"
}
},
{
"label": "maintenance"
}
]
}
Cluster policies
NOTE
When using cluster policies to configure Delta Live Tables clusters, Databricks recommends applying a single policy to
both the default and maintenance clusters.
To configure a cluster policy for a pipeline cluster, create a policy with the cluster_type field set to dlt :
{
"cluster_type": {
"type": "fixed",
"value": "dlt"
}
}
In the pipeline settings, set the cluster policy_id field to the value of the policy identifier. The following example
configures the default and maintenance clusters using the cluster policy with the identifier C65B864F02000008 .
{
"clusters": [
{
"label": "default",
"policy_id": "C65B864F02000008",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
},
{
"label": "maintenance",
"policy_id": "C65B864F02000008"
}
]
}
For an example of creating and using a cluster policy, see Define limits on pipeline clusters.
Examples
Configure a pipeline and cluster
The following example configures a triggered pipeline implemented in example-notebook_1 , using DBFS for
storage, and running on a small one-node cluster:
{
"name": "Example pipeline 1",
"storage": "dbfs:/pipeline-examples/storage-location/example1",
"clusters": [
{
"num_workers": 1,
"spark_conf": {}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/user@databricks.com/example_notebook_1"
}
}
],
"continuous": false
}
{
"name": "Example pipeline 3",
"storage": "dbfs:/pipeline-examples/storage-location/example3",
"libraries": [
{ "notebook": { "path": "/example-notebook_1" } },
{ "notebook": { "path": "/example-notebook_2" } }
]
}
{
"name": "Data Ingest - DEV user@databricks",
"target": "customers_dev_user",
"libraries": ["/Repos/user@databricks.com/ingestion/etl.py"],
}
{
"name": "Data Ingest - PROD",
"target": "customers",
"libraries": ["/Repos/production/ingestion/etl.py"],
}
Parameterize pipelines
The Python and SQL code that defines your datasets can be parameterized by the pipeline’s settings.
Parameterization enables the following use cases:
Separating long paths and other variables from your code.
Reducing the amount of data that is processed in development or staging environments to speed up testing.
Reusing the same transformation logic to process from multiple data sources.
The following example uses the startDate configuration value to limit the development pipeline to a subset of
the input data:
@dlt.table
def customer_events():
start_date = spark.conf.get("mypipeline.startDate")
return read("sourceTable").where(col("date") > start_date)
{
"name": "Data Ingest - DEV",
"configuration": {
"mypipeline.startDate": "2021-01-02"
}
}
{
"name": "Data Ingest - PROD",
"configuration": {
"mypipeline.startDate": "2010-01-02"
}
}
Delta Live Tables event log
7/21/2022 • 5 minutes to read
An event log is created and maintained for every Delta Live Tables pipeline. The event log contains all
information related to the pipeline, including audit logs, data quality checks, pipeline progress, and data lineage.
You can use the event log to track, understand, and monitor the state of your data pipelines.
The event log for each pipeline is stored in a Delta table in DBFS. You can view event log entries in the Delta Live
Tables user interface, the Delta Live Tables API, or by directly querying the Delta table. This article focuses on
querying the Delta table.
The example notebook includes queries discussed in this article and can be used to explore the Delta Live Tables
event log.
Requirements
The examples in this article use JSON SQL functions available in Databricks Runtime 8.1 or higher.
If you have not configured the setting, the default event log location is
storage
/pipelines/<pipeline-id>/system/events in DBFS. For example, if the ID of your pipeline is
91de5e48-35ed-11ec-8d3d-0242ac130003 , the storage location is
/pipelines/91de5e48-35ed-11ec-8d3d-0242ac130003/system/events .
event_log = spark.read.format('delta').load(event_log_path)
event_log.createOrReplaceTempView("event_log_raw")
Audit logging
You can use the event log to audit events, for example, user actions. Events containing information about user
actions have the event type user_action . Information about the action is stored in the user_action object in the
details field. Use the following query to construct an audit log of user events:
T IM ESTA M P A C T IO N USER_N A M E
Lineage
You can see a visual representation of your pipeline graph in the Delta Live Tables user interface. You can also
programatically extract this information to perform tasks such as generating reports for compliance or tracking
data dependencies across an organization. Events containing information about lineage have the event type
flow_definition . The lineage information is stored in the flow_definition object in the details field. The
fields in the flow_definition object contain the necessary information to infer the relationships between
datasets:
1 customers null
2 sales_orders_raw null
4 sales_order_in_la [“sales_orders_cleaned”]
Data quality
The event log captures data quality metrics based on the expectations defined in your pipelines. Events
containing information about data quality have the event type flow_progress . When an expectation is defined
on a dataset, the data quality metrics are stored in the details field in the
flow_progress.data_quality.expectations object. The following example queries the data quality metrics for the
last pipeline update:
SELECT
row_expectations.dataset as dataset,
row_expectations.name as expectation,
SUM(row_expectations.passed_records) as passing_records,
SUM(row_expectations.failed_records) as failing_records
FROM
(
SELECT
explode(
from_json(
details :flow_progress :data_quality :expectations,
"array<struct<name: string, dataset: string, passed_records: int, failed_records: int>>"
)
) row_expectations
FROM
event_log_raw
WHERE
event_type = 'flow_progress'
AND origin.update_id = '${latest_update.id}'
)
GROUP BY
row_expectations.dataset,
row_expectations.name
SELECT
timestamp,
Double(details :cluster_utilization.num_executors) as current_num_executors,
Double(details :cluster_utilization.avg_num_task_slots) as avg_num_task_slots,
Double(
details :cluster_utilization.avg_task_slot_utilization
) as avg_task_slot_utilization,
Double(
details :cluster_utilization.avg_num_queued_tasks
) as queue_size,
Double(details :flow_progress.metrics.backlog_bytes) as backlog
FROM
event_log_raw
WHERE
event_type IN ('cluster_utilization', 'flow_progress')
AND origin.update_id = '${latest_update.id}'
NOTE
The backlog metrics may not be available depending on the pipeline’s data source type and Databricks Runtime version.
SELECT
timestamp,
Double(
case
when details :autoscale.status = 'REQUESTED' then details :autoscale.desired_num_workers
else null
end
) as requested_workers,
Double(
case
when details :autoscale.status = 'ACCEPTED' then details :autoscale.desired_num_workers
else null
end
) as accepted_workers,
Double(
case
when details :autoscale.status = 'SUCCEEDED' then details :autoscale.desired_num_workers
else null
end
) as succeeded_workers,
Double(
case
when details :autoscale.status = 'REJECTED' then details :autoscale.desired_num_workers
else null
end
) as rejected_workers
FROM
event_log_raw
WHERE
event_type = 'autoscale'
AND origin.update_id = '${latest_update.id}'
Runtime information
You can view runtime information for a pipeline update, for example, the Databricks Runtime version for the
update:
DB R_VERSIO N
1 11.0
Example notebook
Querying the Delta Live Tables event log
Get notebook
Run a Delta Live Tables pipeline in a workflow
7/21/2022 • 4 minutes to read
You can run a Delta Live Tables pipeline as part of a data processing workflow with Databricks jobs, Apache
Airflow, or Azure Data Factory.
Jobs
You can orchestrate multiple tasks in a Databricks job to implement a data processing workflow. To include a
Delta Live Tables pipeline in a job, use the Pipeline task when you create a job.
Apache Airflow
Apache Airflow is an open source solution for managing and scheduling data workflows. Airflow represents
workflows as directed acyclic graphs (DAGs) of operations. You define a workflow in a Python file and Airflow
manages the scheduling and execution. For information on installing and using Airflow with Azure Databricks,
see Apache Airflow.
To run a Delta Live Tables pipeline as part of an Airflow workflow, use the DatabricksSubmitRunOperator.
Requirements
The following are required to use the Airflow support for Delta Live Tables:
Airflow version 2.1.0 or later.
The Databricks provider package version 2.1.0 or later.
Example
The following example creates an Airflow DAG that triggers an update for the Delta Live Tables pipeline with the
identifier 8279d543-063c-4d63-9926-dae38e35ce8b :
default_args = {
'owner': 'airflow'
}
with DAG('dlt',
start_date=days_ago(2),
schedule_interval="@once",
default_args=default_args
) as dag:
opr_run_now=DatabricksSubmitRunOperator(
task_id='run_now',
databricks_conn_id='CONNECTION_ID',
pipeline_task={"pipeline_id": "8279d543-063c-4d63-9926-dae38e35ce8b"}
)
Replace CONNECTION_ID with the identifier for an Airflow connection to your workspace.
Save this example in the airflow/dags directory and use the Airflow UI to view and trigger the DAG. Use the
Delta Live Tables UI to view the details of the pipeline update.
Azure Data Factory
Azure Data Factory is a cloud-based ETL service that lets you orchestrate data integration and transformation
workflows. Azure Data Factory directly supports running Azure Databricks tasks in a workflow, including
notebooks, JAR tasks, and Python scripts. You can also include a pipeline in a workflow by calling the Delta Live
Tables API from an Azure Data Factory Web activity. For example, to trigger a pipeline update from Azure Data
Factory:
1. Create a data factory or open an existing data factory.
2. When creation completes, open the page for your data factory and click the Open Azure Data Factor y
Studio tile. The Azure Data Factory user interface appears.
3. Create an Azure Databricks linked service.
4. Create a new Azure Data Factory pipeline by selecting Pipeline from the New dropdown menu in the
Azure Data Factory Studio user interface.
5. In the Activities toolbox, expand General and drag the Web activity to the pipeline canvas. Click the
Settings tab and enter the following values:
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access
tokens, or both for authentication. As a security best practice, when authenticating with automated tools, systems,
scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of
workspace users. For more information, see Service principals for Azure Databricks automation.
URL : https://<databricks-instance>/api/2.0/pipelines/<pipeline-id>/updates .
Replace <databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This article contains a collection of recommendations and solutions to implement common tasks in your Delta
Live Tables pipelines.
Make expectations portable and reusable
Use Python UDFs in SQL
Use MLFlow models in a Delta Live Tables pipeline
Create sample datasets for development and testing
Programmatically manage and create multiple live tables
Quarantine invalid data
Validate row counts across tables
Retain manual deletes or updates
Exclude tables from publishing
Use secrets in a pipeline
Define limits on pipeline clusters
The following Python example defines data quality expectations based on the rules stored in the rules.csv file.
The get_rules() function reads the rules from rules.csv and returns a Python dictionary containing rules
matching the tag argument passed to the function. The dictionary is applied in the @dlt.expect_all_*()
decorators to enforce data quality constraints. For example, any records failing the rules tagged with validity
will be dropped from the raw_farmers_market table:
import dlt
from pyspark.sql.functions import expr, col
def get_rules(tag):
"""
loads data quality rules from csv file
:param tag: tag to match
:return: dictionary of rules that matched the tag
"""
rules = {}
df = spark.read.format("csv").option("header", "true").load("/path/to/rules.csv")
for row in df.filter(col("tag") == tag).collect():
rules[row['name']] = row['constraint']
return rules
@dlt.table(
name="raw_farmers_market"
)
@dlt.expect_all_or_drop(get_rules('validity'))
def get_farmers_market_data():
return (
spark.read.format('csv').option("header", "true")
.load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')
)
@dlt.table(
name="organic_farmers_market"
)
@dlt.expect_all_or_drop(get_rules('maintained'))
def get_organic_farmers_market():
return (
dlt.read("raw_farmers_market")
.filter(expr("Organic = 'Y'"))
.select("MarketName", "Website", "State",
"Facebook", "Twitter", "Youtube", "Organic",
"updateTime"
)
)
3. Create a pipeline
Create a new Delta Live Tables pipeline, adding the notebooks you created to Notebook Libraries . Use
the Add notebook librar y button to add additional notebooks in the Create Pipeline dialog or the
libraries field in the Delta Live Tables settings to configure the notebooks.
import dlt
import mlflow
from pyspark.sql.functions import struct
run_id= "mlflow_run_id"
model_name = "the_model_name_in_run"
model_uri = "runs:/{run_id}/{model_name}".format(run_id=run_id, model_name=model_name)
loaded_model = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)
@dlt.table(
comment="GBT ML scored training dataset based on Loan Risk",
table_properties={
"quality": "gold"
}
)
def gtb_scoring_train_data():
return dlt.read("train_data")
.withColumn('predictions', loaded_model(struct(features)))
@dlt.table(
comment="GBT ML scored valid dataset based on Loan Risk",
table_properties={
"quality": "gold"
}
)
def gtb_scoring_valid_data():
return dlt.read("valid_data")
.withColumn('predictions', loaded_model(struct(features)))
CREATE OR REFRESH STREAMING LIVE TABLE input_data AS SELECT * FROM cloud_files("/production/data", "json")
Then create notebooks that define a sample of data based on requirements. For example, to generate a small
dataset with specific records for testing:
CREATE OR REFRESH LIVE TABLE input_data AS SELECT * FROM prod.input_data WHERE date > current_date() -
INTERVAL 1 DAY
To use these different datasets, create multiple pipelines with the notebooks implementing the transformation
logic. Each pipeline can read data from the LIVE.input_data dataset but is configured to include the notebook
that creates the dataset specific to the environment.
You can use a metaprogramming pattern to reduce the overhead of generating and maintaining redundant flow
definitions. Metaprogramming in Delta Live Tables is done using Python inner functions. Because these
functions are lazily evaluated, you can use them to create flows that are identical except for input parameters.
Each invocation can include a different set of parameters that controls how each table should be generated, as
shown in the following example:
import dlt
from pyspark.sql.functions import *
@dlt.table(
name="raw_fire_department",
comment="raw table for fire department response"
)
@dlt.expect_or_drop("valid_received", "received IS NOT NULL")
@dlt.expect_or_drop("valid_response", "responded IS NOT NULL")
@dlt.expect_or_drop("valid_neighborhood", "neighborhood != 'None'")
def get_raw_fire_department():
return (
return (
spark.read.format('csv')
.option('header', 'true')
.option('multiline', 'true')
.load('/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv')
.withColumnRenamed('Call Type', 'call_type')
.withColumnRenamed('Received DtTm', 'received')
.withColumnRenamed('Response DtTm', 'responded')
.withColumnRenamed('Neighborhooods - Analysis Boundaries', 'neighborhood')
.select('call_type', 'received', 'responded', 'neighborhood')
)
all_tables = []
@dlt.table(
name=response_table,
comment="top 10 neighborhoods with fastest response time "
)
def create_response_table():
return (
spark.sql("""
SELECT
neighborhood,
AVG((ts_received - ts_responded)) as response_time
FROM LIVE.{call_table}
GROUP BY 1
ORDER BY response_time
LIMIT 10
""".format(call_table=call_table))
)
all_tables.append(response_table)
@dlt.table(
name="best_neighborhoods",
comment="which neighbor appears in the best response time list the most"
)
def summary():
target_tables = [dlt.read(t) for t in all_tables]
unioned = functools.reduce(lambda x,y: x.union(y), target_tables)
return (
unioned.groupBy(col("neighborhood"))
.agg(count("*").alias("score"))
.orderBy(desc("score"))
)
Quarantine invalid data
Scenario
You’ve defined expectations to filter out records that violate data quality constraints, but you also want to save
the invalid records for analysis.
Solution
Create rules that are the inverse of the expectations you’ve defined and use those rules to save the invalid
records to a separate table. You can programmatically create these inverse rules. The following example creates
the valid_farmers_market table containing input records that pass the valid_website and valid_location data
quality constraints and also creates the invalid_farmers_market table containing the records that fail those data
quality constraints:
import dlt
rules = {}
quarantine_rules = {}
@dlt.table(
name="raw_farmers_market"
)
def get_farmers_market_data():
return (
spark.read.format('csv').option("header", "true")
.load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')
)
@dlt.table(
name="valid_farmers_market"
)
@dlt.expect_all_or_drop(rules)
def get_valid_farmers_market():
return (
dlt.read("raw_farmers_market")
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime")
)
@dlt.table(
name="invalid_farmers_market"
)
@dlt.expect_all_or_drop(quarantine_rules)
def get_invalid_farmers_market():
return (
dlt.read("raw_farmers_market")
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime")
)
A disadvantage of the above approach is that it generates the quarantine table by processing the data twice. If
you don’t want this performance overhead, you can use the constraints directly within a query to generate a
column indicating the validation status of a record. You can then partition the table by this column for further
optimization.
This approach does not use expectations, so data quality metrics do not appear in the event logs or the pipelines
UI.
import dlt
from pyspark.sql.functions import expr
rules = {}
quarantine_rules = {}
@dlt.table(
name="raw_farmers_market"
)
def get_farmers_market_data():
return (
spark.read.format('csv').option("header", "true")
.load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')
)
@dlt.table(
name="partitioned_farmers_market",
partition_cols = [ 'Quarantine' ]
)
def get_partitioned_farmers_market():
return (
dlt.read("raw_farmers_market")
.withColumn("Quarantine", expr(quarantine_rules))
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime",
"Quarantine")
)
You can manually delete or update the record from raw_user_table and do a refresh operation to recompute
the downstream tables. However, you need to make sure the deleted record isn’t reloaded from the source data.
Solution
Use the pipelines.reset.allowed table property to disable full refresh for raw_user_table so that intended
changes are retained over time:
Setting pipelines.reset.allowed to false prevents refreshes to raw_user_table , but does not prevent
incremental writes to the tables or prevent new data from flowing into the table.
@dlt.table(
comment="Raw customer data",
temporary=True)
def customers_raw():
return ("...")
NOTE
You must add the spark.hadoop. prefix to the spark_conf configuration key that sets the secret value.
{
"id": "43246596-a63f-11ec-b909-0242ac120002",
"clusters": [
{
"label": "default",
"spark_conf": {
"spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net": "
{{secrets/<scope-name>/<secret-name>}}"
},
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
},
{
"label": "maintenance",
"spark_conf": {
"spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net": "
{{secrets/<scope-name>/<secret-name>}}"
}
}
],
"development": true,
"continuous": false,
"libraries": [
{
"notebook": {
"path": "/Users/user@databricks.com/DLT Notebooks/Delta Live Tables quickstart"
}
}
],
"name": "DLT quickstart using ADLS2"
}
Replace
<storage-account-name> with the ADLS Gen2 storage account name.
<scope-name> with the Azure Databricks secret scope name.
<secret-name> with the name of the key containing the Azure storage account access key.
import dlt
json_path = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-input-dataset>"
@dlt.create_table(
comment="Data ingested from an ADLS2 storage account."
)
def read_from_ADLS2():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(json_path)
)
Replace
<container-name> with the name of the Azure storage account container that stores the input data.
<storage-account-name> with the ADLS Gen2 storage account name.
<path-to-input-dataset> with the path to the input dataset.
{
"cluster_type": {
"type": "fixed",
"value": "dlt"
},
"num_workers": {
"type": "unlimited",
"defaultValue": 3,
"isOptional": true
},
"node_type_id": {
"type": "unlimited",
"isOptional": true
},
"spark_version": {
"type": "unlimited",
"hidden": true
}
}
For more information on creating cluster policies, including example policies, see Create a cluster policy.
To use a cluster policy in a pipeline configuration, you need the policy ID. To find the policy ID:
NOTE
When using cluster policies to configure Delta Live Tables clusters, Databricks recommends applying a single policy to
both the default and maintenance clusters.
1. Click Workflows in the sidebar and click the Delta Live Tables tab. The Pipelines list displays.
2. Click the pipeline name. The Pipeline details page appears.
3. Click the Settings button. The Edit Pipeline Settings dialog appears.
4. Click the JSON button.
5. In the clusters setting, set the policy_id field to the value of the policy ID. The following example
configures the default and maintenance clusters using the cluster policy with the ID C65B864F02000008 :
{
"clusters": [
{
"label": "default",
"policy_id": "C65B864F02000008",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
},
{
"label": "maintenance",
"policy_id": "C65B864F02000008"
}
]
}
6. Click Save .
Delta Live Tables frequently asked questions
7/21/2022 • 2 minutes to read
Delta Live Tables clusters use a runtime based on Databricks Runtime. Databricks automatically upgrades the
Delta Live Tables runtime to support enhancements and upgrades to the platform. As with any software
upgrade, a Delta Live Tables runtime upgrade may result in errors or issues running your pipelines. This article
describes best practices to test your pipeline with upcoming releases of the Delta Live Tables runtime, and Delta
Live Tables features that enhance the stability of your pipelines.
NOTE
Delta Live Tables reverts only pipelines running in production mode and with the channel set to current .
The pipeline’s Delta Live Tables runtime is pinned to the previous known-good version.
The Delta Live Tables UI shows a visual indicator that the pipeline is pinned to a previous version because of
an upgrade failure.
Databricks support is notified of the issue. If the issue is related to a regression in the runtime, Databricks will
resolve the issue. If the issue is caused by a custom library or package used by the pipeline, Databricks will
contact you to resolve the issue.
When the issue is resolved, Databricks will initiate the upgrade again.
Best practices
Automate testing of your pipelines with the next runtime version
To ensure changes in the next Delta Live Tables runtime version do not impact your pipelines, use the Delta Live
Tables channels feature:
1. Create a staging pipeline and set the channel to preview .
2. In the Delta Live Tables UI, create a schedule to run the pipeline weekly and enable alerts to receive an email
notification for pipeline failures.
3. If you receive a notification of a failure and are unable to resolve it, open a support ticket with Databricks.
Pipeline dependencies
Delta Live Tables supports external dependencies in your pipelines; for example, you can install any Python
package using the %pip install command. Delta Live Tables also supports using global and cluster-scoped init
scripts. However, these external dependencies, particularly init scripts, increase the risk of issues with runtime
upgrades. To mitigate these risks, minimize using init scripts in your pipelines. If your processing requires init
scripts, automate testing of your pipeline to detect problems early; see Automate testing of your pipelines with
the next runtime version. If you use init scripts, Databricks recommends increasing your testing frequency.
Workflows with jobs
7/21/2022 • 2 minutes to read
You can use a job to run a data processing or data analysis task in an Azure Databricks cluster with scalable
resources. Your job can consist of a single task or can be a large, multi-task workflow with complex
dependencies. Azure Databricks manages the task orchestration, cluster management, monitoring, and error
reporting for all of your jobs. You can run your jobs immediately or periodically through an easy-to-use
scheduling system. You can implement job tasks using notebooks, JARS, Delta Live Tables pipelines, or Python,
Scala, Spark submit, and Java applications.
You create jobs through the Jobs UI, the Jobs API, or the Databricks CLI. The Jobs UI allows you to monitor, test,
and troubleshoot your running and completed jobs.
To get started:
Create your first Azure Databricks jobs workflow with the quickstart.
Learn how to create, view, and run workflows with the Azure Databricks jobs user interface.
Learn about Jobs API updates to support creating and managing workflows with Azure Databricks jobs.
Jobs quickstart
7/21/2022 • 3 minutes to read
This article demonstrates an Azure Databricks job that orchestrates tasks to read and process a sample dataset.
In this quickstart, you:
1. Create a new notebook and add code to retrieve a sample dataset containing popular baby names by year.
2. Save the sample dataset to DBFS.
3. Create a new notebook and add code to read the dataset from DBFS, filter it by year, and display the results.
4. Create a new job and configure two tasks using the notebooks.
5. Run the job and view the results.
Requirements
You must have cluster creation permission to create a job cluster or permissions to an all-purpose cluster.
1. Go to your Azure Databricks landing page and select Create Blank Notebook or click Create in
the sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, Retrieve baby names . Select
Python from the Default Language dropdown menu. You can leave Cluster set to the default value.
You configure the cluster when you create a task using this notebook.
3. Click Create .
4. Copy the following Python code and paste it into the first cell of the notebook.
import requests
response = requests.get('http://health.data.ny.gov/api/views/myeu-hzra/rows.csv')
csvfile = response.content.decode('utf-8')
dbutils.fs.put("dbfs:/FileStore/babynames.csv", csvfile, True)
1. Go to your Azure Databricks landing page and select Create Blank Notebook or click Create in
the sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name; for example, Filter baby names . Select
Python from the Default Language dropdown menu. You can leave Cluster set to the default value.
You configure the cluster when you create a task using this notebook.
3. Click Create .
4. Copy the following Python code and paste it into the first cell of the notebook.
babynames = spark.read.format("csv").option("header", "true").option("inferSchema",
"true").load("dbfs:/FileStore/babynames.csv")
babynames.createOrReplaceTempView("babynames_table")
years = spark.sql("select distinct(Year) from babynames_table").rdd.map(lambda row :
row[0]).collect()
years.sort()
dbutils.widgets.dropdown("year", "2014", [str(x) for x in years])
display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))
Create a job
1. Click Workflows in the sidebar.
2. Click .
The Tasks tab displays with the create task dialog.
3. Replace Add a name for your job… with your job name.
4. In the Task name field, enter a name for the task; for example, retrieve-baby-names .
5. In the Type drop-down, select Notebook .
6. Use the file browser to find the first notebook you created, click the notebook name, and click Confirm .
7. Click Create task .
8. Click below the task you just created to add another task.
9. In the Task name field, enter a name for the task; for example, filter-baby-names .
10. In the Type drop-down, select Notebook .
11. Use the file browser to find the second notebook you created, click the notebook name, and click
Confirm .
12. Click Add under Parameters . In the Key field, enter year . In the Value field, enter 2014 .
13. Click Create task .
1. Click next to Run Now and select Run Now with Different Parameters or click Run Now with
Different Parameters in the Active Runs table.
2. In the Value field, enter 2015 .
3. Click Run .
Jobs
7/21/2022 • 29 minutes to read
A job is a way to run non-interactive code in an Azure Databricks cluster. For example, you can run an extract,
transform, and load (ETL) workload interactively or on a schedule. You can also run jobs interactively in the
notebook UI.
You can create and run a job using the UI, the CLI, or by invoking the Jobs API. You can repair and re-run a failed
or canceled job using the UI or API. You can monitor job run results using the UI, CLI, API, and email notifications.
This article focuses on performing job tasks using the UI. For the other methods, see Jobs CLI and Jobs API 2.1.
Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies. Azure
Databricks manages the task orchestration, cluster management, monitoring, and error reporting for all of your
jobs. You can run your jobs immediately or periodically through an easy-to-use scheduling system.
You can implement a task in a JAR, an Azure Databricks notebook, a Delta Live Tables pipeline, or an application
written in Scala, Java, or Python. Legacy Spark Submit applications are also supported. You control the execution
order of tasks by specifying dependencies between the tasks. You can configure tasks to run in sequence or
parallel. The following diagram illustrates a workflow that:
1. Ingests raw clickstream data and performs processing to sessionize the records.
2. Ingests order data and joins it with the sessionized clickstream data to create a prepared data set for
analysis.
3. Extracts features from the prepared data.
4. Performs tasks in parallel to persist the features and train a machine learning model.
To create your first workflow with an Azure Databricks job, see the quickstart.
IMPORTANT
You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.
A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you
request a run that cannot start immediately.
The number of jobs a workspace can create in an hour is limited to 5000 (includes “run now” and “runs submit”). This
limit also affects jobs created by the REST API and notebook workflows.
Create a job
1. Do one of the following:
2. Replace Add a name for your job… with your job name.
3. Enter a name for the task in the Task name field.
4. Specify the type of task to run. In the Type drop-down, select Notebook , JAR , Spark Submit , Python ,
or Pipeline .
Notebook : In the Source drop-down, select a location for the notebook; either Workspace for a
notebook located in a Azure Databricks workspace folder or Git provider for a notebook located
in a remote Git repository.
Workspace : Use the file browser to find the notebook, click the notebook name, and click
Confirm .
Git provider : Click Edit and enter the Git repository information. See Run jobs using notebooks
in a remote Git repository.
JAR : Specify the Main class . Use the fully qualified name of the class containing the main
method, for example, org.apache.spark.examples.SparkPi . Then click Add under Dependent
Libraries to add libraries required to run the task. One of these libraries must contain the main
class.
To learn more about JAR tasks, see JAR jobs.
Spark Submit : In the Parameters text box, specify the main class, the path to the library JAR, and
all arguments, formatted as a JSON array of strings. The following example configures a spark-
submit task to run the DFSReadWriteTest from the Apache Spark examples:
["--
class","org.apache.spark.examples.DFSReadWriteTest","dbfs:/FileStore/libraries/spark_examples_
2_12_3_1_1.jar","/dbfs/databricks-datasets/README.md","/FileStore/examples/output/"]
IMPORTANT
There are several limitations for spark-submit tasks:
You can run spark-submit tasks only on new clusters.
Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster
autoscaling.
Spark-submit does not support Databricks Utilities. To use Databricks Utilities, use JAR tasks instead.
Python : In the Path textbox, enter the URI of a Python script on DBFS or cloud storage; for
example, dbfs:/FileStore/myscript.py .
Pipeline : In the Pipeline drop-down, select an existing Delta Live Tables pipeline.
Python Wheel : In the Package name text box, enter the package to import, for example,
myWheel-1.0-py2.py3-none-any.whl . In the Entr y Point text box, enter the function to call when
starting the wheel. Click Add under Dependent Libraries to add libraries required to run the
task.
5. Configure the cluster where the task runs. In the Cluster drop-down, select either New Job Cluster or
Existing All-Purpose Clusters .
New Job Cluster : Click Edit in the Cluster drop-down and complete the cluster configuration.
Existing All-Purpose Cluster : Select an existing cluster in the Cluster drop-down. To open the
cluster in a new page, click the icon to the right of the cluster name and description.
To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.
6. You can pass parameters for your task. Each task type has different requirements for formatting and
passing the parameters.
Notebook : Click Add and specify the key and value of each parameter to pass to the task. You can
override or add additional parameters when you manually run a task using the Run a job with
different parameters option. Parameters set the value of the notebook widget specified by the key of
the parameter. Use task parameter variables to pass a limited set of dynamic values as part of a
parameter value.
JAR : Use a JSON-formatted array of strings to specify parameters. These strings are passed as
arguments to the main method of the main class. See Configure JAR job parameters.
Spark Submit task: Parameters are specified as a JSON-formatted array of strings. Conforming to
the Apache Spark spark-submit convention, parameters after the JAR path are passed to the main
method of the main class.
Python : Use a JSON-formatted array of strings to specify parameters. These strings are passed as
arguments which can be parsed using the argparse module in Python.
Python Wheel : In the Parameters drop-down, select Positional arguments to enter parameters as
a JSON-formatted array of strings, or select Keyword arguments > Add to enter the key and value
of each parameter. Both positional and keyword arguments are passed to the Python wheel task as
command-line arguments.
7. To access additional options, including Dependent Libraries, Retr y Policy, and Timeouts , click
Advanced Options . See Edit a task.
8. Click Create .
9. To optionally set the job’s schedule, click Edit schedule in the Job details panel. See Schedule a job.
10. To optionally allow multiple concurrent runs of the same job, click Edit concurrent runs in the Job
details panel. See Maximum concurrent runs.
11. To optionally specify email addresses to receive notifications on job events, click Edit notifications in
the Job details panel. See Notifications.
12. To optionally control permission levels on the job, click Edit permissions in the Job details panel. See
Control access to jobs.
To add another task, click below the task you just created. A shared cluster option is provided if you have
configured a New Job Cluster for a previous task. You can also configure a cluster for each task when you
create or edit a task. To learn more about selecting and configuring clusters to run tasks, see Cluster
configuration tips.
Run a job
1. Click Workflows in the sidebar.
2. Select a job and click the Runs tab. You can run a job immediately or schedule the job to run later.
If one or more tasks in a job with multiple tasks are not successful, you can re-run the subset of unsuccessful
tasks. See Repair an unsuccessful job run.
Run a job immediately
To run the job immediately, click .
TIP
You can perform a test run of a job with a notebook task by clicking Run Now . If you need to make changes to the
notebook, clicking Run Now again after editing the notebook will automatically run the new version of the notebook.
1. Click next to Run Now and select Run Now with Different Parameters or, in the Active Runs table,
click Run Now with Different Parameters . Enter the new parameters depending on the type of task.
Notebook : You can enter parameters as key-value pairs or a JSON object. The provided parameters
are merged with the default parameters for the triggered run. You can use this dialog to set the values
of widgets.
JAR and spark-submit : You can enter a list of parameters or a JSON document. If you delete keys,
the default parameters are used. You can also add task parameter variables for the run.
2. Click Run .
Repair an unsuccessful job run
You can repair failed or canceled multi-task jobs by running only the subset of unsuccessful tasks and any
dependent tasks. Because successful tasks and any tasks that depend on them are not re-run, this feature
reduces the time and resources required to recover from unsuccessful job runs.
You can change job or task settings before repairing the job run. Unsuccessful tasks are re-run with the current
job and task settings. For example, if you change the path to a notebook or a cluster setting, the task is re-run
with the updated notebook or cluster settings.
You can view the history of all task runs on the Task run details page.
NOTE
If one or more tasks share a job cluster, a repair run creates a new job cluster; for example, if the original run used the
job cluster my_job_cluster , the first repair run uses the new job cluster my_job_cluster_v1 , allowing you to easily
see the cluster and cluster settings used by the initial run and any repair runs. The settings for my_job_cluster_v1
are the same as the current settings for my_job_cluster .
Repair is supported only with jobs that orchestrate two or more tasks.
The Duration value displayed in the Runs tab includes the time the first run started until the time when the latest
repair run finished. For example, if a run failed twice and succeeded on the third run, the duration includes the time for
all three runs.
3. Click Save .
Pause and resume a job schedule
To pause a job, you can either:
Click Pause in the Job details panel.
Click Edit schedule in the Job details panel and set the Schedule Type to Manual (Paused)
To resume a paused job schedule, set the Schedule Type to Scheduled .
View jobs
Click Workflows in the sidebar. The Jobs list appears. The Jobs page lists all defined jobs, the cluster
definition, the schedule, if any, and the result of the last run.
NOTE
If you have the increased jobs limit enabled for this workspace, only 25 jobs are displayed in the Jobs list to improve the
page loading time. Use the left and right arrows to page through the full list of jobs.
You can also click any column header to sort the list of jobs (either descending or ascending) by that column.
When the increased jobs limit feature is enabled, you can sort only by Name , Job ID , or Created by . The
default sorting is by Name in ascending order.
To view job run details, click the link in the Star t time column for the run. To view job details, click the job name
in the Job column.
Edit a job
Some configuration options are available on the job, and other options are available on individual tasks. For
example, the maximum concurrent runs can be set on the job only, while parameters must be defined for each
task.
To change the configuration for a job:
Tags also propagate to job clusters created when a job is run, allowing you to use tags with your existing cluster
monitoring.
To add or edit tags, click + Tag in the Job details side panel. You can add the tag as a key and value, or a label.
To add a label, enter the label in the Key field and leave the Value field empty.
Clusters
To see tasks associated with a cluster, hover over the cluster in the side panel. To change the cluster configuration
for all associated tasks, click Configure under the cluster. To configure a new cluster for all associated tasks,
click Swap under the cluster.
Maximum concurrent runs
The maximum number of parallel runs for this job. Azure Databricks skips the run if the job has already reached
its maximum number of active runs when attempting to start a new run. Set this value higher than the default of
1 to perform multiple runs of the same job concurrently. This is useful, for example, if you trigger your job on a
frequent schedule and want to allow consecutive runs to overlap with each other, or you want to trigger multiple
runs that differ by their input parameters.
Notifications
You can add one or more email addresses to notify when runs of this job begin, complete, or fail:
1. Click Edit notifications .
2. Click Add .
3. Enter an email address and click the check box for each notification type to send to that address.
4. To enter another email address for notification, click Add .
5. If you do not want to receive notifications for skipped job runs, click the check box.
6. Click Confirm .
Integrate these email notifications with your favorite notification tools, including:
PagerDuty
Slack
Control access to jobs
Job access control enables job owners and administrators to grant fine-grained permissions on their jobs. Job
owners can choose which other users or groups can view the results of the job. Owners can also choose who
can manage their job runs (Run now and Cancel run permissions).
See Jobs access control for details.
Edit a task
To set task configuration options:
NOTE
Depends on is not visible if the job consists of only a single task.
Configuring task dependencies creates a Directed Acyclic Graph (DAG) of task execution, a common way of
representing execution order in job schedulers. For example, consider the following job consisting of four tasks:
Task 1 is the root task and does not depend on any other task.
Task 2 and Task 3 depend on Task 1 completing first.
Finally, Task 4 depends on Task 2 and Task 3 completing successfully.
Azure Databricks runs upstream tasks before running downstream tasks, running as many of them in parallel as
possible. The following diagram illustrates the order of processing for these tasks:
Individual task configuration options
Individual tasks have the following configuration options:
In this section:
Cluster
Dependent libraries
Task parameter variables
Timeout
Retries
Cluster
To configure the cluster where a task runs, click the Cluster drop-down. You can edit a shared job cluster, but
you cannot delete a shared cluster if it is still used by other tasks.
To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.
Dependent libraries
Dependent libraries will be installed on the cluster before the task runs. You must set all task dependencies to
ensure they are installed before the run starts.
To add a dependent library, click Advanced options and select Add Dependent Libraries to open the Add
Dependent Librar y chooser. Follow the recommendations in Library dependencies for specifying
dependencies.
IMPORTANT
If you have configured a library to install on all clusters automatically, or you select an existing terminated cluster that has
libraries installed, the job execution does not wait for library installation to complete. If a job requires a specific library, you
should attach the library to the job in the Dependent Libraries field.
{
"MyJobID": "my-job-{{job_id}}"
}
The contents of the double curly braces are not evaluated as expressions, so you cannot do operations or
functions within double-curly braces. Whitespace is not stripped inside the curly braces, so {{ job_id }} will
not be evaluated.
The following task parameter variables are supported:
{{task_key}} The unique name assigned to a task that’s part of a job with
multiple tasks.
You can set these variables with any task when you Create a job, Edit a job, or Run a job with different
parameters.
Timeout
The maximum completion time for a job. If the job does not complete in this time, Azure Databricks sets its
status to “Timed Out”.
Retries
A policy that determines when and how many times failed runs are retried. To set the retries for the task, click
Advanced options and select Edit Retr y Policy . The retry interval is calculated in milliseconds between the
start of the failed run and the subsequent retry run.
NOTE
If you configure both Timeout and Retries , the timeout applies to each retry.
Clone a job
You can quickly create a new job by cloning an existing job. Cloning a job creates an identical copy of the job,
except for the job ID. On the job’s page, click More … next to the job’s name and select Clone from the drop-
down menu.
Clone a task
You can quickly create a new task by cloning an existing task:
1. On the job’s page, click the Tasks tab.
2. Select the task to clone.
3. Click and select Clone task .
Delete a job
To delete a job, on the job’s page, click More … next to the job’s name and select Delete from the drop-down
menu.
Delete a task
To delete a task:
1. Click the Tasks tab.
2. Select the task to be deleted.
3. Click and select Remove task .
You can run jobs with notebooks located in a remote Git repository. This feature simplifies creation and
management of production jobs and automates continuous deployment:
You don’t need to create a separate production repo in Azure Databricks, manage permissions for it, and
keep it updated.
You can prevent unintentional changes to a production job, such as local edits in the production repo or
changes from switching a branch.
The job definition process has a single source of truth in the remote repository.
To use notebooks in a remote Git repository, you must Set up Git integration with Databricks Repos.
To create a task with a notebook located in a remote Git repository:
1. In the Type drop-down, select Notebook .
2. In the Source drop-down, select Git provider . The Git information dialog appears.
3. In the Git Information dialog, enter details for the repository.
For Path , enter a relative path to the notebook location, such as etl/notebooks/ .
When you enter the relative path, don’t begin it with / or ./ and don’t include the notebook file
extension, such as .py .
Additional notebook tasks in a multitask job reference the same commit in the remote repository in one of the
following ways:
sha of $branch/head when git_branch is set
sha of $tag when git_tag is set
the value of git_commit
In a multitask job, there cannot be a task that uses a local notebook and another task that uses a remote
repository. This restriction doesn’t apply to non-notebook tasks.
Best practices
In this section:
Cluster configuration tips
Notebook job tips
Streaming tasks
JAR jobs
Library dependencies
Cluster configuration tips
Cluster configuration is important when you operationalize a job. The following provides general guidance on
choosing and configuring job clusters, followed by recommendations for specific job types.
Use shared job clusters
To optimize resource usage with jobs that orchestrate multiple tasks, use shared job clusters. A shared job
cluster allows multiple tasks in the same job run to reuse the cluster. You can use a single job cluster to run all
tasks that are part of the job, or multiple job clusters optimized for specific workloads. To use a shared job
cluster:
1. Select New Job Clusters when you create a task and complete the cluster configuration.
2. Select the new cluster when adding a task to the job, or create a new job cluster. Any cluster you configure
when you select New Job Clusters is available to any task in the job.
A shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the same job.
Libraries cannot be declared in a shared job cluster configuration. You must add dependent libraries in task
settings.
Choose the correct cluster type for your job
New Job Clusters are dedicated clusters for a job or task run. A shared job cluster is created and started
when the first task using the cluster starts and terminates after the last task using the cluster completes. The
cluster is not terminated when idle but terminates only after all tasks using it have completed. If a shared job
cluster fails or is terminated before all tasks have finished, a new cluster is created. A cluster scoped to a
single task is created and started when the task starts and terminates when the task completes. In
production, Databricks recommends using new shared or task scoped clusters so that each job or task runs
in a fully isolated environment.
When you run a task on a new cluster, the task is treated as a data engineering (task) workload, subject to the
task workload pricing. When you run a task on an existing all-purpose cluster, the task is treated as a data
analytics (all-purpose) workload, subject to all-purpose workload pricing.
If you select a terminated existing cluster and the job owner has Can Restar t permission, Azure Databricks
starts the cluster when the job is scheduled to run.
Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.
Use a pool to reduce cluster start times
To decrease new job cluster start time, create a pool and configure the job’s cluster to use the pool.
Notebook job tips
Total notebook cell output (the combined output of all notebook cells) is subject to a 20MB size limit.
Additionally, individual cell output is subject to an 8MB size limit. If total cell output exceeds 20MB in size, or if
the output of an individual cell is larger than 8MB, the run is canceled and marked as failed.
If you need help finding cells near or beyond the limit, run the notebook against an all-purpose cluster and use
this notebook autosave technique.
Streaming tasks
Spark Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should
be set to run using the cron expression "* * * * * ?" (every minute).
Since a streaming task runs continuously, it should always be the final task in a job.
JAR jobs
When running a JAR job, keep in mind the following:
Output size limits
NOTE
Available in Databricks Runtime 6.3 and above.
Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger
size, the run is canceled and marked as failed.
To avoid encountering this limit, you can prevent stdout from being returned from the driver to Azure
Databricks by setting the spark.databricks.driver.disableScalaOutput Spark configuration to true . By default,
the flag value is false . The flag controls cell output for Scala JAR jobs and Scala notebooks. If the flag is
enabled, Spark does not return job execution results to the client. The flag does not affect the data that is written
in the cluster’s log files. Setting this flag is recommended only for job clusters for JAR jobs because it will disable
notebook results.
Use the shared SparkContext
Because Azure Databricks is a managed service, some code changes may be necessary to ensure that your
Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext API to get the
SparkContext . Because Azure Databricks initializes the SparkContext , programs that invoke new SparkContext()
will fail. To get the SparkContext , use only the shared SparkContext created by Azure Databricks:
There are also several methods you should avoid when using the shared SparkContext .
Do not call SparkContext.stop() .
Do not call System.exit(0) or sc.stop() at the end of your Main program. This can cause undefined
behavior.
Use try-finally blocks for job clean up
Consider a JAR that consists of two parts:
jobBody() which contains the main part of the job.
jobCleanup() which has to be executed after jobBody() whether that function succeeded or returned an
exception.
As an example, jobBody() may create tables, and you can use jobCleanup() to drop these tables.
The safe way to ensure that the clean up method is called is to put a try-finally block in the code:
try {
jobBody()
} finally {
jobCleanup()
}
You should not try to clean up using sys.addShutdownHook(jobCleanup) or the following code:
Due to the way the lifetime of Spark containers is managed in Azure Databricks, the shutdown hooks are not run
reliably.
Configure JAR job parameters
You pass parameters to JAR jobs with a JSON string array. See the spark_jar_task object in the request body
passed to the Create a new job operation ( POST /jobs/create ) in the Jobs API. To access these parameters,
inspect the String array passed into your main function.
Library dependencies
The Spark driver has certain library dependencies that cannot be overridden. These libraries take priority over
any of your libraries that conflict with them.
To get the full list of the driver library dependencies, run the following command inside a notebook attached to a
cluster of the same Spark version (or the cluster with the driver you want to examine).
%sh
ls /databricks/jars
In sbt , add Spark and Hadoop as provided dependencies, as shown in the following example:
TIP
Specify the correct Scala version for your dependencies based on the version you are running.
Jobs API updates
7/21/2022 • 10 minutes to read
You can now orchestrate multiple tasks with Azure Databricks jobs. This article details changes to the Jobs API
2.1 that support jobs with multiple tasks and provides guidance to help you update your existing API clients to
work with this new feature.
Databricks recommends Jobs API 2.1 for your API scripts and clients, particularly when using jobs with multiple
tasks.
This article refers to jobs defined with a single task as single-task format and jobs defined with multiple tasks as
multi-task format.
Jobs API 2.0 and 2.1 now support the update request. Use the update request to change an existing job instead
of the reset request to minimize changes between single-task format jobs and multi-task format jobs.
API changes
The Jobs API now defines a TaskSettings object to capture settings for each task in a job. For multi-task format
jobs, the tasks field, an array of TaskSettings data structures, is included in the JobSettings object. Some
fields previously part of JobSettings are now part of the task settings for multi-task format jobs. JobSettings
is also updated to include the format field. The format field indicates the format of the job and is a STRING
value set to SINGLE_TASK or MULTI_TASK .
You need to update your existing API clients for these changes to JobSettings for multi-task format jobs. See the
API client guide for more information on required changes.
Jobs API 2.1 supports the multi-task format. All API 2.1 requests must conform to the multi-task format and
responses are structured in the multi-task format. New features are released for API 2.1 first.
Jobs API 2.0 is updated with an additional field to support multi-task format jobs. Except where noted, the
examples in this document use API 2.0. However, Databricks recommends API 2.1 for new and existing API
scripts and clients.
An example JSON document representing a multi-task format job for API 2.0 and 2.1:
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}
Jobs API 2.1 supports configuration of task level clusters or one or more shared job clusters:
A task level cluster is created and started when a task starts and terminates when the task completes.
A shared job cluster allows multiple tasks in the same job to use the cluster. The cluster is created and started
when the first task using the cluster starts and terminates after the last task using the cluster completes. A
shared job cluster is not terminated when idle but terminates only after all tasks using it are complete.
Multiple non-dependent tasks sharing a cluster can start at the same time. If a shared job cluster fails or is
terminated before all tasks have finished, a new cluster is created.
To configure shared job clusters, include a JobCluster array in the JobSettings object. You can specify a
maximum of 100 clusters per job. The following is an example of an API 2.1 response for a job configured with
two shared clusters:
NOTE
If a task has library dependencies, you must configure the libraries in the task field settings; libraries cannot be
configured in a shared job cluster configuration. In the following example, the libraries field in the configuration of the
ingest_orders task demonstrates specification of a library dependency.
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"job_clusters": [
{
"job_cluster_key": "default_cluster",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"spark_conf": {
"spark.speculation": true
},
"aws_attributes": {
"availability": "SPOT",
"zone_id": "us-west-2a"
},
"autoscale": {
"min_workers": 2,
"max_workers": 8
}
}
},
{
"job_cluster_key": "data_processing_cluster",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "r4.2xlarge",
"spark_conf": {
"spark.speculation": true
},
"aws_attributes": {
"availability": "SPOT",
"zone_id": "us-west-2a"
},
"autoscale": {
"min_workers": 8,
"max_workers": 16
}
}
}
],
"tasks": [
{
"task_key": "ingest_orders",
"description": "Ingest order data",
"depends_on": [ ],
"job_cluster_key": "auto_scaling_cluster",
"spark_jar_task": {
"main_class_name": "com.databricks.OrdersIngest",
"parameters": [
"--data",
"dbfs:/path/to/order-data.json"
]
},
"libraries": [
{
{
"jar": "dbfs:/mnt/databricks/OrderIngest.jar"
}
],
"timeout_seconds": 86400,
"max_retries": 3,
"min_retry_interval_millis": 2000,
"retry_on_timeout": false
},
{
"task_key": "clean_orders",
"description": "Clean and prepare the order data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"job_cluster_key": "default_cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_orders",
"description": "Perform an analysis of the order data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"job_cluster_key": "data_processing_cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}
For single-task format jobs, the JobSettings data structure remains unchanged except for the addition of the
format field. No TaskSettings array is included, and the task settings remain defined at the top level of the
JobSettings data structure. You will not need to make changes to your existing API clients to process single-task
format jobs.
An example JSON document representing a single-task format job for API 2.0:
{
"job_id": 27,
"settings": {
"name": "Example notebook",
"existing_cluster_id": "1201-my-cluster",
"libraries": [
{
"jar": "dbfs:/FileStore/jars/spark_examples.jar"
}
],
"email_notifications": {},
"timeout_seconds": 0,
"schedule": {
"quartz_cron_expression": "0 0 0 * * ?",
"timezone_id": "US/Pacific",
"pause_status": "UNPAUSED"
},
"notebook_task": {
"notebook_path": "/notebooks/example-notebook",
"revision_timestamp": 0
},
"max_concurrent_runs": 1,
"format": "SINGLE_TASK"
},
"created_time": 1504128821443,
"creator_user_name": "user@databricks.com"
}
NOTE
A maximum of 100 tasks can be specified per job.
{
"name": "Multi-task-job",
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"timeout_seconds": 3600,
"max_retries": 3,
"retry_on_timeout": true
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"timeout_seconds": 3600,
"max_retries": 3,
"retry_on_timeout": true
}
]
}
Runs submit
To submit a one-time run of a single-task format job with the Create and trigger a one-time run operation (
POST /runs/submit ) in the Jobs API, you do not need to change existing clients.
To submit a one-time run of a multi-task format job, use the tasks field in JobSettings to specify settings for
each task, including clusters. Clusters must be set at the task level when submitting a multi-task format job
because the runs submit request does not support shared job clusters. See Create for an example JobSettings
specifying multiple tasks.
Update
To update a single-task format job with the Partially update a job operation ( POST /jobs/update ) in the Jobs API,
you do not need to change existing clients.
To update the settings of a multi-task format job, you must use the unique task_key field to identify new task
settings. See Create for an example JobSettings specifying multiple tasks.
Reset
To overwrite the settings of a single-task format job with the Overwrite all settings for a job operation (
POST /jobs/reset ) in the Jobs API, you do not need to change existing clients.
To overwrite the settings of a multi-task format job, specify a JobSettings data structure with an array of
TaskSettings data structures. See Create for an example JobSettings specifying multiple tasks.
Use Update to change individual fields without switching from single-task to multi-task format.
List
For single-task format jobs, no client changes are required to process the response from the List all jobs
operation ( GET /jobs/list ) in the Jobs API.
For multi-task format jobs, most settings are defined at the task level and not the job level. Cluster configuration
may be set at the task or job level. To modify clients to access cluster or task settings for a multi-task format job
returned in the Job structure:
Parse the job_id field for the multi-task format job.
Pass the job_id to the Get a job operation ( GET /jobs/get ) in the Jobs API to retrieve job details. See Get for
an example response from the Get API call for a multi-task format job.
The following example shows a response containing single-task and multi-task format jobs. This example is for
API 2.0:
{
"jobs": [
{
"job_id": 36,
"settings": {
"name": "A job with a single task",
"existing_cluster_id": "1201-my-cluster",
"email_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/example-notebook",
"revision_timestamp": 0
},
"max_concurrent_runs": 1,
"format": "SINGLE_TASK"
},
"created_time": 1505427148390,
"creator_user_name": "user@databricks.com"
},
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com"
}
]
}
Get
For single-task format jobs, no client changes are required to process the response from the Get a job operation
( GET /jobs/get ) in the Jobs API.
Multi-task format jobs return an array of task data structures containing task settings. If you require access to
task level details, you need to modify your clients to iterate through the tasks array and extract required fields.
The following shows an example response from the Get API call for a multi-task format job. This example is for
API 2.0 and 2.1:
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}
Runs get
For single-task format jobs, no client changes are required to process the response from the Get a job run
operation ( GET /jobs/runs/get ) in the Jobs API.
The response for a multi-task format job run contains an array of TaskSettings . To retrieve run results for each
task:
Iterate through each of the tasks.
Parse the run_id for each task.
Call the Get the output for a run operation ( GET /jobs/runs/get-output ) with the run_id to get details on the
run for each task. The following is an example response from this request:
{
"job_id": 53,
"run_id": 759600,
"number_in_job": 7,
"original_attempt_run_id": 759600,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"cluster_spec": {},
"start_time": 1595943854860,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"trigger": "ONE_TIME",
"creator_user_name": "user@databricks.com",
"run_name": "Query logs",
"run_type": "JOB_RUN",
"tasks": [
{
"run_id": 759601,
"task_key": "query-logs",
"description": "Query session logs",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/log-query"
},
"existing_cluster_id": "1201-my-cluster",
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
}
},
{
"run_id": 759602,
"task_key": "validate_output",
"description": "Validate query output",
"depends_on": [
{
"task_key": "query-logs"
}
],
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/validate-query-results"
},
"existing_cluster_id": "1201-my-cluster",
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
}
}
],
"format": "MULTI_TASK"
}
{
"runs": [
{
"job_id": 53,
"run_id": 759600,
"number_in_job": 7,
"original_attempt_run_id": 759600,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"cluster_spec": {},
"start_time": 1595943854860,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"trigger": "ONE_TIME",
"creator_user_name": "user@databricks.com",
"run_name": "Query logs",
"run_type": "JOB_RUN",
"tasks": [],
"format": "MULTI_TASK"
}
],
"has_more": false
}
Managing dependencies in data pipelines
7/21/2022 • 8 minutes to read
Developing and deploying a data processing pipeline often requires managing complex dependencies between
tasks. For example, a pipeline might read data from a source, clean the data, transform the cleaned data, and
writing the transformed data to a target. You need to test, schedule, and troubleshoot data pipelines when you
operationalize them.
Workflow systems address these challenges by allowing you to define dependencies between tasks, schedule
when pipelines run, and monitor workflows. Databricks recommends jobs with multiple tasks to manage your
workflows without relying on an external system. Azure Databricks jobs provide task orchestration with
standard authentication and access control methods. You can manage jobs using a familiar, user-friendly
interface to create and manage complex workflows. You can define a job containing multiple tasks, where each
task runs code such as a notebook or JAR, and control the execution order of tasks in a job by specifying
dependencies between them. You can configure a job’s tasks to run in sequence or parallel.
Azure Databricks also supports workflow management with Azure Data Factory or Apache Airflow.
Apache Airflow
Apache Airflow is an open source solution for managing and scheduling data pipelines. Airflow represents data
pipelines as directed acyclic graphs (DAGs) of operations. You define a workflow in a Python file and Airflow
manages the scheduling and execution.
Airflow provides tight integration between Azure Databricks and Airflow. The Airflow Azure Databricks
integration lets you take advantage of the optimized Spark engine offered by Azure Databricks with the
scheduling features of Airflow.
Requirements
The integration between Airflow and Azure Databricks is available in Airflow version 1.9.0 and later. The
examples in this article are tested with Airflow version 2.1.0.
Airflow requires Python 3.6, 3.7, or 3.8. The examples in this article are tested with Python 3.8.
These commands:
1. Create a directory named airflow and change into that directory.
2. Use pipenv to create and spawn a Python virtual environment. Databricks recommends using a Python
virtual environment to isolate package versions and code dependencies to that environment. This isolation
helps reduce unexpected package version mismatches and code dependency collisions.
3. Initialize an environment variable named AIRFLOW_HOME set to the path of the airflow directory.
4. Install Airflow and the Airflow Databricks provider packages.
5. Create an airflow/dags directory. Airflow uses the dags directory to store DAG definitions.
6. Initialize a SQLite database that Airflow uses to track metadata. In a production Airflow deployment, you
would configure Airflow with a standard database. The SQLite database and default configuration for your
Airflow deployment are initialized in the airflow directory.
7. Create an admin user for Airflow.
To install extras, for example celery and password , run:
airflow webserver
The scheduler is the Airflow component that schedules DAGs. To run it, open a new terminal and run the
following command:
pipenv shell
export AIRFLOW_HOME=$(pwd)
airflow scheduler
1. Go to your Azure Databricks landing page and select Create Blank Notebook or click Create in
the sidebar and select Notebook from the menu. The Create Notebook dialog appears.
2. In the Create Notebook dialog, give your notebook a name, such as Hello Airflow . Set Default
Language to Python . Leave Cluster set to the default value. You will configure the cluster when you
create a task that uses this notebook.
3. Click Create .
4. Copy the following Python code and paste it into the first cell of the notebook.
5. Add a new cell below the first cell and copy and paste the following Python code into the new cell:
print("hello {}".format(greeting))
Create a job
2. Click .
The Tasks tab displays with the create task dialog.
3. Replace Add a name for your job… with your job name.
4. In the Task name field, enter a name for the task, for example, greeting-task .
5. In the Type drop-down, select Notebook .
6. Use the file browser to find the notebook you created, click the notebook name, and click Confirm .
7. Click Add under Parameters . In the Key field, enter greeting . In the Value field, enter Airflow user .
8. Click Create task .
Run the job
To run the job immediately, click in the upper right corner. You can also run the job by clicking the
Runs tab and clicking Run Now in the Active Runs table.
View run details
1. Click the Runs tab and click View Details in the Active Runs table or the Completed Runs (past 60
days) table.
2. Copy the Job ID value. This value is required to trigger the job from Airflow.
Create an Azure Databricks personal access token
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
Airflow connects to Databricks using an Azure Databricks personal access token (PAT). See personal access token
for instructions on creating a PAT.
Configure an Azure Databricks connection
Your Airflow installation contains a default connection for Azure Databricks. To update the connection to connect
to your workspace using the personal access token you created above:
1. In a browser window, open http://localhost:8080/connection/list/.
2. Under Conn ID , locate databricks_default and click the Edit record button.
3. Replace the value in the Host field with the workspace instance name of your Azure Databricks
deployment.
4. In the Extra field, enter the following value:
{"token": "PERSONAL_ACCESS_TOKEN"}
default_args = {
'owner': 'airflow'
}
with DAG('databricks_dag',
start_date = days_ago(2),
schedule_interval = None,
default_args = default_args
) as dag:
opr_run_now = DatabricksRunNowOperator(
task_id = 'run_now',
databricks_conn_id = 'databricks_default',
job_id = JOB_ID
)
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID
transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on
top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Azure Databricks
allows you to configure Delta Lake based on your workload patterns.
Azure Databricks adds optimized layouts and indexes to Delta Lake for fast interactive queries.
This guide provides an introductory overview, quickstarts, and guidance for using Delta Lake on Azure
Databricks.
Introduction
Delta Lake quickstart
Introductory notebooks
Ingest data into Delta Lake
Table batch reads and writes
Table streaming reads and writes
Table deletes, updates, and merges
Change data feed
Table utility commands
Constraints
Table protocol versioning
Delta column mapping
Unity Catalog
Delta Lake APIs
Concurrency control
Migration guide
Best practices: Delta Lake
Frequently asked questions (FAQ)
Delta Lake resources
Optimizations
Delta table properties reference
Introduction
7/21/2022 • 3 minutes to read
Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta
Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing
on top of existing data lakes.
For a quick overview and benefits of Delta Lake, watch this YouTube video (3 minutes).
Delta Engine optimizations make Delta Lake operations highly performant, supporting a variety of workloads
ranging from large-scale ETL processing to ad-hoc, interactive queries. For information on Delta Engine, see
Optimizations.
Quickstart
The Delta Lake quickstart provides an overview of the basics of working with Delta Lake. The quickstart shows
how to load data into a Delta table, modify the table, read the table, display table history, and optimize the table.
For Azure Databricks notebooks that demonstrate these features, see Introductory notebooks.
To try out Delta Lake, see Sign up for Azure Databricks.
Key tasks
The following list provides links to documentation for common Delta Lake tasks.
Create a Delta table: quick start, as part of batch data tasks
Load and write data into a Delta Lake table:
With COPY INTO
With Auto Loader
With the Create Table UI in Databricks SQL.
With streaming: quick start, as part of streaming
With third-party solutions: with partners, with third-party providers
Update data
Merge data updates and insertions (upserts): quick start, as part of table updates
Append data
Overwrite data
Convert a Parquet table to a Delta table
Read data from a Delta table: quick start, as part of batch data tasks, as part of streaming
Optimize a Delta table: quick start, as part of bin packing, as part of Z-ordering, as part of file size tuning
Create a view on top of a Delta table
Delete data from a Delta table
Display Delta table details
Display the history of a Delta table: quick start, as part of data utilities
Clean up Delta table snapshots (vacuum): quick start, as part of data utilities
Work with Delta table columns:
Work with column constraints
Partition data by columns: quick start, as part of batch data tasks
Use automatically-generated columns
Update columns (add, reorder, replace, rename, change type)
Map columns in Delta tables to columns in related Parquet tables
Track changes to a Delta table (change data feed)
Copy or clone a Delta table
Work with table constraints
Work with Delta table versions:
Query an earlier version of a Delta table (time travel): quick start, as part of batch data tasks
Restore or roll back a Delta table to an earlier version
Work with Delta table reader and writer versions
Work with Delta table metadata:
Read existing metadata
Add your own metadata
Use Delta Lake SQL statements
Use the Delta Lake API reference
Learn about Delta Lake concurrency control (ACID transactions)
Resources
For answers to frequently asked questions, see Frequently asked questions (FAQ).
For reference information on Delta Lake SQL commands, see Delta Lake statements.
For further resources, including blog posts, talks, and examples, see Delta Lake resources.
For deep-dive training on Delta Lake, watch this YouTube video (2 hours, 42 minutes).
Delta Lake quickstart
7/21/2022 • 14 minutes to read
The Delta Lake quickstart provides an overview of the basics of working with Delta Lake. The quickstart shows
how to load data into a Delta table, modify the table, read the table, display table history, and optimize the table.
For a demonstration of some of the features that are described in this article (and many more), watch this
YouTube video (9 minutes).
You can run the example Python, R, Scala, and SQL code in this article from within a notebook attached to an
Azure Databricks cluster. You can also run the SQL code in this article from within a query associated with a SQL
warehouse in Databricks SQL.
For existing Azure Databricks notebooks that demonstrate these features, see Introductory notebooks.
Create a table
To create a Delta table, you can use existing Apache Spark SQL code and change the write format from parquet ,
csv , json , and so on, to delta .
For all file types, you read the files into a DataFrame using the corresponding input format (for example,
parquet , csv , json , and so on) and then write out the data in Delta format. In this code example, the input
files are already in Delta format and are located in Sample datasets (databricks-datasets). This code saves the
data in Delta format in Databricks File System (DBFS) in the location specified by save_path .
Python
# Define the input and output formats and paths and the table name.
read_format = 'delta'
write_format = 'delta'
load_path = '/databricks-datasets/learning-spark-v2/people/people-10m.delta'
save_path = '/tmp/delta/people-10m'
table_name = 'default.people10m'
R
library(SparkR)
sparkR.session()
# Define the input and output formats and paths and the table name.
read_format = "delta"
write_format = "delta"
load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
save_path = "/tmp/delta/people-10m/"
table_name = "default.people10m"
Scala
// Define the input and output formats and paths and the table name.
val read_format = "delta"
val write_format = "delta"
val load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
val save_path = "/tmp/delta/people-10m"
val table_name = "default.people10m"
SQL
The preceding operations create a new unmanaged table by using the schema that was inferred from the data.
For unmanaged tables, you control the location of the data. Azure Databricks tracks the table’s name and its
location. For information about available options when you create a Delta table, see Create a table and Write to a
table.
If your source files are in Parquet format, you can use the CONVERT TO DELTA statement to convert files in place.
If the corresponding table is unmanaged, the table remains unmanaged after the conversion:
tableName = 'people10m'
sourceType = 'delta'
loadPath = '/databricks-datasets/learning-spark-v2/people/people-10m.delta'
people = spark \
.read \
.format(sourceType) \
.load(loadPath)
people.write \
.format(sourceType) \
.saveAsTable(tableName)
library(SparkR)
sparkR.session()
tableName = "people10m"
sourceType = "delta"
loadPath = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
people = read.df(
path = loadPath,
source = sourceType
)
saveAsTable(
df = people,
source = sourceType,
tableName = tableName
)
Scala
people.write
.format(sourceType)
.saveAsTable(tableName)
SQL
CREATE TABLE people10m USING DELTA AS
SELECT * FROM delta.`/databricks-datasets/learning-spark-v2/people/people-10m.delta`;
If your source files are in a format that Delta Lake supports, you can use the following shorthand to read from a
file directly, by using the location specifier of `.`````:
Python
library(SparkR)
sparkR.session()
Scala
SQL
For managed tables, Azure Databricks determines the location for the data. To get the location, you can use the
DESCRIBE DETAIL statement, for example:
Python
Scala
display(spark.sql("DESCRIBE DETAIL people10m"))
SQL
# Define the input and output formats and paths and the table name.
read_format = 'delta'
write_format = 'delta'
load_path = '/databricks-datasets/learning-spark-v2/people/people-10m.delta'
partition_by = 'gender'
save_path = '/tmp/delta/people-10m'
table_name = 'default.people10m'
If you already ran the Python code example in Create a table, you must first delete the existing table and the
saved data:
R
library(SparkR)
sparkR.session()
# Define the input and output formats and paths and the table name.
read_format = "delta"
write_format = "delta"
load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
partition_by = "gender"
save_path = "/tmp/delta/people-10m/"
table_name = "default.people10m"
If you already ran the R code example in Create a table, you must first delete the existing table and the saved
data:
library(SparkR)
sparkR.session()
Scala
// Define the input and output formats and paths and the table name.
val read_format = "delta"
val write_format = "delta"
val load_path = "/databricks-datasets/learning-spark-v2/people/people-10m.delta"
val partition_by = "gender"
val save_path = "/tmp/delta/people-10m"
val table_name = "default.people10m"
If you already ran the Scala code example in Create a table, you must first delete the existing table and the saved
data:
// Define the table name and the output path.
val table_name = "default.people10m"
val save_path = "/tmp/delta/people-10m"
SQL
To partition data when you create a Delta table using SQL, specify the PARTITIONED BY columns.
If you already ran the SQL code example in Create a table, you must first delete the existing table:
Modify a table
Delta Lake supports a rich set of operations to modify tables.
Stream writes to a table
You can write data into a Delta table using Structured Streaming. The Delta Lake transaction log guarantees
exactly-once processing, even when there are other streams or batch queries running concurrently against the
table. By default, streams run in append mode, which adds new records to the table.
The following code example starts Structured Streaming. It monitors the DBFS location specified in
json_read_path , scanning for JSON files that are uploaded to this location. As Structured Streaming notices a file
upload, it attempts to write the data to the DBFS location specified in save_path by using the schema specified
in read_schema . Structured Streaming continues monitoring for uploaded files until the code is stopped.
Structured Streaming uses the DBFS location specified in checkpoint_path to help ensure that uploaded files are
evaluated only once.
Python
# Define the schema and the input, checkpoint, and output paths.
read_schema = ("id int, " +
"firstName string, " +
"middleName string, " +
"lastName string, " +
"gender string, " +
"birthDate timestamp, " +
"ssn string, " +
"salary int")
json_read_path = '/FileStore/streaming-uploads/people-10m'
checkpoint_path = '/tmp/delta/people-10m/checkpoints'
save_path = '/tmp/delta/people-10m'
people_stream = (spark
.readStream
.schema(read_schema)
.option('maxFilesPerTrigger', 1)
.option('multiline', True)
.format("json")
.load(json_read_path)
)
(people_stream.writeStream
.format('delta')
.outputMode('append')
.option('checkpointLocation', checkpoint_path)
.start(save_path)
)
library(SparkR)
sparkR.session()
# Define the schema and the input, checkpoint, and output paths.
read_schema = "id int, firstName string, middleName string, lastName string, gender string, birthDate
timestamp, ssn string, salary int"
json_read_path = "/FileStore/streaming-uploads/people-10m"
checkpoint_path = "/tmp/delta/people-10m/checkpoints"
save_path = "/tmp/delta/people-10m"
people_stream = read.stream(
"json",
path = json_read_path,
schema = read_schema,
multiline = TRUE,
maxFilesPerTrigger = 1
)
write.stream(
people_stream,
path = save_path,
mode = "append",
checkpointLocation = checkpoint_path
)
Scala
// Define the schema and the input, checkpoint, and output paths.
val read_schema = ("id int, " +
"firstName string, " +
"middleName string, " +
"lastName string, " +
"gender string, " +
"birthDate timestamp, " +
"ssn string, " +
"salary int")
val json_read_path = "/FileStore/streaming-uploads/people-10m"
val checkpoint_path = "/tmp/delta/people-10m/checkpoints"
val save_path = "/tmp/delta/people-10m"
people_stream.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", checkpoint_path)
.start(save_path)
To test this behavior, here is a conforming JSON file that you can upload to the location specified in
json_read_path , and then query the location in save_path to see the data written by Structured Streaming.
[
{
"id": 10000021,
"firstName": "Joe",
"middleName": "Alexander",
"lastName": "Smith",
"gender": "M",
"birthDate": 188712000,
"ssn": "123-45-6789",
"salary": 50000
},
{
"id": 10000022,
"firstName": "Mary",
"middleName": "Jane",
"lastName": "Doe",
"gender": "F",
"birthDate": "1968-10-27T04:00:00.000+000",
"ssn": "234-56-7890",
"salary": 75500
}
]
For more information about Delta Lake integration with Structured Streaming, see Table streaming reads and
writes and Production considerations for Structured Streaming applications on Azure Databricks. See also the
Structured Streaming Programming Guide on the Apache Spark website.
Batch upserts
To merge a set of updates and insertions into an existing Delta table, you use the MERGE INTO statement. For
example, the following statement takes data from the source table and merges it into the target Delta table.
When there is a matching row in both tables, Delta Lake updates the data column using the given expression.
When there is no matching row, Delta Lake adds a new row. This operation is known as an upsert.
MERGE INTO default.people10m
USING default.people10m_upload
ON default.people10m.id = default.people10m_upload.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
If you specify * , this updates or inserts all columns in the target table. This assumes that the source table has
the same columns as those in the target table, otherwise the query will throw an analysis error.
You must specify a value for every column in your table when you perform an INSERT operation (for example,
when there is no matching row in the existing dataset). However, you do not need to update all values.
To test the preceding example, create the source table as follows:
To test the clause, fill the source table with the following rows, and then run the preceding
WHEN MATCHED
MERGE INTO statement. Because both tables have rows that match the ON clause, the target table’s matching
rows are updated.
SELECT * FROM default.people10m WHERE id BETWEEN 9999998 AND 10000000 SORT BY id ASC
To test the WHEN NOT MATCHED clause, fill the source table with the following rows, and then run the preceding
MERGE INTO statement. Because the target table does not have the following rows, these rows are added to the
target table.
SELECT * FROM default.people10m WHERE id BETWEEN 20000001 AND 20000003 SORT BY id ASC
To run any of the preceding SQL statements in Python, R, or Scala, pass the statement as a string argument to
the spark.sql function in Python or Scala or the sql function in R.
Read a table
In this section:
Display table history
Query an earlier version of the table (time travel)
You access data in Delta tables either by specifying the path on DBFS ( "/tmp/delta/people-10m" ) or the table
name ( "default.people10m" ):
Python
people = spark.read.format('delta').load('/tmp/delta/people-10m')
display(people)
or
people = spark.table('default.people10m')
display(people)
library(SparkR)
sparkR.session()
display(people)
or
library(SparkR)
sparkR.session()
people = tableToDF("default.people10m")
display(people)
Scala
display(people)
or
display(people)
SQL
or
library(SparkR)
sparkR.session()
or
library(SparkR)
sparkR.session()
Scala
or
SQL
or
NOTE
Because version 1 is at timestamp '2019-01-29 00:38:10' , to query version 0 you can use any timestamp in the range
'2019-01-29 00:37:58' to '2019-01-29 00:38:09' inclusive.
DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version
of the table, for example in Python:
display(df1)
or
display(df2)
Optimize a table
Once you have performed multiple changes to a table, you might have a lot of small files. To improve the speed
of read queries, you can use OPTIMIZE to collapse small files into larger ones:
Python
spark.sql("OPTIMIZE delta.`/tmp/delta/people-10m`")
or
spark.sql('OPTIMIZE default.people10m')
library(SparkR)
sparkR.session()
sql("OPTIMIZE delta.`/tmp/delta/people-10m`")
or
library(SparkR)
sparkR.session()
sql("OPTIMIZE default.people10m")
Scala
spark.sql("OPTIMIZE delta.`/tmp/delta/people-10m`")
or
spark.sql("OPTIMIZE default.people10m")
SQL
OPTIMIZE delta.`/tmp/delta/people-10m`
or
OPTIMIZE default.people10m
Z-order by columns
To improve read performance further, you can co-locate related information in the same set of files by Z-
Ordering. This co-locality is automatically used by Delta Lake data-skipping algorithms to dramatically reduce
the amount of data that needs to be read. To Z-Order data, you specify the columns to order on in the
ZORDER BY clause. For example, to co-locate by gender , run:
Python
or
library(SparkR)
sparkR.session()
or
library(SparkR)
sparkR.session()
Scala
or
SQL
OPTIMIZE delta.`/tmp/delta/people-10m`
ZORDER BY (gender)
or
OPTIMIZE default.people10m
ZORDER BY (gender)
For the full set of options available when running OPTIMIZE , see Compaction (bin-packing).
Clean up snapshots
Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other
users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by
running the VACUUM command:
Python
spark.sql('VACUUM default.people10m')
library(SparkR)
sparkR.session()
sql("VACUUM default.people10m")
Scala
spark.sql("VACUUM default.people10m")
SQL
VACUUM default.people10m
You control the age of the latest retained snapshot by using the RETAIN <N> HOURS option:
Python
library(SparkR)
sparkR.session()
Scala
For details on using VACUUM effectively, see Remove files no longer referenced by a Delta table.
Introductory notebooks
7/21/2022 • 2 minutes to read
These notebooks show how to load and save data in Delta Lake format, create a Delta table, optimize the
resulting table, and finally use Delta Lake metadata commands to show the table history, format, and details.
To try out Delta Lake, see Quickstart: Run a Spark job on Azure Databricks using the Azure portal.
Azure Databricks offers a variety of ways to help you ingest data into Delta Lake.
Partner integrations
Databricks partner integrations enable you to easily load data into Azure Databricks. These integrations enable
low-code, easy-to-implement, and scalable data ingestion from a variety of sources into Azure Databricks. See
the Databricks integrations.
The following example shows how to create a Delta table and then use the COPY INTO SQL command to load
sample data from Sample datasets (databricks-datasets) into the table. You can run the example Python, R, Scala,
or SQL code from within a notebook attached to an Azure Databricks cluster. You can also run the SQL code
from within a query associated with a SQL warehouse in Databricks SQL.
Python
table_name = 'default.loan_risks_upload'
source_data = '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet'
source_format = 'PARQUET'
display(loan_risks_upload_data)
'''
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
| 0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
| 1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
| 2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
'''
R
library(SparkR)
sparkR.session()
table_name = "default.loan_risks_upload"
source_data = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
source_format = "PARQUET"
loan_risks_upload_data = tableToDF(table_name)
display(loan_risks_upload_data)
# Result:
# +---------+-------------+-----------+------------+
# | loan_id | funded_amnt | paid_amnt | addr_state |
# +=========+=============+===========+============+
# | 0 | 1000 | 182.22 | CA |
# +---------+-------------+-----------+------------+
# | 1 | 1000 | 361.19 | WA |
# +---------+-------------+-----------+------------+
# | 2 | 1000 | 176.26 | TX |
# +---------+-------------+-----------+------------+
# ...
Scala
val table_name = "default.loan_risks_upload"
val source_data = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
val source_format = "PARQUET"
display(loan_risks_upload_data)
/*
Result:
+---------+-------------+-----------+------------+
| loan_id | funded_amnt | paid_amnt | addr_state |
+=========+=============+===========+============+
| 0 | 1000 | 182.22 | CA |
+---------+-------------+-----------+------------+
| 1 | 1000 | 361.19 | WA |
+---------+-------------+-----------+------------+
| 2 | 1000 | 176.26 | TX |
+---------+-------------+-----------+------------+
...
*/
SQL
-- Result:
-- +---------+-------------+-----------+------------+
-- | loan_id | funded_amnt | paid_amnt | addr_state |
-- +=========+=============+===========+============+
-- | 0 | 1000 | 182.22 | CA |
-- +---------+-------------+-----------+------------+
-- | 1 | 1000 | 361.19 | WA |
-- +---------+-------------+-----------+------------+
-- | 2 | 1000 | 176.26 | TX |
-- +---------+-------------+-----------+------------+
-- ...
To clean up, run the following code, which deletes the table:
Python
Scala
SQL
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any
additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles . Given an input
directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive,
with the option of also processing existing files in that directory.
Use Auto Loader instead of the COPY INTO SQL command when:
You want to load data from a file location that contains files in the order of millions or higher. Auto Loader
can discover files more efficiently than the COPY INTO SQL command and can split file processing into
multiple batches.
Your data schema evolves frequently. Auto Loader provides better support for schema inference and
evolution. See Configuring schema inference and evolution in Auto Loader.
You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult to
reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of files
while an Auto Loader stream is simultaneously running.
For a brief overview and demonstration of Auto Loader, as well as the COPY INTO SQL command earlier in this
article, watch this YouTube video (2 minutes).
For a longer overview and demonstration of Auto Loader, watch this YouTube video (59 minutes).
The following code example demonstrates how Auto Loader detects new data files as they arrive in cloud
storage. You can run the example code from within a notebook attached to an Azure Databricks cluster.
1. Create the file upload directory, for example:
Python
user_dir = '<my-name>@<my-organization.com>'
upload_path = "/FileStore/shared-uploads/" + user_dir + "/population_data_upload"
dbutils.fs.mkdirs(upload_path)
Scala
dbutils.fs.mkdirs(upload_path)
2. Create the following sample CSV files, and then upload them to the file upload directory by using the
DBFS file browser:
WA.csv :
city,year,population
Seattle metro,2019,3406000
Seattle metro,2020,3433000
OR.csv :
city,year,population
Portland metro,2019,2127000
Portland metro,2020,2151000
checkpoint_path = '/tmp/delta/population_data/_checkpoints'
write_path = '/tmp/delta/population_data'
Scala
val checkpoint_path = "/tmp/delta/population_data/_checkpoints"
val write_path = "/tmp/delta/population_data"
4. With the code from step 3 still running, run the following code to query the data in the write directory:
Python
df_population = spark.read.format('delta').load(write_path)
display(df_population)
'''
Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
'''
Scala
display(df_population)
/* Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
*/
5. With the code from step 3 still running, create the following additional CSV files, and then upload them to
the upload directory by using the DBFS file browser:
ID.csv :
city,year,population
Boise,2019,438000
Boise,2020,447000
MT.csv :
city,year,population
Helena,2019,81653
Helena,2020,82590
Misc.csv :
city,year,population
Seattle metro,2021,3461000
Portland metro,2021,2174000
Boise,2021,455000
Helena,2021,81653
6. With the code from step 3 still running, run the following code to query the existing data in the write
directory, in addition to the new data from the files that Auto Loader has detected in the upload directory
and then written to the write directory:
Python
df_population = spark.read.format('delta').load(write_path)
display(df_population)
'''
Result:
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Helena | 2019 | 81653 |
+----------------+------+------------+
| Helena | 2020 | 82590 |
+----------------+------+------------+
| Boise | 2019 | 438000 |
+----------------+------+------------+
| Boise | 2020 | 447000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
| Seattle metro | 2021 | 3461000 |
+----------------+------+------------+
| Portland metro | 2021 | 2174000 |
+----------------+------+------------+
| Boise | 2021 | 455000 |
+----------------+------+------------+
| Helena | 2021 | 81653 |
+----------------+------+------------+
'''
Scala
val df_population = spark.read.format("delta").load(write_path)
display(df_population)
/* Result
+----------------+------+------------+
| city | year | population |
+================+======+============+
| Seattle metro | 2019 | 3406000 |
+----------------+------+------------+
| Seattle metro | 2020 | 3433000 |
+----------------+------+------------+
| Helena | 2019 | 81653 |
+----------------+------+------------+
| Helena | 2020 | 82590 |
+----------------+------+------------+
| Boise | 2019 | 438000 |
+----------------+------+------------+
| Boise | 2020 | 447000 |
+----------------+------+------------+
| Portland metro | 2019 | 2127000 |
+----------------+------+------------+
| Portland metro | 2020 | 2151000 |
+----------------+------+------------+
| Seattle metro | 2021 | 3461000 |
+----------------+------+------------+
| Portland metro | 2021 | 2174000 |
+----------------+------+------------+
| Boise | 2021 | 455000 |
+----------------+------+------------+
| Helena | 2021 | 81653 |
+----------------+------+------------+
*/
7. To clean up, cancel the running code in step 3, and then run the following code, which deletes the upload,
checkpoint, and write directories:
Python
dbutils.fs.rm(write_path, True)
dbutils.fs.rm(upload_path, True)
Scala
dbutils.fs.rm(write_path, true)
dbutils.fs.rm(upload_path, true)
Delta Lake supports most of the options provided by Apache Spark DataFrame read and write APIs for
performing batch reads and writes on tables.
For information on Delta Lake SQL commands, see
Databricks Runtime 7.x and above: Delta Lake statements
Databricks Runtime 5.5 LTS and 6.x: SQL reference for Databricks Runtime 5.5 LTS and 6.x
Create a table
Delta Lake supports creating two types of tables—tables defined in the metastore and tables defined by path.
You can create tables in the following ways.
SQL DDL commands : You can use standard SQL DDL commands supported in Apache Spark (for
example, CREATE TABLE and REPLACE TABLE ) to create Delta tables.
NOTE
In Databricks Runtime 8.0 and above, Delta Lake is the default format and you don’t need USING DELTA .
In Databricks Runtime 7.0 and above, SQL also supports a creating table at a path without creating an
entry in the Hive metastore.
-- Create or replace table with path
CREATE OR REPLACE TABLE delta.`/tmp/delta/people10m` (
id INT,
firstName STRING,
middleName STRING,
lastName STRING,
gender STRING,
birthDate TIMESTAMP,
ssn STRING,
salary INT
) USING DELTA
DataFrameWriter API : If you want to simultaneously create a table and insert data into it from Spark
DataFrames or Datasets, you can use the Spark DataFrameWriter (Scala or Java and Python).
Python
# Create table in the metastore using DataFrame's schema and write data to it
df.write.format("delta").saveAsTable("default.people10m")
# Create or replace partitioned table with path using DataFrame's schema and write/overwrite data to
it
df.write.format("delta").mode("overwrite").save("/tmp/delta/people10m")
Scala
// Create table in the metastore using DataFrame's schema and write data to it
df.write.format("delta").saveAsTable("default.people10m")
// Create table with path using DataFrame's schema and write data to it
df.write.format("delta").mode("overwrite").save("/tmp/delta/people10m")
In Databricks Runtime 8.0 and above, Delta Lake is the default format and you don’t need to specify
USING DELTA , format("delta") , or using("delta") .
In Databricks Runtime 7.0 and above, you can also create Delta tables using the Spark
DataFrameWriterV2 API.
DeltaTableBuilder API : You can also use the DeltaTableBuilder API in Delta Lake to create tables.
Compared to the DataFrameWriter APIs, this API makes it easier to specify additional information like
column comments, table properties, and generated columns.
IMPORTANT
This feature is in Public Preview.
NOTE
This feature is available on Databricks Runtime 8.3 and above.
Python
# Create table in the metastore
DeltaTable.createIfNotExists(spark) \
.tableName("default.people10m") \
.addColumn("id", "INT") \
.addColumn("firstName", "STRING") \
.addColumn("middleName", "STRING") \
.addColumn("lastName", "STRING", comment = "surname") \
.addColumn("gender", "STRING") \
.addColumn("birthDate", "TIMESTAMP") \
.addColumn("ssn", "STRING") \
.addColumn("salary", "INT") \
.execute()
Scala
Python
df.write.format("delta").partitionBy("gender").saveAsTable("default.people10m")
DeltaTable.create(spark) \
.tableName("default.people10m") \
.addColumn("id", "INT") \
.addColumn("firstName", "STRING") \
.addColumn("middleName", "STRING") \
.addColumn("lastName", "STRING", comment = "surname") \
.addColumn("gender", "STRING") \
.addColumn("birthDate", "TIMESTAMP") \
.addColumn("ssn", "STRING") \
.addColumn("salary", "INT") \
.partitionedBy("gender") \
.execute()
Scala
df.write.format("delta").partitionBy("gender").saveAsTable("default.people10m")
DeltaTable.createOrReplace(spark)
.tableName("default.people10m")
.addColumn("id", "INT")
.addColumn("firstName", "STRING")
.addColumn("middleName", "STRING")
.addColumn(
DeltaTable.columnBuilder("lastName")
.dataType("STRING")
.comment("surname")
.build())
.addColumn("lastName", "STRING", comment = "surname")
.addColumn("gender", "STRING")
.addColumn("birthDate", "TIMESTAMP")
.addColumn("ssn", "STRING")
.addColumn("salary", "INT")
.partitionedBy("gender")
.execute()
To determine whether a table contains a specific partition, use the statement
SELECT COUNT(*) > 0 FROM <table-name> WHERE <partition-column> = <value> . If the partition exists, true is
returned. For example:
SQL
SELECT COUNT(*) > 0 AS `Partition exists` FROM default.people10m WHERE gender = "M"
Python
display(spark.sql("SELECT COUNT(*) > 0 AS `Partition exists` FROM default.people10m WHERE gender = 'M'"))
Scala
display(spark.sql("SELECT COUNT(*) > 0 AS `Partition exists` FROM default.people10m WHERE gender = 'M'"))
the table in the metastore automatically inherits the schema, partitioning, and table properties of the
existing data. This functionality can be used to “import” data into the metastore.
If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the
specification exactly matches the configuration of the existing data.
IMPORTANT
If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception
that describes the discrepancy.
NOTE
The metastore is not the source of truth about the latest information of a Delta table. In fact, the table definition in the
metastore may not contain all the metadata like schema and properties. It contains the location of the table, and the
table’s transaction log at the location is the source of truth. If you query the metastore from a system that is not aware of
this Delta-specific customization, you may see incomplete or stale table information.
IMPORTANT
This feature is in Public Preview.
NOTE
This feature is available on Databricks Runtime 8.3 and above.
Delta Lake supports generated columns which are a special type of columns whose values are automatically
generated based on a user-specified function over other columns in the Delta table. When you write to a table
with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes
the values. For example, you can automatically generate a date column (for partitioning the table by date) from
the timestamp column; any writes into the table need only specify the data for the timestamp column. However,
if you explicitly provide values for them, the values must satisfy the constraint
(<value> <=> <generation expression>) IS TRUE or the write will fail with an error.
IMPORTANT
Tables created with generated columns have a higher table writer protocol version than the default. See Table protocol
versioning to understand table protocol versioning and what it means to have a higher version of a table protocol
version.
The following example shows how to create a table with generated columns:
SQL
Python
DeltaTable.create(spark) \
.tableName("default.people10m") \
.addColumn("id", "INT") \
.addColumn("firstName", "STRING") \
.addColumn("middleName", "STRING") \
.addColumn("lastName", "STRING", comment = "surname") \
.addColumn("gender", "STRING") \
.addColumn("birthDate", "TIMESTAMP") \
.addColumn("dateOfBirth", DateType(), generatedAlwaysAs="CAST(birthDate AS DATE)") \
.addColumn("ssn", "STRING") \
.addColumn("salary", "INT") \
.partitionedBy("gender") \
.execute()
Scala
DeltaTable.create(spark)
.tableName("default.people10m")
.addColumn("id", "INT")
.addColumn("firstName", "STRING")
.addColumn("middleName", "STRING")
.addColumn(
DeltaTable.columnBuilder("lastName")
.dataType("STRING")
.comment("surname")
.build())
.addColumn("lastName", "STRING", comment = "surname")
.addColumn("gender", "STRING")
.addColumn("birthDate", "TIMESTAMP")
.addColumn(
DeltaTable.columnBuilder("dateOfBirth")
.dataType(DateType)
.generatedAlwaysAs("CAST(dateOfBirth AS DATE)")
.build())
.addColumn("ssn", "STRING")
.addColumn("salary", "INT")
.partitionedBy("gender")
.execute()
Generated columns are stored as if they were normal columns. That is, they occupy storage.
The following restrictions apply to generated columns:
A generation expression can use any SQL functions in Spark that always return the same result when
given the same argument values, except the following types of functions:
User-defined functions.
Aggregate functions.
Window functions.
Functions returning multiple rows.
For Databricks Runtime 9.1 and above, MERGE operations support generated columns when you set
spark.databricks.delta.schema.autoMerge.enabled to true.
In Databricks Runtime 8.4 and above with Photon support, Delta Lake may be able to generate partition filters
for a query whenever a partition column is defined by one of the following expressions:
CAST(col AS DATE) and the type of col is TIMESTAMP .
YEAR(col) and the type of col is TIMESTAMP .
Two partition columns defined by YEAR(col), MONTH(col) and the type of col is TIMESTAMP .
Three partition columns defined by YEAR(col), MONTH(col), DAY(col) and the type of col is TIMESTAMP .
Four partition columns defined by YEAR(col), MONTH(col), DAY(col), HOUR(col) and the type of col is
TIMESTAMP .
SUBSTRING(col, pos, len) and the type of col is STRING
DATE_FORMAT(col, format) and the type of col is TIMESTAMP .
If a partition column is defined by one of the preceding expressions, and a query filters data using the
underlying base column of a generation expression, Delta Lake looks at the relationship between the base
column and the generated column, and populates partition filters based on the generated partition column if
possible. For example, given the following table:
CREATE TABLE events(
eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
eventDate date GENERATED ALWAYS AS (CAST(eventTime AS DATE))
)
USING DELTA
PARTITIONED BY (eventType, eventDate)
Delta Lake automatically generates a partition filter so that the preceding query only reads the data in partition
date=2020-10-01 even if a partition filter is not specified.
Delta Lake automatically generates a partition filter so that the preceding query only reads the data in partition
year=2020/month=10/day=01 even if a partition filter is not specified.
You can use an EXPLAIN clause and check the provided plan to see whether Delta Lake automatically generates
any partition filters.
Use special characters in column names
By default, special characters such as spaces and any of the characters ,;{}()\n\t= are not supported in table
column names. To include these special characters in a table’s column name, enable column mapping.
Read a table
You can load a Delta table as a DataFrame by specifying a table name or a path:
SQL
Scala
import io.delta.implicits._
spark.read.delta("/tmp/delta/people10m")
The DataFrame returned automatically reads the most recent snapshot of the table for any query; you never
need to run REFRESH TABLE . Delta Lake automatically uses partitioning and statistics to read the minimum
amount of data when there are applicable predicates in the query.
SQL AS OF syntax
where
timestamp_expression can be any one of:
'2018-10-18T22:15:12.013Z' , that is, a string that can be cast to a timestamp
cast('2018-10-18 13:36:32 CEST' as timestamp)
'2018-10-18' , that is, a date string
In Databricks Runtime 6.6 and above:
current_timestamp() - interval 12 hours
date_sub(current_date(), 1)
Any other expression that is or can be cast to a timestamp
version is a long value that can be obtained from the output of DESCRIBE HISTORY table_spec .
DataFrameReader options
DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version
of the table.
A common pattern is to use the latest state of the Delta table throughout the execution of an Azure Databricks
job to update downstream applications.
Because Delta tables auto update, a DataFrame loaded from a Delta table may return different results across
invocations if the underlying data is updated. By using time travel, you can fix the data returned by the
DataFrame across invocations:
@ syntax
You may have a parametrized pipeline, where the input path of your pipeline is a parameter of your job. After
the execution of your job, you may want to reproduce the output some time in the future. In this case, you can
use the @ syntax to specify the timestamp or version. The timestamp must be in yyyyMMddHHmmssSSS format. You
can specify a version after @ by prepending a v to the version. For example, to query version 123 for the
table people10m , specify people10m@v123 .
SQ L
Python
Query the number of new customers added over the last week.
Data retention
To time travel to a previous version, you must retain both the log and the data files for that version.
The data files backing a Delta table are never deleted automatically; data files are deleted only when you run
VACUUM. VACUUM does not delete Delta log files; log files are automatically cleaned up after checkpoints are
written.
By default you can time travel to a Delta table up to 30 days old unless you have:
Run VACUUM on your Delta table.
Changed the data or log file retention periods using the following table properties:
delta.logRetentionDuration = "interval <interval>" : controls how long the history for a table is
kept. The default is interval 30 days .
Each time a checkpoint is written, Azure Databricks automatically cleans up log entries older than
the retention interval. If you set this config to a large enough value, many log entries are retained.
This should not impact performance as operations against the log are constant time. Operations
on history are parallel but will become more expensive as the log size increases.
delta.deletedFileRetentionDuration = "interval <interval>" : controls how long ago a file must
have been deleted before being a candidate for VACUUM . The default is interval 7 days .
To access 30 days of historical data even if you run VACUUM on the Delta table, set
delta.deletedFileRetentionDuration = "interval 30 days" . This setting may cause your storage
costs to go up.
Write to a table
Append
To atomically add new data to an existing Delta table, use append mode:
SQL
df.write.format("delta").mode("append").save("/tmp/delta/people10m")
df.write.format("delta").mode("append").saveAsTable("default.people10m")
Scala
df.write.format("delta").mode("append").save("/tmp/delta/people10m")
df.write.format("delta").mode("append").saveAsTable("default.people10m")
import io.delta.implicits._
df.write.mode("append").delta("/tmp/delta/people10m")
Overwrite
To atomically replace all the data in a table, use overwrite mode:
SQL
Python
df.write.format("delta").mode("overwrite").save("/tmp/delta/people10m")
df.write.format("delta").mode("overwrite").saveAsTable("default.people10m")
Scala
df.write.format("delta").mode("overwrite").save("/tmp/delta/people10m")
df.write.format("delta").mode("overwrite").saveAsTable("default.people10m")
import io.delta.implicits._
df.write.mode("overwrite").delta("/tmp/delta/people10m")
Using DataFrames, you can also selectively overwrite only the data that matches an arbitrary expression. This
feature is available in Databricks Runtime 9.1 LTS and above. The following command atomically replaces events
in January in the target table, which is partitioned by start_date , with the data in df :
Python
df.write \
.format("delta") \
.mode("overwrite") \
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'") \
.save("/tmp/delta/events")
Scala
df.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
.save("/tmp/delta/events")
This sample code writes out the data in df , validates that it all matches the predicate, and performs an atomic
replacement. If you want to write out data that doesn’t all match the predicate, to replace the matching rows in
the target table, you can disable the constraint check by setting
spark.databricks.delta.replaceWhere.constraintCheck.enabled to false:
Python
spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", False)
Scala
spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", false)
In Databricks Runtime 9.0 and below, replaceWhere overwrites data matching a predicate over partition
columns only. The following command atomically replaces the month in January in the target table, which is
partitioned by date , with the data in df :
Python
df.write \
.format("delta") \
.mode("overwrite") \
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'") \
.save("/tmp/delta/people10m")
Scala
df.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
.save("/tmp/delta/people10m")
In Databricks Runtime 9.1 and above, if you want to fall back to the old behavior, you can disable the
spark.databricks.delta.replaceWhere.dataColumns.enabled flag:
Python
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", False)
Scala
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", false)
IMPORTANT
This feature is in Public Preview.
Databricks Runtime 11.1 and above supports dynamic partition overwrite mode for partitioned tables.
When in dynamic partition overwrite mode, we overwrite all existing data in each logical partition for which the
write will commit new data. Any existing logical partitions for which the write does not contain data will remain
unchanged. This mode is only applicable when data is being written in overwrite mode: either INSERT OVERWRITE
in SQL, or a DataFrame write with df.write.mode("overwrite") .
Configure dynamic partition overwrite mode by setting the Spark session configuration
spark.sql.sources.partitionOverwriteMode to dynamic . You can also enable this by setting the DataFrameWriter
option partitionOverwriteMode to dynamic . If present, the query-specific option overrides the mode defined in
the session configuration. The default for partitionOverwriteMode is static .
SQ L
SET spark.sql.sources.partitionOverwriteMode=dynamic;
INSERT OVERWRITE TABLE default.people10m SELECT * FROM morePeople;
Python
df.write \
.format("delta") \
.mode("overwrite") \
.option("partitionOverwriteMode", "dynamic") \
.saveAsTable("default.people10m")
Sc a l a
df.write
.format("delta")
.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.saveAsTable("default.people10m")
NOTE
Dynamic partition overwrite conflicts with the option replaceWhere for partitioned tables.
If dynamic partition overwrite is enabled in the Spark session configuration, and replaceWhere is provided as a
DataFrameWriter option, then Delta Lake overwrites the data according to the replaceWhere expression (query-
specific options override session configurations).
You’ll receive an error if the DataFrameWriter options have both dynamic partition overwrite and replaceWhere
enabled.
IMPORTANT
Validate that the data written with dynamic partition overwrite touches only the expected partitions. A single row in the
incorrect partition can lead to unintentionally overwriting an entire partition. We recommend using replaceWhere to
specify which data to overwrite.
If a partition has been accidentally overwritten, you can use Find the last commit’s version in the Spark session to undo
the change.
For Delta Lake support for updating tables, see Table deletes, updates, and merges.
Limit rows written in a file
You can use the SQL session configuration spark.sql.files.maxRecordsPerFile to specify the maximum number
of records to write to a single file for a Delta Lake table. Specifying a value of zero or a negative value represents
no limit.
In Databricks Runtime 10.5 and above, you can also use the DataFrameWriter option maxRecordsPerFile when
using the DataFrame APIs to write to a Delta Lake table. When maxRecordsPerFile is specified, the value of the
SQL session configuration spark.sql.files.maxRecordsPerFile is ignored.
Python
df.write.format("delta") \
.mode("append") \
.option("maxRecordsPerFile", "10000") \
.save("/tmp/delta/people10m")
Scala
df.write.format("delta")
.mode("append")
.option("maxRecordsPerFile", "10000")
.save("/tmp/delta/people10m")
Idempotent writes
Sometimes a job that writes data to a Delta table is restarted due to various reasons (for example, job
encounters a failure). The failed job may or may not have written the data to Delta table before terminating. In
the case where the data is written to the Delta table, the restarted job writes the same data to the Delta table
which results in duplicate data.
To address this, Delta tables support the following DataFrameWriter options to make the writes idempotent:
txnAppId : A unique string that you can pass on each DataFrame write. For example, this can be the name of
the job.
: A monotonically increasing number that acts as transaction version. This number needs to be
txnVersion
unique for data that is being written to the Delta table(s). For example, this can be the epoch seconds of the
instant when the query is attempted for the first time. Any subsequent restarts of the same job needs to have
the same value for txnVersion .
The above combination of options needs to be unique for each new data that is being ingested into the Delta
table and the txnVersion needs to be higher than the last data that was ingested into the Delta table. For
example:
Last successfully written data contains option values as dailyETL:23423 ( txnAppId:txnVersion ).
Next write of data should have txnAppId = dailyETL and txnVersion as at least 23424 (one more than the
last written data txnVersion ).
Any attempt to write data with txnAppId = dailyETL and txnVersion as 23422 or less is ignored because the
txnVersion is less than the last recorded txnVersion in the table.
Attempt to write data with txnAppId:txnVersion as anotherETL:23424 is successful writing data to the table
as it contains a different txnAppId compared to the same option value in last ingested data.
WARNING
This solution assumes that the data being written to Delta table(s) in multiple retries of the job is same. If a write attempt
in a Delta table succeeds but due to some downstream failure there is a second write attempt with same txn options but
different data, then that second write attempt will be ignored. This can cause unexpected results.
Example
Python
Sc a l a
val appId = ... // A unique string that is used as an application ID.
version = ... // A monotonically increasing number that acts as transaction version.
SET spark.databricks.delta.commitInfo.userMetadata=overwritten-for-fixing-incorrect-data
INSERT OVERWRITE default.people10m SELECT * FROM morePeople
Python
df.write.format("delta") \
.mode("overwrite") \
.option("userMetadata", "overwritten-for-fixing-incorrect-data") \
.save("/tmp/delta/people10m")
Scala
df.write.format("delta")
.mode("overwrite")
.option("userMetadata", "overwritten-for-fixing-incorrect-data")
.save("/tmp/delta/people10m")
Schema validation
Delta Lake automatically validates that the schema of the DataFrame being written is compatible with the
schema of the table. Delta Lake uses the following rules to determine whether a write from a DataFrame to a
table is compatible:
All DataFrame columns must exist in the target table. If there are columns in the DataFrame not present in
the table, an exception is raised. Columns present in the table but not in the DataFrame are set to null.
DataFrame column data types must match the column data types in the target table. If they don’t match, an
exception is raised.
DataFrame column names cannot differ only by case. This means that you cannot have columns such as
“Foo” and “foo” defined in the same table. While you can use Spark in case sensitive or insensitive (default)
mode, Parquet is case sensitive when storing and returning column information. Delta Lake is case-
preserving but insensitive when storing the schema and has this restriction to avoid potential mistakes, data
corruption, or loss issues.
Delta Lake support DDL to add new columns explicitly and the ability to update schema automatically.
If you specify other options, such as partitionBy , in combination with append mode, Delta Lake validates that
they match and throws an error for any mismatch. When partitionBy is not present, appends automatically
follow the partitioning of the existing data.
NOTE
In Databricks Runtime 7.0 and above, INSERT syntax provides schema enforcement and supports schema evolution. If a
column’s data type cannot be safely cast to your Delta Lake table’s data type, then a runtime exception is thrown. If
schema evolution is enabled, new columns can exist as the last columns of your schema (or nested columns) for the
schema to evolve.
For more information about enforcing and evolving schemas in Delta Lake, watch this YouTube video (55
minutes).
IMPORTANT
When you update a Delta table schema, streams that read from that table terminate. If you want the stream to continue
you must restart it.
For recommended methods, see Production considerations for Structured Streaming applications on Azure
Databricks.
Explicitly update schema
You can use the following DDL to explicitly change the schema of a table.
Add columns
ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)
ALTER TABLE table_name ADD COLUMNS (col_name.nested_col_name data_type [COMMENT col_comment] [FIRST|AFTER
colA_name], ...)
Ex a m p l e
If the schema before running ALTER TABLE boxes ADD COLUMNS (colB.nested STRING AFTER field1) is:
- root
| - colA
| - colB
| +-field1
| +-field2
NOTE
Adding nested columns is supported only for structs. Arrays and maps are not supported.
ALTER TABLE table_name ALTER [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER
colA_name]
Ex a m p l e
If the schema before running ALTER TABLE boxes CHANGE COLUMN colB.field2 field2 STRING FIRST is:
- root
| - colA
| - colB
| +-field1
| +-field2
- root
| - colA
| - colB
| +-field2
| +-field1
Replace columns
ALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)
Ex a m p l e
ALTER TABLE boxes REPLACE COLUMNS (colC STRING, colB STRUCT<field2:STRING, nested:STRING, field1:STRING>,
colA STRING)
- root
| - colC
| - colB
| +-field2
| +-nested
| +-field1
| - colA
Rename columns
IMPORTANT
This feature is in Public Preview.
NOTE
This feature is available in Databricks Runtime 10.2 and above.
To rename columns without rewriting any of the columns’ existing data, you must enable column mapping for
the table. See Delta column mapping.
To rename a column:
Ex a m p l e
- root
| - colA
| - colB
| +-field1
| +-field2
IMPORTANT
This feature is in Public Preview.
NOTE
This feature is available in Databricks Runtime 11.0 and above.
To drop columns as a metadata-only operation without rewriting any data files, you must enable column
mapping for the table. See Delta column mapping.
IMPORTANT
Dropping a column from metadata does not delete the underlying data for the column in files. To purge the dropped
column data, you can use REORG TABLE to rewrite files. You can then use VACUUM to physically delete the files that
contain the dropped column data.
To drop a column:
Ch an ge a c ol u m n t ype
spark.read.table(...) \
.withColumn("birthDate", col("birthDate").cast("date")) \
.write \
.format("delta") \
.mode("overwrite")
.option("overwriteSchema", "true") \
.saveAsTable(...)
Ch an ge a c ol u m n n am e
spark.read.table(...) \
.withColumnRenamed("dateOfBirth", "birthDate") \
.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable(...)
When both options are specified, the option from the DataFrameWriter takes precedence. The added columns
are appended to the end of the struct they are present in. Case is preserved when appending a new column.
NOTE
mergeSchema is not supported when table access control is enabled (as it elevates a request that requires MODIFY to
one that requires ALL PRIVILEGES ).
mergeSchema cannot be used with INSERT INTO or .write.insertInto() .
NullType columns
Because Parquet doesn’t support NullType , NullType columns are dropped from the DataFrame when writing
into Delta tables, but are still stored in the schema. When a different data type is received for that column, Delta
Lake merges the schema to the new data type. If Delta Lake receives a NullType for an existing column, the old
schema is retained and the new column is dropped during the write.
NullType in streaming is not supported. Since you must set schemas when using streaming this should be very
rare. NullType is also not accepted for complex types such as ArrayType and MapType .
df.write.option("overwriteSchema", "true")
Views on tables
Delta Lake supports the creation of views on top of Delta tables just like you might with a data source table.
These views integrate with table access control to allow for column and row level security.
The core challenge when you operate with views is resolving the schemas. If you alter a Delta table schema, you
must recreate derivative views to account for any additions to the schema. For instance, if you add a new
column to a Delta table, you must make sure that this column is available in the appropriate views built on top
of that base table.
Table properties
You can store your own metadata as a table property using TBLPROPERTIES in CREATE and ALTER . You can then
SHOW that metadata. For example:
TBLPROPERTIES are stored as part of Delta table metadata. You cannot define new TBLPROPERTIES in a CREATE
statement if a Delta table already exists in a given location.
In addition, to tailor behavior and performance, Delta Lake supports certain Delta table properties:
Block deletes and updates in a Delta table: delta.appendOnly=true .
Configure the time travel retention properties: delta.logRetentionDuration=<interval-string> and
delta.deletedFileRetentionDuration=<interval-string> . For details, see Data retention.
Configure the number of columns for which statistics are collected: delta.dataSkippingNumIndexedCols=n . This
property indicates to the writer that statistics are to be collected only for the first n columns in the table.
Also the data skipping code ignores statistics for any column beyond this column index. This property takes
affect only for new data that is written out.
NOTE
Modifying a Delta table property is a write operation that will conflict with other concurrent write operations, causing
them to fail. We recommend that you modify a table property only when there are no concurrent write operations on
the table.
You can also set delta. -prefixed properties during the first commit to a Delta table using Spark configurations.
For example, to initialize a Delta table with the property delta.appendOnly=true , set the Spark configuration
spark.databricks.delta.properties.defaults.appendOnly to true . For example:
SQL
Python
spark.conf.set("spark.databricks.delta.properties.defaults.appendOnly", "true")
Scala
spark.conf.set("spark.databricks.delta.properties.defaults.appendOnly", "true")
DESCRIBE DETAIL
Provides information about schema, partitioning, table size, and so on. For details, see Retrieve Delta table
details.
DESCRIBE HISTORY
Provides provenance information, including the operation, user, and so on, and operation metrics for each write
to a table. Table history is retained for 30 days. For details, see Retrieve Delta table history.
The Explore and create tables with the Data tab provides a visual view of this detailed table information and
history for Delta tables. In addition to the table schema and sample data, you can click the Histor y tab to see
the table history that displays with DESCRIBE HISTORY .
NOTE
This feature is available in Databricks Runtime 10.1 and above.
For example, you can pass your storage credentails through DataFrame options:
Python
df1 = spark.read.format("delta") \
.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
1>") \
.read("...")
df2 = spark.read.format("delta") \
.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
2>") \
.read("...")
df1.union(df2).write.format("delta") \
.mode("overwrite") \
.option("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key-
3>") \
.save("...")
Scala
You can find the details of the Hadoop file system configurations for your storage in Data sources.
Notebook
For an example of the various Delta table metadata commands, see the end of the following notebook:
Delta Lake batch commands notebook
Get notebook
Table streaming reads and writes
7/21/2022 • 8 minutes to read
Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream . Delta
Lake overcomes many of the limitations typically associated with streaming systems and files, including:
Coalescing small files produced by low latency ingest
Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
Efficiently discovering which files are new when using files as the source for a stream
See also Production considerations for Structured Streaming applications on Azure Databricks.
spark.readStream.format("delta")
.load("/tmp/delta/events")
import io.delta.implicits._
spark.readStream.delta("/tmp/delta/events")
or
import io.delta.implicits._
spark.readStream.format("delta").table("events")
In this section:
Limit input rate
Ignore updates and deletes
Specify initial position
Limit input rate
The following options are available to control micro-batches:
maxFilesPerTrigger : How many new files to be considered in every micro-batch. The default is 1000.
maxBytesPerTrigger : How much data gets processed in each micro-batch. This option sets a “soft max”,
meaning that a batch processes approximately this amount of data and may process more than the limit in
order to make the streaming query move forward in cases when the smallest input unit is larger than this
limit. If you use Trigger.Once for your streaming, this option is ignored. This is not set by default.
If you use maxBytesPerTrigger in conjunction with maxFilesPerTrigger , the micro-batch processes data until
either the maxFilesPerTrigger or maxBytesPerTrigger limit is reached.
NOTE
In cases when the source table transactions are cleaned up due to the logRetentionDuration configuration and the
stream lags in processing, Delta Lake processes the data corresponding to the latest available transaction history of the
source table but does not fail the stream. This can result in data being dropped.
spark.readStream.format("delta")
.option("ignoreDeletes", "true")
.load("/tmp/delta/user_events")
However, if you have to delete data based on user_email , then you will need to use:
spark.readStream.format("delta")
.option("ignoreChanges", "true")
.load("/tmp/delta/user_events")
If you update a user_email with the UPDATE statement, the file containing the user_email in question is
rewritten. When you use ignoreChanges , the new record is propagated downstream with all other unchanged
records that were in the same file. Your logic should be able to handle these incoming duplicate records.
Specify initial position
NOTE
This feature is available on Databricks Runtime 7.3 LTS and above.
You can use the following options to specify the starting point of the Delta Lake streaming source without
processing the entire table.
startingVersion : The Delta Lake version to start from. All table changes starting from this version
(inclusive) will be read by the streaming source. You can obtain the commit versions from the version
column of the DESCRIBE HISTORY command output.
In Databricks Runtime 7.4 and above, to return only the latest changes, specify latest .
startingTimestamp : The timestamp to start from. All table changes committed at or after the timestamp
(inclusive) will be read by the streaming source. One of:
A timestamp string. For example, "2019-01-01T00:00:00.000Z" .
A date string. For example, "2019-01-01" .
You cannot set both options at the same time; you can use only one of them. They take effect only when starting
a new streaming query. If a streaming query has started and the progress has been recorded in its checkpoint,
these options are ignored.
IMPORTANT
Although you can start the streaming source from a specified version or timestamp, the schema of the streaming source
is always the latest schema of the Delta table. You must ensure there is no incompatible schema change to the Delta table
after the specified version or timestamp. Otherwise, the streaming source may return incorrect results when reading the
data with an incorrect schema.
Example
For example, suppose you have a table user_events . If you want to read changes since version 5, use:
spark.readStream.format("delta")
.option("startingVersion", "5")
.load("/tmp/delta/user_events")
spark.readStream.format("delta")
.option("startingTimestamp", "2018-10-18")
.load("/tmp/delta/user_events")
NOTE
The Delta Lake VACUUM function removes all files not managed by Delta Lake but skips any directories that begin with
_ . You can safely store checkpoints alongside other data and metadata for a Delta table using a directory structure such
as <table_name>/_checkpoints .
In this section:
Metrics
Append mode
Complete mode
Metrics
NOTE
Available in Databricks Runtime 8.1 and above.
You can find out the number of bytes and number of files yet to be processed in a streaming query process as
the numBytesOutstanding and numFilesOutstanding metrics. If you are running the stream in a notebook, you can
see these metrics under the Raw Data tab in the streaming query progress dashboard:
{
"sources" : [
{
"description" : "DeltaSource[file:/path/to/source]",
"metrics" : {
"numBytesOutstanding" : "3456",
"numFilesOutstanding" : "8"
},
}
]
}
Append mode
By default, streams run in append mode, which adds new records to the table.
You can use the path method:
Python
events.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/_checkpoints/")
.start("/delta/events")
Scala
events.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/")
.start("/tmp/delta/events")
import io.delta.implicits._
events.writeStream
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/")
.delta("/tmp/delta/events")
or the toTable method in Spark 3.1 and higher (Databricks Runtime 8.3 and above), as follows. (In Spark
versions before 3.1 (Databricks Runtime 8.2 and below), use the table method instead.)
Python
events.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/")
.toTable("events")
Scala
events.writeStream
.outputMode("append")
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/")
.toTable("events")
Complete mode
You can also use Structured Streaming to replace the entire table with every batch. One example use case is to
compute a summary using aggregation:
Python
(spark.readStream
.format("delta")
.load("/tmp/delta/events")
.groupBy("customerId")
.count()
.writeStream
.format("delta")
.outputMode("complete")
.option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/")
.start("/tmp/delta/eventsByCustomer")
)
Scala
spark.readStream
.format("delta")
.load("/tmp/delta/events")
.groupBy("customerId")
.count()
.writeStream
.format("delta")
.outputMode("complete")
.option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/")
.start("/tmp/delta/eventsByCustomer")
The preceding example continuously updates a table that contains the aggregate number of events by customer.
For applications with more lenient latency requirements, you can save computing resources with one-time
triggers. Use these to update summary aggregation tables on a given schedule, processing only new data that
has arrived since the last update.
NOTE
Available in Databricks Runtime 8.4 and above.
The command foreachBatch allows you to specify a function that is executed on the output of every micro-batch
after arbitrary transformations in the streaming query. This allows implementating a foreachBatch function that
can write the micro-batch output to one or more target Delta table destinations. However, foreachBatch does
not make those writes idempotent as those write attempts lack the information of whether the batch is being re-
executed or not. For example, rerunning a failed batch could result in duplicate data writes.
To address this, Delta tables support the following DataFrameWriter options to make the writes idempotent:
txnAppId : A unique string that you can pass on each DataFrame write. For example, you can use the
StreamingQuery ID as txnAppId .
txnVersion : A monotonically increasing number that acts as transaction version.
Delta table uses the combination of txnAppId and txnVersion to identify duplicate writes and ignore them.
If a batch write is interrupted with a failure, rerunning the batch uses the same application and batch ID, which
would help the runtime correctly identify duplicate writes and ignore them. Application ID ( txnAppId ) can be
any user-generated unique string and does not have to be related to the stream ID.
WARNING
If you delete the streaming checkpoint and restart the query with a new checkpoint, you must provide a different appId
; otherwise, writes from the restarted query will be ignored because it will contain the same txnAppId and the batch ID
would start from 0.
The same DataFrameWriter options can be used to achieve the idempotent writes in non-Streaming job. For
details Idempotent writes.
Example
Python
Scala
query = (streamingDF
.join(staticDF, streamingDF.customer_id==staticDF.id, "inner")
.writeStream
.option("checkpointLocation", checkpoint_path)
.table("orders_with_customer_info")
)
Table deletes, updates, and merges
7/21/2022 • 20 minutes to read
Delta Lake supports several statements to facilitate deleting data from and updating data in Delta tables.
For an overview and demonstration of deleting and updating data in Delta Lake, watch this YouTube video (54
minutes).
For additional information about capturing change data from Delta Lake, watch this YouTube video (53 minutes).
Python
NOTE
The Python API is available in Databricks Runtime 6.1 and above.
Scala
NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables._
import org.apache.spark.sql.functions._
import spark.implicits._
Java
NOTE
The Java API is available in Databricks Runtime 6.0 and above.
import io.delta.tables.*;
import org.apache.spark.sql.functions;
IMPORTANT
delete removes the data from the latest version of the Delta table but does not remove it from the physical storage
until the old versions are explicitly vacuumed. See vacuum for details.
TIP
When possible, provide predicates on the partition columns for a partitioned Delta table as such predicates can
significantly speed up the operation.
Update a table
You can update data that matches a predicate in a Delta table. For example, in a table named people10m or a
path at /tmp/delta/people-10m , to change an abbreviation in the gender column from M or F to Male or
Female , you can run the following:
SQL
NOTE
The Python API is available in Databricks Runtime 6.1 and above.
Scala
NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables._
import org.apache.spark.sql.functions._
import spark.implicits._
Java
NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables.*;
import org.apache.spark.sql.functions;
import java.util.HashMap;
TIP
Similar to delete, update operations can get a significant speedup with predicates on partitions.
dfUpdates = deltaTablePeopleUpdates.toDF()
deltaTablePeople.alias('people') \
.merge(
dfUpdates.alias('updates'),
'people.id = updates.id'
) \
.whenMatchedUpdate(set =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.whenNotMatchedInsert(values =
{
"id": "updates.id",
"firstName": "updates.firstName",
"middleName": "updates.middleName",
"lastName": "updates.lastName",
"gender": "updates.gender",
"birthDate": "updates.birthDate",
"ssn": "updates.ssn",
"salary": "updates.salary"
}
) \
.execute()
Scala
import io.delta.tables._
import org.apache.spark.sql.functions._
deltaTablePeople
.as("people")
.merge(
dfUpdates.as("updates"),
"people.id = updates.id")
.whenMatched
.updateExpr(
Map(
"id" -> "updates.id",
"firstName" -> "updates.firstName",
"middleName" -> "updates.middleName",
"lastName" -> "updates.lastName",
"gender" -> "updates.gender",
"birthDate" -> "updates.birthDate",
"ssn" -> "updates.ssn",
"salary" -> "updates.salary"
))
.whenNotMatched
.insertExpr(
Map(
"id" -> "updates.id",
"firstName" -> "updates.firstName",
"middleName" -> "updates.middleName",
"lastName" -> "updates.lastName",
"gender" -> "updates.gender",
"birthDate" -> "updates.birthDate",
"ssn" -> "updates.ssn",
"salary" -> "updates.salary"
))
.execute()
Java
import io.delta.tables.*;
import org.apache.spark.sql.functions;
import java.util.HashMap;
deltaTable
.as("people")
.merge(
dfUpdates.as("updates"),
"people.id = updates.id")
.whenMatched()
.updateExpr(
new HashMap<String, String>() {{
put("id", "updates.id");
put("firstName", "updates.firstName");
put("middleName", "updates.middleName");
put("lastName", "updates.lastName");
put("gender", "updates.gender");
put("birthDate", "updates.birthDate");
put("ssn", "updates.ssn");
put("salary", "updates.salary");
}})
.whenNotMatched()
.insertExpr(
new HashMap<String, String>() {{
put("id", "updates.id");
put("firstName", "updates.firstName");
put("middleName", "updates.middleName");
put("lastName", "updates.lastName");
put("gender", "updates.gender");
put("birthDate", "updates.birthDate");
put("ssn", "updates.ssn");
put("salary", "updates.salary");
}})
.execute();
See the Delta Lake APIs for Scala, Java, and Python syntax details.
Delta Lake merge operations typically require two passes over the source data. If your source data contains
nondeterministic expressions, multiple passes on the source data can produce different rows causing incorrect
results. Some common examples of nondeterministic expressions include the current_date and
current_timestamp functions. If you cannot avoid using non-deterministic functions, consider saving the source
data to storage, for example as a temporary Delta table. Caching the source data may not address this issue, as
cache invalidation can cause the source data to be recomputed partially or completely (for example when a
cluster loses some of it executors when scaling down).
Operation semantics
Here is a detailed description of the merge programmatic operation.
There can be any number of whenMatched and whenNotMatched clauses.
NOTE
In Databricks Runtime 7.2 and below, merge can have at most 2 whenMatched clauses and at most 1
whenNotMatched clause.
whenMatched clauses are executed when a source row matches a target table row based on the match
condition. These clauses have the following semantics.
whenMatched clauses can have at most one update and one delete action. The update action in
merge only updates the specified columns (similar to the update operation) of the matched target
row. The delete action deletes the matched row.
Each whenMatched clause can have an optional condition. If this clause condition exists, the update
or delete action is executed for any matching source-target row pair only when the clause
condition is true.
If there are multiple whenMatched clauses, then they are evaluated in the order they are specified.
All whenMatched clauses, except the last one, must have conditions.
If none of the whenMatched conditions evaluate to true for a source and target row pair that
matches the merge condition, then the target row is left unchanged.
To update all the columns of the target Delta table with the corresponding columns of the source
dataset, use whenMatched(...).updateAll() . This is equivalent to:
for all the columns of the target Delta table. Therefore, this action assumes that the source table
has the same columns as those in the target table, otherwise the query throws an analysis error.
NOTE
This behavior changes when automatic schema migration is enabled. See Automatic schema evolution for
details.
whenNotMatched clauses are executed when a source row does not match any target row based on the
match condition. These clauses have the following semantics.
whenNotMatched clauses can have only the insert action. The new row is generated based on the
specified column and corresponding expressions. You do not need to specify all the columns in the
target table. For unspecified target columns, NULL is inserted.
NOTE
In Databricks Runtime 6.5 and below, you must provide all the columns in the target table for the
INSERT action.
Each whenNotMatched clause can have an optional condition. If the clause condition is present, a
source row is inserted only if that condition is true for that row. Otherwise, the source column is
ignored.
If there are multiple whenNotMatched clauses, then they are evaluated in the order they are
specified. All whenNotMatched clauses, except the last one, must have conditions.
To insert all the columns of the target Delta table with the corresponding columns of the source
dataset, use whenNotMatched(...).insertAll() . This is equivalent to:
for all the columns of the target Delta table. Therefore, this action assumes that the source table
has the same columns as those in the target table, otherwise the query throws an analysis error.
NOTE
This behavior changes when automatic schema migration is enabled. See Automatic schema evolution for
details.
IMPORTANT
A merge operation can fail if multiple rows of the source dataset match and the merge attempts to update the same
rows of the target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it
is unclear which source row should be used to update the matched target row. You can preprocess the source table to
eliminate the possibility of multiple matches. See the change data capture example—it shows how to preprocess the
change dataset (that is, the source dataset) to retain only the latest change for each key before applying that change
into the target Delta table.
A merge operation can produce incorrect results if the source dataset is non-deterministic. This is because merge
may perform two scans of the source dataset and if the data produced by the two scans are different, the final
changes made to the table can be incorrect. Non-determinism in the source can arise in many ways. Some of them are
as follows:
Reading from non-Delta tables. For example, reading from a CSV table where the underlying files can change
between the multiple scans.
Using non-deterministic operations. For example, Dataset.filter() operations that uses current timestamp
to filter data can produce different results between the multiple scans.
You can apply a SQL MERGE operation on a SQL VIEW only if the view has been defined as
CREATE VIEW viewName AS SELECT * FROM deltaTable .
NOTE
In Databricks Runtime 7.3 LTS and above, multiple matches are allowed when matches are unconditionally deleted (since
unconditional delete is not ambiguous even if there are multiple matches).
Schema validation
merge automatically validates that the schema of the data generated by insert and update expressions are
compatible with the schema of the table. It uses the following rules to determine whether the merge operation
is compatible:
For update and insert actions, the specified target columns must exist in the target Delta table.
For updateAll and insertAll actions, the source dataset must have all the columns of the target Delta table.
The source dataset can have extra columns and they are ignored.
For all actions, if the data type generated by the expressions producing the target columns are different from
the corresponding columns in the target Delta table, merge tries to cast them to the types in the table.
Automatic schema evolution
NOTE
Schema evolution in merge is available in Databricks Runtime 6.6 and above.
By default, updateAll and insertAll assign all the columns in the target Delta table with columns of the same
name from the source dataset. Any columns in the source dataset that don’t match columns in the target table
are ignored. However, in some use cases, it is desirable to automatically add source columns to the target Delta
table. To automatically update the table schema during a merge operation with updateAll and insertAll (at
least one of them), you can set the Spark session configuration
spark.databricks.delta.schema.autoMerge.enabled to true before running the merge operation.
NOTE
Schema evolution occurs only when there is either an updateAll ( UPDATE SET * ) or an insertAll ( INSERT * )
action, or both.
update and insert actions cannot explicitly refer to target columns that do not already exist in the target table
(even it there are updateAll or insertAll as one of the clauses). See the examples below.
NOTE
In Databricks Runtime 7.4 and below, merge supports schema evolution of only top-level columns, and not of nested
columns.
Here are a few examples of the effects of merge operation with and without schema evolution.
B EH AVIO R W IT H O UT
SC H EM A EVO L UT IO N B EH AVIO R W IT H SC H EM A
C O L UM N S Q UERY ( IN SC A L A ) ( DEFA ULT ) EVO L UT IO N
NOTE
This feature is available in Databricks Runtime 9.1 and above. For Databricks Runtime 9.0 and below, implicit Spark casting
is used for arrays of structs to resolve struct fields by position, and the effects of merge operations with and without
schema evolution of structs in arrays are inconsistent with the behaviors of structs outside of arrays.
Here are a few examples of the effects of merge operations with and without schema evolution for arrays of
structs.
B EH AVIO R W IT H O UT
SC H EM A EVO L UT IO N B EH AVIO R W IT H SC H EM A
SO URC E SC H EM A TA RGET SC H EM A ( DEFA ULT ) EVO L UT IO N
array<struct<b: string, a: array<struct<a: int, b: The table schema remains The table schema remains
string>> int>> unchanged. Columns will be unchanged. Columns will be
resolved by name and resolved by name and
updated or inserted. updated or inserted.
array<struct<a: int, c: array<struct<a: string, b: update and insert The table schema is
string, d: string>> string>> throw errors because c changed to array<struct<a:
and d do not exist in the string, b: string, c: string, d:
target table. string>>. c and d are
inserted as NULL for
existing entries in the target
table. update and
insert fill entries in the
source table with a casted
to string and b as NULL .
array<struct<a: string, b: array<struct<a: string, b: update and insert The target table schema is
struct<c: string, d: struct<c: string>>> throw errors because d changed to array<struct<a:
string>>> does not exist in the target string, b: struct<c: string, d:
table. string>>>. d is inserted
as NULL for existing
entries in the target table.
Performance tuning
You can reduce the time taken by merge using the following approaches:
Reduce the search space for matches : By default, the merge operation searches the entire Delta table
to find matches in the source table. One way to speed up merge is to reduce the search space by adding
known constraints in the match condition. For example, suppose you have a table that is partitioned by
country and date and you want to use merge to update information for the last day and a specific
country. Adding the condition
will make the query faster as it looks for matches only in the relevant partitions. Furthermore, it will also
reduce the chances of conflicts with other concurrent operations. See Concurrency control for more
details.
Compact files : If the data is stored in many small files, reading the data to search for matches can
become slow. You can compact small files into larger files to improve read throughput. See Compact files
for details.
Control the shuffle par titions for writes : The merge operation shuffles data multiple times to
compute and write the updated data. The number of tasks used to shuffle is controlled by the Spark
session configuration spark.sql.shuffle.partitions . Setting this parameter not only controls the
parallelism but also determines the number of output files. Increasing the value increases parallelism but
also generates a larger number of smaller data files.
Enable optimized writes : For partitioned tables, merge can produce a much larger number of small
files than the number of shuffle partitions. This is because every shuffle task can write multiple files in
multiple partitions, and can become a performance bottleneck. You can reduce the number of files by
enabling Optimized Write.
NOTE
In Databricks Runtime 7.4 and above, Optimized Write is automatically enabled in merge operations on partitioned
tables.
Tune file sizes in table : In Databricks Runtime 8.2 and above, Azure Databricks can automatically detect if
a Delta table has frequent merge operations that rewrite files and may choose to reduce the size of rewritten
files in anticipation of further file rewrites in the future. See the section on tuning file sizes for details.
Low Shuffle Merge : In Databricks Runtime 9.0 and above, Low Shuffle Merge provides an optimized
implementation of MERGE that provides better performance for most common workloads. In addition, it
preserves existing data layout optimizations such as Z-ordering on unmodified data.
Merge examples
Here are a few examples on how to use merge in different scenarios.
In this section:
Data deduplication when writing into Delta tables
Slowly changing data (SCD) Type 2 operation into Delta tables
Write change data into a Delta table
Upsert from streaming queries using foreachBatch
Data deduplication when writing into Delta tables
A common ETL use case is to collect logs into Delta table by appending them to a table. However, often the
sources can generate duplicate log records and downstream deduplication steps are needed to take care of
them. With merge , you can avoid inserting the duplicate records.
SQL
Python
deltaTable.alias("logs").merge(
newDedupedLogs.alias("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId") \
.whenNotMatchedInsertAll() \
.execute()
Scala
deltaTable
.as("logs")
.merge(
newDedupedLogs.as("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId")
.whenNotMatched()
.insertAll()
.execute()
Java
deltaTable
.as("logs")
.merge(
newDedupedLogs.as("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId")
.whenNotMatched()
.insertAll()
.execute();
NOTE
The dataset containing the new logs needs to be deduplicated within itself. By the SQL semantics of merge, it matches
and deduplicates the new data with the existing data in the table, but if there is duplicate data within the new dataset, it is
inserted. Hence, deduplicate the new data before merging into the table.
If you know that you may get duplicate records only for a few days, you can optimized your query further by
partitioning the table by date, and then specifying the date range of the target table to match on.
SQL
deltaTable.alias("logs").merge(
newDedupedLogs.alias("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS") \
.whenNotMatchedInsertAll("newDedupedLogs.date > current_date() - INTERVAL 7 DAYS") \
.execute()
Scala
deltaTable.as("logs").merge(
newDedupedLogs.as("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS")
.whenNotMatched("newDedupedLogs.date > current_date() - INTERVAL 7 DAYS")
.insertAll()
.execute()
Java
deltaTable.as("logs").merge(
newDedupedLogs.as("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS")
.whenNotMatched("newDedupedLogs.date > current_date() - INTERVAL 7 DAYS")
.insertAll()
.execute();
This is more efficient than the previous command as it looks for duplicates only in the last 7 days of logs, not the
entire table. Furthermore, you can use this insert-only merge with Structured Streaming to perform continuous
deduplication of the logs.
In a streaming query, you can use merge operation in foreachBatch to continuously write any streaming
data to a Delta table with deduplication. See the following streaming example for more information on
foreachBatch .
In another streaming query, you can continuously read deduplicated data from this Delta table. This is
possible because an insert-only merge only appends new data to the Delta table.
NOTE
Insert-only merge is optimized to only append data in Databricks Runtime 6.2 and above. In Databricks Runtime 6.1 and
below, writes from insert-only merge operations cannot be read as a stream.
You can use a combination of merge and foreachBatch (see foreachbatch for more information) to write
complex upserts from a streaming query into a Delta table. For example:
Write streaming aggregates in Update Mode : This is much more efficient than Complete Mode.
Write a stream of database changes into a Delta table : The merge query for writing change data can
be used in foreachBatch to continuously apply a stream of changes to a Delta table.
Write a stream data into Delta table with deduplication : The insert-only merge query for
deduplication can be used in foreachBatch to continuously write data (with duplicates) to a Delta table with
automatic deduplication.
NOTE
Make sure that your merge statement inside foreachBatch is idempotent as restarts of the streaming query can
apply the operation on the same batch of data multiple times.
When merge is used in foreachBatch , the input data rate of the streaming query (reported through
StreamingQueryProgress and visible in the notebook rate graph) may be reported as a multiple of the actual rate at
which data is generated at the source. This is because merge reads the input data multiple times causing the input
metrics to be multiplied. If this is a bottleneck, you can cache the batch DataFrame before merge and then uncache it
after merge .
Write streaming aggregates in update mode using merge and foreachBatch notebook
Get notebook
Change data feed
7/21/2022 • 6 minutes to read
NOTE
Delta change data feed is available in Databricks Runtime 8.4 and above.
This article describes how to record and query row-level change information for Delta tables using the change data
feed feature. To learn how to update tables in a Delta Live Tables pipeline based on changes in source data, see Change
data capture with Delta Live Tables.
Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta
table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table.
This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or
updated.
You can read the change events in batch queries using SQL and DataFrame APIs (that is, df.read ), and in
streaming queries using DataFrame APIs (that is, df.readStream ).
Use cases
Change Data Feed is not enabled by default. The following use cases should drive when you enable the change
data feed.
Silver and Gold tables : Improve Delta performance by processing only row-level changes following initial
MERGE , UPDATE , or DELETE operations to accelerate and simplify ETL and ELT operations.
Materialized views : Create up-to-date, aggregated views of information for use in BI and analytics without
having to reprocess the full underlying tables, instead updating only where changes have come through.
Transmit changes : Send a change data feed to downstream systems such as Kafka or RDBMS that can use
it to incrementally process in later stages of data pipelines.
Audit trail table : Capture the change data feed as a Delta table provides perpetual storage and efficient
query capability to see all changes over time, including when deletes occur and what updates were made.
CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES (delta.enableChangeDataFeed = true)
Existing table : Set the table property delta.enableChangeDataFeed = true in the ALTER TABLE command.
-- database/schema names inside the string for table name, with backticks for escaping dots and special
characters
SELECT * FROM table_changes('dbName.`dotted.tableName`', '2021-04-21 06:45:46' , '2021-05-21 12:00:00')
Python
# version as ints or longs
spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 0) \
.option("endingVersion", 10) \
.table("myDeltaTable")
Scala
# not providing a starting version/timestamp will result in the latest snapshot being fetched first
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("myDeltaTable")
Scala
// not providing a starting version/timestamp will result in the latest snapshot being fetched first
spark.readStream.format("delta")
.option("readChangeFeed", "true")
.table("myDeltaTable")
To get the change data while reading the table, set the option readChangeFeed to true . The startingVersion or
startingTimestamp are optional and if not provided the stream returns the latest snapshot of the table at the
time of streaming as an INSERT and future changes as change data. Options like rate limits ( maxFilesPerTrigger
, maxBytesPerTrigger ) and excludeRegex are also supported when reading change data.
NOTE
Rate limiting can be atomic for versions other than the starting snapshot version. That is, the entire commit version will
be rate limited or the entire commit will be returned.
By default if a user passes in a version or timestamp exceeding the last commit on a table, the error
timestampGreaterThanLatestCommit will be thrown. CDF can handle the out of range version case, if the user sets the
following configuration to true .
If you provide a start version greater than the last commit on a table or a start timestamp newer than the last commit on
a table, then when the preceding configuration is enabled, an empty read result is returned.
If you provide an end version greater than the last commit on a table or an end timestamp newer than the last commit
on a table, then when the preceding configuration is enabled in batch read mode, all changes between the start version
and the last commit are be returned.
Change data event schema
In addition to the data columns, change data contains metadata columns that identify the type of change event:
C O L UM N N A M E TYPE VA L UES
(1) preimage is the value before the update, postimage is the value after the update.
Notebook
The notebook shows how to propagate changes made to a silver table of absolute number of vaccinations to a
gold table of vaccination rates.
Change data feed notebook
Get notebook
Table utility commands
7/21/2022 • 23 minutes to read
IMPORTANT
vacuum removes all files from directories not managed by Delta Lake, ignoring directories beginning with _ . If you
are storing additional metadata like Structured Streaming checkpoints within a Delta table directory, use a directory
name such as _checkpoints .
vacuum deletes only data files, not log files. Log files are deleted automatically and asynchronously after checkpoint
operations. The default retention period of log files is 30 days, configurable through the
delta.logRetentionDuration property which you set with the ALTER TABLE SET TBLPROPERTIES SQL method. See
Table properties.
The ability to time travel back to a version older than the retention period is lost after running vacuum .
NOTE
When the Delta cache is enabled, a cluster might contain data from Parquet files that have been deleted with vacuum .
Therefore, it may be possible to query the data of previous table versions whose files have been deleted. Restarting the
cluster will remove the cached data. See Configure the Delta cache.
SQL
VACUUM eventsTable -- vacuum files not required by versions older than the default retention period
VACUUM delta.`/data/events/`
VACUUM delta.`/data/events/` RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours
old
VACUUM eventsTable DRY RUN -- do dry run to get the list of files to be deleted
NOTE
The Python API is available in Databricks Runtime 6.1 and above.
from delta.tables import *
deltaTable.vacuum() # vacuum files not required by versions older than the default retention period
deltaTable.vacuum(100) # vacuum files not required by versions more than 100 hours old
Scala
NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables._
deltaTable.vacuum() // vacuum files not required by versions older than the default retention period
deltaTable.vacuum(100) // vacuum files not required by versions more than 100 hours old
Java
NOTE
The Java API is available in Databricks Runtime 6.0 and above.
import io.delta.tables.*;
import org.apache.spark.sql.functions;
deltaTable.vacuum(); // vacuum files not required by versions older than the default retention period
deltaTable.vacuum(100); // vacuum files not required by versions more than 100 hours old
See the Delta Lake APIs for Scala, Java, and Python syntax details.
WARNING
It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files
can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can
fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed. You must choose an
interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag
behind the most recent update to the table.
Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain
that there are no operations being performed on this table that take longer than the retention interval you plan
to specify, you can turn off this safety check by setting the Spark configuration property
spark.databricks.delta.retentionDurationCheck.enabled to false .
Audit information
VACUUM commits to the Delta transaction log contain audit information. You can query the audit events using
DESCRIBE HISTORY .
NOTE
The Python API is available in Databricks Runtime 6.1 and above.
Scala
NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables._
Java
NOTE
The Java API is available in Databricks Runtime 6.0 and above.
import io.delta.tables.*;
NOTE
Operation metrics are available only when the history command and the operation in the history were run using
Databricks Runtime 6.5 or above.
A few of the other columns are not available if you write into a Delta table using the following methods:
JDBC or ODBC
JAR job
spark-submit job
Run a command using the REST API
Columns added in the future will always be added after the last column.
STREAMING UPDATE
DELETE
TRUNCATE
MERGE
UPDATE
OPTIMIZE
CLONE (1)
RESTORE (2)
VACUUM (3)
properties string-string map All the properties set for this table.
+------+--------------------+------------------+-----------+--------------------+--------------------+------
-------------+----------------+--------+-----------+----------+----------------+----------------+
|format| id| name|description| location| createdAt|
lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|
+------+--------------------+------------------+-----------+--------------------+--------------------+------
-------------+----------------+--------+-----------+----------+----------------+----------------+
| delta|d31f82d2-a69f-42e...|default.deltatable| null|file:/Users/tuor/...|2020-06-05 12:20:...|2020-
06-05 12:20:20| []| 10| 12345| []| 1| 2|
+------+--------------------+------------------+-----------+--------------------+--------------------+------
-------------+----------------+--------+-----------+----------+----------------+----------------+
NOTE
If a Parquet table was created by Structured Streaming, the listing of files can be avoided by using the _spark_metadata
sub-directory as the source of truth for files contained in the table setting the SQL configuration
spark.databricks.delta.convert.useMetadataLog to true .
SQL
-- Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named
'part' and 'part2'
CONVERT TO DELTA parquet.`<path-to-table>` PARTITIONED BY (part int, part2 int)
NOTE
The Python API is available in Databricks Runtime 6.1 and above.
# Convert partitioned parquet table at path '<path-to-table>' and partitioned by integer column named 'part'
partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int")
Scala
NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables._
// Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named
'part' and 'part2'
val partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int, part2
int")
Java
NOTE
The Scala API is available in Databricks Runtime 6.0 and above.
import io.delta.tables.*;
// Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named
'part' and 'part2'
DeltaTable deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int, part2
int");
NOTE
Any file not tracked by Delta Lake is invisible and can be deleted when you run vacuum . You should avoid updating or
appending data files during the conversion process. After the table is converted, make sure all writes go through Delta
Lake.
Convert an Iceberg table to a Delta table
NOTE
This feature is in Public Preview.
This feature is supported in Databricks Runtime 10.4 and above.
You can convert an Iceberg table to a Delta table in place if the underlying file format of the Iceberg table is
Parquet. The following command creates a Delta Lake transaction log based on the Iceberg table’s native file
manifest, schema and partitioning information. The converter also collects column stats during the conversion,
unless NO STATISTICS is specified.
-- Convert the Iceberg table in the path <path-to-table> without collecting statistics.
CONVERT TO DELTA iceberg.`<path-to-table>` NO STATISTICS
NOTE
Converting Iceberg metastore tables is not supported.
You can restore a Delta table to its earlier state by using the RESTORE command. A Delta table internally
maintains historic versions of the table that enable it to be restored to an earlier state. A version corresponding
to the earlier state or a timestamp of when the earlier state was created are supported as options by the
RESTORE command.
IMPORTANT
You can restore an already restored table.
You can restore a cloned table.
Restoring a table to an older version where the data files were deleted manually or by vacuum will fail. Restoring
to this version partially is still possible if spark.sql.files.ignoreMissingFiles is set to true .
The timestamp format for restoring to an earlier state is yyyy-MM-dd HH:mm:ss . Providing only a date(
yyyy-MM-dd ) string is also supported.
SQL
Python
Scala
import io.delta.tables._
Java
import io.delta.tables.*;
R E C O R D S I N D ATA C H A N G E
TA B L E V E R S I O N O P E R AT I O N D E LTA L O G U P D AT E S L O G U P D AT E S
In the preceding example, the RESTORE command results in updates that were already seen when reading the Delta
table version 0 and 1. If a streaming query was reading this table, then these files will be considered as newly added data
and will be processed again.
Restore metrics
NOTE
Available in Databricks Runtime 8.2 and above.
RESTORE reports the following metrics as a single row DataFrame once the operation is complete:
table_size_after_restore : The size of the table after restoring.
num_of_files_after_restore : The number of files in the table after restoring.
num_removed_files : Number of files removed (logically deleted) from the table.
num_restored_files : Number of files restored due to rolling back.
removed_files_size : Total size in bytes of the files that are removed from the table.
restored_files_size : Total size in bytes of the files that are restored.
You can create a copy of an existing Delta table at a specific version using the clone command. Clones can be
either deep or shallow.
In this section:
Clone types
Clone metrics
Permissions
Clone use cases
Clone types
A deep clone is a clone that copies the source table data to the clone target in addition to the metadata of the
existing table. Additionally, stream metadata is also cloned such that a stream that writes to the Delta table
can be stopped on a source table and continued on the target of a clone from where it left off.
A shallow clone is a clone that does not copy the data files to the clone target. The table metadata is
equivalent to the source. These clones are cheaper to create.
Any changes made to either deep or shallow clones affect only the clones themselves and not the source table.
The metadata that is cloned includes: schema, partitioning information, invariants, nullability. For deep clones
only, stream and COPY INTO metadata are also cloned. Metadata not cloned are the table description and user-
defined commit metadata.
IMPORTANT
Shallow clones reference data files in the source directory. If you run vacuum on the source table clients will no longer
be able to read the referenced data files and a FileNotFoundException will be thrown. In this case, running clone
with replace over the shallow clone will repair the clone. If this occurs often, consider using a deep clone instead which
does not depend on the source table.
Deep clones do not depend on the source from which they were cloned, but are expensive to create because a deep
clone copies the data as well as the metadata.
Cloning with replace to a target that already has a table at that path creates a Delta log if one does not exist at that
path. You can clean up any existing data by running vacuum .
If an existing Delta table exists, a new commit is created that includes the new metadata and new data from the source
table. This new commit is incremental, meaning that only new changes since the last clone are committed to the table.
Cloning a table is not the same as Create Table As Select or CTAS . A clone copies the metadata of the source
table in addition to the data. Cloning also has simpler syntax: you don’t need to specify partitioning, format, invariants,
nullability and so on as they are taken from the source table.
A cloned table has an independent history from its source table. Time travel queries on a cloned table will not work
with the same inputs as they work on its source table.
SQL
CREATE TABLE delta.`/data/target/` CLONE delta.`/data/source/` -- Create a deep clone of /data/source at
/data/target
CREATE TABLE IF NOT EXISTS delta.`/data/target/` CLONE db.source_table -- No-op if the target table exists
Python
Scala
import io.delta.tables._
Java
import io.delta.tables.*;
CLONE reports the following metrics as a single row DataFrame once the operation is complete:
source_table_size : Size of the source table that’s being cloned in bytes.
source_num_of_files : The number of files in the source table.
num_removed_files : If the table is being replaced, how many files are removed from the current table.
num_copied_files : Number of files that were copied from the source (0 for shallow clones).
removed_files_size : Size in bytes of the files that are being removed from the current table.
copied_files_size : Size in bytes of the files copied to the table.
Permissions
You must configure permissions for Azure Databricks table access control and your cloud provider.
Table access control
The following permissions are required for both deep and shallow clones:
SELECT permission on the source table.
If you are using CLONE to create a new table, CREATE permission on the database in which you are creating
the table.
If you are using CLONE to replace a table, you must have MODIFY permission on the table.
Cloud provider permissions
If you have created a deep clone, any user that reads the deep clone must have read access to the clone’s
directory. To make changes to the clone, users must have write access to the clone’s directory.
If you have created a shallow clone, any user that reads the shallow clone needs permission to read the files in
the original table, since the data files remain in the source table with shallow clones, as well as the clone’s
directory. To make changes to the clone, users will need write access to the clone’s directory.
Clone use cases
In this section:
Data archiving
Machine learning flow reproduction
Short-term experiments on a production table
Data sharing
Table property overrides
Data archiving
Data may need to be kept for longer than is feasible with time travel or for disaster recovery. In these cases, you
can create a deep clone to preserve the state of a table at a certain point in time for archival. Incremental
archiving is also possible to keep a continually updating state of a source table for disaster recovery.
-- This should leverage the update information in the clone to prune to only
-- changed files in the clone if possible
MERGE INTO my_prod_table
USING my_test
ON my_test.user_id <=> my_prod_table.user_id
WHEN MATCHED AND my_test.user_id is null THEN UPDATE *;
Data sharing
Other business units within a single organization may want to access the same data but may not require the
latest updates. Instead of giving access to the source table directly, you can provide clones with different
permissions for different business units. The performance of the clone can exceed that of a simple view.
NOTE
Available in Databricks Runtime 7.5 and above.
Python
dt = DeltaTable.forName(spark, "prod.my_table")
tblProps = {
"delta.logRetentionDuration": "3650 days",
"delta.deletedFileRetentionDuration": "3650 days"
}
dt.clone('xx://archive/my_table', isShallow=False, replace=True, tblProps)
Sc a l a
To get the version number of the last commit written by the current SparkSession across all threads and all
tables, query the SQL configuration spark.databricks.delta.lastCommitVersionInSession .
SQL
SET spark.databricks.delta.lastCommitVersionInSession
Python
spark.conf.get("spark.databricks.delta.lastCommitVersionInSession")
Scala
spark.conf.get("spark.databricks.delta.lastCommitVersionInSession")
If no commits have been made by the SparkSession , querying the key returns an empty value.
NOTE
If you share the same SparkSession across multiple threads, it’s similar to sharing a variable across multiple threads;
you may hit race conditions as the configuration value is updated concurrently.
Delta Lake APIs
7/21/2022 • 2 minutes to read
For most read and write operations on Delta tables, you can use Apache Spark reader and writer APIs. For
examples, see Table batch reads and writes and Table streaming reads and writes.
However, there are some operations that are specific to Delta Lake and you must use Delta Lake APIs. For
examples, see Table utility commands.
NOTE
Some Delta Lake APIs are still evolving and are indicated with the Evolving qualifier in the API docs.
Azure Databricks ensures binary compatibility between the Delta Lake project and Delta Lake in Databricks
Runtime. To view the Delta Lake API version packaged in each Databricks Runtime version and links to the API
documentation, see the Delta Lake API compatibility matrix.
Concurrency control
7/21/2022 • 4 minutes to read
Delta Lake provides ACID transaction guarantees between reads and writes. This means that:
Multiple writers across multiple clusters can simultaneously modify a table partition and see a consistent
snapshot view of the table and there will be a serial order for these writes.
Readers continue to see a consistent snapshot view of the table that the Azure Databricks job started with,
even when a table is modified during a job.
Write conflicts
The following table describes which pairs of write operations can conflict in each isolation level.
Conflict exceptions
When a transaction conflict occurs, you will observe one of the following exceptions:
ConcurrentAppendException
ConcurrentDeleteReadException
ConcurrentDeleteDeleteException
MetadataChangedException
ConcurrentTransactionException
ProtocolChangedException
ConcurrentAppendException
This exception occurs when a concurrent operation adds files in the same partition (or anywhere in an
unpartitioned table) that your operation reads. The file additions can be caused by INSERT , DELETE , UPDATE , or
MERGE operations.
With the default isolation level of WriteSerializable , files added by blind INSERT operations (that is, operations
that blindly append data without reading any data) do not conflict with any operation, even if they touch the
same partition (or anywhere in an unpartitioned table). If the isolation level is set to Serializable , then blind
appends may conflict.
This exception is often thrown during concurrent DELETE , UPDATE , or MERGE operations. While the concurrent
operations may be physically updating different partition directories, one of them may read the same partition
that the other one concurrently updates, thus causing a conflict. You can avoid this by making the separation
explicit in the operation condition. Consider the following example.
Suppose you run the above code concurrently for different dates or countries. Since each job is working on an
independent partition on the target Delta table, you don’t expect any conflicts. However, the condition is not
explicit enough and can scan the entire table and can conflict with concurrent operations updating any other
partitions. Instead, you can rewrite your statement to add specific date and country to the merge condition, as
shown in the following example.
ConcurrentTransactionException
If a streaming query using the same checkpoint location is started multiple times concurrently and tries to write
to the Delta table at the same time. You should never have two streaming queries use the same checkpoint
location and run at the same time.
ProtocolChangedException
This exception can occur in the following cases:
When your Delta table is upgraded to a new version. For future operations to succeed you may need to
upgrade your Delta Lake version.
When multiple writers are creating or replacing a table at the same time.
When multiple writers are writing to an empty path at the same time.
See Table protocol versioning for more details.
Best practices: Delta Lake
7/21/2022 • 3 minutes to read
Compact files
If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if
you add data in small batches. This can have an adverse effect on the efficiency of table reads, and it can also
affect the performance of your file system. Ideally, a large number of small files should be rewritten into a
smaller number of larger files on a regular basis. This is known as compaction.
You can compact a table using the OPTIMIZE command.
dataframe.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.partitionBy(<your-partition-columns>) \
.saveAsTable("<your-table>") # Managed table
dataframe.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.option("path", "<your-table-path>") \
.partitionBy(<your-partition-columns>) \
.saveAsTable("<your-table>") # External table
SQL
REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) AS SELECT ... -- Managed
table
REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) LOCATION "<your-table-
path>" AS SELECT ... -- External table
Scala
dataframe.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy(<your-partition-columns>)
.saveAsTable("<your-table>") // Managed table
dataframe.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.option("path", "<your-table-path>")
.partitionBy(<your-partition-columns>)
.saveAsTable("<your-table>") // External table
Spark caching
Databricks does not recommend that you use Spark caching for the following reasons:
You lose any data skipping that can come from additional filters added on top of the cached DataFrame .
The data that gets cached may not be updated if the table is accessed using a different identifier (for example,
you do spark.table(x).cache() but then write to the table using spark.write.save(/some/path) .
Frequently asked questions (FAQ)
7/21/2022 • 5 minutes to read
Does Delta Lake support writes or reads using the Spark Streaming
DStream API?
Delta does not support the DStream API. We recommend Table streaming reads and writes.
When I use Delta Lake, will I be able to port my code to other Spark
platforms easily?
Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to other
Spark platforms. To port your code, replace delta format with parquet format.
What DDL and DML features does Delta Lake not support?
Unsupported DDL features:
ANALYZE TABLE PARTITION
ALTER TABLE [ADD|DROP] PARTITION
ALTER TABLE RECOVER PARTITIONS
ALTER TABLE SET SERDEPROPERTIES
CREATE TABLE LIKE
INSERT OVERWRITE DIRECTORY
LOAD DATA
Unsupported DML features:
INSERT INTO [OVERWRITE] table with static partitions
INSERT OVERWRITE TABLE for table with dynamic partitions
Bucketing
Specifying a schema when reading from a table
Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE
Snapshot isolation for readers: Long running jobs will continue to read a consistent snapshot from the
moment the jobs started, even if the table is modified concurrently. Running VACUUM with a retention
less than length of these jobs can cause them to fail with a FileNotFoundException .
Streaming from Delta tables: Streams read from the original files written into a table in order to
ensure exactly once processing. When combined with OPTIMIZE , VACUUM with zero retention can
remove these files before the stream has time to processes them, causing it to fail.
For these reasons Databricks recommends using this technique only on static data sets that must be read
by external tools.
External writes: Delta Lake maintains additional metadata in a transaction log to enable ACID transactions
and snapshot isolation for readers. To ensure the transaction log is updated correctly and the proper
validations are performed, writer implementations must strictly adhere to the Delta Transaction Protocol.
Delta Lake in Databricks Runtime ensures ACID guarantees based on the Delta Transaction Protocol.
Whether non-Spark Delta connectors that write to Delta tables can write with ACID guarantees depends
on the connector implementation. For information, see _ and the integration-specific documentation on
their write guarantees.
Delta Lake resources
7/21/2022 • 2 minutes to read
Examples
The Delta Lake GitHub repository has Scala and Python examples.
Data types
Delta Lake supports all of the data types listed in Data types except for Interval data types.
Optimizations
7/21/2022 • 2 minutes to read
Azure Databricks provides optimizations for Delta Lake that accelerate data lake operations, supporting a variety
of workloads ranging from large-scale ETL processing to ad-hoc, interactive queries. Many of these
optimizations take place automatically; you get their benefits simply by using Azure Databricks for your data
lakes.
Optimize performance with file management
Compaction (bin-packing)
Data skipping
Z-Ordering (multi-dimensional clustering)
Tune file size
Notebooks
Improve interactive query performance
Frequently asked questions (FAQ)
Auto Optimize
How Auto Optimize works
Enable Auto Optimize
When to opt in and opt out
Example workflow: Streaming ingest with concurrent deletes or updates
Frequently asked questions (FAQ)
Optimize performance with caching
Delta and Apache Spark caching
Delta cache consistency
Use Delta caching
Cache a subset of the data
Monitor the Delta cache
Configure the Delta cache
Dynamic file pruning
Isolation levels
Set the isolation level
Bloom filter indexes
How Bloom filter indexes work
Configuration
Create a Bloom filter index
Drop a Bloom filter index
Display the list of Bloom filter indexes
Notebook
Low Shuffle Merge
Optimized performance
Optimized data layout
Availability
Optimize join performance
Range join optimization
Range join optimization
Skew join optimization
Optimized data transformation
Higher-order functions
Transform complex data types
Additional resources
Cost-based optimizer
Optimize performance with file management
7/21/2022 • 16 minutes to read
To improve query speed, Delta Lake on Azure Databricks supports the ability to optimize the layout of data
stored in cloud storage. Delta Lake on Azure Databricks supports two layout algorithms: bin-packing and Z-
Ordering.
This article describes:
How to run the optimization commands.
How the two layout algorithms work.
How to clean up stale table snapshots.
The FAQ explains why optimization is not automatic and includes recommendations for how often to run
optimize commands.
For notebooks that demonstrate the benefits of optimization, see Optimization examples.
For Delta Lake on Azure Databricks SQL optimization command reference information, see
Databricks Runtime 7.x and above: OPTIMIZE (Delta Lake on Azure Databricks)
Databricks Runtime 5.5 LTS and 6.x: Optimize (Delta Lake on Azure Databricks)
Compaction (bin-packing)
Delta Lake on Azure Databricks can improve the speed of read queries from a table. One way to improve this
speed is to coalesce small files into larger ones. You trigger compaction by running the OPTIMIZE command:
SQL
OPTIMIZE delta.`/data/events`
Python
Scala
import io.delta.tables._
val deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.optimize().executeCompaction()
or
SQL
OPTIMIZE events
Python
from delta.tables import *
deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().executeCompaction()
Scala
import io.delta.tables._
val deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().executeCompaction()
If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition
predicate using WHERE :
SQL
Python
Scala
import io.delta.tables._
val deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().where("date='2021-11-18'").executeCompaction()
NOTE
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no
effect.
Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number
of tuples per file. However, the two measures are most often correlated.
Python and Scala APIs for executing OPTIMIZE operation are available from Databricks Runtime 11.0 and above.
Readers of Delta tables use snapshot isolation, which means that they are not interrupted when OPTIMIZE
removes unnecessary files from the transaction log. OPTIMIZE makes no data related changes to the table, so a
read before and after an OPTIMIZE has the same results. Performing OPTIMIZE on a table that is a streaming
source does not affect any current or future streams that treat this table as a source. OPTIMIZE returns the file
statistics (min, max, total, and so on) for the files removed and the files added by the operation. Optimize stats
also contains the Z-Ordering statistics, the number of batches, and partitions optimized.
NOTE
Available in Databricks Runtime 6.0 and above.
You can also compact small files automatically using Auto Optimize.
Data skipping
Data skipping information is collected automatically when you write data into a Delta table. Delta Lake on Azure
Databricks takes advantage of this information (minimum and maximum values) at query time to provide faster
queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its
effectiveness depends on the layout of your data. For best results, apply Z-Ordering.
For an example of the benefits of Delta Lake on Azure Databricks data skipping and Z-Ordering, see the
notebooks in Optimization examples. By default Delta Lake on Azure Databricks collects statistics on the first 32
columns defined in your table schema. You can change this value using the table property
delta.dataSkippingNumIndexedCols . Adding more columns to collect statistics would add more overhead as you
write files.
Collecting statistics on long strings is an expensive operation. To avoid collecting statistics on long strings, you
can either configure the table property delta.dataSkippingNumIndexedCols to avoid columns containing long
strings or move columns containing long strings to a column greater than delta.dataSkippingNumIndexedCols
using ALTER TABLE ALTER COLUMN . See:
Databricks Runtime 7.x and above: ALTER TABLE
Databricks Runtime 5.5 LTS and 6.x: Change columns
For the purposes of collecting statistics, each field within a nested column is considered as an individual column.
You can read more on this article in the blog post: Processing Petabytes of Data in Seconds with Databricks
Delta.
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is,
a large number of distinct values), then use ZORDER BY .
You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the
locality drops with each extra column. Z-Ordering on columns that do not have statistics collected on them
would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as
min, max, and count. You can configure statistics collection on certain columns by reordering columns in the
schema, or you can increase the number of columns to collect statistics on. See Data skipping.
NOTE
Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not
guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-
Ordered, another Z-Ordering of that partition will not have any effect.
Z-Ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily
data size on disk. The two measures are most often correlated, but there can be situations when that is not the
case, leading to skew in optimize task times.
For example, if you ZORDER BY date and your most recent records are all much wider (for example longer arrays
or string values) than the ones in the past, it is expected that the OPTIMIZE job’s task durations will be skewed, as
well as the resulting file sizes. This is, however, only a problem for the OPTIMIZE command itself; it should not
have any negative impact on subsequent queries.
NOTE
Available in Databricks Runtime 8.2 and above.
If you want to tune the size of files in your Delta table, set the table property delta.targetFileSize to the
desired size. If this property is set, all data layout optimization operations will make a best-effort attempt to
generate files of the specified size. Examples here include optimize with Compaction (bin-packing) or Z-
Ordering (multi-dimensional clustering), Auto Compaction, and Optimized Writes.
TA B L E P RO P ERT Y
delta.targetFileSize
For existing tables, you can set and unset properties using the SQL command ALTER TABLE SET TBL
PROPERTIES. You can also set these properties automatically when creating new tables using Spark session
configurations. See Table properties for details.
Autotune based on workload
NOTE
Available in Databricks Runtime 8.2 and above.
To minimize the need for manual tuning, Azure Databricks can automatically tune the file size of Delta tables,
based on workloads operating on the table. Azure Databricks can automatically detect if a Delta table has
frequent MERGE operations that rewrite files and may choose to reduce the size of rewritten files in anticipation
of further file rewrites in the future. For example, when executing a MERGE operation, if 9 out of last 10 previous
operations on the table were also MERGEs, then Optimized Writes and Auto Compaction used by MERGE (if
enabled) will generate smaller file sizes than it would otherwise. This helps in reducing the duration of future
MERGE operations.
Autotune is activated after a few rewrite operations have occurred. However, if you anticipate a Delta table will
experience frequent MERGE , UPDATE , or DELETE operations and want this tuning immediately, you can explicitly
tune file sizes for rewrites by setting the table property delta.tuneFileSizesForRewrites . Set this property to
true to always use lower file sizes for all data layout optimization operations on the table. Set it to false to
never tune to lower file sizes, that is, prevent auto-detection from being activated.
TA B L E P RO P ERT Y
delta.tuneFileSizesForRewrites
Type: Boolean
For existing tables, you can set and unset properties using the SQL command ALTER TABLE SET TBL
PROPERTIES. You can also set these properties automatically when creating new tables using Spark session
configurations. See Table properties for details.
Autotune based on table size
NOTE
Available in Databricks Runtime 8.4 and above.
To minimize the need for manual tuning, Azure Databricks automatically tunes the file size of Delta tables based
on the size of the table. Azure Databricks will use smaller file sizes for smaller tables and larger file sizes for
larger tables so that the number of files in the table does not grow too large. Azure Databricks does not
autotune tables that you have tuned with a specific target size or based on a workload with frequent rewrites.
The target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned
target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly
from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB.
NOTE
When the target file size for a table grows, existing files are not re-optimized into larger files by the OPTIMIZE command.
A large table can therefore always have some files that are smaller than the target size. If it is required to optimize those
smaller files into larger files as well, you can configure a fixed target file size for the table using the
delta.targetFileSize table property.
When a table is written incrementally, the target file sizes and file counts will be close to the following numbers,
based on table size. The file counts in this table are only an example. The actual results will be different
depending on many factors.
A P P RO XIM AT E N UM B ER O F F IL ES IN
TA B L E SIZ E TA RGET F IL E SIZ E TA B L E
10 GB 256 MB 40
1 TB 256 MB 4096
3 TB 307 MB 12108
5 TB 512 MB 17339
7 TB 716 MB 20784
10 TB 1 GB 24437
20 TB 1 GB 34437
50 TB 1 GB 64437
100 TB 1 GB 114437
Notebooks
For an example of the benefits of optimization, see the following notebooks:
Optimization examples
Delta Lake on Databricks optimizations Python notebook
Delta Lake on Databricks optimizations Scala notebook
Delta Lake on Databricks optimizations SQL notebook
IMPORTANT
Delta Lake checkpoints are different than Structured Streaming checkpoints.
In Databricks Runtime 7.2 and below, column-level statistics are stored in Delta Lake checkpoints as a JSON
column.
In Databricks Runtime 7.3 LTS and above, column-level statistics are stored as a struct. The struct format makes
Delta Lake reads much faster, because:
Delta Lake doesn’t perform expensive JSON parsing to obtain column-level statistics.
Parquet column pruning capabilities significantly reduce the I/O required to read the statistics for a column.
The struct format enables a collection of optimizations that reduce the overhead of Delta Lake read operations
from seconds to tens of milliseconds, which significantly reduces the latency for short queries.
Manage column-level statistics in checkpoints
You manage how statistics are written in checkpoints using the table properties
delta.checkpoint.writeStatsAsJson and delta.checkpoint.writeStatsAsStruct . If both table properties are
false , Delta Lake cannot perform data skipping.
IMPORTANT
Enhanced checkpoints do not break compatibility with open source Delta Lake readers. However, setting
delta.checkpoint.writeStatsAsJson to false may have implications on proprietary Delta Lake readers. Contact
your vendors to learn more about performance implications.
If you do not use Databricks Runtime 7.2 or below to query your data, you can also improve the checkpoint
write latency by setting the following table properties:
Disable writes from clusters that write checkpoints without the stats struct
Writers in Databricks Runtime 7.2 and below write checkpoints without the stats struct, which prevents
optimizations for Databricks Runtime 7.3 LTS readers.
To block clusters running Databricks Runtime 7.2 and below from writing to a Delta table, you can upgrade the
Delta table using the upgradeTableProtocol method:
Python
Sc a l a
import io.delta.tables.DeltaTable
val delta = DeltaTable.forPath(spark, "path_to_table") // or DeltaTable.forName
delta.upgradeTableProtocol(1, 3)
WARNING
Applying the upgradeTableProtocol method prevents clusters running Databricks Runtime 7.2 and below from writing
to your table and this change is irreversible. We recommend upgrading your tables only after you are committed to the
new format. You can try out these optimizations by creating a shallow CLONE of your tables using Databricks Runtime 7.3
LTS.
Once you upgrade the table writer version, writers must obey your settings for
'delta.checkpoint.writeStatsAsStruct' and 'delta.checkpoint.writeStatsAsJson' .
The following table summarizes how to take advantage of enhanced checkpoints in various versions of
Databricks Runtime, table protocol versions, and writer types.
Sc a l a
import io.delta.tables.DeltaTable
val delta = DeltaTable.forPath(spark, "path_to_table") // or DeltaTable.forName
delta.upgradeTableProtocol(1, 3)
WARNING
Applying the upgradeTableProtocol method prevents clusters running Databricks Runtime 7.2 and below from writing
to your table. The change is irreversible. Therefore, we recommend upgrading your tables only after you are committed to
the new format. You can try out these optimizations by creating a shallow CLONE of your tables using Databricks Runtime
7.3 LTS:
Databricks Runtime 7.x and above: CREATE TABLE CLONE
Databricks Runtime 5.5 LTS and 6.x: Clone (Delta Lake on Azure Databricks)
Auto Optimize is an optional set of features that automatically compact small files during individual writes to a
Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively. Auto
Optimize is particularly useful in the following scenarios:
Streaming use cases where latency in the order of minutes is acceptable
MERGE INTO is the preferred method of writing into Delta Lake
CREATE TABLE AS SELECT or INSERT INTO are commonly used operations
CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES (delta.autoOptimize.optimizeWrite =
true, delta.autoOptimize.autoCompact = true)
Existing tables : Set the table properties delta.autoOptimize.optimizeWrite = true and
delta.autoOptimize.autoCompact = true in the ALTER TABLE command.
In Databricks Runtime 10.1 and above, the table property delta.autoOptimize.autoCompact also accepts the
values auto and legacy in addition to true and false . When set to auto (recommended), Auto Compaction
uses better defaults, such as setting 32 MB as the target file size (although default behaviors are subject to
change in the future). When set to legacy or true , Auto Compaction uses 128 MB as the target file size.
In addition, you can enable and disable both of these features for Spark sessions with the configurations:
spark.databricks.delta.optimizeWrite.enabled
spark.databricks.delta.autoCompact.enabled
The session configurations take precedence over the table properties allowing you to better control when to opt
in or opt out of these features.
If Auto Compaction fails due to a transaction conflict, Azure Databricks does not fail or retry the
compaction. The corresponding write query (which triggered the Auto Compaction) will succeed even if
the Auto Compaction does not succeed.
In DBR 10.4 and above, this is not an issue: Auto Compaction does not cause transaction conflicts to other
concurrent operations like DELETE , MERGE , or UPDATE . The other concurrent transactions are given
higher priority and will not fail due to Auto Compaction.
This ensures that the number of files written by the stream and the delete and update jobs are of optimal
size.
Enable Auto Compaction on the session level using the following setting on the job that performs the
delete or update.
This allows files to be compacted across your table. Since it happens after the delete or update, you
mitigate the risks of a transaction conflict.
The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast
intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote
location. Successive reads of the same data are then performed locally, which results in significantly improved
reading speed.
The Delta cache works for all Parquet files and is not limited to Delta Lake format files. The Delta cache supports
reading Parquet files in DBFS, HDFS, Azure Blob storage, Azure Data Lake Storage Gen1, and Azure Data Lake
Storage Gen2. It does not support other storage formats such as CSV, JSON, and ORC.
NOTE
You can use Delta caching and Apache Spark caching at the same time.
Summary
The following table summarizes the key differences between Delta and Apache Spark caching so that you can
choose the best tool for your workflow:
Applied to Any Parquet table stored on WASB and Any DataFrame or RDD.
other file systems.
F EAT URE DELTA C A C H E A PA C H E SPA RK C A C H E
Triggered Automatically, on the first read (if Manually, requires code changes.
cache is enabled).
You don’t need to use this command for the Delta cache to work correctly (the data will be cached automatically
when first accessed). But it can be helpful when you require consistent query performance.
For examples and more details, see
Databricks Runtime 7.x and above: CACHE SELECT
Databricks Runtime 5.5 LTS and 6.x: Cache Select (Delta Lake on Azure Databricks)
NOTE
When a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is
some instability with the cache. Spark would then need to reread missing partitions from source as needed.
Example configuration:
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents
queries from adding new data to the cache and reading data from the cache.
Dynamic file pruning
7/21/2022 • 2 minutes to read
Dynamic file pruning (DFP), can significantly improve the performance of many queries on Delta tables. DFP is
especially efficient for non-partitioned tables, or for joins on non-partitioned columns. The performance impact
of DFP is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit of DFP.
For background and use cases for DFP, see Faster SQL Queries on Delta Lake with Dynamic File Pruning.
NOTE
Available in Databricks Runtime 6.1 and above.
The isolation level of a table defines the degree to which a transaction must be isolated from modifications
made by concurrent transactions. Delta Lake on Azure Databricks supports two isolation levels: Serializable and
WriteSerializable.
Serializable : The strongest isolation level. It ensures that committed write operations and all reads are
Serializable. Operations are allowed as long as there exists a serial sequence of executing them one-at-a-
time that generates the same outcome as that seen in the table. For the write operations, the serial
sequence is exactly the same as that seen in the table’s history.
WriteSerializable (Default) : A weaker isolation level than Serializable. It ensures only that the write
operations (that is, not reads) are serializable. However, this is still stronger than Snapshot isolation.
WriteSerializable is the default isolation level because it provides great balance of data consistency and
availability for most common operations.
In this mode, the content of the Delta table may be different from that which is expected from the
sequence of operations seen in the table history. This is because this mode allows certain pairs of
concurrent writes (say, operations X and Y) to proceed such that the result would be as if Y was
performed before X (that is, serializable between them) even though the history would show that Y was
committed after X. To disallow this reordering, set the table isolation level to be Serializable to cause
these transactions to fail.
Read operations always use snapshot isolation. The write isolation level determines whether or not it is possible
for a reader to see a snapshot of a table, that according to the history, “never existed”.
For the Serializable level, a reader always sees only tables that conform to the history. For the WriteSerializable
level, a reader could see a table that does not exist in the Delta log.
For example, consider txn1, a long running delete and txn2, which inserts data deleted by txn1. txn2 and txn1
complete and they are recorded in that order in the history. According to the history, the data inserted in txn2
should not exist in the table. For Serializable level, a reader would never see data inserted by txn2. However, for
the WriteSerializable level, a reader could at some point see the data inserted by txn2.
For more information on which types of operations can conflict with each other in each isolation level and the
possible errors, see Concurrency control.
A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns,
particularly for fields containing arbitrary text.
Bloom filters support columns with the following (input) data types: byte , short , int , long , float , double ,
date , timestamp , and string . Nulls are not added to the Bloom filter, so any null related filter requires reading
the data file. Azure Databricks supports the following data source filters: and , or , in , equals , and
equalsnullsafe . Bloom filters are not supported on nested columns.
Configuration
Bloom filters are enabled by default. To disable Bloom filters, set the session level
spark.databricks.io.skipping.bloomFilter.enabled configuration to false .
For example:
Notebook
The following notebook demonstrates how defining an Bloom filter index speeds up “needle in a haystack”
queries.
Bloom filter demo notebook
Get notebook
Optimize join performance
7/21/2022 • 2 minutes to read
Delta Lake on Azure Databricks optimizes range and skew joins. Range join optimizations require tuning based
on your query patterns and you can make your skew joins efficient with skew hints. See the following articles to
learn how to make best use of these join optimizations:
Range join optimization
Skew join optimization
See also:
Join hints
Join Hints on the Apache Spark website
Range join optimization
7/21/2022 • 7 minutes to read
A range join occurs when two relations are joined using a point in interval or interval overlap condition. The
range join optimization support in Databricks Runtime can bring orders of magnitude improvement in query
performance, but requires careful manual tuning.
-- join two sets of point values within a fixed distance from each other
SELECT *
FROM points1 p1 JOIN points2 p2 ON p1.p >= p2.p - 10 AND p1.p <= p2.p + 10;
NOTE
For DATE values, the value of the bin size is interpreted as days. For example, a bin size value of 7 represents a week.
For TIMESTAMP values, the value of the bin size is interpreted as seconds. If a sub-second value is required, fractional
values can be used. For example, a bin size value of 60 represents a minute, and a bin size value of 0.1 represents 100
milliseconds.
You can specify the bin size either by using a range join hint in the query or by setting a session configuration
parameter. The range join optimization is applied only if you manually specify the bin size. Section Choose the
bin size describes how to choose an optimal bin size.
You can also place a range join hint on one of the joined DataFrames. In that case, the hint contains just the
numeric bin size parameter.
SET spark.databricks.optimizer.rangeJoin.binSize=5
This configuration parameter applies to any join with a range condition. However, a different bin size set through
a range join hint always overrides the one set through the parameter.
SELECT APPROX_PERCENTILE(CAST(end - start AS DOUBLE), ARRAY(0.5, 0.9, 0.99, 0.999, 0.9999)) FROM ranges
A recommended setting of bin size would be the maximum of the value at the 90th percentile, or the value at
the 99th percentile divided by 10, or the value at the 99.9th percentile divided by 100 and so on. The rationale is:
If the value at the 90th percentile is the bin size, only 10% of the value interval lengths are longer than the
bin interval, so span more than 2 adjacent bin intervals.
If the value at the 99th percentile is the bin size, only 1% of the value interval lengths span more than 11
adjacent bin intervals.
If the value at the 99.9th percentile is the bin size, only 0.1% of the value interval lengths span more than 101
adjacent bin intervals.
The same can be repeated for the values at the 99.99th, the 99.999th percentile, and so on if needed.
The described method limits the amount of skewed long value intervals that overlap multiple bin intervals. The
bin size value obtained this way is only a starting point for fine tuning; actual results may depend on the specific
workload.
Skew join optimization
7/21/2022 • 2 minutes to read
Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. Data
skew can severely downgrade performance of queries, especially those with joins. Joins between big tables
require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. It’s likely that data
skew is affecting a query if a query appears to be stuck finishing very few tasks (for example, the last 3 tasks out
of 200). To verify that data skew is affecting a query:
1. Click the stage that is stuck and verify that it is doing a join.
2. After the query finishes, find the stage that does a join and check the task duration distribution.
3. Sort the tasks by decreasing duration and check the first few tasks. If one task took much longer to complete
than the other tasks, there is skew.
To ameliorate skew, Delta Lake on Azure Databricks SQL accepts skew hints in queries. With the information
from a skew hint, Databricks Runtime can construct a better query plan, one that does not suffer from data
skew.
NOTE
With Databricks Runtime 7.3 and above, skew join hints are not required. Skew is automatically taken care of if adaptive
query execution (AQE) and spark.sql.adaptive.skewJoin.enabled are both enabled. See Adaptive query execution.
-- multiple columns
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId')) */ *
FROM orders, customers
WHERE o_custId = c_custId AND o_storeRegionId = c_regionId
Configure skew hint with relation name, column names, and skew
values
You can also specify skew values in the hint. Depending on the query and data, the skew values might be known
(for example, because they never change) or might be easy to find out. Doing this reduces the overhead of skew
join optimization. Otherwise, Delta Lake detects them automatically.
Azure Databricks optimizes the performance of higher-order functions and DataFrame operations using nested
types. See the following articles to learn how to get started with these optimized higher-order functions and
complex data types:
Higher-order functions
Transform complex data types
Higher-order functions
7/21/2022 • 2 minutes to read
Azure Databricks provides dedicated primitives for manipulating arrays in Apache Spark SQL; these make
working with arrays much easier and more concise and do away with the large amounts of boilerplate code
typically required. The primitives revolve around two functional programming constructs: higher-order
functions and anonymous (lambda) functions. These work together to allow you to define functions that
manipulate arrays in SQL. A higher-order function takes an array, implements how the array is processed, and
what the result of the computation will be. It delegates to a lambda function how to process each item in the
array.
While working with nested data types, Delta Lake on Azure Databricks optimizes certain transformations out-of-
the-box. The following notebooks contain many examples on how to convert between complex and primitive
data types using functions natively supported in Apache Spark SQL.
The transaction log for a Delta table contains protocol versioning information that supports Delta Lake
evolution. Delta Lake tracks minimum reader and writer versions separately.
Delta Lake guarantees backward compatibility. A higher protocol version of Delta Lake reader is always able to
read data that was written by a lower protocol version.
Delta Lake will occasionally break forward compatibility. Lower protocol versions of Delta Lake may not be able
to read and write data that was written by a higher protocol version of Delta Lake. If you try to read and write to
a table with a protocol version of Delta Lake that is too low, you’ll get an error telling you that you need to
upgrade.
When creating a table, Delta Lake chooses the minimum required protocol version based on table characteristics
such as the schema or table properties. You can also set the default protocol versions by setting the SQL
configurations:
spark.databricks.delta.properties.defaults.minWriterVersion = 2 (default)
spark.databricks.delta.properties.defaults.minReaderVersion = 1 (default)
To upgrade a table to a newer protocol version, use the DeltaTable.upgradeTableProtocol method:
WARNING
Protocol version upgrades are irreversible, and upgrading the protocol version may break the existing Delta Lake table
readers, writers, or both. Therefore, we recommend you upgrade specific tables only when needed, such as to opt-in to
new features in Delta Lake. You should also check to make sure that all of your current and future production tools
support Delta Lake tables with the new protocol version.
SQL
-- Upgrades the reader protocol version to 1 and the writer protocol version to 3.
ALTER TABLE <table_identifier> SET TBLPROPERTIES('delta.minReaderVersion' = '1', 'delta.minWriterVersion' =
'3')
Python
from delta.tables import DeltaTable
delta = DeltaTable.forPath(spark, "path_to_table") # or DeltaTable.forName
delta.upgradeTableProtocol(1, 3) # upgrades to readerVersion=1, writerVersion=3
Scala
import io.delta.tables.DeltaTable
val delta = DeltaTable.forPath(spark, "path_to_table") // or DeltaTable.forName
delta.upgradeTableProtocol(1, 3) // Upgrades to readerVersion=1, writerVersion=3.
Features by protocol version
F EAT URE MINWRITERVERSION MINREADERVERSION IN T RO DUC ED IN DO C UM EN TAT IO N
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
For documentation on these features along with associated use cases and Azure Databricks example
notebooks, see:
Secondary analysis
DNASeq pipeline
RNASeq pipeline
Tumor/Normal pipeline
Variant annotation methods
Joint genotyping
Joint genotyping pipeline
Features of Databricks Runtime for Genomics have been open-sourced as part of the Databricks-Regeneron
project Glow. For information on Glow, see the Glow documentation.
Tertiary analytics with Apache Spark
7/21/2022 • 2 minutes to read
Use the following guides to get started with open source libraries that extend Apache Spark for genomics on
Databricks Runtime.
ADAM
Hail
Create a cluster
Use Hail in a notebook
Glow
Sync Glow notebooks to your workspace
Set up a Glow environment
Get started with Glow
Setup automated jobs
ADAM
7/21/2022 • 2 minutes to read
ADAM is a library for genomic data processing on Apache Spark. It is used to implement pipelines that operate
on genomic read data such as BAM, SAM, and CRAM files.
To use ADAM in Azure Databricks:
1. Launch a Databricks Runtime cluster with these Spark configurations:
# Hadoop configs
org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.bdgenomics.adam.serialization.ADAMKryoRegistrator
spark.hadoop.hadoopbam.bam.enable-bai-splitter true
Hail is a library built on Apache Spark for analyzing large genomic datasets.
IMPORTANT
When you use Hail 0.2.65 and above, use Apache Spark version 3.1 (Databricks Runtime 8.x or 9.x)
Install Hail on Databricks Runtime, not Databricks Runtime for Genomics (deprecated)
Hail is not supported with Credential passthrough
Hail is not supported with Glow, except when exporting from Hail to Glow
Create a cluster
Install Hail via Docker with Databricks Container Services.
For containers to set up a Hail environment, see the ProjectGlow Dockerhub page. Use
projectglow/databricks-hail:<hail_version> , replacing the tag with an available Hail version.
hail-create-job.json :
{
"name": "hail",
"notebook_task": {
"notebook_path" : "/Users/<user@organization.com>/hail/docs/hail-tutorial",
},
"new_cluster": {
"spark_version": "<databricks_runtime_version>.x-scala2.12",
"azure_attributes": {
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS3_v2",
"num_workers": 32,
"docker_image": {
"url": "projectglow/databricks-hail:<hail_version>"
}
}
}
NOTE
Enable skip_logging_configuration to save logs to the rolling driver log4j output. This setting is supported only in
Hail 0.2.39 and above.
import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)
Glow is an open source project created in collaboration between Databricks and the Regeneron Genetics Center.
For information on features in Glow, see the Glow documentation.
IMPORTANT
Checkpoint to Delta Lake after ingest of or transformations to genotype data.
IMPORTANT
Start small. Experiment on individual variants, samples or chromosomes.
Steps in your pipeline might require a different cluster configuration, depending on the type of computation
performed.
TIP
Use compute-optimized virtual machines to read variant data from cloud object stores.
Use Delta Cache accelerated virtual machines to query variant data.
Use memory-optimized virtual machines for genetic association studies.
Clusters with small machines have a better price-performance ratio when compared with large machines.
The Glow Pipe Transformer supports parallelization of deep learning tools that run on GPUs.
The following example cluster configuration runs a genetic association study on a single chromosome. Edit the
notebook_path and <databricks_runtime_version> as needed.
glow-create-job.json :
{
"name": "glow_gwas",
"notebook_task": {
"notebook_path" : "/Users/<user@organization.com>/glow/docs/source/_static/notebooks/tertiary/gwas-
quantitative",
"base_parameters": {
"allele_freq_cutoff": 0.01
}
},
"new_cluster": {
"spark_version": "<databricks_runtime_version>.x-scala2.12",
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_E8s_v3",
"num_workers": 32,
"spark_conf": {
"spark.sql.execution.arrow.maxRecordsPerBatch": 100
},
"docker_image": {
"url": "projectglow/databricks-glow:<databricks_runtime_version>"
}
}
}
Secondary analysis
7/21/2022 • 2 minutes to read
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
Databricks Runtime for Genomics (Deprecated) contains pre-packaged pipelines to align reads and detect and
annotate variants in individual samples, parallelized using Apache Spark.
DNASeq pipeline
Setup
Reference genomes
Parameters
Customization
Manifest format
Supported input formats
Output
Troubleshooting
Run programmatically
RNASeq pipeline
Setup
Reference genomes
Parameters
Walkthrough
Additional usage info and troubleshooting
Tumor/Normal pipeline
Walkthrough
Setup
Reference genomes
Parameters
Manifest format
Additional usage info and troubleshooting
Variant annotation methods
Pre-packaged SnpEff annotation pipeline
Pre-packaged VEP annotation pipeline
Variant Annotation using Pipe Transformer
DNASeq pipeline
7/21/2022 • 6 minutes to read
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
NOTE
The following library versions are packaged in Databricks Runtime 7.0 for Genomics. For libraries included in lower
versions of Databricks Runtime for Genomics, see the release notes.
The Azure Databricks DNASeq pipeline is a GATK best practices compliant pipeline for short read alignment,
variant calling, and variant annotation. It uses the following software packages, parallelized using Spark.
BWA v0.7.17
ADAM v0.32.0
GATK HaplotypeCaller v4.1.4.1
SnpEff v4.3
For more information about the pipeline implementation and expected runtimes and costs for various option
combinations, see Building the Fastest DNASeq Pipeline at Scala.
Setup
The pipeline is run as an Azure Databricks job. You can set up a cluster policy to save the configuration:
{
"num_workers": {
"type": "unlimited",
"defaultValue": 13
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "Standard_F32s_v2"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
}
}
The cluster configuration should use Databricks Runtime for Genomics.
The task should be the DNASeq notebook found at the bottom of this page.
For best performance, use the compute optimized VMs with at least 60GB of memory. We recommend
Standard_F32s_v2 VMs.
If you’re running base quality score recalibration, use general purpose (Standard_D32s_v3 ) instances
instead since this operation requires more memory.
Reference genomes
You must configure the reference genome using an environment variable. To use GRCh37, set the environment
variable:
refGenomeId=grch37
NOTE
Custom reference genome support is available in Databricks Runtime 6.6 for Genomics and above.
To use a reference build other than GRCh37 or GRCh38, follow these steps:
1. Prepare the reference for use with BWA and GATK.
The reference genome directory contents should include these files:
<reference_name>.dict
<reference_name>.fa
<reference_name>.fa.amb
<reference_name>.fa.ann
<reference_name>.fa.bwt
<reference_name>.fa.fai
<reference_name>.fa.pac
<reference_name>.fa.sa
2. Upload the reference genome files to a directory in cloud storage or DBFS. If you upload the files to cloud
storage, you must mount the directory to a location in DBFS.
3. In your cluster configuration, set an environment variable REF_GENOME_PATH that points to the path of the
fasta file in DBFS. For example,
REF_GENOME_PATH=/mnt/reference-genome/reference.fa
2. Generate the index image file from the BWA index files.
import org.broadinstitute.hellbender.utils.bwa._
BwaMemIndex.createIndexImageFromIndexFiles("/local_disk0/reference-genome/<reference_name>.fa",
"/local_disk0/reference-genome/<reference_name>.fa.img")
3. Copy to the index image file to the same directory as the reference fasta files.
4. Delete the unneeded BWA index files ( .amb , .ann , .bwt , .pac , .sa ) from DBFS.
%fs rm <file>
Parameters
The pipeline accepts parameters that control its behavior. The most important and commonly changed
parameters are documented here; the rest can be found in the DNASeq notebook. After importing the notebook
and setting it as a job task, you can set these parameters for all runs or per-run.
TIP
To optimize run time, set the spark.sql.shuffle.partitions Spark configuration to three times the number of cores of
the cluster.
Customization
You can customize the DNASeq pipeline by disabling read alignment, variant calling, and variant annotation. By
default, all three stages are enabled.
Manifest format
NOTE
Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.
The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. For example:
file_path,sample_id,paired_end,read_group_id
*_R1_*.fastq.bgz,HG001,1,read_group
*_R2_*.fastq.bgz,HG001,2,read_group
If your input consists of unaligned BAM files, you should omit the paired_end field:
file_path,sample_id,paired_end,read_group_id
*.bam,HG001,,read_group
TIP
If the provided manifest is a file, the file_path field in each row can be an absolute path or a path relative to the
manifest file. If the provided manifest is a blob, the file_path field must be an absolute path. You can include globs
(*) to match many files.
IMPORTANT
Gzipped files are not splittable. Choose autoscaling clusters to minimize cost for these files.
To block compress a FASTQ, install htslib, which includes the bgzip executable.
Output
The aligned reads, called variants, and annotated variants are all written out to Delta tables inside the provided
output directory if the corresponding stages are enabled. Each table is partitioned by sample ID. In addition, if
you configured the pipeline to export BAMs or VCFs, they’ll appear under the output directory as well.
|---alignments
|---sampleId=HG001
|---Parquet files
|---alignments.bam
|---HG001.bam
|---annotations
|---Delta files
|---annotations.vcf
|---HG001.vcf
|---genotypes
|---Delta files
|---genotypes.vcf
|---HG001.vcf
When you run the pipeline on a new sample, it’ll appear as a new partition. If you run the pipeline for a sample
that already appears in the output directory, that partition will be overwritten.
Since all the information is available in Delta Lake, you can easily analyze it with Spark in Python, R, Scala, or
SQL. For example:
Python
SQL
Troubleshooting
Job is slow and few tasks are running
Usually indicates that the input FASTQ files are compressed with gzip instead of bgzip . Gzipped files are not
splittable, so the input cannot be processed in parallel.
Run programmatically
In addition to using the UI, you can start runs of the pipeline programmatically using the Databricks CLI.
After setting up the pipeline job in the UI, copy the Job ID as you pass it to the jobs run-now CLI command.
Here’s an example bash script that you can adapt for your workflow:
# Generate a manifest file
cat <<HERE >manifest.csv
file_path,sample_id,paired_end,read_group_id
dbfs:/genomics/my_new_sample/*_R1_*.fastq.bgz,my_new_sample,1,read_group
dbfs:/genomics/my_new_sample/*_R2_*.fastq.bgz,my_new_sample,2,read_group
HERE
In addition to starting runs from the command line, you can use this pattern to invoke the pipeline from
automated systems like Jenkins.
DNASeq pipeline notebook
Get notebook
RNASeq pipeline
7/21/2022 • 2 minutes to read
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
NOTE
The following library versions are packaged in Databricks Runtime 7.0 for Genomics. For libraries included in lower
versions of Databricks Runtime for Genomics, see the release notes.
The Databricks RNASeq pipeline handles short read alignment and quantification using STAR v2.6.1a and ADAM
v0.32.0.
Setup
The pipeline is run as an Azure Databricks job. You can set up a cluster policy to save the configuration:
{
"num_workers": {
"type": "unlimited",
"defaultValue": 13
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "Standard_F32s_v2"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38_star"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
}
}
The task should be the RNASeq notebook provided at the bottom of this page.
For best performance, use the compute optimized VMs with at least 60GB of memory. We recommend
Standard_F32s_v2 VMs.
Reference genomes
You must configure the reference genome using environment variables. To use GRCh37, set the environment
variable:
refGenomeId=grch37_star
refGenomeId=grch38_star
Parameters
The pipeline accepts a number of parameters that control its behavior. The most important and commonly
changed parameters are documented here; the rest can be found in the RNASeq notebook. After importing the
notebook and setting it as a job task, you can set these parameters for all runs or per-run.
Walkthrough
The pipeline consists of two steps:
1. Alignment: Map each short read to the reference genome using the STAR aligner.
2. Quantification: Count how many reads correspond to each reference transcript.
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
The Azure Databricks tumor/normal pipeline is a GATK best practices compliant pipeline for short read
alignment and somatic variant calling using the MuTect2 variant caller.
Walkthrough
The pipeline consists of the following steps:
1. Normal sample alignment using BWA-MEM.
2. Tumor sample alignment using BWA-MEM.
3. Variant calling with MuTect2.
Setup
The pipeline is run as an Azure Databricks job. You can set up a cluster policy to save the configuration:
{
"num_workers": {
"type": "unlimited",
"defaultValue": 13
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "Standard_F32s_v2"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
}
}
Reference genomes
You must configure the reference genome using an environment variable. To use GRCh37, set the environment
variable:
refGenomeId=grch37
Parameters
The pipeline accepts parameters that control its behavior. The most important and commonly changed
parameters are documented here. To view all available parameters and their usage information, run the first cell
of the pipeline notebook. New parameters are added regularly. After importing the notebook and setting it as a
job task, you can set these parameters for all runs or per-run.
TIP
To optimize run time, set the spark.sql.shuffle.partitions Spark configuration to three times the number of cores of
the cluster.
Manifest format
NOTE
Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.
The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. For example:
pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*_R1_*.normal.fastq.bgz,HG001_normal,normal,1,read_group_normal
HG001,*_R2_*.normal.fastq.bgz,HG001_normal,normal,2,read_group_normal
HG001,*_R1_*.tumor.fastq.bgz,HG001_tumor,1,tumor,read_group_tumor
HG001,*_R2_*.tumor.fastq.bgz,HG001_tumor,2,tumor,read_group_tumor
If your input consists of unaligned BAM files, you should omit the paired_end field:
pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*.normal.bam,HG001_normal,normal,,read_group_tumor
HG001,*.tumor.bam,HG001_tumor,tumor,,read_group_normal
The tumor and normal samples for a given individual are grouped by the pair_id field. The tumor and normal
sample names read group names must be different within a pair.
TIP
If the provided manifest is a file, the file_path field in each row may be an absolute path or a path relative to the
manifest file. If the provided manifest is a blob, the file_path field must be an absolute path. You can include globs
(*) to match many files.
NOTE
The pipeline was renamed from TNSeq to MutSeq in Databricks Runtime 7.3 LTS for Genomics and above.
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
Any annotation method can be used on variant data using Glow’s Pipe Transformer.
For example, VEP annotation is performed by downloading annotation data sources (the cache) to each node in
a cluster and calling the VEP command line script with the Pipe Transformer using a script similar to the
following cell.
import glow
import json
input_vcf = "/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz"
input_df = spark.read.format("vcf").load(input_vcf)
cmd = json.dumps([
"/opt/vep/src/ensembl-vep/vep",
"--dir_cache", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96",
"--fasta", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96/data/human_g1k_v37.fa",
"--assembly", "GRCh37",
"--format", "vcf",
"--output_file", "STDOUT",
"--no_stats",
"--cache",
"--offline",
"--vcf",
"--merged"])
output_df = glow.transform("pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header=input_vcf,
output_formatter='vcf')
output_df.write.format("delta").save("dbfs:/mnt/vep-pipe")
Variant annotation methods
7/21/2022 • 2 minutes to read
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
This article describes how to use Databricks Runtime for Genomics (Deprecated) to parallelize variant
annotation methods with Apache Spark using Azure Databricks notebooks.
Pre-packaged SnpEff annotation pipeline
Pre-packaged VEP annotation pipeline
Variant Annotation using Pipe Transformer
Pre-packaged SnpEff annotation pipeline
7/21/2022 • 2 minutes to read
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
Setup
Run SnpEff (v4.3) as an Azure Databricks job. Most likely, an Azure Databricks solutions architect will set up the
initial job for you. The necessary details are:
Cluster configuration
Databricks Runtime for Genomics (Deprecated)
Set the task to the SnpEffAnnotationPipeline notebook imported into your workspace.
Benchmarks
The pipeline has been tested on 85.2 million variant sites from the 1000 Genomes project using the following
cluster configurations:
Driver: Standard_DS13_v2
Workers: Standard_D32s_v3 * 7 (224 cores)
Runtime: 2.5 hours
Reference genomes
You must configure the reference genome using environment variables. To use GRCh37, set the environment
variable:
refGenomeId=grch37
refGenomeId=grch38
Parameters
The pipeline accepts a number of parameters that control its behavior. The most important and commonly
changed parameters are documented here; the rest can be found in the SnpEff Annotation pipeline notebook.
After importing the notebook and setting it as a job task, you can set these parameters for all runs or per-run.
PA RA M ET ER DEFA ULT DESC RIP T IO N
Output
The annotated variants are written out to Delta tables inside the provided output directory. If you configured the
pipeline to export to VCF, they’ll appear under the output directory as well.
output
|---annotations
|---Delta files
|---annotations.vcf
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
Setup
Run VEP (release 96) as an Azure Databricks job.
The necessary details are:
Cluster configuration
Databricks Runtime for Genomics (Deprecated)
For best performance, set the Spark configuration spark.executor.cores 1 and use memory
optimized instances with at least 200GB of memory. We recommend Standard_L32s VMs.
Set the task as the VEPPipeline notebook imported into your workspace.
Reference genomes
You must configure the reference genome and transcripts using environment variables. To use GRCh37 with
merged Ensembl and RefSeq transcripts, set the environment variable:
refGenomeId=grch37_merged_vep_96
The refGenomeId for all pairs of reference genomes and transcripts are listed:
GRC H 37 GRC H 38
Parameters
The pipeline accepts a number of parameters that control its behavior. After importing the notebook and setting
it as a job task, you can set these parameters for all runs or per-run.
PA RA M ET ER DEFA ULT DESC RIP T IO N
LOFTEE
You can run VEP with plugins in order to extend, filter, or manipulate the VEP output. Set up LOFTEE with the
following instructions according to the desired reference genome.
grch37
Create a LOFTEE cluster using an init script.
#!/bin/bash
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch master https://github.com/konradjk/loftee.git
We recommend creating a mount point to store any additional files in cloud storage; these files can then be
accessed using the FUSE mount. Replace the values in the scripts with your mount point.
If desired, save the ancestral sequence at the mount point.
cd <mount-point>
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.fai
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.gzi
When running the VEP pipeline, provide the corresponding extra options.
grch38
Create a LOFTEE cluster that can parse BigWig files using an init script.
#!/bin/bash
# Download LOFTEE
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch grch38 https://github.com/konradjk/loftee.git
# Install Bio::DB::BigFile
cpanm --notest Bio::Perl
cpanm --notest Bio::DB::BigFile
We recommend creating a mount point to store any additional files in cloud storage; these files can then be
accessed using the FUSE mount. Replace the values in the scripts with your mount point.
Save the GERP scores BigWig at the mount point.
cd <mount-point>
wget
https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.
bw
cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/loftee.sql.gz
gunzip loftee.sql.gz
When running the VEP pipeline, provide the corresponding extra options.
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
Aggregate genetic variants using the GATK’s GenotypeGVCF implemented on Apache Spark.
Joint genotyping pipeline
Walkthrough
Setup
Reference genomes
Parameters
Output
Manifest format
Troubleshooting
Additional usage info
Joint genotyping pipeline
7/21/2022 • 4 minutes to read
NOTE
Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics
releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime
for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for
selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule,
see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime
have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.
The Azure Databricks joint genotyping pipeline is a GATK best practices compliant pipeline for joint genotyping
using GenotypeGVCFs.
Walkthrough
The pipeline typically consists of the following steps:
1. Ingest variants into Delta Lake.
2. Joint-call the cohort with GenotypeGVCFs.
During variant ingest, single-sample gVCFs are processed in batches and the rows are stored in Delta Lake to
provide fault tolerance, fast querying, and incremental joint genotyping. In the joint genotyping step, the gVCF
rows are ingested from Delta Lake, split into bins, and distributed to partitions. For each variant site, the relevant
gVCF rows per sample are identified and used for regenotyping.
Setup
The pipeline is run as an Azure Databricks job. Most likely an Azure Databricks solutions architect will work with
you to set up the initial job. The necessary details are:
{
"autoscale.min_workers": {
"type": "unlimited",
"defaultValue": 1
},
"autoscale.max_workers": {
"type": "unlimited",
"defaultValue": 25
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "Standard_L32s_v2"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
}
}
The cluster configuration should use Databricks Runtime for Genomics (Deprecated).
The task should be the joint genotyping pipeline notebook found at the bottom of this page.
For best performance, use the storage-optimized VMs. We recommend Standard_L32s_v2 .
To reduce costs, enable autoscaling with a minimum of 1 worker and a maximum of 10-50 depending on
latency requirements.
Reference genomes
You must configure the reference genome using environment variables. To use GRCh37, set the environment
variable:
refGenomeId=grch37
Parameters
The pipeline accepts parameters that control its behavior. The most important and commonly changed
parameters are documented here. To view all available parameters and their usage information, run the first cell
of the pipeline notebook. New parameters are added regularly. After importing the notebook and setting it as a
job task, you can set these parameters for all runs or per-run.
TIP
To perform joint calling from an existing Delta table, set gvcfDeltaOutput to the table path and replayMode to skip .
You can also provide the manifest , which will be used to define the VCF schema and samples; these will be inferred from
the Delta table otherwise. We ignore the targetedRegions and performValidation parameters in this setup.
Output
The regenotyped variants are all written out to Delta tables inside the provided output directory. In addition, if
you configured the pipeline to export VCFs, they’ll appear under the output directory as well.
output
|---genotypes
|---Delta files
|---genotypes.vcf
|---VCF files
Manifest format
NOTE
Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.
The manifest is a file or blob describing where to find the input single-sample GVCF files, with each file path on
a new row. For example:
HG00096.g.vcf.bgz
HG00097.g.vcf.bgz
TIP
If the provided manifest is a file, each row may be an absolute path or a path relative to the manifest file. If the provided
manifest is a blob, the row field must be an absolute path. You can include globs (*) to match many files.
Troubleshooting
Job fails with an ArrayIndexOutOfBoundsException
This error usually indicates that an input record has an incorrect number of genotype probabilities. Try setting
the performValidation option to true and the validationStringency option to LENIENT or SILENT .
Azure Databricks provides many tools for securing your network infrastructure. This guide covers general
security functionality. For information about securing access to your data, see Data governance guide.
Enterprise security for Azure Databricks
Access control
Secret management
Credential passthrough
Customer-managed keys for encryption
Configure double encryption for DBFS root
Secure cluster connectivity (No Public IP / NPIP)
Encrypt traffic between cluster worker nodes
IP access lists
Configure domain name firewall rules
Best practices: GDPR and CCPA compliance using Delta Lake
Configure access to Azure storage with an Azure Active Directory service principal
For security information specific to Databricks SQL, see the Databricks SQL security guide.
Enterprise security for Azure Databricks
7/21/2022 • 16 minutes to read
This article provides an overview of the most important security-related controls and configurations for the
deployment of Azure Databricks.
This article illustrates some scenarios using example companies to compare how small and large organizations
might handle deployment differently. There are references to the fictional large corporation LargeCorp and the
fictional small company SmallCorp . Use these examples as general guides, but every company is different. If
you have questions, contact your Azure Databricks representative.
For detailed information about specific security features, see Security guide. Your Azure Databricks
representative can also provide you with additional security and compliance documentation.
Account plan
Talk to your Azure Databricks representative about the features you want. They will help you choose the pricing
plan (tier) that suits your needs.
The following features are common for security-conscious organizations. All but SSO require the Premium plan.
Single sign-on (SSO): Authenticate users using Azure Active Directory. This is always enabled and is the
only option. If you use a different IdP, federate your IdP with Azure Active Directory.
Role-based access control: Control access to clusters, jobs, data tables, APIs, and workspace resources
such as notebooks, folders, jobs, and registered models.
Credential passthrough: Control access to Azure Data Lake Storage using users’ Azure Active Directory
credentials.
VNet injection: Deploy an Azure Databricks workspace in your own VNet that you manage in your Azure
subscription. Enables you to implement custom network configuration with custom Network Security
Group rules.
Secure cluster connectivity: Enable secure cluster connectivity on the workspace, which means that your
VNet’s Network Security Group (NSG) has no open inbound ports and Databricks Runtime cluster nodes
have no public IP addresses. Also known as “No Public IPs.” You can enable this feature for a workspace
during deployment.
NOTE
Independent of whether secure cluster connectivity is enabled, all Azure Databricks network traffic between the
data plane VNet and the Azure Databricks control plane goes across the Microsoft network backbone, not the
public Internet.
Workspaces
An Azure Databricks workspace is an environment for accessing your Azure Databricks assets. The workspace
organizes your objects (notebooks, libraries, and experiments) into folders. Your workspace provides access to
data and computational resources such as clusters and jobs.
Determine how many workspaces your organization will need, what teams need to collaborate, and your
requirements for geographic regions.
A small organization such as our example SmallCorp might only need one or a small number of workspaces.
This might also be true of a single division of a larger company that is relatively self-contained. The workspace
administrators could be regular users of the workspace. In some cases, a separate department (IT/OpSec) might
take on the role of workspace administrator to deploy according to enterprise governance policies and manage
permissions, users, and groups.
A large organization such as our example LargeCorp typically requires many workspaces. LargeCorp already has
a centralized group (IT/OpSec) that handles all security and administrative functions. That centralized group
typically sets up new workspaces and enforces security controls across the company.
Common reasons a large corporation might create separate workspaces:
Teams handle different levels of confidential information, possibly including personally identifying
information. By separating workspaces, teams keep different levels of confidential assets separate without
additional complexity such as access control lists. For example, the LargeCorp finance team can easily store
its finance-related notebooks separate from workspaces from other departments.
Simplify billing for Databricks usage (DBUs) and Cloud compute to be charged back to different budgets.
Geographic region variations of teams or data sources. Teams in one region may prefer cloud resources
based in a different region for cost, network latency, or legal compliance. Each workspace can be defined in a
different supported region.
Although workspaces are a common approach to segregate access to resources by team, project, or geography,
there are other options. Workspaces administrators can use access control lists (ACLs) within a workspace to
limit access to resources such as notebooks, folders, jobs, and more, based on user and group memberships.
Another option for controlling differential access to data source in a single workspace is credential passthrough.
Plan your virtual network configuration
The default deployment of Azure Databricks creates a new virtual network that is managed by Microsoft. You
can create a new workspace in your own customer-managed virtual network (also known as VNet injection)
instead. To learn why you might want to do this, see Deploy Azure Databricks in your Azure virtual network
(VNet injection).
Smaller companies might use the default VNet or have only a single customer-managed VNet for a single
Databricks workspace. If there are two or three workspaces, depending on the network architecture and regions,
they might share a single customer-managed VNet with multiple workspaces.
Larger companies often want to create and specify a customer-managed VNet for Databricks to use. They could
have some workspaces share a VNet to simplify Azure resource allocation. One organization might have VNets
in different Azure subscriptions and possibly different regions, and that could affect the planning of the number
of workspaces and your planned VNets to support them.
For each workspace you must provide two subnets: one for the container and one for the host. You cannot share
these subnets (or their address space) across workspaces. If you have multiple workspaces in one VNet, it’s
critical to plan your address space within the VNet. For the supported VNet and subnet sizes and the maximum
Azure Databricks nodes for various subnet sizes, see Deploy Azure Databricks in your Azure virtual network
(VNet injection).
Customer-managed keys for root Azure Blob storage (root DBFS and workspace system data)
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is implemented as a storage account in your Azure Databricks
workspace’s managed resource group. The default storage location in DBFS is known as the DBFS root. By
default, the storage account is encrypted with Microsoft-managed keys. This root Azure Blob storage in your
own subscription also stores other workspace system data like cluster or job logs and notebook version history.
Optionally, you can secure your workspace’s root Azure Blob storage using customer-managed keys. For details,
see Configure customer-managed keys for DBFS root
Customer-managed keys for managed services in the control plane
An Azure Databricks workspace comprises a control plane that is hosted in an Azure Databricks-managed
subscription and a data plane that is deployed in a virtual network in your subscription. The control plane stores
your notebook source code, partial notebook results, secrets stored with the secrets manager, and other
workspace configuration data. By default, the managed services data in the control plane is encrypted at rest
with a Databricks-managed key.
If your security and compliance requirements specify that you must own and manage the key used for
encrypting your notebooks and all results yourself, you can provide your own key for encrypting the notebook
data that is stored in the Azure Databricks control plane. For details, see Enable customer-managed keys for
managed services.
This feature does not encrypt data stored outside of the control plane. For example, it does not encrypt data in
your root Azure Blob storage, which stores job results and other parts of what is called your root DBFS. You can
encrypt that information with your own customer-managed keys. See Customer-managed keys for root Azure
Blob storage (root DBFS and workspace system data)
A small company, such as our example SmallCorp, might create new users by email address manually using the
Azure Databricks Admin console, in which case the users must also be members of the Azure AD tenant that is
associated with that workspace. However, Databricks recommends that you use Azure AD user and group
synchronization to automate user provisioning.
A large company, such as our example LargeCorp, typically manages Azure Databricks users and groups using
Azure AD SCIM synchronization of users and groups.
IP access lists
Authentication proves user identity, but it does not enforce the network location of the users. Accessing a cloud
service from an unsecured network poses security risks, especially when the user may have authorized access to
sensitive or personal data. Enterprise network perimeters (for example, firewalls, proxies, DLP, and logging)
apply security policies and limit access to external services, so access beyond these controls are assumed to be
untrusted.
For example, if an employee walks from the office to a coffee shop, the company can block connections to the
Azure Databricks workspace even if the customer has correct credentials to access the web application and the
REST API.
Specify the IP addresses (or CIDR ranges) on the public network that are allowed access. These IP addresses
could belong to egress gateways or specific user environment. You can also specify IP addresses or subnets to
block even if they are included in the allow list. For example, an allowed IP address range might include a
smaller range of infrastructure IP addresses that in practice are outside the actual secure network perimeter.
For details, see IP access lists.
Audit logs
Audit logs are available automatically for Azure Databricks workspaces. You must configure these to flow to
your Azure Storage Account, Azure Log Analytics workspace, or Azure Event Hub. In the Azure portal, the user
interface refers to them as Diagnostic Logs. After experimenting with your new workspace, try examining the
audit logs. See Diagnostic logging in Azure Databricks.
Cluster policies
Use cluster policies to enforce particular cluster settings, such as instance types, number of nodes, attached
libraries, and compute cost, and display different cluster-creation interfaces for different user levels. Managing
cluster configurations using policies can help enforce universal governance controls and manage the costs of
your compute infrastructure.
A small organization like SmallCorp might have a single cluster policy for all clusters.
A large organization like LargeCorp might have more complex policies, for example:
Customer data analysts who work on extremely large data sets and complex calculations might be allowed to
have clusters of up to hundred nodes.
Finance team data analysts might be allowed to use clusters of up to ten nodes.
Human resources department that works with smaller datasets and simpler notebooks might only be
allowed to have autoscaling clusters of four to eight nodes.
Folder (directory Manage which users can read, run, edit, or manage all
notebooks in a folder.
MLflow registered model and experiment Manage which users can read, edit, or manage MLflow
registered models and experiments.
Token Manage which users can create or use tokens. See also
Secure API access.
Depending on team size and sensitivity of the information, a small company like SmallCorp or a small team
within LargeCorp with its own workspace might allow all non-admin users access to the same objects, like
clusters, jobs, notebooks, and directories.
A larger team or organization with very sensitive information would likely want to use all of these access
controls to enforce the Principle of Least Privilege so that any individual user has access only to the resources
for which they have a legitimate need.
For example, suppose that LargeCorp has three people who need access to a specific workspace folder (which
contains notebooks and experiments) for the finance team. LargeCorp can use these APIs to grant directory
access only to the finance data team group.
Secrets
You may also want to use the secret manager to set up secrets that you expect your notebooks to need. A secret
is a key-value pair that stores secret material for an external data source or other calculation, with a key name
unique within a secret scope.
You create secrets using either the REST API or CLI, but you must use the Secrets utility (dbutils.secrets) in a
notebook or job to read your secrets.
Secrets are stored encrypted at rest, but you can add a customer-managed key to add additional security. See
Customer-managed keys for managed services in the control plane
Registering an application with Azure Active Directory (Azure AD) creates a service principal you can use to
provide access to Azure storage accounts. You can then configure access to these service principals using
credentials stored with secrets.
Databricks recommends using Azure Active Directory service principals scoped to clusters or Databricks SQL
endpoints to configure data access. See Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks and Configure access to cloud storage.
Assign roles
You control access to storage resources by assigning roles to an Azure AD application registration associated
with the storage account. This example assigns the Storage Blob Data Contributor to an Azure storage
account. You may need to assign other roles depending on specific requirements.
1. In the Azure portal, go to the Storage accounts service.
2. Select an Azure storage account to use with this application registration.
3. Click Access Control (IAM) .
4. Click + Add and select Add role assignment from the dropdown menu.
5. Set the Select field to the Azure AD application name and set Role to Storage Blob Data Contributor .
6. Click Save .
Access control
7/21/2022 • 2 minutes to read
In Azure Databricks, you can use access control lists (ACLs) to configure permission to access workspace objects
(folders, notebooks, experiments, and models), clusters, pools, jobs, Delta Live Tables pipelines, and data tables.
All admin users can manage access control lists, as can users who have been given delegated permissions to
manage access control lists.
Admin users enable and disable access control at the Azure Databricks workspace level. See Enable access
control.
NOTE
Workspace object, cluster, pool, job, Delta Live Tables pipelines, and table access control are available only in the Premium
Plan.
NOTE
Access control is available only in the Premium Plan.
By default, all users can create and modify workspace objects—including folders, notebooks, experiments, and
models—unless an administrator enables workspace access control. With workspace object access control,
individual permissions determine a user’s abilities. This article describes the individual permissions and how to
configure workspace object access control.
TIP
You can manage permissions in a fully automated setup using Databricks Terraform provider and databricks_permissions.
Before you can use workspace object access control, an Azure Databricks admin must enable it for the
workspace. See Enable workspace object access control.
Folder permissions
You can assign five permission levels to folders: No Permissions , Can Read , Can Run , Can Edit , and Can
Manage . The table lists the abilities for each permission.
NO
A B IL IT Y P ERM ISSIO N S C A N REA D C A N RUN C A N EDIT C A N M A N A GE
List items in x x x x x
folder
View items in x x x x
folder
Clone and x x x x
export items
Create, import, x
and delete items
Move and x
rename items
Change x
permissions
Notebooks and experiments in a folder inherit all permissions settings of that folder. For example, a user that
has Can Run permission on a folder has Can Run permission on the notebooks in that folder.
Default folder permissions
Independent of workspace object access control, the following permissions exist:
All users have Can Manage permission for items in the Workspace > Shared folder. You can
grant Can Manage permission to notebooks and folders by moving them to the Shared folder.
All users have Can Manage permission for objects the user creates.
With workspace object access control disabled, the following permissions exist:
All users have Can Edit permission for items in the Workspace folder.
With workspace object access control enabled, the following permissions exist:
Workspace folder
Only administrators can create new items in the Workspace folder.
Existing items in the Workspace folder - Can Manage . For example, if the Workspace folder
contained the Documents and Temp folders, all users continue to have the Can
Manage permission for these folders.
New items in the Workspace folder - No Permissions .
A user has the same permission for all items in a folder, including items created or moved into the
folder after you set the permissions, as the permission the user has on the folder.
User home directory - The user has Can Manage permission. All other users have No Permissions
permission.
Notebook permissions
You can assign five permission levels to notebooks: No Permissions , Can Read , Can Run , Can Edit , and Can
Manage . The table lists the abilities for each permission.
NO
A B IL IT Y P ERM ISSIO N S C A N REA D C A N RUN C A N EDIT C A N M A N A GE
View cells x x x x
Comment x x x x
Attach and x x x
detach
notebooks
Run commands x x x
Edit cells x x
Change x
permissions
Repos permissions
You can assign five permission levels to repos: No Permissions , Can Read , Can Run , Can Edit , and Can
Manage . The table lists the abilities for each permission.
NO
A B IL IT Y P ERM ISSIO N S C A N REA D C A N RUN C A N EDIT C A N M A N A GE
View items in x x x x
repo
Clone and x x x x
export items
Run notebooks x x x
in repo
Edit notebooks x x
in repo
Create, import, x
and delete items
Move and x
rename items
Change x
permissions
1. Select Permissions from the drop-down menu for the notebook, folder, or repo:
2. To grant permissions to a user or group, select from the Add Users, Groups, and Ser vice Principals
drop-down, select the permission, and click Add :
To change the permissions of a user or group, select the new permission from the permission drop-down:
3. After you make changes in the dialog, Done changes to Save Changes and a Cancel button appears.
Click Save Changes or Cancel .
Grant permissions x
NOTE
Experiment permissions are only enforced on artifacts stored in DBFS locations managed by MLflow. For more
information, see MLflow Artifact permissions.
Create, delete, and restore experiment requires Can Edit or Can Manage access to the folder containing the
experiment.
You can specify the Can Run permission for experiments. It is enforced the same way as Can Edit .
You can change permissions for an experiment that you own from the experiments page. Click in the
Actions column and select Permissions .
Configure MLflow experiment permissions from the experiment page
All users in your account belong to the group all users . Administrators belong to the group admins , which
has Manage permissions on all objects.
NOTE
Permissions you set in this dialog apply to the notebook that corresponds to this experiment.
1. Click Share .
2. In the dialog, click the Select User, Group or Ser vice Principal… drop-down and select a user, group,
or service principal.
3. Select a permission from the permission drop-down.
4. Click Add .
5. If you want to remove a permission, click for that user, group, or service principal.
2. Grant or remove permissions. All users in your account belong to the group all users . Administrators
belong to the group admins , which has Can Manage permissions on all items.
To grant permissions, select from the Select User, Group, or Ser vice Principal drop-down, select the
permission, and click Add :
To change existing permissions, select the new permission from the permission drop-down:
NOTE
Artifacts stored in MLflow-managed locations can only be accessed using the MLflow Client (version 1.9.1 or later),
which is available for Python, Java, and R. Other access mechanisms, such as dbutils and the DBFS API 2.0, are not
supported for MLflow-managed locations.
You can also specify your own artifact location when creating an MLflow experiment. Experiment access controls are
not enforced on artifacts stored outside of the default MLflow-managed DBFS directory.
MLflow Model permissions
You can assign six permission levels to MLflow Models registered in the MLflow Model Registry: No
Permissions , Can Read , Can Edit , Can Manage Staging Versions , Can Manage Production Versions ,
and Can Manage . The table lists the abilities for each permission.
NOTE
A model version inherits permissions from its parent model; you cannot set permissions for model versions.
CAN CAN
M A N A GE M A N A GE
NO STA GIN G P RO DUC T IO N CAN
A B IL IT Y P ERM ISSIO N S C A N REA D C A N EDIT VERSIO N S VERSIO N S M A N A GE
Create a x x x x x x
model
View model x x x x x
details,
versions,
stage
transition
requests,
activities, and
artifact
download
URIs
Request a x x x x x
model version
stage
transition
Add a version x x x x
to a model
Update x x x x
model and
version
description
Add or edit x x x x
tags for a
model or
model version
Transition x (between x x
model version None,
between Archived, and
stages Staging)
Approve or x (between x x
reject a model None,
version stage Archived, and
transition Staging)
request
CAN CAN
M A N A GE M A N A GE
NO STA GIN G P RO DUC T IO N CAN
A B IL IT Y P ERM ISSIO N S C A N REA D C A N EDIT VERSIO N S VERSIO N S M A N A GE
Cancel a x
model version
stage
transition
request (see
Note)
Modify x
permissions
Rename x
model
Delete model x
and model
versions
NOTE
The creator of a stage transition request can also cancel the request.
NOTE
This section describes how to manage permissions using the UI. You can also use the Permissions API 2.0.
3. Follow the steps listed in Configure MLflow Model permissions, starting at step 4.
When you navigate to a specific model page, permissions set at the registry-wide level are marked
“inherited”.
NOTE
A user with Can Manage permission at the registry-wide level can change registry-wide permissions for all other users.
To get the exact location of the files for a model version, you must have Read access to the model. Use the REST
API endpoint /api/2.0/mlflow/model-versions/get-download-uri .
After obtaining the URI, you can use the DBFS API 2.0 to download the files.
The MLflow Client (for Python, Java, and R) provides several convenience methods that wrap this workflow to
download and load the model, such as mlflow.<flavor>.load_model() .
NOTE
Other access mechanisms, such as dbutils and %fs are not supported for MLflow-managed file locations.
To control who can run jobs and see the results of job runs, see Jobs access control.
Cluster access control
7/21/2022 • 3 minutes to read
NOTE
Access control is available only in the Premium Plan.
By default, all users can create and modify clusters unless an administrator enables cluster access control. With
cluster access control, permissions determine a user’s abilities. This article describes the permissions.
Before you can use cluster access control, an Azure Databricks admin must enable it for the workspace. See
Enable cluster access control for your workspace.
Types of permissions
You can configure two types of cluster permissions:
The Allow unrestricted cluster creation entitlement controls your ability to create clusters.
Cluster-level permissions control your ability to use and modify a specific cluster.
When cluster access control is enabled:
An administrator can configure whether a user can create clusters.
Any user with Can Manage permission for a cluster can configure whether a user can attach to, restart,
resize, and manage that cluster.
Cluster-level permissions
There are four permission levels for a cluster: No Permissions , Can Attach To , Can Restar t , and Can
Manage . The table lists the abilities for each permission.
IMPORTANT
Users with Can Attach To permissions can view the service account keys in the log4j file. Use caution when granting this
permission level.
Attach notebook to x x x
cluster
View Spark UI x x x
Terminate cluster x x
A B IL IT Y N O P ERM ISSIO N S C A N AT TA C H TO C A N RESTA RT C A N M A N A GE
Start cluster x x
Restart cluster x x
Edit cluster x
Attach library to x
cluster
Resize cluster x
Modify permissions x
NOTE
Secrets are not redacted from the Spark driver log streams stdout and stderr . To protect secrets that might
appear in those driver log streams such that only users with the Can Manage permission on the cluster can view them,
set the cluster’s Spark configuration property spark.databricks.acl.needAdminPermissionToViewLogs true .
You have Can Manage permission for any cluster that you create.
Cluster access control must be enabled and you must have Can Manage permission for the cluster.
NOTE
This entitlement cannot be removed from admin users.
2. After you create all of the cluster configurations that you want your users to use, give the users who need
access to a given cluster Can Restar t permission. This allows a user to freely start and stop the cluster
without having to set up all of the configurations manually.
Terraform integration
You can manage permissions in a fully automated setup using Databricks Terraform provider and
databricks_permissions:
resource "databricks_group" "auto" {
display_name = "Automation"
}
access_control {
group_name = databricks_group.auto.display_name
permission_level = "CAN_ATTACH_TO"
}
access_control {
group_name = databricks_group.eng.display_name
permission_level = "CAN_RESTART"
}
access_control {
group_name = databricks_group.ds.display_name
permission_level = "CAN_MANAGE"
}
}
Pool access control
7/21/2022 • 2 minutes to read
IMPORTANT
This feature is in Public Preview.
NOTE
Access control is available only in the Premium Plan.
By default, all users can create and modify pools unless an administrator enables pool access control. With pool
access control, permissions determine a user’s abilities. This article describes the individual permissions and
how to configure pool access control.
Before you can use pool access control, an Azure Databricks admin must enable it for the workspace. See Enable
pool access control for your workspace.
Pool permissions
There are three permission levels for a pool: No Permissions , Can Attach To , and Can Manage . The table
lists the abilities for each permission.
A B IL IT Y N O P ERM ISSIO N S C A N AT TA C H TO C A N M A N A GE
Delete pool x
Edit pool x
The only way to grant a user or group permission to create a pool is through the SCIM API. Follow the SCIM API
2.0 documentation and grant the user the allow-instance-pool-create entitlement.
Terraform integration
You can manage permissions in a fully automated setup using Databricks Terraform provider and
databricks_permissions:
access_control {
group_name = databricks_group.auto.display_name
permission_level = "CAN_ATTACH_TO"
}
access_control {
group_name = databricks_group.eng.display_name
permission_level = "CAN_MANAGE"
}
}
Jobs access control
7/21/2022 • 2 minutes to read
NOTE
Access control is available only in the Premium Plan.
Enabling access control for jobs allows job owners to control who can view job results or manage runs of a job.
This article describes the individual permissions and how to configure jobs access control.
Before you can use jobs access control, an Azure Databricks admin must enable it for the workspace. See Enable
jobs access control for your workspace.
Job permissions
There are five permission levels for jobs: No Permissions , Can View , Can Manage Run , Is Owner , and Can
Manage . Admins are granted the Can Manage permission by default, and they can assign that permission to
non-admin users.
NOTE
The job owner can be changed only by an admin.
NO C A N M A N A GE
A B IL IT Y P ERM ISSIO N S C A N VIEW RUN IS O W N ER C A N M A N A GE
View results, x x x x
Spark UI, logs of
a job run
Run now x x x
Cancel run x x x
Modify x x
permissions
Delete job x x
Change owner
NOTE
The creator of a job has Is Owner permission.
A job cannot have more than one owner.
A job cannot have a group as an owner.
Jobs triggered through Run Now assume the permissions of the job owner and not the user who issued Run Now .
For example, even if job A is configured to run on an existing cluster accessible only to the job owner (user A), a user
(user B) with Can Manage Run permission can start a new run of the job.
You can view notebook run results only if you have the Can View or higher permission on the job. This allows jobs
access control to be intact even if the job notebook was renamed, moved, or deleted.
Jobs access control applies to jobs displayed in the Databricks Jobs UI and their runs. It doesn’t apply to runs spawned
by modularized or linked code in notebooks or runs submitted by API whose ACLs are bundled with the notebooks.
Terraform integration
You can manage permissions in a fully automated setup using Databricks Terraform provider and
databricks_permissions:
resource "databricks_group" "auto" {
display_name = "Automation"
}
new_cluster {
num_workers = 300
spark_version = data.databricks_spark_version.latest.id
node_type_id = data.databricks_node_type.smallest.id
}
notebook_task {
notebook_path = "/Production/MakeFeatures"
}
}
access_control {
group_name = "users"
permission_level = "CAN_VIEW"
}
access_control {
group_name = databricks_group.auto.display_name
permission_level = "CAN_MANAGE_RUN"
}
access_control {
group_name = databricks_group.eng.display_name
permission_level = "CAN_MANAGE"
}
}
Delta Live Tables access control
7/21/2022 • 2 minutes to read
NOTE
Access control is available only in the Premium Plan.
Enabling access control for Delta Live Tables allows pipeline owners to control access to pipelines, including
permissions to view pipeline details, start and stop pipeline updates, and manage pipeline settings. This article
describes the individual permissions and how to configure pipeline access control.
NOTE
The pipeline owner can be changed only by an admin.
The following table lists the access for each permission. The Admin column specifies access for workspace
admins.
NO CAN
A B IL IT Y P ERM ISSIO N S C A N VIEW C A N RUN M A N A GE IS O W N ER A DM IN
View pipeline x x x x x
details and
list pipeline
View Spark UI x x x x x
and driver
logs
Start and x x x x
stop a
pipeline
update
Stop pipeline x x x x
clusters
directly
Edit pipeline x x x
settings
Delete the x x x
pipeline
NO CAN
A B IL IT Y P ERM ISSIO N S C A N VIEW C A N RUN M A N A GE IS O W N ER A DM IN
Modify x x x
pipeline
permissions
Change x
owner
NOTE
By default, the creator of a pipeline has Is Owner permission.
Each execution of a pipeline is called an update. Updates assume the permissions of the user with Is owner
permission on the pipeline and not the user who started the update. All update actions, for example, fetching
notebooks, creating clusters, and running queries, are executed as the pipeline owner.
Clusters are not re-used between pipeline owners.
A pipeline owner can only be a workspace user or service principal.
A pipeline cannot have more than one owner.
A pipeline owner can be changed only by a workspace admin.
By default, all users in all pricing plans can create secrets and secret scopes. Using secret access control,
available with the Premium Plan, you can configure fine-grained permissions for managing access control. This
guide describes how to set up these controls.
NOTE
Access control is available only in the Premium Plan. If your account has the Standard Plan, you must explicitly
grant MANAGE permission to the “users” (all users) group when you create secret scopes.
This article describes how to manage secret access control using the Databricks CLI (version 0.7.1 and above).
Alternatively, you can use the Secrets API 2.0.
Permission levels
The secret access permissions are as follows:
MANAGE - Allowed to change ACLs, and read and write to this secret scope.
WRITE - Allowed to read and write to this secret scope.
READ - Allowed to read this secret scope and list what secrets are available.
Each permission level is a subset of the previous level’s permissions (that is, a principal with WRITE permission
for a given scope can perform all actions that require READ permission).
NOTE
Databricks admins have MANAGE permissions to all secret scopes in the workspace.
Making a put request for a principal that already has an applied permission overwrites the existing permission
level.
View secret ACLs
To view all secret ACLs for a given secret scope:
To get the secret ACL applied to a principal for a given secret scope:
If no ACL exists for the given principal and scope, this request will fail.
Terraform integration
You can manage permissions in a fully automated setup using Databricks Terraform provider and
databricks_secret_acl:
Sometimes accessing data requires that you authenticate to external data sources through JDBC. Instead of
directly entering your credentials into a notebook, use Azure Databricks secrets to store your credentials and
reference them in notebooks and jobs. To manage secrets, you can use the Databricks CLI to access the Secrets
API 2.0.
WARNING
Administrators, secret creators, and users granted permission can read Azure Databricks secrets. While Azure Databricks
makes an effort to redact secret values that might be displayed in notebooks, it is not possible to prevent such users from
reading secrets. For more information, see Secret redaction.
Managing secrets begins with creating a secret scope. A secret scope is collection of secrets identified by a
name. A workspace is limited to a maximum of 100 secret scopes.
NOTE
Databricks recommends aligning secret scopes to roles or applications rather than individuals.
Overview
There are two types of secret scope: Azure Key Vault-backed and Databricks-backed.
Azure Key Vault-backed scopes
To reference secrets stored in an Azure Key Vault, you can create a secret scope backed by Azure Key Vault. You
can then leverage all of the secrets in the corresponding Key Vault instance from that secret scope. Because the
Azure Key Vault-backed secret scope is a read-only interface to the Key Vault, the PutSecret and DeleteSecret
Secrets API 2.0 operations are not allowed. To manage secrets in Azure Key Vault, you must use the Azure
SetSecret REST API or Azure portal UI.
Databricks-backed scopes
A Databricks-backed secret scope is stored in (backed by) an encrypted database owned and managed by Azure
Databricks. The secret scope name:
Must be unique within a workspace.
Must consist of alphanumeric characters, dashes, underscores, and periods, and may not exceed 128
characters.
The names are considered non-sensitive and are readable by all users in the workspace.
You create a Databricks-backed secret scope using the Databricks CLI (version 0.7.1 and above). Alternatively,
you can use the Secrets API 2.0.
Scope permissions
Scopes are created with permissions controlled by ACLs. By default, scopes are created with MANAGE permission
for the user who created the scope (the “creator”), which lets the creator read secrets in the scope, write secrets
to the scope, and change ACLs for the scope. If your account has the Premium Plan, you can assign granular
permissions at any time after you create the scope. For details, see Secret access control.
You can also override the default and explicitly grant MANAGE permission to all users when you create the scope.
In fact, you must do this if your account does not have the Premium Plan.
Best practices
As a team lead, you might want to create different scopes for Azure Synapse Analytics and Azure Blob storage
credentials and then provide different subgroups in your team access to those scopes. You should consider how
to achieve this using the different scope types:
If you use a Databricks-backed scope and add the secrets in those two scopes, they will be different secrets
(Azure Synapse Analytics in scope 1, and Azure Blob storage in scope 2).
If you use an Azure Key Vault-backed scope with each scope referencing a different Azure Key Vault and
add your secrets to those two Azure Key Vaults, they will be different sets of secrets (Azure Synapse Analytics
ones in scope 1, and Azure Blob storage in scope 2). These will work like Databricks-backed scopes.
If you use two Azure Key Vault-backed scopes with both scopes referencing the same Azure Key Vault and
add your secrets to that Azure Key Vault, all Azure Synapse Analytics and Azure Blob storage secrets will be
available. Since ACLs are at the scope level, all members across the two subgroups will see all secrets. This
arrangement does not satisfy your use case of restricting access to a set of secrets to each group.
3. Enter the name of the secret scope. Secret scope names are case insensitive.
4. Use the Manage Principal drop-down to specify whether All Users have MANAGE permission for this
secret scope or only the Creator of the secret scope (that is to say, you).
MANAGE permission allows users to read and write to this secret scope, and, in the case of accounts on the
Premium Plan, to change permissions for the scope.
Your account must have the Premium Plan for you to be able to select Creator. This is the recommended
approach: grant MANAGE permission to the Creator when you create the secret scope, and then assign
more granular access permissions after you have tested the scope. For an example workflow, see Secret
workflow example.
If your account has the Standard Plan, you must set the MANAGE permission to the “All Users” group. If
you select Creator here, you will see an error message when you try to save the scope.
For more information about the MANAGE permission, see Secret access control.
5. Enter the DNS Name (for example, https://databrickskv.vault.azure.net/ ) and Resource ID , for
example:
/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourcegroups/databricks-
rg/providers/Microsoft.KeyVault/vaults/databricksKV
These properties are available from the Proper ties tab of an Azure Key Vault in your Azure portal.
IMPORTANT
You need an Azure AD user token to create an Azure Key Vault-backed secret scope with the Databricks CLI. You
cannot use an Azure Databricks personal access token or an Azure AD application token that belongs to a service
principal.
If the key vault exists in a different tenant than the Azure Databricks workspace, the Azure AD user who creates
the secret scope must have permission to create service principals in the key vault’s tenant. Otherwise, the
following error occurs:
By default, scopes are created with MANAGE permission for the user who created the scope. If your
account does not have the Premium Plan, you must override that default and explicitly grant the MANAGE
permission to the users (all users) group when you create the scope:
If your account in on the Premium Plan, you can change permissions at any time after you create the
scope. For details, see Secret access control.
Once you have created a Databricks-backed secret scope, you can add secrets.
For an example of using secrets when accessing Azure Blob storage, see Mounting cloud object storage on
Azure Databricks.
By default, scopes are created with MANAGE permission for the user who created the scope. If your account does
not have the Premium Plan, you must override that default and explicitly grant the MANAGE permission to
“users” (all users) when you create the scope:
You can also create a Databricks-backed secret scope using the Secrets API Put secret operation.
If your account has the Premium Plan, you can change permissions at any time after you create the scope. For
details, see Secret access control.
Once you have created a Databricks-backed secret scope, you can add secrets.
You can also list existing scopes using the Secrets API List secrets operation.
You can also delete a secret scope using the Secrets API Delete secret scope operation.
Secrets
7/21/2022 • 5 minutes to read
A secret is a key-value pair that stores secret material, with a key name unique within a secret scope. Each scope
is limited to 1000 secrets. The maximum allowed secret value size is 128 KB.
Create a secret
Secret names are case insensitive.
The method for creating a secret depends on whether you are using an Azure Key Vault-backed scope or a
Databricks-backed scope.
Create a secret in an Azure Key Vault-backed scope
To create a secret in Azure Key Vault you use the Azure SetSecret REST API or Azure portal UI.
# ----------------------------------------------------------------------
# Do not edit the above line. Everything that follows it will be ignored.
# Please input your secret value above the line. Text will be stored in
# UTF-8 (MB4) form and any trailing new line will be stripped.
# Exit without saving will abort writing secret.
Paste your secret value above the line and save and exit the editor. Your input is stripped of the comments and
stored associated with the key in the scope.
If you issue a write request with a key that already exists, the new value overwrites the existing value.
You can also provide a secret from a file or from the command line. For more information about writing secrets,
see Secrets CLI.
List secrets
To list secrets in a given scope:
The response displays metadata information about the secret, such as the secret key name and last updated at
timestamp (in milliseconds since epoch). You use the Secrets utility (dbutils.secrets) in a notebook or job to read
a secret. For example:
Read a secret
You create secrets using the REST API or CLI, but you must use the Secrets utility (dbutils.secrets) in a notebook
or job to read a secret.
Delete a secret
To delete a secret from a scope with the Databricks CLI:
NOTE
Available in Databricks Runtime 6.4 Extended Support and above.
You can reference a secret in a Spark configuration property or environment variable. Retrieved secrets are
redacted from notebook output and Spark driver and executor logs.
IMPORTANT
Keep the following security implications in mind when referencing secrets in a Spark configuration property or
environment variable:
If table access control is not enabled on a cluster, any user with Can Attach To permissions on a cluster or Run
permissions on a notebook can read Spark configuration properties from within the notebook. This includes users
who do not have direct permission to read a secret. Databricks recommends enabling table access control on all
clusters or managing access to secrets using secret scopes.
Even when table access control is enabled, users with Can Attach To permissions on a cluster or Run permissions
on a notebook can read cluster environment variables from within the notebook. Databricks does not recommend
storing secrets in cluster environment variables if they must not be available to all users on the cluster.
Secrets are not redacted from the Spark driver log stdout and stderr streams. By default, Spark driver logs are
viewable by users with any of the following cluster level permissions:
Can Attach To
Can Restart
Can Manage
You can optionally limit who can read Spark driver logs to users with the Can Manage permission by setting the
cluster’s Spark configuration property spark.databricks.acl.needAdminPermissionToViewLogs true
NOTE
There should be no spaces between the curly brackets. If there are spaces, they are treated as part of the scope or
secret name.
Any Spark configuration <property-name> can reference a secret. Each Spark configuration property can only
reference one secret, but you can configure multiple Spark properties to reference secrets.
Example
You set a Spark configuration to reference a secret:
spark.password {{secrets/scope1/key1}}
spark.conf.get("spark.password")
SQL
SELECT ${spark.password};
<variable-name>={{secrets/<scope-name>/<secret-name>}}
Environment variables that reference secrets have special behavior. These environment variables are accessible
from a cluster-scoped init script, but are not accessible from a program running in Spark.
Example
You set an environment variable to reference a secret:
SPARKPASSWORD={{secrets/scope1/key1}}
if [ -n "$SPARKPASSWORD" ]; then
use ${SPARKPASSWORD}
fi
Secret redaction
7/21/2022 • 2 minutes to read
Storing credentials as Azure Databricks secrets makes it easy to protect your credentials when you run
notebooks and jobs. However, it is easy to accidentally print a secret to standard output buffers or display the
value during variable assignment.
To prevent this, Azure Databricks redacts secret values that are read using dbutils.secrets.get() . When
displayed in notebook cell output, the secret values are replaced with [REDACTED] .
WARNING
Secret redaction for notebook cell output applies only to literals. The secret redaction functionality therefore does not
prevent deliberate and arbitrary transformations of a secret literal. To ensure the proper control of secrets, you should use
Workspace object access control (limiting permission to run commands) to prevent unauthorized access to shared
notebook contexts.
Secret workflow example
7/21/2022 • 2 minutes to read
In this workflow example, we use secrets to set up JDBC credentials for connecting to an Azure Data Lake Store.
To create an Azure Key Vault-backed secret scope, follow the instructions in Create an Azure Key Vault-backed
secret scope.
NOTE
If your account does not have the Premium Plan, you must create the scope with MANAGE permission granted to all users
(“users”). For example:
Create secrets
The method for creating the secrets depends on whether you are using an Azure Key Vault-backed scope or a
Databricks-backed scope.
Create the secrets in an Azure Key Vault-backed scope
Add the secrets username and password using the Azure SetSecret REST API or Azure portal UI:
You can now use these ConnectionProperties with the JDBC connector to talk to your data source. The values
fetched from the scope are never displayed in the notebook (see Secret redaction).
After verifying that the credentials were configured correctly, share these credentials with the datascience
group to use for their analysis.
Grant the datascience group read-only permission to these credentials by making the following request:
This article describes how you can use Delta Lake on Azure Databricks to manage General Data Protection
Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for your data lake. Because Delta
Lake adds a transactional layer that provides structured data management on top of your data lake, it can
dramatically simplify and speed up your ability to locate and remove personal information (also known as
“personal data”) in response to consumer GDPR or CCPA requests.
The challenge
Your organization may manage hundreds of terabytes worth of personal information in your cloud. Bringing
these datasets into GDPR and CCPA compliance is of paramount importance, but this can be a big challenge,
especially for larger datasets stored in data lakes.
The challenge typically arises from the following factors:
When you have large amounts (petabyte scale) of data in the cloud, user data can be stored and distributed
across multiple datasets and locations.
Point or ad-hoc queries to find data for specific users is expensive (akin to finding a needle in a haystack),
because it often requires full table scans. Taking a brute force approach to GDPR/CCPA compliance can result
in multiple jobs operating over different tables, resulting in weeks of engineering and operational effort.
Data lakes are inherently append-only and do not support the ability to perform row level “delete” or
“update” operations natively, which means that you must rewrite partitions of data. Typical data lake
offerings do not provide ACID transactional capabilities or efficient methods to find relevant data. Moreover,
read/write consistency is also a concern: while user data is being redacted from the data lake, processes that
read data should be protected from material impacts the way they would with a traditional RDBMS.
Data hygiene in the data lake is challenging, given that data lakes by design support availability and partition
tolerance with eventual consistency. Enforceable and rigorous practices and standards are required to assure
cleansed data.
As a result, organizations that manage user data at this scale often end up writing computationally difficult,
expensive, and time-consuming data pipelines to deal with GDPR and CCPA. For example, you might upload
portions of your data lake into proprietary data warehousing technologies, where GDPR and CCPA compliance-
related deletion activities are performed. This adds complexity and reduces data fidelity by forcing multiple
copies of the data. Moreover, exporting data from such warehouse technologies back into a data lake may
require re-optimization to improve query performance. This too results in multiple copies of data being created
and maintained.
The list of customers requesting to be forgotten per GDPR and CCPA come from a transactional database table,
gdpr.customer_delete_keys , that is populated using an online portal. The keys (distinct users) to be deleted
represent roughly 10% (337.615 MB) of the original keys sampled from the original dataset in gdpr.customers .
The schema of the gdpr.customer_delete_keys table contains the following fields:
NOTE
The following example involves a straightforward delete of customer personal data from the customers table. A better
practice is to pseudonymize all customer personal information in your working tables (prior to receiving a data subject
request) and delete the customer entry from the “lookup table” that maps the customer to the pseudonym, while
ensuring that data in working tables cannot be used to reconstruct the customer’s identity. For details, see Pseudonymize
data.
NOTE
The following examples make reference to performance numbers as a way of illustrating the impact of certain
performance options. These numbers were recorded on the dataset described above, on a cluster with 3 worker nodes,
each with 90 GB memory and 12 cores; the driver had 30GB memory and 4 cores.
Here is a simple Delta Lake DELETE FROM operation, deleting the customers included in the
customer_delete_keys table from our sample gdpr.customers table:
DELETE FROM `gdpr.customers` AS t1 WHERE EXISTS (SELECT c_customer_id FROM gdpr.customer_delete_keys WHERE
t1.c_customer_id = c_customer_id)
During testing, this operation took too long to complete: finding files took 32 seconds and rewriting files took
2.6 min. To reduce the time to find the relevant files, you can increase the broadcast threshold:
This broadcast hint instructs Spark to broadcast each specified table when joining it with another table or view.
This setting dropped file-finding to 8 seconds and writing to 1.6 minutes.
You can speed up performance even more with Delta Lake Z-Ordering (multi-dimensional clustering). Z-
Ordering creates a range partition-based arrangement of data and indexes this information in the Delta table.
Delta Lake uses this z-index to find files impacted by the DELETE operation.
To take advantage of Z-Ordering, you must understand how the data you expect to be deleted is spread across
the target table. For example, if the data, even for a few keys, is spread across 90% of the files for the dataset,
you’ll be rewriting more than 90% of your data. Z-Ordering by relevant key columns reduces the number of files
touched and can make rewrites much more efficient.
In this case, you should Z-Order by the c_customer_id column before running delete:
After Z-Ordering, finding files took 7 secs and writing dropped to 50 seconds.
Step 3: Clean up stale data
Depending on how long after a consumer request you delete your data and on your underlying data lake, you
may need to delete table history and underlying raw data.
By default, Delta Lake retains table history for 30 days and makes it available for “time travel” and rollbacks.
That means that, even after you have deleted personal information from a Delta table, users in your organization
may be able to view that historical data and roll back to a version of the table in which the personal information
is still stored. If you determine that GDPR or CCPA compliance requires that these stale records be made
unavailable for querying before the default retention period is up, you can use the VACUUM function to remove
files that are no longer referenced by a Delta table and are older than a specified retention threshold. Once you
have removed table history using the VACUUM command, all users lose the ability to view that history and roll
back.
To delete all customers who requested that their information be deleted, and then remove all table history older
than 7 days, you simply run:
VACUUM gdpr.customers
To remove artifacts younger than 7 days, use the RETAIN num HOURS option:
In addition, if you created Delta tables using Spark APIs to rewrite non-Parquet files to Delta (as opposed to
converting Parquet files to Delta Lake in-place), your raw data may still contain personal information that you
have deleted or anonymized. Databricks recommends that you set up a retention policy with your cloud
provider of thirty days or less to remove raw data automatically.
Pseudonymize data
While the deletion method described above can, strictly, permit your organization to comply with the GDPR and
CCPA requirement to perform deletions of personal information, it comes with a number of downsides. The first
is that the GDPR does not permit any additional processing of personal information once a valid request to
delete has been received. As a consequence, if the data is not stored in a pseudonymized fashion—that is,
replacing personally identifiable information with an artificial identifier or pseudonym—prior to the receipt of
the data subject request, you are obligated to simply delete all of the linked information. If, however, you have
previously pseudonymized the underlying data, your obligations to delete are satisfied by the simple destruction
of any record that links the identifier to the pseudonym (assuming the remaining data is not itself identifiable),
and you may retain the remainder of the data.
In a typical pseudonymization scenario, you keep a secured “lookup table” that maps the customer’s personal
identifiers (name, email address, etc) to the pseudonym. This has the advantage not only of making deletion
easier, but also of allowing you to “restore” the user identity temporarily to update user data over time, an
advantage denied in an anonymization scenario, in which by definition a customer’s identity can never be
restored, and all customer data is by definition static and historical.
For a simple pseudonymization example, consider the customer table updated in the deletion example. In the
pseudonymization scenario, you can create a gdpr.customers_lookup table that contains all customer data that
could be used to identify the customer, with an additional column for a pseudonymized email address. Now, you
can use the pseudo email address as the key in any data tables that reference customers, and when there is a
request to forget this information, you can simply delete that information from the gdpr.customers_lookup table
and the rest of the information can remain non-identifiable forever.
The schema of the gdpr.customers_lookup table is:
In this scenario, put the remaining customer data, which cannot be used to identify the customer, in a
pseudonymized table called gdpr.customers_pseudo :
NOTE
This is a simple example to illustrate salting. Using the same salt for all of your customer keys is not a good way to
mitigate attacks; it just makes the customer keys longer. A more secure approach would be to generate a random salt for
each user. See Make your pseudonymization stronger.
Once you salt the column c_email_address , you can hash it and add the hash to the gdpr.customers_lookup
table as c_email_address_pseudonym :
Now you can use this value for all of your customer-keyed tables.
Make your pseudonymization stronger
To reduce the risk that a compromise of a single salt could have on your database, it is advisable where practical
to use different salts (one per customer, or even per user). Provided that the data attached to the pseudonymous
identifier does not itself contain any information that can identify an individual, if you delete your record of
which salt is related to which user and cannot recreate it, the remaining data should be rendered fully
anonymous and therefore fall outside of the scope of the GDPR and the CCPA. Many organizations choose to
create multiple salts per user and create fully anonymized data outside of the scope of data protection law by
rotating these salts periodically according to business need.
And don’t forget that whether data is “personal” or “identifiable” is not an element-level analysis, but essentially
an array-level analysis. So while obvious things like email addresses are clearly personal, combinations of things
that by themselves would not be personal can also be personal. See, for example
https://aboutmyinfo.org/identity/about: based on an analysis of the 1990 US Census, 87% of the United States
population is uniquely identifiable by the three attributes of zip code, date of birth, and gender. So when you’re
deciding what should be stored as part of the personal identifiers table or the working tables with only
pseudonymous information, make sure to think about whether or not the collision of the seemingly non-
identifiable information might itself be identifiable. And make sure for your own privacy compliance that you
have internal processes that prevent attempts to re-identify individuals with the information you intended to be
non-identifiable (for example differential privacy, privacy preserving histograms, etc.). While it may never be
possible to completely prevent re-identification, following these steps will go a long way towards helping.
Learn more
To learn more about Delta Lake on Azure Databricks, see Delta Lake guide.
For blogs about using Delta Lake for GDPR and CCPA compliance written by Databricks experts, see:
How to Avoid Drowning in GDPR Data Subject Requests in a Data Lake
Make Your Data Lake CCPA Compliant with a Unified Approach to Data and Analytics
Efficient Upserts into Data Lakes with Databricks Delta
To learn about purging personal information in the Azure Databricks workspace, see Manage workspace
storage.
IP access lists
7/21/2022 • 6 minutes to read
Security-conscious enterprises that use cloud SaaS applications need to restrict access to their own employees.
Authentication helps to prove user identity, but that does not enforce network location of the users. Accessing a
cloud service from an unsecured network can pose security risks to an enterprise, especially when the user may
have authorized access to sensitive or personal data. Enterprise network perimeters apply security policies and
limit access to external services (for example, firewalls, proxies, DLP, and logging), so access beyond these
controls are assumed to be untrusted.
For example, suppose a hospital employee accesses an Azure Databricks workspace. If the employee walks from
the office to a coffee shop, the hospital can block connections to the Azure Databricks workspace even if the
customer has correct credentials to access the web application and the REST API.
You can configure Azure Databricks workspaces so that employees connect to the service only through existing
corporate networks with a secure perimeter. Azure Databricks customers can use the IP access lists feature to
define a set of approved IP addresses. All incoming access to the web application and REST APIs requires the
user connect from an authorized IP address.
For employees who are remote or travelling, employees could use VPN to connect to the corporate network,
which in turn enables access to the workspace. Using the previous example, the hospital could allow an
employee to use a VPN from the coffee shop to access the Azure Databricks workspace.
Requirements
This feature requires the Premium Plan.
Flexible configuration
The IP access lists feature is flexible:
Your own workspace administrators control the set of IP addresses on the public Internet that are allowed
access. This is known as the allow list. Allow multiple IP addresses explicitly or as entire subnets (for example
216.58.195.78/28).
Workspace administrators can optionally specify IP addresses or subnets to block even if they are included in
the allow list. This is known as the block list. You might use this feature if an allowed IP address range
includes a smaller range of infrastructure IP addresses that in practice are outside the actual secure network
perimeter.
Workspace administrators use REST APIs to update the list of allowed and blocked IP addresses and subnets.
Feature details
The IP Access List API enables Azure Databricks admins to configure IP allow lists and block lists for a
workspace. If the feature is disabled for a workspace, all access is allowed. There is support for allow lists
(inclusion) and block lists (exclusion).
When a connection is attempted:
1. First all block lists are checked. If the connection IP address matches any block list, the connection is
rejected.
2. If the connection was not rejected by block lists , the IP address is compared with the allow lists. If
there is at least one allow list for the workspace, the connection is allowed only if the IP address matches
an allow list. If there are no allow lists for the workspace, all IP addresses are allowed.
For all allow lists and block lists combined, the workspace supports a maximum of 1000 IP/CIDR values, where
one CIDR counts as a single value.
After changes to the IP access list feature, it can take a few minutes for changes to take effect.
How to use the IP access list API
This article discusses the most common tasks you can perform with the API. For the complete REST API
reference, download the OpenAPI spec and view it directly or using an application that reads OpenAPI 3.0. For
more details on using the OpenAPI spec, see IP Access List API 2.0.
To learn about authenticating to Azure Databricks APIs, see Authentication using Azure Databricks personal
access tokens.
The base path for the endpoints described in this article is https://<databricks-instance>/api/2.0 , where
<databricks-instance> is the adb-<workspace-id>.<random-number>.azuredatabricks.net domain name of your
Azure Databricks deployment.
curl -X -n \
https://<databricks-instance>/api/2.0/workspace-conf?keys=enableIpAccessLists
Example response:
{
"enableIpAccessLists": "true",
}
curl -X PATCH -n \
https://<databricks-instance>/api/2.0/workspace-conf \
-d '{
"enableIpAccessLists": "true"
}'
Example response:
{
"enableIpAccessLists": "true"
}
The response is a copy of the object that you passed in, but with some additional fields, most importantly the
list_id field. You may want to save that value so you can update or delete the list later. If you do not save it,
you are still able to get the ID later by querying the full set of IP access lists with a GET request to the
/ip-access-lists endpoint.
curl -X POST -n \
https://<databricks-instance>/api/2.0/ip-access-lists
-d '{
"label": "office",
"list_type": "ALLOW",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21"
]
}'
Example response:
{
"ip_access_list": {
"list_id": "<list-id>",
"label": "office",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21"
],
"address_count": 2,
"list_type": "ALLOW",
"created_at": 1578423494457,
"created_by": 6476783916686816,
"updated_at": 1578423494457,
"updated_by": 6476783916686816,
"enabled": true
}
}
To add a block list, do the same thing but with list_type set to BLOCK .
curl -X PATCH -n \
https://<databricks-instance>/api/2.0/ip-access-lists/<list-id>
-d '{ "enabled": "false" }'
The response is a copy of the object that you passed in with additional fields for the ID and modification dates.
For example, to replace the contents of the specified list with the following values:
curl -X PUT -n \
https://<databricks-instance>/api/2.0/ip-access-lists/<list-id>
-d '{
"label": "office",
"list_type": "ALLOW",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21"
],
"enabled": "false"
}'
curl -X DELETE -n \
https://<databricks-instance>/api/2.0/ip-access-lists/<list-id>
Configure domain name firewall rules
7/21/2022 • 2 minutes to read
If your corporate firewall blocks traffic based on domain names, you must allow HTTPS and WebSocket traffic to
Azure Databricks domain names to ensure access to Azure Databricks resources. You can choose between two
options, one more permissive but easier to configure, the other specific to your workspace domains.
With secure cluster connectivity enabled, customer virtual networks have no open ports and Databricks
Runtime cluster nodes have no public IP addresses. Secure cluster connectivity is also known as No Public IP
(NPIP).
At a network level, each cluster initiates a connection to the control plane secure cluster connectivity relay
during cluster creation. The cluster establishes this connection using port 443 (HTTPS) and uses a different IP
address than is used for the Web application and REST API.
When the control plane logically starts new Databricks Runtime jobs or performs other cluster
administration tasks, these requests are sent to the cluster through this tunnel.
The data plane (the VNet) has no open ports, and Databricks Runtime cluster nodes have no public IP
addresses.
Benefits:
Easy network administration, with no need to configure ports on security groups or to configure network
peering.
With enhanced security and simple network administration, information security teams can expedite
approval of Databricks as a PaaS provider.
NOTE
All Azure Databricks network traffic between the data plane VNet and the Azure Databricks control plane goes across the
Microsoft network backbone, not the public Internet. This is true even if secure cluster connectivity is disabled.
Use secure cluster connectivity
To use secure cluster connectivity with a new Azure Databricks workspace, use any of the following options.
Azure Portal: When you provision the workspace, go to the Networking tab and set the option Deploy
Azure Databricks workspace with Secure Cluster Connectivity (No Public IP) to Yes .
ARM Templates: For the Microsoft.Databricks/workspaces resource that creates your new workspace, set the
enableNoPublicIp Boolean parameter to true .
IMPORTANT
In either case, you must register the Azure Resource Provider Microsoft.ManagedIdentity in the Azure subscription
that is used to launch workspaces with secure cluster connectivity. This is a one-time operation per subscription. For
instructions, see Azure resource providers and types.
You cannot add secure cluster connectivity to an existing workspace. For information about migrating your
resources to the new workspaces, contact your Microsoft or Databricks account team for details.
If you’re using ARM templates, add the parameter to one of the following templates, based on whether you
want Azure Databricks to create a default (managed) virtual network for the workspace, or if you want to use
your own virtual network, also known as VNet injection. VNet injection is an optional feature that allows you to
provide your own VNet to host new Azure Databricks clusters.
ARM template to set up a workspace using the default (managed) VNet.
ARM template to set up a workspace using VNet injection.
Egress from workspace subnets
When you enable secure cluster connectivity, both of your workspace subnets are private subnets, since cluster
nodes do not have public IP addresses.
The implementation details of network egress vary based on whether you use the default (managed) VNet or
whether you use the optional VNet injection feature to provide your own VNet in which to deploy your
workspace. See the following sections for details.
IMPORTANT
Additional costs may be incurred due to increased egress traffic when you use secure cluster connectivity. For a smaller
organization that needs a cost-optimized solution, it may be acceptable to disable secure cluster connectivity when you
deploy your workspace. However, for the most secure deployment, Microsoft and Databricks strongly recommend that
you enable secure cluster connectivity.
IMPORTANT
The example init script that is referenced in this article derives its shared encryption secret from the hash of the keystore
stored in DBFS. If you rotate the secret by updating the keystore file in DBFS, all running clusters must be restarted.
Otherwise, Spark workers may to fail to authenticate with the Spark driver due to inconsistent shared secret, causing jobs
to slow down. Furthermore, since the shared secret is stored in DBFS, any user with DBFS access can retrieve the secret
using a notebook. For further guidance, contact your representative.
Requirements
This feature requires the Premium plan. Contact your Databricks account representative for more
information.
User queries and transformations are typically sent to your clusters over an encrypted channel. By default,
however, the data exchanged between worker nodes in a cluster is not encrypted. If your environment requires
that data be encrypted at all times, whether at rest or in transit, you can create an init script that configures your
clusters to encrypt traffic between worker nodes, using AES 128-bit encryption over a TLS 1.2 connection.
NOTE
Although AES enables cryptographic routines to take advantage of hardware acceleration, there’s a performance penalty
compared to unencrypted traffic. This penalty can result in queries taking longer on an encrypted cluster, depending on
the amount of data shuffled between nodes.
Enabling encryption of traffic between worker nodes requires setting Spark configuration parameters through
an init script. You can use a cluster-scoped init script for a single cluster or a global init script if you want all
clusters in your workspace to use worker-to-worker encryption.
One time, copy the keystore file to a directory in DBFS. Then create the init script that applies the encryption
settings.
The init script must perform the following tasks:
1. Get the JKS keystore file and password.
2. Set the Spark executor configuration.
3. Set the Spark driver configuration.
NOTE
The JKS keystore file used for enabling SSL/HTTPS is dynamically generated for each workspace. The JKS keystore file’s
password is hardcoded and not intended to protect the confidentiality of the keystore.
The following is an example init script that implements these three tasks to generate the cluster encryption
configuration.
keystore_dbfs_file="/dbfs/<keystore_directory>/jetty_ssl_driver_keystore.jks"
max_attempts=30
while [ ! -f ${keystore_dbfs_file} ];
do
if [ "$max_attempts" == 0 ]; then
echo "ERROR: Unable to find the file : $keystore_dbfs_file .Failing the script."
exit 1
fi
sleep 2s
((max_attempts--))
done
## Derive shared internode encryption secret from the hash of the keystore file
sasl_secret=$(sha256sum $keystore_dbfs_file | cut -d' ' -f1)
if [ -z "${sasl_secret}" ]; then
echo "ERROR: Unable to derive the secret.Failing the script."
exit 1
fi
if [ ! -e $driver_conf ] ; then
touch $driver_conf
fi
spark_defaults_conf="$DB_HOME/spark/conf/spark-defaults.conf"
echo "Configuring spark defaults conf at $spark_defaults_conf"
if [ ! -e $spark_defaults_conf ] ; then
touch $spark_defaults_conf
fi
spark.ssl.enabled true
spark.ssl.keyPassword $local_keystore_password
spark.ssl.keyStore $local_keystore_file
spark.ssl.keyStorePassword $local_keystore_password
spark.ssl.protocol TLSv1.3
spark.ssl.standalone.enabled true
spark.ssl.ui.enabled true
EOF
Once the initialization of the driver and worker nodes is complete, all traffic between these nodes is encrypted
using the keystore file.
This following notebook copies the keystore file and generates the init script in DBFS. You can use the init script
to create new clusters with encryption enabled.
Install an encryption init script notebook
Get notebook
IMPORTANT
This feature is in Public Preview.
NOTE
This feature requires the Premium Plan.
For some types of data, Azure Databricks supports adding a customer-managed key to help protect and control
access to encrypted data. Azure Databricks has two customer-managed key features for different types of data:
Enable customer-managed keys for managed services
Configure customer-managed keys for DBFS root
The following table lists which customer-managed key features are used for which types of data.
Customer-accessible DBFS root data Your workspace’s DBFS root in your DBFS root
workspace root Blob storage in your
Azure subscription. This also includes
workspace libraries and the FileStore
area.
Databricks SQL results Workspace root Blob storage instance DBFS root
in your Azure subscription
Interactive notebook results By default, when you run a notebook For partial results in the control plane,
interactively (rather than as a job) use a customer-managed key for
results are stored in the control plane managed services. For results in the
for performance with some large root Blob storage, which you can
results stored in your workspace root configure for all result storage, use a
Blob storage in your Azure customer-managed key for DBFS root.
subscription. You can choose to
configure Azure Databricks to store all
interactive notebook results in your
Azure subscription.
T Y P E O F DATA LO C AT IO N C USTO M ER- M A N A GED K EY F EAT URE
Other workspace system data in the Workspace root Blob storage in your DBFS root
root Blob storage that is inaccessible Azure subscription
through DBFS, such as notebook
revisions.
For additional security for your workspace’s root Blob storage instance in your Azure subscription, you can
enable double encryption for the DBFS root.
Enable customer-managed keys for managed
services
7/21/2022 • 9 minutes to read
IMPORTANT
This feature is in Public Preview.
NOTE
This feature requires the Premium Plan.
For additional control of your data, you can add your own key to protect and control access to some types of
data. Azure Databricks has two customer-managed key features for different types of data and locations. To
compare them, see Customer-managed keys for encryption.
Managed services data in the Azure Databricks control plane is encrypted at rest. You can add a customer-
managed key for managed services to help protect and control access to the following types of encrypted data:
Notebook source in the Azure Databricks control plane.
Notebook results for notebooks run interactively (not as jobs) that are stored in the control plane. By default,
larger results are also stored in your workspace root bucket. You can configure Azure Databricks to store all
interactive notebook results in your cloud account.
Secrets stored by the secret manager APIs.
Databricks SQL queries and query history.
After you add a customer-managed key encryption for a workspace, Azure Databricks uses your key to control
access to the key that encrypts future write operations to your workspace’s managed services data. Existing data
is not re-encrypted. The data encryption key is cached in memory for several read and write operations and
evicted from memory at a regular interval. New requests for that data require another request to your cloud
service’s key management system. If you delete or revoke your key, reading or writing to the protected data fails
at the end of the cache time interval.
You can rotate (update) the customer-managed key at a later time. See Rotate the key.
IMPORTANT
After you run the key rotation command, you must keep your old KMS key available to Azure Databricks for 24 hours.
NOTE
This feature does not encrypt data stored outside of the control plane. To encrypt data in your workspace’s root Blob
storage, see Configure customer-managed keys for DBFS root.
To use an existing key vault, copy the key vault name for the next step.
2. Get the object ID of the AzureDatabricks application:
Instead of using the Azure CLI, you can get the object ID from within the Azure portal:
a. In Azure Active Directory, select Enterprise Applications from the sidebar menu.
b. Search for AzureDatabricks and click the Enterprise application in the results.
c. From Proper ties , copy the object ID.
3. Set the required permissions for your key vault. Replace <key-vault-name> with the vault name that you
used in the previous step and replace <object-id> with the object ID of the AzureDatabricks application.
Make note of the following values, which you can get from the key ID in the kid property in the response. You
will use them in subsequent steps:
Key vault URL: The beginning part of the key ID that includes the key vault name. It has the form
https://<key-vault-name>.vault.azure.net .
Key name: Name of your key.
Key version: Version of the key.
The full key ID has the form <key-vault-URL>/keys/<key-name>/<key-version> .
If instead you use an existing key, get and copy these values for your key so you can use them in the next steps.
Check to confirm that your existing key is enabled before proceeding.
Step 3: Create or update a workspace with your key
You can deploy a new workspace with customer-managed key for managed services or add customer-managed
key to an existing workspace. You can do both with ARM templates. Use whatever tooling you prefer to use:
Azure portal, Azure CLI, or other tooling.
The following ARM template creates a new workspace with a customer-managed key, using the preview API
version for resource Microsoft.Databricks/workspaces . Save this text locally to a file named
databricks-cmk-template.json .
NOTE
This example template does not include all possible features such as providing your own VNet. If you already use a
template, merge this template’s parameters, resources, and outputs into your existing template.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"workspaceName": {
"type": "string",
"metadata": {
"description": "The name of the Azure Databricks workspace to create."
}
},
"pricingTier": {
"type": "string",
"defaultValue": "premium",
"allowedValues": [
"standard",
"premium"
],
"metadata": {
"description": "The pricing tier of workspace."
}
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location for all resources."
}
},
"apiVersion": {
"type": "string",
"defaultValue": "2021-04-01-preview",
"allowedValues":[
"2021-04-01-preview"
],
"metadata": {
"description": "The api version to create the workspace resources"
}
},
"keyvaultUri": {
"type": "string",
"metadata": {
"description": "The key vault URI for customer-managed key for managed services"
}
},
"keyName": {
"type": "string",
"metadata": {
"description": "The key name used for customer-managed key for managed services"
}
}
},
"keyVersion": {
"type": "string",
"metadata": {
"description": "The key version used for customer-managed key for managed services"
}
}
},
"variables": {
"managedResourceGroupName": "[concat('databricks-rg-', parameters('workspaceName'), '-',
uniqueString(parameters('workspaceName'), resourceGroup().id))]"
},
"resources": [
{
"type": "Microsoft.Databricks/workspaces",
"name": "[parameters('workspaceName')]",
"location": "[parameters('location')]",
"apiVersion": "[parameters('apiVersion')]",
"sku": {
"name": "[parameters('pricingTier')]"
},
"properties": {
"ManagedResourceGroupId": "[concat(subscription().id, '/resourceGroups/',
variables('managedResourceGroupName'))]",
"encryption": {
"entities": {
"managedServices": {
"keySource": "Microsoft.Keyvault",
"keyVaultProperties": {
"keyVaultUri": "[parameters('keyvaultUri')]",
"keyName": "[parameters('keyName')]",
"keyVersion": "[parameters('keyVersion')]"
}
}
}
}
}
}
],
"outputs": {
"workspace": {
"type": "object",
"value": "[reference(resourceId('Microsoft.Databricks/workspaces', parameters('workspaceName')))]"
}
}
}
If you use another template already, you can merge this template’s parameters, resources, and outputs into your
existing template.
To use this template to create or update a workspace, you have several options depending on your tooling.
Create workspace with Azure CLI
To create a new workspace with Azure CLI, run the following command:
IMPORTANT
Other than changes in the key-related parameters, use the same parameters that were used for creating the
workspace.
IMPORTANT
After you run the key rotation command, you must keep your old KMS key available to Azure Databricks for 24
hours.
IMPORTANT
Other than changes in the key-related parameters, use the same parameters that were used for creating the
workspace.
IMPORTANT
After you run the key rotation command, you must keep your old KMS key available to Azure Databricks for 24
hours.
For more details, see the Azure article Quickstart: Create and deploy ARM templates by using the Azure portal.
IMPORTANT
Ensure the new key has the proper permission.
2. Confirm that your template has the correct API version 2021-04-01-preview .
3. Update the workspace:
IMPORTANT
After you run the key rotation command, you must keep your old KMS key available to Azure Databricks for 24
hours.
To use the Azure portal, apply the template using the Custom deployment tool. See Create or
update workspace with Azure portal. Ensure that you use the same values for the resource group
name and the workspace name so it updates the existing workspace, rather than creating a new
workspace.
To use the Azure CLI, run the following command. Ensure that you use the same values for the
resource group name and the workspace name so it updates the existing workspace, rather than
creating a new workspace.
IMPORTANT
Other than changes in the key-related parameters, use the same parameters that were used for creating
the workspace.
az deployment group create --resource-group <existing-resource-group-name> \
--template-file <file-name>.json \
--parameters workspaceName=<existing-workspace-name> \
keyvaultUri=<keyvaultUrl> \
keyName=<keyName> keyVersion=<keyVersion>
4. Optionally export and re-import existing notebooks to ensure all existing notebooks use your new key.
NOTE
This feature is available only in the Premium Plan.
For additional control of your data, you can add your own key to protect and control access to some types of
data. Azure Databricks has two customer-managed key features that involve different types of data and
locations. For a comparison, see Customer-managed keys for encryption.
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is implemented as a Blob storage instance in your Azure Databricks
workspace’s managed resource group. The default storage location in DBFS is known as the DBFS root. By
default, the storage account is encrypted with Microsoft-managed keys.
After you add a customer-managed key encryption for a workspace, Azure Databricks uses your key to encrypt
future write operations to your workspace’s root Blob storage. Existing data is not re-encrypted.
IMPORTANT
This feature affects your DBFS root but is not used for encrypting data on any additional DBFS mounts such as DBFS
mounts of additional Blob or ADLS storage.
You must use Azure Key Vault to store your customer-managed keys. You can either create your own keys and
store them in the key vault, or you can use the Azure Key Vault APIs to generate keys.
There are three ways of enabling customer-managed keys for your DBFS storage:
Configure customer-managed keys for DBFS using the Azure portal
Configure customer-managed keys for DBFS using the Azure CLI
Configure customer-managed keys for DBFS using PowerShell
Configure customer-managed keys for DBFS using
the Azure portal
7/21/2022 • 2 minutes to read
NOTE
This feature is available only in the Premium Plan.
You can use the Azure portal to configure your own encryption key to encrypt the DBFS root storage account.
You must use Azure Key Vault to store the key.
For more information about customer-managed keys for DBFS, see Configure customer-managed keys for DBFS
root.
1. Create a key vault following the instructions in Quickstart: Set and retrieve a key from Azure Key Vault
using the Azure portal.
The Azure Databricks workspace and the key vault must be in the same region and the same Azure Active
Directory (Azure AD) tenant, but they can be in different subscriptions.
2. Create a key in the key vault, continuing to follow the instructions in the Quickstart.
DBFS root storage supports RSA and RSA-HSM keys of sizes 2048, 3072 and 4096. For more information
about keys, see About Key Vault keys.
3. Once your key is created, copy and paste the Key Identifier into a text editor. You will need it when you
configure your key for Azure Databricks.
NOTE
Only users with the key vault Contributor role or higher for the key vault can save.
When the encryption is enabled, the system enables Soft-Delete and Purge Protection on the key vault,
creates a managed identity on the DBFS root, and adds an access policy for this identity in the key vault.
IMPORTANT
If you delete the key that is used for encryption, the data in the DBFS root cannot be accessed. You can use the Azure Key
Vault APIs to recover deleted keys.
Configure customer-managed keys for DBFS using
the Azure CLI
7/21/2022 • 2 minutes to read
NOTE
This feature is available only in the Premium Plan.
You can use the Azure CLI to configure your own encryption key to encrypt the DBFS root storage account. You
must use Azure Key Vault to store the key.
For more information about customer-managed keys for DBFS, see Configure customer-managed keys for DBFS
root.
az login
az account set --subscription <subscription-id>
Note the principalId field in the storageAccountIdentity section of the command output. You will provide it as
the managed identity value when you configure your key vault.
For more information about Azure CLI commands for Azure Databricks workspaces, see the az databricks
workspace command reference.
az keyvault create \
--name <key-vault> \
--resource-group <resource-group> \
--location <region> \
--enable-soft-delete \
--enable-purge-protection
For more information about enabling Soft Delete and Purge Protection using the Azure CLI, see How to use Key
Vault soft-delete with CLI.
az keyvault set-policy \
--name <key-vault> \
--resource-group <resource-group> \
--object-id <managed-identity> \
--key-permissions get unwrapKey wrapKey
Replace <managed-identity> with the principalId value that you noted when you prepared your workspace for
encryption.
DBFS root storage supports RSA and RSA-HSM keys of sizes 2048, 3072 and 4096. For more information about
keys, see About Key Vault keys.
NOTE
This feature is available only in the Premium Plan.
You can use PowerShell to configure your own encryption key to encrypt the DBFS root storage account. You
must use Azure Key Vault to store the key.
For more information about customer-managed keys for DBFS, see Configure customer-managed keys for DBFS
root.
For more information about PowerShell cmdlets for Azure Databricks workspaces, see the Az.Databricks
reference.
To learn how to enable Soft Delete and Purge Protection on an existing key vault with PowerShell, see “Enabling
soft-delete” and “Enabling Purge Protection” in How to use Key Vault soft-delete with PowerShell.
Configure the key vault access policy
Set the access policy for the key vault so that the Azure Databricks workspace has permission to access it, using
Set-AzKeyVaultAccessPolicy.
Set-AzKeyVaultAccessPolicy `
-VaultName $keyVault.VaultName `
-ObjectId $workSpace.StorageAccountIdentity.PrincipalId `
-PermissionsToKeys wrapkey,unwrapkey,get
DBFS root storage supports RSA and RSA-HSM keys of sizes 2048, 3072 and 4096. For more information about
keys, see About Key Vault keys.
NOTE
This feature is available only in the Premium Plan.
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and
available on Azure Databricks clusters. DBFS is implemented as a storage account in your Azure Databricks
workspace’s managed resource group. The default storage location in DBFS is known as the DBFS root.
Azure Storage automatically encrypts all data in a storage account—including DBFS root storage—at the service
level using 256-bit AES encryption. This is one of the strongest block ciphers available and is FIPS 140-2
compliant. If you require higher levels of assurance that your data is secure, you can also enable 256-bit AES
encryption at the Azure Storage infrastructure level. When infrastructure encryption is enabled, data in a storage
account is encrypted twice, once at the service level and once at the infrastructure level, with two different
encryption algorithms and two different keys. Double encryption of Azure Storage data protects against a
scenario where one of the encryption algorithms or keys is compromised. In this scenario, the additional layer of
encryption continues to protect your data.
This article describes how to create a workspace that adds infrastructure encryption (and therefore double
encryption) for a workspace’s root storage. You must enable infrastructure encryption at workspace creation;
you cannot add infrastructure encryption to an existing workspace.
Requirements
Premium Plan
2. On the Create an Azure Databricks workspace page (Create a resource > Analytics > Azure
Databricks ), click the Advanced tab.
3. Next to Enable Infrastructure Encr yption , select Yes .
4. When you have finished your workspace configuration and created the workspace, verify that
infrastructure encryption is enabled.
In the resource page for the Azure Databricks workspace, go to the sidebar menu and select Settings >
Encr yption . Confirm that Enable Infrastructure Encr yption is selected.
For example,
After your workspace is created, verify that infrastructure encryption is enabled by running:
After your workspace is created, verify that infrastructure encryption is enabled by running:
The requireInfrastructureEncryption field should be present in the encryption property and set to true .
For more information about Azure CLI commands for Azure Databricks workspaces, see the az databricks
workspace command reference.
Data governance guide
7/21/2022 • 2 minutes to read
This guide shows how to manage access to your data in Azure Databricks.
The Databricks Security and Trust Center provides information about the ways in which security is built
into every layer of the Databricks Lakehouse Platform. The Security and Trust Center provides
information that enables you to meet your regulatory needs while taking advantage of the Databricks
Lakehouse Platform. Find the following types of information in the Security and Trust Center:
An overview and list of the security and governance features built in the platform.
Information about the compliance standards the platform meets on each cloud provider.
A due-diligence package to help you evaluate how Azure Databricks helps you meet your compliance
and regulatory needs.
An overview of Databricks’ privacy guidelines and how they are enforced.
The information in this article supplements the Security and Trust Center.
Unity Catalog (Preview) is a secure metastore developed by Databricks. Unity Catalog centralizes
metadata and governance of an organization’s data. With Unity Catalog, data governance rules scale with
your needs, regardless of the number of workspaces or the business intelligence tools your organization
uses. See Get started using Unity Catalog.
Table access control lets you apply data governance controls for your data.
Credential passthrough allows you to authenticate automatically to Azure Data Lake Storage from Azure
Databricks clusters using the identity that you use to log in to Azure Databricks.
Audit logs allow your enterprise to monitor details about usage patterns across your Databricks account
and workspaces.
For information about securing your account, workspaces, and compute resources, see Security guide.
In this guide:
Data governance overview
Unity Catalog (Preview)
Table access control
Credential passthrough
Data governance overview
7/21/2022 • 14 minutes to read
This article describes the need for data governance and shares best practices and strategies you can use to
implement these techniques across your organization. It demonstrates a typical deployment workflow you can
employ using Azure Databricks and cloud-native solutions to secure and monitor each layer from the
application down to storage.
Governance challenges
Whether you’re managing the data of a startup or a large corporation, security teams and platform owners have
the singular challenge of ensuring that this data is secure and is being managed according to the internal
controls of the organization. Regulatory bodies the world over are changing the way we think about how data is
both captured and stored. These compliance risks only add further complexity to an already tough problem.
How then, do you open your data to those who can drive the use cases of the future? Ultimately, you should be
adopting data policies and practices that help the business to realize value through the meaningful application
of what can often be vast stores of data, stores that are growing all the time. We get solutions to the world’s
toughest problems when data teams have access to many and disparate sources of data.
Typical challenges when considering the security and availability of your data in the cloud:
Do your current data and analytics tools support access controls on your data in the cloud? Do they provide
robust logging of actions taken on the data as it moves through the given tool?
Will the security and monitoring solution you put in place now scale as demand on the data in your data lake
grows? It can be easy enough to provision and monitor data access for a small number of users. What
happens when you want to open up your data lake to hundreds of users? To thousands?
Is there anything you can do to be proactive in ensuring that your data access policies are being observed? It
is not enough to simply monitor; that is just more data. If data availability is merely a challenge of data
security, you should have a solution in place to actively monitor and track access to this information across
the organization.
What steps can you take to identify gaps in your existing data governance solution?
You can take this one step further by defining fine-grained access controls to a subset of a table or by setting
privileges on derived views of a table.
Service principals
How do you grant access to users or service accounts for more long-running or frequent workloads? What if
you want to utilize a business intelligence tool, such as Power BI or Tableau, that needs access to the tables in
Azure Databricks via ODBC/JDBC? In these cases, you should use service principals and OAuth. Service
principals are identity accounts scoped to very specific Azure resources. When building a job in a notebook, you
can add the following lines to the job cluster’s Spark configuration or run directly in the notebook. This allows
you to access the corresponding file store within the scope of the job.
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "
<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net",
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")
Similarly, you can access said data by reading directly from an Azure Data Lake Storage Gen1 or Gen2 URI by
mounting your file store(s) with a service principal and an OAuth token. Once you’ve set the configuration
above, you can now access files directly in your Azure Data Lake Storage using the URI:
"abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>"
All users on a cluster with a file system registered in this way will have access to the data in the file system.
This solution is suitable for many interactive use cases and offers a streamlined approach, requiring that you
manage permissions in just one place. In this way, you can allocate one cluster to multiple users without having
to worry about provisioning specific access controls for each of your users. Process isolation on Azure
Databricks clusters ensures that user credentials will not be leaked or otherwise shared. This approach also has
the added benefit of logging user-level entries in your Azure storage audit logs, which can help platform admins
to associate storage layer actions with specific users.
Some limitations to this method are:
Supports only Azure Data Lake Storage file systems.
Databricks REST API access.
Table access control: Azure Databricks does not suggest using credential passthrough with table access
control. For more details on the limitations of combining these two features, see Limitations. For more
information about using table access control, see Implement table access control.
Not suitable for long-running jobs or queries, because of the limited time-to-live on a user’s access token.
For these types of workloads, we recommend that you use service principals to access your data.
Securely mount Azure Data Lake Storage using credential passthrough
You can mount an Azure Data Lake Storage account or folder inside it to the Databricks File System (DBFS),
providing an easy and secure way to access data in your data lake. The mount is a pointer to a data lake store, so
the data is never synced locally. When you mount data using a cluster enabled with Azure Data Lake Storage
credential passthrough, any read or write to the mount point uses your Azure AD credentials. This mount point
will be visible to other users, but the only users that will have read and write access are those who:
Have access to the underlying Azure Data Lake Storage storage account
Are using a cluster enabled for Azure Data Lake Storage credential passthrough
To mount Azure Data Lake Storage using credential passthrough, follow the instructions in Mount Azure Data
Lake Storage to DBFS using credential passthrough.
{
"autoscale": {
"min_workers": 2,
"max_workers": 20
},
"cluster_name": "project team interactive cluster",
"spark_version": "latest-stable-scala2.11",
"spark_conf": {
"spark.Azure Databricks.cluster.profile": "serverless",
"spark.Azure Databricks.repl.allowedLanguages": "python,sql",
"spark.Azure Databricks.passthrough.enabled": "true",
"spark.Azure Databricks.pyspark.enableProcessIsolation": "true"
},
"node_type_id": "Standard_D14_v2",
"ssh_public_keys": [],
"custom_tags": {
"ResourceClass": "Serverless",
"team": "new-project-team"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 60,
"enable_elastic_disk": true,
"init_scripts": []
}
{
"access_control_list": [
{
"group_name": "project team",
"permission_level": "CAN_MANAGE"
}
]
}
Instantly you have a cluster that has been provisioned with secure access to critical data in the lake, locked down
to all but the corresponding team, tagged for chargebacks, and configured to meet the requirements of the
project. There are additional configuration steps within your host cloud provider account required to implement
this solution, though too, can be automated to meet the requirements of scale.
Audit access
Configuring access control in Azure Databricks and controlling data access in storage is the first step towards an
efficient data governance solution. However, a complete solution requires auditing access to data and providing
alerting and monitoring capabilities. Databricks provides a comprehensive set of audit events to log activities
provided by Azure Databricks users, allowing enterprises to monitor detailed usage patterns on the platform. To
get a complete understanding of what users are doing on the platform and what data is being accessed, you
should use both native Azure Databricks and cloud provider audit logging capabilities.
Configuring access controls in Azure Databricks and controlling data access in the storage account is a great first
step towards an efficient data governance solution. However, it is incomplete until you can audit access to data
and provide alerting and monitoring capabilities. Azure Databricks provides a comprehensive set of audit events
to log activities performed by users allowing enterprises to monitor detailed usage patterns on the platform.
Make sure you have diagnostic logging enabled in Azure Databricks. Once logging is enabled for your account,
Azure Databricks automatically starts sending diagnostic logs to the delivery location you specified. You also
have the option to Send to Log Analytics , which will forward diagnostic data to Azure Monitor. Here is an
example query you can enter into the Log search box to query all users who have logged into the Azure
Databricks workspace and their location:
In a few steps, you can use Azure monitoring services or create real-time alerts. The Azure Activity Log provides
visibility into the actions taken on your storage accounts and the containers therein. Alert rules can be
configured here as well.
Learn more
Here are some resources to help you build a comprehensive data governance solution that meets your
organization’s needs:
Data governance on Databricks
Access control on Databricks
Data objects in the Databricks Lakehouse
Keep data secure with secrets
Unity Catalog (Preview)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. During the preview, some functionality is limited. See Unity Catalog public preview
limitations. To participate in the preview, contact your Azure Databricks representative.
Unity Catalog is a fine-grained governance solution for data and AI on the Lakehouse.
Unity Catalog helps simplify security and governance of your data with the following key features:
Define once, secure ever ywhere : Unity Catalog offers a single place to administer data access policies
that apply across all workspaces and personas.
Standards-compliant security model : Unity Catalog’s security model is based on standard ANSI SQL, and
allows administrators to grant permissions at the level of catalogs, databases (also called schemas), tables,
and views in their existing data lake using familiar syntax.
Built-in auditing : Unity Catalog automatically captures user-level audit logs that record access to your data.
Unity Catalog requires an Azure Databricks account on the Premium plan.
In this guide:
Get started using Unity Catalog
Key concepts
Data permissions
Create compute resources
Use Azure managed identities in Unity Catalog to access storage
Create a metastore
Create and manage catalogs
Create and manage schemas (databases)
Manage identities in Unity Catalog
Create tables
Create views
Manage access to data
Manage external locations and storage credentials
Query data
Train a machine-learning model with Python from data in Unity Catalog
Connect to BI tools
Audit access and activity for Unity Catalog resources
Upgrade tables and views to Unity Catalog
Automate Unity Catalog setup using Terraform
Unity Catalog public preview limitations
Get started using Unity Catalog
7/21/2022 • 12 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This guide helps you get started with Unity Catalog, the Azure Databricks data governance framework.
Requirements
You must be an Azure Databricks account admin.
The first Azure Databricks account admin must be an Azure Active Directory Global Administrator or a
member of the root management group, which is usually named Tenant root group . That user can
assign users with any level of Azure tenant permission as subsequent Azure Databricks account admins
(who can themselves assign more account admins).
Your Azure Databricks account must be on the Premium plan.
In your Azure tenant, you must have permission to create:
A storage account to use with Azure Data Lake Storage Gen2. See Create a storage account to use with
Azure Data Lake Storage Gen2.
A new resource to hold a system-assigned managed identity. This requires that you be a Contributor
or Owner of a resource group in any subscription in the tenant.
abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<metastore-name>
A single metastore can be shared across multiple Azure Databricks workspaces in an account. Each linked
workspace has the same view of the data in the metastore, and data access control can be managed across
workspaces. Databricks allows one metastore per region. If you have a multi-region Databricks deployment, you
may want separate metastores for each region, but it is good practice to use a small number of metastores
unless your organization requires hard isolation boundaries between sets of data. Data cannot easily be joined
or queried across metastores.
To create a metastore:
1. Make sure that you have the path to the storage container and the resource ID of the Azure Databricks
access connector that you created in the previous task.
2. Log in to the Azure Databricks account console.
3. Click Data .
4. Click Create Metastore .
5. Enter values for the following fields
Name for the metastore.
Region where the metastore will be deployed.
For best performance, co-locate the access connector, workspaces, metastore and cloud storage
location in the same cloud region.
ADLS Gen 2 path : Enter the path to the storage container that you will use as root storage for the
metastore.
The abfss:// prefix is added automatically.
Access Connector ID : Enter the Azure Databricks access connector’s resource ID in the format:
/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/
<connector-name>
6. Click Create .
If the request fails, retry using a different metastore name.
7. When prompted, select workspaces to link to the metastore.
The account-level user who creates a metastore is its owner and metastore admin. Any account admin can
manage permissions for a metastore and its objects. To transfer ownership of a metastore to a different account-
level user or a group, see (Recommended) Transfer ownership of your metastore to a group.
NOTE
Users and groups must be added as account-level identities before they can access Unity Catalog.
1. The initial account-level admin must be a Contributor in the Azure Active Directory root management
group, which is named Tenant root group by default. An Azure Active Directory Global Administrator
can add themselves to this group. Grant yourself this role, or ask an Azure Active Directory Global
Administrator to grant it to you.
The initial account-level admin can add users or groups to the account console, and can designate other
account-level admins by granting the Admin role to users.
2. All Azure Active Directory users who have been added to workspaces in your Azure tenant are
automatically added as account-level identities.
3. To designate additional account-level admins, you grant users the Admin role.
NOTE
It is not possible to grant the Admin role to a group.
a. Log in to the account console by clicking Settings , then clicking Manage account ..
b. Click Users and Groups . A list of Azure Active Directory users appears. Only users and groups
who have been added to workspaces are shown.
c. Click the name of a user.
d. Click Roles .
e. Enable Admin .
To get started, create a group called data-consumers . This group is used later in this walk-through.
2. Click Compute .
3. Click Create cluster .
a. Enter a name for the cluster.
b. Set Databricks runtime version to Runtime: 10.3 (Scala 2.12, Spark 3.2.1) or higher.
4. Click Advanced Options . Set Security Mode to User Isolation or Single User .
User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some
advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also
disabled to ensure security isolation among cluster users.
To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set
the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used
exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster.
Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode,
view security cannot be enforced. A user selecting from a view executes with their own permissions.
For more information about the features available in each security mode, see Cluster security mode.
5. Click Create Cluster .
Create a SQL warehouse
To create a SQL warehouse that can access Unity Catalog data:
1. Log in to the workspace as a workspace-level admin.
2. From the persona switcher, select SQL .
3. Click Create , then select SQL Warehouse .
4. Under Advanced Settings set Channel to Preview .
SQL warehouses are automatically created with the correct security mode, with no configuration required.
<catalog>.<schema>.<table>
A newly-created metastore contains a catalog named main with an empty schema named default . In this
example, you will create a table named department in the default schema in the main catalog.
To create a table, you must be an account admin, metastore admin, or a user with the CREATE permission on the
parent schema and the USAGE permission on the parent catalog and schema.
Follow these steps to create a table manually. You can also import an example notebook and run it to create a
catalog, schema, and table, along with managing permissions on each.
1. Create a notebook and attach it to the cluster you created in Create a compute resource.
For the notebook language, select SQL , Python , R , or Scala , depending on the language you want to
use.
2. Grant permission to create tables on the default schema.
To create tables, users require the CREATE and USAGE permissions on the schema in addition to the
USAGE permission on the catalog. All users receive the USAGE privilege on the main catalog and the
main.default schema when a metastore is created.
Account admins, metastore admins, and the owner of the schema main.default can use the following
command to GRANT the CREATE privilege to a user or group:
SQL
Python
spark.sql("GRANT CREATE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
library(SparkR)
Scala
For example, to allow members of the group data-consumers to create tables in main.default :
SQL
Python
library(SparkR)
Scala
schema = StructType([ \
StructField("deptcode", IntegerType(), True),
StructField("deptname", StringType(), True),
StructField("location", StringType(), True)
])
spark.catalog.createTable(
tableName = "main.default.department",
schema = schema \
)
dfInsert = spark.createDataFrame(
data = [
(10, "FINANCE", "EDINBURGH"),
(20, "SOFTWARE", "PADDINGTON"),
(30, "SALES", "MAIDSTONE"),
(40, "MARKETING", "DARLINGTON"),
(50, "ADMIN", "BIRMINGHAM")
],
schema = schema
)
dfInsert.write.saveAsTable(
name = "main.default.department",
mode = "append"
)
R
library(SparkR)
schema = structType(
structField("deptcode", "integer", TRUE),
structField("deptname", "string", TRUE),
structField("location", "string", TRUE)
)
df = createDataFrame(
data = list(),
schema = schema
)
saveAsTable(
df = df,
tableName = "main.default.department"
)
data = list(
list("deptcode" = 10L, "deptname" = "FINANCE", "location" = "EDINBURGH"),
list("deptcode" = 20L, "deptname" = "SOFTWARE", "location" = "PADDINGTON"),
list("deptcode" = 30L, "deptname" = "SALES", "location" = "MAIDSTONE"),
list("deptcode" = 40L, "deptname" = "MARKETING", "location" = "DARLINGTON"),
list("deptcode" = 50L, "deptname" = "ADMIN", "location" = "BIRMINGHAM")
)
dfInsert = createDataFrame(
data = data,
schema = schema
)
insertInto(
x = dfInsert,
tableName = "main.default.department"
)
Scala
import spark.implicits._
import org.apache.spark.sql.types.StructType
val df = spark.createDataFrame(
new java.util.ArrayList[Row](),
new StructType()
.add("deptcode", "int")
.add("deptname", "string")
.add("location", "string")
)
df.write
.format("delta")
.saveAsTable("main.default.department")
dfInsert.write.insertInto("main.default.department")
Python
display(spark.table("main.default.department"))
display(tableToDF("main.default.department"))
Scala
display(spark.table("main.default.department"))
5. Grant the ability to read and query the table to the data-consumers group that you created in Add users
and groups.
Add a new cell to the notebook and paste in the following code:
SQL
Python
Scala
NOTE
To grant read access to all account-level users instead of only data-consumers , use the group name
account users instead.
Example notebook
You can use the following example SQL notebook to create a catalog, schema, and table, as well as manage
permissions on each.
Create and manage a Unity Catalog table
Get notebook
Next steps
Learn more about key concepts of Unity Catalog
Create tables
Create views
Key concepts
7/21/2022 • 9 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article explains the key concepts behind how Unity Catalog brings security and governance to your
Lakehouse.
Account-level identities
Unity Catalog uses the account-level identity system in Databricks to resolve users and groups and enforce
permissions. You configure users and groups directly in the Databricks account console. Refer to those account-
level users and groups when creating access-control policies in Unity Catalog.
Although Databricks also allows adding local groups to workspaces, those local groups cannot be used in Unity
Catalog. Commands that reference local groups return an error that the group was not found.
Unity Catalog users must also be added to workspaces to access Unity Catalog data in a notebook, a Databricks
SQL query, the Databricks SQL Data Explorer, or a REST API command, and to join Unity Catalog data with data
that is local to a workspace.
Data permissions
In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Metastore
admins and object owners can manage object permissions using the Databricks SQL Data Explorer or SQL
commands.
To learn more, see Data permissions.
Object model
The following diagram illustrates the main securable objects in Unity Catalog:
NOTE
Some objects, such as external locations and storage credentials, are not shown in the diagram. These objects reside in the
metastore at the same level as catalogs.
Metastore
A metastore is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the
permissions that govern access to them. Databricks account admins can create metastores and assign them to
Databricks workspaces to control which workloads use each metastore.
NOTE
Unity Catalog offers a new metastore with built in security and auditing. This is distinct from the metastore used in
previous versions of Databricks, which was based on the Hive Metastore.
Metastore admin
A metastore admin can manage the privileges for all securable objects within a metastore, such as who can
create catalogs or query a table.
NOTE
For more details about Unity Catalog’s data governance model, see Data Permissions.
If necessary, a metastore admin can delegate management of permissions for a metastore object to a different
user or group by changing the object’s ownership. The account-level admin who creates a metastore is its owner.
The owner or owners (if the object is owned by a group) of an object can grant privileges on that object and its
descendents to others.
The initial account-level admin, who must have the Contributor role in the root management group of the
Azure tenant, can enable the Admin role for other users using the Azure Databricks account console. These
account admins are metastore admins.
The account admin who created the metastore is its initial metastore admin. An account admin can assign other
users as metastore admins by changing the metastore’s owner. Account admins can always manage a
metastore, regardless of the metastore’s owner.
Workspace admin
A workspace admin can manage workspace objects like users, jobs, and notebooks, regardless of whether Unity
Catalog is configured for a workspace. In other words, if a workspace is configured to use Unity Catalog,
workspace admins retain the ability to manage workspace objects. Although workspace admins cannot manage
access to data stored in Unity Catalog in the same way a metastore admin can, they do have the ability to
perform workspace management tasks such as adding users and service principals to the workspace, and they
can view and modify workspace objects like jobs and notebooks. This may give access to data registered in
Unity Catalog. The workspace admin therefore remains a privileged role that should be distributed carefully.
Default storage location
Each metastore is configured with a default storage location in an Azure storage account. This is the default
storage location for data in managed tables. External tables store data in other storage paths.
To access the default storage location on the behalf of a user, Unity Catalog uses a root storage credential that is
configured during metastore creation. The root storage credential contains the client secret of a service principal
that has the Azure Blob Contributor role for the default storage location. You can create additional external
credentials that use separate service principals, but the metastore’s root storage credential is still used to write
metadata to the metastore. User code never receives full access to a storage credential. Instead, Unity Catalog
generates scoped access tokens that allow each user or application to access the requested data.
Catalog
A catalog is the first layer of Unity Catalog’s three-level namespace and is used to organize your data assets.
Users can see all catalogs on which they have been assigned the USAGE data permission.
Schema
A schema (also called a database) is the second layer of Unity Catalog’s three-level namespace and organizes
tables and views. To access or list a table or view in a schema, a user must have the USAGE data permission on
the schema and its parent catalog and the SELECT permission on the table or view.
Table
A table resides in the third layer of Unity Catalog’s three-level namespace and contains rows of data. To create a
table, a user must have CREATE and USAGE permissions on the schema and the USAGE permission on its parent
catalog. To query a table, the user must have the SELECT permission on the table and the USAGE permission on
its parent schema and catalog.
A table can be managed or external.
Managed table
Managed tables are the default way to create tables in Unity Catalog. These tables are stored in the managed
storage location you configured when you created each metastore.
To create a managed table, run a CREATE TABLE command without a LOCATION clause.
To delete a managed table, use the DROP TABLE statement.
When a managed table is dropped, its underlying data is deleted from your cloud tenant. The only supported
format for managed tables is Delta.
Example Syntax:
External table
External tables are tables whose data is stored in a storage location outside of the managed storage location,
and are not fully managed by Unity Catalog. When you run DROP TABLE on an external table, Unity Catalog does
not delete the underlying data. You can manage privileges on external tables and use them in queries in the the
same way as managed tables. To create an external table, specify a LOCATION path in your CREATE TABLE
statement. External tables can use the following file formats:
DELTA
CSV
JSON
AVRO
PARQUET
ORC
TEXT
To manage access to the underlying cloud storage for an external table, Unity Catalog introduces two new object
types: storage credentials and external locations.
A storage credential represents an authentication and authorization mechanism for accessing data stored on
your cloud tenant, such as the client secret for a service principal. Each storage credential is subject to Unity
Catalog access-control policies that control which users and groups can access the credential.
If a user attempts to reference use a storage credential on which they haven’t been granted the USAGE
permission, the request fails and Unity Catalog does not attempt to authenticate to the cloud tenant on behalf of
the user.
An external location is an object that contains a reference to a storage credential and a cloud storage
path. The external location grants access only to that path and its child directories and files. Each external
location is subject to Unity Catalog access-control policies that control which users and groups can access
the credential.
If a user attempts to use an external location on which they haven’t been granted the USAGE permission,
the request fails and Unity Catalog does not attempt to authenticate to the cloud tenant on behalf of the
user.
Only metastore admins can create and grant permissions on storage credentials and external locations.
Example Syntax:
NOTE
Before a user can create an external table, the user must have the CREATE TABLE privilege on an external location or
storage credential that grants access to the LOCATION specified in the CREATE TABLE statement.
View
A view resides in the third layer of Unity Catalog’s three-level namespace and is a read-only object composed
from one or more tables and views in a metastore. A view can be composed from tables and views in multiple
schemas or catalogs.
Example syntax:
NOTE
If your workspace is assigned to a Unity Catalog metastore, you use security mode instead of High Concurrency clusters
to ensure the integrity of access controls and enforce strong isolation guarantees. High Concurrency cluster mode is not
available with Unity Catalog.
When you create a Data Science & Engineering or Databricks Machine Learning cluster, you can select from the
following cluster security modes:
None : No isolation. Does not enforce workspace-local table access control or credential passthrough. Cannot
access Unity Catalog data.
Single User : Can be used only by a single user (by default, the user who created the cluster). Other users
cannot attach to the cluster. When accessing a view from a cluster with Single User security mode, the view
is executed with the user’s permissions. Single-user clusters support workloads using Python, Scala, and R.
Init scripts, library installation, and DBFS FUSE mounts are supported on single-user clusters. Automated
jobs should use single-user clusters.
User Isolation : Can be shared by multiple users. Only SQL workloads are supported. Library installation,
init scripts, and DBFS FUSE mounts are disabled to enforce strict isolation among the cluster users.
Table ACL only (Legacy) : Enforces workspace-local table access control, but cannot access Unity Catalog
data.
Passthrough only (Legacy) : Enforces workspace-local credential passthrough, but cannot access Unity
Catalog data.
The only security modes supported for Unity Catalog workloads are Single User and User Isolation .
Databricks SQL endpoints automatically use User Isolation , with no configuration required.
You can upgrade an existing cluster to meet the requirements of Unity Catalog by setting its cluster security
mode to Single User or User Isolation security mode.
The following table describes the features that are enabled and disabled for each cluster security mode.
NOTE
The following table is wide. You may need to scroll sideways in your browser to view all columns.
DATA B
IN IT RIC K S
L EGA C SC RIP T RUN T I
Y S AND UN IT Y ME
TA B L E C REDE L IB RA R C ATA L FOR
SUP P O A C C ES N T IA L DB F S Y OG MACHI
SEC URI RT ED UN IT Y S PA SST M ULT I F USE IN STA L DY N A NE
TY L A N GU C ATA L C ONT R H RO U PLE RDD M O UN L AT IO M IC L EA RN I
M O DE A GES OG OL GH USERS API TS N VIEW S NG
None All ✔ ✔ ✔ ✔ ✔
Single All ✔ ✔ ✔ ✔ ✔
User
User SQL ✔ ✔ ✔ ✔
Isolatio
n
Legacy SQL, ✔ ✔ ✔
table Python
ACL
Legacy SQL, ✔ ✔ ✔ ✔ ✔
passth Python
rough
Data permissions
7/21/2022 • 16 minutes to read
This article explain how data permissions work to control access to data and objects in Unity Catalog.
You can use data access control policies to grant and revoke access to Unity Catalog data and objects in the
Databricks SQL Data Explorer, SQL statements in notebooks or Databricks SQL queries, or using the Unity
Catalog REST API.
Initially, users have no access to data in a metastore. Only metastore admins can create schemas, tables, views,
and other Unity Catalog objects and grant or revoke access on them to account-level users or groups. Access
control policies are not inherited. The account-level admin who creates a metastore is its owner and metastore
admin.
Access control policies are applied by Unity Catalog before data can be read or written to your cloud tenant.
Ownership
Each securable object in Unity Catalog has an owner. The owner can be any account-level user or group, called a
principal. The principal that creates an object becomes its initial owner. An object’s owner has all privileges on
the object, such as SELECT and MODIFY on a table, as well as the permission to grant privileges to other
principals.
The object’s owner can transfer ownership to another user or group. A metastore admin can transfer ownership
of any object in the metastore to another user or group.
To see the owner of a securable object, use the following syntax. Replace the placeholder values:
<SECURABLE_TYPE> : The type of securable, such as CATALOG or TABLE .
<catalog> : The parent catalog for a table or view.
<schema> : The parent schema for a table or view.
<securable_name> : The name of the securable, such as a table or view.
SQL
Python
library(SparkR)
Scala
SQL
Python
library(SparkR)
Scala
Python
library(SparkR)
Scala
Ownership of a metastore
The account-level admin who creates a metastore is its owner. To transfer ownership of a metastore to a
different account-level user or group, see (Recommended) Transfer ownership of your metastore to a group. For
added security, this command is not available using SQL syntax.
Privileges
In Unity Catalog, you can grant the following privileges on a securable object:
USAGE : This privilege does not grant access to the securable itself, but allows the grantee to traverse the
securable in order to access its child objects. For example, to select data from a table, users need to have the
SELECT privilege on that table and USAGE privileges on its parent schema and parent catalog. Thus, you can
use this privilege to restrict access to sections of your data namespace to specific groups.
SELECT : Allows a user to select from a table or view, if the user also has USAGE on its parent catalog and
schema.
MODIFY : Allows the grantee to add, update and delete data to or from the securable if the user also has
USAGE on its parent catalog and schema.
CREATE : Allows a user to create a schema if the user also has USAGE and CREATE permissions on its parent
catalog. Allows a user to create a table or view if the user also has USAGE on its parent catalog and schema
and the CREATE permission on the schema..
In addition, you can grant the following privileges on storage credentials and external locations.
CREATE TABLE : Allows a user to create external tables directly in your cloud tenant using a storage
credential.
READ FILES : When granted on an external location, allows a user to read files directly from your cloud
tenant using the storage credential associated with the external location.
When granted directly on a storage credential, allows a user to read files directly from your cloud tenant
using the storage credential.
WRITE FILES : When granted on an external location, allows a user to write files directly to your cloud
tenant using the storage credential associated with the external location.
When granted directly on a storage credential, allows a user to write files directly to your cloud tenant
using the storage credential.
NOTE
Although you can grant READ FILE and WRITE FILE privileges on a storage credential, Databricks recommends that
you instead grant these privileges on an external location. This allows you to manage permissions at a more granular level
and provides a simpler experience to end users.
In Unity Catalog, privileges are not inherited on child securable objects. For example, if you grant the CREATE
privilege on a catalog to a user, the user does not automatically have the CREATE privilege on all databases in
the catalog.
The following table summarizes the privileges that can be granted on each securable object:
View SELECT
Manage privileges
You can manage privileges for metastore objects in the Databricks SQL data explorer or by using SQL
commands in the Databricks SQL editor or a Data Science & Engineering notebook.
To manage privileges, you use GRANT and REVOKE statements. Only an object’s owner or a metastore admin can
grant privileges on the object and its descendent objects. A built-in account-level group called account users
includes all account-level users.
This section contains examples of using SQL commands to manage privileges. To manage privileges using the
Databricks SQL Data Explorer, see Manage access to data.
Show grants on a securable object
To show grants on an object using SQL, use a command like the following. To use the Databricks SQL Data
Explorer, see Use the Databricks SQL Data Explorer.
Use the following syntax. Replace the placeholder values:
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
SQ L
Python
library(SparkR)
Sc a l a
To show all grants for a given principal on an object, use the following syntax. Replace the placeholder values:
<principal> : The email address of an account-level user or the name of an account-level group.
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
SQ L
Python
library(SparkR)
Sc a l a
Grant a privilege
To grant a privilege using SQL, use a command like the following. To use the Databricks SQL Data Explorer, see
Use the Databricks SQL Data Explorer.
Use the following syntax. Replace the placeholder values:
<privilege> : The privilege to grant, such as SELECT or USAGE .
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
<principal> : The email address of an account-level user or the name of an account-level group.
SQ L
Python
library(SparkR)
Sc a l a
Revoke a privilege
To revoke a privilege using SQL, use a command like the following. To use the Databricks SQL Data Explorer, see
Use the Databricks SQL Data Explorer.
Use the following syntax. Replace the placeholder values:
<privilege> : The privilege to grant, such as SELECT or USAGE .
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
<principal> : The email address of an account-level user or the name of an account-level group.
SQ L
Python
spark.sql("REVOKE <privilege> ON <securable_type> <securable_name> FROM <principal>")
library(SparkR)
Sc a l a
Dynamic views
In Unity Catalog, you can use dynamic views to configure fine-grained access control, including:
Security at the level of columns or rows.
Data masking.
NOTE
Fine-grained access control using dynamic views are not available on clusters with Single User security mode.
Unity Catalog introduces the following functions, which allow you to dynamically limit which users can access a
row, column, or record in a view:
: Returns the current user’s email address.
current_user()
is_account_group_member() : Returns TRUE if the current user is a member of a specific account-level group.
Recommended for use in dynamic views against Unity Catalog data.
is_member() : Returns TRUE if the current user is a member of a specific workspace-level group. This function
is provided for compatibility with the existing Hive metastore. Avoid using it with views against Unity Catalog
data, because it does not evaluate account-level group membership.
The following examples illustrate how to create dynamic views in Unity Catalog.
Column-level permissions
With a dynamic view, you can limit the columns a specific user or group can access. In the following example,
only members of the auditors group can access email addresses from the sales_raw table. During query
analysis, Apache Spark replaces the CASE statement with either the literal string REDACTED or the actual
contents of the email address column. Other columns are returned as normal. This strategy has no negative
impact on the query performance.
SQL
-- Alias the field 'email' to itself (as 'email') to prevent the
-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
user_id,
CASE WHEN
is_account_group_member('auditors') THEN email
ELSE 'REDACTED'
END AS email,
country,
product,
total
FROM sales_raw
Python
library(SparkR)
Scala
// Alias the field 'email' to itself (as 'email') to prevent the
// permission logic from showing up directly in the column name results.
spark.sql("CREATE VIEW sales_redacted AS " +
"SELECT " +
" user_id, " +
" CASE WHEN " +
" is_account_group_member('auditors') THEN email " +
" ELSE 'REDACTED' " +
" END AS email, " +
" country, " +
" product, " +
" total " +
"FROM sales_raw")
Row-level permissions
With a dynamic view, you can specify permissions down to the row or field level. In the following example, only
members of the managers group can view transaction amounts when they exceed $1,000,000. Matching results
are filtered out for other users.
SQL
Python
R
library(SparkR)
Scala
Data masking
Because views in Unity Catalog use Spark SQL, you can implement advanced data masking by using more
complex SQL expressions and regular expressions. In the following example, all users can analyze email
domains, but only members of the auditors group can view a user’s entire email address.
SQL
Python
# The regexp_extract function takes an email address such as
# user.x.lastname@example.com and extracts 'example', allowing
# analysts to query the domain name.
library(SparkR)
Scala
display(spark.table("hive_metastore.sales.sales_raw"))
library(SparkR)
display(tableToDF("hive_metastore.sales.sales_raw"))
Scala
display(spark.table("hive_metastore.sales.sales_raw"))
You can also specify the catalog and schema with a USE statement:
SQL
USE hive_metastore.sales;
SELECT * from sales_raw;
Python
spark.sql("USE hive_metastore.sales")
display(spark.table("sales_raw"))
library(SparkR)
sql("USE hive_metastore.sales")
display(tableToDF("sales_raw"))
Scala
spark.sql("USE hive_metastore.sales")
display(spark.table("sales_raw"))
NOTE
A join with data in the legacy Hive metastore will only work on the workspace where that data resides. Trying to run such
a join in another workspace results in an error. Azure Databricks recommends that you upgrade legacy tables and views to
Unity Catalog.
The following example joins results from the sales_current table in the legacy Hive metastore with the
sales_historical table in the Unity Catalog metastore when the order_id fields are equal.
SQL
Python
dfCurrent = spark.table("hive_metastore.sales.sales_current")
dfHistorical = spark.table("main.shared_sales.sales_historical")
display(dfCurrent.join(
other = dfHistorical,
on = dfCurrent.order_id == dfHistorical.order_id
))
R
library(SparkR)
dfCurrent = tableToDF("hive_metastore.sales.sales_current")
dfHistorical = tableToDF("main.shared_sales.sales_historical")
display(join(
x = dfCurrent,
y = dfHistorical,
joinExpr = dfCurrent$order_id == dfHistorical$order_id))
Scala
display(dfCurrent.join(
right = dfHistorical,
joinExprs = dfCurrent("order_id") === dfHistorical("order_id")
))
The following example expresses a more complex join. It returns the sum of current and historical sales by
customer, assuming that each table contains at least one row for each customer.
SQL
Python
dfCurrent = spark.table("hive_metastore.sales.sales_current")
dfHistorical = spark.table("main.shared_sales.sales_historical")
dfJoin = dfCurrent.join(
other = dfHistorical,
on = dfCurrent.customer_id == dfHistorical.customer_id,
how = "full_outer"
)
display(dfJoin.select(
dfCurrent.customer_id,
dfCurrent.customer_name,
(coalesce(dfCurrent.total) + coalesce(dfHistorical.total)).alias("total")))
R
library(SparkR)
dfCurrent = tableToDF("hive_metastore.sales.sales_current")
dfHistorical = tableToDF("main.shared_sales.sales_historical")
dfJoin = join(
x = dfCurrent,
y = dfHistorical,
joinExpr = dfCurrent$customer_id == dfHistorical$customer_id,
joinType = "fullouter")
display(
select(
x = dfJoin,
col = list(
dfCurrent$customer_id,
dfCurrent$customer_name,
alias(coalesce(dfCurrent$total) + coalesce(dfHistorical$total), "total")
)
)
)
Scala
import org.apache.spark.sql.functions.coalesce
display(dfJoin.select(
cols = dfCurrent("customer_id"),
dfCurrent("customer_name"),
(coalesce(dfCurrent("total")) + coalesce(dfHistorical("total")).alias("total"))
))
Default catalog
If you omit the top-level catalog name and there is no USE CATALOG statement, the default catalog is assumed. To
configure the default catalog for a workspace, set the spark.databricks.sql.initial.catalog.name value.
Databricks recommends setting the default catalog value to hive_metastore so that your existing code can
operate on current Hive metastore data without any change.
Cluster instance profile
When using the Hive metastore alongside Unity Catalog, the instance profile on the cluster is used to access
Hive Metastore data but not the data in Unity Catalog. Unity Catalog does not rely on the instance profile
configured for a cluster.
Upgrade legacy tables to Unity Catalog
Tables in the Hive metastore do not benefit from the full set of security and governance features that Unity
Catalog introduces, such as built-in auditing and access control. Databricks recommends that you upgrade your
legacy tables by adding them to Unity Catalog.
Next steps
Get started using Unity Catalog
Learn more about key concepts of Unity Catalog
Use Azure managed identities in Unity Catalog to
access storage
7/21/2022 • 9 minutes to read
IMPORTANT
This feature is in Public Preview.
This article describes how to use Azure managed identities for connecting to storage containers on behalf of
Unity Catalog users.
NOTE
Azure Databricks supports only system-assigned managed identities. You cannot use user-assigned managed identities.
NOTE
You cannot manage access connectors as a service principal.
The resource group should be in the same region as the storage account that you want to connect to.
2. Click + Create or Create a new resource .
3. Search for Template Deployment .
4. Click Create .
5. Click Build your own template in the editor .
6. Copy and paste this template into the editor:
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"connectorName": {
"defaultValue": "testConnector",
"type": "String",
"metadata": {
"description": "The name of the Azure Databricks Access Connector to create."
}
},
"accessConnectorRegion": {
"defaultValue": "[resourceGroup().location]",
"type": "String",
"metadata": {
"description": "Location for the access connector resource."
}
},
"enableSystemAssignedIdentity": {
"defaultValue": true,
"type": "bool",
"metadata": {
"description": "Whether the system assigned managed identity is enabled"
}
}
},
"resources": [
{
"type": "Microsoft.Databricks/accessConnectors",
"apiVersion": "2022-04-01-preview",
"name": "[parameters('connectorName')]",
"location": "[parameters('accessConnectorRegion')]",
"identity": {
"type": "[if(parameters('enableSystemAssignedIdentity'), 'SystemAssigned', 'None')]"
}
}
]
}
7. Click Save .
8. On the Basics tab, accept, select, or enter values for the following fields:
Subscription : This is the Azure subscription that the Azure Databricks access connector will be
created in. The default is the Azure subscription you are currently using. It can be any subscription in
the tenant.
Resource group : This should be a resource group in the same region as the storage account that you
will connect to.
Region : This should be the same region as the storage account that you will connect to.
Connector Name : Enter a name that indicates the purpose of the connector resource.
Access Connector Region : Accept the default [resourceGroup().location] to have the connector
resource deploy in the same region as the resource group. You can enter a different region value, but
Databricks recommends that the connector region and resource group region be the same as the
storage account that you will connect to.
Enable System Assigned Identity : Accept the default value of true .
9. Click Next: Review + create > .
10. Click Create .
When the deployment succeeds, the Azure Databricks access connector is deployed with a system-
assigned managed identity.
11. When the deployment is complete, click Go to resource .
12. Make note of the Resource ID .
The resource ID is in the format:
/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connec
tor-name>
2. Click Data .
3. Click Create Metastore .
4. Enter values for the following fields:
Name for the metastore.
Region where the metastore will be deployed.
For best performance, co-locate the access connector, workspaces, metastore and cloud storage
location in the same cloud region.
ADLS Gen 2 path : enter the path to the storage container that you will use as root storage for the
metastore.
The abfss:// prefix is added automatically.
Access Connector ID : enter the Azure Databricks access connector’s resource ID in the format:
/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/
<connector-name>
5. Click Create .
If the request fails, retry using a different metastore name.
6. When prompted, select workspaces to link to the metastore.
/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connec
tor-name>
3. After you create the network rule, go to your Azure Storage account in the Azure Portal and view the
managed identity in the Networking tab under Resource instances , resource type
Microsoft.Databricks/accessConnectors .
4. Under Exceptions , clear the Allow Azure ser vices on the trusted ser vices list to access this
storage account checkbox.
5. Optionally, set Public Network Access to Disabled . The managed identity can be used to bypass the
check on public network access.
The standard approach is to keep this value set to Enabled from selected vir tual networks and IP
addresses .
Step 2. Enable your Azure Databricks workspace to access Azure Storage
Follow the instructions in Securely Accessing Azure Data Sources from Azure Databricks to secure connectivity
from your Azure Databricks workspace to Azure Storage.
8. Run the following cURL command to update the metastore with the new root storage credential.
Replace the placeholder values:
<workspace-url> : The URL of the workspace where the personal access token was generated.
<metastore-id> : The metastore ID that you retrieved in the previous step.
<storage-credential-id> : The storage credential ID.
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how to create a metastore in Unity Catalog and link it to workspaces.
Requirements
You must be an Azure Databricks account admin.
Your Azure Databricks account must be on the Premium plan.
In your Azure tenant, you must have permission to create:
A storage account to use with Azure Data Lake Storage Gen2. See Create a storage account to use with
Azure Data Lake Storage Gen2.
A new resource to hold a system-assigned managed identity. This requires that you be a Contributor
or Owner of a resource group in any subscription in the tenant.
3. Click Data .
4. Click Create Metastore .
5. Enter values for the following fields:
Name for the metastore.
Region where the metastore will be deployed.
For best performance, co-locate the access connector, workspaces, metastore and cloud storage
location in the same cloud region.
ADLS Gen 2 path : Enter the path to the storage container that you will use as root storage for the
metastore.
The abfss:// prefix is added automatically.
Access Connector ID : Enter the Azure Databricks access connector’s resource ID in the format:
/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/
<connector-name>
6. Click Create .
If the request fails, retry using a different metastore name.
7. When prompted, select workspaces to link to the metastore.
The user who creates a metastore is its owner. To change ownership of a metastore after creating it, see
(Recommended) Transfer ownership of your metastore to a group.
Create a metastore that is accessed using a service principal
To create a Unity Catalog metastore that is accessed by a service principal:
1. Create a storage account for Azure Data Lake Storage Gen2.
This storage account will contain metadata related to Unity Catalog metastores and their objects, as well
as the data for managed tables in Unity Catalog. See Create a storage account to use with Azure Data
Lake Storage Gen2. Make a note of the region where you created the storage account.
2. Create a container in the new storage account.
Make a note of the ADLSv2 URI for the container, which is in the following format:
abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<metastore-name>
4. In the storage account, go to Access Control (IAM) and grant the new service principal the Storage
blob data contributor role.
5. Make note of these properties, which you will use when you create a metastore:
<aad-application-id>
The storage account region
<storage-container>
The service principal’s <client-secret> , <client-application-id> , and <directory-id>
6. Log in to the account console.
7. Click Data .
8. Click Create Metastore .
a. Enter a name for the metastore.
b. Enter the region where the metastore will be deployed. For best performance, co-locate the
workspaces, metastore and cloud storage location in the same cloud region.
c. For ADLS Gen 2 path , enter the value of <storage-container> . The abfss:// prefix is added
automatically.
9. Click Create .
The user who creates a metastore is its owner. To change ownership of a metastore after creating it, see
(Recommended) Transfer ownership of your metastore to a group.
10. Make a note of the metastore’s ID. When you view the metastore’s properties, the metastore’s ID is the
portion of the URL after /data and before /configuration .
11. The metastore has been created, but Unity Catalog cannot yet write data to it. To finish setting up the
metastore:
a. In a separate browser, log in to a workspace that is assigned to the metastore as a workspace
admin.
b. Make a note of the workspace URL, which is the first portion of the URL, after https:// and
inclusive of azuredatabricks.net .
c. Generate a personal access token. See Generate a personal access token.
d. Add the personal access token to the .netrc file in your home directory. This improves security
by preventing the personal access token from appearing in your shell’s command history. See
Store tokens in a .netrc file and use them in curl.
e. Run the following cURL command to create the root storage credential for the metastore. Replace
the placeholder values:
<workspace-url> : The URL of the workspace where the personal access token was generated.
<credential-name> : A name for the storage credential.
<directory-id> : The directory ID for the service principal you created.
<application-id> : The application ID for the service principal you created.
<client-secret> : The value of the client secret you generated for the service principal (not the
client secret ID).
Make a note of the storage credential ID, which is the value of id from the cURL command’s
response.
12. Run the following cURL command to update the metastore with the new root storage credential. Replace
the placeholder values:
<workspace-url> : The URL of the workspace where the personal access token was generated.
<metastore-id >: The metastore’s ID.
<storage-credential-id >: The storage credential’s ID from the previous command.
Next steps
Create and manage catalogs
Create and manage schemas (databases)
Create tables
Learn more about the key concepts of Unity Catalog
Create compute resources
7/21/2022 • 3 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how to create a Data Science & Engineering or Databricks Machine Learning cluster or a
Databricks SQL warehouse that can access data in Unity Catalog.
Requirements
Your Azure Databricks account must be on the Premium plan.
In a workspace, you must have permission to create compute resources.
2. Click Compute .
3. Click Create cluster .
a. Enter a name for the cluster.
b. Set Databricks runtime version to Runtime: 10.3 (Scala 2.12, Spark 3.2.1) or higher.
4. Click Advanced Options . Set Security Mode to User Isolation or Single User .
User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some
advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also
disabled to ensure security isolation among cluster users.
To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set
the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used
exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster.
Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode,
view security cannot be enforced. A user selecting from a view executes with their own permissions.
For more information about the features available in each security mode, see Cluster security mode.
5. Click Create Cluster .
2. In the Data Science & Engineering or Databricks Machine Learning persona, click Compute .
3. Click Create cluster .
a. Enter a name for the cluster.
b. For Databricks runtime version :
a. Click ML .
b. Select either 10.3 ML (Scala 2.12, Spark 3.2.1) or higher, or 10.3 ML (GPU, Scala 2.12,
Spark 3.2.1) or higher.
4. Click Advanced Options . Set Security Mode to User Isolation or Single User . To run Python code,
you must use Single User .
User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some
advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also
disabled to ensure security isolation among cluster users.
To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set
the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used
exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster.
Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode,
view security cannot be enforced. A user selecting from a view executes with their own permissions.
For more information about the features available in each security mode, see Cluster security mode.
5. Click Create Cluster .
Next steps
Create and manage catalogs
Create and manage schemas (databases)
Create tables
Create and manage catalogs
7/21/2022 • 4 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how to create and manage catalogs in Unity Catalog. A catalog contains schemas (databases),
and a schema contains tables and views.
Requirements
You must be an Azure Databricks account admin.
Your Azure Databricks account must be on the Premium Plan.
You must have a Unity Catalog metastore linked to the workspace where you perform the catalog creation.
The compute resource that you use to run the notebook, Databricks SQL editor, or Data Explorer workflow to
create the catalog must be compliant with Unity Catalog security requirements.
Create a catalog
To create a catalog, you can use the Data Explorer or a SQL command.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. Click the Create Catalog button.
5. Assign permissions for your catalog. See Manage privileges.
6. Click Save .
Sql
1. Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional.
Replace the placeholder values:
<catalog_name> : A name for the catalog.
<comment> : An optional comment.
library(SparkR)
library(SparkR)
Next steps
Now you can add schemas (databases) to your catalog.
Delete a catalog
To delete (or drop) a catalog, you can use the Data Explorer or a SQL command.
Data explorer
You must delete all schemas in the catalog except information_schema before you can delete a catalog. This
includes the auto-created default schema.
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. In the Data pane, on the left, click the catalog you want to delete.
5. In the detail pane, click the three-dot menu to the left of the Create database button and select Delete .
6. On the Delete catalog dialog, click Delete .
Sql
Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace
the placeholder <catalog_name> .
For parameter descriptions, see DROP CATALOG.
If you use DROP CATALOG without the CASCADE option, you must delete all schemas in the catalog except
information_schema before you can delete the catalog. This includes the auto-created default schema.
Python
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<catalog_name> .
R
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<catalog_name> .
library(SparkR)
library(SparkR)
Scala
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<catalog_name> .
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how to create and manage schemas (databases) in Unity Catalog. A schema contains tables
and views. You create schemas inside catalogs.
Requirements
You must be an Azure Databricks account admin.
Your Azure Databricks account must be on the Premium Plan.
You must have a Unity Catalog metastore linked to the workspace where you perform the schema creation.
The compute resource that you use to run the notebook, Databricks SQL editor, or Data Explorer workflow to
create the schema must be compliant with Unity Catalog security requirements.
You must have the USAGE and CREATE data permissions on the schema’s parent catalog.
Create a schema
To create a schema (database), you can use the Data Explorer or SQL commands.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. In the Data pane on the left, click the catalog you want to create the schema in.
5. In the detail pane, click Create database .
6. Give the schema a name and add any comment that would help users understand the purpose of the
schema, then click Create .
7. Assign permissions for your catalog. See Manage privileges.
8. Click Save .
Sql
1. Run the following SQL commands in a notebook or Databricks SQL editor. Items in brackets are optional.
You can use either SCHEMA or DATABASE . Replace the placeholder values:
<catalog_name> : The name of the parent catalog for the schema.
<schema_name> : A name for the schema.
<comment> : An optional comment.
<property_name> = <property_value> [ , ... ] : The Spark SQL properties and values to set for the
schema.
For parameter descriptions, see CREATE SCHEMA.
USE CATALOG <catalog>;
CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] <schema_name>
[ COMMENT <comment> ]
[ WITH DBPROPERTIES ( <property_name = property_value [ , ... ]> ) ];
You can optionally omit the USE CATALOG statement and replace <schema_name> with
<catalog_name>.<schema_name> .
2. Assign privileges to the schema. See Manage privileges.
Python
1. Run the following SQL commands in a notebook. Items in brackets are optional. You can use either
SCHEMA or DATABASE . Replace the placeholder values:
You can optionally omit the USE CATALOG statement and replace <schema_name> with
<catalog_name>.<schema_name> .
2. Assign privileges to the schema. See Manage privileges.
R
1. Run the following SQL commands in a notebook. Items in brackets are optional. You can use either
SCHEMA or DATABASE . Replace the placeholder values:
library(SparkR)
You can optionally omit the USE CATALOG statement and replace <schema_name> with
<catalog_name>.<schema_name> .
2. Assign privileges to the schema. See Manage privileges.
Scala
1. Run the following SQL commands in a notebook. Items in brackets are optional. You can use either
SCHEMA or DATABASE . Replace the placeholder values:
You can optionally omit the USE CATALOG statement and replace <schema_name> with
<catalog_name>.<schema_name> .
2. Assign privileges to the schema. See Manage privileges.
Next steps
Now you can add tables to your schema.
Delete a schema
To delete (or drop) a schema (database), you can use the Data Explorer or a SQL command.
Data explorer
You must delete all tables in the schema before you can delete it.
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. In the Data pane, on the left, click the schema (database) that you want to delete.
5. In the detail pane, click the three-dot menu in the upper right corner and select Delete .
6. On the Delete Database dialog, click Delete .
Sql
Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace
the placeholder <schema_name> .
For parameter descriptions, see DROP SCHEMA.
If you use DROP SCHEMA without the CASCADE option, you must delete all tables in the schema before you can
delete it.
Python
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<schema_name> .
R
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<schema_name> .
library(SparkR)
library(SparkR)
Scala
Run the following SQL command in a notebook. Items in brackets are optional. Replace the placeholder
<schema_name> .
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article describes how to manage users, groups, and service principals in Azure Databricks accounts that use
Unity Catalog.
Requirements
You must be an Azure Databricks account admin.
Your Azure Databricks account must be on the Premium plan.
You must have a Unity Catalog metastore.
To manage account-level identities, you must be an account-level admin.
The initial account-level admin must be a member of the root management group for your Azure tenant.
The initial account-level admin can grant others the Admin role in the account console, as long as those
users have been added to Azure Databricks workspaces in the same Azure tenant.
When a user is added to a workspace, that user also becomes available in the account console and can be
granted the account admin role.
Service principals are not supported as account-level identities.
NOTE
Users and groups must be added as account-level identities before they can access Unity Catalog.
1. The initial account-level admin must be a Contributor in the Azure Active Directory root management
group, which is named Tenant root group by default. An Azure Active Directory Global Administrator
can add themselves to this group. Grant yourself this role, or ask an Azure Active Directory Global
Administrator to grant it to you.
The initial account-level admin can add users or groups to the account console, and can designate other
account-level admins by granting the Admin role to users.
2. All Azure Active Directory users who have been added to workspaces in your Azure tenant are
automatically added as account-level identities.
3. To designate additional account-level admins, you grant users the Admin role.
NOTE
It is not possible to grant the Admin role to a group.
a. Log in to the account console by clicking Settings , then clicking Manage account ..
b. Click Users and Groups . A list of Azure Active Directory users appears. Only users and groups
who have been added to workspaces are shown.
c. Click the name of a user.
d. Click Roles .
e. Enable Admin .
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how to create tables in Unity Catalog. A table can be managed or external.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
The metastore must be linked to the workspace where you run the commands to create the catalog or
schema.
You can create a catalog or schema from a Data Science & Engineering notebook or a Databricks SQL
endpoint that uses a compute resource that is compliant with Unity Catalog’s security requirements.
If necessary, create a catalog and schema in the metastore. The catalog and schema will contain the new
table. To create a schema, you must have the USAGE and CREATE data permissions on the schema’s parent
catalog. When you create a metastore, it contains a catalog called main with an empty schema called
default . See Create and manage catalogs and Create and manage schemas (databases).
If necessary, load the data into files on your cloud tenant, and create an external location in the metastore
that grants access to the files.
To create a table, you must have the USAGE permission on the parent catalog and the USAGE and CREATE
permissions on the parent schema.
For additional requirements to create an external table, see Create an external table.
Managed table
Managed tables are the default way to create tables in Unity Catalog. These tables are stored in the managed
storage location you configured when you created each metastore. To create a managed table with SQL, run a
CREATE TABLE command without a LOCATION clause. To delete a managed table with SQL, use the DROP TABLE
statement. When a managed table is dropped, its underlying data is deleted from your cloud tenant. The only
supported format for managed tables is Delta.
Example SQL syntax:
External table
External tables are tables whose data is stored in a storage location outside of the managed storage location,
and are not fully managed by Unity Catalog. When you run DROP TABLE on an external table, Unity Catalog does
not delete the underlying data. You can manage privileges on external tables and use them in queries in the the
same way as managed tables. To create an external table with SQL, specify a LOCATION path in your
CREATE TABLE statement. External tables can use the following file formats:
DELTA
CSV
JSON
AVRO
PARQUET
ORC
TEXT
To manage access to the underlying cloud storage for an external table, Unity Catalog introduces two new object
types: storage credentials and external locations.
To learn more, see Create an external table.
SQL
Python
library(SparkR)
Scala
Python
library(SparkR)
Scala
spark.sql("CREATE TABLE main.default.department " +
"(" +
" deptcode INT," +
" deptname STRING," +
" location STRING" +
")" +
"INSERT INTO main.default.department VALUES " +
" (10, 'FINANCE', 'EDINBURGH')," +
" (20, 'SOFTWARE', 'PADDINGTON')," +
" (30, 'SALES', 'MAIDSTONE')," +
" (40, 'MARKETING', 'DARLINGTON')," +
" (50, 'ADMIN', 'BIRMINGHAM')")
Example notebook
You can use the following example SQL notebook to create a catalog, schema, and table, and to manage
permissions on them.
Create and manage a table in Unity Catalog
Get notebook
External locations and storage credentials are stored in the top level of the metastore, rather than in a catalog. To
create a storage credential or an external location, you must be the metastore’s owner or an account-level
admin.
To create an external table, follow these high-level steps. You can also use an example notebook to create the
storage credential, external location, and external table and manage permissions for them.
Create a storage credential
NOTE
In the Databricks SQL Data Explorer, you can create storage credentials and view their details, but you cannot modify,
rotate, or delete them. For those operations, you can use the Databricks SQL editor or a notebook.
1. In the Azure Portal, create a service principal.
a. Create a client secret for the service principal and make a note of it.
b. Make a note of the directory ID and application ID for the service principal.
c. In your storage account, grant the service principal the Azure Blob Contributor role.
2. In Azure Databricks, log in to a workspace that is linked to the metastore.
3. From the persona switcher, select SQL
4. Click Data .
5. At the bottom of the screen, click Storage Credentials .
6. Click Create credential .
7. Enter a name for the credential.
8. Enter the directory ID, application ID, and client secret of a service principal that has been granted the Azure
Blob Contributor role on the relevant storage container.
9. Optionally, enter a comment for the storage credential.
10. Click Save .
You can grant permissions directly on the storage credential, but Databricks recommends that you reference it in
an external location and grant permissions to that instead. An external location combines a storage credential
with a specific path, and authorizes access only to that path and its contents.
Manage permissions on a storage credential
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
5. Click the name of a storage credential to open its properties.
6. Click Permissions .
7. To grant permission to users or groups, select each identity, then click Grant .
8. To revoke permissions from users or groups, deselect each identity, then click Revoke .
Create an external location
1. From the persona switcher, select SQL .
2. Click Data .
3. Click the + menu at the upper right and select Add an external location .
4. Click Create location .
a. Enter a name for the location.
b. Optionally copy the storage container path from an existing mount point.
c. If you aren’t copying from an existing mount point, enter a storage container path.
d. Select the storage credential that grants access to the location.
e. Click Save .
Manage permissions on an external location
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click External Locations .
5. Click the name of an external location to open its properties.
6. Click Permissions .
7. To grant permission to users or groups, select each identity, then click Grant .
8. To revoke permissions from users or groups, deselect each identity, then click Revoke .
Create an external table
You can create an external table using an external location (recommended) or using a storage credential directly.
In the following examples, replace the placeholder values:
<catalog> : The name of the catalog that will contain the table.
<schema> : The name of the schema that will contain the table.
<table_name> : A name for the table.
<column_specification> : The name and data type for each column.
<bucket_path> : The path on your cloud tenant where the table will be created.
<table_directory> : A directory where the table will be created. Use a unique directory for each table.
IMPORTANT
Once a table is created in a path, users can no longer directly access the files in that path even if they have been given
privileges on an external location or storage credential to do so. This is to ensure that users cannot circumvent access
controls applied to tables by reading files from your cloud tenant directly.
Python
library(SparkR)
Sc a l a
If so, the external table is created. Otherwise, an error occurs and the external table is not created.
NOTE
You can instead migrate an existing external table in the Hive metastore to Unity Catalog without duplicating its data. See
Upgrade an external table to Unity Catalog.
NOTE
Databricks recommends that you use external locations, rather than using storage credentials directly.
To create an external table using a storage credential, add a WITH (CREDENTIAL <credential_name>) clause to your
SQL statement:
SQ L
Python
library(SparkR)
Sc a l a
spark.sql("CREATE TABLE <catalog>.<schema>.<table_name> " +
"( " +
" <column_specification> " +
") " +
"LOCATION 'abfss://<bucket_path>/<table_directory>' " +
"WITH (CREDENTIAL <storage_credential>)")
Unity Catalog checks whether you have the CREATE TABLE permission on the storage credential you specify, and
whether the storage credential authorizes reading to and writing from the location you specified in the
LOCATION clause. If both of these things are true, the external table is created. Otherwise, an error occurs and the
external table is not created.
Example notebook
Create and manage an external table in Unity Catalog
Get notebook
NOTE
A storage path where you create an external table cannot also be used to read or write data files.
LIST 'abfss://<path_to_files>';
Python
display(spark.sql("LIST 'abfss://<path_to_files>'"))
library(SparkR)
display(sql("LIST 'abfss://<path_to_files>'"))
Scala
display(spark.sql("LIST 'abfss://<path_to_files>'"))
If you have the READ FILES permission on the external location associated with the cloud storage path, a
list of data files in that location is returned.
2. Query the data in the files in a given path:
SQL
SELECT * FROM <format>.`abfss://<path_to_files>`;
Python
display(spark.read.load("abfss:://<path_to_files>"))
library(SparkR)
display(loadDF("abfss:://<path_to_files>"))
Scala
display(spark.read.load("abfss:://<path_to_files>"))
Python
library(SparkR)
Scala
NOTE
You can instead migrate an existing external table in the Hive metastore to Unity Catalog without duplicating its data. See
Upgrade an external table to Unity Catalog.
1. Create a new table and populate it with records data files on your cloud tenant.
IMPORTANT
When you create a table using this method, the storage path is read only once, to prevent duplication of
records. If you want to re-read the contents of the directory, you must drop and re-create the table. For an
existing table, you can insert records from a storage path.
The bucket path where you create a table cannot also be used to read or write data files.
Only the files in the exact directory are read; the read is not recursive.
You must have the following permissions:
USAGE on the parent catalog and schema.
CREATE on the parent schema.
READ FILES on the external location associated with the bucket path where the files are located, or
directly on the storage credential if you are not using an external location.
If you are creating an external table, you need CREATE TABLE on the bucket path where the table will
be created.
To create a managed table and populate it with records from a bucket path:
SQL
Python
library(SparkR)
Scala
Python
library(SparkR)
Scala
To use a storage credential directly, add WITH (CREDENTIAL <storage_credential>) to the command.
To insert records from files in a bucket path into a managed table, using an external location to read from
the bucket path:
SQL
Python
library(SparkR)
Scala
Python
library(SparkR)
Scala
To use a storage credential directly, add WITH (CREDENTIAL <storage_credential>) to the command.
Next steps
Manage access to data
Create views
7/21/2022 • 7 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how to create views in Unity Catalog. A view is a read-only object that joins records from
multiple tables and views.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
The metastore must be linked to the workspace where you run the commands to create the catalog or
schema.
If necessary, create a catalog and schema in the metastore. The catalog and schema will contain the new
table. To create a schema, you must have the USAGE and CREATE data permissions on the schema’s
parent catalog. See Create and manage catalogs and Create and manage schemas (databases).
If necessary, create tables that will compose the view.
You can create a view from a Data Science & Engineering notebook or a Databricks SQL endpoint that
uses a compute resource that is compliant with Unity Catalog’s security requirements.
The parent catalog and schema must exist. When you create a metastore, it contains a catalog called
main with an empty schema called default . See Create and manage catalogs and Create and manage
schemas (databases).
You must have the USAGE permission on the parent catalog and the USAGE and CREATE permissions on
the parent schema.
You must have SELECT access on all tables and views referenced in the view.
NOTE
To read from a view from a cluster with Single User cluster security mode, you must have SELECT on all
referenced tables and views.
Create a view
To create a view, run the following SQL command. Items in brackets are optional. Replace the placeholder values:
<catalog_name> : The name of the catalog.
<schema_name> : The name of the schema.
<view_name> : A name for the view.
<query> : The query, columns, and tables and views used to compose the view.
SQL
Python
library(SparkR)
Scala
For example, to create a view named sales_redacted from columns in the sales_raw table:
SQL
Python
R
library(SparkR)
Scala
NOTE
Fine-grained access control using dynamic views are not available on clusters with Single User security mode.
Unity Catalog introduces the following functions, which allow you to dynamically limit which users can access a
row, column, or record in a view:
current_user(): Returns the current user’s email address.
is_account_group_member() : Returns TRUE if the current user is a member of a specific account-level group.
Recommended for use in dynamic views against Unity Catalog data.
is_member() : Returns TRUE if the current user is a member of a specific workspace-level group. This function
is provided for compatibility with the existing Hive metastore. Avoid using it with views against Unity Catalog
data, because it does not evaluate account-level group membership.
The following examples illustrate how to create dynamic views in Unity Catalog.
Column-level permissions
With a dynamic view, you can limit the columns a specific user or group can access. In the following example,
only members of the auditors group can access email addresses from the sales_raw table. During query
analysis, Apache Spark replaces the CASE statement with either the literal string REDACTED or the actual
contents of the email address column. Other columns are returned as normal. This strategy has no negative
impact on the query performance.
SQL
-- Alias the field 'email' to itself (as 'email') to prevent the
-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
user_id,
CASE WHEN
is_account_group_member('auditors') THEN email
ELSE 'REDACTED'
END AS email,
country,
product,
total
FROM sales_raw
Python
library(SparkR)
Scala
// Alias the field 'email' to itself (as 'email') to prevent the
// permission logic from showing up directly in the column name results.
spark.sql("CREATE VIEW sales_redacted AS " +
"SELECT " +
" user_id, " +
" CASE WHEN " +
" is_account_group_member('auditors') THEN email " +
" ELSE 'REDACTED' " +
" END AS email, " +
" country, " +
" product, " +
" total " +
"FROM sales_raw")
Row-level permissions
With a dynamic view, you can specify permissions down to the row or field level. In the following example, only
members of the managers group can view transaction amounts when they exceed $1,000,000. Matching results
are filtered out for other users.
SQL
Python
R
library(SparkR)
Scala
Data masking
Because views in Unity Catalog use Spark SQL, you can implement advanced data masking by using more
complex SQL expressions and regular expressions. In the following example, all users can analyze email
domains, but only members of the auditors group can view a user’s entire email address.
SQL
Python
# The regexp_extract function takes an email address such as
# user.x.lastname@example.com and extracts 'example', allowing
# analysts to query the domain name.
library(SparkR)
Scala
Next steps
Manage access to data
Manage access to data
7/21/2022 • 8 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how to manage access to objects and data in Unity Catalog. To learn more about data
permissions and object ownership, see Data permissions.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
To manage permissions for a table or view you must be a metastore admin, or you must be the table’s owner
and have the USAGE permission on its parent catalog and schema.
To manage permissions for a schema, you must be a metastore admin, or you must be the schema’s owner
and have the USAGE permission on its parent catalog.
To manage permissions for a catalog, you must be a metastore admin, or you must be the catalog’s owner.
To manage permissions for an external location or storage credential, you must be a metastore admin.
NOTE
Account-level admins can manage any object in any metastore in the account.
SQL
SHOW GRANTS ON <securable_type> <securable_name>;
Python
library(SparkR)
Scala
To show all grants for a given principal on an object, use the following syntax. Replace the placeholder values:
<principal> : The email address of an account-level user or the name of an account-level group.
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
SQL
Python
library(SparkR)
Scala
Manage privileges
To manage privileges, you use GRANT and REVOKE statements. Only an object’s owner or a metastore admin can
grant privileges on the object and its descendent objects. A built-in account-level group called account users
includes all account-level users.
In Unity Catalog, you can grant the following privileges on a securable object:
USAGE : This privilege does not grant access to the securable itself, but allows the grantee to traverse the
securable in order to access its child objects. For example, to select data from a table, users need to have the
SELECT privilege on that table and USAGE privileges on its parent schema and parent catalog. Thus, you can
use this privilege to restrict access to sections of your data namespace to specific groups.
SELECT : Allows a user to select from a table or view, if the user also has USAGE on its parent catalog and
schema.
MODIFY : Allows the grantee to add, update and delete data to or from the securable if the user also has
USAGE on its parent catalog and schema.
CREATE : Allows a user to create a schema if the user also has USAGE and CREATE permissions on its parent
catalog. Allows a user to create a table or view if the user also has USAGE on its parent catalog and schema
and the CREATE permission on the schema..
In addition, you can grant the following privileges on storage credentials and external locations.
CREATE TABLE : Allows a user to create external tables directly in your cloud tenant using a storage
credential.
READ FILES : When granted on an external location, allows a user to read files directly from your cloud
tenant using the storage credential associated with the external location.
When granted directly on a storage credential, allows a user to read files directly from your cloud tenant
using the storage credential.
WRITE FILES : When granted on an external location, allows a user to write files directly to your cloud
tenant using the storage credential associated with the external location.
When granted directly on a storage credential, allows a user to write files directly to your cloud tenant
using the storage credential.
NOTE
Although you can grant READ FILE and WRITE FILE privileges on a storage credential, Databricks recommends that
you instead grant these privileges on an external location. This allows you to manage permissions at a more granular level
and provides a simpler experience to end users.
In Unity Catalog, privileges are not inherited on child securable objects. For example, if you grant the CREATE
privilege on a catalog to a user, the user does not automatically have the CREATE privilege on all databases in
the catalog.
The following table summarizes the privileges that can be granted on each securable object:
View SELECT
Grant a privilege
To grant a privilege, you can use the Databricks SQL Data Explorer or SQL commands. Keep in mind that when
you grant a privilege on an object, you must also grant the USAGE privilege on its parent objects.
Use the Databricks SQL Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. Select the object, such as a catalog, schema, table, or view.
5. Click Permissions .
6. Click Grant .
7. Enter the email address for a user or the name of a group.
8. Select the permissions to grant.
9. Click OK .
Use SQL
Use the following syntax. Replace the placeholder values:
<privilege> : The privilege to grant, such as SELECT or USAGE .
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
<principal> : The email address of an account-level user or the name of an account-level group.
SQ L
Python
library(SparkR)
Sc a l a
Revoke a privilege
To revoke a privilege, you can use the Databricks SQL Data Explorer or SQL commands.
Use the Databricks SQL Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. Select the object, such as a catalog, schema, table, or view.
5. Click Permissions .
6. Select a privilege that has been granted to a user or group.
7. Click Revoke .
8. To confirm, click Revoke .
Use SQL
Use the following syntax. Replace the placeholder values:
<privilege> : The privilege to grant, such as SELECT or USAGE .
<securable_type> : The type of the securable object, such as catalog or table .
<securable_name> : The name of the securable object.
<principal> : The email address of an account-level user or the name of an account-level group.
SQ L
Python
library(SparkR)
Sc a l a
Transfer ownership
To transfer ownership of an object within a metastore, you can use SQL.
Each securable object in Unity Catalog has an owner. The owner can be any account-level user or group, called a
principal. The principal that creates an object becomes its initial owner. An object’s owner has all privileges on
the object, such as SELECT and MODIFY on a table, as well as the permission to grant privileges to other
principals.
The object’s owner can transfer ownership to another user or group. A metastore admin can transfer ownership
of any object in the metastore to another user or group.
To see the owner of a securable object, use the following syntax. Replace the placeholder values:
<SECURABLE_TYPE> : The type of securable, such as CATALOG or TABLE .
<catalog> : The parent catalog for a table or view.
<schema> : The parent schema for a table or view.
<securable_name> : The name of the securable, such as a table or view.
SQL
Python
R
library(SparkR)
Scala
To transfer ownership of an object, use a SQL command with the following syntax. Replace the placeholder
values:
<SECURABLE_TYPE> : The type of securable, such as CATALOG or TABLE .
<SECURABLE_NAME> : The name of the securable.
<PRINCIPAL> : The email address of an account-level user or the name of an account-level group.
SQL
Python
library(SparkR)
Scala
Python
library(SparkR)
Scala
spark.sql("ALTER TABLE orders OWNER TO `accounting`")
Dynamic views
Dynamic views allow you to manage which users have access to a view’s rows, columns, or even specific records
by filtering or masking their values. See Create a dynamic view.
Manage external locations and storage credentials
7/21/2022 • 18 minutes to read
External locations and storage credentials allow Unity Catalog to read and write data on your cloud tenant on
behalf of users. These objects are used for:
Creating, reading from, and writing to external tables.
Creating a managed or external table from files stored on your cloud tenant.
Inserting records into tables from files stored on your cloud tenant.
Directly exploring data files stored on your cloud tenant.
A storage credential represents an authentication and authorization mechanism for accessing data stored on
your cloud tenant, using either an Azure managed identity (strongly recommended) or a service principal. Each
storage credential is subject to Unity Catalog access-control policies that control which users and groups can
access the credential. If a user does not have access to a storage credential in Unity Catalog, the request fails and
Unity Catalog does not attempt to authenticate to your cloud tenant on the user’s behalf.
An external location is an object that combines a cloud storage path with a storage credential that authorizes
access to the cloud storage path. Each storage location is subject to Unity Catalog access-control policies that
control which users and groups can access the credential. If a user does not have access to a storage location in
Unity Catalog, the request fails and Unity Catalog does not attempt to authenticate to your cloud tenant on the
user’s behalf.
Databricks recommends using external locations rather than using storage credentials directly.
To create or manage a storage credential or an external location, you must be a workspace admin for the Unity
Catalog-enabled workspace you want to access the storage from.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
You must have a Unity Catalog metastore.
To manage permissions for an external location or storage credential, you must be a metastore admin.
Unity Catalog only supports Azure Data Lake Storage Gen2 for external locations.
4. Click Data .
5. Click the + menu at the upper right and select Add a storage credential .
6. On the Create a new storage credential dialog, select Managed identity (recommended) .
7. Enter a name for the credential, and enter the access connector’s resource ID in the format:
/subscriptions/12f34567-8ace-9c10-111c-
aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connec
tor-name>
8. Click Save .
9. Create an external location that references this storage credential.
To create a storage credential using a service principal:
1. In the Azure Portal, create a service principal and grant it access to your storage account.
a. Create a client secret for the service principal and make a note of it.
b. Make a note of the directory ID, and application ID for the service principal.
c. Go to your storage account and grant the service principal the Azure Blob Contributor role.
2. Log in to your Unity Catalog-enabled Azure Databricks workspace as a workspace admin.
3. From the persona switcher at the top of the sidebar, select SQL
4. Click Data .
5. Click the + menu at the upper right and select Add a storage credential .
6. On the Create a new storage credential dialog, select Ser vice principal .
7. Enter a name for the credential, along with the directory ID, application ID, and client secret of the service
principal that has been granted the Azure Blob Contributor role on the storage container you want to
access.
8. Click Save .
9. Create an external location that references this storage credential.
List storage credentials
To view the list of all storage credentials in a metastore, you can use the Data Explorer or a SQL command.
Data explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
Sql
Run the following command in a notebook or the Databricks SQL editor.
R
Run the following command in a notebook.
library(SparkR)
Scala
Run the following command in a notebook.
Python
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
R
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
library(SparkR)
Scala
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
Python
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<new_credential_name> : A new name for the credential.
R
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<new_credential_name> : A new name for the credential.
library(SparkR)
Scala
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<new_credential_name> : A new name for the credential.
Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click Storage Credentials .
5. Click the name of a storage credential to open its properties.
6. Click Permissions .
7. To grant permission to users or groups, select each identity, then click Grant .
8. To revoke permissions from users or groups, deselect each identity, then click Revoke .
SQL, Python, R, Scala
In the following examples, replace the placeholder values:
: The email address of the account-level user or the name of the account level group to whom to
<principal>
grant the permission.
<storage_credential_name> : The name of a storage credential.
To show grants on a storage credential, use a command like the following. You can optionally filter the results to
show only the grants for the specified principal.
SQ L
Python
library(SparkR)
Sc a l a
Python
R
library(SparkR)
Sc a l a
To grant permission to select from an external table using a storage credential directly:
SQ L
Python
library(SparkR)
Sc a l a
NOTE
If a group name contains a space, use back-ticks around it (not apostrophes).
Python
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.
R
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.
library(SparkR)
Scala
Run the following command in a notebook. Replace the placeholder values:
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.
Python
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
Portions of the command that are in brackets are optional. By default, if the credential is used by an external
location, it is not deleted. Replace <credential_name> with the name of the credential.
IF EXISTS does not return an error if the credential does not exist.
FORCE deletes the credential even if it is used by an external location. External locations that depend on
this credential can no longer be used.
<credential_name> : The name of the credential.
<principal> : The email address of an account-level user or the name of an account-level group.
spark.sql("DROP STORAGE CREDENTIAL [IF EXISTS] <credential_name> [FORCE]")
R
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
Portions of the command that are in brackets are optional. By default, if the credential is used by an external
location, it is not deleted. Replace <credential_name> with the name of the credential.
IF EXISTS does not return an error if the credential does not exist.
FORCE deletes the credential even if it is used by an external location. External locations that depend on
this credential can no longer be used.
library(SparkR)
Scala
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
Portions of the command that are in brackets are optional. By default, if the credential is used by an external
location, it is not deleted. Replace <credential_name> with the name of the credential.
IF EXISTS does not return an error if the credential does not exist.
FORCE deletes the credential even if it is used by an external location. External locations that depend on
this credential can no longer be used.
NOTE
Each cloud storage path can be associated with only one external location. If you attempt to create a second external
location that references the same path, the command fails.
External locations only support Azure Data Lake Storage Gen2 storage.
SQ L
Python
library(SparkR)
Sc a l a
Python
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
display(spark.sql("DESCRIBE EXTERNAL LOCATION <location_name>"))
R
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
library(SparkR)
Scala
Run the following command in a notebook. Replace <credential_name> with the name of the credential.
Python
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the location.
<new_location_name> : A new name for the location.
R
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the location.
<new_location_name> : A new name for the location.
library(SparkR)
Scala
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the location.
<new_location_name> : A new name for the location.
Python
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<url> : The new storage URL the location should authorize access to in your cloud tenant.
R
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<url> : The new storage URL the location should authorize access to in your cloud tenant.
library(SparkR)
Scala
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<url> : The new storage URL the location should authorize access to in your cloud tenant.
The FORCE option changes the URL even if external tables depend upon the external location.
To change the storage credential that an external location uses, do the following:
Sql
Run the following command in a notebook or the Databricks SQL editor. Replace the placeholder values:
<location_name> : The name of the external location.
<credential_name> : The name of the storage credential that grants access to the location’s URL in your cloud
tenant.
Python
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<credential_name> : The name of the storage credential that grants access to the location’s URL in your cloud
tenant.
R
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<credential_name> : The name of the storage credential that grants access to the location’s URL in your cloud
tenant.
library(SparkR)
Scala
Run the following command in a notebook. Replace the placeholder values:
<location_name> : The name of the external location.
<credential_name> : The name of the storage credential that grants access to the location’s URL in your cloud
tenant.
Data Explorer
1. Log in to a workspace that is linked to the metastore.
2. From the persona switcher, select SQL .
3. Click Data .
4. At the bottom of the screen, click External Locations .
5. Click the name of an external location to open its properties.
6. Click Permissions .
7. To grant permission to users or groups, select each identity, then click Grant .
8. To revoke permissions from users or groups, deselect each identity, then click Revoke .
SQL, Python, R, Scala
In the following examples, replace the placeholder values:
<principal>: The email address of the account-level user or the name of the account level group to
whom to grant the permission.
<location_name> : The name of the external location that authorizes reading from and writing to the
storage container path in your cloud tenant.
<principal> : The email address of an account-level user or the name of an account-level group.
To show grants on an external location, use a command like the following. You c an optionally filter the results to
show only the grants for the specified princ ipal.
SQ L
Python
library(SparkR)
Sc a l a
Python
library(SparkR)
Sc a l a
Python
library(SparkR)
NOTE
If a group name contains a space, use back-ticks around it (not apostrophes).
The FORCE option deletes the external location even if external tables depend upon it.
Python
Run the following command in a notebook. Items in brackets are optional. Replace <location_name> with the
name of the external location.
The FORCE option deletes the external location even if external tables depend upon it.
R
Run the following command in a notebook. Items in brackets are optional. Replace <location_name> with the
name of the external location.
The FORCE option deletes the external location even if external tables depend upon it.
library(SparkR)
Scala
Run the following command in a notebook. Items in brackets are optional. Replace <location_name> with the
name of the external location.
The FORCE option deletes the external location even if external tables depend upon it.
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
To query data in a table or view, the user must have the USAGE permission on the parent catalog and
schema and the SELECT permission on the table or view.
NOTE
To read from a view on a cluster with single-user security mode, the user must have SELECT on all referenced
tables and views.
Python
display(spark.table("<table_name>"))
library(SparkR)
display(tableToDF("<table_name>"))
Scala
display(spark.table("<table_name>"))
Python
display(spark.table("<catalog_name>.<schema_name>.<table_name>"))
library(SparkR)
display(tableToDF("<catalog_name>.<schema_name>.<table_name>"))
Scala
display(spark.table("<catalog_name>.<schema_name>.<table_name>"))
Using three-level namespace simplifies querying data in multiple catalogs and schemas.
You can also use three-level namespace notation for data in the Hive metastore by setting <catalog_name> to
hive_metastore .
NOTE
To read from a view from a cluster with single-user security mode, the user must have SELECT on all referenced
tables and views.
LIST 'abfss://<path_to_files>';
Python
display(spark.sql("LIST 'abfss://<path_to_files>'"))
library(SparkR)
display(sql("LIST 'abfss://<path_to_files>'"))
Scala
display(spark.sql("LIST 'abfss://<path_to_files>'"))
If you have the READ FILES permission on the external location associated with the cloud storage path, a
list of data files in that location is returned.
2. Query the data in the files in a given path:
SQL
Python
display(spark.read.load("abfss:://<path_to_files>"))
library(SparkR)
display(loadDF("abfss:://<path_to_files>"))
Scala
display(spark.read.load("abfss:://<path_to_files>"))
To explore data using a storage credential directly:
SQL
Python
library(SparkR)
Scala
Next steps
Manage access to data
Train a machine-learning model with Python from
data in Unity Catalog
7/21/2022 • 3 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Unity Catalog allows you to apply fine-grained security to tables and to securely access them from any
language, all while interacting seamlessly with other machine-learning components in Azure Databricks. This
article shows how to use Python to train a machine-learning model using data in Unity Catalog.
Requirements
Your Azure Databricks account must be on the Premium plan.
You must be an account admin or the metastore admin for the metastore you use to train the model.
2. In the Data Science & Engineering or Databricks Machine Learning persona, click Compute .
3. Click Create cluster .
a. Enter a name for the cluster.
b. For Databricks runtime version :
a. Click ML .
b. Select either 10.3 ML (Scala 2.12, Spark 3.2.1) or higher, or 10.3 ML (GPU, Scala 2.12,
Spark 3.2.1) or higher.
4. Click Advanced Options . Set Security Mode to User Isolation or Single User . To run Python code,
you must use Single User .
User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some
advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also
disabled to ensure security isolation among cluster users.
To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set
the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used
exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster.
Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode,
view security cannot be enforced. A user selecting from a view executes with their own permissions.
For more information about the features available in each security mode, see Cluster security mode.
5. Click Create Cluster .
When you create a catalog, a schema named default is automatically created within it.
4. Grant access to the ml catalog and the ml.default schema, and the ability to create tables and views, to
the ml_team group. To include all account level users, you could use the group account users .
Now, any user in the ml_team group can run the following example notebook.
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how to connect Unity Catalog to business intelligence (BI) tools.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
JDBC
To use Unity Catalog over JDBC, download the Simba JDBC driver version 2.6.21 or above.
ODBC
To use Unity Catalog over ODBC, download the Simba ODBC driver version 2.6.19 or above.
Tableau
To use Unity Catalog data in Tableau, use Tableau Desktop 2021.4 with the Simba ODBC driver version 2.6.19 or
above. See Tableau.
For more information, see Databricks ODBC and JDBC drivers.
Audit access and activity for Unity Catalog
resources
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
This article shows how you can audit Unity Catalog access and activity.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
Link the metastore to the workspace in which you will process the diagnostic logs.
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
To take advantage of Unity Catalog’s access control and auditing mechanisms, and to share data to multiple
workspaces, you can upgrade tables and views to Unity Catalog.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
If necessary, create a metastore.
If necessary, create catalogs and schemas in the metastore. The catalogs and schemas will contain the new
tables and views.
SQL
Python
df = spark.table("hive_metastore.<old_schema>.<old_table>")
df.write.saveAsTable(
name = "<catalog>.<new_schema>.<new_table>"
)
R
%r
library(SparkR)
df = tableToDF("hive_metastore.<old_schema>.<old_table>")
saveAsTable(
df = df,
tableName = "<catalog>.<new_schema>.<new_table>"
)
Scala
val df = spark.table("hive_metastore.<old_schema>.<old_table>")
df.write.saveAsTable(
tableName = "<catalog>.<new_schema>.<new_table>"
)
If you want to migrate only some columns or rows, modify the SELECT statement.
NOTE
This command creates a managed table in which data is copied into the storage location that was nominated
when the metastore was set up. To create an external table, where a table is registered in Unity Catalog without
moving the data in cloud storage, see Upgrade an external table to Unity Catalog.
4. Grant account-level users or groups access to the new table. See Manage access to data.
5. After the table is migrated, users should update their existing queries and workloads to use the new table.
6. Before you drop the old table, test for dependencies by revoking access to it and re-running related
queries and workloads.
Upgrade process
To upgrade an external table:
1. If you are not already in Databricks SQL, use the persona switcher in the sidebar to select SQL .
NOTE
If you no longer need the old table, you can drop it from the Hive Metastore. Dropping an external table does not
modify the data files on your cloud tenant.
NOTE
If your view also references other views, upgrade those views first.
After you upgrade the view, grant access to it to account-level users and groups.
Before you drop the old view, test for dependencies by revoking access to it and re-running related queries and
workloads.
Upgrade process
1. If you are not already in Databricks SQL, use the persona switcher in the sidebar to select SQL .
NOTE
If you no longer need the old tables, you can drop them from the Hive Metastore. Dropping an external table does not
modify the data files on your cloud tenant.
Automate Unity Catalog setup using Terraform
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Requirements
In Azure Databricks, you must be an account admin.
Your Azure Databricks account must be on the Premium plan.
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
During the public preview, Unity Catalog has the following limitations:
Python, Scala, and R workloads are supported only on clusters that use the Single User security mode.
Workloads in these languages do not support the use of dynamic views for row-level or column-level
security.
Unity Catalog can be used together with the built-in Hive metastore provided by Databricks. You can’t use
Unity Catalog with external Hive metastores that require configuration using init scripts.
Unity Catalog does not manage partitions like Hive does. Unity Catalog managed tables are, by definition,
Delta tables, for which partition metadata is tracked in the Delta log. External non-Delta tables may be
partitioned in storage; Unity Catalog does not manage the partitions for such tables, which means that you
cannot partition objects for these tables from Unity Catalog.
Overwrite mode for DataFrame write operations into Unity Catalog is supported only for managed Delta
tables and not for other cases, such as external tables. In addition, the user must have the CREATE privilege
on the parent schema and must be the owner of the existing object.
Support for streaming workloads against Unity Catalog is in Private Preview, and there is no support for:
Using external locations as the source or destination for streams.
Storing streaming metadata in external locations. This metadata includes streaming checkpoints and
schema location for the cloud_files source.
Using Python or R to write streaming queries. Only Scala and Java are supported in Private Preview.
The following Delta Lake features aren’t supported:
Low shuffle merge
Sampling
Change data feed
Bloom filters
Table access control
7/21/2022 • 2 minutes to read
Table access control lets you programmatically grant and revoke access to your data from Python and SQL.
By default, all users have access to all data stored in a cluster’s managed tables unless table access control is
enabled for that cluster. Once table access control is enabled, users can set permissions for data objects on that
cluster.
Requirements
This feature requires the Premium Plan.
This feature requires a Data Science & Engineering cluster with an appropriate configuration or a Databricks
SQL endpoint.
This section covers:
Enable table access control for a cluster
Data object privileges
Enable table access control for a cluster
7/21/2022 • 3 minutes to read
NOTE
This article contains references to the term whitelist, a term that Azure Databricks does not use. When the term is
removed from the software, we’ll remove it from this article.
This article describes how to enable table access control for a cluster.
For information about how to set privileges on a data object once table access control has been enabled on a
cluster, see Data object privileges.
IMPORTANT
Even if table access control is enabled for a cluster, Azure Databricks administrators have access to file-level data.
spark.databricks.acl.sqlOnly true
NOTE
Access to SQL-only table access control is not affected by the Enable Table Access Control setting in the admin console.
That setting controls only the workspace-wide enablement of Python and SQL table access control.
To create the cluster using the REST API, see Create cluster enabled for table access control example.
The Azure Databricks data governance model lets you programmatically grant, deny, and revoke access to your
data from Spark SQL. This model lets you control access to securable objects like catalogs, schemas (databases),
tables, views, and functions. It also allows for fine-grained access control (to a particular subset of a table, for
example) by setting privileges on derived views created from arbitrary queries. The Azure Databricks SQL query
analyzer enforces these access control policies at runtime on Azure Databricks clusters with table access control
enabled and all SQL warehouses.
This article describes the privileges, objects, and ownership rules that make up the Azure Databricks data
governance model. It also describes how to grant, deny, and revoke object privileges.
Requirements
The requirements for managing object privileges depends on your environment:
Databricks Data Science & Engineering and Databricks Machine Learning
Databricks SQL
Databricks Data Science & Engineering and Databricks Machine Learning
An administrator must enable and enforce table access control for the workspace.
The cluster must be enabled for table access control.
Databricks SQL
See Admin quickstart requirements.
NOTE
ANONYMOUS FUNCTION objects are not supported in Databricks SQL.
Privileges
SELECT : gives read access to an object.
CREATE : gives ability to create an object (for example, a table in a schema).
MODIFY : gives ability to add, delete, and modify data to or from an object.
USAGE : does not give any abilities, but is an additional requirement to perform any action on a schema
object.
READ_METADATA : gives ability to view an object and its metadata.
CREATE_NAMED_FUNCTION : gives ability to create a named UDF in an existing catalog or schema.
MODIFY_CLASSPATH : gives ability to add files to the Spark class path.
ALL PRIVILEGES : gives all privileges (is translated into all the above privileges).
NOTE
The MODIFY_CLASSPATH privilege is not supported in Databricks SQL.
USAGE privilege
To perform an action on a schema object, a user must have the USAGE privilege on that schema in addition to
the privilege to perform that action. Any one of the following satisfy the USAGE requirement:
Be an admin
Have the USAGE privilege on the schema or be in a group that has the USAGE privilege on the schema
Have the USAGE privilege on the CATALOG or be in a group that has the USAGE privilege
Be the owner of the schema or be in a group that owns the schema
Even the owner of an object inside a schema must have the USAGE privilege in order to use it.
As an example, an administrator could define a finance group and an accounting schema for them to use. To
set up a schema that only the finance team can use and share, an admin would do the following:
With these privileges, members of the finance group can create tables and views in the accounting schema,
but can’t share those tables or views with any principal that does not have USAGE on the accounting schema.
D a t a b r i c k s D a t a Sc i e n c e & En g i n e e r i n g a n d D a t a b r i c k s R u n t i m e v e r si o n b e h a v i o r
Clusters running Databricks Runtime 7.3 LTS and above enforce the USAGE privilege.
Clusters running Databricks Runtime 7.2 and below do not enforce the USAGE privilege.
To ensure that existing workloads function unchanged, in workspaces that used table access control before
USAGE was introduced have had the USAGE privilege on CATALOG granted to the users group. If you want to
take advantage of the USAGE privilege, you must run REVOKE USAGE ON CATALOG FROM users and then
GRANT USAGE ... as needed.
Privilege hierarchy
When table access control is enabled on the workspace and on all clusters, SQL objects in Azure Databricks are
hierarchical and privileges are inherited downward. This means that granting or denying a privilege on the
CATALOG automatically grants or denies the privilege to all schemas in the catalog. Similarly, privileges granted
on a schema object are inherited by all objects in that schema. This pattern is true for all securable objects.
If you deny a user privileges on a table, the user can’t see the table by attempting to list all tables in the schema.
If you deny a user privileges on a schema, the user can’t see that the schema exists by attempting to list all
schemas in the catalog.
Object ownership
When table access control is enabled on a cluster or SQL warehouse, a user who creates a schema, table, view,
or function becomes its owner. The owner is granted all privileges and can grant privileges to other users.
Groups may own objects, in which case all members of that group are considered owners.
Ownership determines whether or not you can grant privileges on derived objects to other users. For example,
suppose user A owns table T and grants user B SELECT privilege on table T. Even though user B can select from
table T, user B cannot grant SELECT privilege on table T to user C, because user A is still the owner of the
underlying table T. Furthermore, user B cannot circumvent this restriction simply by creating a view V on table T
and granting privileges on that view to user C. When Azure Databricks checks for privileges for user C to access
view V, it also checks that the owner of V and underlying table T are the same. If the owners are not the same,
user C must also have SELECT privileges on underlying table T.
When table access control is disabled on a cluster, no owner is registered when a schema, table, view, or function
is created. To test if an object has an owner, run SHOW GRANTS ON <object-name> . If you do not see an entry with
ActionType OWN , the object does not have an owner.
NOTE
You must enclose user specifications in backticks ( ` ` ), not single quotes ( ' ' ).
CREATE TABLE Either OWN or both USAGE and CREATE on the schema.
CREATE VIEW Either OWN or both USAGE and CREATE on the schema.
SHOW GRANTS OWN on the object, or the user subject to the grant.
IMPORTANT
When you use table access control, DROP TABLE statements are case sensitive. If a table name is lower case and the
DROP TABLE references the table name using mixed or upper case, the DROP TABLE statement will fail.
Examples
NOTE
Available in Databricks Runtime 7.3 LTS and above. However, to use these functions in Databricks Runtime 7.3 LTS, you
must set the Spark config spark.databricks.userInfoFunctions.enabled true .
Consider the following example, which combines both functions to determine if a user has the appropriate
group membership:
-- Return: true if the user is a member and false if they are not
SELECT
current_user as user,
-- Check to see if the current user is a member of the "Managers" group.
is_member("Managers") as admin
Allowing administrators to set fine granularity privileges for multiple users and groups within a single view is
both expressive and powerful, while saving on administration overhead.
Column-level permissions
Through dynamic views it’s easy to limit what columns a specific group or user can see. Consider the following
example where only users who belong to the auditors group are able to see email addresses from the
sales_rawtable. At analysis time Spark replaces the CASE statement with either the literal 'REDACTED' or the
column email . This behavior allows for all the usual performance optimizations provided by Spark.
Row-level permissions
Using dynamic views you can specify permissions down to the row or field level. Consider the following
example, where only users who belong to the managers group are able to see transaction amounts ( total
column) greater than $1,000,000.00:
Data masking
As shown in the preceding examples, you can implement column-level masking to prevent users from seeing
specific column data unless they are in the correct group. Because these views are standard Spark SQL, you can
do more advanced types of masking with more complex SQL expressions. The following example lets all users
perform analysis on email domains, but lets members of the auditors group see users’ full email addresses.
The principal <user>@<domain-name> can select from tables t1 and t2, as well as any tables and views created in
schema D in the future.
How do I grant a user privileges on all tables except one?
You grant SELECT privilege to the schema and then deny SELECT privilege for the specific table you want to
restrict access to.
The principal <user>@<domain-name> can select from all tables in D except D.T.
A user has SELECT privileges on a view of table T, but when that user tries to SELECT from that view, they get
the error User does not have privilege SELECT on table .
This common error can occur for one of the following reasons:
Table T has no registered owner because it was created using a cluster or SQL warehouse for which table
access control is disabled.
The grantor of the SELECT privilege on a view of table T is not the owner of table T or the user does not also
have select SELECT privilege on table T.
privilege_assignments {
principal = "serge@example.com"
privileges = ["SELECT", "MODIFY"]
}
privilege_assignments {
principal = "special group"
privileges = ["SELECT"]
}
}
Access Azure Data Lake Storage using Azure Active
Directory credential passthrough
7/21/2022 • 11 minutes to read
NOTE
This article contains references to the term whitelisted, a term that Azure Databricks does not use. When the term is
removed from the software, we’ll remove it from this article.
You can authenticate automatically to Accessing Azure Data Lake Storage Gen1 from Azure Databricks (ADLS
Gen1) and ADLS Gen2 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity
that you use to log into Azure Databricks. When you enable Azure Data Lake Storage credential passthrough for
your cluster, commands that you run on that cluster can read and write data in Azure Data Lake Storage without
requiring you to configure service principal credentials for access to storage.
Azure Data Lake Storage credential passthrough is supported with Azure Data Lake Storage Gen1 and Gen2
only. Azure Blob storage does not support credential passthrough.
This article covers:
Enabling credential passthrough for standard and high-concurrency clusters.
Configuring credential passthrough and initializing storage resources in ADLS accounts.
Accessing ADLS resources directly when credential passthrough is enabled.
Accessing ADLS resources through a mount point when credential passthrough is enabled.
Supported features and limitations when using credential passthrough.
Notebooks are included to provide examples of using credential passthrough with ADLS Gen1 and ADLS Gen2
storage accounts.
Requirements
Premium Plan. See Upgrade or Downgrade an Azure Databricks Workspace for details on upgrading a
standard plan to a premium plan.
An Azure Data Lake Storage Gen1 or Gen2 storage account. Azure Data Lake Storage Gen2 storage accounts
must use the hierarchical namespace to work with Azure Data Lake Storage credential passthrough. See
Create a storage account for instructions on creating a new ADLS Gen2 account, including how to enable the
hierarchical namespace.
Properly configured user permissions to Azure Data Lake Storage. An Azure Databricks administrator needs
to ensure that users have the correct roles, for example, Storage Blob Data Contributor, to read and write data
stored in Azure Data Lake Storage. See Use the Azure portal to assign an Azure role for access to blob and
queue data.
You cannot use a cluster configured with ADLS credentials, for example, service principal credentials, with
credential passthrough.
IMPORTANT
You cannot authenticate to Azure Data Lake Storage with your Azure Active Directory credentials if you are behind a
firewall that has not been configured to allow traffic to Azure Active Directory. Azure Firewall blocks Active Directory
access by default. To allow access, configure the AzureActiveDirectory service tag. You can find equivalent information for
network virtual appliances under the AzureActiveDirectory tag in the Azure IP Ranges and Service Tags JSON file. For
more information, see Azure Firewall service tags and Azure IP Addresses for Public Cloud.
Logging recommendations
You can log identities passed through to ADLS storage in the Azure storage diagnostic logs. Logging identities
allows ADLS requests to be tied to individual users from Azure Databricks clusters. Turn on diagnostic logging
on your storage account to start receiving these logs:
Azure Data Lake Storage Gen1: Follow the instructions in Enable diagnostic logging for your Data Lake
Storage Gen1 account.
Azure Data Lake Storage Gen2: Configure using PowerShell with the Set-AzStorageServiceLoggingProperty
command. Specify 2.0 as the version, because log entry format 2.0 includes the user principal name in the
request.
IMPORTANT
Enabling Azure Data Lake Storage credential passthrough for a High Concurrency cluster blocks all ports on the cluster
except for ports 44, 53, and 80.
IMPORTANT
The user assigned to the cluster must have at least Can Attach To permission for the cluster in order to run commands
on the cluster. Admins and the cluster creator have Can Manage permissions, but cannot run commands on the cluster
unless they are the designated cluster user.
Create a container
Containers provide a way to organize objects in an Azure storage account.
Access Azure Data Lake Storage directly using credential passthrough
After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can
access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2
using an abfss:// path.
Azure Data Lake Storage Gen1
Python
spark.read.format("csv").load("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()
# SparkR
library(SparkR)
sparkR.session()
collect(read.df("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv", source = "csv"))
# sparklyr
library(sparklyr)
sc <- spark_connect(method = "databricks")
sc %>% spark_read_csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv") %>% sdf_collect()
spark.read.format("csv").load("abfss://<container-name>@<storage-account-
name>.dfs.core.windows.net/MyData.csv").collect()
# SparkR
library(SparkR)
sparkR.session()
collect(read.df("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/MyData.csv", source =
"csv"))
# sparklyr
library(sparklyr)
sc <- spark_connect(method = "databricks")
sc %>% spark_read_csv("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/MyData.csv") %>%
sdf_collect()
Replace <container-name> with the name of a container in the ADLS Gen2 storage account.
Replace <storage-account-name> with the ADLS Gen2 storage account name.
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-account-name>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Scala
// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-account-name>.azuredatalakestore.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class":
spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Scala
// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)
Replace <container-name> with the name of a container in the ADLS Gen2 storage account.
Replace <storage-account-name> with the ADLS Gen2 storage account name.
Replace <mount-name> with the name of the intended mount point in DBFS.
WARNING
Do not provide your storage account access keys or service principal credentials to authenticate to the mount point. That
would give other users access to the filesystem using those credentials. The purpose of Azure Data Lake Storage
credential passthrough is to prevent you from having to use those credentials and to ensure that access to the filesystem
is restricted to users who have access to the underlying Azure Data Lake Storage account.
Security
It is safe to share Azure Data Lake Storage credential passthrough clusters with other users. You will be isolated
from each other and will not be able to read or use each other’s credentials.
Supported features
M IN IM UM DATA B RIC K S RUN T IM E
F EAT URE VERSIO N N OT ES
%run 5.5
*
org/apache/spark/ml/classification/RandomForestClassifier
*
org/apache/spark/ml/clustering/BisectingKMeans
*
org/apache/spark/ml/clustering/GaussianMixture
* org/spark/ml/clustering/KMeans
* org/spark/ml/clustering/LDA
*
org/spark/ml/evaluation/ClusteringEvaluator
* org/spark/ml/feature/HashingTF
*
org/spark/ml/feature/OneHotEncoder
*
org/spark/ml/feature/StopWordsRemover
*
org/spark/ml/feature/VectorIndexer
*
org/spark/ml/feature/VectorSizeHint
*
org/spark/ml/regression/IsotonicRegression
*
org/spark/ml/regression/RandomForestRegressor
* org/spark/ml/util/DatasetUtils
M IN IM UM DATA B RIC K S RUN T IM E
F EAT URE VERSIO N N OT ES
Scala 5.5
SparkR 6.0
sparklyr 10.1
Ganglia UI 6.1
Limitations
The following features are not supported with Azure Data Lake Storage credential passthrough:
%fs (use the equivalent dbutils.fs command instead).
The Databricks REST API.
Table access control. The permissions granted by Azure Data Lake Storage credential passthrough could be
used to bypass the fine-grained permissions of table ACLs, while the extra restrictions of table ACLs will
constrain some of the benefits you get from credential passthrough. In particular:
If you have Azure AD permission to access the data files that underlie a particular table you will have
full permissions on that table via the RDD API, regardless of the restrictions placed on them via table
ACLs.
You will be constrained by table ACLs permissions only when using the DataFrame API. You will see
warnings about not having permission SELECT on any file if you try to read files directly with the
DataFrame API, even though you could read those files directly via the RDD API.
You will be unable to read from tables backed by filesystems other than Azure Data Lake Storage, even
if you have table ACL permission to read the tables.
The following methods on SparkContext ( sc ) and SparkSession ( spark ) objects:
Deprecated methods.
Methods such as addFile() and addJar() that would allow non-admin users to call Scala code.
Any method that accesses a filesystem other than Azure Data Lake Storage Gen1 or Gen2 (to access
other filesystems on a cluster with Azure Data Lake Storage credential passthrough enabled, use a
different method to specify your credentials and see the section on trusted filesystems under
Troubleshooting).
The old Hadoop APIs ( hadoopFile() and hadoopRDD() ).
Streaming APIs, since the passed-through credentials would expire while the stream was still running.
The FUSE mount ( /dbfs ) is available only in Databricks Runtime 7.3 LTS and above. Mount points with
credential passthrough configured are not supported through the FUSE mount.
Azure Data Factory.
MLflow on high concurrency clusters.
azureml-sdk Python package on high concurrency clusters.
You cannot extend the lifetime of Azure Active Directory passthrough tokens using Azure Active Directory
token lifetime policies. As a consequence, if you send a command to the cluster that takes longer than an
hour, it will fail if an Azure Data Lake Storage resource is accessed after the 1 hour mark.
When using Hive 2.3 and above you can’t add a partition on a cluster with credential passthrough enabled.
For more information, see the relevant troubleshooting section.
Example notebooks
The following notebooks demonstrate Azure Data Lake Storage credential passthrough for Azure Data Lake
Storage Gen1 and Gen2.
Azure Data Lake Storage Gen1 passthrough notebook
Get notebook
Azure Data Lake Storage Gen2 passthrough notebook
Get notebook
Troubleshooting
py4j.security.Py4JSecurityException: … is not whitelisted
This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as
safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method
could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s
credentials.
org.apache.spark.api.python.PythonSecurityException: Path … uses an untrusted filesystem
This exception is thrown when you have tried to access a filesystem that is not known by the Azure Data Lake
Storage credential passthrough cluster to be safe. Using an untrusted filesystem might allow a user on a Azure
Data Lake Storage credential passthrough cluster to access another user’s credentials, so we disallow all
filesystems that we are not confident are being used safely.
To configure the set of trusted filesystems on a Azure Data Lake Storage credential passthrough cluster, set the
Spark conf key spark.databricks.pyspark.trustedFilesystems on that cluster to be a comma-separated list of the
class names that are trusted implementations of org.apache.hadoop.fs.FileSystem .
Adding a partition fails with AzureCredentialNotFoundException when credential passthrough is enabled
When using Hive 2.3-3.1, if you try to add a partition on a cluster with credential passthrough enabled, the
following exception occurs:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could
not find ADLS Gen2 Token
To work around this issue, add partitions on a cluster without credential passthrough enabled.
Data sharing guide
7/21/2022 • 2 minutes to read
This guide shows how you can use Delta Sharing to share data in Azure Databricks with recipients outside your
organization.
Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations
regardless of which computing platforms they use. Delta Sharing is available for data in a Unity Catalog
metastore.
Unity Catalog (Preview) is a secure metastore developed by Databricks. Unity Catalog centralizes metadata and
governance of an organization’s data. With Unity Catalog, data governance rules scale with your needs,
regardless of the number of workspaces or the business intelligence tools your organization uses. See Get
started using Unity Catalog.
To share data using Delta Sharing:
1. You load the data into a Unity Catalog metastore.
You can create new tables and insert records into them, or you can import existing tables into Unity
Catalog from a workspace’s local Hive metastore.
2. You enable Delta Sharing on the metastore.
3. You create shares and recipients. Shares and recipients are Delta Sharing objects.
A share is a read-only collection of tables and table partitions to be shared with one or more
recipients. A metastore can have multiple shares, and you can control which recipients have access to
each share. A single metastore can contain multiple shares, but each share can belong to only one
metastore. If you remove a share, all recipients of that share lose the ability to access it.
A recipient is an object that associates an organization with a credential that allows organization to
access one or more shares. When you create a recipient, a downloadable credential is generated for
that recipient. Each metastore can have multiple recipients, but each recipient can belong to only one
metastore. A recipient can have access to multiple shares. If you remove a recipient, that recipient
loses access to all shares it could previously access.
4. After creating a recipient and granting the recipient access to shares, use a secure channel to
communicate with the recipient, and share with them the unique URL where they can download the
credential.
A credential can be downloaded only one time. Databricks recommends the use of a password manager
for storing and sharing a downloaded credential.
Also share with them the documentation for Delta Sharing data recipients. They can use this
documentation to access the data you share with them.
5. At any time, you can modify the contents of a share, modify the shares to which a recipient has access, or
drop a share or a recipient.
6. Data recipients have immediate read-only access to the live, up-to-date data you share with them.
7. A data provider can enable audit logs for Delta Sharing to understand who is creating shares and
recipients and which recipients are accessing which shares.
8. A data recipient who uses Azure Databricks to access Delta Sharing data can also enable audit logs to
understand who is accessing which Delta Sharing data.
In this guide:
Share data using Delta Sharing (Preview)
Access data shared with you using Delta Sharing
Delta Sharing IP access list guide
Share data using Delta Sharing (Preview)
7/21/2022 • 16 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
This article shows how to share data in Unity Catalog with data recipients outside your organization. If you are a
data recipient, see Access data shared with you using Delta Sharing.
The shared data is not provided by Databricks directly but by data providers running on Databricks.
NOTE
By accessing a data provider’s shared data as a data recipient, data recipient represents that it has been authorized to
access the data share(s) provided to it by the data provider and acknowledges that (1) Databricks has no liability for such
data or data recipient’s use of such shared data, and (2) Databricks may collect information about data recipient’s use of
and access to the shared data (including identifying any individual or company who accesses the data using the credential
file in connection with such information) and may share it with the applicable data provider.
Requirements
Unity Catalog must be enabled, and at least one metastore must exist.
Only an account admin can enable Delta Sharing for a metastore.
Only a metastore admin or account admin can share data using Delta Sharing. See Metastore admin.
To rotate a recipient’s credential, you must use the Unity Catalog CLI. See (Optional) Install the Unity Catalog
CLI.
To manage shares and recipients, you can use a Data Science & Engineering notebook or a Databricks SQL
query.
Key concepts
In Delta Sharing, a share is a read-only collection of tables and table partitions to be shared with one or more
recipients. A metastore can have multiple shares, and you can control which recipients have access to each
share. A single metastore can contain multiple shares, but each share can belong to only one metastore. If you
remove a share, all recipients of that share lose the ability to access it.
A recipient is an object that associates an organization with a credential that allows the organization to access
one or more shares. When you create a recipient, a downloadable credential is generated for that recipient. Each
metastore can have multiple recipients, but each recipient can belong to only one metastore. A recipient can
have access to multiple shares. If you remove a recipient, that recipient loses access to all shares it could
previously access.
Shares and recipients exist independent of each other.
Manage shares
In Delta Sharing, a share is a named object that contains a collection of tables in a metastore that you want to
share as a group. A share can contain tables from only a single metastore. You can add or remove tables from a
share at any time.
The following sections show how to create, describe, update and delete shares.
Create a share
To create a share, run the following command in a notebook or the Databricks SQL editor. Replace the
placeholder values:
<share_name> : a descriptive name for the share.
<comment> : a comment describing the share.
After you create a share, you can add tables to it and associate it with one or more recipients.
List shares
To list all shares, use the SHOW SHARES command.
SHOW SHARES;
Describe a share
To list a share’s metadata and all tables associated with a share, use the DESCRIBE SHARE command. Replace
<share_name> with the name of the share:
To associate a table with the share, use the ALTER SHARE ADD TABLE command.
NOTE
You can provide either the original schema and table name in the metastore or the names defined in the share.
When you add or remove tables from a share, the change takes effect the next time a recipient accesses the
share.
Partition specifications
To share only part of a table when adding the table to a share, you can provide a partition specification. The
following example shares part of the data in the inventory table, given that the table is partitioned by year ,
month , and date columns.
Delete a share
To delete a share, use the DROP SHARE command. Recipients can no longer access data that was previously
shared. Replace <share_name> with the name of the share.
DROP SHARE [IF EXISTS] <share_name>;
Manage recipients
A recipient is a named set of credentials that represents an organization with whom to share data. This section
shows how to manage recipients in Delta Sharing.
Create a recipient
Use the CREATE RECIPIENT command to create a recipient. Replace the placeholder values:
<recipient_> : A descriptive name for the recipient.
<comment> : A comment with more information.
The recipient’s token will expire after the recipient token lifetime has elapsed. For more information, see Security
recommendations for recipients.
After creating a recipient:
1. Use the DESCRIBE command to get their activation link.
2. Use a secure channel to share the activation link with them, along with the article showing how to access
shared data. The activation link can be accessed only a single time. Recipients should treat the downloaded
credential as a secret and must not share it outside of their organization. If necessary, you can rotate a
recipient’s credential.
3. Grant them access to shares.
List recipients
The SHOW RECIPIENTS command lists all recipients. Optionally, replace <pattern> with a LIKE predicate.
Describe a recipient
To view details about a recipient, including its creator, creation timestamp, token lifetime, activation link, and
whether the credential has been downloaded, use the DESCRIBE RECIPIENT command. Replace <recipient_name>
with the name of the recipient.
3. View the activation URL by using the DESCRIBE RECIPIENT <recipient_name> command, and share it with
the recipient over a secure channel.
To grant access:
GRANT SELECT
ON SHARE <share_name>
TO RECIPIENT <recipient_name>;
To revoke access:
REVOKE SELECT
ON SHARE <share_name>
FROM RECIPIENT <recipient_name>;
NOTE
SELECT is the only privilege you can grant on a share.
IMPORTANT
Delta Sharing activity is logged at the level of the account. Do not enter a value into workspace_ids_filter .
Audit logs are delivered for each workspace in your account, as well as account-level activities. Logs are
delivered to the storage container you configure.
2. Events for Delta Sharing are logged with serviceName set to unityCatalog . The requestParams section of
each event includes a delta_sharing prefix.
For example, the following audit event shows an update to the recipient token lifetime. In this example,
redacted values are replaced with <redacted> .
{
"version":"2.0",
"auditLevel":"ACCOUNT_LEVEL",
"timestamp":1629775584891,
"orgId":"3049059095686970",
"shardName":"example-workspace",
"accountId":"<redacted>",
"sourceIPAddress":"<redacted>",
"userAgent":"curl/7.64.1",
"sessionId":"<redacted>",
"userIdentity":{
"email":"<redacted>",
"subjectName":null
},
"serviceName":"unityCatalog",
"actionName":"updateMetastore",
"requestId":"<redacted>",
"requestParams":{
"id":"<redacted>",
"delta_sharing_enabled":"true"
"delta_sharing_recipient_token_lifetime_in_seconds": 31536000
},
"response":{
"statusCode":200,
"errorMessage":null,
"result":null
},
"MAX_LOG_MESSAGE_LENGTH":16384
}
The following table lists audited events for Delta Sharing, from the point of view of the data provider.
NOTE
The following important fields are always present in the audit log:
userIdentity.email : The ID of the user who initiated the activity.
requestParams.id : the Unity Catalog metastore.
A C T IO N N A M E REQ UEST PA RA M S
delta_sharing_recipient_token_lifetime_in_seconds : If
present, indicates that the recipient token lifetime was
updated.
A C T IO N N A M E REQ UEST PA RA M S
listRecipients none
listShares none
The following Delta Sharing errors are logged, from the point of view of the data recipient. Items between <
and > characters represent placeholder text.
Delta Sharing is not enabled on the selected metastore.
DatabricksServiceException: FEATURE_DISABLED:
Delta Sharing is not enabled`
DatabricksServiceException: CATALOG_DOES_NOT_EXIST:
Catalog ‘xxx’ does not exist.`
A user who is not an account admin or metastore admin attempted to perform a privileged operation.
DatabricksServiceException: PERMISSION_DENIED:
Only administrators can <operation_name> <operation_target>
An operation was attempted on a metastore from a workspace to which the metastore is not assigned.
DatabricksServiceException: INVALID_STATE:
Workspace <workspace_name> is no longer assigned to this metastore
A user attempted to rotate a recipient that was already in a rotated state and whose previous token had
not yet expired.
DatabricksServiceException: RECIPIENT_ALREADY_EXISTS/SHARE_ALREADY_EXISTS: Recipient/Share <name>
already exists
A user attempted to create a new recipient or share with the same name as an existing one.
DatabricksServiceException: RECIPIENT_DOES_NOT_EXIST/SHARE_DOES_NOT_EXIST: Recipient/Share '<name>'
does not exist.
A user attempted to perform an operation on a recipient or share that does not exist.
DatabricksServiceException: RESOURCE_ALREADY_EXISTS: Shared Table '<name>' already exists.
A user attempted to add a table to a share, but the table had already been added.
DatabricksServiceException: TABLE_DOES_NOT_EXIST: Table '<name>' does not exist.
A user attempted to perform an operation that referenced a table that does not exist.
DatabricksServiceException: SCHEMA_DOES_NOT_EXIST: Schema '<name>' does not exist.
A user attempted to perform an operation that referenced a schema that did not exist.
For auditable events and errors for data recipients, see Audit access and activity for Delta Sharing resources.
Limitations
Only tables stored in a Unity Catalog metastore can be shared with Delta Sharing.
Only managed and external tables in Delta format are supported.
Sharing views is not supported in this preview.
Next steps
Learn how a recipient can access shares with Delta Sharing.
Learn more about Unity Catalog.
Access data shared with you using Delta Sharing
7/21/2022 • 11 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
This article shows data recipients how to access data shared from Delta Sharing using Databricks, Apache Spark,
and pandas.
NOTE
By accessing a data provider’s shared data as a data recipient, data recipient represents that it has been authorized to
access the data share(s) provided to it by the data provider and acknowledges that (1) Databricks has no liability for such
data or data recipient’s use of such shared data, and (2) Databricks may collect information about data recipient’s use of
and access to the shared data (including identifying any individual or company who accesses the data using the credential
file in connection with such information) and may share it with the applicable data provider.
IMPORTANT
You can download a credential file only one time. If you visit the activation link again after you have already downloaded
the credential file, the Download Credential File button is disabled.
Don’t further share the activation link or share the credential file you have received with anyone outside of your
organization.
If you lose the activation link before using it, contact the data provider.
1. Click the activation link shared with you by the data provider. The activation page opens in your browser.
2. Click Download Credential File .
Store the credential file in a secure location. If you need to share it with someone in your organization,
Databricks recommends using a password manager.
NOTE
Partner integrations are, unless otherwise noted, provided by the third parties and you must have an account with the
appropriate provider for the use of their products and services. While Databricks does its best to keep this content up to
date, we make no representation regarding the integrations or the accuracy of the content on the partner integration
pages. Reach out to the appropriate providers regarding the integrations.
Use Databricks
Follow these steps to access shared data in a Azure Databricks workspace using notebook commands. You store
the credential file in DBFS, then use it to authenticate to the data provider’s Azure Databricks account and read
the data the data provider shared with you.
In this example, you create a notebook with multiple cells you can run independently. You could instead add the
notebook commands to the same cell and run them in a sequence.
1. In a text editor, open the credential file you downloaded.
2. Click Workspace .
3. Right-click a folder, then click Create > Notebook .
Enter a name.
Set the default language for the notebook to Python. This is the default.
Select a cluster to attach to the notebook. Select a cluster that runs Databricks Runtime 8.4 or above
or a cluster with the Apache Spark connector library installed. For more information about installing
cluster libraries, see Libraries.
Click Create .
The notebook opens in the notebook editor.
4. To use Python or pandas to access the shared data, install the delta-sharing Python connector. In the
notebook editor, paste the following command:
5. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
The delta-sharing Python library is installed in the cluster if it isn’t already installed.
6. In the notebook editor, click , and select Add Cell Below . In the new cell, paste the following
command, which uploads the contents of the credential file to a folder in DBFS. Replace the variables as
follows:
<dbfs-path> : the path to the folder where you want to save the credential file
<credential-file-contents> : the contents of the credential file.
The credential file contains JSON which defines three fields: shareCredentialsVersion , endpoint ,
and bearerToken .
%scala
dbutils.fs.put("<dbfs-path>/config.share","""
<credential-file-contents>
""")
7. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
After the credential file is uploaded, you can delete this cell. All workspace users can read the credential
file from DBFS, and the credential file is available in DBFS on all clusters and SQL warehouses. To delete
the cell, click x in the cell actions menu at the far right.
8. Using Python, list the tables in the share. In the notebook editor, click , and select Add Cell Below .
9. In the new cell, paste the following command. Replace <dbfs-path> with the path from the previous
command.
When the code runs, Python reads the credential file from DBFS on the cluster. DBFS is mounted using
FUSE at /dbfs/ .
import delta_sharing
client = delta_sharing.SharingClient(f"/dbfs/<dbfs-path>/config.share")
client.list_all_tables()
10. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
The result is an array of tables, along with metadata for each table. The following output shows two
tables:
If the output is empty or doesn’t contain the tables you expect, contact the data provider.
11. Using Scala, query a shared table. In the notebook editor, click , and select Add Cell Below . In the new
cell, paste the following command. When the code runs, the credential file is read from DBFS through the
JVM.
Replace the variables:
<profile_path> : the DBFS path of the credential file. For example, /<dbfs-path>/config.share .
<share_name> : the value of share= for the table.
<schema_name> : the value of schema= for the table.
<table_name> : the value of name= for the table.
%scala
spark.read.format("deltaSharing")
.load("<profile_path>#<share_name>.<schema_name>.<table_name>").limit(10);
12. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
Each time you load the shared table, you see fresh data from the source.
13. To query the shared data using SQL, you must first create a local table in the workspace from the shared
table, then query the local table.
The shared data is not stored or cached in the local table. Each time you query the local table, you see the
current state of the shared data.
In the notebook editor, click , and select Add Cell Below . In the new cell, paste the following
command.
Replace the variables:
<local_table_name> : the name of the local table.
<profile_path> : the location of the credential file.
<share_name> : the value of share= for the table.
<schema_name> : the value of schema= for the table.
<table_name> : the value of name= for the table.
%sql
DROP TABLE IF EXISTS table_name;
14. In the cell actions menu at the far right, click and select Run Cell , or press
shift+enter .
When you run the command, the shared data is queried directly. As a test, the table is queried and the
first 10 results are returned.
If the output is empty or doesn’t contain the data you expect, contact the data provider.
IMPORTANT
Delta Sharing activity is logged at the level of the account. Do not enter a value into workspace_ids_filter .
Audit logs are delivered for each workspace in your account, as well as account-level activities. Logs are
delivered to the storage container you configure.
2. Events for Delta Sharing are logged with serviceName set to unityCatalog . The requestParams section of
each event includes the following fields, which you can share with the data provider to help them
troubleshoot issues.
recipient_name: The name of the recipient in the data provider’s system.
metastore_id : The name of the metastore in the data provider’s system.
sourceIPAddress : The IP address where the request originated.
For example, the following audit event shows that a recipient successfully listed the shares that were
available to them. In this example, redacted values are replaced with <redacted> .
{
"Version":"2.0",
"auditLevel":"ACCOUNT_LEVEL",
"Timestamp":1635235341950,
"orgId":"0",
"shardName":"<redacted>",
"accountId":"<redacted>",
"sourceIPAddress":"<redacted>",
"userAgent":null,
"sessionId":null,
"userIdentity":null,
"serviceName":"unityCatalog",
"actionName":"deltaSharingListShares",
"requestId":"ServiceMain-cddd3114b1b40003",
"requestParams":{
"Metastore_id":"<redacted>",
"Options":"{}",
"Recipient_name":"<redacted>"
},
"Response":{
"statusCode":200,
"errorMessage":null,
"Result":null
},
"MAX_LOG_MESSAGE_LENGTH":16384
}
The following table lists audited events for Delta Sharing, from the point of view of the data recipient.
A C T IO N REQ UEST PA RA M S
The following Delta Sharing errors are logged, from the point of view of the data recipient. Items between <
and > characters represent placeholder text.
The user attempted to access a share they do not have permission to access.
DatabricksServiceException: PERMISSION_DENIED:
User does not have SELECT on Share <share_name>`
The user attempted to access a table that does not exist in the share.
For auditable events and errors for data providers, see Audit access and activity for Delta Sharing resources.
Access shared data outside Azure Databricks
If you don’t use Azure Databricks, follow these instructions to access shared data.
Use Apache Spark
Follow these steps to access shared data in Apache Spark 3.x or above.
1. To access metadata related to the shared data, such as the list of tables shared with you, install the delta-
sharing Python connector:
client = delta_sharing.SharingClient(f"<profile_path>/config.share")
client.list_all_tables()
The result is an array of tables, along with metadata for each table. The following output shows two
tables:
If the output is empty or doesn’t contain the tables you expect, contact the data provider.
4. Access shared data in Spark using Python:
delta_sharing.load_as_spark(f"<profile_path>#<share_name>.<schema_name>.<table_name>")
spark.read.format("deltaSharing")\
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")\
.limit(10))
spark.read.format("deltaSharing")
.load("<profile_path>#<share_name>.<schema_name>.<table_name>")
.limit(10)
If the output is empty or doesn’t contain the data you expect, contact the data provider.
Use pandas
Follow these steps to access shared data in pandas 0.25.3 or above.
1. To access metadata related to the shared data, such as the list of tables shared with you, you must install
the delta-sharing Python connector:
2. List the tables in the share. In the following example, replace <profile_path> with the location of the
credential file.
import delta_sharing
client = delta_sharing.SharingClient(f"<profile_path>/config.share")
client.list_all_tables()
If the output is empty or doesn’t contain the tables you expect, contact the data provider.
3. Access shared data in pandas using Python. In the following example, replace the variables as follows:
<profile_path> : the location of the credential file.
<share_name> : the value of share= for the table.
<schema_name> : the value of schema= for the table.
<table_name> : the value of name= for the table.
import delta_sharing
delta_sharing.load_as_pandas(f"<profile_path>#<share_name>.<schema_name>.<table_name>")
If the output is empty or doesn’t contain the data you expect, contact the data provider.
Next steps
Learn more about Databricks
Learn more about Delta Sharing
Learn more about Unity Catalog
Delta Sharing IP access list guide
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
The Delta Sharing IP access list API enables the provider metastore admin to configure an IP access list for each
recipient. This list is independent of Workspace IP Access Lists. This API supports allowlists (inclusion) only.
The IP access list affects:
Delta Sharing OSS Protocol REST API access.
Delta Sharing Activation URL access.
Delta Sharing Credential File download.
Each recipient supports a maximum of 100 IP/CIDR values, where one CIDR counts as a single value. Only IPv4
addresses are supported.
Audit Logging
The following operations have audit logs related to IP access lists:
Recipient management operations: create, update
Denial of access to any of the Delta Sharing OSS Protocol REST API calls
Denial of access to Delta Sharing Activation URL
Denial of access to Delta Sharing Credential File download
To learn more about how to enable and read audit logs for Delta Sharing, please refer to Audit access and
activity for Delta Sharing resources. The following table lists audited events related to IP access lists:
deltaSharing* (All Delta Sharing actions is_ip_access_denied : None if there The recipient IP address.
would have this audit log.) is no IP access list configured.
Otherwise, true if the request was
denied and false if the request was
not denied.
Developer tools and guidance
7/21/2022 • 3 minutes to read
Learn about tools and guidance you can use to work with Azure Databricks assets and data and to develop
Azure Databricks applications.
Use an IDE
You can connect many popular third-party IDEs to an Azure Databricks cluster. This allows you to write code on
your local development machine by using the Spark APIs and then run that code as jobs remotely on an Azure
Databricks cluster.
Databricks recommends that you use dbx by Databricks Labs for local development.
Databricks also provides a code sample that you can explore to use an IDE with dbx .
NOTE
Databricks also supports a tool named Databricks Connect. However, Databricks plans no new feature development for
Databricks Connect at this time. Also, Databricks Connect has several limitations.
Databricks CLI Use the command line to work with Data Science &
Engineering workspace assets such as cluster policies,
clusters, file systems, groups, pools, jobs, libraries, runs,
secrets, and tokens.
Databricks SQL CLI Use the command line to run SQL commands and scripts on
a Databricks SQL warehouse.
NAME USE T H IS TO O L W H EN Y O U WA N T TO…
C AT EGO RY USE T H IS A P I TO W O RK W IT H …
REST API (latest) Data Science & Engineering workspace assets such as
clusters, global init scripts, groups, pools, jobs, libraries,
permissions, secrets, and tokens, by using the latest version
of the Databricks REST API.
REST API 2.1 Data Science & Engineering workspace assets such as jobs,
by using version 2.1 of the Databricks REST API.
REST API 2.0 Data Science & Engineering workspace assets such as
clusters, global init scripts, groups, pools, jobs, libraries,
permissions, secrets, and tokens, by using version 2.0 of the
Databricks REST API.
Provision infrastructure
You can use an infrastructure-as-code (IaC) approach to programmatically provision Azure Databricks
infrastructure and assets such as workspaces, clusters, cluster policies, pools, jobs, groups, permissions, secrets,
tokens, and users. For details, see Databricks Terraform provider.
Use CI/CD
To manage the lifecycle of Azure Databricks assets and data, you can use continuous integration and continuous
delivery (CI/CD) and data pipeline tools.
Continuous integration and delivery on Azure Databricks Develop a CI/CD pipleine for Azure Databricks that uses
using Azure DevOps Azure DevOps.
Continuous integration and delivery on Azure Databricks Develop a CI/CD workflow on GitHub that uses GitHub
using GitHub Actions Actions developed for Azure Databricks.
Continuous integration and delivery on Azure Databricks Develop a CI/CD pipeline for Azure Databricks that uses
using Jenkins Jenkins.
Managing dependencies in data pipelines Manage and schedule a data pipeline that uses Apache
Airflow.
A REA USE T H ESE TO O L S W H EN Y O U WA N T TO…
Service principals for CI/CD Use service principals, instead of users, with CI/CD systems.
TO O L USE T H IS W H EN Y O U WA N T TO :
Databricks SQL CLI Use a command line to run SQL commands and scripts on a
Databricks SQL warehouse.
DataGrip integration with Azure Databricks Use a query console, schema navigation, smart code
completion, and other features to run SQL commands and
scripts and to browse database objects in Azure Databricks.
DBeaver integration with Azure Databricks Run SQL commands and browse database objects in Azure
Databricks by using this client software application and
database administration tool.
IMPORTANT
dbx by Databricks Labs is provided as-is and is not officially supported by Databricks through customer technical support
channels. Support, questions, and feature requests can be communicated through the Issues page of the
databrickslabs/dbx repo on GitHub. Issues with the use of this code will not be answered or investigated by Databricks
Support.
dbx by Databricks Labs is an open source tool which is designed to extend the Databricks command-line
interface (Databricks CLI) and to provide functionality for rapid development lifecycle and continuous
integration and continuous delivery/deployment (CI/CD) on the Azure Databricks platform.
dbx simplifies jobs launch and deployment processes across multiple environments. It also helps to package
your project and deliver it to your Azure Databricks environment in a versioned fashion. Designed in a CLI-first
manner, it is built to be actively used both inside CI/CD pipelines and as a part of local tooling (such as local
IDEs, including Visual Studio Code and PyCharm).
The typical development workflow with dbx is:
1. Create a remote repository with a Git provider Databricks supports, if you do not have a remote repo
available already.
2. Clone your remote repo into your Azure Databricks workspace.
3. Create or move an Azure Databricks notebook into the cloned repo in your Azure Databricks worksapce. Use
this notebook to begin prototyping the code that you want your Azure Databricks clusters to run.
4. To enhance and modularize your notebook code by adding separate helper classes and functions,
configuration files, and tests, switch over to using a local development machine with dbx , your preferred
IDE, and Git installed.
5. Clone your remote repo to your local development machine.
6. Move your code out of your notebook into one or more local code files.
7. As you code locally, push your work from your local repo to your remote repo. Also, sync your remote repo
with your Azure Databricks workspace.
8. Keep using the notebook in your Azure Databricks workspace for rapid prototyping, and keep moving
validated code from your notebook to your local machine. Keep using your local IDE for tasks such as code
modularization, code completion, linting, unit testing, and step-through debugging of code and objects that
do not require a live connection to Azure Databricks.
9. Use dbx to batch run your local code on your target clusters, as desired. (This is similar to running the spark-
submit script in Spark’s bin directory to launch applications on a Spark cluster.)
10. When you are ready for production, use a CI/CD platform such as GitHub Actions, Azure DevOps, or GitLab
to automate running your remote repo’s code on your clusters.
Requirements
To use dbx , you must have the following installed on your local development machine, regardless of whether
your code uses Python, Scala, or Java:
Python version 3.6 or above.
If your code uses Python, you should use a version of Python that matches the one that is installed on
your target clusters. To get the version of Python that is installed on an existing cluster, you can use the
cluster’s web terminal to run the python --version command. See also the “System environment”
section in the Databricks runtime releases for the Databricks Runtime version for your target clusters.
pip. ( dbx also supports conda, but conda is not covered in this article.)
If your code uses Python, a method to create Python virtual environments to ensure you are using the
correct versions of Python and package dependencies in your dbx projects. This article covers pipenv.
The dbx package from the Python Package Index (PyPI). You can install this by running pip install dbx .
The Databricks CLI, set up with authentication. The Databricks CLI is automatically installed when you
install dbx . This authentication can be set up on your local development machine in one or both of the
following locations:
Within the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables (starting with Databricks
CLI version 0.8.0).
In a profile within your .databrickscfg file.
dbx looks for authentication credentials in these two locations, respectively. dbx uses only the first set
of matching credentials that it finds.
NOTE
dbx does not supporting the use of a .netrc file for authentication.
git for pushing and syncing local and remote code changes.
Continue with the instructions for one of the following IDEs:
Visual Studio Code
PyCharm
IntelliJ IDEA
Eclipse
NOTE
Databricks has validated usage of the preceding IDEs with dbx ; however, dbx should work with any IDE. You can also
use No IDE (terminal only).
dbx is optimized to work with single-file Python code files and compiled Scala and Java JAR files. dbx does not work
with single-file R code files or compiled R code packages. This is because dbx works with the Jobs API 2.0 and 2.1, and
these APIs cannot run single-file R code files or compiled R code packages as jobs.
mkdir dbx-demo
cd dbx-demo
code .
TIP
If command not found: code displays after you run code . , see Launching from the command line on the
Microsoft website.
For Windows:
md dbx-demo
cd dbx-demo
code .
2. In Visual Studio Code, create a Python virtual environment for this project:
a. On the menu bar, click View > Terminal .
b. From the root of the dbx-demo folder, run the pipenv command with the following option, where
<version> is the target version of Python that you already have installed locally (and, ideally, a
version that matches your target clusters’ version of Python), for example 3.7.5 .
Make a note of the Virtualenv location value in the output of the pipenv command, as you will
need it in the next step.
3. Select the target Python interpreter, and then activate the Python virtual environment:
a. On the menu bar, click View > Command Palette , type Python: Select , and then click Python:
Select Interpreter .
b. Select the Python interpreter within the path to the Python virtual environment that you just created.
(This path is listed as the Virtualenv location value in the output of the pipenv command.)
c. On the menu bar, click View > Command Palette , type Terminal: Create , and then click Terminal:
Create New Terminal .
For more information, see Using Python environments in VS Code in the Visual Studio Code
documentation.
4. Continue with Create a dbx project.
PyCharm
Complete the following instructions to begin using PyCharm and Python with dbx .
On your local development machine, you must have PyCharm installed in addition to the general requirements.
Follow these steps to begin setting up your dbx project structure:
1. In PyCharm, on the menu bar, click File > New Project .
2. In the Create Project dialog, choose a location for your new project.
3. Expand Python interpreter : New Pipenv environment .
4. Select New environment using , if it is not already selected, and then select Pipenv from the drop-down
list.
5. For Base interpreter , select the location that contains the Python interpreter for the target version of
Python that you already have installed locally (and, ideally, a version that matches your target clusters’
version of Python).
6. For Pipenv executable , select the location that contains your local installation of pipenv , if it is not already
auto-detected.
7. If you want to create a minimal dbx project, and you want to use the main.py file with that minimal dbx
project, then select the Create a main.py welcome script box. Otherwise, clear this box.
8. Click Create .
9. In the Project tool window, right-click the project’s root folder, and then click Open in > Terminal .
10. Continue with Create a dbx project.
IntelliJ IDEA
Complete the following instructions to begin using IntelliJ IDEA and Scala with dbx . These instructions create a
minimal sbt-based Scala project that you can use to start a dbx project.
On your local development machine, you must have the following installed in addition to the general
requirements:
IntelliJ IDEA.
The Scala plugin for IntelliJ IDEA. For more information, see Discover IntelliJ IDEA for Scala in the IntelliJ IDEA
documentation.
Java Runtime Environment (JRE) 8. While any edition of JRE 8 should work, Databricks has so far only
validated usage of dbx and IntelliJ IDEA with the OpenJDK 8 JRE. Databricks has not yet validated usage of
dbx with IntelliJ IDEA and Java 11. For more information, see Java Development Kit (JDK) in the IntelliJ IDEA
documentation.
Follow these steps to begin setting up your dbx project structure:
Step 1: Create an sbt-based Scala project
1. In IntelliJ IDEA, depending on your view, click Projects > New Project or File > New > Project .
2. In the New Project dialog, click Scala , click sbt , and then click Next .
3. Enter a project name and a location for the project.
4. For JDK , select your installation of the OpenJDK 8 JRE.
5. For sbt , choose the highest available version of sbt that is listed.
6. For Scala , ideally, choose the version of Scala that matches your target clusters’ version of Scala. See the
“System environment” section in the Databricks runtime releases for the Databricks Runtime version for your
target clusters.
7. Next to Scala , select the Sources box if it is not already selected.
8. Add a package prefix to Package Prefix . These steps use the package prefix com.example.demo . If you specify
a different package prefix, replace the package prefix throughout these steps.
9. Click Finish .
Step 2: Add an object to the package
You can add any required objects to your package. This package contains a single object named SampleApp .
1. In the Project tool window (View > Tool Windows > Project ), right-click the project-name > src >
main > scala folder, and then click New > Scala Class .
2. Choose Object , and type the object’s name and then press Enter. For example, type SampleApp . If you
enter a different object name here, be sure to replace the name throughout these steps.
3. Replace the contents of the SampleApp.scala file with the following code:
package com.example.demo
object SampleApp {
def main(args: Array[String]) {
}
}
In the project’s src > main > scala > SampleApp.scala file, add the code that you want dbx to batch run on
your target clusters. For basic testing, use the example Scala code in the section Code example.
Step 5: Run the project
1. On the menu bar, click Run > Edit Configurations .
2. In the Run/Debug Configurations dialog, click the + (Add New Configuration ) icon, or Add new , or
Add new run configuration .
3. In the drop-down, click sbt Task .
4. For Name , enter a name for the configuration, for example, Run the program .
5. For Tasks , enter ~run .
6. Select Use sbt shell .
7. Click OK .
8. On the menu bar, click Run > Run ‘Run the program’ . The run’s results appear in the sbt shell tool
window.
Step 6: Build the project as a JAR
You can add any JAR build settings to your project that you want. This step assumes that you only want to build
a JAR that is based on the project that was set up in the previous steps.
1. On the menu bar, click File > Project Structure .
2. In the Project Structure dialog, click Project Settings > Ar tifacts .
3. Click the + (Add ) icon.
4. In the drop-down list, select JAR > From modules with dependencies .
5. In the Create JAR from Modules dialog, for Module , select the name of your project.
6. For Main Class , click the folder icon.
7. In the Select Main Class dialog, on the Search by Name tab, select SampleApp , and then click OK .
8. For JAR files from libraries , select copy to the output director y and link via manifest .
9. Click OK to close the Create JAR from Modules dialog.
10. Click OK to close the Project Structure dialog.
11. On the menu bar, click Build > Build Ar tifacts .
12. In the context menu that appears, select project-name :jar > Build . Wait while sbt builds your JAR. The
build’s results appear in the Build Output tool window (View > Tool Windows > Build ).
The JAR is built to the project’s out > artifacts > <project-name>_jar folder. The JAR’s name is
<project-name>.jar .
Step 7: Display the terminal in the IDE
With your dbx project structure now in place, you are ready to create your dbx project.
Display the IntelliJ IDEA terminal by clicking View > Tool Windows > Terminal on the menu bar, and then
continue with Create a dbx project.
Eclipse
Complete the following instructions to begin using Eclipse and Java with dbx . These instructions create a
minimal Maven-based Java project that you can use to start a dbx project.
On your local development machine, you must have the following installed in addition to the general
requirements:
A version of Eclipse. These instructions use the Eclipse IDE for Java Developers edition of the Eclipse IDE.
An edition of the Java Runtime Environment (JRE) or Java Development Kit (JDK) 11, depending on your local
machine’s operating system. While any edition of JRE or JDK 11 should work, Databricks has so far only
validated usage of dbx and the Eclipse IDE for Java Developers with Eclipse 2022-03 R, which includes
AdoptOpenJDK 11.
Follow these steps to begin setting up your dbx project structure:
Step 1: Create a Maven-based Java project
1. In Eclipse, click File > New > Project .
2. In the New Project dialog, expand Maven , select Maven Project , and click Next .
3. In the New Maven Project dialog, select Create a simple project (skip archetype selection) , and click
Next .
4. For Group Id , enter a group ID that conforms to Java’s package name rules. These steps use the package
name of com.example.demo . If you enter a different group ID, substitute it throughout these steps.
5. For Ar tifact Id , enter a name for the JAR file without the version number. These steps use the JAR name of
dbx-demo . If you enter a different name for the JAR file, substitute it throughout these steps.
6. Click Finish .
Step 2: Add a class to the package
You can add any classes to your package that you want. This package will contain a single class named
SampleApp .
1. In the Project Explorer view (Window > Show View > Project Explorer ), select the project-name
project icon, and then click File > New > Class .
2. In the New Java Class dialog, for Package , enter com.example.demo .
3. For Name , enter SampleApp .
4. For Modifiers , select public .
5. Leave Superclass blank.
6. For Which method stubs would you like to create , select public static void Main(String[] args) .
7. Click Finish .
Step 3: Add dependencies to the project
1. In the Project Explorer view, double-click project-name > pom.xml .
2. Add the following dependencies as a child element of the <project> element, and then save the file:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.2.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
Replace:
2.12 with your target clusters’ version of Scala.
3.2.1 with your target clusters’ version of Spark.
See the “System environment” section in the Databricks runtime releases for the Databricks Runtime
version for your target clusters.
Step 4: Compile the project
1. In the project’s pom.xml file, add the following Maven compiler properties as a child element of the
<project> element, and then save the file:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
</properties>
2. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run
Configurations .
3. In the Run Configurations dialog, click Maven Build .
4. Click the New launch configuration icon.
5. Enter a name for this launch configuration, for example clean compile .
6. For Base director y , click Workspace , choose your project’s directory, and click OK .
7. For Goals , enter clean compile .
8. Click Run . The run’s output appears in the Console view (Window > Show View > Console ).
Step 5: Add code to the project
You can add any code to your project that you want. This step assumes that you only want to add code to a file
named SampleApp.java for a package named com.example.demo .
In the project’s src/main/java > com.example.demo > SampleApp.java file, add the code that you want dbx to
batch run on your target clusters. (If you do not have any code handy, you can use the Java code in the Code
example, listed toward the end of this article.)
Step 6: Run the project
1. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run
Configurations .
2. In the Run Configurations dialog, expand Java Application , and then click App .
3. Click Run . The run’s output appears in the Console view.
Step 7: Build the project as a JAR
1. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run
Configurations .
2. In the Run Configurations dialog, click Maven Build .
3. Click the New launch configuration icon.
4. Enter a name for this launch configuration, for example clean package .
5. For Base director y , click Workspace , choose your project’s directory, and click OK .
6. For Goals , enter clean package .
7. Click Run . The run’s output appears in the Console view.
The JAR is built to the <project-name> > target folder. The JAR’s name is <project-name>-0.0.1-SNAPSHOT.jar .
NOTE
If the JAR does not appear in the target folder in the Project Explorer window at first, you can try to display it by
right-clicking the project-name project icon, and then click Refresh .
mkdir dbx-demo
cd dbx-demo
For Windows:
md dbx-demo
cd dbx-demo
2. Create a Python virtual environment for this project by running the pipenv command, with the following
option, from the root of the dbx-demo folder, where <version> is the target version of Python that you
already have installed locally, for example 3.7.5 .
pipenv shell
NOTE
To create a dbx templated project for Python that demonstrates batch running of code on all-purpose clusters and jobs
clusters, remote code artifact deployments, and CI/CD platform setup, skip ahead to Create a dbx templated project for
Python with CI/CD support.
To complete this procedure, you must have an existing all-purpose cluster in your workspace. (See Display
clusters or Create a cluster.) Ideally (but not required), the version of Python in your Python virtual environment
should match the version that is installed on this cluster. To get the version of Python on the cluster, you can use
the cluster’s web terminal to run the command python --version .
python --version
1. From your terminal, from your dbx project’s root folder, run the dbx configure command with the
following option. This command creates a hidden .dbx folder within your dbx project’s root folder. This
.dbx folder contains lock.json and project.json files.
2. Create a folder named conf within your dbx project’s root folder.
For Linux and macOS:
mkdir conf
For Windows:
md conf
3. Add a file named deployment.yaml file to the conf directory, with the following file contents:
environments:
default:
jobs:
- name: "dbx-demo-job"
spark_python_task:
python_file: "dbx-demo-job.py"
NOTE
The deployment.yaml file contains the lower-cased word default , which is a reference to the upper-cased
DEFAULT profile within your Databricks CLI .databrickscfg file. If you want dbx to use a different profile,
replace default with your target profile’s name.
If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a
profile in your Databricks CLI .databrickscfg file, then leave default in the deployment.yaml as is. dbx
will use this reference by default.
4. Add the code to run on the cluster to a file named dbx-demo-job.py and add the file to the root folder of
your dbx project. (If you do not have any code handy, you can use the Python code in the Code example,
listed toward the end of this article.)
NOTE
You do not have to name this file dbx-demo-job.py . If you choose a different file name, be sure to update the
python_file field in the conf/deployment.yaml file to match.
5. Run the command dbx execute command with the following options. In this command, replace
<existing-cluster-id> with the ID of the target cluster in your workspace. (To get the ID, see Cluster URL
and ID.)
dbx execute --cluster-id=<existing-cluster-id> --job=dbx-demo-job --no-rebuild --no-package
6. To view the run’s results locally, see your terminal’s output. To view the run’s results on your cluster, go to
the Standard output pane in the Driver logs tab for your cluster. (See Cluster driver and worker logs.)
7. Continue with Next steps.
Create a minimal dbx project for Scala or Java
The following minimal dbx project is the simplest and fastest approach to getting started with dbx and Scala
or Java. It demonstrates deploying a single Scala or Java JAR to your Azure Databricks workspace and then
running that deployed JAR on a Azure Databricks jobs cluster in your Azure Databricks workspace.
NOTE
Azure Databricks limits how you can run Scala and Java code on clusters:
You cannot run a single Scala or Java file as a job on a cluster as you can with a single Python file. To run Scala or Java
code, you must first build it into a JAR.
You can run a JAR as a job on an existing all-purpose cluster. However, you cannot reinstall any updates to that JAR on
the same all-purpose cluster. In this case, you must use a job cluster instead. This section uses the job cluster
approach.
You must first deploy the JAR to your Azure Databricks workspace before you can run that deployed JAR on any all-
purpose cluster or jobs cluster in that workspace.
1. In your terminal, from your project’s root folder, run the dbx configure command with the following
option. This command creates a hidden .dbx folder within your project’s root folder. This .dbx folder
contains lock.json and project.json files.
NOTE
The project.json file contains a reference to the DEFAULT profile within your Databricks CLI .databrickscfg
file. If you want dbx to use a different profile, replace DEFAULT with your target profile’s name.
If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a
profile in your Databricks CLI .databrickscfg file, then leave DEFAULT in the project.json file as is. dbx
will use this reference by default.
mkdir conf
For Windows:
md conf
3. Add a file named deployment.yaml file to the conf directory, with the following minimal file contents:
environments:
default:
strict_path_adjustment_policy: true
jobs:
- name: "dbx-demo-job"
new_cluster:
spark_version: "10.4.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 2
instance_pool_id: "my-instance-pool"
libraries:
- jar: "file://out/artifacts/dbx_demo_jar/dbx-demo.jar"
spark_jar_task:
main_class_name: "com.example.demo.SampleApp"
Replace:
The value of spark_version with the appropriate Runtime version strings for your target jobs cluster.
The value of node_type_id with the appropriate Cluster node type for your target jobs cluster.
The value of instance_pool_id with the ID of an existing instance pool in your workspace, to enable
faster running of jobs. If you do not have an existing instance pool available or you do not want to use
an instance pool, remove this line altogether.
The value of jar with the path in the project to the JAR. For IntelliJ IDEA with Scala, it could be
file://out/artifacts/dbx_demo_jar/dbx-demo.jar . For the Eclipse IDE with Java, it could be
file://target/dbx-demo-0.0.1-SNAPSHOT.jar .
The value of main_class_name with the name of main class in the JAR, for example
com.example.demo.SampleApp .
NOTE
The deployment.yaml file contains the word default , which is a reference to the default environment in the
.dbx/project.json file, which in turn is a reference to the DEFAULT profile within your Databricks CLI
.databrickscfg file. If you want dbx to use a different profile, replace default in this deployment.yaml file
with the corresponding reference in the .dbx/project.json file, which in turn references the corresponding
profile within your Databricks CLI .databrickscfg file.
If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a
profile in your Databricks CLI .databrickscfg file, then leave default in the deployment.yaml as is. dbx
will use the default environment settings (except for the profile value) in the .dbx/project.json file by
default.
4. Run the dbx deploy command with the following options. dbx deploys the JAR to the location in the
.dbx/project.json file’s artifact_location path for the matching environment. dbx also deploys the
project’s files as part of an MLflow experiment, to the location listed in the .dbx/project.json file’s
workspace_dir path for the matching environment.
5. Run the dbx launch command with the following options. This command runs the job with the matching
name in conf/deployment.yaml . To find the deployed JAR to run as part of the job, dbx references the
location in the .dbx/project.json file’s artifact_location path for the matching environment. To
determine which specific JAR to run, dbx references the MLflow experiment in the location listed in the
.dbx/project.json file’s workspace_dir path for the matching environment.
dbx launch --job=dbx-demo-job
6. To view the job run’s results on your jobs cluster, see View jobs.
7. To view the experiment that the job referenced, see Experiments.
8. Continue with Next steps.
Create a dbx templated project for Python with CI/CD support
The following dbx templated project for Python demonstrates support for batch running of Python code on
Azure Databricks all-purpose clusters and jobs clusters in your Azure Databricks workspaces, remote code
artifact deployments, and CI/CD platform setup. (To create a minimal dbx project for Python that only
demonstrates batch running of a single Python code file on an existing all-purpose cluster, skip back to Create a
minimal dbx project for Python.)
1. From your terminal, in your dbx project’s root folder, run the dbx init command.
dbx init
2. For project_name , enter a name for your project, or press Enter to accept the default project name.
3. For version , enter a starting version number for your project, or press Enter to accept the default project
version.
4. For cloud , select the number that corresponds to the Azure Databricks cloud version that you want your
project to use, or press Enter to accept the default.
5. For cicd_tool , select the number that corresponds to the supported CI/CD tool that you want your
project to use, or press Enter to accept the default.
6. For project_slug , enter a prefix that you want to use for resources in your project, or press Enter to
accept the default.
7. For workspace_dir , enter the local path to the workspace directory for your project, or press Enter to
accept the default.
8. For ar tifact_location , enter the path in your Azure Databricks workspace to where your project’s
artifacts will be written, or press Enter to accept the default.
9. For profile , enter the name of the Databricks CLI authentication profile that you want your project to use,
or press Enter to accept the default.
TIP
You can skip the preceding steps by running dbx init with hard-coded template parameters, for example:
dbx calculates the parameters project_slug , workspace_dir , and artifact_location automatically. These three
parameters are optional, and they are useful only for more advanced use cases.
See the init command in CLI Reference in the dbx documentation.
To use your new project, see Basic Python Template in the dbx documentation.
See also Next steps.
Code example
If you do not have any code readily available to batch run with dbx , you can experiment by having dbx batch
run the following code. This code creates a small table in your workspace, queries the table, and then deletes the
table.
TIP
If you want to leave the table in your workspace instead of deleting it, comment out the last line of code in this example
before you batch run it with dbx .
Python
# For testing and debugging of local objects, run
# "pip install pyspark=X.Y.Z", where "X.Y.Z"
# matches the version of PySpark
# on your target clusters.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("dbx-demo").getOrCreate()
data = [
[ 'BLI', date(2021, 4, 3), 52, 43],
[ 'BLI', date(2021, 4, 2), 50, 38],
[ 'BLI', date(2021, 4, 1), 52, 41],
[ 'PDX', date(2021, 4, 3), 64, 45],
[ 'PDX', date(2021, 4, 2), 61, 41],
[ 'PDX', date(2021, 4, 1), 66, 39],
[ 'SEA', date(2021, 4, 3), 57, 43],
[ 'SEA', date(2021, 4, 2), 54, 39],
[ 'SEA', date(2021, 4, 1), 56, 41]
]
# Results:
#
# +-----------+----------+---------+--------+
# |AirportCode| Date|TempHighF|TempLowF|
# +-----------+----------+---------+--------+
# | PDX|2021-04-03| 64| 45|
# | PDX|2021-04-02| 61| 41|
# | SEA|2021-04-03| 57| 43|
# | SEA|2021-04-02| 54| 39|
# +-----------+----------+---------+--------+
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.sql.Date
object SampleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").getOrCreate()
// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+
package com.example.demo;
import java.util.ArrayList;
import java.util.List;
import java.sql.Date;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.Dataset;
// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+
Next steps
Extend your conf/deployment.yaml file to support various types of all-purpose and jobs cluster definitions.
Declare multitask jobs in your conf/deployment.yaml file.
Reference environment variables and named properties in your conf/deployment.yaml file.
Batch run code as new jobs on clusters with the dbx execute command.
Limitations
Jobs that contain notebook task types are not supported.
Target all-purpose clusters or jobs clusters must use a Databricks Runtime for Machine Learning (Databricks
Runtime ML) version.
dbx execute cannot be used with multi-task jobs.
Only Python is fully supported.
Additional resources
Batch deploy code artifacts to Azure Databricks workspace storage with the dbx deploy command.
Batch run existing jobs on clusters with the dbx launch command.
Use dbx to do jobless deployments.
Integrate dbx with Azure Data Factory.
Integrate dbx with Apache Airflow.
Learn more about dbx and CI/CD.
dbx documentation
databrickslabs/dbx repository on GitHub
dbx limitations
Use an IDE with Azure Databricks
7/21/2022 • 15 minutes to read
You can use third-party integrated development environments (IDEs) for software development with Azure
Databricks. Some of these IDEs include the following:
Visual Studio Code
PyCharm
IntelliJ IDEA
Eclipse
You use these IDEs to do software development in programming languages that Azure Databricks supports,
including the following languages:
Python
R
Scala
Java
To demonstrate how this can work, this article describes a Python-based code sample that you can work with in
any Python-compatible IDE. Specifically, this article describes how to work with this code sample in Visual Studio
Code, which provides the following developer productivity features:
Code completion
Linting
Testing
Debugging code objects that do not require a real-time connection to remote Azure Databricks resources.
This article uses dbx by Databricks Labs along with Visual Studio Code to submit the code sample to a remote
Azure Databricks workspace. dbx instructs Azure Databricks Workflows to run the submitted code on an Azure
Databricks jobs cluster in that workspace.
You can use popular third-party Git providers for version control and continuous integration and continuous
delivery or continuous deployment (CI/CD) of your code. For version control, these Git providers include the
following:
GitHub
Bitbucket
GitLab
Azure DevOps (not available in Azure China regions)
AWS CodeCommit
GitHub AE
For CI/CD, dbx supports the following CI/CD platforms:
GitHub Actions
Azure Pipelines
GitLab CI/CD
To demonstrate how version control and CI/CD can work, this article describes how to use Visual Studio Code,
dbx , and this code sample, along with GitHub and GitHub Actions.
Code sample requirements
To use this code sample, you must have the following:
An Azure Databricks workspace in your Azure Databricks account. Create a workspace if you do not already
have one.
A GitHub account. Create a GitHub account, if you do not already have one.
Additionally, on your local development machine, you must have the following:
Python version 3.8 or above.
You should use a version of Python that matches the one that is installed on your target clusters. To get
the version of Python that is installed on an existing cluster, you can use the cluster’s web terminal to run
the python --version command. See also the “System environment” section in the Databricks runtime
releases for the Databricks Runtime version for your target clusters. In any case, the version of Python
must be 3.8 or above.
To get the version of Python that is currently referenced on your local machine, run python --version
from your local terminal. (Depending on how you set up Python on your local machine, you may need to
run python3 instead of python throughout this article.) See also Select a Python interpreter.
pip. pip is automatically installed with newer versions of Python. To check whether pip is already
installed, run pip --version from your local terminal. (Depending on how you set up Python or pip on
your local machine, you may need to run pip3 instead of pip throughout this article.)
The dbx package from the Python Package Index (PyPI). You can install this by running pip install dbx .
NOTE
You do not need to install dbx now. You can install it later in the code sample setup section.
A method to create Python virtual environments to ensure you are using the correct versions of Python
and package dependencies in your dbx projects. This article covers pipenv.
The Databricks CLI, set up with authentication.
NOTE
You do not need to install the Databricks CLI now. You can install it later in the code sample setup section. If you
want to install it later, you must remember to set up authentication at that time instead.
NOTE
These steps do not include setting up this code sample for CI/CD. You do not need to set up CI/CD to run this code
sample. If you want to set up CI/CD later, see Run with GitHub Actions.
mkdir ide-demo
cd ide-demo
code .
TIP
If you get the error command not found: code , see Launching from the command line on the Microsoft website.
For Windows:
md ide-demo
cd ide-demo
code .
2. In Visual Studio Code, on the menu bar, click View > Terminal .
3. From the root of the ide-demo folder, run the pipenv command with the following option, where
<version> is the target version of Python that you already have installed locally (and, ideally, a version
that matches your target clusters’ version of Python), for example 3.8.10 .
Make a note of the Virtualenv location value in the output of the pipenv command, as you will need it
in the next step.
4. Select the target Python interpreter, and then activate the Python virtual environment:
a. On the menu bar, click View > Command Palette , type Python: Select , and then click Python:
Select Interpreter .
b. Select the Python interpreter within the path to the Python virtual environment that you just
created. (This path is listed as the Virtualenv location value in the output of the pipenv
command.)
c. On the menu bar, click View > Command Palette , type Terminal: Create , and then click
Terminal: Create New Terminal .
d. Make sure that the command prompt indicates that you are in the pipenv shell. To confirm, you
should see something like (<your-username>) before your command prompt. If you do not see it,
run the following command:
pipenv shell
To exit the pipenv shell, run the command exit , and the parentheses disappear.
For more information, see Using Python environments in VS Code in the Visual Studio Code
documentation.
Step 2: Clone the code sample from GitHub
1. In Visual Studio Code, open the ide-demo folder (File > Open Folder ), if it is not already open.
2. Click View > Command Palette , type Git: Clone , and then click Git: Clone .
3. For Provide repositor y URL or pick a repositor y source , enter
https://github.com/databricks/ide-best-practices
4. Browse to your ide-demo folder, and click Select Repositor y Location .
Step 3: Install the code sample’s dependencies
1. Install a version of dbx and the Databricks CLI that is compatible with your version of Python. To do this,
in Visual Studio Code from your terminal, from your ide-demo folder with a pipenv shell activated (
pipenv shell ), run the following command:
dbx --version
databricks --version
If a list of root-level folder names for your workspace is returned, authentication is set up.
5. Install the Python packages that this code sample depends on. To do this, run the following command
from the ide-demo/ide-best-practices folder:
6. Confirm that the code sample’s dependent packages are installed. To do this, run the following command:
pip list
If the packages that are listed in the requirements.txt and unit-requirements.txt files are somewhere in
this list, the dependent packages are installed.
NOTE
The files listed in requirements.txt are for specific package versions. For better compatibility, you can cross-
reference these versions with the cluster node type that you want your Azure Databricks workspace to use for
running deployments on later. See the “System environment” section for your cluster’s Databricks Runtime version
in Databricks runtime releases.
Step 4: Customize the code sample for your Azure Databricks workspace
1. Customize the repo’s dbx project settings. To do this, in the .dbx/project.json file, change the value of
the profile object from DEFAULT to the name of the profile that matches the one that you set up for
authentication with the Databricks CLI. If you did not set up any non-default profile, leave DEFAULT as is.
For example:
{
"environments": {
"default": {
"profile": "DEFAULT",
"workspace_dir": "/Shared/dbx/covid_analysis",
"artifact_location": "dbfs:/Shared/dbx/projects/covid_analysis"
}
}
}
2. Customize the dbx project’s deployment settings. To do this, in the conf/deployment.yml file, change the
value of the spark_version and node_type_id objects from 10.4.x-scala2.12 and m6gd.large to the
Azure Databricks runtime version string and cluster node type that you want your Azure Databricks
workspace to use for running deployments on.
For example, to specify Databricks Runtime 10.4 LTS and a Standard_DS3_v2 node type:
environments:
default:
strict_path_adjustment_policy: true
jobs:
- name: "covid_analysis_etl_integ"
new_cluster:
spark_version: "10.4.x-scala2.12"
num_workers: 1
node_type_id: "Standard_DS3_v2"
spark_python_task:
python_file: "file://jobs/covid_trends_job.py"
- name: "covid_analysis_etl_prod"
new_cluster:
spark_version: "10.4.x-scala2.12"
num_workers: 1
node_type_id: "Standard_DS3_v2"
spark_python_task:
python_file: "file://jobs/covid_trends_job.py"
parameters: ["--prod"]
- name: "covid_analysis_etl_raw"
new_cluster:
spark_version: "10.4.x-scala2.12"
num_workers: 1
node_type_id: "Standard_DS3_v2"
spark_python_task:
python_file: "file://jobs/covid_trends_job_raw.py"
TIP
In this example, each of these three job definitions has the same spark_version and node_type_id value. You can use
different values for different job definitions. You can also create shared values and reuse them across job definitions, to
reduce typing errors and code maintenance. See the YAML example in the dbx documentation.
Code modularization
Unmodularized code
The jobs/covid_trends_job_raw.py file is an unmodularized version of the code logic. You can run this file by
itself.
Modularized code
The jobs/covid_trends_job.py file is a modularized version of the code logic. This file relies on the shared code
in the covid_analysis/transforms.py file. The covid_analysis/__init__.py file treats the covide_analysis folder
as a containing package.
Testing
Unit tests
The tests/testdata.csv file contains a small portion of the data in the covid-hospitalizations.csv file for
testing purposes. The tests/transforms_test.py file contains the unit tests for the covid_analysis/transforms.py
file.
Unit test runner
The pytest.ini file contains configuration options for running tests with pytest. See pytest.ini and
Configuration Options in the pytest documentation.
The .coveragerc file contains configuration options for Python code coverage measurements with coverage.py.
See Configuration reference in the coverage.py documentation.
The requirements.txt file, which is a subset of the unit-requirements.txt file that you ran earlier with pip ,
contains a list of packages that the unit tests also depend on.
Packaging
The setup.py file provides commands to be run at the console (console scripts), such as the pip command, for
packaging Python projects with setuptools. See Entry Points in the setuptools documentation.
Other files
There are other files in this code sample that have not been previously described:
The .github/workflows folder contains three files, databricks_pull_request_tests.yml , onpush.yml , and
onrelease.yaml , that represent the GitHub Actions, which are covered later in the GitHub Actions section.
The .gitignore file contains a list of local folders and files that Git ignores for your repo.
pip install -e .
This command creates a covid_analysis.egg-info folder, which contains information about the compiled
version of the covid_analysis/__init__.py and covid_analysis/transforms.py files.
2. Run the tests by running the following command:
pytest tests/
The tests’ results are displayed in the terminal. All four tests should show as passing.
3. Optionally, get test coverage metrics for your tests by running the following command:
NOTE
If a message displays that coverage cannot be found, run pip install coverage , and try again.
coverage report -m
4. If all four tests pass, send the dbx project’s contents to your Azure Databricks workspace, by running the
following command:
Information about the project and its runs are sent to the location specified in the workspace_dir object
in the .dbx/project.json file.
The project’s contents are sent to the location specified in the artifact_location object in the
.dbx/project.json file.
5. Run the pre-production version of the code in your workspace, by running the following command:
A link to the run’s results are displayed in the terminal. It should look something like this:
https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/12345
Follow this link in your web browser to see the run’s results in your workspace.
6. Run the production version of the code in your workspace, by running the following command:
A link to the run’s results are displayed in the terminal. It should look something like this:
https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/23456
Follow this link in your web browser to see the run’s results in your workspace.
Run with GitHub Actions
In the project’s .github/workflows folder, the onpush.yml and onrelease.yml GitHub Actions files do the
following:
On each push to a tag that begins with v , uses dbx to deploy the covid_analysis_etl_prod job.
On each push that is not to a tag that begins with v :
1. Uses pytest to run the unit tests.
2. Uses dbx to deploy the file specified in the covid_analysis_etl_integ job to the remote workspace.
3. Uses dbx to launch the already-deployed file specified in the covid_analysis_etl_integ job on the
remote workspace, tracing this run until it finishes.
NOTE
An additional GitHub Actions file, databricks_pull_request_tests.yml , is provided for you as a template to
experiment with, without impacting the onpush.yml and onrelease.yml GitHub Actions files. You can run this code
sample without the databricks_pull_request_tests.yml GitHub Actions file. Its usage is not covered in this article.
The following subsections describe how to set up and run the onpush.yml and onrelease.yml GitHub Actions
files.
Set up to use GitHub Actions
Set up your Azure Databricks workspace by following the instructions in Service principals for CI/CD. This
includes the following actions:
1. Create an Azure AD service principal.
2. Create an Azure AD token for the Azure AD service principal.
As a security best practice, Databricks recommends that you use an Azure AD token for an Azure AD service
principal, instead of the Azure Databricks personal access token for your workspace user, for enabling GitHub to
authenticate with your Azure Databricks workspace.
After you create the Azure AD service principal and its Azure AD token, stop and make a note of the Azure AD
token value, which you will you use in the next section.
Run GitHub Actions
St e p 1 : P u b l i sh y o u r c l o n e d r e p o
1. In Visual Studio Code, in the sidebar, click the GitHub icon. If the icon is not visible, enable the GitHub Pull
Requests and Issues extension through the Extensions view (View > Extensions ) first.
2. If the Sign In button is visible, click it, and follow the on-screen instructions to sign in to your GitHub
account.
3. On the menu bar, click View > Command Palette , type Publish to GitHub , and then click Publish to
GitHub .
4. Select an option to publish your cloned repo to your GitHub account.
St e p 2 : A d d e n c r y p t e d se c r e t s t o y o u r r e p o
In the GitHub website for your published repo, follow the instructions in Creating encrypted secrets for a
repository, for the following encrypted secrets:
Create an encrypted secret named , set to the value of your per-workspace URL, for example
DATABRICKS_HOST
https://adb-1234567890123456.7.azuredatabricks.net .
Create an encrypted secret named DATABRICKS_TOKEN , set to the value of the Azure AD token for the Azure AD
service principal.
St e p 3 : C r e a t e a n d p u b l i sh a b r a n c h t o y o u r r e p o
1. In Visual Studio Code, in Source Control view (View > Source Control ), click the … (Views and More
Actions ) icon.
2. Click Branch > Create Branch From .
3. Enter a name for the branch, for example my-branch .
4. Select the branch to create the branch from, for example main .
5. Make a minor change to one of the files in your local repo, and then save the file. For example, make a minor
change to a code comment in the tests/transforms_test.py file.
6. In Source Control view, click the … (Views and More Actions ) icon again.
7. Click Changes > Stage All Changes .
8. Click the … (Views and More Actions ) icon again.
9. Click Commit > Commit Staged .
10. Enter a message for the commit.
11. Click the … (Views and More Actions ) icon again.
12. Click Branch > Publish Branch .
St e p 4 : C r e a t e a p u l l r e q u e st a n d m e r g e
NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Databricks Connect allows you to connect your favorite IDE (Eclipse, IntelliJ, PyCharm, RStudio, Visual Studio
Code), notebook server (Jupyter Notebook, Zeppelin), and other custom applications to Azure Databricks
clusters.
This article explains how Databricks Connect works, walks you through the steps to get started with Databricks
Connect, explains how to troubleshoot issues that may arise when using Databricks Connect, and differences
between running using Databricks Connect versus running in an Azure Databricks notebook.
Overview
Databricks Connect is a client library for Databricks Runtime. It allows you to write jobs using Spark APIs and
run them remotely on an Azure Databricks cluster instead of in the local Spark session.
For example, when you run the DataFrame command
spark.read.format("parquet").load(...).groupBy(...).agg(...).show() using Databricks Connect, the parsing
and planning of the job runs on your local machine. Then, the logical representation of the job is sent to the
Spark server running in Azure Databricks for execution in the cluster.
With Databricks Connect, you can:
Run large-scale Spark jobs from any Python, Java, Scala, or R application. Anywhere you can import pyspark ,
import org.apache.spark , or require(SparkR) , you can now run Spark jobs directly from your application,
without needing to install any IDE plugins or use Spark submission scripts.
Step through and debug code in your IDE even when working with a remote cluster.
Iterate quickly when developing libraries. You do not need to restart the cluster after changing Python or Java
library dependencies in Databricks Connect, because each client session is isolated from each other in the
cluster.
Shut down idle clusters without losing work. Because the client application is decoupled from the cluster, it is
unaffected by cluster restarts or upgrades, which would normally cause you to lose all the variables, RDDs,
and DataFrame objects defined in a notebook.
NOTE
For Python development with SQL queries, Databricks recommends that you use the Databricks SQL Connector for
Python instead of Databricks Connect. the Databricks SQL Connector for Python is easier to set up than Databricks
Connect. Also, Databricks Connect parses and plans jobs runs on your local machine, while jobs run on remote compute
resources. This can make it especially difficult to debug runtime errors. The Databricks SQL Connector for Python submits
SQL queries directly to remote compute resources and fetches results.
Requirements
Only the following Databricks Runtime versions are supported:
Databricks Runtime 10.4 LTS ML, Databricks Runtime 10.4 LTS
Databricks Runtime 9.1 LTS ML, Databricks Runtime 9.1 LTS
Databricks Runtime 7.3 LTS ML, Databricks Runtime 7.3 LTS
Databricks Runtime 6.4 ML, Databricks Runtime 6.4
The minor version of your client Python installation must be the same as the minor Python version of
your Azure Databricks cluster. The table shows the Python version installed with each Databricks Runtime.
For example, if you’re using Conda on your local development environment and your cluster is running
Python 3.7, you must create an environment with that version, for example:
The Databricks Connect major and minor package version must always match your Databricks Runtime
version. Databricks recommends that you always use the most recent package of Databricks Connect that
matches your Databricks Runtime version. For example, when using a Databricks Runtime 7.3 LTS cluster,
use the databricks-connect==7.3.* package.
NOTE
See the Databricks Connect release notes for a list of available Databricks Connect releases and maintenance
updates.
Java Runtime Environment (JRE) 8. The client has been tested with the OpenJDK 8 JRE. The client does not
support Java 11.
NOTE
On Windows, if you see an error that Databricks Connect cannot find winutils.exe , see Cannot find winutils.exe on
Windows.
NOTE
Always specify databricks-connect==X.Y.* instead of databricks-connect=X.Y , to make sure that the
newest package is installed.
The unique organization ID for your workspace. See Get workspace, cluster, notebook, folder,
model, and job identifiers.
The port that Databricks Connect connects to. The default port is 15001 . If your cluster is
configured to use a different port, such as 8787 which was given in previous instructions for
Azure Databricks, use the configured port number.
2. Configure the connection. You can use the CLI, SQL configs, or environment variables. The precedence of
configuration methods from highest to lowest is: SQL config keys, CLI, and environment variables.
CLI
a. Run databricks-connect .
databricks-connect configure
This library (the "Software") may not be used except in connection with the
Licensee's use of the Databricks Platform Services pursuant to an Agreement
...
b. Accept the license and supply configuration values. For Databricks Host and Databricks
Token , enter the workspace URL and the personal access token you noted in Step 1.
If you get a message that the Azure Active Directory token is too long, you can leave the Databricks
Token field empty and manually enter the token in ~/.databricks-connect .
SQL configs or environment variables. The following table shows the SQL config keys and the
environment variables that correspond to the configuration properties you noted in Step 1. To set
a SQL config key, use sql("set config=value") . For example:
sql("set spark.databricks.service.clusterId=0304-201045-abcdefgh") .
PA RA M ET ER SQ L C O N F IG K EY EN VIRO N M EN T VA RIA B L E N A M E
databricks-connect test
If the cluster you configured is not running, the test starts the cluster which will remain running until its
configured autotermination time. The output should be something like:
* PySpark is installed at /.../3.5.6/lib/python3.5/site-packages/pyspark
* Checking java version
java version "1.8.0_152"
Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
* Testing scala command
18/12/10 16:38:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/12/10 16:38:50 WARN MetricsSystem: Using default name SparkStatusTracker for source because
neither spark.metrics.namespace nor spark.app.id is set.
18/12/10 16:39:53 WARN SparkServiceRPCClient: Now tracking server state for 5abb7c7e-df8e-4290-947c-
c9a38601024e, invalidating prev state
18/12/10 16:39:59 WARN SparkServiceRPCClient: Syncing 129 files (176036 bytes) took 3003 ms
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.range(100).reduce(_ + _)
Spark context Web UI available at https://10.8.5.214:4040
Spark context available as 'sc' (master = local[*], app id = local-1544488730553).
Spark session available as 'spark'.
View job details at <databricks-url>/?o=0#/setting/clusters/<cluster-id>/sparkUi
View job details at <databricks-url>?o=0#/setting/clusters/<cluster-id>/sparkUi
res0: Long = 4950
scala> :quit
NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
The Databricks Connect configuration script automatically adds the package to your project configuration. To get
started in a Python kernel, run:
To enable the %sql shorthand for running and visualizing SQL queries, use the following snippet:
@magics_class
class DatabricksConnectMagics(Magics):
@line_cell_magic
def sql(self, line, cell=None):
if cell and line:
raise ValueError("Line must be empty for cell magic", line)
try:
from autovizwidget.widget.utils import display_dataframe
except ImportError:
print("Please run `pip install autovizwidget` to enable the visualization widget.")
display_dataframe = lambda x: x
return display_dataframe(self.get_spark().sql(cell or line).toPandas())
def get_spark(self):
user_ns = get_ipython().user_ns
if "spark" in user_ns:
return user_ns["spark"]
else:
from pyspark.sql import SparkSession
user_ns["spark"] = SparkSession.builder.getOrCreate()
return user_ns["spark"]
ip = get_ipython()
ip.register_magics(DatabricksConnectMagics)
PyCharm
NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
The Databricks Connect configuration script automatically adds the package to your project configuration.
Python 3 clusters
1. When you create a PyCharm project, select Existing Interpreter . From the drop-down menu, select the
Conda environment you created (see Requirements).
1. Download and unpack the open source Spark onto your local machine. Choose the same version as in
your Azure Databricks cluster (Hadoop 2.7).
2. Run databricks-connect get-jar-dir . This command returns a path like
/usr/local/lib/python3.5/dist-packages/pyspark/jars . Copy the file path of one directory above the JAR
directory file path, for example, /usr/local/lib/python3.5/dist-packages/pyspark , which is the SPARK_HOME
directory.
3. Configure the Spark lib path and Spark home by adding them to the top of your R script. Set
<spark-lib-path> to the directory where you unpacked the open source Spark package in step 1. Set
<spark-home-path> to the Databricks Connect directory from step 2.
sparkR.session()
df <- as.DataFrame(faithful)
head(df)
IMPORTANT
This feature is in Public Preview.
NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
You can copy sparklyr-dependent code that you’ve developed locally using Databricks Connect and run it in an Azure
Databricks notebook or hosted RStudio Server in your Azure Databricks workspace with minimal or no code changes.
In this section:
Requirements
Install, configure, and use sparklyr
Resources
sparklyr and RStudio Desktop limitations
Requirements
sparklyr 1.2 or above.
Databricks Runtime 6.4 or above with matching Databricks Connect.
Install, configure, and use sparklyr
1. In RStudio Desktop, install sparklyr 1.2 or above from CRAN or install the latest master version from
GitHub.
2. Activate the Python environment with Databricks Connect installed and run the following command in
the terminal to get the <spark-home-path> :
databricks-connect get-spark-home
library(sparklyr)
sc <- spark_connect(method = "databricks", spark_home = "<spark-home-path>")
library(dplyr)
src_tbls(sc)
spark_disconnect(sc)
Resources
For more information, see the sparklyr GitHub README.
For code examples, see sparklyr.
sparklyr and RStudio Desktop limitations
The following features are unsupported:
sparklyr streaming APIs
sparklyr ML APIs
broom APIs
csv_file serialization mode
spark submit
IntelliJ (Scala or Java)
NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
To avoid conflicts, we strongly recommend removing any other Spark installations from your classpath. If
this is not possible, make sure that the JARs you add are at the front of the classpath. In particular, they
must be ahead of any other installed version of Spark (otherwise you will either use one of those other
Spark versions and run locally or throw a ClassDefNotFoundError ).
3. Check the setting of the breakout option in IntelliJ. The default is All and will cause network timeouts if
you set breakpoints for debugging. Set it to Thread to avoid stopping the background network threads.
Eclipse
NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
To avoid conflicts, we strongly recommend removing any other Spark installations from your classpath. If
this is not possible, make sure that the JARs you add are at the front of the classpath. In particular, they
must be ahead of any other installed version of Spark (otherwise you will either use one of those other
Spark versions and run locally or throw a ClassDefNotFoundError ).
Visual Studio Code
NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
For example, if your cluster is Python 3.5, your local environment should be Python 3.5.
SBT
NOTE
Databricks recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Databricks plans no new feature development for Databricks Connect at this time. Also, be aware of the limitations of
Databricks Connect.
Before you begin to use Databricks Connect, you must meet the requirements and set up the client for Databricks
Connect.
To use SBT, you must configure your build.sbt file to link against the Databricks Connect JARs instead of the
usual Spark library dependency. You do this with the unmanagedBase directive in the following example build file,
which assumes a Scala app that has a com.example.Test main object:
build.sbt
name := "hello-world"
version := "1.0"
scalaVersion := "2.11.6"
// this should be set to the path returned by ``databricks-connect get-jar-dir``
unmanagedBase := new java.io.File("/usr/local/lib/python2.7/dist-packages/pyspark/jars")
mainClass := Some("com.example.Test")
import java.util.ArrayList;
import java.util.List;
import java.sql.Date;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.Dataset;
// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+
Python
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import date
spark = SparkSession.builder.appName('temps-demo').getOrCreate()
data = [
[ 'BLI', date(2021, 4, 3), 52, 43],
[ 'BLI', date(2021, 4, 2), 50, 38],
[ 'BLI', date(2021, 4, 1), 52, 41],
[ 'PDX', date(2021, 4, 3), 64, 45],
[ 'PDX', date(2021, 4, 2), 61, 41],
[ 'PDX', date(2021, 4, 1), 66, 39],
[ 'SEA', date(2021, 4, 3), 57, 43],
[ 'SEA', date(2021, 4, 2), 54, 39],
[ 'SEA', date(2021, 4, 1), 56, 41]
]
# Results:
#
# +-----------+----------+---------+--------+
# |AirportCode| Date|TempHighF|TempLowF|
# +-----------+----------+---------+--------+
# | PDX|2021-04-03| 64| 45|
# | PDX|2021-04-02| 61| 41|
# | SEA|2021-04-03| 57| 43|
# | SEA|2021-04-02| 54| 39|
# +-----------+----------+---------+--------+
Scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.sql.Date
object Demo {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local").getOrCreate()
// Results:
//
// +-----------+----------+---------+--------+
// |AirportCode| Date|TempHighF|TempLowF|
// +-----------+----------+---------+--------+
// | PDX|2021-04-03| 64| 45|
// | PDX|2021-04-02| 61| 41|
// | SEA|2021-04-03| 57| 43|
// | SEA|2021-04-02| 54| 39|
// +-----------+----------+---------+--------+
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
#sc.setLogLevel("INFO")
class Foo(object):
def __init__(self, x):
self.x = x
spark = SparkSession.builder \
.config("spark.jars", "/path/to/udf.jar") \
.getOrCreate()
sc = spark.sparkContext
def plus_one_udf(col):
f = sc._jvm.com.example.Test.plusOne()
return Column(f.apply(_to_seq(sc, [col], _to_java_column)))
sc._jsc.addJar("/path/to/udf.jar")
spark.range(100).withColumn("plusOne", plus_one_udf("id")).show()
Scala
package com.example
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
...
.getOrCreate();
spark.sparkContext.setLogLevel("INFO")
Access DBUtils
You can use dbutils.fs and dbutils.secrets utilities of the Databricks Utilities module. Supported commands
are dbutils.fs.cp , dbutils.fs.head , dbutils.fs.ls , dbutils.fs.mkdirs , dbutils.fs.mv , dbutils.fs.put ,
dbutils.fs.rm , dbutils.secrets.get , dbutils.secrets.getBytes , dbutils.secrets.list ,
dbutils.secrets.listScopes . See File system utility (dbutils.fs) or run dbutils.fs.help() and Secrets utility
(dbutils.secrets) or run dbutils.secrets.help() .
Python
spark = SparkSession.builder.getOrCreate()
dbutils = DBUtils(spark)
print(dbutils.fs.ls("dbfs:/"))
print(dbutils.secrets.listScopes())
When using Databricks Runtime 7.3 LTS or above, to access the DBUtils module in a way that works both locally
and in Azure Databricks clusters, use the following get_dbutils() :
def get_dbutils(spark):
from pyspark.dbutils import DBUtils
return DBUtils(spark)
Scala
dbutils.fs.cp('file:/home/user/data.csv', 'dbfs:/uploads')
dbutils.fs.cp('dbfs:/output/results.csv', 'file:/home/user/downloads/')
The maximum file size that can be transferred that way is 250 MB.
Enable dbutils.secrets.get
Because of security restrictions, the ability to call dbutils.secrets.get is disabled by default. Contact Azure
Databricks support to enable this feature for your workspace.
// list files
> dbfs.listStatus(new Path("dbfs:/"))
res1: Array[org.apache.hadoop.fs.FileStatus] = Array(FileStatus{path=dbfs:/$; isDirectory=true; ...})
// open file
> val stream = dbfs.open(new Path("dbfs:/path/to/your_file"))
stream: org.apache.hadoop.fs.FSDataInputStream = org.apache.hadoop.fs.FSDataInputStream@7aa4ef24
Troubleshooting
Run databricks-connect test to check for connectivity issues. This section describes some common issues you
may encounter and how to resolve them.
Python version mismatch
Check the Python version you are using locally has at least the same minor release as the version on the cluster
(for example, 3.5.1 versus 3.5.2 is OK, 3.5 versus 3.6 is not).
If you have multiple Python versions installed locally, ensure that Databricks Connect is using the right one by
setting the PYSPARK_PYTHON environment variable (for example, PYSPARK_PYTHON=python3 ).
Server not enabled
Ensure the cluster has the Spark server enabled with spark.databricks.service.server.enabled true . You should
see the following lines in the driver log if it is:
Conflicting SPARK_HOME
If you have previously used Spark on your machine, your IDE may be configured to use one of those other
versions of Spark rather than the Databricks Connect Spark. This can manifest in several ways, including “stream
corrupted” or “class not found” errors. You can see which version of Spark is being used by checking the value of
the SPARK_HOME environment variable:
Java
System.out.println(System.getenv("SPARK_HOME"));
Python
import os
print(os.environ['SPARK_HOME'])
Scala
println(sys.env.get("SPARK_HOME"))
Resolution
If SPARK_HOME is set to a version of Spark other than the one in the client, you should unset the SPARK_HOME
variable and try again.
Check your IDE environment variable settings, your .bashrc , .zshrc , or .bash_profile file, and anywhere else
environment variables might be set. You will most likely have to quit and restart your IDE to purge the old state,
and you may even need to create a new project if the problem persists.
You should not need to set SPARK_HOME to a new value; unsetting it should be sufficient.
Conflicting or Missing PATH entry for binaries
It is possible your PATH is configured so that commands like spark-shell will be running some other previously
installed binary instead of the one provided with Databricks Connect. This can cause databricks-connect test to
fail. You should make sure either the Databricks Connect binaries take precedence, or remove the previously
installed ones.
If you can’t run commands like spark-shell , it is also possible your PATH was not automatically set up by
pip install and you’ll need to add the installation bin dir to your PATH manually. It’s possible to use
Databricks Connect with IDEs even if this isn’t set up. However, the databricks-connect test command will not
work.
Conflicting serialization settings on the cluster
If you see “stream corrupted” errors when running databricks-connect test , this may be due to incompatible
cluster serialization configs. For example, setting the spark.io.compression.codec config can cause this issue. To
resolve this issue, consider removing these configs from the cluster settings, or setting the configuration in the
Databricks Connect client.
Cannot find winutils.exe on Windows
If you are using Databricks Connect on Windows and see:
ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
Either Java or Databricks Connect was installed into a directory with a space in your path. You can work around
this by either installing into a directory path without spaces, or configuring your path using the short name
form.
Python
spark.conf.set("spark.databricks.service.token", new_aad_token)
Scala
spark.conf.set("spark.databricks.service.token", newAADToken)
After you update the token, the application can continue to use the same SparkSession and any objects and
state that are created in the context of the session. To avoid intermittent errors, Databricks recommends that you
provide a new token before the old token expires.
You can extend the lifetime of the Azure Active Directory token to persist during the execution of your
application. To do that, attach a TokenLifetimePolicy with an appropriately long lifetime to the Azure Active
Directory authorization application that you used to acquire the access token.
NOTE
Azure Active Directory passthrough uses two tokens: the Azure Active Directory access token that was previously
described that you configure in Databricks Connect, and the ADLS passthrough token for the specific resource that
Databricks generates while Databricks processes the request. You cannot extend the lifetime of ADLS passthrough tokens
by using Azure Active Directory token lifetime policies. If you send a command to the cluster that takes longer than an
hour, it will fail if the command accesses an ADLS resource after the one hour mark.
Limitations
Databricks Connect does not support the following Azure Databricks features and third-party platforms:
Structured Streaming.
Running arbitrary code that is not a part of a Spark job on the remote cluster.
Native Scala, Python, and R APIs for Delta table operations (for example, DeltaTable.forPath ) are not
supported. However, the SQL API ( spark.sql(...) ) with Delta Lake operations and the Spark API (for
example, spark.read.load ) on Delta tables are both supported.
Copy into.
Apache Zeppelin 0.7.x and below.
Connecting to clusters with table access control.
Connecting to clusters with process isolation enabled (in other words, where
spark.databricks.pyspark.enableProcessIsolation is set to true ).
The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL
commands on Azure Databricks clusters and Databricks SQL warehouses. The Databricks SQL Connector for
Python is easier to set up and use than similar Python libraries such as pyodbc. This library follows PEP 249 –
Python Database API Specification v2.0.
Requirements
A development machine running Python >=3.7, <3.10.
An existing cluster or SQL warehouse.
Get started
Gather the following information for the cluster or SQL warehouse that you want to use:
Cluster
The server hostname of the cluster. You can get this from the Ser ver Hostname value in the
Advanced Options > JDBC/ODBC tab for your cluster.
The HTTP path of the cluster. You can get this from the HTTP Path value in the Advanced Options >
JDBC/ODBC tab for your cluster.
A valid access token. You can use an Azure Databricks personal access token for the workspace. You
can also use an Azure Active Directory access token.
NOTE
As a security best practice, you should not hard-code this information into your code. Instead, you should retrieve
this information from a secure location. For example, the code examples later in this article use environment
variables.
Sql warehouse
The server hostname of the SQL warehouse. You can get this from the Ser ver Hostname value in the
Connection Details tab for your SQL warehouse.
The HTTP path of the SQL warehouse. You can get this from the HTTP Path value in the Connection
Details tab for your SQL warehouse.
A valid access token. You can use an Azure Databricks personal access token for the workspace. You
can also use an Azure Active Directory access token.
NOTE
As a security best practice, you should not hard-code this information into your code. Instead, you should retrieve
this information from a secure location. For example, the code examples later in this article use environment
variables.
Install the Databricks SQL Connector for Python library on your development machine by running
pip install databricks-sql-connector .
Examples
The following code examples demonstrate how to use the Databricks SQL Connector for Python to query and
insert data, query metadata, manage cursors and connections, and configure logging.
These code example retrieve their server_hostname , http_path , and access_token connection variable values
from these environment variables:
DATABRICKS_SERVER_HOSTNAME , which represents the Ser ver Hostname value from the requirements.
DATABRICKS_HTTP_PATH , which represents the HTTP Path value from the requirements.
DATABRICKS_TOKEN , which represents your access token from the requirements.
You can use other approaches to retrieving these connection variable values. Using environment variables is just
one approach among many.
Query data
Insert data
Query metadata
Manage cursors and connections
Configure logging
Query data
The following code example demonstrates how to call the Databricks SQL Connector for Python to run a basic
SQL command on a cluster or SQL warehouse. This command returns the first two rows from the diamonds
table.
NOTE
The diamonds table is included in the Sample datasets (databricks-datasets).
Insert data
The following example demonstrate how to insert small amounts of data (thousands of rows):
from databricks import sql
import os
result = cursor.fetchall()
For large amounts of data, you should first upload the data to cloud storage and then execute the COPY INTO
command.
Query metadata
There are dedicated methods for retrieving metadata. The following example retrieves metadata about columns
in a sample table:
cursor = connection.cursor()
cursor.close()
connection.close()
Configure logging
The Databricks SQL Connector uses Python’s standard logging module. You can configure the logging level
similar to the following:
logging.getLogger("databricks.sql").setLevel(logging.DEBUG)
logging.basicConfig(filename = "results.log",
level = logging.DEBUG)
cursor = connection.cursor()
result = cursor.fetchall()
cursor.close()
connection.close()
API reference
Package
Module
Methods
connect method
Classes
Connection class
Methods
close method
cursor method
Cursor class
Attributes
arraysize attribute
description attribute
Methods
method
cancel
close method
execute method
executemany method
catalogs method
schemas method
tables method
columns method
fetchall method
fetchmany method
fetchone method
fetchall_arrow method
fetchmany_arrow method
Row class
Methods
asDict method
Type conversions
Package
databricks-sql-connector
Methods
connect m et h o d
PA RA M ET ERS
ser ver_hostname
Type: str
The server hostname for the cluster or SQL warehouse. To get the server hostname, see the instructions earlier in this article.
Example: adb-1234567890123456.7.azuredatabricks.net
PA RA M ET ERS
http_path
Type: str
The HTTP path of the cluster or SQL warehouse. To get the HTTP path, see the instructions earlier in this article.
Example:
sql/protocolv1/o/1234567890123456/1234-567890-test123 for a cluster.
/sql/1.0/warehouses/a1b234c567d8e9fa for a SQL warehouse.
access_token
Type: str
Your Azure Databricks personal access token or Azure Active Directory token for the workspace for the cluster or SQL
warehouse. To create a token, see the instructions earlier in this article.
Example: dapi...<the-remaining-portion-of-your-token>
session_configuration
A dictionary of Spark session configuration parameters. Setting a configuration is equivalent to using the SET key=val SQL
command. Run the SQL command SET -v to get a full list of available configurations.
Defaults to None .
http_headers
Additional (key, value) pairs to set in HTTP headers on every RPC request the client makes. Typical usage will not set any extra
HTTP headers. Defaults to None .
catalog
Type: str
Initial catalog to use for the connection. Defaults to None (in which case the default catalog, typically hive_metastore , will
be used).
schema
Type: str
Initial schema to use for the connection. Defaults to None (in which case the default schema default will be used).
Classes
Connection class
Represents a connection to a database.
Met h o ds
close me t h o d
Closes the connection to the database and releases all associated resources on the server. Any additional calls to
this connection will throw an Error .
No parameters.
No return value.
cursor me t h o d
Used with the fetchmany method, specifies the internal buffer size, which is also how many rows are actually
fetched from the server at a time. The default value is 10000 . For narrow results (results in which each row does
not contain a lot of data), you should increase this value for better performance.
Read-write access.
description a t t ri b u t e
Contains a Python list of tuple objects. Each of these tuple objects contains 7 values, with the first 2 items
of each tuple object containing information describing a single result column as follows:
name : The name of the column.
type_code : A string representing the type of the column. For example, an integer column will have a type
code of int .
The remaining 5 items of each 7-item tuple object are not implemented, and their values are not defined. They
will typically be returned as 4 None values followed by a single True value.
Read-only access.
Met h o ds
cancel me t h o d
Interrupts the running of any database query or command that the cursor has started. To release the associated
resources on the server, call the close method after calling the cancel method.
No parameters.
No return value.
close me t h o d
Closes the cursor and releases the associated resources on the server. Closing an already closed cursor might
throw an error.
No parameters.
No return value.
execute me t h o d
PA RA M ET ERS
operation
Type: str
cursor.execute(
'SELECT * FROM default.diamonds WHERE cut="Ideal" LIMIT 2'
)
cursor.execute(
'SELECT * FROM default.diamonds WHERE cut=%(cut_type)s LIMIT 2',
{ 'cut_type': 'Ideal' }
)
parameters
Type: dictionary
executemany me t h o d
Prepares and then runs a database query or command using all parameter sequences in the seq_of_parameters
argument. Only the final result set is retained.
No return value.
PA RA M ET ERS
operation
Type: str
seq_of_parameters
catalogs me t h o d
Execute a metadata query about the catalogs. Actual results should then be fetched using fetchmany or
fetchall . Important fields in the result set include:
No parameters.
No return value.
Since version 1.0
schemas me t h o d
Execute a metadata query about the schemas. Actual results should then be fetched using fetchmany or
fetchall . Important fields in the result set include:
No return value.
Since version 1.0
PA RA M ET ERS
catalog_name
Type: str
schema_name
Type: str
tables me t h o d
Execute a metadata query about tables and views. Actual results should then be fetched using fetchmany or
fetchall . Important fields in the result set include:
Field name: TABLE_CAT . Type: str . The catalog to which the table belongs.
Field name: TABLE_SCHEM . Type: str . The schema to which the table belongs.
Field name: TABLE_NAME . Type: str . The name of the table.
Field name: TABLE_TYPE . Type: str . The kind of relation, for example VIEW or TABLE (applies to Databricks
Runtime 10.2 and above as well as to Databricks SQL; prior versions of the Databricks Runtime return an
empty string).
No return value.
Since version 1.0
PA RA M ET ERS
catalog_name
Type: str
schema_name
Type: str
table_name
Type: str
table_types
Type: List[str]
columns me t h o d
Execute a metadata query about the columns. Actual results should then be fetched using fetchmany or
fetchall . Important fields in the result set include:
Field name: TABLE_CAT . Type: str . The catalog to which the column belongs.
Field name: TABLE_SCHEM . Type: str . The schema to which the column belongs.
Field name: TABLE_NAME . Type: str . The name of the table to which the column belongs.
Field name: COLUMN_NAME . Type: str . The name of the column.
No return value.
Since version 1.0
PA RA M ET ERS
catalog_name
Type: str
schema_name
Type: str
table_name
Type: str
column_name
Type: str
fetchall me t h o d
Throws an Error if the previous call to the execute method did not return any data or no execute call has yet
been made.
PA RA M ET ERS
size
Type: int
This parameter is optional. If not specified, the value of the arraysize attribute is used.
Example: cursor.fetchmany(10)
fetchone me t h o d
Gets all (or all remaining) rows of a query, as a PyArrow Table object. Queries returning very large amounts of
data should use fetchmany_arrow instead to reduce memory consumption.
No parameters.
Returns all (or all remaining) rows of the query as a PyArrow table.
Throws an Error if the previous call to the execute method did not return any data or no execute call has yet
been made.
Since version 2.0
fetchmany_arrow me t h o d
PA RA M ET ERS
size
Type: int
This parameter is optional. If not specified, the value of the arraysize attribute is used.
Example: cursor.fetchmany_arrow(10)
Row class
The row class is a tuple-like data structure that represents an individual result row. If the row contains a column
with the name "my_column" , you can access the "my_column" field of row via row.my_column . You can also use
numeric indicies to access fields, for example row[0] . If the column name is not allowed as an attribute method
name (for example, it begins with a digit), then you can access the field as row["1_my_column"] .
Since version 1.0
Met h o ds
asDict me t h o d
Return a dictionary representation of the row, which is indexed by field names. If there are duplicate field names,
one of the duplicate fields (but only one) will be returned in the dictionary. Which duplicate field is returned is
not defined.
No parameters.
Returns a dict of fields.
Type conversions
The following table maps Apache Spark SQL data types to their Python data type equivalents.
array str
bigint int
binary bytearray
boolean bool
date datetime.date
decimal decimal.Decimal
double float
int int
map str
null NoneType
smallint int
string str
struct str
timestamp datetime.datetime
tinyint int
Troubleshooting
tokenAuthWrapperInvalidAccessToken: Invalid access token message
Issue : When you run your code, you see a message similar to
Error during request to server: tokenAuthWrapperInvalidAccessToken: Invalid access token .
Possible cause : The value passed to access_token is not a valid Azure Databricks personal access token.
Recommended fix : Check that the value passed to access_token is correct and try again.
gaierror(8, 'nodename nor servname provided, or not known') message
Issue : When you run your code, you see a message similar to
Error during request to server: gaierror(8, 'nodename nor servname provided, or not known') .
Possible cause : The value passed to server_hostname is not the correct host name.
Recommended fix : Check that the value passed to server_hostname is correct and try again.
For more information on finding the server hostname, see Retrieve the connection details.
IpAclError message
Issue : When you run your code, you see the message Error during request to server: IpAclValidation when
you try to use the connector on an Azure Databricks notebook.
Possible cause : You may have IP allow listing enabled for the Azure Databricks workspace. With IP allow listing,
connections from Spark clusters back to the control plane are not allowed by default.
Recommended fix : Ask your administrator to add the data plane subnet to the IP allow list.
Additional resources
For more information, see:
Data types (Databricks SQL)
Built-in Types (for bool , bytearray , float , int , and str ) on the Python website
datetime (for datetime.date and datatime.datetime ) on the Python website
decimal (for decimal.Decimal ) on the Python website
Built-in Constants (for NoneType ) on the Python website
Databricks SQL Driver for Go
7/21/2022 • 2 minutes to read
IMPORTANT
The Databricks SQL Driver for Go is provided as-is and is not officially supported by Databricks through customer
technical support channels. Support, questions, and feature requests can be communicated through the Issues page of
the databricks/databricks-sql-go repo on GitHub. Issues with the use of this code will not be answered or investigated by
Databricks Support.
The Databricks SQL Driver for Go is a Go library that allows you to use Go code to run SQL commands on Azure
Databricks compute resources.
Requirements
A development machine running Go, version 1.18 or higher. To print the installed version of Go, run the
command go version . Download and install Go.
An existing cluster or SQL warehouse.
Display clusters.
Create a cluster.
View SQL warehouses.
Create a SQL warehouse.
The Ser ver Hostname and HTTP Path value for the existing cluster or SQL warehouse.
Get these values for a cluster.
Get these values for a SQL warehouse.
An Azure Databricks personal access token.
Generate a personal access token.
Manage personal access tokens.
NOTE
The Databricks SQL Driver for Go does not support Azure Active Directory (Azure AD) tokens for authentication.
databricks://:dapi1ab2c34defabc567890123d4efa56789@adb-
1234567890123456.7.azuredatabricks.net/sql/1.0/endpoints/a1b234c5678901d2
NOTE
As a security best practice, you should not hard-code this DSN connection string into your Go code. Instead, you should
retrieve this DSN connection string from a secure location. For example, the code example later in this article uses an
environment variable.
Query data
The following code example demonstrates how to call the Databricks SQL Driver for Go to run a basic SQL
query on an Azure Databricks compute resource. This command returns the first two rows from the diamonds
table.
NOTE
The diamonds table is included in the Sample datasets (databricks-datasets).
This code example retrieves the DSN connection string from an environment variable named DATABRICKS_DSN .
package main
import (
"database/sql"
"fmt"
"os"
_ "github.com/databricks/databricks-sql-go"
)
func main() {
dsn := os.Getenv("DATABRICKS_DSN")
if dsn == "" {
panic("No connection string found." +
"Set the DATABRICKS_DSN environment variable, and try again.")
}
if err != nil {
panic(err)
}
var (
_c0 string
carat string
cut string
color string
clarity string
depth string
table string
price string
x string
y string
z string
)
if err != nil {
panic(err)
}
defer rows.Close()
for rows.Next() {
err := rows.Scan(&_c0,
&carat,
&cut,
&color,
&clarity,
&depth,
&table,
&price,
&x,
&y,
&z)
if err != nil {
panic(err)
}
fmt.Print(_c0, ",",
carat, ",",
cut, ",",
color, ",",
clarity, ",",
depth, ",",
table, ",",
price, ",",
x, ",",
y, ",",
z, "\n")
}
err = rows.Err()
if err != nil {
panic(err)
}
}
Output:
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
For additional examples, see the examples folder in the databricks/databricks-sql-go repository on GitHub.
Additional resources
The Databricks SQL Driver for Go repository on GitHub
Go database/sql tutorial
The database/sql package home page
Databricks SQL Driver for Node.js
7/21/2022 • 3 minutes to read
IMPORTANT
The Databricks SQL Driver for Node.js is provided as-is and is not officially supported by Databricks through customer
technical support channels. Support, questions, and feature requests can be communicated through the Issues page of
the databricks/databricks-sql-nodejs repo on GitHub. Issues with the use of this code will not be answered or investigated
by Databricks Support.
The Databricks SQL Driver for Node.js is a Node.js library that allows you to use JavaScript code to run SQL
commands on Azure Databricks compute resources.
Requirements
A development machine running Node.js, version 14 or higher. To print the installed version of Node.js,
run the command node -v . To install and use different versions of Node.js, you can use tools such as
Node Version Manager (nvm).
Node Package Manager ( npm ). Later versions of Node.js already include npm . To check whether npm is
installed, run the command npm -v . To install npm if needed, you can follow instructions such as the
ones at Download and install npm.
The @databricks/sql package from npm. To install the @databricks/sql package in your Node.js project,
use npm to run the following command from within the same directory as your project:
npm i @databricks/sql
NOTE
The Databricks SQL Driver for Node.js does not support Azure Active Directory (Azure AD) tokens for
authentication.
Specify the connection variables
To access your cluster or SQL warehouse, the Databricks SQL Driver for Node.js uses connection variables
named token , server_hostname and http_path , representing your Azure Databricks personal access token and
your cluster’s or SQL warehouse’s Ser ver Hostname and HTTP Path values, respectively.
The Azure Databricks personal access token value for token is similar to the following:
dapi1ab2c34defabc567890123d4efa56789 .
The Ser ver Hostname value for server_hostname is similar to the following:
adb-1234567890123456.7.azuredatabricks.net .
The HTTP Path value for is similar to the following: for a cluster,
http_path
sql/protocolv1/o/1234567890123456/1234-567890-abcdefgh ; and for a SQL warehouse,
/sql/1.0/endpoints/a1b234c5678901d2 .
NOTE
As a security best practice, you should not hard code these connection variable values into your code. Instead, you should
retrieve these connection variable values from a secure location. For example, the code example later in this article uses
environment variables.
Query data
The following code example demonstrates how to call the Databricks SQL Driver for Node.js to run a basic SQL
query on an Azure Databricks compute resource. This command returns the first two rows from the diamonds
table.
NOTE
The diamonds table is included in the Sample datasets (databricks-datasets).
This code example retrieves the token , server_hostname and http_path connection variable values from a set
of environment variables. These environment variables have the following environment variable names:
DATABRICKS_TOKEN , which represents your Azure Databricks personal access token from the requirements.
DATABRICKS_SERVER_HOSTNAME , which represents the Ser ver Hostname value from the requirements.
DATABRICKS_HTTP_PATH , which represents the HTTP Path value from the requirements.
You can use other approaches to retrieving these connection variable values. Using environment variables is just
one approach among many.
const { DBSQLClient } = require('@databricks/sql');
client.connect(
options = {
token: token,
host: server_hostname,
path: http_path
}).then(
async client => {
const session = await client.openSession();
await utils.waitUntilReady(
operation = queryOperation,
progress = false,
callback = () => {});
await utils.fetchAll(
operation = queryOperation
);
await queryOperation.close();
console.table(result);
await session.close();
client.close();
}).catch(error => {
console.log(error);
});
Output:
┌─────────┬─────┬────────┬───────────┬───────┬─────────┬────────┬───────┬───────┬────────┬────────┬────────┐
│ (index) │ _c0 │ carat │ cut │ color │ clarity │ depth │ table │ price │ x │ y │ z │
├─────────┼─────┼────────┼───────────┼───────┼─────────┼────────┼───────┼───────┼────────┼────────┼────────┤
│ 0 │ '1' │ '0.23' │ 'Ideal' │ 'E' │ 'SI2' │ '61.5' │ '55' │ '326' │ '3.95' │ '3.98' │ '2.43' │
│ 1 │ '2' │ '0.21' │ 'Premium' │ 'E' │ 'SI1' │ '59.8' │ '61' │ '326' │ '3.89' │ '3.84' │ '2.31' │
└─────────┴─────┴────────┴───────────┴───────┴─────────┴────────┴───────┴───────┴────────┴────────┴────────┘
For additional examples, see the examples folder in the databricks/databricks-sql-nodejs repository on GitHub.
Additional resources
The Databricks SQL Driver for Node.js repository on GitHub
Getting started with the Databricks SQL Driver for Node.js
Troubleshooting the Databricks SQL Driver for Node.js
Connect Python and pyodbc to Azure Databricks
7/21/2022 • 12 minutes to read
You can connect from your local Python code through ODBC to data in a Databricks cluster or SQL warehouse.
To do this, you can use the open source Python code module pyodbc .
Follow these instructions to install, configure, and use pyodbc .
For more information about pyodbc , see the pyodbc Wiki.
NOTE
Databricks offers the Databricks SQL Connector for Python as an alternative to pyodbc . The Databricks SQL Connector
for Python is easier to set up and use, and has a more robust set of coding constructs, than pyodbc . However pyodbc
may have better performance when fetching queries results above 10 MB.
Requirements
A local development machine running one of the following:
macOS
Windows
A Unix or Linux distribution that supports .rpm or .deb files
pip.
For Unix, Linux, or macOS, Homebrew.
An Azure Databricks cluster, a Databricks SQL warehouse, or both. For more information, see Create a cluster
and Create a SQL warehouse.
Follow the instructions for Unix, Linux, or macOS or for Windows.
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
TIP
If you do not want to or cannot use the /etc/odbc.ini file on your machine, you can specify connection details
directly in Python code. To do this, skip the rest of this step and proceed to Step 3: Test your configuration.
Cluster
[Databricks_Cluster]
Driver = <driver-path>
Description = Simba Spark ODBC Driver DSN
HOST = <server-hostname>
PORT = 443
Schema = default
SparkServerType = 3
AuthMech = 3
UID = token
PWD = <personal-access-token>
ThriftTransport = 2
SSL = 1
HTTPPath = <http-path>
In the preceding configuration file, replace the following placeholders, and then save the file:
Replace <driver-path> with one of the following:
macOS : /Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit : /opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit : /opt/simba/spark/lib/32/libsparkodbc_sb32.so
Replace <server-hostname> with the Ser ver Hostname value from the Advanced Options >
JDBC/ODBC tab for your cluster.
Replace <personal-access-token> with the value of your personal access token for your Azure
Databricks workspace.
Replace <http-path> with the HTTP Path value from the Advanced Options > JDBC/ODBC tab for
your cluster.
TIP
To allow pyodbc to switch connections to a different cluster, add an entry to the [ODBC Data Sources] section
and a matching entry below [Databricks_Cluster] with the specific connection details. Each entry must have a
unique name within this file.
Sql warehouse
[SQL_Warehouse]
Driver = <driver-path>
HOST = <server-hostname>
PORT = 443
Schema = default
SparkServerType = 3
AuthMech = 3
UID = token
PWD = <personal-access-token>
ThriftTransport = 2
SSL = 1
HTTPPath = <http-path>
In the preceding configuration file, replace the following placeholders, and then save the file:
Replace <driver-path> with one of the following:
macOS : /Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit : /opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit : /opt/simba/spark/lib/32/libsparkodbc_sb32.so
Replace <server-hostname> with the Ser ver Hostname value from the Connection Details tab for
your SQL warehouse.
Replace <personal-access-token> with the value of your personal access token for your SQL
warehouse.
Replace <http-path> with the HTTP Path value from the Connection Details tab for your SQL
warehouse.
TIP
To allow pyodbc to switch connections to a different SQL warehouse, add an entry to the
[ODBC Data Sources] section and a matching entry below [SQL_Warehouse] with the specific connection
details. Each entry must have a unique name within this file.
2. Add the preceding information you just added to the /etc/odbc.ini file to the corresponding
/usr/local/etc/odbc.ini file on your machine as well.
[ODBC Drivers]
Simba SQL Server ODBC Driver = Installed
In the preceding content, replace <driver-path> with one of the following values, and then save the file:
macOS : /Library/simba/spark/lib/libsparkodbc_sbu.dylib
Linux 64-bit : /opt/simba/spark/lib/64/libsparkodbc_sb64.so
Linux 32-bit : /opt/simba/spark/lib/32/libsparkodbc_sb32.so
4. Add the information you just added to the /etc/odbcinst.ini file to the corresponding
/usr/local/etc/odbcinst.ini file on your machine as well.
5. Add the following information at the end of the simba.sparkodbc.ini file on your machine, and then save
the file. For macOS, this file is in /Library/simba/spark/lib .
DriverManagerEncoding=UTF-16
ODBCInstLib=/usr/local/Cellar/unixodbc/2.3.9/lib/libodbcinst.dylib
import pyodbc
NOTE
If you skipped Step 2: Configure software and did not use an /etc/odbc.ini file, then specify connection details
in the call to pyodbc.connect , for example:
conn = pyodbc.connect("Driver=<driver-path>;" +
"HOST=<server-hostname>;" +
"PORT=443;" +
"Schema=default;" +
"SparkServerType=3;" +
"AuthMech=3;" +
"UID=token;" +
"PWD=<personal-access-token>;" +
"ThriftTransport=2;" +
"SSL=1;" +
"HTTPPath=<http-path>",
autocommit=True)
Replace the placeholders with the values as described in Step 2: Configure software.
2. To speed up running the code, start the cluster that corresponds to the HTTPPath setting in your
odbc.ini file.
3. Run the pyodbc-test-cluster.py file with your Python interpreter. The first two rows of the database table
are displayed.
To query by using a SQL warehouse:
1. Create a file named pyodbc-test-cluster.py . Replace <table-name> with the name of the database table
to query, and then save the file.
import pyodbc
NOTE
If you skipped Step 2: Configure software and did not use an /etc/odbc.ini file, then specify connection details
in the call to pyodbc.connect , for example:
conn = pyodbc.connect("Driver=<driver-path>;" +
"HOST=<server-hostname>;" +
"PORT=443;" +
"Schema=default;" +
"SparkServerType=3;" +
"AuthMech=3;" +
"UID=token;" +
"PWD=<personal-access-token>;" +
"ThriftTransport=2;" +
"SSL=1;" +
"HTTPPath=<http-path>",
autocommit=True)
Replace the placeholders with the values as described in Step 2: Configure software.
2. To speed up running the code, start the SQL warehouse that corresponds to the HTTPPath setting in your
odbc.ini file.
3. Run the pyodbc-test-warehouse.py file with your Python interpreter. The first two rows of the database
table are displayed.
Next steps
To run the Python test code against a different cluster or SQL warehouse, change the settings in the
preceding two odbc.ini files. Or add a new entry to the [ODBC Data Sources] section, along with matching
connection details, to the two odbc.ini files. Then change the DSN name in the test code to match the
related name in [ODBC Data Sources] .
To run the Python test code against a different database tables, change the table_name value.
To run the Python test code with a different SQL query, change the execute command string.
Windows
If your local Python code is running on a Windows machine, follow these instructions.
Step 1: Install software
1. Download the Databricks ODBC driver.
2. To install the Databricks ODBC driver, open the SimbaSparkODBC.zip file that you downloaded.
3. Double-click the extracted Simba Spark.msi file, and follow any on-screen directions.
4. Install the pyodbc module: from an administrative command prompt, run pip install pyodbc . For more
information, see pyodbc on the PyPI website and Install in the pyodbc Wiki.
Step 2: Configure software
Specify connection details for the Azure Databricks cluster or Databricks SQL warehouse for pyodbc to use.
To specify connection details for a cluster:
1. Add a data source name (DSN) that contains information about your cluster: start the ODBC Data Sources
application: on the Star t menu, begin typing ODBC , and then click ODBC Data Sources .
2. On the User DSN tab, click Add . In the Create New Data Source dialog box, click Simba Spark ODBC
Driver , and then click Finish .
3. In the Simba Spark ODBC Driver DSN Setup dialog box, change the following values:
Data Source Name : Databricks_Cluster
Description : My cluster
Spark Ser ver Type : SparkThriftServer (Spark 1.1 and later)
Host(s) : The Ser ver Hostname value from the Advanced Options, JDBC/ODBC tab for your cluster.
Por t : 443
Database : default
Mechanism : User Name and Password
User Name : token
Password : The value of your personal access token for your Azure Databricks workspace.
Thrift Transpor t : HTTP
4. Click HTTP Options . In the HTTP Proper ties dialog box, for HTTP Path , enter the HTTP Path value from
the Advanced Options, JDBC/ODBC tab for your cluster, and then click OK .
5. Click SSL Options . In the SSL Options dialog box, check the Enable SSL box, and then click OK .
6. Click Test . If the test succeeds, click OK .
TIP
To allow pyodbc to switch connections to a different cluster, repeat this procedure with the specific connection details.
Each DSN must have a unique name.
TIP
To allow pyodbc to connect to switch connections to a different SQL warehouse, repeat this procedure with the specific
connection details. Each DSN must have a unique name.
import pyodbc
2. To speed up running the code, start the cluster that corresponds to the Host(s) value in the Simba
Spark ODBC Driver DSN Setup dialog box for your Azure Databricks cluster.
3. Run the pyodbc-test-cluster.py file with your Python interpreter. The first two rows of the database table
are displayed.
To query by using a SQL warehouse:
1. Create a file named pyodbc-test-cluster.py . Replace <table-name> with the name of the database table
to query, and then save the file.
import pyodbc
2. To speed up running the code, start the SQL warehouse that corresponds to the Host(s) value in the
Simba Spark ODBC Driver DSN Setup dialog box for your Databricks SQL warehouse.
3. Run the pyodbc-test-warehouse.py file with your Python interpreter. The first two rows of the database
table are displayed.
Next steps
To run the Python test code against a different cluster or SQL warehouse, change the Host(s) value in the
Simba Spark ODBC Driver DSN Setup dialog box for your Azure Databricks cluster or Databricks SQL
warehouse. Or create a new DSN. Then change the DSN name in the test code to match the related Data
Source Name .
To run the Python test code against a different database table, change the table_name value.
To run the Python test code with a different SQL query, change the execute command string.
Troubleshooting
This section addresses common issues when using pyodbc with Databricks.
Unicode decode error
Issue : You receive an error message similar to the following:
Cause : An issue exists in pyodbc version 4.0.31 or below that could manifest with such symptoms when
running queries that return columns with long names or a long error message. The issue has been fixed by a
newer version of pyodbc .
Solution : Upgrade your installation of pyodbc to version 4.0.32 or above.
General troubleshooting
See Issues in the mkleehammer/pyodbc repository on GitHub.
Databricks CLI
7/21/2022 • 8 minutes to read
The Databricks command-line interface (CLI) provides an easy-to-use interface to the Azure Databricks platform.
The open source project is hosted on GitHub. The CLI is built on top of the Databricks REST API 2.0 and is
organized into command groups based on the Cluster Policies API 2.0, Clusters API 2.0, DBFS API 2.0, Groups
API 2.0, Instance Pools API 2.0, Jobs API 2.1, Libraries API 2.0, Delta Live Tables API 2.0, Repos API 2.0, Secrets API
2.0, Token API 2.0, and Workspace API 2.0 through the cluster-policies , clusters , fs , groups ,
instance-pools , jobs and runs , libraries , repos , secrets , tokens , and workspace command groups,
respectively.
For example, you can use the Databricks CLI to do things such as:
Provision compute resources in Azure Databricks workspaces.
Run data processing and data analysis tasks.
List, import, and export notebooks and folders in workspaces.
IMPORTANT
This CLI is under active development and is released as an Experimental client. This means that interfaces are still subject
to change.
IMPORTANT
On macOS, the default Python 2 installation does not implement the TLSv1_2 protocol and running the CLI with
this Python installation results in the error:
AttributeError: 'module' object has no attribute 'PROTOCOL_TLSv1_2' . Use Homebrew to install a version
of Python that has ssl.PROTOCOL_TLSv1_2 .
Limitations
Using the Databricks CLI with firewall enabled storage containers is not supported. Databricks recommends you
use Databricks Connect or az storage.
Install the CLI
Run pip install databricks-cli using the appropriate version of pip for your Python installation:
To list the version of the CLI that is currently installed, run databricks --version (or databricks -v ):
databricks --version
# Or...
databricks -v
Set up authentication
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
Before you can run CLI commands, you must set up authentication. To authenticate to the CLI you can use a
Databricks personal access token or an Azure Active Directory (Azure AD) token.
Set up authentication using an Azure AD token
To configure the CLI using an Azure AD token, generate the Azure AD token and store it in the environment
variable DATABRICKS_AAD_TOKEN .
Un ix, lin u x, m ac os
export DATABRICKS_AAD_TOKEN=<Azure-AD-token>
W indow s
[DEFAULT]
host = <workspace-URL>
token = <Azure-AD-token>
Token:
After you complete the prompts, your access credentials are stored in the file ~/.databrickscfg on Unix, Linux,
or macOS, or %USERPROFILE%\.databrickscfg on Windows. The file contains a default profile entry:
[DEFAULT]
host = <workspace-URL>
token = <personal-access-token>
For CLI 0.8.1 and above, you can change the path of this file by setting the environment variable
DATABRICKS_CONFIG_FILE .
export DATABRICKS_CONFIG_FILE=<path-to-file>
Windows
IMPORTANT
The CLI does not work with a .netrc file. You can have a .netrc file in your environment for other purposes, but the CLI
will not use that .netrc file.
CLI 0.8.0 and above supports the following environment variables:
DATABRICKS_HOST
DATABRICKS_TOKEN
An environment variable setting takes precedence over the setting in the configuration file.
Test your authentication setup
To check whether you set up authentication correctly, you can run a command such as the following, replacing
<someone@example.com> with your Azure Databricks workspace username:
If successful, this command lists the objects in the specified workspace path.
Connection profiles
The Databricks CLI configuration supports multiple connection profiles. The same installation of Databricks CLI
can be used to make API calls on multiple Azure Databricks workspaces.
To add a connection profile, specify a unique name for the profile:
[<profile-name>]
host = <workspace-URL>
token = <token>
If --profile <profile-name> is not specified, the default profile is used. If a default profile is not found, you are
prompted to configure the CLI with a default profile.
Test your connection profiles
To check whether you set up your connection profiles correctly, you can run a command such as the following,
replacing <someone@example.com> with your Azure Databricks workspace username and <DEFAULT> with one of
your connection profile names:
If successful, this command lists the objects in the specified workspace path in the workspace for the specified
connection profile. Run this command for each connection profile that you want to test.
Alias command groups
Sometimes it can be inconvenient to prefix each CLI invocation with the name of a command group, for example
databricks workspace ls . To make the CLI easier to use, you can alias command groups to shorter commands.
For example, to shorten databricks workspace ls to dw ls in the Bourne again shell, you can add
alias dw="databricks workspace" to the appropriate bash profile. Typically, this file is located at ~/.bash_profile
.
TIP
Azure Databricks already aliases databricks fs to dbfs ; databricks fs ls and dbfs ls are equivalent.
databricks fs -h
databricks fs cp -h
For example, the following command prints the settings of the job with the ID of 233.
{
"name": "Quickstart",
"new_cluster": {
"spark_version": "7.5.x-scala2.12",
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"num_workers": 8,
...
},
"email_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "/Quickstart"
},
"max_concurrent_runs": 1
}
As another example, the following command prints the names and IDs of all available clusters in the workspace:
databricks clusters list --output JSON | jq '[ .clusters[] | { name: .cluster_name, id: .cluster_id } ]'
[
{
"name": "My Cluster 1",
"id": "1234-567890-grip123"
},
{
"name": "My Cluster 2",
"id": "2345-678901-patch234"
}
]
You can install jq for example on macOS using Homebrew with brew install jq or on Windows using
Chocolatey with choco install jq . For more information on jq , see the jq Manual.
JSON string parameters
String parameters are handled differently depending on your operating system:
Unix, linux, macos
You must enclose JSON string parameters in single quotes. For example:
Windows
You must enclose JSON string parameters in double quotes, and the quote characters inside the string must be
preceded by \ . For example:
Troubleshooting
The following sections provide tips for troubleshooting common issues with the Databricks CLI.
Using EOF with databricks configure does not work
For Databricks CLI 0.12.0 and above, using the end of file ( EOF ) sequence in a script to pass parameters to the
databricks configure command does not work. For example, the following script causes Databricks CLI to
ignore the parameters, and no error message is thrown:
# Do not do this.
databricksUrl=<per-workspace-url>
databricksToken=<personal-access-token-or-Azure-AD-token>
CLI commands
Cluster Policies CLI
Clusters CLI
DBFS CLI
Delta Live Tables CLI
Groups CLI
Instance Pools CLI
Jobs CLI
Libraries CLI
Repos CLI
Runs CLI
Secrets CLI
Stack CLI
Tokens CLI
Unity Catalog CLI
Workspace CLI
Databricks SQL CLI
7/21/2022 • 7 minutes to read
IMPORTANT
The Databricks SQL CLI is provided as-is and is not officially supported by Databricks through customer technical support
channels. Support, questions, and feature requests can be communicated through the Issues page of the
databricks/databricks-sql-cli repo on GitHub. Issues with the use of this code will not be answered or investigated by
Databricks Support.
The Databricks SQL command line interface (Databricks SQL CLI) enables you to run SQL queries on your
existing Databricks SQL warehouses from your terminal or Windows Command Prompt instead of from
locations such as the Databricks SQL editor or an Azure Databricks notebook. From the command line, you get
productivity features such as suggestions and syntax highlighting.
Requirements
At least one Databricks SQL warehouse. View your available warehouses. Create an warehouse, if you do not
already have one.
Your warehouse’s connection details. Specifically, you need the Ser ver hostname and HTTP path values.
An Azure Databricks personal access token. Create a personal access token, if you do not already have one.
Python 3.7 or higher. To check whether you have Python installed, run the command python --version from
your terminal or Command Prompt. (On some systems, you may need to enter python3 instead.) Install
Python, if you do not have it already installed.
pip, the package installer for Python. Newer versions of Python install pip by default. To check whether you
have pip installed, run the command pip --version from your terminal or Command Prompt. (On some
systems, you may need to enter pip3 instead.) Install pip, if you do not have it already installed.
The Databricks SQL CLI package from the Python Packaging Index (PyPI). You can use pip to install the
Databricks SQL CLI package from PyPI by running pip install databricks-sql-cli or
python -m pip install databricks-sql-cli .
(Optional) A utility for creating and managing Python virtual environments, such as venv, virtualenv, or
pipenv. Virtual environments help to ensure that you are using the correct versions of Python and the
Databricks SQL CLI together. Setting up and using virtual environments is outside of the scope of this article.
For more information, see Creating Virtual Environments.
Authentication
You must provide the Databricks SQL CLI with authentication details for your Databricks SQL warehouse, so that
the target warehouse is called with the proper access credentials. You can provide this information in several
ways:
In the dbsqlclircsettings file in its default location (or by specifying an alternate settings file through the
--clirc option each time you run a command with the Databricks SQL CLI). See Settings file.
By setting the DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH and DBSQLCLI_ACCESS_TOKEN environment variables.
See Environment variables.
By specifying the --hostname , --http-path , and --access-token options each time you run a command with
the Databricks SQL CLI. See Command options.
Whenever you run the Databricks SQL CLI, it looks for authentication details in the following order, stopping
when it finds the first set of details:
1. The --hostname , --http-path , and --access-token options.
2. The DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH and DBSQLCLI_ACCESS_TOKEN environment variables.
3. The dbsqlclirc settings file in its default location (or an alternate settings file specified by the --clirc
option).
Settings file
To use the dbsqlclirc settings file to provide the Databricks SQL CLI with authentication details for your
Databricks SQL warehouse, run the Databricks SQL CLI for the first time, as follows:
dbsqlcli
The Databricks SQL CLI creates a settings file for you, at ~/.dbsqlcli/dbsqlclirc on Unix, Linux, and macOS, and
at %HOMEDRIVE%%HOMEPATH%\.dbsqlcli\dbsqlclirc or %USERPROFILE%\.dbsqlcli\dbsqlclirc on Windows. To
customize this file:
1. Use a text editor to open and edit the dbsqlclirc file.
2. Scroll to the following section:
# [credentials]
# host_name = ""
# http_path = ""
# access_token = ""
[credentials]
host_name = "adb-12345678901234567.8.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/1abc2d3456e7f890a"
access_token = "dapi1234567890b2cd34ef5a67bc8de90fa12b"
Alternatively, instead of using the dbsqlclirc file in its default location, you can specify a file in a different
location by adding the --clirc command option and the path to the alternate file. That alternate file’s contents
must conform to the preceding syntax.
Environment variables
To use the DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH , and DBSQLCLI_ACCESS_TOKEN environment variables to
provide the Databricks SQL CLI with authentication details for your Databricks SQL warehouse, do the following:
Unix, linux, and macos
To set the environment variables for only the current terminal session, run the following commands. To set the
environment variables for all terminal sessions, enter the following commands into your shell’s startup file and
then restart your terminal. In the following commands, replace the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.
export DBSQLCLI_HOST_NAME="adb-12345678901234567.8.azuredatabricks.net"
export DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/1abc2d3456e7f890a"
export DBSQLCLI_ACCESS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"
Windows
To set the environment variables for only the current Command Prompt session, run the following commands,
replacing the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.:
set DBSQLCLI_HOST_NAME="adb-12345678901234567.8.azuredatabricks.net"
set DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/1abc2d3456e7f890a"
set DBSQLCLI_ACCESS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"
To set the environment variables for all Command Prompt sessions, run the following commands and then
restart your Command Prompt, replacing the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.
Command options
To use the --hostname , --http-path , and --access-token options to provide the Databricks SQL CLI with
authentication details for your Databricks SQL warehouse, do the following:
Every time you run a command with the Databricks SQL CLI:
Specify the --hostname option and your warehouse’s Ser ver hostname value from the requirements.
Specify the --http-path option and your warehouse’s HTTP path value from the requirements.
Specify the --access-token option and your personal access token value from the requirements.
For example:
Query sources
The Databricks SQL CLI enables you to run queries in the following ways:
From a query string.
From a file.
In a read-evaluate-print loop (REPL) approach. This approach provides suggestions as you type.
Query string
To run a query as a string, use the -e option followed by the query, represented as a string. For example:
Output:
_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
To switch output formats, use the --table-format option along with a value such as ascii for ASCII table
format, for example:
Output:
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
For a list of available output format values, see the comments for the table_format setting in the dbsqlclirc
file.
File
To run a file that contains SQL, use the -e option followed by the path to a .sql file. For example:
dbsqlcli -e my-query.sql
Output:
_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
To switch output formats, use the --table-format option along with a value such as ascii for ASCII table
format, for example:
dbsqlcli -e my-query.sql --table-format ascii
Output:
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
For a list of available output format values, see the comments for the table_format setting in the dbsqlclirc
file.
REPL
To enter read-evaluate-print loop (REPL) mode scoped to the default database, run the following command:
dbsqlcli
You can also enter REPL mode scoped to a specific database, by running the following command:
dbsqlcli <database-name>
For example:
dbsqlcli default
exit
In REPL mode, you can use the following characters and keys:
Use the semicolon ( ; ) to end a line.
Use F3 to toggle multiline mode.
Use the spacebar to show suggestions at the insertion point, if suggestions are not already displayed.
Use the up and down arrows to navigate suggestions.
Use the right arrow to complete the highlighted suggestion.
For example:
dbsqlcli default
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
2 rows in set
Time: 0.703s
hostname:default> exit
Additional resources
Databricks SQL CLI README
Databricks Utilities
7/21/2022 • 29 minutes to read
Databricks Utilities ( dbutils ) make it easy to perform powerful combinations of tasks. You can use the utilities
to work with object storage efficiently, to chain and parameterize notebooks, and to work with secrets. dbutils
are not supported outside of notebooks.
IMPORTANT
Calling dbutils inside of executors can produce unexpected results. To learn more about limitations of dbutils and
alternatives that could be used instead, see Limitations.
dbutils.help()
Scala
dbutils.help()
This module provides various utilities for users to interact with the rest of Databricks.
fs: DbfsUtils -> Manipulates the Databricks filesystem (DBFS) from the console
jobs: JobsUtils -> Utilities for leveraging jobs features
library: LibraryUtils -> Utilities for session isolated libraries
notebook: NotebookUtils -> Utilities for the control flow of a notebook (EXPERIMENTAL)
secrets: SecretUtils -> Provides utilities for leveraging secrets within notebooks
widgets: WidgetsUtils -> Methods to create and get bound value of input widgets inside notebooks
dbutils.fs.help()
Scala
dbutils.fs.help()
dbutils.fs provides utilities for working with FileSystems. Most methods in this package can take either a
DBFS path (e.g., "/foo" or "dbfs:/foo"), or another FileSystem URI. For more info about a method, use
dbutils.fs.help("methodName"). In notebooks, you can also use the %fs shorthand to access DBFS. The %fs
shorthand maps straightforwardly onto dbutils calls. For example, "%fs head --maxBytes=10000 /file/path"
translates into "dbutils.fs.head("/file/path", maxBytes = 10000)".
fsutils
cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly
across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given
file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any
necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly
across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a
file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory
mount
mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs:
Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount
point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they
receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point
updateMount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null,
extraConfigs: Map = Map.empty[String, String]): boolean -> Similar to mount(), but updates an existing mount
point instead of creating a new one
dbutils.fs.help("cp")
dbutils.fs.help("cp")
Scala
dbutils.fs.help("cp")
/**
* Copies a file or directory, possibly across FileSystems.
*
* Example: cp("/mnt/my-folder/a", "dbfs://a/b")
*
* @param from FileSystem URI of the source file or directory
* @param to FileSystem URI of the destination file or directory
* @param recurse if true, all files and directories will be recursively copied
* @return true if all files were successfully copied
*/
cp(from: java.lang.String, to: java.lang.String, recurse: boolean = false): boolean
NOTE
Available in Databricks Runtime 9.0 and above.
Commands : summarize
The data utility allows you to understand and interpret datasets. To list the available commands, run
dbutils.data.help() .
dbutils.data provides utilities for understanding and interpreting datasets. This module is currently in
preview and may be unstable. For more info about a method, use dbutils.data.help("methodName").
summarize(df: Object, precise: boolean): void -> Summarize a Spark DataFrame and visualize the statistics to
get quick insights
NOTE
This feature is in Public Preview.
When precise is set to false (the default), some returned statistics include approximations to reduce run
time.
The number of distinct values for categorical columns may have ~5% relative error for high-
cardinality columns.
The frequent value counts may have an error of up to 0.01% when the number of distinct values is
greater than 10000.
The histograms and percentile estimates may have an error of up to 0.01% relative to the total
number of rows.
When precise is set to true, the statistics are computed with higher precision. All statistics except for the
histograms and percentiles for numeric columns are now exact.
The histograms and percentile estimates may have an error of up to 0.0001% relative to the total
number of rows.
The tooltip at the top of the data summary output indicates the mode of current run.
This example displays summary statistics for an Apache Spark DataFrame with approximations enabled by
default. To see the results, run this command in a notebook. This example is based on Sample datasets
(databricks-datasets).
Python
df = spark.read.format('csv').load(
'/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv',
header=True,
inferSchema=True
)
dbutils.data.summarize(df)
Scala
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
dbutils.data.summarize(df)
Note that the visualization uses SI notation to concisely render numerical values smaller than 0.01 or larger than
10000. As an example, the numerical value 1.25e-15 will be rendered as 1.25f . One exception: the
visualization uses “ B ” for 1.0e9 (giga) instead of “ G ”.
fsutils
cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly
across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given
file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any
necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly
across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a
file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory
mount
mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs:
Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount
point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they
receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point
updateMount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null,
extraConfigs: Map = Map.empty[String, String]): boolean -> Similar to mount(), but updates an existing mount
point instead of creating a new one
cp command (dbutils.fs.cp)
Copies a file or directory, possibly across filesystems.
To display help for this command, run dbutils.fs.help("cp") .
This example copies the file named old_file.txt from /FileStore to /tmp/new , renaming the copied file to
new_file.txt .
Python
dbutils.fs.cp("/FileStore/old_file.txt", "/tmp/new/new_file.txt")
# Out[4]: True
dbutils.fs.cp("/FileStore/old_file.txt", "/tmp/new/new_file.txt")
# [1] TRUE
Scala
dbutils.fs.cp("/FileStore/old_file.txt", "/tmp/new/new_file.txt")
dbutils.fs.head("/tmp/my_file.txt", 25)
dbutils.fs.head("/tmp/my_file.txt", 25)
Scala
dbutils.fs.head("/tmp/my_file.txt", 25)
ls command (dbutils.fs.ls)
Lists the contents of a directory.
To display help for this command, run dbutils.fs.help("ls") .
This example displays information about the contents of /tmp . The modificationTime field is available in
Databricks Runtime 10.2 and above. In R, modificationTime is returned as a string.
Python
dbutils.fs.ls("/tmp")
R
dbutils.fs.ls("/tmp")
# [[1]]
# [[1]]$path
# [1] "dbfs:/tmp/my_file.txt"
# [[1]]$name
# [1] "my_file.txt"
# [[1]]$size
# [1] 40
# [[1]]$isDir
# [1] FALSE
# [[1]]$isFile
# [1] TRUE
# [[1]]$modificationTime
# [1] "1622054945000"
Scala
dbutils.fs.ls("/tmp")
dbutils.fs.mkdirs("/tmp/parent/child/grandchild")
# Out[15]: True
dbutils.fs.mkdirs("/tmp/parent/child/grandchild")
# [1] TRUE
Scala
dbutils.fs.mkdirs("/tmp/parent/child/grandchild")
dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
Scala
dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))
For additional code examples, see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks.
mounts command (dbutils.fs.mounts)
Displays information about what is currently mounted within DBFS.
To display help for this command, run dbutils.fs.help("mounts") .
Python
dbutils.fs.mounts()
Scala
dbutils.fs.mounts()
For additional code examples, see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks.
mv command (dbutils.fs.mv)
Moves a file or directory, possibly across filesystems. A move is a copy followed by a delete, even for moves
within filesystems.
To display help for this command, run dbutils.fs.help("mv") .
This example moves the file my_file.txt from /FileStore to /tmp/parent/child/granchild .
Python
dbutils.fs.mv("/FileStore/my_file.txt", "/tmp/parent/child/grandchild")
# Out[2]: True
dbutils.fs.mv("/FileStore/my_file.txt", "/tmp/parent/child/grandchild")
# [1] TRUE
Scala
dbutils.fs.mv("/FileStore/my_file.txt", "/tmp/parent/child/grandchild")
# Wrote 18 bytes.
# Out[6]: True
# [1] TRUE
Scala
// Wrote 18 bytes.
// res2: Boolean = true
dbutils.fs.refreshMounts()
Scala
dbutils.fs.refreshMounts()
For additiional code examples, see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks.
rm command (dbutils.fs.rm)
Removes a file or directory.
To display help for this command, run dbutils.fs.help("rm") .
This example removes the file named hello_db.txt in /tmp .
Python
dbutils.fs.rm("/tmp/hello_db.txt")
# Out[8]: True
dbutils.fs.rm("/tmp/hello_db.txt")
# [1] TRUE
Scala
dbutils.fs.rm("/tmp/hello_db.txt")
dbutils.fs.unmount("/mnt/<mount-name>")
For additional code examples, see Accessing Azure Data Lake Storage Gen2 and Blob Storage with Azure
Databricks.
updateMount command (dbutils.fs.updateMount)
Similar to the dbutils.fs.mount command, but updates an existing mount point instead of creating a new one.
Returns an error if the mount point is not present.
To display help for this command, run dbutils.fs.help("updateMount") .
This command is available in Databricks Runtime 10.2 and above.
Python
dbutils.fs.updateMount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
Scala
dbutils.fs.updateMount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>",
mountPoint = "/mnt/<mount-name>",
extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))
The jobs utility allows you to leverage jobs features. To display help for this utility, run dbutils.jobs.help() .
taskValues: TaskValuesUtils -> Provides utilities for leveraging job task values
NOTE
Available in Databricks Runtime 7.3 and above.
This subutility is available only for Python.
NOTE
Available in Databricks Runtime 7.3 and above.
This command is available only for Python.
On Databricks Runtime 10.4 and earlier, if get cannot find the task, a Py4JJavaError is raised instead of a ValueError .
Gets the contents of the specified task value for the specified task in the current job run.
To display help for this command, run dbutils.jobs.taskValues.help("get") .
For example:
dbutils.jobs.taskValues.get(taskKey = "my-task", \
key = "my-key", \
default = 7, \
debugValue = 42)
NOTE
Available in Databricks Runtime 7.3 and above.
This command is available only for Python.
Sets or updates a task value. You can set up to 250 task values for a job run.
To display help for this command, run dbutils.jobs.taskValues.help("set") .
Some examples include:
dbutils.jobs.taskValues.set(key = "my-key", \
value = 5)
dbutils.jobs.taskValues.set(key = "my-other-key", \
value = "my other value")
IMPORTANT
Library utilities are not available on Databricks Runtime ML or Databricks Runtime for Genomics. Instead, see Notebook-
scoped Python libraries.
For Databricks Runtime 7.2 and above, Databricks recommends using %pip magic commands to install notebook-
scoped libraries. See Notebook-scoped Python libraries.
Library utilities are enabled by default. Therefore, by default the Python environment for each notebook is
isolated by using a separate Python executable that is created when the notebook is attached to and inherits the
default Python environment on the cluster. Libraries installed through an init script into the Azure Databricks
Python environment are still available. You can disable this feature by setting
spark.databricks.libraryIsolation.enabled to false .
This API is compatible with the existing cluster-wide library installation through the UI and REST API. Libraries
installed through this API have higher priority than cluster-wide libraries.
To list the available commands, run dbutils.library.help() .
install(path: String): boolean -> Install the library within the current notebook session
installPyPI(pypiPackage: String, version: String = "", repo: String = "", extras: String = ""): boolean ->
Install the PyPI library within the current notebook session
list: List -> List the isolated libraries added for the current notebook session via dbutils
restartPython: void -> Restart python process for the current notebook session
updateCondaEnv(envYmlContent: String): boolean -> Update the current notebook's Conda environment based on
the specification (content of environment
IMPORTANT
dbutils.library.install is removed in Databricks Runtime 11.0 and above.
Databricks recommends that you put all your library install commands in the first cell of your notebook and call
restartPython at the end of that cell. The Python notebook state is reset after running restartPython ; the notebook
loses all state including but not limited to local variables, imported libraries, and other ephemeral states. Therefore, we
recommend that you install libraries and reset the notebook state in the first notebook cell.
The accepted library sources are dbfs , abfss , adl , and wasbs .
dbutils.library.install("abfss:/path/to/your/library.egg")
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.
dbutils.library.install("abfss:/path/to/your/library.whl")
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.
NOTE
You can directly install custom wheel files using %pip . In the following example we are assuming you have uploaded your
library wheel file to DBFS:
Egg files are not supported by pip, and wheel is considered the standard for build and binary packaging for Python. See
Wheel vs Egg for more details. However, if you want to use an egg file in a way that’s compatible with %pip , you can use
the following workaround:
# This step is only needed if no %pip commands have been run yet.
# It will trigger setting up the isolated notebook environment
%pip install <any-lib> # This doesn't need to be a real library; for example "%pip install any-lib"
would work
import sys
# Assuming the preceding step was completed, the following command
# adds the egg file to the current notebook environment
sys.path.append("/local/path/to/library.egg")
IMPORTANT
dbutils.library.installPyPI is removed in Databricks Runtime 11.0 and above.
The version and extraskeys cannot be part of the PyPI package string. For example:
dbutils.library.installPyPI("azureml-sdk[databricks]==1.19.0") is not valid. Use the version and extras
arguments to specify the version and extras information as follows:
This example specifies library requirements in one notebook and installs them by using %run in the other. To do
this, first define the libraries to install in a notebook. This example uses a notebook named InstallDependencies .
dbutils.library.installPyPI("torch")
dbutils.library.installPyPI("scikit-learn", version="1.19.1")
dbutils.library.installPyPI("azureml-sdk", extras="databricks")
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.
import torch
from sklearn.linear_model import LinearRegression
import azureml
...
This example resets the Python notebook state while maintaining the environment. This technique is available
only in Python notebooks. For example, you can use this technique to reload libraries Azure Databricks
preinstalled with a different version:
dbutils.library.installPyPI("numpy", version="1.15.4")
dbutils.library.restartPython()
You can also use this technique to install libraries such as tensorflow that need to be loaded on process start up:
dbutils.library.installPyPI("tensorflow")
dbutils.library.restartPython()
NOTE
The equivalent of this command using %pip is:
%pip freeze
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling
this command.
dbutils.library.updateCondaEnv(
"""
channels:
- anaconda
dependencies:
- gensim=3.4
- nltk=3.4
""")
exit(value: String): void -> This method lets you exit a notebook with a value
run(path: String, timeoutSeconds: int, arguments: Map): String -> This method runs a notebook and returns
its exit value.
Scala
NOTE
The maximum length of the string value returned from the run command is 5 MB. See Get the output for a single run (
GET /jobs/runs/get-output ).
Scala
WARNING
Administrators, secret creators, and users granted permission can read Azure Databricks secrets. While Azure Databricks
makes an effort to redact secret values that might be displayed in notebooks, it is not possible to prevent such users from
reading secrets. For more information, see Secret redaction.
dbutils.secrets.get(scope="my-scope", key="my-key")
# Out[14]: '[REDACTED]'
dbutils.secrets.get(scope="my-scope", key="my-key")
# [1] "[REDACTED]"
Scala
dbutils.secrets.get(scope="my-scope", key="my-key")
# Out[1]: 'a1!b2@c3#'
R
my_secret = dbutils.secrets.getBytes(scope="my-scope", key="my-key")
print(rawToChar(my_secret))
# [1] "a1!b2@c3#"
Scala
// a1!b2@c3#
// mySecret: Array[Byte] = Array(97, 49, 33, 98, 50, 64, 99, 51, 35)
// convertedString: String = a1!b2@c3#
dbutils.secrets.list("my-scope")
# Out[10]: [SecretMetadata(key='my-key')]
dbutils.secrets.list("my-scope")
# [[1]]
# [[1]]$key
# [1] "my-key"
Scala
dbutils.secrets.list("my-scope")
dbutils.secrets.listScopes()
# Out[14]: [SecretScope(name='my-scope')]
R
dbutils.secrets.listScopes()
# [[1]]
# [[1]]$name
# [1] "my-scope"
Scala
dbutils.secrets.listScopes()
combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input
widget with a given name, default value and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input
widget a with given name, default value and choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect
input widget with a given name, default value and choices
remove(name: String): void -> Removes an input widget from the notebook
removeAll: void -> Removes all widgets in the notebook
text(name: String, defaultValue: String, label: String): void -> Creates a text input widget with a given
name and default value
dbutils.widgets.combobox(
name='fruits_combobox',
defaultValue='banana',
choices=['apple', 'banana', 'coconut', 'dragon fruit'],
label='Fruits'
)
print(dbutils.widgets.get("fruits_combobox"))
# banana
R
dbutils.widgets.combobox(
name='fruits_combobox',
defaultValue='banana',
choices=list('apple', 'banana', 'coconut', 'dragon fruit'),
label='Fruits'
)
print(dbutils.widgets.get("fruits_combobox"))
# [1] "banana"
Scala
dbutils.widgets.combobox(
"fruits_combobox",
"banana",
Array("apple", "banana", "coconut", "dragon fruit"),
"Fruits"
)
print(dbutils.widgets.get("fruits_combobox"))
// banana
dbutils.widgets.dropdown(
name='toys_dropdown',
defaultValue='basketball',
choices=['alphabet blocks', 'basketball', 'cape', 'doll'],
label='Toys'
)
print(dbutils.widgets.get("toys_dropdown"))
# basketball
dbutils.widgets.dropdown(
name='toys_dropdown',
defaultValue='basketball',
choices=list('alphabet blocks', 'basketball', 'cape', 'doll'),
label='Toys'
)
print(dbutils.widgets.get("toys_dropdown"))
# [1] "basketball"
Scala
dbutils.widgets.dropdown(
"toys_dropdown",
"basketball",
Array("alphabet blocks", "basketball", "cape", "doll"),
"Toys"
)
print(dbutils.widgets.get("toys_dropdown"))
// basketball
dbutils.widgets.get('fruits_combobox')
# banana
dbutils.widgets.get('fruits_combobox')
# [1] "banana"
Scala
dbutils.widgets.get("fruits_combobox")
This example gets the value of the notebook task parameter that has the programmatic name age . This
parameter was set to 35 when the related notebook task was run.
Python
dbutils.widgets.get('age')
# 35
dbutils.widgets.get('age')
# [1] "35"
Scala
dbutils.widgets.get("age")
// res6: String = 35
NOTE
This command is deprecated. Use dbutils.widgets.get instead.
Scala
Python
dbutils.widgets.multiselect(
name='days_multiselect',
defaultValue='Tuesday',
choices=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'],
label='Days of the Week'
)
print(dbutils.widgets.get("days_multiselect"))
# Tuesday
dbutils.widgets.multiselect(
name='days_multiselect',
defaultValue='Tuesday',
choices=list('Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'),
label='Days of the Week'
)
print(dbutils.widgets.get("days_multiselect"))
# [1] "Tuesday"
Scala
dbutils.widgets.multiselect(
"days_multiselect",
"Tuesday",
Array("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"),
"Days of the Week"
)
print(dbutils.widgets.get("days_multiselect"))
// Tuesday
IMPORTANT
If you add a command to remove a widget, you cannot add a subsequent command to create a widget in the same cell.
You must create the widget in another cell.
This example removes the widget with the programmatic name fruits_combobox .
Python
dbutils.widgets.remove('fruits_combobox')
R
dbutils.widgets.remove('fruits_combobox')
Scala
dbutils.widgets.remove("fruits_combobox")
IMPORTANT
If you add a command to remove all widgets, you cannot add a subsequent command to create any widgets in the same
cell. You must create the widgets in another cell.
dbutils.widgets.removeAll()
dbutils.widgets.removeAll()
Scala
dbutils.widgets.removeAll()
dbutils.widgets.text(
name='your_name_text',
defaultValue='Enter your name',
label='Your name'
)
print(dbutils.widgets.get("your_name_text"))
R
dbutils.widgets.text(
name='your_name_text',
defaultValue='Enter your name',
label='Your name'
)
print(dbutils.widgets.get("your_name_text"))
Scala
dbutils.widgets.text(
"your_name_text",
"Enter your name",
"Your name"
)
print(dbutils.widgets.get("your_name_text"))
Maven
<dependency>
<groupId>com.databricks</groupId>
<artifactId>dbutils-api_TARGET</artifactId>
<version>VERSION</version>
</dependency>
Gradle
compile 'com.databricks:dbutils-api_TARGET:VERSION'
Replace TARGET with the desired target (for example 2.12 ) and VERSION with the desired version (for example
0.0.5 ). For a list of available targets and versions, see the DBUtils API webpage on the Maven Repository
website.
Once you build your application against this library, you can deploy the application.
IMPORTANT
The dbutils-api library allows you to locally compile an application that uses dbutils , but not to run it. To run the
application, you must deploy it in Azure Databricks.
Limitations
Calling dbutils inside of executors can produce unexpected results or potentially result in errors.
If you need to run file system operations on executors using dbutils , there are several faster and more scalable
alternatives available:
For file copy or move operations, you can check a faster option of running filesystem operations described in
Parallelize filesystem operations.
For file system list and delete operations, you can refer to parallel listing and delete methods utilizing Spark
in How to list and delete files faster in Databricks.
For information about executors, see Cluster Mode Overview on the Apache Spark website.
REST API (latest)
7/21/2022 • 2 minutes to read
The Databricks REST API allows for programmatic management of various Azure Databricks resources. This
article provides links to the latest version of each API.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
For general usage notes about the Databricks REST API, see Databricks REST API reference. You can also jump
directly to the REST API home pages for versions 2.1, 2.0, or 1.2.
Clusters API 2.0
Cluster Policies API 2.0
Databricks SQL Queries and Dashboards API 2.0
Databricks SQL Query History API 2.0
Databricks SQL Warehouses API 2.0
DBFS API 2.0
Databricks SQL API 2.0
Delta Live Tables API 2.0
Git Credentials API 2.0
Global Init Scripts API 2.0
Groups API 2.0
Instance Pools API 2.0
IP Access List API 2.0
Jobs API 2.1
Libraries API 2.0
MLflow API 2.0
Permissions API 2.0
Repos API 2.0
SCIM API 2.0
Secrets API 2.0
Token API 2.0
Token Management API 2.0
Workspace API 2.0
Clusters API 2.0
7/21/2022 • 46 minutes to read
The Clusters API allows you to create, start, edit, list, terminate, and delete clusters. The maximum allowed size of
a request to the Clusters API is 10MB.
Cluster lifecycle methods require a cluster ID, which is returned from Create. To obtain a list of clusters, invoke
List.
Azure Databricks maps cluster node instance types to compute units known as DBUs. See the instance type
pricing page for a list of the supported instance types and their corresponding DBUs. For instance provider
information, see Azure instance type specifications and pricing.
Azure Databricks always provides one year’s deprecation notice before ceasing support for an instance type.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create
EN DP O IN T H T T P M ET H O D
2.0/clusters/create POST
Create a new Apache Spark cluster. This method acquires new instances from the cloud provider if necessary.
This method is asynchronous; the returned cluster_id can be used to poll the cluster state. When this method
returns, the cluster is in a PENDING state. The cluster is usable once it enters a RUNNING state. See ClusterState.
NOTE
Azure Databricks may not be able to acquire some of the requested nodes, due to cloud provider limitations or transient
network issues. If it is unable to acquire a sufficient number of the requested nodes, cluster creation will terminate with an
informative error message.
Examples
create-cluster.json :
{
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"spark_conf": {
"spark.speculation": true
},
"num_workers": 25
}
{ "cluster_id": "1234-567890-undid123" }
Here is an example for an autoscaling cluster. This cluster will start with two nodes, the minimum.
create-cluster.json :
{
"cluster_name": "autoscaling-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"autoscale" : {
"min_workers": 2,
"max_workers": 50
}
}
{ "cluster_id": "1234-567890-hared123" }
This example creates a Single Node cluster. To create a Single Node cluster:
Set spark_conf and custom_tags to the exact values in the example.
Set num_workers to 0 .
create-cluster.json :
{
"cluster_name": "single-node-cluster",
"spark_version": "7.6.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 0,
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*]"
},
"custom_tags": {
"ResourceClass": "SingleNode"
}
}
{ "cluster_id": "1234-567890-pouch123" }
To create a job or submit a run with a new cluster using a policy, set policy_id to the policy ID:
create-cluster.json :
{
"num_workers": null,
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"spark_conf": {},
"node_type_id": "Standard_D3_v2",
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"init_scripts": [],
"policy_id": "C65B864F02000008"
}
create-job.json :
{
"run_name": "my spark task",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10,
"policy_id": "ABCD000000000000"
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}
Note :
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Edit
EN DP O IN T H T T P M ET H O D
2.0/clusters/edit POST
Edit the configuration of a cluster to match the provided attributes and size.
You can edit a cluster if it is in a RUNNING or TERMINATED state. If you edit a cluster while it is in a RUNNING state,
it will be restarted so that the new attributes can take effect. If you edit a cluster while it is in a TERMINATED state,
it will remain TERMINATED . The next time it is started using the clusters/start API, the new attributes will take
effect. An attempt to edit a cluster in any other state will be rejected with an INVALID_STATE error code.
Clusters created by the Databricks Jobs service cannot be edited.
Example
edit-cluster.json :
{
"cluster_id": "1202-211320-brick1",
"num_workers": 10,
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2"
}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Start
EN DP O IN T H T T P M ET H O D
2.0/clusters/start POST
Start a terminated cluster given its ID. This is similar to createCluster , except:
The terminated cluster ID and attributes are preserved.
The cluster starts with the last specified cluster size. If the terminated cluster is an autoscaling cluster, the
cluster starts with the minimum number of nodes.
If the cluster is in the RESTARTING state, a 400 error is returned.
You cannot start a cluster launched to run a job.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Restart
EN DP O IN T H T T P M ET H O D
2.0/clusters/restart POST
Restart a cluster given its ID. The cluster must be in the RUNNING state.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/restart \
--data '{ "cluster_id": "1234-567890-reef123" }'
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Resize
EN DP O IN T H T T P M ET H O D
2.0/clusters/resize POST
Resize a cluster to have a desired number of workers. The cluster must be in the RUNNING state.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete (terminate)
EN DP O IN T H T T P M ET H O D
2.0/clusters/delete POST
Terminate a cluster given its ID. The cluster is removed asynchronously. Once the termination has completed, the
cluster will be in the TERMINATED state. If the cluster is already in a TERMINATING or TERMINATED state, nothing
will happen.
Unless a cluster is pinned, 30 days after the cluster is terminated, it is permanently deleted.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Permanent delete
EN DP O IN T H T T P M ET H O D
2.0/clusters/permanent-delete POST
Permanently delete a cluster. If the cluster is running, it is terminated and its resources are asynchronously
removed. If the cluster is terminated, then it is immediately removed.
You cannot perform any action, including retrieve the cluster’s permissions, on a permanently deleted cluster. A
permanently deleted cluster is also no longer returned in the cluster list.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Get
EN DP O IN T H T T P M ET H O D
2.0/clusters/get GET
Retrieve the information for a cluster given its identifier. Clusters can be described while they are running or up
to 30 days after they are terminated.
Example
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Note :
* Vendor: Databricks
* Creator:
* ClusterName:
* ClusterId:
* Name: On job clusters:
* RunName:
* JobId: On resources used by
Databricks SQL:
* SqlWarehouseId:
Pin
NOTE
You must be an Azure Databricks administrator to invoke this API.
EN DP O IN T H T T P M ET H O D
2.0/clusters/pin POST
Ensure that an all-purpose cluster configuration is retained even after a cluster has been terminated for more
than 30 days. Pinning ensures that the cluster is always returned by the List API. Pinning a cluster that is already
pinned has no effect.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Unpin
NOTE
You must be an Azure Databricks administrator to invoke this API.
EN DP O IN T H T T P M ET H O D
2.0/clusters/unpin POST
Allows the cluster to eventually be removed from the list returned by the List API. Unpinning a cluster that is not
pinned has no effect.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/clusters/list GET
Return information about all pinned clusters, active clusters, up to 200 of the most recently terminated all-
purpose clusters in the past 30 days, and up to 30 of the most recently terminated job clusters in the past 30
days. For example, if there is 1 pinned cluster, 4 active clusters, 45 terminated all-purpose clusters in the past 30
days, and 50 terminated job clusters in the past 30 days, then this API returns the 1 pinned cluster, 4 active
clusters, all 45 terminated all-purpose clusters, and the 30 most recently terminated job clusters.
Example
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/clusters/list-node-types GET
Return a list of supported Spark node types. These node types can be used to launch a cluster.
Example
{
"node_types": [
{
"node_type_id": "Standard_L80s_v2",
"memory_mb": 655360,
"num_cores": 80,
"description": "Standard_L80s_v2",
"instance_type_id": "Standard_L80s_v2",
"is_deprecated": false,
"category": "Storage Optimized",
"support_ebs_volumes": true,
"support_cluster_tags": true,
"num_gpus": 0,
"node_instance_type": {
"instance_type_id": "Standard_L80s_v2",
"local_disks": 1,
"local_disk_size_gb": 800,
"instance_family": "Standard LSv2 Family vCPUs",
"local_nvme_disk_size_gb": 1788,
"local_nvme_disks": 10,
"swap_size": "10g"
},
"is_hidden": false,
"support_port_forwarding": true,
"display_order": 0,
"is_io_cache_enabled": true,
"node_info": {
"available_core_quota": 350,
"total_core_quota": 350
}
},
{
"..."
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Runtime versions
EN DP O IN T H T T P M ET H O D
2.0/clusters/spark-versions GET
Return the list of available runtime versions. These versions can be used to launch a cluster.
Example
{
"versions": [
{
"key": "8.2.x-scala2.12",
"name": "8.2 (includes Apache Spark 3.1.1, Scala 2.12)"
},
{
"..."
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Events
EN DP O IN T H T T P M ET H O D
2.0/clusters/events POST
Retrieve a list of events about the activity of a cluster. You can retrieve events from active clusters (running,
pending, or reconfiguring) and terminated clusters within 30 days of their last termination. This API is paginated.
If there are more events to read, the response includes all the parameters necessary to request the next page of
events.
Example:
list-events.json :
{
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 5,
"limit": 5,
"event_type": "RUNNING"
}
{
"events": [
{
"cluster_id": "1234-567890-reef123",
"timestamp": 1619471498409,
"type": "RUNNING",
"details": {
"current_num_workers": 2,
"target_num_workers": 2
}
},
{
"..."
}
],
"next_page": {
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 10,
"limit": 5
},
"total_count": 25
}
list-events.json :
{
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 10,
"limit": 5,
"event_type": "RUNNING"
}
{
"events": [
{
"cluster_id": "1234-567890-reef123",
"timestamp": 1618330776302,
"type": "RUNNING",
"details": {
"current_num_workers": 2,
"target_num_workers": 2
}
},
{
"..."
}
],
"next_page": {
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 15,
"limit": 5
},
"total_count": 25
}
Request structure
Retrieve events pertaining to a specific cluster.
Data structures
In this section:
AutoScale
ClusterInfo
ClusterEvent
ClusterEventType
EventDetails
ClusterAttributes
ClusterSize
ListOrder
ResizeCause
ClusterLogConf
InitScriptInfo
ClusterTag
DbfsStorageInfo
FileStorageInfo
DockerImage
DockerBasicAuth
LogSyncStatus
NodeType
ClusterCloudProviderNodeInfo
ClusterCloudProviderNodeStatus
ParameterPair
SparkConfPair
SparkEnvPair
SparkNode
SparkVersion
TerminationReason
PoolClusterTerminationCode
ClusterSource
ClusterState
TerminationCode
TerminationType
TerminationParameter
AzureAttributes
AzureAvailability
AutoScale
Range defining the min and max number of cluster workers.
ClusterInfo
Metadata about a cluster.
* Vendor: Databricks
* Creator:
* ClusterName:
* ClusterId:
* Name: On job clusters:
* RunName:
* JobId: On resources used by
Databricks SQL:
* SqlWarehouseId:
ClusterEvent
Cluster event information.
ClusterEventType
Type of a cluster event.
DID_NOT_EXPAND_DISK Indicates that a disk is low on space, but adding disks would
put it over the max capacity.
EXPANDED_DISK Indicates that a disk was low on space and the disks were
expanded.
FAILED_TO_EXPAND_DISK Indicates that a disk was low on space and disk space could
not be expanded.
INIT_SCRIPTS_STARTING Indicates that the cluster scoped init script has started.
INIT_SCRIPTS_FINISHED Indicates that the cluster scoped init script has finished.
RUNNING Indicates the cluster has finished being created. Includes the
number of nodes in the cluster and a failure reason if some
nodes could not be acquired.
NODES_LOST Indicates that some nodes were lost from the cluster.
DRIVER_HEALTHY Indicates that the driver is healthy and the cluster is ready
for use.
SPARK_EXCEPTION Indicates that a Spark exception was thrown from the driver.
EventDetails
Details about a cluster event.
ClusterAttributes
Common set of attributes set during cluster creation. These attributes cannot be changed over the lifetime of a
cluster.
Note :
ClusterSize
Cluster size specification.
ListOrder
Generic ordering enum for list-based queries.
ResizeCause
Reason why a cluster was resized.
ClusterLogConf
Path to cluster log.
InitScriptInfo
Path to an init script. For instructions on using init scripts with Databricks Container Services, see Use an init
script.
NOTE
The file storage type is only available for clusters set up using Databricks Container Services.
ClusterTag
Cluster tag definition.
STRING The value of the tag. The value length must be less than or
equal to 256 UTF-8 characters.
DbfsStorageInfo
DBFS storage information.
FileStorageInfo
File storage information.
NOTE
This location type is only available for clusters set up using Databricks Container Services.
DockerImage
Docker image connection information.
DockerBasicAuth
Docker repository basic authentication information.
LogSyncStatus
Log delivery status.
NodeType
Description of a Spark node type including both the dimensions of the node and the instance type on which it
will be hosted.
ClusterCloudProviderNodeInfo
Information about an instance supplied by a cloud provider.
ClusterCloudProviderNodeStatus
Status of an instance supplied by a cloud provider.
ParameterPair
Parameter that provides additional information about why a cluster was terminated.
SparkConfPair
Spark configuration key-value pairs.
SparkEnvPair
Spark environment variable key-value pairs.
IMPORTANT
When specifying environment variables in a job cluster, the fields in this data structure accept only Latin characters (ASCII
character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese,
Japanese kanjis, and emojis.
SparkNode
Spark driver or executor configuration.
SparkVersion
Databricks Runtime version of the cluster.
TerminationReason
Reason why a cluster was terminated.
PoolClusterTerminationCode
Status code indicating why the cluster was terminated due to a pool failure.
C O DE DESC RIP T IO N
ClusterSource
Service that created the cluster.
ClusterState
State of a cluster. The allowable state transitions are as follows:
PENDING -> RUNNING
PENDING -> TERMINATING
RUNNING -> RESIZING
RUNNING -> RESTARTING
RUNNING -> TERMINATING
RESTARTING -> RUNNING
RESTARTING -> TERMINATING
RESIZING -> RUNNING
RESIZING -> TERMINATING
TERMINATING -> TERMINATED
RUNNING Indicates that a cluster has been started and is ready for use.
TerminationCode
Status code indicating why the cluster was terminated.
C O DE DESC RIP T IO N
JOB_FINISHED The cluster was launched by a job, and terminated when the
job completed.
CLOUD_PROVIDER_SHUTDOWN The instance that hosted the Spark driver was terminated by
the cloud provider.
SPARK_ERROR The Spark driver failed to start. Possible reasons may include
incompatible libraries and initialization scripts that corrupted
the Spark container.
DRIVER_UNREACHABLE Azure Databricks was not able to access the Spark driver,
because it was not reachable.
DRIVER_UNRESPONSIVE Azure Databricks was not able to access the Spark driver,
because it was unresponsive.
INSTANCE_POOL_CLUSTER_FAILURE Pool backed cluster specific failure. See Pools for details.
TerminationType
Reason why the cluster was terminated.
CLOUD_FAILURE Cloud provider infrastructure issue. Client can retry after the
underlying issue is resolved.
TerminationParameter
Key that provides additional information about why a cluster was terminated.
K EY DESC RIP T IO N
databricks_error_message Additional context that may explain the reason for cluster
termination.
inactivity_duration_min An idle cluster was shut down after being inactive for this
duration.
instance_id The ID of the instance that was hosting the Spark driver.
azure_error_code The Azure provided error code describing why cluster nodes
could not be provisioned. For reference, see:
https://docs.microsoft.com/azure/virtual-
machines/windows/error-messages.
K EY DESC RIP T IO N
AzureAttributes
Attributes set during cluster creation related to Azure.
spot_bid_max_price DOUBLE The max bid price used for Azure spot
instances. You can set this to greater
than or equal to the current spot price.
You can also set this to -1 (the default),
which specifies that the instance
cannot be evicted on the basis of price.
The price for the instance will be the
current price for spot instances or the
price for a standard instance. You can
view historical pricing and eviction
rates in the Azure portal.
AzureAvailability
The Azure instance availability type behavior.
IMPORTANT
This feature is in Public Preview.
A cluster policy limits the ability to create clusters based on a set of rules. The policy rules limit the attributes or
attribute values available for cluster creation. Cluster policies have ACLs that limit their use to specific users and
groups.
Only admin users can create, edit, and delete policies. Admin users also have access to all policies.
For requirements and limitations on cluster policies, see Manage cluster policies.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
IMPORTANT
The Cluster Policies API requires a policy JSON definition to be passed within a JSON request in stringified form. In most
cases this requires escaping of the quote characters.
In this section:
Get
List
Create
Edit
Delete
Data structures
Get
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/get GET
{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":
{\"type\":\"forbidden\",\"hidden\":true}}",
"created_at_timestamp": 1600000000000
}
Request structure
Response structure
List
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/list GET
Request structure
Response structure
Create
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/create POST
create-cluster-policy.json :
{
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":{\"type\":\"forbidden\",\"hidden\":true}}"
}
{ "policy_id": "ABCD000000000000" }
Request structure
Response structure
Edit
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/edit POST
Update an existing policy. This may make some clusters governed by this policy invalid. For such clusters the
next cluster edit must provide a confirming configuration, but otherwise they can continue to run.
Example
edit-cluster-policy.json :
{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":{\"type\":\"forbidden\",\"hidden\":true}}"
}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/delete POST
Delete a policy. Clusters governed by this policy can still run, but cannot be edited.
Example
{}
Request structure
Data structures
In this section:
Policy
PolicySortColumn
Policy
A cluster policy entity.
PolicySortColumn
The sort order for the ListPolices request.
2.0/preview/permissions/cluster- GET
policies/<clusterPolicyId>
Example
Request structure
Response structure
A Clusters ACL.
Get permission levels
EN DP O IN T H T T P M ET H O D
2.0/preview/permissions/cluster- GET
policies/<clusterPolicyId>/permissionLevels
Example
{
"permission_levels": [
{
"permission_level": "CAN_USE",
"description": "Can use the policy"
}
]
}
Request structure
Response structure
An array of PermissionLevel with associated description.
Add or modify permissions
EN DP O IN T H T T P M ET H O D
2.0/preview/permissions/cluster- PATCH
policies/<clusterPolicyId>
Example
add-cluster-policy-permissions.json :
{
"access_control_list": [
{
"user_name": "someone-else@example.com",
"permission_level": "CAN_USE"
}
]
}
{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "mary@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"user_name": "someone-else@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/"
]
}
]
}
]
}
Request structure
Request body
Response body
A Clusters ACL.
Set or delete permissions
A PUT request replaces all direct permissions on the cluster policy object. You can make delete requests by
making a GET request to retrieve the current list of permissions followed by a PUT request removing entries to
be deleted.
EN DP O IN T H T T P M ET H O D
2.0/preview/permissions/cluster- PUT
policies/<clusterPolicyId>
Example
set-cluster-policy-permissions.json :
{
"access_control_list": [
{
"user_name": "someone@example.com",
"permission_level": "CAN_USE"
}
]
}
{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "someone@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/"
]
}
]
}
]
}
Request structure
Request body
F IEL D N A M E TYPE DESC RIP T IO N
Response body
A Clusters ACL.
Data structures
In this section:
Clusters ACL
AccessControl
Permission
AccessControlInput
PermissionLevel
Clusters ACL
AccessControl
Permission
AccessControlInput
An item representing an ACL rule applied to the principal (user, group, or service principal).
PermissionLevel
Permission level that you can set on a cluster policy.
CAN_USE Allow user to create clusters based on the policy. The user
does not need the cluster create permission.
DBFS API 2.0
7/21/2022 • 9 minutes to read
The DBFS API is a Databricks API that makes it simple to interact with various data sources without having to
include your credentials every time you read a file. See Databricks File System (DBFS) for more information. For
an easy to use command line client of the DBFS API, see Databricks CLI.
NOTE
To ensure high quality of service under heavy load, Azure Databricks is now enforcing API rate limits for DBFS API calls.
Limits are set per workspace to ensure fair usage and high availability. Automatic retries are available using Databricks CLI
version 0.12.0 and above. We advise all customers to switch to the latest Databricks CLI version.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Limitations
Using the DBFS API with firewall enabled storage containers is not supported. Databricks recommends you use
Databricks Connect or az storage.
Add block
EN DP O IN T H T T P M ET H O D
2.0/dbfs/add-block POST
Append a block of data to the stream specified by the input handle. If the handle does not exist, this call will
throw an exception with RESOURCE_DOES_NOT_EXIST . If the block of data exceeds 1 MB, this call will throw an
exception with MAX_BLOCK_SIZE_EXCEEDED . A typical workflow for file upload would be:
1. Call create and get a handle.
2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Close
EN DP O IN T H T T P M ET H O D
2.0/dbfs/close POST
Close the stream specified by the input handle. If the handle does not exist, this call throws an exception with
RESOURCE_DOES_NOT_EXIST . A typical workflow for file upload would be:
Create
EN DP O IN T H T T P M ET H O D
2.0/dbfs/create POST
Open a stream to write to a file and returns a handle to this stream. There is a 10 minute idle timeout on this
handle. If a file or directory already exists on the given path and overwrite is set to false, this call throws an
exception with RESOURCE_ALREADY_EXISTS . A typical workflow for file upload would be:
1. Call create and get a handle.
2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/create \
--data '{ "path": "/tmp/HelloWorld.txt", "overwrite": true }'
{ "handle": 1234567890123456 }
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/dbfs/delete POST
Delete the file or directory (optionally recursively delete all files in the directory). This call throws an exception
with IO_ERROR if the path is a non-empty directory and recursive is set to false or on other similar errors.
When you delete a large number of files, the delete operation is done in increments. The call returns a response
after approximately 45 seconds with an error message (503 Service Unavailable) asking you to re-invoke the
delete operation until the directory structure is fully deleted. For example:
{
"error_code": "PARTIAL_DELETE",
"message": "The requested operation has deleted 324 files. There are more files remaining. You must make
another request to delete more."
}
For operations that delete more than 10K files, we discourage using the DBFS REST API, but advise you to
perform such operations in the context of a cluster, using the File system utility (dbutils.fs). dbutils.fs covers
the functional scope of the DBFS REST API, but from notebooks. Running such operations using notebooks
provides better control and manageability, such as selective deletes, and the possibility to automate periodic
delete jobs.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/delete \
--data '{ "path": "/tmp/HelloWorld.txt" }'
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Get status
EN DP O IN T H T T P M ET H O D
2.0/dbfs/get-status GET
Get the file information of a file or directory. If the file or directory does not exist, this call throws an exception
with RESOURCE_DOES_NOT_EXIST .
Example
{
"path": "/tmp/HelloWorld.txt",
"is_dir": false,
"file_size": 13,
"modification_time": 1622054945000
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/dbfs/list GET
List the contents of a directory, or details of the file. If the file or directory does not exist, this call throws an
exception with RESOURCE_DOES_NOT_EXIST .
When calling list on a large directory, the list operation will time out after approximately 60 seconds. We
strongly recommend using list only on directories containing less than 10K files and discourage using the
DBFS REST API for operations that list more than 10K files. Instead, we recommend that you perform such
operations in the context of a cluster, using the File system utility (dbutils.fs), which provides the same
functionality without timing out.
Example
{
"files": [
{
"path": "/tmp/HelloWorld.txt",
"is_dir": false,
"file_size": 13,
"modification_time": 1622054945000
},
{
"..."
}
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Mkdirs
EN DP O IN T H T T P M ET H O D
2.0/dbfs/mkdirs POST
Create the given directory and necessary parent directories if they do not exist. If there exists a file (not a
directory) at any prefix of the input path, this call throws an exception with RESOURCE_ALREADY_EXISTS . If this
operation fails it may have succeeded in creating some of the necessary parent directories.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Move
EN DP O IN T H T T P M ET H O D
2.0/dbfs/move POST
Move a file from one location to another location within DBFS. If the source file does not exist, this call throws an
exception with RESOURCE_DOES_NOT_EXIST . If there already exists a file in the destination path, this call throws an
exception with RESOURCE_ALREADY_EXISTS . If the given source path is a directory, this call always recursively
moves all files.
When moving a large number of files, the API call will time out after approximately 60 seconds, potentially
resulting in partially moved data. Therefore, for operations that move more than 10K files, we strongly
discourage using the DBFS REST API. Instead, we recommend that you perform such operations in the context of
a cluster, using the File system utility (dbutils.fs) from a notebook, which provides the same functionality without
timing out.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Put
EN DP O IN T H T T P M ET H O D
2.0/dbfs/put POST
Upload a file through the use of multipart form post. It is mainly used for streaming uploads, but can also be
used as a convenient single call for data upload.
The amount of data that can be passed using the contents parameter is limited to 1 MB if specified as a string (
MAX_BLOCK_SIZE_EXCEEDED is thrown if exceeded) and 2 GB as a file.
Example
To upload a local file named HelloWorld.txt in the current directory:
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Read
EN DP O IN T H T T P M ET H O D
2.0/dbfs/read GET
Return the contents of a file. If the file does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST .
If the path is a directory, the read length is negative, or if the offset is negative, this call throws an exception with
INVALID_PARAMETER_VALUE . If the read length exceeds 1 MB, this call throws an exception with
MAX_READ_SIZE_EXCEEDED . If offset + length exceeds the number of bytes in a file, reads contents until the end
of file.
Example
{
"bytes_read": 8,
"data": "ZWxsbywgV28="
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
FileInfo
FileInfo
The attributes of a file or directory.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
To configure individual SQL warehouses, use the SQL Warehouses API. To configure all SQL warehouses, use the
Global SQL Warehouses API.
Requirements
To create SQL warehouses you must have cluster create permission, which is enabled in the Data Science &
Engineering workspace.
To manage a SQL warehouse you must have Can Manage permission in Databricks SQL for the warehouse.
2.0/sql/warehouses/ POST
Example request
{
"name": "My SQL warehouse",
"cluster_size": "MEDIUM",
"min_num_clusters": 1,
"max_num_clusters": 10,
"tags": {
"custom_tags": [
{
"key": "mykey",
"value": "myvalue"
}
]
},
"enable_photon": "true",
"channel": {
"name": "CHANNEL_NAME_CURRENT"
}
}
Example response
{
"id": "0123456789abcdef"
}
Delete
EN DP O IN T H T T P M ET H O D
2.0/sql/warehouses/{id} DELETE
2.0/sql/warehouses/{id}/edit POST
Modify a SQL warehouse. All fields are optional. Missing fields default to the current values.
Example request
{
"name": "My Edited SQL warehouse",
"cluster_size": "LARGE",
"auto_stop_mins": 60
}
Get
EN DP O IN T H T T P M ET H O D
2.0/sql/warehouses/{id} GET
Example response
{
"id": "7f2629a529869126",
"name": "MyWarehouse",
"size": "SMALL",
"min_num_clusters": 1,
"max_num_clusters": 1,
"auto_stop_mins": 0,
"auto_resume": true,
"num_clusters": 0,
"num_active_sessions": 0,
"state": "STOPPED",
"creator_name": "user@example.com",
"jdbc_url":
"jdbc:spark://hostname.staging.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath
=/sql/1.0/warehouses/7f2629a529869126;",
"odbc_params": {
"hostname": "hostname.cloud.databricks.com",
"path": "/sql/1.0/warehouses/7f2629a529869126",
"protocol": "https",
"port": 443
},
"tags": {
"custom_tags": [
{
"key": "mykey",
"value": "myvalue"
}
]
},
"spot_instance_policy": "COST_OPTIMIZED",
"enable_photon": true,
"cluster_size": "SMALL",
"channel": {
"name": "CHANNEL_NAME_CURRENT"
}
}
List
EN DP O IN T H T T P M ET H O D
2.0/sql/warehouses/ GET
{
"warehouses": [
{ "id": "123456790abcdef", "name": "My SQL warehouse", "cluster_size": "MEDIUM" },
{ "id": "098765321fedcba", "name": "Another SQL warehouse", "cluster_size": "LARGE" }
]
}
Note: If you use the deprecated 2.0/sql/endpoints/ API, the top-level response field would be “endpoints”
instead of “warehouses”.
Start
EN DP O IN T H T T P M ET H O D
2.0/sql/warehouses/{id}/start POST
2.0/sql/warehouses/{id}/stop POST
/2.0/sql/config/warehouses GET
Example response
{
"security_policy": "DATA_ACCESS_CONTROL",
"data_access_config": [
{
"key": "spark.sql.hive.metastore.jars",
"value": "/databricks/hive_metastore_jars/*"
}
],
"sql_configuration_parameters": {
"configuration_pairs": [
{
"key" : "legacy_time_parser_policy",
"value": "LEGACY"
}
]
}
}
Edit
Edit the configuration for all SQL warehouses.
IMPORTANT
All fields are required.
Invoking this method restarts all running SQL warehouses.
EN DP O IN T H T T P M ET H O D
/2.0/sql/config/warehouses PUT
Example request
{
"data_access_config": [
{
"key": "spark.sql.hive.metastore.jars",
"value": "/databricks/hive_metastore_jars/*"
}
],
"sql_configuration_parameters": {
"configuration_pairs": [
{
"key" : "legacy_time_parser_policy",
"value": "LEGACY"
}
]
}
}
Data structures
In this section:
WarehouseConfPair
WarehouseHealth
WarehouseSecurityPolicy
WarehouseSpotInstancePolicy
WarehouseState
WarehouseStatus
WarehouseTags
WarehouseTagPair
ODBCParams
RepeatedWarehouseConfPairs
Channel
ChannelName
WarehouseConfPair
F IEL D N A M E TYPE DESC RIP T IO N
WarehouseHealth
F IEL D N A M E TYPE DESC RIP T IO N
WarehouseSecurityPolicy
O P T IO N DESC RIP T IO N
WarehouseSpotInstancePolicy
O P T IO N DESC RIP T IO N
COST_OPTIMIZED Use an on-demand instance for the cluster driver and spot
instances for cluster executors. The maximum spot price is
100% of the on-demand price. This is the default policy.
WarehouseState
State of a SQL warehouse. The allowable state transitions are:
STARTING -> STARTING , RUNNING , STOPPING , DELETING
RUNNING -> STOPPING , DELETING
STOPPING -> STOPPED , STARTING
STOPPED -> STARTING , DELETING
DELETING -> DELETED
WarehouseStatus
STAT E DESC RIP T IO N
WarehouseTags
F IEL D N A M E TYPE DESC RIP T IO N
WarehouseTagPair
F IEL D N A M E TYPE DESC RIP T IO N
ODBCParams
F IEL D N A M E TYPE DESC RIP T IO N
RepeatedWarehouseConfPairs
F IEL D N A M E TYPE DESC RIP T IO N
Channel
F IEL D N A M E TYPE DESC RIP T IO N
ChannelName
NAME DESC RIP T IO N
The Queries and Dashboards API manages queries, results, and dashboards.
This API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification.
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Query History API 2.0
7/21/2022 • 2 minutes to read
The Query History API shows SQL queries performed using Databricks SQL warehouses. You can use this
information to help you debug issues with queries.
This API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification.
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Delta Live Tables API guide
7/21/2022 • 13 minutes to read
The Delta Live Tables API allows you to create, edit, delete, start, and view details about pipelines.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines POST
pipeline-settings.json :
{
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"continuous": false
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file.
Response
{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5"
}
Request structure
See PipelineSettings.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Edit a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines/{pipeline_id} PUT
pipeline-settings.json
{
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Delete a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines/{pipeline_id} DELETE
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
2.0/pipelines/{pipeline_id}/updates POST
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
{
"update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/pipelines/{pipeline_id}/stop POST
Stops any active pipeline update. If no update is running, this request is a no-op.
Example
This example stops an update for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
2.0/pipelines/{pipeline_id}/events GET
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/events \
--data '{"max_results": 5}'
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
events An array of pipeline events. The list of events matching the request
criteria.
2.0/pipelines/{pipeline_id} GET
Gets details about a pipeline, including the pipeline settings and recent updates.
Example
This example gets details for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/pipelines/{pipeline_id}/updates/{update_id} GET
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates/9a84f906-fc51-
11eb-9a03-0242ac130003
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List pipelines
EN DP O IN T H T T P M ET H O D
2.0/pipelines/ GET
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
"notebook='<path>'" to select
pipelines that reference the provided
notebook path.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
KeyValue
NotebookLibrary
PipelineLibrary
PipelineSettings
PipelineStateInfo
PipelinesNewCluster
UpdateStateInfo
KeyValue
A key-value pair that specifies configuration parameters.
NotebookLibrary
A specification for a notebook containing pipeline code.
PipelineLibrary
A specification for pipeline dependencies.
PipelineSettings
The settings for a pipeline deployment.
PipelineStateInfo
The state of a pipeline, the status of the most recent updates, and information about associated resources.
PipelinesNewCluster
A pipeline cluster specification.
The Delta Live Tables system sets the following attributes. These attributes cannot be configured by users:
spark_version
init_scripts
Note :
UpdateStateInfo
The current state of a pipeline update.
The Git Credentials API allows users to manage their Git credentials to use Databricks Repos.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
The Git Credentials API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
Global Init Scripts API 2.0
7/21/2022 • 2 minutes to read
Configure a cluster-scoped init script using the DBFS REST API are shell scripts that run during startup on each
cluster node of every cluster in the workspace, before the Apache Spark driver or worker JVM starts. They can
help you to enforce consistent cluster configurations across your workspace. Use them carefully because they
can cause unanticipated impacts, like library conflicts.
The Global Init Scripts API lets Azure Databricks administrators add global cluster initialization scripts in a secure
and controlled manner. To learn how to add them using the UI, see Configure a cluster-scoped init script using
the DBFS REST API.
The Global Init Scripts API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Groups API 2.0
7/21/2022 • 4 minutes to read
NOTE
You must be an Azure Databricks administrator to invoke this API.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Add member
EN DP O IN T H T T P M ET H O D
2.0/groups/add-member POST
Add a user or group to a group. This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group with the
given name does not exist, or if a group with the given parent name does not exist.
Examples
To add a user to a group:
{}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Create
EN DP O IN T H T T P M ET H O D
2.0/groups/create POST
Create a new group with the given name. This call returns an error RESOURCE_ALREADY_EXISTS if a group with the
given name already exists.
Example
{ "group_name": "reporting-department" }
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List members
EN DP O IN T H T T P M ET H O D
2.0/groups/list-members GET
Return all of the members of a particular group. This call returns the error RESOURCE_DOES_NOT_EXIST if a group
with the given name does not exist. This method is non-recursive; it returns all groups that belong to the given
group but not the principals that belong to those child groups.
Example
curl --netrc -X GET \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list-members \
--data '{ "group_name": "reporting-department" }' \
| jq .
{
"members": [
{
"user_name": "someone@example.com"
}
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/groups/list GET
{
"group_names": [
"reporting-department",
"data-ops-read-only",
"admins"
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/groups/list-parents GET
Retrieve all groups in which a given user or group is a member. This method is non-recursive; it returns all
groups in which the given user or group is a member but not the groups in which those groups are members.
This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group with the given name does not exist.
Examples
To list groups for a user:
{
"group_names": [
"reporting-department"
]
}
{
"group_names": [
"data-ops-read-only"
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Remove member
EN DP O IN T H T T P M ET H O D
2.0/groups/remove-member POST
Remove a user or group from a group. This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group
with the given name does not exist or if a group with the given parent name does not exist.
Examples
To remove a user from a group:
{}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/groups/delete POST
Remove a group from this organization. This call returns the error RESOURCE_DOES_NOT_EXIST if a group with the
given name does not exist.
Example
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
PrincipalName
PrincipalName
Container type for a name that is either a user name or a group name.
The Instance Pools API allows you to create, edit, delete and list instance pools.
An instance pool reduces cluster start and auto-scaling times by maintaining a set of idle, ready-to-use cloud
instances. When a cluster attached to a pool needs an instance, it first attempts to allocate one of the pool’s idle
instances. If the pool has no idle instances, it expands by allocating a new instance from the instance provider in
order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is
free for another cluster to use. Only clusters attached to a pool can use that pool’s idle instances.
Azure Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.
See pricing.
Requirements
You must have permission to attach to the pool; see Pool access control.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create
EN DP O IN T H T T P M ET H O D
2.0/instance-pools/create POST
Create an instance pool. Use the returned instance_pool_id to query the status of the instance pool, which
includes the number of instances currently allocated by the instance pool. If you provide the min_idle_instances
parameter, instances are provisioned in the background and are ready to use once the idle_count in the
InstancePoolStats equals the requested minimum.
If your account has Databricks Container Services enabled and the instance pool is created with
preloaded_docker_images , you can use the instance pool to launch clusters with a Docker image. The Docker
image in the instance pool doesn’t have to match the Docker image in the cluster. However, the container
environment of the cluster created on the pool must align with the container environment of the instance pool:
you cannot use an instance pool created with preloaded_docker_images to launch a cluster without a Docker
image and you cannot use an instance pool created without preloaded_docker_images to a launch cluster with a
Docker image.
NOTE
Azure Databricks may not be able to acquire some of the requested idle instances due to instance provider limitations or
transient network issues. Clusters can still attach to the instance pool, but may not start as quickly.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/create \
--data @create-instance-pool.json
create-instance-pool.json :
{
"instance_pool_name": "my-pool",
"node_type_id": "Standard_D3_v2",
"min_idle_instances": 10,
"custom_tags": [
{
"key": "my-key",
"value": "my-value"
}
]
}
{ "instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg" }
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/instance-pools/edit POST
Edit an instance pool. This modifies the configuration of an existing instance pool.
NOTE
You can edit only the following values: instance_pool_name , min_idle_instances , max_capacity , and
idle_instance_autotermination_minutes .
You must provide an instance_pool_name value.
Example
edit-instance-pool.json :
{
"instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg",
"instance_pool_name": "my-edited-pool",
"min_idle_instances": 5,
"max_capacity": 200,
"idle_instance_autotermination_minutes": 30
}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/instance-pools/delete POST
Delete an instance pool. This permanently deletes the instance pool. The idle instances in the pool are
terminated asynchronously. New clusters cannot attach to the pool. Running clusters attached to the pool
continue to run but cannot autoscale up. Terminated clusters attached to the pool will fail to start until they are
edited to no longer use the pool.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Get
EN DP O IN T H T T P M ET H O D
2.0/instance-pools/get GET
{
"instance_pool_name": "mypool",
"node_type_id": "Standard_D3_v2",
"custom_tags": {
"my-key": "my-value"
},
"idle_instance_autotermination_minutes": 60,
"enable_elastic_disk": false,
"preloaded_spark_versions": [
"5.4.x-scala2.11"
],
"instance_pool_id": "101-120000-brick1-pool-ABCD1234",
"default_tags": {
"Vendor": "Databricks",
"DatabricksInstancePoolCreatorId": "100125",
"DatabricksInstancePoolId": "101-120000-brick1-pool-ABCD1234"
},
"state": "ACTIVE",
"stats": {
"used_count": 10,
"idle_count": 5,
"pending_used_count": 5,
"pending_idle_count": 5
},
"status": {}
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
preloaded_spark_versions An array of STRING A list with the runtime version the pool
installs on each instance. Pool clusters
that use a preloaded runtime version
start faster as they do not have to wait
for the image to download. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.
F IEL D N A M E TYPE DESC RIP T IO N
* Vendor: Databricks
* DatabricksInstancePoolCreatorId:
<create_user_id>
* DatabricksInstancePoolId:
<instance_pool_id>
List
EN DP O IN T H T T P M ET H O D
2.0/instance-pools/list GET
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
InstancePoolState
InstancePoolStats
InstancePoolStatus
PendingInstanceError
DiskSpec
DiskType
InstancePoolAndStats
AzureDiskVolumeType
InstancePoolAzureAttributes
InstancePoolState
The state of an instance pool. The current allowable state transitions are:
ACTIVE -> DELETED
NAME DESC RIP T IO N
InstancePoolStats
Statistics about the usage of the instance pool.
InstancePoolStatus
Status about failed pending instances in the pool.
PendingInstanceError
Error message of a failed pending instance.
DiskSpec
Describes the initial set of disks to attach to each instance. For example, if there are 3 instances and each
instance is configured to start with 2 disks, 100 GiB each, then Azure Databricks creates a total of 6 disks, 100
GiB each, for these instances.
DiskType
Describes the type of disk.
InstancePoolAndStats
F IEL D N A M E TYPE DESC RIP T IO N
preloaded_spark_versions An array of STRING A list with the runtime version the pool
installs on each instance. Pool clusters
that use a preloaded runtime version
start faster as they do not have to wait
for the image to download. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.
* Vendor: Databricks
* DatabricksInstancePoolCreatorId:
<create_user_id>
* DatabricksInstancePoolId:
<instance_pool_id>
InstancePoolAzureAttributes
Attributes set during instance pools creation related to Azure.
spot_bid_max_price DOUBLE The max bid price used for Azure spot
instances. You can set this to greater
than or equal to the current spot price.
You can also set this to -1 (the default),
which specifies that the instance
cannot be evicted on the basis of price.
The price for the instance will be the
current price for spot instances or the
price for a standard instance. You can
view historical pricing and eviction
rates in the Azure portal.
IP Access List API 2.0
7/21/2022 • 2 minutes to read
Azure Databricks workspaces can be configured so that employees connect to the service only through existing
corporate networks with a secure perimeter. Azure Databricks customers can use the IP access lists feature to
define a set of approved IP addresses. All incoming access to the web application and REST APIs requires the
user connect from an authorized IP address.
For more details about this feature and examples of how to use this API, see IP access lists.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
The IP Access List API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Jobs API 2.1
7/21/2022 • 2 minutes to read
The Jobs API allows you to programmatically manage Azure Databricks jobs. See Jobs.
The Jobs API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Libraries API 2.0
7/21/2022 • 7 minutes to read
The Libraries API allows you to install and uninstall libraries and get the status of libraries on a cluster.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
2.0/libraries/all-cluster-statuses GET
Get the status of all libraries on all clusters. A status will be available for all libraries installed on clusters via the
API or the libraries UI as well as libraries set to be installed on all clusters via the libraries UI. If a library has been
set to be installed on all clusters, is_library_for_all_clusters will be true , even if the library was also installed
on this specific cluster.
Example
Request
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Cluster status
EN DP O IN T H T T P M ET H O D
2.0/libraries/cluster-status GET
Get the status of libraries on a cluster. A status will be available for all libraries installed on the cluster via the API
or the libraries UI as well as libraries set to be installed on all clusters via the libraries UI. If a library has been set
to be installed on all clusters, is_library_for_all_clusters will be true , even if the library was also installed on
the cluster.
Example
Request
Or:
curl --netrc --get \
https://<databricks-instance>/api/2.0/libraries/cluster-status \
--data cluster_id=<cluster-id> \
| jq .
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<cluster-id> with the Azure Databricks workspace ID of the cluster, for example 1234-567890-example123 .
{
"cluster_id": "11203-my-cluster",
"library_statuses": [
{
"library": {
"jar": "dbfs:/mnt/libraries/library.jar"
},
"status": "INSTALLED",
"messages": [],
"is_library_for_all_clusters": false
},
{
"library": {
"pypi": {
"package": "beautifulsoup4"
},
},
"status": "INSTALLING",
"messages": ["Successfully resolved package from PyPI"],
"is_library_for_all_clusters": false
},
{
"library": {
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
},
},
"status": "FAILED",
"messages": ["R package installation is not supported on this spark version.\nPlease upgrade to
Runtime 3.2 or higher"],
"is_library_for_all_clusters": false
}
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Install
EN DP O IN T H T T P M ET H O D
2.0/libraries/install POST
Install libraries on a cluster. The installation is asynchronous - it completes in the background after the request.
IMPORTANT
This call will fail if the cluster is terminated.
Installing a wheel library on a cluster is like running the pip command against the wheel file directly on driver
and executors. All the dependencies specified in the library setup.py file are installed and this requires the
library name to satisfy the wheel file name convention.
The installation on the executors happens only when a new task is launched. With Databricks Runtime 7.1 and
below, the installation order of libraries is nondeterministic. For wheel libraries, you can ensure a deterministic
installation order by creating a zip file with suffix .wheelhouse.zip that includes all the wheel files.
Example
install-libraries.json :
{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"whl": "dbfs:/mnt/libraries/mlflow-0.0.1.dev0-py2-none-any.whl"
},
{
"whl": "dbfs:/mnt/libraries/wheel-libraries.wheelhouse.zip"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": ["slf4j:slf4j"]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "https://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of install-libraries.json with fields that are appropriate for your solution.
Uninstall
EN DP O IN T H T T P M ET H O D
2.0/libraries/uninstall POST
Set libraries to be uninstalled on a cluster. The libraries aren’t uninstalled until the cluster is restarted.
Uninstalling libraries that are not installed on the cluster has no impact but is not an error.
Example
uninstall-libraries.json :
{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"cran": "ada"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of uninstall-libraries.json with fields that are appropriate for your solution.
Data structures
In this section:
ClusterLibraryStatuses
Library
LibraryFullStatus
MavenLibrary
PythonPyPiLibrary
RCranLibrary
LibraryInstallStatus
ClusterLibraryStatuses
F IEL D N A M E TYPE DESC RIP T IO N
Library
F IEL D N A M E TYPE DESC RIP T IO N
jar OR egg OR whl OR pypi OR maven STRING OR STRING OR STRING OR If jar, URI of the JAR to be installed.
OR cran PythonPyPiLibrary OR MavenLibrary DBFS and ADLS ( abfss ) URIs are
OR RCranLibrary supported. For example:
{ "jar":
"dbfs:/mnt/databricks/library.jar"
}
or
{ "jar": "abfss://my-
bucket/library.jar" }
. If ADLS is used, make sure the cluster
has read access on the library.
LibraryFullStatus
The status of the library on a specific cluster.
F IEL D N A M E TYPE DESC RIP T IO N
messages An array of STRING All the info and warning messages that
have occurred so far for this library.
MavenLibrary
F IEL D N A M E TYPE DESC RIP T IO N
PythonPyPiLibrary
F IEL D N A M E TYPE DESC RIP T IO N
RCranLibrary
F IEL D N A M E TYPE DESC RIP T IO N
LibraryInstallStatus
The status of a library on a specific cluster.
PENDING No action has yet been taken to install the library. This state
should be very short lived.
UNINSTALL_ON_RESTART The library has been marked for removal. Libraries can be
removed only when clusters are restarted, so libraries that
enter this state will remain until the cluster is restarted.
MLflow API 2.0
7/21/2022 • 2 minutes to read
Azure Databricks provides a managed version of the MLflow tracking server and the Model Registry, which host
the MLflow REST API. You can invoke the MLflow REST API using URLs of the form
https://<databricks-instance>/api/2.0/mlflow/<api-endpoint>
replacing <databricks-instance> with the workspace URL of your Azure Databricks deployment.
MLflow compatibility matrix lists the MLflow release packaged in each Databricks Runtime version and a link to
the respective documentation.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Rate limits
The MLflow APIs are rate limited as four groups, based on their function and maximum throughput. The
following is the list of API groups and their respective limits in qps (queries per second):
Low throughput experiment management (list, update, delete, restore): 7 qps
Search runs: 7 qps
Log batch: 47 qps
All other APIs: 127 qps
In addition, there is a limit of 20 concurrent model versions in Pending status (in creation) per workspace.
If the rate limit is reached, subsequent API calls will return status code 429. All MLflow clients (including the UI)
automatically retry 429s with an exponential backoff.
API reference
The MLflow API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Permissions API 2.0
7/21/2022 • 2 minutes to read
IMPORTANT
This feature is in Public Preview.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Repos API 2.0
7/21/2022 • 2 minutes to read
The Repos API allows you to manage Databricks repos programmatically. See Git integration with Databricks
Repos.
The Repos API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
SCIM API 2.0
7/21/2022 • 2 minutes to read
IMPORTANT
This feature is in Public Preview.
Azure Databricks supports SCIM, or System for Cross-domain Identity Management, an open standard that
allows you to automate user provisioning using a REST API and JSON. The Azure Databricks SCIM API follows
version 2.0 of the SCIM protocol.
Requirements
Your Azure Databricks account must have the Premium Plan.
https://<databricks-instance>/api/2.0/preview/scim/v2/<api-endpoint>
Header parameters
PA RA M ET ER TYPE DESC RIP T IO N
PA RA M ET ER TYPE DESC RIP T IO N
machine <databricks-instance>
login token password <access-
token>
Filter results
Use filters to return a subset of users or groups. For all users, the user userName and group displayName fields
are supported. Admin users can filter users on the active attribute.
Sort results
Sort results using the sortBy and sortOrder query parameters. The default is to sort by ID.
IMPORTANT
This feature is in Public Preview.
Requirements
Your Azure Databricks account must have the Premium Plan.
Get me
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Me GET
Retrieve the same information about yourself as returned by Get user by ID.
Example
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
IMPORTANT
This feature is in Public Preview.
An Azure Databricks administrator can invoke all SCIM API endpoints. Non-admin users can invoke the Get
users endpoint to read user display names and IDs.
NOTE
Each workspace can have a maximum of 10,000 users and 5,000 groups. Service principals count toward the user
maximum.
SCIM (Users) lets you create users in Azure Databricks and give them the proper level of access, temporarily lock
and unlock user accounts, and remove access for users (deprovision them) when they leave your organization
or no longer need access to Azure Databricks.
For error codes, see SCIM API 2.0 Error Codes.
Requirements
Your Azure Databricks account must have the Premium Plan.
Get users
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users GET
Admin users: Retrieve a list of all users in the Azure Databricks workspace.
Non-admin users: Retrieve a list of all users in the Azure Databricks workspace, returning username, user
display name, and object ID only.
Examples
This example gets information about all users.
This example uses the eq (equals) filter query parameter with userName to get information about a specific
user.
curl --netrc -X GET \
"https://<databricks-instance>/api/2.0/preview/scim/v2/Users?filter=userName+eq+<username>" \
| jq .
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<username> with the Azure Databricks workspace username of the user, for example someone@example.com .
Get user by ID
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users/{id} GET
Admin users: Retrieve a single user resource from the Azure Databricks workspace, given their Azure Databricks
ID.
Example
Request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
Response
Create user
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users POST
Example
curl --netrc -X POST \
https://<databricks-instance>/api/2.0/preview/scim/v2/Users \
--header 'Content-type: application/scim+json' \
--data @create-user.json \
| jq .
create-user.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:User" ],
"userName": "<username>",
"groups": [
{
"value":"123456"
}
],
"entitlements":[
{
"value":"allow-cluster-create"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<username> with the Azure Databricks workspace username of the user, for example someone@example.com .
2.0/preview/scim/v2/Users/{id} PATCH
Admin users: Update a user resource with operations on specific attributes, except those that are immutable (
userName and userId ). The PATCH method is recommended over the PUT method for setting or updating user
entitlements.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
Example
This example adds the allow-cluster-create entitlement to the specified user.
update-user.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
2.0/preview/scim/v2/Users/{id} PUT
Admin users: Overwrite the user resource across multiple attributes, except those that are immutable ( userName
and userId ).
Request must include the schemas attribute, set to urn:ietf:params:scim:schemas:core:2.0:User .
NOTE
The PATCH method is recommended over the PUT method for setting or updating user entitlements.
Example
This example changes the specified user’s previous entitlements to now have only the allow-cluster-create
entitlement.
overwrite-user.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:User" ],
"userName": "<username>",
"entitlements": [
{
"value": "allow-cluster-create"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
<username> with the Azure Databricks workspace username of the user, for example someone@example.com . To
get the username, call Get users.
This example uses a .netrc file and jq.
Delete user by ID
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users/{id} DELETE
Admin users: Remove a user resource. A user that does not own or belong to a workspace in Azure Databricks is
automatically purged after 30 days.
Deleting a user from a workspace also removes objects associated with the user. For example, notebooks are
archived, clusters are terminated, and jobs become ownerless.
The user’s home directory is not automatically deleted. Only an administrator can access or remove a deleted
user’s home directory.
The access control list (ACL) configuration of a user is preserved even after that user is removed from a
workspace.
Example request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file.
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users/{id} PATCH
Admin users: Activate or deactivate a user. Deactivating a user removes all access to a workspace for that user
but leaves permissions and objects associated with the user unchanged. Clusters associated with the user keep
running, and notebooks remain in their original locations. The user’s tokens are retained but cannot be used to
authenticate while the user is deactivated. Scheduled jobs, however, fail unless assigned to a new owner.
You can use the Get users and Get user by ID requests to view whether users are active or inactive.
NOTE
Allow at least five minutes for the cache to be cleared for deactivation to take effect.
IMPORTANT
An Azure Active Directory (Azure AD) user with the Contributor or Owner role on the Azure Databricks subscription
can reactivate themselves using the Azure AD login flow. If a user with one of these roles needs to be deactivated, you
should also revoke their privileges on the subscription.
Set the active value to false to deactivate a user and true to activate a user.
Example
Request
toggle-user-activation.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "replace",
"path": "active",
"value": [
{
"value": "false"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 .
{
"emails": [
{
"type": "work",
"value": "someone@example.com",
"primary": true
}
],
"displayName": "Someone User",
"schemas": [
"urn:ietf:params:scim:schemas:core:2.0:User",
"urn:ietf:params:scim:schemas:extension:workspace:2.0:User"
],
"name": {
"familyName": "User",
"givenName": "Someone"
},
"active": false,
"groups": [],
"id": "123456",
"userName": "someone@example.com"
}
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users GET
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
Admin users: Deactivate users that have not logged in for a customizable period. Scheduled jobs owned by a
user are also considered activity.
EN DP O IN T H T T P M ET H O D
2.0/preview/workspace-conf PATCH
The request body is a key-value pair where the value is the time limit for how long a user can be inactive before
being automatically deactivated.
Example
deactivate-users.json :
{
"maxUserInactiveDays": "90"
}
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
Admin users: Retrieve the user inactivity limit defined for a workspace.
EN DP O IN T H T T P M ET H O D
2.0/preview/workspace-conf GET
Example request
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file and jq.
Example response
{
"maxUserInactiveDays": "90"
}
SCIM API 2.0 (Groups)
7/21/2022 • 3 minutes to read
IMPORTANT
This feature is in Public Preview.
Requirements
Your Azure Databricks account must have the Premium Plan.
NOTE
An Azure Databricks administrator can invoke all SCIM API endpoints.
Non-admin users can invoke the Get groups endpoint to read group display names and IDs.
You can have no more than 10,000 users and 5,000 groups in a workspace.
SCIM (Groups) lets you create users and groups in Azure Databricks and give them the proper level of access
and remove access for groups (deprovision them).
For error codes, see SCIM API 2.0 Error Codes.
Get groups
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups GET
Admin users: Retrieve a list of all groups in the Azure Databricks workspace.
Non-admin users: Retrieve a list of all groups in the Azure Databricks workspace, returning group display name
and object ID only.
Examples
You can use filters to specify subsets of groups. For example, you can apply the sw (starts with) filter parameter
to displayName to retrieve a specific group or set of groups. This example retrieves all groups with a
displayName field that start with my- .
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file and jq.
Get group by ID
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups/{id} GET
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Create group
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups POST
Members list is optional and can include users and other groups. You can also add members to a group using
PATCH .
Example
create-group.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:Group" ],
"displayName": "<group-name>",
"members": [
{
"value":"<user-id>"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-name> with the name of the group in the Azure Databricks workspace, for example my-group .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
Update group
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups/{id} PATCH
Admin users: Update a group in Azure Databricks by adding or removing members. Can add and remove
individual members or groups within the group.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
NOTE
Azure Databricks does not support updating group names.
Example
Add to group
update-group.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op":"add",
"value": {
"members": [
{
"value":"<user-id>"
}
]
}
}
]
}
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "members[value eq \"<user-id>\"]"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
Delete group
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups/{id} DELETE
Admin users: Remove a group from Azure Databricks. Users in the group are not removed.
Example
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file.
SCIM API 2.0 (ServicePrincipals)
7/21/2022 • 6 minutes to read
IMPORTANT
This feature is in Public Preview.
SCIM (ServicePrincipals) lets you manage Azure Active Directory service principals in Azure Databricks.
For error codes, see SCIM API 2.0 Error Codes.
For additional examples, see Service principals for Azure Databricks automation.
Requirements
Your Azure Databricks account must have the Premium Plan.
2.0/preview/scim/v2/ServicePrincipals GET
You can use filters to specify subsets of service principals. For example, you can apply the eq (equals) filter
parameter to applicationId to retrieve a specific service principal:
In workspaces with a large number of service principals, you can exclude attributes from the request to improve
performance.
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
2.0/preview/scim/v2/ServicePrincipals/{id} GET
Retrieve a single service principal resource from the Azure Databricks workspace, given a service principal ID.
Example
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
2.0/preview/scim/v2/ServicePrincipals POST
Add an Azure Active Directory (Azure AD) service principal to the Azure Databricks workspace. In Azure
Databricks, you must create an application in Azure Active Directory and then add it to your Azure Databricks
workspace to use as a service principal. Service principals count toward the limit of 10000 users per workspace.
Request parameters follow the standard SCIM 2.0 protocol.
Example
add-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<azure-application-id>",
"displayName": "<display-name>",
"groups": [
{
"value": "<group-id>"
}
],
"entitlements": [
{
"value":"allow-cluster-create"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<azure-application-id> with the application ID of the Azure Active Directory (Azure AD) application, for
example 12345a67-8b9c-0d1e-23fa-4567b89cde01
<display-name> with the display name of the service principal, for example someone@example.com .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
2.0/preview/scim/v2/ServicePrincipals/{id} PATCH
Update a service principal resource with operations on specific attributes, except for applicationId and id ,
which are immutable.
Use the PATCH method to add, update, or remove individual attributes. Use the PUT method to overwrite the
entire service principal in a single operation.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
Add entitlements
Example
change-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
Remove entitlements
Example
change-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
Add to a group
Example
change-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "groups",
"value": [
{
"value": "<group-id>"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Remove from a group
Example
remove-from-group.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "members[value eq \"<service-principal-id>\"]"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
2.0/preview/scim/v2/ServicePrincipals/{id} PUT
Overwrite the entire service principal resource, except for applicationId and id , which are immutable.
Use the PATCH method to add, update, or remove individual attributes.
IMPORTANT
You must include the attribute in the request, with the exact value
schemas
urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal .
Examples
Add an entitlement
update-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<appliation-id>",
"displayName": "<display-name>",
"groups": [
{
"value": "<group-id>"
}
],
"entitlements": [
{
"value":"allow-cluster-create"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
<display-name> with the display name of the service principal, for example someone@example.com .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Remove all entitlements and groups
Removing all entitlements and groups is a reversible alternative to deactivating the service principal.
Use the PUT method to avoid the need to check the existing entitlements and group memberships first.
update-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<application-id>",
"displayName": "<display-name>",
"groups": [],
"entitlements": []
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
<display-name> with the display name of the service principal, for example someone@example.com .
2.0/preview/scim/v2/ServicePrincipals/{id} DELETE
The Secrets API allows you to manage secrets, secret scopes, and access permissions. To manage secrets, you
must:
1. Create a secret scope.
2. Add your secrets to the scope.
3. If you have the Premium Plan, assign access control to the secret scope.
To learn more about creating and managing secrets, see Secret management and Secret access control. You
access and reference secrets in notebooks and jobs by using Secrets utility (dbutils.secrets).
IMPORTANT
To access Databricks REST APIs, you must authenticate. To use the Secrets API with Azure Key Vault secrets, you must
authenticate using an Azure Active Directory token.
2.0/secrets/scopes/create POST
{
"scope": "my-simple-azure-keyvault-scope",
"scope_backend_type": "AZURE_KEYVAULT",
"backend_azure_keyvault":
{
"resource_id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/azure-
rg/providers/Microsoft.KeyVault/vaults/my-azure-kv",
"dns_name": "https://my-azure-kv.vault.azure.net/"
},
"initial_manage_principal": "users"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<token> with your Azure Databricks personal access token. For more information, see Authentication using
Azure Databricks personal access tokens.
<management-token> with your Azure Active Directory token. For more information, see Get Azure AD tokens
by using the Microsoft Authentication Library.
The contents of create-scope.json with fields that are appropriate for your solution.
This example uses a .netrc file.
If initial_manage_principal is specified, the initial ACL applied to the scope is applied to the supplied principal
(user, service principal, or group) with MANAGE permissions. The only supported principal for this option is the
group users , which contains all users in the workspace. If initial_manage_principal is not specified, the initial
ACL with MANAGE permission applied to the scope is assigned to the API request issuer’s user identity.
Throws RESOURCE_ALREADY_EXISTS if a scope with the given name already exists. Throws RESOURCE_LIMIT_EXCEEDED
if maximum number of scopes in the workspace is exceeded. Throws INVALID_PARAMETER_VALUE if the scope
name is invalid.
For more information, see Create an Azure Key Vault-backed secret scope using the Databricks CLI.
Create a Databricks-backed secret scope
The scope name:
Must be unique within a workspace.
Must consist of alphanumeric characters, dashes, underscores, and periods, and may not exceed 128
characters.
The names are considered non-sensitive and are readable by all users in the workspace. A workspace is limited
to a maximum of 100 secret scopes.
Example
create-scope.json :
{
"scope": "my-simple-databricks-scope",
"initial_manage_principal": "users"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of create-scope.json with fields that are appropriate for your solution.
2.0/secrets/scopes/delete POST
delete-scope.json :
{
"scope": "my-secret-scope"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-scope.json with fields that are appropriate for your solution.
This example uses a .netrc file.
Throws RESOURCE_DOES_NOT_EXIST if the scope does not exist. Throws PERMISSION_DENIED if the user does not
have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/secrets/scopes/list GET
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
{
"scopes": [
{
"name": "my-databricks-scope",
"backend_type": "DATABRICKS"
},
{
"name": "mount-points",
"backend_type": "DATABRICKS"
}
]
}
Throws PERMISSION_DENIED if you do not have permission to make this API call.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
EN DP O IN T H T T P M ET H O D
2.0/secrets/put POST
Insert a secret under the provided scope with the given name. If a secret already exists with the same name, this
command overwrites the existing secret’s value. The server encrypts the secret using the secret scope’s
encryption settings before storing it. You must have WRITE or MANAGE permission on the secret scope.
The secret key must consist of alphanumeric characters, dashes, underscores, and periods, and cannot exceed
128 characters. The maximum allowed secret value size is 128 KB. The maximum number of secrets in a given
scope is 1000.
You can read a secret value only from within a command on a cluster (for example, through a notebook); there is
no API to read a secret value outside of a cluster. The permission applied is based on who is invoking the
command and you must have at least READ permission.
Example
put-secret.json :
{
"scope": "my-databricks-scope",
"key": "my-string-key",
"string_value": "my-value"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of put-secret.json with fields that are appropriate for your solution.
Delete secret
The method for deleting a secret depends on the type of scope backend. To delete a secret from a scope backed
by Azure Key Vault, use the Azure SetSecret REST API. To delete a secret from a Databricks-backed scope, use the
following endpoint:
EN DP O IN T H T T P M ET H O D
2.0/secrets/delete POST
Delete the secret stored in this secret scope. You must have WRITE or MANAGE permission on the secret scope.
Example
delete-secret.json :
{
"scope": "my-secret-scope",
"key": "my-secret-key"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-secret.json with fields that are appropriate for your solution.
List secrets
EN DP O IN T H T T P M ET H O D
2.0/secrets/list GET
List the secret keys that are stored at this scope. This is a metadata-only operation; you cannot retrieve secret
data using this API. You must have READ permission to make this call.
Example
Request
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .
{
"secrets": [
{
"key": "my-string-key",
"last_updated_timestamp": 1520467595000
},
{
"key": "my-byte-key",
"last_updated_timestamp": 1520467595000
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/secrets/acls/put POST
Create or overwrite the ACL associated with the given principal (user, service principal, or group) on the
specified scope point. In general, a user, service principal, or group will use the most powerful permission
available to them, and permissions are ordered as follows:
MANAGE - Allowed to change ACLs, and read and write to this secret scope.
WRITE - Allowed to read and write to this secret scope.
READ - Allowed to read this secret scope and list what secrets are available.
put-secret-acl.json :
{
"scope": "my-secret-scope",
"principal": "data-scientists",
"permission": "READ"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of put-secret-acl.json with fields that are appropriate for your solution.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/secrets/acls/delete POST
delete-secret-acl.json :
{
"scope": "my-secret-scope",
"principal": "data-scientists"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-secret-acl.json with fields that are appropriate for your solution.
2.0/secrets/acls/get GET
Describe the details about the given ACL, such as the group and permission.
You must have the MANAGE permission to invoke this API.
Example
Request
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .
<principal-name> with the name of the principal, for example users .
{
"principal": "data-scientists",
"permission": "READ"
}
Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/secrets/acls/list GET
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .
Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
AclItem
SecretMetadata
SecretScope
AclPermission
ScopeBackendType
AclItem
An item representing an ACL rule applied to the given principal (user, service principal, or group) on the
associated scope point.
SecretMetadata
The metadata about a secret. Returned when listing secrets. Does not contain the actual secret value.
F IEL D N A M E TYPE DESC RIP T IO N
SecretScope
An organizational resource for storing secrets. Secret scopes can be different types, and ACLs can be applied to
control permissions for all secrets within a scope.
AclPermission
The ACL permission levels for secret ACLs applied to secret scopes.
ScopeBackendType
The type of secret scope backend.
The Token API allows you to create, list, and revoke tokens that can be used to authenticate and access Azure
Databricks REST APIs.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create
EN DP O IN T H T T P M ET H O D
2.0/token/create POST
Create and return a token. This call returns the error QUOTA_EXCEEDED if the current number of non-expired
tokens exceeds the token quota. The token quota for a user is 600.
Example
Request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This is an example token with a description to attach to the token.
7776000 with the lifetime of the token, in seconds. This example specifies 90 days.
{
"token_value": "dapi1a2b3c45d67890e1f234567a8bc9012d",
"token_info": {
"token_id": "1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3",
"creation_time": 1626286601651,
"expiry_time": 1634062601651,
"comment": "This is an example token"
}
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/token/list GET
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
{
"token_infos": [
{
"token_id": "1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3",
"creation_time": 1626286601651,
"expiry_time": 1634062601651,
"comment": "This is an example token"
},
{
"token_id": "2345678901a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c4",
"creation_time": 1626286906596,
"expiry_time": 1634062906596,
"comment": "This is another example token"
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
token_infos An array of Public token info A list of token information for a user-
workspace pair.
Revoke
EN DP O IN T H T T P M ET H O D
2.0/token/delete POST
Revoke an access token. This call returns the error RESOURCE_DOES_NOT_EXIST if a token with the specified ID is not
valid.
Example
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
Public token info
Public token info
A data structure that describes the public metadata of an access token.
The Token Management API lets Azure Databricks administrators manage their users’ Azure Databricks personal
access tokens. As an admin, you can:
Monitor and revoke users’ personal access tokens.
Control the lifetime of future tokens in your workspace.
You can also control which users can create and use tokens via the Permissions API 2.0 or in the Admin Console.
The Token Management API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Workspace API 2.0
7/21/2022 • 6 minutes to read
The Workspace API allows you to list, import, export, and delete notebooks and folders. The maximum allowed
size of a request to the Workspace API is 10MB. See Cluster log delivery examples for a how to guide on this
API.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Delete
EN DP O IN T H T T P M ET H O D
2.0/workspace/delete POST
Delete an object or a directory (and optionally recursively deletes all objects in the directory). If path does not
exist, this call returns an error RESOURCE_DOES_NOT_EXIST . If path is a non-empty directory and recursive is set
to false , this call returns an error DIRECTORY_NOT_EMPTY . Object deletion cannot be undone and deleting a
directory recursively is not atomic.
Example
Request:
Export
EN DP O IN T H T T P M ET H O D
2.0/workspace/export GET
Export a notebook or contents of an entire directory. You can also export a Databricks Repo, or a notebook or
directory from a Databricks Repo. You cannot export non-notebook files from a Databricks Repo. If path does
not exist, this call returns an error RESOURCE_DOES_NOT_EXIST . You can export a directory only in DBC format. If
the exported data exceeds the size limit, this call returns an error MAX_NOTEBOOK_SIZE_EXCEEDED . This API does not
support exporting a library.
Example
Request:
Response:
If the direct_download field was set to false or was omitted from the request, a base64-encoded version of the
content is returned, for example:
{
"content": "Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKMSsx",
}
Otherwise, if direct_download was set to true in the request, the content is downloaded.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Get status
EN DP O IN T H T T P M ET H O D
2.0/workspace/get-status GET
Gets the status of an object or a directory. If path does not exist, this call returns an error
RESOURCE_DOES_NOT_EXIST .
Example
Request:
Response:
{
"object_type": "NOTEBOOK",
"path": "/Users/me@example.com/MyFolder/MyNotebook",
"language": "PYTHON",
"object_id": 123456789012345
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Import
EN DP O IN T H T T P M ET H O D
2.0/workspace/import POST
Import a notebook or the contents of an entire directory. If path already exists and overwrite is set to false ,
this call returns an error RESOURCE_ALREADY_EXISTS . You can use only DBC format to import a directory.
Example
Import a base64-encoded string:
List
EN DP O IN T H T T P M ET H O D
2.0/workspace/list GET
List the contents of a directory, or the object if it is not a directory. If the input path does not exist, this call
returns an error RESOURCE_DOES_NOT_EXIST .
Example
List directories and their contents:
Request:
Response:
{
"objects": [
{
"path": "/Users/me@example.com/MyFolder",
"object_type": "DIRECTORY",
"object_id": 234567890123456
},
{
"path": "/Users/me@example.com/MyFolder/MyNotebook",
"object_type": "NOTEBOOK",
"language": "PYTHON",
"object_id": 123456789012345
},
{
"..."
}
]
}
List repos:
Response:
{
"objects": [
{
"path": "/Repos/me@example.com/MyRepo1",
"object_type": "REPO",
"object_id": 234567890123456
},
{
"path": "/Repos/me@example.com/MyRepo2",
"object_type": "REPO",
"object_id": 123456789012345
},
{
"..."
}
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/workspace/mkdirs POST
Create the given directory and necessary parent directories if they do not exists. If there exists an object (not a
directory) at any prefix of the input path, this call returns an error RESOURCE_ALREADY_EXISTS . If this operation fails
it may have succeeded in creating some of the necessary parent directories.
Example
Request:
Data structures
In this section:
ObjectInfo
ExportFormat
Language
ObjectType
ObjectInfo
The information of the object in workspace. It is returned by list and get-status .
F O RM AT DESC RIP T IO N
Language
The language of notebook.
R R notebook.
ObjectType
The type of the object in workspace.
NOTEBOOK Notebook
DIRECTORY Directory
LIBRARY Library
REPO Repository
REST API 1.2
7/21/2022 • 9 minutes to read
The Databricks REST API allows you to programmatically access Azure Databricks instead of going through the
web UI.
This article covers REST API 1.2. The REST API latest version, as well as REST API 2.1 and 2.0, are also available.
IMPORTANT
Use the Clusters API 2.0 for managing clusters programmatically and the Libraries API 2.0 for managing libraries
programmatically.
The 1.2 Create an execution context and Run a command APIs continue to be supported.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
API categories
Execution context: create unique variable namespaces where Spark commands can be called.
Command execution: run commands within a specific execution context.
Details
This REST API runs over HTTPS.
For retrieving information, use HTTP GET.
For modifying state, use HTTP POST.
For file upload, use multipart/form-data . Otherwise use application/json .
The response content type is JSON.
Basic authentication is used to authenticate the user for every API call.
User credentials are base64 encoded and are in the HTTP header for every API call. For example,
Authorization: Basic YWRtaW46YWRtaW4= . If you use curl , alternatively you can store user credentials in a
.netrc file.
For more information about using the Databricks REST API, see the Databricks REST API reference.
Get started
To try out the examples in this article, replace <databricks-instance> with the workspace URL of your Azure
Databricks deployment.
The following examples use curl and a .netrc file. You can adapt these curl examples with an HTTP library in
your programming language of choice.
API reference
Get the list of clusters
Get information about a cluster
Restart a cluster
Create an execution context
Get information about an execution context
Delete an execution context
Run a command
Get information about a command
Cancel a command
Get the list of libraries for a cluster
Upload a library to a cluster
Get the list of clusters
Method and path:
GET /api/1.2/clusters/list
Example
Request:
Response:
[
{
"id": "1234-567890-span123",
"name": "MyCluster",
"status": "Terminated",
"driverIp": "",
"jdbcPort": 10000,
"numWorkers":0
},
{
"..."
}
]
Request schema
None.
Response schema
An array of objects, with each object representing information about a cluster as follows:
F IEL D
id
Type: string
name
Type: string
status
Type: string
* Error
* Pending
* Reconfiguring
* Restarting
* Running
* Terminated
* Terminating
* Unknown
driverIp
Type: string
jdbcPor t
Type: number
numWorkers
Type: number
Example
Request:
Response:
{
"id": "1234-567890-span123",
"name": "MyCluster",
"status": "Terminated",
"driverIp": "",
"jdbcPort": 10000,
"numWorkers": 0
}
Request schema
F IEL D
clusterId
Type: string
Response schema
An object that represents information about the cluster.
F IEL D
id
Type: string
name
Type: string
status
Type: string
* Error
* Pending
* Reconfiguring
* Restarting
* Running
* Terminated
* Terminating
* Unknown
driverIp
Type: string
jdbcPor t
Type: number
numWorkers
Type: number
Restart a cluster
Method and path:
POST /api/1.2/clusters/restart
Example
Request:
Response:
{
"id": "1234-567890-span123"
}
Request schema
F IEL D
clusterId
Type: string
Response schema
F IEL D
id
Type: string
Example
Request:
Response:
{
"id": "1234567890123456789"
}
Request schema
F IEL D
clusterId
Type: string
clusterId
Type: string
* python
* scala
* sql
Response schema
F IEL D
id
Type: string
Example
Request:
Response:
{
"id": "1234567890123456789",
"status": "Running"
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
Response schema
F IEL D
id
Type: string
status
Type: string
* Error
* Pending
* Running
Example
Request:
Response:
{
"id": "1234567890123456789"
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
Response schema
F IEL D
id
Type: string
Run a command
Method and path:
POST /api/1.2/commands/execute
Example
Request:
execute-command.json :
{
"clusterId": "1234-567890-span123",
"contextId": "1234567890123456789",
"language": "python",
"command": "print('Hello, World!')"
}
Response:
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
language
Type: string
command
Type: string
commandFile
Type: string
options
Type: string
An optional map of values used downstream. For example, a displayRowLimit override (used in testing).
Response schema
F IEL D
id
Type: string
Example
Request:
Response:
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890",
"status": "Finished",
"results": {
"resultType": "text",
"data": "Hello, World!"
}
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
commandId
Type: string
Response schema
F IEL D
id
Type: string
status
Type: string
* Cancelled
* Cancelling
* Error
* Finished
* Queued
* Running
results
Type: object
* error
* image
* images
* table
* text
For error :
For image :
For images :
For table :
* isJsonSchema : true if a JSON schema is returned instead of a string representation of the Hive type. Type: true /
false
For text :
Cancel a command
Method and path:
POST/api/1.2/commands/cancel
Example
Request:
Response:
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
The ID of the execution context that is associated with the command to cancel.
commandId
Type: string
Response schema
F IEL D
id
Type: string
IMPORTANT
This operation is deprecated. Use the Cluster status operation in the Libraries API instead.
Example
Request:
Request schema
F IEL D
clusterId
Type: string
Response schema
An array of objects, with each object representing information about a library as follows:
F IEL D
name
Type: string
status
Type: string
* LibraryError
* LibraryLoaded
* LibraryPending
IMPORTANT
This operation is deprecated. Use the Install operation in the Libraries API instead.
Request schema
F IEL D
clusterId
Type: string
name
Type: string
language
Type: string
uri
Type: string
Response schema
Information about the uploaded library.
F IEL D
language
Type: string
uri
Type: string
Additional examples
The following additional examples provide commands that you can use with curl or adapt with an HTTP
library in your programming language of choice.
Create an execution context
Run a command
Upload and run a Spark JAR
Create an execution context
Create an execution context on a specified cluster for a given programming language:
Run a command
Known limitations: command execution does not support %run .
Run a command string:
Run a file:
{
"id": "1234567890123456789"
}
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}
3. Check on the status of your command. It may not return immediately if you are running a lengthy Spark
job.
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890",
"results": {
"data": "Content Size Avg: 1234, Min: 1234, Max: 1234",
"resultType": "text"
},
"status": "Finished"
}
To authenticate to and access Databricks REST APIs, you can use Azure Databricks personal access tokens or
Azure Active Directory (Azure AD) tokens.
This article discusses how to use Azure Databricks personal access tokens. For Azure AD tokens, see
Authenticate using Azure Active Directory tokens.
IMPORTANT
Tokens replace passwords in an authentication flow and should be protected like passwords. To protect tokens, Databricks
recommends that you store tokens in:
Secret management and retrieve tokens in notebooks using the Secrets utility (dbutils.secrets).
A local key store and use the Python keyring package to retrieve tokens at runtime.
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
Requirements
Token-based authentication is enabled by default for all Azure Databricks accounts launched after January 2018.
If token-based authentication is disabled, your administrator must enable it before you can perform the tasks
described in Manage personal access tokens.
1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click the Generate New Token button.
5. Optionally enter a description (comment) and expiration period.
6. Click the Generate button.
7. Copy the generated token and store in a secure location.
1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click x for the token you want to revoke.
5. On the Revoke Token dialog, click the Revoke Token button.
machine <databricks-instance>
login token
password <token-value>
where:
<databricks-instance> is the instance ID portion of the workspace URL for your Azure Databricks
deployment. For example, if the workspace URL is https://adb-1234567890123456.7.azuredatabricks.net then
<databricks-instance> is adb-1234567890123456.7.azuredatabricks.net .
token is the literal string token .
<token-value> is the value of your token, for example dapi1234567890ab1cde2f3ab456c7d89efa .
For multiple machine/token entries, add one line per entry, with the machine , login and password properties
for each machine/token matching pair on the same line. The result looks like this:
This example invokes the .netrc file by using --netrc (you can also use -n ) in the curl command. It uses
the specified workspace URL to find the matching machine entry in the .netrc file.
export DATABRICKS_TOKEN=dapi1234567890ab1cde2f3ab456c7d89efa
This article contains examples that demonstrate how to use the Azure Databricks REST API.
In the following examples, replace <databricks-instance> with the workspace URL of your Azure Databricks
deployment. <databricks-instance> should start with adb- . Do not use the deprecated regional URL starting
with <azure-region-name> . It may not work for new workspaces, will be less reliable, and will exhibit lower
performance than per-workspace URLs.
Authentication
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
To learn how to authenticate to the REST API, review Authentication using Azure Databricks personal access
tokens and Authenticate using Azure Active Directory tokens.
The examples in this article assume you are using Azure Databricks personal access tokens. In the following
examples, replace <your-token> with your personal access token. The curl examples assume that you store
Azure Databricks API credentials under .netrc. The Python examples use Bearer authentication. Although the
examples show storing the token in the code, for leveraging credentials safely in Azure Databricks, we
recommend that you follow the Secret management user guide.
For examples that use Authenticate using Azure Active Directory tokens, see the articles in that section.
DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'
BASE_URL = 'https://%s/api/2.0/dbfs/' % (DOMAIN)
The following example shows how to launch a Python 3 cluster using the Databricks REST API and the requests
Python HTTP library. This example uses Databricks REST API version 2.0.
import requests
DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'
response = requests.post(
'https://%s/api/2.0/clusters/create' % (DOMAIN),
headers={'Authorization': 'Bearer %s' % TOKEN},
json={
"cluster_name": "my-cluster",
"spark_version": "5.5.x-scala2.11",
"node_type_id": "Standard_D3_v2",
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3",
}
}
)
if response.status_code == 200:
print(response.json()['cluster_id'])
else:
print("Error launching cluster: %s: %s" % (response.json()["error_code"], response.json()["message"]))
Create a High Concurrency cluster
The following example shows how to launch a High Concurrency mode cluster using the Databricks REST API.
This example uses Databricks REST API version 2.0.
Databricks Light
curl -n -X POST -H 'Content-Type: application/json' -d
'{
"name": "SparkPi Python job",
"new_cluster": {
"spark_version": "apache-spark-2.4.x-scala2.11",
"node_type_id": "Standard_D3_v2",
"num_workers": 2
},
"spark_python_task": {
"python_file": "dbfs:/docs/pi.py",
"parameters": [
"10"
]
}
}' https://<databricks-instance>/api/2.0/jobs/create
curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{
"name": "SparkPi spark-submit job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
},
"spark_submit_task": {
"parameters": [
"--class",
"org.apache.spark.examples.SparkPi",
"dbfs:/docs/sparkpi.jar",
"10"
]
}
}' https://<databricks-instance>/api/2.0/jobs/create
If the code uses SparkR, it must first install the package. Databricks Runtime contains the SparkR source
code. Install the SparkR package from its local directory as shown in the following example:
install.packages("/databricks/spark/R/pkg", repos = NULL)
library(SparkR)
sparkR.session()
n <- nrow(createDataFrame(iris))
write.csv(n, "/dbfs/path/to/num_rows.csv")
Databricks Runtime installs the latest version of sparklyr from CRAN. If the code uses sparklyr, You must
specify the Spark master URL in spark_connect . To form the Spark master URL, use the SPARK_LOCAL_IP
environment variable to get the IP, and use the default port 7077. For example:
library(sparklyr)
curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{
"name": "R script spark-submit job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
},
"spark_submit_task": {
"parameters": [ "dbfs:/path/to/your_code.R" ]
}
}' https://<databricks-instance>/api/2.0/jobs/create
This returns a job-id that you can then use to run the job.
3. Run the job using the job-id .
curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{ "job_id": <job-id> }' https://<databricks-instance>/api/2.0/jobs/run-now
curl -n \
-F filedata=@"SparkPi-assembly-0.1.jar" \
-F path="/docs/sparkpi.jar" \
-F overwrite=true \
https://<databricks-instance>/api/2.0/dbfs/put
curl -n https://<databricks-instance>/api/2.0/clusters/spark-versions
This example uses 7.3.x-scala2.12 . See Runtime version strings for more information about Spark
cluster versions.
4. Create the job. The JAR is specified as a library and the main class name is referenced in the Spark JAR
task.
This returns a job-id that you can then use to run the job.
5. Run the job using run now :
curl -n \
-X POST -H 'Content-Type: application/json' \
-d '{ "job_id": <job-id> }' https://<databricks-instance>/api/2.0/jobs/run-now
curl -n https://<databricks-instance>/api/2.0/jobs/runs/get?run_id=<run-id> | jq
8. To view the job output, visit the job run details page.
{"cluster_id":"1111-223344-abc55"}
After cluster creation, Azure Databricks syncs log files to the destination every 5 minutes. It uploads driver logs
to dbfs:/logs/1111-223344-abc55/driver and executor logs to dbfs:/logs/1111-223344-abc55/executor .
Check log delivery status
You can retrieve cluster information with log delivery status via API. This example uses Databricks REST API
version 2.0.
If the latest batch of log upload was successful, the response should contain only the timestamp of the last
attempt:
{
"cluster_log_status": {
"last_attempted": 1479338561
}
}
{
"cluster_log_status": {
"last_attempted": 1479338561,
"last_exception": "Exception: Access Denied ..."
}
}
Workspace examples
Here are some examples for using the Workspace API to list, get info about, create, delete, export, and import
workspace objects.
List a notebook or a folder
The following cURL command lists a path in the workspace. This example uses Databricks REST API version 2.0.
{
"objects": [
{
"object_type": "DIRECTORY",
"path": "/Users/user@example.com/folder"
},
{
"object_type": "NOTEBOOK",
"language": "PYTHON",
"path": "/Users/user@example.com/notebook1"
},
{
"object_type": "NOTEBOOK",
"language": "SCALA",
"path": "/Users/user@example.com/notebook2"
}
]
}
If the path is a notebook, the response contains an array containing the status of the input notebook.
Get information about a notebook or a folder
The following cURL command gets the status of a path in the workspace. This example uses Databricks REST API
version 2.0.
curl -n -X GET -H 'Content-Type: application/json' -d \
'{
"path": "/Users/user@example.com/"
}' https://<databricks-instance>/api/2.0/workspace/get-status
{
"object_type": "DIRECTORY",
"path": "/Users/user@example.com"
}
Create a folder
The following cURL command creates a folder. It creates the folder recursively like mkdir -p . If the folder
already exists, it will do nothing and succeed. This example uses Databricks REST API version 2.0.
curl -n -X GET \
-d '{ "path": "/Users/user@example.com/notebook", "format": "SOURCE" }' \
https://<databricks-instance>/api/2.0/workspace/export
{
"content":
"Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKcHJpbnQoImhlbGxvLCB3b3JsZCIpCgovLyBDT01NQU5EIC0tLS0tLS0tLS0KCg=="
}
The Databricks REST API allows for programmatic management of various Azure Databricks resources. This
article provides links to version 2.1 of each API.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
For general usage notes about the Databricks REST API, see Databricks REST API reference.
The REST API latest version, as well as REST API 2.0 and 1.2, are also available.
Jobs API 2.1
Authentication using Azure Databricks personal
access tokens
7/21/2022 • 3 minutes to read
To authenticate to and access Databricks REST APIs, you can use Azure Databricks personal access tokens or
Azure Active Directory (Azure AD) tokens.
This article discusses how to use Azure Databricks personal access tokens. For Azure AD tokens, see
Authenticate using Azure Active Directory tokens.
IMPORTANT
Tokens replace passwords in an authentication flow and should be protected like passwords. To protect tokens, Databricks
recommends that you store tokens in:
Secret management and retrieve tokens in notebooks using the Secrets utility (dbutils.secrets).
A local key store and use the Python keyring package to retrieve tokens at runtime.
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
Requirements
Token-based authentication is enabled by default for all Azure Databricks accounts launched after January 2018.
If token-based authentication is disabled, your administrator must enable it before you can perform the tasks
described in Manage personal access tokens.
1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click the Generate New Token button.
5. Optionally enter a description (comment) and expiration period.
6. Click the Generate button.
7. Copy the generated token and store in a secure location.
1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click x for the token you want to revoke.
5. On the Revoke Token dialog, click the Revoke Token button.
machine <databricks-instance>
login token
password <token-value>
where:
<databricks-instance> is the instance ID portion of the workspace URL for your Azure Databricks
deployment. For example, if the workspace URL is https://adb-1234567890123456.7.azuredatabricks.net then
<databricks-instance> is adb-1234567890123456.7.azuredatabricks.net .
token is the literal string token .
<token-value> is the value of your token, for example dapi1234567890ab1cde2f3ab456c7d89efa .
For multiple machine/token entries, add one line per entry, with the machine , login and password properties
for each machine/token matching pair on the same line. The result looks like this:
This example invokes the .netrc file by using --netrc (you can also use -n ) in the curl command. It uses
the specified workspace URL to find the matching machine entry in the .netrc file.
export DATABRICKS_TOKEN=dapi1234567890ab1cde2f3ab456c7d89efa
The Jobs API allows you to programmatically manage Azure Databricks jobs. See Jobs.
The Jobs API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
REST API 2.0
7/21/2022 • 2 minutes to read
The Databricks REST API allows for programmatic management of various Azure Databricks resources. This
article provides links to version 2.0 of each API.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
For general usage notes about the Databricks REST API, see Databricks REST API reference.
The REST API latest version, as well as REST API 2.1 and 1.2, are also available.
Clusters API 2.0
Cluster Policies API 2.0
Databricks SQL Warehouses API 2.0
Databricks SQL Queries and Dashboards API 2.0
Databricks SQL Query History API 2.0
DBFS API 2.0
Delta Live Tables API 2.0
Git Credentials API 2.0
Global Init Scripts API 2.0
Groups API 2.0
Instance Pools API 2.0
IP Access List API 2.0
Jobs API 2.0
Libraries API 2.0
MLflow API 2.0
Permissions API 2.0
Repos API 2.0
SCIM API 2.0
Secrets API 2.0
Token API 2.0
Token Management API 2.0
Workspace API 2.0
Authentication using Azure Databricks personal
access tokens
7/21/2022 • 3 minutes to read
To authenticate to and access Databricks REST APIs, you can use Azure Databricks personal access tokens or
Azure Active Directory (Azure AD) tokens.
This article discusses how to use Azure Databricks personal access tokens. For Azure AD tokens, see
Authenticate using Azure Active Directory tokens.
IMPORTANT
Tokens replace passwords in an authentication flow and should be protected like passwords. To protect tokens, Databricks
recommends that you store tokens in:
Secret management and retrieve tokens in notebooks using the Secrets utility (dbutils.secrets).
A local key store and use the Python keyring package to retrieve tokens at runtime.
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
Requirements
Token-based authentication is enabled by default for all Azure Databricks accounts launched after January 2018.
If token-based authentication is disabled, your administrator must enable it before you can perform the tasks
described in Manage personal access tokens.
1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click the Generate New Token button.
5. Optionally enter a description (comment) and expiration period.
6. Click the Generate button.
7. Copy the generated token and store in a secure location.
1. Click Settings in the lower left corner of your Azure Databricks workspace.
2. Click User Settings .
3. Go to the Access Tokens tab.
4. Click x for the token you want to revoke.
5. On the Revoke Token dialog, click the Revoke Token button.
machine <databricks-instance>
login token
password <token-value>
where:
<databricks-instance> is the instance ID portion of the workspace URL for your Azure Databricks
deployment. For example, if the workspace URL is https://adb-1234567890123456.7.azuredatabricks.net then
<databricks-instance> is adb-1234567890123456.7.azuredatabricks.net .
token is the literal string token .
<token-value> is the value of your token, for example dapi1234567890ab1cde2f3ab456c7d89efa .
For multiple machine/token entries, add one line per entry, with the machine , login and password properties
for each machine/token matching pair on the same line. The result looks like this:
This example invokes the .netrc file by using --netrc (you can also use -n ) in the curl command. It uses
the specified workspace URL to find the matching machine entry in the .netrc file.
export DATABRICKS_TOKEN=dapi1234567890ab1cde2f3ab456c7d89efa
The Clusters API allows you to create, start, edit, list, terminate, and delete clusters. The maximum allowed size of
a request to the Clusters API is 10MB.
Cluster lifecycle methods require a cluster ID, which is returned from Create. To obtain a list of clusters, invoke
List.
Azure Databricks maps cluster node instance types to compute units known as DBUs. See the instance type
pricing page for a list of the supported instance types and their corresponding DBUs. For instance provider
information, see Azure instance type specifications and pricing.
Azure Databricks always provides one year’s deprecation notice before ceasing support for an instance type.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create
EN DP O IN T H T T P M ET H O D
2.0/clusters/create POST
Create a new Apache Spark cluster. This method acquires new instances from the cloud provider if necessary.
This method is asynchronous; the returned cluster_id can be used to poll the cluster state. When this method
returns, the cluster is in a PENDING state. The cluster is usable once it enters a RUNNING state. See ClusterState.
NOTE
Azure Databricks may not be able to acquire some of the requested nodes, due to cloud provider limitations or transient
network issues. If it is unable to acquire a sufficient number of the requested nodes, cluster creation will terminate with an
informative error message.
Examples
create-cluster.json :
{
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"spark_conf": {
"spark.speculation": true
},
"num_workers": 25
}
{ "cluster_id": "1234-567890-undid123" }
Here is an example for an autoscaling cluster. This cluster will start with two nodes, the minimum.
create-cluster.json :
{
"cluster_name": "autoscaling-cluster",
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"autoscale" : {
"min_workers": 2,
"max_workers": 50
}
}
{ "cluster_id": "1234-567890-hared123" }
This example creates a Single Node cluster. To create a Single Node cluster:
Set spark_conf and custom_tags to the exact values in the example.
Set num_workers to 0 .
create-cluster.json :
{
"cluster_name": "single-node-cluster",
"spark_version": "7.6.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 0,
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*]"
},
"custom_tags": {
"ResourceClass": "SingleNode"
}
}
{ "cluster_id": "1234-567890-pouch123" }
To create a job or submit a run with a new cluster using a policy, set policy_id to the policy ID:
create-cluster.json :
{
"num_workers": null,
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"cluster_name": "my-cluster",
"spark_version": "7.3.x-scala2.12",
"spark_conf": {},
"node_type_id": "Standard_D3_v2",
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"init_scripts": [],
"policy_id": "C65B864F02000008"
}
create-job.json :
{
"run_name": "my spark task",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10,
"policy_id": "ABCD000000000000"
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}
Note :
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Edit
EN DP O IN T H T T P M ET H O D
2.0/clusters/edit POST
Edit the configuration of a cluster to match the provided attributes and size.
You can edit a cluster if it is in a RUNNING or TERMINATED state. If you edit a cluster while it is in a RUNNING state,
it will be restarted so that the new attributes can take effect. If you edit a cluster while it is in a TERMINATED state,
it will remain TERMINATED . The next time it is started using the clusters/start API, the new attributes will take
effect. An attempt to edit a cluster in any other state will be rejected with an INVALID_STATE error code.
Clusters created by the Databricks Jobs service cannot be edited.
Example
edit-cluster.json :
{
"cluster_id": "1202-211320-brick1",
"num_workers": 10,
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2"
}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Start
EN DP O IN T H T T P M ET H O D
2.0/clusters/start POST
Start a terminated cluster given its ID. This is similar to createCluster , except:
The terminated cluster ID and attributes are preserved.
The cluster starts with the last specified cluster size. If the terminated cluster is an autoscaling cluster, the
cluster starts with the minimum number of nodes.
If the cluster is in the RESTARTING state, a 400 error is returned.
You cannot start a cluster launched to run a job.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Restart
EN DP O IN T H T T P M ET H O D
2.0/clusters/restart POST
Restart a cluster given its ID. The cluster must be in the RUNNING state.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/clusters/restart \
--data '{ "cluster_id": "1234-567890-reef123" }'
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Resize
EN DP O IN T H T T P M ET H O D
2.0/clusters/resize POST
Resize a cluster to have a desired number of workers. The cluster must be in the RUNNING state.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete (terminate)
EN DP O IN T H T T P M ET H O D
2.0/clusters/delete POST
Terminate a cluster given its ID. The cluster is removed asynchronously. Once the termination has completed, the
cluster will be in the TERMINATED state. If the cluster is already in a TERMINATING or TERMINATED state, nothing
will happen.
Unless a cluster is pinned, 30 days after the cluster is terminated, it is permanently deleted.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Permanent delete
EN DP O IN T H T T P M ET H O D
2.0/clusters/permanent-delete POST
Permanently delete a cluster. If the cluster is running, it is terminated and its resources are asynchronously
removed. If the cluster is terminated, then it is immediately removed.
You cannot perform any action, including retrieve the cluster’s permissions, on a permanently deleted cluster. A
permanently deleted cluster is also no longer returned in the cluster list.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Get
EN DP O IN T H T T P M ET H O D
2.0/clusters/get GET
Retrieve the information for a cluster given its identifier. Clusters can be described while they are running or up
to 30 days after they are terminated.
Example
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Note :
* Vendor: Databricks
* Creator:
* ClusterName:
* ClusterId:
* Name: On job clusters:
* RunName:
* JobId: On resources used by
Databricks SQL:
* SqlWarehouseId:
Pin
NOTE
You must be an Azure Databricks administrator to invoke this API.
EN DP O IN T H T T P M ET H O D
2.0/clusters/pin POST
Ensure that an all-purpose cluster configuration is retained even after a cluster has been terminated for more
than 30 days. Pinning ensures that the cluster is always returned by the List API. Pinning a cluster that is already
pinned has no effect.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Unpin
NOTE
You must be an Azure Databricks administrator to invoke this API.
EN DP O IN T H T T P M ET H O D
2.0/clusters/unpin POST
Allows the cluster to eventually be removed from the list returned by the List API. Unpinning a cluster that is not
pinned has no effect.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/clusters/list GET
Return information about all pinned clusters, active clusters, up to 200 of the most recently terminated all-
purpose clusters in the past 30 days, and up to 30 of the most recently terminated job clusters in the past 30
days. For example, if there is 1 pinned cluster, 4 active clusters, 45 terminated all-purpose clusters in the past 30
days, and 50 terminated job clusters in the past 30 days, then this API returns the 1 pinned cluster, 4 active
clusters, all 45 terminated all-purpose clusters, and the 30 most recently terminated job clusters.
Example
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/clusters/list-node-types GET
Return a list of supported Spark node types. These node types can be used to launch a cluster.
Example
{
"node_types": [
{
"node_type_id": "Standard_L80s_v2",
"memory_mb": 655360,
"num_cores": 80,
"description": "Standard_L80s_v2",
"instance_type_id": "Standard_L80s_v2",
"is_deprecated": false,
"category": "Storage Optimized",
"support_ebs_volumes": true,
"support_cluster_tags": true,
"num_gpus": 0,
"node_instance_type": {
"instance_type_id": "Standard_L80s_v2",
"local_disks": 1,
"local_disk_size_gb": 800,
"instance_family": "Standard LSv2 Family vCPUs",
"local_nvme_disk_size_gb": 1788,
"local_nvme_disks": 10,
"swap_size": "10g"
},
"is_hidden": false,
"support_port_forwarding": true,
"display_order": 0,
"is_io_cache_enabled": true,
"node_info": {
"available_core_quota": 350,
"total_core_quota": 350
}
},
{
"..."
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Runtime versions
EN DP O IN T H T T P M ET H O D
2.0/clusters/spark-versions GET
Return the list of available runtime versions. These versions can be used to launch a cluster.
Example
{
"versions": [
{
"key": "8.2.x-scala2.12",
"name": "8.2 (includes Apache Spark 3.1.1, Scala 2.12)"
},
{
"..."
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Events
EN DP O IN T H T T P M ET H O D
2.0/clusters/events POST
Retrieve a list of events about the activity of a cluster. You can retrieve events from active clusters (running,
pending, or reconfiguring) and terminated clusters within 30 days of their last termination. This API is paginated.
If there are more events to read, the response includes all the parameters necessary to request the next page of
events.
Example:
list-events.json :
{
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 5,
"limit": 5,
"event_type": "RUNNING"
}
{
"events": [
{
"cluster_id": "1234-567890-reef123",
"timestamp": 1619471498409,
"type": "RUNNING",
"details": {
"current_num_workers": 2,
"target_num_workers": 2
}
},
{
"..."
}
],
"next_page": {
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 10,
"limit": 5
},
"total_count": 25
}
list-events.json :
{
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 10,
"limit": 5,
"event_type": "RUNNING"
}
{
"events": [
{
"cluster_id": "1234-567890-reef123",
"timestamp": 1618330776302,
"type": "RUNNING",
"details": {
"current_num_workers": 2,
"target_num_workers": 2
}
},
{
"..."
}
],
"next_page": {
"cluster_id": "1234-567890-reef123",
"start_time": 1617238800000,
"end_time": 1619485200000,
"order": "DESC",
"offset": 15,
"limit": 5
},
"total_count": 25
}
Request structure
Retrieve events pertaining to a specific cluster.
Data structures
In this section:
AutoScale
ClusterInfo
ClusterEvent
ClusterEventType
EventDetails
ClusterAttributes
ClusterSize
ListOrder
ResizeCause
ClusterLogConf
InitScriptInfo
ClusterTag
DbfsStorageInfo
FileStorageInfo
DockerImage
DockerBasicAuth
LogSyncStatus
NodeType
ClusterCloudProviderNodeInfo
ClusterCloudProviderNodeStatus
ParameterPair
SparkConfPair
SparkEnvPair
SparkNode
SparkVersion
TerminationReason
PoolClusterTerminationCode
ClusterSource
ClusterState
TerminationCode
TerminationType
TerminationParameter
AzureAttributes
AzureAvailability
AutoScale
Range defining the min and max number of cluster workers.
ClusterInfo
Metadata about a cluster.
* Vendor: Databricks
* Creator:
* ClusterName:
* ClusterId:
* Name: On job clusters:
* RunName:
* JobId: On resources used by
Databricks SQL:
* SqlWarehouseId:
ClusterEvent
Cluster event information.
ClusterEventType
Type of a cluster event.
DID_NOT_EXPAND_DISK Indicates that a disk is low on space, but adding disks would
put it over the max capacity.
EXPANDED_DISK Indicates that a disk was low on space and the disks were
expanded.
FAILED_TO_EXPAND_DISK Indicates that a disk was low on space and disk space could
not be expanded.
INIT_SCRIPTS_STARTING Indicates that the cluster scoped init script has started.
INIT_SCRIPTS_FINISHED Indicates that the cluster scoped init script has finished.
RUNNING Indicates the cluster has finished being created. Includes the
number of nodes in the cluster and a failure reason if some
nodes could not be acquired.
NODES_LOST Indicates that some nodes were lost from the cluster.
DRIVER_HEALTHY Indicates that the driver is healthy and the cluster is ready
for use.
SPARK_EXCEPTION Indicates that a Spark exception was thrown from the driver.
EventDetails
Details about a cluster event.
ClusterAttributes
Common set of attributes set during cluster creation. These attributes cannot be changed over the lifetime of a
cluster.
Note :
ClusterSize
Cluster size specification.
ListOrder
Generic ordering enum for list-based queries.
ResizeCause
Reason why a cluster was resized.
ClusterLogConf
Path to cluster log.
InitScriptInfo
Path to an init script. For instructions on using init scripts with Databricks Container Services, see Use an init
script.
NOTE
The file storage type is only available for clusters set up using Databricks Container Services.
ClusterTag
Cluster tag definition.
STRING The value of the tag. The value length must be less than or
equal to 256 UTF-8 characters.
DbfsStorageInfo
DBFS storage information.
FileStorageInfo
File storage information.
NOTE
This location type is only available for clusters set up using Databricks Container Services.
DockerImage
Docker image connection information.
DockerBasicAuth
Docker repository basic authentication information.
LogSyncStatus
Log delivery status.
NodeType
Description of a Spark node type including both the dimensions of the node and the instance type on which it
will be hosted.
ClusterCloudProviderNodeInfo
Information about an instance supplied by a cloud provider.
ClusterCloudProviderNodeStatus
Status of an instance supplied by a cloud provider.
ParameterPair
Parameter that provides additional information about why a cluster was terminated.
SparkConfPair
Spark configuration key-value pairs.
SparkEnvPair
Spark environment variable key-value pairs.
IMPORTANT
When specifying environment variables in a job cluster, the fields in this data structure accept only Latin characters (ASCII
character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese,
Japanese kanjis, and emojis.
SparkNode
Spark driver or executor configuration.
SparkVersion
Databricks Runtime version of the cluster.
TerminationReason
Reason why a cluster was terminated.
PoolClusterTerminationCode
Status code indicating why the cluster was terminated due to a pool failure.
C O DE DESC RIP T IO N
ClusterSource
Service that created the cluster.
ClusterState
State of a cluster. The allowable state transitions are as follows:
PENDING -> RUNNING
PENDING -> TERMINATING
RUNNING -> RESIZING
RUNNING -> RESTARTING
RUNNING -> TERMINATING
RESTARTING -> RUNNING
RESTARTING -> TERMINATING
RESIZING -> RUNNING
RESIZING -> TERMINATING
TERMINATING -> TERMINATED
RUNNING Indicates that a cluster has been started and is ready for use.
TerminationCode
Status code indicating why the cluster was terminated.
C O DE DESC RIP T IO N
JOB_FINISHED The cluster was launched by a job, and terminated when the
job completed.
CLOUD_PROVIDER_SHUTDOWN The instance that hosted the Spark driver was terminated by
the cloud provider.
SPARK_ERROR The Spark driver failed to start. Possible reasons may include
incompatible libraries and initialization scripts that corrupted
the Spark container.
DRIVER_UNREACHABLE Azure Databricks was not able to access the Spark driver,
because it was not reachable.
DRIVER_UNRESPONSIVE Azure Databricks was not able to access the Spark driver,
because it was unresponsive.
INSTANCE_POOL_CLUSTER_FAILURE Pool backed cluster specific failure. See Pools for details.
TerminationType
Reason why the cluster was terminated.
CLOUD_FAILURE Cloud provider infrastructure issue. Client can retry after the
underlying issue is resolved.
TerminationParameter
Key that provides additional information about why a cluster was terminated.
K EY DESC RIP T IO N
databricks_error_message Additional context that may explain the reason for cluster
termination.
inactivity_duration_min An idle cluster was shut down after being inactive for this
duration.
instance_id The ID of the instance that was hosting the Spark driver.
azure_error_code The Azure provided error code describing why cluster nodes
could not be provisioned. For reference, see:
https://docs.microsoft.com/azure/virtual-
machines/windows/error-messages.
K EY DESC RIP T IO N
AzureAttributes
Attributes set during cluster creation related to Azure.
spot_bid_max_price DOUBLE The max bid price used for Azure spot
instances. You can set this to greater
than or equal to the current spot price.
You can also set this to -1 (the default),
which specifies that the instance
cannot be evicted on the basis of price.
The price for the instance will be the
current price for spot instances or the
price for a standard instance. You can
view historical pricing and eviction
rates in the Azure portal.
AzureAvailability
The Azure instance availability type behavior.
IMPORTANT
This feature is in Public Preview.
A cluster policy limits the ability to create clusters based on a set of rules. The policy rules limit the attributes or
attribute values available for cluster creation. Cluster policies have ACLs that limit their use to specific users and
groups.
Only admin users can create, edit, and delete policies. Admin users also have access to all policies.
For requirements and limitations on cluster policies, see Manage cluster policies.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
IMPORTANT
The Cluster Policies API requires a policy JSON definition to be passed within a JSON request in stringified form. In most
cases this requires escaping of the quote characters.
In this section:
Get
List
Create
Edit
Delete
Data structures
Get
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/get GET
{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":
{\"type\":\"forbidden\",\"hidden\":true}}",
"created_at_timestamp": 1600000000000
}
Request structure
Response structure
List
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/list GET
Request structure
Response structure
Create
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/create POST
create-cluster-policy.json :
{
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":{\"type\":\"forbidden\",\"hidden\":true}}"
}
{ "policy_id": "ABCD000000000000" }
Request structure
Response structure
Edit
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/edit POST
Update an existing policy. This may make some clusters governed by this policy invalid. For such clusters the
next cluster edit must provide a confirming configuration, but otherwise they can continue to run.
Example
edit-cluster-policy.json :
{
"policy_id": "ABCD000000000000",
"name": "Test policy",
"definition": "{\"spark_conf.spark.databricks.cluster.profile\":{\"type\":\"forbidden\",\"hidden\":true}}"
}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/policies/clusters/delete POST
Delete a policy. Clusters governed by this policy can still run, but cannot be edited.
Example
{}
Request structure
Data structures
In this section:
Policy
PolicySortColumn
Policy
A cluster policy entity.
PolicySortColumn
The sort order for the ListPolices request.
2.0/preview/permissions/cluster- GET
policies/<clusterPolicyId>
Example
Request structure
Response structure
A Clusters ACL.
Get permission levels
EN DP O IN T H T T P M ET H O D
2.0/preview/permissions/cluster- GET
policies/<clusterPolicyId>/permissionLevels
Example
{
"permission_levels": [
{
"permission_level": "CAN_USE",
"description": "Can use the policy"
}
]
}
Request structure
Response structure
An array of PermissionLevel with associated description.
Add or modify permissions
EN DP O IN T H T T P M ET H O D
2.0/preview/permissions/cluster- PATCH
policies/<clusterPolicyId>
Example
add-cluster-policy-permissions.json :
{
"access_control_list": [
{
"user_name": "someone-else@example.com",
"permission_level": "CAN_USE"
}
]
}
{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "mary@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"user_name": "someone-else@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/"
]
}
]
}
]
}
Request structure
Request body
Response body
A Clusters ACL.
Set or delete permissions
A PUT request replaces all direct permissions on the cluster policy object. You can make delete requests by
making a GET request to retrieve the current list of permissions followed by a PUT request removing entries to
be deleted.
EN DP O IN T H T T P M ET H O D
2.0/preview/permissions/cluster- PUT
policies/<clusterPolicyId>
Example
set-cluster-policy-permissions.json :
{
"access_control_list": [
{
"user_name": "someone@example.com",
"permission_level": "CAN_USE"
}
]
}
{
"object_id": "/cluster-policies/ABCD000000000000",
"object_type": "cluster-policy",
"access_control_list": [
{
"user_name": "someone@example.com",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": false
}
]
},
{
"group_name": "admins",
"all_permissions": [
{
"permission_level": "CAN_USE",
"inherited": true,
"inherited_from_object": [
"/cluster-policies/"
]
}
]
}
]
}
Request structure
Request body
F IEL D N A M E TYPE DESC RIP T IO N
Response body
A Clusters ACL.
Data structures
In this section:
Clusters ACL
AccessControl
Permission
AccessControlInput
PermissionLevel
Clusters ACL
AccessControl
Permission
AccessControlInput
An item representing an ACL rule applied to the principal (user, group, or service principal).
PermissionLevel
Permission level that you can set on a cluster policy.
CAN_USE Allow user to create clusters based on the policy. The user
does not need the cluster create permission.
SQL Warehouses APIs 2.0
7/21/2022 • 8 minutes to read
IMPORTANT
To access Databricks REST APIs, you must authenticate.
To configure individual SQL warehouses, use the SQL Warehouses API. To configure all SQL warehouses, use the
Global SQL Warehouses API.
Requirements
To create SQL warehouses you must have cluster create permission, which is enabled in the Data Science &
Engineering workspace.
To manage a SQL warehouse you must have Can Manage permission in Databricks SQL for the warehouse.
2.0/sql/warehouses/ POST
Example request
{
"name": "My SQL warehouse",
"cluster_size": "MEDIUM",
"min_num_clusters": 1,
"max_num_clusters": 10,
"tags": {
"custom_tags": [
{
"key": "mykey",
"value": "myvalue"
}
]
},
"enable_photon": "true",
"channel": {
"name": "CHANNEL_NAME_CURRENT"
}
}
Example response
{
"id": "0123456789abcdef"
}
Delete
EN DP O IN T H T T P M ET H O D
2.0/sql/warehouses/{id} DELETE
2.0/sql/warehouses/{id}/edit POST
Modify a SQL warehouse. All fields are optional. Missing fields default to the current values.
Example request
{
"name": "My Edited SQL warehouse",
"cluster_size": "LARGE",
"auto_stop_mins": 60
}
Get
EN DP O IN T H T T P M ET H O D
2.0/sql/warehouses/{id} GET
Example response
{
"id": "7f2629a529869126",
"name": "MyWarehouse",
"size": "SMALL",
"min_num_clusters": 1,
"max_num_clusters": 1,
"auto_stop_mins": 0,
"auto_resume": true,
"num_clusters": 0,
"num_active_sessions": 0,
"state": "STOPPED",
"creator_name": "user@example.com",
"jdbc_url":
"jdbc:spark://hostname.staging.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath
=/sql/1.0/warehouses/7f2629a529869126;",
"odbc_params": {
"hostname": "hostname.cloud.databricks.com",
"path": "/sql/1.0/warehouses/7f2629a529869126",
"protocol": "https",
"port": 443
},
"tags": {
"custom_tags": [
{
"key": "mykey",
"value": "myvalue"
}
]
},
"spot_instance_policy": "COST_OPTIMIZED",
"enable_photon": true,
"cluster_size": "SMALL",
"channel": {
"name": "CHANNEL_NAME_CURRENT"
}
}
List
EN DP O IN T H T T P M ET H O D
2.0/sql/warehouses/ GET
{
"warehouses": [
{ "id": "123456790abcdef", "name": "My SQL warehouse", "cluster_size": "MEDIUM" },
{ "id": "098765321fedcba", "name": "Another SQL warehouse", "cluster_size": "LARGE" }
]
}
Note: If you use the deprecated 2.0/sql/endpoints/ API, the top-level response field would be “endpoints”
instead of “warehouses”.
Start
EN DP O IN T H T T P M ET H O D
2.0/sql/warehouses/{id}/start POST
2.0/sql/warehouses/{id}/stop POST
/2.0/sql/config/warehouses GET
Example response
{
"security_policy": "DATA_ACCESS_CONTROL",
"data_access_config": [
{
"key": "spark.sql.hive.metastore.jars",
"value": "/databricks/hive_metastore_jars/*"
}
],
"sql_configuration_parameters": {
"configuration_pairs": [
{
"key" : "legacy_time_parser_policy",
"value": "LEGACY"
}
]
}
}
Edit
Edit the configuration for all SQL warehouses.
IMPORTANT
All fields are required.
Invoking this method restarts all running SQL warehouses.
EN DP O IN T H T T P M ET H O D
/2.0/sql/config/warehouses PUT
Example request
{
"data_access_config": [
{
"key": "spark.sql.hive.metastore.jars",
"value": "/databricks/hive_metastore_jars/*"
}
],
"sql_configuration_parameters": {
"configuration_pairs": [
{
"key" : "legacy_time_parser_policy",
"value": "LEGACY"
}
]
}
}
Data structures
In this section:
WarehouseConfPair
WarehouseHealth
WarehouseSecurityPolicy
WarehouseSpotInstancePolicy
WarehouseState
WarehouseStatus
WarehouseTags
WarehouseTagPair
ODBCParams
RepeatedWarehouseConfPairs
Channel
ChannelName
WarehouseConfPair
F IEL D N A M E TYPE DESC RIP T IO N
WarehouseHealth
F IEL D N A M E TYPE DESC RIP T IO N
WarehouseSecurityPolicy
O P T IO N DESC RIP T IO N
WarehouseSpotInstancePolicy
O P T IO N DESC RIP T IO N
COST_OPTIMIZED Use an on-demand instance for the cluster driver and spot
instances for cluster executors. The maximum spot price is
100% of the on-demand price. This is the default policy.
WarehouseState
State of a SQL warehouse. The allowable state transitions are:
STARTING -> STARTING , RUNNING , STOPPING , DELETING
RUNNING -> STOPPING , DELETING
STOPPING -> STOPPED , STARTING
STOPPED -> STARTING , DELETING
DELETING -> DELETED
WarehouseStatus
STAT E DESC RIP T IO N
WarehouseTags
F IEL D N A M E TYPE DESC RIP T IO N
WarehouseTagPair
F IEL D N A M E TYPE DESC RIP T IO N
ODBCParams
F IEL D N A M E TYPE DESC RIP T IO N
RepeatedWarehouseConfPairs
F IEL D N A M E TYPE DESC RIP T IO N
Channel
F IEL D N A M E TYPE DESC RIP T IO N
ChannelName
NAME DESC RIP T IO N
The Queries and Dashboards API manages queries, results, and dashboards.
This API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification.
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Query History API 2.0
7/21/2022 • 2 minutes to read
The Query History API shows SQL queries performed using Databricks SQL warehouses. You can use this
information to help you debug issues with queries.
This API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification.
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
DBFS API 2.0
7/21/2022 • 9 minutes to read
The DBFS API is a Databricks API that makes it simple to interact with various data sources without having to
include your credentials every time you read a file. See Databricks File System (DBFS) for more information. For
an easy to use command line client of the DBFS API, see Databricks CLI.
NOTE
To ensure high quality of service under heavy load, Azure Databricks is now enforcing API rate limits for DBFS API calls.
Limits are set per workspace to ensure fair usage and high availability. Automatic retries are available using Databricks CLI
version 0.12.0 and above. We advise all customers to switch to the latest Databricks CLI version.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Limitations
Using the DBFS API with firewall enabled storage containers is not supported. Databricks recommends you use
Databricks Connect or az storage.
Add block
EN DP O IN T H T T P M ET H O D
2.0/dbfs/add-block POST
Append a block of data to the stream specified by the input handle. If the handle does not exist, this call will
throw an exception with RESOURCE_DOES_NOT_EXIST . If the block of data exceeds 1 MB, this call will throw an
exception with MAX_BLOCK_SIZE_EXCEEDED . A typical workflow for file upload would be:
1. Call create and get a handle.
2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Close
EN DP O IN T H T T P M ET H O D
2.0/dbfs/close POST
Close the stream specified by the input handle. If the handle does not exist, this call throws an exception with
RESOURCE_DOES_NOT_EXIST . A typical workflow for file upload would be:
Create
EN DP O IN T H T T P M ET H O D
2.0/dbfs/create POST
Open a stream to write to a file and returns a handle to this stream. There is a 10 minute idle timeout on this
handle. If a file or directory already exists on the given path and overwrite is set to false, this call throws an
exception with RESOURCE_ALREADY_EXISTS . A typical workflow for file upload would be:
1. Call create and get a handle.
2. Make one or more add-block calls with the handle you have.
3. Call close with the handle you have.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/create \
--data '{ "path": "/tmp/HelloWorld.txt", "overwrite": true }'
{ "handle": 1234567890123456 }
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/dbfs/delete POST
Delete the file or directory (optionally recursively delete all files in the directory). This call throws an exception
with IO_ERROR if the path is a non-empty directory and recursive is set to false or on other similar errors.
When you delete a large number of files, the delete operation is done in increments. The call returns a response
after approximately 45 seconds with an error message (503 Service Unavailable) asking you to re-invoke the
delete operation until the directory structure is fully deleted. For example:
{
"error_code": "PARTIAL_DELETE",
"message": "The requested operation has deleted 324 files. There are more files remaining. You must make
another request to delete more."
}
For operations that delete more than 10K files, we discourage using the DBFS REST API, but advise you to
perform such operations in the context of a cluster, using the File system utility (dbutils.fs). dbutils.fs covers
the functional scope of the DBFS REST API, but from notebooks. Running such operations using notebooks
provides better control and manageability, such as selective deletes, and the possibility to automate periodic
delete jobs.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/dbfs/delete \
--data '{ "path": "/tmp/HelloWorld.txt" }'
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Get status
EN DP O IN T H T T P M ET H O D
2.0/dbfs/get-status GET
Get the file information of a file or directory. If the file or directory does not exist, this call throws an exception
with RESOURCE_DOES_NOT_EXIST .
Example
{
"path": "/tmp/HelloWorld.txt",
"is_dir": false,
"file_size": 13,
"modification_time": 1622054945000
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/dbfs/list GET
List the contents of a directory, or details of the file. If the file or directory does not exist, this call throws an
exception with RESOURCE_DOES_NOT_EXIST .
When calling list on a large directory, the list operation will time out after approximately 60 seconds. We
strongly recommend using list only on directories containing less than 10K files and discourage using the
DBFS REST API for operations that list more than 10K files. Instead, we recommend that you perform such
operations in the context of a cluster, using the File system utility (dbutils.fs), which provides the same
functionality without timing out.
Example
{
"files": [
{
"path": "/tmp/HelloWorld.txt",
"is_dir": false,
"file_size": 13,
"modification_time": 1622054945000
},
{
"..."
}
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Mkdirs
EN DP O IN T H T T P M ET H O D
2.0/dbfs/mkdirs POST
Create the given directory and necessary parent directories if they do not exist. If there exists a file (not a
directory) at any prefix of the input path, this call throws an exception with RESOURCE_ALREADY_EXISTS . If this
operation fails it may have succeeded in creating some of the necessary parent directories.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Move
EN DP O IN T H T T P M ET H O D
2.0/dbfs/move POST
Move a file from one location to another location within DBFS. If the source file does not exist, this call throws an
exception with RESOURCE_DOES_NOT_EXIST . If there already exists a file in the destination path, this call throws an
exception with RESOURCE_ALREADY_EXISTS . If the given source path is a directory, this call always recursively
moves all files.
When moving a large number of files, the API call will time out after approximately 60 seconds, potentially
resulting in partially moved data. Therefore, for operations that move more than 10K files, we strongly
discourage using the DBFS REST API. Instead, we recommend that you perform such operations in the context of
a cluster, using the File system utility (dbutils.fs) from a notebook, which provides the same functionality without
timing out.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Put
EN DP O IN T H T T P M ET H O D
2.0/dbfs/put POST
Upload a file through the use of multipart form post. It is mainly used for streaming uploads, but can also be
used as a convenient single call for data upload.
The amount of data that can be passed using the contents parameter is limited to 1 MB if specified as a string (
MAX_BLOCK_SIZE_EXCEEDED is thrown if exceeded) and 2 GB as a file.
Example
To upload a local file named HelloWorld.txt in the current directory:
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Read
EN DP O IN T H T T P M ET H O D
2.0/dbfs/read GET
Return the contents of a file. If the file does not exist, this call throws an exception with RESOURCE_DOES_NOT_EXIST .
If the path is a directory, the read length is negative, or if the offset is negative, this call throws an exception with
INVALID_PARAMETER_VALUE . If the read length exceeds 1 MB, this call throws an exception with
MAX_READ_SIZE_EXCEEDED . If offset + length exceeds the number of bytes in a file, reads contents until the end
of file.
Example
{
"bytes_read": 8,
"data": "ZWxsbywgV28="
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
FileInfo
FileInfo
The attributes of a file or directory.
The Delta Live Tables API allows you to create, edit, delete, start, and view details about pipelines.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines POST
pipeline-settings.json :
{
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"continuous": false
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file.
Response
{
"pipeline_id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5"
}
Request structure
See PipelineSettings.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Edit a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines/{pipeline_id} PUT
pipeline-settings.json
{
"id": "a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5",
"name": "Wikipedia pipeline (SQL)",
"storage": "/Users/username/data",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"libraries": [
{
"notebook": {
"path": "/Users/username/DLT Notebooks/Delta Live Tables quickstart (SQL)"
}
}
],
"target": "wikipedia_quickstart_data",
"continuous": false
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Delete a pipeline
EN DP O IN T H T T P M ET H O D
2.0/pipelines/{pipeline_id} DELETE
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
2.0/pipelines/{pipeline_id}/updates POST
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
{
"update_id": "a1b23c4d-5e6f-78gh-91i2-3j4k5lm67no8"
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/pipelines/{pipeline_id}/stop POST
Stops any active pipeline update. If no update is running, this request is a no-op.
Example
This example stops an update for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
2.0/pipelines/{pipeline_id}/events GET
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/events \
--data '{"max_results": 5}'
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
events An array of pipeline events. The list of events matching the request
criteria.
2.0/pipelines/{pipeline_id} GET
Gets details about a pipeline, including the pipeline settings and recent updates.
Example
This example gets details for the pipeline with ID a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5 :
Request
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/pipelines/{pipeline_id}/updates/{update_id} GET
curl -n -X GET \
https://<databricks-instance>/api/2.0/pipelines/a12cd3e4-0ab1-1abc-1a2b-1a2bcd3e4fg5/updates/9a84f906-fc51-
11eb-9a03-0242ac130003
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List pipelines
EN DP O IN T H T T P M ET H O D
2.0/pipelines/ GET
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
"notebook='<path>'" to select
pipelines that reference the provided
notebook path.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
KeyValue
NotebookLibrary
PipelineLibrary
PipelineSettings
PipelineStateInfo
PipelinesNewCluster
UpdateStateInfo
KeyValue
A key-value pair that specifies configuration parameters.
NotebookLibrary
A specification for a notebook containing pipeline code.
PipelineLibrary
A specification for pipeline dependencies.
PipelineSettings
The settings for a pipeline deployment.
PipelineStateInfo
The state of a pipeline, the status of the most recent updates, and information about associated resources.
PipelinesNewCluster
A pipeline cluster specification.
The Delta Live Tables system sets the following attributes. These attributes cannot be configured by users:
spark_version
init_scripts
Note :
UpdateStateInfo
The current state of a pipeline update.
The Git Credentials API allows users to manage their Git credentials to use Databricks Repos.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
The Git Credentials API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
Global Init Scripts API 2.0
7/21/2022 • 2 minutes to read
Configure a cluster-scoped init script using the DBFS REST API are shell scripts that run during startup on each
cluster node of every cluster in the workspace, before the Apache Spark driver or worker JVM starts. They can
help you to enforce consistent cluster configurations across your workspace. Use them carefully because they
can cause unanticipated impacts, like library conflicts.
The Global Init Scripts API lets Azure Databricks administrators add global cluster initialization scripts in a secure
and controlled manner. To learn how to add them using the UI, see Configure a cluster-scoped init script using
the DBFS REST API.
The Global Init Scripts API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Groups API 2.0
7/21/2022 • 4 minutes to read
NOTE
You must be an Azure Databricks administrator to invoke this API.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Add member
EN DP O IN T H T T P M ET H O D
2.0/groups/add-member POST
Add a user or group to a group. This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group with the
given name does not exist, or if a group with the given parent name does not exist.
Examples
To add a user to a group:
{}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Create
EN DP O IN T H T T P M ET H O D
2.0/groups/create POST
Create a new group with the given name. This call returns an error RESOURCE_ALREADY_EXISTS if a group with the
given name already exists.
Example
{ "group_name": "reporting-department" }
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List members
EN DP O IN T H T T P M ET H O D
2.0/groups/list-members GET
Return all of the members of a particular group. This call returns the error RESOURCE_DOES_NOT_EXIST if a group
with the given name does not exist. This method is non-recursive; it returns all groups that belong to the given
group but not the principals that belong to those child groups.
Example
curl --netrc -X GET \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/groups/list-members \
--data '{ "group_name": "reporting-department" }' \
| jq .
{
"members": [
{
"user_name": "someone@example.com"
}
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/groups/list GET
{
"group_names": [
"reporting-department",
"data-ops-read-only",
"admins"
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/groups/list-parents GET
Retrieve all groups in which a given user or group is a member. This method is non-recursive; it returns all
groups in which the given user or group is a member but not the groups in which those groups are members.
This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group with the given name does not exist.
Examples
To list groups for a user:
{
"group_names": [
"reporting-department"
]
}
{
"group_names": [
"data-ops-read-only"
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Remove member
EN DP O IN T H T T P M ET H O D
2.0/groups/remove-member POST
Remove a user or group from a group. This call returns the error RESOURCE_DOES_NOT_EXIST if a user or group
with the given name does not exist or if a group with the given parent name does not exist.
Examples
To remove a user from a group:
{}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/groups/delete POST
Remove a group from this organization. This call returns the error RESOURCE_DOES_NOT_EXIST if a group with the
given name does not exist.
Example
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
PrincipalName
PrincipalName
Container type for a name that is either a user name or a group name.
The Instance Pools API allows you to create, edit, delete and list instance pools.
An instance pool reduces cluster start and auto-scaling times by maintaining a set of idle, ready-to-use cloud
instances. When a cluster attached to a pool needs an instance, it first attempts to allocate one of the pool’s idle
instances. If the pool has no idle instances, it expands by allocating a new instance from the instance provider in
order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is
free for another cluster to use. Only clusters attached to a pool can use that pool’s idle instances.
Azure Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.
See pricing.
Requirements
You must have permission to attach to the pool; see Pool access control.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create
EN DP O IN T H T T P M ET H O D
2.0/instance-pools/create POST
Create an instance pool. Use the returned instance_pool_id to query the status of the instance pool, which
includes the number of instances currently allocated by the instance pool. If you provide the min_idle_instances
parameter, instances are provisioned in the background and are ready to use once the idle_count in the
InstancePoolStats equals the requested minimum.
If your account has Databricks Container Services enabled and the instance pool is created with
preloaded_docker_images , you can use the instance pool to launch clusters with a Docker image. The Docker
image in the instance pool doesn’t have to match the Docker image in the cluster. However, the container
environment of the cluster created on the pool must align with the container environment of the instance pool:
you cannot use an instance pool created with preloaded_docker_images to launch a cluster without a Docker
image and you cannot use an instance pool created without preloaded_docker_images to a launch cluster with a
Docker image.
NOTE
Azure Databricks may not be able to acquire some of the requested idle instances due to instance provider limitations or
transient network issues. Clusters can still attach to the instance pool, but may not start as quickly.
Example
curl --netrc -X POST \
https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/instance-pools/create \
--data @create-instance-pool.json
create-instance-pool.json :
{
"instance_pool_name": "my-pool",
"node_type_id": "Standard_D3_v2",
"min_idle_instances": 10,
"custom_tags": [
{
"key": "my-key",
"value": "my-value"
}
]
}
{ "instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg" }
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/instance-pools/edit POST
Edit an instance pool. This modifies the configuration of an existing instance pool.
NOTE
You can edit only the following values: instance_pool_name , min_idle_instances , max_capacity , and
idle_instance_autotermination_minutes .
You must provide an instance_pool_name value.
Example
edit-instance-pool.json :
{
"instance_pool_id": "1234-567890-fetch12-pool-A3BcdEFg",
"instance_pool_name": "my-edited-pool",
"min_idle_instances": 5,
"max_capacity": 200,
"idle_instance_autotermination_minutes": 30
}
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/instance-pools/delete POST
Delete an instance pool. This permanently deletes the instance pool. The idle instances in the pool are
terminated asynchronously. New clusters cannot attach to the pool. Running clusters attached to the pool
continue to run but cannot autoscale up. Terminated clusters attached to the pool will fail to start until they are
edited to no longer use the pool.
Example
{}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Get
EN DP O IN T H T T P M ET H O D
2.0/instance-pools/get GET
{
"instance_pool_name": "mypool",
"node_type_id": "Standard_D3_v2",
"custom_tags": {
"my-key": "my-value"
},
"idle_instance_autotermination_minutes": 60,
"enable_elastic_disk": false,
"preloaded_spark_versions": [
"5.4.x-scala2.11"
],
"instance_pool_id": "101-120000-brick1-pool-ABCD1234",
"default_tags": {
"Vendor": "Databricks",
"DatabricksInstancePoolCreatorId": "100125",
"DatabricksInstancePoolId": "101-120000-brick1-pool-ABCD1234"
},
"state": "ACTIVE",
"stats": {
"used_count": 10,
"idle_count": 5,
"pending_used_count": 5,
"pending_idle_count": 5
},
"status": {}
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
preloaded_spark_versions An array of STRING A list with the runtime version the pool
installs on each instance. Pool clusters
that use a preloaded runtime version
start faster as they do not have to wait
for the image to download. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.
F IEL D N A M E TYPE DESC RIP T IO N
* Vendor: Databricks
* DatabricksInstancePoolCreatorId:
<create_user_id>
* DatabricksInstancePoolId:
<instance_pool_id>
List
EN DP O IN T H T T P M ET H O D
2.0/instance-pools/list GET
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
InstancePoolState
InstancePoolStats
InstancePoolStatus
PendingInstanceError
DiskSpec
DiskType
InstancePoolAndStats
AzureDiskVolumeType
InstancePoolAzureAttributes
InstancePoolState
The state of an instance pool. The current allowable state transitions are:
ACTIVE -> DELETED
NAME DESC RIP T IO N
InstancePoolStats
Statistics about the usage of the instance pool.
InstancePoolStatus
Status about failed pending instances in the pool.
PendingInstanceError
Error message of a failed pending instance.
DiskSpec
Describes the initial set of disks to attach to each instance. For example, if there are 3 instances and each
instance is configured to start with 2 disks, 100 GiB each, then Azure Databricks creates a total of 6 disks, 100
GiB each, for these instances.
DiskType
Describes the type of disk.
InstancePoolAndStats
F IEL D N A M E TYPE DESC RIP T IO N
preloaded_spark_versions An array of STRING A list with the runtime version the pool
installs on each instance. Pool clusters
that use a preloaded runtime version
start faster as they do not have to wait
for the image to download. You can
retrieve a list of available runtime
versions by using the Runtime versions
API call.
* Vendor: Databricks
* DatabricksInstancePoolCreatorId:
<create_user_id>
* DatabricksInstancePoolId:
<instance_pool_id>
InstancePoolAzureAttributes
Attributes set during instance pools creation related to Azure.
spot_bid_max_price DOUBLE The max bid price used for Azure spot
instances. You can set this to greater
than or equal to the current spot price.
You can also set this to -1 (the default),
which specifies that the instance
cannot be evicted on the basis of price.
The price for the instance will be the
current price for spot instances or the
price for a standard instance. You can
view historical pricing and eviction
rates in the Azure portal.
IP Access List API 2.0
7/21/2022 • 2 minutes to read
Azure Databricks workspaces can be configured so that employees connect to the service only through existing
corporate networks with a secure perimeter. Azure Databricks customers can use the IP access lists feature to
define a set of approved IP addresses. All incoming access to the web application and REST APIs requires the
user connect from an authorized IP address.
For more details about this feature and examples of how to use this API, see IP access lists.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
The IP Access List API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Jobs API 2.0
7/21/2022 • 45 minutes to read
The Jobs API allows you to create, edit, and delete jobs. The maximum allowed size of a request to the Jobs API is
10MB. See Create a High Concurrency cluster for a how-to guide on this API.
For details about updates to the Jobs API that support orchestration of multiple tasks with Azure Databricks
jobs, see Jobs API updates.
NOTE
If you receive a 500-level error when making Jobs API requests, Databricks recommends retrying requests for up to 10
min (with a minimum 30 second interval between retries).
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create
EN DP O IN T H T T P M ET H O D
2.0/jobs/create POST
create-job.json :
{
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"timeout_seconds": 3600,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 * * ?",
"timezone_id": "America/Los_Angeles"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of create-job.json with fields that are appropriate for your solution.
{
"job_id": 1
}
Request structure
IMPORTANT
When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs
Compute pricing.
When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload
subject to All-Purpose Compute pricing.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/jobs/list GET
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
{
"jobs": [
{
"job_id": 1,
"settings": {
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"timeout_seconds": 100000000,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 * * ?",
"timezone_id": "America/Los_Angeles",
"pause_status": "UNPAUSED"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
},
"created_time": 1457570074236
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Delete
EN DP O IN T H T T P M ET H O D
2.0/jobs/delete POST
Delete a job and send an email to the addresses specified in JobSettings.email_notifications . No action occurs
if the job has already been removed. After the job is removed, neither its details nor its run history is visible in
the Jobs UI or API. The job is guaranteed to be removed upon completion of this request. However, runs that
were active before the receipt of this request may still be active. They will be terminated asynchronously.
Example
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<job-id> with the ID of the job, for example 123 .
Get
EN DP O IN T H T T P M ET H O D
2.0/jobs/get GET
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<job-id> with the ID of the job, for example 123 .
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
settings JobSettings Settings for this job and all of its runs.
These settings can be updated using
the Reset or Update endpoints.
Reset
EN DP O IN T H T T P M ET H O D
2.0/jobs/reset POST
Overwrite all settings for a specific job. Use the Update endpoint to update job settings partially.
Example
This example request makes job 2 identical to job 1 in the create example.
reset-job.json :
{
"job_id": 2,
"new_settings": {
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"timeout_seconds": 100000000,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 * * ?",
"timezone_id": "America/Los_Angeles",
"pause_status": "UNPAUSED"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of reset-job.json with fields that are appropriate for your solution.
Update
EN DP O IN T H T T P M ET H O D
2.0/jobs/update POST
Add, change, or remove specific settings of an existing job. Use the Reset endpoint to overwrite all job settings.
Example
This example request removes libraries and adds email notification settings to job 1 defined in the create
example.
update-job.json :
{
"job_id": 1,
"new_settings": {
"existing_cluster_id": "1201-my-cluster",
"email_notifications": {
"on_start": [ "someone@example.com" ],
"on_success": [],
"on_failure": []
}
},
"fields_to_remove": ["libraries"]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of update-job.json with fields that are appropriate for your solution.
new_settings JobSettings The new settings for the job. Any top-
level fields specified in new_settings
are completely replaced. Partially
updating nested fields is not
supported.
Run now
IMPORTANT
You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.
A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you
request a run that cannot start immediately.
The number of jobs a workspace can create in an hour is limited to 5000 (includes “run now” and “runs submit”). This
limit also affects jobs created by the REST API and notebook workflows.
EN DP O IN T H T T P M ET H O D
2.0/jobs/run-now POST
Run a job now and return the run_id of the triggered run.
TIP
If you invoke Create together with Run now, you can use the Runs submit endpoint instead, which allows you to submit
your workload directly without having to create a job.
Example
run-job.json :
An example request for a notebook job:
{
"job_id": 1,
"notebook_params": {
"name": "john doe",
"age": "35"
}
}
{
"job_id": 2,
"jar_params": [ "john doe", "35" ]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of run-job.json with fields that are appropriate for your solution.
job_id INT64
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Runs submit
IMPORTANT
You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.
A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you
request a run that cannot start immediately.
The number of jobs a workspace can create in an hour is limited to 5000 (includes “run now” and “runs submit”). This
limit also affects jobs created by the REST API and notebook workflows.
EN DP O IN T H T T P M ET H O D
2.0/jobs/runs/submit POST
Submit a one-time run. This endpoint allows you to submit a workload directly without creating a job. Runs
submitted using this endpoint don’t display in the UI. Use the jobs/runs/get API to check the run state after the
job is submitted.
Example
Request
{
"run_name": "my spark task",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_D3_v2",
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of submit-job.json with fields that are appropriate for your solution.
{
"run_id": 123
}
Request structure
IMPORTANT
When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs
Compute pricing.
When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload
subject to All-Purpose Compute pricing.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Runs list
EN DP O IN T H T T P M ET H O D
2.0/jobs/runs/list GET
NOTE
Runs are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should save old run
results before they expire. To export using the UI, see Export job run results. To export using the Jobs API, see Runs
export.
Example
Request
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<job-id> with the ID of the job, for example 123 .
<true-false> with true or false .
<offset> with the offset value.
<limit> with the limit value.
<run-type> with the run_type value.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Runs get
EN DP O IN T H T T P M ET H O D
2.0/jobs/runs/get GET
NOTE
Runs are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should save old run
results before they expire. To export using the UI, see Export job run results. To export using the Jobs API, see Runs
export.
Example
Request
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .
{
"job_id": 1,
"run_id": 452,
"number_in_job": 5,
"state": {
"life_cycle_state": "RUNNING",
"state_message": "Performing action"
},
"task": {
"notebook_task": {
"notebook_path": "/Users/someone@example.com/my-notebook"
}
},
"cluster_spec": {
"existing_cluster_id": "1201-my-cluster"
},
"cluster_instance": {
"cluster_id": "1201-my-cluster",
"spark_context_id": "1102398-spark-context-id"
},
"overriding_parameters": {
"jar_params": ["param1", "param2"]
},
"start_time": 1457570074236,
"end_time": 1457570075149,
"setup_duration": 259754,
"execution_duration": 3589020,
"cleanup_duration": 31038,
"trigger": "PERIODIC"
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
cluster_instance ClusterInstance The cluster used for this run. If the run
is specified to use a new cluster, this
field will be set once the Jobs service
has requested a cluster for the run.
2.0/jobs/runs/export GET
NOTE
Only notebook runs can be exported in HTML format. Exporting runs of other types will fail.
Example
Request
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .
{
"views": [ {
"content": "<!DOCTYPE html><html><head>Head</head><body>Body</body></html>",
"name": "my-notebook",
"type": "NOTEBOOK"
} ]
}
To extract the HTML notebook from the JSON response, download and run this Python script.
NOTE
The notebook body in the __DATABRICKS_NOTEBOOK_MODEL object is encoded.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Runs cancel
EN DP O IN T H T T P M ET H O D
2.0/jobs/runs/cancel POST
Cancel a job run. Because the run is canceled asynchronously, the run may still be running when this request
completes. The run will be terminated shortly. If the run is already in a terminal life_cycle_state , this method
is a no-op.
This endpoint validates that the run_id parameter is valid and for invalid parameters returns HTTP status code
400.
Example
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .
2.0/jobs/runs/cancel-all POST
Cancel all active runs of a job. Because the run is canceled asynchronously, it doesn’t prevent new runs from
being started.
This endpoint validates that the job_id parameter is valid and for invalid parameters returns HTTP status code
400.
Example
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<job-id> with the ID of the job, for example 123 .
2.0/jobs/runs/get-output GET
Retrieve the output and metadata of a run. When a notebook task returns a value through the
dbutils.notebook.exit() call, you can use this endpoint to retrieve that value. Azure Databricks restricts this API to
return the first 5 MB of the output. For returning a larger result, you can store job results in a cloud storage
service.
This endpoint validates that the run_id parameter is valid and for invalid parameters returns HTTP status code
400.
Runs are automatically removed after 60 days. If you to want to reference them beyond 60 days, you should
save old run results before they expire. To export using the UI, see Export job run results. To export using the
Jobs API, see Runs export.
Example
Request
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .
This example uses a .netrc file and jq.
Response
{
"metadata": {
"job_id": 1,
"run_id": 452,
"number_in_job": 5,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"task": {
"notebook_task": {
"notebook_path": "/Users/someone@example.com/my-notebook"
}
},
"cluster_spec": {
"existing_cluster_id": "1201-my-cluster"
},
"cluster_instance": {
"cluster_id": "1201-my-cluster",
"spark_context_id": "1102398-spark-context-id"
},
"overriding_parameters": {
"jar_params": ["param1", "param2"]
},
"start_time": 1457570074236,
"setup_duration": 259754,
"execution_duration": 3589020,
"cleanup_duration": 31038,
"trigger": "PERIODIC"
},
"notebook_output": {
"result": "the maybe truncated string passed to dbutils.notebook.exit()"
}
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N
Runs delete
EN DP O IN T H T T P M ET H O D
2.0/jobs/runs/delete POST
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<run-id> with the ID of the run, for example 123 .
Data structures
In this section:
ClusterInstance
ClusterSpec
CronSchedule
Job
JobEmailNotifications
JobSettings
JobTask
NewCluster
NotebookOutput
NotebookTask
ParamPair
PipelineTask
Run
RunLifeCycleState
RunParameters
RunResultState
RunState
SparkJarTask
SparkPythonTask
SparkSubmitTask
TriggerType
ViewItem
ViewType
ViewsToExport
ClusterInstance
Identifiers for the cluster and Spark context used by a run. These two values together identify an execution
context across all time.
ClusterSpec
IMPORTANT
When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs
Compute pricing.
When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload
subject to All-Purpose Compute pricing.
CronSchedule
F IEL D N A M E TYPE DESC RIP T IO N
Job
F IEL D N A M E TYPE DESC RIP T IO N
run_as STRING The user name that the job will run as.
run_as is based on the current job
settings, and is set to the creator of
the job if job access control is disabled,
or the is_owner permission if job
access control is enabled.
settings JobSettings Settings for this job and all of its runs.
These settings can be updated using
the resetJob method.
JobEmailNotifications
IMPORTANT
The on_start, on_success, and on_failure fields accept only Latin characters (ASCII character set). Using non-ASCII
characters will return an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.
JobSettings
IMPORTANT
When you run a job on a new jobs cluster, the job is treated as a Jobs Compute (automated) workload subject to Jobs
Compute pricing.
When you run a job on an existing all-purpose cluster, it is treated as an All-Purpose Compute (interactive) workload
subject to All-Purpose Compute pricing.
Settings for a job. These settings can be updated using the resetJob method.
JobTask
F IEL D N A M E TYPE DESC RIP T IO N
NewCluster
F IEL D N A M E TYPE DESC RIP T IO N
Note :
NotebookOutput
F IEL D N A M E TYPE DESC RIP T IO N
F IEL D N A M E TYPE DESC RIP T IO N
NotebookTask
All the output cells are subject to the size of 8MB. If the output of a cell has a larger size, the rest of the run will
be cancelled and the run will be marked as failed. In that case, some of the content output from other cells may
also be missing.
If you need help finding the cell that is beyond the limit, run the notebook against an all-purpose cluster and use
this notebook autosave technique.
ParamPair
Name-based parameters for jobs running notebook tasks.
IMPORTANT
The fields in this data structure accept only Latin characters (ASCII character set). Using non-ASCII characters will return
an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.
PipelineTask
F IEL D N A M E TYPE DESC RIP T IO N
Run
All the information about a run except for its output. The output can be retrieved separately with the
getRunOutput method.
cluster_instance ClusterInstance The cluster used for this run. If the run
is specified to use a new cluster, this
field will be set once the Jobs service
has requested a cluster for the run.
RunLifeCycleState
The life cycle state of a run. Allowed state transitions are:
PENDING -> RUNNING -> TERMINATING -> TERMINATED
PENDING -> SKIPPED
PENDING -> INTERNAL_ERROR
RUNNING -> INTERNAL_ERROR
TERMINATING -> INTERNAL_ERROR
PENDING The run has been triggered. If there is not already an active
run of the same job, the cluster and execution context are
being prepared. If there is already an active run of the same
job, the run will immediately transition into the SKIPPED
state without preparing any resources.
TERMINATING The task of this run has completed, and the cluster and
execution context are being cleaned up.
TERMINATED The task of this run has completed, and the cluster and
execution context have been cleaned up. This state is
terminal.
SKIPPED This run was aborted because a previous run of the same
job was already active. This state is terminal.
RunParameters
Parameters for this run. Only one of jar_params, python_params , or notebook_params should be specified in the
run-now request, depending on the type of job task. Jobs with Spark JAR task or Python task take a list of
position-based parameters, and jobs with notebook tasks take a key value map.
F IEL D N A M E TYPE DESC RIP T IO N
RunResultState
The result state of the run.
If life_cycle_state = TERMINATED : if the run had a task, the result is guaranteed to be available, and it
indicates the result of the task.
If life_cycle_state = PENDING , RUNNING , or SKIPPED , the result state is not available.
If life_cycle_state = TERMINATING or lifecyclestate = INTERNAL_ERROR : the result state is available if the run
had a task and managed to start it.
Once available, the result state never changes.
STAT E DESC RIP T IO N
RunState
F IEL D N A M E TYPE DESC RIP T IO N
SparkJarTask
F IEL D N A M E TYPE DESC RIP T IO N
SparkPythonTask
F IEL D N A M E TYPE DESC RIP T IO N
SparkSubmitTask
IMPORTANT
You can invoke Spark submit tasks only on new clusters.
In the new_cluster specification, libraries and spark_conf are not supported. Instead, use --jars and
--py-files to add Java and Python libraries and --conf to set the Spark configuration.
master , deploy-mode , and executor-cores are automatically configured by Azure Databricks; you cannot specify
them in parameters.
By default, the Spark submit job uses all available memory (excluding reserved memory for Azure Databricks services).
You can set --driver-memory , and --executor-memory to a smaller value to leave some room for off-heap usage.
The --jars , --py-files , --files arguments support DBFS paths.
For example, assuming the JAR is uploaded to DBFS, you can run SparkPi by setting the following parameters.
{
"parameters": [
"--class",
"org.apache.spark.examples.SparkPi",
"dbfs:/path/to/examples.jar",
"10"
]
}
TriggerType
These are the type of triggers that can fire a run.
ONE_TIME One time triggers that fire a single run. This occurs you
triggered a single run on demand through the UI or the API.
TYPE DESC RIP T IO N
ViewItem
The exported content is in HTML format. For example, if the view to export is dashboards, one HTML string is
returned for every dashboard.
ViewType
TYPE DESC RIP T IO N
ViewsToExport
View to export: either code, all dashboards, or all.
The Libraries API allows you to install and uninstall libraries and get the status of libraries on a cluster.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
2.0/libraries/all-cluster-statuses GET
Get the status of all libraries on all clusters. A status will be available for all libraries installed on clusters via the
API or the libraries UI as well as libraries set to be installed on all clusters via the libraries UI. If a library has been
set to be installed on all clusters, is_library_for_all_clusters will be true , even if the library was also installed
on this specific cluster.
Example
Request
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Cluster status
EN DP O IN T H T T P M ET H O D
2.0/libraries/cluster-status GET
Get the status of libraries on a cluster. A status will be available for all libraries installed on the cluster via the API
or the libraries UI as well as libraries set to be installed on all clusters via the libraries UI. If a library has been set
to be installed on all clusters, is_library_for_all_clusters will be true , even if the library was also installed on
the cluster.
Example
Request
Or:
curl --netrc --get \
https://<databricks-instance>/api/2.0/libraries/cluster-status \
--data cluster_id=<cluster-id> \
| jq .
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<cluster-id> with the Azure Databricks workspace ID of the cluster, for example 1234-567890-example123 .
{
"cluster_id": "11203-my-cluster",
"library_statuses": [
{
"library": {
"jar": "dbfs:/mnt/libraries/library.jar"
},
"status": "INSTALLED",
"messages": [],
"is_library_for_all_clusters": false
},
{
"library": {
"pypi": {
"package": "beautifulsoup4"
},
},
"status": "INSTALLING",
"messages": ["Successfully resolved package from PyPI"],
"is_library_for_all_clusters": false
},
{
"library": {
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
},
},
"status": "FAILED",
"messages": ["R package installation is not supported on this spark version.\nPlease upgrade to
Runtime 3.2 or higher"],
"is_library_for_all_clusters": false
}
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Install
EN DP O IN T H T T P M ET H O D
2.0/libraries/install POST
Install libraries on a cluster. The installation is asynchronous - it completes in the background after the request.
IMPORTANT
This call will fail if the cluster is terminated.
Installing a wheel library on a cluster is like running the pip command against the wheel file directly on driver
and executors. All the dependencies specified in the library setup.py file are installed and this requires the
library name to satisfy the wheel file name convention.
The installation on the executors happens only when a new task is launched. With Databricks Runtime 7.1 and
below, the installation order of libraries is nondeterministic. For wheel libraries, you can ensure a deterministic
installation order by creating a zip file with suffix .wheelhouse.zip that includes all the wheel files.
Example
install-libraries.json :
{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"whl": "dbfs:/mnt/libraries/mlflow-0.0.1.dev0-py2-none-any.whl"
},
{
"whl": "dbfs:/mnt/libraries/wheel-libraries.wheelhouse.zip"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": ["slf4j:slf4j"]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "https://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of install-libraries.json with fields that are appropriate for your solution.
Uninstall
EN DP O IN T H T T P M ET H O D
2.0/libraries/uninstall POST
Set libraries to be uninstalled on a cluster. The libraries aren’t uninstalled until the cluster is restarted.
Uninstalling libraries that are not installed on the cluster has no impact but is not an error.
Example
uninstall-libraries.json :
{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"cran": "ada"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of uninstall-libraries.json with fields that are appropriate for your solution.
Data structures
In this section:
ClusterLibraryStatuses
Library
LibraryFullStatus
MavenLibrary
PythonPyPiLibrary
RCranLibrary
LibraryInstallStatus
ClusterLibraryStatuses
F IEL D N A M E TYPE DESC RIP T IO N
Library
F IEL D N A M E TYPE DESC RIP T IO N
jar OR egg OR whl OR pypi OR maven STRING OR STRING OR STRING OR If jar, URI of the JAR to be installed.
OR cran PythonPyPiLibrary OR MavenLibrary DBFS and ADLS ( abfss ) URIs are
OR RCranLibrary supported. For example:
{ "jar":
"dbfs:/mnt/databricks/library.jar"
}
or
{ "jar": "abfss://my-
bucket/library.jar" }
. If ADLS is used, make sure the cluster
has read access on the library.
LibraryFullStatus
The status of the library on a specific cluster.
F IEL D N A M E TYPE DESC RIP T IO N
messages An array of STRING All the info and warning messages that
have occurred so far for this library.
MavenLibrary
F IEL D N A M E TYPE DESC RIP T IO N
PythonPyPiLibrary
F IEL D N A M E TYPE DESC RIP T IO N
RCranLibrary
F IEL D N A M E TYPE DESC RIP T IO N
LibraryInstallStatus
The status of a library on a specific cluster.
PENDING No action has yet been taken to install the library. This state
should be very short lived.
UNINSTALL_ON_RESTART The library has been marked for removal. Libraries can be
removed only when clusters are restarted, so libraries that
enter this state will remain until the cluster is restarted.
MLflow API 2.0
7/21/2022 • 2 minutes to read
Azure Databricks provides a managed version of the MLflow tracking server and the Model Registry, which host
the MLflow REST API. You can invoke the MLflow REST API using URLs of the form
https://<databricks-instance>/api/2.0/mlflow/<api-endpoint>
replacing <databricks-instance> with the workspace URL of your Azure Databricks deployment.
MLflow compatibility matrix lists the MLflow release packaged in each Databricks Runtime version and a link to
the respective documentation.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Rate limits
The MLflow APIs are rate limited as four groups, based on their function and maximum throughput. The
following is the list of API groups and their respective limits in qps (queries per second):
Low throughput experiment management (list, update, delete, restore): 7 qps
Search runs: 7 qps
Log batch: 47 qps
All other APIs: 127 qps
In addition, there is a limit of 20 concurrent model versions in Pending status (in creation) per workspace.
If the rate limit is reached, subsequent API calls will return status code 429. All MLflow clients (including the UI)
automatically retry 429s with an exponential backoff.
API reference
The MLflow API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Permissions API 2.0
7/21/2022 • 2 minutes to read
IMPORTANT
This feature is in Public Preview.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Repos API 2.0
7/21/2022 • 2 minutes to read
The Repos API allows you to manage Databricks repos programmatically. See Git integration with Databricks
Repos.
The Repos API is provided as an OpenAPI 3.0 specification that you can download and view as a structured API
reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
SCIM API 2.0
7/21/2022 • 2 minutes to read
IMPORTANT
This feature is in Public Preview.
Azure Databricks supports SCIM, or System for Cross-domain Identity Management, an open standard that
allows you to automate user provisioning using a REST API and JSON. The Azure Databricks SCIM API follows
version 2.0 of the SCIM protocol.
Requirements
Your Azure Databricks account must have the Premium Plan.
https://<databricks-instance>/api/2.0/preview/scim/v2/<api-endpoint>
Header parameters
PA RA M ET ER TYPE DESC RIP T IO N
PA RA M ET ER TYPE DESC RIP T IO N
machine <databricks-instance>
login token password <access-
token>
Filter results
Use filters to return a subset of users or groups. For all users, the user userName and group displayName fields
are supported. Admin users can filter users on the active attribute.
Sort results
Sort results using the sortBy and sortOrder query parameters. The default is to sort by ID.
IMPORTANT
This feature is in Public Preview.
Requirements
Your Azure Databricks account must have the Premium Plan.
Get me
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Me GET
Retrieve the same information about yourself as returned by Get user by ID.
Example
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
IMPORTANT
This feature is in Public Preview.
An Azure Databricks administrator can invoke all SCIM API endpoints. Non-admin users can invoke the Get
users endpoint to read user display names and IDs.
NOTE
Each workspace can have a maximum of 10,000 users and 5,000 groups. Service principals count toward the user
maximum.
SCIM (Users) lets you create users in Azure Databricks and give them the proper level of access, temporarily lock
and unlock user accounts, and remove access for users (deprovision them) when they leave your organization
or no longer need access to Azure Databricks.
For error codes, see SCIM API 2.0 Error Codes.
Requirements
Your Azure Databricks account must have the Premium Plan.
Get users
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users GET
Admin users: Retrieve a list of all users in the Azure Databricks workspace.
Non-admin users: Retrieve a list of all users in the Azure Databricks workspace, returning username, user
display name, and object ID only.
Examples
This example gets information about all users.
This example uses the eq (equals) filter query parameter with userName to get information about a specific
user.
curl --netrc -X GET \
"https://<databricks-instance>/api/2.0/preview/scim/v2/Users?filter=userName+eq+<username>" \
| jq .
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<username> with the Azure Databricks workspace username of the user, for example someone@example.com .
Get user by ID
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users/{id} GET
Admin users: Retrieve a single user resource from the Azure Databricks workspace, given their Azure Databricks
ID.
Example
Request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
Response
Create user
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users POST
Example
curl --netrc -X POST \
https://<databricks-instance>/api/2.0/preview/scim/v2/Users \
--header 'Content-type: application/scim+json' \
--data @create-user.json \
| jq .
create-user.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:User" ],
"userName": "<username>",
"groups": [
{
"value":"123456"
}
],
"entitlements":[
{
"value":"allow-cluster-create"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<username> with the Azure Databricks workspace username of the user, for example someone@example.com .
2.0/preview/scim/v2/Users/{id} PATCH
Admin users: Update a user resource with operations on specific attributes, except those that are immutable (
userName and userId ). The PATCH method is recommended over the PUT method for setting or updating user
entitlements.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
Example
This example adds the allow-cluster-create entitlement to the specified user.
update-user.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
2.0/preview/scim/v2/Users/{id} PUT
Admin users: Overwrite the user resource across multiple attributes, except those that are immutable ( userName
and userId ).
Request must include the schemas attribute, set to urn:ietf:params:scim:schemas:core:2.0:User .
NOTE
The PATCH method is recommended over the PUT method for setting or updating user entitlements.
Example
This example changes the specified user’s previous entitlements to now have only the allow-cluster-create
entitlement.
overwrite-user.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:User" ],
"userName": "<username>",
"entitlements": [
{
"value": "allow-cluster-create"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
<username> with the Azure Databricks workspace username of the user, for example someone@example.com . To
get the username, call Get users.
This example uses a .netrc file and jq.
Delete user by ID
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users/{id} DELETE
Admin users: Remove a user resource. A user that does not own or belong to a workspace in Azure Databricks is
automatically purged after 30 days.
Deleting a user from a workspace also removes objects associated with the user. For example, notebooks are
archived, clusters are terminated, and jobs become ownerless.
The user’s home directory is not automatically deleted. Only an administrator can access or remove a deleted
user’s home directory.
The access control list (ACL) configuration of a user is preserved even after that user is removed from a
workspace.
Example request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file.
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users/{id} PATCH
Admin users: Activate or deactivate a user. Deactivating a user removes all access to a workspace for that user
but leaves permissions and objects associated with the user unchanged. Clusters associated with the user keep
running, and notebooks remain in their original locations. The user’s tokens are retained but cannot be used to
authenticate while the user is deactivated. Scheduled jobs, however, fail unless assigned to a new owner.
You can use the Get users and Get user by ID requests to view whether users are active or inactive.
NOTE
Allow at least five minutes for the cache to be cleared for deactivation to take effect.
IMPORTANT
An Azure Active Directory (Azure AD) user with the Contributor or Owner role on the Azure Databricks subscription
can reactivate themselves using the Azure AD login flow. If a user with one of these roles needs to be deactivated, you
should also revoke their privileges on the subscription.
Set the active value to false to deactivate a user and true to activate a user.
Example
Request
toggle-user-activation.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "replace",
"path": "active",
"value": [
{
"value": "false"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 .
{
"emails": [
{
"type": "work",
"value": "someone@example.com",
"primary": true
}
],
"displayName": "Someone User",
"schemas": [
"urn:ietf:params:scim:schemas:core:2.0:User",
"urn:ietf:params:scim:schemas:extension:workspace:2.0:User"
],
"name": {
"familyName": "User",
"givenName": "Someone"
},
"active": false,
"groups": [],
"id": "123456",
"userName": "someone@example.com"
}
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Users GET
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
Admin users: Deactivate users that have not logged in for a customizable period. Scheduled jobs owned by a
user are also considered activity.
EN DP O IN T H T T P M ET H O D
2.0/preview/workspace-conf PATCH
The request body is a key-value pair where the value is the time limit for how long a user can be inactive before
being automatically deactivated.
Example
deactivate-users.json :
{
"maxUserInactiveDays": "90"
}
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
Admin users: Retrieve the user inactivity limit defined for a workspace.
EN DP O IN T H T T P M ET H O D
2.0/preview/workspace-conf GET
Example request
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file and jq.
Example response
{
"maxUserInactiveDays": "90"
}
SCIM API 2.0 (Groups)
7/21/2022 • 3 minutes to read
IMPORTANT
This feature is in Public Preview.
Requirements
Your Azure Databricks account must have the Premium Plan.
NOTE
An Azure Databricks administrator can invoke all SCIM API endpoints.
Non-admin users can invoke the Get groups endpoint to read group display names and IDs.
You can have no more than 10,000 users and 5,000 groups in a workspace.
SCIM (Groups) lets you create users and groups in Azure Databricks and give them the proper level of access
and remove access for groups (deprovision them).
For error codes, see SCIM API 2.0 Error Codes.
Get groups
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups GET
Admin users: Retrieve a list of all groups in the Azure Databricks workspace.
Non-admin users: Retrieve a list of all groups in the Azure Databricks workspace, returning group display name
and object ID only.
Examples
You can use filters to specify subsets of groups. For example, you can apply the sw (starts with) filter parameter
to displayName to retrieve a specific group or set of groups. This example retrieves all groups with a
displayName field that start with my- .
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
This example uses a .netrc file and jq.
Get group by ID
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups/{id} GET
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Create group
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups POST
Members list is optional and can include users and other groups. You can also add members to a group using
PATCH .
Example
create-group.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:Group" ],
"displayName": "<group-name>",
"members": [
{
"value":"<user-id>"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-name> with the name of the group in the Azure Databricks workspace, for example my-group .
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
Update group
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups/{id} PATCH
Admin users: Update a group in Azure Databricks by adding or removing members. Can add and remove
individual members or groups within the group.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
NOTE
Azure Databricks does not support updating group names.
Example
Add to group
update-group.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op":"add",
"value": {
"members": [
{
"value":"<user-id>"
}
]
}
}
]
}
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "members[value eq \"<user-id>\"]"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
<user-id> with the Azure Databricks workspace ID of the user, for example 2345678901234567 . To get the
user ID, call Get users.
This example uses a .netrc file and jq.
Delete group
EN DP O IN T H T T P M ET H O D
2.0/preview/scim/v2/Groups/{id} DELETE
Admin users: Remove a group from Azure Databricks. Users in the group are not removed.
Example
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file.
SCIM API 2.0 (ServicePrincipals)
7/21/2022 • 6 minutes to read
IMPORTANT
This feature is in Public Preview.
SCIM (ServicePrincipals) lets you manage Azure Active Directory service principals in Azure Databricks.
For error codes, see SCIM API 2.0 Error Codes.
For additional examples, see Service principals for Azure Databricks automation.
Requirements
Your Azure Databricks account must have the Premium Plan.
2.0/preview/scim/v2/ServicePrincipals GET
You can use filters to specify subsets of service principals. For example, you can apply the eq (equals) filter
parameter to applicationId to retrieve a specific service principal:
In workspaces with a large number of service principals, you can exclude attributes from the request to improve
performance.
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
2.0/preview/scim/v2/ServicePrincipals/{id} GET
Retrieve a single service principal resource from the Azure Databricks workspace, given a service principal ID.
Example
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
2.0/preview/scim/v2/ServicePrincipals POST
Add an Azure Active Directory (Azure AD) service principal to the Azure Databricks workspace. In Azure
Databricks, you must create an application in Azure Active Directory and then add it to your Azure Databricks
workspace to use as a service principal. Service principals count toward the limit of 10000 users per workspace.
Request parameters follow the standard SCIM 2.0 protocol.
Example
add-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<azure-application-id>",
"displayName": "<display-name>",
"groups": [
{
"value": "<group-id>"
}
],
"entitlements": [
{
"value":"allow-cluster-create"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<azure-application-id> with the application ID of the Azure Active Directory (Azure AD) application, for
example 12345a67-8b9c-0d1e-23fa-4567b89cde01
<display-name> with the display name of the service principal, for example someone@example.com .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
2.0/preview/scim/v2/ServicePrincipals/{id} PATCH
Update a service principal resource with operations on specific attributes, except for applicationId and id ,
which are immutable.
Use the PATCH method to add, update, or remove individual attributes. Use the PUT method to overwrite the
entire service principal in a single operation.
Request parameters follow the standard SCIM 2.0 protocol and depend on the value of the schemas attribute.
Add entitlements
Example
change-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
Remove entitlements
Example
change-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "entitlements",
"value": [
{
"value": "allow-cluster-create"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
Add to a group
Example
change-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "add",
"path": "groups",
"value": [
{
"value": "<group-id>"
}
]
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Remove from a group
Example
remove-from-group.json :
{
"schemas": [ "urn:ietf:params:scim:api:messages:2.0:PatchOp" ],
"Operations": [
{
"op": "remove",
"path": "members[value eq \"<service-principal-id>\"]"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
This example uses a .netrc file and jq.
2.0/preview/scim/v2/ServicePrincipals/{id} PUT
Overwrite the entire service principal resource, except for applicationId and id , which are immutable.
Use the PATCH method to add, update, or remove individual attributes.
IMPORTANT
You must include the attribute in the request, with the exact value
schemas
urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal .
Examples
Add an entitlement
update-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<appliation-id>",
"displayName": "<display-name>",
"groups": [
{
"value": "<group-id>"
}
],
"entitlements": [
{
"value":"allow-cluster-create"
}
]
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
<display-name> with the display name of the service principal, for example someone@example.com .
<group-id> with the ID of the group in the Azure Databricks workspace, for example 2345678901234567 . To
get the group ID, call Get groups.
This example uses a .netrc file and jq.
Remove all entitlements and groups
Removing all entitlements and groups is a reversible alternative to deactivating the service principal.
Use the PUT method to avoid the need to check the existing entitlements and group memberships first.
update-service-principal.json :
{
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"applicationId": "<application-id>",
"displayName": "<display-name>",
"groups": [],
"entitlements": []
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<service-principal-id> with the ID of the service principal, for example 2345678901234567 . To get the service
principal ID, call Get service principals.
<application-id> with the applicationId value of the service principal, for example
12345a67-8b9c-0d1e-23fa-4567b89cde01 .
<display-name> with the display name of the service principal, for example someone@example.com .
2.0/preview/scim/v2/ServicePrincipals/{id} DELETE
The Secrets API allows you to manage secrets, secret scopes, and access permissions. To manage secrets, you
must:
1. Create a secret scope.
2. Add your secrets to the scope.
3. If you have the Premium Plan, assign access control to the secret scope.
To learn more about creating and managing secrets, see Secret management and Secret access control. You
access and reference secrets in notebooks and jobs by using Secrets utility (dbutils.secrets).
IMPORTANT
To access Databricks REST APIs, you must authenticate. To use the Secrets API with Azure Key Vault secrets, you must
authenticate using an Azure Active Directory token.
2.0/secrets/scopes/create POST
{
"scope": "my-simple-azure-keyvault-scope",
"scope_backend_type": "AZURE_KEYVAULT",
"backend_azure_keyvault":
{
"resource_id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/azure-
rg/providers/Microsoft.KeyVault/vaults/my-azure-kv",
"dns_name": "https://my-azure-kv.vault.azure.net/"
},
"initial_manage_principal": "users"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<token> with your Azure Databricks personal access token. For more information, see Authentication using
Azure Databricks personal access tokens.
<management-token> with your Azure Active Directory token. For more information, see Get Azure AD tokens
by using the Microsoft Authentication Library.
The contents of create-scope.json with fields that are appropriate for your solution.
This example uses a .netrc file.
If initial_manage_principal is specified, the initial ACL applied to the scope is applied to the supplied principal
(user, service principal, or group) with MANAGE permissions. The only supported principal for this option is the
group users , which contains all users in the workspace. If initial_manage_principal is not specified, the initial
ACL with MANAGE permission applied to the scope is assigned to the API request issuer’s user identity.
Throws RESOURCE_ALREADY_EXISTS if a scope with the given name already exists. Throws RESOURCE_LIMIT_EXCEEDED
if maximum number of scopes in the workspace is exceeded. Throws INVALID_PARAMETER_VALUE if the scope
name is invalid.
For more information, see Create an Azure Key Vault-backed secret scope using the Databricks CLI.
Create a Databricks-backed secret scope
The scope name:
Must be unique within a workspace.
Must consist of alphanumeric characters, dashes, underscores, and periods, and may not exceed 128
characters.
The names are considered non-sensitive and are readable by all users in the workspace. A workspace is limited
to a maximum of 100 secret scopes.
Example
create-scope.json :
{
"scope": "my-simple-databricks-scope",
"initial_manage_principal": "users"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of create-scope.json with fields that are appropriate for your solution.
2.0/secrets/scopes/delete POST
delete-scope.json :
{
"scope": "my-secret-scope"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-scope.json with fields that are appropriate for your solution.
This example uses a .netrc file.
Throws RESOURCE_DOES_NOT_EXIST if the scope does not exist. Throws PERMISSION_DENIED if the user does not
have permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/secrets/scopes/list GET
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
{
"scopes": [
{
"name": "my-databricks-scope",
"backend_type": "DATABRICKS"
},
{
"name": "mount-points",
"backend_type": "DATABRICKS"
}
]
}
Throws PERMISSION_DENIED if you do not have permission to make this API call.
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
EN DP O IN T H T T P M ET H O D
2.0/secrets/put POST
Insert a secret under the provided scope with the given name. If a secret already exists with the same name, this
command overwrites the existing secret’s value. The server encrypts the secret using the secret scope’s
encryption settings before storing it. You must have WRITE or MANAGE permission on the secret scope.
The secret key must consist of alphanumeric characters, dashes, underscores, and periods, and cannot exceed
128 characters. The maximum allowed secret value size is 128 KB. The maximum number of secrets in a given
scope is 1000.
You can read a secret value only from within a command on a cluster (for example, through a notebook); there is
no API to read a secret value outside of a cluster. The permission applied is based on who is invoking the
command and you must have at least READ permission.
Example
put-secret.json :
{
"scope": "my-databricks-scope",
"key": "my-string-key",
"string_value": "my-value"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of put-secret.json with fields that are appropriate for your solution.
Delete secret
The method for deleting a secret depends on the type of scope backend. To delete a secret from a scope backed
by Azure Key Vault, use the Azure SetSecret REST API. To delete a secret from a Databricks-backed scope, use the
following endpoint:
EN DP O IN T H T T P M ET H O D
2.0/secrets/delete POST
Delete the secret stored in this secret scope. You must have WRITE or MANAGE permission on the secret scope.
Example
delete-secret.json :
{
"scope": "my-secret-scope",
"key": "my-secret-key"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-secret.json with fields that are appropriate for your solution.
List secrets
EN DP O IN T H T T P M ET H O D
2.0/secrets/list GET
List the secret keys that are stored at this scope. This is a metadata-only operation; you cannot retrieve secret
data using this API. You must have READ permission to make this call.
Example
Request
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .
{
"secrets": [
{
"key": "my-string-key",
"last_updated_timestamp": 1520467595000
},
{
"key": "my-byte-key",
"last_updated_timestamp": 1520467595000
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/secrets/acls/put POST
Create or overwrite the ACL associated with the given principal (user, service principal, or group) on the
specified scope point. In general, a user, service principal, or group will use the most powerful permission
available to them, and permissions are ordered as follows:
MANAGE - Allowed to change ACLs, and read and write to this secret scope.
WRITE - Allowed to read and write to this secret scope.
READ - Allowed to read this secret scope and list what secrets are available.
put-secret-acl.json :
{
"scope": "my-secret-scope",
"principal": "data-scientists",
"permission": "READ"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of put-secret-acl.json with fields that are appropriate for your solution.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/secrets/acls/delete POST
delete-secret-acl.json :
{
"scope": "my-secret-scope",
"principal": "data-scientists"
}
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
The contents of delete-secret-acl.json with fields that are appropriate for your solution.
2.0/secrets/acls/get GET
Describe the details about the given ACL, such as the group and permission.
You must have the MANAGE permission to invoke this API.
Example
Request
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .
<principal-name> with the name of the principal, for example users .
{
"principal": "data-scientists",
"permission": "READ"
}
Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/secrets/acls/list GET
Or:
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
<scope-name> with the name of the secrets scope, for example my-scope .
Throws RESOURCE_DOES_NOT_EXIST if no such secret scope exists. Throws PERMISSION_DENIED if you do not have
permission to make this API call.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
AclItem
SecretMetadata
SecretScope
AclPermission
ScopeBackendType
AclItem
An item representing an ACL rule applied to the given principal (user, service principal, or group) on the
associated scope point.
SecretMetadata
The metadata about a secret. Returned when listing secrets. Does not contain the actual secret value.
F IEL D N A M E TYPE DESC RIP T IO N
SecretScope
An organizational resource for storing secrets. Secret scopes can be different types, and ACLs can be applied to
control permissions for all secrets within a scope.
AclPermission
The ACL permission levels for secret ACLs applied to secret scopes.
ScopeBackendType
The type of secret scope backend.
The Token API allows you to create, list, and revoke tokens that can be used to authenticate and access Azure
Databricks REST APIs.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Create
EN DP O IN T H T T P M ET H O D
2.0/token/create POST
Create and return a token. This call returns the error QUOTA_EXCEEDED if the current number of non-expired
tokens exceeds the token quota. The token quota for a user is 600.
Example
Request
Replace:
<databricks-instance> with the Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
This is an example token with a description to attach to the token.
7776000 with the lifetime of the token, in seconds. This example specifies 90 days.
{
"token_value": "dapi1a2b3c45d67890e1f234567a8bc9012d",
"token_info": {
"token_id": "1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3",
"creation_time": 1626286601651,
"expiry_time": 1634062601651,
"comment": "This is an example token"
}
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
List
EN DP O IN T H T T P M ET H O D
2.0/token/list GET
Replace with the Azure Databricks workspace instance name, for example
<databricks-instance>
adb-1234567890123456.7.azuredatabricks.net .
{
"token_infos": [
{
"token_id": "1234567890a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c3",
"creation_time": 1626286601651,
"expiry_time": 1634062601651,
"comment": "This is an example token"
},
{
"token_id": "2345678901a12bc3456de789012f34ab56c78d9012e3fabc4de56f7a89b012c4",
"creation_time": 1626286906596,
"expiry_time": 1634062906596,
"comment": "This is another example token"
}
]
}
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
token_infos An array of Public token info A list of token information for a user-
workspace pair.
Revoke
EN DP O IN T H T T P M ET H O D
2.0/token/delete POST
Revoke an access token. This call returns the error RESOURCE_DOES_NOT_EXIST if a token with the specified ID is not
valid.
Example
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Data structures
In this section:
Public token info
Public token info
A data structure that describes the public metadata of an access token.
The Token Management API lets Azure Databricks administrators manage their users’ Azure Databricks personal
access tokens. As an admin, you can:
Monitor and revoke users’ personal access tokens.
Control the lifetime of future tokens in your workspace.
You can also control which users can create and use tokens via the Permissions API 2.0 or in the Admin Console.
The Token Management API is provided as an OpenAPI 3.0 specification that you can download and view as a
structured API reference in your favorite OpenAPI editor.
Download the OpenAPI specification
View in Redocly: this link immediately opens the OpenAPI specification as a structured API reference for easy
viewing.
View in Postman: Postman is an app that you must download to your computer. Once you do, you can import
the OpenAPI spec as a file or URL.
View in Swagger Editor: In the online Swagger Editor, go to the File menu and click Impor t file to import
and view the downloaded OpenAPI specification.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Workspace API 2.0
7/21/2022 • 6 minutes to read
The Workspace API allows you to list, import, export, and delete notebooks and folders. The maximum allowed
size of a request to the Workspace API is 10MB. See Cluster log delivery examples for a how to guide on this
API.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
Delete
EN DP O IN T H T T P M ET H O D
2.0/workspace/delete POST
Delete an object or a directory (and optionally recursively deletes all objects in the directory). If path does not
exist, this call returns an error RESOURCE_DOES_NOT_EXIST . If path is a non-empty directory and recursive is set
to false , this call returns an error DIRECTORY_NOT_EMPTY . Object deletion cannot be undone and deleting a
directory recursively is not atomic.
Example
Request:
Export
EN DP O IN T H T T P M ET H O D
2.0/workspace/export GET
Export a notebook or contents of an entire directory. You can also export a Databricks Repo, or a notebook or
directory from a Databricks Repo. You cannot export non-notebook files from a Databricks Repo. If path does
not exist, this call returns an error RESOURCE_DOES_NOT_EXIST . You can export a directory only in DBC format. If
the exported data exceeds the size limit, this call returns an error MAX_NOTEBOOK_SIZE_EXCEEDED . This API does not
support exporting a library.
Example
Request:
Response:
If the direct_download field was set to false or was omitted from the request, a base64-encoded version of the
content is returned, for example:
{
"content": "Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKMSsx",
}
Otherwise, if direct_download was set to true in the request, the content is downloaded.
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Get status
EN DP O IN T H T T P M ET H O D
2.0/workspace/get-status GET
Gets the status of an object or a directory. If path does not exist, this call returns an error
RESOURCE_DOES_NOT_EXIST .
Example
Request:
Response:
{
"object_type": "NOTEBOOK",
"path": "/Users/me@example.com/MyFolder/MyNotebook",
"language": "PYTHON",
"object_id": 123456789012345
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
Import
EN DP O IN T H T T P M ET H O D
2.0/workspace/import POST
Import a notebook or the contents of an entire directory. If path already exists and overwrite is set to false ,
this call returns an error RESOURCE_ALREADY_EXISTS . You can use only DBC format to import a directory.
Example
Import a base64-encoded string:
List
EN DP O IN T H T T P M ET H O D
2.0/workspace/list GET
List the contents of a directory, or the object if it is not a directory. If the input path does not exist, this call
returns an error RESOURCE_DOES_NOT_EXIST .
Example
List directories and their contents:
Request:
Response:
{
"objects": [
{
"path": "/Users/me@example.com/MyFolder",
"object_type": "DIRECTORY",
"object_id": 234567890123456
},
{
"path": "/Users/me@example.com/MyFolder/MyNotebook",
"object_type": "NOTEBOOK",
"language": "PYTHON",
"object_id": 123456789012345
},
{
"..."
}
]
}
List repos:
Response:
{
"objects": [
{
"path": "/Repos/me@example.com/MyRepo1",
"object_type": "REPO",
"object_id": 234567890123456
},
{
"path": "/Repos/me@example.com/MyRepo2",
"object_type": "REPO",
"object_id": 123456789012345
},
{
"..."
}
]
}
Request structure
F IEL D N A M E TYPE DESC RIP T IO N
Response structure
F IEL D N A M E TYPE DESC RIP T IO N
2.0/workspace/mkdirs POST
Create the given directory and necessary parent directories if they do not exists. If there exists an object (not a
directory) at any prefix of the input path, this call returns an error RESOURCE_ALREADY_EXISTS . If this operation fails
it may have succeeded in creating some of the necessary parent directories.
Example
Request:
Data structures
In this section:
ObjectInfo
ExportFormat
Language
ObjectType
ObjectInfo
The information of the object in workspace. It is returned by list and get-status .
F O RM AT DESC RIP T IO N
Language
The language of notebook.
R R notebook.
ObjectType
The type of the object in workspace.
NOTEBOOK Notebook
DIRECTORY Directory
LIBRARY Library
REPO Repository
REST API 1.2
7/21/2022 • 9 minutes to read
The Databricks REST API allows you to programmatically access Azure Databricks instead of going through the
web UI.
This article covers REST API 1.2. The REST API latest version, as well as REST API 2.1 and 2.0, are also available.
IMPORTANT
Use the Clusters API 2.0 for managing clusters programmatically and the Libraries API 2.0 for managing libraries
programmatically.
The 1.2 Create an execution context and Run a command APIs continue to be supported.
IMPORTANT
To access Databricks REST APIs, you must authenticate.
API categories
Execution context: create unique variable namespaces where Spark commands can be called.
Command execution: run commands within a specific execution context.
Details
This REST API runs over HTTPS.
For retrieving information, use HTTP GET.
For modifying state, use HTTP POST.
For file upload, use multipart/form-data . Otherwise use application/json .
The response content type is JSON.
Basic authentication is used to authenticate the user for every API call.
User credentials are base64 encoded and are in the HTTP header for every API call. For example,
Authorization: Basic YWRtaW46YWRtaW4= . If you use curl , alternatively you can store user credentials in a
.netrc file.
For more information about using the Databricks REST API, see the Databricks REST API reference.
Get started
To try out the examples in this article, replace <databricks-instance> with the workspace URL of your Azure
Databricks deployment.
The following examples use curl and a .netrc file. You can adapt these curl examples with an HTTP library in
your programming language of choice.
API reference
Get the list of clusters
Get information about a cluster
Restart a cluster
Create an execution context
Get information about an execution context
Delete an execution context
Run a command
Get information about a command
Cancel a command
Get the list of libraries for a cluster
Upload a library to a cluster
Get the list of clusters
Method and path:
GET /api/1.2/clusters/list
Example
Request:
Response:
[
{
"id": "1234-567890-span123",
"name": "MyCluster",
"status": "Terminated",
"driverIp": "",
"jdbcPort": 10000,
"numWorkers":0
},
{
"..."
}
]
Request schema
None.
Response schema
An array of objects, with each object representing information about a cluster as follows:
F IEL D
id
Type: string
name
Type: string
status
Type: string
* Error
* Pending
* Reconfiguring
* Restarting
* Running
* Terminated
* Terminating
* Unknown
driverIp
Type: string
jdbcPor t
Type: number
numWorkers
Type: number
Example
Request:
Response:
{
"id": "1234-567890-span123",
"name": "MyCluster",
"status": "Terminated",
"driverIp": "",
"jdbcPort": 10000,
"numWorkers": 0
}
Request schema
F IEL D
clusterId
Type: string
Response schema
An object that represents information about the cluster.
F IEL D
id
Type: string
name
Type: string
status
Type: string
* Error
* Pending
* Reconfiguring
* Restarting
* Running
* Terminated
* Terminating
* Unknown
driverIp
Type: string
jdbcPor t
Type: number
numWorkers
Type: number
Restart a cluster
Method and path:
POST /api/1.2/clusters/restart
Example
Request:
Response:
{
"id": "1234-567890-span123"
}
Request schema
F IEL D
clusterId
Type: string
Response schema
F IEL D
id
Type: string
Example
Request:
Response:
{
"id": "1234567890123456789"
}
Request schema
F IEL D
clusterId
Type: string
clusterId
Type: string
* python
* scala
* sql
Response schema
F IEL D
id
Type: string
Example
Request:
Response:
{
"id": "1234567890123456789",
"status": "Running"
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
Response schema
F IEL D
id
Type: string
status
Type: string
* Error
* Pending
* Running
Example
Request:
Response:
{
"id": "1234567890123456789"
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
Response schema
F IEL D
id
Type: string
Run a command
Method and path:
POST /api/1.2/commands/execute
Example
Request:
execute-command.json :
{
"clusterId": "1234-567890-span123",
"contextId": "1234567890123456789",
"language": "python",
"command": "print('Hello, World!')"
}
Response:
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
language
Type: string
command
Type: string
commandFile
Type: string
options
Type: string
An optional map of values used downstream. For example, a displayRowLimit override (used in testing).
Response schema
F IEL D
id
Type: string
Example
Request:
Response:
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890",
"status": "Finished",
"results": {
"resultType": "text",
"data": "Hello, World!"
}
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
commandId
Type: string
Response schema
F IEL D
id
Type: string
status
Type: string
* Cancelled
* Cancelling
* Error
* Finished
* Queued
* Running
results
Type: object
* error
* image
* images
* table
* text
For error :
For image :
For images :
For table :
* isJsonSchema : true if a JSON schema is returned instead of a string representation of the Hive type. Type: true /
false
For text :
Cancel a command
Method and path:
POST/api/1.2/commands/cancel
Example
Request:
Response:
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}
Request schema
F IEL D
clusterId
Type: string
contextId
Type: string
The ID of the execution context that is associated with the command to cancel.
commandId
Type: string
Response schema
F IEL D
id
Type: string
IMPORTANT
This operation is deprecated. Use the Cluster status operation in the Libraries API instead.
Example
Request:
Request schema
F IEL D
clusterId
Type: string
Response schema
An array of objects, with each object representing information about a library as follows:
F IEL D
name
Type: string
status
Type: string
* LibraryError
* LibraryLoaded
* LibraryPending
IMPORTANT
This operation is deprecated. Use the Install operation in the Libraries API instead.
Request schema
F IEL D
clusterId
Type: string
name
Type: string
language
Type: string
uri
Type: string
Response schema
Information about the uploaded library.
F IEL D
language
Type: string
uri
Type: string
Additional examples
The following additional examples provide commands that you can use with curl or adapt with an HTTP
library in your programming language of choice.
Create an execution context
Run a command
Upload and run a Spark JAR
Create an execution context
Create an execution context on a specified cluster for a given programming language:
Run a command
Known limitations: command execution does not support %run .
Run a command string:
Run a file:
{
"id": "1234567890123456789"
}
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890"
}
3. Check on the status of your command. It may not return immediately if you are running a lengthy Spark
job.
{
"id": "1234ab56-7890-1cde-234f-5abcdef67890",
"results": {
"data": "Content Size Avg: 1234, Min: 1234, Max: 1234",
"resultType": "text"
},
"status": "Finished"
}
HashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across
several cloud providers. You can use the Databricks Terraform provider to manage your Azure Databricks
workspaces and the associated cloud infrastructure using a flexible, powerful tool. The goal of the Databricks
Terraform provider is to support all Databricks REST APIs, supporting automation of the most complicated
aspects of deploying and managing your data platforms. Databricks customers are using the Databricks
Terraform provider to deploy and manage clusters and jobs, provision Databricks workspaces, and configure
data access.
Getting started
Complete the following steps to install and configure the command line tools that Terraform needs to operate.
These tools include the Terraform CLI and the Azure CLI. After setting up these tools, complete the steps to
create a base Terraform configuration that you can use later to manage your Azure Databricks workspaces and
the associated Azure cloud infrastructure.
NOTE
This procedure assumes that you have access to a deployed Azure Databricks workspace as a Databricks admin, access to
the corresponding Azure subscription, and the appropriate permissions for the actions you want Terraform to perform in
that Azure subscription. For more information, see the following:
Manage users, groups, and service principals
Assign Azure roles using the Azure portal on the Azure website
1. Install the Terraform CLI. For details, see Download Terraform on the Terraform website.
2. Install the Azure CLI, and then use the Azure CLI to login to Azure by running the az login command.
For details, see Install the Azure CLI on the Microsoft Azure website and Azure Provider: Authenticating
using the Azure CLI on the Terraform website.
az login
TIP
To have Terraform run within the context of a different login, run the az login command again. You can switch
to have Terraform use an Azure subscription other than the one listed as "isDefault": true in the output of
running az login . To do this, run the command az account set --subscription="<subscription ID>" ,
replacing <subscription ID> with the value of the id property of the desired subscription in the output of
running az login .
This procedure uses the Azure CLI, along with the default subscription, to authenticate. For alternative
authentication options, see Authenticating to Azure on the Terraform website.
3. In your terminal, create an empty directory and then switch to it. (Each separate set of Terraform
configuration files must be in its own directory.) For example: mkdir terraform_demo && cd terraform_demo .
mkdir terraform_demo && cd terraform_demo
4. In this empty directory, create a file named main.tf . Add the following content to this file, and then save
the file.
TIP
If you use Visual Studio Code, the HashiCorp Terraform extension for Visual Studio Code adds editing features for
Terraform files such as syntax highlighting, IntelliSense, code navigation, code formatting, a module explorer, and
much more.
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = ">= 2.26"
}
databricks = {
source = "databricks/databricks"
}
}
}
provider "azurerm" {
features {}
}
provider "databricks" {}
5. Initialize the working directory containing the main.tf file by running the terraform init command. For
more information, see Command: init on the Terraform website.
terraform init
Terraform downloads the azurerm and databricks providers and installs them in a hidden subdirectory
of your current working directory, named .terraform . The terraform init command prints out which
version of the providers were installed. Terraform also creates a lock file named .terraform.lock.hcl
which specifies the exact provider versions used, so that you can control when you want to update the
providers used for your project.
6. Apply the changes required to reach the desired state of the configuration by running the
terraform apply command. For more information, see Command: apply on the Terraform website.
terraform apply
Because no resources have yet been specified in the file, the output is
main.tf
Apply complete! Resources: 0 added, 0 changed, 0 destroyed. Also, Terraform writes data into a file called
terraform.tfstate . To create resources, continue with Sample configuration, Next steps, or both to
specify the desired resources to create, and then run the terraform apply command again. Terraform
stores the IDs and properties of the resources it manages in this terraform.tfstate file, so that it can
update or destroy those resources going forward.
Sample configuration
Complete the following procedure to create a sample Terraform configuration that creates a notebook and a job
to run that notebook, in an existing Azure Databricks workspace.
1. In the main.tf file that you created in Getting started, change the databricks provider to reference an
existing Azure Databricks workspace:
provider "databricks" {
host = var.databricks_workspace_url
}
variable "databricks_workspace_url" {
description = "The URL to the Azure Databricks workspace (must start with https://)"
type = string
default = "<Azure Databricks workspace URL>"
}
variable "resource_prefix" {
description = "The prefix to use when naming the notebook and job"
type = string
default = "terraform-demo"
}
variable "email_notifier" {
description = "The email address to send job status to"
type = list(string)
default = ["<Your email address>"]
}
// Create a job to run the sample notebook. The job will create
// a cluster to run on. The cluster will use the smallest available
// node type and run the latest version of Spark.
// Get the smallest available node type to use for the cluster. Choose
// only from among available node types with local storage.
data "databricks_node_type" "smallest" {
local_disk = true
}
Next steps
1. Create an Azure Databricks workspace.
2. Manage workspace resources for an Azure Databricks workspace.
Troubleshooting
NOTE
For Terraform-specific support, see the Latest Terraform topics on the HashiCorp Discuss website. For issues specific to the
Databricks Terraform Provider, see Issues in the databrickslabs/terraform-provider-databricks GitHub repository.
2. Run the following Terraform command and then approve the changes when prompted:
For information about this command, see Command: state replace-provider in the Terraform
documentation.
3. Verify the changes by running the following Terraform command:
terraform init
Additional examples
Deploy an Azure Databricks workspace using Terraform
Manage a workspace end-to-end using Terraform
Create clusters
Control access to clusters: see Enable cluster access control for your workspace and Cluster access control
Control access to jobs: see Enable jobs access control for a workspace and Jobs access control
Control access to pools: see Enable instance pool access control for a workspace and Pool access control
Control access to personal access tokens
Control access to notebooks
Configure Databricks Repos
Control access to secrets
Configure usage log delivery
Control access to Databricks SQL tables
Implement CI/CD pipelines to deploy Databricks resources using the Databricks Terraform provider
Additional resources
Databricks Provider Documentation on the Terraform Registry website
Terraform Documentation on the Terraform website
Continuous integration and delivery on Azure
Databricks using Azure DevOps
7/21/2022 • 20 minutes to read
Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering
software in short, frequent cycles through the use of automation pipelines. While this is by no means a new
process, having been ubiquitous in traditional software engineering for decades, it is becoming an increasingly
necessary process for data engineering and data science teams. In order for data products to be valuable, they
must be delivered in a timely manner. Additionally, consumers must have confidence in the validity of outcomes
within these products. By automating the building, testing, and deployment of code, development teams are
able to deliver releases more frequently and reliably than the more manual processes that are still prevalent
across many data engineering and data science teams.
Continuous integration begins with the practice of having you commit your code with some frequency to a
branch within a source code repository. Each commit is then merged with the commits from other developers to
ensure that no conflicts were introduced. Changes are further validated by creating a build and running
automated tests against that build. This process ultimately results in an artifact, or deployment bundle, that will
eventually be deployed to a target environment, in this case an Azure Databricks workspace.
git add .
git commit -m "<commit-message>"
git push
If you prefer to develop in an IDE rather than in Azure Databricks notebooks, you can use the VCS integration
features built into modern IDEs or the git CLI to commit your code.
Azure Databricks provides Databricks Connect, an SDK that connects IDEs to Azure Databricks clusters. This is
especially useful when developing libraries, as it allows you to run and unit test your code on Azure Databricks
clusters without having to deploy that code. See Databricks Connect limitations to determine whether your use
case is supported.
NOTE
Databricks now recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Depending on your branching strategy and promotion process, the point at which a CI/CD pipeline will initiate a
build will vary. However, committed code from various contributors will eventually be merged into a designated
branch to be built and deployed. Branch management steps run outside of Azure Databricks, using the interfaces
provided by the version control system.
There are numerous CI/CD tools you can use to manage and execute your pipeline. This article illustrates how to
use the Azure DevOps automation server. CI/CD is a design pattern, so the steps and stages outlined in this
article should transfer with a few changes to the pipeline definition language in each tool. Furthermore, much of
the code in this example pipeline runs standard Python code, which you can invoke in other tools.
For information about using Jenkins with Azure Databricks, see Continuous integration and delivery on Azure
Databricks using Jenkins.
Define your build pipeline
Azure DevOps provides a cloud hosted interface for defining the stages of your CI/CD pipeline using YAML. You
define the build pipeline, which runs unit tests and builds a deployment artifact, in the Pipelines interface. Then,
to deploy the code to an Azure Databricks workspace, you specify this deployment artifact in a release pipeline.
In your Azure DevOps project, open the Pipelines menu and click Pipelines .
Click the New Pipeline button to open the Pipeline editor, where you define your build in the
azure-pipelines.yml file.
You can use the Git branch selector to customize the build process for each branch in your
Git repository.
The azure-pipelines.yml file is stored by default in the root directory of the git repository for the pipeline.
Environment variables referenced by the pipeline are configured using the Variables button.
For more information on Azure DevOps and build pipelines, see the Azure DevOps documentation.
Configure your build agent
To execute the pipeline, Azure DevOps provides cloud-hosted, on-demand execution agents that support
deployments to Kubernetes, VMs, Azure Functions, Azure Web Apps, and many more targets. In this example,
you use an on-demand agent to automate the deployment of code to the target Azure Databricks workspace.
Tools or packages required by the pipeline must be defined in the pipeline script and installed on the agent at
execution time.
This example requires the following dependencies:
Conda - Conda is an open source environment management system
Python v3.7.3 - Python will be used to run tests, build a deployment wheel, and execute deployment scripts.
The version of Python is important as tests require that the version of Python running on the agent should
match that of the Azure Databricks cluster. This example uses Databricks Runtime 6.4, which includes Python
3.7.
Python libraries: requests , databricks-connect , databricks-cli , pytest
Here is an example pipeline ( azure-pipelines.yml ). The complete script follows. This article steps through each
section of the script.
trigger:
- release
pool:
name: Hosted Ubuntu 1604
steps:
- task: UsePythonVersion@0
displayName: 'Use Python 3.7'
inputs:
versionSpec: 3.7
- script: |
pip install pytest requests setuptools wheel
pip install -U databricks-connect==6.4.*
displayName: 'Load Python Dependencies'
- script: |
echo "y
$(WORKSPACE-REGION-URL)
$(CSE-DEVELOP-PAT)
$(EXISTING-CLUSTER-ID)
$(WORKSPACE-ORG-ID)
15001" | databricks-connect configure
displayName: 'Configure DBConnect'
- checkout: self
persistCredentials: true
clean: true
- script: |
python -m pytest --junit-xml=$(Build.Repository.LocalPath)/logs/TEST-LOCAL.xml
$(Build.Repository.LocalPath)/libraries/python/dbxdemo/test*.py || true
- task: PublishTestResults@2
inputs:
testResultsFiles: '**/TEST-*.xml'
failTaskOnFailedTests: true
publishRunAttachments: true
- script: |
cd $(Build.Repository.LocalPath)/libraries/python/dbxdemo
python3 setup.py sdist bdist_wheel
ls dist/
displayName: 'Build Python Wheel for Libs'
- script: |
git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp --parents -r '{}'
$(Build.BinariesDirectory)
mkdir -p $(Build.BinariesDirectory)/libraries/python/libs
cp $(Build.Repository.LocalPath)/libraries/python/dbxdemo/dist/*.*
$(Build.BinariesDirectory)/libraries/python/libs
$(Build.BinariesDirectory)/libraries/python/libs
mkdir -p $(Build.BinariesDirectory)/cicd-scripts
cp $(Build.Repository.LocalPath)/cicd-scripts/*.* $(Build.BinariesDirectory)/cicd-scripts
- task: ArchiveFiles@2
inputs:
rootFolderOrFile: '$(Build.BinariesDirectory)'
includeRootFolder: false
archiveType: 'zip'
archiveFile: '$(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip'
replaceExistingArchive: true
- task: PublishBuildArtifacts@1
inputs:
ArtifactName: 'DatabricksBuild'
# Install Python. The version must match the version on the Databricks cluster.
steps:
- task: UsePythonVersion@0
displayName: 'Use Python 3.7'
inputs:
versionSpec: 3.7
# Install required Python modules, including databricks-connect, required to execute a unit test
# on a cluster.
- script: |
pip install pytest requests setuptools wheel
pip install -U databricks-connect==6.4.*
displayName: 'Load Python Dependencies'
# Use environment variables to pass Databricks login information to the Databricks Connect
# configuration function
- script: |
echo "y
$(WORKSPACE-REGION-URL)
$(CSE-DEVELOP-PAT)
$(EXISTING-CLUSTER-ID)
$(WORKSPACE-ORG-ID)
15001" | databricks-connect configure
displayName: 'Configure DBConnect'
- script: |
python -m pytest --junit-xml=$(Build.Repository.LocalPath)/logs/TEST-LOCAL.xml
$(Build.Repository.LocalPath)/libraries/python/dbxdemo/test*.py || true
ls logs
displayName: 'Run Python Unit Tests for library code'
The following snippet ( addcol.py ) is a library function that might be installed on an Azure Databricks cluster.
This simple function adds a new column, populated by a literal, to an Apache Spark DataFrame.
# addcol.py
import pyspark.sql.functions as F
def with_status(df):
return df.withColumn("status", F.lit("checked"))
The following test, test-addcol.py , passes a mock DataFrame object to the with_status function, defined in
addcol.py. The result is then compared to a DataFrame object containing the expected values. If the values
match, the test passes.
# test-addcol.py
import pytest
class TestAppendCol(object):
def test_with_status(self):
source_data = [
("pete", "pan", "peter.pan@databricks.com"),
("jason", "argonaut", "jason.argonaut@databricks.com")
]
source_df = get_spark().createDataFrame(
source_data,
["first_name", "last_name", "email"]
)
actual_df = with_status(source_df)
expected_data = [
("pete", "pan", "peter.pan@databricks.com", "checked"),
("jason", "argonaut", "jason.argonaut@databricks.com", "checked")
]
expected_df = get_spark().createDataFrame(
expected_data,
["first_name", "last_name", "email", "status"]
)
assert(expected_df.collect() == actual_df.collect())
- script: |
cd $(Build.Repository.LocalPath)/libraries/python/dbxdemo
python3 setup.py sdist bdist_wheel
ls dist/
displayName: 'Build Python Wheel for Libs'
- task: PublishTestResults@2
inputs:
testResultsFiles: '**/TEST-*.xml'
failTaskOnFailedTests: true
publishRunAttachments: true
# Add the wheel file you just created along with utility scripts used by the Release pipeline
# The implementation in your Pipeline may be different.
# The objective is to add all files intended for the current release.
mkdir -p $(Build.BinariesDirectory)/libraries/python/libs
cp $(Build.Repository.LocalPath)/libraries/python/dbxdemo/dist/*.*
$(Build.BinariesDirectory)/libraries/python/libs
mkdir -p $(Build.BinariesDirectory)/cicd-scripts
cp $(Build.Repository.LocalPath)/cicd-scripts/*.* $(Build.BinariesDirectory)/cicd-scripts
displayName: 'Get Changes'
- task: PublishBuildArtifacts@1
inputs:
ArtifactName: 'DatabricksBuild'
3. In the Artifacts box on the left side of the screen, click and select the build pipeline created earlier.
You can configure how the pipeline is triggered by clicking , which displays triggering options on the right
side of the screen. If you want a release to be initiated automatically based on build artifact availability or after a
pull request workflow, enable the appropriate trigger.
To add steps or tasks for the deployment, click the link within the stage object.
Add tasks
To add tasks, click the plus sign in the Agent job section, indicated by the red arrow in the following figure. A
searchable list of available tasks appears. There is also a Marketplace for third-party plug-ins that can be used to
supplement the standard Azure DevOps tasks.
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
# installWhlLibrary.py
#!/usr/bin/python3
import json
import requests
import sys
import getopt
import time
import os
def main():
shard = ''
token = ''
clusterid = ''
libspath = ''
dbfspath = ''
try:
opts, args = getopt.getopt(sys.argv[1:], 'hstcld',
['shard=', 'token=', 'clusterid=', 'libs=', 'dbfspath='])
except getopt.GetoptError:
print(
'installWhlLibrary.py -s <shard> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
sys.exit(2)
p = 0
waiting = True
while waiting:
time.sleep(30)
clusterresp = requests.get(shard + '/api/2.0/clusters/get?cluster_id=' + clusterid,
auth=("token", token))
clusterjson = clusterresp.text
jsonout = json.loads(clusterjson)
current_state = jsonout['state']
print(clusterid + " state:" + current_state)
if current_state in ['TERMINATED', 'RUNNING','INTERNAL_ERROR', 'SKIPPED'] or p >= 10:
break
p = p + 1
if __name__ == '__main__':
main()
The executenotebook.py script runs the notebook using the jobs runs submit endpoint which submits an
anonymous job. Because this endpoint is asynchronous, it uses the job ID initially returned by the REST call to
the poll for the status of the job. After the job completes, the JSON output is saved to the path specified by the
function arguments passed at invocation.
# executenotebook.py
#!/usr/bin/python3
import json
import requests
import os
import sys
import getopt
import time
def main():
shard = ''
token = ''
clusterid = ''
localpath = ''
workspacepath = ''
outfilepath = ''
try:
opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
['shard=', 'token=', 'clusterid=', 'localpath=', 'workspacepath=',
'outfilepath='])
except getopt.GetoptError:
print(
'executenotebook.py -s <shard> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o
<outfilepath>)')
sys.exit(2)
notebooks = []
for path, subdirs, files in os.walk(localpath):
for name in files:
fullpath = path + '/' + name
# removes localpath to repo but keeps workspace path
fullworkspacepath = workspacepath + path.replace(localpath, '')
i=0
waiting = True
while waiting:
time.sleep(10)
jobresp = requests.get(shard + '/api/2.0/jobs/runs/get?run_id='+str(runid),
data=json.dumps(values), auth=("token", token))
jobjson = jobresp.text
print("jobjson:" + jobjson)
j = json.loads(jobjson)
current_state = j['state']['life_cycle_state']
runid = j['run_id']
if current_state in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 12:
break
i=i+1
if outfilepath != '':
file = open(outfilepath + '/' + str(runid) + '.json', 'w')
file.write(json.dumps(j))
file.close()
if __name__ == '__main__':
main()
class TestJobOutput(unittest.TestCase):
test_output_path = '#ENV#'
def test_performance(self):
path = self.test_output_path
statuses = []
statuses.append(status)
self.assertFalse('FAILED' in statuses)
def test_job_run(self):
path = self.test_output_path
statuses = []
self.assertFalse('FAILED' in statuses)
if __name__ == '__main__':
unittest.main()
IMPORTANT
This feature is in Public Preview.
Below is a list of GitHub Actions developed for Azure Databricks that you can use in your CI/CD workflows on
GitHub.
A service principal is an identity created for use with automated tools and applications, including CI/CD
platforms such as GitHub Actions, Airflow in data pipelines, and Jenkins.
As a security best practice, Databricks recommends using an Azure AD service principal and its Azure AD token
instead of your Azure Databricks user or your Azure Databricks personal access token for your workspace user
to give CI/CD platforms access to Azure Databricks resources. Some benefits to this approach include the
following:
You can grant and restrict access to Azure Databricks resources for an Azure AD service principal
independently of a user. For instance, this allows you to prohibit an Azure AD service principal from acting as
an admin in your Azure Databricks workspace while still allowing other specific users in your workspace to
continue to act as admins.
Users can safeguard their access tokens from being accessed by CI/CD platforms.
You can temporarily disable or permanently delete an Azure AD service principal without impacting other
users. For instance, this allows you to pause or remove access from an Azure AD service principal that you
suspect is being used in a malicious way.
If a user leaves your organization, you can remove that user without impacting any Azure AD service
principal.
To give a CI/CD platform access to your Azure Databricks workspace, you provide the CI/CD platform with
information about your Azure AD service principal. For instance, this may involve generating an Azure AD token
for the Azure AD service principal and then giving this Azure AD token to the CI/CD platform.
To give your Azure Databricks workspace access to a Git provider (for example, when you use Azure Databricks
Git integration with Databricks Repos), you must add your Git provider credentials to your workspace,
depending on the Git provider’s requirements.
To complete these two access connections, you do the following:
1. Create an Azure AD service principal.
2. Create an Azure AD token for an Azure AD service principal.
3. Add the Azure AD service principal to your Azure Databricks workspace.
4. Add your Git provider credentials to your workspace with your Azure AD token and the Git Credentials API
2.0.
The first three steps are covered in Service principals for Azure Databricks automation.
To complete the last step, you can use tools such as curl and Postman. You cannot use the Azure Databricks
user interface.
This article describes how to:
1. Provide information to the CI/CD platform, for example the Azure AD token for the Azure AD service
principal, depending on the CI/CD platform’s requirements.
2. If you use Azure Databricks Git integration with Databricks Repos, add your Git provider credentials to your
Azure Databricks workspace.
Requirements
The Azure AD token for an Azure AD service principal. To create an Azure AD service principal and its Azure
AD token, see Service principals for Azure Databricks automation.
A tool to call the Azure Databricks APIs, such as curl or Postman.
If you use a Git provider, an account with your Git provider.
For more information about which GitHub encrypted secrets are required for a GitHub Action, see Service
principals for Azure Databricks automation and the documentation for that GitHub Action.
To add these GitHub encrypted secrets to your GitHub repository, see Creating encrypted secrets for a
repository in the GitHub documentation. For other approaches to add these GitHub repository secrets, see
Encrypted secrets in the GitHub documentation.
Add the GitHub personal access token for a GitHub machine user to your Azure Databricks workspace
This section describes how to enable your Azure Databricks workspace to access GitHub (for example, when you
use Azure Databricks Git integration with Databricks Repos).
As a security best practice, Databricks recommends that you use GitHub machine users instead of GitHub
personal accounts, for many of the same reasons that you should use an Azure AD service principal instead of a
Azure Databricks user. To add the GitHub personal access token for a GitHub machine user to your Azure
Databricks workspace, do the following:
1. Create a GitHub machine user, if you do not already have one available. A GitHub machine user is a
GitHub personal account, separate from your own GitHub personal account, that you can use to automate
activity on GitHub. Create a new separate GitHub account to use as a GitHub machine user, if you do not
already have one available.
NOTE
When you create a new separate GitHub account as a GitHub machine user, you cannot associate it with the email
address for your own GitHub personal account. Instead, see your organization’s email administrator about getting
a separate email address that you can associate with this new separate GitHub account as a GitHub machine user.
See your organization’s account administrator about managing the separate email address and its associated
GitHub machine user and its GitHub personal access tokens within your organization.
2. Give the GitHub machine user access to your GitHub repository. See Inviting a team or person in the
GitHub documentation. To accept the invitation, you may first need to sign out of your GitHub personal
account, and then sign back in as the GitHub machine user.
3. Sign in to GitHub as the machine user, and then create a GitHub personal access token for that machine
user. See Create a personal access token in the GitHub documentation. Be sure to give the GitHub
personal access token repo access.
4. Use a tool such as curl or Postman to call the “create a Git credential entry” ( POST /git-credentials )
operation in the Git Credentials API 2.0. (You cannot use the Azure Databricks user interface for this.) In
the following instructions, replace:
with the Azure AD token for your Azure AD service principal.
<service-principal-access-token>
(Do not use the Azure Databricks personal access token for your workspace user.)
TIP
To confirm that you are using the correct token, you can first use the Azure AD token for your Azure AD
service principal to call the SCIM API 2.0 (Me) API, and review the output of the call.
<machine-user-access-token> with the GitHub personal access token for the GitHub machine user.
<machine-user-name> with the GitHub username of the GitHub machine user.
Curl
Run the following command. Make sure the set-git-credentials.json file is in the same directory where
you run this command. This command uses the environment variable DATABRICKS_HOST , representing
your Azure Databricks per-workspace URL, for example
https://adb-1234567890123456.7.azuredatabricks.net .
curl -X POST \
${DATABRICKS_HOST}/api/2.0/git-credentials \
--header 'Authorization: Bearer <service-principal-access-token>' \
--data @set-git-credentials.json \
| jq .
set-git-credentials.json :
{
"personal_access_token": "<machine-user-access-token>",
"git_username": "<machine-user-name>",
"git_provider": "gitHub"
}
Postman
a. Create a new HTTP request (File > New > HTTP Request ).
b. In the HTTP verb drop-down list, select POST .
c. For Enter request URL , enter http://<databricks-instance-name>/api/2.0/git-credentials , where
<databricks-instance-name> is your Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
f. On the Headers tab, add the Key and Value pair of Content-Type and application/scim+json
{
"personal_access_token": "<machine-user-access-token>",
"git_username": "<machine-user-name>",
"git_provider": "gitHub"
}
i. Click Send .
TIP
To confirm that the call was successful, you can use the Azure AD token for your Azure AD service principal to call the “get
Git credentials” ( GET /git-credentials ) operation in the Git Credentials API 2.0, and review the output of the call.
Continuous integration and delivery on Azure
Databricks using Jenkins
7/21/2022 • 19 minutes to read
Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering
software in short, frequent cycles through the use of automation pipelines. While this is by no means a new
process, having been ubiquitous in traditional software engineering for decades, it is becoming an increasingly
necessary process for data engineering and data science teams. In order for data products to be valuable, they
must be delivered in a timely manner. Additionally, consumers must have confidence in the validity of outcomes
within these products. By automating the building, testing, and deployment of code, development teams are
able to deliver releases more frequently and reliably than the more manual processes that are still prevalent
across many data engineering and data science teams.
Continuous integration begins with the practice of having you commit your code with some frequency to a
branch within a source code repository. Each commit is then merged with the commits from other developers to
ensure that no conflicts were introduced. Changes are further validated by creating a build and running
automated tests against that build. This process ultimately results in an artifact, or deployment bundle, that will
eventually be deployed to a target environment, in this case an Azure Databricks workspace.
git add .
git commit -m "<commit-message>"
git push
If you prefer to develop in an IDE rather than in Azure Databricks notebooks, you can use the VCS integration
features built into modern IDEs or the git CLI to commit your code.
Azure Databricks provides Databricks Connect, an SDK that connects IDEs to Azure Databricks clusters. This is
especially useful when developing libraries, as it allows you to run and unit test your code on Azure Databricks
clusters without having to deploy that code. See Databricks Connect limitations to determine whether your use
case is supported.
NOTE
Databricks now recommends that you use dbx by Databricks Labs for local development instead of Databricks Connect.
Depending on your branching strategy and promotion process, the point at which a CI/CD pipeline will initiate a
build will vary. However, committed code from various contributors will eventually be merged into a designated
branch to be built and deployed. Branch management steps run outside of Azure Databricks, using the interfaces
provided by the version control system.
There are numerous CI/CD tools you can use to manage and execute your pipeline. This article illustrates how to
use the Jenkins automation server. CI/CD is a design pattern, so the steps and stages outlined in this article
should transfer with a few changes to the pipeline definition language in each tool. Furthermore, much of the
code in this example pipeline runs standard Python code, which you can invoke in other tools.
For information about using Azure DevOps with Azure Databricks, see Continuous integration and delivery on
Azure Databricks using Azure DevOps.
Configure your agent
Jenkins uses a master service for coordination and one to many execution agents. In this example you use the
default permanent agent node included with the Jenkins server. You must manually install the following tools
and packages required by the pipeline on the agent, in this case the Jenkins server:
Conda: an open source Python environment management system.
Python 3.7.3: used to run tests, build a deployment wheel, and execute deployment scripts. The version of
Python is important as tests require that the version of Python running on the agent should match that of the
Azure Databricks cluster. This example uses Databricks Runtime 6.4, which includes Python 3.7.
Python libraries: requests , databricks-connect , databricks-cli , and pytest .
You write a Pipeline definition in a text file (called a Jenkinsfile) which in turn is checked into a project’s source
control repository. For more information, see Jenkins Pipeline. Here is an example Pipeline:
// Jenkinsfile
node {
def GITREPO = "/var/lib/jenkins/workspace/${env.JOB_NAME}"
def GITREPOREMOTE = "https://github.com/<repo>"
def GITHUBCREDID = "<github-token>"
def CURRENTRELEASE = "<release>"
def DBTOKEN = "<databricks-token>"
def DBURL = "https://<databricks-instance>"
def SCRIPTPATH = "${GITREPO}/Automation/Deployments"
def NOTEBOOKPATH = "${GITREPO}/Workspace"
def LIBRARYPATH = "${GITREPO}/Libraries"
def BUILDPATH = "${GITREPO}/Builds/${env.JOB_NAME}-${env.BUILD_NUMBER}"
def OUTFILEPATH = "${BUILDPATH}/Validation/Output"
def TESTRESULTPATH = "${BUILDPATH}/Validation/reports/junit"
def WORKSPACEPATH = "/Shared/<path>"
def DBFSPATH = "dbfs:<dbfs-path>"
def CLUSTERID = "<cluster-id>"
def CONDAPATH = "<conda-path>"
def CONDAENV = "<conda-env>"
stage('Setup') {
stage('Setup') {
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """#!/bin/bash
# Configure Conda environment for deployment & testing
source ${CONDAPATH}/bin/activate ${CONDAENV}
# Generate artifact
tar -czvf Builds/latest_build.tar.gz ${BUILDPATH}
"""
archiveArtifacts artifacts: 'Builds/latest_build.tar.gz'
}
stage('Deploy') {
sh """#!/bin/bash
sh """#!/bin/bash
# Enable Conda environment for tests
source ${CONDAPATH}/bin/activate ${CONDAENV}
NOTE
This article mentions the use of Azure Databricks personal access tokens, Azure Active Directory (Azure AD) access tokens,
or both for authentication. As a security best practice, when authenticating with automated tools, systems, scripts, and
apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more
information, see Service principals for Azure Databricks automation.
stage('Setup') {
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """#!/bin/bash
# Configure Conda environment for deployment & testing
source ${CONDAPATH}/bin/activate ${CONDAENV}
The following snippet is a library function that might be installed on an Azure Databricks cluster. It is a simple
function that adds a new column, populated by a literal, to an Apache Spark DataFrame.
# addcol.py
import pyspark.sql.functions as F
def with_status(df):
return df.withColumn("status", F.lit("checked"))
This test passes a mock DataFrame object to the with_status function, defined in addcol.py . The result is then
compared to a DataFrame object containing the expected values. If the values match, which in this case they will,
the test passes.
# test-addcol.py
import pytest
class TestAppendCol(object):
def test_with_status(self):
source_data = [
("paula", "white", "paula.white@example.com"),
("john", "baer", "john.baer@example.com")
]
source_df = get_spark().createDataFrame(
source_data,
["first_name", "last_name", "email"]
)
actual_df = with_status(source_df)
expected_data = [
("paula", "white", "paula.white@example.com", "checked"),
("john", "baer", "john.baer@example.com", "checked")
]
expected_df = get_spark().createDataFrame(
expected_data,
["first_name", "last_name", "email", "status"]
)
assert(expected_df.collect() == actual_df.collect())
stage('Package') {
sh """#!/bin/bash
# Enable Conda environment for tests
source ${CONDAPATH}/bin/activate ${CONDAENV}
# Generate artifact
tar -czvf Builds/latest_build.tar.gz ${BUILDPATH}
"""
archiveArtifacts artifacts: 'Builds/latest_build.tar.gz'
}
Deploy artifacts
In the Deploy stage you use the Databricks CLI, which, like the Databricks Connect module used earlier, is
installed in your Conda environment, so you must activate it for this shell session. You use the Workspace CLI
and DBFS CLI to upload the notebooks and libraries, respectively:
stage('Deploy') {
sh """#!/bin/bash
# Enable Conda environment for tests
source ${CONDAPATH}/bin/activate ${CONDAENV}
Installing a new version of a library on an Azure Databricks cluster requires that you first uninstall the existing
library. To do this, you invoke the Databricks REST API in a Python script to perform the following steps:
1. Check if the library is installed.
2. Uninstall the library.
3. Restart the cluster if any uninstalls were performed.
a. Wait until the cluster is running again before proceeding.
4. Install the library.
# installWhlLibrary.py
#!/usr/bin/python3
import json
import requests
import sys
import getopt
import time
def main():
workspace = ''
token = ''
clusterid = ''
libs = ''
dbfspath = ''
try:
opts, args = getopt.getopt(sys.argv[1:], 'hstcld',
['workspace=', 'token=', 'clusterid=', 'libs=', 'dbfspath='])
except getopt.GetoptError:
print(
'installWhlLibrary.py -s <workspace> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
sys.exit(2)
libslist = libs.split()
p = 0
waiting = True
while waiting:
time.sleep(30)
clusterresp = requests.get(workspace + '/api/2.0/clusters/get?cluster_id=' + clusterid,
auth=("token", token))
clusterjson = clusterresp.text
jsonout = json.loads(clusterjson)
current_state = jsonout['state']
print(clusterid + " state:" + current_state)
if current_state in ['RUNNING','INTERNAL_ERROR', 'SKIPPED'] or p >= 10:
break
p = p + 1
# Install Libraries
for lib in libslist:
dbfslib = dbfspath + lib
print("Installing " + dbfslib)
values = {'cluster_id': clusterid, 'libraries': [{'whl': dbfslib}]}
if __name__ == '__main__':
main()
This stage calls two Python automation scripts. The first script, executenotebook.py , runs the notebook using the
Create and trigger a one-time run ( POST /jobs/runs/submit ) endpoint which submits an anonymous job. Since
this endpoint is asynchronous, it uses the job ID initially returned by the REST call to poll for the status of the
job. Once the job has completed, the JSON output is saved to the path specified by the function arguments
passed at invocation.
# executenotebook.py
#!/usr/bin/python3
import json
import requests
import os
import sys
import getopt
import time
def main():
workspace = ''
token = ''
clusterid = ''
localpath = ''
workspacepath = ''
outfilepath = ''
try:
opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
['workspace=', 'token=', 'clusterid=', 'localpath=', 'workspacepath=',
'outfilepath='])
except getopt.GetoptError:
print(
'executenotebook.py -s <workspace> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o
<outfilepath>)')
sys.exit(2)
notebooks = []
for path, subdirs, files in os.walk(localpath):
for name in files:
fullpath = path + '/' + name
# removes localpath to repo but keeps workspace path
fullworkspacepath = workspacepath + path.replace(localpath, '')
i=0
waiting = True
while waiting:
time.sleep(10)
jobresp = requests.get(workspace + '/api/2.0/jobs/runs/get?run_id='+str(runid),
data=json.dumps(values), auth=("token", token))
jobjson = jobresp.text
print("jobjson:" + jobjson)
j = json.loads(jobjson)
current_state = j['state']['life_cycle_state']
runid = j['run_id']
if current_state in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 12:
break
i=i+1
if outfilepath != '':
file = open(outfilepath + '/' + str(runid) + '.json', 'w')
file.write(json.dumps(j))
file.close()
if __name__ == '__main__':
main()
The second script, evaluatenotebookruns.py , defines the test_job_run function, which parses and evaluates the
JSON to determine if the assert statements within the notebook passed or failed. An additional test,
test_performance , catches tests that run longer than expected.
# evaluatenotebookruns.py
import unittest
import json
import glob
import os
class TestJobOutput(unittest.TestCase):
test_output_path = '#ENV#'
def test_performance(self):
path = self.test_output_path
statuses = []
statuses.append(status)
self.assertFalse('FAILED' in statuses)
def test_job_run(self):
path = self.test_output_path
statuses = []
self.assertFalse('FAILED' in statuses)
if __name__ == '__main__':
unittest.main()
As seen earlier in the unit test stage, you use pytest to run the tests and generate the result summaries.
Publish test results
The JSON results are archived and the test results are published to Jenkins using the junit Jenkins plugin. This
enables you to visualize reports and dashboards related to the status of the build process.
IMPORTANT
The Databricks SQL CLI is provided as-is and is not officially supported by Databricks through customer technical support
channels. Support, questions, and feature requests can be communicated through the Issues page of the
databricks/databricks-sql-cli repo on GitHub. Issues with the use of this code will not be answered or investigated by
Databricks Support.
The Databricks SQL command line interface (Databricks SQL CLI) enables you to run SQL queries on your
existing Databricks SQL warehouses from your terminal or Windows Command Prompt instead of from
locations such as the Databricks SQL editor or an Azure Databricks notebook. From the command line, you get
productivity features such as suggestions and syntax highlighting.
Requirements
At least one Databricks SQL warehouse. View your available warehouses. Create an warehouse, if you do not
already have one.
Your warehouse’s connection details. Specifically, you need the Ser ver hostname and HTTP path values.
An Azure Databricks personal access token. Create a personal access token, if you do not already have one.
Python 3.7 or higher. To check whether you have Python installed, run the command python --version from
your terminal or Command Prompt. (On some systems, you may need to enter python3 instead.) Install
Python, if you do not have it already installed.
pip, the package installer for Python. Newer versions of Python install pip by default. To check whether you
have pip installed, run the command pip --version from your terminal or Command Prompt. (On some
systems, you may need to enter pip3 instead.) Install pip, if you do not have it already installed.
The Databricks SQL CLI package from the Python Packaging Index (PyPI). You can use pip to install the
Databricks SQL CLI package from PyPI by running pip install databricks-sql-cli or
python -m pip install databricks-sql-cli .
(Optional) A utility for creating and managing Python virtual environments, such as venv, virtualenv, or
pipenv. Virtual environments help to ensure that you are using the correct versions of Python and the
Databricks SQL CLI together. Setting up and using virtual environments is outside of the scope of this article.
For more information, see Creating Virtual Environments.
Authentication
You must provide the Databricks SQL CLI with authentication details for your Databricks SQL warehouse, so that
the target warehouse is called with the proper access credentials. You can provide this information in several
ways:
In the dbsqlclircsettings file in its default location (or by specifying an alternate settings file through the
--clirc option each time you run a command with the Databricks SQL CLI). See Settings file.
By setting the DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH and DBSQLCLI_ACCESS_TOKEN environment variables.
See Environment variables.
By specifying the --hostname , --http-path , and --access-token options each time you run a command with
the Databricks SQL CLI. See Command options.
Whenever you run the Databricks SQL CLI, it looks for authentication details in the following order, stopping
when it finds the first set of details:
1. The --hostname , --http-path , and --access-token options.
2. The DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH and DBSQLCLI_ACCESS_TOKEN environment variables.
3. The dbsqlclirc settings file in its default location (or an alternate settings file specified by the --clirc
option).
Settings file
To use the dbsqlclirc settings file to provide the Databricks SQL CLI with authentication details for your
Databricks SQL warehouse, run the Databricks SQL CLI for the first time, as follows:
dbsqlcli
The Databricks SQL CLI creates a settings file for you, at ~/.dbsqlcli/dbsqlclirc on Unix, Linux, and macOS, and
at %HOMEDRIVE%%HOMEPATH%\.dbsqlcli\dbsqlclirc or %USERPROFILE%\.dbsqlcli\dbsqlclirc on Windows. To
customize this file:
1. Use a text editor to open and edit the dbsqlclirc file.
2. Scroll to the following section:
# [credentials]
# host_name = ""
# http_path = ""
# access_token = ""
[credentials]
host_name = "adb-12345678901234567.8.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/1abc2d3456e7f890a"
access_token = "dapi1234567890b2cd34ef5a67bc8de90fa12b"
Alternatively, instead of using the dbsqlclirc file in its default location, you can specify a file in a different
location by adding the --clirc command option and the path to the alternate file. That alternate file’s contents
must conform to the preceding syntax.
Environment variables
To use the DBSQLCLI_HOST_NAME , DBSQLCLI_HTTP_PATH , and DBSQLCLI_ACCESS_TOKEN environment variables to
provide the Databricks SQL CLI with authentication details for your Databricks SQL warehouse, do the following:
Unix, linux, and macos
To set the environment variables for only the current terminal session, run the following commands. To set the
environment variables for all terminal sessions, enter the following commands into your shell’s startup file and
then restart your terminal. In the following commands, replace the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.
export DBSQLCLI_HOST_NAME="adb-12345678901234567.8.azuredatabricks.net"
export DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/1abc2d3456e7f890a"
export DBSQLCLI_ACCESS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"
Windows
To set the environment variables for only the current Command Prompt session, run the following commands,
replacing the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.:
set DBSQLCLI_HOST_NAME="adb-12345678901234567.8.azuredatabricks.net"
set DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/1abc2d3456e7f890a"
set DBSQLCLI_ACCESS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"
To set the environment variables for all Command Prompt sessions, run the following commands and then
restart your Command Prompt, replacing the value of:
DBSQLCLI_HOST_NAME with your warehouse’s Ser ver hostname value from the requirements.
DBSQLCLI_HTTP_PATH with your warehouse’s HTTP path value from the requirements.
DBSQLCLI_ACCESS_TOKEN with your personal access token value from the requirements.
Command options
To use the --hostname , --http-path , and --access-token options to provide the Databricks SQL CLI with
authentication details for your Databricks SQL warehouse, do the following:
Every time you run a command with the Databricks SQL CLI:
Specify the --hostname option and your warehouse’s Ser ver hostname value from the requirements.
Specify the --http-path option and your warehouse’s HTTP path value from the requirements.
Specify the --access-token option and your personal access token value from the requirements.
For example:
Query sources
The Databricks SQL CLI enables you to run queries in the following ways:
From a query string.
From a file.
In a read-evaluate-print loop (REPL) approach. This approach provides suggestions as you type.
Query string
To run a query as a string, use the -e option followed by the query, represented as a string. For example:
Output:
_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
To switch output formats, use the --table-format option along with a value such as ascii for ASCII table
format, for example:
Output:
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
For a list of available output format values, see the comments for the table_format setting in the dbsqlclirc
file.
File
To run a file that contains SQL, use the -e option followed by the path to a .sql file. For example:
dbsqlcli -e my-query.sql
Output:
_c0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
To switch output formats, use the --table-format option along with a value such as ascii for ASCII table
format, for example:
dbsqlcli -e my-query.sql --table-format ascii
Output:
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
For a list of available output format values, see the comments for the table_format setting in the dbsqlclirc
file.
REPL
To enter read-evaluate-print loop (REPL) mode scoped to the default database, run the following command:
dbsqlcli
You can also enter REPL mode scoped to a specific database, by running the following command:
dbsqlcli <database-name>
For example:
dbsqlcli default
exit
In REPL mode, you can use the following characters and keys:
Use the semicolon ( ; ) to end a line.
Use F3 to toggle multiline mode.
Use the spacebar to show suggestions at the insertion point, if suggestions are not already displayed.
Use the up and down arrows to navigate suggestions.
Use the right arrow to complete the highlighted suggestion.
For example:
dbsqlcli default
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| _c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
| 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 2 | 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
+-----+-------+---------+-------+---------+-------+-------+-------+------+------+------+
2 rows in set
Time: 0.703s
hostname:default> exit
Additional resources
Databricks SQL CLI README
DataGrip integration with Azure Databricks
7/21/2022 • 5 minutes to read
DataGrip is an integrated development environment (IDE) for database developers that provides a query
console, schema navigation, explain plans, smart code completion, real-time analysis and quick fixes,
refactorings, version control integration, and other features.
This article describes how to use your local development machine to install, configure, and use DataGrip to work
with databases in Azure Databricks.
NOTE
This article was tested with macOS, Databricks JDBC Driver version 2.6.25, and DataGrip version 2021.1.1.
Requirements
Before you install DataGrip, your local development machine must meet the following requirements:
A Linux, macOS, or Windows operating system.
Download the Databricks JDBC Driver onto your local development machine, extracting the
DatabricksJDBC42.jar file from the downloaded DatabricksJDBC42-<version>.zip file.
An Azure Databricks cluster or SQL warehouse to connect DataGrip to.
jdbc:databricks://adb-
1234567890123456.7.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/proto
colv1/o/1234567890123456/1234-567890-reef123;AuthMech=3;UID=token;PWD=<personal-access-token>
IMPORTANT
If the JDBC URL starts with jdbc:spark: , you must change it to jdbc:databricks: or else you will get
a connection error later.
b. Replace <personal-access-token> with your personal access token for the Azure Databricks
workspace.
TIP
If you do not want to store your personal access token on your local development machine, omit
UID=token;PWD=<personal-access-token> from the JDBC URL and, in the Save list, choose Never . You
will be prompted for your User ( token ) and Password (your personal access token) each time you try
to connect.
jdbc:databricks://adb-
1234567890123456.7.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPat
h=/sql/1.0/warehouses/a123456bcde7f890;
IMPORTANT
If the JDBC URL starts with jdbc:spark: , you must change it to jdbc:databricks: or else you will get
a connection error later.
b. For User , enter token .
c. For Password , enter your personal access token.
TIP
If you do not want to store your personal access token on your local development machine, leave User
and Password blank and, in the Save list, select Never . You will be prompted for your User (the word
token ) and Password (your personal access token) each time you try to connect.
TIP
You should start your resource before testing your connection. Otherwise the test might take several minutes to
complete while the resource starts.
6. If the connection succeeds, on the Schemas tab, check the boxes for the schemas that you want to be
able to access, for example default .
7. Click OK .
Repeat the instructions in this step for each resource that you want DataGrip to access.
TIP
To change what happens when you click the Execute icon, select Customize in the drop-down list.
7. In the Database window, double-click the diamonds table to see its data. If the diamonds table is not
displayed, click the Refresh button in the window’s toolbar.
To delete the diamonds table:
1. In DataGrip, in the Database window’s toolbar, click the Jump to Quer y Console button.
2. Select console (Default) .
3. In the console tab, enter this SQL statement:
Next steps
Learn more about the Query console in DataGrip.
Learn about the Data editor in DataGrip.
Learn more about the various tool windows in DataGrip.
Learn how to search in DataGrip.
Learn how to export data in DataGrip.
Learn how to find and replace text using regular expressions in DataGrip.
Additional resources
DataGrip documentation
DataGrip Support
DBeaver integration with Azure Databricks
7/21/2022 • 6 minutes to read
DBeaver is a local, multi-platform database tool for developers, database administrators, data analysts, data
engineers, and others who need to work with databases. DBeaver supports Azure Databricks as well as other
popular databases.
This article describes how to use your local development machine to install, configure, and use the free, open
source DBeaver Community Edition (CE) to work with databases in Azure Databricks.
NOTE
This article was tested with macOS, Databricks JDBC Driver version 2.6.25, and DBeaver CE version 22.1.0.
Requirements
Before you install DBeaver, your local development machine must meet the following requirements:
A Linux 64-bit, macOS, or Windows 64-bit operating system. (Linux 32-bit is supported but not
recommended.)
The Databricks JDBC Driver onto your local development machine, extracting the DatabricksJDBC42.jar file
from the downloaded DatabricksJDBC42-<version>.zip file.
You must also have an Azure Databricks cluster or SQL warehouse to connect DBeaver to.
IMPORTANT
If the JDBC URL starts with jdbc:spark: , you must change it to jdbc:databricks: or else you will get
a connection error later.
b. Replace <personal-access-token> with your personal access token for the Azure Databricks
workspace.
c. Check Save password locally .
TIP
If you do not want to store your personal access token on your local development machine, omit
UID=token;PWD=<personal-access-token> from the JDBC URL and uncheck Save password locally . You will
be prompted for your Username ( token ) and Password (your personal access token) each time you try to
connect.
Sql warehouse
a. Find the JDBC URL field value on the Connection Details tab for your SQL warehouse. The JDBC
URL should look similar to this one:
jdbc:databricks://adb-
1234567890123456.7.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPat
h=/sql/1.0/warehouses/a123456bcde7f890;
IMPORTANT
If the JDBC URL starts with jdbc:spark: , you must change it to jdbc:databricks: or else you will get
a connection error later.
TIP
You should start your Azure Databricks resource before testing your connection. Otherwise the test might take
several minutes to complete while the resource starts.
TIP
You should start your resource before trying to connect to it. Otherwise the connection might take several
minutes to complete while the resource starts.
Next steps
Use the Database object editor to work with database object properties, data, and entity relation diagrams.
Use the Data editor to view and edit data in a database table or view.
Use the SQL editor to work with SQL scripts.
Work with entity relation diagrams (ERDs) in DBeaver.
Import and export data into and from DBeaver.
Migrate data using DBeaver.
Troubleshoot JDBC driver issues with DBeaver.
Additional resources
DBeaver documentation
DBeaver support
DBeaver editions
CloudBeaver
Service principals for Azure Databricks automation
7/21/2022 • 6 minutes to read
A service principal is an identity created for use with automated tools and systems including scripts, apps, and
CI/CD platforms.
As a security best practice, Databricks recommends using an Azure AD service principal and its Azure AD token
instead of your Azure Databricks user or your Azure Databricks personal access token for your workspace user
to give automated tools and systems access to Azure Databricks resources. Some benefits to this approach
include the following:
You can grant and restrict access to Azure Databricks resources for an Azure AD service principal
independently of a user. For instance, this allows you to prohibit an Azure AD service principal from acting as
an admin in your Azure Databricks workspace while still allowing other specific users in your workspace to
continue to act as admins.
Users can safeguard their access tokens from being accessed by automated tools and systems.
You can temporarily disable or permanently delete an Azure AD service principal without impacting other
users. For instance, this allows you to pause or remove access from an Azure AD service principal that you
suspect is being used in a malicious way.
If a user leaves your organization, you can remove that user without impacting any Azure AD service
principal.
To create an Azure AD service principal, you use these tools and APIs:
You create an Azure AD service principal with tools such as the Azure portal.
You create an Azure AD token for an Azure AD service principal with tools such as curl and Postman.
After you create an Azure AD service principal, you add it to your Azure Databricks workspace with the SCIM
API 2.0 (ServicePrincipals). To call this API, you can also use tools such as curl and Postman. You cannot use
the Azure Databricks user interface.
This article describes how to:
1. Add an Azure AD service principal to your Azure Databricks workspace.
2. Create an Azure AD token for the Azure AD service principal.
To create an Azure AD service principal, see Provision a service principal in Azure portal.
Requirements
Access to the Azure portal.
One of the following, which enables you to call the Azure Databricks APIs:
An Azure Databricks personal access token for your Azure Databricks workspace user.
An Azure AD token for your Azure AD application.
A tool to call the Azure Databricks APIs, such as curl or Postman.
If you want to call the Azure Databricks APIs with Postman, note that instead of entering your Azure Databricks
workspace instance name, for example adb-1234567890123456.7.azuredatabricks.net and your Azure Databricks
personal access token for your workspace user for every Postman example in this article, you can define
variables and use variables in Postman instead.
If you want to call the Azure Databricks APIs with curl , this article’s curl examples use two environment
variables, DATABRICKS_HOST and DATABRICKS_TOKEN , representing your Azure Databricks per-workspace URL, for
example https://adb-1234567890123456.7.azuredatabricks.net ; and your Azure Databricks personal access token
for your workspace user. To set these environment variables, do the following:
Unix, linux, and macos
To set the environment variables for only the current terminal session, run the following commands. To set the
environment variables for all terminal sessions, enter the following commands into your shell’s startup file and
then restart your terminal. Replace the example values here with your own values.
export DATABRICKS_HOST="https://adb-12345678901234567.8.azuredatabricks.net"
export DATABRICKS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"
Windows
To set the environment variables for only the current Command Prompt session, run the following commands.
Replace the example values here with your own values.
set DATABRICKS_HOST="https://adb-12345678901234567.8.azuredatabricks.net"
set DATABRICKS_TOKEN="dapi1234567890b2cd34ef5a67bc8de90fa12b"
To set the environment variables for all Command Prompt sessions, run the following commands and then
restart your Command Prompt. Replace the example values here with your own values.
If you want to call the Azure Databricks APIs with curl , also note the following:
This article’s curl examples use shell command formatting for Unix, Linux, and macOS. For the Windows
Command shell, replace \ with ^ , and replace ${...} with %...% .
You can use a tool such as jq to format the JSON-formatted output of curl for easier reading and querying.
This article’s curl examples use jq to format the JSON output.
If you work with multiple Azure Databricks workspaces, instead of constantly changing the DATABRICKS_HOST
and DATABRICKS_TOKEN variables, you can use a .netrc file. If you use a .netrc file, modify this article’s curl
examples as follows:
Change curl -X to curl --netrc -X
Replace ${DATABRICKS_HOST} with your Azure Databricks per-workspace URL, for example
https://adb-1234567890123456.7.azuredatabricks.net
Remove --header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
curl -X POST \
${DATABRICKS_HOST}/api/2.0/preview/scim/v2/ServicePrincipals \
--header "Content-type: application/scim+json" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--data @add-service-principal.json \
| jq .
add-service-principal.json :
{
"applicationId": "<application-client-id>",
"displayName": "<display-name>",
"entitlements": [
{
"value": "allow-cluster-create"
}
],
"groups": [
{
"value": "<group-id>"
}
],
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"active": true
}
Postman
1. Create a new HTTP request (File > New > HTTP Request ).
2. In the HTTP verb drop-down list, select POST .
3. For Enter request URL , enter
https://<databricks-instance-name>/api/2.0/preview/scim/v2/ServicePrincipals , where
<databricks-instance-name> is your Azure Databricks workspace instance name, for example
adb-1234567890123456.7.azuredatabricks.net .
{
"applicationId": "<application-client-id>",
"displayName": "<display-name>",
"entitlements": [
{
"value": "allow-cluster-create"
}
],
"groups": [
{
"value": "<group-id>"
}
],
"schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ],
"active": true
}
9. Click Send .
This section provides a guide to developing notebooks and jobs in Azure Databricks using the Python language.
The first subsection provides links to tutorials for common workflows and tasks. The second subsection provides
links to APIs, libraries, and key tools.
A basic workflow for getting started is:
Import code: Either import your own code from files or Git repos or try a tutorial listed below. Databricks
recommends learning using interactive Databricks Notebooks.
Run your code on a cluster: Either create a cluster of your own, or ensure you have permissions to use a
shared cluster. Attach your notebook to the cluster, and run the notebook.
Beyond this, you can branch out into more specific topics:
Work with larger data sets using Apache Spark
Add visualizations
Automate your workload as a job
Use machine learning to analyze your data
Develop in IDEs
Tutorials
The below tutorials provide example code and notebooks to learn about common workflows. See Import a
notebook for instructions on importing notebook examples into your workspace.
Interactive data science and machine learning
Getting started with Apache Spark DataFrames for data preparation and analytics: Introduction to
DataFrames - Python
End-to-end example of building machine learning models on Azure Databricks. For additional examples, see
10-minute tutorials: Get started with machine learning on Azure Databricks and the MLflow guide’s
Quickstart Python.
Databricks AutoML lets you get started quickly with developing machine learning models on your own
datasets. Its glass-box approach generates notebooks with the complete machine learning workflow, which
you may clone, modify, and rerun.
Data engineering
Introduction to DataFrames - Python provides a walkthrough and FAQ to help you learn about Apache Spark
DataFrames for data preparation and analytics.
Delta Lake quickstart.
Delta Live Tables quickstart provides a walkthrough of Delta Live Tables to build and manage reliable data
pipelines, including Python examples.
Production machine learning and machine learning operations
MLflow Model Registry example
End-to-end example of building machine learning models on Azure Databricks
Reference
The below subsections list key features and tips to help you begin developing in Azure Databricks with Python.
Python APIs
Python code that runs outside of Databricks can generally run within Databricks, and vice versa. If you have
existing code, just import it into Databricks to get started. See Manage code with notebooks and Databricks
Repos below for details.
Databricks can run both single-machine and distributed Python workloads. For single-machine computing, you
can use Python APIs and libraries as usual; for example, pandas and scikit-learn will “just work.” For distributed
Python workloads, Databricks offers two popular APIs out of the box: the Pandas API on Spark and PySpark.
Pandas API on Spark
NOTE
The Koalas open-source project now recommends switching to the Pandas API on Spark. The Pandas API on Spark is
available on clusters that run Databricks Runtime 10.0 (Unsupported) and above. For clusters that run Databricks
Runtime 9.1 LTS and below, use Koalas instead.
pandas is a Python package commonly used by data scientists for data analysis and manipulation. However,
pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas-equivalent APIs
that work on Apache Spark. This open-source API is an ideal choice for data scientists who are familiar with
pandas but not Apache Spark.
PySpark API
PySpark is the official Python API for Apache Spark. This API provides more flexibility than the Pandas API on
Spark. These links provide an introduction to and reference for PySpark.
Introduction to DataFrames
Introduction to Structured Streaming
PySpark API reference
Manage code with notebooks and Databricks Repos
Databricks notebooks support Python. These notebooks provide functionality similar to that of Jupyter, but with
additions such as built-in visualizations using big data, Apache Spark integrations for debugging and
performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by
importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the
notebook.
TIP
To completely reset the state of your notebook, it can be useful to restart the iPython kernel. For Jupyter users, the
“restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the
kernel in a Python notebook, click on the cluster dropdown in the upper-left and click Detach & Re-attach . This
detaches the notebook from your cluster and reattaches it, which restarts the Python process.
Databricks Repos allows users to synchronize notebooks and other files with Git repositories. Databricks Repos
helps with code versioning and collaboration, and it can simplify importing a full repository of code into Azure
Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a
remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to
a cluster, and run the notebook.
Clusters and libraries
Azure Databricks Clusters provide compute management for both single nodes and large clusters. You can
customize cluster hardware and libraries according to your needs. Data scientists will generally begin work
either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach
a notebook to the cluster or run a job on the cluster.
For small workloads which only require single nodes, data scientists can use Single Node clusters for cost
savings.
For detailed tips, see Best practices: Cluster configuration
Administrators can set up cluster policies to simplify and guide cluster creation.
Azure Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box,
including Apache Spark, Delta Lake, pandas, and more. You can also install additional third-party or custom
Python libraries to use with notebooks and jobs.
Start with the default libraries in the Databricks Runtime. Use the Databricks Runtime for Machine Learning
for machine learning workloads. For full lists of pre-installed libraries, see Databricks runtime releases.
Customize your environment using Notebook-scoped Python libraries, which allow you to modify your
notebook or job environment with libraries from PyPI or other repositories. The %pip install my_library
magic command installs my_library to all nodes in your currently attached cluster, yet does not interfere
with other workloads on shared clusters.
Install non-Python libraries as Cluster libraries as needed.
For more details, see Libraries.
Visualizations
Azure Databricks Python notebooks have built-in support for many types of visualizations. You can also use
legacy visualizations.
You can also visualize data using third-party libraries; some are pre-installed in the Databricks Runtime, but you
can install custom libraries as well. Popular options include:
Bokeh
Matplotlib
Plotly
Jobs
You can automate Python workloads as scheduled or triggered Jobs in Databricks. Jobs can run notebooks,
Python scripts, and Python wheels.
For details on creating a job via the UI, see Create a job.
The Jobs API 2.1 allows you to create, edit, and delete jobs.
The Jobs CLI provides a convenient command line interface for calling the Jobs API.
TIP
To schedule a Python script instead of a notebook, use the spark_python_task field under tasks in the body of a
create job request.
Machine learning
Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular
data, deep learning for computer vision and natural language processing, recommendation systems, graph
analytics, and more. For general information about machine learning on Databricks, see the Databricks Machine
Learning guide.
For ML algorithms, you can use pre-installed libraries in the Databricks Runtime for Machine Learning, which
includes popular Python tools such as scikit-learn, TensorFlow, Keras, PyTorch, Apache Spark MLlib, and XGBoost.
You can also install custom libraries.
For machine learning operations (MLOps), Azure Databricks provides a managed service for the open source
library MLFlow. MLflow Tracking lets you record model development and save models in reusable formats; the
MLflow Model Registry lets you manage and automate the promotion of models towards production; and Jobs
and MLflow Model Serving allow hosting models as batch and streaming jobs and as REST endpoints. For more
information and examples, see the MLflow guide or the MLflow Python API docs.
To get started with common machine learning workloads, see the following pages:
Training scikit-learn and tracking with MLflow: 10-minute tutorial: machine learning on Databricks with scikit-
learn
Training deep learning models: Deep learning
Hyperparameter tuning: Parallelize hyperparameter tuning with scikit-learn and MLflow
Graph analytics: GraphFrames user guide - Python
IDEs, developer tools, and APIs
In addition to developing Python code within Azure Databricks notebooks, you can develop externally using
integrated development environments (IDEs) such as PyCharm, Jupyter, and Visual Studio Code. To synchronize
work between external development environments and Databricks, there are several options:
Code : You can synchronize code using Git. See Git integration with Databricks Repos.
Libraries and Jobs : You can create libraries (such as wheels) externally and upload them to Databricks.
Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See
Libraries and Jobs.
Remote machine execution : You can run code from your local IDE for interactive development and testing.
The IDE can communicate with Azure Databricks to execute large computations on Azure Databricks clusters.
To learn to use Databricks Connect to create this connection, see Use an IDE.
Databricks provides a full set of REST APIs which support automation and integration with external tooling. You
can use APIs to manage resources like clusters and libraries, code and other workspace objects, workloads and
jobs, and more. See REST API (latest).
For more information on IDEs, developer tools, and APIs, see Developer tools and guidance.
Additional resources
The Databricks Academy offers self-paced and instructor-led courses on many topics.
Features that support interoperability between PySpark and pandas
pandas function APIs
pandas user-defined functions
Optimize conversion between PySpark and pandas DataFrames
Python and SQL database connectivity
The Databricks SQL Connector for Python allows you to use Python code to run SQL commands on
Azure Databricks resources.
pyodbc allows you to connect from your local Python code through ODBC to data stored in the
Databricks Lakehouse.
FAQs and tips for moving Python workloads to Databricks
Migrate single node workloads to Azure Databricks
Migrate production workloads to Azure Databricks
Knowledge Base
Pandas API on Spark
7/21/2022 • 2 minutes to read
NOTE
This feature is available on clusters that run Databricks Runtime 10.0 (Unsupported) and above. For clusters that run
Databricks Runtime 9.1 LTS and below, use Koalas instead.
Commonly used by data scientists, pandas is a Python package that provides easy-to-use data structures and
data analysis tools for the Python programming language. However, pandas does not scale out to big data.
Pandas API on Spark fills this gap by providing pandas equivalent APIs that work on Apache Spark. Pandas API
on Spark is useful not only for pandas users but also PySpark users, because pandas API on Spark supports
many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame.
Requirements
Pandas API on Spark is available beginning in Apache Spark 3.2 (which is included beginning in Databricks
Runtime 10.0 (Unsupported)) by using the following import statement:
import pyspark.pandas as ps
Notebook
The following notebook shows how to migrate from pandas to pandas API on Spark.
pandas to pandas API on Spark notebook
Get notebook
Resources
Pandas API on Spark user guide on the Apache Spark website
Migrating from Koalas to pandas API on Spark on the Apache Spark website
Pandas API on Spark reference on the Apache Spark website
Koalas
7/21/2022 • 2 minutes to read
NOTE
Koalas is deprecated on clusters that run Databricks Runtime 10.0 (Unsupported) and above. For clusters running
Databricks Runtime 10.0 (Unsupported) and above, use Pandas API on Spark instead.
If you try using Koalas on clusters that run Databricks Runtime 10.0 (Unsupported) and above, an informational message
displays, recommending that you use Pandas API on Spark instead.
Koalas provides a drop-in replacement for pandas. Commonly used by data scientists, pandas is a Python
package that provides easy-to-use data structures and data analysis tools for the Python programming
language. However, pandas does not scale out to big data. Koalas fills this gap by providing pandas equivalent
APIs that work on Apache Spark. Koalas is useful not only for pandas users but also PySpark users, because
Koalas supports many tasks that are difficult to do with PySpark, for example plotting data directly from a
PySpark DataFrame.
Requirements
Koalas is included on clusters running Databricks Runtime 7.3 through 9.1. For clusters running Databricks
Runtime 10.0 and above, use Pandas API on Spark instead.
To use Koalas on a cluster running Databricks Runtime 7.0 or below, install Koalas as an Azure Databricks
PyPI library.
To use Koalas in an IDE, notebook server, or other custom applications that connect to an Azure Databricks
cluster, install Databricks Connect and follow the Koalas installation instructions.
Notebook
The following notebook shows how to migrate from pandas to Koalas.
pandas to Koalas notebook
Get notebook
Resources
Koalas documentation
10 Minutes from pandas to Koalas on Apache Spark
Azure Databricks for R developers
7/21/2022 • 2 minutes to read
This section provides a guide to developing notebooks in Azure Databricks using the R language.
R APIs
Azure Databricks supports two APIs that provide an R interface to Apache Spark: SparkR and sparklyr.
SparkR
These articles provide an introduction and reference for SparkR.
SparkR overview
SparkR ML tutorials
SparkR function reference
sparklyr
This article provides an introduction to sparklyr.
sparklyr
Visualizations
Azure Databricks R notebooks supports various types of visualizations using the display function.
Visualizations in R
Tools
In addition to Azure Databricks notebooks, you can also use the following R developer tools:
RStudio on Azure Databricks
Shiny on hosted RStudio Server
Use Shiny inside Databricks notebooks
renv on Azure Databricks
Use SparkR and RStudio Desktop with Databricks Connect.
Use sparklyr and RStudio Desktop with Databricks Connect.
Resources
Knowledge Base
SparkR overview
7/21/2022 • 4 minutes to read
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR also supports
distributed machine learning using MLlib.
SparkR in notebooks
For Spark 2.0 and above, you do not need to explicitly pass a sqlContext object to every function call.
For Spark 2.2 and above, notebooks no longer import SparkR by default because SparkR functions were
conflicting with similarly named functions from other popular packages. To use SparkR you can call
library(SparkR) in your notebooks. The SparkR session is already configured, and all SparkR functions will
talk to your attached cluster using the existing session.
The simplest way to create a DataFrame is to convert a local R data.frame into a SparkDataFrame . Specifically
we can use createDataFrame and pass in the local R data.frame to create a SparkDataFrame . Like most other
SparkR functions, createDataFrame syntax changed in Spark 2.0. You can see examples of this in the code
snippet bellow. For more examples, see createDataFrame.
library(SparkR)
df <- createDataFrame(faithful)
library(SparkR)
diamondsDF <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv",
header="true", inferSchema = "true")
head(diamondsDF)
require(SparkR)
irisDF <- createDataFrame(iris)
write.df(irisDF, source = "com.databricks.spark.avro", path = "dbfs:/tmp/iris.avro", mode = "overwrite")
%fs ls /tmp/iris.avro
Now use the spark-avro package again to read back the data.
The data source API can also be used to save DataFrames into multiple file formats. For example, you can save
the DataFrame from the previous example to a Parquet file using write.df .
%fs ls dbfs:/tmp/iris.parquet
# Create a df consisting of only the 'species' column using a Spark SQL query
species <- sql("SELECT species FROM irisTemp")
species is a SparkDataFrame.
DataFrame operations
Spark DataFrames support a number of functions to do structured data processing. Here are some basic
examples. A complete list can be found in the API docs.
Select rows and columns
# Create DataFrame
df <- createDataFrame(faithful)
# Filter the DataFrame to only retain rows with wait times shorter than 50 mins
head(filter(df, df$waiting < 50))
head(count(groupBy(df, df$waiting)))
# You can also sort the output from the aggregation to get the most common waiting times
waiting_counts <- count(groupBy(df, df$waiting))
head(arrange(waiting_counts, desc(waiting_counts$count)))
Column operations
SparkR provides a number of functions that can be directly applied to columns for data processing and
aggregation. The following example shows the use of basic arithmetic functions.
Machine learning
SparkR exposes most of MLLib algorithms. Under the hood, SparkR uses MLlib to train the model.
The following example shows how to build a gaussian GLM model using SparkR. To run linear regression, set
family to "gaussian" . To run logistic regression, set family to "binomial" . When using SparkML GLM SparkR
automatically performs one-hot encoding of categorical features so that it does not need to be done manually.
Beyond String and Double type features, it is also possible to fit over MLlib Vector features, for compatibility
with other MLlib components.
Use glm
Load diamonds data and split into training and test sets
Train a linear regression model using glm()
Train a logistic regression model using glm()
Use glm
7/21/2022 • 2 minutes to read
Parameters :
formula: Symbolic description of model to be fitted, for eg: ResponseVariable ~ Predictor1 + Predictor2 .
Supported operators: ~ , + , - , and .
data : Any SparkDataFrame
family : String, "gaussian" for linear regression or "binomial" for logistic regression
lambda : Numeric, Regularization parameter
alpha : Numeric, Elastic-net mixing parameter
Load diamonds data and split into training and test sets
require(SparkR)
# Exclude rowIDs
trainingData <- trainingData[, -1]
testData <- testData[, -1]
print(count(diamonds))
print(count(trainingData))
print(count(testData))
head(trainingData)
Use predict() on the test data to see how well the model works on new data.
Syntax : predict(model, newData)
Parameters :
model : MLlib model
newData : SparkDataFrame, typically your test set
Output : SparkDataFrame
# Calculate RMSE
head(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) / nrow(errors)), "RMSE")))
# Subset data to include rows where diamond cut = "Premium" or diamond cut = "Very Good"
trainingDataSub <- subset(trainingData, trainingData$cut %in% c("Premium", "Very Good"))
testDataSub <- subset(testData, testData$cut %in% c("Premium", "Very Good"))
Requirements
Azure Databricks distributes the latest stable version of sparklyr with every runtime release. You can use
sparklyr in Azure Databricks R notebooks or inside RStudio Server hosted on Azure Databricks by importing the
installed version of sparklyr.
In RStudio Desktop, Databricks Connect allows you to connect sparklyr from your local machine to Azure
Databricks clusters and run Apache Spark code. See Use sparklyr and RStudio Desktop with Databricks Connect.
Use sparklyr
After you install sparklyr and establish the connection, all other sparklyr API work as they normally do. See the
example notebook for some examples.
sparklyr is usually used along with other tidyverse packages such as dplyr. Most of these packages are
preinstalled on Databricks for your convenience. You can simply import them and start using the API.
> library(SparkR)
The following objects are masked from ‘package:dplyr’:
If you import SparkR after you imported dplyr, you can reference the functions in dplyr by using the fully
qualified names, for example, dplyr::arrange() . Similarly if you import dplyr after SparkR, the functions in
SparkR are masked by dplyr.
Alternatively, you can selectively detach one of the two packages while you do not need it.
detach("package:dplyr")
Unsupported features
Azure Databricks does not support sparklyr methods such as spark_web() and spark_log() that require a local
browser. However, since the Spark UI is built-in on Azure Databricks, you can inspect Spark jobs and logs easily.
See Cluster driver and worker logs.
Sparklyr notebook
Get notebook
RStudio on Azure Databricks
7/21/2022 • 10 minutes to read
Azure Databricks integrates with RStudio Server, the popular integrated development environment (IDE) for R.
You can use either the Open Source or Pro editions of RStudio Server on Azure Databricks. If you want to use
RStudio Server Pro, you must transfer your existing RStudio Pro license to Azure Databricks (see Get started
with RStudio Workbench (previously RStudio Server Pro)).
Databricks Runtime for Machine Learning includes an unmodified version of RStudio Server Open Source
package for which the source code can be found in GitHub. The following table lists the version of RStudio
Server Open Source that is currently preinstalled on Databricks Runtime for ML versions.
WARNING
Azure Databricks proxies the RStudio web service from port 8787 on the cluster’s Spark driver. This web proxy is intended
for use only with RStudio. If you launch other web services on port 8787, you might expose your users to potential
security exploits. Neither Databricks nor Microsoft is responsible for any issues that result from the installation of
unsupported software on a cluster.
Requirements
The cluster must not have table access control, automatic termination, or credential passthrough enabled.
You must have Can Attach To permission for that cluster. The cluster admin can grant you this permission.
See Cluster access control.
If you want to use the Pro edition, an RStudio Server floating Pro license.
Get started with RStudio Server Open Source
IMPORTANT
If you are using Databricks Runtime 7.0 ML or above, RStudio Server Open Source is already installed and you can skip
the section on installing RStudio Server.
To get started with RStudio Server Open Source on Azure Databricks, you must install RStudio on an Azure
Databricks cluster. You need to perform this installation only once. Installation is usually performed by an
administrator.
Install RStudio Server Open Source
To set up RStudio Server Open Source on an Azure Databricks cluster, you must create an init script to install the
RStudio Server Open Source binary package. See Cluster-scoped init scripts for more details. Here is an example
notebook cell that installs an init script on a location on DBFS.
IMPORTANT
All users have read and write access to DBFS, so the init script can be modified by any user. If this is a potential
issue for you, Databricks recommends that you put the init script on Azure Data Lake Storage Gen2 and restrict
permissions to it.
You may need to modify the package URL depending on the Ubuntu version of your runtime, which you can find
in the release notes.
script = """#!/bin/bash
dbutils.fs.mkdirs("/databricks/rstudio")
dbutils.fs.put("/databricks/rstudio/rstudio-install.sh", script, True)
3. Click the Open RStudio UI link to open the UI in a new tab. Enter your username and password in the
login form and sign in.
4. From the RStudio UI, you can import the SparkR package and set up a SparkR session to launch Spark
jobs on your cluster.
library(SparkR)
sparkR.session()
5. You can also attach the sparklyr package and set up a Spark connection.
SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = "databricks")
NOTE
If you plan to install RStudio Workbench on a Databricks Runtime version that already includes RStudio Server Open
Source package, you need to first uninstall that package for installation to succeed.
The following is an example notebook cell that generates an init script on DBFS. The script also performs
additional authentication configurations that streamline integration with Azure Databricks.
IMPORTANT
All users have read and write access to DBFS, so the init script can be modified by any user. If this is a potential
issue for you, Databricks recommends that you put the init script on Azure Data Lake Storage Gen2 and restrict
permissions to it.
You may need to modify the package URL depending on the Ubuntu version of your runtime, which you can find
in the release notes.
script = """#!/bin/bash
## Configuring authentication
sudo echo 'auth-proxy=1' >> /etc/rstudio/rserver.conf
sudo echo 'auth-proxy-user-header-rewrite=^(.*)$ $1' >> /etc/rstudio/rserver.conf
sudo echo 'auth-proxy-sign-in-url=<domain>/login.html' >> /etc/rstudio/rserver.conf
sudo echo 'admin-enabled=1' >> /etc/rstudio/rserver.conf
sudo echo 'export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin' >>
/etc/rstudio/rsession-profile
# Session configurations
sudo echo 'session-rprofile-on-resume-default=1' >> /etc/rstudio/rsession.conf
sudo echo 'allow-terminal-websockets=0' >> /etc/rstudio/rsession.conf
dbutils.fs.mkdirs("/databricks/rstudio")
dbutils.fs.put("/databricks/rstudio/rstudio-install.sh", script, True)
1. Replace <domain> with your Azure Databricks URL and <license-server-url> with the URL of your floating
license server.
2. Run the code in a notebook to install the script at dbfs:/databricks/rstudio/rstudio-install.sh
3. Before launching a cluster add dbfs:/databricks/rstudio/rstudio-install.sh as an init script. See Diagnostic
logs for details.
4. Launch the cluster.
Use RStudio Server Pro
1. Display the details of the cluster on which you installed RStudio and click the Apps tab:
2. In the Apps tab, click the Set up RStudio button.
3. You do not need the one-time password. Click the Open RStudio UI link and it will open an
authenticated RStudio Pro session for you.
4. From the RStudio UI, you can attach the SparkR package and set up a SparkR session to launch Spark
jobs on your cluster.
library(SparkR)
sparkR.session()
5. You can also attach the sparklyr package and set up a Spark connection.
SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = "databricks")
Frequently asked questions (FAQ)
What is the difference between RStudio Server Open Source and RStudio Workbench?
RStudio Workbench supports a wide range of enterprise features that are not available on the Open Source
edition. You can see the feature comparison on RStudio’s website.
In addition, RStudio Server Open Source is distributed under the GNU Affero General Public License (AGPL),
while the Pro version comes with a commercial license for organizations that are not able to use AGPL software.
Finally, RStudio Workbench comes with professional and enterprise support from RStudio, PBC, while RStudio
Server Open Source comes with no support.
Can I use my RStudio Workbench / RStudio Server Pro license on Azure Databricks?
Yes, if you already have a Pro or Enterprise license for RStudio Server, you can use that license on Azure
Databricks. See Get started with RStudio Workbench (previously RStudio Server Pro) to learn how to set up
RStudio Workbench on Azure Databricks.
Where does RStudio Server run? Do I need to manage any additional services/servers?
As you can see on the diagram in RStudio integration architecture, the RStudio Server daemon runs on the
driver (master) node of your Azure Databricks cluster. With RStudio Server Open Source, you do not need to run
any additional servers/services. However, for RStudio Workbench, you must manage a separate instance that
runs RStudio License Server.
Can I use RStudio Server on a standard cluster?
Yes, you can. Originally, you were required to use a high concurrency cluster, but that limitation is no longer in
place.
Can I use RStudio Server on a cluster with auto termination?
No, you can’t use RStudio when auto termination is enabled. Auto termination can purge unsaved user scripts
and data inside an RStudio session. To protect users against this unintended data loss scenario, RStudio is
disabled on such clusters by default.
For customers who require cleaning up cluster resources when they are not used, Databricks recommends using
cluster APIs to clean up RStudio clusters based on a schedule.
How should I persist my work on RStudio?
We strongly recommend that you persist your work using a version control system from RStudio. RStudio has
great support for various version control systems and allows you to check in and manage your projects.
You can also save your files (code or data) on the Databricks File System (DBFS). For example, if you save a file
under /dbfs/ the files will not be deleted when your cluster is terminated or restarted.
IMPORTANT
If you do not persist your code through version control or DBFS, you risk losing your work if an admin restarts or
terminates the cluster.
Another method is to save the R notebook to your local file system by exporting it as Rmarkdown , then later
importing the file into the RStudio instance.
The blog Sharing R Notebooks using RMarkdown describes the steps in more detail.
How do I start a SparkR session?
SparkR is contained in Databricks Runtime, but you must load it into RStudio. Run the following code inside
RStudio to initialize a SparkR session.
library(SparkR)
sparkR.session()
If there is an error importing the SparkR package, run .libPaths() and verify that
/home/ubuntu/databricks/spark/R/lib is included in the result.
If it is not included, check the content of /usr/lib/R/etc/Rprofile.site . List
/home/ubuntu/databricks/spark/R/lib/SparkR on the driver to verify that the SparkR package is installed.
How do I start a sparklyr session?
The sparklyr package must be installed on the cluster. Use one of the following methods to install the
sparklyr package:
As an Azure Databricks library
install.packages() command
RStudio package management UI
SparkR is contained in Databricks Runtime, but you must load it into RStudio. Run the following code inside
RStudio to initialize a sparklyr session.
SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = “databricks”)
Shiny is an R package, available on CRAN, used to build interactive R applications and dashboards. You can use
Shiny inside RStudio Server hosted on Azure Databricks clusters. You can also develop, host, and share Shiny
applications directly from an Azure Databricks notebook. See Share Shiny app URL.
To get started with Shiny, see the Shiny tutorials.
This article describes how to run Shiny applications on RStudio on Azure Databricks and use Apache Spark
inside Shiny applications.
Requirements
Shiny is included in Databricks Runtime 7.3 LTS and above. On Databricks Runtime 6.4 Extended Support
and Databricks Runtime 5.5 LTS, you must install the Shiny R package.
RStudio on Azure Databricks.
IMPORTANT
With RStudio Server Pro, you must disable proxied authentication. Make sure auth-proxy=1 is not present inside
/etc/rstudio/rserver.conf .
> library(shiny)
> runExample("01_hello")
Listening on http://127.0.0.1:3203
# Define the UI
ui <- fluidPage(
sliderInput("carat", "Select Carat Range:",
min = 0, max = 5, value = c(0, 5), step = 0.01),
plotOutput('plot')
)
library(shiny)
library(promises)
library(future)
plan(multisession)
ui <- fluidPage(
sidebarLayout(
# Display heartbeat
sidebarPanel(textOutput("keep_alive")),
How many connections can be accepted for one Shiny app link during development?
Databricks recommends up to 20.
Can I use a different version of the Shiny package than the one installed in Databricks Runtime?
Yes. See Fix the Version of R Packages.
How can I develop a Shiny application that can be published to a Shiny server and access data on Azure
Databricks?
While you can access data naturally using SparkR or sparklyr during development and testing on Azure
Databricks, after a Shiny application is published to a stand-alone hosting service, it cannot directly access the
data and tables on Azure Databricks.
To enable your application to function outside Azure Databricks, you must rewrite how you access data. There
are a few options:
Use JDBC/ODBC to submit queries to an Azure Databricks cluster.
Use Databricks Connect.
Directly access data on object storage.
Databricks recommends that you work with your Azure Databricks solutions team to find the best approach for
your existing data and analytics architecture.
How can I save the Shiny applications that I develop on Azure Databricks?
You can either save your application code on DBFS through the FUSE mount or check your code into version
control.
Can I develop a Shiny application inside an Azure Databricks notebook?
Yes, you can develop a Shiny application inside an Azure Databricks notebook. For more details see Use Shiny
inside Databricks notebooks.
Use Shiny inside Databricks notebooks
7/21/2022 • 2 minutes to read
IMPORTANT
This feature is in Public Preview.
You can develop, host, and share Shiny applications directly from an Azure Databricks notebook.
To get started with Shiny, see the Shiny tutorials. You can run these tutorials on Azure Databricks notebooks.
Requirements
Databricks Runtime 8.3 or above.
library(shiny)
runExample("01_hello")
3. When the app is ready, the output includes the Shiny app URL as a clickable link which opens a new tab.
See Share Shiny app URL for information about sharing this app with other users.
NOTE
Log messages appear in the command result, similar to the default log message (
Listening on http://0.0.0.0:5150 ) shown in the example.
To stop the Shiny application, click Cancel.
The Shiny application uses the notebook R process. If you detach the notebook from the cluster, or if you cancel the
cell running the application, the Shiny application teminates. You cannot run other cells while the Shiny application is
running.
NOTE
You must use absolute path or set the working directory with setwd() .
1. Check out the code from a repository using code similar to:
2. To run the application, enter code similar to the following code in another cell:
library(shiny)
runApp("/databricks/driver/shiny-examples/007-widgets/")
library(shiny)
library(SparkR)
sparkR.session()
ui <- fluidPage(
mainPanel(
textOutput("value")
)
)
ui <- fluidPage(
mainPanel(
textOutput("value")
)
)
renv is an R package that lets users manage R dependencies specific to the notebook.
Using renv , you can create and manage the R library environment for your project, save the state of these
libraries to a lockfile , and later restore libraries as required. Together, these tools can help make projects more
isolated, portable, and reproducible.
You can install renv as a cluster-scoped library or as a notebook-scoped library. To install renv as a notebook-
scoped library, use:
Databricks recommends using a CRAN snapshot as the repository to fix the package version.
Initialize renv session with pre -installed R libraries
The first step when using renv is to initialize a session using renv::init() . Set libPaths to change the default
download location to be your R notebook-scoped library path.
renv::init(settings = list(external.libraries=.libPaths()))
.libPaths(c(.libPaths()[2], .libPaths())
renv::install("digest")
To install an old version of digest , run the following inside of a notebook cell.
renv::install("digest@0.6.18")
To install digest from GitHub, run the following inside of a notebook cell.
renv::install("eddelbuettel/digest")
To install a package from Bioconductor, run the following inside of a notebook cell.
renv::settings$snapshot.type("all")
This sets renv to snapshot all packages that are installed into libPaths , not just the ones that are currently
used in the notebook. See renv documentation for more information.
Now you can run the following inside of a notebook cell to save the current state of your environment.
renv::snapshot(lockfile="/dbfs/PATH/TO/WHERE/YOU/WANT/TO/SAVE/renv.lock", force=TRUE)
This updates the lockfile by capturing all packages installed on libPaths . It also moves your lockfile from
the local filesystem to DBFS, where it persists even if your cluster terminates or restarts.
Reinstall a renv environment given a lockfile from DBFS
First, make sure that your new cluster is running an identical Databricks Runtime version as the one you first
created the renv environment on. This ensures that the pre-installed R packages are identical. You can find a list
of these in each runtime’s release notes. After you Install renv, run the following inside of a notebook cell.
renv::init(settings = list(external.libraries=.libPaths()))
.libPaths(c(.libPaths()[2], .libPaths()))
renv::restore(lockfile="/dbfs/PATH/TO/WHERE/YOU/SAVED/renv.lock", exclude=c("Rserve", "SparkR"))
This copies your lockfile from DBFS into the local file system and then restores any packages specified in the
lockfile .
NOTE
To avoid missing repository errors, exclude the Rserve and SparkR packages from package restoration. Both of these
packages are pre-installed in all runtimes.
renv Cache
A very useful feature of renv is its global package cache, which is shared across all renv projects on the
cluster. It speeds up installation times and saves disk space. The renv cache does not cache packages
downloaded via the devtools API or install.packages() with any additional arguments other than pkgs .
Azure Databricks for Scala developers
7/21/2022 • 2 minutes to read
This section provides a guide to developing notebooks and jobs in Azure Databricks using the Scala language.
Scala API
These links provide an introduction to and reference for the Apache Spark Scala API.
Introduction to DataFrames
Complex and nested data
Aggregators
Introduction to Datasets
Introduction to Structured Streaming
Apache Spark API reference
Visualizations
Azure Databricks Scala notebooks have built-in support for many types of visualizations. You can also use legacy
visualizations:
Visualization overview
Visualization deep dive in Scala
Interoperability
This section describes features that support interoperability between Scala and SQL.
User-defined functions
User-defined aggregate functions
Tools
In addition to Azure Databricks notebooks, you can also use the following Scala developer tools
IntelliJ
Libraries
Databricks runtimes provide many libraries. To make third-party or locally-built Scala libraries available to
notebooks and jobs running on your Azure Databricks clusters, you can install libraries following these
instructions:
Install Scala libraries in a cluster
Resources
Knowledge Base
Azure Databricks for SQL developers
7/21/2022 • 2 minutes to read
This section provides a guide to developing notebooks in the Databricks Data Science & Engineering and
Databricks Machine Learning environments using the SQL language.
Databricks SQL
If you are a data analyst who works primarily with SQL queries and BI tools, Databricks SQL provides an
intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake. You
may want to skip this article, which is focused on developing notebooks in the Databricks Data Science &
Engineering and Databricks Machine Learning environments. Instead see:
_
Queries in Databricks SQL
SQL reference for Databricks SQL
SQL Reference
The SQL language reference that you use depends on the Databricks Runtime version that your cluster is
running:
Databricks Runtime 7.x and above (Spark SQL 3.x)
Databricks Runtime 6.4 Extended Support and Databricks Light 2.4 (Spark SQL 2.4)
For Databricks SQL, see SQL reference for Databricks SQL.
Use cases
Cost-based optimizer
Transactional writes to cloud storage with DBIO
Handling bad records and files
Handling large queries in interactive workflows
Adaptive query execution
Query semi-structured data in SQL
Data skipping index
Visualizations
SQL notebooks support various types of visualizations using the display function.
Create a new visualization
Interoperability
This section describes features that support interoperability between SQL and other languages supported in
Azure Databricks.
User-defined scalar functions - Python
User-defined scalar functions - Scala
User-defined aggregate functions - Scala
Tools
In addition to Azure Databricks notebooks, you can also use various third-party developer tools, data sources,
and other integrations. See Databricks integrations.
Access control
This article describes how to use SQL constructs to control access to database objects:
Data object privileges
Resources
Apache Spark SQL Guide
Delta Lake guide
Knowledge Base
SQL reference for Databricks Runtime 7.3 LTS and
above
7/21/2022 • 2 minutes to read
This is a SQL command reference for users on clusters running Databricks Runtime 7.x and above in the
Databricks Data Science & Engineering workspace and Databricks Machine Learning environment.
NOTE
For Databricks Runtime 5.5 LTS and 6.x SQL commands, see SQL reference for Databricks Runtime 5.5 LTS and 6.x.
For the Databricks SQL language reference, see SQL reference for Databricks SQL.
General reference
This general reference describes data types, functions, identifiers, literals, and semantics:
How to read a syntax diagram
Data types and literals
SQL data type rules
Datetime patterns
Functions
Built-in functions
Lambda functions
Window functions
Identifiers
Names
Null semantics
Expressions
JSON path expressions
Partitions
ANSI compliance
Apache Hive compatibility
Principals
Privileges and securable objects
External locations and storage credentials
Delta Sharing
Information schema
Reserved words
DDL statements
You use data definition statements to create or modify the structure of database objects in a database:
ALTER CATALOG
ALTER CREDENTIAL
ALTER DATABASE
ALTER LOCATION
ALTER TABLE
ALTER SCHEMA
ALTER SHARE
ALTER VIEW
COMMENT ON
CREATE BLOOMFILTER INDEX
CREATE CATALOG
CREATE DATABASE
CREATE FUNCTION (External)
CREATE FUNCTION (SQL)
CREATE LOCATION
CREATE RECIPIENT
CREATE SCHEMA
CREATE SHARE
CREATE TABLE
CREATE VIEW
DROP BLOOMFILTER INDEX
DROP CATALOG
DROP DATABASE
DROP CREDENTIAL
DROP FUNCTION
DROP LOCATION
DROP RECIPIENT
DROP SCHEMA
DROP SHARE
DROP TABLE
DROP VIEW
MSCK REPAIR TABLE
TRUNCATE TABLE
DML statements
You use data manipulation statements to add, change, or delete data:
COPY INTO
DELETE FROM
INSERT INTO
INSERT OVERWRITE DIRECTORY
INSERT OVERWRITE DIRECTORY with Hive format
LOAD DATA
MERGE INTO
UPDATE
EXPLAIN
Auxiliary statements
You use auxiliary statements to collect statistics, manage caching for Apache Spark cache, explore metadata, set
configurations, and manage resources:
Analyze statement
Apache Spark Cache statements
Describe statements
Show statements
Configuration management
Resource management
Analyze statement
ANALYZE TABLE
Apache Spark Cache statements
CACHE TABLE
CLEAR CACHE
REFRESH
REFRESH FUNCTION
REFRESH TABLE
UNCACHE TABLE
Describe statements
DESCRIBE CATALOG
DESCRIBE CREDENTIAL
DESCRIBE DATABASE
DESCRIBE FUNCTION
DESCRIBE LOCATION
DESCRIBE QUERY
DESCRIBE RECIPIENT
DESCRIBE SCHEMA
DESCRIBE SHARE
DESCRIBE TABLE
Show statements
LIST
SHOW ALL IN SHARE
SHOW CATALOGS
SHOW COLUMNS
SHOW CREATE TABLE
SHOW CREDENTIALS
SHOW DATABASES
SHOW FUNCTIONS
SHOW GROUPS
SHOW LOCATIONS
SHOW PARTITIONS
SHOW RECIPIENTS
SHOW SCHEMAS
SHOW SHARES
SHOW TABLE
SHOW TABLES
SHOW TBLPROPERTIES
SHOW USERS
SHOW VIEWS
Configuration management
RESET
SET
SET TIMEZONE
USE CATALOG
USE DATABASE
USE SCHEMA
Resource management
ADD ARCHIVE
ADD FILE
ADD JAR
LIST ARCHIVE
LIST FILE
LIST JAR
Security statements
You use security SQL statements to manage access to data:
ALTER GROUP
CREATE GROUP
DENY
DROP GROUP
GRANT
GRANT SHARE
REPAIR PRIVILEGES
REVOKE
REVOKE SHARE
SHOW GRANTS
SHOW GRANTS ON SHARE
SHOW GRANTS TO RECIPIENT
For details using these statements, see Data object privileges.
How to read a syntax diagram
7/21/2022 • 2 minutes to read
This section describes the various patterns of syntax used throughout the Databricks Runtime reference.
Base components
Keyword
Token
Clause
Argument
Keyword
SELECT
Keywords in SQL are always capitalized in this document, but they are case insensitive.
Token
( )
< >
.
*
,
The SQL language includes round braces ( ( , ) ) as well as angled braces ( < , > ), dots ( . ), commas ( , ), and
a few other characters. When these characters are present in a syntax diagram you must enter them as is.
Clause
LIMIT clause
SELECT named_expression
named_expression
expression AS alias
A clause represents a named subsection of syntax. A local clause is described in the same syntax diagram that
invokes it. If the clause is common, it links to another section of the Databricks Runtime reference. Some clauses
are known by their main keyword and are depicted with a capital keyword followed by clause. Other clauses are
always lower case and use underscore ( _ ) where appropriate. Local clauses are fully explained within the
following section. All other clauses have a short description with a link to the main page.
Argument
mapExpr
Arguments to functions are specified in camelCase. Databricks Runtime describes the meaning of arguments in
the Arguments section.
Chain of tokens
SELECT expr
Components separated by whitespace must be entered in order, unconditionally, and be separated only by
whitespace or comments. Databricks Runtime supports comments of the form /* ... */ (C-style), and -- ...
, which extends to end of the line.
Choice
Specifies a fork in the syntax.
Mandatory choice
{ INT | INTEGER }
Curly braces { ... } mean you must specify exactly one of the multiple components. Each choice is separated
by a | .
Optional choice
[ ASC | DESC ]
Square brackets [ ... ] indicate you can choose at most one of multiple components. Each choice is separated
by a | .
Grouping
{ SELECT expr }
{ SELECT
expr }
Curly braces { ... } specify that you must provide all the embedded components. If a syntax diagram spans
multiple lines, this form clarifies that it depicts the same syntax.
Option
[ NOT NULL ]
Square brackets [...] specify that the enclosed components are optional.
Repetition
col_option [...]
col_alias [, ...]
The [...] ellipsis notation indicates that you can repeat the immediately preceding component, grouping, or
choice multiple times. If the ellipsis is preceded by another character, such as a separated dot [. ...] , or a
comma [, ...] , you must separate each repetition by that character.
Names
7/21/2022 • 5 minutes to read
Catalog name
Identifies a catalog. A catalog provides a grouping of objects which can be further subdivided into schemas.
Syntax
catalog_identifier
Parameters
catalog_identifier : An identifier that uniquely identifies the catalog.
Examples
Schema name
Identifies a schema. A schema provides a grouping of objects in a catalog.
Syntax
[ catalog_name . ] schema_identifier
Parameters
catalog_name : The name of an existing catalog.
schema_identifier : An identifier that uniquely identifies the schema.
Examples
Database name
A synonym for schema name.
While usage of SCHEMA , and DATABASE is interchangeable, SCHEMA is preferred.
Table name
Identifies a table object. The table can be qualified with a schema name or unqualified using a simple identifier.
Syntax
temporal_spec
{
@ timestamp_encoding |
@V version |
[ FOR ] { SYSTEM_TIMESTAMP | TIMESTAMP } AS OF timestamp_expression |
[ FOR ] { SYSTEM_VERSION | VERSION } AS OF version
}
credential_spec
WITH ( CREDENTIAL credential_name )
Parameters
schema_name : A qualified or unqualified schema name that contains the table.
table_identifier : An identifier that specifies the name of the table or table_alias.
file_format : One of json , csv , avro , parquet , orc , binaryFile , text , delta (case insensitive).
path_to_table : The location of the table in the file system. You must have the ANY_FILE permission to
use this syntax.
temporal_spec : When used references a Delta table at the specified point in time or version.
You can use a temporal specification only within the context of a query or a MERGE USING.
@ timestamp_encoding : A positive Bigint literal that encodes a timestamp in yyyyMMddHHmmssSSS
format.
@V version : A positive Integer literal identifying the version of the Delta table.
timestamp_expression : A simple expression that evaluates to a TIMESTAMP. timestamp_expressiom
must be a constant expression, but may contain current_date() or current_timestamp() .
version : A Integer literal or String literal identifying the version of the Delta table.
credential_spec
You can use an applicable credential to gain access to a path_to_table which is not embedded in an
external location.
credential_name
The name of the credential used to access the storage location.
If the name is unqualified and does not reference a known table alias, Databricks Runtime first attempts to
resolve the table in the current schema.
If the name is qualified with a schema, Databricks Runtime attempts to resolve the table in the current catalog.
Databricks Runtime raises an error if you use a temporal_spec for a table that is not in Delta Lake format.
Examples
`Employees`
employees
hr.employees
`hr`.`employees`
hive_metastore.default.tab
system.information_schema.columns
delta.`somedir/delta_table`
`csv`.`spreadsheets/data.csv`
View name
Identifies a view. The view can be qualified with a schema name or unqualified using a simple identifier.
Syntax
[ schema_name . ] view_identifier
Parameters
schema_name : The qualified or unqualified name of the schema that contains the view.
view_identifier : An identifier that specifies the name of the view or the view identifier of a CTE.
Examples
`items`
items
hr.items
`hr`.`items`
Column name
Identifies a column within a table or view. The column can be qualified with a table or view name, or unqualified
using a simple identifier.
Syntax
Parameters
table_name : A qualified or unqualified table name of the table containing the column.
view_name : A qualified or unqualified view name of the view containing the column.
column_identifier : An identifier that specifies the name of the column.
The identified column must exist within the table or view.
Databricks Runtime supports a special _metadata column. This pseudo column of type struct is part of every
table and can be used to retrieve metadata information about the rows in the table.
Examples
Field name
Identifies a field within a struct. The field must be qualified with the path up to the struct containing the field.
Syntax
Parameters
expr : An expression of type STRUCT.
field_identifier : An identifier that specifies the name of the field.
A deeply nested field can be referenced by specifying the field identifier along the path to the root struct.
Examples
Function name
Identifies a function. The function can be qualified with a schema name, or unqualified using a simple identifier.
Syntax
[ schema_name . ] function_identifier
Parameters
schema_name : A qualified or unqualified schema name that contains the function.
function_identifier : An identifier that specifies the name of the function.
Examples
`math`.myplus
myplus
math.`myplus`
Parameter name
Identifies a parameter in the body of a SQL user-defined function (SQL UDF). The function can be qualified with
a function identifier, or unqualified using a simple identifier.
Syntax
[ function_identifier . ] parameter_identifier
Parameters
function_identifier : An identifier that specifies the name of a function.
parameter_identifier : An identifier that specifies the name of a parameter.
Examples
Table alias
Labels a table reference, query, table function, or other form of a relation.
Syntax
Parameters
table_identifier : An identifier that specifies the name of the table.
column_identifierN : An optional identifier that specifies the name of the column.
If you provide column identifiers, their number must match the number of columns in the matched relation.
If you don’t provide column identifiers, their names are inherited from the labeled relation.
Examples
Column alias
Labels the result of an expression in a SELECT list for reference.
If the expression is a table valued generator function, the alias labels the list of columns produced.
Syntax
[AS] column_identifier
Parameters
column_identifier : An identifier that specifies the name of the column.
While column aliases need not be unique within the select list, uniqueness is a requirement to reference an alias
by name.
Examples
> SELECT 1 AS a;
a
1
> SELECT 1 a, 2 b;
a b
1 2
Credential name
Identifies a credential to access storage at an external location.
Syntax
credential_identifier
Parameters
credential_identifier : An unqualified identifier that uniquely identifies the credential.
Examples
Location name
Identifies an external storage location.
Syntax
location_identifier
Parameters
location_identifier : An unqualified identifier that uniquely identifies the location.
Examples
`s3-json-data`
s3_json_data
Share name
Identifies a share to access data shared by a provider.
Syntax
share_identifier
Parameters
share_identifier : An unqualified identifier that uniquely identifies the share.
Examples
`public info`
`public-info`
public_info
Recipient name
Identifies an recipient for a share.
Syntax
recipient_identifier
Parameters
recipient_identifier : An unqualified identifier that uniquely identifies the recipient.
Examples
`Good Corp`
`Good-corp`
Good_Corp
Related
Identifiers
File metadata column
Databricks Runtime expression
7/21/2022 • 2 minutes to read
An expression is a formula that computes a result based on literals or references to columns, fields, or variables,
using functions or operators.
Syntax
{ literal |
column_reference |
field_reference |
parameter_reference |
CAST expression |
CASE expression |
expr operator expr |
operator expr |
expr [ expr ] |
function_invocation |
( expr ) |
scalar_subquery }
scalar_subquery
( query )
The brackets [ expr ] are actual brackets and do not indicate optional syntax.
Parameters
literal
A literal of a type described in Data types.
column_reference
A reference to a column in a table or column alias.
field_reference
A reference to a field in a STRUCT type.
parameter_reference
A reference to a parameter of a SQL user defined function from with the body of the function. The
reference may use the unqualified name of the parameter or qualify the name with the function name.
Parameters constitute the outermost scope when resolving identifiers.
CAST expression
An expression casting the argument to a different type.
CASE expression
An expression allowing for conditional evaluation.
expr
An expression itself which is combined with an operator , or which is an argument to a function.
operator
A unary or binary operator.
[ expr ]
A reference to an array element or a map key.
function_invocation
An expression invoking a built-in or user defined function.
The pages for each builtin function and operator describe the data types their parameters expect.
Databricks Runtime performs implicit casting to expected types using SQL data type rules. If an operator
or function is invalid for the provided argument, Databricks Runtime raises an error. Functions also
document which parameters are mandatory or optional.
When invoking a SQL user defined function you may omit arguments for trailing parameters if the
parameters have defined defaults.
( expr )
Enforced precedence that overrides operator precedence.
scalar_subquer y :
( quer y )
An expression based on a query that must return a single column and at most one row.
The pages for each function and operator describe the data types their parameters expect. Databricks Runtime
performs implicit casting to expected types using SQL data type rules. If an operator or function is invalid for the
provided argument, Databricks Runtime raises an error.
Constant expression
An expression that is based only on literals or deterministic functions with no arguments. Databricks Runtime
can execute the expression and use the resulting constant where ordinarily literals are required.
Boolean expression
An expression with a result type of BOOLEAN . A Boolean expression is also sometimes referred to as a condition
or a predicate .
Scalar subquery
An expression of the form ( query ) . The query must return a table that has one column and at most one row.
If the query returns no row, the result is NULL . If the query returns more than one row, Databricks Runtime
returns an error. Otherwise, the result is the value returned by the query.
Simple expression
An expression that does not contain a query , such as a scalar subquery or an EXISTS predicate.
Examples
> SELECT 1;
1
> SELECT 1 + 1;
2
> SELECT 2 * 1 + 2;
4
Reserved words are literals used as keywords by the SQL language which should not be used as identifiers to
avoid unexpected behavior.
Reserved schema names have special meaning to Databricks Runtime.
Reserved words
Databricks Runtime does not formally disallow any specific literals from being used as identifiers.
However, to use any of the following list of identifiers as a table alias, you must surround the name with back-
ticks (`).
ANTI
CROSS
EXCEPT
FULL
INNER
INTERSECT
JOIN
LATERAL
LEFT
MINUS
NATURAL
ON
RIGHT
SEMI
UNION
USING
Examples
-- Using SQL keywords
> CREATE TEMPORARY VIEW where(where) AS (VALUES (1));
-- Usage of NULL
> SELECT NULL, `null`, T.null FROM VALUES(1) AS T(null);
NULL 1 1
Related articles
names
identifiers
Data types
7/21/2022 • 5 minutes to read
STRUCT<[fieldName:fieldType [NOT NULL][COMMENT str][, Represents values with the structure described by a
…]]> sequence of fields.
Data type classification
Data types are grouped into the following classes:
Integral numeric types represent whole numbers:
TINYINT
SMALLINT
INT
BIGINT
Exact numeric types represent base-10 numbers:
Integral numeric
DECIMAL
Binar y floating point types use exponents and a binary representation to cover a large range of numbers:
FLOAT
DOUBLE
Numeric types represents all numeric data types:
Exact numeric
Binary floating point
Date-time types represent date and time components:
DATE
TIMESTAMP
Simple types are types defined by holding singleton values:
Numeric
Date-time
BINARY
BOOLEAN
INTERVAL
STRING
Complex types are composed of multiple components of complex or simple types:
ARRAY
MAP
STRUCT
Language mappings
Scala
Spark SQL data types are defined in the package org.apache.spark.sql.types . You access them by importing the
package:
import org.apache.spark.sql.types._
A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E
Java
Spark SQL data types are defined in the package org.apache.spark.sql.types . To access or create a data type,
use factory methods provided in org.apache.spark.sql.types.DataTypes .
A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E
Python
Spark SQL data types are defined in the package pyspark.sql.types . You access them by importing the package:
R
A P I TO A C C ESS O R C REAT E
SQ L T Y P E DATA T Y P E VA L UE T Y P E DATA T Y P E
(1) Numbers are converted to the domain at runtime. Make sure that numbers are within range.
(2) The optional value defaults to TRUE .
(3) Interval types
YearMonthIntervalType([startField,] endField) : Represents a year-month interval which is made up of a
contiguous subset of the following fields:
startField is the leftmost field, and endField is the rightmost field of the type. Valid values of
startField and endField are 0(MONTH) and 1(YEAR) .
DayTimeIntervalType([startField,] endField) : Represents a day-time interval which is made up of a
contiguous subset of the following fields:
startField is the leftmost field, and endField is the rightmost field of the type. Valid values of
startField and endField are 0(DAY) , 1(HOUR) , 2(MINUTE) , 3(SECOND) .
(4) StructType
StructType(fields) Represents values with the structure described by a sequence, list, or array of
StructField s (fields). Two fields with the same name are not allowed.
StructField(name, dataType, nullable) Represents a field in a StructType . The name of a field is indicated
by name . The data type of a field is indicated by dataType. nullable indicates if values of these fields can
have null values. This is the default.
Related articles
Special floating point values
SQL data type rules
7/21/2022 • 6 minutes to read
Databricks Runtime uses several rules to resolve conflicts among data types:
Promotion safely expands a type to a wider type.
Implicit downcasting narrows a type. The opposite of promotion.
Implicit crosscasting transforms a type into a type of another type family.
You can also explicitly cast between many types:
cast function casts between most types, and returns errors if it cannot.
tr y_cast function works like cast function but returns NULL when passed invalid values.
Other builtin functions cast between types using provided format directives.
Type promotion
Type promotion is the process of casting a type into another type of the same type family which contains all
possible values of the original type. Therefore type promotion is a safe operation. For example TINYINT has a
range from -128 to 127 . All its possible values can be safely promoted to INTEGER .
TINYINT TINYINT -> SMALLINT -> INT -> BIGINT -> DECIMAL ->
FLOAT (1) -> DOUBLE
SMALLINT SMALLINT -> INT -> BIGINT -> DECIMAL -> FLOAT (1) ->
DOUBLE
INT INT -> BIGINT -> DECIMAL -> FLOAT (1) -> DOUBLE
DOUBLE DOUBLE
TIMESTAMP TIMESTAMP
BINARY BINARY
BOOLEAN BOOLEAN
INTERVAL INTERVAL
STRING STRING
(1) For least common type resolution FLOAT is skipped to avoid loss of precision.
(2) For a complex type the precedence rule applies recursively to its component elements.
Strings and NULL
Special rules apply for STRING and untyped NULL :
NULL can be promoted to any other type.
STRING can be promoted to BIGINT , BINARY , BOOLEAN , DATE , DOUBLE , INTERVAL , and TIMESTAMP . If the
actual string value cannot be cast to least common type Databricks Runtime raises a runtime error. When
promoting to INTERVAL the string value must match the intervals units.
Type precedence graph
This is a graphical depiction of the precedence hierarchy, combining the type precedence list and strings and
NULLs rules.
-- INTEGER and DATE do not share a precedence chain or support crosscasting in either direction.
> SELECT typeof(coalesce(1, DATE'2020-01-01'));
Error: Incompatible types [INT, DATE]
-- Both are ARRAYs and the elements have a least common type
> SELECT typeof(coalesce(ARRAY(1Y), ARRAY(1L)))
ARRAY<BIGINT>
-- The least common type is a BIGINT, but the value is not BIGINT.
> SELECT coalesce('6.1', 5);
Error: 6.1 is not a BIGINT
The substring function expects arguments of type STRING for the string and INTEGER for the start and length
parameters.
-- Promotion of TINYINT to INTEGER
> SELECT substring('hello', 1Y, 2);
he
-- No casting
> SELECT substring('hello', 1, 2);
he
Related
cast
Data types
Functions
try_cast
Datetime patterns
7/21/2022 • 6 minutes to read
There are several common scenarios for datetime usage in Databricks Runtime:
CSV and JSON data sources use the pattern string for parsing and formatting datetime content.
Datetime functions related to convert STRING to and from DATE or TIMESTAMP . For example:
unix_timestamp
date_format
to_unix_timestamp
from_unixtime
to_date
to_timestamp
from_utc_timestamp
to_utc_timestamp
Pattern table
Databricks Runtime uses pattern letters in the following table for date and timestamp parsing and formatting:
d day-of-month number(3) 28
a am-pm-of-day am-pm PM
m minute-of-hour number(2) 30
s second-of-minute number(2) 55
'MM' or 'LL' : Month number in a year starting from 1. Zero padding is added for month 1-9.
'MMM' : Short textual representation in the standard form. The month pattern should be a part of a
date pattern not just a stand-alone month except locales where there is no difference between
stand and stand-alone forms like in English.
: full textual month representation in the standard form. It is used for parsing/formatting
'MMMM'
months as a part of dates/timestamps.
Related articles
date_format function
from_unixtime function
from_utc_timestamp function
to_date function
to_timestamp function
to_utc_timestamp function
to_unix_timestamp function
unix_timestamp function
Functions
7/21/2022 • 2 minutes to read
Spark SQL provides two function features to meet a wide range of needs: built-in functions and user-defined
functions (UDFs).
Built-in functions
This article presents the usages and descriptions of categories of frequently used built-in functions for
aggregation, arrays and maps, dates and timestamps, and JSON data.
Built-in functions
User-defined functions
UDFs allow you to define your own functions when the system’s built-in functions are not enough to perform
the desired task. To use UDFs, you first define the function, then register the function with Spark, and finally call
the registered function. A UDF can act on a single row or act on multiple rows at once. Spark SQL also supports
integration of existing Hive implementations of UDFs, user defined aggregate functions (UDAF), and user
defined table functions (UDTF).
User-defined aggregate functions (UDAFs)
Integration with Hive UDFs, UDAFs, and UDTFs
User-defined scalar functions (UDFs)
Built-in functions
7/21/2022 • 28 minutes to read
This article presents links to and descriptions of built-in operators, and functions for strings and binary types,
numeric scalars, aggregations, windows, arrays, maps, dates and timestamps, casting, CSV data, JSON data,
XPath manipulation, and miscellaneous functions.
Also see:
Alphabetic list of built-in functions
and expr1 and expr2 Returns the logical AND of expr1 and
expr2 .
between expr1 [not] between expr2 and Tests whether expr1 is greater or
expr2
equal than expr2 and less than or
equal to expr3 .
ilike str [not] ilike {ANY|SOME|ALL} Returns true if str matches any/all
([pattern[, ...]])
patterns case-insensitively.
is distinct expr1 is [not] distinct from Tests whether the arguments have
expr2 different values where NULLs are
considered as comparable values.
like str [not] like {ANY|SOME|ALL} Returns true if str matches any/all
([pattern[, ...]]) patterns.
O P ERATO R SY N TA X DESC RIP T IO N
regexp str [not] regexp regex Returns true if str matches regex .
regexp_like str [not] regexp_like regex Returns true if str matches regex .
rlike str [not] rlike regex Returns true if str matches regex .
Operator precedence
P REC EDEN C E O P ERATO R
1 : , :: , [ ]
P REC EDEN C E O P ERATO R
2 - (unary), + (unary), ~
3 * , / , % , div
4 + , - , ||
5 &
6 ^
7 |
9 not , exists
11 and
12 or
aes_decrypt(expr, key[, mode[, padding]]) Decrypts a binary expr using AES encryption.
aes_encrypt(expr, key[, mode[, padding]]) Encrypts a binary expr using AES encryption.
ascii(str) Returns the ASCII code point of the first character of str .
btrim(str [, trimStr]) Returns str with leading and trailing characters removed.
charindex(substr, str[, pos]) Returns the position of the first occurrence of substr in
str after position pos .
decode(expr, charSet) Translates binary expr to a string using the character set
encoding charSet .
format_string(strfmt[, obj1 [, …]]) Returns a formatted string from printf-style format strings.
str ilike (pattern[ESCAPE escape]) Returns true if str matches pattern with escape case
insensitively.
levenshtein(str1, str2) Returns the Levenshtein distance between the strings str1
and str2 .
str like (pattern[ESCAPE escape]) Returns true if str matches pattern with escape .
locate(substr, str[, pos]) Returns the position of the first occurrence of substr in
str after position pos .
lpad(expr, len[, pad]) Returns expr , left-padded with pad to a length of len .
overlay(input PLACING replace FROM pos [FOR len]) Replaces input with replace that starts at pos and is
of length len .
position(substr, str[, pos]) Returns the position of the first occurrence of substr in
str after position pos .
printf(strfmt[, obj1 [, …]]) Returns a formatted string from printf-style format strings.
regexp_extract(str, regexp[, idx]) Extracts the first string in str that matches the regexp
expression and corresponds to the regex group index.
regexp_extract_all(str, regexp[, idx]) Extracts the all strings in str that matches the regexp
expression and corresponds to the regex group index.
regexp_replace(str, regexp, rep[, position]) Replaces all substrings of str that match regexp with
rep .
F UN C T IO N DESC RIP T IO N
right(str, len) Returns the rightmost len characters from the string str
.
rpad(expr, len[, pad]) Returns expr , right-padded with pad to a length of len .
split(str, regex[, limit]) Splits str around occurrences that match regex and
returns an array with a length of at most limit .
split_part(str, delim, partNum) Splits str around occurrences of delim and returns the
partNum part.
substr(expr, pos[, len]) Returns the substring of expr that starts at pos and is of
length len .
substr(expr FROM pos[ FOR len]) Returns the substring of expr that starts at pos and is of
length len .
substring(expr, pos[, len]) Returns the substring of expr that starts at pos and is of
length len .
F UN C T IO N DESC RIP T IO N
substring(expr FROM pos[ FOR len]) Returns the substring of expr that starts at pos and is of
length len .
substring_index(expr, delim, count) Returns the substring of expr before count occurrences
of the delimiter delim .
translate(expr, from, to) Returns an expr where all characters in from have been
replaced with those in to .
trim([[BOTH | LEADING | TRAILING] [trimStr] FROM] str) Trim characters from a string.
expr1 & expr2 Returns the bitwise AND of expr1 and expr2 .
atan2(exprY, exprX) Returns the angle in radians between the positive x-axis of a
plane and the point specified by the coordinates ( exprX ,
exprY ).
bit_reverse(expr) Returns the value obtained by reversing the order of the bits
in the argument.
divisor div dividend Returns the integral part of the division of divisor by
dividend .
floor(expr[,targetScale]) Returns the largest number not smaller than expr rounded
down to targetScale digits relative to the decimal point.
try_add(expr1, expr2) Returns the sum of expr1 and expr2 , or NULL in case of
error.
Aggregate functions
F UN C T IO N DESC RIP T IO N
bit_and(expr) Returns the bitwise AND of all input values in the group.
bit_xor(expr) Returns the bitwise XOR of all input values in the group.
bool_and(expr) Returns true if all values in expr are true within the group.
bool_or(expr) Returns true if at least one value in expr is true within the
group.
count(expr[, …]) Returns the number of rows in a group for which the
supplied expressions are all non-null.
count_if(expr) Returns the number of true values for the group in expr .
count_min_sketch(expr, epsilon, confidence, seed) Returns a count-min sketch of all values in the group in
expr with the epsilon , confidence and seed .
every(expr) Returns true if all values of expr in the group are true.
last(expr[,ignoreNull]) Returns the last value of expr for the group of rows.
last_value(expr[,ignoreNull]) Returns the last value of expr for the group of rows.
percentile(expr, percentage [,frequency]) Returns the exact percentile value of expr at the specified
percentage .
percentile_cont(pct) WITHIN GROUP (ORDER BY key) Returns the interpolated percentile of the key within the
group.
percentile_disc(pct) WITHIN GROUP (ORDER BY key) Returns the discrete percentile of the key within the group.
regr_count(yExpr, xExpr) Returns the number of non-null value pairs yExpr , xExpr
in the group.
regr_sxx(yExpr, xExpr) Returns the sum of squares of the xExpr values of a group
where xExpr and yExpr are NOT NULL.
regr_syy(yExpr, xExpr) Returns the sum of squares of the yExpr values of a group
where xExpr and yExpr are NOT NULL.
ntile(n) Divides the rows for each window partition into n buckets
ranging from 1 to at most n .
lag(expr[,offset[,default]]) Returns the value of expr from a preceding row within the
partition.
nth_value(expr, offset[, ignoreNulls]) Returns the value of expr at a specific offset in the
window.
Array functions
F UN C T IO N DESC RIP T IO N
F UN C T IO N DESC RIP T IO N
arrays_zip(array1 [, …]) Returns a merged array of structs in which the nth struct
contains all Nth values of input arrays.
exists(expr, pred) Returns true if pred is true for any element in expr .
F UN C T IO N DESC RIP T IO N
forall(expr, predFunc) Tests whether predFunc holds for all elements in the array.
zip_with(expr1, expr2, func) Merges the arrays in expr1 and expr2 , element-wise,
into a single array using func .
Map functions
F UN C T IO N DESC RIP T IO N
map([{key1, value1}[, …]]) Creates a map with the specified key-value pairs.
map_filter(expr, func) Filters entries in the map in expr using the function func .
map_from_arrays(keys, values) Creates a map with a pair of the keys and values arrays.
map_zip_with(map1, map2, func) Merges map1 and map2 into a single map.
transform_keys(expr, func) Transforms keys in a map in expr using the function func
.
try_element_at(mapExpr, key) Returns the value of mapExpr for key , or NULL if key
does not exist.
F UN C T IO N DESC RIP T IO N
datediff(unit, start, stop) Returns the difference between two timestamps measured in
unit s.
divisor div dividend Returns the integral part of the division of interval divisor
by interval dividend .
last_day(expr) Returns the last day of the month that the date belongs to.
make_dt_interval([days[, hours[, mins[, secs]]]]) Creates an day-time interval from days , hours , mins
and secs .
make_interval(years, months, weeks, days, hours, mins, secs) Deprecated: Creates an interval from years , months ,
weeks , days , hours , mins and secs .
next_day(expr,dayOfWeek) Returns the first date which is later than expr and named
as in dayOfWeek .
quarter(expr) Returns the quarter of the year for expr in the range 1 to
4.
timestampdiff(unit, start, stop) Returns the difference between two timestamps measured in
unit s.
trunc(expr, fmt) Returns a date with the a portion of the date truncated to
the unit specified by the format model fmt .
try_add(expr1, expr2) Returns the sum of expr1 and expr2 , or NULL in case of
error.
window(expr, width[, step[, start]]) Creates a hopping based sliding-window over a timestamp
expression.
F UN C T IO N DESC RIP T IO N
cast(expr AS type) Casts the value expr to the target data type type .
expr :: type Casts the value expr to the target data type type .
make_dt_interval([days[, hours[, mins[, secs]]]]) Creates an day-time interval from days , hours , mins
and secs .
F UN C T IO N DESC RIP T IO N
make_interval(years, months, weeks, days, hours, mins, secs) Creates an interval from years , months , weeks , days ,
hours , mins and secs .
map([{key1, value1} [, …]]) Creates a map with the specified key-value pairs.
named_struct({name1, val1} [, …]) Creates a struct with the specified field names and values.
try_cast(expr AS type) Casts the value expr to the target data type type safely.
CSV functions
F UN C T IO N DESC RIP T IO N
from_csv(csvStr, schema[, options]) Returns a struct value with the csvStr and schema .
to_csv(expr[, options]) Returns a CSV string with the specified struct value.
JSON functions
F UN C T IO N DESC RIP T IO N
from_json(jsonStr, schema[, options]) Returns a struct value with the jsonStr and schema .
to_json(expr[, options]) Returns a JSON string with the struct specified in expr .
XPath functions
F UN C T IO N DESC RIP T IO N
xpath(xml, xpath) Returns values within the nodes of xml that match xpath .
xpath_string(xml, xpath) Returns the contents of the first XML node that matches the
XPath expression.
Miscellaneous functions
F UN C T IO N DESC RIP T IO N
CASE expr { WHEN opt1 THEN res1 } […] [ELSE def] END Returns resN for the first optN that equals expr or
def if none matches.
CASE { WHEN cond1 THEN res1 } […] [ELSE def] END Returns resN for the first condN that evaluates to true, or
def if none found.
decode(expr, { key, value } [, …] [,defValue]) Returns the value matching the key.
greatest(expr1 [, …]) Returns the largest value of all arguments, skipping null
values.
input_file_block_start() Returns the start offset in bytes of the block being read.
input_file_name() Returns the name of the file being read, or empty string if
not available.
least(expr1 [, …]) Returns the smallest value of all arguments, skipping null
values.
F UN C T IO N DESC RIP T IO N
range(start, end [, step [, numParts]]) Returns a table of values within a specified range.
window(expr, width[, step [, start]]) Creates a hopping based sliding-window over a timestamp
expression.
This article provides an alphabetically ordered list of built-in functions and operators in Databricks Runtime.
abs function
acos function
acosh function
add_months function
aes_decrypt function
aes_encrypt function
aggregate function
& (ampersand sign) operator
and predicate
any aggregate function
approx_count_distinct aggregate function
approx_percentile aggregate function
approx_top_k aggregate function
array function
array_agg aggregate function
array_contains function
array_distinct function
array_except function
array_intersect function
array_join function
array_max function
array_min function
array_position function
array_remove function
array_repeat function
array_size function
array_sort function
array_union function
arrays_overlap function
arrays_zip function
ascii function
asin function
asinh function
assert_true function
* (asterisk sign) operator
atan function
atan2 function
atanh function
avg aggregate function
!= (bangeq sign) operator
! (bang sign) operator
base64 function
between predicate
bigint function
bin function
binary function
bit_and aggregate function
bit_count function
bit_get function
bit_length function
bit_or aggregate function
bit_reverse function
bit_xor aggregate function
bool_and aggregate function
bool_or aggregate function
boolean function
[ ] (bracket sign) operator (Databricks Runtime)
bround function
btrim function
cardinality function
^ (caret sign) operator
case expression
cast function
cbrt function
ceil function
ceiling function
char function
char_length function
character_length function
charindex function
chr function
coalesce function
collect_list aggregate function
collect_set aggregate function
:: (colon colon sign) operator
: (colon sign) operator (Databricks Runtime)
concat function
concat_ws function
contains function
conv function
corr aggregate function
cos function
cosh function
cot function
count aggregate function
count_if aggregate function
count_min_sketch aggregate function
covar_pop aggregate function
covar_samp aggregate function
crc32 function
csc function
cube function
cume_dist analytic window function
current_catalog function
current_database function
current_date function
current_schema function
current_timestamp function
current_timezone function
current_user function
current_version function
date function
date_add function
date_format function
date_from_unix_date function
date_part function
date_sub function
date_trunc function
dateadd function
datediff function
datediff (timestamp) function
day function
dayofmonth function
dayofweek function
dayofyear function
decimal function
decode function
decode (character set) function
degrees function
dense_rank ranking window function
div operator
double function
e function
element_at function
elt function
encode function
endswith function
== (eq eq sign) operator
= (eq sign) operator
every aggregate function
exists function
exp function
explode table-valued generator function
explode_outer table-valued generator function
expm1 function
extract function
factorial function
filter function
find_in_set function
first aggregate function
first_value aggregate function
flatten function
float function
floor function
forall function
format_number function
format_string function
from_csv function
from_json function
from_unixtime function
from_utc_timestamp function
get_json_object function
getbit function
greatest function
grouping function
grouping_id function
>= (gt eq sign) operator
> (gt sign) operator
hash function
hex function
hour function
hypot function
if function
iff function
ifnull function
ilike operator
in predicate
initcap function
inline table-valued generator function
inline_outer table-valued generator function
input_file_block_length function
input_file_block_start function
input_file_name function
instr function
int function
is_member function
is distinct operator
is false operator
isnan function
isnotnull function
isnull function
is null operator
is true operator
java_method function
json_array_length function
json_object_keys function
json_tuple table-valued generator function
kurtosis aggregate function
lag analytic window function
last aggregate function
last_day function
last_value aggregate function
lcase function
lead analytic window function
least function
left function
length function
levenshtein function
like operator
ln function
locate function
log function
log10 function
log1p function
log2 function
lower function
lpad function
<=> (lt eq gt sign) operator
<= (lt eq sign) operator
<> (lt gt sign) operator
ltrim function
< (lt sign) operator
make_date function
make_dt_interval function
make_interval function
make_timestamp function
make_ym_interval function
map function
map_concat function
map_contains_key function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
max aggregate function
max_by aggregate function
md5 function
mean aggregate function
min aggregate function
min_by aggregate function
- (minus sign) operator
- (minus sign) unary operator
minute function
mod function
monotonically_increasing_id function
month function
months_between function
named_struct function
nanvl function
negative function
next_day function
not operator
now function
nth_value analytic window function
ntile ranking window function
nullif function
nvl function
nvl2 function
octet_length function
or operator
overlay function
parse_url function
percent_rank ranking window function
percentile aggregate function
percentile_approx aggregate function
percentile_cont aggregate function
percentile_disc aggregate function
% (percent sign) operator
pi function
|| (pipe pipe sign) operator
| (pipe sign) operator
+ (plus sign) operator
+ (plus sign) unary operator
pmod function
posexplode table-valued generator function
posexplode_outer table-valued generator function
position function
positive function
pow function
power function
printf function
quarter function
radians function
raise_error function
rand function
randn function
random function
range table-valued function
rank ranking window function
reduce function
reflect function
regexp operator
regexp_extract function
regexp_extract_all function
regexp_like function
regexp_replace function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_r2 aggregate function
regr_sxx aggregate function
regr_sxy aggregate function
regr_syy aggregate function
repeat function
replace function
reverse function
right function
rint function
rlike operator
round function
row_number ranking window function
rpad function
rtrim function
schema_of_csv function
schema_of_json function
sec function
second function
sentences function
sequence function
sha function
sha1 function
sha2 function
shiftleft function
shiftright function
shiftrightunsigned function
shuffle function
sign function
signum function
sin function
sinh function
size function
skewness aggregate function
/ (slash sign) operator
slice function
smallint function
some aggregate function
sort_array function
soundex function
space function
spark_partition_id function
split function
split_part function
sqrt function
stack table-valued generator function
startswith function
std aggregate function
stddev aggregate function
stddev_pop aggregate function
stddev_samp aggregate function
str_to_map function
string function
struct function
substr function
substring function
substring_index function
sum aggregate function
tan function
tanh function
~ (tilde sign) operator
timestamp function
timestamp_micros function
timestamp_millis function
timestamp_seconds function
timestampadd function
timestampdiff function
tinyint function
to_csv function
to_date function
to_json function
to_number function
to_timestamp function
to_unix_timestamp function
to_utc_timestamp function
transform function
transform_keys function
transform_values function
translate function
trim function
trunc function
try_add function
try_avg aggregate function
try_cast function
try_divide function
try_element_at function
try_multiply function
try_subtract function
try_sum aggregate function
try_to_number function
typeof function
ucase function
unbase64 function
unhex function
unix_date function
unix_micros function
unix_millis function
unix_seconds function
unix_timestamp function
upper function
uuid function
var_pop aggregate function
var_samp aggregate function
variance aggregate function
version function
weekday function
weekofyear function
width_bucket function
window grouping expression
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xxhash64 function
year function
zip_with function
Azure Databricks lambda functions
7/21/2022 • 2 minutes to read
Syntax
{ param -> expr |
(param1 [, ...] ) -> expr }
Parameters
paramN : An identifier used by the parent function to pass arguments for the lambda function.
expr : Any simple expression referencing paramN , which does not contain a subquery.
Returns
The result type is defined by the result type of expr .
If there is more than one paramN , the parameter names must be unique. The types of the parameters are set by
the invoking function. The expression must be valid for these types and the result type must match the defined
expectations of the invoking functions.
Examples
The array_sort function function expects a lambda function with two parameters. The parameter types will be
the type of the elements of the array to be sorted. The expression is expected to return an INTEGER where -1
means param1 < param2 , 0 means param1 = param2 , and 1 otherwise.
To sort an ARRAY of STRING in a right to left lexical order, you can use the following lambda function.
Lambda functions are defined and used ad hoc. So the function definition is the argument:
Related articles
aggregate function
array_sort function
exists function
filter function
forall function
map_filter function
map_zip_with function
transform function
transform_keys function
transform_values function
zip_with function
Window functions
7/21/2022 • 3 minutes to read
Functions that operate on a group of rows, referred to as a window, and calculate a return value for each row
based on the group of rows. Window functions are useful for processing tasks such as calculating a moving
average, computing a cumulative statistic, or accessing the value of rows given the relative position of the
current row.
Syntax
function OVER { window_name | ( window_name ) | window_spec }
function:
{ ranking_function | analytic_function | aggregate_function }
window_spec:
( [ PARTITION BY partition [ , ... ] ] [ order_by ] [ window_frame ] )
Parameters
function
The function operating on the window. Different classes of functions support different configurations of
window specifications.
ranking_function
Any of the Ranking window functions.
If specified the window_spec must include an ORDER BY clause, but not a window_frame clause.
analytic_function
Any of the Analytic window functions.
aggregate_function
Any of the Aggregate functions.
If specified the function must not include a FILTER clause.
window_spec
This clause defines how the rows will be grouped, sorted within the group, and which rows within a
partition a function operates on.
partition
One or more expression used to specify a group of rows defining the scope on which the function
operates. If no PARTITION clause is specified the partition is comprised of all rows.
order_by
The ORDER BY clause specifies the order of rows within a partition.
window_frame
The window frame clause specifies a sliding subset of rows within the partition on which the
aggregate or analytics function operates.
You can specify SORT BY as an alias for ORDER BY.
You can also specify CLUSTER BY, or DISTRIBUTE BY as an alias for PARTITION BY.
Examples
> CREATE TABLE employees
(name STRING, dept STRING, salary INT, age INT);
> INSERT INTO employees
VALUES ('Lisa', 'Sales', 10000, 35),
('Evan', 'Sales', 32000, 38),
('Fred', 'Engineering', 21000, 28),
('Alex', 'Sales', 30000, 33),
('Tom', 'Engineering', 23000, 33),
('Jane', 'Marketing', 29000, 28),
('Jeff', 'Marketing', 35000, 38),
('Paul', 'Engineering', 29000, 23),
('Chloe', 'Engineering', 23000, 25);
Related articles
SELECT
ORDER BY
window frame clause
Aggregate functions
Ranking window functions
Analytic window functions
Ranking window functions
User-defined scalar functions (UDFs)
7/21/2022 • 2 minutes to read
User-defined scalar functions (UDFs) are user-programmable routines that act on one row. This documentation
lists the classes that are required for creating and registering UDFs. It also contains examples that demonstrate
how to define and register UDFs and invoke them in Spark SQL.
UserDefinedFunction class
To define the properties of a user-defined function, you can use some of the methods defined in this class.
asNonNullable(): UserDefinedFunction : Updates UserDefinedFunction to non-nullable.
asNondeterministic(): UserDefinedFunction : Updates UserDefinedFunction to nondeterministic.
withName(name: String): UserDefinedFunction : Updates UserDefinedFunction with a given name.
Examples
Scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf
Java
import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import static org.apache.spark.sql.functions.udf;
import org.apache.spark.sql.types.DataTypes;
Related statements
User-defined aggregate functions (UDAFs)
Integration with Hive UDFs, UDAFs, and UDTFs
User-defined aggregate functions (UDAFs)
7/21/2022 • 4 minutes to read
User-defined aggregate functions (UDAFs) are user-programmable routines that act on multiple rows at once
and return a single aggregated value as a result. This documentation lists the classes that are required for
creating and registering UDAFs. It also contains examples that demonstrate how to define and register UDAFs in
Scala and invoke them in Spark SQL.
Aggregator
Syntax Aggregator[-IN, BUF, OUT]
A base class for user-defined aggregations, which can be used in Dataset operations to take all of the elements
of a group and reduce them to a single value.
IN : The input type for the aggregation.
BUF : The type of the intermediate value of the reduction.
OUT : The type of the final output result.
bufferEncoder : Encoder[BUF]
The Encoder for the intermediate value type.
finish(reduction: BUF): OUT
Transform the output of the reduction.
merge(b1: BUF, b2: BUF): BUF
Merge two intermediate values.
outputEncoder : Encoder[OUT]
The Encoder for the final output value type.
reduce(b: BUF, a: IN): BUF
Aggregate input value a into current intermediate value. For performance, the function may modify b
and return it instead of constructing new object for b .
zero: BUF
The initial value of the intermediate result for this aggregation.
Examples
Type -safe user-defined aggregate functions
User-defined aggregations for strongly typed Datasets revolve around the Aggregator abstract class. For
example, a type-safe user-defined average can look like:
Untyped user-defined aggregate functions
Typed aggregations, as described above, may also be registered as untyped aggregating UDFs for use with
DataFrames. For example, a user-defined average for untyped DataFrames can look like:
Scala
val df = spark.read.format("json").load("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
Java
import java.io.Serializable;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.expressions.Aggregator;
import org.apache.spark.sql.functions;
public static class Average implements Serializable {
private long sum;
private long count;
Dataset<Row> df = spark.read().format("json").load("examples/src/main/resources/employees.json");
df.createOrReplaceTempView("employees");
df.show();
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
SQL
-- Compile and place UDAF MyAverage in a JAR file called `MyAverage.jar` in /tmp.
CREATE FUNCTION myAverage AS 'MyAverage' USING JAR '/tmp/MyAverage.jar';
Related statements
Scalar user defined functions (UDFs)
Integration with Hive UDFs, UDAFs, and UDTFs
Integration with Hive UDFs, UDAFs, and UDTFs
7/21/2022 • 2 minutes to read
Spark SQL supports integration of Hive UDFs, UDAFs, and UDTFs. Similar to Spark UDFs and UDAFs, Hive UDFs
work on a single row as input and generate a single row as output, while Hive UDAFs operate on multiple rows
and return a single aggregated row as a result. In addition, Hive also supports UDTFs (User Defined Tabular
Functions) that act on one row as input and return multiple rows as output. To use Hive UDFs/UDAFs/UTFs, the
user should register them in Spark, and then use them in Spark SQL queries.
Examples
Hive has two UDF interfaces: UDF and GenericUDF. An example below uses GenericUDFAbs derived from
GenericUDF .
SELECT * FROM t;
+-----+
|value|
+-----+
| -1.0|
| 2.0|
| -3.0|
+-----+
SELECT * FROM t;
+------+
| value|
+------+
|[1, 2]|
|[3, 4]|
+------+
Hive has two UDAF interfaces: UDAF and GenericUDAFResolver. An example below uses GenericUDAFSum
derived from GenericUDAFResolver .
SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1|
| a| 2|
| b| 3|
+---+-----+
A JSON path expression is used to extract values from a JSON string using the : operator
Syntax
{ { identifier | [ field ] | [ * ] | [ index ] }
[ . identifier | [ field ] | [ * ] | [ index ] ] [...] }
The brackets surrounding field , * and index are actual brackets and not indicating an optional syntax.
Parameters
identifier : A case insensitive identifier of a JSON field.
[ field ] : A bracketed case sensitive STRING literal identifying a JSON field.
[ * ] : Identifying all elements in a JSON array.
[ index ] : An integer literal identifying a specific element in a 0-based JSON array.
Returns
A STRING.
When a JSON field exists with an un-delimited null value, you will receive a SQL NULL value for that column,
not a null text value.
You can use :: operator to cast values to basic data types.
Use the from_json function to cast nested results into more complex data types, such as arrays or structs.
Notes
You can use an un-delimited identifier to refer to a JSON field if the name does not contain spaces, or special
characters, and there is no field of the same name in different case.
Use a delimited identifier if there is no field of the same name in different case.
The [ field ] notation can always be used, but requires you to exactly match the case of the field.
If Databricks Runtime cannot uniquely identify a field an error is returned. If no match is found for any field
Databricks Runtime returns NULL .
Examples
The following examples use the data created with the statement in Example data.
In this section:
Extract using identifier and delimiters
Extract nested fields
Extract values from arrays
NULL behavior
Cast values
Example data
Extract using identifier and delimiters
-- Use backticks to escape special characters. References are case insensitive when you use backticks.
-- Use brackets to make them case sensitive.
> SELECT raw:`zip code`, raw:`Zip Code`, raw:['fb:testid'] FROM store_data;
94025 94025 1234
-- Use brackets
> SELECT raw:['store']['bicycle'] FROM store_data;
'{ "price":19.95, "color":"red" }'
-- Index elements
> SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data;
'{ "weight":8, "type":"apple" }' '{ "weight":9, "type":"pear" }'
NULL behavior
Cast values
-- price is returned as a double, not a string
> SELECT raw:store.bicycle.price::double FROM store_data
19.95
Example data
A partition is composed of a subset of rows in a table that share the same value for a predefined subset of
columns called the partitioning columns. Using partitions can speed up queries against the table as well as data
manipulation.
To use partitions, you define the set of partitioning column when you create a table by including the
PARTITIONED BY clause.
When inserting or manipulating rows in a table Databricks Runtime automatically dispatches rows into the
appropriate partitions.
You can also specify the partition directly using a PARTITION clause.
This syntax is also available for tables that don’t use Delta Lake format, to DROP, ADD or RENAME partitions
quickly by using the ALTER TABLE statement.
PARTITIONED BY
The PARTITIONED BY clause specified a list of columns along which the new table is partitioned.
Syntax
Parameters
par tition_column
An identifier may reference a column_identifier in the table. If you specify more than one column there
must be no duplicates. If you reference all columns in the table’s column_specification an error is raised.
column_type
Unless the partition_column refers to a column_identifier in the table’s column_specification ,
column_type defines the data type of the partition_column .
Not all data types supported by Databricks Runtime are supported by all data sources.
Notes
Unless you define a Delta Lake table partitioning columns referencing the columns in the column specification
are always moved to the end of the table.
PARTITION
You use the PARTITION clause to identify a partition to be queried or manipulated.
A partition is identified by naming all its columns and associating each with a value. You need not specify them
in a specific order.
Unless you are adding a new partition to an existing table you may omit columns or values to indicate that the
operation applies to the all matching partitions matching the subset of columns.
PARTITION ( { partition_column [ = partition_value | LIKE pattern ] } [ , ... ] )
Parameters
par tition_column
A column named as a partition column of the table. You may not specify the same column twice.
= partition_value
A literal of a data type matching the type of the partition column. If you omit a partition value the
specification will match all values for this partition column.
LIKE pattern
Examples
-- Use the PARTTIONED BY clause in a table definition
> CREATE TABLE student(university STRING,
major STRING,
name STRING)
PARTITIONED BY(university, major)
-- Drop all partitions from the named university, independent of the major.
> ALTER TABLE student DROP PARTITION(university = 'TU Kaiserslautern');
Principal
7/21/2022 • 2 minutes to read
A principal is a user, service principal, or group known to the metastore. Principals can be granted privileges and
may own securable objects.
Syntax
{ `<user>@<domain-name>` |
`<sp-application-id>` |
group_name |
USERS }
Parameters
<user>@<domain-name>
An individual user. You must quote the identifier with back-ticks (`) due to the @ character.
<sp-application-id>
A service principal, specified by its applicationId value. You must quote the identifier with back-ticks (`)
due to the dash characters in the ID.
group_name
An identifier specifying a group of users or groups.
USERS
Examples
-- Granting a privilege to the user alf@melmak.et
> GRANT SELECT ON TABLE t TO `alf@melmak.et`;
Related
ALTER GROUP
CREATE GROUP
GRANT
REVOKE
Privileges and securable objects
7/21/2022 • 3 minutes to read
Securable objects
A securable object is an object defined in the metastore on which privileges can be granted to a principal.
To manage privileges on any object you must be its owner or an administrator.
Syntax
securable_object
{ ANONYMOUS FUNCTION |
ANY FILE |
CATALOG [ catalog_name ] |
{ SCHEMA | DATABASE } schema_name |
EXTERNAL LOCATION location_name |
FUNCTION function_name |
STORAGE CREDENTIAL credential_name |
[ TABLE ] table_name |
VIEW view_name }
Parameters
ANONYMOUS FUNCTION
You can grant the privilege to SELECT and MODIFY any file in the filesystem.
CATALOG catalog_name
You can grant CREATE , CREATE_NAMED_FUNCTION , and USAGE on a catalog. The default catalog name is
hive_metastore . If the catalog name is hive_metastore you can also grant SELECT , READ_METADATA , and
MODIFY to grant these privileges on to any existing and future securable object within the catalog.
Privilege types
CREATE
Create objects other than external user defined functions (UDF) within the catalog or schema.
CREATE_NAMED_FUNCTION
Add files to the Spark class path to create named non-SQL functions.
READ_METADATA
Discover the securable object in SHOW and interrogate the object in DESCRIBE
If the securable object is the hive_metastore catalog or a schema within it, granting READ_METADATA will
grant READ_METADATA on all current and future tables and views within the securable object.
READ FILES
Query a table or view, invoke a user defined or anonymous function, or select ANY FILE . The user needs
SELECT on the table, view, or function, as well as USAGE on the object’s schema and catalog.
If the securable object is the hive_metastore or a schema within it, granting SELECT will grant SELECT on
all current and future tables and views within the securable object.
USAGE
Required, but not sufficient to reference any objects in a catalog or schema. The principal also needs to
have privileges on the individual securable objects.
WRITE FILES
Directly COPY INTO files governed by the storage credential or external location.
Privilege matrix
The following table shows which privileges are associated with which securable objects.
ANON
YMOU EXT ER STO RA
P RIVIL S NAL GE
EGE F UN C T ANY C ATA L SC H EM LO C AT I F UN C T C REDE
TYPE IO N F IL E OG A ON IO N N T IA L TA B L E VIEW
MODI Yes
FY_CL
ASSPA
TH
HMS This privilege only applies for securable objects in the hive_metastore catalog.
Examples
-- Grant a privilege to the user alf@melmak.et
> GRANT SELECT ON TABLE t TO `alf@melmak.et`;
Related
GRANT
Principal
REVOKE
External locations and storage credentials
7/21/2022 • 3 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
To interface with storage not managed by Databricks Runtime in a secure manner use a:
storage credential
A SQL object used to abstract long term credentials from cloud storage providers.
Since: Databricks Runtime 10.3
external location
A SQL Object used to associate a URL with a storage credential.
Since: Databricks Runtime 10.3
external table
A table with a storage path contained within an external location.
Storage credential
A storage credential is a securable SQL object encapsulating an identity for the cloud service provider, any of:
An AWS IAM role
An Azure service principal
Once a storage credential is created access to it can be granted to principals (users and groups).
A user or group with permission to use a storage credential can access any storage path covered by the storage
credential by using WITH (CREDENTIAL = credential) in your SQL command.
For more fine-grained access control, combine a storage credential with an external location.
Storage credential names are unqualified and must be unique within the metastore.
Related articles
Create a storage credential (CLI)
ALTER STORAGE CREDENTIAL
DROP STORAGE CREDENTIAL
DESCRIBE STORAGE CREDENTIAL
SHOW STORAGE CREDENTIALS
GRANT
REVOKE
External location
An external location is a securable SQL object that combines a storage path with a storage credential that
authorizes access to that path.
After an external location is created, you can grant access to it to account-level principals (users and groups).
A user or group with permission to use an external location can access any storage path within the location’s
path without direct access to the storage credential.
To further refine access control you can use GRANT on external tables to encapsulate access to individual files
within an external location.
External location names are unqualified and must be unique within the metastore.
The storage path of any external location may not be contained within another external location’s storage path,
or within an external table’s storage path using an explicit storage credential.
External table
An external table is a table that references an external storage path by using a LOCATION clause.
The storage path should be an existing external location to which you have been granted access.
Alternatively you can reference a storage credential to which you have been granted access.
Using external tables abstracts away the storage path, external location, and storage credential for users whom
are granted access to the external table.
WARNING
To avoid accidental data loss, do not register a schema (database) to a location with existing data or create new external
tables in a location managed by a schema. Dropping a schema will recursively delete all data files in the managed location.
-- ceo can directly read from any storage path using myazure_storage_cred
> SELECT count(1) FROM `delta`.`abfss://depts/finance/forecast/somefile` WITH (CREDENTIAL
my_azure_storage_cred);
100
> SELECT count(1) FROM `delta`.`abfss://depts/hr/employees` WITH (CREDENTIAL my_azure_storage_cred);
2017
-- `finance` can read from any torage path that under abfss://depts/finance but nowhere else
> SELECT count(1) FROM `delta`.`abfss://depts/finance/forecast/somefile` WITH (CREDENTIAL
my_azure_storage_cred);
100
> SELECT count(1) FROM `delta`.`abfss://depts/hr/employees` WITH (CREDENTIAL my_azure_storage_cred);
Error
-- `finance` can create an external table over specific object within the `finance_loc` location
> CREATE TABLE sec_filings LOCATION 'abfss://depts/finance/sec_filings`;
Related articles
Create a storage credential (CLI)
ALTER STORAGE CREDENTIAL
ALTER TABLE
CREATE LOCATION
DESCRIBE STORAGE CREDENTIAL
DESCRIBE TABLE
DROP STORAGE CREDENTIAL
DROP TABLE
SHOW STORAGE CREDENTIALS
SHOW TABLES
GRANT
REVOKE
Delta Sharing
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Delta Sharing is an open protocol for secure data sharing with other organizations regardless of which
computing platforms they use. It can share collections of tables in a Unity Catalog metastore in real time without
copying them, so that data recipients can immediately begin working with the latest version of the shared data.
Since: Databricks Runtime 10.3
There are two components to Delta Sharing:
Shares
A share provides a logical grouping for the tables you intend to share.
Recipients
A recipient identifies an organization with which you want to share any number of shares.
Shares
A share is a container instantiated with the CREATE SHARE command. Once created you can iteratively register a
collection of existing tables defined within the metastore using the ALTER SHARE command. You can register
tables under their original name, qualified by their original schema, or provide alternate exposed names.
You must be a metastore admin or account admin to create, alter, and drop shares.
Examples
-- Create share `customer_share` only if share with same name doesn't exist, with a comment.
> CREATE SHARE IF NOT EXISTS customer_share COMMENT 'This is customer share';
Recipients
A recipient is an object you create using CREATE RECIPIENT to represent an organization which you want to
allow access shares. When you create a recipient Databricks Runtime generates an activation link you can send
to the organization. To retrieve the activation link after creation you use DESCRIBE RECIPIENT.
Once a recipient has been created you can give it SELECT privileges on shares of your choice using GRANT ON
SHARE.
You must be a metastore administrator to create recipients, drop recipients, and grant access to shares.
Examples
-- Create a recipient.
> CREATE RECIPIENT IF NOT EXISTS other_org COMMENT 'other.org';
Related articles
CREATE RECIPIENT
DESCRIBE RECIPIENT
DROP RECIPIENT
SHOW RECIPIENT
ARRAY type
7/21/2022 • 2 minutes to read
Syntax
ARRAY < elementType >
elementType : Any data type defining the type of the elements of the array.
Limits
The array type supports sequences of any length greater or equal to 0.
Literals
See array function for details on how to produce literal array values.
See [ ] operator for details how to retrieve elements from an array.
Examples
> SELECT ARRAY(1, 2, 3);
[1, 2, 3]
Related
[]
MAP type
STRUCT type
array function
cast function
BIGINT type
7/21/2022 • 2 minutes to read
Syntax
{ BIGINT |
LONG }
Limits
The range of numbers is from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
Literals
[ + | - ] digit [ ... ] [L]
Examples
> SELECT +1L;
1
Related
TINYINT type
SMALLINT type
INT type
DECIMAL type
FLOAT type
DOUBLE type
cast function
BINARY type
7/21/2022 • 2 minutes to read
Syntax
BINARY
Limits
The type supports byte sequences of any length greater or equal to 0.
Literals
X { 'num [ ... ]' | "num [ ... ]" }
Examples
> SELECT X'1';
[01]
Related
STRING type
cast function
BOOLEAN type
7/21/2022 • 2 minutes to read
Syntax
BOOLEAN
Limits
The type supports true and false values.
Literals
{ TRUE | FALSE }
Examples
> SELECT true;
TRUE
Related
cast function
DATE type
7/21/2022 • 2 minutes to read
Represents values comprising values of fields year, month, and day, without a time-zone.
Syntax
DATE
Limits
The range of dates supported is June 23 -5877641 CE to July 11 +5881580 CE .
Literals
DATE dateString
dateString
{ '[+|-]yyyy[...]' |
'[+|-]yyyy[...]-[m]m' |
'[+|-]yyyy[...]-[m]m-[d]d' |
'[+|-]yyyy[...]-[m]m-[d]d[T]' }
Examples
> SELECT DATE'0000';
0000-01-01
Related
TIMESTAMP type
INTERVAL type
cast function
DECIMAL type
7/21/2022 • 2 minutes to read
Syntax
{ DECIMAL | DEC | NUMERIC } [ ( p [ , s ] ) ]
p : Optional maximum precision (total number of digits) of the number between 1 and 38. The default is 10. s :
Optional scale of the number between 0 and p . The number of digits to the right of the decimal point. The
default is 0.
Limits
The range of numbers:
-1Ep + 1 to -1E-s
0
+1E-s to +1Ep - 1
For example a DECIMAL(5, 2) has a range of: -999.99 to 999.99.
Literals
decimal_digits { [ BD ] | [ exponent BD ] }
| digit [ ... ] [ exponent ] BD
decimal_digits:
[ + | - ] { digit [ ... ] . [ digit [ ... ] ]
| . digit [ ... ] }
exponent:
E [ + | - ] digit [ ... ]
Examples
> SELECT +1BD;
1
Related
TINYINT type
SMALLINT type
INT type
BIGINT type
FLOAT type
DOUBLE type
cast function
DOUBLE type
7/21/2022 • 2 minutes to read
Syntax
DOUBLE
Limits
The range of numbers is:
Negative infinity
-1.79769E+308 to -2.225E-307
0
+2.225E-307 to +1.79769E+308
Positive infinity
NaN (not a number)
Literals
decimal_digits { D | exponent [ D ] }
| digit [ ... ] { exponent [ D ] | [ exponent ] D }
decimal_digits:
[ + | - ] { digit [ ... ] . [ digit [ ... ] ]
| . digit [ ... ] }
exponent:
E [ + | - ] digit [ ... ]
Notes
DOUBLE is a base-2 numeric type. When given a literal which is base-10 the representation may not be exact.
Use DECIMAL type to accurately represent fractional or large base-10 numbers.
Examples
> SELECT +1D;
1.0
Related
TINYINT type
SMALLINT type
INT type
BIGINT type
DECIMAL type
FLOAT type
cast function
Special floating point values
FLOAT type
7/21/2022 • 2 minutes to read
Syntax
{ FLOAT | REAL }
Limits
The range of numbers is:
Negative infinity
-3.402E+38 to -1.175E-37
0
+1.175E-37 to +3.402E+38
Positive infinity
NaN (not a number)
Literals
decimal_digits [ exponent ] F
| [ + | - ] digit [ ... ] [ exponent ] F
decimal_digits:
[ + | - ] { digit [ ... ] . [ digit [ ... ] ]
| . digit [ ... ] }
exponent:
E [ + | - ] digit [ ... ]
Notes
FLOAT is a base-2 numeric type. When given a literal which is base-10 the representation may not be exact. Use
DECIMAL type to accurately represent fractional or large base-10 numbers.
Examples
> SELECT +1F;
1.0
Related
TINYINT type
SMALLINT type
INT type
BIGINT type
DECIMAL type
DOUBLE type
cast function
Special floating point values
INT type
7/21/2022 • 2 minutes to read
Syntax
{ INT | INTEGER }
Limits
The range of numbers is from -2,147,483,648 to 2,147,483,647.
Literals
[ + | - ] digit [ ... ]
Examples
> SELECT +1;
1
Related
TINYINT type
SMALLINT type
BIGINT type
DECIMAL type
FLOAT type
DOUBLE type
cast function
INTERVAL type
7/21/2022 • 2 minutes to read
Syntax
INTERVAL { yearMonthIntervalQualifier | dayTimeIntervalQualifier }
yearMonthIntervalQualifier
{ YEAR [TO MONTH] |
MONTH }
dayTimeIntervalQualifier
{ DAY [TO { HOUR | MINUTE | SECOND } ] |
HOUR [TO { MINUTE | SECOND } ] |
MINUTE [TO SECOND] |
SECOND }
Notes
Intervals covering years or months are called year-month intervals.
Intervals covering days, hours, minutes, or seconds are called day-time intervals.
You cannot combine or compare year-month and day-time intervals.
Day-time intervals are strictly based on 86400s/day and 60s/min.
Seconds are always considered to include microseconds.
Limits
A year-month interval has a maximal range of +/- 178,956,970 years and 11 months.
A day-time interval has a maximal range of +/- 106,751,991 days, 23 hours, 59 minutes, and 59.999999
seconds.
Literals
year-month interval
INTERVAL [+|-] yearMonthIntervalString yearMonthIntervalQualifier
day-time interval
INTERVAL [+|-] dayTimeIntervalString dayTimeIntervalQualifier
yearMonthIntervalString
{ '[+|-] y[...]' |
'[+|-] y[...]-[m]m' }
dayTimeIntervalString
{ '[+|-] d[...]' |
'[+|-] d[...] [h]h' |
'[+|-] d[...] [h]h:[m]m' |
'[+|-] d[...] [h]h:[m]m:[s]s' |
'[+|-] d[...] [h]h:[m]m:[s]s.ms[ms][ms][us][us][us]' |
'[+|-] h[...]' |
'[+|-] h[...]:[m]m' |
'[+|-] h[...]:[m]m:[s]s' |
'[+|-] h[...]:[m]m:[s]s.ms[ms][ms][us][us][us]' |
'[+|-] m[...]' |
'[+|-] m[...]:[s]s' |
'[+|-] m[...]:[s]s.ms[ms][ms][us][us][us]' |
'[+|-] s[...]' |
'[+|-] s[...].ms[ms][ms][us][us][us]' }
Unless a unit constitutes the leading unit of the intervalQualifier it must fall within the defined range:
Months: between 0 and 11
Hours: between 0 and 23
Minutes: between 0 and 59
Seconds: between 0.000000 and 59.999999
You can prefix a sign either inside or outside intervalString . If there is one - sign, the interval is negative. If
there are two or no - signs, the interval is positive. If the components in the intervalString do not match up
with the components in the intervalQualifier an error is raised. If the intervalString value does not fit into
the range specified by the intervalQualifier an error is raised.
Examples
> SELECT INTERVAL '100-00' YEAR TO MONTH;
100-0
Related
DATE type
TIMESTAMP type
cast function
MAP type
7/21/2022 • 2 minutes to read
Syntax
MAP <keyType, valueType>
keyType : Any data type other than MAP specifying the keys.
valueType : Any data type specifying the values.
Limits
The map type supports maps of any cardinality greater or equal to 0.
The keys must be unique and not be NULL.
MAP is not a comparable data type.
Literals
See map function for details on how to produce literal map values.
See [ ] operator for details on how to retrieve values from a map by key.
Examples
> SELECT map('red', 1, 'green', 2);
{red->1, green->2}
Related
[ ] operator
ARRAY type
STRUCT type
map function
cast function
VOID type
7/21/2022 • 2 minutes to read
Syntax
{ NULL | VOID }
Limits
The only value the VOID type can hold is NULL.
Literals
NULL
Examples
> SELECT typeof(NULL);
VOID
Related
cast function
SMALLINT type
7/21/2022 • 2 minutes to read
Syntax
{ SMALLINT | SHORT }
Limits
The range of numbers is from -32,768 to 32,767.
Literals
[ + | - ] digit [ ... ] S
Examples
> SELECT +1S;
1
Related
TINYINT type
INT type
BIGINT type
DECIMAL type
FLOAT type
DOUBLE type
cast function
STRING type
7/21/2022 • 2 minutes to read
Syntax
STRING
Literals
[r|R]'c [ ... ]'
r or R
Examples
> SELECT 'Spark';
Spark
Related
cast function
STRUCT type
7/21/2022 • 2 minutes to read
Syntax
STRUCT < [fieldName [:] fieldType [NOT NULL] [COMMENT str] [, …] ] >
fieldName : An identifier naming the field. The names need not be unique.
fieldType : Any data type.
NOT NULL : When specified the struct guarantees that the value of this field is never NULL.
COMMENT str : An optional string literal describing the field.
Limits
The type supports any number of fields greater or equal to 0.
Literals
See struct function and named_struct function for details on how to produce literal array values.
Examples
> SELECT struct('Spark', 5);
{Spark, 5}
> SELECT typeof(CAST(NULL AS STRUCT<Field1:INT NOT NULL COMMENT 'The first field.',Field2:ARRAY<INT>>));
struct<Field1:int,Field2:array<int>>
Related
ARRAY type
MAP type
struct function
named_struct function
cast function
TIMESTAMP type
7/21/2022 • 2 minutes to read
Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local
time-zone. The timestamp value represents an absolute point in time.
Syntax
TIMESTAMP
Limits
The range of timestamps supported is June 23 -5877641 CE to July 11 +5881580 CE .
Literals
TIMESTAMP timestampString
timestampString
{ '[+|-]yyyy[...]' |
'[+|-]yyyy[...]-[m]m' |
'[+|-]yyyy[...]-[m]m-[d]d' |
'[+|-]yyyy[...]-[m]m-[d]d ' |
'[+|-]yyyy[...]-[m]m-[d]d[T][h]h[:]' |
'[+|-]yyyy[..]-[m]m-[d]d[T][h]h:[m]m[:]' |
'[+|-]yyyy[...]-[m]m-[d]d[T][h]h:[m]m:[s]s[.]' |
'[+|-]yyyy[...]-[m]m-[d]d[T][h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zoneId]' }
zoneId :
Z - Zulu time zone UTC+0
+|-[h]h:[m]m
An ID with one of the prefixes UTC+, UTC-, GMT+, GMT-, UT+ or UT-, and a suffix in the formats:
+|-h[h]
+|-hh[:]mm
+|-hh:mm:ss
+|-hhmmss
Region-based zone IDs in the form <area>/<city> , for example, Europe/Paris .
If the month or day components are not specified they default to 1. If hour, minute, or second components are
not specified they default to 0. If no zoneId is specified it defaults to session time zone,
If the literal does represent a proper timestamp Azure Databricks raises an error.
Notes
Timestamps with local timezone are internally normalized and persisted in UTC. Whenever the value or a
portion of it is extracted the local session timezone is applied.
Examples
> SELECT TIMESTAMP'0000';
0000-01-01 00:00:00
Related
DATE type
INTERVAL type
cast function
TINYINT type
7/21/2022 • 2 minutes to read
Syntax
{ TINYINT | BYTE }
Limits
The range of numbers is from -128 to 127.
Literals
[ + | - ] digit [ ... ] Y
Examples
> SELECT +1Y;
1
Related
SMALLINT type
INT type
BIGINT type
DECIMAL type
FLOAT type
DOUBLE type
cast function
Special floating point values (Databricks SQL)
7/21/2022 • 2 minutes to read
NaN semantics
When dealing with float or double types that do not exactly match standard floating point semantics, NaN
has the following semantics:
NaN = NaN returns true.
In aggregations, all NaN values are grouped together.
NaN is treated as a normal value in join keys.
NaN values go last when in ascending order, larger than any other numeric value.
Examples
> SELECT double('infinity');
Infinity
Related
FLOAT type (Databricks SQL)
DOUBLE type (Databricks SQL)
abs function
7/21/2022 • 2 minutes to read
Syntax
abs(expr)
Arguments
expr : An expression that evaluates to a numeric or interval.
Returns
A numeric or interval of the same type as expr .
For integral numeric types the function can return an ARITHMETIC_OVERFLOW error.
WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.
Examples
> SELECT abs(-1);
1
Related functions
sign function
signum function
negative function
positive function
acos function
7/21/2022 • 2 minutes to read
Syntax
acos(expr)
Arguments
expr : A numeric expression.
Returns
A DOUBLE. if the argument is out of bounds, NaN is returned.
Examples
> SELECT acos(1);
0.0
> SELECT acos(2);
NaN
Related functions
cos function
acosh function
acosh function
7/21/2022 • 2 minutes to read
Syntax
acosh(expr)
Arguments
expr : A numeric expression.
Returns
A DOUBLE. if the argument is out of bounds, NaN is returned.
Examples
> SELECT acosh(1);
0.0
> SELECT acosh(0);
NaN
Related functions
cos function
acos function
add_months function
7/21/2022 • 2 minutes to read
Syntax
add_months(startDate, numMonths)
Arguments
startDate : A DATE expression.
numMonths : An integral number.
Returns
A DATE. If the result exceeds the number of days of the month the result is rounded down to the end of the
month. If the result exceeds the supported range for a date an overflow error is reported.
Examples
> SELECT add_months('2016-08-31', 1);
2016-09-30
Related functions
dateadd function
datediff (timestamp) function
months_between function
aes_decrypt function
7/21/2022 • 2 minutes to read
Syntax
aes_decrypt(expr, key [, mode [, padding]])
Arguments
expr : The BINARY expression to be decrypted.
key : A BINARY expression. Must match the key originally used to produce the encrypted value and be 16,
24, or 32 bytes long.
mode : An optional STRING expression describing the encryption mode used to produce the encrypted value.
padding : An optional STRING expression describing how encryption handled padding of the value to key
length.
Returns
A BINARY.
mode must be one of (case insensitive):
'ECB' : Use Electronic CodeBook (ECB) mode.
'GCM' : Use Galois/Counter Mode (GCM) . This is the default.
Examples
> SELECT base64(aes_encrypt('Spark', 'abcdefghijklmnop'));
4A5jOAh9FNGwoMeuJukfllrLdHEZxA2DyuSQAWz77dfn
Related functions
aes_encrypt function
aes_encrypt function
7/21/2022 • 2 minutes to read
Syntax
aes_encrypt(expr, key [, mode [, padding]])
Arguments
expr : The BINARY expression to be encrypted.
key : A BINARY expression. The key to be used to encrypt expr . It must be 16, 24, or 32 bytes long.
mode : An optional STRING expression describing the encryption mode.
padding : An optional STRING expression describing how encryption handles padding of the value to key
length.
Returns
A BINARY.
mode must be one of (case insensitive):
'ECB' : Use Electronic CodeBook (ECB) mode.
'GCM' : Use Galois/Counter Mode (GCM) . This is the default.
Examples
> SELECT base64(aes_encrypt('Spark', 'abcdefghijklmnop'));
4A5jOAh9FNGwoMeuJukfllrLdHEZxA2DyuSQAWz77dfn
Related functions
aes_decrypt function
aggregate function
7/21/2022 • 2 minutes to read
Syntax
aggregate(expr, start, merge [, finish])
Arguments
expr : An ARRAY expression.
start : An initial value of any type.
merge : A lambda function used to aggregate the current element.
finish : An optional lambda function used to finalize the aggregation.
Returns
The result type matches the result type of the finish lambda function if exists or start .
Applies an expression to an initial state and all elements in the array, and reduces this to a single state. The final
state is converted into the final result by applying a finish function.
The merge function takes two parameters. The first being the accumulator, the second the element to be
aggregated. The accumulator and the result must be of the type of start . The optional finish function takes
one parameter and returns the final result.
This function is a synonym for reduce function.
Examples
> SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x);
6
> SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x, acc -> acc * 10);
60
Related functions
array function
reduce function
& (ampersand sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 & expr2
Arguments
expr1 : An integral numeric type expression.
expr2 : An integral numeric type expression.
Returns
The result type matches the widest type of expr1 and expr2 .
Examples
> SELECT 3 & 5;
1
Related functions
| (pipe sign) operator
~ (tilde sign) operator
^ (caret sign) operator
bit_count function
and predicate
7/21/2022 • 2 minutes to read
Syntax
expr1 and expr2
Arguments
expr1 : A BOOLEAN expression
expr2 : A BOOLEAN expression
Returns
A BOOLEAN.
Examples
> SELECT true and true;
true
> SELECT true and false;
false
> SELECT true and NULL;
NULL
> SELECT false and NULL;
false
Related functions
or operator
not operator
any aggregate function
7/21/2022 • 2 minutes to read
Syntax
any(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BOOLEAN.
The any aggregate function is synonymous to max aggregate function, but limited to a boolean argument.
Examples
> SELECT any(col) FROM VALUES (true), (false), (false) AS tab(col);
true
Related functions
max aggregate function
min aggregate function
approx_count_distinct aggregate function
7/21/2022 • 2 minutes to read
Returns the estimated number of distinct values in expr within the group.
Syntax
approx_count_distinct(expr[, relativeSD]) [FILTER ( WHERE cond ) ]
Arguments
expr : Can be of any type for which equivalence is defined.
relativeSD : Defines the maximum relative standard deviation allowed.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BIGINT.
Examples
> SELECT approx_count_distinct(col1) FROM VALUES (1), (1), (2), (2), (3) tab(col1);
3
> SELECT approx_count_distinct(col1) FILTER(WHERE col2 = 10)
FROM VALUES (1, 10), (1, 10), (2, 10), (2, 10), (3, 10), (1, 12) AS tab(col1, col2);
3
Related functions
approx_percentile aggregate function
approx_top_k aggregate function
approx_percentile aggregate function
7/21/2022 • 2 minutes to read
Syntax
approx_percentile ( [ALL | DISTINCT] expr, percentile [, accuracy] ) [FILTER ( WHERE cond ) ]
Arguments
expr : A numeric expression.
percentile : A numeric literal between 0 and 1 or a literal array of numeric values, each between 0 and 1.
accuracy : An INTEGER literal greater than 0. If accuracy is omitted it is set to 10000.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The aggregate function returns the expression that is the smallest value in the ordered group (sorted from least
to greatest) such that no more than percentile of expr values is less than the value or equal to that value.
If percentile is an array, approx_percentile returns the approximate percentile array of expr at percentile .
The accuracy parameter controls approximation accuracy at the cost of memory. A higher value of accuracy
yields better accuracy, 1.0/accuracy is the relative error of the approximation. This function is a synonym for
percentile_approx aggregate function.
If DISTINCT is specified the function operates only on a unique set of expr values.
Examples
> SELECT approx_percentile(col, array(0.5, 0.4, 0.1), 100) FROM VALUES (0), (1), (2), (10) AS tab(col);
[1,1,0]
> SELECT approx_percentile(col, 0.5, 100) FROM VALUES (0), (6), (6), (7), (9), (10) AS tab(col);
6
> SELECT approx_percentile(DISTINCT col, 0.5, 100) FROM VALUES (0), (6), (6), (7), (9), (10) AS tab(col);
7
Related functions
approx_count_distinct aggregate function
approx_top_k aggregate function
percentile aggregate function
percentile_approx aggregate function
percentile_cont aggregate function
approx_top_k aggregate function
7/21/2022 • 2 minutes to read
Returns the top k most frequently occurring item values in an expr along with their approximate counts.
Since: Databricks Runtime 10.2
Syntax
approx_top_k(expr[, k[, maxItemsTracked]]) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression of STRING, BOOLEAN, DATE, TIMESTAMP, or numeric type.
k : An optional INTEGER literal greater than 0. If k is not specified, it defaults to 5 .
maxItemsTracked : An optional INTEGER literal greater than or equal to k . If maxItemsTracked is not specified,
it defaults to 10000 .
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
Results are returned as an ARRAY of type STRUCT, where each STRUCT contains an item field for the value
(with its original input type) and a count field (of type LONG) with the approximate number of occurrences. The
array is sorted by count descending.
The aggregate function returns the top k most frequently occurring item values in an expression expr along
with their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where
numRows is the total number of rows. Higher values of maxItemsTracked provide better accuracy at the cost of
increased memory usage. Expressions that have fewer than maxItemsTracked distinct items will yield exact item
counts. Results include NULL values as their own item in the results.
Examples
> SELECT approx_top_k(expr) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
[{'item':4,'count':2},{'item':1,'count':2},{'item':0,'count':2},{'item':3,'count':1},{'item':2,'count':1}]
> SELECT approx_top_k(expr, 2) FROM VALUES 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd' AS tab(expr);
[{'item':'c','count',4},{'item':'d','count':2}]
> SELECT approx_top_k(expr, 10, 100) FROM VALUES (0), (1), (1), (2), (2), (2) AS tab(expr);
[{'item':2,'count':3},{'item':1,'count':2},{'item':0,'count':1}]
Related functions
approx_count_distinct aggregate function
approx_percentile aggregate function
array function
7/21/2022 • 2 minutes to read
Syntax
array(expr [, ...])
Arguments
exprN : Elements of any type that share a least common type.
Returns
An array of elements of exprNs least common type.
If the array is empty or all elements are NULL the result type is an array of type null.
Examples
-- an array of integers
> SELECT array(1, 2, 3);
[1,2,3]
-- an array of strings
> SELECT array(1.0, 1, 'hello');
[1.0,1,hello]
Related
[ ] operator
map function
collect_set aggregate function
collect_list aggregate function
SQL data type rules
array_agg aggregate function
7/21/2022 • 2 minutes to read
Syntax
array_agg ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression of any type.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
An ARRAY of the argument type.
The order of elements in the array is non-deterministic. NULL values are excluded.
If DISTINCT is specified the function collects only unique values and is a synonym for collect_set aggregate
function
This function is a synonym for collect_list
Examples
> SELECT array_agg(col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2,1]
> SELECT array_agg(DISTINCT col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2]
Related functions
array function
collect_list aggregate function
collect_set aggregate function
array_contains function
7/21/2022 • 2 minutes to read
Syntax
array_contains(array, value)
Arguments
array : An ARRAY to be searched.
value : An expression with a type sharing a least common type with the array elements.
Returns
A BOOLEAN. If value is NULL , the result is NULL . If any element in array is NULL , the result is NULL if value
is not matched to any other element.
Examples
> SELECT array_contains(array(1, 2, 3), 2);
true
> SELECT array_contains(array(1, NULL, 3), 2);
NULL
> SELECT array_contains(array(1, 2, 3), NULL);
NULL
Related
arrays_overlap function
array_position function
SQL data type rules
array_distinct function
7/21/2022 • 2 minutes to read
Syntax
array_distinct(array)
Arguments
array : An ARRAY expression.
Returns
The function returns an array of the same type as the input argument where all duplicate values have been
removed.
Examples
> SELECT array_distinct(array(1, 2, 3, NULL, 3));
[1,2,3,NULL]
Related functions
array_except function
array_intersect function
array_sort function
array_remove function
array_union function
array_except function
7/21/2022 • 2 minutes to read
Syntax
array_except(array1, array2)
Arguments
array1 : An ARRAY of any type with comparable elements.
array2 : An ARRAY of elements sharing a least common type with the elements of array1 .
Returns
An ARRAY of matching type to array1 with no duplicates.
Examples
> SELECT array_except(array(1, 2, 2, 3), array(1, 1, 3, 5));
[2]
Related
array_distinct function
array_intersect function
array_sort function
array_remove function
array_union function
SQL data type rules
array_intersect function
7/21/2022 • 2 minutes to read
Syntax
array_intersect(array1, array2)
Arguments
array1 : An ARRAY of any type with comparable elements.
array2 : n ARRAY of elements sharing a least common type with the elements of array1 .
Returns
An ARRAY of matching type to array1 with no duplicates and elements contained in both array1 and array2 .
Examples
> SELECT array_intersect(array(1, 2, 3), array(1, 3, 3, 5));
[1,3]
Related
array_distinct function
array_except function
array_sort function
array_remove function
array_union function
SQL data type rules
array_join function
7/21/2022 • 2 minutes to read
Syntax
array_join(array, delimiter [, nullReplacement])
Arguments
array : Any ARRAY type, but its elements are interpreted as strings.
delimiter : A STRING used to separate the concatenated array elements.
nullReplacement : A STRING used to express a NULL value in the result.
Returns
A STRING where the elements of array are separated by delimiter and null elements are substituted for
nullReplacement . If nullReplacement is omitted, null elements are filtered out. If any argument is NULL , the
result is NULL .
Examples
> SELECT array_join(array('hello', 'world'), ' ');
hello world
> SELECT array_join(array('hello', NULL ,'world'), ' ');
hello world
> SELECT array_join(array('hello', NULL ,'world'), ' ', ',');
Hello,world
Related functions
concat function
concat_ws function
array_max function
7/21/2022 • 2 minutes to read
Syntax
array_max(array)
Arguments
array : Any ARRAY with elements for which order is supported.
Returns
The result matches the type of the elements. NULL elements are skipped. If array is empty, or contains only
NULL elements, NULL is returned.
Examples
> SELECT array_max(array(1, 20, NULL, 3));
20
Related functions
array_min function
array_min function
7/21/2022 • 2 minutes to read
Syntax
array_min(array)
Arguments
array : Any ARRAY with elements for which order is supported.
Returns
The result matches the type of the elements. NULL elements are skipped. If array is empty, or contains only
NULL elements, NULL is returned.
Examples
> SELECT array_min(array(1, 20, NULL, 3));
1
Related functions
array_max function
array_position function
7/21/2022 • 2 minutes to read
Syntax
array_position(array, element)
Arguments
array : An ARRAY with comparable elements.
element : An expression matching the types of the elements in array .
Returns
A long type.
Array indexing starts at 1. If the element value is NULL a NULL is returned.
Examples
> SELECT array_position(array(3, 2, 1, 4, 1), 1);
3
> SELECT array_position(array(3, NULL, 1), NULL)
NULL
Related functions
array_contains function
arrays_overlap function
array_remove function
7/21/2022 • 2 minutes to read
Syntax
array_remove(array, element)
Arguments
array : An ARRAY.
element : An expression of a type sharing a least common type with the elements of array .
Returns
The result type matched the type of the array.
If the element to be removed is NULL , the result is NULL .
Examples
> SELECT array_remove(array(1, 2, 3, NULL, 3, 2), 3);
[1,2,NULL,2]
> SELECT array_remove(array(1, 2, 3, NULL, 3, 2), NULL);
NULL
Related
array_except function
SQL data type rules
array_repeat function
7/21/2022 • 2 minutes to read
Syntax
array_repeat(element, count)
Arguments
element : An expression of any type.
count : An INTEGER greater or equal to 0.
Returns
An array of the elements of the element type.
Examples
> SELECT array_repeat('123', 2);
[123, 123]
Related functions
array function
array_size function
7/21/2022 • 2 minutes to read
Syntax
array_size(array)
Arguments
array : An ARRAY expression.
Returns
An INTEGER.
Examples
> SELECT array_size(array(1, NULL, 3, NULL));
4
Related
array function
element_at function
array_sort function
7/21/2022 • 2 minutes to read
Syntax
array_sort(array, func)
Arguments
array : An expression that evaluates to an array.
func : A lambda function defining the sort order.
Returns
The result type matches the type of array .
If func is omitted, the array is sorted in ascending order.
If func is provided it takes two arguments representing two elements of the array.
The function must return -1, 0, or 1 depending on whether the first element is less than, equal to, or greater than
the second element.
If the func returns other values (including NULL), array_sort fails and raises an error.
NULL elements are placed at the end of the returned array.
Examples
> SELECT array_sort(array(5, 6, 1),
(left, right) -> CASE WHEN left < right THEN -1
WHEN left > right THEN 1 ELSE 0 END);
[1,5,6]
> SELECT array_sort(array('bc', 'ab', 'dc'),
(left, right) -> CASE WHEN left IS NULL and right IS NULL THEN 0
WHEN left IS NULL THEN -1
WHEN right IS NULL THEN 1
WHEN left < right THEN 1
WHEN left > right THEN -1 ELSE 0 END);
[dc,bc,ab]
> SELECT array_sort(array('b', 'd', null, 'c', 'a'));
[a,b,c,d,NULL]
Related functions
array_distinct function
array_intersect function
array_except function
array_remove function
array_union function
sort_array function
array_union function
7/21/2022 • 2 minutes to read
Returns an array of the elements in the union of array1 and array2 without duplicates.
Syntax
array_union(array1, array2)
Arguments
array1 : An ARRAY.
array2 : An ARRAY of the same type as array1 .
Returns
An ARRAY of the same type as array .
Examples
> SELECT array_union(array(1, 2, 2, 3), array(1, 3, 5));
[1,2,3,5]
Related functions
array_distinct function
array_intersect function
array_except function
array_sort function
array_remove function
zip_with function
arrays_overlap function
7/21/2022 • 2 minutes to read
Syntax
arrays_overlap (array1, array2)
Arguments
array1 : An ARRAY.
array2 : An ARRAY sharing a least common type with array1 .
Returns
The result is BOOLEAN true if there is overlap.
If the arrays have no common non-null element, they are both non-empty, and either of them contains a null
element, NULL , false otherwise.
Examples
> SELECT arrays_overlap(array(1, 2, 3), array(3, 4, 5));
true
> SELECT arrays_overlap(array(1, 2, NULL, 3), array(NULL, 4, 5));
NULL
Related
array_contains function
array_position function
SQL data type rules
arrays_zip function
7/21/2022 • 2 minutes to read
Returns a merged array of structs in which the nth struct contains all Nth values of input arrays.
Syntax
arrays_zip (array1 [, ...])
Arguments
arrayN : An ARRAY.
Returns
An ARRAY of STRUCT where the type of the nth field that matches the type of the elements of arrayN .
The number of array arguments can be 0 or more. If the function is called without arguments it returns an
empty array of an empty struct. Arrays that are shorter than the largest array are extended with null elements.
Examples
> SELECT arrays_zip(array(1, 2, 3), array(2, 3, 4));
[{1,2},{2,3},{3,4}]
> SELECT arrays_zip(array(1, 2), array(2, 3), array(3, 4));
[{1,2,3},{2,3,4}]
> SELECT arrays_zip(array(1, 2), array('shoe', 'string', 'budget'));
[{1, shoe},{2, string},{null,budget}]
> SELECT arrays_zip();
[{}]
Related functions
ascii function
7/21/2022 • 2 minutes to read
Syntax
ascii(str)
Arguments
str : A STRING.
Returns
An INTEGER.
If str is empty, the result is 0. If the first character is not an ASCII character or part of the Latin-1 Supplement
range of UTF-16, the result is undefined.
Examples
> SELECT ascii('234');
50
> SELECT ascii('');
0
Related functions
chr function
char function
asin function
7/21/2022 • 2 minutes to read
Syntax
asin(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE. If the argument is out of bound, NaN .
Examples
> SELECT asin(0);
0.0
> SELECT asin(2);
NaN
Related functions
sin function
acos function
atan function
asinh function
asinh function
7/21/2022 • 2 minutes to read
Syntax
asinh(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
The result type is DOUBLE. If the argument is out of bound, NaN .
Examples
> SELECT asinh(0);
0.0
Related functions
sin function
asin function
assert_true function
7/21/2022 • 2 minutes to read
Syntax
assert_true(expr)
Arguments
expr : A BOOLEAN expression.
Returns
An untyped NULL if no error is returned.
Examples
> SELECT assert_true(0 < 1);
NULL
> SELECT assert_true(0 > 1);
'0 > 1' is not true
Related functions
* (asterisk sign) operator
7/21/2022 • 2 minutes to read
Syntax
multiplier * multiplicand
Arguments
multiplier : A numeric or INTERVAL expression.
multiplicand : A numeric expression or INTERVAL expression.
Returns
If both multiplier and multiplicand are DECIMAL, the result is DECIMAL.
If multiplier or multiplicand is an INTERVAL, the result is of the same type.
If both multiplier and multiplicand are integral numeric types, the result is the larger of the two types.
In all other cases the result is a DOUBLE.
If either the multiplier or the multiplicand is 0, the operator returns 0.
If the result of the multiplication is outside the bound for the result type an ARITHMETIC_OVERFLOW error is
raised.
Use try_multiply to return NULL on overflow.
WARNING
If spark.sql.ansi.enabled is false the result “wraps” if it is out of bounds for integral types and NULL for fractional types.
Examples
> SELECT 3 * 2;
6
Syntax
atan(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT atan(0);
0.0
Related functions
atan2 function
acos function
asin function
tan function
atan2 function
7/21/2022 • 2 minutes to read
Returns the angle in radians between the positive x-axis of a plane and the point specified by the coordinates (
exprX , exprY ).
Syntax
atan2(exprY, exprX)
Arguments
exprY : An expression that evaluates to a numeric.
exprX : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT atan2(0, 0);
0.0
Related functions
atan function
acos function
asin function
tan function
atanh function
7/21/2022 • 2 minutes to read
Syntax
atanh(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE. if the argument is out of bounds, the result is NaN.
Examples
> SELECT atanh(0);
0.0
> SELECT atanh(2);
NaN
Related functions
atan function
tan function
atan2 function
acosh function
asinh function
avg aggregate function
7/21/2022 • 2 minutes to read
Syntax
avg( [ALL | DISTINCT] expr) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric or an interval.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type is computed as for the arguments:
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
year-month interval: The result is an INTERVAL YEAR TO MONTH .
day-time interval: The result is an INTERVAL YEAR TO SECOND .
In all other cases the result is a DOUBLE.
Nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified the average is computed after duplicates have been removed.
If the result overflows the result type, Databricks Runtime raises an ARITHMETIC_OVERFLOW error or
CANNOT_CHANGE_DECIMAL_PRECISION error. To return a NULL instead use try_avg.
WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but return NULL.
Examples
> SELECT avg(col) FROM VALUES (1), (2), (3) AS tab(col);
2.0
> SELECT avg(DISTINCT col) FROM VALUES (1), (1), (2) AS tab(col);
1.5
> SELECT avg(vol) FROM VALUES (INTERVAL '1' YEAR), (INTERVAL '2' YEAR) AS tab(col);
1-6
Related functions
aggregate function
max aggregate function
mean aggregate function
min aggregate function
try_avg aggregate function
try_sum aggregate function
sum aggregate function
!= (bangeq sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 != expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with expr1 .
Returns
A BOOLEAN.
This function is a synonym for <> (lt gt sign) operator.
Examples
> SELECT 2 != 2;
false
> SELECT 3 != 2;
true
> SELECT 1 != '1';
false
> SELECT true != NULL;
NULL
> SELECT NULL != NULL;
NULL
Related
< (lt sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
! (bang sign) operator
7/21/2022 • 2 minutes to read
Syntax
!expr
Arguments
expr :A BOOLEAN expression.
Returns
A BOOLEAN.
This operator is a synonym for not operator.
Examples
> SELECT !true;
false
> SELECT !false;
true
> SELECT !NULL;
NULL
Related functions
and predicate
or operator
not operator
base64 function
7/21/2022 • 2 minutes to read
Syntax
base64(expr)
Arguments
expr : A BINARY expression or a STRING which the function will interpret as BINARY.
Returns
A STRING.
Examples
> SELECT base64('Spark SQL');
U3BhcmsgU1FM
Related functions
unbase64 function
between predicate
7/21/2022 • 2 minutes to read
Tests whether expr1 is greater or equal than expr2 and less than or equal to expr3 .
Syntax
expr1 [not] between expr2 and expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with all other arguments.
expr3 : An expression that shares a least common type with all other arguments.
Returns
The results is a BOOLEAN.
If not is specified the function is a synonym for expr1 < expr2 or expr1 > expr3 .
Without not the function is a synonym for expr1 >= expr2 and expr1 <= expr3 .
Examples
> SELECT 4 between 3 and 5;
true
> SELECT 4 not between 3 and 5;
false
> SELECT 4 not between NULL and 5;
NULL
Related
in predicate
and predicate
SQL data type rules
bigint function
7/21/2022 • 2 minutes to read
Syntax
bigint(expr)
Arguments
expr : Any expression which is castable to BIGINT.
Returns
A BIGINT.
This function is a synonym for CAST(expr AS BIGINT) .
See cast function for details.
Examples
> SELECT bigint(current_timestamp);
1616168320
> SELECT bigint('5');
5
Related functions
cast function
bin function
7/21/2022 • 2 minutes to read
Syntax
bin(expr)
Arguments
expr : A BIGINT expression.
Returns
A STRING consisting of 1 and 0 s.
Examples
> SELECT bin(13);
1101
> SELECT bin(-13);
1111111111111111111111111111111111111111111111111111111111110011
> SELECT bin(13.3);
1101
Related functions
binary function
7/21/2022 • 2 minutes to read
Syntax
binary(expr)
Arguments
expr : Any expression that that can be cast to BINARY.
Returns
A BINARY.
This function is a synonym for CAST(expr AS BINARY) .
See cast function for details.
Examples
> SELECT binary('Spark SQL');
[53 70 61 72 6B 20 53 51 4C]
Related functions
cast function
bit_and aggregate function
7/21/2022 • 2 minutes to read
Syntax
bit_and(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to an integral numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type matches the argument type.
Examples
> SELECT bit_and(col) FROM VALUES (3), (5) AS tab(col);
1
> SELECT bit_and(col) FILTER(WHERE col < 6) FROM VALUES (3), (5), (6) AS tab(col);
1
Related functions
bit_or aggregate function
bit_xor aggregate function
some aggregate function
bit_count function
7/21/2022 • 2 minutes to read
Syntax
bit_count(expr)
Arguments
expr : A BIGINT or BOOLEAN expression.
Returns
An INTEGER.
Examples
> SELECT bit_count(0);
0
> SELECT bit_count(5);
2
> SELECT bit_count(-1);
64
> SELECT bit_count(true);
1
Related functions
| (pipe sign) operator
& (ampersand sign) operator
^ (caret sign) operator
~ (tilde sign) operator
bit_get function
7/21/2022 • 2 minutes to read
Syntax
bit_get(expr, pos))
Arguments
expr : An expression that evaluates to an integral numeric.
pos : An expression of type INTEGER.
Returns
The result type is an INTEGER.
The result value is 1 if the bit is set, 0 otherwise.
Bits are counted right to left and 0-based.
If pos is outside the bounds of the data type of expr Databricks Runtime raises an error.
bit_get is a synonym of getbit.
Examples
> SELECT hex(23Y), bit_get(23Y, 3);
0
Related functions
bit_reverse function
getbit function
~ (tilde sign) operator
bit_length function
7/21/2022 • 2 minutes to read
Returns the bit length of string data or number of bits of binary data.
Syntax
bit_length(expr)
Arguments
expr : An BINARY or STRING expression.
Returns
An INTEGER.
Examples
> SELECT bit_length('Spark SQL');
72
> SELECT bit_length('北京');
48
Related functions
length function
char_length function
character_length function
bit_or aggregate function
7/21/2022 • 2 minutes to read
Syntax
bit_or(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to an integral numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type matches the argument type.
Examples
> SELECT bit_or(col) FROM VALUES (3), (5) AS tab(col);
7
> SELECT bit_or(col) FILTER(WHERE col < 8) FROM VALUES (3), (5), (8) AS tab(col);
7
Related functions
bit_and aggregate function
bit_xor aggregate function
some aggregate function
bit_xor aggregate function
7/21/2022 • 2 minutes to read
Syntax
bit_xor ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to an integral numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type matches the argument type.
If DISTINCT is specified the aggregate operates only on distinct values.
Examples
> SELECT bit_xor(col) FROM VALUES (3), (3), (5) AS tab(col);
5
> SELECT bit_xor(DISTINCT col) FROM VALUES (3), (3), (5) AS tab(col);
6
Related functions
bit_or aggregate function
bit_and aggregate function
some aggregate function
bool_and aggregate function
7/21/2022 • 2 minutes to read
Returns true if all values in expr are true within the group.
Syntax
bool_and(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BOOLEAN.
Examples
> SELECT bool_and(col) FROM VALUES (true), (true), (true) AS tab(col);
true
> SELECT bool_and(col) FROM VALUES (NULL), (true), (true) AS tab(col);
true
> SELECT bool_and(col) FROM VALUES (true), (false), (true) AS tab(col);
false
Related functions
bool_or aggregate function
every aggregate function
bool_or aggregate function
7/21/2022 • 2 minutes to read
Returns true if at least one value in expr is true within the group.
Syntax
bool_or(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BOOLEAN.
Examples
> SELECT bool_or(col) FROM VALUES (true), (false), (false) AS tab(col);
true
> SELECT bool_or(col) FROM VALUES (NULL), (true), (false) AS tab(col);
true
> SELECT bool_or(col) FROM VALUES (false), (false), (NULL) AS tab(col);
false
Related functions
bool_and aggregate function
every aggregate function
some aggregate function
boolean function
7/21/2022 • 2 minutes to read
Syntax
boolean(expr)
Arguments
expr : Any expression that can be cast to BOOLEAN.
Returns
A BOOLEAN.
This function is a synonym for CAST(expr AS binary) .
See cast function for details.
Examples
> SELECT boolean(1);
true
> SELECT boolean(0);
false
Related functions
cast function
bround function
7/21/2022 • 2 minutes to read
Syntax
bround(expr [,targetScale] )
Arguments
expr : A numeric expression.
targetScale : An INTEGER expression greater or equal to 0. If targetScale is omitted the default is 0.
Returns
If expr is DECIMAL the result is DECIMAL with a scale that is the smaller of expr scale and targetScale .
In HALF_EVEN rounding, also known as Gaussian or banker’s rounding, the digit 5 is rounded towards an even
digit.
Examples
> SELECT bround(2.5, 0);
2
> SELECT bround(2.6, 0);
3
> SELECT bround(3.5, 0);
4
> SELECT bround(2.25, 1);
2.2
Related functions
floor function
ceiling function
ceil function
round function
btrim function
7/21/2022 • 2 minutes to read
Syntax
btrim( str [, trimStr ] )
Arguments
str : A STRING expression to be trimmed.
trimStr : An optional STRING expression with characters to be trimmed. The default is a space character.
Returns
A STRING.
The function removes any leading and trailing characters within trimStr from str .
Examples
> SELECT 'X' || btrim(' SparkSQL ') || 'X';
XSparkSQLX
Related functions
lpad function
ltrim function
rpad function
rtrim function
trim function
cardinality function
7/21/2022 • 2 minutes to read
Syntax
cardinality(expr)
Arguments
expr : An ARRAY or MAP expression.
Returns
An INTEGER.
Examples
> SELECT cardinality(array('b', 'd', 'c', 'a'));
4
> SELECT cardinality(map('a', 1, 'b', 2));
2
Related functions
^ (caret sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 ^ expr2
Arguments
expr1 : An integral numeric type expression.
expr2 : An integral numeric type expression.
Returns
The result type matches the least common type of expr1 and expr2 .
Examples
> SELECT 3 ^ 5;
6
Related functions
& (ampersand sign) operator
~ (tilde sign) operator
| (pipe sign) operator
bit_count function
case expression
7/21/2022 • 2 minutes to read
Returns resN for the first optN that equals expr or def if none matches.
Returns resN for the first condN evaluating to true, or def if none found.
Syntax
CASE expr {WHEN opt1 THEN res1} [...] [ELSE def] END
Arguments
expr : Any expression for which comparison is defined.
optN : An expression that has a least common type with expr and all other optN .
resN : Any expression that has a least common type with all other resN and def .
def : An optional expression that has a least common type with all resN .
condN : A BOOLEAN expression.
Returns
The result type matches the least common type of resN and def .
If def is omitted the default is NULL. Conditions are evaluated in order and only the resN or def which yields
the result is executed.
Examples
> SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
1.0
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
2.0
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;
NULL
> SELECT CASE 3 WHEN 1 THEN 'A' WHEN 2 THEN 'B' WHEN 3 THEN 'C' END;
C
Related articles
decode function
coalesce function
nullif function
nvl function
nvl2 function
SQL data type rules
cast function
7/21/2022 • 13 minutes to read
Syntax
cast(sourceExpr AS targetType)
Arguments
sourceExpr : Any castable expression.
targetType : The data type of the result.
Returns
The result is type targetType .
The following combinations of data type casting are valid:
SO U
RC E
( RO Y EA R
W) -
TA RG MON DAY -
ET ( C T IM E TH T IM E
OLU N UM ST RI STA IN T E IN T E BOO B IN A A RRA ST RU
MN) VO ID ERIC NG DAT E MP RVA L RVA L L EA N RY Y MAP CT
VOI Y Y Y Y Y Y Y Y Y Y Y Y
D
num N Y Y N Y N N Y N N N N
eric
STRI N Y Y Y Y Y Y Y Y N N N
NG
DATE N N Y Y Y N N N N N N N
TIME N Y Y Y Y N N N N N N N
STA
MP
year- N N Y N N Y N N N N N N
mont
h
inter
val
SO U
RC E
( RO Y EA R
W) -
TA RG MON DAY -
ET ( C T IM E TH T IM E
OLU N UM ST RI STA IN T E IN T E BOO B IN A A RRA ST RU
MN) VO ID ERIC NG DAT E MP RVA L RVA L L EA N RY Y MAP CT
day- N N Y N N N Y N N N N N
time
inter
val
BOO N Y Y N Y N N Y N N N N
LEAN
BINA N Y Y N N N N N Y N N N
RY
ARR N N Y N N N N N N Y N N
AY
MAP N N Y N N N N N N N Y N
STRU N N Y N N N N N N N N Y
CT
numeric
If the targetType is a numeric and sourceExpr is of type:
VOID
The result is a NULL of the specified numeric type.
numeric
If targetType is an integral numeric, the result is sourceExpr truncated to a whole number.
Otherwise the result is sourceExpr rounded to a fit the available scale of targetType .
If the value is outside the range of targetType , an overflow error is raised.
Use try_cast to turn overflow errors into NULL .
STRING
sourceExpr is read as a literal value of the targetType .
If sourceExpr doesn’t comply with the format for literal values, an error is raised.
If the value is outside the range of the targetType , an overflow error is raised.
Use try_cast to turn overflow and invalid format errors into NULL .
TIMESTAMP
The result is the number of seconds elapsed between 1970-01-01 00:00:00 UTC and sourceExpr .
If targetType is an integral numeric, the result is truncated to a whole number.
Otherwise the result is rounded to a fit the available scale of targetType .
If the result is outside the range of targetType , an overflow error is raised.
Use try_cast to turn overflow errors into NULL .
BOOLEAN
If sourceExpr is:
true : The result is 0.
false : The result is 1.
NULL : The result is NULL .
Examples
STRING
If the targetType is a STRING type and sourceExpr is of type:
VOID
The result is a NULL string.
exact numeric
The result is the literal number with an optional minus-sign and no leading zeros except for the single
digit to the left of the decimal point. If the targetType is DECIMAL(p, s) with s greater 0, a decimal
point is added and trailing zeros are added up to scale.
floating-point binar y
If the absolute number is less that 10,000,000 and greater or equal than 0.001 , the result is expressed
without scientific notation with at least one digit on either side of the decimal point.
Otherwise Databricks Runtime uses a mantissa followed by E and an exponent. The mantissa has an
optional leading minus sign followed by one digit to the left of the decimal point, and the minimal
number of digits greater than zero to the right. The exponent has and optional leading minus sign.
DATE
If the year is between 9999 BCE and 9999 CE, the result is a dateString of the form -YYYY-MM-DD and
YYYY-MM-DD respectively.
For years prior or after this range, the necessary number of digits are added to the year component and
+ is used for CE.
TIMESTAMP
If the year is between 9999 BCE and 9999 CE, the result is a timestampString of the form
-YYYY-MM-DD hh:mm:ss and YYYY-MM-DD hh:mm:ss respectively.
For years prior or after this range, the necessary number of digits are added to the year component and
+ is used for CE.
-- Caesar no more
> SELECT cast(DATE'-0044-03-15' AS STRING);
-0044-03-15
DATE
If the targetType is a DATE type and sourceExpr is of type:
VOID
The result is a NULL DATE.
STRING
sourceExpr must be a valid dateString.
If sourceExpr is not a valid dateString , Databricks Runtime returns an error.
Use try_cast to turn invalid data errors into NULL .
TIMESTAMP
The result is date portion of the timestamp sourceExpr .
Examples
TIMESTAMP
If the targetType is a TIMESTAMP type and sourceExpr is of type:
VOID
The result is a NULL DATE.
numeric
sourceExpr is read as the number of seconds since 1970-01-01 00:00:00 UTC .
Fractions smaller than microseconds are truncated.
If the value is outside of the range of TIMESTAMP , an overflow error is raised.
Use try_cast to turn overflow errors into NULL .
STRING
sourceExpr must be a valid timestampString.
If sourceExpr is not a valid timestampString , Databricks Runtime returns an error.
Use try_cast to turn invalid data errors into NULL .
DATE
The result is the sourceExpr DATE at 00:00:00 hrs.
Examples
> SELECT cast(NULL AS TIMESTAMP);
NULL
year-month interval
If the targetType is a year-month interval and sourceExpr is of type:
VOID
The result is a NULL year-month interval.
STRING
sourceExpr must be a valid yearMonthIntervalString.
If sourceExpr is not a valid yearMonthIntervalString , Databricks Runtime returns an error.
Use try_cast to turn invalid data errors into NULL .
year-month inter val
If the targetType yearMonthIntervalQualifier includes MONTH the value remains unchanged, but is
reinterpreted to match the target type.
Otherwise, if the source type yearMonthIntervalQualifier includes MONTH , the result is truncated to full
years.
Examples
> SELECT cast(NULL AS INTERVAL YEAR);
NULL
day-time interval
If the targetType is a day-time interval and sourceExpr is of type:
VOID
The result is a NULL day-time interval.
STRING
sourceExpr must be a valid dayTimeIntervalString.
If sourceExpr is not a valid dayTimeIntervalString , Databricks Runtime returns an error.
Use try_cast to turn invalid data errors into NULL .
day-time inter val
If the targetType dayTimeIntervalQualifier includes the smallest unit of the source type
dayTimeIntervalQualifier, the value remains unchanged, but is reinterpreted to match the target type.
Otherwise, the sourceExpr interval is truncated to fit the targetType .
BOOLEAN
If the targetType is a BOOLEAN and sourceExpr is of type:
VOID
The result is a NULL Boolean.
numeric
If sourceExpr is:
0 : The result is false .
NULL : The result is NULL .
special floating point value : The result is true .
Otherwise the result is true .
STRING
If sourcEexpr is (case insensitive):
'T', 'TRUE', 'Y', 'YES', or '1' : The result is true
'F', 'FALSE', 'N', 'NO', or '0' : The result is false
: The result is NULL
NULL
Otherwise Databricks Runtime returns an invalid input syntax for type boolean error.
Use try_cast to turn invalid data errors into NULL .
Examples
BINARY
If the targetType is a BINARY and sourceExpr is of type:
VOID
The result is a NULL Binary.
STRING
The result is the UTF-8 encoding of the surceExpr .
Examples
ARRAY
If the targetType is an ARRAY and sourceExpr is of type:
VOID
The result is a NULL of the targeType .
ARRAY
If the cast from sourceElementType to targetElementType is supported, the result is an
ARRAY<targetElementType> with all elements cast to the targetElementType .
Databricks Runtime raises an error if the cast isn’t supported or if any of the elements can’t be cast.
Use try_cast to turn invalid data or overflow errors into NULL .
Examples
MAP
If the targetType is an MAP<targetKeyType, targetValueType> and sourceExpr is of type:
VOID
The result is a NULL of the targetType .
MAP<sourceKeyType, sourceValueType>
If the casts from sourceKeyType to targetKeyType and sourceValueType to targetValueType are
supported, the result is an MAP<targetKeyType, targetValueType> with all keys cast to the targetKeyType
and all values cast to the targetValueType .
Databricks Runtime raises an error if the cast isn’t supported or if any of the keys or values can’t be cast.
Use try_cast to turn invalid data or overflow errors into NULL .
Examples
> SELECT cast(NULL AS MAP<STRING, INT>);
NULL
> SELECT cast(map('10', 't', '15', 'f', '20', NULL) AS MAP<INT, BOOLEAN>);
{10:true,15:false,20:null}
> SELECT cast(map('10', 't', '15', 'f', '20', NULL) AS MAP<INT, ARRAY<INT>>);
error: cannot cast map<string,string> to map<int,array<int>>
> SELECT cast(map('10', 't', '15', 'f', '20', 'o') AS MAP<INT, BOOLEAN>);
error: invalid input syntax for type boolean: o.
STRUCT
If the targetType is a STRUCT<[targetFieldName:targetFieldType [NOT NULL][COMMENT str][, …]]> and
sourceExpr is of type:
VOID
The result is a NULL of the targetType .
STRUCT<[sourceFieldName:sourceFieldType [NOT NULL][COMMENT str][, …]]>
The sourceExpr can be cast to targetType if all of thee conditions are true:
The source type has the same number of fields as the target
For all fields: sourceFieldTypeN can be cast to the targetFieldTypeN .
For all field values: The source field value N can be cast to targetFieldTypeN and the value isn’t null if
target field N is marked as NOT NULL .
sourceFieldName s, source NOT NULL constraints, and source COMMENT s need not match the targetType
and are ignored.
Databricks Runtime raises an error if the cast isn’t supported or if any of the keys or values can’t be cast.
Use try_cast to turn invalid data or overflow errors into NULL .
Examples
> SELECT cast(named_struct('a', 't', 'b', '1900') AS STRUCT<b:BOOLEAN, c:DATE NOT NULL COMMENT 'Hello'>);
{"b":true,"c":1900-01-01}
> SELECT cast(named_struct('a', 't', 'b', NULL::DATE) AS STRUCT<b:BOOLEAN, c:DATE NOT NULL COMMENT
'Hello'>);
error: cannot cast struct<a:string,b:date> to struct<b:boolean,c:date>
Related functions
:: (colon colon sign) operator
try_cast function
cbrt function
7/21/2022 • 2 minutes to read
Syntax
cbrt(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT cbrt(27.0);
3.0
Related functions
sqrt function
ceil function
7/21/2022 • 2 minutes to read
Returns the smallest number not smaller than expr rounded up to targetScale digits relative to the decimal
point.
Syntax
ceil(expr [, targetScale])
Arguments
expr : An expression that evaluates to a numeric.
targetScale: An optional INTEGER literal greater than -38 specifying by how many digits after the
decimal points to round up.
Since: Databricks Runtime 10.5
Returns
If no targetScale is given:
If expr is DECIMAL(p, s) , returns DECIMAL(p - s + 1, 0) .
For all other cases, returns a BIGINT.
If targetScale is specified and expr is a:
TINYINT
Returns a DECIMAL(p, s) with p = max(14, -targetScale + 1)) and s = min(7, max(0, targetScale))
DOUBLE
Returns a DECIMAL(p, s) with p = max(30, -targetScale + 1)) and s = min(15, max(0, targetScale))
DECIMAL(p_in, s_in)
Returns a DECIMAL(p, s) with p = max(p_in - s_in + 1, -targetScale + 1)) and
s = min(s_in, max(0, targetScale))
If targetScale is negative the rounding occurs to -targetScale digits to the left of the decimal point.
The default targetScale is 0, which rounds up to the next bigger integral number.
This function is a synonym of ceiling function.
Examples
> SELECT ceil(-0.1);
0
Related functions
floor function
ceiling function
bround function
round function
ceiling function
7/21/2022 • 2 minutes to read
Returns the smallest number not smaller than expr rounded up to targetScale digits relative to the decimal
point.
Syntax
ceiling(expr [, targetScale])
Arguments
expr : An expression that evaluates to a numeric.
targetScale: An optional INTEGER literal greater than -38 specifying to how many digits after the
decimal points to round up.
Since: Databricks Runtime 10.5
Returns
If no targetScale is given:
If expr is DECIMAL(p, s) , returns DECIMAL(p - s + 1, 0) .
For all other cases, returns a BIGINT.
If targetScale is specified and expr is a:
TINYINT
Returns a DECIMAL(p, s) with p = max(14, -targetScale + 1)) and s = min(7, max(0, targetScale))
DOUBLE
Returns a DECIMAL(p, s) with p = max(30, -targetScale + 1)) and s = min(15, max(0, targetScale))
DECIMAL(p_in, s_in)
Returns a DECIMAL(p, s) with p = max(p_in - s_in + 1, -targetScale + 1)) and
s = min(s_in, max(0, targetScale))
If targetScale is negative the rounding occurs to -targetScale digits to the left of the decimal point.
The default targetScale is 0, which rounds up to the next bigger integral number.
This function is a synonym of ceil function.
Examples
> SELECT ceiling(-0.1);
0
Related functions
floor function
ceil function
bround function
round function
char function
7/21/2022 • 2 minutes to read
Syntax
char(expr)
Arguments
expr : An expression that evaluates to an integral numeric.
Returns
A STRING.
If the argument is less than 0, an empty string is returned. If the argument is larger than 255 , it is treated as
modulo 256. This implies char covers the ASCII and Latin-1 Supplement range of UTF-16.
This function is a synonym for chr function.
Examples
> SELECT char(65);
A
Related functions
chr function
ascii function
char_length function
7/21/2022 • 2 minutes to read
Returns the character length of string data or number of bytes of binary data.
Syntax
char_length(expr)
Arguments
expr : A BINARY or STRING expression.
Returns
The result type is INTEGER.
The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
This function is a synonym for character_length function and length function.
Examples
> SELECT char_length('Spark SQL ');
10
> select char_length('床前明月光')
5
Related functions
character_length function
length function
character_length function
7/21/2022 • 2 minutes to read
Returns the character length of string data or number of bytes of binary data.
Syntax
character_length(expr)
Arguments
expr : A BINARY or STRING expression.
Returns
The result type is INTEGER.
The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
This function is a synonym for char_length function and length function.
Examples
> SELECT character_length('Spark SQL ');
10
> select character_length('床前明月光')
5
Related functions
char_length function
length function
chr function
7/21/2022 • 2 minutes to read
Syntax
chr(expr)
Arguments
expr : An expression that evaluates to an integral numeric.
Returns
The result type is STRING.
If the argument is less than 0, an empty string is returned. If the argument is larger than 255 , it is treated as
modulo 256. This implies char covers the ASCII and Latin-1 Supplement range of UTF-16.
This function is a synonym for char function.
Examples
> SELECT chr(65);
A
Related functions
char function
ascii function
coalesce function
7/21/2022 • 2 minutes to read
Syntax
coalesce(expr1 [, ...] )
Arguments
exprN : Any expression that shares a least common type across all exprN.
Returns
The result type is the least common type of the arguments.
There must be at least one argument. Unlike for regular functions where all arguments are evaluated before
invoking the function, coalesce evaluates arguments left to right until a non-null value is found. If all
arguments are NULL , the result is NULL .
Examples
> SELECT coalesce(NULL, 1, NULL);
1
> SELECT coalesce(NULL, 5 / 0);
Division by zero
Related
nvl function
nvl2 function
SQL data type rules
collect_list aggregate function
7/21/2022 • 2 minutes to read
Syntax
collect_list ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression of any type.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
An ARRAY of the argument type.
The order of elements in the array is non-deterministic. NULL values are excluded.
If DISTINCT is specified the function collects only unique values and is a synonym for collect_set aggregate
function
This function is a synonym for array_agg
Examples
> SELECT collect_list(col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2,1]
> SELECT collect_list(DISTINCT col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2]
Related functions
array_agg aggregate function
array function
collect_set aggregate function
collect_set aggregate function
7/21/2022 • 2 minutes to read
Returns an array consisting of all unique values in expr within the group.
Syntax
collect_set(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression of any type.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
An ARRAY of the argument type.
The order of elements in the array is non-deterministic. NULL values are excluded.
Examples
> SELECT collect_set(col) FROM VALUES (1), (2), (NULL), (1) AS tab(col);
[1,2]
> SELECT collect_set(col1) FILTER(WHERE col2 = 10)
FROM VALUES (1, 10), (2, 10), (NULL, 10), (1, 10), (3, 12) AS tab(col1, col2);
[1,2]
Related functions
array function
collect_list aggregate function
:: (colon colon sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr :: type
Arguments
expr : Any castable expression.
Returns
The result is type type .
This operator is a synonym for cast(expr AS type) where you find a detailed description.
Examples
> SELECT '20.5'::INTEGER;
20
> SELECT typeof(NULL::STRING);
string
Related functions
cast function
try_cast function
concat function
7/21/2022 • 2 minutes to read
Syntax
concat(expr1, expr2 [, ...] )
Arguments
exprN : Expressions which are all STRING, all BINARY or all ARRAYs of STRING or BINARY.
Returns
The result type matches the argument types.
There must be at least one argument. This function is a synonym for || (pipe pipe sign) operator.
Examples
> SELECT concat('Spark', 'SQL');
SparkSQL
> SELECT concat(array(1, 2, 3), array(4, 5), array(6));
[1,2,3,4,5,6]
Related functions
|| (pipe pipe sign) operator
array_join function
array_union function
concat_ws function
concat_ws function
7/21/2022 • 2 minutes to read
Syntax
concat_ws(sep [, expr1 [, ...] ])
Arguments
sep : An STRING expression.
exprN : Each exprN can be either a STRING or an ARRAY of STRING.
Returns
The result type is STRING.
If sep is NULL the result is NULL. exprN that are NULL are ignored. If only the separator is provided, or all
exprN are NULL, an empty string.
Examples
> SELECT concat_ws(' ', 'Spark', 'SQL');
Spark SQL
> SELECT concat_ws('s');
''
> SELECT concat_ws(',', 'Spark', array('S', 'Q', NULL, 'L'), NULL);
Spark,S,Q,L
Related functions
|| (pipe pipe sign) operator
concat function
array_join function
contains function
7/21/2022 • 2 minutes to read
Syntax
contains(expr, subExpr)
Arguments
expr : A STRING or BINARY within which to search.
subExpr : The STRING or BINARY to search for.
Returns
A BOOLEAN. If expr or subExpr are NULL , the result is NULL . If subExpr is the empty string or empty binary
the result is true .
Since: Databricks Runtime 10.5
The function operates in BINARY mode if both arguments are BINARY.
Examples
> SELECT contains(NULL, 'Spark');
NULL
Related
array_contains function
conv function
7/21/2022 • 2 minutes to read
Syntax
conv(num, fromBase, toBase)
Arguments
num : An STRING expression expressing a number in fromBase .
fromBase : An INTEGER expression denoting the source base.
toBase : An INTEGER expression denoting the target base.
Returns
A STRING.
The function supports base 2 to base 36. The digit ‘A’ (or ‘a’) represents decimal 10 and ‘Z’ (or ‘z’) represents
decimal 35.
Examples
> SELECT conv('100', 2, 10);
4
> SELECT conv('-10', 16, 10);
-16
Related functions
corr aggregate function
7/21/2022 • 2 minutes to read
Syntax
corr ( [ALL | DISTINCT] expr1, expr2 ) [FILTER ( WHERE cond ) ]
Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr1 , expr2 pairs.
Examples
> SELECT corr(c1, c2) FROM VALUES (3, 2), (3, 3), (3, 3), (6, 4) as tab(c1, c2);
0.816496580927726
> SELECT corr(DISTINCT c1, c2) FROM VALUES (3, 2), (3, 3), (3, 3), (6, 4) as tab(c1, c2);
0.8660254037844387
> SELECT corr(DISTINCT c1, c2) FILTER(WHERE c1 != c2)
FROM VALUES (3, 2), (3, 3), (3, 3), (6, 4) as tab(c1, c2);
1.0
Related functions
cos function
7/21/2022 • 2 minutes to read
Syntax
cos(expr)
Arguments
expr : An expression that evaluates to a numeric expressing the angle in radians.
Returns
A DOUBLE.
Examples
> SELECT cos(0);
1.0
> SELECT cos(pi());
-1.0
Related functions
sin function
tan function
acos function
cosh function
cosh function
7/21/2022 • 2 minutes to read
Syntax
cosh(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT cosh(0);
1.0
Related functions
sinh function
cos function
cot function
7/21/2022 • 2 minutes to read
Syntax
cot(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT cot(1);
0.6420926159343306
Related functions
cos function
cosh function
tan function
tanh function
count aggregate function
7/21/2022 • 2 minutes to read
Syntax
count ( [DISTINCT | ALL] * ) [FILTER ( WHERE cond ) ]
Arguments
expr : Any expression.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BIGINT.
If * is specified also counts row containing NULL values.
If expr are specified counts only rows for which all expr are not NULL .
If DISTINCT duplicate rows are not counted.
Examples
> SELECT count(*) FROM VALUES (NULL), (5), (5), (20) AS tab(col);
4
> SELECT count(col) FROM VALUES (NULL), (5), (5), (20) AS tab(col);
3
> SELECT count(col) FILTER(WHERE col < 10)
FROM VALUES (NULL), (5), (5), (20) AS tab(col);
2
> SELECT count(DISTINCT col) FROM VALUES (NULL), (5), (5), (10) AS tab(col);
2
Related functions
avg aggregate function
sum aggregate function
min aggregate function
max aggregate function
count_if aggregate function
count_if aggregate function
7/21/2022 • 2 minutes to read
Syntax
count_if ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BIGINT.
count_if(expr) FILTER(WHERE cond) is equivalent to count_if(expr AND cond) .
If DISTINCT is specified only unique rows are counted.
Examples
> SELECT count_if(col % 2 = 0) FROM VALUES (NULL), (0), (1), (2), (2), (3) AS tab(col);
3
> SELECT count_if(DISTINCT col % 2 = 0) FROM VALUES (NULL), (0), (1), (2), (2), (3) AS tab(col);
2
> SELECT count_if(col IS NULL) FROM VALUES (NULL), (0), (1), (2), (3) AS tab(col);
1
Related functions
avg aggregate function
sum aggregate function
min aggregate function
max aggregate function
count aggregate function
count_min_sketch aggregate function
7/21/2022 • 2 minutes to read
Returns a count-min sketch of all values in the group in expr with the epsilon , confidence and seed .
Syntax
count_min_sketch ( [ALL | DISTINCT] expr, epsilon, confidence, seed ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to an integral numeric, STRING, or BINARY.
epsilon : A DOUBLE literal greater than 0 describing the relative error.
confidence : A DOUBLE literal greater than 0 and less than 1.
seed : An INTEGER literal.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BINARY.
Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
If DISTINCT is specified the function operates only on a unique set of expr values.
Examples
> SELECT hex(count_min_sketch(col, 0.5d, 0.5d, 1)) FROM VALUES (1), (2), (1) AS tab(col);
0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000
000000000000
> SELECT hex(count_min_sketch(DISTINCT col, 0.5d, 0.5d, 1)) FROM VALUES (1), (2), (1) AS tab(col);
0000000100000000000000020000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000010000
000000000000
Related functions
covar_pop aggregate function
7/21/2022 • 2 minutes to read
Syntax
covar_pop ( [ALL | DISTINCT] expr1, expr2 ) [FILTER ( WHERE cond ) ]
Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr1 , expr2 pairs.
Examples
> SELECT covar_pop(c1, c2) FROM VALUES (1,1), (2,2), (2,2), (3,3) AS tab(c1, c2);
0.5
> SELECT covar_pop(DISTINCT c1, c2) FROM VALUES (1,1), (2,2), (2,2), (3,3) AS tab(c1, c2);
0.6666666666666666
Related functions
covar_samp aggregate function
covar_samp aggregate function
7/21/2022 • 2 minutes to read
Syntax
covar_samp ( [ALL | DISTINCT] expr1, expr2 ) [FILTER ( WHERE cond ) ]
Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr1 , expr2 pairs.
Examples
> SELECT covar_samp(c1, c2) FROM VALUES (1,1), (2,2), (2, 2), (3,3) AS tab(c1, c2);
0.6666666666666666
> SELECT covar_samp(DISTINCT c1, c2) FROM VALUES (1,1), (2,2), (2, 2), (3,3) AS tab(c1, c2);
1.0
Related functions
covar_pop aggregate function
crc32 function
7/21/2022 • 2 minutes to read
Syntax
crc32(expr)
Arguments
expr : A BINARY expression.
Returns
A BIGINT.
Examples
> SELECT crc32('Spark');
1557323817
Related functions
hash function
md5 function
sha function
sha1 function
sha2 function
csc function
7/21/2022 • 2 minutes to read
Syntax
csc(expr)
Arguments
expr : An expression that evaluates to a numeric expressing the angle in radians.
Returns
A DOUBLE.
csc(expr) is equivalent to 1 / sin(expr)
Examples
> SELECT csc(pi() / 2);
1.0
Related functions
acos function
cos function
cosh function
csc function
sin function
tan function
cube function
7/21/2022 • 2 minutes to read
Syntax
cube (expr1 [, ...] )
Arguments
exprN : Any expression that can be grouped.
Returns
The function must be the only grouping expression in the GROUP BY clause. See GROUP BY clause for details.
Examples
> SELECT name, age, count(*) FROM VALUES (2, 'Alice'), (5, 'Bob') people(age, name) GROUP BY cube(name,
age);
Bob 5 1
Alice 2 1
Alice NULL 1
NULL 2 1
NULL NULL 2
Bob NULL 1
NULL 5 1
Related functions
GROUP BY clause
cume_dist analytic window function
7/21/2022 • 2 minutes to read
Syntax
cume_dist()
Arguments
This function takes no arguments.
Returns
A DOUBLE.
The OVER clause of the window function must include an ORDER BY clause. If the order is not unique the
duplicates share the same relative later position. cume_dist() over(order by expr) is similar, but not identical to
rank() over(order by position) / count(*) since rank ranking window function produces the earliest absolute
order.
Examples
> SELECT a, b, cume_dist() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3),
('A1', 1) tab(a, b);
A1 1 0.6666666666666666
A1 1 0.6666666666666666
A1 2 1.0
A2 3 1.0
Related functions
rank ranking window function
dense_rank ranking window function
row_number ranking window function
Window functions
current_catalog function
7/21/2022 • 2 minutes to read
Syntax
current_catalog()
Arguments
This function takes no argument.
Returns
A STRING.
Examples
> SELECT current_catalog();
spark_catalog
Related functions
current_schema function
current_database function
7/21/2022 • 2 minutes to read
Syntax
current_database()
Arguments
This function takes no arguments
Returns
A STRING.
This function is an alias for current_schema function.
Examples
> SELECT current_database();
default
Related functions
current_catalog function
current_schema function
current_date function
7/21/2022 • 2 minutes to read
Syntax
current_date()
Arguments
This function takes no arguments.
Returns
A DATE.
The braces are optional.
Examples
> SELECT current_date();
2020-04-25
> SELECT current_date;
2020-04-25
Related functions
current_timestamp function
current_timezone function
now function
current_schema function
7/21/2022 • 2 minutes to read
Syntax
current_schema()
Arguments
This function takes no arguments.
Returns
A STRING.
Examples
> SELECT current_schema();
default
Related functions
current_catalog function
current_timestamp function
7/21/2022 • 2 minutes to read
Syntax
current_timestamp()
Arguments
This function takes no arguments.
Returns
A TIMESTAMP.
The braces are optional.
Examples
> SELECT current_timestamp();
2020-04-25 15:49:11.914
> SELECT current_timestamp;
2020-04-25 15:49:11.914
Related functions
current_date function
current_timezone function
now function
current_timezone function
7/21/2022 • 2 minutes to read
Syntax
current_timezone()
Arguments
This function takes no arguments.
Returns
A STRING.
Examples
> SELECT current_timezone();
Asia/Shanghai
Related functions
current_date function
current_timestamp function
now function
current_user function
7/21/2022 • 2 minutes to read
Syntax
current_user()
Arguments
This function takes no arguments.
Returns
A STRING.
The braces are optional.
Examples
> SELECT current_user();
user1
> SELECT current_user;
user1
Related functions
is_member function
current_version function
7/21/2022 • 2 minutes to read
Syntax
current_version()
Arguments
This function takes no arguments.
Returns
A STRUCT with the following fields:
dbr_version : A STRING with the current version of Databricks Runtime.
dbsql_version : A NULL STRING in Databricks Runtime.
u_build_hash : A STRING used by Azure Databricks support.
r_build_hash : A STRING used by Azure Databricks support.
Examples
> SELECT current_version().dbr_version;
11.0
Related functions
version function
date function
7/21/2022 • 2 minutes to read
Syntax
date(expr)
Arguments
expr : An expression that can be cast to DATE.
Returns
A DATE.
This function is a synonym for CAST(expr AS expr) .
See cast function for details.
Examples
> SELECT date('2021-03-21');
2021-03-21
Related functions
cast function
date_add function
7/21/2022 • 2 minutes to read
Syntax
date_add(startDate, numDays)
Arguments
startDate : A DATE expression.
numDays : An INTEGER expression.
Returns
A DATE.
If numDays is negative abs(num_days) are subtracted from startDate .
If the result date overflows the date range the function raises an error.
Examples
> SELECT date_add('2016-07-30', 1);
2016-07-31
Related functions
date_from_unix_date function
date_sub function
datediff function
months_between function
timestampadd function
date_format function
7/21/2022 • 2 minutes to read
Syntax
date_format(expr, fmt)
Arguments
expr : A DATE, TIMESTAMP, or a STRING in a valid datetime format.
fmt: A STRING expression describing the desired format.
Returns
A STRING.
See Datetime patterns for details on valid formats.
Examples
> SELECT date_format('2016-04-08', 'y');
2016
Related functions
Datetime patterns
date_from_unix_date function
7/21/2022 • 2 minutes to read
Syntax
date_from_unix_date(days)
Arguments
days : An INTEGER expression.
Returns
A DATE.
If days is negative the days are subtracted from 1970-01-01 .
This function is a synonym for date_add(DATE'1970-01-01', days) .
Examples
> SELECT date_from_unix_date(1);
1970-01-02
Related functions
date_add function
date_sub function
date_part function
7/21/2022 • 2 minutes to read
Syntax
date_part(field, expr)
Arguments
field : An STRING literal. See extract function for details.
expr : A DATE, TIMESTAMP, or INTERVAL expression.
Returns
If field is ‘SECOND’, a DECIMAL(8, 6) . In all other cases, an INTEGER.
The date_part function is a synonym for extract(field FROM expr) .
Examples
> SELECT date_part('YEAR', TIMESTAMP'2019-08-12 01:00:00.123456');
2019
> SELECT date_part('WEEK', TIMESTAMP'2019-08-12 01:00:00.123456');
33
> SELECT date_part('DAY', DATE'2019-08-12');
224
> SELECT date_part('SECONDS', TIMESTAMP'2019-10-01 00:00:01.000001');
1.000001
> SELECT date_part('MONTHS', INTERVAL '2-11' YEAR TO MONTH);
11
> SELECT date_part('SECONDS', INTERVAL '5:00:30.001' HOUR TO SECOND);
30.001000
Related functions
extract function
date_sub function
7/21/2022 • 2 minutes to read
Syntax
date_sub(startDate, numDays)
Arguments
startDate : A DATE expression.
numDays : An INTEGER expression.
Returns
A DATE.
If numDays is negative abs(num_days) are added to startDate .
If the result date overflows the date range the function raises an error.
Examples
> SELECT date_sub('2016-07-30', 1);
2016-07-29
Related functions
date_add function
date_from_unix_date function
datediff function
months_between function
timestampadd function
date_trunc function
7/21/2022 • 2 minutes to read
Syntax
date_trunc(field, expr)
Arguments
field : A STRING literal.
expr : A DATE, TIMESTAMP, or STRING with a valid timestamp format.
Returns
A TIMESTAMP.
Valid units for field are:
‘YEAR’, ‘YYYY’, ‘YY’: truncate to the first date of the year that the expr falls in, the time part will be zero out
‘QUARTER’: truncate to the first date of the quarter that the expr falls in, the time part will be zero out
‘MONTH’, ‘MM’, ‘MON’: truncate to the first date of the month that the expr falls in, the time part will be
zero out
‘WEEK’: truncate to the Monday of the week that the expr falls in, the time part will be zero out
‘DAY’, ‘DD’: zero out the time part
‘HOUR’: zero out the minute and second with fraction part
‘MINUTE’- zero out the second with fraction part
‘SECOND’: zero out the second fraction part
‘MILLISECOND’: zero out the microseconds
‘MICROSECOND’: everything remains
Examples
> SELECT date_trunc('YEAR', '2015-03-05T09:32:05.359');
2015-01-01 00:00:00
> SELECT date_trunc('MM', '2015-03-05T09:32:05.359');
2015-03-01 00:00:00
> SELECT date_trunc('DD', '2015-03-05T09:32:05.359');
2015-03-05 00:00:00
> SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359');
2015-03-05 09:00:00
> SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456');
2015-03-05 09:32:05.123
Related functions
trunc function
dateadd function
7/21/2022 • 2 minutes to read
Syntax
dateadd(unit, value, expr)
unit
{ MICROSECOND |
MILLISECOND |
SECOND |
MINUTE |
HOUR |
DAY | DAYOFYEAR |
WEEK |
MONTH |
QUARTER |
YEAR }
Arguments
unit : A unit of measure.
value : A numeric expression with the number of unit s to add to expr .
expr : A TIMESTAMP expression.
Returns
A TIMESTAMP.
If value is negative it is subtracted from the expr . If unit is MONTH , QUARTER , or YEAR the day portion of the
result will be adjusted to result in a valid date.
The function returns an overflow error if the result is beyond the supported range of timestamps.
dateadd is a synonym for timestampadd.
Examples
> SELECT dateadd(MICROSECOND, 5, TIMESTAMP'2022-02-28 00:00:00');
2022-02-28 00:00:00.000005
Related functions
add_months function
date_add function
date_sub function
timestamp function
timestampadd function
datediff function
7/21/2022 • 2 minutes to read
Syntax
datediff(endDate, startDate)
Arguments
endDate : A DATE expression.
startDate : A DATE expression.
Returns
An INTEGER.
If endDate is before startDate the result is negative.
To measure the difference between two dates in units other than days use datediff (timestamp) function.
Examples
> SELECT datediff('2009-07-31', '2009-07-30');
1
> SELECT datediff('2009-07-30', '2009-07-31');
-1
Related functions
date_add function
date_sub function
datediff (timestamp) function
datediff (timestamp) function
7/21/2022 • 2 minutes to read
Syntax
datediff(unit, start, end)
unit
{ MICROSECOND |
MILLISECOND |
SECOND |
MINUTE |
HOUR |
DAY |
WEEK |
MONTH |
QUARTER |
YEAR }
Arguments
unit : A unit of measure.
start : A starting TIMESTAMP expression.
end : A ending TIMESTAMP expression.
Returns
A BIGINT.
If start is greater than end the result is negative.
The function counts whole elapsed units based on UTC with a DAY being 86400 seconds.
One month is considered elapsed when the calendar month has increased and the calendar day and time is
equal or greater to the start. Weeks, quarters, and years follow from that.
datediff (timestamp) is a synonym for timestampdiff function.
Examples
-- One second shy of a month elapsed
> SELECT datediff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 11:59:59');
0
-- One month has passed even though its' not end of the month yet because day and time line up.
> SELECT datediff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 12:00:00');
1
Related functions
add_months function
date_add function
date_sub function
datediff function
timestamp function
timestampadd function
day function
7/21/2022 • 2 minutes to read
Syntax
day(expr)
Arguments
expr : A DATE or TIMESTAMP expression.
Returns
An INTEGER.
This function is a synonym for extract(DAY FROM expr) .
Examples
> SELECT day('2009-07-30');
30
Related functions
dayofmonth function
dayofweek function
dayofyear function
hour function
minute function
second function
extract function
dayofmonth function
7/21/2022 • 2 minutes to read
Syntax
dayofmonth(expr)
Arguments
expr : A DATE or TIMESTAMP expression.
Returns
An INTEGER.
This function is a synonym for extract(DAY FROM expr) .
Examples
> SELECT dayofmonth('2009-07-30');
30
Related functions
day function
dayofweek function
dayofyear function
hour function
minute function
second function
extract function
dayofweek function
7/21/2022 • 2 minutes to read
Syntax
dayofweek(expr)
Arguments
expr : A DATE or TIMESTAMP expression.
Returns
An INTEGER where 1 = Sunday , and 7 = Saturday .
This function is a synonym for extract(DAYOFWEEK FROM expr) .
Examples
> SELECT dayofweek('2009-07-30');
5
Related functions
day function
dayofmonth function
dayofyear function
hour function
minute function
second function
extract function
weekday function
dayofyear function
7/21/2022 • 2 minutes to read
Syntax
dayofyear(expr)
Arguments
expr : A DATE or TIMESTAMP expression.
Returns
An INTEGER.
This function is a synonym for extract(DAY FORM expr) .
Examples
> SELECT dayofyear('2016-04-09');
100
Related functions
day function
dayofmonth function
dayofweek function
hour function
minute function
second function
extract function
weekday function
decimal function
7/21/2022 • 2 minutes to read
Syntax
decimal(expr)
Arguments
expr : An expression that can be cast to DECIMAL.
Returns
The result is DECIMAL(10, 0).
This function is a synonym for CAST(expr AS decimal(10, 0))
Examples
> SELECT decimal('5.2');
5
Related functions
cast function
decode function
7/21/2022 • 2 minutes to read
Syntax
decode(expr, { key1, value1 } [, ...] [, defValue])
Arguments
expr : Any expression of a comparable type.
keyN : An expression that matched the type of expr .
valueN : An expression that shares a least common type with defValue and the other valueN s.
defValue : An optional expression that shares a least common type with valueN .
Returns
The result is of the least common type of the valueN and defValue .
The function returns the first valueN for which keyN matches expr. For this function NULL matches NULL. If
no keyN matches expr , defValue is returned if it exists. If no defValue was specified the result is NULL.
Examples
> SELECT decode(5, 6, 'Spark', 5, 'SQL', 4, 'rocks');
SQL
> SELECT decode(NULL, 6, 'Spark', NULL, 'SQL', 4, 'rocks');
SQL
> SELECT decode(7, 6, 'Spark', 5, 'SQL', 'rocks');
rocks
Related functions
case expression
decode (character set) function
decode (character set) function
7/21/2022 • 2 minutes to read
Translates binary expr to a string using the character set encoding charSet .
Syntax
decode(expr, charSet)
Arguments
expr : A BINARY expression encoded in charset .
charSet : A STRING expression.
Returns
A STRING.
If charSet does not match the encoding the result is undefined. charSet must be one of (case insensitive):
‘US-ASCII’
‘ISO-8859-1’
‘UTF-8’
‘UTF-16BE’
‘UTF-16LE’
‘UTF-16’
Examples
> SELECT encode('Spark SQL', 'UTF-16');
[FE FF 00 53 00 70 00 61 00 72 00 6B 00 20 00 53 00 51 00 4C]
> SELECT decode(X'FEFF0053007000610072006B002000530051004C', 'UTF-16')
Spark SQL
Related functions
encode function
decode function
degrees function
7/21/2022 • 2 minutes to read
Syntax
degrees(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Given an angle in radians, this function returns the associated degrees.
Examples
> SELECT degrees(3.141592653589793);
180.0
Related functions
radians function
dense_rank ranking window function
7/21/2022 • 2 minutes to read
Syntax
dense_rank()
Arguments
This function takes no arguments.
Returns
An INTEGER.
The OVER clause of the window function must include an ORDER BY clause. Unlike the function rank ranking
window function, dense_rank will not produce gaps in the ranking sequence. Unlike row_number ranking
window function, dense_rank does not break ties. If the order is not unique the duplicates share the same
relative later position.
Examples
> SELECT a,
b,
dense_rank() OVER(PARTITION BY a ORDER BY b),
rank() OVER(PARTITION BY a ORDER BY b),
row_number() OVER(PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1 1 1
A1 1 1 1 2
A1 2 2 3 3
A2 3 1 1 1
Related functions
rank ranking window function
row_number ranking window function
cume_dist analytic window function
Window functions
div operator
7/21/2022 • 2 minutes to read
Syntax
divisor div dividend
Arguments
divisor : An expression that evaluates to a numeric or interval.
dividend : A matching interval type if divisor is an interval, a numeric otherwise.
Returns
A BIGINT
If dividend is 0 , INTERVAL '0' SECOND or INTERVAL '0' MONTH the operator raises a DIVIDE_BY_ZERO error.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error division by 0.
Examples
> SELECT 3 div 2;
1
> SELECT -5.9 div 1;
-5
Related functions
/ (slash sign) operator
* (asterisk sign) operator
+ (plus sign) operator
- (minus sign) operator
double function
7/21/2022 • 2 minutes to read
Syntax
double(expr)
Arguments
expr : An expression that can be cast to DOUBLE.
Returns
A DOUBLE.
This function is a synonym for CAST(expr AS DOUBLE) .
See cast function for details.
Examples
> SELECT double('5.2');
5.2
Related functions
cast function
e function
7/21/2022 • 2 minutes to read
Syntax
e()
Arguments
This function takes no arguments.
Returns
The result is Euler’s constant as a DOUBLE.
Examples
> SELECT e();
2.7182818284590455
Related functions
exp function
pi function
element_at function
7/21/2022 • 2 minutes to read
Syntax
element_at(arrayExpr, index)
element_at(mapExpr, key)
Arguments
arrayExpr : An ARRAY expression.
index : An INTEGER expression.
mapExpr : A MAP expression.
key : An expression matching the type of the keys of mapExpr
Returns
If the first argument is an ARRAY:
The result is of the type of the elements of expr .
abs(index) must be between 1 and the length of the array.
If index is negative the function accesses elements from the last to the first.
The function raises INVALID_ARRAY_INDEX_IN_ELEMENT_AT error if abs(index) exceeds the length of the
array.
If the first argument is a MAP and key cannot be matched to an entry in mapExpr the function raises a
MAP_KEY_DOES_NOT_EXIST error.
NOTE
If spark.sql.ansi.failOnElementNotExists is false the function returns NULL instead of raising errors.
Examples
> SELECT element_at(array(1, 2, 3), 2);
2
Related functions
array_contains function
array_position function
try_element_at function
elt function
7/21/2022 • 2 minutes to read
Syntax
elt(index, expr1 [, ...])
Arguments
index : An INTEGER expression greater than 0.
exprN : Any expression that shares a least common type with all exprN .
Returns
The result has the type of the least common type of the exprN .
Index must be between 1 and the number of expr . If index is out of bounds, an INVALID_ARRAY_INDEX error is
raised.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error if the index is out of bound.
Examples
> SELECT elt(1, 'scala', 'java');
scala
Related functions
element_at function
encode function
7/21/2022 • 2 minutes to read
Returns the binary representation of a string using the charSet character encoding.
Syntax
encode(expr, charSet)
Arguments
expr : A STRING expression to be encoded.
charSet : A STRING expression specifying the encoding.
Returns
A BINARY.
The charset must be one of (case insensitive):
‘US-ASCII’
‘ISO-8859-1’
‘UTF-8’
‘UTF-16BE’
‘UTF-16LE’
‘UTF-16’
Examples
> SELECT encode('Spark SQL', 'UTF-16');
[FE FF 00 53 00 70 00 61 00 72 00 6B 00 20 00 53 00 51 00 4C]
> SELECT decode(X'FEFF0053007000610072006B002000530051004C', 'UTF-16')
Spark SQL
Related functions
decode (character set) function
endswith function
7/21/2022 • 2 minutes to read
Syntax
endswith(expr, endExpr)
Arguments
expr : A STRING or BINARY expression.
endExpr : A STRING or BINARY expression which is compared to the end of str .
Returns
A BOOLEAN.
If expr or endExpr is NULL , the result is NULL .
If endExpr is the empty string or empty binary the result is true .
Since: Databricks Runtime 10.5
The function operates in BINARY mode if both arguments are BINARY.
Examples
> SELECT endswith('SparkSQL', 'SQL');
true
Related
contains function
startswith function
substr function
== (eq eq sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 == expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression sharing a least common type with expr1 .
Returns
A BOOLEAN.
This function is a synonym for = (eq sign) operator.
Examples
> SELECT 2 == 2;
true
Related
< (lt sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
!= (bangeq sign) operator
== (eq eq sign) operator
<> (lt gt sign) operator
SQL data type rules
= (eq sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 = expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression sharing a least common type with expr1 .
Returns
A BOOLEAN.
This function is a synonym for == (eq eq sign) operator.
Examples
> SELECT 2 = 2;
true
Related
< (lt sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
!= (bangeq sign) operator
== (eq eq sign) operator
<> (lt gt sign) operator
SQL data type rules
every aggregate function
7/21/2022 • 2 minutes to read
Syntax
every(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BOOLEAN.
This function is a synonym for bool_and aggregate function.
Examples
> SELECT every(col) FROM VALUES (true), (true), (true) AS tab(col);
true
> SELECT every(col) FROM VALUES (NULL), (true), (true) AS tab(col);
true
> SELECT every(col) FROM VALUES (true), (false), (true) AS tab(col);
false
> SELECT every(col1) FILTER(WHERE col2 = 1)
FROM VALUES (true, 1), (false, 2), (true, 1) AS tab(col1, col2);
true
Related functions
bool_and aggregate function
bool_or aggregate function
some aggregate function
exists function
7/21/2022 • 2 minutes to read
Returns true if func is true for any element in expr or query returns at least one row.
Syntax
exists(expr, func)
exists(query)
Arguments
expr : An ARRAY expression.
func : A lambda function.
query : Any SELECT.
Returns
A BOOLEAN.
The lambda function must result in a boolean and operate on one parameter, which represents an element in the
array.
exists(query) can only be used in the WHERE clause and few other specific cases.
Examples
> SELECT exists(array(1, 2, 3), x -> x % 2 == 0);
true
> SELECT exists(array(1, 2, 3), x -> x % 2 == 10);
false
> SELECT exists(array(1, NULL, 3), x -> x % 2 == 0);
NULL
> SELECT exists(array(0, NULL, 2, 3, NULL), x -> x IS NULL);
true
> SELECT exists(array(1, 2, 3), x -> x IS NULL);
false
Related functions
filter function
array_contains function
exp function
7/21/2022 • 2 minutes to read
Syntax
exp(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT exp(0);
1.0
> SELECT exp(1);
2.7182818284590455
Related functions
e function
expm1 function
ln function
pow function
explode table-valued generator function
7/21/2022 • 2 minutes to read
Syntax
explode(expr)
Arguments
expr : An ARRAY or MAP expression.
Returns
A set of rows composed of the other expressions in the select list and either the elements of the array or the
keys and values of the map. If expr is NULL no rows are produced.
explode can only be placed in the select list or a LATERAL VIEW. When placing the function in the SELECT list
there must be no other generator function in the same SELECT list.
The column produced by explode of an array is named col by default, but can be aliased. The columns for a
map are by default called key and value . They can also be aliased using an alias tuple such as
AS (myKey, myValue) .
Examples
> SELECT explode(array(10, 20)) AS elem, 'Spark';
10 Spark
20 Spark
> SELECT explode(map(1, 'a', 2, 'b')) AS (num, val), 'Spark';
1 a Spark
2 b Spark
Related functions
explode_outer table-valued generator function
posexplode table-valued generator function
posexplode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
explode_outer table-valued generator function
7/21/2022 • 2 minutes to read
Syntax
explode_outer(expr)
Arguments
expr : An ARRAY or MAP expression.
Returns
A set of rows composed of the other expressions in the select list and either the elements of the array or the
keys and values of the map. If expr is NULL a single row with NULLs for the array or map values is produced.
explode_outer can only be placed in the select list or a LATERAL VIEW. When placing the function in the select
list there must be no other generator function in the same select list.
The column produced by explode of an array is named col by default, but can be aliased. The columns for a
map are by default called key and value . They can also be aliased using an alias tuple such as
AS (myKey, myValue) .
Examples
> SELECT explode_outer(array(10, 20)) AS elem, 'Spark';
10 Spark
20 Spark
Related functions
explode table-valued generator function
posexplode table-valued generator function
posexplode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
expm1 function
7/21/2022 • 2 minutes to read
Returns exp(expr) - 1 .
Syntax
expm1(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT expm1(0);
0.0
Related functions
e function
exp function
extract function
7/21/2022 • 2 minutes to read
Syntax
extract(field FROM source)
Arguments
field : A keyword that selects which part of source should be extracted.
source : A DATE, TIMESTAMP, or INTERVAL expression.
Returns
If field is SECOND ,a DECIMAL(8, 6) . In all other cases, an INTEGER.
Supported values of field when source is DATE or TIMESTAMP are:
“YEAR”, (“Y”, “YEARS”, “YR”, “YRS”) - the year field
“YEAROFWEEK” - the ISO 8601 week-numbering year that the datetime falls in. For example, 2005-01-02 is
part of the 53rd week of year 2004, so the result is 2004
“QUARTER”, (“QTR”) - the quarter (1 - 4) of the year that the datetime falls in
“MONTH”, (“MON”, “MONS”, “MONTHS”) - the month field (1 - 12)
“WEEK”, (“W”, “WEEKS”) - the number of the ISO 8601 week-of-week-based-year. A week is considered to
start on a Monday and week 1 is the first week with >3 days. In the ISO week-numbering system, it is
possible for early-January dates to be part of the 52nd or 53rd week of the previous year, and for late-
December dates to be part of the first week of the next year. For example, 2005-01-02 is part of the 53rd
week of year 2004, while 2012-12-31 is part of the first week of 2013
“DAY”, (“D”, “DAYS”) - the day of the month field (1 - 31)
“DAYOFWEEK”,(“DOW”) - the day of the week for datetime as Sunday(1) to Saturday(7)
“DAYOFWEEK_ISO”,(“DOW_ISO”) - ISO 8601 based day of the week for datetime as Monday(1) to Sunday(7)
“DOY” - the day of the year (1 - 365/366)
“HOUR”, (“H”, “HOURS”, “HR”, “HRS”) - The hour field (0 - 23)
“MINUTE”, (“M”, “MIN”, “MINS”, “MINUTES”) - the minutes field (0 - 59)
“SECOND”, (“S”, “SEC”, “SECONDS”, “SECS”) - the seconds field, including fractional parts
Supported values of field when source is INTERVAL are:
“YEAR”, (“Y”, “YEARS”, “YR”, “YRS”) - the total months / 12
“MONTH”, (“MON”, “MONS”, “MONTHS”) - the total months % 12
“DAY”, (“D”, “DAYS”) - the days part of interval
“HOUR”, (“H”, “HOURS”, “HR”, “HRS”) - how many hours the microseconds contains
“MINUTE”, (“M”, “MIN”, “MINS”, “MINUTES”) - how many minutes left after taking hours from microseconds
“SECOND”, (“S”, “SEC”, “SECONDS”, “SECS”) - how many seconds with fractions left after taking hours and
minutes from microseconds
Examples
> SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 01:00:00.123456');
2019
> SELECT extract(week FROM TIMESTAMP'2019-08-12 01:00:00.123456');
33
> SELECT extract(DAY FROM DATE'2019-08-12');
224
> SELECT extract(SECONDS FROM TIMESTAMP'2019-10-01 00:00:01.000001');
1.000001
> SELECT extract(MONTHS FROM INTERVAL '2-11' YEAR TO MONTH);
11
> SELECT extract(SECONDS FROM INTERVAL '5:00:30.001' HOUR TO SECOND);
30.001000
Related functions
date_part function
dayofweek function
dayofmonth function
dayofyear function
factorial function
7/21/2022 • 2 minutes to read
Syntax
factorial(expr)
Arguments
expr : An INTEGER expression between 1 and 20.
Returns
A BIGINT.
If expr is out of bounds, the function returns NULL.
Examples
> SELECT factorial(5);
120
Related functions
filter function
7/21/2022 • 2 minutes to read
Syntax
filter(expr, func)
Arguments
expr : An ARRAY expression.
func : A lambda function.
Returns
The result is of the same type as expr .
The lambda function may use one or two parameters where the first parameter represents the element and the
second the index into the array.
Examples
> SELECT filter(array(1, 2, 3), x -> x % 2 == 1);
[1,3]
> SELECT filter(array(0, 2, 3), (x, i) -> x > i);
[2,3]
> SELECT filter(array(0, null, 2, 3, null), x -> x IS NOT NULL);
[0,2,3]
Related functions
exists function
forall function
map_filter function
find_in_set function
7/21/2022 • 2 minutes to read
Syntax
find_in_set(searchExpr, sourceExpr)
Arguments
searchExpr : A STRING expression specifying the “word” to be searched.
sourceExpr : A STRING expression with commas separating “words”.
Returns
An INTEGER. The resulting position is 1-based and points to the first letter of the match. If no match is found for
searchExpr in sourceExpr or searchExpr contains a comma, 0 is returned.
Examples
> SELECT find_in_set('ab','abc,b,ab,c,def');
3
Related functions
array_contains function
first aggregate function
7/21/2022 • 2 minutes to read
Syntax
first(expr[, ignoreNull) [ FILTER ( WHERE cond ) ] [ IGNORE NULLS | RESPECT NULLS ]
Arguments
expr : An expression of any type.
ignoreNull : An optional BOOLEAN literal defaulting to false. The default for ignoreNull is false.
cond : An optional boolean expression filtering the rows used for aggregation.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used or ignoreNull is true any expr value that is
NULL is ignored. The default is RESPECT NULLS .
Returns
The result has the same type as expr .
first is a synonym for first_value aggregate function.
This function is non-deterministic.
Examples
> SELECT first(col) FROM VALUES (10), (5), (20) AS tab(col);
10
> SELECT first(col) FROM VALUES (NULL), (5), (20) AS tab(col);
NULL
> SELECT first(col) IGNORE NULLS FROM VALUES (NULL), (5), (20) AS tab(col);
5
Related functions
min aggregate function
max aggregate function
last aggregate function
first_value aggregate function
first_value aggregate function
7/21/2022 • 2 minutes to read
Syntax
first_value(expr[, ignoreNull]) [FILTER ( WHERE cond ) ] [ IGNORE NULLS | RESPECT NULLS ]
Arguments
expr : An expression of any type.
ignoreNull : An optional BOOLEAN literal defaulting to false. The default for ignoreNull is false.
cond : An optional boolean expression filtering the rows used for aggregation.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used or ignoreNull is true any expr value that is
NULL is ignored. The default is RESPECT NULLS .
Returns
The result has the same type as expr .
first_value is a synonym for first aggregate function.
This function is non-deterministic.
Examples
> SELECT first_value(col) FROM VALUES (10), (5), (20) AS tab(col);
10
> SELECT first_value(col) FROM VALUES (NULL), (5), (20) AS tab(col);
NULL
> SELECT first_value(col) IGNORE NULLS FROM VALUES (NULL), (5), (20) AS tab(col);
5
Related functions
min aggregate function
max aggregate function
last aggregate function
first_value aggregate function
flatten function
7/21/2022 • 2 minutes to read
Syntax
flatten(expr)
Arguments
expr : An ARRAY of ARRAY expression.
Returns
The result matches the type of the nested arrays within expr .
Examples
> SELECT flatten(array(array(1, 2), array(3, 4)));
[1,2,3,4]
Related functions
float function
7/21/2022 • 2 minutes to read
Syntax
float(expr)
Arguments
expr : An expression that can be cast to FLOAT.
Returns
A FLOAT.
This function is a synonym for CAST(expr AS FLOAT) .
See cast function for details.
Examples
> SELECT float('5.2');
5.2
Related functions
cast function
floor function
7/21/2022 • 2 minutes to read
Returns the largest number not bigger than expr rounded down to targetScale digits relative to the decimal
point.
Syntax
floor(expr [, targetScale])
Arguments
expr : An expression that evaluates to a numeric.
targetScale: An optional INTEGER literal greater than -38 specifying by how many digits after the
decimal points to round down.
Since: Databricks Runtime 10.5
Returns
If no targetScale is given:
If expr is DECIMAL(p, s) , returns DECIMAL(p - s + 1, 0) .
For all other cases, returns a BIGINT.
If targetScale is specified and expr is a:
TINYINT
Returns a DECIMAL(p, s) with p = max(14, -targetScale + 1)) and s = min(7, max(0, targetScale))
DOUBLE
Returns a DECIMAL(p, s) with p = max(30, -targetScale + 1)) and s = min(15, max(0, targetScale))
DECIMAL(p_in, s_in)
Returns a DECIMAL(p, s) with p = max(p_in - s_in + 1, -targetScale + 1)) and
s = min(s_in, max(0, targetScale))
If targetScale is negative the rounding occurs to -targetScale digits to the left of the decimal point.
The default targetScale is 0, which rounds down to the next smaller integral number.
Examples
> SELECT floor(-0.1);
-1
Related functions
ceiling function
ceil function
bround function
round function
forall function
7/21/2022 • 2 minutes to read
Syntax
forall(expr, func)
Arguments
expr : An ARRAY expression.
func : A lambda function returning a BOOLEAN.
Returns
A BOOLEAN.
The lambda function uses one parameter passing an element of the array.
Examples
> SELECT forall(array(1, 2, 3), x -> x % 2 == 0);
false
> SELECT forall(array(2, 4, 8), x -> x % 2 == 0);
true
> SELECT forall(array(1, NULL, 3), x -> x % 2 == 0);
false
> SELECT forall(array(2, NULL, 8), x -> x % 2 == 0);
NULL
Related functions
filter function
exists function
format_number function
7/21/2022 • 2 minutes to read
Syntax
format_number(expr, scale)
format_number(expr, fmt)
Arguments
expr : An expression that evaluates to a numeric.
scale : An INTEGER expression greater or equal to 0.
fmt : A STRING expression specifying a format.
Returns
A STRING.
A negative scale produces a null.
Examples
> SELECT format_number(12332.123456, 4);
12,332.1235
> SELECT format_number(12332.123456, '#.###');
12332.123
> SELECT format_number(12332.123456, 'EUR ,###.-');
EUR 12,332.-
Related functions
format_string function
format_string function
7/21/2022 • 2 minutes to read
Syntax
format_string(strfmt [, obj1 [, ...] ])
Arguments
strfmt : A STRING expression.
objN : STRING or numeric expressions.
Returns
A STRING.
Examples
> SELECT format_string('Hello World %d %s', 100, 'days');
Hello World 100 days
Related functions
format_number function
from_csv function
7/21/2022 • 5 minutes to read
Syntax
from_csv(csvStr, schema [, options])
Arguments
csvStr : A STRING expression specifying a row of CSV data.
schema : A STRING literal or invocation of schema_of_csv function.
options : An optional MAP<STRING,STRING> literal specifying directives.
Returns
A STRUCT with field names and types matching the schema definition.
csvStr should be well formed with respect to the schema and options . schema must be defined as comma-
separated column name and data type pairs as used in for example CREATE TABLE .
options , if provided, can be any of the following:
sep (default , ): sets a separator for each field and value. This separator can be one or more characters.
encoding (default UTF-8): decodes the CSV files by the specified encoding type.
quote (default " ): sets a single character used for escaping quoted values where the separator can be part
of the value. If you would like to turn off quotations, you need to set not null but an empty string. This
behavior is different from com.databricks.spark.csv .
escape (default \ ): sets a single character used for escaping quotes inside an already quoted value.
charToEscapeQuoteEscaping (default escape or \0 ): sets a single character used for escaping the escape for
the quote character. The default value is escape character when escape and quote characters are different,
\0 otherwise.
comment (default empty string): sets a single character used for skipping lines beginning with this character.
By default, it is disabled.
header (default false ): uses the first line as names of columns.
enforceSchema (default true ): If it is set to true, the specified or inferred schema is forcibly applied to
datasource files, and headers in CSV files is ignored. If the option is set to false, the schema is validated
against all headers in CSV files in the case when the header option is set to true. Field names in the schema
and column names in CSV headers are checked by their positions taking into account
spark.sql.caseSensitive . Though the default value is true, it is recommended to disable the enforceSchema
option to avoid incorrect results.
inferSchema (default false ): infers the input schema automatically from data. It requires one extra pass
over the data.
samplingRatio (default 1.0): defines fraction of rows used for schema inferring.
ignoreLeadingWhiteSpace (default false ): a flag indicating whether or not leading whitespaces from values
being read should be skipped.
ignoreTrailingWhiteSpace (default false ): a flag indicating whether or not trailing whitespaces from values
being read should be skipped.
nullValue (default empty string): sets the string representation of a null value.
emptyValue (default empty string): sets the string representation of an empty value.
nanValue (default NaN ): sets the string representation of a non-number value.
positiveInf (default Inf ): sets the string representation of a positive infinity value.
negativeInf (default -Inf) : sets the string representation of a negative infinity value.
dateFormat (default yyyy-MM-dd ): sets the string that indicates a date format. Custom date formats follow the
formats at Datetime patterns. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] ): sets the string that indicates a timestamp
format. Custom date formats follow the formats at Datetime patterns. This applies to timestamp type.
maxColumns (default 20480 ): defines a hard limit of how many columns a record can have.
maxCharsPerColumn (default -1): defines the maximum number of characters allowed for any specified value
being read. By default, it is -1 meaning unlimited length
unescapedQuoteHandling (default STOP_AT_DELIMITER ): defines how the CSV parser handles values with
unescaped quotes.
STOP_AT_CLOSING_QUOTE : If unescaped quotes are found in the input, accumulate the quote character
and proceed parsing the value as a quoted value, until a closing quote is found.
BACK_TO_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted
value. This will make the parser accumulate all characters of the current parsed value until the
delimiter is found. If no delimiter is found in the value, the parser will continue accumulating
characters from the input until a delimiter or line ending is found.
STOP_AT_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted
value. This will make the parser accumulate all characters until the delimiter or a line ending is found
in the input.
STOP_AT_DELIMITER : If unescaped quotes are found in the input, the content parsed for the specified
value is skipped and the value set in nullValue is produced instead.
RAISE_ERROR : If unescaped quotes are found in the input, a TextParsingException is thrown.
mode (default PERMISSIVE ): allows a mode for dealing with corrupt records during parsing. It supports the
following case-insensitive modes. Spark tries to parse only required columns in CSV under column pruning.
Therefore, corrupt records can be different based on required set of fields. This behavior can be controlled by
spark.sql.csv.parser.columnPruning.enabled (enabled by default).
PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field configured by
columnNameOfCorruptRecord , and sets malformed fields to null. To keep corrupt records, an user can set
a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does
not have the field, it drops corrupt records during parsing. A record with fewer or more tokens than
schema is not a corrupted record to CSV. When it meets a record having fewer tokens than the length
of the schema, sets null to extra fields. When the record has more tokens than the length of the
schema, it drops extra tokens.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord ): allows
renaming the new field having malformed string created by PERMISSIVE mode. This overrides
spark.sql.columnNameOfCorruptRecord .
multiLine (default false ): parse one record, which may span multiple lines.
locale (default en-US ): sets a locale as language tag in IETF BCP 47 format. For instance, this is used while
parsing dates and timestamps.
lineSep (default covers all \r , \r\n , and \n ): defines the line separator that should be used for parsing.
Maximum length is 1 character.
pathGlobFilter : an optional glob pattern to only include files with paths matching the pattern. The syntax
follows org.apache.hadoop.fs.GlobFilter . It does not change the behavior of partition discovery.
Examples
> SELECT from_csv('1, 0.8', 'a INT, b DOUBLE');
{1,0.8}
> SELECT from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
{"time":2015-08-26 00:00:00}
Related functions
from_json function
schema_of_json function
schema_of_csv function
to_json function
to_csv function
from_json function
7/21/2022 • 2 minutes to read
Syntax
from_json(jsonStr, schema [, options])
Arguments
jsonStr : A STRING expression specifying a row of CSV data.
schema : A STRING literal or invocation of schema_of_json function.
options : An optional MAP<STRING,STRING> literal specifying directives.
Returns
A struct with field names and types matching the schema definition.
jsonStr should be well formed with respect to schema and options . schema must be defined as comma-
separated column name and data type pairs as used in for example CREATE TABLE .
options , if provided, can be any of the following:
primitivesAsString (default false ): infers all primitive values as a string type.
prefersDecimal (default false ): infers all floating-point values as a decimal type. If the values do not fit in
decimal, then it infers them as doubles.
allowComments (default false ): ignores Java and C++ style comment in JSON records.
allowUnquotedFieldNames (default false ): allows unquoted JSON field names.
allowSingleQuotes (default true ): allows single quotes in addition to double quotes.
allowNumericLeadingZeros (default false ): allows leading zeros in numbers (for example, 00012 ).
allowBackslashEscapingAnyCharacter (default false ): allows accepting quoting of all character using
backslash quoting mechanism.
allowUnquotedControlChars (default false ): allows JSON Strings to contain unquoted control characters
(ASCII characters with value less than 32, including tab and line feed characters) or not.
mode (default PERMISSIVE ): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field configured by
columnNameOfCorruptRecord , and sets malformed fields to null. To keep corrupt records, you can set a
string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not
have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a
columnNameOfCorruptRecord field in an output schema.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord ): allows
renaming the new field having malformed string created by PERMISSIVE mode. This overrides
spark.sql.columnNameOfCorruptRecord .
dateFormat (default yyyy-MM-dd ): sets the string that indicates a date format. Custom date formats follow the
formats at Datetime patterns. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] ): sets the string that indicates a timestamp
format. Custom date formats follow the formats at Datetime patterns. This applies to timestamp type.
multiLine (default false ): parses one record, which may span multiple lines, per file.
encoding (by default it is not set): allows to forcibly set one of standard basic or extended encoding for the
JSON files. For example UTF-16BE, UTF-32LE. If the encoding is not specified and multiLine is set to true , it
is detected automatically.
lineSep (default covers all \r , \r\n and \n ): defines the line separator that should be used for parsing.
samplingRatio (default 1.0): defines fraction of input JSON objects used for schema inferring.
dropFieldIfAllNull (default false ): whether to ignore column of all null values or empty array/struct
during schema inference.
locale (default is en-US ): sets a locale as language tag in IETF BCP 47 format. For instance, this is used
while parsing dates and timestamps.
allowNonNumericNumbers (default true ): allows JSON parser to recognize set of not-a-number ( NaN ) tokens
as legal floating number values:
+INF for positive infinity, as well as alias of +Infinity and Infinity .
-INF for negative infinity), alias -Infinity .
NaN for other not-a-numbers, like result of division by zero.
Examples
> SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE');
{1,0.8}
> SELECT from_json('{"time":"26/08/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
{2015-08-26 00:00:00}
Related functions
: operator
from_csv function
schema_of_json function
schema_of_csv function
to_csv function
to_json function
json_object_keys function
json_array_length function
json_tuple table-valued generator function
from_json function
get_json_object function
from_unixtime function
7/21/2022 • 2 minutes to read
Syntax
from_unixtime(unixTime [, fmt])
Arguments
unixTime : A BIGINT expression representing seconds elapsed since 1969-12-31 at 16:00:00.
fmt: An optional STRING expression with a valid format.
Returns
A STRING.
See Datetime patterns for valid formats. The ‘yyyy-MM-dd HH:mm:ss’ pattern is used if omitted.
Examples
> SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');
1969-12-31 16:00:00
> SELECT from_unixtime(0);
1969-12-31 16:00:00
Related functions
to_unix_timestamp function
Datetime patterns
from_utc_timestamp function
7/21/2022 • 2 minutes to read
Syntax
from_utc_timestamp(expr, timeZone)
Arguments
expr : A TIMESTAMP expression with a UTC timestamp.
timeZone : A STRING expression that is a valid timezone.
Returns
A TIMESTAMP.
Examples
> SELECT from_utc_timestamp('2016-08-31', 'Asia/Seoul');
2016-08-31 09:00:00
> SELECT from_utc_timestamp('2017-07-14 02:40:00.0', 'GMT+1');
'2017-07-14 03:40:00.0'
Related functions
to_utc_timestamp function
getbit function
7/21/2022 • 2 minutes to read
Syntax
getbit(expr, pos))
Arguments
expr : An expression that evaluates to an integral numeric.
pos : An expression of type INTEGER.
Returns
The result type is INTEGER.
The result value is 1 if the bit is set, 0 otherwise.
Bits are counted right to left and 0-based.
If pos is outside the bounds of the data type of expr Databricks Runtime raises an error.
getbit is a synonym of bit_get.
Examples
> SELECT hex(23Y), getbit(23Y, 3);
0
Related functions
bit_get function
bit_reverse function
~ (tilde sign) operator
get_json_object function
7/21/2022 • 2 minutes to read
Syntax
get_json_object(expr, path)
Arguments
expr : A STRING expression containing well formed JSON.
path : A STRING literal with a well formed JSON path.
Returns
A STRING.
If the object cannot be found null is returned.
Examples
> SELECT get_json_object('{"a":"b"}', '$.a');
b
Related functions
json_tuple table-valued generator function
greatest function
7/21/2022 • 2 minutes to read
Syntax
greatest(expr1 [, ...])
Arguments
exprN : Any expression of a comparable type with a shared least common type across all exprN .
Returns
The result type is the least common type of the arguments.
Examples
> SELECT greatest(10, 9, 2, 4, 3);
10
Related
least function
SQL data type rules
grouping function
7/21/2022 • 2 minutes to read
Indicates whether a specified column in a GROUPING SET , ROLLUP , or CUBE represents a subtotal.
Syntax
grouping(col)
Arguments
col : A column reference identified in a GROUPING SET , ROLLUP , or CUBE .
Returns
An INTEGER.
The result is 1 for a specified row if the row represents a subtotal over the grouping of col , or 0 if it is not.
Examples
> SELECT name, grouping(name), sum(age) FROM VALUES (2, 'Alice'), (5, 'Bob') people(age, name) GROUP BY
cube(name);
Alice 0 2
Bob 0 5
NULL 1 7
Related functions
grouping_id function
grouping_id function
7/21/2022 • 2 minutes to read
Syntax
grouping_id( [col1 [, ...] ] )
Arguments
colN : A column reference identified in a GROUPING SET , ROLLUP , or CUBE .
Returns
A BIGINT.
The function combines the grouping function for several columns into one by assigning each column a bit in a
bit vector. The col1 is represented by the highest order bit. A bit is set to 1 if the row computes a subtotal for
the corresponding column.
Specifying no argument is equivalent to specifying all columns listed in the GROUPING SET , CUBE , or ROLLUP .
Examples
> SELECT name, age, grouping_id(name, age),
conv(cast(grouping_id(name, age) AS STRING), 10, 2),
avg(height)
FROM VALUES (2, 'Alice', 165), (5, 'Bob', 180) people(age, name, height)
GROUP BY cube(name, age)
Alice 2 0 0 165.0
Alice NULL 1 1 165.0
NULL 2 2 10 165.0
NULL NULL 3 11 172.5
Bob NULL 1 1 180.0
Bob 5 0 0 180.0
NULL 5 2 10 180.0
Related functions
grouping function
>= (gt eq sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 >= expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression sharing a least common type with expr1 .
Returns
A BOOLEAN.
Examples
> SELECT 2 >= 1;
true
> SELECT 2.0 >= '2.1';
false
> SELECT to_date('2009-07-30 04:17:52') >= to_date('2009-07-30 04:17:52');
true
> SELECT to_date('2009-07-30 04:17:52') >= to_date('2009-08-01 04:17:52');
false
> SELECT 1 >= NULL;
NULL
Related
!= (bangeq sign) operator
<= (lt eq sign) operator
< (lt sign) operator
> (gt sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
> (gt sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 > expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression sharing a least common type with expr1 .
Returns
A BOOLEAN.
Examples
> SELECT 2 > 1;
true
> SELECT 2 > '1.1';
true
> SELECT to_date('2009-07-30 04:17:52') > to_date('2009-07-30 04:17:52');
false
> SELECT to_date('2009-07-30 04:17:52') > to_date('2009-08-01 04:17:52');
false
> SELECT 1 > NULL;
NULL
Related
!= (bangeq sign) operator
<= (lt eq sign) operator
< (lt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
hash function
7/21/2022 • 2 minutes to read
Syntax
hash(expr1, ...)
Arguments
exprN : An expression of any type.
Returns
An INTEGER.
Examples
> SELECT hash('Spark', array(123), 2);
-1321691492
Related functions
crc32 function
md5 function
sha function
sha1 function
sha2 function
xxhash64 function
hex function
7/21/2022 • 2 minutes to read
Syntax
hex(expr)
Arguments
expr : A BIGINT, BINARY, or STRING expression.
Returns
A STRING.
The function returns the hexadecimal representation of the argument.
Examples
> SELECT hex(17);
11
> SELECT hex('Spark SQL');
537061726B2053514C
Related functions
unhex function
hour function
7/21/2022 • 2 minutes to read
Syntax
hour(expr)
Arguments
expr : A TIMESTAMP expression.
Returns
An INTEGER.
This function is a synonym for extract(HOUR FROM expr) .
Examples
> SELECT hour('2009-07-30 12:58:59');
12
Related functions
dayofmonth function
dayofweek function
dayofyear function
day function
minute function
extract function
hypot function
7/21/2022 • 2 minutes to read
Syntax
hypot(expr1, expr2)
Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT hypot(3, 4);
5.0
Related functions
cbrt function
sqrt function
if function
7/21/2022 • 2 minutes to read
Syntax
if(cond, expr1, expr2)
Arguments
cond : A BOOLEAN expression.
expr1 : An expression of any type.
expr2 : An expression that shares a least common type with expr1 .
Returns
The result is the common maximum type of expr1 and expr2 .
This function is a synonym for iff function.
Examples
> SELECT if(1 < 2, 'a', 'b');
a
Related functions
case expression
decode function
iff function
ifnull function
7/21/2022 • 2 minutes to read
Syntax
ifnull(expr1, expr2)
Arguments
expr1 : An expression of any type.
expr2 : An expression sharing a least common type with expr1 .
Returns
The result type is the least common type of expr1 and expr2 .
This function is a synonym for coalesce function.
Examples
> SELECT ifnull(NULL, array('2'));
[2]
Related functions
coalesce function
if function
nvl function
nvl2 function
in predicate
7/21/2022 • 2 minutes to read
Syntax
elem in ( expr1 [, ...] )
elem in ( query )
Arguments
elem : An expression of any comparable type.
exprN : An expression of any type sharing a least common type with all other arguments.
query : Any query. The result must share a least common type with elem . If the query returns more than
one column elem must be an tuple (STRUCT) with the same number of field
Returns
The results is a BOOLEAN.
Examples
> SELECT 1 in(1, 2, 3);
true
> SELECT 1 in(2, 3, 4);
false
> SELECT (1, 2) IN ((1, 2), (2, 3));
true
> SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 1), named_struct('a', 1, 'b', 3));
false
> SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 2), named_struct('a', 1, 'b', 3));
true
> SELECT 1 IN (SELECT * FROM VALUES(1), (2));
true;
> SELECT (1, 2) IN (SELECT c1, c2 FROM VALUES(1, 2), (3, 4) AS T(c1, c2));
true;
Related functions
exists function
array_contains function
SELECT
initcap function
7/21/2022 • 2 minutes to read
Syntax
initcap(expr)
Arguments
expr : A STRING expression.
Returns
A STRING.
All other letters are in lowercase. Words are delimited by white space.
Examples
> SELECT initcap('sPark sql');
Spark Sql
Related functions
lower function
lcase function
ucase function
upper function
inline table-valued generator function
7/21/2022 • 2 minutes to read
Syntax
inline(expr)
Arguments
expr : An ARRAY expression.
Returns
A set of rows composed of the other expressions in the select list and the fields of the structs.
If expr is NULL no rows are produced.
inline can only be placed in the select list or a LATERAL VIEW. When placing the function in the select list there
must be no other generator function in the same select list.
The columns produced by inline are named “col1”, “col2”, etc by default, but can be aliased using an alias tuple
such as AS (myCol1, myCol2) .
Examples
> SELECT inline(array(struct(1, 'a'), struct(2, 'b'))), 'Spark SQL';
1 a Spark SQL
2 b Spark SQL
Related functions
explode table-valued generator function
explode_outer table-valued generator function
posexplode table-valued generator function
posexplode_outer table-valued generator function
inline_outer table-valued generator function
inline_outer table-valued generator function
7/21/2022 • 2 minutes to read
Syntax
inline_outer(expr)
Arguments
expr : An ARRAY expression.
Returns
A set of rows composed of the other expressions in the select list and the fields of the structs.
If expr is NULL, or the array is empty a single row with nulls for the attributes is produced.
inline can only be placed in the select list or a LATERAL VIEW. When placing the function in the select list there
must be no other generator function in the same select list.
The columns produced by inline_outer are named “col1”, “col2”, etc by default, but can be aliased using an alias
tuple such as AS (myCol1, myCol2) .
Examples
> SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')));
1 a
2 b
Related functions
explode table-valued generator function
explode_outer table-valued generator function
posexplode table-valued generator function
posexplode_outer table-valued generator function
inline table-valued generator function
input_file_block_length function
7/21/2022 • 2 minutes to read
Syntax
input_file_block_length()
Arguments
This function takes no arguments.
Returns
A BIGINT.
If the information is not available -1 is returned.
The function is non-deterministic.
Examples
> SELECT input_file_block_length();
-1
Related functions
input_file_block_start function
input_file_name function
input_file_block_start function
7/21/2022 • 2 minutes to read
Syntax
input_file_block_start()
Arguments
This function takes no arguments.
Returns
A BIGINT.
If the information is not available -1 is returned.
The function is non-deterministic.
Examples
> SELECT input_file_block_start();
-1
Related functions
input_file_block_length function
input_file_name function
input_file_name function
7/21/2022 • 2 minutes to read
Returns the name of the file being read, or empty string if not available.
Syntax
input_file_name()
Arguments
This function takes no arguments.
Returns
A STRING.
If the information is not available an empty string is returned.
The function is non-deterministic.
Examples
> SELECT input_file_name();
Related functions
input_file_block_length function
input_file_block_start function
instr function
7/21/2022 • 2 minutes to read
Syntax
instr(str, substr)
Arguments
str : A STRING expression.
substr : A STRING expression.
Returns
A BIGINT.
If substr cannot be found the function returns 0.
Examples
> SELECT instr('SparkSQL', 'SQL');
6
> SELECT instr('SparkSQL', 'R');
0
Related functions
locate function
position function
int function
7/21/2022 • 2 minutes to read
Syntax
int(expr)
Arguments
expr : Any expression which is castable to INTEGER.
Returns
An INTEGER.
This function is a synonym for CAST(expr AS INTEGER) .
Examples
> SELECT int(-5.6);
5
> SELECT int('5');
5
Related functions
cast function
is distinct operator
7/21/2022 • 2 minutes to read
Tests whether the arguments have different values where NULLs are considered as comparable values.
Syntax
expr1 is [not] distinct from expr2
Arguments
expr1 : An expression of a comparable type.
expr2 : An expression of a type sharing a least common type with expr1 .
Returns
A BOOLEAN.
If both expr1 and expr2 NULL they are considered not distinct.
If only one of expr1 and expr2 is NULL the expressions are considered distinct.
If both expr1 and expr2 are not NULL they are considered distinct if expr <> expr2 .
Examples
> SELECT NULL is distinct from NULL;
false
Related
= (eq sign) operator
!= (bangeq sign) operator
<=> (lt eq gt sign) operator
isnan function
is true operator
SQL data type rules
is false operator
7/21/2022 • 2 minutes to read
Syntax
expr is [not] false
Arguments
expr : A BOOLEAN, STRING, or numeric expression.
Returns
A BOOLEAN.
If expr is a STRING of case insensitive value 't' or 'true' it is interpreted as a BOOLEAN true . Any other
non NULL string is interpreted as false .
If expr is a numeric of value 1 is its interpreted as a BOOLEAN true . Any other non NULL number is
interpreted as false .
If not is specified this operator returns true if expr is false or NULL and false otherwise.
If not is not specified the operator returns true if expr is true and false otherwise.
Examples
> SELECT 1 is true;
true
Related functions
isnotnull function
isnull function
is null operator
isnan function
is true operator
isnan function
7/21/2022 • 2 minutes to read
Syntax
isnan(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A BOOLEAN.
Examples
> SELECT isnan(cast('NaN' as double));
true
> SELECT isnan(7);
false
Related functions
isnotnull function
isnull function
isnotnull function
7/21/2022 • 2 minutes to read
Syntax
isnotnull(expr)
Arguments
expr : An expression of any type.
Returns
A BOOLEAN.
This function is a synonym for expr IS NOT NULL .
Examples
> SELECT isnotnull(1);
true
Related functions
isnull function
isnan function
is null operator
isnull function
7/21/2022 • 2 minutes to read
Syntax
isnull(expr)
Arguments
expr : An expression of any type.
Returns
A BOOLEAN.
This function is a synonym for expr IS NULL .
Examples
> SELECT isnull(1);
false
Related functions
isnotnull function
isnan function
is null operator
is null operator
7/21/2022 • 2 minutes to read
Syntax
expr is [not] null
Arguments
expr : An expression of any type.
Returns
A BOOLEAN.
If is specified this operator is a synonym for
not isnotnull(expr) . Otherwise the operator is a synonym for
isnull(expr) .
Examples
> SELECT 1 is null;
false
Related functions
isnotnull function
isnull function
isnan function
is false operator
is true operator
is true operator
7/21/2022 • 2 minutes to read
Syntax
expr is [not] true
Arguments
expr : A BOOLEAN, STRING, or numeric expression.
Returns
A BOOLEAN.
If expr is a STRING of case insensitive value 't' or 'true' it is interpreted as a BOOLEAN true . Any other
non NULL string is interpreted as false .
If expr is a numeric of value 1 is its interpreted as a BOOLEAN true . Any other non NULL number is
interpreted as false .
If not is specified this operator returns true if expr is true or NULL and false otherwise.
If not is not specified the operator returns true if expr is false and false otherwise.
Examples
> SELECT 5 is false;
true
Related functions
isnotnull function
isnull function
is null operator
isnan function
is false operator
java_method function
7/21/2022 • 2 minutes to read
Syntax
java_method(class, method [, arg1 [, ...] ] )
Arguments
class : A STRING literal specifying the java class.
method : A STRING literal specifying the java method.
argn : An expression with a type appropriate for the selected method.
Returns
A STRING.
Examples
> SELECT java_method('java.util.UUID', 'randomUUID');
c33fb387-8500-4bfa-81d2-6e0e3e930df2
> SELECT java_method('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
A5cf6c42-0c85-418f-af6c-3e4e5b1328f2
Related functions
reflect function
json_array_length function
7/21/2022 • 2 minutes to read
Syntax
json_array_length(jsonArray)
Arguments
jsonArray : A JSON array.
Returns
An INTEGER.
The function returns NULL if jsonArray is not a valid JSON string or NULL .
Examples
> SELECT json_array_length('[1,2,3,4]');
4
> SELECT json_array_length('[1,2,3,{"f1":1,"f2":[5,6]},4]');
5
> SELECT json_array_length('[1,2');
NULL
Related functions
: operator
json_object_keys function
json_array_length function
json_tuple table-valued generator function
from_json function
get_json_object function
schema_of_json function
to_json function
json_object_keys function
7/21/2022 • 2 minutes to read
Syntax
json_object_keys(jsonObject)
Arguments
jsonObject : A STRING expression of a valid JSON array format.
Returns
An ARRAY.
If ‘jsonObject’ is any other valid JSON string, an invalid JSON string or an empty string, the function returns
NULL.
Examples
> SELECT json_object_keys('{}');
[]
> SELECT json_object_keys('{"key": "value"}');
[key]
> SELECT json_object_keys('{"f1":"abc","f2":{"f3":"a", "f4":"b"}}');
[f1,f2]
Related functions
: operator
json_array_length function
json_tuple table-valued generator function
from_json function
get_json_object function
schema_of_json function
to_json function
json_tuple table-valued generator function
7/21/2022 • 2 minutes to read
Syntax
json_tuple(jsonStr, path1 [, ...] )
Arguments
jsonStr : A STRING expression with well formed JSON.
pathN : A STRING literal with a JSON path.
Returns
A row composed of the other expressions in the select list and the JSON objects.
If any object cannot be found NULL is returned for that object. The column alias for the produced columns are
by default named c1, c2, etc., but can be aliased using AS (myC1, myC2, …) .
json_tuple can only be placed in the select list or a LATERAL VIEW. When placing the function in the select list
there must be no other generator function in the same select list.
Examples
> SELECT json_tuple('{"a":1, "b":2}', 'a', 'b'), 'Spark SQL';
1 2 Spark SQL
> SELECT json_tuple('{"a":1, "b":2}', 'a', 'c'), 'Spark SQL';
1 NULL Spark SQL
Related functions
: operator
json_object_keys function
json_array_length function
json_tuple table-valued generator function
from_json function
get_json_object function
schema_of_json function
to_json function
kurtosis aggregate function
7/21/2022 • 2 minutes to read
Syntax
kurtosis ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
Examples
> SELECT kurtosis(col) FROM VALUES (-10), (-20), (100), (100), (1000) AS tab(col);
0.16212458373485106
> SELECT kurtosis(DISTINCT col) FROM VALUES (-10), (-20), (100), (100), (1000) AS tab(col);
-0.7014368047529627
> SELECT kurtosis(col) FROM VALUES (1), (10), (100), (10), (1) as tab(col);
0.19432323191699075
Related functions
skewness aggregate function
lag analytic window function
7/21/2022 • 2 minutes to read
Returns the value of expr from a preceding row within the partition.
Syntax
lag( expr [, offset [, default] ] )
Arguments
expr : An expression of any type.
offset : An optional INTEGER literal specifying the offset.
default : An expression of the same type as expr .
Returns
The result type matches expr .
If offset is positive the value originates from the row preceding the current row by offset specified the
ORDER BY in the OVER clause. An offset of 0 uses the current row’s value. A negative offset uses the value from
a row following the current row. If you do not specify offset it defaults to 1, the immediately following row.
If there is no row at the specified offset within the partition, the specified default is used. The default default
is NULL . You must provide an ORDER BY clause.
This function is a synonym to lead(expr, -offset, default) .
Examples
> SELECT a, b, lag(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1',
1) tab(a, b);
A1 1 NULL
A1 1 1
A1 2 1
A2 3 NULL
Related functions
lead analytic window function
last aggregate function
last_value aggregate function
first_value aggregate function
Window functions
last aggregate function
7/21/2022 • 2 minutes to read
Syntax
last(expr [, ignoreNull] ) [FILTER ( WHERE cond ) ] [ IGNORE NULLS | RESPECT NULLS ]
Arguments
expr : An expression of any type.
ignoreNull : An optional BOOLEAN literal defaulting to false. The default for ignoreNull is false.
cond : An optional boolean expression filtering the rows used for aggregation.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used or ignoreNull is true any expr value that is
NULL is ignored. The default is RESPECT NULLS .
Returns
The result type matches expr .
The function is a synonym for last_value aggregate function.
This function is non-deterministic.
Examples
> SELECT last(col) FROM VALUES (10), (5), (20) AS tab(col);
20
> SELECT last(col) FROM VALUES (10), (5), (NULL) AS tab(col);
NULL
> SELECT last(col) IGNORE NULLS FROM VALUES (10), (5), (NULL) AS tab(col);
5
Related functions
last_value aggregate function
first aggregate function
first_value aggregate function
last_day function
7/21/2022 • 2 minutes to read
Returns the last day of the month that the date belongs to.
Syntax
last_day(expr)
Arguments
expr : A DATE expression.
Returns
A DATE.
Examples
> SELECT last_day('2009-01-12');
2009-01-31
Related functions
next_day function
last_value aggregate function
7/21/2022 • 2 minutes to read
Syntax
last_value(expr [, ignoreNull] ) [FILTER ( WHERE cond ) ] [ IGNORE NULLS | RESPECT NULLS ]
Arguments
expr : An expression of any type.
ignoreNull : A optional BOOLEAN literal
cond : An optional boolean expression filtering the rows used for aggregation.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used or ignoreNull is true any expr value that is
NULL is ignored. The default is RESPECT NULLS .
Returns
The result type matches expr .
The function is a synonym for last aggregate function.
This function is non-deterministic.
Examples
> SELECT last_value(col) FROM VALUES (10), (5), (20) AS tab(col);
20
> SELECT last_value(col) FROM VALUES (10), (5), (NULL) AS tab(col);
NULL
> SELECT last_value(col) IGNORE NULLS FROM VALUES (10), (5), (NULL) AS tab(col);
5
Related functions
last aggregate function
first aggregate function
first_value aggregate function
lcase function
7/21/2022 • 2 minutes to read
Syntax
lcase(expr)
Arguments
expr : A STRING expression.
Returns
A STRING.
Examples
> SELECT lcase('LowerCase');
lowercase
Related functions
lower function
initcap function
ucase function
upper function
lead analytic window function
7/21/2022 • 2 minutes to read
Returns the value of expr from a subsequent row within the partition.
Syntax
lead(expr [, offset [, default] ] )
Arguments
expr : An expression of any type.
offset : An optional INTEGER literal specifying the offset.
default : An expression of the same type as expr .
Returns
The result type matches expr .
If offset is positive the value originates from the row following the current row by offset specified the
ORDER BY in the OVER clause. An offset of 0 uses the current row’s value. A negative offset uses the value from
a row preceding the current row. If you do not specify offset it defaults to 1, the immediately preceding row.
If there is no row at the specified offset within the partition the specified default is used. The default default
is NULL. An ORDER BY clause must be provided.
This function is a synonym to lag(expr, -offset, default) .
Examples
> SELECT a, b, lead(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1',
1) tab(a, b);
A1 1 1
A1 1 2
A1 2 NULL
A2 3 NULL
Related functions
lag analytic window function
last aggregate function
last_value aggregate function
first_value aggregate function
Window functions
least function
7/21/2022 • 2 minutes to read
Syntax
least(expr1 [, ...] )
Arguments
exprN : An expression of any type that shares a least common type with all other arguments.
Returns
The result is the least common type of all arguments.
Examples
> SELECT least(10, 9, 2, 4, 3);
2
Related
greatest function
SQL data type rules
left function
7/21/2022 • 2 minutes to read
Syntax
left(str, len)
Arguments
str : A STRING expression.
len : An INTEGER expression.
Returns
A STRING.
If len is less than 1, an empty string is returned.
Examples
> SELECT left('Spark SQL', 3);
Spa
Related functions
right function
substr function
length function
7/21/2022 • 2 minutes to read
Returns the character length of string data or number of bytes of binary data.
Syntax
length(expr)
Arguments
expr : A STRING or BINARY expression.
Returns
An INTEGER.
The length of string data includes the trailing spaces. The length of binary data includes trailing binary zeros.
This function is a synonym for character_length function and char_length function.
Examples
> SELECT length('Spark SQL ');
10
> select length('床前明月光')
5
Related functions
character_length function
char_length function
levenshtein function
7/21/2022 • 2 minutes to read
Returns the Levenshtein distance between the strings str1 and str2 .
Syntax
levenshtein(str1, str2)
Arguments
str1 : A STRING expression.
str2 : A STRING expression.
Returns
An INTEGER.
Examples
> SELECT levenshtein('kitten', 'sitting');
3
Related functions
like operator
7/21/2022 • 2 minutes to read
Syntax
str [ NOT ] like ( pattern [ ESCAPE escape ] )
Arguments
str : A STRING expression.
pattern : A STRING expression.
escape : A single character STRING literal.
ANY or SOME or ALL :
Since: Databricks Runtime 9.1
If ALL is specified then like returns true if str matches all patterns, otherwise returns true if it
matches at least one pattern.
Returns
A BOOLEAN.
The pattern is a string which is matched literally, with exception to the following special symbols:
_ matches any one character in the input (similar to . in POSIX regular expressions)
% matches zero or more characters in the input (similar to .* in POSIX regular expressions).
The default escape character is the '\' . If an escape character precedes a special symbol or another escape
character, the following character is matched literally. It is invalid to escape any other character.
String literals are unescaped. For example, in order to match '\abc' , the pattern should be '\\abc' .
str NOT like ... is equivalent to NOT(str like ...) .
Examples
> SELECT like('Spark', '_park');
true
Related functions
ilike operator
rlike operator
regexp operator
ln function
7/21/2022 • 2 minutes to read
Syntax
ln(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
if the argument is out of bounds, the result is NULL.
Examples
> SELECT ln(1);
0.0
Related functions
exp function
expm1 function
e function
log function
log10 function
log1p function
locate function
7/21/2022 • 2 minutes to read
Returns the position of the first occurrence of substr in str after position pos .
Syntax
locate(substr, str [, pos] )
Arguments
subtr : A STRING expression.
str : A STRING expression.
pos : An optional INTEGER expression.
Returns
An INTEGER.
The specified pos and return value are 1-based. If pos is omitted substr is searched from the beginning of
str . If pos is less than 1 the result is 0.
Examples
> SELECT locate('bar', 'foobarbar');
4
> SELECT locate('bar', 'foobarbar', 5);
7
Related functions
position function
instr function
charindex function
log function
7/21/2022 • 2 minutes to read
Syntax
log( [base,] expr)
Arguments
base : An optional expression that evaluates to a numeric.
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
If base or expr are less than or equal to 0 the result is NULL. log(expr) is a synonym for ln(expr) .
Examples
> SELECT log(10, 100);
2.0
> SELECT log(e());
1.0
Related functions
log10 function
ln function
log1p function
pow function
log10 function
7/21/2022 • 2 minutes to read
Syntax
log10(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
If expr is less than or equal to 0 the result is NULL.
Examples
> SELECT log10(10);
1.0
Related functions
log function
log2 function
ln function
log1p function
pow function
log1p function
7/21/2022 • 2 minutes to read
Syntax
log1p(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
If expr is less than or equal to -1 the result is NULL.
Examples
> SELECT log1p(0);
0.0
Related functions
log function
ln function
log10 function
expm1 function
e function
exp function
pow function
log2 function
7/21/2022 • 2 minutes to read
Syntax
log2(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
If expr is less than or equal to 0 the result is NULL.
Examples
> SELECT log2(2);
1.0
Related functions
log function
ln function
log1p function
log10 function
pow function
lower function
7/21/2022 • 2 minutes to read
Syntax
lower(expr)
Arguments
expr : A STRING expression.
Returns
A STRING.
Examples
> SELECT lower('LowerCase');
lowercase
Related functions
lcase function
initcap function
ucase function
upper function
lpad function
7/21/2022 • 2 minutes to read
Syntax
lpad(expr, len [, pad] )
Arguments
expr : A STRING or BINARY expression to be padded.
len : An INTEGER expression specifying the length of the result string
pad : An optional STRING or BINARY expression specifying the padding.
Returns
A BINARY if both expr and pad are BINARY, otherwise STRING.
If expr is longer than len , the return value is shortened to len characters. If you do not specify pad , a
STRING expr is padded to the left with space characters, whereas a BINARY expr is padded to the left with
x’00’ bytes. If len is less than 1, an empty string.
BINARY is supported since: Databricks Runtime 11.0.
Examples
> SELECT lpad('hi', 5, 'ab');
abahi
> SELECT lpad('hi', 1, '??');
h
> SELECT lpad('hi', 5);
hi
Related functions
rpad function
ltrim function
rtrim function
trim function
<=> (lt eq gt sign) operator
7/21/2022 • 2 minutes to read
Returns the same result as the EQUAL(=) for non-null operands, but returns true if both are NULL , false if
one of the them is NULL .
Syntax
expr1 <=> expr2
Arguments
expr1 : An expression of a comparable type.
expr2 : An expression that shares a least common type with expr1 .
Returns
A BOOLEAN.
This operator is a synonym for expr1 is not distinct from expr2
Examples
> SELECT 2 <=> 2;
true
> SELECT 1 <=> '1';
true
> SELECT true <=> NULL;
false
> SELECT NULL <=> NULL;
true
Related
!= (bangeq sign) operator
< (lt sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<= (lt eq sign) operator
= (eq sign) operator
<> (lt gt sign) operator
is distinct operator
SQL data type rules
<= (lt eq sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 <= expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with expr1 .
Returns
A BOOLEAN.
Examples
> SELECT 2 <= 2;
true
> SELECT 1.0 <= '1';
true
> SELECT to_date('2009-07-30 04:17:52') <= to_date('2009-07-30 04:17:52');
true
> SELECT to_date('2009-07-30 04:17:52') <= to_date('2009-08-01 04:17:52');
true
> SELECT 1 <= NULL;
NULL
Related
!= (bangeq sign) operator
< (lt sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
<> (lt gt sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 <> expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with expr1 .
Returns
A BOOLEAN.
This function is a synonym for != (bangeq sign) operator.
Examples
> SELECT 2 <> 2;
false
> SELECT 3 <> 2;
true
> SELECT 1 <> '1';
false
> SELECT true <> NULL;
NULL
> SELECT NULL <> NULL;
NULL
Related
< (lt sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
!= (bangeq sign) operator
SQL data type rules
ltrim function
7/21/2022 • 2 minutes to read
Syntax
ltrim( [trimstr ,] str)
Arguments
trimstr : An optional STRING expression with the string to be trimmed.
str : A STRING expression from which to trim.
Returns
A STRING.
The default for trimStr is a single space. The function removes any leading characters within trimStr from
str .
Examples
> SELECT '+' || ltrim(' SparkSQL ') || '+';
+SparkSQL +
> SELECT '+' || ltrim('abc', 'acbabSparkSQL ') || '+';
+SparkSQL +
Related functions
btrim function
rpad function
lpad function
rtrim function
trim function
< (lt sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 < expr2
Arguments
expr1 : An expression of any comparable type.
expr2 : An expression that shares a least common type with expr1 .
Returns
A BOOLEAN.
Examples
> SELECT 1 < 2;
true
> SELECT 1.1 < '1';
false
> SELECT to_date('2009-07-30 04:17:52') < to_date('2009-07-30 04:17:52');
false
> SELECT to_date('2009-07-30 04:17:52') < to_date('2009-08-01 04:17:52');
true
> SELECT 1 < NULL;
NULL
Related
!= (bangeq sign) operator
<= (lt eq sign) operator
> (gt sign) operator
>= (gt eq sign) operator
<=> (lt eq gt sign) operator
= (eq sign) operator
<> (lt gt sign) operator
SQL data type rules
make_date function
7/21/2022 • 2 minutes to read
Syntax
make_date(year, month, day)
Arguments
year : An INTEGER expression evaluating to a value from 1 to 9999.
month : An INTEGER expression evaluating to a value from 1 (January) to 12 (December).
day : An INTEGER expression evaluating to a value from 1 to 31.
Returns
A DATE.
If any of the arguments is out of bounds, the function raises an error.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed arguments.
Examples
> SELECT make_date(2013, 7, 15);
2013-07-15
> SELECT make_date(2019, 13, 1);
NULL
> SELECT make_date(2019, 7, NULL);
NULL
> SELECT make_date(2019, 2, 30);
NULL
Related functions
make_timestamp function
make_interval function
make_dt_interval function
7/21/2022 • 2 minutes to read
Syntax
make_dt_interval( [ days [, hours [, mins [, secs] ] ] ] )
Arguments
days : An integral number of days, positive or negative
hours : An integral number of hours, positive or negative
mins : An integral number of minutes, positive or negative
secs : A number of seconds with the fractional part in microsecond precision.
Returns
An INTERVAL DAY TO SECOND .
Unspecified arguments are defaulted to 0. If you provide no arguments the result is an
INTERVAL '0 00:00:00.000000000' DAY TO SECOND .
Examples
> SELECT make_dt_interval(100, 13);
100 13:00:00.000000000
Related functions
make_date function
make_timestamp function
make_ym_interval function
make_interval function
7/21/2022 • 2 minutes to read
Creates an interval from years , months , weeks , days , hours , mins and secs .
WARNING
This constructor is deprecated since it generates an INTERVAL which cannot be compared or operated upon. Please use
make_ym_interval or make_dt_interval to produce intervals.
Syntax
make_interval( [years [, months [, weeks [, days [, hours [, mins [, secs] ] ] ] ] ] ] )
Arguments
years : An integral number of years, positive or negative
months : An integral number of months, positive or negative
weeks : An integral number of weeks, positive or negative
days : An integral number of days, positive or negative
hours : An integral number of hours, positive or negative
mins : An integral number of minutes, positive or negative
secs : A number of seconds with the fractional part in microsecond precision.
Returns
An INTERVAL.
Unspecified arguments are defaulted to 0. If you provide no arguments the result is an INTERVAL with 0
seconds.
Examples
> SELECT make_interval(100, 11);
100 years 11 months
> SELECT make_interval(100, null);
NULL
> SELECT make_interval();
0 seconds
> SELECT make_interval(0, 0, 1, 1, 12, 30, 01.001001);
8 days 12 hours 30 minutes 1.001001 seconds
Related functions
make_dt_interval function
make_date function
make_timestamp function
make_ym_interval function
make_ym_interval function
7/21/2022 • 2 minutes to read
Syntax
make_dt_interval( [ years [, months ] ] )
Arguments
years : An integral number of years, positive or negative
months : An integral number of months, positive or negative
Returns
An INTERVAL YEAR TO MONTH .
Unspecified arguments are defaulted to 0. If you provide no arguments the result is an
INTERVAL '0-0' YEAR TO MONTH .
Examples
> SELECT make_ym_interval(100, 5);
100-5
Related functions
make_dt_interval function
make_date function
make_timestamp function
make_timestamp function
7/21/2022 • 2 minutes to read
Creates a timestamp from year , month , day , hour , min , sec , and timezone fields.
Syntax
make_timestamp(year, month, day, hour, min, sec [, timezone] )
Arguments
year : An INTEGER expression evaluating to a value from 1 to 9999.
month : An INTEGER expression evaluating to a value from 1 (January) to 12 (December).
day : An INTEGER expression evaluating to a value from 1 to 31.
hour : An INTEGER expression evaluating to a value between 0 and 23.
min : An INTEGER expression evaluating to a value between 0 and 59.
sec : A numeric expression evaluating to a value between 0 and 60.
timezone : An optional STRING expression evaluating to a valid timezone string. For example: CET, UTC.
Returns
A TIMESTAMP.
If any of the arguments is out of bounds, the function returns an error. If sec is 60 it is interpreted as 0 and a
minute is added to the result.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for out of bounds arguments.
Examples
> SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887);
2014-12-28 06:30:45.887
> SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET');
2014-12-27 21:30:45.887
> SELECT make_timestamp(2019, 6, 30, 23, 59, 60);
2019-07-01 00:00:00
> SELECT make_timestamp(2019, 13, 1, 10, 11, 12, 'PST');
NULL
> SELECT make_timestamp(NULL, 7, 22, 15, 30, 0);
NULL
Related functions
make_date function
map function
7/21/2022 • 2 minutes to read
Syntax
map( [key1, value1] [, ...] )
Arguments
keyN : An expression of any comparable type. All keyN must share a least common type.
valueN : An expression of any type. All valueN must share a least common type.
Returns
A MAP with keys typed as the least common type of keyN and values typed as the least common type of
valueN .
Examples
> SELECT map(1.0, '2', 3.0, '4');
{1.0 -> 2, 3.0 -> 4}
Related
[ ] operator
map_concat function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
SQL data type rules
map_concat function
7/21/2022 • 2 minutes to read
Syntax
map_concat([ expr1 [, ...] ] )
Arguments
exprN : A MAP expression. All exprN must share a least common type.
Returns
A MAP of the least common type of exprN .
If no argument is provided, an empty map. If there is a key collision an error is raised.
Examples
> SELECT map_concat(map(1, 'a', 2, 'b'), map(3, 'c'));
{1 -> a, 2 -> b, 3 -> c}
Related
map function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
SQL data type rules
map_contains_key function
7/21/2022 • 2 minutes to read
Syntax
map_contains_key(map, key)
Arguments
map : An map to be searched.
key : An expression with a type sharing a least common type with the map keys.
Returns
A BOOLEAN. If map or key is NULL , the result is NULL .
Examples
> SELECT map_contains_key(map(1, 'a', 2, 'b'), 2);
true
Related
array_contains function
map function
map_keys function
map_values function
SQL data type rules
map_entries function
7/21/2022 • 2 minutes to read
Syntax
map_entries(map)
Arguments
map : A MAP expression.
Returns
An ARRAY of STRUCTs holding key-value pairs.
Examples
> SELECT map_entries(map(1, 'a', 2, 'b'));
[{1, a}, {2, b}]
Related functions
map function
map_concat function
map_from_entries function
map_filter function
map_from_arrays function
map_keys function
map_values function
map_zip_with function
map_filter function
7/21/2022 • 2 minutes to read
Syntax
map_filter(expr, func)
Arguments
expr : A MAP expression.
func : A lambda function with two parameters returning a BOOLEAN. The first parameter takes the key the
second parameter takes the value.
Returns
The result is the same type as expr .
Examples
> SELECT map_filter(map(1, 0, 2, 2, 3, -1), (k, v) -> k > v);
{1 -> 0, 3 -> -1}
Related functions
map function
map_concat function
map_entries function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
map_zip_with function
map_from_arrays function
7/21/2022 • 2 minutes to read
Syntax
map_from_arrays(keys, values)
Arguments
keys : An ARRAY expression without duplicates or NULL.
values : An ARRAY expression of the same cardinality as keys
Returns
A MAP where keys are of the element type of keys and values are of the element type of values .
Examples
> SELECT map_from_arrays(array(1.0, 3.0), array('2', '4'));
{1.0 -> 2, 3.0 -> 4}
Related functions
map function
map_concat function
map_entries function
map_filter function
map_from_entries function
map_keys function
map_values function
map_zip_with function
map_from_entries function
7/21/2022 • 2 minutes to read
Syntax
map_from_entries(expr)
Arguments
expr : An ARRAY expression of STRUCT with two fields.
Returns
A MAP where keys are the first field of the structs and values the second. There must be no duplicates or nulls in
the first field (the key).
Examples
> SELECT map_from_entries(array(struct(1, 'a'), struct(2, 'b')));
{1 -> a, 2 -> b}
Related functions
map function
map_concat function
map_entries function
map_filter function
map_from_arrays function
map_keys function
map_values function
map_zip_with function
map_keys function
7/21/2022 • 2 minutes to read
Syntax
map_keys(map)
Arguments
map : A MAP expression.
Returns
An ARRAY where the element type matches the map key type.
Examples
> SELECT map_keys(map(1, 'a', 2, 'b'));
[1,2]
Related functions
map function
map_concat function
map_contains_key function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_values function
map_zip_with function
map_values function
7/21/2022 • 2 minutes to read
Syntax
map_values(map)
Arguments
map : A MAP expression.
Returns
An ARRAY where the element type matches the map value type.
Examples
> SELECT map_values(map(1, 'a', 2, 'b'));
[a,b]
Related functions
map function
map_concat function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_zip_with function
map_zip_with function
7/21/2022 • 2 minutes to read
Syntax
map_zip_with(map1, map2, func)
Arguments
map1 : A MAP expression.
map2 : A MAP expression of the same key type as map1
func : A lambda function taking three parameters. The first parameter is the key, followed by the values from
each map.
Returns
A MAP where the key matches the key type of the input maps and the value is typed by the return type of the
lambda function.
If a key is not matched by one side the respective value provided to the lambda function is NULL.
Examples
> SELECT map_zip_with(map(1, 'a', 2, 'b'), map(1, 'x', 2, 'y'), (k, v1, v2) -> concat(v1, v2));
{1 -> ax, 2 -> by}
Related functions
map function
map_concat function
map_entries function
map_filter function
map_from_arrays function
map_from_entries function
map_keys function
map_values function
max aggregate function
7/21/2022 • 2 minutes to read
Syntax
max(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression of any type that can be ordered.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type matches the type of the argument.
Examples
> SELECT max(col) FROM VALUES (10), (50), (20) AS tab(col);
50
Related functions
min aggregate function
avg aggregate function
max_by aggregate function
min_by aggregate function
mean aggregate function
max_by aggregate function
7/21/2022 • 2 minutes to read
Returns the value of an expr1 associated with the maximum value of expr2 in a group.
Syntax
max_by(expr1, expr2) [FILTER ( WHERE cond ) ]
Arguments
expr1 : An expression of any type.
expr2 : An expression of a type that can be ordered.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type matches the type of expr1 .
This function is non-deterministic if expr2 is not unique within the group.
Examples
> SELECT max_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
b
Related functions
min aggregate function
avg aggregate function
max aggregate function
min_by aggregate function
md5 function
7/21/2022 • 2 minutes to read
Syntax
md5(expr)
Arguments
expr : An BINARY expression.
Returns
A STRING.
Examples
> SELECT md5('Spark');
8cde774d6f7333752ed72cacddb05126
Related functions
crc32 function
hash function
sha function
sha1 function
sha2 function
mean aggregate function
7/21/2022 • 2 minutes to read
Syntax
mean ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type is computed as for the arguments:
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
year-month interval: The result is an INTERVAL YEAR TO MONTH .
day-time interval: The result is an INTERVAL YEAR TO SECOND .
In all other cases the result is a DOUBLE.
Nulls within the group are ignored. If a group is empty or consists only of nulls the result is NULL.
If DISTINCT is specified the mean is computed after duplicates have been removed.
If the result overflows the result type, Databricks Runtime raises an overflow error. To return a NULL instead use
try_avg.
WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but return NULL.
Examples
> SELECT mean(col) FROM VALUES (1), (2), (3) AS tab(col);
2.0
> SELECT avg(DISTINCT col) FROM VALUES (1), (1), (2) AS tab(col);
1.5
> SELECT mean(vol) FROM VALUES (INTERVAL '1' YEAR), (INTERVAL '2' YEAR) AS tab(col);
1-6
Related functions
aggregate function
avg aggregate function
max aggregate function
min aggregate function
sum aggregate function
try_avg aggregate function
min aggregate function
7/21/2022 • 2 minutes to read
Syntax
min(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression of any type that can be ordered.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type matches the type of the argument.
Examples
> SELECT min(col) FROM VALUES (10), (50), (20) AS tab(col);
10
Related functions
max aggregate function
avg aggregate function
max_by aggregate function
min_by aggregate function
mean aggregate function
min_by aggregate function
7/21/2022 • 2 minutes to read
Returns the value of an expr1 associated with the minimum value of expr2 in a group.
Syntax
min_by(expr1, expr2) [FILTER ( WHERE cond ) ]
Arguments
expr1 : An expression of any type.
expr2 : An expression of a type that can be ordered.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type matches the type of expr1 .
This function is non-deterministic if expr2 is not unique within the group.
Examples
> SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
a
Related functions
min aggregate function
avg aggregate function
max aggregate function
min_by aggregate function
- (minus sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 - expr2
Arguments
expr1 : A numeric, DATE, TIMESTAMP, or INTERVAL expression.
expr2 : The accepted type depends on the type of expr :
If expr1 is a numeric expr2 must be numeric expression
If expr1 is a year-month or day-time interval, expr2 must be of the matching class of interval.
Otherwise expr2 must be a DATE or TIMESTAMP.
Returns
The result type is determined in the following order:
If expr1 is a numeric, the result is common maximum type of the arguments.
If expr1 is a DATE and expr2 is a day-time interval the result is a TIMESTAMP.
If expr1 is a TIMESTAMP and expr2 is an interval the result is a TIMESTAMP.
If expr1 and expr2 are DATEs the result is an INTERVAL DAYS .
If expr1 or expr2 are TIMESTAMP the result is an INTERVAL DAY TO SECOND .
If expr1 and expr2 are year-month intervals the result is a year-month interval of sufficiently wide units to
represent the result.
If expr1 and expr2 are day-time intervals the result is a day-time interval of sufficiently wide units to
represent the result.
Otherwise, the result type matches expr1 .
If both expressions are interval they must be of the same class.
When you subtract a year-month interval from a DATE, Databricks Runtime ensures that the resulting date is
well-formed.
If the result overflows the result type, Databricks Runtime raises an ARITHMETIC_OVERFLOW error.
Use try_subtract to return NULL on overflow.
WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.
Examples
> SELECT 2 - 1;
1
Related functions
* (asterisk sign) operator
+ (plus sign) operator
/ (slash sign) operator
sum aggregate function
try_add function
try_subtract function
- (minus sign) unary operator
7/21/2022 • 2 minutes to read
Syntax
- expr
Arguments
expr : An expression that evaluates to a numeric or interval.
Returns
The result type matches the argument type.
For integral numeric types the function can return an ARITHMETIC_OVERFLOW error.
This function is a synonym for negative function.
WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.
Examples
> SELECT -(1);
-1
Related functions
sign function
abs function
positive function
+ (plus sign) unary operator
negative function
minute function
7/21/2022 • 2 minutes to read
Syntax
minute(expr)
Arguments
expr : An TIMESTAMP expression or a STRING of a valid timestamp format.
Returns
An INTEGER.
This function is a synonym for extract(MINUTES FROM expr) .
Examples
> SELECT minute('2009-07-30 12:58:59');
58
Related functions
dayofmonth function
dayofweek function
dayofyear function
day function
hour function
month function
extract function
mod function
7/21/2022 • 2 minutes to read
Syntax
mod(dividend, divisor)
Arguments
dividend : An expression that evaluates to a numeric.
divisor : An expression that evaluates to a numeric.
Returns
If both dividend and divisor are of DECIMAL , the result matches the divisor’s type. In all other cases, a
DOUBLE.
If divisor is 0, the function raises a DIVIDE_BY_ZERO error.
This function is equivalent to the % (percent sign) operator.
Examples
> SELECT MOD(2, 1.8);
0.2
Related functions
% (percent sign) operator
/ (slash sign) operator
pmod function
monotonically_increasing_id function
7/21/2022 • 2 minutes to read
Syntax
monotonically_increasing_id()
Arguments
This function takes no arguments.
Returns
A BIGINT.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
Examples
> SELECT monotonically_increasing_id();
0
Related functions
uuid function
month function
7/21/2022 • 2 minutes to read
Syntax
month(expr)
Arguments
expr : An TIMESTAMP expression or a STRING of a valid timestamp format.
Returns
An INTEGER.
This function is a synonym for extract(MONTH FROM expr) .
Examples
> SELECT month('2016-07-30');
7
Related functions
dayofmonth function
dayofweek function
dayofyear function
day function
hour function
minute function
extract function
months_between function
7/21/2022 • 2 minutes to read
Returns the number of months elapsed between dates or timestamps in expr1 and expr2 .
Syntax
months_between(expr1, expr2 [, roundOff] )
Arguments
expr1 : An DATE or TIMESTAMP expression.
expr2 : An expression of the same type as expr1 .
roundOff : A optional BOOLEAN expression.
Returns
A DOUBLE.
If expr1 is later than expr2 , the result is positive.
If expr1 and expr2 are on the same day of the month, or both are the last day of the month, time of day is
ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits unless
roundOff =false.
Examples
> SELECT months_between('1997-02-28 10:30:00', '1996-10-30');
3.94959677
> SELECT months_between('1997-02-28 10:30:00', '1996-10-30', false);
3.9495967741935485
Related functions
- (minus sign) operator
add_months function
datediff function
datediff (timestamp) function
date_add function
date_sub function
dateadd function
named_struct function
7/21/2022 • 2 minutes to read
Syntax
named_struct( {name1, val1} [, ...] )
Arguments
nameN : A STRING literal naming field N.
valN : An expression of any type specifying the value for field N.
Returns
A struct with field N matching the type of valN .
Examples
> SELECT named_struct('a', 1, 'b', 2, 'c', 3);
{ 1, 2, 3}
Related functions
struct function
map function
str_to_map function
array function
nanvl function
7/21/2022 • 2 minutes to read
Syntax
nanvl(expr1, expr2)
Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT nanvl(cast('NaN' AS DOUBLE), 123);
123.0
Related functions
coalesce function
negative function
7/21/2022 • 2 minutes to read
Syntax
negative(expr)
Arguments
expr : An expression that evaluates to a numeric or interval.
Returns
The result type matches the argument type.
For integral numeric types the function can return an ARITHMETIC_OVERFLOW error.
This function is a synonym for - (minus sign) unary operator.
WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.
Examples
> SELECT negative(1);
-1
Related functions
sign function
abs function
positive function
- (minus sign) unary operator
+ (plus sign) unary operator
next_day function
7/21/2022 • 2 minutes to read
Returns the first date which is later than expr and named as in dayOfWeek .
Syntax
next_day(expr, dayOfWeek)
Arguments
expr : A DATE expression.
dayOfWeek : A STRING expression identifying a day of the week.
Returns
A DATE.
dayOfWeek must be one of the following (case insensitive):
'SU' , 'SUN' , 'SUNDAY'
'MO' , 'MON' , 'MONDAY'
'TU' , 'TUE' , 'TUESDAY'
'WE' , 'WED' , 'WEDNESDAY'
'TH' , 'THU' , 'THURSDAY'
'FR' , 'FRI' , 'FRIDAY'
'SA' , 'SAT' , 'SATURDAY'
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for a malformed dayOfWeek .
Examples
> SELECT next_day('2015-01-14', 'TU');
2015-01-20
Related functions
dayofweek function
last_day function
not operator
7/21/2022 • 2 minutes to read
Syntax
not expr
Arguments
expr : A BOOLEAN expression.
Returns
A BOOLEAN.
This operator is an alias for ! (bang sign) operator.
Examples
> SELECT not true;
false
> SELECT not false;
true
> SELECT not NULL;
NULL
Related functions
& (ampersand sign) operator
| (pipe sign) operator
! (bang sign) operator
now function
7/21/2022 • 2 minutes to read
Syntax
now()
Arguments
This function takes no arguments.
Returns
A TIMESTAMP.
Examples
> SELECT now();
2020-04-25 15:49:11.914
Related functions
current_date function
current_timezone function
current_timestamp function
nth_value analytic window function
7/21/2022 • 2 minutes to read
Syntax
nth_value(expr, offset) [ IGNORE NULLS | RESPECT NULLS ]
Arguments
expr : An expression of any type.
offset : An INTEGER literal greater than 0.
IGNORE NULLS or RESPECT NULLS : When IGNORE NULLS is used any expr value that is NULL is ignored in the
count. The default is RESPECT NULLS .
Returns
The result type matches the expr type.
The window function returns the value of expr at the row that is the offset th row from the beginning of the
window frame.
If there is no such offset th row, returns NULL .
You must use the ORDER BY clause clause with this function. If the order is non-unique, the result is non-
deterministic.
Examples
> SELECT a, b, nth_value(b, 2) OVER (PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1
A1 1 1
A1 2 1
A2 3 NULL
Related functions
lag analytic window function
lead analytic window function
first aggregate function
last aggregate function
Window functions
ntile ranking window function
7/21/2022 • 2 minutes to read
Divides the rows for each window partition into n buckets ranging from 1 to at most n .
Syntax
ntile([n])
Arguments
n : An optional INTEGER literal greater than 0.
Returns
An INTEGER.
The default for n is 1. If n is greater than the actual number or rows in the window You must use the ORDER
BY clause with this function.
If the order is non-unique, the result is non-deterministic.
Examples
> SELECT a, b, ntile(2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1',
1) tab(a, b);
A1 1 1
A1 1 1
A1 2 2
A2 3 1
Related functions
Window functions
nullif function
7/21/2022 • 2 minutes to read
Syntax
nullif(expr1, expr2)
Arguments
expr1 : An expression of any type.
expr2 : An expression of the same type as expr .
Returns
NULL if expr1 equals to expr2 , or expr1 otherwise.
Examples
> SELECT nullif(2, 2);
NULL
> SELECT nullif(2, 3);
2
Related functions
coalesce function
decode function
case expression
nvl function
7/21/2022 • 2 minutes to read
Syntax
nvl(expr1, expr2)
Arguments
expr1 : An expression of any type.
expr2 : An expression that shares a least common type with expr1 .
Returns
The result type is the least common type of the argument types.
This function is a synonym for coalesce(expr1, expr2) .
Examples
> SELECT nvl(NULL, 2);
2
> SELECT nvl(3, 2);
3
Related functions
coalesce function
nvl2 function
nvl2 function
7/21/2022 • 2 minutes to read
Syntax
nvl2(expr1, expr2, expr3)
Arguments
expr1 : An expression of any type.
expr2 : An expression of any type.
expr3 : An expression that shares a least common type with expr2 .
Returns
The result is least common type of expr2 and expr3 .
This function is a synonym for CASE WHEN expr1 IS NOT NULL expr2 ELSE expr2 END .
Examples
> SELECT nvl2(NULL, 2, 1);
1
> SELECT nvl2('spark', 2, 1);
2
Related functions
coalesce function
case expression
nvl function
octet_length function
7/21/2022 • 2 minutes to read
Returns the byte length of string data or number of bytes of binary data.
Syntax
octet_length(expr)
Arguments
expr : A STRING or BINARY expression.
Returns
An INTEGER.
Examples
> SELECT octet_length('Spark SQL');
9
> SELECT octet_length('서울시');
9
Related functions
bit_length function
length function
char_length function
character_length function
or operator
7/21/2022 • 2 minutes to read
Syntax
expr1 or expr2
Arguments
expr1 : A BOOLEAN expression.
expr2 : A BOOLEAN expression.
Returns
A BOOLEAN.
Examples
> SELECT true or false;
true
> SELECT false or false;
false
> SELECT true or NULL;
true
> SELECT false or NULL;
NULL
Related functions
and predicate
or operator
not operator
! (bang sign) operator
overlay function
7/21/2022 • 2 minutes to read
Replaces input with replace that starts at pos and is of length len .
Syntax
overlay(input, replace, pos[, len])
Arguments
input : A STRING ot BINARY expression.
replace : An expression of the same type as input .
pos : An INTEGER expression.
len : An optional INTEGER expression.
Returns
The result type matches the type of input .
If pos is negative the position is counted starting from the back. len must be greater or equal to 0. len
specifies the length of the snippet within input to be replaced. The default for len is the length of replace .
Examples
> SELECT overlay('Spark SQL', 'ANSI ', 7, 0);
Spark ANSI SQL
> SELECT overlay('Spark SQL' PLACING '_' FROM 6);
Spark_SQL
> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7);
Spark CORE
> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0);
Spark ANSI SQL
> SELECT overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4);
Structured SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
[53 70 61 72 6B 5F 53 51 4C]
Related functions
replace function
regexp_replace function
parse_url function
7/21/2022 • 2 minutes to read
Syntax
parse_url(url, partToExtract [, key] )
Arguments
url : A STRING expression.
partToExtract : A STRING expression.
key : A STRING expression.
Returns
A STRING.
partToExtract must be one of:
'HOST'
'PATH'
'QUERY'
'REF'
'PROTOCOL'
'FILE'
'AUTHORITY'
'USERINFO'
NOTE
If spark.sql.ansi.enabled is false parse_url returns NULL if the url string is invalid.
Examples
> SELECT parse_url('http://spark.apache.org/path?query=1', 'HOST');
spark.apache.org
Related functions
percent_rank ranking window function
7/21/2022 • 2 minutes to read
Syntax
percent_rank()
Arguments
The function takes no arguments
Returns
A DOUBLE.
The function is defined as the rank within the window minus one divided by the number of rows within the
window minus 1. If the there is only one row in the window the rank is 0.
As an expression the semantic can be expressed as:
nvl((rank() OVER(PARTITION BY p ORDER BY o) - 1) / nullif(count(1) OVER(PARTITION BY p) -1), 0), 0)
This function is similar, but not the same as cume_dist analytic window function.
You must include ORDER BY clause in the window specification.
Examples
> SELECT a, b, percent_rank(b) OVER (PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A1', 3), ('A1', 6), ('A1', 7), ('A1', 7), ('A2', 3), ('A1', 1)
tab(a, b)
A1 1 0.0
A1 1 0.0
A1 2 0.3333333333333333
A1 3 0.5
A1 6 0.6666666666666666
A1 7 0.8333333333333334
A1 7 0.8333333333333334
A2 3 0.0
Related functions
cume_dist analytic window function
rank ranking window function
Window functions
percentile aggregate function
7/21/2022 • 2 minutes to read
Returns the exact percentile value of expr at the specified percentage in a group.
Syntax
percentile ( [ALL | DISTINCT] expr, percentage [, frequency] ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
percentage : A numeric expression between 0 and 1 or an ARRAY of numeric expressions, each between 0
and 1.
frequency : An optional integral number literal greater than 0.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
DOUBLE if percentage is numeric, or an ARRAY of DOUBLE if percentage is an ARRAY.
Frequency describes the number of times expr must be counted. A frequency of 10 for a specific value is
equivalent to that value appearing 10 times in the window at a frequency of 1. The default frequency is 1.
If DISTINCT is specified the function operates only on a unique set of expr values.
Examples
> SELECT percentile(col, 0.3) FROM VALUES (0), (10), (10) AS tab(col);
6.0
> SELECT percentile(DISTINCT col, 0.3) FROM VALUES (0), (10), (10) AS tab(col);
3.0
> SELECT percentile(col, 0.3, freq) FROM VALUES (0, 1), (10, 2) AS tab(col, freq);
6.0
> SELECT percentile(col, array(0.25, 0.75)) FROM VALUES (0), (10) AS tab(col);
[2.5,7.5]
Related functions
approx_percentile aggregate function
percentile_approx aggregate function
percentile_cont aggregate function
percentile_approx aggregate function
7/21/2022 • 2 minutes to read
Syntax
percentile_approx ( [ALL | DISTINCT ] expr, percentile [, accuracy] ) [FILTER ( WHERE cond ) ]
Arguments
expr : A numeric expression.
percentile : A numeric literal between 0 and 1 or a literal array of numeric values, each between 0 and 1.
accuracy : An INTEGER literal greater than 0. If accuracy is omitted it is set to 10000 .
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The aggregate function returns the expression which is the smallest value in the ordered group (sorted from
least to greatest) such that no more than percentile of expr values is less than the value or equal to that
value. If percentile is an array percentile_approx, returns the approximate percentile array of expr at the
specified percentile.
The accuracy parameter controls approximation accuracy at the cost of memory. Higher value of accuracy
yields better accuracy, 1.0/accuracy is the relative error of the approximation.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for approx_percentile aggregate function.
Examples
> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), 100)
FROM VALUES (0), (1), (2), (10) AS tab(col);
[1,1,0]
Related functions
approx_percentile aggregate function
approx_count_distinct aggregate function
percentile aggregate function
percentile_cont aggregate function
percentile_cont aggregate function
7/21/2022 • 2 minutes to read
Returns the value that corresponds to the percentile of the provided the sortKey s using a continuous
distribution model.
Since: Databricks Runtime 10.3
Syntax
percentile_cont ( percentile )
WITHIN GROUP (ORDER BY sortKey [ASC | DESC] )
Arguments
percentile : A numeric literal between 0 and 1 or a literal array of numeric literals, each between 0 and 1.
sortKey : A numeric expression over which the percentile will be computed.
ASC or DESC : Optionally specify whether the percentile is computed using ascending or descending order.
The default is ASC .
Returns
DOUBLE if percentile is numeric, or an ARRAY of DOUBLE if percentile is an ARRAY.
The aggregate function returns the interpolated percentile within the group of sortKeys .
Examples
-- Return the median, 40%-ile and 10%-ile.
> SELECT percentile_cont(array(0.5, 0.4, 0.1)) WITHIN GROUP (ORDER BY col)
FROM VALUES (0), (1), (2), (10) AS tab(col);
[1.5, 1.2000000000000002, 0.30000000000000004]
Related functions
percentile_approx aggregate function
approx_count_distinct aggregate function
percentile aggregate function
percentile_disc aggregate function
percentile_disc aggregate function
7/21/2022 • 2 minutes to read
Returns the value that corresponds to the percentile of the provided sortKey using a discrete distribution
model.
Since: Databricks Runtime 11.0
Syntax
percentile_disc ( percentile )
WITHIN GROUP (ORDER BY sortKey [ASC | DESC] )
Arguments
percentile : A numeric literal between 0 and 1 or a literal array of numeric literals, each between 0 and 1.
sortKey : A numeric expression over which the percentile is computed.
ASC or DESC : Optionally specify whether the percentile is computed using ascending or descending order.
The default is ASC .
Returns
DOUBLE if percentile is numeric, or an ARRAY of DOUBLE if percentile is an ARRAY.
The aggregate function returns the sortKey value that matches the percentile within the group of sortKeys .
Examples
-- Return the median, 40%-ile and 10%-ile.
> SELECT percentile_disc(array(0.5, 0.4, 0.1)) WITHIN GROUP (ORDER BY col)
FROM VALUES (0), (1), (2), (10) AS tab(col);
[1, 1, 0]
Related functions
percentile_approx aggregate function
approx_count_distinct aggregate function
percentile aggregate function
percentile_cont aggregate function
% (percent sign) operator
7/21/2022 • 2 minutes to read
Syntax
dividend % divisor
Arguments
dividend : An expression that evaluates to a numeric.
divisor : An expression that evaluates to a numeric.
Returns
If both dividend and divisor are of DECIMAL, the result matches the divisor’s type. In all other cases, a
DOUBLE.
If divisor is 0 (zero) the function raises a DIVIDE_BY_ZERO error.
This function is equivalent to mod function.
Examples
> SELECT 2 % 1.8;
0.2
Related functions
mod function
/ (slash sign) operator
pmod function
pi function
7/21/2022 • 2 minutes to read
Returns pi.
Syntax
pi()
Arguments
The function takes no argument.
Returns
A DOUBLE.
Examples
> SELECT pi();
3.141592653589793
Related functions
e function
|| (pipe pipe sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 || expr2
Arguments
expr1 : A STRING, BINARY or ARRAY of STRING or BINARY expression.
expr2 : An expression with type matching expr1 .
Returns
The result type matches the argument types.
This operator is a synonym for concat function
Examples
> SELECT 'Spark' || 'SQL';
SparkSQL
> SELECT array(1, 2, 3) || array(4, 5) || array(6);
[1,2,3,4,5,6]
Related functions
concat function
array_join function
array_union function
concat_ws function
| (pipe sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 | expr2
Arguments
expr1 : An integral numeric type expression.
expr2 : An integral numeric type expression.
Returns
The result type matches the widest type of expr1 and expr2 .
Examples
> SELECT 3 | 5;
7
Related functions
& (ampersand sign) operator
~ (tilde sign) operator
^ (caret sign) operator
bit_count function
+ (plus sign) operator
7/21/2022 • 2 minutes to read
Syntax
expr1 + expr2
Arguments
expr1 : A numeric, DATE, TIMESTAMP, or INTERVAL expression.
expr2 : If expr1 is a numeric expr2 must be numeric expression, or an INTERVAL otherwise.
Returns
If expr1 is a numeric, the common maximum type of the arguments.
If expr1 is a DATE and expr2 is a day-time interval the result is a TIMESTAMP.
If expr1 and expr2 are year-month intervals the result is a year-month interval of sufficiently wide units to
represent the result.
If expr1 and expr2 are day-time intervals the result is a day-time interval of sufficiently wide units to
represent the result.
Otherwise, the result type matches expr1 .
If both expressions are interval they must be of the same class.
When you add a year-month interval to a DATE, Databricks Runtime ensures that the resulting date is well-
formed.
If the result overflows the result type, Databricks Runtime raises an ARITHMETIC_OVERFLOW error.
Use try_add to return NULL on overflow.
WARNING
If spark.sql.ansi.enabled is false an overflow will not cause an error but “wrap” the result.
Examples
> SELECT 1 + 2;
3
Related functions
* (asterisk sign) operator
- (minus sign) operator
/ (slash sign) operator
sum aggregate function
try_add function
try_divide function
+ (plus sign) unary operator
7/21/2022 • 2 minutes to read
Syntax
+ expr
Arguments
expr : An expression that evaluates to a numeric or INTERVAL.
Returns
The result type matches the argument.
This function is a no-op.
+ is a synonym for positive function.
Examples
> SELECT +(1);
1
> SELECT +(-1);
-1
Related functions
negative function
abs function
sign function
- (minus sign) unary operator
positive function
pmod function
7/21/2022 • 2 minutes to read
Syntax
pmod(dividend, divisor)
Arguments
dividend : An expression that evaluates to a numeric.
divisor : An expression that evaluates to a numeric.
Returns
If both dividend and divisor are of DECIMAL the result matches the type of divisor . In all other cases a
DOUBLE.
If divisor is 0 the function raises a DIVIDE_BY_ZERO error.
This function is equivalent to abs(mod(dividend, divisor)) .
Examples
> SELECT pmod(10, 3);
1
Related functions
mod function
% (percent sign) operator
posexplode table-valued generator function
7/21/2022 • 2 minutes to read
Syntax
posexplode(expr)
Arguments
expr : An ARRAY or MAP expression.
Returns
A set of rows composed of the other expressions in the select list, the position of the elements in the array or
map, and the elements of the array, or keys and values of the map.
If expr is NULL , no rows are produced.
The columns produced by posexplode of an array are named pos, and col by default, but can be aliased. You can
also alias them using an alias tuple such as AS (myPos, myValue) .
The columns for maps are by default called pos, key and value. You can also alias them using an alias tuple such
as AS (myPos, myKey, myValue) .
You can place pos_explode only in the select list or a LATERAL VIEW. When placing the function in the select list
there must be no other generator function in the same select list.
Examples
> SELECT posexplode(array(10, 20)) AS (r, elem), 'Spark';
0 10 Spark
1 20 Spark
> SELECT posexplode(map(1, 'a', 2, 'b')) AS (r, num, val), 'Spark';
0 1 a Spark
1 2 b Spark
Related functions
explode table-valued generator function
explode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
posexplode_outer table-valued generator function
stack table-valued generator function
posexplode_outer table-valued generator function
7/21/2022 • 2 minutes to read
Returns rows by un-nesting the array with numbering of positions using OUTER semantics.
Syntax
posexplode_outer(expr)
Arguments
expr : An ARRAY or MAP expression.
Returns
A set of rows composed of the other expressions in the select list, the position of the elements in the array or
map, and the elements of the array, or keys and values of the map.
If expr is NULL , a single row with NULLs for the array or map values.
The columns produced by posexplode_outer of an array are named pos and col by default, but can be aliased.
You can also alias them using an alias tuple such as AS (myPos, myValue) .
The columns for maps are by default called pos , key , and value . You can also alias them using an alias tuple
such as AS (myPos, myKey, myValue) .
You can place posexplode_outer only in the select list or a LATERAL VIEW . When placing the function in the select
list there must be no other generator function in the same select list.
Examples
> SELECT posexplode_outer(array(10, 20)) AS (r, elem), 'Spark';
0 10 Spark
1 20 Spark
> SELECT posexplode_outer(map(1, 'a', 2, 'b')) AS (r, num, val), 'Spark';
0 1 a Spark
1 2 b Spark
> SELECT posexplode_outer(cast(NULL AS array<int>)), 'Spark';
NULL Spark
Related functions
explode table-valued generator function
explode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
posexplode table-valued generator function
position function
7/21/2022 • 2 minutes to read
Returns the position of the first occurrence of substr in str after position pos .
Syntax
position(substr, str [, pos] )
position(subtr IN str)
Arguments
substr : A STRING expression.
str : A STRING expression.
pos : An INTEGER expression.
Returns
An INTEGER.
The specified pos and return value are 1-based. If pos is omitted, substr is searched from the beginning of
str . If pos is less than 1, the result is 0.
Examples
> SELECT position('bar', 'foobarbar');
4
> SELECT position('bar', 'foobarbar', 5);
7
> SELECT position('bar' IN 'foobarbar');
4
Related functions
charindex function
instr function
locate function
positive function
7/21/2022 • 2 minutes to read
Syntax
positive(expr)
Arguments
expr : An expression that evaluates to a numeric or INTERVAL.
Returns
The result type matches the argument.
This function is a no-op.
positive is a synonym for + (plus sign) unary operator.
Examples
> SELECT positive(1);
1
> SELECT positive(-1);
-1
Related functions
negative function
abs function
sign function
+ (plus sign) unary operator
- (minus sign) unary operator
pow function
7/21/2022 • 2 minutes to read
Syntax
pow(expr1, expr2)
Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
Returns
A DOUBLE.
This function is a synonym for power function.
Examples
> SELECT pow(2, 3);
8.0
Related functions
exp function
log function
log10 function
ln function
power function
power function
7/21/2022 • 2 minutes to read
Syntax
power(expr1, expr2)
Arguments
expr1 : An expression that evaluates to a numeric.
expr2 : An expression that evaluates to a numeric.
Returns
A DOUBLE.
This function is a synonym for pow function.
Examples
> SELECT pow(2, 3);
8.0
Related functions
exp function
log function
log10 function
ln function
pow function
printf function
7/21/2022 • 2 minutes to read
Syntax
printf(strfmt[, obj1, ...])
Arguments
strfmt : A STRING expression.
objN : A STRING or numeric expression.
Returns
A STRING.
Examples
> SELECT printf('Hello World %d %s', 100, 'days');
Hello World 100 days
Related functions
format_number function
format_string function
quarter function
7/21/2022 • 2 minutes to read
Syntax
quarter(expr)
Arguments
expr : A DATE or TIMESTAMP expression.
Returns
An INTEGER.
This function is a synonym for extract(QUARTER FROM expr) .
Examples
> SELECT quarter('2016-08-31');
3
Related functions
day function
dayofweek function
dayofyear function
extract function
month function
radians function
7/21/2022 • 2 minutes to read
Syntax
radians(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Given an angle in degrees, returns the associated radians.
Examples
> SELECT radians(180);
3.141592653589793
Related functions
degrees function
pi function
raise_error function
7/21/2022 • 2 minutes to read
Syntax
raise_error(expr)
Arguments
expr : A STRING expression.
Returns
The NULL type.
The function raises a runtime error with expr as the error message.
Examples
> SELECT raise_error('custom error message');
custom error message
Related functions
assert_true function
rand function
7/21/2022 • 2 minutes to read
Syntax
rand( [seed] )
Arguments
seed : An optional INTEGER literal.
Returns
A DOUBLE.
The function generates pseudo random results with independent and identically distributed (i.i.d.) uniformly
distributed values in [0, 1).
This function is non-deterministic.
rand is a synonym for random function.
Examples
> SELECT rand();
0.9629742951434543
> SELECT rand(0);
0.8446490682263027
> SELECT rand(null);
0.8446490682263027
Related functions
randn function
random function
randn function
7/21/2022 • 2 minutes to read
Syntax
randn( [seed] )
Arguments
seed : An optional INTEGER literal.
Returns
A DOUBLE.
The function regenerates pseudo random results with independent and identically distributed (i.i.d.) values
drawn from the standard normal distribution.
This function is non-deterministic.
Examples
> SELECT randn();
-0.3254147983080288
> SELECT randn(0);
1.1164209726833079
> SELECT randn(null);
1.1164209726833079
Related functions
rand function
random function
random function
7/21/2022 • 2 minutes to read
Syntax
random( [seed] )
Arguments
seed : An optional INTEGER literal.
Returns
A DOUBLE.
The function generates pseudo random results with independent and identically distributed uniformly
distributed values in [0, 1) .
This function is non-deterministic.
rand is a synonym for random function.
Examples
> SELECT rand();
0.9629742951434543
> SELECT rand(0);
0.8446490682263027
> SELECT rand(null);
0.8446490682263027
Related functions
randn function
rand function
range table-valued function
7/21/2022 • 2 minutes to read
Syntax
range(end)
Arguments
start : An optional BIGINT literal defaulted to 0, marking the first value generated.
end : A BIGINT literal marking endpoint (exclusive) of the number generation.
step : An optional BIGINT literal defaulted to 1, specifying the increment used when generating values.
numParts : An optional INTEGER literal specifying how the production of rows is spread across partitions,
Returns
A table with a single BIGINT column named id .
Examples
> SELECT spark_partition_id(), t.* FROM range(5) AS t;
3 0
6 1
9 2
12 3
15 4
Related functions
sequence function
rank ranking window function
7/21/2022 • 2 minutes to read
Syntax
rank()
Arguments
This function takes no arguments.
Returns
An INTEGER.
The OVER clause of the window function must include an ORDER BY clause.
Unlike the function dense_rank , rank will produce gaps in the ranking sequence. Unlike row_number , rank does
not break ties.
If the order is not unique, the duplicates share the same relative earlier position.
Examples
> SELECT a,
b,
dense_rank() OVER(PARTITION BY a ORDER BY b),
rank() OVER(PARTITION BY a ORDER BY b),
row_number() OVER(PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1 1 1
A1 1 1 1 2
A1 2 2 3 3
A2 3 1 1 1
Related functions
dense_rank ranking window function
row_number ranking window function
cume_dist analytic window function
Window functions
reflect function
7/21/2022 • 2 minutes to read
Syntax
reflect(class, method [, arg1] [, ...])
Arguments
class : A STRING literal specifying the java class.
method : A STRING literal specifying the java method.
argN : An expression with a type appropriate for the selected method.
Returns
A STRING.
Examples
> SELECT reflect('java.util.UUID', 'randomUUID');
c33fb387-8500-4bfa-81d2-6e0e3e930df2
> SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
A5cf6c42-0c85-418f-af6c-3e4e5b1328f2
Related functions
java_method function
regexp operator
7/21/2022 • 2 minutes to read
Syntax
str [NOT] regexp regex
Arguments
str : A STRING expression to be matched.
regex : A STRING expression with a matching pattern.
Returns
A BOOLEAN.
The regexstring must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regex can be '^\\abc$' .
Examples
> SELECT '%SystemDrive%\\Users\\John' regexp '%SystemDrive%\\\\Users.*';
true
Related functions
ilike operator
like operator
regexp_extract_all function
regexp_replace function
split function
rlike operator
regexp_extract function
7/21/2022 • 2 minutes to read
Extracts the first string in str that matches the regexp expression and corresponds to the regex group index.
Syntax
regexp_extract(str, regexp [, idx] )
Arguments
str : A STRING expression to be matched.
regexp : A STRING expression with a matching pattern.
idx : An optional integral number expression greater or equal 0 with default 1.
Returns
A STRING.
The regexp string must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regexp can be '^\\abc$' . regexp may contain multiple groups. idx
indicates which regex group to extract. An idx of 0 means matching the entire regular expression.
Examples
> SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);
100
Related functions
ilike operator
like operator
regexp_extract_all function
regexp_replace function
split function
regexp operator
rlike operator
regexp_extract_all function
7/21/2022 • 2 minutes to read
Extracts the all strings in str that matches the regexp expression and corresponds to the regex group index.
Syntax
regexp_extract_all(str, regexp [, idx] )
Arguments
str : A STRING expression to be matched.
regexp : A STRING expression with a matching pattern.
idx : An optional integral number expression greater or equal 0 with default 1.
Returns
An ARRAY of STRING.
The regexp string must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regexp can be '^\\abc$' . regexp may contain multiple groups. idx
indicates which regex group to extract. An idx of 0 means match the entire regular expression.
Examples
> SELECT regexp_extract_all('100-200, 300-400', '(\\d+)-(\\d+)', 1);
[100, 300]
Related functions
ilike operator
like operator
regexp_extract function
regexp_replace function
split function
regexp_like function
7/21/2022 • 2 minutes to read
Syntax
regexp_like( str, regex )
Arguments
str : A STRING expression to be matched.
regex : A STRING expression with a matching pattern.
Returns
A BOOLEAN.
The regexstring must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regex can be '^\\abc$' .
Examples
> SELECT regexp_like('%SystemDrive%\\Users\\John', '%SystemDrive%\\\\Users.*');
true
Related functions
ilike operator
like operator
regexp operator
regexp_extract_all function
regexp_replace function
rlike operator
split function
regexp_replace function
7/21/2022 • 2 minutes to read
Syntax
regexp_replace(str, regexp, rep [, position] )
Arguments
str : A STRING expression to be matched.
regexp : A STRING expression with a matching pattern.
rep : A STRING expression which is the replacement string.
position : A optional integral numeric literal greater than 0, stating where to start matching. The default is 1.
Returns
A STRING.
The regexpstring must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regexp can be '^\\abc$' . Searching starts at position . The default is 1, which
marks the beginning of str . If position exceeds the character length of str , the result is str .
Examples
> SELECT regexp_replace('100-200', '(\\d+)', 'num');
num-num
Related functions
ilike operator
like operator
regexp_extract function
regexp_extract_all function
regr_avgx aggregate function
7/21/2022 • 2 minutes to read
Returns the mean of xExpr calculated from values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 10.5
Syntax
regr_avgx( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]
Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.
Returns
The result type depends on the type of xExpr :
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
Otherwise the result is a DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified the average is computed after duplicates have been removed.
regr_avgx(y, x) is a synonym for avg(x) FILTER(WHERE x IS NOT NULL AND y IS NOT NULL) .
Examples
> SELECT regr_avgx(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
2.6666666666666665
Related functions
avg aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_sxx aggregate function
regr_sxy aggregate function
regr_syy aggregate function
regr_avgy aggregate function
7/21/2022 • 2 minutes to read
Returns the mean of yExpr calculated from values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 10.5
Syntax
regr_avgy( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]
Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.
Returns
The result type depends on the type of yExpr :
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
Otherwise the result is a DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified the average is computed after duplicates have been removed.
regr_avgy(y, x) is a synonym for avg(y) FILTER(WHERE x IS NOT NULL AND y IS NOT NULL) .
Examples
> SELECT regr_avgy(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
1.6666666666666667
Related functions
avg aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_sxy aggregate function
regr_sxy aggregate function
regr_syy aggregate function
regr_count aggregate function
7/21/2022 • 2 minutes to read
Returns the number of non-null value pairs yExpr , xExpr in the group.
Since: Databricks Runtime 10.5
Syntax
regr_count ( [ALL | DISTINCT] yExpr, xExpr ) [FILTER ( WHERE cond ) ]
Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.
Returns
A BIGINT.
regr_count(yExpr, xExpr) is equivalent to count_if(yExpr IS NOT NULL AND xExpr IS NOT NULL) .
If DISTINCT is specified only unique rows are counted.
Examples
> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS t(y, x);
4
> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, NULL), (2, 3), (2, 4) AS t(y, x);
3
> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, NULL), (NULL, 3), (2, 4) AS t(y, x);
2
Related functions
avg aggregate function
count aggregate function
count_if aggregate function
min aggregate function
max aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_sxx aggregate function
regr_sxy aggregate function
regr_syy aggregate function
sum aggregate function
regr_r2 aggregate function
7/21/2022 • 2 minutes to read
Returns the coefficient of determination from values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 11.0
Syntax
regr_r2( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]
Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.
Returns
A DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the average is computed after duplicates are removed.
Examples
> SELECT regr_r2(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
1
Related functions
avg aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_sxx aggregate function
7/21/2022 • 2 minutes to read
Returns the sum of squares of the xExpr values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 11.0
Syntax
regr_sxx( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]
Arguments
yExpr : A numeric expression, the dependent variable .
xExpr : A numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.
Returns
The result type is DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the result is computed after duplicates are removed.
regr_sxx(y, x) is a synonym for regr_count(y, x) * var_pop(x) .
Examples
> SELECT typeof(regr_sxx(y, x)) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
0.6666666666666666
Related functions
avg aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
regr_sxy aggregate function
7/21/2022 • 2 minutes to read
Returns the sum of products of yExpr and xExpr calculated from values of a group where xExpr and yExpr
are NOT NULL.
Since: Databricks Runtime 11.0
Syntax
regr_sxy( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]
Arguments
yExpr : A numeric expression, the dependent variable .
xExpr : A numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.
Returns
The result type is a DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the result is computed after duplicates are removed.
regr_sxy(y, x) is a synonym for regr_count(y, x) * covar_pop(y, x) .
Examples
> SELECT regr_sxy(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
0.6666666666666666
Related functions
covar_pop aggregate function
regr_avgx aggregate function
regr_avgx aggregate function
regr_count aggregate function
regr_syy aggregate function
7/21/2022 • 2 minutes to read
Returns the sum of squares of the yExpr values of a group where xExpr and yExpr are NOT NULL.
Since: Databricks Runtime 11.0
Syntax
regr_syy( [ALL | DISTINCT] yExpr, xExpr) [FILTER ( WHERE cond ) ]
Arguments
yExpr : An numeric expression, the dependent variable .
xExpr : An numeric expression, the independent variable .
cond : An optional boolean expression filtering the rows used for the function.
Returns
The result type is DOUBLE.
Any nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the result is computed after duplicates are removed.
regr_syy(y, x) is a synonym for regr_count(y, x) * var_pop(y) .
Examples
> SELECT regr_syy(y, x) FROM VALUES (1, 2), (2, 3), (2, 3), (null, 4), (4, null) AS T(y, x);
0.6666666666666666
Related functions
avg aggregate function
regr_avgx aggregate function
regr_avgy aggregate function
regr_count aggregate function
repeat function
7/21/2022 • 2 minutes to read
Syntax
repeat(expr, n)
Arguments
expr : A STRING expression.
n : An INTEGER expression.
Returns
A STRING.
If n is less than 1, an empty string.
Examples
> SELECT repeat('123', 2);
123123
Related functions
space function
replace function
7/21/2022 • 2 minutes to read
Syntax
replace(str, search [, replace] )
Arguments
str : A STRING expression to be searched.
search : A STRING repression to be replaced.
replace : An optional STRING expression to replace search with. The default is an empty string.
Returns
A STRING.
If you do not specify replace or is an empty string, nothing replaces the string that is removed from str .
Examples
> SELECT replace('ABCabc', 'abc', 'DEF');
ABCDEF
Related functions
overlay function
signum function
regexp_replace function
translate function
reverse function
7/21/2022 • 2 minutes to read
Syntax
reverse(expr)
Arguments
expr : A STRING or ARRAY expression.
Returns
The result type matches the type of expr .
Examples
> SELECT reverse('Spark SQL');
LQS krapS
> SELECT reverse(array(2, 1, 4, 3));
[3,4,1,2]
Related functions
right function
7/21/2022 • 2 minutes to read
Syntax
right(str, len)
Arguments
str : A STRING expression.
len : An integral number expression.
Returns
A STRING.
If len is less or equal than 0, an empty string.
Examples
> SELECT right('Spark SQL', 3);
SQL
Related functions
substr function
substring function
left function
rint function
7/21/2022 • 2 minutes to read
Syntax
rint(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
This function is a synonym for round(expr, 0) .
Examples
> SELECT rint(12.3456);
12.0
Related functions
ceil function
ceiling function
floor function
round function
rlike operator
7/21/2022 • 2 minutes to read
Syntax
str [NOT] rlike regex
Arguments
str : A STRING expression to be matched.
regex : A STRING expression with a matching pattern.
Returns
A BOOLEAN.
The regexstring must be a Java regular expression. String literals are unescaped. For example, to match
'\abc' , a regular expression for regex can be '^\\abc$' .
Examples
> SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.*';
true
Related functions
ilike operator
like operator
regexp operator
regexp_extract_all function
regexp_like function
regexp_replace function
regexp_replace function
split function
round function
7/21/2022 • 2 minutes to read
Syntax
round(expr [,targetScale] )
Arguments
expr : A numeric expression.
targetScale : An INTEGER expression greater or equal to 0. If targetScale is omitted the default is 0.
Returns
If expr is DECIMAL the result is DECIMAL with a scale that is the smaller of expr scale and targetScale .
In HALF_UP rounding, the digit 5 is rounded up.
Examples
> SELECT round(2.5, 0);
3
> SELECT round(2.6, 0);
3
> SELECT round(3.5, 0);
4
> SELECT round(2.25, 1);
2.2
Related functions
floor function
ceiling function
ceil function
bround function
row_number ranking window function
7/21/2022 • 2 minutes to read
Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the
window partition.
Syntax
row_number()
Arguments
The function takes no arguments.
Returns
An INTEGER.
The OVERclause of the window function must include an ORDER BY clause. Unlike rank and dense_rank ,
row_number breaks ties.
Examples
> SELECT a,
b,
dense_rank() OVER(PARTITION BY a ORDER BY b),
rank() OVER(PARTITION BY a ORDER BY b),
row_number() OVER(PARTITION BY a ORDER BY b)
FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1 1 1
A1 1 1 1 2
A1 2 2 3 3
A2 3 1 1 1
Related functions
rank ranking window function
dense_rank ranking window function
cume_dist analytic window function
Window functions
rpad function
7/21/2022 • 2 minutes to read
Syntax
rpad(expr, len [, pad] )
Arguments
expr : A STRING or BINARY expression to be padded.
len : An INTEGER expression.
pad : An optional STRING or BINARY expression with the pattern for padding. The default is a space
character for STRING and x’00’ for BINARY.
Returns
A BINARY if both expr and pad are BINARY. Otherwise, returns a STRING.
If expr is longer than len , the return value is shortened to len characters. If you do not specify pad , a
STRING expr is padded to the right with space characters, whereas a BINARY expr is padded to the right with
x’00’ bytes. If len is less than 1, an empty string.
BINARY is supported since: Databricks Runtime 11.0.
Examples
> SELECT rpad('hi', 5, 'ab');
hiaba
> SELECT rpad('hi', 1, '??');
h
> SELECT rpad('hi', 5);
hi
Related functions
lpad function
ltrim function
rtrim function
trim function
rtrim function
7/21/2022 • 2 minutes to read
Syntax
rtrim( [trimStr ,] str)
Arguments
trimStr : An optional STRING expression with characters to be trimmed. The default is a space character.
str : A STRING expression to be trimmed.
Returns
A STRING.
The function removes any trailing characters within trimStr from str .
Examples
> SELECT rtrim('SparkSQL ') || '+';
SparkSQL+
> SELECT rtrim('ab', 'SparkSQLabcaaba');
SparkSQLabc
Related functions
btrim function
lpad function
ltrim function
rpad function
trim function
schema_of_csv function
7/21/2022 • 2 minutes to read
Syntax
schema_of_csv(csv [, options] )
Arguments
csv : A STRING literal with valid CSV data.
options : An optional MAP literals where keys and values are STRING.
Returns
A STRING composing a struct. The field names are derived by position as _Cn . The values hold the derived
formatted SQL types. For details on options see from_csv function.
Examples
> DESCRIBE SELECT schema_of_csv('1,abc');
STRUCT<`_c0`: INT, `_c1`: STRING>
Related functions
from_csv function
schema_of_json function
7/21/2022 • 2 minutes to read
Syntax
schema_of_json(json [, options] )
Arguments
json : A STRING literal with JSON.
options : An optional MAP literals with keys and values being STRING.
Returns
A STRING holding a definition of an array of structs with n fields of strings where the column names are
derived from the JSON keys. The field values hold the derived formatted SQL types. For details on options, see
from_json function.
Examples
> SELECT schema_of_json('[{"col":0}]');
ARRAY<STRUCT<`col`: BIGINT>>
> SELECT schema_of_json('[{"col":01}]', map('allowNumericLeadingZeros', 'true'));
ARRAY<STRUCT<`col`: BIGINT>>
Related functions
from_json function
sec function
7/21/2022 • 2 minutes to read
Syntax
sec(expr)
Arguments
expr : An expression that evaluates to a numeric expressing the angle in radians.
Returns
A DOUBLE.
sec(expr) is equivalent to 1 / cos(expr)
Examples
> SELECT sec(pi());
-1.0
Related functions
acos function
cos function
cosh function
csc function
sin function
tan function
second function
7/21/2022 • 2 minutes to read
Syntax
second(expr)
Arguments
expr : A TIMESTAMP expression.
Returns
An INTEGER.
This function is equivalent to int(extract(SECOND FROM timestamp)) .
Examples
> SELECT second('2009-07-30 12:58:59');
59
Related functions
extract function
hour function
minute function
sentences function
7/21/2022 • 2 minutes to read
Syntax
sentences(str [, lang, country] )
Arguments
str : A STRING expression to be parsed.
lang : An optional STRING expression with a language code from ISO 639 Alpha-2 (e.g. ‘DE’) , Alpha-3, or a
language subtag of up to 8 characters.
country : An optional STRING expression with a country code from ISO 3166 alpha-2 country code or a UN
M.49 numeric-3 area code.
Returns
An ARRAY of ARRAY of STRING.
The default for lang is en and country US .
Examples
> SELECT sentences('Hi there! Good morning.');
[[Hi, there],[Good, morning]]
> SELECT sentences('Hi there! Good morning.', 'en', 'US');
[[Hi, there],[Good, morning]]
Related functions
split function
sequence function
7/21/2022 • 2 minutes to read
Syntax
sequence(start, stop [, step] )
Arguments
start : An expression of an integral numeric type, DATE, or TIMESTAMP.
stop : If start is numeric an integral numeric, a DATE or TIMESTAMP otherwise.
step : An INTERVAL expression if start is a DATE or TIMESTAMP, or an integral numeric otherwise.
Returns
An ARRAY of least common type of start and stop .
By default step is 1 if start is less than or equal to stop , otherwise -1.
For the DATE or TIMESTAMP sequences default step is INTERVAL ‘1’ DAY and INTERVAL ‘-1’ DAY respectively.
If start is greater than stop then step must be negative, and vice versa.
Examples
> SELECT sequence(1, 5);
[1,2,3,4,5]
Related functions
repeat function
sha function
7/21/2022 • 2 minutes to read
Syntax
sha(expr)
Arguments
expr : A BINARY or STRING expression.
Returns
A STRING.
This function is a synonym for sha1 function.
Examples
> SELECT sha('Spark');
85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c
Related functions
sha1 function
sha2 function
crc32 function
md5 function
hash function
sha1 function
7/21/2022 • 2 minutes to read
Syntax
sha1(expr)
Arguments
expr : A BINARY or STRING expression.
Returns
A STRING.
This function is a synonym for sha function.
Examples
> SELECT sha1('Spark');
85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c
Related functions
sha function
sha2 function
crc32 function
md5 function
hash function
sha2 function
7/21/2022 • 2 minutes to read
Syntax
sha2(expr, bitLength)
Arguments
expr : A BINARY or STRING expression.
bitLength : An INTEGER expression.
Returns
A STRING.
bitLength can be 0, 224 , 256 , 384 , or 512 . bitLength 0 is equivalent to 256 .
Examples
>> SELECT sha2('Spark', 256);
529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b
Related functions
sha function
sha1 function
crc32 function
md5 function
hash function
shiftleft function
7/21/2022 • 2 minutes to read
Syntax
shiftleft(expr, n)
Arguments
expr : An INTEGER or BIGINT expression.
n : An INTEGER expression.
Returns
The result matches the type of expr .
If n is less than 0 the result is 0.
Examples
> SELECT shiftleft(2, 1);
4
Related functions
shiftright function
shiftrightunsigned function
shiftright function
7/21/2022 • 2 minutes to read
Syntax
shiftright(expr, n)
Arguments
expr : An INTEGER or BIGINT expression.
n : An INTEGER expression specifying the number of bits to shift.
Returns
The result type matches expr .
When expr is negative (that is, the highest order bit is set) the result remains negative because the highest
order bit is sticky. When n is negative the result is 0.
Examples
> SELECT shiftright(4, 1);
2
> SELECT shiftright(-4, 1);
-2
Related functions
shiftleft function
shiftrightunsigned function
shiftrightunsigned function
7/21/2022 • 2 minutes to read
Syntax
shiftrightunsigned(expr, n)
Arguments
expr : An INTEGER or BIGINT expression.
n : An INTEGER expression specifying the number of bits to shift.
Returns
The result type matches expr .
When n is negative the result is 0.
Examples
> SELECT shiftrightunsigned(4, 1);
2
> SELECT shiftrightunsigned(-4, 1);
2147483646
Related functions
shiftleft function
shiftright function
shuffle function
7/21/2022 • 2 minutes to read
Syntax
shuffle(expr)
Arguments
expr : An ARRAY expression.
Returns
The result type matches the type expr .
This function is non-deterministic.
Examples
> SELECT shuffle(array(1, 20, 3, 5));
[3,1,5,20]
> SELECT shuffle(array(1, 20, NULL, 3));
[20,NULL,3,1]
Related functions
array_sort function
sort_array function
sign function
7/21/2022 • 2 minutes to read
Syntax
sign(expr)
Arguments
expr : An expression that evaluates to a numeric or interval.
Returns
A DOUBLE.
This function is a synonym for signum function.
Examples
> SELECT sign(40);
1.0
Related functions
abs function
signum function
negative function
positive function
signum function
7/21/2022 • 2 minutes to read
Syntax
signum(expr)
Arguments
expr : An expression that evaluates to a numeric or interval.
Returns
A DOUBLE.
This function is a synonym for sign function.
Examples
> SELECT signum(40);
1.0
Related functions
abs function
sign function
negative function
positive function
sin function
7/21/2022 • 2 minutes to read
Syntax
sin(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT sin(0);
0.0
Related functions
cos function
sinh function
tan function
sinh function
7/21/2022 • 2 minutes to read
Syntax
sinh(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
Examples
> SELECT sinh(0);
0.0
Related functions
cos function
cosh function
sin function
tanh function
tan function
size function
7/21/2022 • 2 minutes to read
Syntax
size(expr)
Arguments
expr : An ARRAY or MAP expression.
Returns
An INTEGER.
NOTE
If spark.sql.ansi.enabled is false size(NULL) returns -1 instead of NULL .
Examples
> SELECT size(array('b', 'd', 'c', 'a'));
4
> SELECT size(map('a', 1, 'b', 2));
2
> SELECT size(NULL);
NULL
Related functions
length function
skewness aggregate function
7/21/2022 • 2 minutes to read
Syntax
skewness ( [ALL | DISTINCT ] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
Examples
> SELECT skewness(col) FROM VALUES (-10), (-20), (100), (1000), (1000) AS tab(col);
0.3853941073355022
> SELECT skewness(DISTINCT col) FROM VALUES (-10), (-20), (100), (1000), (1000) AS tab(col);
1.1135657469022011
> SELECT skewness(col) FROM VALUES (-1000), (-100), (10), (20) AS tab(col);
-1.1135657469022011
Related functions
kurtosis aggregate function
/ (slash sign) operator
7/21/2022 • 2 minutes to read
Syntax
dividend / divisor
Arguments
dividend : A numeric or INTERVAL expression.
divisor : A numeric expression.
Returns
If both dividend and divisor are DECIMAL, the result is DECIMAL.
If dividend is a year-month interval, the result is an INTERVAL YEAR TO MONTH .
If divident is a day-time interval, the result is an INTERVAL DAY TO SECOND .
In all other cases, a DOUBLE.
If the divisor is 0, the operator returns a DIVIDE_BY_ZERO error.
Use try_divide to return NULL on division-by-zero.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error division by 0.
Examples
> SELECT 3 / 2;
1.5
> SELECT 3 / 0;
Error: DIVIDE_BY_ZERO
Related functions
* (asterisk sign) operator
div operator
- (minus sign) operator
+ (plus sign) operator
sum aggregate function
try_divide function
slice function
7/21/2022 • 2 minutes to read
Syntax
slice(expr, start, length)
Arguments
expr : An ARRAY expression.
start : An INTEGER expression.
length : An INTEGER expression that is greater or equal to 0.
Returns
The result is of the type of expr .
The function subsets array expr starting from index start (array indices start at 1), or starting from the end if
start is negative, with the specified length . If the requested array slice does not overlap with the actual length
of the array, an empty array is returned.
Examples
> SELECT slice(array(1, 2, 3, 4), 2, 2);
[2,3]
> SELECT slice(array(1, 2, 3, 4), -2, 2);
[3,4]
Related functions
array function
smallint function
7/21/2022 • 2 minutes to read
Syntax
smallint(expr)
Arguments
expr : Any expression which is castable to SMALLINT.
Returns
The result is SMALLINT.
This function is a synonym for CAST(expr AS SMALLINT) .
See cast function for details.
Examples
> SELECT smallint(-5.6);
5
> SELECT smallint('5');
5
Related functions
cast function
some aggregate function
7/21/2022 • 2 minutes to read
Syntax
some(expr) [FILTER ( WHERE cond ) ]
Arguments
expr : A BOOLEAN expression.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A BOOLEAN.
Examples
> SELECT some(col) FROM VALUES (true), (false), (false) AS tab(col);
true
> SELECT some(col) FROM VALUES (NULL), (true), (false) AS tab(col);
true
> SELECT some(col) FROM VALUES (false), (false), (NULL) AS tab(col);
false
Related functions
bool_and aggregate function
bool_or aggregate function
every aggregate function
sort_array function
7/21/2022 • 2 minutes to read
Syntax
sort_array(expr [, ascendingOrder] )
Arguments
expr : An ARRAY expression of sortable elements.
ascendingOrder : An optional BOOLEAN expression defaulting to true .
Returns
The result type matches expr .
Sorts the input array in ascending or descending order according to the natural ordering of the array elements.
NULL elements are placed at the beginning of the returned array in ascending order or at the end of the
returned array in descending order.
Examples
> SELECT sort_array(array('b', 'd', NULL, 'c', 'a'), true);
[NULL,a,b,c,d]
Related functions
array_sort function
soundex function
7/21/2022 • 2 minutes to read
Syntax
soundex(expr)
Arguments
expr : A STRING expression.
Returns
A STRING.
Examples
> SELECT soundex('Miller');
M460
Related functions
space function
7/21/2022 • 2 minutes to read
Syntax
space(n)
Arguments
n : An INTEGER expression that evaluates to a numeric.
Returns
A STRING.
If n is less than or equal to 0 an empty string.
Examples
> SELECT concat('1', space(2), '1');
1 1
Related functions
repeat function
spark_partition_id function
7/21/2022 • 2 minutes to read
Syntax
spark_partition_id()
Arguments
The function takes no arguments.
Returns
An INTEGER.
Examples
> SELECT spark_partition_id();
0
Related functions
split function
7/21/2022 • 2 minutes to read
Splits str around occurrences that match regex and returns an array with a length of at most limit .
Syntax
split(str, regex [, limit] )
Arguments
str : A STRING expression to be split.
regexp : A STRING expression that is a Java regular expression used to split str .
limit : An optional INTEGER expression defaulting to 0 (no limit).
Returns
An ARRAY of STRING.
If limit > 0: The resulting array’s length will not be more than limit, and the resulting array’s last entry will
contain all input beyond the last matched regex .
If limit <= 0: regex will be applied as many times as possible, and the resulting array can be of any size.
Examples
> SELECT split('oneAtwoBthreeC', '[ABC]');
[one,two,three,]
> SELECT split('oneAtwoBthreeC', '[ABC]', -1);
[one,two,three,]
> SELECT split('oneAtwoBthreeC', '[ABC]', 2);
[one,twoBthreeC]
Related functions
regexp_extract function
regexp_extract_all function
split_part function
split_part function
7/21/2022 • 2 minutes to read
Splits str around occurrences of delim and returns the partNum part.
Since: Databricks Runtime 11.0
Syntax
split_part(str, delim, partNum)
Arguments
str : A STRING expression to be split.
delimiter : A STRING expression serving as delimiter for the parts.
partNum : An INTEGER expression electing the part to be returned.
Returns
A STRING.
If partNum >= 1: The partNum s part counting from the beginning of str will be returned.
If partNum <= -1: The abs(partNum) s part counting from the end of str will be returned.
partNum must not be 0. split_part returns an empty string if partNum is beyond the number of parts in str .
Examples
> SELECT '->' || split_part('Hello,world,!', ',', 1) || '<-';
->Hello<-
Syntax
sqrt(expr)
Arguments
expr : An expression that evaluates to a numeric.
Returns
A DOUBLE.
If expr is negative the result is NaN.
Examples
> SELECT sqrt(4);
2.0
Related functions
cbrt function
stack table-valued generator function
7/21/2022 • 2 minutes to read
Syntax
stack(numRows, expr1 [, ...] )
Arguments
numRows : An INTEGER literal greater than 0 specifying the number of rows produced.
exprN : An expression of any type. The type of any exprN must match the type of expr(N+numRows) .
Returns
A set of numRows rows which includes all other columns in the select list and max(1, (N/numRows)) columns
produced by this function. An incomplete row is padded with NULL .
By default the produced columns are named col0, … col(n-1) . The column aliases can be specified using for
example, AS (myCol1, .. myColn) .
You can place stack only in the select list or a LATERAL VIEW. When placing the function in the select list there
must be no other generator function in the same select list.
Examples
SELECT 'hello', stack(2, 1, 2, 3) AS (first, second), 'world';
-- hello 1 2 world
-- hello 3 NULL world
Related functions
explode table-valued generator function
explode_outer table-valued generator function
inline table-valued generator function
inline_outer table-valued generator function
posexplode_outer table-valued generator function
posexplode table-valued generator function
startswith function
7/21/2022 • 2 minutes to read
Syntax
startswith(expr, startExpr)
Arguments
expr : A STRING expression.
startExpr : A STRING expression which is compared to the start of str .
Returns
A BOOLEAN.
If expr or startExpr is NULL , the result is NULL .
If startExpr is the empty string or empty binary the result is true .
Since: Databricks Runtime 10.5
The function operates in BINARY mode if both arguments are BINARY.
Examples
> SELECT startswith('SparkSQL', 'Spark');
true
Related
contains function
endswith function
substr function
std aggregate function
7/21/2022 • 2 minutes to read
Returns the sample standard deviation calculated from the values within the group.
Syntax
std ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for stddev aggregate function.
Examples
> SELECT std(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9574271077563381
> SELECT std(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0
Related functions
stddev aggregate function
stddev_pop aggregate function
stddev_samp aggregate function
stddev aggregate function
7/21/2022 • 2 minutes to read
Returns the sample standard deviation calculated from the values within the group.
Syntax
stddev ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for std aggregate function.
Examples
> SELECT stddev(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9574271077563381
> SELECT stddev(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0
Related functions
std aggregate function
stddev_pop aggregate function
stddev_samp aggregate function
stddev_pop aggregate function
7/21/2022 • 2 minutes to read
Syntax
stddev_pop ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
Examples
> SELECT stddev_pop(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.82915619758885
> SELECT stddev_pop(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.816496580927726
Related functions
std aggregate function
stddev aggregate function
stddev_samp aggregate function
stddev_samp aggregate function
7/21/2022 • 2 minutes to read
Syntax
stddev_samp ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
Examples
> SELECT stddev_samp(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9574271077563381
> SELECT stddev_samp(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0
Related functions
std aggregate function
stddev aggregate function
stddev_pop aggregate function
str_to_map function
7/21/2022 • 2 minutes to read
Creates a map after splitting the input into key-value pairs using delimiters.
Syntax
str_to_map(expr [, pairDelim [, keyValueDelim] ] )
Arguments
expr : An STRING expression.
pairDelim : An optional STRING literal defaulting to ',' that specifies how to to split entries.
keyValueDelim : An optional STRING literal defaulting to ':' that specifies how to split each key-value pair.
Returns
A MAP of STRING for both keys and values.
Both pairDelim and keyValueDelim are treated as regular expressions.
Examples
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');
{a -> 1, b -> 2, c -> 3}
> SELECT str_to_map('a');
{a->NULL}
Related functions
map function
string function
7/21/2022 • 2 minutes to read
Syntax
string(expr)
Arguments
expr : An expression that can be cast to STRING.
Returns
The result matches the type of expr .
This function is a synonym for cast(expr AS STRING)
Examples
> SELECT string(5);
5
> SELECT string(current_date);
2021-04-01
Related functions
cast function
struct function
7/21/2022 • 2 minutes to read
Syntax
struct(expr1 [, ...] )
Arguments
exprN : An expression of any type.
Returns
A struct with fieldN matching the type of exprN .
Fields are named colN .
Examples
> SELECT struct(1, 2, 3);
{1, 2, 3}
Related functions
named_struct function
map function
str_to_map function
substr function
7/21/2022 • 2 minutes to read
Returns the substring of expr that starts at pos and is of length len .
Syntax
substr(expr, pos [, len] )
Arguments
expr : An BINARY or STRING expression.
pos : An integral numeric expression specifying the starting position.
len : An optional integral numeric expression.
Returns
The result matches the type of expr .
pos is 1 based. If pos is negative the start is determined by counting characters (or bytes for BINARY) from the
end.
If len is less than 1 the result is empty.
If len is omitted the function returns on characters or bytes starting with pos .
This function is a synonym for substring function.
Examples
> SELECT substr('Spark SQL', 5);
k SQL
> SELECT substr('Spark SQL', -3);
SQL
> SELECT substr('Spark SQL', 5, 1);
k
> SELECT substr('Spark SQL' FROM 5);
k SQL
> SELECT substr('Spark SQL' FROM -3);
SQL
> SELECT substr('Spark SQL' FROM 5 FOR 1);
k
> SELECT substr('Spark SQL' FROM -10 FOR 5);
Spar
Related functions
substring function
substring function
7/21/2022 • 2 minutes to read
Returns the substring of expr that starts at pos and is of length len .
Syntax
substring(expr, pos [, len])
Arguments
expr : An BINARY or STRING expression.
pos : An integral numeric expression specifying the starting position.
len : An optional integral numeric expression.
Returns
A STRING.
pos is 1 based. If pos is negative the start is determined by counting characters (or bytes for BINARY) from the
end.
If len is less than 1 the result is empty.
If len is omitted the function returns on characters or bytes starting with pos .
This function is a synonym for substr function.
Examples
> SELECT substring('Spark SQL', 5);
k SQL
> SELECT substring('Spark SQL', -3);
SQL
> SELECT substring('Spark SQL', 5, 1);
k
> SELECT substring('Spark SQL' FROM 5);
k SQL
> SELECT substring('Spark SQL' FROM -3);
SQL
> SELECT substring('Spark SQL' FROM 5 FOR 1);
k
> SELECT substring('Spark SQL' FROM -10 FOR 5);
Spar
Related functions
substr function
substring_index function
7/21/2022 • 2 minutes to read
Returns the substring of expr before count occurrences of the delimiter delim .
Syntax
substring_index(expr, delim, count)
Arguments
expr : A STRING or BINARY expression.
delim : An expression matching the type of expr specifying the delimiter.
count : An INTEGER expression to count the delimiters.
Returns
The result matches the type of expr .
If count is positive, everything to the left of the final delimiter (counting from the left) is returned.
If count is negative, everything to the right of the final delimiter (counting from the right) is returned.
Examples
> SELECT substring_index('www.apache.org', '.', 2);
www.apache
Related functions
substr function
substring function
sum aggregate function
7/21/2022 • 2 minutes to read
Syntax
sum ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric or interval.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
If is an integral number type, a BIGINT. If expr is DECIMAL(p, s) the result is
expr
DECIMAL(p + min(10, 31-p), s) . If expr is an interval the result type matches expr .
Otherwise, a DOUBLE.
If DISTINCT is specified only unique values are summed up.
If the result overflows the result type Databricks Runtime raises an ARITHMETIC_OVERFLOW error. To return a
NULL instead use try_sum
WARNING
If spark.sql.ansi.enabled is false an overflow of BIGINT will not cause an error but “wrap” the result.
Examples
> SELECT sum(col) FROM VALUES (5), (10), (15) AS tab(col);
30
> SELECT sum(DISTINCT col) FROM VALUES (5), (10), (10), (15) AS tab(col);
30
Related functions
aggregate function
avg aggregate function
max aggregate function
mean aggregate function
min aggregate function
try_avg aggregate function
try_sum aggregate function
tan function
7/21/2022 • 2 minutes to read
Syntax
tan(expr)
Arguments
expr : An expression that evaluates to a numeric expressing the angle in radians.
Returns
A DOUBLE.
Examples
> SELECT tan(0);
0.0
Related functions
tanh function
cos function
sin function
tanh function
7/21/2022 • 2 minutes to read
Syntax
tanh(expr)
Arguments
expr : An expression that evaluates to a numeric expressing the hyperbolic angle.
Returns
A DOUBLE.
Examples
> SELECT tanh(0);
0.0
Related functions
tan function
cosh function
sinh function
~ (tilde sign) operator
7/21/2022 • 2 minutes to read
Syntax
~ expr
Arguments
expr : An integral numeric type expression.
Returns
The result type matches the type of expr .
Examples
> SELECT ~ 0;
-1
Related functions
& (ampersand sign) operator
| (pipe sign) operator
^ (caret sign) operator
bit_count function
timestamp function
7/21/2022 • 2 minutes to read
Syntax
timestamp(expr)
Arguments
expr : Any expression that can be cast to TIMESTAMP.
Returns
A TIMESTAMP.
This function is a synonym for CAST(expr AS TIMESTAMP) .
For details see cast function.
Examples
> SELECT timestamp('2020-04-30 12:25:13.45');
2020-04-30 12:25:13.45
> SELECT timestamp(date'2020-04-30');
2020-04-30 00:00:00
> SELECT timestamp(123);
1969-12-31 16:02:03
Related functions
cast function
timestamp_micros function
7/21/2022 • 2 minutes to read
Syntax
timestamp_micros(expr)
Arguments
expr : An integral numeric expression specifying microseconds.
Returns
A TIMESTAMP.
Examples
> SELECT timestamp_micros(1230219000123123);
2008-12-25 07:30:00.123123
Related functions
timestamp function
timestamp_millis function
timestamp_seconds function
timestamp_millis function
7/21/2022 • 2 minutes to read
Syntax
timestamp_millis(expr)
Arguments
expr : An integral numeric expression specifying milliseconds.
Returns
A TIMESTAMP.
Examples
> SELECT timestamp_millis(1230219000123);
2008-12-25 07:30:00.123
Related functions
timestamp function
timestamp_micros function
timestamp_seconds function
timestamp_seconds function
7/21/2022 • 2 minutes to read
Syntax
timestamp_seconds(expr)
Arguments
expr : An numeric expression specifying seconds.
Returns
A TIMESTAMP.
Examples
> SELECT timestamp_seconds(1230219000);
2008-12-25 07:30:00
> SELECT timestamp_seconds(1230219000.123);
2008-12-25 07:30:00.123
Related functions
timestamp function
timestamp_micros function
timestamp_millis function
timestampadd function
7/21/2022 • 2 minutes to read
Syntax
timestampadd(unit, value, expr)
unit
{ MICROSECOND |
MILLISECOND |
SECOND |
MINUTE |
HOUR |
DAY | DAYOFYEAR |
WEEK |
MONTH |
QUARTER |
YEAR }
Arguments
unit : A unit of measure.
value : A numeric expression with the number of unit s to add to expr .
expr : A TIMESTAMP expression.
Returns
A TIMESTAMP.
If value is negative it is subtracted from the expr . If unit is MONTH , QUARTER , or YEAR the day portion of the
result will be adjusted to result in a valid date.
The function returns an overflow error if the result is beyond the supported range of timestamps.
Examples
> SELECT timestampadd(MICROSECOND, 5, TIMESTAMP'2022-02-28 00:00:00');
2022-02-28 00:00:00.000005
Related functions
add_months function
date_add function
date_sub function
dateadd function
timestamp function
timestampdiff function
7/21/2022 • 2 minutes to read
Syntax
timestampdiff(unit, start, end)
unit
{ MICROSECOND |
MILLISECOND |
SECOND |
MINUTE |
HOUR |
DAY |
WEEK |
MONTH |
QUARTER |
YEAR }
Arguments
unit : A unit of measure.
start : A starting TIMESTAMP expression.
end : A ending TIMESTAMP expression.
Returns
A BIGINT.
If start is greater than end the result is negative.
The function counts whole elapsed units based on UTC with a DAY being 86400 seconds.
One month is considered elapsed when the calendar month has increased and the calendar day and time is
equal or greater to the start. Weeks, quarters, and years follow from that.
Examples
-- One second shy of a month elapsed
> SELECT timestampdiff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 11:59:59');
0
-- One month has passed even though its' not end of the month yet because day and time line up.
> SELECT timestampdiff(MONTH, TIMESTAMP'2021-02-28 12:00:00', TIMESTAMP'2021-03-28 12:00:00');
1
Syntax
tinyint(expr)
Arguments
expr : Any expression which is castable to TINYINT.
Returns
The result is TINYINT.
This function is a synonym for CAST(expr AS TINYINT) .
See cast function for details.
Examples
> SELECT tinyint('12');
12
> SELECT tinyint(5.4);
5
Related functions
cast function
to_csv function
7/21/2022 • 2 minutes to read
Syntax
to_csv(expr [, options] )
Arguments
expr : A STRUCT expression.
options : An optional MAP literal expression with keys and values being STRING.
Returns
A STRING.
See from_csv function for details on possible options .
Examples
> SELECT to_csv(named_struct('a', 1, 'b', 2));
1,2
> SELECT to_csv(named_struct('time', to_timestamp('2015-08-26', 'yyyy-MM-dd')), map('timestampFormat',
'dd/MM/yyyy'));
26/08/2015
Related functions
from_csv function
schema_of_csv function
to_json function
from_json function
schema_of_json function
to_date function
7/21/2022 • 2 minutes to read
Syntax
to_date(expr [, fmt] )
Arguments
expr : A STRING expression representing a date.
fmt: An optional format STRING expression.
Returns
A DATE.
If fmt is supplied, it must conform with Datetime patterns.
If fmt is not supplied, the function is a synonym for cast(expr AS DATE) .
If fmt is malformed or its application does not result in a well-formed date, the function raises an error.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed dates.
Examples
> SELECT to_date('2009-07-30 04:17:52');
2009-07-30
> SELECT to_date('2016-12-31', 'yyyy-MM-dd');
2016-12-31
Related functions
cast function
date function
to_timestamp function
Datetime patterns
to_json function
7/21/2022 • 2 minutes to read
Syntax
to_json(expr [, options] )
Arguments
expr : A STRUCT expression.
options : An optional MAP literal expression with keys and values being STRING.
Returns
A STRING.
See from_json function for details on possible options .
Examples
> SELECT to_json(named_struct('a', 1, 'b', 2));
{"a":1,"b":2}
> SELECT to_json(named_struct('time', to_timestamp('2015-08-26', 'yyyy-MM-dd')), map('timestampFormat',
'dd/MM/yyyy'));
{"time":"26/08/2015"}
> SELECT to_json(array(named_struct('a', 1, 'b', 2)));
[{"a":1,"b":2}]
> SELECT to_json(map('a', named_struct('b', 1)));
{"a":{"b":1}}
> SELECT to_json(map(named_struct('a', 1),named_struct('b', 2)));
{"[1]":{"b":2}}
> SELECT to_json(map('a', 1));
{"a":1}
> SELECT to_json(array((map('a', 1))));
[{"a":1}]
Related functions
: operator
from_csv function
schema_of_csv function
from_json function
schema_of_json function
to_number function
7/21/2022 • 2 minutes to read
Syntax
to_number(expr, fmt)
fmt
{ ' [ MI | S ] [ L | $ ]
[ 0 | 9 | G | , ] [...]
[ . | D ]
[ 0 | 9 ] [...]
[ L | $ ] [ PR | MI | S ] ' }
Arguments
expr : A STRING expression representing a number. expr may include leading or trailing spaces.
fmt : A STRING literal, specifying the expected format of expr .
Returns
A DECIMAL(p, s) where p is the total number of digits ( 0 or 9 ) and s is the number of digits after the
decimal point, or 0 if there is none.
fmt can contain the following elements (case insensitive):
0 or 9
Specifies an expected digit between 0 and 9 . A 0 to the left of the decimal points indicates that expr
must have at least as many digits. Leading 9 indicate that expr may omit these digits.
expr must not be larger that the number of digits to the left of the decimal point allows.
Digits to the right of the decimal indicate the most digits expr may have to the right of the decimal point
than fmt specifies.
. or D
Specifies the position of the , grouping (thousands) separator. There must be a 0 or 9 to the left and
right of each grouping separator. expr must match the grouping separator relevant to the size of the
number.
L or $
Specifies the location of the $ currency sign. This character may only be specified once.
S or MI
Specifies the position of an optional ‘+’ or ‘-‘ sign for S , and ‘-‘ only for MI . This directive may be
specified only once.
PR
Only allowed at the end of the format string; specifies that expr indicates a negative number with
wrapping angled brackets ( <1> ).
If expr contains any characters other than 0 through 9 , or characters permitted in fmt , an error is returned.
To return NULL instead of an error for invalid expr use try_to_number().
Examples
-- The format expects:
-- * an optional sign at the beginning,
-- * followed by a dollar sign,
-- * followed by a number between 3 and 6 digits long,
-- * thousands separators,
-- * up to two dight beyond the decimal point.
> SELECT to_number('-$12,345.67', 'S$999,099.99');
-12345.67
Related functions
cast function
to_date function
try_to_number function
to_timestamp function
7/21/2022 • 2 minutes to read
Syntax
to_timestamp(expr [, fmt] )
Arguments
expr : A STRING expression representing a timestamp.
fmt: An optional format STRING expression.
Returns
A TIMESTAMP.
If fmt is supplied, it must conform with Datetime patterns.
If fmt is not supplied, the function is a synonym for cast(expr AS TIMESTAMP) .
If fmt is malformed or its application does not result in a well formed timestamp, the function raises an error.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed timestamps.
Examples
> SELECT to_timestamp('2016-12-31 00:12:00');
2016-12-31 00:12:00
Related functions
cast function
timestamp function
to_date function
Datetime patterns
to_unix_timestamp function
7/21/2022 • 2 minutes to read
Syntax
to_unix_timestamp(expr [, fmt] )
Arguments
expr : A STRING expression representing a timestamp.
fmt: An optional format STRING expression.
Returns
A BIGINT.
If fmt is supplied, it must conform with Datetime patterns.
If fmt is not supplied, the function is a synonym for cast(expr AS TIMESTAMP) .
If fmt is malformed or its application does not result in a well formed timestamp, the function raises an error.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed timestamps.
Examples
> SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460098800
Related functions
from_unixtime function
Datetime patterns
to_utc_timestamp function
7/21/2022 • 2 minutes to read
Syntax
to_utc_timestamp(expr, timezone)
Arguments
expr : A TIMESTAMP expression.
timezone : A STRING expression that is a valid timezone.
Returns
A TIMESTAMP.
Examples
> SELECT to_utc_timestamp('2016-08-31', 'Asia/Seoul');
2016-08-30 15:00:00
> SELECT to_utc_timestamp( '2017-07-14 02:40:00.0', 'GMT+1');
2017-07-14 01:40:00.0
Related functions
from_utc_timestamp function
transform function
7/21/2022 • 2 minutes to read
Syntax
transform(expr, func)
Arguments
expr : An ARRAY expression.
func : A lambda function.
Returns
An ARRAY of the type of the lambda function’s result.
The lambda function must have 1 or 2 parameters. The first parameter represents the element, the optional
second parameter represents the index of the element.
The lambda function produces a new value for each element in the array.
Examples
> SELECT transform(array(1, 2, 3), x -> x + 1);
[2,3,4]
> SELECT transform(array(1, 2, 3), (x, i) -> x + i);
[1,3,5]
Related functions
transform_keys function
transform_values function
transform_keys function
7/21/2022 • 2 minutes to read
Syntax
transform_keys(expr, func)
Arguments
expr : A MAP expression.
func : A lambda function.
Returns
A MAP where the keys have the type of the result of the lambda functions and the values have the type of the
expr MAP values.
The lambda function must have 2 parameters. The first parameter represents the key. The second parameter
represents the value.
The lambda function produces a new key for each entry in the map.
Examples
> SELECT transform_keys(map_from_arrays(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 1);
{2 -> 1, 3 -> 2, 4 -> 3}
> SELECT transform_keys(map_from_arrays(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + v);
{2 -> 1, 4 -> 2, 6 -> 3}
Related functions
transform function
transform_values function
transform_values function
7/21/2022 • 2 minutes to read
Syntax
transform_values(expr, func)
Arguments
expr : A MAP expression.
func : A lambda function.
Returns
A MAP where the values have the type of the result of the lambda functions and the keys have the type of the
expr MAP keys.
The lambda function must have 2 parameters. The first parameter represents the key. The second parameter
represents the value.
The lambda function produces a new value for each entry in the map.
Examples
> SELECT transform_values(map_from_arrays(array(1, 2, 3), array(1, 2, 3)), (k, v) -> v + 1);
{1 -> 2, 2 -> 3, 3 -> 4}
> SELECT transform_values(map_from_arrays(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + v);
{1 -> 2, 2 -> 4, 3 -> 6}
Related functions
transform function
transform_keys function
translate function
7/21/2022 • 2 minutes to read
Returns an expr where all characters in from have been replaced with those in to .
Syntax
translate(expr, from, to)
Arguments
expr : A STRING expression.
from : A STRING expression consisting of a set of characters to be replaced.
to : A STRING expression consisting of a matching set of characters to replace from .
Returns
A STRING.
The function replaces all occurrences of any character in from with the corresponding character in to.
If to has a shorter length than from unmatched characters are removed.
Examples
> SELECT translate('AaBbCc', 'abc', '123');
A1B2C3
> SELECT translate('AaBbCc', 'abc', '1');
A1BC
> SELECT translate('AaBbCc', 'abc', '');
ABC
Related functions
replace function
overlay function
regexp_replace function
trim function
7/21/2022 • 2 minutes to read
Syntax
trim(str)
Arguments
trimStr : A STRING expression with a set of characters to be trimmed.
str : A STRING expression to be trimmed.
Returns
A STRING.
Examples
> SELECT '+' || trim(' SparkSQL ') || '+';
+SparkSQL+
> SELECT '+' || trim(BOTH FROM ' SparkSQL ') || '+';
+SparkSQL+
> SELECT '+' || trim(LEADING FROM ' SparkSQL ') || '+';
+SparkSQL +
> SELECT '+' || trim(TRAILING FROM ' SparkSQL ') || '+';
+ SparkSQL+
> SELECT trim('SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
parkSQ
> SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
parkSQLS
> SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
SSparkSQ
Related functions
btrim function
lpad function
ltrim function
rpad function
rtrim function
trunc function
7/21/2022 • 2 minutes to read
Returns a date with the a portion of the date truncated to the unit specified by the format model fmt .
Syntax
trunc(expr, fmt)
Arguments
expr : A DATE expression.
fmt : A STRING expression specifying how to truncate.
Returns
A DATE.
fmt must be one of (case insensitive):
'YEAR' , , 'YY' - truncate to the first date of the year that the date falls in.
'YYYY'
'QUARTER' - truncate to the first date of the quarter that the date falls in.
'MONTH' , 'MM' , 'MON' - truncate to the first date of the month that the date falls in.
'WEEK' - truncate to the Monday of the week that the date falls in.
Examples
> SELECT trunc('2019-08-04', 'week');
2019-07-29
> SELECT trunc('2019-08-04', 'quarter');
2019-07-01
> SELECT trunc('2009-02-12', 'MM');
2009-02-01
> SELECT trunc('2015-10-27', 'YEAR');
2015-01-01
Related functions
date_trunc function
try_add function
7/21/2022 • 2 minutes to read
Syntax
try_add ( expr1 , expr2 )
Arguments
expr1 : A numeric, DATE, TIMESTAMP, or INTERVAL expression.
expr2 : If expr1 is a numeric expr2 must be numeric expression, or an INTERVAL otherwise.
Returns
If expr1 is a numeric, the common maximum type of the arguments.
If expr1 is a DATE and expr2 is a day-time interval the result is a TIMESTAMP.
If expr1 and expr2 are year-month intervals the result is a year-month interval of sufficiently wide units to
represent the result.
If expr1 and expr2 are day-time intervals the result is a day-time interval of sufficiently wide units to
represent the result.
Otherwise, the result type matches expr1 .
If both expressions are interval they must be of the same class.
If the result overflows the result type Databricks Runtime returns NULL.
When you add a year-month interval to a DATE Databricks Runtime will assure that the resulting date is well
formed.
Examples
> SELECT try_add(1, 2);
3
Related functions
- (minus sign) operator
/ (slash sign) operator
* (asterisk sign) operator
sum aggregate function
try_divide function
try_avg aggregate function
7/21/2022 • 2 minutes to read
Returns the mean calculated from values of a group. If there is an overflow, returns NULL.
Since: Databricks Runtime 11.0
Syntax
try_avg( [ALL | DISTINCT] expr) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that returns a numeric or an interval value.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
The result type is computed as for the arguments:
DECIMAL(p, s) : The result type is a DECIMAL(p + 4, s + 4) . If the maximum precision for DECIMAL is reached
the increase in scale will be limited to avoid loss of significant digits.
year-month interval: The result is an INTERVAL YEAR TO MONTH .
day-time interval: The result is an INTERVAL YEAR TO SECOND .
In all other cases the result is a DOUBLE.
Nulls within the group are ignored. If a group is empty or consists only of nulls, the result is NULL.
If DISTINCT is specified, the average is computed after duplicates are removed.
To raise an error instead of NULL in case of an overflow use avg.
Examples
> SELECT try_avg(col) FROM VALUES (1), (2), (3) AS tab(col);
2.0
> SELECT try_avg(DISTINCT col) FROM VALUES (1), (1), (2) AS tab(col);
1.5
> SELECT try_avg(col) FROM VALUES (INTERVAL '1' YEAR), (INTERVAL '2' YEAR) AS tab(col);
1-6
Related functions
avg aggregate function
aggregate function
max aggregate function
mean aggregate function
min aggregate function
try_sum aggregate function
sum aggregate function
try_cast function
7/21/2022 • 2 minutes to read
Returns the value of sourceExpr cast to data type targetType if possible, or NULL if not possible.
Since: Databricks Runtime 10.0
Syntax
try_cast(sourceExpr AS targetType)
Arguments
sourceExpr : Any castable expression.
targetType : The type of the result.
Returns
The result is of type targetType .
This function is a more relaxed variant of cast function which includes a detailed description.
try_cast differs from cast function by tolerating the following conditions as long as the cast from the type of
expr to type is supported:
If a sourceExpr value cannot fit within the domain of targetType the result is NULL instead of an overflow
error.
If a sourceExpr value is not well formed or contains invalid characters the result is NULL instead of an
invalid data error.
Exception to the above are:
Casting to a STRUCT field with NOT NULL property.
Casting a MAP key.
Examples
> SELECT try_cast('10' AS INT);
10
Related functions
:: (colon colon sign) operator
cast function
try_divide function
7/21/2022 • 2 minutes to read
Syntax
try_divide(dividend, divisor)
Arguments
dividend : A numeric or INTERVAL expression.
divisor : A numeric expression.
Returns
If both dividend and divisor are DECIMAL, the result is DECIMAL.
If dividend is a year-month interval, the result is an INTERVAL YEAR TO MONTH .
If divident is a day-time interval, the result is an INTERVAL DAY TO SECOND .
In all other cases, a DOUBLE.
If the divisor is 0, the operator returns NULL.
Examples
> SELECT try_divide(3, 2);
1.5
Related functions
* (asterisk sign) operator
div operator
- (minus sign) operator
+ (plus sign) operator
sum aggregate function
try_add function
try_element_at function
7/21/2022 • 2 minutes to read
Syntax
try_element_at(arrayExpr, index)
try_element_at(mapExpr, key)
Arguments
arrayExpr : An ARRAY expression.
index : An INTEGER expression.
mapExpr : A MAP expression.
key : An expression matching the type of the keys of mapExpr
Returns
If the first argument is an ARRAY:
The result is of the type of the elements of expr .
abs(index) must not be 0.
If index is negative the function accesses elements from the last to the first.
The function returns NULL if abs(index) exceeds the length of the array, or if key does not exist in the map.
Examples
> SELECT try_element_at(array(1, 2, 3), 2);
2
Related functions
array_contains function
array_position function
element_at function
try_multiply function
7/21/2022 • 2 minutes to read
Syntax
try_multiply(multiplier, multiplicand)
Arguments
multiplier : A numeric or INTERVAL expression.
multiplicand : A numeric expression or INTERVAL expression.
Returns
If both multiplier and multiplicand are DECIMAL, the result is DECIMAL.
If multiplier or multiplicand is an INTERVAL, the result is of the same type.
If both multiplier and multiplier are integral numeric types the result is the larger of the two types.
In all other cases the result is a DOUBLE.
If either the multiplier or the multiplicand is 0, the operator returns 0.
If the result of the multiplication is outside the bound for the result type the result is NULL .
Examples
> SELECT 3 * 2;
6
Related functions
* (asterisk sign) operator
div operator
- (minus sign) operator
+ (plus sign) operator
sum aggregate function
try_add function
try_divide function
try_subtract function
try_subtract function
7/21/2022 • 2 minutes to read
Syntax
try_subtract ( expr1 , expr2 )
Arguments
expr1 : A numeric, DATE, TIMESTAMP, or INTERVAL expression.
expr2 : If expr1 is a numeric expr2 must be numeric expression, or an INTERVAL otherwise.
Returns
If expr1 is a numeric, the common maximum type of the arguments.
If expr1 is a DATE and expr2 is a day-time interval the result is a TIMESTAMP.
If expr1 and expr2 are year-month intervals the result is a year-month interval of sufficiently wide units to
represent the result.
If expr1 and expr2 are day-time intervals the result is a day-time interval of sufficiently wide units to
represent the result.
Otherwise, the result type matches expr1 .
If both expressions are interval they must be of the same class.
If the result overflows the result type Databricks Runtime returns NULL.
When you subtract a year-month interval from a DATE Databricks Runtime will assure that the resulting date is
well formed.
Examples
> SELECT try_subtract(1, 2);
-1
Related functions
- (minus sign) operator
/ (slash sign) operator
* (asterisk sign) operator
sum aggregate function
try_add function
try_divide function
try_multiply function
try_sum aggregate function
7/21/2022 • 2 minutes to read
Returns the sum calculated from values of a group, or NULL if there is an overflow.
Since: Databricks Runtime 10.5
Syntax
try_sum ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric or interval.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
If expr is an integral number type, a BIGINT.
If expr is DECIMAL(p, s) the result is DECIMAL(p + min(10, 31-p), s) .
If expr is an interval the result type matches expr .
Otherwise, a DOUBLE.
If DISTINCT is specified only unique values are summed up.
If the result overflows the result type Databricks Runtime returns NULL. To return an error instead use sum.
Examples
> SELECT try_sum(col) FROM VALUES (5), (10), (15) AS tab(col);
30
> SELECT try_sum(DISTINCT col) FROM VALUES (5), (10), (10), (15) AS tab(col);
30
Related functions
aggregate function
avg aggregate function
max aggregate function
mean aggregate function
min aggregate function
sum aggregate function
try_to_number function
7/21/2022 • 2 minutes to read
Returns expr cast to DECIMAL using formatting fmt , or NULL if expr does not match the format.
Since: Databricks Runtime 10.5
Syntax
try_to_number(expr, fmt)
fmt
{ ' [ MI | S ] [ L | $ ]
[ 0 | 9 | G | , ] [...]
[ . | D ]
[ 0 | 9 ] [...]
[ L | $ ] [ PR | MI | S ] ' }
Arguments
expr : A STRING expression representing a number. expr may include leading or trailing spaces.
fmt : An STRING literal, specifying the expected format of expr .
Returns
A DECIMAL(p, s) where p is the total number of digits ( 0 or 9 ) and s is the number of digits after the
decimal point, or 0 if there are no digits after the decimal point.
fmt can contain the following elements (case insensitive):
0 or 9
Specifies an expected digit between 0 and 9 . A 0 to the left of the decimal points indicates that expr
must have at least as many digits. Leading 9 indicate that expr may omit these digits.
expr must not be larger that the number of digits to the left of the decimal point allows.
Digits to the right of the decimal indicate the maximum number of digits expr may have to the right of
the decimal point specified by fmt .
. or D
Specifies the position of the , grouping (thousands) separator. There must be a 0 or 9 to the left and
right of each grouping separator. expr must match the grouping separator relevant to the size of the
number.
L or $
Specifies the location of the $ currency sign. This character may only be specified once.
S or MI
Specifies the position of an optional ‘+’ or ‘-‘ sign for S , and ‘-‘ only for MI . This directive may be
specified only once.
PR
Specifies that expr indicates a negative number with wrapping angled brackets ( <1> ).
If expr contains any characters other than 0 through 9 , or those permitted in fmt ,a NULL is returned.
For strict semantic use to_number().
Examples
-- The format expects:
-- * an optional sign at the beginning,
-- * followed by a dollar sign,
-- * followed by a number between 3 and 6 digits long,
-- * thousands separators,
-- * up to two dight beyond the decimal point.
> SELECT try_to_number('-$12,345.67', 'S$999,099.99');
-12345.67
Related functions
cast function
to_date function
to_number function
typeof function
7/21/2022 • 2 minutes to read
Return a DDL-formatted type string for the data type of the input.
Syntax
typeof(expr)
Arguments
expr : Any expression.
Returns
A STRING.
Examples
> SELECT typeof(1);
int
> SELECT typeof(array(1));
array<int>
Related functions
ucase function
7/21/2022 • 2 minutes to read
Syntax
ucase(expr)
Arguments
expr : A STRING expression.
Returns
A STRING.
This function is a synonym for upper function.
Examples
> SELECT ucase('SparkSql');
SPARKSQL
Related functions
lower function
initcap function
upper function
unbase64 function
7/21/2022 • 2 minutes to read
Syntax
unbase64(expr)
Arguments
expr : A STRING expression in a base64 format.
Returns
A BINARY.
Examples
> SELECT cast(unbase64('U3BhcmsgU1FM') AS STRING);
Spark SQL
Related functions
base64 function
unhex function
7/21/2022 • 2 minutes to read
Syntax
unhex(expr)
Arguments
expr : A STRING expression of hexadecimal characters.
Returns
The result is BINARY.
If the length of expr is odd, the first character is discarded and the result is padded with a null byte. If expr
contains non hex characters the result is NULL.
Examples
> SELECT decode(unhex('537061726B2053514C'), 'UTF-8');
Spark SQL
Related functions
hex function
unix_date function
7/21/2022 • 2 minutes to read
Syntax
unix_date(expr)
Arguments
expr : A DATE expression.
Returns
An INTEGER.
Examples
> SELECT unix_date(DATE('1970-01-02'));
1
Related functions
unix_micros function
unix_millis function
unix_seconds function
unix_micros function
7/21/2022 • 2 minutes to read
Syntax
unix_micros(expr)
Arguments
expr : A TIMESTAMP expression.
Returns
A BIGINT.
Examples
> SELECT unix_micros(TIMESTAMP('1970-01-01 00:00:01Z'));
1000000
Related functions
unix_date function
unix_millis function
unix_seconds function
unix_millis function
7/21/2022 • 2 minutes to read
Syntax
unix_millis(expr)
Arguments
expr : A TIMESTAMP expression.
Returns
A BIGINT.
The function truncates higher levels of precision.
Examples
> SELECT unix_millis(TIMESTAMP('1970-01-01 00:00:01Z'));
1000
Related functions
unix_date function
unix_micros function
unix_seconds function
unix_seconds function
7/21/2022 • 2 minutes to read
Syntax
unix_seconds(expr)
Arguments
expr : A TIMESTAMP expression.
Returns
A BIGINT.
The function truncates higher levels of precision.
Examples
> SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z'));
1
Related functions
unix_date function
unix_micros function
unix_millis function
unix_timestamp function
7/21/2022 • 2 minutes to read
Syntax
unix_timestamp([expr [, fmt] ] )
Arguments
expr : An optional DATE, TIMESTAMP, or a STRING expression in a valid datetime format.
fmt : An optional STRING expression specifying the format if expr is a STRING.
Returns
A BIGINT.
If no argument is provided the default is the current timestamp. fmt is ignored if expr is a DATE or
TIMESTAMP. If expr is a STRING fmt is used to translate the string to a TIMESTAMP before computing the unix
timestamp.
The default fmt value is 'yyyy-MM-dd HH:mm:ss' .
See Datetime patterns for valid date and time format patterns.
If fmt or expr are invalid the function raises an error.
NOTE
If spark.sql.ansi.enabled is false the function returns NULL instead of an error for malformed timestamps.
Examples
> SELECT unix_timestamp();
1476884637
> SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460041200
Related functions
timestamp function
upper function
7/21/2022 • 2 minutes to read
Syntax
upper(expr)
Arguments
expr : A STRING expression.
Returns
A STRING.
This function is a synonym for ucase function.
Examples
> SELECT upper('SparkSql');
SPARKSQL
Related functions
lower function
initcap function
ucase function
uuid function
7/21/2022 • 2 minutes to read
Syntax
uuid()
Arguments
The function takes no argument.
Returns
A STRING formatted as a canonical UUID 36-character string.
The function is non-deterministic.
Examples
> SELECT uuid();
46707d92-02f4-4817-8116-a4c3b23e6266
Related functions
rand function
var_pop aggregate function
7/21/2022 • 2 minutes to read
Syntax
var_pop ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
Examples
> SELECT var_pop(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.6875
> SELECT var_pop(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.6666666666666666
Related functions
var_samp aggregate function
variance aggregate function
var_samp aggregate function
7/21/2022 • 2 minutes to read
Syntax
var_samp ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for variance aggregate function.
Examples
> SELECT var_samp(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9166666666666666
> SELECT var_samp(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0
Related functions
var_pop aggregate function
variance aggregate function
variance aggregate function
7/21/2022 • 2 minutes to read
Syntax
variance ( [ALL | DISTINCT] expr ) [FILTER ( WHERE cond ) ]
Arguments
expr : An expression that evaluates to a numeric.
cond : An optional boolean expression filtering the rows used for aggregation.
Returns
A DOUBLE.
If DISTINCT is specified the function operates only on a unique set of expr values.
This function is a synonym for var_samp aggregate function.
Examples
> SELECT variance(col) FROM VALUES (1), (2), (3), (3) AS tab(col);
0.9166666666666666
> SELECT variance(DISTINCT col) FROM VALUES (1), (2), (3), (3) AS tab(col);
1.0
Related functions
var_pop aggregate function
var_samp aggregate function
version function
7/21/2022 • 2 minutes to read
Syntax
version()
Arguments
The function takes no argument.
Returns
A STRING that contains two fields, the first being a release version and the second being a git revision.
Examples
> SELECT version();
3.1.0 a6d6ea3efedbad14d99c24143834cd4e2e52fb40
Related functions
current_version function
weekday function
7/21/2022 • 2 minutes to read
Syntax
weekday(expr)
Arguments
expr : A DATE or TIMESTAMP expression.
Returns
An INTEGER where 0 = Monday and 6 = Sunday.
This function is a synonym for extract(DAYOFWEEK_ISO FROM expr) - 1 .
Examples
> SELECT weekday(DATE'2009-07-30'), extract(DAYOFWEEK_ISO FROM DATE'2009-07-30');
3 4
Related functions
day function
dayofmonth function
dayofyear function
extract function
dayofweek function
weekofyear function
7/21/2022 • 2 minutes to read
Syntax
weekofyear(expr)
Arguments
expr : A DATE expression.
Returns
An INTEGER.
A week is considered to start on a Monday and week 1 is the first week with >3 days. This function is a synonym
for extract(WEEK FROM expr) .
Examples
> SELECT weekofyear('2008-02-20');
8
Related functions
day function
dayofmonth function
dayofyear function
extract function
dayofweek function
width_bucket function
7/21/2022 • 2 minutes to read
Syntax
width_bucket(expr, minExpr, maxExpr, numBuckets)
Arguments
expr : An numeric expression to be bucketed.
minExpr : A numeric expression providing a lower bound for the buckets.
maxExpr : A numeric expression providing an upper bound for the buckets.
numBuckets : An INTEGER expression greater than 0 specifying the number of buckets.
Returns
An INTEGER.
The function divides the range between minExpr and maxExpr into numBuckets slices of equal size. The result is
the slice into which expr falls.
If expr is outside of minExpr the result is 0.
If expr is outside of maxExpr the result is numbuckets + 1
Examples
> SELECT width_bucket(5.3, 0.2, 10.6, 5);
3
> SELECT width_bucket(-2.1, 1.3, 3.4, 3);
0
> SELECT width_bucket(8.1, 0.0, 5.7, 4);
5
> SELECT width_bucket(-0.9, 5.2, 0.5, 2);
3
Related functions
window grouping expression
7/21/2022 • 2 minutes to read
Syntax
window(expr, width [, slide [, start] ] )
Arguments
expr : A TIMESTAMP expression specifying the subject of the window.
width : A STRING literal representing the width of the window as an INTERVAL DAY TO SECOND literal.
slide : An optional STRING literal representing an offset from midnight to start, expressed as an INTERVAL
HOUR TO SECOND literal.
start : An optional STRING literal representing the start of the next window expressed as an INTERVAL DAY
TO SECOND literal.
Returns
Returns a set of groupings which can be operated on with aggregate functions. The GROUP BY column name is
window . It is of type STRUCT<start:TIMESTAMP, end:TIMESTAMP>
slide must be less than or equal to width . start must be less than slide .
If slide < width the rows in each groups overlap. By default slide equals width so expr are partitioned
into groups. The windowing starts at 1970-01-01 00:00:00 UTC + start . The default for start is '0 SECONDS' ’
Examples
> SELECT window, min(val), max(val), count(val)
FROM VALUES (TIMESTAMP'2020-08-01 12:20:21', 17),
(TIMESTAMP'2020-08-01 12:20:22', 12),
(TIMESTAMP'2020-08-01 12:23:10', 8),
(TIMESTAMP'2020-08-01 12:25:05', 11),
(TIMESTAMP'2020-08-01 12:28:59', 15),
(TIMESTAMP'2020-08-01 12:30:01', 23),
(TIMESTAMP'2020-08-01 12:30:15', 2),
(TIMESTAMP'2020-08-01 12:35:22', 16) AS S(stamp, val)
GROUP BY window(stamp, '2 MINUTES 30 SECONDS', '30 SECONDS', '15 SECONDS');
{2020-08-01 12:19:15, 2020-08-01 12:21:45} 12 17 2
{2020-08-01 12:18:15, 2020-08-01 12:20:45} 12 17 2
{2020-08-01 12:20:15, 2020-08-01 12:22:45} 12 17 2
{2020-08-01 12:19:45, 2020-08-01 12:22:15} 12 17 2
{2020-08-01 12:18:45, 2020-08-01 12:21:15} 12 17 2
{2020-08-01 12:21:45, 2020-08-01 12:24:15} 8 8 1
{2020-08-01 12:22:45, 2020-08-01 12:25:15} 8 11 2
{2020-08-01 12:21:15, 2020-08-01 12:23:45} 8 8 1
{2020-08-01 12:22:15, 2020-08-01 12:24:45} 8 8 1
{2020-08-01 12:20:45, 2020-08-01 12:23:15} 8 8 1
{2020-08-01 12:23:45, 2020-08-01 12:26:15} 11 11 1
{2020-08-01 12:23:15, 2020-08-01 12:25:45} 11 11 1
{2020-08-01 12:24:45, 2020-08-01 12:27:15} 11 11 1
{2020-08-01 12:24:15, 2020-08-01 12:26:45} 11 11 1
{2020-08-01 12:27:15, 2020-08-01 12:29:45} 15 15 1
{2020-08-01 12:27:45, 2020-08-01 12:30:15} 15 23 2
{2020-08-01 12:28:45, 2020-08-01 12:31:15} 2 23 3
{2020-08-01 12:26:45, 2020-08-01 12:29:15} 15 15 1
{2020-08-01 12:28:15, 2020-08-01 12:30:45} 2 23 3
{2020-08-01 12:29:45, 2020-08-01 12:32:15} 2 23 2
Related functions
cube function
xpath function
7/21/2022 • 2 minutes to read
Syntax
xpath(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
An ARRAY of STRING.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>','a/b/text()');
[b1, b2, b3]
Related functions
xpath_boolean function
xpath_double function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_boolean function
7/21/2022 • 2 minutes to read
Returns true if the xpath expression evaluates to true , or if a matching node in xml is found.
Syntax
xpath_boolean(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
An BOOLEAN.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath_boolean('<a><b>1</b></a>','a/b');
true
Related functions
xpath function
xpath_double function
xpath_float function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_double function
7/21/2022 • 2 minutes to read
Syntax
xpath_double(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
A DOUBLE.
The result is zero if no match is found, or NaN if a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath_double('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3.0
Related functions
xpath function
xpath_boolean function
xpath_float function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_float function
7/21/2022 • 2 minutes to read
Syntax
xpath_float(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
The result is FLOAT.
The result is zero if no match is found, or NaN if a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath_float('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3.0
Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_int function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_int function
7/21/2022 • 2 minutes to read
Syntax
xpath_int(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
An INTEGER.
The result is zero if no match is found, or a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath_int('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3
Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_long function
xpath_number function
xpath_short function
xpath_string function
xpath_long function
7/21/2022 • 2 minutes to read
Syntax
xpath_long(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
A BIGINT.
The result is zero if no match is found, or a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath_long('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3
Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_int function
xpath_number function
xpath_short function
xpath_string function
xpath_number function
7/21/2022 • 2 minutes to read
Syntax
xpath_number(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
A DOUBLE.
The result is zero if no match is found, or NaN if a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath_number('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3.0
Related functions
xpath function
xpath_boolean function
xpath_float function
xpath_int function
xpath_long function
xpath_double function
xpath_short function
xpath_string function
xpath_short function
7/21/2022 • 2 minutes to read
Syntax
xpath_short(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
The result is SMALLINT.
The result is zero if no match is found, or a match is found but the value is non-numeric.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath_int('<a><b>1</b><b>2</b></a>', 'sum(a/b)');
3
Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_long function
xpath_number function
xpath_int function
xpath_string function
xpath_string function
7/21/2022 • 2 minutes to read
Returns the contents of the first XML node that matches the XPath expression.
Syntax
xpath_string(xml, xpath)
Arguments
xml : A STRING expression of XML.
xpath : A STRING expression that is a well formed XPath.
Returns
The result is STRING.
The function raises an error if xml or xpath are malformed.
Examples
> SELECT xpath_string('<a><b>b</b><c>cc</c></a>','a/c');
cc
Related functions
xpath function
xpath_boolean function
xpath_double function
xpath_float function
xpath_long function
xpath_number function
xpath_int function
xpath_short function
xxhash64 function
7/21/2022 • 2 minutes to read
Syntax
xxhash64(expr1 [, ...] )
Arguments
exprN : An expression of any type.
Returns
A BIGINT.
Examples
> SELECT xxhash64('Spark', array(123), 2);
5602566077635097486
Related functions
hash function
crc32 function
year function
7/21/2022 • 2 minutes to read
Syntax
year(expr)
Arguments
expr : A DATE or TIMESTAMP expression.
Returns
An INTEGER.
This function is a synonym for extract(YEAR FROM expr) .
Examples
> SELECT year('2016-07-30');
2016
Related functions
dayofmonth function
dayofweek function
day function
hour function
minute function
extract function
zip_with function
7/21/2022 • 2 minutes to read
Merges the arrays in expr1 and expr2 , element-wise, into a single array using func .
Syntax
zip_with(expr1, expr2, func)
Arguments
expr1 : An ARRAY expression.
expr2 : An ARRAY expression.
func : A lambda function taking two parameters.
Returns
An ARRAY of the result of the lambda function.
If one array is shorter, nulls are appended at the end to match the length of the longer array before applying
func .
Examples
> SELECT zip_with(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x));
[{a, 1}, {b, 2}, {c, 3}]
> SELECT zip_with(array(1, 2), array(3, 4), (x, y) -> x + y);
[4,6]
> SELECT zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y));
[ad , be, cf]
Related functions
array_distinct function
array_intersect function
array_except function
array_sort function
array_remove function
array_union function
Identifiers
7/21/2022 • 2 minutes to read
An identifier is a string used to identify a object such as a table, view, schema, or column. Databricks Runtime
has regular identifiers and delimited identifiers, which are enclosed within backticks. All identifiers are case-
insensitive.
Syntax
Regular identifiers
NOTE
If spark.sql.ansi.enabled is set to true , you cannot use an ANSI SQL reserved keyword as an identifier. For details,
see ANSI Compliance.
Delimited identifiers
`c [ ... ]`
Parameters
letter : Any letter from A-Z or a-z.
digit : Any numeral from 0 to 9.
c : Any character from the character set. Use ` to escape special characters (for example, `.` ).
Examples
-- This CREATE TABLE fails because of the illegal identifier name a.b
CREATE TABLE test (a.b int);
no viable alternative at input 'CREATE TABLE test (a.'(line 1, pos 20)
-- This CREATE TABLE fails because the special character ` is not escaped
CREATE TABLE test1 (`a`b` int);
no viable alternative at input 'CREATE TABLE test (`a`b`'(line 1, pos 23)
A table consists of a set of rows and each row contains a set of columns. A column is associated with a data type
and represents a specific attribute of an entity (for example, age is a column of an entity called person ).
Sometimes, the value of a column specific to a row is not known at the time the row comes into existence. In
SQL , such values are represented as NULL . This section details the semantics of NULL values handling in
various operators, expressions and other SQL constructs.
The following illustrates the schema layout and data of a table named person . The data contains NULL values in
the age column and this table is used in various examples in the sections below.
Id Name Age
--- -------- ----
100 Joe 30
200 Marry NULL
300 Mike 18
400 Fred 50
500 Albert NULL
600 Michelle 30
700 Dan 50
Comparison operators
Databricks Runtime supports the standard comparison operators such as > , >= , = , < and <= . The result of
these operators is unknown or NULL when one of the operands or both the operands are unknown or NULL . In
order to compare the NULL values for equality, Databricks Runtime provides a null-safe equal operator ( <=> ),
which returns False when one of the operand is NULL and returns True when both the operands are NULL .
The following table illustrates the behavior of comparison operators when one or both operands are NULL :
L EF T RIGH T
O P ERA N D O P ERA N D > >= = < <= <=>
Examples
-- Normal comparison operators return `NULL` when one of the operand is `NULL`.
> SELECT 5 > null AS expression_output;
expression_output
-----------------
null
-- Normal comparison operators return `NULL` when both the operands are `NULL`.
> SELECT null = null AS expression_output;
expression_output
-----------------
null
-- Null-safe equal operator return `False` when one of the operand is `NULL`
> SELECT 5 <=> null AS expression_output;
expression_output
-----------------
false
-- Null-safe equal operator return `True` when one of the operand is `NULL`
> SELECT NULL <=> NULL;
expression_output
-----------------
true
-----------------
Logical operators
Databricks Runtime supports standard logical operators such as AND , OR and NOT . These operators take
Boolean expressions as the arguments and return a Boolean value.
The following tables illustrate the behavior of logical operators when one or both operands are NULL .
O P ERA N D N OT
NULL NULL
Examples
-- Normal comparison operators return `NULL` when one of the operands is `NULL`.
> SELECT (true OR null) AS expression_output;
expression_output
-----------------
true
-- Normal comparison operators return `NULL` when both the operands are `NULL`.
> SELECT (null OR false) AS expression_output
expression_output
-----------------
null
-- Null-safe equal operator returns `False` when one of the operands is `NULL`
> SELECT NOT(null) AS expression_output;
expression_output
-----------------
null
Expressions
The comparison operators and logical operators are treated as expressions in Databricks Runtime. Databricks
Runtime also supports other forms of expressions, which can be broadly classified as:
Null intolerant expressions
Expressions that can process NULL value operands
The result of these expressions depends on the expression itself.
Null intolerant expressions
Null intolerant expressions return NULL when one or more arguments of expression are NULL and most of the
expressions fall in this category.
Examples
Examples
-- `count(*)` does not skip `NULL` values.
> SELECT count(*) FROM person;
count(1)
--------
7
Examples
-- Persons whose age is unknown (`NULL`) are filtered out from the result set.
> SELECT * FROM person WHERE age > 0;
name age
-------- ---
Michelle 30
Fred 50
Mike 18
Dan 50
Joe 30
-- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
-- The persons with unknown age (`NULL`) are filtered out by the join operator.
> SELECT * FROM person p1, person p2
WHERE p1.age = p2.age
AND p1.name = p2.name;
name age name age
-------- --- -------- ---
Michelle 30 Michelle 30
Fred 50 Fred 50
Mike 18 Mike 18
Dan 50 Dan 50
Joe 30 Joe 30
-- The age column from both legs of join are compared using null-safe equal which
-- is why the persons with unknown age (`NULL`) are qualified by the join.
> SELECT * FROM person p1, person p2
WHERE p1.age <=> p2.age
AND p1.name = p2.name;
name age name age
-------- ---- -------- ----
Albert null Albert null
Michelle 30 Michelle 30
Fred 50 Fred 50
Mike 18 Mike 18
Dan 50 Dan 50
Marry null Marry null
Joe 30 Joe 30
-- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
> SELECT DISTINCT age FROM person;
age
----
null
50
30
18
-- `NULL` values from two legs of the `EXCEPT` are not in output.
-- This basically shows that the comparison happens in a null-safe manner.
> SELECT age, name FROM person
EXCEPT
SELECT age FROM unknown_age;
age name
--- --------
30 Joe
50 Fred
30 Michelle
18 Mike
50 Dan
NOT IN always returns UNKNOWN when the list contains NULL , regardless of the input value. This is because
IN returns UNKNOWN if the value is not in the list containing NULL , and because NOT UNKNOWN is again UNKNOWN .
Examples
-- The subquery has only `NULL` value in its result set. Therefore,
-- the result of `IN` predicate is UNKNOWN.
> SELECT * FROM person WHERE age IN (SELECT null);
name age
---- ---
-- The subquery has `NULL` value in the result set as well as a valid
-- value `50`. Rows with age = 50 are returned.
> SELECT * FROM person
WHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
name age
---- ---
Fred 50
Dan 50
-- Since subquery has `NULL` value in the result set, the `NOT IN`
-- predicate would return UNKNOWN. Hence, no rows are
-- qualified for this query.
> SELECT * FROM person
WHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
name age
---- ---
Information schema (Databricks Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
The INFORMATION_SCHEMA is a SQL Standard based, system provided schema present in every catalog other than
the HIVE_METASTORE catalog.
Within the information schema you find a set of views describing the objects known to the schema’s catalog that
you are privileged the see. The information schema of the SYSTEM catalog returns information about objects
across all catalogs within the metastore.
The purpose of the information schema is to provide a SQL based, self describing API to the metadata.
Since: Databricks Runtime 10.2
TABLE_PRIVILEGES Lists principals which have privileges on the tables and views
in the catalog.
Notes
While identifiers are case insensitive when referenced in SQL statements they are stored in the information
schema as STRING . This implies that you must either search for them using the case in which the identifier is
stored, or use functions such as ilike.
Examples
> SELECT table_name, column_name
FROM information_schema.views
WHERE data_type = 'DOUBLE'
AND schema_name = 'information_schema';
Related articles
SHOW
DESCRIBE
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
(Databricks Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The CATALOG_PRIVILEGES relation contains the following columns:
Constraints
The following constraints apply to the CATALOG_PRIVILEGES relation:
Examples
> SELECT catalog_name, grantee
FROM information_schema.catalog_privileges;
Related
Information schema
INFORMATION_SCHEMA.CATALOGS
SHOW GRANTS
INFORMATION_SCHEMA.CATALOGS (Databricks
Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Describes catalogs.
Information is displayed only for catalogs the user has permission to interact with.
Since: Databricks Runtime 10.2
Definition
The CATALOGS relation contains the following columns:
Constraints
The following constraints apply to the CATALOG relation:
Examples
> SELECT catalog_owner
FROM information_schema.catalogs
system
Related
Information schema
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
INFORMATION_SCHEMA.INFORMATION_SCHEMA_CATALOG_NAME
INFORMATION_SCHEMA.CHECK_CONSTRAINTS
(Databricks Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The CHECK_CONSTRAINTS relation contains the following columns:
Constraints
The following constraints apply to the CHECK_CONSTRAINT relation:
Examples
> SELECT constraint_name, check_clause
FROM information_schema.check_constraints
WHERE table_schema = 'information_schema';
Related
Information schema
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.COLUMNS (Databricks
Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The COLUMNS relation contains the following columns:
CHARACTER_MAXIMUM_LENGTH
I NTEGER Yes Yes Always NULL ,
reserved for future
use.
IS_SYSTEM_TIME_PERIOD_START
STRING No Yes Always NO , reserved
for future use.
IS_SYSTEM_TIME_PERIOD_END
STRING No Yes Always NO , reserved
for future use.
SYSTEM_TIME_PERIOD_TIMESTAMP_GENERATION
STRING Yes Yes Always NULL ,
reserved for future
use.
PARTITION_ORDINAL_POSITION
INTEGER Yes No Position (numbered
from 1 ) of the
column in the
partition, NULL if
not a partitioning
column.
Constraints
The following constraints apply to the COLUMNS relation:
Examples
> SELECT ordinal_position, column_name, data_type
FROM information_schema.tables
WHERE table_schema = 'information_schema'
AND table_name = 'catalog_privilges'
ORDER BY ordinal_position;
1 grantor STRING
2 grantee STRING
3 catalog_name STRING
4 privilege_type STRING
5 is_grantable STRING
Related
DESCRIBE TABLE
Information schema
INFORMATION_SCHEMA.TABLES
SHOW COLUMNS
SHOW TABLE
SHOW TABLES
INFORMATION_SCHEMA.INFORMATION_SCHEMA_CATALOG_NAME
(Databricks Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The INFORMATION_SCHEMA_CATALOG_NAME relation contains the following columns:
Constraints
The following constraints apply to the INFORMATION_SCHEMA_CATALOG_NAME relation:
Examples
> SELECT catalog_name
FROM information_schema.information_schema_catalog_name
default
Related
Information schema
INFORMATION_SCHEMA.CATALOGS
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS
(Databricks Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The REFERENTIAL_CONSTRAINTS relation contains the following columns:
UNIQUE_CONSTRAINT_CATALOG
STRING No Yes Catalog containing
the referenced
constraint.
UNIQUE_CONSTARINT_SCHEMA
S TRING No Yes Database (schema)
containing the
referenced constraint.
Constraints
The following constraints apply to the REFERENTIAL_CONSTRAINTS relation:
Examples
> SELECT constraint_name, check_clause
FROM information_schema.referential_constraints
WHERE table_schema = 'information_schema';
Related
Information schema
INFORMATION_SCHEMA.CHECK_CONSTRAINTS
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
(Databricks Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The SCHEMA_PRIVILEGES relation contains the following columns:
Constraints
The following constraints apply to the SCHEMA_PRIVILEGES relation:
Examples
> SELECT catalog_name, schema_name, grantee
FROM information_schema.schema_privileges;
Related
Information schema
INFORMATION_SCHEMA.SCHEMATA
SHOW GRANTS
INFORMATION_SCHEMA.SCHEMATA (Databricks
Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The SCHEMATA relation contains the following columns:
Constraints
The following constraints apply to the TABLES relation:
Examples
> SELECT schema_owner
FROM information_schema.schemata
WHERE table_schema = 'information_schema'
AND table_name = 'default';
system
Related
DESCRIBE DATABASE
Information schema
INFORMATION_SCHEMA.CATALOGS
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
SHOW DATABASES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
(Databricks Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The TABLE_PRIVILEGES relation contains the following columns:
Constraints
The following constraints apply to the TABLE_PRIVILEGES relation:
C L A SS NAME C O L UM N L IST DESC RIP T IO N
Examples
> SELECT table_catalog, table_schema, table_name, grantee
FROM information_schema.table_privileges;
Related
Information schema
INFORMATION_SCHEMA.TABLES
SHOW GRANTS
INFORMATION_SCHEMA.TABLES (Databricks
Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Contains the object level meta data for tables and views (relations) within the local catalog, or all catalogs, if
owned by the SYSTEM catalog.
The rows returned are limited to the relations the user is privileged to interact with.
Since: Databricks Runtime 10.2
Definition
The TABLES relation contains the following columns:
Constraints
The following constraints apply to the TABLES relation:
Examples
> SELECT table_owner
FROM information_schema.tables
WHERE table_schema = 'information_schema'
AND table_name = 'columns';
system
Related
DESCRIBE TABLE
Information schema
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.TABLE_PRIVILEGES
INFORMATION_SCHEMA.VIEWS
SHOW CREATE TABLE
SHOW PARTITIONS
SHOW TABLE
SHOW TABLES
SHOW TBLPROPERTIES
INFORMATION_SCHEMA.VIEWS (Databricks
Runtime)
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Definition
The VIEWS relation contains the following columns:
Constraints
The following constraints apply to the VIEWS relation:
Examples
> SELECT is_intertable_into
FROM information_schema.views
WHERE table_schema = 'information_schema'
AND table_name = 'columns'
NO
Related
DESCRIBE TABLE
Information schema
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.SCHEMATA
INFORMATION_SCHEMA.TABLES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
SHOW CREATE TABLE
SHOW VIEWS
ANSI compliance
7/21/2022 • 16 minutes to read
Spark SQL has two options to support compliance with the ANSI SQL standard: spark.sql.ansi.enabled and
spark.sql.storeAssignmentPolicy .
When spark.sql.ansi.enabled is set to true , Spark SQL uses an ANSI compliant dialect instead of being Hive
compliant. For example, Spark will throw an exception at runtime instead of returning null results if the inputs to
a SQL operator/function are invalid. Some ANSI dialect features may be not from the ANSI SQL standard
directly, but their behaviors align with ANSI SQL’s style.
Moreover, Spark SQL has an independent option to control implicit casting behaviours when inserting rows in a
table. The casting behaviours are defined as store assignment rules in the standard.
When spark.sql.storeAssignmentPolicy is set to ANSI , Spark SQL complies with the ANSI store assignment
rules. This is a separate configuration because its default value is ANSI , while the configuration
spark.sql.ansi.enabled is disabled by default.
The following subsections present behavior changes in arithmetic operations, type conversions, and SQL
parsing when ANSI mode is enabled. For type conversions in Spark SQL, there are three kinds of them and this
article will introduce them one by one: cast, store assignment and type coercion.
Arithmetic operations
In Spark SQL, arithmetic operations performed on numeric types (with the exception of decimal) are not
checked for overflows by default. This means that in case an operation causes overflows, the result is the same
with the corresponding operation in a Java or Scala program (For example, if the sum of 2 integers is higher
than the maximum value representable, the result is a negative number). On the other hand, Spark SQL returns
null for decimal overflows. When spark.sql.ansi.enabled is set to true and an overflow occurs in numeric and
interval arithmetic operations, it throws an arithmetic exception at runtime.
-- `spark.sql.ansi.enabled=true`
> SELECT 2147483647 + 1;
error: integer overflow
-- `spark.sql.ansi.enabled=false`
> SELECT 2147483647 + 1;
-2147483648
Cast
When spark.sql.ansi.enabled is set to true , explicit casting by CAST syntax throws a runtime exception for
illegal cast patterns defined in the standard, such as casts from a string to an integer.
The CAST clause of Spark ANSI mode follows the syntax rules of section 6.13 “cast specification” in ISO/IEC
9075-2:2011 Information technology — Database languages - SQL — Part 2: Foundation (SQL/Foundation),
except it specially allows the following straightforward type conversions which are disallowed as per the ANSI
standard:
NumericType <=> BooleanType
StringType <=> BinaryType
The valid combinations of source and target data type in a CAST expression are given by the following table. “Y”
indicates that the combination is syntactically valid without restriction and “N” indicates that the combination is
not valid.
SO URC
ETA RG N UM E ST RIN T IM ES IN T ERV BOOLE B IN A R ST RUC
ET RIC G DAT E TA M P AL AN Y A RRAY MAP T
Numer Y Y N N N Y N N N N
ic
String Y Y Y Y Y Y Y N N N
Date N Y Y Y N N N N N N
Timest N Y Y Y N N N N N N
amp
Interva N Y N N Y N N N N N
l
Boolea Y Y N N N Y N N N N
n
Binary Y N N N N N Y N N N
Array N N N N N N N Y N N
Map N N N N N N N N Y N
Struct N N N N N N N N N Y
-- Examples of explicit casting
-- `spark.sql.ansi.enabled=true`
> SELECT CAST('a' AS INT);
error: invalid input syntax for type numeric: a
-- `spark.sql.storeAssignmentPolicy=ANSI`
> INSERT INTO t VALUES ('1');
error: Cannot write incompatible data to table '`default`.`t`':
- Cannot safely cast 'v': string to int;
Store assignment
As mentioned at the beginning, when spark.sql.storeAssignmentPolicy is set to ANSI (which is the default
value), Spark SQL complies with the ANSI store assignment rules on table insertions. The valid combinations of
source and target data type in table insertions are given by the following table.
SO URC
ETA RG N UM E ST RIN T IM ES IN T ERV BOOLE B IN A R ST RUC
ET RIC G DAT E TA M P AL AN Y A RRAY MAP T
Numer Y Y N N N N N N N N
ic
String N Y N N N N N N N N
Date N Y Y Y N N N N N N
Timest N Y Y Y N N N N N N
amp
Interva N Y N N Y N N N N N
l
SO URC
ETA RG N UM E ST RIN T IM ES IN T ERV BOOLE B IN A R ST RUC
ET RIC G DAT E TA M P AL AN Y A RRAY MAP T
Boolea N Y N N N Y N N N N
n
Binary N Y N N N N Y N N N
Array N N N N N N N Y* N N
Map N N N N N N N N Y* N
Struct N N N N N N N N N Y*
For Array/Map/Struct types, the data type check rule applies recursively to its component elements.
During table insertion, Spark will throw exception on numeric value overflow.
Type coercion
Type Promotion and Precedence
When spark.sql.ansi.enabled is set to true , Spark SQL uses several rules that govern how conflicts between
data types are resolved. At the heart of this conflict resolution is the Type Precedence List which defines whether
values of a given data type can be promoted to another data type implicitly.
Byte Byte -> Short -> Int -> Long -> Decimal -> Float* ->
Double
Short Short -> Int -> Long -> Decimal-> Float* -> Double
Int Int -> Long -> Decimal -> Float* -> Double
Double Double
Timestamp Timestamp
String String
DATA T Y P E P REC EDEN C E L IST ( F RO M N A RRO W EST TO W IDEST )
Binary Binary
Boolean Boolean
Interval Interval
Map Map**
Array Array**
Struct Struct**
For least common type resolution float is skipped to avoid loss of precision.
** For a complex type, the precedence rule applies recursively to its component elements.
Special rules apply for the String type and untyped NULL. A NULL can be promoted to any other type, while a
String can be promoted to any simple data type.
This is a graphical depiction of the precedence list as a directed tree:
-- The coalesce function accepts any set of argument types as long as they share a least common type.
-- The result type is the least common type of the arguments.
> SET spark.sql.ansi.enabled=true;
-- The substring function expects arguments of type INT for the start and length parameters.
> SELECT substring('hello', 1Y, 2);
he
SQL functions
The behavior of some SQL functions can be different under ANSI mode ( spark.sql.ansi.enabled=true ).
size : This function returns null for null input under ANSI mode.
element_at :
This function throws ArrayIndexOutOfBoundsException if using invalid indices.
This function throws NoSuchElementException if key does not exist in map.
elt : This function throws ArrayIndexOutOfBoundsException if using invalid indices.
make_date : This function fails with an exception if the result date is invalid.
make_timestamp : This function fails with an exception if the result timestamp is invalid.
make_interval : This function fails with an exception if the result interval is invalid.
next_day : This function throws IllegalArgumentException if input is not a valid day of week.
parse_url : This function throws IllegalArgumentException if an input string is not a valid url.
to_date : This function fails with an exception if the input string can’t be parsed, or the pattern string is
invalid.
to_timestamp : This function fails with an exception if the input string can’t be parsed, or the pattern string is
invalid.
to_unix_timestamp : This function fails with an exception if the input string can’t be parsed, or the pattern
string is invalid.
unix_timestamp : This function fails with an exception if the input string can’t be parsed, or the pattern string
is invalid.
SQL operators
The behavior of some SQL operators can be different under ANSI mode ( spark.sql.ansi.enabled=true ).
array_col[index] : This operator throws ArrayIndexOutOfBoundsException if using invalid indices.
map_col[key] : This operator throws NoSuchElementException if key does not exist in map.
CAST(string_col AS TIMESTAMP) : This operator fails with an exception if the input string can’t be parsed.
CAST(string_col AS DATE) : This operator fails with an exception if the input string can’t be parsed.
SQL keywords
When spark.sql.ansi.enabled is true, Spark SQL will use the ANSI mode parser. In this mode, Spark SQL has
two kinds of keywords:
Reserved keywords: Keywords that are reserved and can’t be used as identifiers for table, view, column,
function, alias, etc.
Non-reserved keywords: Keywords that have a special meaning only in particular contexts and can be used
as identifiers in other contexts. For example, EXPLAIN SELECT ... is a command, but EXPLAIN can be used as
identifiers in other places.
When the ANSI mode is disabled, Spark SQL has two kinds of keywords:
Non-reserved keywords: Same definition as the one when the ANSI mode enabled.
Strict-non-reserved keywords: A strict version of non-reserved keywords, which cannot be used as table
alias.
By default spark.sql.ansi.enabled is false.
Below is a list of all the keywords in Spark SQL.
SPA RK SQ L SPA RK SQ L
K EY W O RD A N SI M O DE DEFA ULT M O DE SQ L - 2016
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Syntax
ALTER CATALOG [ catalog_name ] OWNER TO principal
Parameters
catalog_name
The name of the catalog to be altered. If you provide no name the default is hive_metastore .
OWNER TO principal
Transfers ownership of the catalog to principal .
Examples
-- Creates a catalog named `some_cat`.
> CREATE CATALOG some_cat;
Related articles
CREATE CATALOG
DESCRIBE CATALOG
DROP CATALOG
SHOW CATALOGS
ALTER STORAGE CREDENTIAL
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Syntax
ALTER STORAGE CREDENTAL credential_name
{ RENAME TO to_credential_name |
OWNER TO principal }
Parameters
credential_name
Identifies the storage credential being altered.
RENAME TO to_credential_name
Renames the credential a new name. The name must be unique among all credentials in the metastore.
OWNER TO principal
Transfers ownership of the storage credential to principal .
Examples
> ALTER STORAGE CREDENTIAL street_cred RENAME TO good_cred;
Related articles
DESCRIBE STORAGE CREDENTIAL
DROP STORAGE CREDENTIAL
Principal
SHOW STORAGE CREDENTIAL
ALTER DATABASE
7/21/2022 • 2 minutes to read
Related articles
ALTER SCHEMA
CREATE SCHEMA
DESCRIBE SCHEMA
DROP SCHEMA
SHOW SCHEMAS
ALTER EXTERNAL LOCATION
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Syntax
ALTER EXTERNAL LOCATION location_name
{ RENAME TO to_location_name |
SET URL url [FORCE] |
SET STORAGE CREDENTIAL credential_name |
ONNER TO principal }
Parameters
location_name
Identifies the external location being altered.
RENAME TO to_location_name
Renames the location a new name. The name must be unique among all locations in the metastore.
SET URL url [FORCE]
url must be a STRING literal with the location of the cloud storage described as an absolute URL.
Unless you specify FORCE the statement will fail if the location is currently in use.
SET STORAGE CREDENTIAL credential_name
Updates the named credential used to access this location. If the credential does not exist Databricks
Runtime raises an error.
OWNER TO principal
Transfers ownership of the storage location to principal .
Examples
-- Rename a location
> ALTER EXTERNAL LOCATION descend_loc RENAME TO decent_loc;
Related articles
CREATE EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATION
DROP EXTERNAL LOCATION
External locations and storage credentials
SHOW EXTERNAL LOCATIONS
ALTER SCHEMA
7/21/2022 • 2 minutes to read
Alters metadata associated with a schema by setting DBPROPERTIES . The specified property values override any
existing value with the same property name. An error message is issued if the schema is not found in the
system. This command is mostly used to record the metadata for a schema and may be used for auditing
purposes.
While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.
Syntax
ALTER { SCHEMA | DATABASE schema_name
{ SET DBPROPERTIES ( { key = val } [, ...] ) |
OWNER TO principal }
Parameters
schema_name
The name of the schema to be altered.
DBPROPERTIES ( key = val [, …] )
The schema properties to be set or unset.
OWNER TO principal
Transfers ownership of the schema to principal .
Examples
-- Creates a schema named `inventory`.
> CREATE SCHEMA inventory;
Related articles
CREATE SCHEMA
DESCRIBE SCHEMA
DROP SCHEMA
SHOW SCHEMAS
ALTER SHARE
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Adds or removes tables to or from the share. Transfers the ownership of a share to a new principal.
Since: Databricks Runtime 10.3
Syntax
ALTER share_name
{ alter_table |
REMOVE TABLE clause }
alter_table
{ ADD [ TABLE ] table_name [ COMMENT comment ]
[ PARTITION clause ] [ AS table_share_name ] }
Parameters
share_name
The name of the share to be altered.
alter_table
Examples
-- Creates a share named `some_share`.
> CREATE SHARE some_share;
Related articles
CREATE SHARE
DESCRIBE SHARE
DROP SHARE
SHOW SHARES
ALTER TABLE
7/21/2022 • 10 minutes to read
Required permissions
If you use Unity Catalog you must have OWNERSHIP on the table to use ALTER TABLE to:
Change the owner
Grant permissions on the table
Change the table name
For all other metadata operations on a table (for example updating comments, properties, or columns) you can
make updates if you have the MODIFY permission on the table.
Syntax
ALTER TABLE table_name
{ RENAME TO clause |
ADD COLUMN clause |
ALTER COLUMN clause |
DROP COLUMN clause |
RENAME COLUMN clause |
ADD CONSTRAINT clause |
DROP CONSTRAINT clause |
ADD PARTITION clause |
DROP PARTITION clause |
RENAME PARTITION clause |
RECOVER PARTITIONS clause |
SET TBLPROPERTIES clause |
UNSET TBLPROPERTIES clause |
SET SERDE clause |
SET LOCATION clause |
OWNER TO clause }
Parameters
table_name
Identifies the table being altered. The name must not include a temporal specification.
RENAME TO to_table_name
Renames the table within the same schema.
to_table_name
Identifies the new table name. The name must not include a temporal specification.
ADD COLUMN
This clause is not supported for JDBC data sources.
Adds one or more columns to the table, or fields to existing columns in a Delta Lake table.
column_identifier
The name of the column to be added. The name must be unique within the table.
Unless FIRST or AFTER name are specified the column or field will be appended at the end.
field_name
The fully qualified name of the field to be added to an existing column. All components of the path
to the nested field must exist and the field name itself must be unique.
COMMENT comment
An optional STRING literal describing the added column or field.
FIRST
If specified the column will be added as the first column of the table, or the field will be added as
the first field of in the containing struct.
AFTER identifier
If specified the column or field will be added immediately after the field or column identifier .
ALTER COLUMN
Changes a property or the location of a column.
column_identifier
The name of the column to be altered.
field_name
The fully qualified name of the field to be altered. All components of the path to the nested field
must exist.
COMMENT comment
Changes the description of the column_name column. comment must be a STRING literal.
FIRST or AFTER identifier
Moves the column from its current position to the front ( FIRST ) or immediately AFTER the
identifier . This clause is only supported if table_name is a Delta table.
SET NOT NULL or DROP NOT NULL
Changes the domain of valid column values to exclude nulls SET NOT NULL , or include nulls
DROP NOT NULL . This option is only supported for Delta Lake tables. Delta Lake will ensure the
constraint is valid for all existing and new data.
SYNC IDENTITY
Since: Databricks Runtime 10.3
Synchronize the metadata of an identity column with the actual data. When you write your own
values to an identity column, it might not comply with the metadata. This option evaluates the
state and updates the metadata to be consistent with the actual data. After this command, the next
automatically assigned identity value will start from start + (n + 1) * step , where n is the
smallest value that satisfies start + n * step >= max() (for a positive step).
This option is only supported for identity columns on Delta Lake tables.
DROP COLUMN
Since: Databricks Runtime 11.0
Drop one or more columns or fields in a Delta Lake table.
When you drop a column or field, you must drop dependent check constraints and generated columns.
IF EXISTS
When you specify IF EXISTS , Databricks Runtime ignores an attempt to drop columns that do not
exist. Otherwise, dropping non-existing columns will cause an error.
column_identifier
The name of the existing column.
field_name
The fully qualified name of an existing field.
RENAME COLUMN
Renames a column or field in a Delta Lake table.
When you rename a column or field you also need to change dependent check constraints and generated
columns. Any primary keys and foreign keys using the column will be dropped. In case of foreign keys
you must own the table on which the foreign key is defined.
column_identifier
The existing name of the column.
to_column_identifier
The new column identifier. The identifier must be unique within the table.
field_name
The existing fully qualified name of a field.
to_field_identifier
The new field identifier. The identifier must be unique within the local struct.
ADD CONSTRAINT
Adds a check constraint, foreign key constraint, or primary key constraint to the table.
Foreign keys and primary keys are not supported for tables in the hive_metastore catalog.
DROP CONSTRAINT
Drops a primary key, foreign key, or check constraint from the table.
ADD PARTITION
If specified adds one or more partitions to the table. Adding partitions is not supported for Delta Lake
tables.
IF NOT EXISTS
An optional clause directing Databricks Runtime to ignore the statement if the partition already
exists.
PARTITION clause
A partition to be added. The partition keys must match the partitioning of the table and be
associated with values. If the partition already exists an error is raised unless IF NOT EXISTS has
been specified.
LOCATION path
path must be a STRING literal representing an optional location pointing to the partition.
If no location is specified the location will be derived from the location of the table and the
partition keys.
If there are files present at the location they populate the partition and must be compatible with
the data_source of the table and its options.
DROP PARTITION
If specified this clause drops one or more partitions from the table, optionally deleting any files at the
partitions’ locations.
Delta Lake tables do not support dropping of partitions.
IF EXISTS
When you specify IF EXISTS Azure Databricks will ignore an attempt to drop partitions that do
not exists. Otherwise, non existing partitions will cause an error.
PARTITION clause
Specifies a partition to be dropped. If the partition is only partially identified a slice of partitions is
dropped.
PURGE
If set, the table catalog must remove partition data by skipping the Trash folder even when the
catalog has configured one. The option is applicable only for managed tables. It is effective only
when:
The file system supports a Trash folder. The catalog has been configured for moving the dropped
partition to the Trash folder. There is no Trash folder in AWS S3, so it is not effective.
There is no need to manually delete files after dropping partitions.
RENAME PARTITION
Replaces the keys of a partition.
Delta Lake tables do not support renaming partitions.
from_par tition_clause
The definition of the partition to be renamed.
to_par tition_clause
The new definition for this partition. A partition with the same keys must not already exist.
RECOVER PARTITIONS
This clause does not apply to Delta Lake tables.
Instructs Databricks Runtime to scan the table’s location and add any files to the table which have been
added directly to the filesystem.
SET TBLPROPERTIES
Sets or resets one or more user defined properties.
UNSET TBLPROPERTIES
Removes one or more user defined properties.
SET LOCATION
Moves the location of a partition or table.
Delta Lake does not support moving individual partitions of a Delta Lake table.
PARTITION clause
Optionally identifies the partition for which the location will to be changed. If you omit naming a
partition Azure Databricks moves the location of the table.
LOCATION path
path must be a STRING literal. Specifies the new location for the partition or table.
Files in the original location will not be moved to the new location.
OWNER TO principal
Transfers ownership of the table to principal .
Examples
For Delta Lake add, and alter column examples, see
Explicitly update schema
Add columns
Change column comment or ordering
Constraints
-- RENAME table
> DESCRIBE student;
col_name data_type comment
----------------------- --------- -------
name string NULL
rollno int NULL
age int NULL
# Partition Information
# col_name data_type comment
age int NULL
-- RENAME partition
> SHOW PARTITIONS StudentInfo;
partition
---------
age=10
age=11
age=12
> ALTER TABLE StudentInfo ADD columns (LastName string, DOB timestamp);
> ALTER TABLE StudentInfo ADD IF NOT EXISTS PARTITION (age=18) PARTITION (age=20);
-- RENAME COLUMN
> ALTER TABLE StudentInfo RENAME COLUMN name TO FirstName;
> ALTER TABLE dbx.tab1 SET SERDE 'org.apache.hadoop' WITH SERDEPROPERTIES ('k' = 'v', 'kay' = 'vee')
Related articles
ALTER VIEW
ADD CONSTRAINT
COMMENT ON
CREATE TABLE
DROP CONSTRAINT
DROP TABLE
MSCK REPAIR TABLE
PARTITION
ALTER VIEW
7/21/2022 • 2 minutes to read
Alters metadata associated with the view. It can change the definition of the view, change the name of a view to a
different name, set and unset the metadata of the view by setting TBLPROPERTIES .
If the view is cached, the command clears cached data of the view and all its dependents that refer to it. The
view’s cache will be lazily filled when the view is accessed the next time. The command leaves view’s dependents
as uncached.
Syntax
ALTER VIEW view_name
{ rename |
SET TBLPROPERTIES clause |
UNSET TBLPROPERTIES clause |
alter_body |
owner_to }
rename
RENAME TO to_view_name
alter_body
AS query
property_key
{ idenitifier [. ...] | string_literal }
owner_to
OWNER TO principal
Parameters
view_name
Identifies the view to be altered.
RENAME TO to_view_name
Renames the existing view within the schema.
to_view_name specifies the new name of the view. If the to_view_name already exists, a
TableAlreadyExistsException is thrown. If to_view_name is qualified it must match the schema name of
view_name .
SET TBLPROPERTIES
Sets or resets one or more user defined properties.
UNSET TBLPROPERTIES
Removes one or more user defined properties.
AS quer y
A query that constructs the view from base tables or other views.
This clause is equivalent to a CREATE OR REPLACE VIEW statement on an existing view.
OWNER TO principal
Transfers ownership of the view to principal . Unless the view is defined in the hive_metastore you may
only transfer ownership to a group you belong to.
Examples
Related articles
DESCRIBE TABLE
CREATE VIEW
DROP VIEW
SHOW VIEWS
CREATE CATALOG
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Creates a catalog with the specified name. If a catalog with the same name already exists, an exception is thrown.
Since: Databricks Runtime 10.3
Syntax
CREATE CATALOG [ IF NOT EXISTS ] catalog_name
[ COMMENT comment ]
Parameters
catalog_name
The name of the catalog to be created.
IF NOT EXISTS
Creates a catalog with the given name if it does not exist. If a catalog with the same name already exists,
nothing will happen.
comment
An optional STRING literal. The description for the catalog.
Examples
-- Create catalog `customer_cat`. This throws exception if catalog with name customer_cat
-- already exists.
> CREATE CATALOG customer_cat;
-- Create catalog `customer_cat` only if catalog with same name doesn't exist.
> CREATE CATALOG IF NOT EXISTS customer_cat;
-- Create catalog `customer_cat` only if catalog with same name doesn't exist, with a comment.
> CREATE CATALOG IF NOT EXISTS customer_cat COMMENT 'This is customer catalog';
Related articles
DESCRIBE CATALOG
DROP CATALOG
CREATE DATABASE
7/21/2022 • 2 minutes to read
Related articles
CREATE SCHEMA
DESCRIBE SCHEMA
DROP SCHEMA
CREATE FUNCTION (External)
7/21/2022 • 3 minutes to read
Creates a temporary or permanent external function. Temporary functions are scoped at a session level where
as permanent functions are created in the persistent catalog and are made available to all sessions. The
resources specified in the USING clause are made available to all executors when they are executed for the first
time.
In addition to the SQL interface, Spark allows you to create custom user defined scalar and aggregate functions
using Scala, Python, and Java APIs. See User-defined scalar functions (UDFs) and User-defined aggregate
functions (UDAFs) for more information.
Syntax
CREATE [ OR REPLACE ] [ TEMPORARY ] FUNCTION [ IF NOT EXISTS ]
function_name AS class_name [ resource_locations ]
Parameters
OR REPL ACE
If specified, the resources for the function are reloaded. This is mainly useful to pick up any changes made
to the implementation of the function. This parameter is mutually exclusive to IF NOT EXISTS and cannot
be specified together.
TEMPORARY
Indicates the scope of function being created. When TEMPORARY is specified, the created function is valid
and visible in the current session. No persistent entry is made in the catalog for these kind of functions.
IF NOT EXISTS
If specified, creates the function only when it does not exist. The creation of function succeeds (no error is
thrown) if the specified function already exists in the system. This parameter is mutually exclusive to
OR REPLACE and cannot be specified together.
function_name
A name for the function. The function name may be optionally qualified with a schema name.
class_name
The name of the class that provides the implementation for function to be created. The implementing
class should extend one of the base classes as follows:
Should extend UDF or UDAF in org.apache.hadoop.hive.ql.exec package.
Should extend AbstractGenericUDAFResolver, GenericUDF , or GenericUDTF in
org.apache.hadoop.hive.ql.udf.generic package.
Should extend UserDefinedAggregateFunction in org.apache.spark.sql.expressions package.
resource_locations
The list of resources that contain the implementation of the function along with its dependencies.
Syntax: USING { { (JAR | FILE | ARCHIVE) resource_uri } , ... }
Examples
-- 1. Create a simple UDF `SimpleUdf` that increments the supplied integral value by 10.
-- import org.apache.hadoop.hive.ql.exec.UDF;
-- public class SimpleUdf extends UDF {
-- public int evaluate(int value) {
-- return value + 10;
-- }
-- }
-- 2. Compile and place it in a JAR file called `SimpleUdf.jar` in /tmp.
Related articles
CREATE FUNCTION (SQL)
SHOW FUNCTIONS
DESCRIBE FUNCTION
DROP FUNCTION
CREATE EXTERNAL LOCATION
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Creates an external location with the specified name. If a location with the same name already exists, an
exception is thrown.
Since: Databricks Runtime 10.3
Syntax
CREATE EXTERNAL LOCATION [IF NOT EXISTS] location_name
URL url
WITH (STORAGE CREDENTIAL credential_name)
[COMMENT comment]
Parameters
location_name
The name of the location to be created.
IF NOT EXISTS
Creates a location with the given name if it does not exist. If a location with the same name already exists,
nothing will happen.
url
A STRING literal with the location of the cloud storage described as an absolute URL.
credential_name
The named credential used to connect to this location.
comment
An optional description for the location, or NULL . The default is NULL .
Examples
-- Create a location accessed using the abfss_remote_cred credential
> CREATE EXTERNAL LOCATION abfss_remote URL 'abfss://us-east-1/location'
WITH (STORAGE CREDENTIAL abfss_remote_cred)
COMMENT 'Default source for Azure exernal data';
Related articles
ALTER EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATION
DROP EXTERNAL LOCATION
External locations and storage credentials
SHOW EXTERNAL LOCATIONS
CREATE RECIPIENT
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Creates a recipient with the specified name and generates an activation link. If a recipient with the same name
already exists, an exception is thrown.
To create and manage a recipient you must be a metastore administrator and Databricks Runtime must be
configured with the Unity Catalog metastore.
Use DESCRIBE RECIPIENT to retrieve the activation link.
Since: Databricks Runtime 10.3
Syntax
CREATE RECPIENT [ IF NOT EXISTS ] recipient_name
[ COMMENT comment ]
Parameters
recipient_name
The name of the recipient to be created.
IF NOT EXISTS
Creates a recipient with the given name if it does not exist. If a recipient with the same name already
exists, nothing will happen.
comment
An optional STRING literal. The description for the recipient.
Examples
-- Create recipient `other_corp` This throws an exception if a recipient with name other_corp
-- already exists.
> CREATE RECIPIENT other_corp;
-- Create recipient `other_corp` only if a recipient with the same name doesn't exist.
> CREATE RECIPIENT IF NOT EXISTS other_corp;
-- Create recipient `other_corp` only if a recipient with same name doesn't exist, with a comment.
> CREATE RECIPIENT IF NOT EXISTS other_corp COMMENT 'This is Other Corp';
Related articles
DESCRIBE RECIPIENT
DROP RECIPIENT
SHOW RECIPIENTS
CREATE SCHEMA
7/21/2022 • 2 minutes to read
Creates a schema with the specified name. If a schema with the same name already exists, an exception is
thrown.
Syntax
CREATE SCHEMA [ IF NOT EXISTS ] schema_name
[ COMMENT schema_comment ]
[ LOCATION schema_directory ]
[ WITH DBPROPERTIES ( property_name = property_value [ , ... ] ) ]
Parameters
schema_name
The name of the schema to be created.
IF NOT EXISTS
Creates a schema with the given name if it does not exist. If a schema with the same name already exists,
nothing will happen.
schema_director y
Path of the file system in which the specified schema is to be created. If the specified path does not exist
in the underlying file system, creates a directory with the path. If the location is not specified, the schema
is created in the default warehouse directory, whose path is configured by the static configuration
spark.sql.warehouse.dir .
WARNING
To avoid accidental data loss, do not register a schema (database) to a location with existing data or create new external
tables in a location managed by a schema. Dropping a schema will recursively delete all data files in the managed location.
schema_comment
The description for the schema.
WITH DBPROPERTIES ( proper ty_name = proper ty_value [ , … ] )
The properties for the schema in key-value pairs.
Examples
-- Create schema `customer_sc`. This throws exception if schema with name customer_sc
-- already exists.
> CREATE SCHEMA customer_sc;
-- Create schema `customer_sc` only if schema with same name doesn't exist.
> CREATE SCHEMA IF NOT EXISTS customer_sc;
-- Create schema `customer_sc` only if schema with same name doesn't exist with
-- `Comments`,`Specific Location` and `Database properties`.
> CREATE SCHEMA IF NOT EXISTS customer_sc COMMENT 'This is customer schema' LOCATION '/user'
WITH DBPROPERTIES (ID=001, Name='John');
Related articles
DESCRIBE SCHEMA
DROP SCHEMA
CREATE SHARE
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Creates a share with the specified name. If a share with the same name already exists, an exception is thrown.
To create and manage a share you must be a metastore administrator and Databricks Runtime must be
configured with the Unity Catalog metastore.
To add content to the share use ALTER SHARE
Since: Databricks Runtime 10.3
Syntax
CREATE SHARE [ IF NOT EXISTS ] share_name
[ COMMENT comment ]
Parameters
share_name
The name of the share to be created.
IF NOT EXISTS
Creates a share with the given name if it does not exist. If a share with the same name already exists,
nothing will happen.
comment
An optional STRING literal. The description for the share.
Examples
-- Create share `customer_share`. This throws exception if a share with name customer_share
-- already exists.
> CREATE SHARE customer_share;
-- Create share `customer_share` only if share with same name doesn't exist.
> CREATE SHARE IF NOT EXISTS customer_share;
-- Create share `customer_share` only if share with same name doesn't exist, with a comment.
> CREATE SHARE IF NOT EXISTS customer_share COMMENT 'This is customer share';
Related articles
ALTER SHARE
DESCRIBE SHARE
DROP SHARE
CREATE FUNCTION (SQL)
7/21/2022 • 6 minutes to read
NOTE
This statement is supported only for functions created in the hive_metastore catalog.
Syntax
CREATE [OR REPLACE] [TEMPORARY] FUNCTION [IF NOT EXISTS]
function_name ( [ function_parameter [, ...] ] )
RETURNS { data_type | TABLE ( column_spec [, ...])
[ characteristic [...] ]
RETURN { expression | query }
function_parameter
parameter_name data_type [DEFAULT default_expression] [COMMENT parameter_comment]
column_spec
column_name data_type [COMMENT column_comment]
characteristic
{ LANGUAGE SQL |
[NOT] DETERMINISTIC |
COMMENT function_comment |
[CONTAINS SQL | READS SQL DATA] |
SQL SECURITY DEFINER }
Parameters
OR REPL ACE
If specified, the function with the same name and signature (number of parameters and parameter types)
is replaced. You cannot replace an existing function with a different signature. This is mainly useful to
update the function body and the return type of the function. You cannot specify this parameter with
IF NOT EXISTS .
TEMPORARY
The scope of the function being created. When you specify TEMPORARY , the created function is valid and
visible in the current session. No persistent entry is made in the catalog.
IF NOT EXISTS
If specified, creates the function only when it does not exist. The creation of the function succeeds (no
error is thrown) if the specified function already exists in the system. You cannot specify this parameter
with OR REPLACE .
function_name
A name for the function. For a permanent function, you can optionally qualify the function name with a
schema name. If the name is not qualified the permanent function is created in the current schema.
function_parameter
Specifies a parameter of the function.
parameter_name
The parameter name must be unique within the function.
data_type
Any supported data type.
DEFAULT default_expression
Since: Databricks Runtime 10.4
An optional default to be used when a function invocation does not assign an argument to the
parameter. default_expression must be castable to data_type . The expression must not reference
another parameter or contain a subquery.
When you specify a default for one parameter, all following parameters must also have a default.
COMMENT comment
An optional description of the parameter. comment must be a STRING literal.
RETURNS data_type
The return data type of the scalar function.
RETURNS TABLE (column_spec [,…] )
The signature of the result of the table function.
column_name
The column name must be unique within the signature.
data_type
Any supported data type.
COMMENT column_comment
An optional description of the column. comment must be a STRING literal.
RETURN { expression | quer y }
The body of the function. For a scalar function, it can either be a query or an expression. For a table
function, it can only be a query. The expression cannot contain:
Aggregate functions
Window functions
Ranking functions
Row producing functions such as explode
Within the body of the function you can refer to parameter by its unqualified name or by qualifying the
parameter with the function name.
characteristic
All characteristic clauses are optional. You can specify any number of them in any order, but you can
specify each clause only once.
L ANGUAGE SQL
The language of the function. SQL is the only supported language.
[NOT] DETERMINISTIC
Whether the function is deterministic. A function is deterministic when it returns only one result
for a given set of arguments.
COMMENT function_comment
A comment for the function. function_comment must be String literal.
CONTAINS SQL or READS SQL DATA
Whether a function reads data directly or indirectly from a table or a view. When the function
reads SQL data, you cannot specify CONTAINS SQL . If you don’t specify either clause, the property is
derived from the function body.
SQL SECURITY DEFINER
The body of the function and any default expressions are executed using the authorization of the
owner of the function. This is the only supported behavior.
Examples
Create and use a SQL scalar function
Create and use a function that uses DEFAULTs
Create a SQL table function
Replace a SQL function
Describe a SQL function
Create and use a SQL scalar function
> CREATE VIEW t(c1, c2) AS VALUES (0, 1), (1, 2);
-- Create a temporary function with no parameter.
> CREATE TEMPORARY FUNCTION hello() RETURNS STRING RETURN 'Hello World!';
Related articles
CREATE FUNCTION (External)
DROP FUNCTION
SHOW FUNCTIONS
DESCRIBE FUNCTION
GRANT
REVOKE
CREATE TABLE
7/21/2022 • 2 minutes to read
Related articles
ALTER TABLE
DROP TABLE
CREATE TABLE (Hive format)
CREATE TABLE [USING]
CREATE TABLE LIKE
CREATE TABLE CLONE
CREATE TABLE [USING]
7/21/2022 • 6 minutes to read
Syntax
{ { [CREATE OR] REPLACE TABLE | CREATE TABLE [ IF NOT EXISTS ] }
table_name
[ table_specification ] [ USING data_source ]
[ table_clauses ]
[ AS query ] }
table_specification
( { column_identifier column_type [ NOT NULL ]
[ GENERATED ALWAYS AS ( expr ) |
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] ]
[ COMMENT column_comment ]
[ column_constraint ] } [, ...]
[ , table_constraint ] [...] )
table_clauses
{ OPTIONS clause |
PARTITIONED BY clause |
clustered_by_clause |
LOCATION path [ WITH ( CREDENTIAL credential_name ) ] |
COMMENT table_comment |
TBLPROPERTIES clause } [...]
clustered_by_clause
{ CLUSTERED BY ( cluster_column [, ...] )
[ SORTED BY ( { sort_column [ ASC | DESC ] } [, ...] ) ]
INTO num_buckets BUCKETS }
Parameters
REPL ACE
If specified replaces the table and its content if it already exists. This clause is only supported for Delta
Lake tables.
NOTE
Azure Databricks strongly recommends using REPLACE instead of dropping and re-creating Delta Lake tables.
IF NOT EXISTS
If specified and a table with the same name already exists, the statement is ignored.
IF NOT EXISTS cannot coexist with REPLACE , which means CREATE OR REPLACE TABLE IF NOT EXISTS is not
allowed.
table_name
The name of the table to be created. The name must not include a temporal specification. If the name is
not qualified the table is created in the current schema.
table_specification
This optional clause defines the list of columns, their types, properties, descriptions, and column
constraints.
If you do not define columns the table schema you must specify either AS query or LOCATION .
column_identifier
A unique name for the column.
column_type
Specifies the data type of the column. Not all data types supported by Azure Databricks are
supported by all data sources.
NOT NULL
If specified the column will not accept NULL values. This clause is only supported for Delta Lake
tables.
GENERATED ALWAYS AS ( expr )
When you specify this clause the value of this column is determined by the specified expr .
expr may be composed of literals, column identifiers within the table, and deterministic, built-in
SQL functions or operators except:
Aggregate functions
Analytic window functions
Ranking window functions
Table valued generator functions
Also expr must not contain any subquery.
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH star t ] [
INCREMENT BY step ] ) ]
Since: Databricks Runtime 10.3
Defines an identity column. When you write to the table, and do not provide values for the identity
column, it will be automatically assigned a unique and statistically increasing (or decreasing if
step is negative) value. This clause is only supported for Delta Lake tables. This clause can only be
used for columns with BIGINT data type.
The automatically assigned values start with start and increment by step . Assigned values are
unique but are not guaranteed to be contiguous. Both parameters are optional, and the default
value is 1. step cannot be 0 .
If the automatically assigned values are beyond the range of the identity column type, the query
will fail.
When ALWAYS is used, you cannot provide your own values for the identity column.
The following operations are not supported:
PARTITIONED BY an identity column
UPDATE an identity column
COMMENT column_comment
A string literal to describe the column.
column_constraint
Adds a primary key or foreign key constraint to the column in a Delta Lake table.
Constraints are not supported for tables in the hive_metastore catalog.
To add a check constraint to a Delta Lake table use ALTER TABLE.
table_constraint
Adds a primary key or foreign key constraints to the Delta Lake table.
Constraints are not supported for tables in the hive_metastore catalog.
To add a check constraint to a Delta Lake table use ALTER TABLE.
USING data_source
The file format to use for the table. data_source must be one of:
TEXT
AVRO
CSV
JSON
JDBC
PARQUET
ORC
DELTA
LIBSVM
a fully-qualified class name of a custom implementation of
org.apache.spark.sql.sources.DataSourceRegister .
If USING is omitted, the default is DELTA .
For any data_source other than DELTA you must also specify a LOCATION unless the table catalog is
hive_metastore .
HIVE is supported to create a Hive SerDe table. You can specify the Hive-specific file_format and
row_format using the OPTIONS clause, which is a case-insensitive string map. The option_keys are:
FILEFORMAT
INPUTFORMAT
OUTPUTFORMAT
SERDE
FIELDDELIM
ESCAPEDELIM
MAPKEYDELIM
LINEDELIM
table_clauses
Optionally specify location, partitioning, clustering, options, comments, and user defined properties for
the new table. Each sub clause may only be specified once.
PARTITIONED BY
An optional clause to partition the table by a subset of columns.
NOTE
Unless you define a Delta Lake table partitioning columns referencing the columns in the column
specification are always moved to the end of the table.
clustered_by_clause
Optionally cluster the table or each partition into a fixed number of hash buckets using a subset of
the columns.
Clustering is not supported for Delta Lake tables.
CLUSTERED BY
Specifies the set of columns by which to cluster each partition, or the table if no partitioning
is specified.
cluster_column
An identifier referencing a column_identifier in the table. If you specify more than
one column there must be no duplicates. Since a clustering operates on the partition
level you must not name a partition column also as a cluster column.
SORTED BY
Optionally maintains a sort order for rows in a bucket.
sor t_column
A column to sort the bucket by. The column must not be partition column. Sort
columns must be unique.
ASC or DESC
Optionally specifies whether sort_column is sorted in ascending ( ASC ) or
descending ( DESC ) order. The default values is ASC .
INTO num_buckets BUCKETS
An INTEGER literal specifying the number of buckets into which each partition (or the table
if no partitioning is specified) is devided.
LOCATION path [ WITH ( CREDENTIAL credential_name ) ]
An optional path to the directory where table data is stored, which could be a path on distributed
storage. path must be a STRING literal. If you specify no location the table is considered a
managed table and Azure Databricks creates a default table location.
Examples
-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);
Related articles
ALTER TABLE
CONSTRAINT
CREATE TABLE LIKE
CREATE TABLE CLONE
DROP TABLE
PARTITIONED BY
Table properties and table options
CREATE TABLE with Hive format
7/21/2022 • 3 minutes to read
Syntax
CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier
[ ( col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ]
[ COMMENT table_comment ]
[ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT col_comment2 ], ... )
| ( col_name1, col_name2, ... ) ]
[ ROW FORMAT row_format ]
[ STORED AS file_format ]
[ LOCATION path ]
[ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ]
[ AS select_statement ]
row_format:
: SERDE serde_class [ WITH SERDEPROPERTIES (k1=v1, k2=v2, ... ) ]
| DELIMITED [ FIELDS TERMINATED BY fields_terminated_char [ ESCAPED BY escaped_char ] ]
[ COLLECTION ITEMS TERMINATED BY collection_items_terminated_char ]
[ MAP KEYS TERMINATED BY map_key_terminated_char ]
[ LINES TERMINATED BY row_terminated_char ]
[ NULL DEFINED AS null_char ]
The clauses between the column definition clause and the AS SELECT clause can appear in any order. For
example, you can write COMMENT table_comment after TBLPROPERTIES .
NOTE
In Databricks Runtime 8.0 and above you must specify either the STORED AS or ROW FORMAT clause. Otherwise, the SQL
parser uses the CREATE TABLE [USING] syntax to parse it and creates a Delta table by default.
Parameters
table_identifier
A table name, optionally qualified with a schema name.
Syntax: [schema_name.] table_name
EXTERNAL
Defines the table using the path provided in LOCATION .
PARTITIONED BY
Partitions the table by the specified columns.
ROW FORMAT
Use the SERDE clause to specify a custom SerDe for one table. Otherwise, use the DELIMITED clause to
use the native SerDe and specify the delimiter, escape character, null character and so on.
SERDE
Specifies a custom SerDe for one table.
serde_class
Specifies a fully-qualified class name of a custom SerDe.
SERDEPROPERTIES
A list of key-value pairs used to tag the SerDe definition.
DELIMITED
The DELIMITED clause can be used to specify the native SerDe and state the delimiter, escape character,
null character and so on.
FIELDS TERMINATED BY
Used to define a column separator.
COLLECTION ITEMS TERMINATED BY
Used to define a collection item separator.
MAP KEYS TERMINATED BY
Used to define a map key separator.
LINES TERMINATED BY
Used to define a row separator.
NULL DEFINED AS
Used to define the specific value for NULL.
ESCAPED BY
Define the escape mechanism.
COLLECTION ITEMS TERMINATED BY
Define a collection item separator.
MAP KEYS TERMINATED BY
Define a map key separator.
LINES TERMINATED BY
Define a row separator.
NULL DEFINED AS
Define the specific value for NULL .
STORED AS
The file format for the table. Available formats include TEXTFILE , SEQUENCEFILE , RCFILE , ORC , PARQUET ,
and AVRO . Alternatively, you can specify your own input and output formats through INPUTFORMAT and
OUTPUTFORMAT . Only formats TEXTFILE , SEQUENCEFILE , and RCFILE can be used with ROW FORMAT SERDE
and only TEXTFILE can be used with ROW FORMAT DELIMITED .
LOCATION
Path to the directory where table data is stored, which could be a path on distributed storage.
COMMENT
A string literal to describe the table.
TBLPROPERTIES
A list of key-value pairs used to tag the table definition.
AS select_statement
Populates the table using the data from the select statement.
Examples
--Use hive format
CREATE TABLE student (id INT, name STRING, age INT) STORED AS ORC;
--Use personalized custom SerDe(we may need to `ADD JAR xxx.jar` first to ensure we can find the
serde_class,
--or you may run into `CLASSNOTFOUND` exception)
ADD JAR /tmp/hive_serde_example.jar;
Related statements
CREATE TABLE [USING]
CREATE TABLE LIKE
CREATE TABLE LIKE
7/21/2022 • 3 minutes to read
Defines a table using the definition and metadata of an existing table or view.
The statement does not inherit primary key or foreign key constraints from the source table.
Delta Lake does not support CREATE TABLE LIKE . Instead use CREATE TABLE AS.
Syntax
CREATE TABLE [ IF NOT EXISTS ] table_name LIKE source_table_name [table_clauses]
table_clauses
{ USING data_source |
LOCATION path |
TBLPROPERTIES clause } [...]
ROW FORMAT row_format |
STORED AS file_format } [...]
row_format
{ SERDE serde_class [ WITH SERDEPROPERTIES (serde_key = serde_val [, ...] ) ] |
{ DELIMITED [ FIELDS TERMINATED BY fields_terminated_char [ ESCAPED BY escaped_char ] ]
[ COLLECTION ITEMS TERMINATED BY collection_items_terminated_char ]
[ MAP KEYS TERMINATED BY map_key_terminated_char ]
[ LINES TERMINATED BY row_terminated_char ]
[ NULL DEFINED AS null_char ] } }
property_key
{ identifier [. ...] | string_literal }
Parameters
IF NOT EXISTS
If specified ignores the statement if the table_name already exists.
table_name
The name of the table to create. The name must not include a temporal specification. If the name is not
qualified the table is created in the current schema. A table_name must not exist already.
source_table_name
The name of the table whose definition is copied. The table must not be a Delta Lake table.
table_clauses
Optionally specify a data source format, location, and user defined properties for the new table. Each sub
clause may only be specified once.
LOCATION path
Path to the directory where table data is stored, which could be a path on distributed storage. If
you specify a location the new table becomes an external table . If you do not specify a location
the table is a managed table .
TBLPROPERTIES
Optionally sets one or more user defined properties.
USING data_source
The file format to use for the table. data_source must be one of:
TEXT
CSV
JSON
JDBC
PARQUET
ORC
HIVE
LIBSVM
a fully-qualified class name of a custom implementation of
org.apache.spark.sql.sources.DataSourceRegister .
HIVE is supported to create a Hive SerDe table. You can specify the Hive-specific file_format and
row_format using the OPTIONS clause, which is a case-insensitive string map. The option keys are
FILEFORMAT , INPUTFORMAT , OUTPUTFORMAT , SERDE , FIELDDELIM , ESCAPEDELIM , MAPKEYDELIM , and
LINEDELIM .
If you do not specify USING the format of the source table will be inherited.
ROW FORMAT row_format
To specify a custom SerDe, set to SERDE and specify the fully-qualified class name of a custom
SerDe and optional SerDe properties. To use the native SerDe, set to DELIMITED and specify the
delimiter, escape character, null character and so on.
SERDEPROPERTIES
A list of key-value pairs used to tag the SerDe definition.
FIELDS TERMINATED BY
Define a column separator.
ESCAPED BY
Define the escape mechanism.
COLLECTION ITEMS TERMINATED BY
Define a collection item separator.
MAP KEYS TERMINATED BY
Define a map key separator.
LINES TERMINATED BY
Define a row separator.
NULL DEFINED AS
Define the specific value for NULL .
STORED AS
The file format for the table. Available formats include TEXTFILE , SEQUENCEFILE , RCFILE ,
ORC , PARQUET , and AVRO . Alternatively, you can specify your own input and output formats
through INPUTFORMAT and OUTPUTFORMAT . Only formats TEXTFILE , SEQUENCEFILE , and
RCFILE can be used with ROW FORMAT SERDE and only TEXTFILE can be used with
ROW FORMAT DELIMITED .
Examples
-- Create table using a new location
> CREATE TABLE Student_Dupli LIKE Student LOCATION '/mnt/data_files';
Related articles
CREATE TABLE [USING]
CREATE TABLE CLONE
DROP TABLE
ALTER TABLE
CREATE TABLE with Hive format
Table properties
CREATE VIEW
7/21/2022 • 2 minutes to read
Constructs a virtual table that has no physical data based on the result-set of a SQL query. ALTER VIEW and
DROP VIEW only change metadata.
Syntax
CREATE [ OR REPLACE ] [ [ GLOBAL ] TEMPORARY ] VIEW [ IF NOT EXISTS ] view_name
[ column_list ]
[ COMMENT view_comment ]
[ TBLPROPERTIES clause ]
AS query
column_list
( { column_alias [ COMMENT column_comment ] } [, ...] )
Parameters
OR REPL ACE
If a view of the same name already exists, it is replaced. To replace an existing view you must be its owner.
[ GLOBAL ] TEMPORARY
TEMPORARY views are session-scoped and is dropped when session ends because it skips persisting the
definition in the underlying metastore, if any. GLOBAL TEMPORARY views are tied to a system preserved
temporary schema global_temp .
IF NOT EXISTS
Creates the view only if it does not exist. If a view by this name already exists the CREATE VIEW statement
is ignored.
You may specify at most one of IF NOT EXISTS or OR REPLACE .
view_name
The name of the newly created view. A temporary view’s name must not be qualified. A the fully qualified
view name must be unique.
column_list
Optionally labels the columns in the query result of the view. If you provide a column list the number of
column aliases must match the number of expressions in the query. In case no column list is specified
aliases are derived from the body of the view.
column_alias
The column aliases must be unique.
column_comment
An optional STRING literal describing the column alias.
view_comment
An optional STRING literal providing a view-level comments.
TBLPROPERTIES
Optionally sets one or more user defined properties.
AS quer y
A query that constructs the view from base tables or other views.
Examples
-- Create or replace view for `experienced_employee` with comments.
> CREATE OR REPLACE VIEW experienced_employee
(id COMMENT 'Unique identification number', Name)
COMMENT 'View for experienced employees'
AS SELECT id, name
FROM all_employee
WHERE working_years > 5;
Related articles
ALTER VIEW
DROP VIEW
query
SHOW VIEWS
Table properties
DROP CATALOG
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Drops a catalog. An exception is thrown if the catalog does not exist in the metastore.
Since: Databricks Runtime 10.3
Syntax
DROP CATALOG [ IF EXISTS ] catalog_name [ RESTRICT | CASCADE ]
Parameters
IF EXISTS
If specified, no exception is thrown when the catalog does not exist.
catalog_name :
The name of an existing catalog in the metastore. If the name does not exist, an exception is thrown.
RESTRICT
If specified, will restrict dropping a non-empty catalog and is enabled by default.
CASCADE
If specified, will drop all the associated databases (schemas) and the objects within them.
Examples
-- Create a `vaccine` catalog
> CREATE CATALOG vaccine COMMENT 'This catalog is used to maintain information about vaccines';
Related articles
CREATE CATALOG
DESCRIBE CATALOG
SHOW CATALOGS
DROP STORAGE CREDENTIAL
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Drops an existing storage credential. An exception is thrown if the location does not exist in the metastore.
Since: Databricks Runtime 10.3
Syntax
DROP STORAGE CREDENTIAL [ IF EXISTS ] credential_name [ FORCE ]
Parameters
IF EXISTS
Optionally force the credential to be dropped even if it is used by existing objects. If FORCE is not
specified an error is raised if the credential is in use.
Examples
> DROP STORAGE CREDENTIAL street_cred FORCE;
Related articles
ALTER STORAGE CREDENTIAL
DESCRIBE STORAGE CREDENTIAL
SHOW STORAGE CREDENTIALS
DROP DATABASE
7/21/2022 • 2 minutes to read
Related articles
CREATE SCHEMA
DESCRIBE SCHEMA
DROP SCHEMA
SHOW SCHEMAS
DROP FUNCTION
7/21/2022 • 2 minutes to read
Syntax
DROP [ TEMPORARY ] FUNCTION [ IF EXISTS ] function_name
Parameters
function_name
The name of an existing function. The function name may be optionally qualified with a schema name.
TEMPORARY
Used to delete a TEMPORARY function.
IF EXISTS
If specified, no exception is thrown when the function does not exist.
Examples
-- Create a permanent function `hello`
> CREATE FUNCTION hello() RETURNS STRING RETURN 'Hello World!';
-- List the functions after dropping, it should list only temporary function
> SHOW USER FUNCTIONS;
hello
Related statements
CREATE FUNCTION (External)
CREATE FUNCTION (SQL)
DESCRIBE FUNCTION
SHOW FUNCTIONS
DROP EXTERNAL LOCATION
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Drops an external location. An exception is thrown if the location does not exist in the metastore.
Since: Databricks Runtime 10.3
Syntax
DROP EXTERNAL LOCATION [ IF EXISTS ] location_name [ FORCE ]
Parameters
IF EXISTS
Optionally force the location to be dropped even if it is used by existing external tables. If FORCE is not
specified an error is raised if the location is in use.
Examples
> DROP EXTERNAL LOCATION some_location FORCE;
Related articles
CREATE EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATION
External locations and storage credentials
SHOW EXTERNAL LOCATIONS
DROP RECIPIENT
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Drops a recipient. An exception is thrown if the recipient does not exist in the system.
Since: Databricks Runtime 10.3
Syntax
DROP RECIPIENT [ IF EXISTS ] recipient_name
Parameters
IF EXISTS
If specified, no exception is thrown when the recipient does not exist.
recipient_name
The name of an existing recipient in the system. If the name does not exist, an exception is thrown.
Examples
-- Create `other_corp` recipient
> CREATE RECIPIENT other_corp COMMENT 'OtherCorp.com';
Related articles
CREATE RECIPIENT
DESCRIBE RECIPIENT
SHOW RECIPIENTS
DROP SCHEMA
7/21/2022 • 2 minutes to read
Drops a schema and deletes the directory associated with the schema from the file system. An exception is
thrown if the schema does not exist in the system.
While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.
Syntax
DROP SCHEMA [ IF EXISTS ] schema_name [ RESTRICT | CASCADE ]
Parameters
IF EXISTS
If specified, no exception is thrown when the schema does not exist.
schema_name
The name of an existing schemas in the system. If the name does not exist, an exception is thrown.
RESTRICT
If specified, will restrict dropping a non-empty schema and is enabled by default.
CASCADE
If specified, will drop all the associated tables and functions.
WARNING
To avoid accidental data loss, do not register a schema (database) to a location with existing data or create new external
tables in a location managed by a schema. Dropping a schema will recursively delete all data files in the managed location.
Examples
-- Create `inventory_schema` Database
> CREATE SCHEMA inventory_schema COMMENT 'This schema is used to maintain Inventory';
Related articles
CREATE SCHEMA
DESCRIBE SCHEMA
SHOW SCHEMAS
DROP SHARE
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Drops a share. An exception is thrown if the share does not exist in the system.
Since: Databricks Runtime 10.3
Syntax
DROP SHARE [ IF EXISTS ] share_name
Parameters
IF EXISTS
If specified, no exception is thrown when the share does not exist.
share_name
The name of an existing share. If the name does not exist, an exception is thrown.
Examples
-- Create `vaccine` share
> CREATE SHARE vaccine COMMENT 'This share is used to share information about vaccines';
Related articles
ALTER SHARE
CREATE SHARE
DESCRIBE SHARE
SHOW SHARES
DROP TABLE
7/21/2022 • 2 minutes to read
Deletes the table and removes the directory associated with the table from the file system if the table is not
EXTERNAL table. An exception is thrown if the table does not exist.
In case of an external table, only the associated metadata information is removed from the metastore schema.
Any foreign key constraints referencing the table are also dropped.
If the table is cached, the command uncaches the table and all its dependents.
Syntax
DROP TABLE [ IF EXISTS ] table_name
Parameter
IF EXISTS
If specified, no exception is thrown when the table does not exist.
table_name
The name of the table to be created. The name must not include a temporal specification.
Examples
-- Assumes a table named `employeetable` exists.
> DROP TABLE employeetable;
Related articles
CREATE TABLE
CREATE SCHEMA
DROP SCHEMA
DROP VIEW
7/21/2022 • 2 minutes to read
Removes the metadata associated with a specified view from the catalog.
Syntax
DROP VIEW [ IF EXISTS ] view_name
Parameter
IF EXISTS
If specified, no exception is thrown when the view does not exist.
view_name
The name of the view to be dropped.
Examples
-- Assumes a view named `employeeView` exists.
> DROP VIEW employeeView;
-- Assumes a view named `employeeView` does not exist. Try with IF EXISTS
-- this time it will not throw exception
> DROP VIEW IF EXISTS employeeView;
Related articles
CREATE VIEW
ALTER VIEW
SHOW VIEWS
CREATE SCHEMA
DROP SCHEMA
MSCK REPAIR TABLE
7/21/2022 • 2 minutes to read
Recovers all of the partitions in the directory of a table and updates the Hive metastore. When creating a table
using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the
partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore;
you must run MSCK REPAIR TABLE to register the partitions.
Another way to recover partitions is to use ALTER TABLE RECOVER PARTITIONS.
This statement does not apply to Delta Lake tables.
If the table is cached, the command clears cached data of the table and all its dependents that refer to it. The
cache will be lazily filled when the next time the table or the dependents are accessed.
Syntax
MSCK REPAIR TABLE table_name [ {ADD | DROP | SYNC} PARTITIONS]
Parameters
table_name
The name of the partitioned table to be repaired.
ADD or DROP or SYNC PARTITIONS
Examples
-- create a partitioned table from existing data /tmp/namesAndAges.parquet
> CREATE TABLE t1 (name STRING, age INT) USING parquet PARTITIONED BY (age)
LOCATION "/tmp/namesAndAges.parquet";
Related articles
ALTER TABLE
MSCK REPAIR PRIVILEGES
TRUNCATE TABLE
7/21/2022 • 2 minutes to read
Removes all the rows from a table or partition(s). The table must not be a view or an external or temporary
table. In order to truncate multiple partitions at once, specify the partitions in partition_spec . If no
partition_spec is specified, removes all partitions in the table.
If the table is cached, the command clears cached data of the table and all its dependents that refer to it. The
cache will be lazily filled when the table or the dependents are accessed the next time.
Syntax
TRUNCATE TABLE table_name [ PARTITION clause ]
Parameters
table_name
The name of the table to truncate. The name must not include a temporal specification.
PARTITION clause
Optional specification of a partition.
Examples
-- Create table Student with partition
> CREATE TABLE Student (name STRING, rollno INT) PARTITIONED BY (age INT);
Related articles
DROP TABLE
ALTER TABLE
PARTITION
USE CATALOG
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Sets the current catalog. After the current catalog is set, partially and unqualified identifiers for tables, functions,
and views that are referenced by SQLs are resolved from the current catalog.
Setting the catalog also resets the current database to default .
Since: Databricks Runtime 10.3
Syntax
{ USE | SET } CATALOG [ catalog_name | ' catalog_name ' ]
Parameter
catalog_name
Name of the catalog to use. If the database does not exist, an exception is thrown.
Examples
-- Use the 'hive_metastore' which exists.
> USE CATALOG hive_metastore;
Related articles
CREATE SCHEMA
DROP SCHEMA
USE SCHEMA
USE SCHEMA
7/21/2022 • 2 minutes to read
Syntax
USE [SCHEMA] schema_name
Parameter
schema_name
Name of the schema to use. If schema_name is qualified the current catalog is also set to the specified
catalog name. If the schema does not exist, an exception is thrown.
Examples
-- Use the 'userschema' which exists.
> USE SCHEMA userschema;
Related articles
CREATE SCHEMA
DROP SCHEMA
Table properties and table options
7/21/2022 • 3 minutes to read
TBLPROPERTIES
Sets one or more table properties in a new table or view.
You can use table properties to tag tables with information not tracked by SQL.
Syntax
property_key
{ identifier [. ...] | string_literal }
Parameters
proper ty_key
The property key. The key can consist of one or more identifiers separated by a dot, or a string literal.
Property keys must be unique and are case sensitive.
proper ty_val
The value for the property. The value must be a BOOLEAN, STRING, INTEGER, or DECIMAL literal.
Examples
property_key
{ identifier [. ...] | string_literal }
Parameters
proper ty_key
The property key. The key can consist of one or more identifiers separated by a dot, or a string literal.
Property keys must be unique and are case sensitive.
proper ty_val
The new value for the property. The value must be a BOOLEAN, STRING, INTEGER, or DECIMAL literal.
Examples
UNSET TBLPROPERTIES
Removes one or more table properties from a table or view.
Syntax
property_key
{ identifier [. ...] | string_literal }
Parameters
IF EXISTS
An optional clause directing Databricks Runtime not to raise an error if any of the property keys do not
exist.
proper ty_key
The property key to remove. The key can consist of one or more identifiers separated by a dot, or a string
literal.
Property keys are case sensitive. If property_key doesn’t exist and error is raised unless IF EXISTS has
been specified.
Examples
-- Remove a table's table properties.
> ALTER TABLE T UNSET TBLPROPERTIES(this.is.my.key, 'this.is.my.key2');
> SHOW TBLPROPERTIES T;
option.serialization.format 1
transient_lastDdlTime 1649784415
OPTIONS
Sets one or more table options in a new table.
The purpose of table options is to pass storage properties to the underlying storage, such as SERDE properties
to Hive.
Specifying table options for Delta Lake tables will also echo these options as table properties.
Syntax
property_key
{ identifier [. ...] | string_literal }
Parameters
proper ty_key
The property key. The key can consist of one or more identifiers separated by a dot, or a string literal.
Property keys must be unique and are case sensitive.
proper ty_val
The value for the property. The value must be a BOOLEAN, STRING, INTEGER, or DECIMAL literal.
Examples
Use the LOCATION clauses of ALTER TABLE and CREATE TABLE to set a table location.
owner
Use the OWNER TO clause of ALTER TABLE and ALTER VIEW to transfer ownership of a table or view.
provider
Use the USING clause of CREATE TABLE to set the data source of a table
You should not use property keys starting with the option identifier. This prefix identifier will be filtered out in
SHOW TBLPROPERTIES. The option prefix is also used to display table options.
Related articles
CREATE TABLE [USING]
CREATE TABLE CLONE
DROP TABLE
ALTER TABLE
INSERT INTO
7/21/2022 • 6 minutes to read
Inserts new rows into a table and optionally truncates the table or partitions. You specify the inserted rows by
value expressions or the result of a query.
Syntax
INSERT { OVERWRITE | INTO } [ TABLE ] table_name
[ PARTITION clause ]
[ ( column_name [, ...] ) ]
query
NOTE
When you INSERT INTO a Delta table schema enforcement and evolution is supported. If a column’s data type cannot
be safely cast to a Delta table’s data type, a runtime exception is thrown. If schema evolution is enabled, new columns can
exist as the last columns of your schema (or nested columns) for the schema to evolve.
Parameters
INTO or OVERWRITE
Examples
In this section:
INSERT INTO
Insert with a column list
Insert with both a partition spec and a column list
INSERT OVERWRITE
INSERT INTO
Single row insert using a VALUES clause
> CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
PARTITIONED BY (student_id);
-- Assuming the visiting_students table has already been created and populated.
> SELECT * FROM visiting_students;
name address student_id
------------- --------------------- ----------
Fleur Laurent 345 Copper St, London 777777
Gordon Martin 779 Lake Ave, Oxford 888888
> CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
PARTITIONED BY (student_id)
LOCATION "/mnt/user1/students";
INSERT OVERWRITE
Insert using a VALUES clause
-- Assuming the students table has already been created and populated.
> SELECT * FROM students;
name address student_id
------------- ------------------------- ----------
Amy Smith 123 Park Ave, San Jose 111111
Bob Brown 456 Taylor St, Cupertino 222222
Cathy Johnson 789 Race Ave, Palo Alto 333333
Dora Williams 134 Forest Ave, Melo Park 444444
Fleur Laurent 345 Copper St, London 777777
Gordon Martin 779 Lake Ave, Oxford 888888
Helen Davis 469 Mission St, San Diego 999999
Jason Wang 908 Bird St, Saratoga 121212
-- Assuming the persons table has already been created and populated.
> SELECT * FROM persons;
name address ssn
------------- ------------------------- ---------
Dora Williams 134 Forest Ave, Melo Park 123456789
Eddie Davis 245 Market St,Milpitas 345678901
> CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
PARTITIONED BY (student_id)
LOCATION "/mnt/user1/students";
Related articles
COPY
DELETE
MERGE
PARTITION
query
UPDATE
INSERT OVERWRITE DIRECTORY
INSERT OVERWRITE DIRECTORY with Hive format
INSERT OVERWRITE DIRECTORY with Hive format
7/21/2022 • 2 minutes to read
Overwrites the existing data in the directory with the new values using Hive SerDe . Hive support must be
enabled to use this command. You specify the inserted rows by value expressions or the result of a query.
Syntax
INSERT OVERWRITE [ LOCAL ] DIRECTORY directory_path
[ ROW FORMAT row_format ] [ STORED AS file_format ]
{ VALUES ( { value | NULL } [ , ... ] ) [ , ( ... ) ] | query }
Parameters
director y_path
The destination directory. The LOCAL keyword specifies that the directory is on the local file system.
row_format
The row format for this insert. Valid options are SERDE clause and DELIMITED clause. SERDE clause can
be used to specify a custom SerDe for this insert. Alternatively, DELIMITED clause can be used to specify
the native SerDe and state the delimiter, escape character, null character, and so on.
file_format
The file format for this insert. Valid options are TEXTFILE , SEQUENCEFILE , RCFILE , ORC , PARQUET , and
AVRO . You can also specify your own input and output format using INPUTFORMAT and OUTPUTFORMAT .
ROW FORMAT SERDE can only be used with TEXTFILE , SEQUENCEFILE , or RCFILE , while
ROW FORMAT DELIMITED can only be used with TEXTFILE .
Examples
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/destination'
STORED AS orc
SELECT * FROM test_table;
Related statements
INSERT INTO
INSERT OVERWRITE DIRECTORY
INSERT OVERWRITE DIRECTORY
7/21/2022 • 2 minutes to read
Overwrites the existing data in the directory with the new values using a given Spark file format. You specify the
inserted row by value expressions or the result of a query.
Syntax
INSERT OVERWRITE [ LOCAL ] DIRECTORY [ directory_path ]
USING file_format [ OPTIONS ( key [ = ] val [ , ... ] ) ]
{ VALUES ( { value | NULL } [ , ... ] ) [ , ( ... ) ] | query }
Parameters
director y_path
The destination directory. It can also be specified in OPTIONS using path . The LOCAL keyword is used to
specify that the directory is on the local file system.
file_format
The file format to use for the insert. Valid options are TEXT , CSV , JSON , JDBC , PARQUET , ORC , HIVE ,
LIBSVM , or a fully qualified class name of a custom implementation of
org.apache.spark.sql.execution.datasources.FileFormat .
Examples
INSERT OVERWRITE DIRECTORY '/tmp/destination'
USING parquet
OPTIONS (col1 1, col2 2, col3 'test')
SELECT * FROM test_table;
Related statements
INSERT INTO
INSERT OVERWRITE DIRECTORY with Hive format
LOAD DATA
7/21/2022 • 2 minutes to read
Loads the data into a Hive SerDe table from the user specified directory or file. If a directory is specified then all
the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the
LOAD DATA statement takes an optional partition specification. When a partition is specified, the data files (when
input source is a directory) or the single file (when input source is a file) are loaded into the partition of the
target table.
If the table is cached, the command clears cached data of the table and all its dependents that refer to it. The
cache will be lazily filled when the table or the dependents are accessed the next time.
Syntax
LOAD DATA [ LOCAL ] INPATH path [ OVERWRITE ] INTO TABLE table_name [ PARTITION clause ]
Parameters
path
Path of the file system. It can be either an absolute or a relative path.
table_name
Identifies the table to be inserted to. The name must not include a temporal specification.
PARTITION clause
An optional parameter that specifies a target partition for the insert. You may also only partially specify
the partition.
LOCAL
If specified, it causes the INPATH to be resolved against the local file system, instead of the default file
system, which is typically a distributed storage.
OVERWRITE
By default, new data is appended to the table. If OVERWRITE is used, the table is instead overwritten with
new data.
Examples
-- Example without partition specification.
-- Assuming the students table has already been created and populated.
> SELECT * FROM students;
name address student_id
--------- ---------------------- ----------
Amy Smith 123 Park Ave, San Jose 111111
> CREATE TABLE test_load (name VARCHAR(64), address VARCHAR(64), student_id INT) USING HIVE;
> CREATE TABLE test_load_partition (c1 INT, c2 INT, c3 INT) USING HIVE PARTITIONED BY (c2, c3);
Related articles
INSERT INTO
COPY INTO
EXPLAIN
7/21/2022 • 2 minutes to read
Provides the logical or physical plans for an input statement. By default, this clause provides information about a
physical plan only.
Syntax
EXPLAIN [ EXTENDED | CODEGEN | COST | FORMATTED ] statement
Parameters
EXTENDED
Generates parsed logical plan, analyzed logical plan, optimized logical plan and physical plan. Parsed
Logical plan is a unresolved plan that extracted from the query. Analyzed logical plans transforms which
translates unresolvedAttribute and unresolvedRelation into fully typed objects. The optimized logical plan
transforms through a set of optimization rules, resulting in the physical plan.
CODEGEN
Generates code for the statement, if any and a physical plan.
COST
If plan node statistics are available, generates a logical plan and the statistics.
FORMATTED
Generates two sections: a physical plan outline and node details.
statement
A SQL statement to be explained.
Examples
-- Default Output
> EXPLAIN select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;
+----------------------------------------------------+
| plan|
+----------------------------------------------------+
| == Physical Plan ==
*(2) HashAggregate(keys=[k#33], functions=[sum(cast(v#34 as bigint))])
+- Exchange hashpartitioning(k#33, 200), true, [id=#59]
+- *(1) HashAggregate(keys=[k#33], functions=[partial_sum(cast(v#34 as bigint))])
+- *(1) LocalTableScan [k#33, v#34]
|
+----------------------------------------------------
-- Using Extended
> EXPLAIN EXTENDED select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;
+----------------------------------------------------+
| plan|
+----------------------------------------------------+
| == Parsed Logical Plan ==
'Aggregate ['k], ['k, unresolvedalias('sum('v), None)]
+- 'SubqueryAlias `t`
+- 'UnresolvedInlineTable [k, v], [List(1, 2), List(1, 3)]
== Physical Plan ==
*(2) HashAggregate(keys=[k#47], functions=[sum(cast(v#48 as bigint))], output=[k#47, sum(v)#50L])
+- Exchange hashpartitioning(k#47, 200), true, [id=#79]
+- *(1) HashAggregate(keys=[k#47], functions=[partial_sum(cast(v#48 as bigint))], output=[k#47, sum#52L])
+- *(1) LocalTableScan [k#47, v#48]
|
+----------------------------------------------------+
-- Using Formatted
> EXPLAIN FORMATTED select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;
+----------------------------------------------------+
| plan|
+----------------------------------------------------+
| == Physical Plan ==
* HashAggregate (4)
+- Exchange (3)
+- * HashAggregate (2)
+- * LocalTableScan (1)
(3) Exchange
Input: [k#19, sum#24L]
Syntax
[ common_table_expression ]
{ subquery | set_operator }
[ ORDER BY clause | { [ DISTRIBUTE BY clause ] [ SORT BY clause ] } | CLUSTER BY clause ]
[ WINDOW clause ]
[ LIMIT clause ]
subquery
{ SELECT clause |
VALUES clause |
( query ) |
TABLE [ table_name | view_name ]}
Parameters
common table expression
Common table expressions (CTE) are one or more named queries which can be reused multiple times
within the main query block to avoid repeated computations or to improve readability of complex, nested
queries.
subquer y
One of several constructs producing an intermediate result set.
SELECT
A subquery consisting of a SELECT FROM WHERE pattern.
VALUES
Specified an inline temporary table.
** ( query )**
A nested invocation of a query which may contain set operators or common table expressions.
TABLE
Returns the entire table or view.
table_name
Identifies the table to be returned.
view_name
Identifies the view to be returned.
set_operator
A construct combining subqueries using UNION , EXCEPT , or INTERSECT operators.
ORDER BY
An ordering of the rows of the complete result set of the query. The output rows are ordered across the
partitions. This parameter is mutually exclusive with SORT BY , CLUSTER BY , and DISTRIBUTE BY and
cannot be specified together.
DISTRIBUTE BY
A set of expressions by which the result rows are repartitioned. This parameter is mutually exclusive with
ORDER BY and CLUSTER BY and cannot be specified together.
SORT BY
An ordering by which the rows are ordered within each partition. This parameter is mutually exclusive
with ORDER BY and CLUSTER BY and cannot be specified together.
CLUSTER BY
A set of expressions that is used to repartition and sort the rows. Using this clause has the same effect of
using DISTRIBUTE BY and SORT BY together.
LIMIT
The maximum number of rows that can be returned by a statement or subquery. This clause is mostly
used in the conjunction with ORDER BY to produce a deterministic result.
WINDOW
Defines named window specifications that can be shared by multiple Window functions in the
select_query .
Related articles
CLUSTER BY clause
Common table expression (CTE)
DISTRIBUTE BY clause
GROUP BY clause
HAVING clause
Hints
VALUES clause
JOIN
LATERAL VIEW clause
LIMIT clause
ORDER BY clause
PIVOT clause
TABLESAMPLE clause
Set operator
SORT BY clause
Table-valued function (TVF)
WHERE clause
WINDOW clause
Window functions
SELECT
7/21/2022 • 5 minutes to read
Composes a result set from one or more tables. The SELECT clause can be part of a query which also includes
common table expressions (CTE), set operations, and various other clauses.
Syntax
SELECT [ hints ] [ ALL | DISTINCT ] { named_expression | star_clause } [, ...]
FROM from_item [, ...]
[ LATERAL VIEW clause ]
[ PIVOT clause ]
[ WHERE clause ]
[ GROUP BY clause ]
[ HAVING clause]
[ QUALIFY clause ]
from_item
{ table_name [ TABLESAMPLE clause ] [ table_alias ] |
JOIN clause |
[ LATERAL ] table_valued_function [ table_alias ] |
VALUES clause |
[ LATERAL ] ( query ) [ TABLESAMPLE clause ] [ table_alias ] }
named_expression
expression [ column_alias ]
star_clause
[ { table_name | view_name } . ] * [ except_clause ]
except_clause
EXCEPT ( { column_name | field_name } [, ...] )
Parameters
hints
Hints help the Databricks Runtime optimizer make better planning decisions. Databricks Runtime
supports hints that influence selection of join strategies and repartitioning of the data.
ALL
Select all matching rows from the relation. Enabled by default.
DISTINCT
Select all matching rows from the relation after removing duplicates in results.
named_expression
An expression with an optional assigned name.
expression
A combination of one or more values, operators, and SQL functions that evaluates to a value.
column_alias
An optional column identifier naming the expression result. If no column_alias is provided
Databricks Runtime derives one.
star_clause
A shorthand to name all the referencable columns in the FROM clause. The list of columns is ordered by
the order of from_item s and the order of columns within each from_item .
The _metadata column is not included this list. You must reference it explicitly.
table_name
If present limits the columns to be named to those in the specified referencable table.
view_name
If specified limits the columns to be expanded to those in the specified referencable view.
except_clause
Since: Databricks Runtime 11.0
Optionally prunes columns or fields from the referencable set of columns identified in the select_star
clause.
column_name
A column that is part of the set of columns that you can reference.
field_name
A reference to a field in a column of the set of columns that you can reference. If you exclude all
fields from a STRUCT , the result is an empty STRUCT .
Each name must reference a column included in the set of columns that you can reference or their fields.
Otherwise, Databricks Runtime raises a MISSING_COLUMN error. If names overlap or are not unique,
Databricks Runtime raises an EXCEPT_OVERLAPPING_COLUMNS error.
from_item
A source of input for the SELECT . One of the following:
table_name
Identifies a table that may contain a temporal specification. See Query an older snapshot of a table
(time travel) for details.
view_name
Identifies a view.
JOIN
Combines two or more relations using a join.
[L ATERAL] table_valued_function
Invokes a table function. To refer to columns exposed by a preceding from_item in the same FROM
clause you must specify LATERAL .
VALUES
Defines an inline table.
[L ATERAL] ( quer y )
Computes a relation using a query. A query prefixed by LATERAL may reference columns exposed
by a preceding from_item in the same FROM clause. Such a construct is called a correlated or
dependent query.
LATERAL is supported since Databricks Runtime 9.0.
TABLESAMPLE
Optionally reduce the size of the result set by only sampling a fraction of the rows.
table_alias
Optionally specifies a label for the from_item . If the table_alias includes column_identifier s
their number must match the number of columns in the from_item .
PIVOT
Used for data perspective; you can get the aggregated values based on specific column value.
L ATERAL VIEW
Used in conjunction with generator functions such as EXPLODE , which generates a virtual table containing
one or more rows. LATERAL VIEW applies the rows to each original output row.
WHERE
Filters the result of the FROM clause based on the supplied predicates.
GROUP BY
The expressions that are used to group the rows. This is used in conjunction with aggregate functions (
MIN , MAX , COUNT , SUM , AVG ) to group rows based on the grouping expressions and aggregate values
in each group. When a FILTER clause is attached to an aggregate function, only the matching rows are
passed to that function.
HAVING
The predicates by which the rows produced by GROUP BY are filtered. The HAVING clause is used to filter
rows after the grouping is performed. If you specify HAVING without GROUP BY , it indicates a GROUP BY
without grouping expressions (global aggregate).
QUALIFY
The predicates that are used to filter the results of window functions. To use QUALIFY , at least one
window function is required to be present in the SELECT list or the QUALIFY clause.
where
timestamp_expression can be any one of:
'2018-10-18T22:15:12.013Z' , that is, a string that can be cast to a timestamp
cast('2018-10-18 13:36:32 CEST' as timestamp)
'2018-10-18' , that is, a date string
In Databricks Runtime 6.6 and above:
current_timestamp() - interval 12 hours
date_sub(current_date(), 1)
Any other expression that is or can be cast to a timestamp
version is a long value that can be obtained from the output of DESCRIBE HISTORY table_spec .
@ syntax
Use the @ syntax to specify the timestamp or version. The timestamp must be in yyyyMMddHHmmssSSS format.
You can specify a version after @ by prepending a v to the version. For example, to query version 123 for the
table events , specify events@v123 .
Example
Examples
-- select all referencable columns from all tables
> SELECT * FROM VALUES(1, 2) AS t1(c1, c2), VALUES(3, 4) AS t2(c3, c4);
1 2 3 4
Syntax
VALUES {expression | ( expression [, ...] ) } [, ...] [table_alias]
Parameters
expression
A combination of one or more values, operators and SQL functions that results in a value.
table_alias
An optional label to allow the result set to be referenced by name.
Each tuple constitutes a row.
If there is more than one row the number of fields in each tuple must match.
When using the VALUES syntax, if no tuples are specified, each expression equates to a single field tuple.
When using the SELECT syntax all expressions constitute a single row temporary table.
The nth field of each tuple must share a least common type. If table_alias specifies column names, their
number must match the number of expressions per tuple.
The result is a temporary table where each column’s type is the least common type of the matching tuples fields.
Examples
-- single row, without a table alias
> VALUES ("one", 1);
one 1
Related articles
Query
SELECT
CLUSTER BY clause
7/21/2022 • 2 minutes to read
Repartitions the data based on the input expressions and then sorts the data within each partition. This is
semantically equivalent to performing a DISTRIBUTE BY followed by a SORT BY. This clause only ensures that the
resultant rows are sorted within each partition and does not guarantee a total order of output.
Syntax
CLUSTER BY expression [, ...]
Parameters
expression
Specifies combination of one or more values, operators and SQL functions that results in a value.
Examples
> CREATE TEMP VIEW person (name, age)
AS VALUES ('Zen Hui', 25),
('Anil B', 18),
('Shone S', 16),
('Mike A', 25),
('John A', 18),
('Jack N', 16);
-- Reduce the number of shuffle partitions to 2 to illustrate the behavior of `CLUSTER BY`.
-- It's easier to see the clustering and sorting behavior with less number of partitions.
> SET spark.sql.shuffle.partitions = 2;
-- Select the rows with no ordering. Please note that without any sort directive, the results
-- of the query is not deterministic. It's included here to show the difference in behavior
-- of a query when `CLUSTER BY` is not used vs when it's used. The query below produces rows
-- where age column is not sorted.
> SELECT age, name FROM person;
16 Shone S
25 Zen Hui
16 Jack N
25 Mike A
18 John A
18 Anil B
-- Produces rows clustered by age. Persons with same age are clustered together.
-- In the query below, persons with age 18 and 25 are in first partition and the
-- persons with age 16 are in the second partition. The rows are sorted based
-- on age within each partition.
> SELECT age, name FROM person CLUSTER BY age;
18 John A
18 Anil B
25 Zen Hui
25 Mike A
16 Shone S
16 Jack N
Related articles
Query
DISTRIBUTE BY
SORT BY
Common table expression (CTE)
7/21/2022 • 2 minutes to read
Defines a temporary result set that you can reference possibly multiple times within the scope of a SQL
statement. A CTE is used mainly in a SELECT statement.
Syntax
WITH common_table_expression [, ...]
common_table_expression
view_identifier [ ( column_identifier [, ...] ) ] [ AS ] ( query )
Parameters
view_identifier
An identifier by which the common_table_expression can be referenced
column_identifier
An optional identifier by which a column of the common_table_expression can be referenced.
If column_identifiers are specified their number must match the number of columns returned by the
query . If no names are specified the column names are derived from the query .
quer y
A query producing a result set.
Examples
-- CTE with multiple column aliases
> WITH t(x, y) AS (SELECT 1, 2)
SELECT * FROM t WHERE x = 1 AND y = 2;
1 2
-- CTE in subquery
> SELECT max(c) FROM (
WITH t(c) AS (SELECT 1)
SELECT * FROM t);
1
Related articles
Query
DISTRIBUTE BY clause
7/21/2022 • 2 minutes to read
Repartitions data based on the input expressions. Unlike the CLUSTER BY clause, does not sort the data within
each partition.
Syntax
DISTRIBUTE BY expression [, ...]
Parameters
expression
An expression of any type.
Examples
> CREATE TEMP VIEW person (name, age)
AS VALUES ('Zen Hui', 25),
('Anil B', 18),
('Shone S', 16),
('Mike A', 25),
('John A', 18),
('Jack N', 16);
-- Reduce the number of shuffle partitions to 2 to illustrate the behavior of `DISTRIBUTE BY`.
-- It's easier to see the clustering and sorting behavior with less number of partitions.
> SET spark.sql.shuffle.partitions = 2;
-- Select the rows with no ordering. Please note that without any sort directive, the result
-- of the query is not deterministic. It's included here to just contrast it with the
-- behavior of `DISTRIBUTE BY`. The query below produces rows where age columns are not
-- clustered together.
> SELECT age, name FROM person;
16 Shone S
25 Zen Hui
16 Jack N
25 Mike A
18 John A
18 Anil B
-- Produces rows clustered by age. Persons with same age are clustered together.
-- Unlike `CLUSTER BY` clause, the rows are not sorted within a partition.
> SELECT age, name FROM person DISTRIBUTE BY age;
25 Zen Hui
25 Mike A
18 John A
18 Anil B
16 Shone S
16 Jack N
Related articles
Query
CLUSTER BY
SORT BY
GROUP BY clause
7/21/2022 • 7 minutes to read
The GROUP BY clause is used to group the rows based on a set of specified grouping expressions and compute
aggregations on the group of rows based on one or more specified aggregate functions. Databricks Runtime
also supports advanced aggregations to do multiple aggregations for the same input record set via
GROUPING SETS , CUBE , ROLLUP clauses. The grouping expressions and advanced aggregations can be mixed in
the GROUP BY clause and nested in a GROUPING SETS clause.
See more details in the Mixed/Nested Grouping Analytics section.
When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.
Syntax
GROUP BY group_expression [, ...] [ WITH ROLLUP | WITH CUBE ]
grouping_set
{ expression |
( [ expression [, ...] ] ) }
Parameters
group_expression
Specifies the criteria for grouping rows together. The grouping of rows is performed based on result
values of the grouping expressions. A grouping expression may be a column name like GROUP BY a, a ,
column position like GROUP BY 0 , or an expression like GROUP BY a + b .
grouping_set
A grouping set is specified by zero or more comma-separated expressions in parentheses. When the
grouping set has only one element, parentheses can be omitted. For example, GROUPING SETS ((a), (b)) is
the same as GROUPING SETS (a, b).
GROUPING SETS
Groups the rows for each grouping set specified after GROUPING SETS . For example:
GROUP BY GROUPING SETS ((warehouse), (product)) is semantically equivalent to a union of results of
GROUP BY warehouse and GROUP BY product .
This clause is a shorthand for a UNION ALL where each leg of the UNION ALL operator performs
aggregation of each grouping set specified in the GROUPING SETS clause.
Similarly, GROUP BY GROUPING SETS ((warehouse, product), (product), ()) is semantically equivalent to the
union of results of GROUP BY warehouse, product , GROUP BY product and a global aggregate.
NOTE
For Hive compatibility Databricks Runtime allows GROUP BY ... GROUPING SETS (...) . The GROUP BY expressions are
usually ignored, but if they contain extra expressions in addition to the GROUPING SETS expressions, the extra
expressions will be included in the grouping expressions and the value is always null. For example,
SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b) , the output of column c is always null.
ROLLUP
Specifies multiple levels of aggregations in a single statement. This clause is used to compute
aggregations based on multiple grouping sets. ROLLUP is a shorthand for GROUPING SETS . For example:
GROUP BY warehouse, product WITH ROLLUP or GROUP BY ROLLUP(warehouse, product) is equivalent to
GROUP BY GROUPING SETS((warehouse, product), (warehouse), ()) .
While GROUP BY ROLLUP(warehouse, product, (warehouse, location))
is equivalent to
GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ()) .
The N elements of a ROLLUP specification result in N+1 GROUPING SETS .
CUBE
The CUBE clause is used to perform aggregations based on a combination of grouping columns specified
in the GROUP BY clause. CUBE is a shorthand for GROUPING SETS . For example:
GROUP BY warehouse, product WITH CUBE or GROUP BY CUBE(warehouse, product) is equivalent to
GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ()) .
While GROUP BY CUBE(warehouse, product, (warehouse, location))
is equivalent to
GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location),
(product, warehouse, location), (warehouse), (product), (warehouse, product), ())
.
The N elements of a CUBE specification results in 2^N GROUPING SETS .
aggregate_name
An aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
DISTINCT
Removes duplicates in input rows before they are passed to aggregate functions.
FILTER
Filters the input rows for which the boolean_expression in the WHERE clause evaluates to true are passed
to the aggregate function; other rows are discarded.
and
CUBE ROLLUP is just syntax sugar for GROUPING SETS . Please refer to the sections above for how to translate
CUBE and ROLLUP to GROUPING SETS . group_expression can be treated as a single-group GROUPING SETS in this
context.
For multiple GROUPING SETS in the GROUP BY clause, Databricks Runtime generates a single GROUPING SETS by
doing a cross-product of the original GROUPING SETS .
For nested GROUPING SETS in the GROUPING SETS clause, Databricks Runtime simply takes its grouping sets and
strips them. For example:
GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())
and
GROUP BY warehouse, ROLLUP(product), CUBE(location, size)
are equivalent to
GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse,
product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size),
(warehouse))
.
While GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))
Examples
CREATE TEMP VIEW dealer (id, city, car_model, quantity) AS
VALUES (100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);
-- Multiple aggregations.
-- 1. Sum of quantity per dealership.
-- 2. Max quantity per dealership.
> SELECT id, sum(quantity) AS sum, max(quantity) AS max
FROM dealer GROUP BY id ORDER BY id;
id sum max
--- --- ---
100 32 15
200 33 20
200 33 20
300 13 8
-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
> SELECT id,
sum(quantity) FILTER (WHERE car_model IN ('Honda Civic', 'Honda CRV')) AS `sum(quantity)`
FROM dealer
GROUP BY id ORDER BY id;
id sum(quantity)
--- -------------
100 17
200 23
300 5
--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
> SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
first(age, true) last(id, false) sum(id)
------------------- ------------------ ----------
30 400 1000
Related articles
QUALIFY
SELECT
HAVING clause
7/21/2022 • 2 minutes to read
Filters the results produced by GROUP BY based on the specified condition. Often used in conjunction with a
GROUP BY clause.
Syntax
HAVING boolean_expression
Parameters
boolean_expression
Any expression that evaluates to a result type BOOLEAN . Two or more expressions may be combined
together using logical operators such as AND or OR .
The expressions specified in the HAVING clause can only refer to:
Constant expressions
Expressions that appear in GROUP BY
Aggregate functions
Examples
> CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
> INSERT INTO dealer VALUES
(100, 'Fremont' , 'Honda Civic' , 10),
(100, 'Fremont' , 'Honda Accord', 15),
(100, 'Fremont' , 'Honda CRV' , 7),
(200, 'Dublin' , 'Honda Civic' , 20),
(200, 'Dublin' , 'Honda Accord', 10),
(200, 'Dublin' , 'Honda CRV' , 3),
(300, 'San Jose', 'Honda Civic' , 5),
(300, 'San Jose', 'Honda Accord', 8);
Related articles
GROUP BY
QUALIFY
SELECT
Hints
7/21/2022 • 4 minutes to read
Syntax
/*+ hint [, ...] */
Partitioning hints
Partitioning hints allow you to suggest a partitioning strategy that Databricks Runtime should follow. COALESCE ,
REPARTITION , and REPARTITION_BY_RANGE hints are supported and are equivalent to coalesce , repartition , and
repartitionByRange Dataset APIs, respectively. These hints give you a way to tune performance and control the
number of output files. When multiple partitioning hints are specified, multiple nodes are inserted into the
logical plan, but the leftmost hint is picked by the optimizer.
Partitioning hint types
COALESCE
Reduce the number of partitions to the specified number of partitions. It takes a partition number as a
parameter.
REPARTITION
Repartition to the specified number of partitions using the specified partitioning expressions. It takes a
partition number, column names, or both as parameters.
REPARTITION_BY_RANGE
Repartition to the specified number of partitions using the specified partitioning expressions. It takes
column names and an optional partition number as parameters.
REBAL ANCE
The REBALANCE hint can be used to rebalance the query result output partitions, so that every partition is
of a reasonable size (not too small and not too big). It can take column names as parameters, and try its
best to partition the query result by these columns. This is a best-effort: if there are skews, Spark will split
the skewed partitions, to make these partitions not too big. This hint is useful when you need to write the
result of this query to a table, to avoid too small/big files. This hint is ignored if AQE is not enabled.
Examples
> SELECT /*+ COALESCE(3) */ * FROM t;
== Physical Plan ==
Exchange RoundRobinPartitioning(100), false, [id=#121]
+- *(1) ColumnarToRow
+- FileScan parquet default.t[name#29,c#30] Batched: true, DataFilters: [], Format: Parquet,
Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<name:string>
Join hints
Join hints allow you to suggest the join strategy that Databricks Runtime should use. When different join
strategy hints are specified on both sides of a join, Databricks Runtime prioritizes hints in the following order:
BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL . When both sides are specified with the
BROADCAST hint or the SHUFFLE_HASH hint, Databricks Runtime picks the build side based on the join type and the
sizes of the relations. Since a given strategy may not support all join types, Databricks Runtime is not
guaranteed to use the join strategy suggested by the hint.
Join hint types
BROADCAST
Use broadcast join. The join side with the hint is broadcast regardless of autoBroadcastJoinThreshold . If
both sides of the join have the broadcast hints, the one with the smaller size (based on stats) is broadcast.
The aliases for BROADCAST are BROADCASTJOIN and MAPJOIN .
MERGE
Use shuffle sort merge join. The aliases for MERGE are SHUFFLE_MERGE and MERGEJOIN .
SHUFFLE_HASH
Use shuffle hash join. If both sides have the shuffle hash hints, Databricks Runtime chooses the smaller
side (based on stats) as the build side.
SHUFFLE_REPLICATE_NL
Use shuffle-and-replicate nested loop join.
Examples
-- When different join strategy hints are specified on both sides of a join, Databricks Runtime
-- prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint
-- over the SHUFFLE_REPLICATE_NL hint.
-- Databricks Runtime will issue Warning in the following example
-- org.apache.spark.sql.catalyst.analysis.HintErrorLogger: Hint (strategy=merge)
-- is overridden by another hint and will not take effect.
SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
Skew hints
(Delta Lake on Azure Databricks) See Skew join optimization for information about the SKEW hint.
Related statements
SELECT
JOIN
7/21/2022 • 4 minutes to read
Syntax
relation { [ join_type ] JOIN relation join_criteria |
NATURAL join_type JOIN relation |
CROSS JOIN relation }
relation
{ table_name [ table_alias ] |
view_name [ table_alias ] |
[ LATERAL ] ( query ) [ table_alias ] |
( JOIN clause ) [ table_alias ] |
VALUES clause |
[ LATERAL ] table_valued_function [ table_alias ] }
join_type
{ [ INNER ] |
LEFT [ OUTER ] |
[ LEFT ] SEMI |
RIGHT [ OUTER ] |
FULL [ OUTER ] |
[ LEFT ] ANTI |
CROSS }
join_criteria
{ ON boolean_expression |
USING ( column_name [, ...] ) }
Parameters
relation
The relations to be joined.
table_name
A reference to a table, view, or common table expression (CTE).
view_name
A reference to a view, or common table expression (CTE).
[ L ATERAL ] ( quer y )
Any nested query. A query prefixed by LATERAL (Since: Databricks Runtime 9.0) may reference
columns exposed by preceding from_item s in the same FROM clause. Such a construct is called a
correlated or dependent join. A correlated join cannot be a RIGHT OUTER JOIN or a
FULL OUTER JOIN .
( JOIN clause )
A nested invocation of a JOIN.
VALUES clause
A clause that produces an inline temporary table.
[ L ATERAL ] table_valued_function
An invocation of a table function.
join_type
The join type.
[ INNER ]
Returns rows that have matching values in both relations. The default join.
LEFT [ OUTER ]
Returns all values from the left relation and the matched values from the right relation, or appends
NULL if there is no match. It is also referred to as a left outer join.
RIGHT [ OUTER ]
Returns all values from the right relation and the matched values from the left relation, or appends
NULL if there is no match. It is also referred to as a right outer join.
FULL [OUTER]
Returns all values from both relations, appending NULL values on the side that does not have a
match. It is also referred to as a full outer join.
[ LEFT ] SEMI
Returns values from the left side of the relation that has a match with the right. It is also referred to
as a left semi join.
[ LEFT ] ANTI
Returns values from the left relation that has no match with the right. It is also referred to as a left
anti join.
CROSS JOIN
Returns the Cartesian product of two relations.
NATURAL
Specifies that the rows from the two relations will implicitly be matched on equality for all columns with
matching names.
join_criteria
Specifies how the rows from one relation is combined with the rows of another relation.
ON boolean_expression
An expression with a return type of BOOLEAN which specifies how rows from the two relations are
matched. If the result is true the rows are considered a match.
USING ( column_name [, …] )
Matches rows by comparing equality for list of columns column_name which must exist in both
relations.
USING (c1, c2) is a synonym for ON rel1.c1 = rel2.c1 AND rel1.c2 = rel2.c2 .
table_alias
A temporary name with an optional column identifier list.
Notes
When you specify USING or NATURAL , SELECT * will only show one occurrence for each of the columns used to
match.
If you omit the join_criteria the semantic of any join_type becomes that of a CROSS JOIN .
Examples
-- Use employee and department tables to demonstrate different type of joins.
> CREATE TEMP VIEW employee(id, name, deptno) AS
VALUES(105, 'Chloe', 5),
(103, 'Paul' , 3),
(101, 'John' , 1),
(102, 'Lisa' , 2),
(104, 'Evan' , 4),
(106, 'Amy' , 6);
Related articles
SELECT
LATERAL VIEW clause
7/21/2022 • 2 minutes to read
Used in conjunction with generator functions such as EXPLODE , which generates a virtual table containing one
or more rows. LATERAL VIEW applies the rows to each original output row.
Syntax
LATERAL VIEW [ OUTER ] generator_function ( expression [, ...] ) [ table_identifier ] AS column_identifier
[, ...]
Parameters
OUTER
If OUTER specified, returns null if an input array/map is empty or null.
generator_function
A generator function (EXPLODE, INLINE, etc.).
table_identifier
The alias for generator_function , which is optional.
column_identifier
Lists the column aliases of generator_function , which may be used in output rows. The number of
column identifiers must match the number of columns returned by the generator function.
Examples
> CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
> INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
Related articles
SELECT
LIMIT clause
7/21/2022 • 2 minutes to read
Constrains the number of rows returned by the Query. In general, this clause is used in conjunction with ORDER
BY to ensure that the results are deterministic.
Syntax
LIMIT { ALL | integer_expression }
Parameters
ALL
If specified, the query returns all the rows. In other words, no limit is applied if this option is specified.
integer_expression
A literal expression that returns an integer.
Examples
> CREATE TEMP VIEW person (name, age)
AS VALUES ('Zen Hui', 25),
('Anil B' , 18),
('Shone S', 16),
('Mike A' , 25),
('John A' , 18),
('Jack N' , 16);
Returns the result rows in a sorted manner in the user specified order. Unlike the SORT BY clause, this clause
guarantees a total order in the output.
Syntax
ORDER BY { expression [ sort_direction | nulls_sort_oder ] } [, ...]
sort_direction
[ ASC | DESC ]
nulls_sort_order
[ NULLS FIRST | NULLS LAST ]
Parameters
expression
An expression of any type used to establish an order in which results are returned.
If the expression a literal INT value it is interpreted as a column position in the select list.
sor t_direction
Specifies the sort order for the order by expression.
ASC : The sort direction for this expression is ascending.
DESC : The sort order for this expression is descending.
If sort direction is not explicitly specified, then by default rows are sorted ascending.
nulls_sor t_order
Optionally specifies whether NULL values are returned before/after non-NULL values. If null_sort_order
is not specified, then NULLs sort first if sort order is ASC and NULLS sort last if sort order is DESC .
NULLS FIRST : NULL values are returned first regardless of the sort order.
NULLS LAST : NULL values are returned last regardless of the sort order.
When specifying more than one expression sorting occurs left to right. All rows are sorted by the first
expression. If there are duplicate values for the first expression the second expression is used to resolve order
within the group of duplicates and so on. The resulting order not deterministic if there are duplicate values
across all order by expressions.
Examples
> CREATE TABLE person (id INT, name STRING, age INT);
> INSERT INTO person VALUES
(100, 'John' , 30),
(200, 'Mary' , NULL),
(300, 'Mike' , 80),
(400, 'Jerry', NULL),
(500, 'Dan' , 50);
-- Sort rows by age. By default rows are sorted in ascending manner with NULL FIRST.
> SELECT name, age FROM person ORDER BY age;
Jerry NULL
Mary NULL
John 30
Dan 50
Mike 80
-- Sort rows based on more than one column with each column having different
-- sort direction.
> SELECT * FROM person ORDER BY name ASC, age DESC;
500 Dan 50
400 Jerry NULL
100 John 30
200 Mary NULL
300 Mike 80
Related articles
Query
SORT BY
Window functions
PIVOT clause
7/21/2022 • 4 minutes to read
Transforms the intermediate result set of the FROM clause by rotating unique values of a specified column list
into separate columns.
Syntax
PIVOT ( { aggregate_expression [ [ AS ] agg_column_alias ] } [, ...]
FOR column_list IN ( expression_list ) )
column_list
{ column_name |
( column_name [, ...] ) }
expression_list
{ expression [ AS ] [ column_alias ] |
{ ( expression [, ...] ) [ AS ] [ column_alias] } [, ...] ) }
Parameters
aggregate_expression
An expression of any type where all column references to the FROM clause are arguments to aggregate
functions.
agg_column_alias
An optional alias for the result of the aggregation. If no alias is specified, PIVOT generates an alias based
on aggregate_xpression .
column_list
The set of columns to be rotated.
column_name
A column from the FROM clause.
expression_list
Maps values from column_list to column aliases.
expression
A literal expression with a type that shares a least common type with the respective column_name .
The number of expressions in each tuple must match the number of column_names in
column_list .
column_alias
An optional alias specifying the name of the generated column. If no alias is specified PIVOT
generates an alias based on the expression s.
Result
A temporary table of the following form:
All the columns from the intermediate result set of the FROM clause that have not been specified in any
aggregate_expression or column_list .
Examples
-- A very basic PIVOT
-- Given a table with sales by quarter, return a table that returns sales across quarters per year.
> CREATE TEMP VIEW sales(year, quarter, region, sales) AS
VALUES (2018, 1, 'east', 100),
(2018, 2, 'east', 20),
(2018, 3, 'east', 40),
(2018, 4, 'east', 40),
(2019, 1, 'east', 120),
(2019, 2, 'east', 110),
(2019, 3, 'east', 80),
(2019, 4, 'east', 60),
(2018, 1, 'west', 105),
(2018, 2, 'west', 25),
(2018, 3, 'west', 45),
(2018, 4, 'west', 45),
(2019, 1, 'west', 125),
(2019, 2, 'west', 115),
(2019, 3, 'west', 85),
(2019, 4, 'west', 65);
-- To aggregate across regions the column must be removed from the input.
> SELECT year, q1, q2, q3, q4
FROM (SELECT year, quarter, sales FROM sales) AS s
PIVOT (sum(sales) AS sales
FOR quarter
IN (1 AS q1, 2 AS q2, 3 AS q3, 4 AS q4));
2018 205 45 85 85
2019 245 225 165 125
> CREATE TEMP VIEW person (id, name, age, class, address) AS
VALUES (100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
2018 205 102.5 45 22.5 85 42.5 85 42.5
2018 205 102.5 45 22.5 85 42.5 85 42.5
2019 245 122.5 225 112.5 165 82.5 125 62.5
Related articles
SELECT
Aggregate functions
QUALIFY clause
7/21/2022 • 2 minutes to read
Filters the results of window functions. To use QUALIFY , at least one window function is required to be present in
the SELECT list or the QUALIFY clause.
NOTE
Available in Databricks Runtime 10.0 and above.
Syntax
QUALIFY boolean_expression
Parameters
boolean_expression
Any expression that evaluates to a result type boolean . Two or more expressions may be combined
together using the logical operators ( AND, OR).
The expressions specified in the QUALIFY clause cannot contain aggregate functions.
Examples
CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
INSERT INTO dealer VALUES
(100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);
Related statements
SELECT
WHERE clause
GROUP BY clause
ORDER BY clause
SORT BY clause
CLUSTER BY clause
DISTRIBUTE BY clause
LIMIT clause
PIVOT clause
LATERAL VIEW clause
TABLESAMPLE clause
7/21/2022 • 2 minutes to read
Syntax
TABLESAMPLE ( { percentage PERCENT ) |
num_rows ROWS |
BUCKET fraction OUT OF total } )
[ REPEATABLE ( seed ) ]
Parameters
percentage PERCENT
An INTEGER or DECIMAL constant percentage between 0 and 100 specifying which percentage of the
table’s rows to sample.
num_rows ROWS
A constant positive INTEGER expression num_rows specifying an absolute number of rows out of all rows
to sample.
BUCKET fraction OUT OF total
An INTEGER constant fraction specifying the portion out of the INTEGER constant total to sample.
REPEATABLE ( seed )
NOTE
TABLESAMPLE returns the approximate number of rows or fraction requested.
Always use TABLESAMPLE (percent PERCENT) if randomness is important. TABLESAMPLE (num_rows ROWS) is not a
simple random sample but instead is implemented using LIMIT .
Examples
> CREATE TEMPORARY VIEW test(id, name) AS
VALUES ( 1, 'Lisa'),
( 2, 'Mary'),
( 3, 'Evan'),
( 4, 'Fred'),
( 5, 'Alex'),
( 6, 'Mark'),
( 7, 'Lily'),
( 8, 'Lucy'),
( 9, 'Eric'),
(10, 'Adam');
> SELECT * FROM test;
5 Alex
8 Lucy
2 Mary
4 Fred
1 Lisa
9 Eric
10 Adam
6 Mark
7 Lily
3 Evan
Related articles
SELECT
Set operators
7/21/2022 • 2 minutes to read
Combines two subqueries into a single one. Databricks Runtime supports three types of set operators:
EXCEPT
INTERSECT
UNION
Syntax
subquery1 { { UNION [ ALL | DISTINCT ] |
INTERSECT [ ALL | DISTINCT ] |
EXCEPT [ ALL | DISTINCT ] } subquery2 } [...] }
subquer y1 , subquer y2
Any two subquery clauses as specified in SELECT. Both subqueries must have the same number of
columns and share a least common type for each respective column.
UNION [ALL | DISTINCT]
Returns the result of subquery1 plus the rows of subquery2`.
If ALL is specified duplicate rows are preserved.
If DISTINCT is specified the result does not contain any duplicate rows. This is the default.
INTERSECT [ALL | DISTINCT]
Returns the set of rows which are in both subqueries.
If ALL is specified a row that appears multiple times in the subquery1 as well as in subquery will be
returned multiple times.
If DISTINCT is specified the result does not contain duplicate rows. This is the default.
EXCEPT [ALL | DISTINCT ]
Returns the rows in subquery1 which are not in subquery2 .
If is specified, each row in
ALL subquery2 will remove exactly one of possibly multiple matches from
subquery1 .
If DISTINCT is specified, duplicate rows are removed from subquery1 before applying the operation, so
all matches are removed and the result will have no duplicate rows (matched or unmatched). This is the
default.
You can specify MINUS as a syntax alternative for EXCEPT .
When chaining set operations INTERSECT has a higher precedence than UNION and EXCEPT .
The type of each result column is the least common type of the respective columns in subquery1 and subquery2
.
Examples
-- Use number1 and number2 tables to demonstrate set operators in this page.
> CREATE TEMPORARY VIEW number1(c) AS VALUES (3), (1), (2), (2), (3), (4);
> CREATE TEMPORARY VIEW number2(c) AS VALUES (5), (1), (1), (2);
Related articles
SELECT
SORT BY clause
7/21/2022 • 3 minutes to read
Returns the result rows sorted within each partition in the user specified order. When there is more than one
partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which
guarantees a total order of the output.
Syntax
SORT BY { expression [ sort_direction nulls_sort_oder ] } [, ...]
sort_direction
[ ASC | DEC ]
nulls_sort_order
[ NULLS FIRST | NULLS LAST ]
Parameters
expression
An expression of any type used to establish a partition local order in which results are returned.
If the expression is a literal INT value it is interpreted as a column position in the select list.
sor t_direction
Specifies the sort order for the sort by expression.
ASC : The sort direction for this expression is ascending.
DESC : The sort order for this expression is descending.
If sort direction is not explicitly specified, then by default rows are sorted ascending.
nulls_sor t_order
Optionally specifies whether NULL values are returned before/after non-NULL values. If null_sort_order
is not specified, then NULLs sort first if sort order is ASC and NULLS sort last if sort order is DESC .
NULLS FIRST : NULL values are returned first regardless of the sort order.
NULLS LAST : NULL values are returned last regardless of the sort order.
When specifying more than one expression sorting occurs left to right. All rows within the partition are sorted
by the first expression. If there are duplicate values for the first expression the second expression is used to
resolve order within the group of duplicates and so on. The resulting order not deterministic if there are
duplicate values across all order by expressions.
Examples
> CREATE TEMP VIEW person (zip_code, name, age)
AS VALUES (94588, 'Zen Hui', 50),
(94588, 'Dan Li', 18),
(94588, 'Anil K', 27),
(94588, 'John V', NULL),
(94511, 'David K', 42),
(94511, 'Aryan B.', 18),
(94511, 'Lalit B.', NULL);
-- Sort rows within partition in ascending manner keeping null values to be last.
> SELECT /*+ REPARTITION(zip_code) */ age, name, zip_code FROM person
SORT BY age NULLS LAST;
18 Dan Li 94588
27 Anil K 94588
50 Zen Hui 94588
NULL John V 94588
18 Aryan B. 94511
42 David K 94511
NULL Lalit B. 94511
-- Sort rows by age within each partition in descending manner, which defaults to NULL LAST.
> SELECT /*+ REPARTITION(zip_code) */ age, name, zip_code FROM person
SORT BY age DESC;
50 Zen Hui 94588
27 Anil K 94588
18 Dan Li 94588
NULL John V 94588
42 David K 94511
18 Aryan B. 94511
NULL Lalit B. 94511
-- Sort rows by age within each partition in descending manner keeping null values to be first.
> SELECT /*+ REPARTITION(zip_code) */ age, name, zip_code FROM person
SORT BY age DESC NULLS FIRST;
NULL John V 94588
50 Zen Hui 94588
27 Anil K 94588
18 Dan Li 94588
NULL Lalit B. 94511
42 David K 94511
18 Aryan B. 94511
-- Sort rows within each partition based on more than one column with each column having
-- different sort direction.
> SELECT /*+ REPARTITION(zip_code) */ name, age, zip_code FROM person
SORT BY name ASC, age DESC;
Anil K 27 94588
Dan Li 18 94588
John V null 94588
Zen Hui 50 94588
Aryan B. 18 94511
David K 42 94511
Lalit B. null 94511
Related articles
Query
Table-valued function (TVF)
7/21/2022 • 3 minutes to read
A function that returns a relation or a set of rows. There are two types of TVFs:
Specified in a FROM clause, for example, range .
Specified in SELECT and LATERAL VIEW clauses, for example, explode .
Syntax
function_name ( expression [, ...] ) [ table_alias ]
Parameters
expression
A combination of one or more values, operators, and SQL functions that results in a value.
table_alias
An optional label to reference the function result and its columns.
range (start, end) Long, Long Creates a table with a single LongType
column named id, containing rows in a
range from start to end (exclusive)
with step value 1.
range (start, end, step) Long, Long, Long Creates a table with a single LongType
column named id, containing rows in a
range from start to end (exclusive)
with step value.
range (start, end, step, numPartitions) Long, Long, Long, Int Creates a table with a single LongType
column named id, containing rows in a
range from start to end (exclusive)
with step value, with partition number
numPartitions specified.
stack (n, expr1, …, exprk) Seq[Expression] Separates expr1, …, exprk into n rows.
Uses column names col0, col1, etc. by
default unless specified otherwise.
json_tuple (jsonStr, p1, p2, …, pn) Seq[Expression] Returns a tuple like the function
get_json_object, but it takes multiple
names. All the input parameters and
output column types are string.
Examples
-- range call with end
> SELECT * FROM range(6 + cos(3));
0
1
2
3
4
Related articles
SELECT
WHERE clause
7/21/2022 • 2 minutes to read
Limits the results of the FROM clause of a query or a subquery based on the specified condition.
Syntax
WHERE boolean_expression
Parameters
boolean_expression
Any expression that evaluates to a result type BOOLEAN . You can combine two or more expressions using
the logical operators such as AND or OR .
Examples
> CREATE TABLE person (id INT, name STRING, age INT);
> INSERT INTO person VALUES
(100, 'John', 30),
(200, 'Mary', NULL),
(300, 'Mike', 80),
(400, 'Dan' , 50);
Related articles
QUALIFY
SELECT
WINDOW clause
7/21/2022 • 2 minutes to read
The window clause allows you to define and name one or more distinct window specifications once and share
them across many window functions within the same query.
Syntax
WINDOW { window_name AS window_spec } [, ...]
Parameters
window_name
An identifier by which the window specification can be referenced. The identifier must be unique within
the WINDOW clause.
window_spec
A window specification to be shared across one or more window functions.
Examples
> CREATE TABLE employees
(name STRING, dept STRING, salary INT, age INT);
> INSERT INTO employees
VALUES ('Lisa', 'Sales', 10000, 35),
('Evan', 'Sales', 32000, 38),
('Fred', 'Engineering', 21000, 28),
('Alex', 'Sales', 30000, 33),
('Tom', 'Engineering', 23000, 33),
('Jane', 'Marketing', 29000, 28),
('Jeff', 'Marketing', 35000, 38),
('Paul', 'Engineering', 29000, 23),
('Chloe', 'Engineering', 23000, 25);
Specifies a sliding subset of rows within the partition on which the aggregate or analytic window function
operates.
Syntax
{ frame_mode frame_start |
frame_mode BETWEEN frame_start AND frame_end } }
frame_mode
{ RANGE | ROWS }
frame_start
{ UNBOUNDED PRECEDING |
offset_start PRECEDING |
CURRENT ROW |
offset_start FOLLOWING }
frame_end
{ offset_stop PRECEDING |
CURRENT ROW |
offset_stop FOLLOWING |
UNBOUNDED FOLLOWING }
Parameters
frame_mode
ROWS
If specified, the sliding window frame is expressed in terms of rows preceding or following the
current row.
RANGE
If specified, the window function must specify an ORDER BY clause with a single expression
obExpr .
The boundaries of the sliding window are then expressed as an offset from the obExpr for the
current row.
frame_star t
The starting position of the sliding window frame relative to the current row.
UNBOUNDED PRECEDING
Specifies that the window frame starts at the beginning of partition.
offset_start PRECEDING
If the mode is ROWS , offset_start is the positive integral literal number defining how many rows
prior to the current row the frame starts.
If the mode is RANGE , offset_start is a positive literal value of a type which can be subtracted
from obExpr . The frame starts at the first row of the partition for which obExpr is greater or
equal to obExpr - offset_start at the current row.
CURRENT ROW
Specifies that the frame starts at the current row.
offset_start FOLLOWING
If the mode is ROWS , offset_start is the positive integral literal number defining how many rows
past to the current row the frame starts. If the mode is RANGE , offset_start is a positive literal
value of a type which can be added to obExpr . The frame starts at the first row of the partition for
which obExpr is greater or equal to obExpr + offset_start at the current row.
frame_stop
The end of the sliding window frame relative to the current row.
If not specified, the frame stops at the CURRENT ROW. The end of the sliding window must be greater
than the start of the window frame.
offset_start PRECEDING
If frame_mode is ROWS , offset_stop is the positive integral literal number defining how many
rows prior to the current row the frame stops. If frame_mode is RANGE , offset_stop is a positive
literal value of the same type as offset_start . The frame ends at the last row off the partition for
which obExpr is less than or equal to obExpr - offset_stop at the current row.
CURRENT ROW
Specifies that the frame stops at the current row.
offsetStop FOLLOWING
If frame_mode is ROWS , offset_stop is the positive integral literal number defining how many
rows past to the current row the frame ends. If frame_mode is RANGE , offset_stop is a positive
literal value of the same type as offset_start . The frame ends at the last row of the partition for
which obExpr is less than or equal to obExpr + offset_stop at the current row.
UNBOUNDED FOLLOWING
Specifies that the window frame stops at the end of the partition.
Related articles
Window functions
ANALYZE TABLE
7/21/2022 • 2 minutes to read
The ANALYZE TABLE statement collects statistics about one specific table or all the tables in one specified schema,
that are to be used by the query optimizer to find a better query execution plan.
IMPORTANT
You can run ANALYZE TABLE on Delta tables only on Databricks Runtime 8.3 and above.
Syntax
ANALYZE TABLE table_name [ PARTITION clause ]
COMPUTE STATISTICS [ NOSCAN | FOR COLUMNS col1 [, ...] | FOR ALL COLUMNS ]
Parameters
table_name
Identifies the table to be analyzed. The name must not include a temporal specification.
PARTITION clause
Optionally limits the command to a subset of partitions.
[ NOSCAN | FOR COLUMNS col [, …] | FOR ALL COLUMNS ]
If no analyze option is specified, ANALYZE TABLE collects the table’s number of rows and size in bytes.
NOSCAN
Collect only the table’s size in bytes ( which does not require scanning the entire table ).
FOR COLUMNS col [, …] | FOR ALL COLUMNS
Collect column statistics for each column specified, or alternatively for every column, as well as
table statistics.
{ FROM | IN } [schema_name](sql-ref-names.md#schema-name
Specifies the name of the schema to be analyzed. Without a schema name, ANALYZE TABLES collects all
tables in the current schema that the current user has permission to analyze.
NOSCAN
Collects only the table’s size in bytes (which does not require scanning the entire table).
FOR COLUMNS col [ , … ] | FOR ALL COLUMNS
Collects column statistics for each column specified, or alternatively for every column, as well as table
statistics.
If no analyze option is specified, both number of rows and size in bytes are collected.
Examples
> CREATE TABLE students (name STRING, student_id INT) PARTITIONED BY (student_id);
> INSERT INTO students PARTITION (student_id = 111111) VALUES ('Mark');
> INSERT INTO students PARTITION (student_id = 222222) VALUES ('John');
Related articles
PARTITION
CACHE TABLE
7/21/2022 • 2 minutes to read
Caches contents of a table or output of a query with the given storage level in Apache Spark cache. If a query is
cached, then a temp view is created for this query. This reduces scanning of the original files in future queries.
Syntax
CACHE [ LAZY ] TABLE table_name
[ OPTIONS ( 'storageLevel' [ = ] value ) ] [ [ AS ] query ]
See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.
Parameters
L AZY
Only cache the table when it is first used, instead of immediately.
table_name
Identifies the Delta table or view to cache. The name must not include a temporal specification.
OPTIONS ( ‘storageLevel’ [ = ] value )
OPTIONS clause with storageLevel key and value pair. A warning is issued when a key other than
storageLevel is used. The valid options for storageLevel are:
NONE
DISK_ONLY
DISK_ONLY_2
MEMORY_ONLY
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_ONLY_SER_2
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
MEMORY_AND_DISK_SER_2
OFF_HEAP
An Exception is thrown when an invalid value is set for storageLevel . If storageLevel is not explicitly set
using OPTIONS clause, the default storageLevel is set to MEMORY_AND_DISK .
quer y
A query that produces the rows to be cached. It can be in one of following formats:
A SELECT statement
A TABLE statement
A FROM statement
Examples
> CACHE TABLE testCache OPTIONS ('storageLevel' 'DISK_ONLY') SELECT * FROM testData;
Related statements
CLEAR CACHE
UNCACHE TABLE
REFRESH TABLE
REFRESH
REFRESH FUNCTION
CLEAR CACHE
7/21/2022 • 2 minutes to read
Removes the entries and associated data from the in-memory and/or on-disk cache for all cached tables and
views in Apache Spark cache.
Syntax
> CLEAR CACHE
See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.
Examples
> CLEAR CACHE;
Related statements
CACHE TABLE
UNCACHE TABLE
REFRESH TABLE
REFRESH
REFRESH FUNCTION
REFRESH
7/21/2022 • 2 minutes to read
Invalidates and refreshes all the cached data (and the associated metadata) in Apache Spark cache for all
Datasets that contains the given data source path. Path matching is by prefix, that is, / would invalidate
everything that is cached.
Syntax
REFRESH resource_path
See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.
Parameters
resource_path
The path of the resource that is to be refreshed.
Examples
-- The Path is resolved using the datasource's File Index.
> CREATE TABLE test(ID INT) using parquet;
> INSERT INTO test SELECT 1000;
> CACHE TABLE test;
> INSERT INTO test SELECT 100;
> REFRESH "hdfs://path/to/table";
Related statements
CACHE TABLE
CLEAR CACHE
UNCACHE TABLE
REFRESH TABLE
REFRESH FUNCTION
REFRESH FUNCTION
7/21/2022 • 2 minutes to read
Invalidates the cached function entry for Apache Spark cache, which includes a class name and resource location
of the given function. The invalidated cache is populated right away. Note that REFRESH FUNCTION only works for
permanent functions. Refreshing native functions or temporary functions will cause an exception.
Syntax
REFRESH FUNCTION function_name
See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.
Parameters
function_name
A function name. If the name in unqualified the current schema is used.
Examples
-- The cached entry of the function is refreshed
-- The function is resolved from the current schema as the function name is unqualified.
> REFRESH FUNCTION func1;
Related statements
CACHE TABLE
CLEAR CACHE
UNCACHE TABLE
REFRESH TABLE
REFRESH
REFRESH TABLE
7/21/2022 • 2 minutes to read
Invalidates the cached entries for Apache Spark cache, which include data and metadata of the given table or
view. The invalidated cache is populated in lazy manner when the cached table or the query associated with it is
executed again.
Syntax
REFRESH [TABLE] table_name
See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.
Parameters
table_name
Identifies the Delta table or view to cache. The name must not include a temporal specification.
Examples
-- The cached entries of the table is refreshed
-- The table is resolved from the current schema as the table name is unqualified.
> REFRESH TABLE tbl1;
Related statements
CACHE TABLE
CLEAR CACHE
UNCACHE TABLE
REFRESH
REFRESH FUNCTION
UNCACHE TABLE
7/21/2022 • 2 minutes to read
Removes the entries and associated data from the in-memory and/or on-disk cache for a given table or view in
Apache Spark cache. The underlying entries should already have been brought to cache by previous
CACHE TABLE operation. UNCACHE TABLE on a non-existent table throws an exception if IF EXISTS is not
specified.
Syntax
UNCACHE TABLE [ IF EXISTS ] table_name
See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.
Parameters
table_name
Identifies the Delta table or view to cache. The name must not include a temporal specification.
Examples
> UNCACHE TABLE t1;
Related statements
CACHE TABLE
CLEAR CACHE
REFRESH TABLE
REFRESH
REFRESH FUNCTION
DESCRIBE CATALOG
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Returns the metadata of an existing catalog. The metadata information includes catalog name, comment, and
owner. If the optional EXTENDED option is specified, it returns the basic metadata information along with the
other catalog properties.
Since: Databricks Runtime 10.3
Syntax
{ DESC | DESCRIBE } CATALOG [ EXTENDED ] catalog_name
Parameters
catalog_name : The name of an existing catalog in the metastore. If the name does not exist, an exception is
thrown.
Examples
> DESCRIBE CATALOG main;
info_name info_value
------------ ------------------------------------
Catalog Name main
Comment Main catalog (auto-created)
Owner metastore-admin-users
Related articles
DESCRIBE DATABASE
DESCRIBE FUNCTION
DESCRIBE QUERY
DESCRIBE TABLE
INFORMATION_SCHEMA.CATALOGS
DESCRIBE STORAGE CREDENTIAL
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Returns the metadata of an existing storage credential. The metadata information includes credential name,
comment, owner and other metadata.
You must be account or metastore admin to execute this command.
Since: Databricks Runtime 10.3
Syntax
DESCRIBE STORAGE CREDENTIAL credential_name
Parameters
credential_name
The name of an existing storage credential in the metastore. If the name does not exist, an exception is
thrown.
Examples
> DESCRIBE CREDENTIAL good_cred;
name owner created_at created_by credential
--------- ------ ------------------------ ------------ ---------------------------------------------
good_cred admins 2022-01-01T08:00:00.0000 jane@doe.com AwsIamRole:arn:aws:iam:123456789012:roe/us....
Related articles
ALTER STORAGE CREDENTIAL
DROP STORAGE CREDENTIAL
SHOW STORAGE CREDENTIALS
DESCRIBE DATABASE
7/21/2022 • 2 minutes to read
Related articles
DESCRIBE CATALOG
DESCRIBE FUNCTION
DESCRIBE QUERY
DESCRIBE SCHEMA
DESCRIBE TABLE
INFORMATION_SCHEMA.SCHEMATA
DESCRIBE FUNCTION
7/21/2022 • 2 minutes to read
Returns the basic metadata information of an existing function. The metadata information includes the function
name, implementing class and the usage details. If the optional EXTENDED option is specified, the basic metadata
information is returned along with the extended usage information.
Syntax
{ DESC | DESCRIBE } FUNCTION [ EXTENDED ] function_name
Parameters
function_name
The name of an existing function in the metastore. The function name may be optionally qualified with a
schema name. If function_name is qualified with a schema then the function is resolved from the user
specified schema, otherwise it is resolved from the current schema.
Examples
-- Describe a builtin scalar function.
-- Returns function name, implementing class and usage
> DESCRIBE FUNCTION abs;
Function: abs
Class: org.apache.spark.sql.catalyst.expressions.Abs
Usage: abs(expr) - Returns the absolute value of the numeric value.
Related articles
DESCRIBE SCHEMA
DESCRIBE TABLE
DESCRIBE QUERY
DESCRIBE EXTERNAL LOCATION
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Returns the metadata of an existing external location. The metadata information includes location name, URL,
associated credential, owner, and timestamps of creation and last modification.
Since: Databricks Runtime 10.3
Syntax
DESCRIBE EXTERNAL LOCATION location_name
Parameters
location_name
The name of an existing external location in the metastore. If the name does not exist, an exception is
thrown.
Examples
> DESCRIBE EXTERNAL LOCATION best_loco;
name url credential_name owner created_by created_at
comment
--------- ----------------------------------- --------------- -------------- -------------- --------------
-- ----------
best_loco abfss://us-east-1-dev/best_location good_credential scooby@doo.com scooby@doo.com 2021-11-12
13:51 Nice place
Related articles
ALTER EXTERNAL LOCATION
CREATE EXTERNAL LOCATION
DROP EXTERNAL LOCATION
SHOW EXTERNAL LOCATIONS
DESCRIBE QUERY
7/21/2022 • 2 minutes to read
Syntax
{ DESC | DESCRIBE } [ QUERY ] input_statement
Parameters
QUERY
This clause is optional and may be omitted.
quer y
The query to be described.
Examples
-- Create table `person`
> CREATE TABLE person (name STRING , age INT COMMENT 'Age column', address STRING);
Related articles
DESCRIBE SCHEMA
DESCRIBE TABLE
DESCRIBE FUNCTION
DESCRIBE RECIPIENT
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Returns the metadata of an existing recipient. The metadata information includes recipient name, and activation
link.
Since: Databricks Runtime 10.3
Syntax
[ DESC | DESCRIBE ] RECIPIENT recipient_name
Parameters
recipient_name
The name of an existing recipient. If the name does not exist, an exception is thrown.
Examples
> CREATE RECIPIENT other_org COMMENT 'other.org';
> DESCRIBE RECIPIENT other_org;
name created_at created_by comment activation_link
active_token_id active_token_expiration_time rotated_token_id
rotated_token_expiration_time
--------- ---------------------------- -------------------------- --------- --------------- --------------
---------------------- ---------------------------- ---------------- -----------------------------
other_org 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com other.org https://.... 0160c81f-5262-
40bb-9b03-3ee12e6d98d7 9999-12-31T23:59:59.999+0000 NULL NULL
Related articles
CREATE RECIPIENT
DROP RECIPIENT
SHOW RECIPIENTS
DESCRIBE SCHEMA
7/21/2022 • 2 minutes to read
Returns the metadata of an existing schema. The metadata information includes the schema’s name, comment,
and location on the filesystem. If the optional EXTENDED option is specified, schema properties are also returned.
While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.
Syntax
{ DESC | DESCRIBE } SCHEMA [ EXTENDED ] schema_name
Parameters
schema_name : The name of an existing schema (schema) in the system. If the name does not exist, an
exception is thrown.
Examples
-- Create employees SCHEMA
> CREATE SCHEMA employees COMMENT 'For software companies';
-- Describe employees SCHEMA with EXTENDED option to return additional schema properties
> DESCRIBE SCHEMA EXTENDED employees;
database_description_item database_description_value
------------------------- ---------------------------------------------
Database Name employees
Description For software companies
Location file:/you/Temp/employees.db
Properties ((Create-by,kevin), (Create-date,09/01/2019))
-- Describe deployment.
> DESCRIBE SCHEMA deployment;
database_description_item database_description_value
------------------------- ------------------------------
Database Name deployment
Description Deployment environment
Location file:/you/Temp/deployment.db
Related articles
DESCRIBE FUNCTION
DESCRIBE QUERY
DESCRIBE TABLE
INFORMATION_SCHEMA.SCHEMATA
DESCRIBE SHARE
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Returns the metadata of an existing share. The metadata information includes share name, owner, and
timestamps of creation and last modification.
To list the content of a share use SHOW ALL IN SHARE.
Since: Databricks Runtime 10.3
Syntax
[ DESC | DESCRIBE ] SHARE share_name
Parameters
share_name
The name of an existing share. If the name does not exist, an exception is thrown.
Examples
> CREATE SHARE vaccine COMMENT 'vaccine data to publish';
> DESCRIBE SHARE vaccine;
name created_at created_by comment
--------- ---------------------------- -------------------------- -----------------------
vaccine 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com vaccine data to publish
Related articles
ALTER SHARE
CREATE SHARE
DROP SHARE
SHOW ALL IN SHARE
SHOW SHARES
DESCRIBE TABLE
7/21/2022 • 3 minutes to read
Returns the basic metadata information of a table. The metadata information includes column name, column
type and column comment. Optionally you can specify a partition spec or column name to return the metadata
pertaining to a partition or column respectively.
Syntax
{ DESC | DESCRIBE } [ TABLE ] [ EXTENDED | FORMATTED ] table_name { [ PARTITION clause ] | [ column_name ] }
Parameters
EXTENDED or FORMATTED
If specified display detailed information about the specified columns, including the column statistics
collected by the command, and additional metadata information (such as schema qualifier, owner, and
access time).
table_name
Identifies the table to be described. The name may not use a temporal specification.
PARTITION clause
An optional parameter directing Databricks Runtime to return addition metadata for the named
partitions.
column_name
An optional parameter with the column name that needs to be described. Currently nested columns are
not allowed to be specified.
Parameters partition_spec and column_name are mutually exclusive and cannot be specified together.
Examples
-- Creates a table `customer`. Assumes current schema is `salesdb`.
> CREATE TABLE customer(
cust_id INT,
state VARCHAR(20),
name STRING COMMENT 'Short name'
)
USING parquet
PARTITIONED BY (state);
> INSERT INTO customer PARTITION (state = 'AR') VALUES (100, 'Mike');
-- Returns additional metadata such as parent schema, owner, access time etc.
> DESCRIBE TABLE EXTENDED customer;
col_name data_type comment
---------------------------- ------------------------------ ----------
cust_id int null
name string Short name
state string null
# Partition Information
# col_name data_type comment
state string null
-- Returns partition metadata such as partitioning column name, column type and comment.
> DESCRIBE TABLE EXTENDED customer PARTITION (state = 'AR');
col_name data_type comment
------------------------------ ------------------------------ ----------
cust_id int null
name string Short name
state string null
# Partition Information
# col_name data_type comment
state string null
# Storage Information
Location file:/tmp/salesdb.db/custom...
Serde Library org.apache.hadoop.hive.ql.i...
Serde Library org.apache.hadoop.hive.ql.i...
InputFormat org.apache.hadoop.hive.ql.i...
OutputFormat org.apache.hadoop.hive.ql.i...
> CREATE TABLE T(pk1 INTEGER NOT NULL, pk2 INTEGER NOT NULL,
CONSTRAINT t_pk PRIMARY KEY(pk1, pk2));
> CREATE TABLE S(pk INTEGER NOT NULL PRIMARY KEY,
fk1 INTEGER, fk2 INTEGER,
CONSTRAINT s_t_fk FOREIGN KEY(fk1, fk2) REFERENCES T);
# Constraints
t_pk PRIMARY KEY (pk1, pk2)
# Constraints
s_pk PRIMARY KEY (pk)
s_fk_p FOREIGN KEY (fk1, fk2) REFERENCES default.t (pk1, pk2)
DESCRIBE DETAIL
DESCRIBE DETAIL [schema_name.]table_name
Return information about schema, partitioning, table size, and so on. For example, for Delta tables, you can see
the current reader and writer versions of a table. See Detail schema for the detail schema.
Related articles
DESCRIBE FUNCTION
DESCRIBE QUERY
DESCRIBE SCHEMA
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
PARTITION
LIST
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Syntax
LIST url [ WITH ( CREDENTIAL credential_name ) ] [ LIMIT limit ]
Parameters
url
A STRING literal with the location of the cloud storage described as an absolute URL.
credential_name
An optional named credential used to access this URL. If you supply a credential it must be sufficient to
access the URL. If you do not supply a credential the URL must be contained in an external location to to
which you have access.
limit
An optional INTEGER constant between 1 and 1001 used to limit the number of objects returned. The
default limit is 1001 .
Examples
> LIST 'abfss://us-east-1-dev/some_dir' WITH (CREDENTIAL azure_some_dir) LIMIT 2
path name size modification_time is_directory
------------------------------------- ------ ---- ----------------- ------------
abfss://us-east-1-dev/some_dir/table1 table1 0 ... true
abfss://us-east-1-dev/some_dir/table1 table1 0 ... true
Related articles
ALTER EXTERNAL LOCATION
CREATE EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATIONS
DROP EXTERNAL LOCATION
SHOW ALL IN SHARE
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Syntax
SHOW ALL IN SHARE share_name
Parameters
share_name
The name of an existing share. If the name does not exist, an exception is thrown.
Examples
-- Create share `customer_share` only if share with same name doesn't exist, with a comment.
> CREATE SHARE IF NOT EXISTS customer_share COMMENT 'This is customer share';
Related articles
ALTER SHARE
CREATE SHARE
DESCRIBE SHARE
DROP SHARE
SHOW SHARES
SHOW COLUMNS
7/21/2022 • 2 minutes to read
Returns the list of columns in a table. If the table does not exist, an exception is thrown.
Syntax
SHOW COLUMNS { IN | FROM } table_name [ { IN | FROM } schema_name ]
NOTE
Keywords IN and FROM are interchangeable.
Parameters
table_name
Identifies the table. The name must not include a temporal specification.
schema_name
An optional alternative means of qualifying the table_name with a schema name. When this parameter is
specified then table name should not be qualified with a different schema name.
Examples
-- Create `customer` table in the `salessc` schema;
> USE SCHEMA salessc;
> CREATE TABLE customer(
cust_cd INT,
name VARCHAR(100),
cust_addr STRING);
Related articles
DESCRIBE TABLE
INFORMATION_SCHEMA.COLUMNS
SHOW TABLE
SHOW COLUMNS
7/21/2022 • 2 minutes to read
Returns the list of columns in a table. If the table does not exist, an exception is thrown.
Syntax
SHOW COLUMNS { IN | FROM } table_name [ { IN | FROM } schema_name ]
NOTE
Keywords IN and FROM are interchangeable.
Parameters
table_name
Identifies the table. The name must not include a temporal specification.
schema_name
An optional alternative means of qualifying the table_name with a schema name. When this parameter is
specified then table name should not be qualified with a different schema name.
Examples
-- Create `customer` table in the `salessc` schema;
> USE SCHEMA salessc;
> CREATE TABLE customer(
cust_cd INT,
name VARCHAR(100),
cust_addr STRING);
Related articles
DESCRIBE TABLE
INFORMATION_SCHEMA.COLUMNS
SHOW TABLE
SHOW CREATE TABLE
7/21/2022 • 2 minutes to read
Returns the CREATE TABLE statement or CREATE VIEW statement that was used to create a given table or view.
SHOW CREATE TABLE on a non-existent table or a temporary view throws an exception.
Syntax
SHOW CREATE TABLE { table_name | view_name }
Parameters
table_name
Identifies the table. The name must not include a temporal specification.
Examples
> CREATE TABLE test (c INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES ('prop1' = 'value1', 'prop2' = 'value2');
Related articles
CREATE TABLE
CREATE VIEW
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
INFORMATION_SCHEMA.VIEWS
SHOW STORAGE CREDENTIALS
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Syntax
SHOW STORAGE CREDENTIALS
Parameters
The statement takes no parameters.
Examples
> SHOW STORAGE CREDENTIALS
name comment
------------ -----------------
some_creds Used to access s3
Related articles
DESCRIBE STORAGE CREDENTIAL
DROP STORAGE CREDENTIAL
SHOW DATABASES
7/21/2022 • 2 minutes to read
Related articles
ALTER SCHEMA
CREATE SCHEMA
DESCRIBE SCHEMA
INFORMATION_SCHEMA.SCHEMATA
SHOW SCHEMAS
SHOW FUNCTIONS
7/21/2022 • 2 minutes to read
Returns the list of functions after applying an optional regex pattern. Databricks Runtime supports a large
number of functions. You can use SHOW FUNCTIONS in conjunction with describe function to quickly find a
function and learn how to use it. The LIKE clause is optional, and ensures compatibility with other systems.
Syntax
SHOW [ function_kind ] FUNCTIONS [ { FROM | IN } schema_name ]
[ [ LIKE ] { function_name | regex_pattern } ]
function_kind
{ USER | SYSTEM | ALL }
Parameters
function_kind
The name space of the function to be searched upon. The valid name spaces are:
USER - Looks up the function(s) among the user defined functions.
SYSTEM - Looks up the function(s) among the system defined functions.
ALL - Looks up the function(s) among both user and system defined functions.
schema_name
Since: Databricks Runtime 10.3
Specifies the schema in which functions are to be listed.
function_name
A name of an existing function in the system. If schema_name is not provided the function name may be
qualified with a schema name instead. If function_name is not qualified and schema_name has not been
specified the function is resolved from the current schema.
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
Examples
-- List a system function `trim` by searching both user defined and system
-- defined functions.
> SHOW FUNCTIONS trim;
trim
-- Use normal regex pattern to list function names that has 4 characters
-- with `t` as the starting character.
> SHOW FUNCTIONS LIKE 't[a-z][a-z][a-z]';
tanh
trim
Related articles
DESCRIBE FUNCTION
SHOW GROUPS
7/21/2022 • 2 minutes to read
Syntax
SHOW GROUPS [ WITH USER user_principal |
WITH GROUP group_principal ]
[ [ LIKE ] regex_pattern ]
Parameters
user_principal
Show only groups that contain the specified user.
group_principal
Show only groups that contain the specified group.
regex_pattern
A STRING literal with a limited regular expression pattern used to filter the results of the statement.
* at the start and end of a pattern matches on a substring.
* only at end of a pattern matches the start of a group.
| separates multiple regular expressions, any of which can match.
The pattern match is case-insensitive.
Examples
-- Lists all groups.
> SHOW GROUPS;
name directGroup
------------ -----------
tv_alien NULL
alien NULL
californian NULL
pastafarian NULL
Related articles
SHOW GRANTS
SHOW USERS
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
SHOW EXTERNAL LOCATIONS
7/21/2022 • 2 minutes to read
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
Lists the external locations that match an optionally supplied regular expression pattern. If no pattern is supplied
then the command lists all the external locations in the metastore.
Since: Databricks Runtime 10.3
Syntax
SHOW EXTERNAL LOCATIONS [ LIKE regex_pattern ]
Parameters
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
Examples
> SHOW EXTERNAL LOCATIONS;
name url comment
------------ ----------------------------------- ----------
best_loco abfss://us-east-1-dev/best_location Nice place
three_tatami abfss://us-west-1-dev/tatami_xs Quite cozy
Related articles
ALTER EXTERNAL LOCATION
CREATE EXTERNAL LOCATION
DESCRIBE EXTERNAL LOCATIONS
DROP EXTERNAL LOCATION
SHOW PARTITIONS
7/21/2022 • 2 minutes to read
Syntax
SHOW PARTITIONS table_name [ PARTITION clause ]
Parameters
table_name
Identifies the table. The name must not include a temporal specification.
PARTITION clause
An optional parameter that specifies a partition. If the specification is only a partial all matching partitions
are returned. If no partition is specified at all Databricks Runtime returns all partitions.
Examples
-- create a partitioned table and insert a few rows.
> USE salesdb;
> CREATE TABLE customer(id INT, name STRING) PARTITIONED BY (state STRING, city STRING);
> INSERT INTO customer PARTITION (state = 'CA', city = 'Fremont') VALUES (100, 'John');
> INSERT INTO customer PARTITION (state = 'CA', city = 'San Jose') VALUES (200, 'Marry');
> INSERT INTO customer PARTITION (state = 'AZ', city = 'Peoria') VALUES (300, 'Daniel');
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Lists the recipients that match an optionally supplied regular expression pattern. If no pattern is supplied then
the command lists all the recipients in the metastore.
Since: Databricks Runtime 10.3
Syntax
SHOW RECIPIENTS [ LIKE regex_pattern ]
Parameters
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
Examples
> CREATE RECIPIENT other_org COMMENT 'other.org';
> CREATE RECIPIENT better_corp COMMENT 'better.com';
> SHOW RECIPIENTS;
name created_at created_by comment
----------- ---------------------------- -------------------------- ----------
other_org 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com other.org
better_corp 2022-01-01T00:00:01.000+0000 alwaysworks@databricks.com better.com
Related articles
CREATE RECIPIENT
DESCRIBE RECIPIENT
DROP RECIPIENT
SHOW SCHEMAS
7/21/2022 • 2 minutes to read
Lists the schemas that match an optionally supplied regular expression pattern. If no pattern is supplied then the
command lists all the schemas in the system.
While usage of SCHEMAS and DATABASES is interchangeable, SCHEMAS is preferred.
Syntax
SHOW SCHEMAS [ LIKE regex_pattern ]
Parameters
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
Examples
-- Create schema. Assumes a schema named `default` already exists in
-- the system.
> CREATE SCHEMA payroll_sc;
> CREATE SCHEMA payments_sc;
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Lists the shares that match an optionally supplied regular expression pattern. If no pattern is supplied then the
command lists all the shares in the metastore.
To list the content of a share use SHOW ALL IN SHARE.
Since: Databricks Runtime 10.3
Syntax
SHOW SHARES[ LIKE regex_pattern ]
Parameters
regex_pattern
A regular expression pattern that is used to filter the results of the statement.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
Examples
> CREATE SHARE vaccine COMMENT 'vaccine data to publish';
> CREATE SHARE meds COMMENT 'meds data to publish';
> SHOW SHARES;
name created_at created_by comment
--------- ---------------------------- -------------------------- -----------------------
vaccine 2022-01-01T00:00:00.000+0000 alwaysworks@databricks.com vaccine data to publish
meds 2022-01-01T00:00:01.000+0000 alwaysworks@databricks.com meds data to publish
Related articles
ALTER SHARE
CREATE SHARE
DESCRIBE SHARE
DROP SHARE
SHOW ALL IN SHARE
SHOW TABLE EXTENDED
7/21/2022 • 2 minutes to read
Shows information for all tables matching the given regular expression. Output includes basic table information
and file system information like Last Access , Created By , Type , Provider , Table Properties , Location ,
Serde Library , InputFormat , OutputFormat , Storage Properties , Partition Provider , Partition Columns , and
Schema .
If a partition specification is present, it outputs the given partition’s file-system-specific information such as
Partition Parameters and Partition Statistics . You cannot use a table regex with a partition specification.
Syntax
SHOW TABLE EXTENDED [ { IN | FROM } schema_name ] LIKE regex_pattern
[ PARTITION clause ]
Parameters
schema_name
Specifies schema name. If not provided, uses the current schema.
regex_pattern
The regular expression pattern used to filter out unwanted tables.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
PARTITION clause
Optionally specifying partitions. You cannot use a table regex pattern with a PARTITION clause.
Examples
-- Assumes `employee` table partitioned by column `grade`
> CREATE TABLE employee(name STRING, grade INT) PARTITIONED BY (grade);
> INSERT INTO employee PARTITION (grade = 1) VALUES ('sam');
> INSERT INTO employee PARTITION (grade = 2) VALUES ('suj');
Related articles
CREATE TABLE
DESCRIBE TABLE
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
PARTITION
SHOW TABLES
7/21/2022 • 2 minutes to read
Returns all the tables for an optionally specified schema. Additionally, the output of this statement may be
filtered by an optional matching pattern. If no schema is specified then the tables are returned from the current
schema.
Syntax
SHOW TABLES [ { FROM | IN } schema_name ] [ LIKE regex_pattern ]
Parameters
schema_name
Specifies schema name from which tables are to be listed. If not provided, uses the current schema.
regex_pattern
The regular expression pattern that is used to filter out unwanted tables.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
Examples
-- List all tables in default schema
> SHOW TABLES;
database tableName isTemporary
-------- --------- -----------
default sam false
default sam1 false
default suj false
-- List all tables from default schema matching the pattern `sam*`
> SHOW TABLES FROM default LIKE 'sam*';
database tableName isTemporary
-------- --------- -----------
default sam false
default sam1 false
Related articles
CREATE SCHEMA
CREATE TABLE
DROP SCHEMA
DROP TABLE
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
SHOW TBLPROPERTIES
7/21/2022 • 2 minutes to read
Returns the value of a table property given an optional value for a property key. If no key is specified then all the
properties and options are returned. Table options are prefixed with option .
Syntax
SHOW TBLPROPERTIES table_name
[ ( [unquoted_property_key | property_key_as_string_literal] ) ]
unquoted_property_key
key_part1 [. ...]
Parameters
table_name
Identifies the table. The name must not include a temporal specification.
unquoted_proper ty_key
The property key in unquoted form. The key can consist of multiple parts separated by a dot.
proper ty_key_as_string_literal
A property key value as a string literal.
NOTE
Property value returned by this statement excludes some properties that are internal to spark and hive. The excluded
properties are:
All the properties that start with prefix spark.sql
Property keys such as: EXTERNAL , comment
All the properties generated internally by hive to store statistics. Some of these properties are: numFiles ,
numPartitions , numRows .
Examples
-- create a table `customer` in schema `salessc`
> USE salessc;
> CREATE TABLE customer(cust_code INT, name VARCHAR(100), cust_addr STRING)
TBLPROPERTIES ('created.by.user' = 'John', 'created.date' = '01-01-2001');
-- show all the user specified properties for a qualified table `customer`
-- in schema `salessc`
> SHOW TBLPROPERTIES salessc.customer;
key value
--------------------- ----------
created.by.user John
created.date 01-01-2001
transient_lastDdlTime 1567554931
Related articles
ALTER TABLE
CREATE TABLE
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
SHOW TABLES
SHOW TABLE
Table properties and table options
SHOW USERS
7/21/2022 • 2 minutes to read
Syntax
SHOW USERS [ LIKE pattern_expression ]
Parameters
pattern_expression
A limited pattern expression that is used to filter the results of the statement.
The * character is used at the start and end of a pattern to match on a substring.
The * character is used only at end of a pattern to match the start of a username.
The | character is used to separate multiple different expressions, any of which can match.
The pattern match is case-insensitive.
Examples
-- Lists all users.
> SHOW USERS;
name
------------------
user1@example.com
user2@example.com
user3@example.com
Related articles
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
SHOW GRANTS
SHOW GROUPS
SHOW VIEWS
7/21/2022 • 2 minutes to read
Returns all the views for an optionally specified schema. Additionally, the output of this statement may be
filtered by an optional matching pattern. If no schema is specified then the views are returned from the current
schema. If the specified schema is the global temporary view schema, Databricks Runtime lists global temporary
views. Note that the command also lists local temporary views regardless of a given schema.
Syntax
SHOW VIEWS [ { FROM | IN } schema_name ] [ LIKE regex_pattern ]
Parameters
schema_name
The schema name from which views are listed.
regex_pattern
The regular expression pattern that is used to filter out unwanted views.
Except for * and | character, the pattern works like a regular expression.
* alone matches 0 or more characters and | is used to separate multiple different regular
expressions, any of which can match.
The leading and trailing blanks are trimmed in the input pattern before processing. The pattern match
is case-insensitive.
Examples
-- Create views in different schemas, also create global/local temp views.
> CREATE VIEW sam AS SELECT id, salary FROM employee WHERE name = 'sam';
> CREATE VIEW sam1 AS SELECT id, salary FROM employee WHERE name = 'sam1';
> CREATE VIEW suj AS SELECT id, salary FROM employee WHERE name = 'suj';
> USE SCHEMA usersc;
> CREATE VIEW user1 AS SELECT id, salary FROM default.employee WHERE name = 'user1';
> CREATE VIEW user2 AS SELECT id, salary FROM default.employee WHERE name = 'user2';
> USE SCHEMA default;
> CREATE TEMP VIEW temp1 AS SELECT 1 AS col1;
> CREATE TEMP VIEW temp2 AS SELECT 1 AS col1;
-- List all views from default schema matching the pattern `sam*`
> SHOW VIEWS FROM default LIKE 'sam*';
namespace viewName isTemporary
----------- ------------ --------------
default sam false
default sam1 false
-- List all views from the current schema matching the pattern `sam|suj|temp*`
> SHOW VIEWS LIKE 'sam|suj|temp*';
namespace viewName isTemporary
------------- ------------ --------------
default sam false
default suj false
temp2 true
Related articles
CREATE SCHEMA
CREATE VIEW
DROP SCHEMA
DROP VIEW
INFORMATION_SCHEMA.COLUMNS
INFORMATION_SCHEMA.TABLES
INFORMATION_SCHEMA.VIEWS
RESET
7/21/2022 • 2 minutes to read
Resets runtime configurations specific to the current session which were set via the SET command to your
default values.
Syntax
RESET;
Parameters
(none)
Reset any runtime configurations specific to the current session which were set via the SET command to
your default values.
Examples
-- Reset any runtime configurations specific to the current session which were set via the SET command to
your default values.
RESET;
Related statements
SET
SET
7/21/2022 • 2 minutes to read
Sets a property, returns the value of an existing property or returns all SQLConf properties with value and
meaning.
Syntax
SET
SET [ -v ]
SET property_key[ = property_value ]
Parameters
-v
Outputs the key, value and meaning of existing SQLConf properties.
proper ty_key
Returns the value of specified property key.
proper ty_key=proper ty_value
Sets the value for a given property key. If an old value exists for a given property key, then it gets
overridden by the new value.
Examples
-- Set a property.
SET spark.sql.variable.substitute=false;
Related statements
RESET
SET TIME ZONE
7/21/2022 • 2 minutes to read
Syntax
SET TIME ZONE { LOCAL | time_zone_value | INTERVAL interval_literal }
Parameters
LOCAL
Set the time zone to the one specified in the java user.timezone property, or to the environment variable
TZ if user.timezone is undefined, or to the system time zone if both of them are undefined.
timezone_value
A STRING literal. The ID of session local timezone in the format of either region-based zone IDs or zone
offsets. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in
the format ‘ (+|-)HH ’, ‘ (+|-)HH:mm ’ or ‘ (+|-)HH:mm:ss ’, e.g ‘-08’, ‘+01:00’ or ‘-13:33:33’. Also, ‘UTC’ and
‘Z’ are supported as aliases of ‘+00:00’. Other short names are not recommended to use because they
can be ambiguous.
inter val literal
The interval literal represents the difference between the session time zone to the ‘UTC’. It must be in the
range of [-18, 18] hours and max to second precision, e.g. INTERVAL 2 HOURS 30 MINUTES or
INTERVAL '15:40:32' HOUR TO SECOND .
Examples
-- Set time zone to the system default.
> SET TIME ZONE LOCAL;
Related articles
SET
ADD ARCHIVE
7/21/2022 • 2 minutes to read
Adds an archive file to the list of resources. The given archive file should be one of .zip, .tar, .tar.gz, .tgz and .jar. To
list the archive files that have been added, use LIST ARCHIVE.
Since: Databricks Runtime 10.0
Syntax
ADD [ARCHIVE | ARCHIVES] file_name [...]
Parameters
file_name
The name of an ARCHIVE file to add. It could be either on a local file system or a distributed file system.
Examples
> ADD ARCHIVE /tmp/test.tar.gz;
> ADD ARCHIVE "/path with space/abc.tar" ADD ARCHIVE "/path with space/def.tar";
Related statements
ADD FILE
ADD JAR
LIST ARCHIVE
LIST FILE
LIST JAR
ADD FILE
7/21/2022 • 2 minutes to read
Adds a single file as well as a directory to the list of resources. The added resource can be listed using LIST FILE.
Syntax
ADD [ FILE | FILES ] resource_name [...]
Parameters
resource_name
The name of a file or directory to be added.
Examples
> ADD FILE /tmp/test;
Related statements
ADD ARCHIVE
ADD JAR
LIST FILE
LIST JAR
LIST ARCHIVE
ADD JAR
7/21/2022 • 2 minutes to read
Adds a JAR file to the list of resources. The added JAR file can be listed using LIST JAR.
Syntax
ADD [JAR | JARS] file_name [...]
Parameters
file_name
The name of a JAR file to be added. It could be either on a local file system or a distributed file system.
Examples
> ADD JAR /tmp/test.jar;
Related statements
ADD ARCHIVE
ADD FILE
LIST ARCHIVE
LIST FILE
LIST JAR
LIST ARCHIVE
7/21/2022 • 2 minutes to read
Syntax
LIST [ARCHIVE | ARCHIVES] [file_name [...]]
Parameters
file_name
Optional a name of an archive to list.
Examples
> ADD ARCHIVES /tmp/test.zip /tmp/test_2.tar.gz;
Related statements
ADD ARCHIVE
ADD JAR
ADD FILE
LIST FILE
LIST JAR
LIST FILE
7/21/2022 • 2 minutes to read
Syntax
LIST [ FILE | FILES ] [ resource_name [...]]
Parameters
resource_name
Optional a name of a file or directory to list.
Examples
> ADD FILE /tmp/test /tmp/test_2;
Related statements
ADD FILE
ADD JAR
LIST JAR
LIST JAR
7/21/2022 • 2 minutes to read
Syntax
LIST [JAR | JARS] [file_name [...]]
file_name
Optional a name of an archive to list.
Examples
> ADD JAR /tmp/test.jar /tmp/test_2.jar;
Related statements
ADD ARCHIVE
ADD FILE
ADD JAR
LIST ARCHIVE
LIST FILE
CACHE SELECT
7/21/2022 • 2 minutes to read
Caches the data accessed by the specified simple SELECT query in the Delta cache. You can choose a subset of
columns to be cached by providing a list of column names and choose a subset of rows by providing a
predicate. This enables subsequent queries to avoid scanning the original files as much as possible. This
construct is applicable only to Parquet tables. Views are also supported, but the expanded queries are restricted
to the simple queries, as described above.
Syntax
CACHE SELECT column_name [, ...] FROM table_name [ WHERE boolean_expression ]
See Delta and Apache Spark caching for the differences between the Delta cache and the Apache Spark cache.
Parameters
table_name
Identifies an existing table. The name must not include a temporal specification.
Examples
> CACHE SELECT * FROM boxes
> CACHE SELECT width, length FROM boxes WHERE height=3
CREATE TABLE CLONE
7/21/2022 • 2 minutes to read
Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow:
deep clones copy over the data from the source and shallow clones do not.
IMPORTANT
There are important differences between shallow and deep clones that can determine how best to use them. See Clone a
Delta table.
Syntax
CREATE TABLE [IF NOT EXISTS] table_name
[SHALLOW | DEEP] CLONE source_table_name [LOCATION path]
Parameters
IF NOT EXISTS
If specified, the statement is ignored if table_name already exists.
[CREATE OR] REPL ACE
If CREATE OR is specified the table is replaced if it exists and newly created if it does not. Without
CREATE OR the table_name must exist.
table_name
The name of the Delta Lake table to be created. The name must not include a temporal specification. If the
name is not qualified the table is created in the current schema. table_name must not exist already unless
REPLACE or IF NOT EXISTS has been specified.
Converts an existing Parquet table to a Delta table in-place. This command lists all the files in the directory,
creates a Delta Lake transaction log that tracks these files, and automatically infers the data schema by reading
the footers of all Parquet files. The conversion process collects statistics to improve query performance on the
converted Delta table. If you provide a table name, the metastore is also updated to reflect that the table is now
a Delta table.
This command supports converting Iceberg tables whose underlying file format is Parquet. In this case, the
converter generates the Delta Lake transaction log based on Iceberg table’s native file manifest, schema and
partitioning information.
Syntax
CONVERT TO DELTA table_name [ NO STATISTICS ] [ PARTITIONED BY clause ]
Parameters
table_name
Either an optionally qualified table identifier or a path to a Parquet file. The name must not include a
temporal specification.
NO STATISTICS
Bypass statistics collection during the conversion process and finish conversion faster. After the table is
converted to Delta Lake, you can use OPTIMIZE ZORDER BY to reorganize the data layout and generate
statistics.
PARTITIONED BY
Partition the created table by the specified columns. Required if the data is partitioned. The conversion
process aborts and throw an exception if the directory structure does not conform to the PARTITIONED BY
specification. If you do not provide the PARTITIONED BY clause, the command assumes that the table is
not partitioned.
Examples
NOTE
You do not need to provide partitioning information for Iceberg tables or tables registered to the metastore.
Related articles
PARTITIONED BY
VACUUM
COPY INTO
7/21/2022 • 15 minutes to read
Loads data from a file location into a Delta table. This is a retriable and idempotent operation—files in the
source location that have already been loaded are skipped. For examples, see Common data loading patterns
with COPY INTO.
Syntax
COPY INTO target_table
FROM { source |
( SELECT expression_list FROM source ) }
[ WITH (
[ CREDENTIAL { credential_name |
(temporary_credential_options) } ]
[ ENCRYPTION (encryption_options) ])
]
FILEFORMAT = data_source
[ VALIDATE [ ALL | num_rows ROWS ] ]
[ FILES = ( file_name [, ...] ) | PATTERN = regex_pattern ]
[ FORMAT_OPTIONS ( { data_source_reader_option = value } [, ...] ) ]
[ COPY_OPTIONS ( { copy_option = value } [, ...] ) ]
Parameters
target_table
Identifies an existing Delta table. The target_table must not include a temporal specification.
If the table name is provided in the form of a location, such as: delta.`/path/to/table` , Unity Catalog can
govern access to the locations that are being written to. You can write to an external location by:
Defining the location as an external location and having WRITE FILES permissions on that external
location.
Having WRITE FILES permissions on a named storage credential that provide authorization to write to
a location using: COPY INTO delta.`/some/location` WITH (CREDENTIAL <named_credential>)
See Manage external locations and storage credentials for more details.
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
source
The file location to load the data from. Files in this location must have the format specified in FILEFORMAT .
The location is provided in the form of a URI.
Access to the source location can be provided through:
credential_name
Optional name of the credential used to access or write to the storage location. You use this
credential only if the file location isn’t included in an external location.
Inline temporary credentials.
Defining the source location as an external location and having READ FILES permissions on the
external location through Unity Catalog.
Using a named storage credential with READ FILES permissions that provide authorization to read
from a location through Unity Catalog.
IMPORTANT
Unity Catalog is in Public Preview. To participate in the preview, contact your Azure Databricks representative.
You don’t need to provide inline or named credentials if the path is already defined as an external location
that you have permissions to use. See Manage external locations and storage credentials for more details.
NOTE
If the source file path is a root path, add a slash ( / ) at the end of the file path, for example, s3://my-bucket/ .
NOTE
VALIDATE mode is available in Databricks Runtime 10.3 and above.
FILES
A list of file names to load, with length up to 1000. Cannot be specified with PATTERN .
PATTERN
A regex pattern that identifies the files to load from the source directory. Cannot be specified with FILES .
FORMAT_OPTIONS
Options to be passed to the Apache Spark data source reader for the specified format. See Format
options for each file format.
COPY_OPTIONS
Options to control the operation of the COPY INTO command.
force: boolean, default false . If set to true , idempotency is disabled and files are loaded
regardless of whether they’ve been loaded before.
mergeSchema : boolean, default false . If set to true , the schema can be evolved according to the
incoming data. To evolve the schema of a table, you must have OWN permissions on the table.
NOTE
mergeSchema option is available in Databricks Runtime 10.3 and above.
Format options
Generic options
JSON options
CSV options
PARQUET options
AVRO options
BINARYFILE options
TEXT options
ORC options
Generic options
The following options apply to all file formats.
O P T IO N
ignoreCorruptFiles
Type: Boolean
Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents
that have been read will still be returned. Observable as numSkippedCorruptFiles in the
operationMetrics column of the Delta Lake history. Available in Databricks Runtime 11.0 and above.
ignoreMissingFiles
Type: Boolean
Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents
that have been read will still be returned. Available in Databricks Runtime 11.0 and above.
modifiedAfter
An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.
modifiedBefore
An optional timestamp to ingest files that have a modification timestamp before the provided timestamp.
pathGlobFilter
Type: String
recursiveFileLookup
Type: Boolean
Whether to load data recursively within the base directory and skip partition inference.
JSON options
O P T IO N
allowBackslashEscapingAnyCharacter
Type: Boolean
Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed
by the JSON specification can be escaped.
allowComments
Type: Boolean
Whether to allow the use of Java, C, and C++ style comments ( '/' , '*' , and '//' varieties) within parsed content or
not.
allowNonNumericNumbers
Type: Boolean
Whether to allow the set of not-a-number ( NaN ) tokens as legal floating number values.
allowNumericLeadingZeros
Type: Boolean
Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, 000001).
allowSingleQuotes
Type: Boolean
Whether to allow use of single quotes (apostrophe, character '\' ) for quoting strings (names and String values).
allowUnquotedControlChars
Type: Boolean
Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab
and line feed characters) or not.
allowUnquotedFieldNames
Type: Boolean
Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification).
badRecordsPath
Type: String
The path to store files for recording the information about bad JSON records.
columnNameOfCorruptRecord
Type: String
The column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED ,
this column will be empty.
dateFormat
Type: String
dropFieldIfAllNull
Type: Boolean
Whether to ignore columns of all null values or empty arrays and structs during schema inference.
encoding or charset
Type: String
The name of the encoding of the JSON files. See java.nio.charset.Charset for list of options. You cannot use UTF-16 and
UTF-32 when multiline is true .
inferTimestamp
Type: Boolean
lineSep
Type: String
locale
Type: String
A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the JSON.
Default value: US
mode
Type: String
multiLine
Type: Boolean
prefersDecimal
Type: Boolean
primitivesAsString
Type: Boolean
rescuedDataColumn
Type: String
Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to
a separate column. This column is included by default when using Auto Loader. For more details, refer to Rescued data column.
timestampFormat
Type: String
timeZone
Type: String
CSV options
O P T IO N
badRecordsPath
Type: String
The path to store files for recording the information about bad CSV records.
charToEscapeQuoteEscaping
Type: Char
The character used to escape the character used for escaping quotes. For example, for the following record: [ " a\\", b ] :
* If the character to escape the '\' is undefined, the record won’t be parsed. The parser will read characters:
[a],[\],["],[,],[ ],[b] and throw an error because it cannot find a closing quote.
* If the character to escape the '\' is defined as '\' , the record will be read with 2 values: [a\] and [b] .
columnNameOfCorruptRecord
Type: String
A column for storing records that are malformed and cannot be parsed. If the mode for parsing is set as DROPMALFORMED ,
this column will be empty.
comment
Type: Char
Defines the character that represents a line comment when found in the beginning of a line of text. Use '\0' to disable
comment skipping.
dateFormat
Type: String
emptyValue
Type: String
encoding or charset
Type: String
The name of the encoding of the CSV files. See java.nio.charset.Charset for the list of options. UTF-16 and UTF-32
cannot be used when multiline is true .
enforceSchema
Type: Boolean
Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are
ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution.
escape
Type: Char
header
Type: Boolean
Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema.
ignoreLeadingWhiteSpace
Type: Boolean
ignoreTrailingWhiteSpace
Type: Boolean
inferSchema
Type: Boolean
Whether to infer the data types of the parsed CSV records or to assume all columns are of StringType . Requires an
additional pass over the data if set to true .
lineSep
Type: String
locale
Type: String
A java.util.Locale identifier. Influences default date, timestamp, and decimal parsing within the CSV.
Default value: US
maxCharsPerColumn
Type: Int
Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to -1 , which
means unlimited.
Default value: -1
maxColumns
Type: Int
mergeSchema
Type: Boolean
Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader
when inferring the schema.
mode
Type: String
multiLine
Type: Boolean
nanValue
Type: String
The string representation of a non-a-number value when parsing FloatType and DoubleType columns.
negativeInf
Type: String
The string representation of negative infinity when parsing FloatType or DoubleType columns.
nullValue
Type: String
parserCaseSensitive (deprecated)
Type: Boolean
While reading files, whether to align columns declared in the header with the schema case sensitively. This is true by default
for Auto Loader. Columns that differ by case will be rescued in the rescuedDataColumn if enabled. This option has been
deprecated in favor of readerCaseSensitive .
positiveInf
Type: String
The string representation of positive infinity when parsing FloatType or DoubleType columns.
quote
Type: Char
The character used for escaping values where the field delimiter is part of the value.
readerCaseSensitive
Type: Boolean
Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.
rescuedDataColumn
Type: String
Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.
sep or delimiter
Type: String
skipRows
Type: Int
The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If
header is true, the header will be the first unskipped and uncommented row.
Default value: 0
timestampFormat
Type: String
timeZone
Type: String
unescapedQuoteHandling
Type: String
* STOP_AT_CLOSING_QUOTE : If unescaped quotes are found in the input, accumulate the quote character and proceed parsing
the value as a quoted value, until a closing quote is found.
* BACK_TO_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted value. This will make
the parser accumulate all characters of the current parsed value until the delimiter defined by sep is found. If no delimiter is
found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.
* STOP_AT_DELIMITER : If unescaped quotes are found in the input, consider the value as an unquoted value. This will make
the parser accumulate all characters until the delimiter defined by sep , or a line ending is found in the input.
* SKIP_VALUE : If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the
next delimiter is found) and the value set in nullValue will be produced instead.
* RAISE_ERROR : If unescaped quotes are found in the input, a
TextParsingException will be thrown.
PARQUET options
O P T IO N
datetimeRebaseMode
Type: String
Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .
int96RebaseMode
Type: String
Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .
mergeSchema
Type: Boolean
Whether to infer the schema across multiple files and to merge the schema of each file.
readerCaseSensitive
Type: Boolean
Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.
rescuedDataColumn
Type: String
Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.
AVRO options
O P T IO N
avroSchema
Type: String
Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is
compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema.
For example, if you set an evolved schema containing one additional column with a default value, the read result will contain
the new column too.
datetimeRebaseMode
Type: String
Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values:
EXCEPTION , LEGACY , and
CORRECTED .
mergeSchema
Type: Boolean
Whether to infer the schema across multiple files and to merge the schema of each file.
mergeSchema for Avro does not relax data types.
readerCaseSensitive
Type: Boolean
Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names
differ by case from the schema; otherwise, read the data in a case-insensitive manner.
rescuedDataColumn
Type: String
Whether to collect all data that can’t be parsed due to: a data type mismatch, and schema mismatch (including column casing)
to a separate column. This column is included by default when using Auto Loader. For more details refer to Rescued data
column.
BINARYFILE options
Binary files do not have any additional configuration options.
TEXT options
O P T IO N
encoding
Type: String
The name of the encoding of the TEXT files. See java.nio.charset.Charset for list of options.
lineSep
Type: String
wholeText
Type: Boolean
ORC options
O P T IO N
mergeSchema
Type: Boolean
Whether to infer the schema across multiple files and to merge the schema of each file.
Related articles
Credentials
DELETE
INSERT
MERGE
PARTITION
query
UPDATE
CREATE BLOOM FILTER INDEX (Delta Lake on
Azure Databricks)
7/21/2022 • 2 minutes to read
Creates a Bloom filter index for new or rewritten data; it does not create Bloom filters for existing data. The
command fails if either the table name or one of the columns does not exist. If Bloom filtering is enabled for a
column, existing Bloom filter options are replaced by the new options.
Syntax
CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
options
OPTIONS ( { key1 [ = ] val1 } [, ...] )
Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
While it is not possible to build a Bloom filter index for data that is already written, the OPTIMIZE command
updates Bloom filters for data that is reorganized. Therefore, you can backfill a Bloom filter by running OPTIMIZE
on a table:
If you have not previously optimized the table.
With a different file size, requiring that the data files be re-written.
With a ZORDER (or a different ZORDER , if one is already present), requiring that the data files be re-written.
You can tune the Bloom filter by defining options at the column level or at the table level:
fpp : False positive probability. The desired false positive rate per written Bloom filter. This influences the
number of bits needed to put a single item in the Bloom filter and influences the size of the Bloom filter. The
value must be larger than 0 and smaller than or equal to 1. The default value is 0.1 which requires 5 bits per
item.
numItems : Number of distinct items the file can contain. This setting is important for the quality of filtering as
it influences the total number of bits used in the Bloom filter (number of items - number of bits per item). If
this setting is incorrect, the Bloom filter is either very sparsely populated, wasting disk space and slowing
queries that must download this file, or it is too full and is less accurate (higher FPP). The value must be
larger than 0. The default is 1 million items.
maxExpectedFpp : The expected FPP threshold for which a Bloom filter is not written to disk. The maximum
expected false positive probability at which a Bloom filter is written. If the expected FPP is larger than this
threshold, the Bloom filter’s selectivity is too low; the time and resources it takes to use the Bloom filter
outweighs its usefulness. The value must be between 0 and 1. The default is 1.0 (disabled).
These options play a role only when writing the data. You can configure these properties at various hierarchical
levels: write operation, table level, and column level. The column level takes precedence over the table and
operation levels, and the table level takes precedence over the operation level.
See Bloom filter indexes.
Related articles
DROP BLOOMFILTER INDEX
DELETE FROM
7/21/2022 • 2 minutes to read
Deletes the rows that match a predicate. When no predicate is provided, deletes all rows.
This statement is only supported for Delta Lake tables.
Syntax
DELETE FROM table_name [table_alias] [WHERE predicate]
Parameters
table_name
Identifies an existing table. The name must not include a temporal specification.
table_alias
Define an alias for the table. The alias must not include a column list.
WHERE
Filter rows by predicate.
The WHERE predicate supports subqueries, including IN , NOT IN , EXISTS , NOT EXISTS , and scalar
subqueries. The following types of subqueries are not supported:
Nested subqueries, that is, an subquery inside another subquery
NOT IN subquery inside an OR , for example, a = 3 OR b NOT IN (SELECT c from t)
In most cases, you can rewrite NOT IN subqueries using NOT EXISTS . We recommend using NOT EXISTS
whenever possible, as DELETE with NOT IN subqueries can be slow.
Examples
> DELETE FROM events WHERE date < '2017-01-01'
COPY
INSERT
MERGE
PARTITION
query
UPDATE
DESCRIBE HISTORY (Delta Lake on Azure
Databricks)
7/21/2022 • 2 minutes to read
Returns provenance information, including the operation, user, and so on, for each write to a table. Table history
is retained for 30 days.
Syntax
DESCRIBE HISTORY table_name
Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
See Retrieve Delta table history for details.
DROP BLOOM FILTER INDEX (Delta Lake on Azure
Databricks)
7/21/2022 • 2 minutes to read
Syntax
DROP BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS(columnName1 [, ...] ) ]
Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
The command fails if either the table name or one of the columns does not exist. All Bloom filter related
metadata is removed from the specified columns.
When a table does not have any Bloom filters, the underlying index files are cleaned when the table is
vacuumed.
Related articles
CREATE BLOOMFILTER INDEX
FSCK REPAIR TABLE
7/21/2022 • 2 minutes to read
Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying
file system. This can happen when these files have been manually deleted.
Syntax
FSCK REPAIR TABLE table_name [DRY RUN]
Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
DRY RUN
Return a list of files to be removed from the transaction log.
MERGE INTO
7/21/2022 • 4 minutes to read
Merges a set of updates, insertions, and deletions based on a source table into a target Delta table.
This statement is supported only for Delta Lake tables.
Syntax
MERGE INTO target_table_name [target_alias]
USING source_table_reference [source_alias]
ON merge_condition
[ WHEN MATCHED [ AND condition ] THEN matched_action ] [...]
[ WHEN NOT MATCHED [ AND condition ] THEN not_matched_action ] [...]
matched_action
{ DELETE |
UPDATE SET * |
UPDATE SET { column1 = value1 } [, ...] }
not_matched_action
{ INSERT * |
INSERT (column1 [, ...] ) VALUES (value1 [, ...])
Parameters
target_table_name
A Table name identifying the table being modified. The table referenced must be a Delta table.
target_alias
A Table aliasfor the target table. The alias must not include a column list.
source_table_reference
A Table name identifying the source table to be merged into the target table.
source_alias
A Table alias for the source table. The alias must not include a column list.
merge_condition
How the rows from one relation are combined with the rows of another relation. An expression with a
return type of BOOLEAN.
condition
A Boolean expression which must be true to satisfy the WHEN MATCHED or WHEN NOT MATCHED clause.
matched_action
There can be any number of WHEN MATCHED and WHEN NOT MATCHED clauses each, but at least one clause is
required. Multiple matches are allowed when matches are unconditionally deleted (since unconditional
delete is not ambiguous even if there are multiple matches).
WHEN MATCHED clauses are executed when a source row matches a target table row based on the
match condition. These clauses have the following semantics.
WHEN MATCHED clauses can have at most one UPDATE and one DELETE action. The UPDATE action in
merge only updates the specified columns of the matched target row. The DELETE action will
delete the matched row.
Each WHEN MATCHED clause can have an optional condition. If this clause condition exists, the
UPDATE or DELETE action is executed for any matching source-target row pair row only when the
clause condition is true.
If there are multiple WHEN MATCHED clauses, then they are evaluated in the order they are specified.
All WHEN MATCHED clauses, except the last one, must have conditions.
If none of the WHEN MATCHED conditions evaluate to true for a source and target row pair that
matches the merge condition, then the target row is left unchanged.
To update all the columns of the target Delta table with the corresponding columns of the source
dataset, use UPDATE SET * . This is equivalent to
UPDATE SET col1 = source.col1 [, col2 = source.col2 ...] for all the columns of the target Delta
table. Therefore, this action assumes that the source table has the same columns as those in the
target table, otherwise the query will throw an analysis error.
This behavior changes when automatic schema migration is enabled. See Automatic schema
evolution for details.
WHEN NOT MATCHED clauses are executed when a source row does not match any target row based on the
match condition. These clauses have the following semantics.
WHEN NOT MATCHED clauses can only have the INSERT action. The new row is generated based on
the specified column and corresponding expressions. All the columns in the target table do not
need to be specified. For unspecified target columns, NULL is inserted.
Each WHEN NOT MATCHED clause can have an optional condition. If the clause condition is present, a
source row is inserted only if that condition is true for that row. Otherwise, the source row is
ignored.
If there are multiple WHEN NOT MATCHED clauses, then they are evaluated in the order they are
specified. All WHEN NOT MATCHED clauses, except the last one, must have conditions.
To insert all the columns of the target Delta table with the corresponding columns of the source
dataset, use INSERT * . This is equivalent to
INSERT (col1 [, col2 ...]) VALUES (source.col1 [, source.col2 ...]) for all the columns of the
target Delta table. Therefore, this action assumes that the source table has the same columns as
those in the target table, otherwise the query will throw an analysis error.
NOTE
This behavior changes when automatic schema migration is enabled. See Automatic schema evolution for
details.
IMPORTANT
A MERGE operation can fail if multiple rows of the source dataset match and attempt to update the same rows of the
target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear
which source row should be used to update the matched target row. You can preprocess the source table to eliminate the
possibility of multiple matches. See the Change data capture example—it preprocesses the change dataset (that is, the
source dataset) to retain only the latest change for each key before applying that change into the target Delta table.
Examples
You can use MERGE INTO for complex operations like deduplicating data, upserting change data, applying SCD
Type 2 operations, etc. See Merge examples for a few examples.
Related articles
DELETE
INSERT INTO
UPDATE
OPTIMIZE (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read
Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you
do not specify colocation, bin-packing optimization is performed.
Syntax
OPTIMIZE table_name [WHERE predicate]
[ZORDER BY (col_name1 [, ...] ) ]
NOTE
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no
effect. It aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of
tuples per file. However, the two measures are most often correlated.
Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not
guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-Ordered,
another Z-Ordering of that partition will not have any effect. It aims to produce evenly-balanced data files with respect
to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there
can be situations when that is not the case, leading to skew in optimize task times.
To control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize . The
default value is 1073741824 , which sets the size to 1 GB. Specifying the value 104857600 sets the file size to 100
MB.
Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
WHERE
Optimize the subset of rows matching the given partition predicate. Only filters involving partition key
attributes are supported.
ZORDER BY
Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping
algorithms to dramatically reduce the amount of data that needs to be read. You can specify multiple
columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with
each additional column.
Examples
OPTIMIZE events
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
For more information about the OPTIMIZE command, see Optimize performance with file management.
REORG TABLE
7/21/2022 • 2 minutes to read
Reorganize a Delta Lake table by rewriting files to purge soft-deleted data, such as the column data dropped by
ALTER TABLE DROP COLUMN.
Syntax
REORG TABLE table_name [WHERE predicate] APPLY (PURGE)
NOTE
REORG TABLE only rewrites files that actually contain soft-deleted data.
REORG TABLE is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect.
After running REORG TABLE, the soft-deleted data may still exist in the old files. You can run VACUUM to physically
delete the old files.
Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
WHERE predicate
Reorganizes the files that match the given partition predicate. Only filters involving partition key
attributes are supported.
APPLY (PURGE)
Examples
> REORG TABLE events APPLY (PURGE);
> REORG TABLE events WHERE date >= '2022-01-01' APPLY (PURGE);
NOTE
Available in Databricks Runtime 7.4 and above.
Restores a Delta table to an earlier state. Restoring to an earlier version number or a timestamp is supported.
Syntax
RESTORE [TABLE] table_name [TO] time_travel_version
Parameters
table_name
Identifies Delta table to be restored. The table name must not use a temporal specification.
time_travel_version
{ TIMESTAMP AS OF timestamp_expression |
VERSION AS OF version }
where
timestamp_expression can be any one of:
'2018-10-18T22:15:12.013Z' , that is, a string that can be cast to a timestamp
cast('2018-10-18 13:36:32 CEST' as timestamp)
'2018-10-18' , that is, a date string
In Databricks Runtime 6.6 and above:
current_timestamp() - interval 12 hours
date_sub(current_date(), 1)
Any other expression that is or can be cast to a timestamp
version is a long value that can be obtained from the output of DESCRIBE HISTORY table_spec .
Neither timestamp_expression nor version can be subqueries.
For more information about the RESTORE command, see Restore a Delta table.
UPDATE
7/21/2022 • 2 minutes to read
Updates the column values for the rows that match a predicate. When no predicate is provided, update the
column values for all rows.
This statement is only supported for Delta Lake tables.
Syntax
UPDATE table_name [table_alias]
SET { { column_name | field_name } = expr } [, ...]
[WHERE clause]
Parameters
table_name
Identifies table to be updated. The table name must not use a temporal specification.
table_alias
Define an alias for the table. The alias must not include a column list.
column_name
A reference to a column in the table. You may reference each column at most once.
field_name
A reference to field within a column of type STRUCT. You may reference each field at most once.
expr
An arbitrary expression. If you reference table_name columns they represent the state of the row prior
the update.
WHERE
Filter rows by predicate. The WHERE clause may include subqueries with the following exceptions:
Nested subqueries, that is, a subquery inside another subquery
A NOT IN subquery inside an OR , for example, a = 3 OR b NOT IN (SELECT c from t)
In most cases, you can rewrite NOT IN subqueries using NOT EXISTS . You should use NOT EXISTS
whenever possible, as UPDATE with NOT IN subqueries can be slow.
Examples
> UPDATE events SET eventType = 'click' WHERE eventType = 'clk'
Related articles
COPY
DELETE
INSERT
MERGE
PARTITION
query
VACUUM
7/21/2022 • 2 minutes to read
NOTE
This command works differently depending on whether you’re working on a Delta or Apache Spark table.
WARNING
It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files
can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can
fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed. You must choose an
interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag
behind the most recent update to the table.
Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain
that there are no operations being performed on this table that take longer than the retention interval you plan
to specify, you can turn off this safety check by setting the Spark configuration property
spark.databricks.delta.retentionDurationCheck.enabled to false .
Parameters
table_name
Identifies an existing Delta table. The name must not include a temporal specification.
RETAIN num HOURS
The retention threshold.
DRY RUN
Return a list of files to be deleted.
Parameters
table_name
Identifies an existing table by name or path.
RETAIN num HOURS
The retention threshold.
DENY
7/21/2022 • 2 minutes to read
Denies a privilege on a securable object to a principal. Denying a privilege takes precedent over any explicit or
implicit grant.
Denying a privilege on a schema (for example a SELECT privilege) has the effect of implicitly denying that
privilege on all objects in that schema. Denying a specific privilege on the catalog implicitly denies that privilege
on all schemas in the catalog.
…note:: This statement applies only to the hive_metastore catalog and its objects.
…important:: To undo a DENY you REVOKE the same privilege from the principal.
Syntax
DENY privilege_types ON securable_object TO principal
privilege_types
{ ALL PRIVLEGES |
privilege_type [, ...] }
Parameters
privilege_types
This identifies one or more privileges the principal is denied.
ALL PRIVILEGES
securable_object
The object on which the privileges are denied to the principal.
principal
The user or group whose privileges are denied.
Example
-- Deny Alf the right to query `t`.
> DENY SELECT ON TABLE t TO `alf@melmak.et`;
Related
GRANT
REPAIR PRIVILEGES
REVOKE
SHOW GRANTS
ALTER GROUP
7/21/2022 • 2 minutes to read
Alters a workspace level group by either adding or dropping users and groups as members.
Syntax
ALTER GROUP parent_principal { ADD | DROP }
{ GROUP group_principal [, ...] |
USER user_principal [, ...] } [...]
Parameters
parent_group_principal
The name of the workspace level group to be altered.
group_principal
A list of workspace level subgroups to ADD or DROP.
user_principal
A list of workspace level users to ADD or DROP.
Examples
-- Creates a group named `aliens` containing a user `alf@melmak.et`.
CREATE GROUP aliens WITH USER `alf@melmak.et`;
Related articles
SHOW GROUPS
DROP GROUP
CREATE GROUP
CREATE GROUP
7/21/2022 • 2 minutes to read
Creates a workspace level group with the specified name, optionally including a list of users and groups.
Syntax
CREATE GROUP group_principal
[ WITH
[ USER user_principal [, ...] ]
[ GROUP subgroup_principal [, ...] ]
]
Parameters
group_principal
The name of the workspace-level group to be created.
user_principal
A workspace level user to include as a member of the group.
subgroup_principal
A workspace level subgrouo to include as a member of the group.
Examples
-- Create an empty group.
CREATE GROUP humans;
Related articles
SHOW GROUPS
ALTER GROUPS
DROP GROUP
principals
DROP GROUP
7/21/2022 • 2 minutes to read
Drops a workspace level grouo. An exception is thrown if the group does not exist in the system.
Syntax
DROP GROUP group_principal
Parameters
group_principal
The name of the existing workspace level group to drop.
Examples
-- Create `aliens` Group
> CREATE GROUP aliens WITH GROUP tv_aliens;
Related articles
SHOW GROUPS
ALTER GROUP
CREATE GROUP
principal
GRANT
7/21/2022 • 2 minutes to read
NOTE
Modifying access to the samples catalog is not supported. This catalog is available to all workspaces, but is read-only.
Syntax
GRANT privilege_types ON securable_object TO principal
privilege_types
{ ALL PRIVILEGES |
privilege_type [, ...] }
Parameters
privilege types
This identifies one or more privileges to be granted to the principal .
ALL PRIVILEGES
Examples
> GRANT CREATE ON SCHEMA <schema-name> TO `alf@melmak.et`;
Related articles
GRANT ON SHARE
REPAIR PRIVILEGES
REVOKE
SHOW GRANTS
GRANT ON SHARE
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Syntax
GRANT SELECT ON SHARE share_name TO RECIPIENT recipient_name
Parameters
share_name
The name of the share which the recipient is granted access to. If the share does not exist an error is
raised.
recipient_name
The name of the recipient to which access to teh share is granted. If the recipient does not exist an error is
raised.
Examples
Related articles
REVOKE ON SHARE
REVOKE
7/21/2022 • 2 minutes to read
NOTE
Modifying access to the samples catalog is not supported. This catalog is available to all workspaces, but is read-only.
Syntax
REVOKE privilege_types ON securable_object FROM principal
privilege_types
{ ALL PRIVILEGES |
privilege_type [, ...] }
Parameters
privilege_types
This identifies one or more privileges to be revoked from the principal .
ALL PRIVILEGES
Examples
> REVOKE ALL PRIVILEGES ON SCHEMA default FROM `alf@melmak.et`;
Related articles
GRANT
REPAIR PRIVILEGES
REVOKE ON SHARE
SHOW GRANTS
REVOKE ON SHARE
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Syntax
REVOKE SELECT ON SHARE share_name FROM RECIPIENT recipient_name
Parameters
share_name
The name of the share from which the recipient is revoked access. If the share does not exist an error is
raised.
recipient_name
The name of the recipient from which access to the share is revoked. If the recipient does not exist an
error is raised.
Examples
Related articles
GRANT ON SHARE
REVOKE
SHOW GRANTS
7/21/2022 • 2 minutes to read
Displays all privileges (inherited, denied, and granted) that affect the securable object.
To run this command you must be either:
A workspace administrator or the owner of the object.
The user specified in principal .
Use SHOW GRANTS TO RECIPIENT to list which shares a recipient has access to.
Syntax
SHOW GRANTS [ principal ] ON securable_object
Parameters
principal
An optional user or group for which to show the privileges granted or denied. If not specified SHOW will
return privileges for all principals who have privileges on the object.
securable_object
The object whose privileges to show.
Example
> SHOW GRANTS `alf@melmak.et` ON SCHEMA my_schema;
principal prvilege
------------- --------
alf@melmak.et USE
Related articles
GRANT
INFORMATION_SCHEMA.CATALOG_PRIVILEGES
INFORMATION_SCHEMA.SCHEMA_PRIVILEGES
INFORMATION_SCHEMA.TABLE_PRIVILEGES
REPAIR PRIVILEGES
REVOKE
REVOKE ON SHARE
SHOW GRANTS TO RECIPIENT
SHOW GRANTS ON SHARE
SHOW GROUPS
SHOW USERS
SHOW GRANTS ON SHARE
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Syntax
SHOW GRANTS ON SHARE
Parameters
share_name
The name of the share whose shares will be listed.
Example
> SHOW GRANTS ON SHARE shared_date;
recipient privilege
--------- ---------
some_corp SELECT
other_org SELECT
Related articles
GRANT TO SHARE
REVOKE ON SHARE
SHOW GRANTS
SHOW GRANTS TO RECIPIENT
SHOW SHARES
SHOW RECIPIENTS
SHOW GRANTS TO RECIPIENT
7/21/2022 • 2 minutes to read
IMPORTANT
Delta Sharing is in Public Preview. To participate in the preview, you must enable the External Data Sharing feature group
in the Azure Databricks Account Console. See Enable the External Data Sharing feature group for your account.
Delta Sharing is subject to applicable terms. Enabling the External Data Sharing feature group represents acceptance of
those terms.
Syntax
SHOW GRANTS TO RECIPIENT recipient_name
Parameters
recipient_name
The name of the recipient whose shares will be listed.
Example
> SHOW GRANTS TO RECIPIENT a_corp;
share privilege
----- ---------
data1 SELECT
data2 SELECT
Related articles
GRANT TO SHARE
REVOKE ON SHARE
SHOW GRANTS
SHOW GRANTS ON SHARE
SHOW SHARES
SHOW RECIPIENTS
MSCK REPAIR PRIVILEGES
7/21/2022 • 2 minutes to read
Removes all the privileges from all the users associated with the object.
You use this statement to clean up residual access control left behind after objects have been dropped from the
Hive metastore outside of Databricks Runtime.
This statement only applies to objects in the hive_metastore catalog.
Syntax
MSCK REPAIR object PRIVILEGES
object
{ [ SCHEMA | DATABASE ] schema_name |
FUNCTION function_name |
TABLE table_name
VIEW view_name |
ANONYMOUS FUNCTION |
ANY FILE }
Parameters
schema_name
Names the schema from which privileges are removed.
function_name
Names the function from which privileges are removed.
table_name
Names the table from which privileges are removed.
view_name
Names the view from which privileges are removed.
ANY FILE
Revokes ANY FILE privilege from all users.
ANONYMOUS FUNCTION
Revokes ANONYMOUS FUNCTION privilege from all users.
Examples
> MSCK REPAIR SCHEMA gone_from_hive PRIVILEGES;
SET DBPROPERTIES
Specify a property named key for the database and establish the value for the property respectively as val . If
key already exists, the old value is overwritten with val .
Assign owner
ALTER DATABASE db_name OWNER TO `user_name@user_domain.com`
Rename an existing table or view. If the destination table name already exists, an exception is thrown. This
operation does not support moving tables across databases.
For managed tables, renaming a table moves the table location; for unmanaged (external) tables, renaming a
table does not move the table location.
For further information on managed versus unmanaged (external) tables, see Data objects in the Databricks
Lakehouse.
Set the properties of an existing table or view. If a particular property was already set, this overrides the old
value with the new one.
NOTE
Property names are case sensitive. If you have key1 and then later set Key1 , a new table property is created.
To view table properties, run:
Drop one or more properties of an existing table or view. If a specified property does not exist, an exception is
thrown.
IF EXISTS
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Set the SerDe or the SerDe properties of a table or partition. If a specified SerDe property was already set, this
overrides the old value with the new one. Setting the SerDe is allowed only for tables created using the Hive
format.
Assign owner
ALTER (TABLE|VIEW) object-name OWNER TO `user_name@user_domain.com`
ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)
ALTER TABLE table_name ADD COLUMNS (col_name.nested_col_name data_type [COMMENT col_comment] [FIRST|AFTER
colA_name], ...)
Add columns to an existing table. It supports adding nested column. If a column with the same name already
exists in the table or the same nested struct, an exception is thrown.
Change columns
alterColumnAction:
: TYPE dataType
: [COMMENT col_comment]
: [FIRST|AFTER colA_name]
: (SET | DROP) NOT NULL
Change a column definition of an existing table. You can change the data type, comment, nullability of a column
or reorder columns.
NOTE
Available in Databricks Runtime 7.0 and above.
Change columns (Hive syntax)
ALTER TABLE table_name CHANGE [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER
colA_name]
ALTER TABLE table_name CHANGE [COLUMN] col_name.nested_col_name col_name data_type [COMMENT col_comment]
[FIRST|AFTER colA_name]
Change a column definition of an existing table. You can change the comment of the column and reorder
columns.
NOTE
In Databricks Runtime 7.0 and above you cannot use CHANGE COLUMN :
To change the contents of complex data types such as structs. Instead use ADD COLUMNS to add new columns to
nested fields, or ALTER COLUMN to change the properties of a nested column.
To relax the nullability of a column in a Delta table. Instead use
ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL .
Replace columns
ALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)
Replace the column definitions of an existing table. It supports changing the comments of columns, adding
columns, and reordering columns. If specified column definitions are not compatible with the existing
definitions, an exception is thrown.
Alter Table Partition
7/21/2022 • 2 minutes to read
Add partition
ALTER TABLE table_name ADD [IF NOT EXISTS]
(PARTITION part_spec [LOCATION path], ...)
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Add partitions to the table, optionally with a custom location for each partition added. This is supported only for
tables created using the Hive format. However, beginning with Spark 2.1, Alter Table Partitions is also
supported for tables defined using the datasource API.
IF NOT EXISTS
Change partition
ALTER TABLE table_name PARTITION part_spec RENAME TO PARTITION part_spec
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Change the partitioning field values of a partition. This operation is allowed only for tables created using the
Hive format.
Drop partition
ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...)
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Drop a partition from a table or view. This operation is allowed only for tables created using the Hive format.
IF EXISTS
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Set the location of the specified partition. Setting the location of individual partitions is allowed only for tables
created using the Hive format.
Analyze Table
7/21/2022 • 2 minutes to read
Collect statistics about the table that can be used by the query optimizer to find a better plan.
Table statistics
ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [NOSCAN]
Collect only basic statistics for the table (number of rows, size in bytes).
NOSCAN
Collect only statistics that do not require scanning the whole table (that is, size in bytes).
Column statistics
ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS FOR COLUMNS col1 [, col2, ...]
Collect column statistics for the specified columns in addition to table statistics.
TIP
Use this command whenever possible because it collects more statistics so the optimizer can find better plans. Make sure
to collect statistics for all columns used by the query.
See also:
Use Describe Table to inspect the existing statistics
Cost-based optimizer
Cache Select (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read
Cache the data accessed by the specified simple SELECT query in the Delta cache. You can choose a subset of
columns to be cached by providing a list of column names and choose a subset of rows by providing a
predicate. This enables subsequent queries to avoid scanning the original files as much as possible. This
construct is applicable only to Parquet tables. Views are also supported, but the expanded queries are restricted
to the simple queries, as described above.
See Delta and Apache Spark caching for the differences between the RDD cache and the Databricks IO cache.
Examples
CACHE SELECT * FROM boxes
CACHE SELECT width, length FROM boxes WHERE height=3
Cache Table
7/21/2022 • 2 minutes to read
Cache the contents of the table in memory using the RDD cache. This enables subsequent queries to avoid
scanning the original files as much as possible.
LAZY
Cache the table lazily instead of eagerly scanning the entire table.
See Delta and Apache Spark caching for the differences between the RDD cache and the Databricks IO cache.
Clear Cache
7/21/2022 • 2 minutes to read
CLEAR CACHE
IMPORTANT
This feature is in Public Preview.
NOTE
Available in Databricks Runtime 7.2 and above.
Clone a source Delta table to a target destination at a specific version. A clone can be either deep or shallow
referring to whether it copies over the data from the source or not.
IMPORTANT
There are important differences between shallow and deep clones with respect to dependencies between a clone and the
source and other differences. See Clone a Delta table.
where
<time_travel_version> =
TIMESTAMP AS OF timestamp_expression |
VERSION AS OF version
Specify CREATE TABLE IF NOT EXISTS to avoid creating a table target_table if a table already exists. If a table
already exists at the target, the clone operation is a no-op.
Specify CREATE OR REPLACE to replace the target of a clone operation if there is an existing table
target_table . This updates the metastore with the new table if table name is used.
Specifying SHALLOW or DEEP creates a shallow or deep clone at the target. If neither SHALLOW nor DEEP is
specified then a deep clone is created by default.
Specifying LOCATION creates an external table at the target with the provided location as the path where the
data will be stored. If the target provided is a path instead of a table name, the operation will fail.
Examples
You can use CLONE for complex operations like data migration, data archiving, machine learning flow
reproduction, short-term experiments, data sharing etc. See Clone use cases for a few examples.
Convert To Delta (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read
NOTE
CONVERT TO DELTA [db_name.]table_name requires Databricks Runtime 6.6 or above.
Convert an existing Parquet table to a Delta table in-place. This command lists all the files in the directory,
creates a Delta Lake transaction log that tracks these files, and automatically infers the data schema by reading
the footers of all Parquet files. The conversion process collects statistics to improve query performance on the
converted Delta table. If you provide a table name, the metastore is also updated to reflect that the table is now
a Delta table.
NO STATISTICS
Bypass statistics collection during the conversion process and finish conversion faster. After the table is
converted to Delta Lake, you can use OPTIMIZE ZORDER BY to reorganize the data layout and generate statistics.
PARTITIONED BY
Partition the created table by the specified columns. Required if the data is partitioned. The conversion process
aborts and throw an exception if the directory structure does not conform to the PARTITIONED BY specification. If
you do not provide the PARTITIONED BY clause, the command assumes that the table is not partitioned.
Caveats
Any file not tracked by Delta Lake is invisible and can be deleted when you run VACUUM . You should avoid
updating or appending data files during the conversion process. After the table is converted, make sure all
writes go through Delta Lake.
It is possible that multiple external tables share the same underlying Parquet directory. In this case, if you run
CONVERT on one of the external tables, then you will not be able to access the other external tables because their
underlying directory has been converted from Parquet to Delta Lake. To query or write to these external tables
again, you must run CONVERT on them as well.
CONVERT populates the catalog information, such as schema and table properties, to the Delta Lake transaction
log. If the underlying directory has already been converted to Delta Lake and its metadata is different from the
catalog metadata, a convertMetastoreMetadataMismatchException will be thrown. If you want CONVERT to
overwrite the existing metadata in the Delta Lake transaction log, set the SQL configuration
spark.databricks.delta.convert.metadataCheck.enabled to false.
IMPORTANT
This feature is in Public Preview.
Load data from a file location into a Delta table. This is a re-triable and idempotent operation—files in the source
location that have already been loaded are skipped.
table_identifier
The file location to load the data from. Files in this location must have the format specified in FILEFORMAT .
SELECT identifier_list
Selects the specified columns or expressions from the source data before copying into the Delta table.
FILEFORMAT = data_source
The format of the source files to load. One of CSV , JSON , AVRO , ORC , PARQUET .
FILES
A list of file names to load, with length up to 1000. Cannot be specified with PATTERN .
PATTERN
A regex pattern that identifies the files to load from the source directory. Cannot be specified with FILES .
FORMAT_OPTIONS
Options to be passed to the Apache Spark data source reader for the specified format.
COPY_OPTIONS
Options to control the operation of the COPY INTO command. The only option is 'force' ; if set to 'true' ,
idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.
Examples
COPY INTO delta.`target_path`
FROM (SELECT key, index, textData, 'constant_value' FROM 'source_path')
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
FORMAT_OPTIONS('header' = 'true')
Create Bloom Filter Index (Delta Lake on Azure
Databricks)
7/21/2022 • 2 minutes to read
Create a Bloom filter index for new or rewritten data; it does not create Bloom filters for existing data. The
command fails if either the table name or one of the columns does not exist. If Bloom filtering is enabled for a
column, existing Bloom filter options are replaced by the new options.
While it is not possible to build a Bloom filter index for data that is already written, the OPTIMIZE command
updates Bloom filters for data that is reorganized. Therefore, you can backfill a Bloom filter by running OPTIMIZE
on a table:
If you have not previously optimized the table.
With a different file size, requiring that the data files be re-written.
With a ZORDER (or a different ZORDER , if one is already present), requiring that the data files be re-written.
You can tune the Bloom filter by defining options at the column level or at the table level:
fpp : False positive probability. The desired false positive rate per written Bloom filter. This influences the
number of bits needed to put a single item in the Bloom filter and influences the size of the Bloom filter. The
value must be larger than 0 and smaller than or equal to 1. The default value is 0.1 which requires 5 bits per
item.
numItems : Number of distinct items the file can contain. This setting is important for the quality of filtering as
it influences the total number of bits used in the Bloom filter (number of items * number of bits per item). If
this setting is incorrect, the Bloom filter is either very sparsely populated, wasting disk space and slowing
queries that must download this file, or it is too full and is less accurate (higher FPP). The value must be
larger than 0. The default is 1 million items.
maxExpectedFpp : The expected FPP threshold for which a Bloom filter is not written to disk. The maximum
expected false positive probability at which a Bloom filter is written. If the expected FPP is larger than this
threshold, the Bloom filter’s selectivity is too low; the time and resources it takes to use the Bloom filter
outweighs its usefulness. The value must be between 0 and 1. The default is 1.0 (disabled).
These options play a role only when writing the data. You can configure these properties at various hierarchical
levels: write operation, table level, and column level. The column level takes precedence over the table and
operation levels, and the table level takes precedence over the operation level.
See Bloom filter indexes.
Create Database
7/21/2022 • 2 minutes to read
Create a database. If a database with the same name already exists, an exception is thrown.
IF NOT EXISTS
If a database with the same name already exists, nothing will happen.
LOCATION
If the specified path does not already exist in the underlying file system, this command tries to create a directory
with the path.
WITH DBPROPERTIES
Specify a property named key for the database and establish the value for the property respectively as val . If
key already exists, the old value is overwritten with val .
Examples
-- Create database `customer_db`. This throws exception if database with name customer_db
-- already exists.
CREATE DATABASE customer_db;
-- Create database `customer_db` only if database with same name doesn't exist.
CREATE DATABASE IF NOT EXISTS customer_db;
Create Function
7/21/2022 • 2 minutes to read
resource:
: [JAR|FILE|ARCHIVE] file_uri
Create a function. The specified class for the function must extend either UDF or UDAF in
org.apache.hadoop.hive.ql.exec , or one of AbstractGenericUDAFResolver , GenericUDF , or GenericUDTF in
org.apache.hadoop.hive.ql.udf.generic . If a function with the same name already exists in the database, an
exception will be thrown.
NOTE
This command is supported only when Hive support is enabled.
TEMPORARY
The created function is available only in this session and is not be persisted to the underlying metastore, if any.
No database name can be specified for temporary functions.
USING resource
The resources that must be loaded to support this function. A list of JAR, file, or archive URIs.
Create Table
7/21/2022 • 7 minutes to read
Create a table using a data source. If a table with the same name already exists in the database, an exception is
thrown.
IF NOT EXISTS
If a table with the same name already exists in the database, nothing will happen.
USING data_source
The file format to use for the table. data_source must be one of TEXT , AVRO , CSV , JSON , JDBC , PARQUET , ORC ,
HIVE , DELTA , or LIBSVM , or a fully-qualified class name of a custom implementation of
org.apache.spark.sql.sources.DataSourceRegister .
HIVE is supported to create a Hive SerDe table. You can specify the Hive-specific file_format and row_format
using the OPTIONS clause, which is a case-insensitive string map. The option keys are FILEFORMAT , INPUTFORMAT ,
OUTPUTFORMAT , SERDE , FIELDDELIM , ESCAPEDELIM , MAPKEYDELIM , and LINEDELIM .
OPTIONS
Table options used to optimize the behavior of the table or configure HIVE tables.
NOTE
This clause is not supported by Delta Lake.
Partition the created table by the specified columns. A directory is created for each partition.
CLUSTERED BY col_name3, col_name4, ...)
Each partition in the created table will be split into a fixed number of buckets by the specified columns. This is
typically used with partitioning to read and shuffle less data.
LOCATION path
The directory to store the table data. This clause automatically implies EXTERNAL .
WARNING
To avoid accidental data loss, do not register a schema (database) to a location with existing data or create new external
tables in a location managed by a schema. Dropping a schema will recursively delete all data files in the managed location.
AS select_statement
Populate the table with input data from the SELECT statement. This cannot contain a column list.
NOT NULL
Indicate that a column value cannot be NULL . If specified, and an Insert or Update (Delta Lake on Azure
Databricks) statements sets a column value to NULL , a SparkException is thrown. The default is to allow a NULL
value.
LOCATION <path-to-delta-files>
If you specify a LOCATION that already contains data stored in Delta Lake, Delta Lake does the following:
If you specify only the table name and location, for example:
the table in the Hive metastore automatically inherits the schema, partitioning, and table properties of the
existing data. This functionality can be used to “import” data into the metastore.
If you specify any configuration (schema, partitioning, or table properties), Delta Lake verifies that the
specification exactly matches the configuration of the existing data.
WARNING
If the specified configuration does not exactly match the configuration of the data, Delta Lake throws an exception that
describes the discrepancy.
Examples
CREATE TABLE boxes (width INT, length INT, height INT) USING CSV
-- CREATE a HIVE SerDe table using the CREATE TABLE USING syntax.
CREATE TABLE my_table (name STRING, age INT, hair_color STRING)
USING HIVE
OPTIONS(
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat',
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat',
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe')
PARTITIONED BY (hair_color)
TBLPROPERTIES ('status'='staging', 'owner'='andrew')
row_format:
: SERDE serde_cls [WITH SERDEPROPERTIES (key1=val1, key2=val2, ...)]
| DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
[NULL DEFINED AS char]
file_format:
: TEXTFILE | SEQUENCEFILE | RCFILE | ORC | PARQUET | AVRO
| INPUTFORMAT input_fmt OUTPUTFORMAT output_fmt
Create a table using the Hive format. If a table with the same name already exists in the database, an exception
will be thrown. When the table is dropped later, its data will be deleted from the file system.
NOTE
This command is supported only when Hive support is enabled.
EXTERNAL
The table uses the custom directory specified with LOCATION . Queries on the table access existing data
previously stored in the directory. When an EXTERNAL table is dropped, its data is not deleted from the file
system. This flag is implied if LOCATION is specified.
IF NOT EXISTS
If a table with the same name already exists in the database, nothing will happen.
PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)
Partition the table by the specified columns. This set of columns must be distinct from the set of non-partitioned
columns. You cannot specify partitioned columns with AS select_statement .
ROW FORMAT
Use the SERDE clause to specify a custom SerDe for this table. Otherwise, use the DELIMITED clause to use the
native SerDe and specify the delimiter, escape character, null character, and so on.
STORED AS file_format
Specify the file format for this table. Available formats include TEXTFILE , SEQUENCEFILE , RCFILE , ORC , PARQUET ,
and AVRO . Alternatively, you can specify your own input and output formats through INPUTFORMAT and
OUTPUTFORMAT . Only formats TEXTFILE , SEQUENCEFILE , and RCFILE can be used with ROW FORMAT SERDE and only
TEXTFILE can be used with ROW FORMAT DELIMITED .
LOCATION path
The directory to store the table data. This clause automatically implies EXTERNAL .
AS select_statement
Populate the table with input data from the select statement. You cannot specify this with PARTITIONED BY .
Data types
Spark SQL supports the following data types:
Numeric types
ByteType : Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127 .
ShortType : Represents 2-byte signed integer numbers. The range of numbers is from -32768 to
32767 .
IntegerType : Represents 4-byte signed integer numbers. The range of numbers is from -2147483648
to 2147483647 .
LongType : Represents 8-byte signed integer numbers. The range of numbers is from
-9223372036854775808 to 9223372036854775807 .
FloatType : Represents 4-byte single-precision floating point numbers.
DoubleType : Represents 8-byte double-precision floating point numbers.
DecimalType : Represents arbitrary-precision signed decimal numbers. Backed internally by
java.math.BigDecimal . A BigDecimal consists of an arbitrary precision integer unscaled value and a
32-bit integer scale.
String type: StringType : Represents character string values.
Binary type: BinaryType : Represents byte sequence values.
Boolean type: BooleanType : Represents boolean values.
Datetime types
TimestampType : Represents values comprising values of fields year, month, day, hour, minute, and
second, with the session local time zone. The timestamp value represents an absolute point in time.
DateType : Represents values comprising values of fields year, month and day, without a time-zone.
Complex types
ArrayType(elementType, containsNull) : Represents values comprising a sequence of elements with the
type of elementType . containsNull is used to indicate if elements in a ArrayType value can have
null values.
MapType(keyType, valueType, valueContainsNull) : Represents values comprising a set of key-value
pairs. The data type of keys is described by keyType and the data type of values is described by
valueType . For a MapType value, keys are not allowed to have null values. valueContainsNull is
used to indicate if values of a MapType value can have null values.
StructType(fields) : Represents values with the structure described by a sequence of StructField (
fields ).
StructField(name, dataType, nullable) : Represents a field in a StructType . The name of a field is
indicated by name . The data type of a field is indicated by dataType . nullable is used to indicate if
values of these fields can have null values.
The following table shows the type names and aliases for each data type.
DATA T Y P E SQ L N A M E
BooleanType BOOLEAN
DoubleType DOUBLE
DateType DATE
TimestampType TIMESTAMP
StringType STRING
BinaryType BINARY
CalendarIntervalType INTERVAL
ArrayType ARRAY<element_type>
Examples
CREATE TABLE my_table (name STRING, age INT)
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
COMMENT 'This table is created with existing data'
LOCATION 'spark-warehouse/tables/my_existing_table'
Create a managed table using the definition/metadata of an existing table or view. The created table always uses
its own directory in the default warehouse location.
NOTE
Delta Lake does not support CREATE TABLE LIKE . Instead use CREATE TABLE AS . See AS.
Create View
7/21/2022 • 2 minutes to read
If the view does not exist, CREATE OR REPLACE VIEW is equivalent to CREATE VIEW . If the view does exist,
CREATE OR REPLACE VIEW is equivalent to ALTER VIEW .
[GLOBAL] TEMPORARY
TEMPORARY skips persisting the view definition in the underlying metastore, if any. If GLOBAL is specified, the
view can be accessed by different sessions and kept alive until your application ends; otherwise, the temporary
views are session-scoped and will be automatically dropped if the session terminates. All the global temporary
views are tied to a system preserved temporary database global_temp . The database name is preserved, and
thus, users are not allowed to create/use/drop this database. You must use the qualified name to access the
global temporary view.
NOTE
A temporary view defined in a notebook is not visible in other notebooks. See Notebook isolation.
A column list that defines the view schema. The column names must be unique with the same number of
columns retrieved by select_statement . When the column list is not given, the view schema is the output
schema of select_statement .
TBLPROPERTIES
A SELECT statement that defines the view. The statement can select from base tables or the other views.
IMPORTANT
You cannot specify datasource, partition, or clustering options since a view is not materialized like a table.
Examples
-- Create a persistent view view_deptDetails in database1. The view definition is recorded in the underlying
metastore
CREATE VIEW database1.view_deptDetails
AS SELECT * FROM company JOIN dept ON company.dept_id = dept.id;
-- Create or replace a local temporary view from a persistent view with an extra filter
CREATE OR REPLACE TEMPORARY VIEW temp_DeptSFO
AS SELECT * FROM database1.view_deptDetails WHERE loc = 'SFO';
-- Create a global temp view to share the data through different sessions
CREATE GLOBAL TEMP VIEW global_DeptSJC
AS SELECT * FROM database1.view_deptDetails WHERE loc = 'SJC';
-- Drop the global temp view, temp view, and persistent view.
DROP VIEW global_temp.global_DeptSJC;
DROP VIEW temp_DeptSFO;
DROP VIEW database1.view_deptDetails;
Delete From (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read
Delete the rows that match a predicate. When no predicate is provided, delete all rows.
WHERE
In most cases, you can rewrite NOT IN subqueries using NOT EXISTS . We recommend using NOT EXISTS
whenever possible, as DELETE with NOT IN subqueries can be slow.
Example
DELETE FROM events WHERE date < '2017-01-01'
Subquery Examples
DELETE FROM all_events
WHERE session_time < (SELECT min(session_time) FROM good_events)
DENY
privilege_type [, privilege_type ] ...
ON [CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION <function-name>
| ANONYMOUS FUNCTION | ANY FILE]
TO principal
privilege_type
: SELECT | CREATE | MODIFY | READ_METADATA | CREATE_NAMED_FUNCTION | ALL PRIVILEGES
principal
: `<user>@<domain-name>` | <group-name>
Deny a privilege on an object to a user or principal. Denying a privilege on a database (for example a SELECT
privilege) has the effect of implicitly denying that privilege on all objects in that database. Denying a specific
privilege on the catalog has the effect of implicitly denying that privilege on all databases in the catalog.
To deny a privilege to all users, specify the keyword users after TO .
DENY can be used to ensure that a user or principal cannot access the specified object, despite any implicit or
explicit GRANTs . When an object is accessed, Databricks first checks if there are any explicit or implicit DENYs on
the object before checking if there are any explicit or implicit GRANTs .
For example, suppose there is a database db with tables t1 and t2 . A user is initially granted SELECT
privileges on db . The user can access t1 and t2 due to the GRANT on the database db .
If the administrator issues a DENY on table t1 , the user will no longer be able to access t1 . If the
administrator issues a DENY on database db , the user will not be able to access any tables in db even if there
is an explicit GRANT on these tables. That is, the DENY always supersedes the GRANT .
Example
DENY SELECT ON <table-name> TO `<user>@<domain-name>`;
Describe Database
7/21/2022 • 2 minutes to read
Return the metadata of an existing database (name, comment and location). If the database does not exist, an
exception is thrown.
EXTENDED
Return the metadata of an existing function (implementing class and usage). If the function does not exist, an
exception is thrown.
EXTENDED
Return provenance information, including the operation, user, and so on, for each write to a table. Table history is
retained for 30 days.
See Retrieve Delta table history for details.
Describe Table
7/21/2022 • 2 minutes to read
Return the metadata of an existing table (column names, data types, and comments). If the table does not exist,
an exception is thrown.
EXTENDED
Display detailed information about the table, including parent database, table type, storage information, and
properties.
Return the metadata of a specified partition. The partition_spec must provide the values for all the partition
columns.
EXTENDED
Display basic information about the table and the partition-specific storage information.
Display detailed information about the specified columns, including the column statistics collected by the
command ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column_name [column_name, ...] .
Return information about schema, partitioning, table size, and so on. For example, you can see the current
reader and writer versions of a table.
Drop Bloom Filter Index (Delta Lake on Azure
Databricks)
7/21/2022 • 2 minutes to read
Drop a database and delete the directory associated with the database from the file system. If the database does
not exist, an exception is thrown.
IF EXISTS
Dropping a non-empty database also drops all associated tables and functions.
Drop Function
7/21/2022 • 2 minutes to read
Drop an existing function. If the function to drop does not exist, an exception is thrown.
NOTE
This command is supported only when Hive support is enabled.
TEMPORARY
Drop a table and delete the directory associated with the table from the file system if this is not an EXTERNAL
table. If the table to drop does not exist, an exception is thrown.
IF EXISTS
Examples
-- Drop the global temp view, temp view, and persistent view.
DROP VIEW global_temp.global_DeptSJC;
DROP VIEW temp_DeptSFO;
DROP VIEW database1.view_deptDetails;
Explain
7/21/2022 • 2 minutes to read
Provide detailed plan information about statement without actually running it. By default this only outputs
information about the physical plan. Explaining DESCRIBE TABLE is not supported.
EXTENDED
Output information about the logical plan before and after analysis and optimization.
CODEGEN
Remove the file entries from the transaction log of a Delta table that can no longer be found in the underlying
file system. This can happen when these files have been manually deleted.
DRY RUN
!
! expr - Logical not.
%
expr1 % expr2 - Returns the remainder after expr1 / expr2 .
Examples:
&
expr1 & expr2 - Returns the result of bitwise AND of expr1 and expr2 .
Examples:
*
expr1 _ expr2 - Returns expr1 _ expr2 .
Examples:
> SELECT 2 * 3;
6
+
expr1 + expr2 - Returns expr1 + expr2 .
Examples:
> SELECT 1 + 2;
3
-
expr1 - expr2 - Returns expr1 - expr2 .
Examples:
> SELECT 2 - 1;
1
/
expr1 / expr2 - Returns expr1 / expr2 . It always performs floating point division.
Examples:
> SELECT 3 / 2;
1.5
> SELECT 2L / 2L;
1.0
<
expr1 < expr2 - Returns true if expr1 is less than expr2 .
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types
such array/struct, the data types of fields must be orderable.
Examples:
<=
expr1 <= expr2 - Returns true if expr1 is less than or equal to expr2 .
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types
such array/struct, the data types of fields must be orderable.
Examples:
> SELECT 2 <= 2;
true
> SELECT 1.0 <= '1';
true
> SELECT to_date('2009-07-30 04:17:52') <= to_date('2009-07-30 04:17:52');
true
> SELECT to_date('2009-07-30 04:17:52') <= to_date('2009-08-01 04:17:52');
true
> SELECT 1 <= NULL;
NULL
<=>
expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both
are null, false if one of the them is null.
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be used in equality comparison. Map type is not supported. For complex types such
array/struct, the data types of fields must be orderable.
Examples:
=
expr1 = expr2 - Returns true if expr1 equals expr2 , or false otherwise.
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be used in equality comparison. Map type is not supported. For complex types such
array/struct, the data types of fields must be orderable.
Examples:
> SELECT 2 = 2;
true
> SELECT 1 = '1';
true
> SELECT true = NULL;
NULL
> SELECT NULL = NULL;
NULL
==
expr1 == expr2 - Returns true if expr1 equals expr2 , or false otherwise.
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be used in equality comparison. Map type is not supported. For complex types such
array/struct, the data types of fields must be orderable.
Examples:
> SELECT 2 == 2;
true
> SELECT 1 == '1';
true
> SELECT true == NULL;
NULL
> SELECT NULL == NULL;
NULL
>
expr1 > expr2 - Returns true if expr1 is greater than expr2 .
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types
such array/struct, the data types of fields must be orderable.
Examples:
>=
expr1 >= expr2 - Returns true if expr1 is greater than or equal to expr2 .
Arguments:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a
type that can be ordered. For example, map type is not orderable, so it is not supported. For complex types
such array/struct, the data types of fields must be orderable.
Examples:
> SELECT 2 >= 1;
true
> SELECT 2.0 >= '2.1';
false
> SELECT to_date('2009-07-30 04:17:52') >= to_date('2009-07-30 04:17:52');
true
> SELECT to_date('2009-07-30 04:17:52') >= to_date('2009-08-01 04:17:52');
false
> SELECT 1 >= NULL;
NULL
^
expr1 ^ expr2 - Returns the result of bitwise exclusive OR of expr1 and expr2 .
Examples:
> SELECT 3 ^ 5;
2
abs
abs(expr) - Returns the absolute value of the numeric value.
Examples:
acos
acos(expr) - Returns the inverse cosine (arccosine) of expr , as if computed by java.lang.Math.acos .
Examples:
add_months
add_months(start_date, num_months) - Returns the date that is num_months after start_date .
Examples:
Since: 1.5.0
aggregate
aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array,
and reduces this to a single state. The final state is converted into the final result by applying a finish function.
Examples:
Since: 2.4.0
and
expr1 and expr2 - Logical AND.
approx_count_distinct
approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. relativeSD
defines the maximum estimation error allowed.
approx_percentile
approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column
col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter
(default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory.
Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. When
percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the
approximate percentile array of column col at the given percentage array.
Examples:
array
array(expr, …) - Returns an array with the given elements.
Examples:
array_contains
array_contains(array, value) - Returns true if the array contains the value.
Examples:
Since: 2.4.0
array_except
array_except(array1, array2) - Returns an array of the elements in array1 but not in array2, without duplicates.
Examples:
Since: 2.4.0
array_intersect
array_intersect(array1, array2) - Returns an array of the elements in the intersection of array1 and array2,
without duplicates.
Examples:
Since: 2.4.0
array_join
array_join(array, delimiter[, nullReplacement]) - Concatenates the elements of the given array using the delimiter
and an optional string to replace nulls. If no value is set for nullReplacement, any null value is filtered.
Examples:
Since: 2.4.0
array_max
array_max(array) - Returns the maximum value in the array. NULL elements are skipped.
Examples:
> SELECT array_max(array(1, 20, null, 3));
20
Since: 2.4.0
array_min
array_min(array) - Returns the minimum value in the array. NULL elements are skipped.
Examples:
Since: 2.4.0
array_position
array_position(array, element) - Returns the (1-based) index of the first element of the array as long.
Examples:
Since: 2.4.0
array_remove
array_remove(array, element) - Remove all elements that equal to element from array.
Examples:
Since: 2.4.0
array_repeat
array_repeat(element, count) - Returns the array containing element count times.
Examples:
Since: 2.4.0
array_sort
array_sort(array) - Sorts the input array in ascending order. The elements of the input array must be orderable.
Null elements will be placed at the end of the returned array.
Examples:
Since: 2.4.0
array_union
array_union(array1, array2) - Returns an array of the elements in the union of array1 and array2, without
duplicates.
Examples:
Since: 2.4.0
arrays_overlap
arrays_overlap(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays
have no common element and they are both non-empty and either of them contains a null element null is
returned, false otherwise.
Examples:
Since: 2.4.0
arrays_zip
arrays_zip(a1, a2, …) - Returns a merged array of structs in which the N-th struct contains all N-th values of
input arrays.
Examples:
[{"0":1,"1":2},{"0":2,"1":3},{"0":3,"1":4}]
[{"0":1,"1":2,"2":3},{"0":2,"1":3,"2":4}]
Since: 2.4.0
ascii
ascii(str) - Returns the numeric value of the first character of str .
Examples:
asin
asin(expr) - Returns the inverse sine (arcsine) of expr , as if computed by java.lang.Math.asin .
Examples:
assert_true
assert_true(expr) - Throws an exception if expr is not true.
Examples:
atan
atan(expr) - Returns the inverse tangent (arctangent) of expr , as if computed by java.lang.Math.atan
Examples:
atan2
atan2(exprY, exprX) - Returns the angle in radians between the positive x-axis of a plane and the point given by
the coordinates ( exprX , exprY ), as if computed by java.lang.Math.atan2 .
Arguments:
exprY - coordinate on y-axis
exprX - coordinate on x-axis
Examples:
base64
base64(bin) - Converts the argument from a binary bin to a base 64 string.
Examples:
bigint
bigint(expr) - Casts the value expr to the target data type bigint .
bin
bin(expr) - Returns the string representation of the long value expr represented in binary.
Examples:
binary
binary(expr) - Casts the value expr to the target data type binary .
bit_length
bit_length(expr) - Returns the bit length of string data or number of bits of binary data.
Examples:
boolean
boolean(expr) - Casts the value expr to the target data type boolean .
bround
bround(expr, d) - Returns expr rounded to d decimal places using HALF_EVEN rounding mode.
Examples:
> SELECT bround(2.5, 0);
2.0
cardinality
cardinality(expr) - Returns the size of an array or a map. The function returns -1 if its input is null and
spark.sql.legacy.sizeOfNull is set to true. If spark.sql.legacy.sizeOfNull is set to false, the function returns null for
null input. By default, the spark.sql.legacy.sizeOfNull parameter is set to true.
Examples:
cast
cast(expr AS type) - Casts the value expr to the target data type type .
Examples:
cbrt
cbrt(expr) - Returns the cube root of expr .
Examples:
ceil
ceil(expr) - Returns the smallest integer not smaller than expr .
Examples:
ceiling
ceiling(expr) - Returns the smallest integer not smaller than expr .
Examples:
> SELECT ceiling(-0.1);
0
> SELECT ceiling(5);
5
char
char(expr) - Returns the ASCII character having the binary equivalent to expr . If n is larger than 256 the result is
equivalent to chr(n % 256)
Examples:
char_length
char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of
string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
character_length
character_length(expr) - Returns the character length of string data or number of bytes of binary data. The
length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
chr
chr(expr) - Returns the ASCII character having the binary equivalent to expr . If n is larger than 256 the result is
equivalent to chr(n % 256)
Examples:
collect_list
collect_list(expr) - Collects and returns a list of non-unique elements.
collect_set
collect_set(expr) - Collects and returns a set of unique elements.
concat
concat(col1, col2, …, colN) - Returns the concatenation of col1, col2, …, colN.
Examples:
concat_ws
concat_ws(sep, [str | array(str)]+) - Returns the concatenation of the strings separated by sep .
Examples:
conv
conv(num, from_base, to_base) - Convert num from from_base to to_base .
Examples:
corr
corr(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.
cos
cos(expr) - Returns the cosine of expr , as if computed by java.lang.Math.cos .
Arguments:
expr - angle in radians
Examples:
cosh
cosh(expr) - Returns the hyperbolic cosine of expr , as if computed by java.lang.Math.cosh .
Arguments:
expr - hyperbolic angle
Examples:
cot
cot(expr) - Returns the cotangent of expr , as if computed by 1/java.lang.Math.cot .
Arguments:
expr - angle in radians
Examples:
count
count(*) - Returns the total number of retrieved rows, including rows containing null.
count(expr[, expr…]) - Returns the number of rows for which the supplied expression(s) are all non-null.
count(DISTINCT expr[, expr…]) - Returns the number of rows for which the supplied expression(s) are unique and
non-null.
count_min_sketch
count_min_sketch(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp,
confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage.
Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
covar_pop
covar_pop(expr1, expr2) - Returns the population covariance of a set of number pairs.
covar_samp
covar_samp(expr1, expr2) - Returns the sample covariance of a set of number pairs.
crc32
crc32(expr) - Returns a cyclic redundancy check value of the expr as a bigint.
Examples:
cube
cume_dist
cume_dist() - Computes the position of a value relative to all values in the partition.
current_database
current_database() - Returns the current database.
Examples:
current_date
current_date() - Returns the current date at the start of query evaluation.
Since: 1.5.0
current_timestamp
current_timestamp() - Returns the current timestamp at the start of query evaluation.
Since: 1.5.0
date
date(expr) - Casts the value expr to the target data type date .
date_add
date_add(start_date, num_days) - Returns the date that is num_days after start_date .
Examples:
> SELECT date_add('2016-07-30', 1);
2016-07-31
Since: 1.5.0
date_format
date_format(timestamp, fmt) - Converts timestamp to a value of string in the format specified by the date
format fmt .
Examples:
Since: 1.5.0
date_sub
date_sub(start_date, num_days) - Returns the date that is num_days before start_date .
Examples:
Since: 1.5.0
date_trunc
date_trunc(fmt, ts) - Returns timestamp ts truncated to the unit specified by the format model fmt . fmt
should be one of [“YEAR”, “YYYY”, “YY”, “MON”, “MONTH”, “MM”, “DAY”, “DD”, “HOUR”, “MINUTE”, “SECOND”,
“WEEK”, “QUARTER”]
Examples:
Since: 2.3.0
datediff
datediff(endDate, startDate) - Returns the number of days from startDate to endDate .
Examples:
> SELECT datediff('2009-07-31', '2009-07-30');
1
Since: 1.5.0
day
day(date) - Returns the day of month of the date/timestamp.
Examples:
Since: 1.5.0
dayofmonth
dayofmonth(date) - Returns the day of month of the date/timestamp.
Examples:
Since: 1.5.0
dayofweek
dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, …, 7 = Saturday).
Examples:
Since: 2.3.0
dayofyear
dayofyear(date) - Returns the day of year of the date/timestamp.
Examples:
Since: 1.5.0
decimal
decimal(expr) - Casts the value expr to the target data type decimal .
decode
decode(bin, charset) - Decodes the first argument using the second argument character set.
Examples:
degrees
degrees(expr) - Converts radians to degrees.
Arguments:
expr - angle in radians
Examples:
dense_rank
dense_rank() - Computes the rank of a value in a group of values. The result is one plus the previously assigned
rank value. Unlike the function rank, dense_rank will not produce gaps in the ranking sequence.
double
double(expr) - Casts the value expr to the target data type double .
e
e() - Returns Euler’s number, e.
Examples:
element_at
element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements
from the last to the first. Returns NULL if the index exceeds the length of the array.
element_at(map, key) - Returns value for given key, or NULL if the key is not contained in the map
Examples:
> SELECT element_at(array(1, 2, 3), 2);
2
> SELECT element_at(map(1, 'a', 2, 'b'), 2);
b
Since: 2.4.0
elt
elt(n, input1, input2, …) - Returns the n -th input, e.g., returns input2 when n is 2.
Examples:
encode
encode(str, charset) - Encodes the first argument using the second argument character set.
Examples:
exists
exists(expr, pred) - Tests whether a predicate holds for one or more elements in the array.
Examples:
Since: 2.4.0
exp
exp(expr) - Returns e to the power of expr .
Examples:
explode
explode(expr) - Separates the elements of array expr into multiple rows, or the elements of map expr into
multiple rows and columns.
Examples:
> SELECT explode(array(10, 20));
10
20
explode_outer
explode_outer(expr) - Separates the elements of array expr into multiple rows, or the elements of map expr
into multiple rows and columns.
Examples:
expm1
expm1(expr) - Returns exp( expr ) - 1.
Examples:
factorial
factorial(expr) - Returns the factorial of expr . expr is [0..20]. Otherwise, null.
Examples:
filter
filter(expr, func) - Filters the input array using the given predicate.
Examples:
Since: 2.4.0
find_in_set
find_in_set(str, str_array) - Returns the index (1-based) of the given string ( str ) in the comma-delimited list (
str_array ). Returns 0, if the string was not found or if the given string ( str ) contains a comma.
Examples:
> SELECT find_in_set('ab','abc,b,ab,c,def');
3
first
first(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If isIgnoreNull is true, returns
only non-null values.
first_value
first_value(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If isIgnoreNull is true,
returns only non-null values.
flatten
flatten(arrayOfArrays) - Transforms an array of arrays into a single array.
Examples:
Since: 2.4.0
float
float(expr) - Casts the value expr to the target data type float .
floor
floor(expr) - Returns the largest integer not greater than expr .
Examples:
format_number
format_number(expr1, expr2) - Formats the number expr1 like ‘#,###,###.##’, rounded to expr2 decimal
places. If expr2 is 0, the result has no decimal point or fractional part. expr2 also accept a user specified
format. This is supposed to function like MySQL’s FORMAT.
Examples:
from_json
from_json( jsonStr, schema[, options]) - Returns a struct value with the given jsonStr and schema .
Examples:
{"a":1, "b":0.8}
{"time":"2015-08-26 00:00:00.0"}
Since: 2.2.0
from_unixtime
from_unixtime(unix_time, format) - Returns unix_time in the specified format .
Examples:
Since: 1.5.0
from_utc_timestamp
from_utc_timestamp(timestamp, timezone) - Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a
time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield
‘2017-07-14 03:40:00.0’.
Examples:
Since: 1.5.0
get_json_object
get_json_object( json_txt, path) - Extracts a json object from path .
Examples:
greatest
greatest(expr, …) - Returns the greatest value of all parameters, skipping null values.
Examples:
grouping
grouping_id
hash
hash(expr1, expr2, …) - Returns a hash value of the arguments.
Examples:
hex
hex(expr) - Converts expr to hexadecimal.
Examples:
hour
hour(timestamp) - Returns the hour component of the string/timestamp.
Examples:
Since: 1.5.0
hypot
hypot(expr1, expr2) - Returns sqrt( expr1 **2 + expr2 **2).
Examples:
if
if(expr1, expr2, expr3) - If expr1 evaluates to true, then returns expr2 ; otherwise returns expr3 .
Examples:
ifnull
ifnull(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.
Examples:
in
expr1 in(expr2, expr3, …) - Returns true if expr equals to any valN.
Arguments:
expr1, expr2, expr3, … - the arguments must be same type.
Examples:
initcap
initcap(str) - Returns str with the first letter of each word in uppercase. All other letters are in lowercase.
Words are delimited by white space.
Examples:
> SELECT initcap('sPark sql');
Spark Sql
inline
inline(expr) - Explodes an array of structs into a table.
Examples:
inline_outer
inline_outer(expr) - Explodes an array of structs into a table.
Examples:
input_file_block_length
input_file_block_length() - Returns the length of the block being read, or -1 if not available.
input_file_block_start
input_file_block_start() - Returns the start offset of the block being read, or -1 if not available.
input_file_name
input_file_name() - Returns the name of the file being read, or empty string if not available.
instr
instr(str, substr) - Returns the (1-based) index of the first occurrence of substr in str .
Examples:
int
int(expr) - Casts the value expr to the target data type int .
isnan
isnan(expr) - Returns true if expr is NaN, or false otherwise.
Examples:
isnotnull
isnotnull(expr) - Returns true if expr is not null, or false otherwise.
Examples:
isnull
isnull(expr) - Returns true if expr is null, or false otherwise.
Examples:
java_method
java_method(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Examples:
json_tuple
json_tuple( jsonStr, p1, p2, …, pn) - Returns a tuple like the function get_json_object, but it takes multiple names.
All the input parameters and output column types are string.
Examples:
kurtosis
kurtosis(expr) - Returns the kurtosis value calculated from values of a group.
lag
lag(input[, offset[, default]]) - Returns the value of input at the offset th row before the current row in the
window. The default value of offset is 1 and the default value of default is null. If the value of input at the
offsetth row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the
window does not have any previous row), default is returned.
last
last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns
only non-null values.
last_day
last_day(date) - Returns the last day of the month which the date belongs to.
Examples:
Since: 1.5.0
last_value
last_value(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true,
returns only non-null values.
lcase
lcase(str) - Returns str with all characters changed to lowercase.
Examples:
lead
lead(input[, offset[, default]]) - Returns the value of input at the offset th row after the current row in the
window. The default value of offset is 1 and the default value of default is null. If the value of input at the
offset th row is null, null is returned. If there is no such an offset row (e.g., when the offset is 1, the last row of
the window does not have any subsequent row), default is returned.
least
least(expr, …) - Returns the least value of all parameters, skipping null values.
Examples:
left
left(str, len) - Returns the leftmost len ( len can be string type) characters from the string str ,if len is less or
equal than 0 the result is an empty string.
Examples:
length
length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string
data includes the trailing spaces. The length of binary data includes binary zeros.
Examples:
levenshtein
levenshtein(str1, str2) - Returns the Levenshtein distance between the two given strings.
Examples:
like
str like pattern - Returns true if str matches pattern, null if any arguments are null, false otherwise.
Arguments:
str - a string expression
pattern - a string expression. The pattern is a string which is matched literally, with exception to the
following special symbols:
_ matches any one character in the input (similar to . in posix regular expressions)
% matches zero or more characters in the input (similar to .* in posix regular expressions)
The escape character is ‘’. If an escape character precedes a special symbol or another escape character,
the following character is matched literally. It is invalid to escape any other character.
Since Spark 2.0, string literals are unescaped in our SQL parser. For example, in order to match “abc”, the
pattern should be “abc”.
When SQL config ‘spark.sql.parser.escapedStringLiterals’ is enabled, it fallbacks to Spark 1.6 behavior
regarding string literal parsing. For example, if the config is enabled, the pattern to match “abc” should be
“abc”.
Examples:
> SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\\Users%'
true
Note:
Use RLIKE to match with standard regular expressions.
ln
ln(expr) - Returns the natural logarithm (base e) of expr .
Examples:
locate
locate(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos . The
given pos and return value are 1-based.
Examples:
log
log(base, expr) - Returns the logarithm of expr with base .
Examples:
log10
log10(expr) - Returns the logarithm of expr with base 10.
Examples:
log1p
log1p(expr) - Returns log(1 + expr ).
Examples:
> SELECT log1p(0);
0.0
log2
log2(expr) - Returns the logarithm of expr with base 2.
Examples:
lower
lower(str) - Returns str with all characters changed to lowercase.
Examples:
lpad
lpad(str, len, pad) - Returns str , left-padded with pad to a length of len . If str is longer than len , the
return value is shortened to len characters.
Examples:
ltrim
ltrim(str) - Removes the leading space characters from str .
ltrim(trimStr, str) - Removes the leading string contains the characters from the trim string
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
Examples:
map
map(key0, value0, key1, value1, …) - Creates a map with the given key/value pairs.
Examples:
{1.0:"2",3.0:"4"}
map_concat
map_concat(map, …) - Returns the union of all the given maps
Examples:
{1:"a",2:"c",3:"d"}
Since: 2.4.0
map_from_arrays
map_from_arrays(keys, values) - Creates a map with a pair of the given key/value arrays. All elements in keys
should not be null
Examples:
{1.0:"2",3.0:"4"}
Since: 2.4.0
map_from_entries
map_from_entries(arrayOfEntries) - Returns a map created from the given array of entries.
Examples:
{1:"a",2:"b"}
Since: 2.4.0
map_keys
map_keys(map) - Returns an unordered array containing the keys of the map.
Examples:
map_values
map_values(map) - Returns an unordered array containing the values of the map.
Examples:
max
max(expr) - Returns the maximum value of expr .
md5
md5(expr) - Returns an MD5 128-bit checksum as a hex string of expr .
Examples:
mean
mean(expr) - Returns the mean calculated from values of a group.
min
min(expr) - Returns the minimum value of expr .
minute
minute(timestamp) - Returns the minute component of the string/timestamp.
Examples:
Since: 1.5.0
mod
expr1 mod expr2 - Returns the remainder after expr1 / expr2 .
Examples:
> SELECT 2 mod 1.8;
0.2
> SELECT MOD(2, 1.8);
0.2
monotonically_increasing_id
monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is
guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts
the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition.
The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion
records. The function is non-deterministic because its result depends on partition IDs.
month
month(date) - Returns the month component of the date/timestamp.
Examples:
Since: 1.5.0
months_between
months_between(timestamp1, timestamp2[, roundOff]) - If timestamp1 is later than timestamp2 , then the result
is positive. If timestamp1 and timestamp2 are on the same day of month, or both are the last day of month, time
of day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8
digits unless roundOff=false.
Examples:
Since: 1.5.0
named_struct
named_struct(name1, val1, name2, val2, …) - Creates a struct with the given field names and values.
Examples:
{"a":1,"b":2,"c":3}
nanvl
nanvl(expr1, expr2) - Returns expr1 if it’s not NaN, or expr2 otherwise.
Examples:
negative
negative(expr) - Returns the negated value of expr .
Examples:
next_day
next_day(start_date, day_of_week) - Returns the first date which is later than start_date and named as
indicated.
Examples:
Since: 1.5.0
not
not expr - Logical not.
now
now() - Returns the current timestamp at the start of query evaluation.
Since: 1.5.0
ntile
ntile(n) - Divides the rows for each window partition into n buckets ranging from 1 to at most n .
nullif
nullif(expr1, expr2) - Returns null if expr1 equals to expr2 , or expr1 otherwise.
Examples:
nvl
nvl(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.
Examples:
nvl2
nvl2(expr1, expr2, expr3) - Returns expr2 if expr1 is not null, or expr3 otherwise.
Examples:
octet_length
octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.
Examples:
or
expr1 or expr2 - Logical OR.
parse_url
parse_url(url, partToExtract[, key]) - Extracts a part from a URL.
Examples:
percent_rank
percent_rank() - Computes the percentage ranking of a value in a group of values.
percentile
percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric column col at the given
percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive
integral
percentile(col, array(percentage1 [, percentage2]…) [, frequency]) - Returns the exact percentile value array of
numeric column col at the given percentage(s). Each value of the percentage array must be between 0.0 and
1.0. The value of frequency should be positive integral
percentile_approx
percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column
col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter
(default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory.
Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. When
percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the
approximate percentile array of column col at the given percentage array.
Examples:
pi
pi() - Returns pi.
Examples:
pmod
pmod(expr1, expr2) - Returns the positive value of expr1 mod expr2 .
Examples:
posexplode
posexplode(expr) - Separates the elements of array expr into multiple rows with positions, or the elements of
map expr into multiple rows and columns with positions.
Examples:
posexplode_outer
posexplode_outer(expr) - Separates the elements of array expr into multiple rows with positions, or the
elements of map expr into multiple rows and columns with positions.
Examples:
position
position(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos .
The given pos and return value are 1-based.
Examples:
positive
positive(expr) - Returns the value of expr .
pow
pow(expr1, expr2) - Raises expr1 to the power of expr2 .
Examples:
power
power(expr1, expr2) - Raises expr1 to the power of expr2 .
Examples:
printf
printf(strfmt, obj, …) - Returns a formatted string from printf-style format strings.
Examples:
Since: 1.5.0
radians
radians(expr) - Converts degrees to radians.
Arguments:
expr - angle in degrees
Examples:
rand
rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed
values in [0, 1).
Examples:
randn
randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from
the standard normal distribution.
Examples:
rank
rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding
or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.
reflect
reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
Examples:
regexp_extract
regexp_extract(str, regexp[, idx]) - Extracts a group that matches regexp .
Examples:
regexp_replace
regexp_replace(str, regexp, rep) - Replaces all substrings of str that match regexp with rep .
Examples:
repeat
repeat(str, n) - Returns the string which repeats the given string value n times.
Examples:
replace
replace(str, search[, replace]) - Replaces all occurrences of search with replace .
Arguments:
str - a string expression
search - a string expression. If search is not found in str , str is returned unchanged.
replace - a string expression. If replace is not specified or is an empty string, nothing replaces the string that
is removed from str .
Examples:
> SELECT replace('ABCabc', 'abc', 'DEF');
ABCDEF
reverse
reverse(array) - Returns a reversed string or an array with reverse order of elements.
Examples:
right
right(str, len) - Returns the rightmost len ( len can be string type) characters from the string str ,if len is less
or equal than 0 the result is an empty string.
Examples:
rint
rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical
integer.
Examples:
rlike
str rlike regexp - Returns true if str matches regexp , or false otherwise.
Arguments:
str - a string expression
regexp - a string expression. The pattern string should be a Java regular expression.
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. For example, to
match “abc”, a regular expression for regexp can be “^abc$”.
There is a SQL config ‘spark.sql.parser.escapedStringLiterals’ that can be used to fallback to the Spark 1.6
behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match
“abc” is “^abc$”.
Examples:
When spark.sql.parser.escapedStringLiterals is disabled (default).
> SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
true
Note:
Use LIKE to match with simple string pattern.
rollup
round
round(expr, d) - Returns expr rounded to d decimal places using HALF_UP rounding mode.
Examples:
row_number
row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering
of rows within the window partition.
rpad
rpad(str, len, pad) - Returns str , right-padded with pad to a length of len . If str is longer than len , the
return value is shortened to len characters.
Examples:
rtrim
rtrim(str) - Removes the trailing space characters from str .
rtrim(trimStr, str) - Removes the trailing string which contains the characters from the trim string from the str
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
Examples:
> SELECT rtrim(' SparkSQL ');
SparkSQL
> SELECT rtrim('LQSa', 'SSparkSQLS');
SSpark
schema_of_json
schema_of_json( json[, options]) - Returns schema in the DDL format of JSON string.
Examples:
Since: 2.4.0
second
second(timestamp) - Returns the second component of the string/timestamp.
Examples:
Since: 1.5.0
sentences
sentences(str[, lang, country]) - Splits str into an array of array of words.
Examples:
sequence
sequence(start, stop, step) - Generates an array of elements from start to stop (inclusive), incrementing by step.
The type of the returned elements is the same as the type of argument expressions.
Supported types are: byte, short, integer, long, date, timestamp.
The start and stop expressions must resolve to the same type. If start and stop expressions resolve to the ‘date’
or ‘timestamp’ type then the step expression must resolve to the ‘interval’ type, otherwise to the same type as
the start and stop expressions.
Arguments:
start - an expression. The start of the range.
stop - an expression. The end the range (inclusive).
step - an optional expression. The step of the range. By default step is 1 if start is less than or equal to stop,
otherwise -1. For the temporal sequences it’s 1 day and -1 day respectively. If start is greater than stop then
the step must be negative, and vice versa.
Examples:
Since: 2.4.0
sha
sha(expr) - Returns a sha1 hash value as a hex string of the expr .
Examples:
sha1
sha1(expr) - Returns a sha1 hash value as a hex string of the expr .
Examples:
sha2
sha2(expr, bitLength) - Returns a checksum of SHA-2 family as a hex string of expr . SHA-224, SHA-256, SHA-
384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.
Examples:
shiftleft
shiftleft(base, expr) - Bitwise left shift.
Examples:
shiftright
shiftright(base, expr) - Bitwise (signed) right shift.
Examples:
> SELECT shiftright(4, 1);
2
shiftrightunsigned
shiftrightunsigned(base, expr) - Bitwise unsigned right shift.
Examples:
shuffle
shuffle(array) - Returns a random permutation of the given array.
Examples:
sign
sign(expr) - Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.
Examples:
signum
signum(expr) - Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive.
Examples:
sin
sin(expr) - Returns the sine of expr , as if computed by java.lang.Math.sin .
Arguments:
expr - angle in radians
Examples:
> SELECT sin(0);
0.0
sinh
sinh(expr) - Returns hyperbolic sine of expr , as if computed by java.lang.Math.sinh .
Arguments:
expr - hyperbolic angle
Examples:
size
size(expr) - Returns the size of an array or a map. The function returns -1 if its input is null and
spark.sql.legacy.sizeOfNull is set to true. If spark.sql.legacy.sizeOfNull is set to false, the function returns null for
null input. By default, the spark.sql.legacy.sizeOfNull parameter is set to true.
Examples:
skewness
skewness(expr) - Returns the skewness value calculated from values of a group.
slice
slice(x, start, length) - Subsets array x starting from index start (or starting from the end if start is negative) with
the specified length.
Examples:
Since: 2.4.0
smallint
smallint(expr) - Casts the value expr to the target data type smallint .
sort_array
sort_array(array[, ascendingOrder]) - Sorts the input array in ascending or descending order according to the
natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in
ascending order or at the end of the returned array in descending order.
Examples:
soundex
soundex(str) - Returns Soundex code of the string.
Examples:
space
space(n) - Returns a string consisting of n spaces.
Examples:
spark_partition_id
spark_partition_id() - Returns the current partition id.
split
split(str, regex) - Splits str around occurrences that match regex .
Examples:
sqrt
sqrt(expr) - Returns the square root of expr .
Examples:
std
std(expr) - Returns the sample standard deviation calculated from values of a group.
stddev
stddev(expr) - Returns the sample standard deviation calculated from values of a group.
stddev_pop
stddev_pop(expr) - Returns the population standard deviation calculated from values of a group.
stddev_samp
stddev_samp(expr) - Returns the sample standard deviation calculated from values of a group.
str_to_map
str_to_map(text[, pairDelim[, keyValueDelim]]) - Creates a map after splitting the text into key/value pairs using
delimiters. Default delimiters are ‘,’ for pairDelim and ‘:’ for keyValueDelim .
Examples:
string
string(expr) - Casts the value expr to the target data type string .
struct
struct(col1, col2, col3, …) - Creates a struct with the given field values.
substr
substr(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len , or the slice of byte
array that starts at pos and is of length len .
Examples:
> SELECT substr('Spark SQL', 5);
k SQL
> SELECT substr('Spark SQL', -3);
SQL
> SELECT substr('Spark SQL', 5, 1);
k
substring
substring(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len , or the slice of
byte array that starts at pos and is of length len .
Examples:
substring_index
substring_index(str, delim, count) - Returns the substring from str before count occurrences of the delimiter
delim . If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If
count is negative, everything to the right of the final delimiter (counting from the right) is returned. The
function substring_index performs a case-sensitive match when searching for delim .
Examples:
sum
sum(expr) - Returns the sum calculated from values of a group.
tan
tan(expr) - Returns the tangent of expr , as if computed by java.lang.Math.tan .
Arguments:
expr - angle in radians
Examples:
tanh
tanh(expr) - Returns the hyperbolic tangent of expr , as if computed by java.lang.Math.tanh .
Arguments:
expr - hyperbolic angle
Examples:
timestamp
timestamp(expr) - Casts the value expr to the target data type timestamp .
tinyint
tinyint(expr) - Casts the value expr to the target data type tinyint .
to_date
to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with
invalid input. By default, it follows casting rules to a date if the fmt is omitted.
Examples:
Since: 1.5.0
to_json
to_json(expr[, options]) - Returns a JSON string with a given struct value
Examples:
{"a":1,"b":2}
```sql
> SELECT to_json(named_struct('time', to_timestamp('2015-08-26', 'yyyy-MM-dd')), map('timestampFormat',
'dd/MM/yyyy'));
{"time":"26/08/2015"}
```sql
> SELECT to_json(array(named_struct('a', 1, 'b', 2)));
[{"a":1,"b":2}]
```sql
> SELECT to_json(map('a', named_struct('b', 1)));
{"a":{"b":1}}
```sql
> SELECT to_json(map(named_struct('a', 1),named_struct('b', 2)));
{"[1]":{"b":2}}
```sql
> SELECT to_json(map('a', 1));
{"a":1}
```sql
> SELECT to_json(array((map('a', 1))));
[{"a":1}]
Since: 2.2.0
to_timestamp
to_timestamp(timestamp[, fmt]) - Parses the timestamp expression with the fmt expression to a timestamp.
Returns null with invalid input. By default, it follows casting rules to a timestamp if the fmt is omitted.
Examples:
Since: 2.2.0
to_unix_timestamp
to_unix_timestamp(expr[, pattern]) - Returns the UNIX timestamp of the given time.
Examples:
Since: 1.6.0
to_utc_timestamp
to_utc_timestamp(timestamp, timezone) - Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time
in the given time zone, and renders that time as a timestamp in UTC. For example, ‘GMT+1’ would yield ‘2017-
07-14 01:40:00.0’.
Examples:
transform
transform(expr, func) - Transforms elements in an array using the function.
Examples:
Since: 2.4.0
translate
translate(input, from, to) - Translates the input string by replacing the characters present in the from string
with the corresponding characters in the to string.
Examples:
trim
trim(str) - Removes the leading and trailing space characters from str .
trim(BOTH trimStr FROM str) - Remove the leading and trailing trimStr characters from str
trim(LEADING trimStr FROM str) - Remove the leading trimStr characters from str
trim(TRAILING trimStr FROM str) - Remove the trailing trimStr characters from str
Arguments:
str - a string expression
trimStr - the trim string characters to trim, the default value is a single space
BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string
Examples:
Since: 1.5.0
ucase
ucase(str) - Returns str with all characters changed to uppercase.
Examples:
unbase64
unbase64(str) - Converts the argument from a base 64 string str to a binary.
Examples:
unhex
unhex(expr) - Converts hexadecimal expr to binary.
Examples:
unix_timestamp
unix_timestamp([expr[, pattern]]) - Returns the UNIX timestamp of current or specified time.
Examples:
Since: 1.5.0
upper
upper(str) - Returns str with all characters changed to uppercase.
Examples:
uuid
uuid() - Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-
character string.
Examples:
var_pop
var_pop(expr) - Returns the population variance calculated from values of a group.
var_samp
var_samp(expr) - Returns the sample variance calculated from values of a group.
variance
variance(expr) - Returns the sample variance calculated from values of a group.
weekday
weekday(date) - Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, …, 6 = Sunday).
Examples:
Since: 2.4.0
weekofyear
weekofyear(date) - Returns the week of the year of the given date. A week is considered to start on a Monday
and week 1 is the first week with >3 days.
Examples:
Since: 1.5.0
when
CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END - When expr1 = true, returns
expr2 ; else when expr3 = true, returns expr4 ; else returns expr5 .
Arguments:
expr1, expr3 - the branch condition expressions should all be boolean type.
expr2, expr4, expr5 - the branch value expressions and else value expression should all be same type or
coercible to a common type.
Examples:
> SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
1
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
2
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;
NULL
window
xpath
xpath(xml, xpath) - Returns a string array of values within the nodes of xml that match the XPath expression.
Examples:
xpath_boolean
xpath_boolean(xml, xpath) - Returns true if the XPath expression evaluates to true, or if a matching node is
found.
Examples:
xpath_double
xpath_double(xml, xpath) - Returns a double value, the value zero if no match is found, or NaN if a match is
found but the value is non-numeric.
Examples:
xpath_int
xpath_int(xml, xpath) - Returns an integer value, or the value zero if no match is found, or a match is found but
the value is non-numeric.
Examples:
xpath_long
xpath_long(xml, xpath) - Returns a long integer value, or the value zero if no match is found, or a match is found
but the value is non-numeric.
Examples:
xpath_number
xpath_number(xml, xpath) - Returns a double value, the value zero if no match is found, or NaN if a match is
found but the value is non-numeric.
Examples:
xpath_short
xpath_short(xml, xpath) - Returns a short integer value, or the value zero if no match is found, or a match is
found but the value is non-numeric.
Examples:
xpath_string
xpath_string(xml, xpath) - Returns the text contents of the first xml node that matches the XPath expression.
Examples:
year
year(date) - Returns the year component of the date/timestamp.
Examples:
Since: 1.5.0
zip_with
zip_with(left, right, func) - Merges the two given arrays, element-wise, into a single array using function. If one
array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.
Examples:
> SELECT zip_with(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x));
[{"y":"a","x":1},{"y":"b","x":2},{"y":"c","x":3}]
[4,6]
> SELECT zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y));
["ad","be","cf"]
Since: 2.4.0
|
expr1 | expr2 - Returns the result of bitwise OR of expr1 and expr2 .
Examples:
> SELECT 3 | 5;
7
~
~ expr - Returns the result of bitwise NOT of expr .
Examples:
> SELECT ~ 0;
-1
Grant
7/21/2022 • 2 minutes to read
GRANT
privilege_type [, privilege_type ] ...
ON (CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION <function-name>
| ANONYMOUS FUNCTION | ANY FILE)
TO principal
privilege_type
: SELECT | CREATE | MODIFY | READ_METADATA | CREATE_NAMED_FUNCTION | ALL PRIVILEGES
principal
: `<user>@<domain-name>` | <group-name>
Grant a privilege on an object to a user or principal. Granting a privilege on a database (for example a SELECT
privilege) has the effect of implicitly granting that privilege on all objects in that database. Granting a specific
privilege on the catalog has the effect of implicitly granting that privilege on all databases in the catalog.
To grant a privilege to all users, specify the keyword users after TO .
Examples
GRANT SELECT ON DATABASE <database-name> TO `<user>@<domain-name>`
GRANT SELECT ON ANONYMOUS FUNCTION TO `<user>@<domain-name>`
GRANT SELECT ON ANY FILE TO `<user>@<domain-name>`
CREATE OR REPLACE VIEW <view-name> AS SELECT columnA, columnB FROM <table-name> WHERE columnC > 1000;
GRANT SELECT ON VIEW <view-name> TO `<user>@<domain-name>`;
For details on required table ownership, see Frequently asked questions (FAQ).
Insert
7/21/2022 • 6 minutes to read
part_spec:
: (part_col_name1=val1 [, part_col_name2=val2, ...])
Insert data into a table or a partition from the result table of a select statement. Data is inserted by ordinal
(ordering of columns) and not by names.
NOTE
(Delta Lake on Azure Databricks) If a column has a NOT NULL constraint, and an INSERT INTO statement sets a column
value to NULL , a SparkException is thrown.
OVERWRITE
Overwrite existing data in the table or the partition. Otherwise, new data is appended.
Examples
INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]
values_row:
: (val1 [, val2, ...])
Overwrite existing data in the table or the partition. Otherwise, new data is appended.
Examples
IMPORTANT
In the dynamic partition mode, the input result set could result in a large number of dynamic partitions, and thus
generate a large number of partition directories.
OVERWRITE
The semantics are different based on the type of the target table.
Hive SerDe tables: INSERT OVERWRITE doesn’t delete partitions ahead, and only overwrite those partitions that have
data written into it at runtime. This matches Apache Hive semantics. For Hive SerDe tables, Spark SQL respects the
Hive-related configuration, including hive.exec.dynamic.partition and hive.exec.dynamic.partition.mode .
Native data source tables: INSERT OVERWRITE first deletes all the partitions that match the partition specification (e.g.,
PARTITION(a=1, b)) and then inserts all the remaining values. The behavior of native data source tables can be
changed to be consistent with Hive SerDe tables by changing the session-specific configuration
spark.sql.sources.partitionOverwriteMode to DYNAMIC . The default mode is STATIC .
Examples
-- Create a partitioned native Parquet table
CREATE TABLE data_source_tab2 (col1 INT, p1 STRING, p2 STRING)
USING PARQUET PARTITIONED BY (p1, p2)
-- Two partitions ('part1', 'part1') and ('part1', 'part2') are created by this dynamic insert.
-- The dynamic partition column p2 is resolved by the last column `'part' || id`
INSERT INTO data_source_tab2 PARTITION (p1 = 'part1', p2)
SELECT id, 'part' || id FROM RANGE(1, 3)
-- After this INSERT OVERWRITE, the two partitions ('part1', 'part1') and ('part1', 'part2') are dropped,
-- because both partitions are included by (p1 = 'part1', p2).
-- Then, two partitions ('partNew1', 'partNew2'), ('part1', 'part1') exist after this operation.
INSERT OVERWRITE TABLE data_source_tab2 PARTITION (p1 = 'part1', p2)
VALUES (5, 'part1')
-- Create and fill a partitioned hive serde table with three partitions:
-- ('part1', 'part1'), ('part1', 'part2') and ('partNew1', 'partNew2')
CREATE TABLE hive_serde_tab2 (col1 INT, p1 STRING, p2 STRING)
USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2)
INSERT INTO hive_serde_tab2 PARTITION (p1 = 'part1', p2)
SELECT id, 'part' || id FROM RANGE(1, 3)
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'partNew1', p2)
VALUES (3, 'partNew2')
-- After this INSERT OVERWRITE, only the partitions ('part1', 'part1') is overwritten by the new value.
-- All the three partitions still exist.
INSERT OVERWRITE TABLE hive_serde_tab2 PARTITION (p1 = 'part1', p2)
VALUES (5, 'part1')
Insert the query results of select_statement into a directory directory_path using Spark native format. If the
specified path exists, it is replaced with the output of the select_statement .
DIRECTORY
The path of the destination directory of the insert. The directory can also be specified in OPTIONS using the key
path . If the specified path exists, it is replaced with the output of the select_statement . If LOCAL is used, the
directory is on the local file system.
USING
The file format to use for the insert. One of TEXT , CSV , JSON , JDBC , PARQUET , ORC , HIVE , and LIBSVM , or a
fully qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister .
AS
Populate the destination directory with input data from the select statement.
Examples
INSERT OVERWRITE DIRECTORY
USING parquet
OPTIONS ('path' '/tmp/destination/path')
SELECT key, col1, col2 FROM source_table
Insert the query results of select_statement into a directory directory_path using Hive SerDe. If the specified
path exists, it is replaced with the output of the select_statement .
NOTE
This command is supported only when Hive support is enabled.
DIRECTORY
The path of the destination directory of the insert. If the specified path exists, it will be replaced with the output
of the select_statement . If LOCAL is used, the directory is on the local file system.
ROW FORMAT
Use the SERDE clause to specify a custom SerDe for this insert. Otherwise, use the DELIMITED clause to use the
native SerDe and specify the delimiter, escape character, null character, and so on.
STORED AS
The file format for this insert. One of TEXTFILE , SEQUENCEFILE , RCFILE , ORC , PARQUET , and AVRO . Alternatively,
you can specify your own input and output format through INPUTFORMAT and OUTPUTFORMAT . Only TEXTFILE ,
SEQUENCEFILE , and RCFILE can be used with ROW FORMAT SERDE , and only TEXTFILE can be used with
ROW FORMAT DELIMITED .
AS
Populate the destination directory with input data from the select statement.
Examples
LOAD DATA [LOCAL] INPATH path [OVERWRITE] INTO TABLE [db_name.]table_name [PARTITION part_spec]
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
Load data from a file into a table or a partition in the table. The target table must not be temporary. A partition
spec must be provided if and only if the target table is partitioned.
NOTE
This is supported only for tables created using the Hive format.
LOCAL
Load the path from the local file system. Otherwise, the default file system is used.
OVERWRITE
Delete existing data in the table. Otherwise, new data is appended to the table.
Merge Into (Delta Lake on Azure Databricks)
7/21/2022 • 3 minutes to read
Merge a set of updates, insertions, and deletions based on a source table into a target Delta table.
where
<matched_action> =
DELETE |
UPDATE SET * |
UPDATE SET column1 = value1 [, column2 = value2 ...]
<not_matched_action> =
INSERT * |
INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
<time_travel_version> =
TIMESTAMP AS OF timestamp_expression |
VERSION AS OF version
In Databricks Runtime 5.5 LTS and 6.x, MERGE can have at most 2 WHEN MATCHED clauses and at most 1
WHEN NOT MATCHED clause.
WHEN MATCHED clauses are executed when a source row matches a target table row based on the match
condition. These clauses have the following semantics.
WHEN MATCHED clauses can have at most on UPDATE and one DELETE action. The UPDATE action in
merge only updates the specified columns of the matched target row. The DELETE action will delete
the matched row.
Each WHEN MATCHED clause can have an optional condition. If this clause condition exists, the UPDATE or
DELETE action is executed for any matching source-target row pair row only when the clause
condition is true.
If there are multiple WHEN MATCHED clauses, then they are evaluated in order they are specified (that is,
the order of the clauses matter). All WHEN MATCHED clauses, except the last one, must have conditions.
If both WHEN MATCHED clauses have conditions and neither of the conditions are true for a matching
source-target row pair, then the matched target row is left unchanged.
To update all the columns of the target Delta table with the corresponding columns of the source
dataset, use UPDATE SET * . This is equivalent to
UPDATE SET col1 = source.col1 [, col2 = source.col2 ...] for all the columns of the target Delta
table. Therefore, this action assumes that the source table has the same columns as those in the target
table, otherwise the query will throw an analysis error.
This behavior changes when automatic schema migration is enabled. See Automatic schema
evolution for details.
WHEN NOT MATCHED clauses are executed when a source row does not match any target row based on the
match condition. These clauses have the following semantics.
WHEN NOT MATCHED clauses can only have the INSERT action. The new row is generated based on
the specified column and corresponding expressions. All the columns in the target table do not
need to be specified. For unspecified target columns, NULL will be inserted.
NOTE
In Databricks Runtime 6.5 and below, you must provide all columns in the target table for the INSERT
action.
Each WHEN NOT MATCHED clause can have an optional condition. If the clause condition is present, a
source row is inserted only if that condition is true for that row. Otherwise, the source column is
ignored.
If there are multiple WHEN NOT MATCHED clauses, then they are evaluated in order they are specified
(that is, the order of the clauses matter). All WHEN NOT MATCHED clauses, except the last one, must
have conditions.
To insert all the columns of the target Delta table with the corresponding columns of the source
dataset, use INSERT * . This is equivalent to
INSERT (col1 [, col2 ...]) VALUES (source.col1 [, source.col2 ...]) for all the columns of the
target Delta table. Therefore, this action assumes that the source table has the same columns as
those in the target table, otherwise the query will throw an analysis error.
NOTE
This behavior changes when automatic schema migration is enabled. See Automatic schema evolution for
details.
IMPORTANT
A MERGE operation can fail if multiple rows of the source dataset match and attempt to update the same rows of the
target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear
which source row should be used to update the matched target row. You can preprocess the source table to eliminate the
possibility of multiple matches. See the Change data capture example—it preprocesses the change dataset (that is, the
source dataset) to retain only the latest change for each key before applying that change into the target Delta table.
Examples
You can use MERGE for complex operations such as deduplicating data, upserting change data, applying SCD
Type 2 operations. See Merge examples for a few examples.
Msck
7/21/2022 • 2 minutes to read
Optimize the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you
do not specify colocation, bin-packing optimization is performed.
NOTE
Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no
effect. It aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of
tuples per file. However, the two measures are most often correlated.
Z-Ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-Ordering is not
guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-Ordered,
another Z-Ordering of that partition will not have any effect. It aims to produce evenly-balanced data files with respect
to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there
can be situations when that is not the case, leading to skew in optimize task times.
To control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize . The
default value is 1073741824 , which sets the size to 1 GB. Specifying the value 104857600 sets the file size to 100
MB.
WHERE
Optimize the subset of rows matching the given partition predicate. Only filters involving partition key
attributes are supported.
ZORDER BY
Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms
to dramatically reduce the amount of data that needs to be read. You can specify multiple columns for
ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional
column.
Examples
OPTIMIZE events
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
Refresh Table
7/21/2022 • 2 minutes to read
Refresh all cached entries associated with the table. If the table was previously cached, then it would be cached
lazily the next time it is scanned.
Reset
7/21/2022 • 2 minutes to read
RESET
Reset all properties to their default values. The Set command output will be empty after this.
Revoke
7/21/2022 • 2 minutes to read
REVOKE
privilege_type [, privilege_type ] ...
ON (CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION <function-name>
| ANONYMOUS FUNCTION | ANY FILE)
FROM principal
privilege_type
: SELECT | CREATE | MODIFY | READ_METADATA | CREATE_NAMED_FUNCTION | ALL PRIVILEGES
principal
: `<user>@<domain-name>` | <group-name>
Revoke an explicitly granted or denied privilege on an object from a user or principal. A REVOKE is strictly
scoped to the object specified in the command and does not cascade to contained objects.
To revoke a privilege from all users, specify the keyword users after FROM .
For example, suppose there is a database db with tables t1 and t2 . A user is initially granted SELECT
privileges on db and on t1 . The user can access t2 due to the GRANT on the database db .
If the administrator revokes the SELECT privilege on db , the user will no longer be able to access t2 , but will
still be able to access t1 since there is an explicit GRANT on table t1 .
If the administrator instead revokes the SELECT on table t1 but still keeps the SELECT on database db , the
user can still access t1 because the SELECT on the database db implicitly confers privileges on the table t1 .
Examples
REVOKE ALL PRIVILEGES ON DATABASE default FROM `<user>@<domain-name>`
REVOKE SELECT ON <table-name> FROM `<user>@<domain-name>`
Select
7/21/2022 • 6 minutes to read
named_expression:
: expression [AS alias]
relation:
| join_relation
| (table_name|query|relation) [sample] [AS alias]
: VALUES (expressions)[, (expressions), ...]
[AS (column_name[, column_name, ...])]
expressions:
: expression[, expression, ...]
sort_expressions:
: expression [ASC|DESC][, expression [ASC|DESC], ...]
Select all matching rows from the relation then remove duplicate results.
WHERE
Impose total ordering on a set of expressions. Default sort direction is ascending. You cannot use this with
SORT BY , CLUSTER BY , or DISTRIBUTE BY .
DISTRIBUTE BY
Repartition rows in the relation based on a set of expressions. Rows with the same expression values will be
hashed to the same worker. You cannot use this with ORDER BY or CLUSTER BY .
SORT BY
Impose ordering on a set of expressions within each partition. Default sort direction is ascending. You cannot
use this with ORDER BY or CLUSTER BY .
CLUSTER BY
Repartition rows in the relation based on a set of expressions and sort the rows in ascending order based on the
expressions. In other words, this is a shorthand for DISTRIBUTE BY and SORT BY where all expressions are
sorted in ascending order. You cannot use this with ORDER BY , DISTRIBUTE BY , or SORT BY .
WINDOW
Examples
SELECT * FROM boxes
SELECT width, length FROM boxes WHERE height=3
SELECT DISTINCT width, length FROM boxes WHERE height=3 LIMIT 2
SELECT * FROM VALUES (1, 2, 3) AS (width, length, height)
SELECT * FROM VALUES (1, 2, 3), (2, 3, 4) AS (width, length, height)
SELECT * FROM boxes ORDER BY width
SELECT * FROM boxes DISTRIBUTE BY width SORT BY width
SELECT * FROM boxes CLUSTER BY length
Delta tables
You can specify a table as delta.<path-to-table> or <table-name> .
You can specify a time travel version after the table identifier using TIMESTAMP AS OF , VERSION AS OF , or @
syntax. See Query an older snapshot of a table (time travel) for details.
Examples
Table sample
sample:
| TABLESAMPLE ([integer_expression | decimal_expression] PERCENT)
: TABLESAMPLE (integer_expression ROWS)
Sample the input data. Express in terms of either a percentage (must be between 0 and 100) or a fixed number
of input rows.
Examples
SELECT * FROM boxes TABLESAMPLE (3 ROWS)
SELECT * FROM boxes TABLESAMPLE (25 PERCENT)
Join
join_relation:
| relation join_type JOIN relation [ON boolean_expression | USING (column_name, column_name) ]
: relation NATURAL join_type JOIN relation
join_type:
| INNER
| [LEFT | RIGHT] SEMI
| [LEFT | RIGHT | FULL] [OUTER]
: [LEFT] ANTI
INNER JOIN
Select all rows from both relations where there is match.
OUTER JOIN
Select all rows from both relations, filling with null values on the side that does not have a match.
SEMI JOIN
Select only rows from the side of the SEMI JOIN where there is a match. If one row matches multiple
rows, only the first match is returned.
LEFT ANTI JOIN
Select only rows from the left side that match no rows on the right side.
Examples
Lateral view
lateral_view:
: LATERAL VIEW [OUTER] function_name (expressions)
table_name [AS (column_name[, column_name, ...])]
Generate zero or more output rows for each input row using a table-generating function. The most common
built-in function used with LATERAL VIEW is explode .
LATERAL VIEW OUTER
Generate a row with null values even when the function returned zero rows.
Examples
Group by a set of expressions using one or more aggregate functions. Common built-in aggregate functions
include count, avg, min, max, and sum.
ROLLUP
Create a grouping set at each hierarchical level of the specified expressions. For instance, For instance,
GROUP BY a, b, c WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ()) .
The total number of grouping sets will be N + 1 , where N is the number of group expressions.
CUBE
Create a grouping set for each possible combination of set of the specified expressions. For instance,
GROUP BY a, b, c WITH CUBE is equivalent to
GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ()) . The total number of
grouping sets will be 2^N , where N is the number of group expressions.
GROUPING SETS
Perform a group by for each subset of the group expressions specified in the grouping sets. For instance,
GROUP BY x, y GROUPING SETS (x, y) is equivalent to the result of GROUP BY x unioned with that of GROUP BY y .
Examples
Window functions
window_expression:
: expression OVER window_spec
named_window:
: window_identifier AS window_spec
window_spec:
| window_identifier
: ( [PARTITION | DISTRIBUTE] BY expressions
[[ORDER | SORT] BY sort_expressions] [window_frame])
window_frame:
| [RANGE | ROWS] frame_bound
: [RANGE | ROWS] BETWEEN frame_bound AND frame_bound
frame_bound:
| CURRENT ROW
| UNBOUNDED [PRECEDING | FOLLOWING]
: expression [PRECEDING | FOLLOWING]
Compute a result over a range of input rows. A windowed expression is specified using the OVER keyword,
which is followed by either an identifier to the window (defined using the WINDOW keyword) or the specification
of a window.
PARTITION BY
Specify how rows within a window partition are ordered, aliased by SORT BY .
RANGE bound
Express the size of the window in terms of a value range for the expression.
ROWS bound
Express the size of the window in terms of the number of rows before and/or after the current row.
CURRENT ROW
Use negative infinity as the lower bound or infinity as the upper bound.
PRECEDING
If used with a RANGE bound, this defines the lower bound of the value range. If used with a ROWS bound, this
determines the number of rows before the current row to keep in the window.
FOLLOWING
If used with a RANGE bound, this defines the upper bound of the value range. If used with a ROWS bound, this
determines the number of rows after the current row to keep in the window.
Hints
hints:
: /*+ hint[, hint, ...] */
hint:
: hintName [(expression[, expression, ...])]
You use hints improve the performance of a query. For example, you can hint that a table is small enough to be
broadcast, which would speed up joins.
You add one or more hints to a SELECT statement inside /*+ ... */ comment blocks. You can specify multiple
hints inside the same comment block, in which case the hints are separated by commas, and there can be
multiple such comment blocks. A hint has a name (for example, BROADCAST ) and accepts 0 or more parameters.
Examples
Delta Lake on Azure Databricks See Skew join optimization for more information about the SKEW hint.
Set
7/21/2022 • 2 minutes to read
SET [-v]
SET property_key[=property_value]
Set a property, return the value of an existing property, or list all existing properties. If a value is provided for an
existing property key, the old value will be overridden.
-v
Return the list of columns in a table. If the table does not exist, an exception is thrown.
Show Create Table
7/21/2022 • 2 minutes to read
Return the command used to create an existing table. If the table does not exist, an exception is thrown.
Show Databases
7/21/2022 • 2 minutes to read
Show functions matching the given regex or function name. If no regex or name is provided, then all functions
are shown. IF USER or SYSTEM is declared then these will only show user-defined Spark SQL functions and
system-defined Spark SQL functions respectively.
LIKE
SHOW GRANTS [user] ON [CATALOG | DATABASE <database-name> | TABLE <table-name> | VIEW <view-name> | FUNCTION
<function-name> | ANONYMOUS FUNCTION | ANY FILE]
Display all privileges (including inherited, denied, and granted) that affect the specified object.
Example
part_spec:
: (part_col_name1=val1, part_col_name2=val2, ...)
List the partitions of a table, filtering by given partition values. Listing partitions is supported only for tables
created using the Delta Lake format or the Hive format, when Hive support is enabled.
Show Table Properties
7/21/2022 • 2 minutes to read
Return all properties or the value of a specific property set in a table. If the table does not exist, an exception will
be thrown.
Show Tables
7/21/2022 • 2 minutes to read
Return all tables. Shows a table’s database and whether a table is temporary.
FROM | IN
Indicates which table names to match. In pattern , * matches any number of characters.
Truncate Table
7/21/2022 • 2 minutes to read
part_spec:
: (part_col1=value1, part_col2=value2, ...)
Delete all rows from a table or matching partitions in the table. The table must not be an external table or a view.
PARTITION
A partial partition spec to match partitions to be truncated. In Spark 2.0, this is supported only for tables created
using the Hive format. Since Spark 2.1, data source tables are also supported. Not supported for Delta tables.
Uncache Table
7/21/2022 • 2 minutes to read
Drop all cached entries associated with the table from the RDD cache.
Update (Delta Lake on Azure Databricks)
7/21/2022 • 2 minutes to read
UPDATE [db_name.]table_name [AS alias] SET col1 = value1 [, col2 = value2 ...] [WHERE predicate]
Update the column values for the rows that match a predicate. When no predicate is provided, update the
column values for all rows.
NOTE
(Delta Lake on Azure Databricks) If a column has a NOT NULL constraint, and an INSERT INTO statement sets a column
value to NULL , a SparkException is thrown.
WHERE
Example
UPDATE events SET eventType = 'click' WHERE eventType = 'clk'
UPDATE supports subqueries in the WHERE predicate, including IN , NOT IN , EXISTS , NOT EXISTS , and scalar
subqueries.
Subquery Examples
UPDATE all_events
SET session_time = 0, ignored = true
WHERE session_time < (SELECT min(session_time) FROM good_events)
UPDATE orders AS t1
SET order_status = 'returned'
WHERE EXISTS (SELECT oid FROM returned_orders WHERE t1.oid = oid)
UPDATE events
SET category = 'undefined'
WHERE category NOT IN (SELECT category FROM events2 WHERE date > '2001-01-01')
NOTE
The following types of subqueries are not supported:
Nested subqueries, that is, a subquery inside another subquery
A NOT IN subquery inside an OR , for example, a = 3 OR b NOT IN (SELECT c from t)
In most cases, you can rewrite NOT IN subqueries using NOT EXISTS . We recommend using NOT EXISTS whenever
possible, as UPDATE with NOT IN subqueries can be slow.
Use Database
7/21/2022 • 2 minutes to read
USE db_name
Set the current database. All subsequent commands that do not explicitly specify a database will use this one. If
the provided database does not exist, an exception is thrown. The default current database is default .
Vacuum
7/21/2022 • 2 minutes to read
Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the
latest state of the transaction log for the table and are older than a retention threshold. Files are deleted
according to the time they have been logically removed from Delta’s transaction log + retention hours, not their
modification timestamps on the storage system. The default threshold is 7 days.
On Delta tables, Azure Databricks does not automatically trigger VACUUM operations. See Remove files no longer
referenced by a Delta table.
If you run VACUUM on a Delta table, you lose the ability to time travel back to a version older than the specified
data retention period.
RETAIN num HOURS
Spark SQL can use a cost-based optimizer (CBO) to improve query plans. This is especially useful for queries
with multiple joins. For this to work it is critical to collect table and column statistics and keep them up to date.
Collect statistics
To get the full benefit of the CBO it is important to collect both column statistics and table statistics. Statistics can
be collected using the Analyze Table command.
TIP
To maintain the statistics up-to-date, run ANALYZE TABLE after writing to the table.
Spark SQL UI
Use the Spark SQL UI page to see the executed plan and accuracy of the statistics.
A line such as rows output: 2,451,005 est: N/A means that this operator produces approximately 2M rows and
there were no statistics available.
A line such as rows output: 2,451,005 est: 1616404 (1X) means that this operator produces approx. 2M rows,
while the estimate was approx. 1.6M and the estimation error factor was 1.
A line such as rows output: 2,451,005 est: 2626656323 means that this operator produces approximately 2M
rows while the estimate was 2B rows, so the estimation error factor was 1000.
spark.conf.set("spark.sql.cbo.enabled", false)
Data skipping index
7/21/2022 • 2 minutes to read
IMPORTANT
DATASKIPPING INDEX was removed in Databricks Runtime 7.0. We recommend that you use Delta tables instead, which
offer improved data skipping capabilities.
Description
In addition to partition pruning, Databricks Runtime includes another feature that is meant to avoid scanning
irrelevant data, namely the Data Skipping Index. It uses file-level statistics in order to perform additional
skipping at file granularity. This works with, but does not depend on, Hive-style partitioning.
The effectiveness of data skipping depends on the characteristics of your data and its physical layout. As
skipping is done at file granularity, it is important that your data is horizontally partitioned across multiple files.
This will typically happen as a consequence of having multiple append jobs, (shuffle) partitioning, bucketing,
and/or the use of spark.sql.files.maxRecordsPerFile . It works best on tables with sorted buckets (
df.write.bucketBy(...).sortBy(...).saveAsTable(...) / CREATE TABLE ... CLUSTERED BY ... SORTED BY ... ), or
with columns that are correlated with partition keys (for example, brandName - modelName ,
companyID - stockPrice ), but also when your data just happens to exhibit some sortedness / clusteredness (for
example, orderID , bitcoinValue ).
NOTE
This beta feature has a number of important limitations:
It’s Opt-In: needs to be enabled manually, on a per-table basis.
It’s SQL only: there is no DataFrame API for it.
Once a table is indexed, the effects of subsequent INSERT or ADD PARTITION operations are not guaranteed to be
visible until the index is explicitly REFRESHed.
SQL Syntax
Create Index
Enables Data Skipping on the given table for the first (i.e. left-most) N supported columns, where N is controlled
by spark.databricks.io.skipping.defaultNumIndexedCols (default: 32)
partitionBy columns are always indexed and do not count towards this N.
Create Index For Columns
Enables Data Skipping on the given table for the specified list of columns. Same as above, all partitionBy
columns will always be indexed in addition to the ones specified.
Describe Index
Displays which columns of the given table are indexed, along with the corresponding types of file-level statistic
that are collected.
If EXTENDED is specified, a third column called “effectiveness_score” is displayed that gives an approximate
measure of how beneficial we expect DataSkipping to be for filters on the corresponding columns.
Refresh Full Index
Rebuilds the whole index. I.e. all the table’s partitions will be re-indexed.
Refresh Partitions
Re-indexes the specified partitions only. This operation should generally be faster than full index refresh.
Drop Index
Disables Data Skipping on the given table and deletes all index data.
Transactional writes to cloud storage with DBIO
7/21/2022 • 2 minutes to read
The Databricks DBIO package provides transactional writes to cloud storage for Apache Spark jobs. This solves a
number of performance and correctness issues that occur when Spark is used in a cloud-native setting (for
example, writing directly to storage services).
IMPORTANT
The commit protocol is not respected when you access data using paths ending in * . For example, reading
dbfs://my/path will only return committed changes, while reading dbfs://my/path/* will return the content of all the
data files in the directory, irrespective of whether their content was committed or not. This is an expected behavior.
With DBIO transactional commit, metadata files starting with _started_<id> and _committed_<id> accompany
data files created by Spark jobs. Generally you shouldn’t alter these files directly. Rather, you should use the
VACUUM command to clean them up.
For example, VACUUM ... RETAIN 1 HOUR removes uncommitted files older than one hour.
IMPORTANT
Avoid vacuuming with a horizon of less than one hour. It can cause data inconsistency.
Scala
When reading data from a file-based data source, Apache Spark SQL faces two typical error cases. First, the files
may not be readable (for instance, they could be missing, inaccessible or corrupted). Second, even if the files are
processable, some records may not be parsable (for example, due to syntax errors and schema mismatch).
Azure Databricks provides a unified interface for handling bad records and files without interrupting Spark jobs.
You can obtain the exception records/files and reasons from the exception logs by setting the data source option
badRecordsPath . badRecordsPath specifies a path to store exception files for recording the information about
bad records for CSV and JSON sources and bad files for all the file-based built-in sources (for example, Parquet).
In addition, when reading files transient errors like network connection exception, IO exception, and so on, may
occur. These errors are ignored and also recorded under the badRecordsPath , and Spark will continue to run the
tasks.
NOTE
Using the badRecordsPath option in a file-based data source has a few important limitations:
It is non-transactional and can lead to inconsistent results.
Transient errors are treated as failures.
Examples
Unable to find input file
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.format("parquet").load("/input/parquetFile")
df.show()
In the above example, since df.show() is unable to find the input file, Spark creates an exception file in JSON
format to record the error. For example, /tmp/badRecordsPath/20170724T101153/bad_files/xyz is the path of the
exception file. This file is under the specified badRecordsPath directory, /tmp/badRecordsPath . 20170724T101153 is
the creation time of this DataFrameReader . bad_files is the exception type. xyz is a file that contains a JSON
record, which has the path of the bad file and the exception/reason message.
Input file contains bad record
// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.format("text").save("/tmp/input/jsonFile")
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.schema("a int, b int")
.format("json")
.load("/tmp/input/jsonFile")
df.show()
In this example, the DataFrame contains only the first parsable record ( {"a": 1, "b": 2} ). The second bad
record ( {bad-record ) is recorded in the exception file, which is a JSON file located in
/tmp/badRecordsPath/20170724T114715/bad_records/xyz . The exception file contains the bad record, the path of the
file containing the record, and the exception/reason message. After you locate the exception files, you can use a
JSON reader to process them.
Handling large queries in interactive workflows
7/21/2022 • 4 minutes to read
A challenge with interactive data workflows is handling large queries. This includes queries that generate too
many output rows, fetch many external partitions, or compute on extremely large data sets. These queries can
be extremely slow, saturate cluster resources, and make it difficult for others to share the same cluster.
Query Watchdog is a process that prevents queries from monopolizing cluster resources by examining the most
common causes of large queries and terminating queries that pass a threshold. This article describes how to
enable and configure Query Watchdog.
IMPORTANT
Query Watchdog is enabled for all all-purpose clusters created using the UI.
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.shuffle.partitions", 10)
spark.range(1000000)
.withColumn("join_key", lit(" "))
.createOrReplaceTempView("table_x")
spark.range(1000000)
.withColumn("join_key", lit(" "))
.createOrReplaceTempView("table_y")
These table sizes are manageable in Apache Spark. However, they each include a join_key column with an
empty string in every row. This can happen if the data is not perfectly clean or if there is significant data skew
where some keys are more prevalent than others. These empty join keys are far more prevalent than any other
value.
In the following code, the analyst is joining these two tables on their keys, which produces output of one trillion
results, and all of these are produced on a single executor (the executor that gets the " " key):
SELECT
id, count()
FROM
(SELECT
x.id
FROM
table_x x
JOIN
table_y y
on x.join_key = y.join_key)
GROUP BY id
This query appears to be running. But without knowing about the data, the analyst sees that there’s “only” a
single task left over the course of executing the job. The query never finishes, leaving the analyst frustrated and
confused about why it did not work.
In this case there is only one problematic join key. Other times there may be many more.
spark.conf.set("spark.databricks.queryWatchdog.enabled", true)
spark.conf.set("spark.databricks.queryWatchdog.outputRatioThreshold", 1000L)
The latter configuration declares that any given task should never produce more than 1000 times the number of
input rows.
TIP
The output ratio is completely customizable. We recommend starting lower and seeing what threshold works well for you
and your team. A range of 1,000 to 10,000 is a good starting point.
Not only does Query Watchdog prevent users from monopolizing cluster resources for jobs that will never
complete, it also saves time by fast-failing a query that would have never completed. For example, the following
query will fail after several minutes because it exceeds the ratio.
SELECT
join_key,
sum(x.id),
count()
FROM
(SELECT
x.id,
y.join_key
FROM
table_x x
JOIN
table_y y
on x.join_key = y.join_key)
GROUP BY join_key
spark.conf.set("spark.databricks.queryWatchdog.minTimeSecs", 10L)
spark.conf.set("spark.databricks.queryWatchdog.minOutputRows", 100000L)
TIP
If you configure Query Watchdog in a notebook, the configuration does not persist across cluster restarts. If you want to
configure Query Watchdog for all users of a cluster, we recommend that you use a cluster configuration.
spark.conf.set("spark.databricks.queryWatchdog.maxHivePartitions", 20000)
spark.conf.set("spark.databricks.queryWatchdog.maxQueryTasks", 20000)
Adaptive query execution (AQE) is query re-optimization that occurs during query execution.
The motivation for runtime re-optimization is that Azure Databricks has the most up-to-date accurate statistics
at the end of a shuffle and broadcast exchange (referred to as a query stage in AQE). As a result, Azure
Databricks can opt for a better physical strategy, pick an optimal post-shuffle partition size and number, or do
optimizations that used to require hints, for example, skew join handling.
This can be very useful when statistics collection is not turned on or when statistics are stale. It is also useful in
places where statically derived statistics are inaccurate, such as in the middle of a complicated query, or after the
occurrence of data skew.
Capabilities
In Databricks Runtime 7.3 LTS and above, AQE is enabled by default. It has 4 major features:
Dynamically changes sort merge join into broadcast hash join.
Dynamically coalesces partitions (combine small partitions into reasonably sized partitions) after shuffle
exchange. Very small tasks have worse I/O throughput and tend to suffer more from scheduling overhead
and task setup overhead. Combining small tasks saves resources and improves cluster throughput.
Dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed)
skewed tasks into roughly evenly sized tasks.
Dynamically detects and propagates empty relations.
Application
AQE applies to all queries that are:
Non-streaming
Contain at least one exchange (usually when there’s a join, aggregate, or window), one sub-query, or both.
Not all AQE-applied queries are necessarily re-optimized. The re-optimization might or might not come up with
a different query plan than the one statically compiled. To determine whether a query’s plan has been changed
by AQE, see the following section, Query plans.
Query plans
This section discusses how you can examine query plans in different ways.
In this section:
Spark UI
DataFrame.explain()
SQL EXPLAIN
Spark UI
AdaptiveSparkPlan node
AQE-applied queries contain one or more AdaptiveSparkPlan nodes, usually as the root node of each main
query or sub-query. Before the query runs or when it is running, the isFinalPlan flag of the corresponding
AdaptiveSparkPlan node shows as false ; after the query execution completes, the isFinalPlan flag changes to
true.
Evolving plan
The query plan diagram evolves as the execution progresses and reflects the most current plan that is being
executed. Nodes that have already been executed (in which metrics are available) will not change, but those that
haven’t can change over time as the result of re-optimizations.
The following is a query plan diagram example:
DataFrame.explain()
AdaptiveSparkPlan node
AQE-applied queries contain one or more AdaptiveSparkPlan nodes, usually as the root node of each main
query or sub-query. Before the query runs or when it is running, the isFinalPlan flag of the corresponding
AdaptiveSparkPlan node shows as false ; after the query execution completes, the isFinalPlan flag changes to
true .
After the stage execution completes, the statistics are those collected at runtime, and the flag isRuntime will
become true , for example: Statistics(sizeInBytes=658.1 KiB, rowCount=2.81E+4, isRuntime=true)
The following is a DataFrame.explain example:
Before the execution
During the execution
SQL EXPLAIN
AdaptiveSparkPlan node
AQE-applied queries contain one or more AdaptiveSparkPlan nodes, usually as the root node of each main
query or sub-query.
No current plan
As SQL EXPLAIN does not execute the query, the current plan is always the same as the initial plan and does not
reflect what would eventually get executed by AQE.
The following is a SQL explain example:
Effectiveness
The query plan will change if one or more AQE optimizations take effect. The effect of these AQE optimizations
is demonstrated by the difference between the current and final plans and the initial plan and specific plan
nodes in the current and final plans.
Dynamically change sort merge join into broadcast hash join: different physical join nodes between the
current/final plan and the initial plan
Dynamically handle skew join: node SortMergeJoin with field isSkew as true.
Dynamically detect and propagate empty relations: part of (or entire) the plan is replaced by node
LocalTableScan with the relation field as empty.
Configuration
In this section:
Enable and disable adaptive query execution
Dynamically change sort merge join into broadcast hash join
Dynamically coalesce partitions
Dynamically handle skew join
Dynamically detect and propagate empty relations
Enable and disable adaptive query execution
P RO P ERT Y
spark .databricks.optimizer.adaptive.enabled
Type: Boolean
spark .databricks.adaptive.autoBroadcastJoinThreshold
Type: Boolean
The target size after coalescing. The coalesced partition sizes will be close to but no bigger than this target size.
The minimum size of partitions after coalescing. The coalesced partition sizes will be no smaller than this size.
Type: Integer
The minimum number of partitions after coalescing. Not recommended, because setting explicitly overrides
spark.sql.adaptive.coalescePartitions.minPartitionSize .
spark .sql.adaptive.skewJoin.enabled
Type: Boolean
Type: Integer
A factor that when multiplied by the median partition size contributes to determining whether a partition is skewed.
Default value: 5
A partition is considered skewed when both (partition size > skewedPartitionFactor * median partition size)
and (partition size > skewedPartitionThresholdInBytes) are true .
Dynamically detect and propagate empty relations
P RO P ERT Y
spark .databricks.adaptive.emptyRelationPropagation.enabled
Type: Boolean
In addition, skew handling support is limited for certain join types, for example, in LEFT OUTER JOIN , only skew
on the left side can be optimized.
Legacy
The term “Adaptive Execution” has existed since Spark 1.6, but the new AQE in Spark 3.0 is fundamentally
different. In terms of functionality, Spark 1.6 does only the “dynamically coalesce partitions” part. In terms of
technical architecture, the new AQE is a framework of dynamic planning and replanning of queries based on
runtime stats, which supports a variety of optimizations such as the ones we have described in this article and
can be extended to enable more potential optimizations.
Query semi-structured data in SQL
7/21/2022 • 4 minutes to read
NOTE
Available in Databricks Runtime 8.1 and above.
This article describes the Databricks SQL operators you can use to query and transform semi-structured data
stored as JSON.
NOTE
This feature lets you read semi-structured data without flattening the files. However, for optimal read query performance
Databricks recommends that you extract nested columns with the correct data types.
Syntax
You extract a column from fields containing JSON strings using the syntax <column-name>:<extraction-path> ,
where <column-name> is the string column name and <extraction-path> is the path to the field to extract. The
returned results are strings.
Examples
The following examples use the data created with the statement in Example data.
In this section:
Extract a top-level column
Extract nested fields
Extract values from arrays
Cast values
Example data
Extract a top-level column
To extract a column, specify the name of the JSON field in your extraction path.
You can provide column names within brackets. Columns referenced inside brackets are matched case
sensitively. The column name is also referenced case insensitively.
+-------+-------+
| owner | owner |
+-------+-------+
| amy | amy |
+-------+-------+
-- References are case sensitive when you use brackets
SELECT raw:OWNER case_insensitive, raw:['OWNER'] case_sensitive FROM store_data
+------------------+----------------+
| case_insensitive | case_sensitive |
+------------------+----------------+
| amy | null |
+------------------+----------------+
Use backticks to escape spaces and special characters. The field names are matched case insensitively.
-- Use backticks to escape special characters. References are case insensitive when you use backticks.
-- Use brackets to make them case sensitive.
SELECT raw:`zip code`, raw:`Zip Code`, raw:['fb:testid'] FROM store_data
+----------+----------+-----------+
| zip code | Zip Code | fb:testid |
+----------+----------+-----------+
| 94025 | 94025 | 1234 |
+----------+----------+-----------+
NOTE
If a JSON record contains multiple columns that can match your extraction path due to case insensitive matching, you will
receive an error asking you to use brackets. If you have matches of columns across rows, you will not receive any errors.
The following will throw an error: {"foo":"bar", "Foo":"bar"} , and the following won’t throw an error:
{"foo":"bar"}
{"Foo":"bar"}
+------------------+
| bicycle |
+------------------+
| { |
| "price":19.95, |
| "color":"red" |
| } |
+------------------+
-- Use brackets
SELECT raw:store['bicycle'], raw:store['BICYCLE'] FROM store_data
+------------------+---------+
| bicycle | BICYCLE |
+------------------+---------+
| { | null |
| "price":19.95, | |
| "color":"red" | |
| } | |
+------------------+---------+
-- Index elements
SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data
+------------------+-----------------+
| fruit | fruit |
+------------------+-----------------+
| { | { |
| "weight":8, | "weight":9, |
| "type":"apple" | "type":"pear" |
| } | } |
+------------------+-----------------+
+--------------------+
| isbn |
+--------------------+
| [ |
| null, |
| "0-553-21311-3", |
| "0-395-19395-8" |
| ] |
+--------------------+
Cast values
You can use :: to cast values to basic data types. Use the from_json method to cast nested results into more
complex data types, such as arrays or structs.
+------------------+
| price |
+------------------+
| 19.95 |
+------------------+
+------------------+
| bicycle |
+------------------+
| { |
| "price":19.95, |
| "color":"red" |
| } |
+------------------+
Example data
NULL behavior
When a JSON field exists with a null value, you will receive a SQL null value for that column, not a null text
value.
+-------------+-----------+
| sql_null | text_null |
+-------------+-----------+
| true | null |
+-------------+-----------+
Optimize conversion between PySpark and pandas
DataFrames
7/21/2022 • 2 minutes to read
Apache Arrow is an in-memory columnar data format used in Apache Spark to efficiently transfer data between
JVM and Python processes. This is beneficial to Python developers that work with pandas and NumPy data.
However, its usage is not automatic and requires some minor changes to configuration or code to take full
advantage and ensure compatibility.
PyArrow versions
PyArrow is installed in Databricks Runtime. For information on the version of PyArrow available in each
Databricks Runtime version, see the Databricks runtime release notes.
To use Arrow for these methods, set the Spark configuration spark.sql.execution.arrow.pyspark.enabled to
true . This configuration is enabled by default except for High Concurrency clusters as well as user isolation
clusters in workspaces that are Unity Catalog enabled.
In addition, optimizations enabled by spark.sql.execution.arrow.pyspark.enabled could fall back to a non-Arrow
implementation if an error occurs before the computation within Spark. You can control this behavior using the
Spark configuration spark.sql.execution.arrow.pyspark.fallback.enabled .
Example
import numpy as np
import pandas as pd
Using the Arrow optimizations produces the same results as when Arrow is not enabled. Even with Arrow,
toPandas() results in the collection of all records in the DataFrame to the driver program and should be done
on a small subset of the data.
In addition, not all Spark data types are supported and an error can be raised if a column has an unsupported
type. If an error occurs during createDataFrame() , Spark falls back to create the DataFrame without Arrow.
User-defined scalar functions - Scala
7/21/2022 • 2 minutes to read
This article contains Scala user-defined function (UDF) examples. It shows how to register UDFs, how to invoke
UDFs, and caveats regarding evaluation order of subexpressions in Spark SQL. See User-defined scalar functions
(UDFs) for more details.
This WHERE clause does not guarantee the strlen UDF to be invoked after filtering out nulls.
To perform proper null checking, we recommend that you do either of the following:
Make the UDF itself null-aware and do null checking inside the UDF itself
Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch
This article contains Python user-defined function (UDF) examples. It shows how to register UDFs, how to invoke
UDFs, and caveats regarding evaluation order of subexpressions in Spark SQL.
You can optionally set the return type of your UDF. The default return type is StringType .
Alternatively, you can declare the same UDF using annotation syntax:
This WHERE clause does not guarantee the strlen UDF to be invoked after filtering out nulls.
To perform proper null checking, we recommend that you do either of the following:
Make the UDF itself null-aware and do null checking inside the UDF itself
Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a conditional branch
A pandas user-defined function (UDF)—also known as vectorized UDF—is a user-defined function that uses
Apache Arrow to transfer data and pandas to work with the data. pandas UDFs allow vectorized operations that
can increase performance up to 100x compared to row-at-a-time Python UDFs.
For background information, see the blog post New Pandas UDFs and Python Type Hints in the Upcoming
Release of Apache Spark 3.0 and Optimize conversion between PySpark and pandas DataFrames.
You define a pandas UDF using the keyword pandas_udf as a decorator and wrap the function with a Python
type hint. This article describes the different types of pandas UDFs and shows how to use pandas UDFs with
type hints.
The Python function should take a pandas Series as an input and return a pandas Series of the same length, and
you should specify these in the Python type hints. Spark runs a pandas UDF by splitting columns into batches,
calling the function for each batch as a subset of the data, then concatenating the results.
The following example shows how to create a pandas UDF that computes the product of 2 columns.
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType
# The function for a pandas_udf should be able to execute with local pandas data
x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
# 0 1
# 1 4
# 2 9
# dtype: int64
import pandas as pd
from typing import Iterator
from pyspark.sql.functions import col, pandas_udf, struct
df.select(plus_one(col("x"))).show()
# +-----------+
# |plus_one(x)|
# +-----------+
# | 2|
# | 3|
# | 4|
# +-----------+
# In the UDF, you can initialize some state before processing batches.
# Wrap your code with try/finally or use context managers to ensure
# the release of resources at the end.
y_bc = spark.sparkContext.broadcast(1)
@pandas_udf("long")
def plus_y(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
y = y_bc.value # initialize states
try:
for x in batch_iter:
yield x + y
finally:
pass # release resources here, if any
df.select(plus_y(col("x"))).show()
# +---------+
# |plus_y(x)|
# +---------+
# | 2|
# | 3|
# | 4|
# +---------+
Iterator of multiple Series to Iterator of Series UDF
An Iterator of multiple Series to Iterator of Series UDF has similar characteristics and restrictions as Iterator of
Series to Iterator of Series UDF. The specified function takes an iterator of batches and outputs an iterator of
batches. It is also useful when the UDF execution requires initializing some state.
The differences are:
The underlying Python function takes an iterator of a tuple of pandas Series.
The wrapped pandas UDF takes multiple Spark columns as an input.
You specify the type hints as Iterator[Tuple[pandas.Series, ...]] -> Iterator[pandas.Series] .
@pandas_udf("long")
def multiply_two_cols(
iterator: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
for a, b in iterator:
yield a * b
df.select(multiply_two_cols("x", "x")).show()
# +-----------------------+
# |multiply_two_cols(x, x)|
# +-----------------------+
# | 1|
# | 4|
# | 9|
# +-----------------------+
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
df.select(mean_udf(df['v'])).show()
# +-----------+
# |mean_udf(v)|
# +-----------+
# | 4.2|
# +-----------+
df.groupby("id").agg(mean_udf(df['v'])).show()
# +---+-----------+
# | id|mean_udf(v)|
# +---+-----------+
# | 1| 1.5|
# | 2| 6.0|
# +---+-----------+
w = Window \
.partitionBy('id') \
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
# +---+----+------+
# | id| v|mean_v|
# +---+----+------+
# | 1| 1.0| 1.5|
# | 1| 2.0| 1.5|
# | 2| 3.0| 6.0|
# | 2| 5.0| 6.0|
# | 2|10.0| 6.0|
# +---+----+------+
Usage
Setting Arrow batch size
Data partitions in Spark are converted into Arrow record batches, which can temporarily lead to high memory
usage in the JVM. To avoid possible out of memory exceptions, you can adjust the size of the Arrow record
batches by setting the spark.sql.execution.arrow.maxRecordsPerBatch configuration to an integer that
determines the maximum number of rows for each batch. The default value is 10,000 records per batch. If the
number of columns is large, the value should be adjusted accordingly. Using this limit, each data partition is
divided into 1 or more record batches for processing.
Timestamp with time zone semantics
Spark internally stores timestamps as UTC values, and timestamp data brought in without a specified time zone
is converted as local time to UTC with microsecond resolution.
When timestamp data is exported or displayed in Spark, the session time zone is used to localize the timestamp
values. The session time zone is set with the spark.sql.session.timeZone configuration and defaults to the JVM
system local time zone. pandas uses a datetime64 type with nanosecond resolution, datetime64[ns] , with
optional time zone on a per-column basis.
When timestamp data is transferred from Spark to pandas it is converted to nanoseconds and each column is
converted to the Spark session time zone then localized to that time zone, which removes the time zone and
displays values as local time. This occurs when calling toPandas() or pandas_udf with timestamp columns.
When timestamp data is transferred from pandas to Spark, it is converted to UTC microseconds. This occurs
when calling createDataFrame with a pandas DataFrame or when returning a timestamp from a pandas UDF.
These conversions are done automatically to ensure Spark has data in the expected format, so it is not necessary
to do any of these conversions yourself. Any nanosecond values are truncated.
A standard UDF loads timestamp data as Python datetime objects, which is different than a pandas timestamp.
To get the best performance, we recommend that you use pandas time series functionality when working with
timestamps in a pandas UDF. For details, see Time Series / Date functionality.
Example notebook
The following notebook illustrates the performance improvements you can achieve with pandas UDFs:
pandas UDFs benchmark notebook
Get notebook
pandas function APIs
7/21/2022 • 4 minutes to read
pandas function APIs enable you to directly apply a Python native function, which takes and outputs pandas
instances, to a PySpark DataFrame. Similar to pandas user-defined functions, function APIs also use Apache
Arrow to transfer data and pandas to work with the data; however, Python type hints are optional in pandas
function APIs.
There are three types of pandas function APIs:
Grouped map
Map
Cogrouped map
pandas function APIs leverage the same internal logic that pandas UDF executions use. Therefore, it shares the
same characteristics with pandas UDFs such as PyArrow, supported SQL types, and the configurations.
For more information, see the blog post New Pandas UDFs and Python Type Hints in the Upcoming Release of
Apache Spark 3.0.
Grouped map
You transform your grouped data via groupBy().applyInPandas() to implement the “split-apply-combine”
pattern. Split-apply-combine consists of three steps:
Split the data into groups by using DataFrame.groupBy .
Apply a function on each group. The input and output of the function are both pandas.DataFrame . The input
data contains all the rows and columns for each group.
Combine the results into a new DataFrame .
The column labels of the returned pandas.DataFrame must either match the field names in the defined output
schema if specified as strings, or match the field data types by position if not strings, for example, integer
indices. See pandas.DataFrame for how to label columns when constructing a pandas.DataFrame .
All data for a group is loaded into memory before the function is applied. This can lead to out of memory
exceptions, especially if the group sizes are skewed. The configuration for maxRecordsPerBatch is not applied on
groups and it is up to you to ensure that the grouped data fits into the available memory.
The following example shows how to use groupby().apply() to subtract the mean from each value in the group.
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())
Map
You perform map operations with pandas instances by DataFrame.mapInPandas() in order to transform an
iterator of pandas.DataFrame to another iterator of pandas.DataFrame that represents the current PySpark
DataFrame and returns the result as a PySpark DataFrame.
The underlying function takes and outputs an iterator of pandas.DataFrame . It can return the output of arbitrary
length in contrast to some pandas UDFs such as Series to Series pandas UDF.
The following example shows how to use mapInPandas() :
def filter_func(iterator):
for pdf in iterator:
yield pdf[pdf.id == 1]
df.mapInPandas(filter_func, schema=df.schema).show()
# +---+---+
# | id|age|
# +---+---+
# | 1| 21|
# +---+---+
Cogrouped map
For cogrouped map operations with pandas instances, use DataFrame.groupby().cogroup().applyInPandas() for
two PySpark DataFrame s to be cogrouped by a common key and then a Python function applied to each
cogroup. It consists of the following steps:
Shuffle the data such that the groups of each DataFrame which share a key are cogrouped together.
Apply a function to each cogroup. The input of the function is two pandas.DataFrame (with an optional tuple
representing the key). The output of the function is a pandas.DataFrame .
Combine the pandas.DataFrame s from all groups into a new PySpark DataFrame .
To use groupBy().cogroup().applyInPandas() , you must define the following:
A Python function that defines the computation for each cogroup.
A StructType object or a string that defines the schema of the output PySpark DataFrame .
The column labels of the returned pandas.DataFrame must either match the field names in the defined output
schema if specified as strings, or match the field data types by position if not strings, for example, integer
indices. See pandas.DataFrame for how to label columns when constructing a pandas.DataFrame .
All data for a cogroup is loaded into memory before the function is applied. This can lead to out of memory
exceptions, especially if the group sizes are skewed. The configuration for maxRecordsPerBatch is not applied
and it is up to you to ensure that the cogrouped data fits into the available memory.
The following example shows how to use groupby().cogroup().applyInPandas() to perform an asof join
between two datasets.
import pandas as pd
df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
("time", "id", "v1"))
df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("time", "id", "v2"))
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string").show()
# +--------+---+---+---+
# | time| id| v1| v2|
# +--------+---+---+---+
# |20000101| 1|1.0| x|
# |20000102| 1|3.0| x|
# |20000101| 2|2.0| y|
# |20000102| 2|4.0| y|
# +--------+---+---+---+
Apache Spark SQL in Azure Databricks is designed to be compatible with the Apache Hive, including metastore
connectivity, SerDes, and UDFs.
Metastore connectivity
See External Apache Hive metastore for information on how to connect Azure Databricks to an externally hosted
Hive metastore.
This article lists the top questions you might have related to Azure Databricks. It also lists some common
problems you might have while using Databricks. For more information, see What is Azure Databricks.
Next steps
Quickstart: Get started with Azure Databricks
What is Azure Databricks?
Welcome to the Knowledge Base for Azure
Databricks
7/21/2022 • 2 minutes to read
This Knowledge Base provides a wide variety of troubleshooting, how-to, and best practices articles to help you
succeed with Azure Databricks, Delta Lake, and Apache Spark. These articles were written mostly by support and
field engineers, in response to typical customer questions and issues.
Azure Databricks administration: tips and troubleshooting
Azure infrastructure: tips and troubleshooting
Business intelligence tools: tips and troubleshooting
Clusters: tips and troubleshooting
Data management: tips and troubleshooting
Data sources: tips and troubleshooting
Databricks File System (DBFS): tips and troubleshooting
Databricks SQL: tips and troubleshooting
Developer tools: tips and troubleshooting
Delta Lake: tips and troubleshooting
Jobs: tips and troubleshooting
Job execution: tips and troubleshooting
Libraries: tips and troubleshooting
Machine learning: tips and troubleshooting
Metastore: tips and troubleshooting
Metrics: tips and troubleshooting
Notebooks: tips and troubleshooting
Security and permissions: tips and troubleshooting
Streaming: tips and troubleshooting
Visualizations: tips and troubleshooting
Python with Apache Spark: tips and troubleshooting
R with Apache Spark: tips and troubleshooting
Scala with Apache Spark: tips and troubleshooting
SQL with Apache Spark: tips and troubleshooting
How to discover who deleted a workspace in Azure
portal
7/21/2022 • 2 minutes to read
If your workspace has disappeared or been deleted, you can identify which user deleted it by checking the
Activity log in the Azure portal.
1. Go to the Activity log in the Azure portal.
2. Expand the timeline to focus on when the workspace was deleted.
3. Filter the log for a record of the specific event.
4. Click on the event to display information about the event, including the user who initiated the event.
The screenshot shows how you can click the Remove Databricks Workspace event in the Operation Name
column, and then view detailed information about the event.
If you are still unable to find who deleted the workspace, create a support case with Microsoft Support. Provide
details such as the workspace id and the time range of the event (including your time zone). Microsoft Support
will review the corresponding backend activity logs.
How to discover who deleted a cluster in Azure
portal
7/21/2022 • 2 minutes to read
If a cluster in your workspace has disappeared or been deleted, you can identify which user deleted it by running
a query in the Log Analytics workspaces service in the Azure portal.
NOTE
If you do not have an analytics workspace set up, you must configure Diagnostic Logging in Azure Databricks before you
continue.
DatabricksClusters
| where ActionName == "permanentDelete"
and Response contains "\"statusCode\":200"
and RequestParams contains "\"cluster_id\":\"0210-024915-bore731\"" // Add cluster_id filter if
cluster id is known
and TimeGenerated between(datetime("2020-01-25 00:00:00") .. datetime("2020-01-28 00:00:00"))
// Add timestamp (in UTC) filter to narrow down the result.
| extend id = parse_json(Identity)
| extend requestParams = parse_json(RequestParams)
| project UserEmail=id.email,clusterId = requestParams.cluster_id, SourceIPAddress,
EventTime=TimeGenerated
If you are still unable to find who deleted the cluster, create a support case with Microsoft Support. Provide
details such as the workspace id and the time range of the event (including your time zone). Microsoft Support
will review the corresponding backend activity logs.
Configure Simba JDBC driver using Azure AD
7/21/2022 • 2 minutes to read
This article describes how to access Azure Databricks with a Simba JDBC driver using Azure AD authentication.
This can be useful if you want to use an Azure AD user account to connect to Azure Databricks.
NOTE
Power BI has native support for Azure AD authentication with Azure Databricks. Review the Power BI documentation for
more information.
NOTE
If Grant admin consent is not enabled, you may encounter an error later on in the process.
4. Click Save .
authority_host_url = "https://login.microsoftonline.com/""
# Application ID of Azure Databricks
azure_databricks_resource_id = "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d"
# configure AuthenticationContext
# authority URL and tenant ID are used
authority_url = authority_host_url + user_parameters['tenant']
context = AuthenticationContext(authority_url)
access_token = token_response['accessToken']
refresh_token = token_response['refreshToken']
import jaydebeapi
import pandas as pd
try:
conn=jaydebeapi.connect("com.simba.spark.jdbc.Driver", url)
cursor = conn.cursor()
# Uncomment the following two lines if this code is running in the Databricks Connect IDE or within a
workspace notebook.
# df = spark.createDataFrame(pdf)
# df.show()
finally:
if cursor is not None:
cursor.close()
Configure Simba ODBC driver with a proxy in
Windows
7/21/2022 • 2 minutes to read
In this article you learn how to configure the Databricks ODBC Driver when your local Windows machine is
behind a proxy server.
7. Click OK .
8. Click OK .
9. Click OK .
Troubleshooting JDBC and ODBC connections
7/21/2022 • 2 minutes to read
This article provides information to help you troubleshoot the connection between your Databricks JDBC/ODBC
server and BI tools and data sources.
If the connection times out, check whether your network settings of the connection are correct.
TTransportException
If the response contains a TTransportException (the error is expected) like the following, it means that the
gateway is functioning properly and you have passed in valid credentials. If you are not able to connect with the
same credentials, check that the client you are using is properly configured and is using the latest Simba drivers
(version >= 1.2.0):
Other errors
If you get the error 401 Unauthorized , check the credentials you are using:
Verify that the username is token (not your username) and the password is a personal access token (it
should start with dapi ).
Responses such as 404, Not Found usually indicate problems with locating the specified cluster:
You can ignore these errors. The Simba internal log4j library is shaded to avoid conflicts with the log4j
library in your application. However, Simba may still load the log4j configuration of your application, and
attempt to use some custom log4j appenders. This attempt fails with the shaded library. Relevant
information is still captured in the logs.
Power BI proxy and SSL configuration
7/21/2022 • 3 minutes to read
Driver configurations
You can set driver configurations using the microsoft.sparkodbc.ini file which can be found in the
ODBC Drivers\Simba Spark ODBC Driver directory. The absolute path of the microsoft.sparkodbc.ini directory
depends on whether you are using Power BI Desktop or on-premises Power BI Gateway:
Power BI Desktop:
C:\Program Files\Microsoft Power BI Desktop\bin\ODBC Drivers\Simba Spark ODBC
Driver\microsoft.sparkodbc.ini
Power BI Gateway: m\ODBC Drivers\Simba Spark ODBC Driver\microsoft.sparkodbc.ini ,
where m is placed inside the gateway installation directory.
Set driver configurations
1. Check if the microsoft.sparkodbc.ini file was already created. If it is then jump to step 3.
2. Open Notepad or File Explorer as Run As Administrator and create a file at ODBC DriversSimba
Spark ODBC Drivermicrosoft.sparkodbc.ini .
3. Add the new driver configurations to the file below the header [Driver] by using the syntax =. Configuration
keys can be found in the manual provided with the installation of the Databricks ODBC Driver. The manual is
located at
C:\Program Files\Simba Spark ODBC Driver\Simba Apache Spark ODBC Connector Install and Configuration
Guide.html
.
Configuring a proxy
To configure a proxy, add the following configurations to the driver configuration in the
microsoft.sparkodbc.ini file:
[Driver]
UseProxy=1
ProxyHost=<proxy.example.com>
ProxyPort=<port>
ProxyUID=<username>
ProxyPWD=<password>
[Driver]
CheckCertRevocation=0
Troubleshooting
Error: SSL_connect: certificate verify failed
When SSL issues occur, the ODBC driver returns a generic error SSL_connect: cer tificate verify failed . You
can get more detailed SSL debugging logs by setting in the ODBC DriversSimba Spark ODBC
Drivermicrosoft.sparkodbc.inimicrosoft.sparkodbc.ini file the following two configurations:
[Driver]
AllowDetailedSSLErrorMessages=1
EnableCurlDebugLogging=1
Problem
When you try to mount an Azure Data Lake Storage (ADLS) Gen1 account on Azure Databricks, it fails with the
error:
Cause
This error can occur if the ADLS Gen1 account was previously mounted in the workspace, but not unmounted,
and the credential used for that mount subsequently expired. When you try to mount the same account with a
new credential, there is a conflict between the expired and new credentials.
Solution
You need to unmount all existing mounts, and then create a new mount with a new, unexpired credential.
For more information, see Mount Azure Data Lake Storage Gen1 with DBFS.
Network configuration of Azure Data Lake Storage
Gen1 causes
ADLException: Error getting info for file
7/21/2022 • 2 minutes to read
Problem
Access to Azure Data Lake Storage Gen1 (ADLS Gen1) fails with
ADLException: Error getting info for file <filename> when the following network configuration is in place:
Azure Databricks workspace is deployed in your own virtual network (uses VNet injection).
Traffic is allowed via Azure Data Lake Storage credential passthrough.
ADLS Gen1 storage firewall is enabled.
Azure Active Directory (Azure AD) service endpoint is enabled for the Azure Databricks workspace’s virtual
network.
Cause
Azure Databricks uses a control plane located in its own virtual network, and the control plane is responsible for
obtaining a token from Azure AD. ADLS credential passthrough uses the control plane to obtain Azure AD
tokens to authenticate the interactive user with ADLS Gen1.
When you deploy your Databricks workspace in your own virtual network (using VNet injection), Azure
Databricks clusters are created in your own virtual network. For increased security, you can restrict access to the
ADLS Gen 1 account by configuring the ADLS Gen1 firewall to allow only requests from your own virtual
network, by implementing service endpoints to Azure AD.
However, ADLS credential passthrough fails in this case. The reason is that when ADLS Gen1 checks for the
virtual network where the token was created, it finds the network to be the Azure Databricks control plane and
not the customer-provided virtual network where the original passthrough call was made.
Solution
To use ADLS credential passthrough with a service endpoint, storage firewall, and ADLS Gen1, enable Allow
access to Azure ser vices in the firewall settings.
If you have security concerns about enabling this setting in the firewall, you can upgrade to ADLS Gen2. ADLS
Gen2 works with the network configuration described above.
For more information, see:
Deploying Azure Databricks in your Azure Virtual Network
Accessing Azure Data Lake Storage Automatically with your Azure Active Directory Credentials
How to assign a single public IP for VNet-injected
workspaces using Azure Firewall
7/21/2022 • 2 minutes to read
You can use an Azure Firewall to create a VNet-injected workspace in which all clusters have a single IP
outbound address. The single IP address can be used as an additional security layer with other Azure services
and applications that allow access based on specific IP addresses.
1. Set up an Azure Databricks Workspace in your own virtual network.
2. Set up a firewall within the virtual network. See Create an NVA. When you create the firewall, you should:
Note both the private and public IP addresses for the firewall for later use.
Create a network rule for the public subnet to forward all traffic to the internet:
Name: any arbitrary name
Priority: 100
Protocol: Any
Source Addresses: IP range for the public subnet in the virtual network that you created
Destination Addresses: 0.0.0.0/1
Destination Ports: *
3. Create a Custom Route Table and associate it with the public subnet.
a. Add custom routes, also known as user-defined routes (UDR) for the following services. Specify the
Azure Databricks region addresses for your region. For Next hop type , enter Internet , as shown in
creating a route table.
Control Plane NAT VIP
Webapp
Metastore
Artifact Blob Storage
Logs Blob Storage
b. Add a custom route for the firewall with the following values:
Address prefix: 0.0.0.0./0
Next hop type: Virtual appliance
Next hop address: The private IP address for the firewall.
c. Associate the route table with the public subnet.
4. Validate the setup
a. Create a cluster in the Azure Databricks workspace.
b. Next, query blob storage to your own paths or run %fs ls in a cell.
c. If it fails, confirm that the route table has all required UDRs (including Service Endpoint instead of the
UDR for Blob Storage)
For more information, see Route Azure Databricks traffic using a virtual appliance or firewall.
How to analyze user interface performance issues
7/21/2022 • 2 minutes to read
Problem
The Azure Databricks user interface seems to be running slowly.
Cause
User interface performance issues typically occur due to network latency or a database query taking more time
than expected.
In order to troubleshoot this type of problem, you need to collect network logs and analyze them to see which
network traffic is affected.
In most cases, you will need the assistance of Databricks Support to identify and resolve issues with Databricks
user interface performance, but you can also analyze the logs yourself with a tool such as G Suite Toolbox HAR
Analyzer. This tool helps you analyze the logs and identify the exact API and the time taken for each request.
Troubleshooting procedure
This is the procedure for Google Chrome. For other browsers, see G Suite Toolbox HAR Analyzer.
1. Open Google Chrome and go to the page where the issue occurs.
2. In the Chrome menu bar, select View > Developer > Developer Tools .
3. In the panel at the bottom of your screen, select the Network tab.
4. Look for a round Record button in the upper left corner of the Network tab, and make sure it is red. If it is
grey, click it once to start recording.
5. Check the box next to Preser ve log .
6. Click Clear to clear out any existing logs from the Network tab.
7. Reproduce the issue while the network requests are being recorded.
8. After you reproduce and record the issue, right-click anywhere on the grid of network requests to open a
context menu, select Save all as HAR with Content , and save the file to your computer.
9. Analyze the file using the HAR Analyzer tool. If this analysis does not resolve the problem, open a support
ticket and upload the HAR file or attach it to your email so that Databricks can analyze it.
Example output from HAR Analyzer
Configure custom DNS settings using dnsmasq
7/21/2022 • 2 minutes to read
dnsmasq is a tool for installing and configuring DNS routing rules for cluster nodes. You can use it to set up
routing between your Azure Databricks environment and your on-premise network.
WARNING
If you use your own DNS server and it goes down, you will experience an outage and will not be able to create clusters.
Use the following cluster-scoped init script to configure dnsmasq for a cluster node.
1. Use netcat ( nc ) to test connectivity from the notebook environment to your on-premise network.
nc -vz <on-premise-ip> 53
2. Create the base directory you want to store the init script in if it does not already exist.
dbutils.fs.mkdirs("dbfs:/databricks/<init-script-folder>/")
dbutils.fs.put("/databricks/<init-script-folder>/dns-masq-az.sh";,"""
#!/bin/bash
sudo apt-get update -y
sudo apt-get install dnsmasq -y --force-yes
## Find the default DNS settings for the instance and use them as the default DNS route
azvm_dns=cat /etc/resolv.conf | grep "nameserver"; | cut -d' ' -f 2
echo "Old dns in resolv.conf $azvm_dns"
echo "server=$azvm_dns" | sudo tee --append /etc/dnsmasq.conf
display(dbutils.fs.ls("dbfs:/databricks/<init-script-folder>/dns-masq-az.sh"))
5. Install the init script that you just created as a cluster-scoped init script.
You will need the full path to the location of the script (
dbfs:/databricks/<init-script-folder>/dns-masq-az.sh ).
6. Launch a zero-node cluster to confirm that you can create clusters.
Jobs are not progressing in the workspace
7/21/2022 • 2 minutes to read
Problem
Jobs fail to run on any cluster in the workspace.
Cause
This can happen if you have changed the VNet of an existing workspace. Changing the VNet of an existing Azure
Databricks workspace is not supported.
Review Deploy Azure Databricks in your VNet for more details.
Solution
1. Open the cluster driver logs in the Azure Databricks UI.
2. Search for the following WARN messages:
19/11/19 16:50:29 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have sufficient resources
19/11/19 16:50:44 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have sufficient resources
19/11/19 16:50:59 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have sufficient resources
If this error is present, it is likely that the VNet of the Azure Databricks workspace was changed.
3. Revert the change to restore the original VNet configuration that was used when the Azure Databricks
workspace was created.
4. Restart the running cluster.
5. Resubmit your jobs.
6. Verify the jobs are getting resources.
SAS requires current ABFS client
7/21/2022 • 2 minutes to read
Problem
While using SAS token authentication, you encounter an IllegalArgumentException error.
Cause
SAS requires the current ABFS client. Previous ABFS clients do not support SAS.
Solution
You must use the current ABFS client (
shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem ) to use SAS.
This ABFS client is available by default in Databricks Runtime 7.3 LTS and above.
If you are using an old ABFS client, you should update your code so it references the current ABFS client.
Auto termination is disabled when starting a job
cluster
7/21/2022 • 2 minutes to read
Problem
You are trying to start a job cluster, but the job creation fails with an error message.
Cause
Job clusters auto terminate once the job is completed. As a result, they do not support explicit autotermination
policies.
If you include autotermination_minutes in your cluster policy JSON, you get the error on job creation.
{
"autotermination_minutes": {
"type": "fixed",
"value": 30,
"hidden": true
}
}
Solution
Do not define autotermination_minutes in the cluster policy for job clusters.
Auto termination should only be used for all-purpose clusters.
How to calculate the number of cores in a cluster
7/21/2022 • 2 minutes to read
If your organization has installed a metrics service on your cluster nodes, you can view the number of cores in
an Azure Databricks cluster in the Workspace UI using the Metrics tab on the cluster details page.
If the driver and executors are of the same node type, you can also determine the number of cores available in a
cluster programmatically, using Scala utility code:
1. Use sc.statusTracker.getExecutorInfos.length to get the total number of nodes. The result includes the
driver node, so subtract 1.
2. Use java.lang.Runtime.getRuntime.availableProcessors to get the number of cores per node.
3. Multiply both results (subtracting 1 from the total number of nodes) to get the total number of cores
available:
Problem
Your cluster’s Spark configuration values are not applied.
Cause
This happens when the Spark config values are declared in the cluster configuration as well as in an init script.
When Spark config values are located in more than one place, the configuration in the init script takes
precedence and the cluster ignores the configuration settings in the UI.
Solution
You should define your Spark configuration values in one place.
Choose to define the Spark configuration in the cluster configuration or include the Spark configuration in an
init script.
Do not do both.
Cluster failed to launch
7/21/2022 • 3 minutes to read
This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for
each scenario based on error messages found in logs.
Cluster timeout
Error messages:
Cause
The cluster can fail to launch if it has a connection to an external Hive metastore and it tries to download all the
Hive metastore libraries from a maven repo. A cluster downloads almost 200 JAR files, including dependencies.
If the Azure Databricks cluster manager cannot confirm that the driver is ready within 5 minutes, then cluster
launch fails. This can occur because JAR downloading is taking too much time.
Solution
Store the Hive libraries in DBFS and access them locally from the DBFS location. See Spark Options.
The cluster could not be started in 50 minutes. Cause: Timed out with exception after <xxx> attempts
Cause
Init scripts that run during the cluster spin-up stage send an RPC (remote procedure call) to each worker
machine to run the scripts locally. All RPCs must return their status before the process continues. If any RPC hits
an issue and doesn’t respond back (due to a transient networking issue, for example), then the 1-hour timeout
can be hit, causing the cluster setup job to fail.
Solution
Use a cluster-scoped init script instead of global or cluster-named init scripts. With cluster-scoped init scripts,
Azure Databricks does not use synchronous blocking of RPCs to fetch init script execution status.
Library installation timed out after 1800 seconds. Libraries that are not yet installed:
Cause
This is usually an intermittent problem due to network problems.
Solution
Usually you can fix this problem by re-running the job or restarting the cluster.
The library installer is configured to time out after 3 minutes. While fetching and installing jars, a timeout can
occur due to network problems. To mitigate this issue, you can download the libraries from maven to a DBFS
location and install it from there.
Cause
This error is usually returned by the cloud provider.
Solution
See the cloud provider error information in cluster unexpected termination.
Cause
This error is usually returned by the cloud provider.
Solution
See the cloud provider error information in cluster unexpected termination.
Instances unreachable
Error message:
An unexpected error was encountered while setting up the cluster. Please retry and contact Azure Databricks
if the problem persists. Internal error message: Timeout while placing node
Cause
This error is usually returned by the cloud provider. Typically, it occurs when you have an Azure Databricks
workspace deployed to your own virtual network (VNet) (as opposed to the default VNet created when you
launch a new Azure Databricks workspace). If the virtual network where the workspace is deployed is already
peered or has an ExpressRoute connection to on-premises resources, the virtual network cannot make an ssh
connection to the cluster node when Azure Databricks is attempting to create a cluster.
Solution
Add a user-defined route (UDR) to give the Azure Databricks control plane ssh access to the cluster instances,
Blob Storage instances, and artifact resources. This custom UDR allows outbound connections and does not
interfere with cluster creation. For detailed UDR instructions, see Step 3: Create user-defined routes and
associate them with your Azure Databricks virtual network subnets. For more VNet-related troubleshooting
information, see Troubleshooting.
Cluster fails to start with dummy does not exist
error
7/21/2022 • 2 minutes to read
Problem
You try to start a cluster, but it fails to start. You get an Apache Spark error message.
You review the cluster driver and worker logs and see an error message containing
java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist .
21/07/14 21:44:06 ERROR DriverDaemon$: XXX Fatal uncaught exception. Terminating driver.
java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1668)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1632)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:511)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:511)
at scala.collection.immutable.List.foreach(List.scala:392)
Cause
You have spark.files dummy set in your Spark Config, but no such file exists.
Spark interprets the dummy configuration value as a valid file path and tries to find it on the local file system. If
the file does not exist, it generates the error message.
Solution
Option 1: Delete spark.files dummy from your Spark Config if you are not passing actual files to Spark.
Option 2: Create a dummy file and place it on the cluster. You can do this with an init script.
1. Create the init script.
dbutils.fs.put("dbfs:/databricks/<init-script-folder>/create_dummy_file.sh",
"""
#!/bin/bash
touch /databricks/driver/dummy""", True)
2. Install the init script that you just created as a cluster-scoped init script.
You will need the full path to the location of the script (
dbfs:/databricks/<init-script-folder>/create_dummy_file.sh ).
3. Restart the cluster
Restart your cluster after you have installed the init script.
Job fails due to cluster manager core instance
request limit
7/21/2022 • 2 minutes to read
Problem
An Azure Databricks Notebook or Job API returns the following error:
Unexpected failure while creating the cluster for the job. Cause REQUEST_LIMIT_EXCEEDED: Your request was
rejected due to API rate limit. Please retry your request later, or choose a larger node type instead.
Cause
The error indicates the Cluster Manager Service core instance request limit was exceeded.
A Cluster Manager core instance can support a maximum of 1000 requests.
Solution
Contact Azure Databricks Support to increase the limit set in the core instance.
Azure Databricks can increase the job limit maxBurstyUpsizePerOrg up to 2000, and upsizeTokenRefillRatePerMin
up to 120. Current running jobs are affected when the limit is increased.
Increasing these values can stop the throttling issue, but can also cause high CPU utilization.
The best solution for this issue is to replace the Cluster Manager core instance with a larger instance that can
support maximum data transmission rates.
Azure Databricks Support can change the current Cluster Manager instance type to a larger one.
Cannot apply updated cluster policy
7/21/2022 • 2 minutes to read
Problem
You are attempting to update an existing cluster policy, however the update does not apply to the cluster
associated with the policy. If you attempt to edit a cluster that is managed by a policy, the changes are not
applied or saved.
Cause
This is a known issue that is being addressed.
Solution
You can use a workaround until a permanent fix is available.
1. Edit the cluster policy.
2. Re-attribute the policy to Free form .
3. Add the edited policy back to the cluster.
If you want to edit a cluster that is associated with a policy:
1. Terminate the cluster.
2. Associate a different policy to the cluster.
3. Edit the cluster.
4. Re-associate the original policy to the cluster.
Custom Docker image requires root
7/21/2022 • 2 minutes to read
Problem
You are trying to launch an Azure Databricks cluster with a custom Docker container, but cluster creation fails
with an error.
{
"reason": {
"code": "CONTAINER_LAUNCH_FAILURE",
"type": "SERVICE_FAULT",
"parameters": {
"instance_id": "i-xxxxxxx",
"databricks_error_message": "Failed to launch spark container on instance i-xxxx. Exception: Could not add
container for xxxx with address xxxx. Could not mkdir in container"
}
}
}
Cause
Azure Databricks clusters require a root user and sudo .
Custom container images that are configured to start as a non-root user are not supported.
For more information, review the custom container documentation.
Solution
You must configure your Docker container to start as the root user.
Example
This container configuration starts as the standard user ubuntu. It fails to launch.
FROM databricksruntime/standard:8.x
RUN apt-get update -y && apt-get install -y git && \
ln -s /databricks/conda/envs/dcs-minimal/bin/pip /usr/local/bin/pip && \
ln -s /databricks/conda/envs/dcs-minimal/bin/python /usr/local/bin/python
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt .
RUN chown -R ubuntu /app
USER ubuntu
Problem
You are trying to use a custom Apache Spark garbage collection algorithm (other than the default one (parallel
garbage collection) on clusters running Databricks Runtime 10.0 and above. When you try to start a cluster, it
fails to start. If the configuration is set on an executor, the executor is immediately terminated.
For example, if you set either of the following custom garbage collection algorithms in your Spark config, the
cluster creation fails.
Spark driver
spark.driver.extraJavaOptions -XX:+UseG1GC
Spark executor
spark.executor.extraJavaOptions -XX:+UseG1GC
Cause
A new Java virtual machine (JVM) flag was introduced to set the garbage collection algorithm to parallel
garbage collection. If you do not change the default, the change has no impact.
If you change the garbage collection algorithm by setting spark.executor.extraJavaOptions or
spark.driver.extraJavaOptions in your Spark config , the value conflicts with the new flag. As a result, the JVM
crashed and prevents the cluster from starting.
Solution
To work around this issue, you must explicitly remove the parallel garbage collection flag in your Spark config .
This must be done at the cluster level.
Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged
between worker nodes in a cluster is not encrypted.
If you require that data is encrypted at all times, you can encrypt traffic between cluster worker nodes using AES
128 over a TLS 1.2 connection.
In some cases, you may want to use TLS 1.3 instead of TLS 1.2 because it allows for stronger ciphers.
To use TLS 1.3 on your clusters, you must enable OpenJSSE in the cluster’s Apache Spark configuration.
1. Add spark.driver.extraJavaOptions -XX:+UseOpenJSSE to your Spark Config .
2. Restart your cluster.
OpenJSSE and TLS 1.3 are now enabled on your cluster and can be used in notebooks.
Enable retries in init script
7/21/2022 • 2 minutes to read
dbutils.fs.put("dbfs:/databricks/<path-to-init-script>/retry-example-init.sh", """#!/bin/bash
function fail {
echo $1 >&2
exit 1
}
function retry {
local n=1
local max=5
local delay=5
while true; do
"$@" && break || {
if [[ $n -lt $max ]]; then
((n++))
echo "Command failed. Attempt $n/$max: `date`"
sleep $delay;
else
echo "Collecting additional info for debugging.."
ps aux > /tmp/ps_info.txt
debug_log_file=debug_logs_${HOSTNAME}_$(date +"%Y-%m-%d--%H-%M").zip
zip -r /tmp/${debug_log_file} /var/log/ /tmp/ps_info.txt /databricks/data/logs/
cp /tmp/${debug_log_file} /dbfs/tmp/
fail "The command has failed after $n attempts. `date`"
fi
}
done
}
sleep 15s
echo "starting Copying at `date`"
retry cp -rv /dbfs/libraries/xyz.jar /databricks/jars/
Problem
When a user who has permission to start a cluster, such as a Azure Databricks Admin user, submits a job that is
owned by a different user, the job fails with the following message:
Message: Run executed on existing cluster ID <cluster id> failed because of insufficient permissions. The
error received from the cluster manager was: 'You are not authorized to restart this cluster. Please contact
your administrator or the cluster creator.'
Cause
This error can occur when the job owner’s privilege to start the cluster is revoked. In this scenario, the job will
fail even if it is submitted by an Admin user.
Solution
Re-grant the privilege to start the cluster (known as Can Manage ) to the job owner. Change the job owner to a
user or group that has the cluster start privilege. You can change it by navigating to your job page in Jobs , then
to Advanced > Permissions > Edit .
Adding a configuration setting overwrites all default
spark.executor.extraJavaOptions settings
7/21/2022 • 2 minutes to read
Problem
When you add a configuration setting by entering it in the Apache Spark Config text area, the new setting
replaces existing settings instead of being appended.
Version
Databricks Runtime 5.1 and below.
Cause
When the cluster restarts, the cluster reads settings from a configuration file that is created in the Clusters UI,
and overwrites the default settings.
For example, when you add the following extraJavaOptions to the Spark Config text area:
spark.executor.extraJavaOptions -
javaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus
_jmx_exporter/jmx_prometheus_javaagent.yml
Then, in Spark UI > Environment > Spark Proper ties under spark.executor.extraJavaOptions , only the
newly added configuration setting shows:
-javaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus
_jmx_exporter/jmx_prometheus_javaagent.yml
-Djava.io.tmpdir=/local_disk0/tmp -XX:ReservedCodeCacheSize=256m -
XX:+UseCodeCacheFlushing -Ddatabricks.serviceName=spark-executor-1 -
Djava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:+PrintFlagsFinal -
XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -Xss4m -
Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.Dataty
peFactoryImpl -
Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.Documen
tBuilderFactoryImpl -
Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFact
oryImpl -
Djavax.xml.validation.SchemaFactory=https://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jax
p.validation.XMLSchemaFactory -
Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -
Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMX
SImplementationSourceImpl
Solution
To add a new configuration setting to spark.executor.extraJavaOptions without losing the default settings:
1. In Spark UI > Environment > Spark Proper ties , select and copy all of the properties set by default for
spark.executor.extraJavaOptions .
2. Click Edit .
3. In the Spark Config text area (Clusters > cluster-name > Advanced Options > Spark ), paste the
default settings.
4. Append the new configuration setting below the default settings.
5. Click outside the text area, then click Confirm .
6. Restart the cluster.
For example, let’s say you paste the following settings into the Spark Config text area. The new configuration
setting is appended to the default settings.
spark.executor.extraJavaOptions = -Djava.io.tmpdir=/local_disk0/tmp -
XX:ReservedCodeCacheSize=256m -XX:+UseCodeCacheFlushing -Ddatabricks.serviceName=spark-
executor-1 -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -
XX:+PrintFlagsFinal -XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -Xss4m -
Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.Dataty
peFactoryImpl -
Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentB
uilderFactoryImpl -
Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactor
yImpl -
Djavax.xml.validation.SchemaFactory:https://www.w3.org/2001/XMLSchema=com.sun.org.apache.xer
ces.internal.jaxp.validation.XMLSchemaFactory -
Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -
Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplem
entationSourceImpl -
javaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus_jm
x_exporter/jmx_prometheus_javaagent.yml
After you restart the cluster, the default settings and newly added configuration setting appear in Spark UI >
Environment > Spark Proper ties .
Enable GCM cipher suites
7/21/2022 • 2 minutes to read
Azure Databricks clusters do not have GCM (Galois/Counter Mode) cipher suites enabled by default.
You must enable GCM cipher suites on your cluster to connect to an external server that requires GCM cipher
suites.
NOTE
If nmap is not installed, run sudo apt-get install -y nmap to install it on your cluster.
dbutils.fs.put("/<path-to-init-script>/enable-gcm.sh", """#!/bin/bash
sed -i 's/, GCM//g' /databricks/spark/dbconf/java/extra.security
""",True)
dbutils.fs.put("/<path-to-init-script>/enable-gcm.sh", """#!/bin/bash
sed -i 's/, GCM//g' /databricks/spark/dbconf/java/extra.security
""",true)
Remember the path to the init script. You will need it when configuring your cluster.
If the GCM cipher suites are enabled, you will see the following AES-GCM ciphers listed in the output.
TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
TLS_RSA_WITH_AES_256_GCM_SHA384
TLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384
TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384
TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
TLS_DHE_DSS_WITH_AES_256_GCM_SHA384
TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
TLS_RSA_WITH_AES_128_GCM_SHA256
TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256
TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256
TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
TLS_DHE_DSS_WITH_AES_128_GCM_SHA256
Problem
You are trying to create a cluster, but it is failing with an invalid tag value error message.
Cause
Limitations on tag Key and Value are set by Azure.
Azure tag keys must:
Contain 1-512 characters
Contain letters, numbers, spaces (except < > * % & : \ ? / + )
Not start with azure , microsoft , or windows
Not duplicate an existing key
Azure tag values must:
Contain 1-256 characters
Contain letters, numbers, spaces (except < > * % & : \ ? / + )
Not start with azure , microsoft , or windows
For more information, please refer to the Azure tag resource limitations documentation.
Solution
Requests to update any limits on tagging must be made directly with the Azure support team.
Install a private PyPI repo
7/21/2022 • 2 minutes to read
Certain use cases may require you to install libraries from private PyPI repositories.
If you are installing from a public repository, you should review the library documentation.
This article shows you how to configure an example init script that authenticates and downloads a PyPI library
from a private repository.
dbutils.fs.mkdirs("dbfs:/databricks/<init-script-folder>/")
dbutils.fs.put("/databricks/<init-script-folder>/private-pypi-install.sh","""
#!/bin/bash
/databricks/python/bin/pip install --index-url=https://${<repo-username>}:${<repo-
password>}@<private-pypi-repo-domain-name> private-package==<version>
""", True)
display(dbutils.fs.ls("dbfs:/databricks/<init-script-folder>/private-pypi-install.sh"))
Problem
You are trying to update an IP access list and you get an INVALID_STATE error message.
Cause
The IP access list update that you are trying to commit does not include your current public IP address. If your
current IP address is not included in the access list, you are blocked from the environment.
If you assume that your current IP is 3.3.3.3, this example API call results in an INVALID_STATE error message.
curl -X POST -n \
https://<databricks-instance>/api/2.0/ip-access-lists
-d '{
"label": "office",
"list_type": "ALLOW",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21"
]
}'
Solution
You must always include your current public IP address in the JSON file that is used to update the IP access list.
If you assume that your current IP is 3.3.3.3, this example API call results in a successful IP access list update.
curl -X POST -n \
https://<databricks-instance>/api/2.0/ip-access-lists
-d '{
"label": "office",
"list_type": "ALLOW",
"ip_addresses": [
"1.1.1.1",
"2.2.2.2/21",
"3.3.3.3"
]
}'
How to overwrite log4j configurations on Azure
Databricks clusters
7/21/2022 • 2 minutes to read
IMPORTANT
This article describes steps related to customer use of Log4j 1.x within an Azure Databricks cluster. Log4j 1.x is no longer
maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one
of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilities. You
should not enable either of these classes in your cluster.
There is no standard way to overwrite log4j configurations on clusters with custom configurations. You must
overwrite the configuration files using init scripts.
The current configurations are stored in two log4j.properties files:
On the driver:
%sh
cat /home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties
On the worker:
%sh
cat /home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties
To set class-specific logging on the driver or on workers, use the following script:
#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}
Replace <custom-prop> with the property name, and <value> with the property value.
Upload the script to DBFS and select a cluster using the cluster configuration UI.
You can also set log4j.properties for the driver in the same way.
See Cluster Node Initialization Scripts for more information.
Persist Apache Spark CSV metrics to a DBFS
location
7/21/2022 • 2 minutes to read
Spark has a configurable metrics system that supports a number of sinks, including CSV files.
In this article, we are going to show you how to configure an Azure Databricks cluster to use a CSV sink and
persist those metrics to a DBFS location.
NOTE
The CSV metrics are saved locally before being uploaded to the DBFS location because DBFS is not designed for a large
number of random writes.
Customize the sample code and then run it in a notebook to create an init script on your cluster.
Sample code to create an init script:
dbutils.fs.put("/<init-path>/metrics.sh","""
#!/bin/bash
mkdir /tmp/csv
sudo bash -c "cat <<EOF >> /databricks/spark/dbconf/log4j/master-worker/metrics.properties
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
spark.metrics.staticSources.enabled true
spark.metrics.executorMetricsSource.enabled true
spark.executor.processTreeMetrics.enabled true
spark.sql.streaming.metricsEnabled true
master.source.jvm.class org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class org.apache.spark.metrics.source.JvmSource
*.sink.csv.period 5
*.sink.csv.unit seconds
*.sink.csv.directory /tmp/csv/
worker.sink.csv.period 5
worker.sink.csv.unit seconds
EOF"
Replace <init-path> with the DBFS location you want to use to save the init script.
Replace <metrics-path> with the DBFS location you want to use to save the CSV metrics.
IMPORTANT
Cluster log delivery is not enabled by default. You must enable cluster log delivery before starting your cluster, otherwise
there will be no logs to replay.
spark.ui.retainedJobs 1000
spark.ui.retainedStages 1000
spark.ui.retainedTasks 100000
spark.sql.ui.retainedExecutions 1000
Set Apache Hadoop core-site.xml properties
7/21/2022 • 2 minutes to read
mkdir -p /dbfs/hadoop-configs/
cat << 'EOF' > /dbfs/hadoop-configs/core-site.xml
<property>
<name><property-name-here></name>
<value><property-value-here></value>
</property>
EOF
You can add multiple properties to the file by adding additional name/value pairs to the script.
You can also create this file locally, and then upload it to your cluster.
START_DRIVER_SCRIPT=/databricks/spark/scripts/start_driver.sh
START_WORKER_SCRIPT=/databricks/spark/scripts/start_spark_slave.sh
TMP_DRIVER_SCRIPT=/tmp/start_driver_temp.sh
TMP_WORKER_SCRIPT=/tmp/start_spark_slave_temp.sh
TMP_SCRIPT=/tmp/set_core-site_configs.sh
config_xml="/dbfs/hadoop-configs/core-site.xml"
sed -i '/<\/configuration>/{
r $config_xml
a \</configuration>
d
}' /databricks/spark/dbconf/hadoop/core-site.xml
EOL
cat "$TMP_SCRIPT" > "$TMP_DRIVER_SCRIPT"
cat "$TMP_SCRIPT" > "$TMP_WORKER_SCRIPT"
""", True)
IMPORTANT
This article describes steps related to customer use of Log4j 1.x within an Azure Databricks cluster. Log4j 1.x is no longer
maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one
of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilities.
To set the log level on all executors, you must set it inside the JVM on each worker.
For example:
sc.parallelize(Seq("")).foreachPartition(x => {
import org.apache.log4j.{LogManager, Level}
import org.apache.commons.logging.LogFactory
LogManager.getRootLogger().setLevel(Level.DEBUG)
val log = LogFactory.getLog("EXECUTOR-LOG:")
log.debug("START EXECUTOR DEBUG LOG LEVEL")
})
To verify that the level is set, navigate to the Spark UI, select the Executors tab, and open the stderr log for any
executor:
Unexpected cluster termination
7/21/2022 • 2 minutes to read
Problem
When you launch an Azure Databricks cluster, you get an UnknownHostException error.
You may also get one of the following error messages:
Error: There was an error in the network configuration. databricks_error_message: Could not access worker
artifacts.
Error: Temporary failure in name resolution.
Internal error message: Failed to launch spark container on instance XXX. Exception: Could not add
container for XXX with address X.X.X.X.mysql.database.azure.com: Temporary failure in name resolution.
Cause
These errors indicate an issue with DNS settings.
Primary DNS could be down or unresponsive.
Artifacts are not being resolved, which results in the cluster launch failure.
You may have a host record listing the artifact public IP as static, but it has changed.
Solution
Identify a working DNS server and update the DNS entry on the cluster.
1. Start a standalone Azure VM and verify that the artifacts blob storage account is reachable from the
instance.
2. Verify that you can reach your primary DNS server from a notebook by running a ping command.
3. If your DNS server is not responding, try to reach your secondary DNS server from a notebook by
running a ping command.
4. Launch a Web Terminal from the cluster workspace.
5. Edit the /etc/resolv.conf file on the cluster.
6. Update the nameserver value with your working DNS server.
7. Save the changes to the file.
8. Restart systemd-resolved .
Further troubleshooting
If you are still having DNS issues, you should try the following steps:
Verify that port 43 (used for whois) and port 53 (used for DNS) are open in your firewall.
Add the Azure recursive resolver (168,.63.129.16) to the default DNS forwarder. Review the VMs and role
instances documentation for more information.
Verify that nslookup results are identical between your laptop and the default DNS. If there is a mistmatch,
your DNS server may have an incorrect host record.
Verify that everything works with a default Azure DNS server. If it works with Azure DNS, but fails with your
custom DNS, your DNS admin should review your DNS server settings.
Apache Spark executor memory allocation
7/21/2022 • 2 minutes to read
By default, the amount of memory available for each executor is allocated within the Java Virtual Machine (JVM)
memory heap. This is controlled by the spark.executor.memory property.
However, some unexpected behaviors were observed on instances with a large amount of memory allocated. As
JVMs scale up in memory size, issues with the garbage collector become apparent. These issues can be resolved
by limiting the amount of memory under garbage collector management.
Selected Azure Databricks cluster types enable the off-heap mode, which limits the amount of memory under
garbage collector management. This is why certain Spark clusters have the spark.executor.memory value set to a
fraction of the overall cluster memory.
The following Azure Databricks cluster types enable the off-heap memory policy:
Standard_L8s_v2
Standard_L16s_v2
Standard_L32s_v2
Standard_L32s_v2
Standard_L80s_v2
Apache Spark UI shows less than total node
memory
7/21/2022 • 2 minutes to read
Problem
The Executors tab in the Spark UI shows less memory than is actually available on the node:
An F8s instance (16 GB, 4 core) for the driver node, shows 4.5 GB memory on the Executors tab
An F4s instance (8 GB, 4 core) for the driver node, shows 710 GB memory on the Executors tab:
Cause
The total amount of memory shown is less than the memory on the cluster because some memory is occupied
by kernel and node-level services.
Solution
To calculate the available amount of memory, you can use the formula used for executor memory allocation
(all_memory_size * 0.97 - 4800MB) * 0.8 , where:
Problem
Cluster creation fails with a message about a cloud provider error when you hover over cluster state.
Cloud Provider Launch Failure: A cloud provider error was encountered while setting up the cluster.
When you view the cluster event log to get more details, you see a message about core quota limits.
Operation results in exceeding quota limits of Core. Maximum allowed: 350, Current in use: 350, Additional
requested: 4.
Cause
Azure subscriptions have a CPU core quota limit which restricts the number of CPU cores you can use. This is a
hard limit. If you try to start a cluster that would result in your account exceeding the CPU core quota the cluster
launch will fail.
Solution
You can either free up resources or request a quota increase for your account.
Stop inactive clusters to free up CPU cores for use.
Open an Azure support case with a request to increase the CPU core quota limit for your subscription.
IP address limit prevents cluster creation
7/21/2022 • 2 minutes to read
Problem
Cluster creation fails with a message about a cloud provider error when you hover over cluster state.
Cloud Provider Launch Failure: A cloud provider error was encountered while setting up the cluster.
When you view the cluster event log to get more details, you see a message about publicIPAddresses limits.
Cause
Azure subscriptions have a public IP address limit which restricts the number of public IP addresses you can use.
This is a hard limit. If you try to start a cluster that would result in your account exceeding the public IP address
quota the cluster launch will fail.
Solution
You can either free up resources or request a quota increase for your account.
Stop inactive clusters to free up public IP addresses for use.
Open an Azure support case with a request to increase the public IP address quota limit for your
subscription.
Slow cluster launch and missing nodes
7/21/2022 • 2 minutes to read
Problem
A cluster takes a long time to launch and displays an error message similar to the following:
Cause
Provisioning an Azure VM typically takes 2-4 minutes, but if all the VMs in a cluster cannot be provisioned at the
same time, cluster creation can be delayed. This is due to Azure Databricks having to reissue VM creation
requests over a period of time.
Solution
If a cluster launches without all of the nodes, Azure Databricks automatically tries to acquire the additional
nodes and will update the cluster once available.
To workaround this, you should configure a cluster with a bigger instance type and a smaller number of nodes.
Cluster slowdown due to Ganglia metrics filling root
partition
7/21/2022 • 2 minutes to read
NOTE
This article applies to Databricks Runtime 7.3 LTS and below.
Problem
Clusters start slowing down and may show a combination of the following symptoms:
1. Unhealthy cluster events are reported:
Request timed out. Driver is temporarily unavailable.
Metastore is down.
DBFS is down.
2. You do not see any high GC events or memory utilization associated with the driver process.
3. When you use top on the driver node you see an intermittent high average load.
4. The Ganglia related gmetad process shows intermittent high CPU utilization.
5. The root disk shows high disk usage with df -h / . Specifically, /var/lib/ganglia/rrds shows high disk
usage.
6. The Ganglia UI is unable to show the load distribution.
You can verify the issue by looking for files with local in the prefix in /var/lib/ganglia/rrds . Generally, this
directory should only have files prefixed with application-<applicationId> .
For example:
Cause
Ganglia metrics typically use less than 10GB of disk space. However, under certain circumstances, a “data
explosion” can occur, which causes the root partition to fill with Ganglia metrics. Data explosions also create a
dirty cache. When this happens, the Ganglia metrics can consume more than 100GB of disk space on root.
This “data explosion” can happen if you define the spark session variable as global in your Python file and then
call functions defined in the same file to perform Apache Spark transformation on data. When this happens, the
Spark session logic can be serialized, along with the required function definition, resulting in a Spark session
being created on the worker node.
For example, take the following Spark session definition:
from pyspark.sql import SparkSession
def get_spark():
"""Returns a spark session."""
return SparkSession.builder.getOrCreate()
def generator(partition):
print(globals()['spark'])
for row in partition:
yield [word.lower() for word in row["value"]]
If you use the following example commands, local prefixed files are created:
The print(globals()['spark']) statement in the generator() function doesn’t result in an error, because it is
available as a global variable in the worker nodes. It may fail with an invalid key error in some cases, as that
value is not available as a global variable. Streaming jobs that execute on short batch intervals are susceptible to
this issue.
Solution
Ensure that you are not using SparkSession.builder.getOrCreate() to define a Spark session as a global variable.
When you troubleshoot, you can use the timestamps on files with the local prefix to help determine when a
problematic change was first introduced.
Configure a cluster to use a custom NTP server
7/21/2022 • 2 minutes to read
By default Azure Databricks clusters use public NTP servers. This is sufficient for most use cases, however you
can configure a cluster to use a custom NTP server. This does not have to be a public NTP server. It can be a
private NTP server under your control. A common use case is to minimize the amount of Internet traffic from
your cluster.
# NTP configuration
server <ntp-server-hostname> iburst
dbutils.fs.put("/databricks/init_scripts/ntp.sh","""
#!/bin/bash
echo "<ntp-server-ip> <ntp-server-hostname>" >> /etc/hosts
cp /dbfs/databricks/init_scripts/ntp.conf /etc/
sudo service ntp restart""",True)
display(dbutils.fs.ls("dbfs:/databricks/init_scripts/ntp.sh"))
5. Click Clusters , click your cluster name, click Edit , click Advanced Options , click Init Scripts .
6. Select DBFS under Destination .
7. Enter the full path to ntp.sh and click Add .
8. Click Confirm and Restar t . A confirmation dialog box appears. Click Confirm and wait for the cluster
to restart.
%sh ntpq -p
This article explains how to use SSH to connect to an Apache Spark driver node for advanced troubleshooting
and installing custom software.
IMPORTANT
You can only use SSH if your workspace is deployed in an Azure Virtual Network (VNet) under your control. If your
workspace is NOT VNet injected, the SSH option will not appear.
NOTE
You must provide the path to the directory where you want to save the public and private key. The public key is saved
with the extension .pub.
In November 2021, the way environment variables are interpreted when creating, editing, or updating clusters
was changed in some workspaces.
This change will be reverted on December 3, 2021 from 01:00-03:00 UTC.
After the change is reverted, environment variables will behave as they did before the change.
This article explains how to validate the environment variable behavior on your cluster.
NOTE
The “new” input behavior will no longer work after the change has been reverted.