You are on page 1of 596

Contents

SQL Data Warehouse Documentation


Overview
What is SQL Data Warehouse?
SQL Data Warehouse architecture
Data warehouse units
Cheat sheet
Best practices
Capacity limits
FAQ
Release Notes
Recent Release
Gen2 region upgrade schedule
Quickstarts
Create and connect
Portal
PowerShell
Pause and resume
Portal
PowerShell
Scale
Portal
PowerShell
T-SQL
Workload management
Create workload classifier
T-SQL
Concepts
Security
Overview
Advanced data security
Overview
Data discovery & classification
Vulnerability assessment
Advanced Threat Protection
Auditing
Network security
Firewall rules
Private Link
Virtual network service endpoints
Authentication
Overview
Azure Active Directory
Logins and users
Multi-Factor Auth
Access control
Overview
Column-level security
Row-level security
Data protection
Dynamic data masking
Transparent Data Encryption (TDE) overview
TDE with Bring Your Own Key
Data loading
Overview
Best practices
Columnstore compression
Development
Overview
Best practices
JSON
JSON
ISJSON
JSON_VALUE
JSON_QUERY
JSON_MODIFY
OPENJSON
Materialized view
Performance tuning guidance
Create
Alter
Check view overhead
Ordered clustered columnstore index
Performance tuning guidance
Create table
CTAS
Create and change index
Result set caching
Performance tuning guidance
Set for a database
Set for a session
Check cache size
Clean up cache
Tables
Overview
CTAS
Data types
Distributed tables
Table constraints
Indexes
Identity
Partitions
Replicated tables
Statistics
Temporary
T-SQL language elements
Loops
Stored procedures
Transactions
Transactions best practices
User-defined schemas
Variable assignment
Views
Querying
Dynamic SQL
Group by options
Labels
Workload management
Resource classes & workload management
Workload classification
Workload importance
Memory & concurrency limits
Manageability & monitoring
Overview
Scale, pause, resume
Resource utilization & query activity
Data protection
Recommendations
Troubleshoot
Maintenance schedules
Integration
Overview
How-to guides
Secure
Configure Azure AD auth
Conditional Access
Virtual network rules by PowerShell
Enable encryption - portal
Enable encryption - T-SQL
Load data
New York taxi cab data
Contoso public data
Azure Data Lake Storage
Azure Databricks
Data Factory
SSIS
Load WideWorldImporters
Develop
Overview
Connection strings
Change management and deployment
Install Visual Studio SSDT
Source control integration
Continuous integration and deployment
Connect and query
Visual Studio SSDT
SSMS
sqlcmd
Workload management
Analyze your workload
Manage and monitor workload importance
Configure workload importance
Manage and monitor
Upgrade to Gen2
Manage Compute
Monitor your workload - Portal
Monitor your workload - DMVs
Monitor Gen2 cache
Backup and restore
Create restore points
Restore an existing data warehouse
Restore a deleted data warehouse
Restore from a geo-backup
Integrate
Use machine learning
Build data pipelines
Connect with Fivetran
Get started with Striim
Add an Azure Stream Analytics job
Build dashboards and reports
Visualize with Power BI
Troubleshoot
Connectivity
Reference
Database collation types
T-SQL
Full reference
SQL DW language elements
SQL DW statements
System views
PowerShell cmdlets
REST APIs
Resources
Azure Roadmap
Forum
Pricing
Pricing calculator
Feature requests
Service updates
Stack Overflow
Support
Videos
Partners
Business intelligence
Data integration
Data management
What is Azure SQL Data Warehouse?
5/31/2019 • 2 minutes to read • Edit Online

SQL Data Warehouse is a cloud-based Enterprise Data Warehouse (EDW ) that uses Massively Parallel
Processing (MPP ) to quickly run complex queries across petabytes of data. Use SQL Data Warehouse as a key
component of a big data solution. Import big data into SQL Data Warehouse with simple PolyBase T-SQL
queries, and then use the power of MPP to run high-performance analytics. As you integrate and analyze, the
data warehouse will become the single version of truth your business can count on for insights.

Key component of big data solution


SQL Data Warehouse is a key component of an end-to-end big data solution in the Cloud.

In a cloud data solution, data is ingested into big data stores from a variety of sources. Once in a big data store,
Hadoop, Spark, and machine learning algorithms prepare and train the data. When the data is ready for complex
analysis, SQL Data Warehouse uses PolyBase to query the big data stores. PolyBase uses standard T-SQL
queries to bring the data into SQL Data Warehouse.
SQL Data Warehouse stores data into relational tables with columnar storage. This format significantly reduces
the data storage costs, and improves query performance. Once data is stored in SQL Data Warehouse, you can
run analytics at massive scale. Compared to traditional database systems, analysis queries finish in seconds
instead of minutes, or hours instead of days.
The analysis results can go to worldwide reporting databases or applications. Business analysts can then gain
insights to make well-informed business decisions.

Next steps
Explore Azure SQL Data Warehouse architecture
Quickly create a SQL Data Warehouse
Load sample data.
Explore Videos
Or look at some of these other SQL Data Warehouse Resources.
Search Blogs
Submit a Feature requests
Search Customer Advisory Team blogs
Create a support ticket
Search MSDN forum
Search Stack Overflow forum
Azure SQL Data Warehouse - Massively parallel
processing (MPP) architecture
6/4/2019 • 5 minutes to read • Edit Online

Learn how Azure SQL Data Warehouse combines massively parallel processing (MPP ) with Azure storage to
achieve high performance and scalability.

MPP architecture components


SQL Data Warehouse leverages a scale out architecture to distribute computational processing of data across
multiple nodes. The unit of scale is an abstraction of compute power that is known as a data warehouse unit. SQL
Data Warehouse separates compute from storage which enables you to scale compute independently of the data
in your system.

SQL Data Warehouse uses a node-based architecture. Applications connect and issue T-SQL commands to a
Control node, which is the single point of entry for the data warehouse. The Control node runs the MPP engine
which optimizes queries for parallel processing, and then passes operations to Compute nodes to do their work in
parallel. The Compute nodes store all user data in Azure Storage and run the parallel queries. The Data Movement
Service (DMS ) is a system-level internal service that moves data across the nodes as necessary to run queries in
parallel and return accurate results.
With decoupled storage and compute, SQL Data Warehouse can:
Independently size compute power irrespective of your storage needs.
Grow or shrink compute power without moving data.
Pause compute capacity while leaving data intact, so you only pay for storage.
Resume compute capacity during operational hours.
Azure storage
SQL Data Warehouse uses Azure storage to keep your user data safe. Since your data is stored and managed by
Azure storage, SQL Data Warehouse charges separately for your storage consumption. The data itself is sharded
into distributions to optimize the performance of the system. You can choose which sharding pattern to use to
distribute the data when you define the table. SQL Data Warehouse supports these sharding patterns:
Hash
Round Robin
Replicate
Control node
The Control node is the brain of the data warehouse. It is the front end that interacts with all applications and
connections. The MPP engine runs on the Control node to optimize and coordinate parallel queries. When you
submit a T-SQL query to SQL Data Warehouse, the Control node transforms it into queries that run against each
distribution in parallel.
Compute nodes
The Compute nodes provide the computational power. Distributions map to Compute nodes for processing. As
you pay for more compute resources, SQL Data Warehouse re-maps the distributions to the available Compute
nodes. The number of compute nodes ranges from 1 to 60, and is determined by the service level for the data
warehouse.
Each Compute node has a node ID that is visible in system views. You can see the Compute node ID by looking for
the node_id column in system views whose names begin with sys.pdw_nodes. For a list of these system views, see
MPP system views.
Data Movement Service
Data Movement Service (DMS ) is the data transport technology that coordinates data movement between the
Compute nodes. Some queries require data movement to ensure the parallel queries return accurate results. When
data movement is required, DMS ensures the right data gets to the right location.

Distributions
A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. When
SQL Data Warehouse runs a query, the work is divided into 60 smaller queries that run in parallel. Each of the 60
smaller queries runs on one of the data distributions. Each Compute node manages one or more of the 60
distributions. A data warehouse with maximum compute resources has one distribution per Compute node. A data
warehouse with minimum compute resources has all the distributions on one compute node.

Hash-distributed tables
A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.
To shard data into a hash-distributed table, SQL Data Warehouse uses a hash function to deterministically assign
each row to one distribution. In the table definition, one of the columns is designated as the distribution column.
The hash function uses the values in the distribution column to assign each row to a distribution.
The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.
Each row belongs to one distribution.
A deterministic hash algorithm assigns each row to one distribution.
The number of table rows per distribution varies as shown by the different sizes of tables.
There are performance considerations for the selection of a distribution column, such as distinctness, data skew,
and the types of queries that run on the system.

Round-robin distributed tables


A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for
loads.
A round-robin distributed table distributes data evenly across the table but without any further optimization. A
distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. It is quick
to load data into a round-robin table, but query performance can often be better with hash distributed tables. Joins
on round-robin tables require reshuffling data and this takes additional time.

Replicated Tables
A replicated table provides the fastest query performance for small tables.
A table that is replicated caches a full copy of the table on each compute node. Consequently, replicating a table
removes the need to transfer data among compute nodes before a join or aggregation. Replicated tables are best
utilized with small tables. Extra storage is required and there is additional overhead that is incurred when writing
data which make large tables impractical.
The following diagram shows a replicated table. For SQL Data Warehouse, the replicated table is cached on the
first distribution on each compute node.

Next steps
Now that you know a bit about SQL Data Warehouse, learn how to quickly create a SQL Data Warehouse and
load sample data. If you are new to Azure, you may find the Azure glossary helpful as you encounter new
terminology. Or look at some of these other SQL Data Warehouse Resources.
Customer success stories
Blogs
Feature requests
Videos
Customer Advisory Team blogs
Create support ticket
MSDN forum
Stack Overflow forum
Twitter
Data Warehouse Units (DWUs) and compute Data
Warehouse Units (cDWUs)
8/15/2019 • 7 minutes to read • Edit Online

Recommendations on choosing the ideal number of data warehouse units (DWUs, cDWUs) to optimize price and
performance, and how to change the number of units.

What are Data Warehouse Units


Azure SQL Data Warehouse CPU, memory, and IO are bundled into units of compute scale called Data
Warehouse Units (DWUs). A DWU represents an abstract, normalized measure of compute resources and
performance. A change to your service level alters the number of DWUs that are available to the system, which in
turn adjusts the performance, and the cost, of your system.
For higher performance, you can increase the number of data warehouse units. For less performance, reduce data
warehouse units. Storage and compute costs are billed separately, so changing data warehouse units does not
affect storage costs.
Performance for data warehouse units is based on these data warehouse workload metrics:
How fast a standard data warehousing query can scan a large number of rows and then perform a complex
aggregation. This operation is I/O and CPU intensive.
How fast the data warehouse can ingest data from Azure Storage Blobs or Azure Data Lake. This operation is
network and CPU intensive.
How fast the CREATE TABLE AS SELECT T-SQL command can copy a table. This operation involves reading data
from storage, distributing it across the nodes of the appliance and writing to storage again. This operation is
CPU, IO, and network intensive.
Increasing DWUs:
Linearly changes performance of the system for scans, aggregations, and CTAS statements
Increases the number of readers and writers for PolyBase load operations
Increases the maximum number of concurrent queries and concurrency slots.

Service Level Objective


The Service Level Objective (SLO ) is the scalability setting that determines the cost and performance level of your
data warehouse. The service levels for Gen2 are measured in compute data warehouse units (cDWU ), for
example DW2000c. Gen1 service levels are measured in DWUs, for example DW2000.

NOTE
Azure SQL Data Warehouse Gen2 recently added additional scale capabilities to support compute tiers as low as 100
cDWU. Existing data warehouses currently on Gen1 that require the lower compute tiers can now upgrade to Gen2 in the
regions that are currently available for no additional cost. If your region is not yet supported, you can still upgrade to a
supported region. For more information, see Upgrade to Gen2.

In T-SQL, the SERVICE_OBJECTIVE setting determines the service level and the performance tier for your data
warehouse.
--Gen1
CREATE DATABASE myElasticSQLDW
WITH
( SERVICE_OBJECTIVE = 'DW1000'
)
;

--Gen2
CREATE DATABASE myComputeSQLDW
(Edition = 'Datawarehouse'
,SERVICE_OBJECTIVE = 'DW1000c'
)
;

Performance Tiers and Data Warehouse Units


Each performance tier uses a slightly different unit of measure for their data warehouse units. This difference is
reflected on the invoice as the unit of scale directly translates to billing.
Gen1 data warehouses are measured in Data Warehouse Units (DWUs).
Gen2 data warehouses are measured in compute Data Warehouse Units (cDWUs).
Both DWUs and cDWUs support scaling compute up or down, and pausing compute when you don't need to use
the data warehouse. These operations are all on-demand. Gen2 uses a local disk-based cache on the compute
nodes to improve performance. When you scale or pause the system, the cache is invalidated and so a period of
cache warming is required before optimal performance is achieved.
As you increase data warehouse units, you are linearly increasing computing resources. Gen2 provides the best
query performance and highest scale. Gen2 systems also make the most use of the cache.
Capacity limits
Each SQL server (for example, myserver.database.windows.net) has a Database Transaction Unit (DTU ) quota
that allows a specific number of data warehouse units. For more information, see the workload management
capacity limits.

How many data warehouse units do I need


The ideal number of data warehouse units depends very much on your workload and the amount of data you
have loaded into the system.
Steps for finding the best DWU for your workload:
1. Begin by selecting a smaller DWU.
2. Monitor your application performance as you test data loads into the system, observing the number of DWUs
selected compared to the performance you observe.
3. Identify any additional requirements for periodic periods of peak activity. Workloads that show significant
peaks and troughs in activity may need to be scaled frequently.
SQL Data Warehouse is a scale-out system that can provision vast amounts of compute and query sizeable
quantities of data. To see its true capabilities for scaling, especially at larger DWUs, we recommend scaling the
data set as you scale to ensure that you have enough data to feed the CPUs. For scale testing, we recommend
using at least 1 TB.
NOTE
Query performance only increases with more parallelization if the work can be split between compute nodes. If you find
that scaling is not changing your performance, you may need to tune your table design and/or your queries. For query
tuning guidance, see Manage user queries.

Permissions
Changing the data warehouse units requires the permissions described in ALTER DATABASE.
Built-in roles for Azure resources such as SQL DB Contributor and SQL Server Contributor can change DWU
settings.

View current DWU settings


To view the current DWU setting:
1. Open SQL Server Object Explorer in Visual Studio.
2. Connect to the master database associated with the logical SQL Database server.
3. Select from the sys.database_service_objectives dynamic management view. Here is an example:

SELECT db.name [Database]


, ds.edition [Edition]
, ds.service_objective [Service Objective]
FROM sys.database_service_objectives AS ds
JOIN sys.databases AS db ON ds.database_id = db.database_id
;

Change data warehouse units


Azure portal
To change DWUs or cDWUs:
1. Open the Azure portal, open your database, and click Scale.
2. Under Scale, move the slider left or right to change the DWU setting.
3. Click Save. A confirmation message appears. Click yes to confirm or no to cancel.
PowerShell

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.

To change the DWUs or cDWUs, use the Set-AzSqlDatabase PowerShell cmdlet. The following example sets the
service level objective to DW1000 for the database MySQLDW that is hosted on server MyServer.

Set-AzSqlDatabase -DatabaseName "MySQLDW" -ServerName "MyServer" -RequestedServiceObjectiveName "DW1000"

For more information, see PowerShell cmdlets for SQL Data Warehouse
T -SQL
With T-SQL you can view the current DWU or cDWU settings, change the settings, and check the progress.
To change the DWUs or cDWUs:
1. Connect to the master database associated with your logical SQL Database server.
2. Use the ALTER DATABASE TSQL statement. The following example sets the service level objective to
DW1000 for the database MySQLDW.

ALTER DATABASE MySQLDW


MODIFY (SERVICE_OBJECTIVE = 'DW1000')
;

REST APIs
To change the DWUs, use the Create or Update Database REST API. The following example sets the service level
objective to DW1000 for the database MySQLDW, which is hosted on server MyServer. The server is in an Azure
resource group named ResourceGroup1.

PUT https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-
name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}?api-version=2014-04-01-preview
HTTP/1.1
Content-Type: application/json; charset=UTF-8

{
"properties": {
"requestedServiceObjectiveName": DW1000
}
}

For more REST API examples, see REST APIs for SQL Data Warehouse.

Check status of DWU changes


DWU changes may take several minutes to complete. If you are scaling automatically, consider implementing
logic to ensure that certain operations have been completed before proceeding with another action.
Checking the database state through various endpoints allows you to correctly implement automation. The portal
provides notification upon completion of an operation and the databases current state but does not allow for
programmatic checking of state.
You cannot check the database state for scale-out operations with the Azure portal.
To check the status of DWU changes:
1. Connect to the master database associated with your logical SQL Database server.
2. Submit the following query to check database state.

SELECT *
FROM sys.databases
;

1. Submit the following query to check status of operation


SELECT *
FROM sys.dm_operation_status
WHERE resource_type_desc = 'Database'
AND major_resource_id = 'MySQLDW'
;

This DMV returns information about various management operations on your SQL Data Warehouse such as the
operation and the state of the operation, which is either IN_PROGRESS or COMPLETED.

The scaling workflow


When you start a scale operation, the system first kills all open sessions, rolling back any open transactions to
ensure a consistent state. For scale operations, scaling only occurs after this transactional rollback has completed.
For a scale-up operation, the system detaches all compute nodes, provisions the additional compute nodes,
and then reattaches to the storage layer.
For a scale-down operation, the system detaches all compute nodes and then reattaches only the needed
nodes to the storage layer.

Next steps
To learn more about managing performance, see Resource classes for workload management and Memory and
concurrency limits.
Cheat sheet for Azure SQL Data Warehouse
8/30/2019 • 6 minutes to read • Edit Online

This cheat sheet provides helpful tips and best practices for building your Azure SQL Data Warehouse solutions.
Before you get started, learn more about each step in detail by reading Azure SQL Data Warehouse Workload
Patterns and Anti-Patterns, which explains what SQL Data Warehouse is and what it is not.
The following graphic shows the process of designing a data warehouse:

Queries and operations across tables


When you know in advance the primary operations and queries to be run in your data warehouse, you can
prioritize your data warehouse architecture for those operations. These queries and operations might include:
Joining one or two fact tables with dimension tables, filtering the combined table, and then appending the
results into a data mart.
Making large or small updates into your fact sales.
Appending only data to your tables.
Knowing the types of operations in advance helps you optimize the design of your tables.

Data migration
First, load your data into Azure Data Lake Storage or Azure Blob storage. Next, use PolyBase to load your data into
SQL Data Warehouse in a staging table. Use the following configuration:

DESIGN RECOMMENDATION

Distribution Round Robin

Indexing Heap

Partitioning None

Resource Class largerc or xlargerc

Learn more about data migration, data loading, and the Extract, Load, and Transform (ELT) process.

Distributed or replicated tables


Use the following strategies, depending on the table properties:

TYPE GREAT FIT FOR... WATCH OUT IF...

Replicated • Small dimension tables in a star • Many write transactions are on table
schema with less than 2 GB of storage (such as insert, upsert, delete, update)
after compression (~5x compression) • You change Data Warehouse Units
(DWU) provisioning frequently
• You only use 2-3 columns but your
table has many columns
• You index a replicated table

Round Robin (default) • Temporary/staging table • Performance is slow due to data


• No obvious joining key or good movement
candidate column

Hash • Fact tables • The distribution key cannot be


• Large dimension tables updated

Tips:
Start with Round Robin, but aspire to a hash distribution strategy to take advantage of a massively parallel
architecture.
Make sure that common hash keys have the same data format.
Don’t distribute on varchar format.
Dimension tables with a common hash key to a fact table with frequent join operations can be hash distributed.
Use sys.dm_pdw_nodes_db_partition_stats to analyze any skewness in the data.
Use sys.dm_pdw_request_steps to analyze data movements behind queries, monitor the time broadcast, and
shuffle operations take. This is helpful to review your distribution strategy.
Learn more about replicated tables and distributed tables.

Index your table


Indexing is helpful for reading tables quickly. There is a unique set of technologies that you can use based on your
needs:
TYPE GREAT FIT FOR... WATCH OUT IF...

Heap • Staging/temporary table • Any lookup scans the full table


• Small tables with small lookups

Clustered index • Tables with up to 100 million rows • Used on a replicated table
• Large tables (more than 100 million • You have complex queries involving
rows) with only 1-2 columns heavily multiple join and Group By operations
used • You make updates on the indexed
columns: it takes memory

Clustered columnstore index (CCI) • Large tables (more than 100 million • Used on a replicated table
(default) rows) • You make massive update operations
on your table
• You overpartition your table: row
groups do not span across different
distribution nodes and partitions

Tips:
On top of a clustered index, you might want to add a nonclustered index to a column heavily used for filtering.
Be careful how you manage the memory on a table with CCI. When you load data, you want the user (or the
query) to benefit from a large resource class. Make sure to avoid trimming and creating many small
compressed row groups.
On Gen2, CCI tables are cached locally on the compute nodes to maximize performance.
For CCI, slow performance can happen due to poor compression of your row groups. If this occurs, rebuild or
reorganize your CCI. You want at least 100,000 rows per compressed row groups. The ideal is 1 million rows in
a row group.
Based on the incremental load frequency and size, you want to automate when you reorganize or rebuild your
indexes. Spring cleaning is always helpful.
Be strategic when you want to trim a row group. How large are the open row groups? How much data do you
expect to load in the coming days?
Learn more about indexes.

Partitioning
You might partition your table when you have a large fact table (greater than 1 billion rows). In 99 percent of cases,
the partition key should be based on date. Be careful to not overpartition, especially when you have a clustered
columnstore index.
With staging tables that require ELT, you can benefit from partitioning. It facilitates data lifecycle management. Be
careful not to overpartition your data, especially on a clustered columnstore index.
Learn more about partitions.

Incremental load
If you're going to incrementally load your data, first make sure that you allocate larger resource classes to loading
your data. This is particularly important when loading into tables with clustered columnstore indexes. See resource
classes for further details.
We recommend using PolyBase and ADF V2 for automating your ELT pipelines into SQL Data Warehouse.
For a large batch of updates in your historical data, consider using a CTAS to write the data you want to keep in a
table rather than using INSERT, UPDATE, and DELETE.
Maintain statistics
Until auto-statistics are generally available, SQL Data Warehouse requires manual maintenance of statistics. It's
important to update statistics as significant changes happen to your data. This helps optimize your query plans. If
you find that it takes too long to maintain all of your statistics, be more selective about which columns have
statistics.
You can also define the frequency of the updates. For example, you might want to update date columns, where new
values might be added, on a daily basis. You gain the most benefit by having statistics on columns involved in joins,
columns used in the WHERE clause, and columns found in GROUP BY.
Learn more about statistics.

Resource class
SQL Data Warehouse uses resource groups as a way to allocate memory to queries. If you need more memory to
improve query or loading speed, you should allocate higher resource classes. On the flip side, using larger resource
classes impacts concurrency. You want to take that into consideration before moving all of your users to a large
resource class.
If you notice that queries take too long, check that your users do not run in large resource classes. Large resource
classes consume many concurrency slots. They can cause other queries to queue up.
Finally, by using Gen2 of SQL Data Warehouse, each resource class gets 2.5 times more memory than Gen1.
Learn more how to work with resource classes and concurrency.

Lower your cost


A key feature of SQL Data Warehouse is the ability to manage compute resources. You can pause the data
warehouse when you're not using it, which stops the billing of compute resources. You can scale resources to meet
your performance demands. To pause, use the Azure portal or PowerShell. To scale, use the Azure portal,
Powershell, T-SQL, or a REST API.
Autoscale now at the time you want with Azure Functions:

Optimize your architecture for performance


We recommend considering SQL Database and Azure Analysis Services in a hub-and-spoke architecture. This
solution can provide workload isolation between different user groups while also using advanced security features
from SQL Database and Azure Analysis Services. This is also a way to provide limitless concurrency to your users.
Learn more about typical architectures that take advantage of SQL Data Warehouse.
Deploy in one click your spokes in SQL databases from SQL Data Warehouse:
Best practices for Azure SQL Data Warehouse
9/22/2019 • 12 minutes to read • Edit Online

This article is a collection of best practices to help you to achieve optimal performance from your Azure SQL Data
Warehouse. Some of the concepts in this article are basic and easy to explain, other concepts are more advanced
and we just scratch the surface in this article. The purpose of this article is to give you some basic guidance and to
raise awareness of important areas to focus as you build your data warehouse. Each section introduces you to a
concept and then point you to more detailed articles which cover the concept in more depth.
If you are just getting started with Azure SQL Data Warehouse, do not let this article overwhelm you. The
sequence of the topics is mostly in the order of importance. If you start by focusing on the first few concepts, you'll
be in good shape. As you get more familiar and comfortable with using SQL Data Warehouse, come back and look
at a few more concepts. It won't take long for everything to make sense.
For loading guidance, see Guidance for loading data.

Reduce cost with pause and scale


For more information about reducing costs through pausing and scaling, see the Manage compute.

Maintain statistics
Unlike SQL Server, which automatically detects and creates or updates statistics on columns, SQL Data
Warehouse requires manual maintenance of statistics. While we do plan to change this in the future, for now you
will want to maintain your statistics to ensure that the SQL Data Warehouse plans are optimized. The plans
created by the optimizer are only as good as the available statistics. Creating sampled statistics on every
column is an easy way to get started with statistics. It's equally important to update statistics as significant
changes happen to your data. A conservative approach may be to update your statistics daily or after each load.
There are always trade-offs between performance and the cost to create and update statistics. If you find it is
taking too long to maintain all of your statistics, you may want to try to be more selective about which columns
have statistics or which columns need frequent updating. For example, you might want to update date columns,
where new values may be added, daily. You will gain the most benefit by having statistics on columns
involved in joins, columns used in the WHERE clause and columns found in GROUP BY.
See also Manage table statistics, CREATE STATISTICS, UPDATE STATISTICS

Group INSERT statements into batches


A one-time load to a small table with an INSERT statement or even a periodic reload of a look-up may perform
just fine for your needs with a statement like INSERT INTO MyLookup VALUES (1, 'Type 1') . However, if you need to
load thousands or millions of rows throughout the day, you might find that singleton INSERTS just can't keep up.
Instead, develop your processes so that they write to a file and another process periodically comes along and loads
this file.
See also INSERT

Use PolyBase to load and export data quickly


SQL Data Warehouse supports loading and exporting data through several tools including Azure Data Factory,
PolyBase, and BCP. For small amounts of data where performance isn't critical, any tool may be sufficient for your
needs. However, when you are loading or exporting large volumes of data or fast performance is needed, PolyBase
is the best choice. PolyBase is designed to leverage the MPP (Massively Parallel Processing) architecture of SQL
Data Warehouse and will therefore load and export data magnitudes faster than any other tool. PolyBase loads can
be run using CTAS or INSERT INTO. Using CTAS will minimize transaction logging and the fastest way to
load your data. Azure Data Factory also supports PolyBase loads and can achieve similar performance as CTAS.
PolyBase supports a variety of file formats including Gzip files. To maximize throughput when using gzip text
files, break files up into 60 or more files to maximize parallelism of your load. For faster total throughput,
consider loading data concurrently.
See also Load data, Guide for using PolyBase, Azure SQL Data Warehouse loading patterns and strategies, Load
Data with Azure Data Factory, Move data with Azure Data Factory, CREATE EXTERNAL FILE FORMAT, Create
table as select (CTAS )

Load then query external tables


While Polybase, also known as external tables, can be the fastest way to load data, it is not optimal for queries. SQL
Data Warehouse Polybase tables currently only support Azure blob files and Azure Data Lake storage. These files
do not have any compute resources backing them. As a result, SQL Data Warehouse cannot offload this work and
therefore must read the entire file by loading it to tempdb in order to read the data. Therefore, if you have several
queries that will be querying this data, it is better to load this data once and have queries use the local table.
See also Guide for using PolyBase

Hash distribute large tables


By default, tables are Round Robin distributed. This makes it easy for users to get started creating tables without
having to decide how their tables should be distributed. Round Robin tables may perform sufficiently for some
workloads, but in most cases selecting a distribution column will perform much better. The most common example
of when a table distributed by a column will far outperform a Round Robin table is when two large fact tables are
joined. For example, if you have an orders table, which is distributed by order_id, and a transactions table, which is
also distributed by order_id, when you join your orders table to your transactions table on order_id, this query
becomes a pass-through query, which means we eliminate data movement operations. Fewer steps mean a faster
query. Less data movement also makes for faster queries. This explanation only scratches the surface. When
loading a distributed table, be sure that your incoming data is not sorted on the distribution key as this will slow
down your loads. See the below links for much more details on how selecting a distribution column can improve
performance as well as how to define a distributed table in the WITH clause of your CREATE TABLE statement.
See also Table overview, Table distribution, Selecting table distribution, CREATE TABLE, CREATE TABLE AS
SELECT

Do not over-partition
While partitioning data can be very effective for maintaining your data through partition switching or optimizing
scans by with partition elimination, having too many partitions can slow down your queries. Often a high
granularity partitioning strategy which may work well on SQL Server may not work well on SQL Data Warehouse.
Having too many partitions can also reduce the effectiveness of clustered columnstore indexes if each partition has
fewer than 1 million rows. Keep in mind that behind the scenes, SQL Data Warehouse partitions your data for you
into 60 databases, so if you create a table with 100 partitions, this actually results in 6000 partitions. Each
workload is different so the best advice is to experiment with partitioning to see what works best for your
workload. Consider lower granularity than what may have worked for you in SQL Server. For example, consider
using weekly or monthly partitions rather than daily partitions.
See also Table partitioning

Minimize transaction sizes


INSERT, UPDATE, and DELETE statements run in a transaction and when they fail they must be rolled back. To
minimize the potential for a long rollback, minimize transaction sizes whenever possible. This can be done by
dividing INSERT, UPDATE, and DELETE statements into parts. For example, if you have an INSERT which you
expect to take 1 hour, if possible, break the INSERT up into 4 parts, which will each run in 15 minutes. Leverage
special Minimal Logging cases, like CTAS, TRUNCATE, DROP TABLE or INSERT to empty tables, to reduce
rollback risk. Another way to eliminate rollbacks is to use Metadata Only operations like partition switching for
data management. For example, rather than execute a DELETE statement to delete all rows in a table where the
order_date was in October of 2001, you could partition your data monthly and then switch out the partition with
data for an empty partition from another table (see ALTER TABLE examples). For unpartitioned tables consider
using a CTAS to write the data you want to keep in a table rather than using DELETE. If a CTAS takes the same
amount of time, it is a much safer operation to run as it has very minimal transaction logging and can be canceled
quickly if needed.
See also Understanding transactions, Optimizing transactions, Table partitioning, TRUNCATE TABLE, ALTER
TABLE, Create table as select (CTAS )

Reduce query result sizes


This helps you avoid client side issues caused by large query result. You can edit your query to reduce the number
of rows returned. Some query generation tools allow you to add “top N” syntax to each query. You can also CETAS
the query result to a temporary table and then use PolyBase export for the downlevel processing.

Use the smallest possible column size


When defining your DDL, using the smallest data type which will support your data will improve query
performance. This is especially important for CHAR and VARCHAR columns. If the longest value in a column is 25
characters, then define your column as VARCHAR (25). Avoid defining all character columns to a large default
length. In addition, define columns as VARCHAR when that is all that is needed rather than use NVARCHAR.
See also Table overview, Table data types, CREATE TABLE

Use temporary heap tables for transient data


When you are temporarily landing data on SQL Data Warehouse, you may find that using a heap table will make
the overall process faster. If you are loading data only to stage it before running more transformations, loading the
table to heap table will be much faster than loading the data to a clustered columnstore table. In addition, loading
data to a temp table will also load much faster than loading a table to permanent storage. Temporary tables start
with a "#" and are only accessible by the session which created it, so they may only work in limited scenarios. Heap
tables are defined in the WITH clause of a CREATE TABLE. If you do use a temporary table, remember to create
statistics on that temporary table too.
See also Temporary tables, CREATE TABLE, CREATE TABLE AS SELECT

Optimize clustered columnstore tables


Clustered columnstore indexes are one of the most efficient ways you can store your data in SQL Data Warehouse.
By default, tables in SQL Data Warehouse are created as Clustered ColumnStore. To get the best performance for
queries on columnstore tables, having good segment quality is important. When rows are written to columnstore
tables under memory pressure, columnstore segment quality may suffer. Segment quality can be measured by
number of rows in a compressed Row Group. See the Causes of poor columnstore index quality in the Table
indexes article for step by step instructions on detecting and improving segment quality for clustered columnstore
tables. Because high-quality columnstore segments are important, it's a good idea to use users IDs which are in
the medium or large resource class for loading data. Using lower data warehouse units means you want to assign
a larger resource class to your loading user.
Since columnstore tables generally won't push data into a compressed columnstore segment until there are more
than 1 million rows per table and each SQL Data Warehouse table is partitioned into 60 tables, as a rule of thumb,
columnstore tables won't benefit a query unless the table has more than 60 million rows. For table with less than
60 million rows, it may not make any sense to have a columnstore index. It also may not hurt. Furthermore, if you
partition your data, then you will want to consider that each partition will need to have 1 million rows to benefit
from a clustered columnstore index. If a table has 100 partitions, then it will need to have at least 6 billion rows to
benefit from a clustered columns store (60 distributions * 100 partitions * 1 million rows). If your table does not
have 6 billion rows in this example, either reduce the number of partitions or consider using a heap table instead. It
also may be worth experimenting to see if better performance can be gained with a heap table with secondary
indexes rather than a columnstore table.
When querying a columnstore table, queries will run faster if you select only the columns you need.
See also Table indexes, Columnstore indexes guide, Rebuilding columnstore indexes

Use larger resource class to improve query performance


SQL Data Warehouse uses resource groups as a way to allocate memory to queries. Out of the box, all users are
assigned to the small resource class which grants 100 MB of memory per distribution. Since there are always 60
distributions and each distribution is given a minimum of 100 MB, system wide the total memory allocation is
6,000 MB, or just under 6 GB. Certain queries, like large joins or loads to clustered columnstore tables, will benefit
from larger memory allocations. Some queries, like pure scans, will see no benefit. On the flip side, utilizing larger
resource classes impacts concurrency, so you will want to take this into consideration before moving all of your
users to a large resource class.
See also Resource classes for workload management

Use Smaller Resource Class to Increase Concurrency


If you are noticing that user queries seem to have a long delay, it could be that your users are running in larger
resource classes and are consuming a lot of concurrency slots causing other queries to queue up. To see if users
queries are queued, run SELECT * FROM sys.dm_pdw_waits to see if any rows are returned.
See also Resource classes for workload management, sys.dm_pdw_waits

Use DMVs to monitor and optimize your queries


SQL Data Warehouse has several DMVs which can be used to monitor query execution. The monitoring article
below walks through step-by-step instructions on how to look at the details of an executing query. To quickly find
queries in these DMVs, using the LABEL option with your queries can help.
See also Monitor your workload using DMVs, LABEL, OPTION, sys.dm_exec_sessions, sys.dm_pdw_exec_requests,
sys.dm_pdw_request_steps, sys.dm_pdw_sql_requests, sys.dm_pdw_dms_workers, DBCC
PDW_SHOWEXECUTIONPLAN, sys.dm_pdw_waits

Other resources
Also see our Troubleshooting article for common issues and solutions.
If you didn't find what you were looking for in this article, try using the "Search for docs" on the left side of this
page to search all of the Azure SQL Data Warehouse documents. The Azure SQL Data Warehouse Forum is a
place for you to ask questions to other users and to the SQL Data Warehouse Product Group. We actively monitor
this forum to ensure that your questions are answered either by another user or one of us. If you prefer to ask your
questions on Stack Overflow, we also have an Azure SQL Data Warehouse Stack Overflow Forum.
Finally, please do use the Azure SQL Data Warehouse Feedback page to make feature requests. Adding your
requests or up-voting other requests really helps us prioritize features.
SQL Data Warehouse capacity limits
9/30/2019 • 5 minutes to read • Edit Online

Maximum values allowed for various components of Azure SQL Data Warehouse.

Workload management
CATEGORY DESCRIPTION MAXIMUM

Data Warehouse Units (DWU) Max DWU for a single SQL Data Gen1: DW6000
Warehouse Gen2: DW30000c

Data Warehouse Units (DWU) Default DTU per server 54,000


By default, each SQL server (for
example,
myserver.database.windows.net) has a
DTU Quota of 54,000, which allows up
to 9 DW6000c. This quota is simply a
safety limit. You can increase your quota
by creating a support ticket and
selecting Quota as the request type. To
calculate your DTU needs, multiply the
7.5 by the total DWU needed, or
multiply 9.0 by the total cDWU needed.
For example:
DW6000 x 7.5 = 45,000 DTUs
DW6000c x 9.0 = 54,000 DTUs.
You can view your current DTU
consumption from the SQL server
option in the portal. Both paused and
unpaused databases count toward the
DTU quota.

Database connection Maximum Concurrent open sessions 1024

The number of concurrent open


sessions will vary based on the selected
DWU. DWU600c and above support a
maximum of 1024 open sessions.
DWU500c and below, support a
maximum concurrent open session limit
of 512. Note, there are limits on the
number of queries that can execute
concurrently. When the concurrency
limit is exceeded, the request goes into
an internal queue where it waits to be
processed.

Database connection Maximum memory for prepared 20 MB


statements
CATEGORY DESCRIPTION MAXIMUM

Workload management Maximum concurrent queries 128

SQL Data Warehouse can execute a


maximum of 128 concurrent queries
and queues remaining queries.

The number of concurrent queries can


decrease when users are assigned to
higher resource classes or when SQL
Data Warehouse has a lower data
warehouse unit setting. Some queries,
like DMV queries, are always allowed to
run and do not impact the concurrent
query limit. For more details on
concurrent query execution, see the
concurrency maximums article.

tempdb Maximum GB 399 GB per DW100. Therefore at


DWU1000, tempdb is sized to 3.99 TB.

Database objects
CATEGORY DESCRIPTION MAXIMUM

Database Max size Gen1: 240 TB compressed on disk. This


space is independent of tempdb or log
space, and therefore this space is
dedicated to permanent tables.
Clustered columnstore compression is
estimated at 5X. This compression
allows the database to grow to
approximately 1 PB when all tables are
clustered columnstore (the default table
type).

Gen2: 240TB for rowstore and unlimited


storage for columnstore tables

Table Max size 60 TB compressed on disk

Table Tables per database 100,000

Table Columns per table 1024 columns

Table Bytes per column Dependent on column data type. Limit


is 8000 for char data types, 4000 for
nvarchar, or 2 GB for MAX data types.
CATEGORY DESCRIPTION MAXIMUM

Table Bytes per row, defined size 8060 bytes

The number of bytes per row is


calculated in the same manner as it is
for SQL Server with page compression.
Like SQL Server, SQL Data Warehouse
supports row-overflow storage, which
enables variable length columns to be
pushed off-row. When variable length
rows are pushed off-row, only 24-byte
root is stored in the main record. For
more information, see Row-Overflow
Data Exceeding 8-KB.

Table Partitions per table 15,000

For high performance, we recommend


minimizing the number of partitions
you need while still supporting your
business requirements. As the number
of partitions grows, the overhead for
Data Definition Language (DDL) and
Data Manipulation Language (DML)
operations grows and causes slower
performance.

Table Characters per partition boundary 4000


value.

Index Non-clustered indexes per table. 50

Applies to rowstore tables only.

Index Clustered indexes per table. 1

Applies to both rowstore and


columnstore tables.

Index Index key size. 900 bytes.

Applies to rowstore indexes only.

Indexes on varchar columns with a


maximum size of more than 900 bytes
can be created if the existing data in the
columns does not exceed 900 bytes
when the index is created. However,
later INSERT or UPDATE actions on the
columns that cause the total size to
exceed 900 bytes will fail.

Index Key columns per index. 16

Applies to rowstore indexes only.


Clustered columnstore indexes include
all columns.

Statistics Size of the combined column values. 900 bytes.


CATEGORY DESCRIPTION MAXIMUM

Statistics Columns per statistics object. 32

Statistics Statistics created on columns per table. 30,000

Stored Procedures Maximum levels of nesting. 8

View Columns per view 1,024

Loads
CATEGORY DESCRIPTION MAXIMUM

Polybase Loads MB per row 1

Polybase loads rows that are smaller


than 1 MB. Loading LOB data types
into tables with a Clustered
Columnstore Index (CCI) is not
supported.

Queries
CATEGORY DESCRIPTION MAXIMUM

Query Queued queries on user tables. 1000

Query Concurrent queries on system views. 100

Query Queued queries on system views 1000

Query Maximum parameters 2098

Batch Maximum size 65,536*4096

SELECT results Columns per row 4096

You can never have more than 4096


columns per row in the SELECT result.
There is no guarantee that you can
always have 4096. If the query plan
requires a temporary table, the 1024
columns per table maximum might
apply.
CATEGORY DESCRIPTION MAXIMUM

SELECT Nested subqueries 32

You can never have more than 32


nested subqueries in a SELECT
statement. There is no guarantee that
you can always have 32. For example, a
JOIN can introduce a subquery into the
query plan. The number of subqueries
can also be limited by available memory.

SELECT Columns per JOIN 1024 columns

You can never have more than 1024


columns in the JOIN. There is no
guarantee that you can always have
1024. If the JOIN plan requires a
temporary table with more columns
than the JOIN result, the 1024 limit
applies to the temporary table.

SELECT Bytes per GROUP BY columns. 8060

The columns in the GROUP BY clause


can have a maximum of 8060 bytes.

SELECT Bytes per ORDER BY columns 8060 bytes

The columns in the ORDER BY clause


can have a maximum of 8060 bytes

Identifiers per statement Number of referenced identifiers 65,535

SQL Data Warehouse limits the number


of identifiers that can be contained in a
single expression of a query. Exceeding
this number results in SQL Server error
8632. For more information, see
Internal error: An expression services
limit has been reached.

String literals Number of string literals in a statement 20,000

SQL Data Warehouse limits the number


of string constants in a single
expression of a query. Exceeding this
number results in SQL Server error
8632.

Metadata

SYSTEM VIEW MAXIMUM ROWS

sys.dm_pdw_component_health_alerts 10,000
SYSTEM VIEW MAXIMUM ROWS

sys.dm_pdw_dms_cores 100

sys.dm_pdw_dms_workers Total number of DMS workers for the most recent 1000 SQL
requests.

sys.dm_pdw_errors 10,000

sys.dm_pdw_exec_requests 10,000

sys.dm_pdw_exec_sessions 10,000

sys.dm_pdw_request_steps Total number of steps for the most recent 1000 SQL requests
that are stored in sys.dm_pdw_exec_requests.

sys.dm_pdw_os_event_logs 10,000

sys.dm_pdw_sql_requests The most recent 1000 SQL requests that are stored in
sys.dm_pdw_exec_requests.

Next steps
For recommendations on using SQL Data Warehouse, see the Cheat Sheet.
SQL Data Warehouse Frequently asked questions
6/4/2019 • 2 minutes to read • Edit Online

General
Q. What does SQL DW offer for data security?
A. SQL DW offers several solutions for protecting data such as TDE and auditing. For more information, see
Security.
Q. Where can I find out what legal or business standards is SQL DW compliant with?
A. Visit the Microsoft Compliance page for various compliance offerings by product such as SOC and ISO. First
choose by Compliance title, then expand Azure in the Microsoft in-scope cloud services section on the right side of
the page to see what services are Azure services are compliant.
Q. Can I connect PowerBI?
A. Yes! Though PowerBI supports direct query with SQL DW, it’s not intended for large number of users or real-
time data. For production use of PowerBI, we recommend using PowerBI on top of Azure Analysis Services or
Analysis Service IaaS.
Q. What are SQL Data Warehouse Capacity Limits?
A. See our current capacity limits page.
Q. Why is my Scale/Pause/Resume taking so long?
A. A variety of factors can influence the time for compute management operations. A common case for long
running operations is transactional rollback. When a scale or pause operation is initiated, all incoming sessions are
blocked and queries are drained. In order to leave the system in a stable state, transactions must be rolled back
before an operation can commence. The greater the number and larger the log size of transactions, the longer the
operation will be stalled restoring the system to a stable state.

User support
Q. I have a feature request, where do I submit it?
A. If you have a feature request, submit it on our UserVoice page
Q. How can I do x?
A. For help in developing with SQL Data Warehouse, you can ask questions on our Stack Overflow page.
Q. How do I submit a support ticket?
A. Support Tickets can be filed through Azure portal.

SQL language/feature support


Q. What datatypes does SQL Data Warehouse support?
A. See SQL Data Warehouse data types.
Q. What table features do you support?
A. While SQL Data Warehouse supports many features, some are not supported and are documented in
Unsupported Table Features.

Tooling and administration


Q. Do you support Database projects in Visual Studio.
A. We currently do not support Database projects in Visual Studio for SQL Data Warehouse. If you'd like to cast a
vote to get this feature, visit our User Voice Database projects feature request.
Q. Does SQL Data Warehouse support REST APIs?
A. Yes. Most REST functionality that can be used with SQL Database is also available with SQL Data Warehouse.
You can find API information within REST documentation pages or MSDN.

Loading
Q. What client drivers do you support?
A. Driver support for DW can be found on the Connection Strings page
Q: What file formats are supported by PolyBase with SQL Data Warehouse?
A: Orc, RC, Parquet, and flat delimited text
Q: What can I connect to from SQL DW using PolyBase?
A: Azure Data Lake Store and Azure Storage Blobs
Q: Is computation pushdown possible when connecting to Azure Storage Blobs or ADLS?
A: No, SQL DW PolyBase only interacts the storage components.
Q: Can I connect to HDI?
A: HDI can use either ADLS or WASB as the HDFS layer. If you have either as your HDFS layer, then you can load
that data into SQL DW. However, you cannot generate pushdown computation to the HDI instance.

Next steps
For more information on SQL Data Warehouse as a whole, see our Overview page.
Azure SQL Data Warehouse release notes
9/25/2019 • 16 minutes to read • Edit Online

This article summarizes the new features and improvements in the recent releases of Azure SQL Data Warehouse
(Azure SQL DW ). The article also lists notable content updates that aren't directly related to the release but
published in the same time frame. For improvements to other Azure services, see Service updates.

Check your Azure SQL Data Warehouse version


As new features are rolled out to all regions, check the version deployed to your instance and the latest Azure SQL
DW release notes for feature availability. To check your Azure SQL DW version, connect to your data warehouse
via SQL Server Management Studio (SSMS ) and run SELECT @@VERSION; to return the current version of Azure
SQL DW.
Example output:

Use the date identified to confirm which release has been applied to your Azure SQL DW.

September 2019
SERVICE IMPROVEMENTS DETAILS

Azure Private Link (Preview) With Azure Private Link, you can create a private endpoint in
your Virtual Network (VNet) and map it to your Azure SQL
DW. These resources are then accessible over a private IP
address in your VNet, enabling connectivity from on-premises
through Azure ExpressRoute private peering and/or VPN
gateway. Overall, this simplifies the network configuration by
not requiring you to open it up to public IP addresses. This
also enables protection against data exfiltration risks. For more
details, see overview and SQL DW documentation.

Data Discovery & Classification (GA) Data discovery and classification feature is now Generally
Available. This feature provides advanced capabilities for
discovering, classifying, labeling & protecting sensitive
data in your databases.

Azure Advisor one-click Integration SQL Data Warehouse now directly integrates with Azure
Advisor recommendations in the overview blade along with
providing a one-click experience. You can now discover
recommendations in the overview blade instead of navigating
to the Azure advisor blade. Find out more about
recommendations here.
SERVICE IMPROVEMENTS DETAILS

Read Committed Snapshot Isolation (Preview) You can use ALTER DATABSE to enable or disable snapshot
isolation for a user database. To avoid impact to your current
workload, you may want to set this option during database
maintenance window or wait until there is no other active
connection to the database. For more information, see Alter
database set options.

EXECUTE AS (Transact-SQL) EXECUTE AS T-SQL support is now available in SQL Data


Warehouse enabling customers to set the execution context of
a session to the specified user.

Additional T-SQL support The T-SQL language surface area for SQL Data Warehouse has
been extended to include support for:
- FORMAT (Transact-SQL)
- TRY_PARSE (Transact-SQL)
- TRY_CAST (Transact-SQL)
- TRY_CONVERT (Transact-SQL)
- sys.user_token (Transact-SQL)

July 2019
SERVICE IMPROVEMENTS DETAILS

Materialized View (Preview) A Materialized View persists the data returned from the view
definition query and automatically gets updated as data
changes in the underlying tables. It improves the performance
of complex queries (typically queries with joins and
aggregations) while offering simple maintenance operations.
For more information, see:
- CREATE MATERIALIZED VIEW AS SELECT (Transact-SQL)
- ALTER MATERIALIZED VIEW (Transact-SQL)
- T-SQL statements supported in Azure SQL Data Warehouse

Additional T-SQL support The T-SQL language surface area for SQL Data Warehouse has
been extended to include support for:
- AT TIME ZONE (Transact-SQL)
- STRING_AGG (Transact-SQL)

Result set caching (Preview) DBCC commands added to manage the previously announced
result set cache. For more information, see:
- DBCC DROPRESULTSETCACHE (Transact-SQL)
- DBCC SHOWRESULTCACHESPACEUSED (Transact-SQL)
Also see the new result_set_cache column in
sys.dm_pdw_exec_requests that shows when an executed
query used the result set cache.

Ordered clustered columnstore index (Preview) New column, column_store_order_ordinal, added to


sys.index_columns to identify the order of columns in an
ordered clustered columnstore index.

May 2019
SERVICE IMPROVEMENTS DETAILS
SERVICE IMPROVEMENTS DETAILS

Dynamic data masking (Preview) Dynamic Data Masking (DDM) prevents unauthorized access
to your sensitive data in your data warehouse by obfuscating
it on-the-fly in the query results, based on the masking rules
you define.For more information, see SQL Database dynamic
data masking.

Workload importance now Generally Available Workload Management Classification and Importance provide
the ability to influence the run order of queries. For more
information on workload importance, see the Classification
and Importance overview articles in the documentation. Check
out the CREATE WORKLOAD CLASSIFIER doc as well.

See workload importance in action in the below videos:


-Workload Management concepts
-Workload Management scenarios

Additional T-SQL support The T-SQL language surface area for SQL Data Warehouse has
been extended to include support for:
- TRIM

JSON functions Business analysts can now use familiar T-SQL language to
query and manipulate documents that are formatted as JSON
data using the following new JSON functions in Azure Data
Warehouse:
- ISJSON
- JSON_VALUE
- JSON_QUERY
- JSON_MODIFY
- OPENJSON

Result set caching (Preview) Result-set caching enables instant query response times while
reducing time-to-insight for business analysts and reporting
users. For more information, see:
- ALTER DATABASE (Transact-SQL)
- ALTER DATABASE SET Options (Transact SQL)
- SET RESULT SET CACHING (Transact-SQL)
- SET Statement (Transact-SQL)
- sys.databases (Transact-SQL)

Ordered clustered columnstore index (Preview) Columnstore is a key enabler for storing and efficiently
querying large amounts of data. For each table, it divides the
incoming data into Row Groups and each column of a Row
Group forms a Segment on a disk. Ordered clustered
columnstore indexes further optimize query execution by
enabling efficient segment elimination. For more information,
see:
- CREATE TABLE (Azure SQL Data Warehouse)
- CREATE COLUMNSTORE INDEX (Transact-SQL).

March 2019
SERVICE IMPROVEMENTS DETAILS
SERVICE IMPROVEMENTS DETAILS

Data Discovery & Classification Data Discovery & Classification is now available in public
preview for Azure SQL Data Warehouse. It’s critical to protect
sensitive data and the privacy of your customers. As your
business and customer data assets grow, it becomes
unmanageable to discover, classify, and protect your data. The
data discovery and classification feature that we’re introducing
natively with Azure SQL Data Warehouse helps make
protecting your data more manageable. The overall benefits of
this capability are:
• Meeting data privacy standards and regulatory compliance
requirements.
• Restricting access to and hardening the security of data
warehouses containing highly sensitive data.
• Monitoring and alerting on anomalous access to sensitive
data.
• Visualization of sensitive data in a central dashboard on the
Azure portal.

Data Discovery & Classification is available for Azure SQL Data


Warehouse in all Azure regions, It's part of Advanced Data
Security including Vulnerability Assessment and Threat
Detection. For more information about Data Discovery &
Classification, see the blog post and our online
documentation.

GROUP BY ROLLUP ROLLUP is now a supported GROUP BY option in Azure Data


Warehouse. GROUP BY ROLLUP creates a group for each
combination of column expressions. GROUP BY also "rolls up"
the results into subtotals and grand totals. The GROUP BY
function processes from right to left, decreasing the number of
column expressions over which it creates groups and
aggregation(s). The column order affects the ROLLUP output
and can affect the number of rows in the result set.

For more information on GROUP BY ROLLUP, see GROUP BY


(Transact-SQL)

Improved accuracy for DWU used and CPU portal SQL Data Warehouse significantly enhances metric accuracy in
metrics the Azure portal. This release includes a fix to the CPU and
DWU Used metric definition to properly reflect your workload
across all compute nodes. Before this fix, metric values were
being underreported. Expect to see an increase in the DWU
used and CPU metrics in the Azure portal.

Row Level Security We introduced Row-level Security capability back in Nov 2017.
We’ve now extended this support to external tables as well.
Additionally, we’ve added support for calling non-deterministic
functions in the inline table-valued functions (inline TVFs)
required for defining a security filter predicate. This addition
allows you to specify IS_ROLEMEMBER(), USER_NAME() etc. in
the security filter predicate. For more information, please see
the examples in the Row-level Security documentation.

Additional T-SQL Support The T-SQL language surface area for SQL Data Warehouse has
been extended to include support for STRING_SPLIT (Transact-
SQL).
SERVICE IMPROVEMENTS DETAILS

Query Optimizer enhancements Query optimization is a critical component of any database.


Making optimal choices on how to best execute a query can
yield significant improvements. When executing complex
analytical queries in a distributed environment, the number of
operations executed matters. Query performance has been
enhanced by producing better quality plans. These plans
minimize expensive data transfer operations and redundant
computations such as, repeated subqueries. For more
information, see this Azure SQL Data Warehouse blog post.

Documentation improvements
DOCUMENTATION IMPROVEMENTS DETAILS

January 2019
Service improvements
SERVICE IMPROVEMENTS DETAILS

Return Order By Optimization SELECT…ORDER BY queries get a performance boost in this


release. Now, all compute nodes send their results to a single
compute node. This node merges and sorts the results and
returns them to the user. Merging through a single compute
node results in a significant performance gain when the query
result set contains a large number of rows. Previously, the
query execution engine would order results on each compute
node. The results would them be streamed to the control
node. The control node would then merge the results.

Data Movement Enhancements for PartitionMove and In Azure SQL Data Warehouse Gen2, data movement steps of
BroadcastMove type ShuffleMove, use instant data movement techniques. For
more information, see performance enhancements blog. With
this release, PartitionMove and BroadcastMove are now
powered by the same instant data movement techniques.
User queries that use these types of data movement steps will
run with improved performance. No code change is required
to take advantage of these performance improvements.

Notable Bugs Incorrect Azure SQL Data Warehouse version -


SELECT @@VERSION may return the incorrect version,
10.0.9999.0. The correct version for the current release is
10.0.10106.0. This bug has been reported and is under review.

Documentation improvements
DOCUMENTATION IMPROVEMENTS DETAILS

none
December 2018
Service improvements
SERVICE IMPROVEMENTS DETAILS

Virtual Network Service Endpoints Generally Available This release includes general availability of Virtual Network
(VNet) Service Endpoints for Azure SQL Data Warehouse in all
Azure regions. VNet Service Endpoints enable you to isolate
connectivity to your logical server from a given subnet or set
of subnets within your virtual network. The traffic to Azure
SQL Data Warehouse from your VNet will always stay within
the Azure backbone network. This direct route will be
preferred over any specific routes that take Internet traffic
through virtual appliances or on-premises. No additional
billing is charged for virtual network access through service
endpoints. Current pricing model for Azure SQL Data
Warehouse applies as is.

With this release, we also enabled PolyBase connectivity to


Azure Data Lake Storage Gen2 (ADLS) via Azure Blob File
System (ABFS) driver. Azure Data Lake Storage Gen2 brings all
the qualities that are required for the complete lifecycle of
analytics data to Azure Storage. Features of the two existing
Azure storage services, Azure Blob Storage and Azure Data
Lake Storage Gen1 are converged. Features from Azure Data
Lake Storage Gen1, such as file system semantics, file-level
security, and scale are combined with low-cost, tiered storage,
and high availability/disaster recovery capabilities from Azure
Blob Storage.

Using Polybase you can also import data into Azure SQL Data
Warehouse from Azure Storage secured to VNet. Similarly,
exporting data from Azure SQL Data Warehouse to Azure
Storage secured to VNet is also supported via Polybase.

For more information on VNet Service Endpoints in Azure SQL


Data Warehouse, refer to the blog post or the documentation.
SERVICE IMPROVEMENTS DETAILS

Automatic Performance Monitoring (Preview) Query Store is now available in Preview for Azure SQL Data
Warehouse. Query Store is designed to help you with query
performance troubleshooting by tracking queries, query plans,
runtime statistics, and query history to help you monitor the
activity and performance of your data warehouse. Query Store
is a set of internal stores and Dynamic Management Views
(DMVs) that allow you to:

• Identify and tune top resource consuming queries


• Identify and improve unplanned workloads
• Evaluate query performance and impact to the plan by
changes in statistics, indexes, or system size (DWU setting)
• See full query text for all queries executed

The Query Store contains three actual stores:


• A plan store for persisting the execution plan information
• A runtime stats store for persisting the execution statistics
information
• A wait stats store for persisting wait stats information.

SQL Data Warehouse manages these stores automatically and


provides an unlimited number of queries storied over the last
seven days at no additional charge. Enabling Query Store is as
simple as running an ALTER DATABASE T-SQL statement:
sql ----ALTER DATABASE [DatabaseName] SET QUERY_STORE
= ON;-------For more information on Query Store in Azure
SQL Data Warehouse, see the article, Monitoring performance
by using the Query Store, and the Query Store DMVs, such as
sys.query_store_query. Here is the blog post announcing the
release.

Lower Compute Tiers for Azure SQL Data Warehouse Azure SQL Data Warehouse Gen2 now supports lower
Gen2 compute tiers. Customers can experience Azure SQL Data
Warehouse’s leading performance, flexibility, and security
features starting with 100 cDWU (Data Warehouse Units) and
scale to 30,000 cDWU in minutes. Starting mid-December
2018, customers can benefit from Gen2 performance and
flexibility with lower compute tiers in regions, with the rest of
the regions available during 2019.

By dropping the entry point for next-generation data


warehousing, Microsoft opens the doors to value-driven
customers who want to evaluate all the benefits of a secure,
high-performance data warehouse without guessing which
trial environment is best for them. Customers may start as low
as 100 cDWU, down from the current 500 cDWU entry point.
SQL Data Warehouse Gen2 continues to support pause and
resume operations and goes beyond just the flexibility in
compute. Gen2 also supports unlimited column-store storage
capacity along with 2.5 times more memory per query, up to
128 concurrent queries and adaptive caching features. These
features on average bring five times more performance
compared to the same Data Warehouse Unit on Gen1 at the
same price. Geo-redundant backups are standard for Gen2
with built-in guaranteed data protection. Azure SQL Data
Warehouse Gen2 is ready to scale when you are.
SERVICE IMPROVEMENTS DETAILS

Columnstore Background Merge By default, Azure SQL Data Warehouse (Azure SQL DW) stores
data in columnar format, with micro-partitions called
rowgroups. Sometimes, due to memory constrains at index
build or data load time, the rowgroups may be compressed
with less than the optimal size of one million rows. Rowgroups
may also become fragmented due to deletes. Small or
fragmented rowgroups result in higher memory consumption,
as well as inefficient query execution. With this release of
Azure SQL DW, the columnstore background maintenance
task merges small compressed rowgroups to create larger
rowgroups to better utilize memory and speed up query
execution.

October 2018
Service improvements
SERVICE IMPROVEMENTS DETAILS

DevOps for Data Warehousing The highly requested feature for SQL Data Warehouse (SQL
DW) is now in preview with the support for SQL Server Data
Tool (SSDT) in Visual Studio! Teams of developers can now
collaborate over a single, version-controlled codebase and
quickly deploy changes to any instance in the world.
Interested in joining? This feature is available for preview
today! You can register by visiting the SQL Data Warehouse
Visual Studio SQL Server Data Tools (SSDT) - Preview
Enrollment form. Given the high demand, we are managing
acceptance into preview to ensure the best experience for our
customers. Once you sign up, our goal is to confirm your
status within seven business days.

Row Level Security Generally Available Azure SQL Data Warehouse (SQL DW) now supports row level
security (RLS) adding a powerful capability to secure your
sensitive data. With the introduction of RLS, you can
implement security policies to control access to rows in your
tables, as in who can access what rows. RLS enables this fine-
grained access control without having to redesign your data
warehouse. RLS simplifies the overall security model as the
access restriction logic is located in the database tier itself
rather than away from the data in another application. RLS
also eliminates the need to introduce views to filter out rows
for access control management. There is no additional cost for
this enterprise-grade security feature for all our customers.
SERVICE IMPROVEMENTS DETAILS

Advanced Advisors Advanced tuning for Azure SQL Data Warehouse (SQL DW)
just got simpler with additional data warehouse
recommendations and metrics. There are additional advanced
performance recommendations through Azure Advisor at your
disposal, including:

1. Adaptive cache – Be advised when to scale to optimize


cache utilization.
2. Table distribution – Determine when to replicate tables to
reduce data movement and increase workload performance.
3. Tempdb – Understand when to scale and configure resource
classes to reduce tempdb contention.

There is a deeper integration of data warehouse metrics with


Azure Monitor including an enhanced customizable
monitoring chart for near real-time metrics in the overview
blade. You no longer must leave the data warehouse overview
blade to access Azure Monitor metrics when monitoring
usage, or validating and applying data warehouse
recommendations. In addition, there are new metrics available,
such as tempdb and adaptive cache utilization to complement
your performance recommendations.

Advanced tuning with integrated advisors Advanced tuning for Azure SQL Data Warehouse (SQL DW)
just got simpler with additional data warehouse
recommendations and metrics and a redesign of the portal
overview blade that provides an integrated experience with
Azure Advisor and Azure Monitor.

Accelerated Database Recovery (ADR) Azure SQL Data Warehouse Accelerated Database Recovery
(ADR) is now in Public Preview. ADR is a new SQL Server
Engine that greatly improves database availability, especially in
the presence of long running transactions, by completely
redesigning the current recovery process from the ground up.
The primary benefits of ADR are fast and consistent database
recovery and instantaneous transaction rollback.
SERVICE IMPROVEMENTS DETAILS

Azure Monitor diagnostics logs SQL Data Warehouse (SQL DW) now enables enhanced
insights into analytical workloads by integrating directly with
Azure Monitor diagnostic logs. This new capability enables
developers to analyze workload behavior over an extended
time period and make informed decisions on query
optimization or capacity management. We have now
introduced an external logging process through Azure
Monitor diagnostic logs that provide additional insights into
your data warehouse workload. With a single click of a button,
you are now able to configure diagnostic logs for historical
query performance troubleshooting capabilities using Log
Analytics. Azure Monitor diagnostic logs support customizable
retention periods by saving the logs to a storage account for
auditing purposes, the capability to stream logs to event hubs
near real-time telemetry insights, and the ability to analyze
logs using Log Analytics with log queries. Diagnostic logs
consist of telemetry views of your data warehouse equivalent
to the most commonly used performance troubleshooting
DMVs for SQL Data Warehouse. For this initial release, we
have enabled views for the following system dynamic
management views:

• sys.dm_pdw_exec_requests
• sys.dm_pdw_request_steps
• sys.dm_pdw_dms_workers
• sys.dm_pdw_waits
• sys.dm_pdw_sql_requests

Columnstore memory management As the number of compressed column store row groups
increases, the memory required to manage the internal
column segment metadata for those rowgroups increases. As
a result, query performance and queries executed against
some of the Columnstore Dynamic Management Views
(DMVs) can degrade. Improvements have made in this release
to optimize the size of the internal metadata for these cases,
leading to improved experience and performance for such
queries.

Azure Data Lake Storage Gen2 integration (GA Azure SQL Data Warehouse (SQL DW) now has native
integration with Azure Data Lake Storage Gen2. Customers
can now load data using external tables from ABFS into SQL
DW. This functionality enables customers to integrate with
their data lakes in Data Lake Storage Gen2.
SERVICE IMPROVEMENTS DETAILS

Notable Bugs CETAS to Parquet failures in small resource classes on Data


warehouses of DW2000 and more - This fix correctly identifies
a null reference in the Create External Table As to Parquet
code path.

Identity column value might lose in some CTAS operation -


The value of an identify column may not be preserved when
CTASed to another table. Reported in a blog.

Internal failure in some cases when a session is terminated


while a query is still running - This fix triggers an
InvalidOperationException if a session is terminated when the
query is still running.

(Deployed in November 2018) Customers were experiencing a


suboptimal performance when attempting to load multiple
small files from ADLS (Gen1) using Polybase. - System
performance was bottlenecked during AAD security token
validation. Performance problems were mitigated by enabling
caching of security tokens.

Next steps
create a SQL Data Warehouse

More information
Blog - Azure SQL Data Warehouse
Customer Advisory Team blogs
Customer success stories
Stack Overflow forum
Twitter
Videos
Azure glossary
Upgrade your data warehouse to Gen2
9/11/2019 • 6 minutes to read • Edit Online

Microsoft is helping drive down the entry-level cost of running a data warehouse. Lower compute tiers capable of
handling demanding queries are now available for Azure SQL Data Warehouse. Read the full announcement
Lower compute tier support for Gen2. The new offering is available in the regions noted in the table below. For
supported regions, existing Gen1 data warehouses can be upgraded to Gen2 through either:
The automatic upgrade process: Automatic upgrades don't start as soon as the service is available in a
region. When automatic upgrades start in a specific region, individual DW upgrades will take place during your
selected maintenance schedule.
Self-upgrade to Gen2: You can control when to upgrade by doing a self-upgrade to Gen2. If your region is not
yet supported, you can restore from a restore point directly to a Gen2 instance in a supported region.

Automated Schedule and Region Availability Table


The following table summarizes by region when the Lower Gen2 compute tier will be available and when
automatic upgrades start. The dates are subject to change. Check back to see when your region becomes available.
* indicates a specific schedule for the region is currently unavailable.

REGION LOWER GEN2 AVAILABLE AUTOMATIC UPGRADES BEGIN

Australia East Available Complete

Australia Southeast Available Complete

Brazil South Available Complete

Canada Central Available Complete

Canada East June 1, 2020 July 1, 2020

Central US Available Complete

China East * *

China East 2 Available Complete

China North * *

China North 2 Available Complete

East Asia Available Complete

East US Available Complete

East US 2 Available Complete

France Central Available In-progress


REGION LOWER GEN2 AVAILABLE AUTOMATIC UPGRADES BEGIN

Germany Central * *

Germany West Central September 1, 2019 October 1, 2019

India Central Available Complete

India South Available Complete

India West July 1, 2019 In-progress

Japan East Available Complete

Japan West Available Complete

Korea Central Available Complete

Korea South Available Complete

North Central US Available Complete

North Europe Available Complete

South Africa North July 12, 2019 Complete

South Central US Available Complete

South East Asia Available Complete

UAE North July 20, 2019 Complete

UK South Available In-progress

UK West Available In-progress

West Central US November 1, 2019 December 1, 2019

West Europe Available Complete

West US Available Complete

West US 2 Available Complete

Automatic upgrade process


Based on the availability chart above, we'll be scheduling automated upgrades for your Gen1 instances. To avoid
any unexpected interruptions on the availability of the data warehouse, the automated upgrades will be scheduled
during your maintenance schedule. The ability to create a new Gen1 instance will be disabled in regions
undergoing auto upgrade to Gen2. Gen1 will be deprecated once the automatic upgrades have been completed.
For more information on schedules, see View a maintenance schedule
The upgrade process will involve a brief drop in connectivity (approximately 5 min) as we restart your data
warehouse. Once your data warehouse has been restarted, it will be fully available for use. However, you may
experience a degradation in performance while the upgrade process continues to upgrade the data files in the
background. The total time for the performance degradation will vary dependent on the size of your data files.
You can also expedite the data file upgrade process by running Alter Index rebuild on all primary columnstore
tables using a larger SLO and resource class after the restart.

NOTE
Alter Index rebuild is an offline operation and the tables will not be available until the rebuild completes.

Self-upgrade to Gen2
You can choose to self-upgrade by following these steps on an existing Gen1 data warehouse. If you choose to self-
upgrade, you must complete it before the automatic upgrade process begins in your region. Doing so ensures that
you avoid any risk of the automatic upgrades causing a conflict.
There are two options when conducting a self-upgrade. You can either upgrade your current data warehouse in-
place or you can restore a Gen1 data warehouse into a Gen2 instance.
Upgrade in-place - This option will upgrade your existing Gen1 data warehouse to Gen2. The upgrade
process will involve a brief drop in connectivity (approximately 5 min) as we restart your data warehouse.
Once your data warehouse has been restarted, it will be fully available for use. If you experience issues
during the upgrade, open a support request and reference “Gen2 upgrade” as the possible cause.
Upgrade from restore point - Create a user-defined restore point on your current Gen1 data warehouse and
then restore directly to a Gen2 instance. The existing Gen1 data warehouse will stay in place. Once the
restore has been completed, your Gen2 data warehouse will be fully available for use. Once you have run all
testing and validation processes on the restored Gen2 instance, the original Gen1 instance can be deleted.
Step 1: From the Azure portal, create a user-defined restore point.
Step 2: When restoring from a user-defined restore point, set the "performance Level" to your preferred
Gen2 tier.
You may experience a period of degradation in performance while the upgrade process continues to upgrade the
data files in the background. The total time for the performance degradation will vary dependent on the size of
your data files.
To expedite the background data migration process, you can immediately force data movement by running Alter
Index rebuild on all primary columnstore tables you'd be querying at a larger SLO and resource class.

NOTE
Alter Index rebuild is an offline operation and the tables will not be available until the rebuild completes.

If you encounter any issues with your data warehouse, create a support request and reference “Gen2 upgrade” as
the possible cause.
For more information, see Upgrade to Gen2.

Migration frequently asked questions


Q: Does Gen2 cost the same as Gen1?
A: Yes.
Q: How will the upgrades affect my automation scripts?
A: Any automation script that references a Service Level Objective should be changed to correspond to the
Gen2 equivalent. See details here.
Q: How long does a self-upgrade normally take?
A: You can upgrade in place or upgrade from a restore point.
Upgrading in place will cause your data warehouse to momentarily pause and resume. A background
process will continue while the data warehouse is online.
It takes longer if you are upgrading through a restore point, because the upgrade will go through the full
restore process.
Q: How long will the auto upgrade take?
A: The actual downtime for the upgrade is only the time it takes to pause and resume the service, which is
between 5 to 10 minutes. After the brief down-time, a background process will run a storage migration. The
length of time for the background process is dependent on the size of your data warehouse.
Q: When will this automatic upgrade take place?
A: During your maintenance schedule. Leveraging your chosen maintenance schedule will minimize disruption
to your business.
Q: What should I do if my background upgrade process seems to be stuck?
A: Kick off a reindex of your Columnstore tables. Note that reindexing of the table will be offline during this
operation.
Q: What if Gen2 does not have the Service Level Objective I have on Gen1?
A: If you are running a DW600 or DW1200 on Gen1, it is advised to use DW500c or DW1000c respectively
since Gen2 provides more memory, resources, and higher performance than Gen1.
Q: Can I disable geo-backup?
A: No. Geo-backup is an enterprise feature to preserve your data warehouse availability in the event a region
becomes unavailable. Open a support request if you have further concerns.
Q: Is there a difference in T-SQL syntax between Gen1 and Gen2?
A: There is no change in the T-SQL language syntax from Gen1 to Gen2.
Q: Does Gen2 support Maintenance Windows?
A: Yes.
Q: Will I be able to create a new Gen1 instance after my region has been upgraded?
A: No. After a region has been upgraded, the creation of new Gen1 instances will be disabled.

Next steps
Upgrade steps
Maintenance windows
Resource health monitor
Review Before you begin a migration
Upgrade in-place and upgrade from a restore point
Create a user-defined restore point
Learn How to restore to Gen2
Open a SQL Data Warehouse support request
Quickstart: Create and query an Azure SQL Data
Warehouse in the Azure portal
9/5/2019 • 6 minutes to read • Edit Online

Quickly create and query an Azure SQL Data Warehouse by using the Azure portal.
If you don't have an Azure subscription, create a free account before you begin.

NOTE
Creating a SQL Data Warehouse may result in a new billable service. For more information, see SQL Data Warehouse
pricing.

Before you begin


Download and install the newest version of SQL Server Management Studio (SSMS ).

Sign in to the Azure portal


Sign in to the Azure portal.

Create a data warehouse


An Azure SQL Data Warehouse is created with a defined set of compute resources. The database is created
within an Azure resource group and in an Azure SQL logical server.
Follow these steps to create a SQL Data Warehouse that contains the AdventureWorksDW sample data.
1. Click Create a resource in the upper left-hand corner of the Azure portal.
2. Select Databases from the New page, and select SQL Data Warehouse under Featured on the New
page.
3. Fill out the SQL Data Warehouse form with the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Database name mySampleDataWarehouse For valid database names, see


Database Identifiers. Note, a data
warehouse is a type of database.

Subscription Your subscription For details about your subscriptions,


see Subscriptions.

Resource group myResourceGroup For valid resource group names, see


Naming rules and restrictions.

Select source Sample Specifies to load a sample database.


Note, a data warehouse is one type
of database.

Select sample AdventureWorksDW Specifies to load the


AdventureWorksDW sample
database.
4. Click Server to create and configure a new server for your new database. Fill out the New server form
with the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Server name Any globally unique name For valid server names, see Naming
rules and restrictions.

Server admin login Any valid name For valid login names, see Database
Identifiers.

Password Any valid password Your password must have at least


eight characters and must contain
characters from three of the
following categories: upper case
characters, lower case characters,
numbers, and non-alphanumeric
characters.
SETTING SUGGESTED VALUE DESCRIPTION

Location Any valid location For information about regions, see


Azure Regions.

5. Click Select.
6. Click Performance level to specify the performance configuration for the data warehouse.
7. For this tutorial, select Gen2. The slider, by default, is set to DW1000c. Try moving it up and down to see
how it works.
8. Click Apply.
9. Now that you've completed the SQL Data Warehouse form, click Create to provision the database.
Provisioning takes a few minutes.
10. On the toolbar, click Notifications to monitor the deployment process.

Create a server-level firewall rule


The SQL Data Warehouse service creates a firewall at the server-level. This firewall prevents external applications
and tools from connecting to the server or any databases on the server. To enable connectivity, you can add
firewall rules that enable connectivity for specific IP addresses. Follow these steps to create a server-level firewall
rule for your client's IP address.

NOTE
SQL Data Warehouse communicates over port 1433. If you are trying to connect from within a corporate network,
outbound traffic over port 1433 might not be allowed by your network's firewall. If so, you cannot connect to your Azure
SQL Database server unless your IT department opens port 1433.

1. After the deployment completes, select All services from the left-hand menu. Select Databases, select the
star next to SQL data warehouses to add SQL data warehouses to your favorites.
2. Select SQL data warehouses from the left-hand menu and then click mySampleDataWarehouse on
the SQL data warehouses page. The overview page for your database opens, showing you the fully
qualified server name (such as mynewserver-20180430.database.windows.net) and provides options
for further configuration.
3. Copy this fully qualified server name for use to connect to your server and its databases in this and other
quick starts. To open server settings, click the server name.

4. Click Show firewall settings.

5. The Firewall settings page for the SQL Database server opens.
6. To add your current IP address to a new firewall rule, click Add client IP on the toolbar. A firewall rule can
open port 1433 for a single IP address or a range of IP addresses.
7. Click Save. A server-level firewall rule is created for your current IP address opening port 1433 on the
logical server.
8. Click OK and then close the Firewall settings page.
You can now connect to the SQL server and its data warehouses using this IP address. The connection works
from SQL Server Management Studio or another tool of your choice. When you connect, use the ServerAdmin
account you created previously.

IMPORTANT
By default, access through the SQL Database firewall is enabled for all Azure services. Click OFF on this page and then click
Save to disable the firewall for all Azure services.

Get the fully qualified server name


Get the fully qualified server name for your SQL server in the Azure portal. Later you use the fully qualified name
when connecting to the server.
1. Sign in to the Azure portal.
2. Select SQL Data warehouses from the left-hand menu, and click your data warehouse on the SQL data
warehouses page.
3. In the Essentials pane in the Azure portal page for your database, locate and then copy the Server name.
In this example, the fully qualified name is mynewserver-20180430.database.windows.net.
Connect to the server as server admin
This section uses SQL Server Management Studio (SSMS ) to establish a connection to your Azure SQL server.
1. Open SQL Server Management Studio.
2. In the Connect to Server dialog box, enter the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Server type Database engine This value is required

Server name The fully qualified server name Here's an example: mynewserver-
20180430.database.windows.net.

Authentication SQL Server Authentication SQL Authentication is the only


authentication type that is
configured in this tutorial.

Login The server admin account Account that you specified when you
created the server.

Password The password for your server admin Password that you specified when
account you created the server.

3. Click Connect. The Object Explorer window opens in SSMS.


4. In Object Explorer, expand Databases. Then expand mySampleDatabase to view the objects in your new
database.

Run some queries


SQL Data Warehouse uses T-SQL as the query language. To open a query window and run some T-SQL queries,
use the following steps:
1. Right-click mySampleDataWarehouse and select New Query. A new query window opens.
2. In the query window, enter the following command to see a list of databases.
SELECT * FROM sys.databases

3. Click Execute. The query results show two databases: master and mySampleDataWarehouse.

4. To look at some data, use the following command to see the number of customers with last name of
Adams that have three children at home. The results list six customers.

SELECT LastName, FirstName FROM dbo.dimCustomer


WHERE LastName = 'Adams' AND NumberChildrenAtHome = 3;

Clean up resources
You're being charged for data warehouse units and data stored your data warehouse. These compute and storage
resources are billed separately.
If you want to keep the data in storage, you can pause compute when you aren't using the data warehouse. By
pausing compute, you're only charged for data storage. You can resume compute whenever you're ready to
work with the data.
If you want to remove future charges, you can delete the data warehouse.
Follow these steps to clean up resources you no longer need.
1. Sign in to the Azure portal, click on your data warehouse.

2. To pause compute, click the Pause button. When the data warehouse is paused, you see a Resume button.
To resume compute, click Resume.
3. To remove the data warehouse so you aren't charged for compute or storage, click Delete.
4. To remove the SQL server you created, click mynewserver-20180430.database.windows.net in the
previous image, and then click Delete. Be careful with this deletion, since deleting the server also deletes
all databases assigned to the server.
5. To remove the resource group, click myResourceGroup, and then click Delete resource group.

Next steps
You've now created a data warehouse, created a firewall rule, connected to your data warehouse, and run a few
queries. To learn more about Azure SQL Data Warehouse, continue to the tutorial for loading data.
Load data into a SQL Data Warehouse
Quickstart: Create and query an Azure SQL Data
Warehouse with Azure PowerShell
8/18/2019 • 4 minutes to read • Edit Online

Quickly create an Azure SQL Data Warehouse using Azure PowerShell.


If you don't have an Azure subscription, create a free account before you begin.

NOTE
Creating a SQL Data Warehouse may result in a new billable service. For more information, see SQL Data Warehouse pricing.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Sign in to Azure
Sign in to your Azure subscription using the Connect-AzAccount command and follow the on-screen directions.

Connect-AzAccount

To see which subscription you're using, run Get-AzSubscription.

Get-AzSubscription

If you need to use a different subscription than the default, run Set-AzContext.

Set-AzContext -SubscriptionName "MySubscription"

Create variables
Define variables for use in the scripts in this quickstart.
# The data center and resource name for your resources
$resourcegroupname = "myResourceGroup"
$location = "WestEurope"
# The logical server name: Use a random value or replace with your own value (don't capitalize)
$servername = "server-$(Get-Random)"
# Set an admin name and password for your database
# The sign-in information for the server
$adminlogin = "ServerAdmin"
$password = "ChangeYourAdminPassword1"
# The ip address range that you want to allow to access your server - change as appropriate
$startip = "0.0.0.0"
$endip = "0.0.0.0"
# The database name
$databasename = "mySampleDataWarehosue"

Create a resource group


Create an Azure resource group using the New -AzResourceGroup command. A resource group is a logical
container into which Azure resources are deployed and managed as a group. The following example creates a
resource group named myResourceGroup in the westeurope location.

New-AzResourceGroup -Name $resourcegroupname -Location $location

Create a logical server


Create an Azure SQL logical server using the New -AzSqlServer command. A logical server contains a group of
databases managed as a group. The following example creates a randomly named server in your resource group
with an admin user named ServerAdmin and a password of ChangeYourAdminPassword1 . Replace these pre-defined
values as desired.

New-AzSqlServer -ResourceGroupName $resourcegroupname `


-ServerName $servername `
-Location $location `
-SqlAdministratorCredentials $(New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList
$adminlogin, $(ConvertTo-SecureString -String $password -AsPlainText -Force))

Configure a server firewall rule


Create an Azure SQL server-level firewall rule using the New -AzSqlServerFirewallRule command. A server-level
firewall rule allows an external application, such as SQL Server Management Studio or the SQLCMD utility to
connect to a SQL Data Warehouse through the SQL Data Warehouse service firewall. In the following example,
the firewall is only opened for other Azure resources. To enable external connectivity, change the IP address to an
appropriate address for your environment. To open all IP addresses, use 0.0.0.0 as the starting IP address and
255.255.255.255 as the ending address.

New-AzSqlServerFirewallRule -ResourceGroupName $resourcegroupname `


-ServerName $servername `
-FirewallRuleName "AllowSome" -StartIpAddress $startip -EndIpAddress $endip
NOTE
SQL Database and SQL Data Warehouse communicate over port 1433. If you're trying to connect from within a corporate
network, outbound traffic over port 1433 may not be allowed by your network's firewall. If so, you won't be able to connect
to your Azure SQL server unless your IT department opens port 1433.

Create a data warehouse


This example creates a data warehouse using the previously defined variables. It specifies the service objective as
DW100c, which is a lower-cost starting point for your data warehouse.

New-AzSqlDatabase `
-ResourceGroupName $resourcegroupname `
-ServerName $servername `
-DatabaseName $databasename `
-Edition "DataWarehouse" `
-RequestedServiceObjectiveName "DW100c" `
-CollationName "SQL_Latin1_General_CP1_CI_AS" `
-MaxSizeBytes 10995116277760

Required Parameters are:


RequestedServiceObjectiveName: The amount of data warehouse units you're requesting. Increasing this
amount increases compute cost. For a list of supported values, see memory and concurrency limits.
DatabaseName: The name of the SQL Data Warehouse that you're creating.
ServerName: The name of the server that you're using for creation.
ResourceGroupName: Resource group you're using. To find available resource groups in your subscription
use Get-AzureResource.
Edition: Must be "DataWarehouse" to create a SQL Data Warehouse.
Optional Parameters are:
CollationName: The default collation if not specified is SQL_Latin1_General_CP1_CI_AS. Collation can't be
changed on a database.
MaxSizeBytes: The default max size of a database is 240TB. The max size limits rowstore data. There is
unlimited storage for columnar data.
For more information on the parameter options, see New -AzSqlDatabase.

Clean up resources
Other quickstart tutorials in this collection build upon this quickstart.

TIP
If you plan to continue on to work with later quickstart tutorials, don't clean up the resources created in this quickstart. If you
don't plan to continue, use the following steps to delete all resources created by this quickstart in the Azure portal.

Remove-AzResourceGroup -ResourceGroupName $resourcegroupname

Next steps
You've now created a data warehouse, created a firewall rule, connected to your data warehouse, and run a few
queries. To learn more about Azure SQL Data Warehouse, continue to the tutorial for loading data.
Load data into a SQL Data Warehouse
Quickstart: Pause and resume compute for an Azure
SQL Data Warehouse in the Azure portal
8/18/2019 • 2 minutes to read • Edit Online

Use the Azure portal to pause compute in Azure SQL Data Warehouse to save costs. Resume compute when you
are ready to use the data warehouse.
If you don't have an Azure subscription, create a free account before you begin.

Sign in to the Azure portal


Sign in to the Azure portal.

Before you begin


Use Create and Connect - portal to create a data warehouse called mySampleDataWarehouse.

Pause compute
To save costs, you can pause and resume compute resources on-demand. For example, if you won't be using the
database during the night and on weekends, you can pause it during those times, and resume it during the day.
You won't be charged for compute resources while the database is paused. However, you will continue to be
charged for storage.
Follow these steps to pause a SQL Data Warehouse.
1. Click SQL databases in the left page of the Azure portal.
2. Select mySampleDataWarehouse from the SQL databases page. This opens the data warehouse.
3. On the mySampleDataWarehouse page, notice Status is Online.
4. To pause the data warehouse, click the Pause button.
5. A confirmation question appears asking if you want to continue. Click Yes.
6. Wait a few moments, and then notice the Status is Pausing.
7. When the pause operation is complete, the status is Paused and the option button is Start.
8. The compute resources for the data warehouse are now offline. You won't be charged for compute until you
resume the service.

Resume compute
Follow these steps to resume a SQL Data Warehouse.
1. Click SQL databases in the left page of the Azure portal.
2. Select mySampleDataWarehouse from the SQL databases page. This opens the data warehouse.
3. On the mySampleDataWarehouse page, notice Status is Paused.
4. To resume the data warehouse, click Start.
5. A confirmation question appears asking if you want to start. Click Yes.
6. Notice the Status is Resuming.
7. When the data warehouse is back online, the status is Online and the option button is Pause.
8. The compute resources for the data warehouse are now online and you can use the service. Charges for
compute have resumed.

Clean up resources
You are being charged for data warehouse units and the data stored in your data warehouse. These compute and
storage resources are billed separately.
If you want to keep the data in storage, pause compute.
If you want to remove future charges, you can delete the data warehouse.
Follow these steps to clean up resources as you desire.
1. Sign in to the Azure portal, and click on your data warehouse.
2. To pause compute, click the Pause button. When the data warehouse is paused, you see a Start button. To
resume compute, click Start.
3. To remove the data warehouse so you are not charged for compute or storage, click Delete.
4. To remove the SQL server you created, click mynewserver-20171113.database.windows.net, and then
click Delete. Be careful with this deletion, since deleting the server also deletes all databases assigned to
the server.
5. To remove the resource group, click myResourceGroup, and then click Delete resource group.

Next steps
You have now paused and resumed compute for your data warehouse. To learn more about Azure SQL Data
Warehouse, continue to the tutorial for loading data.
Load data into a SQL Data Warehouse
Quickstart: Pause and resume compute in Azure SQL
Data Warehouse with Azure PowerShell
9/4/2019 • 3 minutes to read • Edit Online

Use PowerShell to pause compute in Azure SQL Data Warehouse to save costs. Resume compute when you are
ready to use the data warehouse.
If you don't have an Azure subscription, create a free account before you begin.

Before you begin


NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

This quickstart assumes you already have a SQL Data Warehouse that you can pause and resume. If you need to
create one, you can use Create and Connect - portal to create a data warehouse called
mySampleDataWarehouse.

Log in to Azure
Log in to your Azure subscription using the Connect-AzAccount command and follow the on-screen directions.

Connect-AzAccount

To see which subscription you are using, run Get-AzSubscription.

Get-AzSubscription

If you need to use a different subscription than the default, run Set-AzContext.

Set-AzContext -SubscriptionName "MySubscription"

Look up data warehouse information


Locate the database name, server name, and resource group for the data warehouse you plan to pause and
resume.
Follow these steps to find location information for your data warehouse.
1. Sign in to the Azure portal.
2. Click SQL databases in the left page of the Azure portal.
3. Select mySampleDataWarehouse from the SQL databases page. The data warehouse opens.
4. Write down the data warehouse name, which is the database name. Also write down the server name, and
the resource group.
5. If your server is foo.database.windows.net, use only the first part as the server name in the PowerShell
cmdlets. In the preceding image, the full server name is newserver-20171113.database.windows.net. Drop
the suffix and use newserver-20171113 as the server name in the PowerShell cmdlet.

Pause compute
To save costs, you can pause and resume compute resources on-demand. For example, if you are not using the
database during the night and on weekends, you can pause it during those times, and resume it during the day.
There is no charge for compute resources while the database is paused. However, you continue to be charged for
storage.
To pause a database, use the Suspend-AzSqlDatabase cmdlet. The following example pauses a data warehouse
named mySampleDataWarehouse hosted on a server named newserver-20171113. The server is in an Azure
resource group named myResourceGroup.

Suspend-AzSqlDatabase –ResourceGroupName "myResourceGroup" `


–ServerName "newserver-20171113" –DatabaseName "mySampleDataWarehouse"

A variation, this next example retrieves the database into the $database object. It then pipes the object to Suspend-
AzSqlDatabase. The results are stored in the object resultDatabase. The final command shows the results.

$database = Get-AzSqlDatabase –ResourceGroupName "myResourceGroup" `


–ServerName "newserver-20171113" –DatabaseName "mySampleDataWarehouse"
$resultDatabase = $database | Suspend-AzSqlDatabase
$resultDatabase

Resume compute
To start a database, use the Resume-AzSqlDatabase cmdlet. The following example starts a database named
mySampleDataWarehouse hosted on a server named newserver-20171113. The server is in an Azure resource
group named myResourceGroup.

Resume-AzSqlDatabase –ResourceGroupName "myResourceGroup" `


–ServerName "newserver-20171113" -DatabaseName "mySampleDataWarehouse"

A variation, this next example retrieves the database into the $database object. It then pipes the object to Resume-
AzSqlDatabase and stores the results in $resultDatabase. The final command shows the results.
$database = Get-AzSqlDatabase –ResourceGroupName "ResourceGroup1" `
–ServerName "Server01" –DatabaseName "Database02"
$resultDatabase = $database | Resume-AzSqlDatabase
$resultDatabase

Check status of your data warehouse operation


To check the status of your data warehouse, use the Get-AzSqlDatabaseActivity cmdlet.

Get-AzSqlDatabaseActivity -ResourceGroupName "ResourceGroup01" -ServerName "Server01" -DatabaseName


"Database02"

Clean up resources
You are being charged for data warehouse units and data stored your data warehouse. These compute and storage
resources are billed separately.
If you want to keep the data in storage, pause compute.
If you want to remove future charges, you can delete the data warehouse.
Follow these steps to clean up resources as you desire.
1. Sign in to the Azure portal, and click on your data warehouse.

2. To pause compute, click the Pause button. When the data warehouse is paused, you see a Start button. To
resume compute, click Start.
3. To remove the data warehouse so you are not charged for compute or storage, click Delete.
4. To remove the SQL server you created, click mynewserver-20171113.database.windows.net, and then
click Delete. Be careful with this deletion, since deleting the server also deletes all databases assigned to the
server.
5. To remove the resource group, click myResourceGroup, and then click Delete resource group.

Next steps
You have now paused and resumed compute for your data warehouse. To learn more about Azure SQL Data
Warehouse, continue to the tutorial for loading data.
Load data into a SQL Data Warehouse
Quickstart: Scale compute in Azure SQL Data
Warehouse in the Azure portal
8/18/2019 • 2 minutes to read • Edit Online

Scale compute in Azure SQL Data Warehouse in the Azure portal. Scale out compute for better performance, or
scale back compute to save costs.
If you don't have an Azure subscription, create a free account before you begin.

Sign in to the Azure portal


Sign in to the Azure portal.

Before you begin


You can scale a data warehouse that you already have, or use Quickstart: create and connect - portal to create a
data warehouse named mySampleDataWarehouse. This quickstart scales mySampleDataWarehouse.

NOTE
Your data warehouse must be online to scale.

Scale compute
SQL Data Warehouse compute resources can be scaled by increasing or decreasing data warehouse units. The
[create and connect - portal] quickstart(create-data-warehouse-portal.md) created mySampleDataWarehouse
and initialized it with 400 DWUs. The following steps adjust the DWUs for mySampleDataWarehouse.
To change data warehouse units:
1. Click SQL data warehouses in the left page of the Azure portal.
2. Select mySampleDataWarehouse from the SQL data warehouses page. The data warehouse opens.
3. Click Scale.
4. In the Scale panel, move the slider left or right to change the DWU setting.

5. Click Save. A confirmation message appears. Click yes to confirm or no to cancel.


Next steps
You've now learned to scale compute for your data warehouse. To learn more about Azure SQL Data Warehouse,
continue to the tutorial for loading data.
Load data into a SQL Data Warehouse
Quickstart: Scale compute in Azure SQL Data
Warehouse in Azure PowerShell
9/4/2019 • 2 minutes to read • Edit Online

Scale compute in Azure SQL Data Warehouse using Azure PowerShell. Scale out compute for better performance,
or scale back compute to save costs.
If you don't have an Azure subscription, create a free account before you begin.

Before you begin


NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

This quickstart assumes you already have a SQL Data Warehouse that you can scale. If you need to create one, use
Create and Connect - portal to create a data warehouse called mySampleDataWarehouse.

Log in to Azure
Log in to your Azure subscription using the Connect-AzAccount command and follow the on-screen directions.

Connect-AzAccount

To see which subscription you are using, run Get-AzSubscription.

Get-AzSubscription

If you need to use a different subscription than the default, run Set-AzContext.

Set-AzContext -SubscriptionName "MySubscription"

Look up data warehouse information


Locate the database name, server name, and resource group for the data warehouse you plan to pause and
resume.
Follow these steps to find location information for your data warehouse.
1. Sign in to the Azure portal.
2. Click SQL data warehouses in the left page of the Azure portal.
3. Select mySampleDataWarehouse from the SQL data warehouses page. This opens the data warehouse.
4. Write down the data warehouse name, which will be used as the database name. Remember, a data
warehouse is one type of database. Also write down the server name, and the resource group. You will use
these in the pause and resume commands.
5. If your server is foo.database.windows.net, use only the first part as the server name in the PowerShell
cmdlets. In the preceding image, the full server name is newserver-20171113.database.windows.net. We
use newserver-20180430 as the server name in the PowerShell cmdlet.

Scale compute
In SQL Data Warehouse, you can increase or decrease compute resources by adjusting data warehouse units. The
Create and Connect - portal created mySampleDataWarehouse and initialized it with 400 DWUs. The following
steps adjust the DWUs for mySampleDataWarehouse.
To change data warehouse units, use the Set-AzSqlDatabase PowerShell cmdlet. The following example sets the
data warehouse units to DW300c for the database mySampleDataWarehouse which is hosted in the Resource
group myResourceGroup on server mynewserver-20180430.

Set-AzSqlDatabase -ResourceGroupName "myResourceGroup" -DatabaseName "mySampleDataWarehouse" -ServerName


"mynewserver-20171113" -RequestedServiceObjectiveName "DW300c"

Check data warehouse state


To see the current state of the data warehouse, use the Get-AzSqlDatabase PowerShell cmdlet. This gets the state
of the mySampleDataWarehouse database in ResourceGroup myResourceGroup and server mynewserver-
20180430.database.windows.net.

$database = Get-AzSqlDatabase -ResourceGroupName myResourceGroup -ServerName mynewserver-20171113 -


DatabaseName mySampleDataWarehouse
$database

Which will result in something like this:


ResourceGroupName : myResourceGroup
ServerName : mynewserver-20171113
DatabaseName : mySampleDataWarehouse
Location : North Europe
DatabaseId : 34d2ffb8-b70a-40b2-b4f9-b0a39833c974
Edition : DataWarehouse
CollationName : SQL_Latin1_General_CP1_CI_AS
CatalogCollation :
MaxSizeBytes : 263882790666240
Status : Online
CreationDate : 11/20/2017 9:18:12 PM
CurrentServiceObjectiveId : 284f1aff-fee7-4d3b-a211-5b8ebdd28fea
CurrentServiceObjectiveName : DW300c
RequestedServiceObjectiveId : 284f1aff-fee7-4d3b-a211-5b8ebdd28fea
RequestedServiceObjectiveName :
ElasticPoolName :
EarliestRestoreDate :
Tags :
ResourceId :
/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/
resourceGroups/myResourceGroup/providers/Microsoft.Sql/servers/mynewserver-
20171113/databases/mySampleDataWarehouse
CreateMode :
ReadScale : Disabled
ZoneRedundant : False

You can see the Status of the database in the output. In this case, you can see that this database is online. When
you run this command, you should receive a Status value of Online, Pausing, Resuming, Scaling, or Paused.
To see the status by itself, use the following command:

$database | Select-Object DatabaseName,Status

Next steps
You have now learned how to scale compute for your data warehouse. To learn more about Azure SQL Data
Warehouse, continue to the tutorial for loading data.
Load data into a SQL Data Warehouse
Quickstart: Scale compute in Azure SQL Data
Warehouse using T-SQL
8/18/2019 • 3 minutes to read • Edit Online

Scale compute in Azure SQL Data Warehouse using T-SQL and SQL Server Management Studio (SSMS ). Scale
out compute for better performance, or scale back compute to save costs.
If you don't have an Azure subscription, create a free account before you begin.

Before you begin


Download and install the newest version of SQL Server Management Studio (SSMS ).

Create a data warehouse


Use Quickstart: create and Connect - portal to create a data warehouse named mySampleDataWarehouse.
Finish the quickstart to ensure you have a firewall rule and can connect to your data warehouse from within SQL
Server Management Studio.

Connect to the server as server admin


This section uses SQL Server Management Studio (SSMS ) to establish a connection to your Azure SQL server.
1. Open SQL Server Management Studio.
2. In the Connect to Server dialog box, enter the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Server type Database engine This value is required

Server name The fully qualified server name Here's an example: mynewserver-
20171113.database.windows.net.

Authentication SQL Server Authentication SQL Authentication is the only


authentication type that is configured
in this tutorial.

Login The server admin account The account that you specified when
you created the server.

Password The password for your server admin This is the password that you
account specified when you created the
server.
3. Click Connect. The Object Explorer window opens in SSMS.
4. In Object Explorer, expand Databases. Then expand mySampleDatabase to view the objects in your new
database.
View service objective
The service objective setting contains the number of data warehouse units for the data warehouse.
To view the current data warehouse units for your data warehouse:
1. Under the connection to mynewserver-20171113.database.windows.net, expand System Databases.
2. Right-click master and select New Query. A new query window opens.
3. Run the following query to select from the sys.database_service_objectives dynamic management view.
SELECT
db.name [Database]
, ds.edition [Edition]
, ds.service_objective [Service Objective]
FROM
sys.database_service_objectives ds
JOIN
sys.databases db ON ds.database_id = db.database_id
WHERE
db.name = 'mySampleDataWarehouse'

4. The following results show mySampleDataWarehouse has a service objective of DW400.

Scale compute
In SQL Data Warehouse, you can increase or decrease compute resources by adjusting data warehouse units. The
Create and Connect - portal created mySampleDataWarehouse and initialized it with 400 DWUs. The following
steps adjust the DWUs for mySampleDataWarehouse.
To change data warehouse units:
1. Right-click master and select New Query.
2. Use the ALTER DATABASE T-SQL statement to modify the service objective. Run the following query to
change the service objective to DW300.

ALTER DATABASE mySampleDataWarehouse


MODIFY (SERVICE_OBJECTIVE = 'DW300c')
;

Monitor scale change request


To see the progress of the previous change request, you can use the WAITFORDELAY T-SQL syntax to poll the
sys.dm_operation_status dynamic management view (DMV ).
To poll for the service object change status:
1. Right-click master and select New Query.
2. Run the following query to poll the sys.dm_operation_status DMV.

WHILE
(
SELECT TOP 1 state_desc
FROM sys.dm_operation_status
WHERE
1=1
AND resource_type_desc = 'Database'
AND major_resource_id = 'MySampleDataWarehouse'
AND operation = 'ALTER DATABASE'
ORDER BY
start_time DESC
) = 'IN_PROGRESS'
BEGIN
RAISERROR('Scale operation in progress',0,0) WITH NOWAIT;
WAITFOR DELAY '00:00:05';
END
PRINT 'Complete';

3. The resulting output shows a log of the polling of the status.

Check data warehouse state


When a data warehouse is paused, you can't connect to it with T-SQL. To see the current state of the data
warehouse, you can use a PowerShell cmdlet. For an example, see Check data warehouse state - Powershell.

Check operation status


To return information about various management operations on your SQL Data Warehouse, run the following
query on the sys.dm_operation_status DMV. For example, it returns the operation and the state of the operation,
which is IN_PROGRESS or COMPLETED.

SELECT *
FROM
sys.dm_operation_status
WHERE
resource_type_desc = 'Database'
AND
major_resource_id = 'MySampleDataWarehouse'

Next steps
You've now learned how to scale compute for your data warehouse. To learn more about Azure SQL Data
Warehouse, continue to the tutorial for loading data.
Load data into a SQL Data Warehouse
Quickstart: Create a workload classifier using T-SQL
8/18/2019 • 2 minutes to read • Edit Online

In this quickstart, you'll quickly create a workload classifier with high importance for the CEO of your organization.
This workload classifier will allow CEO queries to take precedence over other queries with lower importance in the
queue.
If you don't have an Azure subscription, create a free account before you begin.

NOTE
Creating a SQL Data Warehouse may result in a new billable service. For more information, see SQL Data Warehouse pricing.

Prerequisites
This quickstart assumes you already have a SQL Data Warehouse and that you have CONTROL DATABASE
permissions. If you need to create one, use Create and Connect - portal to create a data warehouse called
mySampleDataWarehouse.

Sign in to the Azure portal


Sign in to the Azure portal.

Create login for TheCEO


Create a SQL Server authentication login in the master database using CREATE LOGIN for 'TheCEO'.

IF NOT EXISTS (SELECT * FROM sys.sql_logins WHERE name = 'TheCEO')


BEGIN
CREATE LOGIN [TheCEO] WITH PASSWORD='<strongpassword>'
END
;

Create user
Create user, "TheCEO", in mySampleDataWarehouse

IF NOT EXISTS (SELECT * FROM sys.database_principals WHERE name = 'THECEO')


BEGIN
CREATE USER [TheCEO] FOR LOGIN [TheCEO]
END
;

Create a workload classifier


Create a workload classifier for "TheCEO" with high importance.
DROP WORKLOAD CLASSIFIER [wgcTheCEO];
CREATE WORKLOAD CLASSIFIER [wgcTheCEO]
WITH (WORKLOAD_GROUP = 'xlargerc'
,MEMBERNAME = 'TheCEO'
,IMPORTANCE = HIGH);

View existing classifiers


SELECT * FROM sys.workload_management_workload_classifiers

Clean up resources
DROP WORKLOAD CLASSIFIER [wgcTheCEO]
DROP USER [TheCEO]
;

You're being charged for data warehouse units and data stored in your data warehouse. These compute and
storage resources are billed separately.
If you want to keep the data in storage, you can pause compute when you aren't using the data warehouse. By
pausing compute, you're only charged for data storage. When you're ready to work with the data, resume
compute.
If you want to remove future charges, you can delete the data warehouse.
Follow these steps to clean up resources.
1. Sign in to the Azure portal, select on your data warehouse.

2. To pause compute, select the Pause button. When the data warehouse is paused, you see a Start button. To
resume compute, select Start.
3. To remove the data warehouse so you're not charged for compute or storage, select Delete.
4. To remove the SQL server you created, select mynewserver-20180430.database.windows.net in the
previous image, and then select Delete. Be careful with this deletion, since deleting the server also deletes
all databases assigned to the server.
5. To remove the resource group, select myResourceGroup, and then select Delete resource group.

Next steps
You've now created a workload classifier. Run a few queries as TheCEO to see how they perform. See
sys.dm_pdw_exec_requests to view queries and the importance assigned.
For more information about Azure SQL Data Warehouse workload management, see Workload Importance
and Workload Classification.
See the how -to articles to Configure Workload Importance and how to Manage and monitor Workload
Management.
Secure a database in SQL Data Warehouse
3/18/2019 • 4 minutes to read • Edit Online

This article walks through the basics of securing your Azure SQL Data Warehouse database. In particular, this
article gets you started with resources for limiting access, protecting data, and monitoring activities on a
database.

Connection security
Connection Security refers to how you restrict and secure connections to your database using firewall rules and
connection encryption.
Firewall rules are used by both the server and the database to reject connection attempts from IP addresses that
have not been explicitly whitelisted. To allow connections from your application or client machine's public IP
address, you must first create a server-level firewall rule using the Azure portal, REST API, or PowerShell. As a
best practice, you should restrict the IP address ranges allowed through your server firewall as much as possible.
To access Azure SQL Data Warehouse from your local computer, ensure the firewall on your network and local
computer allows outgoing communication on TCP port 1433.
SQL Data Warehouse uses server-level firewall rules. It does not support database-level firewall rules. For more
information, see Azure SQL Database firewall, sp_set_firewall_rule.
Connections to your SQL Data Warehouse are encrypted by default. Modifying connection settings to disable
encryption are ignored.

Authentication
Authentication refers to how you prove your identity when connecting to the database. SQL Data Warehouse
currently supports SQL Server Authentication with a username and password, and with Azure Active Directory.
When you created the logical server for your database, you specified a "server admin" login with a username and
password. Using these credentials, you can authenticate to any database on that server as the database owner, or
"dbo" through SQL Server Authentication.
However, as a best practice, your organization’s users should use a different account to authenticate. This way you
can limit the permissions granted to the application and reduce the risks of malicious activity in case your
application code is vulnerable to a SQL injection attack.
To create a SQL Server Authenticated user, connect to the master database on your server with your server
admin login and create a new server login. Additionally, it is a good idea to create a user in the master database
for Azure SQL Data Warehouse users. Creating a user in master allows a user to log in using tools like SSMS
without specifying a database name. It also allows them to use the object explorer to view all databases on a SQL
server.

-- Connect to master database and create a login


CREATE LOGIN ApplicationLogin WITH PASSWORD = 'Str0ng_password';
CREATE USER ApplicationUser FOR LOGIN ApplicationLogin;

Then, connect to your SQL Data Warehouse database with your server admin login and create a database user
based on the server login you created.
-- Connect to SQL DW database and create a database user
CREATE USER ApplicationUser FOR LOGIN ApplicationLogin;

To give a user permission to perform additional operations such as creating logins or creating new databases,
assign the user to the Loginmanager and dbmanager roles in the master database. For more information on these
additional roles and authenticating to a SQL Database, see Managing databases and logins in Azure SQL
Database. For more information, see Connecting to SQL Data Warehouse By Using Azure Active Directory
Authentication.

Authorization
Authorization refers to what you can do within an Azure SQL Data Warehouse database. Authorization privileges
are determined by role memberships and permissions. As a best practice, you should grant users the least
privileges necessary. To manage roles, you can use the following stored procedures:

EXEC sp_addrolemember 'db_datareader', 'ApplicationUser'; -- allows ApplicationUser to read data


EXEC sp_addrolemember 'db_datawriter', 'ApplicationUser'; -- allows ApplicationUser to write data

The server admin account you are connecting with is a member of db_owner, which has authority to do anything
within the database. Save this account for deploying schema upgrades and other management operations. Use
the "ApplicationUser" account with more limited permissions to connect from your application to the database
with the least privileges needed by your application.
There are ways to further limit what a user can do with Azure SQL Data Warehouse:
Granular Permissions let you control which operations you can do on individual columns, tables, views,
schemas, procedures, and other objects in the database. Use granular permissions to have the most control
and grant the minimum permissions necessary.
Database roles other than db_datareader and db_datawriter can be used to create more powerful application
user accounts or less powerful management accounts. The built-in fixed database roles provide an easy way to
grant permissions, but can result in granting more permissions than are necessary.
Stored procedures can be used to limit the actions that can be taken on the database.
The following example grants read access to a user-defined schema.

--CREATE SCHEMA Test


GRANT SELECT ON SCHEMA::Test to ApplicationUser

Managing databases and logical servers from the Azure portal or using the Azure Resource Manager API is
controlled by your portal user account's role assignments. For more information, see Role-based access control in
Azure portal.

Encryption
Azure SQL Data Warehouse Transparent Data Encryption (TDE ) helps protect against the threat of malicious
activity by encrypting and decrypting your data at rest. When you encrypt your database, associated backups and
transaction log files are encrypted without requiring any changes to your applications. TDE encrypts the storage
of an entire database by using a symmetric key called the database encryption key.
In SQL Database, the database encryption key is protected by a built-in server certificate. The built-in server
certificate is unique for each SQL Database server. Microsoft automatically rotates these certificates at least every
90 days. The encryption algorithm used by SQL Data Warehouse is AES -256. For a general description of TDE,
see Transparent Data Encryption.
You can encrypt your database using the Azure portal or T-SQL.

Next steps
For details and examples on connecting to your SQL Data Warehouse with different protocols, see Connect to
SQL Data Warehouse.
Advanced data security for Azure SQL Database
10/27/2019 • 3 minutes to read • Edit Online

Advanced data security is a unified package for advanced SQL security capabilities. It includes functionality for
discovering and classifying sensitive data, surfacing and mitigating potential database vulnerabilities, and
detecting anomalous activities that could indicate a threat to your database. It provides a single go-to location for
enabling and managing these capabilities.

Overview
Advanced data security (ADS ) provides a set of advanced SQL security capabilities, including data discovery &
classification, vulnerability assessment, and Advanced Threat Protection.
Data discovery & classification provides capabilities built into Azure SQL Database for discovering, classifying,
labeling & protecting the sensitive data in your databases. It can be used to provide visibility into your database
classification state, and to track the access to sensitive data within the database and beyond its borders.
Vulnerability assessment is an easy to configure service that can discover, track, and help you remediate
potential database vulnerabilities. It provides visibility into your security state, and includes actionable steps to
resolve security issues, and enhance your database fortifications.
Advanced Threat Protection detects anomalous activities indicating unusual and potentially harmful attempts
to access or exploit your database. It continuously monitors your database for suspicious activities, and
provides immediate security alerts on potential vulnerabilities, SQL injection attacks, and anomalous database
access patterns. Advanced Threat Protection alerts provide details of the suspicious activity and recommend
action on how to investigate and mitigate the threat.
Enable SQL ADS once to enable all of these included features. With one click, you can enable ADS for all
databases on your SQL Database server or managed instance. Enabling or managing ADS settings requires
belonging to the SQL security manager role, SQL database admin role or SQL server admin role.
ADS pricing aligns with Azure Security Center standard tier, where each protected SQL Database server or
managed instance is counted as one node. Newly protected resources qualify for a free trial of Security Center
standard tier. For more information, see the Azure Security Center pricing page.

Getting Started with ADS


The following steps get you started with ADS.

1. Enable ADS
Enable ADS by navigating to Advanced Data Security under the Security heading for your SQL Database
server or manged instance. To enable ADS for all databases on the database server or managed instance, click
Enable Advanced Data Security on the server.

NOTE
A storage account is automatically created and configured to store your Vulnerability Assessment scan results. If you've
already enabled ADS for another server in the same resource group and region, then the existing storage account is used.
NOTE
The cost of ADS is aligned with Azure Security Center standard tier pricing per node, where a node is the entire SQL
Database server or managed instance. You are thus paying only once for protecting all databases on the database server or
managed instance with ADS. You can try ADS out initially with a free trial.

2. Start classifying data, tracking vulnerabilities, and investigating threat


alerts
Click the Data Discovery & Classification card to see recommended sensitive columns to classify and to
classify your data with persistent sensitivity labels. Click the Vulnerability Assessment card to view and manage
vulnerability scans and reports, and to track your security stature. If security alerts have been received, click the
Advanced Threat Protection card to view details of the alerts and to see a consolidated report on all alerts in
your Azure subscription via the Azure Security Center security alerts page.

3. Manage ADS settings on your SQL Database server or managed


instance
To view and manage ADS settings, navigate to Advanced Data Security under the Security heading for your
SQL Database server or managed instance. On this page, you can enable or disable ADS, and modify vulnerability
assessment and Advanced Threat Protection settings for your entire SQL Database server or managed instance.
4. Manage ADS settings for a SQL database
To override ADS settings for a particular database, check the Enable Advanced Data Security at the database
level checkbox. Use this option only if you have a particular requirement to receive separate Advanced Threat
Protection alerts or vulnerability assessment results for the individual database, in place of or in addition to the
alerts and results received for all databases on the database server or managed instance.
Once the checkbox is selected, you can then configure the relevant settings for this database.
Advanced data security settings for your database server or managed instance can also be reached from the ADS
database pane. Click Settings in the main ADS pane, and then click View Advanced Data Security server
settings.

Next steps
Learn more about data discovery & classification
Learn more about vulnerability assessment
Learn more about Advanced Threat Protection
Learn more about Azure security center
Azure SQL Database and SQL Data Warehouse data
discovery & classification
9/16/2019 • 6 minutes to read • Edit Online

Data discovery & classification provides advanced capabilities built into Azure SQL Database for discovering,
classifying, labeling & protecting the sensitive data in your databases.
Discovering and classifying your most sensitive data (business, financial, healthcare, personally identifiable data
(PII), and so on.) can play a pivotal role in your organizational information protection stature. It can serve as
infrastructure for:
Helping meet data privacy standards and regulatory compliance requirements.
Various security scenarios, such as monitoring (auditing) and alerting on anomalous access to sensitive data.
Controlling access to and hardening the security of databases containing highly sensitive data.
Data discovery & classification is part of the Advanced Data Security (ADS ) offering, which is a unified package
for advanced SQL security capabilities. data discovery & classification can be accessed and managed via the
central SQL ADS portal.

NOTE
This document relates to Azure SQL Database and Azure SQL Data Warehouse. For simplicity, SQL Database is used when
referring to both SQL Database and SQL Data Warehouse. For SQL Server (on premises), see SQL Data Discovery and
Classification.

What is data discovery & classification


Data discovery & classification introduces a set of advanced services and new SQL capabilities, forming a new
SQL Information Protection paradigm aimed at protecting the data, not just the database:
Discovery & recommendations
The classification engine scans your database and identifies columns containing potentially sensitive data. It
then provides you an easy way to review and apply the appropriate classification recommendations via the
Azure portal.
Labeling
Sensitivity classification labels can be persistently tagged on columns using new classification metadata
attributes introduced into the SQL Engine. This metadata can then be utilized for advanced sensitivity-
based auditing and protection scenarios.
Query result set sensitivity
The sensitivity of query result set is calculated in real time for auditing purposes.
Visibility
The database classification state can be viewed in a detailed dashboard in the portal. Additionally, you can
download a report (in Excel format) to be used for compliance & auditing purposes, as well as other needs.

Discover, classify & label sensitive columns


The following section describes the steps for discovering, classifying, and labeling columns containing sensitive
data in your database, as well as viewing the current classification state of your database and exporting reports.
The classification includes two metadata attributes:
Labels – The main classification attributes, used to define the sensitivity level of the data stored in the column.
Information Types – Provide additional granularity into the type of data stored in the column.

Define and customize your classification taxonomy


SQL data discovery & classification comes with a built-in set of sensitivity labels and a built-in set of information
types and discovery logic. You now have the ability to customize this taxonomy and define a set and ranking of
classification constructs specifically for your environment.
Definition and customization of your classification taxonomy is done in one central place for your entire Azure
tenant. That location is in Azure Security Center, as part of your Security Policy. Only someone with administrative
rights on the Tenant root management group can perform this task.
As part of the Information Protection policy management, you can define custom labels, rank them, and associate
them with a selected set of information types. You can also add your own custom information types and configure
them with string patterns, which are added to the discovery logic for identifying this type of data in your
databases. Learn more about customizing and managing your policy in the Information Protection policy how -to
guide.
Once the tenant-wide policy has been defined, you can continue with the classification of individual databases
using your customized policy.

Classify your SQL Database


1. Go to the Azure portal.
2. Navigate to Advanced Data Security under the Security heading in your Azure SQL Database pane. Click
to enable advanced data security, and then click on the Data discovery & classification card.

3. The Overview tab includes a summary of the current classification state of the database, including a
detailed list of all classified columns, which you can also filter to view only specific schema parts,
information types and labels. If you haven’t yet classified any columns, skip to step 5.
4. To download a report in Excel format, click on the Export option in the top menu of the window.

5. To begin classifying your data, click on the Classification tab at the top of the window.

6. The classification engine scans your database for columns containing potentially sensitive data and
provides a list of recommended column classifications. To view and apply classification
recommendations:
To view the list of recommended column classifications, click on the recommendations panel at the
bottom of the window:

Review the list of recommendations – to accept a recommendation for a specific column, check the
checkbox in the left column of the relevant row. You can also mark all recommendations as accepted
by checking the checkbox in the recommendations table header.
To apply the selected recommendations, click on the blue Accept selected recommendations
button.

7. You can also manually classify columns as an alternative, or in addition, to the recommendation-based
classification:
Click on Add classification in the top menu of the window.

In the context window that opens, select the schema > table > column that you want to classify, and
the information type and sensitivity label. Then click on the blue Add classification button at the
bottom of the context window.
8. To complete your classification and persistently label (tag) the database columns with the new classification
metadata, click on Save in the top menu of the window.

Auditing access to sensitive data


An important aspect of the information protection paradigm is the ability to monitor access to sensitive data.
Azure SQL Database Auditing has been enhanced to include a new field in the audit log called
data_sensitivity_information, which logs the sensitivity classifications (labels) of the actual data that was returned
by the query.

Manage data classification using T-SQL


You can use T-SQL to add/remove column classifications, as well as retrieve all classifications for the entire
database.

NOTE
When using T-SQL to manage labels, there is no validation that labels added to a column exist in the organizational
information protection policy (the set of labels that appear in the portal recommendations). It is therefore up to you to
validate this.

Add/update the classification of one or more columns: ADD SENSITIVITY CLASSIFICATION


Remove the classification from one or more columns: DROP SENSITIVITY CLASSIFICATION
View all classifications on the database: sys.sensitivity_classifications
Manage classifications using Rest APIs
You can also use REST APIs to programmatically manage classifications. The published REST APIs support the
following operations:
Create Or Update - Creates or updates the sensitivity label of a given column
Delete - Deletes the sensitivity label of a given column
Disable Recommendation - Disables sensitivity recommendations on a given column
Enable Recommendation - Enables sensitivity recommendations on a given column (recommendations are
enabled by default on all columns)
Get - Gets the sensitivity label of a given column
List Current By Database - Gets the current sensitivity labels of a given database
List Recommended By Database - Gets the recommended sensitivity labels of a given database

Manage data discovery and classification using Azure PowerShell


You can use PowerShell to get all the recommended columns in an Azure SQL database and a managed instance.
PowerShell Cmdlets for Azure SQL database
Get-AzSqlDatabaseSensitivityClassification
Set-AzSqlDatabaseSensitivityClassification
Remove-AzSqlDatabaseSensitivityClassification
Get-AzSqlDatabaseSensitivityRecommendation
PowerShell Cmdlets for managed instance
Get-AzSqlInstanceDatabaseSensitivityClassification
Set-AzSqlInstanceDatabaseSensitivityClassification
Remove-AzSqlInstanceDatabaseSensitivityClassification
Get-AzSqlInstanceDatabaseSensitivityRecommendation

Permissions
The following built-in roles can read the data classification of an Azure SQL database: Owner , Reader ,
Contributor , SQL Security Manager and User Access Administrator .

The following built-in roles can modify the data classification of an Azure SQL database: Owner , Contributor ,
SQL Security Manager .

Learn more about RBAC for Azure resources

Next steps
Learn more about advanced data security.
Consider configuring Azure SQL Database Auditing for monitoring and auditing access to your classified
sensitive data.
SQL Vulnerability Assessment service helps you
identify database vulnerabilities
7/26/2019 • 5 minutes to read • Edit Online

SQL Vulnerability Assessment is an easy to configure service that can discover, track, and help you remediate
potential database vulnerabilities. Use it to proactively improve your database security.
Vulnerability Assessment is part of the advanced data security (ADS ) offering, which is a unified package for
advanced SQL security capabilities. Vulnerability Assessment can be accessed and managed via the central SQL
ADS portal.

NOTE
Vulnerability Assessment is supported for Azure SQL Database, Azure SQL Managed Instance and Azure SQL Data
Warehouse. For simplicity, SQL Database is used in this article when referring to any of these managed database services.

The Vulnerability Assessment service


SQL Vulnerability Assessment (VA) is a service that provides visibility into your security state, and includes
actionable steps to resolve security issues, and enhance your database security. It can help you:
Meet compliance requirements that require database scan reports.
Meet data privacy standards.
Monitor a dynamic database environment where changes are difficult to track.
Vulnerability Assessment is a scanning service built into the Azure SQL Database service. The service employs a
knowledge base of rules that flag security vulnerabilities and highlight deviations from best practices, such as
misconfigurations, excessive permissions, and unprotected sensitive data. The rules are based on Microsoft’s best
practices and focus on the security issues that present the biggest risks to your database and its valuable data.
They cover both database-level issues as well as server-level security issues, like server firewall settings and
server-level permissions. These rules also represent many of the requirements from various regulatory bodies to
meet their compliance standards.
Results of the scan include actionable steps to resolve each issue and provide customized remediation scripts
where applicable. An assessment report can be customized for your environment by setting an acceptable baseline
for permission configurations, feature configurations, and database settings.

Implementing Vulnerability Assessment


The following steps implement VA on SQL Database.
1. Run a scan
Get started with VA by navigating to Advanced Data Security under the Security heading in your Azure SQL
Database pane. Click to enable advanced data security, and then click on Select Storage or on the Vulnerability
Assessment card, which automatically opens the Vulnerability Assessment settings card for the entire SQL server.
Start by configuring a storage account where your scan results for all databases on the server will be stored. For
information about storage accounts, see About Azure storage accounts. Once storage is configured, click Scan to
scan your database for vulnerabilities.
NOTE
The scan is lightweight and safe. It takes a few seconds to run, and is entirely read-only. It does not make any changes to
your database.

2. View the report


When your scan is complete, your scan report is automatically displayed in the Azure portal. The report presents
an overview of your security state: how many issues were found and their respective severities. Results include
warnings on deviations from best practices and a snapshot of your security-related settings, such as database
principals and roles and their associated permissions.The scan report also provides a map of sensitive data
discovered in your database, and includes recommendations to classify that data using data discovery &
classification.
3. Analyze the results and resolve issues
Review your results and determine the findings in the report that are true security issues in your environment.
Drill down to each failed result to understand the impact of the finding and why each security check failed. Use the
actionable remediation information provided by the report to resolve the issue.

4. Set your baseline


As you review your assessment results, you can mark specific results as being an acceptable Baseline in your
environment. The baseline is essentially a customization of how the results are reported. Results that match the
baseline are considered as passing in subsequent scans. Once you have established your baseline security state,
VA only reports on deviations from the baseline and you can focus your attention on the relevant issues.
5. Run a new scan to see your customized tracking report
After you complete setting up your Rule Baselines, run a new scan to view the customized report. VA now
reports only the security issues that deviate from your approved baseline state.

Vulnerability Assessment can now be used to monitor that your database maintains a high level of security at all
times, and that your organizational policies are met. If compliance reports are required, VA reports can be helpful
to facilitate the compliance process.
6. Set up periodic recurring scans
Navigate to the Vulnerability Assessment settings to turn on Periodic recurring scans. This configures
Vulnerability Assessment to automatically run a scan on your database once per week. A scan result summary will
be sent to the email address(es) you provide.
7. Export an assessment report
Click Export Scan Results to create a downloadable Excel report of your scan result. This report contains a
summary tab that displays a summary of the assessment, including all failed checks. It also includes a Results tab
containing the full set of results from the scan, including all checks that were run and the result details for each.
8. View scan history
Click Scan History in the VA pane to view a history of all scans previously run on this database. Select a
particular scan in the list to view the detailed results of that scan.
Vulnerability Assessment can now be used to monitor that your database maintains a high level of security at all
times, and that your organizational policies are met. If compliance reports are required, VA reports can be helpful
to facilitate the compliance process.

Manage Vulnerability Assessments using Azure PowerShell


NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

IMPORTANT
The PowerShell Azure Resource Manager module is still supported by Azure SQL Database, but all future development is for
the Az.Sql module. For these cmdlets, see AzureRM.Sql. The arguments for the commands in the Az module and in the
AzureRm modules are substantially identical.

You can use Azure PowerShell cmdlets to programmatically manage your vulnerability assessments. The
supported cmdlets are:
Update-AzSqlDatabaseVulnerabilityAssessmentSetting
Updates the vulnerability assessment settings of a database
Get-AzSqlDatabaseVulnerabilityAssessmentSetting
Returns the vulnerability assessment settings of a database
Clear-AzSqlDatabaseVulnerabilityAssessmentSetting
Clears the vulnerability assessment settings of a database
Set-AzSqlDatabaseVulnerabilityAssessmentRuleBaseline
Sets the vulnerability assessment rule baseline.
Get-AzSqlDatabaseVulnerabilityAssessmentRuleBaseline
Gets the vulnerability assessment rule baseline for a given rule.
Clear-AzSqlDatabaseVulnerabilityAssessmentRuleBaseline
Clears the vulnerability assessment rule baseline. First set the baseline before using this cmdlet to clear it.
Start-AzSqlDatabaseVulnerabilityAssessmentScan
Triggers the start of a vulnerability assessment scan
Get-AzSqlDatabaseVulnerabilityAssessmentScanRecord
Gets all vulnerability assessment scan record(s) associated with a given database.
Convert-AzSqlDatabaseVulnerabilityAssessmentScan
Converts vulnerability assessment scan results to an Excel file
For a script example, see Azure SQL Vulnerability Assessment PowerShell support.

Next steps
Learn more about advanced data security
Learn more about data discovery & classification
Advanced Threat Protection for Azure SQL Database
7/26/2019 • 4 minutes to read • Edit Online

Advanced Threat Protection for Azure SQL Database and SQL Data Warehouse detects anomalous activities
indicating unusual and potentially harmful attempts to access or exploit databases.
Advanced Threat Protection is part of the Advanced data security (ADS ) offering, which is a unified package for
advanced SQL security capabilities. Advanced Threat Protection can be accessed and managed via the central SQL
ADS portal.

NOTE
This topic applies to Azure SQL server, and to both SQL Database and SQL Data Warehouse databases that are created on
the Azure SQL server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.

What is Advanced Threat Protection


Advanced Threat Protection provides a new layer of security, which enables customers to detect and respond to
potential threats as they occur by providing security alerts on anomalous activities. Users receive an alert upon
suspicious database activities, potential vulnerabilities, and SQL injection attacks, as well as anomalous database
access and queries patterns. Advanced Threat Protection integrates alerts with Azure Security Center, which
include details of suspicious activity and recommend action on how to investigate and mitigate the threat.
Advanced Threat Protection makes it simple to address potential threats to the database without the need to be a
security expert or manage advanced security monitoring systems.
For a full investigation experience, it is recommended to enable SQL Database Auditing, which writes database
events to an audit log in your Azure storage account.

Advanced Threat Protection alerts


Advanced Threat Protection for Azure SQL Database detects anomalous activities indicating unusual and
potentially harmful attempts to access or exploit databases and it can trigger the following alerts:
Vulnerability to SQL injection: This alert is triggered when an application generates a faulty SQL
statement in the database. This alert may indicate a possible vulnerability to SQL injection attacks. There are
two possible reasons for the generation of a faulty statement:
A defect in application code that constructs the faulty SQL statement
Application code or stored procedures don't sanitize user input when constructing the faulty SQL
statement, which may be exploited for SQL Injection
Potential SQL injection: This alert is triggered when an active exploit happens against an identified
application vulnerability to SQL injection. This means the attacker is trying to inject malicious SQL
statements using the vulnerable application code or stored procedures.
Access from unusual location: This alert is triggered when there is a change in the access pattern to SQL
server, where someone has logged on to the SQL server from an unusual geographical location. In some
cases, the alert detects a legitimate action (a new application or developer maintenance). In other cases, the
alert detects a malicious action (former employee, external attacker).
Access from unusual Azure data center: This alert is triggered when there is a change in the access
pattern to SQL server, where someone has logged on to the SQL server from an unusual Azure data center
that was seen on this server during the recent period. In some cases, the alert detects a legitimate action
(your new application in Azure, Power BI, Azure SQL Query Editor). In other cases, the alert detects a
malicious action from an Azure resource/service (former employee, external attacker).
Access from unfamiliar principal: This alert is triggered when there is a change in the access pattern to
SQL server, where someone has logged on to the SQL server using an unusual principal (SQL user). In
some cases, the alert detects a legitimate action (new application, developer maintenance). In other cases,
the alert detects a malicious action (former employee, external attacker).
Access from a potentially harmful application: This alert is triggered when a potentially harmful
application is used to access the database. In some cases, the alert detects penetration testing in action. In
other cases, the alert detects an attack using common attack tools.
Brute force SQL credentials: This alert is triggered when there is an abnormal high number of failed
logins with different credentials. In some cases, the alert detects penetration testing in action. In other cases,
the alert detects brute force attack.

Explore anomalous database activities upon detection of a suspicious


event
You receive an email notification upon detection of anomalous database activities. The email provides information
on the suspicious security event including the nature of the anomalous activities, database name, server name,
application name, and the event time. In addition, the email provides information on possible causes and
recommended actions to investigate and mitigate the potential threat to the database.
1. Click the View recent SQL alerts link in the email to launch the Azure portal and show the Azure Security
Center alerts page, which provides an overview of active threats detected on the SQL database.

2. Click a specific alert to get additional details and actions for investigating this threat and remediating future
threats.
For example, SQL injection is one of the most common Web application security issues on the Internet that
is used to attack data-driven applications. Attackers take advantage of application vulnerabilities to inject
malicious SQL statements into application entry fields, breaching or modifying data in the database. For
SQL Injection alerts, the alert’s details include the vulnerable SQL statement that was exploited.

Explore Advanced Threat Protection alerts for your database in the


Azure portal
Advanced Threat Protection integrates its alerts with Azure security center. Live SQL Advanced Threat Protection
tiles within the database and SQL ADS blades in the Azure portal track the status of active threats.
Click Advanced Threat Protection alert to launch the Azure Security Center alerts page and get an overview of
active SQL threats detected on the database or data warehouse.
Next steps
Learn more about Advanced Threat Protection in single and pooled databases.
Learn more about Advanced Threat Protection in managed instance.
Learn more about Advanced data security.
Learn more about Azure SQL Database auditing
Learn more about Azure security center
For more information on pricing, see the SQL Database pricing page
Get started with SQL database auditing
10/28/2019 • 12 minutes to read • Edit Online

Auditing for Azure SQL Database and SQL Data Warehouse tracks database events and writes them to an audit
log in your Azure storage account, Log Analytics workspace or Event Hubs. Auditing also:
Helps you maintain regulatory compliance, understand database activity, and gain insight into
discrepancies and anomalies that could indicate business concerns or suspected security violations.
Enables and facilitates adherence to compliance standards, although it doesn't guarantee compliance. For
more information about Azure programs that support standards compliance, see the Azure Trust Center
where you can find the most current list of SQL Database compliance certifications.

NOTE
This topic applies to Azure SQL server, and to both SQL Database and SQL Data Warehouse databases that are created on
the Azure SQL server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.

NOTE
This article was recently updated to use the term Azure Monitor logs instead of Log Analytics. Log data is still stored in a
Log Analytics workspace and is still collected and analyzed by the same Log Analytics service. We are updating the
terminology to better reflect the role of logs in Azure Monitor. See Azure Monitor terminology changes for details.

Azure SQL database auditing overview


You can use SQL database auditing to:
Retain an audit trail of selected events. You can define categories of database actions to be audited.
Report on database activity. You can use pre-configured reports and a dashboard to get started quickly with
activity and event reporting.
Analyze reports. You can find suspicious events, unusual activity, and trends.

IMPORTANT
Audit logs are written to Append Blobs in Azure Blob storage on your Azure subscription.
All storage kinds (v1, v2, blob) are supported.
All storage replication configurations are supported.
Premium storage is currently not supported.
Storage in VNet is currently not supported.
Storage behind a Firewall is currently not supported

Define server-level vs. database-level auditing policy


An auditing policy can be defined for a specific database or as a default server policy:
A server policy applies to all existing and newly created databases on the server.
If server blob auditing is enabled, it always applies to the database. The database will be audited,
regardless of the database auditing settings.
Enabling blob auditing on the database or data warehouse, in addition to enabling it on the server, does not
override or change any of the settings of the server blob auditing. Both audits will exist side by side. In
other words, the database is audited twice in parallel; once by the server policy and once by the database
policy.

NOTE
You should avoid enabling both server blob auditing and database blob auditing together, unless:
You want to use a different storage account or retention period for a specific database.
You want to audit event types or categories for a specific database that differ from the rest of the databases on
the server. For example, you might have table inserts that need to be audited only for a specific database.
Otherwise, we recommended that you enable only server-level blob auditing and leave the database-level auditing
disabled for all databases.

Set up auditing for your database


The following section describes the configuration of auditing using the Azure portal.
1. Go to the Azure portal.
2. Navigate to Auditing under the Security heading in your SQL database/server pane.

3. If you prefer to set up a server auditing policy, you can select the View server settings link on the
database auditing page. You can then view or modify the server auditing settings. Server auditing policies
apply to all existing and newly created databases on this server.
4. If you prefer to enable auditing on the database level, switch Auditing to ON.
If server auditing is enabled, the database-configured audit will exist side-by-side with the server audit.

5. New - You now have multiple options for configuring where audit logs will be written. You can write logs
to an Azure storage account, to a Log Analytics workspace for consumption by Azure Monitor logs, or to
event hub for consumption using event hub. You can configure any combination of these options, and audit
logs will be written to each.

WARNING
Enabling auditing to Log Analytics will incur cost based on ingestion rates. Please be aware of the associated cost
with using this option, or consider storing the audit logs in an Azure storage account.
6. To configure writing audit logs to a storage account, select Storage and open Storage details. Select the
Azure storage account where logs will be saved, and then select the retention period. The old logs will be
deleted. Then click OK.

IMPORTANT
The default value for retention period is 0 (unlimited retention). You can change this value by moving the
Retention (Days) slider in Storage settings when configuring the storage account for auditing.
If you change retention period from 0 (unlimited retention) to any other value, please note that retention will
only apply to logs written after retention value was changed (logs written during the period when retention was
set to unlimited are preserved, even after retention is enabled)
7. To configure writing audit logs to a Log Analytics workspace, select Log Analytics (Preview) and open
Log Analytics details. Select or create the Log Analytics workspace where logs will be written and then
click OK.

8. To configure writing audit logs to an event hub, select Event Hub (Preview) and open Event Hub
details. Select the event hub where logs will be written and then click OK. Be sure that the event hub is in
the same region as your database and server.
9. Click Save.
10. If you want to customize the audited events, you can do this via PowerShell cmdlets or the REST API.
11. After you've configured your auditing settings, you can turn on the new threat detection feature and
configure emails to receive security alerts. When you use threat detection, you receive proactive alerts on
anomalous database activities that can indicate potential security threats. For more information, see
Getting started with threat detection.

IMPORTANT
Enabling auditing on an paused Azure SQL Data Warehouse is not possible. To enable it, un-pause the Data Warehouse.

WARNING
Enabling auditing on a server that has an Azure SQL Data Warehouse on it will result in the Data Warehouse being
resumed and re-paused again which may incur in billing charges.

Analyze audit logs and reports


If you chose to write audit logs to Azure Monitor logs:
Use the Azure portal. Open the relevant database. At the top of the database's Auditing page, click View
audit logs.
Then, you have two ways to view the logs:
Clicking on Log Analytics at the top of the Audit records page will open the Logs view in Log Analytics
workspace, where you can customize the time range and the search query.

Clicking View dashboard at the top of the Audit records page will open a dashboard displaying audit
logs info, where you can drill down into Security Insights, Access to Sensitive Data and more. This
dashboard is designed to help you gain security insights for your data. You can also customize the time
range and search query.
Alternatively, you can also access the audit logs from Log Analytics blade. Open your Log Analytics
workspace and under General section, click Logs. You can start with a simple query, such as: search
"SQLSecurityAuditEvents" to view the audit logs. From here, you can also use Azure Monitor logs to run
advanced searches on your audit log data. Azure Monitor logs gives you real-time operational insights
using integrated search and custom dashboards to readily analyze millions of records across all your
workloads and servers. For additional useful information about Azure Monitor logs search language and
commands, see Azure Monitor logs search reference.
If you chose to write audit logs to Event Hub:
To consume audit logs data from Event Hub, you will need to set up a stream to consume events and write
them to a target. For more information, see Azure Event Hubs Documentation.
Audit logs in Event Hub are captured in the body of Apache Avro events and stored using JSON formatting
with UTF -8 encoding. To read the audit logs, you can use Avro Tools or similar tools that process this format.
If you chose to write audit logs to an Azure storage account, there are several methods you can use to view the
logs:
Audit logs are aggregated in the account you chose during setup. You can explore audit logs by using a tool
such as Azure Storage Explorer. In Azure storage, auditing logs are saved as a collection of blob files within
a container named sqldbauditlogs. For further details about the hierarchy of the storage folder, naming
conventions, and log format, see the SQL Database Audit Log Format.
Use the Azure portal. Open the relevant database. At the top of the database's Auditing page, click View
audit logs.
Audit records opens, from which you'll be able to view the logs.
You can view specific dates by clicking Filter at the top of the Audit records page.
You can switch between audit records that were created by the server audit policy and the database
audit policy by toggling Audit Source.
You can view only SQL injection related audit records by checking Show only audit records for
SQL injections checkbox.

Use the system function sys.fn_get_audit_file (T-SQL ) to return the audit log data in tabular format. For
more information on using this function, see sys.fn_get_audit_file.
Use Merge Audit Files in SQL Server Management Studio (starting with SSMS 17):
1. From the SSMS menu, select File > Open > Merge Audit Files.

2. The Add Audit Files dialog box opens. Select one of the Add options to choose whether to merge
audit files from a local disk or import them from Azure Storage. You are required to provide your
Azure Storage details and account key.
3. After all files to merge have been added, click OK to complete the merge operation.
4. The merged file opens in SSMS, where you can view and analyze it, as well as export it to an XEL or
CSV file, or to a table.
Use Power BI. You can view and analyze audit log data in Power BI. For more information and to access a
downloadable template, see Analyze audit log data in Power BI.
Download log files from your Azure Storage blob container via the portal or by using a tool such as Azure
Storage Explorer.
After you have downloaded a log file locally, double-click the file to open, view, and analyze the logs in
SSMS.
You can also download multiple files simultaneously via Azure Storage Explorer. To do so, right-click a
specific subfolder and select Save as to save in a local folder.
Additional methods:
After downloading several files or a subfolder that contains log files, you can merge them locally as
described in the SSMS Merge Audit Files instructions described previously.
View blob auditing logs programmatically:
Query Extended Events Files by using PowerShell.

Production practices
With geo-replicated databases, when you enable auditing on the primary database the secondary database will
Audi ti ng geo-repl i cated databases

have an identical auditing policy. It is also possible to set up auditing on the secondary database by enabling
auditing on the secondary server, independently from the primary database.
Server-level (recommended): Turn on auditing on both the primary server as well as the secondary server
- the primary and secondary databases will each be audited independently based on their respective server-
level policy.
Database-level: Database-level auditing for secondary databases can only be configured from Primary
database auditing settings.
Auditing must be enabled on the primary database itself, not the server.
After auditing is enabled on the primary database, it will also become enabled on the secondary
database.

IMPORTANT
With database-level auditing, the storage settings for the secondary database will be identical to those of
the primary database, causing cross-regional traffic. We recommend that you enable only server-level
auditing, and leave the database-level auditing disabled for all databases.

In production, you are likely to refresh your storage keys periodically. When writing audit logs to Azure storage,
Storage key regenerati on

you need to resave your auditing policy when refreshing your keys. The process is as follows:
1. Open Storage Details. In the Storage Access Key box, select Secondary, and click OK. Then click Save
at the top of the auditing configuration page.
2. Go to the storage configuration page and regenerate the primary access key.

3. Go back to the auditing configuration page, switch the storage access key from secondary to primary, and
then click OK. Then click Save at the top of the auditing configuration page.
4. Go back to the storage configuration page and regenerate the secondary access key (in preparation for the
next key's refresh cycle).

Additional Information
For details about the log format, hierarchy of the storage folder and naming conventions, see the Blob
Audit Log Format Reference.
IMPORTANT
Azure SQL Database Audit stores 4000 characters of data for character fields in an audit record. When the
statement or the data_sensitivity_information values returned from an auditable action contain more than 4000
characters, any data beyond the first 4000 characters will be truncated and not audited.

Audit logs are written to Append Blobs in an Azure Blob storage on your Azure subscription:
Premium Storage is currently not supported by Append Blobs.
Storage in VNet is currently not supported.
The default auditing policy includes all actions and the following set of action groups, which will audit all
the queries and stored procedures executed against the database, as well as successful and failed logins:
BATCH_COMPLETED_GROUP
SUCCESSFUL_DATABASE_AUTHENTICATION_GROUP
FAILED_DATABASE_AUTHENTICATION_GROUP
You can configure auditing for different types of actions and action groups using PowerShell, as described
in the Manage SQL database auditing using Azure PowerShell section.
When using AAD Authentication, failed logins records will not appear in the SQL audit log. To view failed
login audit records, you need to visit the Azure Active Directory portal, which logs details of these events.

Manage SQL database auditing using Azure PowerShell


PowerShell cmdlets (including WHERE clause support for additional filtering):
Create or Update Database Auditing Policy (Set-AzSqlDatabaseAudit)
Create or Update Server Auditing Policy (Set-AzSqlServerAudit)
Get Database Auditing Policy (Get-AzSqlDatabaseAudit)
Get Server Auditing Policy (Get-AzSqlServerAudit)
Remove Database Auditing Policy (Remove-AzSqlDatabaseAudit)
Remove Server Auditing Policy (Remove-AzSqlServerAudit)
For a script example, see Configure auditing and threat detection using PowerShell.

Manage SQL database auditing using REST API


REST API:
Create or Update Database Auditing Policy
Create or Update Server Auditing Policy
Get Database Auditing Policy
Get Server Auditing Policy
Extended policy with WHERE clause support for additional filtering:
Create or Update Database Extended Auditing Policy
Create or Update Server Extended Auditing Policy
Get Database Extended Auditing Policy
Get Server Extended Auditing Policy

Manage SQL database auditing using Azure Resource Manager


templates
You can manage Azure SQL database auditing using Azure Resource Manager templates, as shown in these
examples:
Deploy an Azure SQL Server with Auditing enabled to write audit logs to Azure Blob storage account
Deploy an Azure SQL Server with Auditing enabled to write audit logs to Log Analytics
Deploy an Azure SQL Server with Auditing enabled to write audit logs to Event Hubs

NOTE
The linked samples are on an external public repository and are provided 'as is', without warranty, and are not supported
under any Microsoft support program/service.
Azure SQL Database and Azure SQL Data
Warehouse IP firewall rules
9/30/2019 • 11 minutes to read • Edit Online

NOTE
This article applies to Azure SQL servers, and to both Azure SQL Database and Azure SQL Data Warehouse databases on
an Azure SQL server. For simplicity, SQL Database is used to refer to both SQL Database and SQL Data Warehouse.

IMPORTANT
This article does not apply to Azure SQL Database Managed Instance. For information about network configuration, see
Connect your application to Azure SQL Database Managed Instance.

When you create a new Azure SQL server named mysqlserver, for example, the SQL Database firewall blocks all
access to the public endpoint for the server (which is accessible at mysqlserver.database.windows.net).

IMPORTANT
SQL Data Warehouse only supports server-level IP firewall rules. It doesn't support database-level IP firewall rules.

How the firewall works


Connection attempts from the internet and Azure must pass through the firewall before they reach your SQL
server or SQL database, as the following diagram shows.
Server-level IP firewall rules
These rules enable clients to access your entire Azure SQL server, that is, all the databases within the same SQL
Database server. The rules are stored in the master database. You can have a maximum of 128 server-level IP
firewall rules for an Azure SQL Server.
You can configure server-level IP firewall rules by using the Azure portal, PowerShell, or Transact-SQL
statements.
To use the portal or PowerShell, you must be the subscription owner or a subscription contributor.
To use Transact-SQL, you must connect to the SQL Database instance as the server-level principal login or as
the Azure Active Directory administrator. (A server-level IP firewall rule must first be created by a user who
has Azure-level permissions.)
Database -level IP firewall rules
These rules enable clients to access certain (secure) databases within the same SQL Database server. You create
the rules for each database (including the master database), and they're stored in the individual database.
You can only create and manage database-level IP firewall rules for master and user databases by using Transact-
SQL statements and only after you configure the first server-level firewall.
If you specify an IP address range in the database-level IP firewall rule that's outside the range in the server-level
IP firewall rule, only those clients that have IP addresses in the database-level range can access the database.
You can have a maximum of 128 database-level IP firewall rules for a database. For more information about
configuring database-level IP firewall rules, see the example later in this article and see
sp_set_database_firewall_rule (Azure SQL Database).
Recommendations for how to set firewall rules
We recommend that you use database-level IP firewall rules whenever possible. This practice enhances security
and makes your database more portable. Use server-level IP firewall rules for administrators. Also use them
when you have many databases that have the same access requirements, and you don't want to configure each
database individually.

NOTE
For information about portable databases in the context of business continuity, see Authentication requirements for
disaster recovery.

Server-level versus database-level IP firewall rules


Should users of one database be fully isolated from another database?
If yes, use database-level IP firewall rules to grant access. This method avoids using server-level IP firewall rules,
which permit access through the firewall to all databases. That would reduce the depth of your defenses.
Do users at the IP addresses need access to all databases?
If yes, use server-level IP firewall rules to reduce the number of times that you have to configure IP firewall rules.
Does the person or team who configures the IP firewall rules only have access through the Azure portal,
PowerShell, or the REST API?
If so, you must use server-level IP firewall rules. Database-level IP firewall rules can only be configured through
Transact-SQL.
Is the person or team who configures the IP firewall rules prohibited from having high-level permission at the
database level?
If so, use server-level IP firewall rules. You need at least CONTROL DATABASE permission at the database level
to configure database-level IP firewall rules through Transact-SQL.
Does the person or team who configures or audits the IP firewall rules centrally manage IP firewall rules for
many (perhaps hundreds) of databases?
In this scenario, best practices are determined by your needs and environment. Server-level IP firewall rules
might be easier to configure, but scripting can configure rules at the database-level. And even if you use server-
level IP firewall rules, you might need to audit database-level IP firewall rules to see if users with CONTROL
permission on the database create database-level IP firewall rules.
Can I use a mix of server-level and database-level IP firewall rules?
Yes. Some users, such as administrators, might need server-level IP firewall rules. Other users, such as users of a
database application, might need database-level IP firewall rules.
Connections from the internet
When a computer tries to connect to your database server from the internet, the firewall first checks the
originating IP address of the request against the database-level IP firewall rules for the database that the
connection requests.
If the address is within a range that's specified in the database-level IP firewall rules, the connection is granted
to the SQL database that contains the rule.
If the address isn't within a range in the database-level IP firewall rules, the firewall checks the server-level IP
firewall rules. If the address is within a range that's in the server-level IP firewall rules, the connection is
granted. Server-level IP firewall rules apply to all SQL databases on the Azure SQL server.
If the address isn't within a range that's in any of the database-level or server-level IP firewall rules, the
connection request fails.

NOTE
To access SQL Database from your local computer, ensure that the firewall on your network and local computer allow
outgoing communication on TCP port 1433.

Connections from inside Azure


To allow applications hosted inside Azure to connect to your SQL server, Azure connections must be enabled.
When an application from Azure tries to connect to your database server, the firewall verifies that Azure
connections are allowed. A firewall setting that has starting and ending IP addresses equal to 0.0.0.0 indicates
that Azure connections are allowed. If the connection isn't allowed, the request doesn't reach the SQL Database
server.

IMPORTANT
This option configures the firewall to allow all connections from Azure, including connections from the subscriptions of
other customers. If you select this option, make sure that your login and user permissions limit access to authorized users
only.

Create and manage IP firewall rules


You create the first server-level firewall setting by using the Azure portal or programmatically by using Azure
PowerShell, Azure CLI, or an Azure REST API. You create and manage additional server-level IP firewall rules by
using these methods or Transact-SQL.

IMPORTANT
Database-level IP firewall rules can only be created and managed by using Transact-SQL.

To improve performance, server-level IP firewall rules are temporarily cached at the database level. To refresh the
cache, see DBCC FLUSHAUTHCACHE.
TIP
You can use SQL Database Auditing to audit server-level and database-level firewall changes.

Use the Azure portal to manage server-level IP firewall rules


To set a server-level IP firewall rule in the Azure portal, go to the overview page for your Azure SQL database or
your SQL Database server.

TIP
For a tutorial, see Create a DB using the Azure portal.

From the database overview page


1. To set a server-level IP firewall rule from the database overview page, select Set server firewall on the
toolbar, as the following image shows. The Firewall settings page for the SQL Database server opens.

2. Select Add client IP on the toolbar to add the IP address of the computer that you're using, and then
select Save. A server-level IP firewall rule is created for your current IP address.

From the server overview page


The overview page for your server opens. It shows the fully qualified server name (such as
mynewserver20170403.database.windows.net) and provides options for further configuration.
1. To set a server-level rule from this page, select Firewall from the Settings menu on the left side.
2. Select Add client IP on the toolbar to add the IP address of the computer that you're using, and then
select Save. A server-level IP firewall rule is created for your current IP address.
Use Transact-SQL to manage IP firewall rules
CATALOG VIEW OR STORED PROCEDURE LEVEL DESCRIPTION

sys.firewall_rules Server Displays the current server-level IP


firewall rules

sp_set_firewall_rule Server Creates or updates server-level IP


firewall rules

sp_delete_firewall_rule Server Removes server-level IP firewall rules

sys.database_firewall_rules Database Displays the current database-level IP


firewall rules

sp_set_database_firewall_rule Database Creates or updates the database-level


IP firewall rules

sp_delete_database_firewall_rule Databases Removes database-level IP firewall rules

The following example reviews the existing rules, enables a range of IP addresses on the server Contoso, and
deletes an IP firewall rule:

SELECT * FROM sys.firewall_rules ORDER BY name;

Next, add a server-level IP firewall rule.

EXECUTE sp_set_firewall_rule @name = N'ContosoFirewallRule',


@start_ip_address = '192.168.1.1', @end_ip_address = '192.168.1.10'

To delete a server-level IP firewall rule, execute the sp_delete_firewall_rule stored procedure. The following
example deletes the rule ContosoFirewallRule:

EXECUTE sp_delete_firewall_rule @name = N'ContosoFirewallRule'

Use PowerShell to manage server-level IP firewall rules

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
IMPORTANT
The PowerShell Azure Resource Manager module is still supported by Azure SQL Database, but all development is now for
the Az.Sql module. For these cmdlets, see AzureRM.Sql. The arguments for the commands in the Az and AzureRm modules
are substantially identical.

CMDLET LEVEL DESCRIPTION

Get-AzSqlServerFirewallRule Server Returns the current server-level firewall


rules

New-AzSqlServerFirewallRule Server Creates a new server-level firewall rule

Set-AzSqlServerFirewallRule Server Updates the properties of an existing


server-level firewall rule

Remove-AzSqlServerFirewallRule Server Removes server-level firewall rules

The following example uses PowerShell to set a server-level IP firewall rule:

New-AzSqlServerFirewallRule -ResourceGroupName "myResourceGroup" `


-ServerName $servername `
-FirewallRuleName "AllowSome" -StartIpAddress "0.0.0.0" -EndIpAddress "0.0.0.0"

TIP
For PowerShell examples in the context of a quickstart, see Create DB - PowerShell and Create a single database and
configure a SQL Database server-level IP firewall rule using PowerShell.

Use CLI to manage server-level IP firewall rules


CMDLET LEVEL DESCRIPTION

az sql server firewall-rule create Server Creates a server IP firewall rule

az sql server firewall-rule list Server Lists the IP firewall rules on a server

az sql server firewall-rule show Server Shows the detail of an IP firewall rule

az sql server firewall-rule update Server Updates an IP firewall rule

az sql server firewall-rule delete Server Deletes an IP firewall rule

The following example uses CLI to set a server-level IP firewall rule:

az sql server firewall-rule create --resource-group myResourceGroup --server $servername \


-n AllowYourIp --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0
TIP
For a CLI example in the context of a quickstart, see Create DB - Azure CLI and Create a single database and configure a
SQL Database IP firewall rule using the Azure CLI.

Use a REST API to manage server-level IP firewall rules


API LEVEL DESCRIPTION

List firewall rules Server Displays the current server-level IP


firewall rules

Create or update firewall rules Server Creates or updates server-level IP


firewall rules

Delete firewall rules Server Removes server-level IP firewall rules

Get firewall rules Server Gets server-level IP firewall rules

Troubleshoot the database firewall


Consider the following points when access to the SQL Database service doesn't behave as you expect.
Local firewall configuration:
Before your computer can access SQL Database, you may need to create a firewall exception on your
computer for TCP port 1433. To make connections inside the Azure cloud boundary, you may have to
open additional ports. For more information, see the "SQL Database: Outside vs inside" section of Ports
beyond 1433 for ADO.NET 4.5 and SQL Database.
Network address translation:
Because of network address translation (NAT), the IP address that's used by your computer to connect to
SQL Database may be different than the IP address in your computer's IP configuration settings. To view
the IP address that your computer is using to connect to Azure:
1. Sign in to the portal.
2. Go to the Configure tab on the server that hosts your database.
3. The Current Client IP Address is displayed in the Allowed IP Addresses section. Select Add for
Allowed IP Addresses to allow this computer to access the server.
Changes to the allow list haven't taken effect yet:
There may be up to a five-minute delay for changes to the SQL Database firewall configuration to take
effect.
The login isn't authorized, or an incorrect password was used:
If a login doesn't have permissions on the SQL Database server or the password is incorrect, the
connection to the server is denied. Creating a firewall setting only gives clients an opportunity to try to
connect to your server. The client must still provide the necessary security credentials. For more
information about preparing logins, see Controlling and granting database access to SQL Database and
SQL Data Warehouse.
Dynamic IP address:
If you have an internet connection that uses dynamic IP addressing and you have trouble getting through
the firewall, try one of the following solutions:
Ask your internet service provider for the IP address range that's assigned to your client computers
that access the SQL Database server. Add that IP address range as an IP firewall rule.
Get static IP addressing instead for your client computers. Add the IP addresses as IP firewall rules.

Next steps
Confirm that your corporate network environment allows inbound communication from the compute IP
address ranges (including SQL ranges) that are used by the Azure datacenters. You might have to add those
IP addresses to the allow list. See Microsoft Azure datacenter IP ranges.
For a quickstart about creating a server-level IP firewall rule, see Create an Azure SQL database.
For help with connecting to an Azure SQL database from open-source or third-party applications, see Client
quickstart code samples to SQL Database.
For information about additional ports that you may need to open, see the "SQL Database: Outside vs inside"
section of Ports beyond 1433 for ADO.NET 4.5 and SQL Database
For an overview of Azure SQL Database security, see Securing your database.
Private Link for Azure SQL Database and Data
Warehouse (Preview)
9/17/2019 • 7 minutes to read • Edit Online

Private Link allows you to connect to various PaaS services in Azure via a private endpoint. For a list to PaaS
services that support Private Link functionality, go to the Private Link Documentation page. A private endpoint is a
private IP address within a specific VNet and Subnet.

IMPORTANT
This article applies to Azure SQL server, and to both SQL Database and SQL Data Warehouse databases that are created on
the Azure SQL server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.
This article does not apply to a managed instance deployment in Azure SQL Database.

Data exfiltration prevention


Data exfiltration in Azure SQL Database is when an authorized user, such as a database admin is able extract data
from one system and move it another location or system outside the organization. For example, the user moves the
data to a storage account owned by a third party.
Consider a scenario with a user running SQL Server Management Studio (SSMS ) inside an Azure VM connecting
to a SQL Database. This SQL Database is in the West US data center. The example below shows how to limit
access with public endpoints on SQL Database using network access controls.
1. Disable all Azure service traffic to SQL Database via the public endpoint by setting Allow Azure Services to
OFF. Ensure no IP addresses are allowed in the server and database level firewall rules. For more information,
see Azure SQL Database and Data Warehouse network access controls.
2. Only allow traffic to the SQL Database using the Private IP address of the VM. For more information, see the
articles on Service Endpoint and VNet firewall rules.
3. On the Azure VM, narrow down the scope of outgoing connection by using Network Security Groups (NSGs)
and Service Tags as follows
Specify an NSG rule to allow traffic for Service Tag = SQL.WestUs - only allowing connection to SQL
Database in West US
Specify an NSG rule (with a higher priority) to deny traffic for Service Tag = SQL - denying connections
to SQL Database in all regions
At the end of this setup, the Azure VM can connect only to SQL Databases in the West US region. However, the
connectivity isn't restricted to a single SQL Database. The VM can still connect to any SQL Databases in the West
US region, including the databases that aren't part of the subscription. While we've reduced the scope of data
exfiltration in the above scenario to a specific region, we haven't eliminated it altogether.
With Private Link, customers can now set up network access controls like NSGs to restrict access to the private
endpoint. Individual Azure PaaS resources are then mapped to specific private endpoints. A malicious insider can
only access the mapped PaaS resource (for example a SQL Database) and no other resource.

On-premises connectivity over private peering


When customers connect to the public endpoint from on-premises machines, their IP address needs to be added to
the IP -based firewall using a Server-level firewall rule. While this model works well for allowing access to
individual machines for dev or test workloads, it's difficult to manage in a production environment.
With Private Link, customers can enable cross-premises access to the private endpoint using ExpressRoute, private
peering, or VPN tunneling. Customers can then disable all access via the public endpoint and not use the IP -based
firewall to allow any IP addresses.
With Private Link, customers can enable cross-premises access to the private endpoint using Express Route (ER )
private peering or VPN tunnel.They can subsequently disable all access via public endpoint and not use the IP -
based firewall.

How to set up Private Link for Azure SQL Database


Creation Process
Private Endpoints can be created using the portal, PowerShell, or Azure CLI:
Portal
PowerShell
CLI
Approval Process
Once the network admin creates the Private Endpoint (PE ), the SQL admin can manage the Private Endpoint
Connection (PEC ) to SQL Database.
1. Navigate to the SQL server resource in the Azure portal.
(1) Select the Private endpoint connections in the left pane
(2) Shows a list of all Private Endpoint Connections (PECs)
(3) Corresponding Private Endpoint (PE ) created

2. Select an individual PEC from the list by selecting it.


3. The SQL admin can choose to approve or reject a PEC and optionally add a short text response.

4. After approval or rejection, the list will reflect the appropriate state along with the response text.

Use cases of Private Link for Azure SQL Database


Clients can connect to the Private endpoint from the same VNet, peered VNet in same region, or via VNet-to-VNet
connection across regions. Additionally, clients can connect from on-premises using ExpressRoute, private peering,
or VPN tunneling. Below is a simplified diagram showing the common use cases.
Test connectivity to SQL Database from an Azure VM in same Virtual
Network (VNet)
For this scenario, assume you've created an Azure Virtual Machine (VM ) running Windows Server 2016.
1. Start a Remote Desktop (RDP ) session and connect to the virtual machine.
2. You can then do some basic connectivity checks to ensure that the VM is connecting to SQL Database via the
private endpoint using the following tools:
a. Telnet
b. Psping
c. Nmap
d. SQL Server Management Studio (SSMS )
Check Connectivity using Telnet
Telnet Client is a Windows feature that can be used to test connectivity. Depending on the version of the Windows
OS, you may need to enable this feature explicitly.
Open a Command Prompt window after you have installed Telnet. Run the Telnet command and specify the IP
address and private endpoint of the SQL Database.

>telnet 10.1.1.5 1433

When Telnet connects successfully, you'll see a blank screen at the command window like the below image:

Check Connectivity using Psping


Psping can be used as follows to check that the Private endpoint connection(PEC ) is listening for connections on
port 1433.
Run psping as follows by providing the FQDN for your SQL Database server and port 1433:
>psping.exe mysqldbsrvr.database.windows.net:1433

PsPing v2.10 - PsPing - ping, latency, bandwidth measurement utility


Copyright (C) 2012-2016 Mark Russinovich
Sysinternals - www.sysinternals.com

TCP connect to 10.6.1.4:1433:


5 iterations (warmup 1) ping test:
Connecting to 10.6.1.4:1433 (warmup): from 10.6.0.4:49953: 2.83ms
Connecting to 10.6.1.4:1433: from 10.6.0.4:49954: 1.26ms
Connecting to 10.6.1.4:1433: from 10.6.0.4:49955: 1.98ms
Connecting to 10.6.1.4:1433: from 10.6.0.4:49956: 1.43ms
Connecting to 10.6.1.4:1433: from 10.6.0.4:49958: 2.28ms

The output show that Psping could ping the private IP address associated with the PEC.
Check connectivity using Nmap
Nmap (Network Mapper) is a free and open-source tool used for network discovery and security auditing. For
more information and the download link, visit https://nmap.org. You can use this tool to ensure that the private
endpoint is listening for connections on port 1433.
Run Nmap as follows by providing the address range of the subnet that hosts the private endpoint.

>nmap -n -sP 10.1.1.0/24


...
...
Nmap scan report for 10.1.1.5
Host is up (0.00s latency).
Nmap done: 256 IP addresses (1 host up) scanned in 207.00 seconds

The result shows that one IP address is up; which corresponds to the IP address for the private endpoint.
Check Connectivity using SQL Server Management Studio (SSMS )
The last step is to use SSMS to connect to the SQL Database. After you connect to the SQL Database using SSMS,
verify that you're connecting from the private IP address of the Azure VM by running the following query:

select client_net_address from sys.dm_exec_connections


where session_id=@@SPID

NOTE
In preview, connections to private endpoint only support Proxy as the connection policy

Connecting from an Azure VM in Peered Virtual Network (VNet)


Configure VNet peering to establish connectivity to the SQL Database from an Azure VM in a peered VNet.

Connecting from an Azure VM in VNet-to-VNet environment


Configure VNet-to-VNet VPN gateway connection to establish connectivity to a SQL Database from an Azure VM
in a different region or subscription.

Connecting from an on-premises environment over VPN


To establish connectivity from an on-premises environment to the SQL Database, choose and implement one of
the options:
Point-to-Site connection
Site-to-Site VPN connection
ExpressRoute circuit

Connecting from an Azure SQL Data Warehouse to Azure Storage


using Polybase
PolyBase is commonly used to load data into Azure SQL Data Warehouse from Azure Storage accounts. If the
Azure Storage account that you are loading data from limits access only to a set of VNet-subnets via Private
Endpoints, Service Endpoints, or IP -based firewalls, the connectivity from PolyBase to the account will break. For
enabling both PolyBase import and export scenarios with Azure SQL Data Warehouse connecting to Azure
Storage that's secured to a VNet, follow the steps provided here.

Next steps
For an overview of Azure SQL Database security, see Securing your database
For an overview of Azure SQL Database connectivity, see Azure SQL Connectivity Architecture
Use virtual network service endpoints and rules for
database servers
10/2/2019 • 13 minutes to read • Edit Online

Virtual network rules are one firewall security feature that controls whether the database server for your single
databases and elastic pool in Azure SQL Database or for your databases in SQL Data Warehouse accepts
communications that are sent from particular subnets in virtual networks. This article explains why the virtual
network rule feature is sometimes your best option for securely allowing communication to your Azure SQL
Database and SQL Data Warehouse.

IMPORTANT
This article applies to Azure SQL server, and to both SQL Database and SQL Data Warehouse databases that are created on
the Azure SQL server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.
This article does not apply to a managed instance deployment in Azure SQL Database because it does not have a service
endpoint associated with it.

To create a virtual network rule, there must first be a virtual network service endpoint for the rule to reference.

How to create a virtual network rule


If you only create a virtual network rule, you can skip ahead to the steps and explanation later in this article.

Details about virtual network rules


This section describes several details about virtual network rules.
Only one geographic region
Each Virtual Network service endpoint applies to only one Azure region. The endpoint does not enable other
regions to accept communication from the subnet.
Any virtual network rule is limited to the region that its underlying endpoint applies to.
Server-level, not database -level
Each virtual network rule applies to your whole Azure SQL Database server, not just to one particular database on
the server. In other words, virtual network rule applies at the server-level, not at the database-level.
In contrast, IP rules can apply at either level.
Security administration roles
There is a separation of security roles in the administration of Virtual Network service endpoints. Action is
required from each of the following roles:
Network Admin: Turn on the endpoint.
Database Admin: Update the access control list (ACL ) to add the given subnet to the SQL Database server.
RBAC alternative:
The roles of Network Admin and Database Admin have more capabilities than are needed to manage virtual
network rules. Only a subset of their capabilities is needed.
You have the option of using role-based access control (RBAC ) in Azure to create a single custom role that has
only the necessary subset of capabilities. The custom role could be used instead of involving either the Network
Admin or the Database Admin. The surface area of your security exposure is lower if you add a user to a custom
role, versus adding the user to the other two major administrator roles.

NOTE
In some cases the Azure SQL Database and the VNet-subnet are in different subscriptions. In these cases you must ensure
the following configurations:
Both subscriptions must be in the same Azure Active Directory tenant.
The user has the required permissions to initiate operations, such as enabling service endpoints and adding a VNet-
subnet to the given Server.
Both subscriptions must have the Microsoft.Sql provider registered.

Limitations
For Azure SQL Database, the virtual network rules feature has the following limitations:
In the firewall for your SQL Database, each virtual network rule references a subnet. All these referenced
subnets must be hosted in the same geographic region that hosts the SQL Database.
Each Azure SQL Database server can have up to 128 ACL entries for any given virtual network.
Virtual network rules apply only to Azure Resource Manager virtual networks; and not to classic
deployment model networks.
Turning ON virtual network service endpoints to Azure SQL Database also enables the endpoints for the
MySQL and PostgreSQL Azure services. However, with endpoints ON, attempts to connect from the
endpoints to your MySQL or PostgreSQL instances may fail.
The underlying reason is that MySQL and PostgreSQL likely do not have a virtual network rule
configured. You must configure a virtual network rule for Azure Database for MySQL and PostgreSQL
and the connection will succeed.
On the firewall, IP address ranges do apply to the following networking items, but virtual network rules do
not:
Site-to-Site (S2S ) virtual private network (VPN )
On-premises via ExpressRoute
Considerations when using Service Endpoints
When using service endpoints for Azure SQL Database, review the following considerations:
Outbound to Azure SQL Database Public IPs is required: Network Security Groups (NSGs) must be
opened to Azure SQL Database IPs to allow connectivity. You can do this by using NSG Service Tags for Azure
SQL Database.
ExpressRoute
If you are using ExpressRoute from your premises, for public peering or Microsoft peering, you will need to
identify the NAT IP addresses that are used. For public peering, each ExpressRoute circuit by default uses two NAT
IP addresses applied to Azure service traffic when the traffic enters the Microsoft Azure network backbone. For
Microsoft peering, the NAT IP address(es) that are used are either customer provided or are provided by the
service provider. To allow access to your service resources, you must allow these public IP addresses in the
resource IP firewall setting. To find your public peering ExpressRoute circuit IP addresses, open a support ticket
with ExpressRoute via the Azure portal. Learn more about NAT for ExpressRoute public and Microsoft peering.
To allow communication from your circuit to Azure SQL Database, you must create IP network rules for the public
IP addresses of your NAT.

Impact of using VNet Service Endpoints with Azure storage


Azure Storage has implemented the same feature that allows you to limit connectivity to your Azure Storage
account. If you choose to use this feature with an Azure Storage account that is being used by Azure SQL Server,
you can run into issues. Next is a list and discussion of Azure SQL Database and Azure SQL Data Warehouse
features that are impacted by this.
Azure SQL Data Warehouse PolyBase
PolyBase is commonly used to load data into Azure SQL Data Warehouse from Azure Storage accounts. If the
Azure Storage account that you are loading data from limits access only to a set of VNet-subnets, connectivity
from PolyBase to the Account will break. For enabling both PolyBase import and export scenarios with Azure SQL
Data Warehouse connecting to Azure Storage that's secured to VNet, follow the steps indicated below:
Prerequisites

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

IMPORTANT
The PowerShell Azure Resource Manager module is still supported by Azure SQL Database, but all future development is for
the Az.Sql module. For these cmdlets, see AzureRM.Sql. The arguments for the commands in the Az module and in the
AzureRm modules are substantially identical.

1. Install Azure PowerShell using this guide.


2. If you have a general-purpose v1 or blob storage account, you must first upgrade to general-purpose v2 using
this guide.
3. You must have Allow trusted Microsoft services to access this storage account turned on under Azure
Storage account Firewalls and Virtual networks settings menu. Refer to this guide for more information.
Steps
1. In PowerShell, register your Azure SQL Server hosting your Azure SQL Data Warehouse instance with
Azure Active Directory (AAD ):

Connect-AzAccount
Select-AzSubscription -SubscriptionId your-subscriptionId
Set-AzSqlServer -ResourceGroupName your-database-server-resourceGroup -ServerName your-SQL-servername -
AssignIdentity

a. Create a general-purpose v2 Storage Account using this guide.

NOTE
If you have a general-purpose v1 or blob storage account, you must first upgrade to v2 using this guide.
For known issues with Azure Data Lake Storage Gen2, please refer to this guide.

2. Under your storage account, navigate to Access Control (IAM ), and click Add role assignment. Assign
Storage Blob Data Contributor RBAC role to your Azure SQL Server hosting your Azure SQL Data
Warehouse which you've registered with Azure Active Directory (AAD ) as in step#1.

NOTE
Only members with Owner privilege can perform this step. For various built-in roles for Azure resources, refer to this
guide.

3. Polybase connectivity to the Azure Storage account:


a. Create a database master key if you haven't created one earlier:

CREATE MASTER KEY [ENCRYPTION BY PASSWORD = 'somepassword'];

b. Create database scoped credential with IDENTITY = 'Managed Service Identity':

CREATE DATABASE SCOPED CREDENTIAL msi_cred WITH IDENTITY = 'Managed Service Identity';

NOTE
There is no need to specify SECRET with Azure Storage access key because this mechanism uses Managed
Identity under the covers.
IDENTITY name should be 'Managed Service Identity' for PolyBase connectivity to work with Azure
Storage account secured to VNet.

c. Create external data source with abfss:// scheme for connecting to your general-purpose v2 storage
account using PolyBase:

CREATE EXTERNAL DATA SOURCE ext_datasource_with_abfss WITH (TYPE = hadoop, LOCATION =


'abfss://myfile@mystorageaccount.dfs.core.windows.net', CREDENTIAL = msi_cred);

NOTE
If you already have external tables associated with general-purpose v1 or blob storage account, you
should first drop those external tables and then drop corresponding external data source. Then create
external data source with abfss:// scheme connecting to general-purpose v2 storage account as above and
re-create all the external tables using this new external data source. You could use Generate and Publish
Scripts Wizard to generate create-scripts for all the external tables for ease.
For more information on abfss:// scheme, refer to this guide.
For more information on CREATE EXTERNAL DATA SOURCE, refer to this guide.

d. Query as normal using external tables.


Azure SQL Database Blob Auditing
Blob auditing pushes audit logs to your own storage account. If this storage account uses the VNet Service
endpoints feature then connectivity from Azure SQL Database to the storage account will break.

Adding a VNet Firewall rule to your server without turning On VNet


Service Endpoints
Long ago, before this feature was enhanced, you were required to turn VNet service endpoints On before you
could implement a live VNet rule in the Firewall. The endpoints related a given VNet-subnet to an Azure SQL
Database. But now as of January 2018, you can circumvent this requirement by setting the
IgnoreMissingVNetServiceEndpoint flag.
Merely setting a Firewall rule does not help secure the server. You must also turn VNet service endpoints On for
the security to take effect. When you turn service endpoints On, your VNet-subnet experiences downtime until it
completes the transition from Off to On. This is especially true in the context of large VNets. You can use the
IgnoreMissingVNetServiceEndpoint flag to reduce or eliminate the downtime during transition.
You can set the IgnoreMissingVNetServiceEndpoint flag by using PowerShell. For details, see PowerShell to
create a Virtual Network service endpoint and rule for Azure SQL Database.

Errors 40914 and 40615


Connection error 40914 relates to virtual network rules, as specified on the Firewall pane in the Azure portal. Error
40615 is similar, except it relates to IP address rules on the Firewall.
Error 40914
Message text: Cannot open server ' [server-name]' requested by the login. Client is not allowed to access the server.
Error description: The client is in a subnet that has virtual network server endpoints. But the Azure SQL Database
server has no virtual network rule that grants to the subnet the right to communicate with the SQL Database.
Error resolution: On the Firewall pane of the Azure portal, use the virtual network rules control to add a virtual
network rule for the subnet.
Error 40615
Message text: Cannot open server '{0}' requested by the login. Client with IP address '{1}' is not allowed to access
the server.
Error description: The client is trying to connect from an IP address that is not authorized to connect to the Azure
SQL Database server. The server firewall has no IP address rule that allows a client to communicate from the
given IP address to the SQL Database.
Error resolution: Enter the client's IP address as an IP rule. Do this by using the Firewall pane in the Azure portal.
A list of several SQL Database error messages is documented here.

Portal can create a virtual network rule


This section illustrates how you can use the Azure portal to create a virtual network rule in your Azure SQL
Database. The rule tells your SQL Database to accept communication from a particular subnet that has been
tagged as being a Virtual Network service endpoint.

NOTE
If you intend to add a service endpoint to the VNet firewall rules of your Azure SQL Database server, first ensure that service
endpoints are turned On for the subnet.
If service endpoints are not turned on for the subnet, the portal asks you to enable them. Click the Enable button on the
same blade on which you add the rule.

PowerShell alternative
A PowerShell script can also create virtual network rules. The crucial cmdlet New-
AzSqlServerVirtualNetworkRule. If interested, see PowerShell to create a Virtual Network service endpoint and
rule for Azure SQL Database.

REST API alternative


Internally, the PowerShell cmdlets for SQL VNet actions call REST APIs. You can call the REST APIs directly.
Virtual Network Rules: Operations

Prerequisites
You must already have a subnet that is tagged with the particular Virtual Network service endpoint type name
relevant to Azure SQL Database.
The relevant endpoint type name is Microsoft.Sql.
If your subnet might not be tagged with the type name, see Verify your subnet is an endpoint.

Azure portal steps


1. Sign in to the Azure portal.
2. Then navigate the portal to SQL servers > Firewall / Virtual Networks.
3. Set the Allow access to Azure services control to OFF.

IMPORTANT
If you leave the control set to ON, your Azure SQL Database server accepts communication from any subnet inside
the Azure boundary i.e. originating from one of the IP addresses that is recognized as those within ranges defined for
Azure data centers. Leaving the control set to ON might be excessive access from a security point of view. The
Microsoft Azure Virtual Network service endpoint feature, in coordination with the virtual network rule feature of
SQL Database, together can reduce your security surface area.

4. Click the + Add existing control, in the Virtual networks section.


5. In the new Create/Update pane, fill in the controls with the names of your Azure resources.

TIP
You must include the correct Address prefix for your subnet. You can find the value in the portal. Navigate All
resources > All types > Virtual networks. The filter displays your virtual networks. Click your virtual network, and
then click Subnets. The ADDRESS RANGE column has the Address prefix you need.
6. Click the OK button near the bottom of the pane.
7. See the resulting virtual network rule on the firewall pane.

NOTE
The following statuses or states apply to the rules:
Ready: Indicates that the operation that you initiated has Succeeded.
Failed: Indicates that the operation that you initiated has Failed.
Deleted: Only applies to the Delete operation, and indicates that the rule has been deleted and no longer applies.
InProgress: Indicates that the operation is in progress. The old rule applies while the operation is in this state.

Related articles
Azure virtual network service endpoints
Azure SQL Database server-level and database-level firewall rules
The virtual network rule feature for Azure SQL Database became available in late September 2017.

Next steps
Use PowerShell to create a virtual network service endpoint, and then a virtual network rule for Azure SQL
Database.
Virtual Network Rules: Operations with REST APIs
Authenticate to Azure SQL Data Warehouse
4/3/2019 • 2 minutes to read • Edit Online

Learn how to authenticate to Azure SQL Data Warehouse by using Azure Active Directory (AAD ) or SQL Server
authentication.
To connect to SQL Data Warehouse, you must pass in security credentials for authentication purposes. Upon
establishing a connection, certain connection settings are configured as part of establishing your query session.
For more information on security and how to enable connections to your data warehouse, see Secure a database
in SQL Data Warehouse.

SQL authentication
To connect to SQL Data Warehouse, you must provide the following information:
Fully qualified servername
Specify SQL authentication
Username
Password
Default database (optional)
By default your connection connects to the master database and not your user database. To connect to your user
database, you can choose to do one of two things:
Specify the default database when registering your server with the SQL Server Object Explorer in SSDT,
SSMS, or in your application connection string. For example, include the InitialCatalog parameter for an
ODBC connection.
Highlight the user database before creating a session in SSDT.

NOTE
The Transact-SQL statement USE MyDatabase; is not supported for changing the database for a connection. For guidance
connecting to SQL Data Warehouse with SSDT, refer to the Query with Visual Studio article.

Azure Active Directory (AAD) authentication


Azure Active Directory authentication is a mechanism of connecting to Microsoft Azure SQL Data Warehouse by
using identities in Azure Active Directory (Azure AD ). With Azure Active Directory authentication, you can
centrally manage the identities of database users and other Microsoft services in one central location. Central ID
management provides a single place to manage SQL Data Warehouse users and simplifies permission
management.
Benefits
Azure Active Directory benefits include:
Provides an alternative to SQL Server authentication.
Helps stop the proliferation of user identities across database servers.
Allows password rotation in a single place
Manage database permissions using external (AAD ) groups.
Eliminates storing passwords by enabling integrated Windows authentication and other forms of
authentication supported by Azure Active Directory.
Uses contained database users to authenticate identities at the database level.
Supports token-based authentication for applications connecting to SQL Data Warehouse.
Supports Multi-Factor authentication through Active Directory Universal Authentication for various tools
including SQL Server Management Studio and SQL Server Data Tools.

NOTE
Azure Active Directory is still relatively new and has some limitations. To ensure that Azure Active Directory is a good fit for
your environment, see Azure AD features and limitations, specifically the Additional considerations.

Configuration steps
Follow these steps to configure Azure Active Directory authentication.
1. Create and populate an Azure Active Directory
2. Optional: Associate or change the active directory that is currently associated with your Azure Subscription
3. Create an Azure Active Directory administrator for Azure SQL Data Warehouse.
4. Configure your client computers
5. Create contained database users in your database mapped to Azure AD identities
6. Connect to your data warehouse by using Azure AD identities
Currently Azure Active Directory users are not shown in SSDT Object Explorer. As a workaround, view the users
in sys.database_principals.
Find the details
The steps to configure and use Azure Active Directory authentication are nearly identical for Azure SQL
Database and Azure SQL Data Warehouse. Follow the detailed steps in the topic Connecting to SQL Database
or SQL Data Warehouse By Using Azure Active Directory Authentication.
Create custom database roles and add users to the roles. Then grant granular permissions to the roles. For
more information, see Getting Started with Database Engine Permissions.

Next steps
To start querying your data warehouse with Visual Studio and other applications, see Query with Visual Studio.
Use Azure Active Directory Authentication for
authentication with SQL
8/21/2019 • 9 minutes to read • Edit Online

Azure Active Directory authentication is a mechanism of connecting to Azure SQL Database, Managed Instance,
and SQL Data Warehouse by using identities in Azure Active Directory (Azure AD ).

NOTE
This topic applies to Azure SQL server, and to both SQL Database and SQL Data Warehouse databases that are created on
the Azure SQL server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.

With Azure AD authentication, you can centrally manage the identities of database users and other Microsoft
services in one central location. Central ID management provides a single place to manage database users and
simplifies permission management. Benefits include the following:
It provides an alternative to SQL Server authentication.
Helps stop the proliferation of user identities across database servers.
Allows password rotation in a single place.
Customers can manage database permissions using external (Azure AD ) groups.
It can eliminate storing passwords by enabling integrated Windows authentication and other forms of
authentication supported by Azure Active Directory.
Azure AD authentication uses contained database users to authenticate identities at the database level.
Azure AD supports token-based authentication for applications connecting to SQL Database.
Azure AD authentication supports ADFS (domain federation) or native user/password authentication for a
local Azure Active Directory without domain synchronization.
Azure AD supports connections from SQL Server Management Studio that use Active Directory Universal
Authentication, which includes Multi-Factor Authentication (MFA). MFA includes strong authentication with a
range of easy verification options — phone call, text message, smart cards with pin, or mobile app notification.
For more information, see SSMS support for Azure AD MFA with SQL Database and SQL Data Warehouse.
Azure AD supports similar connections from SQL Server Data Tools (SSDT) that use Active Directory
Interactive Authentication. For more information, see Azure Active Directory support in SQL Server Data Tools
(SSDT).

NOTE
Connecting to SQL Server running on an Azure VM is not supported using an Azure Active Directory account. Use a domain
Active Directory account instead.

The configuration steps include the following procedures to configure and use Azure Active Directory
authentication.
1. Create and populate Azure AD.
2. Optional: Associate or change the active directory that is currently associated with your Azure Subscription.
3. Create an Azure Active Directory administrator for the Azure SQL Database server, the Managed Instance, or
the Azure SQL Data Warehouse.
4. Configure your client computers.
5. Create contained database users in your database mapped to Azure AD identities.
6. Connect to your database by using Azure AD identities.

NOTE
To learn how to create and populate Azure AD, and then configure Azure AD with Azure SQL Database, Managed Instance,
and SQL Data Warehouse, see Configure Azure AD with Azure SQL Database.

Trust architecture
The following high-level diagram summarizes the solution architecture of using Azure AD authentication with
Azure SQL Database. The same concepts apply to SQL Data Warehouse. To support Azure AD native user
password, only the Cloud portion and Azure AD/Azure SQL Database is considered. To support Federated
authentication (or user/password for Windows credentials), the communication with ADFS block is required. The
arrows indicate communication pathways.

The following diagram indicates the federation, trust, and hosting relationships that allow a client to connect to a
database by submitting a token. The token is authenticated by an Azure AD, and is trusted by the database.
Customer 1 can represent an Azure Active Directory with native users or an Azure AD with federated users.
Customer 2 represents a possible solution including imported users; in this example coming from a federated
Azure Active Directory with ADFS being synchronized with Azure Active Directory. It's important to understand
that access to a database using Azure AD authentication requires that the hosting subscription is associated to the
Azure AD. The same subscription must be used to create the SQL Server hosting the Azure SQL Database or SQL
Data Warehouse.
Administrator structure
When using Azure AD authentication, there are two Administrator accounts for the SQL Database server and
Managed Instance; the original SQL Server administrator and the Azure AD administrator. The same concepts
apply to SQL Data Warehouse. Only the administrator based on an Azure AD account can create the first Azure
AD contained database user in a user database. The Azure AD administrator login can be an Azure AD user or an
Azure AD group. When the administrator is a group account, it can be used by any group member, enabling
multiple Azure AD administrators for the SQL Server instance. Using group account as an administrator enhances
manageability by allowing you to centrally add and remove group members in Azure AD without changing the
users or permissions in SQL Database. Only one Azure AD administrator (a user or group) can be configured at
any time.
Permissions
To create new users, you must have the ALTER ANY USER permission in the database. The ALTER ANY USER
permission can be granted to any database user. The ALTER ANY USER permission is also held by the server
administrator accounts, and database users with the CONTROL ON DATABASE or ALTER ON DATABASE permission for
that database, and by members of the db_owner database role.
To create a contained database user in Azure SQL Database, Managed Instance, or SQL Data Warehouse, you
must connect to the database or instance using an Azure AD identity. To create the first contained database user,
you must connect to the database by using an Azure AD administrator (who is the owner of the database). This is
demonstrated in Configure and manage Azure Active Directory authentication with SQL Database or SQL Data
Warehouse. Any Azure AD authentication is only possible if the Azure AD admin was created for Azure SQL
Database or SQL Data Warehouse server. If the Azure Active Directory admin was removed from the server,
existing Azure Active Directory users created previously inside SQL Server can no longer connect to the database
using their Azure Active Directory credentials.

Azure AD features and limitations


The following members of Azure AD can be provisioned in Azure SQL server or SQL Data Warehouse:
Native members: A member created in Azure AD in the managed domain or in a customer domain. For
more information, see Add your own domain name to Azure AD.
Federated domain members: A member created in Azure AD with a federated domain. For more
information, see Microsoft Azure now supports federation with Windows Server Active Directory.
Imported members from other Azure AD's who are native or federated domain members.
Active Directory groups created as security groups.
Azure AD users that are part of a group that has db_owner server role cannot use the CREATE
DATABASE SCOPED CREDENTIAL syntax against Azure SQL Database and Azure SQL Data
Warehouse. You will see the following error:
SQL Error [2760] [S0001]: The specified schema name 'user@mydomain.com' either does not exist or you do
not have permission to use it.

Grant the db_owner role directly to the individual Azure AD user to mitigate the CREATE DATABASE
SCOPED CREDENTIAL issue.
These system functions return NULL values when executed under Azure AD principals:
SUSER_ID()
SUSER_NAME(<admin ID>)
SUSER_SNAME(<admin SID>)
SUSER_ID(<admin name>)
SUSER_SID(<admin name>)

Managed Instances
Azure AD server principals (logins) and users are supported as a preview feature for Managed Instances.
Setting Azure AD server principals (logins) mapped to an Azure AD group as database owner is not supported
in Managed Instances.
An extension of this is that when a group is added as part of the dbcreator server role, users from this
group can connect to the Managed Instance and create new databases, but will not be able to access the
database. This is because the new database owner is SA, and not the Azure AD user. This issue does not
manifest if the individual user is added to the dbcreator server role.
SQL Agent management and jobs execution is supported for Azure AD server principals (logins).
Database backup and restore operations can be executed by Azure AD server principals (logins).
Auditing of all statements related to Azure AD server principals (logins) and authentication events is
supported.
Dedicated administrator connection for Azure AD server principals (logins) which are members of sysadmin
server role is supported.
Supported through SQLCMD Utility and SQL Server Management Studio.
Logon triggers are supported for logon events coming from Azure AD server principals (logins).
Service Broker and DB mail can be setup using an Azure AD server principal (login).

Connecting using Azure AD identities


Azure Active Directory authentication supports the following methods of connecting to a database using Azure
AD identities:
Azure Active Directory Password
Azure Active Directory Integrated
Azure Active Directory Universal with MFA
Using Application token authentication
The following authentication methods are supported for Azure AD server principals (logins) ( public preview):
Azure Active Directory Password
Azure Active Directory Integrated
Azure Active Directory Universal with MFA
Additional considerations
To enhance manageability, we recommend you provision a dedicated Azure AD group as an administrator.
Only one Azure AD administrator (a user or group) can be configured for an Azure SQL Database server or
Azure SQL Data Warehouse at any time.
The addition of Azure AD server principals (logins) for Managed Instances ( public preview) allows the
possibility of creating multiple Azure AD server principals (logins) that can be added to the sysadmin
role.
Only an Azure AD administrator for SQL Server can initially connect to the Azure SQL Database server,
Managed Instance, or Azure SQL Data Warehouse using an Azure Active Directory account. The Active
Directory administrator can configure subsequent Azure AD database users.
We recommend setting the connection timeout to 30 seconds.
SQL Server 2016 Management Studio and SQL Server Data Tools for Visual Studio 2015 (version
14.0.60311.1April 2016 or later) support Azure Active Directory authentication. (Azure AD authentication is
supported by the .NET Framework Data Provider for SqlServer; at least version .NET Framework 4.6).
Therefore the newest versions of these tools and data-tier applications (DAC and .BACPAC ) can use Azure AD
authentication.
Beginning with version 15.0.1, sqlcmd utility and bcp utility support Active Directory Interactive authentication
with MFA.
SQL Server Data Tools for Visual Studio 2015 requires at least the April 2016 version of the Data Tools
(version 14.0.60311.1). Currently Azure AD users are not shown in SSDT Object Explorer. As a workaround,
view the users in sys.database_principals.
Microsoft JDBC Driver 6.0 for SQL Server supports Azure AD authentication. Also, see Setting the Connection
Properties.
PolyBase cannot authenticate by using Azure AD authentication.
Azure AD authentication is supported for SQL Database by the Azure portal Import Database and Export
Database blades. Import and export using Azure AD authentication is also supported from the PowerShell
command.
Azure AD authentication is supported for SQL Database, Managed Instance, and SQL Data Warehouse by use
CLI. For more information, see Configure and manage Azure Active Directory authentication with SQL
Database or SQL Data Warehouse and SQL Server - az sql server.

Next steps
To learn how to create and populate Azure AD, and then configure Azure AD with Azure SQL Database or
Azure SQL Data Warehouse, see Configure and manage Azure Active Directory authentication with SQL
Database, Managed Instance, or SQL Data Warehouse.
For a tutorial of using Azure AD server principals (logins) with Managed Instances, see Azure AD server
principals (logins) with Managed Instances
For an overview of access and control in SQL Database, see SQL Database access and control.
For an overview of logins, users, and database roles in SQL Database, see Logins, users, and database roles.
For more information about database principals, see Principals.
For more information about database roles, see Database roles.
For syntax on creating Azure AD server principals (logins) for Managed Instances, see CREATE LOGIN.
For more information about firewall rules in SQL Database, see SQL Database firewall rules.
Controlling and granting database access to SQL
Database and SQL Data Warehouse
8/14/2019 • 12 minutes to read • Edit Online

After firewall rules configuration, you can connect to Azure SQL Database and SQL Data Warehouse as one of
the administrator accounts, as the database owner, or as a database user in the database.

NOTE
This topic applies to Azure SQL server, and to SQL Database and SQL Data Warehouse databases created on the Azure SQL
server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.

TIP
For a tutorial, see Secure your Azure SQL Database. This tutorial does not apply to Azure SQL Database Managed
Instance.

Unrestricted administrative accounts


There are two administrative accounts (Server admin and Active Directory admin) that act as administrators.
To identify these administrator accounts for your SQL server, open the Azure portal, and navigate to the
Properties tab of your SQL server or SQL Database.

Server admin
When you create an Azure SQL server, you must designate a Server admin login. SQL server creates
that account as a login in the master database. This account connects using SQL Server authentication
(user name and password). Only one of these accounts can exist.

NOTE
To reset the password for the server admin, go to the Azure portal, click SQL Servers, select the server from the
list, and then click Reset Password.

Azure Active Directory admin


One Azure Active Directory account, either an individual or security group account, can also be configured
as an administrator. It is optional to configure an Azure AD administrator, but an Azure AD administrator
must be configured if you want to use Azure AD accounts to connect to SQL Database. For more
information about configuring Azure Active Directory access, see Connecting to SQL Database or SQL
Data Warehouse By Using Azure Active Directory Authentication and SSMS support for Azure AD MFA
with SQL Database and SQL Data Warehouse.
The Server admin and Azure AD admin accounts have the following characteristics:
Are the only accounts that can automatically connect to any SQL Database on the server. (To connect to a user
database, other accounts must either be the owner of the database, or have a user account in the user
database.)
These accounts enter user databases as the dbo user and they have all the permissions in the user databases.
(The owner of a user database also enters the database as the dbo user.)
Do not enter the master database as the dbo user, and have limited permissions in master.
Are not members of the standard SQL Server sysadmin fixed server role, which is not available in SQL
database.
Can create, alter, and drop databases, logins, users in master, and server-level IP firewall rules.
Can add and remove members to the dbmanager and loginmanager roles.
Can view the sys.sql_logins system table.
Configuring the firewall
When the server-level firewall is configured for an individual IP address or range, the SQL server admin and the
Azure Active Directory admin can connect to the master database and all the user databases. The initial server-
level firewall can be configured through the Azure portal, using PowerShell or using the REST API. Once a
connection is made, additional server-level IP firewall rules can also be configured by using Transact-SQL.
Administrator access path
When the server-level firewall is properly configured, the SQL server admin and the Azure Active Directory
admin can connect using client tools such as SQL Server Management Studio or SQL Server Data Tools. Only
the latest tools provide all the features and capabilities. The following diagram shows a typical configuration for
the two administrator accounts.
When using an open port in the server-level firewall, administrators can connect to any SQL Database.
Connecting to a database by using SQL Server Management Studio
For a walk-through of creating a server, a database, server-level IP firewall rules, and using SQL Server
Management Studio to query a database, see Get started with Azure SQL Database servers, databases, and
firewall rules by using the Azure portal and SQL Server Management Studio.

IMPORTANT
It is recommended that you always use the latest version of Management Studio to remain synchronized with updates to
Microsoft Azure and SQL Database. Update SQL Server Management Studio.

Additional server-level administrative roles


IMPORTANT
This section does not apply to Azure SQL Database Managed Instance as these roles are specific to Azure SQL
Database.

In addition to the server-level administrative roles discussed previously, SQL Database provides two restricted
administrative roles in the master database to which user accounts can be added that grant permissions to either
create databases or manage logins.
Database creators
One of these administrative roles is the dbmanager role. Members of this role can create new databases. To use
this role, you create a user in the master database and then add the user to the dbmanager database role. To
create a database, the user must be a user based on a SQL Server login in the master database or contained
database user based on an Azure Active Directory user.
1. Using an administrator account, connect to the master database.
2. Create a SQL Server authentication login, using the CREATE LOGIN statement. Sample statement:

CREATE LOGIN Mary WITH PASSWORD = '<strong_password>';


NOTE
Use a strong password when creating a login or contained database user. For more information, see Strong
Passwords.

To improve performance, logins (server-level principals) are temporarily cached at the database level. To
refresh the authentication cache, see DBCC FLUSHAUTHCACHE.
3. In the master database, create a user by using the CREATE USER statement. The user can be an Azure
Active Directory authentication contained database user (if you have configured your environment for
Azure AD authentication), or a SQL Server authentication contained database user, or a SQL Server
authentication user based on a SQL Server authentication login (created in the previous step.) Sample
statements:

CREATE USER [mike@contoso.com] FROM EXTERNAL PROVIDER; -- To create a user with Azure Active Directory
CREATE USER Ann WITH PASSWORD = '<strong_password>'; -- To create a SQL Database contained database
user
CREATE USER Mary FROM LOGIN Mary; -- To create a SQL Server user based on a SQL Server authentication
login

4. Add the new user, to the dbmanager database role in master using the ALTER ROLE statement. Sample
statements:

ALTER ROLE dbmanager ADD MEMBER Mary;


ALTER ROLE dbmanager ADD MEMBER [mike@contoso.com];

NOTE
The dbmanager is a database role in master database so you can only add a database user to the dbmanager role.
You cannot add a server-level login to database-level role.

5. If necessary, configure a firewall rule to allow the new user to connect. (The new user might be covered by
an existing firewall rule.)
Now the user can connect to the master database and can create new databases. The account creating the
database becomes the owner of the database.
Login managers
The other administrative role is the login manager role. Members of this role can create new logins in the master
database. If you wish, you can complete the same steps (create a login and user, and add a user to the
loginmanager role) to enable a user to create new logins in the master. Usually logins are not necessary as
Microsoft recommends using contained database users, which authenticate at the database-level instead of using
users based on logins. For more information, see Contained Database Users - Making Your Database Portable.

Non-administrator users
Generally, non-administrator accounts do not need access to the master database. Create contained database
users at the database level using the CREATE USER (Transact-SQL ) statement. The user can be an Azure Active
Directory authentication contained database user (if you have configured your environment for Azure AD
authentication), or a SQL Server authentication contained database user, or a SQL Server authentication user
based on a SQL Server authentication login (created in the previous step.) For more information, see Contained
Database Users - Making Your Database Portable.
To create users, connect to the database, and execute statements similar to the following examples:

CREATE USER Mary FROM LOGIN Mary;


CREATE USER [mike@contoso.com] FROM EXTERNAL PROVIDER;

Initially, only one of the administrators or the owner of the database can create users. To authorize additional
users to create new users, grant that selected user the ALTER ANY USER permission, by using a statement such as:

GRANT ALTER ANY USER TO Mary;

To give additional users full control of the database, make them a member of the db_owner fixed database role.
In Azure SQL Database use the ALTER ROLE statement.

ALTER ROLE db_owner ADD MEMBER Mary;

In Azure SQL Data Warehouse use EXEC sp_addrolemember.

EXEC sp_addrolemember 'db_owner', 'Mary';

NOTE
One common reason to create a database user based on a SQL Database server login is for users that need access to
multiple databases. Since contained database users are individual entities, each database maintains its own user and its own
password. This can cause overhead as the user must then remember each password for each database, and it can become
untenable when having to change multiple passwords for many databases. However, when using SQL Server Logins and
high availability (active geo-replication and failover groups), the SQL Server logins must be set manually at each server.
Otherwise, the database user will no longer be mapped to the server login after a failover occurs, and will not be able to
access the database post failover. For more information on configuring logins for geo-replication, please see Configure and
manage Azure SQL Database security for geo-restore or failover.

Configuring the database -level firewall


As a best practice, non-administrator users should only have access through the firewall to the databases that
they use. Instead of authorizing their IP addresses through the server-level firewall and giving them access to all
databases, use the sp_set_database_firewall_rule statement to configure the database-level firewall. The database-
level firewall cannot be configured by using the portal.
Non-administrator access path
When the database-level firewall is properly configured, the database users can connect using client tools such as
SQL Server Management Studio or SQL Server Data Tools. Only the latest tools provide all the features and
capabilities. The following diagram shows a typical non-administrator access path.
Groups and roles
Efficient access management uses permissions assigned to groups and roles instead of individual users.
When using Azure Active Directory authentication, put Azure Active Directory users into an Azure Active
Directory group. Create a contained database user for the group. Place one or more database users into a
database role and then assign permissions to the database role.
When using SQL Server authentication, create contained database users in the database. Place one or
more database users into a database role and then assign permissions to the database role.
The database roles can be the built-in roles such as db_owner, db_ddladmin, db_datawriter, db_datareader,
db_denydatawriter, and db_denydatareader. db_owner is commonly used to grant full permission to only a
few users. The other fixed database roles are useful for getting a simple database in development quickly, but are
not recommended for most production databases. For example, the db_datareader fixed database role grants
read access to every table in the database, which is usually more than is strictly necessary. It is far better to use
the CREATE ROLE statement to create your own user-defined database roles and carefully grant each role the
least permissions necessary for the business need. When a user is a member of multiple roles, they aggregate the
permissions of them all.

Permissions
There are over 100 permissions that can be individually granted or denied in SQL Database. Many of these
permissions are nested. For example, the UPDATE permission on a schema includes the UPDATE permission on
each table within that schema. As in most permission systems, the denial of a permission overrides a grant.
Because of the nested nature and the number of permissions, it can take careful study to design an appropriate
permission system to properly protect your database. Start with the list of permissions at Permissions (Database
Engine) and review the poster size graphic of the permissions.
Considerations and restrictions
When managing logins and users in SQL Database, consider the following:
You must be connected to the master database when executing the CREATE/ALTER/DROP DATABASE
statements.
The database user corresponding to the Server admin login cannot be altered or dropped.
US -English is the default language of the Server admin login.
Only the administrators (Server admin login or Azure AD administrator) and the members of the
dbmanager database role in the master database have permission to execute the CREATE DATABASE and
DROP DATABASE statements.

You must be connected to the master database when executing the CREATE/ALTER/DROP LOGIN statements.
However using logins is discouraged. Use contained database users instead.
To connect to a user database, you must provide the name of the database in the connection string.
Only the server-level principal login and the members of the loginmanager database role in the master
database have permission to execute the CREATE LOGIN , ALTER LOGIN , and DROP LOGIN statements.
When executing the CREATE/ALTER/DROP LOGIN and CREATE/ALTER/DROP DATABASE statements in an ADO.NET
application, using parameterized commands is not allowed. For more information, see Commands and
Parameters.
When executing the CREATE/ALTER/DROP DATABASE and CREATE/ALTER/DROP LOGIN statements, each of these
statements must be the only statement in a Transact-SQL batch. Otherwise, an error occurs. For example,
the following Transact-SQL checks whether the database exists. If it exists, a DROP DATABASE statement is
called to remove the database. Because the DROP DATABASE statement is not the only statement in the batch,
executing the following Transact-SQL statement results in an error.

IF EXISTS (SELECT [name]


FROM [sys].[databases]
WHERE [name] = N'database_name')
DROP DATABASE [database_name];
GO

Instead, use the following Transact-SQL statement:

DROP DATABASE IF EXISTS [database_name]

When executing the CREATE USER statement with the FOR/FROM LOGIN option, it must be the only statement
in a Transact-SQL batch.
When executing the ALTER USER statement with the WITH LOGIN option, it must be the only statement in a
Transact-SQL batch.
To CREATE/ALTER/DROP a user requires the ALTER ANY USER permission on the database.
When the owner of a database role tries to add or remove another database user to or from that database
role, the following error may occur: User or role 'Name' does not exist in this database. This error
occurs because the user is not visible to the owner. To resolve this issue, grant the role owner the
VIEW DEFINITION permission on the user.

Next steps
To learn more about firewall rules, see Azure SQL Database Firewall.
For an overview of all the SQL Database security features, see SQL security overview.
For a tutorial, see Secure your Azure SQL Database.
For information about views and stored procedures, see Creating views and stored procedures
For information about granting access to a database object, see Granting Access to a Database Object
Using Multi-factor AAD authentication with Azure
SQL Database and Azure SQL Data Warehouse
(SSMS support for MFA)
8/25/2019 • 5 minutes to read • Edit Online

Azure SQL Database and Azure SQL Data Warehouse support connections from SQL Server Management
Studio (SSMS ) using Active Directory Universal Authentication. This article discusses the differences between the
various authentication options, and also the limitations associated with using Universal Authentication.
Download the latest SSMS - On the client computer, download the latest version of SSMS, from Download
SQL Server Management Studio (SSMS ).
For all the features discussed in this article, use at least July 2017, version 17.2. The most recent connection dialog
box, should look similar to the following image:

The five authentication options


Active Directory Universal Authentication supports the two non-interactive authentication methods: -
Active Directory - Password authentication - Active Directory - Integrated authentication

There are two non-interactive authentication models as well, which can be used in many different applications
(ADO.NET, JDCB, ODC, etc.). These two methods never result in pop-up dialog boxes:
Active Directory - Password
Active Directory - Integrated

The interactive method is that also supports Azure multi-factor authentication (MFA) is:
Active Directory - Universal with MFA

Azure MFA helps safeguard access to data and applications while meeting user demand for a simple sign-in
process. It delivers strong authentication with a range of easy verification options (phone call, text message, smart
cards with pin, or mobile app notification), allowing users to choose the method they prefer. Interactive MFA with
Azure AD can result in a pop-up dialog box for validation.
For a description of Multi-Factor Authentication, see Multi-Factor Authentication. For configuration steps, see
Configure Azure SQL Database multi-factor authentication for SQL Server Management Studio.
Azure AD domain name or tenant ID parameter
Beginning with SSMS version 17, users that are imported into the current Active Directory from other Azure
Active Directories as guest users, can provide the Azure AD domain name, or tenant ID when they connect. Guest
users include users invited from other Azure ADs, Microsoft accounts such as outlook.com, hotmail.com, live.com,
or other accounts like gmail.com. This information, allows Active Directory Universal with MFA
Authentication to identify the correct authenticating authority. This option is also required to support Microsoft
accounts (MSA) such as outlook.com, hotmail.com, live.com, or non-MSA accounts. All these users who want to
be authenticated using Universal Authentication must enter their Azure AD domain name or tenant ID. This
parameter represents the current Azure AD domain name/tenant ID the Azure Server is linked with. For example,
if Azure Server is associated with Azure AD domain contosotest.onmicrosoft.com where user
joe@contosodev.onmicrosoft.com is hosted as an imported user from Azure AD domain
contosodev.onmicrosoft.com , the domain name required to authenticate this user is contosotest.onmicrosoft.com .
When the user is a native user of the Azure AD linked to Azure Server, and is not an MSA account, no domain
name or tenant ID is required. To enter the parameter (beginning with SSMS version 17.2), in the Connect to
Database dialog box, complete the dialog box, selecting Active Directory - Universal with MFA authentication,
click Options, complete the User name box, and then click the Connection Properties tab. Check the AD
domain name or tenant ID box, and provide authenticating authority, such as the domain name
(contosotest.onmicrosoft.com ) or the GUID of the tenant ID.
If you are running SSMS 18.x or later then the AD domain name or tenant ID is no longer needed for guest users
because 18.x or later automatically recognizes it.
Azure AD business to business support
Azure AD users supported for Azure AD B2B scenarios as guest users (see What is Azure B2B collaboration) can
connect to SQL Database and SQL Data Warehouse only as part of members of a group created in current Azure
AD and mapped manually using the Transact-SQL CREATE USER statement in a given database. For example, if
steve@gmail.com is invited to Azure AD contosotest (with the Azure Ad domain contosotest.onmicrosoft.com ), an
Azure AD group, such as usergroup must be created in the Azure AD that contains the steve@gmail.com member.
Then, this group must be created for a specific database (that is, MyDatabase) by Azure AD SQL admin or Azure
AD DBO by executing a Transact-SQL CREATE USER [usergroup] FROM EXTERNAL PROVIDER statement. After the
database user is created, then the user steve@gmail.com can log in to MyDatabase using the SSMS authentication
option Active Directory – Universal with MFA support . The usergroup, by default, has only the connect permission
and any further data access that will need to be granted in the normal way. Note that user steve@gmail.com as a
guest user must check the box and add the AD domain name contosotest.onmicrosoft.com in the SSMS
Connection Property dialog box. The AD domain name or tenant ID option is only supported for the
Universal with MFA connection options, otherwise it is greyed out.

Universal Authentication limitations for SQL Database and SQL Data


Warehouse
SSMS and SqlPackage.exe are the only tools currently enabled for MFA through Active Directory Universal
Authentication.
SSMS version 17.2, supports multi-user concurrent access using Universal Authentication with MFA. Version
17.0 and 17.1, restricted a login for an instance of SSMS using Universal Authentication to a single Azure
Active Directory account. To log in as another Azure AD account, you must use another instance of SSMS.
(This restriction is limited to Active Directory Universal Authentication; you can log in to different servers using
Active Directory Password Authentication, Active Directory Integrated Authentication, or SQL Server
Authentication).
SSMS supports Active Directory Universal Authentication for Object Explorer, Query Editor, and Query Store
visualization.
SSMS version 17.2 provides DacFx Wizard support for Export/Extract/Deploy Data database. Once a specific
user is authenticated through the initial authentication dialog using Universal Authentication, the DacFx Wizard
functions the same way it does for all other authentication methods.
The SSMS Table Designer does not support Universal Authentication.
There are no additional software requirements for Active Directory Universal Authentication except that you
must use a supported version of SSMS.
The Active Directory Authentication Library (ADAL ) version for Universal authentication was updated to its
latest ADAL.dll 3.13.9 available released version. See Active Directory Authentication Library 3.14.1.

Next steps
For configuration steps, see Configure Azure SQL Database multi-factor authentication for SQL Server
Management Studio.
Grant others access to your database: SQL Database Authentication and Authorization: Granting Access
Make sure others can connect through the firewall: Configure an Azure SQL Database server-level firewall rule
using the Azure portal
Configure and manage Azure Active Directory authentication with SQL Database or SQL Data Warehouse
Microsoft SQL Server Data-Tier Application Framework (17.0.0 GA)
SQLPackage.exe
Import a BACPAC file to a new Azure SQL Database
Export an Azure SQL database to a BACPAC file
C# interface IUniversalAuthProvider Interface
When using Active Directory- Universal with MFA authentication, ADAL tracing is available beginning with
SSMS 17.3. Off by default, you can turn on ADAL tracing by using the Tools, Options menu, under Azure
Services, Azure Cloud, ADAL Output Window Trace Level, followed by enabling Output in the View
menu. The traces are available in the output window when selecting Azure Active Directory option.
Azure SQL Database and SQL Data Warehouse
access control
7/26/2019 • 4 minutes to read • Edit Online

To provide security, Azure SQL Database and SQL Data Warehouse control access with firewall rules limiting
connectivity by IP address, authentication mechanisms requiring users to prove their identity, and authorization
mechanisms limiting users to specific actions and data.

IMPORTANT
For an overview of the SQL Database security features, see SQL security overview. For a tutorial, see Secure your Azure SQL
Database. For an overview of SQL Data Warehouse security features, see SQL Data Warehouse security overview

Firewall and firewall rules


Microsoft Azure SQL Database provides a relational database service for Azure and other Internet-based
applications. To help protect your data, firewalls prevent all access to your database server until you specify which
computers have permission. The firewall grants access to databases based on the originating IP address of each
request. For more information, see Overview of Azure SQL Database firewall rules
The Azure SQL Database service is only available through TCP port 1433. To access a SQL Database from your
computer, ensure that your client computer firewall allows outgoing TCP communication on TCP port 1433. If not
needed for other applications, block inbound connections on TCP port 1433.
As part of the connection process, connections from Azure virtual machines are redirected to a different IP address
and port, unique for each worker role. The port number is in the range from 11000 to 11999. For more
information about TCP ports, see Ports beyond 1433 for ADO.NET 4.5 and SQL Database2.

Authentication
SQL Database supports two types of authentication:
SQL Authentication:
This authentication method uses a username and password. When you created the SQL Database server
for your database, you specified a "server admin" login with a username and password. Using these
credentials, you can authenticate to any database on that server as the database owner, or "dbo."
Azure Active Directory Authentication:
This authentication method uses identities managed by Azure Active Directory and is supported for
managed and integrated domains. If you want to use Azure Active Directory Authentication, you must
create another server admin called the "Azure AD admin," which is allowed to administer Azure AD users
and groups. This admin can also perform all operations that a regular server admin can. See Connecting to
SQL Database By Using Azure Active Directory Authentication for a walkthrough of how to create an Azure
AD admin to enable Azure Active Directory Authentication.
The Database Engine closes connections that remain idle for more than 30 minutes. The connection must login
again before it can be used. Continuously active connections to SQL Database require reauthorization (performed
by the database engine) at least every 10 hours. The database engine attempts reauthorization using the originally
submitted password and no user input is required. For performance reasons, when a password is reset in SQL
Database, the connection is not reauthenticated, even if the connection is reset due to connection pooling. This is
different from the behavior of on-premises SQL Server. If the password has been changed since the connection
was initially authorized, the connection must be terminated and a new connection made using the new password.
A user with the KILL DATABASE CONNECTION permission can explicitly terminate a connection to SQL Database by
using the KILL command.
User accounts can be created in the master database and can be granted permissions in all databases on the
server, or they can be created in the database itself (called contained users). For information on creating and
managing logins, see Manage logins. Use contained databases to enhance portability and scalability. For more
information on contained users, see Contained Database Users - Making Your Database Portable, CREATE USER
(Transact-SQL ), and Contained Databases.
As a best practice your application should use a dedicated account to authenticate -- this way you can limit the
permissions granted to the application and reduce the risks of malicious activity in case your application code is
vulnerable to a SQL injection attack. The recommended approach is to create a contained database user, which
allows your app to authenticate directly to the database.

Authorization
Authorization refers to what a user can do within an Azure SQL Database, and this is controlled by your user
account's database role memberships and object-level permissions. As a best practice, you should grant users the
least privileges necessary. The server admin account you are connecting with is a member of db_owner, which has
authority to do anything within the database. Save this account for deploying schema upgrades and other
management operations. Use the "ApplicationUser" account with more limited permissions to connect from your
application to the database with the least privileges needed by your application. For more information, see Manage
logins.
Typically, only administrators need access to the master database. Routine access to each user database should be
through non-administrator contained database users created in each database. When you use contained database
users, you do not need to create logins in the master database. For more information, see Contained Database
Users - Making Your Database Portable.
You should familiarize yourself with the following features that can be used to limit or elevate permissions:
Impersonation and module-signing can be used to securely elevate permissions temporarily.
Row -Level Security can be used limit which rows a user can access.
Data Masking can be used to limit exposure of sensitive data.
Stored procedures can be used to limit the actions that can be taken on the database.

Next steps
For an overview of the SQL Database security features, see SQL security overview.
To learn more about firewall rules, see Firewall rules.
To learn about users and logins, see Manage logins.
For a discussion of proactive monitoring, see Database Auditing and SQL Database threat detection.
For a tutorial, see Secure your Azure SQL Database.
Column-level Security
4/2/2019 • 2 minutes to read • Edit Online

Column-Level Security (CLS ) enables customers to control access to database table columns based on the user's
execution context or their group membership.

CLS simplifies the design and coding of security in your application. CLS enables you to implement restrictions on
column access to protect sensitive data. For example, ensuring that specific users can access only certain columns
of a table pertinent to their department. The access restriction logic is located in the database tier rather than away
from the data in another application tier. The database applies the access restrictions every time that data access is
attempted from any tier. This restriction makes your security system more reliable and robust by reducing the
surface area of your overall security system. In addition, CLS also eliminates the need for introducing views to filter
out columns for imposing access restrictions on the users.
You could implement CLS with the GRANT T-SQL statement. With this mechanism, both SQL and Azure Active
Directory (AAD ) authentication are supported.

Syntax
GRANT <permission> [ ,...n ] ON
[ OBJECT :: ][ schema_name ]. object_name [ ( column [ ,...n ] ) ]
TO <database_principal> [ ,...n ]
[ WITH GRANT OPTION ]
[ AS <database_principal> ]
<permission> ::=
SELECT
| UPDATE
<database_principal> ::=
Database_user
| Database_role
| Database_user_mapped_to_Windows_User
| Database_user_mapped_to_Windows_Group

Example
The following example shows how to restrict ‘TestUser’ from accessing ‘SSN’ column of ‘Membership’ table:
Create ‘Membership’ table with SSN column used to store social security numbers:
CREATE TABLE Membership
(MemberID int IDENTITY,
FirstName varchar(100) NULL,
SSN char(9) NOT NULL,
LastName varchar(100) NOT NULL,
Phone varchar(12) NULL,
Email varchar(100) NULL);

Allow ‘TestUser’ to access all columns except for SSN column that has sensitive data:

GRANT SELECT ON Membership(MemberID, FirstName, LastName, Phone, Email) TO TestUser;

Queries executed as ‘TestUser’ will fail if they include the SSN column:

SELECT * FROM Membership;

Msg 230, Level 14, State 1, Line 12


The SELECT permission was denied on the column 'SSN' of the object 'Membership', database 'CLS_TestDW', schema
'dbo'.

Use Cases
Some examples of how CLS is being used today:
A financial services firm allows only account managers to have access to customer social security numbers
(SSN ), phone numbers, and other personally identifiable information (PII).
A health care provider allows only doctors and nurses to have access to sensitive medical records while not
allowing members of the billing department to view this data.
Dynamic data masking for Azure SQL Database and
Data Warehouse
10/16/2019 • 3 minutes to read • Edit Online

SQL Database dynamic data masking limits sensitive data exposure by masking it to non-privileged users.
Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate
how much of the sensitive data to reveal with minimal impact on the application layer. It’s a policy-based security
feature that hides the sensitive data in the result set of a query over designated database fields, while the data in
the database is not changed.
For example, a service representative at a call center may identify callers by several digits of their credit card
number, but those data items should not be fully exposed to the service representative. A masking rule can be
defined that masks all but the last four digits of any credit card number in the result set of any query. As another
example, an appropriate data mask can be defined to protect personally identifiable information (PII) data, so that a
developer can query production environments for troubleshooting purposes without violating compliance
regulations.

Dynamic data masking basics


You set up a dynamic data masking policy in the Azure portal by selecting the dynamic data masking operation in
your SQL Database configuration blade or settings blade.
Dynamic data masking permissions
Dynamic data masking can be configured by the Azure SQL Database admin, server admin, or SQL Security
Manager roles.
Dynamic data masking policy
SQL users excluded from masking - A set of SQL users or AAD identities that get unmasked data in the SQL
query results. Users with administrator privileges are always excluded from masking, and see the original data
without any mask.
Masking rules - A set of rules that define the designated fields to be masked and the masking function that is
used. The designated fields can be defined using a database schema name, table name, and column name.
Masking functions - A set of methods that control the exposure of data for different scenarios.

MASKING FUNCTION MASKING LOGIC

Default Full masking according to the data types of the


designated fields

• Use XXXX or fewer Xs if the size of the field is less than 4


characters for string data types (nchar, ntext, nvarchar).
• Use a zero value for numeric data types (bigint, bit, decimal,
int, money, numeric, smallint, smallmoney, tinyint, float, real).
• Use 01-01-1900 for date/time data types (date, datetime2,
datetime, datetimeoffset, smalldatetime, time).
• For SQL variant, the default value of the current type is used.
• For XML the document <masked/> is used.
• Use an empty value for special data types (timestamp table,
hierarchyid, GUID, binary, image, varbinary spatial types).
MASKING FUNCTION MASKING LOGIC

Credit card Masking method, which exposes the last four digits of
the designated fields and adds a constant string as a prefix
in the form of a credit card.

XXXX-XXXX-XXXX-1234

Email Masking method, which exposes the first letter and


replaces the domain with XXX.com using a constant string
prefix in the form of an email address.

aXX@XXXX.com

Random number Masking method, which generates a random number


according to the selected boundaries and actual data types. If
the designated boundaries are equal, then the masking
function is a constant number.

Custom text Masking method, which exposes the first and last
characters and adds a custom padding string in the middle. If
the original string is shorter than the exposed prefix and suffix,
only the padding string is used.
prefix[padding]suffix

Recommended fields to mask


The DDM recommendations engine, flags certain fields from your database as potentially sensitive fields, which
may be good candidates for masking. In the Dynamic Data Masking blade in the portal, you will see the
recommended columns for your database. All you need to do is click Add Mask for one or more columns and
then Save to apply a mask for these fields.

Set up dynamic data masking for your database using PowerShell


cmdlets
See Azure SQL Database Cmdlets.

Set up dynamic data masking for your database using REST API
See Operations for Azure SQL Database.
Transparent data encryption for SQL Database and
Data Warehouse
10/25/2019 • 10 minutes to read • Edit Online

Transparent data encryption (TDE ) helps protect Azure SQL Database, Azure SQL Managed Instance, and Azure
Data Warehouse against the threat of malicious offline activity by encrypting data at rest. It performs real-time
encryption and decryption of the database, associated backups, and transaction log files at rest without requiring
changes to the application. By default, TDE is enabled for all newly deployed Azure SQL databases. TDE cannot be
used to encrypt the logical master database in SQL Database. The master database contains objects that are
needed to perform the TDE operations on the user databases.
TDE needs to be manually enabled for older databases of Azure SQL Database, Azure SQL Managed Instance, or
Azure SQL Data Warehouse. Managed Instance databases created through restore inherit encryption status from
the source database.
Transparent data encryption encrypts the storage of an entire database by using a symmetric key called the
database encryption key. This database encryption key is protected by the transparent data encryption protector.
The protector is either a service-managed certificate (service-managed transparent data encryption) or an
asymmetric key stored in Azure Key Vault (Bring Your Own Key). You set the transparent data encryption protector
at the server level for Azure SQL Database and Data Warehouse, and instance level for Azure SQL Managed
Instance. The term server refers both to server and instance throughout this document, unless stated differently.
On database startup, the encrypted database encryption key is decrypted and then used for decryption and re-
encryption of the database files in the SQL Server Database Engine process. Transparent data encryption performs
real-time I/O encryption and decryption of the data at the page level. Each page is decrypted when it's read into
memory and then encrypted before being written to disk. For a general description of transparent data encryption,
see Transparent data encryption.
SQL Server running on an Azure virtual machine also can use an asymmetric key from Key Vault. The
configuration steps are different from using an asymmetric key in SQL Database and SQL Managed Instance. For
more information, see Extensible key management by using Azure Key Vault (SQL Server).

Service-managed transparent data encryption


In Azure, the default setting for transparent data encryption is that the database encryption key is protected by a
built-in server certificate. The built-in server certificate is unique for each server and the encryption algorithm used
is AES 256. If a database is in a geo-replication relationship, both the primary and geo-secondary database are
protected by the primary database's parent server key. If two databases are connected to the same server, they also
share the same built-in certificate. Microsoft automatically rotates these certificates in compliance with the internal
security policy and the root key is protected by a Microsoft internal secret store. Customers can verify SQL
Database compliance with internal security policies in independent third-party audit reports available on the
Microsoft Trust Center.
Microsoft also seamlessly moves and manages the keys as needed for geo-replication and restores.
IMPORTANT
All newly created SQL databases and Managed Instance databases are encrypted by default by using service-managed
transparent data encryption. Existing SQL databases created before May 2017 and SQL databases created through restore,
geo-replication, and database copy are not encrypted by default. Existing Managed Instance databases created before
February 2019 are not encrypted by default. Managed Instance databases created through restore inherit encryption status
from the source.

Customer-managed transparent data encryption - Bring Your Own Key


TDE with customer-managed keys in Azure Key Vault allows to encrypt the Database Encryption Key (DEK) with a
customer-managed asymmetric key called TDE Protector. This is also generally referred to as Bring Your Own Key
(BYOK) support for Transparent Data Encryption. In the BYOK scenario, the TDE Protector is stored in a customer-
owned and managed Azure Key Vault, Azure’s cloud-based external key management system. The TDE Protector
can be generated by the key vault or transferred to the key vault from an on premises HSM device. The TDE DEK,
which is stored on the boot page of a database, is encrypted and decrypted by the TDE Protector, which is stored in
Azure Key Vault and never leaves the key vault. SQL Database needs to be granted permissions to the customer-
owned key vault to decrypt and encrypt the DEK. If permissions of the logical SQL server to the key vault are
revoked, a database will be inaccessible and all data is encrypted. For Azure SQL Database, the TDE protector is set
at the logical SQL server level and is inherited by all databases associated with that server. For Azure SQL
Managed Instance (BYOK feature in preview ), the TDE protector is set at the instance level and it is inherited by all
encrypted databases on that instance. The term server refers both to server and instance throughout this
document, unless stated differently.
With TDE with Azure Key Vault integration, users can control key management tasks including key rotations, key
vault permissions, key backups, and enable auditing/reporting on all TDE protectors using Azure Key Vault
functionality. Key Vault provides central key management, leverages tightly monitored hardware security modules
(HSMs), and enables separation of duties between management of keys and data to help meet compliance with
security policies. To learn more about transparent data encryption with Azure Key Vault integration (Bring Your
Own Key support) for Azure SQL Database, SQL Managed Instance (BYOK feature in preview ), and Data
Warehouse, see Transparent data encryption with Azure Key Vault integration.
To start using transparent data encryption with Azure Key Vault integration (Bring Your Own Key support), see the
how -to guide Turn on transparent data encryption by using your own key from Key Vault by using PowerShell.

Move a transparent data encryption-protected database


You don't need to decrypt databases for operations within Azure. The transparent data encryption settings on the
source database or primary database are transparently inherited on the target. Operations that are included
involve:
Geo-restore
Self-service point-in-time restore
Restoration of a deleted database
Active geo-replication
Creation of a database copy
Restore of backup file to Azure SQL Managed Instance
IMPORTANT
Taking manual COPY-ONLY backup of a database encrypted by service-managed TDE is not allowed in Azure SQL Managed
Instance, since certificate used for encryption is not accessible. Use point-in-time-restore feature to move this type of
database to another Managed Instance.

When you export a transparent data encryption-protected database, the exported content of the database isn't
encrypted. This exported content is stored in un-encrypted BACPAC files. Be sure to protect the BACPAC files
appropriately and enable transparent data encryption after import of the new database is finished.
For example, if the BACPAC file is exported from an on-premises SQL Server instance, the imported content of the
new database isn't automatically encrypted. Likewise, if the BACPAC file is exported to an on-premises SQL Server
instance, the new database also isn't automatically encrypted.
The one exception is when you export to and from a SQL database. Transparent data encryption is enabled in the
new database, but the BACPAC file itself still isn't encrypted.

Manage transparent data encryption


Portal
PowerShell
Transact-SQL
REST API
Manage transparent data encryption in the Azure portal.
To configure transparent data encryption through the Azure portal, you must be connected as the Azure Owner,
Contributor, or SQL Security Manager.
You turn transparent data encryption on and off on the database level. To enable transparent data encryption on a
database, go to the Azure portal and sign in with your Azure Administrator or Contributor account. Find the
transparent data encryption settings under your user database. By default, service-managed transparent data
encryption is used. A transparent data encryption certificate is automatically generated for the server that contains
the database. For Azure SQL Managed Instance use T-SQL to turn transparent data encryption on and off on a
database.
You set the transparent data encryption master key, also known as the transparent data encryption protector, on the
server level. To use transparent data encryption with Bring Your Own Key support and protect your databases with
a key from Key Vault, open the transparent data encryption settings under your server.
Azure SQL Transparent Data Encryption with
customer-managed keys in Azure Key Vault: Bring
Your Own Key support
10/24/2019 • 16 minutes to read • Edit Online

Transparent Data Encryption (TDE ) with Azure Key Vault integration allows to encrypt the Database Encryption
Key (DEK) with a customer-managed asymmetric key called TDE Protector. This is also generally referred to as
Bring Your Own Key (BYOK) support for Transparent Data Encryption. In the BYOK scenario, the TDE Protector is
stored in a customer-owned and managed Azure Key Vault, Azure’s cloud-based external key management
system. The TDE Protector can be generated by the key vault or transferred to the key vault from an on-prem
HSM device. The TDE DEK, which is stored on the boot page of a database, is encrypted and decrypted by the
TDE Protector stored in Azure Key Vault, which it never leaves. SQL Database needs to be granted permissions to
the customer-owned key vault to decrypt and encrypt the DEK. If permissions of the logical SQL server to the key
vault are revoked, a database will be inaccessible, connections will be denied and all data is encrypted. For Azure
SQL Database, the TDE protector is set at the logical SQL server level and is inherited by all databases associated
with that server. For Azure SQL Managed Instance, the TDE protector is set at the instance level and it is inherited
by all encrypted databases on that instance. The term server refers both to server and instance throughout this
document, unless stated differently.

NOTE
Transparent Data Encryption with Azure Key Vault integration (Bring Your Own Key) for Azure SQL Database Managed
Instance is in preview.

With TDE with Azure Key Vault integration, users can control key management tasks including key rotations, key
vault permissions, key backups, and enable auditing/reporting on all TDE protectors using Azure Key Vault
functionality. Key Vault provides central key management, leverages tightly monitored hardware security modules
(HSMs), and enables separation of duties between management of keys and data to help meet compliance with
security policies.
TDE with Azure Key Vault integration provides the following benefits:
Increased transparency and granular control with the ability to self-manage the TDE protector
Ability to revoke permissions at any time to render database inaccessible
Central management of TDE protectors (along with other keys and secrets used in other Azure services) by
hosting them in Key Vault
Separation of key and data management responsibilities within the organization, to support separation of
duties
Greater trust from your own clients, since Key Vault is designed so that Microsoft does not see or extract any
encryption keys.
Support for key rotation
IMPORTANT
For those using service-managed TDE who would like to start using Key Vault, TDE remains enabled during the process of
switching over to a TDE protector in Key Vault. There is no downtime nor re-encryption of the database files. Switching from
a service-managed key to a Key Vault key only requires re-encryption of the database encryption key (DEK), which is a fast
and online operation.

How does TDE with Azure Key Vault integration support work
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

IMPORTANT
The PowerShell Azure Resource Manager module is still supported by Azure SQL Database, but all future development is for
the Az.Sql module. For these cmdlets, see AzureRM.Sql. The arguments for the commands in the Az module and in the
AzureRm modules are substantially identical.

When TDE is first configured to use a TDE protector from Key Vault, the server sends the DEK of each TDE -
enabled database to Key Vault for a wrap key request. Key Vault returns the encrypted database encryption key,
which is stored in the user database.

IMPORTANT
It is important to note that once a TDE Protector is stored in Azure Key Vault, it never leaves the Azure Key Vault.
The server can only send key operation requests to the TDE protector key material within Key Vault, and never accesses or
caches the TDE protector. The Key Vault administrator has the right to revoke Key Vault permissions of the server at any
point, in which case all connections to the database are denied.
Guidelines for configuring TDE with Azure Key Vault
General Guidelines
Ensure Azure Key Vault and Azure SQL Database/Managed Instance are going to be in the same tenant. Cross-
tenant key vault and server interactions are not supported.
If you are planning a tenant move, TDE with AKV will have to be reconfigured, learn more about moving
resources.
When configuring TDE with Azure Key Vault, it is important to consider the load placed on the key vault by
repeated wrap/unwrap operations. For example, since all databases associated with a SQL Database server use
the same TDE protector, a failover of that server will trigger as many key operations against the vault as there
are databases in the server. Based on our experience and documented key vault service limits, we recommend
associating at most 500 Standard / General Purpose or 200 Premium / Business Critical databases with one
Azure Key Vault in a single subscription to ensure consistently high availability when accessing the TDE
protector in the vault.
Recommended: Keep a copy of the TDE protector on premises. This requires an HSM device to create a TDE
Protector locally and a key escrow system to store a local copy of the TDE Protector. Learn how to transfer a
key from a local HSM to Azure Key Vault.
Guidelines for configuring Azure Key Vault
Create a key vault with soft-delete and purge protection enabled to protect from data loss in case of
accidental key – or key vault – deletion. You must enable “soft-delete” property on the key vault via CLI or
Powershell (this option is not available from the AKV Portal yet – but required by Azure SQL ):
Soft deleted resources are retained for a set period of time, 90 days unless they are recovered or purged.
The recover and purge actions have their own permissions associated in a key vault access policy.
Set a resource lock on the key vault to control who can delete this critical resource and help to prevent
accidental or unauthorized deletion. Learn more about resource locks
Grant the SQL Database server access to the key vault using its Azure Active Directory (Azure AD ) Identity.
When using the Portal UI, the Azure AD identity gets automatically created and the key vault access
permissions are granted to the server. Using PowerShell to configure TDE with BYOK, the Azure AD
identity must be created and completion should be verified. See Configure TDE with BYOK and Configure
TDE with BYOK for Managed Instance for detailed step-by-step instructions when using PowerShell.

NOTE
If the Azure AD Identity is accidentally deleted or the server’s permissions are revoked using the key vault’s
access policy or inadvertently by moving the server to a different tenant, the server loses access to the key vault,
and TDE encrypted databases will be inaccessible and logons are denied until the logical server’s Azure AD Identity
and permissions have been restored.

When using firewalls and virtual networks with Azure Key Vault, you must allow trusted Microsoft services
to bypass this firewall. Choose YES.

NOTE
If TDE encrypted SQL databases lose access to the key vault because they cannot bypass the firewall, the databases
will be inaccessible, and logons are denied until firewall bypass permissions have been restored.

Enable auditing and reporting on all encryption keys: Key Vault provides logs that are easy to inject into
other security information and event management (SIEM ) tools. Operations Management Suite (OMS )
Log Analytics is one example of a service that is already integrated.
To ensure high availability of encrypted databases, configure each SQL Database server with two Azure
Key Vaults that reside in different regions.
Guidelines for configuring the TDE Protector (asymmetric key)
Create your encryption key locally on a local HSM device. Ensure this is an asymmetric, RSA 2048 or RSA
HSM 2048 key so it is storable in Azure Key Vault.
Escrow the key in a key escrow system.
Import the encryption key file (.pfx, .byok, or .backup) to Azure Key Vault.

NOTE
For testing purposes, it is possible to create a key with Azure Key Vault, however this key cannot be escrowed,
because the private key can never leave the key vault. Always back up and escrow keys used to encrypt production
data, as the loss of the key (accidental deletion in key vault, expiration etc.) results in permanent data loss.

If you use a key with an expiration date – implement an expiration warning system to rotate the key before
it expires: once the key expires, the encrypted databases lose access to their TDE Protector and
will be inaccessible and all logons will be denied until the key has been rotated to a new key and selected
as the new key and default TDE Protector for the logical SQL server.
Ensure the key is enabled and has permissions to perform get, wrap key, and unwrap key operations.
Create an Azure Key Vault key backup before using the key in Azure Key Vault for the first time. Learn more
about the Backup-AzKeyVaultKey command.
Create a new backup whenever any changes are made to the key (for example, add ACLs, add tags, add key
attributes).
Keep previous versions of the key in the key vault when rotating keys, so older database backups can be
restored. When the TDE Protector is changed for a database, old backups of the database are not updated
to use the latest TDE Protector. Each backup needs the TDE Protector it was created with at restore time.
Key rotations can be performed following the instructions at Rotate the Transparent Data Encryption
Protector Using PowerShell.
Keep all previously used keys in Azure Key Vault after changing back to service-managed keys. This ensures
database backups can be restored with the TDE protectors stored in Azure Key Vault. TDE protectors
created with Azure Key Vault have to be maintained until all stored backups have been created with service-
managed keys.
Make recoverable backup copies of these keys using Backup-AzKeyVaultKey.
To remove a potentially compromised key during a security incident without the risk of data loss, follow the
steps at Remove a potentially compromised key.
Guidelines for monitoring the TDE with Azure Key Vault configuration
If the logical SQL server loses access to the customer-managed TDE protector in Azure Key Vault, the database
will deny all connections and appear inaccessible in the Azure portal. The most common causes for this are:
Key vault accidentally deleted or behind a firewall
Key vault key accidentally deleted, disabled or expired
The logical SQL Server instance AppId accidentally deleted
Key specific permissions for logical SQL Server instance AppId revoked
NOTE
The database will self-heal and become online automatically if the access to the customer-managed TDE Protector is
restored within 48 hours. If the database is inaccessible due to an intermittent networking outage, there is no action
required and the databases will come back online automatically.

For more information about troubleshooting existing configurations see Troubleshoot TDE
To monitor database state and to enable alerting for loss of TDE Protector access, configure the following
Azure features:
Azure Resource Health. An inaccessible database that has lost access to the TDE Protector will show as
"Unavailable" after the first connection to the database has been denied.
Activity Log when access to the TDE protector in the customer-managed key vault fails, entries are
added to the activity log. Creating alerts for these events will enable you to reinstate access as soon as
possible.
Action Groups can be defined to send you notifications and alerts based on your preferences, e.g.
Email/SMS/Push/Voice, Logic App, Webhook, ITSM, or Automation Runbook.

High Availability, Geo-Replication, and Backup / Restore


High availability and disaster recovery
How to configure high availability with Azure Key Vault depends on the configuration of your database and SQL
Database server, and here are the recommended configurations for two distinct cases. The first case is a stand-
alone database or SQL Database server with no configured geo redundancy. The second case is a database or
SQL Database server configured with failover groups or geo-redundancy, where it must be ensured that each
geo-redundant copy has a local Azure Key Vault within the failover group to ensure geo-failovers work.
In the first case, if you require high availability of a database and SQL Database server with no configured geo-
redundancy, it is highly recommended to configure the server to use two different key vaults in two different
regions with the same key material. This can be accomplished by creating a TDE protector using the primary Key
Vault co-located in the same region as the SQL Database server and cloning the key into a key vault in a different
Azure region, so that the server has access to a second key vault should the primary key vault experience an
outage while the database is up and running. Use the Backup-AzKeyVaultKey cmdlet to retrieve the key in
encrypted format from the primary key vault and then use the Restore-AzKeyVaultKey cmdlet and specify a key
vault in the second region.
How to configure Geo-DR with Azure Key Vault
To maintain high availability of TDE Protectors for encrypted databases, it is required to configure redundant
Azure Key Vaults based on the existing or desired SQL Database failover groups or active geo-replication
instances. Each geo-replicated server requires a separate key vault, that must be co-located with the server in the
same Azure region. Should a primary database become inaccessible due to an outage in one region and a failover
is triggered, the secondary database is able to take over using the secondary key vault.
For Geo-Replicated Azure SQL databases, the following Azure Key Vault configuration is required:
One primary database with a key vault in region and one secondary database with a key vault in region.
At least one secondary is required, up to four secondaries are supported.
Secondaries of secondaries (chaining) are not supported.
The following section will go over the setup and configuration steps in more detail.
Azure Key Vault Configuration Steps
Install Azure PowerShell
Create two Azure Key Vaults in two different regions using PowerShell to enable the “soft-delete” property on
the key vaults (this option is not available from the AKV Portal yet – but required by SQL ).
Both Azure Key Vaults must be located in the two regions available in the same Azure Geo in order for backup
and restore of keys to work. If you need the two key vaults to be located in different geos to meet SQL Geo-DR
requirements, follow the BYOK Process that allows keys to be imported from an on-premises HSM.
Create a new key in the first key vault:
RSA/RSA-HSM 2048 key
No expiration dates
Key is enabled and has permissions to perform get, wrap key, and unwrap key operations
Back up the primary key and restore the key to the second key vault. See BackupAzureKeyVaultKey and
Restore-AzKeyVaultKey.
Azure SQL Database Configuration Steps
The following configuration steps differ whether starting with a new SQL deployment or if working with an
already existing SQL Geo-DR deployment. We outline the configuration steps for a new deployment first, and
then explain how to assign TDE Protectors stored in Azure Key Vault to an existing deployment that already has a
Geo-DR link established.
Steps for a new deployment:
Create the two SQL Database servers in the same two regions as the previously created key vaults.
Select the SQL Database server TDE pane, and for each SQL Database server:
Select the AKV in the same region
Select the key to use as TDE Protector – each server will use the local copy of the TDE Protector.
Doing this in the Portal will create an AppID for the SQL Database server, which is used to assign the
SQL Database server permissions to access the key vault – do not delete this identity. Access can be
revoked by removing the permissions in Azure Key Vault instead for the SQL Database server, which is
used to assign the SQL Database server permissions to access the key vault.
Create the primary database.
Follow the active geo-replication guidance to complete the scenario, this step will create the secondary
database.
NOTE
It is important to ensure that the same TDE Protectors are present in both key vaults, before proceeding to establish the
geo-link between the databases.

Steps for an existing SQL DB with Geo-DR deployment:


Because the SQL Database servers already exist, and primary and secondary databases are already assigned, the
steps to configure Azure Key Vault must be performed in the following order:
Start with the SQL Database server that hosts the secondary database:
Assign the key vault located in the same region
Assign the TDE Protector
Now go to the SQL Database server that hosts the primary database:
Select the same TDE Protector as used for the secondary DB
NOTE
When assigning the key vault to the server, it is important to start with the secondary server. In the second step assign the
key vault to the primary server and update the TDE Protector, the Geo-DR link will continue to work because at this point
the TDE Protector used by the replicated database is available to both servers.

Before enabling TDE with customer managed keys in Azure Key Vault for a SQL Database Geo-DR scenario, it is
important to create and maintain two Azure Key Vaults with identical contents in the same regions that will be
used for SQL Database geo-replication. “Identical contents” specifically means that both key vaults must contain
copies of the same TDE Protector(s) so that both servers have access to the TDE Protectors use by all databases.
Going forward, it is required to keep both key vaults in sync, which means they must contain the same copies of
TDE Protectors after key rotation, maintain old versions of keys used for log files or backups, TDE Protectors must
maintain the same key properties and the key vaults must maintain the same access permissions for SQL.
Follow the steps in Active geo-replication overview to test and trigger a failover, which should be done on a
regular basis to confirm the access permissions for SQL to both key vaults have been maintained.
Backup and Restore
Once a database is encrypted with TDE using a key from Key Vault, any generated backups are also encrypted
with the same TDE Protector.
To restore a backup encrypted with a TDE Protector from Key Vault, make sure that the key material is still in the
original vault under the original key name. When the TDE Protector is changed for a database, old backups of the
database are not updated to use the latest TDE Protector. Therefore, we recommend that you keep all old
versions of the TDE Protector in Key Vault, so database backups can be restored.
If a key that might be needed for restoring a backup is no longer in its original key vault, the following error
message is returned: "Target server <Servername> does not have access to all AKV Uris created between
<Timestamp #1> and <Timestamp #2>. Please retry operation after restoring all AKV Uris."
To mitigate this, run the Get-AzSqlServerKeyVaultKey cmdlet to return the list of keys from Key Vault that were
added to the server (unless they were deleted by a user). To ensure all backups can be restored, make sure the
target server for the backup has access to all of these keys.
Get-AzSqlServerKeyVaultKey `
-ServerName <LogicalServerName> `
-ResourceGroup <SQLDatabaseResourceGroupName>

To learn more about backup recovery for SQL Database, see Recover an Azure SQL database. To learn more
about backup recovery for SQL Data Warehouse, see Recover an Azure SQL Data Warehouse.
Additional consideration for backed up log files: Backed up log files remain encrypted with the original TDE
Encryptor, even if the TDE Protector was rotated and the database is now using a new TDE Protector. At restore
time, both keys will be needed to restore the database. If the log file is using a TDE Protector stored in Azure Key
Vault, this key will be needed at restore time, even if the database has been changed to use service-managed TDE
in the meantime.
Designing a PolyBase data loading strategy for Azure
SQL Data Warehouse
7/28/2019 • 6 minutes to read • Edit Online

Traditional SMP data warehouses use an Extract, Transform and Load (ETL ) process for loading data. Azure SQL
Data Warehouse is a massively parallel processing (MPP ) architecture that takes advantage of the scalability and
flexibility of compute and storage resources. Utilizing an Extract, Load, and Transform (ELT) process can take
advantage of MPP and eliminate resources needed to transform the data prior to loading. While SQL Data
Warehouse supports many loading methods including non-Polybase options such as BCP and SQL BulkCopy
API, the fastest and most scalable way to load date is through PolyBase. PolyBase is a technology that accesses
external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language.

What is ELT?
Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a data
warehouse and then transformed.
The basic steps for implementing a PolyBase ELT for SQL Data Warehouse are:
1. Extract the source data into text files.
2. Land the data into Azure Blob storage or Azure Data Lake Store.
3. Prepare the data for loading.
4. Load the data into SQL Data Warehouse staging tables using PolyBase.
5. Transform the data.
6. Insert the data into production tables.
For a loading tutorial, see Use PolyBase to load data from Azure blob storage to Azure SQL Data Warehouse.
For more information, see Loading patterns blog.

1. Extract the source data into text files


Getting data out of your source system depends on the storage location. The goal is to move the data into
PolyBase supported delimited text files.
PolyBase external file formats
PolyBase loads data from UTF -8 and UTF -16 encoded delimited text files. In addition to the delimited text files, it
loads from the Hadoop file formats RC File, ORC, and Parquet. PolyBase can also load data from Gzip and Snappy
compressed files. PolyBase currently does not support extended ASCII, fixed-width format, and nested formats
such as WinZip, JSON, and XML. If you are exporting from SQL Server, you can use bcp command-line tool to
export the data into delimited text files. The Parquet to SQL DW data type mapping is the following:

PARQUET DATA TYPE SQL DATA TYPE

tinyint tinyint

smallint smallint
PARQUET DATA TYPE SQL DATA TYPE

int int

bigint bigint

boolean bit

double float

float real

double money

double smallmoney

string nchar

string nvarchar

string char

string varchar

binary binary

binary varbinary

timestamp date

timestamp smalldatetime

timestamp datetime2

timestamp datetime

timestamp time

date date

decimal decimal

2. Land the data into Azure Blob storage or Azure Data Lake Store
To land the data in Azure storage, you can move it to Azure Blob storage or Azure Data Lake Store. In either
location, the data should be stored in text files. PolyBase can load from either location.
Tools and services you can use to move data to Azure Storage:
Azure ExpressRoute service enhances network throughput, performance, and predictability. ExpressRoute is a
service that routes your data through a dedicated private connection to Azure. ExpressRoute connections do
not route data through the public internet. The connections offer more reliability, faster speeds, lower latencies,
and higher security than typical connections over the public internet.
AZCopy utility moves data to Azure Storage over the public internet. This works if your data sizes are less than
10 TB. To perform loads on a regular basis with AZCopy, test the network speed to see if it is acceptable.
Azure Data Factory (ADF ) has a gateway that you can install on your local server. Then you can create a
pipeline to move data from your local server up to Azure Storage. To use Data Factory with SQL Data
Warehouse, see Load data into SQL Data Warehouse.

3. Prepare the data for loading


You might need to prepare and clean the data in your storage account before loading it into SQL Data Warehouse.
Data preparation can be performed while your data is in the source, as you export the data to text files, or after the
data is in Azure Storage. It is easiest to work with the data as early in the process as possible.
Define external tables
Before you can load data, you need to define external tables in your data warehouse. PolyBase uses external tables
to define and access the data in Azure Storage. An external table is similar to a database view. The external table
contains the table schema and points to data that is stored outside the data warehouse.
Defining external tables involves specifying the data source, the format of the text files, and the table definitions.
These are the T-SQL syntax topics that you will need:
CREATE EXTERNAL DATA SOURCE
CREATE EXTERNAL FILE FORMAT
CREATE EXTERNAL TABLE
For an example of creating external objects, see the Create external tables step in the loading tutorial.
Format text files
Once the external objects are defined, you need to align the rows of the text files with the external table and file
format definition. The data in each row of the text file must align with the table definition. To format the text files:
If your data is coming from a non-relational source, you need to transform it into rows and columns. Whether
the data is from a relational or non-relational source, the data must be transformed to align with the column
definitions for the table into which you plan to load the data.
Format data in the text file to align with the columns and data types in the SQL Data Warehouse destination
table. Misalignment between data types in the external text files and the data warehouse table causes rows to
be rejected during the load.
Separate fields in the text file with a terminator. Be sure to use a character or a character sequence that is not
found in your source data. Use the terminator you specified with CREATE EXTERNAL FILE FORMAT.

4. Load the data into SQL Data Warehouse staging tables using
PolyBase
It is best practice to load data into a staging table. Staging tables allow you to handle errors without interfering
with the production tables. A staging table also gives you the opportunity to use SQL Data Warehouse MPP for
data transformations before inserting the data into production tables.
Options for loading with PolyBase
To load data with PolyBase, you can use any of these loading options:
PolyBase with T-SQL works well when your data is in Azure Blob storage or Azure Data Lake Store. It gives
you the most control over the loading process, but also requires you to define external data objects. The other
methods define these objects behind the scenes as you map source tables to destination tables. To orchestrate
T-SQL loads, you can use Azure Data Factory, SSIS, or Azure functions.
PolyBase with SSIS works well when your source data is in SQL Server, either SQL Server on-premises or in
the cloud. SSIS defines the source to destination table mappings, and also orchestrates the load. If you already
have SSIS packages, you can modify the packages to work with the new data warehouse destination.
PolyBase with Azure Data Factory (ADF ) is another orchestration tool. It defines a pipeline and schedules jobs.
PolyBase with Azure DataBricks transfers data from a SQL Data Warehouse table to a Databricks dataframe
and/or writes data from a Databricks dataframe to a SQL Data Warehouse table using PolyBase.
Non-PolyBase loading options
If your data is not compatible with PolyBase, you can use bcp or the SQLBulkCopy API. bcp loads directly to SQL
Data Warehouse without going through Azure Blob storage, and is intended only for small loads. Note, the load
performance of these options is significantly slower than PolyBase.

5. Transform the data


While data is in the staging table, perform transformations that your workload requires. Then move the data into a
production table.

6. Insert the data into production tables


The INSERT INTO ... SELECT statement moves the data from the staging table to the permanent table.
As you design an ETL process, try running the process on a small test sample. Try extracting 1000 rows from the
table to a file, move it to Azure, and then try loading it into a staging table.

Partner loading solutions


Many of our partners have loading solutions. To find out more, see a list of our solution partners.

Next steps
For loading guidance, see Guidance for load data.
Best practices for loading data into Azure SQL Data
Warehouse
8/8/2019 • 6 minutes to read • Edit Online

Recommendations and performance optimizations for loading data into Azure SQL Data Warehouse.

Preparing data in Azure Storage


To minimize latency, co-locate your storage layer and your data warehouse.
When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large
text columns. To work around this limitation, export only a subset of the columns.
PolyBase cannot load rows that have more than 1,000,000 bytes of data. When you put data into the text files in
Azure Blob storage or Azure Data Lake Store, they must have fewer than 1,000,000 bytes of data. This byte
limitation is true regardless of the table schema.
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files.
The difference between UTF -8 and UTF -16 performance is minimal.
Split large compressed files into smaller compressed files.

Running loads with enough compute


For fastest loading speed, run only one load job at a time. If that is not feasible, run a minimal number of loads
concurrently. If you expect a large loading job, consider scaling up your data warehouse before the load.
To run loads with appropriate compute resources, create loading users designated for running loads. Assign each
loading user to a specific resource class. To run a load, sign in as one of the loading users, and then run the load.
The load runs with the user's resource class. This method is simpler than trying to change a user's resource class
to fit the current resource class need.
Example of creating a loading user
This example creates a loading user for the staticrc20 resource class. The first step is to connect to master and
create a login.

-- Connect to master
CREATE LOGIN LoaderRC20 WITH PASSWORD = 'a123STRONGpassword!';

Connect to the data warehouse and create a user. The following code assumes you are connected to the database
called mySampleDataWarehouse. It shows how to create a user called LoaderRC20, give the user control
permission on a database. It then adds the user as a member of the staticrc20 database role.

-- Connect to the database


CREATE USER LoaderRC20 FOR LOGIN LoaderRC20;
GRANT CONTROL ON DATABASE::[mySampleDataWarehouse] to LoaderRC20;
EXEC sp_addrolemember 'staticrc20', 'LoaderRC20';

To run a load with resources for the staticRC20 resource classes, sign in as LoaderRC20 and run the load.
Run loads under static rather than dynamic resource classes. Using the static resource classes guarantees the
same resources regardless of your data warehouse units. If you use a dynamic resource class, the resources vary
according to your service level. For dynamic classes, a lower service level means you probably need to use a
larger resource class for your loading user.

Allowing multiple users to load


There is often a need to have multiple users load data into a data warehouse. Loading with the CREATE TABLE AS
SELECT (Transact-SQL ) requires CONTROL permissions of the database. The CONTROL permission gives
control access to all schemas. You might not want all loading users to have control access on all schemas. To limit
permissions, use the DENY CONTROL statement.
For example, consider database schemas, schema_A for dept A, and schema_B for dept B. Let database users
user_A and user_B be users for PolyBase loading in dept A and B, respectively. They both have been granted
CONTROL database permissions. The creators of schema A and B now lock down their schemas using DENY:

DENY CONTROL ON SCHEMA :: schema_A TO user_B;


DENY CONTROL ON SCHEMA :: schema_B TO user_A;

User_A and user_B are now locked out from the other dept’s schema.

Loading to a staging table


To achieve the fastest loading speed for moving data into a data warehouse table, load data into a staging table.
Define the staging table as a heap and use round-robin for the distribution option.
Consider that loading is usually a two-step process in which you first load to a staging table and then insert the
data into a production data warehouse table. If the production table uses a hash distribution, the total time to load
and insert might be faster if you define the staging table with the hash distribution. Loading to the staging table
takes longer, but the second step of inserting the rows to the production table does not incur data movement
across the distributions.

Loading to a columnstore index


Columnstore indexes require large amounts of memory to compress data into high-quality rowgroups. For best
compression and index efficiency, the columnstore index needs to compress the maximum of 1,048,576 rows into
each rowgroup. When there is memory pressure, the columnstore index might not be able to achieve maximum
compression rates. This in turn effects query performance. For a deep dive, see Columnstore memory
optimizations.
To ensure the loading user has enough memory to achieve maximum compression rates, use loading users
that are a member of a medium or large resource class.
Load enough rows to completely fill new rowgroups. During a bulk load, every 1,048,576 rows get
compressed directly into the columnstore as a full rowgroup. Loads with fewer than 102,400 rows send the
rows to the deltastore where rows are held in a b-tree index. If you load too few rows, they might all go to the
deltastore and not get compressed immediately into columnstore format.

Increase batch size when using SQLBulkCopy API or BCP


As mentioned before, loading with PolyBase will provide the highest throughput with SQL Data Warehouse. If
you cannot use PolyBase to load and must use the SQLBulkCopy API (or BCP ) you should consider increasing
batch size for better throughput.

Handling loading failures


A load using an external table can fail with the error "Query aborted -- the maximum reject threshold was reached
while reading from an external source". This message indicates that your external data contains dirty records. A
data record is considered dirty if the data types and number of columns do not match the column definitions of
the external table, or if the data doesn't conform to the specified external file format.
To fix the dirty records, ensure that your external table and external file format definitions are correct and your
external data conforms to these definitions. In case a subset of external data records are dirty, you can choose to
reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE.

Inserting data into a production table


A one-time load to a small table with an INSERT statement, or even a periodic reload of a look-up might perform
good enough with a statement like INSERT INTO MyLookup VALUES (1, 'Type 1') . However, singleton inserts are not
as efficient as performing a bulk load.
If you have thousands or more single inserts throughout the day, batch the inserts so you can bulk load them.
Develop your processes to append the single inserts to a file, and then create another process that periodically
loads the file.

Creating statistics after the load


To improve query performance, it's important to create statistics on all columns of all tables after the first load, or
substantial changes occur in the data. This can be done manually or you can enable auto-create statistics.
For a detailed explanation of statistics, see Statistics. The following example shows how to manually create
statistics on five columns of the Customer_Speed table.

create statistics [SensorKey] on [Customer_Speed] ([SensorKey]);


create statistics [CustomerKey] on [Customer_Speed] ([CustomerKey]);
create statistics [GeographyKey] on [Customer_Speed] ([GeographyKey]);
create statistics [Speed] on [Customer_Speed] ([Speed]);
create statistics [YearMeasured] on [Customer_Speed] ([YearMeasured]);

Rotate storage keys


It is good security practice to change the access key to your blob storage on a regular basis. You have two storage
keys for your blob storage account, which enables you to transition the keys.
To rotate Azure Storage account keys:
For each storage account whose key has changed, issue ALTER DATABASE SCOPED CREDENTIAL.
Example:
Original key is created

CREATE DATABASE SCOPED CREDENTIAL my_credential WITH IDENTITY = 'my_identity', SECRET = 'key1'

Rotate key from key 1 to key 2

ALTER DATABASE SCOPED CREDENTIAL my_credential WITH IDENTITY = 'my_identity', SECRET = 'key2'

No other changes to underlying external data sources are needed.


Next steps
To learn more about PolyBase and designing an Extract, Load, and Transform (ELT) process, see Design ELT for
SQL Data Warehouse.
For a loading tutorial, Use PolyBase to load data from Azure blob storage to Azure SQL Data Warehouse.
To monitor data loads, see Monitor your workload using DMVs.
Maximizing rowgroup quality for columnstore
7/5/2019 • 6 minutes to read • Edit Online

Rowgroup quality is determined by the number of rows in a rowgroup. Increasing the available memory can
maximize the number of rows a columnstore index compresses into each rowgroup. Use these methods to
improve compression rates and query performance for columnstore indexes.

Why the rowgroup size matters


Since a columnstore index scans a table by scanning column segments of individual rowgroups, maximizing the
number of rows in each rowgroup enhances query performance. When rowgroups have a high number of rows,
data compression improves which means there is less data to read from disk.
For more information about rowgroups, see Columnstore Indexes Guide.

Target size for rowgroups


For best query performance, the goal is to maximize the number of rows per rowgroup in a columnstore index. A
rowgroup can have a maximum of 1,048,576 rows. It's okay to not have the maximum number of rows per
rowgroup. Columnstore indexes achieve good performance when rowgroups have at least 100,000 rows.

Rowgroups can get trimmed during compression


During a bulk load or columnstore index rebuild, sometimes there isn't enough memory available to compress all
the rows designated for each rowgroup. When there is memory pressure, columnstore indexes trim the rowgroup
sizes so compression into the columnstore can succeed.
When there is insufficient memory to compress at least 10,000 rows into each rowgroup, SQL Data Warehouse
generates an error.
For more information on bulk loading, see Bulk load into a clustered columnstore index.

How to monitor rowgroup quality


The DMV sys.dm_pdw_nodes_db_column_store_row_group_physical_stats
(sys.dm_db_column_store_row_group_physical_stats contains the view definition matching SQL DB to SQL Data
Warehouse) that exposes useful information such as number of rows in rowgroups and the reason for trimming if
there was trimming. You can create the following view as a handy way to query this DMV to get information on
rowgroup trimming.
create view dbo.vCS_rg_physical_stats
as
with cte
as
(
select tb.[name] AS [logical_table_name]
, rg.[row_group_id] AS [row_group_id]
, rg.[state] AS [state]
, rg.[state_desc] AS [state_desc]
, rg.[total_rows] AS [total_rows]
, rg.[trim_reason_desc] AS trim_reason_desc
, mp.[physical_name] AS physical_name
FROM sys.[schemas] sm
JOIN sys.[tables] tb ON sm.[schema_id] = tb.[schema_id]
JOIN sys.[pdw_table_mappings] mp ON tb.[object_id] = mp.[object_id]
JOIN sys.[pdw_nodes_tables] nt ON nt.[name] = mp.[physical_name]
JOIN sys.[dm_pdw_nodes_db_column_store_row_group_physical_stats] rg ON rg.[object_id] = nt.
[object_id]
AND rg.[pdw_node_id] = nt.
[pdw_node_id]
AND rg.[distribution_id] = nt.[distribution_id]
)
select *
from cte;

The trim_reason_desc tells whether the rowgroup was trimmed(trim_reason_desc = NO_TRIM implies there was
no trimming and row group is of optimal quality). The following trim reasons indicate premature trimming of the
rowgroup:
BULKLOAD: This trim reason is used when the incoming batch of rows for the load had less than 1 million
rows. The engine will create compressed row groups if there are greater than 100,000 rows being inserted (as
opposed to inserting into the delta store) but sets the trim reason to BULKLOAD. In this scenario, consider
increasing your batch load to include more rows. Also, reevaluate your partitioning scheme to ensure it is not
too granular as row groups cannot span partition boundaries.
MEMORY_LIMITATION: To create row groups with 1 million rows, a certain amount of working memory is
required by the engine. When available memory of the loading session is less than the required working
memory, row groups get prematurely trimmed. The following sections explain how to estimate memory
required and allocate more memory.
DICTIONARY_SIZE: This trim reason indicates that rowgroup trimming occurred because there was at least
one string column with wide and/or high cardinality strings. The dictionary size is limited to 16 MB in memory
and once this limit is reached the row group is compressed. If you do run into this situation, consider isolating
the problematic column into a separate table.

How to estimate memory requirements


The maximum required memory to compress one rowgroup is approximately
72 MB +
#rows * #columns * 8 bytes +
#rows * #short-string-columns * 32 bytes +
#long-string-columns * 16 MB for compression dictionary
where short-string-columns use string data types of <= 32 bytes and long-string-columns use string data types of
> 32 bytes.
Long strings are compressed with a compression method designed for compressing text. This compression
method uses a dictionary to store text patterns. The maximum size of a dictionary is 16 MB. There is only one
dictionary for each long string column in the rowgroup.
For an in-depth discussion of columnstore memory requirements, see the video Azure SQL Data Warehouse
scaling: configuration and guidance.

Ways to reduce memory requirements


Use the following techniques to reduce the memory requirements for compressing rowgroups into columnstore
indexes.
Use fewer columns
If possible, design the table with fewer columns. When a rowgroup is compressed into the columnstore, the
columnstore index compresses each column segment separately. Therefore the memory requirements to
compress a rowgroup increase as the number of columns increases.
Use fewer string columns
Columns of string data types require more memory than numeric and date data types. To reduce memory
requirements, consider removing string columns from fact tables and putting them in smaller dimension tables.
Additional memory requirements for string compression:
String data types up to 32 characters can require 32 additional bytes per value.
String data types with more than 32 characters are compressed using dictionary methods. Each column in the
rowgroup can require up to an additional 16 MB to build the dictionary.
Avoid over-partitioning
Columnstore indexes create one or more rowgroups per partition. In SQL Data Warehouse, the number of
partitions grows quickly because the data is distributed and each distribution is partitioned. If the table has too
many partitions, there might not be enough rows to fill the rowgroups. The lack of rows does not create memory
pressure during compression, but it leads to rowgroups that do not achieve the best columnstore query
performance.
Another reason to avoid over-partitioning is there is a memory overhead for loading rows into a columnstore
index on a partitioned table. During a load, many partitions could receive the incoming rows, which are held in
memory until each partition has enough rows to be compressed. Having too many partitions creates additional
memory pressure.
Simplify the load query
The database shares the memory grant for a query among all the operators in the query. When a load query has
complex sorts and joins, the memory available for compression is reduced.
Design the load query to focus only on loading the query. If you need to run transformations on the data, run
them separate from the load query. For example, stage the data in a heap table, run the transformations, and then
load the staging table into the columnstore index. You can also load the data first and then use the MPP system to
transform the data.
Adjust MAXDOP
Each distribution compresses rowgroups into the columnstore in parallel when there is more than one CPU core
available per distribution. The parallelism requires additional memory resources, which can lead to memory
pressure and rowgroup trimming.
To reduce memory pressure, you can use the MAXDOP query hint to force the load operation to run in serial
mode within each distribution.
CREATE TABLE MyFactSalesQuota
WITH (DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM FactSalesQuota
OPTION (MAXDOP 1);

Ways to allocate more memory


DWU size and the user resource class together determine how much memory is available for a user query. To
increase the memory grant for a load query, you can either increase the number of DWUs or increase the resource
class.
To increase the DWUs, see How do I scale performance?
To change the resource class for a query, see Change a user resource class example.

Next steps
To find more ways to improve performance in SQL Data Warehouse, see the Performance overview.
Design decisions and coding techniques for SQL
Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Take a look through these development articles to better understand key design decisions, recommendations,
and coding techniques for SQL Data Warehouse.

Key design decisions


The following articles highlight concepts and design decisions for developing a distributed data warehouse using
SQL Data Warehouse:
connections
concurrency
transactions
user-defined schemas
table distribution
table indexes
table partitions
CTAS
statistics

Development recommendations and coding techniques


These articles highlight specific coding techniques, tips, and recommendations for developing your SQL Data
Warehouse:
stored procedures
labels
views
temporary tables
dynamic SQL
looping
group by options
variable assignment

Next steps
For more reference information, see SQL Data Warehouse T-SQL statements.
Development best practices for Azure SQL Data
Warehouse
7/24/2019 • 6 minutes to read • Edit Online

This article describes guidance and best practices as you develop your data warehouse solution.

Reduce cost with pause and scale


For more information about reducing costs through pausing and scaling, see the Manage compute.

Maintain statistics
Ensure you update your statistics daily or after each load. There are always trade-offs between performance and
the cost to create and update statistics. If you find it is taking too long to maintain all of your statistics, you may
want to try to be more selective about which columns have statistics or which columns need frequent updating. For
example, you might want to update date columns, where new values may be added, daily. You will gain the most
benefit by having statistics on columns involved in joins, columns used in the WHERE clause and
columns found in GROUP BY.
See also Manage table statistics, CREATE STATISTICS, UPDATE STATISTICS

Hash distribute large tables


By default, tables are Round Robin distributed. This makes it easy for users to get started creating tables without
having to decide how their tables should be distributed. Round Robin tables may perform sufficiently for some
workloads, but in most cases selecting a distribution column will perform much better. The most common example
of when a table distributed by a column will far outperform a Round Robin table is when two large fact tables are
joined. For example, if you have an orders table, which is distributed by order_id, and a transactions table, which is
also distributed by order_id, when you join your orders table to your transactions table on order_id, this query
becomes a pass-through query, which means we eliminate data movement operations. Fewer steps mean a faster
query. Less data movement also makes for faster queries. This explanation only scratches the surface. When
loading a distributed table, be sure that your incoming data is not sorted on the distribution key as this will slow
down your loads. See the below links for much more details on how selecting a distribution column can improve
performance as well as how to define a distributed table in the WITH clause of your CREATE TABLES statement.
See also Table overview, Table distribution, Selecting table distribution, CREATE TABLE, CREATE TABLE AS
SELECT

Do not over-partition
While partitioning data can be very effective for maintaining your data through partition switching or optimizing
scans by with partition elimination, having too many partitions can slow down your queries. Often a high
granularity partitioning strategy which may work well on SQL Server may not work well on SQL Data Warehouse.
Having too many partitions can also reduce the effectiveness of clustered columnstore indexes if each partition has
fewer than 1 million rows. Keep in mind that behind the scenes, SQL Data Warehouse partitions your data for you
into 60 databases, so if you create a table with 100 partitions, this actually results in 6000 partitions. Each workload
is different so the best advice is to experiment with partitioning to see what works best for your workload. Consider
lower granularity than what may have worked for you in SQL Server. For example, consider using weekly or
monthly partitions rather than daily partitions.
See also Table partitioning

Minimize transaction sizes


INSERT, UPDATE, and DELETE statements run in a transaction and when they fail they must be rolled back. To
minimize the potential for a long rollback, minimize transaction sizes whenever possible. This can be done by
dividing INSERT, UPDATE, and DELETE statements into parts. For example, if you have an INSERT which you
expect to take 1 hour, if possible, break the INSERT up into 4 parts, which will each run in 15 minutes. Leverage
special Minimal Logging cases, like CTAS, TRUNCATE, DROP TABLE or INSERT to empty tables, to reduce
rollback risk. Another way to eliminate rollbacks is to use Metadata Only operations like partition switching for
data management. For example, rather than execute a DELETE statement to delete all rows in a table where the
order_date was in October of 2001, you could partition your data monthly and then switch out the partition with
data for an empty partition from another table (see ALTER TABLE examples). For unpartitioned tables consider
using a CTAS to write the data you want to keep in a table rather than using DELETE. If a CTAS takes the same
amount of time, it is a much safer operation to run as it has very minimal transaction logging and can be canceled
quickly if needed.
See also Understanding transactions, Optimizing transactions, Table partitioning, TRUNCATE TABLE, ALTER
TABLE, Create table as select (CTAS )

Use the smallest possible column size


When defining your DDL, using the smallest data type which will support your data will improve query
performance. This is especially important for CHAR and VARCHAR columns. If the longest value in a column is 25
characters, then define your column as VARCHAR (25). Avoid defining all character columns to a large default
length. In addition, define columns as VARCHAR when that is all that is needed rather than use NVARCHAR.
See also Table overview, Table data types, CREATE TABLE

Optimize clustered columnstore tables


Clustered columnstore indexes are one of the most efficient ways you can store your data in SQL Data Warehouse.
By default, tables in SQL Data Warehouse are created as Clustered ColumnStore. To get the best performance for
queries on columnstore tables, having good segment quality is important. When rows are written to columnstore
tables under memory pressure, columnstore segment quality may suffer. Segment quality can be measured by
number of rows in a compressed Row Group. See the Causes of poor columnstore index quality in the Table
indexes article for step by step instructions on detecting and improving segment quality for clustered columnstore
tables. Because high-quality columnstore segments are important, it's a good idea to use users IDs which are in the
medium or large resource class for loading data. Using lower data warehouse units means you want to assign a
larger resource class to your loading user.
Since columnstore tables generally won't push data into a compressed columnstore segment until there are more
than 1 million rows per table and each SQL Data Warehouse table is partitioned into 60 tables, as a rule of thumb,
columnstore tables won't benefit a query unless the table has more than 60 million rows. For table with less than
60 million rows, it may not make any sense to have a columnstore index. It also may not hurt. Furthermore, if you
partition your data, then you will want to consider that each partition will need to have 1 million rows to benefit
from a clustered columnstore index. If a table has 100 partitions, then it will need to have at least 6 billion rows to
benefit from a clustered columns store (60 distributions * 100 partitions * 1 million rows). If your table does not
have 6 billion rows in this example, either reduce the number of partitions or consider using a heap table instead. It
also may be worth experimenting to see if better performance can be gained with a heap table with secondary
indexes rather than a columnstore table.
When querying a columnstore table, queries will run faster if you select only the columns you need.
See also Table indexes, Columnstore indexes guide, Rebuilding columnstore indexes
Next steps
If you didn't find what you were looking for in this article, try using the "Search for docs" on the left side of this
page to search all of the Azure SQL Data Warehouse documents. The Azure SQL Data Warehouse Forum is a
place for you to ask questions to other users and to the SQL Data Warehouse Product Group. We actively monitor
this forum to ensure that your questions are answered either by another user or one of us. If you prefer to ask your
questions on Stack Overflow, we also have an Azure SQL Data Warehouse Stack Overflow Forum.
Finally, please do use the Azure SQL Data Warehouse Feedback page to make feature requests. Adding your
requests or up-voting other requests really helps us prioritize features.
Performance tuning with materialized views
9/27/2019 • 10 minutes to read • Edit Online

The materialized views in Azure SQL Data Warehouse provide a low maintenance method for complex analytical
queries to get fast performance without any query change. This article discusses the general guidance on using
materialized views.

Materialized views vs. standard views


Azure SQL Data Warehouse supports standard and materialized views. Both are virtual tables created with
SELECT expressions and presented to queries as logical tables. Views encapsulate the complexity of common data
computation and add an abstraction layer to computation changes so there's no need to rewrite queries.
A standard view computes its data each time when the view is used. There's no data stored on disk. People typically
use standard views as a tool that helps organize the logical objects and queries in a database. To use a standard
view, a query needs to make direct reference to it.
A materialized view pre-computes, stores, and maintains its data in Azure SQL Data Warehouse just like a table.
There's no recomputation needed each time when a materialized view is used. That's why queries that use all or
subset of the data in materialized views can get faster performance. Even better, queries can use a materialized view
without making direct reference to it, so there's no need to change application code.
Most of the requirements on a standard view still apply to a materialized view. For details on the materialized view
syntax and other requirements, refer to CREATE MATERIALIZED VIEW AS SELECT.

COMPARISON VIEW MATERIALIZED VIEW

View definition Stored in Azure data warehouse. Stored in Azure data warehouse.

View content Generated each time when the view is Pre-processed and stored in Azure data
used. warehouse during view creation.
Updated as data is added to the
underlying tables.

Data refresh Always updated Always updated

Speed to retrieve view data from Slow Fast


complex queries

Extra storage No Yes

Syntax CREATE VIEW CREATE MATERIALIZED VIEW AS


SELECT

Benefits of using materialized views


A properly designed materialized view can provide following benefits:
Reduce the execution time for complex queries with JOINs and aggregate functions. The more complex the
query, the higher the potential for execution-time saving. The most benefit is gained when a query's
computation cost is high and the resulting data set is small.
The optimizer in Azure SQL Data Warehouse can automatically use deployed materialized views to improve
query execution plans. This process is transparent to users providing faster query performance and doesn't
require queries to make direct reference to the materialized views.
Require low maintenance on the views. All incremental data changes from the base tables are automatically
added to the materialized views in a synchronous manner. This design allows querying materialized views to
return the same data as directly querying the base tables.
The data in a materialized view can be distributed differently from the base tables.
Data in materialized views gets the same high availability and resiliency benefits as data in regular tables.
Comparing to other data warehouse providers, the materialized views implemented in Azure SQL Data Warehouse
also provide the following additional benefits:
Automatic and synchronous data refresh with data changes in base tables. No user action is required.
Broad aggregate function support. See CREATE MATERIALIZED VIEW AS SELECT (Transact-SQL ).
The support for query-specific materialized view recommendation. See EXPLAIN (Transact-SQL ).

Common scenarios
Materialized views are typically used in following scenarios:
Need to improve the performance of complex analytical queries against large data in size
Complex analytical queries typically use more aggregation functions and table joins, causing more compute-heavy
operations such as shuffles and joins in query execution. That's why those queries take longer to complete,
specially on large tables. Users can create materialized views for the data returned from the common computations
of queries, so there's no recomputation needed when this data is needed by queries, allowing lower compute cost
and faster query response.
Need faster performance with no or minimum query changes
Schema and query changes in data warehouses are typically kept to a minimum to support regular ETL operations
and reporting. People can use materialized views for query performance tuning, if the cost incurred by the views
can be offset by the gain in query performance. In comparison to other tuning options such as scaling and statistics
management, it's a much less impactful production change to create and maintain a materialized view and its
potential performance gain is also higher.
Creating or maintaining materialized views does not impact the queries running against the base tables.
The query optimizer can automatically use the deployed materialized views without direct view reference in a
query. This capability reduces the need for query change in performance tuning.
Need different data distribution strategy for faster query performance
Azure data warehouse is a distributed massively parallel processing (MPP ) system. Data in a data warehouse table
is distributed across 60 nodes using one of three distribution strategies (hash, round_robin, or replicated). The data
distribution is specified at the table creation time and stays unchanged until the table is dropped. Materialized view
being a virtual table on disk supports hash and round_robin data distributions. Users can choose a data
distribution that is different from the base tables but optimal for the performance of queries that use the views
most.

Design guidance
Here is the general guidance on using materialized views to improve query performance:
Design for your workload
Before you begin to create materialized views, it's important to have a deep understanding of your workload in
terms of query patterns, importance, frequency, and the size of resulting data.
Users can run EXPLAIN WITH_RECOMMENDATIONS <SQL_statement> for the materialized views
recommended by the query optimizer. Since these recommendations are query-specific, a materialized view that
benefits a single query may not be optimal for other queries in the same workload. Evaluate these
recommendations with your workload needs in mind. The ideal materialized views are those that benefit the
workload's performance.
Be aware of the tradeoff between faster queries and the cost
For each materialized view, there's a data storage cost and a cost for maintaining the view. As data changes in base
tables, the size of the materialized view increases and its physical structure also changes. To avoid query
performance degradation, each materialized view is maintained separately by the data warehouse engine. The
maintenance workload gets higher when the number of materialized views and base table changes increase. Users
should check if the cost incurred from all materialized views can be offset by the query performance gain.
You can run this query for the list of materialized view in a database:

SELECT V.name as materialized_view, V.object_id


FROM sys.views V
JOIN sys.indexes I ON V.object_id= I.object_id AND I.index_id < 2;

Options to reduce the number of materialized views:


Identify common data sets frequently used by the complex queries in your workload. Create materialized
views to store those data sets so the optimizer can use them as building blocks when creating execution
plans.
Drop the materialized views that have low usage or are no longer needed. A disabled materialized view is
not maintained but it still incurs storage cost.
Combine materialized views created on the same or similar base tables even if their data doesn't overlap.
Combing materialized views could result in a larger view in size than the sum of the separate views,
however the view maintenance cost should reduce. For example:

-- Query 1 would benefit from having a materialized view created with this SELECT statement

SELECT A, SUM(B)
FROM T
GROUP BY A

-- Query 2 would benefit from having a materialized view created with this SELECT statement

SELECT C, SUM(D)
FROM T
GROUP BY C

-- You could create a single mateiralized view of this form

SELECT A, C, SUM(B), SUM(D)


FROM T
GROUP BY A, C

Not all performance tuning requires query change


The data warehouse optimizer can automatically use deployed materialized views to improve query performance.
This support is applied transparently to queries that don't reference the views and queries that use aggregates
unsupported in materialized views creation. No query change is needed. You can check a query's estimated
execution plan to confirm if a materialized view is used.
Monitor materialized views
A materialized view is stored in the data warehouse just like a table with clustered columnstore index (CCI).
Reading data from a materialized view includes scanning the CCI index segments and applying any incremental
changes from base tables. When the number of incremental changes is too high, resolving a query from a
materialized view can take longer than directly querying the base tables. To avoid query performance degradation,
it’s a good practice to run DBCC PDW_SHOWMATERIALIZEDVIEWOVERHEAD to monitor the view’s
overhead_ratio (total_rows / max(1, base_view_row )). Users should REBUILD the materialized view if its
overhead_ratio is too high.
Materialized view and result set caching
These two features are introduced in Azure SQL Data Warehouse around the same time for query performance
tuning. Result set caching is used for getting high concurrency and fast response from repetitive queries against
static data. To use the cached result, the form of the cache requesting query must match with the query that
produced the cache. In addition, the cached result must apply to the entire query. Materialized views allow data
changes in the base tables. Data in materialized views can be applied to a piece of a query. This support allows the
same materialized views to be used by different queries that share some computation for faster performance.

Example
This example uses a TPCDS -like query that finds customers who spend more money via catalog than in stores,
identify the preferred customers and their country of origin. The query involves selecting TOP 100 records from
the UNION of three sub-SELECT statements involving SUM () and GROUP BY.

WITH year_total AS (
SELECT c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(isnull(ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt+ss_ext_sales_price, 0)/2)
year_total
,'s' sale_type
FROM customer
,store_sales
,date_dim
WHERE c_customer_sk = ss_customer_sk
AND ss_sold_date_sk = d_date_sk
GROUP BY c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
UNION ALL
SELECT c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(isnull(cs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt+cs_ext_sales_price, 0)/2)
year_total
,'c' sale_type
FROM customer
,catalog_sales
,date_dim
WHERE c_customer_sk = cs_bill_customer_sk
AND cs_sold_date_sk = d_date_sk
GROUP BY c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
UNION ALL
SELECT c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(isnull(ws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt+ws_ext_sales_price, 0)/2)
year_total
,'w' sale_type
FROM customer
,web_sales
,date_dim
WHERE c_customer_sk = ws_bill_customer_sk
AND ws_sold_date_sk = d_date_sk
GROUP BY c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
)
SELECT TOP 100
t_s_secyear.customer_id
,t_s_secyear.customer_first_name
,t_s_secyear.customer_last_name
,t_s_secyear.customer_birth_country
FROM year_total t_s_firstyear
,year_total t_s_secyear
,year_total t_c_firstyear
,year_total t_c_secyear
,year_total t_w_firstyear
,year_total t_w_secyear
WHERE t_s_secyear.customer_id = t_s_firstyear.customer_id
AND t_s_firstyear.customer_id = t_c_secyear.customer_id
AND t_s_firstyear.customer_id = t_c_firstyear.customer_id
AND t_s_firstyear.customer_id = t_w_firstyear.customer_id
AND t_s_firstyear.customer_id = t_w_secyear.customer_id
AND t_s_firstyear.sale_type = 's'
AND t_c_firstyear.sale_type = 'c'
AND t_w_firstyear.sale_type = 'w'
AND t_s_secyear.sale_type = 's'
AND t_c_secyear.sale_type = 'c'
AND t_w_secyear.sale_type = 'w'
AND t_s_firstyear.dyear+0 = 1999
AND t_s_secyear.dyear+0 = 1999+1
AND t_c_firstyear.dyear+0 = 1999
AND t_c_secyear.dyear+0 = 1999+1
AND t_c_secyear.dyear+0 = 1999+1
AND t_w_firstyear.dyear+0 = 1999
AND t_w_secyear.dyear+0 = 1999+1
AND t_s_firstyear.year_total > 0
AND t_c_firstyear.year_total > 0
AND t_w_firstyear.year_total > 0
AND CASE WHEN t_c_firstyear.year_total > 0 THEN t_c_secyear.year_total / t_c_firstyear.year_total ELSE NULL
END
> CASE WHEN t_s_firstyear.year_total > 0 THEN t_s_secyear.year_total / t_s_firstyear.year_total ELSE
NULL END
AND CASE WHEN t_c_firstyear.year_total > 0 THEN t_c_secyear.year_total / t_c_firstyear.year_total ELSE NULL
END
> CASE WHEN t_w_firstyear.year_total > 0 THEN t_w_secyear.year_total / t_w_firstyear.year_total ELSE
NULL END
ORDER BY t_s_secyear.customer_id
,t_s_secyear.customer_first_name
,t_s_secyear.customer_last_name
,t_s_secyear.customer_birth_country
OPTION ( LABEL = 'Query04-af359846-253-3');

Check the query's estimated execution plan. There are 18 shuffles and 17 joins operations, which take more time to
execute. Now let's create one materialized view for each of the three sub-SELECT statements.

CREATE materialized view nbViewSS WITH (DISTRIBUTION=HASH(customer_id)) AS


SELECT c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(isnull(ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt+ss_ext_sales_price, 0)/2)
year_total
, count_big(*) AS cb
FROM dbo.customer
,dbo.store_sales
,dbo.date_dim
WHERE c_customer_sk = ss_customer_sk
AND ss_sold_date_sk = d_date_sk
GROUP BY c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
GO
CREATE materialized view nbViewCS WITH (DISTRIBUTION=HASH(customer_id)) AS
SELECT c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(isnull(cs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt+cs_ext_sales_price, 0)/2)
year_total
, count_big(*) as cb
FROM dbo.customer
,dbo.catalog_sales
,dbo.date_dim
WHERE c_customer_sk = cs_bill_customer_sk
AND cs_sold_date_sk = d_date_sk
GROUP BY c_customer_id
,c_first_name
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year

GO
CREATE materialized view nbViewWS WITH (DISTRIBUTION=HASH(customer_id)) AS
SELECT c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(isnull(ws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt+ws_ext_sales_price, 0)/2)
year_total
, count_big(*) AS cb
FROM dbo.customer
,dbo.web_sales
,dbo.date_dim
WHERE c_customer_sk = ws_bill_customer_sk
AND ws_sold_date_sk = d_date_sk
GROUP BY c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year

Check the execution plan of the original query again. Now the number of joins changes from 17 to 5 and there's no
shuffle anymore. Click the Filter operation icon in the plan, its Output List shows the data is read from the
materialized views instead of base tables.

With materialized views, the same query runs much faster without any code change.
Next steps
For more development tips, see SQL Data Warehouse development overview.
Performance tuning with ordered clustered
columnstore index
10/18/2019 • 6 minutes to read • Edit Online

When users query a columnstore table in Azure SQL Data Warehouse, the optimizer checks the minimum and
maximum values stored in each segment. Segments that are outside the bounds of the query predicate aren't read
from disk to memory. A query can get faster performance if the number of segments to read and their total size are
small.

Ordered vs. non-ordered clustered columnstore index


By default, for each Azure Data Warehouse table created without an index option, an internal component (index
builder) creates a non-ordered clustered columnstore index (CCI) on it. Data in each column is compressed into a
separate CCI rowgroup segment. There's metadata on each segment’s value range, so segments that are outside
the bounds of the query predicate aren't read from disk during query execution. CCI offers the highest level of data
compression and reduces the size of segments to read so queries can run faster. However, because the index
builder doesn't sort data before compressing them into segments, segments with overlapping value ranges could
occur, causing queries to read more segments from disk and take longer to finish.
When creating an ordered CCI, the Azure SQL Data Warehouse engine sorts the existing data in memory by the
order key(s) before the index builder compresses them into index segments. With sorted data, segment
overlapping is reduced allowing queries to have a more efficient segment elimination and thus faster performance
because the number of segments to read from disk is smaller. If all data can be sorted in memory at once, then
segment overlapping can be avoided. Given the large size of data in data warehouse tables, this scenario doesn't
happen often.
To check the segment ranges for a column, run this command with your table name and column name:

SELECT o.name, pnp.index_id, cls.row_count, pnp.data_compression_desc, pnp.pdw_node_id,


pnp.distribution_id, cls.segment_id, cls.column_id, cls.min_data_id, cls.max_data_id, cls.max_data_id-
cls.min_data_id as difference
FROM sys.pdw_nodes_partitions AS pnp
JOIN sys.pdw_nodes_tables AS Ntables ON pnp.object_id = NTables.object_id AND pnp.pdw_node_id =
NTables.pdw_node_id
JOIN sys.pdw_table_mappings AS Tmap ON NTables.name = TMap.physical_name AND
substring(TMap.physical_name,40, 10) = pnp.distribution_id
JOIN sys.objects AS o ON TMap.object_id = o.object_id
JOIN sys.pdw_nodes_column_store_segments AS cls ON pnp.partition_id = cls.partition_id AND
pnp.distribution_id = cls.distribution_id
JOIN sys.columns as cols ON o.object_id = cols.object_id AND cls.column_id = cols.column_id
WHERE o.name = '<Table_Name>' and cols.name = '<Column_Name>'
ORDER BY o.name, pnp.distribution_id, cls.min_data_id

NOTE
In an ordered CCI table, new data resulting from DML or data loading operations are not automatically sorted. Users can
REBUILD the ordered CCI to sort all data in the table. In Azure SQL Data Warehouse, the columnstore index REBUILD is an
offline operation. For a partitioned table, the REBUILD is done one partition at a time. Data in the partition that is being
rebuilt is "offline" and unavailable until the REBUILD is complete for that partition.
Query performance
A query's performance gain from an ordered CCI depends on the query patterns, the size of data, how well the
data is sorted, the physical structure of segments, and the DWU and resource class chosen for the query execution.
Users should review all these factors before choosing the ordering columns when designing an ordered CCI table.
Queries with all these patterns typically run faster with ordered CCI.
1. The queries have equality, inequality, or range predicates
2. The predicate columns and the ordered CCI columns are the same.
3. The predicate columns are used in the same order as the column ordinal of ordered CCI columns.
In this example, table T1 has a clustered columnstore index ordered in the sequence of Col_C, Col_B, and Col_A.

CREATE CLUSTERED COLUMNSTORE INDEX MyOrderedCCI ON T1


ORDER (Col_C, Col_B, Col_A)

The performance of query 1 can benefit more from ordered CCI than the other 3 queries.

-- Query #1:

SELECT * FROM T1 WHERE Col_C = 'c' AND Col_B = 'b' AND Col_A = 'a';

-- Query #2

SELECT * FROM T1 WHERE Col_B = 'b' AND Col_C = 'c' AND Col_A = 'a';

-- Query #3
SELECT * FROM T1 WHERE Col_B = 'b' AND Col_A = 'a';

-- Query #4
SELECT * FROM T1 WHERE Col_A = 'a' AND Col_C = 'c';

Data loading performance


The performance of data loading into an ordered CCI table is similar to a partitioned table. Loading data into an
ordered CCI table can take longer than a non-ordered CCI table because of the data sorting operation, however
queries can run faster afterwards with ordered CCI.
Here is an example performance comparison of loading data into tables with different schemas.
Here is an example query performance comparison between CCI and ordered CCI.

Reduce segment overlapping


The number of overlapping segments depends on the size of data to sort, the available memory, and the maximum
degree of parallelism (MAXDOP ) setting during ordered CCI creation. Below are options to reduce segment
overlapping when creating ordered CCI.
Use xlargerc resource class on a higher DWU to allow more memory for data sorting before the index
builder compresses the data into segments. Once in an index segment, the physical location of the data
cannot be changed. There is no data sorting within a segment or across segments.
Create ordered CCI with MAXDOP = 1. Each thread used for ordered CCI creation works on a subset of
data and sorts it locally. There's no global sorting across data sorted by different threads. Using parallel
threads can reduce the time to create an ordered CCI but will generate more overlapping segments than
using a single thread. Currently, the MAXDOP option is only supported in creating an ordered CCI table
using CREATE TABLE AS SELECT command. Creating an ordered CCI via CREATE INDEX or CREATE
TABLE commands does not support the MAXDOP option. For example,

CREATE TABLE Table1 WITH (DISTRIBUTION = HASH(c1), CLUSTERED COLUMNSTORE INDEX ORDER(c1) )
AS SELECT * FROM ExampleTable
OPTION (MAXDOP 1);
Pre-sort the data by the sort key(s) before loading them into Azure SQL Data Warehouse tables.
Here is an example of an ordered CCI table distribution that has zero segment overlapping following above
recommendations. The ordered CCI table is created in a DWU1000c database via CTAS from a 20GB heap table
using MAXDOP 1 and xlargerc. The CCI is ordered on a BIGINT column with no duplicates.

Create ordered CCI on large tables


Creating an ordered CCI is an offline operation. For tables with no partitions, the data won't be accessible to users
until the ordered CCI creation process completes. For partitioned tables,since the engine creates the ordered CCI
partition by partition, users can still access the data in partitions where ordered CCI creation isn't in process. You
can use this option to minimize the downtime during ordered CCI creation on large tables:
1. Create partitions on the target large table (called Table_A).
2. Create an empty ordered CCI table (called Table_B ) with the same table and partition schema as Table A.
3. Switch one partition from Table A to Table B.
4. Run ALTER INDEX <Ordered_CCI_Index> ON <Table_B> REBUILD PARTITION = <Partition_ID> on Table B
to rebuild the switched-in partition.
5. Repeat step 3 and 4 for each partition in Table_A.
6. Once all partitions are switched from Table_A to Table_B and have been rebuilt, drop Table_A, and rename
Table_B to Table_A.

Examples
A. To check for ordered columns and order ordinal:

SELECT object_name(c.object_id) table_name, c.name column_name, i.column_store_order_ordinal


FROM sys.index_columns i
JOIN sys.columns c ON i.object_id = c.object_id AND c.column_id = i.column_id
WHERE column_store_order_ordinal <>0

B. To change column ordinal, add or remove columns from the order list, or to change from CCI to
ordered CCI:

CREATE CLUSTERED COLUMNSTORE INDEX InternetSales ON InternetSales


ORDER (ProductKey, SalesAmount)
WITH (DROP_EXISTING = ON)

Next steps
For more development tips, see SQL Data Warehouse development overview.
Performance tuning with result set caching
10/17/2019 • 2 minutes to read • Edit Online

When result set caching is enabled, Azure SQL Data Warehouse automatically caches query results in the user
database for repetitive use. This allows subsequent query executions to get results directly from the persisted cache
so recomputation is not needed. Result set caching improves query performance and reduces compute resource
usage. In addition, queries using cached results set do not use any concurrency slots and thus do not count against
existing concurrency limits. For security, users can only access the cached results if they have the same data access
permissions as the users creating the cached results.

Key commands
Turn ON/OFF result set caching for a user database
Turn ON/OFF result set caching for a session
Check the size of cached result set
Clean up the cache

What's not cached


Once result set caching is turned ON for a database, results are cached for all queries until the cache is full, except
for these queries:
Queries using non-deterministic functions such as DateTime.Now ()
Queries using user defined functions
Queries using tables with row level security or column level security enabled
Queries returning data with row size larger than 64KB
Queries with large result sets (for example, > 1 million rows) may experience slower performance during the first
run when the result cache is being created.

When cached results are used


Cached result set is reused for a query if all of the following requirements are all met:
The user who's running the query has access to all the tables referenced in the query.
There is an exact match between the new query and the previous query that generated the result set cache.
There is no data or schema changes in the tables where the cached result set was generated from.
Run this command to check if a query was executed with a result cache hit or miss. If there is a cache hit, the
result_cache_hit will return 1.

SELECT request_id, command, result_cache_hit FROM sys.pdw_exec_requests


WHERE request_id = <'Your_Query_Request_ID'>

Manage cached results


The maximum size of result set cache is 1 TB per database. The cached results are automatically invalidated when
the underlying query data change.
The cache eviction is managed by Azure SQL Data Warehouse automatically following this schedule:
Every 48 hours if the result set hasn't been used or has been invalidated.
When the result set cache approaches the maximum size.
Users can manually empty the entire result set cache by using one of these options:
Turn OFF the result set cache feature for the database
Run DBCC DROPRESULTSETCACHE while connected to the database
Pausing a database won't empty cached result set.

Next steps
For more development tips, see SQL Data Warehouse development overview.
Designing tables in Azure SQL Data Warehouse
9/26/2019 • 10 minutes to read • Edit Online

Learn key concepts for designing tables in Azure SQL Data Warehouse.

Determine table category


A star schema organizes data into fact and dimension tables. Some tables are used for integration or staging data
before it moves to a fact or dimension table. As you design a table, decide whether the table data belongs in a
fact, dimension, or integration table. This decision informs the appropriate table structure and distribution.
Fact tables contain quantitative data that are commonly generated in a transactional system, and then
loaded into the data warehouse. For example, a retail business generates sales transactions every day, and
then loads the data into a data warehouse fact table for analysis.
Dimension tables contain attribute data that might change but usually changes infrequently. For
example, a customer's name and address are stored in a dimension table and updated only when the
customer's profile changes. To minimize the size of a large fact table, the customer's name and address do
not need to be in every row of a fact table. Instead, the fact table and the dimension table can share a
customer ID. A query can join the two tables to associate a customer's profile and transactions.
Integration tables provide a place for integrating or staging data. You can create an integration table as a
regular table, an external table, or a temporary table. For example, you can load data to a staging table,
perform transformations on the data in staging, and then insert the data into a production table.

Schema and table names


Schemas are a good way to group tables, used in a similar fashion, together. If you are migrating multiple
databases from an on-prem solution to SQL Data Warehouse, it works best to migrate all of the fact, dimension,
and integration tables to one schema in SQL Data Warehouse. For example, you could store all the tables in the
WideWorldImportersDW sample data warehouse within one schema called wwi. The following code creates a
user-defined schema called wwi.

CREATE SCHEMA wwi;

To show the organization of the tables in SQL Data Warehouse, you could use fact, dim, and int as prefixes to the
table names. The following table shows some of the schema and table names for WideWorldImportersDW.

WIDEWORLDIMPORTERSDW TABLE TABLE TYPE SQL DATA WAREHOUSE

City Dimension wwi.DimCity

Order Fact wwi.FactOrder

Table persistence
Tables store data either permanently in Azure Storage, temporarily in Azure Storage, or in a data store external to
data warehouse.
Regular table
A regular table stores data in Azure Storage as part of the data warehouse. The table and the data persist
regardless of whether a session is open. This example creates a regular table with two columns.

CREATE TABLE MyTable (col1 int, col2 int );

Temporary table
A temporary table only exists for the duration of the session. You can use a temporary table to prevent other
users from seeing temporary results and also to reduce the need for cleanup. Temporary tables utilize local
storage to offer fast performance. For more information, see Temporary tables.
External table
An external table points to data located in Azure Storage blob or Azure Data Lake Store. When used in
conjunction with the CREATE TABLE AS SELECT statement, selecting from an external table imports data into
SQL Data Warehouse. External tables are therefore useful for loading data. For a loading tutorial, see Use
PolyBase to load data from Azure blob storage.

Data types
SQL Data Warehouse supports the most commonly used data types. For a list of the supported data types, see
data types in CREATE TABLE reference in the CREATE TABLE statement. For guidance on using data types, see
Data types.

Distributed tables
A fundamental feature of SQL Data Warehouse is the way it can store and operate on tables across distributions.
SQL Data Warehouse supports three methods for distributing data, round-robin (default), hash and replicated.
Hash-distributed tables
A hash distributed table distributes rows based on the value in the distribution column. A hash distributed table
is designed to achieve high performance for queries on large tables. There are several factors to consider when
choosing a distribution column.
For more information, see Design guidance for distributed tables.
Replicated tables
A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated
tables since joins on replicated tables do not require data movement. Replication requires extra storage, though,
and is not practical for large tables.
For more information, see Design guidance for replicated tables.
Round-robin tables
A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly.
Loading data into a round-robin table is fast. However, queries can require more data movement than the other
distribution methods.
For more information, see Design guidance for distributed tables.
Common distribution methods for tables
The table category often determines which option to choose for distributing the table.

TABLE CATEGORY RECOMMENDED DISTRIBUTION OPTION


TABLE CATEGORY RECOMMENDED DISTRIBUTION OPTION

Fact Use hash-distribution with clustered columnstore index.


Performance improves when two hash tables are joined on
the same distribution column.

Dimension Use replicated for smaller tables. If tables are too large to
store on each Compute node, use hash-distributed.

Staging Use round-robin for the staging table. The load with CTAS is
fast. Once the data is in the staging table, use
INSERT...SELECT to move the data to production tables.

Table partitions
A partitioned table stores and performs operations on the table rows according to data ranges. For example, a
table could be partitioned by day, month, or year. You can improve query performance through partition
elimination, which limits a query scan to data within a partition. You can also maintain the data through partition
switching. Since the data in SQL Data Warehouse is already distributed, too many partitions can slow query
performance. For more information, see Partitioning guidance. When partition switching into table partitions that
are not empty, consider using the TRUNCATE_TARGET option in your ALTER TABLE statement if the existing
data is to be truncated. The below code switches in the transformed daily data into the SalesFact overwriting any
existing data.

ALTER TABLE SalesFact_DailyFinalLoad SWITCH PARTITION 256 TO SalesFact PARTITION 256 WITH (TRUNCATE_TARGET =
ON);

Columnstore indexes
By default, SQL Data Warehouse stores a table as a clustered columnstore index. This form of data storage
achieves high data compression and query performance on large tables. The clustered columnstore index is
usually the best choice, but in some cases a clustered index or a heap is the appropriate storage structure. A heap
table can be especially useful for loading transient data, such as a staging table which is transformed into a final
table.
For a list of columnstore features, see What's new for columnstore indexes. To improve columnstore index
performance, see Maximizing rowgroup quality for columnstore indexes.

Statistics
The query optimizer uses column-level statistics when it creates the plan for executing a query. To improve query
performance, it's important to have statistics on individual columns, especially columns used in query joins.
Creating statistics happens automatically. However, updating statistics does not happen automatically. Update
statistics after a significant number of rows are added or changed. For example, update statistics after a load. For
more information, see Statistics guidance.

Primary key and unique key


PRIMARY KEY is only supported when NONCLUSTERED and NOT ENFORCED are both used. UNIQUE
constraint is only supported with NOT ENFORCED is used. Check SQL Data Warehouse Table Constraints.

Commands for creating tables


You can create a table as a new empty table. You can also create and populate a table with the results of a select
statement. The following are the T-SQL commands for creating a table.

T-SQL STATEMENT DESCRIPTION

CREATE TABLE Creates an empty table by defining all the table columns and
options.

CREATE EXTERNAL TABLE Creates an external table. The definition of the table is stored
in SQL Data Warehouse. The table data is stored in Azure
Blob storage or Azure Data Lake Store.

CREATE TABLE AS SELECT Populates a new table with the results of a select statement.
The table columns and data types are based on the select
statement results. To import data, this statement can select
from an external table.

CREATE EXTERNAL TABLE AS SELECT Creates a new external table by exporting the results of a
select statement to an external location. The location is either
Azure Blob storage or Azure Data Lake Store.

Aligning source data with the data warehouse


Data warehouse tables are populated by loading data from another data source. To perform a successful load, the
number and data types of the columns in the source data must align with the table definition in the data
warehouse. Getting the data to align might be the hardest part of designing your tables.
If data is coming from multiple data stores, you can bring the data into the data warehouse and store it in an
integration table. Once data is in the integration table, you can use the power of SQL Data Warehouse to
perform transformation operations. Once the data is prepared, you can insert it into production tables.

Unsupported table features


SQL Data Warehouse supports many, but not all, of the table features offered by other databases. The following
list shows some of the table features that are not supported in SQL Data Warehouse.
Foreign key, Check Table Constraints
Computed Columns
Indexed Views
Sequence
Sparse Columns
Surrogate Keys. Implement with Identity.
Synonyms
Triggers
Unique Indexes
User-Defined Types

Table size queries


One simple way to identify space and rows consumed by a table in each of the 60 distributions, is to use DBCC
PDW_SHOWSPACEUSED.

DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');
However, using DBCC commands can be quite limiting. Dynamic management views (DMVs) show more detail
than DBCC commands. Start by creating this view.

CREATE VIEW dbo.vTableSizes


AS
WITH base
AS
(
SELECT
GETDATE() AS [execution_time]
, DB_NAME() AS [database_name]
, s.name AS [schema_name]
, t.name AS [table_name]
, QUOTENAME(s.name)+'.'+QUOTENAME(t.name) AS [two_part_name]
, nt.[name] AS [node_table_name]
, ROW_NUMBER() OVER(PARTITION BY nt.[name] ORDER BY (SELECT NULL)) AS [node_table_name_seq]
, tp.[distribution_policy_desc] AS [distribution_policy_name]
, c.[name] AS [distribution_column]
, nt.[distribution_id] AS [distribution_id]
, i.[type] AS [index_type]
, i.[type_desc] AS [index_type_desc]
, nt.[pdw_node_id] AS [pdw_node_id]
, pn.[type] AS [pdw_node_type]
, pn.[name] AS [pdw_node_name]
, di.name AS [dist_name]
, di.position AS [dist_position]
, nps.[partition_number] AS [partition_nmbr]
, nps.[reserved_page_count] AS [reserved_space_page_count]
, nps.[reserved_page_count] - nps.[used_page_count] AS [unused_space_page_count]
, nps.[in_row_data_page_count]
+ nps.[row_overflow_used_page_count]
+ nps.[lob_used_page_count] AS [data_space_page_count]
, nps.[reserved_page_count]
- (nps.[reserved_page_count] - nps.[used_page_count])
- ([in_row_data_page_count]
+ [row_overflow_used_page_count]+[lob_used_page_count]) AS [index_space_page_count]
, nps.[row_count] AS [row_count]
from
sys.schemas s
INNER JOIN sys.tables t
ON s.[schema_id] = t.[schema_id]
INNER JOIN sys.indexes i
ON t.[object_id] = i.[object_id]
AND i.[index_id] <= 1
INNER JOIN sys.pdw_table_distribution_properties tp
ON t.[object_id] = tp.[object_id]
INNER JOIN sys.pdw_table_mappings tm
ON t.[object_id] = tm.[object_id]
INNER JOIN sys.pdw_nodes_tables nt
ON tm.[physical_name] = nt.[name]
INNER JOIN sys.dm_pdw_nodes pn
ON nt.[pdw_node_id] = pn.[pdw_node_id]
INNER JOIN sys.pdw_distributions di
ON nt.[distribution_id] = di.[distribution_id]
INNER JOIN sys.dm_pdw_nodes_db_partition_stats nps
ON nt.[object_id] = nps.[object_id]
AND nt.[pdw_node_id] = nps.[pdw_node_id]
AND nt.[distribution_id] = nps.[distribution_id]
LEFT OUTER JOIN (select * from sys.pdw_column_distribution_properties where distribution_ordinal = 1) cdp
ON t.[object_id] = cdp.[object_id]
LEFT OUTER JOIN sys.columns c
ON cdp.[object_id] = c.[object_id]
AND cdp.[column_id] = c.[column_id]
)
, size
AS
(
SELECT
SELECT
[execution_time]
, [database_name]
, [schema_name]
, [table_name]
, [two_part_name]
, [node_table_name]
, [node_table_name_seq]
, [distribution_policy_name]
, [distribution_column]
, [distribution_id]
, [index_type]
, [index_type_desc]
, [pdw_node_id]
, [pdw_node_type]
, [pdw_node_name]
, [dist_name]
, [dist_position]
, [partition_nmbr]
, [reserved_space_page_count]
, [unused_space_page_count]
, [data_space_page_count]
, [index_space_page_count]
, [row_count]
, ([reserved_space_page_count] * 8.0) AS [reserved_space_KB]
, ([reserved_space_page_count] * 8.0)/1000 AS [reserved_space_MB]
, ([reserved_space_page_count] * 8.0)/1000000 AS [reserved_space_GB]
, ([reserved_space_page_count] * 8.0)/1000000000 AS [reserved_space_TB]
, ([unused_space_page_count] * 8.0) AS [unused_space_KB]
, ([unused_space_page_count] * 8.0)/1000 AS [unused_space_MB]
, ([unused_space_page_count] * 8.0)/1000000 AS [unused_space_GB]
, ([unused_space_page_count] * 8.0)/1000000000 AS [unused_space_TB]
, ([data_space_page_count] * 8.0) AS [data_space_KB]
, ([data_space_page_count] * 8.0)/1000 AS [data_space_MB]
, ([data_space_page_count] * 8.0)/1000000 AS [data_space_GB]
, ([data_space_page_count] * 8.0)/1000000000 AS [data_space_TB]
, ([index_space_page_count] * 8.0) AS [index_space_KB]
, ([index_space_page_count] * 8.0)/1000 AS [index_space_MB]
, ([index_space_page_count] * 8.0)/1000000 AS [index_space_GB]
, ([index_space_page_count] * 8.0)/1000000000 AS [index_space_TB]
FROM base
)
SELECT *
FROM size
;

Table space summary


This query returns the rows and space by table. It allows you to see which tables are your largest tables and
whether they are round-robin, replicated, or hash -distributed. For hash-distributed tables, the query shows the
distribution column.
SELECT
database_name
, schema_name
, table_name
, distribution_policy_name
, distribution_column
, index_type_desc
, COUNT(distinct partition_nmbr) as nbr_partitions
, SUM(row_count) as table_row_count
, SUM(reserved_space_GB) as table_reserved_space_GB
, SUM(data_space_GB) as table_data_space_GB
, SUM(index_space_GB) as table_index_space_GB
, SUM(unused_space_GB) as table_unused_space_GB
FROM
dbo.vTableSizes
GROUP BY
database_name
, schema_name
, table_name
, distribution_policy_name
, distribution_column
, index_type_desc
ORDER BY
table_reserved_space_GB desc
;

Table space by distribution type

SELECT
distribution_policy_name
, SUM(row_count) as table_type_row_count
, SUM(reserved_space_GB) as table_type_reserved_space_GB
, SUM(data_space_GB) as table_type_data_space_GB
, SUM(index_space_GB) as table_type_index_space_GB
, SUM(unused_space_GB) as table_type_unused_space_GB
FROM dbo.vTableSizes
GROUP BY distribution_policy_name
;

Table space by index type

SELECT
index_type_desc
, SUM(row_count) as table_type_row_count
, SUM(reserved_space_GB) as table_type_reserved_space_GB
, SUM(data_space_GB) as table_type_data_space_GB
, SUM(index_space_GB) as table_type_index_space_GB
, SUM(unused_space_GB) as table_type_unused_space_GB
FROM dbo.vTableSizes
GROUP BY index_type_desc
;

Distribution space summary


SELECT
distribution_id
, SUM(row_count) as total_node_distribution_row_count
, SUM(reserved_space_MB) as total_node_distribution_reserved_space_MB
, SUM(data_space_MB) as total_node_distribution_data_space_MB
, SUM(index_space_MB) as total_node_distribution_index_space_MB
, SUM(unused_space_MB) as total_node_distribution_unused_space_MB
FROM dbo.vTableSizes
GROUP BY distribution_id
ORDER BY distribution_id
;

Next steps
After creating the tables for your data warehouse, the next step is to load data into the table. For a loading
tutorial, see Loading data to SQL Data Warehouse.
CREATE TABLE AS SELECT (CTAS) in Azure SQL
Data Warehouse
10/24/2019 • 9 minutes to read • Edit Online

This article explains the CREATE TABLE AS SELECT (CTAS ) T-SQL statement in Azure SQL Data Warehouse for
developing solutions. The article also provides code examples.

CREATE TABLE AS SELECT


The CREATE TABLE AS SELECT (CTAS ) statement is one of the most important T-SQL features available. CTAS
is a parallel operation that creates a new table based on the output of a SELECT statement. CTAS is the simplest
and fastest way to create and insert data into a table with a single command.

SELECT...INTO vs. CTAS


CTAS is a more customizable version of the SELECT...INTO statement.
The following is an example of a simple SELECT...INTO:

SELECT *
INTO [dbo].[FactInternetSales_new]
FROM [dbo].[FactInternetSales]

SELECT...INTO doesn't allow you to change either the distribution method or the index type as part of the
operation. You create [dbo].[FactInternetSales_new] by using the default distribution type of ROUND_ROBIN,
and the default table structure of CLUSTERED COLUMNSTORE INDEX.
With CTAS, on the other hand, you can specify both the distribution of the table data as well as the table structure
type. To convert the previous example to CTAS:

CREATE TABLE [dbo].[FactInternetSales_new]


WITH
(
DISTRIBUTION = ROUND_ROBIN
,CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT *
FROM [dbo].[FactInternetSales];

NOTE
If you're only trying to change the index in your CTAS operation, and the source table is hash distributed, maintain the
same distribution column and data type. This avoids cross-distribution data movement during the operation, which is more
efficient.

Use CTAS to copy a table


Perhaps one of the most common uses of CTAS is creating a copy of a table in order to change the DDL. Let's say
you originally created your table as ROUND_ROBIN , and now want to change it to a table distributed on a column.
CTAS is how you would change the distribution column. You can also use CTAS to change partitioning, indexing,
or column types.
Let's say you created this table by using the default distribution type of ROUND_ROBIN , not specifying a distribution
column in the CREATE TABLE .

CREATE TABLE FactInternetSales


(
ProductKey int NOT NULL,
OrderDateKey int NOT NULL,
DueDateKey int NOT NULL,
ShipDateKey int NOT NULL,
CustomerKey int NOT NULL,
PromotionKey int NOT NULL,
CurrencyKey int NOT NULL,
SalesTerritoryKey int NOT NULL,
SalesOrderNumber nvarchar(20) NOT NULL,
SalesOrderLineNumber tinyint NOT NULL,
RevisionNumber tinyint NOT NULL,
OrderQuantity smallint NOT NULL,
UnitPrice money NOT NULL,
ExtendedAmount money NOT NULL,
UnitPriceDiscountPct float NOT NULL,
DiscountAmount float NOT NULL,
ProductStandardCost money NOT NULL,
TotalProductCost money NOT NULL,
SalesAmount money NOT NULL,
TaxAmt money NOT NULL,
Freight money NOT NULL,
CarrierTrackingNumber nvarchar(25),
CustomerPONumber nvarchar(25));

Now you want to create a new copy of this table, with a Clustered Columnstore Index , so you can take advantage
of the performance of Clustered Columnstore tables. You also want to distribute this table on ProductKey ,
because you're anticipating joins on this column and want to avoid data movement during joins on ProductKey .
Lastly, you also want to add partitioning on OrderDateKey , so you can quickly delete old data by dropping old
partitions. Here is the CTAS statement, which copies your old table into a new table.

CREATE TABLE FactInternetSales_new


WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(ProductKey),
PARTITION
(
OrderDateKey RANGE RIGHT FOR VALUES
(
20000101,20010101,20020101,20030101,20040101,20050101,20060101,20070101,20080101,20090101,
20100101,20110101,20120101,20130101,20140101,20150101,20160101,20170101,20180101,20190101,
20200101,20210101,20220101,20230101,20240101,20250101,20260101,20270101,20280101,20290101
)
)
)
AS SELECT * FROM FactInternetSales;

Finally, you can rename your tables, to swap in your new table and then drop your old table.
RENAME OBJECT FactInternetSales TO FactInternetSales_old;
RENAME OBJECT FactInternetSales_new TO FactInternetSales;

DROP TABLE FactInternetSales_old;

Use CTAS to work around unsupported features


You can also use CTAS to work around a number of the unsupported features listed below. This method can often
prove helpful, because not only will your code be compliant, but it will often run faster on SQL Data Warehouse.
This performance is a result of its fully parallelized design. Scenarios include:
ANSI JOINS on UPDATEs
ANSI JOINs on DELETEs
MERGE statement

TIP
Try to think "CTAS first." Solving a problem by using CTAS is generally a good approach, even if you're writing more data as
a result.

ANSI join replacement for update statements


You might find that you have a complex update. The update joins more than two tables together by using ANSI
join syntax to perform the UPDATE or DELETE.
Imagine you had to update this table:

CREATE TABLE [dbo].[AnnualCategorySales]


( [EnglishProductCategoryName] NVARCHAR(50) NOT NULL
, [CalendarYear] SMALLINT NOT NULL
, [TotalSalesAmount] MONEY NOT NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN
);

The original query might have looked something like this example:
UPDATE acs
SET [TotalSalesAmount] = [fis].[TotalSalesAmount]
FROM [dbo].[AnnualCategorySales] AS acs
JOIN (
SELECT [EnglishProductCategoryName]
, [CalendarYear]
, SUM([SalesAmount]) AS [TotalSalesAmount]
FROM [dbo].[FactInternetSales] AS s
JOIN [dbo].[DimDate] AS d ON s.[OrderDateKey] = d.[DateKey]
JOIN [dbo].[DimProduct] AS p ON s.[ProductKey] = p.[ProductKey]
JOIN [dbo].[DimProductSubCategory] AS u ON p.[ProductSubcategoryKey] = u.
[ProductSubcategoryKey]
JOIN [dbo].[DimProductCategory] AS c ON u.[ProductCategoryKey] = c.
[ProductCategoryKey]
WHERE [CalendarYear] = 2004
GROUP BY
[EnglishProductCategoryName]
, [CalendarYear]
) AS fis
ON [acs].[EnglishProductCategoryName] = [fis].[EnglishProductCategoryName]
AND [acs].[CalendarYear] = [fis].[CalendarYear];

SQL Data Warehouse doesn't support ANSI joins in the FROM clause of an UPDATE statement, so you can't use
the previous example without modifying it.
You can use a combination of a CTAS and an implicit join to replace the previous example:

-- Create an interim table


CREATE TABLE CTAS_acs
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT ISNULL(CAST([EnglishProductCategoryName] AS NVARCHAR(50)),0) AS [EnglishProductCategoryName]
, ISNULL(CAST([CalendarYear] AS SMALLINT),0) AS [CalendarYear]
, ISNULL(CAST(SUM([SalesAmount]) AS MONEY),0) AS [TotalSalesAmount]
FROM [dbo].[FactInternetSales] AS s
JOIN [dbo].[DimDate] AS d ON s.[OrderDateKey] = d.[DateKey]
JOIN [dbo].[DimProduct] AS p ON s.[ProductKey] = p.[ProductKey]
JOIN [dbo].[DimProductSubCategory] AS u ON p.[ProductSubcategoryKey] = u.[ProductSubcategoryKey]
JOIN [dbo].[DimProductCategory] AS c ON u.[ProductCategoryKey] = c.[ProductCategoryKey]
WHERE [CalendarYear] = 2004
GROUP BY [EnglishProductCategoryName]
, [CalendarYear];

-- Use an implicit join to perform the update


UPDATE AnnualCategorySales
SET AnnualCategorySales.TotalSalesAmount = CTAS_ACS.TotalSalesAmount
FROM CTAS_acs
WHERE CTAS_acs.[EnglishProductCategoryName] = AnnualCategorySales.[EnglishProductCategoryName]
AND CTAS_acs.[CalendarYear] = AnnualCategorySales.[CalendarYear] ;

--Drop the interim table


DROP TABLE CTAS_acs;

ANSI join replacement for delete statements


Sometimes the best approach for deleting data is to use CTAS, especially for DELETE statements that use ANSI
join syntax. This is because SQL Data Warehouse doesn't support ANSI joins in the FROM clause of a DELETE
statement. Rather than deleting the data, select the data you want to keep.
The following is an example of a converted DELETE statement:
CREATE TABLE dbo.DimProduct_upsert
WITH
( Distribution=HASH(ProductKey)
, CLUSTERED INDEX (ProductKey)
)
AS -- Select Data you want to keep
SELECT p.ProductKey
, p.EnglishProductName
, p.Color
FROM dbo.DimProduct p
RIGHT JOIN dbo.stg_DimProduct s
ON p.ProductKey = s.ProductKey;

RENAME OBJECT dbo.DimProduct TO DimProduct_old;


RENAME OBJECT dbo.DimProduct_upsert TO DimProduct;

Replace merge statements


You can replace merge statements, at least in part, by using CTAS. You can combine the INSERT and the UPDATE
into a single statement. Any deleted records should be restricted from the SELECT statement to omit from the
results.
The following example is for an UPSERT :

CREATE TABLE dbo.[DimProduct_upsert]


WITH
( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED INDEX ([ProductKey])
)
AS
-- New rows and new versions of rows
SELECT s.[ProductKey]
, s.[EnglishProductName]
, s.[Color]
FROM dbo.[stg_DimProduct] AS s
UNION ALL
-- Keep rows that are not being touched
SELECT p.[ProductKey]
, p.[EnglishProductName]
, p.[Color]
FROM dbo.[DimProduct] AS p
WHERE NOT EXISTS
( SELECT *
FROM [dbo].[stg_DimProduct] s
WHERE s.[ProductKey] = p.[ProductKey]
);

RENAME OBJECT dbo.[DimProduct] TO [DimProduct_old];


RENAME OBJECT dbo.[DimProduct_upsert] TO [DimProduct];

Explicitly state data type and nullability of output


When migrating code, you might find you run across this type of coding pattern:
DECLARE @d decimal(7,2) = 85.455
, @f float(24) = 85.455

CREATE TABLE result


(result DECIMAL(7,2) NOT NULL
)
WITH (DISTRIBUTION = ROUND_ROBIN)

INSERT INTO result


SELECT @d*@f;

You might think you should migrate this code to CTAS, and you'd be correct. However, there's a hidden issue
here.
The following code doesn't yield the same result:

DECLARE @d decimal(7,2) = 85.455


, @f float(24) = 85.455;

CREATE TABLE ctas_r


WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT @d*@f as result;

Notice that the column "result" carries forward the data type and nullability values of the expression. Carrying the
data type forward can lead to subtle variances in values if you aren't careful.
Try this example:

SELECT result,result*@d
from result;

SELECT result,result*@d
from ctas_r;

The value stored for result is different. As the persisted value in the result column is used in other expressions, the
error becomes even more significant.

This is important for data migrations. Even though the second query is arguably more accurate, there's a
problem. The data would be different compared to the source system, and that leads to questions of integrity in
the migration. This is one of those rare cases where the "wrong" answer is actually the right one!
The reason we see a disparity between the two results is due to implicit type casting. In the first example, the table
defines the column definition. When the row is inserted, an implicit type conversion occurs. In the second
example, there is no implicit type conversion as the expression defines the data type of the column.
Notice also that the column in the second example has been defined as a NULLable column, whereas in the first
example it has not. When the table was created in the first example, column nullability was explicitly defined. In
the second example, it was left to the expression, and by default would result in a NULL definition.
To resolve these issues, you must explicitly set the type conversion and nullability in the SELECT portion of the
CTAS statement. You can't set these properties in 'CREATE TABLE'. The following example demonstrates how to
fix the code:

DECLARE @d decimal(7,2) = 85.455


, @f float(24) = 85.455

CREATE TABLE ctas_r


WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT ISNULL(CAST(@d*@f AS DECIMAL(7,2)),0) as result

Note the following:


You can use CAST or CONVERT.
Use ISNULL, not COALESCE, to force NULLability. See the following note.
ISNULL is the outermost function.
The second part of the ISNULL is a constant, 0.

NOTE
For the nullability to be correctly set, it's vital to use ISNULL and not COALESCE. COALESCE is not a deterministic function,
and so the result of the expression will always be NULLable. ISNULL is different. It's deterministic. Therefore, when the
second part of the ISNULL function is a constant or a literal, the resulting value will be NOT NULL.

Ensuring the integrity of your calculations is also important for table partition switching. Imagine you have this
table defined as a fact table:

CREATE TABLE [dbo].[Sales]


(
[date] INT NOT NULL
, [product] INT NOT NULL
, [store] INT NOT NULL
, [quantity] INT NOT NULL
, [price] MONEY NOT NULL
, [amount] MONEY NOT NULL
)
WITH
( DISTRIBUTION = HASH([product])
, PARTITION ( [date] RANGE RIGHT FOR VALUES
(20000101,20010101,20020101
,20030101,20040101,20050101
)
)
);

However, the amount field is a calculated expression. It isn't part of the source data.
To create your partitioned dataset, you might want to use the following code:
CREATE TABLE [dbo].[Sales_in]
WITH
( DISTRIBUTION = HASH([product])
, PARTITION ( [date] RANGE RIGHT FOR VALUES
(20000101,20010101
)
)
)
AS
SELECT
[date]
, [product]
, [store]
, [quantity]
, [price]
, [quantity]*[price] AS [amount]
FROM [stg].[source]
OPTION (LABEL = 'CTAS : Partition IN table : Create');

The query would run perfectly well. The problem comes when you try to do the partition switch. The table
definitions don't match. To make the table definitions match, modify the CTAS to add an ISNULL function to
preserve the column's nullability attribute.

CREATE TABLE [dbo].[Sales_in]


WITH
( DISTRIBUTION = HASH([product])
, PARTITION ( [date] RANGE RIGHT FOR VALUES
(20000101,20010101
)
)
)
AS
SELECT
[date]
, [product]
, [store]
, [quantity]
, [price]
, ISNULL(CAST([quantity]*[price] AS MONEY),0) AS [amount]
FROM [stg].[source]
OPTION (LABEL = 'CTAS : Partition IN table : Create');

You can see that type consistency and maintaining nullability properties on a CTAS is an engineering best
practice. It helps to maintain integrity in your calculations, and also ensures that partition switching is possible.
CTAS is one of the most important statements in SQL Data Warehouse. Make sure you thoroughly understand it.
See the CTAS documentation.

Next steps
For more development tips, see the development overview.
Table data types in Azure SQL Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Recommendations for defining table data types in Azure SQL Data Warehouse.

What are the data types?


SQL Data Warehouse supports the most commonly used data types. For a list of the supported data types, see
data types in the CREATE TABLE statement.

Minimize row length


Minimizing the size of data types shortens the row length, which leads to better query performance. Use the
smallest data type that works for your data.
Avoid defining character columns with a large default length. For example, if the longest value is 25 characters,
then define your column as VARCHAR (25).
Avoid using [NVARCHAR ][NVARCHAR ] when you only need VARCHAR.
When possible, use NVARCHAR (4000) or VARCHAR (8000) instead of NVARCHAR (MAX) or
VARCHAR (MAX).
If you are using PolyBase external tables to load your tables, the defined length of the table row cannot exceed 1
MB. When a row with variable-length data exceeds 1 MB, you can load the row with BCP, but not with PolyBase.

Identify unsupported data types


If you are migrating your database from another SQL database, you might encounter data types that are not
supported in SQL Data Warehouse. Use this query to discover unsupported data types in your existing SQL
schema.

SELECT t.[name], c.[name], c.[system_type_id], c.[user_type_id], y.[is_user_defined], y.[name]


FROM sys.tables t
JOIN sys.columns c on t.[object_id] = c.[object_id]
JOIN sys.types y on c.[user_type_id] = y.[user_type_id]
WHERE y.[name] IN ('geography','geometry','hierarchyid','image','text','ntext','sql_variant','xml')
AND y.[is_user_defined] = 1;

Workarounds for unsupported data types


The following list shows the data types that SQL Data Warehouse does not support and gives alternatives that
you can use instead of the unsupported data types.

UNSUPPORTED DATA TYPE WORKAROUND

geometry varbinary

geography varbinary

hierarchyid nvarchar(4000)
UNSUPPORTED DATA TYPE WORKAROUND

image varbinary

text varchar

ntext nvarchar

sql_variant Split column into several strongly typed columns.

table Convert to temporary tables.

timestamp Rework code to use datetime2 and the


CURRENT_TIMESTAMP function. Only constants are
supported as defaults, therefore current_timestamp cannot
be defined as a default constraint. If you need to migrate row
version values from a timestamp typed column, then use
BINARY(8) or VARBINARY(8) for NOT NULL or NULL row
version values.

xml varchar

user-defined type Convert back to the native data type when possible.

default values Default values support literals and constants only.

Next steps
For more information on developing tables, see Table Overview.
Guidance for designing distributed tables in Azure
SQL Data Warehouse
7/24/2019 • 9 minutes to read • Edit Online

Recommendations for designing hash-distributed and round-robin distributed tables in Azure SQL Data
Warehouse.
This article assumes you are familiar with data distribution and data movement concepts in SQL Data
Warehouse. For more information, see Azure SQL Data Warehouse - Massively Parallel Processing (MPP )
architecture.

What is a distributed table?


A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows
are distributed with a hash or round-robin algorithm.
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article.
Round-robin tables are useful for improving loading speed. These design choices have a significant impact on
improving query and loading performance.
Another table storage option is to replicate a small table across all the Compute nodes. For more information,
see Design guidance for replicated tables. To quickly choose among the three options, see Distributed tables in
the tables overview.
As part of table design, understand as much as possible about your data and how the data is queried. For
example, consider these questions:
How large is the table?
How often is the table refreshed?
Do I have fact and dimension tables in a data warehouse?
Hash distributed
A hash-distributed table distributes table rows across the Compute nodes by using a deterministic hash function
to assign each row to one distribution.

Since identical values always hash to the same distribution, the data warehouse has built-in knowledge of the
row locations. SQL Data Warehouse uses this knowledge to minimize data movement during queries, which
improves query performance.
Hash-distributed tables work well for large fact tables in a star schema. They can have very large numbers of
rows and still achieve high performance. There are, of course, some design considerations that help you to get
the performance the distributed system is designed to provide. Choosing a good distribution column is one such
consideration that is described in this article.
Consider using a hash-distributed table when:
The table size on disk is more than 2 GB.
The table has frequent insert, update, and delete operations.
Round-robin distributed
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to
distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned
to the same distribution.
As a result, the system sometimes needs to invoke a data movement operation to better organize your data
before it can resolve a query. This extra step can slow down your queries. For example, joining a round-robin
table usually requires reshuffling the rows, which is a performance hit.
Consider using the round-robin distribution for your table in the following scenarios:
When getting started as a simple starting point since it is the default
If there is no obvious joining key
If there is not good candidate column for hash distributing the table
If the table does not share a common join key with other tables
If the join is less significant than other joins in the query
When the table is a temporary staging table
The tutorial Load New York taxicab data to Azure SQL Data Warehouse gives an example of loading data into a
round-robin staging table.

Choosing a distribution column


A hash-distributed table has a distribution column that is the hash key. For example, the following code creates a
hash-distributed table with ProductKey as the distribution column.

CREATE TABLE [dbo].[FactInternetSales]


( [ProductKey] int NOT NULL
, [OrderDateKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [OrderQuantity] smallint NOT NULL
, [UnitPrice] money NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([ProductKey])
)
;

Choosing a distribution column is an important design decision since the values in this column determine how
the rows are distributed. The best choice depends on several factors, and usually involves tradeoffs. However, if
you don't choose the best column the first time, you can use CREATE TABLE AS SELECT (CTAS ) to re-create the
table with a different distribution column.
Choose a distribution column that does not require updates
You cannot update a distribution column unless you delete the row and insert a new row with the updated
values. Therefore, select a column with static values.
Choose a distribution column with data that distributes evenly
For best performance, all of the distributions should have approximately the same number of rows. When one or
more distributions have a disproportionate number of rows, some distributions finish their portion of a parallel
query before others. Since the query can't complete until all distributions have finished processing, each query is
only as fast as the slowest distribution.
Data skew means the data is not distributed evenly across the distributions
Processing skew means that some distributions take longer than others when running parallel queries. This
can happen when the data is skewed.
To balance the parallel processing, select a distribution column that:
Has many unique values. The column can have some duplicate values. However, all rows with the same
value are assigned to the same distribution. Since there are 60 distributions, the column should have at least
60 unique values. Usually the number of unique values is much greater.
Does not have NULLs, or has only a few NULLs. For an extreme example, if all values in the column are
NULL, all the rows are assigned to the same distribution. As a result, query processing is skewed to one
distribution, and does not benefit from parallel processing.
Is not a date column. All data for the same date lands in the same distribution. If several users are all
filtering on the same date, then only 1 of the 60 distributions do all the processing work.
Choose a distribution column that minimizes data movement
To get the correct query result queries might move data from one Compute node to another. Data movement
commonly happens when queries have joins and aggregations on distributed tables. Choosing a distribution
column that helps minimize data movement is one of the most important strategies for optimizing performance
of your SQL Data Warehouse.
To minimize data movement, select a distribution column that:
Is used in JOIN , GROUP BY , DISTINCT , OVER , and HAVING clauses. When two large fact tables have frequent
joins, query performance improves when you distribute both tables on one of the join columns. When a table
is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause.
Is not used in WHERE clauses. This could narrow the query to not run on all the distributions.
Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on
only a few distributions.
What to do when none of the columns are a good distribution column
If none of your columns have enough distinct values for a distribution column, you can create a new column as a
composite of one or more values. To avoid data movement during query execution, use the composite
distribution column as a join column in queries.
Once you design a hash-distributed table, the next step is to load data into the table. For loading guidance, see
Loading overview.

How to tell if your distribution column is a good choice


After data is loaded into a hash-distributed table, check to see how evenly the rows are distributed across the 60
distributions. The rows per distribution can vary up to 10% without a noticeable impact on performance.
Determine if the table has data skew
A quick way to check for data skew is to use DBCC PDW_SHOWSPACEUSED. The following SQL code returns
the number of table rows that are stored in each of the 60 distributions. For balanced performance, the rows in
your distributed table should be spread evenly across all the distributions.

-- Find data skew for a distributed table


DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');

To identify which tables have more than 10% data skew:


1. Create the view dbo.vTableSizes that is shown in the Tables overview article.
2. Run the following query:

select *
from dbo.vTableSizes
where two_part_name in
(
select two_part_name
from dbo.vTableSizes
where row_count > 0
group by two_part_name
having (max(row_count * 1.000) - min(row_count * 1.000))/max(row_count * 1.000) >= .10
)
order by two_part_name, row_count
;

Check query plans for data movement


A good distribution column enables joins and aggregations to have minimal data movement. This affects the way
joins should be written. To get minimal data movement for a join on two hash-distributed tables, one of the join
columns needs to be the distribution column. When two hash-distributed tables join on a distribution column of
the same data type, the join does not require data movement. Joins can use additional columns without incurring
data movement.
To avoid data movement during a join:
The tables involved in the join must be hash distributed on one of the columns participating in the join.
The data types of the join columns must match between both tables.
The columns must be joined with an equals operator.
The join type may not be a CROSS JOIN .
To see if queries are experiencing data movement, you can look at the query plan.

Resolve a distribution column problem


It is not necessary to resolve all cases of data skew. Distributing data is a matter of finding the right balance
between minimizing data skew and data movement. It is not always possible to minimize both data skew and
data movement. Sometimes the benefit of having the minimal data movement might outweigh the impact of
having data skew.
To decide if you should resolve data skew in a table, you should understand as much as possible about the data
volumes and queries in your workload. You can use the steps in the Query monitoring article to monitor the
impact of skew on query performance. Specifically, look for how long it takes large queries to complete on
individual distributions.
Since you cannot change the distribution column on an existing table, the typical way to resolve data skew is to
re-create the table with a different distribution column.
Re -create the table with a new distribution column
This example uses CREATE TABLE AS SELECT to re-create a table with a different hash distribution column.
CREATE TABLE [dbo].[FactInternetSales_CustomerKey]
WITH ( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([CustomerKey])
, PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES ( 20000101, 20010101, 20020101, 20030101
, 20040101, 20050101, 20060101, 20070101
, 20080101, 20090101, 20100101, 20110101
, 20120101, 20130101, 20140101, 20150101
, 20160101, 20170101, 20180101, 20190101
, 20200101, 20210101, 20220101, 20230101
, 20240101, 20250101, 20260101, 20270101
, 20280101, 20290101
)
)
)
AS
SELECT *
FROM [dbo].[FactInternetSales]
OPTION (LABEL = 'CTAS : FactInternetSales_CustomerKey')
;

--Create statistics on new table


CREATE STATISTICS [ProductKey] ON [FactInternetSales_CustomerKey] ([ProductKey]);
CREATE STATISTICS [OrderDateKey] ON [FactInternetSales_CustomerKey] ([OrderDateKey]);
CREATE STATISTICS [CustomerKey] ON [FactInternetSales_CustomerKey] ([CustomerKey]);
CREATE STATISTICS [PromotionKey] ON [FactInternetSales_CustomerKey] ([PromotionKey]);
CREATE STATISTICS [SalesOrderNumber] ON [FactInternetSales_CustomerKey] ([SalesOrderNumber]);
CREATE STATISTICS [OrderQuantity] ON [FactInternetSales_CustomerKey] ([OrderQuantity]);
CREATE STATISTICS [UnitPrice] ON [FactInternetSales_CustomerKey] ([UnitPrice]);
CREATE STATISTICS [SalesAmount] ON [FactInternetSales_CustomerKey] ([SalesAmount]);

--Rename the tables


RENAME OBJECT [dbo].[FactInternetSales] TO [FactInternetSales_ProductKey];
RENAME OBJECT [dbo].[FactInternetSales_CustomerKey] TO [FactInternetSales];

Next steps
To create a distributed table, use one of these statements:
CREATE TABLE (Azure SQL Data Warehouse)
CREATE TABLE AS SELECT (Azure SQL Data Warehouse
Primary key, foreign key, and unique key in Azure
SQL Data Warehouse
9/26/2019 • 3 minutes to read • Edit Online

Learn about table constraints in Azure SQL Data Warehouse, including primary key, foreign key, and unique key.

Table constraints
Azure SQL Data Warehouse supports these table constraints:
PRIMARY KEY is only supported when NONCLUSTERED and NOT ENFORCED are both used.
UNIQUE constraint is only supported with NOT ENFORCED is used.
FOREIGN KEY constraint is not supported in Azure SQL Data Warehouse.

Remarks
Having primary key and/or unique key allows data warehouse engine to generate an optimal execution plan for a
query. All values in a primary key column or a unique constraint column should be unique.
After creating a table with primary key or unique constraint in Azure data warehouse, users need to make sure all
values in those columns are unique. A violation of that may cause the query to return inaccurate result. This
example shows how a query may return inaccurate result if the primary key or unique constraint column includes
duplicate values.

-- Create table t1
CREATE TABLE t1 (a1 INT NOT NULL, b1 INT) WITH (DISTRIBUTION = ROUND_ROBIN)

-- Insert values to table t1 with duplicate values in column a1.


INSERT INTO t1 VALUES (1, 100)
INSERT INTO t1 VALUES (1, 1000)
INSERT INTO t1 VALUES (2, 200)
INSERT INTO t1 VALUES (3, 300)
INSERT INTO t1 VALUES (4, 400)

-- Run this query. No primary key or unique constraint. 4 rows returned. Correct result.
SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1

/*
a1 total
----------- -----------
1 2
2 1
3 1
4 1

(4 rows affected)
*/

-- Add unique constraint


ALTER TABLE t1 ADD CONSTRAINT unique_t1_a1 unique (a1) NOT ENFORCED

-- Re-run this query. 5 rows returned. Incorrect result.


SELECT a1, count(*) AS total FROM t1 GROUP BY a1

/*
a1 total
a1 total
----------- -----------
2 1
4 1
1 1
3 1
1 1

(5 rows affected)
*/

-- Drop unique constraint.


ALTER TABLE t1 DROP CONSTRAINT unique_t1_a1

-- Add primary key constraint


ALTER TABLE t1 add CONSTRAINT PK_t1_a1 PRIMARY KEY NONCLUSTERED (a1) NOT ENFORCED

-- Re-run this query. 5 rows returned. Incorrect result.


SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1

/*
a1 total
----------- -----------
2 1
4 1
1 1
3 1
1 1

(5 rows affected)
*/

-- Manually fix the duplicate values in a1


UPDATE t1 SET a1 = 0 WHERE b1 = 1000

-- Verify no duplicate values in column a1


SELECT * FROM t1

/*
a1 b1
----------- -----------
2 200
3 300
4 400
0 1000
1 100

(5 rows affected)
*/

-- Add unique constraint


ALTER TABLE t1 add CONSTRAINT unique_t1_a1 UNIQUE (a1) NOT ENFORCED

-- Re-run this query. 5 rows returned. Correct result.


SELECT a1, COUNT(*) as total FROM t1 GROUP BY a1

/*
a1 total
----------- -----------
2 1
3 1
4 1
0 1
1 1

(5 rows affected)
*/

-- Drop unique constraint.


ALTER TABLE t1 DROP CONSTRAINT unique_t1_a1
ALTER TABLE t1 DROP CONSTRAINT unique_t1_a1

-- Add primary key contraint


ALTER TABLE t1 ADD CONSTRAINT PK_t1_a1 PRIMARY KEY NONCLUSTERED (a1) NOT ENFORCED

-- Re-run this query. 5 rows returned. Correct result.


SELECT a1, COUNT(*) AS total FROM t1 GROUP BY a1

/*
a1 total
----------- -----------
2 1
3 1
4 1
0 1
1 1

(5 rows affected)
*/

Examples
Create a data warehouse table with a primary key:

CREATE TABLE mytable (c1 INT PRIMARY KEY NONCLUSTERED NOT ENFORCED, c2 INT);

Create a data warehouse table with a unique constraint:

CREATE TABLE t6 (c1 INT UNIQUE NOT ENFORCED, c2 INT);

Next steps
After creating the tables for your data warehouse, the next step is to load data into the table. For a loading tutorial,
see Loading data to SQL Data Warehouse.
Indexing tables in SQL Data Warehouse
7/24/2019 • 14 minutes to read • Edit Online

Recommendations and examples for indexing tables in Azure SQL Data Warehouse.

Index types
SQL Data Warehouse offers several indexing options including clustered columnstore indexes, clustered indexes
and nonclustered indexes, and a non-index option also known as heap.
To create a table with an index, see the CREATE TABLE (Azure SQL Data Warehouse) documentation.

Clustered columnstore indexes


By default, SQL Data Warehouse creates a clustered columnstore index when no index options are specified on
a table. Clustered columnstore tables offer both the highest level of data compression as well as the best overall
query performance. Clustered columnstore tables will generally outperform clustered index or heap tables and
are usually the best choice for large tables. For these reasons, clustered columnstore is the best place to start
when you are unsure of how to index your table.
To create a clustered columnstore table, simply specify CLUSTERED COLUMNSTORE INDEX in the WITH
clause, or leave the WITH clause off:

CREATE TABLE myTable


(
id int NOT NULL,
lastName varchar(20),
zipCode varchar(6)
)
WITH ( CLUSTERED COLUMNSTORE INDEX );

There are a few scenarios where clustered columnstore may not be a good option:
Columnstore tables do not support varchar(max), nvarchar(max) and varbinary(max). Consider heap or
clustered index instead.
Columnstore tables may be less efficient for transient data. Consider heap and perhaps even temporary
tables.
Small tables with less than 60 million rows. Consider heap tables.

Heap tables
When you are temporarily landing data in SQL Data Warehouse, you may find that using a heap table makes
the overall process faster. This is because loads to heaps are faster than to index tables and in some cases the
subsequent read can be done from cache. If you are loading data only to stage it before running more
transformations, loading the table to heap table is much faster than loading the data to a clustered columnstore
table. In addition, loading data to a temporary table loads faster than loading a table to permanent storage.
For small lookup tables, less than 60 million rows, often heap tables make sense. Cluster columnstore tables
begin to achieve optimal compression once there is more than 60 million rows.
To create a heap table, simply specify HEAP in the WITH clause:
CREATE TABLE myTable
(
id int NOT NULL,
lastName varchar(20),
zipCode varchar(6)
)
WITH ( HEAP );

Clustered and nonclustered indexes


Clustered indexes may outperform clustered columnstore tables when a single row needs to be quickly
retrieved. For queries where a single or very few row lookup is required to performance with extreme speed,
consider a cluster index or nonclustered secondary index. The disadvantage to using a clustered index is that
only queries that benefit are the ones that use a highly selective filter on the clustered index column. To improve
filter on other columns a nonclustered index can be added to other columns. However, each index which is
added to a table adds both space and processing time to loads.
To create a clustered index table, simply specify CLUSTERED INDEX in the WITH clause:

CREATE TABLE myTable


(
id int NOT NULL,
lastName varchar(20),
zipCode varchar(6)
)
WITH ( CLUSTERED INDEX (id) );

To add a non-clustered index on a table, use the following syntax:

CREATE INDEX zipCodeIndex ON myTable (zipCode);

Optimizing clustered columnstore indexes


Clustered columnstore tables are organized in data into segments. Having high segment quality is critical to
achieving optimal query performance on a columnstore table. Segment quality can be measured by the number
of rows in a compressed row group. Segment quality is most optimal where there are at least 100K rows per
compressed row group and gain in performance as the number of rows per row group approach 1,048,576
rows, which is the most rows a row group can contain.
The below view can be created and used on your system to compute the average rows per row group and
identify any sub-optimal cluster columnstore indexes. The last column on this view generates a SQL statement
which can be used to rebuild your indexes.
CREATE VIEW dbo.vColumnstoreDensity
AS
SELECT
GETDATE() AS [execution_date]
, DB_Name() AS [database_name]
, s.name AS [schema_name]
, t.name AS [table_name]
, COUNT(DISTINCT rg.[partition_number]) AS [table_partition_count]
, SUM(rg.[total_rows]) AS [row_count_total]
, SUM(rg.[total_rows])/COUNT(DISTINCT rg.[distribution_id]) AS
[row_count_per_distribution_MAX]
, CEILING ((SUM(rg.[total_rows])*1.0/COUNT(DISTINCT rg.[distribution_id]))/1048576) AS
[rowgroup_per_distribution_MAX]
, SUM(CASE WHEN rg.[State] = 0 THEN 1 ELSE 0 END) AS
[INVISIBLE_rowgroup_count]
, SUM(CASE WHEN rg.[State] = 0 THEN rg.[total_rows] ELSE 0 END) AS [INVISIBLE_rowgroup_rows]
, MIN(CASE WHEN rg.[State] = 0 THEN rg.[total_rows] ELSE NULL END) AS
[INVISIBLE_rowgroup_rows_MIN]
, MAX(CASE WHEN rg.[State] = 0 THEN rg.[total_rows] ELSE NULL END) AS
[INVISIBLE_rowgroup_rows_MAX]
, AVG(CASE WHEN rg.[State] = 0 THEN rg.[total_rows] ELSE NULL END) AS
[INVISIBLE_rowgroup_rows_AVG]
, SUM(CASE WHEN rg.[State] = 1 THEN 1 ELSE 0 END) AS [OPEN_rowgroup_count]
, SUM(CASE WHEN rg.[State] = 1 THEN rg.[total_rows] ELSE 0 END) AS [OPEN_rowgroup_rows]
, MIN(CASE WHEN rg.[State] = 1 THEN rg.[total_rows] ELSE NULL END) AS [OPEN_rowgroup_rows_MIN]
, MAX(CASE WHEN rg.[State] = 1 THEN rg.[total_rows] ELSE NULL END) AS [OPEN_rowgroup_rows_MAX]
, AVG(CASE WHEN rg.[State] = 1 THEN rg.[total_rows] ELSE NULL END) AS [OPEN_rowgroup_rows_AVG]
, SUM(CASE WHEN rg.[State] = 2 THEN 1 ELSE 0 END) AS [CLOSED_rowgroup_count]
, SUM(CASE WHEN rg.[State] = 2 THEN rg.[total_rows] ELSE 0 END) AS [CLOSED_rowgroup_rows]
, MIN(CASE WHEN rg.[State] = 2 THEN rg.[total_rows] ELSE NULL END) AS
[CLOSED_rowgroup_rows_MIN]
, MAX(CASE WHEN rg.[State] = 2 THEN rg.[total_rows] ELSE NULL END) AS
[CLOSED_rowgroup_rows_MAX]
, AVG(CASE WHEN rg.[State] = 2 THEN rg.[total_rows] ELSE NULL END) AS
[CLOSED_rowgroup_rows_AVG]
, SUM(CASE WHEN rg.[State] = 3 THEN 1 ELSE 0 END) AS
[COMPRESSED_rowgroup_count]
, SUM(CASE WHEN rg.[State] = 3 THEN rg.[total_rows] ELSE 0 END) AS
[COMPRESSED_rowgroup_rows]
, SUM(CASE WHEN rg.[State] = 3 THEN rg.[deleted_rows] ELSE 0 END) AS
[COMPRESSED_rowgroup_rows_DELETED]
, MIN(CASE WHEN rg.[State] = 3 THEN rg.[total_rows] ELSE NULL END) AS
[COMPRESSED_rowgroup_rows_MIN]
, MAX(CASE WHEN rg.[State] = 3 THEN rg.[total_rows] ELSE NULL END) AS
[COMPRESSED_rowgroup_rows_MAX]
, AVG(CASE WHEN rg.[State] = 3 THEN rg.[total_rows] ELSE NULL END) AS
[COMPRESSED_rowgroup_rows_AVG]
, 'ALTER INDEX ALL ON ' + s.name + '.' + t.NAME + ' REBUILD;' AS [Rebuild_Index_SQL]
FROM sys.[pdw_nodes_column_store_row_groups] rg
JOIN sys.[pdw_nodes_tables] nt ON rg.[object_id] = nt.[object_id]
AND rg.[pdw_node_id] = nt.[pdw_node_id]
AND rg.[distribution_id] = nt.[distribution_id]
JOIN sys.[pdw_table_mappings] mp ON nt.[name] = mp.[physical_name]
JOIN sys.[tables] t ON mp.[object_id] = t.[object_id]
JOIN sys.[schemas] s ON t.[schema_id] = s.[schema_id]
GROUP BY
s.[name]
, t.[name]
;

Now that you have created the view, run this query to identify tables with row groups with less than 100K rows.
Of course, you may want to increase the threshold of 100K if you are looking for more optimal segment quality.
SELECT *
FROM [dbo].[vColumnstoreDensity]
WHERE COMPRESSED_rowgroup_rows_AVG < 100000
OR INVISIBLE_rowgroup_rows_AVG < 100000

Once you have run the query you can begin to look at the data and analyze your results. This table explains what
to look for in your row group analysis.

COLUMN HOW TO USE THIS DATA

[table_partition_count] If the table is partitioned, then you may expect to see higher
Open row group counts. Each partition in the distribution
could in theory have an open row group associated with it.
Factor this into your analysis. A small table that has been
partitioned could be optimized by removing the partitioning
altogether as this would improve compression.

[row_count_total] Total row count for the table. For example, you can use this
value to calculate percentage of rows in the compressed
state.

[row_count_per_distribution_MAX] If all rows are evenly distributed this value would be the
target number of rows per distribution. Compare this value
with the compressed_rowgroup_count.

[COMPRESSED_rowgroup_rows] Total number of rows in columnstore format for the table.

[COMPRESSED_rowgroup_rows_AVG] If the average number of rows is significantly less than the


maximum # of rows for a row group, then consider using
CTAS or ALTER INDEX REBUILD to recompress the data

[COMPRESSED_rowgroup_count] Number of row groups in columnstore format. If this number


is very high in relation to the table it is an indicator that the
columnstore density is low.

[COMPRESSED_rowgroup_rows_DELETED] Rows are logically deleted in columnstore format. If the


number is high relative to table size, consider recreating the
partition or rebuilding the index as this removes them
physically.

[COMPRESSED_rowgroup_rows_MIN] Use this in conjunction with the AVG and MAX columns to
understand the range of values for the row groups in your
columnstore. A low number over the load threshold
(102,400 per partition aligned distribution) suggests that
optimizations are available in the data load

[COMPRESSED_rowgroup_rows_MAX] As above

[OPEN_rowgroup_count] Open row groups are normal. One would reasonably expect
one OPEN row group per table distribution (60). Excessive
numbers suggest data loading across partitions. Double
check the partitioning strategy to make sure it is sound

[OPEN_rowgroup_rows] Each row group can have 1,048,576 rows in it as a


maximum. Use this value to see how full the open row
groups are currently
COLUMN HOW TO USE THIS DATA

[OPEN_rowgroup_rows_MIN] Open groups indicate that data is either being trickle loaded
into the table or that the previous load spilled over
remaining rows into this row group. Use the MIN, MAX, AVG
columns to see how much data is sat in OPEN row groups.
For small tables it could be 100% of all the data! In which
case ALTER INDEX REBUILD to force the data to
columnstore.

[OPEN_rowgroup_rows_MAX] As above

[OPEN_rowgroup_rows_AVG] As above

[CLOSED_rowgroup_rows] Look at the closed row group rows as a sanity check.

[CLOSED_rowgroup_count] The number of closed row groups should be low if any are
seen at all. Closed row groups can be converted to
compressed row groups using the ALTER INDEX ...
REORGANIZE command. However, this is not normally
required. Closed groups are automatically converted to
columnstore row groups by the background "tuple mover"
process.

[CLOSED_rowgroup_rows_MIN] Closed row groups should have a very high fill rate. If the fill
rate for a closed row group is low, then further analysis of
the columnstore is required.

[CLOSED_rowgroup_rows_MAX] As above

[CLOSED_rowgroup_rows_AVG] As above

[Rebuild_Index_SQL] SQL to rebuild columnstore index for a table

Causes of poor columnstore index quality


If you have identified tables with poor segment quality, you want to identify the root cause. Below are some
other common causes of poor segment quality:
1. Memory pressure when index was built
2. High volume of DML operations
3. Small or trickle load operations
4. Too many partitions
These factors can cause a columnstore index to have significantly less than the optimal 1 million rows per row
group. They can also cause rows to go to the delta row group instead of a compressed row group.
Memory pressure when index was built
The number of rows per compressed row group are directly related to the width of the row and the amount of
memory available to process the row group. When rows are written to columnstore tables under memory
pressure, columnstore segment quality may suffer. Therefore, the best practice is to give the session which is
writing to your columnstore index tables access to as much memory as possible. Since there is a trade-off
between memory and concurrency, the guidance on the right memory allocation depends on the data in each
row of your table, the data warehouse units allocated to your system, and the number of concurrency slots you
can give to the session which is writing data to your table.
High volume of DML operations
A high volume of DML operations that update and delete rows can introduce inefficiency into the columnstore.
This is especially true when the majority of the rows in a row group are modified.
Deleting a row from a compressed row group only logically marks the row as deleted. The row remains in
the compressed row group until the partition or table is rebuilt.
Inserting a row adds the row to an internal rowstore table called a delta row group. The inserted row is not
converted to columnstore until the delta row group is full and is marked as closed. Row groups are closed
once they reach the maximum capacity of 1,048,576 rows.
Updating a row in columnstore format is processed as a logical delete and then an insert. The inserted row
may be stored in the delta store.
Batched update and insert operations that exceed the bulk threshold of 102,400 rows per partition-aligned
distribution go directly to the columnstore format. However, assuming an even distribution, you would need to
be modifying more than 6.144 million rows in a single operation for this to occur. If the number of rows for a
given partition-aligned distribution is less than 102,400 then the rows go to the delta store and stay there until
sufficient rows have been inserted or modified to close the row group or the index has been rebuilt.
Small or trickle load operations
Small loads that flow into SQL Data Warehouse are also sometimes known as trickle loads. They typically
represent a near constant stream of data being ingested by the system. However, as this stream is near
continuous the volume of rows is not particularly large. More often than not the data is significantly under the
threshold required for a direct load to columnstore format.
In these situations, it is often better to land the data first in Azure blob storage and let it accumulate prior to
loading. This technique is often known as micro -batching.
Too many partitions
Another thing to consider is the impact of partitioning on your clustered columnstore tables. Before partitioning,
SQL Data Warehouse already divides your data into 60 databases. Partitioning further divides your data. If you
partition your data, then consider that each partition needs at least 1 million rows to benefit from a clustered
columnstore index. If you partition your table into 100 partitions, then your table needs at least 6 billion rows to
benefit from a clustered columnstore index (60 distributions 100 partitions 1 million rows). If your 100-partition
table does not have 6 billion rows, either reduce the number of partitions or consider using a heap table instead.
Once your tables have been loaded with some data, follow the below steps to identify and rebuild tables with
sub-optimal clustered columnstore indexes.

Rebuilding indexes to improve segment quality


Step 1: Identify or create user which uses the right resource class
One quick way to immediately improve segment quality is to rebuild the index. The SQL returned by the above
view returns an ALTER INDEX REBUILD statement which can be used to rebuild your indexes. When rebuilding
your indexes, be sure that you allocate enough memory to the session that rebuilds your index. To do this,
increase the resource class of a user which has permissions to rebuild the index on this table to the
recommended minimum.
Below is an example of how to allocate more memory to a user by increasing their resource class. To work with
resource classes, see Resource classes for workload management.

EXEC sp_addrolemember 'xlargerc', 'LoadUser'

Step 2: Rebuild clustered columnstore indexes with higher resource class user
Sign in as the user from step 1 (e.g. LoadUser), which is now using a higher resource class, and execute the
ALTER INDEX statements. Be sure that this user has ALTER permission to the tables where the index is being
rebuilt. These examples show how to rebuild the entire columnstore index or how to rebuild a single partition.
On large tables, it is more practical to rebuild indexes a single partition at a time.
Alternatively, instead of rebuilding the index, you could copy the table to a new table using CTAS. Which way is
best? For large volumes of data, CTAS is usually faster than ALTER INDEX. For smaller volumes of data, ALTER
INDEX is easier to use and won't require you to swap out the table.

-- Rebuild the entire clustered index


ALTER INDEX ALL ON [dbo].[DimProduct] REBUILD

-- Rebuild a single partition


ALTER INDEX ALL ON [dbo].[FactInternetSales] REBUILD Partition = 5

-- Rebuild a single partition with archival compression


ALTER INDEX ALL ON [dbo].[FactInternetSales] REBUILD Partition = 5 WITH (DATA_COMPRESSION =
COLUMNSTORE_ARCHIVE)

-- Rebuild a single partition with columnstore compression


ALTER INDEX ALL ON [dbo].[FactInternetSales] REBUILD Partition = 5 WITH (DATA_COMPRESSION = COLUMNSTORE)

Rebuilding an index in SQL Data Warehouse is an offline operation. For more information about rebuilding
indexes, see the ALTER INDEX REBUILD section in Columnstore Indexes Defragmentation, and ALTER INDEX.
Step 3: Verify clustered columnstore segment quality has improved
Rerun the query which identified table with poor segment quality and verify segment quality has improved. If
segment quality did not improve, it could be that the rows in your table are extra wide. Consider using a higher
resource class or DWU when rebuilding your indexes.

Rebuilding indexes with CTAS and partition switching


This example uses the CREATE TABLE AS SELECT (CTAS ) statement and partition switching to rebuild a table
partition.

-- Step 1: Select the partition of data and write it out to a new table using CTAS
CREATE TABLE [dbo].[FactInternetSales_20000101_20010101]
WITH ( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED COLUMNSTORE INDEX
, PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES
(20000101,20010101
)
)
)
AS
SELECT *
FROM [dbo].[FactInternetSales]
WHERE [OrderDateKey] >= 20000101
AND [OrderDateKey] < 20010101
;

-- Step 2: Switch IN the rebuilt data with TRUNCATE_TARGET option


ALTER TABLE [dbo].[FactInternetSales_20000101_20010101] SWITCH PARTITION 2 TO [dbo].[FactInternetSales]
PARTITION 2 WITH (TRUNCATE_TARGET = ON);
For more details about re-creating partitions using CTAS, see Using partitions in SQL Data Warehouse.

Next steps
For more information about developing tables, see Developing tables.
Using IDENTITY to create surrogate keys in Azure
SQL Data Warehouse
7/26/2019 • 5 minutes to read • Edit Online

Recommendations and examples for using the IDENTITY property to create surrogate keys on tables in Azure
SQL Data Warehouse.

What is a surrogate key


A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the
table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models.
You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance.

Creating a table with an IDENTITY column


The IDENTITY property is designed to scale out across all the distributions in the data warehouse without affecting
load performance. Therefore, the implementation of IDENTITY is oriented toward achieving these goals.
You can define a table as having the IDENTITY property when you first create the table by using syntax that is
similar to the following statement:

CREATE TABLE dbo.T1


( C1 INT IDENTITY(1,1) NOT NULL
, C2 INT NULL
)
WITH
( DISTRIBUTION = HASH(C2)
, CLUSTERED COLUMNSTORE INDEX
)
;

You can then use INSERT..SELECT to populate the table.


This remainder of this section highlights the nuances of the implementation to help you understand them more
fully.
Allocation of values
The IDENTITY property doesn't guarantee the order in which the surrogate values are allocated, which reflects the
behavior of SQL Server and Azure SQL Database. However, in Azure SQL Data Warehouse, the absence of a
guarantee is more pronounced.
The following example is an illustration:
CREATE TABLE dbo.T1
( C1 INT IDENTITY(1,1) NOT NULL
, C2 VARCHAR(30) NULL
)
WITH
( DISTRIBUTION = HASH(C2)
, CLUSTERED COLUMNSTORE INDEX
)
;

INSERT INTO dbo.T1


VALUES (NULL);

INSERT INTO dbo.T1


VALUES (NULL);

SELECT *
FROM dbo.T1;

DBCC PDW_SHOWSPACEUSED('dbo.T1');

In the preceding example, two rows landed in distribution 1. The first row has the surrogate value of 1 in column
C1 , and the second row has the surrogate value of 61. Both of these values were generated by the IDENTITY
property. However, the allocation of the values is not contiguous. This behavior is by design.
Skewed data
The range of values for the data type are spread evenly across the distributions. If a distributed table suffers from
skewed data, then the range of values available to the datatype can be exhausted prematurely. For example, if all
the data ends up in a single distribution, then effectively the table has access to only one-sixtieth of the values of
the data type. For this reason, the IDENTITY property is limited to INT and BIGINT data types only.
SELECT..INTO
When an existing IDENTITY column is selected into a new table, the new column inherits the IDENTITY property,
unless one of the following conditions is true:
The SELECT statement contains a join.
Multiple SELECT statements are joined by using UNION.
The IDENTITY column is listed more than one time in the SELECT list.
The IDENTITY column is part of an expression.
If any one of these conditions is true, the column is created NOT NULL instead of inheriting the IDENTITY
property.
CREATE TABLE AS SELECT
CREATE TABLE AS SELECT (CTAS ) follows the same SQL Server behavior that's documented for SELECT..INTO.
However, you can't specify an IDENTITY property in the column definition of the CREATE TABLE part of the
statement. You also can't use the IDENTITY function in the SELECT part of the CTAS. To populate a table, you need
to use CREATE TABLE to define the table followed by INSERT..SELECT to populate it.

Explicitly inserting values into an IDENTITY column


SQL Data Warehouse supports SET IDENTITY_INSERT <your table> ON|OFF syntax. You can use this syntax to
explicitly insert values into the IDENTITY column.
Many data modelers like to use predefined negative values for certain rows in their dimensions. An example is the
-1 or "unknown member" row.
The next script shows how to explicitly add this row by using SET IDENTITY_INSERT:
SET IDENTITY_INSERT dbo.T1 ON;

INSERT INTO dbo.T1


( C1
, C2
)
VALUES (-1,'UNKNOWN')
;

SET IDENTITY_INSERT dbo.T1 OFF;

SELECT *
FROM dbo.T1
;

Loading data
The presence of the IDENTITY property has some implications to your data-loading code. This section highlights
some basic patterns for loading data into tables by using IDENTITY.
To load data into a table and generate a surrogate key by using IDENTITY, create the table and then use
INSERT..SELECT or INSERT..VALUES to perform the load.
The following example highlights the basic pattern:

--CREATE TABLE with IDENTITY


CREATE TABLE dbo.T1
( C1 INT IDENTITY(1,1)
, C2 VARCHAR(30)
)
WITH
( DISTRIBUTION = HASH(C2)
, CLUSTERED COLUMNSTORE INDEX
)
;

--Use INSERT..SELECT to populate the table from an external table


INSERT INTO dbo.T1
(C2)
SELECT C2
FROM ext.T1
;

SELECT *
FROM dbo.T1
;

DBCC PDW_SHOWSPACEUSED('dbo.T1');

NOTE
It's not possible to use CREATE TABLE AS SELECT currently when loading data into a table with an IDENTITY column.

For more information on loading data, see Designing Extract, Load, and Transform (ELT) for Azure SQL Data
Warehouse and Loading best practices.

System views
You can use the sys.identity_columns catalog view to identify a column that has the IDENTITY property.
To help you better understand the database schema, this example shows how to integrate sys.identity_column` with
other system catalog views:

SELECT sm.name
, tb.name
, co.name
, CASE WHEN ic.column_id IS NOT NULL
THEN 1
ELSE 0
END AS is_identity
FROM sys.schemas AS sm
JOIN sys.tables AS tb ON sm.schema_id = tb.schema_id
JOIN sys.columns AS co ON tb.object_id = co.object_id
LEFT JOIN sys.identity_columns AS ic ON co.object_id = ic.object_id
AND co.column_id = ic.column_id
WHERE sm.name = 'dbo'
AND tb.name = 'T1'
;

Limitations
The IDENTITY property can't be used:
When the column data type is not INT or BIGINT
When the column is also the distribution key
When the table is an external table
The following related functions are not supported in SQL Data Warehouse:
IDENTITY ()
@@IDENTITY
SCOPE_IDENTITY
IDENT_CURRENT
IDENT_INCR
IDENT_SEED

Common tasks
This section provides some sample code you can use to perform common tasks when you work with IDENTITY
columns.
Column C1 is the IDENTITY in all the following tasks.
Find the highest allocated value for a table
Use the MAX() function to determine the highest value allocated for a distributed table:

SELECT MAX(C1)
FROM dbo.T1

Find the seed and increment for the IDENTITY property


You can use the catalog views to discover the identity increment and seed configuration values for a table by using
the following query:
SELECT sm.name
, tb.name
, co.name
, ic.seed_value
, ic.increment_value
FROM sys.schemas AS sm
JOIN sys.tables AS tb ON sm.schema_id = tb.schema_id
JOIN sys.columns AS co ON tb.object_id = co.object_id
JOIN sys.identity_columns AS ic ON co.object_id = ic.object_id
AND co.column_id = ic.column_id
WHERE sm.name = 'dbo'
AND tb.name = 'T1'
;

Next steps
Table overview
CREATE TABLE (Transact-SQL ) IDENTITY (Property)
DBCC CHECKINDENT
Partitioning tables in SQL Data Warehouse
7/24/2019 • 10 minutes to read • Edit Online

Recommendations and examples for using table partitions in Azure SQL Data Warehouse.

What are table partitions?


Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are
created on a date column. Partitioning is supported on all SQL Data Warehouse table types; including clustered
columnstore, clustered index, and heap. Partitioning is also supported on all distribution types, including both
hash or round robin distributed.
Partitioning can benefit data maintenance and query performance. Whether it benefits both or just one is
dependent on how data is loaded and whether the same column can be used for both purposes, since
partitioning can only be done on one column.
Benefits to loads
The primary benefit of partitioning in SQL Data Warehouse is to improve the efficiency and performance of
loading data by use of partition deletion, switching and merging. In most cases data is partitioned on a date
column that is closely tied to the order in which the data is loaded into the database. One of the greatest benefits
of using partitions to maintain data it the avoidance of transaction logging. While simply inserting, updating, or
deleting data can be the most straightforward approach, with a little thought and effort, using partitioning
during your load process can substantially improve performance.
Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact table
might contain just data for the past 36 months. At the end of every month, the oldest month of sales data is
deleted from the table. This data could be deleted by using a delete statement to delete the data for the oldest
month. However, deleting a large amount of data row -by-row with a delete statement can take too much time, as
well as create the risk of large transactions that take a long time to rollback if something goes wrong. A more
optimal approach is to drop the oldest partition of data. Where deleting the individual rows could take hours,
deleting an entire partition could take seconds.
Benefits to queries
Partitioning can also be used to improve query performance. A query that applies a filter to partitioned data can
limit the scan to only the qualifying partitions. This method of filtering can avoid a full table scan and only scan a
smaller subset of data. With the introduction of clustered columnstore indexes, the predicate elimination
performance benefits are less beneficial, but in some cases there can be a benefit to queries. For example, if the
sales fact table is partitioned into 36 months using the sales date field, then queries that filter on the sale date
can skip searching in partitions that don’t match the filter.

Sizing partitions
While partitioning can be used to improve performance some scenarios, creating a table with too many
partitions can hurt performance under some circumstances. These concerns are especially true for clustered
columnstore tables. For partitioning to be helpful, it is important to understand when to use partitioning and the
number of partitions to create. There is no hard fast rule as to how many partitions are too many, it depends on
your data and how many partitions you loading simultaneously. A successful partitioning scheme usually has
tens to hundreds of partitions, not thousands.
When creating partitions on clustered columnstore tables, it is important to consider how many rows belong
to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1
million rows per distribution and partition is needed. Before partitions are created, SQL Data Warehouse already
divides each table into 60 distributed databases. Any partitioning added to a table is in addition to the
distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly
partitions, and given that SQL Data Warehouse has 60 distributions, then the sales fact table should contain 60
million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the
recommended minimum number of rows per partition, consider using fewer partitions in order to increase the
number of rows per partition. For more information, see the Indexing article, which includes queries that can
assess the quality of cluster columnstore indexes.

Syntax differences from SQL Server


SQL Data Warehouse introduces a way to define partitions that is simpler than SQL Server. Partitioning
functions and schemes are not used in SQL Data Warehouse as they are in SQL Server. Instead, all you need to
do is identify partitioned column and the boundary points. While the syntax of partitioning may be slightly
different from SQL Server, the basic concepts are the same. SQL Server and SQL Data Warehouse support one
partition column per table, which can be ranged partition. To learn more about partitioning, see Partitioned
Tables and Indexes.
The following example uses the CREATE TABLE statement to partition the FactInternetSales table on the
OrderDateKey column:

CREATE TABLE [dbo].[FactInternetSales]


(
[ProductKey] int NOT NULL
, [OrderDateKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [OrderQuantity] smallint NOT NULL
, [UnitPrice] money NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([ProductKey])
, PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES
(20000101,20010101,20020101
,20030101,20040101,20050101
)
)
)
;

Migrating partitioning from SQL Server


To migrate SQL Server partition definitions to SQL Data Warehouse simply:
Eliminate the SQL Server partition scheme.
Add the partition function definition to your CREATE TABLE.
If you are migrating a partitioned table from a SQL Server instance, the following SQL can help you to figure
out the number of rows that in each partition. Keep in mind that if the same partitioning granularity is used on
SQL Data Warehouse, the number of rows per partition decreases by a factor of 60.
-- Partition information for a SQL Server Database
SELECT s.[name] AS [schema_name]
, t.[name] AS [table_name]
, i.[name] AS [index_name]
, p.[partition_number] AS [partition_number]
, SUM(a.[used_pages]*8.0) AS [partition_size_kb]
, SUM(a.[used_pages]*8.0)/1024 AS [partition_size_mb]
, SUM(a.[used_pages]*8.0)/1048576 AS [partition_size_gb]
, p.[rows] AS [partition_row_count]
, rv.[value] AS [partition_boundary_value]
, p.[data_compression_desc] AS [partition_compression_desc]
FROM sys.schemas s
JOIN sys.tables t ON t.[schema_id] = s.[schema_id]
JOIN sys.partitions p ON p.[object_id] = t.[object_id]
JOIN sys.allocation_units a ON a.[container_id] = p.[partition_id]
JOIN sys.indexes i ON i.[object_id] = p.[object_id]
AND i.[index_id] = p.[index_id]
JOIN sys.data_spaces ds ON ds.[data_space_id] = i.[data_space_id]
LEFT JOIN sys.partition_schemes ps ON ps.[data_space_id] = ds.[data_space_id]
LEFT JOIN sys.partition_functions pf ON pf.[function_id] = ps.[function_id]
LEFT JOIN sys.partition_range_values rv ON rv.[function_id] = pf.[function_id]
AND rv.[boundary_id] = p.[partition_number]
WHERE p.[index_id] <=1
GROUP BY s.[name]
, t.[name]
, i.[name]
, p.[partition_number]
, p.[rows]
, rv.[value]
, p.[data_compression_desc]
;

Partition switching
SQL Data Warehouse supports partition splitting, merging, and switching. Each of these functions is executed
using the ALTER TABLE statement.
To switch partitions between two tables, you must ensure that the partitions align on their respective boundaries
and that the table definitions match. As check constraints are not available to enforce the range of values in a
table, the source table must contain the same partition boundaries as the target table. If the partition boundaries
are not then same, then the partition switch will fail as the partition metadata will not be synchronized.
How to split a partition that contains data
The most efficient method to split a partition that already contains data is to use a CTAS statement. If the
partitioned table is a clustered columnstore, then the table partition must be empty before it can be split.
The following example creates a partitioned columnstore table. It inserts one row into each partition:
CREATE TABLE [dbo].[FactInternetSales]
(
[ProductKey] int NOT NULL
, [OrderDateKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [OrderQuantity] smallint NOT NULL
, [UnitPrice] money NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([ProductKey])
, PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES
(20000101
)
)
)
;

INSERT INTO dbo.FactInternetSales


VALUES (1,19990101,1,1,1,1,1,1);
INSERT INTO dbo.FactInternetSales
VALUES (1,20000101,1,1,1,1,1,1);

The following query finds the row count by using the sys.partitions catalog view:

SELECT QUOTENAME(s.[name])+'.'+QUOTENAME(t.[name]) as Table_name


, i.[name] as Index_name
, p.partition_number as Partition_nmbr
, p.[rows] as Row_count
, p.[data_compression_desc] as Data_Compression_desc
FROM sys.partitions p
JOIN sys.tables t ON p.[object_id] = t.[object_id]
JOIN sys.schemas s ON t.[schema_id] = s.[schema_id]
JOIN sys.indexes i ON p.[object_id] = i.[object_Id]
AND p.[index_Id] = i.[index_Id]
WHERE t.[name] = 'FactInternetSales'
;

The following split command receives an error message:

ALTER TABLE FactInternetSales SPLIT RANGE (20010101);

Msg 35346, Level 15, State 1, Line 44 SPLIT clause of ALTER PARTITION statement failed because the partition
is not empty. Only empty partitions can be split in when a columnstore index exists on the table. Consider
disabling the columnstore index before issuing the ALTER PARTITION statement, then rebuilding the
columnstore index after ALTER PARTITION is complete.
However, you can use CTAS to create a new table to hold the data.
CREATE TABLE dbo.FactInternetSales_20000101
WITH ( DISTRIBUTION = HASH(ProductKey)
, CLUSTERED COLUMNSTORE INDEX
, PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES
(20000101
)
)
)
AS
SELECT *
FROM FactInternetSales
WHERE 1=2
;

As the partition boundaries are aligned, a switch is permitted. This will leave the source table with an empty
partition that you can subsequently split.

ALTER TABLE FactInternetSales SWITCH PARTITION 2 TO FactInternetSales_20000101 PARTITION 2;

ALTER TABLE FactInternetSales SPLIT RANGE (20010101);

All that is left is to align the data to the new partition boundaries using CTAS , and then switch the data back into
the main table.

CREATE TABLE [dbo].[FactInternetSales_20000101_20010101]


WITH ( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED COLUMNSTORE INDEX
, PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES
(20000101,20010101
)
)
)
AS
SELECT *
FROM [dbo].[FactInternetSales_20000101]
WHERE [OrderDateKey] >= 20000101
AND [OrderDateKey] < 20010101
;

ALTER TABLE dbo.FactInternetSales_20000101_20010101 SWITCH PARTITION 2 TO dbo.FactInternetSales PARTITION 2;

Once you have completed the movement of the data, it is a good idea to refresh the statistics on the target table.
Updating statistics ensures the statistics accurately reflect the new distribution of the data in their respective
partitions.

UPDATE STATISTICS [dbo].[FactInternetSales];

Load new data into partitions that contain data in one step
Loading data into partitions with partition switching is a convenient way stage new data in a table that is not
visible to users the switch in the new data. It can be challenging on busy systems to deal with the locking
contention associated with partition switching. To clear out the existing data in a partition, an ALTER TABLE used
to be required to switch out the data. Then another ALTER TABLE was required to switch in the new data. In SQL
Data Warehouse, the TRUNCATE_TARGET option is supported in the ALTER TABLE command. With TRUNCATE_TARGET
the ALTER TABLE command overwrites existing data in the partition with new data. Below is an example which
uses CTAS to create a new table with the existing data, inserts new data, then switches all the data back into the
target table, overwriting the existing data.
CREATE TABLE [dbo].[FactInternetSales_NewSales]
WITH ( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED COLUMNSTORE INDEX
, PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES
(20000101,20010101
)
)
)
AS
SELECT *
FROM [dbo].[FactInternetSales]
WHERE [OrderDateKey] >= 20000101
AND [OrderDateKey] < 20010101
;

INSERT INTO dbo.FactInternetSales_NewSales


VALUES (1,20000101,2,2,2,2,2,2);

ALTER TABLE dbo.FactInternetSales_NewSales SWITCH PARTITION 2 TO dbo.FactInternetSales PARTITION 2 WITH


(TRUNCATE_TARGET = ON);

Table partitioning source control


To avoid your table definition from rusting in your source control system, you may want to consider the
following approach:
1. Create the table as a partitioned table but with no partition values

CREATE TABLE [dbo].[FactInternetSales]


(
[ProductKey] int NOT NULL
, [OrderDateKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [OrderQuantity] smallint NOT NULL
, [UnitPrice] money NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([ProductKey])
, PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES () )
)
;

2. SPLIT the table as part of the deployment process:


-- Create a table containing the partition boundaries

CREATE TABLE #partitions


WITH
(
LOCATION = USER_DB
, DISTRIBUTION = HASH(ptn_no)
)
AS
SELECT ptn_no
, ROW_NUMBER() OVER (ORDER BY (ptn_no)) as seq_no
FROM (
SELECT CAST(20000101 AS INT) ptn_no
UNION ALL
SELECT CAST(20010101 AS INT)
UNION ALL
SELECT CAST(20020101 AS INT)
UNION ALL
SELECT CAST(20030101 AS INT)
UNION ALL
SELECT CAST(20040101 AS INT)
) a
;

-- Iterate over the partition boundaries and split the table

DECLARE @c INT = (SELECT COUNT(*) FROM #partitions)


, @i INT = 1 --iterator for while loop
, @q NVARCHAR(4000) --query
, @p NVARCHAR(20) = N'' --partition_number
, @s NVARCHAR(128) = N'dbo' --schema
, @t NVARCHAR(128) = N'FactInternetSales' --table
;

WHILE @i <= @c
BEGIN
SET @p = (SELECT ptn_no FROM #partitions WHERE seq_no = @i);
SET @q = (SELECT N'ALTER TABLE '+@s+N'.'+@t+N' SPLIT RANGE ('+@p+N');');

-- PRINT @q;
EXECUTE sp_executesql @q;
SET @i+=1;
END

-- Code clean-up

DROP TABLE #partitions;

With this approach the code in source control remains static and the partitioning boundary values are allowed to
be dynamic; evolving with the warehouse over time.

Next steps
For more information about developing tables, see the articles on Table Overview.
Design guidance for using replicated tables in Azure
SQL Data Warehouse
7/24/2019 • 7 minutes to read • Edit Online

This article gives recommendations for designing replicated tables in your SQL Data Warehouse schema. Use
these recommendations to improve query performance by reducing data movement and query complexity.

Prerequisites
This article assumes you are familiar with data distribution and data movement concepts in SQL Data
Warehouse. For more information, see the architecture article.
As part of table design, understand as much as possible about your data and how the data is queried. For
example, consider these questions:
How large is the table?
How often is the table refreshed?
Do I have fact and dimension tables in a data warehouse?

What is a replicated table?


A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes the
need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple copies,
replicated tables work best when the table size is less than 2 GB compressed. 2 GB is not a hard limit. If the data is
static and does not change, you can replicate larger tables.
The following diagram shows a replicated table that is accessible on each Compute node. In SQL Data
Warehouse, the replicated table is fully copied to a distribution database on each Compute node.

Replicated tables work well for dimension tables in a star schema. Dimension tables are typically joined to fact
tables which are distributed differently than the dimension table. Dimensions are usually of a size that makes it
feasible to store and maintain multiple copies. Dimensions store descriptive data that changes slowly, such as
customer name and address, and product details. The slowly changing nature of the data leads to less
maintenance of the replicated table.
Consider using a replicated table when:
The table size on disk is less than 2 GB, regardless of the number of rows. To find the size of a table, you can
use the DBCC PDW_SHOWSPACEUSED command: DBCC PDW_SHOWSPACEUSED('ReplTableCandidate') .
The table is used in joins that would otherwise require data movement. When joining tables that are not
distributed on the same column, such as a hash-distributed table to a round-robin table, data movement is
required to complete the query. If one of the tables is small, consider a replicated table. We recommend using
replicated tables instead of round-robin tables in most cases. To view data movement operations in query
plans, use sys.dm_pdw_request_steps. The BroadcastMoveOperation is the typical data movement operation
that can be eliminated by using a replicated table.
Replicated tables may not yield the best query performance when:
The table has frequent insert, update, and delete operations. These data manipulation language (DML )
operations require a rebuild of the replicated table. Rebuilding frequently can cause slower performance.
The data warehouse is scaled frequently. Scaling a data warehouse changes the number of Compute nodes,
which incurs rebuilding the replicated table.
The table has a large number of columns, but data operations typically access only a small number of columns.
In this scenario, instead of replicating the entire table, it might be more effective to distribute the table, and
then create an index on the frequently accessed columns. When a query requires data movement, SQL Data
Warehouse only moves data for the requested columns.

Use replicated tables with simple query predicates


Before you choose to distribute or replicate a table, think about the types of queries you plan to run against the
table. Whenever possible,
Use replicated tables for queries with simple query predicates, such as equality or inequality.
Use distributed tables for queries with complex query predicates, such as LIKE or NOT LIKE.
CPU -intensive queries perform best when the work is distributed across all of the Compute nodes. For example,
queries that run computations on each row of a table perform better on distributed tables than replicated tables.
Since a replicated table is stored in full on each Compute node, a CPU -intensive query against a replicated table
runs against the entire table on every Compute node. The extra computation can slow query performance.
For example, this query has a complex predicate. It runs faster when the data is in a distributed table instead of a
replicated table. In this example, the data can be round-robin distributed.

SELECT EnglishProductName
FROM DimProduct
WHERE EnglishDescription LIKE '%frame%comfortable%'

Convert existing round-robin tables to replicated tables


If you already have round-robin tables, we recommend converting them to replicated tables if they meet the
criteria outlined in this article. Replicated tables improve performance over round-robin tables because they
eliminate the need for data movement. A round-robin table always requires data movement for joins.
This example uses CTAS to change the DimSalesTerritory table to a replicated table. This example works
regardless of whether DimSalesTerritory is hash-distributed or round-robin.
CREATE TABLE [dbo].[DimSalesTerritory_REPLICATE]
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = REPLICATE
)
AS SELECT * FROM [dbo].[DimSalesTerritory]
OPTION (LABEL = 'CTAS : DimSalesTerritory_REPLICATE')

-- Switch table names


RENAME OBJECT [dbo].[DimSalesTerritory] to [DimSalesTerritory_old];
RENAME OBJECT [dbo].[DimSalesTerritory_REPLICATE] TO [DimSalesTerritory];

DROP TABLE [dbo].[DimSalesTerritory_old];

Query performance example for round-robin versus replicated


A replicated table does not require any data movement for joins because the entire table is already present on
each Compute node. If the dimension tables are round-robin distributed, a join copies the dimension table in full
to each Compute node. To move the data, the query plan contains an operation called BroadcastMoveOperation.
This type of data movement operation slows query performance and is eliminated by using replicated tables. To
view query plan steps, use the sys.dm_pdw_request_steps system catalog view.
For example, in following query against the AdventureWorks schema, the FactInternetSales table is hash-
distributed. The DimDate and DimSalesTerritory tables are smaller dimension tables. This query returns the total
sales in North America for fiscal year 2004:

SELECT [TotalSalesAmount] = SUM(SalesAmount)


FROM dbo.FactInternetSales s
INNER JOIN dbo.DimDate d
ON d.DateKey = s.OrderDateKey
INNER JOIN dbo.DimSalesTerritory t
ON t.SalesTerritoryKey = s.SalesTerritoryKey
WHERE d.FiscalYear = 2004
AND t.SalesTerritoryGroup = 'North America'

We re-created DimDate and DimSalesTerritory as round-robin tables. As a result, the query showed the following
query plan, which has multiple broadcast move operations:

We re-created DimDate and DimSalesTerritory as replicated tables, and ran the query again. The resulting query
plan is much shorter and does not have any broadcast moves.
Performance considerations for modifying replicated tables
SQL Data Warehouse implements a replicated table by maintaining a master version of the table. It copies the
master version to one distribution database on each Compute node. When there is a change, SQL Data
Warehouse first updates the master table. Then it rebuilds the tables on each Compute node. A rebuild of a
replicated table includes copying the table to each Compute node and then building the indexes. For example, a
replicated table on a DW400 has 5 copies of the data. A master copy and a full copy on each Compute node. All
data is stored in distribution databases. SQL Data Warehouse uses this model to support faster data modification
statements and flexible scaling operations.
Rebuilds are required after:
Data is loaded or modified
The data warehouse is scaled to a different level
Table definition is updated
Rebuilds are not required after:
Pause operation
Resume operation
The rebuild does not happen immediately after data is modified. Instead, the rebuild is triggered the first time a
query selects from the table. The query that triggered the rebuild reads immediately from the master version of
the table while the data is asynchronously copied to each Compute node. Until the data copy is complete,
subsequent queries will continue to use the master version of the table. If any activity happens against the
replicated table that forces another rebuild, the data copy is invalidated and the next select statement will trigger
data to be copied again.
Use indexes conservatively
Standard indexing practices apply to replicated tables. SQL Data Warehouse rebuilds each replicated table index
as part of the rebuild. Only use indexes when the performance gain outweighs the cost of rebuilding the indexes.
Batch data loads
When loading data into replicated tables, try to minimize rebuilds by batching loads together. Perform all the
batched loads before running select statements.
For example, this load pattern loads data from four sources and invokes four rebuilds.
Load from source 1.
Select statement triggers rebuild 1.
Load from source 2.
Select statement triggers rebuild 2.
Load from source 3.
Select statement triggers rebuild 3.
Load from source 4.
Select statement triggers rebuild 4.
For example, this load pattern loads data from four sources, but only invokes one rebuild.
Load from source 1.
Load from source 2.
Load from source 3.
Load from source 4.
Select statement triggers rebuild.
Rebuild a replicated table after a batch load
To ensure consistent query execution times, consider forcing the build of the replicated tables after a batch load.
Otherwise, the first query will still use data movement to complete the query.
This query uses the sys.pdw_replicated_table_cache_state DMV to list the replicated tables that have been
modified, but not rebuilt.

SELECT [ReplicatedTable] = t.[name]


FROM sys.tables t
JOIN sys.pdw_replicated_table_cache_state c
ON c.object_id = t.object_id
JOIN sys.pdw_table_distribution_properties p
ON p.object_id = t.object_id
WHERE c.[state] = 'NotReady'
AND p.[distribution_policy_desc] = 'REPLICATE'

To trigger a rebuild, run the following statement on each table in the preceding output.

SELECT TOP 1 * FROM [ReplicatedTable]

Next steps
To create a replicated table, use one of these statements:
CREATE TABLE (Azure SQL Data Warehouse)
CREATE TABLE AS SELECT (Azure SQL Data Warehouse)
For an overview of distributed tables, see distributed tables.
Table statistics in Azure SQL Data Warehouse
7/24/2019 • 14 minutes to read • Edit Online

Recommendations and examples for creating and updating query-optimization statistics on tables in Azure SQL
Data Warehouse.

Why use statistics


The more Azure SQL Data Warehouse knows about your data, the faster it can execute queries against it. After
loading data into SQL Data Warehouse, collecting statistics on your data is one of the most important things you
can do to optimize your queries. The SQL Data Warehouse query optimizer is a cost-based optimizer. It
compares the cost of various query plans, and then chooses the plan with the lowest cost. In most cases, it
chooses the plan that will execute the fastest. For example, if the optimizer estimates that the date your query is
filtering on will return one row it will choose one plan. If it estimates that the selected date will return 1 million
rows, it will return a different plan.

Automatic creation of statistic


When the database AUTO_CREATE_STATISTICS option is on, SQL Data Warehouse analyzes incoming user
queries for missing statistics. If statistics are missing, the query optimizer creates statistics on individual columns
in the query predicate or join condition to improve cardinality estimates for the query plan. Automatic creation of
statistics is currently turned on by default.
You can check if your data warehouse has AUTO_CREATE_STATISTICS configured by running the following
command:

SELECT name, is_auto_create_stats_on


FROM sys.databases

If your data warehouse does not have AUTO_CREATE_STATISTICS configured, we recommend you enable this
property by running the following command:

ALTER DATABASE <yourdatawarehousename>


SET AUTO_CREATE_STATISTICS ON

These statements will trigger automatic creation of statistics:


SELECT
INSERT-SELECT
CTAS
UPDATE
DELETE
EXPLAIN when containing a join or the presence of a predicate is detected

NOTE
Automatic creation of statistics are not created on temporary or external tables.

Automatic creation of statistics is done synchronously so you may incur slightly degraded query performance if
your columns are missing statistics. The time to create statistics for a single column depends on the size of the
table. To avoid measurable performance degradation, especially in performance benchmarking, you should
ensure stats have been created first by executing the benchmark workload before profiling the system.

NOTE
The creation of stats will be logged in sys.dm_pdw_exec_requests under a different user context.

When automatic statistics are created, they will take the form: WA_Sys<8 digit column id in Hex>_<8 digit table
id in Hex>. You can view stats that have already been created by running the DBCC SHOW_STATISTICS
command:

DBCC SHOW_STATISTICS (<table_name>, <target>)

The table_name is the name of the table that contains the statistics to display. This cannot be an external table.
The target is the name of the target index, statistics, or column for which to display statistics information.

Updating statistics
One best practice is to update statistics on date columns each day as new dates are added. Each time new rows
are loaded into the data warehouse, new load dates or transaction dates are added. These change the data
distribution and make the statistics out of date. Conversely, statistics on a country/region column in a customer
table might never need to be updated, because the distribution of values doesn’t generally change. Assuming the
distribution is constant between customers, adding new rows to the table variation isn't going to change the data
distribution. However, if your data warehouse only contains one country/region and you bring in data from a
new country/region, resulting in data from multiple countries/regions being stored, then you need to update
statistics on the country/region column.
The following are recommendations updating statistics:

Frequency of stats updates Conservative: Daily


After loading or transforming your data

Sampling Less than 1 billion rows, use default sampling (20 percent).
With more than 1 billion rows, use sampling of two percent.

One of the first questions to ask when you're troubleshooting a query is, "Are the statistics up to date?"
This question is not one that can be answered by the age of the data. An up-to-date statistics object might be old
if there's been no material change to the underlying data. When the number of rows has changed substantially,
or there is a material change in the distribution of values for a column, then it's time to update statistics.
There is no dynamic management view to determine if data within the table has changed since the last time
statistics were updated. Knowing the age of your statistics can provide you with part of the picture. You can use
the following query to determine the last time your statistics were updated on each table.

NOTE
If there is a material change in the distribution of values for a column, you should update statistics regardless of the last
time they were updated.
SELECT
sm.[name] AS [schema_name],
tb.[name] AS [table_name],
co.[name] AS [stats_column_name],
st.[name] AS [stats_name],
STATS_DATE(st.[object_id],st.[stats_id]) AS [stats_last_updated_date]
FROM
sys.objects ob
JOIN sys.stats st
ON ob.[object_id] = st.[object_id]
JOIN sys.stats_columns sc
ON st.[stats_id] = sc.[stats_id]
AND st.[object_id] = sc.[object_id]
JOIN sys.columns co
ON sc.[column_id] = co.[column_id]
AND sc.[object_id] = co.[object_id]
JOIN sys.types ty
ON co.[user_type_id] = ty.[user_type_id]
JOIN sys.tables tb
ON co.[object_id] = tb.[object_id]
JOIN sys.schemas sm
ON tb.[schema_id] = sm.[schema_id]
WHERE
st.[user_created] = 1;

Date columns in a data warehouse, for example, usually need frequent statistics updates. Each time new rows
are loaded into the data warehouse, new load dates or transaction dates are added. These change the data
distribution and make the statistics out of date. Conversely, statistics on a gender column in a customer table
might never need to be updated. Assuming the distribution is constant between customers, adding new rows to
the table variation isn't going to change the data distribution. However, if your data warehouse contains only one
gender and a new requirement results in multiple genders, then you need to update statistics on the gender
column.
For more information, see general guidance for Statistics.

Implementing statistics management


It is often a good idea to extend your data-loading process to ensure that statistics are updated at the end of the
load. The data load is when tables most frequently change their size and/or their distribution of values. Therefore,
this is a logical place to implement some management processes.
The following guiding principles are provided for updating your statistics during the load process:
Ensure that each loaded table has at least one statistics object updated. This updates the table size (row count
and page count) information as part of the statistics update.
Focus on columns participating in JOIN, GROUP BY, ORDER BY, and DISTINCT clauses.
Consider updating "ascending key" columns such as transaction dates more frequently, because these values
will not be included in the statistics histogram.
Consider updating static distribution columns less frequently.
Remember, each statistic object is updated in sequence. Simply implementing
UPDATE STATISTICS <TABLE_NAME> isn't always ideal, especially for wide tables with lots of statistics objects.

For more information, see Cardinality Estimation.

Examples: Create statistics


These examples show how to use various options for creating statistics. The options that you use for each
column depend on the characteristics of your data and how the column will be used in queries.
Create single -column statistics with default options
To create statistics on a column, simply provide a name for the statistics object and the name of the column.
This syntax uses all of the default options. By default, SQL Data Warehouse samples 20 percent of the table
when it creates statistics.

CREATE STATISTICS [statistics_name] ON [schema_name].[table_name]([column_name]);

For example:

CREATE STATISTICS col1_stats ON dbo.table1 (col1);

Create single -column statistics by examining every row


The default sampling rate of 20 percent is sufficient for most situations. However, you can adjust the sampling
rate.
To sample the full table, use this syntax:

CREATE STATISTICS [statistics_name] ON [schema_name].[table_name]([column_name]) WITH FULLSCAN;

For example:

CREATE STATISTICS col1_stats ON dbo.table1 (col1) WITH FULLSCAN;

Create single -column statistics by specifying the sample size


Alternatively, you can specify the sample size as a percent:

CREATE STATISTICS col1_stats ON dbo.table1 (col1) WITH SAMPLE = 50 PERCENT;

Create single -column statistics on only some of the rows


You can also create statistics on a portion of the rows in your table. This is called a filtered statistic.
For example, you can use filtered statistics when you plan to query a specific partition of a large partitioned table.
By creating statistics on only the partition values, the accuracy of the statistics will improve, and therefore
improve query performance.
This example creates statistics on a range of values. The values can easily be defined to match the range of values
in a partition.

CREATE STATISTICS stats_col1 ON table1(col1) WHERE col1 > '2000101' AND col1 < '20001231';

NOTE
For the query optimizer to consider using filtered statistics when it chooses the distributed query plan, the query must fit
inside the definition of the statistics object. Using the previous example, the query's WHERE clause needs to specify col1
values between 2000101 and 20001231.

Create single -column statistics with all the options


You can also combine the options together. The following example creates a filtered statistics object with a
custom sample size:
CREATE STATISTICS stats_col1 ON table1 (col1) WHERE col1 > '2000101' AND col1 < '20001231' WITH SAMPLE = 50
PERCENT;

For the full reference, see CREATE STATISTICS.


Create multi-column statistics
To create a multi-column statistics object, simply use the previous examples, but specify more columns.

NOTE
The histogram, which is used to estimate the number of rows in the query result, is only available for the first column listed
in the statistics object definition.

In this example, the histogram is on product_category. Cross-column statistics are calculated on


product_category and product_sub_category:

CREATE STATISTICS stats_2cols ON table1 (product_category, product_sub_category) WHERE product_category >


'2000101' AND product_category < '20001231' WITH SAMPLE = 50 PERCENT;

Because there is a correlation between product_category and product_sub_category, a multi-column statistics


object can be useful if these columns are accessed at the same time.
Create statistics on all columns in a table
One way to create statistics is to issue CREATE STATISTICS commands after creating the table:

CREATE TABLE dbo.table1


(
col1 int
, col2 int
, col3 int
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
)
;

CREATE STATISTICS stats_col1 on dbo.table1 (col1);


CREATE STATISTICS stats_col2 on dbo.table2 (col2);
CREATE STATISTICS stats_col3 on dbo.table3 (col3);

Use a stored procedure to create statistics on all columns in a database


SQL Data Warehouse does not have a system stored procedure equivalent to sp_create_stats in SQL Server. This
stored procedure creates a single column statistics object on every column of the database that doesn't already
have statistics.
The following example will help you get started with your database design. Feel free to adapt it to your needs:

CREATE PROCEDURE [dbo].[prc_sqldw_create_stats]


( @create_type tinyint -- 1 default 2 Fullscan 3 Sample
, @sample_pct tinyint
)
AS

IF @create_type IS NULL
BEGIN
SET @create_type = 1;
SET @create_type = 1;
END;

IF @create_type NOT IN (1,2,3)


BEGIN
THROW 151000,'Invalid value for @stats_type parameter. Valid range 1 (default), 2 (fullscan) or 3
(sample).',1;
END;

IF @sample_pct IS NULL
BEGIN;
SET @sample_pct = 20;
END;

IF OBJECT_ID('tempdb..#stats_ddl') IS NOT NULL


BEGIN;
DROP TABLE #stats_ddl;
END;

CREATE TABLE #stats_ddl


WITH ( DISTRIBUTION = HASH([seq_nmbr])
, LOCATION = USER_DB
)
AS
WITH T
AS
(
SELECT t.[name] AS [table_name]
, s.[name] AS [table_schema_name]
, c.[name] AS [column_name]
, c.[column_id] AS [column_id]
, t.[object_id] AS [object_id]
, ROW_NUMBER()
OVER(ORDER BY (SELECT NULL)) AS [seq_nmbr]
FROM sys.[tables] t
JOIN sys.[schemas] s ON t.[schema_id] = s.[schema_id]
JOIN sys.[columns] c ON t.[object_id] = c.[object_id]
LEFT JOIN sys.[stats_columns] l ON l.[object_id] = c.[object_id]
AND l.[column_id] = c.[column_id]
AND l.[stats_column_id] = 1
LEFT JOIN sys.[external_tables] e ON e.[object_id] = t.[object_id]
WHERE l.[object_id] IS NULL
AND e.[object_id] IS NULL -- not an external table
)
SELECT [table_schema_name]
, [table_name]
, [column_name]
, [column_id]
, [object_id]
, [seq_nmbr]
, CASE @create_type
WHEN 1
THEN CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name +
'_'+column_name)+' ON '+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+')'
AS VARCHAR(8000))
WHEN 2
THEN CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name +
'_'+column_name)+' ON '+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+')
WITH FULLSCAN' AS VARCHAR(8000))
WHEN 3
THEN CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name +
'_'+column_name)+' ON '+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+')
WITH SAMPLE '+CONVERT(varchar(4),@sample_pct)+' PERCENT' AS VARCHAR(8000))
END AS create_stat_ddl
FROM T
;

DECLARE @i INT = 1
, @t INT = (SELECT COUNT(*) FROM #stats_ddl)
, @s NVARCHAR(4000) = N''
, @s NVARCHAR(4000) = N''
;

WHILE @i <= @t
BEGIN
SET @s=(SELECT create_stat_ddl FROM #stats_ddl WHERE seq_nmbr = @i);

PRINT @s
EXEC sp_executesql @s
SET @i+=1;
END

DROP TABLE #stats_ddl;

To create statistics on all columns in the table using the defaults, execute the stored procedure.

EXEC [dbo].[prc_sqldw_create_stats] 1, NULL;

To create statistics on all columns in the table using a fullscan, call this procedure:

EXEC [dbo].[prc_sqldw_create_stats] 2, NULL;

To create sampled statistics on all columns in the table, enter 3, and the sample percent. This procedures uses a
20 percent sample rate.

EXEC [dbo].[prc_sqldw_create_stats] 3, 20;

To create sampled statistics on all columns

Examples: Update statistics


To update statistics, you can:
Update one statistics object. Specify the name of the statistics object you want to update.
Update all statistics objects on a table. Specify the name of the table instead of one specific statistics object.
Update one specific statistics object
Use the following syntax to update a specific statistics object:

UPDATE STATISTICS [schema_name].[table_name]([stat_name]);

For example:

UPDATE STATISTICS [dbo].[table1] ([stats_col1]);

By updating specific statistics objects, you can minimize the time and resources required to manage statistics.
This requires some thought to choose the best statistics objects to update.
Update all statistics on a table
A simple method for updating all the statistics objects on a table is:

UPDATE STATISTICS [schema_name].[table_name];

For example:
UPDATE STATISTICS dbo.table1;

The UPDATE STATISTICS statement is easy to use. Just remember that it updates all statistics on the table, and
therefore might perform more work than is necessary. If performance is not an issue, this is the easiest and most
complete way to guarantee that statistics are up to date.

NOTE
When updating all statistics on a table, SQL Data Warehouse does a scan to sample the table for each statistics object. If
the table is large and has many columns and many statistics, it might be more efficient to update individual statistics based
on need.

For an implementation of an UPDATE STATISTICS procedure, see Temporary Tables. The implementation method
is slightly different from the preceding CREATE STATISTICS procedure, but the result is the same.
For the full syntax, see Update Statistics.

Statistics metadata
There are several system views and functions that you can use to find information about statistics. For example,
you can see if a statistics object might be out of date by using the stats-date function to see when statistics were
last created or updated.
Catalog views for statistics
These system views provide information about statistics:

CATALOG VIEW DESCRIPTION

sys.columns One row for each column.

sys.objects One row for each object in the database.

sys.schemas One row for each schema in the database.

sys.stats One row for each statistics object.

sys.stats_columns One row for each column in the statistics object. Links back
to sys.columns.

sys.tables One row for each table (includes external tables).

sys.table_types One row for each data type.

System functions for statistics


These system functions are useful for working with statistics:

SYSTEM FUNCTION DESCRIPTION

STATS_DATE Date the statistics object was last updated.

DBCC SHOW_STATISTICS Summary level and detailed information about the


distribution of values as understood by the statistics object.
Combine statistics columns and functions into one view
This view brings columns that relate to statistics and results from the STATS_DATE () function together.

CREATE VIEW dbo.vstats_columns


AS
SELECT
sm.[name] AS [schema_name]
, tb.[name] AS [table_name]
, st.[name] AS [stats_name]
, st.[filter_definition] AS [stats_filter_definition]
, st.[has_filter] AS [stats_is_filtered]
, STATS_DATE(st.[object_id],st.[stats_id])
AS [stats_last_updated_date]
, co.[name] AS [stats_column_name]
, ty.[name] AS [column_type]
, co.[max_length] AS [column_max_length]
, co.[precision] AS [column_precision]
, co.[scale] AS [column_scale]
, co.[is_nullable] AS [column_is_nullable]
, co.[collation_name] AS [column_collation_name]
, QUOTENAME(sm.[name])+'.'+QUOTENAME(tb.[name])
AS two_part_name
, QUOTENAME(DB_NAME())+'.'+QUOTENAME(sm.[name])+'.'+QUOTENAME(tb.[name])
AS three_part_name
FROM sys.objects AS ob
JOIN sys.stats AS st ON ob.[object_id] = st.[object_id]
JOIN sys.stats_columns AS sc ON st.[stats_id] = sc.[stats_id]
AND st.[object_id] = sc.[object_id]
JOIN sys.columns AS co ON sc.[column_id] = co.[column_id]
AND sc.[object_id] = co.[object_id]
JOIN sys.types AS ty ON co.[user_type_id] = ty.[user_type_id]
JOIN sys.tables AS tb ON co.[object_id] = tb.[object_id]
JOIN sys.schemas AS sm ON tb.[schema_id] = sm.[schema_id]
WHERE 1=1
AND st.[user_created] = 1
;

DBCC SHOW_STATISTICS() examples


DBCC SHOW_STATISTICS () shows the data held within a statistics object. This data comes in three parts:
Header
Density vector
Histogram
The header metadata about the statistics. The histogram displays the distribution of values in the first key column
of the statistics object. The density vector measures cross-column correlation. SQL Data Warehouse computes
cardinality estimates with any of the data in the statistics object.
Show header, density, and histogram
This simple example shows all three parts of a statistics object:

DBCC SHOW_STATISTICS([<schema_name>.<table_name>],<stats_name>)

For example:

DBCC SHOW_STATISTICS (dbo.table1, stats_col1);

Show one or more parts of DBCC SHOW_STATISTICS ()


If you're only interested in viewing specific parts, use the WITH clause and specify which parts you want to see:

DBCC SHOW_STATISTICS([<schema_name>.<table_name>],<stats_name>) WITH stat_header, histogram, density_vector

For example:

DBCC SHOW_STATISTICS (dbo.table1, stats_col1) WITH histogram, density_vector

DBCC SHOW_STATISTICS() differences


DBCC SHOW_STATISTICS () is more strictly implemented in SQL Data Warehouse compared to SQL Server:
Undocumented features are not supported.
Cannot use Stats_stream.
Cannot join results for specific subsets of statistics data. For example, STAT_HEADER JOIN
DENSITY_VECTOR.
NO_INFOMSGS cannot be set for message suppression.
Square brackets around statistics names cannot be used.
Cannot use column names to identify statistics objects.
Custom error 2767 is not supported.

Next steps
For further improve query performance, see Monitor your workload
Temporary tables in SQL Data Warehouse
7/24/2019 • 4 minutes to read • Edit Online

This article contains essential guidance for using temporary tables and highlights the principles of session level
temporary tables. Using the information in this article can help you modularize your code, improving both
reusability and ease of maintenance of your code.

What are temporary tables?


Temporary tables are useful when processing data - especially during transformation where the intermediate
results are transient. In SQL Data Warehouse, temporary tables exist at the session level. They are only visible to
the session in which they were created and are automatically dropped when that session logs off. Temporary
tables offer a performance benefit because their results are written to local rather than remote storage.

Create a temporary table


Temporary tables are created by prefixing your table name with a # . For example:

CREATE TABLE #stats_ddl


(
[schema_name] NVARCHAR(128) NOT NULL
, [table_name] NVARCHAR(128) NOT NULL
, [stats_name] NVARCHAR(128) NOT NULL
, [stats_is_filtered] BIT NOT NULL
, [seq_nmbr] BIGINT NOT NULL
, [two_part_name] NVARCHAR(260) NOT NULL
, [three_part_name] NVARCHAR(400) NOT NULL
)
WITH
(
DISTRIBUTION = HASH([seq_nmbr])
, HEAP
)

Temporary tables can also be created with a CTAS using exactly the same approach:
CREATE TABLE #stats_ddl
WITH
(
DISTRIBUTION = HASH([seq_nmbr])
, HEAP
)
AS
(
SELECT
sm.[name] AS [schema_name]
, tb.[name] AS [table_name]
, st.[name] AS [stats_name]
, st.[has_filter] AS [stats_is_filtered]
, ROW_NUMBER()
OVER(ORDER BY (SELECT NULL)) AS [seq_nmbr]
, QUOTENAME(sm.[name])+'.'+QUOTENAME(tb.[name]) AS [two_part_name]
, QUOTENAME(DB_NAME())+'.'+QUOTENAME(sm.[name])+'.'+QUOTENAME(tb.[name]) AS [three_part_name]
FROM sys.objects AS ob
JOIN sys.stats AS st ON ob.[object_id] = st.[object_id]
JOIN sys.stats_columns AS sc ON st.[stats_id] = sc.[stats_id]
AND st.[object_id] = sc.[object_id]
JOIN sys.columns AS co ON sc.[column_id] = co.[column_id]
AND sc.[object_id] = co.[object_id]
JOIN sys.tables AS tb ON co.[object_id] = tb.[object_id]
JOIN sys.schemas AS sm ON tb.[schema_id] = sm.[schema_id]
WHERE 1=1
AND st.[user_created] = 1
GROUP BY
sm.[name]
, tb.[name]
, st.[name]
, st.[filter_definition]
, st.[has_filter]
)
;

NOTE
CTAS is a powerful command and has the added advantage of being efficient in its use of transaction log space.

Dropping temporary tables


When a new session is created, no temporary tables should exist. However, if you are calling the same stored
procedure, which creates a temporary with the same name, to ensure that your CREATE TABLE statements are
successful a simple pre-existence check with a DROP can be used as in the following example:

IF OBJECT_ID('tempdb..#stats_ddl') IS NOT NULL


BEGIN
DROP TABLE #stats_ddl
END

For coding consistency, it is a good practice to use this pattern for both tables and temporary tables. It is also a
good idea to use DROP TABLE to remove temporary tables when you have finished with them in your code. In
stored procedure development, it is common to see the drop commands bundled together at the end of a
procedure to ensure these objects are cleaned up.

DROP TABLE #stats_ddl


Modularizing code
Since temporary tables can be seen anywhere in a user session, this can be exploited to help you modularize your
application code. For example, the following stored procedure generates DDL to update all statistics in the
database by statistic name.

CREATE PROCEDURE [dbo].[prc_sqldw_update_stats]


( @update_type tinyint -- 1 default 2 fullscan 3 sample 4 resample
,@sample_pct tinyint
)
AS

IF @update_type NOT IN (1,2,3,4)


BEGIN;
THROW 151000,'Invalid value for @update_type parameter. Valid range 1 (default), 2 (fullscan), 3 (sample)
or 4 (resample).',1;
END;

IF @sample_pct IS NULL
BEGIN;
SET @sample_pct = 20;
END;

IF OBJECT_ID('tempdb..#stats_ddl') IS NOT NULL


BEGIN
DROP TABLE #stats_ddl
END

CREATE TABLE #stats_ddl


WITH
(
DISTRIBUTION = HASH([seq_nmbr])
)
AS
(
SELECT
sm.[name] AS [schema_name]
, tb.[name] AS [table_name]
, st.[name] AS [stats_name]
, st.[has_filter] AS [stats_is_filtered]
, ROW_NUMBER()
OVER(ORDER BY (SELECT NULL)) AS [seq_nmbr]
, QUOTENAME(sm.[name])+'.'+QUOTENAME(tb.[name]) AS [two_part_name]
, QUOTENAME(DB_NAME())+'.'+QUOTENAME(sm.[name])+'.'+QUOTENAME(tb.[name]) AS [three_part_name]
FROM sys.objects AS ob
JOIN sys.stats AS st ON ob.[object_id] = st.[object_id]
JOIN sys.stats_columns AS sc ON st.[stats_id] = sc.[stats_id]
AND st.[object_id] = sc.[object_id]
JOIN sys.columns AS co ON sc.[column_id] = co.[column_id]
AND sc.[object_id] = co.[object_id]
JOIN sys.tables AS tb ON co.[object_id] = tb.[object_id]
JOIN sys.schemas AS sm ON tb.[schema_id] = sm.[schema_id]
WHERE 1=1
AND st.[user_created] = 1
GROUP BY
sm.[name]
, tb.[name]
, st.[name]
, st.[filter_definition]
, st.[has_filter]
)
SELECT
CASE @update_type
WHEN 1
THEN 'UPDATE STATISTICS '+[two_part_name]+'('+[stats_name]+');'
WHEN 2
THEN 'UPDATE STATISTICS '+[two_part_name]+'('+[stats_name]+') WITH FULLSCAN;'
THEN 'UPDATE STATISTICS '+[two_part_name]+'('+[stats_name]+') WITH FULLSCAN;'
WHEN 3
THEN 'UPDATE STATISTICS '+[two_part_name]+'('+[stats_name]+') WITH SAMPLE '+CAST(@sample_pct AS
VARCHAR(20))+' PERCENT;'
WHEN 4
THEN 'UPDATE STATISTICS '+[two_part_name]+'('+[stats_name]+') WITH RESAMPLE;'
END AS [update_stats_ddl]
, [seq_nmbr]
FROM t1
;
GO

At this stage, the only action that has occurred is the creation of a stored procedure that generates a temporary
table, #stats_ddl, with DDL statements. This stored procedure drops #stats_ddl if it already exists to ensure it does
not fail if run more than once within a session. However, since there is no DROP TABLE at the end of the stored
procedure, when the stored procedure completes, it leaves the created table so that it can be read outside of the
stored procedure. In SQL Data Warehouse, unlike other SQL Server databases, it is possible to use the
temporary table outside of the procedure that created it. SQL Data Warehouse temporary tables can be used
anywhere inside the session. This can lead to more modular and manageable code as in the following example:

EXEC [dbo].[prc_sqldw_update_stats] @update_type = 1, @sample_pct = NULL;

DECLARE @i INT = 1
, @t INT = (SELECT COUNT(*) FROM #stats_ddl)
, @s NVARCHAR(4000) = N''

WHILE @i <= @t
BEGIN
SET @s=(SELECT update_stats_ddl FROM #stats_ddl WHERE seq_nmbr = @i);

PRINT @s
EXEC sp_executesql @s
SET @i+=1;
END

DROP TABLE #stats_ddl;

Temporary table limitations


SQL Data Warehouse does impose a couple of limitations when implementing temporary tables. Currently, only
session scoped temporary tables are supported. Global Temporary Tables are not supported. In addition, views
cannot be created on temporary tables. Temporary tables can only be created with hash or round robin
distribution. Replicated temporary table distribution is not supported.

Next steps
To learn more about developing tables, see the Table Overview.
Using T-SQL loops in SQL Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Tips for using T-SQL loops and replacing cursors in Azure SQL Data Warehouse for developing solutions.

Purpose of WHILE loops


SQL Data Warehouse supports the WHILE loop for repeatedly executing statement blocks. This WHILE loop
continues for as long as the specified conditions are true or until the code specifically terminates the loop using the
BREAK keyword. Loops are useful for replacing cursors defined in SQL code. Fortunately, almost all cursors that
are written in SQL code are of the fast forward, read-only variety. Therefore, [WHILE ] loops are a great alternative
for replacing cursors.

Replacing cursors in SQL Data Warehouse


However, before diving in head first you should ask yourself the following question: "Could this cursor be rewritten
to use set-based operations?." In many cases, the answer is yes and is often the best approach. A set-based
operation often performs faster than an iterative, row by row approach.
Fast forward read-only cursors can be easily replaced with a looping construct. The following is a simple example.
This code example updates the statistics for every table in the database. By iterating over the tables in the loop,
each command executes in sequence.
First, create a temporary table containing a unique row number used to identify the individual statements:

CREATE TABLE #tbl


WITH
( DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Sequence
, [name]
, 'UPDATE STATISTICS '+QUOTENAME([name]) AS sql_code
FROM sys.tables
;

Second, initialize the variables required to perform the loop:

DECLARE @nbr_statements INT = (SELECT COUNT(*) FROM #tbl)


, @i INT = 1
;

Now loop over statements executing them one at a time:

WHILE @i <= @nbr_statements


BEGIN
DECLARE @sql_code NVARCHAR(4000) = (SELECT sql_code FROM #tbl WHERE Sequence = @i);
EXEC sp_executesql @sql_code;
SET @i +=1;
END

Finally drop the temporary table created in the first step


DROP TABLE #tbl;

Next steps
For more development tips, see development overview.
Using stored procedures in SQL Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Tips for implementing stored procedures in Azure SQL Data Warehouse for developing solutions.

What to expect
SQL Data Warehouse supports many of the T-SQL features that are used in SQL Server. More importantly, there
are scale-out specific features that you can use to maximize the performance of your solution.
However, to maintain the scale and performance of SQL Data Warehouse there are also some features and
functionality that have behavioral differences and others that are not supported.

Introducing stored procedures


Stored procedures are a great way for encapsulating your SQL code; storing it close to your data in the data
warehouse. Stored procedures help developers modularize their solutions by encapsulating the code into
manageable units; facilitating greater reusability of code. Each stored procedure can also accept parameters to
make them even more flexible.
SQL Data Warehouse provides a simplified and streamlined stored procedure implementation. The biggest
difference compared to SQL Server is that the stored procedure is not pre-compiled code. In data warehouses, the
compilation time is small in comparison to the time it takes to run queries against large data volumes. It is more
important to ensure the stored procedure code is correctly optimized for large queries. The goal is to save hours,
minutes, and seconds, not milliseconds. It is therefore more helpful to think of stored procedures as containers for
SQL logic.
When SQL Data Warehouse executes your stored procedure, the SQL statements are parsed, translated, and
optimized at run time. During this process, each statement is converted into distributed queries. The SQL code that
is executed against the data is different than the query submitted.

Nesting stored procedures


When stored procedures call other stored procedures, or execute dynamic SQL, then the inner stored procedure or
code invocation is said to be nested.
SQL Data Warehouse supports a maximum of eight nesting levels. This is slightly different to SQL Server. The
nest level in SQL Server is 32.
The top-level stored procedure call equates to nest level 1.

EXEC prc_nesting

If the stored procedure also makes another EXEC call, the nest level increases to two.

CREATE PROCEDURE prc_nesting


AS
EXEC prc_nesting_2 -- This call is nest level 2
GO
EXEC prc_nesting
If the second procedure then executes some dynamic SQL, the nest level increases to three.

CREATE PROCEDURE prc_nesting_2


AS
EXEC sp_executesql 'SELECT 'another nest level' -- This call is nest level 2
GO
EXEC prc_nesting

Note, SQL Data Warehouse does not currently support @@NESTLEVEL. You need to track the nest level. It is
unlikely for you to exceed the eight nest level limit, but if you do, you need to rework your code to fit the nesting
levels within this limit.

INSERT..EXECUTE
SQL Data Warehouse does not permit you to consume the result set of a stored procedure with an INSERT
statement. However, there is an alternative approach you can use. For an example, see the article on temporary
tables.

Limitations
There are some aspects of Transact-SQL stored procedures that are not implemented in SQL Data Warehouse.
They are:
temporary stored procedures
numbered stored procedures
extended stored procedures
CLR stored procedures
encryption option
replication option
table-valued parameters
read-only parameters
default parameters
execution contexts
return statement

Next steps
For more development tips, see development overview.
Using transactions in SQL Data Warehouse
7/24/2019 • 5 minutes to read • Edit Online

Tips for implementing transactions in Azure SQL Data Warehouse for developing solutions.

What to expect
As you would expect, SQL Data Warehouse supports transactions as part of the data warehouse workload.
However, to ensure the performance of SQL Data Warehouse is maintained at scale some features are limited
when compared to SQL Server. This article highlights the differences and lists the others.

Transaction isolation levels


SQL Data Warehouse implements ACID transactions. However, the isolation level of the transactional support is
limited to READ UNCOMMITTED; this level cannot be changed. If READ UNCOMMITTED is a concern, you can
implement a number of coding methods to prevent dirty reads of data. The most popular methods use both CTAS
and table partition switching (often known as the sliding window pattern) to prevent users from querying data
that is still being prepared. Views that pre-filter the data are also a popular approach.

Transaction size
A single data modification transaction is limited in size. The limit is applied per distribution. Therefore, the total
allocation can be calculated by multiplying the limit by the distribution count. To approximate the maximum
number of rows in the transaction divide the distribution cap by the total size of each row. For variable length
columns, consider taking an average column length rather than using the maximum size.
In the table below the following assumptions have been made:
An even distribution of data has occurred
The average row length is 250 bytes

Gen2
CAP PER MAX
DISTRIBUTION NUMBER OF TRANSACTION # ROWS PER MAX ROWS PER
DWU (GB) DISTRIBUTIONS SIZE (GB) DISTRIBUTION TRANSACTION

DW100c 1 60 60 4,000,000 240,000,000

DW200c 1.5 60 90 6,000,000 360,000,000

DW300c 2.25 60 135 9,000,000 540,000,000

DW400c 3 60 180 12,000,000 720,000,000

DW500c 3.75 60 225 15,000,000 900,000,000

DW1000c 7.5 60 450 30,000,000 1,800,000,000

DW1500c 11.25 60 675 45,000,000 2,700,000,000


CAP PER MAX
DISTRIBUTION NUMBER OF TRANSACTION # ROWS PER MAX ROWS PER
DWU (GB) DISTRIBUTIONS SIZE (GB) DISTRIBUTION TRANSACTION

DW2000c 15 60 900 60,000,000 3,600,000,000

DW2500c 18.75 60 1125 75,000,000 4,500,000,000

DW3000c 22.5 60 1,350 90,000,000 5,400,000,000

DW5000c 37.5 60 2,250 150,000,000 9,000,000,000

DW6000c 45 60 2,700 180,000,000 10,800,000,000

DW7500c 56.25 60 3,375 225,000,000 13,500,000,000

DW10000c 75 60 4,500 300,000,000 18,000,000,000

DW15000c 112.5 60 6,750 450,000,000 27,000,000,000

DW30000c 225 60 13,500 900,000,000 54,000,000,000

Gen1
CAP PER MAX
DISTRIBUTION NUMBER OF TRANSACTION # ROWS PER MAX ROWS PER
DWU (GB) DISTRIBUTIONS SIZE (GB) DISTRIBUTION TRANSACTION

DW100 1 60 60 4,000,000 240,000,000

DW200 1.5 60 90 6,000,000 360,000,000

DW300 2.25 60 135 9,000,000 540,000,000

DW400 3 60 180 12,000,000 720,000,000

DW500 3.75 60 225 15,000,000 900,000,000

DW600 4.5 60 270 18,000,000 1,080,000,000

DW1000 7.5 60 450 30,000,000 1,800,000,000

DW1200 9 60 540 36,000,000 2,160,000,000

DW1500 11.25 60 675 45,000,000 2,700,000,000

DW2000 15 60 900 60,000,000 3,600,000,000

DW3000 22.5 60 1,350 90,000,000 5,400,000,000

DW6000 45 60 2,700 180,000,000 10,800,000,000

The transaction size limit is applied per transaction or operation. It is not applied across all concurrent
transactions. Therefore each transaction is permitted to write this amount of data to the log.
To optimize and minimize the amount of data written to the log, please refer to the Transactions best practices
article.

WARNING
The maximum transaction size can only be achieved for HASH or ROUND_ROBIN distributed tables where the spread of the
data is even. If the transaction is writing data in a skewed fashion to the distributions then the limit is likely to be reached
prior to the maximum transaction size.

Transaction state
SQL Data Warehouse uses the XACT_STATE () function to report a failed transaction using the value -2. This value
means the transaction has failed and is marked for rollback only.

NOTE
The use of -2 by the XACT_STATE function to denote a failed transaction represents different behavior to SQL Server. SQL
Server uses the value -1 to represent an uncommittable transaction. SQL Server can tolerate some errors inside a
transaction without it having to be marked as uncommittable. For example SELECT 1/0 would cause an error but not force
a transaction into an uncommittable state. SQL Server also permits reads in the uncommittable transaction. However, SQL
Data Warehouse does not let you do this. If an error occurs inside a SQL Data Warehouse transaction it will automatically
enter the -2 state and you will not be able to make any further select statements until the statement has been rolled back.
It is therefore important to check that your application code to see if it uses XACT_STATE() as you may need to make code
modifications.

For example, in SQL Server you might see a transaction that looks like the following:
SET NOCOUNT ON;
DECLARE @xact_state smallint = 0;

BEGIN TRAN
BEGIN TRY
DECLARE @i INT;
SET @i = CONVERT(INT,'ABC');
END TRY
BEGIN CATCH
SET @xact_state = XACT_STATE();

SELECT ERROR_NUMBER() AS ErrNumber


, ERROR_SEVERITY() AS ErrSeverity
, ERROR_STATE() AS ErrState
, ERROR_PROCEDURE() AS ErrProcedure
, ERROR_MESSAGE() AS ErrMessage
;

IF @@TRANCOUNT > 0
BEGIN
ROLLBACK TRAN;
PRINT 'ROLLBACK';
END

END CATCH;

IF @@TRANCOUNT >0
BEGIN
PRINT 'COMMIT';
COMMIT TRAN;
END

SELECT @xact_state AS TransactionState;

The preceding code gives the following error message:


Msg 111233, Level 16, State 1, Line 1 111233; The current transaction has aborted, and any pending changes
have been rolled back. Cause: A transaction in a rollback-only state was not explicitly rolled back before a DDL,
DML, or SELECT statement.
You won't get the output of the ERROR_* functions.
In SQL Data Warehouse the code needs to be slightly altered:
SET NOCOUNT ON;
DECLARE @xact_state smallint = 0;

BEGIN TRAN
BEGIN TRY
DECLARE @i INT;
SET @i = CONVERT(INT,'ABC');
END TRY
BEGIN CATCH
SET @xact_state = XACT_STATE();

IF @@TRANCOUNT > 0
BEGIN
PRINT 'ROLLBACK';
ROLLBACK TRAN;
END

SELECT ERROR_NUMBER() AS ErrNumber


, ERROR_SEVERITY() AS ErrSeverity
, ERROR_STATE() AS ErrState
, ERROR_PROCEDURE() AS ErrProcedure
, ERROR_MESSAGE() AS ErrMessage
;
END CATCH;

IF @@TRANCOUNT >0
BEGIN
PRINT 'COMMIT';
COMMIT TRAN;
END

SELECT @xact_state AS TransactionState;

The expected behavior is now observed. The error in the transaction is managed and the ERROR_* functions
provide values as expected.
All that has changed is that the ROLLBACK of the transaction had to happen before the read of the error
information in the CATCH block.

Error_Line() function
It is also worth noting that SQL Data Warehouse does not implement or support the ERROR_LINE () function. If
you have this in your code, you need to remove it to be compliant with SQL Data Warehouse. Use query labels in
your code instead to implement equivalent functionality. For more details, see the LABEL article.

Using THROW and RAISERROR


THROW is the more modern implementation for raising exceptions in SQL Data Warehouse but RAISERROR is
also supported. There are a few differences that are worth paying attention to however.
User-defined error messages numbers cannot be in the 100,000 - 150,000 range for THROW
RAISERROR error messages are fixed at 50,000
Use of sys.messages is not supported

Limitations
SQL Data Warehouse does have a few other restrictions that relate to transactions.
They are as follows:
No distributed transactions
No nested transactions permitted
No save points allowed
No named transactions
No marked transactions
No support for DDL such as CREATE TABLE inside a user-defined transaction

Next steps
To learn more about optimizing transactions, see Transactions best practices. To learn about other SQL Data
Warehouse best practices, see SQL Data Warehouse best practices.
Optimizing transactions in Azure SQL Data
Warehouse
7/24/2019 • 9 minutes to read • Edit Online

Learn how to optimize the performance of your transactional code in Azure SQL Data Warehouse while
minimizing risk for long rollbacks.

Transactions and logging


Transactions are an important component of a relational database engine. SQL Data Warehouse uses transactions
during data modification. These transactions can be explicit or implicit. Single INSERT, UPDATE, and DELETE
statements are all examples of implicit transactions. Explicit transactions use BEGIN TRAN, COMMIT TRAN, or
ROLLBACK TRAN. Explicit transactions are typically used when multiple modification statements need to be tied
together in a single atomic unit.
Azure SQL Data Warehouse commits changes to the database using transaction logs. Each distribution has its
own transaction log. Transaction log writes are automatic. There is no configuration required. However, whilst this
process guarantees the write it does introduce an overhead in the system. You can minimize this impact by writing
transactionally efficient code. Transactionally efficient code broadly falls into two categories.
Use minimal logging constructs whenever possible
Process data using scoped batches to avoid singular long running transactions
Adopt a partition switching pattern for large modifications to a given partition

Minimal vs. full logging


Unlike fully logged operations, which use the transaction log to keep track of every row change, minimally logged
operations keep track of extent allocations and meta-data changes only. Therefore, minimal logging involves
logging only the information that is required to roll back the transaction after a failure, or for an explicit request
(ROLLBACK TRAN ). As much less information is tracked in the transaction log, a minimally logged operation
performs better than a similarly sized fully logged operation. Furthermore, because fewer writes go the
transaction log, a much smaller amount of log data is generated and so is more I/O efficient.
The transaction safety limits only apply to fully logged operations.

NOTE
Minimally logged operations can participate in explicit transactions. As all changes in allocation structures are tracked, it is
possible to roll back minimally logged operations.

Minimally logged operations


The following operations are capable of being minimally logged:
CREATE TABLE AS SELECT (CTAS )
INSERT..SELECT
CREATE INDEX
ALTER INDEX REBUILD
DROP INDEX
TRUNCATE TABLE
DROP TABLE
ALTER TABLE SWITCH PARTITION

NOTE
Internal data movement operations (such as BROADCAST and SHUFFLE) are not affected by the transaction safety limit.

Minimal logging with bulk load


CTAS and INSERT...SELECT are both bulk load operations. However, both are influenced by the target table
definition and depend on the load scenario. The following table explains when bulk operations are fully or
minimally logged:

PRIMARY INDEX LOAD SCENARIO LOGGING MODE

Heap Any Minimal

Clustered Index Empty target table Minimal

Clustered Index Loaded rows do not overlap with Minimal


existing pages in target

Clustered Index Loaded rows overlap with existing Full


pages in target

Clustered Columnstore Index Batch size >= 102,400 per partition Minimal
aligned distribution

Clustered Columnstore Index Batch size < 102,400 per partition Full
aligned distribution

It is worth noting that any writes to update secondary or non-clustered indexes will always be fully logged
operations.

IMPORTANT
SQL Data Warehouse has 60 distributions. Therefore, assuming all rows are evenly distributed and landing in a single
partition, your batch will need to contain 6,144,000 rows or larger to be minimally logged when writing to a Clustered
Columnstore Index. If the table is partitioned and the rows being inserted span partition boundaries, then you will need
6,144,000 rows per partition boundary assuming even data distribution. Each partition in each distribution must
independently exceed the 102,400 row threshold for the insert to be minimally logged into the distribution.

Loading data into a non-empty table with a clustered index can often contain a mixture of fully logged and
minimally logged rows. A clustered index is a balanced tree (b-tree) of pages. If the page being written to already
contains rows from another transaction, then these writes will be fully logged. However, if the page is empty then
the write to that page will be minimally logged.

Optimizing deletes
DELETE is a fully logged operation. If you need to delete a large amount of data in a table or a partition, it often
makes more sense to SELECT the data you wish to keep, which can be run as a minimally logged operation. To
select the data, create a new table with CTAS. Once created, use RENAME to swap out your old table with the
newly created table.

-- Delete all sales transactions for Promotions except PromotionKey 2.

--Step 01. Create a new table select only the records we want to kep (PromotionKey 2)
CREATE TABLE [dbo].[FactInternetSales_d]
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([ProductKey])
, PARTITION ( [OrderDateKey] RANGE RIGHT
FOR VALUES ( 20000101, 20010101, 20020101, 20030101, 20040101,
20050101
, 20060101, 20070101, 20080101, 20090101, 20100101,
20110101
, 20120101, 20130101, 20140101, 20150101, 20160101,
20170101
, 20180101, 20190101, 20200101, 20210101, 20220101,
20230101
, 20240101, 20250101, 20260101, 20270101, 20280101,
20290101
)
)
AS
SELECT *
FROM [dbo].[FactInternetSales]
WHERE [PromotionKey] = 2
OPTION (LABEL = 'CTAS : Delete')
;

--Step 02. Rename the Tables to replace the


RENAME OBJECT [dbo].[FactInternetSales] TO [FactInternetSales_old];
RENAME OBJECT [dbo].[FactInternetSales_d] TO [FactInternetSales];

Optimizing updates
UPDATE is a fully logged operation. If you need to update a large number of rows in a table or a partition, it can
often be far more efficient to use a minimally logged operation such as CTAS to do so.
In the example below a full table update has been converted to a CTAS so that minimal logging is possible.
In this case, we are retrospectively adding a discount amount to the sales in the table:
--Step 01. Create a new table containing the "Update".
CREATE TABLE [dbo].[FactInternetSales_u]
WITH
( CLUSTERED INDEX
, DISTRIBUTION = HASH([ProductKey])
, PARTITION ( [OrderDateKey] RANGE RIGHT
FOR VALUES ( 20000101, 20010101, 20020101, 20030101, 20040101,
20050101
, 20060101, 20070101, 20080101, 20090101, 20100101,
20110101
, 20120101, 20130101, 20140101, 20150101, 20160101,
20170101
, 20180101, 20190101, 20200101, 20210101, 20220101,
20230101
, 20240101, 20250101, 20260101, 20270101, 20280101,
20290101
)
)
)
AS
SELECT
[ProductKey]
, [OrderDateKey]
, [DueDateKey]
, [ShipDateKey]
, [CustomerKey]
, [PromotionKey]
, [CurrencyKey]
, [SalesTerritoryKey]
, [SalesOrderNumber]
, [SalesOrderLineNumber]
, [RevisionNumber]
, [OrderQuantity]
, [UnitPrice]
, [ExtendedAmount]
, [UnitPriceDiscountPct]
, ISNULL(CAST(5 as float),0) AS [DiscountAmount]
, [ProductStandardCost]
, [TotalProductCost]
, ISNULL(CAST(CASE WHEN [SalesAmount] <=5 THEN 0
ELSE [SalesAmount] - 5
END AS MONEY),0) AS [SalesAmount]
, [TaxAmt]
, [Freight]
, [CarrierTrackingNumber]
, [CustomerPONumber]
FROM [dbo].[FactInternetSales]
OPTION (LABEL = 'CTAS : Update')
;

--Step 02. Rename the tables


RENAME OBJECT [dbo].[FactInternetSales] TO [FactInternetSales_old];
RENAME OBJECT [dbo].[FactInternetSales_u] TO [FactInternetSales];

--Step 03. Drop the old table


DROP TABLE [dbo].[FactInternetSales_old]

NOTE
Re-creating large tables can benefit from using SQL Data Warehouse workload management features. For more information,
see Resource classes for workload management.

Optimizing with partition switching


If faced with large-scale modifications inside a table partition, then a partition switching pattern makes sense. If
the data modification is significant and spans multiple partitions, then iterating over the partitions achieves the
same result.
The steps to perform a partition switch are as follows:
1. Create an empty out partition
2. Perform the 'update' as a CTAS
3. Switch out the existing data to the out table
4. Switch in the new data
5. Clean up the data
However, to help identify the partitions to switch, create the following helper procedure.

CREATE PROCEDURE dbo.partition_data_get


@schema_name NVARCHAR(128)
, @table_name NVARCHAR(128)
, @boundary_value INT
AS
IF OBJECT_ID('tempdb..#ptn_data') IS NOT NULL
BEGIN
DROP TABLE #ptn_data
END
CREATE TABLE #ptn_data
WITH ( DISTRIBUTION = ROUND_ROBIN
, HEAP
)
AS
WITH CTE
AS
(
SELECT s.name AS [schema_name]
, t.name AS [table_name]
, p.partition_number AS [ptn_nmbr]
, p.[rows] AS [ptn_rows]
, CAST(r.[value] AS INT) AS [boundary_value]
FROM sys.schemas AS s
JOIN sys.tables AS t ON s.[schema_id] = t.[schema_id]
JOIN sys.indexes AS i ON t.[object_id] = i.[object_id]
JOIN sys.partitions AS p ON i.[object_id] = p.[object_id]
AND i.[index_id] = p.[index_id]
JOIN sys.partition_schemes AS h ON i.[data_space_id] = h.[data_space_id]
JOIN sys.partition_functions AS f ON h.[function_id] = f.[function_id]
LEFT JOIN sys.partition_range_values AS r ON f.[function_id] = r.[function_id]
AND r.[boundary_id] = p.[partition_number]
WHERE i.[index_id] <= 1
)
SELECT *
FROM CTE
WHERE [schema_name] = @schema_name
AND [table_name] = @table_name
AND [boundary_value] = @boundary_value
OPTION (LABEL = 'dbo.partition_data_get : CTAS : #ptn_data')
;
GO

This procedure maximizes code reuse and keeps the partition switching example more compact.
The following code demonstrates the steps mentioned previously to achieve a full partition switching routine.

--Create a partitioned aligned empty table to switch out the data


IF OBJECT_ID('[dbo].[FactInternetSales_out]') IS NOT NULL
BEGIN
DROP TABLE [dbo].[FactInternetSales_out]
DROP TABLE [dbo].[FactInternetSales_out]
END

CREATE TABLE [dbo].[FactInternetSales_out]


WITH
( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED COLUMNSTORE INDEX
, PARTITION ( [OrderDateKey] RANGE RIGHT
FOR VALUES ( 20020101, 20030101
)
)
)
AS
SELECT *
FROM [dbo].[FactInternetSales]
WHERE 1=2
OPTION (LABEL = 'CTAS : Partition Switch IN : UPDATE')
;

--Create a partitioned aligned table and update the data in the select portion of the CTAS
IF OBJECT_ID('[dbo].[FactInternetSales_in]') IS NOT NULL
BEGIN
DROP TABLE [dbo].[FactInternetSales_in]
END

CREATE TABLE [dbo].[FactInternetSales_in]


WITH
( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED COLUMNSTORE INDEX
, PARTITION ( [OrderDateKey] RANGE RIGHT
FOR VALUES ( 20020101, 20030101
)
)
)
AS
SELECT
[ProductKey]
, [OrderDateKey]
, [DueDateKey]
, [ShipDateKey]
, [CustomerKey]
, [PromotionKey]
, [CurrencyKey]
, [SalesTerritoryKey]
, [SalesOrderNumber]
, [SalesOrderLineNumber]
, [RevisionNumber]
, [OrderQuantity]
, [UnitPrice]
, [ExtendedAmount]
, [UnitPriceDiscountPct]
, ISNULL(CAST(5 as float),0) AS [DiscountAmount]
, [ProductStandardCost]
, [TotalProductCost]
, ISNULL(CAST(CASE WHEN [SalesAmount] <=5 THEN 0
ELSE [SalesAmount] - 5
END AS MONEY),0) AS [SalesAmount]
, [TaxAmt]
, [Freight]
, [CarrierTrackingNumber]
, [CustomerPONumber]
FROM [dbo].[FactInternetSales]
WHERE OrderDateKey BETWEEN 20020101 AND 20021231
OPTION (LABEL = 'CTAS : Partition Switch IN : UPDATE')
;

--Use the helper procedure to identify the partitions


--The source table
EXEC dbo.partition_data_get 'dbo','FactInternetSales',20030101
DECLARE @ptn_nmbr_src INT = (SELECT ptn_nmbr FROM #ptn_data)
SELECT @ptn_nmbr_src

--The "in" table


EXEC dbo.partition_data_get 'dbo','FactInternetSales_in',20030101
DECLARE @ptn_nmbr_in INT = (SELECT ptn_nmbr FROM #ptn_data)
SELECT @ptn_nmbr_in

--The "out" table


EXEC dbo.partition_data_get 'dbo','FactInternetSales_out',20030101
DECLARE @ptn_nmbr_out INT = (SELECT ptn_nmbr FROM #ptn_data)
SELECT @ptn_nmbr_out

--Switch the partitions over


DECLARE @SQL NVARCHAR(4000) = '
ALTER TABLE [dbo].[FactInternetSales] SWITCH PARTITION '+CAST(@ptn_nmbr_src AS VARCHAR(20)) +' TO [dbo].
[FactInternetSales_out] PARTITION ' +CAST(@ptn_nmbr_out AS VARCHAR(20))+';
ALTER TABLE [dbo].[FactInternetSales_in] SWITCH PARTITION '+CAST(@ptn_nmbr_in AS VARCHAR(20)) +' TO [dbo].
[FactInternetSales] PARTITION ' +CAST(@ptn_nmbr_src AS VARCHAR(20))+';'
EXEC sp_executesql @SQL

--Perform the clean-up


TRUNCATE TABLE dbo.FactInternetSales_out;
TRUNCATE TABLE dbo.FactInternetSales_in;

DROP TABLE dbo.FactInternetSales_out


DROP TABLE dbo.FactInternetSales_in
DROP TABLE #ptn_data

Minimize logging with small batches


For large data modification operations, it may make sense to divide the operation into chunks or batches to scope
the unit of work.
A following code is a working example. The batch size has been set to a trivial number to highlight the technique.
In reality, the batch size would be significantly larger.
SET NO_COUNT ON;
IF OBJECT_ID('tempdb..#t') IS NOT NULL
BEGIN
DROP TABLE #t;
PRINT '#t dropped';
END

CREATE TABLE #t
WITH ( DISTRIBUTION = ROUND_ROBIN
, HEAP
)
AS
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS seq_nmbr
, SalesOrderNumber
, SalesOrderLineNumber
FROM dbo.FactInternetSales
WHERE [OrderDateKey] BETWEEN 20010101 and 20011231
;

DECLARE @seq_start INT = 1


, @batch_iterator INT = 1
, @batch_size INT = 50
, @max_seq_nmbr INT = (SELECT MAX(seq_nmbr) FROM dbo.#t)
;

DECLARE @batch_count INT = (SELECT CEILING((@max_seq_nmbr*1.0)/@batch_size))


, @seq_end INT = @batch_size
;

SELECT COUNT(*)
FROM dbo.FactInternetSales f

PRINT 'MAX_seq_nmbr '+CAST(@max_seq_nmbr AS VARCHAR(20))


PRINT 'MAX_Batch_count '+CAST(@batch_count AS VARCHAR(20))

WHILE @batch_iterator <= @batch_count


BEGIN
DELETE
FROM dbo.FactInternetSales
WHERE EXISTS
(
SELECT 1
FROM #t t
WHERE seq_nmbr BETWEEN @seq_start AND @seq_end
AND FactInternetSales.SalesOrderNumber = t.SalesOrderNumber
AND FactInternetSales.SalesOrderLineNumber = t.SalesOrderLineNumber
)
;

SET @seq_start = @seq_end


SET @seq_end = (@seq_start+@batch_size);
SET @batch_iterator +=1;
END

Pause and scaling guidance


Azure SQL Data Warehouse lets you pause, resume, and scale your data warehouse on demand. When you pause
or scale your SQL Data Warehouse, it is important to understand that any in-flight transactions are terminated
immediately; causing any open transactions to be rolled back. If your workload had issued a long running and
incomplete data modification prior to the pause or scale operation, then this work will need to be undone. This
undoing might impact the time it takes to pause or scale your Azure SQL Data Warehouse database.
IMPORTANT
Both UPDATE and DELETE are fully logged operations and so these undo/redo operations can take significantly longer
than equivalent minimally logged operations.

The best scenario is to let in flight data modification transactions complete prior to pausing or scaling SQL Data
Warehouse. However, this scenario might not always be practical. To mitigate the risk of a long rollback, consider
one of the following options:
Rewrite long running operations using CTAS
Break the operation into chunks; operating on a subset of the rows

Next steps
See Transactions in SQL Data Warehouse to learn more about isolation levels and transactional limits. For an
overview of other Best Practices, see SQL Data Warehouse Best Practices.
Using user-defined schemas in SQL Data Warehouse
7/24/2019 • 3 minutes to read • Edit Online

Tips for using T-SQL user-defined schemas in Azure SQL Data Warehouse for developing solutions.

Schemas for application boundaries


Traditional data warehouses often use separate databases to create application boundaries based on either
workload, domain or security. For example, a traditional SQL Server data warehouse might include a staging
database, a data warehouse database, and some data mart databases. In this topology each database operates as a
workload and security boundary in the architecture.
By contrast, SQL Data Warehouse runs the entire data warehouse workload within one database. Cross database
joins are not permitted. Therefore SQL Data Warehouse expects all tables used by the warehouse to be stored
within the one database.

NOTE
SQL Data Warehouse does not support cross database queries of any kind. Consequently, data warehouse implementations
that leverage this pattern will need to be revised.

Recommendations
These are recommendations for consolidating workloads, security, domain and functional boundaries by using
user defined schemas
1. Use one SQL Data Warehouse database to run your entire data warehouse workload
2. Consolidate your existing data warehouse environment to use one SQL Data Warehouse database
3. Leverage user-defined schemas to provide the boundary previously implemented using databases.
If user-defined schemas have not been used previously then you have a clean slate. Simply use the old database
name as the basis for your user-defined schemas in the SQL Data Warehouse database.
If schemas have already been used then you have a few options:
1. Remove the legacy schema names and start fresh
2. Retain the legacy schema names by pre-pending the legacy schema name to the table name
3. Retain the legacy schema names by implementing views over the table in an extra schema to re-create the old
schema structure.

NOTE
On first inspection option 3 may seem like the most appealing option. However, the devil is in the detail. Views are read only
in SQL Data Warehouse. Any data or table modification would need to be performed against the base table. Option 3 also
introduces a layer of views into your system. You might want to give this some additional thought if you are using views in
your architecture already.

Examples:
Implement user-defined schemas based on database names
CREATE SCHEMA [stg]; -- stg previously database name for staging database
GO
CREATE SCHEMA [edw]; -- edw previously database name for the data warehouse
GO
CREATE TABLE [stg].[customer] -- create staging tables in the stg schema
( CustKey BIGINT NOT NULL
, ...
);
GO
CREATE TABLE [edw].[customer] -- create data warehouse tables in the edw schema
( CustKey BIGINT NOT NULL
, ...
);

Retain legacy schema names by pre-pending them to the table name. Use schemas for the workload boundary.

CREATE SCHEMA [stg]; -- stg defines the staging boundary


GO
CREATE SCHEMA [edw]; -- edw defines the data warehouse boundary
GO
CREATE TABLE [stg].[dim_customer] --pre-pend the old schema name to the table and create in the staging
boundary
( CustKey BIGINT NOT NULL
, ...
);
GO
CREATE TABLE [edw].[dim_customer] --pre-pend the old schema name to the table and create in the data warehouse
boundary
( CustKey BIGINT NOT NULL
, ...
);

Retain legacy schema names using views

CREATE SCHEMA [stg]; -- stg defines the staging boundary


GO
CREATE SCHEMA [edw]; -- stg defines the data warehouse boundary
GO
CREATE SCHEMA [dim]; -- edw defines the legacy schema name boundary
GO
CREATE TABLE [stg].[customer] -- create the base staging tables in the staging boundary
( CustKey BIGINT NOT NULL
, ...
)
GO
CREATE TABLE [edw].[customer] -- create the base data warehouse tables in the data warehouse boundary
( CustKey BIGINT NOT NULL
, ...
)
GO
CREATE VIEW [dim].[customer] -- create a view in the legacy schema name boundary for presentation consistency
purposes only
AS
SELECT CustKey
, ...
FROM [edw].customer
;
NOTE
Any change in schema strategy needs a review of the security model for the database. In many cases you might be able to
simplify the security model by assigning permissions at the schema level. If more granular permissions are required then you
can use database roles.

Next steps
For more development tips, see development overview.
Assigning variables in Azure SQL Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Tips for assigning T-SQL variables in Azure SQL Data Warehouse for developing solutions.

Setting variables with DECLARE


Variables in SQL Data Warehouse are set using the DECLARE statement or the SET statement. Initializing variables
with DECLARE is one of the most flexible ways to set a variable value in SQL Data Warehouse.

DECLARE @v int = 0
;

You can also use DECLARE to set more than one variable at a time. You cannot use SELECT or UPDATE to do the
following:

DECLARE @v INT = (SELECT TOP 1 c_customer_sk FROM Customer where c_last_name = 'Smith')
, @v1 INT = (SELECT TOP 1 c_customer_sk FROM Customer where c_last_name = 'Jones')
;

You cannot initialize and use a variable in the same DECLARE statement. To illustrate the point, the following
example is not allowed since @p1 is both initialized and used in the same DECLARE statement. The following
example gives an error.

DECLARE @p1 int = 0


, @p2 int = (SELECT COUNT (*) FROM sys.types where is_user_defined = @p1 )
;

Setting values with SET


SET is a common method for setting a single variable.
The following statements are all valid ways to set a variable with SET:

SET @v = (Select max(database_id) from sys.databases);


SET @v = 1;
SET @v = @v+1;
SET @v +=1;

You can only set one variable at a time with SET. However, compound operators are permissible.

Limitations
You cannot use UPDATE for variable assignment.

Next steps
For more development tips, see development overview.
Views in Azure SQL Data Warehouse
10/24/2019 • 2 minutes to read • Edit Online

Views can be used in a number of different ways to improve the quality of your solution.
Azure SQL Data Warehouse supports standard and materialized views. Both are virtual tables created with
SELECT expressions and presented to queries as logical tables. Views encapsulate the complexity of common data
computation and add an abstraction layer to computation changes so there's no need to rewrite queries.

Standard view
A standard view computes its data each time when the view is used. There's no data stored on disk. People
typically use standard views as a tool that helps organize the logical objects and queries in a database. To use a
standard view, a query needs to make direct reference to it. For more information, see the CREATE VIEW
documentation.
Views in SQL Data Warehouse are stored as metadata only. Consequently, the following options are not available:
There is no schema binding option
Base tables cannot be updated through the view
Views cannot be created over temporary tables
There is no support for the EXPAND / NOEXPAND hints
There are no indexed views in SQL Data Warehouse
Standard views can be utilized to enforce performance optimized joins between tables. For example, a view can
incorporate a redundant distribution key as part of the joining criteria to minimize data movement. Another benefit
of a view could be to force a specific query or joining hint. Using views in this manner guarantees that joins are
always performed in an optimal fashion avoiding the need for users to remember the correct construct for their
joins.

Materialized view
A materialized view pre-computes, stores, and maintains its data in Azure SQL Data Warehouse just like a table.
There's no recomputation needed each time when a materialized view is used. As the data gets loaded into base
tables, Azure SQL Data Warehouse synchronously refreshes the materialized views. The query optimizer
automatically uses deployed materialized views to improve query performance even if the views are not
referenced in the query. Queries benefiting most from materialized views are complex queries (typically queries
with joins and aggregations) on large tables that produce small result set.
For details on the materialized view syntax and other requirements, refer to CREATE MATERIALIZED VIEW AS
SELECT.
For query tuning guidance, check Performance tuning with materialized views.

Example
A common application pattern is to re-create tables using CREATE TABLE AS SELECT (CTAS ) followed by an
object renaming pattern whilst loading data. The following example adds new date records to a date dimension.
Note how a new table, DimDate_New, is first created and then renamed to replace the original version of the table.
CREATE TABLE dbo.DimDate_New
WITH (DISTRIBUTION = ROUND_ROBIN
, CLUSTERED INDEX (DateKey ASC)
)
AS
SELECT *
FROM dbo.DimDate AS prod
UNION ALL
SELECT *
FROM dbo.DimDate_stg AS stg;

RENAME OBJECT DimDate TO DimDate_Old;


RENAME OBJECT DimDate_New TO DimDate;

However, this approach can result in tables appearing and disappearing from a user's view as well as "table does
not exist" error messages. Views can be used to provide users with a consistent presentation layer whilst the
underlying objects are renamed. By providing access to data through views, users do not need visibility to the
underlying tables. This layer provides a consistent user experience while ensuring that the data warehouse
designers can evolve the data model. Being able to evolve the underlying tables means designers can use CTAS to
maximize performance during the data loading process.

Next steps
For more development tips, see SQL Data Warehouse development overview.
Dynamic SQL in SQL Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Tips for using dynamic SQL in Azure SQL Data Warehouse for developing solutions.

Dynamic SQL Example


When developing application code for SQL Data Warehouse, you may need to use dynamic sql to help deliver
flexible, generic, and modular solutions. SQL Data Warehouse does not support blob data types at this time. Not
supporting blob data types might limit the size of your strings since blob data types include both varchar(max) and
nvarchar(max) types. If you have used these types in your application code to build large strings, you need to break
the code into chunks and use the EXEC statement instead.
A simple example:

DECLARE @sql_fragment1 VARCHAR(8000)=' SELECT name '


, @sql_fragment2 VARCHAR(8000)=' FROM sys.system_views '
, @sql_fragment3 VARCHAR(8000)=' WHERE name like ''%table%''';

EXEC( @sql_fragment1 + @sql_fragment2 + @sql_fragment3);

If the string is short, you can use sp_executesql as normal.

NOTE
Statements executed as dynamic SQL will still be subject to all TSQL validation rules.

Next steps
For more development tips, see development overview.
Group by options in SQL Data Warehouse
7/24/2019 • 3 minutes to read • Edit Online

Tips for implementing group by options in Azure SQL Data Warehouse for developing solutions.

What does GROUP BY do?


The GROUP BY T-SQL clause aggregates data to a summary set of rows. GROUP BY has some options that SQL
Data Warehouse does not support. These options have workarounds.
These options are
GROUP BY with ROLLUP
GROUPING SETS
GROUP BY with CUBE

Rollup and grouping sets options


The simplest option here is to use UNION ALL instead to perform the rollup rather than relying on the explicit
syntax. The result is exactly the same
The following example using the GROUP BY statement with the ROLLUP option:

SELECT [SalesTerritoryCountry]
, [SalesTerritoryRegion]
, SUM(SalesAmount) AS TotalSalesAmount
FROM dbo.factInternetSales s
JOIN dbo.DimSalesTerritory t ON s.SalesTerritoryKey = t.SalesTerritoryKey
GROUP BY ROLLUP (
[SalesTerritoryCountry]
, [SalesTerritoryRegion]
)
;

By using ROLLUP, the preceding example requests the following aggregations:


Country and Region
Country
Grand Total
To replace ROLLUP and return the same results, you can use UNION ALL and explicitly specify the required
aggregations:
SELECT [SalesTerritoryCountry]
, [SalesTerritoryRegion]
, SUM(SalesAmount) AS TotalSalesAmount
FROM dbo.factInternetSales s
JOIN dbo.DimSalesTerritory t ON s.SalesTerritoryKey = t.SalesTerritoryKey
GROUP BY
[SalesTerritoryCountry]
, [SalesTerritoryRegion]
UNION ALL
SELECT [SalesTerritoryCountry]
, NULL
, SUM(SalesAmount) AS TotalSalesAmount
FROM dbo.factInternetSales s
JOIN dbo.DimSalesTerritory t ON s.SalesTerritoryKey = t.SalesTerritoryKey
GROUP BY
[SalesTerritoryCountry]
UNION ALL
SELECT NULL
, NULL
, SUM(SalesAmount) AS TotalSalesAmount
FROM dbo.factInternetSales s
JOIN dbo.DimSalesTerritory t ON s.SalesTerritoryKey = t.SalesTerritoryKey;

To replace GROUPING SETS, the sample principle applies. You only need to create UNION ALL sections for the
aggregation levels you want to see.

Cube options
It is possible to create a GROUP BY WITH CUBE using the UNION ALL approach. The problem is that the code
can quickly become cumbersome and unwieldy. To mitigate this, you can use this more advanced approach.
Let's use the example above.
The first step is to define the 'cube' that defines all the levels of aggregation that we want to create. It is important
to take note of the CROSS JOIN of the two derived tables. This generates all the levels for us. The rest of the code
is really there for formatting.
CREATE TABLE #Cube
WITH
( DISTRIBUTION = ROUND_ROBIN
, LOCATION = USER_DB
)
AS
WITH GrpCube AS
(SELECT CAST(ISNULL(Country,'NULL')+','+ISNULL(Region,'NULL') AS NVARCHAR(50)) as 'Cols'
, CAST(ISNULL(Country+',','')+ISNULL(Region,'') AS NVARCHAR(50)) as 'GroupBy'
, ROW_NUMBER() OVER (ORDER BY Country) as 'Seq'
FROM ( SELECT 'SalesTerritoryCountry' as Country
UNION ALL
SELECT NULL
) c
CROSS JOIN ( SELECT 'SalesTerritoryRegion' as Region
UNION ALL
SELECT NULL
) r
)
SELECT Cols
, CASE WHEN SUBSTRING(GroupBy,LEN(GroupBy),1) = ','
THEN SUBSTRING(GroupBy,1,LEN(GroupBy)-1)
ELSE GroupBy
END AS GroupBy --Remove Trailing Comma
,Seq
FROM GrpCube;

The following shows the results of the CTAS:

The second step is to specify a target table to store interim results:

DECLARE
@SQL NVARCHAR(4000)
,@Columns NVARCHAR(4000)
,@GroupBy NVARCHAR(4000)
,@i INT = 1
,@nbr INT = 0
;
CREATE TABLE #Results
(
[SalesTerritoryCountry] NVARCHAR(50)
,[SalesTerritoryRegion] NVARCHAR(50)
,[TotalSalesAmount] MONEY
)
WITH
( DISTRIBUTION = ROUND_ROBIN
, LOCATION = USER_DB
)
;

The third step is to loop over our cube of columns performing the aggregation. The query will run once for every
row in the #Cube temporary table and store the results in the #Results temp table
SET @nbr =(SELECT MAX(Seq) FROM #Cube);

WHILE @i<=@nbr
BEGIN
SET @Columns = (SELECT Cols FROM #Cube where seq = @i);
SET @GroupBy = (SELECT GroupBy FROM #Cube where seq = @i);

SET @SQL ='INSERT INTO #Results


SELECT '+@Columns+'
, SUM(SalesAmount) AS TotalSalesAmount
FROM dbo.factInternetSales s
JOIN dbo.DimSalesTerritory t
ON s.SalesTerritoryKey = t.SalesTerritoryKey
'+CASE WHEN @GroupBy <>''
THEN 'GROUP BY '+@GroupBy ELSE '' END

EXEC sp_executesql @SQL;


SET @i +=1;
END

Lastly, you can return the results by simply reading from the #Results temporary table

SELECT *
FROM #Results
ORDER BY 1,2,3
;

By breaking the code up into sections and generating a looping construct, the code becomes more manageable
and maintainable.

Next steps
For more development tips, see development overview.
Using labels to instrument queries in Azure SQL Data
Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Tips for using labels to instrument queries in Azure SQL Data Warehouse for developing solutions.

What are labels?


SQL Data Warehouse supports a concept called query labels. Before going into any depth, let's look at an
example:

SELECT *
FROM sys.tables
OPTION (LABEL = 'My Query Label')
;

The last line tags the string 'My Query Label' to the query. This tag is particularly helpful since the label is query-
able through the DMVs. Querying for labels provides a mechanism for locating problem queries and helping to
identify progress through an ELT run.
A good naming convention really helps. For example, starting the label with PROJECT, PROCEDURE,
STATEMENT, or COMMENT helps to uniquely identify the query among all the code in source control.
The following query uses a dynamic management view to search by label.

SELECT *
FROM sys.dm_pdw_exec_requests r
WHERE r.[label] = 'My Query Label'
;

NOTE
It is essential to put square brackets or double quotes around the word label when querying. Label is a reserved word and
causes an error when it is not delimited.

Next steps
For more development tips, see development overview.
Workload management with resource classes in
Azure SQL Data Warehouse
10/8/2019 • 16 minutes to read • Edit Online

Guidance for using resource classes to manage memory and concurrency for queries in your Azure SQL Data
Warehouse.

What are resource classes


The performance capacity of a query is determined by the user's resource class. Resource classes are pre-
determined resource limits in Azure SQL Data Warehouse that govern compute resources and concurrency for
query execution. Resource classes can help you manage your workload by setting limits on the number of
queries that run concurrently and on the compute-resources assigned to each query. There's a trade-off
between memory and concurrency.
Smaller resource classes reduce the maximum memory per query, but increase concurrency.
Larger resource classes increase the maximum memory per query, but reduce concurrency.
There are two types of resource classes:
Static resources classes, which are well suited for increased concurrency on a data set size that is fixed.
Dynamic resource classes, which are well suited for data sets that are growing in size and need increased
performance as the service level is scaled up.
Resource classes use concurrency slots to measure resource consumption. Concurrency slots are explained
later in this article.
To view the resource utilization for the resource classes, see Memory and concurrency limits.
To adjust the resource class, you can run the query under a different user or change the current user's
resource class membership.
Static resource classes
Static resource classes allocate the same amount of memory regardless of the current performance level,
which is measured in data warehouse units. Since queries get the same memory allocation regardless of the
performance level, scaling out the data warehouse allows more queries to run within a resource class. Static
resource classes are ideal if the data volume is known and constant.
The static resource classes are implemented with these pre-defined database roles:
staticrc10
staticrc20
staticrc30
staticrc40
staticrc50
staticrc60
staticrc70
staticrc80
Dynamic resource classes
Dynamic Resource Classes allocate a variable amount of memory depending on the current service level.
While static resource classes are beneficial for higher concurrency and static data volumes, dynamic resource
classes are better suited for a growing or variable amount of data. When you scale up to a larger service level,
your queries automatically get more memory.
The dynamic resource classes are implemented with these pre-defined database roles:
smallrc
mediumrc
largerc
xlargerc
The memory allocation for each resource class is as follows, regardless of service level. The minimum
concurrency queries is also listed. For some service levels, more than the minimum concurrency can be
achieved.

RESOURCE CLASS PERCENTAGE MEMORY MIN CONCURRENT QUERIES

smallrc 3% 32

mediumrc 10% 10

largerc 22% 4

xlargerc 70% 1

Default resource class


By default, each user is a member of the dynamic resource class smallrc.
The resource class of the service administrator is fixed at smallrc and cannot be changed. The service
administrator is the user created during the provisioning process. The service administrator in this context is
the login specified for the "Server admin login" when creating a new SQL Data Warehouse instance with a new
server.

NOTE
Users or groups defined as Active Directory admin are also service administrators.

Resource class operations


Resource classes are designed to improve performance for data management and manipulation activities.
Complex queries can also benefit from running under a large resource class. For example, query performance
for large joins and sorts can improve when the resource class is large enough to enable the query to execute in
memory.
Operations governed by resource classes
These operations are governed by resource classes:
INSERT-SELECT, UPDATE, DELETE
SELECT (when querying user tables)
ALTER INDEX - REBUILD or REORGANIZE
ALTER TABLE REBUILD
CREATE INDEX
CREATE CLUSTERED COLUMNSTORE INDEX
CREATE TABLE AS SELECT (CTAS )
Data loading
Data movement operations conducted by the Data Movement Service (DMS )

NOTE
SELECT statements on dynamic management views (DMVs) or other system views are not governed by any of the
concurrency limits. You can monitor the system regardless of the number of queries executing on it.

Operations not governed by resource classes


Some queries always run in the smallrc resource class even though the user is a member of a larger resource
class. These exempt queries do not count towards the concurrency limit. For example, if the concurrency limit is
16, many users can be selecting from system views without impacting the available concurrency slots.
The following statements are exempt from resource classes and always run in smallrc:
CREATE or DROP TABLE
ALTER TABLE ... SWITCH, SPLIT, or MERGE PARTITION
ALTER INDEX DISABLE
DROP INDEX
CREATE, UPDATE, or DROP STATISTICS
TRUNCATE TABLE
ALTER AUTHORIZATION
CREATE LOGIN
CREATE, ALTER, or DROP USER
CREATE, ALTER, or DROP PROCEDURE
CREATE or DROP VIEW
INSERT VALUES
SELECT from system views and DMVs
EXPLAIN
DBCC

Concurrency slots
Concurrency slots are a convenient way to track the resources available for query execution. They are like
tickets that you purchase to reserve seats at a concert because seating is limited. The total number of
concurrency slots per data warehouse is determined by the service level. Before a query can start executing, it
must be able to reserve enough concurrency slots. When a query finishes, it releases its concurrency slots.
A query running with 10 concurrency slots can access 5 times more compute resources than a query
running with 2 concurrency slots.
If each query requires 10 concurrency slots and there are 40 concurrency slots, then only 4 queries can run
concurrently.
Only resource governed queries consume concurrency slots. System queries and some trivial queries don't
consume any slots. The exact number of concurrency slots consumed is determined by the query's resource
class.

View the resource classes


Resource classes are implemented as pre-defined database roles. There are two types of resource classes:
dynamic and static. To view a list of the resource classes, use the following query:
SELECT name
FROM sys.database_principals
WHERE name LIKE '%rc%' AND type_desc = 'DATABASE_ROLE';

Change a user's resource class


Resource classes are implemented by assigning users to database roles. When a user runs a query, the query
runs with the user's resource class. For example, if a user is a member of the staticrc10 database role, their
queries run with small amounts of memory. If a database user is a member of the xlargerc or staticrc80
database roles, their queries run with large amounts of memory.
To increase a user's resource class, use sp_addrolemember to add the user to a database role of a large
resource class. The below code adds a user to the largerc database role. Each request gets 22% of the system
memory.

EXEC sp_addrolemember 'largerc', 'loaduser';

To decrease the resource class, use sp_droprolemember. If 'loaduser' is not a member or any other resource
classes, they go into the default smallrc resource class with a 3% memory grant.

EXEC sp_droprolemember 'largerc', 'loaduser';

Resource class precedence


Users can be members of multiple resource classes. When a user belongs to more than one resource class:
Dynamic resource classes take precedence over static resource classes. For example, if a user is a member
of both mediumrc(dynamic) and staticrc80 (static), queries run with mediumrc.
Larger resource classes take precedence over smaller resource classes. For example, if a user is a member
of mediumrc and largerc, queries run with largerc. Likewise, if a user is a member of both staticrc20 and
statirc80, queries run with staticrc80 resource allocations.

Recommendations
We recommend creating a user that is dedicated to running a specific type of query or load operation. Give
that user a permanent resource class instead of changing the resource class on a frequent basis. Static resource
classes afford greater overall control on the workload, so we suggest using static resource classes before
considering dynamic resource classes.
Resource classes for load users
CREATE TABLE uses clustered columnstore indexes by default. Compressing data into a columnstore index is a
memory-intensive operation, and memory pressure can reduce the index quality. Memory pressure can lead to
needing a higher resource class when loading data. To ensure loads have enough memory, you can create a
user that is designated for running loads and assign that user to a higher resource class.
The memory needed to process loads efficiently depends on the nature of the table loaded and the data size.
For more information on memory requirements, see Maximizing rowgroup quality.
Once you have determined the memory requirement, choose whether to assign the load user to a static or
dynamic resource class.
Use a static resource class when table memory requirements fall within a specific range. Loads run with
appropriate memory. When you scale the data warehouse, the loads do not need more memory. By using a
static resource class, the memory allocations stay constant. This consistency conserves memory and allows
more queries to run concurrently. We recommend that new solutions use the static resource classes first as
these provide greater control.
Use a dynamic resource class when table memory requirements vary widely. Loads might require more
memory than the current DWU or cDWU level provides. Scaling the data warehouse adds more memory to
load operations, which allows loads to perform faster.
Resource classes for queries
Some queries are compute-intensive and some aren't.
Choose a dynamic resource class when queries are complex, but don't need high concurrency. For example,
generating daily or weekly reports is an occasional need for resources. If the reports are processing large
amounts of data, scaling the data warehouse provides more memory to the user's existing resource class.
Choose a static resource class when resource expectations vary throughout the day. For example, a static
resource class works well when the data warehouse is queried by many people. When scaling the data
warehouse, the amount of memory allocated to the user doesn't change. Consequently, more queries can
be executed in parallel on the system.
Proper memory grants depend on many factors, such as the amount of data queried, the nature of the table
schemas, and various joins, select, and group predicates. In general, allocating more memory allows queries to
complete faster, but reduces the overall concurrency. If concurrency is not an issue, over-allocating memory
does not harm throughput.
To tune performance, use different resource classes. The next section gives a stored procedure that helps you
figure out the best resource class.

Example code for finding the best resource class


You can use the following specified stored procedure to figure out concurrency and memory grant per
resource class at a given SLO and the best resource class for memory intensive CCI operations on non-
partitioned CCI table at a given resource class:
Here's the purpose of this stored procedure:
1. To see the concurrency and memory grant per resource class at a given SLO. User needs to provide NULL
for both schema and tablename as shown in this example.
2. To see the best resource class for the memory-intensive CCI operations (load, copy table, rebuild index, etc.)
on non-partitioned CCI table at a given resource class. The stored proc uses table schema to find out the
required memory grant.
Dependencies & Restrictions
This stored procedure isn't designed to calculate the memory requirement for a partitioned cci table.
This stored procedure doesn't take memory requirements into account for the SELECT part of
CTAS/INSERT-SELECT and assumes it's a SELECT.
This stored procedure uses a temp table, which is available in the session where this stored procedure was
created.
This stored procedure depends on the current offerings (for example, hardware configuration, DMS config),
and if any of that changes then this stored proc won't work correctly.
This stored procedure depends on existing concurrency limit offerings and if these change then this stored
procedure won't work correctly.
This stored procedure depends on existing resource class offerings and if these change then this stored
procedure won't work correctly.
NOTE
If you are not getting output after executing stored procedure with parameters provided, then there could be two cases.
1. Either DW Parameter contains an invalid SLO value
2. Or, there is no matching resource class for the CCI operation on the table.
For example, at DW100c, the highest memory grant available is 1 GB, and if table schema is wide enough to cross the
requirement of 1 GB.

Usage example
Syntax:
EXEC dbo.prc_workload_management_by_DWU @DWU VARCHAR(7), @SCHEMA_NAME VARCHAR(128), @TABLE_NAME VARCHAR(128)

1. @DWU: Either provide a NULL parameter to extract the current DWU from the DW DB or provide any
supported DWU in the form of 'DW100c'
2. @SCHEMA_NAME: Provide a schema name of the table
3. @TABLE_NAME: Provide a table name of the interest
Examples executing this stored proc:

EXEC dbo.prc_workload_management_by_DWU 'DW2000c', 'dbo', 'Table1';


EXEC dbo.prc_workload_management_by_DWU NULL, 'dbo', 'Table1';
EXEC dbo.prc_workload_management_by_DWU 'DW6000c', NULL, NULL;
EXEC dbo.prc_workload_management_by_DWU NULL, NULL, NULL;

The following statement creates Table1 that is used in the preceding examples.
CREATE TABLE Table1 (a int, b varchar(50), c decimal (18,10), d char(10), e varbinary(15), f float, g
datetime, h date);

Stored procedure definition

-------------------------------------------------------------------------------
-- Dropping prc_workload_management_by_DWU procedure if it exists.
-------------------------------------------------------------------------------
IF EXISTS (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'prc_workload_management_by_DWU')
DROP PROCEDURE dbo.prc_workload_management_by_DWU
GO

-------------------------------------------------------------------------------
-- Creating prc_workload_management_by_DWU.
-------------------------------------------------------------------------------
CREATE PROCEDURE dbo.prc_workload_management_by_DWU
(@DWU VARCHAR(7),
@SCHEMA_NAME VARCHAR(128),
@TABLE_NAME VARCHAR(128)
)
AS

IF @DWU IS NULL
BEGIN
-- Selecting proper DWU for the current DB if not specified.

SELECT @DWU = 'DW'+ CAST(CASE WHEN Mem> 4 THEN Nodes*500


ELSE Mem*100
END AS VARCHAR(10)) +'c'
FROM (
SELECT Nodes=count(distinct n.pdw_node_id), Mem=max(i.committed_target_kb/1000/1000/60)
FROM sys.dm_pdw_nodes n
CROSS APPLY sys.dm_pdw_nodes_os_sys_info i
WHERE type = 'COMPUTE')A
END
END

-- Dropping temp table if exists.


IF OBJECT_ID('tempdb..#ref') IS NOT NULL
BEGIN
DROP TABLE #ref;
END;

-- Creating ref. temp table (CTAS) to hold mapping info.


CREATE TABLE #ref
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
WITH
-- Creating concurrency slots mapping for various DWUs.
alloc
AS
(
SELECT 'DW100c' AS DWU,4 AS max_queries,4 AS max_slots,1 AS slots_used_smallrc,1 AS slots_used_mediumrc,2 A
S slots_used_largerc,4 AS slots_used_xlargerc,1 AS slots_used_staticrc10,2 AS slots_used_staticrc20,4 AS sl
ots_used_staticrc30,4 AS slots_used_staticrc40,4 AS slots_used_staticrc50,4 AS slots_used_staticrc60,4 AS s
lots_used_staticrc70,4 AS slots_used_staticrc80
UNION ALL
SELECT 'DW200c', 8, 8, 1, 2, 4, 8, 1, 2, 4, 8, 8, 8, 8, 8
UNION ALL
SELECT 'DW300c', 12, 12, 1, 2, 4, 8, 1, 2, 4, 8, 8, 8, 8, 8
UNION ALL
SELECT 'DW400c', 16, 16, 1, 4, 8, 16, 1, 2, 4, 8, 16, 16, 16, 16
UNION ALL
SELECT 'DW500c', 20, 20, 1, 4, 8, 16, 1, 2, 4, 8, 16, 16, 16, 16
UNION ALL
SELECT 'DW1000c', 32, 40, 1, 4, 8, 28, 1, 2, 4, 8, 16, 32, 32, 32
UNION ALL
SELECT 'DW1500c', 32, 60, 1, 6, 13, 42, 1, 2, 4, 8, 16, 32, 32, 32
UNION ALL
SELECT 'DW2000c', 48, 80, 2, 8, 17, 56, 1, 2, 4, 8, 16, 32, 64, 64
UNION ALL
SELECT 'DW2500c', 48, 100, 3, 10, 22, 70, 1, 2, 4, 8, 16, 32, 64, 64
UNION ALL
SELECT 'DW3000c', 64, 120, 3, 12, 26, 84, 1, 2, 4, 8, 16, 32, 64, 64
UNION ALL
SELECT 'DW5000c', 64, 200, 6, 20, 44, 140, 1, 2, 4, 8, 16, 32, 64, 128
UNION ALL
SELECT 'DW6000c', 128, 240, 7, 24, 52, 168, 1, 2, 4, 8, 16, 32, 64, 128
UNION ALL
SELECT 'DW7500c', 128, 300, 9, 30, 66, 210, 1, 2, 4, 8, 16, 32, 64, 128
UNION ALL
SELECT 'DW10000c', 128, 400, 12, 40, 88, 280, 1, 2, 4, 8, 16, 32, 64, 128
UNION ALL
SELECT 'DW15000c', 128, 600, 18, 60, 132, 420, 1, 2, 4, 8, 16, 32, 64, 128
UNION ALL
SELECT 'DW30000c', 128, 1200, 36, 120, 264, 840, 1, 2, 4, 8, 16, 32, 64, 128
)
-- Creating workload mapping to their corresponding slot consumption and default memory grant.
,map
AS
(
SELECT CONVERT(varchar(20), 'SloDWGroupSmall') AS wg_name, slots_used_smallrc AS slots_used FROM alloc
WHERE DWU = @DWU
UNION ALL
SELECT CONVERT(varchar(20), 'SloDWGroupMedium') AS wg_name, slots_used_mediumrc AS slots_used FROM alloc
WHERE DWU = @DWU
UNION ALL
SELECT CONVERT(varchar(20), 'SloDWGroupLarge') AS wg_name, slots_used_largerc AS slots_used FROM alloc
WHERE DWU = @DWU
UNION ALL
SELECT CONVERT(varchar(20), 'SloDWGroupXLarge') AS wg_name, slots_used_xlargerc AS slots_used FROM alloc
WHERE DWU = @DWU
UNION ALL
SELECT 'SloDWGroupC00',1
UNION ALL
UNION ALL
SELECT 'SloDWGroupC01',2
UNION ALL
SELECT 'SloDWGroupC02',4
UNION ALL
SELECT 'SloDWGroupC03',8
UNION ALL
SELECT 'SloDWGroupC04',16
UNION ALL
SELECT 'SloDWGroupC05',32
UNION ALL
SELECT 'SloDWGroupC06',64
UNION ALL
SELECT 'SloDWGroupC07',128
)

-- Creating ref based on current / asked DWU.


, ref
AS
(
SELECT a1.*
, m1.wg_name AS wg_name_smallrc
, m1.slots_used * 250 AS tgt_mem_grant_MB_smallrc
, m2.wg_name AS wg_name_mediumrc
, m2.slots_used * 250 AS tgt_mem_grant_MB_mediumrc
, m3.wg_name AS wg_name_largerc
, m3.slots_used * 250 AS tgt_mem_grant_MB_largerc
, m4.wg_name AS wg_name_xlargerc
, m4.slots_used * 250 AS tgt_mem_grant_MB_xlargerc
, m5.wg_name AS wg_name_staticrc10
, m5.slots_used * 250 AS tgt_mem_grant_MB_staticrc10
, m6.wg_name AS wg_name_staticrc20
, m6.slots_used * 250 AS tgt_mem_grant_MB_staticrc20
, m7.wg_name AS wg_name_staticrc30
, m7.slots_used * 250 AS tgt_mem_grant_MB_staticrc30
, m8.wg_name AS wg_name_staticrc40
, m8.slots_used * 250 AS tgt_mem_grant_MB_staticrc40
, m9.wg_name AS wg_name_staticrc50
, m9.slots_used * 250 AS tgt_mem_grant_MB_staticrc50
, m10.wg_name AS wg_name_staticrc60
, m10.slots_used * 250 AS tgt_mem_grant_MB_staticrc60
, m11.wg_name AS wg_name_staticrc70
, m11.slots_used * 250 AS tgt_mem_grant_MB_staticrc70
, m12.wg_name AS wg_name_staticrc80
, m12.slots_used * 250 AS tgt_mem_grant_MB_staticrc80
FROM alloc a1
JOIN map m1 ON a1.slots_used_smallrc = m1.slots_used and m1.wg_name = 'SloDWGroupSmall'
JOIN map m2 ON a1.slots_used_mediumrc = m2.slots_used and m2.wg_name = 'SloDWGroupMedium'
JOIN map m3 ON a1.slots_used_largerc = m3.slots_used and m3.wg_name = 'SloDWGroupLarge'
JOIN map m4 ON a1.slots_used_xlargerc = m4.slots_used and m4.wg_name = 'SloDWGroupXLarge'
JOIN map m5 ON a1.slots_used_staticrc10 = m5.slots_used and m5.wg_name NOT IN
('SloDWGroupSmall','SloDWGroupMedium','SloDWGroupLarge','SloDWGroupXLarge')
JOIN map m6 ON a1.slots_used_staticrc20 = m6.slots_used and m6.wg_name NOT IN
('SloDWGroupSmall','SloDWGroupMedium','SloDWGroupLarge','SloDWGroupXLarge')
JOIN map m7 ON a1.slots_used_staticrc30 = m7.slots_used and m7.wg_name NOT IN
('SloDWGroupSmall','SloDWGroupMedium','SloDWGroupLarge','SloDWGroupXLarge')
JOIN map m8 ON a1.slots_used_staticrc40 = m8.slots_used and m8.wg_name NOT IN
('SloDWGroupSmall','SloDWGroupMedium','SloDWGroupLarge','SloDWGroupXLarge')
JOIN map m9 ON a1.slots_used_staticrc50 = m9.slots_used and m9.wg_name NOT IN
('SloDWGroupSmall','SloDWGroupMedium','SloDWGroupLarge','SloDWGroupXLarge')
JOIN map m10 ON a1.slots_used_staticrc60 = m10.slots_used and m10.wg_name NOT IN
('SloDWGroupSmall','SloDWGroupMedium','SloDWGroupLarge','SloDWGroupXLarge')
JOIN map m11 ON a1.slots_used_staticrc70 = m11.slots_used and m11.wg_name NOT IN
('SloDWGroupSmall','SloDWGroupMedium','SloDWGroupLarge','SloDWGroupXLarge')
JOIN map m12 ON a1.slots_used_staticrc80 = m12.slots_used and m12.wg_name NOT IN
('SloDWGroupSmall','SloDWGroupMedium','SloDWGroupLarge','SloDWGroupXLarge')
WHERE a1.DWU = @DWU
)
SELECT DWU
, max_queries
, max_slots
, slots_used
, wg_name
, tgt_mem_grant_MB
, up1 as rc
, (ROW_NUMBER() OVER(PARTITION BY DWU ORDER BY DWU)) as rc_id
FROM
(
SELECT DWU
, max_queries
, max_slots
, slots_used
, wg_name
, tgt_mem_grant_MB
, REVERSE(SUBSTRING(REVERSE(wg_names),1,CHARINDEX('_',REVERSE(wg_names),1)-1)) as up1
, REVERSE(SUBSTRING(REVERSE(tgt_mem_grant_MBs),1,CHARINDEX('_',REVERSE(tgt_mem_grant_MBs),1)-1))
as up2
, REVERSE(SUBSTRING(REVERSE(slots_used_all),1,CHARINDEX('_',REVERSE(slots_used_all),1)-1)) as up3
FROM ref AS r1
UNPIVOT
(
wg_name FOR wg_names IN (wg_name_smallrc,wg_name_mediumrc,wg_name_largerc,wg_name_xlargerc,
wg_name_staticrc10, wg_name_staticrc20, wg_name_staticrc30, wg_name_staticrc40, wg_name_staticrc50,
wg_name_staticrc60, wg_name_staticrc70, wg_name_staticrc80)
) AS r2
UNPIVOT
(
tgt_mem_grant_MB FOR tgt_mem_grant_MBs IN (tgt_mem_grant_MB_smallrc,tgt_mem_grant_MB_mediumrc,
tgt_mem_grant_MB_largerc,tgt_mem_grant_MB_xlargerc, tgt_mem_grant_MB_staticrc10,
tgt_mem_grant_MB_staticrc20,
tgt_mem_grant_MB_staticrc30, tgt_mem_grant_MB_staticrc40, tgt_mem_grant_MB_staticrc50,
tgt_mem_grant_MB_staticrc60, tgt_mem_grant_MB_staticrc70, tgt_mem_grant_MB_staticrc80)
) AS r3
UNPIVOT
(
slots_used FOR slots_used_all IN (slots_used_smallrc,slots_used_mediumrc,slots_used_largerc,
slots_used_xlargerc, slots_used_staticrc10, slots_used_staticrc20, slots_used_staticrc30,
slots_used_staticrc40, slots_used_staticrc50, slots_used_staticrc60, slots_used_staticrc70,
slots_used_staticrc80)
) AS r4
) a
WHERE up1 = up2
AND up1 = up3
;

-- Getting current info about workload groups.


WITH
dmv
AS
(
SELECT
rp.name AS rp_name
, rp.max_memory_kb*1.0/1048576 AS rp_max_mem_GB
, (rp.max_memory_kb*1.0/1024)
*(request_max_memory_grant_percent/100) AS max_memory_grant_MB
, (rp.max_memory_kb*1.0/1048576)
*(request_max_memory_grant_percent/100) AS max_memory_grant_GB
, wg.name AS wg_name
, wg.importance AS importance
, wg.request_max_memory_grant_percent AS request_max_memory_grant_percent
FROM sys.dm_pdw_nodes_resource_governor_workload_groups wg
JOIN sys.dm_pdw_nodes_resource_governor_resource_pools rp ON wg.pdw_node_id = rp.pdw_node_id
AND wg.pool_id = rp.pool_id
WHERE rp.name = 'SloDWPool'
GROUP BY
rp.name
, rp.max_memory_kb
, wg.name
, wg.importance
, wg.request_max_memory_grant_percent
)
-- Creating resource class name mapping.
,names
AS
(
SELECT 'smallrc' as resource_class, 1 as rc_id
UNION ALL
SELECT 'mediumrc', 2
UNION ALL
SELECT 'largerc', 3
UNION ALL
SELECT 'xlargerc', 4
UNION ALL
SELECT 'staticrc10', 5
UNION ALL
SELECT 'staticrc20', 6
UNION ALL
SELECT 'staticrc30', 7
UNION ALL
SELECT 'staticrc40', 8
UNION ALL
SELECT 'staticrc50', 9
UNION ALL
SELECT 'staticrc60', 10
UNION ALL
SELECT 'staticrc70', 11
UNION ALL
SELECT 'staticrc80', 12
)
,base AS
( SELECT schema_name
, table_name
, SUM(column_count) AS column_count
, ISNULL(SUM(short_string_column_count),0) AS short_string_column_count
, ISNULL(SUM(long_string_column_count),0) AS long_string_column_count
FROM ( SELECT sm.name AS schema_name
, tb.name AS table_name
, COUNT(co.column_id) AS column_count
, CASE WHEN co.system_type_id IN
(36,43,106,108,165,167,173,175,231,239)
AND co.max_length <= 32
THEN COUNT(co.column_id)
END AS short_string_column_count
, CASE WHEN co.system_type_id IN (165,167,173,175,231,239)
AND co.max_length > 32 and co.max_length <=8000
THEN COUNT(co.column_id)
END AS long_string_column_count
FROM sys.schemas AS sm
JOIN sys.tables AS tb on sm.[schema_id] = tb.[schema_id]
JOIN sys.columns AS co ON tb.[object_id] = co.[object_id]
WHERE tb.name = @TABLE_NAME AND sm.name = @SCHEMA_NAME
GROUP BY sm.name
, tb.name
, co.system_type_id
, co.max_length ) a
GROUP BY schema_name
, table_name
)
, size AS
(
SELECT schema_name
, table_name
, 75497472 AS table_overhead

, column_count*1048576*8 AS column_size
, short_string_column_count*1048576*32 AS short_string_size,
(long_string_column_count*16777216) AS long_string_size
(long_string_column_count*16777216) AS long_string_size
FROM base
UNION
SELECT CASE WHEN COUNT(*) = 0 THEN 'EMPTY' END as schema_name
,CASE WHEN COUNT(*) = 0 THEN 'EMPTY' END as table_name
,CASE WHEN COUNT(*) = 0 THEN 0 END as table_overhead
,CASE WHEN COUNT(*) = 0 THEN 0 END as column_size
,CASE WHEN COUNT(*) = 0 THEN 0 END as short_string_size

,CASE WHEN COUNT(*) = 0 THEN 0 END as long_string_size


FROM base
)
, load_multiplier as
(
SELECT CASE
WHEN FLOOR(8 * (CAST (CAST(REPLACE(REPLACE(@DWU,'DW',''),'c','') AS INT) AS FLOAT)/6000)) > 0
AND CHARINDEX(@DWU,'c')=0
THEN FLOOR(8 * (CAST (CAST(REPLACE(REPLACE(@DWU,'DW',''),'c','') AS INT) AS FLOAT)/6000))
ELSE 1
END AS multiplication_factor
)
SELECT r1.DWU
, schema_name
, table_name
, rc.resource_class as closest_rc_in_increasing_order
, max_queries_at_this_rc = CASE
WHEN (r1.max_slots / r1.slots_used > r1.max_queries)
THEN r1.max_queries
ELSE r1.max_slots / r1.slots_used
END
, r1.max_slots as max_concurrency_slots
, r1.slots_used as required_slots_for_the_rc
, r1.tgt_mem_grant_MB as rc_mem_grant_MB
,
CAST((table_overhead*1.0+column_size+short_string_size+long_string_size)*multiplication_factor/1048576
AS DECIMAL(18,2)) AS est_mem_grant_required_for_cci_operation_MB
FROM size
, load_multiplier
, #ref r1, names rc
WHERE r1.rc_id=rc.rc_id
AND
CAST((table_overhead*1.0+column_size+short_string_size+long_string_size)*multiplication_factor/1048576
AS DECIMAL(18,2)) < r1.tgt_mem_grant_MB
ORDER BY
ABS(CAST((table_overhead*1.0+column_size+short_string_size+long_string_size)*multiplication_factor/1048576
AS DECIMAL(18,2)) - r1.tgt_mem_grant_MB)
GO

Next step
For more information about managing database users and security, see Secure a database in SQL Data
Warehouse. For more information about how larger resource classes can improve clustered columnstore index
quality, see Memory optimizations for columnstore compression.
Azure SQL Data Warehouse workload classification
7/5/2019 • 3 minutes to read • Edit Online

This article explains the SQL Data Warehouse workload classification process of assigning a resource class and
importance to incoming requests.

Classification
Workload management classification allows workload policies to be applied to requests through assigning
resource classes and importance.
While there are many ways to classify data warehousing workloads, the simplest and most common classification
is load and query. You load data with insert, update, and delete statements. You query the data using selects. A
data warehousing solution will often have a workload policy for load activity, such as assigning a higher resource
class with more resources. A different workload policy could apply to queries, such as lower importance compared
to load activities.
You can also subclassify your load and query workloads. Subclassification gives you more control of your
workloads. For example, query workloads can consist of cube refreshes, dashboard queries or ad-hoc queries. You
can classify each of these query workloads with different resource classes or importance settings. Load can also
benefit from subclassification. Large transformations can be assigned to larger resource classes. Higher
importance can be used to ensure key sales data is loader before weather data or a social data feed.
Not all statements are classified as they do not require resources or need importance to influence execution.
DBCC commands, BEGIN, COMMIT, and ROLLBACK TRANSACTION statements are not classified.

Classification process
Classification in SQL Data Warehouse is achieved today by assigning users to a role that has a corresponding
resource class assigned to it using sp_addrolemember. The ability to characterize requests beyond a login to a
resource class is limited with this capability. A richer method for classification is now available with the CREATE
WORKLOAD CLASSIFIER syntax. With this syntax, SQL Data Warehouse users can assign importance and a
resource class to requests.

NOTE
Classification is evaluated on a per request basis. Multiple requests in a single session can be classified differently.

Classification precedence
As part of the classification process, precedence is in place to determine which resource class is assigned.
Classification based on a database user takes precedence over role membership. If you create a classifier that
maps the UserA database user to the mediumrc resource class. Then, map the RoleA database role (of which
UserA is a member) to the largerc resource class. The classifier that maps the database user to the mediumrc
resource class will take precedence over the classifier that maps the RoleA database role to the largerc resource
class.
If a user is a member of multiple roles with different resource classes assigned or matched in multiple classifiers,
the user is given the highest resource class assignment. This behavior is consistent with existing resource class
assignment behavior.

System classifiers
Workload classification has system workload classifiers. The system classifiers map existing resource class role
memberships to resource class resource allocations with normal importance. System classifiers can't be dropped.
To view system classifiers, you can run the below query:

SELECT * FROM sys.workload_management_workload_classifiers where classifier_id <= 12

Mixing resource class assignments with classifiers


System classifiers created on your behalf provide an easy path to migrate to workload classification. Using
resource class role mappings with classification precedence, can lead to misclassification as you start to create new
classifiers with importance.
Consider the following scenario:
An existing data warehouse has a database user DBAUser assigned to the largerc resource class role. The
resource class assignment was done with sp_addrolemember.
The data warehouse is now updated with workload management.
To test the new classification syntax, the database role DBARole (which DBAUser is a member of), has a
classifier created for them mapping them to mediumrc and high importance.
When DBAUser logs in and runs a query, the query will be assigned to largerc. Because a user takes
precedence over a role membership.
To simplify troubleshooting misclassification, we recommended you remove resource class role mappings as you
create workload classifiers. The code below returns existing resource class role memberships. Run
sp_droprolemember for each member name returned from the corresponding resource class.

SELECT r.name AS [Resource Class]


, m.name AS membername
FROM sys.database_role_members rm
JOIN sys.database_principals AS r ON rm.role_principal_id = r.principal_id
JOIN sys.database_principals AS m ON rm.member_principal_id = m.principal_id
WHERE r.name IN
('mediumrc','largerc','xlargerc','staticrc10','staticrc20','staticrc30','staticrc40','staticrc50','staticrc60'
,'staticrc70','staticrc80');

--for each row returned run


sp_droprolemember ‘[Resource Class]’, membername

Next steps
For more information on creating a classifier, see the CREATE WORKLOAD CLASSIFIER (Transact-SQL ).
See the Quickstart on how to create a workload classifier Create a workload classifier.
See the how -to articles to Configure Workload Importance and how to manage and monitor Workload
Management.
See sys.dm_pdw_exec_requests to view queries and the importance assigned.
Azure SQL Data Warehouse workload importance
7/5/2019 • 2 minutes to read • Edit Online

This article explains how workload importance can influence the order of execution for SQL Data Warehouse
requests.

Importance
Business needs can require data warehousing workloads to be more important than others. Consider a scenario
where mission critical sales data is loaded before the fiscal period close. Data loads for other sources such as
weather data don't have strict SLAs. Setting high importance for a request to load sales data and low importance
to a request to load whether data ensures the sales data load gets first access to resources and completes quicker.

Importance levels
There are five levels of importance: low, below_normal, normal, above_normal, and high. Requests that don't set
importance are assigned the default level of normal. Requests that have the same importance level have the same
scheduling behavior that exists today.

Importance scenarios
Beyond the basic importance scenario described above with sales and weather data, there are other scenarios
where workload importance helps meet data processing and querying needs.
Locking
Access to locks for read and write activity is one area of natural contention. Activities such as partition switching
or RENAME OBJECT require elevated locks. Without workload importance, SQL Data Warehouse optimizes for
throughput. Optimizing for throughput means that when running and queued requests have the same locking
needs and resources are available, the queued requests can bypass requests with higher locking needs that arrived
in the request queue earlier. Once workload importance is applied to requests with higher locking needs. Request
with higher importance will be run before request with lower importance.
Consider the following example:
Q1 is actively running and selecting data from SalesFact. Q2 is queued waiting for Q1 to complete. It was
submitted at 9am and is attempting to partition switch new data into SalesFact. Q3 is submitted at 9:01am and
wants to select data from SalesFact.
If Q2 and Q3 have the same importance and Q1 is still executing, Q3 will begin executing. Q2 will continue to
wait for an exclusive lock on SalesFact. If Q2 has higher importance than Q3, Q3 will wait until Q2 is finished
before it can begin execution.
Non-uniform requests
Another scenario where importance can help meet querying demands is when requests with different resource
classes are submitted. As was previously mentioned, under the same importance, SQL Data Warehouse optimizes
for throughput. When mixed size requests (such as smallrc or mediumrc) are queued, SQL Data Warehouse will
choose the earliest arriving request that fits within the available resources. If workload importance is applied, the
highest importance request is scheduled next.
Consider the following example on DW500c:
Q1, Q2, Q3, and Q4 are running smallrc queries. Q5 is submitted with the mediumrc resource class at 9am. Q6 is
submitted with smallrc resource class at 9:01am.
Because Q5 is mediumrc, it requires two concurrency slots. Q5 needs to wait for two of the running queries to
complete. However, when one of the running queries (Q1-Q4) completes, Q6 is scheduled immediately because
the resources exist to execute the query. If Q5 has higher importance than Q6, Q6 waits until Q5 is running before
it can begin executing.

Next steps
For more information on creating a classifier, see the CREATE WORKLOAD CLASSIFIER (Transact-SQL ).
For more information about SQL Data Warehouse workload classification, see Workload Classification.
See the Quickstart Create workload classifier for how to create a workload classifier.
See the how -to articles to Configure Workload Importance and how to Manage and monitor Workload
Management.
See sys.dm_pdw_exec_requests to view queries and the importance assigned.
Memory and concurrency limits for Azure SQL Data
Warehouse
10/8/2019 • 3 minutes to read • Edit Online

View the memory and concurrency limits allocated to the various performance levels and resource classes in
Azure SQL Data Warehouse. For more information, and to apply these capabilities to your workload
management plan, see Resource classes for workload management.

Data warehouse capacity settings


The following tables show the maximum capacity for the data warehouse at different performance levels. To
change the performance level, see Scale compute - portal.
Service Levels
The service levels range from DW100c to DW30000c.

DISTRIBUTIONS PER COMPUTE MEMORY PER DATA


PERFORMANCE LEVEL COMPUTE NODES NODE WAREHOUSE (GB)

DW100c 1 60 60

DW200c 1 60 120

DW300c 1 60 180

DW400c 1 60 240

DW500c 1 60 300

DW1000c 2 30 600

DW1500c 3 20 900

DW2000c 4 15 1200

DW2500c 5 12 1500

DW3000c 6 10 1800

DW5000c 10 6 3000

DW6000c 12 5 3600

DW7500c 15 4 4500

DW10000c 20 3 6000

DW15000c 30 2 9000
DISTRIBUTIONS PER COMPUTE MEMORY PER DATA
PERFORMANCE LEVEL COMPUTE NODES NODE WAREHOUSE (GB)

DW30000c 60 1 18000

The maximum service level is DW30000c, which has 60 Compute nodes and one distribution per Compute node.
For example, a 600 TB data warehouse at DW30000c processes approximately 10 TB per Compute node.

Concurrency maximums
To ensure each query has enough resources to execute efficiently, SQL Data Warehouse tracks resource
utilization by assigning concurrency slots to each query. The system puts queries into a queue based on
importance and concurrency slots. Queries wait in the queue until enough concurrency slots are available.
Importance and concurrency slots determine CPU prioritization. For more information, see Analyze your
workload
Static resource classes
The following table shows the maximum concurrent queries and concurrency slots for each static resource class.

MAXIM
UM CONCU SLOTS SLOTS SLOTS SLOTS SLOTS SLOTS SLOTS SLOTS
CONCU RRENCY USED USED USED USED USED USED USED USED
RRENT SLOTS BY BY BY BY BY BY BY BY
SERVIC QUERIE AVAILA STATIC STATIC STATIC STATIC STATIC STATIC STATIC STATIC
E LEVEL S BLE RC10 RC20 RC30 RC40 RC50 RC60 RC70 RC80

DW10 4 4 1 2 4 4 4 4 4 4
0c

DW20 8 8 1 2 4 8 8 8 8 8
0c

DW30 12 12 1 2 4 8 8 8 8 8
0c

DW40 16 16 1 2 4 8 16 16 16 16
0c

DW50 20 20 1 2 4 8 16 16 16 16
0c

DW10 32 40 1 2 4 8 16 32 32 32
00c

DW15 32 60 1 2 4 8 16 32 32 32
00c

DW20 48 80 1 2 4 8 16 32 64 64
00c

DW25 48 100 1 2 4 8 16 32 64 64
00c

DW30 64 120 1 2 4 8 16 32 64 64
00c
MAXIM
UM CONCU SLOTS SLOTS SLOTS SLOTS SLOTS SLOTS SLOTS SLOTS
CONCU RRENCY USED USED USED USED USED USED USED USED
RRENT SLOTS BY BY BY BY BY BY BY BY
SERVIC QUERIE AVAILA STATIC STATIC STATIC STATIC STATIC STATIC STATIC STATIC
E LEVEL S BLE RC10 RC20 RC30 RC40 RC50 RC60 RC70 RC80

DW50 64 200 1 2 4 8 16 32 64 128


00c

DW60 128 240 1 2 4 8 16 32 64 128


00c

DW75 128 300 1 2 4 8 16 32 64 128


00c

DW10 128 400 1 2 4 8 16 32 64 128


000c

DW15 128 600 1 2 4 8 16 32 64 128


000c

DW30 128 1200 1 2 4 8 16 32 64 128


000c

Dynamic resource classes


The following table shows the maximum concurrent queries and concurrency slots for each dynamic resource
class. Dynamic resource classes use a 3-10-22-70 memory percentage allocation for small-medium-large-xlarge
resource classes across all service levels.

MAXIMUM CONCURRENCY
CONCURRENT SLOTS SLOTS USED BY SLOTS USED BY SLOTS USED BY SLOTS USED BY
SERVICE LEVEL QUERIES AVAILABLE SMALLRC MEDIUMRC LARGERC XLARGERC

DW100c 4 4 1 1 1 2

DW200c 8 8 1 1 1 5

DW300c 12 12 1 1 2 8

DW400c 16 16 1 1 3 11

DW500c 20 20 1 2 4 14

DW1000c 32 40 1 4 8 28

DW1500c 32 60 1 6 13 42

DW2000c 32 80 2 8 17 56

DW2500c 32 100 3 10 22 70

DW3000c 32 120 3 12 26 84

DW5000c 32 200 6 20 44 140


MAXIMUM CONCURRENCY
CONCURRENT SLOTS SLOTS USED BY SLOTS USED BY SLOTS USED BY SLOTS USED BY
SERVICE LEVEL QUERIES AVAILABLE SMALLRC MEDIUMRC LARGERC XLARGERC

DW6000c 32 240 7 24 52 168

DW7500c 32 300 9 30 66 210

DW10000c 32 400 12 40 88 280

DW15000c 32 600 18 60 132 420

DW30000c 32 1200 36 120 264 840

When there are not enough concurrency slots free to start query execution, queries are queued and executed
based on importance. If there is equivalent importance, queries are executed on a first-in, first-out basis. As a
queries finishes and the number of queries and slots fall below the limits, SQL Data Warehouse releases queued
queries.

Next steps
To learn more about how to leverage resource classes to optimize your workload further please review the
following articles:
Resource classes for workload management
Analyzing your workload
Manageability and monitoring with Azure SQL Data
Warehouse
1/30/2019 • 2 minutes to read • Edit Online

Take a look through what's available to help you manage and monitor SQL Data Warehouse. The following articles
highlight ways to optimize performance and usage of your data warehouse.

Overview
Learn about compute management and elasticity
Understand what metrics and logs are available in the Azure portal
Learn about backup and restore capabilities
Learn about built-in intelligence and recommendations
Learn about maintenance periods and what is available to minimize downtime of your data warehouse
Find common troubleshooting guidance

Next steps
For How -to guides, see Monitor and tune your data warehouse.
Manage compute in Azure SQL Data Warehouse
8/18/2019 • 6 minutes to read • Edit Online

Learn about managing compute resources in Azure SQL Data Warehouse. Lower costs by pausing the data
warehouse, or scale the data warehouse to meet performance demands.

What is compute management?


The architecture of SQL Data Warehouse separates storage and compute, allowing each to scale independently.
As a result, you can scale compute to meet performance demands independent of data storage. You can also
pause and resume compute resources. A natural consequence of this architecture is that billing for compute and
storage is separate. If you don't need to use your data warehouse for a while, you can save compute costs by
pausing compute.

Scaling compute
You can scale out or scale back compute by adjusting the data warehouse units setting for your data warehouse.
Loading and query performance can increase linearly as you add more data warehouse units.
For scale-out steps, see the Azure portal, PowerShell, or T-SQL quickstarts. You can also perform scale-out
operations with a REST API.
To perform a scale operation, SQL Data Warehouse first kills all incoming queries and then rolls back
transactions to ensure a consistent state. Scaling only occurs once the transaction rollback is complete. For a
scale operation, the system detaches the storage layer from the Compute nodes, adds Compute nodes, and then
reattaches the storage layer to the Compute layer. Each data warehouse is stored as 60 distributions, which are
evenly distributed to the Compute nodes. Adding more Compute nodes adds more compute power. As the
number of Compute nodes increases, the number of distributions per compute node decreases, providing more
compute power for your queries. Likewise, decreasing data warehouse units reduces the number of Compute
nodes, which reduces the compute resources for queries.
The following table shows how the number of distributions per Compute node changes as the data warehouse
units change. DWU6000 provides 60 Compute nodes and achieves much higher query performance than
DWU100.

DATA WAREHOUSE UNITS # OF COMPUTE NODES # OF DISTRIBUTIONS PER NODE

100 1 60

200 2 30

300 3 20

400 4 15

500 5 12

600 6 10

1000 10 6
DATA WAREHOUSE UNITS # OF COMPUTE NODES # OF DISTRIBUTIONS PER NODE

1200 12 5

1500 15 4

2000 20 3

3000 30 2

6000 60 1

Finding the right size of data warehouse units


To see the performance benefits of scaling out, especially for larger data warehouse units, you want to use at
least a 1-TB data set. To find the best number of data warehouse units for your data warehouse, try scaling up
and down. Run a few queries with different numbers of data warehouse units after loading your data. Since
scaling is quick, you can try various performance levels in an hour or less.
Recommendations for finding the best number of data warehouse units:
For a data warehouse in development, begin by selecting a smaller number of data warehouse units. A good
starting point is DW400 or DW200.
Monitor your application performance, observing the number of data warehouse units selected compared to
the performance you observe.
Assume a linear scale, and determine how much you need to increase or decrease the data warehouse units.
Continue making adjustments until you reach an optimum performance level for your business
requirements.

When to scale out


Scaling out data warehouse units impacts these aspects of performance:
Linearly improves performance of the system for scans, aggregations, and CTAS statements.
Increases the number of readers and writers for loading data.
Maximum number of concurrent queries and concurrency slots.
Recommendations for when to scale out data warehouse units:
Before you perform a heavy data loading or transformation operation, scale out to make the data available
more quickly.
During peak business hours, scale out to accommodate larger numbers of concurrent queries.

What if scaling out does not improve performance?


Adding data warehouse units increasing the parallelism. If the work is evenly split between the Compute nodes,
the additional parallelism improves query performance. If scaling out is not changing your performance, there
are some reasons why this might happen. Your data might be skewed across the distributions, or queries might
be introducing a large amount of data movement. To investigate query performance issues, see Performance
troubleshooting.

Pausing and resuming compute


Pausing compute causes the storage layer to detach from the Compute nodes. The Compute resources are
released from your account. You are not charged for compute while compute is paused. Resuming compute
reattaches storage to the Compute nodes, and resumes charges for Compute. When you pause a data
warehouse:
Compute and memory resources are returned to the pool of available resources in the data center
Data warehouse unit costs are zero for the duration of the pause.
Data storage is not affected and your data stays intact.
SQL Data Warehouse cancels all running or queued operations.
When you resume a data warehouse:
SQL Data Warehouse acquires compute and memory resources for your data warehouse units setting.
Compute charges for your data warehouse units resume.
Your data becomes available.
After the data warehouse is online, you need to restart your workload queries.
If you always want your data warehouse accessible, consider scaling it down to the smallest size rather than
pausing.
For pause and resume steps, see the Azure portal, or PowerShell quickstarts. You can also use the pause REST
API or the resume REST API.

Drain transactions before pausing or scaling


We recommend allowing existing transactions to finish before you initiate a pause or scale operation.
When you pause or scale your SQL Data Warehouse, behind the scenes your queries are canceled when you
initiate the pause or scale request. Canceling a simple SELECT query is a quick operation and has almost no
impact to the time it takes to pause or scale your instance. However, transactional queries, which modify your
data or the structure of the data, may not be able to stop quickly. Transactional queries, by definition, must
either complete in their entirety or rollback their changes. Rolling back the work completed by a
transactional query can take as long, or even longer, than the original change the query was applying. For
example, if you cancel a query which was deleting rows and has already been running for an hour, it could take
the system an hour to insert back the rows which were deleted. If you run pause or scaling while transactions
are in flight, your pause or scaling may seem to take a long time because pausing and scaling has to wait for the
rollback to complete before it can proceed.
See also Understanding transactions, and Optimizing transactions.

Automating compute management


To automate the compute management operations, see Manage compute with Azure functions.
Each of the scale-out, pause, and resume operations can take several minutes to complete. If you are scaling,
pausing, or resuming automatically, we recommend implementing logic to ensure that certain operations have
completed before proceeding with another action. Checking the data warehouse state through various
endpoints allows you to correctly implement automation of such operations.
To check the data warehouse state, see the PowerShell or T-SQL quickstart. You can also check the data
warehouse state with a REST API.

Permissions
Scaling the data warehouse requires the permissions described in ALTER DATABASE. Pause and Resume
require the SQL DB Contributor permission, specifically Microsoft.Sql/servers/databases/action.
Next steps
See the how to guide for manage compute Another aspect of managing compute resources is allocating
different compute resources for individual queries. For more information, see Resource classes for workload
management.
Monitoring resource utilization and query activity in
Azure SQL Data Warehouse
9/27/2019 • 2 minutes to read • Edit Online

Azure SQL Data Warehouse provides a rich monitoring experience within the Azure portal to surface insights to
your data warehouse workload. The Azure portal is the recommended tool when monitoring your data warehouse
as it provides configurable retention periods, alerts, recommendations, and customizable charts and dashboards
for metrics and logs. The portal also enables you to integrate with other Azure monitoring services such as
Operations Management Suite (OMS ) and Azure Monitor (logs) to provide a holistic monitoring experience for
not only your data warehouse but also your entire Azure analytics platform for an integrated monitoring
experience. This documentation describes what monitoring capabilities are available to optimize and manage your
analytics platform with SQL Data Warehouse.

Resource utilization
The following metrics are available in the Azure portal for SQL Data Warehouse. These metrics are surfaced
through Azure Monitor.

METRIC NAME DESCRIPTION AGGREGATION TYPE

CPU percentage CPU utilization across all nodes for the Maximum
data warehouse

Data IO percentage IO Utilization across all nodes for the Maximum


data warehouse

Memory percentage Memory utilization (SQL Server) across Maximum


all nodes for the data warehouse

Successful Connections Number of successful connections to Total


the data

Failed Connections Number of failed connections to the Total


data warehouse

Blocked by Firewall Number of logins to the data Total


warehouse which was blocked

DWU limit Service level objective of the data Maximum


warehouse

DWU percentage Maximum between CPU percentage Maximum


and Data IO percentage

DWU used DWU limit * DWU percentage Maximum


METRIC NAME DESCRIPTION AGGREGATION TYPE

Cache hit percentage (cache hits / cache miss) * 100 where Maximum
cache hits is the sum of all columnstore
segments hits in the local SSD cache
and cache miss is the columnstore
segments misses in the local SSD cache
summed across all nodes

Cache used percentage (cache used / cache capacity) * 100 Maximum


where cache used is the sum of all bytes
in the local SSD cache across all nodes
and cache capacity is the sum of the
storage capacity of the local SSD cache
across all nodes

Local tempdb percentage Local tempdb utilization across all Maximum


compute nodes - values are emitted
every five minutes

Things to consider when viewing metrics and setting alerts:


Failed and successful connections are reported for a particular data warehouse - not for the logical server
Memory percentage reflects utilization even if the data warehouse is in idle state - it does not reflect active
workload memory consumption. Use and track this metric along with others (tempdb, gen2 cache) to make
a holistic decision on if scaling for additional cache capacity will increase workload performance to meet
your requirements.

Query activity
For a programmatic experience when monitoring SQL Data Warehouse via T-SQL, the service provides a set of
Dynamic Management Views (DMVs). These views are useful when actively troubleshooting and identifying
performance bottlenecks with your workload.
To view the list of DMVs that SQL Data Warehouse provides, refer to this documentation.

Metrics and diagnostics logging


Both metrics and logs can be exported to Azure Monitor, specifically the Azure Monitor logs component and can
be programmatically accessed through log queries. The log latency for SQL Data Warehouse is about 10-15
minutes. For more details on the factors impacting latency, visit the following documentation.

Next steps
The following How -to guides describe common scenarios and use cases when monitoring and managing your
data warehouse:
Monitor your data warehouse workload with DMVs
Backup and restore in Azure SQL Data Warehouse
10/22/2019 • 6 minutes to read • Edit Online

Learn how to use backup and restore in Azure SQL Data Warehouse. Use data warehouse restore points to
recover or copy your data warehouse to a previous state in the primary region. Use data warehouse geo-
redundant backups to restore to a different geographical region.

What is a data warehouse snapshot


A data warehouse snapshot creates a restore point you can leverage to recover or copy your data warehouse to a
previous state. Since SQL Data Warehouse is a distributed system, a data warehouse snapshot consists of many
files that are located in Azure storage. Snapshots capture incremental changes from the data stored in your data
warehouse.
A data warehouse restore is a new data warehouse that is created from a restore point of an existing or deleted
data warehouse. Restoring your data warehouse is an essential part of any business continuity and disaster
recovery strategy because it re-creates your data after accidental corruption or deletion. Data warehouse is also a
powerful mechanism to create copies of your data warehouse for test or development purposes. SQL Data
Warehouse restore rates can vary depending on the database size and location of the source and target data
warehouse.

Automatic Restore Points


Snapshots are a built-in feature of the service that creates restore points. You do not have to enable this capability.
However, the data warehouse should be in active state for restore point creation. If the data warehouse is paused
frequently, automatic restore points may not be created so make sure to create user-defined restore point before
pausing the data warehouse. Automatic restore points currently cannot be deleted by users as the service uses
these restore points to maintain SLAs for recovery.
SQL Data Warehouse takes snapshots of your data warehouse throughout the day creating restore points that are
available for seven days. This retention period cannot be changed. SQL Data Warehouse supports an eight-hour
recovery point objective (RPO ). You can restore your data warehouse in the primary region from any one of the
snapshots taken in the past seven days.
To see when the last snapshot started, run this query on your online SQL Data Warehouse.

select top 1 *
from sys.pdw_loader_backup_runs
order by run_id desc
;

User-Defined Restore Points


This feature enables you to manually trigger snapshots to create restore points of your data warehouse before and
after large modifications. This capability ensures that restore points are logically consistent, which provides
additional data protection in case of any workload interruptions or user errors for quick recovery time. User-
defined restore points are available for seven days and are automatically deleted on your behalf. You cannot
change the retention period of user-defined restore points. 42 user-defined restore points are guaranteed at any
point in time so they must be deleted before creating another restore point. You can trigger snapshots to create
user-defined restore points through PowerShell or the Azure portal.
NOTE
If you require restore points longer than 7 days, please vote for this capability here. You can also create a user-defined
restore point and restore from the newly created restore point to a new data warehouse. Once you have restored, you have
the data warehouse online and can pause it indefinitely to save compute costs. The paused database incurs storage charges
at the Azure Premium Storage rate. If you need an active copy of the restored data warehouse, you can resume which
should take only a few minutes.

Restore point retention


The following lists details for restore point retention periods:
1. SQL Data Warehouse deletes a restore point when it hits the 7-day retention period and when there are at
least 42 total restore points (including both user-defined and automatic)
2. Snapshots are not taken when a data warehouse is paused
3. The age of a restore point is measured by the absolute calendar days from the time the restore point is taken
including when the data warehouse is paused
4. At any point in time, a data warehouse is guaranteed to be able to store up to 42 user-defined restore points
and 42 automatic restore points as long as these restore points have not reached the 7-day retention period
5. If a snapshot is taken, the data warehouse is then paused for greater than 7 days, and then resumes, it is
possible for restore point to persist until there are 42 total restore points (including both user-defined and
automatic)
Snapshot retention when a data warehouse is dropped
When you drop a data warehouse, SQL Data Warehouse creates a final snapshot and saves it for seven days. You
can restore the data warehouse to the final restore point created at deletion. If the data warehouse is dropped in
Paused state, no snapshot is taken. In that scenario, make sure to create a user-defined restore point before
dropping the data warehouse.

IMPORTANT
If you delete a logical SQL server instance, all databases that belong to the instance are also deleted and cannot be
recovered. You cannot restore a deleted server.

Geo-backups and disaster recovery


SQL Data Warehouse performs a geo-backup once per day to a paired data center. The RPO for a geo-restore is
24 hours. You can restore the geo-backup to a server in any other region where SQL Data Warehouse is
supported. A geo-backup ensures you can restore data warehouse in case you cannot access the restore points in
your primary region.

NOTE
If you require a shorter RPO for geo-backups, vote for this capability here. You can also create a user-defined restore point
and restore from the newly created restore point to a new data warehouse in a different region. Once you have restored,
you have the data warehouse online and can pause it indefinitely to save compute costs. The paused database incurs storage
charges at the Azure Premium Storage rate. Should you need an active copy of the data warehouse, you can resume which
should take only a few minutes.

Backup and restore costs


You will notice the Azure bill has a line item for Storage and a line item for Disaster Recovery Storage. The Storage
charge is the total cost for storing your data in the primary region along with the incremental changes captured by
snapshots. For a more detailed explanation of how snapshots are charged, refer to Understanding how Snapshots
Accrue Charges. The geo-redundant charge covers the cost for storing the geo-backups.
The total cost for your primary data warehouse and seven days of snapshot changes is rounded to the nearest TB.
For example, if your data warehouse is 1.5 TB and the snapshots captures 100 GB, you are billed for 2 TB of data
at Azure Premium Storage rates.
If you are using geo-redundant storage, you receive a separate storage charge. The geo-redundant storage is
billed at the standard Read-Access Geographically Redundant Storage (RA-GRS ) rate.
For more information about SQL Data Warehouse pricing, see SQL Data Warehouse Pricing. You are not charged
for data egress when restoring across regions.

Restoring from restore points


Each snapshot creates a restore point that represents the time the snapshot started. To restore a data warehouse,
you choose a restore point and issue a restore command.
You can either keep the restored data warehouse and the current one, or delete one of them. If you want to replace
the current data warehouse with the restored data warehouse, you can rename it using ALTER DATABASE (Azure
SQL Data Warehouse) with the MODIFY NAME option.
To restore a data warehouse, see Restore a data warehouse using the Azure portal, Restore a data warehouse
using PowerShell, or Restore a data warehouse using REST APIs.
To restore a deleted or paused data warehouse, you can create a support ticket.

Cross subscription restore


If you need to directly restore across subscription, vote for this capability here. Restore to a different logical server
and 'Move' the server across subscriptions to perform a cross subscription restore.

Geo-redundant restore
You can restore your data warehouse to any region supporting SQL Data Warehouse at your chosen performance
level.

NOTE
To perform a geo-redundant restore you must not have opted out of this feature.

Next steps
For more information about disaster planning, see Business continuity overview
SQL Data Warehouse Recommendations
3/15/2019 • 2 minutes to read • Edit Online

This article describes the recommendations served by SQL Data Warehouse through Azure Advisor.
SQL Data Warehouse provides recommendations to ensure your data warehouse is consistently optimized for
performance. Data warehouse recommendations are tightly integrated with Azure Advisor to provide you with
best practices directly within the Azure portal. SQL Data Warehouse analyzes the current state of your data
warehouse, collects telemetry, and surfaces recommendations for your active workload on a daily cadence. The
supported data warehouse recommendation scenarios are outlined below along with how to apply recommended
actions.
If you have any feedback on the SQL Data Warehouse Advisor or run into any issues, reach out to
sqldwadvisor@service.microsoft.com.
Click here to check your recommendations today! Currently this feature is applicable to Gen2 data warehouses
only.

Data Skew
Data skew can cause additional data movement or resource bottlenecks when running your workload. The
following documentation describes show to identify data skew and prevent it from happening by selecting an
optimal distribution key.
Identify and remove skew

No or Outdated Statistics
Having suboptimal statistics can severely impact query performance as it can cause the SQL Data Warehouse
query optimizer to generate suboptimal query plans. The following documentation describes the best practices
around creating and updating statistics:
Creating and updating table statistics
To see the list of impacted tables by these recommendations, run the following T-SQL script. Advisor continuously
runs the same T-SQL script to generate these recommendations.

Replicate Tables
For replicated table recommendations, Advisor detects table candidates based on the following physical
characteristics:
Replicated table size
Number of columns
Table distribution type
Number of partitions
Advisor continuously leverages workload-based heuristics such as table access frequency, rows returned on
average, and thresholds around data warehouse size and activity to ensure high-quality recommendations are
generated.
The following describes workload-based heuristics you may find in the Azure portal for each replicated table
recommendation:
Scan avg- the average percent of rows that were returned from the table for each table access over the past
seven days
Frequent read, no update - indicates that the table has not been updated in the past seven days while showing
access activity
Read/update ratio - the ratio of how frequent the table was accessed relative to when it gets updated over the
past seven days
Activity - measures the usage based on access activity. This compares the table access activity relative to the
average table access activity across the data warehouse over the past seven days.
Currently Advisor will only show at most four replicated table candidates at once with clustered columnstore
indexes prioritizing the highest activity.

IMPORTANT
The replicated table recommendation is not full proof and does not take into account data movement operations. We are
working on adding this as a heuristic but in the meantime, you should always validate your workload after applying the
recommendation. Please contact sqldwadvisor@service.microsoft.com if you discover replicated table recommendations that
causes your workload to regress. To learn more about replicated tables, visit the following documentation.
Troubleshooting Azure SQL Data Warehouse
8/18/2019 • 4 minutes to read • Edit Online

This article lists common troubleshooting question.

Connecting
ISSUE RESOLUTION

Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'. This error occurs when an AAD user tries to connect to the
(Microsoft SQL Server, Error: 18456) master database, but does not have a user in master. To
correct this issue, either specify the SQL Data Warehouse you
wish to connect to at connection time or add the user to the
master database. See Security overview article for more
details.

The server principal "MyUserName" is not able to access the This error occurs when an AAD user tries to connect to the
database "master" under the current security context. Cannot master database, but does not have a user in master. To
open user default database. Login failed. Login failed for user correct this issue, either specify the SQL Data Warehouse you
'MyUserName'. (Microsoft SQL Server, Error: 916) wish to connect to at connection time or add the user to the
master database. See Security overview article for more
details.

CTAIP error This error can occur when a login has been created on the
SQL server master database, but not in the SQL Data
Warehouse database. If you encounter this error, take a look
at the Security overview article. This article explains how to
create a login and user on master, and then how to create a
user in the SQL Data Warehouse database.

Blocked by Firewall Azure SQL databases are protected by server and database
level firewalls to ensure only known IP addresses have access
to a database. The firewalls are secure by default, which means
that you must explicitly enable and IP address or range of
addresses before you can connect. To configure your firewall
for access, follow the steps in Configure server firewall access
for your client IP in the Provisioning instructions.

Cannot connect with tool or driver SQL Data Warehouse recommends using SSMS, SSDT for
Visual Studio, or sqlcmd to query your data. For more
information on drivers and connecting to SQL Data
Warehouse, see Drivers for Azure SQL Data Warehouse and
Connect to Azure SQL Data Warehouse articles.

Tools
ISSUE RESOLUTION

Visual Studio object explorer is missing AAD users This is a known issue. As a workaround, view the users in
sys.database_principals. See Authentication to Azure SQL Data
Warehouse to learn more about using Azure Active Directory
with SQL Data Warehouse.
ISSUE RESOLUTION

Manual scripting, using the scripting wizard, or connecting via Ensure that users have been created in the master database.
SSMS is slow, not responding, or producing errors In scripting options, also make sure that the engine edition is
set as “Microsoft Azure SQL Data Warehouse Edition” and
engine type is “Microsoft Azure SQL Database”.

Generate scripts fails in SSMS Generating a script for SQL Data Warehouse fails if the option
"Generate script for dependent objects" option is set to "True."
As a workaround, users must manually go to Tools ->
Options ->SQL Server Object Explorer -> Generate script for
dependent options and set to false

Performance
ISSUE RESOLUTION

Query performance troubleshooting If you are trying to troubleshoot a particular query, start with
Learning how to monitor your queries.

Poor query performance and plans often is a result of missing The most common cause of poor performance is lack of
statistics statistics on your tables. See Maintaining table statistics for
details on how to create statistics and why they are critical to
your performance.

Low concurrency / queries queued Understanding Workload management is important in order


to understand how to balance memory allocation with
concurrency.

How to implement best practices The best place to start to learn ways to improve query
performance is SQL Data Warehouse best practices article.

How to improve performance with scaling Sometimes the solution to improving performance is to simply
add more compute power to your queries by Scaling your
SQL Data Warehouse.

Poor query performance as a result of poor index quality Some times queries can slow down because of Poor
columnstore index quality. See this article for more
information and how to Rebuild indexes to improve segment
quality.

System management
ISSUE RESOLUTION

Msg 40847: Could not perform the operation because server Either reduce the DWU of the database you are trying to
would exceed the allowed Database Transaction Unit quota of create or request a quota increase.
45000.

Investigating space utilization See Table sizes to understand the space utilization of your
system.
ISSUE RESOLUTION

Help with managing tables See the Table overview article for help with managing your
tables. This article also includes links into more detailed topics
like Table data types, Distributing a table, Indexing a table,
Partitioning a table, Maintaining table statistics and
Temporary tables.

Transparent data encryption (TDE) progress bar is not You can view the state of TDE via powershell.
updating in the Azure portal

Differences from SQL Database


ISSUE RESOLUTION

Unsupported SQL Database features See Unsupported table features.

Unsupported SQL Database data types See Unsupported data types.

DELETE and UPDATE limitations See UPDATE workarounds, DELETE workarounds and Using
CTAS to work around unsupported UPDATE and DELETE
syntax.

MERGE statement is not supported See MERGE workarounds.

Stored procedure limitations See Stored procedure limitations to understand some of the
limitations of stored procedures.

UDFs do not support SELECT statements This is a current limitation of our UDFs. See CREATE
FUNCTION for the syntax we support.

Next steps
For more help in finding solution to your issue, here are some other resources you can try.
Blogs
Feature requests
Videos
CAT team blogs
Create support ticket
MSDN forum
Stack Overflow forum
Twitter
Use maintenance schedules to manage service
updates and maintenance
10/4/2019 • 4 minutes to read • Edit Online

Maintenance schedules are now available in all Azure SQL Data Warehouse regions. The maintenance schedule
feature integrates the Service Health Planned Maintenance Notifications, Resource Health Check Monitor, and the
Azure SQL Data Warehouse maintenance scheduling service.
You use maintenance scheduling to choose a time window when it's convenient to receive new features, upgrades,
and patches. You choose a primary and a secondary maintenance window within a seven-day period. To use this
feature you will need to identify a primary and secondary window within separate day ranges.
For example, you can schedule a primary window of Saturday 22:00 to Sunday 01:00, and then schedule a
secondary window of Wednesday 19:00 to 22:00. If SQL Data Warehouse can't perform maintenance during your
primary maintenance window, it will try the maintenance again during your secondary maintenance window.
Service maintenance could occur during both the primary and secondary windows. To ensure rapid completion of
all maintenance operations, DW400(c) and lower data warehouse tiers could complete maintenance outside of a
designated maintenance window.
All newly created Azure SQL Data Warehouse instances will have a system-defined maintenance schedule applied
during deployment. The schedule can be edited as soon as deployment is complete.
Each maintenance window can be between three and eight hours. Maintenance can occur at any time within the
window. When maintenance starts, all active sessions will be canceled and Non-committed transactions will be
rolled back. You should expect multiple brief losses in connectivity as the service deploys new code to your data
warehouse. You'll be notified immediately after your data warehouse maintenance is completed.
All maintenance operations should finish within the scheduled maintenance windows. No maintenance will take
place outside the specified maintenance windows without prior notification. If your data warehouse is paused
during a scheduled maintenance, it will be updated during the resume operation.

Alerts and monitoring


Integration with Service Health notifications and the Resource Health Check Monitor allows customers to stay
informed of impending maintenance activity. The new automation takes advantage of Azure Monitor. You can
decide how you want to be notified of impending maintenance events. Also, you can choose which automated
flows will help you manage downtime and minimize operational impact.
A 24-hour advance notification precedes all maintenance events that aren't for the DWC400c and lower tiers. To
minimize instance downtime, make sure that your data warehouse has no long-running transactions before your
chosen maintenance period.

NOTE
In the event we are required to deploy a time critical update, advanced notification times may be significantly reduced.

If you received an advance notification that maintenance will take place, but SQL Data Warehouse can't perform
maintenance during that time, you'll receive a cancellation notification. Maintenance will then resume during the
next scheduled maintenance period.
All active maintenance events appear in the Service Health - Planned Maintenance section. The Service Health
history includes a full record of past events. You can monitor maintenance via the Azure Service Health check
portal dashboard during an active event.
Maintenance schedule availability
Even if maintenance scheduling isn't available in your selected region, you can view and edit your maintenance
schedule at any time. When maintenance scheduling becomes available in your region, the identified schedule will
immediately become active on your data warehouse.

View a maintenance schedule


Portal
By default, all newly created Azure SQL Data Warehouse instances have an eight-hour primary and secondary
maintenance window applied during deployment. As indicated above, you can change the windows as soon
deployment is complete. No maintenance will take place outside the specified maintenance windows without prior
notification.
To view the maintenance schedule that has been applied to your data warehouse, complete the following steps:
1. Sign in to the Azure portal.
2. Select the data warehouse that you want to view.
3. The selected data warehouse opens on the overview blade. The maintenance schedule that's applied to the data
warehouse appears below Maintenance schedule.

Change a maintenance schedule


Portal
A maintenance schedule can be updated or changed at any time. If the selected instance is going through an active
maintenance cycle, the settings will be saved. They'll become active during the next identified maintenance period.
Learn more about monitoring your data warehouse during an active maintenance event.
Identifying the primary and secondary windows
The primary and secondary windows must have separate day ranges. An example is a primary window of
Tuesday–Thursday and a secondary of window of Saturday–Sunday.
To change the maintenance schedule for your data warehouse, complete the following steps:
1. Sign in to the Azure portal.
2. Select the data warehouse that you want to update. The page opens on the overview blade.
3. Open the page for maintenance schedule settings by selecting the Maintenance Schedule (preview)
summary link on the overview blade. Or, select the Maintenance Schedule option on the left-side
resource menu.

4. Identify the preferred day range for your primary maintenance window by using the options at the top of
the page. This selection determines if your primary window will occur on a weekday or over the weekend.
Your selection will update the drop-down values. During preview, some regions might not yet support the
full set of available Day options.
5. Choose your preferred primary and secondary maintenance windows by using the drop-down list boxes:
Day: Preferred day to perform maintenance during the selected window.
Start time: Preferred start time for the maintenance window.
Time window: Preferred duration of your time window.
The Schedule summary area at the bottom of the blade is updated based on the values that you selected.
6. Select Save. A message appears, confirming that your new schedule is now active.
If you're saving a schedule in a region that doesn't support maintenance scheduling, the following message
appears. Your settings are saved and become active when the feature becomes available in your selected
region.

Next steps
Learn more about creating, viewing, and managing alerts by using Azure Monitor.
Learn more about webhook actions for log alert rules.
Learn more Creating and managing Action Groups.
Learn more about Azure Service Health.
Integrate other services with SQL Data Warehouse
5/17/2019 • 2 minutes to read • Edit Online

In addition to its core functionality, SQL Data Warehouse enables users to integrate with many of the other
services in Azure. Some of these services include:
Power BI
Azure Data Factory
Azure Machine Learning
Azure Stream Analytics
SQL Data Warehouse continues to integrate with more services across Azure, and more Integration partners.

Power BI
Power BI integration allows you to combine the compute power of SQL Data Warehouse with the dynamic
reporting and visualization of Power BI. Power BI integration currently includes:
Direct Connect: A more advanced connection with logical pushdown against SQL Data Warehouse.
Pushdown provides faster analysis on a larger scale.
Open in Power BI: The 'Open in Power BI' button passes instance information to Power BI for a simplified way
to connect.
For more information, see Integrate with Power BI, or the Power BI documentation.

Azure Data Factory


Azure Data Factory gives users a managed platform to create complex extract and load pipelines. SQL Data
Warehouse's integration with Azure Data Factory includes:
Stored Procedures: Orchestrate the execution of stored procedures on SQL Data Warehouse.
Copy: Use ADF to move data into SQL Data Warehouse. This operation can use ADF's standard data
movement mechanism or PolyBase under the covers.
For more information, see Integrate with Azure Data Factory.

Azure Machine Learning


Azure Machine Learning is a fully managed analytics service, which allows you to create intricate models using a
large set of predictive tools. SQL Data Warehouse is supported as both a source and destination for these models
with the following functionality:
Read Data: Drive models at scale using T-SQL against SQL Data Warehouse.
Write Data: Commit changes from any model back to SQL Data Warehouse.
For more information, see Integrate with Azure Machine Learning.

Azure Stream Analytics


Azure Stream Analytics is a complex, fully managed infrastructure for processing and consuming event data
generated from Azure Event Hub. Integration with SQL Data Warehouse allows for streaming data to be
effectively processed and stored alongside relational data enabling deeper, more advanced analysis.
Job Output: Send output from Stream Analytics jobs directly to SQL Data Warehouse.
For more information, see Integrate with Azure Stream Analytics.
Configure and manage Azure Active Directory
authentication with SQL
10/17/2019 • 23 minutes to read • Edit Online

This article shows you how to create and populate Azure AD, and then use Azure AD with Azure SQL Database,
managed instance, and SQL Data Warehouse. For an overview, see Azure Active Directory Authentication.

NOTE
This article applies to Azure SQL server, and to both SQL Database and SQL Data Warehouse databases that are created on
the Azure SQL server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.

IMPORTANT
Connecting to SQL Server running on an Azure VM is not supported using an Azure Active Directory account. Use a
domain Active Directory account instead.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

IMPORTANT
The PowerShell Azure Resource Manager module is still supported by Azure SQL Database, but all future development is for
the Az.Sql module. For these cmdlets, see AzureRM.Sql. The arguments for the commands in the Az module and in the
AzureRm modules are substantially identical.

Create and populate an Azure AD


Create an Azure AD and populate it with users and groups. Azure AD can be the initial Azure AD managed
domain. Azure AD can also be an on-premises Active Directory Domain Services that is federated with the Azure
AD.
For more information, see Integrating your on-premises identities with Azure Active Directory, Add your own
domain name to Azure AD, Microsoft Azure now supports federation with Windows Server Active Directory,
Administering your Azure AD directory, Manage Azure AD using Windows PowerShell, and Hybrid Identity
Required Ports and Protocols.

Associate or add an Azure subscription to Azure Active Directory


1. Associate your Azure subscription to Azure Active Directory by making the directory a trusted directory
for the Azure subscription hosting the database. For details, see How Azure subscriptions are associated
with Azure AD.
2. Use the directory switcher in the Azure portal to switch to the subscription associated with domain.
Additional information: Every Azure subscription has a trust relationship with an Azure AD instance.
This means that it trusts that directory to authenticate users, services, and devices. Multiple subscriptions
can trust the same directory, but a subscription trusts only one directory. This trust relationship that a
subscription has with a directory is unlike the relationship that a subscription has with all other resources
in Azure (websites, databases, and so on), which are more like child resources of a subscription. If a
subscription expires, then access to those other resources associated with the subscription also stops. But
the directory remains in Azure, and you can associate another subscription with that directory and
continue to manage the directory users. For more information about resources, see Understanding
resource access in Azure. To learn more about this trusted relationship see How to associate or add an
Azure subscription to Azure Active Directory.

Create an Azure AD administrator for Azure SQL server


Each Azure SQL server (which hosts a SQL Database or SQL Data Warehouse) starts with a single server
administrator account that is the administrator of the entire Azure SQL server. A second SQL Server
administrator must be created, that is an Azure AD account. This principal is created as a contained database user
in the master database. As administrators, the server administrator accounts are members of the db_owner role
in every user database, and enter each user database as the dbo user. For more information about the server
administrator accounts, see Managing Databases and Logins in Azure SQL Database.
When using Azure Active Directory with geo-replication, the Azure Active Directory administrator must be
configured for both the primary and the secondary servers. If a server does not have an Azure Active Directory
administrator, then Azure Active Directory logins and users receive a "Cannot connect" to server error.

NOTE
Users that are not based on an Azure AD account (including the Azure SQL server administrator account), cannot create
Azure AD-based users, because they do not have permission to validate proposed database users with the Azure AD.

Provision an Azure Active Directory administrator for your managed


instance
IMPORTANT
Only follow these steps if you are provisioning a managed instance. This operation can only be executed by
Global/Company administrator or a Privileged Role Administrator in Azure AD. Following steps describe the process of
granting permissions for users with different privileges in directory.

Your managed instance needs permissions to read Azure AD to successfully accomplish tasks such as
authentication of users through security group membership or creation of new users. For this to work, you need
to grant permissions to managed instance to read Azure AD. There are two ways to do it: from Portal and
PowerShell. The following steps both methods.
1. In the Azure portal, in the upper-right corner, select your connection to drop down a list of possible Active
Directories.
2. Choose the correct Active Directory as the default Azure AD.
This step links the subscription associated with Active Directory with managed instance making sure that
the same subscription is used for both Azure AD and the managed instance.
3. Navigate to managed instance and select one that you want to use for Azure AD integration.
4. Select the banner on top of the Active Directory admin page and grant permission to the current user. If
you are logged in as Global/Company administrator in Azure AD, you can do it from the Azure portal or
using PowerShell with the script below.
# Gives Azure Active Directory read permission to a Service Principal representing the managed
instance.
# Can be executed only by a "Company Administrator", "Global Administrator", or "Privileged Role
Administrator" type of user.

$aadTenant = "<YourTenantId>" # Enter your tenant ID


$managedInstanceName = "MyManagedInstance"

# Get Azure AD role "Directory Users" and create if it doesn't exist


$roleName = "Directory Readers"
$role = Get-AzureADDirectoryRole | Where-Object {$_.displayName -eq $roleName}
if ($role -eq $null) {
# Instantiate an instance of the role template
$roleTemplate = Get-AzureADDirectoryRoleTemplate | Where-Object {$_.displayName -eq $roleName}
Enable-AzureADDirectoryRole -RoleTemplateId $roleTemplate.ObjectId
$role = Get-AzureADDirectoryRole | Where-Object {$_.displayName -eq $roleName}
}

# Get service principal for managed instance


$roleMember = Get-AzureADServicePrincipal -SearchString $managedInstanceName
$roleMember.Count
if ($roleMember -eq $null)
{
Write-Output "Error: No Service Principals with name '$ ($managedInstanceName)', make sure that
managedInstanceName parameter was entered correctly."
exit
}
if (-not ($roleMember.Count -eq 1))
{
Write-Output "Error: More than one service principal with name pattern '$
($managedInstanceName)'"
Write-Output "Dumping selected service principals...."
$roleMember
exit
}

# Check if service principal is already member of readers role


$allDirReaders = Get-AzureADDirectoryRoleMember -ObjectId $role.ObjectId
$selDirReader = $allDirReaders | where{$_.ObjectId -match $roleMember.ObjectId}

if ($selDirReader -eq $null)


{
# Add principal to readers role
Write-Output "Adding service principal '$($managedInstanceName)' to 'Directory Readers'
role'..."
Add-AzureADDirectoryRoleMember -ObjectId $role.ObjectId -RefObjectId $roleMember.ObjectId
Write-Output "'$($managedInstanceName)' service principal added to 'Directory Readers'
role'..."

#Write-Output "Dumping service principal '$($managedInstanceName)':"


#$allDirReaders = Get-AzureADDirectoryRoleMember -ObjectId $role.ObjectId
#$allDirReaders | where{$_.ObjectId -match $roleMember.ObjectId}
}
else
{
Write-Output "Service principal '$($managedInstanceName)' is already member of 'Directory
Readers' role'."
}

5. After the operation is successfully completed, the following notification will show up in the top-right
corner:
6. Now you can choose your Azure AD admin for your managed instance. For that, on the Active Directory
admin page, select Set admin command.

7. In the AAD admin page, search for a user, select the user or group to be an administrator, and then select
Select.
The Active Directory admin page shows all members and groups of your Active Directory. Users or groups
that are grayed out can't be selected because they aren't supported as Azure AD administrators. See the
list of supported admins in Azure AD Features and Limitations. Role-based access control (RBAC ) applies
only to the Azure portal and isn't propagated to SQL Server.
8. At the top of the Active Directory admin page, select Save.

The process of changing the administrator may take several minutes. Then the new administrator appears
in the Active Directory admin box.
After provisioning an Azure AD admin for your managed instance, you can begin to create Azure AD server
principals (logins) (public preview) with the CREATE LOGIN syntax. For more information, see managed
instance Overview.
TIP
To later remove an Admin, at the top of the Active Directory admin page, select Remove admin, and then select Save.

PowerShell for SQL managed instance


To run PowerShell cmdlets, you need to have Azure PowerShell installed and running. For detailed information,
see How to install and configure Azure PowerShell. To provision an Azure AD admin, execute the following Azure
PowerShell commands:
Connect-AzAccount
Select-AzSubscription
Cmdlets used to provision and manage Azure AD admin for SQL managed instance:

CMDLET NAME DESCRIPTION

Set-AzSqlInstanceActiveDirectoryAdministrator Provisions an Azure AD administrator for SQL managed


instance in the current subscription. (Must be from the
current subscription)

Remove-AzSqlInstanceActiveDirectoryAdministrator Removes an Azure AD administrator for SQL managed


instance in the current subscription.

Get-AzSqlInstanceActiveDirectoryAdministrator Returns information about an Azure AD administrator for


SQL managed instance in the current subscription.

PowerShell examples for managed instance


The following command gets information about an Azure AD administrator for a managed instance named
ManagedInstance01 that is associated with a resource group named ResourceGroup01.

Get-AzSqlInstanceActiveDirectoryAdministrator -ResourceGroupName "ResourceGroup01" -InstanceName


"ManagedInstance01"

The following command provisions an Azure AD administrator group named DBAs for the managed instance
named ManagedInstance01. This server is associated with resource group ResourceGroup01.

Set-AzSqlInstanceActiveDirectoryAdministrator -ResourceGroupName "ResourceGroup01" -InstanceName


"ManagedInstance01" -DisplayName "DBAs" -ObjectId "40b79501-b343-44ed-9ce7-da4c8cc7353b"

The following command removes the Azure AD administrator for the managed instance named
ManagedInstanceName01 associated with the resource group ResourceGroup01.

Remove-AzSqlInstanceActiveDirectoryAdministrator -ResourceGroupName "ResourceGroup01" -InstanceName


"ManagedInstanceName01" -Confirm -PassThru

CLI for SQL managed instance


You can also provision an Azure AD admin for SQL managed instance by calling the following CLI commands:

COMMAND DESCRIPTION

az sql mi ad-admin create Provisions an Azure Active Directory administrator for SQL
managed instance. (Must be from the current subscription)
COMMAND DESCRIPTION

az sql mi ad-admin delete Removes an Azure Active Directory administrator for SQL
managed instance.

az sql mi ad-admin list Returns information about an Azure Active Directory


administrator currently configured for SQL managed instance.

az sql mi ad-admin update Updates the Active Directory administrator for a SQL
managed instance.

For more information about CLI commands, see az sql mi.

Provision an Azure Active Directory administrator for your Azure SQL


Database server
IMPORTANT
Only follow these steps if you are provisioning an Azure SQL Database server or Data Warehouse.

The following two procedures show you how to provision an Azure Active Directory administrator for your Azure
SQL server in the Azure portal and by using PowerShell.
Azure portal
1. In the Azure portal, in the upper-right corner, select your connection to drop down a list of possible Active
Directories. Choose the correct Active Directory as the default Azure AD. This step links the subscription-
associated Active Directory with Azure SQL server making sure that the same subscription is used for both
Azure AD and SQL Server. (The Azure SQL server can be hosting either Azure SQL Database or Azure

SQL Data Warehouse.)


2. In the left banner select All services, and in the filter type in SQL server. Select Sql Servers.
NOTE
On this page, before you select SQL servers, you can select the star next to the name to favorite the category and
add SQL servers to the left navigation bar.

3. On SQL Server page, select Active Directory admin.


4. In the Active Directory admin page, select Set admin.

5. In the Add admin page, search for a user, select the user or group to be an administrator, and then select
Select. (The Active Directory admin page shows all members and groups of your Active Directory. Users
or groups that are grayed out cannot be selected because they are not supported as Azure AD
administrators. (See the list of supported admins in the Azure AD Features and Limitations section of
Use Azure Active Directory Authentication for authentication with SQL Database or SQL Data
Warehouse.) Role-based access control (RBAC ) applies only to the portal and is not propagated to SQL
Server.
6. At the top of the Active Directory admin page, select SAVE.

The process of changing the administrator may take several minutes. Then the new administrator appears in the
Active Directory admin box.
NOTE
When setting up the Azure AD admin, the new admin name (user or group) cannot already be present in the virtual master
database as a SQL Server authentication user. If present, the Azure AD admin setup will fail; rolling back its creation and
indicating that such an admin (name) already exists. Since such a SQL Server authentication user is not part of the Azure
AD, any effort to connect to the server using Azure AD authentication fails.

To later remove an Admin, at the top of the Active Directory admin page, select Remove admin, and then
select Save.
PowerShell for Azure SQL Database and Azure SQL Data Warehouse
To run PowerShell cmdlets, you need to have Azure PowerShell installed and running. For detailed information,
see How to install and configure Azure PowerShell. To provision an Azure AD admin, execute the following Azure
PowerShell commands:
Connect-AzAccount
Select-AzSubscription
Cmdlets used to provision and manage Azure AD admin for Azure SQL Database and Azure SQL Data
Warehouse:

CMDLET NAME DESCRIPTION

Set-AzSqlServerActiveDirectoryAdministrator Provisions an Azure Active Directory administrator for Azure


SQL server or Azure SQL Data Warehouse. (Must be from the
current subscription)

Remove-AzSqlServerActiveDirectoryAdministrator Removes an Azure Active Directory administrator for Azure


SQL server or Azure SQL Data Warehouse.

Get-AzSqlServerActiveDirectoryAdministrator Returns information about an Azure Active Directory


administrator currently configured for the Azure SQL server
or Azure SQL Data Warehouse.

Use PowerShell command get-help to see more information for each of these commands. For example,
get-help Set-AzSqlServerActiveDirectoryAdministrator .

PowerShell examples for Azure SQL Database and Azure SQL Data Warehouse
The following script provisions an Azure AD administrator group named DBA_Group (object ID
40b79501-b343-44ed-9ce7-da4c8cc7353f ) for the demo_server server in a resource group named Group-23:

Set-AzSqlServerActiveDirectoryAdministrator -ResourceGroupName "Group-23"


-ServerName "demo_server" -DisplayName "DBA_Group"

The DisplayName input parameter accepts either the Azure AD display name or the User Principal Name. For
example, DisplayName="John Smith" and DisplayName="johns@contoso.com" . For Azure AD groups only the Azure
AD display name is supported.

NOTE
The Azure PowerShell command Set-AzSqlServerActiveDirectoryAdministrator does not prevent you from
provisioning Azure AD admins for unsupported users. An unsupported user can be provisioned, but can not connect to a
database.
The following example uses the optional ObjectID:

Set-AzSqlServerActiveDirectoryAdministrator -ResourceGroupName "Group-23"


-ServerName "demo_server" -DisplayName "DBA_Group" -ObjectId "40b79501-b343-44ed-9ce7-da4c8cc7353f"

NOTE
The Azure AD ObjectID is required when the DisplayName is not unique. To retrieve the ObjectID and DisplayName
values, use the Active Directory section of Azure Classic Portal, and view the properties of a user or group.

The following example returns information about the current Azure AD admin for Azure SQL server:

Get-AzSqlServerActiveDirectoryAdministrator -ResourceGroupName "Group-23" -ServerName "demo_server" | Format-


List

The following example removes an Azure AD administrator:

Remove-AzSqlServerActiveDirectoryAdministrator -ResourceGroupName "Group-23" -ServerName "demo_server"

NOTE
You can also provision an Azure Active Directory Administrator by using the REST APIs. For more information, see Service
Management REST API Reference and Operations for Azure SQL Database Operations for Azure SQL Database

CLI for Azure SQL Database and Azure SQL Data Warehouse
You can also provision an Azure AD admin by calling the following CLI commands:

COMMAND DESCRIPTION

az sql server ad-admin create Provisions an Azure Active Directory administrator for Azure
SQL server or Azure SQL Data Warehouse. (Must be from the
current subscription)

az sql server ad-admin delete Removes an Azure Active Directory administrator for Azure
SQL server or Azure SQL Data Warehouse.

az sql server ad-admin list Returns information about an Azure Active Directory
administrator currently configured for the Azure SQL server
or Azure SQL Data Warehouse.

az sql server ad-admin update Updates the Active Directory administrator for an Azure SQL
server or Azure SQL Data Warehouse.

For more information about CLI commands, see az sql server.

Configure your client computers


On all client machines, from which your applications or users connect to Azure SQL Database or Azure SQL Data
Warehouse using Azure AD identities, you must install the following software:
.NET Framework 4.6 or later from https://msdn.microsoft.com/library/5a4x27ek.aspx.
Azure Active Directory Authentication Library for SQL Server (ADALSQL.DLL ) is available in multiple
languages (both x86 and amd64) from the download center at Microsoft Active Directory Authentication
Library for Microsoft SQL Server.
You can meet these requirements by:
Installing either SQL Server 2016 Management Studio or SQL Server Data Tools for Visual Studio 2015
meets the .NET Framework 4.6 requirement.
SSMS installs the x86 version of ADALSQL.DLL.
SSDT installs the amd64 version of ADALSQL.DLL.
The latest Visual Studio from Visual Studio Downloads meets the .NET Framework 4.6 requirement, but does
not install the required amd64 version of ADALSQL.DLL.

Create contained database users in your database mapped to Azure


AD identities
IMPORTANT
Managed instance now supports Azure AD server principals (logins) ( public preview), which enables you to create logins
from Azure AD users, groups, or applications. Azure AD server principals (logins) provides the ability to authenticate to your
managed instance without requiring database users to be created as a contained database user. For more information, see
managed instance Overview. For syntax on creating Azure AD server principals (logins), see CREATE LOGIN.

Azure Active Directory authentication requires database users to be created as contained database users. A
contained database user based on an Azure AD identity, is a database user that does not have a login in the
master database, and which maps to an identity in the Azure AD directory that is associated with the database.
The Azure AD identity can be either an individual user account or a group. For more information about contained
database users, see Contained Database Users- Making Your Database Portable.

NOTE
Database users (with the exception of administrators) cannot be created using the Azure portal. RBAC roles are not
propagated to SQL Server, SQL Database, or SQL Data Warehouse. Azure RBAC roles are used for managing Azure
Resources, and do not apply to database permissions. For example, the SQL Server Contributor role does not grant
access to connect to the SQL Database or SQL Data Warehouse. The access permission must be granted directly in the
database using Transact-SQL statements.

WARNING
Special characters like colon : or ampersand & when included as user names in the T-SQL CREATE LOGIN and CREATE
USER statements are not supported.

To create an Azure AD -based contained database user (other than the server administrator that owns the
database), connect to the database with an Azure AD identity, as a user with at least the ALTER ANY USER
permission. Then use the following Transact-SQL syntax:

CREATE USER <Azure_AD_principal_name> FROM EXTERNAL PROVIDER;

Azure_AD_principal_name can be the user principal name of an Azure AD user or the display name for an Azure
AD group.
Examples: To create a contained database user representing an Azure AD federated or managed domain user:
CREATE USER [bob@contoso.com] FROM EXTERNAL PROVIDER;
CREATE USER [alice@fabrikam.onmicrosoft.com] FROM EXTERNAL PROVIDER;

To create a contained database user representing an Azure AD or federated domain group, provide the display
name of a security group:

CREATE USER [ICU Nurses] FROM EXTERNAL PROVIDER;

To create a contained database user representing an application that connects using an Azure AD token:

CREATE USER [appName] FROM EXTERNAL PROVIDER;

NOTE
This command requires that SQL access Azure AD (the "external provider") on behalf of the logged-in user. Sometimes,
circumstances will arise that cause Azure AD to return an exception back to SQL. In these cases, the user will see SQL error
33134, which should contain the AAD-specific error message. Most of the time, the error will say that access is denied, or
that the user must enroll in MFA to access the resource, or that access between first-party applications must be handled via
preauthorization. In the first two cases, the issue is usually caused by Conditional Access policies that are set in the user's
AAD tenant: they prevent the user from accessing the external provider. Updating the CA policies to allow access to the
application '00000002-0000-0000-c000-000000000000' (the application ID of the AAD Graph API) should resolve the
issue. In the case that the error says access between first-party applications must be handled via preauthorization, the issue
is because the user is signed in as a service principal. The command should succeed if it is executed by a user instead.

TIP
You cannot directly create a user from an Azure Active Directory other than the Azure Active Directory that is associated
with your Azure subscription. However, members of other Active Directories that are imported users in the associated
Active Directory (known as external users) can be added to an Active Directory group in the tenant Active Directory. By
creating a contained database user for that AD group, the users from the external Active Directory can gain access to SQL
Database.

For more information about creating contained database users based on Azure Active Directory identities, see
CREATE USER (Transact-SQL ).

NOTE
Removing the Azure Active Directory administrator for Azure SQL server prevents any Azure AD authentication user from
connecting to the server. If necessary, unusable Azure AD users can be dropped manually by a SQL Database administrator.

NOTE
If you receive a Connection Timeout Expired, you may need to set the TransparentNetworkIPResolution parameter of
the connection string to false. For more information, see Connection timeout issue with .NET Framework 4.6.1 -
TransparentNetworkIPResolution.

When you create a database user, that user receives the CONNECT permission and can connect to that database
as a member of the PUBLIC role. Initially the only permissions available to the user are any permissions granted
to the PUBLIC role, or any permissions granted to any Azure AD groups that they are a member of. Once you
provision an Azure AD -based contained database user, you can grant the user additional permissions, the same
way as you grant permission to any other type of user. Typically grant permissions to database roles, and add
users to roles. For more information, see Database Engine Permission Basics. For more information about special
SQL Database roles, see Managing Databases and Logins in Azure SQL Database. A federated domain user
account that is imported into a managed domain as an external user, must use the managed domain identity.

NOTE
Azure AD users are marked in the database metadata with type E (EXTERNAL_USER) and for groups with type X
(EXTERNAL_GROUPS). For more information, see sys.database_principals.

Connect to the user database or data warehouse by using SSMS or


SSDT
To confirm the Azure AD administrator is properly set up, connect to the master database using the Azure AD
administrator account. To provision an Azure AD -based contained database user (other than the server
administrator that owns the database), connect to the database with an Azure AD identity that has access to the
database.

IMPORTANT
Support for Azure Active Directory authentication is available with SQL Server 2016 Management Studio and SQL Server
Data Tools in Visual Studio 2015. The August 2016 release of SSMS also includes support for Active Directory Universal
Authentication, which allows administrators to require Multi-Factor Authentication using a phone call, text message, smart
cards with pin, or mobile app notification.

Using an Azure AD identity to connect using SSMS or SSDT


The following procedures show you how to connect to a SQL database with an Azure AD identity using SQL
Server Management Studio or SQL Server Database Tools.
Active Directory integrated authentication
Use this method if you are logged in to Windows using your Azure Active Directory credentials from a federated
domain.
1. Start Management Studio or Data Tools and in the Connect to Server (or Connect to Database
Engine) dialog box, in the Authentication box, select Active Directory - Integrated. No password is
needed or can be entered because your existing credentials will be presented for the connection.
2. Select the Options button, and on the Connection Properties page, in the Connect to database box,
type the name of the user database you want to connect to. (The AD domain name or tenant ID” option
is only supported for Universal with MFA connection options, otherwise it is greyed out.)
Active Directory password authentication
Use this method when connecting with an Azure AD principal name using the Azure AD managed domain. You
can also use it for federated accounts without access to the domain, for example when working remotely.
Use this method to authenticate to SQL DB/DW with Azure AD for native or federated Azure AD users. A native
user is one explicitly created in Azure AD and being authenticated using user name and password, while a
federated user is a Windows user whose domain is federated with Azure AD. The latter method (using user &
password) can be used when a user wants to use their windows credential, but their local machine is not joined
with the domain (for example, using a remote access). In this case, a Windows user can indicate their domain
account and password and can authenticate to SQL DB/DW using federated credentials.
1. Start Management Studio or Data Tools and in the Connect to Server (or Connect to Database
Engine) dialog box, in the Authentication box, select Active Directory - Password.
2. In the User name box, type your Azure Active Directory user name in the format
username@domain.com. User names must be an account from the Azure Active Directory or an account
from a domain federate with the Azure Active Directory.
3. In the Password box, type your user password for the Azure Active Directory account or federated domain
account.

4. Select the Options button, and on the Connection Properties page, in the Connect to database box,
type the name of the user database you want to connect to. (See the graphic in the previous option.)

Using an Azure AD identity to connect from a client application


The following procedures show you how to connect to a SQL database with an Azure AD identity from a client
application.
Active Directory integrated authentication
To use integrated Windows authentication, your domain’s Active Directory must be federated with Azure Active
Directory. Your client application (or a service) connecting to the database must be running on a domain-joined
machine under a user’s domain credentials.
To connect to a database using integrated authentication and an Azure AD identity, the Authentication keyword in
the database connection string must be set to Active Directory Integrated. The following C# code sample uses
ADO .NET.

string ConnectionString =
@"Data Source=n9lxnyuzhv.database.windows.net; Authentication=Active Directory Integrated; Initial
Catalog=testdb;";
SqlConnection conn = new SqlConnection(ConnectionString);
conn.Open();
The connection string keyword Integrated Security=True is not supported for connecting to Azure SQL
Database. When making an ODBC connection, you will need to remove spaces and set Authentication to
'ActiveDirectoryIntegrated'.
Active Directory password authentication
To connect to a database using integrated authentication and an Azure AD identity, the Authentication keyword
must be set to Active Directory Password. The connection string must contain User ID/UID and Password/PWD
keywords and values. The following C# code sample uses ADO .NET.

string ConnectionString =
@"Data Source=n9lxnyuzhv.database.windows.net; Authentication=Active Directory Password; Initial
Catalog=testdb; UID=bob@contoso.onmicrosoft.com; PWD=MyPassWord!";
SqlConnection conn = new SqlConnection(ConnectionString);
conn.Open();

Learn more about Azure AD authentication methods using the demo code samples available at Azure AD
Authentication GitHub Demo.

Azure AD token
This authentication method allows middle-tier services to connect to Azure SQL Database or Azure SQL Data
Warehouse by obtaining a token from Azure Active Directory (AAD ). It enables sophisticated scenarios including
certificate-based authentication. You must complete four basic steps to use Azure AD token authentication:
1. Register your application with Azure Active Directory and get the client ID for your code.
2. Create a database user representing the application. (Completed earlier in step 6.)
3. Create a certificate on the client computer runs the application.
4. Add the certificate as a key for your application.
Sample connection string:

string ConnectionString =@"Data Source=n9lxnyuzhv.database.windows.net; Initial Catalog=testdb;"


SqlConnection conn = new SqlConnection(ConnectionString);
conn.AccessToken = "Your JWT token"
conn.Open();

For more information, see SQL Server Security Blog. For information about adding a certificate, see Get started
with certificate-based authentication in Azure Active Directory.
sqlcmd
The following statements, connect using version 13.1 of sqlcmd, which is available from the Download Center.

NOTE
sqlcmd with the -G command does not work with system identities, and requires a user principal login.

sqlcmd -S Target_DB_or_DW.testsrv.database.windows.net -G
sqlcmd -S Target_DB_or_DW.testsrv.database.windows.net -U bob@contoso.com -P MyAADPassword -G -l 30

Next steps
For an overview of access and control in SQL Database, see SQL Database access and control.
For an overview of logins, users, and database roles in SQL Database, see Logins, users, and database roles.
For more information about database principals, see Principals.
For more information about database roles, see Database roles.
For more information about firewall rules in SQL Database, see SQL Database firewall rules.
Conditional Access (MFA) with Azure SQL Database
and Data Warehouse
7/26/2019 • 2 minutes to read • Edit Online

Azure SQL Database, Managed Instance, and SQL Data Warehouse support Microsoft Conditional Access.

NOTE
This topic applies to Azure SQL server, and to both SQL Database and SQL Data Warehouse databases that are created on
the Azure SQL server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.

The following steps show how to configure SQL Database to enforce a Conditional Access policy.

Prerequisites
You must configure your SQL Database or SQL Data Warehouse to support Azure Active Directory
authentication. For specific steps, see Configure and manage Azure Active Directory authentication with SQL
Database or SQL Data Warehouse.
When multi-factor authentication is enabled, you must connect with at supported tool, such as the latest SSMS.
For more information, see Configure Azure SQL Database multi-factor authentication for SQL Server
Management Studio.

Configure CA for Azure SQL DB/DW


1. Sign in to the Portal, select Azure Active Directory, and then select Conditional Access. For more
information, see Azure Active Directory Conditional Access technical reference.
2. In the Conditional Access-Policies blade, click New policy, provide a name, and then click Configure
rules.
3. Under Assignments, select Users and groups, check Select users and groups, and then select the user or
group for Conditional Access. Click Select, and then click Done to accept your selection.
4. Select Cloud apps, click Select apps. You see all apps available for Conditional Access. Select Azure SQL
Database, at the bottom click Select, and then click Done.
If you can’t find Azure SQL Database listed in the following third screenshot, complete the following steps:
Sign in to your Azure SQL DB/DW instance using SSMS with an AAD admin account.
Execute CREATE USER [user@yourtenant.com] FROM EXTERNAL PROVIDER .
Sign in to AAD and verify that Azure SQL Database and Data Warehouse are listed in the applications in
your AAD.
5. Select Access controls, select Grant, and then check the policy you want to apply. For this example, we
select Require multi-factor authentication.
Summary
The selected application (Azure SQL Database) allowing to connect to Azure SQL DB/DW using Azure AD
Premium, now enforces the selected Conditional Access policy, Required multi-factor authentication.
For questions about Azure SQL Database and Data Warehouse regarding multi-factor authentication, contact
MFAforSQLDB@microsoft.com.

Next steps
For a tutorial, see Secure your Azure SQL Database.
PowerShell: Create a Virtual Service endpoint and
VNet rule for SQL
8/6/2019 • 12 minutes to read • Edit Online

Virtual network rules are one firewall security feature that controls whether the database server for your single
databases and elastic pool in Azure SQL Database or for your databases in SQL Data Warehouse accepts
communications that are sent from particular subnets in virtual networks.

IMPORTANT
This article applies to Azure SQL server, and to both SQL Database and SQL Data Warehouse databases that are created on
the Azure SQL server. For simplicity, SQL Database is used when referring to both SQL Database and SQL Data Warehouse.
This article does not apply to a managed instance deployment in Azure SQL Database because it does not have a service
endpoint associated with it.

This article provides and explains a PowerShell script that takes the following actions:
1. Creates a Microsoft Azure Virtual Service endpoint on your subnet.
2. Adds the endpoint to the firewall of your Azure SQL Database server, to create a virtual network rule.
Your motivations for creating a rule are explained in: Virtual Service endpoints for Azure SQL Database.

TIP
If all you need is to assess or add the Virtual Service endpoint type name for SQL Database to your subnet, you can skip
ahead to our more direct PowerShell script.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

IMPORTANT
The PowerShell Azure Resource Manager module is still supported by Azure SQL Database, but all future development is for
the Az.Sql module. For these cmdlets, see AzureRM.Sql. The arguments for the commands in the Az module and in the
AzureRm modules are substantially identical.

Major cmdlets
This article emphasizes the New-AzSqlServerVirtualNetworkRule cmdlet that adds the subnet endpoint to the
access control list (ACL ) of your Azure SQL Database server, thereby creating a rule.
The following list shows the sequence of other major cmdlets that you must run to prepare for your call to New-
AzSqlServerVirtualNetworkRule. In this article, these calls occur in script 3 "Virtual network rule":
1. New -AzVirtualNetworkSubnetConfig: Creates a subnet object.
2. New -AzVirtualNetwork: Creates your virtual network, giving it the subnet.
3. Set-AzVirtualNetworkSubnetConfig: Assigns a Virtual Service endpoint to your subnet.
4. Set-AzVirtualNetwork: Persists updates made to your virtual network.
5. New -AzSqlServerVirtualNetworkRule: After your subnet is an endpoint, adds your subnet as a virtual network
rule, into the ACL of your Azure SQL Database server.
This cmdlet Offers the parameter -IgnoreMissingVNetServiceEndpoint, starting in Azure RM
PowerShell Module version 5.1.1.

Prerequisites for running PowerShell


You can already log in to Azure, such as through the Azure portal.
You can already run PowerShell scripts.

NOTE
Please ensure that service endpoints are turned on for the VNet/Subnet that you want to add to your Server otherwise
creation of the VNet Firewall Rule will fail.

One script divided into four chunks


Our demonstration PowerShell script is divided into a sequence of smaller scripts. The division eases learning and
provides flexibility. The scripts must be run in their indicated sequence. If you do not have time now to run the
scripts, our actual test output is displayed after script 4.
Script 1: Variables
This first PowerShell script assigns values to variables. The subsequent scripts depend on these variables.

IMPORTANT
Before you run this script, you can edit the values, if you like. For example, if you already have a resource group, you might
want to edit your resource group name as the assigned value.
Your subscription name should be edited into the script.

PowerShell script 1 source code


######### Script 1 ########################################
## LOG into to your Azure account. ##
## (Needed only one time per powershell.exe session.) ##
###########################################################

$yesno = Read-Host 'Do you need to log into Azure (only one time per powershell.exe session)? [yes/no]';
if ('yes' -eq $yesno) { Connect-AzAccount; }

###########################################################
## Assignments to variables used by the later scripts. ##
###########################################################

# You can edit these values, if necessary.

$SubscriptionName = 'yourSubscriptionName';
Select-AzSubscription -SubscriptionName $SubscriptionName;

$ResourceGroupName = 'RG-YourNameHere';
$Region = 'westcentralus';

$VNetName = 'myVNet';
$SubnetName = 'mySubnet';
$VNetAddressPrefix = '10.1.0.0/16';
$SubnetAddressPrefix = '10.1.1.0/24';
$VNetRuleName = 'myFirstVNetRule-ForAcl';

$SqlDbServerName = 'mysqldbserver-forvnet';
$SqlDbAdminLoginName = 'ServerAdmin';
$SqlDbAdminLoginPassword = 'ChangeYourAdminPassword1';

$ServiceEndpointTypeName_SqlDb = 'Microsoft.Sql'; # Official type name.

Write-Host 'Completed script 1, the "Variables".';

Script 2: Prerequisites
This script prepares for the next script, where the endpoint action is. This script creates for you the following listed
items, but only if they do not already exist. You can skip script 2 if you are sure these items already exist:
Azure resource group
Azure SQL Database server
PowerShell script 2 source code

######### Script 2 ########################################


## Ensure your Resource Group already exists. ##
###########################################################

Write-Host "Check whether your Resource Group already exists.";

$gottenResourceGroup = $null;

$gottenResourceGroup = Get-AzResourceGroup `
-Name $ResourceGroupName `
-ErrorAction SilentlyContinue;

if ($null -eq $gottenResourceGroup)


{
Write-Host "Creating your missing Resource Group - $ResourceGroupName.";

$gottenResourceGroup = New-AzResourceGroup `
-Name $ResourceGroupName `
-Location $Region;

$gottenResourceGroup;
}
}
else { Write-Host "Good, your Resource Group already exists - $ResourceGroupName."; }

$gottenResourceGroup = $null;

###########################################################
## Ensure your Azure SQL Database server already exists. ##
###########################################################

Write-Host "Check whether your Azure SQL Database server already exists.";

$sqlDbServer = $null;

$sqlDbServer = Get-AzSqlServer `
-ResourceGroupName $ResourceGroupName `
-ServerName $SqlDbServerName `
-ErrorAction SilentlyContinue;

if ($null -eq $sqlDbServer)


{
Write-Host "Creating the missing Azure SQL Database server - $SqlDbServerName.";

Write-Host "Gather the credentials necessary to next create an Azure SQL Database server.";

$sqlAdministratorCredentials = New-Object `
-TypeName System.Management.Automation.PSCredential `
-ArgumentList `
$SqlDbAdminLoginName, `
$(ConvertTo-SecureString `
-String $SqlDbAdminLoginPassword `
-AsPlainText `
-Force `
);

if ($null -eq $sqlAdministratorCredentials)


{
Write-Host "ERROR, unable to create SQL administrator credentials. Now ending.";
return;
}

Write-Host "Create your Azure SQL Database server.";

$sqlDbServer = New-AzSqlServer `
-ResourceGroupName $ResourceGroupName `
-ServerName $SqlDbServerName `
-Location $Region `
-SqlAdministratorCredentials $sqlAdministratorCredentials;

$sqlDbServer;
}
else { Write-Host "Good, your Azure SQL Database server already exists - $SqlDbServerName."; }

$sqlAdministratorCredentials = $null;
$sqlDbServer = $null;

Write-Host 'Completed script 2, the "Prerequisites".';

Script 3: Create an endpoint and a rule


This script creates a virtual network with a subnet. Then the script assigns the Microsoft.Sql endpoint type to
your subnet. Finally the script adds your subnet to the access control list (ACL ) of your SQL Database server,
thereby creating a rule.
PowerShell script 3 source code

######### Script 3 ########################################


## Create your virtual network, and give it a subnet. ##
## Create your virtual network, and give it a subnet. ##
###########################################################

Write-Host "Define a subnet '$SubnetName', to be given soon to a virtual network.";

$subnet = New-AzVirtualNetworkSubnetConfig `
-Name $SubnetName `
-AddressPrefix $SubnetAddressPrefix `
-ServiceEndpoint $ServiceEndpointTypeName_SqlDb;

Write-Host "Create a virtual network '$VNetName'." `


" Give the subnet to the virtual network that we created.";

$vnet = New-AzVirtualNetwork `
-Name $VNetName `
-AddressPrefix $VNetAddressPrefix `
-Subnet $subnet `
-ResourceGroupName $ResourceGroupName `
-Location $Region;

###########################################################
## Create a Virtual Service endpoint on the subnet. ##
###########################################################

Write-Host "Assign a Virtual Service endpoint 'Microsoft.Sql' to the subnet.";

$vnet = Set-AzVirtualNetworkSubnetConfig `
-Name $SubnetName `
-AddressPrefix $SubnetAddressPrefix `
-VirtualNetwork $vnet `
-ServiceEndpoint $ServiceEndpointTypeName_SqlDb;

Write-Host "Persist the updates made to the virtual network > subnet.";

$vnet = Set-AzVirtualNetwork `
-VirtualNetwork $vnet;

$vnet.Subnets[0].ServiceEndpoints; # Display the first endpoint.

###########################################################
## Add the Virtual Service endpoint Id as a rule, ##
## into SQL Database ACLs. ##
###########################################################

Write-Host "Get the subnet object.";

$vnet = Get-AzVirtualNetwork `
-ResourceGroupName $ResourceGroupName `
-Name $VNetName;

$subnet = Get-AzVirtualNetworkSubnetConfig `
-Name $SubnetName `
-VirtualNetwork $vnet;

Write-Host "Add the subnet .Id as a rule, into the ACLs for your Azure SQL Database server.";

$vnetRuleObject1 = New-AzSqlServerVirtualNetworkRule `
-ResourceGroupName $ResourceGroupName `
-ServerName $SqlDbServerName `
-VirtualNetworkRuleName $VNetRuleName `
-VirtualNetworkSubnetId $subnet.Id;

$vnetRuleObject1;

Write-Host "Verify that the rule is in the SQL DB ACL.";

$vnetRuleObject2 = Get-AzSqlServerVirtualNetworkRule `
-ResourceGroupName $ResourceGroupName `
-ServerName $SqlDbServerName `
-VirtualNetworkRuleName $VNetRuleName;

$vnetRuleObject2;

Write-Host 'Completed script 3, the "Virtual-Network-Rule".';

Script 4: Clean-up
This final script deletes the resources that the previous scripts created for the demonstration. However, the script
asks for confirmation before it deletes the following:
Azure SQL Database server
Azure Resource Group
You can run script 4 any time after script 1 completes.
PowerShell script 4 source code
######### Script 4 ########################################
## Clean-up phase A: Unconditional deletes. ##
## ##
## 1. The test rule is deleted from SQL DB ACL. ##
## 2. The test endpoint is deleted from the subnet. ##
## 3. The test virtual network is deleted. ##
###########################################################

Write-Host "Delete the rule from the SQL DB ACL.";

Remove-AzSqlServerVirtualNetworkRule `
-ResourceGroupName $ResourceGroupName `
-ServerName $SqlDbServerName `
-VirtualNetworkRuleName $VNetRuleName `
-ErrorAction SilentlyContinue;

Write-Host "Delete the endpoint from the subnet.";

$vnet = Get-AzVirtualNetwork `
-ResourceGroupName $ResourceGroupName `
-Name $VNetName;

Remove-AzVirtualNetworkSubnetConfig `
-Name $SubnetName `
-VirtualNetwork $vnet;

Write-Host "Delete the virtual network (thus also deletes the subnet).";

Remove-AzVirtualNetwork `
-Name $VNetName `
-ResourceGroupName $ResourceGroupName `
-ErrorAction SilentlyContinue;

###########################################################
## Clean-up phase B: Conditional deletes. ##
## ##
## These might have already existed, so user might ##
## want to keep. ##
## ##
## 1. Azure SQL Database server ##
## 2. Azure resource group ##
###########################################################

$yesno = Read-Host 'CAUTION !: Do you want to DELETE your Azure SQL Database server AND your Resource Group?
[yes/no]';
if ('yes' -eq $yesno)
{
Write-Host "Remove the Azure SQL DB server.";

Remove-AzSqlServer `
-ServerName $SqlDbServerName `
-ResourceGroupName $ResourceGroupName `
-ErrorAction SilentlyContinue;

Write-Host "Remove the Azure Resource Group.";

Remove-AzResourceGroup `
-Name $ResourceGroupName `
-ErrorAction SilentlyContinue;
}
else
{
Write-Host "Skipped over the DELETE of SQL Database and resource group.";
}

Write-Host 'Completed script 4, the "Clean-Up".';


Actual output from scripts 1 through 4
The output from our test run is displayed next, in an abbreviated format. The output might be helpful in case you
do not want to actually run the PowerShell scripts now.

[C:\WINDOWS\system32\]
0 >> C:\Demo\PowerShell\sql-database-vnet-service-endpoint-powershell-s1-variables.ps1
Do you need to log into Azure (only one time per powershell.exe session)? [yes/no]: yes

Environment : AzureCloud
Account : xx@microsoft.com
TenantId : 11111111-1111-1111-1111-111111111111
SubscriptionId : 22222222-2222-2222-2222-222222222222
SubscriptionName : MySubscriptionName
CurrentStorageAccount :

[C:\WINDOWS\system32\]
0 >> C:\Demo\PowerShell\sql-database-vnet-service-endpoint-powershell-s2-prerequisites.ps1
Check whether your Resource Group already exists.
Creating your missing Resource Group - RG-YourNameHere.

ResourceGroupName : RG-YourNameHere
Location : westcentralus
ProvisioningState : Succeeded
Tags :
ResourceId : /subscriptions/22222222-2222-2222-2222-222222222222/resourceGroups/RG-YourNameHere

Check whether your Azure SQL Database server already exists.


Creating the missing Azure SQL Database server - mysqldbserver-forvnet.
Gather the credentials necessary to next create an Azure SQL Database server.
Create your Azure SQL Database server.

ResourceGroupName : RG-YourNameHere
ServerName : mysqldbserver-forvnet
Location : westcentralus
SqlAdministratorLogin : ServerAdmin
SqlAdministratorPassword :
ServerVersion : 12.0
Tags :
Identity :

Completed script 2, the "Prerequisites".

[C:\WINDOWS\system32\]
0 >> C:\Demo\PowerShell\sql-database-vnet-service-endpoint-powershell-s3-vnet-rule.ps1
Define a subnet 'mySubnet', to be given soon to a virtual network.
Create a virtual network 'myVNet'. Give the subnet to the virtual network that we created.
WARNING: The output object type of this cmdlet will be modified in a future release.
Assign a Virtual Service endpoint 'Microsoft.Sql' to the subnet.
Persist the updates made to the virtual network > subnet.

Get the subnet object.


Add the subnet .Id as a rule, into the ACLs for your Azure SQL Database server.
ProvisioningState Service Locations
----------------- ------- ---------
Succeeded Microsoft.Sql {westcentralus}

Verify that the rule is in the SQL DB ACL.

Completed script 3, the "Virtual-Network-Rule".

[C:\WINDOWS\system32\]
0 >> C:\Demo\PowerShell\sql-database-vnet-service-endpoint-powershell-s4-clean-up.ps1
Delete the rule from the SQL DB ACL.

Delete the endpoint from the subnet.

Delete the virtual network (thus also deletes the subnet).


CAUTION !: Do you want to DELETE your Azure SQL Database server AND your Resource Group? [yes/no]: yes
Remove the Azure SQL DB server.

ResourceGroupName : RG-YourNameHere
ServerName : mysqldbserver-forvnet
Location : westcentralus
SqlAdministratorLogin : ServerAdmin
SqlAdministratorPassword :
ServerVersion : 12.0
Tags :
Identity :

Remove the Azure Resource Group.


True
Completed script 4, the "Clean-Up".

This is the end of our main PowerShell script.

Verify your subnet is an endpoint


You might have a subnet that was already assigned the Microsoft.Sql type name, meaning it is already a Virtual
Service endpoint. You could use the Azure portal to create a virtual network rule from the endpoint.
Or, you might be unsure whether your subnet has the Microsoft.Sql type name. You can run the following
PowerShell script to take these actions:
1. Ascertain whether your subnet has the Microsoft.Sql type name.
2. Optionally, assign the type name if it is absent.
The script asks you to confirm, before it applies the absent type name.
Phases of the script
Here are the phases of the PowerShell script:
1. LOG into to your Azure account, needed only once per PS session. Assign variables.
2. Search for your virtual network, and then for your subnet.
3. Is your subnet tagged as Microsoft.Sql endpoint server type?
4. Add a Virtual Service endpoint of type name Microsoft.Sql, on your subnet.

IMPORTANT
Before you run this script, you must edit the values assigned to the $-variables, near the top of the script.

Direct PowerShell source code


This PowerShell script does not update anything, unless you respond yes if is asks you for confirmation. The script
can add the type name Microsoft.Sql to your subnet. But the script tries the add only if your subnet lacks the type
name.

### 1. LOG into to your Azure account, needed only once per PS session. Assign variables.

$yesno = Read-Host 'Do you need to log into Azure (only one time per powershell.exe session)? [yes/no]';
if ('yes' -eq $yesno) { Connect-AzAccount; }

# Assignments to variables used by the later scripts.


# You can EDIT these values, if necessary.

$SubscriptionName = 'yourSubscriptionName';
Select-AzSubscription -SubscriptionName "$SubscriptionName";

$ResourceGroupName = 'yourRGName';
$VNetName = 'yourVNetName';
$SubnetName = 'yourSubnetName';
$SubnetAddressPrefix = 'Obtain this value from the Azure portal.'; # Looks roughly like: '10.0.0.0/24'

$ServiceEndpointTypeName_SqlDb = 'Microsoft.Sql'; # Do NOT edit. Is official value.

### 2. Search for your virtual network, and then for your subnet.

# Search for the virtual network.


$vnet = $null;
$vnet = Get-AzVirtualNetwork `
-ResourceGroupName $ResourceGroupName `
-Name $VNetName;

if ($vnet -eq $null)


{
Write-Host "Caution: No virtual network found by the name '$VNetName'.";
Return;
}

$subnet = $null;
for ($nn=0; $nn -lt $vnet.Subnets.Count; $nn++)
{
$subnet = $vnet.Subnets[$nn];
if ($subnet.Name -eq $SubnetName)
{ break; }
$subnet = $null;
}

if ($subnet -eq $null)


{
Write-Host "Caution: No subnet found by the name '$SubnetName'";
Return;
}

### 3. Is your subnet tagged as 'Microsoft.Sql' endpoint server type?

$endpointMsSql = $null;
for ($nn=0; $nn -lt $subnet.ServiceEndpoints.Count; $nn++)
{
$endpointMsSql = $subnet.ServiceEndpoints[$nn];
if ($endpointMsSql.Service -eq $ServiceEndpointTypeName_SqlDb)
{
$endpointMsSql;
break;
}
$endpointMsSql = $null;
}

if ($endpointMsSql -ne $null)


{
Write-Host "Good: Subnet found, and is already tagged as an endpoint of type
'$ServiceEndpointTypeName_SqlDb'.";
Return;
}
else
{
Write-Host "Caution: Subnet found, but not yet tagged as an endpoint of type
'$ServiceEndpointTypeName_SqlDb'.";

# Ask the user for confirmation.


$yesno = Read-Host 'Do you want the PS script to apply the endpoint type name to your subnet? [yes/no]';
if ('no' -eq $yesno) { Return; }
if ('no' -eq $yesno) { Return; }
}

### 4. Add a Virtual Service endpoint of type name 'Microsoft.Sql', on your subnet.

$vnet = Set-AzVirtualNetworkSubnetConfig `
-Name $SubnetName `
-AddressPrefix $SubnetAddressPrefix `
-VirtualNetwork $vnet `
-ServiceEndpoint $ServiceEndpointTypeName_SqlDb;

# Persist the subnet update.


$vnet = Set-AzVirtualNetwork `
-VirtualNetwork $vnet;

for ($nn=0; $nn -lt $vnet.Subnets.Count; $nn++)


{ $vnet.Subnets[0].ServiceEndpoints; } # Display.

Actual output
The following block displays our actual feedback (with cosmetic edits).

<# Our output example (with cosmetic edits), when the subnet was already tagged:

Do you need to log into Azure (only one time per powershell.exe session)? [yes/no]: no

Environment : AzureCloud
Account : xx@microsoft.com
TenantId : 11111111-1111-1111-1111-111111111111
SubscriptionId : 22222222-2222-2222-2222-222222222222
SubscriptionName : MySubscriptionName
CurrentStorageAccount :

ProvisioningState : Succeeded
Service : Microsoft.Sql
Locations : {westcentralus}

Good: Subnet found, and is already tagged as an endpoint of type 'Microsoft.Sql'.


#>
Get started with Transparent Data Encryption (TDE)
in SQL Data Warehouse
5/6/2019 • 2 minutes to read • Edit Online

Required Permissions
To enable Transparent Data Encryption (TDE ), you must be an administrator or a member of the dbmanager role.

Enabling Encryption
To enable TDE for a SQL Data Warehouse, follow the steps below:
1. Open the database in the Azure portal
2. In the database blade, click the Settings button
3. Select the Transparent data encryption option

4. Select the On setting

5. Select Save
Disabling Encryption
To disable TDE for a SQL Data Warehouse, follow the steps below:
1. Open the database in the Azure portal
2. In the database blade, click the Settings button
3. Select the Transparent data encryption option

4. Select the Off setting

5. Select Save
Encryption DMVs
Encryption can be confirmed with the following DMVs:
sys.databases
sys.dm_pdw_nodes_database_encryption_keys
Get started with Transparent Data Encryption (TDE)
5/6/2019 • 2 minutes to read • Edit Online

Required Permissions
To enable Transparent Data Encryption (TDE ), you must be an administrator or a member of the dbmanager role.

Enabling Encryption
Follow these steps to enable TDE for a SQL Data Warehouse:
1. Connect to the master database on the server hosting the database using a login that is an administrator or a
member of the dbmanager role in the master database
2. Execute the following statement to encrypt the database.

ALTER DATABASE [AdventureWorks] SET ENCRYPTION ON;

Disabling Encryption
Follow these steps to disable TDE for a SQL Data Warehouse:
1. Connect to the master database using a login that is an administrator or a member of the dbmanager role in
the master database
2. Execute the following statement to encrypt the database.

ALTER DATABASE [AdventureWorks] SET ENCRYPTION OFF;

NOTE
A paused SQL Data Warehouse must be resumed before making changes to the TDE settings.

Verifying Encryption
To verify encryption status for a SQL Data Warehouse, follow the steps below:
1. Connect to the master or instance database using a login that is an administrator or a member of the
dbmanager role in the master database
2. Execute the following statement to encrypt the database.

SELECT
[name],
[is_encrypted]
FROM
sys.databases;

A result of 1 indicates an encrypted database, 0 indicates a non-encrypted database.

Encryption DMVs
sys.databases
sys.dm_pdw_nodes_database_encryption_keys
Tutorial: Load New York Taxicab data to Azure SQL
Data Warehouse
8/18/2019 • 17 minutes to read • Edit Online

This tutorial uses PolyBase to load New York Taxicab data from a public Azure blob to Azure SQL Data
Warehouse. The tutorial uses the Azure portal and SQL Server Management Studio (SSMS ) to:
Create a data warehouse in the Azure portal
Set up a server-level firewall rule in the Azure portal
Connect to the data warehouse with SSMS
Create a user designated for loading data
Create external tables for data in Azure blob storage
Use the CTAS T-SQL statement to load data into your data warehouse
View the progress of data as it is loading
Create statistics on the newly loaded data
If you don't have an Azure subscription, create a free account before you begin.

Before you begin


Before you begin this tutorial, download and install the newest version of SQL Server Management Studio
(SSMS ).

Log in to the Azure portal


Log in to the Azure portal.

Create a blank SQL Data Warehouse


An Azure SQL Data Warehouse is created with a defined set of compute resources. The database is created
within an Azure resource group and in an Azure SQL logical server.
Follow these steps to create a blank SQL Data Warehouse.
1. Click Create a resource in the upper left-hand corner of the Azure portal.
2. Select Databases from the New page, and select SQL Data Warehouse under Featured on the New
page.
3. Fill out the SQL Data Warehouse form with the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Database name mySampleDataWarehouse For valid database names, see


Database Identifiers.

Subscription Your subscription For details about your subscriptions,


see Subscriptions.

Resource group myResourceGroup For valid resource group names, see


Naming rules and restrictions.

Select source Blank database Specifies to create a blank database.


Note, a data warehouse is one type
of database.
4. Click Server to create and configure a new server for your new database. Fill out the New server form
with the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Server name Any globally unique name For valid server names, see Naming
rules and restrictions.

Server admin login Any valid name For valid login names, see Database
Identifiers.

Password Any valid password Your password must have at least


eight characters and must contain
characters from three of the
following categories: upper case
characters, lower case characters,
numbers, and non-alphanumeric
characters.
SETTING SUGGESTED VALUE DESCRIPTION

Location Any valid location For information about regions, see


Azure Regions.

5. Click Select.
6. Click Performance level to specify whether the data warehouse is Gen1 or Gen2, and the number of
data warehouse units.
7. For this tutorial, select Gen2 of SQL Data Warehouse. The slider is set to DW1000c by default. Try
moving it up and down to see how it works.
8. Click Apply.
9. In the SQL Data Warehouse page, select a collation for the blank database. For this tutorial, use the
default value. For more information about collations, see Collations
10. Now that you have completed the SQL Database form, click Create to provision the database.
Provisioning takes a few minutes.
11. On the toolbar, click Notifications to monitor the deployment process.

Create a server-level firewall rule


The SQL Data Warehouse service creates a firewall at the server-level that prevents external applications and
tools from connecting to the server or any databases on the server. To enable connectivity, you can add firewall
rules that enable connectivity for specific IP addresses. Follow these steps to create a server-level firewall rule
for your client's IP address.

NOTE
SQL Data Warehouse communicates over port 1433. If you are trying to connect from within a corporate network,
outbound traffic over port 1433 might not be allowed by your network's firewall. If so, you cannot connect to your Azure
SQL Database server unless your IT department opens port 1433.

1. After the deployment completes, click SQL databases from the left-hand menu and then click
mySampleDatabase on the SQL databases page. The overview page for your database opens,
showing you the fully qualified server name (such as mynewserver-20180430.database.windows.net)
and provides options for further configuration.
2. Copy this fully qualified server name for use to connect to your server and its databases in subsequent
quick starts. Then click on the server name to open server settings.

3. Click the server name to open server settings.

4. Click Show firewall settings. The Firewall settings page for the SQL Database server opens.
5. Click Add client IP on the toolbar to add your current IP address to a new firewall rule. A firewall rule
can open port 1433 for a single IP address or a range of IP addresses.
6. Click Save. A server-level firewall rule is created for your current IP address opening port 1433 on the
logical server.
7. Click OK and then close the Firewall settings page.
You can now connect to the SQL server and its data warehouses using this IP address. The connection works
from SQL Server Management Studio or another tool of your choice. When you connect, use the ServerAdmin
account you created previously.

IMPORTANT
By default, access through the SQL Database firewall is enabled for all Azure services. Click OFF on this page and then click
Save to disable the firewall for all Azure services.

Get the fully qualified server name


Get the fully qualified server name for your SQL server in the Azure portal. Later you will use the fully qualified
name when connecting to the server.
1. Log in to the Azure portal.
2. Select SQL Data warehouses from the left-hand menu, and click your database on the SQL data
warehouses page.
3. In the Essentials pane in the Azure portal page for your database, locate and then copy the Server
name. In this example, the fully qualified name is mynewserver-20180430.database.windows.net.
Connect to the server as server admin
This section uses SQL Server Management Studio (SSMS ) to establish a connection to your Azure SQL server.
1. Open SQL Server Management Studio.
2. In the Connect to Server dialog box, enter the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Server type Database engine This value is required

Server name The fully qualified server name The name should be something like
this: mynewserver-
20180430.database.windows.net.

Authentication SQL Server Authentication SQL Authentication is the only


authentication type that we have
configured in this tutorial.

Login The server admin account This is the account that you
specified when you created the
server.

Password The password for your server admin This is the password that you
account specified when you created the
server.
3. Click Connect. The Object Explorer window opens in SSMS.
4. In Object Explorer, expand Databases. Then expand System databases and master to view the objects
in the master database. Expand mySampleDatabase to view the objects in your new database.

Create a user for loading data


The server admin account is meant to perform management operations, and is not suited for running queries on
user data. Loading data is a memory-intensive operation. Memory maximums are defined according to which
Generation of SQL Data Warehouse you've provisioned, data warehouse units, and resource class.
It's best to create a login and user that is dedicated for loading data. Then add the loading user to a resource
class that enables an appropriate maximum memory allocation.
Since you are currently connected as the server admin, you can create logins and users. Use these steps to
create a login and user called LoaderRC20. Then assign the user to the staticrc20 resource class.
1. In SSMS, right-click master to show a drop-down menu, and choose New Query. A new query window
opens.
2. In the query window, enter these T-SQL commands to create a login and user named LoaderRC20,
substituting your own password for 'a123STRONGpassword!'.

CREATE LOGIN LoaderRC20 WITH PASSWORD = 'a123STRONGpassword!';


CREATE USER LoaderRC20 FOR LOGIN LoaderRC20;

3. Click Execute.
4. Right-click mySampleDataWarehouse, and choose New Query. A new query Window opens.
5. Enter the following T-SQL commands to create a database user named LoaderRC20 for the LoaderRC20
login. The second line grants the new user CONTROL permissions on the new data warehouse. These
permissions are similar to making the user the owner of the database. The third line adds the new user as
a member of the staticrc20 resource class.

CREATE USER LoaderRC20 FOR LOGIN LoaderRC20;


GRANT CONTROL ON DATABASE::[mySampleDataWarehouse] to LoaderRC20;
EXEC sp_addrolemember 'staticrc20', 'LoaderRC20';

6. Click Execute.

Connect to the server as the loading user


The first step toward loading data is to login as LoaderRC20.
1. In Object Explorer, click the Connect drop down menu and select Database Engine. The Connect to
Server dialog box appears.

2. Enter the fully qualified server name, and enter LoaderRC20 as the Login. Enter your password for
LoaderRC20.
3. Click Connect.
4. When your connection is ready, you will see two server connections in Object Explorer. One connection
as ServerAdmin and one connection as MedRCLogin.
Create external tables for the sample data
You are ready to begin the process of loading data into your new data warehouse. This tutorial shows you how
to use external tables to load New York City taxi cab data from an Azure storage blob. For future reference, to
learn how to get your data to Azure blob storage or to load it directly from your source into SQL Data
Warehouse, see the loading overview.
Run the following SQL scripts specify information about the data you wish to load. This information includes
where the data is located, the format of the contents of the data, and the table definition for the data.
1. In the previous section, you logged into your data warehouse as LoaderRC20. In SSMS, right-click your
LoaderRC20 connection and select New Query. A new query window appears.
2. Compare your query window to the previous image. Verify your new query window is running as
LoaderRC20 and performing queries on your MySampleDataWarehouse database. Use this query
window to perform all of the loading steps.
3. Create a master key for the MySampleDataWarehouse database. You only need to create a master key
once per database.

CREATE MASTER KEY;

4. Run the following CREATE EXTERNAL DATA SOURCE statement to define the location of the Azure
blob. This is the location of the external taxi cab data. To run a command that you have appended to the
query window, highlight the commands you wish to run and click Execute.

CREATE EXTERNAL DATA SOURCE NYTPublic


WITH
(
TYPE = Hadoop,
LOCATION = 'wasbs://2013@nytaxiblob.blob.core.windows.net/'
);

5. Run the following CREATE EXTERNAL FILE FORMAT T-SQL statement to specify formatting
characteristics and options for the external data file. This statement specifies the external data is stored as
text and the values are separated by the pipe ('|') character. The external file is compressed with Gzip.
CREATE EXTERNAL FILE FORMAT uncompressedcsv
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '',
DATE_FORMAT = '',
USE_TYPE_DEFAULT = False
)
);
CREATE EXTERNAL FILE FORMAT compressedcsv
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS ( FIELD_TERMINATOR = '|',
STRING_DELIMITER = '',
DATE_FORMAT = '',
USE_TYPE_DEFAULT = False
),
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'
);

6. Run the following CREATE SCHEMA statement to create a schema for your external file format. The
schema provides a way to organize the external tables you are about to create.

CREATE SCHEMA ext;

7. Create the external tables. The table definitions are stored in SQL Data Warehouse, but the tables
reference data that is stored in Azure blob storage. Run the following T-SQL commands to create several
external tables that all point to the Azure blob we defined previously in our external data source.

CREATE EXTERNAL TABLE [ext].[Date]


(
[DateID] int NOT NULL,
[Date] datetime NULL,
[DateBKey] char(10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DayOfMonth] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DaySuffix] varchar(4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DayName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DayOfWeek] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DayOfWeekInMonth] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DayOfWeekInYear] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DayOfQuarter] varchar(3) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DayOfYear] varchar(3) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[WeekOfMonth] varchar(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[WeekOfQuarter] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[WeekOfYear] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Month] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[MonthName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[MonthOfQuarter] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Quarter] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[QuarterName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Year] char(4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[YearName] char(7) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[MonthYear] char(10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[MMYYYY] char(6) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[FirstDayOfMonth] date NULL,
[LastDayOfMonth] date NULL,
[FirstDayOfQuarter] date NULL,
[LastDayOfQuarter] date NULL,
[FirstDayOfYear] date NULL,
[LastDayOfYear] date NULL,
[IsHolidayUSA] bit NULL,
[IsWeekday] bit NULL,
[HolidayUSA] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
[HolidayUSA] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)
WITH
(
LOCATION = 'Date',
DATA_SOURCE = NYTPublic,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[Geography]
(
[GeographyID] int NOT NULL,
[ZipCodeBKey] varchar(10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[County] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[City] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[State] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Country] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[ZipCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)
WITH
(
LOCATION = 'Geography',
DATA_SOURCE = NYTPublic,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[HackneyLicense]
(
[HackneyLicenseID] int NOT NULL,
[HackneyLicenseBKey] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[HackneyLicenseCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)
WITH
(
LOCATION = 'HackneyLicense',
DATA_SOURCE = NYTPublic,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[Medallion]
(
[MedallionID] int NOT NULL,
[MedallionBKey] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[MedallionCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
)
WITH
(
LOCATION = 'Medallion',
DATA_SOURCE = NYTPublic,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 0
)
;
CREATE EXTERNAL TABLE [ext].[Time]
(
[TimeID] int NOT NULL,
[TimeBKey] varchar(8) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[HourNumber] tinyint NOT NULL,
[MinuteNumber] tinyint NOT NULL,
[SecondNumber] tinyint NOT NULL,
[TimeInSecond] int NOT NULL,
[HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[DayTimeBucketGroupKey] int NOT NULL,
[DayTimeBucket] varchar(100) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
LOCATION = 'Time',
DATA_SOURCE = NYTPublic,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[Trip]
(
[DateID] int NOT NULL,
[MedallionID] int NOT NULL,
[HackneyLicenseID] int NOT NULL,
[PickupTimeID] int NOT NULL,
[DropoffTimeID] int NOT NULL,
[PickupGeographyID] int NULL,
[DropoffGeographyID] int NULL,
[PickupLatitude] float NULL,
[PickupLongitude] float NULL,
[PickupLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[DropoffLatitude] float NULL,
[DropoffLongitude] float NULL,
[DropoffLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[PassengerCount] int NULL,
[TripDurationSeconds] int NULL,
[TripDistanceMiles] float NULL,
[PaymentType] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[FareAmount] money NULL,
[SurchargeAmount] money NULL,
[TaxAmount] money NULL,
[TipAmount] money NULL,
[TollsAmount] money NULL,
[TotalAmount] money NULL
)
WITH
(
LOCATION = 'Trip2013',
DATA_SOURCE = NYTPublic,
FILE_FORMAT = compressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[Weather]
(
[DateID] int NOT NULL,
[GeographyID] int NOT NULL,
[PrecipitationInches] float NOT NULL,
[AvgTemperatureFahrenheit] float NOT NULL
)
WITH
(
LOCATION = 'Weather',
DATA_SOURCE = NYTPublic,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 0
)
;

8. In Object Explorer, expand mySampleDataWarehouse to see the list of external tables you just created.
Load the data into your data warehouse
This section uses the external tables you just defined to load the sample data from Azure Storage Blob to SQL
Data Warehouse.

NOTE
This tutorial loads the data directly into the final table. In a production environment, you will usually use CREATE TABLE AS
SELECT to load into a staging table. While data is in the staging table you can perform any necessary transformations. To
append the data in the staging table to a production table, you can use the INSERT...SELECT statement. For more
information, see Inserting data into a production table.

The script uses the CREATE TABLE AS SELECT (CTAS ) T-SQL statement to load the data from Azure Storage
Blob into new tables in your data warehouse. CTAS creates a new table based on the results of a select
statement. The new table has the same columns and data types as the results of the select statement. When the
select statement selects from an external table, SQL Data Warehouse imports the data into a relational table in
the data warehouse.
1. Run the following script to load the data into new tables in your data warehouse.
CREATE TABLE [dbo].[Date]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].[Date]
OPTION (LABEL = 'CTAS : Load [dbo].[Date]')
;
CREATE TABLE [dbo].[Geography]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[Geography]
OPTION (LABEL = 'CTAS : Load [dbo].[Geography]')
;
CREATE TABLE [dbo].[HackneyLicense]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].[HackneyLicense]
OPTION (LABEL = 'CTAS : Load [dbo].[HackneyLicense]')
;
CREATE TABLE [dbo].[Medallion]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].[Medallion]
OPTION (LABEL = 'CTAS : Load [dbo].[Medallion]')
;
CREATE TABLE [dbo].[Time]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].[Time]
OPTION (LABEL = 'CTAS : Load [dbo].[Time]')
;
CREATE TABLE [dbo].[Weather]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].[Weather]
OPTION (LABEL = 'CTAS : Load [dbo].[Weather]')
;
CREATE TABLE [dbo].[Trip]
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
AS SELECT * FROM [ext].[Trip]
OPTION (LABEL = 'CTAS : Load [dbo].[Trip]')
;

2. View your data as it loads. You’re loading several GBs of data and compressing it into highly performant
clustered columnstore indexes. Run the following query that uses a dynamic management views (DMVs)
to show the status of the load. After starting the query, grab a coffee and a snack while SQL Data
Warehouse does some heavy lifting.

SELECT
r.command,
s.request_id,
r.status,
count(distinct input_name) as nbr_files,
sum(s.bytes_processed)/1024/1024/1024.0 as gb_processed
FROM
sys.dm_pdw_exec_requests r
INNER JOIN sys.dm_pdw_dms_external_work s
ON r.request_id = s.request_id
WHERE
r.[label] = 'CTAS : Load [dbo].[Date]' OR
r.[label] = 'CTAS : Load [dbo].[Geography]' OR
r.[label] = 'CTAS : Load [dbo].[HackneyLicense]' OR
r.[label] = 'CTAS : Load [dbo].[Medallion]' OR
r.[label] = 'CTAS : Load [dbo].[Time]' OR
r.[label] = 'CTAS : Load [dbo].[Weather]' OR
r.[label] = 'CTAS : Load [dbo].[Trip]'
GROUP BY
r.command,
s.request_id,
r.status
ORDER BY
nbr_files desc,
gb_processed desc;

3. View all system queries.

SELECT * FROM sys.dm_pdw_exec_requests;

4. Enjoy seeing your data nicely loaded into your data warehouse.
Authenticate using managed identities to load (optional)
Loading using PolyBase and authenticating through managed identities is the most secure mechanism and
enables you to leverage VNet Service Endpoints with Azure storage.
Prerequisites
1. Install Azure PowerShell using this guide.
2. If you have a general-purpose v1 or blob storage account, you must first upgrade to general-purpose v2
using this guide.
3. You must have Allow trusted Microsoft services to access this storage account turned on under Azure
Storage account Firewalls and Virtual networks settings menu. Refer to this guide for more information.
Steps
1. In PowerShell, register your SQL Database server with Azure Active Directory (AAD ):

Connect-AzAccount
Select-AzSubscription -SubscriptionId your-subscriptionId
Set-AzSqlServer -ResourceGroupName your-database-server-resourceGroup -ServerName your-database-
servername -AssignIdentity

a. Create a general-purpose v2 Storage Account using this guide.

NOTE
If you have a general-purpose v1 or blob storage account, you must first upgrade to v2 using this guide.

2. Under your storage account, navigate to Access Control (IAM ), and click Add role assignment. Assign
Storage Blob Data Contributor RBAC role to your SQL Database server.

NOTE
Only members with Owner privilege can perform this step. For various built-in roles for Azure resources, refer to
this guide.

3. Polybase connectivity to the Azure Storage account:


a. Create your database scoped credential with IDENTITY = 'Managed Service Identity':

CREATE DATABASE SCOPED CREDENTIAL msi_cred WITH IDENTITY = 'Managed Service Identity';

NOTE
There is no need to specify SECRET with Azure Storage access key because this mechanism uses
Managed Identity under the covers.
IDENTITY name should be 'Managed Service Identity' for PolyBase connectivity to work with Azure
Storage account.

b. Create the External Data Source specifying the Database Scoped Credential with the Managed
Service Identity.
c. Query as normal using external tables.
Refer to the following documentation if you'd like to set up virtual network service endpoints for SQL Data
Warehouse.

Clean up resources
You are being charged for compute resources and data that you loaded into your data warehouse. These are
billed separately.
If you want to keep the data in storage, you can pause compute when you aren't using the data warehouse.
By pausing compute you will only be charge for data storage and you can resume the compute whenever
you are ready to work with the data.
If you want to remove future charges, you can delete the data warehouse.
Follow these steps to clean up resources as you desire.
1. Log in to the Azure portal, click on your data warehouse.
2. To pause compute, click the Pause button. When the data warehouse is paused, you will see a Start
button. To resume compute, click Start.
3. To remove the data warehouse so you won't be charged for compute or storage, click Delete.
4. To remove the SQL server you created, click mynewserver-20180430.database.windows.net in the
previous image, and then click Delete. Be careful with this as deleting the server will delete all databases
assigned to the server.
5. To remove the resource group, click myResourceGroup, and then click Delete resource group.

Next steps
In this tutorial, you learned how to create a data warehouse and create a user for loading data. You created
external tables to define the structure for data stored in Azure Storage Blob, and then used the PolyBase
CREATE TABLE AS SELECT statement to load data into your data warehouse.
You did these things:
Created a data warehouse in the Azure portal
Set up a server-level firewall rule in the Azure portal
Connected to the data warehouse with SSMS
Created a user designated for loading data
Created external tables for data in Azure Storage Blob
Used the CTAS T-SQL statement to load data into your data warehouse
Viewed the progress of data as it is loading
Created statistics on the newly loaded data
Advance to the development overview to learn how to migrate an existing database to SQL Data Warehouse.
Design decisions to migrate an existing database to SQL Data Warehouse
Load Contoso Retail data to Azure SQL Data
Warehouse
7/5/2019 • 8 minutes to read • Edit Online

In this tutorial, you learn to use PolyBase and T-SQL commands to load two tables from the Contoso Retail data
into Azure SQL Data Warehouse.
In this tutorial you will:
1. Configure PolyBase to load from Azure blob storage
2. Load public data into your database
3. Perform optimizations after the load is finished.

Before you begin


To run this tutorial, you need an Azure account that already has a SQL Data Warehouse. If you don't have a data
warehouse provisioned, see Create a SQL Data Warehouse and set server-level firewall rule.

1. Configure the data source


PolyBase uses T-SQL external objects to define the location and attributes of the external data. The external object
definitions are stored in SQL Data Warehouse. The data is stored externally.
1.1. Create a credential
Skip this step if you are loading the Contoso public data. You don't need secure access to the public data since it's
already accessible to anyone.
Don't skip this step if you're using this tutorial as a template for loading your own data. To access data through a
credential, use the following script to create a database-scoped credential, and then use it when defining the
location of the data source.
-- A: Create a master key.
-- Only necessary if one does not already exist.
-- Required to encrypt the credential secret in the next step.

CREATE MASTER KEY;

-- B: Create a database scoped credential


-- IDENTITY: Provide any string, it is not used for authentication to Azure storage.
-- SECRET: Provide your Azure storage account key.

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential


WITH
IDENTITY = 'user',
SECRET = '<azure_storage_account_key>'
;

-- C: Create an external data source


-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure blob storage.
-- LOCATION: Provide Azure storage account name and blob container name.
-- CREDENTIAL: Provide the credential created in the previous step.

CREATE EXTERNAL DATA SOURCE AzureStorage


WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);

1.2. Create the external data source


Use this CREATE EXTERNAL DATA SOURCE command to store the location of the data, and the type of data.

CREATE EXTERNAL DATA SOURCE AzureStorage_west_public


WITH
(
TYPE = Hadoop
, LOCATION = 'wasbs://contosoretaildw-tables@contosoretaildw.blob.core.windows.net/'
);

IMPORTANT
If you choose to make your azure blob storage containers public, remember that as the data owner you will be charged for
data egress charges when data leaves the data center.

2. Configure data format


The data is stored in text files in Azure blob storage, and each field is separated with a delimiter. In SSMS, run the
following CREATE EXTERNAL FILE FORMAT command to specify the format of the data in the text files. The
Contoso data is uncompressed and pipe delimited.
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
( FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
)
);

3. Create the external tables


Now that you've specified the data source and file format, you're ready to create the external tables.
3.1. Create a schema for the data.
To create a place to store the Contoso data in your database, create a schema.

CREATE SCHEMA [asb]


GO

3.2. Create the external tables.


Run the following script to create the DimProduct and FactOnlineSales external tables. All you're doing here is
defining column names and data types, and binding them to the location and format of the Azure blob storage
files. The definition is stored in SQL Data Warehouse and the data is still in the Azure Storage Blob.
The LOCATION parameter is the folder under the root folder in the Azure Storage Blob. Each table is in a different
folder.

--DimProduct
CREATE EXTERNAL TABLE [asb].DimProduct (
[ProductKey] [int] NOT NULL,
[ProductLabel] [nvarchar](255) NULL,
[ProductName] [nvarchar](500) NULL,
[ProductDescription] [nvarchar](400) NULL,
[ProductSubcategoryKey] [int] NULL,
[Manufacturer] [nvarchar](50) NULL,
[BrandName] [nvarchar](50) NULL,
[ClassID] [nvarchar](10) NULL,
[ClassName] [nvarchar](20) NULL,
[StyleID] [nvarchar](10) NULL,
[StyleName] [nvarchar](20) NULL,
[ColorID] [nvarchar](10) NULL,
[ColorName] [nvarchar](20) NOT NULL,
[Size] [nvarchar](50) NULL,
[SizeRange] [nvarchar](50) NULL,
[SizeUnitMeasureID] [nvarchar](20) NULL,
[Weight] [float] NULL,
[WeightUnitMeasureID] [nvarchar](20) NULL,
[UnitOfMeasureID] [nvarchar](10) NULL,
[UnitOfMeasureName] [nvarchar](40) NULL,
[StockTypeID] [nvarchar](10) NULL,
[StockTypeName] [nvarchar](40) NULL,
[UnitCost] [money] NULL,
[UnitPrice] [money] NULL,
[AvailableForSaleDate] [datetime] NULL,
[StopSaleDate] [datetime] NULL,
[Status] [nvarchar](7) NULL,
[ImageURL] [nvarchar](150) NULL,
[ProductURL] [nvarchar](150) NULL,
[ETLLoadID] [int] NULL,
[LoadDate] [datetime] NULL,
[LoadDate] [datetime] NULL,
[UpdateDate] [datetime] NULL
)
WITH
(
LOCATION='/DimProduct/'
, DATA_SOURCE = AzureStorage_west_public
, FILE_FORMAT = TextFileFormat
, REJECT_TYPE = VALUE
, REJECT_VALUE = 0
)
;

--FactOnlineSales
CREATE EXTERNAL TABLE [asb].FactOnlineSales
(
[OnlineSalesKey] [int] NOT NULL,
[DateKey] [datetime] NOT NULL,
[StoreKey] [int] NOT NULL,
[ProductKey] [int] NOT NULL,
[PromotionKey] [int] NOT NULL,
[CurrencyKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar](20) NOT NULL,
[SalesOrderLineNumber] [int] NULL,
[SalesQuantity] [int] NOT NULL,
[SalesAmount] [money] NOT NULL,
[ReturnQuantity] [int] NOT NULL,
[ReturnAmount] [money] NULL,
[DiscountQuantity] [int] NULL,
[DiscountAmount] [money] NULL,
[TotalCost] [money] NOT NULL,
[UnitCost] [money] NULL,
[UnitPrice] [money] NULL,
[ETLLoadID] [int] NULL,
[LoadDate] [datetime] NULL,
[UpdateDate] [datetime] NULL
)
WITH
(
LOCATION='/FactOnlineSales/'
, DATA_SOURCE = AzureStorage_west_public
, FILE_FORMAT = TextFileFormat
, REJECT_TYPE = VALUE
, REJECT_VALUE = 0
)
;

4. Load the data


There are different ways to access external data. You can query data directly from the external tables, load the data
into new tables in the data warehouse, or add external data to existing data warehouse tables.
4.1. Create a new schema
CTAS creates a new table that contains data. First, create a schema for the contoso data.

CREATE SCHEMA [cso]


GO

4.2. Load the data into new tables


To load data from Azure blob storage into the data warehouse table, use the CREATE TABLE AS SELECT (Transact-
SQL ) statement. Loading with CTAS leverages the strongly typed external tables you've created. To load the data
into new tables, use one CTAS statement per table.
CTAS creates a new table and populates it with the results of a select statement. CTAS defines the new table to
have the same columns and data types as the results of the select statement. If you select all the columns from an
external table, the new table will be a replica of the columns and data types in the external table.
In this example, we create both the dimension and the fact table as hash distributed tables.

SELECT GETDATE();
GO

CREATE TABLE [cso].[DimProduct] WITH (DISTRIBUTION = HASH([ProductKey] ) ) AS SELECT * FROM [asb].


[DimProduct] OPTION (LABEL = 'CTAS : Load [cso].[DimProduct] ');
CREATE TABLE [cso].[FactOnlineSales] WITH (DISTRIBUTION = HASH([ProductKey] ) ) AS SELECT * FROM [asb].
[FactOnlineSales] OPTION (LABEL = 'CTAS : Load [cso].[FactOnlineSales] ');

4.3 Track the load progress


You can track the progress of your load using dynamic management views (DMVs).

-- To see all requests


SELECT * FROM sys.dm_pdw_exec_requests;

-- To see a particular request identified by its label


SELECT * FROM sys.dm_pdw_exec_requests as r
WHERE r.[label] = 'CTAS : Load [cso].[DimProduct] '
OR r.[label] = 'CTAS : Load [cso].[FactOnlineSales] '
;

-- To track bytes and files


SELECT
r.command,
s.request_id,
r.status,
count(distinct input_name) as nbr_files,
sum(s.bytes_processed)/1024/1024/1024 as gb_processed
FROM
sys.dm_pdw_exec_requests r
inner join sys.dm_pdw_dms_external_work s
on r.request_id = s.request_id
WHERE
r.[label] = 'CTAS : Load [cso].[DimProduct] '
OR r.[label] = 'CTAS : Load [cso].[FactOnlineSales] '
GROUP BY
r.command,
s.request_id,
r.status
ORDER BY
nbr_files desc,
gb_processed desc;

5. Optimize columnstore compression


By default, SQL Data Warehouse stores the table as a clustered columnstore index. After a load completes, some of
the data rows might not be compressed into the columnstore. There are different reasons why this can happen. To
learn more, see manage columnstore indexes.
To optimize query performance and columnstore compression after a load, rebuild the table to force the
columnstore index to compress all the rows.
SELECT GETDATE();
GO

ALTER INDEX ALL ON [cso].[DimProduct] REBUILD;


ALTER INDEX ALL ON [cso].[FactOnlineSales] REBUILD;

For more information on maintaining columnstore indexes, see the manage columnstore indexes article.

6. Optimize statistics
It's best to create single-column statistics immediately after a load. If you know certain columns aren't going to be
in query predicates, you can skip creating statistics on those columns. If you create single-column statistics on
every column, it might take a long time to rebuild all the statistics.
If you decide to create single-column statistics on every column of every table, you can use the stored procedure
code sample prc_sqldw_create_stats in the statistics article.
The following example is a good starting point for creating statistics. It creates single-column statistics on each
column in the dimension table, and on each joining column in the fact tables. You can always add single or multi-
column statistics to other fact table columns later on.

CREATE STATISTICS [stat_cso_DimProduct_AvailableForSaleDate] ON [cso].[DimProduct]([AvailableForSaleDate]);


CREATE STATISTICS [stat_cso_DimProduct_BrandName] ON [cso].[DimProduct]([BrandName]);
CREATE STATISTICS [stat_cso_DimProduct_ClassID] ON [cso].[DimProduct]([ClassID]);
CREATE STATISTICS [stat_cso_DimProduct_ClassName] ON [cso].[DimProduct]([ClassName]);
CREATE STATISTICS [stat_cso_DimProduct_ColorID] ON [cso].[DimProduct]([ColorID]);
CREATE STATISTICS [stat_cso_DimProduct_ColorName] ON [cso].[DimProduct]([ColorName]);
CREATE STATISTICS [stat_cso_DimProduct_ETLLoadID] ON [cso].[DimProduct]([ETLLoadID]);
CREATE STATISTICS [stat_cso_DimProduct_ImageURL] ON [cso].[DimProduct]([ImageURL]);
CREATE STATISTICS [stat_cso_DimProduct_LoadDate] ON [cso].[DimProduct]([LoadDate]);
CREATE STATISTICS [stat_cso_DimProduct_Manufacturer] ON [cso].[DimProduct]([Manufacturer]);
CREATE STATISTICS [stat_cso_DimProduct_ProductDescription] ON [cso].[DimProduct]([ProductDescription]);
CREATE STATISTICS [stat_cso_DimProduct_ProductKey] ON [cso].[DimProduct]([ProductKey]);
CREATE STATISTICS [stat_cso_DimProduct_ProductLabel] ON [cso].[DimProduct]([ProductLabel]);
CREATE STATISTICS [stat_cso_DimProduct_ProductName] ON [cso].[DimProduct]([ProductName]);
CREATE STATISTICS [stat_cso_DimProduct_ProductSubcategoryKey] ON [cso].[DimProduct]([ProductSubcategoryKey]);
CREATE STATISTICS [stat_cso_DimProduct_ProductURL] ON [cso].[DimProduct]([ProductURL]);
CREATE STATISTICS [stat_cso_DimProduct_Size] ON [cso].[DimProduct]([Size]);
CREATE STATISTICS [stat_cso_DimProduct_SizeRange] ON [cso].[DimProduct]([SizeRange]);
CREATE STATISTICS [stat_cso_DimProduct_SizeUnitMeasureID] ON [cso].[DimProduct]([SizeUnitMeasureID]);
CREATE STATISTICS [stat_cso_DimProduct_Status] ON [cso].[DimProduct]([Status]);
CREATE STATISTICS [stat_cso_DimProduct_StockTypeID] ON [cso].[DimProduct]([StockTypeID]);
CREATE STATISTICS [stat_cso_DimProduct_StockTypeName] ON [cso].[DimProduct]([StockTypeName]);
CREATE STATISTICS [stat_cso_DimProduct_StopSaleDate] ON [cso].[DimProduct]([StopSaleDate]);
CREATE STATISTICS [stat_cso_DimProduct_StyleID] ON [cso].[DimProduct]([StyleID]);
CREATE STATISTICS [stat_cso_DimProduct_StyleName] ON [cso].[DimProduct]([StyleName]);
CREATE STATISTICS [stat_cso_DimProduct_UnitCost] ON [cso].[DimProduct]([UnitCost]);
CREATE STATISTICS [stat_cso_DimProduct_UnitOfMeasureID] ON [cso].[DimProduct]([UnitOfMeasureID]);
CREATE STATISTICS [stat_cso_DimProduct_UnitOfMeasureName] ON [cso].[DimProduct]([UnitOfMeasureName]);
CREATE STATISTICS [stat_cso_DimProduct_UnitPrice] ON [cso].[DimProduct]([UnitPrice]);
CREATE STATISTICS [stat_cso_DimProduct_UpdateDate] ON [cso].[DimProduct]([UpdateDate]);
CREATE STATISTICS [stat_cso_DimProduct_Weight] ON [cso].[DimProduct]([Weight]);
CREATE STATISTICS [stat_cso_DimProduct_WeightUnitMeasureID] ON [cso].[DimProduct]([WeightUnitMeasureID]);
CREATE STATISTICS [stat_cso_FactOnlineSales_CurrencyKey] ON [cso].[FactOnlineSales]([CurrencyKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_CustomerKey] ON [cso].[FactOnlineSales]([CustomerKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_DateKey] ON [cso].[FactOnlineSales]([DateKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_OnlineSalesKey] ON [cso].[FactOnlineSales]([OnlineSalesKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_ProductKey] ON [cso].[FactOnlineSales]([ProductKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_PromotionKey] ON [cso].[FactOnlineSales]([PromotionKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_StoreKey] ON [cso].[FactOnlineSales]([StoreKey]);
Achievement unlocked!
You have successfully loaded public data into Azure SQL Data Warehouse. Great job!
You can now start querying the tables to explore your data. Run the following query to find out total sales per
brand:

SELECT SUM(f.[SalesAmount]) AS [sales_by_brand_amount]


, p.[BrandName]
FROM [cso].[FactOnlineSales] AS f
JOIN [cso].[DimProduct] AS p ON f.[ProductKey] = p.[ProductKey]
GROUP BY p.[BrandName]

Next steps
To load the full data set, run the example load the full Contoso Retail Data Warehouse from the Microsoft SQL
Server Samples repository.
For more development tips, see SQL Data Warehouse development overview.
Load data from Azure Data Lake Storage to SQL
Data Warehouse
8/9/2019 • 7 minutes to read • Edit Online

Use PolyBase external tables to load data from Azure Data Lake Storage into Azure SQL Data Warehouse.
Although you can run adhoc queries on data stored in Data Lake Storage, we recommend importing the data into
the SQL Data Warehouse for best performance.
Create database objects required to load from Data Lake Storage.
Connect to a Data Lake Storage directory.
Load data into Azure SQL Data Warehouse.
If you don't have an Azure subscription, create a free account before you begin.

Before you begin


Before you begin this tutorial, download and install the newest version of SQL Server Management Studio
(SSMS ).
To run this tutorial, you need:
Azure Active Directory Application to use for Service-to-Service authentication. To create, follow Active
directory authentication
An Azure SQL Data Warehouse. See Create and query and Azure SQL Data Warehouse.
A Data Lake Storage account. See Get started with Azure Data Lake Storage.

Create a credential
To access your Data Lake Storage account, you will need to create a Database Master Key to encrypt your
credential secret used in the next step. You then create a Database Scoped Credential. When authenticating using
service principals, the Database Scoped Credential stores the service principal credentials set up in AAD. You can
also use the storage account key in the Database Scoped Credential for Gen2.
To connect to Data Lake Storage using service principals, you must first create an Azure Active Directory
Application, create an access key, and grant the application access to the Data Lake Storage account. For
instructions, see Authenticate to Azure Data Lake Storage Using Active Directory.
-- A: Create a Database Master Key.
-- Only necessary if one does not already exist.
-- Required to encrypt the credential secret in the next step.
-- For more information on Master Key: https://msdn.microsoft.com/library/ms174382.aspx?f=255&MSPPError=-
2147217396

CREATE MASTER KEY;

-- B (for service principal authentication): Create a database scoped credential


-- IDENTITY: Pass the client id and OAuth 2.0 Token Endpoint taken from your Azure Active Directory
Application
-- SECRET: Provide your AAD Application Service Principal key.
-- For more information on Create Database Scoped Credential: https://msdn.microsoft.com/library/mt270260.aspx

CREATE DATABASE SCOPED CREDENTIAL ADLSCredential


WITH
-- Always use the OAuth 2.0 authorization endpoint (v1)
IDENTITY = '<client_id>@<OAuth_2.0_Token_EndPoint>',
SECRET = '<key>'
;

-- B (for Gen2 storage key authentication): Create a database scoped credential


-- IDENTITY: Provide any string, it is not used for authentication to Azure storage.
-- SECRET: Provide your Azure storage account key.

CREATE DATABASE SCOPED CREDENTIAL ADLSCredential


WITH
IDENTITY = 'user',
SECRET = '<azure_storage_account_key>'
;

-- It should look something like this when authenticating using service principals:
CREATE DATABASE SCOPED CREDENTIAL ADLSCredential
WITH
IDENTITY = '536540b4-4239-45fe-b9a3-629f97591c0c@https://login.microsoftonline.com/42f988bf-85f1-41af-
91ab-2d2cd011da47/oauth2/token',
SECRET = 'BjdIlmtKp4Fpyh9hIvr8HJlUida/seM5kQ3EpLAmeDI='
;

Create the external data source


Use this CREATE EXTERNAL DATA SOURCE command to store the location of the data.
-- C (for Gen1): Create an external data source
-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure Data Lake Storage.
-- LOCATION: Provide Data Lake Storage Gen1 account name and URI
-- CREDENTIAL: Provide the credential created in the previous step.

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage


WITH (
TYPE = HADOOP,
LOCATION = 'adl://<datalakestoregen1accountname>.azuredatalakestore.net',
CREDENTIAL = ADLSCredential
);

-- C (for Gen2): Create an external data source


-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure Data Lake Storage.
-- LOCATION: Provide Data Lake Storage Gen2 account name and URI
-- CREDENTIAL: Provide the credential created in the previous step.

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage


WITH (
TYPE = HADOOP,
LOCATION='abfs[s]://<container>@<AzureDataLake account_name>.dfs.core.windows.net', -- Please note the
abfss endpoint for when your account has secure transfer enabled
CREDENTIAL = ADLSCredential
);

Configure data format


To import the data from Data Lake Storage, you need to specify the External File Format. This object defines how
the files are written in Data Lake Storage. For the complete list, look at our T-SQL documentation CREATE
EXTERNAL FILE FORMAT

-- D: Create an external file format


-- FIELD_TERMINATOR: Marks the end of each field (column) in a delimited text file
-- STRING_DELIMITER: Specifies the field terminator for data of type string in the text-delimited file.
-- DATE_FORMAT: Specifies a custom format for all date and time data that might appear in a delimited text
file.
-- Use_Type_Default: Store missing values as default for datatype.

CREATE EXTERNAL FILE FORMAT TextFileFormat


WITH
( FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
)
);

Create the External Tables


Now that you have specified the data source and file format, you are ready to create the external tables. External
tables are how you interact with external data. The location parameter can specify a file or a directory. If it specifies
a directory, all files within the directory will be loaded.
-- D: Create an External Table
-- LOCATION: Folder under the Data Lake Storage root folder.
-- DATA_SOURCE: Specifies which Data Source Object to use.
-- FILE_FORMAT: Specifies which File Format Object to use
-- REJECT_TYPE: Specifies how you want to deal with rejected rows. Either Value or percentage of the total
-- REJECT_VALUE: Sets the Reject value based on the reject type.

-- DimProduct
CREATE EXTERNAL TABLE [dbo].[DimProduct_external] (
[ProductKey] [int] NOT NULL,
[ProductLabel] [nvarchar](255) NULL,
[ProductName] [nvarchar](500) NULL
)
WITH
(
LOCATION='/DimProduct/'
, DATA_SOURCE = AzureDataLakeStorage
, FILE_FORMAT = TextFileFormat
, REJECT_TYPE = VALUE
, REJECT_VALUE = 0
)
;

External Table Considerations


Creating an external table is easy, but there are some nuances that need to be discussed.
External Tables are strongly typed. This means that each row of the data being ingested must satisfy the table
schema definition. If a row does not match the schema definition, the row is rejected from the load.
The REJECT_TYPE and REJECT_VALUE options allow you to define how many rows or what percentage of the
data must be present in the final table. During load, if the reject value is reached, the load fails. The most common
cause of rejected rows is a schema definition mismatch. For example, if a column is incorrectly given the schema of
int when the data in the file is a string, every row will fail to load.
Data Lake Storage Gen1 uses Role Based Access Control (RBAC ) to control access to the data. This means that the
Service Principal must have read permissions to the directories defined in the location parameter and to the
children of the final directory and files. This enables PolyBase to authenticate and load that data.

Load the data


To load data from Data Lake Storage use the CREATE TABLE AS SELECT (Transact-SQL ) statement.
CTAS creates a new table and populates it with the results of a select statement. CTAS defines the new table to
have the same columns and data types as the results of the select statement. If you select all the columns from an
external table, the new table is a replica of the columns and data types in the external table.
In this example, we are creating a hash distributed table called DimProduct from our External Table
DimProduct_external.

CREATE TABLE [dbo].[DimProduct]


WITH (DISTRIBUTION = HASH([ProductKey] ) )
AS
SELECT * FROM [dbo].[DimProduct_external]
OPTION (LABEL = 'CTAS : Load [dbo].[DimProduct]');
Optimize columnstore compression
By default, SQL Data Warehouse stores the table as a clustered columnstore index. After a load completes, some of
the data rows might not be compressed into the columnstore. There's a variety of reasons why this can happen. To
learn more, see manage columnstore indexes.
To optimize query performance and columnstore compression after a load, rebuild the table to force the
columnstore index to compress all the rows.

ALTER INDEX ALL ON [dbo].[DimProduct] REBUILD;

Optimize statistics
It is best to create single-column statistics immediately after a load. There are some choices for statistics. For
example, if you create single-column statistics on every column it might take a long time to rebuild all the statistics.
If you know certain columns are not going to be in query predicates, you can skip creating statistics on those
columns.
If you decide to create single-column statistics on every column of every table, you can use the stored procedure
code sample prc_sqldw_create_stats in the statistics article.
The following example is a good starting point for creating statistics. It creates single-column statistics on each
column in the dimension table, and on each joining column in the fact tables. You can always add single or multi-
column statistics to other fact table columns later on.

Achievement unlocked!
You have successfully loaded data into Azure SQL Data Warehouse. Great job!

Next steps
In this tutorial, you created external tables to define the structure for data stored in Data Lake Storage Gen1, and
then used the PolyBase CREATE TABLE AS SELECT statement to load data into your data warehouse.
You did these things:
Created database objects required to load from Data Lake Storage Gen1.
Connected to a Data Lake Storage Gen1 directory.
Loaded data into Azure SQL Data Warehouse.
Loading data is the first step to developing a data warehouse solution using SQL Data Warehouse. Check out our
development resources.
Learn how to develop tables in SQL Data Warehouse
Tutorial: Load data to Azure SQL Data Warehouse
8/18/2019 • 31 minutes to read • Edit Online

This tutorial uses PolyBase to load the WideWorldImportersDW data warehouse from Azure Blob storage to
Azure SQL Data Warehouse. The tutorial uses the Azure portal and SQL Server Management Studio (SSMS ) to:
Create a data warehouse in the Azure portal
Set up a server-level firewall rule in the Azure portal
Connect to the data warehouse with SSMS
Create a user designated for loading data
Create external tables that use Azure blob as the data source
Use the CTAS T-SQL statement to load data into your data warehouse
View the progress of data as it is loading
Generate a year of data in the date dimension and sales fact tables
Create statistics on the newly loaded data
If you don't have an Azure subscription, create a free account before you begin.

Before you begin


Before you begin this tutorial, download and install the newest version of SQL Server Management Studio
(SSMS ).

Sign in to the Azure portal


Sign in to the Azure portal.

Create a blank SQL Data Warehouse


An Azure SQL Data Warehouse is created with a defined set of compute resources. The database is created within
an Azure resource group and in an Azure SQL logical server.
Follow these steps to create a blank SQL Data Warehouse.
1. Click Create a resource in the upper left-hand corner of the Azure portal.
2. Select Databases from the New page, and select SQL Data Warehouse under Featured on the New
page.
3. Fill out the SQL Data Warehouse form with the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Database name SampleDW For valid database names, see


Database Identifiers.

Subscription Your subscription For details about your subscriptions,


see Subscriptions.

Resource group SampleRG For valid resource group names, see


Naming rules and restrictions.

Select source Blank database Specifies to create a blank database.


Note, a data warehouse is one type
of database.
4. Click Server to create and configure a new server for your new database. Fill out the New server form
with the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Server name Any globally unique name For valid server names, see Naming
rules and restrictions.

Server admin login Any valid name For valid login names, see Database
Identifiers.

Password Any valid password Your password must have at least


eight characters and must contain
characters from three of the
following categories: upper case
characters, lower case characters,
numbers, and non-alphanumeric
characters.

Location Any valid location For information about regions, see


Azure Regions.
5. Click Select.
6. Click Performance tier to specify whether the data warehouse is Gen1 or Gen2, and the number of data
warehouse units.
7. For this tutorial, select the Gen1 service tier. The slider, by default, is set to DW400. Try moving it up and
down to see how it works.

8. Click Apply.
9. In the SQL Data Warehouse page, select a collation for the blank database. For this tutorial, use the default
value. For more information about collations, see Collations
10. Now that you have completed the SQL Database form, click Create to provision the database. Provisioning
takes a few minutes.
11. On the toolbar, click Notifications to monitor the deployment process.

Create a server-level firewall rule


The SQL Data Warehouse service creates a firewall at the server-level that prevents external applications and tools
from connecting to the server or any databases on the server. To enable connectivity, you can add firewall rules
that enable connectivity for specific IP addresses. Follow these steps to create a server-level firewall rule for your
client's IP address.
NOTE
SQL Data Warehouse communicates over port 1433. If you are trying to connect from within a corporate network,
outbound traffic over port 1433 might not be allowed by your network's firewall. If so, you cannot connect to your Azure
SQL Database server unless your IT department opens port 1433.

1. After the deployment completes, click SQL databases from the left-hand menu and then click SampleDW
on the SQL databases page. The overview page for your database opens, showing you the fully qualified
server name (such as sample-svr.database.windows.net) and provides options for further configuration.
2. Copy this fully qualified server name for use to connect to your server and its databases in subsequent
quick starts. To open the server settings, click the server name.

3. To open the server settings, click the server name.

4. Click Show firewall settings. The Firewall settings page for the SQL Database server opens.
5. To add your current IP address to a new firewall rule, click Add client IP on the toolbar. A firewall rule can
open port 1433 for a single IP address or a range of IP addresses.
6. Click Save. A server-level firewall rule is created for your current IP address opening port 1433 on the
logical server.
7. Click OK and then close the Firewall settings page.
You can now connect to the SQL server and its data warehouses using this IP address. The connection works from
SQL Server Management Studio or another tool of your choice. When you connect, use the serveradmin account
you created previously.

IMPORTANT
By default, access through the SQL Database firewall is enabled for all Azure services. Click OFF on this page and then click
Save to disable the firewall for all Azure services.

Get the fully qualified server name


Get the fully qualified server name for your SQL server in the Azure portal. Later you will use the fully qualified
name when connecting to the server.
1. Sign in to the Azure portal.
2. Select SQL Databases from the left-hand menu, and click your database on the SQL databases page.
3. In the Essentials pane in the Azure portal page for your database, locate and then copy the Server name.
In this example, the fully qualified name is mynewserver-20171113.database.windows.net.
Connect to the server as server admin
This section uses SQL Server Management Studio (SSMS ) to establish a connection to your Azure SQL server.
1. Open SQL Server Management Studio.
2. In the Connect to Server dialog box, enter the following information:

SETTING SUGGESTED VALUE DESCRIPTION

Server type Database engine This value is required

Server name The fully qualified server name For example, sample-
svr.database.windows.net is a fully
qualified server name.

Authentication SQL Server Authentication SQL Authentication is the only


authentication type that is
configured in this tutorial.

Login The server admin account This is the account that you specified
when you created the server.

Password The password for your server admin This is the password that you
account specified when you created the
server.
3. Click Connect. The Object Explorer window opens in SSMS.
4. In Object Explorer, expand Databases. Then expand System databases and master to view the objects in
the master database. Expand SampleDW to view the objects in your new database.

Create a user for loading data


The server admin account is meant to perform management operations, and is not suited for running queries on
user data. Loading data is a memory-intensive operation. Memory maximums are defined according to the
Generation of SQL Data Warehouse you're using, data warehouse units, and resource class.
It's best to create a login and user that is dedicated for loading data. Then add the loading user to a resource class
that enables an appropriate maximum memory allocation.
Since you are currently connected as the server admin, you can create logins and users. Use these steps to create a
login and user called LoaderRC60. Then assign the user to the staticrc60 resource class.
1. In SSMS, right-click master to show a drop-down menu, and choose New Query. A new query window
opens.

2. In the query window, enter these T-SQL commands to create a login and user named LoaderRC60,
substituting your own password for 'a123STRONGpassword!'.

CREATE LOGIN LoaderRC60 WITH PASSWORD = 'a123STRONGpassword!';


CREATE USER LoaderRC60 FOR LOGIN LoaderRC60;

3. Click Execute.
4. Right-click SampleDW, and choose New Query. A new query Window opens.
5. Enter the following T-SQL commands to create a database user named LoaderRC60 for the LoaderRC60
login. The second line grants the new user CONTROL permissions on the new data warehouse. These
permissions are similar to making the user the owner of the database. The third line adds the new user as a
member of the staticrc60 resource class.

CREATE USER LoaderRC60 FOR LOGIN LoaderRC60;


GRANT CONTROL ON DATABASE::[SampleDW] to LoaderRC60;
EXEC sp_addrolemember 'staticrc60', 'LoaderRC60';

6. Click Execute.

Connect to the server as the loading user


The first step toward loading data is to login as LoaderRC60.
1. In Object Explorer, click the Connect drop down menu and select Database Engine. The Connect to
Server dialog box appears.
2. Enter the fully qualified server name, and enter LoaderRC60 as the Login. Enter your password for
LoaderRC60.
3. Click Connect.
4. When your connection is ready, you will see two server connections in Object Explorer. One connection as
ServerAdmin and one connection as LoaderRC60.

Create external tables and objects


You are ready to begin the process of loading data into your new data warehouse. For future reference, to learn
how to get your data to Azure Blob storage or to load it directly from your source into SQL Data Warehouse, see
the loading overview.
Run the following SQL scripts to specify information about the data you wish to load. This information includes
where the data is located, the format of the contents of the data, and the table definition for the data. The data is
located in a global Azure Blob.
1. In the previous section, you logged into your data warehouse as LoaderRC60. In SSMS, right-click
SampleDW under your LoaderRC60 connection and select New Query. A new query window appears.

2. Compare your query window to the previous image. Verify your new query window is running as
LoaderRC60 and performing queries on your SampleDW database. Use this query window to perform all
of the loading steps.
3. Create a master key for the SampleDW database. You only need to create a master key once per database.

CREATE MASTER KEY;

4. Run the following CREATE EXTERNAL DATA SOURCE statement to define the location of the Azure blob.
This is the location of the external worldwide importers data. To run a command that you have appended to
the query window, highlight the commands you wish to run and click Execute.

CREATE EXTERNAL DATA SOURCE WWIStorage


WITH
(
TYPE = Hadoop,
LOCATION = 'wasbs://wideworldimporters@sqldwholdata.blob.core.windows.net'
);

5. Run the following CREATE EXTERNAL FILE FORMAT T-SQL statement to specify the formatting
characteristics and options for the external data file. This statement specifies the external data is stored as
text and the values are separated by the pipe ('|') character.
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS
(
FIELD_TERMINATOR = '|',
USE_TYPE_DEFAULT = FALSE
)
);

6. Run the following CREATE SCHEMA statements to create a schema for your external file format. The ext
schema provides a way to organize the external tables you are about to create. The wwi schema organizes
the standard tables that will contain the data.

CREATE SCHEMA ext;


GO
CREATE SCHEMA wwi;

7. Create the external tables. The table definitions are stored in SQL Data Warehouse, but the tables reference
data that is stored in Azure blob storage. Run the following T-SQL commands to create several external
tables that all point to the Azure blob you defined previously in the external data source.

CREATE EXTERNAL TABLE [ext].[dimension_City](


[City Key] [int] NOT NULL,
[WWI City ID] [int] NOT NULL,
[City] [nvarchar](50) NOT NULL,
[State Province] [nvarchar](50) NOT NULL,
[Country] [nvarchar](60) NOT NULL,
[Continent] [nvarchar](30) NOT NULL,
[Sales Territory] [nvarchar](50) NOT NULL,
[Region] [nvarchar](30) NOT NULL,
[Subregion] [nvarchar](30) NOT NULL,
[Location] [nvarchar](76) NULL,
[Latest Recorded Population] [bigint] NOT NULL,
[Valid From] [datetime2](7) NOT NULL,
[Valid To] [datetime2](7) NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH (LOCATION='/v1/dimension_City/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[dimension_Customer] (
[Customer Key] [int] NOT NULL,
[WWI Customer ID] [int] NOT NULL,
[Customer] [nvarchar](100) NOT NULL,
[Bill To Customer] [nvarchar](100) NOT NULL,
[Category] [nvarchar](50) NOT NULL,
[Buying Group] [nvarchar](50) NOT NULL,
[Primary Contact] [nvarchar](50) NOT NULL,
[Postal Code] [nvarchar](10) NOT NULL,
[Valid From] [datetime2](7) NOT NULL,
[Valid To] [datetime2](7) NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH (LOCATION='/v1/dimension_Customer/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[dimension_Employee] (
[Employee Key] [int] NOT NULL,
[WWI Employee ID] [int] NOT NULL,
[Employee] [nvarchar](50) NOT NULL,
[Preferred Name] [nvarchar](50) NOT NULL,
[Is Salesperson] [bit] NOT NULL,
[Photo] [varbinary](300) NULL,
[Valid From] [datetime2](7) NOT NULL,
[Valid To] [datetime2](7) NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION='/v1/dimension_Employee/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[dimension_PaymentMethod] (
[Payment Method Key] [int] NOT NULL,
[WWI Payment Method ID] [int] NOT NULL,
[Payment Method] [nvarchar](50) NOT NULL,
[Valid From] [datetime2](7) NOT NULL,
[Valid To] [datetime2](7) NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/dimension_PaymentMethod/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[dimension_StockItem](
[Stock Item Key] [int] NOT NULL,
[WWI Stock Item ID] [int] NOT NULL,
[Stock Item] [nvarchar](100) NOT NULL,
[Color] [nvarchar](20) NOT NULL,
[Selling Package] [nvarchar](50) NOT NULL,
[Buying Package] [nvarchar](50) NOT NULL,
[Brand] [nvarchar](50) NOT NULL,
[Size] [nvarchar](20) NOT NULL,
[Lead Time Days] [int] NOT NULL,
[Quantity Per Outer] [int] NOT NULL,
[Is Chiller Stock] [bit] NOT NULL,
[Barcode] [nvarchar](50) NULL,
[Tax Rate] [decimal](18, 3) NOT NULL,
[Unit Price] [decimal](18, 2) NOT NULL,
[Recommended Retail Price] [decimal](18, 2) NULL,
[Typical Weight Per Unit] [decimal](18, 3) NOT NULL,
[Photo] [varbinary](300) NULL,
[Valid From] [datetime2](7) NOT NULL,
[Valid To] [datetime2](7) NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/dimension_StockItem/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[dimension_Supplier](
[Supplier Key] [int] NOT NULL,
[WWI Supplier ID] [int] NOT NULL,
[Supplier] [nvarchar](100) NOT NULL,
[Category] [nvarchar](50) NOT NULL,
[Primary Contact] [nvarchar](50) NOT NULL,
[Supplier Reference] [nvarchar](20) NULL,
[Payment Days] [int] NOT NULL,
[Postal Code] [nvarchar](10) NOT NULL,
[Postal Code] [nvarchar](10) NOT NULL,
[Valid From] [datetime2](7) NOT NULL,
[Valid To] [datetime2](7) NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/dimension_Supplier/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[dimension_TransactionType](
[Transaction Type Key] [int] NOT NULL,
[WWI Transaction Type ID] [int] NOT NULL,
[Transaction Type] [nvarchar](50) NOT NULL,
[Valid From] [datetime2](7) NOT NULL,
[Valid To] [datetime2](7) NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/dimension_TransactionType/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[fact_Movement] (
[Movement Key] [bigint] NOT NULL,
[Date Key] [date] NOT NULL,
[Stock Item Key] [int] NOT NULL,
[Customer Key] [int] NULL,
[Supplier Key] [int] NULL,
[Transaction Type Key] [int] NOT NULL,
[WWI Stock Item Transaction ID] [int] NOT NULL,
[WWI Invoice ID] [int] NULL,
[WWI Purchase Order ID] [int] NULL,
[Quantity] [int] NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/fact_Movement/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[fact_Order] (
[Order Key] [bigint] NOT NULL,
[City Key] [int] NOT NULL,
[Customer Key] [int] NOT NULL,
[Stock Item Key] [int] NOT NULL,
[Order Date Key] [date] NOT NULL,
[Picked Date Key] [date] NULL,
[Salesperson Key] [int] NOT NULL,
[Picker Key] [int] NULL,
[WWI Order ID] [int] NOT NULL,
[WWI Backorder ID] [int] NULL,
[Description] [nvarchar](100) NOT NULL,
[Package] [nvarchar](50) NOT NULL,
[Quantity] [int] NOT NULL,
[Unit Price] [decimal](18, 2) NOT NULL,
[Tax Rate] [decimal](18, 3) NOT NULL,
[Total Excluding Tax] [decimal](18, 2) NOT NULL,
[Tax Amount] [decimal](18, 2) NOT NULL,
[Total Including Tax] [decimal](18, 2) NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/fact_Order/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[fact_Purchase] (
[Purchase Key] [bigint] NOT NULL,
[Date Key] [date] NOT NULL,
[Supplier Key] [int] NOT NULL,
[Stock Item Key] [int] NOT NULL,
[WWI Purchase Order ID] [int] NULL,
[Ordered Outers] [int] NOT NULL,
[Ordered Quantity] [int] NOT NULL,
[Received Outers] [int] NOT NULL,
[Package] [nvarchar](50) NOT NULL,
[Is Order Finalized] [bit] NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/fact_Purchase/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[fact_Sale] (
[Sale Key] [bigint] NOT NULL,
[City Key] [int] NOT NULL,
[Customer Key] [int] NOT NULL,
[Bill To Customer Key] [int] NOT NULL,
[Stock Item Key] [int] NOT NULL,
[Invoice Date Key] [date] NOT NULL,
[Delivery Date Key] [date] NULL,
[Salesperson Key] [int] NOT NULL,
[WWI Invoice ID] [int] NOT NULL,
[Description] [nvarchar](100) NOT NULL,
[Package] [nvarchar](50) NOT NULL,
[Quantity] [int] NOT NULL,
[Unit Price] [decimal](18, 2) NOT NULL,
[Tax Rate] [decimal](18, 3) NOT NULL,
[Total Excluding Tax] [decimal](18, 2) NOT NULL,
[Tax Amount] [decimal](18, 2) NOT NULL,
[Profit] [decimal](18, 2) NOT NULL,
[Total Including Tax] [decimal](18, 2) NOT NULL,
[Total Dry Items] [int] NOT NULL,
[Total Chiller Items] [int] NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/fact_Sale/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[fact_StockHolding] (
[Stock Holding Key] [bigint] NOT NULL,
[Stock Item Key] [int] NOT NULL,
[Quantity On Hand] [int] NOT NULL,
[Bin Location] [nvarchar](20) NOT NULL,
[Last Stocktake Quantity] [int] NOT NULL,
[Last Cost Price] [decimal](18, 2) NOT NULL,
[Reorder Level] [int] NOT NULL,
[Target Stock Level] [int] NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/fact_StockHolding/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
CREATE EXTERNAL TABLE [ext].[fact_Transaction] (
[Transaction Key] [bigint] NOT NULL,
[Date Key] [date] NOT NULL,
[Date Key] [date] NOT NULL,
[Customer Key] [int] NULL,
[Bill To Customer Key] [int] NULL,
[Supplier Key] [int] NULL,
[Transaction Type Key] [int] NOT NULL,
[Payment Method Key] [int] NULL,
[WWI Customer Transaction ID] [int] NULL,
[WWI Supplier Transaction ID] [int] NULL,
[WWI Invoice ID] [int] NULL,
[WWI Purchase Order ID] [int] NULL,
[Supplier Invoice Number] [nvarchar](20) NULL,
[Total Excluding Tax] [decimal](18, 2) NOT NULL,
[Tax Amount] [decimal](18, 2) NOT NULL,
[Total Including Tax] [decimal](18, 2) NOT NULL,
[Outstanding Balance] [decimal](18, 2) NOT NULL,
[Is Finalized] [bit] NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH ( LOCATION ='/v1/fact_Transaction/',
DATA_SOURCE = WWIStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

8. In Object Explorer, expand SampleDW to see the list of external tables you created.

Load the data into your data warehouse


This section uses the external tables you defined to load the sample data from Azure Blob to SQL Data
Warehouse.
NOTE
This tutorial loads the data directly into the final table. In a production environment, you will usually use CREATE TABLE AS
SELECT to load into a staging table. While data is in the staging table you can perform any necessary transformations. To
append the data in the staging table to a production table, you can use the INSERT...SELECT statement. For more
information, see Inserting data into a production table.

The script uses the CREATE TABLE AS SELECT (CTAS ) T-SQL statement to load the data from Azure Storage
Blob into new tables in your data warehouse. CTAS creates a new table based on the results of a select statement.
The new table has the same columns and data types as the results of the select statement. When the select
statement selects from an external table, SQL Data Warehouse imports the data into a relational table in the data
warehouse.
This script does not load data into the wwi.dimension_Date and wwi.fact_Sale tables. These tables are generated in
a later step in order to make the tables have a sizeable number of rows.
1. Run the following script to load the data into new tables in your data warehouse.

CREATE TABLE [wwi].[dimension_City]


WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[dimension_City]
OPTION (LABEL = 'CTAS : Load [wwi].[dimension_City]')
;

CREATE TABLE [wwi].[dimension_Customer]


WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[dimension_Customer]
OPTION (LABEL = 'CTAS : Load [wwi].[dimension_Customer]')
;

CREATE TABLE [wwi].[dimension_Employee]


WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[dimension_Employee]
OPTION (LABEL = 'CTAS : Load [wwi].[dimension_Employee]')
;

CREATE TABLE [wwi].[dimension_PaymentMethod]


WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[dimension_PaymentMethod]
OPTION (LABEL = 'CTAS : Load [wwi].[dimension_PaymentMethod]')
;

CREATE TABLE [wwi].[dimension_StockItem]


WITH
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[dimension_StockItem]
OPTION (LABEL = 'CTAS : Load [wwi].[dimension_StockItem]')
;

CREATE TABLE [wwi].[dimension_Supplier]


WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[dimension_Supplier]
OPTION (LABEL = 'CTAS : Load [wwi].[dimension_Supplier]')
;

CREATE TABLE [wwi].[dimension_TransactionType]


WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[dimension_TransactionType]
OPTION (LABEL = 'CTAS : Load [wwi].[dimension_TransactionType]')
;

CREATE TABLE [wwi].[fact_Movement]


WITH
(
DISTRIBUTION = HASH([Movement Key]),
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[fact_Movement]
OPTION (LABEL = 'CTAS : Load [wwi].[fact_Movement]')
;

CREATE TABLE [wwi].[fact_Order]


WITH
(
DISTRIBUTION = HASH([Order Key]),
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[fact_Order]
OPTION (LABEL = 'CTAS : Load [wwi].[fact_Order]')
;

CREATE TABLE [wwi].[fact_Purchase]


WITH
(
DISTRIBUTION = HASH([Purchase Key]),
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[fact_Purchase]
OPTION (LABEL = 'CTAS : Load [wwi].[fact_Purchase]')
;

CREATE TABLE [wwi].[seed_Sale]


WITH
(
DISTRIBUTION = HASH([WWI Invoice ID]),
CLUSTERED COLUMNSTORE INDEX
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[fact_Sale]
OPTION (LABEL = 'CTAS : Load [wwi].[seed_Sale]')
;

CREATE TABLE [wwi].[fact_StockHolding]


WITH
(
DISTRIBUTION = HASH([Stock Holding Key]),
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[fact_StockHolding]
OPTION (LABEL = 'CTAS : Load [wwi].[fact_StockHolding]')
;

CREATE TABLE [wwi].[fact_Transaction]


WITH
(
DISTRIBUTION = HASH([Transaction Key]),
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM [ext].[fact_Transaction]
OPTION (LABEL = 'CTAS : Load [wwi].[fact_Transaction]')
;

2. View your data as it loads. You’re loading several GBs of data and compressing it into highly performant
clustered columnstore indexes. Open a new query window on SampleDW, and run the following query to
show the status of the load. After starting the query, grab a coffee and a snack while SQL Data Warehouse
does some heavy lifting.

SELECT
r.command,
s.request_id,
r.status,
count(distinct input_name) as nbr_files,
sum(s.bytes_processed)/1024/1024/1024 as gb_processed
FROM
sys.dm_pdw_exec_requests r
INNER JOIN sys.dm_pdw_dms_external_work s
ON r.request_id = s.request_id
WHERE
r.[label] = 'CTAS : Load [wwi].[dimension_City]' OR
r.[label] = 'CTAS : Load [wwi].[dimension_Customer]' OR
r.[label] = 'CTAS : Load [wwi].[dimension_Employee]' OR
r.[label] = 'CTAS : Load [wwi].[dimension_PaymentMethod]' OR
r.[label] = 'CTAS : Load [wwi].[dimension_StockItem]' OR
r.[label] = 'CTAS : Load [wwi].[dimension_Supplier]' OR
r.[label] = 'CTAS : Load [wwi].[dimension_TransactionType]' OR
r.[label] = 'CTAS : Load [wwi].[fact_Movement]' OR
r.[label] = 'CTAS : Load [wwi].[fact_Order]' OR
r.[label] = 'CTAS : Load [wwi].[fact_Purchase]' OR
r.[label] = 'CTAS : Load [wwi].[fact_StockHolding]' OR
r.[label] = 'CTAS : Load [wwi].[fact_Transaction]'
GROUP BY
r.command,
s.request_id,
r.status
ORDER BY
nbr_files desc,
gb_processed desc;
3. View all system queries.

SELECT * FROM sys.dm_pdw_exec_requests;

4. Enjoy seeing your data nicely loaded into your data warehouse.

Create tables and procedures to generate the Date and Sales tables
This section creates the wwi.dimension_Date and wwi.fact_Sale tables. It also creates stored procedures that can
generate millions of rows in the wwi.dimension_Date and wwi.fact_Sale tables.
1. Create the dimension_Date and fact_Sale tables.
CREATE TABLE [wwi].[dimension_Date]
(
[Date] [datetime] NOT NULL,
[Day Number] [int] NOT NULL,
[Day] [nvarchar](10) NOT NULL,
[Month] [nvarchar](10) NOT NULL,
[Short Month] [nvarchar](3) NOT NULL,
[Calendar Month Number] [int] NOT NULL,
[Calendar Month Label] [nvarchar](20) NOT NULL,
[Calendar Year] [int] NOT NULL,
[Calendar Year Label] [nvarchar](10) NOT NULL,
[Fiscal Month Number] [int] NOT NULL,
[Fiscal Month Label] [nvarchar](20) NOT NULL,
[Fiscal Year] [int] NOT NULL,
[Fiscal Year Label] [nvarchar](10) NOT NULL,
[ISO Week Number] [int] NOT NULL
)
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED INDEX ([Date])
);
CREATE TABLE [wwi].[fact_Sale]
(
[Sale Key] [bigint] IDENTITY(1,1) NOT NULL,
[City Key] [int] NOT NULL,
[Customer Key] [int] NOT NULL,
[Bill To Customer Key] [int] NOT NULL,
[Stock Item Key] [int] NOT NULL,
[Invoice Date Key] [date] NOT NULL,
[Delivery Date Key] [date] NULL,
[Salesperson Key] [int] NOT NULL,
[WWI Invoice ID] [int] NOT NULL,
[Description] [nvarchar](100) NOT NULL,
[Package] [nvarchar](50) NOT NULL,
[Quantity] [int] NOT NULL,
[Unit Price] [decimal](18, 2) NOT NULL,
[Tax Rate] [decimal](18, 3) NOT NULL,
[Total Excluding Tax] [decimal](18, 2) NOT NULL,
[Tax Amount] [decimal](18, 2) NOT NULL,
[Profit] [decimal](18, 2) NOT NULL,
[Total Including Tax] [decimal](18, 2) NOT NULL,
[Total Dry Items] [int] NOT NULL,
[Total Chiller Items] [int] NOT NULL,
[Lineage Key] [int] NOT NULL
)
WITH
(
DISTRIBUTION = HASH ( [WWI Invoice ID] ),
CLUSTERED COLUMNSTORE INDEX
)

2. Create [wwi].[InitialSalesDataPopulation] to increase the number of rows in [wwi].[seed_Sale] by a factor of


eight.
CREATE PROCEDURE [wwi].[InitialSalesDataPopulation] AS
BEGIN
INSERT INTO [wwi].[seed_Sale] (
[Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date
Key], [Delivery Date Key], [Salesperson Key], [WWI Invoice ID], [Description], [Package], [Quantity],
[Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], [Profit], [Total Including Tax], [Total
Dry Items], [Total Chiller Items], [Lineage Key]
)
SELECT
[Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date
Key], [Delivery Date Key], [Salesperson Key], [WWI Invoice ID], [Description], [Package], [Quantity],
[Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], [Profit], [Total Including Tax], [Total
Dry Items], [Total Chiller Items], [Lineage Key]
FROM [wwi].[seed_Sale]

INSERT INTO [wwi].[seed_Sale] (


[Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date
Key], [Delivery Date Key], [Salesperson Key], [WWI Invoice ID], [Description], [Package], [Quantity],
[Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], [Profit], [Total Including Tax], [Total
Dry Items], [Total Chiller Items], [Lineage Key]
)
SELECT
[Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date
Key], [Delivery Date Key], [Salesperson Key], [WWI Invoice ID], [Description], [Package], [Quantity],
[Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], [Profit], [Total Including Tax], [Total
Dry Items], [Total Chiller Items], [Lineage Key]
FROM [wwi].[seed_Sale]

INSERT INTO [wwi].[seed_Sale] (


[Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date
Key], [Delivery Date Key], [Salesperson Key], [WWI Invoice ID], [Description], [Package], [Quantity],
[Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], [Profit], [Total Including Tax], [Total
Dry Items], [Total Chiller Items], [Lineage Key]
)
SELECT
[Sale Key], [City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date
Key], [Delivery Date Key], [Salesperson Key], [WWI Invoice ID], [Description], [Package], [Quantity],
[Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], [Profit], [Total Including Tax], [Total
Dry Items], [Total Chiller Items], [Lineage Key]
FROM [wwi].[seed_Sale]
END

3. Create this stored procedure that populates rows into wwi.dimension_Date.

CREATE PROCEDURE [wwi].[PopulateDateDimensionForYear] @Year [int] AS


BEGIN
IF OBJECT_ID('tempdb..#month', 'U') IS NOT NULL
DROP TABLE #month
CREATE TABLE #month (
monthnum int,
numofdays int
)
WITH ( DISTRIBUTION = ROUND_ROBIN, heap )
INSERT INTO #month
SELECT 1, 31 UNION SELECT 2, CASE WHEN (@YEAR % 4 = 0 AND @YEAR % 100 <> 0) OR @YEAR % 400 = 0
THEN 29 ELSE 28 END UNION SELECT 3,31 UNION SELECT 4,30 UNION SELECT 5,31 UNION SELECT 6,30 UNION
SELECT 7,31 UNION SELECT 8,31 UNION SELECT 9,30 UNION SELECT 10,31 UNION SELECT 11,30 UNION SELECT
12,31

IF OBJECT_ID('tempdb..#days', 'U') IS NOT NULL


DROP TABLE #days
CREATE TABLE #days (days int)
WITH (DISTRIBUTION = ROUND_ROBIN, HEAP)

INSERT INTO #days


SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION
SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 10 UNION SELECT 11 UNION SELECT 12 UNION SELECT 13
UNION SELECT 14 UNION SELECT 15 UNION SELECT 16 UNION SELECT 17 UNION SELECT 18 UNION SELECT 19 UNION
SELECT 20 UNION SELECT 21 UNION SELECT 22 UNION SELECT 23 UNION SELECT 24 UNION SELECT 25 UNION SELECT
26 UNION SELECT 27 UNION SELECT 28 UNION SELECT 29 UNION SELECT 30 UNION SELECT 31

INSERT [wwi].[dimension_Date] (
[Date], [Day Number], [Day], [Month], [Short Month], [Calendar Month Number], [Calendar Month
Label], [Calendar Year], [Calendar Year Label], [Fiscal Month Number], [Fiscal Month Label], [Fiscal
Year], [Fiscal Year Label], [ISO Week Number]
)
SELECT
CAST(CAST(monthnum AS VARCHAR(2)) + '/' + CAST([days] AS VARCHAR(3)) + '/' + CAST(@year AS
CHAR(4)) AS DATE) AS [Date]
,DAY(CAST(CAST(monthnum AS VARCHAR(2)) + '/' + CAST([days] AS VARCHAR(3)) + '/' + CAST(@year AS
CHAR(4)) AS DATE)) AS [Day Number]
,CAST(DATENAME(day, CAST(CAST(monthnum AS VARCHAR(2)) + '/' + CAST([days] AS VARCHAR(3)) + '/'
+ CAST(@year AS CHAR(4)) AS DATE)) AS NVARCHAR(10)) AS [Day]
,CAST(DATENAME(month, CAST(CAST(monthnum AS VARCHAR(2)) + '/' + CAST([days] AS VARCHAR(3)) + '/'
+ CAST(@year as char(4)) AS DATE)) AS nvarchar(10)) AS [Month]
,CAST(SUBSTRING(DATENAME(month, CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as
varchar(3)) + '/' + CAST(@year as char(4)) AS DATE)), 1, 3) AS nvarchar(3)) AS [Short Month]
,MONTH(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as
char(4)) AS DATE)) AS [Calendar Month Number]
,CAST(N'CY' + CAST(YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) +
'/' + CAST(@year as char(4)) AS DATE)) AS nvarchar(4)) + N'-' + SUBSTRING(DATENAME(month,
CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as char(4)) AS
DATE)), 1, 3) AS nvarchar(10)) AS [Calendar Month Label]
,YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as
char(4)) AS DATE)) AS [Calendar Year]
,CAST(N'CY' + CAST(YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) +
'/' + CAST(@year as char(4)) AS DATE)) AS nvarchar(4)) AS nvarchar(10)) AS [Calendar Year Label]
,CASE WHEN MONTH(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' +
CAST(@year as char(4)) AS DATE)) IN (11, 12)
THEN MONTH(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year
as char(4)) AS DATE)) - 10
ELSE MONTH(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year
as char(4)) AS DATE)) + 2 END AS [Fiscal Month Number]
,CAST(N'FY' + CAST(CASE WHEN MONTH(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as
varchar(3)) + '/' + CAST(@year as char(4)) AS DATE)) IN (11, 12)
THEN YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as
char(4)) AS DATE)) + 1
ELSE YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as
char(4)) AS DATE)) END AS nvarchar(4)) + N'-' + SUBSTRING(DATENAME(month, CAST(CAST(monthnum as
varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as char(4)) AS DATE)), 1, 3) AS
nvarchar(20)) AS [Fiscal Month Label]
,CASE WHEN MONTH(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' +
CAST(@year as char(4)) AS DATE)) IN (11, 12)
THEN YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as
char(4)) AS DATE)) + 1
ELSE YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as
char(4)) AS DATE)) END AS [Fiscal Year]
,CAST(N'FY' + CAST(CASE WHEN MONTH(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as
varchar(3)) + '/' + CAST(@year as char(4)) AS DATE)) IN (11, 12)
THEN YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as
char(4)) AS DATE)) + 1
ELSE YEAR(CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' + CAST(@year as
char(4)) AS DATE))END AS nvarchar(4)) AS nvarchar(10)) AS [Fiscal Year Label]
, DATEPART(ISO_WEEK, CAST(CAST(monthnum as varchar(2)) + '/' + CAST([days] as varchar(3)) + '/' +
CAST(@year as char(4)) AS DATE)) AS [ISO Week Number]
FROM #month m
CROSS JOIN #days d
WHERE d.days <= m.numofdays

DROP table #month;


DROP table #days;
END;
4. Create this procedure that populates the wwi.dimension_Date and wwi.fact_Sale tables. It calls [wwi].
[PopulateDateDimensionForYear] to populate wwi.dimension_Date.

CREATE PROCEDURE [wwi].[Configuration_PopulateLargeSaleTable] @EstimatedRowsPerDay [bigint],@Year [int]


AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;

EXEC [wwi].[PopulateDateDimensionForYear] @Year;

DECLARE @OrderCounter bigint = 0;


DECLARE @NumberOfSalesPerDay bigint = @EstimatedRowsPerDay;
DECLARE @DateCounter date;
DECLARE @StartingSaleKey bigint;
DECLARE @MaximumSaleKey bigint = (SELECT MAX([Sale Key]) FROM wwi.seed_Sale);
DECLARE @MaxDate date;
SET @MaxDate = (SELECT MAX([Invoice Date Key]) FROM wwi.fact_Sale)
IF ( @MaxDate < CAST(@YEAR AS CHAR(4)) + '1231') AND (@MaxDate > CAST(@YEAR AS CHAR(4)) + '0101')
SET @DateCounter = @MaxDate
ELSE
SET @DateCounter= CAST(@Year as char(4)) + '0101';

PRINT 'Targeting ' + CAST(@NumberOfSalesPerDay AS varchar(20)) + ' sales per day.';

DECLARE @OutputCounter varchar(20);


DECLARE @variance DECIMAL(18,10);
DECLARE @VariantNumberOfSalesPerDay BIGINT;

WHILE @DateCounter < CAST(@YEAR AS CHAR(4)) + '1231'


BEGIN
SET @OutputCounter = CONVERT(varchar(20), @DateCounter, 112);
RAISERROR(@OutputCounter, 0, 1);
SET @variance = (SELECT RAND() * 10)*.01 + .95
SET @VariantNumberOfSalesPerDay = FLOOR(@NumberOfSalesPerDay * @variance)

SET @StartingSaleKey = @MaximumSaleKey - @VariantNumberOfSalesPerDay - FLOOR(RAND() * 20000);


SET @OrderCounter = 0;

INSERT [wwi].[fact_Sale] (
[City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], [Invoice Date Key],
[Delivery Date Key], [Salesperson Key], [WWI Invoice ID], [Description], Package, Quantity, [Unit
Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], Profit, [Total Including Tax], [Total Dry
Items], [Total Chiller Items], [Lineage Key]
)
SELECT TOP(@VariantNumberOfSalesPerDay)
[City Key], [Customer Key], [Bill To Customer Key], [Stock Item Key], @DateCounter,
DATEADD(day, 1, @DateCounter), [Salesperson Key], [WWI Invoice ID], [Description], Package, Quantity,
[Unit Price], [Tax Rate], [Total Excluding Tax], [Tax Amount], Profit, [Total Including Tax], [Total
Dry Items], [Total Chiller Items], [Lineage Key]
FROM [wwi].[seed_Sale]
WHERE
--[Sale Key] > @StartingSaleKey and /* IDENTITY DOES NOT WORK THE SAME IN SQLDW AND CAN'T
USE THIS METHOD FOR VARIANT */
[Invoice Date Key] >=cast(@YEAR AS CHAR(4)) + '-01-01'
ORDER BY [Sale Key];

SET @DateCounter = DATEADD(day, 1, @DateCounter);


END;

END;

Generate millions of rows


Use the stored procedures you created to generate millions of rows in the wwi.fact_Sale table, and corresponding
data in the wwi.dimension_Date table.
1. Run this procedure to seed the [wwi].[seed_Sale] with more rows.

EXEC [wwi].[InitialSalesDataPopulation]

2. Run this procedure to populate wwi.fact_Sale with 100,000 rows per day for each day in the year 2000.

EXEC [wwi].[Configuration_PopulateLargeSaleTable] 100000, 2000

3. The data generation in the previous step might take a while as it progresses through the year. To see which
day the current process is on, open a new query and run this SQL command:

SELECT MAX([Invoice Date Key]) FROM wwi.fact_Sale;

4. Run the following command to see the space used.

EXEC sp_spaceused N'wwi.fact_Sale';

Populate the replicated table cache


SQL Data Warehouse replicates a table by caching the data to each Compute node. The cache gets populated
when a query runs against the table. Therefore, the first query on a replicated table might require extra time to
populate the cache. After the cache is populated, queries on replicated tables run faster.
Run these SQL queries to populate the replicated table cache on the Compute nodes.

```sql
SELECT TOP 1 * FROM [wwi].[dimension_City];
SELECT TOP 1 * FROM [wwi].[dimension_Customer];
SELECT TOP 1 * FROM [wwi].[dimension_Date];
SELECT TOP 1 * FROM [wwi].[dimension_Employee];
SELECT TOP 1 * FROM [wwi].[dimension_PaymentMethod];
SELECT TOP 1 * FROM [wwi].[dimension_StockItem];
SELECT TOP 1 * FROM [wwi].[dimension_Supplier];
SELECT TOP 1 * FROM [wwi].[dimension_TransactionType];
```

Create statistics on newly loaded data


To achieve high query performance, it's important to create statistics on each column of each table after the first
load. It's also important to update statistics after substantial changes in the data.
1. Create this stored procedure that updates statistics on all columns of all tables.

CREATE PROCEDURE [dbo].[prc_sqldw_create_stats]


( @create_type tinyint -- 1 default 2 Fullscan 3 Sample
, @sample_pct tinyint
)
AS

IF @create_type IS NULL
BEGIN
SET @create_type = 1;
END;
IF @create_type NOT IN (1,2,3)
BEGIN
THROW 151000,'Invalid value for @stats_type parameter. Valid range 1 (default), 2 (fullscan) or 3
(sample).',1;
END;

IF @sample_pct IS NULL
BEGIN;
SET @sample_pct = 20;
END;

IF OBJECT_ID('tempdb..#stats_ddl') IS NOT NULL


BEGIN;
DROP TABLE #stats_ddl;
END;

CREATE TABLE #stats_ddl


WITH ( DISTRIBUTION = HASH([seq_nmbr])
, LOCATION = USER_DB
)
AS
WITH T
AS
(
SELECT t.[name] AS [table_name]
, s.[name] AS [table_schema_name]
, c.[name] AS [column_name]
, c.[column_id] AS [column_id]
, t.[object_id] AS [object_id]
, ROW_NUMBER()
OVER(ORDER BY (SELECT NULL)) AS [seq_nmbr]
FROM sys.[tables] t
JOIN sys.[schemas] s ON t.[schema_id] = s.[schema_id]
JOIN sys.[columns] c ON t.[object_id] = c.[object_id]
LEFT JOIN sys.[stats_columns] l ON l.[object_id] = c.[object_id]
AND l.[column_id] = c.[column_id]
AND l.[stats_column_id] = 1
LEFT JOIN sys.[external_tables] e ON e.[object_id] = t.[object_id]
WHERE l.[object_id] IS NULL
AND e.[object_id] IS NULL -- not an external table
)
SELECT [table_schema_name]
, [table_name]
, [column_name]
, [column_id]
, [object_id]
, [seq_nmbr]
, CASE @create_type
WHEN 1
THEN CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name +
'_'+column_name)+' ON
'+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+')' AS
VARCHAR(8000))
WHEN 2
THEN CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name +
'_'+column_name)+' ON
'+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+') WITH FULLSCAN'
AS VARCHAR(8000))
WHEN 3
THEN CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name +
'_'+column_name)+' ON
'+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+') WITH SAMPLE
'+CONVERT(varchar(4),@sample_pct)+' PERCENT' AS VARCHAR(8000))
END AS create_stat_ddl
FROM T
;

DECLARE @i INT = 1
, @t INT = (SELECT COUNT(*) FROM #stats_ddl)
, @t INT = (SELECT COUNT(*) FROM #stats_ddl)
, @s NVARCHAR(4000) = N''
;

WHILE @i <= @t
BEGIN
SET @s=(SELECT create_stat_ddl FROM #stats_ddl WHERE seq_nmbr = @i);
PRINT @s
EXEC sp_executesql @s
SET @i+=1;
END

DROP TABLE #stats_ddl;

2. Run this command to create statistics on all columns of all tables in the data warehouse.

EXEC [dbo].[prc_sqldw_create_stats] 1, NULL;

Clean up resources
You are being charged for compute resources and data that you loaded into your data warehouse. These are billed
separately.
Follow these steps to clean up resources as you desire.
1. Sign in to the Azure portal, click on your data warehouse.

2. If you want to keep the data in storage, you can pause compute when you aren't using the data warehouse.
By pausing compute, you will only be charge for data storage and you can resume the compute whenever
you are ready to work with the data. To pause compute, click the Pause button. When the data warehouse is
paused, you will see a Start button. To resume compute, click Start.
3. If you want to remove future charges, you can delete the data warehouse. To remove the data warehouse so
you won't be charged for compute or storage, click Delete.
4. To remove the SQL server you created, click sample-svr.database.windows.net in the previous image,
and then click Delete. Be careful with this as deleting the server will delete all databases assigned to the
server.
5. To remove the resource group, click SampleRG, and then click Delete resource group.

Next steps
In this tutorial, you learned how to create a data warehouse and create a user for loading data. You created
external tables to define the structure for data stored in Azure Storage Blob, and then used the PolyBase CREATE
TABLE AS SELECT statement to load data into your data warehouse.
You did these things:
Created a data warehouse in the Azure portal
Set up a server-level firewall rule in the Azure portal
Connected to the data warehouse with SSMS
Created a user designated for loading data
Created external tables for data in Azure Storage Blob
Used the CTAS T-SQL statement to load data into your data warehouse
Viewed the progress of data as it is loading
Created statistics on the newly loaded data
Advance to the development overview to learn how to migrate an existing database to SQL Data Warehouse.
Design decisions to migrate an existing database to SQL Data Warehouse
Connect to Azure SQL Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Get connected to Azure SQL Data Warehouse.

Find your server name


The server name in the following example is samplesvr.database.windows.net. To find the fully qualified server
name:
1. Go to the Azure portal.
2. Click on SQL data warehouses.
3. Click on the data warehouse you want to connect to.
4. Locate the full server name.

Supported drivers and connection strings


Azure SQL Data Warehouse supports ADO.NET, ODBC, PHP, and JDBC. To find the latest version and
documentation, click on one of the preceding drivers. To automatically generate the connection string for the
driver that you are using from the Azure portal, click on the Show database connection strings from the
preceding example. Following are also some examples of what a connection string looks like for each driver.

NOTE
Consider setting the connection timeout to 300 seconds to allow your connection to survive short periods of unavailability.

ADO.NET connection string example

Server=tcp:{your_server}.database.windows.net,1433;Database={your_database};User ID={your_user_name};Password=
{your_password_here};Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;
ODBC connection string example

Driver={SQL Server Native Client 11.0};Server=tcp:{your_server}.database.windows.net,1433;Database=


{your_database};Uid={your_user_name};Pwd={your_password_here};Encrypt=yes;TrustServerCertificate=no;Connection
Timeout=30;

PHP connection string example

Server: {your_server}.database.windows.net,1433 \r\nSQL Database: {your_database}\r\nUser Name:


{your_user_name}\r\n\r\nPHP Data Objects(PDO) Sample Code:\r\n\r\ntry {\r\n $conn = new PDO (
\"sqlsrv:server = tcp:{your_server}.database.windows.net,1433; Database = {your_database}\", \"
{your_user_name}\", \"{your_password_here}\");\r\n $conn->setAttribute( PDO::ATTR_ERRMODE,
PDO::ERRMODE_EXCEPTION );\r\n}\r\ncatch ( PDOException $e ) {\r\n print( \"Error connecting to SQL Server.\"
);\r\n die(print_r($e));\r\n}\r\n\rSQL Server Extension Sample Code:\r\n\r\n$connectionInfo = array(\"UID\"
=> \"{your_user_name}\", \"pwd\" => \"{your_password_here}\", \"Database\" => \"{your_database}\",
\"LoginTimeout\" => 30, \"Encrypt\" => 1, \"TrustServerCertificate\" => 0);\r\n$serverName = \"tcp:
{your_server}.database.windows.net,1433\";\r\n$conn = sqlsrv_connect($serverName, $connectionInfo);

JDBC connection string example

jdbc:sqlserver://yourserver.database.windows.net:1433;database=yourdatabase;user={your_user_name};password=
{your_password_here};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;lo
ginTimeout=30;

Connection settings
SQL Data Warehouse standardizes some settings during connection and object creation. These settings cannot be
overridden and include:

DATABASE SETTING VALUE

ANSI_NULLS ON

QUOTED_IDENTIFIERS ON

DATEFORMAT mdy

DATEFIRST 7

Next steps
To connect and query with Visual Studio, see Query with Visual Studio. To learn more about authentication
options, see Authentication to Azure SQL Data Warehouse.
Connection strings for Azure SQL Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

You can connect to SQL Data Warehouse with several different application protocols such as, ADO.NET, ODBC,
PHP and JDBC. Below are some examples of connections strings for each protocol. You can also use the Azure
portal to build your connection string. To build your connection string using the Azure portal, navigate to your
database blade, under Essentials click on Show database connection strings.

Sample ADO.NET connection string


Server=tcp:{your_server}.database.windows.net,1433;Database={your_database};User ID={your_user_name};Password=
{your_password_here};Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;

Sample ODBC connection string


Driver={SQL Server Native Client 11.0};Server=tcp:{your_server}.database.windows.net,1433;Database=
{your_database};Uid={your_user_name};Pwd={your_password_here};Encrypt=yes;TrustServerCertificate=no;Connection
Timeout=30;

Sample PHP connection string


Server: {your_server}.database.windows.net,1433 \r\nSQL Database: {your_database}\r\nUser Name:
{your_user_name}\r\n\r\nPHP Data Objects(PDO) Sample Code:\r\n\r\ntry {\r\n $conn = new PDO (
\"sqlsrv:server = tcp:{your_server}.database.windows.net,1433; Database = {your_database}\", \"
{your_user_name}\", \"{your_password_here}\");\r\n $conn->setAttribute( PDO::ATTR_ERRMODE,
PDO::ERRMODE_EXCEPTION );\r\n}\r\ncatch ( PDOException $e ) {\r\n print( \"Error connecting to SQL Server.\"
);\r\n die(print_r($e));\r\n}\r\n\rSQL Server Extension Sample Code:\r\n\r\n$connectionInfo = array(\"UID\"
=> \"{your_user_name}\", \"pwd\" => \"{your_password_here}\", \"Database\" => \"{your_database}\",
\"LoginTimeout\" => 30, \"Encrypt\" => 1, \"TrustServerCertificate\" => 0);\r\n$serverName = \"tcp:
{your_server}.database.windows.net,1433\";\r\n$conn = sqlsrv_connect($serverName, $connectionInfo);

Sample JDBC connection string


jdbc:sqlserver://yourserver.database.windows.net:1433;database=yourdatabase;user={your_user_name};password=
{your_password_here};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;lo
ginTimeout=30;

NOTE
Consider setting the connection timeout to 300 seconds in order to allow the connection to survive short periods of
unavailability.

Next steps
To start querying your data warehouse with Visual Studio and other applications, see Query with Visual Studio.
Getting started with Visual Studio 2019 for SQL Data
Warehouse
10/17/2019 • 2 minutes to read • Edit Online

Visual Studio 2019 SQL Server Data Tools (SSDT) is a single tool allowing you to do the following:
Connect, query, and develop applications for SQL Data Warehouse
Leverage an object explorer to visually explore all objects in your data model including tables, views, stored
procedures, and etc.
Generate T-SQL data definition language (DDL ) scripts for your objects
Develop your data warehouse using a state-based approach with SSDT Database Projects
Integrate your database project with source control systems such as Git with Azure DevOps Repos
Set up continuous integration and deployment pipelines with automation servers such as Azure DevOps

NOTE
Currently Visual Studio SSDT Database Projects is in preview. To receive periodic updates on this feature, please vote on
UserVoice.

Install Visual Studio 2019 Preview


See Download Visual Studio 2019 to download and install Visual Studio 16.3 and above. During install, select the
data storage and processing workload. Standalone SSDT installation is no longer required in Visual Studio 2019.

Reporting issues with SSDT Visual Studio 2019 (preview)


To report issues when using SSDT with SQL Data Warehouse, email the following email distribution list:
sqldwssdtpreview@service.microsoft.com

Next steps
Now that you have the latest version of SSDT, you're ready to connect to your SQL Data Warehouse.
Source Control Integration for Azure SQL Data
Warehouse
8/28/2019 • 2 minutes to read • Edit Online

This tutorial outlines how to integrate your SQL Server Data tools (SSDT) database project with source control.
Source control integration is the first step in building your continuous integration and deployment pipeline with
SQL Data Warehouse.

Before you begin


Sign up for an Azure DevOps organization
Go through the Create and Connect tutorial
Install Visual Studio 2019

Set up and connect to Azure DevOps


1. In your Azure DevOps Organization, create a project that will host your SSDT database project via an Azure
Repo repository

2. Open Visual Studio and connect to your Azure DevOps organization and project from step 1 by selecting
“Manage Connections”
3. Clone your Azure Repo repository from your project to your local machine

Create and connect your project


1. In Visual Studio, create a new SQL Server Database Project with both a directory and local Git repository in
your local cloned repository
2. Right-click on your empty sqlproject and import your data warehouse into the database project

3. In team explorer in Visual Studio, commit your all changes to your local Git repository
4. Now that you have the changes committed locally in the cloned repository, sync and push your changes to
your Azure Repo repository in your Azure DevOps project.

Validation
1. Verify changes have been pushed to your Azure Repo by updating a table column in your database project
from Visual Studio SQL Server Data Tools (SSDT)
2. Commit and push the change from your local repository to your Azure Repo

3. Verify the change has been pushed in your Azure Repo repository

4. (Optional) Use Schema Compare and update the changes to your target data warehouse using SSDT to
ensure the object definitions in your Azure Repo repository and local repository reflect your data warehouse

Next steps
Developing for Azure SQL Data Warehouse
Continuous integration and deployment for Azure
SQL Data Warehouse
8/29/2019 • 2 minutes to read • Edit Online

This simple tutorial outlines how to integrate your SQL Server Data tools (SSDT) database project with Azure
DevOps and leverage Azure Pipelines to set up continuous integration and deployment. This tutorial is the second
step in building your continuous integration and deployment pipeline with SQL Data Warehouse.

Before you begin


Go through the source control integration tutorial
Create a self-hosted agent that has the SSDT Preview Bits (16.3 preview 2 and higher) installed for SQL
Data Warehouse (preview )
Set up and connect to Azure DevOps

NOTE
SSDT is currently in preview where you will need to leverage a self-hosted agent. The Microsoft-hosted agents will be
updated in the next few months.

Continuous integration with Visual Studio build


1. Navigate to Azure Pipelines and create a new build pipeline
2. Select your source code repository (Azure Repos Git) and select the .NET Desktop app template

3. Edit your YAML file to use the proper pool of your agent. Your YAML file should look something like this:

At this point, you have a simple environment where any check-in to your source control repository master branch
should automatically trigger a successful Visual Studio build of your database project. Validate the automation is
working end to end by making a change in your local database project and checking in that change to your master
branch.

Continuous deployment with the Azure SQL Database deployment task


1. Add a new task using the Azure SQL Database deployment task and fill in the required fields to connect to
your target data warehouse. When this task runs, the DACPAC generated from the previous build process is
deployed to the target data warehouse.
2. When using a self-hosted agent, make sure you set your environment variable to use the correct
SqlPackage.exe for SQL Data Warehouse. The path should look something like this:

C:\Program Files (x86)\Microsoft Visual


Studio\2019\Preview\Common7\IDE\Extensions\Microsoft\SQLDB\DAC\150
Run and validate your pipeline. You can make changes locally and check in changes to your source control
that should generate an automatic build and deployment.

Next steps
Explore Azure SQL Data Warehouse architecture
Quickly create a SQL Data Warehouse
Load sample data.
Explore Videos
Connect to SQL Data Warehouse with Visual Studio
and SSDT
8/18/2019 • 2 minutes to read • Edit Online

Use Visual Studio to query Azure SQL Data Warehouse in just a few minutes. This method uses the SQL Server
Data Tools (SSDT) extension in Visual Studio 2019.

Prerequisites
To use this tutorial, you need:
An existing SQL Data Warehouse. To create one, see Create a SQL Data Warehouse.
SSDT for Visual Studio. If you have Visual Studio, you probably already have this. For installation instructions
and options, see Installing Visual Studio and SSDT.
The fully qualified SQL server name. To find this, see Connect to SQL Data Warehouse.

1. Connect to your SQL Data Warehouse


1. Open Visual Studio 2019.
2. Open SQL Server Object Explorer. To do this, select View > SQL Server Object Explorer.

3. Click the Add SQL Server icon.

4. Fill in the fields in the Connect to Server window.


Server name. Enter the server name previously identified.
Authentication. Select SQL Server Authentication or Active Directory Integrated
Authentication.
User Name and Password. Enter user name and password if SQL Server Authentication was selected
above.
Click Connect.
5. To explore, expand your Azure SQL server. You can view the databases associated with the server. Expand
AdventureWorksDW to see the tables in your sample database.
2. Run a sample query
Now that a connection has been established to your database, let's write a query.
1. Right-click your database in SQL Server Object Explorer.
2. Select New Query. A new query window opens.

3. Copy this TSQL query into the query window:

SELECT COUNT(*) FROM dbo.FactInternetSales;

4. Run the query. To do this, click the green arrow or use the following shortcut: CTRL + SHIFT + E .

5. Look at the query results. In this example, the FactInternetSales table has 60398 rows.

Next steps
Now that you can connect and query, try visualizing the data with PowerBI.
To configure your environment for Azure Active Directory authentication, see Authenticate to SQL Data
Warehouse.
Connect to SQL Data Warehouse with SQL Server
Management Studio (SSMS)
8/18/2019 • 2 minutes to read • Edit Online

Use SQL Server Management Studio (SSMS ) to connect to and query Azure SQL Data Warehouse.

Prerequisites
To use this tutorial, you need:
An existing SQL Data Warehouse. To create one, see Create a SQL Data Warehouse.
SQL Server Management Studio (SSMS ) installed. Install SSMS for free if you don't already have it.
The fully qualified SQL server name. To find this, see Connect to SQL Data Warehouse.

1. Connect to your SQL Data Warehouse


1. Open SSMS.
2. Open Object Explorer. To do this, select File > Connect Object Explorer.

3. Fill in the fields in the Connect to Server window.


Server name. Enter the server name previously identified.
Authentication. Select SQL Server Authentication or Active Directory Integrated
Authentication.
User Name and Password. Enter user name and password if SQL Server Authentication was selected
above.
Click Connect.
4. To explore, expand your Azure SQL server. You can view the databases associated with the server. Expand
AdventureWorksDW to see the tables in your sample database.

2. Run a sample query


Now that a connection has been established to your database, let's write a query.
1. Right-click your database in SQL Server Object Explorer.
2. Select New Query. A new query window opens.
3. Copy this TSQL query into the query window:

SELECT COUNT(*) FROM dbo.FactInternetSales;

4. Run the query. To do this, click Execute or use the following shortcut: F5 .

5. Look at the query results. In this example, the FactInternetSales table has 60398 rows.

Next steps
Now that you can connect and query, try visualizing the data with PowerBI.
To configure your environment for Azure Active Directory authentication, see Authenticate to SQL Data
Warehouse.
Connect to SQL Data Warehouse with sqlcmd
7/24/2019 • 2 minutes to read • Edit Online

Use sqlcmd command-line utility to connect to and query an Azure SQL Data Warehouse.

1. Connect
To get started with sqlcmd, open the command prompt and enter sqlcmd followed by the connection string for
your SQL Data Warehouse database. The connection string requires the following parameters:
Server (-S ): Server in the form < Server Name > .database.windows.net
Database (-d): Database name.
Enable Quoted Identifiers (-I ): Quoted identifiers must be enabled to connect to a SQL Data Warehouse
instance.
To use SQL Server Authentication, you need to add the username/password parameters:
User (-U ): Server user in the form < User >
Password (-P ): Password associated with the user.
For example, your connection string might look like the following:

C:\>sqlcmd -S MySqlDw.database.windows.net -d Adventure_Works -U myuser -P myP@ssword -I

To use Azure Active Directory Integrated authentication, you need to add the Azure Active Directory parameters:
Azure Active Directory Authentication (-G): use Azure Active Directory for authentication
For example, your connection string might look like the following:

C:\>sqlcmd -S MySqlDw.database.windows.net -d Adventure_Works -G -I

NOTE
You need to enable Azure Active Directory Authentication to authenticate using Active Directory.

2. Query
After connection, you can issue any supported Transact-SQL statements against the instance. In this example,
queries are submitted in interactive mode.

C:\>sqlcmd -S MySqlDw.database.windows.net -d Adventure_Works -U myuser -P myP@ssword -I


1> SELECT name FROM sys.tables;
2> GO
3> QUIT

These next examples show how you can run your queries in batch mode using the -Q option or piping your SQL
to sqlcmd.
sqlcmd -S MySqlDw.database.windows.net -d Adventure_Works -U myuser -P myP@ssword -I -Q "SELECT name FROM
sys.tables;"

"SELECT name FROM sys.tables;" | sqlcmd -S MySqlDw.database.windows.net -d Adventure_Works -U myuser -P


myP@ssword -I > .\tables.out

Next steps
See sqlcmd documentation for more about details about the options available in sqlcmd.
Analyze your workload in Azure SQL Data
Warehouse
7/5/2019 • 2 minutes to read • Edit Online

Techniques for analyzing your workload in Azure SQL Data Warehouse.

Resource Classes
SQL Data Warehouse provides resource classes to assign system resources to queries. For more information on
resource classes, see Resource classes & workload management. Queries will wait if the resource class assigned to
a query needs more resources than are currently available.

Queued query detection and other DMVs


You can use the sys.dm_pdw_exec_requests DMV to identify queries that are waiting in a concurrency queue.
Queries waiting for a concurrency slot have a status of suspended.

SELECT r.[request_id] AS Request_ID


, r.[status] AS Request_Status
, r.[submit_time] AS Request_SubmitTime
, r.[start_time] AS Request_StartTime
, DATEDIFF(ms,[submit_time],[start_time]) AS Request_InitiateDuration_ms
, r.resource_class AS Request_resource_class
FROM sys.dm_pdw_exec_requests r
;

Workload management roles can be viewed with sys.database_principals .

SELECT ro.[name] AS [db_role_name]


FROM sys.database_principals ro
WHERE ro.[type_desc] = 'DATABASE_ROLE'
AND ro.[is_fixed_role] = 0
;

The following query shows which role each user is assigned to.

SELECT r.name AS role_principal_name


, m.name AS member_principal_name
FROM sys.database_role_members rm
JOIN sys.database_principals AS r ON rm.role_principal_id = r.principal_id
JOIN sys.database_principals AS m ON rm.member_principal_id = m.principal_id
WHERE r.name IN ('mediumrc','largerc','xlargerc')
;

SQL Data Warehouse has the following wait types:


LocalQueriesConcurrencyResourceType: Queries that sit outside of the concurrency slot framework. DMV
queries and system functions such as SELECT @@VERSION are examples of local queries.
UserConcurrencyResourceType: Queries that sit inside the concurrency slot framework. Queries against
end-user tables represent examples that would use this resource type.
DmsConcurrencyResourceType: Waits resulting from data movement operations.
BackupConcurrencyResourceType: This wait indicates that a database is being backed up. The maximum
value for this resource type is 1. If multiple backups have been requested at the same time, the others queue. In
general, we recommend a minimum time between consecutive snapshots of 10 minutes.
The sys.dm_pdw_waits DMV can be used to see which resources a request is waiting for.

SELECT w.[wait_id]
, w.[session_id]
, w.[type] AS Wait_type
, w.[object_type]
, w.[object_name]
, w.[request_id]
, w.[request_time]
, w.[acquire_time]
, w.[state]
, w.[priority]
, SESSION_ID() AS Current_session
, s.[status] AS Session_status
, s.[login_name]
, s.[query_count]
, s.[client_id]
, s.[sql_spid]
, r.[command] AS Request_command
, r.[label]
, r.[status] AS Request_status
, r.[submit_time]
, r.[start_time]
, r.[end_compile_time]
, r.[end_time]
, DATEDIFF(ms,r.[submit_time],r.[start_time]) AS Request_queue_time_ms
, DATEDIFF(ms,r.[start_time],r.[end_compile_time]) AS Request_compile_time_ms
, DATEDIFF(ms,r.[end_compile_time],r.[end_time]) AS Request_execution_time_ms
, r.[total_elapsed_time]
FROM sys.dm_pdw_waits w
JOIN sys.dm_pdw_exec_sessions s ON w.[session_id] = s.[session_id]
JOIN sys.dm_pdw_exec_requests r ON w.[request_id] = r.[request_id]
WHERE w.[session_id] <> SESSION_ID()
;

The sys.dm_pdw_resource_waits DMV shows the wait information for a given query. Resource wait time measures
the time waiting for resources to be provided. Signal wait time is the time it takes for the underlying SQL servers
to schedule the query onto the CPU.

SELECT [session_id]
, [type]
, [object_type]
, [object_name]
, [request_id]
, [request_time]
, [acquire_time]
, DATEDIFF(ms,[request_time],[acquire_time]) AS acquire_duration_ms
, [concurrency_slots_used] AS concurrency_slots_reserved
, [resource_class]
, [wait_id] AS queue_position
FROM sys.dm_pdw_resource_waits
WHERE [session_id] <> SESSION_ID()
;

You can also use the sys.dm_pdw_resource_waits DMV calculate how many concurrency slots have been granted.
SELECT SUM([concurrency_slots_used]) as total_granted_slots
FROM sys.[dm_pdw_resource_waits]
WHERE [state] = 'Granted'
AND [resource_class] is not null
AND [session_id] <> session_id()
;

The sys.dm_pdw_wait_stats DMV can be used for historic trend analysis of waits.

SELECT w.[pdw_node_id]
, w.[wait_name]
, w.[max_wait_time]
, w.[request_count]
, w.[signal_time]
, w.[completed_count]
, w.[wait_time]
FROM sys.dm_pdw_wait_stats w
;

Next steps
For more information about managing database users and security, see Secure a database in SQL Data
Warehouse. For more information about how larger resource classes can improve clustered columnstore index
quality, see Rebuilding indexes to improve segment quality.
Manage and monitor workload importance in Azure
SQL Data Warehouse
7/5/2019 • 2 minutes to read • Edit Online

Manage and monitor request level importance in Azure SQL Data Warehouse using DMVs and catalog views.

Monitor importance
Monitor importance using the new importance column in the sys.dm_pdw_exec_requests dynamic management
view. The below monitoring query shows submit time and start time for queries. Review the submit time and start
time along with importance to see how importance influenced scheduling.

SELECT s.login_name, r.status, r.importance, r.submit_time, r.start_time


FROM sys.dm_pdw_exec_sessions s
JOIN sys.dm_pdw_exec_requests r ON s.session_id = r.session_id
WHERE r.resource_class is not null
ORDER BY r.start_time

To look further into how queries are being schedule, use the catalog views.

Manage importance with catalog views


The sys.workload_management_workload_classifiers catalog view contains information on classifiers in your
Azure SQL Data Warehouse instance. To exclude the system-defined classifiers that map to resource classes
execute the following code:

SELECT *
FROM sys.workload_management_workload_classifiers
WHERE classifier_id > 12

The catalog view, sys.workload_management_workload_classifier_details, contains information on the parameters


used in creation of the classifier. The below query shows that ExecReportsClassifier was created on the
membername parameter for values with ExecutiveReports:

SELECT c.name,cd.classifier_type, classifier_value


FROM sys.workload_management_workload_classifiers c
JOIN sys.workload_management_workload_classifier_details cd
ON cd.classifier_id = c.classifier_id
WHERE c.name = 'ExecReportsClassifier'

To simplify troubleshooting misclassification, we recommended you remove resource class role mappings as you
create workload classifiers. The code below returns existing resource class role memberships. Run
sp_droprolemember for each membername returned from the corresponding resource class. Below is an example of
checking for existence before dropping a workload classifier:
IF EXISTS (SELECT 1 FROM sys.workload_management_workload_classifiers WHERE name = 'ExecReportsClassifier')
DROP WORKLOAD CLASSIFIER ExecReportsClassifier;
GO

Next steps
For more information on Classification, see Workload Classification.
For more information on Importance, see Workload Importance
Go to Configure Workload Importance
Configure workload importance in Azure SQL Data
Warehouse
7/5/2019 • 2 minutes to read • Edit Online

Setting importance in the SQL Data Warehouse allows you to influence the scheduling of queries. Queries with
higher importance will be scheduled to run before queries with lower importance. To assign importance to
queries, you need to create a workload classifier.

Create a Workload Classifier with Importance


Often in a data warehouse scenario you have users who need their queries to run quickly. The user could be
executives of the company who need to run reports or the user could be an analyst running an adhoc query. You
create a workload classifier to assign importance to a query. The examples below use the new create workload
classifier syntax to create two classifiers. Membername can be a single user or a group. Individual user
classifications take precedence over role classifications. To find existing data warehouse users, run:

Select name from sys.sysusers

To create a workload classifier, for a user with higher importance run:

CREATE WORKLOAD CLASSIFIERExecReportsClassifier


WITH (WORKLOAD_GROUP = 'xlargerc'
,MEMBERNAME = 'name'
,IMPORTANCE =above_normal);

To create a workload classifier for a user running adhoc queries with lower importance run:

CREATE WORKLOAD CLASSIFIERAdhocClassifier


WITH (WORKLOAD_GROUP = 'xlargerc'
,MEMBERNAME = 'name'
,IMPORTANCE =below_normal);

Next Steps
For more information about workload management, see Workload Classification
For more information on Importance, see Workload Importance
Go to Manage and monitor Workload Importance
Optimize performance by upgrading SQL Data
Warehouse
8/18/2019 • 6 minutes to read • Edit Online

Upgrade Azure SQL Data Warehouse to latest generation of Azure hardware and storage architecture.

Why upgrade?
You can now seamlessly upgrade to the SQL Data Warehouse Compute Optimized Gen2 tier in the Azure portal
for supported regions. If your region does not support self-upgrade, you can upgrade to a supported region or
wait for self-upgrade to be available in your region. Upgrade now to take advantage of the latest generation of
Azure hardware and enhanced storage architecture including faster performance, higher scalability, and unlimited
columnar storage.

Applies to
This upgrade applies to Compute Optimized Gen1 tier data warehouses in supported regions.

Before you begin


1. Check if your region is supported for GEN1 to GEN2 migration. Note the automatic migration dates. To
avoid conflicts with the automated process, plan your manual migration prior to the automated process
start date.
2. If you are in a region that is not yet supported, continue to check for your region to be added or upgrade
using restore to a supported region.
3. If your region is supported, upgrade through the Azure portal
4. Select the suggested performance level for the data warehouse based on your current performance
level on Compute Optimized Gen1 tier by using the mapping below:

COMPUTE OPTIMIZED GEN1 TIER COMPUTE OPTIMIZED GEN2 TIER

DW100 DW100c

DW200 DW200c

DW300 DW300c

DW400 DW400c

DW500 DW500c

DW600 DW500c

DW1000 DW1000c
COMPUTE OPTIMIZED GEN1 TIER COMPUTE OPTIMIZED GEN2 TIER

DW1200 DW1000c

DW1500 DW1500c

DW2000 DW2000c

DW3000 DW3000c

DW6000 DW6000c

NOTE
Suggested performance levels are not a direct conversion. For example, we recommend going from DW600 to DW500c.

Upgrade in a supported region using the Azure portal


Before you begin
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

NOTE
Migration from Gen1 to Gen2 through the Azure portal is permanent. There is not a process for returning to Gen1.

Sign in to the Azure portal


Sign in to the Azure portal.
1. If the Compute Optimized Gen1 tier data warehouse to be upgraded is paused, resume the data
warehouse.

NOTE
Azure SQL Data Warehouse must be running to migrate to Gen2.

2. Be prepared for a few minutes of downtime.


3. Identify any code references to Compute Optimized Gen1 performance levels and modify them to their
equivalent Compute Optimized Gen2 performance level. Below are two examples of where you should
update code references before upgrading:
Original Gen1 PowerShell command:
Set-AzSqlDatabase -ResourceGroupName "myResourceGroup" -DatabaseName "mySampleDataWarehouse" -
ServerName "mynewserver-20171113" -RequestedServiceObjectiveName "DW300"

Modified to:

Set-AzSqlDatabase -ResourceGroupName "myResourceGroup" -DatabaseName "mySampleDataWarehouse" -


ServerName "mynewserver-20171113" -RequestedServiceObjectiveName "DW300c"

NOTE
-RequestedServiceObjectiveName "DW300" is changed to - RequestedServiceObjectiveName "DW300c"

Original Gen1 T-SQL command:

ALTER DATABASE mySampleDataWarehouse MODIFY (SERVICE_OBJECTIVE = 'DW300') ;

Modified to:

ALTER DATABASE mySampleDataWarehouse MODIFY (SERVICE_OBJECTIVE = 'DW300c') ;

NOTE
SERVICE_OBJECTIVE = 'DW300' is changed to SERVICE_OBJECTIVE = 'DW300c'

Start the upgrade


1. Go to your Compute Optimized Gen1 tier data warehouse in the Azure portal. If the Compute Optimized
Gen1 tier data warehouse to be upgraded is paused, resume the data warehouse.
2. Select Upgrade to Gen2 card under the Tasks tab:
NOTE
If you do not see the Upgrade to Gen2 card under the Tasks tab, your subscription type is limited in the current
region. Submit a support ticket to get your subscription whitelisted.

3. Ensure your workload has completed running and quiesced before upgrading. You'll experience downtime
for a few minutes before your data warehouse is back online as a Compute Optimized Gen2 tier data
warehouse. Select Upgrade:

4. Monitor your upgrade by checking the status in the Azure portal:

The first step of the upgrade process goes through the scale operation ("Upgrading - Offline") where all
sessions will be killed, and connections will be dropped.
The second step of the upgrade process is data migration ("Upgrading - Online"). Data migration is an
online trickle background process. This process slowly moves columnar data from the old storage
architecture to the new storage architecture using a local SSD cache. During this time, your data
warehouse will be online for querying and loading. Your data will be available to query regardless of
whether it has been migrated or not. The data migration happens at varying rates depending on your data
size, your performance level, and the number of your columnstore segments.
5. Optional Recommendation: Once the scaling operation is complete, you can speed up the data
migration background process. You can force data movement by running Alter Index rebuild on all primary
columnstore tables you'd be querying at a larger SLO and resource class. This operation is offline
compared to the trickle background process, which can take hours to complete depending on the number
and sizes of your tables. However, once complete, data migration will be much quicker due to the new
enhanced storage architecture with high-quality rowgroups.

NOTE
Alter Index rebuild is an offline operation and the tables will not be available until the rebuild completes.

The following query generates the required Alter Index Rebuild commands to expedite data migration:

SELECT 'ALTER INDEX [' + idx.NAME + '] ON ['


+ Schema_name(tbl.schema_id) + '].['
+ Object_name(idx.object_id) + '] REBUILD ' + ( CASE
WHEN (
(SELECT Count(*)
FROM sys.partitions
part2
WHERE part2.index_id
= idx.index_id
AND
idx.object_id =
part2.object_id)
> 1 ) THEN
' PARTITION = '
+ Cast(part.partition_number AS NVARCHAR(256))
ELSE ''
END ) + '; SELECT ''[' +
idx.NAME + '] ON [' + Schema_name(tbl.schema_id) + '].[' +
Object_name(idx.object_id) + '] ' + (
CASE
WHEN ( (SELECT Count(*)
FROM sys.partitions
part2
WHERE
part2.index_id =
idx.index_id
AND idx.object_id
= part2.object_id) > 1 ) THEN
' PARTITION = '
+ Cast(part.partition_number AS NVARCHAR(256))
+ ' completed'';'
ELSE ' completed'';'
END )
FROM sys.indexes idx
INNER JOIN sys.tables tbl
ON idx.object_id = tbl.object_id
LEFT OUTER JOIN sys.partitions part
ON idx.index_id = part.index_id
AND idx.object_id = part.object_id
WHERE idx.type_desc = 'CLUSTERED COLUMNSTORE';

Upgrade from an Azure geographical region using restore through the


Azure portal
Create a user-defined restore point using the Azure portal
1. Sign in to the Azure portal.
2. Navigate to the SQL Data Warehouse that you want to create a restore point for.
3. At the top of the Overview section, select +New Restore Point.

4. Specify a name for your restore point.


Restore an active or paused database using the Azure portal
1. Sign in to the Azure portal.
2. Navigate to the SQL Data Warehouse that you want to restore from.
3. At the top of the Overview section, select Restore.
4. Select either Automatic Restore Points or User-Defined Restore Points.

5. For User-Defined Restore Points, select a Restore point or Create a new user-defined restore point.
Choose a server in a Gen2 supported geographic region.
Restore from an Azure geographical region using PowerShell
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

To recover a database, use the Restore-AzSqlDatabase cmdlet.

NOTE
You can perform a geo-restore to Gen2! To do so, specify an Gen2 ServiceObjectiveName (e.g. DW1000c) as an optional
parameter.

1. Open Windows PowerShell.


2. Connect to your Azure account and list all the subscriptions associated with your account.
3. Select the subscription that contains the database to be restored.
4. Get the database you want to recover.
5. Create the recovery request for the database, specifying a Gen2 ServiceObjectiveName.
6. Verify the status of the geo-restored database.

Connect-AzAccount
Get-AzSubscription
Select-AzSubscription -SubscriptionName "<Subscription_name>"

# Get the database you want to recover


$GeoBackup = Get-AzSqlDatabaseGeoBackup -ResourceGroupName "<YourResourceGroupName>" -ServerName "
<YourServerName>" -DatabaseName "<YourDatabaseName>"

# Recover database
$GeoRestoredDatabase = Restore-AzSqlDatabase –FromGeoBackup -ResourceGroupName "<YourResourceGroupName>" -
ServerName "<YourTargetServer>" -TargetDatabaseName "<NewDatabaseName>" –ResourceId $GeoBackup.ResourceID -
ServiceObjectiveName "<YourTargetServiceLevel>" -RequestedServiceObjectiveName "DW300c"

# Verify that the geo-restored database is online


$GeoRestoredDatabase.status

NOTE
To configure your database after the restore has completed, see Configure your database after recovery.

The recovered database will be TDE -enabled if the source database is TDE -enabled.
If you experience any issues with your data warehouse, create a support request and reference “Gen2 upgrade” as
the possible cause.

Next steps
Your upgraded data warehouse is online. To take advantage of the enhanced architecture, see Resource classes for
Workload Management.
Use Azure Functions to manage compute resources
in Azure SQL Data Warehouse
3/15/2019 • 4 minutes to read • Edit Online

This tutorial uses Azure Functions to manage compute resources for a data warehouse in Azure SQL Data
Warehouse.
In order to use Azure Function App with SQL Data Warehouse, you must create a Service Principal Account with
contributor access under the same subscription as your data warehouse instance.

Deploy timer-based scaling with an Azure Resource Manager template


To deploy the template, you need the following information:
Name of the resource group your SQL DW instance is in
Name of the logical server your SQL DW instance is in
Name of your SQL DW instance
Tenant ID (Directory ID ) of your Azure Active Directory
Subscription ID
Service Principal Application ID
Service Principal Secret Key
Once you have the preceding information, deploy this template:

Once you've deployed the template, you should find three new resources: a free Azure App Service Plan, a
consumption-based Function App plan, and a storage account that handles the logging and the operations queue.
Continue reading the other sections to see how to modify the deployed functions to fit your need.

Change the compute level


1. Navigate to your Function App service. If you deployed the template with the default values, this service
should be named DWOperations. Once your Function App is open, you should notice there are five
functions deployed to your Function App Service.
2. Select either DWScaleDownTrigger or DWScaleUpTrigger depending on whether you would like to change
the scale up or scale down time. In the drop-down menu, select Integrate.

3. Currently the value displayed should say either %ScaleDownTime% or %ScaleUpTime%. These values
indicate the schedule is based on values defined in your Application Settings. For now, you can ignore this
value and change the schedule to your preferred time based on the next steps.
4. In the schedule area, add the time the CRON expression you would like to reflect how often you want the
SQL Data Warehouse to be scaled up.

The value of schedule is a CRON expression that includes these six fields:

{second} {minute} {hour} {day} {month} {day-of-week}

For example, "0 30 9 * * 1 -5" would reflect a trigger every weekday at 9:30am. For more information, visit
Azure Functions schedule examples.

Change the time of the scale operation


1. Navigate to your Function App service. If you deployed the template with the default values, this service
should be named DWOperations. Once your Function App is open, you should notice there are five
functions deployed to your Function App Service.
2. Select either DWScaleDownTrigger or DWScaleUpTrigger depending on whether you would like to change
the scale up or scale down compute value. Upon selecting the functions, your pane should show the index.js
file.

3. Change the value of ServiceLevelObjective to the level you would like and hit save. This value is the
compute level that your data warehouse instance will scale to based on the schedule defined in the
Integrate section.

Use pause or resume instead of scale


Currently, the functions on by default are DWScaleDownTrigger and DWScaleUpTrigger. If you would like to use
pause and resume functionality instead, you can enable DWPauseTrigger or DWResumeTrigger.
1. Navigate to the Functions pane.

2. Click on the sliding toggle for the corresponding triggers you would like to enable.
3. Navigate to the Integrate tabs for the respective triggers to change their schedule.

NOTE
The functional difference between the scaling triggers and the pause/resume triggers is the message that is sent to
the queue. For more information, see Add a new trigger function.

Add a new trigger function


Currently, there are only two scaling functions included within the template. With these functions, during the
course of a day, you can only scale down once and up once. For more granular control, such as scaling down
multiple times per day or having different scaling behavior on the weekends, you need to add another trigger.
1. Create a new blank function. Select the + button near your Functions location to show the function
template pane.

2. From Language, select Javascript, then select TimerTrigger.

3. Name your function and set your schedule. The image shows how one may trigger their function every
Saturday at midnight (late Friday evening).

4. Copy the content of index.js from one of the other trigger functions.

5. Set your operation variable to the desired behavior as follows:


// Resume the data warehouse instance
var operation = {
"operationType": "ResumeDw"
}

// Pause the data warehouse instance


var operation = {
"operationType": "PauseDw"
}

// Scale the data warehouse instance to DW600


var operation = {
"operationType": "ScaleDw",
"ServiceLevelObjective": "DW600"
}

Complex scheduling
This section briefly demonstrates what is necessary to get more complex scheduling of pause, resume, and scaling
capabilities.
Example 1:
Daily scale up at 8am to DW600 and scale down at 8pm to DW200.

FUNCTION SCHEDULE OPERATION

Function1 008*** var operation = {"operationType":


"ScaleDw",
"ServiceLevelObjective": "DW600"}

Function2 0 0 20 * * * var operation = {"operationType":


"ScaleDw",
"ServiceLevelObjective": "DW200"}

Example 2:
Daily scale up at 8am to DW1000, scale down once to DW600 at 4pm, and scale down at 10pm to DW200.

FUNCTION SCHEDULE OPERATION

Function1 008*** var operation = {"operationType":


"ScaleDw",
"ServiceLevelObjective":
"DW1000"}

Function2 0 0 16 * * * var operation = {"operationType":


"ScaleDw",
"ServiceLevelObjective": "DW600"}

Function3 0 0 22 * * * var operation = {"operationType":


"ScaleDw",
"ServiceLevelObjective": "DW200"}

Example 3:
Scale up at 8am to DW1000 , scale down once to DW600 at 4pm on the weekdays. Pauses Friday 11pm, resumes
7am Monday morning.
FUNCTION SCHEDULE OPERATION

Function1 0 0 8 * * 1-5 var operation = {"operationType":


"ScaleDw",
"ServiceLevelObjective":
"DW1000"}

Function2 0 0 16 * * 1-5 var operation = {"operationType":


"ScaleDw",
"ServiceLevelObjective": "DW600"}

Function3 0 0 23 * * 5 var operation = {"operationType":


"PauseDw"}

Function4 007**0 var operation = {"operationType":


"ResumeDw"}

Next steps
Learn more about timer trigger Azure functions.
Checkout the SQL Data Warehouse samples repository.
Monitor workload - Azure portal
8/22/2019 • 2 minutes to read • Edit Online

This article describes how to use the Azure portal to monitor your workload. This includes setting up Azure
Monitor Logs to investigate query execution and workload trends using log analytics for Azure SQL Data
Warehouse.

Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Data Warehouse: We will be collecting logs for a SQL Data Warehouse. If you don't have a SQL
Data Warehouse provisioned, see the instructions in Create a SQL Data Warehouse.

Create a Log Analytics workspace


Navigate to the browse blade for Log Analytics workspaces and create a workspace

For more details on workspaces, visit the following documentation.

Turn on Diagnostic logs


Configure diagnostic settings to emit logs from your SQL Data Warehouse. Logs consist of telemetry views of
your data warehouse equivalent to the most commonly used performance troubleshooting DMVs for SQL Data
Warehouse. Currently the following views are supported:
sys.dm_pdw_exec_requests
sys.dm_pdw_request_steps
sys.dm_pdw_dms_workers
sys.dm_pdw_waits
sys.dm_pdw_sql_requests

Logs can be emitted to Azure Storage, Stream Analytics, or Log Analytics. For this tutorial, select Log Analytics.

Run queries against Log Analytics


Navigate to your Log Analytics workspace where you can do the following:
Analyze logs using log queries and save queries for reuse
Save queries for reuse
Create log alerts
Pin query results to a dashboard
For details on the capabilities of log queries, visit the following documentation.

Sample log queries


//List all queries
AzureDiagnostics
| where Category contains "ExecRequests"
| project TimeGenerated, StartTime_t, EndTime_t, Status_s, Command_s, ResourceClass_s,
duration=datetime_diff('millisecond',EndTime_t, StartTime_t)

//Chart the most active resource classes


AzureDiagnostics
| where Category contains "ExecRequests"
| where Status_s == "Completed"
| summarize totalQueries = dcount(RequestId_s) by ResourceClass_s
| render barchart
//Count of all queued queries
AzureDiagnostics
| where Category contains "waits"
| where Type_s == "UserConcurrencyResourceType"
| summarize totalQueuedQueries = dcount(RequestId_s)

Next steps
Now that you have set up and configured Azure monitor logs, customize Azure dashboards to share across your
team.
Monitor your workload using DMVs
8/25/2019 • 8 minutes to read • Edit Online

This article describes how to use Dynamic Management Views (DMVs) to monitor your workload. This includes
investigating query execution in Azure SQL Data Warehouse.

Permissions
To query the DMVs in this article, you need either VIEW DATABASE STATE or CONTROL permission. Usually
granting VIEW DATABASE STATE is the preferred permission as it is much more restrictive.

GRANT VIEW DATABASE STATE TO myuser;

Monitor connections
All logins to SQL Data Warehouse are logged to sys.dm_pdw_exec_sessions. This DMV contains the last 10,000
logins. The session_id is the primary key and is assigned sequentially for each new logon.

-- Other Active Connections


SELECT * FROM sys.dm_pdw_exec_sessions where status <> 'Closed' and session_id <> session_id();

Monitor query execution


All queries executed on SQL Data Warehouse are logged to sys.dm_pdw_exec_requests. This DMV contains the
last 10,000 queries executed. The request_id uniquely identifies each query and is the primary key for this DMV.
The request_id is assigned sequentially for each new query and is prefixed with QID, which stands for query ID.
Querying this DMV for a given session_id shows all queries for a given logon.

NOTE
Stored procedures use multiple Request IDs. Request IDs are assigned in sequential order.

Here are steps to follow to investigate query execution plans and times for a particular query.
STEP 1: Identify the query you wish to investigate

-- Monitor active queries


SELECT *
FROM sys.dm_pdw_exec_requests
WHERE status not in ('Completed','Failed','Cancelled')
AND session_id <> session_id()
ORDER BY submit_time DESC;

-- Find top 10 queries longest running queries


SELECT TOP 10 *
FROM sys.dm_pdw_exec_requests
ORDER BY total_elapsed_time DESC;

From the preceding query results, note the Request ID of the query that you would like to investigate.
Queries in the Suspended state can be queued due to a large number of active running queries. These queries
also appear in the sys.dm_pdw_waits waits query with a type of UserConcurrencyResourceType. For information
on concurrency limits, see Performance tiers or Resource classes for workload management. Queries can also
wait for other reasons such as for object locks. If your query is waiting for a resource, see Investigating queries
waiting for resources further down in this article.
To simplify the lookup of a query in the sys.dm_pdw_exec_requests table, use LABEL to assign a comment to your
query that can be looked up in the sys.dm_pdw_exec_requests view.

-- Query with Label


SELECT *
FROM sys.tables
OPTION (LABEL = 'My Query')
;

-- Find a query with the Label 'My Query'


-- Use brackets when querying the label column, as it it a key word
SELECT *
FROM sys.dm_pdw_exec_requests
WHERE [label] = 'My Query';

STEP 2: Investigate the query plan


Use the Request ID to retrieve the query's distributed SQL (DSQL ) plan from sys.dm_pdw_request_steps.

-- Find the distributed query plan steps for a specific query.


-- Replace request_id with value from Step 1.

SELECT * FROM sys.dm_pdw_request_steps


WHERE request_id = 'QID####'
ORDER BY step_index;

When a DSQL plan is taking longer than expected, the cause can be a complex plan with many DSQL steps or
just one step taking a long time. If the plan is many steps with several move operations, consider optimizing your
table distributions to reduce data movement. The Table distribution article explains why data must be moved to
solve a query and explains some distribution strategies to minimize data movement.
To investigate further details about a single step, the operation_type column of the long-running query step and
note the Step Index:
Proceed with Step 3a for SQL operations: OnOperation, RemoteOperation, ReturnOperation.
Proceed with Step 3b for Data Movement operations: ShuffleMoveOperation, BroadcastMoveOperation,
TrimMoveOperation, PartitionMoveOperation, MoveOperation, CopyOperation.
STEP 3a: Investigate SQL on the distributed databases
Use the Request ID and the Step Index to retrieve details from sys.dm_pdw_sql_requests, which contains
execution information of the query step on all of the distributed databases.

-- Find the distribution run times for a SQL step.


-- Replace request_id and step_index with values from Step 1 and 3.

SELECT * FROM sys.dm_pdw_sql_requests


WHERE request_id = 'QID####' AND step_index = 2;

When the query step is running, DBCC PDW_SHOWEXECUTIONPLAN can be used to retrieve the SQL Server
estimated plan from the SQL Server plan cache for the step running on a particular distribution.
-- Find the SQL Server execution plan for a query running on a specific SQL Data Warehouse Compute or Control
node.
-- Replace distribution_id and spid with values from previous query.

DBCC PDW_SHOWEXECUTIONPLAN(1, 78);

STEP 3b: Investigate data movement on the distributed databases


Use the Request ID and the Step Index to retrieve information about a data movement step running on each
distribution from sys.dm_pdw_dms_workers.

-- Find the information about all the workers completing a Data Movement Step.
-- Replace request_id and step_index with values from Step 1 and 3.

SELECT * FROM sys.dm_pdw_dms_workers


WHERE request_id = 'QID####' AND step_index = 2;

Check the total_elapsed_time column to see if a particular distribution is taking significantly longer than
others for data movement.
For the long-running distribution, check the rows_processed column to see if the number of rows being moved
from that distribution is significantly larger than others. If so, this finding might indicate skew of your
underlying data.
If the query is running, DBCC PDW_SHOWEXECUTIONPLAN can be used to retrieve the SQL Server estimated
plan from the SQL Server plan cache for the currently running SQL Step within a particular distribution.

-- Find the SQL Server estimated plan for a query running on a specific SQL Data Warehouse Compute or Control
node.
-- Replace distribution_id and spid with values from previous query.

DBCC PDW_SHOWEXECUTIONPLAN(55, 238);

Monitor waiting queries


If you discover that your query is not making progress because it is waiting for a resource, here is a query that
shows all the resources a query is waiting for.

-- Find queries
-- Replace request_id with value from Step 1.

SELECT waits.session_id,
waits.request_id,
requests.command,
requests.status,
requests.start_time,
waits.type,
waits.state,
waits.object_type,
waits.object_name
FROM sys.dm_pdw_waits waits
JOIN sys.dm_pdw_exec_requests requests
ON waits.request_id=requests.request_id
WHERE waits.request_id = 'QID####'
ORDER BY waits.object_name, waits.object_type, waits.state;

If the query is actively waiting on resources from another query, then the state will be AcquireResources. If the
query has all the required resources, then the state will be Granted.
Monitor tempdb
Tempdb is used to hold intermediate results during query execution. High utilization of the tempdb database can
lead to slow query performance. Each node in Azure SQL Data Warehouse has approximately 1 TB of raw space
for tempdb. Below are tips for monitoring tempdb usage and for decreasing tempdb usage in your queries.
Monitoring tempdb with views
To monitor tempdb usage, first install the microsoft.vw_sql_requests view from the Microsoft Toolkit for SQL
Data Warehouse. You can then execute the following query to see the tempdb usage per node for all executed
queries:

-- Monitor tempdb
SELECT
sr.request_id,
ssu.session_id,
ssu.pdw_node_id,
sr.command,
sr.total_elapsed_time,
es.login_name AS 'LoginName',
DB_NAME(ssu.database_id) AS 'DatabaseName',
(es.memory_usage * 8) AS 'MemoryUsage (in KB)',
(ssu.user_objects_alloc_page_count * 8) AS 'Space Allocated For User Objects (in KB)',
(ssu.user_objects_dealloc_page_count * 8) AS 'Space Deallocated For User Objects (in KB)',
(ssu.internal_objects_alloc_page_count * 8) AS 'Space Allocated For Internal Objects (in KB)',
(ssu.internal_objects_dealloc_page_count * 8) AS 'Space Deallocated For Internal Objects (in KB)',
CASE es.is_user_process
WHEN 1 THEN 'User Session'
WHEN 0 THEN 'System Session'
END AS 'SessionType',
es.row_count AS 'RowCount'
FROM sys.dm_pdw_nodes_db_session_space_usage AS ssu
INNER JOIN sys.dm_pdw_nodes_exec_sessions AS es ON ssu.session_id = es.session_id AND ssu.pdw_node_id =
es.pdw_node_id
INNER JOIN sys.dm_pdw_nodes_exec_connections AS er ON ssu.session_id = er.session_id AND ssu.pdw_node_id =
er.pdw_node_id
INNER JOIN microsoft.vw_sql_requests AS sr ON ssu.session_id = sr.spid AND ssu.pdw_node_id = sr.pdw_node_id
WHERE DB_NAME(ssu.database_id) = 'tempdb'
AND es.session_id <> @@SPID
AND es.login_name <> 'sa'
ORDER BY sr.request_id;

If you have a query that is consuming a large amount of memory or have received an error message related to
allocation of tempdb, it could be due to a very large CREATE TABLE AS SELECT (CTAS ) or INSERT SELECT
statement running that is failing in the final data movement operation. This can usually be identified as a
ShuffleMove operation in the distributed query plan right before the final INSERT SELECT. Use
sys.dm_pdw_request_steps to monitor ShuffleMove operations.
The most common mitigation is to break your CTAS or INSERT SELECT statement into multiple load statements
so the data volume will not exceed the 1TB per node tempdb limit. You can also scale your cluster to a larger size
which will spread the tempdb size across more nodes reducing the tempdb on each individual node.
In addition to CTAS and INSERT SELECT statements, large, complex queries running with insufficient memory
can spill into tempdb causing queries to fail. Consider running with a larger resource class to avoid spilling into
tempdb.

Monitor memory
Memory can be the root cause for slow performance and out of memory issues. Consider scaling your data
warehouse if you find SQL Server memory usage reaching its limits during query execution.
The following query returns SQL Server memory usage and memory pressure per node:

-- Memory consumption
SELECT
pc1.cntr_value as Curr_Mem_KB,
pc1.cntr_value/1024.0 as Curr_Mem_MB,
(pc1.cntr_value/1048576.0) as Curr_Mem_GB,
pc2.cntr_value as Max_Mem_KB,
pc2.cntr_value/1024.0 as Max_Mem_MB,
(pc2.cntr_value/1048576.0) as Max_Mem_GB,
pc1.cntr_value * 100.0/pc2.cntr_value AS Memory_Utilization_Percentage,
pc1.pdw_node_id
FROM
-- pc1: current memory
sys.dm_pdw_nodes_os_performance_counters AS pc1
-- pc2: total memory allowed for this SQL instance
JOIN sys.dm_pdw_nodes_os_performance_counters AS pc2
ON pc1.object_name = pc2.object_name AND pc1.pdw_node_id = pc2.pdw_node_id
WHERE
pc1.counter_name = 'Total Server Memory (KB)'
AND pc2.counter_name = 'Target Server Memory (KB)'

Monitor transaction log size


The following query returns the transaction log size on each distribution. If one of the log files is reaching 160
GB, you should consider scaling up your instance or limiting your transaction size.

-- Transaction log size


SELECT
instance_name as distribution_db,
cntr_value*1.0/1048576 as log_file_size_used_GB,
pdw_node_id
FROM sys.dm_pdw_nodes_os_performance_counters
WHERE
instance_name like 'Distribution_%'
AND counter_name = 'Log File(s) Used Size (KB)'

Monitor transaction log rollback


If your queries are failing or taking a long time to proceed, you can check and monitor if you have any
transactions rolling back.

-- Monitor rollback
SELECT
SUM(CASE WHEN t.database_transaction_next_undo_lsn IS NOT NULL THEN 1 ELSE 0 END),
t.pdw_node_id,
nod.[type]
FROM sys.dm_pdw_nodes_tran_database_transactions t
JOIN sys.dm_pdw_nodes nod ON t.pdw_node_id = nod.pdw_node_id
GROUP BY t.pdw_node_id, nod.[type]

Monitor PolyBase load


The following query provides a ballpark estimate of the progress of your load. The query only shows files
currently being processed.
-- To track bytes and files
SELECT
r.command,
s.request_id,
r.status,
count(distinct input_name) as nbr_files,
sum(s.bytes_processed)/1024/1024/1024 as gb_processed
FROM
sys.dm_pdw_exec_requests r
inner join sys.dm_pdw_dms_external_work s
on r.request_id = s.request_id
GROUP BY
r.command,
s.request_id,
r.status
ORDER BY
nbr_files desc,
gb_processed desc;

Next steps
For more information about DMVs, see System views.
How to monitor the Gen2 cache
3/12/2019 • 2 minutes to read • Edit Online

The Gen2 storage architecture automatically tiers your most frequently queried columnstore segments in a cache
residing on NVMe based SSDs designed for Gen2 data warehouses. Greater performance is realized when your
queries retrieve segments that are residing in the cache. This article describes how to monitor and troubleshoot
slow query performance by determining whether your workload is optimally leveraging the Gen2 cache.

Troubleshoot using the Azure portal


You can use Azure Monitor to view Gen2 cache metrics to troubleshoot query performance. First go to the Azure
portal and click on Monitor:

Select the metrics button and fill in the Subscription, Resource group, Resource type, and Resource name of
your data warehouse.
The key metrics for troubleshooting the Gen2 cache are Cache hit percentage and Cache used percentage.
Configure the Azure metric chart to display these two metrics.
Cache hit and used percentage
The matrix below describes scenarios based on the values of the cache metrics:

HIGH CACHE HIT PERCENTAGE LOW CACHE HIT PERCENTAGE

High Cache used percentage Scenario 1 Scenario 2

Low Cache used percentage Scenario 3 Scenario 4

Scenario 1: You are optimally using your cache. Troubleshoot other areas which may be slowing down your
queries.
Scenario 2: Your current working data set cannot fit into the cache which causes a low cache hit percentage due to
physical reads. Consider scaling up your performance level and rerun your workload to populate the cache.
Scenario 3: It is likely that your query is running slow due to reasons unrelated to the cache. Troubleshoot other
areas which may be slowing down your queries. You can also consider scaling down your instance to reduce your
cache size to save costs.
Scenario 4: You had a cold cache which could be the reason why your query was slow. Consider rerunning your
query as your working dataset should now be in cached.
Important: If the cache hit percentage or cache used percentage is not updating after rerunning your
workload, your working set can already be residing in memory. Note only clustered columnstore tables
are cached.

Next steps
For more information on general query performance tuning, see Monitor query execution.
User-defined restore points
8/18/2019 • 2 minutes to read • Edit Online

In this article, you learn to create a new user-defined restore point for Azure SQL Data Warehouse using
PowerShell and Azure portal.

Create user-defined restore points through PowerShell


To create a user-defined restore point, use the New -AzSqlDatabaseRestorePoint PowerShell cmdlet.
1. Before you begin, make sure to install Azure PowerShell.
2. Open PowerShell.
3. Connect to your Azure account and list all the subscriptions associated with your account.
4. Select the subscription that contains the database to be restored.
5. Create a restore point for an immediate copy of your data warehouse.

$SubscriptionName="<YourSubscriptionName>"
$ResourceGroupName="<YourResourceGroupName>"
$ServerName="<YourServerNameWithoutURLSuffixSeeNote>" # Without database.windows.net
$DatabaseName="<YourDatabaseName>"
$Label = "<YourRestorePointLabel>"

Connect-AzAccount
Get-AzSubscription
Select-AzSubscription -SubscriptionName $SubscriptionName

# Create a restore point of the original database


New-AzSqlDatabaseRestorePoint -ResourceGroupName $ResourceGroupName -ServerName $ServerName -DatabaseName
$DatabaseName -RestorePointLabel $Label

6. See the list of all the existing restore points.

# List all restore points


Get-AzSqlDatabaseRestorePoints -ResourceGroupName $ResourceGroupName -ServerName $ServerName -DatabaseName
$DatabaseName

Create user-defined restore points through the Azure portal


User-defined restore points can also be created through Azure portal.
1. Sign in to your Azure portal account.
2. Navigate to the SQL Data Warehouse that you want to create a restore point for.
3. Select Overview from the left pane, select + New Restore Point. If the New Restore Point button isn't
enabled, make sure that the data warehouse isn't paused.
4. Specify a name for your user-defined restore point and click Apply. User-defined restore points have a
default retention period of seven days.

Next steps
Restore an existing data warehouse
Restore a deleted data warehouse
Restore from a geo-backup data warehouse
Restore an existing Azure SQL Data Warehouse
8/18/2019 • 2 minutes to read • Edit Online

In this article, you learn to restore an existing SQL Data Warehouse through Azure portal and PowerShell:

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Verify your DTU capacity. Each SQL Data Warehouse is hosted by a SQL server (for example,
myserver.database.windows.net) which has a default DTU quota. Verify the SQL server has enough remaining
DTU quota for the database being restored. To learn how to calculate DTU needed or to request more DTU, see
Request a DTU quota change.

Before you begin


1. Make sure to install Azure PowerShell.
2. Have an existing restore point that you want to restore from. If you want to create a new restore, see the
tutorial to create a new user-defined restore point.

Restore an existing data warehouse through PowerShell


To restore an existing data warehouse from a restore point use the Restore-AzSqlDatabase PowerShell cmdlet.
1. Open PowerShell.
2. Connect to your Azure account and list all the subscriptions associated with your account.
3. Select the subscription that contains the database to be restored.
4. List the restore points for the data warehouse.
5. Pick the desired restore point using the RestorePointCreationDate.
6. Restore the data warehouse to the desired restore point using Restore-AzSqlDatabase PowerShell cmdlet.
1. To restore the SQL Data Warehouse to a different logical server, make sure to specify the other logical
server name. This logical server can also be in a different resource group and region. 2. To restore to a
different subscription, use the 'Move' button to move the logical server to another subscription.
7. Verify that the restored data warehouse is online.
8. After the restore has completed, you can configure your recovered data warehouse by following configure
your database after recovery.
$SubscriptionName="<YourSubscriptionName>"
$ResourceGroupName="<YourResourceGroupName>"
$ServerName="<YourServerNameWithoutURLSuffixSeeNote>" # Without database.windows.net
#$TargetResourceGroupName="<YourTargetResourceGroupName>" # uncomment to restore to a different logical
server.
#$TargetServerName="<YourtargetServerNameWithoutURLSuffixSeeNote>"
$DatabaseName="<YourDatabaseName>"
$NewDatabaseName="<YourDatabaseName>"

Connect-AzAccount
Get-AzSubscription
Select-AzSubscription -SubscriptionName $SubscriptionName

# Or list all restore points


Get-AzSqlDatabaseRestorePoints -ResourceGroupName $ResourceGroupName -ServerName $ServerName -DatabaseName
$DatabaseName

# Get the specific database to restore


$Database = Get-AzSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $ServerName -DatabaseName
$DatabaseName

# Pick desired restore point using RestorePointCreationDate "xx/xx/xxxx xx:xx:xx xx"


$PointInTime="<RestorePointCreationDate>"

# Restore database from a restore point


$RestoredDatabase = Restore-AzSqlDatabase –FromPointInTimeBackup –PointInTime $PointInTime -ResourceGroupName
$Database.ResourceGroupName -ServerName $Database.ServerName -TargetDatabaseName $NewDatabaseName –ResourceId
$Database.ResourceID

# Use the following command to restore to a different logical server


#$RestoredDatabase = Restore-AzSqlDatabase –FromPointInTimeBackup –PointInTime $PointInTime -ResourceGroupName
$Database.ResourceTargetGroupName -ServerName $TargetServerName -TargetDatabaseName $NewDatabaseName –
ResourceId $Database.ResourceID

# Verify the status of restored database


$RestoredDatabase.status

Restore an existing data warehouse through the Azure portal


1. Sign in to the Azure portal.
2. Navigate to the SQL Data Warehouse that you want to restore from.
3. At the top of the Overview blade, select Restore.

4. Select either Automatic Restore Points or User-Defined Restore Points. If the data warehouse doesn't
have any automatic restore points, wait a few hours or create a user defined restore point before restoring.
For User-Defined Restore Points, select an existing one or create a new one. For Server, you can pick a
logical server in a different resource group and region or create a new one. After providing all the
parameters, click Review + Restore.
Next Steps
Restore a deleted data warehouse
Restore from a geo-backup data warehouse
Restore a deleted Azure SQL Data Warehouse
7/23/2019 • 2 minutes to read • Edit Online

In this article, you learn to restore a deleted SQL Data Warehouse using Azure portal and PowerShell:

Before you begin


NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Verify your DTU capacity. Each SQL Data Warehouse is hosted by a SQL server (for example,
myserver.database.windows.net) which has a default DTU quota. Verify that the SQL server has enough remaining
DTU quota for the database being restored. To learn how to calculate DTU needed or to request more DTU, see
Request a DTU quota change.

Restore a deleted data warehouse through PowerShell


To restore a deleted SQL Data Warehouse, use the Restore-AzSqlDatabase cmdlet. If the corresponding logical
server has been deleted as well, you can't restore that data warehouse.
1. Before you begin, make sure to install Azure PowerShell.
2. Open PowerShell.
3. Connect to your Azure account and list all the subscriptions associated with your account.
4. Select the subscription that contains the deleted data warehouse to be restored.
5. Get the specific deleted data warehouse.
6. Restore the deleted data warehouse
a. To restore the deleted SQL Data Warehouse to a different logical server, make sure to specify the other
logical server name. This logical server can also be in a different resource group and region.
b. To restore to a different subscription, use the Move button to move the logical server to another
subscription.
7. Verify that the restored data warehouse is online.
8. After the restore has completed, you can configure your recovered data warehouse by following configure your
database after recovery.
$SubscriptionName="<YourSubscriptionName>"
$ResourceGroupName="<YourResourceGroupName>"
$ServerName="<YourServerNameWithoutURLSuffixSeeNote>" # Without database.windows.net
#$TargetResourceGroupName="<YourTargetResourceGroupName>" # uncomment to restore to a different logical
server.
#$TargetServerName="<YourtargetServerNameWithoutURLSuffixSeeNote>"
$DatabaseName="<YourDatabaseName>"
$NewDatabaseName="<YourDatabaseName>"

Connect-AzAccount
Get-AzSubscription
Select-AzSubscription -SubscriptionName $SubscriptionName

# Get the deleted database to restore


$DeletedDatabase = Get-AzSqlDeletedDatabaseBackup -ResourceGroupName $ResourceGroupName -ServerName
$ServerName -DatabaseName $DatabaseName

# Restore deleted database


$RestoredDatabase = Restore-AzSqlDatabase –FromDeletedDatabaseBackup –DeletionDate
$DeletedDatabase.DeletionDate -ResourceGroupName $DeletedDatabase.ResourceGroupName -ServerName
$DeletedDatabase.ServerName -TargetDatabaseName $NewDatabaseName –ResourceId $DeletedDatabase.ResourceID

# Use the following command to restore deleted data warehouse to a different logical server
#$RestoredDatabase = Restore-AzSqlDatabase –FromDeletedDatabaseBackup –DeletionDate
$DeletedDatabase.DeletionDate -ResourceGroupName $TargetResourceGroupName -ServerName $TargetServerName -
TargetDatabaseName $NewDatabaseName –ResourceId $DeletedDatabase.ResourceID

# Verify the status of restored database


$RestoredDatabase.status

Restore a deleted database using the Azure portal


1. Sign in to the Azure portal.
2. Navigate to the SQL server your deleted data warehouse was hosted on.
3. Select the Deleted databases icon in the table of contents.

4. Select the deleted SQL Data Warehouse that you want to restore.
5. Specify a new Database name and click OK

Next Steps
Restore an existing data warehouse
Restore from a geo-backup data warehouse
Geo-restore Azure SQL Data Warehouse
7/23/2019 • 2 minutes to read • Edit Online

In this article, you learn to restore your data warehouse from a geo-backup through Azure portal and PowerShell.

Before you begin


NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Verify your DTU capacity. Each SQL Data Warehouse is hosted by a SQL server (for example,
myserver.database.windows.net) which has a default DTU quota. Verify that the SQL server has enough remaining
DTU quota for the database being restored. To learn how to calculate DTU needed or to request more DTU, see
Request a DTU quota change.

Restore from an Azure geographical region through PowerShell


To restore from a geo-backup, use the Get-AzSqlDatabaseGeoBackup and Restore-AzSqlDatabase cmdlet.

NOTE
You can perform a geo-restore to Gen2! To do so, specify an Gen2 ServiceObjectiveName (e.g. DW1000c) as an optional
parameter.

1. Before you begin, make sure to install Azure PowerShell.


2. Open PowerShell.
3. Connect to your Azure account and list all the subscriptions associated with your account.
4. Select the subscription that contains the data warehouse to be restored.
5. Get the data warehouse you want to recover.
6. Create the recovery request for the data warehouse.
7. Verify the status of the geo-restored data warehouse.
8. To configure your data warehouse after the restore has completed, see Configure your database after recovery.
$SubscriptionName="<YourSubscriptionName>"
$ResourceGroupName="<YourResourceGroupName>"
$ServerName="<YourServerNameWithoutURLSuffixSeeNote>" # Without database.windows.net
$TargetResourceGroupName="<YourTargetResourceGroupName>" # Restore to a different logical server.
$TargetServerName="<YourtargetServerNameWithoutURLSuffixSeeNote>"
$DatabaseName="<YourDatabaseName>"
$NewDatabaseName="<YourDatabaseName>"
$TargetServiceObjective="<YourTargetServiceObjective-DWXXXc>"

Connect-AzAccount
Get-AzSubscription
Select-AzSubscription -SubscriptionName $SubscriptionName
Get-AzureSqlDatabase -ServerName $ServerName

# Get the data warehouse you want to recover


$GeoBackup = Get-AzSqlDatabaseGeoBackup -ResourceGroupName $ResourceGroupName -ServerName $ServerName -
DatabaseName $DatabaseName

# Recover data warehouse


$GeoRestoredDatabase = Restore-AzSqlDatabase –FromGeoBackup -ResourceGroupName $TargetResourceGroupName -
ServerName $TargetServerName -TargetDatabaseName $NewDatabaseName –ResourceId $GeoBackup.ResourceID -
ServiceObjectiveName $TargetServiceObjective

# Verify that the geo-restored data warehouse is online


$GeoRestoredDatabase.status

The recovered database will be TDE -enabled if the source database is TDE -enabled.

Restore from an Azure geographical region through Azure portal


Follow the steps outlined below to restore an Azure SQL Data Warehouse from a geo-backup:
1. Sign in to your Azure portal account.
2. Click + Create a resource and search for SQL Data Warehouse and click Create.

3. Fill out the information requested in the Basics tab and click Next: Additional settings.
4. For Use existing data parameter, select Backup and select the appropriate backup from the scroll down
options. Click Review + Create.

5. Once the data warehouse has been restored, check that the Status is Online.
Next Steps
Restore an existing data warehouse
Restore a deleted data warehouse
Analyze data with Azure Machine Learning
5/17/2019 • 3 minutes to read • Edit Online

This tutorial uses Azure Machine Learning to build a predictive machine learning model based on data stored in
Azure SQL Data Warehouse. Specifically, this builds a targeted marketing campaign for Adventure Works, the
bike shop, by predicting if a customer is likely to buy a bike or not.

Prerequisites
To step through this tutorial, you need:
A SQL Data Warehouse pre-loaded with AdventureWorksDW sample data. To provision this, see Create a SQL
Data Warehouse and choose to load the sample data. If you already have a data warehouse but do not have
sample data, you can load sample data manually.

1. Get the data


The data is in the dbo.vTargetMail view in the AdventureWorksDW database. To read this data:
1. Sign into Azure Machine Learning studio and click on my experiments.
2. Click +NEW on the bottom left of the screen and select Blank Experiment.
3. Enter a name for your experiment: Targeted Marketing.
4. Drag the Import data module under Data Input and output from the modules pane into the canvas.
5. Specify the details of your SQL Data Warehouse database in the Properties pane.
6. Specify the database query to read the data of interest.

SELECT [CustomerKey]
,[GeographyKey]
,[CustomerAlternateKey]
,[MaritalStatus]
,[Gender]
,cast ([YearlyIncome] as int) as SalaryYear
,[TotalChildren]
,[NumberChildrenAtHome]
,[EnglishEducation]
,[EnglishOccupation]
,[HouseOwnerFlag]
,[NumberCarsOwned]
,[CommuteDistance]
,[Region]
,[Age]
,[BikeBuyer]
FROM [dbo].[vTargetMail]

Run the experiment by clicking Run under the experiment canvas.


After the experiment finishes running successfully, click the output port at the bottom of the Reader module and
select Visualize to see the imported data.

2. Clean the data


To clean the data, drop some columns that are not relevant for the model. To do this:
1. Drag the Select Columns in Dataset module under Data Transformation < Manipulation into the canvas.
Connect this module to the Import Data module.
2. Click Launch column selector in the Properties pane to specify which columns you wish to drop.
3. Exclude two columns: CustomerAlternateKey and GeographyKey.

3. Build the model


We will split the data 80-20: 80% to train a machine learning model and 20% to test the model. We will make use
of the “Two-Class” algorithms for this binary classification problem.
1. Drag the Split module into the canvas.
2. In the properties pane, enter 0.8 for Fraction of rows in the first output dataset.
3. Drag the Two-Class Boosted Decision Tree module into the canvas.
4. Drag the Train Model module into the canvas and specify inputs by connecting it to the Two-Class Boosted
Decision Tree (ML algorithm) and Split (data to train the algorithm on) modules.

5. Then, click Launch column selector in the Properties pane. Select the BikeBuyer column as the column to
predict.
4. Score the model
Now, we will test how the model performs on test data. We will compare the algorithm of our choice with a
different algorithm to see which performs better.
1. Drag Score Model module into the canvas and connect it to Train Model and Split Data modules.

2. Drag the Two-Class Bayes Point Machine into the experiment canvas. We will compare how this algorithm
performs in comparison to the Two-Class Boosted Decision Tree.
3. Copy and Paste the modules Train Model and Score Model in the canvas.
4. Drag the Evaluate Model module into the canvas to compare the two algorithms.
5. Run the experiment.

6. Click the output port at the bottom of the Evaluate Model module and click Visualize.
The metrics provided are the ROC curve, precision-recall diagram and lift curve. Looking at these metrics, we can
see that the first model performed better than the second one. To look at the what the first model predicted, click
on output port of the Score Model and click Visualize.

You will see two more columns added to your test dataset.
Scored Probabilities: the likelihood that a customer is a bike buyer.
Scored Labels: the classification done by the model – bike buyer (1) or not (0). This probability threshold for
labeling is set to 50% and can be adjusted.
Comparing the column BikeBuyer (actual) with the Scored Labels (prediction), you can see how well the model
has performed. As next steps, you can use this model to make predictions for new customers and publish this
model as a web service or write results back to SQL Data Warehouse.

Next steps
To learn more about building predictive machine learning models, refer to Introduction to Machine Learning on
Azure.
Get started quickly with Fivetran and SQL Data
Warehouse
5/17/2019 • 2 minutes to read • Edit Online

This quickstart describes how to set up a new Fivetran user to work with Azure SQL Data Warehouse. The article
assumes that you have an existing instance of SQL Data Warehouse.

Set up a connection
1. Find the fully qualified server name and database name that you use to connect to SQL Data Warehouse.
If you need help finding this information, see Connect to Azure SQL Data Warehouse.
2. In the setup wizard, choose whether to connect your database directly or by using an SSH tunnel.
If you choose to connect directly to your database, you must create a firewall rule to allow access. This
method is the simplest and most secure method.
If you choose to connect by using an SSH tunnel, Fivetran connects to a separate server on your network.
The server provides an SSH tunnel to your database. You must use this method if your database is in an
inaccessible subnet on a virtual network.
3. Add the IP address 52.0.2.4 to your server-level firewall to allow incoming connections to your SQL Data
Warehouse instance from Fivetran.
For more information, see Create a server-level firewall rule.

Set up user credentials


1. Connect to your Azure SQL Data Warehouse by using SQL Server Management Studio or the tool that you
prefer. Sign in as a server admin user. Then, run the following SQL commands to create a user for Fivetran:
In the master database:

CREATE LOGIN fivetran WITH PASSWORD = '<password>';

In SQL Data Warehouse database:

CREATE USER fivetran_user_without_login without login;


CREATE USER fivetran FOR LOGIN fivetran;
GRANT IMPERSONATE on USER::fivetran_user_without_login to fivetran;

2. Grant the Fivetran user the following permissions to your warehouse:

GRANT CONTROL to fivetran;

CONTROL permission is required to create database-scoped credentials that are used when a user loads
files from Azure Blob storage by using PolyBase.
3. Add a suitable resource class to the Fivetran user. The resource class you use depends on the memory that's
required to create a columnstore index. For example, integrations with products like Marketo and Salesforce
require a higher resource class because of the large number of columns and the larger volume of data the
products use. A higher resource class requires more memory to create columnstore indexes.
We recommend that you use static resource classes. You can start with the staticrc20 resource class. The
staticrc20 resource class allocates 200 MB for each user, regardless of the performance level you use. If
columnstore indexing fails at the initial resource class level, increase the resource class.

EXEC sp_addrolemember '<resource_class_name>', 'fivetran';

For more information, read about memory and concurrency limits and resource classes.

Sign in to Fivetran
To sign in to Fivetran, enter the credentials that you use to access SQL Data Warehouse:
Host (your server name).
Port.
Database.
User (the user name should be fivetran@server_name where server_name is part of your Azure host URI:
server_name.database.windows.net).
Password.
Striim Azure SQL DW Marketplace Offering Install
Guide
5/17/2019 • 2 minutes to read • Edit Online

This quickstart assumes that you already have a pre-existing instance of SQL Data Warehouse.
Search for Striim in the Azure Marketplace, and select the Striim for Data Integration to SQL Data Warehouse
(Staged) option

Configure the Striim VM with specified properties, noting down the Striim cluster name, password, and admin
password
Once deployed, click on <VM Name>-masternode in the Azure portal, click Connect, and copy the Login using VM
local account

Download the sqljdbc42.jar from https://www.microsoft.com/en-us/download/details.aspx?id=54671 to your local


machine.
Open a command-line window, and change directories to where you downloaded the JDBC jar. SCP the jar file to
your Striim VM, getting the address and password from the Azure portal
Open another command-line window, or use an ssh utility to ssh into the Striim cluster

Execute the following commands to move the JDBC jar file into Striim’s lib directory, and start and stop the server.
1. sudo su
2. cd /tmp
3. mv sqljdbc42.jar /opt/striim/lib
4. systemctl stop striim-node
5. systemctl stop striim-dbms
6. systemctl start striim-dbms
7. systemctl start striim-node
Now, open your favorite browser and navigate to <DNS Name>:9080

Log in with the username and the password you set up in the Azure portal, and select your preferred wizard to get
started, or go to the Apps page to start using the drag and drop UI
Use Azure Stream Analytics with SQL Data
Warehouse
5/17/2019 • 2 minutes to read • Edit Online

Azure Stream Analytics is a fully managed service providing low -latency, highly available, scalable complex event
processing over streaming data in the cloud. You can learn the basics by reading Introduction to Azure Stream
Analytics. You can then learn how to create an end-to-end solution with Stream Analytics by following the Get
started using Azure Stream Analytics tutorial.
In this article, you will learn how to use your Azure SQL Data Warehouse database as an output sink for your
Stream Analytics jobs.

Prerequisites
First, run through the following steps in the Get started using Azure Stream Analytics tutorial.
1. Create an Event Hub input
2. Configure and start event generator application
3. Provision a Stream Analytics job
4. Specify job input and query
Then, create an Azure SQL Data Warehouse database

Specify job output: Azure SQL Data Warehouse database


Step 1
In your Stream Analytics job click OUTPUT from the top of the page, and then click ADD.
Step 2
Select SQL Database.
Step 3
Enter the following values on the next page:
Output Alias: Enter a friendly name for this job output.
Subscription:
If your SQL Data Warehouse database is in the same subscription as the Stream Analytics job, select Use
SQL Database from Current Subscription.
If your database is in a different subscription, select Use SQL Database from Another Subscription.
Database: Specify the name of a destination database.
Server Name: Specify the server name for the database you just specified. You can use the Azure portal to find
this.
User Name: Specify the user name of an account that has write permissions for the database.
Password: Provide the password for the specified user account.
Table: Specify the name of the target table in the database.

Step 4
Click the check button to add this job output and to verify that Stream Analytics can successfully connect to the
database.
When the connection to the database succeeds, you will see a notification in the portal. You can click Test to test the
connection to the database.

Next steps
For an overview of integration, see SQL Data Warehouse integration overview.
For more development tips, see SQL Data Warehouse development overview.
Visualize data with Power BI
5/24/2019 • 2 minutes to read • Edit Online

This tutorial shows you how to use Power BI to connect to SQL Data Warehouse and create a few basic
visualizations.

Prerequisites
To step through this tutorial, you need:
A SQL Data Warehouse pre-loaded with the AdventureWorksDW database. To provision a data warehouse,
see Create a SQL Data Warehouse and choose to load the sample data. If you already have a data warehouse
but do not have sample data, you can load WideWorldImportersDW.

1. Connect to your database


To open Power BI and connect to your AdventureWorksDW database:
1. Sign into the Azure portal.
2. Click SQL databases and choose your AdventureWorks SQL Data Warehouse database.

3. Click the 'Open in Power BI' button.


4. You should now see the SQL Data Warehouse connection page displaying your database web address.
Click next.

5. Enter your Azure SQL server username and password.


6. To open the database, click the AdventureWorksDW dataset on the left blade.

2. Create a report
You are now ready to use Power BI to analyze your AdventureWorksDW sample data. To perform the analysis,
AdventureWorksDW has a view called AggregateSales. This view contains a few of the key metrics for analyzing
the sales of the company.
1. To create a map of sales amount according to postal code, in the right-hand fields pane, click the
AggregateSales view to expand it. Click the PostalCode and SalesAmount columns to select them.

Power BI automatically recognized geographic data and put it in a map for you.
2. This step creates a bar graph that shows amount of sales per customer income. To create the bar graph, go
to the expanded AggregateSales view. Click the SalesAmount field. Drag the Customer Income field to the
left and drop it into Axis.

The bar chart over the left.


3. This step creates a line chart that shows sales amount per order date. To create the line chart, go to the
expanded AggregateSales view. Click SalesAmount and OrderDate. In the Visualizations column, click the
Line Chart icon, which is the first icon in the second line under visualizations.

You now have a report that shows three different visualizations of the data.
You can save your progress at any time by clicking File and selecting Save.

Using Direct Connect


As with Azure SQL Database, SQL Data Warehouse Direct Connect allows logical pushdown alongside the
analytical capabilities of Power BI. With Direct Connect, queries are sent back to your Azure SQL Data
Warehouse in real-time as you explore the data. This feature, combined with the scale of SQL Data Warehouse,
enables you to create dynamic reports in minutes against terabytes of data. In addition, the introduction of the
Open in Power BI button allows users to directly connect Power BI to their SQL Data Warehouse without
collecting information from other parts of Azure.
When using Direct Connect:
Specify the fully qualified server name when connecting.
Ensure firewall rules for the database are configured to Allow access to Azure services.
Every action such as selecting a column, or adding a filter, directly queries the data warehouse.
Tiles are refreshed automatically and approximately every 15 minutes.
Q&A is not available for Direct Connect datasets.
Schema changes are incorporated automatically.
All Direct Connect queries will time out after 2 minutes.
These restrictions and notes may change as the experiences improve.
Next steps
Now that we've given you some time to warm up with the sample data, see how to develop or load. Or take a
look at the Power BI website.
Troubleshooting connectivity issues
9/10/2019 • 2 minutes to read • Edit Online

This article lists common troubleshooting techniques around connecting to your SQL Data Warehouse.
Check service availability
Check for paused or scaling operation
Check your firewall settings
Check your VNet/Service Endpoint settings
Check for the latest drivers
Check your connection string
Intermittent connection issues
Common error messages

Check service availability


Check to see if the service is available. In the Azure portal, go to the SQL Data Warehouse you're trying to connect.
In the left TOC panel, click on Diagnose and solve problems.

The status of your SQL Data Warehouse will be shown here. If the service isn't showing as Available, check
further steps.
If your Resource health shows that your data warehouse is paused or scaling, follow the guidance to resume your
data warehouse.

Additional information about Resource Health can be found here.

Check for paused or scaling operation


Check the portal to see if your SQL Data Warehouse is paused or scaling.

If you see that your service is paused or scaling, check to see it isn't during your maintenance schedule. On the
portal for your SQL Data Warehouse Overview, you'll see the elected maintenance schedule.

Otherwise, check with your IT administrator to verify that this maintenance isn't a scheduled event. To resume the
SQL Data Warehouse, follow the steps outlined here.

Check your firewall settings


SQL Data Warehouse communicates over port 1433. If you're trying to connect from within a corporate network,
outbound traffic over port 1433 might not be allowed by your network's firewall. In that case, you can't connect to
your Azure SQL Database server unless your IT department opens port 1433. Additional information on firewall
configurations can be found here.

Check your VNet/Service Endpoint settings


If you're receiving Errors 40914 and 40615, see error description and resolution here.

Check for the latest drivers


Software
Check to make sure you're using the latest tools to connect to your SQL Data Warehouse:
SSMS
Azure Data Studio
SQL Server Data Tools (Visual Studio)
Drivers
Check to make sure you're using the latest driver versions. Using an older version of the drivers could result in
unexpected behaviors as the older drivers may not support new features.
ODBC
JDBC
OLE DB
PHP

Check your connection string


Check to make sure your connection strings are set properly. Below are some samples. You can find additional
information around connection strings here.
ADO.NET connection string

Server=tcp:{your_server}.database.windows.net,1433;Database={your_database};User ID={your_user_name};Password=
{your_password_here};Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;

ODBC Connection string

Driver={SQL Server Native Client 11.0};Server=tcp:{your_server}.database.windows.net,1433;Database=


{your_database};Uid={your_user_name};Pwd={your_password_here};Encrypt=yes;TrustServerCertificate=no;Connection
Timeout=30;

PHP Connection string

Server: {your_server}.database.windows.net,1433 \r\nSQL Database: {your_database}\r\nUser Name:


{your_user_name}\r\n\r\nPHP Data Objects(PDO) Sample Code:\r\n\r\ntry {\r\n $conn = new PDO ( \"sqlsrv:server =
tcp:{your_server}.database.windows.net,1433; Database = {your_database}\", \"{your_user_name}\", \"
{your_password_here}\");\r\n $conn->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );\r\n}\r\ncatch (
PDOException $e ) {\r\n print( \"Error connecting to SQL Server.\" );\r\n die(print_r($e));\r\n}\r\n\rSQL
Server Extension Sample Code:\r\n\r\n$connectionInfo = array(\"UID\" => \"{your_user_name}\", \"pwd\" => \"
{your_password_here}\", \"Database\" => \"{your_database}\", \"LoginTimeout\" => 30, \"Encrypt\" => 1,
\"TrustServerCertificate\" => 0);\r\n$serverName = \"tcp:{your_server}.database.windows.net,1433\";\r\n$conn =
sqlsrv_connect($serverName, $connectionInfo);

JDBC connection string

jdbc:sqlserver://yourserver.database.windows.net:1433;database=yourdatabase;user={your_user_name};password=
{your_password_here};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;log
inTimeout=30;

Intermittent connection issues


Check to see if you're experiencing heavy load on the server with a high number of queued requests. You may need
to scale up your data warehouse for additional resources.

Common error messages


Errors 40914 and 40615, see the error description and resolution here.

Still having connectivity issues?


Create a support ticket so the engineering team can support you.
Database collation support for Azure SQL Data
Warehouse
8/9/2019 • 2 minutes to read • Edit Online

You can change the default database collation from the Azure portal when you create a new Azure SQL Data
Warehouse database. This capability makes it even easier to create a new database using one of the 3800
supported database collations for SQL Data Warehouse. Collations provide the locale, code page, sort order and
character sensitivity rules for character-based data types. Once chosen, all columns and expressions requiring
collation information inherit the chosen collation from the database setting. The default inheritance can be
overridden by explicitly stating a different collation for a character-based data type.

Changing collation
To change the default collation, you simple update to the Collation field in the provisioning experience.
For example, if you wanted to change the default collation to case sensitive, you would simply rename the Collation
from SQL_Latin1_General_CP1_CI_AS to SQL_Latin1_General_CP1_CS_AS.

List of unsupported collation types


Japanese_Bushu_Kakusu_140_BIN
Japanese_Bushu_Kakusu_140_BIN2
Japanese_Bushu_Kakusu_140_CI_AI_VSS
Japanese_Bushu_Kakusu_140_CI_AI_WS_VSS
Japanese_Bushu_Kakusu_140_CI_AI_KS_VSS
Japanese_Bushu_Kakusu_140_CI_AI_KS_WS_VSS
Japanese_Bushu_Kakusu_140_CI_AS_VSS
Japanese_Bushu_Kakusu_140_CI_AS_WS_VSS
Japanese_Bushu_Kakusu_140_CI_AS_KS_VSS
Japanese_Bushu_Kakusu_140_CI_AS_KS_WS_VSS
Japanese_Bushu_Kakusu_140_CS_AI_VSS
Japanese_Bushu_Kakusu_140_CS_AI_WS_VSS
Japanese_Bushu_Kakusu_140_CS_AI_KS_VSS
Japanese_Bushu_Kakusu_140_CS_AI_KS_WS_VSS
Japanese_Bushu_Kakusu_140_CS_AS_VSS
Japanese_Bushu_Kakusu_140_CS_AS_WS_VSS
Japanese_Bushu_Kakusu_140_CS_AS_KS_VSS
Japanese_Bushu_Kakusu_140_CS_AS_KS_WS_VSS
Japanese_Bushu_Kakusu_140_CI_AI
Japanese_Bushu_Kakusu_140_CI_AI_WS
Japanese_Bushu_Kakusu_140_CI_AI_KS
Japanese_Bushu_Kakusu_140_CI_AI_KS_WS
Japanese_Bushu_Kakusu_140_CI_AS
Japanese_Bushu_Kakusu_140_CI_AS_WS
Japanese_Bushu_Kakusu_140_CI_AS_KS
Japanese_Bushu_Kakusu_140_CI_AS_KS_WS
Japanese_Bushu_Kakusu_140_CS_AI
Japanese_Bushu_Kakusu_140_CS_AI_WS
Japanese_Bushu_Kakusu_140_CS_AI_KS
Japanese_Bushu_Kakusu_140_CS_AI_KS_WS
Japanese_Bushu_Kakusu_140_CS_AS
Japanese_Bushu_Kakusu_140_CS_AS_WS
Japanese_Bushu_Kakusu_140_CS_AS_KS
Japanese_Bushu_Kakusu_140_CS_AS_KS_WS
Japanese_XJIS_140_BIN
Japanese_XJIS_140_BIN2
Japanese_XJIS_140_CI_AI_VSS
Japanese_XJIS_140_CI_AI_WS_VSS
Japanese_XJIS_140_CI_AI_KS_VSS
Japanese_XJIS_140_CI_AI_KS_WS_VSS
Japanese_XJIS_140_CI_AS_VSS
Japanese_XJIS_140_CI_AS_WS_VSS
Japanese_XJIS_140_CI_AS_KS_VSS
Japanese_XJIS_140_CI_AS_KS_WS_VSS
Japanese_XJIS_140_CS_AI_VSS
Japanese_XJIS_140_CS_AI_WS_VSS
Japanese_XJIS_140_CS_AI_KS_VSS
Japanese_XJIS_140_CS_AI_KS_WS_VSS
Japanese_XJIS_140_CS_AS_VSS
Japanese_XJIS_140_CS_AS_WS_VSS
Japanese_XJIS_140_CS_AS_KS_VSS
Japanese_XJIS_140_CS_AS_KS_WS_VSS
Japanese_XJIS_140_CI_AI
Japanese_XJIS_140_CI_AI_WS
Japanese_XJIS_140_CI_AI_KS
Japanese_XJIS_140_CI_AI_KS_WS
Japanese_XJIS_140_CI_AS
Japanese_XJIS_140_CI_AS_WS
Japanese_XJIS_140_CI_AS_KS
Japanese_XJIS_140_CI_AS_KS_WS
Japanese_XJIS_140_CS_AI
Japanese_XJIS_140_CS_AI_WS
Japanese_XJIS_140_CS_AI_KS
Japanese_XJIS_140_CS_AI_KS_WS
Japanese_XJIS_140_CS_AS
Japanese_XJIS_140_CS_AS_WS
Japanese_XJIS_140_CS_AS_KS
Japanese_XJIS_140_CS_AS_KS_WS
SQL_EBCDIC1141_CP1_CS_AS
SQL_EBCDIC277_2_CP1_CS_AS
Checking the current collation
To check the current collation for the database, you can run the following T-SQL snippet: SELECT
DATABASEPROPERTYEX(DB_NAME (), 'Collation') AS Collation; When passed ‘Collation’ as the property
parameter, the DatabasePropertyEx function returns the current collation for the database specified. You can learn
more about the DatabasePropertyEx function on MSDN.
T-SQL language elements supported in Azure SQL
Data Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Links to the documentation for T-SQL language elements supported in Azure SQL Data Warehouse.

Core elements
syntax conventions
object naming rules
reserved keywords
collations
comments
constants
data types
EXECUTE
expressions
KILL
IDENTITY property workaround
PRINT
USE

Batches, control-of-flow, and variables


BEGIN...END
BREAK
DECLARE @local_variable
IF...ELSE
RAISERROR
SET@local_variable
THROW
TRY...CATCH
WHILE

Operators
+ (Add)
+ (String Concatenation)
- (Negative)
- (Subtract)
* (Multiply)
/ (Divide)
Modulo

Wildcard character(s) to match


= (Equals)
> (Greater than)
< (Less than)
>= (Great than or equal to)
<= (Less than or equal to)
<> (Not equal to)
!= (Not equal to)
AND
BETWEEN
EXISTS
IN
IS [NOT] NULL
LIKE
NOT
OR
Bitwise operators
& (Bitwise AND )
| (Bitwise OR )
^ (Bitwise exclusive OR )
~ (Bitwise NOT)
^= (Bitwise Exclusive OR EQUALS )
|= (Bitwise OR EQUALS )
&= (Bitwise AND EQUALS )

Functions
@@DATEFIRST
@@ERROR
@@LANGUAGE
@@SPID
@@TRANCOUNT
@@VERSION
ABS
ACOS
ASCII
ASIN
ATAN
ATN2
BINARY_CHECKSUM
CASE
CAST and CONVERT
CEILING
CHAR
CHARINDEX
CHECKSUM
COALESCE
COL_NAME
COLLATIONPROPERTY
CONCAT
COS
COT
COUNT
COUNT_BIG
CUME_DIST
CURRENT_TIMESTAMP
CURRENT_USER
DATABASEPROPERTYEX
DATALENGTH
DATEADD
DATEDIFF
DATEFROMPARTS
DATENAME
DATEPART
DATETIME2FROMPARTS
DATETIMEFROMPARTS
DATETIMEOFFSETFROMPARTS
DAY
DB_ID
DB_NAME
DEGREES
DENSE_RANK
DIFFERENCE
EOMONTH
ERROR_MESSAGE
ERROR_NUMBER
ERROR_PROCEDURE
ERROR_SEVERITY
ERROR_STATE
EXP
FIRST_VALUE
FLOOR
GETDATE
GETUTCDATE
HAS_DBACCESS
HASHBYTES
INDEXPROPERTY
ISDATE
ISNULL
ISNUMERIC
LAG
LAST_VALUE
LEAD
LEFT
LEN
LOG
LOG10
LOWER
LTRIM
MAX
MIN
MONTH
NCHAR
NTILE
NULLIF
OBJECT_ID
OBJECT_NAME
OBJECTPROPERTY
OIBJECTPROPERTYEX
ODBCS scalar functions
OVER clause
PARSENAME
PATINDEX
PERCENTILE_CONT
PERCENTILE_DISC
PERCENT_RANK
PI
POWER
QUOTENAME
RADIANS
RAND
RANK
REPLACE
REPLICATE
REVERSE
RIGHT
ROUND
ROW_NUMBER
RTRIM
SCHEMA_ID
SCHEMA_NAME
SERVERPROPERTY
SESSION_USER
SIGN
SIN
SMALLDATETIMEFROMPARTS
SOUNDEX
SPACE
SQL_VARIANT_PROPERTY
SQRT
SQUARE
STATS_DATE
STDEV
STDEVP
STR
STUFF
SUBSTRING
SUM
SUSER_SNAME
SWITCHOFFSET
SYSDATETIME
SYSDATETIMEOFFSET
SYSTEM_USER
SYSUTCDATETIME
TAN
TERTIARY_WEIGHTS
TIMEFROMPARTS
TODATETIMEOFFSET
TYPE_ID
TYPE_NAME
TYPEPROPERTY
UNICODE
UPPER
USER
USER_NAME
VAR
VARP
YEAR
XACT_STATE

Transactions
transactions

Diagnostic sessions
CREATE DIAGNOSTICS SESSION

Procedures
sp_addrolemember
sp_columns
sp_configure
sp_datatype_info_90
sp_droprolemember
sp_execute
sp_executesql
sp_fkeys
sp_pdw_add_network_credentials
sp_pdw_database_encryption
sp_pdw_database_encryption_regenerate_system_keys
sp_pdw_log_user_data_masking
sp_pdw_remove_network_credentials
sp_pkeys
sp_prepare
sp_spaceused
sp_special_columns_100
sp_sproc_columns
sp_statistics
sp_tables
sp_unprepare

SET statements
SET ANSI_DEFAULTS
SET ANSI_NULL_DFLT_OFF
SET ANSI_NULL_DFLT_ON
SET ANSI_NULLS
SET ANSI_PADDING
SET ANSI_WARNINGS
SET ARITHABORT
SET ARITHIGNORE
SET CONCAT_NULL_YIELDS_NULL
SET DATEFIRST
SET DATEFORMAT
SET FMTONLY
SET IMPLICIT_TRANSACITONS
SET LOCK_TIMEOUT
SET NUMBERIC_ROUNDABORT
SET QUOTED_IDENTIFIER
SET ROWCOUNT
SET TEXTSIZE
SET TRANSACTION ISOLATION LEVEL
SET XACT_ABORT

Next steps
For more reference information, see T-SQL statements in Azure SQL Data Warehouse, and System views in Azure
SQL Data Warehouse.
T-SQL statements supported in Azure SQL Data
Warehouse
7/24/2019 • 2 minutes to read • Edit Online

Links to the documentation for T-SQL statements supported in Azure SQL Data Warehouse.

Data Definition Language (DDL) statements


ALTER DATABASE
ALTER INDEX
ALTER MATERIALIZED VIEW (Preview )
ALTER PROCEDURE
ALTER SCHEMA
ALTER TABLE
CREATE COLUMNSTORE INDEX
CREATE DATABASE
CREATE DATABASE SCOPED CREDENTIAL
CREATE EXTERNAL DATA SOURCE
CREATE EXTERNAL FILE FORMAT
CREATE EXTERNAL TABLE
CREATE FUNCTION
CREATE INDEX
CREATE MATERIALIZED VIEW AS SELECT (Preview )
CREATE PROCEDURE
CREATE SCHEMA
CREATE STATISTICS
CREATE TABLE
CREATE TABLE AS SELECT
CREATE VIEW
CREATE WORKLOAD CLASSIFIER
DROP EXTERNAL DATA SOURCE
DROP EXTERNAL FILE FORMAT
DROP EXTERNAL TABLE
DROP INDEX
DROP PROCEDURE
DROP STATISTICS
DROP TABLE
DROP SCHEMA
DROP VIEW
DROP WORKLOAD CLASSIFIER
RENAME
SET RESULT_SET_CACHING
TRUNCATE TABLE
UPDATE STATISTICS
Data Manipulation Language (DML) statements
DELETE
INSERT
UPDATE

Database Console Commands


DBCC DROPCLEANBUFFERS
DBCC DROPRESULTSETCACHE (Preview )
DBCC FREEPROCCACHE
DBCC SHRINKLOG
DBCC PDW_SHOWEXECUTIONPLAN
DBCC PDW_SHOWMATERIALIZEDVIEWOVERHEAD
DBCC SHOWRESULTCACHESPACEUSED (Preview )
DBCC PDW_SHOWPARTITIONSTATS
DBCC PDW_SHOWSPACEUSED
DBCC SHOW_STATISTICS

Query statements
SELECT
WITH common_table_expression
EXCEPT and INTERSECT
EXPLAIN
FROM
Using PIVOT and UNPIVOT
GROUP BY
HAVING
ORDER BY
OPTION
UNION
WHERE
TOP
Aliasing
Search condition
Subqueries

Security statements
Permissions: GRANT, DENY, REVOKE
ALTER AUTHORIZATION
ALTER CERTIFICATE
ALTER DATABASE ENCRYPTION KEY
ALTER LOGIN
ALTER MASTER KEY
ALTER ROLE
ALTER USER
BACKUP CERTIFICATE
CLOSE MASTER KEY
CREATE CERTIFICATE
CREATE DATABASE ENCRYPTION KEY
CREATE LOGIN
CREATE MASTER KEY
CREATE ROLE
CREATE USER
DROP CERTIFICATE
DROP DATABASE ENCRYPTION KEY
DROP LOGIN
DROP MASTER KEY
DROP ROLE
DROP USER
OPEN MASTER KEY

Next steps
For more reference information, see T-SQL language elements in Azure SQL Data Warehouse, and System views
in Azure SQL Data Warehouse.
System views supported in Azure SQL Data
Warehouse
8/16/2019 • 2 minutes to read • Edit Online

Links to the documentation for T-SQL statements supported in Azure SQL Data Warehouse.

SQL Data Warehouse catalog views


sys.pdw_column_distribution_properties
sys.pdw_distributions
sys.pdw_index_mappings
sys.pdw_loader_backup_run_details
sys.pdw_loader_backup_runs
sys.pdw_materialized_view_column_distribution_properties (Preview )
sys.pdw_materialized_view_distribution_properties (Preview )
sys.pdw_materialized_view_mappings (Preview )
sys.pdw_nodes_column_store_dictionaries
sys.pdw_nodes_column_store_row_groups
sys.pdw_nodes_column_store_segments
sys.pdw_nodes_columns
sys.pdw_nodes_indexes
sys.pdw_nodes_partitions
sys.pdw_nodes_pdw_physical_databases
sys.pdw_nodes_tables
sys.pdw_replicated_table_cache_state
sys.pdw_table_distribution_properties
sys.pdw_table_mappings
sys.resource_governor_workload_groups
sys.workload_management_workload_classifier_details (Preview )
sys.workload_management_workload_classifiers (Preview )

SQL Data Warehouse dynamic management views (DMVs)


sys.dm_pdw_dms_cores
sys.dm_pdw_dms_external_work
sys.dm_pdw_dms_workers
sys.dm_pdw_errors
sys.dm_pdw_exec_connections
sys.dm_pdw_exec_requests
sys.dm_pdw_exec_sessions
sys.dm_pdw_hadoop_operations
sys.dm_pdw_lock_waits
sys.dm_pdw_nodes
sys.dm_pdw_nodes_database_encryption_keys
sys.dm_pdw_os_threads
sys.dm_pdw_request_steps
sys.dm_pdw_resource_waits
sys.dm_pdw_sql_requests
sys.dm_pdw_sys_info
sys.dm_pdw_wait_stats
sys.dm_pdw_waits

SQL Server DMVs applicable to SQL Data Warehouse


The following DMVs are applicable to SQL Data Warehouse, but must be executed by connecting to the master
database.
sys.database_service_objectives
sys.dm_operation_status
sys.fn_helpcollations()

SQL Server catalog views


sys.all_columns
sys.all_objects
sys.all_parameters
sys.all_sql_modules
sys.all_views
sys.assemblies
sys.assembly_modules
sys.assembly_types
sys.certificates
sys.check_constraints
sys.columns
sys.computed_columns
sys.credentials
sys.data_spaces
sys.database_credentials
sys.database_files
sys.database_permissions
sys.database_principals
sys.database_query_store_options
sys.database_role_members
sys.databases
sys.default_constraints
sys.external_data_sources
sys.external_file_formats
sys.external_tables
sys.filegroups
sys.foreign_key_columns
sys.foreign_keys
sys.identity_columns
sys.index_columns
sys.indexes
sys.key_constraints
sys.numbered_procedures
sys.objects
sys.parameters
sys.partition_functions
sys.partition_parameters
sys.partition_range_values
sys.partition_schemes
sys.partitions
sys.procedures
sys.query_context_settings
sys.query_store_plan
sys.query_store_query
sys.query_store_query_text
sys.query_store_runtime_stats
sys.query_store_runtime_stats_interval
sys.schemas
sys.securable_classes
sys.sql_expression_dependencies
sys.sql_modules
sys.stats
sys.stats_columns
sys.symmetric_keys
sys.synonyms
sys.syscharsets
sys.syscolumns
sys.sysdatabases
sys.syslanguages
sys.sysobjects
sys.sysreferences
sys.system_columns
sys.system_objects
sys.system_parameters
sys.system_sql_modules
sys.system_views
sys.systypes
sys.sysusers
sys.tables
sys.types
sys.views

SQL Server DMVs available in SQL Data Warehouse


SQL Data Warehouse exposes many of the SQL Server dynamic management views (DMVs). These views, when
queried in SQL Data Warehouse, are reporting the state of SQL Databases running on the distributions.
SQL Data Warehouse and Analytics Platform System's Parallel Data Warehouse (PDW ) use the same system
views. Each DMV has a column called pdw_node_id, which is the identifier for the Compute node.

NOTE
To use these views, insert ‘pdw_nodes_’ into the name, as shown in the following table:

DMV NAME IN SQL DATA WAREHOUSE SQL SERVER TRANSACT-SQL ARTICLE

sys.dm_pdw_nodes_db_column_store_row_group_physical_sta sys.dm_db_column_store_row_group_physical_stats
ts

sys.dm_pdw_nodes_db_column_store_row_group_operational_ sys.dm_db_column_store_row_group_operational_stats
stats

sys.dm_pdw_nodes_db_file_space_usage sys.dm_db_file_space_usage

sys.dm_pdw_nodes_db_index_usage_stats sys.dm_db_index_usage_stats

sys.dm_pdw_nodes_db_partition_stats sys.dm_db_partition_stats

sys.dm_pdw_nodes_db_session_space_usage sys.dm_db_session_space_usage

sys.dm_pdw_nodes_db_task_space_usage sys.dm_db_task_space_usage

sys.dm_pdw_nodes_exec_background_job_queue sys.dm_exec_background_job_queue

sys.dm_pdw_nodes_exec_background_job_queue_stats sys.dm_exec_background_job_queue_stats

sys.dm_pdw_nodes_exec_cached_plans sys.dm_exec_cached_plans

sys.dm_pdw_nodes_exec_connections sys.dm_exec_connections

sys.dm_pdw_nodes_exec_procedure_stats sys.dm_exec_procedure_stats

sys.dm_pdw_nodes_exec_query_memory_grants sys.dm_exec_query_memory_grants

sys.dm_pdw_nodes_exec_query_optimizer_info sys.dm_exec_query_optimizer_info

sys.dm_pdw_nodes_exec_query_resource_semaphores sys.dm_exec_query_resource_semaphores

sys.dm_pdw_nodes_exec_query_stats sys.dm_exec_query_stats

sys.dm_pdw_nodes_exec_requests sys.dm_exec_requests

sys.dm_pdw_nodes_exec_sessions sys.dm_exec_sessions

sys.dm_pdw_nodes_io_pending_io_requests sys.dm_io_pending_io_requests

sys.dm_pdw_nodes_io_virtual_file_stats sys.dm_io_virtual_file_stats

sys.dm_pdw_nodes_os_buffer_descriptors sys.dm_os_buffer_descriptors
DMV NAME IN SQL DATA WAREHOUSE SQL SERVER TRANSACT-SQL ARTICLE

sys.dm_pdw_nodes_os_child_instances sys.dm_os_child_instances

sys.dm_pdw_nodes_os_cluster_nodes sys.dm_os_cluster_nodes

sys.dm_pdw_nodes_os_dispatcher_pools sys.dm_os_dispatcher_pools

sys.dm_pdw_nodes_os_dispatchers Transact-SQL Documentation is not available.

sys.dm_pdw_nodes_os_hosts sys.dm_os_hosts

sys.dm_pdw_nodes_os_latch_stats sys.dm_os_latch_stats

sys.dm_pdw_nodes_os_memory_brokers sys.dm_os_memory_brokers

sys.dm_pdw_nodes_os_memory_cache_clock_hands sys.dm_os_memory_cache_clock_hands

sys.dm_pdw_nodes_os_memory_cache_counters sys.dm_os_memory_cache_counters

sys.dm_pdw_nodes_os_memory_cache_entries sys.dm_os_memory_cache_entries

sys.dm_pdw_nodes_os_memory_cache_hash_tables sys.dm_os_memory_cache_hash_tables

sys.dm_pdw_nodes_os_memory_clerks sys.dm_os_memory_clerks

sys.dm_pdw_nodes_os_memory_node_access_stats Transact-SQL Documentation is not available.

sys.dm_pdw_nodes_os_memory_nodes sys.dm_os_memory_nodes

sys.dm_pdw_nodes_os_memory_objects sys.dm_os_memory_objects

sys.dm_pdw_nodes_os_memory_pools sys.dm_os_memory_pools

sys.dm_pdw_nodes_os_nodes sys.dm_os_nodes

sys.dm_pdw_nodes_os_performance_counters sys.dm_os_performance_counters

sys.dm_pdw_nodes_os_process_memory sys.dm_os_process_memory

sys.dm_pdw_nodes_os_schedulers sys.dm_os_schedulers

sys.dm_pdw_nodes_os_spinlock_stats Transact-SQL Documentation is not available.

sys.dm_pdw_nodes_os_sys_info sys.dm_os_sys_info

sys.dm_pdw_nodes_os_sys_memory sys.dm_os_memory_nodes

sys.dm_pdw_nodes_os_tasks sys.dm_os_tasks

sys.dm_pdw_nodes_os_threads sys.dm_os_threads
DMV NAME IN SQL DATA WAREHOUSE SQL SERVER TRANSACT-SQL ARTICLE

sys.dm_pdw_nodes_os_virtual_address_dump sys.dm_os_virtual_address_dump

sys.dm_pdw_nodes_os_wait_stats sys.dm_os_wait_stats

sys.dm_pdw_nodes_os_waiting_tasks sys.dm_os_waiting_tasks

sys.dm_pdw_nodes_os_workers sys.dm_os_workers

sys.dm_pdw_nodes_tran_active_snapshot_database_transactio sys.dm_tran_active_snapshot_database_transactions
ns

sys.dm_pdw_nodes_tran_active_transactions sys.dm_tran_active_transactions

sys.dm_pdw_nodes_tran_commit_table sys.dm_tran_commit_table

sys.dm_pdw_nodes_tran_current_snapshot sys.dm_tran_current_snapshot

sys.dm_pdw_nodes_tran_current_transaction sys.dm_tran_current_transaction

sys.dm_pdw_nodes_tran_database_transactions sys.dm_tran_database_transactions

sys.dm_pdw_nodes_tran_locks sys.dm_tran_locks

sys.dm_pdw_nodes_tran_session_transactions sys.dm_tran_session_transactions

sys.dm_pdw_nodes_tran_top_version_generators sys.dm_tran_top_version_generators

SQL Server 2016 PolyBase DMVs available in SQL Data Warehouse


The following DMVs are applicable to SQL Data Warehouse, but must be executed by connecting to the master
database.
sys.dm_exec_compute_node_errors
sys.dm_exec_compute_node_status
sys.dm_exec_compute_nodes
sys.dm_exec_distributed_request_steps
sys.dm_exec_distributed_requests
sys.dm_exec_distributed_sql_requests
sys.dm_exec_dms_services
sys.dm_exec_dms_workers
sys.dm_exec_external_operations
sys.dm_exec_external_work

SQL Server INFORMATION_SCHEMA views


CHECK_CONSTRAINTS
COLUMNS
PARAMETERS
ROUTINES
SCHEMATA
TABLES
VIEW_COLUMN_USAGE
VIEW_TABLE_USAGE
VIEWS

Next steps
For more reference information, see T-SQL statements in Azure SQL Data Warehouse, and T-SQL language
elements in Azure SQL Data Warehouse.
PowerShell cmdlets and REST APIs for SQL Data
Warehouse
5/6/2019 • 2 minutes to read • Edit Online

Many SQL Data Warehouse administration tasks can be managed using either Azure PowerShell cmdlets or REST
APIs. Below are some examples of how to use PowerShell commands to automate common tasks in your SQL
Data Warehouse. For some good REST examples, see the article Manage scalability with REST.

NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.

Get started with Azure PowerShell cmdlets


1. Open Windows PowerShell.
2. At the PowerShell prompt, run these commands to sign in to the Azure Resource Manager and select your
subscription.

Connect-AzAccount
Get-AzSubscription
Select-AzSubscription -SubscriptionName "MySubscription"

Pause SQL Data Warehouse Example


Pause a database named "Database02" hosted on a server named "Server01." The server is in an Azure resource
group named "ResourceGroup1."

Suspend-AzSqlDatabase –ResourceGroupName "ResourceGroup1" –ServerName "Server01" –DatabaseName "Database02"

A variation, this example pipes the retrieved object to Suspend-AzSqlDatabase. As a result, the database is paused.
The final command shows the results.

$database = Get-AzSqlDatabase –ResourceGroupName "ResourceGroup1" –ServerName "Server01" –DatabaseName


"Database02"
$resultDatabase = $database | Suspend-AzSqlDatabase
$resultDatabase

Start SQL Data Warehouse Example


Resume operation of a database named "Database02" hosted on a server named "Server01." The server is
contained in a resource group named "ResourceGroup1."
Resume-AzSqlDatabase –ResourceGroupName "ResourceGroup1" –ServerName "Server01" -DatabaseName "Database02"

A variation, this example retrieves a database named "Database02" from a server named "Server01" that is
contained in a resource group named "ResourceGroup1." It pipes the retrieved object to Resume-AzSqlDatabase.

$database = Get-AzSqlDatabase –ResourceGroupName "ResourceGroup1" –ServerName "Server01" –DatabaseName


"Database02"
$resultDatabase = $database | Resume-AzSqlDatabase

NOTE
Note that if your server is foo.database.windows.net, use "foo" as the -ServerName in the PowerShell cmdlets.

Other supported PowerShell cmdlets


These PowerShell cmdlets are supported with Azure SQL Data Warehouse.
Get-AzSqlDatabase
Get-AzSqlDeletedDatabaseBackup
Get-AzSqlDatabaseRestorePoint
New -AzSqlDatabase
Remove-AzSqlDatabase
Restore-AzSqlDatabase
Resume-AzSqlDatabase
Select-AzSubscription
Set-AzSqlDatabase
Suspend-AzSqlDatabase

Next steps
For more PowerShell examples, see:
Create a SQL Data Warehouse using PowerShell
Database restore
For other tasks which can be automated with PowerShell, see Azure SQL Database Cmdlets. Note that not all
Azure SQL Database cmdlets are supported for Azure SQL Data Warehouse. For a list of tasks which can be
automated with REST, see Operations for Azure SQL Database.
REST APIs for Azure SQL Data Warehouse
6/3/2019 • 2 minutes to read • Edit Online

REST APIs for managing compute in Azure SQL Data Warehouse.

Scale compute
To change the data warehouse units, use the Create or Update Database REST API. The following example sets the
data warehouse units to DW1000 for the database MySQLDW, which is hosted on server MyServer. The server is
in an Azure resource group named ResourceGroup1.

PATCH https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-
name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}?api-version=2014-04-01-preview
HTTP/1.1
Content-Type: application/json; charset=UTF-8

{
"properties": {
"requestedServiceObjectiveName": DW1000
}
}

Pause compute
To pause a database, use the Pause Database REST API. The following example pauses a database named
Database02 hosted on a server named Server01. The server is in an Azure resource group named
ResourceGroup1.

POST https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-
name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}/pause?api-version=2014-04-01-
preview HTTP/1.1

Resume compute
To start a database, use the Resume Database REST API. The following example starts a database named
Database02 hosted on a server named Server01. The server is in an Azure resource group named
ResourceGroup1.

POST https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-
name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}/resume?api-version=2014-04-01-
preview HTTP/1.1

Check database state


NOTE
Currently Check database state might return ONLINE while the database is completing the online workflow, resulting in
connection errors. You might need to add a 2 to 3 minutes delay in your application code if you are using this API call to
trigger connection attempts.
GET https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-
name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}?api-version=2014-04-01 HTTP/1.1

Get maintenance schedule


Check the maintenance schedule that has been set for a data warehouse.

GET https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-
name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}/maintenanceWindows/current?
maintenanceWindowName=current&api-version=2017-10-01-preview HTTP/1.1

Set maintenance schedule


To set and update a maintnenance schedule on an existing data warehouse.

PUT https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-
name}/providers/Microsoft.Sql/servers/{server-name}/databases/{database-name}/maintenanceWindows/current?
maintenanceWindowName=current&api-version=2017-10-01-preview HTTP/1.1

{
"properties": {
"timeRanges": [
{
"dayOfWeek": Saturday,
"startTime": 00:00,
"duration": 08:00,
},
{
"dayOfWeek": Wednesday
"startTime": 00:00,
"duration": 08:00,
}
]
}
}

Next steps
For more information, see Manage compute.
How to create a support ticket for SQL Data
Warehouse
3/15/2019 • 2 minutes to read • Edit Online

If you are having any issues with your SQL Data Warehouse, create a support ticket so the engineering support
team can assist you.

Create a support ticket


1. Open the Azure portal.
2. On the Home screen, click the Help + support tab.

3. On the Help + Support blade, click New support request and fill out the Basics blade.
Select your Azure support plan.
Billing, quota, and subscription management support are available at all support levels.
Break-fix support is provided through Developer, Standard, Professional Direct, or Premier
support. Break-fix issues are problems experienced by customers while using Azure where there
is a reasonable expectation that Microsoft caused the problem.
Developer mentoring and advisory services are available at the Professional Direct and
Premier support levels.
If you have a Premier support plan, you can also report SQL Data Warehouse related issues on
the Microsoft Premier online portal. See Azure support plans to learn more about the various
support plans, including scope, response times, pricing, etc. For frequently asked questions about
Azure support, see Azure support FAQs.
4. Fill out the Problem blade.
NOTE
By default, each SQL server (for example, myserver.database.windows.net) has a DTU Quota of 45,000. This
quota is simply a safety limit. You can increase your quota by creating a support ticket and selecting Quota as
the request type. To calculate your DTU needs, multiply 7.5 by the total DWU needed. For example, you would
like to host two DW6000s on one SQL server, then you should request a DTU quota of 90,000. You can view
your current DTU consumption from the SQL server blade in the portal. Both paused and unpaused databases
count toward the DTU quota.

5. Fill out your contact information.

6. Click Create to submit the support request.

Monitor a support ticket


After you have submitted the support request, the Azure support team will contact you. To check your request
status and details, click All support requests on the dashboard.
Other resources
Additionally, you can connect with the SQL Data Warehouse community on Stack Overflow or on the Azure
SQL Data Warehouse MSDN forum.
Azure SQL Data Warehouse - Videos
7/5/2019 • 2 minutes to read • Edit Online

Watch the latest Azure SQL Data Warehouse videos to learn about new capabilities and performance
improvements.
To get started, select the overview video below to learn about the new updates to Azure SQL Data Warehouse.
Also, learn how Modern Data Warehouse patterns can be used to tackle real world scenarios such as cybercrime.

Additional videos describing specific functionalities can be viewed on:


YouTube: Advanced Analytics with Azure
Azure Videos
SQL Data Warehouse business intelligence partners
5/17/2019 • 3 minutes to read • Edit Online

To create your end-to-end data warehouse solution, choose from a wide variety of industry-leading tools. This
article highlights Microsoft partner companies with official business intelligence (BI) solutions supporting Azure
SQL Data Warehouse.

Our business intelligence partners


PARTNER DESCRIPTION WEBSITE/PRODUCT LINK

Birst Product page


Birst connects the entire organization Azure Marketplace
through a network of interwoven
virtualized BI instances on-top of a
shared common analytical fabric

ClearStory Data (Continuous Product page


Business Insights)
ClearStory Data enables fast-cycle
analysis across disparate data stored in
SQL Data Warehouse. ClearStory’s
integrated Spark-based platform and
analytics application speed data access
and harmonization of disparate data
sets. They enable fast, collaborative
exploration that empowers business
users to be self-reliant to gain insights.

Dundas BI Product page


Dundas Data Visualization is a leading, Azure Marketplace
global provider of Business Intelligence
and Data Visualization software.
Dundas dashboards, reporting, and
visual data analytics provide seamless
integration into business applications,
enabling better decisions and faster
insights.

IBM Cognos Analytics Product page


Cognos Analytics includes smart self-
service capabilities that make it simple,
clear and easy to use whether you are
an experienced business analyst
examining the kinks in a vast supply
chain or a marketer optimizing a single
campaign. Cognos Analytics uses AI
and other intelligent capabilities to do
the heavy lifting to guide data
exploration, and make it easier for users
to get the answers they need
PARTNER DESCRIPTION WEBSITE/PRODUCT LINK

Information Builders (WebFOCUS) Product page


WebFOCUS business intelligence helps
companies use data more strategically
across and beyond the enterprise. It
allows users and administrators to
rapidly create dashboards that combine
content from multiple data sources and
formats and provides robust security
and comprehensive governance that
enables seamless, secure diverly and
sharing of any BI and analytics content

Jinfonet JReport Product page


JReport is an embeddable BI solution
for the enterprise. The solution
empowers users to create reports,
dashboards, and data analysis on cloud,
big data, and transactional data
sources. By visualizing data, you can
perform your own reporting and data
discovery for agile, on-the-fly decision
making.

Logi Analytics Product page


Together, Logi Analytics and Azure SQL
Data Warehouse enables your
organization to collect, analyze, and
immediately act on the largest and
most diverse data sets in the world.

Looker BI Product page


Looker gives everyone in your company Azure Marketplace
the ability to explore and understand
the data that drives your business.
Looker also gives the data analyst a
flexible and reusable modeling layer to
control and curate that data.
Companies have fundamentally
transformed their culture using Looker
as the catalyst.

MicroStrategy Product page


The MicroStrategy platform offers a Azure Marketplace
complete set of business intelligence
and analytics capabilities that enable
organizations to get value from their
business data. MicroStrategy's powerful
analytical engine, comprehensive
toolsets, variety of data connectors, and
scalable, open architecture ensure you
have everything you need to extend
access to analytics across every team
and business function
PARTNER DESCRIPTION WEBSITE/PRODUCT LINK

Qlik Sense Enterprise Product page


Drive insight discovery with the data
visualization app that anyone can use.
With Qlik Sense, everyone in your
organization can easily create flexible,
interactive visualizations and make
meaningful decisions.

SiSense Product page


SiSense is a full-stack Business
Intelligence software that comes with
tools that a business needs to analyze
and visualize data: a high-performance
analytical database, the ability to join
multiple sources, simple data extraction
(ETL), and web-based data visualization.
Start to analyze and visualize large data
sets with SiSense BI and Analytics today.

Tableau Product page


Tableau’s self-service analytics help Azure Marketplace
anyone see and understand their data,
across many kinds of data from flat files
to databases. Tableau has a native,
optimized connector to Microsoft Azure
SQL Data Warehouse that supports
both live data and in-memory analytics.

Targit (Decision Suite) Product page


Targit Decision Suite provides BI and Azure Marketplace
Analytics platform that delivers real-
time dashboards, self-service analytics,
user-friendly reporting, stunning mobile
capabilities, and simple data-discovery
technology in a single, cohesive
solution. Targit gives companies the
courage to act.

Yellowfin Product page


Yellowfin is a top rated Cloud BI vendor Azure Marketplace
for ad hoc Reporting and Dashboards
by BARC; The BI Survey. Connect to
Azure SQL Data Warehouse, then
create and share beautiful reports and
dashboards with award winning
collaborative BI and location intelligence
features.

Next Steps
To learn more about some of our other partners, see Data Integration partners and Data Management partners.
SQL Data Warehouse data integration partners
5/17/2019 • 3 minutes to read • Edit Online

To create your data warehouse solution, choose from a wide variety of industry-leading tools. This article
highlights Microsoft partner companies with official data integration solutions supporting Azure SQL Data
Warehouse.

Data integration partners


PARTNER DESCRIPTION WEBSITE/PRODUCT LINK

Alooma Product page


Alooma is a ETL solution that enables
data teams to integrate, enrich, and
stream data from various data silos to
SQL Data Warehouse all in real time.

Alteryx Product page


Alteryx Designer provides a repeatable Azure Marketplace
workflow for self-service data analytics
that leads to deeper insights in hours,
not the weeks typical of traditional
approaches! Alteryx Designer helps
data analysts by combining data
preparation, data blending, and
analytics – predictive, statistical, and
spatial – using the same intuitive user
interface.

Attunity (CloudBeam) Product page


Attunity CloudBeam provides an Azure Marketplace
automated solution for loading data
into SQL Data Warehouse. It simplifies
batch loading and incremental
replication of data from many sources -
SQL Server, Oracle, DB2, Sybase,
MySQL, and more.

Denodo Product page


Denodo provide real-time access to Azure Marketplace
data across an organization's diverse
data sources. It uses data virtualization
to bridge data across many sources
without replication. It offers broad
access to structured and unstructured
data residing in enterprise, big data,
and cloud sources in both batch and
real time.
PARTNER DESCRIPTION WEBSITE/PRODUCT LINK

Fivetran Product page


Fivetran helps you centralize data from
disparate sources. It features a zero
maintenance, zero configuration data
pipeline product with a growing list of
built-in connectors to all the popular
data sources. Setup takes five minutes
after authenticating to data sources
and target data warehouse.

1.Informatica Cloud Services for Informatica Cloud services for Azure


Azure Product page
Informatica Cloud offers a best-in-class Azure Marketplace
solution for self-service data migration,
integration, and management Informatica PowerCenter
capabilities. Customers can quickly and Product page
reliably import, and export petabytes of Azure Marketplace
data to Azure from a variety of sources.
Informatica Cloud Services for Azure
provides native, high volume, high-
performance connectivity to Azure SQL
Data Warehouse, SQL Database, Blob
Storage, Data Lake Store, and Azure
Cosmos DB.

2.Informatica PowerCenter
PowerCenter is a metadata-driven data
integration platform that jumpstarts
and accelerates data integration
projects in order to deliver data to the
business more quickly than manual
hand coding. It serves as the
foundation for your data integration
investments

Information Builders (Omni-Gen Product page


Data Management) Azure Marketplace
Information Builder's Omni-Gen data
management platform provides data
integration, data quality, and master
data management solutions. It makes it
easy to access, move, and blend all data
regardless of format, location, volume,
or latency.

Qubole Product page


Qubole provides a cloud-native Azure Marketplace
platform that enables users to conduct
ETL, analytics, and AI/ML workloads. It
supports a variety of open-source
engines - Apache Spark, TensorFlow,
Presto, Airflow, Hadoop, Hive, and
more. It provides easy-to-use end-user
tools for data processing from SQL
query tools, to notebooks, and
dashboards that leverage powerful
open-source engines.
PARTNER DESCRIPTION WEBSITE/PRODUCT LINK

Segment Product page


Segment is a data management and
analytics solutions that helps you make
sense of customer data coming from
various sources. It allows you to
connect your data to over 200 tools to
create better decisions, products, and
experiences. Segment will transform
and load multiple data sources into
your warehouse for you using its built-
in data connectors

Skyvia (data integration) Product page


Skyvia data integration provides a
wizard that automates data imports.
This wizard allows you to migrate data
between a variety of sources - CRMs,
application database, CSV files, and
more.

SnapLogic Product page


The SnapLogic Platform enables Azure Marketplace
customers to quickly transfer data into
and out of Microsoft Azure SQL Data
Warehouse. It offers the ability to
integrate hundreds of applications,
services, and IoT scenarios in one
solution.

Talend Cloud Product page


Talend Cloud is an interprise data Azure Marketplace
integration platform to connect, access,
and transform any data across the
cloud or on-premises. It is an
integration platform-as-a-service
(iPaaS) offering that provides broad
connectivity, built-in data quality, and
native support for the latest big data
and cloud technologies

Trifacta Wrangler Product page


Trifacta helps individuals and Azure Marketplace
organizations explore, and join together
diverse data for analysis. Trifacta
Wrangler is designed to handle data
wrangling workloads that need to
support data at scale and a large
number of end users.

Wherescape RED Product page


WhereScape RED is an IDE that Azure Marketplace
provides teams with automation tools
to streamline ETL workflows. The IDE
provides best practice, optimized native
code for popular data targets. Use
WhereScape RED to cut the time to
develop, deploy, and operate your data
infrastructure.
Next steps
To learn more about other partners, see Business Intelligence partners and Data Management partners.
SQL Data Warehouse data management partners
5/17/2019 • 2 minutes to read • Edit Online

To create your data warehouse solution, choose from a wide variety of industry-leading tools. This article
highlights Microsoft partner companies with data management tools and solutions supporting Azure SQL Data
Warehouse.

Data management partners


PARTNER DESCRIPTION WEBSITE/PRODUCT LINK

Coffing Data Warehousing Product page


Coffing Data Warehousing provides
Nexus Chameleon, a tool with 10 years
of design dedicated to querying
systems. Nexus is available as a query
tool for SQL Data Warehouse. Use
Nexus to query in-house and cloud
computers and join data across
different platforms. Point-Click-Report!

Inbrein MicroERD Product page


Inbrein MicroERD provides the tools
that you need to create a precise data
model, reduce data redundancy,
improve productivity, and observe
standards. By using its UI, which was
developed based on extensive user
experiences, a modeler can work on DB
models easily and conveniently. You can
continuously enjoy new and improved
functions of MicroERD through prompt
functional improvements and updates.

Infolibrarian (Metadata Product page


Management Server) Azure Marketplace
InfoLibrarian catalogs, stores, and
manages metadata to help you solve
key pain points of data management. In
addition, Infolibrarian provides
metadata management, data
governance, and asset management
software solutions for managing and
publishing metadata from a diverse set
of tools and technologies.
PARTNER DESCRIPTION WEBSITE/PRODUCT LINK

RedPoint Data Management Product page


RedPoint Data Management enables Azure Marketplace
marketers to apply all their data to
drive cross-channel customer
engagement while performing
structured and unstructured data
management. By taking advantage of
Azure SQL Data Warehouse and
RedPoint you can maximize the value of
your structured and unstructured data
to deliver the hyper-personalized,
contextual interactions needed to
engage today’s omni-channel customer.
Drag-and-drop interface makes
designing and executing data
management processes easy.

SentryOne (DW Sentry) Product page


With the intelligent data movement Azure Marketplace
dashboard and event calendar, you
always know exactly what is impacting
your workload. Designed to give you
visibility into your queries and jobs
running to load, backup, or restore your
data, you never worry about making
the most of your Azure resources.

Next Steps
To learn more about other partners, see Business Intelligence partners and Data Integration partners.

You might also like