Professional Documents
Culture Documents
De|G PRESS, the startup born out of one of the world’s most venerable publishers,
De Gruyter, promises to bring you an unbiased, valuable, and meticulously edited
work on important topics in the fields of business, information technology, com-
puting, engineering, and mathematics. By selecting the finest authors to present,
without bias, information necessary for their chosen topic for professionals, in
the depth you would hope for, we wish to satisfy your needs and earn our five-star
ranking.
In keeping with these principles, the books you read from De|G PRESS will
be practical, efficient and, if we have done our job right, yield many returns on
their price.
We invite businesses to order our books in bulk in print or electronic form as a
best solution to meeting the learning needs of your organization, or parts of your
organization, in a most cost-effective manner.
There is no better way to learn about a subject in depth than from a book that
is efficient, clear, well organized, and information rich. A great book can provide
life-changing knowledge. We hope that with De|G PRESS books you will find that
to be the case.
DOI 10.1515/9781547401277-202
Acknowledgments
Thanks to my editor, Jeff Pepper, who worked with me to come up with this quick
start approach, and to Triston Arisawa for jumping in to verify the accuracy of the
numerous exercises that are presented throughout this book.
DOI 10.1515/9781547401277-203
About the Author
DOI 10.1515/9781547401277-204
Contents
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory 1
Creating a Local SQL Instance 1
Creating an Azure SQL Database 3
Building a Basic Azure Data Factory Pipeline 10
Monitoring and Alerts 28
Summary 31
Chapter 3: Azure Data Warehouse and Data Integration using ADF or External
Tables 65
Creating an ADW Instance 66
ADW Performance and Pricing 67
Connecting to Your Data Warehouse 69
Modeling a Very Simple Data Warehouse 72
Load Data Using ADF 77
Using External Tables 89
Summary 97
Index 99
Introduction
A challenge has been presented to me, that is to distill the essence of Azure Data
Factory (ADF), Azure Data Lake Server (ADLS), and Azure Data Warehouse (ADW)
into a book that is a short and fast quick start guide. There’s a tremendous amount
of territory to cover when it comes to diving into these technologies! What I hoped
to accomplish in this book is the following:
1. Lay out the steps that will set up each environment and perform basic devel-
opment within it.
2. Show how to move data between the various environments and components,
including local SQL and Azure instances.
3. Eliminate some of the elusive aspects of the various features (e.g., check out
the overview of External Tables at the end of Chapter 3!)
4. Save you time!
I guarantee that this book will help you fully understand how to set up an ADF
pipeline integration with multiple sources and destinations that will require very
little of your time. You’ll know how to create an ADLS instance and move data in
a variety of formats into it. You’ll be able to build a data warehouse that can be
populated with an ADF process or by using external tables. And you’ll have a fair
understanding of permissions, monitoring the various environments, and doing
development across components.
Dive in. Have fun. There is a ton of value packed into this little book!
DOI 10.1515/9781547401277-206
Chapter 1
Copying Data to Azure SQL Using
Azure Data Factory
In this chapter we’ll build out several components to illustrate how data can be
copied between data sources using Azure Data Factory (ADF). The easiest way to
illustrate this is by using a simple on-premise local SQL instance with data that
will be copied to a cloud-based Azure SQL instance. We’ll create an ADF pipeline
that uses an integration runtime component to acquire the connection with the
local SQL database. A simple map will be created within the pipeline to show how
the columns in the local table map to the Azure table. The full flow of this model
is shown in Figure 1.1.
To build this simple architecture, a local SQL instance will need to be in place.
We’ll create a single table called Customers. By putting a few records into the
table, we can then use it as the base to load data into the new Azure SQL instance
you create in the next section of this chapter. The table script and the script to
load records into this table are shown in Listing 1.1. A screenshot of the local SQL
Server instance is shown in Figure 1.2.
DOI 10.1515/9781547401277-001
2 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Listing 1.1: Listing 1.1. The Local SQL Customers Table with Data
,N'Jim'
,CAST(N'1980-10-01' AS Date)
,CAST(N'2017-09-01T12:01:04.000' AS DateTime)
,CAST(N'2018-04-01T11:31:45.000' AS DateTime))
GO
INSERT [dbo].[Customers] ([CustomerID]
,[LastName]
,[FirstName]
,[Birthday]
,[CreatedOn]
,[ModifiedOn])
VALUES (N'CUST002'
,N'Smith'
,N'Jen'
,CAST(N'1978-03-04' AS Date)
,CAST(N'2018-01-12T01:34:12.000' AS DateTime)
,CAST(N'2018-01-12T01:45:12.000' AS DateTime))
GO
Now we’ll create an Azure SQL database. It will contain a single table that will
be called Contacts. To begin with, this Contacts table will contain its own record,
separate from data in any other location, but will eventually be populated with
the copied data from the local SQL database. To create the Azure database, you’ll
need to log into portal.azure.com. Once you’ve successfully logged in, on the left
you’ll see a list of actions that can be taken in the left-hand navigation toolbar. To
create the database, click on the SQL databases menu item and then click the Add
button, as shown in Figure 1.3.
4 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
You can now enter in the information that pertains to your new database. You can
get more information on these properties by clicking the icon next to each label.
Some additional details on several of these properties are noted as follows:
1. Database name—the database name will be referenced in a variety of loca-
tions, so name it just like you would a local database (in this case, we’ll refer
to it as InotekDemo).
2. Subscription—you’ll have several potential options here, based on what you
have purchased. Figure 1.3 shows Visual Studio Enterprise, as that is the
MSDN subscription that is available. Your options will look different depend-
ing on licensing.
3. Select source—go with a new blank database for this exercise, but you could
base it on an existing template or backup if there was one available that
matched your needs.
4. Server—this is the name of the database server you will connect to and where
your new database will live. You can use the default or you can create your
own (see Figure 1.4). A database server will allow you to separate your data-
bases and business functions from one another. This server will be called
“Demoserverinotek” with serveradmin as the login name.
Creating an Azure SQL Database 5
Figure 1.4: Configuring the new Azure SQL Server where the database will reside
When you’re ready, click the Create a new server button and the deployment
process in Azure will begin. You’ll see a notification on your toolbar (see Figure
1.5) that shows the status of this deployment. After a minute or two your database
deployment will be completed and you’ll be able to click on the new database and
see information about it.
6 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
There are several ways to connect to this new server. You can use the Azure tools or
you can connect from a local SQL tool like SQL Server Enterprise Manager. Using
Enterprise Manager requires that you enter information about the SQL Server you
are trying to connect to. To connect to your Azure server, click on the Connection
strings property of your database in Azure. You’ll want to grab the server name
from here and enter it into your server connection window (shown in Figure 1.6).
Creating an Azure SQL Database 7
Figure 1.6: Connecting to the new Azure SQL Server from a local Enterprise Manager connec-
tion window
Next, you’ll type in the login and password and click Connect. If you haven’t con-
nected to an Azure SQL instance before, you will be asked to log into Azure. If this
occurs, click the Sign in button and enter the credentials that you used to connect
to the Azure portal (see Figure 1.7). Once authenticated, you’ll be able to select
whether to add the specific IP you’re on or add the full subnet. You’ll be required
to do this each time you connect to your SQL instance from a new IP.
Figure 1.7: The first time you connect from Enterprise Manager on a new computer, you will be
required to enter this information
8 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
With the credentials entered accurately, Enterprise Manager will connect to your
new SQL Azure instance and you’ll be able to create artifacts just like you would
with a local SQL instance. For this exercise, we’ll create a table called Contact in
the InotekDemo database where we’ll eventually upload data from a local SQL
instance. The table looks like that shown in Figure 1.8, with SQL script shown in
Listing 1.2.
Listing 1.2: Listing 1.2 A new table created on the Azure SQL Server database
In addition to SQL Enterprise Manager, you can also use the Query Editor tool
that’s available in the Azure web interface. To illustrate how this tool works, we’ll
use it to insert a record into the table. To do this, click on the Query editor link in
the Azure navigation bar (see Figure 1.9). A window will open where you’ll be able
to see your objects and type in standard SQL commands.
Creating an Azure SQL Database 9
Figure 1.9: Using the Query Editor tool available in the Azure portal
You’ll have basic access to write queries and view SQL objects. To insert a record,
use a standard insert script like that shown in Figure 1.10 and click the Run
button. You’ll see information about your SQL transaction on the two available
tabs, Results and Messages. You can save your query or open a new one. Both of
these actions use your local file path and are not actions that take place within
Azure itself. You can also use the INSERT script shown in Listing 1.3.
You can also edit your data in your tables through an editing interface in Azure by
clicking on the Edit Data button. This will open an editable grid version of your
table where you can modify your existing data or create new records. You can do
this by using the Create New Row button, shown in Figure 1.11.
At this point, you have a local SQL database and a hosted Azure SQL database,
both populated with a small amount of data. We’ll now look at how to pull the
data from your local SQL instance into your Azure SQL instance using Azure Data
Factory (ADF). To create a new ADF instance, click on the Create a resource link
Building a Basic Azure Data Factory Pipeline 11
on your main Azure navigation toolbar and then select Analytics. Click on the
Data Factory icon in the right-hand list, as shown in Figure 1.12.
In the configuration screen that opens, you’ll be required to enter a name for your
new ADF process. You’ll also need to select several other properties, one of which
is the Resource Group. For ease of reference and organization, we’ll put this ADF
12 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
component in the same Resource Group that was used for the SQL Azure server
instance created earlier in this chapter. Figure 1.13 shows the base configuration
for the data factory. Click the Create button once the configuration has been com-
pleted.
The creation of the ADF will take a few moments, but eventually you’ll see a noti-
fication that your deployment has been completed. You can see this on the noti-
fication bar, where you can also click the Go to resource button (see Figure 1.14).
You can also access this via the All resources option on the main Azure portal
navigation toolbar.
Building a Basic Azure Data Factory Pipeline 13
Figure 1.14: Notification indicating that the data factory has been deployed
Clicking this button will take you to an overview screen of your new ADF. You can
also access this overview screen by clicking All resources from the main Azure
navigation toolbar and clicking on the name of the ADF that was just created.
You’ll see a button on this overview screen for Author & Monitor (see Figure 1.15).
Click it to open a new window where the actual ADF development will take place.
One item to note—some functions in Azure will not work in Internet Explorer.
Only Chrome and Edge are officially supported. If you run into functionality
issues, try another browser!
14 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.15: To achieve development of an ADF, click on the Author & Monitor button
For development within the ADF web framework, there are several options that
will be available for you to choose from (see Figure 1.16). We’ll look at using the
Create pipeline functionality and SSIS Integration Runtime later in this book
(both of which allow for development of custom multistep processes), but for
now we’ll look at Copy Data, which is a templated pipeline that lets you quickly
configure a process to copy data from one location to another. We’ll use this to
copy data from the local SQL instance to the Azure instance.
Note that in cases where it is necessary to move data from a local SQL instance
into an Azure instance, we can use all of these approaches (a pipeline, the copy
data tab, and the SSIS integration runtime), as well as others, which we won’t
detail here (like replication, external integration platforms, local database jobs,
etc.!).
Figure 1.16: Use the Copy Data button for development within the ADF
Building a Basic Azure Data Factory Pipeline 15
To use the Copy Data component, click on the Copy Data icon on the ADF screen.
You will see a new area open where you can configure the copy process. Underly-
ing this web interface is a standard ADF pipeline, but instead of using the pipe-
line development interface, you can use a more workflow-oriented view to con-
figure things.
You’ll begin by naming the task and stating whether it will run once or
whether it will run on a recurring schedule. If you select the recurring schedule
option, you’ll see several items that will allow you to define your preferences.
Note that one of the options is a Tumbling Window, which basically means that
the process will complete before the timer starts again so that two processes of
the same type don’t overlap one another. A standard scheduler will run at a given
time, regardless of whether the previous instance has completed or not. As you
can see in Figure 1.17, the schedule options are typical.
The next step in the process, after the base properties have been defined, is to
specify the source of the data that will be copied. There are many options and
many existing connectors that allow for rapid connection to a variety of data
sources. For this exercise we’ll point to the local SQL database with the customer
data. Click on the Database tab, click the Create new connection button, and
select SQL Server from the list of connectors (see Figure 1.18).
When the SQL Server icon is selected, you’ll have an extensive set of options to
determine how to connect to your source database. First, you’ll need to create an
integration runtime to connect to a local SQL instance. The integration runtime
will allow you to install an executable on the local machine that will enable ADF
to communicate with it. Once the integration runtime has been created, you’ll be
able to see your local SQL Server and configure your ADF pipeline to point at the
source table you are after. Here are the steps to set up this source:
1. On the New Linked Service window (Figure 1.19), set the name of your service
and then click the +New link in the Connect via integration runtime drop-
down.
Building a Basic Azure Data Factory Pipeline 17
2. On the next screen that opens, select the Self-Hosted button and click Next.
3. Next, give the integration runtime a descriptive name and click Next.
4. There will now be a screen that has several options—you can use the Express
setup or the Manual setup. As you can see in Figure 1.20, the installation of
the integration runtime on your local machine (where the local instance of
SQL Server exists) is secured through two encrypted authentication keys. If
you use the Manual setup, you’ll need to reference these keys. The Express
setup will reference them for you. For this exercise, click the Express setup
link.
18 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
5. When you click on the Express setup link, an executable will download, and
you’ll need to run it. This will time to install. A progress indicator will appear
on the screen while installation is taking place (see Figure 1.21).
Building a Basic Azure Data Factory Pipeline 19
Figure 1.21: The installation progress of the local executable for the integration runtime
6. When the installation has completed, you’ll be able to open the Integration
Runtime Configuration Manager on your local computer. To validate that this
configuration tool is working, click on the Diagnostics tab and test a connec-
tion to your local database that you plan to connect to from Azure. Figure 1.22
shows a validated test (note the checkmark next to the Test button). You can
connect to virtually any type of local database, not just SQL Server (just select
the type of connection you need to test from the dropdown).
20 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.22: Testing the connection from the newly installed local integration runtime
7. Finally, back in the Azure portal and the Integration Runtime Setup screen,
click the Finish button.
When you have finished creating and installing the Integration Runtime compo-
nent, you’ll find yourself back on the New Linked Service window. You’ll need
to specify the server name, database name, and connection information. If you
tested your connection in the locally installed Integration Runtime Configuration
Manager as shown in the steps above, you can repeat the same information here.
Figure 1.23 shows the connection to a local database instance configured in the
ADF screen that has been tested and validated.
Building a Basic Azure Data Factory Pipeline 21
Figure 1.23: Testing connectivity to the local database from the Azure portal
Click the Finish button. The source data store will now be created. You’ll now be
able to proceed to the next step of the process of setting up the Copy Data pipe-
line. Do this by clicking the Next button on the main screen, which will pop up a
22 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
new window where you can indicate what table set or query you will be using to
extract data from the local database. For this exercise, there is only a single table
available to select, which is called Customers. Click this table and a preview of the
current data will be shown (see Figure 1.24).
Figure 1.24: Selecting the source table(s) where data will be copied from
Click Next and you will see a screen where filtering can be applied. We’ll leave
the default setting as no filtering. Clicking Next on the filter page will take you
to the Destination stage of configuring the Copy Data pipeline. For the current
solution, the destination is going to be the SQL Azure database that was created
earlier in this chapter. The setup of the destination is like the setup of the source,
except in this case no additional integration runtime will need to be set up or
configured, since the destination is an Azure database. Click the Azure tab and
press the Create new connection button. Select the Azure SQL Database option,
as shown in Figure 1.25.
Building a Basic Azure Data Factory Pipeline 23
Click the Continue button and a new screen will appear within which you can
configure the connection to the Azure database. Here, you will set the name of the
connection and then select the default AutoResolveIntegrationRuntime option
for the integration runtime component. This integration runtime will allow the
pipeline to connect to any Azure SQL servers within the current framework. Select
the Azure subscription that you used to create your Azure SQL Server earlier in
this chapter and then select the server and database from the dropdowns that will
auto-populate based on what is available. Enter the appropriate credentials and
then test the connection. Figure 1.26 shows the configuration for the Azure SQL
connection.
24 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Continue with the configuration process by clicking Next, which will allow you
to define the targeted table in the Azure SQL database where the source data will
be copied to. There is only a single table that we’ve created in the Azure database.
This table will show in the dropdown as Contact. Select this table and click Next.
A screen will open where the table mapping between the source and the destina-
tion takes place. We’re working with two simple tables, so the mapping is one to
one. Figure 1.28 shows the completed mapping for the tables that are being used
in this exercise.
In cases where your source tables don’t match cleanly with the target, you’ll
need to write a process on your local SQL instance that will load the data into a
staging table that will allow for ease of mapping. You can always write custom
mapping logic in a custom Azure pipeline, but leaving the logic at the database
level, when possible, will generally ease your development efforts.
Figure 1.28: Perform the mapping of the source columns to the target columns
26 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Click Next to continue. You’ll see options to set fault tolerance, which we’ll leave
defaulted for now. For large datasets, you may want to simply skip rows that
have errors. You can define your preference here and then click Next. You’ll see
a summary of all the configurations that have been done. Review this summary
(shown in Figure 1.29), as the next step you’ll take is to deploy the Copy Data
pipeline. Click Next when you have reviewed the summary. The deployment of
the completed pipeline will now take place.
With the pipeline deployed, you’ll be able to test it. There are several ways to navi-
gate around Azure, but for this exercise click the Pencil icon on the left side of the
screen. This will show the pipeline you just developed (which has a single step
of Copy Data) and the two datasets configured. You’ll want to test your process.
This can be done most easily by clicking the Debug button in the upper toolbar of
your pipeline. To see this button, you’ll need to click on the pipeline name in the
left-hand toolbar. Figure 1.30 shows the various tabs that need to be clicked to be
able to press the Debug button.
Building a Basic Azure Data Factory Pipeline 27
When the Debug button is clicked, the process will begin to run. It takes a second
for it to spin up, and you can monitor the progress of the run in the lower output
window. The process will instantiate and begin to execute. When it completes,
it will either show that it has succeeded or that it failed. In either case, you can
click the eye glasses icon to see details about what happened during the run (see
Figure 1.31).
Figure 1.31: Process has completed successfully, click the eyeglasses icon to see details
By clicking the Details button, you’ll see the number of records that were read,
the number that were successfully written, and details about rows that failed. As
shown in Figure 1.32, the two records that were in the source database were suc-
cessfully copied to the target. You can verify that the data is truly in the targeted
table by running a select query against the Azure SQL table and reviewing the
records that were loaded.
28 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.32: Summary showing that the data was successfully copied from the source to the
target
The pipeline that was just created was set up to run on a scheduled basis, which
means the data will continue to copy to the destination table every hour. To see
the status, click on the gauge (Monitor) icon on the left-hand side of the screen.
This will show you a history of the pipelines that have run, along with several
options for sorting and viewing metrics. Figure 1.33 shows that current pipeline’s
audit history.
Monitoring and Alerts 29
From this same monitoring area, you can see the health of your integration run-
times by clicking on the Integration Runtimes tab at the top of the screen. You
can also click on Alerts and Metrics, both of which will open new tabs within
your browser.
When you click Monitoring, you will land on the main Azure portal monitor-
ing page where there are endless options for seeing what is going on within your
various Azure components. Some of the key areas to note here are the Activity Log
(click here to see all activity across your Azure solution) and Metrics (build your
own custom metrics to be able to see details across your deployed components).
But for monitoring your ADF solution itself, you’ll most likely want to remain
within the Monitor branch of the ADF screens.
When you click Alerts, you’ll find that you can set up a variety of rules that
will allow you to monitor your databases. These databases are primarily intended
for administrative monitoring, but you’ll see that there are dozens of alert types
that can be configured, and you’ll want to become familiar with what is in here
to decide if you’re interested in setting up notifications. Just like the main Mon-
itoring page, these alerts apply more to the generic Azure portal than they do
specifically to ADF, but you can certainly monitor events related to your ADF envi-
ronment. You can also build notification directly into your ADF pipelines as you
build them out.
To set up an alert from the main Alert page, click on +New alert rule on the
toolbar, as shown in Figure 1.34.
30 Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
A new screen will open where a rule can be created. Three steps need to be taken
to create a rule: define the alert condition, name the alert (which ultimately
gets included in the notification that is sent), and define who gets notified. An
example of a configured alert rule is shown in Figure 1.35. You can set up noti-
fications like emails, SMS texts, voice alerts, along with a variety of others. The
number of configurations here is extensive.
Summary
A lot of territory has been covered in this chapter. We’ve looked at creating an
ADF pipeline that uses an integration runtime to connect to a local SQL Server
instance and copying data from a table on that local instance to a table in an Azure
SQL database. We’ve looked at the options around configuring and working with
each of these components, as well as how to create basic monitoring and alerting.
In the next chapter, we’ll expand into deeper customization and development of
an ADF pipeline and look at how additional data sources can be brought into the
mix by working with an Azure Data Lake Server.
Chapter 2
Azure Data Lake Server and ADF Integration
Having explored the basic approach to copying data to an Azure SQL instance
using an ADF pipeline in the previous chapter, we’ll now expand our data storage
options by working with Azure Data Lake Server (ADLS) storage. With ADLS, just
about any type of data can be uploaded, such as spreadsheets, CSV files, data-
base dumps, and other formats. In this chapter, we’ll look at uploading CSV and
database extracts containing simple sales information to ADLS. This sales infor-
mation will be related to the customer data that was used in the previous chapter.
For example, the CSV file will contain the customer ID that can be related to the
record in the Azure SQL database. This relationship will be made concrete in the
next chapter when we pull the data from Azure SQL and ADLS into an Azure Data
Warehouse.
To move data into ADLS, we’ll build out a pipeline in ADF. In the previous
chapter we used the “Copy Data” pipeline template to move the data into SQL
Azure; we’ll make a copy of this and alter it to push data to ADLS. To create the
copy, we’ll build a code repository in Git and connect it to the ADF development
instance. Once the data has been loaded into the ADLS, we’ll look at using an
Azure Data Lake Analytics instance for basic querying against the data lake. The
flow of the data and components of this chapter are shown in Figure 2.1.
Git Repository
Integration
ADLA
Runtime ADF Pipeline
TXT ADLS Query
CSV
DOI 10.1515/9781547401277-002
34 Chapter 2: Azure Data Lake Server and ADF Integration
It’s possible to create an Azure Data Lake Storage resource as part of the ADF
custom pipeline setup, but we’ll create the ADLS resource first and then incor-
porate it into a copy of the pipeline that was created in the previous chapter. To
create this ADLS instance, click on +Create a resource in the main Azure portal
navigation and then click on Storage. Next, click on the Data Lake Storage Gen 1
option under the featured components (see Figure 2.2).
When you have selected the Data Lake Storage Gen1 resource option, a new
screen layover will open where you can configure the ADLS top level information.
Figure 2.3 shows the configuration used for this exercise. When done setting the
properties click Create, which will deploy the ADLS resource. The properties set
on this page include:
1. Name—note that you can only name your ADLS resource with lowercase
letters and numbers. The name must be unique across all existing ADLS
instances, so you’ll have to experiment a little!
2. Subscription, Resource Group, and Location—for this example use the same
information that you used to create the Azure components in Chapter 1.
3. Pricing package—pricing on ADLS is inexpensive, but care should be taken
when determining what plan you want to utilize. If you click the information
Creating an Azure Data Lake Storage Resource 35
circle next to payments a new window will open where you can calculate
your potential usage costs.
4. Encryption settings—by default, encryption is enabled. The security keys are
managed within the ADLS resource. These keys will be used later when the
ADLS resource is referenced from the ADF pipeline that will be created.
When the deployment of the ADLS resource has completed, you’ll get a notifica-
tion (similar to that shown in Figure 2.4). You’ll be able to navigate to it by clicking
36 Chapter 2: Azure Data Lake Server and ADF Integration
Figure 2.4: A notification will appear when the resource has been created
Opening the newly created ADLS resource shows several actions that can be
taken, as shown in Figure 2.5. The default screen shows usages, costs incurred,
and other reporting metrics. The navigation bar to the left shows options for
setting up alerts and monitoring usage, like what was described at the end of
Chapter 1. Most importantly, there’s the option Data explorer, which is the heart
of configuration and access to data within the data lake itself.
Creating an Azure Data Lake Storage Resource 37
Clicking the Data explorer button will open your data lake server in a new window.
Your first step will be to create a new folder. This folder will contain the data that
you’ll be uploading to the data lake. For this exercise we’ll call it SalesData. In
Figure 2.6 you can see the navigation frame on the left and the newly created
SalesData folder.
We’ll manually upload a CSV file now. Later in this chapter we’ll use an ADF
pipeline to upload data automatically. Click the SalesData folder to open it. Once
inside the folder, click the Upload button. Figure 2.7 shows the data contained in
38 Chapter 2: Azure Data Lake Server and ADF Integration
the CSV that will be uploaded for this discussion (there’s an image of it in Excel
and Notepad so that you can see it is just a comma separated list of values).
You’ll see three columns in this CSV file. The first column is Customer ID, which
matches the ID column in the Azure SQL Contact table from Chapter 1. The second
column is Sale Date and the third is Sale Amount. There are several sales for each
customer. Of course, you aren’t limited to CSVs in a data lake—you can upload
just about anything. A real-world scenario there would be a large export of sales
data from an ERP system, which is in flat file format. This data would be uploaded
hourly or daily to the ADLS instance. We’ll look at this real-world application in
more detail in Chapter 3. Figure 2.8 shows the CSV uploaded into the data lake.
Creating an Azure Data Lake Storage Resource 39
Your uploaded file can now be seen in the main data explorer menu. You can
now click on the context menu of the file. One option is to preview the file, which
lets you see all of the data within the file. In most data lake scenarios, you’ll be
dealing with extremely large files that have a variety of formats, so the preview
functionality is critical as you navigate through your files to find the information
and structure you are after. Figure 2.9 shows the context menu with a preview of
the CSV that was uploaded.
You’ve just created an ADLS instance and uploaded a file manually. This has
limited application, and you really need to figure out how to automate this
through code, However, before moving on to creating an ADF pipeline and related
components needed for automating the transfer of data into your new ADLS, let’s
pause to look at code repositories.
With pipelines in your ADF, you’ll likely want to back things up or import
pipelines into other instances that you create in the future. To export or import
pipelines and related components you must have a code repository to integrate
with. Azure gives you two options: Git and Azure DevOps. We’ll break down getting
Git set up and a repository created so that ADF components can be exported and
imported through it.
The first step of this process is to get your Git repository associated with
Azure. Navigating to the root ADF page, you’ll see an option for Set up Code
Repository. Clicking on this will allow you to enter your repository information
(see Figure 2.10). If you don’t see this icon, then you will need to click on the
repository button in the toolbar at the top of the screen within the Author and
Monitor window where pipeline development takes place.
Using Git for Code Repository 41
If you don’t already have access to Git, follow these four steps to set up a reposi-
tory and connect ADF to it:
1. Go to github.com and set up an account (or log into an existing account).
2. Create a new code repository. Once the repository has been set up, you’ll need
to initialize it. The repository home page will have command line informa-
tion to initialize your new repository with a readme document. To run these
commands, you’ll need to download Git and install it on your local machine.
Once it has been installed locally, open a command prompt window and type
42 Chapter 2: Azure Data Lake Server and ADF Integration
the commands listed. In the end, you should have a repository set up that has
a single readme document. Note that you’ll most likely have to play around
with these steps to get them to work—keep at it until you succeed, but it may
take you a bit of work! If you don’t want to mess around with command lines
(who would?), make sure you click on the Initialize this repository with a
README option when you create your repository (see Figure 2.11).
3. Once the repository has been set up, you will be able to reference it from
within ADF. Click on the Set up Code Repository button on the main ADF
page.
4. A configuration screen will open within Azure that requests your Git user
account. Entering this will allow you to connect to Git and select the reposi-
tory you’ve just created. Once you’ve connected successfully to the repository,
you’ll see that Git is now integrated into your Authoring area (see Figure 2.12).
Using Git for Code Repository 43
Figure 2.12: The repository will show in the ADF screen once added successfully
As soon as you have your repository connected, the components you’ve created
in Azure will automatically be synchronized with your Git repository. After com-
pleting the connection and logging into the Git web interface, you should see
something like what is shown in Figure 2.13. Here, the dataset, pipeline, linked
server, trigger, and integration runtime folders have all been created, based on
components that have been created to data in the ADF solution. Within each
of the folders are the component files. These files could be imported into other
ADF instances by copying them into the repository for that ADF solution. Azure
and Git perform continual synchronizations once connected so that your data is
always backed up and accessible.
Figure 2.13: Once connected, the code in your Azure ADF will automatically sync with Git
44 Chapter 2: Azure Data Lake Server and ADF Integration
To understand how to use the repository, let’s take a quick look at importing a
pipeline from another solution that may have been developed. We’ll assume that
this solution already exists somewhere and we are trying to take a copy of that
pipeline and upload it to the current ADF instance. To do this, follow these steps:
1. Determine which files you want to copy over. In the case of the pipeline
created in Chapter 1, for example, there is a single pipeline file, two dataset
files, an integration runtime, linked services, and other components. You can
see in Figure 2.14 the resources on the left in ADF, along with the pipeline files
in the Git folder.
2. You can transfer just the core pipeline file or any of the other associated files
(e.g., data sources, integration runtimes, etc.) individually. Each will import
with their original configurations. If, for example, you import a pipeline that
references several data sources but you don’t import the data sources, the
pipeline will still open. You’ll just have to set up new data sources to link into
it.
3. Every file in ADF exports as a JSON file. You can edit these files before import-
ing if you want. For example, Listing 2.1 shows the CopyLocalDataToAzure
pipeline as implemented in Chapter 1. If you want to alter the name on this,
just edit the JSON file’s name property (there are two of them you’ll need to
set). If you want to alter the mappings, change the columnMappings node
information. For advanced developers, editing the JSON of existing pipelines
Using Git for Code Repository 45
and related components can save time and allow you to build out multiple
processes with common logic with relative ease.
{
"name": "CopyLocalDataToAzure", <--- NAME CAN BE CHANGED (EXAMPLE)
"properties": {
"activities": [
{
"name": " CopyLocalDataToAzure", <--- WOULD BE CHANGED HERE AS WELL
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "[dbo].[Customers]"
},
{
"name": "Destination",
"value": "[dbo].[Contact]"
}
],
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000
},
"enableStaging": false,
"dataIntegrationUnits": 0,
"translator": {
46 Chapter 2: Azure Data Lake Server and ADF Integration
"type": "TabularTranslator",
"columnMappings": "FirstName: First, LastName: Last, Birthday: DOB,
ModifiedOn: Last
Modified" <--- MAPPING CAN BE CHANGED (ANOTHER EXAMPLE)
}
},
"inputs": [
{
"referenceName": "SourceDataset_sl9",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "DestinationDataset_sl9",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
4. When you’ve determined what file(s) you want to add to ADF (and made
any edits manually to the JSON), upload them into the Git repository. This
will automatically cause them to display in the ADF web interface under the
appropriate folder.
Knowing how to navigate a code repository within ADF is important. In the next
sections, you’ll build out a pipeline that moves data from a local SQL instance
directly into ADLS. While there are many ways to build out an ADLS implementa-
tion, you’ll likely start by copying pipelines that you’ve created before to perform
similar functions. For example, let’s say you build out a pipeline that has many
steps. You now want to build out additional processes that copy data from similar
sources and load them into the same target. Being able to copy the original, make
minor modifications to the JSON, and then reimport the modified file into ADF as
new components, will allow you to develop pipelines without having to configure
every component from scratch.
Building a Simple ADF Pipeline to Load ADLS 47
Note that you can also use the Clone option (see Figure 2.15) from the context
menu if your components are all within a single ADF instance. You’ll do all your
editing from within the ADF web framework. The repository is a way to back up
files and transfer them between ADF instances.
We’ll now build a simple pipeline to automatically load data into the ADLS that
was created. Piggybacking off the repository that was just created, take a copy of
the original CopyLocalDataToAzure pipeline (from Chapter 1) and download it
to your local machine. Rename the file to CopyLocalDataToADLS and open the
file in Notepad. Edit the name property so it matches the name of the file. Once
that has been completed, upload the file back into your Git repository. Figure 2.16
shows these steps. When the upload has completed, you can refresh your ADF
screen (use the refresh button on the toolbar) and you’ll see the new pipeline in
your pipeline list.
48 Chapter 2: Azure Data Lake Server and ADF Integration
When the new pipeline shows in the ADF navigation, click on it and an edit
window will open. You will make several changes here. Start with the destination
connection (called the Sink) and convert this from the Azure SQL instance to the
new ADLS instance. To do this, click on the Copy Data box in the development
area and then click on the Sink tab in the properties window (see Figure 2.17).
Click the New button and select the Azure Data Lake Storage Gen1 option.
Building a Simple ADF Pipeline to Load ADLS 49
A new dataset will be created, which must then be configured. On the General
tab, set a unique descriptive name (we’ll call it AzureDataLakeStoreMain for this
exercise). Next, click on the Connection tab. Follow the steps below to configure
this tab.
1. Click on the New button next to the Linked Service property dropdown. In the
configuration screen that opens, set the properties to point to the ADLS that
you created earlier in this chapter. Figure 2.18 shows this screen configured.
50 Chapter 2: Azure Data Lake Server and ADF Integration
2. Click the Test connection button to ensure your connection works. Most likely
you’ll get a failure. For connections to ADLS, you must first give access to
your Azure application to read/write the ADLS instance. To do this, copy the
Service identity application ID GUID. This can be accessed directly from the
text underneath the Authentication Type property on the linked service con-
figuration screen.
Building a Simple ADF Pipeline to Load ADLS 51
3. Next, go into your ADLS instance and click on Data explorer. In the Data
explorer, click on Access and then click Add. When the GUID is placed in the
Select field, the name of your ADF should automatically appear. Click on the
permissions as shown in Figure 2.19. You can select Read, Write, or Execute,
or any combination that you need, but the two radio buttons should be set as
indicated in Figure 2.19 below.
4. Assuming you have the connection worked out with your linked service, click
the Finish button.
5. Now, set the File path property by clicking on the Browse button. We’ll use
the same SalesData folder that we manually uploaded the CSV file to earlier
in this chapter. Select this folder and click Finish. At this stage the file prop-
erty can be hardcoded—we’ll come back and make this dynamic after getting
the core pipeline functional. For this exercise, create a text file that looks like
that shown in Figure 2.20.
6. Give the file a name like “export.txt.”. This will allow us to create a mapping
to this structure. We’ll create an SQL table shortly that matches this structure
(see Listing 2.2). When this file is created, upload it to your Data Lake storage
using the Data explorer and place it in the SalesData folder.
52 Chapter 2: Azure Data Lake Server and ADF Integration
Figure 2.20: Create a dummy instance of a file in the format of the data that you will be migrat-
ing so that it can be mapped
7. The rest of the properties can be left to their defaults. See Figure 2.21 for the
configuration of the Sink.
8. Remember to click Save when you’re done configuring this new data set!
The sink is now configured! We need to revise the source connection now. Origi-
nally, this source was pointing to the local SQL Server instance and pulling data
from the Customers table. Now were going to create a new table that will store
Sales information. All records in this table relate to the Customer table via the ID
that is in place, but this is not meant to be a relational data model. The idea here
is that this is a table that has a large volume of data in it that is being published
to it from some external source or sources. Think of it as a large data dump. We’re
going to take this dump of sales data and load it into the data lake.
Building a Simple ADF Pipeline to Load ADLS 53
To reconfigure the source of your pipeline, click the Source tab on the Copy
Data box and set up a new connection to a new sales table in your local SQL
instance. Listing 2.2 shows the script for this table while Figure 2.22 shows the
source tab in the pipeline configured to point to this, along with a preview of the
data.
Listing 2.2: New Sales table in the local database (with sample data)
Figure 2.22: Previewing that data from the source connection that is in the source SQL table
The revision of the Sink and the Source is all that must be done to complete the
changes in this new pipeline. When it runs, it should copy the data from the SQL
table and place it into a text file in the ADLS instance. Save everything, including
the pipeline, and click on the Debug button. It will begin processing, and should
succeed (if not, you’ll see errors and will have to address them!).
When the process has completed successfully, you’ll see a new export.txt file
copied out to the ADLS instance. It will have overwritten the “dummy” sample file
that you created earlier in this section. If you want to take snapshots of data and
ensure that files are not overwritten each time the process runs, you can modify
the file name expression in the Connection tab. Follow these steps to make the file
name dynamic (you can apply this approach to dynamic scripting in a variety of
locations in pipeline development):
1. Open the AzureDataLakeStore data set that you created to connect to ADLS
from your ADF pipeline.
2. Click on the file name in the File path property. A link will appear below it.
Click that link to open a new screen where you can enter dynamic content.
3. In the Add Dynamic Content screen, you’ll have the opportunity to build out
a script that will allow for a dynamic file name. In this case, we’ll create a
concatenation script that sets the file name and appends a date stamp to it.
Figure 2.23 shows this scripted out with only the seconds included in the
56 Chapter 2: Azure Data Lake Server and ADF Integration
date, but you can figure out the broader pattern to include year, month, day,
hour, and minute.
Figure 2.23: A script for dynamic file naming out of the pipeline sink
4. Once complete, save the pipeline and run it. Assuming the code was typed
in accurately, you’ll see a new file posted to your ADLS directory with the
dynamic file name.
Currently, you have two separate pipelines. The Copy Data pipeline created in
Chapter 1, which copies information from the local SQL instance into Azure SQL,
and the second Copy Data pipeline from this chapter that copies information into
Data Lake Analytics 57
the ADLS instance. You can combine these into a single pipeline, if you want. To
do this, open one of the pipelines and right click the Copy Data box and select
Copy. Next, go to the other pipeline and click Paste. Rename the box and then
drag the arrow from the left-hand box to the box on the right. Click the Validate
button to make sure no errors were introduced, and then click Debug. You now
have a single process that will manage all the movement of data (see Figure 2.24).
Figure 2.24: Combining both pipelines into a single pipeline process with two steps
By using Azure Data Lake Analytics (ADLA), you can query data stored in your
ADLS instance in a (somewhat) similar way to how you would work with a tra-
ditional SQL database. This is a separate resource that’s created in your Azure
Portal and one which lets you build out queries (both read and write) using
U-SQL. U-SQL allows you to query across data sources and data types, combine
them into a single result set, and pull that result set into a single output. Though
there is a learning curve to U-SQL, the ability to query across data types in your
ADLS instance and other sources is of immense value and well worth the ramp-up
on a new language. In this section, to demonstrate how to use an ADLA resource,
we’ll look at querying data from two sources and compiling them into a single
output.
Throughout this chapter, we’ll take a quick look at how to set up a new
analytics resource and build out some basic queries against the data that we’ve
uploaded to the ADLS instance. To begin, click on +Create a resource in the main
58 Chapter 2: Azure Data Lake Server and ADF Integration
navigation bar of your Azure portal. Select Analytics and then click on Data Lake
Analytics, as shown in Figure 2.25.
In the screen that opens, you’ll be able to configure the base settings for the ADLA
resource. Enter a descriptive name (it must be unique across all ADLAs that are
created, not just your own!). When setting up your ADLS, you’ll fill out the stan-
dard options for the other properties that you’ve seen elsewhere. You’ll also have
to select the ADLS instance you’ll be pointing this ADLA resource to. Figure 2.26
shows a configured ADLA resource using the ADLS created earlier in this chapter.
Once complete, click the Create button.
Data Lake Analytics 59
The deployment of the new resource will take a few moments. Once deployed,
you’ll see a new notification and will be able to navigate to the new resource. With
the new ADLA resource opened, you can now write a query against your ADLS
instance. Listing 2.3 shows an example of a U-SQL query. The code extracts data
from the CSV files that were uploaded earlier in this chapter, performs a filter on
them, and writes them out to a target file. To code and run this, follow these steps:
1. Open your data analytics resource and click on the +New job button (see
Figure 2.27).
60 Chapter 2: Azure Data Lake Server and ADF Integration
2. In the New job screen, enter the code from Listing 2.3 and click Submit (see
Figure 2.28).
Listing 2.3: First U-SQL Query to pull data from CSV in ADLS directory
3. If execution of the script succeeds, you’ll see a flow like that shown in Figure
2.29. You’ll see a variety of information that pertains to your query, including
time to query and size of data. This will provide great insight into where your
data lies and how long it will take to get at various information when dealing
with big data and endless files.
4. If the script fails, you’ll see extensive error information in the output window,
which will hopefully (!) give you enough information to fix your U-SQL. To
work with the query and retest until it works as you would expect, click on
the Reuse script button on the top of the screen.
62 Chapter 2: Azure Data Lake Server and ADF Integration
Figure 2.29: Executing the job not only performs the coded actions but also shows a flow with
metrics
5. The output of the file will go to the ADLS instance into a new folder called
Sales (shown in Figure 2.30). The U-SQL script indicates the output folder and
file. If the folder does not exist, the U-SQL will create one.
There are some robust paths, that are easy to access. that will monitor your ADLA
jobs. Click the Job management tab on the ADLA navigation. You’ll see a history of
jobs that have already run. Clicking these jobs will give you detailed information
about the run. Figure 2.31 shows the history, along with a break out of one that
had errors. You can click on any of these jobs, whether they succeeded or failed,
rerunning, debugging, analyzing, and performing other actions on them.
You can expand your U-SQL to include querying data from your Azure SQL
instance and other sources. For example, in the code from Listing 2.3 above, you
could add a second result set to pull Sales information from a sales table on SQL
Azure and then include it in the results that are written out to the output file.
To do this, you would first need to set up a new connection to your Azure SQL
instance using U-SQL, Visual Studio, or PowerShell. At that point, you would be
able to include queries against it in the U-SQL code.
As you work with your ADLA queries, you’ll come to realize that while there is
immense value in being able to query across entities and data types, there are also
some real challenges and limitations in structuring your queries. For example, in
the exercise above, one of the fields is “Sales Amount,” which is a float. Ideally,
64 Chapter 2: Azure Data Lake Server and ADF Integration
we could extract that value directly as a float, but the first row in the CSV file is
column headers, and these are included in the query. So, for this we must treat
everything as a string and work within U-SQL to convert that data to the specific
data types.
Querying directly against your data lake will lead to great frustration when
dealing with huge data sets with differing cleanliness of data. The logical thing
to do is extract clean data from your various data sources (Azure SQL, ADLS, and
other locations) and load them into an Azure Data Warehouse, where true analyt-
ics and reporting can take place. We’ll spend the next chapter looking at how to
get your data into a data warehouse.
Summary
You’ve worked through building an Azure Data Lake Server and you’ve populated
it with CSV and TXT data both manually and via an ADF pipeline. You’ve looked
at creating a connection to ADLS from an Azure pipeline and understand what
it takes to set up a Git repository to synchronize code you’ve built in ADF. Addi-
tionally, several patterns for working with larger data sets have been outlined, as
has working with dynamic settings in a pipeline. With the components that have
been built out in chapters 1 and 2, you now have a framework for pushing data to
a data warehouse and building views into that data (which will be covered in the
next chapter!).
Chapter 3
Azure Data Warehouse and Data Integration
using ADF or External Tables
In the first two chapters you’ve worked through building out a variety of com-
ponents. All the components can stand alone on their own, and the data can be
queried and reported on. However, to perform complex analysis of data from a
variety of disparate sources, especially involving historical data, you’ll need to
load all of this data into a data warehouse. An Azure Data Warehouse (ADW)
allows for storage of data as well as the dimensional modeling of that data, which
utilizes facts and dimensions.
In this chapter we’ll build out a simple data model in an ADW instance and
then use an ADF pipeline to populate it with information from the sources that
have been developed earlier in this book. We’ll also look at building an external
table that pulls data directly into ADW from an ADLS file without the use of an
ADF integration. The basic data flows that will be developed in this chapter are
shown in Figure 3.1.
External
Integration Table
Runtime
ADLS
Stored
Procedure
DOI 10.1515/9781547401277-003
66 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
The creation of an ADW instance is done through the Azure portal by clicking
on the SQL data warehouses option on the main navigation screen. Click the
+Add button and enter information in the configuration screen that opens. This
information is similar to what you would enter for a new Azure SQL instance,
like you created in Chapter 1. Figure 3.2 shows the configuration for the ADW that
we’ll be referring to in this chapter. Note the Performance level property, which
will be explained in more detail in the next section. For now, choose the Gen1:
DW100 option.
Figure 3.2: Configuring the ADW and setting the Performance level property
When done configuring this screen, click the Create button. Azure will work to
deploy the new resource, and when completed you’ll get a notification. Note that
the creation of a new ADW will take longer to create than the other resources you
have created so far in this book. Once the notification shows that the deployment
has succeeded, click the Refresh button in the main toolbar to see the new data
warehouse. Figure 3.3 shows the newly created ADW resource.
ADW Performance and Pricing 67
Your ADW is now available and running, which means you are paying for it by
the hour! Pay close attention to the next section so that you don’t get surprised
with a large Azure bill. Before you read the next section, click on your new ADW
instance and click the Pause button on the toolbar (refer to Figure 3.4). When
you’re ready to work with your ADW instance, you can start it back up (the Pause
button will turn into a Resume button). Of course, once you pause your database,
you won’t be able to connect to it (we’ll look at connecting to it shortly).
Figure 3.4: Make sure to pause your ADW when you aren’t using it, as you’ll be paying it by the
hour!
be set at the time you create your ADW or by clicking the Scale button on the main
ADW overview tab.
Increasing storage and performance has a direct impact on the hourly cost. There
are several tiers of pricing that span across two top level options (Gen1 and Gen2)
and a dozen tiers within those options. These tiers are referred to with a “DW”
followed by an alphanumeric combination. For this exercise we are creating a
tiny development DW, so we’ll choose the cheapest option, which is DW100. You
can rescale this once the DW has been created if you want to expand capacity.
Note that your options may be limited based on your region. For example, some
regions may have DW400 as the lowest option.
When your ADW instance is being used in a production setting, you may want
to set up a process that automatically turns on and off your ADW instance. For
example, you may decide you don’t need your data warehouse available during
weekend hours, so you could create a script to pause it from 6:00 pm on Friday
to 8:00 am on Monday. Essentially, you’ll create a new automated scheduled task
(usually through a PowerShell script) that can perform the pause and the restart
of the ADW on a timed basis. You can create a script via the Automation script tab
on your main ADW menu. As you can see in Figure 3.6, this is coding intensive
and requires a lot of work and experimentation. We won’t build a script in this
book, but several starting points for this can be found on the web.
Connecting to Your Data Warehouse 69
Figure 3.6: Options are available for building a script to enable/pause your ADW instance to
reduce billing costs
Now that you understand pricing and how your ADW will cost you by the hour,
we’ll proceed with development! If you paused your ADW instance after creating
it, you’ll need to make sure to start it back up by clicking the Resume button (see
Figure 3.7). Remember to manually pause it again after you’re done working with
it for the day.
With the ADW instance active, you’ll be able to interact with it in several ways.
You’ll want to connect to it through a local SQL client tool like SQL Server Manage-
ment Studio (SSMS). If you added your ADW to the same Azure database server
that you created in Chapter 1, you should see the ADW instance appear under the
available Databases when you connect to that server through SSMS (see Figure
70 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
3.8). If you got your ADW in a different Azure server, then you’ll need to enter your
connection information.
If you’re on a computer that doesn’t have SSMS (or something similar) installed,
then you can use a simple query editor within Azure to work with your ADW. Do
this by clicking on the Query editor button on your overview tab and then connect
to your instance (refer to Figure 3.9). Once opened, you’ll have basic SQL func-
tionality and access to your ADW components (tables, views, etc.).
Connecting to Your Data Warehouse 71
You can also connect to your ADW through Visual Studio. The easiest way to do
this is by clicking the Open in Visual Studio option on the main ADW navigation
menu. From here, a large button with the same terms will open (see Figure 3.10).
When you click it, your local instance of Visual Studio should open. If you don’t
have Visual Studio installed, you can download it from this page or install it on
your own.
Visual Studio should default to the SQL Server Object Explorer view shown in
Figure 3.11. If it doesn’t open, click on the View menu option in the toolbar and
72 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
select it from there. You’ll then be able to navigate through your database com-
ponents in much the same way as you would from the SQL Server Management
Studio.
This chapter is going to keep the modeling of a data warehouse simple. The focus
and purpose of this book is to show you how to set up Azure based components
and move data between them. It’s not intended to show you best practices around
modeling the ADW objects themselves! However, before we begin with the simple
table structure that will be used for this exercise, several high-level notes about
data modeling need to be explained:
Modeling a Very Simple Data Warehouse 73
1. Moving data from an external source to a data warehouse can be done using
external tables. In this way, you can define data sources, credentials, and
table structures that allow you to reference external tables just like you
would internal tables. For example, if you wanted to point to a specific ADLS
file in a directory, you could create an external table in your ADW that points
to this location. The use of external tables is essential in large scale data
warehouse modeling. You can create these tables directly from the source
data using a CREATE TABLE AS SELECT. You can populate them using other
processes as well.
2. You’ll likely want to stage your data before loading it into your final tables.
The raw data in the external tables would be “cleansed” and loaded into your
staging tables. From there, you can write processes to load your final tables,
which will be used to build reports and analytics. For this exercise, we’ll
create staging tables, but we’ll skip the external table discussion.
3. You’ll create dimensions and fact tables, which are referred to as star schemas.
From these, you’ll be able to model data as it changes over time. Also, you’ll
be able to do deep analysis that can’t be done with “point in time” data that
is stored in a relational database.
We’ll create two staging tables and several dimension tables with a single fact
table. Eventually, we’ll create external tables as well. The flow will look like that
in Figure 3.12 in a fully built out flow.
ADLS ADF
Local SQL
Figure 3.12: The various paths we’re looking at to load data into the ADW
Let’s begin by creating the staging tables. These tables are almost exact copies
of those you’ve created before. They are where the data from the source tables
(including the Azure SQL instance and the file dump in the ADLS instance) will
be placed. Once the data is placed there, we’ll have an ADF pipeline component,
74 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
called a stored procedure, in the ADW to extract the data from these staging
tables. The data will them be loaded into the final fact and dimension tables in a
star schema. The staging tables are shown in Listing 3.1.
For the final “production” tables in the ADW, we’ll create a simple star schema,
which is based on the sales and customer data records we’ve used in illustra-
tions earlier in this book. We’ll create three “dimension” tables (called Customer,
Product, and Date) and one “fact” table (called Sales). See Figure 3.13 for the star
Modeling a Very Simple Data Warehouse 75
schema. The scripts for these tables are shown in Listing 3.2 (you can run these
scripts from within SSMS on your local machine while connected to the ADW
instance).
DimCustomer DimDate
FactSale
DimProduct
With the creation of these four tables you now have a data warehouse that can be
populated with data and eventually reported on. In the next section we’ll look at
putting records into these tables by pulling data from various sources. The final
data model is shown in Figure 3.14.
Load Data Using ADF 77
There are a variety of ways to get data loaded into your ADW. We’ll continue with
our process of using an ADF pipeline to move data, as it is one of the easiest and
most versatile ways available. You’ve created several pipelines in the previous
chapters. This pipeline will also use the CopyData process as well as the intro-
duction of a stored procedure. To begin, however, let’s create a baseline pipeline
using the following steps:
1. Open the ADF instance that you’ve been working with in chapters 1 and 2 and
add a new pipeline. Name the pipeline LoadADW.
2. Next, drop a Copy Data shape onto the design page. Name this Copy_To_
Stage_Cust.
3. Set the Source to the Customers table on your local SQL instance that was
created in Chapter 1 (shown in Figure 3.15).
Figure 3.15: Setting the source to the customers table on the local SQL instance
4. Set up the Sink to connect to the ADW instance. Click the New button to select
a data set and click on the Azure SQL Data Warehouse option.
5. In the New Linked Service window that opens (see Figure 3.16), configure
your connection to your ADW server. The only feature that will be new is that
78 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
you’ll have several authentication options. We’ll use SQL authentication for
this exercise. Select that option in the Authentication type property drop-
down and enter in the credentials that you used at the very beginning of
Chapter 1 (the username is “serveradmin”).
6. Figure 3.17 shows the server admin account in the SSMS explorer, along with
the SQL script used to create it. The GUI functions that you have with local
databases are not available with Azure databases accessed via SSMS. If you
Load Data Using ADF 79
want to create a new SQL user in your ADW, you’ll have to use scripting. Once
the login has been created, add it to the User name property in the Linked
Service configuration along with the password and then click the Test con-
nection button to make sure you can connect. When validated, click the
Finish button. This will create a new resource.
Figure 3.17: The serveradmin login shown from the security tab within the ADW database
7. Back on the Connection tab in your Copy Data pipeline component, select the
table that you are targeting. In this case, it is the StageCustomers table (see
Figure 3.18).
8. The source and sink have been declared. Now you can click on the Mapping
tab to define the column mappings. The mappings will look like what is
shown in Figure 3.19. You’ll need to click the Import Schemas button before
the mapping will show.
80 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
Figure 3.19: With the source and sink defined, mapping can take place
9. With this shape complete, click the Validate button on the toolbar. You may
see an error that looks like that shown in Figure 3.20. This can be remedied by
clicking the Enable Staging checkbox on the Settings tab of the shape. You’ll
have to create a new connection, which should be configured like that shown
next to the error in the same figure.
Load Data Using ADF 81
10. Next, you’ll follow similar steps for a second Copy Data shape. This one will
copy data from the Sales file in the ADLS (created in Chapter 2) and push it
into the ADW’s Sales staging table. Name this shape Copy_To_Stage_Sales.
11. Set the source connection to the existing ADLS linked service and point it to
the SalesData folder and the export.txt file (see Figure 3.21).
82 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
12. Click the Sink tab and create a new connection to the ADW Sales staging
table.
13. Via the Mapping tab, you’ll need to map the columns from the ADLS file to the
Staging Sales table. The column names, as you can see in Figure 3.22, are not
well named in the source. In order to see what these are, click on the Source
tab and preview the data.
Load Data Using ADF 83
Figure 3.22: Column mapping to a CSV may require referencing the data in preview to see
column names
1. You can now map the columns appropriately. For example, Prop_1, which is
the customer ID, can be mapped to CustomerID (see Figure 3.23). One column
should not be mapped, that is the identity filed in the target file. If need be,
you can leave fields empty or add dynamic content. Note that depending on
the CSV format, this mapping may occur automatically.
84 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
2. Return to the main pipeline development area. Link the two Copy shapes so
that they look like what is shown in Figure 3.24.
Figure 3.24: The pipeline with the two copy data processes (one from local SQL, the other from
ADLS)
3. Click on the Validate button to test the full pipeline. The pipeline should val-
idate.
4. Save everything you’ve been working on.
Load Data Using ADF 85
We’ll pause here for a moment with pipeline development to discuss mapping
options. For simple scenarios like our exercise, the mapping tool is fine. Columns
that map one to one or need minor modifications can be handled through the
little mapping tool in the pipeline. However, for true control over how you map
your data, you’ll want to shell out to a stored procedure on the database side.
Dealing with data at the native database level is good practice and allows you to
easily test and update the process. Integration architects have built complex map-
pings at the integration tier for years, however, leaving it at the database level
could have saved tremendous amounts of time and energy.
To illustrate how to do basic mapping at the SQL level, we’ll call a stored
procedure to map the data from the Staging tables into the final fact and dimen-
sion tables in the ADW. First, we’ll write code to pull the data from the Customer
staging table and insert it into the Customer dimension table. Next, we’ll do the
same for Product, and lastly, for Date. Once these three tables have been popu-
lated with data, we can execute the code to populate the fact table and be done.
We’ll use a single stored procedure that will be called from the ADW.
To ease the development of your SQL, go ahead and run your pipeline (you
can click the Debug button to do this). This will populate the staging tables in the
ADW with data so that you can work with the mapping in SQL more easily.
To create a new stored procedure in your ADW instance, right click Stored
Procedures in your SSMS tree view and select New and then Stored Procedure
(see Figure 3.25). This will create a new script window for you to write your SQL.
Enter the code that is in Listing 3.3 to create a procedure that takes the first step of
loading data into the Customer dimension table.
,DateOfSale
,ItemPurchased
,@currentDate
FROM StageSales
-- clear out the staging tables for next run
TRUNCATE TABLE StageCustomers
TRUNCATE TABLE StageSales
END
GO
We’ll now add a call to this stored procedure from the ADF pipeline so that each
time the process runs, the procedure gets called. The flow in the pipeline will be
as follows:
1. Copy data to the stage customer table.
2. Copy data to the stage sales table.
3. Call the stored procedure, which will load the dimension and fact table and
then clear out the stage tables.
To call the stored procedure from your pipeline, expand the General tab in the
pipeline toolbox and drag the “Stored Procedure” shape onto the design surface,
dropping it to the right of the Copy_To_Stage_Sales box. Rename this to “Call_
Stored_Proc.” On the SQL Account tab, set the linked service to the service that
you’ve set up for the connection to ADW. On the Stored Procedure tab, select the
LoadDataFromStaging procedure that you’ve created (Listing 3.3 above). Figure
3.26 shows the configuration for this stage of the pipeline.
88 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
Run the process again using the Debug button in the pipeline. The result should
be that the data has loaded into the target data warehouse tables. You can see in
the output of the pipeline run process the total amount of time each step takes
and whether the step succeeded or failed. Figure 3.27 shows the output from the
current exercise.
You can also monitor the pipeline activities that are running (or have run) by
clicking the Overview tab and scrolling through the dashboard controls that are
available (see Figure 3.28).
The ability to use external tables is a key feature of an Azure data warehouse.
You can create an external table that points to any external data source that can
be queried allowing you to load data into your data warehouse that can then be
queried like any other standard SQL table. By using an external table and loading
that data source directly into the ADW, you don’t need to use an ADF pipeline or
any other outside tool to load data.
For ADW to connect to ADLS, you must create several items. These include a
key, credential, and data source on the ADW server itself, as well as an Azure App
via the Azure Portal. The first step you need to take is to create the Azure App. By
doing so, you’ll gain access to several data points that must be available in the
90 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
credential within the ADW database in order to connect. The key components to
use in an external table to ADLS are shown in Figure 3.29.
ADLS
External
Master Key Azure App Directory with
Table
Permissions
DataSource Credential
File Format
Figure 3.29: The components needed for authentication and querying of external data in ADLS
from ADW
To begin, let’s create the Azure App. This can be done by clicking on Azure Active
Directory in the main Azure portal and then clicking on the App registrations
option. You can now click +New application registration from the toolbar.
In the properties that appear, type in a descriptive name for your app (we will call
it inotekadlstoadw for this exercise), leave the Application type property set to
Web app / API, and then enter a valid URL—this URL can be anything, but for this
exercise enter a valid URL (you won’t need this for ADW/ADLS integration). Click
the Create button when done. Once it creates, you’ll be able to navigate to your
new app registration from the main Azure page (see Figure 3.30).
Figure 3.30: A new app registration, which is used for connecting ADW to ADLS
There are three unique identifiers that you’re going to need to capture from this
app (or from within Azure) in order to create a valid SQL data source component
Using External Tables 91
in ADW. These include an Application ID, a Key, and an OAuth string. You can
capture the Application ID by clicking on your new App registration and grabbing
it from the overview window that opens (see Figure 3.31).
Next, you’ll need to create a new key. Click on Settings on the registered app
window and then select Keys from the Settings menu. In the Description field,
type a key name (we’ll call it “Example” for now) and then set the expiration (for
this exercise we’ll set it to never expire). Once you save it, you’ll see a key value
generated. You need to copy the key value now because once you leave this page,
you’ll never see it again (see Figure 3.32). This is the Key value that will be used
shortly in an Azure SQL script.
92 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
Click on the Properties tab on the main Azure Active Directory navigation bar to
get the last of the three unique identifiers (the OAuth information), and copy the
value from the Directory ID (see Figure 3.33). This will be combined with the string
shown below. The final list of IDs that you’ll want is as follows (you can refer back
here for when you’re building your SQL script later in this section):
1. Application ID (from the App Registration): example value is f2e499a9-ca67-
4855-b027-4adce08ba17d
2. Key value (from the Keys within the settings of the App Registration): example
value is mPO0Q41MqsVPrqOYH04UZBfOF3LNmVqi2ZAtj8GbNf0=
3. OAuth value (combined string with the ID from the Directory ID field under
properties in Azure Active Directory: example value is OAUTH = https://login.
windows.net/d61c4b5c-858b-4554-821b-2a29fc79003f/oauth2/token
Using External Tables 93
To prepare for the rest of this walk-through, log into your ADLS instance and
create a new directory called Exports. Place the Sales.csv file we worked with
earlier in this book into that directory. This will allow us to easily reference it in
our connection from ADW. The external table in ADW that will be created shortly
will point to this directory and will consume any data within it. Figure 3.34 shows
this folder with the file being previewed.
94 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables
When the file and folder are created, click on the top level of your folder struc-
ture and click the Access button. You should see the new Application Registra-
tion you’ve just created in the list of available users. Click on this user (inotekadl-
stoadw) and give it full permissions to read, write, and execute in the child
directories. You’ll now have permissions to read from this directory when creating
the external table. Refer to Figure 3.35 to see this user given access.
Figure 3.35: Giving permissions to the app registration to read from the new directory
You can now create a number of items in your ADW database that will allow you
to connect to the ADLS instance. You’ll need to create a Master Key, an External
File Format, a Credential, and a Data source. When the four connectivity objects
Using External Tables 95
in Listing 3.4 are created, you’ll be able to build an External table to pull data from
the ADLS instance.
Listing 3.4: Create ADW components to allow connecting to ADLS from External table
Finally, you can create your external table! The external table will declare the
fields being queried (which should match the field in the CSV file you uploaded
earlier). It will indicate what data source to use and what file format to expect.
Creating the table will cause the data to populate within it if it runs successfully.
We’ll declare all the field types as strings so that no restrictions are placed
on what gets loaded into the table. When dealing with large data sets, there is
often data that doesn’t match a specific field type. You can clean the data before
loading it into the external table or you can let the external table be an exact rep-
resentation of what is in the source and do the cleanup while loading into staging
tables (or elsewhere). Listing 3.5 shows the code to create the external table.
Executing this code will cause the data to load from the CSV into the table. You
can query this table just like any other table on your database, but you cannot do
other operations (like writing). As you can see in Figure 3.36, the load included
the header information from the CSV. If you wanted to eliminate this row from a
standard SQL table, you could issue a DELETE SQL command. However, you can
see that this causes an error stating it isn’t a supported operation. You’ll need to
eliminate this data when you load the external data into your staging tables.
Summary 97
Figure 3.36: Querying the external table is possible, but write operations are not
That’s it for using external tables! If you want to automate things, you could pop-
ulate these tables and do the cleansing of the data for staging tables by calling
procedures that you write through a custom ADF pipeline. At this point, you
should have all the tools you need to create that pipeline and populate your ADW
tables with data from a variety of sources!
Summary
This chapter covered the basics of setting up an Azure Data Warehouse and pop-
ulating it with data from several sources. You looked at creating a simple data
warehouse model, building staging tables, and automating that load using an
ADF pipeline. You also worked through setting up external tables and populating
one through an ADLS connection, without the need for an ADF pipeline. Hope-
fully, you now have a good understanding of the interaction between ADF and
the various storage options in Azure, and a clear idea of how to populate data
in your ADW instance. You should now be able to build you own integrations of
data between local repositories, Azure SQL, Azure Data Lake Servers, and Azure
Data Warehouses.
Index
A Azure Data Lake Storage Gen1 option 48
ADF (Azure Data Factory) 1–2, 10–14, 16, Azure Data Lake storage resource 34–35,
28–30, 44, 46, 64–66, 72–74, 76–88 37, 39
ADF integration 33–34, 36, 38, 40, 42, 44, Azure Data Warehouse 64–66, 68, 70, 72,
46, 48, 64–65 74, 76, 78, 88–90, 96–97
ADF pipeline 29, 31, 33, 35, 37, 64–65, 87, Azure data warehousing 67
89, 97 Azure database 3, 8, 22–23, 25, 78
–– custom 97 Azure database server 69
–– standard 15 Azure navigation bar 8
ADF pipeline component 73 Azure navigation toolbar 11, 13
ADF solution 29, 43 Azure pipeline 64
ADLA resource 57–59 –– custom 25
ADLAs (Azure Data Lake Analytics) 33, 57–58 Azure portal 7, 9, 20–21, 57–58, 66, 89–90
ADLS (Azure Data Lake Server) 33–40, 46–52, Azure portal monitoring page 29
54–60, 62, 64–65, 73, 81–82, 89–90, Azure portal navigation 34
93–95 Azure portal navigation toolbar 12
ADLS connection 97 Azure portal query 10, 71
ADLS directory 56, 61 Azure query 10
ADLS file 65, 73, 82 Azure server 6, 70
ADLS resource 34–36 Azure SQL 1–2, 4, 6–8, 10, 12, 14, 16, 33,
ADW (Azure Data Warehouse) 65–71, 73–75, 63–64
77–79, 85, 87, 89–91, 93, 97 Azure SQL connection 23
ADW database 79, 90, 94 Azure SQL data warehouse option 77
ADW performance 67 Azure SQL Database 3–5, 7, 9, 23, 25, 31, 33
ADW sales staging table 82 –– hosted 10
ADW’s sales staging table 81 Azure SQL Database option 22
Alert rule 29–30 Azure SQL script 91
Alerts 28–29, 36 Azure SQL Server 5, 7–8, 23
Amount of sale 53, 74, 86, 96 Azure subscription 23
Analytics 11, 58, 64, 73 Azure toolbar 36
App 90
App registration 90–92, 94–95 B
Application ID 91–92, 95 Basic Azure Data Factory pipeline 10–11, 13,
Application registration 90–91, 94 15, 17, 19, 21, 23, 25, 27
Authentication type property 50, 78
Azure 5–7, 9–10, 13–14, 19, 40, 42–43, C
65–66, 70, 72
Azure active directory 90, 92 Cast 3, 53–54
Azure active directory navigation bar 92 Click Next 22, 26
Azure app 89–90 Clustered COLUMNSTORE index 74–76
Azure components 29, 34 Code 40, 43, 56, 59–60, 63, 85, 96
Azure Data Factory. See ADF Code repository 33, 40–43, 45–46
Azure Data Lake Analytics (ADLAs) 33, 57–58 Column mapping 79, 83
Azure Data Lake Server. See ADLS Columns 1, 38, 82–83, 85
DOI 10.1515/9781547401277-004
100 Index
H N
History 28–29, 63 Navigate 26, 35, 39, 46, 59, 72, 90
Nchar 2, 8, 53, 74
I N’CUST001, 2, 53–54
ID 8, 52–54, 74, 92 N’CUST002, 3, 53–54
Import 44 N’Goodies 53–54
Import pipelines 40 Notification 5–6, 12–13, 29–30, 35–36, 59,
Initialize 41–42 66
Installation 17–19 NULL 2, 8, 53, 74–76, 96
Integration runtime 16–20, 22–23, 29, 31, Nvarchar 96
33, 44, 65
Integration runtime component 1, 23 O
Item purchased 53–54, 74, 86–87 Output 46, 61–63, 88
J P
Jobs 60, 62–63 Pause 40, 67–68, 85
JSON 44–46 Pause button 67
JSON pipeline file in Git 44 Permissions 51, 90, 94
Pipeline 26, 28, 33–34, 43–48, 53, 55–57, 77,
K 84–85, 87–88
Key value 91–92 Pipeline activities 89
Keys 17, 35, 53–54, 89, 91–92, 95 Pipeline development 40, 55, 85
Pipeline files 44
L Populate 65, 73, 85, 96–97
Last name 2–3, 46, 74–75, 86 Preview 22, 39, 53, 82–83
Link 16, 44, 55, 84 Process 14–16, 21, 25–27, 55, 68, 73, 77, 85,
Linked service 44, 49, 51, 78, 81, 87 87–88
Load 25, 46, 52, 64–65, 73, 86–87, 96–97 Properties 4, 11, 34, 45, 49, 52, 58, 90, 92
Load ADLS 47, 49, 51, 53, 55
Load data 1, 73, 77, 79, 81, 83, 85, 87, 89 Q
Loading 73, 86, 89, 94, 96 Query 9, 22, 27, 57, 59, 61, 63–64, 96
Local database 4, 19, 21–22, 53, 78 Query editor 8–9, 70
Local machine 16–18, 41, 47, 75
Locations 3–4, 14, 34, 55, 64, 73, 95–96 R
Log 3, 7, 41, 93 Records 1, 3, 8–10, 27, 33, 52, 76
Repository 40–44, 47
M Resource button 12, 36
Map 1, 82–83, 85 Resource group 11–12, 34
Mapping 25, 44, 46, 51, 79–80, 83, 85 Resources 13, 34, 36, 44, 57, 59, 66, 79
Mapping tool 85 Result 61, 88
Master key 90, 94–95 Resume button 67, 69
Metrics 28–29, 36, 62 ROBIN 75–76
Modeling 72–73, 75
Modified on 2–3, 46, 74 S
Monitor 13, 27–29, 63, 89 Sales 38, 53–54, 61–62, 74, 81, 96
Monitoring 28–29, 88 Sales information 33, 52, 86
Sales.csv file 93
102 Index
T
Table, external 65–66, 68, 70, 72–74, 76, 78,
80, 82, 88–97