You are on page 1of 117

Mark Beckner

Quick Start Guide to Azure Data Factory, Azure Data


Lake Server, and Azure Data Warehouse
Mark Beckner
Quick Start Guide to
Azure Data Factory, Azure
Data Lake Server, and
Azure Data Warehouse
ISBN 978-1-5474-1735-3
e-ISBN (PDF) 978-1-5474-0127-7
e-ISBN (EPUB) 978-1-5474-0129-1

Library of Congress Control Number: 2018962033

Bibliographic information published by the Deutsche Nationalbibliothek


The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data are available on the Internet at http://dnb.dnb.de.

© 2019 Mark Beckner


Published by Walter de Gruyter Inc., Boston/Berlin
Printing and binding: CPI books GmbH, Leck
Typesetting: MacPS, LLC, Carmel
www.degruyter.com
About De|G PRESS

Five Stars as a Rule

De|G PRESS, the startup born out of one of the world’s most venerable publishers,
De Gruyter, promises to bring you an unbiased, valuable, and meticulously edited
work on important topics in the fields of business, information technology, com-
puting, engineering, and mathematics. By selecting the finest authors to present,
without bias, information necessary for their chosen topic for professionals, in
the depth you would hope for, we wish to satisfy your needs and earn our five-star
ranking.
In keeping with these principles, the books you read from De|G PRESS will
be practical, efficient and, if we have done our job right, yield many returns on
their price.
We invite businesses to order our books in bulk in print or electronic form as a
best solution to meeting the learning needs of your organization, or parts of your
organization, in a most cost-effective manner.
There is no better way to learn about a subject in depth than from a book that
is efficient, clear, well organized, and information rich. A great book can provide
life-changing knowledge. We hope that with De|G PRESS books you will find that
to be the case.

DOI 10.1515/9781547401277-202
Acknowledgments
Thanks to my editor, Jeff Pepper, who worked with me to come up with this quick
start approach, and to Triston Arisawa for jumping in to verify the accuracy of the
numerous exercises that are presented throughout this book.

DOI 10.1515/9781547401277-203
About the Author

Mark Beckner is an enterprise solutions expert. With over 20 years of experi-


ence, he leads his firm Inotek Group, specializing in business strategy and enter-
prise application integration with a focus in health care, CRM, supply chain and
business technologies.
He has authored numerous technical books, including Administering, Con-
figuring, and Maintaining Microsoft Dynamics 365 in the Cloud, Using Scribe
Insight, BizTalk 2013 Recipes, BizTalk 2013 EDI for Health Care, BizTalk 2013 EDI
for Supply Chain Management, Microsoft Dynamics CRM API Development, and
more. Beckner also helps up-and-coming coders, programmers, and aspiring
tech entrepreneurs reach their personal and professional goals.
Mark has a wide range of experience, including specialties in BizTalk Server,
SharePoint, Microsoft Dynamics 365, Silverlight, Windows Phone, SQL Server,
SQL Server Reporting Services (SSRS), .NET Framework, .NET Compact Frame-
work, C#, VB.NET, ASP.NET, and Scribe.
Beckner’s expertise has been featured in Computerworld, Entrepreneur, IT
Business Edge, SD Times, UpStart Business Journal, and more.
He graduated from Fort Lewis College with a bachelor’s degree in computer
science and information systems. Mark and his wife, Sara, live in Colorado with
their two children, Ciro and Iyer.

DOI 10.1515/9781547401277-204
Contents
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory  1
Creating a Local SQL Instance  1
Creating an Azure SQL Database  3
Building a Basic Azure Data Factory Pipeline  10
Monitoring and Alerts  28
Summary  31

Chapter 2: Azure Data Lake Server and ADF Integration  33


Creating an Azure Data Lake Storage Resource  34
Using Git for Code Repository  40
Building a Simple ADF Pipeline to Load ADLS  47
Combining into a Single ADF Pipeline  56
Data Lake Analytics  57
Summary  64

Chapter 3: Azure Data Warehouse and Data Integration using ADF or External
Tables  65
Creating an ADW Instance  66
ADW Performance and Pricing  67
Connecting to Your Data Warehouse  69
Modeling a Very Simple Data Warehouse  72
Load Data Using ADF  77
Using External Tables  89
Summary  97

Index  99
Introduction
A challenge has been presented to me, that is to distill the essence of Azure Data
Factory (ADF), Azure Data Lake Server (ADLS), and Azure Data Warehouse (ADW)
into a book that is a short and fast quick start guide. There’s a tremendous amount
of territory to cover when it comes to diving into these technologies! What I hoped
to accomplish in this book is the following:
1. Lay out the steps that will set up each environment and perform basic devel-
opment within it.
2. Show how to move data between the various environments and components,
including local SQL and Azure instances.
3. Eliminate some of the elusive aspects of the various features (e.g., check out
the overview of External Tables at the end of Chapter 3!)
4. Save you time!

I guarantee that this book will help you fully understand how to set up an ADF
pipeline integration with multiple sources and destinations that will require very
little of your time. You’ll know how to create an ADLS instance and move data in
a variety of formats into it. You’ll be able to build a data warehouse that can be
populated with an ADF process or by using external tables. And you’ll have a fair
understanding of permissions, monitoring the various environments, and doing
development across components.
Dive in. Have fun. There is a ton of value packed into this little book!

DOI 10.1515/9781547401277-206
Chapter 1
Copying Data to Azure SQL Using
Azure Data Factory
In this chapter we’ll build out several components to illustrate how data can be
copied between data sources using Azure Data Factory (ADF). The easiest way to
illustrate this is by using a simple on-premise local SQL instance with data that
will be copied to a cloud-based Azure SQL instance. We’ll create an ADF pipeline
that uses an integration runtime component to acquire the connection with the
local SQL database. A simple map will be created within the pipeline to show how
the columns in the local table map to the Azure table. The full flow of this model
is shown in Figure 1.1.

Figure 1.1: Components and flow of data being built in this chapter

Creating a Local SQL Instance

To build this simple architecture, a local SQL instance will need to be in place.
We’ll create a single table called Customers. By putting a few records into the
table, we can then use it as the base to load data into the new Azure SQL instance
you create in the next section of this chapter. The table script and the script to
load records into this table are shown in Listing 1.1. A screenshot of the local SQL
Server instance is shown in Figure 1.2.

DOI 10.1515/9781547401277-001
2   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.2: Creating a table on a local SQL Server instance

Listing 1.1: Listing 1.1. The Local SQL Customers Table with Data

CREATE TABLE [dbo].[Customers](


[CustomerID] [nchar](10) NOT NULL,
[LastName] [varchar](50) NULL,
[FirstName] [varchar](50) NULL,
[Birthday] [date] NULL,
[CreatedOn] [datetime] NULL,
[ModifiedOn] [datetime] NULL
) ON [PRIMARY]
GO
INSERT [dbo].[Customers] ([CustomerID]
,[LastName]
,[FirstName]
,[Birthday]
,[CreatedOn]
,[ModifiedOn])
VALUES (N'CUST001'
,N'Jones'
Creating an Azure SQL Database   3

,N'Jim'
,CAST(N'1980-10-01' AS Date)
,CAST(N'2017-09-01T12:01:04.000' AS DateTime)
,CAST(N'2018-04-01T11:31:45.000' AS DateTime))
GO
INSERT [dbo].[Customers] ([CustomerID]
,[LastName]
,[FirstName]
,[Birthday]
,[CreatedOn]
,[ModifiedOn])
VALUES (N'CUST002'
,N'Smith'
,N'Jen'
,CAST(N'1978-03-04' AS Date)
,CAST(N'2018-01-12T01:34:12.000' AS DateTime)
,CAST(N'2018-01-12T01:45:12.000' AS DateTime))
GO

Creating an Azure SQL Database

Now we’ll create an Azure SQL database. It will contain a single table that will
be called Contacts. To begin with, this Contacts table will contain its own record,
separate from data in any other location, but will eventually be populated with
the copied data from the local SQL database. To create the Azure database, you’ll
need to log into portal.azure.com. Once you’ve successfully logged in, on the left
you’ll see a list of actions that can be taken in the left-hand navigation toolbar. To
create the database, click on the SQL databases menu item and then click the Add
button, as shown in Figure 1.3.
4   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.3: Adding a new Azure SQL database

You can now enter in the information that pertains to your new database. You can
get more information on these properties by clicking the icon next to each label.
Some additional details on several of these properties are noted as follows:
1. Database name—the database name will be referenced in a variety of loca-
tions, so name it just like you would a local database (in this case, we’ll refer
to it as InotekDemo).
2. Subscription—you’ll have several potential options here, based on what you
have purchased. Figure 1.3 shows Visual Studio Enterprise, as that is the
MSDN subscription that is available. Your options will look different depend-
ing on licensing.
3. Select source—go with a new blank database for this exercise, but you could
base it on an existing template or backup if there was one available that
matched your needs.
4. Server—this is the name of the database server you will connect to and where
your new database will live. You can use the default or you can create your
own (see Figure 1.4). A database server will allow you to separate your data-
bases and business functions from one another. This server will be called
“Demoserverinotek” with serveradmin as the login name.
Creating an Azure SQL Database   5

Figure 1.4: Configuring the new Azure SQL Server where the database will reside

5. Pricing Tier—pricing in Azure is a little overwhelming, and you’ll want to


think about costs across all components before you decide. For now, we’ll
select the basic model, which allows for up to 2 gigs of data.

When you’re ready, click the Create a new server button and the deployment
process in Azure will begin. You’ll see a notification on your toolbar (see Figure
1.5) that shows the status of this deployment. After a minute or two your database
deployment will be completed and you’ll be able to click on the new database and
see information about it.
6   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.5: Notification of the deployment in process

There are several ways to connect to this new server. You can use the Azure tools or
you can connect from a local SQL tool like SQL Server Enterprise Manager. Using
Enterprise Manager requires that you enter information about the SQL Server you
are trying to connect to. To connect to your Azure server, click on the Connection
strings property of your database in Azure. You’ll want to grab the server name
from here and enter it into your server connection window (shown in Figure 1.6).
Creating an Azure SQL Database   7

Figure 1.6: Connecting to the new Azure SQL Server from a local Enterprise Manager connec-
tion window

Next, you’ll type in the login and password and click Connect. If you haven’t con-
nected to an Azure SQL instance before, you will be asked to log into Azure. If this
occurs, click the Sign in button and enter the credentials that you used to connect
to the Azure portal (see Figure 1.7). Once authenticated, you’ll be able to select
whether to add the specific IP you’re on or add the full subnet. You’ll be required
to do this each time you connect to your SQL instance from a new IP.

Figure 1.7: The first time you connect from Enterprise Manager on a new computer, you will be
required to enter this information
8   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

With the credentials entered accurately, Enterprise Manager will connect to your
new SQL Azure instance and you’ll be able to create artifacts just like you would
with a local SQL instance. For this exercise, we’ll create a table called Contact in
the InotekDemo database where we’ll eventually upload data from a local SQL
instance. The table looks like that shown in Figure 1.8, with SQL script shown in
Listing 1.2.

Figure 1.8: Creating a table in the Azure database from Enterprise Manager

Listing 1.2: Listing 1.2 A new table created on the Azure SQL Server database

CREATE TABLE [dbo].[Contact](


[ID] [int] IDENTITY(1,1) NOT NULL,
[First] [nchar](20) NULL,
[Last] [nchar](20) NULL,
[DOB] [date] NULL,
[LastModified] [datetime] NULL
) ON [PRIMARY]

In addition to SQL Enterprise Manager, you can also use the Query Editor tool
that’s available in the Azure web interface. To illustrate how this tool works, we’ll
use it to insert a record into the table. To do this, click on the Query editor link in
the Azure navigation bar (see Figure 1.9). A window will open where you’ll be able
to see your objects and type in standard SQL commands.
Creating an Azure SQL Database   9

Figure 1.9: Using the Query Editor tool available in the Azure portal

You’ll have basic access to write queries and view SQL objects. To insert a record,
use a standard insert script like that shown in Figure 1.10 and click the Run
button. You’ll see information about your SQL transaction on the two available
tabs, Results and Messages. You can save your query or open a new one. Both of
these actions use your local file path and are not actions that take place within
Azure itself. You can also use the INSERT script shown in Listing 1.3.

Listing 1.3: Listing 1.3 Alternative Insert Script

INSERT INTO dbo.Contact


(First,Last,DOB)
VALUES
('John','Doe','2000-01-01')
10   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.10: Inserting a record using the Azure portal Query tool

You can also edit your data in your tables through an editing interface in Azure by
clicking on the Edit Data button. This will open an editable grid version of your
table where you can modify your existing data or create new records. You can do
this by using the Create New Row button, shown in Figure 1.11.

Figure 1.11: Editing data directly in the Azure Query tool

Building a Basic Azure Data Factory Pipeline

At this point, you have a local SQL database and a hosted Azure SQL database,
both populated with a small amount of data. We’ll now look at how to pull the
data from your local SQL instance into your Azure SQL instance using Azure Data
Factory (ADF). To create a new ADF instance, click on the Create a resource link
Building a Basic Azure Data Factory Pipeline   11

on your main Azure navigation toolbar and then select Analytics. Click on the
Data Factory icon in the right-hand list, as shown in Figure 1.12.

Figure 1.12: Creating a new Data Factory resource

In the configuration screen that opens, you’ll be required to enter a name for your
new ADF process. You’ll also need to select several other properties, one of which
is the Resource Group. For ease of reference and organization, we’ll put this ADF
12   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

component in the same Resource Group that was used for the SQL Azure server
instance created earlier in this chapter. Figure 1.13 shows the base configuration
for the data factory. Click the Create button once the configuration has been com-
pleted.

Figure 1.13: A new data factory configuration

The creation of the ADF will take a few moments, but eventually you’ll see a noti-
fication that your deployment has been completed. You can see this on the noti-
fication bar, where you can also click the Go to resource button (see Figure 1.14).
You can also access this via the All resources option on the main Azure portal
navigation toolbar.
Building a Basic Azure Data Factory Pipeline   13

Figure 1.14: Notification indicating that the data factory has been deployed

Clicking this button will take you to an overview screen of your new ADF. You can
also access this overview screen by clicking All resources from the main Azure
navigation toolbar and clicking on the name of the ADF that was just created.
You’ll see a button on this overview screen for Author & Monitor (see Figure 1.15).
Click it to open a new window where the actual ADF development will take place.
One item to note—some functions in Azure will not work in Internet Explorer.
Only Chrome and Edge are officially supported. If you run into functionality
issues, try another browser!
14   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.15: To achieve development of an ADF, click on the Author & Monitor button

For development within the ADF web framework, there are several options that
will be available for you to choose from (see Figure 1.16). We’ll look at using the
Create pipeline functionality and SSIS Integration Runtime later in this book
(both of which allow for development of custom multistep processes), but for
now we’ll look at Copy Data, which is a templated pipeline that lets you quickly
configure a process to copy data from one location to another. We’ll use this to
copy data from the local SQL instance to the Azure instance.
Note that in cases where it is necessary to move data from a local SQL instance
into an Azure instance, we can use all of these approaches (a pipeline, the copy
data tab, and the SSIS integration runtime), as well as others, which we won’t
detail here (like replication, external integration platforms, local database jobs,
etc.!).

Figure 1.16: Use the Copy Data button for development within the ADF
Building a Basic Azure Data Factory Pipeline   15

To use the Copy Data component, click on the Copy Data icon on the ADF screen.
You will see a new area open where you can configure the copy process. Underly-
ing this web interface is a standard ADF pipeline, but instead of using the pipe-
line development interface, you can use a more workflow-oriented view to con-
figure things.
You’ll begin by naming the task and stating whether it will run once or
whether it will run on a recurring schedule. If you select the recurring schedule
option, you’ll see several items that will allow you to define your preferences.
Note that one of the options is a Tumbling Window, which basically means that
the process will complete before the timer starts again so that two processes of
the same type don’t overlap one another. A standard scheduler will run at a given
time, regardless of whether the previous instance has completed or not. As you
can see in Figure 1.17, the schedule options are typical.

Figure 1.17: Setting the name and schedule attributes of the process


16   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

The next step in the process, after the base properties have been defined, is to
specify the source of the data that will be copied. There are many options and
many existing connectors that allow for rapid connection to a variety of data
sources. For this exercise we’ll point to the local SQL database with the customer
data. Click on the Database tab, click the Create new connection button, and
select SQL Server from the list of connectors (see Figure 1.18).

Figure 1.18: Creating a SQL source

When the SQL Server icon is selected, you’ll have an extensive set of options to
determine how to connect to your source database. First, you’ll need to create an
integration runtime to connect to a local SQL instance. The integration runtime
will allow you to install an executable on the local machine that will enable ADF
to communicate with it. Once the integration runtime has been created, you’ll be
able to see your local SQL Server and configure your ADF pipeline to point at the
source table you are after. Here are the steps to set up this source:
1. On the New Linked Service window (Figure 1.19), set the name of your service
and then click the +New link in the Connect via integration runtime drop-
down.
Building a Basic Azure Data Factory Pipeline   17

Figure 1.19: Creating a new integration runtime

2. On the next screen that opens, select the Self-Hosted button and click Next.
3. Next, give the integration runtime a descriptive name and click Next.
4. There will now be a screen that has several options—you can use the Express
setup or the Manual setup. As you can see in Figure 1.20, the installation of
the integration runtime on your local machine (where the local instance of
SQL Server exists) is secured through two encrypted authentication keys. If
you use the Manual setup, you’ll need to reference these keys. The Express
setup will reference them for you. For this exercise, click the Express setup
link.
18   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.20: Options for installing the integration runtime on a local machine

5. When you click on the Express setup link, an executable will download, and
you’ll need to run it. This will time to install. A progress indicator will appear
on the screen while installation is taking place (see Figure 1.21).
Building a Basic Azure Data Factory Pipeline   19

Figure 1.21: The installation progress of the local executable for the integration runtime

6. When the installation has completed, you’ll be able to open the Integration
Runtime Configuration Manager on your local computer. To validate that this
configuration tool is working, click on the Diagnostics tab and test a connec-
tion to your local database that you plan to connect to from Azure. Figure 1.22
shows a validated test (note the checkmark next to the Test button). You can
connect to virtually any type of local database, not just SQL Server (just select
the type of connection you need to test from the dropdown).
20   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.22: Testing the connection from the newly installed local integration runtime

7. Finally, back in the Azure portal and the Integration Runtime Setup screen,
click the Finish button.

When you have finished creating and installing the Integration Runtime compo-
nent, you’ll find yourself back on the New Linked Service window. You’ll need
to specify the server name, database name, and connection information. If you
tested your connection in the locally installed Integration Runtime Configuration
Manager as shown in the steps above, you can repeat the same information here.
Figure 1.23 shows the connection to a local database instance configured in the
ADF screen that has been tested and validated.
Building a Basic Azure Data Factory Pipeline   21

Figure 1.23: Testing connectivity to the local database from the Azure portal

Click the Finish button. The source data store will now be created. You’ll now be
able to proceed to the next step of the process of setting up the Copy Data pipe-
line. Do this by clicking the Next button on the main screen, which will pop up a
22   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

new window where you can indicate what table set or query you will be using to
extract data from the local database. For this exercise, there is only a single table
available to select, which is called Customers. Click this table and a preview of the
current data will be shown (see Figure 1.24).

Figure 1.24: Selecting the source table(s) where data will be copied from

Click Next and you will see a screen where filtering can be applied. We’ll leave
the default setting as no filtering. Clicking Next on the filter page will take you
to the Destination stage of configuring the Copy Data pipeline. For the current
solution, the destination is going to be the SQL Azure database that was created
earlier in this chapter. The setup of the destination is like the setup of the source,
except in this case no additional integration runtime will need to be set up or
configured, since the destination is an Azure database. Click the Azure tab and
press the Create new connection button. Select the Azure SQL Database option,
as shown in Figure 1.25.
Building a Basic Azure Data Factory Pipeline   23

Figure 1.25: Selecting an Azure SQL Database as the destination

Click the Continue button and a new screen will appear within which you can
configure the connection to the Azure database. Here, you will set the name of the
connection and then select the default AutoResolveIntegrationRuntime option
for the integration runtime component. This integration runtime will allow the
pipeline to connect to any Azure SQL servers within the current framework. Select
the Azure subscription that you used to create your Azure SQL Server earlier in
this chapter and then select the server and database from the dropdowns that will
auto-populate based on what is available. Enter the appropriate credentials and
then test the connection. Figure 1.26 shows the configuration for the Azure SQL
connection.
24   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.26: The configuration screen for the destination database

Click Finish to complete the creation of the destination connection. You’ll be


returned to the main screen of the Copy Database pipeline configuration flow. If
you click on the All tab, you’ll see both the source and destination connections.
You can roll your mouse over them to see a quick view into what the configuration
for each is (see Figure 1.27).
Building a Basic Azure Data Factory Pipeline   25

Figure 1.27: Source and destination connections are now available

Continue with the configuration process by clicking Next, which will allow you
to define the targeted table in the Azure SQL database where the source data will
be copied to. There is only a single table that we’ve created in the Azure database.
This table will show in the dropdown as Contact. Select this table and click Next.
A screen will open where the table mapping between the source and the destina-
tion takes place. We’re working with two simple tables, so the mapping is one to
one. Figure 1.28 shows the completed mapping for the tables that are being used
in this exercise.
In cases where your source tables don’t match cleanly with the target, you’ll
need to write a process on your local SQL instance that will load the data into a
staging table that will allow for ease of mapping. You can always write custom
mapping logic in a custom Azure pipeline, but leaving the logic at the database
level, when possible, will generally ease your development efforts.

Figure 1.28: Perform the mapping of the source columns to the target columns
26   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Click Next to continue. You’ll see options to set fault tolerance, which we’ll leave
defaulted for now. For large datasets, you may want to simply skip rows that
have errors. You can define your preference here and then click Next. You’ll see
a summary of all the configurations that have been done. Review this summary
(shown in Figure 1.29), as the next step you’ll take is to deploy the Copy Data
pipeline. Click Next when you have reviewed the summary. The deployment of
the completed pipeline will now take place.

Figure 1.29: Summary of the work that has been done

With the pipeline deployed, you’ll be able to test it. There are several ways to navi-
gate around Azure, but for this exercise click the Pencil icon on the left side of the
screen. This will show the pipeline you just developed (which has a single step
of Copy Data) and the two datasets configured. You’ll want to test your process.
This can be done most easily by clicking the Debug button in the upper toolbar of
your pipeline. To see this button, you’ll need to click on the pipeline name in the
left-hand toolbar. Figure 1.30 shows the various tabs that need to be clicked to be
able to press the Debug button.
Building a Basic Azure Data Factory Pipeline   27

Figure 1.30: Clicking on the Debug button to test the process

When the Debug button is clicked, the process will begin to run. It takes a second
for it to spin up, and you can monitor the progress of the run in the lower output
window. The process will instantiate and begin to execute. When it completes,
it will either show that it has succeeded or that it failed. In either case, you can
click the eye glasses icon to see details about what happened during the run (see
Figure 1.31).

Figure 1.31: Process has completed successfully, click the eyeglasses icon to see details

By clicking the Details button, you’ll see the number of records that were read,
the number that were successfully written, and details about rows that failed. As
shown in Figure 1.32, the two records that were in the source database were suc-
cessfully copied to the target. You can verify that the data is truly in the targeted
table by running a select query against the Azure SQL table and reviewing the
records that were loaded.
28   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.32: Summary showing that the data was successfully copied from the source to the
target

Monitoring and Alerts

The pipeline that was just created was set up to run on a scheduled basis, which
means the data will continue to copy to the destination table every hour. To see
the status, click on the gauge (Monitor) icon on the left-hand side of the screen.
This will show you a history of the pipelines that have run, along with several
options for sorting and viewing metrics. Figure 1.33 shows that current pipeline’s
audit history.
Monitoring and Alerts   29

Figure 1.33: Seeing the history of the scheduled pipelines

From this same monitoring area, you can see the health of your integration run-
times by clicking on the Integration Runtimes tab at the top of the screen. You
can also click on Alerts and Metrics, both of which will open new tabs within
your browser.
When you click Monitoring, you will land on the main Azure portal monitor-
ing page where there are endless options for seeing what is going on within your
various Azure components. Some of the key areas to note here are the Activity Log
(click here to see all activity across your Azure solution) and Metrics (build your
own custom metrics to be able to see details across your deployed components).
But for monitoring your ADF solution itself, you’ll most likely want to remain
within the Monitor branch of the ADF screens.
When you click Alerts, you’ll find that you can set up a variety of rules that
will allow you to monitor your databases. These databases are primarily intended
for administrative monitoring, but you’ll see that there are dozens of alert types
that can be configured, and you’ll want to become familiar with what is in here
to decide if you’re interested in setting up notifications. Just like the main Mon-
itoring page, these alerts apply more to the generic Azure portal than they do
specifically to ADF, but you can certainly monitor events related to your ADF envi-
ronment. You can also build notification directly into your ADF pipelines as you
build them out.
To set up an alert from the main Alert page, click on +New alert rule on the
toolbar, as shown in Figure 1.34.
30   Chapter 1: Copying Data to Azure SQL Using Azure Data Factory

Figure 1.34: Creating a new alert

A new screen will open where a rule can be created. Three steps need to be taken
to create a rule: define the alert condition, name the alert (which ultimately
gets included in the notification that is sent), and define who gets notified. An
example of a configured alert rule is shown in Figure 1.35. You can set up noti-
fications like emails, SMS texts, voice alerts, along with a variety of others. The
number of configurations here is extensive.

Figure 1.35: Creating an alert rule


Summary   31

Summary

A lot of territory has been covered in this chapter. We’ve looked at creating an
ADF pipeline that uses an integration runtime to connect to a local SQL Server
instance and copying data from a table on that local instance to a table in an Azure
SQL database. We’ve looked at the options around configuring and working with
each of these components, as well as how to create basic monitoring and alerting.
In the next chapter, we’ll expand into deeper customization and development of
an ADF pipeline and look at how additional data sources can be brought into the
mix by working with an Azure Data Lake Server.
Chapter 2
Azure Data Lake Server and ADF Integration
Having explored the basic approach to copying data to an Azure SQL instance
using an ADF pipeline in the previous chapter, we’ll now expand our data storage
options by working with Azure Data Lake Server (ADLS) storage. With ADLS, just
about any type of data can be uploaded, such as spreadsheets, CSV files, data-
base dumps, and other formats. In this chapter, we’ll look at uploading CSV and
database extracts containing simple sales information to ADLS. This sales infor-
mation will be related to the customer data that was used in the previous chapter.
For example, the CSV file will contain the customer ID that can be related to the
record in the Azure SQL database. This relationship will be made concrete in the
next chapter when we pull the data from Azure SQL and ADLS into an Azure Data
Warehouse.
To move data into ADLS, we’ll build out a pipeline in ADF. In the previous
chapter we used the “Copy Data” pipeline template to move the data into SQL
Azure; we’ll make a copy of this and alter it to push data to ADLS. To create the
copy, we’ll build a code repository in Git and connect it to the ADF development
instance. Once the data has been loaded into the ADLS, we’ll look at using an
Azure Data Lake Analytics instance for basic querying against the data lake. The
flow of the data and components of this chapter are shown in Figure 2.1.

Git Repository

On-Premise Machine Azure

Integration
ADLA
Runtime ADF Pipeline
TXT ADLS Query

Local SQL Server

CSV

Figure 2.1: Components and flow of data being built in this chapter

DOI 10.1515/9781547401277-002
34   Chapter 2: Azure Data Lake Server and ADF Integration

Creating an Azure Data Lake Storage Resource

It’s possible to create an Azure Data Lake Storage resource as part of the ADF
custom pipeline setup, but we’ll create the ADLS resource first and then incor-
porate it into a copy of the pipeline that was created in the previous chapter. To
create this ADLS instance, click on +Create a resource in the main Azure portal
navigation and then click on Storage. Next, click on the Data Lake Storage Gen 1
option under the featured components (see Figure 2.2).

Figure 2.2: Creating a new ADLS resource

When you have selected the Data Lake Storage Gen1 resource option, a new
screen layover will open where you can configure the ADLS top level information.
Figure 2.3 shows the configuration used for this exercise. When done setting the
properties click Create, which will deploy the ADLS resource. The properties set
on this page include:
1. Name—note that you can only name your ADLS resource with lowercase
letters and numbers. The name must be unique across all existing ADLS
instances, so you’ll have to experiment a little!
2. Subscription, Resource Group, and Location—for this example use the same
information that you used to create the Azure components in Chapter 1.
3. Pricing package—pricing on ADLS is inexpensive, but care should be taken
when determining what plan you want to utilize. If you click the information
Creating an Azure Data Lake Storage Resource   35

circle next to payments a new window will open where you can calculate
your potential usage costs.
4. Encryption settings—by default, encryption is enabled. The security keys are
managed within the ADLS resource. These keys will be used later when the
ADLS resource is referenced from the ADF pipeline that will be created.

Figure 2.3: Configuration of top-level settings for the ADLS instance

When the deployment of the ADLS resource has completed, you’ll get a notifica-
tion (similar to that shown in Figure 2.4). You’ll be able to navigate to it by clicking
36   Chapter 2: Azure Data Lake Server and ADF Integration

either on the Go to resource button in the notification window or by clicking on


All resources in the main Azure toolbar.

Figure 2.4: A notification will appear when the resource has been created

Opening the newly created ADLS resource shows several actions that can be
taken, as shown in Figure 2.5. The default screen shows usages, costs incurred,
and other reporting metrics. The navigation bar to the left shows options for
setting up alerts and monitoring usage, like what was described at the end of
Chapter 1. Most importantly, there’s the option Data explorer, which is the heart
of configuration and access to data within the data lake itself.
Creating an Azure Data Lake Storage Resource   37

Figure 2.5: Accessing Data explorer in ADLS

Clicking the Data explorer button will open your data lake server in a new window.
Your first step will be to create a new folder. This folder will contain the data that
you’ll be uploading to the data lake. For this exercise we’ll call it SalesData. In
Figure 2.6 you can see the navigation frame on the left and the newly created
SalesData folder.

Figure 2.6: Creating a new folder

We’ll manually upload a CSV file now. Later in this chapter we’ll use an ADF
pipeline to upload data automatically. Click the SalesData folder to open it. Once
inside the folder, click the Upload button. Figure 2.7 shows the data contained in
38   Chapter 2: Azure Data Lake Server and ADF Integration

the CSV that will be uploaded for this discussion (there’s an image of it in Excel
and Notepad so that you can see it is just a comma separated list of values).

Figure 2.7: The data being uploaded, shown in Excel and Notepad

You’ll see three columns in this CSV file. The first column is Customer ID, which
matches the ID column in the Azure SQL Contact table from Chapter 1. The second
column is Sale Date and the third is Sale Amount. There are several sales for each
customer. Of course, you aren’t limited to CSVs in a data lake—you can upload
just about anything. A real-world scenario there would be a large export of sales
data from an ERP system, which is in flat file format. This data would be uploaded
hourly or daily to the ADLS instance. We’ll look at this real-world application in
more detail in Chapter 3. Figure 2.8 shows the CSV uploaded into the data lake.
Creating an Azure Data Lake Storage Resource   39

Figure 2.8: Uploading a file manually to ADLS

Your uploaded file can now be seen in the main data explorer menu. You can
now click on the context menu of the file. One option is to preview the file, which
lets you see all of the data within the file. In most data lake scenarios, you’ll be
dealing with extremely large files that have a variety of formats, so the preview
functionality is critical as you navigate through your files to find the information
and structure you are after. Figure 2.9 shows the context menu with a preview of
the CSV that was uploaded.

Figure 2.9: Previewing the data that was uploaded


40   Chapter 2: Azure Data Lake Server and ADF Integration

Using Git for Code Repository

You’ve just created an ADLS instance and uploaded a file manually. This has
limited application, and you really need to figure out how to automate this
through code, However, before moving on to creating an ADF pipeline and related
components needed for automating the transfer of data into your new ADLS, let’s
pause to look at code repositories.
With pipelines in your ADF, you’ll likely want to back things up or import
pipelines into other instances that you create in the future. To export or import
pipelines and related components you must have a code repository to integrate
with. Azure gives you two options: Git and Azure DevOps. We’ll break down getting
Git set up and a repository created so that ADF components can be exported and
imported through it.
The first step of this process is to get your Git repository associated with
Azure. Navigating to the root ADF page, you’ll see an option for Set up Code
Repository. Clicking on this will allow you to enter your repository information
(see Figure 2.10). If you don’t see this icon, then you will need to click on the
repository button in the toolbar at the top of the screen within the Author and
Monitor window where pipeline development takes place.
Using Git for Code Repository   41

Figure 2.10: Setting up a code repository

If you don’t already have access to Git, follow these four steps to set up a reposi-
tory and connect ADF to it:
1. Go to github.com and set up an account (or log into an existing account).
2. Create a new code repository. Once the repository has been set up, you’ll need
to initialize it. The repository home page will have command line informa-
tion to initialize your new repository with a readme document. To run these
commands, you’ll need to download Git and install it on your local machine.
Once it has been installed locally, open a command prompt window and type
42   Chapter 2: Azure Data Lake Server and ADF Integration

the commands listed. In the end, you should have a repository set up that has
a single readme document. Note that you’ll most likely have to play around
with these steps to get them to work—keep at it until you succeed, but it may
take you a bit of work! If you don’t want to mess around with command lines
(who would?), make sure you click on the Initialize this repository with a
README option when you create your repository (see Figure 2.11).

Figure 2.11: Make sure and initialize your Git repository

3. Once the repository has been set up, you will be able to reference it from
within ADF. Click on the Set up Code Repository button on the main ADF
page.
4. A configuration screen will open within Azure that requests your Git user
account. Entering this will allow you to connect to Git and select the reposi-
tory you’ve just created. Once you’ve connected successfully to the repository,
you’ll see that Git is now integrated into your Authoring area (see Figure 2.12).
Using Git for Code Repository   43

Figure 2.12: The repository will show in the ADF screen once added successfully

As soon as you have your repository connected, the components you’ve created
in Azure will automatically be synchronized with your Git repository. After com-
pleting the connection and logging into the Git web interface, you should see
something like what is shown in Figure 2.13. Here, the dataset, pipeline, linked
server, trigger, and integration runtime folders have all been created, based on
components that have been created to data in the ADF solution. Within each
of the folders are the component files. These files could be imported into other
ADF instances by copying them into the repository for that ADF solution. Azure
and Git perform continual synchronizations once connected so that your data is
always backed up and accessible.

Figure 2.13: Once connected, the code in your Azure ADF will automatically sync with Git
44   Chapter 2: Azure Data Lake Server and ADF Integration

To understand how to use the repository, let’s take a quick look at importing a
pipeline from another solution that may have been developed. We’ll assume that
this solution already exists somewhere and we are trying to take a copy of that
pipeline and upload it to the current ADF instance. To do this, follow these steps:
1. Determine which files you want to copy over. In the case of the pipeline
created in Chapter 1, for example, there is a single pipeline file, two dataset
files, an integration runtime, linked services, and other components. You can
see in Figure 2.14 the resources on the left in ADF, along with the pipeline files
in the Git folder.

Figure 2.14: The JSON pipeline file in Git

2. You can transfer just the core pipeline file or any of the other associated files
(e.g., data sources, integration runtimes, etc.) individually. Each will import
with their original configurations. If, for example, you import a pipeline that
references several data sources but you don’t import the data sources, the
pipeline will still open. You’ll just have to set up new data sources to link into
it.
3. Every file in ADF exports as a JSON file. You can edit these files before import-
ing if you want. For example, Listing 2.1 shows the CopyLocalDataToAzure
pipeline as implemented in Chapter 1. If you want to alter the name on this,
just edit the JSON file’s name property (there are two of them you’ll need to
set). If you want to alter the mappings, change the columnMappings node
information. For advanced developers, editing the JSON of existing pipelines
Using Git for Code Repository   45

and related components can save time and allow you to build out multiple
processes with common logic with relative ease.

Listing 2.1: JSON of pipeline

{
"name": "CopyLocalDataToAzure", <--- NAME CAN BE CHANGED (EXAMPLE)
"properties": {
"activities": [
{
"name": " CopyLocalDataToAzure", <--- WOULD BE CHANGED HERE AS WELL
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "[dbo].[Customers]"
},
{
"name": "Destination",
"value": "[dbo].[Contact]"
}
],
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 10000
},
"enableStaging": false,
"dataIntegrationUnits": 0,
"translator": {
46   Chapter 2: Azure Data Lake Server and ADF Integration

"type": "TabularTranslator",
"columnMappings": "FirstName: First, LastName: Last, Birthday: DOB,
ModifiedOn: Last
Modified" <--- MAPPING CAN BE CHANGED (ANOTHER EXAMPLE)
}
},
"inputs": [
{
"referenceName": "SourceDataset_sl9",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "DestinationDataset_sl9",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

4. When you’ve determined what file(s) you want to add to ADF (and made
any edits manually to the JSON), upload them into the Git repository. This
will automatically cause them to display in the ADF web interface under the
appropriate folder.

Knowing how to navigate a code repository within ADF is important. In the next
sections, you’ll build out a pipeline that moves data from a local SQL instance
directly into ADLS. While there are many ways to build out an ADLS implementa-
tion, you’ll likely start by copying pipelines that you’ve created before to perform
similar functions. For example, let’s say you build out a pipeline that has many
steps. You now want to build out additional processes that copy data from similar
sources and load them into the same target. Being able to copy the original, make
minor modifications to the JSON, and then reimport the modified file into ADF as
new components, will allow you to develop pipelines without having to configure
every component from scratch.
Building a Simple ADF Pipeline to Load ADLS   47

Note that you can also use the Clone option (see Figure 2.15) from the context
menu if your components are all within a single ADF instance. You’ll do all your
editing from within the ADF web framework. The repository is a way to back up
files and transfer them between ADF instances.

Figure 2.15: Clone a component

Building a Simple ADF Pipeline to Load ADLS

We’ll now build a simple pipeline to automatically load data into the ADLS that
was created. Piggybacking off the repository that was just created, take a copy of
the original CopyLocalDataToAzure pipeline (from Chapter 1) and download it
to your local machine. Rename the file to CopyLocalDataToADLS and open the
file in Notepad. Edit the name property so it matches the name of the file. Once
that has been completed, upload the file back into your Git repository. Figure 2.16
shows these steps. When the upload has completed, you can refresh your ADF
screen (use the refresh button on the toolbar) and you’ll see the new pipeline in
your pipeline list.
48   Chapter 2: Azure Data Lake Server and ADF Integration

Figure 2.16: Copying a file, editing it, and uploading it

When the new pipeline shows in the ADF navigation, click on it and an edit
window will open. You will make several changes here. Start with the destination
connection (called the Sink) and convert this from the Azure SQL instance to the
new ADLS instance. To do this, click on the Copy Data box in the development
area and then click on the Sink tab in the properties window (see Figure 2.17).
Click the New button and select the Azure Data Lake Storage Gen1 option.
Building a Simple ADF Pipeline to Load ADLS   49

Figure 2.17: Altering the sink connection from SQL Azure to ADLS

A new dataset will be created, which must then be configured. On the General
tab, set a unique descriptive name (we’ll call it AzureDataLakeStoreMain for this
exercise). Next, click on the Connection tab. Follow the steps below to configure
this tab.
1. Click on the New button next to the Linked Service property dropdown. In the
configuration screen that opens, set the properties to point to the ADLS that
you created earlier in this chapter. Figure 2.18 shows this screen configured.
50   Chapter 2: Azure Data Lake Server and ADF Integration

Figure 2.18: Setting up the connection

2. Click the Test connection button to ensure your connection works. Most likely
you’ll get a failure. For connections to ADLS, you must first give access to
your Azure application to read/write the ADLS instance. To do this, copy the
Service identity application ID GUID. This can be accessed directly from the
text underneath the Authentication Type property on the linked service con-
figuration screen.
Building a Simple ADF Pipeline to Load ADLS   51

3. Next, go into your ADLS instance and click on Data explorer. In the Data
explorer, click on Access and then click Add. When the GUID is placed in the
Select field, the name of your ADF should automatically appear. Click on the
permissions as shown in Figure 2.19. You can select Read, Write, or Execute,
or any combination that you need, but the two radio buttons should be set as
indicated in Figure 2.19 below.

Figure 2.19: Allowing a connection to ADLS from ADF

4. Assuming you have the connection worked out with your linked service, click
the Finish button.
5. Now, set the File path property by clicking on the Browse button. We’ll use
the same SalesData folder that we manually uploaded the CSV file to earlier
in this chapter. Select this folder and click Finish. At this stage the file prop-
erty can be hardcoded—we’ll come back and make this dynamic after getting
the core pipeline functional. For this exercise, create a text file that looks like
that shown in Figure 2.20.
6. Give the file a name like “export.txt.”. This will allow us to create a mapping
to this structure. We’ll create an SQL table shortly that matches this structure
(see Listing 2.2). When this file is created, upload it to your Data Lake storage
using the Data explorer and place it in the SalesData folder.
52   Chapter 2: Azure Data Lake Server and ADF Integration

Figure 2.20: Create a dummy instance of a file in the format of the data that you will be migrat-
ing so that it can be mapped

7. The rest of the properties can be left to their defaults. See Figure 2.21 for the
configuration of the Sink.
8. Remember to click Save when you’re done configuring this new data set!

Figure 2.21: Final sink configuration

The sink is now configured! We need to revise the source connection now. Origi-
nally, this source was pointing to the local SQL Server instance and pulling data
from the Customers table. Now were going to create a new table that will store
Sales information. All records in this table relate to the Customer table via the ID
that is in place, but this is not meant to be a relational data model. The idea here
is that this is a table that has a large volume of data in it that is being published
to it from some external source or sources. Think of it as a large data dump. We’re
going to take this dump of sales data and load it into the data lake.
Building a Simple ADF Pipeline to Load ADLS   53

To reconfigure the source of your pipeline, click the Source tab on the Copy
Data box and set up a new connection to a new sales table in your local SQL
instance. Listing 2.2 shows the script for this table while Figure 2.22 shows the
source tab in the pipeline configured to point to this, along with a preview of the
data.

Listing 2.2: New Sales table in the local database (with sample data)

CREATE TABLE [dbo].[Sales](


[ID] [int] IDENTITY(1,1) NOT NULL,
[CustomerID] [nchar](10) NOT NULL,
[DateOfSale] [date] NOT NULL,
[AmountOfSale] [money] NOT NULL,
[ItemPurchased] [varchar](50) NOT NULL,
CONSTRAINT [PK_Sales] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY
= OFF, AL
LOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET IDENTITY_INSERT [dbo].[Sales] ON
GO
INSERT [dbo].[Sales] ([ID], [CustomerID], [DateOfSale], [AmountOf-
Sale], [ItemPur
chased]) VALUES (2, N'CUST001 ', CAST(N'2018-10-01' AS Date),
100.0000, N'Goodies')
GO
INSERT [dbo].[Sales] ([ID], [CustomerID], [DateOfSale], [AmountOf-
Sale], [ItemPur
chased]) VALUES (3, N'CUST002 ', CAST(N'2018-09-14' AS Date),
56.0000, N'Parts')
GO
INSERT [dbo].[Sales] ([ID], [CustomerID], [DateOfSale], [AmountOf-
Sale], [ItemPur
chased]) VALUES (4, N'CUST002 ', CAST(N'2018-09-20' AS Date),
134.3400, N'AssortedI
tems')
GO
54   Chapter 2: Azure Data Lake Server and ADF Integration

INSERT [dbo].[Sales] ([ID], [CustomerID], [DateOfSale], [AmountOf-


Sale], [ItemPur
chased]) VALUES (5, N'CUST001 ', CAST(N'2018-10-02' AS Date),
102.5000, N'Parts')
GO
INSERT [dbo].[Sales] ([ID], [CustomerID], [DateOfSale], [AmountOf-
Sale], [ItemPur
chased]) VALUES (6, N'CUST001 ', CAST(N'2018-10-02' AS Date),
43.0000, N'Goodies')
GO
INSERT [dbo].[Sales] ([ID], [CustomerID], [DateOfSale], [AmountOf-
Sale], [ItemPurchased]) VALUES (7, N'CUST002 ', CAST(N'2018-09-25' AS
Date), 13.4500, N'AssortedI
tems')
GO
INSERT [dbo].[Sales] ([ID], [CustomerID], [DateOfSale], [AmountOf-
Sale], [ItemPur
chased]) VALUES (8, N'CUST001 ', CAST(N'2018-09-30' AS Date),
150.0000, N'Goodies')
GO
SET IDENTITY_INSERT [dbo].[Sales] OFF
GO
ALTER TABLE [dbo].[Sales] WITH CHECK ADD CONSTRAINT [FK_Sales_Sales]
FOREIGN
KEY([ID])
REFERENCES [dbo].[Sales] ([ID])
GO
ALTER TABLE [dbo].[Sales] CHECK CONSTRAINT [FK_Sales_Sales]
GO
Building a Simple ADF Pipeline to Load ADLS   55

Figure 2.22: Previewing that data from the source connection that is in the source SQL table

The revision of the Sink and the Source is all that must be done to complete the
changes in this new pipeline. When it runs, it should copy the data from the SQL
table and place it into a text file in the ADLS instance. Save everything, including
the pipeline, and click on the Debug button. It will begin processing, and should
succeed (if not, you’ll see errors and will have to address them!).
When the process has completed successfully, you’ll see a new export.txt file
copied out to the ADLS instance. It will have overwritten the “dummy” sample file
that you created earlier in this section. If you want to take snapshots of data and
ensure that files are not overwritten each time the process runs, you can modify
the file name expression in the Connection tab. Follow these steps to make the file
name dynamic (you can apply this approach to dynamic scripting in a variety of
locations in pipeline development):
1. Open the AzureDataLakeStore data set that you created to connect to ADLS
from your ADF pipeline.
2. Click on the file name in the File path property. A link will appear below it.
Click that link to open a new screen where you can enter dynamic content.
3. In the Add Dynamic Content screen, you’ll have the opportunity to build out
a script that will allow for a dynamic file name. In this case, we’ll create a
concatenation script that sets the file name and appends a date stamp to it.
Figure 2.23 shows this scripted out with only the seconds included in the
56   Chapter 2: Azure Data Lake Server and ADF Integration

date, but you can figure out the broader pattern to include year, month, day,
hour, and minute.

Figure 2.23: A script for dynamic file naming out of the pipeline sink

4. Once complete, save the pipeline and run it. Assuming the code was typed
in accurately, you’ll see a new file posted to your ADLS directory with the
dynamic file name.

Combining into a Single ADF Pipeline

Currently, you have two separate pipelines. The Copy Data pipeline created in
Chapter 1, which copies information from the local SQL instance into Azure SQL,
and the second Copy Data pipeline from this chapter that copies information into
Data Lake Analytics   57

the ADLS instance. You can combine these into a single pipeline, if you want. To
do this, open one of the pipelines and right click the Copy Data box and select
Copy. Next, go to the other pipeline and click Paste. Rename the box and then
drag the arrow from the left-hand box to the box on the right. Click the Validate
button to make sure no errors were introduced, and then click Debug. You now
have a single process that will manage all the movement of data (see Figure 2.24).

Figure 2.24: Combining both pipelines into a single pipeline process with two steps

Data Lake Analytics

By using Azure Data Lake Analytics (ADLA), you can query data stored in your
ADLS instance in a (somewhat) similar way to how you would work with a tra-
ditional SQL database. This is a separate resource that’s created in your Azure
Portal and one which lets you build out queries (both read and write) using
U-SQL. U-SQL allows you to query across data sources and data types, combine
them into a single result set, and pull that result set into a single output. Though
there is a learning curve to U-SQL, the ability to query across data types in your
ADLS instance and other sources is of immense value and well worth the ramp-up
on a new language. In this section, to demonstrate how to use an ADLA resource,
we’ll look at querying data from two sources and compiling them into a single
output.
Throughout this chapter, we’ll take a quick look at how to set up a new
analytics resource and build out some basic queries against the data that we’ve
uploaded to the ADLS instance. To begin, click on +Create a resource in the main
58   Chapter 2: Azure Data Lake Server and ADF Integration

navigation bar of your Azure portal. Select Analytics and then click on Data Lake
Analytics, as shown in Figure 2.25.

Figure 2.25: Creating a new ADLA instance

In the screen that opens, you’ll be able to configure the base settings for the ADLA
resource. Enter a descriptive name (it must be unique across all ADLAs that are
created, not just your own!). When setting up your ADLS, you’ll fill out the stan-
dard options for the other properties that you’ve seen elsewhere. You’ll also have
to select the ADLS instance you’ll be pointing this ADLA resource to. Figure 2.26
shows a configured ADLA resource using the ADLS created earlier in this chapter.
Once complete, click the Create button.
Data Lake Analytics   59

Figure 2.26: Core configuration of the ADLA resource pointing to an ADLS instance

The deployment of the new resource will take a few moments. Once deployed,
you’ll see a new notification and will be able to navigate to the new resource. With
the new ADLA resource opened, you can now write a query against your ADLS
instance. Listing 2.3 shows an example of a U-SQL query. The code extracts data
from the CSV files that were uploaded earlier in this chapter, performs a filter on
them, and writes them out to a target file. To code and run this, follow these steps:
1. Open your data analytics resource and click on the +New job button (see
Figure 2.27).
60   Chapter 2: Azure Data Lake Server and ADF Integration

Figure 2.27: Create a new job to code U-SQL

2. In the New job screen, enter the code from Listing 2.3 and click Submit (see
Figure 2.28).

Figure 2.28: Submitting the job


Data Lake Analytics   61

Listing 2.3: First U-SQL Query to pull data from CSV in ADLS directory

//Read from CSV file stored in ADLS directory


@sales =
EXTRACT [Customer ID] string
,[Sale Date] string
,[Sale Amount] string
FROM @"/SalesData/Sales.csv"
USING Extractors.Csv();
// refine search to get only subset of data
@result =
SELECT *
FROM @sales
WHERE [Customer ID] == "1";
//Write result to file
OUTPUT @result
TO @"/SalesExports/sales_output.txt"
USING Outputters.Tsv();

3. If execution of the script succeeds, you’ll see a flow like that shown in Figure
2.29. You’ll see a variety of information that pertains to your query, including
time to query and size of data. This will provide great insight into where your
data lies and how long it will take to get at various information when dealing
with big data and endless files.
4. If the script fails, you’ll see extensive error information in the output window,
which will hopefully (!) give you enough information to fix your U-SQL. To
work with the query and retest until it works as you would expect, click on
the Reuse script button on the top of the screen.
62   Chapter 2: Azure Data Lake Server and ADF Integration

Figure 2.29: Executing the job not only performs the coded actions but also shows a flow with
metrics

5. The output of the file will go to the ADLS instance into a new folder called
Sales (shown in Figure 2.30). The U-SQL script indicates the output folder and
file. If the folder does not exist, the U-SQL will create one.

Figure 2.30: The output from the U-SQL script execution


Data Lake Analytics   63

There are some robust paths, that are easy to access. that will monitor your ADLA
jobs. Click the Job management tab on the ADLA navigation. You’ll see a history of
jobs that have already run. Clicking these jobs will give you detailed information
about the run. Figure 2.31 shows the history, along with a break out of one that
had errors. You can click on any of these jobs, whether they succeeded or failed,
rerunning, debugging, analyzing, and performing other actions on them.

Figure 2.31: The output from the U-SQL script execution

You can expand your U-SQL to include querying data from your Azure SQL
instance and other sources. For example, in the code from Listing 2.3 above, you
could add a second result set to pull Sales information from a sales table on SQL
Azure and then include it in the results that are written out to the output file.
To do this, you would first need to set up a new connection to your Azure SQL
instance using U-SQL, Visual Studio, or PowerShell. At that point, you would be
able to include queries against it in the U-SQL code.
As you work with your ADLA queries, you’ll come to realize that while there is
immense value in being able to query across entities and data types, there are also
some real challenges and limitations in structuring your queries. For example, in
the exercise above, one of the fields is “Sales Amount,” which is a float. Ideally,
64   Chapter 2: Azure Data Lake Server and ADF Integration

we could extract that value directly as a float, but the first row in the CSV file is
column headers, and these are included in the query. So, for this we must treat
everything as a string and work within U-SQL to convert that data to the specific
data types.
Querying directly against your data lake will lead to great frustration when
dealing with huge data sets with differing cleanliness of data. The logical thing
to do is extract clean data from your various data sources (Azure SQL, ADLS, and
other locations) and load them into an Azure Data Warehouse, where true analyt-
ics and reporting can take place. We’ll spend the next chapter looking at how to
get your data into a data warehouse.

Summary

You’ve worked through building an Azure Data Lake Server and you’ve populated
it with CSV and TXT data both manually and via an ADF pipeline. You’ve looked
at creating a connection to ADLS from an Azure pipeline and understand what
it takes to set up a Git repository to synchronize code you’ve built in ADF. Addi-
tionally, several patterns for working with larger data sets have been outlined, as
has working with dynamic settings in a pipeline. With the components that have
been built out in chapters 1 and 2, you now have a framework for pushing data to
a data warehouse and building views into that data (which will be covered in the
next chapter!).
Chapter 3
Azure Data Warehouse and Data Integration
using ADF or External Tables
In the first two chapters you’ve worked through building out a variety of com-
ponents. All the components can stand alone on their own, and the data can be
queried and reported on. However, to perform complex analysis of data from a
variety of disparate sources, especially involving historical data, you’ll need to
load all of this data into a data warehouse. An Azure Data Warehouse (ADW)
allows for storage of data as well as the dimensional modeling of that data, which
utilizes facts and dimensions.
In this chapter we’ll build out a simple data model in an ADW instance and
then use an ADF pipeline to populate it with information from the sources that
have been developed earlier in this book. We’ll also look at building an external
table that pulls data directly into ADW from an ADLS file without the use of an
ADF integration. The basic data flows that will be developed in this chapter are
shown in Figure 3.1.

On-Premise Machine Azure

External
Integration Table
Runtime
ADLS

Local SQL Server Azure Data Azure Data


Warehouse Warehouse
ADF Pipeline Staging Tables Dim/Fact Tables

Stored
Procedure

Figure 3.1: Components and flow of data being built in this chapter

DOI 10.1515/9781547401277-003
66   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Creating an ADW Instance

The creation of an ADW instance is done through the Azure portal by clicking
on the SQL data warehouses option on the main navigation screen. Click the
+Add button and enter information in the configuration screen that opens. This
information is similar to what you would enter for a new Azure SQL instance,
like you created in Chapter 1. Figure 3.2 shows the configuration for the ADW that
we’ll be referring to in this chapter. Note the Performance level property, which
will be explained in more detail in the next section. For now, choose the Gen1:
DW100 option.

Figure 3.2: Configuring the ADW and setting the Performance level property

When done configuring this screen, click the Create button. Azure will work to
deploy the new resource, and when completed you’ll get a notification. Note that
the creation of a new ADW will take longer to create than the other resources you
have created so far in this book. Once the notification shows that the deployment
has succeeded, click the Refresh button in the main toolbar to see the new data
warehouse. Figure 3.3 shows the newly created ADW resource.
ADW Performance and Pricing   67

Figure 3.3: The newly created ADW

Your ADW is now available and running, which means you are paying for it by
the hour! Pay close attention to the next section so that you don’t get surprised
with a large Azure bill. Before you read the next section, click on your new ADW
instance and click the Pause button on the toolbar (refer to Figure 3.4). When
you’re ready to work with your ADW instance, you can start it back up (the Pause
button will turn into a Resume button). Of course, once you pause your database,
you won’t be able to connect to it (we’ll look at connecting to it shortly).

Figure 3.4: Make sure to pause your ADW when you aren’t using it, as you’ll be paying it by the
hour!

ADW Performance and Pricing

One important aspect of Azure data warehousing is related to performance and


costs. Data warehouses can get expensive very quickly, as they are billed based on
an hourly model. During development, you’ll want to shut off your ADW instance
when you are not working with it, and only turn it on when you are actively devel-
oping. As you can see in Figure 3.5, the pricing options are very clear—these can
68   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

be set at the time you create your ADW or by clicking the Scale button on the main
ADW overview tab.

Figure 3.5: Pricing models

Increasing storage and performance has a direct impact on the hourly cost. There
are several tiers of pricing that span across two top level options (Gen1 and Gen2)
and a dozen tiers within those options. These tiers are referred to with a “DW”
followed by an alphanumeric combination. For this exercise we are creating a
tiny development DW, so we’ll choose the cheapest option, which is DW100. You
can rescale this once the DW has been created if you want to expand capacity.
Note that your options may be limited based on your region. For example, some
regions may have DW400 as the lowest option.
When your ADW instance is being used in a production setting, you may want
to set up a process that automatically turns on and off your ADW instance. For
example, you may decide you don’t need your data warehouse available during
weekend hours, so you could create a script to pause it from 6:00 pm on Friday
to 8:00 am on Monday. Essentially, you’ll create a new automated scheduled task
(usually through a PowerShell script) that can perform the pause and the restart
of the ADW on a timed basis. You can create a script via the Automation script tab
on your main ADW menu. As you can see in Figure 3.6, this is coding intensive
and requires a lot of work and experimentation. We won’t build a script in this
book, but several starting points for this can be found on the web.
Connecting to Your Data Warehouse   69

Figure 3.6: Options are available for building a script to enable/pause your ADW instance to
reduce billing costs

Connecting to Your Data Warehouse

Now that you understand pricing and how your ADW will cost you by the hour,
we’ll proceed with development! If you paused your ADW instance after creating
it, you’ll need to make sure to start it back up by clicking the Resume button (see
Figure 3.7). Remember to manually pause it again after you’re done working with
it for the day.

Figure 3.7: Resuming your ADW

With the ADW instance active, you’ll be able to interact with it in several ways.
You’ll want to connect to it through a local SQL client tool like SQL Server Manage-
ment Studio (SSMS). If you added your ADW to the same Azure database server
that you created in Chapter 1, you should see the ADW instance appear under the
available Databases when you connect to that server through SSMS (see Figure
70   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

3.8). If you got your ADW in a different Azure server, then you’ll need to enter your
connection information.

Figure 3.8: Connecting to ADW through SSMS

If you’re on a computer that doesn’t have SSMS (or something similar) installed,
then you can use a simple query editor within Azure to work with your ADW. Do
this by clicking on the Query editor button on your overview tab and then connect
to your instance (refer to Figure 3.9). Once opened, you’ll have basic SQL func-
tionality and access to your ADW components (tables, views, etc.).
Connecting to Your Data Warehouse   71

Figure 3.9: Using the Azure portal’s Query editor

You can also connect to your ADW through Visual Studio. The easiest way to do
this is by clicking the Open in Visual Studio option on the main ADW navigation
menu. From here, a large button with the same terms will open (see Figure 3.10).
When you click it, your local instance of Visual Studio should open. If you don’t
have Visual Studio installed, you can download it from this page or install it on
your own.

Figure 3.10: Opening your ADW through Visual Studio

Visual Studio should default to the SQL Server Object Explorer view shown in
Figure 3.11. If it doesn’t open, click on the View menu option in the toolbar and
72   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

select it from there. You’ll then be able to navigate through your database com-
ponents in much the same way as you would from the SQL Server Management
Studio.

Figure 3.11: The SQL Server Object Explorer window in Visual Studio

Modeling a Very Simple Data Warehouse

This chapter is going to keep the modeling of a data warehouse simple. The focus
and purpose of this book is to show you how to set up Azure based components
and move data between them. It’s not intended to show you best practices around
modeling the ADW objects themselves! However, before we begin with the simple
table structure that will be used for this exercise, several high-level notes about
data modeling need to be explained:
Modeling a Very Simple Data Warehouse   73

1. Moving data from an external source to a data warehouse can be done using
external tables. In this way, you can define data sources, credentials, and
table structures that allow you to reference external tables just like you
would internal tables. For example, if you wanted to point to a specific ADLS
file in a directory, you could create an external table in your ADW that points
to this location. The use of external tables is essential in large scale data
warehouse modeling. You can create these tables directly from the source
data using a CREATE TABLE AS SELECT. You can populate them using other
processes as well.
2. You’ll likely want to stage your data before loading it into your final tables.
The raw data in the external tables would be “cleansed” and loaded into your
staging tables. From there, you can write processes to load your final tables,
which will be used to build reports and analytics. For this exercise, we’ll
create staging tables, but we’ll skip the external table discussion.
3. You’ll create dimensions and fact tables, which are referred to as star schemas.
From these, you’ll be able to model data as it changes over time. Also, you’ll
be able to do deep analysis that can’t be done with “point in time” data that
is stored in a relational database.

We’ll create two staging tables and several dimension tables with a single fact
table. Eventually, we’ll create external tables as well. The flow will look like that
in Figure 3.12 in a fully built out flow.

External Staging Stored Dim/Fact


ADLS
Table Table Procedure Tables

ADLS ADF

Local SQL

Figure 3.12: The various paths we’re looking at to load data into the ADW

Let’s begin by creating the staging tables. These tables are almost exact copies
of those you’ve created before. They are where the data from the source tables
(including the Azure SQL instance and the file dump in the ADLS instance) will
be placed. Once the data is placed there, we’ll have an ADF pipeline component,
74   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

called a stored procedure, in the ADW to extract the data from these staging
tables. The data will them be loaded into the final fact and dimension tables in a
star schema. The staging tables are shown in Listing 3.1.

Listing 3.1: Listing 3.1. Data warehouse staging tables

CREATE TABLE [dbo].[StageCustomers]


(
[CustomerID] [nchar](10) NOT NULL,
[LastName] [varchar](50) NULL,
[FirstName] [varchar](50) NULL,
[Birthday] [date] NULL,
[CreatedOn] [datetime] NULL,
[ModifiedOn] [datetime] NULL
)
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
GO
CREATE TABLE [dbo].[StageSales]
(
[ID] [nchar](10) NOT NULL,
[CustomerID] [nchar](10) NOT NULL,
[DateOfSale] [date] NOT NULL,
[AmountOfSale] [money] NOT NULL,
[ItemPurchased] [varchar](50) NOT NULL
)
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
GO

For the final “production” tables in the ADW, we’ll create a simple star schema,
which is based on the sales and customer data records we’ve used in illustra-
tions earlier in this book. We’ll create three “dimension” tables (called Customer,
Product, and Date) and one “fact” table (called Sales). See Figure 3.13 for the star
Modeling a Very Simple Data Warehouse   75

schema. The scripts for these tables are shown in Listing 3.2 (you can run these
scripts from within SSMS on your local machine while connected to the ADW
instance).

DimCustomer DimDate

FactSale

DimProduct

Figure 3.13: The simple star schema we are building on the ADW

Listing 3.2: Listing 3.2. Data warehouse dimension and fact tables

CREATE TABLE [dbo].[DimCustomer]


(
[CustomerID] [varchar](30) NOT NULL,
[FirstName] [varchar](30) NULL,
[LastName] [varchar](30) NULL,
[DateOfBirth] [date] NULL,
[IsActive] [bit] NULL,
[ExtractDateTime] [datetime] NOT NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
GO
CREATE TABLE [dbo].[DimDate]
(
[Date] [date] NOT NULL,
76   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

[IsActive] [bit] NULL,


[ExtractDateTime] [datetime] NOT NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
GO
CREATE TABLE [dbo].[DimProduct]
(
[ProductName] [varchar](30) NOT NULL,
[IsActive] [bit] NULL,
[ExtractDateTime] [datetime] NOT NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
GO
CREATE TABLE [dbo].[FactSale]
(
[Amount] [numeric](28, 9) NOT NULL,
[CustomerKey] [varchar](30) NOT NULL,
[DateKey] [date] NOT NULL,
[ProductKey] [varchar](30) NOT NULL,
[ExtractDateTime] [datetime] NOT NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
GO

With the creation of these four tables you now have a data warehouse that can be
populated with data and eventually reported on. In the next section we’ll look at
putting records into these tables by pulling data from various sources. The final
data model is shown in Figure 3.14.
Load Data Using ADF   77

Figure 3.14: Staging, dimension, and fact tables

Load Data Using ADF

There are a variety of ways to get data loaded into your ADW. We’ll continue with
our process of using an ADF pipeline to move data, as it is one of the easiest and
most versatile ways available. You’ve created several pipelines in the previous
chapters. This pipeline will also use the CopyData process as well as the intro-
duction of a stored procedure. To begin, however, let’s create a baseline pipeline
using the following steps:
1. Open the ADF instance that you’ve been working with in chapters 1 and 2 and
add a new pipeline. Name the pipeline LoadADW.
2. Next, drop a Copy Data shape onto the design page. Name this Copy_To_
Stage_Cust.
3. Set the Source to the Customers table on your local SQL instance that was
created in Chapter 1 (shown in Figure 3.15).

Figure 3.15: Setting the source to the customers table on the local SQL instance

4. Set up the Sink to connect to the ADW instance. Click the New button to select
a data set and click on the Azure SQL Data Warehouse option.
5. In the New Linked Service window that opens (see Figure 3.16), configure
your connection to your ADW server. The only feature that will be new is that
78   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

you’ll have several authentication options. We’ll use SQL authentication for
this exercise. Select that option in the Authentication type property drop-
down and enter in the credentials that you used at the very beginning of
Chapter 1 (the username is “serveradmin”).

Figure 3.16: Creating a new linked service to the ADW

6. Figure 3.17 shows the server admin account in the SSMS explorer, along with
the SQL script used to create it. The GUI functions that you have with local
databases are not available with Azure databases accessed via SSMS. If you
Load Data Using ADF   79

want to create a new SQL user in your ADW, you’ll have to use scripting. Once
the login has been created, add it to the User name property in the Linked
Service configuration along with the password and then click the Test con-
nection button to make sure you can connect. When validated, click the
Finish button. This will create a new resource.

Figure 3.17: The serveradmin login shown from the security tab within the ADW database

7. Back on the Connection tab in your Copy Data pipeline component, select the
table that you are targeting. In this case, it is the StageCustomers table (see
Figure 3.18).

Figure 3.18: Setting the target (Sink) connection

8. The source and sink have been declared. Now you can click on the Mapping
tab to define the column mappings. The mappings will look like what is
shown in Figure 3.19. You’ll need to click the Import Schemas button before
the mapping will show.
80   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Figure 3.19: With the source and sink defined, mapping can take place

9. With this shape complete, click the Validate button on the toolbar. You may
see an error that looks like that shown in Figure 3.20. This can be remedied by
clicking the Enable Staging checkbox on the Settings tab of the shape. You’ll
have to create a new connection, which should be configured like that shown
next to the error in the same figure.
Load Data Using ADF   81

Figure 3.20: Addressing a potential error after pipeline validation

10. Next, you’ll follow similar steps for a second Copy Data shape. This one will
copy data from the Sales file in the ADLS (created in Chapter 2) and push it
into the ADW’s Sales staging table. Name this shape Copy_To_Stage_Sales.
11. Set the source connection to the existing ADLS linked service and point it to
the SalesData folder and the export.txt file (see Figure 3.21).
82   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Figure 3.21: Configuring the source to point to a specific file in ADLS

12. Click the Sink tab and create a new connection to the ADW Sales staging
table.
13. Via the Mapping tab, you’ll need to map the columns from the ADLS file to the
Staging Sales table. The column names, as you can see in Figure 3.22, are not
well named in the source. In order to see what these are, click on the Source
tab and preview the data.
Load Data Using ADF   83

Figure 3.22: Column mapping to a CSV may require referencing the data in preview to see
column names

1. You can now map the columns appropriately. For example, Prop_1, which is
the customer ID, can be mapped to CustomerID (see Figure 3.23). One column
should not be mapped, that is the identity filed in the target file. If need be,
you can leave fields empty or add dynamic content. Note that depending on
the CSV format, this mapping may occur automatically.
84   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Figure 3.23: Fields from CSV to target table are now mapped

2. Return to the main pipeline development area. Link the two Copy shapes so
that they look like what is shown in Figure 3.24.

Figure 3.24: The pipeline with the two copy data processes (one from local SQL, the other from
ADLS)

3. Click on the Validate button to test the full pipeline. The pipeline should val-
idate.
4. Save everything you’ve been working on.
Load Data Using ADF   85

We’ll pause here for a moment with pipeline development to discuss mapping
options. For simple scenarios like our exercise, the mapping tool is fine. Columns
that map one to one or need minor modifications can be handled through the
little mapping tool in the pipeline. However, for true control over how you map
your data, you’ll want to shell out to a stored procedure on the database side.
Dealing with data at the native database level is good practice and allows you to
easily test and update the process. Integration architects have built complex map-
pings at the integration tier for years, however, leaving it at the database level
could have saved tremendous amounts of time and energy.
To illustrate how to do basic mapping at the SQL level, we’ll call a stored
procedure to map the data from the Staging tables into the final fact and dimen-
sion tables in the ADW. First, we’ll write code to pull the data from the Customer
staging table and insert it into the Customer dimension table. Next, we’ll do the
same for Product, and lastly, for Date. Once these three tables have been popu-
lated with data, we can execute the code to populate the fact table and be done.
We’ll use a single stored procedure that will be called from the ADW.
To ease the development of your SQL, go ahead and run your pipeline (you
can click the Debug button to do this). This will populate the staging tables in the
ADW with data so that you can work with the mapping in SQL more easily.
To create a new stored procedure in your ADW instance, right click Stored
Procedures in your SSMS tree view and select New and then Stored Procedure
(see Figure 3.25). This will create a new script window for you to write your SQL.
Enter the code that is in Listing 3.3 to create a procedure that takes the first step of
loading data into the Customer dimension table.

Figure 3.25: Creating a new stored procedure on your ADW


86   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Listing 3.3: Stored Procedure on Data Warehouse to Load from Stage to Dims/Fact

CREATE PROCEDURE LoadDataFromStaging


AS
BEGIN
SET NOCOUNT ON
DECLARE @currentDate datetime
SELECT @currentDate = GETDATE()
-- when loading into the customer dimension, we want only unique cus-
tomer
-- info, so check to make sure the data isn't already in the target
INSERT INTO DimCustomer
SELECT DISTINCT CustomerID
,FirstName
,LastName
,Birthday
,1 -- active flag
,@currentDate
FROM StageCustomers
WHERE CustomerID NOT IN (SELECT CustomerID FROM DimCustomer)
-- only load unique dates
INSERT INTO DimDate
SELECT DISTINCT DateOfSale
,1
,@currentDate
FROM StageSales
WHERE DateOfSale NOT IN (SELECT [Date] FROM DimDate)
-- again, we only want unique products
INSERT INTO DimProduct
SELECT DISTINCT ItemPurchased
,1
,@currentDate
FROM StageSales
WHERE ItemPurchased NOT IN (SELECT ProductName FROM DimProduct)
-- we will allow any sale information to go into the sales table
-- as the source is from the ADLS instance and we want all data
-- we'll use reporting to filter out duplicates, if needed
INSERT INTO FactSale
SELECT AmountOfSale
,CustomerID
Load Data Using ADF   87

,DateOfSale
,ItemPurchased
,@currentDate
FROM StageSales
-- clear out the staging tables for next run
TRUNCATE TABLE StageCustomers
TRUNCATE TABLE StageSales
END
GO

We’ll now add a call to this stored procedure from the ADF pipeline so that each
time the process runs, the procedure gets called. The flow in the pipeline will be
as follows:
1. Copy data to the stage customer table.
2. Copy data to the stage sales table.
3. Call the stored procedure, which will load the dimension and fact table and
then clear out the stage tables.

To call the stored procedure from your pipeline, expand the General tab in the
pipeline toolbox and drag the “Stored Procedure” shape onto the design surface,
dropping it to the right of the Copy_To_Stage_Sales box. Rename this to “Call_
Stored_Proc.” On the SQL Account tab, set the linked service to the service that
you’ve set up for the connection to ADW. On the Stored Procedure tab, select the
LoadDataFromStaging procedure that you’ve created (Listing 3.3 above). Figure
3.26 shows the configuration for this stage of the pipeline.
88   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Figure 3.26: Calling a stored procedure from the pipeline

Run the process again using the Debug button in the pipeline. The result should
be that the data has loaded into the target data warehouse tables. You can see in
the output of the pipeline run process the total amount of time each step takes
and whether the step succeeded or failed. Figure 3.27 shows the output from the
current exercise.

Figure 3.27: Monitoring the status of the pipeline steps


Using External Tables   89

You can also monitor the pipeline activities that are running (or have run) by
clicking the Overview tab and scrolling through the dashboard controls that are
available (see Figure 3.28).

Figure 3.28: A view into the pipeline activities

Using External Tables

The ability to use external tables is a key feature of an Azure data warehouse.
You can create an external table that points to any external data source that can
be queried allowing you to load data into your data warehouse that can then be
queried like any other standard SQL table. By using an external table and loading
that data source directly into the ADW, you don’t need to use an ADF pipeline or
any other outside tool to load data.
For ADW to connect to ADLS, you must create several items. These include a
key, credential, and data source on the ADW server itself, as well as an Azure App
via the Azure Portal. The first step you need to take is to create the Azure App. By
doing so, you’ll gain access to several data points that must be available in the
90   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

credential within the ADW database in order to connect. The key components to
use in an external table to ADLS are shown in Figure 3.29.

ADW Azure App ADLS

ADLS
External
Master Key Azure App Directory with
Table
Permissions

DataSource Credential

File Format

Figure 3.29: The components needed for authentication and querying of external data in ADLS
from ADW

To begin, let’s create the Azure App. This can be done by clicking on Azure Active
Directory in the main Azure portal and then clicking on the App registrations
option. You can now click +New application registration from the toolbar.

In the properties that appear, type in a descriptive name for your app (we will call
it inotekadlstoadw for this exercise), leave the Application type property set to
Web app / API, and then enter a valid URL—this URL can be anything, but for this
exercise enter a valid URL (you won’t need this for ADW/ADLS integration). Click
the Create button when done. Once it creates, you’ll be able to navigate to your
new app registration from the main Azure page (see Figure 3.30).

Figure 3.30: A new app registration, which is used for connecting ADW to ADLS

There are three unique identifiers that you’re going to need to capture from this
app (or from within Azure) in order to create a valid SQL data source component
Using External Tables   91

in ADW. These include an Application ID, a Key, and an OAuth string. You can
capture the Application ID by clicking on your new App registration and grabbing
it from the overview window that opens (see Figure 3.31).

Figure 3.31: Accessing the Application ID from the new application registration

Next, you’ll need to create a new key. Click on Settings on the registered app
window and then select Keys from the Settings menu. In the Description field,
type a key name (we’ll call it “Example” for now) and then set the expiration (for
this exercise we’ll set it to never expire). Once you save it, you’ll see a key value
generated. You need to copy the key value now because once you leave this page,
you’ll never see it again (see Figure 3.32). This is the Key value that will be used
shortly in an Azure SQL script.
92   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Figure 3.32: Creating a new Key in the app registration

Click on the Properties tab on the main Azure Active Directory navigation bar to
get the last of the three unique identifiers (the OAuth information), and copy the
value from the Directory ID (see Figure 3.33). This will be combined with the string
shown below. The final list of IDs that you’ll want is as follows (you can refer back
here for when you’re building your SQL script later in this section):
1. Application ID (from the App Registration): example value is f2e499a9-ca67-
4855-b027-4adce08ba17d
2. Key value (from the Keys within the settings of the App Registration): example
value is mPO0Q41MqsVPrqOYH04UZBfOF3LNmVqi2ZAtj8GbNf0=
3. OAuth value (combined string with the ID from the Directory ID field under
properties in Azure Active Directory: example value is OAUTH = https://login.
windows.net/d61c4b5c-858b-4554-821b-2a29fc79003f/oauth2/token
Using External Tables   93

Figure 3.33: The Directory ID makes up part of the OAuth entry

To prepare for the rest of this walk-through, log into your ADLS instance and
create a new directory called Exports. Place the Sales.csv file we worked with
earlier in this book into that directory. This will allow us to easily reference it in
our connection from ADW. The external table in ADW that will be created shortly
will point to this directory and will consume any data within it. Figure 3.34 shows
this folder with the file being previewed.
94   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Figure 3.34: Loading a file to be used as the source of the external table

When the file and folder are created, click on the top level of your folder struc-
ture and click the Access button. You should see the new Application Registra-
tion you’ve just created in the list of available users. Click on this user (inotekadl-
stoadw) and give it full permissions to read, write, and execute in the child
directories. You’ll now have permissions to read from this directory when creating
the external table. Refer to Figure 3.35 to see this user given access.

Figure 3.35: Giving permissions to the app registration to read from the new directory

You can now create a number of items in your ADW database that will allow you
to connect to the ADLS instance. You’ll need to create a Master Key, an External
File Format, a Credential, and a Data source. When the four connectivity objects
Using External Tables   95

in Listing 3.4 are created, you’ll be able to build an External table to pull data from
the ADLS instance.

Listing 3.4: Create ADW components to allow connecting to ADLS from External table

CREATE MASTER KEY


GO
-- the external file that we will pull from is the CSV,
-- so file format is declared as follows
CREATE EXTERNAL FILE FORMAT [demofileformat]
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS
(
FIELD_TERMINATOR = N',', USE_TYPE_DEFAULT = True
)
)
GO
-- Create a new credential, which uses the
-- three keys that we have created
-- the first (starts with "f2e4") is the Application ID
-- the second (which is after the @ sign following the App ID) is the
OAuth value
-- the third (the SECRET) is the value from the Key on the App regis-
tration
CREATE DATABASE SCOPED CREDENTIAL AppCred
WITH IDENTITY = 'f2e499a9-ca67-4855-b027-4adce08ba17d@
https://login.windows.net/d61c4b5c-858b-4554-821b-2a29fc79003f/
oauth2/token',
SECRET = 'mPO0Q41MqsVPrqOYH04UZBfOF3LNmVqi2ZAtj8GbNf0=';
GO
-- the final item is the data source
-- this one connects to the ADLS instance
CREATE EXTERNAL DATA SOURCE [demodatalakesource]
WITH (
TYPE = HADOOP,
LOCATION = N'adl://inotekdemodatalake.azuredatalakestore.net',
CREDENTIAL = AppCred)
GO
96   Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables

Finally, you can create your external table! The external table will declare the
fields being queried (which should match the field in the CSV file you uploaded
earlier). It will indicate what data source to use and what file format to expect.
Creating the table will cause the data to populate within it if it runs successfully.
We’ll declare all the field types as strings so that no restrictions are placed
on what gets loaded into the table. When dealing with large data sets, there is
often data that doesn’t match a specific field type. You can clean the data before
loading it into the external table or you can let the external table be an exact rep-
resentation of what is in the source and do the cleanup while loading into staging
tables (or elsewhere). Listing 3.5 shows the code to create the external table.

Listing 3.5: The External Table

CREATE EXTERNAL TABLE [Sales](


[CustomerID] [nvarchar](20) NOT NULL,
[SaleDate] [nvarchar](20) NOT NULL,
[AmountOfSale] [nvarchar](20) NOT NULL
)
WITH (LOCATION='/Exports/',
DATA_SOURCE = demodatalakesource, -- points to the datasource created
earlier
FILE_FORMAT = demofileformat, -- points to the file format created
earlier
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

Executing this code will cause the data to load from the CSV into the table. You
can query this table just like any other table on your database, but you cannot do
other operations (like writing). As you can see in Figure 3.36, the load included
the header information from the CSV. If you wanted to eliminate this row from a
standard SQL table, you could issue a DELETE SQL command. However, you can
see that this causes an error stating it isn’t a supported operation. You’ll need to
eliminate this data when you load the external data into your staging tables.
Summary   97

Figure 3.36: Querying the external table is possible, but write operations are not

That’s it for using external tables! If you want to automate things, you could pop-
ulate these tables and do the cleansing of the data for staging tables by calling
procedures that you write through a custom ADF pipeline. At this point, you
should have all the tools you need to create that pipeline and populate your ADW
tables with data from a variety of sources!

Summary

This chapter covered the basics of setting up an Azure Data Warehouse and pop-
ulating it with data from several sources. You looked at creating a simple data
warehouse model, building staging tables, and automating that load using an
ADF pipeline. You also worked through setting up external tables and populating
one through an ADLS connection, without the need for an ADF pipeline. Hope-
fully, you now have a good understanding of the interaction between ADF and
the various storage options in Azure, and a clear idea of how to populate data
in your ADW instance. You should now be able to build you own integrations of
data between local repositories, Azure SQL, Azure Data Lake Servers, and Azure
Data Warehouses.
Index
A Azure Data Lake Storage Gen1 option 48
ADF (Azure Data Factory) 1–2, 10–14, 16, Azure Data Lake storage resource 34–35,
28–30, 44, 46, 64–66, 72–74, 76–88 37, 39
ADF integration 33–34, 36, 38, 40, 42, 44, Azure Data Warehouse 64–66, 68, 70, 72,
46, 48, 64–65 74, 76, 78, 88–90, 96–97
ADF pipeline 29, 31, 33, 35, 37, 64–65, 87, Azure data warehousing 67
89, 97 Azure database 3, 8, 22–23, 25, 78
–– custom 97 Azure database server 69
–– standard 15 Azure navigation bar 8
ADF pipeline component 73 Azure navigation toolbar 11, 13
ADF solution 29, 43 Azure pipeline 64
ADLA resource 57–59 –– custom 25
ADLAs (Azure Data Lake Analytics) 33, 57–58 Azure portal 7, 9, 20–21, 57–58, 66, 89–90
ADLS (Azure Data Lake Server) 33–40, 46–52, Azure portal monitoring page 29
54–60, 62, 64–65, 73, 81–82, 89–90, Azure portal navigation 34
93–95 Azure portal navigation toolbar 12
ADLS connection 97 Azure portal query 10, 71
ADLS directory 56, 61 Azure query 10
ADLS file 65, 73, 82 Azure server 6, 70
ADLS resource 34–36 Azure SQL 1–2, 4, 6–8, 10, 12, 14, 16, 33,
ADW (Azure Data Warehouse) 65–71, 73–75, 63–64
77–79, 85, 87, 89–91, 93, 97 Azure SQL connection 23
ADW database 79, 90, 94 Azure SQL data warehouse option 77
ADW performance 67 Azure SQL Database 3–5, 7, 9, 23, 25, 31, 33
ADW sales staging table 82 –– hosted 10
ADW’s sales staging table 81 Azure SQL Database option 22
Alert rule 29–30 Azure SQL script 91
Alerts 28–29, 36 Azure SQL Server 5, 7–8, 23
Amount of sale 53, 74, 86, 96 Azure subscription 23
Analytics 11, 58, 64, 73 Azure toolbar 36
App 90
App registration 90–92, 94–95 B
Application ID 91–92, 95 Basic Azure Data Factory pipeline 10–11, 13,
Application registration 90–91, 94 15, 17, 19, 21, 23, 25, 27
Authentication type property 50, 78
Azure 5–7, 9–10, 13–14, 19, 40, 42–43, C
65–66, 70, 72
Azure active directory 90, 92 Cast 3, 53–54
Azure active directory navigation bar 92 Click Next 22, 26
Azure app 89–90 Clustered COLUMNSTORE index 74–76
Azure components 29, 34 Code 40, 43, 56, 59–60, 63, 85, 96
Azure Data Factory. See ADF Code repository 33, 40–43, 45–46
Azure Data Lake Analytics (ADLAs) 33, 57–58 Column mapping 79, 83
Azure Data Lake Server. See ADLS Columns 1, 38, 82–83, 85

DOI 10.1515/9781547401277-004
100   Index

Components 1, 5, 12, 31, 33, 40, 43–47, Datetime 2–3, 8, 74–76


64–65, 79 Dbo 2–3, 8, 45, 53–54, 74–76
Components and flow of data 1, 33, 65 Debug button 26–27, 55, 85, 88
Configuration 12, 23–24, 26, 30, 34–36, 44, Destination 22–23, 25, 45
52, 66, 87 Dim customer 75, 86
Configure 14–16, 23, 34, 46, 49, 58, 77 Dim date 75, 86
Connect 4, 6–8, 16, 19, 31, 33, 67, 69–71, Dimension tables 73–74, 85
89–90 Dimensions 65, 73, 77, 87
Connection 19–20, 23, 50–51, 53, 63–64, 77, Dim product 75–76, 86
79–80, 82, 87 Directory 73, 90, 93–
–– destination 24–25, 48 Dropdown 19, 23, 25, 49
Connection button 16, 22
Connection information 20, 70 E
Connectors 16 Edit 10, 44, 46–47
Context menu 39, 47 Editing 44, 47–48
Copy data 14–15, 21–22, 26, 46, 56, 79, 81, Enterprise manager 6–8
87 Errors 26, 55, 57, 63, 80, 96
Copy data box 48, 53, 57 Example value 92
Copy data shape 77, 81 External file format 94–95
Copying pipelines 46 Extract 61, 64, 74
Copy local data to Azure 45 Extract date time 75–76
Copy local data to an Azure pipeline 44, 47
Costs 5, 36, 67, 69 F
Creation 12, 24, 66, 76 Fact tables 73, 75, 77, 85, 87
Credentials 7–8, 23, 73, 78, 89–90, 94–95 File 39–40, 43–44, 46–48, 51–52, 55–56,
CSV 33, 38–39, 61, 64, 83–84, 95–96 61–62, 82, 93–94, 96
CSV file 33, 37–38, 51, 59, 61, 64, 96 –– core pipeline 44
Customer dimension table 85 –– export.txt 55, 81
Customer ID 2–3, 53–54, 74–75, 83, 86, 96 –– single pipeline 44
Customers 1–3, 22, 38, 45, 74 –– target 59, 83
Customers table 52, 77 –– text 51, 55
File format 90, 95–96
D File name 55
Data explorer 51 –– dynamic 55–56
Data factory 11–13 File path property 51, 55
Data Integration 65–66, 68, 70, 72, 74, 76, Finish button 20–21, 51, 79
78, 80, 82 First Name 2–3, 46, 74–75, 86
Data Lake 33, 36–38, 51–52, 64 Flow 1, 33, 61–62, 65, 73, 87
Data Lake Analytics 57–59, 61, 63 Folder 37, 43, 46, 51, 62, 93–94
Data types 57, 63–64 Formats 33, 39, 52, 95–96
Data warehouse 64–69, 71–73, 76, 86, 89
Database 3–6, 23, 29, 67, 96 G
–– local SQL 1, 3, 10, 16 Git 33, 41–44
Database name 4, 20 Git for code repository 40–41, 43, 45
Database server 4 Git repository 33, 40, 42–43, 46–47, 64
Date 2–3, 8, 53–54, 56, 74–76, 85–86
Date of sale 53–54, 74, 86–87
Index   101

H N
History 28–29, 63 Navigate 26, 35, 39, 46, 59, 72, 90
Nchar 2, 8, 53, 74
I N’CUST001, 2, 53–54
ID 8, 52–54, 74, 92 N’CUST002, 3, 53–54
Import 44 N’Goodies 53–54
Import pipelines 40 Notification 5–6, 12–13, 29–30, 35–36, 59,
Initialize 41–42 66
Installation 17–19 NULL 2, 8, 53, 74–76, 96
Integration runtime 16–20, 22–23, 29, 31, Nvarchar 96
33, 44, 65
Integration runtime component 1, 23 O
Item purchased 53–54, 74, 86–87 Output 46, 61–63, 88

J P
Jobs 60, 62–63 Pause 40, 67–68, 85
JSON 44–46 Pause button 67
JSON pipeline file in Git 44 Permissions 51, 90, 94
Pipeline 26, 28, 33–34, 43–48, 53, 55–57, 77,
K 84–85, 87–88
Key value 91–92 Pipeline activities 89
Keys 17, 35, 53–54, 89, 91–92, 95 Pipeline development 40, 55, 85
Pipeline files 44
L Populate 65, 73, 85, 96–97
Last name 2–3, 46, 74–75, 86 Preview 22, 39, 53, 82–83
Link 16, 44, 55, 84 Process 14–16, 21, 25–27, 55, 68, 73, 77, 85,
Linked service 44, 49, 51, 78, 81, 87 87–88
Load 25, 46, 52, 64–65, 73, 86–87, 96–97 Properties 4, 11, 34, 45, 49, 52, 58, 90, 92
Load ADLS 47, 49, 51, 53, 55
Load data 1, 73, 77, 79, 81, 83, 85, 87, 89 Q
Loading 73, 86, 89, 94, 96 Query 9, 22, 27, 57, 59, 61, 63–64, 96
Local database 4, 19, 21–22, 53, 78 Query editor 8–9, 70
Local machine 16–18, 41, 47, 75
Locations 3–4, 14, 34, 55, 64, 73, 95–96 R
Log 3, 7, 41, 93 Records 1, 3, 8–10, 27, 33, 52, 76
Repository 40–44, 47
M Resource button 12, 36
Map 1, 82–83, 85 Resource group 11–12, 34
Mapping 25, 44, 46, 51, 79–80, 83, 85 Resources 13, 34, 36, 44, 57, 59, 66, 79
Mapping tool 85 Result 61, 88
Master key 90, 94–95 Resume button 67, 69
Metrics 28–29, 36, 62 ROBIN 75–76
Modeling 72–73, 75
Modified on 2–3, 46, 74 S
Monitor 13, 27–29, 63, 89 Sales 38, 53–54, 61–62, 74, 81, 96
Monitoring 28–29, 88 Sales information 33, 52, 86
Sales.csv file 93
102   Index

Sales data folder 37, 51, 81 Tabs 24, 26, 29, 49, 87


Schema, star 73–75 Template, copy data pipeline 33
Screen 17–18, 21–26, 28–30, 40, 49, 55, 58, Test 19, 23, 26–27, 84–85
61, 66 Test connection button 50, 79
–– overview 13 Tiers 68
Script 1, 9, 53, 55–56, 61, 68–69, 75 Toolbar 5, 29, 40, 47, 66–67, 71, 80, 90
Server 4, 6, 23, 69 Type 7–8, 15, 19, 33, 41, 45–46, 90–91,
Setup 17, 22 95–96
Setup link 17–18
Simple ADF Pipeline 47, 49, 51, 53, 55 U
Single ADF Pipeline 56 Upload 38, 44, 46–47, 51
Single pipeline process 57 Uploading 37, 39, 48
Source connection 52, 55, 81 U-SQL 57, 61–64
Source tables 16, 22, 25, 73 U-SQL script execution 62–63
SQL 7, 85
SQL Server 6, 16–17, 19 V
SQL Server Management Studio. See SSMS Value 38, 45, 57, 63–64, 92, 95–96
SSIS Integration Runtime 14 Values 2–3, 9, 53–54
SSMS (SQL Server Management Studio) Varchar 2, 53, 74–76
69–70, 72, 75, 78 Very simple data warehouse 72–73, 75
Stage sales 74, 86–87 Visual Studio 63, 71–72
Staging tables 25, 65, 73–74, 85, 87, 96–97
Stored procedure 65, 73–74, 77, 85, 87–88 W, X, Z
String 61, 64, 92, 96 Window 8, 13, 22, 35, 37

T
Table, external 65–66, 68, 70, 72–74, 76, 78,
80, 82, 88–97

You might also like