You are on page 1of 98

Azure Data Factory

Hands On Exercises

Page 1
Table of Contents
Unit 1: Create Azure Data Factory Workspace...............................................................................4
Exercise 1: Creating Data Factory.............................................................................................4
Unit 2: Understanding Key Components.........................................................................................7
Exercise 2: How to create Azure Integration Runtime................................................................7
Exercise 3: How to create linked Self- Hosted Integration Runtime.........................................14
Exercise 4: How to create Linked Services...............................................................................18
4.1: Linked Service with SAP Tables.........................................................................................19
4.2: Linked Service with SQL Database....................................................................................22
4.3: Linked Service with Azure Data Lake Storage Gen2.........................................................24
Exercise 5: How to create Datasets...........................................................................................27
5.1: Dataset with SAP Table......................................................................................................28
5.2: Dataset with SQL Database...............................................................................................30
5.3: Dataset with Data Lake Storage Gen2...............................................................................32
Exercise 6: Create a pipeline....................................................................................................35
Unit 3: Activities............................................................................................................................38
Exercise 7: How to create Lookup Activities............................................................................38
Exercise 8: How to create Stored Procedure Activity...............................................................39
Exercise 9: How to create Copy Data Activity..........................................................................41
Exercise 10: How to create Azure Function Activities..............................................................44
Exercise 11: How to create Iteration & Conditional Activities................................................46
Unit 4: Triggers, publishing and monitoring.................................................................................48
Exercise 12: How to create Schedule Trigger...........................................................................48
Exercise 13: How to create Tumbling Window Trigger............................................................50
Exercise 14: How to create Event Trigger................................................................................52
Exercise 15: Publishing of the Activity......................................................................................56
Exercise 16: Debugging & Monitoring a pipeline....................................................................58
Unit 5: Copy multiple tables in bulk.............................................................................................59
Exercise 1: End-to-end workflow..............................................................................................60
Exercise 2: Prerequisites...........................................................................................................60
Exercise 3: Prepare SQL Database and Azure Synapse Analytics (formerly SQL DW)..........60
Exercise 4: Azure services to access SQL server......................................................................61
Exercise 5: Create a data factory..............................................................................................61
Page 2
Exercise 6: Create the source Azure SQL Database linked service..........................................63
Exercise 7: Create the sink Azure Synapse Analytics (formerly SQL DW) linked service.......64
Exercise 8: Create the staging Azure Storage linked service...................................................65
Exercise 9: Create a dataset for source SQL Database............................................................66
Exercise 10: Create a dataset for sink Azure Synapse Analytics (formerly SQL DW).............66
Exercise 11: Create pipelines....................................................................................................69
Exercise 12: Create the pipeline IterateAndCopySQLTables...................................................69
Exercise 13: Create the pipeline GetTableListAndTriggerCopyData......................................74
Exercise 14: Trigger a pipeline run..........................................................................................77
Exercise 15: Monitor the pipeline run......................................................................................77
Unit 6: Incrementally load data from Azure SQL Database to Azure Blob storage.....................79
Exercise 1: Overview.................................................................................................................79
Exercise 2: Prerequisites...........................................................................................................80
Exercise 3: Create a data source table in your SQL database.................................................81
Exercise 4: Create another table in your SQL database to store the high watermark value. . .81
Exercise 5: Create a stored procedure in your SQL database..................................................82
Exercise 6: Create a data factory..............................................................................................83
Exercise 7: Create a pipeline....................................................................................................85
Exercise 8: Monitor the pipeline run........................................................................................94
Exercise 9: Review the results...................................................................................................94
Exercise 10: Trigger another pipeline run................................................................................95
Exercise 11: Monitor the second pipeline run..........................................................................95
Exercise 12: Verify the second output.......................................................................................96

Unit 1: Create Azure Data Factory Workspace


Exercise 1: Creating Data Factory
1. In the Azure portal, select Create a resource > Analytics > Data Factory

Page 3
2. Under the Create Data Factory page, provide the values to create a Data Factory workspace.
After filling in all the required details click on Review + create.

After you see ‘Validation Passed’, click on Create.

Page 4
After few seconds, you will see a message ‘Your deployment is complete’. Click on ‘Go to
resource’. Below screen will appear:

3. Click on Launch Studio to get into Data Factory homepage. On the left panel you will see options
like Author, Monitor, Manage, etc. which will use in further exercises.

Page 5
Unit 2: Understanding Key Components
Exercise 2: How to create Azure Integration Runtime

1. On the Azure Data Factory Homepage, select the Manage tab on the left pane.

Page 6
2. Under Connections, select Integration runtimes and click on ‘New’ button to create an
integration runtime.

3. Select Azure, Self-Hosted option and click ‘Continue’ button from the below screen.

4. Select Azure option and click ‘Continue’ button from the below screen.

Page 7
5. Set the value of Name (give description as well) and region as below. Leave the other options as
it is:
Name: ir-hgstft2023-<username>
Region: "North Central US"

Page 8
6. Click on Create, after creation the new Integration Runtime will appear like this:

Page 9
Exercise 3: How to create linked Self- Hosted Integration Runtime

1. Under Connections, select Integration runtimes and click on ‘New’ button to create an integration
runtime.

2. Select Azure, Self-Hosted option and click ‘Continue’ button from the below screen.

3. Select Linked Self-Hosted option and click ‘Continue’ button from the below screen.screen.

Page 10
Need to add more details

Page 11
Exercise 4: How to create Linked Services

4.1: Linked Service with SAP Tables

1. Under Linked Service, provide the values to create a Linked Service to link SAP Tables to
the Data Factory.

2. Under New linked service, Select SAP HANA and then select ‘Continue’.

Page 12
Need to add more details

4.2: Linked Service with SQL Database

1. Under Linked Service, provide the values to create a Linked Service to link Azure SQL Database
to the Data Factory

2. Under New linked service, Select Azure SQL Database and then select ‘Continue’.

Page 13
Need to add more details

4.3: Linked Service with Azure Data Lake Storage Gen2

1. Under Linked Service, provide the values to create a Linked Service to link Azure Data Lake
Storage Gen2 to the Data Factory

Page 14
2. Under New linked service, Select Azure Data Lake Storage Gen2 and then select ‘Continue’.

3. Under New Linked Service pane, provide the following fields to connect with Azure Data Lake
Storage Gen2.
Name: ls_adls_hgstft2023_<username>
Azure subscription: Microsoft Azure Sponsorship - 2021_22
Storage Account Name: sahgstft2023

Page 15
Then click on ‘Create’.

4. Once you click on it, new Linked Service will be created. We can view all the linked services
created under Manage -> Linked services:

Page 16
Exercise 5: How to create Datasets

1. Under Author tab, In Factory Resources, Select Dataset to create new dataset.

5.1: Dataset with SAP Table


2. Under New Dataset pane, select the SAP HANA and click on ‘Continue’.

Page 17
3. After selecting the SAP HANA, provide the values for the following fields like Name, Linked
Service & Table.

Need to add more details

Page 18
5.2: Dataset with SQL Database
4. Under New Dataset pane, select the Azure SQL Database and click on ‘Continue’.

5. Under Properties Panel, set the properties for Azure SQL Database.

Need to add more details

5.3: Dataset with Data Lake Storage Gen2

6. Under New Dataset pane, select the Azure Data Lake Storage Gen2 and click on ‘Continue’.

Page 19
7. On the Select Format page, choose the format type of the data, and then select Continue. In
this case, select Delimited Text when copy files as-is without parsing the content.

Page 20
8. Under Set Properties Panel, set the properties for Azure Data Lake Gen2.
Name: ds_adls_hgstft2023_<username>
Linked Service: ls_adls_hgstft2023_<username>
File Path: adf-training

Then click ‘OK’

Page 21
9. Once you click ‘OK’ it, new Dataset will be created. We can view all the datasets created under
Factory Resources -> Datasets:

Page 22
Exercise 6: Create a pipeline

1. Under Factory Resources, Select Pipeline, a new tab will appear and we can drag and drop
various activity into the pane and also you can edit the pipeline name.

Page 23
Unit 3: Activities
Exercise 7: How to create Lookup Activities

The Lookup activity can read data stored in a database or file system and pass it to subsequent copy or
transformation activities.

1. Under Activities pane, Choose the Lookup activity from General tab.

2. In the Settings tab of the Lookup activity, we can define the values for Source Dataset, File path
type and checked the box for First row only.
Name: Lookup Activty
Source dataset: ds_adls_hgstft2023_<username
File path type: Wildcard file path

Page 24
Exercise 8: How to create Stored Procedure Activity

1. Under Activities pane, Choose the Stored Procedure activity from General tab and assign the
Name.
Name: Stored procedure

Page 25
2. Under Stored Procedure Actvity, provide the specific value for the given fields like Linked
Service & Stored procedure name
Linked Service: ls_sql_demo1
Stored procedure name: [dbo].[CreateNewVisit]

Page 26
Exercise 9: How to create Copy Data Activity

1. Under Pipeline Pane, In the Activties toolbox expand Move & Transform. Drag the Copy
Data to the pipeline designer surface.
Name : Copy Activty

Page 27
2. Under Pipeline Designer Surface, provide the values for both Source & Sink and click on
the Validate to validate the pipeline settings.

Source Connection:
Source dataset: ds_saptable_demo
Row Count: 5

Page 28
Sink Connection :
Sink dataset: ds_dls_demo
Copy behavior: None

Page 29
Exercise 10: How to create Azure Function Activities

The Azure Function activity allows to run Azure Functions in a Data Factory pipeline. To run an Azure
Function, we need to create a linked service connection and an activity that specifies the Azure Function
that we plan to execute.

1. Firstly, will have to link the Azure Function with Azure Function linked service & provide the specific
values for the given fields.
Page 30
Under Azure Function activity,provide the specific values for the given fields like name, type, linked
service, function name, method header & body.

Property Description
Name Name of the activity in the pipeline
Type Type of activty is “Azure function Activty”
Linked service The Azure Function linked service for the
corresponding Azure Function App
function name Name of the function in the Azure Function
App that this activity calls
method REST API method for the function call

Page 31
Exercise 11: How to create Iteration & Conditional Activities
The Conditionals Activity defines a repeating control flow in your pipeline. This activity is used to iterate
over a collection and executes specified activities in a loop. The loop implementation of this activity is
similar to Foreach looping structure in programming languages.

1. Under Data Factory activty page, you will choose the various conditonal activity like Filter,
ForEach, IF Condition, Switch & Until for implementing any conditions over the data.

Page 32
2. Under Conditional pane, choose the setting tab to set the sequentials, batch count, & items
properties for implementing any conditions over the data.
Sequentail: Not tick
Batch Count: 16
Items: @activity(Get Metadata1).outputchilditems

Page 33
Exercise 12: Transformation using Dataflow.

In this exercise we are creating a Dataflow and performing aggregate and join transformations
in the dataflow and we are creating a pipeline with Dataflow Activity.

Prerequisite:

1. Employee.csv, department.csv files in the Azure Storage Account.


Storage account name: sahgstft2023
Container name: adf-training
Folder: dataflow-input

Page 34
If files are not present in the specified location upload the files given below.

Employee.csv and department.csv files data look like below.

Employee data

Department data

Creating datasets for employee data, department data and for sink data.

For the employee, department and the sink datasets are pointing to the Azure Data Lake
Storage Gen2. So, in the below steps are about to create these three datasets.

1. Click + (plus) in the left pane, and then click Dataset.

Page 35
2. In the New Dataset window, select Azure Data Lake Storage Gen2, and then
click Continue.

3. Select format as Delimited text

4. In the Set properties window, under Name,

Give name as EmployeeDataset for employee data,


DepartmentDataset for department data
DataflowSinkDataset for sink

Select Linked Service as ls_adls_hgstft2023_<username> (Select the linked service created in


Exercise 3 for Azure Datalake)

Browse the file path:

adf-training/dataflow-input/employee.csv
adf-training/dataflow-input/department.csv
adf-training/dataflow-output

Page 36
Set Properties for EmployeeDataset

Set Properties for DepartmentDataset

Page 37
Set Properties for DataflowSinkDataset

Creating Dataflow

1. Click + (plus) in the left pane, and then click Dataflow.

Page 38
2. In the General panel under Properties, specify Dataflow_demo for Name. Then
collapse the panel by clicking the Properties icon in the top-right corner.

3. Click down arrow mark and select Add Source

4. Switch to Source settings tab

a. Select Output stream name as Employee

b. Source Type as Dataset

c. Select the dataset created for employee data

Page 39
5. Next to add Aggregate transformation, click on the +(plus) in Employee stream
and select Aggregate under Schema modifier section.

Page 40
6. Switch to Aggregate settings

a. Select Output stream name as AggregateOnDep

b. Incoming stream as Employee

c. Click on Group By select column name department from the dropdown.

Page 41
d. Click on Aggregates, select column as empid

e. Open expression builder by clicking the Expression box and give the
Column name as TotalEmpCount and Expression as

count(empid)

Click Save and finish

Page 42
7. For seeing Data Preview, turn on Data flow debug and switch into Data Preview
tab.

Page 43
Here we are getting department wise employee count. But in place of department id if
it’s department name, it’s become more understandable. For that we are using join
transformation which joins with department data and pulls the corresponding
department name.

8. Click on Add Source, switch to the Source Settings tab,

a. Give Output stream name as Department

b. Select Source type as Dataset

c. For Dataset select DepartmentDataset

Page 44
9. Click on +(plus) in the AggregateOnDept stream and select Join transformation
under Multiple Inputs/Outputs sesson.

10. Switch to the Join settings,

Page 45
a. Give Output stream name as JoinOnDep

b. Left stream as AggregateOnDep

c. Right stream as Department

d. Join type Inner

e. Set Join conditions as below.

11. Click on +(plus) on JoinOnDep stream and select Sink.

12. Switch to the Sink tab,

Page 46
a. Output stream name as Sink

b. Incoming stream as JoinOnDep

c. Sink type as Dataset

d. Select Dataset as DataflowSinkDataset

13. Switch to the Settings

a. Select File name option as Output to a single file

b. Give Output to single file as TotalEmpCountByDept

14. Switch to the Mappings tab,

Clear the option Auto mapping and delete the unwanted columns using delete
button and keep only below columns and drag and adjust the column ordering.

Page 47
15. Switch to Optimize tab, select Single partition

16. Switch to Data preview tab to see the final output.

Page 48
17. To validate the dataflow, click Validate on the toolbar. Confirm that there are no
validation errors.

Create Pipeline
1. In the left pane, click + (plus), and click Pipeline.

2. In the General panel under Properties,


specify DataFlowPipelineDemo for Name. Then collapse the panel by clicking
the Properties icon in the top-right corner.

Page 49
3. In the Activities toolbox, expand Move & Transform, and drag-drop
the Dataflow activity to the pipeline design surface. You can also search for
activities in the Activities toolbox.

a. In the General tab at the bottom, enter Dataflowpipeline for Name.

b. Switch to the Settings tab, select the dataflow dataflow_demo from the drop
down

4. To validate the pipeline, click Validate on the toolbar. Confirm that there are no
validation errors.

5. To publish entities (dataflow, datasets, pipelines, etc.) to the Data Factory service,
click Publish all on top of the window. Wait until the publishing succeeds.

6. Click Debug on the toolbar and confirm the pipeline is running fine.

7. Verify the file created in the storage account.

Page 50
Page 51
Unit 4: Triggers, publishing and monitoring
Exercise 12: How to create Schedule Trigger

1. Under Manage, select Trigger and then click New.

2. Under New Trigger pane, we can choose the Type of trigger as Schedule from the given options
Schedule, Tumbling window, Storage events, or Custom events.
Name: Schedule_trigger_demo
Type: Schedule

3. Other than this we can choose start date, recurrence, end date and whether or not to activate
the trigger immediately after you create it (Start trigger on creation). The recurrence have
various options like Minutes, Hour, Days,Week etc are present.

Page 52
4. After providing the details click on ‘OK’. The newly created trigger we can see at
Manage>Triggers pane. The trigger is showing as in the stopped state as we are unselect the
option for ‘Start trigger on creation’. We can activate the trigger if we want.

Exercise 13: How to create Tumbling Window Trigger

1. Under Manage, select Trigger and then click New.

Page 53
2. Under New Trigger pane, we can choose the Type of trigger as Tumbling window from the
given options Schedule, Tumbling window, Storage events, or Custom events.
Name: tumbling_trigger_demo
Type: Tumbling Window

3. Other than this we can choose start date, recurrence, end date and whether or not to activate
the trigger immediately after you create it (Start trigger on creation). The recurrence setting is
different, you can only choose Minutes, Hours, or Months in the pane.

Page 54
4. After providing all details click on ‘OK’. The newly created trigger we can see at under Mange>
Trigger pane.

Exercise 14: How to create Storage Event Trigger

1. Under Manage, select Trigger and then click New.

Page 55
2. Under New Trigger pane, we can choose the Type of trigger as Storage events from the given
options Schedule, Tumbling window Trigger, Storage events, Custom events.

3. We can choose the start date, end date, whether or not to activate the trigger immediately
after you create it (Start trigger on creation) & the main settings are Storage account name,
container, and blob path. We can also provide the Blob path begins with and Blob path ends
with conditions.

Name: storage_event_trigger_demo
Type: Event
Azure Subscription name: Microsoft Azure Sponsorship - 2021_22
Storage Account Name: sahgstft2023
Container name: adf-training
Blob Path ends with: csv
Event: Blob Created

Page 56
Page 57
4. Click on ‘Continue’, ‘Data Preview’ pane will open which shows the preview of blobs which
satisfies the conditions.
Click ‘OK’, a new trigger will get created. We can see the newly created trigger under Mange>
Trigger pane.

1. Once the creation of the trigger is done, open the pipeline that you want to trigger then click on
‘Add trigger’ where you can see two options Trigger now or New/Edit.

Page 58
2. In the Add Triggers, we can either create a new trigger or choose a trigger from the existing
triggers.

Page 59
Page 60
Exercise 15: Publishing of the Activity

1. Before triggering a pipeline, you must publish entities to the Data Factory.
To publish, select Publish all on the top.

Page 61
2. Under Publish All pane, you will see all the pending changes and then click Publish.

3. After publishing is done you will see a pop up like ‘Publishing is completed’.

Page 62
Exercise 16: Debugging & Monitoring a pipeline.

1. Under Pipeline designer surface, click Debug to trigger a test run.

2. For monitoring the pipeline, swich on the Monitor tab and use the Refresh button to refresh the
list for seeing the status of the particular activity run.

Page 63
Unit 5: Copy multiple tables in bulk

This tutorial demonstrates copying a number of tables from Azure SQL Database to
Azure Synapse Analytics. You can apply the same pattern in other copy scenarios as
well. For example, copying tables from SQL Server/Oracle to Azure SQL Database/Azure
Synapse Analytics (formerly SQL DW)/Azure Blob, copying different paths from Blob to
Azure SQL Database tables.

At a high level, this tutorial involves following steps:

 Create a data factory.


 Create Azure SQL Database, Azure Synapse Analytics, and Azure Storage linked
services.
 Create Azure SQL Database and Azure Synapse Analytics datasets.
 Create a pipeline to look up the tables to be copied and another pipeline to
perform the actual copy operation.
 Start a pipeline run.
 Monitor the pipeline and activity runs.

End-to-end workflow

Prerequisites
 Azure Storage account. The Azure Storage account is used as staging blob
storage in the bulk copy operation.
 Azure SQL Database. This database contains the source data.
 Azure Synapse Analytics. This data warehouse holds the data copied over from
the SQL Database.

Page 64
Prepare SQL Database and Azure Synapse Analytics

Prepare the source Azure SQL Database:

Create a database in SQL Database with Adventure Works LT sample data


following Create a database in Azure SQL Database article. This tutorial copies all the
tables from this sample database to an Azure Synapse Analytics (formerly SQL DW).

Note: The above-mentioned step is for your reference, participants no need to do


anything on this step. Will provide the Azure SQL Database credentials in the
following steps.

Prepare the sink Azure Synapse Analytics:

1. If you don't have an Azure Synapse Analytics (formerly SQL DW), see the Create a
SQL Data Warehouse article for steps to create one.

2. Create corresponding table schemas in Azure Synapse Analytics (formerly SQL


DW). You use Azure Data Factory to migrate/copy data in a later step.

Azure services to access SQL server

For both SQL Database and Azure Synapse Analytics, allow Azure services to access SQL
server.

Ensure that Allow Azure services and resources to access this server setting is
turned ON for your server. This setting allows the Data Factory service to read data from
your Azure SQL Database and write data to your Azure Synapse Analytics.

To verify and turn on this setting, go to your server > Security > Networking > set
the Allow Azure services and resources to access this server to ON.

Page 65
Note: The above-mentioned step is for your reference, participants no need to do
anything on this step.

Open Azure Data Factory

1. Go to the Azure portal.

2. Launch the Azure Data Factory created in the Exercise 1

Page 66
Create the source Azure SQL Database linked service.

In this step, you create a linked service to link your database in Azure SQL Database to
the data factory.

1. Open Manage tab from the left pane.

2. On the Linked services page, select +New to create a new linked service.

1. In the New Linked Service window, select Azure SQL Database, and
click Continue.

2. In the New Linked Service (Azure SQL Database) window, do the following
steps:

a. Enter AzureSqlDatabaseLinkedService for Name.

b. Enter Azure Subscription as Microsoft Azure Sponsorship - 2021_23

c. Select your server for Server name

d. Select your database for Database name

e. Enter name of the user to connect to your database

f. Enter password for the user.


Page 67
g. To test the connection to your database using the specified information,
click Test connection.

h. Click Create to save the linked service

Create the sink Azure Synapse Analytics linked service

1. In the Connections tab, click + New on the toolbar again.

2. In the New Linked Service window, select Azure Synapse Analytics, and
click Continue.

3. In the New Linked Service (Azure Synapse Analytics) window, do the following
steps:

a. Enter AzureSqlDWLinkedService for Name.

b. Enter Azure Subscription as Microsoft Azure Sponsorship - 2021_23

c. Select your server for Server name

d. Select your database for Database name

e. Enter User name to connect to your database

f. Enter Password for the user.

g. To test the connection to your database using the specified information,


click Test connection.

h. Click Create.

Page 68
Create the staging Azure Storage linked service

In this tutorial, you use Azure Blob storage as an interim staging area to enable PolyBase
for a better copy performance.

1. In the Connections tab, click + New on the toolbar again.

2. In the New Linked Service window, select Azure Blob Storage, and
click Continue.

3. In the New Linked Service (Azure Blob Storage) window, do the following steps:

a. Enter AzureStorageLinkedService for Name.


b. Select your Azure Storage account for Storage account name.

c. Click Create.

Create a dataset for source SQL Database

1. Click + (plus) in the left pane, and then click Dataset.

Page 69
2. In the New Dataset window, select Azure SQL Database, and then click Continue.

3. In the Set properties window, under Name,

Enter AzureSqlDatabaseDataset. Under Linked service,


select AzureSqlDatabaseLinkedService. Then click OK.

4. Switch to the Connection tab, select any table for Table. This table is a dummy
table. You specify a query on the source dataset when creating a pipeline. The
query is used to extract data from your database. Alternatively, you can
click Edit check box, and enter dbo.dummyName as the table name.

Create a dataset for sink Azure Synapse Analytics

1. Click + (plus) in the left pane and click Dataset.

2. In the New Dataset window, select Azure Synapse Analytics, and then
click Continue.

3. In the Set properties window, under Name, enter AzureSqlDWDataset.


Under Linked service, select AzureSqlDWLinkedService. Then click OK.

4. Switch to the Parameters tab, click + New, and enter DWTableName for the
parameter name. Click + New again and enter DWSchema for the parameter
name. If you copy/paste this name from the page, ensure that there's no trailing
space character at the end of DWTableName and DWSchema.

5. Switch to the Connection tab,

a. For Table, check the Edit option. Select into the first input box and click
the Add dynamic content link below. In the Add Dynamic Content page,
click the DWSchema under Parameters, which will automatically populate
the top expression text box @dataset().DWSchema, and then click Finish.

Page 70
1. Select into the second input box and click the Add dynamic content link below. In
the Add Dynamic Content page, click the DWTAbleName under Parameters, which
will automatically populate the top expression text box @dataset().DWTableName, and
then click Finish.

2. The tableName property of the dataset is set to the values that are passed as
arguments for the DWSchema and DWTableName parameters. The ForEach activity
iterates through a list of tables and passes one by one to the Copy activity.

Page 71
Create pipelines.

In this tutorial, you create two


pipelines: IterateAndCopySQLTables and GetTableListAndTriggerCopyData.

The GetTableListAndTriggerCopyData pipeline performs two actions:

 Looks up the Azure SQL Database system table to get the list of tables to be
copied.
 Triggers the pipeline IterateAndCopySQLTables to do the actual data copy.

The IterateAndCopySQLTables pipeline takes a list of tables as a parameter. For each


table in the list, it copies data from the table in Azure SQL Database to Azure Synapse
Analytics (formerly SQL DW) using staged copy and PolyBase.

Create the pipeline IterateAndCopySQLTables

18. In the left pane, click + (plus), and click Pipeline.

19. In the General panel under Properties,


specify IterateAndCopySQLTables for Name. Then collapse the panel by clicking
the Properties icon in the top-right corner.

20. Switch to the Parameters tab, and do the following actions:

a. Click + New.

Page 72
b. Enter tableList for the parameter Name.

c. Select Array for Type.

21. In the Activities toolbox, expand Iteration & Conditions, and drag-drop
the ForEach activity to the pipeline design surface. You can also search for
activities in the Activities toolbox.

a. In the General tab at the bottom, enter IterateSQLTables for Name.

b. Switch to the Settings tab, click the input box for Items, then click the Add
dynamic content link below.

c. In the Add Dynamic Content page, click the tableList under Parameters, which
will automatically populate the top expression text box
as @pipeline().parameter.tableList. Then click Ok

Page 73
Page 74
1. Switch to Activities tab, click the pencil icon to add a child activity to
the ForEach activity.

2. In the Activities toolbox, expand Move & Transfer, and drag-drop Copy
data activity into the pipeline designer surface. Notice the breadcrumb menu at
the top. The IterateAndCopySQLTable is the pipeline name
and IterateSQLTables is the ForEach activity name. The designer is in the activity
scope. To switch back to the pipeline editor from the ForEach editor, you can click
the link in the breadcrumb menu.

Page 75
3. Switch to the Source tab, and do the following steps:

1. Select AzureSqlDatabaseDataset for Source Dataset.

2. Select Query option for Use query.

3. Click the Query input box -> select the Add dynamic content below -> enter
the following expression for Query -> select Finish.

SQLCopy
SELECT * FROM [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]

4. Switch to the Sink tab, and do the following steps:

1. Select AzureSqlDWDataset for Sink Dataset.

2. Click the input box for the VALUE of DWTableName parameter -> select
the Add dynamic content below, enter @item().TABLE_NAME expression as script,
-> select Finish.

Page 76
3. Click the input box for the VALUE of DWSchema parameter -> select the Add
dynamic content below, enter @item().TABLE_SCHEMA expression as script, ->
select Finish.

4. For Copy method, select PolyBase.

5. Clear the Use type default option.

6. Click the Pre-copy Script input box -> select the Add dynamic content below
-> enter the following expression as script -> select Ok.

SQLCopy
TRUNCATE TABLE [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]

5. Switch to the Settings tab, and do the following steps:

1. Select the checkbox for Enable Staging.


2. Select AzureStorageLinkedService for Store Account Linked Service.

Page 77
6. To validate the pipeline settings, click Validate on the top pipeline tool bar. Make
sure that there's no validation error. To close the Pipeline Validation Report, click
Close.

Create the pipeline GetTableListAndTriggerCopyData

This pipeline does two actions:

 Looks up the Azure SQL Database system table to get the list of tables to be
copied.
 Triggers the pipeline "IterateAndCopySQLTables" to do the actual data copy.

1. In the left pane, click + (plus), and click Pipeline.

2. In the General panel under Properties, change the name of the pipeline
to GetTableListAndTriggerCopyData.

3. In the Activities toolbox, expand General, and drag-drop Lookup activity to the
pipeline designer surface, and do the following steps:

1. Enter LookupTableList for Name.


2. Enter Retrieve the table list from my database for Description.

4. Switch to the Settings tab, and do the following steps:

1. Select AzureSqlDatabaseDataset for Source Dataset.

2. Select Query for Use query.

3. Enter the following SQL query for Query.

SQLCopy
SELECT TABLE_SCHEMA, TABLE_NAME FROM information_schema.TABLES WHERE
TABLE_TYPE = 'BASE TABLE' and TABLE_SCHEMA = 'SalesLT' and TABLE_NAME <>
'ProductModel'

4. Clear the checkbox for the First row only field.

Page 78
5. Drag-drop Execute Pipeline activity from the Activities toolbox to the pipeline
designer surface, and set the name to TriggerCopy.

6. To Connect the Lookup activity to the Execute Pipeline activity, drag the green
box attached to the Lookup activity to the left of Execute Pipeline activity.

7. Switch to the Settings tab of Execute Pipeline activity, and do the following steps:

1. Select IterateAndCopySQLTables for Invoked pipeline.

2. Clear the checkbox for Wait on completion.

3. In the Parameters section, click the input box under VALUE -> select the Add
dynamic content below -> enter @activity('LookupTableList').output.value as
table name value -> select Finish. You're setting the result list from the Lookup
Page 79
activity as an input to the second pipeline. The result list contains the list of
tables whose data needs to be copied to the destination.

8. To validate the pipeline, click Validate on the toolbar. Confirm that there are no
validation errors. To close the Pipeline Validation Report, click >>.

9. To publish entities (datasets, pipelines, etc.) to the Data Factory service,


click Publish all on top of the window. Wait until the publishing succeeds.

Trigger a pipeline run

1. Go to pipeline GetTableListAndTriggerCopyData, click Add Trigger on the top


pipeline tool bar, and then click Trigger now.

2. Confirm the run on the Pipeline run page, and then select Ok.

Monitor the pipeline run

1. Switch to the Monitor tab. Click Refresh until you see runs for both the pipelines
in your solution. Continue refreshing the list until you see the Succeeded status.

2. To view activity runs associated with


the GetTableListAndTriggerCopyData pipeline, click the pipeline name link for
the pipeline. You should see two activity runs for this pipeline

Page 80
run.

Page 81
Unit 6: Incrementally load data from Azure SQL Database to Azure
Blob storage

In this tutorial, you create an Azure data factory with a pipeline that loads delta data
from a table in Azure SQL Database to Azure Blob storage.

You perform the following steps in this tutorial:

 Prepare the data store to store the watermark value.


 Create a data factory.
 Create linked services.
 Create source, sink, and watermark datasets.
 Create a pipeline.
 Run the pipeline.
 Monitor the pipeline run.
 Review results
 Add more data to the source.
 Run the pipeline again.
 Monitor the second pipeline run.
 Review results from the second run.

Overview

Here is the high-level solution diagram:

Here are the important steps to create this solution:

Page 82
1. Select the watermark column. Select one column in the source data store, which
can be used to slice the new or updated records for every run. Normally, the data
in this selected column (for example, last_modify_time or ID) keeps increasing
when rows are created or updated. The maximum value in this column is used as a
watermark.

2. Prepare a data store to store the watermark value. In this tutorial, you store the
watermark value in a SQL database.

3. Create a pipeline with the following workflow:

The pipeline in this solution has the following activities:


o Create two Lookup activities. Use the first Lookup activity to retrieve the last
watermark value. Use the second Lookup activity to retrieve the new
watermark value. These watermark values are passed to the Copy activity.
o Create a Copy activity that copies rows from the source data store with the
value of the watermark column greater than the old watermark value and less
than the new watermark value. Then, it copies the delta data from the source
data store to Blob storage as a new file.
o Create a StoredProcedure activity that updates the watermark value for the
pipeline that runs next time.

Prerequisites
 Azure SQL Database. Use the Azure SQL Database used in previous tutorial.
 Azure Storage. Use the storage account sahgstft2023 and the container adf-
training.

Page 83
Create a data source table in your SQL database

1. Open SQL Server Management Studio. In Server Explorer, right-click the database,
and choose New Query.

2. Run the following SQL command against your SQL database to create a table
named data_source_table as the data source store:

SQL Copy
create table data_source_table
(
PersonID int,
Name varchar(255),
LastModifytime datetime
);

INSERT INTO data_source_table


(PersonID, Name, LastModifytime)
VALUES
(1, 'aaaa','9/1/2017 12:56:00 AM'),
(2, 'bbbb','9/2/2017 5:23:00 AM'),
(3, 'cccc','9/3/2017 2:36:00 AM'),
(4, 'dddd','9/4/2017 3:21:00 AM'),
(5, 'eeee','9/5/2017 8:06:00 AM');

Create another table in your SQL database to store the high watermark
value

1. Run the following SQL command against your SQL database to create a table
named watermarktable to store the watermark value:
SQL Copy
create table watermarktable
(

TableName varchar(255),
WatermarkValue datetime,
);

2. Set the default value of the high watermark with the table name of source data
store. In this tutorial, the table name is data_source_table.
SQL Copy
Page 84
INSERT INTO watermarktable
VALUES ('data_source_table','1/1/2010 12:00:00 AM')

3. Review the data in the table watermarktable.


SQL Copy
Select * from watermarktable

Output:Copy
TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000

Create a stored procedure in your SQL database

create a stored procedure in your SQL database:


SQL Copy
CREATE PROCEDURE usp_write_watermark @LastModifiedtime datetime, @TableName
varchar(50)
AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Page 85
Open Data Factory

1. Go to the Azure portal.

2. Launch the Azure Data Factory created in the Exercise 1

Create a pipeline.

In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and

one StoredProcedure activity chained in one pipeline.

1. Click on Author tab.

Page 86
2. In the left pane, click + (plus), and click Pipeline.

3. In the General panel under Properties, specify IncrementalCopyPipeline for

Name. Then collapse the panel by clicking the Properties icon in the top-right

corner.

4. Let's add the first lookup activity to get the old watermark value. In the Activities

toolbox, expand General, and drag-drop the Lookup activity to the pipeline

designer surface. Change the name of the activity to

LookupOldWaterMarkActivity.

Page 87
5. Switch to the Settings tab, and click + New for Source Dataset. In this step, you

create a dataset to represent data in the watermarktable. This table contains the

old watermark that was used in the previous copy operation.

6. In the New Dataset window, select Azure SQL Database, and click Continue. You

see a new window opened for the dataset.

7. In the Set properties window for the dataset, enter WatermarkDataset for

Name.

8. For Linked Service, select AzureSqlDatabaseLinkedService (Linked Service

created in previous tutorial)

9. And then click Ok, new dataset will be created.

10. Open newly created dataset from the Factory Resources.

Page 88
11. In the Connection tab, select [dbo].[watermarktable] If you want to preview

data in the table, click Preview data.

Page 89
12. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking

the name of the pipeline in the tree view on the left. In the properties window for

the Lookup activity, confirm that WatermarkDataset is selected for the Source

Dataset field.

13. In the Activities toolbox, expand General, and drag-drop another Lookup activity

to the pipeline designer surface, and set the name to

LookupNewWaterMarkActivity in the General tab of the properties window.

This Lookup activity gets the new watermark value from the table with the source

data to be copied to the destination.

14. In the properties window for the second Lookup activity, switch to the Settings

tab, and click New. You create a dataset to point to the source table that contains

the new watermark value (maximum value of LastModifyTime).

15. In the New Dataset window, select Azure SQL Database, and click Continue.

16. In the Set properties window, enter SourceDataset for Name. Select

AzureSqlDatabaseLinkedService for Linked service.

17. Select [dbo].[data_source_table] for Table. You specify a query on this dataset

later in the tutorial. The query takes the precedence over the table you specify in

this step.

18. Click Ok.

19. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking

the name of the pipeline in the tree view on the left. In the properties window for

the Lookup activity, confirm that SourceDataset is selected for the Source

Dataset field.

Page 90
20. Select Query for the Use Query field and enter the following query: you are only

selecting the maximum value of LastModifytime from the data_source_table.

Please make sure you have also checked First row only.

SQLCopy

select MAX(LastModifytime) as NewWatermarkvalue from data_source_table

21. In the Activities toolbox, expand Move & Transform, and drag-drop the Copy

activity from the Activities toolbox, and set the name to

IncrementalCopyActivity.

Page 91
22. Connect both Lookup activities to the Copy activity by dragging the green

button attached to the Lookup activities to the Copy activity. Release the mouse

button when you see the border color of the Copy activity changes to blue.

23. Select the Copy activity and confirm that you see the properties for the activity in

the Properties window.

24. Switch to the Source tab in the Properties window, and do the following steps:

1. Select SourceDataset for the Source Dataset field.

2. Select Query for the Use Query field.

3. Enter the following SQL query for the Query field.

SQLCopy

select * from data_source_table where LastModifytime >


'@{activity('LookupOldWaterMarkActivity').output.firstRow.Watermar
kValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWater
markvalue}'

Page 92
25. Switch to the Sink tab and click + New for the Sink Dataset field.

26. In this tutorial sink data store is of type Azure Blob Storage. Therefore, select

Azure Blob Storage, and click Continue in the New Dataset window.

27. In the Select Format window, select the format type of your data, and click

Continue.

28. In the Set Properties window, enter SinkDataset for Name. For Linked Service,

select AzureStorageLinkedService and click ‘Ok’.

29. Go to the Connection tab of SinkDataset and do the following steps:

1. For the File path field, enter adf-training as container and

incrementalcopy as folder name. You can also use the Browse button for

the File path to navigate to a folder in a blob container.

2. For the File part of the File path field, select Add dynamic content

[Alt+P], and then enter @concat('Incremental-', pipeline().RunId,

'.txt')in the opened window. Then click Ok. The file name is dynamically

generated by using the expression. Each pipeline run has a unique ID. The

Copy activity uses the run ID to generate the file name.

Page 93
30. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking

the name of the pipeline in the tree view on the left.

31. In the Activities toolbox, expand General, and drag-drop the Stored Procedure

activity from the Activities toolbox to the pipeline designer surface. Connect the

green (Success) output of the Copy activity to the Stored Procedure activity.

32. Select Stored Procedure Activity in the pipeline designer, change its name to

StoredProceduretoWriteWatermarkActivity.

33. Switch to the settings tab and select AzureSqlDatabaseLinkedService for Linked

service and select usp_write_watermark for Stored procedure name

34. To specify values for the stored procedure parameters, click Import parameter,

and enter following values for the parameters:

Parameter Type Value

LastModifiedt DateTime @{activity('LookupNewWaterMarkActivity').output.firstRo

ime w.NewWatermarkvalue}

TableName String @{activity('LookupOldWaterMarkActivity').output.firstRow.

TableName}

Page 94
35. To validate the pipeline settings, click Validate on the toolbar. Confirm that there

are no validation errors. To close the Pipeline Validation Report window, click

Close.

36. Publish entities (linked services, datasets, and pipelines) to the Azure Data Factory

service by selecting the Publish All button. Wait until you see a message that the

publishing succeeded.

Trigger a pipeline run

1. Go to pipeline IncrementalCopyPipeline, click Add Trigger on the top pipeline


tool bar, and then click Trigger now.

2. Confirm the run on the Pipeline run page, and then select Ok

Page 95
Monitor the pipeline run

1. Switch to the Monitor tab on the left. You see the status of the pipeline run
triggered by a manual trigger. You can use links under the PIPELINE NAME
column to view run details and to rerun the pipeline.

2. To see activity runs associated with the pipeline run, select the link under the
PIPELINE NAME column. For details about the activity runs, select the Details link
(eyeglasses icon) under the ACTIVITY NAME column. Select All pipeline runs at
the top to go back to the Pipeline Runs view. To refresh the view, select Refresh.
Exercise 9: Review the results

1. Connect to your Azure Storage Account by using tools such as Azure Storage
Explorer. Verify that an output file is created in the incrementalcopy folder of the
adftutorial container.

2. Open the output file and notice that all the data is copied from the
data_source_table to the blob file.

3. Check the latest value from watermarktable. You see that the watermark value was
updated.

Page 96
SQL Copy
Select * from watermarktable

Here is the output:

TableName WatermarkValue
data_source_table 2017-09-05 8:06:00.000

Insert new data into your database (data source store).


SQL Copy
INSERT INTO data_source_table
VALUES (6, 'newdata','9/6/2017 2:23:00 AM')

INSERT INTO data_source_table


VALUES (7, 'newdata','9/7/2017 9:01:00 AM')

The updated data in your database is:


Copy
PersonID | Name | LastModifytime
-------- | ---- | --------------
1 | aaaa | 2017-09-01 00:56:00.000
2 | bbbb | 2017-09-02 05:23:00.000
3 | cccc | 2017-09-03 02:36:00.000
4 | dddd | 2017-09-04 03:21:00.000
5 | eeee | 2017-09-05 08:06:00.000
6 | newdata | 2017-09-06 02:23:00.000
7 | newdata | 2017-09-07 09:01:00.000

Trigger another pipeline run

1. Switch to the Author tab. Click the pipeline in the tree view if it's not opened in
the designer.

2. Click Add Trigger on the toolbar and click Trigger Now.


Monitor the second pipeline run

1. Switch to the Monitor tab on the left. You see the status of the pipeline run
triggered by a manual trigger. You can use links under the PIPELINE NAME
column to view activity details and to rerun the pipeline.

Page 97
2. To see activity runs associated with the pipeline run, select the link under the
PIPELINE NAME column. For details about the activity runs, select the Details link
(eyeglasses icon) under the ACTIVITY NAME column. Select All pipeline runs at
the top to go back to the Pipeline Runs view. To refresh the view, select Refresh.
Verify the second output

1. In the blob storage, you see that another file was created. In this tutorial, the new
file name is Incremental-<GUID>.txt. Open that file, and you see two rows of
records in it.

2. Check the latest value from watermarktable. You see that the watermark value was
updated again.
SQL Copy
Select * from watermarktable

sample output:
TableName WatermarkValue
data_source_table 2017-09-07 09:01:00.000

Page 98

You might also like