You are on page 1of 72

Azure Data Factory

A Complete Introduction

Paul Andrew | Principal Consultant & Solution Architect

@MrPaulAndrew In/MrPaulAndrew MrPaulAndrew.com


Azure Data Factory
An Complete Introduction… (toneedthetokeyknow based on experience.)
things Paul thinks you

Paul Andrew | Principal Consultant & Solution Architect

@MrPaulAndrew In/MrPaulAndrew MrPaulAndrew.com


Azure Data Factory
What is Azure Data Factory?

V1 V2
2016 2020
What is Azure Data Factory?
https://azure.microsoft.com/en-gb/services/data-factory/
What is Azure Data Factory?

Copy Data Transform


What is Azure Data Factory?

Copy Data Transform


Data Factory Components
Copy Data Transform

1 Linked Services – How and what to connect to. Like the SSIS connection manager.

SQLDBLinkedService

ConnectionString: Server=MyServer;Database=myDataBase
UserName: “MrPaulAndrew”
Password: ***************
Data Factory Components
Copy Data Transform

1 Linked Services – How and what to connect to. Like the SSI

SQLDBLinkedService

ConnectionString: Server=MyServer;Database=myDataBase
UserName: “MrPaulAndrew”
Password: ***************
Data Factory Components
Copy Data Transform

1 Linked Services
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets – Where is my data? What format? What file path/table do I need?

dbo.DimCustomer

/RAW/Orders/2018/01/01/Orders.csv
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets
Databricks Notebook Activity
3 Activities – What do we notebookPath: /Playground/Playing
want to happen? baseParameters: Testing
With what conditions? libraries[ jar]: dbfs:/lib1.jar
linkedServiceName: BricksOfData01
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets

3 Activities

4 Pipelines – What groups of


work do I want to do?
Execute Pipeline
Activity
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets

3 Activities

4 Pipelines – What groups of work


do I want to do?
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows
3 Activities • Scheduled
• Blob File Events
4 Pipelines • Logic App Calls
5 Triggers – How are we going to tell our pipeline(s) to execute?
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual


• Tumbling Windows
3 Activities • Scheduled
• Blob File Events
4 Pipelines • Logic App Calls
5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows - AKA Time Slices
3 Activities • Scheduled Loading

• Blob File Events


4 Pipelines • Logic App Calls 2019 2020

5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows
3 Activities • Scheduled
• Blob File Events
4 Pipelines • Logic App Calls - Every 1 minute.
- UTC
5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows
3 Activities • Scheduled
• Blob File Events {Path} Created
4 Pipelines • Logic App Calls {Path} Deleted

5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets • Manual via UI


• Tumbling Windows
3 Activities • Scheduled
• Blob File Events
4 Pipelines • Logic App Calls
5 Triggers
Data Factory Components
Copy Data Transform

1 Linked Services

2 Data Sets

3 Activities

4 Pipelines

5 Triggers
Data Factory Control Flow Components
Copy Data Transform

1 Linked Services

2 Data Sets

3 Activities

4 Pipelines

5 Triggers
Data Factory Control Flow Components
Copy Data Transform

1 ✓Linked Services
2 ✓Data Sets
3 ✓Activities
4 ✓Pipelines
5 Triggers
Data Factory Control Flow Components
Copy Data Transform

1 Linked Services
Expression Builder
2 Data Sets Parameters
@{………}
System Variables
3 Activities
• Collection
4 Pipelines • Conversation
• Date
5 Triggers • Logical
• Math
• String
Integration Runtimes
Integration Runtimes
Flexible Region
Data Movements
Azure
1 Integration Runtime Activity
Orchestration

Specified Region

SSIS SSIS Package


2 Integration Runtime
Execution

Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Integration Runtimes
Flexible Region
Movement Hours
Azure
1 Integration Runtime Activity
Orchestration

Specified Region

SSIS SSIS Package


2 Integration Runtime
Execution

Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Running SSIS Packages in Azure Data Factory
Parent/Child Packages
SSIS Package
SSIS
Activity
Package
80x
Run Package

8x Packages per
Node

10x Nodes

Logical or Managed
Instance
Data Factory

SSISDB
Visual Studio – SQL Server Data Tools (SSDT) SSIS IR
Integration Runtimes
Flexible Region
Movement Hours
Azure
1 Integration Runtime Activity
Orchestration

Specified Region

SSIS SSIS Package


2 Integration Runtime
Execution

Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Integration Runtimes
Flexible Region
Movement Hours
Azure
1 Integration Runtime Activity
Orchestration

Specified Region

SSIS SSIS Package


2 Integration Runtime
Execution

Virtual Machine
Gateway Access
Self Hosted
3 Integration Runtime Activity
Orchestration
Single Hosted IR
On Premises Azure
Multiple Hosted IR’s (Failover & Load Balancing)
On Premises Azure
Hosted IR Linked to Multiple Data Factory’s
On Premises Azure
Using a Hosted IR with Express Route
On Premises Azure
Using a Hosted IR with Express Route
On Premises Azure
Recap: Data Factory
1 Linked Services Azure
1 Integration Runtime
2 Data Sets
SSIS
3 Activities 2 Integration Runtime
4 Pipelines
Self Hosted
5 Triggers
3 Integration Runtime
Data Factory Key Activities
Lookup Activity
Get Metadata to Support Other Control Flow Activities
Single Value
Or
Many Values
Dataset [array]

Lookup

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
ForEach Activity IsSequential:
true
[array]
Scaling Out Control Flow Activities [0]

[1]

Many Values [2]


[array]
[3]

[i]
ForEach
[array]

Lookup

[0] [1] [2] [3] [4] [5] [6] [i]


Copy Data Do Stuff

Batch Count Default: 20


@item(). Batch Count Max: 50

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity
Custom Activity Extend Data Factory with Custom Code
References Objects
Datasets: []
Linked Services: []

Custom

Linked Services
Azure Batch ???
Azure Blob Storage

https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity
Building a Custom Activity
A .Net Console App Executed Using Azure Batch Service.

Data Factory
Pipeline &
Custom Activity
Custom
Azure AD

Application Raw Data

Batch
Service
Solution &
Projects
Data Lake
Blob Store

C# Classes
Compute Pool Clean Data
https://mrpaulandrew.com/2018/11/12/creating-an-azure-data-factory-v2-custom-activity/ Key Vault
Recap: Data Factory – Useful Activities
Lookup
1 Get value(s) from lots of places to support other activities.

ForEach
2 Iteration of other activities, sequentially or in parallel.

Custom
3 Extensibility - code executed by Azure Batch compute pools.
Simple Dynamic Copy Pipeline
Demo Architecture
SQLDB

Data Lake
Store Gen2
Hosted IR

C:\

Paul’s Laptop Data Factory

Function App Office 365


Data Factory Data Flows
Data Factory Control Flow Components
Copy Data Transform
Data Factory Data Flows

Wrangling Mapping
Data Flow Data Flow

1 Linked Services

2 Data Sets

3 Activities

4 Pipelines

5 Triggers
Mapping Data Flows
What is a Mapping Data Flow?
Mapping Data Azure
Flow Databricks

Data Factory Portal Interface

Data Factory

https://mrpaulandrew.com/2018/10/22/what-is-azure-data-factory-data-flow/
Mapping Data Flows
Mapping Data Flows – Settings & Concepts
Mapping Data Flows – Transformations
Multicast
Merge Join

Script Component
Mapping Data Flows – Expression Builder
Mapping Data Flows – Debug Mode

General Purpose cluster


Wrangling Data Flows
What is a Wrangling Data Flow?

Power BI Desktop

Data Factory Portal


What is a Wrangling Data Flow?
Wrangling Data Azure
Flow Databricks

Data Factory Portal

Data Factory
What is a Wrangling Data Flow?
Wrangling Data Azure
Flow Databricks

Data Factory Portal

Data Factory
Data Flow - Cluster Configuration
Azure
Databricks

• General Purpose
• Memory Optimised
• Compute Optimised
Recap: Data Factory – Data Flows
Mapping
1 Similar to SSIS Data Flows in appearance.

Wrangling
2 Similar to Power BI Power Query.

Data Flow Cluster Config


3 Via the Data Factory Azure IR
Data Factory DevOps – CI/CD
PaulsFunFactory

Documentation Author & Monitor


When connected to View JSON for
source control current component.
(Git/GitHub) switch
between Repository code
ZIP extract/import of
development and branches. Simple mandatory complete data factory
deployed instances. key value pair with parameters file.
Save, creates a commit to the checking of JSON
connected repository. components.
Discard dirty panels
(unsaved changes) .
Deploy ARM template from Source code ‘Undo’.
development environment
to deployed service.
Debug the Data Flow.
Debug the Control Flow. Get a cluster ready.
Run the pipeline.
Debug the Control Flow.
Run the pipeline.
Data Factory Source Code ARM Template Export

Git

master*
adf_publish Deployed
Developer Service
Save
Debug Dev

“ARMTemplateForFactory.json”

* Can only ‘Publish’ from your


default code branch.
Data Factory Deployments
Option 1 – Single Data Factory Service Debug
Git branch Service

save
merge

master publish
Deployed
Developer Service

Dev Prod
Option 2 – ARM Templates for Multiple Data Factory Services
Debug
Test Service Live Service
Git branch Service

save
merge

master ARM
Template
Developer
Dev Test Prod
Recap: Data Factory – DevOps
Development and Debugging
1 Via the Portal UI

Source Code
2 3x choices of JSON files!

Deployment
3 Several options depending on requirements.
Summary
What is Azure Data Factory?

Copy Data Transform


What is Azure Data Factory?
Lookup
ForEach

Copy Data

Custom

Do Stuff

1. Orchestrator of our Control Flow operations – with scale out Activities.


2. Orchestrator of our Data Flow transformations – using cloud native services.
3. The scheduler of solutions – using a variety of Pipeline Triggers.
Thank you for listening…
Paul Andrew

Blog: mrpaulandrew.com
Email: paul@mrpaulandrew.com

Twitter: @mrpaulandrew
LinkedIn: In/mrpaulandrew
Slides:
GitHub: github.com/mrpaulandrew Community Events
Repository

You might also like