Professional Documents
Culture Documents
Tim Warner
Tim Warner
Click to edit Master title style
timw.info/dp203
Course Expectations
Click to edit Master title style
• We'll learn by doing – at least 80 percent demo
• Case study approach
• Please review the recordings…several times!
• I’m here to answer your questions – take advantage of
this
• Use the Q&A panel
Break Schedule (Central Time Zone)
Click to edit Master title style
• 07:00am - Start
• 08:00am - 9-minute break
• 09:00am - 9-minute break
• 10:00am - 9-minute break
• 11:00am - 9-minute break
• 12:00pm - Finish
Session Recordings
Click to edit Master title style
Session Recordings
Click to edit Master title style
Session Recordings
Click to edit Master title style
Mobile Browser: learning.oreilly.com
Click to edit Master title style
O'Reilly Mobile App
Click to edit Master title style
Exam DP-203 Latest Changes: Nov 2023
Click to edit Master title style
What is an Azure Data Engineer?
Click to edit Master title style
• Design and implement the management, monitoring,
security, and privacy of data using the full stack of data
services
• “Builds and tunes data pipelines”
• “Implements, monitors, and optimizes data platforms”
• “Has solid knowledge of SQL, Python, or Scala”
• The Azure Data Scientist consumes the data the
Engineer provides
Azure Data Engineer Associate
Click to edit Master title style
DP-203
DP-203
Data Engineering on Microsoft
Azure
DP-900
Azure Data Scientist Associate
Click to edit Master title style
DP-100
Azure Data Analyst Associate
Click to edit Master title style
DA-100
Azure Cosmos DB Developer
Click to edit Master title style
DP-420
Tim'stoCertification
Click edit Master Study Model
title style
Thank you!
Click to edit Master title style
Exam DP-203
Case Study
Contoso Retail - Data Modernization
Click to edit Master title style
• Company: Contoso Retail (large, multinational retail with online and
physical stores)
• Pain Points:
• Siloed data systems
• Manual, time-consuming data preparation
• Difficulty with real-time analytics
• Desire to implement predictive models
• Project Goals:
• Centralized Azure data platform
• Streamlined data transformation
• Optimized analytics data warehouse
• Enable predictive analytics
Contoso Retail Architecture Overview
• Click to edit Master title style
Visual Diagram:
• Azure Data Lake Storage Gen2 (ADLS Gen2): Central box labeled "Data Lake",
mention hierarchical structure and partitioning.
• Azure Data Factory (ADF): Arrows coming INTO the data lake from boxes
labeled "On-Premises SQL Server", "PostgreSQL", "Sales CSVs", emphasizing
multiple sources.
• Azure Databricks: Box with an arrow pointing OUT of the data lake, labeled
"Transformations & Feature Engineering (Spark)"
• Azure Synapse Analytics: Box with an arrow pointing IN from the data lake,
labeled "Dedicated SQL Pool (Partitioned)" and another box connected labeled
"Serverless SQL Pool (Exploration)".
• Azure Stream Analytics: Small box near the top labeled "Social Media Data,"
with an arrow into the data lake and an arrow out to a box labeled
"Recommendations Engine".
• Power BI: Final box, with an arrow from the Azure Synapse Analytics Dedicated
SQL Pool.
Data Ingestion and Transformation
Click to edit Master title style
• Data Sources
• On-premises databases (SQL Server and PostgreSQL)
• Varied file formats (CSV, etc.)
• Secure Connectivity
• Emphasis on secure methods for accessing on-premises data
• Azure Data Lake Storage Gen2
• Mention of partitioning strategies
• Azure Databricks
• Large-scale transformations with Spark, addressing data quality
(missing values, duplicates)
Data Warehousing, Analytics, & Predictions
Click to edit Master title style
• Azure Synapse Analytics:
• Dedicated SQL pool for efficient querying (highlight table
partitioning/distribution), serverless SQL pools for exploration
• Security:
• Row-level security, data masking for sensitive information
• Power BI:
• Visualizations, dashboards, and reporting
• Predictive Analytics:
• Azure Stream Analytics for real-time processing for
recommendations
• Databricks for demand forecasting and model development
Click to edit Master title style
Data Fundamentals
Data Types
Click to edit Master title style
Customer
CustomerID CustomerName CustomerPhone
Orders
OrderID CustomerID OrderDate
Streaming
Data Processing
Click to edit Master title style
Data
processing
Functions Cognitive Services
Raw
Data
Complex
processing
Extract
Azure Synapse
Data Analytics
Click to edit Master title style
On-premises data
SQL Server, Oracle, Data ingestion Data storage Data processing Data visualization
fileshares, SAP
Cloud data
Azure, AWS, GCP
SaaS data
Salesforce, Dynamics
Non-Binary Data Formats
Click to edit Master title style
• CSV
• Good for bandwidth-sensitive data loads
• JSON
• Clear, structured format with optional validation
Binary Data Formats
Click to edit Master title style
• Optimized for splitting across compute nodes
• Parquet, ORC: Columnar store
• Fast read performance (compression) for analytical
workloads
• Avro: Row-based store that includes JSON
• Schematized
• Optimized for write performance
Click to edit Master title style
Store
Azure Data Lake Storage
High performance data lake available in
all 54 Azure regions
Data Lake Storage Gen 2
Click to edit Master title style
Azure Data Lake Storage Gen 2
Click to edit Master title style
Access Tiers & Lifecycle Management
Click to edit Master title style
Click to edit Master title style
PolyBase
Data Warehouse Star Schema
Click to edit Master title style
Data Warehouse Snowflake Schema
Click to edit Master title style
Azure Synapse
Click to edit Master title style
Azure Synapse SQL Pool (DW) Architecture
Click to edit Master title style
Synapse SQL Pool Types
Click to edit Master title style
Azure Synapse Table Distribution Modes
Click to edit Master title style
https://timw.info/0jl
Azure Synapse Table Distribution Modes
Click to edit Master title style
https://timw.info/0jl
Slowly Changing Dimensions (SCD)
Click to edit Master title style
Slowly Changing Dimensions (SCD)
Click to edit Master title style
Slowly Changing Dimensions (SCD)
Click to edit Master title style
Click to edit Master title style
Azure Databricks
Azure Databricks
Click to edit Master title style
Lambda Architecture
Click to edit Master title style
Lambda Architecture with Databricks
Click to edit Master title style
Kappa Architecture
Click to edit Master title style
Kappa Architecture with Databricks
Click to edit Master title style
Click to edit Master title style
Data Security
Network security
Click to edit Master title style
Securing your network from attacks and unauthorized access is an important
part of any architecture
Network security
Internet protection Firewalls DDoS protection
groups
Assess the resources that To provide inbound The Azure DDoS Protection Network Security Groups
are internet-facing, and to protection at the service protects your Azure allow you to filter network
only allow inbound and perimeter, there are applications by scrubbing traffic to and from Azure
outbound communication several choices: traffic at the Azure resources in an Azure
where necessary. Make • Azure Firewall network edge before it can virtual network. An NSG
sure you identify all • Azure Application impact your service’s can contain multiple
resources that are allowing Gateway availability inbound and outbound
inbound network traffic of security rules
• Azure Storage Firewall
any type
Identity and access
Click to edit Master title style
Authentication Azure Active Directory features
This is the process of establishing the
identity of a person or service looking to Single sign-on Apps & device Identity services
access a resource. Azure Active Directory Enables users to management Manage Business
is a cloud-based identity service that remember only one You can manage your to business (B2B)
provide this capability ID and one cloud and identity services
password to access on-premises apps and and Business-to-
multiple devices and Customer (B2C)
applications the access to your identity services
Authorization organizations resources
This is the process of establishing what
level of access an authenticated person
or service has. It specifies what data
they're allowed to access and what they
can do with it. Azure Active Directory
also provides this capability
Encryption
Click to edit Master title style
Encryption at rest Encryption on Azure
Data at rest is the data that has been
stored on a physical medium. This could Raw encryption Database encryption Encrypting secrets
be data stored on the disk of a server, Enables the Enables the encryption Azure Key Vault is a
data stored in a database, or data stored encryption of: of databases using: centralized cloud
in a storage account • Azure Storage • Transparent Data service for storing
Encryption your application
• V.M. Disks
secrets
• Disk Encryption
Encryption in transit
Data in transit is the data actively moving
from one location to another, such as
across the internet or through a private
network. Secure transfer can be handled
by several different layers
Encryption
Click to edit Master title style
Encryption at rest Encryption on Azure
Data at rest is the data that has been
stored on a physical medium. This could Raw encryption Database encryption Encrypting secrets
be data stored on the disk of a server, Enables the Enables the encryption Azure Key Vault is a
data stored in a database, or data stored encryption of: of databases using: centralized cloud
in a storage account • Azure Storage • Transparent Data service for storing
Encryption your application
• V.M. Disks
secrets
• Disk Encryption
Encryption in transit
Data in transit is the data actively moving
from one location to another, such as
across the internet or through a private
network. Secure transfer can be handled
by several different layers
Azure SQL Database Firewall Rules
Click to edit Master title style
Azure SQL Database DDM
Click to edit Master title style
Azure SQL Database Always Encrypted
Click to edit Master title style
Azure Data Factory
Click to edit Master title style
Power BI
Click to edit Master title style
What are data streams
Click to edit Master title style
Data streams: Data streams are used to:
In the context of analytics, data streams
are event data generated by sensors or Analyze data: Understand systems: Trigger actions:
other sources that can be analyzed by Continuously Understand component Trigger specific
another technology analyze data to or actions when
detect issues and system behavior under certain thresholds
understand or various conditions to are identified
Data stream processing approach: respond to them fuel further
There are two approaches. Reference enhancements
data is streaming data that can be of said system
collected over time and persisted in
storage as static data. In contrast,
streaming data have relatively low
storage requirements. And run
computations in sliding windows
Event processing
Click to edit Master title style
The process of consuming data streams, analyzing them, and deriving actionable insights
out of them is called Event Processing and has three distinct components:
An engine to consume event data streams and deriving insights from them.
Depending on the problem space, event processors either process one incoming
Event processor event at a time (such as a heart rate monitor) or process multiple events at a time
(such as a highway toll lane sensor)
An application which consumes the data and takes specific action based on the
Event consumer insights. Examples of event consumers include alert generation, dashboards, or even
sending data to another event processing engine
Processing events with Azure Stream
Click to edit Master title style
Analytics
Microsoft Azure Stream Analytics is an event processing engine. It enables the consumption
and analysis of high volumes of streaming data in real time
1. In the Azure portal, select NEW, type 1. After the deployment is complete, click the xx-name-eh event hub on the dashboard
Event Hubs, and then select Event Hubs
from the resulting search. Then select 2. Then, under Entities, select Event Hubs
Create 3. To create the event hub, select the + Event Hub button. Provide the name socialstudy-eh,
2. Provide a name for the event hub, and and then select Create
then create a resource group. Specify xx- 4. To grant access to the event hub, we need to create a shared access policy. Select the socialstudy-eh
name-eh and xx-name-rg respectively, event hub when it appears, and then, under Settings, select Shared access policies
XX- represent your initials to ensure
uniqueness of the Event Hub name and 5. Under Shared access policies, create a policy with MANAGE permissions by selecting + Add. Give the
Resource policy the name of xx-name-eh-sap, check MANAGE, and then select Create
Group name 6. Select your new policy after it has been created, and then select the copy button for the
3. Click the checkbox to Pin to the CONNECTION STRING – PRIMARY KEY entity
dashboard, then select the Create 7. Paste the CONNECTION STRING – PRIMARY KEY entity into Notepad, this is needed later in the
button exercise
8. Leave all windows open
Azure Stream Analytics workflow
Click to edit Master title style
Complex event processing of Stream Data in Azure
Represents a data
item(s) stored in
IR Integration
Runtime
Data
Pipeline Control
Lake Store CF
Flow
1
Activities
Azure
Databricks
Dataset
Azure Monitor
Click to edit Master title style
Data Pipelines
Click to edit Master title style
Azure Diagnostics
Click to edit Master title style
Log Analytics
Click to edit Master title style
Lambda architectures from a real time
Click to edit Master title style
mode perspective
Speed Layer:
The Speed layer processes data streams in
real or near real time. This works well when
the aim is to minimize the latency of the
data ingestion to analysis:
1. New data ingested from sources
4. Real time views of the data created
Serving Layer:
The serving layer is optional in the
real-time architecture and acts as the
storage output of either the Batch or Speed
layer that is used by client applications to
access the results
of the data-sets
Architect a stream processing pipeline
Click to edit Master title style
with Azure Stream Analytics
Design a stream processing pipeline
Click to edit Master title style
with Azure Databricks
Automate an enterprise business
Click to edit Master title style
intelligence architecture