Professional Documents
Culture Documents
Ramesh Retnasamy
Data Engineer/ Machine Learning Engineer
https://www.linkedin.com/in/ramesh-retnasamy/
About this course
PySpark
Spark SQL
Azure Databricks
Delta Lake
About this course
Azure Databricks
Azure Key Vault
PowerBI
Formula1 Cloud Data Platform
Formula1 Cloud Data Platform
ADF Pipelines
ADF Pipelines
University students
Data Architects
Who is this course not for
You are not interested in hands-on learning approach
Azure Account
Our Commitments
Orchestration
23.Azure Data Factory 24.Connecting Other Tools
Introduction to Azure Databricks
Azure Databricks
Databricks
Microsoft Azure
Apache Spark
Apache Spark is a lightning-fast unified analytics engine for big
data processing and machine learning
Spark Core
Scala Python Java R
Workspace/
MLFlow Notebook
SQL Administrati
Analytics on Controls
Optimized
Delta Lake Spark (5x
Databases/ faster)
Tables
Azure Databricks Unified Azure Portal +
Unified Billing
Data Services
MLFlow Workspace
/ Notebook Messaging Services
SQL
Azure IoT Hub
Azure Dev Ops Analytics Azure Event Hub
Administra
tion
Controls
Delta Lake
Optimized
Databases/ Spark (5x
Tables faster)
Databricks
Azure ML
Azure Databricks Unified Azure
Portal + Unified
Billing
MLFlow Workspace
/ Notebook
Azure Dev
Ops
SQL
Analytics Administra Messaging
tion Services
Controls
Delta Lake
Databricks Power BI
Azure ML
Microsoft Azure
Creating Azure Databricks Service
Azure Databricks Architecture
Control Plane (Databricks
Subscription)
Databricks Cluster Manager Azure Resource Manager
Azure AD
Databricks UX Data Plane (Customer Subscription)
VNet
DBFS VM VM
Databricks
Clusters Workspace
VM VM
Notebooks Data
Clusters
Jobs Models
Databricks Clusters
What is Databricks Cluster
Cluster Types
Cluster Configuration
Creating a cluster
Pricing
Cost Control
Cluster Pools
Cluster Policy
Databricks Cluster
VM
Driver
VM VM VM
Single Node
Cluster Configuration
Single/ Multi Node Single User
Only One User Access
Access Mode Supports Python, SQL, Scala, R
Shared
Multiple User Access
Only available in Premium. Supports Python, SQL
No Isolation Shared
Multiple User Access
Supports Python, SQL
Custom
Legacy Configuration
Cluster Configuration
Single/ Multi Node Databricks Runtime
Scala, Java, Ubuntu GPU
Spark Delta Lake
Access Mode Python, R Libraries Libraries
Other Databricks Services
Databricks Runtime
Databricks Runtime ML
Everything from Popular ML Libraries (PyTorch, Keras,
Databricks runtime TensorFlow, XGBoost etc)
Photon Runtime
Everything from
Photon Engine
Databricks runtime
Databricks Runtime
Databricks Runtime
Auto Termination
Auto Termination
Compute Optimized
Auto Scaling
Storage Optimized
Cluster VM Type/ Size
General Purpose
GPU Accelerated
Cluster Configuration
Single/ Multi Node
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Cluster Policy
Cluster Configuration
Single/ Multi Node
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Cluster Policy
Cluster Configuration
Single/ Multi Node
Access Mode
Simplifies the user interface
Databricks Runtime
Auto Scaling
Achieves cost control
Cluster VM Type/ Size
Only available on premium tier
Cluster Policy
Creating Databricks Cluster
Azure Databricks Cluster
Pricing
Azure Databricks Pricing Factors
Driver
Tier (Premium/ Standard)
VM VM VM
Purchase Plan (Pay As You Go/ Pre-Purchase)
Worker Worker Worker
Azure Databricks Pricing Calculation
Databricks Cluster VM
Driver
VM VM VM
A Databricks Unit (DBU) is a normalized unit of processing power on the Databricks Lakehouse Platform used for
measurement and pricing purposes
Azure Databricks Pricing Calculation
No of Price based on
X = DBU Cost
DBU workload/ tier
Driver
1 (Driver) X Price of VM = Total Cost of
Node Cost =
the Cluster
+
No of Worker
X Price of VM =
workers Node Cost
Azure Databricks Pricing Calculation
https://azure.microsoft.com/en-gb/pricing/details/databricks/
Azure Databricks Pricing Calculation
https://azure.microsoft.com/en-gb/pricing/details/virtual-machines/linux/#pricing
Azure Databricks Pricing Calculation
No of Price based on
X = DBU Cost
DBU workload/ tier
0.75 0.55 0.4125
+
Driver
1 (Driver) X Price of VM = Total Cost of
Node Cost =
1 the Cluster
0.3510
0.3510
$0.7635
+
per hour
No of Worker
X Price of VM =
workers Node Cost
0 0.00
Estimated cost for doing the course
Azure Data Lake Storage
Azure Data Factory None - Billed only for execution of the pipeline
Azure Databricks Job Cluster None - Destroyed once the job completes
Azure Databricks Cluster Pool Delete the cluster pool at end of the lesson
Databricks Runtime
Auto Termination
Auto Scaling
Cluster 1
Cluster VM Type/ Size
Cluster Policy
Cluster Pool
Cluster Pools
Single/ Multi Node Pool
(idle instances 1 & max instances 2)
Access Mode
VM2
Databricks Runtime
Auto Termination
Auto Scaling
Cluster 1
Cluster VM Type/ Size
VM1
Cluster Policy
Cluster Pool
Cluster Pools
Single/ Multi Node Pool
(idle instances 1 & max instances 2)
Access Mode
Databricks Runtime
Auto Termination
Auto Scaling
Cluster 1 Cluster 2
Cluster VM Type/ Size
VM1 VM2
Cluster Policy
Cluster Pool
Creating Cluster Pool
Cluster Policy
Cluster Policy
What’s a notebook
Creating a notebook
Magic Commands
Databricks Utilities
Secrets Utilities
Widget Utilities
ADLS
Gen2
Access Azure Data Lake - Section Overview
Azure Active Directory
Unity Catalog
ADLS
Gen2
Access Azure Data Lake - Section Overview
abfss://demo@formula1dl.dfs.core.windows.net/
abfss://demo@formula1dl.dfs.core.windows.net/test/
abfss://demo@formula1dl.dfs.core.windows.net/test/circuits.csv
Authenticate Using Access Keys
dbutils.fs.ls("abfss://demo@formula1dl.dfs.core.windows.net/")
Access Azure Data Lake
using
Shared Access Signature (SAS Token)
Shared Access Signature
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net",
"SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")
Access Azure Data Lake
using
Service Principal
Service Principal
3. Set Spark Config with App/ Client Id, Directory/ Tenant Id & Secret
User
Notebook Cluster ADLS Gen2
Secrets
Cluster Scoped Authentication
User
Notebook Cluster ADLS Gen2
Secrets
AAD Credential Passthrough
Azure Active Directory
AAD User
AAD Credential Passthrough
Azure Active Directory RBAC Azure Subscription
Owner
Contributor
Reader
…
AAD User
… Azure Data Lake
Custom Gen2
Student Subscription
Using Cluster Scoped Authentication (via
Company Subscription
Access Keys)
without Access to AAD
Free Subscription
Pay-as-you-go
Subscription Using Service Principal
Any other Subscription
with Access to AAD
Securing Secrets
Secret scopes help store the credentials securely and reference them in
notebooks, clusters and jobs when required
VM VM
DBFS Databricks
VM VM
Workspace
Clusters
Azure Blob Storage
Databricks File System (DBFS) is a distributed file system mounted on the Databricks workspace
DBFS Root is the default storage for a Databricks workspace created during workspace deployment
DBFS Root
Backed by the default Azure Blob Storage Default storage location for managed tables
Can be accessed via the web UI Default storage location, but not recommended for user data
VM VM
DBFS Databricks
VM VM
Workspace
Clusters
Azure Blob Storage
/mnt
Customer Subscription
Azure Blob Storage
Benefits
Access files using file semantics rather than storage URLs (e.g. /mnt/storage1)
Stores files to object storage (e.g. Azure Blob), so you get all the benefits from Azure
Databricks Notebooks
DBFS
Databricks Notebooks
DBFS /mnt/storage1
Create mount using
Credentials
Service Principal
Provide
Required
Access
Azure Data Lake Gen2
Databricks Mounts
Databricks Notebooks
DBFS /mnt/storage1
What is formula1
Project Requirements
Solution Architecture
Data Overview
Formula1
Formula1 Overview
Seasons
Race Weekend
Teams/
Qualifying Qualifying Results
Constructors
Drivers Race
Driver Standings
Laps Race Results
Constructor
Pit Stops
Standings
Formula1 Data Source
http://ergast.com/mrd/
Formula1 Data Source
Formula1 Data Files
Import Raw Data to Data Lake
Project
Requirements
Data Ingestion Requirements
Ingest All 8 files into the data lake
Driver Standings
Constructor Standings
Analysis Requirements
Dominant Drivers
Dominant Teams
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-
architecture
Databricks Architecture
https://databricks.com/solutions/data-pipelines
Spark Architecture
Spark Architecture
Driver Node
VM
Core Core
Core Core
Driver Program
Driver Program
Source: https://www.bbc.co.uk/sport/formula1/drivers-world-championship/standings
Spark DataFrame
DataFrame DataFrame
Data Sources API DataFrame API Data Sources API
- DataFrame - DataFrame
Data Source Reader Methods Writer Methods Data Sink
(CSV, JSON etc) (ORC, Parquet
etc)
Spark Documentation
Data Ingestion Overview
Data Ingestion Requirements
Ingest All 8 files into the data lake
CSV Parquet
JSON Avro
XML Read Data Transform Data Write Data Delta
JDBC JDBC
… ….
Data Types
Row
Column
Functions
Window
Grouping
Data Ingestion Overview
Data Ingestion - Circuits
Partition By
race_year
Partition By
race_id
Data Ingestion - Pitstops
Multi Line JSON Read Data Transform Data Write Data Parquet
Data Ingestion - Laptimes
Include notebook
Notebook workflow
Databricks Jobs
Include notebook (%run)
Passing Parameters (widgets)
Notebook Workflow
Databricks Jobs
Filter/ Join Transformations
Filter Transformation
Join Transformations
Simple Aggregations
Grouped Aggregations
Window Functions
Spark SQL
Databricks Default
Hive Meta Store
External Meta Store (Azure SQL,
MySQL etc.)
Databricks
Workspace
Database
Table Views
Managed External
Managed/ External Tables
ADF Pipelines
SQL Basics
Simple Functions
Aggregate Functions
Window Functions
Joins
Dominant Drivers/ Teams Analysis
Implementation
Data Load Types
Race 3 Race 3
Race 4
Full Refresh/ Load – Day 1
Day 1
Ingestion Transformation
Race 1
Data Lake Data Lake
Race 1 Race 1
Full Refresh/ Load – Day 2
Day 1
Race 1
Race 1 Race 1
Day 2
Race 1 Ingestion Transformation
Race 2
Data Lake Data Lake
Race 1 Race 1
Race 2 Race 2
Full Refresh/ Load – Day 3
Day 1
Race 1
Race 1 Race 1
Race 2 Race 2
Day 2
Race 1 Ingestion Transformation
Race 2
Data Lake Data Lake
Race 1 Race 1
Day 3 Race 2
Race 2
Race 1 Race 3 Race 3
Race 2
Race 3
Incremental Dataset
Day 1
Ingestion Transformation
Race 1
Data Lake Data Lake
Race 1 Race 1
Incremental Load – Day 2
Day 1
Race 1
Day 2
Ingestion Transformation
Race 2
Data Lake Data Lake
Race 1 Race 1
Race 2 Race 2
Incremental Load – Day 3
Day 1
Race 1
Day 2
Ingestion Transformation
Race 2
Data Lake Data Lake
Race 1 Race 1
Race 2 Race 2
Day 3 Race 3
Race 3
Race 3
Hybrid Scenarios
Race 2
….
Race 1047
Formula1 Scenario / Solution 1
Race 2
….
Race 1047
Race 2
….
Race 1047
Lakehouse Architecture
Operational
Data
ETL
External
Data Data Warehouse Data Consumers
/ Mart
Data Sources
Data Warehouse
Scalability
Ingest Transform
Data Sources
Data Science/ ML
workloads BI Reports
Data Lake
No support for ACID transactions
Failed jobs leave partial files
Inconsistent reads
Unable to handle corrections to data
No history or versioning
Poor performance
Poor BI support
Complex to set-up
Lambda architecture for streaming workloads
Data Lakehouse
Operational
Data
Ingest Transform
External
Data
Data Sources
Data Science/ ML
workloads BI Reports
Data Lakehouse
Handles all types of data
Cheap cloud object storage
Uses open source format
Support for all types of workloads
Ability to use BI tools directly
ACID support
History & Versioning
Better performance
Simple architecture
Delta Lake
Overview
Creating Pipelines
Creating Triggers
Azure Data Factory Overview
What is Azure Data Factory
Multi Cloud
SaaS Apps
Data Formats
On Premises
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Transform/
Ingest Publish
Analyze
Data Consumers
Transform/
Ingest Publish
Analyze
Data Consumers
Serverless
A fully managed, serverless data integration solution for ingesting, preparing and
transforming all of your data at scale.
What Azure Data Factory Is Not
Pipeline
Activity Activity
Dataset
Storage Compute
SQL Azure Azure
ADLS
Database Databricks HDInsight
Connecting from Power BI
Unity Catalog - Introduction
Unity Catalog
Unity Catalog
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Without Unity Catalog
Databricks Workspace 1 Databricks Workspace 2
Compute Compute
Metastore
Access Control
User Management
Unity Catalog
Data Explorer
Audit Logs
Data Lineage
Unity Catalog
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Unity Catalog Set-up
Databricks Databricks
Workspace 1 Workspace 2
Compute Compute
Unity Catalog Set-up
Metastore
Databricks Databricks
Workspace 1 Workspace 2
Compute Compute
Unity Catalog Set-up
Databricks Unity
Catalog
Metastore
Databricks
Workspace
Compute
Unity Catalog Set-up
Databricks Unity
Catalog
Default Storage
Metastore
(ADLS Gen2 Container)
Databricks
Workspace
Compute
Unity Catalog Set-up
Databricks Unity
Catalog
Managed Identity/
Metastore
Service Principal
Databricks Unity
Catalog
Access Connector for
Metastore
Databricks
Databricks Unity
Catalog
Access Connector for
Metastore
Databricks
Databricks Unity
Catalog
Access Connector for
Metastore
Databricks
Unity Catalog
Unity Catalog Set-up
Unity Catalog
Unity Catalog Object Model
Unity Catalog
Unity Catalog Object Model
Metastore Top level container
Catalog
Schema (Database)
External Managed
Unity Catalog Object Model
Metastore
Schema (Database)
External Managed
Unity Catalog Object Model
Metastore
Schema (Database)
External Managed
Unity Catalog Object Model
Metastore
Formula1
Catalog
Schema
Catalog
Schema
Catalog
Football_ Football_ Football_ Cricket_D Cricket_Te Cricket_Pr
F1_Dev F1_Test F1_Prod
Dev Test Prod ev st od
Catalog
F1_Sandb Football_ Cricket_S
ox Sandbox andbox
Catalog
Schemas Schemas Schemas Schemas Schemas Schemas Schemas Schemas Schemas Schemas Schemas Schemas
Tables Tables Tables Tables Tables Tables Tables Tables Tables Tables Tables Tables
Accessing External Locations
Unity Catalog
Unity Catalog Configuration
Databricks Unity
Catalog
Access Connector for
Metastore
Databricks
Databricks Unity
Catalog
Access Connector for
Metastore
Databricks
Schema (Database)
External Managed
Storage Credential/ External Location
Metastore
Credential object stored centrally in the metastore Object stored centrally at the metastore
Used on user’s behalf to access cloud storage Combines the ADLS path with the storage credential
Created using Managed Identity/ Service Principal Subjected to access control policies.
Databricks Unity
Catalog
Databricks
Workspace
External Data Lake
(ADLS Gen2)
Compute
User
Accessing External Locations
Databricks Unity Catalog
Managed Identity/
Service Principal
Databricks
Workspace
External Data Lake
(ADLS Gen2)
Compute
User
Accessing External Locations
Databricks Unity Catalog
Storage Credential
Access Connector for
Databricks ADLS Storage Container
Workflows
Drivers Drivers
Driver_Wins
Results Results
Mini Project - Overview
Catalog
Schema (Database)
Table
External Managed
Mini Project – Create Catalogs & Schemas
Metastore
Catalog
Schema (Database)
Table
External Managed
Mini Project – Create Catalogs & Schemas
Default Storage
Metastore
Managed Location 3 (ADLS Gen2 Container)
(Mandatory)
Table
External Managed
Mini Project – Create Catalogs & Schemas
Workflows
Drivers Drivers
Driver_Wins
Results Results
Table
External Managed
Mini Project – Create Catalogs & Schemas
Default Storage
Metastore
Managed Location 3 (ADLS Gen2 Container)
(Mandatory)
formula1_dev Catalog
Managed Location 2
(Optional)
Table
External Managed
Mini Project – Create External Tables
Data Discovery
Data Audit
Unity Catalog
Data Lineage
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Unity Catalog – Data Discoverability
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Unity Catalog – Data Audit
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Unity Catalog – Data Discoverability
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Unity Catalog – Data Lineage
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Unity Catalog – Data Lineage
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Data Lineage
Data Lineage is the process of following/ tracking the journey of data within a pipeline.
Workflows
Better compliance
Data Lineage - Limitations
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Unity Catalog – Data Access Control
Data Access
Data Audit
Control
Unity Catalog
Data
Data Lineage
Discoverability
Unity Catalog is a Databricks offered unified solution for implementing data governance in the Data Lakehouse
Unity Catalog - Securable Objects
Unity Catalog - Security Model
Users
Service Principal
Groups
Security Model – Role Based Access
Metastore Admin
Service Principal
Object Owner
Groups
Object User
Roles
Security Model – Access Control List
Users CREATE
USAGE
Service Principal
…
Groups
SELECT
Privileges
Security Model – Access Control List
Metastore
Schema (Database)
External Managed
Security Model – Access Control List
Metastore
Schema (Database)
External Managed
Security Model – Access Control List
Metastore
Catalog
Schema (Database)
External Managed
Security Model – Access Control List
Metastore
SELECT
SELECT EXECUTE
MODIFY
Security Model – Access Control List
Metastore
USE SCHEMA
Schema (Database) CREATE TABLE/
FUNCTION
SELECT
SELECT EXECUTE
MODIFY
Security Model – Access Control List
Metastore CREATE CATALOG
USE CATALOG
Catalog USE CATALOG
CREATE SCHEMA
USE SCHEMA
Schema (Database) CREATE TABLE/ USE SCHEMA
FUNCTION
SELECT
SELECT EXECUTE
MODIFY
Access Control List – Inheritance Model
Metastore CREATE CATALOG
USE SCHEMA
CREATE TABLE/
FUNCTION
ALL PRIVILEGES USE CATALOG
SELECT Catalog
MODIFY CREATE SCHEMA
EXECUTE
SELECT
SELECT EXECUTE
MODIFY
Security Model – Access Control List
Metastore