Professional Documents
Culture Documents
d r
Design and i gimplemen
R o
la n
A
Azure Storage Accounts
u e s
d r i g
o
This is a service on the Azure platform that allows you to store different data objects.
R
Blob storage – This is Microsoft’s object storage solution on the cloud.
l an
You can use this service for storing large amounts of unstructured data.
A
Data services
D
Azure Data Lake Storage Gen2
u e s
d r i g
o
This service provides features that are pertinent to big data analytics.
R
This service is built on top of Azure Blob Storage.
l an
You can use this as a central repository for storing your structured and unstructured data.
A
Data services
D
Azure Synapse
u e s
d r i g
o
This is an enterprise analytics service.
R
Here you can store your data in data warehouses and query the data accordingly.
l an
It also has Spark capabilities and Pipelines for data integration.
A
Data services
D
Azure Data Factory
u e s
d r i g
o
This is a cloud-based ETL tool. It also acts as a data integration service.
R
You can create workflows when it comes to orchestrating data movement and transforming data.
l an
Has support for a variety of data stores.
A
Data services
D
Azure Stream Analytics
u e s
d r i g
o
This is a fully managed stream processing engine.
R
Here you can analyze and process large volumes of streaming data.
l an
You can ingest data from different data sources.
A
Data services
D
Azure Databricks
u e s
d r i g
o
This is a set of tools that can be used for building, deploying , sharing and maintaining
enterprise-grade data solutions at scale.
R
You can use this service to process and analyze your data..
l an
This service makes use of Apache Spark that can be used to process data.
A
Data services
D
u e s
d r i g
Design and implement data storage
o
- Azure Synapse Analytics
n R
A l a
u e s
d r i g
Best Practices
R o
l a n
A
Distributed Tables
u e s
Hash-distribution
d r i g Choose Round-Robin
o
performance on large candidate for hash
tables. distribution.
R
Round-Robin distribution Fact Tables
Distributed
n
This helps in improving Tables Best to use hash-
loading speed.
a
distribution.
A l
Dimension
Data Movement
With hash-distribution, Best to use replicated
Design
u e s
d r i g
o
1 Choosing the right column Ensure that data gets distributed across the distributions
n
Data Skew
R
If the data is not distributed properly it can lead to data skew.
A l a
Data Movement Choose a column that minimizes data movement – Is used in JOIN, GROUP BY
Design
clauses.
D
u e s
d r i g
Design and Develop Data Processing
o
- Azure Data Factory
n R
A l a
u e s
d r i
Mapping Data Flowsg
R o
l an
A
Mapping Data Flows
u e s
d r i g
This feature helps to visualize the data transformations in Azure Data Factory.
R o
You can write the required transformation logic without actually writing any code.
l an
The data flows run on Apache Spark clusters. Azure Data Factory will manage the transformations in the
A
data flow.
Data services
D
Mapping Data Flows
u e s
d r i g
Debug Mode – You can actually see the results of the data flow while designing the flow.
R o
In the debug mode session, the data flow will run interactively on the Spark cluster.
l an
In the debug mode, you will be charged on an hourly basis for the active cluster.
A
Data services
D
u e s
d r i g
Design and Develop Data Processing
o
- Scala, Notebooks and Spark
n R
A l a
u e s
d r i g
Lake databases
R o
l a n
A
Lake
u e s
d r i g
o
1 Spark pool These are the databases and tables created in the Spark pool.
2 Shared
n R
These tables can also be accessed from the Serverless SQL pool.
A
Data
l a Here the data for the databases are stored in data lake storage. Supported file
Synapse
o
- Azure Databricks
n R
A l a
u e s
i g
More on rclusters
d
R o
l an
A
Clusters
u e s
Compute
d r i g Terminate
o
cluster compute resources. the details of the cluster are
retained for 30 days.
R
Notebooks Pin
n
You define the workloads Clusters
Pin the cluster if you want the
as a set of commands in
a
details of the terminated cluster
l
Notebooks. for a longer duration.
A
Cluster types Job Cluster
Databricks
u e s
d r i g
o
1 Single User Supported Languages – Python, SQL, Scala, R
2 Shared
n R
Needs a Premium plan. You can have data isolation amongst
a
users. Python and SQL supported.
A l
No Isolation Shared Supported Languages – Python, SQL, Scala, R.
Databricks
D
Cluster
u e s
d r i g
This maintains the state information for the notebooks attached to
o
1 Driver node
the cluster. It also maintains the SparkContext.
n R
This runs the executors for running the distributed workloads.
2 Worker node
A l a
Spot Instances To save on costs, you can also make use of Spot Instances.
Databricks
D
u e s
d r i g
Design and Implement Data Security
R o
l an
A
u e s
d r i g
Dynamic Data Masking
R o
l an
A
Dynamic Data Masking
u e s
Exposure
d r i g Email
o
exposure of data. exposed. And the domain name of the
email address is replaced with XXX.com.
R
Rule Custom text
n
You can create rules to Masking
Here you decide which characters
mask the data.
a
to expose for a field.
A l
Credit Card masking rule Random number
This is used to mask the Here you can generate a
Security
are exposed.
u e s
d r i g
Data classification
R o
l a n
A
Data
u e s
d r i g
o
1 Classify This feature provides capabilities for discovering, classifying, labelling, and
reporting the sensitive data in your databases.
n
Discovery
R
The data discovery feature can scan the database and identify columns that
a
contains sensitive data. You can then view and apply the recommendations
l
accordingly.
A
3 Sensitivity labels You can then apply sensitivity labels to the column. This helps to define the
Security
o
and data processing
n R
A l a
u e s
d r i
Microsoft Purview g
R o
l an
A
Microsoft Purview
u e s
d r i g
This is a unified data governance service that helps to track your data assets across multi-cloud, on-
o
premises and software-as-a-service data stores.
R
Data Map – This helps to provide metadata about your enterprise data.
l an
Data Estate Insights – This can give an overview of your data.
A
Purview
D
u e s
d r i g
Azure Stream Analytics
R
Metrics
o
l an
A
Azure Stream Analytics
u e s
Backlogged Input Events
o
backlogged. A non-zero value for this earlier than their arrival time by more than
metric implies that your job isn't able 5 minutes.
R
to keep up with the number of
incoming events. For this you can Late Input Events
n
decide to scale up the number of Metrics Events that arrived later than the
streaming units.
a
configured late arrival tolerance window.
A l
Data Conversion Errors Out-of-Order Events
Monitoring
u e s
d r i g
o
1 Watermark delay The maximum watermark delay across all partitions of all outputs in the job.
If the value of the Watermark Delay is greater than 0, it could be due to many
R
reasons.
n
2 Delays Inherent processing delay of the streaming pipeline. Clock skew of the
a
processing node generating the metric
A l
Monitoring
3 Resources There are not enough processing resources in your Stream Analytics job.
D