Slides 1234

u e s
d r
Design and i gimplemen
R o
la n
A
Azure Storage Accounts
u e s
d r i g
o
This is a service on the Azure platform that allows you to store different data objects.
R
Blob storage – This is Microsoft’s object storage solution on the cloud.
l an
You can use this service for storing large amounts of unstructured data.
A
Data services
D
Azure Data Lake Storage Gen2
u e s
d r i g
o
This service provides features that are pertinent to big data analytics.
R
This service is built on top of Azure Blob Storage.
l an
You can use this as a central repository for storing your structured and unstructured data.
A
Data services
D
Azure Synapse
u e s
d r i g
o
This is an enterprise analytics service.
R
Here you can store your data in data warehouses and query the data accordingly.
l an
It also has Spark capabilities and Pipelines for data integration.
A
Data services
D
Azure Data Factory
u e s
d r i g
o
This is a cloud-based ETL tool. It also acts as a data integration service.
R
You can create workflows when it comes to orchestrating data movement and transforming data.
l an
Has support for a variety of data stores.
A
Data services
D
Azure Stream Analytics
u e s
d r i g
o
This is a fully managed stream processing engine.
R
Here you can analyze and process large volumes of streaming data.
l an
You can ingest data from different data sources.
A
Data services
D
Azure Databricks
u e s
d r i g
o
This is a set of tools that can be used for building, deploying , sharing and maintaining
enterprise-grade data solutions at scale.
R
You can use this service to process and analyze your data..
l an
This service makes use of Apache Spark that can be used to process data.
A
Data services
D
u e s
d r i g
Design and implement data storage
o
- Azure Synapse Analytics
n R
A l a
u e s
d r i g
Best Practices
R o
l a n
A
Distributed Tables
u e s
Hash-distribution
This improves query
d r i g Choose Round-Robin
No joining key, no good
o
performance on large candidate for hash
tables. distribution.
R
Round-Robin distribution Fact Tables
Distributed
n
This helps in improving Tables Best to use hash-
loading speed.
a
distribution.
A l
Dimension
Data Movement
With hash-distribution, Best to use replicated
Design
SQL Analytics has tables.

knowledge on the rows in
the distribution.
D
Hash-distributed
u e s
d r i g
o
1 Choosing the right column Ensure that data gets distributed across the distributions
n
Data Skew
R
If the data is not distributed properly it can lead to data skew.
A l a
Data Movement Choose a column that minimizes data movement – Is used in JOIN, GROUP BY
Design
clauses.
D
u e s
d r i g
Design and Develop Data Processing
o
- Azure Data Factory
n R
A l a
u e s
d r i
Mapping Data Flowsg
R o
l an
A
Mapping Data Flows
u e s
d r i g
This feature helps to visualize the data transformations in Azure Data Factory.
R o
You can write the required transformation logic without actually writing any code.
l an
The data flows run on Apache Spark clusters. Azure Data Factory will manage the transformations in the
A
data flow.
Data services
D
Mapping Data Flows
u e s
d r i g
Debug Mode – You can actually see the results of the data flow while designing the flow.
R o
In the debug mode session, the data flow will run interactively on the Spark cluster.
l an
In the debug mode, you will be charged on an hourly basis for the active cluster.
A
Data services
D
u e s
d r i g
o
- Scala, Notebooks and Spark
n R
A l a
u e s
d r i g
Lake databases
R o
l a n
A
Lake
u e s
d r i g
o
1 Spark pool These are the databases and tables created in the Spark pool.
2 Shared
n R
These tables can also be accessed from the Serverless SQL pool.
A
Data
l a Here the data for the databases are stored in data lake storage. Supported file
Synapse
formats are Parquet, Delta and CSV.

D
u e s
d r i g
o
- Azure Databricks
n R
A l a
u e s
i g
More on rclusters
d
R o
l an
A
Clusters
u e s
Compute
Your workloads run on the
d r i g Terminate
When you terminate the cluster,
o
cluster compute resources. the details of the cluster are
retained for 30 days.
R
Notebooks Pin
n
You define the workloads Clusters
Pin the cluster if you want the
as a set of commands in
a
details of the terminated cluster
l
Notebooks. for a longer duration.
A
Cluster types Job Cluster
Databricks
You have your all-purpose Here the cluster terminates

clusters and the job once the job is complete.
clusters.
D
Cluster
u e s
d r i g
o
1 Single User Supported Languages – Python, SQL, Scala, R
2 Shared
n R
Needs a Premium plan. You can have data isolation amongst
a
users. Python and SQL supported.
A l
No Isolation Shared Supported Languages – Python, SQL, Scala, R.
Databricks
D
Cluster
u e s
d r i g
This maintains the state information for the notebooks attached to
o
1 Driver node
the cluster. It also maintains the SparkContext.
n R
This runs the executors for running the distributed workloads.
2 Worker node
A l a
Spot Instances To save on costs, you can also make use of Spot Instances.
Databricks
D
u e s
d r i g
Design and Implement Data Security
R o
l an
A
u e s
d r i g
Dynamic Data Masking
R o
l an
A
Dynamic Data Masking
u e s
Exposure
Here you can limit the
d r i g Email
Here first letter of the email address is
o
exposure of data. exposed. And the domain name of the
email address is replaced with XXX.com.
R
Rule Custom text
n
You can create rules to Masking
Here you decide which characters
mask the data.
a
to expose for a field.
A l
Credit Card masking rule Random number
This is used to mask the Here you can generate a
Security
column that contain credit random number for the field.

card details. Here only the
last four digits of the field
D
are exposed.
u e s
d r i g
Data classification
R o
l a n
A
Data
u e s
d r i g
o
1 Classify This feature provides capabilities for discovering, classifying, labelling, and
reporting the sensitive data in your databases.
n
Discovery
R
The data discovery feature can scan the database and identify columns that
a
contains sensitive data. You can then view and apply the recommendations
l
accordingly.
A
3 Sensitivity labels You can then apply sensitivity labels to the column. This helps to define the
Security
sensitivity level of the data stored in the column.

D
u e s
d r i g
Monitor and optimize data storage
o
and data processing
n R
A l a
u e s
d r i
Microsoft Purview g
R o
l an
A
Microsoft Purview
u e s
d r i g
This is a unified data governance service that helps to track your data assets across multi-cloud, on-
o
premises and software-as-a-service data stores.
R
Data Map – This helps to provide metadata about your enterprise data.
l an
Data Estate Insights – This can give an overview of your data.
A
Purview
D
u e s
d r i g
R
Metrics
o
l an
A
u e s
Backlogged Input Events
Number of input events that are
d r i g Early Input Events

Events whose application timestamp is
o
backlogged. A non-zero value for this earlier than their arrival time by more than
metric implies that your job isn't able 5 minutes.
R
to keep up with the number of
incoming events. For this you can Late Input Events
n
decide to scale up the number of Metrics Events that arrived later than the
streaming units.
a
configured late arrival tolerance window.
A l
Data Conversion Errors Out-of-Order Events
Monitoring
Number of output events

Number of events received out of order that
that could not be
were either dropped or given an adjusted
converted to the expected
timestamp, based on the Event Ordering Policy.
output schema.
D
u e s
d r i g
o
1 Watermark delay The maximum watermark delay across all partitions of all outputs in the job.
If the value of the Watermark Delay is greater than 0, it could be due to many
R
reasons.
n
2 Delays Inherent processing delay of the streaming pipeline. Clock skew of the
a
processing node generating the metric
A l
Monitoring
3 Resources There are not enough processing resources in your Stream Analytics job.
D

Slides 1234

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides 1234

Uploaded by

Copyright:

Available Formats

u e s

This improves query

No joining key, no good

SQL Analytics has tables.

formats are Parquet, Delta and CSV.

Your workloads run on the

When you terminate the cluster,

You have your all-purpose Here the cluster terminates

Here you can limit the

Here first letter of the email address is

column that contain credit random number for the field.

sensitivity level of the data stored in the column.

Number of input events that are

d r i g Early Input Events

Number of output events

You might also like