Professional Documents
Culture Documents
Storage Gen2
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics,
built on Azure Blob storage.
Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage
Gen1 with Azure Blob storage. For example, Data Lake Storage Gen2 provides file
system semantics, file-level security, and scale. Because these capabilities are built on
Blob storage, you'll also get low-cost, tiered storage, with high availability/disaster
recovery capabilities.
Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise
data lakes on Azure. Designed from the start to service multiple petabytes of
information while sustaining hundreds of gigabits of throughput, Data Lake Storage
Gen2 allows you to easily manage massive amounts of data.
Hadoop compatible access: Data Lake Storage Gen2 allows you to manage
and access data just as you would with a Hadoop Distributed File System
(HDFS). The new ABFS driver (used to access data) is available within all Apache
Hadoop environments. These environments include Azure HDInsight, Azure
Databricks, and Azure Synapse Analytics.
A superset of POSIX permissions: The security model for Data Lake Gen2
supports ACL and POSIX permissions along with some extra granularity specific
to Data Lake Storage Gen2. Settings may be configured through Storage
Explorer or through frameworks like Hive and Spark.
Cost effective: Data Lake Storage Gen2 offers low-cost storage capacity and
transactions. Features such as Azure Blob storage lifecycle optimize costs as
data transitions through its lifecycle.
Optimized driver: The ABFS driver is optimized specifically for big data
analytics. The corresponding REST APIs are surfaced through the
endpoint dfs.core.windows.net.
Short for "Portable Operating System Interface for uni-X", POSIX is a set of standards
codified by the IEEE and issued by ANSI and ISO. The goal of POSIX is to ease the
task of cross-platform software development by establishing a set of guidelines for
operating system vendors to follow. Ideally, a developer should have to write a
program only once to run on all POSIX-compliant systems. Most modern
commercial Unix implementations and many free ones are POSIX compliant. There
are actually several different POSIX releases, but the most important are POSIX.1 and
POSIX.2, which define system calls and command-line interface, respectively.
If you are performing analytics on the data, set up the storage account as an Azure
Data Lake Storage Gen2 account by setting the Hierarchical Namespace option
to Enabled. Because Azure Data Lake Storage Gen2 is integrated into the Azure
Storage platform, applications can use either the Blob APIs or the Azure Data Lake
Storage Gen2 file system APIs to access data.
There are four stages for processing big data solutions that are common to all
architectures:
Ingestion - The ingestion phase identifies the technology and processes that
are used to acquire the source data. This data can come from files, logs, and
other types of unstructured data that must be put into the Data Lake Store. The
technology that is used will vary depending on the frequency that the data is
transferred. For example, for batch movement of data, Azure Data Factory may
be the most appropriate technology to use. For real-time ingestion of data,
Apache Kafka for HDInsight or Stream Analytics may be an appropriate
technology to use.
Store - The store phase identifies where the ingested data should be placed. In
this case, we're using Azure Data Lake Storage Gen2.
Prep and train - The prep and train phase identifies the technologies that are
used to perform data preparation and model training and scoring for data
science solutions. The common technologies that are used in this phase are
Azure Databricks, Azure HDInsight or Azure Machine Learning Services.
Model and serve - Finally, the model and serve phase involves the
technologies that will present the data to users. These can include visualization
tools such as Power BI, or other data stores such as Azure Synapse Analytics,
Azure Cosmos DB, Azure SQL Database, or Azure Analysis Services. Often, a
combination of these technologies will be used depending on the business
requirements.
Examine uses for Azure Data Lake
Storage Gen2
Creating a modern data warehouse
Imagine you're a Data Engineering consultant for Contoso. In the past, they've
created an on-premises business intelligence solution that used a Microsoft SQL
Server Database Engine, SQL Server Integration Services, SQL Server Analysis
Services, and SQL Server Reporting Services to provide historical reports. They tried
using the Analysis Services Data Mining component to create a predictive analytics
solution to predict the buying behaviour of customers. While this approach worked
well with low volumes of data, it couldn't scale after more than a gigabyte of data
was collected. Furthermore, they were never able to deal with the JSON data that a
third-party application generated when a customer used the feedback module of the
point of sale (POS) application.
Contoso has turned to you for help with creating an architecture that can scale with
the data needs that are required to create a predictive model and to handle the
JSON data so that it's integrated into the BI solution. You suggest the following
architecture:
The architecture uses Azure Data Lake Storage at the center of the solution for a
modern data warehouse. Integration Services is replaced by Azure Data Factory to
ingest data into the Data Lake from a business application. This is the source for the
predictive model that is built into Azure Databricks. PolyBase is used to transfer the
historical data into a big data relational format that is held in Azure Synapse
Analytics, which also stores the results of the trained model from Databricks. Azure
Analysis Services provides the caching capability for SQL Data Warehouse to service
many users and to present the data through Power BI reports.
In this second use case, Azure Data Lake Storage plays an important role in providing
a large-scale data store. Your skills are needed by AdventureWorks, which is a global
seller of bicycles and cycling components through a chain of resellers and on the
internet. As their customers browse the product catalog on their websites and add
items to their baskets, a recommendation engine that is built into Azure Databricks
recommends other products. They need to make sure that the results of their
recommendation engine can scale globally. The recommendations are based on the
web log files that are stored on the web servers and transferred to the Azure
Databricks model hourly. The response time for the recommendation should be less
than 1 ms. You propose the following architecture:
In this solution, Azure Data Factory transfers terabytes of web logs from a web server
to the Azure Data Lake on an hourly basis. This data is provided as features to the
predictive model in Azure Databricks, which is then trained and scored. The results
are distributed globally by using Azure Cosmos DB, which the real-time app (the
AdventureWorks website) will use to provide recommendations to customers as they
add products to their online baskets.
To complete this architecture, PolyBase is used against the Data Lake to transfer
descriptive data to the SQL Data Warehouse for reporting purposes. Azure Analysis
Services provides the caching capability for SQL Data Warehouse to service many
users and to display the data through Power BI reports.
Real-time analytical solutions
In this use case, you are a Data Engineer for Trey Research, an organization that is
working with a transport company to monitor the fleet of Heavy Goods Vehicles
(HGV) that drive around Europe. Each HGV is equipped with sensor hardware that will
continuously report metric data on the temperature, the speed, and the oil and brake
solution levels of an HGV. When the engine is turned off, the sensor also outputs a
file with summary information about a trip, including the mileage and elevation of a
trip. A trip is a period in which the HGV engine is turned on and off.
Both the real-time data and batch data is processed in a machine learning model to
predict a maintenance schedule for each of the HGVs. This data is made available to
the downstream application that third-party garage companies can use if an HGV
breaks down anywhere in Europe. In addition, historical reports about the HGV
should be visually presented to users. As a result, the following architecture is
proposed:
In this architecture, there are two ingestion streams. Azure Data Factory ingests the
summary files that are generated when the HGV engine is turned off. Apache Kafka
provides the real-time ingestion engine for the telemetry data. Both data streams are
stored in Azure Data Lake Store for use in the future, but they are also passed on to
other technologies to meet business needs. Both streaming and batch data are
provided to the predictive model in Azure Databricks, and the results are published
to Azure Cosmos DB to be used by the third-party garages. PolyBase transfers data
from the Data Lake Store into SQL Data Warehouse where Azure Analysis Services
creates the HGV reports by using Power BI.
Azure HDInsight
Spark
Azure Data Lake
Azure Machine Learning
Data Factory can orchestrate many data tasks. In this exercise, you'll use it to copy
data from Azure Data Lake Storage Gen1 to Data Lake Storage Gen2.
https://docs.microsoft.com/en-us/learn/modules/upload-data-to-azure-data-lake-storage/5-
copy-data-from-gen1-to-gen2
Imagine the network administrator at Contoso is auditing the security of the assets
within the domain. At the end of the audit, she needs to be satisfied that all the data
stored in Azure strictly follows Contoso's security policy. As Contoso's primary data
consultant, you'll help the network administrator understand how Azure Storage can
help her meet Contoso's security requirement
https://docs.microsoft.com/en-us/learn/modules/secure-azure-storage-account/2-storage-
security-features
https://pramodsingla.com/tag/azure-data-lake/
HDFS is a file system. HDFS stands for Hadoop Distributed File system. It is part of
Apache Hadoop eco system.
ADLS is a Azure storage offering from Microsoft. ADLS stands for Azure Data Lake
Storage. It provides distributed storage file format for bulk data processing needs.
ADLS is having internal distributed file system format called Azure Blob File
System(ABFS). In addition, it also provides similar file system interface API like
Hadoop to address files and directories inside ADLS using URI scheme. This
way, it is easier for applications using HDFS to migrate to ADLS without code
changes. For clients, accessing HDFS using HDFS driver, similar experience is got
by accessing ADLS using ABFS driver.