You are on page 1of 8

Introduction to Azure Data Lake

Storage Gen2
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics,
built on Azure Blob storage.

Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage
Gen1 with Azure Blob storage. For example, Data Lake Storage Gen2 provides file
system semantics, file-level security, and scale. Because these capabilities are built on
Blob storage, you'll also get low-cost, tiered storage, with high availability/disaster
recovery capabilities.

Designed for enterprise big data analytics

Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise
data lakes on Azure. Designed from the start to service multiple petabytes of
information while sustaining hundreds of gigabits of throughput, Data Lake Storage
Gen2 allows you to easily manage massive amounts of data.

A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical


namespace to Blob storage. The hierarchical namespace organizes objects/files into a
hierarchy of directories for efficient data access. A common object store naming
convention uses slashes in the name to mimic a hierarchical directory structure. 

Key features of Data Lake Storage Gen2

 Hadoop compatible access: Data Lake Storage Gen2 allows you to manage
and access data just as you would with a Hadoop Distributed File System
(HDFS). The new ABFS driver (used to access data) is available within all Apache
Hadoop environments. These environments include Azure HDInsight, Azure
Databricks, and Azure Synapse Analytics.

 A superset of POSIX permissions: The security model for Data Lake Gen2
supports ACL and POSIX permissions along with some extra granularity specific
to Data Lake Storage Gen2. Settings may be configured through Storage
Explorer or through frameworks like Hive and Spark.

 Cost effective: Data Lake Storage Gen2 offers low-cost storage capacity and
transactions. Features such as Azure Blob storage lifecycle optimize costs as
data transitions through its lifecycle.
 Optimized driver: The ABFS driver is optimized specifically for big data
analytics. The corresponding REST APIs are surfaced through the
endpoint dfs.core.windows.net.

Short for "Portable Operating System Interface for uni-X", POSIX is a set of standards
codified by the IEEE and issued by ANSI and ISO. The goal of POSIX is to ease the
task of cross-platform software development by establishing a set of guidelines for
operating system vendors to follow. Ideally, a developer should have to write a
program only once to run on all POSIX-compliant systems. Most modern
commercial Unix implementations and many free ones are POSIX compliant. There
are actually several different POSIX releases, but the most important are POSIX.1 and
POSIX.2, which define system calls and command-line interface, respectively.

Compare Azure Data Lake Store to


Azure Blob storage
In Azure Blob storage, you can store large amounts of unstructured ("object") data, in
a single hierarchy, also known as a flat namespace. You can access this data by using
HTTP or HTTPs. Azure Data Lake Storage Gen2 builds on blob storage and optimizes
I/O of high-volume data by using hierarchical namespaces that you turned on in the
previous exercise.

Hierarchical namespaces organize blob data into directories and stores metadata


about each directory and the files within it. This structure allows operations, such as
directory renames and deletes, to be performed in a single atomic operation. Flat
namespaces, by contrast, require several operations proportionate to the number of
objects in the structure. Hierarchical namespaces keep the data organized, which
yields better storage and retrieval performance for an analytical use case and lowers
the cost of analysis.

Azure Blob storage vs. Azure Data Lake Storage

If you want to store data without performing analysis on the data, set


the Hierarchical Namespace option to Disabled to set up the storage account as an
Azure Blob storage account. You can also use blob storage to archive rarely used
data or to store website assets such as images and media.

If you are performing analytics on the data, set up the storage account as an Azure
Data Lake Storage Gen2 account by setting the Hierarchical Namespace option
to Enabled. Because Azure Data Lake Storage Gen2 is integrated into the Azure
Storage platform, applications can use either the Blob APIs or the Azure Data Lake
Storage Gen2 file system APIs to access data.

Understand the stages for processing


big data by using Azure Data Lake
Store
Azure Data Lake Storage Gen2 plays a fundamental role in a wide range of big data
architectures. These architectures can involve the creation of:

 A modern data warehouse.


 Advanced analytics against big data.
 A real-time analytical solution.

There are four stages for processing big data solutions that are common to all
architectures:

 Ingestion - The ingestion phase identifies the technology and processes that
are used to acquire the source data. This data can come from files, logs, and
other types of unstructured data that must be put into the Data Lake Store. The
technology that is used will vary depending on the frequency that the data is
transferred. For example, for batch movement of data, Azure Data Factory may
be the most appropriate technology to use. For real-time ingestion of data,
Apache Kafka for HDInsight or Stream Analytics may be an appropriate
technology to use.
 Store - The store phase identifies where the ingested data should be placed. In
this case, we're using Azure Data Lake Storage Gen2.
 Prep and train - The prep and train phase identifies the technologies that are
used to perform data preparation and model training and scoring for data
science solutions. The common technologies that are used in this phase are
Azure Databricks, Azure HDInsight or Azure Machine Learning Services.
 Model and serve - Finally, the model and serve phase involves the
technologies that will present the data to users. These can include visualization
tools such as Power BI, or other data stores such as Azure Synapse Analytics,
Azure Cosmos DB, Azure SQL Database, or Azure Analysis Services. Often, a
combination of these technologies will be used depending on the business
requirements.
Examine uses for Azure Data Lake
Storage Gen2
Creating a modern data warehouse

Imagine you're a Data Engineering consultant for Contoso. In the past, they've
created an on-premises business intelligence solution that used a Microsoft SQL
Server Database Engine, SQL Server Integration Services, SQL Server Analysis
Services, and SQL Server Reporting Services to provide historical reports. They tried
using the Analysis Services Data Mining component to create a predictive analytics
solution to predict the buying behaviour of customers. While this approach worked
well with low volumes of data, it couldn't scale after more than a gigabyte of data
was collected. Furthermore, they were never able to deal with the JSON data that a
third-party application generated when a customer used the feedback module of the
point of sale (POS) application.

Contoso has turned to you for help with creating an architecture that can scale with
the data needs that are required to create a predictive model and to handle the
JSON data so that it's integrated into the BI solution. You suggest the following
architecture:

The architecture uses Azure Data Lake Storage at the center of the solution for a
modern data warehouse. Integration Services is replaced by Azure Data Factory to
ingest data into the Data Lake from a business application. This is the source for the
predictive model that is built into Azure Databricks. PolyBase is used to transfer the
historical data into a big data relational format that is held in Azure Synapse
Analytics, which also stores the results of the trained model from Databricks. Azure
Analysis Services provides the caching capability for SQL Data Warehouse to service
many users and to present the data through Power BI reports.

Advanced analytics for big data

In this second use case, Azure Data Lake Storage plays an important role in providing
a large-scale data store. Your skills are needed by AdventureWorks, which is a global
seller of bicycles and cycling components through a chain of resellers and on the
internet. As their customers browse the product catalog on their websites and add
items to their baskets, a recommendation engine that is built into Azure Databricks
recommends other products. They need to make sure that the results of their
recommendation engine can scale globally. The recommendations are based on the
web log files that are stored on the web servers and transferred to the Azure
Databricks model hourly. The response time for the recommendation should be less
than 1 ms. You propose the following architecture:

In this solution, Azure Data Factory transfers terabytes of web logs from a web server
to the Azure Data Lake on an hourly basis. This data is provided as features to the
predictive model in Azure Databricks, which is then trained and scored. The results
are distributed globally by using Azure Cosmos DB, which the real-time app (the
AdventureWorks website) will use to provide recommendations to customers as they
add products to their online baskets.

To complete this architecture, PolyBase is used against the Data Lake to transfer
descriptive data to the SQL Data Warehouse for reporting purposes. Azure Analysis
Services provides the caching capability for SQL Data Warehouse to service many
users and to display the data through Power BI reports.
Real-time analytical solutions

To perform real-time analytical solutions, the ingestion phase of the architecture is


changed for processing big data solutions. In this architecture, note the introduction
of Apache Kafka for Azure HDInsight to ingest streaming data from an Internet of
Things (IoT) device, although this could be replaced with Azure IoT Hub and Azure
Stream Analytics. The key point is that the data is persisted in Data Lake Storage
Gen2 to service other parts of the solution.

In this use case, you are a Data Engineer for Trey Research, an organization that is
working with a transport company to monitor the fleet of Heavy Goods Vehicles
(HGV) that drive around Europe. Each HGV is equipped with sensor hardware that will
continuously report metric data on the temperature, the speed, and the oil and brake
solution levels of an HGV. When the engine is turned off, the sensor also outputs a
file with summary information about a trip, including the mileage and elevation of a
trip. A trip is a period in which the HGV engine is turned on and off.

Both the real-time data and batch data is processed in a machine learning model to
predict a maintenance schedule for each of the HGVs. This data is made available to
the downstream application that third-party garage companies can use if an HGV
breaks down anywhere in Europe. In addition, historical reports about the HGV
should be visually presented to users. As a result, the following architecture is
proposed:

In this architecture, there are two ingestion streams. Azure Data Factory ingests the
summary files that are generated when the HGV engine is turned off. Apache Kafka
provides the real-time ingestion engine for the telemetry data. Both data streams are
stored in Azure Data Lake Store for use in the future, but they are also passed on to
other technologies to meet business needs. Both streaming and batch data are
provided to the predictive model in Azure Databricks, and the results are published
to Azure Cosmos DB to be used by the third-party garages. PolyBase transfers data
from the Data Lake Store into SQL Data Warehouse where Azure Analysis Services
creates the HGV reports by using Power BI.

Upload data to Azure Data Lake


Storage
Learn various ways to upload data to Data Lake Storage Gen 2. Upload data through the
Azure portal, Azure Storage Explorer, or .NET. Or copy the data in Azure Data Factory.

Exercise - Use Azure Data Factory to


copy data from Data Lake Storage Gen1
to Data Lake Storage Gen2
Azure Data Factory is a cloud-based data integration service that creates workflows
in the cloud. These workflows orchestrate batch data movement and transformations.
Use Data Factory to create and schedule workflows (called pipelines) to ingest data
from various data stores. The data can then be processed and transformed with
services like these:

 Azure HDInsight
 Spark
 Azure Data Lake
 Azure Machine Learning

Data Factory can orchestrate many data tasks. In this exercise, you'll use it to copy
data from Azure Data Lake Storage Gen1 to Data Lake Storage Gen2.

https://docs.microsoft.com/en-us/learn/modules/upload-data-to-azure-data-lake-storage/5-
copy-data-from-gen1-to-gen2

Secure your Azure Storage account


Learn how Azure Storage provides multi-layered security to protect your data. Find
out how to use access keys, to secure networks, and to use Advanced Threat
Protection to proactively monitor your system.
Introduction
Azure Storage accounts provide a wealth of security options that protect your cloud-
based data. Azure services such as Blob storage, Files share, Table storage, and Data
Lake Store all build on Azure Storage. Because of this foundation, the services benefit
from the fine-grained security controls in Azure Storage.

Imagine the network administrator at Contoso is auditing the security of the assets
within the domain. At the end of the audit, she needs to be satisfied that all the data
stored in Azure strictly follows Contoso's security policy. As Contoso's primary data
consultant, you'll help the network administrator understand how Azure Storage can
help her meet Contoso's security requirement

https://docs.microsoft.com/en-us/learn/modules/secure-azure-storage-account/2-storage-
security-features

https://pramodsingla.com/tag/azure-data-lake/

What is the difference between HDFS and ADLS?

 HDFS is a file system. HDFS stands for Hadoop Distributed File system. It is part of
Apache Hadoop eco system. 
 ADLS is a Azure storage offering from Microsoft. ADLS stands for Azure Data Lake
Storage. It provides distributed storage file format for bulk data processing needs.

 ADLS is having internal distributed file system format called Azure Blob File
System(ABFS). In addition, it also provides similar file system interface API like
Hadoop to address files and directories inside ADLS using URI scheme. This
way, it is easier for applications using HDFS to migrate to ADLS without code
changes. For clients, accessing HDFS using HDFS driver, similar experience is got
by accessing ADLS using ABFS driver.

You might also like