Professional Documents
Culture Documents
LAKEHOUSE TECHNIQUES
Instructor: Mr. TRAN MINH QUANG
Group 7:
Mai Ngọc Duyên - 2292390
Hồ Hữu Thanh - 2292368
AGENDA
DATA LAKE
DEMO
DATA LAKEHOUSE
2
PROBLEMS WITH A BIG DATA PLATFORM
The Evolution of The Big Data Infrastructure
4
DATA LAKE
What is a Data Lake?
6
DATA LAKE
What is a Data Lake?
Microsoft's definition of a data lake as:
• Azure Data Lake includes all of the capabilities required to make it easy for developers, data
scientists, and analysts to store data of any size, shape, and speed, and do all types of
processing and analytics across platforms and languages.
• Azure Data Lake works with existing IT investments for identity, management, and security for
simplified data management and governance, solves many of the productivity and scalability
challenges.
7
DATA LAKE
What is a Data Lake?
Most definitions of the data lake concept
focus on the following characteristics of data
lakes:
1. A data lake provides sufficient data
storage to store all of the data of an
enterprise or organization.
2. A data lake can store massive amounts
of data of all types, including structured,
semi-structured, and unstructured data.
3. The data stored in a data lake is raw data
or a complete replica of business data.
4. A data lake provides full metadata to
manage all types of data-related
elements, including data sources, data Figure 1: A data lake's basic capabilities
formats, connection information, data
schemas, and permission management
8 capabilities.
DATA LAKE
What is a Data Lake?
10
DATA LAKE
The Basic Architecture of Data Lake
14
Figure 3: Data Lifecycle in a Data Lake
DATA LAKE
Available Data Lake Solutions
15
DATA LAKE
AWS's Data Lake Solution
This solution is based on AWS Lake Formation, which is a management component and
works with AWS's other services to form an enterprise data lake.
1. Data Inflow:
▪ Metadata inflow:
o Data source preparation
o Metadata crawling
--> form a data catalog and generate
security settings and access control
policies.
--> retrieve metadata from external
data sources.
▪ Business data inflow: is completed
through ETL.
Figure 4: Data Lake Solution Provided by
AWS abstracts metadata crawling, AWS
ETL, and data preparation into AWS
Glue. Support disparate
Support data mobility
data sources
17
DATA LAKE
AWS's Data Lake Solution
2. Data Accumulation:
▪ Amazon S3 provides centralized
storage through the data lake, with
support for on-demand scale-out.
18
DATA LAKE
AWS's Data Lake Solution
4. Data Application:
AWS uses external computing engines
to support rich computing modes apart
from batch processing.
For example:
Amazon Athena and Amazon Redshift
provide SQL-based interactive batch
processing capabilities.
Amazon EMR provides Spark-based
computing capabilities, such as stream
computing and machine learning. Figure 4: Data Lake Solution Provided by
AWS
Computing engines
19
DATA LAKE
AWS's Data Lake Solution
20
DATA LAKE
AWS's Data Lake Solution
5. Permission Management:
▪ The permissions of AWS Lake Formation
are divided into:
▪ The data catalog access permissions --
> control access to metadata.
▪ The underlying data access
permissions --> control access to stored
data, divided into:
§ Data access permissions: a
database grants to its tables
§ Data storage permissions: for each
specific data catalog in Amazon
S3.
▪ For example:
User A is only granted data access permission
Figure 5: Permission Separation for AWS's Data Lake
and cannot create tables in the specified Solution
21 bucket of Amazon S3.
DATA LAKE
AWS's Data Lake Solution
22
Figure 5: Mapping of AWS's Data Lake Solution to the Reference Architecture
DATA LAKE
AWS's Data Lake Solution
In short,
• AWS's data lake solution provides full support for metadata management and permission
management.
• AWS's data lake solution provides stream computing and machine learning only as
extended computing capabilities.
• AWS's data lake solution provides all of the functions shown in the reference architecture
except quality management and data governance. Quality management and data
governance are closely related to the organizational structure and business type of an
enterprise, requiring a great deal of customization and development work. Therefore, a
general solution usually does not provide these two functions.
23
DATA LAKE
Summary
A data lake is the infrastructure for next-generation big data analytics and processing. It provides
richer functions than a big data platform. Data lake solutions are likely to evolve in the following
directions in the future:
• Cloud-Native Architecture
• Sufficient Data Management Capabilities
• Big Data Capabilities and Database-Like Experience
• Comprehensive Data Integration and Development Capabilities
• Deep Convergence and Integration With Businesses
24
DATA LAKE
Demo
25
DATA LAKE
Demo
26
DATA LAKE
Demo
27
DATA LAKE
Demo
28
DATA LAKE
Demo
29
DATA LAKE
Challenges with A Data Lake
1. High Cost
2. Management Difficulty
3. Long Time-to-Value
4. Immature Data Security and Governance
5. Lack of Skills
6. Data is Outpacing Moore’s Law
7. Open Source is Not Going to Save Us
30
DATA LAKEHOUSE
Definition
• A data lakehouse is an open data management architecture that combines the
flexibility and cost-efficiency of data lakes with the data management and
structure features of data warehouses, all on one data platform.
• Simply put: The data Lakehouse is the only data architecture that stores all data
— unstructured, semi-structured, AND structured — in your data lake while still
providing the data quality and data governance standards of a data warehouse.
31
DATA LAKEHOUSE
The benefits of a data lakehouse
Traditionally, though, data warehouses were not optimized for these unstructured data
types, making it necessary to simultaneously manage multiple systems – a data
lake, several data warehouses, and other specialized systems.
32
DATA LAKEHOUSE
Disadvantage of a data lakehouse
Possible reduced functionality
It could be a challenge to design and maintain the monolithic design of the
lakehouse. Besides, universal designs may have lower functionality than those
developed for specific use cases.
Underdeveloped concept
Data lakehouses are a relatively new technology and need further development.
The current state of tech doesn’t allow rolling out all their capabilities.
33
DATA LAKEHOUSE
The Architecture of Data Lakehouse
The data lakehouse has a layer design, with a warehouse layer on top of a data lake. This architecture, which
enables combining structured and unstructured data, makes it efficient for business intelligence and business
analysis.
• A data lakehouse system usually consists of the following layers:
• Ingestion
• Storage
• Metadata
• API
• Consumption
34
35
DATA LAKEHOUSE
Ingestion layer
The first layer is responsible for collecting data from multiple sources and
delivering it to the storage layer. For that, batch and streaming methods are used.
Storage layer
The data lakehouse design allows you to keep different types of data as objects in
low-cost object stores like AWS S3. The client tools then can read these objects
directly from the store using open file formats.
Metadata layer
The metadata layer is a joint catalog that provides metadata (data describing other
data) for all objects stored in a data lake and allows users to apply management
features, including ACID transactions, cache, indexing, and data extraction.
API layer
The API layer contains various APIs that allow all end users to process tasks faster
and have access to the data necessary for advanced analytics.
36
DATA LAKEHOUSE
Comparison with Data Lake & Data Warehouse
• The difference between the three storage options can be summarized as follows.
• The opposite is true for the data lake: it’s easy to ingest and store data there, but
using and querying it may pose problems. It provides structured storage for some
types of data and unstructured storage for others while keeping all data in one place.
37
DATA LAKEHOUSE
Comparison with Data Lake & Data Warehouse
38
DATA LAKEHOUSE
Comparison with Data Lake & Data Warehouse
39
DATA LAKEHOUSE
Summary
• A data lakehouse allows you to aggregate and update data in one place.
• You can store both structured and unstructured data in data lakehouses.
40
A BRIEF OF POPULAR DATA ARCHITECTURES
41
REFERENCES