You are on page 1of 42

INVESTIGATE DATA LAKE AND

LAKEHOUSE TECHNIQUES
Instructor: Mr. TRAN MINH QUANG
Group 7:
Mai Ngọc Duyên - 2292390
Hồ Hữu Thanh    - 2292368
AGENDA

PROBLEMS WITH A BIG DATA PLATFORM (A TRADITIONAL DATA WAREHOUSE)

DATA LAKE

DEMO

DATA LAKEHOUSE

2
PROBLEMS WITH A BIG DATA PLATFORM
The Evolution of The Big Data Infrastructure

Phase 1: Hadoop Architecture Diagram

Phase 2: Lambda Architecture Diagram

Phase 3: Kappa Architecture Diagram


3
PROBLEMS WITH A BIG DATA PLATFORM
The Evolution of The Big Data Infrastructure
Big data platforms process the full data of an enterprise or organization while providing a full
range of data processing capabilities to meet application needs.
The big data infrastructure is specially designed for storage and computing, but it ignores data
asset management. A data lake is designed based on a consideration of asset management.
Why do we use the term "data lake" instead of data river or data sea?
To make better use of data, enterprises and organizations must take the following measures to
manage data assets:
• Store data assets as-is over a long term
• Implement effective management and centralized governance of data assets
• Provide multi-modal computing capabilities to meet data processing needs
• Provide unified data views, data schemas, and data processing results for businesses

4
DATA LAKE
What is a Data Lake?

Wikipedia defines a data lake as:


• A system or repository of data stored in its natural/raw
format, usually object blobs or files.
• Usually a single store of all enterprise data, including raw copies of
source system data and transformed data used for tasks, such as
reporting, visualization, advanced analytics, and machine learning. 
• Can include structured data from relational databases (rows and
columns), semi-structured data (CSV, logs, XML, JSON),
unstructured data (emails, documents, PDFs), and binary data
(images, audio, video).
5
DATA LAKE
What is a Data Lake?
AWS defines a data lake in a more direct
manner:
• A centralized repository that allows
you to store all of your structured and
unstructured data at any scale.
• You can store your data as-is, without
having to first structure the data, and run
different types of analytics - from
dashboards and visualizations to big
data processing, real-time analytics, and
machine learning to guide better
decisions.

6
DATA LAKE
What is a Data Lake?
Microsoft's definition of a data lake as:
• Azure Data Lake includes all of the capabilities required to make it easy for developers, data
scientists, and analysts to store data of any size, shape, and speed, and do all types of
processing and analytics across platforms and languages.
• Azure Data Lake works with existing IT investments for identity, management, and security for
simplified data management and governance, solves many of the productivity and scalability
challenges. 

7
DATA LAKE
What is a Data Lake?
Most definitions of the data lake concept
focus on the following characteristics of data
lakes:
1. A data lake provides sufficient data
storage to store all of the data of an
enterprise or organization.
2. A data lake can store massive amounts
of data of all types, including structured,
semi-structured, and unstructured data.
3. The data stored in a data lake is raw data
or a complete replica of business data.
4. A data lake provides full metadata to
manage all types of data-related
elements, including data sources, data Figure 1: A data lake's basic capabilities
formats, connection information, data
schemas, and permission management
8 capabilities.
DATA LAKE
What is a Data Lake?

5. A data lake provides diverse analytics


capabilities, including batch processing,
stream computing, interactive analytics, and
machine learning, along with job scheduling
and management capabilities.​

6. A data lake supports comprehensive data


lifecycle management.​

7. A data lake provides comprehensive


capabilities for data retrieval and publishing. 
Figure 1: A data lake's basic capabilities
8. A data lake provides big data capabilities,
including the ultra-large storage space and
scalability needed to process data on a large
9 scale.
DATA LAKE
The Basic Characteristics of Data Lake

A comparison table from the AWS website:

10
DATA LAKE
The Basic Architecture of Data Lake

• A data lake not only provides


the basic capabilities of a big
data platform, but also data
management, data
governance, and data asset
management capabilities. ​

• Similar to a big data platform, a


typical data lake provides the
storage and computing
capabilities needed to process
data at an ultra-large-scale, as
well as multi-modal data
processing capabilities.
11 Figure 2: The Reference Architecture of Data Lake Components
DATA LAKE
The Basic Architecture of Data Lake
• The centralized
storage provides a unified
area for the storage of the
internal data of an
enterprise or organization. 
• Most data lake practices
recommend using
distributed systems, such
as Amazon S3, Alibaba
Cloud OSS, OBS, and
HDFS, as the data lake's
unified storage.

12 Figure 2: The Reference Architecture of Data Lake Components


DATA LAKE
The Basic Architecture of Data Lake
In addition, a data lake provides the following more sophisticated data management
capabilities:
• Improved Data Access Capabilities
• Allow you to define and manage a variety of disparate external data sources and extract and
migrate data from these sources.
• Improved Data Management Capabilities 
• Basic DM capabilities: metadata management, data access control, and data asset management.
• Extended DM capabilities: job management, process orchestration, and capabilities related to
data quality and data governance.
• Shared Metadata
• A data lake provides metadata as the basis for integrating all of its computing engines with the
stored data.
• Computing engines directly retrieve information from metadata while processing data (data
storage location, data format, data schema, and data distribution)
13 • A data lake controls access to the stored data at the levels of the database, table, column, and
row.
DATA LAKE
The Basic Architecture of Data Lake
In theory, a well-managed data lake retains raw data permanently, while constantly improving
and evolving process data to meet your business needs.

14
Figure 3: Data Lifecycle in a Data Lake
DATA LAKE
Available Data Lake Solutions

• AWS's Data Lake Solution 


• Huawei's Data Lake Solution
• Alibaba Cloud's Data Lake Solution
• Azure's Data Lake Solution

15
DATA LAKE
AWS's Data Lake Solution
This solution is based on AWS Lake Formation, which is a management component and
works with AWS's other services to form an enterprise data lake. 

16 Figure 4: Data Lake Solution Provided by AWS


DATA LAKE Share the same
data catalog
AWS's Data Lake Solution

1. Data Inflow:
▪ Metadata inflow:
o Data source preparation
o Metadata crawling
--> form a data catalog and generate
security settings and access control
policies.
--> retrieve metadata from external
data sources.
▪ Business data inflow: is completed
through ETL.
Figure 4: Data Lake Solution Provided by
AWS abstracts metadata crawling, AWS
ETL, and data preparation into AWS
Glue. Support disparate
Support data mobility
data sources
17
DATA LAKE
AWS's Data Lake Solution

2. Data Accumulation:
▪ Amazon S3 provides centralized
storage through the data lake, with
support for on-demand scale-out.

3.  Data Computing:


▪ Use AWS Glue for basic data
processing.
▪ AWS Glue implements basic
computing through batch ETL tasks,
triggered in 3 ways: manual,
scheduled, and event-based. Figure 4: Data Lake Solution Provided by
AWS

18
DATA LAKE
AWS's Data Lake Solution

4. Data Application:
AWS uses external computing engines
to support rich computing modes apart
from batch processing. 

For example:
Amazon Athena and Amazon Redshift
provide SQL-based interactive batch
processing capabilities.
Amazon EMR provides Spark-based
computing capabilities, such as stream
computing and machine learning. Figure 4: Data Lake Solution Provided by
AWS

Computing engines
19
DATA LAKE
AWS's Data Lake Solution

5.  Permission Management: 


▪ Use AWS Lake Formation to provide
permission management capabilities
at the database, table, and column
levels.
▪ AWS Glue accesses AWS Lake
Formation only at the database and
table levels. 

Figure 4: Data Lake Solution Provided by


AWS

20
DATA LAKE
AWS's Data Lake Solution
5.  Permission Management: 
▪ The permissions of AWS Lake Formation
are divided into:
▪ The data catalog access permissions --
> control access to metadata.
▪ The underlying data access
permissions --> control access to stored
data, divided into:
§ Data access permissions:  a
database grants to its tables
§ Data storage permissions: for each
specific data catalog in Amazon
S3.
▪ For example: 
User A is only granted data access permission
Figure 5: Permission Separation for AWS's Data Lake
and cannot create tables in the specified Solution
21 bucket of Amazon S3.
DATA LAKE
AWS's Data Lake Solution

22
Figure 5: Mapping of AWS's Data Lake Solution to the Reference Architecture
DATA LAKE
AWS's Data Lake Solution
In short,
• AWS's data lake solution provides full support for metadata management and permission
management. 
• AWS's data lake solution provides stream computing and machine learning only as
extended computing capabilities.
• AWS's data lake solution provides all of the functions shown in the reference architecture
except quality management and data governance. Quality management and data
governance are closely related to the organizational structure and business type of an
enterprise, requiring a great deal of customization and development work. Therefore, a
general solution usually does not provide these two functions.

23
DATA LAKE
Summary
A data lake is the infrastructure for next-generation big data analytics and processing. It provides
richer functions than a big data platform. Data lake solutions are likely to evolve in the following
directions in the future:
• Cloud-Native Architecture
• Sufficient Data Management Capabilities
• Big Data Capabilities and Database-Like Experience
• Comprehensive Data Integration and Development Capabilities
• Deep Convergence and Integration With Businesses

24
DATA LAKE
Demo

25
DATA LAKE
Demo

26
DATA LAKE
Demo

27
DATA LAKE
Demo

28
DATA LAKE
Demo

29
DATA LAKE
Challenges with A Data Lake

1. High Cost
2. Management Difficulty
3. Long Time-to-Value
4. Immature Data Security and Governance
5. Lack of Skills
6. Data is Outpacing Moore’s Law
7. Open Source is Not Going to Save Us

30
DATA LAKEHOUSE
Definition
• A data lakehouse is an open data management architecture that combines the
flexibility and cost-efficiency of data lakes with the data management and
structure features of data warehouses, all on one data platform.

• Simply put: The data Lakehouse is the only data architecture that stores all data
— unstructured, semi-structured, AND structured — in your data lake while still
providing the data quality and data governance standards of a data warehouse.

31
DATA LAKEHOUSE
The benefits of a data lakehouse
Traditionally, though, data warehouses were not optimized for these unstructured data
types, making it necessary to simultaneously manage multiple systems – a data
lake, several data warehouses, and other specialized systems.

• Less time and effort administrating 

• Simplified schema and data governance 

• Reduced data movement and redundancy 

• Direct access to data for analysis tools 

• Cost-effective data storage  

32
DATA LAKEHOUSE
Disadvantage of a data lakehouse
Possible reduced functionality
It could be a challenge to design and maintain the monolithic design of the
lakehouse. Besides, universal designs may have lower functionality than those
developed for specific use cases.
Underdeveloped concept
 Data lakehouses are a relatively new technology and need further development.
The current state of tech doesn’t allow rolling out all their capabilities.

33
DATA LAKEHOUSE
The Architecture of Data Lakehouse
The data lakehouse has a layer design, with a warehouse layer on top of a data lake. This architecture, which
enables combining structured and unstructured data, makes it efficient for business intelligence and business
analysis.
• A data lakehouse system usually consists of the following layers:
• Ingestion
• Storage
• Metadata
• API
• Consumption

34
35
DATA LAKEHOUSE
Ingestion layer
The first layer is responsible for collecting data from multiple sources and
delivering it to the storage layer. For that, batch and streaming methods are used.
Storage layer
The data lakehouse design allows you to keep different types of data as objects in
low-cost object stores like AWS S3. The client tools then can read these objects
directly from the store using open file formats.
Metadata layer
The metadata layer is a joint catalog that provides metadata (data describing other
data) for all objects stored in a data lake and allows users to apply management
features, including ACID transactions, cache, indexing, and data extraction.
API layer
The API layer contains various APIs that allow all end users to process tasks faster
and have access to the data necessary for advanced analytics.

36
DATA LAKEHOUSE
Comparison with Data Lake & Data Warehouse

• The difference between the three storage options can be summarized as follows.

• The opposite is true for the data lake: it’s easy to ingest and store data there, but
using and querying it may pose problems. It provides structured storage for some
types of data and unstructured storage for others while keeping all data in one place.

37
DATA LAKEHOUSE
Comparison with Data Lake & Data Warehouse

38
DATA LAKEHOUSE
Comparison with Data Lake & Data Warehouse

39
DATA LAKEHOUSE
Summary

• A data lakehouse allows you to aggregate and update data in one place. 

• You can store both structured and unstructured data in data lakehouses. 

• If your company wishes to utilize pioneering technologies to build effective solutions


based on data analysis, data lakehouses are an option you should not overlook.

40
A BRIEF OF POPULAR DATA ARCHITECTURES

41
REFERENCES

• Data Lake: Concepts, Characteristics, Architecture, and Case Studies by ApsaraDB


(November 2020) from 
https://www.alibabacloud.com/blog/data-lake-concepts-characteristics-architecture-and-case-st
udies_596910
 
• What is a data lake? by AWS Website from
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/?nc1=h_ls 
• Data Lake vs Lakehouse vs Data Mesh: The Evolution of Data Transformation by DevIQ
website from https://www.deviq.io/insights/data-lake-vs-lakehouse-vs-data-mesh 
• Data Lake: A Strategy for Building Enterprise Data Asset By Manoj Gopalkrishnan (October
2020) from 
https://www.linkedin.com/pulse/data-lake-strategy-building-enterprise-asset-manoj-gopalkrishn
an
 
• Create Data Lake with Amazon S3, Lake Formation and Glue by AWS Tutorials (June 2020)
from https://www.youtube.com/watch?v=l5Hz2qkp4K0 
42

You might also like