You are on page 1of 7

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com

ScienceDirect
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2022) 000–000
Procedia Computer Science 00 (2022) 000–000
ScienceDirect www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia

Procedia Computer Science 214 (2022) 405–411

9th
9th International
International Conference
Conference on
on Information
Information Technology
Technology and
and Quantitative
Quantitative Management
Management

Observations
Observations and
and Expectations
Expectations on
on Recent
Recent Developments
Developments of
of Data
Data Lakes
Lakes
Zhengxin
Zhengxin Chen
Chen
College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182 USA
College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182 USA
Abstract
Abstract
The concept of data lake was proposed more than a decade ago. Although progress has been made in data lake research and applications,
The concept of data lake was proposed more than a decade ago. Although progress has been made in data lake research and applications,
there are also numerous issues and challenges need to be addressed. In this paper, we survey some recent developments, provide our
there are also numerous issues and challenges need to be addressed. In this paper, we survey some recent developments, provide our
observations, as well as our expectations on future research and practice in this area. We start with a discussion on general terminology in data
observations, as well as our expectations on future research and practice in this area. We start with a discussion on general terminology in data
lakes, then review various aspects of data lakes, including an examination on metadata and related issues, information granularities involved in
lakes, then review various aspects of data lakes, including an examination on metadata and related issues, information granularities involved in
data lakes, unique features such as mixed lazy and eager approach in data lake lifecycle, relationship between data lakes and data mining,
data lakes, unique features such as mixed lazy and eager approach in data lake lifecycle, relationship between data lakes and data mining,
interplay between data lakes and data warehousing techniques, as well as some notable developments related to applications of data lakes. We
interplay between data lakes and data warehousing techniques, as well as some notable developments related to applications of data lakes. We
present our observations while we examine these aspects, and wrap up this paper by expressing our hope of developing a unified
present our observations while we examine these aspects, and wrap up this paper by expressing our hope of developing a unified
framework for data lake lifecycle evaluation.
framework for data lake lifecycle evaluation.
© 2022 The Authors. Published by Elsevier B.V.
This
© is an
2022 open
The accessPublished
Authors. article under the CC BY-NC-ND
by Elsevier B.V. license (https://creativecommons.org/licenses/by-nc-nd/4.0)
© 2022 The Authors.
Peer-review under Published byofElsevier
responsibility B.V. committee of the 9th International Conference on Information Technology and Quantitative
the scientific
Selection and/or peer-review under responsibility of the organizers of ITQM 2022
Selection
Management and/or peer-review under responsibility of the organizers of ITQM 2022
Keywords: data lake, metadata management, data profiling, big data analytics, data warehouse
Keywords: data lake, metadata management, data profiling, big data analytics, data warehouse

1. Introduction
1. Introduction
The concept of data lake was proposed in 2010 [17]. Started with a comparison of data mart, which is a store of bottled water
The concept of data lake was proposed in 2010 [17]. Started with a comparison of data mart, which is a store of bottled water
(cleansed and packaged and structured for easy consumption), James Dixon proposed the concept of data lake as “a large body
(cleansed and packaged and structured for easy consumption), James Dixon proposed the concept of data lake as “a large body
of water in a more natural state.” The contents of the data lake stream in from a source to fill the lake, and various users of the
of water in a more natural state.” The contents of the data lake stream in from a source to fill the lake, and various users of the
lake can come to examine, dive in, or take samples [17]. In a sense, data lake phenomenon resembles what happened in database
lake can come to examine, dive in, or take samples [17]. In a sense, data lake phenomenon resembles what happened in database
management systems field in dealing with data which are beyond relational: whereas object-oriented databases strengthened the
management systems field in dealing with data which are beyond relational: whereas object-oriented databases strengthened the
philosophy of rigid schema in 1980s, XML and more recent big data phenomena [43] have shown an alternative, rebellious
philosophy of rigid schema in 1980s, XML and more recent big data phenomena [43] have shown an alternative, rebellious
direction, which liberates data storage and manipulation from rigid schemas. Similarly, whereas data warehousing [9] stays with
direction, which liberates data storage and manipulation from rigid schemas. Similarly, whereas data warehousing [9] stays with
rigid schema to handle consolidated data, data lakes would accept all kinds of data and use them as they are, regardless the data
rigid schema to handle consolidated data, data lakes would accept all kinds of data and use them as they are, regardless the data
are structured, semi-structured, or unstructured. As an example of usefulness of data lake, a public data lake for COVID-19
are structured, semi-structured, or unstructured. As an example of usefulness of data lake, a public data lake for COVID-19
research and development is available at https://aws.amazon.com/covid-19-data-lake./
research and development is available at https://aws.amazon.com/covid-19-data-lake./
As time goes by, now more and more people are becoming interested in (or curious about) data lakes. However, online
As time goes by, now more and more people are becoming interested in (or curious about) data lakes. However, online
materials related to data lakes are very diverse and qualities vary significantly. It is important to note that unlike its
materials related to data lakes are very diverse and qualities vary significantly. It is important to note that unlike its
“predecessor” data warehouse, a systematical framework for data lake research is still lacking. Researchers have voiced
“predecessor” data warehouse, a systematical framework for data lake research is still lacking. Researchers have voiced
concerns of this field, and issues on open gaps have been issued (e.g., [20]. For individuals who have been working in database-
concerns of this field, and issues on open gaps have been issued (e.g., [20]. For individuals who have been working in database-

1877-0509 © 2022 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 9th International Conference on Information Technology and
Quantitative Management
10.1016/j.procs.2022.11.192
406 Zhengxin Chen et al. / Procedia Computer Science 214 (2022) 405–411
Author name / Procedia Computer Science 00 (2022) 000–000

related IT areas but who are new to the specific topic of data lakes, a common question could be: how to survive the “flood” in
this huge “lake” of data lake literature; or, how can we get a quick start to learn key approaches of this important topic?
This paper is an attempt to provide a partial answer from a learner’s perspective. The author of this article is not in a position
(nor has the capacity) to provide a complete review. Yet, we are willing to try to show what we have observed and what can be
expected from our incomplete survey, with the hope that excellent and comprehensive reviews will appear in the near future to
meet our expectations.
The rest of this paper contains an (incomplete) survey of materials related to data lakes available online. Since the quality of
online materials related to data lakes are very uneven, we may start with tutorials and surveys from reputable sources and
institutions, along with some widely cited papers. In our view, good knowledge of basics of database management systems
(DBMS) [42] and data warehousing techniques [24,25] holds the key for understanding the development and challenges of data
lakes. Among the huge “lake” of data lake articles, we have found a handful of them could be used as starters, and [40, 28]
could be a good starting point. Reference [20] provided a critical examination on the landscape of data lakes from a high-level
perspective. As a nice tutorial (but with only extended abstract available online), reference [33] largely focused on data lake
challenges and opportunities, which include data ingestion, data extraction, data cleaning, dataset discovery, metadata
management, data integration, and dataset versioning.
A much more detailed and in-depth survey on data lakes can be found in [23]. It could be quite challenging for beginners to
fully understand the entire contents; yet, the overall structure of the survey paper presents a clear picture on what the most
important topics in data lakes are about. For convenience of our discussion to be presented below, here is a very brief overview
on the structure of that paper. After a brief review of history of data lakes and definition of data lake concept (including
comparisons with data warehouses and data spaces), the authors presented their own data lake architecture, and provided a set of
classification criteria for data lakes solutions based on this architecture. There are three layers of data lake architecture: the layer
of ingestion provides the functions of metadata extraction and modeling; the maintenance layer provides functions of dataset
preparation and organization, discovery of related datasets, data interaction, metadata enrichment, data quality improvement and
schema evolution; and finally, the exploration layer provides functions on query-driven data discovery and query of
heterogenous data. A good paper on data lake ingestion management can be found in [48]. [Some terms used here will be
briefly explained in later sections.]
In the rest of this paper, we present our observations on recent developments in data lakes – again, from a learner’s
perspective. We examine a number of aspects related to data lakes, and along the way of our examination, we identify various
notable issues, some of them are explicitly stated in the literature, while some others are implied. Starting with general
terminology in data lakes (Section 2), we proceed to examine metadata (Section 3), information granules involved in data lakes
(Section 4), mixed lazy and eager approach in data lakes lifecycle (Section 5), knowledge pattern extraction (Section 6), as well
as relationship between data warehouses and data lakes (Section 7). Brief remarks about data lake applications are given in
Section 8. We wrap up our paper in Section 9, where we review experts’ opinions in regard to challenges in data lakes and
describe our expectations for future research in data lakes; in particular, we believe a framework for evaluating data lake
lifecycle is needed.

2. Terminology

One of the most important things we want to emphasize is the terminology used in data lakes literature. Terms in data lake
context may have special meaning and they have to be understood correctly. An example is the concept of data lake architecture.
If we follow the original definition of data lake, we may envision a set of miscellaneous data coming from various sources in
their original format; so why bother any kind of architecture at all? To make sense of this term, it is necessary to consider how
data lakes are formed. According to [23], the architecture of a data lake describes the structure and components of the system,
indicating how to store, organize and use the data. First, we should note that although on-premise data lakes are still popular, as
cloud computing has become mainstream, cloud data lakes now offer an alternative elastic compute services to analyze data in
cloud storage on-demand. Lambda architecture [39] is a popular data lake architecture from industry. Data lake architecture has
also been widely discussed in research papers (e.g., [21]); although there is no common consensus on exactly how a data lake
architecture should be, almost all existing proposals involving multiple layers, such as data ingestion, maintenance and
exploration [23]. Another important term is the concept of modeling in data lakes. Since there is no predefined schemas in data
lakes, why modeling could still make sense? In fact, even there is no common requirement on structures of data to be stored,
there could be various case when we talk about data models and data modeling. For example, when we consider data store using
Zhengxin Chen et al. / Procedia Computer Science 214 (2022) 405–411 407
Author name / Procedia Computer Science 00 (2022) 000–000

Hadoop, we have to consider the underlying model used for data storage. Metadata modeling is another huge topic to be
addressed in data lakes literature.
There are also some new terms which were introduced due to unique features of data lakes lifecycle. For example, the
concept of query discovery was proposed in [31] to denote the task of discovering a query (or transformation) that translates data
from one form into another, which is intended to find the right operators to join, nest, group, link, and twist data into a desired
form. Furthermore, today’s data science landscape indicates that data analysis requires discovery of data that joins, unions, or
aggregates with existing data in a precise way – a paradigm referred to as query-driven data discovery. (We will revisit this
newly proposed term in Section 5.)

3. Metadata, schema matching and ontology

Since data in data lakes are coming from various sources, metadata plays extremely important role in characterizing the
diverse and heterogeneous datasets stored in data lakes. Metadata of a data lake holds the key for understanding and
manipulating that data lake; in fact, it is the only place to start working with any data set stored in that data lake. Since research
papers on metadata in data lakes are abundant, in this short section we just want to provide a very brief sketch. An example on
extraction of metadata can be found in [29]. The use of metadata is critical for dataset discovery (or data profiling) to handle
users’ information needs [1,34]. In particular, ontologies can be used to handle the semantics involved. A nice discussion on
data profiling in a more general context can be found in [30]. Reference [3] presented an information profiling prototype system,
which has an ontology alignment component, and reference [2] presented an approach for proximity mining for pre-filtering
schema matching.
The topic of metadata is also closely related to our observation on information granularity, which is to be presented in the
next section.

4. Granularities in data lakes

As noted by many authors, metadata in data lakes involves dealing with data and objects at various level of
granularities, and similarity measures are needed in forming these granules. For example, [18] proposed a metadata model
supports the acquisition of metadata on varying granular levels, any metadata categorization, including the acquisition of both
metadata that belongs to a specific data element as well as metadata that applies to a broader range of data.
Granularities are also a major concern of other aspects related to data lakes. One such aspect is entity resolution (ER), which
constitutes a core task for data integration which aims at matching different representations of entities coming from various
sources. Reference [4] proposed query driven entity resolution in data lakes, which explicitly manipulate information granules
through blocking and meta-blocking techniques. To achieve the goal of entity resolution, while blocking restricts the executed
comparisons to similar entities through clustering to construct entity blocks, meta-blocking is intended to restructure a given
block to a new one to reduce redundancy. An ER-enriched query plan can eventually be developed, and queries can be
processed through a newly proposed blockjoin operator.
In fact, studying such kind of information granules and handling related uncertainty issues falls in the area of granular
computing (GrC) [46]. As a backbone of GrC, rough set theory was used in a data lake to integrate the data silo with other
organizations data to optimize the operational business processes within an organization to improve data quality and efficiency
[44]. But in general, although GrC should have potential of playing a significant role in metadata management for data lakes,
such kind of expectation has not been realized. In fact, when we google the phrase “GrC for data lakes,”, it is likely that we will
get a bunch of result concerning governance (G), risk (R) and compliance (C), generally referred to as GRC in data lakes
literature,[5], but not granular computing! We hope that research activities related to data lakes from granular computing will
catch up, and the search result will be very different in coming years.

5. Mixed lazy and eager approach in data lake lifecycle

Since the philosophy behind the concept of data lake is to get the data first before it is processed (transformed), the practice
of data lakes implies a lazy approach. However, unlike the case of data warehouses where data are consolidated first and then
analysis takes place, data lakes usually require some kind of data analysis method to be applied for discovery of needed data
(rather than waiting to a later time when all data have arrived); in our view, this implies an eager approach. Probably we should
408 Zhengxin Chen et al. / Procedia Computer Science 214 (2022) 405–411
Author name / Procedia Computer Science 00 (2022) 000–000

not call this phenomenon as a dilemma, but this is critical in the overall data lake lifecycle. Due to the importance of this issue, a
little more elaboration follows.
First of all, as noted in [28], different from the traditional practice of ETL (extract, transform and load) of data warehouses,
data lakes make the different ordering in processing data. The data will be stored in its original format. The preprocessing step
will not be handled until the data are required by the application or in query time. As a result, data lakes promote the idea of
ELT (Extract, Load, Transform). This is in sharp contrast with data warehouses, which store consolidated historical data for
analysis. As such, data warehousing serves as an enabling technique for effective data mining and analysis. It is important to
note that although data lakes play a similar role as data warehouses for data storage, people seldom talk about querying and
mining (together they are referred to as exploration in the context of data lakes) on the entire data lake. An explanation can be
found in [23]: When we talk about exploration in data lakes, of course we hope useful information can be retrieved from data
lakes. However, since the data stored in a data lake is so huge, the existing solutions solve the querying problem in data lakes in
the following two directions: discover the data lakes based on the relatedness of datasets or provide a united query interface for
heterogeneous data sources. As a result, in our view, querying and analysis of stored data are not necessarily the final stage of
data lake lifecycle; instead, they are intertwined with activities of other layers. The emphasis of data mining and machine
learning in the context of data lakes may not be on the discovery of finding hidden knowledge patterns stored in the entire data
lake, but rather, in the tasks of identifying relevant data sets of users’ interests. This is the underlying motivation for the new
paradigm of query-driven data discovery (section 2). The enthusiasm of research work along this direction is reflected in
[7,14,15,47,48].

6. Knowledge pattern extraction and knowledge lakes

As noted in previous section, unlike the case of data warehouses, data mining does not have to wait until the entire data lake
is constructed. Nevertheless, research on data lakes as a whole has been carried out, as shown in [10], where a network-based
approach is used to extract visual knowledge patterns in a data lake for management effectiveness.
Another notable development can be found in [6], where the concept of intelligent knowledge lake was introduced (based on
authors’ previously proposed notion of knowledge lake, which is a contextualized data lake, and related algorithms), to facilitate
linking artificial intelligence (AI) and data analytics. This should enable AI applications to learn from contextualized data and
use them to automate business processes and develop cognitive assistance for facilitating the knowledge intensive processes or
generating new rules for future business analytics.

7. Data warehouses and data lakes

Relationships and differences between data warehouses and data lakes have been widely addressed in survey papers and
tutorials (e.g. [23]). Reference [32] noted that in contrast to a hierarchical data warehouse with files or folders data storage, the
data lake uses a flat architecture, where each data element has a unique identifier and a set of extended metadata tags. The data
lake does not require a rigid schema or manipulation of the data of all shapes and sizes, but it requires maintaining the order of
the data arrival. We can view a data lake as a large data pool to bring in all of the historical data accumulated and new data
(structured, unstructured and semi-structured plus binary from sensors, devices and so on) in near real time into one single place,
in which the schema and data requirements are not defined until the data is queried (here, schema-on-read is used). It has been
noted that although data warehousing techniques allow updates (and a lot of research have been focused on materialized view
maintenance as in Stanford data warehousing project) [24], data stored in data warehouses may be considered as static, because
data lakes encourages data dynamicity at a much higher level.
Interplays between data warehouses and data lakes can be found in various research papers [12,27]. There have been different
opinions of how to deal with future of data warehouses and data lakes, but in our view, it is likely that they will co-existent (at
least for now) because they could be complementary to each other, and data stored in data warehouses may be converted or
exchanged to data lakes, or vice versa. Therefore, a lot of research opportunities exist in this direction of exploration.
Zhengxin Chen et al. / Procedia Computer Science 214 (2022) 405–411 409
Author name / Procedia Computer Science 00 (2022) 000–000

8. Other notable developments related to data lakes

This paper is not intended to applications on data lakes. Nevertheless, we would wrap up our observations on several notable
directions of data lake applications, because studying these applications could reveal various aspects related to the very nature
of data lakes to show how data lakes can be used for:
 Blockchain technology: As a field devoted to studying decentralized ledgers, blockchain technology is gaining popularity,
and relationship between data lakes and blockchains have been studied [37,38,43]; for example, to deal with security-
related issues.
 Life science: Examples of applications of data lakes in life science can be found in [8,11].
 Smart cities and smart computing: Applications of data lakes for smart cities and smart computing have been discussed
in [26,35,36].

9. Our expectation: Towards A Framework for Data Lake Lifecycle Evaluation

To conclude this paper, we review experts’ opinions on the challenges and opportunities in future research of data lakes.
Reference [20] addressed challenges and research gaps in data lakes research. Research gaps in three areas of data lakes are
identified: there is a lack of (a) a holistic concept of data lake architecture, (b) data lake governance, and (c) comprehensive
design and realization strategy. Reference [45] addressed the four challenges faced in data lakes research and practice, including
handling the impact of the evolution of data source structures on an integration layer, optimizing executions of data processing
workflows,_ cataloging available data sets and metadata management, and assuring high quality of data (especially duplicate
elimination) in data warehouses and data lakes. Reference [13] noted that there are five major research gaps: 1) unclear data
modelling methods, 2) missing data lake reference architecture, 3) incomplete metadata management strategy, 4) incomplete
data lake governance strategy, and 5) missing holistic implementation and integration strategy.
We highly appreciate above mentioned critics and remarks for future research and development in data lakes. In this paper,
we have also expressed our own observations on recent developments related to data lake research (some of them may be
subject to debate). Be more specifically, we give the following three items high priority in our wish list:
 A common, “authorized” definition of data lake will emerge near future;
 A better understanding of relationship between data warehouses and data lakes will be widely shared in IT community;
 In particular, we are looking for a framework of data lake lifecycle evaluation, which could be based on the architecture
of [24], which includes data ingestion (metadata extraction and modeling), maintenance (dataset preparation and
organization, discovery of related datasets, data interaction, metadata enrichment, data quality improvement and schema
evolution) and exploration (query-driven data discovery and query of heterogenous data). More systematic, comparative
and experimental studies on data lakes are highly appreciated.

References
[1] Abedjan Z, Golab L, Naumann F, Data profiling – a tutorial, Proc. SIGMOD 2017.
[2] Al-serafi, A. Alberto Abello, Oscar Romero, and Toon Calders, Keeping the data lake in form: DS-kNN datasets categorization using proximity mining.
International Conference on Model and Data Engineering. Model and Data Engineering, 9th International Conference, MEDI 2019: Toulouse, France,
October 28-31, 2019: proceedings". Berlín: Springer, 2019, p. 35-49.
[3] Alserafi A, Abelló A, Romero O, Calders T, Towards Information Profiling: Data Lake Content Metadata Management, Proc. IEEE ICDM Workshops 2016.
[4] Alexiou G, Papastefanatos G, Query Driven Entity Resolution in Data Lakes, ISIP 9 May 2019

[5] Amazon, What is governance, risk, and compliance (GRC)? https://aws.amazon.com/what-is/grc/

[6] Beheshti, A., Benatallah, B., Sheng, Q.Z., Schiliro, F. (2020). Intelligent Knowledge Lakes: The Age of Artificial Intelligence and Big Data. In: U, L., Yang,
J., Cai, Y., Karlapalem, K., Liu, A., Huang, X. (eds) Web Information Systems Engineering. WISE 2020. Communications in Computer and Information
Science, vol 1155. Springer, Singapore. https://doi.org/10.1007/978-981-15-3281-8_3
[7] Bogatu A, Fernandes AAA, Paton NW, Konstantinou N, Dataset Discovery in Data Lakes, ICDE 2020, 709-720
[8] Che H, Duan Y, On the Logical Design of a Prototypical Data Lake System for Biological Resources, Front Bioeng Biotechnol. 2020; 8: 553904.Published
online 2020 Sep 29. doi: 10.3389/fbioe.2020.553904
[9] Chen Z, Intelligent Data Warehousing: From Data Preparation to Data Mining, 2001.
[10] Cheng Z, Wang H, Li H, Extracting knowledge patterns in a data lake for management effectiveness, EBLDM 2020 E3S Web of Conferences 214, 03045
410 Zhengxin Chen et al. / Procedia Computer Science 214 (2022) 405–411
Author name / Procedia Computer Science 00 (2022) 000–000

(2020)), https://doi.org/10.1051/e3sconf/202021403045
[11] Couto JC, Borges OT, Ruiz DD, Automatized bioinformatics data integration in a Hadoop-based data lake, CSCP 2022, pp. 137-153
[12] Dabbèchi, H., Haddar, N.Z., Elghazel, H., Haddar, K. (2021). Social Media Data Integration: From Data Lake to NoSQL Data Warehouse, ISDA 2020
[13] Daradkeh MK, Enterprise Data Lake Management in Business Intelligence and Analytics: Challenges and Research Gaps in Analytics Practices and
Integration, in Ana Azevedo and Manuel Filipe Santos eds., Integration Challenges for Analytics, Business Intelligence, and Data Mining.
[14] Diamantini C, Potena D, Storti E, A Semantic Data Lake Model for Analytic Query-Driven Discovery, iWAS2021, November 29-December 1, 2021, Linz,
Austria, 183-186
[15] Diamantini C, Giudice PL, Potena D, Storti E, Ursino D,An Approach to Extracting Topic-guided Views from the Sources of a Data Lake, Information
Systems Frontiers (2021) 23:243–262
[16] Dibowski H, Schmid S, Svetashova Y, Henson C, Tran T, Using Semantic Technologies to Manage a Data Lake: Data Catalog, Provenance and Access
Control, Proceedings of the 13th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2020), http://ceur-ws.org/Vol-
2757/SSWS2020_paper5.pdf
[17] Dixon J , Pentaho, Hadoop, and Data Lake (14 October 2010). James Dixon’s Blog. Retrieved Aug. 14, 2022.
[18] Eichler R, Giebler C, Gröger C, Schwarz H, Mitschang B, Modeling Metadata in Data Lakes - A Generic Model, In: Data & Knowledge Engineering
(2021), 101931
[19] Farrugia A, Claxton R, Thompson S, Towards Social Network Analytics for Understanding and Managing Enterprise Data Lakes, ACM/IEEE ASONAM
2016
[20] Giebler C, Gröger C, Hoos E, Schwarz H, Mitschang B, Leveraging the Data Lake: Current State and Challenges. In: Ordonez, C., Song, IY., Anderst-
Kotsis, G., Tjoa, A., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2019. Lecture Notes in Computer Science(), vol 11708. Springer,
Cham. https://doi.org/10.1007/978-3-030-27520-4_13
[21] Giebler C, Gröger C, Hoos E, Eichler R, Schwarz H, Mitschang B, The Data Lake Architecture Framework: A Foundation for Building a Comprehensive
Data Lake Architecture, BTW 2021
[22] Gillet A, Leclercq E, Cullot N, Lambda+, the Renewal of the Lambda Architecture: Category Theory to the Rescue. International Conference on Advanced
Information Systems Engineering, Jun 2021, Melbourne, Australia. pp.381-396, ff10.1007/978-3-030-79382-1_23ff. ffhal-03354021
[23] Hai R, Quix C, Jarke M, Data lake concept and systems: a survey. CoRR abs/2106.09592 (2021)
[23] Hammer J, Garcia-Molina H, Widom J, Labio W, Zhuge Y, The Stanford Data Warehousing Project, http://ilpubs.stanford.edu:8090/76/1/1995-10.pdf
[25] Han J, Kamber M, Pei J, Data Mining: Concepts and Techniques (3rd ed.), 2010.

[26] Kafando R, Decoupes R, Sautot L, Teisseire M, Spatial Data Lake for Smart Cities: From Design to Implementation, Proceedings of the 23rd AGILE
Conference on Geographic Information Science, 2020. Editors: Panagiotis Partsinevelos, Phaedon Kyriakidis, and Marinos Kavouras
https://doi.org/10.5194/agile-giss-1-8-2020

[27] Jemmali R, Abdelhedi F, Zurfluh G, Transferring Relational and NoSQL Databases from a Data Lake, SN Computer Science 3(5) July 2022
DOI: 10.1007/s42979-022-01287-7

[28] Khine PP, Wang ZS, Data lake: a new ideology in big data era, ITM Web of Conferences 17, 03025 (2018) https://doi.org/10.1051/itmconf/20181703025
[29] Langenecker S, Sturm C, Schalles C, Binnig C, Towards Learned Metadata Extraction for Data Lakes, in K.-U. Sattler et al. (Hrsg.): Datenbanksysteme für
Business, Technologie und Web (BTW 2021), Lecture Notes in Informatics (LNI), doi:10.18420/btw2021-17
[30] Liu Z, Zhang A, A Survey on Sampling and Profiling over Big Data (Technical Report), Cornell University, 2020.
[31] Miller R, Open Data Integration, Proc. VLDB Endow., 11(12), 2130-2139, 2018.
[32] Miloslavskaya N, Tolstoy A, Big Data, Fast Data and Data Lake Concepts, Proc. 7th Annual International Conference on Biologically Inspired Cognitive
Architectures, BICA 2016, https://www.sciencedirect.com/science/article/pii/S1877050916316957 - aep-article-footnote-id3 Procedia Computer Science,
Volume 88, 2016, Pages 300-305
[33] Nargesianm F, Zhu E, Miller R, Pu K, Arocena P, Data lake management: Challenges and opportunities, VLDB, 2019.
[34] Naumann, F Data profiling revisited, ACM SIGMOD RecordVolume 42Issue 4December 2013 pp 40-49
[35] Nurhadi, Kadir RBA., Surin, ESBM. (2021). Evaluation of NoSQL Databases Features and Capabilities for Smart City Data Lake Management. In: Kim,
H., Kim, K.J., Park, S. (eds) Information Science and Applications. Lecture Notes in Electrical Engineering, vol 739. Springer, Singapore.
https://doi.org/10.1007/978-981-33-6385-4_35
[36] Ouafiq EM, Saadane R, Chehri A, Wahbi M, Data Lake Conception for Smart Farming: A Data Migration Strategy for Big Data Analytics In A.
Zimmermann et al. (eds.), Human Centred Intelligent Systems, Smart Innovation, Systems and Technologies 310, https://doi.org/10.1007/978-981-19-3455-
1_15
[37] Panwar A, Bhatnagar V, A cognitive approach for blockchain-based cryptographic curve hash signature (BC-CCHS) technique to secure healthcare data
in Data Lake, Soft Computing, Nov. 2021. https://doi.org/10.1007/s00500-021-06513-7
[38] Panwar A, Bhatnagar V, Khari M, Salehi AW, Gupta G, A Blockchain Framework to Secure Personal Health Record (PHR) in IBM Cloud-Based Data
Lake, Computational Intelligence and Neuroscience Volume 2022, Article ID 3045107
[39] Pérez-Arteaga, P., Castellanos, C., Castro, H., Correal, D., Guzmán, L. and Denneulin, Y. Cost Comparison of Lambda Architecture Implementations for
Transportation Analytics using Public Cloud Software as a Service. DOI: 10.5220/0006869308550862 In Proceedings of the 13th International Conference
on Software Technologies (ICSOFT 2018), pages 855-862 Franck Ravat and Yan Zhao, Data Lakes: Trends and Perspectives Proc. DEXA 3019, 304-313
[40] Ravat F, Zhao Y, Data Lakes: Trends and Perspectives. International Conference on Database and Expert Systems Applications (DEXA 2019), Aug 2019,
Linz, Austria. pp.304-313. ffhal-02397457
[41] Sawadogo P and Darmont J, On data lake architectures and metadata management, J Int Info Sys, vol 56, pp 97-120, 2021.
[42] Silberschatz A, Korth H, Sudarshan S, Database System Concepts (7th ed.), 2019
[43] Wang L, Exploring Blockchain and Big Data with Alibaba Cloud Data Lake Analytics, Alibaba Clouder August 8, 2018
Zhengxin Chen et al. / Procedia Computer Science 214 (2022) 405–411 411
Author name / Procedia Computer Science 00 (2022) 000–000

[44] Wibowo M, Sulaiman S, Shamsuddin SM, Machine Learning in Data Lake for Combining Data Silos, International Conference on Data Mining and Big
Data, 2017
[44] Wrembel R, Still Open Problems in Data Warehouse and Data Lake Research: extended abstract, 2021 Eighth International Conference on Social Network
Analysis, Management and Security (SNAMS), 2021, pp. 01-03, doi: 10.1109/SNAMS53716.2021.9732098.
[46] Zadeh LA, Toward a Theory of Fuzzy Information Granulation and its Centrality in Human Reasoning and Fuzzy Logic, Fuzzy Sets and Systems, vol. 90,
111-127
[47] Zhao Y, Aligon J, Ferrettini G, Megdiche I, Ravat F, Soulé-Dupuy C, Analysis-oriented Metadata for Data Lakes, IDEAS 2021, July 14–16, 2021, 194-
203.
[48] Zhao Y, Megdiche I, Ravat F, Data Lake Ingestion Management. CoRR abs/2107.02885 (2021)

You might also like