1 s2.0 S1674862X19300060 Main

JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 17, NO.
1, MARCH 2019 51
Digital Object Identifier:10.11989/JEST.1674-862X.80926105
Components and Development in Big Data

System: A Survey
Jing-Huan Yu* | Zi-Meng Zhou
Abstract—With the growth of distributed computing systems, the modern Big Data analysis platform products often
have diversified characteristics. It is hard for users to make decisions when they are in early contact with Big Data
platforms. In this paper, we discussed the design principles and research directions of modern Big Data platforms by
presenting research in modern Big Data products. We provided a detailed review and comparison of several state-of-
the-art frameworks and concluded into a typical structure with five horizontal and one vertical. According to this
structure, this paper presents the components and modern optimization technologies developed for Big Data, which
helps to choose the most suitable components and architecture from various Big Data technologies based on
requirements.
Index Terms—Big Data, cloud computing, data analysis, optimization, system architecture.
1. Introduction
Big Data means a collection of data that can not be crawled, managed, and processed by traditional software
tools over a specified time. Big Data technologies refer to the ability to quickly obtain valuable information from
various types of data[1]. To keep with the desire to store and analyze the daily growth of complex data, a significant
number of analytical platforms are available for analysis of complex structured and unstructured data, each of them
is designed to handle specific types of data/workloads[2],[3].
This article describes the common features in Big Data platforms’ development according to some modern
research. It also lists out some widely-used components and applications of Big Data while exploring those
products in use[4]-[10].
1.1. Applications in Big Data

Big Data technologies are widely used in our daily life. Data analytics is the primary purpose and core feature of
*Corresponding author
Manuscript received 2018-09-26; revised 2018-11-05.
This work was supported by the Research Fund of Tencent Computer System Co. Ltd. under Grant No. 170125.
J.-H. Yu and Z.-M. Zhou are with the Department of Computer Science, City University of Hong Kong, Hong Kong 999077, China
(e-mail: jinghuanyu2-c@my.cityu.edu.hk; zmzhou4-c@my.cityu.edu.hk).
Color versions of one or more of the figures in this paper are available online at http://www.journal.uestc.edu.cn.
Publishing editor: Xin Huang
Copyright © 2019 University of Electronic Science and Technology of China. Publishing Services provided by Elsevier B.V. on behalf of KeAi.
This is an open access article under the CC BY-NC-ND License (http://creativecommons.org/licenses/by-nc-nd/4.0/ ).
52 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 17, NO. 1, MARCH 2019
Big Data. With amounts of data emanating from various digital sources, the importance of data analytics has
tremendously grown all these years. There are many industries that have been propelled by Big Data, the most
representative ones are as follows.
1) Big Data contributions to public healthcare
At all time, healthcare is always a social issue to be considered and benefited from the raising of Big Data.
By keeping track of all the patients’ histories, it can significantly help physicians doing long-term experimental
projects. With the massive data storage technology, it makes sure that once a patient receives treatment,
her/his data will be safely stored in the data center.
2) Big Data contributions to public sectors
In public sectors, Big Data techniques can provide wide areas searching from corner to corner in various public
sector units and allied unions. By providing an extensive range of facilities and pipes for data collecting, Big Data
can build a wide area monitory system including the power grid maintenance, deceit recognition, financial
promotion investigation, ecological fortification, etc.
3) Big Data contributions to education
Teaching by their aptitudes is a critical concept in education, but the lack of educational resources and
population bases makes this concept rarely used in real life. Data-driven classrooms provide a solution much
further than just coursework reformation and grading development. By collecting various online resources and
customized intelligent recommendation services for both students and teachers, these applications can provide
adaptive learning solutions and problem tracking services for educators.
4) Big Data contributions to banking zones and fraud detection
From a certain level, Big Data techniques are designed for the area at first. With the raising of anomaly
detection, the usage of Big Data has been widely expanded. Detecting the misuse of credit cards or debit cards,
archival inspection tracks, and venture credit hazard treatment are the basic usages of Big Data technologies.
1.2. Challenges and Complex Architecture in Big Data

With the widespread use of these techniques, several challenges in Big Data have exposed[11]. Privacy and
security, accessing and sharing of information, analytical challenges, and so on make the optimization of Big Data
technologies desperately needed. The common motivation of expanding these techniques is to find an alternative
way to achieve the highest performance in a cost-effective manner and under given limitation by mostly utilizing the
desired factors.
Big Data applications, due to their complex components and frequent updates, form a unique application
ecosystem based on the aggregation of computing engines. In this paper, we focus on the products and their
improvements in the integration of disparate data sources[12]-[14], combination with high-performance hardware[15],[16],
and execution process optimization[17]-[19]. According to the projects studied during the research period, we
summarized these components into a five-horizontal-and-one-vertical structure to help to learn and organize the
Big Data products from the inside.
2. Vision of Conclusion
Fig. 1 is a partitioning method proposed in this paper, consisting of four main layers and six different modules.
Moreover, it should be noticed that this picture does not mean the dependency relationship between each layer.
The criteria for this classification are based on the primary use of each layer. Moreover, components playing
several roles in different systems may only be mentioned about its most representative function. The exact layer of
these components divided into is based on its typical use case.
YU et al.: Components and Development in Big Data System: A Survey 53
Application Intelligent
layer Integrated access Decision support recommendation Deceit recognition Inspection tracks
Management
layer
Data processing layer Development

management
Toolkit
Data providing
access Real-time query Multi-dimensional query SQL-based calculating
module Data stroage interfaces & meta data protocols Interface
accessing
Native framework Secondary development framework

Data
Data Algorithm driver Translator & optimizer management
analysis Execution Cluster
scheduler Meta data
module requester storage
Hardware interfaces Calculating module interfaces
Data
quality
control
Data
calculating Batch processing components Hybrid processing components Stream processing componets
module
Maintenance
management
Alert
detecting
Data Structured data storage Semi-structured data storage Unstructured data storage System
collecting monitor
layer
Internet resources
Offline collecting Real-time collecting collecting Third-party data collecting
Fig. 1. Five-horizontal-and-one-vertical structure.
2.1. Application Layer

The user application layer is the level that users can directly contact, access, and control. According to the
characteristics of the enterprise, we divide applications into different types such as Web services, integrated data
access[8], and decision support[4]. Most of the products in this level aim at the specific needs of customers and
customized development of the company’s main business. The specific functions and implementations rely on
various development frameworks and message components, which are the primary embodiment of cloud
computing products.
2.2. Data Processing Layer

Data processing layer is the most complex part of Big Data applications. System architects always face various
choices on where to process their data, each choice with possible orders of magnitude differences concerning the
performance. Though there are many related studies like cross-platform framework[12], full stack support, and other
solutions (which will be mentioned more in Section 4) to simplify this layer in recent years[13], the challenges
mentioned above are evident in this layer. In order to have more universal and flexible architecture to meet different
system requirements, we divide this layer into three smaller modules.
1) Data access module
Components in this module are mainly to achieve read-write separation and the application-oriented query,
providing direct data access interfaces for application services like on-line transaction processing (OLTP) and on-
line analytical processing (OLAP). In addition to traditional relation database management system (RDBMS) and
data warehouses like Oracle and DB2, the open source database is an important type of service that constitutes
this level[20]. Compared with solutions that are maintained by the company and constantly tapping the requirements,
open source solutions tend to be built for a specific need.
In modern Big Data platforms, there are three kinds of components in data accessing: Real-time query
component, structured query language (SQL)-based calculating component, and multi-dimensional query
component with both characteristics.
2) Data analysis module

In the early stage of exploration in the module, statistical product and service solutions in full (SPSS), was the
first commercialization product in the field of data analysis. This module mostly provides the upper-level
components in the whole system, which is usually built on top of the calculation engine and is used to implement
various data analysis algorithms.
Unlike the user application layer, this module does not directly provide an interface to system users in common
cases. The generated data usually gets stored or iterated as an intermediate result or as data input for other
modules in the system. There are many famous frameworks in this layer like TensorFlow[21], Apache SINGA[22], and
spark machine learning library (MLlib)[23]. Represent that components in this module are mostly designed for the
certain requirement types.
3) Data calculating module
The calculating module provides massive data computing foundations in the whole architecture. This paper
divides different components according to different processing models, basically covering batch-only, stream-only,
and hybrid frameworks. The products developed for this layer have obvious mash-up characteristics, which are
embodied in the mutual penetration, reference, and invocation between these components.
4) Conclusion
The processing layer performs processing in the system through data reading data from storage systems or
other third-party systems. According to the implementation details of each specific application, different frameworks
will put some of the same components into several layouts. However, we still can map this structure to each of them.
2.3. Data Collecting Layer

The main task of this level is to collect data in batches. Big Data has 4V features: Volume, velocity, variety, and
value[1]. As the core module is responsible for the data collection, the specific requirements this layer should handle
with can be summarized as follows.
1) Diversified data collecting capabilities: Big Data systems always need to collect data from various sources,
so the components set needs to fit the requirements of collecting real-time, incremental, and various forms of data
such as tables, files, messages, and third-party systems[24],[25]. It also means that platforms should also provide
support for all types of distributed data storage systems like RDBMS, Hadoop distributed file system (HDFS), and
network attached storage (NAS)[26],[27]. .
In addition to the traditional extract-transform-load (ETL) technology, the growing requirements need the
platforms providing more flexible functions like real-time collection, Internet crawler analysis, and more kinds of data
sources.
2) Visual rapid configuration capability: There is an important concept in software engineers that the labor cost
in maintenance of a system is much higher than the cost in developing, so it is necessary to provide an easy
maintaining interface set. Basically, developers would prefer graphical development and access interface, drag-
and-drop developing and management graphical user interface (GUI) suites in order to free code writing and
reduce acquisition difficulty. These tool suites will finally reach the goal of cutting the cost wasted on simple and
repeat work[28].
3) Unified scheduling management and control capabilities: Big Data systems always need multiple data
collecting sources, which means that it is important for the system to schedule collection tasks and interfaces.
Mostly, software manufacturers or communities providing toolkits in this layout may focus on the first feature,
about how to collect massive data efficiently. We will list out the most famous ones in this area in subsection 3.2.
2.4. Platform Management Layer

The management layer of Big Data platforms mainly contains application management tools and system
management tools. From the perspective of application management, it is responsible for accessing and monitoring
various component instances and scheduling and configuring external users and systems[29],[30]. The system
management part includes deploying the developed applications, internal resource and authority assignments, and
visual operation and maintenance. The most representative product may be Amazon Web services (AWS)[29],
which provides the standard for cloud computing and virtual cluster management product.
3. Representative Components
In this section, we give a detailed review of the components in the data processing layer and data
collecting layer. We introduce several typical products in use for each layer and then compare them in the
fitness of the layer’s features and core algorithms or techniques.
3.1. Data Processing Layer

The data processing layer consists of three parts: Data accessing module, data analysis module, and data
calculating module. According to the use cases and their features, we deliver several different component sets to
show the main usage of these modules.
3.1.1. Data Accessing Module

Data accessing products usually provide direct access to data sources, mainly building beyond data storage
services and focusing on real-time querying, conditional filtering, and data transportation. There are three typical
kinds which can be divided by interfaces and calculating models: Real-time querying component, SQL-based
computing component, and multi-dimensional query component.
A. Real-Time Querying Component
As for real-time querying products, Facebook Presto[31], Apache Drill/Dremel[32],[33], and Impala[34],[35] are the most
representative ones.
1) Presto: Facebook opened the source code of Presto in November 2013. This distributed SQL query engine
is designed to perform the high-speed, real-time data
Presto command-line shell Thrift/JDBC
analysis. It provides American National Standards
Institute (ANSI) structured query language (SQL) as a Presto coordinator
standard operating interface, including complex Hive
Query parser/analyzer
queries, aggregations, joins, and window functions. metastore Query planner
Presto designed a simple abstraction layer for data Query scheduler
storage to satisfy queries which store data in different

Hadoop storage/scribe
systems (including HBase[36], HDFS[37], and Scribe[38],[39]).
Fig. 2 shows the Presto’s system architecture. Fig. 2. Architecture of Presto system.
Presto uses a custom query execution engine and response operators to support SQL syntax. In addition to the
improved scheduling algorithm, different clusters end to form a pipeline through the network. This mechanism avoids
additional delays caused by unnecessary disk reads and writes. This pipelined execution model runs multiple data
processing segments at the same time; once the data is available, it will pass the data from one processing segment
to the next. This approach will greatly reduce the end-to-end response time of various queries. Meanwhile, it can avoid
a lot of intermediate results in the underlying operation, improving the running effect and cutting wasted resources.
2) Apache Drill and Google Dremel: Dremel is a Google’s interactive data analysis system. Currently used
in open source projects is the implementation project, Apache Drill. The features of the Dremel system are as
follows.
a) Large-scale: It is possible to form a cluster which contains thousands of scales and process data at the
petabyte (PB) level.
b) Nested data models: The Internet’s data is often non-relational. Dremel supports a nested data model
like the JavaScript object notation (JSON) format. In the implementation of Drill, it also has the characteristics
of schema-free, which can run in HDFS[37], Not Only SQL (NoSQL)[40], and cloud storage systems[41].
c) Concept of Internet search and parallel database management system (DBMS) based: Dremel and
Drill’s architecture draws on the concept of service tree used in distributed search engines[39]. Unlike Pig[42] and
Hive[43], it performs queries directly instead of translating them into MapReduce (MR) tasks.
Because of these characteristics, Dremel reduces the minute-level overhead required for MapReduce
processing data to the second level as Figs. 3 and 4 show[33],[44].
10000
2004 : MapReduce
10 min
1000
Execution time (s)
100
2009 : Hive
10 s
2010 : Dremel
10 2012 : Impala
2010 : Spark
1 100 ms
MR-records MR-columns Dremel
Fig. 3. MapReduce and Dremel execution on columnar vs. Fig. 4. Modern cloud analysis engines’ speeds comparing.
record.
Dremel uses a multi-level service tree to execute Client Query execution tree
queries (see Fig. 5). A root node server receives the

Root server
incoming query, reads the metadata from a table, and
routes the query to the next layer. The leaf server is Intermediate
… …
servers
responsible for communicating with the storage layer
…
and accessing the data directly on the local disk. When Leaf servers … …
(with local …
the root node server receives the query, it will ensure
storage)
the horizontal partition of all these tables and rewrite the
Storage layer (e.g., GFS)
query into a union statement. This computation model
is ideal for aggregating queries that return fewer results, Fig. 5. Multi-level service tree in Dremel.
taking the most advantages of combining a large amount of low-cost hardware.
3) Impala: Impala is a real-time interactive SQL Big Data query tool developed by Cloudera inspired by
Google’s Dremel. It is a combination of Google Dremel architecture and massively parallel processing (MPP)
architecture[45],[46]. Impala uses a similar distributed query engine in a commercial parallel relational database,
consisting of query planner, query coordinator, and query exec engine (as shown in Fig. 6).
B. SQL-Based Computing Component
This kind of component is built to facilitate the work habits of data scientists. These products realize the
abstraction of data computing and processing services by providing SQL-like languages. The most common
components are Hive[44], Pig[42], and Spark SQL[47]. These products are not real SQL query tools, but rather
provide a SQL-like statement interface to implement operations on each one’s computing model (like the
MapReduce[2] tasks in Hive and Spark resilient distributed dataset (RDD)[3] in Spark SQL).
1) Hive: Hive is a data warehouse tool based on Hadoop. It maps structured data files into a database table and
provides a simple SQL query function. Hive is designed to convert SQL statements into MapReduce tasks. Due to the
low learning cost, users can quickly implement complex MapReduce statistics through SQL-like statements.
Common Hive SQL and interface Unified metadata

Hive State
SQL app HDFS NN
metastore store
ODBC
Query planner Query planner Query planner

Fully MPP
distributed
Query coordinator Query coordinator Query coordinator
Query exec engine Query exec engine Query exec engine
HDFS DN HBase HDFS DN HBase HDFS DN HBase

Local
direct reads
Fig. 6. Architecture of the Impala system.
Fig. 7 shows the main structure of Hive system,

Hive command-line shell Hive command-line shell
mainly divided into user interfaces, metadata storage,
database drivers, and MapReduce engine. Driver
a) User interfaces: There are a variety of user Physical plan
interfaces, but the most used in practice is the CLI and SQL Query SerDers,
Metastore parser optimizer UDFs
the protocol interface. CLI is the command line database
Execution
interface, which starts a copy of Hive at the same time
it starts. Hive provides Hive query language (HQL) MapReduce engine
operation mode through Java database connectivity
Hadoop storage
(JDBC) and thrift protocols.
b) Metadata storage: Hive stores metadata in Fig. 7. Architecture of Hive system.
third-party databases, such as MySQL and Derby.
The metadata mainly contains the name, columns, and partitions of tables in use and their attributes, like the
directory where these tables’ data located.
c) Database driver: Though there has been through many versions’ evolution since Hive had been
designed out, the main structure has never changed. Driver models mainly include an interpreter, a compiler,
an optimizer, and a physical plan and its executor. The HQL query statement sequentially performs lexical
and syntax analysis, compilation, optimization, and query plan generation in the drivers. The generated query
plan is stored in HDFS and subsequently executed by MapReduce calls.
d) MapReduce and storage: Hive stores data in HDFS and uses the MapReduce engine to do the query
executions.
2) Pig: Pig is a large-scale data analysis platform based on Hadoop. It provides a SQL-like language called Pig
Latin. The compiler of this language converts SQL-like data analysis requests into a series of optimized
MapReduce operations. Fig. 8. shows the architecture of its main framework. Pig provides a simple operation
and programming interface for complex, massive data-parallel computing. Pig’s main structure and running
process are just like Hive. Both two products convert SQL/SQL-like statements into logical plans and physical
plans and can be executed by the MapReduce engine. However, unlike Hive, Pig has its unique Syntax called
Pig Latin which has a great difference with ANSI SQL in programming ideas and data structures.
Pig Latin scripts a) Programming ideas: Pig Latin is a data flow

programming language, and SQL is a declarative
programming language. Pig declares the process of
processing and converting data, while SQL only defines
Grunt shell Pig server
what data is needed. To some extent, Pig scripts are
Parser
similar to the SQL’s query plan which translates
Optimizer declarative results into system steps.
Compiler b) Data structure: In SQL, data is stored in a table
and bound to a specific schema. Pig’s data structure is
Execution engine
loose, and it defines schemes during processing. Also,
Pig supports complex nested data structures, which is
MapReduce quite different from SQL’s flatter table structure. Pig’s
user design function (UDF) and streaming operations
(for adapting UDFs written in different languages) make
Hadoop
Pig more flexible and powerful.
HDFS 3) Spark SQL: Spark SQL’s predecessor is Shark.
Fig. 9 shows the evolution between these two components.
Just like Hive, Spark SQL provides a quick start-up tool for
Fig. 8. Architecture of Apache Pig. technicians who are familiar with RDBMS but do not
understand Spark RDD. Since Hive runs on Hadoop, a large number of intermediate disk landing processes in
MapReduce computing process consume a lot of I/O. Spark group developers modified the optimizer and execution
plan in Hive to access Spark[3], an in-memory computing engine. Based on Shark’s foundation, the Spark SQL
project has got rid of Hive’s specialized components such as syntax parser and query optimizer, while expanding
data compatibility and optimizing operational performance, thus greatly improving components’ expansion
capabilities. Benefits from its loose structure are shown in Fig. 10. Spark can adopt several different optimization
techniques, such as in-memory columnar storage and byte-code generation. This architecture also makes Spark SQL
much more convenient for developers to redesign the SQL parsers, planners, and optimizers according to their
requirements.
Hive architecture Shark architecture
Client Client
CU JDBC CU JDBC
Driver Cache Mgr.
Meta Query Physical plan Meta Query Physical plan
SQL SOL
store store
parser optimizer Execution parser optimizer Execution
MapReduce MapReduce
HDFS HDFS
Fig. 9. Evolution from Hive to Shark.

Logical Physical Code
Analysis optimization planning generation
Cost model
SQL query
Unresolved Logical Optimized Selected
Physical physical RDDs
logical plan plan logical plan plans plan
Data frame
Catalog
Fig. 10. Transportations occurred inside the Spark SQL.

C. Multi-Dimensional Query Component

One of the most representatives of these components is Apache Kylin[45],[46], the multi-dimensional query
component. Although data warehouses have gradually replaced relational databases as the underlying data
storage services for Big Data platforms, relational databases still occupy the core position of the storage layer.
Rational on-line analytical processing (ROLAP) is an improvement in on-line analytical processing (OLAP)
technique, which performs dynamic multi-dimensional analysis on data stored in a relational database. These query
components can directly perform multi-dimensional operations on the contents from the relational database system.
Kylin is designed to reduce the latency of queries ten-billion size of data on Hadoop/Spark, using ANSI SQL to
provide an interface that supports most of the query functionality. As shown in Fig. 11, Kylin can combine with a
wide range of business intelligence tools such as Tableau, Power BI/Excel, microstrategy incorporated (MSTR),
Qlik Sense, Hue, and SuperSet directly. With Kylin, users can perform sub-second interactions with Hadoop data,
providing better performance than Hive. The biggest feature of Kylin is that users can define a data cube and pre-
build it in Kylin with massive data records in Kylin.
Third party app SQL-based tool ·Online analysis data flow

(Web app mobile) (BI tools:Tableau…) ·Offline data flow
·Only SQL for end user
REST API ·OLAP cube is transparent to users
SQL SQL
REST server
Query engine
Mid latency- Low latency-
minutes seconds
Hadoop Routing
Hive OLAP
Kafka cube
RDBMS Metadata
HBase as storage
Cube build engine
Star schema data Key value data
Fig. 11. Architecture of the Apache Kylin.
3.1.2. Data Analysis Module

Components in the data analysis module are mostly frameworks developed for different programming
languages.
R and Python are similar, symbolizing a script-based technology stack, both use statistics techniques for data
analysis. The main difference between two technique stacks is that Python is more inclined to engineering
development and has direct support for developed algorithms such as word segmentation and clustering required
for many projects; while R focuses more on statistical drawing, which provides better visualization support from the
native level. R language rarely mentions its technical framework, but rather serves as an interface to other
frameworks or Big Data engines. Both languages are based on sample statistics. And the scripting language relies
on the characteristics of their execution environments, the support for large-scale data is limited. In most cases,
they need to cooperate with C++ or other more efficient and lower level languages.
The Java virtual machine (JVM) based domain is much more systematic due to its mature cross-platform
technology, complete development suites support, and the strong engineering management structure. A variety of
data analysis tools based on different computing concepts have emerged, most of them are dependent on Apache
Hadoop and Apache Spark. Both build a whole environment for developing suites and provide a powerful and full
functioning interface set.
Unlike the other modules in this layer, products are more demand-oriented. It is hard to list out the frameworks
developed for all purposes, so we just mention two most popular frameworks in machine learning (ML).
1) TensorFlow: For those who have heard of deep learning but have not been too thorough, TensorFlow may
be their favorite deep learning framework. More than the usage of a deep learning framework, its official definition is
an open source software library for machine intelligence. So it is better to consider that TensorFlow is an open
source library for numerical calculation using data flow graphs[21]. TensorFlow[48] is not just a deep learning
framework but a graph compiler. TensorFlow is a relatively low-level programming framework that supports Python
and C++, allows computational distribution on CPU and GPU, even supporting horizontal scaling with a remote
procedure call produced by google (gRPC) (see Fig. 12). Most of its kernel codes rather use a combination of
highly optimized C++ and CUDA (NVIDIA’s language for programming GPUs)[49] to write and optimize eigen
(high-performance C++ and CUDA libraries) and NVIDIA’s cuDNN (a dedicated deep neural network (DNN)[50]
library for NVIDIA GPUs for convolution and other functions)[51],[52].
2) Spark MLlib: MLlib[51] is the Spark’s ML library. It provides a set of algorithms and frameworks to build a
practical, scalable, and easy ML develop environment. There are several tools provided such as[53]:
a) ML algorithms: The common learning algorithms such as classification, regression, clustering, and
collaborative filtering.
b) Featurization: Feature extraction, transformation, dimensionality reduction, and selection.
c) Pipelines: Tools for constructing, evaluating, and tuning ML pipelines.
d) Persistence: Saving and loading algorithms, models, and pipelines.
e) Utilities: Linear algebra, statistics, and data handling.
Client
C++ Python Java Others
Front end
C API
Distributed runtime
Distributed Work Dataflow

master services executor
Exec system
Kernel implements
Network layer Device layer

RPC RDMA GPU CPU
Fig. 12. Architecture of TenserFlow system.
Spark MLlib can directly access into Spark’s application programming interface (APIs) and interoperate
with Numy in Python and R libraries. Developers can use any Hadoop data source (e.g., HDFS, HBase, or
local files), making it easy to plug into Hadoop workflows (example steps in using MLlib are shown in Fig. 13).
Spark uses a new in-memory computation model called RDD, making MLlib run much faster than other
products based on MapReduce (see Fig. 14). The development group also cares about algorithmic
optimization: MLlib’s algorithms set uses leverage iteration high-quality algorithms that can yield better results
than the one-pass approximations.
Data ingestion
Data store
Store model
Data trans-
formation
User model Model training Model testing

server
Results
Predictions Segementation
Data ingestion
Fig. 13. Spark ML architecture.
3) Conclusion: We display two main strain products 120 110

Running time (s)
in the data analysis module, TensorFlow is a typical 90

Hadoop
Python-interfaced and C++-implemented mixture 60
Spark
framework. It depends on the cooperation of front end 30
0.9
and execution system, focusing on the solution of 0
providing a high-performance programming library. Fig. 14. Logistic regression in Hadoop and Spark.
Spark MLlib is built on the Spark calculating system,
transferring common ML algorithms into Spark internal operation chains and taking the advantages of this mature
system. These are two main thoughts in developing data analysis module components.
3.1.3. Data Calculating Module

Big Data is a general term for the technologies designed to collect, organize, and process large datasets which
cannot be solved by traditional solutions. With the growth of demand of popularity, scale and value in kinds of
computing have expanded considerably in recent years.
1) Apache Hadoop
Batch processing has a long history in computing methodology research[52]. Fig. 15 shows the batch processing
engine called MapReduce, which provides sequential operations on a large-scale, static, structured, or semi-
structured dataset and returns the result until the computation is complete, which makes it well-suited for
calculations access to a complete record set. Due to the benefits of handling large volumes of persistent data, this
processing model is frequently used in historical data analysis.
Apache Hadoop[53] is a processing framework designed for distributed batch processing. It was the first Big Data
framework and has been greatly improved by the open-source community. Hadoop reimplemented the algorithms
and components according to the papers and reports from Google, containing the following parts that work
together to process batch data.
a) HDFS: HDFS is a distributed file system module which provides coordinates storage services and data
replication through cluster nodes. HDFS is widely used in Big Data frameworks as the storage system due to its
efficiency.
b) YARN: YARN is the cluster coordinating component of Hadoop stack, responsible for coordinating and
managing the underlying resources and scheduling jobs to be run.
c) MapReduce: MapReduce is the Hadoop’s native batch processing engine. Its processing technique
follows map and shuffles and reduces algorithm using key-value pairs.
User
program
(1) Fork (1) Fork
(1) Fork
Master
(2) (2)
Assign Assign
map reduce
Worker
e
ad ot
Split 0 (6) Write Output

re em
Worker file 0
R
Split 1
)
(4) Local write

(5
(3) Read
Split 2 Worker
Split 3
Worker Output
Split 4 file 1
Worker
Input Map Intermediate files Reduce Output

files phase (on local disks) phase files
Fig. 15. MapReduce processing engine.
This methodology tends to be slow for its heavily leverages permanent storage and reading and writing multiple
times per task. However, as the disk space is one of the most abundant server resources, MapReduce can handle
enormous datasets at a relatively cheap cost. Because it does not encroach on memory, Hadoop cluster achieves
running on less expensive hardware than some alternatives. With the support of cluster manager YARN, Hadoop
has incredible scalability potential and has already been used in the production cluster which consists of thousands
of nodes.
2) Apache Storm
Stream processing systems do computation work over datasets as they enter the system. The main difference
between stream processing and batch processing is defining operations. Stream process systems define
operations to each item of data, while batch process systems define operations on an entire dataset. This makes
stream processors can handle the items as they come into the system. The most significant feature of stream
processing is the event-based calculating processes which will not end until the input explicitly stops. Computation
results are immediately available and continually updated when new data arrives.
Apache Storm[54] is a stream processing framework that focuses on low latency. Apache Storm meets the
demands of near real-time processing. It can handle large quantities of data and deliver results with less latency.
Fig.16 provides a typical structure base on Apache Storm. Apache Storm provides stream processing by
topologies, a directed acyclic graph (DAG) based scheduler. These topologies mean that the transformations and
steps will be applied to each data item. Topologies are composed of streams, spouts, and bolts. The stream is
unlimited data that continuously arrives at the system. Spouts are sources of data streams at the edge of the
topology.
Storm cluster
Master node ZooKeeper cluster

Nimbus
Node 1
ZooKeeper
Master node
Nimbus
Node 2-leader
Master node-leader ZooKeeper
Nimbus Storm UI
Node 1
ZooKeeper
Slave node 1 Slave node 2 Slave node 3
Supervisor Supervisor Supervisor
Fig. 16. Typical structure of Storm Cluster.
Moreover, bolts represent a processing step that consumes streams, applies an operation to them, and outputs
the result as a stream. Apache Storm is probably the best solution currently available for near real-time processing.
It handles data with extremely low latency for workloads which need to be processed with the minimal delay.
Apache Storm is often a good choice when processing that time directly affects user experience[55]. Storm can
integrate with Hadoop’s YARN cluster manager and HDFS storage system, making it easy to hook up to an
existing Hadoop deployment (as shown in Fig. 17).
Apache Storm gives developers options to use mini size batches instead of pure stream processing over the
datasets, providing users more flexibility to shape the tool to the intended use. However, it tends to negate some of
the most significant advantages over other solutions. Moreover, Apache Storm core modules do not offer to order
the guarantees of messages, it means that processing upon each message can be guaranteed, but will cause
massively a duplicate burden.
3) Apache Spark
In order to fit different use cases in application scenes, there are frameworks using plugins and integrated
interfaces to handle both batch and stream workloads.
Apache Spark is a batch processing framework combining SQL, streaming, and sophisticated analytics
(see Fig. 18). Spark streaming is the component handling stream processing. As for batch processing, Apache Spark
focuses on speeding up batch processing workloads by offering full in-memory computation and processing
optimization.
Unlike MapReduce, Spark processes all data in-memory and only accesses the storage layer and the
Apache Storm topology
JMS spout Bolts
Elastic
search
Business UIMA Elastic
logic NLP
HDFS
Hadoop file system
Elastic
search
Business UIMA
HDFS
logic NLP
HDFS
Fig. 17. Usage of storm in a learning healthcare system.
beginning and end of the entire process, to initially

load the data into memory and to persist the final Spark Spark Mllib
Graphx
(machine
results. All intermediate results are managed in SQL streaming (graph)
learning)
memory. There are several exceptions like cache or
OOM cases may force the system taking write Apache Spark
operations. Spark uses a model called RDD to
implement its in-memory computation. These Fig. 18. Spark stack.
immutable structures exist in memory and represent collections of data that can be backtracked based on lineage,
avoiding access to disk storage each time. It can perform better on disk-related tasks by creating DAGs to
represent all the operations the data needs to be operated on and the relationships between them. The scheduler
will assign the workloads according to the sequence of the nodes in DAGs, making the coordinate work between
the processors more intelligent. Spark streaming, the plugin buffering the stream in sub-second increments, is the
stream processing solution for Apache Spark. Fig. 19 shows the process of splitting the input stream into fixed
batches. Spark streaming cannot work as well as the native stream processing frameworks, but it does work well in
practice.
Caching in 2 nodes
Input data Batches of Node 1

Kafka stream input data
Streaming
context
Node 2
Fig. 19. Typical usage of Spark streaming.
Apache Spark is a universal option for diverse processing workloads. Trading off high memory usage, it
provides excellent speed advantages on batch processing workloads and fits well in handling workloads that value
throughput over latency.
4) Conclusion
For batch-only workloads, application demands are not time-sensitive, and on a vast scale, Hadoop can be a
good choice for using less expensive equipment than other solutions.
For stream-only workloads, Storm has extensive language support and can deliver lower latency processing.
For mixed workloads, Spark provides high-speed batch processing and micro-batch processing for streaming. It
has broad support, integrated libraries and tooling, and flexible integrations.
Options for processing within a Big Data system is plentiful, the best fit for the users’ situation should
depend on the state of the data to process, how time-bound the requirements are, and what kind of results
developers are interested in.
3.2. Data Collecting Layer

Data collecting plays a significant role in the Big Data cycle. The Internet provides almost infinite data on a
variety of topics. Behaviors from system users also bring large valuable data helping to improve services and find
demands. We will introduce some most popular products according to different features of these systems.
1) HDFS
Hadoop file system (HDFS)[36] stores large files in a streaming data access mode and runs on a commodity
hardware cluster. It is a file system that stores data across multiple computers in a management network.
HDFS is a most popular file system built for offline collecting purpose, which means that it may not meet the
demands such as low latency for data access, storage of large numbers of small files, multi-user writes, and
arbitrarily modify files.
HDFS uses typical master/slave architecture. As shown in Fig. 20, an HDFS cluster consists of a single
NameNode and multiple DataNodes. NameNode is the master server which stores file system namespace,
executes namespace operations, and determines the mapping of file blocks to DataNodes. As for DataNodes, they
are processes managing storage services attached to the nodes that they run on, responsible for serving read and
write requests. The high availability and disaster back up are implemented by block duplicating mechanism, the file
inside the storage system will be split into blocks and store them in a set of DataNodes.
Metadata (name,replicas,…):/
Metadata ops NameNode home/foo/data, 3,…
Client Block ops
Read DataNodes DataNodes
Replication
Blocks
Rack 1 Write
Rack 2
Client
Fig. 20. HDFS system architecture.
2) HBase
HBase[35] is a distributed, column-oriented open source database, which implements the BigTable[56] based on
HDFS. HBase is a subproject of Apache’s Hadoop project, and Fig. 21 demonstrates a system built on these
components. Different from a general relational database, HBase is a database suitable for unstructured data
storage, and HBase is column-based rather than a row-based model.
Client ZooKeeper HMaster
HRegionServer HRegionServer
HRegion HRegion
Store MemStore Store MemStore Store MemStore Store MemStore

…
HBase
… …
HLog
HLog
StoreFile StoreFile StoreFile … StoreFile StoreFile StoreFile …
… …
HFile HFile HFile HFile HFile HFile
… DFS … DFS
client client
Hadoop
DataNode DataNode DataNode DataNode DataNode
Fig. 21. Hbase system architecture.
A typical HBase cluster consists of an HMaster and several slave nodes led by HRegionServer. Each
HRegionServer node contains its log module and multiple HRegions, a segment of the whole recorded table
divided by keys. HRegion is also the minimum unit of distributed storage and load balance.
The cooperation mode inside HBase is also a typical master-slave model which consists of many layers
processing different workloads. HMaster cooperates with ZooKeeper, the cluster managing system, responsible for
allocating HRegion to each HRegionServer and their loads balancing, discovering invalid HRegionServer and
reallocating them, garbage collecting of HDFS, and handling requests for schema update operations.
3) NoSQL storage system
NoSQL refers to a non-relational database. With the rise of the Internet web2.0 technology, the traditional relational
database cannot fit the ultra-large-scale and high-concurrency requirements occurred in massive data exchange
scales. The NoSQL database was created to solve the challenges brought by multiple data types of large-scale data
collection, especially Big Data application problem. It is hard to classify NoSQL, since each kind of them has its benefits
and focus point. Recent years, in-memory database gains great development. This kind of database stores most of
its contents in memory instead of storing in disk or other extern storage. LevelDB and RocksDB[57] are the most
representative ones in this area. Version controlling is the main issue in blockchain technology. OrpheusDB[58] provides
dataset version control services as its highlight design while ForkBase[59] implements structurally-invariant reusable
indexes (SIRI) indexes as an efficient storage engine for blockchain and fork-able applications.
4) Message queue products
In addition to the above-mentioned offline requirements, real-time acquisition is now an essential component of
Big Data platform. The mainstream architecture uses Flume[60]+Kafka[61] as data entry and uses Kafka’s
customized message queue to process data in real time. Combination with stream processing and in-memory
database to form data aggregation could achieve real-time collection and analysis. Such architecture
schemes are often based on open source systems and have a high degree of reliability. However, since open
source systems are often based on a broader range of application scenarios, they may need additional
development to meet the specific demands, and it is difficult to adjust and improve in a short time.
5) Other data collecting channels

In addition to the two data collecting’s basic demands described above, Internet information collection in
the form of enterprise crawler systems and third-party data sources fulfill the content in this layer. Most of the
data involved in these components needs pretreatments like indexing and slicing. Projects servers these
features such as Solr[62], Lucent, Nutch, and ES, their primary purpose is to convert the natural language or
document structure into a structure more suitable for storing queries.
4. Platform Optimization Directions

With the explosive growth of data volumes, even Big Data technologies face many challenges. Focusing on the
topics of high efficiency and scalability, we summarize some of the recent research achievements.
4.1. Combination with High-Performance Materials

In recent years, material technology and heterogeneous computing have gained much development. Using
hybrid architecture as the solution for Big Data products enhancement is getting more and more attention from
industry.
Non-volatile memory materials like Open-Channel solid-state drive (SSD)[63] and 3D X-point[64] with the
characteristics of non-volatile and high performance than traditional memory material shines in in-memory storage
system optimizations[65],[66].
Despite the new techniques in storage layer, high-performance computing chips are equipped into the system.
Solving the performance bottleneck of traditional databases with dedicated hardware has become a trend[18],[67].
However, the current acceleration solution has the following two shortcomings.
1) The current acceleration scheme is designed for the certain layer, mostly the SQL execution layer, and the
co-processor is usually placed between the storage and the host as a filter. Although there have been many
attempts to improve OLAP systems by field programmable gate arrays (FPGAs), the design of accelerating
solutions is still a challenge for OLTP systems.
2) As chips become smaller and smaller, internal errors are becoming an increasing threat to system reliability.
For a single FPGA chip, the probability of an internal error grows greatly in 5 years, the design of fault tolerance
mechanisms is particularly important for large-scale availability systems.
In general, these systems enhance the performance by taking advantages of the characteristics of new
materials. However, the defects, instability, and high price of new materials make these solutions hard to become
commercial products.
4.2. Internal Process and Algorithm Optimization

Benefit from the development of network transmission technology, Internet of Things also makes great
progress[68],[69]. Optimizations in the process and algorithm most focus on complex-structured data[70], business
logical evaluation-based refactoring[15],[17],[71], or new data structure and algorithms in traditional cases[65],[72]. These
optimization solutions provide approaches at a more fundamental level such as cutting down memory cost,
delaying data transportation, and calculating, which are much more stable and commercially feasible than simply
taking advantages of new materials.
4.3. Scalability Improvements

As this article has always wanted to explain, the ecosystem of Big Data components is very complex. To
provide one single processing through different environments, Musketeer[14] and Rheem[12],[13] provide solutions to
build cross-platform frameworks, helping to increase system scalability. Although these frameworks provide good
cross-platform characteristics over the years, we still have not got rid of the existing framework (as shown in Fig. 22)
system from 2014[9]. In addition to proving that the structure we have summarized is correct, we still hope to see a
new and expanded framework which can achieve improvements by changing the overall structure.
Meteor
script
PACT program
Meteor parser
Nephele job graph
Operator implementations
Extensible packages of operators for
warehousing, information extraction and Sopremo
integration
PACT program assembler
Optimizer
Logical and physical cost-based optimization:
Operator order, local execurtion, data exchange
PACT
Runtime operators
External hashing and sorting-based operator
implementations, implementation of data
exchange strategies
Execution engine
Task scheduling, job/task monitoring, network
management, fault tolerance, I/O services, Nephele
memory management
Resource management
Multi-tenancy, resource alloction Amazon EC2 Apache YARN
Persistent distributed data stroage HDFS Amazon S3 Other data sources
Fig. 22. Stratosphere, a cross-platform framework proposed in 2014.
5. Conclusion
The modern Big Data platform architecture has following difficulties in getting a unified structure: In order to
meet different scenarios, more components will be adopted, which is the first difficulty; and in order to support more
application scenarios as much as possible, each product is often designed to be very broad and consistently
requires additional calculations models to adapt to the needs, this is the second difficulty; on the basis of the first
two, each computing component will learn from each other, infiltrate each other, and call cyclically from each other,
making it almost impossible to directly stratify between components.
Due to its complexity, it is hard to give out a fixed standard for the division of Big Data platform
architecture. Though the architecture of them is much similar to each other, different application environments
and scenarios of the authors’ requirements, frequent updates of the Big Data system itself, and staggered
classification of applications make them never be a framework suitable for even common cases. Reasons
such as staggering did not lead to a unified conclusion. In this paper, we triy to inspire a common protocol
which can aggregate all these processing flows, just like the REST APIs in microservice[73], by providing a
detailed review on the most commonly used components now.
References
[1] A. Sheth, “Transforming big data into smart data: Deriving value via harnessing volume, variety, and velocity using
semantic techniques and technologies,” in Proc. of IEEE 30th Intl. Conf. on Data Engineering, 2014, p. 2.
[2] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the
ACM, vol. 51, no. 1, pp. 107-113, 2008.
[3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in
Proc. of the 2nd USENIX Conf. on Hot Topics in Cloud Computing, 2010, pp. 1-7.
[4] A. E. W. Johnson, M. M. Ghassemi, S. Nemati, K. E. Niehaus, D. A. Clifton, and G. D. Clifford, “Machine learning
and decision support in critical care,” Proc. of the IEEE, vol. 104, no. 2, pp. 444-466, 2016.
[5] F.-H. Guan, D.-M. Zhao, X. Zhang, B.-T. Shan, and Z. Liu, “Study on the intelligent decision support system for
power grid dispatching,” in Proc. of the Intl. Conf. on Sustainable Power Generation and Supply, 2009, pp. 1-4.
[6] X. Wang, W. Dou, Z. Ma, et al., “I-SI: Scalable architecture for analyzing latent topical-level information from social
media data,” Computer Graphics Forum, vol. 31, no. 3, pp. 1275-1284, 2012.
[7] S.-L. He, J.-M. Zhu, P.-J. He, and M. R. Lyu, “Experience report: System log analysis for anomaly detection,” in Proc.
of IEEE 27th Intl. Symposium on Software Reliability Engineering, 2016, pp. 207-218.
[8] J. McHugh, P. E. Cuddihy, J. W. Williams, K. S. Aggour, V. S. Kumar, and V. Mulwad, “Integrated access to big data
polystores through a knowledge-driven framework,” in Proc. of IEEE Intl. Conf. on Big Data, 2017, pp. 1494-1503.
[9] A. Alexandrov, R. Bergmann, S. Ewen, et al., “The stratosphere platform for big data analytics,” The VLDB Journal,
vol. 23, no. 6, pp. 939-964, 2014.
[10] F. Rahman, M. Slepian, and A. Mitra, “A novel big-data processing framwork for healthcare applications: Big-data-
healthcare-in-a-box,” in Proc. of IEEE Intl. Conf. on Big Data, 2016, pp. 3548-3555.
[11] C. Roy, S. S. Rautaray, and M. Pandey, “Big data optimization techniques: A survey,” Intl. Journal of Information
Engineering and Electronic Business, vol. 10, no. 4, pp. 41-48, 2018.
[12] D. Agrawal, S. Chawla, B. Contreras-Rojas, et al., “RHEEM: Enabling cross-platform data processing: May the big
data be with you!” Proc. of the VLDB Endowment, vol. 11, no. 11, pp. 1414-1427, 2018.
[13] D. Agrawal, S. Chawla, A. K. Elmagarmid, et al., “Road to freedom in big data analytics,” in Proc. of the 19th Intl.
Conf. on Extending Database Technology, 2016, pp. 479-484.
[14] I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand, “Musketeer: All for one, one for all in
data processing systems,” in Proc. of the 10th European Conf. on Computer Systems, 2015, DOI:
10.1145/2741948.2741968
[15] P. Nikitopoulos, A. Vlachou, C. Doulkeridis, and G. A. Vouros, “DiSTrDF: Distributed spatio-temporal RDF queries on
Spark,” in Proc. of Workshops of the EDBT/ICDT 2018 Joint Conf., 2018, pp. 125-132.
[16] L.-Y. Lu, T. S. Pillai, H. Gopalakrishnan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Wisckey: Separating
keys from values in SSD-conscious storage,” ACM Trans. on Storage (TOS), vol. 13, no. 1, pp. 5:1-28, 2017.
[17] R. Barber, C. Garcia-Arellano, R. Grosman, et al. (2017). Evolving databases for new-gen big data applications.
[Online]. Available: https://pdfs.semanticscholar.org/ebc5/2e776b09cf02b063f212a765a0952dc0eff1.pdf
[18] R. D. Chamberlain, M. A. Franklin, R. S. Indeck, and R. K. Cytron, “Intelligent data storage and processing using
FPGA devices,” U.S. Patent 15 388 498, April 13, 2017.
[19] Y. M. Zaharia, R.-S. Xin, P. Wendell, et al., “Apache Spark: A unified analytics engine for large-scale data
processing”, Communications of the ACM, vol. 59, no. 11, 2016, DOI: 10.1145/2934664
[20] M. Abadi, P. Barham, J.-M. Chen, et al., “TensorFlow: A system for large-scale machine learning,” in Proc. of the
12th USENIX Conf. on Operating Systems Design and Implementation, 2016, pp. 265-283.
[21] B. C. Ooi, K.-L. Tan, S. Wang, et al., “SINGA: A distributed deep learning platform,” in Proc. of the 23rd ACM Intl.
Conf. on Multimedia, 2015, pp. 685-688.
[22] X.-R. Meng, J. Bradley, B. Yavuz, et al., “MLlib: Machine learning in Apache Spark,” The Journal of Machine
Learning Research, vol. 17, no. 1, pp. 1235-1241, 2016.
[23] B. Feldbaum, “Method and system for access control of a message queue,” U.S. Patent 6 446 206, September 3,
2002.
[24] Oracle. (2015). Oracle goldengate for big data 12c. [Online]. Available: https://www.oracle.com/middleware/data-
integration/goldengate/big-data/index.html
[25] A. Prahlad and J. Schwartz, “Systems and methods for performing storage operations using network attached
storage,” U.S. Patent 7 546 324, June 9, 2009.
[26] Tableau Inc. (2015). Transform your business with ask data the future of analytics starts now. [Online]. Available:
https://www.tableau.com/
[27] Amazon Web services (AWS). (2018). Cloud computing services. [Online]. Available: https://aws.amazon.com/?nc1=hls
[28] Tencent cloud. (2018). Get more with tencent cloud. [Online]. Available: https://intl.cloud.tencent.com/
[29] Presto. (2018). Distributed SQL query engine for big data. [Online]. Available: https://prestodb.io/
[30] M. Hausenblas and J. Nadeau, “Apache drill: Interactive Ad-hoc analysis at scale,” Big Data, vol. 1, no. 2, pp. 100-
104, 2013.
[31] S. Melnik, A. Gubarev, J.-J. Long, et al., “Dremel: Interactive analysis of Web-scale datasets,” Proc. of the VLDB
Endowment, vol. 3, no. 1-2, pp. 330-339, 2010.
[32] M. Kornacker and J. Erickson. (2012). Cloudera impala: Real time queries in Apache Hadoop, for real. [Online].
Available: http://Com/blog/201210/cloudera impala real time queries Apache Hadoop real
[33] Apache Organization. (2018). Impala is the open source, native analytic database for Apache Hadoop. Impala is
shipped by Cloudera, MapR, Oracle, and Amazon. [Online]. Available: https://impala.apache.org/index.html
[34] D. Borthakur, “HDFS architecture guide,” Hadoop Apache Project, vol. 53, pp. 1-13, 2008.
[35] A. Thusoo, Z. Shao, S. Anthony, et al., “Data warehousing and analytics infrastructure at Facebook,” in Proc. of the
ACM SIGMOD Intl. Conf. on Management of Data, 2010, pp. 1013-1020.
[36] Facebook. (2014). Scribe: Scribe is a server for aggregating log data streamed in real time from a large number of
servers. [Online]. Available: https://github.com/facebookarchive/scribe
[37] K. Chodorow, MongoDB: The Definitive Guide: Powerful and Scalable Data Storage, 2nd ed. Beijing: O’Reilly Media,
2013.
[38] M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel, “Amazon s3 for science grids: A viable solution?” in Proc.
of the Intl. Workshop on Data-aware Distributed Computing, 2008, pp. 55-64.
[39] J. Dean, “Challenges in building large-scale information retrieval systems: Invited talk,” in Proc. of the 2nd ACM Intl.
Conf. on Web Search and Data Mining, 2009, p. 1.
[40] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig Latin: A not-so-foreign language for data
processing,” in Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 2008, pp. 1099-1110.
[41] A. Thusoo, J. S. Sarma, N. Jain, et al., “Hive: A warehousing solution over a map-reduce framework,” Proc. of the
VLDB Endowment, vol. 2, no. 2, pp. 1626-1629, 2009.
[42] H.-K. Chen, F.-Z. Wang, and N. Helian, “Entropy4cloud: Using entropy-based complexity to optimize cloud service
resource management,” IEEE Trans. on Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 13-24,
2018.
[43] N. Elgendy and A. Elragal, “Big data analytics: A literature review paper,” in Proc. of Industrial Conf. on Data Mining,
2014, pp. 214-227.
[44] K. K. Das, E. Fratkin, A. Gorajek, K. Stathatos, and M. Gajjar, “Massively parallel in-database predictions using
PMML,” in Proc. of Workshop on Predictive Markup Language Modeling, 2011, pp. 22-27.
[45] M. Armbrust, R.-S. Xin, C. Lian, et al., “Spark SQL: Relational data processing in Spark,” in Proc. of ACM SIGMOD
Intl. Conf. on Management of Data, 2015, pp. 1383-1394.
[46] S. V. Ranawade, S. Navale, and A. Dhamal. (2016). Analytical processing on Hadoop using Apache Kylin. [Online].
Available: http://www.ijais.org/archives/volume12/number2/ranawade-2017-ijais-451682.Pdf
[47] Apache KylinTM. (2018). Apache KylinTM home. [Online]. Available: http://kylin.apache.org/.TensorFlow
[48] TensorFlow. (2018). Introduction to TensorFlow. TensorFlow makes it easy for beginners and experts to create
machine learning models for desktop, mobile, Web, and cloud. [Online]. Available: https://www.tensorflow.org/learn
[49] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming, Upper
Saddle River: Addison-Wesley Professional, 2010.
[50] Cuda Zone. (2018). NVIDIA developer. [Online]. Aailable: https://developer.nvidia.com/coda-zone.html
[51] Apache Spark. (2018). MLlib: Main guide—Spark 2.4.0 documentation. [Online]. Available: https://spark.apa
che.org/docs/latest/ml-guide.html
[52] J.-H. Chen and K.-C. Liu, “On-line batch process monitoring using dynamic PCA and dynamic PLS models,”
Chemical Engineering Science, vol. 57, no. 1, pp. 63-75, 2002.
[53] ApacheTM Hadoop. (2018). ApacheTM hadoop®. [Online]. Available: http://hadoop.apache.org/
[54] Apache storm. (2018). Apache storm. [Online]. Available: http://storm.apache.org/
[55] V. C. Kaggal, R. K. Elayavilli, S. Mehrabi, et al., “Toward a learning health-care system—knowledge delivery at the
point of care empowered by big data and NLP,” Biomedical Informatics Insights, vol. 8, no. S1, pp. 13-22, 2016.
[56] F. Chang, J. Dean, S. Ghemawat, et al., “Bigtable: A distributed storage system for structured data,” ACM Trans. on
Computer Systems (TOCS), vol. 26, no. 2, p. 4, 2008.
[57] RocksDB. (2018). A persistent key-value store for fast storage environments. [Online]. Available: https://rocksdb.org/
[58] L.-Q. Xu, S.-L. Huang, S.-L. Hui, A. J. Elmore, and A. Parameswaran, “OrpheusDB: A lightweight approach to
relational dataset versioning,” in Proc. ACM Intl. Conf. on Management of Data, 2017, pp. 1655-1658.
[59] S. Wang, T. T. A. Dinh, Q. Lin, et al. (2018). Forkbase: An efficient storage engine for blockchain and forkable
applications. [Online]. Available: http://dl.acm.org/citation.cfm?id=3242934
[60] Apache Flume. (2018). Welcome to Apache Flume. [Online]. Available: https://flume.apache.org/
[61] Apache Kafka. (2018). Apache Kafka® is a distributed streaming platform. What exactly does that mean? [Online].
Available: https://kafka.apache.org/
[62] Apache Solr. (2018). Solr is the popular, blazing-fast, open source enterprise search platform built on Apache
Lucene™. [Online]. Available: http://lucene.apache.org/solr/
[63] MatiasBjorling. (2018). Open-channel solid state drives. [Online]. Available: https://openchannelssd.readthedocs.
io/en/latest/
[64] Micron. (2018). 3D XPoint™ technology. [Online]. Available: https://www.micron.com/products/advanced-solutions/
3d-xpoint-technology
[65] S.-M. Wu, K.-H. Lin, and L.-P. Chang, “KVSSD: Close integration of LSM trees and flash translation layer for write-
efficient KV store,” in Proc. of Design, Automation & Test in Europe Conf. & Exhibition, 2018, pp. 563-568.
[66] J. Xu and S. Swanson, “Nova: A log-structured file system for hybrid volatile/non-volatile main memories,” in Proc. of
the 14th USENIX Conf. on File and Storage Technologies, 2016, pp. 323-338.
[67] J. Cong, Z.-M. Fang, M.-H. Huang, L.-B. Wang, and D. Wu, “CPU-FPGA coscheduling for big data applications,”
IEEE Design & Test, vol. 35, no. 1, pp. 16-22, 2018.
[68] F. Firouzi, A. M. Rahmani, K. Mankodiya, et al., “Internet-of-things and big data for smarter healthcare: from device
to architecture, applications and analytics,” Future Generation Computer Systems, vol. 78, pp. 583-586, 2018.
[69] M. Elhoseny, A. Abdelaziz, A. S. Salama, A. M. Riad, K. Muhammad, and A. K. Sangaiah, “A hybrid model of
Internet of things and cloud computing to manage big data in health services applications,” Future Generation
Computer Systems, vol. 86, pp. 1383-1394, 2018.
[70] H. Vo, Y.-H. Liang, J. Kong, and F.-S. Wang, “iSPEED: A scalable and distributed in-memory based spatial query
system for large and structurally complex 3D data,” Proc. of the VLDB Endowment, vol. 11, no. 12, pp. 2078-2081,
2018.
[71] B. Salimi, C. Cole, P. Li, J. Gehrke, and D. Suciu, “HypDB: A demonstration of detecting, explaining and resolving
bias in OLAP queries,” Proc. of the VLDB Endowment, vol. 11, no. 12, pp. 2062-2065, 2018.
[72] E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, and G. Sheffi, “Accordion: Better memory organization for LSM key-
value stores,” Proc. of the VLDB Endowment, vol. 11, no. 12, pp. 1863-1875, 2018.
[73] I. Nadareishvili, R. Mitra, M. McLarty, et al., Microservice Architecture: Aligning Principles, Practices, and Culture,
Sebastopol: O'Reilly Media, Inc., 2016.
Jing-Huan Yu was born in Jiangxi Province, China in 1996. He received the B.S. degree from
University of Electronic Science and Technology of China, Chengdu, China in 2018. He is currently
pursuing the Ph.D. degree with City University of Hong Kong, Hong Kong, China. His research
interests include data mining, database engine, and non-volatile memory.
Zi-Meng Zhou received the M.E. degree from the Department of Computer Science and
Technology, Shandong University, Jinan, China in 2015. He is currently pursuing the Ph.D. degree
with the Department of Computer Science, City University of Hong Kong. His research interests
include non-volatile memories, embedded systems, and data-intensive computing.

1 s2.0 S1674862X19300060 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1674862X19300060 Main

Uploaded by

Copyright:

Available Formats

JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 17, NO.

Digital Object Identifier:10.11989/JEST.1674-862X.80926105

Components and Development in Big Data

1.1. Applications in Big Data

1.2. Challenges and Complex Architecture in Big Data

Data processing layer Development

Native framework Secondary development framework

Fig. 1. Five-horizontal-and-one-vertical structure.

2.1. Application Layer

2.2. Data Processing Layer

2) Data analysis module

2.3. Data Collecting Layer

2.4. Platform Management Layer

3.1. Data Processing Layer

3.1.1. Data Accessing Module

storage to satisfy queries which store data in different

queries (see Fig. 5). A root node server receives the

Common Hive SQL and interface Unified metadata

Query planner Query planner Query planner

Query exec engine Query exec engine Query exec engine

HDFS DN HBase HDFS DN HBase HDFS DN HBase

Fig. 6. Architecture of the Impala system.

Fig. 7 shows the main structure of Hive system,

Pig Latin scripts a) Programming ideas: Pig Latin is a data flow

Fig. 9. Evolution from Hive to Shark.

Fig. 10. Transportations occurred inside the Spark SQL.

C. Multi-Dimensional Query Component

Third party app SQL-based tool ·Online analysis data flow

Fig. 11. Architecture of the Apache Kylin.

3.1.2. Data Analysis Module

Distributed Work Dataflow

Network layer Device layer

Fig. 12. Architecture of TenserFlow system.

User model Model training Model testing

Fig. 13. Spark ML architecture.

3) Conclusion: We display two main strain products 120 110

in the data analysis module, TensorFlow is a typical 90

3.1.3. Data Calculating Module

Split 0 (6) Write Output

(4) Local write

Input Map Intermediate files Reduce Output

Fig. 15. MapReduce processing engine.

Master node ZooKeeper cluster

Master node-leader ZooKeeper

Slave node 1 Slave node 2 Slave node 3

Supervisor Supervisor Supervisor

Fig. 16. Typical structure of Storm Cluster.

Apache Storm topology

JMS spout Bolts

Hadoop file system

Fig. 17. Usage of storm in a learning healthcare system.

beginning and end of the entire process, to initially

Input data Batches of Node 1

Fig. 19. Typical usage of Spark streaming.

3.2. Data Collecting Layer

Client Block ops

Read DataNodes DataNodes

Fig. 20. HDFS system architecture.

Client ZooKeeper HMaster

Store MemStore Store MemStore Store MemStore Store MemStore

DataNode DataNode DataNode DataNode DataNode

Fig. 21. Hbase system architecture.

5) Other data collecting channels