Professional Documents
Culture Documents
1, MARCH 2019 51
Abstract—With the growth of distributed computing systems, the modern Big Data analysis platform products often
have diversified characteristics. It is hard for users to make decisions when they are in early contact with Big Data
platforms. In this paper, we discussed the design principles and research directions of modern Big Data platforms by
presenting research in modern Big Data products. We provided a detailed review and comparison of several state-of-
the-art frameworks and concluded into a typical structure with five horizontal and one vertical. According to this
structure, this paper presents the components and modern optimization technologies developed for Big Data, which
helps to choose the most suitable components and architecture from various Big Data technologies based on
requirements.
Index Terms—Big Data, cloud computing, data analysis, optimization, system architecture.
1. Introduction
Big Data means a collection of data that can not be crawled, managed, and processed by traditional software
tools over a specified time. Big Data technologies refer to the ability to quickly obtain valuable information from
various types of data[1]. To keep with the desire to store and analyze the daily growth of complex data, a significant
number of analytical platforms are available for analysis of complex structured and unstructured data, each of them
is designed to handle specific types of data/workloads[2],[3].
This article describes the common features in Big Data platforms’ development according to some modern
research. It also lists out some widely-used components and applications of Big Data while exploring those
products in use[4]-[10].
*Corresponding author
Manuscript received 2018-09-26; revised 2018-11-05.
This work was supported by the Research Fund of Tencent Computer System Co. Ltd. under Grant No. 170125.
J.-H. Yu and Z.-M. Zhou are with the Department of Computer Science, City University of Hong Kong, Hong Kong 999077, China
(e-mail: jinghuanyu2-c@my.cityu.edu.hk; zmzhou4-c@my.cityu.edu.hk).
Color versions of one or more of the figures in this paper are available online at http://www.journal.uestc.edu.cn.
Publishing editor: Xin Huang
Copyright © 2019 University of Electronic Science and Technology of China. Publishing Services provided by Elsevier B.V. on behalf of KeAi.
This is an open access article under the CC BY-NC-ND License (http://creativecommons.org/licenses/by-nc-nd/4.0/ ).
52 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 17, NO. 1, MARCH 2019
Big Data. With amounts of data emanating from various digital sources, the importance of data analytics has
tremendously grown all these years. There are many industries that have been propelled by Big Data, the most
representative ones are as follows.
1) Big Data contributions to public healthcare
At all time, healthcare is always a social issue to be considered and benefited from the raising of Big Data.
By keeping track of all the patients’ histories, it can significantly help physicians doing long-term experimental
projects. With the massive data storage technology, it makes sure that once a patient receives treatment,
her/his data will be safely stored in the data center.
2) Big Data contributions to public sectors
In public sectors, Big Data techniques can provide wide areas searching from corner to corner in various public
sector units and allied unions. By providing an extensive range of facilities and pipes for data collecting, Big Data
can build a wide area monitory system including the power grid maintenance, deceit recognition, financial
promotion investigation, ecological fortification, etc.
3) Big Data contributions to education
Teaching by their aptitudes is a critical concept in education, but the lack of educational resources and
population bases makes this concept rarely used in real life. Data-driven classrooms provide a solution much
further than just coursework reformation and grading development. By collecting various online resources and
customized intelligent recommendation services for both students and teachers, these applications can provide
adaptive learning solutions and problem tracking services for educators.
4) Big Data contributions to banking zones and fraud detection
From a certain level, Big Data techniques are designed for the area at first. With the raising of anomaly
detection, the usage of Big Data has been widely expanded. Detecting the misuse of credit cards or debit cards,
archival inspection tracks, and venture credit hazard treatment are the basic usages of Big Data technologies.
2. Vision of Conclusion
Fig. 1 is a partitioning method proposed in this paper, consisting of four main layers and six different modules.
Moreover, it should be noticed that this picture does not mean the dependency relationship between each layer.
The criteria for this classification are based on the primary use of each layer. Moreover, components playing
several roles in different systems may only be mentioned about its most representative function. The exact layer of
these components divided into is based on its typical use case.
YU et al.: Components and Development in Big Data System: A Survey 53
Application Intelligent
layer Integrated access Decision support recommendation Deceit recognition Inspection tracks
Management
layer
Data
calculating Batch processing components Hybrid processing components Stream processing componets
module
Maintenance
management
Alert
detecting
Data Structured data storage Semi-structured data storage Unstructured data storage System
collecting monitor
layer
Internet resources
Offline collecting Real-time collecting collecting Third-party data collecting
management tools. From the perspective of application management, it is responsible for accessing and monitoring
various component instances and scheduling and configuring external users and systems[29],[30]. The system
management part includes deploying the developed applications, internal resource and authority assignments, and
visual operation and maintenance. The most representative product may be Amazon Web services (AWS)[29],
which provides the standard for cloud computing and virtual cluster management product.
3. Representative Components
In this section, we give a detailed review of the components in the data processing layer and data
collecting layer. We introduce several typical products in use for each layer and then compare them in the
fitness of the layer’s features and core algorithms or techniques.
a) Large-scale: It is possible to form a cluster which contains thousands of scales and process data at the
petabyte (PB) level.
b) Nested data models: The Internet’s data is often non-relational. Dremel supports a nested data model
like the JavaScript object notation (JSON) format. In the implementation of Drill, it also has the characteristics
of schema-free, which can run in HDFS[37], Not Only SQL (NoSQL)[40], and cloud storage systems[41].
c) Concept of Internet search and parallel database management system (DBMS) based: Dremel and
Drill’s architecture draws on the concept of service tree used in distributed search engines[39]. Unlike Pig[42] and
Hive[43], it performs queries directly instead of translating them into MapReduce (MR) tasks.
Because of these characteristics, Dremel reduces the minute-level overhead required for MapReduce
processing data to the second level as Figs. 3 and 4 show[33],[44].
10000
2004 : MapReduce
10 min
1000
Execution time (s)
100
2009 : Hive
10 s
2010 : Dremel
10 2012 : Impala
2010 : Spark
1 100 ms
MR-records MR-columns Dremel
Fig. 3. MapReduce and Dremel execution on columnar vs. Fig. 4. Modern cloud analysis engines’ speeds comparing.
record.
Dremel uses a multi-level service tree to execute Client Query execution tree
ODBC
MapReduce MapReduce
HDFS HDFS
SQL query
Unresolved Logical Optimized Selected
Physical physical RDDs
logical plan plan logical plan plans plan
Data frame
Catalog
SQL SQL
REST server
Query engine
Mid latency- Low latency-
minutes seconds
Hadoop Routing
Hive OLAP
Kafka cube
RDBMS Metadata
HBase as storage
Cube build engine
Star schema data Key value data
Hadoop and Apache Spark. Both build a whole environment for developing suites and provide a powerful and full
functioning interface set.
Unlike the other modules in this layer, products are more demand-oriented. It is hard to list out the frameworks
developed for all purposes, so we just mention two most popular frameworks in machine learning (ML).
1) TensorFlow: For those who have heard of deep learning but have not been too thorough, TensorFlow may
be their favorite deep learning framework. More than the usage of a deep learning framework, its official definition is
an open source software library for machine intelligence. So it is better to consider that TensorFlow is an open
source library for numerical calculation using data flow graphs[21]. TensorFlow[48] is not just a deep learning
framework but a graph compiler. TensorFlow is a relatively low-level programming framework that supports Python
and C++, allows computational distribution on CPU and GPU, even supporting horizontal scaling with a remote
procedure call produced by google (gRPC) (see Fig. 12). Most of its kernel codes rather use a combination of
highly optimized C++ and CUDA (NVIDIA’s language for programming GPUs)[49] to write and optimize eigen
(high-performance C++ and CUDA libraries) and NVIDIA’s cuDNN (a dedicated deep neural network (DNN)[50]
library for NVIDIA GPUs for convolution and other functions)[51],[52].
2) Spark MLlib: MLlib[51] is the Spark’s ML library. It provides a set of algorithms and frameworks to build a
practical, scalable, and easy ML develop environment. There are several tools provided such as[53]:
a) ML algorithms: The common learning algorithms such as classification, regression, clustering, and
collaborative filtering.
b) Featurization: Feature extraction, transformation, dimensionality reduction, and selection.
c) Pipelines: Tools for constructing, evaluating, and tuning ML pipelines.
d) Persistence: Saving and loading algorithms, models, and pipelines.
e) Utilities: Linear algebra, statistics, and data handling.
Client
C++ Python Java Others
Front end
C API
Distributed runtime
Exec system
Kernel implements
Spark MLlib can directly access into Spark’s application programming interface (APIs) and interoperate
with Numy in Python and R libraries. Developers can use any Hadoop data source (e.g., HDFS, HBase, or
local files), making it easy to plug into Hadoop workflows (example steps in using MLlib are shown in Fig. 13).
Spark uses a new in-memory computation model called RDD, making MLlib run much faster than other
YU et al.: Components and Development in Big Data System: A Survey 61
products based on MapReduce (see Fig. 14). The development group also cares about algorithmic
optimization: MLlib’s algorithms set uses leverage iteration high-quality algorithms that can yield better results
than the one-pass approximations.
Data ingestion
Data store
Store model
Data trans-
formation
Results
Predictions Segementation
Data ingestion
a) HDFS: HDFS is a distributed file system module which provides coordinates storage services and data
replication through cluster nodes. HDFS is widely used in Big Data frameworks as the storage system due to its
efficiency.
b) YARN: YARN is the cluster coordinating component of Hadoop stack, responsible for coordinating and
managing the underlying resources and scheduling jobs to be run.
c) MapReduce: MapReduce is the Hadoop’s native batch processing engine. Its processing technique
follows map and shuffles and reduces algorithm using key-value pairs.
User
program
(1) Fork (1) Fork
(1) Fork
Master
(2) (2)
Assign Assign
map reduce
Worker
e
ad ot
Worker file 0
R
Split 1
)
(3) Read
Split 2 Worker
Split 3
Worker Output
Split 4 file 1
Worker
This methodology tends to be slow for its heavily leverages permanent storage and reading and writing multiple
times per task. However, as the disk space is one of the most abundant server resources, MapReduce can handle
enormous datasets at a relatively cheap cost. Because it does not encroach on memory, Hadoop cluster achieves
running on less expensive hardware than some alternatives. With the support of cluster manager YARN, Hadoop
has incredible scalability potential and has already been used in the production cluster which consists of thousands
of nodes.
2) Apache Storm
Stream processing systems do computation work over datasets as they enter the system. The main difference
between stream processing and batch processing is defining operations. Stream process systems define
operations to each item of data, while batch process systems define operations on an entire dataset. This makes
stream processors can handle the items as they come into the system. The most significant feature of stream
processing is the event-based calculating processes which will not end until the input explicitly stops. Computation
results are immediately available and continually updated when new data arrives.
Apache Storm[54] is a stream processing framework that focuses on low latency. Apache Storm meets the
demands of near real-time processing. It can handle large quantities of data and deliver results with less latency.
YU et al.: Components and Development in Big Data System: A Survey 63
Fig.16 provides a typical structure base on Apache Storm. Apache Storm provides stream processing by
topologies, a directed acyclic graph (DAG) based scheduler. These topologies mean that the transformations and
steps will be applied to each data item. Topologies are composed of streams, spouts, and bolts. The stream is
unlimited data that continuously arrives at the system. Spouts are sources of data streams at the edge of the
topology.
Storm cluster
ZooKeeper
Master node
Nimbus
Node 2-leader
Nimbus Storm UI
Node 1
ZooKeeper
Moreover, bolts represent a processing step that consumes streams, applies an operation to them, and outputs
the result as a stream. Apache Storm is probably the best solution currently available for near real-time processing.
It handles data with extremely low latency for workloads which need to be processed with the minimal delay.
Apache Storm is often a good choice when processing that time directly affects user experience[55]. Storm can
integrate with Hadoop’s YARN cluster manager and HDFS storage system, making it easy to hook up to an
existing Hadoop deployment (as shown in Fig. 17).
Apache Storm gives developers options to use mini size batches instead of pure stream processing over the
datasets, providing users more flexibility to shape the tool to the intended use. However, it tends to negate some of
the most significant advantages over other solutions. Moreover, Apache Storm core modules do not offer to order
the guarantees of messages, it means that processing upon each message can be guaranteed, but will cause
massively a duplicate burden.
3) Apache Spark
In order to fit different use cases in application scenes, there are frameworks using plugins and integrated
interfaces to handle both batch and stream workloads.
Apache Spark is a batch processing framework combining SQL, streaming, and sophisticated analytics
(see Fig. 18). Spark streaming is the component handling stream processing. As for batch processing, Apache Spark
focuses on speeding up batch processing workloads by offering full in-memory computation and processing
optimization.
Unlike MapReduce, Spark processes all data in-memory and only accesses the storage layer and the
64 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 17, NO. 1, MARCH 2019
Elastic
search
Business UIMA Elastic
logic NLP
HDFS
Elastic
search
Business UIMA
HDFS
logic NLP
HDFS
Caching in 2 nodes
Node 2
Apache Spark is a universal option for diverse processing workloads. Trading off high memory usage, it
provides excellent speed advantages on batch processing workloads and fits well in handling workloads that value
throughput over latency.
YU et al.: Components and Development in Big Data System: A Survey 65
4) Conclusion
For batch-only workloads, application demands are not time-sensitive, and on a vast scale, Hadoop can be a
good choice for using less expensive equipment than other solutions.
For stream-only workloads, Storm has extensive language support and can deliver lower latency processing.
For mixed workloads, Spark provides high-speed batch processing and micro-batch processing for streaming. It
has broad support, integrated libraries and tooling, and flexible integrations.
Options for processing within a Big Data system is plentiful, the best fit for the users’ situation should
depend on the state of the data to process, how time-bound the requirements are, and what kind of results
developers are interested in.
Metadata (name,replicas,…):/
Metadata ops NameNode home/foo/data, 3,…
Replication
Blocks
Rack 1 Write
Rack 2
Client
2) HBase
HBase[35] is a distributed, column-oriented open source database, which implements the BigTable[56] based on
HDFS. HBase is a subproject of Apache’s Hadoop project, and Fig. 21 demonstrates a system built on these
components. Different from a general relational database, HBase is a database suitable for unstructured data
storage, and HBase is column-based rather than a row-based model.
66 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 17, NO. 1, MARCH 2019
HRegionServer HRegionServer
HRegion HRegion
… …
HLog
HLog
StoreFile StoreFile StoreFile … StoreFile StoreFile StoreFile …
… …
HFile HFile HFile HFile HFile HFile
… DFS … DFS
client client
Hadoop
A typical HBase cluster consists of an HMaster and several slave nodes led by HRegionServer. Each
HRegionServer node contains its log module and multiple HRegions, a segment of the whole recorded table
divided by keys. HRegion is also the minimum unit of distributed storage and load balance.
The cooperation mode inside HBase is also a typical master-slave model which consists of many layers
processing different workloads. HMaster cooperates with ZooKeeper, the cluster managing system, responsible for
allocating HRegion to each HRegionServer and their loads balancing, discovering invalid HRegionServer and
reallocating them, garbage collecting of HDFS, and handling requests for schema update operations.
3) NoSQL storage system
NoSQL refers to a non-relational database. With the rise of the Internet web2.0 technology, the traditional relational
database cannot fit the ultra-large-scale and high-concurrency requirements occurred in massive data exchange
scales. The NoSQL database was created to solve the challenges brought by multiple data types of large-scale data
collection, especially Big Data application problem. It is hard to classify NoSQL, since each kind of them has its benefits
and focus point. Recent years, in-memory database gains great development. This kind of database stores most of
its contents in memory instead of storing in disk or other extern storage. LevelDB and RocksDB[57] are the most
representative ones in this area. Version controlling is the main issue in blockchain technology. OrpheusDB[58] provides
dataset version control services as its highlight design while ForkBase[59] implements structurally-invariant reusable
indexes (SIRI) indexes as an efficient storage engine for blockchain and fork-able applications.
4) Message queue products
In addition to the above-mentioned offline requirements, real-time acquisition is now an essential component of
Big Data platform. The mainstream architecture uses Flume[60]+Kafka[61] as data entry and uses Kafka’s
customized message queue to process data in real time. Combination with stream processing and in-memory
database to form data aggregation could achieve real-time collection and analysis. Such architecture
schemes are often based on open source systems and have a high degree of reliability. However, since open
source systems are often based on a broader range of application scenarios, they may need additional
development to meet the specific demands, and it is difficult to adjust and improve in a short time.
YU et al.: Components and Development in Big Data System: A Survey 67
build cross-platform frameworks, helping to increase system scalability. Although these frameworks provide good
cross-platform characteristics over the years, we still have not got rid of the existing framework (as shown in Fig. 22)
system from 2014[9]. In addition to proving that the structure we have summarized is correct, we still hope to see a
new and expanded framework which can achieve improvements by changing the overall structure.
Meteor
script
PACT program
Meteor parser
Nephele job graph
Operator implementations
Extensible packages of operators for
warehousing, information extraction and Sopremo
integration
Optimizer
Logical and physical cost-based optimization:
Operator order, local execurtion, data exchange
PACT
Runtime operators
External hashing and sorting-based operator
implementations, implementation of data
exchange strategies
Execution engine
Task scheduling, job/task monitoring, network
management, fault tolerance, I/O services, Nephele
memory management
Resource management
Multi-tenancy, resource alloction Amazon EC2 Apache YARN
5. Conclusion
The modern Big Data platform architecture has following difficulties in getting a unified structure: In order to
meet different scenarios, more components will be adopted, which is the first difficulty; and in order to support more
application scenarios as much as possible, each product is often designed to be very broad and consistently
requires additional calculations models to adapt to the needs, this is the second difficulty; on the basis of the first
two, each computing component will learn from each other, infiltrate each other, and call cyclically from each other,
making it almost impossible to directly stratify between components.
Due to its complexity, it is hard to give out a fixed standard for the division of Big Data platform
architecture. Though the architecture of them is much similar to each other, different application environments
and scenarios of the authors’ requirements, frequent updates of the Big Data system itself, and staggered
classification of applications make them never be a framework suitable for even common cases. Reasons
such as staggering did not lead to a unified conclusion. In this paper, we triy to inspire a common protocol
which can aggregate all these processing flows, just like the REST APIs in microservice[73], by providing a
detailed review on the most commonly used components now.
YU et al.: Components and Development in Big Data System: A Survey 69
References
[1] A. Sheth, “Transforming big data into smart data: Deriving value via harnessing volume, variety, and velocity using
semantic techniques and technologies,” in Proc. of IEEE 30th Intl. Conf. on Data Engineering, 2014, p. 2.
[2] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the
ACM, vol. 51, no. 1, pp. 107-113, 2008.
[3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in
Proc. of the 2nd USENIX Conf. on Hot Topics in Cloud Computing, 2010, pp. 1-7.
[4] A. E. W. Johnson, M. M. Ghassemi, S. Nemati, K. E. Niehaus, D. A. Clifton, and G. D. Clifford, “Machine learning
and decision support in critical care,” Proc. of the IEEE, vol. 104, no. 2, pp. 444-466, 2016.
[5] F.-H. Guan, D.-M. Zhao, X. Zhang, B.-T. Shan, and Z. Liu, “Study on the intelligent decision support system for
power grid dispatching,” in Proc. of the Intl. Conf. on Sustainable Power Generation and Supply, 2009, pp. 1-4.
[6] X. Wang, W. Dou, Z. Ma, et al., “I-SI: Scalable architecture for analyzing latent topical-level information from social
media data,” Computer Graphics Forum, vol. 31, no. 3, pp. 1275-1284, 2012.
[7] S.-L. He, J.-M. Zhu, P.-J. He, and M. R. Lyu, “Experience report: System log analysis for anomaly detection,” in Proc.
of IEEE 27th Intl. Symposium on Software Reliability Engineering, 2016, pp. 207-218.
[8] J. McHugh, P. E. Cuddihy, J. W. Williams, K. S. Aggour, V. S. Kumar, and V. Mulwad, “Integrated access to big data
polystores through a knowledge-driven framework,” in Proc. of IEEE Intl. Conf. on Big Data, 2017, pp. 1494-1503.
[9] A. Alexandrov, R. Bergmann, S. Ewen, et al., “The stratosphere platform for big data analytics,” The VLDB Journal,
vol. 23, no. 6, pp. 939-964, 2014.
[10] F. Rahman, M. Slepian, and A. Mitra, “A novel big-data processing framwork for healthcare applications: Big-data-
healthcare-in-a-box,” in Proc. of IEEE Intl. Conf. on Big Data, 2016, pp. 3548-3555.
[11] C. Roy, S. S. Rautaray, and M. Pandey, “Big data optimization techniques: A survey,” Intl. Journal of Information
Engineering and Electronic Business, vol. 10, no. 4, pp. 41-48, 2018.
[12] D. Agrawal, S. Chawla, B. Contreras-Rojas, et al., “RHEEM: Enabling cross-platform data processing: May the big
data be with you!” Proc. of the VLDB Endowment, vol. 11, no. 11, pp. 1414-1427, 2018.
[13] D. Agrawal, S. Chawla, A. K. Elmagarmid, et al., “Road to freedom in big data analytics,” in Proc. of the 19th Intl.
Conf. on Extending Database Technology, 2016, pp. 479-484.
[14] I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand, “Musketeer: All for one, one for all in
data processing systems,” in Proc. of the 10th European Conf. on Computer Systems, 2015, DOI:
10.1145/2741948.2741968
[15] P. Nikitopoulos, A. Vlachou, C. Doulkeridis, and G. A. Vouros, “DiSTrDF: Distributed spatio-temporal RDF queries on
Spark,” in Proc. of Workshops of the EDBT/ICDT 2018 Joint Conf., 2018, pp. 125-132.
[16] L.-Y. Lu, T. S. Pillai, H. Gopalakrishnan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Wisckey: Separating
keys from values in SSD-conscious storage,” ACM Trans. on Storage (TOS), vol. 13, no. 1, pp. 5:1-28, 2017.
[17] R. Barber, C. Garcia-Arellano, R. Grosman, et al. (2017). Evolving databases for new-gen big data applications.
[Online]. Available: https://pdfs.semanticscholar.org/ebc5/2e776b09cf02b063f212a765a0952dc0eff1.pdf
[18] R. D. Chamberlain, M. A. Franklin, R. S. Indeck, and R. K. Cytron, “Intelligent data storage and processing using
FPGA devices,” U.S. Patent 15 388 498, April 13, 2017.
[19] Y. M. Zaharia, R.-S. Xin, P. Wendell, et al., “Apache Spark: A unified analytics engine for large-scale data
processing”, Communications of the ACM, vol. 59, no. 11, 2016, DOI: 10.1145/2934664
[20] M. Abadi, P. Barham, J.-M. Chen, et al., “TensorFlow: A system for large-scale machine learning,” in Proc. of the
12th USENIX Conf. on Operating Systems Design and Implementation, 2016, pp. 265-283.
[21] B. C. Ooi, K.-L. Tan, S. Wang, et al., “SINGA: A distributed deep learning platform,” in Proc. of the 23rd ACM Intl.
Conf. on Multimedia, 2015, pp. 685-688.
70 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 17, NO. 1, MARCH 2019
[22] X.-R. Meng, J. Bradley, B. Yavuz, et al., “MLlib: Machine learning in Apache Spark,” The Journal of Machine
Learning Research, vol. 17, no. 1, pp. 1235-1241, 2016.
[23] B. Feldbaum, “Method and system for access control of a message queue,” U.S. Patent 6 446 206, September 3,
2002.
[24] Oracle. (2015). Oracle goldengate for big data 12c. [Online]. Available: https://www.oracle.com/middleware/data-
integration/goldengate/big-data/index.html
[25] A. Prahlad and J. Schwartz, “Systems and methods for performing storage operations using network attached
storage,” U.S. Patent 7 546 324, June 9, 2009.
[26] Tableau Inc. (2015). Transform your business with ask data the future of analytics starts now. [Online]. Available:
https://www.tableau.com/
[27] Amazon Web services (AWS). (2018). Cloud computing services. [Online]. Available: https://aws.amazon.com/?nc1=hls
[28] Tencent cloud. (2018). Get more with tencent cloud. [Online]. Available: https://intl.cloud.tencent.com/
[29] Presto. (2018). Distributed SQL query engine for big data. [Online]. Available: https://prestodb.io/
[30] M. Hausenblas and J. Nadeau, “Apache drill: Interactive Ad-hoc analysis at scale,” Big Data, vol. 1, no. 2, pp. 100-
104, 2013.
[31] S. Melnik, A. Gubarev, J.-J. Long, et al., “Dremel: Interactive analysis of Web-scale datasets,” Proc. of the VLDB
Endowment, vol. 3, no. 1-2, pp. 330-339, 2010.
[32] M. Kornacker and J. Erickson. (2012). Cloudera impala: Real time queries in Apache Hadoop, for real. [Online].
Available: http://Com/blog/201210/cloudera impala real time queries Apache Hadoop real
[33] Apache Organization. (2018). Impala is the open source, native analytic database for Apache Hadoop. Impala is
shipped by Cloudera, MapR, Oracle, and Amazon. [Online]. Available: https://impala.apache.org/index.html
[34] D. Borthakur, “HDFS architecture guide,” Hadoop Apache Project, vol. 53, pp. 1-13, 2008.
[35] A. Thusoo, Z. Shao, S. Anthony, et al., “Data warehousing and analytics infrastructure at Facebook,” in Proc. of the
ACM SIGMOD Intl. Conf. on Management of Data, 2010, pp. 1013-1020.
[36] Facebook. (2014). Scribe: Scribe is a server for aggregating log data streamed in real time from a large number of
servers. [Online]. Available: https://github.com/facebookarchive/scribe
[37] K. Chodorow, MongoDB: The Definitive Guide: Powerful and Scalable Data Storage, 2nd ed. Beijing: O’Reilly Media,
2013.
[38] M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel, “Amazon s3 for science grids: A viable solution?” in Proc.
of the Intl. Workshop on Data-aware Distributed Computing, 2008, pp. 55-64.
[39] J. Dean, “Challenges in building large-scale information retrieval systems: Invited talk,” in Proc. of the 2nd ACM Intl.
Conf. on Web Search and Data Mining, 2009, p. 1.
[40] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig Latin: A not-so-foreign language for data
processing,” in Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 2008, pp. 1099-1110.
[41] A. Thusoo, J. S. Sarma, N. Jain, et al., “Hive: A warehousing solution over a map-reduce framework,” Proc. of the
VLDB Endowment, vol. 2, no. 2, pp. 1626-1629, 2009.
[42] H.-K. Chen, F.-Z. Wang, and N. Helian, “Entropy4cloud: Using entropy-based complexity to optimize cloud service
resource management,” IEEE Trans. on Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 13-24,
2018.
[43] N. Elgendy and A. Elragal, “Big data analytics: A literature review paper,” in Proc. of Industrial Conf. on Data Mining,
2014, pp. 214-227.
[44] K. K. Das, E. Fratkin, A. Gorajek, K. Stathatos, and M. Gajjar, “Massively parallel in-database predictions using
PMML,” in Proc. of Workshop on Predictive Markup Language Modeling, 2011, pp. 22-27.
[45] M. Armbrust, R.-S. Xin, C. Lian, et al., “Spark SQL: Relational data processing in Spark,” in Proc. of ACM SIGMOD
Intl. Conf. on Management of Data, 2015, pp. 1383-1394.
YU et al.: Components and Development in Big Data System: A Survey 71
[46] S. V. Ranawade, S. Navale, and A. Dhamal. (2016). Analytical processing on Hadoop using Apache Kylin. [Online].
Available: http://www.ijais.org/archives/volume12/number2/ranawade-2017-ijais-451682.Pdf
[47] Apache KylinTM. (2018). Apache KylinTM home. [Online]. Available: http://kylin.apache.org/.TensorFlow
[48] TensorFlow. (2018). Introduction to TensorFlow. TensorFlow makes it easy for beginners and experts to create
machine learning models for desktop, mobile, Web, and cloud. [Online]. Available: https://www.tensorflow.org/learn
[49] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming, Upper
Saddle River: Addison-Wesley Professional, 2010.
[50] Cuda Zone. (2018). NVIDIA developer. [Online]. Aailable: https://developer.nvidia.com/coda-zone.html
[51] Apache Spark. (2018). MLlib: Main guide—Spark 2.4.0 documentation. [Online]. Available: https://spark.apa
che.org/docs/latest/ml-guide.html
[52] J.-H. Chen and K.-C. Liu, “On-line batch process monitoring using dynamic PCA and dynamic PLS models,”
Chemical Engineering Science, vol. 57, no. 1, pp. 63-75, 2002.
[53] ApacheTM Hadoop. (2018). ApacheTM hadoop®. [Online]. Available: http://hadoop.apache.org/
[54] Apache storm. (2018). Apache storm. [Online]. Available: http://storm.apache.org/
[55] V. C. Kaggal, R. K. Elayavilli, S. Mehrabi, et al., “Toward a learning health-care system—knowledge delivery at the
point of care empowered by big data and NLP,” Biomedical Informatics Insights, vol. 8, no. S1, pp. 13-22, 2016.
[56] F. Chang, J. Dean, S. Ghemawat, et al., “Bigtable: A distributed storage system for structured data,” ACM Trans. on
Computer Systems (TOCS), vol. 26, no. 2, p. 4, 2008.
[57] RocksDB. (2018). A persistent key-value store for fast storage environments. [Online]. Available: https://rocksdb.org/
[58] L.-Q. Xu, S.-L. Huang, S.-L. Hui, A. J. Elmore, and A. Parameswaran, “OrpheusDB: A lightweight approach to
relational dataset versioning,” in Proc. ACM Intl. Conf. on Management of Data, 2017, pp. 1655-1658.
[59] S. Wang, T. T. A. Dinh, Q. Lin, et al. (2018). Forkbase: An efficient storage engine for blockchain and forkable
applications. [Online]. Available: http://dl.acm.org/citation.cfm?id=3242934
[60] Apache Flume. (2018). Welcome to Apache Flume. [Online]. Available: https://flume.apache.org/
[61] Apache Kafka. (2018). Apache Kafka® is a distributed streaming platform. What exactly does that mean? [Online].
Available: https://kafka.apache.org/
[62] Apache Solr. (2018). Solr is the popular, blazing-fast, open source enterprise search platform built on Apache
Lucene™. [Online]. Available: http://lucene.apache.org/solr/
[63] MatiasBjorling. (2018). Open-channel solid state drives. [Online]. Available: https://openchannelssd.readthedocs.
io/en/latest/
[64] Micron. (2018). 3D XPoint™ technology. [Online]. Available: https://www.micron.com/products/advanced-solutions/
3d-xpoint-technology
[65] S.-M. Wu, K.-H. Lin, and L.-P. Chang, “KVSSD: Close integration of LSM trees and flash translation layer for write-
efficient KV store,” in Proc. of Design, Automation & Test in Europe Conf. & Exhibition, 2018, pp. 563-568.
[66] J. Xu and S. Swanson, “Nova: A log-structured file system for hybrid volatile/non-volatile main memories,” in Proc. of
the 14th USENIX Conf. on File and Storage Technologies, 2016, pp. 323-338.
[67] J. Cong, Z.-M. Fang, M.-H. Huang, L.-B. Wang, and D. Wu, “CPU-FPGA coscheduling for big data applications,”
IEEE Design & Test, vol. 35, no. 1, pp. 16-22, 2018.
[68] F. Firouzi, A. M. Rahmani, K. Mankodiya, et al., “Internet-of-things and big data for smarter healthcare: from device
to architecture, applications and analytics,” Future Generation Computer Systems, vol. 78, pp. 583-586, 2018.
[69] M. Elhoseny, A. Abdelaziz, A. S. Salama, A. M. Riad, K. Muhammad, and A. K. Sangaiah, “A hybrid model of
Internet of things and cloud computing to manage big data in health services applications,” Future Generation
Computer Systems, vol. 86, pp. 1383-1394, 2018.
72 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 17, NO. 1, MARCH 2019
[70] H. Vo, Y.-H. Liang, J. Kong, and F.-S. Wang, “iSPEED: A scalable and distributed in-memory based spatial query
system for large and structurally complex 3D data,” Proc. of the VLDB Endowment, vol. 11, no. 12, pp. 2078-2081,
2018.
[71] B. Salimi, C. Cole, P. Li, J. Gehrke, and D. Suciu, “HypDB: A demonstration of detecting, explaining and resolving
bias in OLAP queries,” Proc. of the VLDB Endowment, vol. 11, no. 12, pp. 2062-2065, 2018.
[72] E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, and G. Sheffi, “Accordion: Better memory organization for LSM key-
value stores,” Proc. of the VLDB Endowment, vol. 11, no. 12, pp. 1863-1875, 2018.
[73] I. Nadareishvili, R. Mitra, M. McLarty, et al., Microservice Architecture: Aligning Principles, Practices, and Culture,
Sebastopol: O'Reilly Media, Inc., 2016.
Jing-Huan Yu was born in Jiangxi Province, China in 1996. He received the B.S. degree from
University of Electronic Science and Technology of China, Chengdu, China in 2018. He is currently
pursuing the Ph.D. degree with City University of Hong Kong, Hong Kong, China. His research
interests include data mining, database engine, and non-volatile memory.
Zi-Meng Zhou received the M.E. degree from the Department of Computer Science and
Technology, Shandong University, Jinan, China in 2015. He is currently pursuing the Ph.D. degree
with the Department of Computer Science, City University of Hong Kong. His research interests
include non-volatile memories, embedded systems, and data-intensive computing.