Big Data Service Architecture: A Survey

Big Data Service Architecture: A Survey 393
Big Data Service Architecture: A Survey
Jin Wang1,2, Yaqiong Yang1, Tian Wang3, R. Simon Sherratt4, Jingyu Zhang1
1 School
of Computer &Communication Engineering, Changsha University of Science & Technology, China
2 School of Information Science and Engineering, Fujian University of Technology, China
3 College of Computer Science and Technology, Huaqiao University, China
4 Department of Biomedical Engineering, the University of Reading, UK
jinwang@csust.edu.cn, yangyqst@163.com, cs_tianwang@163.com, sherratt@ieee.org, zhangzhang@csust.edu.cn*
Abstract better performance to compute, process and analyze

the large-scale data [3-4]. Big data technology can
As one of the main development directions in the improve social governance and production efficiency,
information field, big data technology can be applied for and promote scientific research [5-6]. There are
data mining, data analysis and data sharing in the massive complex and challenging tasks that can not be dealt
data, and it created huge economic benefits by using the with by traditional reasoning and learning methods,
potential value of data. Meanwhile, it can provide requiring innovative techniques, algorithms and
decision-making strategies for social and economic infrastructure. Therefore, to tackle the new challenges
development. Big data service architecture is a new of big data technologies, we take a in-depth study of
service economic model that takes data as a resource, and the current big data service architecture.
it loads and extracts the data collected from different data This paper is devoted to analyzing the current big
sources. This service architecture provides various data service architecture, which is composed of three
customized data processing methods, data analysis and main layers. In the data collecting and storage layer,
visualization services for service consumers. This paper data sources in big data services are needed to be
first briefly introduces the general big data service collected by corresponding equipment, and then the
architecture and the technical processing framework,
data in “pre-processed” state will be stored and
which covered data collection and storage. Next, we
processed in a distributed file system or database
discuss big data processing and analysis according to
system. In the data processing layer, different
different service requirements, which can present
valuable data for service consumers. Then, we introduce processing frameworks are adopted according to
the detailed cloud computing service system based on big different forms of data. The in-depth analysis of big
data, which provides high performance solutions for data is currently mainly based on large-scale machine
large-scale data storage, processing and analysis. Finally, learning technologies, which can deeply mine the
we summarize some big data application scenarios over potential value of data. Finally, visualization tools are
various fields. used to present results to data service consumers. In the
application layer, there are applications of big data
Keywords: Big data, Data processing, Data analysis, technology over various fields. In addition, in big data-
Cloud service model, Big data applications based cloud computing services, software and
infrastructure built on cloud model (i.e., SaaS, PaaS,
1 Introduction IaaS) are utilized to process big data. The big data
service architecture is shown in Figure 1.
In the remaining sections of this paper, Section 2
As the concept of big data first appeared in the
introduces the infrastructure of big data service
journal Nature, it is described as large-scale data that
architecture, which involves the collecting and storage
can not be presented, processed and analyzed using
of massive data. Section 3 presents the introduction of
existing technologies, methods and theories [1]. Big
big data processing and analysis technologies. Section
data has the following four typical characteristics, i.e.,
4 introduces the cloud computing service models based
Volume, Variety, Velocity, and Value [2]. The
on big data, and the integration of cloud computing and
statistics show that the economic aggregate of global
big data technologies. In the last section, we
big data market has reached US$58.9 billion in 2017,
summarize some practical application scenarios of big
with the 29.1% increment. By 2020, the global big data
data services.
market will create more than 121.4 billion US dollars.
It is urgent to develop technologies and platforms with
*
Corresponding Author: Jingyu Zhang; E-mail: zhangzhang@csust.edu.cn
DOI: 10.3966/160792642020032102008
394 Journal of Internet Technology Volume 21 (2020) No.2
Application Recommendation Sentiment

layer
Smart grid systems analysis Search engine SaaS
Machine learning Data visualization

Processing
layer
Batch processing Stream processing Hybrid processing PaaS
NoSQL NewSQL Pig analysis Hive data

database database tool warehouse
Collecting
HDFS distributed file system IaaS
and
storage
layer ETL tools Kafka Flume Big data-
based
Batch data cloud
Stream data computing
service
Oracle MySQL SQL server Log Click-stream other system
Figure 1. Big data service architecture
extract data from the data sources, then transform and

2 Data Collecting And Storage load it into the storage targets [10]. The process is
shown in Figure 2. ETL removes corrupted or dirty
In the era of big data, information integration data through data processing operations such as
usually needs to extract and load large amounts of data connection, transformation, and cleanup. The widely
from massive data sources [7-8]. These distributed data used ETL tools are Kettle, Datastage, Informatica, etc.
are needing to be collected by appropriate equipment For the stream data that needs to be collected in real
or software, and data storage management schemes time, a collection tool that can guarantee the
should be provided for these massive data in the instantaneity, fault tolerance, stability and reliability is
sequential processing steps. needed. Flume is a reliable and fault-tolerant
distributed stream processing system that collects,
2.1 Data Collecting aggregates and transfers a large number of log data
from different sources to a centralized storage area [11].
The forms of big data mainly include static batch Flume is usually built around the Hadoop ecosystem
data and dynamic stream data. Batch data is stored in a and it acts as a middleware between data sources and
static form, and stream data is a continuous real-time receivers [12]. And Kafka is a universal open source
data instance sequence. The streaming data will not be messaging system, which is mainly used to build real-
stored completely, and many elements will be time data pipelines and streaming applications [13]. In
discarded directly after processing [9]. order to further optimize the control and processing
Data speed of stream data, Kafka uses queues to process
Step 1 filtering Step 2
Batch Data Data data, which can avoid the asynchronism of processing
Data Extracting Cleaning speed between data generation and processing [14].
The other famous systems include Facebook’s Scribe,
Transmission Taobao’s TimeTunnel.
rule
Uncleaned Cleaned
Data Data 2.2 Data Storage
In the past decades, relational databases and
Data
Step 4 matching Step 3 structured data management techniques have been
Data Data
Data
Loading Transmission
widely used [15]. According to the characteristics of
Warehouse
big data, the storage systems adopted mainly include
Figure 2. The process of ETL distributed file systems, NoSQL, NewSQL and other
data management systems.
Because of the instability of stream data In 2003, Google developed a scalable file system for
transmission, streaming data collecting is different large-scale distributed storage called Google File
from traditional batch data collecting. For batch data System (GFS) [16]. It provides high aggregation
from different data sources, ETL (Extract-Transform- performance of massive data, and meets the demand of
Load) tools are usually used to realize the transmission users for large-capacity storage [17]. HDFS is a part of
and collection of various data types. ETL process is to Apache Hadoop core project, and its design idea refers
to GFS system [18-19]. HDFS at present is considered the simple structure of the key-value pair model, it is
to be the most widely used big data tool, supporting not easy to cause data collisions and the programming
redundancy, reliability, scalability for parallel model is easier to implement. Another data model is
distributed architecture systems [20]. based on documents, using a key to identify each
NoSQL is a general term for a database whose data document. Unlike key-value storage, data in a
management mode is non-relational. This NoSQL data document can be queried. Column family stores are
model involves key-value pairs, column families, inspired by Googles Bigtable [29], and its data is
graphs or documents. Table 1 describes several different stored based on the column family model.
types of NoSQL databases. For example, Because of
Table 1. Different types of NoSQL database

NoSQL
Types Characteristics Advantages Disadvantages Scenarios
databases
Excellent read-write Without automatic fault
performance; Data tolerance and recovery
Redis [21] Key-value pairs; User information
persistence; Read-write function; Not supporting
Simple data structure; storage related to ID
separation for on-line expansion
Key-Value High scalability; Low (key), i.e. sessions,
Memory-based data
query update Small single cache data configuration files,
Memcached processing; Distributed and
efficiency capacity; Not supporting parameters
[22] scalable mode; Fast
for data persistence
processing speed
Fast access speed; Low read-write
MongoDB
Document-based data Multiple data types; efficiency; Large space Suitable for storing
[23]
Document- processing unit; High query efficiency occupancy log files and
Oriented Multiple data formats, Availability and Low query efficiency; analyzing data in real
CouchDB
i.e. JSON, XML concurrency; High High maintenance time
[24]
flexibility difficulty
Fast loading speed; Low multi-conditional
Suitable for storage of large query efficiency; Not Suitable for high
Hbase [25] Column-based data
data; Efficient compression directly supporting SQL concurrency
Column- storage model; High
rate statement query operation on big
Family data correlation in the
Elastic scalability; Not supporting for ACID data; Real-time
Cassandra same column family
Flexible schema; Multi-list transactions and atomic read/write access
[26]
data structures operations
Graph-based data Efficient query
storage formats, i.e. performance; High Not supporting for large Suitable for relational
Neo4j [27]
entities are vertices; performance graphics graph partition data storage,
Graph-
High data correlation algorithm recommendation
Oriented
between entities; ACID transactions and engines, community
GraphDB
Efficient query atomic operations; Multi- High complexity websites
[28]
performance model objects
NewSQL represents the new relational database that high, or even we can not complete the processing in a
not only has the same scalablity as NoSQL, but still given time period. Hence, at the beginning of data
provides ACID and SQL services for transactions [30]. processing, raw data should be cleaned, cropped and
The current widely used NewSQL databases are: integrated in order to provide high performance big
Spanner, MemSQL [31-32]. data services for customers. Table 3 summarizes the
characteristics of the currently popular data processing
3 Data Processing frameworks.
3.1.1 Batch Data Processing

At the beginning of data processing, raw data should
be cleaned, cropped and integrated in order to provide Since batch data is static and the amount of data is
high performance big data services for customers. extremely large, data processing is usually performed
Table 2 introduces different data processing modes in using the distributed off-line computing processing
the big data processing framework. method that is capable of parallel computing. In 2004,
3.1 Data Processing Google designed and developed MapReduce [33],
which is a distributed popular programming model for
Because of the huge amount of data, the cost of data processing big data sets. MapReduce allows users to
processing and analysis by accessing all the data is build complex computations on big data sets without
Table 2. Data processing modes

Date processing modes
Items Batch processing Stream processing Hybrid processing Graph processing
Data Large-scale; High Continuous infinite real- Existing of both batch Relevant data formed by
characteristics accuracy time data sequence data and stream data vertices and edges
Processing speed Minutes level Millisecond level Millisecond level Second level
Simple programming Diversified workloads; Graph data with massive
Serialized, low latency,
System characteristics mode; Time-insensitivity; High fault tolerance; nodes and edges; High
event-driven triggering
Data-intensive Low latency data correlation
Communication Shared
RPC/HTTP Message queue Message queue
mechanism memory/broadcast
Distributed file systems/
Data storage HDFS Real-time input stream Memory/disk
databases
Typical processing
MapReduce Storm, Samza Spark, Flink Pregel, Giraph
frameworks
Off-line analysis and
Pure real-time analysis; Iterative machine
processing of massive Social network map;
Application scenarios real-time scheduling, learning; Incremental
data, large-scale Web traffic road map analysis
continuous calculation calculation
information search
Table 3. Characteristics of data processing frameworks

Data Processing Frameworks
Characteristics MapReduce Dryad Storm Samza Spark Flink Pregel PowerGraph
Scalability        
Real-time    
Reliability        
Data correlation  
Concurrency        
Data persistence    
Multilingual programming        
Memory processing  
Fault-tolerance        
Checkpoint mechanism    
Reprocessing mechanism    
Multi-mode operation        
Stream processing    
Micro-batch processing  
Batch computing    
Interactive query  
worrying about synchronization, fault tolerance, time Hadoop computing system, which does not need
reliability or availability [34]. MapReduce usually to do complicated task scheduling, so it can reduce
divides the input data set into independent blocks, processing latency [37]. Storm has the following
which are processed by parallel Map tasks. Then the characteristics: (1) Simple programming model; (2)
output results of Map phase should be further Scalability; (3) High reliability (unlike real-time
processed by the following reduce tasks. In Reduce systems such as S4 [38], Storm ensures that data can be
phase, shuffle processing distributes data partitions to processed completely); (4) High fault tolerance (Storm
Reduce nodes dynamically [35]. rearranges the problem processing units if some
exception occurs during message processing) [39-40].
3.1.2 Stream Data Processing Apache Samza is a stream processing framework
capable of efficiently processing large amounts of data
The stream data processing pattern is suitable for [41]. The largest Samza implementation by LinkedIn
processing data that requires a real-time response, can handle millions of messages per second during
therefore stream processing is needed as the data peak hours [42]. Meanwhile, the combination of Samza
processing framework that can achieve low latency. and Kafka can better take the advantages of the two
Storm is an open source distributed real-time stream frameworks. Kafka can provide fault tolerance, data
data processing system [36]. Storm is similar to a real-
buffering, state storage and other technologies for Flink [50].

Samza, and the relationship is similar to the
dependence of MapReduce engine on HDFS [43]. 3.1.4 Graph Data Processing
3.1.3 Hybrid Data Processing In the massive data, some data called graph data are
linked together in the form of graphs or networks. The
Some tasks include both batch data processing and number of vertices and edges of graphs has reached
stream data processing. Many data processing hundreds of millions. Traditional graph data computing
frameworks support two types by combining the frameworks can no longer meet the huge computing
similar or related components and APIs, and can needs. At present, two kinds of graph processing
simplify different data processing processes. frameworks are mainly used to deal with these large-
Apache Spark is a new batch data processing scale graph data 51: one is a graph database capable of
framework with stream data processing capabilities real-time data processing (e.g., Neo4j, OrientDB, and
[44]. Spark uses a method called Resilient Distributed DEX); the other is a computing engine capable of
Dataset. Because using RDDs to process data, Spark parallel batch processing (e.g., Hama, Giraph and
can speed up batch data processing by putting the Regel).
entire process in memory [45]. Spark has better data Pregel is a popular batch synchronous parallel
processing performance than Hadoop without memory computing system launched by Google, and it
capacity limitations [46]. In addition, Spark Streaming introduces a vertex centered large-scale graph
component implements a method called Micro-batch computing model 52. The Pregel framework is more
that treats continuous flows as a series of micro-batch efficient than MapReduce in dealing with iterative
data, and handles these micro-batch jobs continuously graph data computation. Pregel can provide a
[47]. Spark Steaming component has good fault computing engine with excellent performance for the
tolerance and load balancing, but it still has insufficient traversal, shortest path and PageRank computation for
performance compared to the complete stream large graph data 53.
processing frameworks.
Flink is an open source data analysis framework 3.2 Data Analysis and Visualization
managed by Apache for batch and stream data Big data analysis technologies (such as machine
processing [48]. Compared with other big data learning) are used to mine valuable data in the massive
processing frameworks, Flink has its unique data data, so as to support the prediction and analysis of
processing methods. It is used with a persistent future trends and patterns. Afterward, the information
message queue (such as Kafka) to process data at should be presented to data service consumers through
different points of the persistent streams [49]. data visualization. Figure 3 shows machine learning
Although both Spark and Flink can handle hybrid data, techniques based on big data.
the micro-batch processing architecture adopted by
Spark will take more time for the data processing than
Big data-based machine

learning technology
Supervised Unsupervised Semi-supervised Reinforcement

learning learning learning learning
Unknown A few target variables are

Continuous target Discrete target target variable unknown, but most are known Discrete target Unknown
variable variable variable target variable
Regression Classification Clustering Correlation Classification Clustering Classification Control
Sentiment Text Mining Paths selection Recommended Unmanned

Event Image Customer system
prediction recognition segmentation analysis Aerial Vehicle
Figure 3. Big data-based machine learning technology
With the in-depth development of big data engineering, data mining, control systems, identification
technologies, the machine learning is widely used for systems, informatics and so on [57-60].
the in-depth big data analysis [54-56]. Machine Currently, machine learning has attracted more and
learning is a research field focusing on theories, more attention in terms of big data analysis. For the
learning systems and algorithm attributes, which analysis of image data, Niu et al. [61] proposed a novel
includes artificial intelligence, information theory, multi-scale depth model, which was used to extract
optimal control, cognitive science, mathematics and rich and discriminative features that could represent
various visual concepts. [62] proposed a novel from different dimensions, so as to further improve
unsupervised feature learning approach for mapping data analysis. Data visualization tools have been
pixel reflectance to illumination invariant encodes. In widely used, mainly including charting tools (D3,
[63], an efficient image retrieval method was proposed, RawGraphs, Google Charts, etc.); map class forming
which improve the overall retrieval rate. In the field of tools(Modest Maps, Openheatmap, ColorBrewer, etc.);
natural language processing, Liu et al. [64] proposed a timeline forming tools (Cube, Timeflow, Dipity, etc.).
feature selection method based on correlation analysis
and Fisher, which can eliminate the redundant features. 4 Big Data-based Cloud Computing
[65] adopted an unsupervised learning method to
Service Systems
estimate the acoustic model of speech recognition,
which does not need to transcribe the training data.
Data service consumers need to obtain valuable data The advantages of cloud computing technology lies
after the processing and analysis. Data visualization in its powerful distributed processing engines,
usually uses tables and images to present data distributed databases, cloud storage and virtualization
information for users. By vivid visual effects, data can technologies. The establishment of cloud computing
be presented in a more understandable and intuitive service system based on big data forms a excellent
form, thereby enhancing the attractiveness and performance data cloud service platform. Figure 4 is
persuasiveness. Through data visualization, data the architecture of big data-based cloud computing
analysts use data trends, data patterns, relationships service systems.
and other information to study the data more deeply
User Public Company Government agencies Management departments Cloud Cloud

SaaS Application Management Security
Service Application field Public Service Social management Economic regulation Market supervision
Centralized Application
monitoring security
Universal Application Component Service Data Service
Resource Physical
Distributed service framework Query retrieval Data mining Relationship discovery management security
PaaS Application Distributed Distributed Data integration Data
Operation Data
server cache message queue processing management management Security
Service Cyber
Infrastructure
Infrastructure management security
Network device Server device Storage device Safety device
IaaS Application Host
management security
Computing resource pools Storage resource pools Computing resource pools
Resource Physical Distributed Physical
Virtual Virtual IT System
pools SAN management security
machine machine storage network network
Figure 4. The architecture of big data-based cloud computing service
data to be processed to the specific SaaS cloud services,

4.1 Big Data-based Cloud Computing Service and then pay after completing the task on demand [69].
Models (2) Platform as a Service (PaaS): The PaaS system
Cloud computing is a novel computing paradigm, provides a scalable, distributed, and fault-tolerant
which is used to implement a shared pool model of cloud service programming platform for big data
configurable computing resources (e.g., networks, processing.
servers, storages, applications and services), and it can (3) Infrastructure as a Service (IaaS): IaaS cloud
be easy to quickly configure and publish cloud computing service providers provide users with
computing tasks [66-67]. Big data workloads often configurable computing resources (processors, storages,
experience repeated changes of scope and size, and networks and applications and I/O devices, etc.),
powerful processing schemes are often required to namely virtualized resource pools. It can deal with the
cope with these changes. Therefore, the components of variability, scalability, reliability and efficiency in big
the big data computing architecture must be carefully data operations.
designed considering the characteristics, cost, speed, In recent years, cloud computing service model
and scalability of the system [68]. Three following based on big data technologies has attracted a lot of
types of cloud computing service models can well research attentions. With the increasing scale of
solve these above problems. internet applications, cloud computing becomes more
(1) Software as a Service (SaaS): Users can obtain and more important for big data proccessing. [70]
data processing frameworks for different requirements proposed a new SaaS (software as a service) design
with the cloud computing services. Users do not need and developed a semantic model to guide the data
to maintain softwares, but only need to deliver the big collection process. To meet the new requirements of
tenant data replication protection in SaaS, Li et al. [71] discovery, access monitoring, risk detection and
proposed a new tenant replication protection anomaly detection) [75].
mechanism MT-DIPS based on tuple sampling. [72] The fusion of big data and cloud computing
discussed a context-aware safe strategy model, which technologies has been paid increasing attention by
can be customized according to the specific researchers. In the field of data collection and storage,
requirements of PaaS-based applications. Sookhak et al. [76] proposed a new data structure
based on the algebraic attributes of outsourced files for
4.2 Big Data and Cloud Computing cloud computing. Yang et al. [77] proposed a cloud
Big data cannot be separated from the cloud data center energy-saving storage strategy based on a
computing. At present, the well-known cloud service novel hypergraph overlay model. For data processing,
providers include Amazon Web Services, Microsoft a new computing framework called Firework is
Azure, Aliyun, etc. These cloud service providers proposed in [78]. And Wang et al. [79] proposed a
integrate data computing, algorithm development, data machine learning framework for cloud computing
service and other technologies according to various auxiliary resource allocation. Ezenwoke et al. [80]
data development needs, which realize a complete set proposed a visual visualization framework for cloud
of big data integration development environment on services.
the cloud [73-74]. In terms of data processing, batch
processing and stream processing are combined to 5 Big Data Application Scenarios
process real-time data streams and historical data
collaboratively. In terms of algorithm development, Big data technologies has appeared in every aspect
machine learning platform supports the implementation of people’s lives, and they have been applied in various
of regression, classification, clustering and other industries (e.g., finance, internet, catering, medical
algorithms, and it supports popular deep learning treatment, energy, sports and entertainment) [81-86]. In
frameworks such as TensorFlow, MXNet, Caffe, the field of Internet of things, data collection
PyTorch. In addition, there are data maps in the data technology of wireless sensor network and data
cloud service ecosystem, which can search data processing algorithm of big data can realize practical
information and make data development and applications such as Internet of vehicles, novel
maintenance easier. The financial data security system computer architectures, Indoor localization and road
on the cloud that passed the professional level test anomaly detection [87-96]. Figure 5 shows the
provides comprehensive data management and security application of big data in various fields.
solutions (e.g., data identification, sensitive data
Telecom customer off-grid
Discover associated
analysis
purchase behavior
Customer relationship Telecom
management Retail industry Supply chain management
Industry
Security field
Wind power monitoring
Urban City planning
Smart grid Energy industry
management
Intelligent Transportation
National security Menu

recommendations
Crime prevention Catering
Security field Catering O2O
industry
Defense against cyber attacks
Precision irrigation Predict game results

Big data
Agricultural
Soil monitoring field
application Sports field Training team
Unmanned aerial scenarios
vehicle Sentiment analysis
Self-driving cars Automobile industry Daily life Personalized service
Personalized learning Credit risk analysis

Education Financial
Online education industry industry Fraud identification
Bioinformatics
Targeted advertising
Biomedical Intelligent medical
Internet
Recommendation system Science
Epidemic prediction
Figure 5. Big data application scenarios

optimization methods.
5.1 Recommendation Systems
5.3 Emotional Analysis
In recent years, with the exponential growth trend of
all kinds of data, data consumers have to face the Due to the rise of social networks, in the era of big
problem of excessive information, which makes it data a new scene of large-scale data with emotional
more difficult to make correct decisions. This analysis arises. Emotional analysis is a process of
phenomenon is called information overload. analyzing people’s emotions, opinions and evaluations,
Recommendation system can use big data processing which is used to extract valuable information from
technologies to extract potentially valuable information massive data [122].
from massive overload information. Its main idea is to In massive data, high-precision emotional
build a model by analyzing user’s historical behavior classification is a major challenge in emotional
and preference information, then automatically analysis. Recent works have effectively explored
recommend items or products of interest to users, and different emotion classification techniques, from
finally obtain personalized lists for different users [97- simple rule-based and dictionary-based approaches to
98]. more complicated machine learning algorithms.
With the exponential growth of data volume, Methods based on machine learning mainly uses
recommendation system can intelligently analyze machine learning algorithm related to emotion analysis
information and provide targeted data services for to process data, such as ANN, Random Forests,
users. So this field has attracted extensive attention of Genetic Algorithm, k-nearest neighbor Algorithm,
researchers. [99] studied the problem of exploring the Support Vector Machine(SVM) and other algorithms.
implicit hierarchy of recommendation system in the By demonstrating a model with audio, video and text
case of unclear recommendation system. And the as sources of information, Poria et al. [123] proposed a
contextual operating tensor model proposed in [100] new multi-mode affective analysis method to obtain
has designed a new recommended method. In the emotions from video networks.
application, Chen et al. [101] propose a time-aware
smart object recommendation model by jointly in 6 Conclusion
social internet of things. Due to the fact that vehicle
trajectory is easily affected by the environment and With the rapid development of modern information
user behavior, trajectory prediction is relatively low. technologies, data has become an important basis for
[102] proposed a prediction model integrating the development of production materials and
environmental perception and behavioral preference. technologies. This paper investigated the big data
Customer information analysis and recommendation service architecture, cloud computing services based
are crucial to the development of enterprises in market on big data and some current big data application
competition. [103] proposed an efficient query scenarios. First, we investigated big data
framework to find and recommend potential customers infrastructures that support the big data collection and
for target products. storage technologies and tools. Then according to the
5.2 Smart Grid different data processing modes, the technical
frameworks of four typical data processing were
In the network of smart grid, the infrastructure briefly introduced. Afterwards, this paper introduced
provides massive information and detailed data the big data applications, including machine learning
required for automatic decision support through technologies for deep big data analysis. In addition, we
wireless sensor network and other technologies [104- discussed big data visualization technologies. Later, we
114]. All of these data need to be processed and stored introduced the cloud computing service models, as
in real time for using historical or real-time data to well as the combination of cloud computing and big
create decisions based on certain situations [115]. data technologies. Finally, this paper summarized some
Machine learning technologies can be used in smart practical application scenarios of big data technologies.
grid to predict power consumption, pricing, power
generation estimation, fault detection, adaptive control
Acknowledgements
and so on [116]. The existing price forecasting, power
load forecasting and other methods may be difficult to
process the huge price data in the power grid. To solve This work was supported by the National Natural
these problems, a new electricity price prediction Science Foundation of China (61772454, 61802031,
model was proposed in [117]. At the same time, in 61811530332, 61811540410). It was also supported by
order to study the problem of data upload in the the open research fund of Key Lab of Broadband
communication system between decentralized devices, Wireless Communication and Sensor Network
Li et al. [118] proposed an optimal algorithm of Technology (Nanjing University of Posts and
polynomial running time. In addition, in other Telecommunications), Ministry of Education (No.
applications of smart grid, [119-121] proposed some JZNY201905).
References [17] M. Wang, B. Li, Y. Zhao, G. Pu, Formalizing Google File

System, IEEE 20th Pacific Rim International Symposium on
[1] D. Goldston, Big Data: Data Wrangling, Nature, Vol. 455, No. Dependable Computing (PRDC), Singapore, Singapore, 2014,
7209, pp. 15, September, 2008. pp. 190-191.
[2] A. Oguntimilehin, E. O. Ademola, A Review of Big Data [18] T. Yeh, T. Chien, Building a Version Control System in the
Management, Benefits and Challenges, Journal of Emerging Hadoop HDFS, Network Operations and Management
Trends in Computing and Information Sciences, Vol. 5, No. 6, Symposium (NOMS), Taipei, Taiwan, 2018, pp. 1-5.
pp. 433-438, June, 2014. [19] Apache Hadoop, http://hadoop.apache.org/.
[3] V. Snášel, J. Nowaková, F. Xhafa, L. Barolli, Geometrical [20] K. Bok, H. Oh, J. Lim, Y. Pae, H. Choi, B. Lee, J. Yoo, An
and Topological Approaches to Big Data, Future Generation Efficient Distributed Caching for Accessing Small Files in
Computer Systems, Vol. 67, pp. 286-296, February, 2017. HDFS, Cluster Computing, Vol. 20, No. 4, pp. 3579-3592,
[4] J. Liu, E. Pacitti, P. Valduriez, A Survey of Scheduling December, 2017.
Frameworks in Big Data Systems, International Journal of [21] Redis, https://redis.io/.
Cloud Computing, Vol. 7, No. 2, pp. 103-128, January, 2018. [22] Memcached, http://memcached.org/.
[5] Y. Chen, M. Zhou, Z. Zheng, Learning Sequence-Based [23] MongoDB, http://mongodb.org/.
Fingerprint for Magnetic Indoor Positioning System, IEEE [24] CouchDB, http://couchdb.apache.org/.
Access, Vol. 7, pp. 163231-163244, November, 2019. [25] HBase, http://hbase.apache.org/.
[6] G. Bello-Orgaz, J. J. Jung, D. Camacho, Social Big Data: [26] Cassandra, http://cassandra.apache.org/.
Recent Achievements and New Challenges, Information [27] J. Hölsch, T. Schmidt, M. Grossniklaus, On the Performance
Fusion, Vol. 28, pp. 45-59, March, 2016. of Analytical and Pattern Matching Graph Queries in Neo4j
[7] Y. Lin, H. Wang, J. Li, H. Gao, Data Source Selection for and a Relational Database, Joint Conference: 6th
Information Integration in Big Data Era, Information Sciences, International Workshop on Querying Graph Structured Data
Vol. 479, pp. 197-213, April, 2019. (EDBT/ICDT), Venice, Italy, 2017, pp. 1-8.
[8] T. Y. Wu, Z. Lee, M. S. Obaidat, S. Kumari, S. Kumar, C. M. [28] GraphDB, http://www.sones.com.
Chen, An Authenticated Key Exchange Protocol for Multi- [29] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach,
server Architecture in 5G Networks, IEEE Access, Vol. 8, pp. M. Burrows, T. Chandra, A. Fikes, R. E. Gruber, Bigtable: A
28096-28108, January, 2020. Distributed Storage System for Structured Data, ACM
[9] P. Karunaratne, S. Karunasekera, A. Harwood, Distributed Transactions on Computer Systems, Vol. 26, No. 2, pp. 1-26,
Stream Clustering Using Micro-clusters on Apache Storm, June, 2008.
Journal of Parallel and Distributed Computing, Vol. 108, pp. [30] A. Pavlo, M. Aslett, What’s Really New with NewSQL?,
74-84, October, 2017. ACM Sigmod Record, Vol. 45, No. 2, pp. 45-55, June, 2016.
[10] J. C. Nwokeji, F. Aqlan, A. Apoorva, A. Olagunju, Big Data [31] J. Chen, S. Jindel, R. Walzer, R. Sen, N. Jimsheleishvilli, M.
ETL Implementation Approaches: A Systematic Literature Andrews, The MemSQL Query Optimizer: A Modern
Review, International Conference on Software Engineering Optimizer for Real-time Analytics in a Distributed Database,
and Knowledge Engineering (SEKE), Redwood, California, Proceedings of the VLDB Endowment, Vol. 9, No. 13, pp.
USA, 2018, pp. 714-715. 1401-1412, September, 2016.
[11] Apache Flume, http://flume.apache.org/. [32] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J.
[12] B. Shu, H. Chen, M. Sun, Dynamic Load Balancing and Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild,
Channel Strategy for Apache Flume Collecting Real-Time W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik,
Data Stream, IEEE International Symposium on Parallel and D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito,
Distributed Processing with Applications (ISPA), Guangzhou, M. Szymaniak, C. Taylor, R. Wang, D. Woodford, Spanner:
China, 2017, pp. 542-549. Google’s Globally-distributed Database, ACM Transactions
[13] Apache Kafka, http://kafka.apache.org/. on Computer Systems, Vol. 31, No. 3, pp. 8:1-8:22, August,
[14] H. Jafarpour, R. Desai, D. Guy, KSQL: Streaming SQL 2013.
Engine for Apache Kafka, International Conference on [33] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing
Extending Database Technology (EDBT), Lisbon, Portugal, on Large Clusters, Communications of the ACM, Vol. 51, No.
2019, pp. 524-533. 1, pp. 107-113, January, 2008.
[15] C. A. D. Deagustini, S. E. F. Dalibón, S. Gottifredi, M. A. [34] Y. Guo, J. Rao, D. Cheng, X. Zhou, iShuffle: Improving
Falappa, C. I. Chesnevar, G. R. Simari, Relational Databases Hadoop Performance with Shuffle-on-Write, IEEE
as a Massive Information Source for Defeasible Transactions on Parallel and Distributed Systems, Vol. 28,
Argumentation, Knowledge-Based Systems, Vol. 51, pp. 93- No. 6, pp. 1649-1662, June, 2017.
109, October, 2013. [35] X. Zhao, J. Zhang, X. Qin, kNN-DP: Handling Data
[16] S. Ghemawat, H. Gobioff, S. T. Leung, The Google File Skewness in kNN Joins Using MapReduce, IEEE
System, Proceedings of the Nineteenth ACM Symposium on Transactions on Parallel and Distributed Systems, Vol. 29,
Operating Systems Principles (SOSP), Bolton Landing, New No. 3, pp. 600-613, March, 2018.
York, USA, 2003, pp. 29-43. [36] Apache storm, http://storm.apache.org/.
[37] H. Yan, D. Sun, S. Gao, Z. Zhou, Performance Analysis of [53] S. Bhatia, R. Kumar, Review of Graph Processing
Storm in a Real-World Big Data Stream Computing Frameworks, IEEE International Conference on Data Mining
Environment, International Conference on Collaborative Workshops (ICDM), Singapore, Singapore, 2018, pp. 998-
Computing: Networking, Applications and Worksharing 1005.
(CollaborateCom), Edinburgh, UK, 2017, pp. 624-634. [54] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, I. Brandic,
[38] R. S. Gallardo, C. Bonacic, M. Marin, S4 Applications Cloud Computing and Emerging IT Platforms: Vision, Hype,
Simulator for Performance Evaluation, International and Reality for Delivering Computing as the 5th Utility,
Conference on Parallel, Distributed, and Network-Based Future Generation Computer Systems, Vol. 25, No. 6, pp.
Processing (PDP), Crete, Greece, 2016, pp. 403-407. 599-616, June, 2009.
[39] L. Aniello, R. Baldoni, L. Querzoni, Adaptive Online [55] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G.
Scheduling in Storm, Proceedings of the 7th ACM van den Driessche, J. Schrittwieser, I. Antonoglou, V.
international conference on Distributed event-based systems Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J.
(DEBS), Arlington, Texas, USA, 2013, pp. 207-218. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,
[40] C. K. Shieh, S. W. Huang, L. D. Sun, M. F. Tsai, N. K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the
Chilamkurti, A Topology-based Scaling Mechanism for Game of Go with Deep Neural Networks and Tree Search,
Apache Storm, International Journal of Network Management, Nature, Vol. 529, No. 7587, pp. 484-489, January, 2016.
Vol. 27, No. 3, pp. e1933, May-June, 2017. [56] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, DeepFM: A
[41] Apache Samza, http://samza.apache.org/. Factorization-machine Based Neural Network for CTR
[42] Z. Zhuang, T. Feng, Y. Pan, H. Ramachandra, B. Sridharan, Prediction, Proceedings of the Twenty-Sixth International
Effective Multi-stream Joining in Apache Samza Framework, Joint Conference on Artificial Intelligence (IJCAI), Melbourne,
IEEE International Congress on Big Data (BigData Australia, 2017, pp. 1725-1731.
Congress), San Francisco, CA, USA, 2016, pp. 267-274. [57] G. E. Hinton, S. Osindero, Y. W. Teh, A Fast Learning
[43] M. A. Kleppmann, J. Kreps, Kafka, Samza and the Unix Algorithm for Deep Belief Nets, Neural Computation, Vol.
Philosophy of Distributed Data, IEEE Data Engineering 18, No. 7, pp. 1527-1554, July, 2006.
Bulletin, Vol. 38, No. 4, pp. 4-14, December, 2015. [58] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
[44] Apache Spark, http://spark.apache.org/. Howard, W. Hubbard, L. D. Jackel, Backpropagation Applied
[45] O. Backhoff, E. Ntoutsi, Scalable Online-Offline Stream to Handwritten Zip Code Recognition, Neural Computation,
Clustering in Apache Spark, IEEE 16th International Vol. 1, No. 4, pp. 541-551, December, 1989.
Conference on Data Mining Workshops (ICDM), Barcelona, [59] S. He, Z. Li, Y. Tang, Z. Liao, F. Li, S. J. Lim, Parameters
Spain, 2016, pp. 37-44. Compressing in Deep Learning, CMC: Computers, Materials
[46] S. Han, W. Choi, R. Muwafiq, Y. Nah, Impact of Memory & Continua, Vol. 62, No. 1, pp. 321-336, 2020.
Size on Bigdata Processing based on Hadoop and Spark, [60] J. Zhang, S. Zhong, T. Wang, H.-C. Chao, J. Wang,
Proceedings of the International Conference on Research in Blockchain-Based Systems and Applications: A Survey,
Adaptive and Convergent Systems (RACS), Krakow, Poland, Journal of Internet Technology, Vol. 21, No. 1, pp. 1-14,
2017, pp. 275-280. January, 2020.
[47] D. Cheng, X. Zhou, Y. Wang, C. Jiang, Adaptive Scheduling [61] Y. Niu, Z. Lu, J. R. Wen, T. Xiang, S. F. Chang, Multi-Modal
Parallel Jobs with Dynamic Batching in Spark Streaming, Multi-Scale Deep Learning for Large-Scale Image Annotation,
IEEE Transactions on Parallel and Distributed Systems, Vol. IEEE Transactions on Image Processing, Vol. 28, No. 4, pp.
29, No. 12, pp. 2672-2685, December, 2018. 1720-1731, April, 2019.
[48] Apache Flink, https://flink.apache.org/. [62] L. Windrim, R. Ramakrishnan, A. Melkumyan, R. J. Murphy,
[49] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, K. A Physics-Based Deep Learning Approach to Shadow
Tzoumas, Apache Flink: Stream and Batch Processing in a Invariant Representations of Hyperspectral Images, IEEE
Single Engine, Bulletin of the IEEE Computer Society Transactions on Image Processing, Vol. 27, No. 2, pp. 665-
Technical Committee on Data Engineering, Vol. 38, No. 4, 677, February, 2018.
pp. 28-38, December, 2015. [63] P. Liu, J. M. Guo, C. Y. Wu, D. Cai, Fusion of Deep Learning
[50] B. Akil, Y. Zhou, U. Röhm, On the Usability of Hadoop and Compressed Domain Features for Content-Based Image
MapReduce, Apache Spark & Apache Flink for Data Science, Retrieval, IEEE Transactions on Image Processing, Vol. 26,
IEEE International Conference on Big Data (Big Data), No. 12, pp. 5706-5717, December, 2017.
Boston, MA, USA, 2017, pp. 303-310. [64] Z. T. Liu, M. Wu, W. H. Cao, J. W. Mao, J. P. Xu, G. Z. Tan,
[51] Z. Lin, Principle and Application of Big Data Technology, Speech Emotion Recognition Based on Feature Selection and
Posts and Telecom Press, 2015. Extreme Learning Machine Decision Tree, Neurocomputing,
[52] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Vol. 273, pp. 271-280, January, 2018.
Horn, N. Leiser, G. Czajkowski, Pregel: A System for Large- [65] V. Despotovic, O. Walter, R. Haeb-Umbach, Machine
scale Graph Processing, Proceedings of the ACM SIGMOD Learning Techniques for Semantic Analysis of Dysarthric
International Conference on Management of data (SIGMOD), Speech: An Experimental Study, Speech Communication, Vol.
Indiana, USA, 2010, pp. 135-146. 99, pp. 242-251, May, 2018.
[66] P. Mell, T. Grance, The NIST Definition of Cloud Computing, Wang, A Machine Learning Framework for Resource
NIST Special Publication 800-145, September, 2011. Allocation Assisted by Cloud Computing, IEEE Network, Vol.
[68] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. 32, No. 2, pp. 144-151, March-April, 2018.
Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. [80] A. Ezenwoke, O. Daramola, M. Adigun, Towards a
Zaharia, A View of Cloud Computing, Communications of Visualization Framework for Service Selection in Cloud E-
the ACM, Vol. 53, No. 4, pp. 50-58, April, 2010. Marketplaces, IEEE 13th World Congress on Services,
[68] M. Arostegi, A. Torre-Bastida, M. N. Bilbao, J. D. Ser, A Honolulu, HI, USA, 2017, pp. 122-129.
Heuristic Approach to the Multicriteria Design of IaaS Cloud [81] N. Fikri, M. Rida, N. Abghour, K. Moussaid, A. E. Omri,
Infrastructures for Big Data Applications, Expert Systems, BigData and Regulation in High Frequency Trading,
Vol. 35, No. 5, pp. 1-12, October, 2018. Proceedings of the International Conference on Cloud and
[69] L. Wang, J. Tao, M. Kunze, A. C. Castellanos, D. Kramer, W. Big Data Computing (ICCBDC), London, UK, 2017, pp. 45-
Karl, Scientific Cloud Computing: Early Definition and 49.
Experience, 10th IEEE International Conference on High [82] S. Tiwari, H. M. Wee, Y. Daryanto, Big Data Analytics in
Performance Computing and Communications (HPCC), Supply Chain Management between 2010 and 2016: Insights
Dalian, China, 2008, pp. 825-830. to Industries, Computers & Industrial Engineering, Vol. 115,
[70] I. L. Yen, F. Bastani, Y. Huang, Y. Zhang, X. Yao, SaaS for pp. 319-330, January, 2018.
Automated Job Performance Appraisals Using Service [83] A. Kamilaris, A. Kartakoullis, F. X. Prenafeta-Boldú, A
Technologies and Big Data Analytics, IEEE International Review on the Practice of Big Data Analysis in Agriculture,
Conference on Web Services (ICWS), Honolulu, HI, USA, Computers and Electronics in Agriculture, Vol. 143, pp. 23-
2017, pp. 412-419. 37, December, 2017.
[72] L. Li, Y. Zhang, Y. Ding, MT-DIPS: a New Data Duplication [84] X. Du, The Application of Big Data Technology in
Integrity Protection Scheme for Multi-tenants Sharing Competitive Sports Research, International Conference on
Storage in SaaS, International Journal of Grid and Utility Geo-Spatial Knowledge and Intelligence (GSKI), Chiang Mai,
Computing, Vol. 9, No. 1, pp. 26-36, February, 2018. Thailand, 2017, pp. 466-471.
[72] Y. Verginadis, I. Patiniotakis, G. Mentzas, S. Veloudis, I. [85] N. Singh, C. B. Kaverappa, J. D. Joshi, Data Mining for
Paraskakis, Data Distribution and Encryption Modelling for Prevention of Crimes, International Conference on Human
PaaS-enabled Cloud Security, IEEE International Conference Interface and the Management of Information (HIMI), Las
on Cloud Computing Technology and Science (CloudCom), Vegas, NV, USA, 2018, pp. 705-717.
Luxembourg, 2016, pp. 497-502. [86] I. Kalamaras, A. Zamichos, A. Salamanis, A. Drosou, D. D.
[73] H. Li, W. Li, H. Wang, J. Wang, An Optimization of Virtual Kehagias, G. Margaritis, S. Papadopoulos, D. Tzovaras, An
Machine Selection and Placement by Using Memory Content Interactive Visual Analytics Platform for Smart Intelligent
Similarity for Server Consolidation in Cloud, Future Transportation Systems Management, IEEE Transactions on
Generation Computer Systems, Vol. 84, pp. 98-107, July, Intelligent Transportation Systems, Vol. 19, No. 2, pp. 487-
2018. 496, February, 2018.
[74] H. Li, W. Li, Q. Feng, S. Zhang, H. Wang, J. Wang, [87] J. Wang, Y. Gao, W. Liu, A. K. Sangaiah, H. J. Kim, An
Leveraging Content Similarity among VMI Files to Allocate Intelligent Data Gathering Schema with Data Fusion
Virtual Machines in Cloud, Future Generation Computer Supported for Mobile Sink in Wireless Sensor Networks,
Systems, Vol. 79, pp. 528-542, February, 2018. International Journal of Distributed Sensor Networks, Vol.
[75] T.-Y. Wu, C.-M. Chen, K.-H. Wang, C. Meng, E. K. Wang, 15, No. 3, pp. 1-9, March, 2019.
A Provably Secure Certificateless Public Key Encryption [88] J. Wang, X. Gu, W. Liu, A. K. Sangaiah, H.-J. Kim, An
with Keyword Search, Journal of the Chinese Institute of Empower Hamilton Loop Based Data Collection Algorithm
Engineers, Vol. 42, No. 1, pp. 20-28, January, 2019. with mobile Agent for WSNs, Human-centric Computing and
[76] M. Sookhak, F. R. Yu, A. Y. Zomaya, Auditing Big Data Information Sciences, Vol. 9, No. 1, Article 18, May, 2019.
Storage in Cloud Computing Using Divide and Conquer [89] J. Zhang, C. Wu, D. Yang, Y. Chen, X. Meng, HSCS: A
Tables, IEEE Transactions on Parallel and Distributed Hybrid Shared Cache Scheduling Scheme for Multi-
Systems, Vol. 29, No. 5, pp. 999-1012, May, 2018. programmed Workloads, Frontiers of Computer Science, Vol.
[77] T. Yang, H. Pen, W. Li, D. Yuan, A. Y. Zomaya, An Energy- 12, No. 6, pp. 1090-1104, December, 2018.
Efficient Storage Strategy for Cloud Datacenters Based on [90] Y. Chen, J. Wang, R. Xia, Q. Zhang, Z. Cao, K. Yang, The
Variable K-Coverage of a Hypergraph, IEEE Transactions on Visual Object Tracking Algorithm Research Based on
Parallel and Distributed Systems, Vol. 28, No. 12, pp. 3344- Adaptive Combination Kernel, Journal of Ambient
3355, December, 2017. Intelligence and Humanized Computing, Vol. 10, No. 12, pp.
[78] Q. Zhang, Q. Zhang, W. Shi, H. Zhong, Firework: Data 4855-4867, December, 2019.
Processing and Sharing for Hybrid Cloud-Edge Analytics, [91] Y. Chen, J. Wang, S. Liu, X. Chen, J. Xiong, J. Xie, K. Yang,
IEEE Transactions on Parallel and Distributed Systems, Vol. Multiscale Fast Correlation Filtering Tracking Algorithm
29, No. 9, pp. 2004-2017, September, 2018. Based on a Feature Fusion Model, Concurrency and
[79] J. B. Wang, J. Wang, Y. Wu, J. Y. Wang, H. Zhu, M. Lin, J. Computation-Parctice and Experience, October, 2019, DOI:
10.1002/cpe.5533. Knowledge-Based Systems, Vol. 168, pp. 80-99, March, 2019.

[92] Y. Chen, M. Zhou, Z. Zheng, M. Huo, Toward Practical [106]T.-T. Nguyen, J.-S. Pan, T.-K. Dao, An Improved Flower
Crowdsourcing-Based Road Anomaly Detection with Scale- Pollination Algorithm for Optimizing Layouts of Nodes in
Invariant Feature, IEEE Access, Vol. 7, pp. 67666-67678, Wireless Sensor Network, IEEE Access, Vol. 7, pp. 75985-
May, 2019. 75998, June, 2019.
[93] K. Gao, F. Han, P. Dong, N. Xiong, R. Du, Connected [107]J.-S. Pan, C.-Y. Lee, A. Sghaier, M. Zeghid, J. Xie, Novel
Vehicle as a Mobile Sensor for Real Time Queue Length at Systolization of Subquadratic Space Complexity Multipliers
Signalized Intersections, Sensors, Vol. 19, No. 9, pp. 2059, Based on Toeplitz Matrix-Vector Product Approach, IEEE
May, 2019. Transactions on Very Large Scale Integration Systems, Vol.
[94] J. Wang, W. Wu, Z. Liao, A. K. Sangaiah, R. S. Sherratt, An 27, No. 7, pp. 1614-1622, July, 2019.
Energy-efficient Off-loading Scheme for Low Latency in [108]J.-S. Pan, L. Kong, T.-W. Sung, P.-W. Tsai, V. Snasel,
Collaborative Edge Computing, IEEE Access, Vol. 7, pp. Alpha-Fraction First Strategy for Hierarchical Model in
149182-149190, October, 2019. Wireless Sensor Networks, Journal of Internet Technology,
[95] W. Li, Z. Chen, X. Gao, W. Liu, J. Wang, Multimodel Vol. 19, No. 6, pp. 1717-1726, November, 2018.
Framework for Indoor Localization under Mobile Edge [109]J. Wang, Y. Gao, C. Zhou, R. S. Sherratt, L. Wang, Optimal
Computing Environment, IEEE Internet of Things Journal, Coverage Multi-Path Scheduling Scheme with Multiple
Vol. 6, No. 3, pp. 4844-4853, June, 2019. Mobile Sinks for WSNs, CMC: Computers, Materials &
[96] S. M. H. Rostami, A. K. Sangaiah, J. Wang, X. Liu, Obstacle Continua, Vol. 62, No. 2, pp. 695-711, January, 2020.
Avoidance of Mobile Robots Using Modified Artificial [110]J. Wang, Y. Gao, K. Wang, A. K. Sangaiah, S. J. Lim, An
Potential Field Algorithm, EURASIP Journal on Wireless Affinity Propagation-based Self-adaptive Clustering Method
Communications and Networking, Vol. 2019, No. 1, Article for Wireless Sensor Networks, Sensors, Vol. 19, No. 11, pp.
70, December, 2019. 2579, June, 2019.
[97] H. Wang, N. Wang, D. Y. Yeung, Collaborative Deep [111]J. Wang, C. Ju, H. Kim, R. S. Sherratt, S. Lee, A Mobile
Learning for Recommender Systems, International Assisted Coverage Hole Patching Scheme Based on Particle
Conference on Knowledge Discovery and Data Mining Swarm Optimization for WSNs, Cluster Computing, Vol. 22,
(KDD), Sydney, Australia, 2015, pp. 1235-1244. No. 1, pp. 1787-1795, January, 2019.
[98] E. Çano, M. Morisio, Hybrid Recommender Systems: A [112]J. Wang, Y. Gao, W. Liu, A. K. Sangaiah, H.-J. Kim, Energy
Systematic Literature Review, Intelligent Data Analysis, Vol. Efficient Routing Algorithm with Mobile Sink Support for
21, No. 6, pp. 1487-1524, November, 2017. Wireless Sensor Networks, Sensors, Vol. 19, No. 7, pp. 1494,
[99] S. Wang, J. Tang, Y. Wang, H. Liu, Exploring Hierarchical April, 2019.
Structures for Recommender Systems, IEEE Transactions on [113]J. Wang, Y. Gao, X. Yin, F. Li, H.-J. Kim, An Enhanced
Knowledge and Data Engineering, Vol. 30, No. 6, pp. 1022- PEGASIS Algorithm with Mobile Sink Support for Wireless
1035, June, 2018. Sensor Networks, Wireless Communications and Mobile
[100]S. Wu, Q. Liu, L. Wang, T. Tan, Contextual Operation for Computing, Vol. 2018, pp. 1-9, December, 2018.
Recommender Systems, IEEE Transactions on Knowledge [114]D. Cao, B. Zheng, B. Ji, Z. Lei, C. Feng, A Robust Distance-
and Data Engineering, Vol. 28, No. 8, pp. 2000-2012, August, based Relay Selection for Message Dissemination in
2016. Vehicular Network, Wireless Networks, pp. 1-17, October,
[101]Y. Chen, M. Zhou, Z. Zheng, D. Chen, Time-aware Smart 2018, doi: 10.1007/s11276-018-1863-4.
Object Recommendation in Social Internet of Things, IEEE [115]Q. Tang, K. Yang, D. Zhou, Y. Luo, F. Yu, A Real-Time
Internet of Things Journal, December, 2019, DOI: 10.1109/ Dynamic Pricing Algorithm for Smart Grid With Unstable
JIOT.2019.2960822. Energy Providers and Malicious Users, IEEE Internet of
[102]Z. Xia, Z. Hu, J. Luo, UPTP Vehicle Trajectory Prediction Things Journal, Vol. 3, No. 4, pp. 554-562, August, 2016.
Based on User Preference Under Complexity Environment, [116]Z. Xia, Z. Fang, F. Zou, J. Wang, A. K. Sangaiah, Research
Wireless Personal Communications, Vol. 97, No. 3, pp. 4651- on defensive Strategy of Real-time Price Attack Based on
4665, December, 2017. Multiperson Zero-determinant, Security and Communication
[103]B. Yin, K. Gu, X. Wei, A Cost-Efficient Framework for Networks, Vol. 2019, pp. 1-13, July, 2019.
Finding Prospective Customers Based on Reverse Skyline [117]K. Wang, C. Xu, Y. Zhang, S. Guo, A. Y. Zomaya, Robust
Queries, Knowledge-Based Systems, Vol. 152, pp. 117-135, Big Data Analytics for Electricity Price Forecasting in the
July, 2018. Smart Grid, IEEE Transactions on Big Data, Vol. 5, No. 1,
[104]J.-S. Pan, L. Kong, T.-W. Sung, P.-W. Tsai, V. Snášel, A pp. 34-45, March, 2019.
Clustering Scheme for Wireless Sensor Networks Based on [118]W. Li, H. Xu, H. Li, Y. Yang, P. K. Sharma, J. Wang, S.
Genetic Algorithm and Dominating Set, Journal of Internet Singh, Complexity and Algorithms for Superposed Data
Technology, Vol. 19, No. 4, pp. 1111-1118, July, 2018. Uploading Problem in Networks with Smart Devices, IEEE
[105]Z. Meng, J.-S. Pan, K.-K. Tseng, PaDE: An Enhanced Internet of Things Journal, October, 2019, doi: 10.1109/
Differential Evolution Algorithm with Novel Control JIOT.2019.2949352.
Parameter Adaptation Schemes for Numerical Optimization, [119]S. He, W. Zeng, K. Xie, H. Yang, M. Lai, X. Su, PPNC:
Privacy Preserving Scheme for Random Linear Network electronic devices concentrating on equalization and
Coding in Smart Grid, KSII Transactions on Internet & DSP architectures.
Information Systems, Vol. 11, No. 3, pp. 1510-1532, March,
2017. Jingyu Zhang received the Ph.D.
[120]Q. Tang, M. Xie, K. Yang, Y. Luo, D. Zhou, Y. Song, A degree in Computer Science and
Decision Function Based Smart Charging and Discharging Technology from Shanghai Jiao Tong
Strategy For Electric Vehicle In Smart Grid, Mobile Networks University in 2017. He is currently
and Applications, Vol. 24, No. 5, pp. 1722-1731, October, an Assistant Professor at the School
2019. of Computer & Communication
[121]W. Li, H. Liu, J. Wang, L. Xiang, Y. Yang, An Improved Engineering, Changsha University of Science and
Linear Kernel for Complementary Maximal Strip Recovery: Technology, China. His research interests include
Simpler and Smaller, Theoretical Computer Science, Vol. 786, blockchain, computer architecture, and mobile
pp. 55-66, September, 2019.
networks.
[122]K. Cheng, J. Li, J. Tang, H. Liu, Unsupervised Sentiment
Analysis with Signed Social Networks, Thirty-First AAAI
Conference on Artificial Intelligence (AAAI), San Francisco,
California, USA, 2017, pp. 3429-3435.
[123]S. Poria, E. Cambria, N. Howard, G. B. Huang, A. Hussain,
Fusing Audio, Visual and Textual Clues for Sentiment
Analysis from Multimodal Content, Neurocomputing, Vol.
174, pp. 50-59, January, 2016.
Biographies
Jin Wang received the M.S. degree

from Nanjing University of Posts and
Telecommunications, China in 2005.
He received Ph.D. degree from Kyung
Hee University, Korea in 2010. Now,
he is a professor at Changsha
University of Science and Technology.
His research interests mainly include wireless sensor
network and network security.
Yaqiong Yang is currently a graduate

student at the School of Computer &
Communication Engineering, Changsha
University of Science & Technology,
Changsha, China. Her research interests
include big data and storage systems.
Tian Wang received his Ph.D. degree

in City University of Hong Kong in
2011. Currently, he is a professor at
the College of Computer Science and
Technology, Huaqiao University,
China. His research interests include
Internet of Things and edge computing.
R. Simon Sherratt received Ph.D.

degree in video signal processing in
1996 from the University of Salford.
He is currently a Senior Lecturer in
Consumer Electronics and a Director
for Teaching and Learning. His
research topic is signal processing in consumer

Big Data Service Architecture: A Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Service Architecture: A Survey

Uploaded by

Copyright:

Available Formats

Big Data Service Architecture: A Survey 393

Big Data Service Architecture: A Survey

3 College of Computer Science and Technology, Huaqiao University, China

4 Department of Biomedical Engineering, the University of Reading, UK

jinwang@csust.edu.cn, yangyqst@163.com, cs_tianwang@163.com, sherratt@ieee.org, zhangzhang@csust.edu.cn*

Abstract better performance to compute, process and analyze

Application Recommendation Sentiment

Machine learning Data visualization

NoSQL NewSQL Pig analysis Hive data

Figure 1. Big data service architecture

extract data from the data sources, then transform and

Table 1. Different types of NoSQL database

3.1.1 Batch Data Processing

Table 2. Data processing modes

Table 3. Characteristics of data processing frameworks

buffering, state storage and other technologies for Flink [50].

Big data-based machine

Supervised Unsupervised Semi-supervised Reinforcement

Unknown A few target variables are

Regression Classification Clustering Correlation Classification Clustering Classification Control

Sentiment Text Mining Paths selection Recommended Unmanned

Figure 3. Big data-based machine learning technology

User Public Company Government agencies Management departments Cloud Cloud

Figure 4. The architecture of big data-based cloud computing service

data to be processed to the specific SaaS cloud services,

National security Menu

Precision irrigation Predict game results

Self-driving cars Automobile industry Daily life Personalized service

Personalized learning Credit risk analysis

Figure 5. Big data application scenarios

References [17] M. Wang, B. Li, Y. Zhao, G. Pu, Formalizing Google File

10.1002/cpe.5533. Knowledge-Based Systems, Vol. 168, pp. 80-99, March, 2019.

Jin Wang received the M.S. degree

Yaqiong Yang is currently a graduate

Tian Wang received his Ph.D. degree

R. Simon Sherratt received Ph.D.

You might also like