You are on page 1of 18

SPECIAL SECTION ON THEORETICAL FOUNDATIONS FOR BIG

DATA APPLICATIONS: CHALLENGES AND OPPORTUNITIES

Received March 15, 2016, accepted March 28, 2016, date of current version August 15, 2016.
Digital Object Identifier 10.1109/ACCESS.2016.2580581

Energy Big Data: A Survey


HUI JIANG1 , KUN WANG1 , (Member, IEEE), YIHUI WANG1 , MIN GAO2 ,
AND YAN ZHANG3 , (Senior Member, IEEE)
1 Key Laboratory of Broadband Wireless Communication and Sensor Network Technology, Ministry of Education, Nanjing University of Posts and
Telecommunications, Nanjing 210003, China
2 Department of Electrical Engineering, University of California at Los Angeles, Los Angeles, CA 90095, USA
3 Simula Research Laboratory and University of Oslo, Norway

Corresponding author: Y. Zhang (yanzhang@ieee.org)


This work was supported in part by the National Natural Science Foundation of China under Grant 61572262, in part by the National
Science Foundation of Jiangsu Province under Grant BK20141427, in part by the Nanjing University of Posts and
Telecommunications under Grant NY214097, in part by the Open Research Fund within the Key Laboratory of
Broadband Wireless Communication and Sensor Network Technology, Ministry of Education, Nanjing
University of Posts and Telecommunications under Grant NYKL201507, in part by the Qinlan Project
of Jiangsu Province, and in part by the Research Council of Norway under Project 240079/F20.

ABSTRACT Energy is one of the most important parts in human life. As a significant application of energy,
smart grid is a complicated interconnected power grid that involves sensors, deployment strategies, smart
meters, and real-time data processing. It continuously generates data with large volume, high velocity, and
diverse variety. In this paper, we first give a brief introduction on big data, smart grid, and big data application
in the smart grid scenario. Then, recent studies and developments are summarized in the context of integrated
architecture and key enabling technologies. Meanwhile, security issues are specifically addressed. Finally,
we introduce several typical big data applications and point out future challenges in the energy domain.

INDEX TERMS Big data, energy, smart grids, data processing, data mining, energy internet, survey.

I. INTRODUCTION • Data from power utilization habits of users;


Smart grid is a major research and development direction in • Data from Phasor Measurement Units (PMUs) for situ-
today’s energy industry. It refers to the next generation power ation awareness;
grid which integrates conventional power grid and manage- • Data from energy consumption measured by the
ment of advanced communication networks and pervasive widespread smart meters;
computing capabilities to improve control, efficiency, relia- • Data from energy market pricing and bidding collected
bility and safety of the grid [1]. Smart grid delivers electricity by Automated Revenue Metering (ARM) system;
between suppliers and consumers so as to form a bidirectional • Data from management, control and maintenance of
electricity and information flow infrastructure. It fulfills the device and equipment in the electric power generation,
demands of each stakeholder, functionality coordination in transmission and distribution in the grid;
electric power generation, terminal electricity consuming and • Data from operating utilities, like financial data and
power market. It also improves the efficiency in each part of large data sets which are not directly obtained through
the system operation and reduces the cost and environmental the network measurement.
impact. At the same time, the system reliability, self-healing All aforementioned sources increase the grid’s data vol-
ability and stability are improved [2]. ume. According to the investigation in [3], by 2009 the
Integrating telecommunication, automation and electric amount of data in electric utilities’ system has already
network control, smart grid requires reliable real-time data reached the level of TeraBytes (TBs). With every passing day,
processing. To support this requirement and benefit future more TBs of data are emerging at infrastructure company data
analysis and decision making, huge amount of historic data centers. This fact accelerates the pace of big data technology
should be well fetched and stored in a reasonable time budget. applying to the smart grid area.
In addition, there are various sources that huge amount of data To obtain a better understanding of the big data appli-
can be generated through diverse measurements acquired by cation on smart grid, we give an overview of energy big
Intelligent Electronic Devices (IEDs) in the smart grid: data (big data applied in the energy domain) covering the

2169-3536
2016 IEEE. Translations and content mining are permitted for academic research only.
3844 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 4, 2016
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
H. Jiang, et al.: Energy Big Data: A Survey

recent researches and development in the context of an architectures for complex power grid, effective data analysis
integrated architecture, key enabling technologies, security, to aware the unfavorable situation in advance, and various
typical applications and challenges. data processing methods for different conditions [10]–[13].
The remainder of this survey is organized as follows.
Overview of the energy big data is provided in Section II, C. BIG DATA IN SMART GRID
and related analytics frameworks are presented in Section III. Data size in electric power systems have increased dramati-
In Section IV, key enabling technologies are listed. Section V cally, which results in gaps and challenges. The 4Vs data (data
illustrates the security issues and Section VI looks into main with characteristics of volume, variety, velocity, and veracity)
application areas. Challenges are discussed in Section VII and in smart grid are difficult to be handled within a tolerable
Section VIII concludes this paper. operating time and hardware resources [14]. Therefore, the
big data technology has been introduced to the power system.
II. BACKGROUND
The emerging big data technology does not conflict with
A. SMART GRID
classical analysis or pretreatment. In fact, it has already been
applied successfully as a powerful data-driven tool in numer-
Smart grid is a form of electric power network. It is a combi-
ous research areas such as financial systems [15], quantum
nation of all the power utilities resources, both of renewable
systems [16], biological systems [17], and wireless commu-
and non-renewable. As a significant innovation, smart grid
nication networks [18].
provides automatic monitoring, protecting and optimizing
Big data sets and their relationship with several sample
for the operation of the interconnected systems. Not only
applications are depicted in Fig. 1.
covering the traditional central generator in the transmission
network and emerging renewal distributed generator in the
distribution system, it also serves industrial consumers and
home users in aspects of thermostats, electric vehicles and
intelligent appliances [4]. Characterized by the bidirectional
connection of electricity and information sets, the smart grid
has formed an automated, widely distributed delivery power
network. Taking advantages from modern communications,
it can not only deliver real-time data and information but also
implement the instantaneous management balance of demand
and supply [5].
Operations on energy storage, demand response and com-
munication between terminal users and power companies
will make a real large scale data-communicated grid system.
This adds complexity in the conventional grid and promote
offering sustainable progresses to utilities and customers [6].
In order to match its complexity, many technologies are
FIGURE 1. Big data and applications.
exploited in the smart grid domain. Examples include wire-
less networks in telecommunications, sensor networks in
In the aspect of the field measurement showed in Fig. 1,
manufacturing, and big data in computer science. Especially,
the deployment of PMUs has improved the measurements on
big data is attracting more and more attention in the power
instantaneous values of voltages and currents for operators
grid domain because it can help extract valuable information
with accurate timestamps [19]. PMUs can be imported to
from data sources of wide diversity and high volume.
the state estimator to provide direct state measurement, since
they provide measurements in both magnitude and angle
B. BIG DATA TECHNOLOGY aspects [20]. This is especially interrelated when the system
Big data technology is an emerging technology which applies topology such as operations on automated breaker change
to data sets where data size is so large and common data- rapidly. PMU data can also be applied in protective relaying
related technology tools are hard to capture, manage, and operations [21]. The use of PMU data is helpful for more
operate under multi-limits. Common data-related technology accurate and more rapid tripping response in measurement.
use a consistent, single, traditional data management frame- In the aspect of distribution, the management of big data
work, which is only suitable for data with low diversity and is also crucial in order to facilitate applications in the fields
volume [7], [8]. If coming from diverse data sets, data will not of demand response, Distributed Energy Resources (DERs),
be correlated in space and time, and it is hard for those data and Electric Vehicles (EVs) [22]. Improved and secure
to get correlated to a unified and generalized power system bi-direction communication and management of the associ-
model [9]. Therefore, traditional data-related technology will ated big data are required for turning loads into dispatchable
no longer guarantee reliable data management, operation, resources, thus promoting DERs to participate in electricity
etc. In this way, big data technology provides integrated markets [23].

VOLUME 4, 2016 3845


H. Jiang et al.: Energy Big Data: A Survey

1) MAIN TECHNIQUES data processing. For example, a technology called Storm is


Over the past few years, almost all major companies have designed especially for real-time data stream analysis [39].
launched their projects on big data, and brought a set of In order to analyze and process interactively, the data is
tools which are brand new to the power industry. Those generated from interactive environments. Users are able to
companies include IBM, Oracle from IT domain, General have real-time interaction with computers due to the direct
Electric, Siemens from the power grid domain and start-ups connection to it. Apache Drill is a distributed system used
like AutoGrid, Opower, and C3 [24]. In addition, scientists for interactive analysis of the big data [40]. The purpose of
have proposed a variety of techniques including optimization Drill is to have low latency in response to ad-hoc queries.
and data mining to capture, analyze, process and visualize big Therefore, it has the capability to process PetaBytes (PBs) of
data that is of large volume and needed to be handled within data and trillions of records within seconds.
limited time.
III. ENERGY BIG DATA: ARCHITECTURE
In order to tackle communication problems and data pro-
Supported by many kinds of different techniques, big data
cessing problems in the smart grid area, a variety of emerging
analysis plays an important role in smart grid. One solution
scalable and distributed architectures and frameworks have
architecture provides a three-tier framework for better dealing
been proposed. Rogers et al. [25] proposed an authenti-
with data high in volume, velocity and variety. For various
cated control framework for distributed voltage supported
data generated from smart grid, an integrated architecture can
on the smart grid. A layered architecture was presented
also be used for cloud storage and efficient processing.
for an optimized algorithm to follow a chain of command
from the transmission grid to the distribution grid [26]. A. ROLE OF BIG DATA IN THE SMART GRID
In addition, many optimization methods have been exploited The data obtained from PMUs, smart meters, sensors and
to solve quantitative problems in smart grid branches such other IEDs have opened up a plethora of chances, such
as power distribution, sensor topology, etc. Furthermore, as predictive analytics, real-time vulnerability assessment,
Aquino-Lugo et al. [27] provided another analytical frame- theft detection, demand-side-management, economic dis-
work using agent-based technologies to manage data patch, energy trading, etc [41].
processing. Big data can help improve the smart grid management
Data mining extracts valuable information from data sets. to a higher level. For example, by enhancing the acces-
It includes a series of clustering, classification and regres- sibility of a customer’s electricity consumption data, the
sion. A typical example of data mining under the smart demand response will be expanded and energy efficiency
grid scenario is the grouping process of customers. Those will be improved [42]. Similarly, analyzing data obtained
customers are grouped into different classes according to their from PMUs and IEDs will help to improve customer ser-
electricity consumption types and other characteristics. Then, vice, prevent outages, maximize safety and ensure service
the clustering method is applied to generating residential load reliability [43], [44]. Furthermore, electric utilities are using
documents [28] and differentiating pattern changes caused by prediction data analytics for estimating several parameters
seasonal and temporal factors [29]. which are helpful to operate the smart grid in an efficient,
In smart grid, this data mining technology can be classified economical and reliable way. For instance, whether the excess
into two categories: one is Artificial Intelligence (AI) for energy is available from renewable sources can be predicted
modeling risk and uncertainty of obtained data [30], and the through accessing the ability of the smart grid to transmit it.
other is estimation based on recorded data. Utilizing fuzzy At the same time, calculating equipment downtime, accessing
wavelet neural network [31] and estimating with nonparamet- power system failures, and managing unit commitment can
ric methods [32], data mining methods obtain enforcement also be done effectively in high correlation with integrat-
recently. For example, the fuzzy decision tree has been used ing distributed generation. Thus, the grid management state
to classify disturbances of power quality [33]. will be enhanced, the efficiency will be improved and the
robustness of generation and deployment will be ensured.
2) MAIN TOOLS
With the integration of various distributed generating sources,
Apache Hadoop is one of the most well-known and powerful refined forecasting, load planning, and unit commitment are
big data tools written in JAVA [34]. Providing platforms and more convenient to avoid inefficient energy transmitting or
infrastructures for specific big data applications in business dispatching extra generation.
and commerce, it plays an important role on big data related
companies, such as MapR, Cloudera and Hortonworks [35]. B. A THREE-TIER ENERGY BIG DATA
Hadoop usually consists of two major components, one is the ANALYTICS ARCHITECTURE
Hadoop Distributed File System (HDFS) [36] and the other A widely accepted analytical framework of big data is shown
is the Hadoop execution engine: MapReduce [37]. HDFS is in Fig. 2, which has three layers including data access and
often applied for data storage while MapReduce is used in computing, data privacy, domain knowledge and big data
data processing [38]. mining algorithm [45].
As for applications of data streaming, operation in The inner core data mining platform is mainly responsible
the electric power system needs real-time response for for data access and computation process. With the increasing

3846 VOLUME 4, 2016


H. Jiang, et al.: Energy Big Data: A Survey

• Analyze the historic data and estimate the energy


production.
• Analyze the consumer behavior patterns about
electricity consumption to estimate the demand in
advance.
• Keep record of the energy production from different
sources and make decision to switch demands between
the high/low priority.
• Balance the load of the demand/supply chain.
• Do efficiently on the storage/transfer of the generated
power.
The improved smart grid architecture is illustrated in Fig. 3,
which consists of the smart grid, the HDFS and the related
cloud environment. In order to manage the storage and
retrieval of massive data, HDFS is used in this system. The
HDFS gives concentration on distributed storage to nodes
in racks [48]. This architecture also contains a database
including consumer behavior pattern, historic data, details in
power supply and demand. Each time the system estimates
FIGURE 2. Three-tier big data analytics architecture. the demand and calculates the supply, it will refer to the
consumer behavior patterns and historic data. These data are
stored in a cloud-based Cassandra DataBase (Cassandra DB).
This improved smart grid utilizes a prediction algorithm [49]
growth of the data volume, distributed storage of large scale to estimate the demand and supply of electric power. In a
data need to be taken into account while computing. That is distributed environment, the smart grid uses distributed power
to say, data analysis and task processing are divided into mul- resources like solar, wind, nuclear sources, and is applied
tiple sub tasks and executed on a large number of computing in many areas, such as industrial production and social
nodes through a parallel program. infrastructure.
The middle layer of the structure plays an important role
to connect inner layer and outer one. The data mining tech-
nology in the inner layer provides a platform for data-related
work in middle layer such as the information sharing, privacy
protection, and knowledge acquisition from areas and appli-
cations, etc. In the whole process, the information sharing is
not only the guarantee of each phase, but also the goal of
processing and analyzing with the big data in smart grid.
In the outer layer of the architecture, preprocessing is
necessary for the heterogeneous, uncertain, incomplete, and
multi-source data through data fusion technology. After pre-
processing, complex and dynamic data will be excavated,
and then, pervasive smart grid global knowledge can be
obtained through local learning and model fusion. Finally,
model and its parameters need adjustment according to the
feedback [46].

C. AN INTEGRATED BIG DATA ARCHITECTURE


IN THE SMART GRID
Based on big data analytics and cloud technology, an inte-
grated architecture can be used in the smart grid in many
ways, such as optimizing power transmission, controlling
power consumption, keeping the balance between power
demand and supply, etc [47]. This architecture takes advan-
tages of the three technologies including big data analyt-
ics, smart grid and cloud computing, which composes an
enhanced version of smart grid. The improved smart grid has
the following functionalities: FIGURE 3. An improved smart grid architecture.

VOLUME 4, 2016 3847


H. Jiang et al.: Energy Big Data: A Survey

IV. ENERGY BIG DATA: KEY TECHNOLOGIES the information and ensure the confidentiality. The encryp-
With the remarkable development of big data analysis in tion protocol has three following components:
energy domain recently, some significant technologies have • Key generation: When given a security coefficient, this
emerged to be applied in various fields. The key techniques of algorithm generates public keys and global parameters.
big data in energy system can be divided into four categories: Set n = q1 · q2 , where q1 , q2 are primes. Select N ∈ Z∗n2
data acquisition and storing, data correlation analysis, crowd- such that N has an order which is a multiple of n and
sourced data control and data visualization. Among them the Zn2 = {0, 1, 2, · · · , n, · · · , n2 −1}, where Z∗ represents
data acquisition and storing is the most crucial component. the adjoint matrix of Z. Set λ(n) = lcm(q1 − 1, q2 − 1),
Crowd-sourced data control is applied especially to solve the where lcm means the least common multiple. Let i rep-
data problems which are hard to be processed. To obtain resents the i-th intended receiver that a message is sent
better understanding of recent research trend, we analyze the for. Then the public key of i-th receiver can be expressed
data acquisition and storing technologies in detail and take a as: PK [i] = (n, N ), the secret key can be expressed as:
brief overview on big data technologies. SK [i] = (λ(n)).
• Encryption method: Select m ∈ Zn which represents
A. DATA ACQUISITION AND STORING a message. Choose a random number r ∈ Z∗n . Then
Data acquisition and storing is the initial problem in big data. the ciphertext c can be given by: c = E(m) = N m r n
The big data technologies in acquisition and storing stage mod n2 where r n is an enhancing factor to make the
gather data from various information in energy system. The homomorphic computation indeterministic.
collected data are of different sources, different formats and • Decryption method: Decryption from c to m is accord-
different features that is stored in data repositories. ing to:
The acquisition and storing technology of big data belongs L(cλ(n) mod n2 )
to data management which involves data fusion, data integra- m = D(c) = mod n,
L(N λ(n) mod n2 )
tion, data management, and data transforming which is usu-
ally called Extract Transform Load (ETL) technology [50]. where the L-function takes input from the data set
Fig. 4 shows the data acquisition and storing flow using U = {u|u = 1 mod nandu < n2 } and the output can be
ETL technology. calculated in: L(u) = (u − 1)/n.
Supposing c1 = E(m1 ) and c2 = E(m2 ) are two ciphertexts
for m1 , m2 ∈ Zn respectively, then the sum of the plaintexts
S can be obtained from the ciphertexts: S = D(c1 · c2
mod n2 ) = (m1 + m2 ) mod n.
Except for the additive homomorphism model above, the
same information can be encrypted into different cipher-
texts by various homomorphism encryption techniques.
Rivest et al. [55] used a multiplicative homomorphism
encryption method to keep privacy and authority in an elec-
tronic mail system. Gentry [56] proposed fully homomorphic
scheme which can support complicated functions.

2) SECURE AGGREGATION BY SMART METERS


In power network, data can be aggregated through Home
FIGURE 4. Data acquisition and storing. Area Network (HAN), Building Area Network (BAN), and
Neighboring Area Network (NAN) in the way that customers’
Data acquisition involves data access and collection. Since privacy is protected [57]. For each HAN, there is a gateway
data have private information, its confidentiality and secu- smart meter called han which collects and sends information
rity should be considered during accessing and transmitting. to the BAN. The gateway smart meter ban in BAN aggregates
We investigate several homomorphic encryption and decryp- these information and sends them to the gateway smart meter
tion methods for data access and storing. Further, secure data nan in the neighborhood area. Finally, nan in NAN reports all
aggregation and routing are also studied in this part. the information to the data repository which is monitored by
the Remote Terminal Unit (RTU).
1) HOMOMORPHIC ENCRYPTION SCHEME Ruj and Nayak [58] proposed a new decentralized security
The transmitted information are always private, so encryp- framework to keep the whole aggregation process safe. The
tion technology is necessary to prevent unpredictable i-th RTU Ri in data repository is securely given a public key
attacks [51], [52]. Paillier [53] used an homomorphic PK [i] = (n, N ) and a secret one SK [i] = (λ(n)) as in the
probabilistic encryption scheme to deal with the compos- key generation in additive homomorphism model. Each smart
ite residuosity class problem. Boneh et al. [54] proposed meter knows the public key PK [i] of its nearest RTU data
another homomorphic encryption scheme which can encrypt repository Ri in the network. The j-th gateway smart meter

3848 VOLUME 4, 2016


H. Jiang, et al.: Energy Big Data: A Survey

hanj in HAN sends a data packet that consists of two fields: a mapping function ψ of which rows is a set of attributes:
the attribute field a and the power consumption field pj cor- ψ : {1, 2, · · · , k} → W, where W means a attributes set.
responding with different attributes. The power consumption Obviously, ψ is a permutation. Then the RTU Ri runs the
field can be encrypted with the data repository public key. ABE encryption algorithm to calculate the ciphertext C by
Then the packet can be described as: PPK [hanj ] = a||cj = C = ABE.Encrypt(m, R, ψ). Ciphertext C is thus stored in
a||E(pj ) = a||(N mj rjn mod n2 ), where PPK [hanj ] is short the data repository. If a user u has a valid set of attributes, his
for the packet sent by smart meter gateway hanj and rj ∈ Z∗n request to access to the ciphertext C from the data repository
is randomly chosen by the smart meter. will be accepted and the C will be decrypted into the message
The attribute field will be checked in BAN. When the m by m = ABE.Decrypt(C, {SK [i, u]}), where SK [i, u] rep-
packet is sent to the BAN, the gateway banl will aggregate all resents the secret key of KDC corresponding to the attribute i
the packets which have the same set of attributes into a new given to the user u.
packet and process the aggregated power consumption.
Q This
aggregated result can be given by: cbanl = hanj ∈HAN chanj . 4) DATA STORING AND ROUTING
The new packet can be described as: PPK [banl ] = a||cbanl . The large amount of data generated by the widespread
Packets aggregated by a set of gateway smart meters sensor networks in the smart grid requires scalability,
{ban1 , ban2 , · · · , banj } then will be sent to the NAN. The self-organization, routing algorithms and efficient data dis-
NAN gateway nank aggregates the packets and operations the semination. Generally, the content of the data should be
information at the same way of ban. Q The aggregated result more important than the identity of the sensor used to store
can be figured out by: cnank = banl ∈BAN cbanl . The new data [63]. Therefore, comparing with the common routing
packet PPK [nank ] = a||cnank will be sent to the nearest data technology, the data-centric storing and routing technologies
repository. The RTU Ri in the data repository will receive have been developed more rapidly in recent decades.
packets and decrypt the encrypted results by using its secret In the data storing and routing technology, the data is
key SK [i]. Thus the RTUs can aggregate information without defined and routed referring to their names instead of the
knowing what exact data was send by the smart meters at storage node’s address. The data can be queried for users after
the HANs. The whole aggregation progress is similar to a defining certain conditions and adopting certain name-based
spanning tree, where the HAN is at the bottom and data routing algorithms, like the Data-Centric Storage (DCS)
repository is at the top. and the Greedy Perimeter Stateless Routing (GPSR) algo-
rithm [64]. In these algorithms, each data object belongs to an
3) ACCESS CONTROL SCHEME associated key and each working node stores a group of keys.
All the sensitive information must be sent separately to Any node in the system is allowed to locate the storage node
the specific terminals, thus access should be restricted and for any key and every node can input and output files based
controlled. Sahai and Waters [59] proposed a cryptographic on its keys, via a hash-table-like interface. The following
protocol called Attribute Based Encryption (ABE) which is example shows the data storage algorithm in information
mainly to distribute attributes to senders and receivers so that encryption and decryption.
only the receiver with matching attributes set has the access to • Encrypt2DS: Each terminal user can encrypt informa-
the data. However, this protocol is restricted to only support tion I into a ciphertext CDS with the data storage encryp-
threshold access structures. Therefore, many researches have tion algorithm, which encrypts with params and the
been launched to improve this protocol. Goyal et al. [60] data storage identity DS in the regional cloud nodes.
proposed a novel ABE scheme called Key-Policy Therefore, this encryption process is denoted as:
based (KP-ABE) scheme which can handle any mono-
tonic access structure. Bethencourt et al. [61] proposed CDS ← Encrypt2DS(params, DS, I ).
another type of protocol which is known as Ciphertext-Policy
ABE (CP-ABE). In these schemes, ciphertexts are encrypted • Decrypt2DS: Each regional cloud node can decrypt an
under a given access structure with a set of attributes. Only accepted ciphertext CDS into the information I with the
a receiver has a matching attributes set can it decrypt the data storage decryption algorithm, which decrypts with
information. params and the private key KDS relevant to the data
Lewko and Waters [62] proposed an access control scheme storage identity DS. Therefore, this decryption process
and gave limited access to authorized data users. It is mainly is denoted as:
based on a ABE cryptographic technique where RTUs have I ← Encrypt2DS(params, KDS , CDS ).
attributes and cryptographic keys distributed by multiple Key
Distribution Centers (KDCs). When an RTU Ri wants to
store a message m, Ri defines an access structure S firstly B. DATA CORRELATION ANALYSIS
which helps it to select the authorized users to access the The big data technologies in data analysis involve correlation
message m. Ri then creates a k × l-matrix R (k represents analysis, data mining and machine learning. Through data
the attributes’ number in the access structure and l is the analysis techniques, valuable information can be extracted
leaves’ number in the access spanning tree). Ri also defines from large amount of big data in a power system.

VOLUME 4, 2016 3849


H. Jiang et al.: Energy Big Data: A Survey

Correlation is a well-known statistical and mathematical where Hi means the ith hit node and D refers to the numbers
method for the compatibility analysis of large data sets. of energy environment location phases. When Ei,t exceeds the
Meier et al. [65] successfully used Pearson Product-Moment spread factor, the network structure will be adapted to fit this
correlation to determine how well data is linearly correlated. change, through growing new energy nodes or distributing
Given two independent input data sets of X , Y with the length the error to neighboring nodes. Interim summarization uses
of N , and X , Y being either the phase data values of two PMU dimensional model representing the power environment in
sites or the momentary magnitudes, the Pearson correlation order to generate online digests of flow data. The digests
submits to a correlation coefficient C (C ∈ [−1, 1]), which facilitates data analysis in various directions, such as trend
is based on the following equation: analysis, granularity exploration and ad-hoc querying.
PN PN
i=1 X i=1 Y
PN
i=1 (XY ) −
C. CROWD-SOURCED DATA CONTROL
C=q PN
N
. Although various technologies can be applied for different
X )2 ( N 2
P
2) − ( i=1 Y )
PN
)×( N 2
i=1
P
( i=1 (X N i=1 (Y ) − N ) data processing periods, some vital data management and
analysis tasks in the power system cannot be addressed
Two application-specific improvements and modifications
completely by data processing technologies. However, these
have been made to this mathematical formula. Firstly,
hard tasks such as entity resolution, sentiment analysis, and
because the algorithm was made incremental, each i data
image recognition, can be dealt with human cognitive abil-
point (xi , yi ) could be gained from the Phasor Data Con-
ity [69]. Generally, utilizing the capabilities of the crowd
centrator (PDC) engine and incorporated into its correlation
is an effective method to handle computer-hard tasks in the
coefficient at once. Thus, there is no need to directly calculate
power system. Therefore, crowd-sourced data management
each average, summation, and standard deviation at each time
has aroused increasing attention in energy big data research,
step repeatedly. This helps reduce the operation time. Sec-
such as data integration [70], knowledge construction [71],
ondly, the correlation algorithm is set to maintain correlation
data cleaning [72], etc. In the energy domain, widespread
information through varying windows of time. The stream
sensor nodes are the same as workers in crowd-sourced data
data structure actually maintains multiple separate points to
control. As workers being task operators, sensor nodes are
terminal positions of each defined sliding window. However,
always assigned to deal with diverse data collected from
separate lists for each sliding window are not created. Rather,
power grid.
points from a single list are controlled to minimize memory
Problems in crowd-sourced data control can be divided
usage and copies of data which are already managed.
into two groups: quality control and cost control. The quality
Correlation analysis techniques can figure out correlation
control is the most important because crowd-sourcing often
or causal relationship and have been widely applied in power
generated relatively low-quality results or even noise. Thus,
system for making investment decisions and forecasting the
we will analyze quality control in detail.
electric power load. As mentioned above, data mining ranges
in the most popular techniques of data analysis, which is
1) QUALITY CONTROL
mainly applied for prediction and presentation in power grid.
Machine learning is also widely applied in energy system. The first step of the quality control is to characterize workers’
Pandy et al. [66] used correlation analysis of big data to quality, including modeling workers and defining parameters.
support machine learning through best fit linear regression of The current researches have proposed several methods to
large data sets. In addition, correlation analysis also applies model a worker’s quality. We summarize some as follows.
• Worker Probability [73]–[75]
for monitoring power equipment conditions, assessing tran-
In worker probability, each worker’s quality is taken as a
sient stability, etc [67].
single parameter p ∈ [0, 1], which indicates the accuracy rate
De Silva et al. [68] proposed an Incremental Summariza-
of the worker’s answer:
tion and Pattern Characterization (ISPC) framework using
data mining technology to address problems in energy con- p = Prob(the worker 0 s answer = the task 0 s true answer).
sumption analysis. This framework has two basic functions:
For example, if a worker’s answer has a probability of
incremental learning and interim summarization. Incremental
80 percent to be correct, the workers quality would be char-
learning can extend dynamic self-organization of senor nodes
acterized as p = 0.8. Since the accuracy parameter p is
in power grid with the dynamic topology preservation tech-
defined in the range of [0, 1], some studies extend it with two
nology. Using a structure-adapting feature map, this technol-
approaches:
ogy can ensure natural groupings in the data being exposed.
i) Extending to a wider range of (−∞, +∞), which
The structure adaption is affected by a total quantization error
enables a higher probability to the correct answer [76].
Ei,t , predefined values x and w, a spread factor which controls
ii) Extending another parameter as a confidence interval to
growth of the dynamic feature map:
describe how confident the calculated p is. For example,
D
XX if the confidence interval is ranged among [0.3, 0.8], it
Ei,t = (xj (t) − wj (t))2 , may narrow to [0.55, 0.6] after a worker answering more
Hi j=1 tasks [77].

3850 VOLUME 4, 2016


H. Jiang, et al.: Energy Big Data: A Survey

• Confusion Matrix [78] D. DATA VISUALIZATION


Confusion matrix is often used to represent a worker’s abil- The data visualization of big data has three main components:
ity to process single-choice tasks with n potential answers. historical information visualization, geographical visualiza-
A confusion matrix is a n × n matrix: tion, and 3-dimensional visualization, which can intuitively

P1,1 P1,2 · · · P1,n
 and accurately express information of electric power system.
P2,1 P2,2 · · · P2,n  The visualization technologies are used for monitoring real-
P= . .. .. .. , time power system status, which heighten the automation
 
 .. . . .  degree of power system. If combined with complex system
Pn,1 Pn,2 ··· Pn,n network theories, it also can identify potential relationships
where the i-th (i ∈ [1, n]) row represents the distribution of and patterns in smart grid.
the worker’s answers for the i-th task’s answer and the j-th Zhang et al. [82] proposed a 5ws parallel coordinate visu-
(j ∈ [1, n]) column represents the correct answer probability alization model for big data density analytics. The 5ws is
of the j-th worker’s answer. Therefore, composed of: where did the data generate, when did the
data generate, why did the data generate, what was the data
pi,j = Prob(the worker 0 s answer is j content, how was the data transferred and who received the
| the task 0 s true answer is i). data. Therefore, each data can be denoted into a function:
f (p, t, r, x, y, z), where p represents where data came from,
Some techniques are studied in order to compute the t represents the time stamp for each data incidence, r repre-
parameters in worker models as above, and we just introduce sents the reason data generated, x represents the data content
two typical examples: such as ‘‘attack’’, ‘‘like’’ or ‘‘dislike’’, y represents the way
• Qualification Test [78] data transferred and z represents the data receiver.
The qualification test sets a series of tasks with known Assuming all n data items are in the time period
true answers. Only when the workers pass the qualification T {t1 , t2 , t3 , · · · , tn }, a function set F can be set and contain
test and achieve good marks can they answer the real answer all data items within a certain time interval:
tasks. Referred to the worker’s test answers, the worker’s
quality can be deduced. For instance, if a worker is modeled F = {f1 , f2 , f3 , · · · , fn }.
by Worker Probability and he answers correctly 8 out of 10
test tasks, his quality is p = 0.8. For a specific data item, supposing p = α, r = β,
• Test-Injected Method [79] x = γ , y = δ, z = , the data pattern can be denoted as
The gold-injected method integrates test tasks and real f (p(α) , t, r(β) , x(γ ) , y(δ) , z() ). Therefore, the subset f(α,β,γ ,δ,)
tasks. For instance, if 10 percent of tasks are set as test tasks, can be mapped in the parallel coordinates using 5 polylines,
each time when a worker is assigned with 10 tasks, he will which represent the data in 5ws dimensions at a particular
take 1 test task and 9 real tasks. Based on the accuracy of time. Then, a 5Ws parallel coordinate visualization will be
worker’s answers on test tasks, his quality can be calculated. gained.
However, this method is different from the qualification test All key technologies and related work have been listed
because worker doesn’t know that some tasks are the tests. in Table 1.
When the workers are in the low quality, worker elimina-
tion is essential to improve the whole quality. There are many V. ENERGY BIG DATA: SECURITY
approaches to detect low-quality workers. A simple one is Although the integration big data with power grid realizes
still using qualification test or test-injected method. It uses high degree of network connectivity, it also brings complex
test tasks to compute the worker’s quality and workers with security vulnerabilities into grid [83]. From the perspective
low quality (e.g. p ≤ 0.5) are not allowed to process the real of big data, we discuss the main security vulnerabilities and
tasks. Therefore, the overall quality can keep ranging in a high requirements in privacy, integrity, authentication and third-
level. party protection. A summary of all four security aspects and
related work have been provided in Fig. 5.
2) COST CONTROL
The crowd is not free. When there are a lager number of tasks A. PRIVACY
to do, the crowd-sourcing can be quite expensive. Therefore, The customer power consumption data collected from smart
cost control is challenging for crowd-sourced data control. meters is a certain kind of privacy that showed be protected.
In recent researches, several effective techniques have been From the detailed consumption information, an insight into
proposed for cost control. One is pruning, which first applies a customer’s behavior can be obtained. Smart grid stores
computer algorithms to preprocess all the tasks and remove customer privacy consequences and acts like an information-
unnecessary tasks, guiding workers only to answer the essen- rich channel which is easy to exploit users’ habits and offer
tial tasks [80]. Another technique is task selection, which can intelligence service. As the history shows in [84], a series
improve quality and reduce cost by prioritizing all the tasks of financial or political incentives boost the development of
and selecting the most beneficial task to do first [81]. information mining and behavior learning technology.

VOLUME 4, 2016 3851


H. Jiang et al.: Energy Big Data: A Survey

TABLE 1. Key technologies in energy big data.

Terminal user’s privacy is a well-known security problem the privacy loss via the mutual information, authors would
about big data applied in power system. Many approaches perturb data between two data sequences.
have been proposed to solve the privacy problems during Since continuous amplitude data transmission is lossy
data access. Li et al. [85] presented a distributed incremental through finite capacity network links, a tested sequence of
data aggregation approach by using neighborhood gateways. k-load measurements {xk } is usually quantized and com-
Kalogridis et al. [86] introduced a method to hide information pressed before the transmission. Therefore, data perturbation
contained in the energy consumption data by using battery is needed to guarantee data privacy in some way. However,
storage. Rastogi and Nath [87] proposed a differential private such a perturbation also needs encoding and decoding to
aggregation algorithm for time-series data which offers good maintain a desired level of fidelity.
practical utility without any trusted server. • Encoding: Assuming that a smart meter can collect
Although these solutions manage to support the utility and n(n  1) measurements in a time interval before
privacy in various ways, they lack a robust theoretical basis communication, and n is large enough to capture the
from both utility and privacy. Rajagopalan et al. [88] pro- memory of data sources. Then the encoding function
posed a theoretical framework which integrates most existing performs a mapping of resulting source sequence Xn =
solutions of the trade-off and provides a universal model in {X1 , X2 , · · · , Xn }, where Xk ∈ R, k = 1, 2, · · · , n, an
smart meters. Such a theoretical framework is suitable for index Wi ∈ Wn will be given by:
many current problems of the privacy-utility tradeoff using FE : Xn → Wn = {1, 2, · · · , Wn },
a technology-independent method. Assume that power con-
sumption data is distributed normally, a data sequence {xk } where each index is a quantized sequence.
is defined of random variables xk ∈ x, −∞ < k < ∞, • Decoding: The decoder at the data collector computes
which generated by stationary continuous valued sources an output sequence b X2 , b
Xn = {b X2 , · · · , b Xk ∈ R, for
Xn }, b
with memory got via the autocorrelation function: Cxx (m) = all b
Xk , using the decoding function:
E[xk xk+m ], m = 0, ±1, ±2, · · · . Then, in order to estimate FD : Wn → b Xn .

3852 VOLUME 4, 2016


H. Jiang, et al.: Energy Big Data: A Survey

FIGURE 5. Security branches and related solutions.

The selection of the encoder should promote the input and are also its relevant work. Security issues may be caused by
output sequences to achieve a desired utility which is given the violation of integrity.
by an average distortion constraint: Liu et al. [90] pointed out that the risk of integrity-targeting
n attacks in smart grid is indeed real. Compared to availability-
1X targeting attacks, integrity-targeting attacks are less brute-
Dn = xk )2 ],
E[(Xk − b
n force but more sophisticated. Either customers’ information
k=1
or network operation information can be the objective of the
and a constraint on the privacy leakage about the desired integrity attacks. That is to say, modifying the original infor-
sequence {Yk } from the revealed one b
Xk is quantized via the mation deliberately and then corrupting critical data exchange
leakage function: in smart grid are the attempt of such attacks.
1 Ruj and Pal [91] proposed a targeted attack model in power
Ln = I(Y k ; b
X k ), network where attackers attack nodes by removing them with
n
probability proportional to the nodes’ degree. Therefore, a
where function E[·] computes the expectation and I(·) rep- node with higher degree will be removed with higher prob-
resents the mutual information. Dn and Ln are functions of ability. In the power network, once a node with higher degree
the number of measurements n. When stationary sources con- has been removed, the integrity of energy data collected by
verge to limiting values, the corresponding limiting values for nodes will be affected more. Note øk as the probability that a
utility can be given by D = limn→∞ Dn and L = limn→∞ Ln node i of k degree hasn’t been removed:
for privacy respectively [89].
deg(i) Ak −αA
B. INTEGRITY øk = 1 − P =1− ,
v∈VA deg(v) 2mA
Preventing unauthorized persons or systems modifying infor-
mation is all about integrity. In the smart grid domain the where VA is the nodes set in power network NA , αA is the
integrity focuses on information such as sensor values, prod- power-law coefficient, mA represents the number of edges
uct recipes and control commands. Avoiding modification in NA , deg(i) = Ak −αA represents the degree of node i and
through message delay, message reply and message injection αA = 0 means the random removal of nodes.

VOLUME 4, 2016 3853


H. Jiang et al.: Energy Big Data: A Survey

Several studies have been conducted to address integrity procedure of the envisioned lightweight message authentica-
attack. Xie et al. [92] firstly investigated the impact of tion scheme:
integrity attacks on energy market through virtual bidding. • HAN gateway hani chooses a number a ∈ Z∗ p ) ran-
They showed the process of an attack constructing a prof- domly and computes ga , then sends ga with an encrypted
itable attacking strategy without being detected. This result request packets to BAN gateway banj :
is helpful to examine the potential loss caused by such attack.
Kosut et al. [93] evaluated the proposed data attacks by hani → banj : {ga }{encr}PKbanj .
generated market revenues and the work was further studied • Gateway banj in BAN decrypts the packets and sends a
by Jia et al. [94] in maximizing the revenues. Yuan et al. [95] encrypted response including gb (b ∈ Z∗p ):
found that the data integrity attacks can result in increas-
ing system operating cost due to the inordinate generation banj → hani : {ga k gb }{encr}PKhani .
dispatch and energy routing. In order to control the real-
time Locational Marginal Prices (LMP) directly under data • After receiving banj ’s response packets, hani recovers ga
attacks, Tan et al. [96] employed a control theory based with its Secret Key (SK). If the result is correct, it means
approaches to analyze the attack effect on pricing stability. that the banj is authenticated by hani . Then, with number
Esmalifalak et al. [97] adopted a novel approach to char- a and gb , hani can calculate the shared session key Ki =
acterize the relationship between attackers and defenders. H ((gb )a ), where H : {0, 1}∗ → Z∗p is a cryptographic
Tan et al. [98] proposed a novel systematic online attack hash function. Then hani sends gb to banj in plaintext.
construction strategy which doesn’t need any power grid • Once banj has received the correct gb , banj will authen-
topology or parameter information. They presented a corre- ticate hani and also compute the session key using the
sponding online defense strategy to identify the malicious same way: Ki = H ((ga )b ).
attack in smart grid. Further, Jia et al. [99] proposed an analy- Because Ki is shared only between hani and banj , banj
sis framework to quantify the data quality impact on real-time can verify the authenticity of sender and the integrity of
LMP. All these related researches are based on the assumption message Mi . Thus, it can provide the NAN gateways with the
that attackers fully know about the targeted power systems authenticated messages.
including the grid topology architecture, branch parameters
and other related information. D. THIRD-PARTY PROTECTION
Damages via the communication systems, except safety haz-
C. AUTHENTICATION ards of the controlled plant itself, can be averted by the
In general, authentication means to verify participator’s iden- protection of a third-party, such as a cloud service provider.
tity and map this identity to the existing authentication table The power system can be used for launching various attacks
in power network. A user gains access to a smart grid commu- on the smart grid or the third party after being attacked. This
nication system with a valid account. Most security solutions will do damage to a smart grid system owner’s reputation, or
use authentication as a basis to distinguish legitimate and even reach to legal liability for damaging the third party.
illegitimate identity [100]. Since the third party isn’t listed in authentication table in
Hamlyn et al. [101] proposed a utility network authen- power network, it may have no access permission. Trusted
tication and security management in smart grid operations. Third Party (TTP) has been a promising approach for Person-
A new security architecture was designed for this manage- ally Identifiable Information (PII) in power system security.
ment to deal with actions and commands requests in multiple Ranchal et al. [104] used a predicate encryption scheme and
security domains. However, their research only focused on multi-party computing to process identity data on untrusted
keeping electric power systems and electric circuits secure hosts. Shamir [105] proposed a threshold secret sharing
in host areas. Furthermore, Fouda et al. [102] proposed on private information. First a private data set D is divided
a light-weight message authentication mechanism which is into n shares: {d1 , d2 , · · · , dn }. Then a threshold t is chosen
based on the Diffie-Hellman key establishment protocol. to recover D. This operation may require t or more shares of
This mechanism allows distributed smart meters to make arbitrary Di . Ben-Or et al. [106] defined a multi-party com-
mutual authentication with low latency and has similari- puting protocol which uses secret input from all the parties.
ties with secure aggregation technology above. Assume that This protocol involves n mutual parties which only calculate
HAN gateway i and BAN gateway j both have their pri- partial function outputs. In this protocol, one player from
vate and secret key pairs which are respectively denoted a party has been selected as the dealer(DLR) and provided
as PKhani , SKhani , PKbani , SKbani . The Diffie-Hellman key partial outputs to figure out the full computation results.
establishment protocol will be adopted at the initial hand- Denote a linear function f of n degree which is known to n
shake between HAN and BAN gateways [103]. Define parties, an arbitrary threshold t, the i-th party Pi , and xi as
G = hgi as a group of large prime order p under the Com- the input of Pi for f . By using a set of input {x1 , x2 , · · · , xn },
putational Diffie-Hellman (CDH) assumption. Otherwise, if n parties will calculate the output through f and send the result
given ga , gb , for unknown a, b ∈ Z∗p , it is difficult to calculate to the DLR. Each Pi generates a polynomial hi of t degree
whether gab ∈ G. The following steps are shown the such that hi (0) = xi . When Pi sends one share to Pj in the

3854 VOLUME 4, 2016


H. Jiang, et al.: Energy Big Data: A Survey

TABLE 2. Key applications in energy big data.

left parties, Pi will compute the portion of function f using measurement and management of generated data from elec-
this sent share. tric power become increasingly difficult and complex. Since
the big data technology helps to make better prediction,
VI. ENERGY BIG DATA: APPLICATIONS
management and processing complex big data in the energy
We take a brief overview on three application directions of
domain, it has become increasingly popular and widely
big data: renewable energy, demand response and EVs. Each
applied in renewable energy companies once emerging.
direction contains numerous technical fields, as shown in
Many researches have launched on various branches
TABLE 2. Although we present these three applications sep-
of renewable energy. Paro and Fadigas [107] proposed a
arately, there are no clear boundaries between them. Demand
methodology to measure and calculate the overall biomass
response should be considered in both renewable energy and
energy efficiency. This methodology can be applied in smart
EVs while EVs is also a concrete implementation of renew-
programs about regeneration plants and generator groups
able energy.
connected to a grid which will improve the efficiency contin-
A. RENEWABLE ENERGY uously. MacGillivray et al. [108] concentrated their attention
The new and renewable energy are integrated in the elec- on marine energy. The marine energy have attracted consid-
tric power generation, which is different from the tradi- erable interest but its commercial prospects mainly depend
tional power generation mode. This difference causes the on cost reduction and substantial learning. Therefore, the

VOLUME 4, 2016 3855


H. Jiang et al.: Energy Big Data: A Survey

authors presented an explicit solution on the key uncertainties Demand response is usually based on the classification
involved in marine energy learning rates analyses including analysis [114]. Users can be classified into several types
wave and tidal stream. They provided a simple learning according to climatic features, geographic conditions and
model and describe a range of learning investments which social stratum. Targeting ar different class of users, dif-
made marine energy technologies become cost-competitive. ferent daily load curves of electrical equipments can be
Wool et al. [109] looked at the tribological design of three drawn. These curves indicates the electrical characteris-
green marine energy systems: tidal, offshore wind, and wave tics, including time intervals, effective factors of electricity
machines. They pointed out that most marine renewable consumption, etc.
energy conversion systems need tribological components to Based on the classification analysis, the total demand
promote rotational motion of wind or tidal streams to generate response from a region or a kind of users can be available
electricity. Therefore, they studied and highlighted current through polymerization. Meanwhile, the analysis results will
research thrusts to address present issues related to the tri- work in designing incentive demand response mechanism.
bology of marine energy conversion technologies. As for Liu et al. [115] provided a power demand model which
another important renewable energy sources, wind energy is derived from workload models and cooling demands of
has also attracted wide attention. For instance, in order to the data center. Specifically, they calculated the total power
improve the efficiency of electric power generation, the demand by: d(t) = dIT (t) + c(dIT (t)), where dIT (t) repre-
Vestas Wind Technology Group of Denmark optimized the sents the total Information Technology (IT) workload demand
wind turbine geographical placement by using big data tech- at time t which is the energy necessary to serve demand,
nologies to analyze the data information including weather and function c() represents cooling power demand which is
reports, tidal conditions, satellite images and geographical associated with IT demand dIT . Note that PUE(t) means the
information [110]. Besides, they also updated the under- Power Usage Effectiveness (PUE) at time t, thus, c(dIT (t))
lying architecture for better data collection and used big can be calculated by c(dIT (t)) = (PUE(t) − 1) ∗ dIT (t).
data processing technologies to meet their high-performance Goyena and Acciona [116] proposed an energetic balance
computing needs. Furthermore, Kaldellis [111] proposed an algorithm to warranty the electric demand in a large energy
optimum autonomous wind power system sizing for remote storage system. This system analyzes the measured data
consumers by using long-term wind speed data. Aiming to constantly and commits for high accuracy.In this algo-
improve the life quality level of remote consumers, they used rithm, electric demand situations of overproduction and
the proposed autonomous wind power systems to address underproduction are distinguished. Compared with the
the electricity demand requirements, especially in high wind- none-renewable production, renewable production has been
potential locations. considered as the priority in order to meet the energy demand.
All researches above in different application areas have a According to the results of this balance algorithm, data-
common purpose: cost minimization. In addition, owing to storage level will be increased, decreased or limited.
the over-exploitation of fossil fuels, energy becomes precious
and energy consumption problem has arisen wide attention. C. ELECTRIC VEHICLES
Kung and Wang [112] proposed a recommender system for Electric Vehicles becomes a promising alternative transporta-
the best combination of renewable energy resources with tion method which is related to the smart grid domain. With
cost-benefit analysis, which include analytical module, cloud the increasing number of EVs, many problems in various
data base, and user interface. This study used Markov Chain areas of EVs such as performance evaluation, driving range
to investigate the influences of decision-making related to and battery capacity become great concerns by researchers.
renewable energy and electricity demand in random time. Besides, EV is one of the typical application examples of
Since the historical electricity data is recorded in continuous big data. Therefore, most of the issues above can be handled
time series, Continuous Markov Chain can be applied to with big data technologies. For instance, Wu et al. [117]
analyze energy big data in order to help power enterprises launched the EV grid integration study based on driving
make optimal investment of renewable energy and evaluate patterns analysis. Recently, most researches are focused on
optimal energy configuration. analyzing the energy efficiency, addressing routing prob-
lems by recording driving data, estimating the driving range.
B. ENERGY DEMAND RESPONSE Su and Chow [118] proposed an Estimation of Distribution
The demand of electric power consumers has been expanding Algorithm (EDA) based on big data theories to allocate elec-
along with the increasing use of machines and emerging tric energy intelligently to EVs. The simultaneous connection
types of appliances such as EVs and smart home furniture. of a large number of EVs to the power grid may have a
Thus, the concern on the stability and reliability of electricity influence on quality and stability of the overall power system.
supply has also been aroused. However, since the traditional This proposed algorithm was devoted to optimally manage
power grid lacks real-time response between demand and a large number of EVs charging at a municipal parking lot.
supply, it cannot meet these demands due to its inflexible Midlam-Mohler et al. [119] focused on the the analysis of
distribution [113]. Therefore, demand response is expected energy efficiency via recording driving data. They developed
to be an aspect for the future smart grid. a data collection system to support the study which allows

3856 VOLUME 4, 2016


H. Jiang, et al.: Energy Big Data: A Survey

data acquired not only from disparate data sources that are uncertain factors of the data, which are common in the big
available on most modern vehicles but from supplemental data applications [15], [126]. These techniques often assume
instrumentation as well. a distribution of the data values based on a probabilistic
As for the issues on EVs charging and driving range, method, and then generate the probabilistic results of data
Rahimi-Eichi et al. [120] proposed another range estima- mining. For example, the K-means clustering algorithm was
tion framework based on big data technology which can optimized to the UK-means clustering algorithm through
collect energy data with various structures from various specifying a data object from an uncertain region with
resources and incorporate them via the range estimation an uncertain probabilistic distribution method [15]. In the
algorithm. They also used historical and real-time data for UK-means clustering algorithm, the precise error and
simulation and thus calculated the remaining driving range. the cluster number of a data object are less important.
Zhang and Grijalva [121] proposed a data-driven queuing Tsang et al. [126] developed a decision tree classifier used for
model for EV charging demand through performing big uncertain data and proposed the related improved algorithms.
data analysis on smart meter measurements. In addition, Extension of classical decision tree in building algorithms
Zhang et al. [122] focused on the charging-transaction-record was proposed in to handle data tuples without certain values.
EV data for better charging service, by detecting, analyz-
2) IMPRECISE DATA QUERYING
ing, and correcting the abnormal data with error types and
distributed locations. Soares et al. [123] presented a novel In the traditional data querying techniques used in smart grid,
methodology based on Monte Carlo Simulation (MCS) to the precise results are required in the DataBase Management
estimate the possible EVs states referring to their locations System (DBMS) according to the query conditions. In recent
and periods of connection in the power grid. Futhermore, years, a lot of querying and indexing techniques began to con-
Lee and Wu [124] provided a EV-battery big data modeling sider about the uncertainty of the data. Instead of giving the
method which has been used to improve the driving range precise response, these techniques estimate data uncertainty
estimation of EVs, using data cloud analysis and processing and provide probabilistic guarantee. For instance, the uncer-
technologies. The authors first presented a simple approach tainty of the data has been modeled as the stochastic value
to project life cycles of battery packs based on collected test within certain limits, and the data queries have been classified
data. The operating voltage of a battery pack E(I ,SOC,T ) can into different classes. Different algorithms are then developed
be evaluated by: to compute the probabilistic answers. Chen and Cheng [127]
studied location data with uncertainty and classified the
E(I ,SOC,T ) = OCV(SOC,T ) − I (R(SOC,T ) + ηr(I ,SOC,T ) )r, query issues into two types, according to the degree of data
where I means the current, SOC is noted as the state of uncertainty.
charge, T represents the temperature, OCV is short for the B. DATA QUALITY
open circuit voltage, R is the resistance of battery cell and Due to the multiple origins of big data in the compli-
η represents the resistance caused by polarization. Then they cated power system, the smart grid database always contains
used a machine learning approach to cluster the collected EV incomplete, inconsistent and uncorrected data. Therefore, a
data by ever-increasing hierarchical self-organizing maps. series of data preprocessing technologies are necessary to
VII. ENERGY BIG DATA: CHALLENGES improve the data quality, especially after uncertain data min-
Coming from wide various sources, the volume of energy big ing and imprecise querying [128].
data is increasing at an exponential speed. At the same time, Data preprocessing contains data manipulation, compres-
difficulties also arise up in data storage, mining, querying, sion and normalization of the older data into improved for-
processing, etc. Therefore, cryptography technologies, fuzzy mat [129]. Data manipulation refers to transformation, which
data computing, qualified data processing are all essential for is used to reduce the impact from data fault sensing and pro-
big data applied better in smart grid. vide more reliable energy data. The compression operation
is used for massive data measurements to be reduced into a
A. DATA UNCERTAINTY reasonable extent. Meanwhile, the normalization technology
Data uncertainty comes from data complexity. Because of is used to regulate the collected data by minimizing its noise,
the complicated and distributed smart grid, the sources and inconsistencies and incompleteness of data. These three oper-
forms of energy big data become diverse. There is no need or ations are main components of data preprocessing but the
unrealistic to obtain accurate data samples or give a precise order is not unique.
response. Therefore, uncertain data mining and imprecise Furthermore, different regulations on big data cause con-
data querying are presented for data uncertainty [125]. flicts between collaborators and lead to inconvenience in
smart grid. Therefore, a unified and complete system standard
1) UNCERTAIN DATA MINING
needs to be set up in the future.
Previous applications in smart grid using traditional data
mining techniques require the data samples be accurate C. QUANTUM CRYPTOGRAPHY FOR DATA SECURITY
or with definite values. However, in the latest decades, Since the large amount of data is stored in multiple distributed
there have been some researches conducted to capture the grids, it is difficult to monitor the security of data storage

VOLUME 4, 2016 3857


H. Jiang et al.: Energy Big Data: A Survey

in real time. However, most data involve personal privacy, [4] B. Gu, X. Sun, and V. S. Sheng, ‘‘Structural minimax probability
commercial secrets, financial information, etc. It causes the machine,’’ IEEE Trans. Neural Netw. Learn. Syst., to be published,
Apr. 2016, doi: 10.1109/TNNLS.2016.2544779.
problem of data larceny, which has become increasingly [5] F. Rahimi and A. Ipakchi, ‘‘Demand response as a market resource
serious. Therefore, some kinds of cryptography technologies under the smart grid paradigm,’’ IEEE Trans. Smart Grid, vol. 1, no. 1,
have been proposed by researches to solute security problems pp. 82–88, Jun. 2010.
[6] S. M. Amin and B. F. Wollenberg, ‘‘Toward a smart grid: Power delivery
of big data in energy domain. for the 21st century,’’ IEEE Power Energy Mag., vol. 3, no. 5, pp. 34–41,
Modern cryptography technique is based on the Rack Scale Sep. 2005.
Architecture (RSA), which is especially used for the hardly [7] T. Niimura, M. Dhaliwal, and K. Ozawa, ‘‘Fuzzy regression models to
represent electricity market data in deregulated power industry,’’ in Proc.
factoring large data [130]. In 1994, Peter Shor discovered IFSA World Congr. NAFIPS Int. Conf., vol. 5. Jul. 2001, pp. 2556–2561.
that quantum computers can handle discrete logarithm oper- [8] Z. Fu, X. Sun, Q. Liu, L. Zhou, and J. Shu, ‘‘Achieving efficient cloud
ation [131] and this would lead to the loss of the efficiency search services: Multi-keyword ranked search over encrypted cloud data
supporting parallel computing,’’ IEICE Trans. Commun., vol. E98-B,
of the RSA architecture. Since the quantum computers have no. 1, pp. 190–200, 2015.
been put into use recently, a new data cryptograph technology [9] X. He, Q. Ai, R. C. Qiu, W. Huang, L. Piao, and H. Liu, ‘‘A big data
called quantum cryptography [132] is considered as the next architecture design for smart grids based on random matrix theory,’’ IEEE
Trans. Smart Grid, Jul. 2015, doi: 10.1109/TSG.2015.2445828.
generation cryptography solution. The quantum cryptogra- [10] M. Kezunovic and B. M. Cuka. (2013). Hierarchical Coordinated Protec-
phy can absolutely guarantee the security of smart grid data tion With High Penetration of Smart Grid Renewable Resources. [Online].
based on the quantum mechanics. In this technology, the data Available: http://eppe.tamu.edu
[11] Y. Zhang et al., ‘‘Wide-area frequency monitoring network (FNET)
is encoded as qubits instead of transmitting bits and different architecture and applications,’’ IEEE Trans. Smart Grid, vol. 1, no. 2,
directions will affect the qubit’s polarization. The quantum pp. 159–167, Sep. 2010.
key distribution begins when a large amount of qubits are sent [12] K. Wang, L. Zhuo, Y. Shao, D. Yue, and K. F. Tsang, ‘‘Towards dis-
tributed data processing on intelligent leakpoints prediction in petro-
from the sender to the receiver. If an eavesdropper wants to chemical industries,’’ IEEE Trans. Ind. Informat., Apr. 2016, doi:
get the information about any qubit, he will have to measure 10.1109/TII.2016.2537788.
it. However, having no idea about the qubit’s polarization, the [13] B. Gu, V. S. Sheng, K. Y. Tay, W. Romano, and S. Li, ‘‘Incremental
support vector learning for ordinal regression,’’ IEEE Trans. Neural Netw.
eavesdropper can only process it by guess and then transmit Learn. Syst., vol. 26, no. 7, pp. 1403–1416, Jul. 2015.
another randomly polarized qubit to the receiver. All these [14] M. Kezunovic, L. Xie, and S. Grijalva, ‘‘The role of big data in improving
operations are under the surveillance. At the end of quantum power system operation and protection,’’ in Proc. Bulk Power Syst. Dyn.
Control-IX Optim., Secur. Control Emerg. Power Grid (IREP), Aug. 2013,
key distribution, the key will be identified safe or not by a pp. 1–9.
brief testing. [15] H. Chen, R. H. L. Chiang, and V. C. Storey, ‘‘Business intelligence and
analytics: From big data to big impact,’’ MIS Quart., vol. 36, no. 4,
pp. 1165–1188, Dec. 2012.
VIII. CONCLUSION [16] T. A. Brody, J. Flores, J. B. French, P. A. Mello, A. Pandey, and
In this paper, we gave an introduction on big data and power S. S. M. Wong, ‘‘Random-matrix physics: Spectrum and strength fluc-
grid, presented the background in the aspects of related work, tuations,’’ Rev. Mod. Phys., vol. 53, no. 3, pp. 385–479, Jul. 1981.
[17] D. Howe et al., ‘‘Big data: The future of biocuration,’’ Nature, vol. 455,
techniques and tools for energy big data. Then, we discussed no. 7209, pp. 47–50, Sep. 2008.
the important role of big data, which brought efficiency and [18] C. Zhang and R. C. Qiu. (Apr. 2014). ‘‘Data modeling with
accuracy to the power system. Furthermore, an integrated large random matrices in a cognitive radio network testbed: Ini-
tial experimental demonstrations with 70 nodes.’’ [Online]. Available:
architecture was introduced for the big data analysis, and http://arxiv.org/abs/1404.3788
we discussed the key techniques of big data in the energy [19] A. Bose, ‘‘Smart transmission grid applications and their supporting
system in four categories: data acquisition and storing, data infrastructure,’’ IEEE Trans. Smart Grid, vol. 1, no. 1, pp. 11–19,
Jun. 2010.
correlation analysis, crowd-sourced data control and data [20] A. Abur and F. Galvan, ‘‘Synchro-phasor assisted state estimation
visualization. As another crucial part of the energy big data, (SPASE),’’ in Proc. IEEE PES Innov. Smart Grid Technol., Jan. 2012,
the security in the power system was illustrated in aspects of pp. 1–2.
[21] J. O’Brien et al., ‘‘Use of synchrophasor measurements in protec-
privacy, integrity, authentication and the third-party protec- tive relaying applications,’’ in Proc. Conf. Protective Relay Eng.,
tion. We also discussed the applications of energy big data in Mar./Apr. 2014, pp. 23–29.
branches of renewable energy, energy demand response, and [22] B. Gu and V. S. Sheng, ‘‘A robust regularization path algorithm for
ϑ-support vector classification,’’ IEEE Trans. Neural Netw. Learn. Syst.,
electric vehicles. Finally, we listed some challenges need to to be published, doi: 10.1109/TNNLS.2016.2527796.
be addressed about data uncertainty, quality and security. [23] H. Shen and Y. Zhu, ‘‘The collaboration between demand response
and smart grid,’’ in Proc. Int. Conf. Sustain. Power Generat. Supply,
Sep. 2012, pp. 1–4.
REFERENCES [24] M. Arenas-Martinez et al., ‘‘A comparative study of data storage and
[1] C. W. Gellings, M. Samotyj, and B. Howe, ‘‘The future’s smart delivery processing architectures for the smart grid,’’ in Proc. IEEE Int. Conf.
system [electric power supply],’’ IEEE Power Energy Mag., vol. 2, no. 5, Smart Grid Commun. (SmartGridComm), Oct. 2010, pp. 285–290.
pp. 40–48, Sep./Oct. 2004. [25] K. M. Rogers, R. Klump, H. Khurana, A. A. Aquino-Lugo, and
[2] Y. Yan, Y. Qian, H. Sharif, and D. Tipper, ‘‘A survey on smart grid T. J. Overbye, ‘‘An authenticated control framework for distributed volt-
communication infrastructures: Motivations, requirements and chal- age support on the smart grid,’’ IEEE Trans. Smart Grid, vol. 1, no. 1,
lenges,’’ IEEE Commun. Surveys Tuts., vol. 15, no. 1, pp. 5–20, pp. 40–47, Jun. 2010.
1st Quart., 2013. [26] A. Swarnkar, N. Gupta, and K. R. Niazi, ‘‘Adapted ant colony optimiza-
[3] K. Wang, Y. Shao, L. Shu, Y. Zhang, and C. Zhu, ‘‘Mobile big data fault- tion for efficient reconfiguration of balanced and unbalanced distribution
tolerant processing for eHealth networks,’’ IEEE Netw., vol. 30, no. 1, systems for loss minimization,’’ Swarm Evol. Comput., vol. 1, no. 3,
pp. 1–7, Jan. 2016. pp. 129–137, Sep. 2011.

3858 VOLUME 4, 2016


H. Jiang, et al.: Energy Big Data: A Survey

[27] A. A. Aquino-Lugo, R. Klump, and T. J. Overbye, ‘‘A control framework [51] Z. Xia, X. Wang, X. Sun, and Q. Wang, ‘‘A secure and dynamic multi-
for the smart grid for voltage support using agent-based technologies,’’ keyword ranked search scheme over encrypted cloud data,’’ IEEE Trans.
IEEE Trans. Smart Grid, vol. 2, no. 1, pp. 173–200, Mar. 2011. Parallel Distrib. Syst., vol. 27, no. 2, pp. 340–352, Feb. 2016.
[28] W. Labeeuw and G. Deconinck, ‘‘Residential electrical load model based [52] Z. Fu, K. Ren, J. Shu, X. Sun, and F. Huang, ‘‘Enabling per-
on mixture model clustering and Markov models,’’ IEEE Trans. Ind. sonalized search over encrypted outsourced data with efficiency
Informat., vol. 9, no. 3, pp. 1561–1569, Aug. 2013. improvement,’’ IEEE Trans. Parallel Distrib. Syst., Dec. 2015, doi:
[29] A. M. S. Ferreira, C. A. M. T. Cavalcante, C. H. O. Fontes, and 10.1109/TPDS.2015.2506573.
J. E. S. Marambio, ‘‘A new method for pattern recognition in load profiles [53] P. Paillier, ‘‘Public-key cryptosystems based on composite
to support decision-making in the management of the electric sector,’’ Int. degree residuosity classes,’’ in Proc. Int. Conf. Theory Appl.
J. Elect. Power Energy Syst., vol. 53, no. 4, pp. 824–831, Dec. 2013. Cryptograph. (EUROCRYPT), 1999, pp. 223–238.
[30] C. E. Borges, Y. K. Penya, and I. Fernandez, ‘‘Evaluating combined load [54] D. Boneh, E.-J. Goh, and K. Nissim, ‘‘Evaluating 2-DNF formulas on
forecasting in large power systems and smart grids,’’ IEEE Trans. Ind. ciphertexts,’’ in Proc. 2nd Theory Cryptogr., 2005, pp. 325–341.
Informat., vol. 9, no. 3, pp. 1570–1577, Aug. 2013. [55] R. L. Rivest, A. Shamir, and L. Adleman, ‘‘A method for obtaining digital
[31] M. Amina, V. S. Kodogiannis, I. Petrounias, and D. Tomtsis, ‘‘A hybrid signatures and public-key cryptosystems,’’ Commun. ACM, vol. 21, no. 2,
intelligent approach for the prediction of electricity consumption,’’ Int. J. pp. 120–126, Feb. 1978.
Elect. Power Energy Syst., vol. 43, no. 1, pp. 99–108, Dec. 2012.
[56] C. Gentry, ‘‘Fully homomorphic encryption using ideal lattices,’’ in Proc.
[32] N. Ding, Y. Besanger, F. Wurtz, and G. Antoine, ‘‘Individual nonpara-
ACM STOC, vol. 9. 2009, pp. 169–178.
metric load estimation model for power distribution network planning,’’
[57] Y. Yan, Y. Qian, and H. Sharif, ‘‘A secure data aggregation and dispatch
IEEE Trans. Ind. Informat., vol. 9, no. 3, pp. 1578–1587, Aug. 2013.
scheme for home area networks in smart grid,’’ in Proc. IEEE Global
[33] M. Biswal and P. K. Dash, ‘‘Measurement and classification of simultane-
Telecommun. Conf. (GLOBECOM), Dec. 2011, pp. 1–6.
ous power signal patterns with an S-transform variant and fuzzy decision
tree,’’ IEEE Trans. Ind. Informat., vol. 9, no. 4, pp. 1819–1827, Nov. 2013. [58] S. Ruj and A. Nayak, ‘‘A decentralized security framework for data
[34] A. K. Karun and K. Chitharanjan, ‘‘Locality sensitive hashing based aggregation and access control in smart grids,’’ IEEE Trans. Smart Grid,
incremental clustering for creating affinity groups in Hadoop—HDFS— vol. 4, no. 1, pp. 196–205, Mar. 2013.
An infrastructure extension,’’ in Proc. IEEE Int. Conf. Circuits, Power [59] A. Sahai and B. Waters, ‘‘Fuzzy identity-based encryption,’’ in Proc. 24th
Comput. Technol., Mar. 2013, pp. 1243–1249. Annu. Int. Conf. Adv. Cryptol. (EUROCRYPT), 2005, pp. 457–473.
[35] C. Kaewkasi and W. Srisuruk, ‘‘A study of big data processing con- [60] V. Goyal, O. Pandey, A. Sahai, and B. Waters, ‘‘Attribute-based encryp-
straints on a low-power Hadoop cluster,’’ in Proc. Int. Comput. Sci. Eng. tion for fine-grained access control of encrypted data,’’ in Proc. 13th ACM
Conf. (ICSEC), Jul./Aug. 2014, pp. 267–272. Conf. Comput. Commun. Secur., 2006, pp. 89–98.
[36] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, ‘‘The Hadoop dis- [61] J. Bethencourt, A. Sahai, and B. Waters, ‘‘Ciphertext-policy attribute-
tributed file system,’’ in Proc. IEEE 26th Symp. Mass Storage Syst. based encryption,’’ in Proc. IEEE Symp. Secur. Privacy (SP), May 2007,
Technol. (MSST), May 2014, pp. 1–10. pp. 321–334.
[37] J. Dean and S. Ghemawat, ‘‘MapReduce: Simplified data [62] A. Lewko and B. Waters, ‘‘Decentralizing attribute-based encryption,’’
processing on large clusters,’’ Commun. ACM, vol. 51, no. 1, in Proc. 30th Annu. Int. Conf. Adv. Cryptology (EUROCRYPT), 2011,
pp. 107–113, 2008. pp. 568–588.
[38] K. V. Rama Satish and N. P. Kavya, ‘‘Big data processing with harnessing [63] S. Ratnasamy et al., ‘‘Data-centric storage in sensornets with GHT, a
Hadoop—MapReduce for optimizing analytical workloads,’’ in Proc. Int. geographic hash table,’’ Mobile Netw. Appl., vol. 8, no. 4, pp. 427–442,
Conf. Contemp. Comput. Informat. (IC3I), Nov. 2014, pp. 49–54. Aug. 2003.
[39] M. M. Maske and P. Prasad, ‘‘A real time processing and streaming of [64] B. Karp and H. T. Kung, ‘‘GPSR: Greedy perimeter stateless routing
wireless network data using storm,’’ in Proc. IEEE Int. Conf. Comput. for wireless networks,’’ in Proc. Int. Conf. Mobile Comput. Netw., 2000,
Power, Energy Inf. Commun., Apr. 2015, pp. 244–249. pp. 243–254.
[40] P. Chandarana and M. Vijayalakshmi, ‘‘Big data analytics frameworks,’’ [65] R. Meier et al., ‘‘Power system data management and analysis using
in Proc. IEEE Int. Conf. Circuits, Syst., Commun. Inf. Technol. Appl., synchrophasor data,’’ in Proc. IEEE Conf. Technol. Sustain. (SusTech),
Apr. 2014, pp. 430–434. Jul. 2014, pp. 225–231.
[41] Z. Asad and M. A. R. Chaudhry, ‘‘A two-way street: Green big data [66] R. Pandey, M. Dhoundiyal, and A. Kumar, ‘‘Correlation analysis of big
processing for a greener smart grid,’’ IEEE Syst. J., Feb. 2016, doi: data to support machine learning,’’ in Proc. 5th Int. Conf. Commun. Syst.
10.1109/JSYST.2015.2498639. Netw. Technol. (CSNT), Apr. 2015, pp. 996–999.
[42] S. Maharjan, Q. Zhu, Y. Zhang, S. Gjessing, and T. Basar, ‘‘Demand [67] M. Pavella, D. Ernst, and D. Ruiz-Vega, Transient Stability of Power
response management in the smart grid in a large population regime,’’ Systems: A Unified Approach to Assessment and Control. New York, NY,
IEEE Trans. Smart Grid, vol. 7, no. 1, pp. 189–199, Jan. 2016. USA: Springer, 2013.
[43] M. K. H. Pulok and M. Omar Faruque, ‘‘Utilization of PMU data to
[68] D. De Silva, X. Yu, D. Alahakoon, and G. Holmes, ‘‘A data mining
evaluate the effectiveness of voltage stability boundary and indices,’’ in
framework for electricity consumption analysis from meter data,’’ IEEE
Proc. IEEE North Amer. Power Symp., Oct. 2015, pp. 1–6.
Trans. Ind. Informat., vol. 7, no. 3, pp. 399–407, Aug. 2011.
[44] S. J. Wangeman, ‘‘IED dataset generation: Analysis across theaters,’’ in
Proc. IEEE Syst. Inf. Eng. Design Symp., Apr. 2013, pp. 92–97. [69] T. Yan, V. Kumar, and D. Ganesan, ‘‘CrowdSearch: Exploiting crowds
for accurate real-time image search on mobile phones,’’ in Proc. 8th Int.
[45] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ‘‘Toward scalable systems for big
Conf. Mobile Syst., Appl., Services, 2010, pp. 77–90.
data analytics: A technology tutorial,’’ IEEE Access, vol. 2, pp. 652–687,
2014. [70] J. Whitehill, P. Ruvolo, T.-F. Wu, J. Bergsma, and J. R. Movellan, ‘‘Whose
[46] B. Gu, V. S. Sheng, Z. Wang, D. Ho, S. Osman, and S. Li, ‘‘Incre- vote should count more: Optimal integration of labels from labelers
mental learning for ϑ-support vector regression,’’ Neural Netw., vol. 67, of unknown expertise,’’ in Proc. Adv. Neural Inf. Process. Syst., 2009,
pp. 140–150, Jul. 2015. pp. 2035–2043.
[47] Y. Simmhan et al., ‘‘An informatics approach to demand response opti- [71] Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart, ‘‘Crowd min-
mization in smart grids,’’ Natural Gas, vol. 31, p. 60, Mar. 2011. ing,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data, Jun. 2013,
[48] A. Pal and S. Agrawal, ‘‘An experimental approach towards big data pp. 241–252.
for analyzing memory utilization on a Hadoop cluster using HDFS and [72] J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and
MapReduce,’’ in Proc. IEEE Int. Conf. Netw. Soft Comput. (ICNSC), T. Milo, ‘‘A sample-and-clean framework for fast and accurate query
Aug. 2014, pp. 442–447. processing on dirty data,’’ in Proc. ACM SIGMOD Int. Conf. Manage.
[49] B. Matic-Cuka and M. Kezunovic, ‘‘Improving smart grid operation with Data, Jun. 2014, pp. 469–480.
new hierarchically coordinated protection approach,’’ in Proc. Medit. [73] S. Guo, A. Parameswaran, and H. Garcia-Molina, ‘‘So who won?:
Conf. Power Generat., Transmiss., Distrib. Energy Convers., Oct. 2012, Dynamic max discovery with the crowd,’’ in Proc. ACM SIGMOD Int.
pp. 1–6. Conf. Manage. Data, 2012, pp. 385–396.
[50] S. K. Bansal, ‘‘Towards a semantic extract-transform-load (ETL) frame- [74] L. Kazemi, C. Shahabi, and L. Chen, ‘‘GeoTruCrowd: Trustworthy query
work for big data integration,’’ in Proc. IEEE Int. Congr. Big Data, answering with spatial crowdsourcing,’’ in Proc. 21st ACM SIGSPATIAL
Jun./Jul. 2014, pp. 522–529. Int. Conf. Adv. Geograph. Inf. Syst., 2013, pp. 314–323.

VOLUME 4, 2016 3859


H. Jiang et al.: Energy Big Data: A Survey

[75] C. C. Cao, J. She, Y. Tong, and L. Chen, ‘‘Whom to ask?: Jury selec- [99] L. Jia, J. Kim, R. J. Thomas, and L. Tong, ‘‘Impact of data quality on real-
tion for decision making tasks on micro-blog services,’’ in Proc. VLDB time locational marginal price,’’ IEEE Trans. Power Syst., vol. 29, no. 2,
Endowment, Jul. 2012, vol. 5. no. 11, pp. 1495–1506. pp. 627–636, Mar. 2014.
[76] D. R. Karger, S. Oh, and D. Shah, ‘‘Iterative learning for reliable [100] H. Liu, H. Ning, Y. Zhang, and L. T. Yang, ‘‘Aggregated-proofs based
crowdsourcing systems,’’ in Proc. Adv. Neural Inf. Process. Syst., 2011, privacy-preserving authentication for V2G networks in the smart grid,’’
pp. 1953–1961. IEEE Trans. Smart Grid, vol. 3, no. 4, pp. 1722–1733, Dec. 2012.
[77] M. Joglekar, H. Garcia-Molina, and A. Parameswaran, ‘‘Evaluating the [101] A. Hamlyn, H. Cheung, T. Mander, L. Wang, C. Yang, and
crowd with confidence,’’ in Proc. 19th ACM SIGKDD Int. Conf. Knowl. R. Cheung, ‘‘Network security management and authentication of actions
Discovery Data Mining, 2013, pp. 686–694. for smart grids operations,’’ in Proc. IEEE Elect. Power Conf., Oct. 2007,
[78] V. C. Raykar et al., ‘‘Supervised learning from multiple experts: Whom pp. 31–36.
to trust when everyone lies a bit,’’ in Proc. 26th Annu. Int. Conf. Mach. [102] M. M. Fouda, Z. M. Fadlullah, N. Kato, R. Lu, and X. Shen, ‘‘Towards
Learn., 2009, pp. 889–896. a light-weight message authentication mechanism tailored for smart
[79] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang, ‘‘CDAS: grid communications,’’ in Proc. IEEE Conf. Comput. Commun. Work-
A crowdsourcing data analytics system,’’ in Proc. VLDB Endowment, shops (INFOCOM WKSHPS), Apr. 2011, pp. 1018–1023.
Jun. 2012, vol. 5. no. 10, pp. 1040–1051. [103] D. R. Stinson, Cryptography: Theory and Practice. Boca Raton, FL,
[80] S. Wang, X. Xiao, and C.-H. Lee, ‘‘Crowd-based deduplication: An adap- USA: CRC Press, 2005.
tive approach,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, [104] R. Ranchal et al., ‘‘Protection of identity information in cloud computing
pp. 1263–1277. without trusted third party,’’ in Proc. 29th IEEE Symp. Reliable Distrib.
[81] Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng, ‘‘QASCA: A quality- Syst., Oct./Nov. 2010, pp. 368–372.
aware task assignment system for crowdsourcing applications,’’ in Proc. [105] A. Shamir, ‘‘How to share a secret,’’ Commun. ACM, vol. 22, no. 11,
ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 1031–1046. pp. 612–613, Nov. 1979.
[82] J. Zhang, M. L. Huang, W. B. Wang, L. F. Lu, and Z.-P. Meng, ‘‘Big data [106] M. Ben-Or, S. Goldwasser, and A. Wigderson, ‘‘Completeness theorems
density analytics using parallel coordinate visualization,’’ in Proc. IEEE for non-cryptographic fault-tolerant distributed computation,’’ in Proc.
Int. Conf. Comput. Sci. Eng. (CSE), Dec. 2014, pp. 1115–1120. 12th Annu. ACM Symp. Theory Comput., 1988, pp. 1–10.
[83] D. Dzung, M. Naedele, T. P. V. Hoff, and M. Crevatin, ‘‘Security [107] A. C. Paro and E. A. F. A. Fadigas, ‘‘A methodology for biomass cogen-
for industrial communication systems,’’ Proc. IEEE, vol. 93, no. 6, eration plants overall energy efficiency calculation and measurement—
pp. 1152–1177, Jun. 2005. A basis for generators real time efficiency data disclosure,’’ in Proc.
[84] P. McDaniel and S. McLaughlin, ‘‘Security and privacy challenges IEEE/PES Power Syst. Conf. Expo. (PSCE), Mar. 2011, pp. 1–7.
in the smart grid,’’ IEEE Security Privacy, vol. 7, no. 3, pp. 75–77, [108] A. MacGillivray, H. Jeffrey, M. Winskel, and I. Bryden, ‘‘Innovation
May/Jun. 2009. and cost reduction for marine renewable energy: A learning invest-
ment sensitivity analysis,’’ Technol. Forecasting Social Change, vol. 87,
[85] F. Li, B. Luo, and P. Liu, ‘‘Secure information aggregation for smart grids
pp. 108–124, Sep. 2014.
using homomorphic encryption,’’ in Proc. 1st IEEE Int. Conf. Smart Grid
Commun. (SmartGridComm), Oct. 2010, pp. 327–332. [109] R. J. K. Wood, A. S. Bahaj, S. R. Turnock, L. Wang, and M. Evans,
‘‘Tribological design constraints of marine renewable energy systems,’’
[86] G. Kalogridis, C. Efthymiou, S. Z. Denic, T. A. Lewis, and
Philos. Trans. Roy. Soc. London A, Math. Phys. Sci., vol. 368, no. 1929,
R. Cepeda, ‘‘Privacy for smart meters: Towards undetectable appli-
pp. 4807–4827, Oct. 2010.
ance load signatures,’’ in Proc. 1st IEEE Int. Conf. Smart Grid
Commun. (SmartGridComm), Oct. 2010, pp. 232–237. [110] R. Billinton and Y. Gao, ‘‘Multistate wind energy conversion system
models for adequacy assessment of generating systems incorporating
[87] V. Rastogi and S. Nath, ‘‘Differentially private aggregation of distributed
wind energy,’’ IEEE Trans. Energy Convers., vol. 23, no. 1, pp. 163–170,
time-series with transformation and encryption,’’ in Proc. ACM SIGMOD
Mar. 2008.
Int. Conf. Manage. Data, 2010, pp. 735–746.
[111] J. K. Kaldellis, ‘‘Optimum autonomous wind–power system sizing for
[88] S. R. Rajagopalan, L. Sankar, S. Mohajer, and H. V. Poor, ‘‘Smart meter
remote consumers, using long-term wind speed data,’’ Appl. Energy,
privacy: A utility-privacy framework,’’ in Proc. IEEE Int. Conf. Smart
vol. 71, no. 3, pp. 215–233, Mar. 2002.
Grid Commun. (SmartGridComm), Oct. 2011, pp. 190–195.
[112] L. Kung and H.-F. Wang, ‘‘A recommender system for the optimal combi-
[89] T. M. Cover and J. A. Thomas, Elements of Information Theory. nation of energy resources with cost-benefit analysis,’’ in Proc. Int. Conf.
New York, NY, USA: Wiley, 2012. Ind. Eng. Oper. Manage. (IEOM), Mar. 2015, pp. 1–10.
[90] Y. Liu, P. Ning, and M. K. Reiter, ‘‘False data injection attacks against [113] S. Maharjan, Q. Zhu, Y. Zhang, S. Gjessing, and T. Basar, ‘‘Depend-
state estimation in electric power grids,’’ ACM Trans. Inf. Syst. Secur., able demand response management in the smart grid: A Stackelberg
vol. 14, no. 1, p. 13, Sep. 2009. game approach,’’ IEEE Trans. Smart Grid, vol. 4, no. 1, pp. 120–132,
[91] S. Ruj and A. Pal, ‘‘Analyzing cascading failures in smart grids under Mar. 2013.
random and targeted attacks,’’ in Proc. IEEE 28th Int. Conf. Adv. Inf. Netw. [114] J. Kwac and R. Rajagopal, ‘‘Demand response targeting using big
Appl. (AINA), May 2014, pp. 226–233. data analytics,’’ in Proc. IEEE Int. Conf. Big Data, Oct. 2013,
[92] L. Xie, Y. Mo, and B. Sinopoli, ‘‘Integrity data attacks in power mar- pp. 683–690.
ket operations,’’ IEEE Trans. Smart Grid, vol. 2, no. 4, pp. 659–666, [115] Z. Liu, A. Wierman, Y. Chen, B. Razon, and N. Chen, ‘‘Data center
Dec. 2011. demand response: Avoiding the coincident peak via workload shifting and
[93] O. Kosut, L. Jia, R. J. Thomas, and L. Tong, ‘‘Malicious data attacks local generation,’’ Perform. Eval., vol. 70, no. 10, pp. 770–791, Oct. 2013.
on the smart grid,’’ IEEE Trans. Smart Grid, vol. 2, no. 4, pp. 645–658, [116] S. G. Goyena and S. A. Acciona, ‘‘Sizing and analysis of big scale and
Dec. 2011. isolated electric systems based on renewable sources with energy stor-
[94] L. Jia, R. Thomas, and L. Tong, ‘‘Malicious data attack on real-time age,’’ in Proc. IEEE PES/IAS Conf. Sustain. Alternative Energy (SAE),
electricity market,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Sep. 2009, pp. 1–7.
Process. (ICASSP), May 2011, pp. 5952–5955. [117] Q. Wu et al., ‘‘Driving pattern analysis for electric vehicle (EV) grid inte-
[95] Y. Yuan, Z. Li, and K. Ren, ‘‘Quantitative analysis of load redistribution gration study,’’ in Proc. IEEE Innov. Smart Grid Technol. Conf. (ISGT),
attacks in power systems,’’ IEEE Trans. Parallel Distrib. Syst., vol. 23, Oct. 2010, pp. 1–6.
no. 9, pp. 1731–1738, Sep. 2012. [118] W. Su and M.-Y. Chow, ‘‘Performance evaluation of an EDA-based large-
[96] R. Tan, V. B. Krishna, D. K. Y. Yau, and Z. Kalbarczyk, ‘‘Impact of scale plug-in hybrid electric vehicle charging algorithm,’’ IEEE Trans.
integrity attacks on real-time pricing in smart grids,’’ in Proc. ACM Smart Grid, vol. 3, no. 1, pp. 308–315, Mar. 2012.
SIGSAC Conf. Comput. Commun. Secur., 2013, pp. 439–450. [119] S. Midlam-Mohler, S. Ewing, V. Marano, Y. Guezennec, and G. Rizzoni,
[97] M. Esmalifalak, G. Shi, Z. Han, and L. Song, ‘‘Bad data injection attack ‘‘PHEV fleet data collection and analysis,’’ in Proc. IEEE Vehicle Power
and defense in electricity market using game theory study,’’ IEEE Trans. Propuls. Conf. (VPPC), Sep. 2009, pp. 1205–1210.
Smart Grid, vol. 4, no. 1, pp. 160–169, Mar. 2013. [120] H. Rahimi-Eichi, P. B. Jeon, M.-Y. Chow, and T.-J. Yeo, ‘‘Incorporat-
[98] S. Tan, W.-Z. Song, M. Stewart, J. Yang, and L. Tong, ‘‘Online data ing big data analysis in speed profile classification for range estima-
integrity attacks against real-time electrical market in smart grid,’’ IEEE tion,’’ in Proc. IEEE 13th Int. Conf. Ind. Informat. (INDIN), Jul. 2015,
Trans. Smart Grid, Apr. 2016, doi: 10.1109/TSG.2016.2550801. pp. 1290–1295.

3860 VOLUME 4, 2016


H. Jiang, et al.: Energy Big Data: A Survey

[121] X. Zhang and S. Grijalva, ‘‘An advanced data driven model for residential YIHUI WANG is currently pursuing the degree in
electric vehicle charging demand,’’ in Proc. IEEE Power Energy Soc. Internet of Things with the Nanjing University of
General Meeting, Jul. 2015, pp. 1–5. Posts and Telecommunications, China. Her current
[122] L. Zhang, Y. Chen, J. Zhu, M. Pan, Z. Sun, and W. Wang, ‘‘Data qual- research interests include wireless sensor network,
ity analysis and improved strategy research on operations management embedded systems, and cyber-physical systems.
system for electric vehicles,’’ in Proc. 5th Int. Conf. Electr. Utility Dereg-
ulation Restruct. Power Technol. (DRPT), Nov. 2015, pp. 2715–2720.
[123] J. Soares, N. Borges, B. Canizes, and Z. Vale, ‘‘Probabilistic estimation
of the state of electric vehicles for smart grid applications in big data
context,’’ in Proc. IEEE Power Energy Soc. General Meeting, Jul. 2015,
pp. 1–5.
[124] C.-H. Lee and C.-H. Wu, ‘‘A novel big data modeling method for
improving driving range estimation of EVs,’’ IEEE Access, vol. 3,
pp. 1980–1993, 2015.
[125] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, ‘‘Evaluating probabilistic
queries over imprecise data,’’ in Proc. ACM SIGMOD Int. Conf. Manage.
Data, Jun. 2003, pp. 551–562.
[126] S. Tsang, B. Kao, K. Y. Yip, W.-S. Ho, and S. D. Lee, ‘‘Decision trees for
uncertain data,’’ IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 64–78,
Jan. 2011.
[127] J. Chen and R. Cheng, ‘‘Efficient evaluation of imprecise location-
dependent queries,’’ in Proc. IEEE Int. Conf. Data Eng., Apr. 2007,
pp. 586–595.
[128] W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, and K. Y. Yip, MIN GAO received the B.E. degree in electrical
‘‘Efficient clustering of uncertain data,’’ in Proc. Int. Conf. Data Mining, engineering and automation from the Nanjing Uni-
Dec. 2006, pp. 436–455. versity of Technology in 2011, and M.S. degree
[129] R. Gutierrez-Osuna and H. T. Nagle, ‘‘A method for evaluating data- in electrical engineering from the University of
preprocessing techniques for odour classification with an array of gas California, Los Angeles, in 2012, where he is
sensors,’’ IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 5, currently pursuing the Ph.D. degree in electrical
pp. 626–632, Oct. 1999. engineering. His research interests include smart
[130] J.-L. Sheu, M.-D. Shieh, C.-H. Wu, and M.-H. Sheu, ‘‘A pipelined archi- grid, IoT, and software formal verification.
tecture of fast modular multiplication for RSA cryptography,’’ in Proc.
IEEE Int. Symp. Circuits Syst., vol. 2. May/Jun. 1998, pp. 121–124.
[131] P. W. Shor, ‘‘Algorithms for quantum computation: Discrete logarithms
and factoring,’’ in Proc. Annu. Symp. Found. Comput. Sci., Nov. 1994,
pp. 124–134.
[132] C. H. Bennett, F. Bessette, G. Brassard, L. Salvail, and J. Smolin,
‘‘Experimental quantum cryptography,’’ J. Cryptol., vol. 5, no. 1,
pp. 3–28, Jan. 1992.

HUI JIANG is currently pursuing the degree in


information network with the Nanjing University
of Posts and Telecommunications, China. Her cur-
rent research interests include wireless sensor net-
work, computer network, Internet of Things, social
YAN ZHANG (M’05–SM’10) received the
networks, security, smart grid communications,
Ph.D. degree from the School of Electrical and
and cyber-physical systems.
Electronics Engineering, Nanyang Technological
University, Singapore. He is currently the Head
of the Department of Networks with the Sim-
ula Research Laboratory, Norway, and an Asso-
KUN WANG (M’-13) received the B.Eng. and ciate Professor (part-time) with the Department
Ph.D. degrees from the School of Computer, Nan- of Informatics, University of Oslo, Norway. His
jing University of Posts and Telecommunications, current research interests include next-generation
Nanjing, China, in 2004 and 2009, respectively. wireless networks and reliable and secure cyber-
From 2013 to 2015, he was a Post-Doctoral Fel- physical systems (e.g., smart grid, transport, and healthcare). He is a Senior
low with the Electrical Engineering Department, Member of the IEEE ComSoc, the IEEE VT Society, the IEEE PES, and
University of California at Los Angeles, CA, the IEEE Computer society. He serves as the TPC Member for numerous
USA. In 2016, he was a Research Fellow with international conferences, including the IEEE INFOCOM, the IEEE ICC,
the School of Computer Science and Engineer- the IEEE GLOBECOM, and the IEEE WCNC. He has received eight best
ing, University of Aizu, Fukushima, Japan. He is paper awards. He is a fellow of IET. He serves as the Chair in a number
currently an Associate Professor with the School of Internet of Things, of conferences, including the IEEE PIMRC 2016, the IEEE Cloudcom
Nanjing University of Posts and Telecommunications. He has authored over 2016/2015, the IEEE CCNC 2016, the IEEE ICCC 2016, WICON 2016,
50 papers in referred international conferences and journals, including the and the IEEE SmartGridComm 2015. He is an Associate Editor or on the
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, ACM Transactions on Mul- Editorial Board of a number of well-established scientific international
timedia Computing, Communications and Applications, ACM Transactions journals, e.g., Wiley Wireless Communications and Mobile Computing.
on Embedded Computing Systems, the IEEE Wireless Communications Mag- He also serves as the Guest Editor of the IEEE TRANSACTIONS ON SMART
azine, the IEEE Communications Magazine, the IEEE Network, the IEEE GRID, the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, the
ACCESS, and the IEEE SENSORS journal. His current research interests are IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, the IEEE Communications
mainly in the area of big data, wireless communications and networking, Magazine, the IEEE Wireless Communications, the IEEE Network, the IEEE
smart grid, energy Internet, and information security technologies. He is a Systems, the IEEE ACCESS, and the IEEE INTERNET OF THINGS.
member of ACM.

VOLUME 4, 2016 3861

You might also like