You are on page 1of 8

2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

Development of the Big Data Management System on


National Virtual Power Plant

Mijeom Kim, Jungin Choi, Jaeweon Yoon


Smart Grid Research Center
Advanced Institutes of Convergence Technology
Suwon-si, Republic of Korea
mijeom, jichoi, jw.yoon@snu.ac.kr

Abstract— we are developing the Business Platform for demand Technologies (ICT) and open source software such as Hadoop
side management in energy sector. This paper introduces the ecosystem [1]. Major functional components are Big Data
Business Platform and describes the development of the pilot Management System to store and manage power and non-power
platform and main components. In addition, main applications data, Business Enable Engine composed of useful web services
such as Automated Demand Response (AutoDR) and Monitoring and Application Program Interface (API) libraries. As killer
Based Commissioning (MBCx) is introduced. One of main applications, we implemented Automated Demand Response
components of the platform is the Big Data Management System to (AutoDR) service and Monitoring Based Commissioning
save and analyze the energy big data. We constructed two big data (MBCx) service.
cluster systems: one is a Hadoop cluster and the other is the one
Spark installed. We also implemented software modules to collect As shown in Fig. 1, our project goal is to develop the
and generate the energy big data. The data flow inside the platform Business Platform supporting scalability and standards based on
is presented and implementation of Application Programing data analysis using statistical methods and demand resources
Interfaces (APIs) to access the big data is described. Furthermore, optimization algorithms. Applications and user interfaces are
we developed various applications on top of the platform using the using component based libraries and open data access APIs. Most
APIs. Among those, this paper presents the monitoring service. importantly, the platform has an interface with a big data system
and external systems and provides apps, service APIs and
Keywords—Big data management system; National Virtual Power
environments to support users and developers to create new
Plant; Demand Response; Demand Side Management
services, and manage apps.
I. INTRODUCTION
Demand Resources have played an important role in Korea
more than 20 years. Worldwide paradigm in energy sector shift
from supply side management to demand side, since controlling
and reducing energy demand is much more efficient than
building more power plants to generate physical energy. Demand
side management is to reduce or shift energy consumption
through efficiency improvements or load shifting on the customer
side of the electrical meter. Demand side management is often
referred to as the least cost resource because the cost of
developing defined quantities of energy that can be reallocated or
shifted is significantly lower than the cost of constructing new
capacity.
In this paper, we introduce the Business Platform for demand Fig. 1. Concept of NVPP Business Platform
side management which makes it possible to exchange and
relocate the surplus energy or get benefit by reducing peak load This paper consists of the followings. Section II introduces
by using Information Technologies (IT). The platform is referred the NVPP Business Platform for demand side management in
to National Virtual Power Plant (NVPP) Business Platform since energy domain. The section also describes development of the
it is available to everyone who wants to save power energy for pilot business platform and components of the platform. We
monetary gain. The NVPP is a cluster of distributed energy present the Big Data Management System of the platform in
resources which are collectively run by central control entity. The detail in Section III including construction of the Big Data
operational focus is to reduce or minimize peak load by Management System, collection of the big data, and data flows
balancing power at short time frame or energy exchange with and APIs. Subsequently, Section IV presents the monitoring
consideration of grid bottlenecks. We developed NVPP Business service (Dash Board) using the Big Data APIs for big data
platform by utilizing Information and Communication analysis and access. Section V describes existing researches and

978-1-4673-9473-4 /15 $31.00 © 2015 IEEE 100


DOI 10.1109/3PGCIC.2015.101
methods related to our system. Concluding remarks and future electricity market like supply resources. For example, buildings
work are followed in Section VI. can provide capacity to the grid during peak events by reducing
consumption where it is curtailable for a period of time.
II. INTRODUCTION OF THE NVPP Automating DR allows DR resources ready for dispatch,
improves DR reliability and predictability, simplifies and reduces
Fig. 2 presents the architecture of the NVPP business cost of DR. On the other hand, MBCx service is not only to
platform. The clients include variety of metering and reduce peak load, but to monitor and control continuously to
controlling/actuating devices. The big data system and API reduce the power load and improve efficiency. MBCx is a proven
services are components of the middleware layer. Application approach for achieving persistent energy savings through
layer has essential services such as Demand Response (DR) and continuous improvement and optimization. The initial focus is to
MBCx and 3rd party apps including Social Network Service rapidly identify a set of energy savings measures and focus on
(SNS) DR and Policy Simulation. getting the no- and low- cost measures implemented as quickly as
possible. MBCx is mainly to achieve Energy Efficiency (EE) on
the daily basis while AutoDR is to cut peak demand and peak
load for demand management. Both AutoDR and MBCx services
are client-server applications. The NVPP platform provides the
application servers and in client sides, client programs need to be
ready for interoperation. For AutoDR, the server program is
referred to as Virtual Top Node (VTN) and the client program as
Virtual End Node (VEN). Fig. 3 and Fig. 4 explain the service
protocols and scenarios of the two applications on the NVPP
Business Platform.

Fig. 2. Architecture of NVPP Business Platform

In the middleware layer, there are two components: Business


Enable Engine and Big Data Management System. The business
engine enables 3rd parties to develop new services on the
business platform, by providing APIs for demand resource
management, energy optimization, visualization, and data access
as well as external system interfaces and billing and user
management services. Billing and user management services
include user registration and management, app usage billings and
app usage related report generation. The big data system gathers
power and non-power data from a variety of clients (including
AutoDR and MBCx clients) and multiple sensors such as smart Fig. 3. Service Protocols of main applications on the NVPP Platform
meters. The data are stored and managed using Hbase [2] and
operated with Hadoop ecosystem.
The main applications are DR services and MBCx service.
Following the definition given by the U.S. Department of Energy
[3], demand response is any “change in electric usage by the end-
use customers from their normal consumption patterns in
response to change in the price of electricity over time, or to
incentive payments designed to induce lower electricity use at
times of high wholesale market prices or when system reliability
is jeopardized.” In other words, DR service is for reducing peak
demand during summer and winter by making commercial and
industrial customers be able to sell their reduced demand in the

101
Fig. 6. Service Stack of the NVPP Platform
Fig. 4. Service Scenarios of main applications on the NVPP Platform
We summarized a dataset of the first database server group
(Data Collection DB server group I) from 2509 smart meters
The server architecture and the service stack of the platform aggregated from December 2012 to April 2015 in Table 1. The
are shown in Fig. 5 and Fig. 6. There are two big data clusters power data are mainly aggregated in two tables, History and
and three power data collection database server groups. The first Demand. History table is updated every 5 minute and saves
database server group is collecting from 2509 smart meters and demand data in Obix [10] format. Demand table saves every 15
the second one is doing from 1824 smart meters. The last one is minute power data.
saving power data from 88 Electric Management System (EMS)
and Energy Storage System (ESS) point meters. TABLE I. DATASET OF THE DATA COLLECTION DB SERVER GROUP I

Regions 4 sites (Industry / Techno park)

Smart Meters 2,509


6 Billions Energy data (History tables)
Records
2 Billions Energy data (Demand tables)
Period 30 months
5 minute (Energy Damand in XML – oBix)
Update
15 minute (Energy Demand)

III. DEVELOPMENT OF THE POWER BIG DATA MANAGEMENT


SYSTEM

A. Construction of the Power Big Data Management System


The big data system collects huge power and non-power data
and analyzes the data on the cluster system, which is a cluster of
general-purpose computers in which they are connected each
Fig. 5. Server Architecture of the NVPP Platform other via commodity network such as Ethernet and each machine
is based on Linux operating systems. The power data is collected
every 15 minute from a few thousand of smart meters. In the big
data management system, to support this distributed
environment, we first constructed a cluster system as shown in
blue box in Fig. 7. By and large, the cluster system consists of a
manager node, a master node, and eight data nodes. Since this
cluster system has one IP address which corresponds to the MAC
address of the manager node, clients outside the cluster system

102
can only communicate with the manager node. The master node data such as text data. In addition, they have been designed to
controls the cluster system, whereas data nodes mainly store satisfy two properties: partition tolerance and either consistency
blocks in them. Each node is a workstation server having Intel or availability. Therefore, NoSQL databases may not perfectly
Xeon 2.20GHz CPU (six cores per CPU) and 8TB Hard Disk guarantee the integrity of data. However, without the cease of
Drive (HDD). The master node has 32GB main memory, while service, data storage space can be added to NoSQL databases so
the size of main memory is 24GB in the data nodes. big data is easily stored in the database without paying an
enormous sum of money. Unlike popular NoSQL databases such
We used Linux operating system (e.g., CentOS release 6.4)
as Big Table, Cassandra, MongoDB, REDIS, etc., HBase is
on each node in the cluster system. In addition, to support the
especially based on Hadoop Distributed File Systems. In other
distributed environment, we installed Hadoop Distributed File
words, data is actually stored in HDFS. On the other hand, HBase
Systems (HDFS) and MapReduce Framework on each machine
has occupied a buffer space in memory. The buffer contains
in the cluster system [1]. In our system, since we use Hadoop 2.2,
popular blocks that are frequently used in some application. If the
we can avoid the failure of the entire system. In Hadoop 1.2,
application continues to initiate update operations, HBase first
there is only one name node physically. If the name node fails,
goes to the buffer, finding the corresponding blocks and then
we can no longer use the system. On the other hand, in our big
updating the block. If the buffer is full, proper blocks are stored
data management system, there are two name nodes. One is
to HDFS by a certain buffer scheduling policy (e.g., Least
active and the other is standby. In particular, we installed and
Recently Used (LRU)).
configured Hadoop name node daemon on the master server and
Hadoop data node daemon on seven data nodes. For high The red box in Fig. 7 shows another cluster we constructed
availability, we set up a standby name node on the eighth data mainly for using Spark [4] which is an in-memory based large
node. Overall, Hadoop ecosystem is composed of data storage, scale data processing solution to reduce the parallel distributed
data processing, data access, management, and applications. In processing time for big data analysis and to increase flexibility
the data storage level, there are HDFS and HBase. HDFS is a and usability. This cluster consists of three data nodes and one
distributed file system, while HBase is a NoSQL (Not Only SQL) name node which have the same H/W specification as the nodes
database based on HDFS. The MapReduce framework is in the in the first cluster. We installed Cloudera’s Distribution Including
data processing layer. Through this framework, programmers can Apache Hadoop (CDH 5.4.2) [5] in the second cluster including
write distributed programs using Java which are executed in Spark [4], Sqoop [7], ZooKeeper, Hive, and so on.
parallel. However, Hadoop also supports the very users that are
not familiar with Java, to access big data using Hive, Pig, and
Avro, which are the data access modules. The map and reduce
tasks are executed in parallel on multiple data nodes within the
cluster system. In the management level, there are Zookeeper and
Chukwa. Zookeeper is required for distributed configuration
service, synchronization service, and naming registry for large
distributed systems. Using Zookeeper, two name nodes, active
and standby name nodes check the status of each other and
failover can be performed. When there is any fault in the active
name node, the standby name node takes over roles from the
active name node. For that, the two name nodes share edit logs
via a shared disk called the journal node existed between them.
This failover process keeps the name node going in a seamless
way. In the meanwhile, Chukwa gives us a service to quickly
search for target data. Finally, as applications, there are machine
learning tools like Mahout [12] and RHadoop for data mining
and visualization, data warehousing tools, and so on.
We installed Hadoop Database (HBase) [2]. HDFS is enough
if data is rarely updated in place and read/append operations are
mainly used in an application. However, HBase is needed when
random updates are common in another application. Unlike the
existing relational database management systems such as Oracle,
IBM-DB2, MySQL, etc., HBase is one of non-relational database
management systems. We call it NoSQL database. This means
that HBase does not support join operations. Instead, the aim of
HBase is to sort large-scale data by time stamp and to quickly
search for particular data. In addition, since NoSQL databases do
not have any relational schema, it is easy to manage unstructured Fig. 7. Clusters of the Big Data Management System

103
Spark is initiated mainly to improve performance of Hadoop generates virtual power data in Obix 1.1 and Node.js reads them.
by in-memory processing instead of disk processing. Basic Node.js sends the virtual data to the real-time view server using
configuration includes SparkContext, a cluster manager and Socket-IO and to the RabbitMQ [11] server using the message
worker nodes. Usually, SparkContext requests the cluster queuing protocol (Advanced Message Queuing Protocol).
manager to distribute jobs to worker nodes. Commonly, worker
nodes run on data nodes. Spark supports Scala, Java and Python
languages and utilizes the Resilient Distributed DataSet (RDD).
RDD is the partitioned immutable dataset saved in distributed
data nodes saving data from external sources or generated by
codes. Spark makes RDD reside in memory using persist and
cache functions. persist function works same as cache function
using MEMORY_ONLY option. We have set up the Spark cluster
and are trying to implement APIs on that. After we develop APIs
on the SPARK cluster, we plan to evaluate the system and
compare performance of ones on Hadoop cluster.

B. Collection of the Power Big Data


We collects the energy usage data per 15 minutes in a good
many buildings and factories and constructed a big data cluster
system. The actual power data from gateways of smart meters are
originally saved in a relational DB, PostgreSQL [6] since there is
no interface from the gateway to the big data system. Energy
usage data are immutable time series fact datasets and they are
events based. Relational Database Management System
Fig. 8. Transfer power data to the Big Data System
(RDBMS) is not suitable for them because RDBMS is optimized
for transactions (Create Read Update Delete operations) and
handling huge dataset with RDMS is expensive. In addition, the
power data is really a bid data since it is huge scale data
(Volume), real-time streaming data (Velocity) and unstructured
data (variety). Therefore, we needed to construct a big data
system for the power data and to transfer them to the big data
system. To move from the data in the PostgreSQL to Hbase on
the big data cluster, we are using Sqoop [7] and Tajo [8] shown
in Fig.8. Apache Sqoop is a tool to transfer bulky data between
structured data-stores such as relational database systems and
Hadoop system. Apache Tajo is a big data warehouse system
which process queries in parallel for performance. Tajo is used to
count rows for Sqoop to transfer only new rows to Hbase to
avoid redundancy. Fig. 8 explains the process to move all the
data to Hbase from PostgreSQL without redundancy. We
generated views of the table to be transferred to make a row
count attribute. In addition, we ran Tajo on Hbase to count rows
transferred already and let Sqoop transfer only new generated
rows. We used Tajo since it is much faster than without it. It
takes a few minutes for Hbase to count 1,000,000 rows, but using
Tajo, it takes less than one minute. We need to run this program Fig. 9. Virtual Power Data Generation
module (JAR file) periodically since power data is generated
continuously.
C. Data flow and APIs
For some applications such as SNS DR, we need near real-
time power data, thus we generate fake real-time power data We analyzed the power data to get useful information for
using Internet of Things (IoT) devices to test those applications. various services by means of baselining, profiling and
Here we used BeagleBone Black (BBB) [9] to generate virtual segmentation. Baselining is for understanding patterns of energy
power data in Obix1.1 [10] format since the NVPP middleware usage and profiling is for modeling customers’ demographics,
uses the Obix 1.1 format for all the power data. Fig. 9 shows the psychographics, and behaviors. In addition, to group energy
process of the virtual power data generation. First, BBB usages for predictive analytics, segmentation is used. We

104
performed statistical analysis by map-reduce and demand
prediction by machine learning such as Mahout [12].
We also developed APIs for 3rd party app developers to build
services and applications. We developed two kinds of APIs: one
is to access the big data and the other is to develop applications.
We call the former one Big data API and the latter one App API.
Big data APIs consist of APIs for Customer Baseline (CBL)
computation, portfolio optimization, energy efficiency, energy
loss, power network configuration, etc, while App APIs include
settlement, reporting, monitoring, event, user management, etc.
Both APIs are open standard based and support SOAP (Simple
Object Access Protocol) and REST (Representational State
Transfer) both. Managers handle API registration and Fig. 11. Configugration of API Service Module
distribution, and analyze API usages. Fig. 10 presents all the data
flows and API services in the entire platform.
IV. MONITORING SERVICE – DASH BOARD
By utilizing APIs to access and analyze the big data described
in Section III, we developed many killer applications. In this
section, we present one of the basic services – Dash Board, a
monitoring service which presents all the important power data
analyzed in temporal and spatial point of view. Fig. 12 presents
the Dash Board service on the Network Operation Center (NOC)
in our research lab which composed by 12 monitors separated by
3 sections. The left section presents spatial monitoring view to
show demand resource generation status on a map. The bigger
picture in Fig. 13 describe it in detail. The red points on the map
represent the customers are generating the demand power
resource and the green points mean the customers are ready to
generate. On the other hand, the white ones denote the customers
are impossible to generate and the grey ones represent
unconnected customers. The middle section of the NOC shows a
demand side management business operator’s generation status
Fig. 10. Data Flow and APIs Services in the NVPP Platform presenting all the belonging customers’ ones. The upper graph in
the detailed figure in Fig. 14 shows the total generation amount
of the selected business operator. The blue line shows the CBL
and the green line shows the amount of power usage. The red line
Configuration of the API service module is described in Fig. represents the generation amount of demand resource. The below
11. All the power and non-power data is stored in Hbase on the circle diagrams in Fig. 15 describe the generation status of each
Hadoop cluster. Hbase has its own RESTful services and APIs to customer belongs to the operator. Lastly, the right section
support the services. However, they are quite complex and using displays all the business operators’ generation status on a
complicated Java codes. We utilized the RESTful service and nationwide scale using the heat-map view. The detail figures are
APIs of Hbase to write simple API modules for NVPP to support shown in Fig. 15. The graph on top shows sum of the generation
external usage. In order to simplify development of RESTful amount of all operators. The heat-map below shows which
Web services, we used the Jersey framework [13] managed by resource is the most valuable in terms of the amount of power
Oracle. Jersey RESTful web services framework is an open generation. The red color resource is the one performs the most.
source API framework for developing RESTful Web Services in Some of those data become sources for big data analysis used by
Java. NVPP API functions have their own HTTP URL via various algorithms such as portfolio optimization and energy
Tomcat and Jersey, so clients can send their requests to NVPP efficiency. On the other hand, some of data are output of the
API services. analysis or related algorithms such as CBL calculation.

105
Fig. 12. Monitoring Service – Dash Board

Fig. 15. All the business operators’ generation status using the heat-map view

V. RELATED WORK
The authors in [14] presented a scalable and reliable data
middleware layer for smart grids. By tailoring the system
specifically to smart grids, they eliminated much of overheads
while still keeping the implementation effort reasonable. That
was achieved by using a log structure inspired architecture to
directly access the block device layer, eliminating the indirection
incurred by high level file system interfaces. Dejan et. al [15]
proposed a system that takes advantage of existing assets and
smart grid services, and enables facility management to actively
Fig. 13. Spatial monitoring view adjust its energy consumption/ production behavior as seen by
external stakeholders, while adhering to its internal goals and
strategies. The authors’ main contribution is to provide the
architecture and concept of the energy efficiency in smart grid
era. In [16], the authors introduced an IoT energy management
system that is device oriented. Household appliances can be
added into the IoT system with device recognition technology
without any additional identification devices. They presented a
management service layer for the recognition of current
household appliances, which does not only establish
communication services among various appliances, but also
deduces human activities conducted for context data using Naive
Bayes from the electric appliances in use and the variation of its
states. In [17], the authors conducted experiments to measure the
performance of the different levels of energy data aggregation.
Thousands of smart meters were aggregated, by usage of the
collected energy readings from a real-world trial. Using a
selected data set, the traditional database system performance
Fig. 14. Demand side management business operator’s generation status
was compared to the emerging column-based approach in order
to assess the suitability for real-time analytic in such scenarios.
The main contributions of the authors is to show that the in-
memory column-based database system is more suited to
aggregate energy data. The authors in [18] investigated the role
of distributed storage in residential areas as well as a mean
towards creating groups of prosumers that feature better forecast
energy behavior. In their work the high impact of the storage has
been demonstrated, even for a simple load forecasting algorithm.
Several other researches have been done related to the energy big
data systems and several companies [19, 20] provide services on

106
energy big data analysis. However, as far as authors are aware, financial resource from the Ministry of Trade, Industry &
this is the first paper to gather and analyze the real massive Energy, Republic of Korea. (No.20132010101800 and
power data using a big data management system and to develop a No.20152010103160)
business platform to provide APIs to access the big data for
development of 3rd party applications. We also developed useful REFERENCES
services using the APIs including the monitoring service for
[1] Apache Software Foundation, “Hadoop 2.2”, http://hadoop.apache.org/
demand management. (2014)
[2] Apache HBase, http://hbase.apache.org/
VI. CONCLUSION AND FUTURE WORK [3] Global Energy Partners, LLC, “Types of DR Participation in Organized
Wholesale Markets in the U.S. and Load Aggregators Business Model”,
We developed the NVPP Business Platform based on ICT 2011
and constructed the energy data analytic platform based on the
[4] Apache Spark, http://spark.apache.org/
Hadoop Big Data Management system. The power data is
[5] Cloudera’s Distribution Including Apache Hadoop (CDH),
collected by transferring from relational DB or generating from http://www.cloudera.com/content/cloudera/en/products-and-
IoT devices. We also utilized open source software to perform services/cdh.html
those jobs such as Sqoop [7], Tajo [8] and Mahout [12]. In [6] PostgresDB http://www.postgresql.org/
addition, we implemented various APIs to develop 3rd party [7] Apache Sqoop, http://sqoop.apache.org/
applications and using those APIs, we made a few main services [8] Apache Tajo, http://tajo.apache.org/
such as AutoDR, MBCx and Dashboard. [9] BeagleBone Black, http://beagleboard.org/BLACK
The future work includes exploiting other effective open [10] Obix version 1.1, working Draft 06, 2010 June, https://www.oasis-
source software to perform and scale better by integrating open.org/committees/download.php/38212/oBIX-1-1-spec-wd06.pdf
advanced technologies to our system. Especially, we are planning [11] RabbitMQ, https://www.rabbitmq.com/
to use kafka [21] instead of rabbitMQ to collect streaming power [12] Apache Mahout, http://mahout.apache.org/
data from various distributed resources. Kafka is a high- [13] Jersey, https://jersey.java.net/
throughput distributed messaging system and known for fast [14] Yin, J., Kulkarni, A., Purohit, S., Gorton, I., and Akyol, B., “Scalable real
operation and high scalability. Meanwhile, we installed and time data management for smart grid”, Middleware Industry Track
Workshop 2011
tested Spark [4] to get fast response from API calls. Although we
made some progress but as another future research, we will [15] Ilic, D., Karnouskos, S., Silva, P., and Detzler, S., “A system for enabling
facility management to achieve deterministic energy behaviour in the smart
actively utilize Spark and more aggressively Solid State Disk grid era”, International Confernce on Smart Grids and Green IT systems
(SSD) together to exploit memory and SSD as a cache. We will (Smart-Green 2014), 2014
evaluate the platform using Spark and compare the performance [16] Lai, C., Lai. Y., Tianruo, L., and Chao, H., “Integration of IoT energy
with one when using only MapReduce. In addition, we plan to management system with appliance and activity recognition”, IEEE
utilize the big data management system for an integrated heat International Conference on Green Computing and Communications
demand management platform for district heating users based on (GreenCom 2012), DOI 10.1109
IoT, which is under development. Since the challenges are [17] Ilic, D., Karnouskos, S., and Wilhelm, M., “A comparative analysis of
smart metering data aggregation performance”, IEEE International
similar in both platforms, we believe that it is possible to exploit Conference on Industrial Informatics, Bochum, Germany, July 2013
the big data management system in the integrated heat demand [18] Lic, D., Karnouskos, S., and Silva, P., “Improving load forecast in
management platform to evaluate and validate our big data prosumer clusters by varying energy storage size”, IEEE PowerTech 2013,
system. June 2013
[19] AutoGrid, http://www.Auto-Grid.com
ACKNOWLEDGMENT [20] C3 Energy, http://c3energy.com
[21] Apache Kafka, http://kafka.apache.org/
This work was supported by the Energy Efficiency &
Resources Core Technology Program of the Korea Institute of
Energy Technology Evaluation and Planning (KETEP) granted

107

You might also like