Distributed Big Data Mining Platform For Smart Grid: 2018 IEEE International Conference On Big Data (Big Data)

2018 IEEE International Conference on Big Data (Big Data)
Distributed Big Data Mining Platform for Smart

Grid
Zhixiang Wang∗† , Bin WU∗† , Demeng BAI‡ and Jiafeng QIN‡
∗ KeyLaboratory of Intelligent Telecommunications Software and Multimedia
† Beijing University of Posts and Telecommunications, Beijing, China
‡ Electric Power Reasearch Institute Shandong Power Supply Company of State Grid, Jinan, China
Email: wangzhixiang513@gmail.com,wubin@bupt.edu.cn
Abstract—With the rapid development of information technol- systems, represented by Hadoop and Spark, has drawn the
ogy and internet, all kinds of industry data exploded causing attention of researchers and been widely used. The data mining
difficult to analyze and mine useful information from big data. platforms have achieved the transition from the data analysis
Traditional analysis system has bottlenecks of performance and
scalability in big data processing. The research and development to big data mining. With the excellent algorithm performance
of novel and efficient big data analysis and mining platform has exhibited by deep learning in recent years, as well as the
become the focus of all organizations. Along with the development outstanding performance in image analysis, speech recognition
of smart grid, power data with characteristics of power industry and target detection. The new goal of data mining analysis
needs more targeted and efficient data mining analysis. In this systems is to apply and integrate deep learning algorithms.
paper, aiming at the shortage of existing work, we propose
a distributed big data mining platform based on distributed At present, based on distributed computing frameworks such
system infrastructure such as Hadoop and Spark. The platform as Hadoop and Spark, and deep learning frameworks such
develops and implements a variety of rapid highly parallel mining as Tensorflow and Caffee, various general-purpose big data
algorithm by Spark and Tensorflow, including machine learning, analysis and mining platforms have been developed. While
statistics and analysis, deep learning and so on. Using the OSGI the algorithms implemented by the analysis algorithm library
technology to build low coupling component model, the platform
improve reusability of component algorithm, introduces the are aimed at general-purpose data and lack of the abilities to
workflow engine and user-friendly GUI, reduces the complexity make good use of characteristics for industry data. In the face
of the user operations, support user-defined data mining tasks. of industry professional data, it is difficult to deal with frequent
For the characteristics of smart grid big data, the platform changes in complex application business and decision-making
develops and improves the dozens of algorithm components needs. Therefore, the development of big data mining platform
about data processing and analysis. And designing a scalable
algorithms library and the component library greatly improves for domain business data has become the focus of further
the scalability of big data mining platform and processing smart research.
grid data. Our platform has already been launched in a state Power big data resources are similar but different from other
grid Company, satisfying the demand of various smart grid data business data. They are large and complex data resources,
analysis business. including internal data such as power grid operation data,
Keywords-Parallel; Data Mining; Components; Spark; Work- equipment inspection data, enterprise marketing data, power
flow enterprise management data, and external data related to grid
data, such as weather data, national economic operation data,
I. I NTRODUCTION etc. To maximize the potential value of power data resources
Along with the rapid development of modern information and make full use of these massive power data resources, it
technology and explosive growth of global data, large data has is necessary to obtain comprehensive information for power
become the most important impetus for nations, enterprises resource allocation decisions through comprehensive data
and even the efficiently sustainable development of society. analysis and industry characteristics analysis. Moreover, power
The world has entered the era of big data. data analysis involves many aspects of power production
Effectively analyzing data can’t lack the support of data operations. People in different departments often pay attention
processing tools and machine learning platforms. Traditional to specific analysis functions, resulting in the need to generate
analytical system is based on OLAP(Online Analytical Pro- a large number of analysis modules. Chinese existing power
cessing) system and OLTP(Online Transaction Processing) data analysis platform has better performance for targeted
system. These systems perform quite well on the process processing of power business, but the speed of data mining
of data analysis. Due to the limitation of the stand-alone analysis and processing due to technical framework cannot
operation mode, the data processing at the big data level meet the current situation of rapidly increasing power big data.
reveals defects such as long processing time and insufficient This article aims to solve the deficiency of the major data
performance. analysis platform. Based on a Spark, Hadoop, YARN and
Since the inception of distributed systems infrastructure other frameworks, it proposes the distributed data mining
1.0.0 formal version of Hadoop in 2011, distributed processing platform that oriented to the power of big data. The platform
978-1-5386-5035-6/18/$31.00 ©2018 IEEE 2345

integrates deep learning distributed algorithms and also sup- of Sciences proposes a high reusable calculation distributed
ports data statistical analysis and traditional machine learning. analysis engine with fusion of heterogeneous resources and
The platform introduces workflow mechanism to provide a generic big data analysis algorithms library, which is named
DAG(Directed Acyclic Graph) abstract data flow for big data BDA [10], and reduces the complexity of the application of
analysis process, which reduces the development and the big data analysis.The BDAP [11] combines workflow and
complexity of the user operation and increases the degree scheduling flow technology, freely combines various oper-
of component reuse. Modular building by OSGI technology ational low-coupling modules with front-end visualization,
are adopted to decrease the coupling between modules and and builds a data analysis process to facilitate maintenance
simplify the operations of function expansion. At the same and expand service functions. The deep learning platform
time, in view of the complex application of the electric power of Baidu, which is named PaddlePaddle [12], and Google’s
industry business professional, the platform integrates and open source Tensorflow framework [13] all provide rich APIs
improves the corresponding data analysis algorithm for power to support massive data training and multi-machine parallel
data. Besides, for supporting frequently changing decision computing, but researchers need to write and run algorithms
needs, it provides the easy expanded modular structure and themselves. Tencent’s DI-X (Data Intelligence X) platform
algorithms library expanded interface. [14] introduces a deep learning component that allows users
The remaining of the paper is organized as follows. In to visually access deep learning algorithms in cloud GPUs
Section II, we briefly introduce the main related work. There and CPU servers, effectively reducing the barriers of use by
are some improvement in parallel deep learning int Section algorithm engineers and data scientists. However, it has not
III. Section V-B elaborates the details of parallel distributed achieved good results in multi-machine parallel acceleration.
power big data mining platform. The performance of parallel The Digital-Plus platform of Alibaba [15] provides multiple
distributed power data mining platform are given in Section business function tools, greatly reducing the workload of data-
V. Section VI shows the features of the platform. Section VII related work such as modeling and ETL, and combining tools
is the application of platform to power big data application. and engines to provide data services for various industries.
Finally, we make a summary in Section VIII.
When the above-mentioned general big data platform faces
II. R ELATED W ORK the data analysis needs of professional industry, it is often not
The traditional analysis system is OLAP and OLTP system fully applicable and needs to adapt the industry data. Aiming at
based on structured data. For undertaking the task of data min- the internet of things, a three-tier general data big data mining
ing, WEKA [1] applies a lot of machine learning algorithms platform based on equipment and cloud computing is proposed
and interactive visual interface based on JAVA. RapidMiner named UniMiner [16], which enables users to perform all data
[2] is also an open-source data analysis tool that provides mining operations locally. The big data analysis platform for
port to define each step of operations in order to access and semiconductor manufacturing [17] searches the characteristics
output different meaning structured data and support any two of the semiconductor manufacturing industry. For the prepro-
ports connecting each other. The connections between different cessing of semiconductor data, many ETL algorithms have
interfaces, which is strictly regulated by KNIME [3], reduce been integrated to adapt to the upper layer business logic.
errors in the user-defined procedure.
Based on the distributed processing system design inherent Like other industries, the development trend of general
high efficiency, researchers have developed their own big data power data analysis platform always follows the complex
mining platform system and try to introduce the deep learning. power data requirements. Based on distributed OSGI [18] pro-
Kloud [4] big data platform designs the secondary index vides electricity metadata service, support arbitrary expression
of Cassandra schema-less table based on HIVE, and make and exchange format power data. Based on the Portlet frame-
condition query on schema-less table more effective. Based work, power data analysis platform [19] uses the MVC pattern
on Hadoop, BC-PDM [5] develops a distributed parallel ETL and Portlet technology to design the component library, which
(Extract, Transform and Load) algorithm, greatly shorten contains flexible and scalable algorithm of all kinds of mining
the time of data preprocessing. SAMOA [6] platform offer algorithms, business analysis, and realize component devel-
the new online mining architecture to rapidly process real- opment and improvement. Based on the Hadoop framework
time streaming data. PDMiner [7] flexibly combines analysis and Spark development of distributed power [20], big data
algorithms to achieve different mining tasks through the four- calculation and analysis platform integration improved part
layer modular and scalable architecture. And ETHINK [8] analysis based on machine learning and statistical analysis of
provides users an easy to use fast and efficient visualization algorithms, such as provided in the face of the user’s power big
data analysis platform base on Spark. Xiaomi company also data UI customization analysis engine, to support the analysis
proposes the Minos [9], an open-source data collection and solution for reuse.While these power data analysis platforms
processing platform. The Minos improve the characteristic for the electric power business and processing performance
of Hadoop cluster management using own monitoring sys- is good, but due to the technical framework of data mining
tem. improve the whole computing performance platform. analysis processing speed can not satisfy the present situation
The Institute of Computing Technology in Chinese Academy of the rapidly growing power big data.
2346
III. PARALLEL D EEP L EARNING N ETWORK This paper extracts the distributed computing engine Spark
based on memory computing and Tensorflow as the imple-
A. Deep Learning Network mentation framework of LeNet-5 and LSTM networks. It can
The basic unit in a neural network is a neuron model, which accelerate the training with the help of distributed computing
includes input, output and computational functions. And a features, realize the training parameters to improve the training
neuron model is expressed as the (1). precision and reduce the training loss. In each round of
training, one node in the Spark platform is selected as the
Xn training parameter server, and other computing nodes perform
Yj = f ( (Xi + Wij ) + Bj ) (1) model training by obtaining the obtained data fragments to
i=1 obtain the model parameter variation ∆α. The parameter
Where Yj represents the jth output result of the neuron server will receive the parameter variation calculated by each
model, Xi represents ith input element of the neuron model, computing node, update the model parameters and the copy of
Wij represents the product of Xi and the weight of the jth the model parameters in each computing node, and perform a
neuron and Bj represents the bias of the jth neuron. new round of training until the final training is completed.
A plurality of neurons are combined to form a level in a The deep neural network model parallelization implemen-
neural network structure, and a plurality of layers are stacked tation algorithm is shown in the Alg.1:
to form a specific neural network. With pre-training method
proposed by Hinton to alleviate the local optimal solution Algorithm 1 Parallel Deep Neural Network Model Training
problem in neural networks [21], the hidden layer is deepened Based on Spark
to 7 layers, and the neural network is promoted to a true deep Input: Training data set
neural network. Output: Lenet-5/LSTM network model for training
Deep neural networks often use a fully connected form to 1: Obtain the training data set, determine the data fragment
connect the lower neurons and the upper neurons, which will size and the number of partitions, and initialize the model
lead to excessive expansion of the number of parameters, and it training parameter α.
is easy to fall into the local optimum. The CNN(Convolutional 2: Distribute data fragments and model training parameter
Neural Network) [22], which can reduce the number of free copies to each compute node.
parameters in the network, becomes a more suitable neural net- 3: Each computing node extracts one of the allocated data
work structure. CNN is inspired by the study of visual cortical fragments for network model training
electrophysiology in biology. By introducing a convolutional 4: The parameter server receives the model parameter varia-
layer, CNN uses the convolution kernel as an intermediary tion ∆α of each computing node.
between neurons and shares parameter weights, which greatly 5: Update the central model parameters and the copy of the
simplifies the model complexity and reduces the parameters model parameters in each compute node, and adjust the
of the model. network model using the back propagation mechanism.
At the same time, the deep neural network can’t model the 6: Repeat step 2 until the training is complete.
changes at the time series level, and the accuracy of the appli-
cation of natural language processing, speech recognition and
other time series data is relatively low. The RNN(Recurrent The Fig.1 shows a schematic diagram of training based on
Neural Network) acquires historical time characteristic infor- Spark parallelization LeNet-5 network model
mation by applying the output of the neuron as another input
signal to the next time stamp. Specify the parameter server and
TrainSet Network Model
initialize α
B. LeNet-5 and LSTM Parallelization Y

Finished
The LeNet-5 network proposed by Y.LeCun [23] is one of N
the most classic convolutional neural networks with 7 layers. Data fragment， Data fragment， Data fragment，
Update α＇ Update α＇ Update α＇
Each layer contains trainable connection weights and adopts
a strategy of weight sharing between each layer. It requires Convolution Convolution
Convolution Convolution Convolution Convolution
multiple rounds of iterative calculation, and the sharing of Update α
al Layer C1 al LayerC5 al Layer C1 al Layer C5 al Layer C1 al Layer C5
Σƒ(Δαi)
parameter weights between each layer provides basic condi- Pooling
Fully
Pooling
Fully Pooling
Fully
...
Connected Connected Connected
Layer S2 Layer S2 Layer S2
Layer F6 Layer F6
tions and optimization space for distributed computing. And Layer F6
the long-distance neuronal spacing in RNN will result in the Convolution

al Layer C3
Output Layer
Convolution
al Layer C3
Output Layer
Convolution
al Layer C3
Output Layer
long-term dependence problem, which can not learn and use

Pooling Pooling Pooling
remote information. Based on the unit improvement in the Layer S4 Δα1
Layer S4 Δα2 Layer S4 Δαn
original RNN, the LSTM (Long Short-Term Memory) network

structure [24] remembers long-term information, and can avoid
the long-term dependency problem in RNN. Fig. 1. The schematic of parallel LeNet-5 model training based on Spark.
2347
Fig. 2. The overview and architecture of the platform.
IV. P LATFORM OVERVIEW To meet demand, quickly reading and writing the data,
The distributed parallel data mining platform for power big of the various algorithms in the upper structure, the data
data proposed in this paper adopts a scalable and easy-to- warehouse module also combines the memory distributed file
expand five-layer architecture, as shown in Fig.2. From the system Alluxio with HDFS to provide data storage in memory
aspects of data visualization, data management, and execution or other storage facilities in the form of files. The service
monitoring, it deconstructs the characteristics of power big provides a reliable data sharing layer for the upper distributed
data analysis and provides effective and convenient analysis computing framework, while reducing redundant storage and
services. resource recovery time.
The other module is the computational processing frame-
A. Data Process work module, which implements a multi-hybrid computing
framework and is composed by Spark, Hadoop, Tensorflow,
The data process layer is aimed at raw power data origi-
etc.
nating from different data sources, including such as GIS data
Each computing framework has its own different resource
and EMS data. To process these various large data, the parallel
management systems. In order to realize the overall man-
platform adopts the MapReduce and implements more than 50
agement scheduling of the hybrid computing framework, the
parallel distributed ETL algorithms, which is robust and more
parallel platform realizes the unified resource management
efficient to meet various types of data processing demands.
module based on the YARN resource management framework
At the same time, tools, such as Sqoop, are provided to
of Hadoop. At the same time, the TensorflowOnSpark frame-
transfer the data extracted in the original data system to the
work is introduced to assist the docking of the Tensorflow
distributed storage System. And it also improves storage per-
framework and the Spark computing framework. The module
formance and increase storage security by using the strategy
also provides unified resources for the upper layer application
of multiple backups.
and avoids conflicts between resource allocations.
B. Infrastructure C. Parallel Algorithms
The infrastructure of platform consists of two module parts, The parallel algorithm layer is the core of the distributed
one of which is the data warehouse module consisting of data parallel mining platform. These algorithms on the parallel
NoSQL, HDFS, Alluxio, etc. platform are mainly based on Spark, Mapreduce, Tensorflow.
The data warehouse module stores structured and unstruc- By improving the parallelism of the calculation in these algo-
tured data in data partitions on disks. For the data stored in the rithms and designing new calculation process in algorithms,
data warehouse module, the parallel platform defines unified dozens of parallels algorithms with high degree of computation
metadata information. The metadata information is composed and high computational efficiency are developed. And it also
of data storage type, data storage location, data storage amount implements the commonly used distributed parallel machine
and other information. To support the whole data warehouse learning algorithms and various types of ETL algorithm for
module, the metadata is also stored in HDFS. data preprocessing modules. At the same time, commonly used
2348
deep learning algorithms, such as CNN, RNN, LSTM network Parallel Algorithms Layer
and Bi-LSTM network, are also implemented. The coupling ETL Clustering Classification Regression
More..
degree between the algorithms is low, and the algorithm use Algorithms Algorithms Algorithms Algorithms
the reserved interfaces of other algorithms to call up. It is
Call The
convenient for the algorithm to improve the operation by the Interfaces
method.
For the special services involved in the power system data OSGI OSGI Services
and the unique features of power data, the general data mining
analysis algorithms are difficult to do the trick. To solve Building Register
the multi-timing, complexity and partial professional power Algorithms And Functions
WorkFlow Engine
Components
equipment characteristics of the power data, more than 20
special algorithms have been developed and added into the Packaging Integrated
platform, such as chromatographic differential warning and Bundle Modulars Offer Components Interfaces
time series prediction. At the same time, an algorithm expan-
sion interface is provided. The algorithm library can append
the targeted algorithm and develop the original algorithm for Fig. 4. The flow chart of integrating components.
the requirements of the business service.
D. Integrated Components Each component calls the interface of the parallel algorithm
layer and obtains the relevant algorithm to implement the
Industry business applications and decision-making needs operating logic of component. It adopts the OSGI framework
often change frequently. And the dimensions of different to enrich the corresponding operating logic and encapsulate it
business mining analysis about the data are not the same. The into the most basic module form, which is named bundle. It
dimension of the power data mining analysis is accompanied also provides interface for calling up and is integrated into the
by complicated power service characteristics. For example, workflow engine. The flow is shown in the Fig.4.
transformer oil chromatographic data in power data, which Each integrated component provides some interfaces, which
contains relevant gas content data such as CH4 and O2. What could update and expand the functions of components, sup-
the non-power industry personnel maybe pay attention to is the ported by OSGI technology for the administer. The administer
curve and trend of each gas content, while the power industry could modify the original components according to the busi-
personnel pay attention to the analysis and excavation whether ness requirements without affecting other parts, and improve
the amount of some important gases, such as CH4, CO2, H2, some operating logic to support the business analysis.
exceeds the threshold and each of them correlation between
gas content and transformer failure. E. Application Service
For the complex demands, the parallel platform builds up
The parallel distributed power big data mining platform
a component-based, service-oriented development mechanism
is a cloud computing web service application. The user can
and runtime environment in the Fig.3. It decomposes the
access and operate the parallel platform through the browser
required development functions into multiple component sets.
to execute the data analysis, and the computing process is at
The information of component sets will be transferred in the
the cloud computing service node.
form of DAG. The workflow engine in the platform will
analyze the DAG and uses the OSGI(Open Service Gateway
Initiative) service to execute scheduling operations, such as
start-stop, update, and uninstallation, of the corresponding
functional components. Depend on the mechanism, the busi-
ness function application is highly dynamic.
User Create Task Studio Parse Task DAG
Transfer XML Files
Some Tasks In DAG

Finish N Resolve DAG OSGI
Waiting Process
Y
Finish
Call OSGI Service

Exception Parse Requests For
N Y Interface To Execute
Process OSGI Service
Components
Fig. 3. The flow chart of scheduling components. Fig. 5. The user interactive interface: Studio
2349
The user interacts through the interactive interface Studio
of the parallel platform. The Studio is as shown in the Fig.5.
The operations of users are constructed in the form of a
workflow. The workflow is constructed in the form of the
data flow diagram that the component is a node and the data
interaction between the components is connected to the node.
The workflow treats the entire data mining analysis process
as data flowing and converting in a data channel. Each time
the data flows through a component node, it is converted into
corresponding data. If a node in the data channel fails or
the data fails to be processed, the data flow no longer flows Fig. 6. The speedup of parallel LeNet-5 in various size of partitions.
to the subsequent flow, and the cause of the task failure is
prompted. The output of each successful node can view. With
these features, the user can clearly understand the analysis
procedure of the data, the process interruption and the cause
of the interruption. It also help the user to solve the problem
and continue the previous data analysis operation.
Besides, the parallel platform provides a real-time monitor-
ing part of the task, which can be used to view the running
status of the current submitted data analysis process and know
the processed component node and its state of the current
analysis process, to assist the user.
In addition to the basic data mining functions, the parallel
platform also provides various types of functional operations, Fig. 7. The speedup of parallel LSTM in various size of partitions.
such as scheduling flow, text mining, social network analysis,
deep learning, and web reporting, for users to process data
analysis in a multi-dimensional perspective.
V. E XPERIMENT A ND P ERFORMANCE O F A LGORITHMS
This section will demonstrate the performance of distributed
parallel algorithms of the platform. The following is the
experiment environment: a Spark on Yarn cluster consisting
of 32 nodes. One of them is Master node and the rest are
Workers. The cluster is equipped with Spark 1.5.1, Hadoop
2.6.0, Tensorflow 1.3.0 and TensorflowOnSpark 1.2.1. Each
node is created by OpenStack on Dell R720 servers, which has
12 cpu cores with 2.10GHz, the memory of 48G and capacity
Fig. 8. The comparison result between stand-alone and parallel LeNet-5.
of 1000G.
A. Performance Of Parallel Deep Neural Network
The parallel platform adopts the aforementioned Hadoop
and Spark server clusters, and performs parallel optimization
in Alg.1 on LeNet-5 and LSTM networks based on Spark
and Tensorflow computing framework. At the same time, two
comparative experiments are carried out: 1. Comparing the
runtime between stand-alone deep learning network with the
parallel ones. 2. Comparing the speedup of parallel deep neural
networks in various number of partitions.
In the Fig.6 and Fig.7 show that the parallel networks
could effectively reduce the runtime with increasing the size Fig. 9. The comparison result between stand-alone and parallel LSTM.
of partitions. And in the Fig.9 and Fig.8, the stand-alone
network run faster than the parallel networks in small data sets The above results show that the runtime of parallel neural
because of the communication cost of parallelization. While network is significantly reduced with the increase of the
it also shows that the parallel networks is more efficient that number of partitions. It has a significant advantage over the
the stand-alone ones in large data sets and the computational stand-alone deep neural network under large data volume and
efficiency is much higher than the communication cost. the methods in Alg.1 is proved scalable and efficient.
2350
TABLE I
C OMPARISON R ESULT B ETWEEN S OME A LGORITHMS O F T HE P LATFORM storage structure. It uses the high-speed IO feature of memory
A ND ML LIB to achieve fast calculation and improves the performance of
various data mining algorithms based on Spark in large-scale
Data BP NaiveBayes Linear Regression
Size Platform MLlib Platform MLlib Platform MLlib data computing. Hadoop uses file storage to store intermediate
0.5G 108s 168s 30s 30s 3501s 3558s data files, which is inferior to Spark computing engine in
1G 256s 289s 42s 59s 7010s 7204s terms of time performance. But it does not depend on memory
5G 781ss 1316s 99s 99s 23264s 24009s
10G 1489s 2877s 209s 310s 39657s 42591s
capacity, and it is not easy to cause data analysis tasks due to
insufficient memory resources under large data volume. The
data analysis task is extremely stable. Tensorflow is a widely
B. Performance Of Parallel Machine Learning Algorithms used computational engine framework for deep learning in the
world, supporting a variety of deep learning algorithms, but it
In addition to parallelized deep neural networks, The paral- is still in single-server usage.
lel platform provides parallelized ETL and machine learning Based on the characteristics of the above computing en-
algorithms to support a wider range of power data analysis gine framework, the parallel platform implements distributed
services and efficiently process big data in the power industry. parallel machine learning algorithm, distributed parallel ETL
The platform also introduces and improves some part of algorithm and deep learning algorithm. Then, it improves
the MLlib algorithms in operating logic optimization. And the Yarn resource management framework in Hadoop and
experiments are carried out on classical data sets of different introduces TensorflowOnSpark to enable combined operation
scales. The experimental results are shown in the Table I. by Tensorflow and Spark framework. At the same time, it
VI. P LATFORM F EATURES abstracts server computing resources, uniformly allocates and
manages the resources that jobs apply. Separating various
A. Parallel Algorithms Library computing tasks could avoid resource conflicts in each com-
Parallel platform integrates Spark, TensorflowOnSpark and puting engine framework and realize multi-computing engine
other computing frameworks and develops parallel deep learn- framework running at the same time.
ing algorithms. It also realizes data parallelism, parallel model C. High Applicability For Power Data
training and iterative averaging of model parameters. Forming
a central model to run the algorithm in a distributed CPU In view of the complexity, timing, and real-time char-
environment. It greatly decreases the training runtime of the acteristics of power data, general-purpose big data analysis
deep neural network and improves the ability of processing algorithms are often less applicable. To this end, the parallel
large power data. platform has specifically integrated multiple algorithm com-
The parallel platform implements and provides dozens of ponents for power data analysis business needs. For example,
parallel machine learning and ETL algorithms for different dozens of algorithms for differential early warning algorithms
needs of power data analysis services. Based on the Spark, for chromatographic data, load evaluation algorithms for de-
Hadoop and other frameworks, the traditional analysis algo- vice load data, text analysis algorithms for device defect
rithms is reconstructed. The distributed computing system is data, and gray correlation algorithms for grid data correlation
adopted to realize distributed parallelized machine learning analysis. And the parallel platform provides the extended
and data analysis algorithms, which complements the ma- algorithm interface. When faced with professional and strong
chine learning library MLlib provided by Spark computing business analysis requirements, the corresponding algorithm
framework. The algorithms also improves the operating logic can be extended for specific services to complete the analysis
of some algorithms in the MLlib library. For data iterative task.
calculation algorithms, such as DBScan, Clara, it uses the D. Simple And Convenient Operations
memory computer system and data memory storage function The parallel platform provides an interactive graphical big
provided by Spark. The calculation operation is transferred data analysis and management interface, which is named Stu-
to the memory, which greatly reduces the data access loss dio. It uses a workflow diagram to construct and compose data
time. Aiming at the parallel operations in the algorithm, which analysis tasks. The arrow marks in the workflow diagram point
could improve the parallel degree, the new algorithm process to the data flow and dependencies. Users can click and drag.
is designed to replace the time-consuming process of the The ambiguous way to complete the creation, configuration,
algorithm, and a more reasonable method such as reducing reuse, submission, operation, task monitoring and visualization
the dimensional function was adopted. It reduces the running of the workflow diagram. Parallel platform-friendly operation
time of the algorithm and improves the running performance reduces the threshold for users to perform big data analysis
of the algorithm. tasks, and efficiently realizes in-depth analysis and value
B. Hybrid Computing Framework mining of big data in the industry.
The most widely used computing frameworks in distributed VII. A PPLICATION O N F IELD O F P OWER
computing are Spark and Hadoop. Spark focuses on memory Based on the advantages of simple and efficient parallel
computing and adopts memory as the intermediate computing platform, the user can realize a variety of power data analysis
2351
TABLE II
business requirements by operating the parallel platform. It T HE R ESULT O F D IFFERENTIATED WARNING O N O IL C HROMATOGRAPHY
has been launched in a certain power grid company, and O F T RANSFORMER C HROMATOGRAPHY
cooperated to unstructured text data extraction and realize the
Time Warning Algorithm Neural Network SVM GM
differential data analysis for power business data. It enrichs 2017-05-01 0.095 0.122 0.126 0.202
the methods of power data analysis business and improves the 2017-05-20 0.053 -0.115 0.128 0.151
efficiency of power-related data analysis. 2017-06-20 0.079 -0.126 0.214 0.265
2017-06-30 0.044 -0.07 0.053 0.131
AveError 0.021 0.028 0.036 0.047
A. Differentiated Warning On Oil Chromatography Of Trans-

former Chromatography
B. The Fault Text Of Power Transmission And Transformation
When a fault occurs inside the transformer chromatography, Analysis
some gas will be decomposed including: H2 , CH4 , C2 H6 , All the time power companies have continuously strength-
etc. Real-time monitoring of the transformer chromatography, ened the management of power plant equipment defects,
which collects a small amount of oil sample in the transformer strictly controlled equipment operation, implemented defect
chromatography and analyses the composition and content of classification, regular reporting, and the power production
dissolved gas in the transformer oil, could detect whether there professional group regularly analyzed major hidden defects to
is a fault inside the transformer chromatography and the sever- ensure that power generation equipment continued to maintain
ity of the fault. Based on the transformer oil chromatographic a healthy level. However, the defect detection record is always
analysis method, The parallel platform integrates the trans- the artificial text record. After entering the database, it is still
former chromatographic calculation, gray correlation analysis, stored in the form of report text. It is extremely inconvenient
RNN network and other model components. According to the for ordinary maintenance personnel to understand the defects
analysis business, the workflow is designed in Fig.10. and problems in a concise and intuitive manner.
Through the above operation process formed in studio of Therefore, the parallel platform improved text segmentation,
the platform, researchers can easily observe the generation of feature extraction and other methods and deep learning algo-
abnormal gas. The generation of abnormal gas is not only rithms such as Bi-LSTM to target power transmission equip-
reflected in the change of individual gas, but the overall ment defect data, and integrated into functional components.
change trend among many related gases. The parallel plat- At the same time, the workflow diagram is designed as the
form introduces the correlation analysis between gas before Fig.11. From dictionary construction to text data segmentation,
the chromatographic early warning algorithm to improve the extraction of key word segments, and then text classification,
accuracy of early warning. The experimental results are shown word connection and other operations to complete the struc-
in the Table II. tured processing of text data
The experimental results in the Table II show that in the The algorithm implemented by the parallel platform accu-
actual oil chromatographic warning, the algorithm proposed rately processes the unstructured text data and obtains the
by the parallel platform has higher accuracy and stronger ap- power professional knowledge entity (main transformer, etc.),
plicability to power data than other commonly used algorithm and raises the data processing F1 value to 68%. Compared
models of time series prediction. to the same type of algorithm, the algorithm has better
Fig. 10. The workflow of differentiated warning on oil chromatography of Fig. 11. The workflow of fault text of power transmission and transformation
transformer chromatography. analysis.
2352
defect data text analysis extraction. Then these have been put
on some power grid company.
In the future work, we will continue to expand the follow-up
work of this paper in terms of deeper development, optimiza-
tion of deep learning algorithms and expansion of other types
of data models in the power field.
ACKNOWLEDGMENT
This work is supported in part by the National Key R&D
Program of China(No.2018YFC0831500).
R EFERENCES
[1] Holmes G, Donkin A, Witten I H. Weka: A machine learning work-
bench[C]//Intelligent Information Systems, 1994. Proceedings of the 1994
Fig. 12. The result of fault text of power transmission and transformation Second Australian and New Zealand Conference on. IEEE, 1994: 357-
analysis. 361.
[2] Hofmann, M., Klinkenberg, R. RapidMiner: Data mining use cases and
business analytics applications[M]. CRC Press, 2013.
[3] Berthold M R, Cebron N, Dill F. KNIME: The Konstanz Information
performance. The experimental results in Fig.12 show that the Miner[J]. Acm Sigkdd Explorations Newsletter, 2006, 11: 26-31.
[4] An ZHUO. Research and Implementation of Big Data Analysis Platform
algorithm is more suitable for power data than the same type Based on P2P Scalable Architecture[D]. Tsinghua University, Beijing,
of algorithm. China, 2012.
[5] Yu L, Zheng J, Shen W C, et al. BC-PDM: data mining, social
network analysis and text mining system based on cloud comput-
VIII. C ONCLUSION ing[C]//Proceedings of the 18th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 2012: 1496-1499.
For the demand of power big data business analysis and [6] De Francisci Morales G. SAMOA: A platform for mining big data
mining along with the development of smart grid, this paper streams[C]//Proceedings of the 22nd International Conference on World
designs and develops a distributed parallel data mining plat- Wide Web. ACM, 2013: 777-778.
[7] Qing HE, Fuzhen ZHUANG, Li ZENG. PDMiner: Parallel Distributed
form for power big data. The parallel platform adopts a highly Data Mining Platform Based on Cloud Computing[J]. Science China,
reusable distributed framework and designs a scalable parallel 2014, 44: 855-871.
algorithm library. It integrates nearly 100 well-running perfor- [8] Li W, Cheng H L, Peng Y, et al. Visualized data mining platform based
on the Spark[J]. Chinese Association of Automation System Simulation
mances and highly parallelized algorithms, which are partially Professional Committee, 2014.
superior to existing open source algorithm libraries MLib [9] Jun LEI, Hangjun YE, Zesheng WU, Big-Data Platform Based on Open
and Hive tools. A variety of deep neural network, machine Source Ecosystem[J]. Journal of Computer Research and Developmen,
2017, 54: 80-93.
learning, statistical analysis and other categories of parallel [10] Guo T, Xu J, Yan X, et al. Ease the Process of Machine Learning with
general mining analysis algorithms and special algorithms for Dataflow[C]//Proceedings of the 25th ACM International on Conference
power data analysis business needs are involved in library. In on Information and Knowledge Management. ACM, 2016: 2437-2440.
[11] Bu Y, Wu B, Chen Y. BDAP: A data mining platform based on Spark[J].
addition, the graphical interactive interface Studio provided Journal of University of Science Technology of China, 2017, 47: 358-368.
by the parallel platform indicates the data flow direction and [12] Yu K. Large-scale deep learning at baidu[C]//Proceedings of the 22nd
dependence relationship with the workflow diagram. It also ACM international conference on Information and Knowledge Manage-
ment. ACM, 2013: 2211-2212.
visually displays the system analysis process, the current exe- [13] Abadi M, Barham P, Chen J, et al. Tensorflow: a system for large-scale
cution status and the analysis result. Then the Studio supports machine learning[C]//OSDI. 2016, 16: 265-283.
the user to customize the data analysis tasks by clicking and [14] Zou Y, Jin X, Li Y, et al. Mariana: Tencent deep learning platform and
its applications[J]. Proceedings of the VLDB Endowment, 2014, 7(13):
dragging without the operations to write programs, which 1772-1777.
reduces the threshold for users to perform big data analysis [15] (2016) The large data platform that supports EB level is re-
tasks. vealed in depth. [Online]. Available:https://yq.aliyun.com/articles/34246?
spm=5176.7965709.247259.6.341b6fdcDH9QQB,
In order to support the frequently changing business needs [16] ur Rehman M H, Liew C S, Wah T Y. UniMiner: Towards a unified
in power big data mining, the parallel platform integrates more framework for data mining[C]//Information and Communication Tech-
than 20 special algorithms for power data business analysis, nologies (WICT), 2014 Fourth World Congress on. IEEE, 2014: 134-139.
[17] Jungang YANG, Jie ZHANG, Wei QIN. Big data analysis platform
and implements pluggable component libraries using OSGI for semiconductor manufacturing[J]. Computer Integrated Manufacturing
technology. These components not only implements flexible Systems, 2016, 22: 2900-2910.
modification, but also decouples the functions of the analysis [18] Ping HU, Zhongqun WANG, Tao LIU. General Electric Data Platform
Based on Distributed OSGI[J]. Computer Engineering, 2014, 40: 71-75.
tasks and then logically organizes the existing component sets [19] Bengong YU, Tianxiang QIAO, Dong ZHANG. Research on Data Anal-
for the business requirements to generate corresponding data ysis Platform of Power Grid Based on Portlet Component[J]. Computer
analysis function modules. And the parallel platform realizes Technology And Development, 2015, 25: 218-220.
[20] Yun CHEN. The Design and Implementation of the Distributed Comput-
various algorithms for grid data analysis business demands, ing and Analysis Platform for Power System[D]. University of Electronic
including chromatographic data differentiation warning and Science and Technology of China, Sichuan, China, 2016.
2353
[21] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data
with neural networks[J]. science, 2006, 313(5786): 504-507.
[22] Taylor G W, Fergus R, LeCun Y, et al. Convolutional learning of spatio-
temporal features[C]//European conference on computer vision. Springer,
Berlin, Heidelberg, 2010: 140-153.
[23] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to
document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-
2324.
[24] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural com-
putation, 1997, 9(8): 1735-1780.
[25] Yuzhu Jiang. Research on data processing and analysis for electrical
equipment condition monitoring usingv Hadoop[D]. North China Electric
Power University, Beijing, China, 2014.
[26] Peng X, Deng D, Cheng S, et al. Key technologies of electric power
big data and its application prospects in smart grid[J]. Proceedings of the
CSEE, 2015, 35(3): 503-511.
[27] Simmhan Y, Aman S, Kumbhare A, et al. Cloud-based software platform
for big data analytics in smart grids[J]. Computing in Science and
Engineering, 2013, 15(4): 38-47.
[28] Yu G, Jin-zhuang L V. Application of big data mining analysis in power
equipment state assessment[J]. Southern Power System Technology, 2014,
8(6): 74-77.
[29] Zhang P, Yang H, Xu Y. Power big data and its application scenarios in
power grid[J]. Proc. CSEE, 2014, 34: 85-92.
[30] Tian L, XIANG M. Abnormal Power Consumption Analysis Based on
Density-based Spatial Clustering of Applications with Noise in Power
Systems[J]. Automation of Electric Power Systems, 2017, 5: 64-70.
[31] Zhu Y, Jia Y, Wang L. Partial discharge pattern recognition method
based on variable predictive model-based class discriminate and partial
least squares regression[J]. IET Science, Measurement and Technology,
2016, 10(7): 737-744.
[32] Kranjc J, Orač R, Podpečan V, et al. ClowdFlows: Online workflows
for distributed big data mining[J]. Future Generation Computer Systems,
2017, 68: 38-58.
[33] Basso T, Moraes R, Antunes N, et al. PRIVAaaS: privacy approach for
a distributed cloud-based data analytics platforms[C]//Proceedings of the
17th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing. IEEE Press, 2017: 1108-1116.
[34] Jain V, Seung S. Natural image denoising with convolutional net-
works[C]//Advances in Neural Information Processing Systems. 2009:
769-776.
[35] Ngiam J, Chen Z, Chia D, et al. Tiled convolutional neural net-
works[C]//Advances in neural information processing systems. 2010:
1279-1287.
[36] Dean J, Corrado G, Monga R, et al. Large scale distributed deep
networks[C]//Advances in neural information processing systems. 2012:
1223-1231.
2354

Distributed Big Data Mining Platform For Smart Grid: 2018 IEEE International Conference On Big Data (Big Data)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Big Data Mining Platform For Smart Grid: 2018 IEEE International Conference On Big Data (Big Data)

Uploaded by

Copyright:

Available Formats

2018 IEEE International Conference on Big Data (Big Data)

Distributed Big Data Mining Platform for Smart

978-1-5386-5035-6/18/$31.00 ©2018 IEEE 2345

B. LeNet-5 and LSTM Parallelization Y

The LeNet-5 network proposed by Y.LeCun [23] is one of N

the long-distance neuronal spacing in RNN will result in the Convolution

long-term dependence problem, which can not learn and use

original RNN, the LSTM (Long Short-Term Memory) network

User Create Task Studio Parse Task DAG

Transfer XML Files

Some Tasks In DAG

Call OSGI Service

A. Differentiated Warning On Oil Chromatography Of Trans-

You might also like