Big Data Mining Method of Thermal Power Based On Spark and Optimization Guidance

2018 IEEE 7th Data Driven Control and Learning Systems Conference
May 25-27, 2018, Enshi, Hubei Province, China
Big Data Mining Method of Thermal Power Based on Spark and

Optimization Guidance
Mingcheng Song 1, Li Jia 2
1. Department of Automation, College of Mechatronics Engineering and Automation, Shanghai University, Shanghai 200072, P.R. China
E-mail:649791309@qq.com
2. Shanghai Key Laboratory of Power Station Automation Technology, Department of Automation, College of Mechatronics Engineering
and Automation, Shanghai University, Shanghai 200072, P. R. China
E-mail: jiali@staff.shu.edu.cn
Abstract: With the increasing degree of information technology in the electric-power industry, the amount of big data in thermal
power has increased geometrically. To address the problem of the computational bottlenecks in traditional data mining deal with
big data of thermal power, big data mining of thermal power method based on Spark is presented in this paper. According to the
characteristics of the actual operation of the unit, the proposed method determines the steady-state conditions of big data of
thermal power and divides the working conditions based on external constraints. In addition, data mining method based on
distributed computing is used to mine big data of thermal power to get the strong association rules, thus the best value of the
parameters under each working condition can be got. Lastly, the historical knowledge base is established, which can guide the
operation of the unit by the proposed method. This method is applied to a 300 MW unit in a power plant in Anhui Province, and
mines the operation data of the unit for 10 days in a month. The results of simulation show that the proposed method can
effectively mine big data of thermal power and has the advantage of computational efficiency compared with traditional data
mining for big data.
Key Words: Big data of thermal power, Spark, Big data mining, Strong association rules, Operation optimization
Spark [16] is a fast and general engine for large-scale data

1 Introduction processing, which can more effectively deal with the big data
Thermal power plant intelligent equipment collects many of thermal power than the Hadoop, especially iterative
production operating data, the amount of stored data learning. As a result, some algorithms are developed to
increased geometrically. These operating data have the achieve parallel computing of data mining by combining
obvious characteristics of big data, big capacity, diversity, data mining with Spark, which can improve the efficiency of
processing speed and high value [1, 2]. According to the computing. Li [17] proposed PFP algorithm, realizing the
definition of big data, the operating data collected by the parallel computing of FP-growth algorithm.
electric-power plant can be considered as big data [3]. These From above analysis, big data mining method of thermal
massive amounts of plant operation data imply useful power based on Spark is proposed in this paper. The
information for operational optimization. Data mining [4] proposed method can effectively solve the bottleneck in
can excavate the optimal value that the unit has achieved in traditional data mining deal with big data of thermal power
different working conditions from the massive historical data and improve the efficiency of mining big data of thermal
of the power plant. Compared with the optimal value of the power by parallel computing on the Spark platform with
unit theory, these optimal values are more easily achieved in K-means algorithms and FP-growth algorithms. According
the actual operation of the unit and have more practical to the characteristics of the actual operation of thermal power
significance [5]. unit, the proposed method determines the steady-state
Many scholars try to find the connection between the unit conditions of big data of thermal power and divides the
operating parameters using association rules mining [6]. working conditions based on external constraints to ensure
References [7] ~ [9] improve the association rules and apply the accuracy and efficiency of data mining. In addition, a
them to electric-power plant operation optimization. method of applying historical knowledge to the optimization
FP-growth (Frequent Pattern-growth) algorithm is a of unit operation is proposed in this paper.
relational analysis algorithm that compresses database to 2 Big Data Mining Method of Thermal Power
FP-tree to save computing resources. The FP-growth
Based on Spark
algorithm [10] only needs to scan the database two times,
which is more efficient than the Apriori algorithm.
2.1 Framework of New Method
Traditional data mining algorithms are difficult to meet
the demand of performance when it faces big data of thermal The new method is run on the Spark computing platform
power with geometric progression. Recently, many effective based on Hadoop. Big data of thermal power is stored in
data mining algorithms are presented to excavate and distributed manner by HDFS (Hadoop Distributed File
analyze the big data by combining big data technology with System) and analyzed by Spark computing framework. The
data mining [11-14]. Wanxiang [15] improved Apriori new method uses judgment of steady-state conditions 、
algorithm on the Hadoop to deal with big data. classification of working condition based on external
constraints and K-means algorithm based on Spark for data
This work was supported by National Natural Science Foundation of China (61773251), and Shanghai preprocessing of big data of thermal power. In addition, the
Municipal Science and Technology Commission (15510722100, 16111106300, 17511109400).
978-1-5386-2618-4/18/$31.00 ©2018 IEEE DDCLS'18

514
new method uses target-guided data compression and Spark distributed computing framework. The new method
FP-growth algorithm based on Spark to mine big data of uses K-means clustering algorithm based on Spark to process
thermal power. Lastly, the historical knowledge base is big data of thermal power. According to the concept of
established, which can guide the operation of the unit by the distributed computing, big data of thermal power is divided
proposed method. into several computing tasks to solve the problem of big data
load and calculation in stand-alone mode.
STEP 1. Extract big data of thermal power from HDFS
and create RDDs. By creating an RDD, the data is vectored
and cached.
STEP 2. Randomly generate K initial cluster centers.
STEP 3. Using the idea of "Map", the distance between
each data object and the cluster center is calculated and
classified at each work node.
STEP 4. Through the idea of "Reduce", the outputs of
each node are combined to get the global result and update
the cluster center.
STEP 5. Determine whether the cluster converges or
reaches the number of iterations, otherwise repeat the fourth
and fifth steps.
Fig. 1: Architecture diagram of new method
STEP 6. Finish.
2.2 The Spark Computing Platform Based on Hadoop
Hadoop [18] is a distributed system infrastructure
developed by the Apache Foundation. The core of its
framework is: HDFS and MapReduce. HDFS provides
storage for massive amounts of data, and MapReduce
provides calculations for massive amounts of data. The new
method stores big data of thermal power and the calculation
results on the HDFS to realize the distributed storage of big
data of thermal power, solving the storage problem of big
data of thermal power and provides the data platform for the
distributed computing of big data of thermal power.
Spark is a parallel computing framework based on
memory computing, and Spark is more than Hadoop in many
ways. Spark can run either individually or on the Hadoop
YARN（Yet Another Resource Negotiator）. Spark can read
the data directly from HDFS. The new method builds Spark
based on Hadoop and constructs a Spark computing platform
based on Hadoop.
2.3 K-means Clustering Algorithm Based on Spark Fig. 2: Flow Chart of K-means clustering algorithm based on Spark
Thermal power units are operating in a complex 2.4 FP-growth Algorithm Based on Spark
environment, electrical 、 magnetic and other noise
interference strong, coupled with the existence of the system The traditional data mining is difficult to deal with the big
uncertainty, thus thermal power data tend to have higher data of the power plant. Apriori algorithm, which is
deviations. K-means algorithm is scalable and efficient, and commonly used in data mining in electric-power plants at
it is an great method to discretize thermal power data. present, needs to scan the dataset for every potential frequent
The traditional K-means algorithm must iteratively item set，and takes up a lot of resources, not suitable for
calculate the Euclidean distance between all objects and the processing big data of thermal power. Although the
centroid, thus the time complexity of the traditional FP-growth [10] algorithm only needs to scan the database two
K-means algorithm grows rapidly as the amount of data times, the whole original data set is compressed into a
increases [19][20]. Therefore, when dealing with big data of frequent pattern tree, and data is stored in memory for
thermal power, it is difficult for computers to load data into computing, so as to speed up the whole mining process.
computer memory at the same time, which causes repeated However, when processing massive data, there is not enough
read and write of memory and disk and occupies a lot of memory to store frequently-modeled trees that contain huge
computer resources. Secondly, it is difficult to quickly amounts of data, and the memory cannot meet the computing
calculate Euclidean distances of all objects and clustering needs of the FP-growth algorithm. Thus traditional
centers in stand-alone mode. The traditional K-means FP-growth is also difficult to apply to big data mining.
algorithm is difficult to meet the computing requirements of The new method uses the FP-growth algorithm based on
big data of thermal power. To solve this problem, the new Spark to excavate big data of thermal power and can
method combines the traditional K-means algorithm with the effectively solve the above problems. Each big data
DDCLS'18
515
acquisition of thermal power plant is called a transaction, index of the current working condition is superior than the
and each parameter acquired each time is called an item. evaluation index of the historical working condition, the
STEP 1. Extract pre-processed big data of thermal power parameter value of the current working condition replaces
from HDFS and create RDDs. the parameter value of the original historical working
STEP 2. The support of frequent items is calculated in condition to update the historical knowledge base.
parallel. According to the support degree, the item set is
descended in descending order, delete the item that does not 3 The Process of Big Data Mining Method of
satisfy the minimum support degree, and record the list of Thermal Power Based on Spark
arranged good to F_list. The big data of thermal power contains abundant and
STEP 3. Data grouping. For each transaction, according to valuable unit state information and stores a large amount of
F_list, delete and sort. Then according to the PFP algorithm unit operation knowledge. Big data of thermal power is
grouping strategy, F_list is divided into Q groups, and the excavated effectively to get historical knowledge by big data
results are recorded as G_List. mining algorithm based on Spark, and the historical
STEP 4. Frequent itemsets are excavated in parallel. The knowledge base is established. Record the running state of
Mapper reads the G_List and divides the transactions into the unit for a period and compare with the historical
each group. Each work node completes the mining task on its knowledge base to find the similar historical conditions. If a
own node alone and gets the frequent pattern of this group. similar historical condition is found, the history knowledge
STEP 5. Aggregating. The frequent patterns in each group will be used to guide the operation of the unit. If there is a
obtained in STEP 4 are aggregated to obtain the global result new working condition, then excavate the record of the
of the strong association rule in each condition. get the working condition for a period and add the new knowledge
historical working condition H. H = [Power, Coal Quality, to the historical database. If the evaluation index of the
Economic index weight value, Environmental index weight current working condition is better, the parameter value of
value, Stable-operation index weight value, Each value of the current working condition replaces the parameter value
optimal parameter, Evaluation index] of the original historical working condition to update the
2.5 Optimized Guidance based on Historical Database historical knowledge base.
Through big data mining algorithm, the history database

of big data of thermal power is obtained. The application of
historical knowledge to the optimization guidance of the
actual operation of the unit is the purpose and application
significance of establishing a historical knowledge base.
Optimization guidance based on the historical database is
proposed in new method. By comparing the similarity
between real-time working conditions and historical working
conditions, the most similar historical working conditions
are searched from the historical knowledge base, and then
the running of units is guided according to the historical
knowledge.
STEP 1. Record the unit operation data for a period.
STEP 2. Set up the weight of economic index 、
environmental index and stable-operation index, determine
the optimization target and calculate the evaluation index.
STEP 3. According to the current external conditions,
optimization goals and parameters, get the current working
condition C.
C = [Power Coal Quality, Economic index weight value,
Environmental index weight value, Stable-operation index
weight value, Each parameter value, Evaluation index]
STEP 4. Calculation of the similarity between the current
working condition C and the working conditions of each
historical case.
STEP 5. If there is a historical case similar to the current
case C, proceed to STEP 6; If there are no similar historical
conditions, new conditions and optimization target
combinations will appear, accumulate data for a period of Fig. 3: Flow Chart of Big data mining method of thermal power
time, and then excavate, then generate new knowledge, and based on Spark
store them in the historical knowledge base.
STEP 6. If the evaluation index of the historical working
condition is superior to the evaluation index of the current
working condition, the operation of the unit is optimized
according to the historical knowledge; If the evaluation
DDCLS'18
516
4 Application There are some external conditions in the actual operation
of the thermal power unit. Different external conditions
4.1 Build Spark Computing Platform Based on because the working conditions of the unit to be different.
Hadoop There is a great difference between the optimal values of the
operating parameters of the thermal power unit under
The hardware part of the platform uses 4 PC machines to different working conditions. Power and coal quality are the
build a Spark cluster based on Hadoop. 4 machines are in a important external conditions that affect the operation of the
LAN. A machine as the main node, running the Master unit [5]. The new method uses power and coal quality as
process, but also as a work node for computing tasks. The external constraints and divides the working conditions. The
other three Slave nodes are work nodes. Set up Hadoop + new method defines that the relative coal quality factor is the
Spark system on the platform, the software configuration is power/ total fuel quantity, and the coal quality factor can
shown in Table 1. reflect the coal's work capacity to a certain extent [22]. The
results of the division of the working conditions are shown in
Tab.1: Software configuration of Spark computing platform
Table 2. Then use K-means algorithm based on Spark for
based on Hadoop
data discretization of each parameter.
Software Edition 4.4 Mining Target
Ubuntu 14.04
In this paper, the economy, environmental protection and
Java jdk1.7 stable operation of the unit operation are considered. The
Scala 2.10.4 mining target index is L, of which L1 is economic index, L2
Hadoop 2.6.0 is the environmental protection index, and L3 is the stable
operating index.
Spark 1.5.2
L = p1 × L1 + p 2 × L2 + p 3 × L3 （1）
IDE Eclipse
p1 + p 2 + p 3 = 1 （2）
The new method uses HDFS for data management, only to
upload the original dataset to HDFS. The system Set the weights to determine the optimization goal. This
automatically divides the data into multiple data blocks and example takes the economy of the operation of the unit as the
stores the data blocks into the cluster, achieve the distributed optimization target and selects the coal consumption rate as
storage of big data of thermal power. the evaluation index.
Considering the operating parameters closely related to
4.2 Mining Object coal consumption rate, parameters such as main steam
This paper uses big data mining method of thermal power pressure, initial steam flow rate, total air flow, outlet gas
based on Spark on big data computing platform to analyze temperature, oxygen content of inlet air of A-air preheater
10-day operating data of 300MW unit in an electric-power and oxygen content of inlet air of B-air preheater and remove
plant in Anhui Province. the operating parameters unrelated to coal consumption rate
of to compress the data space.
4.3 Data Preprocessing
4.5 Results
The actual operation process of the power plant is strictly
in the process of dynamic change, and there are a lot of On Spark computing platform based on Hadoop, set a
unstable operating states. Only the stable running data can minimum support of 3% and a minimum confidence of 85%.
have the value of data mining and can effectively reflect the The FP-growth algorithm based on Spark was used to mine
actual operating state of the unit. The new method adopts the the discrete data for each condition. The results of some
method in reference Error! Reference source not found. to strong association rules are shown in Table 3.
determine the steady-state condition of thermal power big
data.
Tab.2: Results of working condition classification
Working condition Power Coal quality

Working condition 0 Power Interval 0 Bad
Working condition 1 Power Interval 0 good
Working condition 2 Power Interval 0 excellent
…… …… ……
Working condition 14 Power Interva 4 excellent
Tab.3: Results of some strong association rules
Parameter Working condition 5 Working condition 8 Working condition 14

Main steam pressure /MPa <13.6397, 13.899> <15.821, 16.2114> <15.1717, 15.3379>
DDCLS'18
517
Initial steam flow rate /(t/h) <514.524, 650.77> <514.524, 650.77> <651.094, 727.933>
Total air flow <700.723, 775.52> <700.723, 775.52> <814.539, 846.718>
Outlet gas temperature /℃ <106.055，108.711> <101.554, 106.04> <110.945, 113.524>
Oxygen content A /% <4.62987, 6.46174> <4.62987, 6.46174> <4.07882, 4.62819>
Oxygen content B /% <3.6192, 5.47583> <3.6192, 5.47583> <3.10999, 3.61691>
Economic evaluation A+ A+ A+
From table 3, new method can effectively excavate the temperature is within the range of <106.055，108.711>，
strong association rules between the parameters and the Oxygen content A is within the range of <4.62987,
economy under each working condition. Taking the mining 6.46174>，Oxygen content B is within the range of <3.6192,
result of condition 5 as an example, to illustrate the 5.47583>，Economic evaluation of thermal power units at
application significance of strong association rules. When least 85% of the probability of A+. In this paper, the
external conditions for power is within the range of clustering center of the parameter interval is taken as a
<192.752, 202.739>, coal quality is excellent. When main parameter to optimize the target value, which is more helpful
steam pressure is within the range of <13.6397, 13.899>， to guide the operator to optimize the operation of the unit
steam flow is within the range of <514.524, 650.77>，total (see Tab. 4). H = [<192.752, 202.739>, Excellent, 1, 0, 0,
air flow is within the range of <700.723, 775.52>，Outlet gas each value of optimal parameter, A+]
Tab.4: Optimization targeted values
Parameter Working condition 5 Working condition 8 Working condition 14

Main steam pressure /MPa 13.765 16.0005 15.256
Initial steam flow rate /(t/h) 610.895 610.895 691.015
Total air flow 752.499 752.499 830.0002
Outlet gas temperature /℃ 107.551 104.533 111.995
Oxygen content A /% 4.935 4.935 4.323
Oxygen content B /% 3.909 3.909 3.327
Economic evaluation A+ A+ A+
data, and 1 million sets of 18-dimensional data. The number

of clusters is set to 5, and the calculation time is counted. The
results are shown in Figure 4. Traditional FP-growth and
FP-growth based on Spark (SPF) are applied to mining
association rules in 50 thousand groups, 100 thousand
groups, 200 thousand groups, 300 thousand groups and 400
thousand sets of transaction data sets. The minimum support
is set to 0.15%, the minimum confidence is set to 85%, and
the calculation time is counted, as shown in Figure 5.
As can be seen from Figure 4 and Figure 5, when the
Fig. 4: Compare experimental results of K-means amount of data is not large, the time difference between the
two algorithms is small. Because Spark cluster has consumed
some time in startup and loading process, the speed of
traditional data mining algorithm is faster than that of Spark
based big data mining algorithm. But with the increase of
data, the computing efficiency of the big data mining
algorithm based on Spark is obviously better than the
traditional data mining algorithm. The greater the amount of
data, the more obvious the advantage.
From the above analysis shows that Spark-based thermal
power data mining method compared with the traditional
data mining. There is a huge advantage in computational
Fig. 5: Compare experimental results of FP-growth efficiency when dealing with big data of thermal power，It
In this paper, we use traditional K-means algorithm and solves the bottleneck of traditional data mining in processing
K-means based on Spark (SK-means) to test 100 thousand big data of thermal power.
groups of 18-dimensional data, 250 thousand sets of
18-dimensional data, 500 thousand sets of 18-dimensional
DDCLS'18
518
5 Conclusions [10] J Han, J Pei, Y Yin. Mining frequent patterns without
candidate generation, ACM SIGMOD International
In this paper, a new method of big data of thermal power Conference on Management of Data. ACM, 2000:1-12.
mining is proposed: Big data mining method of thermal [11] F Zhang, M Liu, F Gui, et al. A distributed frequent itemset
power based on Spark. In addition, the proposed method is mining algorithm using Spark for Big Data analytics. Cluster
applied to the economic optimization of thermal power units. Computing, 18(4):1493-1501, 2015.
Compared with the traditional optimization method, new [12] W Huang, L Meng, D Zhang, et al. In-Memory Parallel
Processing of Massive Remotely Sensed Data Using an
method has the following advantages:
Apache Spark on Hadoop YARN Model. IEEE Journal of
（1）With judgment of steady-state conditions of thermal Selected Topics in Applied Earth Observations & Remote
power data, new method improves the data quality and Sensing 10(1):3-19. , 2017.
eliminates the interference from the dynamic unstable [13] W Chen, Y Tong, J Zhang, et al. Frequent sequence mining
working condition data to effectively reflect the actual from massive access log for user’s behaviour investigation.
running status of the unit. In addition, the steady-state data is Proceedings of Science, 2017.
divided based on the external constraints to realize the fine [14] D Zhang, M Xin, L Liu, et al. Research on Development
division of the actual operating conditions of the unit. Strategy for Smart Grid Big Data[J]. Proceedings of the
CSEE, 35(1):2-12, 2015.
（ 2 ） By setting the weights of economic indicators, [15] X Wan, N Hu. Research on Application of Big Data Mining
environmental indicators and stable operation indicators, the Technology in Performance Optimization of Steam Turbines
user's different optimization needs are met , and clarify [J]. Proceedings of the CSEE, 36(2):459-467, 2016.
optimization goals. According to the optimization goal, the [16] Spark: Apache Spark. https://spark.apache.org/.
parameters are filtered to compress the data space. [17] H LI， Y Wang， D Zhang，et al. PFP：parallel FP⁃ Growth
（3）The technology of distributed storage computing is for query recommendation , ACM .Proceedings of the 2008
introduced. K-means algorithm based on Spark and ACM Conference on Recommender Systems，2008：107-114.
FP-growth algorithm based on Spark are used to process big [18] White T. Hadoop: The Definitive Guide [M]. 2011.
data of thermal power, which improves the capability of [19] Zhang Y H, Feng-Gang L I. Kmeans Algorithm Based on the
Spark of Parallel Implementation and Optimization. Journal
processing big data of thermal power and solves the problem
of Xian University, 2017.
that traditional methods cannot effectively deal with big data [20] P Liu, J Teng, G Zhang, Study of parallelized k-means
of thermal power. The new method breaks through the algorithm on massive text based on Spark, CCF Big Data.
bottleneck of traditional methods in computing big data of 2014.
thermal power. [21] Wei H K, Song W Z, Qi L I. A RBF Network Based Online
Molding Method For Realtime Cost Model In Power Plant.
References Proceedings of the CSEE, 24(7):246-252, 2004.
[1] YH Huang, ZH Yu, C Xie, et al. Study on the Application of [22] TT Yang, DL Zeng, JZ Liu, Operation optimization rule
Electric Power Big Data Technology in Power System extraction method for generator unit base on classification of
Simulation. Proceedings of the CSEE, 35(1):13-22, 2015. operation condition. Journal of North China Electric Power
[2] DX Liu， HH Hu， J Zhang，et al．Research on key issues University (Natural Science Edition), 36(6):64-68, 2009.
of big data lifecycle and its applications．Proceedings of
the CSEE，35(1)：23-28，2015.
[3] Chinese Society for Electrical Engineering Informatization
committee．Chinese electric power big data development
white paper(2013)[R] ． Beijing ： Chinese Society for
Electrical Engineering，2013.
[4] JW Han. Data Mining: Concepts and Techniques [M].
Morgan Kaufmann Publishers Inc. 2005.
[5] QP Wang, ZQ Chen, H Wei. The Summary of Optimal
Operation Parameters in Power Station Based on the Data
Mining. Electric Power Science and Engineering, 7:19-24,
2015.
[6] Agrawal R，Imieliń ski T， Swami A． Mining association
rules between sets of items in large databases．ACMSIGMOD
Record，22(2)：207-216，1993.
[7] JQ Li, JZ Liu, CL Niu, et al. The research and application of
data mining in power plant operation optimization, IEEE.
International Conference on Machine Learning and
Cybernetics, 2005:1642-1647 Vol. 3.
[8] JQ Li， CL Niu， JZ Liu．Application of data mining
technique in optimizing the operation of power
plants．Journal of Power Engineering，26(6)：830-835，
2007.
[9] JQ Li, JZ Liu, LY Zhang, The Research and Application of
Fuzzy Association Rule Mining in Power Plant Operation
Optimization [J]. Proceedings of the CSEE, 26(20):118-123,
2006.
DDCLS'18
519
DDCLS'18
520

Big Data Mining Method of Thermal Power Based On Spark and Optimization Guidance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Mining Method of Thermal Power Based On Spark and Optimization Guidance

Uploaded by

Copyright:

Available Formats

2018 IEEE 7th Data Driven Control and Learning Systems Conference

May 25-27, 2018, Enshi, Hubei Province, China

Big Data Mining Method of Thermal Power Based on Spark and

Spark [16] is a fast and general engine for large-scale data

978-1-5386-2618-4/18/$31.00 ©2018 IEEE DDCLS'18

Through big data mining algorithm, the history database

Tab.2: Results of working condition classification

Working condition Power Coal quality

Tab.3: Results of some strong association rules

Parameter Working condition 5 Working condition 8 Working condition 14

Tab.4: Optimization targeted values

Parameter Working condition 5 Working condition 8 Working condition 14

data, and 1 million sets of 18-dimensional data. The number

You might also like