You are on page 1of 3

2020 International Conference on Computers, Information Processing and Advanced Education (CIPAE)

A Discretization Method for Industrial Data Based


2020 International Conference on Computers, Information Processing and Advanced Education (CIPAE) | 978-1-7281-8223-0/20/$31.00 ©2020 IEEE | DOI: 10.1109/CIPAE51077.2020.00060

on Big Data Technology

Xiang Wan*, Cheng Wang, Zhengming Tang, Haijun Sun, Shan Gao, Lei Qiao
Wuhan Second Ship Design and Research Institute, Hubei Wuhan 430064, China
*Corresponding author e-mail: wanwuhan@yeah.net

Abstract—A parallel improvement of the traditional II. THE TRADITIONAL CLUSTERING ALGORITHM
K-Means clustering algorithm is achieved based on the
Mapreduce architecture, and the new parallelized clustering The clustering algorithm groups data with common
algorithm is used to realize the discretization of industrial big attributes and characteristics into a cluster according to
data in this paper. The new algorithm streamlines the certain characteristics and attributes of the data. The data of
calculation process, meanwhile, saves the computational the same cluster is highly homogeneous. On the contrary,
overhead caused by data analysis and communication the data of different clusters is highly heterogeneous to
consumption caused by information transfer. achieve the discretization of the data. The clustering method
does not set clustering parameters and discrete targets in
Keywords—Big data, industrial data, K-Means clustering advance, thus it belongs to the unsupervised learning
algorithm method with good objectivity and reliability.
I. INTRODUCTION K-Means algorithm is a clustering algorithm based on
Industrial data come from a wide variety of sources and centroid division, in addition, the average value of data in a
diversity, hence there are significant differences between cluster is usually used as the centroid of the cluster. In
industrial parameters, and sometimes the differences can K-Means algorithm, the initial data set D is divided into k
even reach several orders of magnitude. If the differences clusters, which are C1, C2, ..., Ck. Besides, any data in the
between parameter values is not processed properly in initial data set D belongs to only one cluster, then for 1İ i
advance, the workload of parameter identification and and jİ k, there is Ci  D and Ci ģCj= Ø. The distance
difference processing in subsequent data mining will be between the data is characterized by Euclidean distance:
increased, which will adversely affect the efficiency and
accuracy of mining work. Therefore, it is necessary to d (m, ci ) m, ci (1)
perform a discretization operation on industrial data so that
it can present a unified model which is conducive to data
mining. In Equation (1), ci is the centroid of the cluster Ci, and m
is any data in the cluster Ci. The clustering quality function
By replacing the initial data with several groups, clusters, is defined as:
intervals, labels, etc., data discretization reduces the
differences between the initial data parameters and reduces k
the size of the initial data. Consequently, the efficiency of
subsequent mining work is improved, meanwhile the mining
E ¦ ¦ d (m, c )
i 1 mCi
i
2
(2)

results are more practical and easier to understand due to the


unification of mining object patterns. Common By analyzing the quality function, it is evidently that the
discretization methods include group discretization, ultimate goal of K-Means clustering method is to make the
histogram discretization, clustering discretization, etc. In data in the clusters highly similar, and the boundaries
contrast to the first two methods which need to manually between clusters are clear and obvious.
formulate grouping rules and set grouping parameters,
clustering method is more objective. Thus we prefer use The calculation process of K-Means clustering method is
clustering method to realize the discretization of industrial shown in Fig.1.
data.

Fig.1 The Flow of K-Means Clustering Algorithm

978-1-7281-8223-0/20/$31.00 ©2020 IEEE 204


DOI 10.1109/CIPAE51077.2020.00060
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 28,2023 at 10:34:45 UTC from IEEE Xplore. Restrictions apply.
ķ Determine the number of clusters as k, and select k 4].
items of data from the initial data set D as the initial The parallel computing of jobs and tasks in the
centroids ci; Mapreduce architecture is achieved through the regulation
and management of master and slave nodes [5]. The master
ĸ For each item of data xp in the data set D, calculate control node Job Tracker in the Mapreduce architecture is
the distance from the k centroids by Equation (1) responsible for the coordination and regulation of
respectively, and divide it into the cluster with the smallest computing tasks, and the slave node Task Tracker is
distance; responsible for specific analysis and calculation. The loose
Ĺ After all the data is divided into their respective coupling between the nodes in the Mapreduce architecture
clusters, calculate the clustering quality E according to makes it versatile and convenient to use. Through parallel
Equation (2), and recalculate the new centroids ci* of each calculating in the Mapreduce structure, the calculation
cluster according to Equation (3), process is streamlined, and the computational overhead
caused by data analysis and the communication
nq consumption caused by information transfer are saved.
1
ci*
nq
¦x
q 1
q
(3)
B. The Parallelization Improvements
The parallelization of the original K-Means algorithm is
Among Equation (3), xq is any item of data in cluster Ci, realized in the Mapreduce platform according to the
and nq is the total items number of data in cluster Ci; independence of the distance calculation process from the
data to the cluster centroid. The calculation task of the
ĺThen calculate the clustering quality E* according to distance between each data and the cluster centroid center is
Equation (2) and the new centroid ci*.If realized in the Map stage, and the calculation of the new
centroids is performed in the Reduce stage. The
parallelization algorithm flow is shown in Fig.2.
E*  E d H (4)
ķ Formally organize the initial data set D. Each item of
then judge the quality function to converge, otherwise data is arranged in rows and the row numbers are listed;
repeat steps ĸ ~ ĺ to iterate until the convergence ĸ Determine the number of clusters as k and randomly
conditions are met. Sometimes, ci *=ci or ci*  ci d N is select the initial centroids ci;
also used to determine the convergence condition, that Ĺ In the Map phase, all the data are randomly
means if the centroid has not changed or the centroid change distributed to each node for calculation of the distance d(m,
has a small displacement, the iteration ends. ci ) between each item of data and each initial centroid ci.
K-Means clustering algorithm is simple and easy to The closest centroid is selected and classified into this type
understand. Its main calculation time consumption is of cluster. The intermediate result of the output key-value
focused on calculating the distance between the data and the pair is output;
centroid. However, if the traditional K-Means clustering ĺ In the Shuffle process, the intermediate key-value
algorithm is directly applied to the discretization of pairs need to be processed, and the data of same clusters are
industrial big data, the amount of data to be processed is too summarized into a set;
large, and a large amount of distance calculation needs to be
completed. Repeated calculations cause a lot of Ļ At the beginning of the Reduce phase, merge
computational and communication overhead. Therefore, calculations are performed on the data of homogeneous
how to realize the parallelization of the traditional K-Means clusters, so that the new centroid ci* is determined by
clustering algorithm and improve the computing efficiency calculating the average of each data in the homogeneous
has become an area of intense investigation [1, 2]. clusters, and then the clustering quality E* is determined by
Equation (2);
III. THE OPTIMIZED PARALLEL ALGORITHM
ļ By determining the convergence conditions, the flow
A. Mapreduce Structure direction is determined. If the convergence conditions are
In view of the time complexity of calculation step with met, the algorithm ends and the results are output. On the
distance from each item of data to the centroid is the largest, contrary, if the convergence conditions are not met, iterative
while the calculation process is relatively independent and calculations are started until the convergence conditions are
does not interfere with each other, that reflects good locality, met.
we implement calculation parallelization of this calculation
process based on the Mapreduce architecture, and reduces Ignore the impact of information transmission and
the calculation time consumption greatly. resource allocation, compare and analyze the time
complexity of the K-Means algorithm and the parallelization
As a scalable parallel programming model with algorithm. Set the total number of data in the data set as p,
outstanding scalability, fault tolerance, and applicability, the the calculation number of iterations required to reach the
Mapreduce structure has a good effect on the processing of convergence criterion is q, the calculation time required to
various types of data (including structured data, calculate the distance from an item of data to a centroid is t,
semi-structured data and unstructured data). The central then the time complexity can be simply expressed as T=pqtk
calculation idea of Mapreduce structure is “divide and rule”. in K-Means algorithm. While in the parallelization
Through the parallel processing of data blocks, the purpose algorithm, the data set is divided into n blocks. A Map task
of collaborative computing of massive data is achieved [3, is responsible for the calculation of a data block. Assuming

205

Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 28,2023 at 10:34:45 UTC from IEEE Xplore. Restrictions apply.
that there are a total of a node, each node is responsible for
the completion of (n/a) tasks. The time complexity can be
expressed by T*=pqtk/a. It can be seen that the calculation
efficiency is greatly improved.

Fig. 4 Schematic diagram of discretization results when the cluster number


is 4

B. Conclusions
By discretizing the industrial data, the data is divided
into 3 and 4 clusters. As shown in the clustering results, the
data in each cluster has a high degree of similarity, and the
densely distributed data is classified as one cluster, the data
with sparse edges are classified into one category, thus the
goal of data classification is achieved basically. Through the
example verification, the parallelization algorithm proposed
in this paper can be used for the discretization of industrial
big data.
ACKNOWLEDGMENTS
Fig.2 The Flow of Parallelization Algorithm This work was financially supported by Natural Science
Foundation of Hubei Province (2019CFB281).
IV. CALCULATION RESULTS AND CONCLUSIONS
A. Calculation Results REFERENCES
[1] ZHANG Wang, WANG Hui, Parallel K-Means Clustering Algorithm
Taking industrial data as the research object, the data is in Web Personalized Service, Microelectronics & Computer.vol.
discretized using the parallelization algorithm, and the 24,no.10,pp.65-67,2007.
number of clusters is set to 3 and 4. The discretization [2] LU Yi-qing, LIN Jin-xian, Parallel PSO combined with K-means
results are shown in Fig.3 and 4. clustering algorithm based on MPI, Journal of Computer Applications.
Vol.31,no.2,pp.428-431,2011.
[3] Li Cheng-hua, Zhang Xin-fang, Jin Hai, Xiang Wen, MapReduce: a
New Programming Model for Distributed Parallel Computing,
Computer Engineering&Science. Vol.33, no.3, pp.129-135, 2011.
[4] LIU Zhi-hui, ZHANG Quan-ling, Research overview of big data
technology, Journal of Zhejiang University (Engineering Science).vol.
48,no.6,pp.957-972,2014.
[5] TAO Xue-jiao, HU Xiao-feng, LIU Yang, Overview of Big Data
Research, Journal of System Simulation. No.S1,pp.142-146,2013.

Fig.3 Schematic diagram of discretization results when the cluster number


is 3

206

Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 28,2023 at 10:34:45 UTC from IEEE Xplore. Restrictions apply.

You might also like