You are on page 1of 18

Parallel MS-Kmeans clustering algorithm based on

MapReduce
Guodong Li (  lgd@ncepu.edu.cn )
North China Electric Power University
Chunhong Wang
North China Electric Power University
Kai Li
Xinjiang Information and Telecommunication Company

Research Article

Keywords: Clustering, MapReduce, Hadoop, MS-Kmeans

Posted Date: August 11th, 2022

DOI: https://doi.org/10.21203/rs.3.rs-1857679/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Springer Nature 2021 LATEX template

Parallel MS-Kmeans clustering algorithm


based on MapReduce
Li Guodong1*, Wang Chunhong1 and Li Kai2
1* School of Control And Computer Engineering, North China
Electric Power University, Beijing, 102206, China.
2 Xinjiang Information and Telecommunication Company,
Urumqi, 830000, China.

*Corresponding author(s). E-mail(s): lgd@ncepu.edu.cn;


Contributing authors: wchaxx@163.com; likai@xj.sgcc.com.cn;

Abstract
Aimed at the problems of initial points selection, outliers influence
and cluster instability of traditional K-means algorithm in big data
clustering, an MS-Kmeans algorithm based on MapReduce framework
was proposed. The algorithm selected multiple non-outlier points as
the candidate center points, and mean shifted the candidate center
points, used the maximum-minimum principle to select K initial cen-
ter points from the candidate center points, and then executed the
K-means algorithm to find the final center points. In order to improve
the running speed of the algorithm for clustering large data sets, the
algorithm used MapReduce framework to implement parallel computing
on Hadoop platform. The experimental results showed that the par-
allel MS-Kmeans algorithm is feasible in big data clustering, and the
algorithm performs well in terms of performance, speed and stability.

Keywords: Clustering,MapReduce,Hadoop,MS-Kmeans

1 Introduction
Clustering is a continuous loop iterative process [1] that divides data set into
clusters, dividing objects with high similarity into the same clusters and objects
with low similarity into different clusters [2, 3]. Clustering can be applied to

1
Springer Nature 2021 LATEX template

2 Article Title

data analysis in many related fields of computers, such as text analysis, pattern
recognition, and artificial intelligence, etc. The K-means algorithm is one of the
classical clustering algorithms, which is a distance-based clustering algorithm
[4, 5]. Its basic idea is to divide the data set into K subsets, and each subset
represents a cluster. K-means algorithm is highly dependent on the selection of
initial center points, so it is susceptible to the influence of outliers, which leads
to clustering into local optimal solutions. For this reason, some papers have
proposed improved k-means algorithms [6, 7] to solve the initial center points
selection problem of traditional k-means algorithm, and these algorithms have
good results on small data.
The K-means algorithm requires a large amount of computational resources
when dealing with large-scale data, and the convergence rate becomes slower
[8]. Therefore clustering using parallel K-means algorithm is a popular topic
in data mining. To improve the storage and computational capacity of massive
data, the Apache Foundation developed the Hadoop distributed system infras-
tructure, which is highly reliable, scalable, fault-tolerant, and low-cost [9, 10].
HDFS and MapReduce are two core modules of Hadoop, HDFS is a distributed
file management system running on multiple machines [11, 12]. MapReduce
is a programming framework for distributed computing programs, which can
change algorithms from serial to parallel computing, reduce the time complex-
ity of algorithms [13–15], and decrease the computing time, which is centered
on three functions, Map function, Combine function, and Reduce function
[16, 17]. However, K-means is susceptible to the influence of initial values and
outlier points when dealing with clustering, which leads to unstable and easy
convergence to optimal local solution for each result and affects the clustering
performance.
In this paper, we propose a K-means algorithm to change the initial center
points selection, named MS-Kmeans, which first randomly selects M non-
outlier points from the data set as candidate center points in order to reduce
the influence of outliers, and then performs the Mean Shift algorithm on the M
candidate center points to make the candidate center points move to the data-
dense area, in order to avoid the clustering results falling into In order to avoid
the optimal local solution, the K farthest points from the M candidate center
points are selected as the initial center points of K-means clustering using the
maximum-minimum principle, and then the data are clustered by K-means. In
addition, a parallel MS-Kmeans algorithm is developed using the MapReduce
framework in order to reduce the running time, making the algorithm more
suitable for handling large data sets.

2 Related work
Traditional clustering algorithms were more efficient and obtain satisfactory
clustering results when dealing with small data sets, but the performance was
poor when dealing with large data sets due to memory, data volume, and
computational power. To improve the performance of clustering in large data
Springer Nature 2021 LATEX template

Article Title 3

environments, many clustering algorithms based on the MapReduce frame-


work had been proposed. In the literature [18], the K-Medoids algorithm was
improved by using the population evolution feature of the genetic algorithm
to address the problems of K-Medoids initial center points sensitivity and the
tendency to fall into optimal solutions, which proposed a simple and natu-
ral way of encoding that can be context-sensitive in genetic operations. To
address the bottleneck problems of memory and CPU of K-Medoids algorithm
in processing large data, the proposed K-Medoids algorithm was parallelized
in MapReduce framework, and the parallelized design of the algorithm can
prevent the premature phenomenon to a certain extent. The literature [19]
improved the K-means algorithm using the idea of two-stage progression, where
the first stage used the Canopy algorithm to roughly permute the dataset into
several overlapping subsets, each of which was a cluster, in order to obtain
the rough cluster center points. After the Canopy algorithm, each data point
belonged to at least one or more clusters, and then the points belonged to
multiple clusters were divided into the nearest clusters. The second stage
used the K-means algorithm to calculate and obtain the center points of each
new cluster. The algorithm also introduced the maximum-minimum principle,
which made the clusters as far as possible between any two Canopy cen-
ters. The literature [20] proposed the SKMeans clustering algorithm, which
was a sampling-based K-means clustering algorithm. In order to improve the
efficiency of sampling, obtain a satisfactory sample set, and reduce the I/O
and network overhead of iteration at each sampling, a new sampling method
combined with the grid partitioning method was proposed, which made the
obtained sample points have strong representativeness. In order to avoid los-
ing data points that cannot be represented by sample points and to improve
the quality of clustering, the representative verification in the K-means algo-
rithm was proposed. The proposed algorithm reduced the data computation
by sampling while using the representative verification effectively in improving
the clustering accuracy, and the experiments proved that the MR-SKMeans
algorithm has good efficiency, scalability, and accuracy. An improved paral-
lel Canopy-Kmeans algorithm was proposed in the literature [21], which used
statistical ideas to group and sample the data set before clustering, facilitated
parallelization and reduced the time complexity, this algorithm used the data
outlier averaging sampling method for sampling to ensure that the samples
were uniformly extracted from the original data, and the minimum-maximum
principle was used for selecting the initial center points of the Canopy algo-
rithm. The clustering accuracy and timeliness of the improved Canopy-Kmeans
algorithm were improved to some extent. A density-based incremental K-
means clustering algorithm was proposed in the literature [22], which first
calculated the density of each data point whose density was greater than a cer-
tain threshold, and the points within the density range constituted the basic
clusters, then the two basic clusters that were close to each other were com-
bined into one cluster. Finally, the points that did not belong to any cluster
Springer Nature 2021 LATEX template

4 Article Title

were assigned to the nearest cluster. To improve efficiency and reduce com-
plexity, the algorithm was parallelized on the Hadoop platform. Literature [23]
used a hash algorithm to map data with large similarity to the same address
space and data with small similarity to different address spaces, and then K
initial center points were selected from K data-rich address spaces, and the
selected K initial center points should try to satisfy the principle of large dis-
tance of different clusters. The hash function mapping can effectively avoid
outlier points as the initial clustering centers, and it was conducive to mining
the clustering relationship of the data, so as to select the better initial center
points. A KMEANS-BRMS algorithm based on interval means and sampling
was proposed in the literature [24], in order to reduce the influence of outliers
on clustering results, the mean range method was used to obtain the initial
clustering center point. In order to avoid data skew caused by uneven dis-
tribution of data on each node during parallelization, the BSA (based pond
sampling and first adaptive algorithm) strategy based on pond sampling and
first adaptation algorithm was proposed.
the mean range method (MRM) was proposed to obtain the initial clus-
tering centers, which reduced the interference of outliers. The BSA (based
pond sampling and first adaptive algorithm) strategy was proposed to solve
the data skewing problem caused by uneven data distribution among nodes in
the parallelization process, thus improving the overall clustering efficiency.
In this paper, based on the MapReduce framework, a parallel MS-Kmeans
clustering algorithm is proposed, which selects non-outlier points as candidate
center points and uses the Mean Shift algorithm to mean-shift the candidate
center points to move the candidate center points to data-dense regions [25],
and uses the minimum-maximum principle to select K points from the can-
didate center points as the initial center points of the K means algorithm
and perform K-means clustering on the data to find the final cluster center
points and assign all data to the nearest clusters. This algorithm effectively
reduces the clustering running time and improves the clustering performance
and stability.
The main contributions of this paper are as follows: (1) In this paper, we
propose a parallel MS-Kmeans algorithm that speeds up the time of clustering
large data and improves the performance and stability of clustering. (2)By
finding other data points in the high-dimensional region, we can effectively
determine whether the current point is an outlier, thus reducing the impact
of outliers on the clustering performance. (3) By moving the candidate center
points to the dense region that can represent the data distribution through the
Mean Shift algorithm, the speed of finding center points is accelerated and the
stability of the algorithm is increased. (4) The minimum-maximum principle is
introduced to select the initial center points from the candidate center points
to avoid K-means clustering into local optimal solutions.
Springer Nature 2021 LATEX template

Article Title 5


𝑥𝑚
′ )
𝑀(𝑥 𝑚

𝑥𝑚 ′
𝑥𝑚 ′
𝑥𝑚

(a) (b) (c)

(d) (e) (f)

Fig. 1 Mean Shift algorithm process of each candidate center vector. x′m in figure (a) is
a candidate center vector, then, all shift vectors in the specified area are drawn in figure
(b), figure (c) shows the distance and direction of one shift of x′m , where M (x′m ) is shift
mean vector, and x′m is the position after the first shift of x′m , figure (d) shows the result
of one shift of a candidate center vector, figure (e) shows the result of multiple shifts of a
candidate center vector,finally, figure (f) shows the result of a candidate center vector mean
shift iteration

3 MS-Kmeans algorithm
Suppose there are N data vectors, and each data vector is P -dimensional.X =
{xn = (xn1 , xn2 , ..., xnp , ..., xnP ), n = 1, 2, ..., N ; p = 1, 2, ..., P } where
xn represents the n-th P -dimensional data vector. X ′ = {x′m =
(x′m1 , x′m2 , ..., x′mp , ..., x′mP ), m = 1, 2, ..., M ; p = 1, 2, ..., P } is a set of M can-
didate center vectors randomly selected from X. R = {R1 , R2 , ..., Rk , ..., RK }
represents the K clusters finally divided, Rk is the k-th cluster. C = {ck =
(ck1 , ck2 , ..., ckp , ..., ckP ), k = 1, 2, ..., K, p = 1, 2, ..., P } represents the set of
center vectors of k clusters, ck is the k-th cluster center, where K < M < N .
The following are the steps of the algorithm proposed in this paper:
Select M non-outlier points from data set X as the candidate center vector
set X ′ . For each candidate center vector, mean shift is performed using the
Mean Shift algorithm until the distance of shift equals zero or the number of
the shifts is reached, as shown in Figure 1.
The Mean Shift algorithm for each candidate center vector at time t is
described in steps (1)-(4):
(1)Calculate all the shift vectors xj of the candidate center vector (x′m )t
in the high-dimensional space with a radius of D. where, (x′m )t is the m-th
candidate center vector in the t state.
(2) Calculate the average value of all shift vectors (xj ) in a high-dimensional
space with a radius of D by using formula (1), and get an mean shift vector
Springer Nature 2021 LATEX template

6 Article Title

M (x′m )t
X
M (x′m )t = (1/G) × (xj − (x′m )t ) (1)
xj ∈SD

Where, SD is high-dimensional sphere region with (x′m )t as center vector and


radius D. xj is data vector contained in the scope SD . G is contains the number
of vectors within the range of SD , M (x′m )t is the shift mean vector obtained
in the t state.
(3) Move the candidate center vectors in the direction of the mean shift
vector using the formula (2).

(x′m )t+1 = M (x′m )t + (x′m )t (2)

Where, (x′m )t+1 is the m-th candidate center vector in the t + 1 state.
(4) Repeat steps (2) and (3) until the translation distance is equal to zero
or the number of the shifts is reached.
After the Mean Shift algorithm is executed, M candidate center vectors
located in high-density regions can be obtained. K vectors farthest from each
other are selected from the candidate center vectors as the initial center vector
set C by the maximum-minimum principle. Specific steps such as (5) -(7):
(5) Randomly select a vector x′m from data set X ′ and add it to set C.
(6) Calculate the distance from the vector in X ′ to all the vectors in set C,
if the vector x′m meets the maximum value of (min (dist(x′1 , c1 ), dist(x′1 , c2 )
, · · · , dist(x′1 , ck )) , · · · , min (dist(x′m , c1 ) , dist(x′m , c2 ) , · · · , dist(x′m , ck ))),
then x′m is added to C, where dist(x′m , ck ) represents the Euclidean distance
between x′m and ck .
(7) Repeat step (6) until K initial center vectors are selected as the center
vectors of K clusters.
All data are clustered by K-means based on the selected K initial center
vectors. Specific steps such as (8) -(11):
(8) For each vector xn in X, its Euclidean distance to K center vectors are
calculated using formula (3), and xn is divided into the nearest cluster.
r
XP 2
dis(xn , ck ) = (xnp − ckp ) (3)
p=1

(9) Recalculate each center vector using the formula (4) and compare
whether the distance between the two groups of center vectors is equal to zero.
X
ck = (1/N ) × xn (4)
xn ∈Rk

(10) Repeat steps (8) and (9) until the K center vectors do not change,
indicating that the cluster has reached convergence, the final k center vectors
are obtained.
Springer Nature 2021 LATEX template

Article Title 7

(11) All data vectors are assigned to the nearest cluster. Output clustering
results R.

4 MS-Kmeans algorithm based on Hadoop


platform parallelization
Computing a large amount of data often takes a lot of time. In order to reduce
the running time, the method proposed in this paper implements parallel com-
puting on MapReduce framework. The execution process is divided into three
stages, as shown in Figure 2.

4.1 Mean Shift


The main task of the Mean Shift phase is to select M non-outlier points as
candidate center vectors and shift the mean values to dense areas. This phase
mainly consists of three components: Map, Combine, Reduce.
Map phase: The map function is to find the data object vectors within the
radius D of the candidate center vectors. Input the data set in the form of
< key1, value1 > key-value pairs into the map function. At this stage, it is
necessary to read the selected candidate center vectors and the corresponding
candidate center vector identifiers from the HDFS document, then find out
all vectors within the radius R of each candidate center vector, and mark the
same identifier of the candidate center vector, finally output all vectors and
their identifiers within the radius D of each candidate center vector in the form
of <key2, value2>, where value2 is a data vector, and key2 is the identifier of
the candidate center vector that value2 belongs to.
Combine stage: The combine function combines the key-value pairs of the
same key to reduce the calculation of the reduce function, and the output of
the combine function is <key3, value3> key-value pairs, where key3 is the
candidate center vector identifier and value3 is a combination of all vectors of
the same key3.
Reduce stage: The reduce function processes data vectors within the radius
of the same candidate center vector and generates new candidate center vec-
tors. For each candidate center vector, the shift vector of each candidate center
vector can be calculated according to the data vector within its radius. After
moving the candidate center vector in the direction of the shift vector, a new
candidate center vector can be obtained, and then the new candidate center
vector can be output to the HDFS file, so as to facilitate the reading of the
map function in the next iteration. During the first iteration, if there is only
one data object vector within the radius D of a candidate center vector, the
candidate center vector is discarded. the output format of reduce function is
<key4, value4>, where key4 is the new candidate center vector identifier and
value4 is the new candidate center vector, after each iteration, the distance
between the new candidate center vectors and the old candidate center vectors
should be calculated, if the distance between the two is equal to zero or the
required number of iterations is reached, the iteration should be ended.
Springer Nature 2021 LATEX template

8 Article Title

Data

Select M initial candidate center vectors

Map1 Map2 ... Maps

Mean Shift Combine1 Combine2 Combines


...

Reduce

NO

Shift end condition reached?

YES

Select K initial cluster center vectors

Map1 Map2 ... Maps

Combine1 Combine2 ... Combines

K-means Reduce

Get new center vectors

NO
Center vectors unchanged?

YES

Get k final center vectors

Generate clustering Map1 Map2 ... Maps


results

clustering results

Fig. 2 Parallelization process of MS-Kmeans.There are three stages in this flowchart:Mean


Shift, K-means, Generate clustering results

4.2 K-means
The main work of the K-means stage is to select K initial center vectors from
the candidate center vectors generated after the end of Mean Shift, and then
do K-means clustering for the data. This stage also includes three components:
Map, Combine, Reduce.
Springer Nature 2021 LATEX template

Article Title 9

Map phase: The map function is used to calculate the center vector closest
to the data object vector. Input the data set in the form of <key5, value5>
key-value pairs into the map function. In this function, the center vector closest
to each data vector needs to be calculated, and then identifier is added and
output, the output format is also <key6, value6>, where value6 is the data
vector, and key6 is the identifier of the center vector closest to value6. Each
iteration of the map function places the data vector in the cluster of the nearest
center vector, and may change the center vectors, so the center vectors need
to be recalculated.
Combine stage: Similar to the combine function in the Mean Shift phase.
The combine function combines the key-value pairs of the same key to reduce
the calculation of the reduce function, , The output of this function is a <key7,
value7 > key-value pair, where key7 is the center vector identifier and value7
is a combination of all vectors of the same key7.
Reduce stage:The reduce function processes data vectors belonging to the
same clustering and generates a new center vector. The input of the reduce is
the output of combine. At this stage, the new center vectors are calculated and
output to the HDFS file for reading by the map function in the next iteration,
the output is <key8, value8>, Where value8 is the new center vector and
Key8 is the identification of the new center vector, after each iteration, the
distance between the new center vector and the previous center vector need to
be calculated, if the distance between the two is equal to zero, the iteration is
ended, the clustering result converges.

4.3 Generation of clustering results


Based on the K final center vectors calculated in the K-means phase, all data
object vectors are assigned to the nearest cluster using the map function, and
K clusters are generated.

5 Experiment And Result Analysis


This experimental platform uses the open-source distributed software Hadoop
to build a platform and conduct experiments to test the performance of the
MS-Kmeans clustering algorithm proposed in this paper. Four nodes are built
with four computers for test analysis, the operating system is CentOS 7.
The experimental data are the data sets publicly available on UCI and
Kaggle, and the detailed information of the data is shown in Table 1. In addi-
tion, in order to avoid the influence of randomness, the average value of 20
running results of each data set is taken for analysis.

5.1 Cluster performance analysis


In this section, three commonly used measures of clustering effectiveness, Aver-
age Silhouette, Davies-Bouldin Index (DBI), and Sum of the Squared Error
Springer Nature 2021 LATEX template

10 Article Title

Table 1 Dataset information (K represents the number of clusters

Datasets Number of instances Number of attributes K File size

Iris 150 4 3 2.35KB


Exp-nonscale 600 10000 3 143.05 MB
KDD-Cup-1999 6142547 42 2 749.83 MB

(SSE), are selected to evaluate the performance of the algorithm, and the pro-
posed algorithm is compared with K- means and K-means++ for comparison.
In order to avoid the effect of randomness, the three clustering algorithms were
conducted 20 times for each of the three data sets, and the average values of
the clustering effect indicators were taken as shown in Table 2. Compared with
the K-means algorithm and K-means++ algorithm, the MS-Kmeans algorithm
proposed in this paper shows better performance on all data sets in three
different metrics, and the clustering performance is improved significantly on
the KDD-Cup-1999 data set in particular. It can be found from the table
that the larger the amount of data, the more obvious the improvement in the
performance of the algorithm proposed in this paper.

Table 2 Comparative analysis of clustering algorithms

Dataset algorithm DBI Average Silhouette SSE

K-means 0.666391 0.550964 111.0037


Iris K-means++ 0.663950 0.551941 120.2962
MS-Kmeans 0.662933 0.552266 96.79702
K-means 2.648141 0.088575 1.15E+10
Exp-nonscale K-means++ 2.637542 0.089006 1.19E+10
MS-Kmeans 2.575566 0.092299 1.13E+10
K-means 0.827655 - -
KDD-Cup-1999 K-means++ 0.708568 - -
MS-Kmeans 0.423758 - -

5.2 Speedup ratio analysis


To test the parallel computing performance of the MS-Kmeans algorithm pro-
posed in this paper on the Hadoop platform, this section uses the speedup
ratio to measure the parallelism of the algorithm. In this paper, the speedup
ratios of the MS-Kmeans algorithm on the Iris, EXP-nonscale, and KDD-Cup-
1999 datasets are calculated, and the results are shown in Figure 3. It can be
seen that the optimized MS-Kmeans parallel algorithm based on MapReduce
parallel framework has a good speedup ratio, especially on the KDD-Cup-1999
dataset, where the speedup ratio improvement is most obvious. From Figure
3, it can be analyzed that the acceleration of the algorithm proposed in this
paper is greater when the dataset is larger and the number of nodes is higher,
Springer Nature 2021 LATEX template

Article Title 11

Iris

3 Exp-nonscale

KDD-Cup-1999

2.5
Speedup ratio

1.5

0.5
1 2 3 4

Nodes number
Fig. 3 Speedup ratio.The graph depicts the speedup ratio of three datasets using MS-
Kmeans algorithm on different number of nodes

so the parallel MS-Kmeans algorithm proposed in this paper is very suitable


for large datasets.

5.3 Stability analysis


To verify the stability of the clustering algorithm proposed in this paper, this
section analyzes the stability of the K-means, K-means++, and MS-Kmeans
clustering algorithms on the Iris, EXP-nonscale, and KDD-Cup-1999 datasets
regarding the three evaluation metrics of Average Silhouette, DBI, and SSE.
From Figure 4(a)(b)(c), it can be seen that for the Iris dataset, the MS-Kmeans
algorithm is more stable regarding the three metrics compared to both K-
means and K-means++, and the best values can be reached most of the time.
From Figure 4(e)(f)(g), it can be seen that for the EXP-nonscale dataset, the
MS-Kmeans algorithm is very stable regarding Average Silhouette and DBI
and the values are better, and regarding SSE is more stable compared to
K-means algorithm and K-means++ algorithm. From Figure 4(h), it can be
seen that for the KDD-Cup-1999 dataset, the MS-Kmeans algorithm is very
stable regarding DBI. The experiments show that the MS-Kmeans algorithm
proposed in this paper is stable and can achieve better results.
Springer Nature 2021 LATEX template

12 Article Title

5.4 Runtime analysis


Figure 5 shows the running time of the three algorithms on different nodes
regarding Iris, EXP-nonscale, and KDD-Cup-1999 data. The results show that
the MS-Kmeans algorithm proposed in this paper runs faster than the K-means
algorithm and the K-means++ algorithm, and the MS-Kmeans algorithm runs
faster as the number of nodes increases, which is most obvious in the KDD-
Cup-1999 dataset.

6 Conclusion
In order to solve the problems that the traditional K-means algorithm is sus-
ceptible to outliers, time-consuming and poor stability in the context of large
data, this paper proposes an MS-Kmeans algorithm, which determines out-
liers in high-dimensional space by using the Mean Shift algorithm to move
candidate center points to data-dense regions, selects initial center points
using the maximum-minimum principle in MapReduce framework to achieve
parallelization, and experiments were conducted on Iris, Exp-nonscale, and
KDD-Cup-1999 public datasets to compare the performance of K-means,
K-means++, and MS-Kmeans algorithms and verify the advantages of the
algorithm proposed in this paper. The experimental results show that the clus-
tering speed, performance, and stability of the MS-Kmeans algorithm on large
data sets are effectively improved.

Declarations
• Funding The authors did not receive support from any organization for the
submitted work. No funding was received to assist with the preparation
of this manuscript. No funding was received for conducting this study. No
funds, grants, or other support was received.
• Conflict of interest/Competing interests The authors have no relevant finan-
cial or non-financial interests to disclose. The authors have no competing
interests to declare that are relevant to the content of this article. All authors
certify that they have no affiliations with or involvement in any organization
or entity with any financial interest or non-financial interest in the sub-
ject matter or materials discussed in this manuscript. The authors have no
financial or proprietary interests in any material discussed in this article.
• Ethics approval The manuscript will not be submitted to multiple magazines
for consideration at the same time. The submitted works are original and
have not been published elsewhere in any form or language (in part or in
whole), The results are presented clearly and truly, without fabrication,
tampering or improper data operation.
• Consent to participate
The author did not consent to participate in the declaration
• Consent for publication
Springer Nature 2021 LATEX template

Article Title 13

Submission of work requires that the piece to be reviewed has not been
previously published. Upon acceptance, the Author assigns to the Journal
of Grid Computing the right to publish and distribute the manuscript.
• Availability of data and materials
The datasets analysed during the current study are available in the UCI
and Kaggle. These datasets were derived from the following public domain
resources:
https://archive.ics.uci.edu/ml/datasets/Iris
https://www.kaggle.com/datasets/zechengzhang/cluster-exp-
nonscale?select=kappa omega test.txt
http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data
• Authors’ contributions
Li Guodong: Data Curation, Writing, Review, Editing, Supervision, Project
administration.
Wang Chunhong: Methodology, Software, Validation, Investigation, Writing-
Original draft preparation.
Li Kai: Conceptualization, Visualization.
• Acknowledgements
On the completion of my thesis, I should like to express my deepest grati-
tude to all those whose kindness and advice have made this work possible.
I am greatly indebted to my advisor Li Guodong who gave me valuable
instructions and has improved me in language. His effective advice, shrewd
comments have kept the thesis in the right direction.
I would like to thank my partner Li Kai for his friendship and constructive
suggestions, he constantly encouraged me when I felt frustrated with this
dissertation.

References
[1] Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M.,
Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop
yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual
Symposium on Cloud Computing, pp. 1–16 (2013)

[2] Saeed, Z., Abbasi, R.A., Maqbool, O., Sadaf, A., Razzak, I., Daud, A.,
Aljohani, N.R., Xu, G.: What’s happening around the world? a survey
and framework on event detection techniques on twitter. Journal of Grid
Computing 17(2), 279–312 (2019)

[3] da Rosa Righi, R., Lehmann, M., Gomes, M.M., Nobre, J.C., da Costa,
C.A., Rigo, S.J., Lena, M., Mohr, R.F., de Oliveira, L.R.B.: A survey on
global management view: toward combining system monitoring, resource
management, and load prediction. Journal of Grid Computing 17(3), 473–
502 (2019)

[4] Mao, D.: Improved algorithm of canopy-kmeans based on mapreduce.


Springer Nature 2021 LATEX template

14 Article Title

Computer Engineering and Applications 48(27), 22–26 (2012)

[5] Rodriguez, A., Laio, A.: Clustering by fast search and find of density
peaks. science 344(6191), 1492–1496 (2014)

[6] Zhang, G., Zhang, C., Zhang, H.: Improved k-means algorithm based on
density canopy. Knowledge-based systems 145, 289–297 (2018)

[7] Ling, Y., Zhang, X.: An improved k-means algorithm based on multiple
clustering and density. In: 2021 13th International Conference on Machine
Learning and Computing, pp. 86–92 (2021)

[8] Pacella, M.: Unsupervised classification of multichannel profile data


using pca: An application to an emission control system. Computers &
Industrial Engineering 122, 161–169 (2018)

[9] Feng, D., Zhu, L., Zhang, L.: Review of hadoop performance optimiza-
tion. In: 2016 2nd IEEE International Conference on Computer and
Communications (ICCC), pp. 65–68 (2016). IEEE

[10] Yonggui, W., Chao, W., Wei, D.: Random sampling k-means algorithm
based on mapreduce. Computer Engineering and Applications 52(8), 74–
79 (2016)

[11] Khan, M.A., Memon, Z.A., Khan, S.: Highly available hadoop namenode
architecture. In: 2012 International Conference on Advanced Computer
Science Applications and Technologies (ACSAT), pp. 167–172 (2012).
IEEE

[12] Singh, K., Kaur, R.: Hadoop: addressing challenges of big data. In: 2014
IEEE International Advance Computing Conference (IACC), pp. 686–689
(2014). IEEE

[13] Sardar, T.H., Ansari, Z.: Distributed big data clustering using mapreduce-
based fuzzy c-medoids. Journal of The Institution of Engineers (India):
Series B 103(1), 73–82 (2022)

[14] Alguliyev, R.M., Aliguliyev, R.M., Sukhostat, L.V.: Parallel batch k-


means for big data clustering. Computers & Industrial Engineering 152,
107023 (2021)

[15] Sardar, T.H., Ansari, Z.: Mapreduce-based fuzzy c-means algorithm for
distributed document clustering. Journal of The Institution of Engineers
(India): Series B 103(1), 131–142 (2022)

[16] Xiong, K., He, Y.: Power-effiicent resource allocation in mapreduce clus-
ters. In: 2013 IFIP/IEEE International Symposium on Integrated Network
Springer Nature 2021 LATEX template

Article Title 15

Management (IM 2013), pp. 603–608 (2013). IEEE

[17] Hanif, M., Lee, C.: Jargon of hadoop mapreduce scheduling techniques: a
scientific categorization. The Knowledge Engineering Review 34 (2019)

[18] Lai, X., Gong, X., Han, L.: Genetic algorithm based k-medoids clustering
within mapreduce framework. Computer Science 44(03), 23–26 (2017)

[19] Zhang, W., Jiang, L.: Parallel computation algorithm for big data clus-
tering based on mapreduce. Application Research of Computers 37(1)
(2018)

[20] Li, H., Liu, R., Wang, J., Wu, Q.: An enhanced and efficient clustering
algorithm for large data using mapreduce. IAENG International Journal
of Computer Science 46(1), 61–67 (2019)

[21] Zhou, G.: Improved optimization of canopy-kmeans clustering algorithm


based on hadoop platform. In: Proceedings of the International Confer-
ence on Information Technology and Electrical Engineering 2018, pp. 1–6
(2018)

[22] Lu, W.: Improved k-means clustering algorithm for big data mining under
hadoop parallel framework. Journal of Grid Computing 18(2), 239–250
(2020)

[23] Hou, X.: An improved k-means clustering algorithm based on hadoop


platform. In: The International Conference on Cyber Security Intelligence
and Analytics, pp. 1101–1109 (2019). Springer

[24] Huang, X., Cheng, S.: Optimization of k-means algorithm base on mapre-
duce. In: Journal of Physics: Conference Series, vol. 1881, p. 032069
(2021). IOP Publishing

[25] Chen, Y., Hu, P., Wang, W.: Improved k-means algorithm and its imple-
mentation based on mean shift. In: 2018 11th International Congress on
Image and Signal Processing, Biomedical Engineering and Informatics
(CISP-BMEI), pp. 1–5 (2018). IEEE
Springer Nature 2021 LATEX template

16 Article Title

K-means
Iris K-means++
Iris K-means

0.553 0.668 K-means++


MS-Kmeans
MS-Kmeans
0.667
Average Silhouette

0.5525
0.666
0.552 0.665

DBI
0.664
0.5515
0.663
0.551
0.662

0.5505 0.661
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of runs Number of runs


(a) (b)

K-means K-means
Iris K-means++
Exp-nonscale K-means++
180
MS-Kmeans MS-Kmeans
160 0.095
140
Average Silhouette
0.09
120
SSE

100 0.085

80
0.08
60

40 0.075
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of runs Number of runs


(c) (d)

K-means K-means
Exp-nonscale K-means++
Exp-nonscale K-means++
2.8 1.50E+10
MS-Kmeans MS-Kmeans
2.75 1.40E+10

1.30E+10
2.7
1.20E+10
DBI

SSE

2.65
1.10E+10
2.6
1.00E+10

2.55 9.00E+09

2.5 8.00E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of runs Number of runs


(e) (f)

K-means
KDD-Cup-1999 K-means++
2
MS-Kmeans
1.8
1.6
1.4
DBI

1.2
1
0.8
0.6
0.4
0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of runs
(g)

Fig. 4 Stability analysis.This paper compares the stability of K-means, K-means++ and
MS-Kmeans algorithms on three data sets. Where, figure (a), figure (b) and figure (c) show
the stability of Iris on Average Silhouette, DBI, and SSE. Figure (d), figure (e) and figure (f)
show the stability of Exp-nonscale on Average Silhouette, DBI, and SSE. Figure (g) show
the stability of Kdd-Cup-1999 on DBI. The X-axis represents the number of experiments
and the Y -axis represents the value of each index.
Springer Nature 2021 LATEX template

Article Title 17

k-means k-means
400
Iris k-means++
Exp- nonscale k-means++
1100.00 KDD–Cup- 1999 k-means
k-means++
MS-kmeans 650 MS-kmeans 1000.00
MS-kmeans
600 900.00
350 800.00
550
700.00
Runtime s

Runtime s


Runtime s

✁ ✁

500
300 600.00
450
500.00
400
400.00
250 350
300.00
300 200.00
200 250 100.00
1 2 3 4 1 2 3 4 1 2 3 4
Nodes number Nodes number Nodes number

(a) (b) (c)

Fig. 5 Run time. Figure (a) shows the running time of Iris on three algorithms on different
nodes, figure (b) shows the running time of Exp-nonscale on three algorithms on different
nodes, figure (c) shows the running time of iris on three algorithms on different nodes.

You might also like