You are on page 1of 4

A Robust Initialization Algorithm for k-Means Clustering in

Power Distribution Networks with PMU-based Adaptive

Protection System
Pooria Mohammadi and Hassan El-Kishky
Department of Electrical Engineering, University of Texas at Tyler, Tyler, TX, USA

ABSTRACT however, random initial centroids may lead to noticeable

deviations in k-Means clustering results.
The K-Means clustering is one of the most popular and This article introduces a new initialization algorithm for k-
influential algorithms in data categorizing methods. K- Means clustering. The paper is organized as follows: in
Means simple and straightforward formulation made it a Section 2 k-Means is introduced; in Section 3, the proposed
widely acceptable method in many fields and applications. initialization algorithm is presented; in Section 4, the results
This simplicity comes with some prices such user defined
and discussion are presented; and, finally, conclusions are
number of clusters, uniformly sized clusters and different
presented in Section 5.
final clusters as of being sensitive to initial centroids. K-
means sensitiveness to initial centroids leads to different 2 K-MEANS CLUSTERING PRINCIPLES
clusters per execution with different and relatively long
Given as an input set of data which
iteration numbers. Different applications have their own
initialization and improvement techniques for k-means consists of n objects with dimensions in space of and the
relying on their particular data traits. Power systems parameter which is the number of clusters that
recently have been involving with data mining and input needs to be categorized in, k-Means clustering should
clustering due to fast increase in PMU uses for determine clusters including their vectors. k-Means does not
supervisory, control and protection goals in smart grids. have a limit on an input objects number of dimensions. That
Large amount of data streaming by PMU demands quite is, each can be represented as when
simple method with minimum computational burden to can be any positive integer. Clustering algorithm starts by
meet delay tolerance for various working phases and considering number of points within to start the algorithm
expectations. This article presents an approach with cluster initial centroids which should be close to their
significantly improving k-means clustering algorithm by own clusters center of mass by the end of algorithm. One of
pre-analyzing the data and finding best initial centroids. the disadvantages of k-Means is that it is sensitive to initial
Extensive experiments have been made to verify the centroids however on the other hand, this dependency relies
approach robustness in reducing the number of iterations on the datas natural structure [6]. Choosing random centroids
and resulting in unique clusters in all executions. may be quite sufficient for some types of data but results in
Index Terms k-Means clustering, PMU, power dissimilar outputs for PMU data, as shown in section 4 of the
systems, Central Protection Unit paper.
The squared Euclidian distance is used for data clustering in
the basic k-means. Some k-means algorithms utilize different
1 INTRODUCTION distant computations which dont have a significant impact on
K-means clustering is one of the most popular techniques in the results particularly in this application. An error function is
data clustering. It has been utilized in applications such as data calculated in each iteration. Clustering is done through
mining, knowledge discovery, data compression and vector minimization of the error function. Hence, either the a
quantization, pattern recognition and classification, medical function value or its change from previous iterations can be
imaging, and many others involving experimental, statistical, used as the algorithm termination condition. The most
or just large amounts of information [1-3]. common termination condition, however, is the fact that no
PMUs fast data stream demands a simple and adaptive more points will be transferred within the clusters and the k-
algorithm which is able to adapt to ongoing streams of Means algorithm converges to a steady-state.
information. Xiong [3] shows that k-Means clustering tends to Both centroids and clusters groups of objects will be
produce clusters with relatively uniform sizes even by distinct updated in each of the iterations. Centroids play an important
varied input data. Initial centroids in many applications can be role in k-means number of iterations to converge quality and
determined randomly from within the data [5,6]. In this study, stability of the clustered results in different executions. The

978-1-4799-4047-9/14/$31.00 2014 IEEE 252

centroids update with the mean of objects categorized in it from C-1 to C-2. After stabilizing in C-2 system encounters
cluster and is the number of objects in is given by a fault and enters to C-3, moving the system working
condition in Figure 2 from C-1 to C-2 is indicated by close
dots with small steps, however, dots are running from C-2 to
C-3, which indicates a fault. Also, power system and PMU
output data have concentration of data points in steady state
condition. That is why we have more than 5000 data points in
3 PROPOSED INITIALIZATION each of Cs sections while all the cyan parts are less than 1500
ALGORITHM number of data.
The k-Means clustering requires users to determine the
number of clusters and initial centroids. Both are challenging;
however, the number of clusters is beyond this study.
Each category of data has its particular traits due to the
nature of the system. Dashed lines in Figure 1 separate the
different phases of working in our system as the scenario
events are happening in 0 to 1 second simulation time span.
Plotted data fall into four categories in which three of them are
C-1, C-2, and C-3, which are related to the system when it is
stable. Narrow bands in Figure 1, indicated by dashed lines,
are representing the transient time where the system
encounters a change or fault and needs to pass a time to be
dampened. Figure 1 resulted from a stream set of data with
24031 3-D points yielded by a PMU. This is how , Figure 2. PMU5 VIIa 3-D data presentation.
relates to a specific moment and is a dot with its
position in system trace. Figure 2 presents the same data in a
3-D plotted environment using dots to curve the data. Cyan The two PMU data traits, data step length and
colored dots in Figure 2 are related to narrowly dashed line concentration, have been considered to find the best initial
times in Figure 1 and the same manner exists for Cs and their centroids. A discreet function is used to calculate every step
colors. One can deduce that Figure 1s narrow bands in trivial length of the system. The function is given by
times result in far spaces in actual 3-D space. That is, although (2)

transient parts take a small amount of time, the plotted system
using dots result in a bigger step length for dots in those time
periods. This phenomenon is true for faults, change in load, where , the length of is always one less than
DGs dis/connection, and any other disturbance or work the length of the data set. Figure 3 illustrates the resulted
condition alteration in power systems. function for the PMU output in Figure 1. Figure 3 and the
magnified section clearly present the fact that during steady
state operation, the data points step length barely exceeds 0.2
with an average in the range of 0.05. While the system is
changing its phase of operating point, this step change
significantly increase to more than one.

Figure 1. Typical PMU output data.

PMUs sample the phasors at a constant sampling rate

synchronized by satellite GPS. Hence, any change in Figure 2
dots step length is merely due to the system response which
can be categorized based on the changes rate, magnitude or
average. For instance, in Figure 2, the upstream utility grid is
feeding the system in C-1. Connecting a significant amount of Figure. 3. Discrete function D(i) representing the system step length.
DGs to the system changes the system working point, moving

Using the function D and the PMUs data traits enabled us regions compared with Figure 1. Using basic algorithm has
to develop a more efficient algorithm to calculate the best taken 8 iterations while taking advantage of the proposed
initial centroids. In this application, the total number of data is initialization algorithm reduced it to three iterations.
24031, therefore, batches with sizes less than 100 can easily Decreasing the iterations by one-third has a significant impact
be considered as noise or at most unnecessary. It should be on the clustering computational burden applications.
mentioned that we have reached the same results by ignoring Figure 2 illustrates three plots of data (VII) and each of
batches size 1000 but 100 will be the most conservative batch them can be expanded to phases abc. The same thing is
reduction size. applicable using the signals sequences. Now, any
The algorithm merges the batches with close average points. combination of these data can be used for decision-making
This is not really critical in real systems as we are assuming to and clustering processes regardless of the number of
have enough data during normal working condition, and dimensions. The proposed code has no limitation in data
abnormal working condition has smaller periods of time dimensions which enables the CPU to utilize any combination
compared to normal steady state. At this stage, we have a of preferred data.
limited number of batches, and each of them includes sets of
data with a certain distance in their averages. These batches C-3
will be organized in a rising manner based on their sizes. 180
Obviously, batches with larger sizes are more important, at
least indicating clusters with bigger size. Finally, the top k
ones of these batches will be chosen to represent the clusters

and their center of mass will be calculated. These centroids C-2

will be used for k-means.



80 1.5
A simplified 8-bus system has been used in this study for 60
simulations and PMUs raw data extraction. SDFT PMU has 20 0.5
0 0
been used to estimate the voltage, current, and phases of the Ia Va
installed bus with specific sampling frequency. The 8-bus Figure 4. PMU5 data clustered using conventional algorithm.
system is radial, which is fed from the upstream utility grid
(UG) connection. In a downstream feeder, it has a large
amount of DG which connects at the scenarios second period. 180
The designed scenario aimed to show how the system working C-1

state changes and how the PMU data and subsequently 100

proposed algorithm responses change. Figure 1 can clearly

present the scenario starting from t=0 with normal condition 0

and just being fed from UG. At t=0.3 the downstream DGs
connect to the network improving the voltage, reducing the -100
sensed current, and significantly changing the phase in that
specific bus. At t=0.6 a bolted 3-phase happened in the middle 80 1.5
of the network which is obvious through the PMU output and 60
will be cleared at t=0.9. After fault clearance, the system still 20 0.5
0 0
is being fed from both sides. That is, the system has the same Ia Va
state as the second period marked by C-2 in figure.
Figure 5. PMU5 data clustered using proposed algorithm.
Figure 2 illustrates PMU5 VIIa data plotted based on the
known scenario periods. Hence, the transient parts are in cyan Figures 6 and 7 present clustering results for the same
and the actual clusters have been categorized by different scenario but using data from PMU3. Figure 7 depicts the
colors and arrows showing the cluster names which are the clustering result using the proposed algorithm for 3 scenario
same in other PMU5 cluster results. That is, the cyan colored periods. Connecting the DG at t=0.3 causing the system state
data is to clarify the borders and system states (on bus 5) to go from C-1 to C-2 has a minor location change based on
presenting in 3-D for the designed scenario. Figure 4 bus 3 signals. Hence, installed PMU at bus 3 has been chosen
presents the clustered data using basic k-Means algorithm and as the C-1 and C-2 are closely located to each other, which is
Figure 5 shows the resulted clusters using the proposed more challenging for clustering. Clustered data in Figure 7
algorithm. One can observe in Figure 5 that basic algorithm show the correct distinction between C-1 and C-2 knots;
could not differentiate the C-1 cluster region and has specified however, Figure 6 illustrates how k-Means may result in a
parts of the transients to it. The actual C-1, however, has been totally wrong result of random initial centroids. That is, both
included in the green cluster section of C-2 which is not C-1 and C-2 knots have been classified inside a single cluster
correct. On the other hand, Figure 5 clusters using the and parts of transient have been mistakenly considered as the
proposed algorithm have distinct borders and correct cluster third cluster. Using basic k-Means method yields results by 9

iterations while the proposed algorithm again decreases it by Figure 8 presents a complicated scenario case which has
one-third. Data pre-analyzing, using proposed algorithm, been clustered using the proposed algorithm. This cluster
results in precise and stable clustering in any number of result has been yielded by six iterations when the same
executions. It has been mentioned before that k-Means clusters region and iterations have been yielded in any
clustering using the basic algorithm does not have a stable execution. On the other hand, the basic algorithm has
output. That is, Figures 4-7 are selected from many executions significantly different cluster regions per each execution with
which run by the authors. The basic code may even result in some critical mistakes in categorizing two cluster knots in a
really accurate clustering output or aim to a totally surprising single cluster in many runs. Basic code took a number of
clustered data as of local minimums in objective function. The iterations with a minimum of twenty-one and a maximum of
number of iterations, however, never reaches less than two forty-eight observed, and the iteration average of total runs is
times the iterations using the proposed algorithm. thirty-four which is really high compared to six iterations
using the proposed algorithm.
This article presented the newly developed algorithm to pre-
analyze the data and calculate the initial centroids for

clustering using k-Means. k-Means clustering has a significant

popularity in many applications but each applications data
have their own traits. Basic k-Means clustering simply uses
randomly chosen data to start as initial centroids.
This results in unstable and inaccurate clusters in large
C-2 250
iteration numbers. The proposed algorithm takes advantage of
1 200
150 power system PMU data characteristics to evaluate the scales
1.5 0 of the data and calculate the most efficient initial centroids.
Va Ia
Figure 6. PMU3 data clustered with conventioal k-Means This approach significantly reduces the number of iterations
and stabilizes the clustering output. That is, the clustering
regions are no longer varied executions. The effectiveness of
180 this algorithm can be used in CPU for power system
100 C-3 protection applications since the algorithm is simple and does
not add a significant computational burden to the system and
0 can be considered for fast and transient applications.

[1] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R.
Silverman, and A. Y. Wu, "An efficient k-means clustering algorithm:
Analysis and implementation," IEEE Trans. Pattern Anal. Mach. Intell.
0.5 vol.24, no.7, pp.881,892, Jul 2002.
250 [2] X. Hui, J. Wu, and J. Chen, "k-Means clustering versus validation
150 measures: A data-distribution rerspective," IEEE Trans. Syst., Man. B,
1.5 50
0 vol.39, no.2, pp.318-331, April 2009
Va Ia
[3] G. A. Jimnez-Estvez, L. S. Vargas, and V. Marianov, "Determination
Figure 7. PMU3 data clustered with proposed algorithm of feeder areas for the design of large distribution networks," IEEE
Trans. Power Del., vol.25, no.3, pp.1912-1922, July 2010
[4] M. J. Li, M. K. Ng, Yiu-Ming Cheung, and J. Z. Huang, "Agglomerative
fuzzy k-Means clustering algorithm with selection of number of
C-1 clusters," IEEE Trans. Knowl. Data Eng., vol.20, no.11, pp.1519-1534,
6 C-4
Nov. 2008
[5] G. F. Tzortzis and C. L. Likas, "The global kernel k -means algorithm
4 for clustering in feature space," IEEE Trans. Neural Netw., vol.20, no.7,
C-2 C-5 pp.1181-1194, July 2009

[6] A. Arvani and V. S. Rao, Detection and protection against intrusion on


2 smart grid systems Int. J. Cyber-Sec. Digit. Forensics, vol.3, no.1, pp.
34-38, April 2014.
6 10
4 6
2 4
Ib 0 0

Figure 8. PMU data clustered using proposed algorithm