You are on page 1of 6

1

Phase Identification in Distribution Systems by


Data Mining Methods

F. Ni, J. Q. Liu, F. Wei, C. D. Zhu S. X. Xie


Energy Internet Division Department of Electrical Engineering
Tellhow Sci-Tech Co., Ltd. Eindhoven University of Technology
Shanghai, China. Eindhoven, the Netherlands
E-mail: [fei.ni; junqi.liu; feng.wei; Email: s.xie@student.tue.nl
changdong.zhu]@meinergy.cn

Abstract—Data mining is one of the statistical means that the phase identification in certain low-voltage network system
extracts useful information from an extremely large set of raw is crucial to prevent such consequences.
data. Therefore, data mining methods are under vigorous
development and are commonly used in artificial intelligence fields Traditionally, the phase connection layout of network
such as image processing and robot industry. There has also been systems can be determined by manual intervention or signal
recently applications of data mining in electric power industry, injection approaches. These methods are normally being used
such as classification, clustering and forecasting. In this research in a relatively small-scale electrical networks. However, with
work, clustering techniques are adopted to identify the phase the growing demand of utility power, they turn to be inefficient
connectivity in power systems. Supported by smart meter data to deal with recent large-scale network system.
obtained from end-users on the low-voltage (LV) feeder, phase
The introduction of automatic meter reading (AMR)
identification is properly discussed in this paper. Firstly, the LV
network model is modeled using simulation tool OpenDSS. systems, where sensors and smart-meters are embedded in
Secondly, the phase identification algorithm of the LV network is household grid, allows electrical engineers to monitor the
developed in Matlab by using K-means clustering as well as the unbalanced grid in the whole network system, thus control and
Gaussian Mixture Model (GMM) clustering. Finally, the IEEE predict the correct demand of household usage. A new non-
European Low Voltage Test Feeder is used to verify the proposed intrusive and low-cost approach is needed alongside the
method. Results indicate that these two methods enable phase booming of the modern smart grid network systems.
identification to realize its goals, which is to precisely address the
active loads as well as the correlated phase of corresponding load. II. PROBLEM FORMULATION
Keywords—Phase identification, distribution system, data In general, the phase identification procedure in this paper
mining, Clustering, K-means, GMM. is carried out in four steps:
i. Model the LV network - IEEE LV Feeder in the
I. INTRODUCTION simulation tool OpenDSS.
From personal computers to lighting systems, from high ii. Extract simulated data from OpenDSS using Matlab and
voltage transmission grids to low voltage distribution systems, perform K-means and GMM clustering algorithms to
three-phase systems are widely adopted in modern society. place the simulated data into three clusters.
Typically, there are three voltage levels of modern power
iii. Perform classification to identify the three clusters and
systems: high-voltage (HV), medium-voltage (MV) and low
label them precisely based on the a-priori information
voltage (LV). Specifically, the HV grid is mainly for high
from the transformer side.
capacity long-range transmission, whereas MV and LV grids
target for regional distribution and household utility, iv. Verify the consistence posteriorly by comparing the
respectively. LV grid is of most concern in people’s daily life labelled clusters with the correct information provided by
among these three levels, therefore it is the focus of this paper. IEEE LV test feeder manual.
In typical LV distribution networks, electricity reaches the It is noteworthy that the time series of voltage magnitude on
households via a step-down transformer that converts the both transformer and household side are the only requirement
voltage into 230V / 50Hz, either in single-phase or three-phase. to perform the above procedure.
Based on the phasor principle of the three-phase system, the
summation of the three ideal phase currents should be equal to III. METHODOLOGY
0 A to guarantee the balance. However, it is not always the case
in practice as system unbalances frequently occur. These A. Power flow problem
unbalance issues may lead to negative consequences such as The clustering method used this paper requires information
reduction of asset lifetime, decrease in operational efficiency or of the time-varying voltage magnitudes as input for the
even damage the overloaded phase coils, etc. Hence, analyzing identification algorithm, whereas only load profiles are
available. Thus, the power flow calculation is used.

‹,(((
2

The type of bus at load side is selected as PQ type for the


­ n
sake of clarity, meaning that the active power P and reactive ° Pk GkkVk2  Vk ¦ Vm (Gkm cos T km  Bkm sin T km )
power Q injections are fixed at each time point. °° m 1
mz k
Hence, the active power P and the reactive power Q ® n
(8)
measured by the smart-meters are used for the inputs of the °Q  BkkVk2  Vk ¦ Vm (Gkm sin T km  Bkm cos T km )
° k
power flow simulation. °̄
m 1
mzk

One way to specify the various loads for the voltages is where T km is from the exponential form of Euler Transform of
through the calculation of bus-admittance matrix, which
indicates the relationship between the currents and the voltages complex numbers.
in the power systems.
B. Simulation Software
By Kirchhoff’s Current Law and Nodal analysis, the current
OpenDSS
injection at the bus k and the related voltages can be defined as:
[1] The Open Distribution System Simulator (OpenDSS) is an
open-source simulation software developed by Electric Power
Vk  Vm Research Institute (EPRI) for electric utility power distribution
Ik Vk YkG  ¦ (1) systems [2]. One of the most important features of OpenDSS is
m Z km that it can solve the radial feeder as well as other important
mz k
power flow equations, as indicated in the previous section by
where Vk and Vm are the voltage on buses k and m respectively. applying Newton-Raphson method. By using OpenDSS, the
YkG represents the sum of admittances connected at bus k to power flow results can be easily obtained. Since it is open-
source, it allows programmers to perform various
ground, Z km is the impedance between buses k and m, which functionalities based on their needs, so the external software
can be also represented by the admittance Ykm : like Matlab can directly drive its functional properties by using
the Component Object Model (COM) interface.
1 Matlab
Z km  (2)
Ykm In this paper, Matlab is used as the software for data analysis.
Moreover, YkG is the admittance connected at bus k to In order to load the data flow from OpenDSS model into Matlab,
the COM server interface will firstly get the numerical results
ground, which also relates to the self-admittance Ykk : solved by OpenDSS, then pass all the data to Matlab. Then the
data flow calculation can be easily performed and post-
1
Ykk YkG  ¦ (3)
processed for a higher level analysis such as clustering.
m Z km In short, the interaction between OpenDSS, Matlab and
mzk
COM is depicted in Fig. 1:
After formulation (3), the bus-admittance matrix [Y] for an
n-bus system can be defined as: [1]

ª I1 º ªY11 Y12 ... Y1n º ªV1 º


«I » «Y »« »
« 2» « 21 Y22 ... Y2 n » «V2 » (4)
« ... » « ... ... ... ... » « ... »
« » « »« »
¬In ¼ ¬Yn1 Yn 2 ... Ynn ¼ ¬Vn ¼

Based on nodal analysis, (4) can be summarized into:


n
Ik ¦Y
m 1
km Vm where (Ykm Gkm  jBkm ) (5)
Similarly, the complex conjugation of (5) can be denoted as:
n n

¦Y ¦ (G
* * *
Ik *
km Vm km  jBkm )Vm (6)
m 1 m 1
Then the PQ buses can be specified as: Fig. 1. Proposed interface between OpenDSS and Matlab interface with
respect to COM [3]
n

¦ ª«¬(G
 jBkm )(Vk Vm ) º»
* *
Pk  jQk Vk I k (7) It can be seen in Fig. 1 that data flow is transmitted to COM
m 1 ¼ km
from OpenDSS, then Matlab loads the data flow through the
In power systems, Newton-Raphson (N-R) Procedure has COM interface. Matlab users can easily call all the power flow
been more commonly used in solving power flow equations due quantities and equation results via COM interface for further
to its fast convergence feature. For typical PQ buses, the power data processing and analysis, including the time-series of three-
flow equations can be summarized into: phase voltage magnitude data flow on the transformer side and
the voltage magnitude of the 55 active profiles on the load side.
3

C. Clustering Therefore, the cluster results are not very stable when compared
Clustering is one of the mathematical means for grouping a to GMM clustering.
set of objects so that the objects with similar features will be GMM clustering
assigned into the same group [4]. Clustering, a subset of
GMM definition
unsupervised learning, is also a common method in data mining.
Owing to the big amounts of data stored in these matrices, Another powerful and commonly used data mining method
cluster process seems to be a useful way to simplify the is GMM. When dealing with data in multiple dimensions,
problems. K-means and Gaussian Mixture Model (GMM) are Normal Gaussian distribution has to be transformed into Multi-
the two commonly used methods in clustering. Variate Gaussian distribution [4] :
K-means clustering 1
1  ( x  P )T ¦1 ( x  P )
For K-means clustering algorithm, various input feature N ( x | P , ¦) e 2
(11)
samples will lead to different clustering results. The details of 2S ¦
choosing best training samples can be found in the Section V.
where P is the mean value and ¦ represents the covariance of
Besides the feature samples, it is also important to use these
samples to dig out some useful information—labels, which is the Gaussian distribution dataset. Like K-means, the training
relevant to the phase identification. To obtain the labels of the samples can be denoted as ^ x (1) , x (2) x ( m ) ` for x  n .
data points, clustering algorithms is needed to find every
sample’s potential labels, and put similar features with the same In GMM, the mixture models are regarded as the
label together, then cluster each profiles into three groups. The combination of Multi-variate Gaussian distributions:
reason for three groups is that each load profile is using one of K
the phase from substation. Among them, K-means clustering
algorithm can be regarded as the easiest and fastest clustering
p ( x) ¦S N
k 1
k (x | Pk , ¦ k ) (12)
method for data classification. where K represents the number of mixture models and S k is
Typically, in K-means clustering, training samples can be the mixing coefficient, which satisfies 0 d S k d 1 and
denoted as ^ x (1) , x (2) x ( m ) ` , x ( m )  1 n . And the formulation K

of K-means clustering can be conducted via a general procedure


¦S
k 1
k 1.
as follows [5] :
Latent variable
For every data point x in GMM, there exists a correlated
K-MEANS CLUSTERING ALGORITHM latent variable denoting as z k , which is unobservable. But the
1. Divide data points into k clusters, and randomly correlation between z k and S k can be denoted as below:
initializes cluster centroids as P1 , P2 Pk  n .
2. For each sample i , calculate its belonging cluster p ( zk 1) S k (13)
by:
Then by using Bayesian and joint probably theory, (12) can
2 be re-write into:
c : arg min x  P j
(i ) (i )
(9)
j K
p(x | z ) –N ( x | P k , 6 k ) zk (14)
3. For each cluster j , calculate its corresponding k 1
In order to solve (14), the expectation maximization (EM)
centroids:
m
algorithm has to be introduced: [4]
¦1^c (i )
j` x ( i )
Pj : i 1
m (10) EXPECTATION MAXIMIZATION ALGORITHM
¦1^c(i )
i 1
j` 1. Initialize the value for P k , 6 k and S k , then evaluate
the potential value for the latent variable z k .
4. Repeat Steps 1-3 until P j converges.
2. Evaluate the probability of every component by
where k is the a-priori cluster numbers that has to be known, following equation (E-step):
c represents the closest centroid that sample i by using S N (x | Pk , 6 k )
(i )

J ( zk ) { p ( zk 1| x) K k
Euclidean distance, whereas P j is the updated centroid by (15)
taking the average Euclidean distance value of all the sample
¦ S j N (x | P j , 6 j )
j 1

i in cluster j . 3. Estimate the newer values for P k , 6 k and S k (M-


step):
There is one drawback of K-means. Although it converges 1 N
fast locally, but it does not guarantee global convergence.[6] Pknew ¦ J ( znk )xn
Nk n 1
(16)
4

1 N similar profiles to calculate the corresponding correct rate. Rate


6 knew
Nk
¦J (z
n 1
nk )(x n  Pknew )(x n  Pknew )T (17) equal to 100% means that the identification is done, otherwise
new assumption of the three initial profiles needs to be made
Nk again. Correct rate is literally calculated based on the posterior
Sk (18) verification.
N
where
N IV. TESTS AND RESULTS
Nk ¦ J (z
n 1
nk ) (19) In this section, the performances of the K-means, GMM
4. Finally evaluate the maximum likelihood value: clustering methods are tested. In addition, the application of the
phase identification algorithm on the IEEE LV test feeder is
N
­K ½ investigated.
ln p(x | P , 6, S ) ¦ ln ®¦ S k N (x | Pk , 6 k ) ¾ (20)
n 1 ¯k 1 ¿
5. Repeat Steps 1-4 until ln p(x | P , 6, S ) converges. A. Case study--IEEE LV Test Feeder
In this work, performance of the proposed method is tested
on the IEEE European Low Voltage Test Feeder. It is a typical
D. Identification three-phase low-voltage (LV) feeder in Europe, which is a
After using both K-means and GMM algorithm, the active radial distribution system with a base frequency of 50 Hz.
profiles on the load side can be categorized into three clusters. Through a transformer at substation, voltage (phase-to-phase)
However, the phase information of the active profiles is not level of the main feeder and laterals is stepped down from 11
known yet, since the K-means and EM algorithm cannot label kV to 416 V. On this feeder, there are 906 buses and 55
the clusters based solely on the data provided on the load side. residential loads in total. Each load is connected to the grid by
means of single-phase connection. Fig. 3 is made based on the
Therefore, a proper labelling or identification algorithm has given information by the manual of IEEE LV test feeder, it
to be developed and applied to the three clusters. indicates the numbers of the 55 profiles and the correlated
coordinates within the map of the LV network.
The external medium-voltage (MV) grid is modelled as a
voltage source with an impedance; and data for the transformer,
lines and loads are given in [7].

Fig. 3. One-line diagram of the IEEE European low voltage test feeder.

In addition to the network model of the IEEE LV test feeder,


55 load profiles with a 1-minute resolution for a 24-hour period
are available. In order to identify the phase allocation of each
smart meter, the voltage profiles of 55 end-users and the LV
side of the MV/LV transformer are calculated by means of the
time-series power flow in OpenDSS. For simplicity, power
factors of all loads are set to be 0.95 over the entire simulation
horizon.
A time series of voltage magnitude at the end-users is stored
in a 1440 u 55 matrix. Similarly, the data of three-phase voltage
magnitude at the transformer side is stored in a 1440 u 3 matrix.
Fig. 2. Flowchart of the proposed phase identification algorithm
In the next subsections B and C, results of identifying the
The flowchart of the proposed phase identification phase connections of 55 loads by means of K-means clustering
algorithm is depicted in Fig. 2. The algorithm employs the idea and GMM clustering are displayed.
of a-priori assumption by assuming three random profiles out
of each cluster with phase labels. Same trick is applied on
5

B. K-means clustering Furthermore, by using the same coordinates as in Fig. 3,


In Fig. 4, it shows the comparison before and after applying geographical information of these 55 profiles are depicted as
K-means clustering algorithm to the sample features. shown in Fig. 6.
Afterwards, the Euclidian Distance is applied, however, other
distances such as City Block, Cosine or Correlation can also be
used.

Fig. 5. Comparisons between before using GMM (left) and after using GMM
algorithm (right)

Fig. 4. Comparisons between before using K-means (left) and after using K-
means algorithm (right).

Furthermore, performances of K-means with the commonly


used distances [8] are shown in Table I, where each distance
corresponds to an unique clustering method. It is found that the
K-means algorithm with sqeuclidean distance performs better
than the others, which indicates that the selection of distance is
an important issue for clustering.

TABLE I K-MEANS CLUSTER RESULTS WITH DIFFERENT DISTANCE


CALCULATIONS

DISTANCE NAME CLUSTER 1 CLUSTER 2 CLUSTER 3


Fig. 6. Geographical locations of the IEEE LV test feeder with phase
correlation 22 14 19
information of 55 loads.
cityblock 17 20 18
cosine 17 15 23
sqeuclidean 21 19 15
V. DISCUSSION
Reference 21 19 15

C. GMM clustering A. Descriptive features


The result of applying GMM clustering algorithm is When applying the K-means and GMM algorithms on the
depicted in Fig. 5. In Fig. 5, the 55 time series are clearly phase identification problem, it is difficult to identify the
divided by three clusters, the circles around the clusters are optimal features of the inputs. In descriptive statistics, seven
actually the contours of the Gaussian distributions calculated by indicators are commonly used to describe the features of data,
the EM algorithm. that is, maxima ( Max ), minima ( Min ), mean value ( P ), mode
value ( Mo ), median value ( Md ) standard deviations ( V ) as
D. Phase identification well as variance ( V 2 ). However, the clustering results are
Once the phase identifiction algorithm is finished, then the different when these seven indictors are applied to K-means and
results for both K-means and GMM clustering algorithms can GMM algorithms.
be tabulated in Table II. By comparing the results obtained from
both K-means and GMM with the reference, it is found that the B. Optimal feature
two methods converged and obtained the same results, i.e., 21 As tabulated in Table III, Table IV, seven typical descriptive
profiles in Phase A, 19 in Phase B and 15 in Phase C. features have been applied as the input for K-means and GMM
clustering algorithms, respectively. In spite of the numbers of
TABLE II DATA MINING RESULTS OF IEEE LV TEST FEEDER seven indicators which are applied to the input, none of them
can obtain very satisfied results. However, there are two cases
PROF. IN PROF. IN PROF. IN CORRECT that have correctness over 92% for both K-mean and GMM
ALGORITHM
PHASE A PHASE B PHASE C RATE
K-means and
algorithms, that is,
21 19 15 100%
identification
GMM and X [Vnode (1: 50,:)]' (21)
21 19 15 100%
identification
6

which is the truncated time-series of voltage magnitude of the the proposed data mining methods are promising candidates for
end-users. It is noteworthy that the first 50 rows of Vnode is of further usages in the power system.
great importance of achieving a high accuracy for both K- However, the optimal features that both clustering
means and GMM clustering algorithms. algorithms needed as the input are the first 50 real-time
measurements of the voltage profiles, which will be quite
TABLE III CLUSTERING RESULTS BY K-MEANS WITH DIFFERENT bloated in the case of a power network with many end-users.
FEATURES Since it will give rise to a matrix with a dimension of N e u 50,
NUM. IN NUM. IN NUM. IN (Ne is the number of end-users), which is relatively expensive
FEATURES ACCURACY to compute the results when Ne becomes greater than 100.
CLUSTUR1 CLUSTUR2 CLUSTUR3
X [ P; Md ]' 6 36 13 36.3636% Therefore, the future work of both K-means and GMM
X [ P ; Md ; Mo]' 15 27 13 85.454%
clustering algorithms might focus on searching for features that
are more efficient.
X [ P ; Md ; Mo;V ]' 22 14 19 76.3636%
X [ P ; Md ; Mo;
V ; Max; Min; V 2 ]'
18 9 28 52.7273% REFERENCES
X [Vnode (1: 50,:)]' 19 21 15 96.3636% [1] N. Mohan, Electric power systems. Hoboken, N.J.: John Wiley & Sons,
2012.
di diff (Vnode (1: 50,:)) [2] Misa, Ritam, “Impact of Plug-In Electrical Vehicles and Wind Generators
21 19 15 100%
X [di ]' on Harmonic Distortion of Electric Distribution Systems”, Master’s
Thesis, Michigan Technological University, 2014.
*X is the input for K-means algorithm, Vnode is the voltage matrix on the load side.
[3] Meghasai, S. Monger, R. Vega, H.Krisnaswami, "Simulation of Smart
Functionalities of Photovoltaic Inverters by Interfacing OpenDSS and
TABLE IV CLUSTERING RESULTS BY GMM WITH DIFFERENT FEATURES Matlab," The university of Texas at San Antonio, Thesis, 1588356, p. 70,
2015. W.-K. Chen, Linear Networks and Systems (Book style). Belmont,
NUM. IN NUM. IN NUM. IN CA: Wadsworth, 1993, pp. 123–135.
FEATURES ACCURACY
CLUSTUR1 CLUSTUR2 CLUSTUR3
[4] C. Bishop, Pattern recognition and machine learning. New York: Springer,
X [ P; Md ]' 19 17 19 70.9091% 2006, pp. 430-439.
X [ P ; Md ; Mo]' 19 17 19 70.9091% [5] A. Ng. CS 229. Class Lecture, Topic: “Unsupervised Learning, k-means
clustering.” Dept. of Computer Science, Stanford University, California,
X [ P ; Md ; Mo;V ]' 18 17 20 70.9091% CA, spring 2015.
X [ P ; Md ; Mo; [6] S. Z. Selim and M. A. Ismail, "K-Means-Type Algorithms: A Generalized
20 14 21 65.4545% Convergence Theorem and Characterization of Local Optimality," in
V ; Max; Min; V 2 ]'
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
X [Vnode (1: 50,:)]' 17 23 15 92.7273% PAMI-6, no. 1, pp. 81-87, Jan. 1984.
di diff (Vnode (1: 50,:)) [7] IEEE PES. (2016, Feb.) Distribution test feeders. [Online]. Available:
21 19 15 100% http://www.ewh.ieee.org/soc/pes/dsacom/testfeeders/index.html
X [di ]'
[8] Arthur, David, and Sergi Vassilvitskii. "K-means++: The Advantages
*X is the input for GMM algorithm, Vnode is the voltage matrix on the load side. of Careful Seeding." SODA ‘07: Proceedings of the Eighteenth Annual
On the other hand, based on the knowledge from Kirchhoff ACM-SIAM Symposium on Discrete Algorithms. 2007, pp. 1027–1035.
Voltage Law, the voltage of a node is time-dependent [9]. So [9] T. Jones and N. Nenadic, Electromechanics and MEMS. Cambridge
University Press Textbooks, 2013, pp. 11-12.
the voltage magnitudes of adjacent time steps should be
correlated. Furthermore, it is noted that voltage amplitudes are [10] X. Lu, “Data Mining Techniques in Power Quality Analysis,” TU/e,
Eindhoven, 2015.
more stable at the first 50 time steps especially in the night
hours from 00:00 to 00:50.

VI. CONCLUSION
In recent years, smart meter technologies are widely used in
the electrical system. As a result, a huge amount of data can be
collected from the data center or other parties. Therefore, data
mining techniques is able to perform as an efficient way to
extract useful information out of the large-scale dataset [10].
This paper has introduced two data mining algorithms: K-
means clustering and GMM clustering into power systems. The
performances also have been tested on the IEEE LV test feeder
with real datasets. Both of the two methods used the time-series
information of voltage magnitude at the transformer side as well
as 55 end-users that have load profiles. Firstly, the required
features, such as voltage magnitude of loads and the
transformer, were calculated, and then selected features were
processed by the K-means and GMM clustering algorithms.
Both of the clustering methods are able to converge, and to
achieve good results compared with the reference. Therefore,

You might also like