You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/347907440

Real-Time Synchrophasor Data Anomaly Detection and Classification Using


Isolation Forest, KMeans, and LoOP

Article  in  IEEE Transactions on Smart Grid · December 2020


DOI: 10.1109/TSG.2020.3046602

CITATIONS READS

0 117

4 authors, including:

Ehdieh Khaledian
Washington State University
11 PUBLICATIONS   29 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Bioinformatics, Computational Genomics View project

Wireless Network View project

All content following this page was uploaded by Ehdieh Khaledian on 26 December 2020.

The user has requested enhancement of the downloaded file.


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
1

Real-time Synchrophasor Data Anomaly Detection


and Classification using Isolation Forest, KMeans
and LoOP
E. Khaledian, Student Member, IEEE, S. Pandey, Student Member, IEEE, P. Kundu, Member, IEEE,
and A. K. Srivastava, Senior Member, IEEE

Abstract—Power grid operators assess situational awareness estimation, event detection [10]–[12], protection system failure
using time-tagged measurements from phasor measurement units diagnosis [13]–[15], system instability detection [16], etc.
(PMUs) placed at multiple locations in a network. However, The complex infrastructure to deliver high-resolution time-
synchrophasor measurements are prone to anomalies which may
impact the performance of phasor based applications. Anomalies stamped data makes the synchrophasor measurements prone
include any deviation from expected measurements resulting to bad data, such as, missing data or outliers. Synchrophasor
from power system events or bad data. Bad data include data data with its high reporting rate tend to capture the power
errors or loss of information due to failures in supporting system events occurring in the system, called event data [17].
synchrophasor cyber infrastructure. It is necessary to flag bad Event data can be faults in the system, switching operations,
data before utilizing for an application. This work proposes a tool
for the detection and classification of anomalous data using an load changes, generator drop, and other such power system
unsupervised stacked ensemble learning algorithm. The proposed events. If fed with low quality or erroneous data, applications
synchrophasor anomaly detection and classification (SyADC) tool might produce a result that can be misleading. Therefore,
analyzes a selected window of data points using a combination it is pertinent to detect and classify an anomaly into bad
of three unsupervised methods, namely: isolation forest, KMeans data or event data to aid and improve the performance of
and LoOP. The method classifies the data as anomalies or normal
data with more than 99% recall. The method also provides a synchrophasor based applications.
probability of the data to be an event or bad data with more The anomalies in PMU data can be modeled as outliers
than 99% recall. Results for the IEEE 14 and 68 bus systems with in time -series data (i.e., a sequence of data points). Iden-
synchrophasor data obtained using Real-Time Digital Simulator tification of outliers in PMU time-series data is proposed
and data of industrial PMUs highlight the superiority of the in [18]. A Kalman filter-based algorithm for conditioning of
algorithm to detect and classify anomalies.
synchrophasor data is developed in [19]. Several clustering-
Index Terms—Phasor Measurement Units (PMU), Kalman based methods for detecting bad data in PMUs have been
Filter, Isolation Forest, Anomaly Detection, Event Detection. proposed in recent times [20], [21]. Statistical, clustering and
unsupervised machine learning-based methods to detect bad-
I. I NTRODUCTION data have been developed in [22]. Further, [23]–[25] provide
more insight into power system event detection and recovery

P HASOR measurement units (PMUs) provide time-


synchronized measurements typically at 30 - 60 samples
per second and can help operators make better decisions [1]–
of PMU signals. Historical data based classification of faults
and events is proposed in [26] and event detection is studied
in [27]. The accuracy in terms of true and false detection of
[3]. Supervisory control and data acquisition system (SCADA) poor data quality can be questioned for the above methods.
obtains measurements from meters, transducers, intelligent Also, the methods are unable to differentiate between the
electronic devices (IEDs), and similar devices typically every bad data and power system event data, and fail to identify
4-6 sec [4]. PMUs, on the other hand, with time-stamped and errors that can originate due to faults in the phasor data
fast data rates, can capture the changing power system states concentrator (PDC). PDCs collect data from multiple PMUs
in real-time. and typically transmits to control centers or applications. PDCs
Several applications based on PMU measurements have are pre-configured to work reliably [28]. However, there can
been developed by researchers and implemented in pilot phase. be instances, when a PDC can malfunction due to a logical or
PMUs can be used for faster state estimation with higher hidden failure or the congestion in the communication network
accuracy [5], power system restoration [6], islanding detection delivering data from a specific PDC [29]. In some instances,
[7], oscillation monitoring [8], load modeling [9], parameter cyber-attacks can compromise a PDC [30].
In this work, real-time detection of data anomalies in
E. Khaledian and A. K. Srivastava are with the School of Electrical PMU measurements and their classification into bad-data,
Engineering and Computer Science, Washington State University, Pullman,
WA, 99163 USA. S. Pandey graduated from the Washington State University event-data, and PDC error is proposed. The proposed tool,
and is now working with ComEd, Chicago, IL. P. Kundu worked as a synchrophasor anomaly detection and classification (SyADC),
research faculty at the Washington State University and is now working uses an ensemble technique. An unsupervised machine learn-
at the Indian Institute of Technology Mandi, India. Authors would like to
acknowledge partial financial support from the NSF FW-HTF 1840192. (e- ing, which combines fast and accurate outlier detection al-
mail: anurag.k.srivastava@wsu.edu) gorithms such as isolation forest and local outlier probability

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
2

is used. The proposed method detects the anomalies with a bad results for the application due to inconsistencies
recall (true positive) of 99%. Furthermore, the correlation in archival, such as, mixing of PMU ID, data drop or
between the PMUs is computed to detect the events and PDC corruption.
errors; thus increasing the algorithm’s accuracy. The signifi- These issues might corrupt the PMU data making it unfit
cant contributions of this work are summarized as follows: for use in an application. Therefore, pre-processing of data
1) Real-time anomaly detection using unsupervised ma- is required before an application uses it. Data received from
chine learning with high accuracy and using limited PMUs can have outliers, missing data and event data. The
memory bound. data set shown in Fig. 2 contains four outlier points and two
2) Classification of anomalous data into bad-data and instances of missing data in chunks. It also contains three
power system events using data correlation techniques. power system events, seen as transients in the plot. SyADC
3) Detection of anomalies in PMU data due to PDC errors. aims to provide the user of PMU data with information in real-
time about the quality of the data. The additional information
Section II provides an overview of PMU data flow and data will aid the application’s performance. For example, if there
anomalies. Section III discusses the effect of data anomaly on is bad data, the application will wait for good data to arrive.
applications. The proposed methodology has been presented Similarly, an application designed to work in a quasi-steady
in section IV. Results and concluding remarks are provided in state will benefit if SyADC informs the application in real-time
section V and section VI respectively. that the reported data is an event data.

II. PMU DATA F LOW AND C AUSES OF DATA A NOMALIES


The synchrophasor data channel is shown in Fig. 1. The
measurements obtained by the current transformers (CT) and
potential transformers (PT) are transmitted via communica-
tion lines to control center applications (PMU Applications
Requirements Task Force (PARTF) Framework [28]).

Fig. 2: PMU Data Anomaly Plot

III. S YNCHROPHASOR DATA A NOMALY AT P OINT OF U SE


Fig. 1: Synchrophasor data flow to application The SyADC tool, designed as a data pre-processor tool, can
be plugged before an application to manage the data quality
There are potentially four sources of error in PMU data: issues and enhance performance. The data quality issues have
• Within the PMU: CT and PT bias is a well-known caused reluctance in moving the PMU based applications to
phenomenon [28] and could cause error in measurement. the control center. Each application has a threshold on the
Also, different classes of CT and PT (measurement or amount of anomalous data it can take to perform within an
protection), differences or errors within the PMU and its acceptable limit. SyADC is a real-time algorithm that provides
filtering algorithms, and wrong or missing time-stamps comprehensive information on the data quality and classifies
associated with the measured grid conditions can lead each data point as normal, bad data, event data or PDC error.
to erroneous data. Individual data points are affected if PMU data obtained at the control centers can be classified
the time-stamps are missing. The whole data set can be according to the errors associated with it. In Fig. 3, PMU data
affected if there are biases in CTs and PTs. is classified broadly into anomalous and normal data. If an
• At data aggregators: The data points aggregate at a PDC, application is fed with normal data, it performs as expected;
before it is fed to an application (Fig. 1). It can in- whereas, if anomalous data feeds into the application, it
duce errors due to misalignment, erroneous compression, performs poorly. Anomalous data are further classified as
creation of duplicate data or loss of data due to late event data and bad data. Event data are transients caused
deliveries. There can also be mislabeling or data losses due to faults, generation drop, capacitor bank switching, etc.
at aggregator end. In contrast, bad data can be missing, outliers, biases in the
• The loss of communication network nodes can lead to magnitude of voltage, current and other possible errors.
the loss of data packets. Excess congestion can cause Anomalies in data can be from a single or multiple PMUs.
latency and delay in delivery beyond the acceptable Events affect the data of multiple PMUs across the system. In
time window of the PDC. Data corruption can further case of bad data, most of the time, it is single PMU bad data.
introduce erroneous values. However, if bad data occurs across multiple PMUs, it can be
• Several applications run offline analysis which require an instance of an error in PDC.
aggregation of data for offline usage. The data storage A particular application such as load modeling [9] does not
can also cause discrepancies in the data resulting in get affected if there is a single point bad data or frequency

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
3

different measurements of the same PMUs as well as the


correlation between different PMUs is checked. If there is
an error in the estimation of voltage phasor, it does not
mean that the estimated current phasor will also have an
error in measurement. PMUs from different vendors follow
different techniques to estimate the phasors. The current signal
is sampled using current transformers; whereas the voltage
signals are measured using potential transformers and are
independent of each other. If a correlation is detected, it is
determined as event data; else, it is a case of bad data. If bad
data is detected, it is sent to be checked for PDC error.

Fig. 3: PMU data content and Application

and rate of change of frequency (ROCOF) error. It rejects any


transients that may occur due to an event. However, the event
detection algorithm [27] will be affected by a single point
anomaly as it can be a false indicator of an event in the system
corrupting the algorithm’s performance. Any application that
uses a windowing approach might be less affected by single
point anomalies if present, as it may average out. For an
application’s resilient performance, the data feeding into it
should be cleaned/flagged of anomalies.

IV. P ROPOSED M ETHODOLOGY


The flow diagram of the proposed method is shown in
Fig. 4. The synchrophasor data window from central PDC is
fed into the ensemble method where the probability of each
data point within the window is computed. If the probability
score of each data point remains within the set threshold,
it suggests two possibilities. Either the data window is a Fig. 4: Proposed Methodology
normal quasi-steady-state data window, or it can be a case
of missing packet data. If it is of missing packet data (all The algorithm is designed to classify the data into three
values are default ‘Nan’ or ‘0’), PDC error is checked; else groups: bad data, events and PDC error. To classify in real-
the normal data is fed to the applications. In case there are time, a machine learning algorithm (Fig. 5) is proposed which
points in the data window with probability scores more than uses a stacking of ensemble-based methods to increase the
the set threshold, then it is determined to be anomalous data. efficiency. Stacking is an ensemble machine learning technique
There are three possibilities if it is an anomalous data: i) that takes different machine learning algorithms in the first
the data window is driven by missing data (majority), ii) layer as inputs to a second layer [31]. The stacked method
it is a power system event window, iii) it can be a case improves the result by combining the algorithms of the first
of bad data. First, the missing data is checked by flagging and second layers. It also increases the speed of the process
the detected anomalies based on their probability scores and by applying the fastest algorithm to the high dimensional
removing those from the rest of the normal data in the window. data. Next, it is converted to one dimension, following which
The average of the remaining data is computed; and if the another algorithm is applied. In the following sections, the
average is less than a set threshold, it is a case of missing layers are explained in detail.
data. Then, the flagged anomalies are unflagged and missing
data windows are checked for PDC error. In case the average
is above the set threshold, the initial data window (without A. Isolation Forest
removing anomalies) along with the ensemble step is sent to Any sudden deviation in a metric from the normal data
the correlation block. In this block, the correlation between is anomalous behavior. If the information on the previous

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
4

in anomaly detection owing to the ability to build a partial


model on a subset of data [32]–[34], and iv) simulation results
indicate better performance of the proposed technique even
with varied test scenarios considering power system dynamics.
In the first layer, the data is fed to the iforest algorithm [35].
Isolation forest is preferred for the first layer as it has linear
time complexity and low memory requirement. In this work,
iforest is used to explicitly focus on identifying outliers rather
than profiling normal data.
Isolation forest takes advantage of the nature of anomalies
which are less frequent than regular observations and different
from those in terms of values to isolate those. In a simplified
way, it builds an ensemble of decision trees for a given data
set. The decision trees which are called isolation trees (IT)
have properties in common with a binary search tree (BST).
Therefore, similar to BSTs, in the ITs, anomalies are those
Fig. 5: Stacked learning for Anomaly Detection instances that have short average path lengths on the tree. In
these trees, partitions are created using a split value between
anomalies are available, a supervised learning algorithm can the minima and maxima of a randomly selected feature. The
be developed to detect anomalies. However, it is assumed that algorithm tries to separate each point in the data. As with
there is no feedback on the data, and anomalies are to be other outlier detection methods, an anomaly score is required
identified in real-time. To this end, an unsupervised algorithm for decision making.
is required for the proposed tool that requires no class label at Let D = di , di+1 , di+2 , ..., di+w be a data window with size
the training phase. Unsupervised anomaly detection techniques w. D contains an N -dimensional feature space, D ⊂ RN . In
do not require labeled training data. These techniques are this case, each data point di contains six features: voltage
based on two fundamental assumptions. First, most of the magnitude (V), voltage angle (VA), current magnitude (I),
incidents are normal, and that only a small percentage is current angle (IA), frequency (F) and ROCOF (R). Let Fij
abnormal. Second, the outlier is statistically different from be the jth feature of the ith data.
normal data. Based on these assumptions, data groups of sim-
ilar instances that appear frequently are assumed to be normal di = {Fi1 , Fi2 , ..., Fin } (1)
incidents, and those infrequent data groups are considered to
If d is an observation in D and w is the sub-sampling size,
be anomalies. The outlier detection algorithms can be divided
then, the anomaly score can be calculated using (2):
into distribution-based methods (autoencoders and adversar-
E(h(d)))
ial feature learning), distance-based methods (self-organizing s(d, w) = 2− c(w)) (2)
maps, CMeans, and adaptive resonance theory), and density-
 2(w−1)
based methods (local outlier factor, local outlier probability,
2H(w − 1) −
 x , if w > 2,
DBSCAN and isolation forest (iforest)). The distribution-based c(w) = 1, if w = 2, (3)
approach needs to obtain the distribution model of data to be 
0, otherwise,

tested in advance which depends on the global distribution
of the dataset. The distance-based approach requires users to H(i) ≈ log(i) + e. (4)
select reasonable distance setting and scale parameters and
is less efficient on high-dimensional datasets. The distance- Tr
1 X
based and distribution based outlier detection methods all E(h(d)) = hi (d) (5)
T r i=1
adopt global anomaly standards to process data objects that
cannot perform on the datasets with uneven distribution. In where T r is the total number of trees, h(d) is the path
practical applications, the distribution of data tends to be length of observation d, E[h(d)] is the average h(d) for a
skewed, and there is a lack of indicators that can classify collection of ITs (5) and c(w) is the average path length of an
data. Even if tagged datasets are available, their applicability unsuccessful search in a BST and w is the number of external
to outlier detection tasks is often unknown. nodes (3). In a BST, an external node is the one without child
The density-based local outlier detection method can effec- branches.
tively solve the above problems by describing the degree of Each observation is given an anomaly score s where 0 ≤
outliers of data points quantified by local density.In this work, s ≤ 1. A higher score close to 1 represents anomalies; while
iforest is selected as the primary technique because of the fol- observations are normal if the score is less than 0.5. If there are
lowing advantages: i) it requires a small sub-sampled dataset no distinct anomalies in the data, the scores might be all close
only at the training phase, and our goal is to find anomalies to 0.5. For any normal distribution of the data, the anomaly
for the shortest possible window, ii) it has a linear time score threshold is 0.5 [35] because when the expected path
complexity with high scalability and low memory requirement, length E(h(d)) is equal to the average path length c(w), then
iii) it reduces the impact of swamping and masking problems S is equal to 0.5, regardless of the number of observations [35].

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
5

Applying the iforest to a window magnifies the anomalies C. Clustering Using KMeans
in the window. The output of this step would be the input for
The output of iforest is fed to a KMeans [38] algorithm as
the next algorithm that gives the probability of outlying from
the second layer learner. This learner classifies the results from
the normal data for each data point.
the iforest into binary clusters of normal/anomaly classes. Let
(1) (1)
an initial list of clusters of KMeans be m1 , ..., mk . Each si
B. Probability Using LoOP
would be assigned to the cluster which has the least squared
In order to compute the probability of a particular data point Euclidean distance as given by (12).
in a window to be a local density outlier, the local outlier 2 2
probability (LoOP) [36] algorithm is applied. The output (t) (t) (t)
Ci = {sp : sp − mi ≤ sp − mj ∀j , 1 ≤ j ≤ k}

obtained from the iforest is fed to the LoOP algorithm. LoOP
(12)
carefully resembles the popular local outlier factor (LOF) [37]
algorithm and normalizes the outlier factors to probabilities. The cluster in each step can be updated using (13).
It uses statistical theory to determine the final score which
indicates the probability of an observation in a window to (t+1) 1 X
mi = sj (13)
be a local density outlier. These probabilities facilitate the (t)
Ci xj ∈C (t)
comparison of a data point with its neighbors in the dataset i

window. Formally, the objective is to find (14).


Let ρ be the output of the first layer as (6). ρ contains the
k X k
anomaly scores of each data point in the window computed X 2
X
by the iforest algorithm. argminC kρ − Ξi k = argminC |Ci | V arCi
i=1 s∈Ci i=1
ρ = {si , si+1 , ..., si+w } (6) (14)

Then, LoOP is applied to compute the probability of an where Ξi is the mean of points in Ci . This is equivalent to
observation s ∈ ρ to be an outlier. This probability is derived minimizing the pairwise squared deviations of points in the
from a so-called standard distance from s to reference points same cluster (15):
R: k
sP X 1 X 2
2 argminC kx − yk (15)
r∈R dist(s, r) 2 |Ci |
σ(s, R) = (7) i=1 x,y∈Ci
|R|
Let C be the the output of the KMeans which is a set of
where dist(s, r) is the distance between s and r given by
cluster labels with size w and ci = 1, or ci = 0 ∀ci ∈ C.
a distance metric (e.g., Euclidean or Manhattan distance).
The probabilistic set distance of a point d to reference
points S with ‘significance’ λ (usually 3, corresponding to D. Ensemble of Observations
98% confidence) is defined as:
The final output can be computed by combining the result
pdist(λ, s, R) = λ ∗ σ(s, r) (8) of the LoOP (li ) and KMeans (ci ) for observation i. The
probability of the observation i to be an anomaly is obtained
From the following step, nearest neighbors (NN) are used as as follows (17):
reference sets. NN here refers to the nearest Euclidean distance
between observations resulted from the iforest. For a given P = pi , pi+1 , pi+2 , ...pi+w (16)
neighborhood size k and significance λ, the probabilistic local
outlier factor (PLOF) of data point d is defined as: P = C.L = ci li , ci+1 li+1 , ..., ci+w li+w (17)
pdist(λ, s, N Nk (s)) In this problem, K=2, K here is the number of clusters, is
P LOFλ,k (s) = − 1. (9)
Er∈N Nk (s) [pdist(λ, r, N Nk (s))] considered which means demanding the KMeans algorithm to
divide the observations into two groups. KMeans is sensitive
Finally, this is used to define local outlier probabilities.
to changes because it minimizes the sum-of-squares which
Given the previous equations (7, 8, and 9), the probability
puts more weight on instances different from the normal data.
that a data point s ∈ ρ is a local outlier is defined as:
   It puts normal data into one group. However, it might put
P LOFλ,k (s) normal data with small changes in the second group that are
LoOPλ,k (s) = max 0, erf √ (10)
nP LOF · 2 considered outliers. The normal data is annotated as zero and
q outliers as one. So, ci = 0 addresses the normal data, and
nP LOF = λ. E[P LOF 2 ] (11) ci = 1 shows the outliers. Also, li closer to 0 (the prediction
result from LoOP) shows the normal data. By multiplying the
Let L be the output of the LoOP (10) which is a set of KMeans cluster results with LoOP results, the normal data
probabilities with size w and 0 ≤ li ≤ 1, ∀li ∈ L. The value detected by KMeans is masked which increases the precision
li is the probability of a data to be an outlier, and its value of the ensemble method because it prevents assigning normal
ranges from 0 to 1. data to outliers.

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
6

pm
P1 pm
P2
E. Data classification corr2(P M Uj , P M Uk )
j=1 k=1 (20)
A combination of the results from KMeans and LoOP is corr2D(Z1 , Z2 ) =
used to detect the anomalies. If pi ≤ 0.1 ∀pi , pi ∈ P , it pm1 pm2
indicates either no anomalies in the data, or PDC error in th
where Fi in (19) represents the i feature for the PMU, and
the whole window. If pi > 0.1, there are anomalies in the P M Uj is the j th PMUs in the PDC. If the correlation between
window. The threshold with the smallest error is defined as PMUs in adjacent PDCs is less than 0.4, it is labeled as bad
the optimized threshold where the error is obtained as the data. For the correlation value greater than 0.4, a power system
ratio of the number of falsely detected to the whole number of event is identified.
data. To obtain this, the algorithm was run on several datasets The distribution of correlation between two PMUs belong-
for a range of thresholds from 0 to 1. For data sets with ing to the same PDC is shown in Fig. 6. First, the correlation
5%, 10%, and 15% anomaly rate, it was observed that there for 806 different windows of size 10 (8060 data points) was
is a global minimum when the thresholds are about 0.1. If computed for the test data set. Then, the distribution of the
pi for 50% of the data points is greater than 0.1, then it is correlations was plotted for 806 windows using box plots. The
not possible to classify the normal data from anomalies. For median value (indentations in box) for normal data was about
such a case, the data window is extended to 1.5 times of the 0.85 which is not shown as correlation for normal data is
original window size and the algorithm is applied again. In not required. However, for bad data, the correlation value is
this case, to figure out the type of anomaly, the correlation close to 0.30. Importantly, the boxes between bad data and
between individual PMU data streams such as I, V and F PDC error, normal data, and events did not overlap which
(Table I) is checked. For this, the Pearson correlation (18) demonstrated the distinct correlation ranges. It was observed
is used to compute the correlation. The Pearson correlation that the correlation for anomalies varied from 0 to 0.4, mainly
coefficient is a widely used measure for linear relationships gravitating in 0.2 to 0.4 interquartile. For events and PDC
between two normal distributed variables and thus, often just errors, it ranged from 0.7 to 1. The correlation for the event
called correlation coefficient [39]. The correlation values vary occurrences ranged from 0.7 to 0.95, and for PDC errors
from -1 to +1 . These values are rendered by considering the correlation was very high and more than 0.9. The PDC
the thresholds inferring from previous studies [39]–[41] and error correlation was high because PMUs have almost zero
experiments (Fig. 6). A value of −0.4 ≤ corr ≤ 0.4 indicates values for all data points. The correlation of 0.4 to 0.7 usually
no correlation between data streams, and it can be inferred happens for an event in a neighboring zone. The data set
as bad data. The correlation values of 0.4 < corr < 0.7 and analyzed to derive the threshold is from IEEE 14 bus simulated
−0.7 < corr < −0.4 explicate a moderate correlation which in real-time digital simulator (RTDS).
might be the impact of an event from another zone. Therefore,
the algorithm requires two more steps to detect the events.
The same applies to the correlation values 0.7 < corr < 1
and −1 < corr < −0.7 which outlines a strong correlation.
Therefore, in the next level, the algorithm calculates the
correlation between PMUs in the same zone using (19). An
event will affect the data streams from all PMUs in a PDC.
If the correlation among PMUs belonging to a certain PDC
is −0.4 ≤ corr ≤ 0.4, then the algorithm suggests that there
is bad data.Otherwise, the algorithm calculates the correlation
between the PMUs of different PDCs (Table I). A value of
|corr| ≤ 0.4 indicates no correlation between data streams
, and it can be inferred as bad data [39]. For |corr| > 0.4,
two more steps are required to detect the events. First, the
correlation between PMUs in the same PDC are calculated
using (20). If the correlation between PMUs in a PDC is Fig. 6: Correlation distribution between two PMUs
less than 0.7, then it is concluded as bad data. Otherwise,
the correlation between PMUs of different PDCs is calculated
F. PDC error detection
(Table I).
For correlation between two different PMUs, the matrix After detecting the anomalies, the next step is detection of
correlation is defined as (19). For two PDCs, correlation is PDC errors.
Pn Pw
defined as (20). j=1 i=1 Fij
µ(D) = (21)
cov(X, Y ) nw
corr(x, y) = (18)
σx σy If µ(D) ≤ 0.2 for a PMU in a zone, it means the data in
the PMU might be missing packet data.
n
P
corr(AFi , BFi ) If a PDC malfunctions, it reports ‘0’ or ‘Nan’ depending
i=1 upon settings for all data streams in the PMUs. All the ‘Nans’
corr2(A, B) = (19)
n are converted to zeros for simplicity. Therefore, if the average

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
7

TABLE I: RULE OF THUMB FOR INTERPRETING THE SIZE OF


CORRELATION COEFFICIENT

|Corr| ≥ 0.7 0.4 < |Corr| < 0.7 |Corr| ≤ 0.4


Algorithm 1 Synchrophasor Anomaly Data Classification Strong Moderate Weak
Data
(SyADC) L1
streams
Check L2 Check L2 Bad data
Input: P M U Data PMU
Check L3/
L2 same Check L3 Bad data
Output: Normal Data, Anomaly Data, and Anomaly Classi- zone
Bad data
fication PMU
Event/
Start Sampled Values Stream L3 Neighbor Event Bad data
Bad data
zone
1: First, feed the moving window data to iforest (result:
scores)
2: Second, feed the output of the iforest to KMeans and
LoOP (result: probabilities) µ(D) of all data streams in a window of size ‘w’ for all PMUs
3: if pi ≤ 0.1 ∀pi , pi ∈ P then under the same PDC is zero, it is a PDC error. For instance, a
4: if average datapoints in window <0.2 then case where PDC error occurs for a PMU with one data stream
5: if Multiple PMUs Correlation > 0.7 == TRUE then (voltage) for a duration of one second is shown in Fig. 7. It
6: PDC error is considered that the reporting rate is 30 frames per second,
7: else if then and the values are around 1 p.u. The PDC error is from point
8: PMU packet Missing Data 3 to point 32 of the data stream. Four data windows (W1-W4)
9: end if are considered starting from point 1 to point 40. Therefore,
10: else if then the PDC error starts at the point 3 in the time series (W1)
Label data as normal of window size of 10 data points. For this case, the value of
11: end if n=1 (one data stream) and w=10 (window size) is used for
12: else if ∃pi pi ∈ P where pi > 0.1 then calculation of µ(D). This resulted in µ(D) = 0.2 p.u. for
13: if Number pi > 0.1 equal to number pi < 0.1 then W1, µ(D) = 0 p.u. for W2 and W3 and µ(D) = 0.8 p.u. for
14: Go back to 9 and feed more data to the algorithm W4. For the proposed algorithm, the set threshold determines
15: else if then W1, W2, and W3 as a data packet drop; whereas W4 is a case
16: if average datapoints in window <0.2 for larger class of a window with missing data points.
then
Go to 5 and check for PDC error in larger class
17: else if then
18: Compute the correlation (18) between two features
19: if No Correlation then
20: if Probability > 0.1 (17) then
21: Label as outlier data
22: else if then
23: Label as normal data
24: end if
25: else if Correlation then
26: Compute correlation between PMUs (19)
27: if No Correlation then
28: if Probability > 0.1 (17) then
29: Label as outlier Fig. 7: Packet Data Loss and possible PDC error
30: else if then
31: Label as normal data
32: end if At this stage, it was also determined if there are multiple
33: else if Correlation then PMUs with missing packet data. This indicates an error in
34: Label as event data the local PDC or communication system. In order to check
35: end if if multiple PMUs have missing data at the same time in-
36: end if stance, correlation was computed using (19) as discussed in
37: end if Section IV-E. This help in notifying the operator that PDC or
38: end if communication channel needs to be fixed (Fig. 4). The flow
39: end if of the algorithm is explained in algorithm 1.
40: return Flags (Event, PDC error, Outlier or Normal) The proposed approach is focused on observed anomalies
irrespective of the root causes. Misalignment, erroneous com-
pression, and creation of duplicate data or loss of data in
PDC may result in synchrophasor data anomalies and will
be detected by the proposed algorithm. If all these detected

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
8

anomalies are correlated with a specific PDC, that will also Bus 8 for Fig. 8. As seen in Table II, the different rates of
be detected. bad-data ranged from 5% to 15%.
Fig. 9 presents a detailed test analysis of PMUs at Bus 6 and
V. R ESULTS Bus 10, which belong to the same PDC. In Bus 6 and Bus 10,
two instances of PDC errors were simulated, totaling 80 data
To validate the proposed method, two test systems were
points. Bus 6 also had an event simulated. First, the events,
modeled in the RTDS, and PMU data from the hardware PMU
PDC errors, and bad data were all detected as anomalies.
and OpenPDC software were obtained. Different power system
Next, the application of the correlation processing classified
events like faults, generator drop, load change, transformer
the types of anomalies. In scenario A, a small number of data
tap change and cap bank switching were modeled at different
points were normal, and the majority were zeros (PDC error).
buses. The SyADC algorithm was run on each PMU data set
At first, the unsupervised algorithm detected the smaller cluster
and the results are presented in the subsection below.
as anomalies. The average value µ(D) of the major cluster
was calculated, and it was less than the set threshold (0.2
pu). Therefore, the major cluster was detected as PDC error
and the smaller one as normal data points. In scenario B, all
the data points had the same distribution, so the ensemble
algorithm detected those as normal data. However, in the post-
processing step, µ(D) < 0.2 p.u. Therefore, the correlation
between all the PMUs of the same PDC was computed. A
strong correlation was detected; thus indicating a case of PDC
error. In scenario C and D, an individual missing data and an
outlier was simulated which were detected by the unsupervised
algorithm. Since these were individual anomalies, a weak
correlation was detected, leading to accurate classification of
anomalies. In scenario E, first, anomalies were detected by
the unsupervised algorithm. In the post-processing step, the
Fig. 8: IEEE 14 Bus network with distribution of PMUs correlation was calculated between data streams (current and
voltage in Fig. 9). A strong correlation was observed; hence
the correlation with other neighbor PMUs was computed. In
A. Results this particular case, there was a correlation between PMUs at
Bus 6 and Bus 10, although the event occurred at Bus 6. In
The length of the data window was 8060 points with a
scenario F, the data were normal, and the proposed algorithm
rate of 30 frames per seconds. Bad data was injected in five
correctly identified the normal data. It is to be noted that all
different fractions of the data set. The percentage of bad
the other anomalies simulated for the above described test case
data injections were varied from 5% to 15%. The range of
shown in Fig. 9 were also detected but was not color-coded
data anomalies varied from 0.07% to 50%. For example, if a
for simplicity of explanation.
selected data point has a voltage of 1 p.u, a random function
that follows normal distribution updates it to a value in the
range of 0.5-0.93, 1.07-1.5 or zero (missing data). TABLE II: T EST RESULTS FOR BUS 6 AND 8
Outlier detection is an imbalanced classification problem. Anomaly Rate
Data is classified as anomalous or normal, with the normal 5% 7.5% 10% 13% 15%
category representing the majority of the data points. In this TP 424 608 882 1051 1174
FP 82 80 67 81 76
problem, accuracy is not a proper measure for assessing model TN 7560 7374 7110 6909 6778
PMU Bus 6
performance. Therefore, two metrics, recall and precision, FN 1 5 8 26 39
were used for evaluating the results [42]. First, false positive Recall 0.997 0.99 0.99 0.97 0.97
Precision 0.84 0.89 0.93 0.93 0.94
(FP), false negative (FN), true positive (TP) and true negative TP 418 609 875 1042 1180
(TN) were computed. Following this, the recall and precision FP 67 63 43 81 63
were computed using (22) and (23): TN 7580 7390 7142 6920 6788
PMU Bus 8
FN 2 5 7 24 36
TP Recall 0.995 0.99 0.99 0.98 0.97
precision = (22) Precision 0.86 0.90 0.95 0.93 0.95
TP + FP
TP
recall = (23) Table III shows the data for a window size 10 with three data
TP + FN streams, voltage, current and frequency and the probability
where TP is the number of bad data detected as bad data, results. The probabilities were zero for all data points, except
FN is the number of bad data detected as normal data, TN is for data points: 2, 5, and 8. The data points 2 and 5 were
the number of normal data detected as normal data, and FP detected as anomalies; but the probability of being anomaly
is the number of normal data detected as bad data. Table II for data point 8 was less than the selected threshold. Therefore,
presents the recall and precision for PMU Bus 6 and PMU it was considered as normal data.

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
9

Fig. 9: Plot for two PMUs showing different scenarios, for Bad data, Events, and PDC errors detected using SyADC. A) A
smaller part of the window are normal data points, and majority are PDC error B) PDC error C) Individual missing data D)
Individual outlier E) Event F) Normal data.
TABLE III: DATA CLASSIFICATION USING PROBABILITY
VALUE If data is highly polluted (' 15%), the recall for the method
still remains more than 97% as it detects the anomalies for
S.No. Voltage Current Frequency Probability Flag
1 7.46 1.87 60.02 0 Normal
small windows of data locally.
2 7.46 0.87 60.02 0.53 Anomaly Table VI shows the processing time for data with four
3 7.46 1.87 60.02 0 Normal different window sizes. The processing time for a window
4 7.46 1.87 60.02 0 Normal
5 8.06 1.88 60.02 0.41 Anomaly size of 100 data points and 15% rate of anomalies is less than
6 7.47 1.89 60.02 0 Normal 0.13 seconds. Larger window size results in better accuracy
7 7.47 1.89 60.02 0 Normal of the tool; however, the time for anomaly detection and
8 7.46 1.87 60.02 0.024 Normal
9 7.47 1.89 60.02 0 Normal classification increases which is not desirable for application in
10 7.47 1.89 60.02 0 Normal real time. A window size of 10 provides the desirable accuracy
of data to be fed to PMU applications.
Table IV and V compares the performance of the proposed
tool with other methods. For the data with 15% anomalies, the TABLE VI: P ROCESSING TIME PER WINDOW
precision is almost 96%. Recall for the data with 5% error is Window Size (#of data points) Processing Time (Seconds)
99.7%. 10 0.105
TABLE IV: C OMPARATIVE ANALYSIS OF S YADC 20 0.113
50 0.116
Anomaly Rate Method Precision Recall 100 0.129
Linear Regression [43] 0.78 0.857
DBSCAN [44] 0.80 0.901
5%
P M UEnsemble [22] 0.86 0.90 For the selected thresholds, area under curve (AUC) Re-
SyncAD [27] 0.88 0.936
SyADC 0.86 0.997 ceiver operating characteristic (ROC) for the proposed method
Linear Regression 0.82 0.83 is about 99%. The AUC-ROC for three other approaches
DBSCAN 0.81 0.93 compared to the proposed method is shown in Fig. 10.
10%
P M UEnsemble 0.90 0.94
SyncAD 0.91 0.947 The principal clustering algorithm iforest uses a small
SyADC 0.93 0.99 sub-sampling size which can efficiently perform anomaly
Linear Regression 0.86 0.80 detection with minimal memory footprint. Also, iforest
DBSCAN 0.82 0.94
15%
P M UEnsemble 0.92 0.95 has a space complexity of O(n) because at worst, it needs
SyncAD 0.93 0.965 to go ( 12 × n) times of the array which requires a low
SyADC 0.95 0.97 memory bound. The time complexity is O(log n) because
the array cuts in half every time you iterate. For KMeans
TABLE V: S YADC FEATURES and LoOP, the result of the iforest is used as input which
is only one dimensional and therefore memory usage
Reg. DBSCAN P M UEns. SyncAD is low. For example, space complexity for KMeans is
SyADC
[43] [44] [22] [27]
Bad O((m+k)n), where m is the number of objects and n is
X X X X X
Data the number of attributes considering n-dimensional objects
Events 7 7 7 7 X which here is one. Therefore, the space complexity is O(m+k).
PDC
7 7 7 7 X
Faults

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
10

TABLE VII: S YADC ON IEEE 68 BUS SYSTEM


S. No. Precision Recall
1 0.9560 0.9822
2 0.9578 0.9833
3 0.9636 0.9776
4 0.9468 0.9864
5 0.9588 0.9742
6 0.9780 0.9948
7 0.9764 0.9640
8 0.9834 0.9860
9 0.9356 0.9836
10 0.9615 0.9732

Fig. 10: Area Under Curve (AUC) Receiver Operating Char-


acteristic (ROC) for 10% anomaly rate in the data

B. IEEE 68 bus and Industry Data


1) IEEE 68 Bus System: IEEE 68 bus system was simulated
in RTDS, and PMU measurements were obtained. Five PMUs Fig. 12: Anomaly Detection on Industrial PMU data
were placed on Bus 29, 25, 54, 31 and 41 monitoring lines 29-
Load, 25-26, 54-55, 31-10 and 41-40 respectively. For clarity
only a part of the network is shown in Fig. 11. A total of VI. C ONCLUSIONS
12000 data points at 30 samples per second were gathered for
Synchrophasor data anomalies consist mainly of outliers,
each of the PMUs. A script was written to create different
missing data and event data. The proposed SyADC tool detects
events in real-time; by changing loads, creating fault events,
and classifies these data anomalies. The proposed method
dropping generators, and disconnecting lines. Once the output
does not require any training data and can perform with high
PMU files were gathered, ten different data sets were created
precision in real-time. The major advantage is the classification
by adding random anomalies twice in each of the PMU data
of data anomalies into bad data, events and PDC errors. The
streams. All the random anomalies were created to have an
SyADC tool can be used to process synchrophasor data online.
anomaly rate of 15%. Table VII shows the performance of
Based on the test results, the time of anomaly detection is
the SyADC algorithm having high precision and recall. The
around 0.1-0.2 s, depending on the window size. The per-
algorithm was successful in determining the majority of the
formance of PMU based applications can be improved if the
anomalies.
data is processed using the SyADC tool before usage. The time
taken by the proposed method is independent of the size of the
network as it uses 10 data points per window and a group of
PMUs in a small cluster. Results are shown for the IEEE 14
bus, 68 bus network with PMU data obtained from RTDS
simulator and industrial PMU data highlighting the advantages
of the proposed method in comparison to other available
methods. The thresholds and data window used are the same
for each case. The threshold values are derived from testing
on different datasets, combining machine learning algorithms,
operation of power systems and synchrophasor measurement
Fig. 11: IEEE 68 Bus system features and its architecture. The thresholds and the algorithm
will work for other systems as well. The developed method can
2) Industry Data: SyADC algorithm was applied to several be further extended, considering anomaly in data due to cyber-
industrial PMU data set. The ground truth of the industrial attacks. Further, the method can be augmented to retrieve lost
data set, its topology, and information regarding PMU was PMU data and classify different power system events with
unknown due to confidentiality reasons. However, it can be localization.
mentioned that data from 60 different PMU were obtained,
with each PMUs monitoring several channels. Voltage, current,
frequency, active power and reactive power flow for each PMU R EFERENCES
were analyzed. The SyADC algorithm successfully detected
[1] S. Pandey, N. Patari, and A. K. Srivastava, “Cognitive flexibility of
the outliers, a single point missing data and missing packet power grid operator and decision making in extreme events,” in 2019
data, as seen in Fig. 12. IEEE Power Energy Society General Meeting (PESGM), pp. 1–5, 2019.

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
11

[2] Y. V. Makarov, P. Du, S. Lu, T. B. Nguyen, X. Guo, J. Burns, J. F. IEEE Power & Energy Society General Meeting (PESGM), pp. 1–5,
Gronquist, and M. Pai, “PMU-based wide-area security assessment: IEEE, 2019.
concept, method, and implementation,” IEEE Trans. Smart Grid, vol. 3, [25] K. Chatterjee, K. Mahapatra, and N. R. Chaudhuri, “Robust recovery
no. 3, pp. 1325–1332, 2012. of PMU signals with outlier characterization and stochastic subspace
[3] P. Kundu and A. K. Pradhan, “Enhanced protection security using the selection,” IEEE Trans. Smart Grid, 2019.
system integrity protection scheme (SIPS),” IEEE Trans. Power Del., [26] E. McCollum, J. Bestebreur, J. Town, and A. Gould, “Correlating
vol. 31, pp. 228–235, Feb. 2016. protective relay reports for system-wide, post-event analysis,” in 2018
[4] M. A. Donolo, “Advantages of synchrophasor measurements over 71st Annual Conference for Protective Relay Engineers (CPRE), pp. 1–
SCADA measurements for power system state estimation,” tech. rep., 6, Mar. 2018.
Schweitzer Engineering Laboratories, 2006. [27] S. Pandey, A. Srivastava, and B. Amidan, “A real time event detection,
[5] M. Göl and A. Abur, “A hybrid state estimator for systems with limited classification and localization using synchrophasor data,” IEEE Trans.
number of PMUs,” IEEE Trans. Power Syst., vol. 30, no. 3, pp. 1511– Power Syst., pp. 1–1, 2020.
1517, 2015. [28] P. NASPI, “PMU data quality: A framework for the attributes of PMU
[6] S. A. N. Sarmadi, A. S. Dobakhshari, S. Azizi, and A. M. Ranjbar, “A data quality and quality impacts to synchrophasor applications,” 2017.
sectionalizing method in power system restoration based on WAMS,” [29] H. Gharavi and B. Hu, “Synchrophasor sensor networks for grid com-
IEEE Trans. Smart Grid, vol. 2, no. 1, pp. 190–197, 2011. munication and protection,” Proceedings of the IEEE, vol. 105, no. 7,
[7] Y. Guo, K. Li, D. Laverty, and Y. Xue, “Synchrophasor-based islanding pp. 1408–1428, 2017.
detection for distributed generation systems using systematic principal [30] A. Sundararajan, T. Khan, A. Moghadasi, and A. I. Sarwat, “Survey on
component analysis approaches,” IEEE Trans. Power Del., vol. 30, no. 6, synchrophasor data quality and cybersecurity challenges, and evaluation
pp. 2544–2552, 2015. of their interdependencies,” Journal of Modern Power Systems and Clean
[8] G. Liu, J. Quintero, and V. M. Venkatasubramanian, “Oscillation mon- Energy, vol. 7, no. 3, pp. 449–467, 2019.
itoring system based on wide area synchrophasors in power systems,” [31] A. S. Chowdhury, E. Khaledian, and S. L. Broschat, “Capreomycin
in 2007 iREP symposium-bulk power system dynamics and control-VII. resistance prediction in two species of mycobacterium using a stacked
Revitalizing Operational Reliability, pp. 1–13, IEEE, 2007. ensemble method,” Journal of applied microbiology, 2019.
[9] S. Pandey, A. K. Srivastava, P. Markham, M. Patel, et al., “Online [32] M. E. Aminanto, L. Zhu, T. Ban, R. Isawa, T. Takahashi, and D. Inoue,
estimation of steady-state load models considering data anomalies,” “Automated threat-alert screening for battling alert fatigue with temporal
IEEE Trans. Ind. App., vol. 54, no. 1, pp. 712–721, 2018. isolation forest,” in 2019 17th International Conference on Privacy,
[10] P. Kundu and A. K. Pradhan, “Real-time event identification using Security and Trust (PST), pp. 1–3, IEEE, 2019.
synchrophasor data from selected buses,” IET Generation, Transmission [33] T. Zhang, E. Wang, and D. Zhang, “Predicting failures in hard drivers
Distribution, vol. 12, no. 7, pp. 1664–1671, 2018. based on isolation forest algorithm using sliding window,” in Journal
[11] O. P. Dahal, S. M. Brahma, and H. Cao, “Comprehensive clustering of of Physics: Conference Series, vol. 1187, p. 042084, IOP Publishing,
disturbance events recorded by phasor measurement units,” IEEE Trans. 2019.
Power Del., vol. 29, pp. 1390–1397, Jun 2014. [34] Z. Zou, Y. Xie, K. Huang, G. Xu, D. Feng, and D. Long, “A docker
[12] C. Sun, X. Wang, Y. Zheng, S. Chen, and Y. Yue, “Early warning system container anomaly monitoring system based on optimized isolation
for spatiotemporal prediction of fault events in a power transmission forest,” IEEE Trans. Cloud Computing, 2019.
system,” IET Generation, Transmission & Distribution, vol. 13, no. 21, [35] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 Eighth
pp. 4888–4899, 2019. IEEE International Conference on Data Mining, pp. 413–422, IEEE,
[13] A. Gholami, A. K. Srivastava, and S. Pandey, “Data-driven failure 2008.
diagnosis in transmission protection system with multiple events and [36] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Loop: local
data anomalies,” Journal of Modern Power Systems and Clean Energy, outlier probabilities,” in Proceedings of the 18th ACM conference on
vol. 7, no. 4, pp. 767–778, 2019. Information and knowledge management, pp. 1649–1652, ACM, 2009.
[14] P. Khaledian, B. K. Johnson, and S. Hemati, “Power grid resiliency [37] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying
improvement through remedial action schemes,” in IECON 2018-44th density-based local outliers,” in ACM sigmod record, vol. 29, pp. 93–
Annual Conference of the IEEE Industrial Electronics Society, pp. 774– 104, ACM, 2000.
779, IEEE, 2018. [38] N. Grira, M. Crucianu, and N. Boujemaa, “Unsupervised and semi-
[15] R. Dubey, S. R. Samantaray, and B. K. Panigrahi, “An spatiotempo- supervised clustering: a brief survey,” A review of machine learning
ral information system based wide-area protection fault identification techniques for processing multimedia content, vol. 1, pp. 9–16, 2004.
scheme,” International Journal of Electrical Power & Energy Systems, [39] H. Akoglu, “User’s guide to correlation coefficients,” Turkish journal of
vol. 89, pp. 136–145, 2017. emergency medicine, vol. 18, no. 3, pp. 91–93, 2018.
[16] A. Srivastava, S. Pandey, M. Zhou, P. Banerjee, and Y. Wu, “Ensemble [40] R. Artusi, P. Verderio, and E. Marubini, “Bravais-pearson and spearman
based technique for synchrophasor data quality and analyzing its impact correlation coefficients: meaning, test of hypothesis and confidence
on applications,” in North American Synchrophasor Initiative (NASPI), interval,” The International journal of biological markers, vol. 17, no. 2,
Gaithersburg, MD, pp. 1–24, March 2017. pp. 148–151, 2002.
[17] W. Li, C. Wen, J. Chen, K. Wong, J. Teng, and C. Yuen, “Location [41] Z. Gao, D. Kong, and C. Gao, “Modeling and control of complex
identification of power line outages using PMU measurements with bad dynamic systems: applied mathematical aspects,” 2012.
data,” IEEE Trans. Power Syst., vol. 31, pp. 3624–3635, Sep. 2016. [42] D. M. Powers, “Evaluation: from precision, recall and f-measure to roc,
[18] A. Lazarevic and V. Kumar, “Feature bagging for outlier detection,” in informedness, markedness and correlation,” 2011.
11th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., pp. 157–166, [43] S. Weisberg, Applied linear regression, vol. 528. John Wiley & Sons,
2005. 2005.
[19] K. D. Jones, A. Pal, and J. S. Thorp, “Methodology for performing [44] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based
synchrophasor data conditioning and validation,” IEEE Trans. Power algorithm for discovering clusters in large spatial databases with noise.,”
Syst., vol. 30, pp. 1121–1130, May 2015. in Kdd, vol. 96, pp. 226–231, 1996.
[20] X. Wang, D. Shi, Z. Wang, C. Xu, Q. Zhang, X. Zhang, and Z. Yu,
“Online calibration of phasor measurement unit using density-based
spatial clustering,” IEEE Trans. Power Del., vol. 33, pp. 1081–1090, Ehdieh Khaledian Ehdieh Khaledian (Student
June 2018. Member, IEEE) is a PhD candidate in computer
[21] S. Pandey, S. Chanda, A. Srivastava, and R. Hovsapian, “Resiliency- science at Washington State University, Pullman,
driven proactive distribution system reconfiguration with synchrophasor WA. She received her M.S. degree in computer engi-
data,” IEEE Trans. Power Syst., pp. 1–1, 2020. neering from University of Isfahan, Isfahan in 2014.
[22] M. Zhou, Y. Wang, A. K. Srivastava, Y. Wu, and P. Banerjee, “Ensemble Her research interest includes machine learning and
based algorithm for synchrophasor data anomaly detection,” IEEE Trans. data mining, and extracting the important patterns
Smart Grid, pp. 1–1, 2018. from masses of data.
[23] T. Wu, Y. J. Zhang, and X. Tang, “Online detection of events with low-
quality synchrophasor measurements based on iforest,” IEEE Trans. Ind.
Inform., 2020.
[24] K. Chatterjee and N. R. Chaudhuri, “Corruption-resilient detection of
event-induced outliers in PMU data: A kernel pca approach,” in 2019

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2020.3046602, IEEE
Transactions on Smart Grid
12

Shikhar Pandey is General Engineer at ComEd,


IL. He completed his MS degree in 2017 and his
Ph.D. 2020 in Electrical Engineering from Wash-
ington State University Pullman. His research inter-
est includes synchrophasor technology, their mea-
surement, data quality issues, event detection and
synchrophasor applications. He graduated with un-
dergraduate degree in Electrical Engineering from
National Institute of Technology Patna in 2013 and
worked as Sr. Electrical Engineer(2013-2015) at
Larsen and Toubro ECC, Kullu, H.P. India.

Pratim Kundu is an assistant professor in the


School of Computing and Electrical Engineering at
Indian Institute of Technology Mandi, India. He
received his Ph.D. and M.S. degrees in Electrical
Engineering from Indian Institute of Technology
Kharagpur, India, in 2018 and 2013 respectively. He
worked as a post-doctoral fellow at the Washington
State University, Pullman, USA from 2018 to 2019.
His research interests are power system monitoring
and control, network protection, smart grids and
renewable energy sources.

Anurag K. Srivastava is an associate professor


of electric power engineering at the Washington
State University and the director of the Smart Grid
Demonstration and Research Investigation Lab (SG-
DRIL) within the Energy System Innovation Center
(ESIC). He received his Ph.D. degree in electrical
engineering from the Illinois Institute of Technol-
ogy in 2005. His research interests include data-
driven algorithm for the power system operation and
control. Dr. Srivastava is a past-editor of the IEEE
Transactions on Smart Grid and an editor of IEEE
Transactions on Power Systems. He is an IEEE distinguished lecturer, and
the co-author of more than 300 technical publications.

1949-3053 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on December 26,2020 at 03:54:41 UTC from IEEE Xplore. Restrictions apply.
View publication stats

You might also like