Imputation Method of Missing Values For Dissolved Gas Analysis Data Based On Iterative KNN and Xgboost

Imputation Method of Missing Values for Dissolved Gas Analysis
Data Based on Iterative KNN and XGBoost

Lin Qiao† Ran Ran He Wu
State Grid Liaoning Electric Power State Grid Liaoning Electric Power State Grid Liaoning Electric Power
Supply Co., Ltd Supply Co., Ltd Supply Co., Ltd
Shenyang Liaoning China Shenyang Liaoning China Shenyang Liaoning China
18361232838@163.com 394579129@qq.com 505322675@qq.com
Qiaoni Zhou Sai Liu Yunfei Liu

State Grid Liaoning Electric Power State Grid Electric Power Research College of Computer Science and
Supply Co., Ltd Institute Technology
Shenyang Liaoning China Nari Group Corporation Nanjing University of Aeronautics
1466423711@qq.com Nanjing JiangSu China and Astronautics
liusai@sgepri.sgcc.com.cn Nanjing JiangSu China
18351936098@163.com
ABSTRACT ACM Reference format:

Power transformers are an important part of the power system. Lin Qiao,Ran Ran,He Wu,Qiaoni Zhou, Sai Liu and Yunfei Liu. 2018.
Accurate monitoring of its operating status is particularly important Imputation Method of Missing Values for Dissolved Gas Analysis Data
for the normal and stable operation of the entire power system and Based on Iterative KNN and XGBoost. In Proceedings of 2018
the timely diagnosis of potential faults. Dissolved Gas Analysis International Conference on Algorithms, Computing and Artificial
(DGA) can detect and judge the oil-immersed power transformer Intelligence (ACAI’18). Sanya, China, 7 pages.
https://doi.org/10.1145/3302425.3302447
failure by comparing the dissolved gas content of the power
transformer in the normal operating state and the oil in the fault
state. However, in the operation process of the grid transformer, the 1 Introduction
detection data is often missing. This paper proposes an effective
method based on iterative KNN and XGBoost method for missing The transformer is the main device of the power system. During the
values. Firstly, according to the XGBoost integration tree, there are operation of the transformer, the transformer may encounter
missing values. Information such as the number of attribute electrical or thermal interference, causing problems such as arcing,
discharging, and thermal failure [1]. These faults produce some key
divisions obtained by data set training calculates the importance
special gases such as hydrogen (H2), acetylene (C2H2), ethylene
scores of different attributes to determine the priority of the
(C2H4), methane (CH4), ethane (C2H6) and carbon monoxide
attributes, and then performs interpolation on the missing values in (CO), which are dissolved in the insulating oil of the transformer.
an iterative manner. The experimental results in the case of DGA Medium and exceed a certain threshold. If these problems are not
dataset and different missing rate show that the proposed method is dealt with in a timely manner, it may lead to transformer failure and
superior to the existing similar methods in accuracy, and the dataset operation interruption, and its failure will affect the normal and
after interpolation has a significant improvement on the stable operation of the entire power system, resulting in
classification effect of the classifier. incalculable losses. Therefore, it is of great significance to make
timely and accurate judgments on the faults of the transformer [2].
KEYWORDS Currently, the primary method for diagnosing and detecting
oil-immersed power transformer failures is dissolved gas analysis
Iterative KNN, Interpolation Priority, Dissolved Gas Analysis, (DGA). The results of the operation and monitoring of the
Missing Values transformer show that DGA can detect the hidden faults in the
transformer earlier. Dissolved gas analysis is a tool for diagnosing
†
Corresponding author: 18351936098@163.com early faults in transformers. It warns of impending transformer
© 2018 Association for Computing Machinery. ACM acknowledges that this failures. It uses the concentration of critical gases, the ratio of gas
contribution was authored or co-authored by an employee, contractor or affiliate of concentrations, etc., and then passes the IEC ratio and Rogers ratio.
a national government. As such, the Government retains a nonexclusive, royalty-
free right to publish or reproduce this article, or to allow others to do so, for
The method is to predict the fault of the transformer. The accuracy
Government purposes only. of these methods is not high, and different methods tend to give
ACAI '18, December 21–23, 2018, Sanya, China different predictions. In recent years, a machine learning method
© 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6625- based on SVM, classification regression tree, BPNN, etc. to predict
0/18/12…$15.00 transformer faults by DGA gas has also appeared.
https://doi.org/10.1145/3302425.3302447
ACAI’18, December, 2018, Sanya, China L. Qiao et al.
However, in practice, the data of each substation will generate 2 Related Work
a large amount of disorganized data when it is extracted and
transmitted. The number of levels is exponentially increasing. In DGA data interpolation technology has been extensively studied
the process of transmission and use, a considerable part of the data due to its significance. Many methods have emerged over the years
is due to human factors. Or the phenomenon of missing objective to determine and assign replacement values for missing data items
factors. Data is prone to loss in the process of transmission, and the in DGA data set. Some simple methods such as eyeballing and
reasons for the loss can be roughly divided into subjective reasons mean value can only be used in some particular situations. They
and objective reasons. The data caused by data collection or may fail in some complex situations [6]. The DGA-based
transmission caused by human factors is lack of subjective reasons, transformer fault diagnosis method basically make prediction by
such as data loss caused by data entry errors, job misconduct or establishing a nonlinear model between the dissolved gas and the
intentional falsification of data. Data loss caused by objective fault type, so the relationship between the dissolved gas and the
reasons such as equipment failure and route interruption can be fault label is very important for determining the interpolation order.
called objective reasons, such as data storage failure, substation [7] use KNN to Interpolate the missing values, but it only use once,
mechanical failure, and data transmission route truncation. These which is very inaccurate. [8] use incremental and tree-based
missing problems will not only lead to the failure of the above method to Interpolate missing values, but it don’t take Interpolation
threshold method, but also reduce the performance of algorithms order based on relationship between attribute and label into
such as machine learning. As the missing values in the data set consideration. Linear regression is used in [9] to predict missing
increase, the prediction accuracy of the learning algorithm will values before diagnosis of power transformer, but it can’t predict
decrease. missing values in different attributes. Support vector regression
Missing questions can usually be handled by simple deletion (SVR) is taken in [10] to estimate and fill-in the missing values
methods, such as deleting samples with missing problems, but this with available data. However, their approach requires dispersion of
method may delete valuable information from many missing the continuous values of the fault gases before estimation, a
sample data [3]. Therefore, the best way is to try to analyze the preprocessing step that can lead to loss of information. Although
missing data and interpolate the missing values according to the these methods can interpolate missing values, they do not make full
characteristics of the missing values. Related work has also verified use of the valid information in the missing data and the relationship
that if the information in the missing data can be fully utilized, the between attribute and label, so our method is promising.
accuracy of the machine learning algorithm for transformer fault
diagnosis can be appropriately improved.
In order to improve the accuracy of the interpolation results 3 Proposed Method
and make full use of the information in the complete data and the
incomplete data, this paper proposes a missing value interpolation 3.1 Missing Data Set of DGA
method based on iterative KNN [4] and interpolation priority. First, In the data set shown in Table 1, D = {𝑋𝑖 ,𝑦𝑖 }𝑛𝑖=1 is the data
the dissolved gas analysis data set is trained by the XGBoost [5]
sample which indicates that there are n samples in the whole DGA
integrated classifier capable of processing missing values, and the
average gain and average coverage of each attribute as the number data set, and𝑦𝑖 ∈ {1, … , 𝑐}represents the power transformer fault
of division attributes and the division attribute are obtained, and category label. 𝑋𝑖 = {𝑓𝑖 }𝑚 𝑗=1 is the attribute of each sample,
these data are used to calculate different The importance of the including m attributes of gas concentration. The of symbol ‘?’
attribute to the classification result and the priority of the indicates a missing value representing the concentration of the
interpolation, based on which the interpolation priority of the missing gas. There is no missing value in a complete instance. An
different samples of the same missing attribute is calculated. Then, incomplete instance may contain one or more missing values, such
according to the obtained imputation priority, the missing values as X2, X3, and X5.
are iteratively interpolated by KNN until the convergence condition
The data in this data set are continuous values. The ratios of
is reached, and the data in the incomplete data is fully utilized in
the continuous iterative process to predict the missing value. The different gases are also used in the dataset. These ratios are
main contributions of this article are: calculated from the gas content. Therefore, we only interpolate the
(1) The missing value is predicted by an iterative KNN method with gas content, and then calculate the ratio of different gases based on
interpolated missing values. This missing value takes advantage of the results of the interpolation, as part of the properties. .
information from incomplete data and complete data from the DGA
data set. Table 1. DGA Dataset with Missing Vales
(2) A method based on XGBoost classification is proposed to instances f1 f2 f3 f4 f5 y
determine the interpolation priority based on the intimacy between
the power transformer properties and the tags. X1 x11 X12 X13 X14 X15 0
(3) Non-parametric and iterative KNN can directly predict the X2 X21 X22 ? ? X25 1
missing values of all attributes, instead of creating different X3 X31 X32 X33 ? X35 2
prediction models for each missing attribute, further reducing the
X4 X41 ? ? X44 X45 3
prediction time.
Section 2 gives the related works of interpolation method of X5 X51 X52 X53 X54 X55 4
DGA data, Section 3 of this paper proposes a DAG data missing
value interpolation method based on iterative KNN and XGBoost. 3.2 Interpolation Order of Missing Values
Section 4 discusses the performance based on the algorithm, and
Section 5 summarizes the method.
ACAI’18, December, 2018, Sanya, China
In the interpolation process of missing data, every attribute in the number of missing attributes in the current sample, numattr is the
data sample may be missing. For example, if interpolating f2 first, number of attributes of the current sample X, and the denominator
then interpolating f3 and f4 will affect the quality of data and the can be seen as the missing rate of the current sample.
result of fault diagnosis. Studies have shown that different 𝑚𝑎𝑥 𝑆𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒
𝑝𝑟𝑖𝑜𝑟𝑠𝑎𝑚𝑝𝑙𝑒 = ⁄
(2)
interpolation sequences have an impact on the results of 𝑛𝑢𝑚𝑚𝑖𝑠𝑠 𝑛𝑢𝑚𝑎𝑡𝑡𝑟
interpolation, and it is a better choice to interpolate missing values Assume that the attribute priority in the Table 1 in 3.1 is
in a certain order [11]. There may be multiple missing values in one f3>f2>f4, and both X2 and X4 are missing in f3. At this time, the
sample in the DGA dataset, and there may be more than one attribute priority can’t determine the interpolation priority of X2
missing sample in the same attribute value. This paper proposes a and X4, but the missing rate of X2 and X4 is the same, and the
method to predict the priority (column direction) and sample priority f2>f4, then the priority of X2 is higher than X4. Because in
(horizontal) of the DGA dataset. the case of the same missing rate, samples with relative important
DGA is a technique for classifying transformer faults based on attributes which is not missing should be preferentially interpolated
the amount of dissolved gases in the insulating oil. Therefore, when so that we can use as much effective information as possible.
determining the priority order of interpolation, it is necessary to
consider the association between these characteristic gases and 3.3 Iterative KNN Interpolation Prediction
category labels. If the association between the attribute and the Method
category label is stronger, then the more important the effect of the There are complete and incomplete instances in the DGA data
feature gas on the classification is. If the attribute has a missing
set. There are useful information in both incomplete and complete
value, the missing value of the attribute should be preferentially
interpolated. In addition, we end up using XGBoost to classify data, which should be widely used, which can effectively improve
transformer faults, which can directly train dataset with missing the accuracy of the interpolation [12]. After sorting the missing
values and automatically learn the splitting direction of missing values, we need to select the interpolation algorithm to predict the
attributes, so we use XGBoost to get the information of relationship missing values. When choosing an interpolation algorithm, we need
between characteristic gases and category labels. We also need to to consider the following aspects:
consider these information: 1) The DGA data set contains multiple attributes, so the
1) The number of times each attribute is used as a division interpolation algorithm preferably does not need to create a
attribute in all trees. The more times the attribute is used for prediction model for each attribute.
division, the more important the attribute is. 2) The DGA data set contains the concentration of the gas. The
2) The average gain of the attribute as a division feature.
interpolation algorithm needs to be applied to the continuous values,
3) The average coverage for samples of the attribute as a
division feature. and the covariance and correlation of other variables are retained.
This is similar to feature selection. The priority of an attribute 3) There are samples with multiple missing attribute values in
can be determined by the eq.(1). Since number, gain, and coverage the DGA data set, so the interpolation algorithm should be able to
have different scales, they need to be normalized first, and then the handle multiple missing values.
sum is taken to obtain the priority score. numberk is the number of 4) The Nonparametric method should be chosen as much as
times the k-th attribute is used as a division attribute, number = possible, since the parametric method is usually based on some
{number1, number2, ..., numbern}, g and c are generalized to restrict assumptions, such as the population of data values and the prior
the range of values in the data set. Here, g = 1 and c = 2. The distribution for the model parameters. These assumptions are
calculations of gain' and cover' are similar to number'. priork is the difficult to achieve in reality.
priority score of attribute k.
KNN algorithm is a non-parametric method, which can meet
(𝑛𝑢𝑚𝑏𝑒𝑟𝑘 −𝑚𝑖𝑛 𝑛𝑢𝑚𝑒𝑏𝑟)(𝑐−𝑔) all the above requirements, and has been widely used [13] [14] [15].
𝑛𝑢𝑚𝑏𝑒𝑟𝑘′ = 𝑔 + It is our best choice. Other algorithms such as linear regression,
max 𝑛𝑢𝑚𝑏𝑒𝑟−𝑚𝑖𝑛 𝑛𝑢𝑚𝑏𝑒𝑟
(1) neural networks, decision tree need to establish different
𝑛𝑢𝑚𝑏𝑒𝑟𝑘′ +𝑔𝑎𝑖𝑛′𝑘 +𝑐𝑜𝑣𝑒𝑟𝑘′
𝑝𝑟𝑖𝑜𝑟𝑘 = imputation prediction models for different attributes.
3
The higher the priork, the higher the importance of the attribute, The KNN algorithm look for k most similar instances of the
the more important it is to the transformer fault classification model, sample with missing values from data set to fill in the missing
and we need to interpolate the missing attribute with higher priork values by mean or mode value of the k instances. As an
first. interpolation method, KNN is efficient and easy to implement. The
In addition to the interpolation priority of attributes in the quality of the missing values interpolated by KNN method mainly
vertical direction, the same attribute may be missing in multiple depends on 1) the selection of k and 2) the distance metric. The
samples. At this time, the interpolation priority of the attribute choice of k is related to the size of the actual data set. If the data set
alone can’t obtain a valid interpolation order. For example, the f 3
is large, the value of k should be increased appropriately. If the data
attribute in the Table 1 has missing value in X2 and X4, we can’t
decide whether interpolate X23 or X43 first, so a new metric is set is small, the value of k should be reduced. The k most similar
needed to get an interpolation priority order for the samples with samples need to be calculated by distance metric, we use 2 well-
same missing attribute. This study introduces the eq.(2) to calculate known distance metrics to calculate the similarity between samples.
the interpolation priority for multiple to-be-interpolated samples of 1) City Block Distance (CB). It is based on Taxicab geometry,
the same missing attribute. Scomplete is the set of priority scores of the distance between two samples is the sum of the absolute values
the non-missing attributes of the current sample, nummiss is the
of the differences of each attribute, and it is robust to outliers, as condition of convergence is that the change of the interpolated
shown in the eq.(3). value is less than a certain threshold, which is manually set.
𝑑𝑖𝑠𝑡𝑎𝑏 = ∑𝑛𝑗=1|𝑥𝑎𝑗 − 𝑦𝑏𝑗 | (3) With iterative interpolation, all information are now used to
2) Euclidan Distance (EU). This is the most common metric for estimate the missing values in a DGA dataset.
calculating the distance between two samples, which calculates the
square root of the difference between each attribute of the two
samples. 4 Experiment and Results
𝑑𝑖𝑠𝑡𝑎𝑏 = √∑𝑛𝑗=1 (𝑥𝑎𝑗 − 𝑦𝑏𝑗 )2 (4) 4.1 Data Set
The steps of KNN algorithm are as follows: DGA data set which comes from the network is used in our
1) Determine k, divide data set D into 𝐷𝑐𝑜𝑚 and 𝐷𝑖𝑛𝑐𝑜𝑚 , which research. Due to the lack of existing DGA data sets, we
only contain complete and incomplete data respectively. 𝐷 = collected and organized the data from multiple network
𝐷𝑐𝑜𝑚 ∪ 𝐷𝑖𝑛𝑐𝑜𝑚 . resources [16] [17] [18], which result in a lot more data than
many other studies, which also may bring data quality
2) The interpolated instance is selected from 𝐷𝑖𝑛𝑐𝑜𝑚 according
problems, but this is not our main concern. Besides, The State
to the interpolation order. Use the CB or EU distance metric to
Grid Corporation in China also provided us with some data.
calculate the distance between the selected instance with missing Each sample in the DGA data consists of a series of dissolved
value to be imputed and all instances in 𝐷𝑐𝑜𝑚 𝑋𝑖 = {𝑥𝑖1 , … , 𝑥𝑖𝑚 } gases in oil, the corresponding ratio and the corresponding
indicates an instance with missing values to be predicted. 𝑋𝑞 = fault type. As shown in the Table 2, there are a total of 3,949
{𝑥𝑞1 , … 𝑥𝑞𝑚 } is an instance in 𝐷𝑐𝑜𝑚 . The distance will be calculated samples (many researches only have a few hundred pieces of
according to eq.(3) and eq.(4). m is the dimension of the attributes samples), which is already quite a bit of data.
in sample,and 𝑥𝑖𝑗 is the j-th attribute of the i-th instance.
Table 2. DGA Examples in Our Experiment
3) Repeat 2) and calculate the distance between 𝑋𝑖 and all 𝐂𝟐𝐡𝟐
instances in 𝐷𝑐𝑜𝑚 , and sorts them in ascending order according to H2 Ch4 C2h4 C2h6 C2h2 … Fault
𝐜𝟐𝐡𝟒
the distance. 117 17 3 1 1 0.333 … 0
4) Select top k instances from the ascending list, 𝑋𝑘𝑛𝑛 = 595 32 18 4 65 3.611 … 1
{𝑋1′ , … , 𝑋𝑘′ } indicates the k most similar neighbors selected. 10 63 35 176 0.001 0 … 3
5) The missing value 𝑥𝑖𝑗 is calculated by the k most similar Since the ratio of gas content and gas content are both in our
neighbors, the missing value is estimated by the mean value of the DGA data set, these ratios are not the target of our interpolation
k nearest neighbor instances, and the formula is as eq.(5). because they are calculated by gas content. After the gas
∑𝑘
concentration is interpolated, those ratios can be directly obtained.
𝑝=1 𝑥𝑝𝑗
𝑥𝑖𝑗 =
𝑘
(5) Our DGA data sets are complete. We simulate the real-world DGA
If 𝑋𝑖 has been interpolated and there are no missing values in data missing by randomly changing some data in the complete
DGA data set into missing values. The harsh environment of the
it, then 𝑋𝑖 is placed in 𝐷𝑐𝑜𝑚 to help interpolate the other missing
real-world power transformers determines that the missing of some
values in 𝐷𝑖𝑛𝑐𝑜𝑚 until all missing values are interpolated. values in the DGA data set is random in most cases. We set random
If the interpolation is performed according to the above KNN missing rates to be 5%, 10%, 15% and 20%, respectively.
algorithm, only one interpolation is performed. Although the
information of incomplete data is utilized in the interpolation 4.2 Experiment Setup
process, the result is still very unreliable, and the result is often not
For the whole experiment, we used a computer that has Intel
accurate, the use of information of incomplete data is still
Core i7-7700HQ 2.8GHz CPU and 8G memory. Python3.6 is
insufficient. Initially, KNN interpolation is performed using the
selected as our programming tool. The missing values of dissolved
data in complete data. If the data in 𝐷𝑐𝑜𝑚 has not increased or
gas data of the power transformer in the real environment are
increased little in a period of time, the information from incomplete
simulated by random missing rate. We compare our methods with
data is still very small, and we will not realize the full use of
EM, linear regression, MEAN, etc., and use these methods to
information.
interpolate the missing values in data set independently, and then
So in order to make full use of information from incomplete
compare their error and accuracy of the imputed results.
data, we iteratively interpolate the missing values using KNN. Each
In addition to comparing our methods with other different
time the interpolation is performed using KNN, the all missing
imputation methods, we also compare the classification accuracy
value in 𝐷𝑖𝑛𝑐𝑜𝑚 is interpolated, and a complete data set 𝐷𝑖𝑛𝑐𝑜𝑚,𝑝−1
of classifier before and after interpolation. In this comparison, we
is obtained, which indicates the complete data obtained after the p- used XGBoost (Extreme Gradient Boosting), which can directly
1-th interpolation using KNN. In the p-th iteration, we use the process DGA data with missing values and learn the direction of
complete data set 𝐷𝑐𝑜𝑚 + 𝐷𝑖𝑛𝑐𝑜𝑚,𝑝−1 to interpolate the data in the missing data. We compare the performance of the DGA data set
𝐷𝑖𝑛𝑐𝑜𝑚 to get 𝐷𝑖𝑛𝑐𝑜𝑚,𝑝 , so that we can make full use of information before and after the interpolation on the XGBoost classifier. If there
from incomplete data in multiple iterations. Such an iterative is no improvement in the classification accuracy, the interpolation
operation stops when the interpolated value converges. The of the DGA data set is meaningless. We compare the performance
of the XGBoost classifier on the DGA data set before and after the 0.58, 0.67, 0.67, and 0.74, respectively. The k = 11 is the best k
interpolation. If there is no improvement in the classification based on the average NRMSE of each k at different missing rates.
accuracy, the interpolation of the DGA data set is meaningless.
When evaluating the effect of the interpolation algorithm, we need Table 4. RMSE When Euclidean Distance (EU) as Distance
to evaluate whether it improves the classification accuracy of the Metric
classification model. Accuracy is defined as follows: EU K=1 K=3 K=5 K=7
𝑁 5% 1.90 0.98 0.79 0.81
accuracy = 𝑐 (6) 10% 1.82 1.78 0.92 0.68
𝑁𝑎𝑙𝑙
𝑁𝑐 is the number of the correct instance, and 𝑁𝑎𝑙𝑙 is the 15% 1.10 0.94 0.99 0.95
number of all instances. 20% 1.81 1.87 1.56 0.98
KNN is used in our method, it only needs to adjust a very small K=9 K=11 K=13 K=15 K=17
number of parameters, the most important of which is the number 0.77 0.76 0.82 0.74 0.88
of nearest neighbors, that is, the value of k needed to be set. In our 0.73 0.80 0.91 0.87 0.86
0.87 0.72 0.77 0.71 0.78
research, k = {1, 3, 5, 7, 9, 11, 13, 15, 17}, because there are not
0.76 0.71 0.84 0.98 1.08
many instances/samples in some category. Two different distance
As shown in Table 4, the results of different distance metrics
metrics mentioned in the 3.3 section are also used for comparison. are different. When EU is used as the distance metric, the best k is
When KNN is used to interpolate missing values, different k and 15, 7, 15, and 9, respectively. The corresponding NRMSE are 0.74,
distance metrics are used for comparison to select the optimal k and 0.68, 0.71, and 0.71, respectively. When k is small, only a few
distance metric. neighbors can be selected to interpolate the missing values, so it is
In the EM interpolation method, the number of iterations of easy to cause NRMSE to be too large, because it is easy to ignore
the EM method is manually set. Since EM and the methods we use some of the most similar instances. When k is too large, it will also
are both iterative methods, we also compare the iterative cause NRMSE to be too large, because there may be some less
convergence times of the two methods. When evaluating the effect similar instances in the selected k neighbors, the accuracy of the
of interpolation, since the value of the gas varies widely, we use interpolation may be reduced.
The effect of CB as a distance metric is better than that of EU
Normalized root-mean-square error (NRMSE) to evaluate the
as a distance metric, because the distance in CB is not squared, and
accuracy of the interpolation. The formula is as eq.(7)
it is more stable than the squared EU distance when faced with an
∑𝑛 ′
𝑖=1(𝑦𝑖 −𝑦𝑖 )
2
NRMSE = √ ∑𝑛 2 (7) outlier. So we chose the optimal KNN model with CB as distance
𝑗=1 𝑦𝑖
metric and k=11, and the smallest NRMSE appears at 5% missing
y is the true value of the missing value, and y’ is the
rate.
interpolated value of the missing value.
4.3 Results and Analysis 4.3.2 Comparison with Other Methods. In addition to
comparing the differences between the different k and distance
4.3.1 Comparison of Different k and Distance Metrics in KNN.
metrics in our method, we also compare our method with other
Table 3 shows the RMSE when the City Block Distance is used as
commonly used interpolation methods. These methods include MI
the distance metric. Because our method is based on KNN, we need
and linear regression, which are currently the most commonly used
to evaluate k and distance metric in KNN to select the k and
interpolation method. In the comparison, we used the optimal k and
distance metric that are most suitable for DGA data interpolation.
CB obtained in 4.3.1 as the distance metric. We compare the
We randomly select different values in the different attributes of
different interpolation methods on the same DGA data set. The
the data set to become the missing value.
performance of the three different methods on the same DGA data
set is shown in Figure 1. Linear regression predicts missing values
Table 3. RMSE When City Block Distance (CB) as
by modeling complete DGA data, using missing attributes as
Distance Metric
predictive output targets, and other attributes and labels as input.
CB K=1 K=3 K=5 K=7 K=9
For the implementation of MI ( multiple imputation), It fills the
5% 1.07 1.22 0.97 0.83 0.69
missing values m times to form m complete DGA data set, and then
10% 1.13 1.90 0.92 0.86 0.78
15% 1.28 0.94 0.91 0.93 0.67 analyzes the DGA data set to get the final statistical inference. In
20% 1.29 1.20 1.09 0.87 0.89 our experiment, the number of repetition is set to 10.
CB K=11 K=13 K=15 K=17
5% 0.58 0.71 0.66 0.74
10% 0.67 0.68 0.71 0.77
15% 0.74 0.75 0.69 1.12
20% 0.82 0.74 0.84 0.90
As shown in the above table 3, when CB is used as the distance
metric and the deletion rate is 5%, 10%, 15%, and 20%, the best k
is 11, 11, 9, and 13, respectively. The corresponding NRMSE are
iterative interpolation to compare the effects of iterative KNN

7 6.22
5.88 interpolation on the classification results. The accuracy of the
6 XGBoost classification model before and after interpolation is
4.67 shown in Table 6.
5 4.47
4.07 4.06
Table 6. Accuracy of Classification Models Before and
NRMSE
4 3.23 3.19 After Interpolation

3 Top Average
accuracy accuracy
2 Before 0.832 0.817
0.58 0.67 0.74 0.82 After 0.884 0.857
1 As shown in table 6, after using the iterative KNN
0 interpolation algorithm to interpolate the incomplete DGA data set,
5% 10% 15% 20% the average accuracy and the highest accuracy of XGBoost are
our 0.58 0.67 0.74 0.82 improved, indicating that our iterative KNN interpolation is
linear effective and can be used to interpolate the missing values in DGA
3.23 3.19 4.06 4.47 data set.
regression
MI 5.88 4.07 6.22 4.67
missing rate 5 Conclusions
This paper proposed Iterative KNN interpolation method based on
interpolation priority to predict missing values in DGA data set.
Figure 1. The Performance of the Three Different The experiment results compared with MI and linear regression
Methods on the Same DGA Data Set have shown that it outperforms these current methods. In this
process, our method makes full use of the information in
Our method is obviously stable under different missing rates, incomplete data and complete data, and can interpolate all the
and the NRMSE error is smaller than the linear regression and MI attributes without establishing different prediction models for
interpolation methods, because our method is iterative, making full different missing attributes. The experiment results also
use of information from incomplete DGA data. MI is the worst in demonstrate that our method can improve the accuracy of classifier.
the Figure 1, and the performance of linear regression is not The research was supported by the State Grid Liaoning Electric
expected, but the linear regression is more stable than the MI. Our Power Supply CO., LTD, and we are grateful for the financial
method is the most stable of the 3 and has the best performance. support for the “Key Technology and Application Research of the
In our method, the interpolation priority of the attribute needs Self-Service Grid Big Data Governance
to be sorted according to the importance of the attribute. The (SGLNXT00YJJS1800110)”.
importance is calculated by the number of times the different
attributes in the XGBoost classification model are used as the REFERENCES
partitioning attributes of the regression tree, the average gain when [1] Zhang R, Du Y, Liu Y. , 2010. New Challenges to Power System Planning and
as the partitioning attributes and the average coverage when as the Operation of Smart Grid Development in China// International Conference on
Power System Technology. IEEE.
partitioning attributes. The priority scores for the different [2] Gang L I, Pu J, Wen F, et al., 2016.A Partial Order Reduction Based Method for
attributes obtained in our experiments are shown in the Table 5. Big Data Preprocessing in Smart Grid Environment . Automation of Electric
Power Systems.
[3] Himmelspach L, Conrad S. ,2010.Clustering approaches for data with missing
Table 5. Priority Scores for Different Attributes values: Comparison and evaluation// International Conference on Digital
H2 Ch4 C2h4 C2h6 C2h2 Information Management. IEEE.
[4] Song Q, Shepperd M, Chen X, et al.,2008. Can k-NN imputation improve the
Split 1570 1799 1606 1511 1161 performance of C4.5 with small software project data sets? A comparative
number evaluation. Journal of Systems & Software.
Gain 0.992 1.425 1.75 1.113 2.91 [5] Chen T , Guestrin C ., 2016. XGBoost: A Scalable Tree Boosting System.
[6] Lu Z, Hui Y V. L, , 2003.linear interpolator for missing values in time series.
Coverage 53.56 60.64 64.62 61.164 71.05 Annals of the Institute of Statistical Mathematics.
Prior 0.213 0.543 0.532 0.348 0.416 [7] Sahri, Z. and Yusof, R., 2014.Support Vector Machine-Based Fault Diagnosis of
socre Power Transformer Using k Nearest-Neighbor Imputed DGA Dataset. Journal of
Computer and Communications.
The interpolation order of the different properties we obtained [8] Conversano C , Siciliano R . , 2009.Incremental Tree-Based Missing Data
in the experiment is h2>ch4>c2h4>c2h6>c2h2. Imputation with Lexicographic Ordering[J]. Journal of Classification.
[9] Shi W , Zhu Y , Huang T , et al. , 2017.An Integrated Data Preprocessing
Framework Based on Apache Spark for Fault Diagnosis of Power Grid
4.3.1 The Effect of Iterative KNN Imputation on XGBoost Model. Equipment[J]. Journal of Signal Processing Systems.
[10] Yongli Z , Fang W , Lanqin G ., 2006Transformer Fault Diagnosis Based on
Since XGBoost can directly train DGA data set with missing values, Naive Bayesian Classifier and SVR[C]// Tencon IEEE Region 10 Conference.
we use XGBoost to train the incomplete DGA data set before IEEE.
interpolation, and then train the complete DGA data set after KNN
[11] Zhang S , Wu X , Zhu M . , 2010Efficient missing data imputation for supervised

learning.
[12] Zhang S , Jin Z , Zhu X . , 2011.Missing data imputation by utilizing information
within incomplete instances. Journal of Systems & Software.
[13] Sahri, Z. and Yusof, R., 2014.Support Vector Machine-Based Fault Diagnosis of
Power Transformer Using k Nearest-Neighbor Imputed DGA Dataset. Journal of
Computer and Communications.
[14] Sahri, Z, Yusof, R, Watada, J., 2014. FINNIM: Iterative Imputation of Missing
Values in Dissolved Gas Analysis Dataset. Industrial Informatics IEEE
Transactions on.
[15] Yu H, Wu Q, Lu Y, et al. , 2017. Research on Fault Diagnosis of Power
Transformer Equipment Based on KNN Algorithm[C]// International Conference
on Mechatronics and Intelligent Robotics. Springer, Cham.
[16] https://github.com/Saleh860/DGA
[17] https://github.com/piotrmirowski/DGA
[18] https://github.com/srijanee/DGA

Imputation Method of Missing Values For Dissolved Gas Analysis Data Based On Iterative KNN and Xgboost

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Imputation Method of Missing Values For Dissolved Gas Analysis Data Based On Iterative KNN and Xgboost

Uploaded by

Copyright:

Available Formats

Imputation Method of Missing Values for Dissolved Gas Analysis

Data Based on Iterative KNN and XGBoost

Qiaoni Zhou Sai Liu Yunfei Liu

ABSTRACT ACM Reference format:

iterative interpolation to compare the effects of iterative KNN

4 3.23 3.19 After Interpolation

[11] Zhang S , Wu X , Zhu M . , 2010Efficient missing data imputation for supervised

You might also like