You are on page 1of 12

Optimization Method of Suspected Electricity

Theft Topic Model Based on Chi-square Test


and Logistic Regression

Jian Dou(&) and Ye Aliaosha

China Electric Power Research Institute, Beijing, China


krauser3a@163.com

Abstract. In recent years, the electricity theft has presented characteristics of


high-tech and covert. Therefore, the factors that reflect the existence of stealing
electricity become varied and complex. It brings the problems such as low
efficiency and poor accuracy for power grid enterprises to identify the customers
who had been stealing electricity. In this paper, Chi-square test and logistic
regression are used to optimize the suspected electricity theft topic model. Chi-
square test is used to determine the factors interrelated with the electricity theft
firstly, and then the logistic regression algorithm is used to optimize the weights
of the interrelated factors, and finally constructed a prediction function that can
predict the customers who had been stealing electricity. Experiments show that
the method proposed in this paper can help the power grid enterprises to identify
the customers who had been stealing electricity, on account of having high
accuracy rate, precision rate and recall rate.

Keywords: Anti-electricity theft  Chi-square test  Logistic regression


Power consumption inspection

1 Introduction

For a long time, power grid enterprises have been committed to investigating and
punishing the behavior of electricity theft. With the continuous development of tech-
nology, the electricity theft has presented characteristics of high-tech and covert. It
brings the problems of low efficiency and poor accuracy for power grid enterprises to
identify the customers stealing electricity [1, 2]. The traditional methods to identify
electricity theft users such as artificial analysis have been difficult to meet the
requirements of current anti-electricity theft. Modern techniques need to be used to
screen the factors interrelated with electricity theft, and help power grid enterprises to
accurately identify the electricity theft [2, 3].
Due to the widely application of various modern techniques in electricity theft, the
concealment of electricity theft is improving, and the factors that can reflect the
behavior of electricity theft are becoming more and more complicated [4–7]. As a
result, the requirement for the ability of the power consumption inspector has become
stricter. Some stealers even use kinds of modern equipment such as interference unit, to

© Springer Nature Singapore Pte Ltd. 2018


Q. Zhou et al. (Eds.): ICPCSEE 2018, CCIS 902, pp. 389–400, 2018.
https://doi.org/10.1007/978-981-13-2206-8_32
390 J. Dou and Y. Aliaosha

mislead the power consumption inspector, which made more difficult for the inspector
to identify the thieves from various factors [8].
In 2015, the State Grid proposed a set of models for the on-line monitoring and
intelligent diagnosis of power metering. The electricity theft model contained in these
models could help the power consumption inspector to identify the electricity theft
users.
However, this model is based on rule matching, and the rule is dependent on the
subjective experience heavily. With the increase of new artifice of electricity theft, it is
difficult for this method to capture the characteristics of diverse behavior of electricity
theft accurately. The existing identification methods of electricity stealing behavior are
the same as this model, which can not meet the requirements of electricity theft users
and behavior recognition in current stage.
Therefore, it is necessary to optimize the model based on data analysis, in order to
find the characteristic factors of electricity theft quickly and effectively, and predict the
probability of customers appear the acts to steal electric power accurately. The opti-
mization model proposed in this paper can reduce the economic losses caused by
electricity theft, improve the accuracy of power consumption inspection, and reduce the
working pressure of inspector greatly.

2 Research on Chi-square Test and Logistic Regression

Chi-square test is a commonly used hypothesis testing method. The most common use
of this algorithm is to investigate whether the distribution of disordered categorical
variables is consistent between two or more groups. In addition, it can also be used to
compare the relevance between two or more samples and the classified variables. Chi-
square test is not restricted by the overall distribution, and has many advantages such as
wide scope of application, easy to operate, and has much superiority in practical
applications [9]. Based on the Chi-square test, this paper calculates the main factors
that can determine whether the customers have been stealing electricity, thus building a
topic model of the electricity theft.
The regression model is a mathematical model for the quantitative description of
the statistical relationship. Regression analysis is the method of studying the specific
dependence of independent variables on dependent variables. Based on a set of sample
data, the regression analysis determines the mathematical relationship between vari-
ables, and then carries out reliability of the relationship by statistical test, and finds out
the significant variables from the variables that affect a specific variable finally. With
the aid of obtained relationship, the value of another specific variable would be pre-
dicted or controlled according to the value of one or several variables, and the accuracy
of prediction or control is also given [10]. The logistic regression analysis method is
often used for classifying variables [11]. The predictive result of logistic regression is a
probability between 0 and 1, which is easy to use and explain. In practical applications,
logistic regression plays an important role in the classification problem, such as pre-
dicting the probability of a disease, predicting the probability of commodity purchase,
or judging the sex of a user [12–14]. However, this analysis method is rarely used in
the field of anti-electricity stealing. This paper will screen out the factors that are
Optimization Method of Suspected Electricity Theft Topic Model 391

significantly associated with electricity theft, and optimize the weight of factors based
on logistic regression, in order to get a prediction model with high accuracy, accuracy
and recall rate.

3 Topic Model of Electricity Theft Based on Chi-square Test

Chi-square test can be used to compare the association between two or more samples
and the classified variables. The factors and results that affect the electricity theft can be
considered as the classified variables. In this paper, the correlation of factors and results
is obtained by Chi-square test, so as to eliminate irrelevant factors.

3.1 Screen Out Interrelated Factors of Suspected Electricity Theft


Suppose that the topic model is built on the basis of the electricity theft users sample set
c and the normal users sample set m.
The data set of factors from the electricity theft users and the normal users is used as
the original data set. Define the factors set as qi ; i ð1; 2; . . .; nÞ: The fourfold table of
actual values for building the Chi-square test is shown in Table 1.

Table 1. Fourfold table of actual values


Group Electricity theft users Normal users Total
qi occurred c1 m1 c1 þ m1
qi not occurred c2 m2 c2 þ m2
Total c m cþm

c1 and c2 are the sample number of factors qi that occurred and didn’t occur in
electricity theft users samples respectively. m1 and m2 are the sample number of factors
qi that occurred and didn’t occur in normal users samples respectively.
c ¼ c1 þ c2 ; m ¼ m1 þ m2
First, suppose that qi occurs or not is independent of whether the user has been
stealing electricity or not. Select a sample from user data randomly. The probability of
this sample belongs to electricity theft users is l ¼ c þc m.
Then, according to the independence hypothesis, a new fourfold table of theoretical
values is generated as shown in Table 2.

Table 2. Fourfold table of theoretical values


Group Electricity theft users Normal users Total
qi occurred ðc1 þ m1 Þ  l ðc1 þ m1 Þ  ð1  lÞ c1 þ m1
qi not occurred ðc2 þ m2 Þ  l ðc2 þ m2 Þ  ð1  lÞ c2 þ m2
392 J. Dou and Y. Aliaosha

Obviously, if the two variables are linearly independent, the difference between the
theoretical values and the actual values in the fourfold table is very small.
The formula of Chi-square is

X ðA  T Þ2
v2 ¼ ð1Þ
T

A is the actual values, which is the data shown in Table 1. T is the theoretical value,
which is the data shown in Table 2.
After calculating the value of v2 , determine whether the independence hypothesis is
reliable by querying the critical value table of the Chi-square distribution. The degree
of freedom (DF) of the fourfold table is 1. At this time, the critical probability of the
Chi-square distribution (part) is shown in Table 3.

Table 3. Chi-square distribution critical value table (part)


Group P
DF 0.975 0.2 0.1 0.05 0.025 0.02
L 9.82*10−4 1.642 2.706 3.841 5.024 5.412

By querying the whole table, the probability of all factors interrelated with elec-
tricity theft can be obtained. Suppose that the threshold is e, when the correlation
probability P [ e, this factor is identified as the interrelated factor. At last, a set of
interrelated factors xi ; i ð1; 2; . . .; nÞ could be screened out by this way.

3.2 Eliminate High Interrelated Factors


In order to ensure the independence between the interrelated factors, we need to cal-
culate the correlation degree between the interrelated factors. If the correlation degree
between the two factors q is greater than the set threshold u, the factor which has the
lower correlation degree will be eliminated.
Test the linear dependence between the consecutive and ordinal factors by the
Pearson correlation coefficient. The calculation formula is as follows.
P  
ð xi  xi Þ xj  xj
qxi ;xj ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P  2 ð2Þ
ð xi  xi Þ 2 xj  xj

1; xi occurred
j  ð1; 2; . . .; nÞ; j 6¼ i; xi ¼ ð3Þ
0; xi not occurred

if qxi ;xj [ u, the factors that are less associated with the results of the two factors are
eliminated. Suppose the set of factors which have been eliminated high interrelated
factors is yi ; i  ð1; 2; . . .; kÞ, then yi  xi .
Optimization Method of Suspected Electricity Theft Topic Model 393

Test the dependence between the disorderly scalar factors by the Apriori algorithm,
which is the nonlinear dependence between any two factors in yi ; i  ð1; 2; . . .; k Þ:
Calculate the support degree ryi of each factor in yi ; i  ð1; 2; . . .; kÞ based on the
electricity theft sample.
Define the threshold of support degree as g.
If ryi  g, eliminate the i’th factor from yi ; i  ð1; 2; . . .; kÞ; to generate a new set
zi ; i  ð1; 2; . . .; lÞ; and 
zi  yi.
Constitute the set zi ; zj by selecting two random factors from zi ; i  ð1; 2; . . .; lÞ:
And on this basis calculate the support degree rzi zj of each set in zi ; i  ð1; 2; . . .; lÞ:
If rzi zj  g, we could judge that there are correlation between factor zi and factor zj .
Eliminate the factor which has higher entropy after discretization from yi . At last, we
generate a new set ui ; i  ð1; 2; . . .; r Þ; and ui  zi .

3.3 Determine the Combination Correlation


Combined with the factors which were eliminated in Sect. 3.1 based on the maximum
combined threshold d. Verify the correlation degree of each pair of combinations and
results by Chi-square test in the same way as described in Sect. 3.1, excepting the
groups were change to the combination factors occur simultaneously or not.
If the association probability between the combination factors and the results P [ e,
this combination factors will be regarded as interrelated factors.
Finally, we generate the set of interrelated factors as xi ; i  ð1; 2; . . .; hÞ; h  n.

4 Factor Weight Optimization by Logistic Regression

On the basis of all the interrelated factors, we construct the loss function by logistic
regression algorithm. And then, calculate the optimal solution of each factor weight by
gradient descent method. At last we get the final prediction function.

4.1 Construction of Logistic Regression Function


Logistic regression is mainly used in the binary classification, so we use Sigmoid
function which form is as follows.

1
gð z Þ ¼ ð4Þ
1 þ ez

The Sigmoid function plot is shown in Fig. 1.


It is seen that the Sigmoid function transforms the z value into a value close to 0 or
1, and its output value changes significantly in the vicinity of z ¼ 0.
The structural prediction function is as follows.

  1
hh ð x Þ ¼ g h T x ¼ ;kn ð5Þ
1 þ eðh0 þ h1 x1 þ ... þ hk xk Þ
394 J. Dou and Y. Aliaosha

0.5

-6 -4 -2 0 2 4 6

Fig. 1. Sigmoid function plot

P
Therein, hT ¼ hi , hi represents the weight of the interrelated factor
xi ; ið0; 1; . . .; kÞ:

4.2 Optimize the Weights Based on Gradient Descent Algorithm


Prediction function hh ð xÞ is an algorithm which inducts the sigmoid function to realize
classification boundary fitting on the basis of linear regression. This function has the
following properties.

Pðy ¼ 1jx; hÞ ¼ hh ð xÞ ð6Þ

Pðy ¼ 0jx; hÞ ¼ 1  hh ð xÞ ð7Þ

When we calculate the optimal solution of corresponding parameters through


optimizing function, a non-convex loss function would be got if we applied the loss
function of linear regression directly. This loss function could not meet the application
requirements of gradient descent algorithm. Thus, a monotone convex function need to
be rebuilt as the loss function of logistic regression factor weight optimization.
Therefore, logistic regression loss function based on the log-likelihood loss func-
tion is constructed as follow.

 logðhh ð xÞÞ if y ¼ 1
cos tðhh ð xÞ; yÞ ¼ ð8Þ
 logð1  hh ð xÞÞ if y ¼ 0

Based on maximum likelihood estimation, the loss function can be deformed as


Xk
J ð hÞ ¼ i¼1
yi logðhh ð xÞÞ  ð1  yi Þ logð1  hh ð xÞÞ ð9Þ

J ðhÞ is convex function in this situation. The concrete steps to update the weight by
the gradient descent method are as follows.
Initialize: h0 ; h1 ; . . .; hk , threshold g and learning rate a.
(1) Determine the gradient of the loss function of the current position. The gradient
expression of hi is @h@ i J ðhÞ.
Optimization Method of Suspected Electricity Theft Topic Model 395

(2) Multiply the gradient of the loss function by the learning rate, to get the distance
from the current position called step is a @h@ i J ðhÞ.
(3) Confirm that the distance of the gradient descent of each h value is less than g. If
this is the case, algorithm will terminate. Otherwise, enter step (4).
(4) Update h value according to the following formula:

0 @
hi ¼ hi  a J ðhÞ ð10Þ
@hi

Then turn back to the step (1).


By the iteration of all above steps, the optimal weight would be calculated, the
predictive function would be obtained to predict the samples.
As described above, the topic model is built and optimized based on the electricity
theft user sample set c and the normal user sample set m. The critical value of the
prediction n is calculated according to the following formula.
m
n¼ ð11Þ
cþm

If the predictive value belongs to (n,1], the user would be predicted as an electricity
theft user. If the predicted value belongs to (0,n], the user would be predicted as a
normal user.

5 Experiment Design and Result Analysis


5.1 Experiment Environment
In this experiment, we chose 2000 electricity theft users with their related electricity
utilization data during the period of stealing electricity, as well as 2000 normal users
with their related electricity utilization during a period of time from a provincial power
grid enterprise.

5.2 Experiment Procedures and Results


We selected the initial interrelated factors as shown in Table 4.
Using the method described in Sect. 2, we calculated the correlation degree
between each of the suspected factors and the result that the electricity theft occurred or
not by Chi-square test. Defined the threshold e ¼ 0:8. When the calculated correlation
probability P [ e, the factor would be regarded as interrelated factors. Thus the con-
clusion was that the factors with serial number 3–5, 7–12, 15–16, 22, 26, 28, 30–35,
37–38 were interrelated factors, and others were irrelevant factor.
Then, we used the Pearson correlation coefficient to calculate the correlation degree
between the interrelated factors, so as to ensure the independence between the factors.
Defined the threshold u ¼ 0:9. If the correlation degree between the two factors was
greater than 0.9, we removed the factor which P value is less.
396 J. Dou and Y. Aliaosha

Table 4. Initial relevance factor


Sn. User behavior factors Sn. User behavior factors
1 Energy Meter Value Unbalanced 28 Current Return
2 Energy Meter Overspeed 29 Fee Control Order Failure
3 Energy Meter Reversed 30 Surplus Money Exception
4 Energy Meter Stopped 31 Single Phase Meter
Shunted
5 Energy Meter Fee Rate Exception 32 Secondary Circuit
Shorted (Shunted)
6 Voltage Phase Failure 33 Secondary Circuit
Opened
7 Voltage Overlimit 34 Primary Circuit Shorted
8 Voltage Unbalanced 35 Energy Meter Shorted
9 Phase B Exception of High Voltage Supply & 36 Energy Meter Value Error
High Voltage Metering
10 Current Lost 37 Circuit Series
Semiconductor
11 Current Unbalanced 38 Magnetic Exception
12 Energy Meter Covering Opened 39 Electricity Address
13 Meter Terminal Covering Opened 40 Power Category
14 Measuring Door Opened 41 Contract Capacity
15 Magnetic Interference 42 Running Capacity
16 Energy Differential 43 User Status
17 Power Differential 44 Power Supply
Organization
18 Power Failure Exception 45 Electricity Price Category
19 Load Overcapacity 46 Electricity Price &
Industry Category
20 Requirement Overcapacity 47 Voltage Level
21 Current Overload 48 Measurement Method
22 Power Over Offline Continuously 49 Transformer Type
23 RTU Clock Exception 50 Energy Meter
Comprehensive ratio
24 Power factor Exception 51 Last Year Energy Status
25 Energy Meter Clock Exception 52 Last Year Fee Status
26 Reversed Energy Exception 53 Last Year Arrears Times
27 Phase Sequence Exception 54 Last Year Arrears
Amount

Afterwards, we calculated the correlation degree of combined factors and results


from the irrelevant factors based on Chi-square test. d ¼ 2. If the calculated combined
correlation probability P [ e, the combined factor would be regarded as interrelated
factors. Thus, loss of voltage phase and transformer type is a pair of interrelated
combined factors, and other factors is eliminated.
Optimization Method of Suspected Electricity Theft Topic Model 397

And then, we substituted all interrelated factors and interrelated combined factors
into the logistic regression model, as followed the steps in Sect. 4 for calculations.
Initialized all h values as 0.5, g ¼ 0:1, and defined the learning rates were a ¼ 0:1 and
a ¼ 0:5 respectively. When getting the updated h value in iterations, the predictive
function would be updated, into which the samples would be substituted to forecast.
Calculated the error rate of electricity theft users as the prediction result.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
r¼ ð hh ð x Þ  1Þ 2 ð12Þ

The error rate curve was shown in Fig. 2.

Error
Rate α=0.5
α=0.1

80%

10%

Fig. 2. Error rate curve

Precision α=0.5
α=0.1

98%

Fig. 3. Precision curve

Substituted the result of each iteration into the model to calculate the prediction
precision of samples. The precision curve was shown in Fig. 3.
The result of the solution didn’t change after 152 iterations. The results of the final
model factors weights were shown in Table 5.
Finally, we substituted the samples into the prediction function. The average
accuracy rate was 94.1%, the average precision rate was 96.6%, and the average recall
rate was 95.8%.
398 J. Dou and Y. Aliaosha

Table 5. Model factors weight


Sn. Factors Weight
1 Energy Meter Reversed 0.1
2 Energy Meter Stopped 0.1
3 Energy Meter Fee Rate Exception 0.1
4 Voltage Phase Failure (Dedicated Transformer) 0.1
5 Voltage Phase Failure (Public Transformer) 0.7
6 Voltage Overlimit 0.2
7 Voltage Unbalanced 0.1
8 Phase B Exception of High Voltage Supply & High Voltage Metering 0.2
9 Current Lost 0.4
10 Current Unbalanced 0.1
11 Energy Meter Covering Opened 0.9
12 Magnetic Interference 0.5
13 Energy Differential 0.5
14 Power Over Offline Continuously 0.1
15 Reversed Energy Exception 0.5
16 Current Return 0.2
17 Surplus Money Exception 0.1
18 Single Phase Meter Shunted 0.5
19 Secondary Circuit Shorted (Shunted) 0.9
20 Secondary Circuit Opened 0.5
21 Primary Circuit Shorted 0.9
22 Energy Meter Shorted 0.9
23 Circuit Series Semiconductor 0.5
24 Magnetic Exception 0.9

5.3 Model Verification


100 thousand typical customers were selected from the provincial power grid enter-
prise, including 30 thousand high voltage customers and 70 thousand low voltage
customers. We screened out 8110 exception records, and 7096 records of them were
ascertained as electricity theft or measure abnormal. The precision rate was 87.5%.
The experimental results show that the prediction model constructed in this paper
can achieve the prediction of the results accurately, which can effectively help the
power grid enterprises identify the electricity theft users, reduce the economic losses of
the power trade, and ensure the stable development of the power trade.

6 Conclusion

In view of the difficulties of power grid enterprises in inspecting electricity theft, this
paper proposes an optimization method of suspected electricity theft topic model based
on Chi-square test and logistic regression.
Optimization Method of Suspected Electricity Theft Topic Model 399

Based on the analysis of the factors that affect suspected electricity theft, Chi-
square test is used to calculate the factors and combined factors that have high cor-
relation degree with the results hat the electricity theft occurred or not. Screen out the
interrelated factors and eliminate the irrelevant factors. The logistic regression algo-
rithm is used to optimize the weights of the interrelated factors iteratively, to get the
final prediction function. It can predict whether the users have been stealing electricity
through the values of interrelated factors.
To verify the optimization model presented in this paper, we chose some electricity
theft users and normal users with their related electricity utilization data from provincial
power grid enterprise as the initial data samples. First, we screened out interrelated
factors. Then, we constructed logistic regression function to optimize the weights of
each factor. Finally got the predictive function. Substituted the experimental samples
into the prediction function to get the prediction results which had a good performance
in accuracy, precision, and recall rate.
Experimental results show that Chi-square test and logistic regression algorithms
have a good applicability in selecting electricity theft interrelated factors as well as
predicting whether the users have been stealing electricity. This method can inspire the
power grid enterprises in anti-electricity theft, and improve the accuracy of power
consumption inspection efficiently, and promote the stable development of power grid
enterprises, and maintain well social power using order.

References
1. Wang, J., Meng, Y., Yin, S., Zhang, Y.: The present situation and development trend of anti
electric stolen function of power demand information acquisition system. Power Syst.
Technol. 12(S2), 177–178 (2008)
2. Cheng, C., Zhang, H., Jing, Z., Chen, M., Jiao, L., Yang, L.: Study on the anti-electricity
stealing based on outlier algorithm and the electricity information acquisition system. Power
Syst. Prot. Control 43(17), 69–74 (2015)
3. Hu, S., Guan, J., Yang, Z., Yu, H.: Research on electricity quantity metrology and
acquisition system based on embedded system. Modern Electron. Tech. 39(22), 163–166
+170 (2016). https://doi.org/10.16652/j.issn.1004-373x.2016.22.040
4. Wang, Q., Li, S.: Technology analysis and preventive measures of electric larceny
prevention technology based on electric energy data acquisition system. Electr. Meas.
Instrum. (2016)
5. Ren, S.: Strengthen the supervision and management of electric power to combat the theft of
electricity. Global Mark. Inf. Guide 45, 156 (2014)
6. Zhuang, C., Zhang, B., Hu, J., Li, Q., Zeng, R.: Anomaly detection for power consumption
patterns based on unsupervised learning. Proc. CSEE 36(2), 379–387 (2016). https://doi.org/
10.13334/j.0258-8013.pcsee.2016.02.008
7. Zhao, L., Luan, W., Wang, Q.: Accurate line loss analysis of LV distribution network using
AMI data. Power Syst. Technol. 39(11), 78–83 (2015). https://doi.org/10.13335/j.1000-
3673.pst.2015.11.026
8. Ma, S.: Supervision and management of electricity and measures for preventing electricity
theft. Theor. Res. Urban Constr. 11, 2440 (2016)
400 J. Dou and Y. Aliaosha

9. Xu, C., Lu, G., Ye, Y., Mi, Y.: Cooperative spectrum sensing using Chi-square test for
multi-antenna cognitive radio. Chin. High Technol. Lett. 26(7), 650–656 (2016). https://doi.
org/10.3772/j.issn.1002-0470.2016.07.005
10. Wu, D.: Electricity theft identification method based on curve similarity. Electr. Power
50(2), 181–184 (2017). https://doi.org/10.11930/j.issn.1004-9649.2017.02.181.04
11. Chen, A., Xia, F., Zhong, Y.: A new independence test of four grid table. Stat. Decis. 13,
85–88 (2017). https://doi.org/10.13546/j.cnki.tjyjc.2017.13.020
12. Xu, J., Su, W., Wu, S., Wu, X.: Modeling user reliability based on logistic regression in
micro-blog. Comput. Eng. Des. 3, 772–777 (2015). https://doi.org/10.16208/j.issn1000-
7024.2015.03.042
13. Guo, J., Sun, J., Liang, T., Tan, R.: Evaluation model of disruptive design scheme based on
logistic regression. Comput. Integr. Manuf. Syst. 21(6), 1405–1416 (2015). https://doi.org/
10.13196/j.cims.2015.06.001
14. Wang, Z., Liu, K., Zheng, Z., Li, C.: Prediction retweeting of microblog based on logistic
regression model. J. Chin. Comput. Syst. 37(8), 1651–1655 (2016)

You might also like