Neurocomputing: Xin Gao, Fang Deng, Xianghu Yue

Neurocomputing 396 (2020) 487–494
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Data augmentation in fault diagnosis based on the Wasserstein

generative adversarial network with gradient penalty
Xin Gao, Fang Deng∗, Xianghu Yue
School of Automation, Beijing Institute of Technology, Beijing 100081, China
a r t i c l e i n f o a b s t r a c t
Article history: Fault detection and diagnosis in industrial process is an extremely essential part to keep away from un-
Received 24 March 2018 desired events and ensure the safety of operators and facilities. In the last few decades various data
Revised 7 September 2018
based machine learning algorithms have been widely studied to monitor machine condition and detect
Accepted 12 October 2018
process faults. However, the faulty datasets in industrial process are hard to acquire. Thus low-data of
Available online 24 April 2019
faulty data or imbalanced data distributions are common to see in industrial processes, resulting in the
Keywords: difficulty to accurately identify different faults for many algorithms. Therefore, in this paper, Wasserstein
Data augmentation generative adversarial network with gradient penalty (WGAN-GP) based data augmentation approaches
Fault diagnosis are researched to generate data samples to supplement low-data input set in fault diagnosis field and
Imbalanced data help improve the fault diagnosis accuracies. To verify its efficient, various classifiers are used and three
Low-data domain industrial benchmark datasets are involved to evaluate the performance of GAN based data augmentation
GAN
ability. The results show the fault diagnosis accuracies for classifiers are increased in all datasets after
WGAN-GP
employing the GAN-based data augmentation techniques.
© 2019 Elsevier B.V. All rights reserved.
1. Introduction difficulty to grasp the features accurately to identify different faults

because the categories with low-data would easy to be neglected
Fault detection and diagnosis in industrial process is an ex- and submerged by the ones with large quantities of samples.
tremely important and essential part to keep away from undesired However, this problem should be solved fundamentally by in-
events and ensure the safety of operators and facilities. Therefore, creasing the training samples, i.e., generating more data from the
this field has been a hot research topic for a long time. Other than raw data. Many previous researches have points out that the data
some traditional mathematical model based fault diagnosis meth- augmentation would function as a regularizer to help prevent the
ods, in the last few decades various data based machine learn- overfitting and improve the performance with imbalanced data
ing algorithms have been widely studied, such as K nearest neigh- distribution [6–10]. It can be demonstrated by many famous com-
bors (KNN), principal component analysis, support vector machine petitions where all of winners did data augment to enlarge the
(SVM), random forest, etc. [1,2]. Some of them even have been ap- training set and the training performance rose [11–13]. A tradi-
plied in industrial production systems to monitor machine conditional method in images domain to synthesize data samples to
tion and detect process faults in [3,4]. However, in practice faulty supplement the training dataset is geometric transformations such
dataset in industrial process with large size are hard to acquire, be- as rotation, brighten, clips, flips, and channel alterations. However,
cause the industrial processes are not allowed to operate in faulty in this way, the generated data is simple and just surfaced transfor-
condition in order to keep safety, decrease the maintenance costs, mation from origin data, and the traditional transformation tech-
and avoid the catastrophic accidents. Thus, in industrial processes nique can only be used in the images sets [10]. As for the more
for fault diagnosis, low-data of faulty data or imbalanced data dis- kinds of general datasets, e.g., industrial process data, some more
tributions are common to see. Besides, when the limited data is data augment methods are needed. Therefore, in this paper, GAN is
given, some data based algorithms would suffer from low classi- used for data augmentation to increase the numbers of input data
fication accuracy, e.g. SVM, KNN, etc. [5]. When the data samples samples in low-data domain of the imbalanced data distribution
are in low-data domains or in imbalanced distributions, it will be for industrial fault diagnosis.
Until now, GAN has kinds of varieties, for example, the deep
convolutional GANs (DCGANs), the Semi-GAN, Conditional GAN
∗
Corresponding author.
and auxiliary classifier GAN (ACGAN). Since wasserstein GAN with
E-mail addresses: xin.gao1991@qq.com (X. Gao), dengfang@bit.edu.cn (F. Deng). gradient penalty (WGAN-GP), has much more stable optimizing
https://doi.org/10.1016/j.neucom.2018.10.109
0925-2312/© 2019 Elsevier B.V. All rights reserved.
488 X. Gao, F. Deng and X. Yue / Neurocomputing 396 (2020) 487–494
process and can be applied in more architectures, in this paper, optimized by an adversarial training, and DNN can serve as both
WGAN-GP based data augmentation models are built to generate generator and discriminator. As the artificial intelligent techniques
auxiliary data for the low-data original dataset in industrial pro- developed prosperously, the GAN algorithm meets the requires for
cess for fault diagnosis. Since the focus of this paper is the ability researches and applications in many fields and could bring some
of GAN based data augmentation, the classifiers are not required new developing directions for them. Especially, the most widely
extremely powerful. Contrarily, more efforts are paid to the GAN applied and researched field is images and computer vision with
based data augmentation model. To research and evaluate the per- many compelling performances, such as generating pictures with
formance, various classifiers, i.e., logistic regression, decision tree, digital or human face, generating the high resolution picture ac-
random forest, are involved. Besides, two benchmark datasets from cording to the low resolution picture, and recovering the object
UCI data base and a universal industrial simulation banchmark, i.e., in image from its outline [16,19,20]. Besides, GANs have also been
Tennessee Eastman Processare are used. The tests seem challeng- used in speech recognising and language processing [21].
ing, however, the generated data samples by the proposed data However, the direct application of GAN is to build model and
augmentation scheme are testified be of high quality and can im- generate data samples which have the same distribution with the
prove the classification accuracy after experiments. real data samples. Therefore, GAN can be used to cope with the
The remaining of this paper is organized as follows. In the next learning process with insufficient data or insufficient labeled data.
part, the developing backgrounds and the basic GAN algorithm are On the other hand, in many fields, large amounts of data are
given. Then the methods about data augmentation using WGAN-GP needed, especially for the applied field of pattern recognition. That
are described in Section 3. In Section 4, some comparative exper- is because with limited data most classifiers may less effective be-
iments are made to test the effectiveness of the GAN based data cause of the overfitting in the training process and the less gen-
augmentation using the classification accuracy. Finally, in the last eralization in the testing process caused by the quantities of pa-
part a conclusion of this study and the further research work plan rameters [5]. What a pity, in practice, datasets with very large size
are given. are hard to acquire in some cases, of which faulty dataset in in-
dustrial process is a typical one. In industrial processes for fault
2. Background diagnosis, low-data of faulty data or imbalanced data distributions
are common to see, resulting in the difficulty to identify different
2.1. The introduction of GAN faults exactly. Therefore, in this paper, GAN is researched as a tool
of data augmentation to increase the numbers of input data sam-
In the last few decades, since the flexible layered structure and ples in low-data domain of the imbalanced data distribution.
the use of the backpropagation training algorithm, the artificial Until now, GAN has kinds of varieties. For example, the deep
neural networks (ANNs) would approximate any function more convolutional GANs (DCGANs) are proposed by using the deep con-
easily. Therefore, they attracted numbers of researchers’ attentions volutional neural network in the structure of the adversarial nets
and had wide applications. Note that, in the last ten years, with [22]. Following some training tracks, they would generate better
the developed computing capacity and the larger amount of the high resolution images. Besides, different with the original GAN
stored data, some inherent problems in ANNs, such as too many whose input is the real data or fake data and the expected out-
parameters and the difficult training processes, are relieved in a put is the corresponding labels (real or fake), in the Semi-GAN and
great extent. Further, the ANNs has developed into deep neural Conditional GAN some more information are input to help build
networks (DNNs) by equipped with quantities of layers [14,15], models, e.g., the class of the samples [21,23]. Further more, aux-
which once again become a hot research topic because they have iliary classifier GAN (ACGAN) realises multi-classification and can
more powerful fitting ability, would grasp and learn the features output the corresponding generated data when assign the wanted
of data more precisely, and have the potential to solve highly labels, because the objective function concludes not only the part
complicated problems. DNNs can be divided into two categories to identify the real or fake data but also the part to recognise the
according to the usage, i.e., the discriminative models and the class label [24]. On the other hand, the Wasserstein GAN [25] de-
generative models. However, so far, the biggest successes for the veloped the origin GAN by the optimal process rather than the
DNNs are the discriminative models, whose operational principle structure. Its loss functions are meaningful, directly related with
is projecting the high dimensional input data into a class label the quality of the generated samples. Note that the data sam-
[16]. Nowadays, the deep discriminative models have various ples for fault diagnosis are usually not intuitive and straight, and
breakthrough applications, e.g., image segmentation, speech recog- WGANs have meaningful loss functions to help figure out the con-
nition, and natural language processing. However, in the other vergence and further generate high-quality samples, thus WGAN is
hand, the deep generative models have less effects. This is because suitable for the data augmentation for faulty dataset. Besides, since
in the conventional generative models, such as the restricted the advanced version, i.e., Wasserstein GAN with gradient penalty
Boltzmann machines (RBMs), deep Boltzmann machines (DBMs) (WGAN-GP), has much more stable optimizing process and can be
and their varieties [17,18], the models are built by firstly assuming applied in more architectures, therefore, in this paper, WGAN-GP
the distribution of input data, and then searching values of some is chosen as the basic tool to help realise and improve the fault
variables to make the assumed model fit the real distribution of diagnosis ability.
the sampled data. In this kind of methods, the objective func-
tions, e.g., maximising the log-likelihood, are always very difficult 2.2. Generative adversarial networks (GAN)
to solve, resulting in numerous approximations in the training
process. What is worse, the approximation is to approximate the GAN is originally proposed by Inn Goodfellow [16], which con-
lower bound of the objective function instead of the objective sists of two sub-nets, a generator and a discriminator. By the com-
function, which will bring bigger errors. petitive pair, the algorithm shows a powerful ability to learn rep-
Fortunately, recently a new generative model, generative adver- resentation of data. The training strategy is defined by a minimax
sarial network (GAN) is proposed by Ian Goodfellow [16]. It is com- game and the pair are trained against each other at the same time.
posed of two parts, i.e., generator and discriminator, in which the Generator takes samples from a simple noise distribution, such
generator learns the distribution of the input data and generate as Gaussian and uniform distribution, maps them to a data space
new data, while the discriminator is a binary classifier to figure same as input real data, and aims to be trained to generate data as
out the generated data and the real data. The whole process is realistic as possible. On the other hand, the discriminator is used
X. Gao, F. Deng and X. Yue / Neurocomputing 396 (2020) 487–494 489
to distinguish the input fake data (from generator) and real data, distribution of the generated fake data. Pr represents the distribu-
and it is trained to not be fooled by the generated fake data. That tion of real data and Pg is the distribution of fake data generated
is to say, as the result of the training process by the minmax game, by the generator G by x˜ = G(z ) (z is sampled from some simple
the distribution of the generated fake data tends to be as close as noise distribution), which is as same as the original GAN. What is
the distribution of the real data. more, the training process is similar too. In each training iteration,
More specifically and technically, the objective or the value at first the generator is fixed, the discriminator is trained to opti-
function of the minmax game between generator G and discrim- mal by maximizing the value function to distinguish real data and
inator D is shown in Eq. (1), generated fake data. Then, the discriminator is fixed to train gen-
∼ erator by minimizing the value function to make the Wasserstein
min max E [log(D(x ))] + E [log(1 − D( x ))] (1)
G D x∼Pr ∼ distance of the two distributions, i.e., generated data and real data,
x ∼Pg
as close as possible. In the end, after large numbers of iterations,
where x is real data and the distribution is represented by Pr , while the minmax game converges and the discriminator can not distin-
x˜ is the fake data generated by the generator G and the distribu- guish the real data and generated fake one any more. It worths to
tion is represented by Pg . Specifically, x˜ can be denoted by x˜ = G(z ) mention that the Wasserstein distance used in the value function
in which z is the noise signal sampled from any distribution, e.g., of WGAN is continuous, which not only makes the training process
Gaussian and uniform distribution. more stable but also can reflect the quality of generated sample.
In Eq. (1), the sum of two mean values is just the Jensen– The lower the Wasserstein distance (loss), the higher quality the
Shannon divergence between the real data and the generated fake generated image.
data. From the value function, it can be seen that in each iteration However, in Eq. (2) D denotes the set of 1-Lipschitz functions.
of the training process, when the generator is given, the discrimi- Therefore, to realize the WGAN algorithm, discriminator D should
nator is trained to optimal by maximizing the Jensen–Shannon di- belong to the 1-Lipschitz functions. In [25], the authors show that
vergence to distinguish real data and generated fake data. Then, the restriction can be loosen by multiplying a scaling factor. Thus,
the discriminator is fixed and the generator is trained by mini- Eq. (2) can be approached by the discriminator D belonging to
mizing the Jensen–Shannon divergence to make the generated fake K-Lipschitz functions, which would be realised by clipping the
data as realistic as possible. Therefore, the training process of GAN weights in each layer of the discriminator D into the range [−c, c].
is said a minmax game, and the aforementioned optimizing pro- In this way, the WGAN algorithm is done with the advantages,
cesses are realized through back-propagation. After large numbers i.e. the more stable training processing and the meaningful value
of iterations, in the end, the fake data generated by the generator function.
G are so realistic that the discriminator can not distinguish the real
data and generated fake one any more.
However, in practice the training process of the minmax game 3.2. Wasserstein GAN with gradient panelty (WGAN-GP)
is always unstable, mainly because the Jensen–Shannon divergence
is discrete which tends to cause the discriminator saturates and As mentioned above, WGAN redesigned the loss function, mak-
gradients vanish in the training process. To this end, some mea- ing it related with the quality of sample, and improved the stabil-
sures are proposed. For example, for the DCGAN [22], many train- ity of the optimizing process. However, in some settings undesired
ing suggestions, i.e., batch normalization, ReLU activation, little behaviors would happen, e.g., failing to convergence and generat-
numbers of fully connected layers, can be used to make the training low quality data. These problems mainly result from the use of
ing process more stable and the quality of generated data higher. weight clipping to satisfy the Lipschitz constrain which is a require
However, the DCGAN needs carefully designed the architecture and of WGAN to function properly. Specially, the measure of weight
some training tricks. clipping would cause WGANs working as simple functions, and in
this way some complex data samples can not be fitted by the sim-
3. Data augmentation using WGAN-GP ple approaches. Besides, weight clipping restricts the weights of
each layer in a small range which may lead to gradients vanish-
3.1. Wasserstein GAN ing or exposing easily. Therefore, in [26], Wasserstein GAN gradient
penalty(WGAN-GP) is proposed, which improve the value function
In the original GAN algorithm, the training process of the min- of WGAN and use the measure of gradients penalty to satisfy the
max game is always unstable, which results from the not contin- Lipschitz constrain instead of weight clipping. The value function
uous Jensen–Shannon divergence. Besides, the original GAN has a of WGAN-GP is as follows:
problem of convergence, i.e., when to stop the training can not be

E [D(x )] − E [D(x˜)] − λ E ∇xˆ D(xˆ) − 1 2 (3)

really known. This is because there is no numerical value to show x∼Pr x˜∼Pg xˆ∼Pxˆ
2
how well the parameters are tuned. Thus, the users have to look
at the generated samples to tell whether the model of the GAN is origanal wgan the penalty term
trained well or not. That is why GAN now is mainly applied pros-
From Eq. (3), it can be seen that the main difference from the
perously in just the images field.
WGAN algorithm is the last term. It is the new added penalty term,
Considering the above-mentioned problem, Wasserstein GAN
in which λ is the penalty coefficient and the xˆ stands for the sam-
(WGAN) is proposed [25]. Instead of using the Jensen–Shannon di-
ples on all the straight lines between distribution Pg and Pr . Gen-
vergence in the value function of the original GAN, it uses Wasser-
erally, 1-Lipschitz function refers to the function whose gradient
stein distance W(P, Q) which is continuous under mild assumptions
norm is at most 1. However, in Eq. (3) the penalty term penal-
and describes the distance between the points in the distribution
izes the gradient norm which is away from 1, and makes all gra-
P and distribution Q. To be more specific, the Wasserstein distance
dient norms go towards to 1, experimentally bringing out a faster
W(P, Q) represents the minimal cost of mass to move the distribu-
convergence and better optimal results. Empirically, this is because
tion P to Q. The value function of WGAN is shown in Eq. (2),
the fact that when the discriminator of WGAN is optimized it is
min max E [D(x )] − E [D(x˜)] (2) almost unit gradient norm in distribution Pg and Pr . In practice,
G D∈D x∼Pr x˜∼Pg
The WGAN-GP algorithm has faster convergence, much more sta-
where the latter part, i.e., the difference of two mean value, is just ble training process even with untuned default parameters and not
the Wasserstein distance between distribution of the real data and carefully designed architecture, along with high-quality samples.
Table 1
The experimental environment.
Experimental tool Version number
Computer System Windows 7 (64-bit)

CPU Intel I5
Python 3.5.2 (in Anaconda 4.2.0)
tensorflow 1.4.0
numpy 1.13.3
matplotlib 1.5.3
Table 2
Descriptions of the toy dataset.
Category No.of each category Category description (the quality score)
1 1457 5
2 2198 6
3 880 7 Fig. 1. The flowchart of the comparative experiments.
used to further testify the proposed data augmentation scheme can

As this paper focuses on the fault diagnosis and the data sam-
improve the classification accuracy and help industrial fault diag-
ple in this field are not as intuitive and straight forward as the im-
nosis.
ages, therefore the meaningful loss functions of WGAN is needed
to help generate high-quality samples and determine the conver-
gence. Besides, WGAN-GP has the advantages of much more stable 4.2. Experiments
optimizing process and is suitable for more wide application archi-
tectures and datasets. Therefore, in this paper, WGAN-GP is chosen To assess the GAN based data augmentation scheme to the un-
as the basic tool to help realise and improve the fault diagnosis balanced dataset for helping faults classification, the straight for-
ability. ward method is filling the generated data into the original data set
and then comparing the classification results.
4. Experiments and results discussion At first, the classifiers are trained on the original dataset and
the classification accuracies are recorded. Note that, these accu-
In order to evaluate the ability of WGAN-GP based data aug- racies are recognised as the baselines in the comparative experi-
mentation scheme to improve fault diagnosis performance, com- ments. After that, the categories with less data samples are chosen
parative experiments are conducted in this paper and various com- as the target categories to be supplemented. Then, the WGAN-GP
monly used classifiers are involved, i.e., logistic regression classifier based model is trained using the given data samples of the target
using stochastic gradient decent (SGD-LR), random forest classifier categories. Due to the interpretability between the loss function
(RF), and boosting tree classifier (BT). and the generated data in WGAN-GP, after some iterations, when
The experimental environment is listed the Table 1. Since the the loss of the discriminator in WGAN-GP comes to convergence
main purpose of this paper is to explore the ability of WGAN-GP and falls into a low value range, the generated data samples are
based augmentation to help classification tasks, it worths to men- of the high quality. The training details of the WGAN-GP, including
tion that the parameters of the classifiers in the tests are all the the hyperparamaters can be shown in the following list.
default ones without any optimizations. Besides, as mentioned in
Section 3.2, the performance of WGAN-GP algorithm has less re- The training details of WGAN-GP
– Batch size 40
lationship with the special model structure used [26], thus to be
– Learning rate 0.0 0 0 05
more convenient and easier to handle, in the experimental part of – Training epoch 500
this paper, both the generator and discriminator of WGAN-GP are – Leaky ReLU 0.3
multilayer perceptrons. – Adam optimizer beta1=0.5
– Data normalization −1 ∼ 1
4.1. Datasets Since then the generated data should begin to be collected and
then added into the original dataset, forming the new dataset. At
In our experiments, at first, two benchmark datasets from UCI last, the same classifiers are trained by putting in the new dataset
repository are chosen and tested for classification tasks. The first and the results are recorded too. Fig. 1 shows the flowchart of the
dataset is considered a toy data sets about wine qualityidentifica- comparative tests, where test result 1 refers to the accuracies of
tion, in which 3 categories are chosen to use in this paper. The all the classifiers on the original dataset while result 2 means the
other one is the fault diagnosis dataset (FD dataset). Specially, it accuracies on the new dataset (the original data mixed with the
is a steel plates faults dataset. The detailed information about the WGAN-GP generated ones).
datasets are listed in the Tables 2 and 5. Specially, in the baseline experiment, the original data are di-
Note that the distributions of data samples in the both datasets vided into 4 parts, in which 3 parts are served as training data
are all imbalanced, for example, in the toy dataset, the 3 cate- and the last one is used as testing data. On the other hand, in
gories have 1457, 2198, 880 samples, respectively. With the imbal- the comparative experiment, the new data samples generated by
anced distribution, generally the accuracies of classifiers would be the WGAN-GP based model are added into the training data while
affected, sometimes even misclassification happens. This is because the testing data keep the same as in the baseline experiment. In
the categories with low-data would easy to be neglected and sub- this way, a comparison is made between the accuracies of all the
merged by the majority class. Therefore, to make full use of the classifiers on the original dataset and the accuracies on the new
imbalanced dataset, GAN based data augmentation is conducted. dataset (combined with the WGAN-GP generated ones). By the
Besides, a universal industrial simulation banchmark of fault di- way, the accuracy mentioned here means the percentage of the
agnosis and process monitor, i.e., Tennessee Eastman Process, is correctly predicted samples in all samples of the testing set and
Table 3
The performance of classifiers in original data and mixed data for toy dataset.
Classifier Accuracy(%) and random seed=0 Accuracy(%) and random seed=1 Accuracy(%) and random seed=2
Original data Mixed dataset Original data Mixed dataset Original data Mixed dataset
SGD-LR 48.30 49.27 47.52 48.30 48.06 48.74
RF 68.53 98.79 68.43 98.81 68.34 98.86
BT 63.14 65.59 63.14 65.59 63.14 65.60
Table 4
The improved accuracy in the comparative test.
Classifier Accuracy(%) Accuracy(%) Accuracy(%)

random seed=0 random seed=1 random seed=2
SGD-LR 0.70 0.78 0.72

RF 30.26 30.38 30.49
BT 2.45 2.45 2.46
Table 5
Descriptions of the FD dataset.
Category No. of each category Category description
1 391 K Scratch
2 402 Bumps
3 673 Other Faults
Fig. 2. The loss curves of discriminator of WGAN for the toy dataset. after adding the generated data samples. It is clear that the accura-
cies of all the classifiers are raised and the results of the classifiers
it can be formulated as follows.
based on RF improves greatly, which also show this classifier may
N have better classification ability. Note that in both baseline experi-
Accuracy = × 100% (4)
N ment and the comparative experiment, 3 random seeds are set. It
where N is the quantity of testing data correctly predicated and N is used to show that despite the small improvement of the accura-
is the quantity of all testing data. cies for the classifiers based on SGD-LR and BT, it is also benefited
from the added data samples rather than the random factors. It
4.3. Case 1: the toy dataset is clear that the changes caused by random factors are very small,
and even the biggest one is less than 0.50% while the improved ac-
In this paper, at first, a wine quality identification dataset ob- curacies by the WGAN-GP based data augment model are all more
tained from UCI data base is used, of which 3 categories are se- than 2.45%. Therefore, the WGAN-GP based data augment model is
lected and then served as the toy dataset. This dataset is supplied validated the ability to generate reliable data samples for the low
by university of Minho, and the details, including the descriptions data region to help classification tasks.
and sample number of each category, are presented in Table 2. In
this set each sample has 11 physicochemical attributes to describe 4.4. Case 2: the FD dataset
its quality, such as fixed acidity, volatile acidity, citric acid, residual
sugar and so on. In this case, a steel plates faults data set is researched, which is
The accuracies of the classifiers on the original dataset are the obtained from UCI Repository and supplied by Research Center of
baselines and as for the toy data, the result are recorded in Table 3. Sciences of Communication in Italy. In this dataset, each record has
Note that in the table each accuracy value is the average value of 27 indicted variables to describe the fault type of a stainless steel
100 corresponding tests. leaf, including pixels areas, sum of luminosity, steel plate thickness,
Since the unbalanced distribution of dataset would affect the etc., [27] and in this paper 3 different typologies of faults are used.
classification, the WGAN-GP based model is trained to generate The details of the dataset are listed in Table 5.
similar data samples to add into the target categories and bal- In the FD dataset, when generating new data samples for the
ance the whole dataset. When the category 1 of the toy dataset is category 1, the changing loss curve of the generator and dis-
the target category and new data samples will be generated, the criminator in WGAN-GP are shown in Figs. 3 and 4, respectively.
curve in Fig. 2 shows the changing of the exact value of the loss Fig. 3 shows the loss value of generator is much easier to reach
function in the discriminator in WGAN-GP and it also can been 0. In addition, in Fig. 4 the value comes to convergence to 0 after
seen that after 230th iteration the value comes to convergence. 210th iteration and it means discriminator can not figure out fake
Therefore, the to be used generated data samples are collected af- data and real data. Thus, after 210th iteration the generated data
ter 230th iterations. Finally, for the toy dataset, 1200 samples in are collected as the to be used generated data samples. Finally,
the 3th category and 500 samples in the 1th category are gen- 280 samples in the 1th category are generated and for the 2th
erated, making all the classes of the toy data owning the similar category the same numbers of new samples are generated, mak-
numbers of data samples, and then they are putted into the origi- ing each class in the FD data set owning the similar numbers of
nal dataset to train the classifiers. The results of classifiers in this data samples. Then they are putted into the original dataset and
test are listed in Table 4. As the same with the baseline experi- the combined set serves as the training set for the classifiers.
ment, in this table each accuracy value is also the average value of Both of the accuracies of the classifiers on the original dataset
100 corresponding tests. and the new combined data for the FD data are shown in Table 6.
By comparing the results in Table 3, the Table 4 is made, which Similar to the Case 1, each accuracy value in the table is also the
shows the improved accuracies in the comparative experiments average value of 100 corresponding tests. It can be seen that in
Table 6
The performance of classifiers in original data and mixed data.
SGD-LR 72.49 72.78 72.30 73.04 72.37 71.86

RF 80.61 99.32 80.97 99.26 80.79 99.27
BT 83.66 90.87 83.64 90.87 83.61 90.86
Table 8
Descriptions of 10 operating conditions in TE process.
Fault number Fault description
IDV0 Normal
IDV1 A/C feed ratio,B composition constant
IDV2 B composition,A/C ratio constant
IDV4 Reactor cooling water inlet temperature
IDV5 Condenser cooling water inlet temperature
IDV6 A feed loss (stream 1)
IDV7 C header pressure loss-reduced availability (stream4)
IDV8 A, B, C feed composition (stream 4)
IDV10 C feed temperature (stream 4)
ting and classification ability for multi-classification. Consequently,

in Case 2 even using data augment the classification accuracies
are only improved a little most time or even no improvement
Fig. 3. The loss curves of generator of WGAN for FD dataset. seldomly. Nevertheless, all the great improved values for the RF
classifier and BT classifier as well as the small improved values
for SGD-LR classifier justified the effectiveness of the WGAN-GP
based data augment model. In a summary, the added new gen-
erated samples would improve the performance and increased the
accuracies generally. Therefore, the WGAN-GP based data augment
model is validated the ability to generate reliable data samples to
help classification tasks.
4.5. Case 3: Tennessee Eastman process
The Tennessee Eastman (TE) process is a simulation benchmark

of a real industrial chemical process, and it is universal as a widely
used platform for process monitoring and faults diagnosis, which is
composed of 5 major parts, i.e., the reactor, the product condenser,
a vapor-liquid separator, a recycle compressor, and a product strip-
per, and 52 variables are recorded. Besides, in the simulated pro-
Fig. 4. The loss curves of discriminator of WGAN for the FD dataset. cess, there are 21 operating conditions are set and for each fault
480 samples are in the training dataset. In this paper, to decrease
Table 7 calculating time and computing load, 10 operating conditions of TE
The improved accuracy in the comparative test. simulation process are used to testify the improvement of WGAN
Classifier Accuracy(%) Accuracy(%) Accuracy(%) based data augment for faults diagnosis and their detailed infor-
random seed=0 random seed=1 random seed=2 mation are shown in Table 8.
SGD-LR 0.29 0.74 −0.51 In this case, for each category, training data is used to train the
RF 18.71 18.29 18.48 WGAN-GP model. When the loss value of discriminator convergent
BT 7.21 7.23 7.28 to 0 and discriminator can not figure out fake data and real data
after several iterations, generated new data samples are collected.
Finally, 480 samples for the each category are generated, and then
this case the 3 different faulty conditions are relatively easy to they are putted into the original dataset, becoming the combined
recognised and the accuracies are relatively high compared with set to serve as the training set for the classifiers. Fig. 5. shows the
the Case 1. The differences of the accuracies of the classifiers on distribution of the first 3 principal components in origin data and
the original dataset and the new combined data are listed in the mixed data of fault IDV0, IDV1 and IDV2 after been dimensionally
Table 7, from which we can see that as for the RF classifier and reduced using the principal component analysis (PCA) algorithm.
BT classifier the improved values is great after using data augment The performances of classifiers on the both original dataset and the
while for SGD-LR classifier the improved values is small. However, new combined data are shown in Table 9, and each accuracy value
in Table 7, it would be found that for SGD-LR classifier, when ran- in the table is also the average value of 100 corresponding tests. It
dom seed sets 2, the improved values is −0.51, which indicates should be note that from Case 2 in last subpart, SGD-LR classifier is
the fault diagnosis accuracy is no be improved. The main reason known unsuitable for multi-class classification and have no enough
is that the SGD-LR classifier is built for binary classification rather fitting and classification ability for multi-classification. Since there
than multi-class classification and thus it may have no enough fit- are 10 categories in this case, then SGD-LR classifier is substituted
Fig. 5. The distribution of the first 3 principal components in origin data and mixed data of fault IDV0, IDV1 and IDV2 after been dimensionally reduced using PCA.
Table 9
The performance of classifiers in original data and mixed data for TE process.
RF 94.45 100 94.47 100 94.47 100

BT 96.04 99.90 95.99 99.91 96.05 99.93
GBC 98.88 99.54 98.77 99.91 98.876 99.539
by Gradient Boosting classfier (GBC), which has wonderful classifi- Conflict of interest
cation ability and fit for multi-class classification. From Table 9 we
can see that as for all the classifiers, i.e., the RF classifier, BT classi- None.
fier and GBC classifier, the fault diagnosis accuracies are improved
after using data augment and this justifies the effectiveness of the Acknowledgment
WGAN-GP based data augment model to generate reliable data
samples to help classification tasks. This research was supported by the Beijing NOVA Program
xx2016B027. The first authors would like to express a special ac-
5. Conclusion knowledgement to International Graduate Exchange Program of
Beijing Institute of Technology for partly funding this paper.
As is common that in fault diagnosis field the obtained data References
is limited or usually in imbalanced distributions, the identification
and detection of faults would be difficult. Therefore, in this paper, [1] F. Deng, S. Guo, R. Zhou, J. Chen, Sensor multifault diagnosis with improved
a WGAN-GP based data augmentation approach is used to generate support vector machines, IEEE Trans. Autom. Sci. Eng. PP (99) (2015) 1–11.
[2] X. Gu, F. Deng, X. Gao, R. Zhou, An improved sensor fault diagnosis scheme
data samples to supplement low-data input set to improve the per- based on TA-LSSVM and ECOC-SVM, J. Syst. Sci. Complex. (9) (2017) 1–13.
formance of fault diagnosis in which the data manifold is accom- [3] S.X. Ding, S. Yin, K. Peng, H. Hao, B. Shen, A novel scheme for key performance
plished in real sense with more stable process and high-quality indicator prediction and diagnosis with application to an industrial hot strip
mill, IEEE Trans. Ind. Inf. 9 (4) (2013) 2239–2247.
generated samples. Besides, to evaluate the scheme, three bench-
[4] Y. Zhang, Enhanced statistical analysis of nonlinear processes using KPCA, KICA
mark datasets are used in the testing experiments in this paper. and SVM, Chem. Eng. Sci. 64 (5) (2009) 801–811.
The results demonstrated fault diagnosis accuracies are increased, [5] L. Duan, M. Xie, T. Bai, J. Wang, A new support vector data description method
showing the generated data by the proposed WGAN-GP based data for machinery fault diagnosis with unbalanced datasets, Exp. Syst. Appl. 64
(2016) 239–246.
augmentation approach could help generalization capacity of the [6] P.Y. Simard, D. Steinkraus, J.C. Platt, Best practices for convolutional neural net-
classifiers on testing dataset and help the classification task. works applied to visual document analysis, in: Proceedings of the International
Though added new generated samples would improve the per- Conference on Document Analysis and Recognition, 2003, p. 958.
[7] D.C. Ciresan, U. Meier, L.M. Gambardella, J. Schmidhuber, Deep, big, simple
formance and increased the accuracies generally, the more com- neural nets for handwritten digit recognition, Neural Comput. 22 (12) (2010)
puting load and calculating time is used. Besides, the WGAN-GP 3207–3220.
based data augment model is validated the ability to generate reli- [8] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minor-
ity over-sampling technique, J. Artif. Intell. Res. 16 (1) (2002) 321–357.
able data samples to help classification tasks, however, when clas- [9] X. Zhu, Y. Liu, Z. Qin, J. Li, in: Data augmentation in emotion classification us-
sifier used have no enough fitting and classification ability, even ing generative adversarial networks, 2017.
data augment is used the classification accuracies are only im- [10] S.C. Wong, A. Gatt, V. Stamatescu, M.D. Mcdonnell, in: Understanding data aug-
mentation for classification: when to warp?, 2016.
proved a little most time or even no improvement. On the other
[11] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
hand, due to the powerful fitting ability, the DNN based classi- volutional neural networks, in: Proceedings of the International Conference on
fiers are excepted to be used to identify faults in industrial pro- Neural Information Processing Systems, 2012, pp. 1097–1105.
[12] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
cess with potential performance. However, quantities of layers are
image recognition, Comput. Sci. (2014).
in DNN means quantities of parameters in classifiers are to be de- [13] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-
fined. Therefore, in the future, the authors will be devoted to the tion, in: Proceedings of the Computer Vision and Pattern Recognition, 2016,
improvement of the GAN based data augmentation scheme to suit pp. 770–778.
[14] Y. Lecun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436.
for the DNN based classifiers with larger size data to improve the [15] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neu-
classification accuracy. ral networks, Science 313 (5786) (2006) 504–507, doi:10.1126/science.1127647.
[16] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, Fang Deng received his B.E. degree and Ph.D. degree in
A. Courville, Y. Bengio, Generative adversarial networks, Adv. Neural Inf. Pro- control science and engineering from Beijing Institute of
cess. Syst. 3 (2014) 2672–2680. Technology, Beijing, China, in 2004 and 2009, respectively.
[17] G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief He is currently an associate professor with the School of
nets, Neural Comput. 18 (7) (2014) 1527–1554. Automation, Beijing Institute of Technology. His current
[18] R. Salakhutdinov, G. Hinton, Deep boltzmann machines, J. Mach. Learn. Res. 5 research interests include nonlinear estimation, fault di-
(2) (2009) 1967–2006. agnosis, control of renewable energy resources and wire-
[19] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, in: Adversarial learning less sensor networks.
for neural dialogue generation, 2017. arXiv: 1701.06547.
[20] Z. Zheng, L. Zheng, Y. Yang, in: Unlabeled samples generated by gan improve
the person re-identification baseline in vitro, 2017.
[21] K.F. Wang, C. Gou, Y.J. Duan, Y.L. Lin, X.H. Zheng, F.Y. Wang, Generative ad-
versarial networks: the state of the art and beyond, Acta Autom. Sin. 43 (3)
(2017).
[22] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with Xianghu Yue received his B.E. degree in control science
deep convolutional generative adversarial networks, Comput. Sci. (2015). and engineering from Beijing Institute of Technology, Bei-
[23] M. Mirza, S. Osindero, Conditional generative adversarial nets, Comput. Sci. jing, China, in 2016. He is currently a Ph.D. candidate in
(2014) 2672–2680. the School of Mathematics and Statistics, Beijing Institute
[24] A. Odena, C. Olah, J. Shlens, in: Conditional image synthesis with auxiliary clas- of Technology. His current research interests include in-
sifier gans, 2016. arXiv: 1610.09585. telligent information processing, automatic speech recog-
[25] M. Arjovsky, S. Chintala, L. Bottou, in: Wasserstein gan, 2017. nition, and sound source localization.
[26] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, in: Improved
training of wasserstein gans, 2017.
[27] A. Asuncion, D.J. Newman, in: Uci machine learning repository, 2007.
Xin Gao received the B.E. degree in automation from

North China Electricity Power University, Baoding, China,
in 2013, and the M.E. degree in control theory and con-
trol engineering from Bohai University, Jinzhou, China, in
2016. Now she is pursuing the Ph.D. degree in School of
Automation, Beijing Institute of Technology, Beijing, China.
Her research interests include data-driven fault detection,
diagnosis and prediction.

Neurocomputing: Xin Gao, Fang Deng, Xianghu Yue

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neurocomputing: Xin Gao, Fang Deng, Xianghu Yue

Uploaded by

Copyright:

Available Formats

Neurocomputing 396 (2020) 487–494

Contents lists available at ScienceDirect

Data augmentation in fault diagnosis based on the Wasserstein

1. Introduction diﬃculty to grasp the features accurately to identify different faults

E [D(x )] − E [D(x˜)] − λ E ∇xˆ D(xˆ) − 1 2 (3)

Experimental tool Version number

Computer System Windows 7 (64-bit)

Category No.of each category Category description (the quality score)

used to further testify the proposed data augmentation scheme can

Classiﬁer Accuracy(%) Accuracy(%) Accuracy(%)

SGD-LR 0.70 0.78 0.72

Category No. of each category Category description

SGD-LR 72.49 72.78 72.30 73.04 72.37 71.86

Fault number Fault description

ting and classiﬁcation ability for multi-classiﬁcation. Consequently,

4.5. Case 3: Tennessee Eastman process

The Tennessee Eastman (TE) process is a simulation benchmark

RF 94.45 100 94.47 100 94.47 100

Xin Gao received the B.E. degree in automation from

You might also like

Neurocomputing: Xin Gao, Fang Deng, Xianghu Yue

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neurocomputing: Xin Gao, Fang Deng, Xianghu Yue

Uploaded by

Copyright:

Available Formats

Neurocomputing 396 (2020) 487–494

Contents lists available at ScienceDirect

Data augmentation in fault diagnosis based on the Wasserstein

1. Introduction diﬃculty to grasp the features accurately to identify different faults

E [D(x )] − E [D(x˜)] − λ E ∇xˆ D(xˆ) − 1 2 (3)

Experimental tool Version number

Computer System Windows 7 (64-bit)

Category No.of each category Category description (the quality score)

used to further testify the proposed data augmentation scheme can

Classiﬁer Accuracy(%) Accuracy(%) Accuracy(%)

SGD-LR 0.70 0.78 0.72

Category No. of each category Category description

SGD-LR 72.49 72.78 72.30 73.04 72.37 71.86

Fault number Fault description

ting and classiﬁcation ability for multi-classiﬁcation. Consequently,

4.5. Case 3: Tennessee Eastman process

The Tennessee Eastman (TE) process is a simulation benchmark

RF 94.45 100 94.47 100 94.47 100

Xin Gao received the B.E. degree in automation from

You might also like

E [D(x )] − E [D(x˜)] − λ E ∇xˆ D(xˆ) − 1 2 (3)