Paper FD With RL

Progress in Nuclear Energy 152 (2022) 104401
Contents lists available at ScienceDirect
Progress in Nuclear Energy

journal homepage: www.elsevier.com/locate/pnucene
Development of deep reinforcement learning-based fault diagnosis method

for rotating machinery in nuclear power plants
Gensheng Qian *, Jingquan Liu
Department of Engineering Physics, Tsinghua University, Beijing, 100084, China
A R T I C L E I N F O A B S T R A C T
Keywords: Rotating machinery fault can cause accidents like loss of flow or turbine trip that seriously threaten the operation
Deep reinforcement learning safety of nuclear power plants (NPPs). Artificial intelligence algorithms, like machine learning or deep learning
Small sample methods, can implement fault diagnosis by sample learning with no reliance on fault mechanism or physics
Rotating machinery
model of the equipment. However, the accumulated fault samples are small due to high operation safety re
Fault diagnosis
Nuclear power plant
quirements of the plant. Small sample learning is challenging and usually leads to degradation of model per
formance. The emerging deep reinforcement learning (DRL) algorithm can incorporate the advantages of
automatic feature extraction from deep learning algorithm and interactive learning from reinforcement learning
algorithm, is expected to have better learning ability and robustness. In this paper, two DRL fault diagnosis
models are proposed and compared. Experiment results show that the proposed models can achieve very high
diagnosis accuracy of over 99% and outperform all the baseline models (support vector machine, convolutional
neural network and gated recurrent unit neural network) in all test cases in this paper.
1. Introduction data mining and artificial intelligence techniques (Jordan and Mitchell
2015), data-driven machine learning algorithms can provide a feasible
Rotating machinery, such as the primary coolant pump, condensate solution for intelligent fault diagnosis.
pump, feedwater pump and steam turbine, are the key equipment to Generally, machine learning algorithms focus on how to learn and
ensure the stable and safe operation of the entire thermal-hydraulic obtain useful insights to make predictions and decisions from datasets
circulation system in nuclear power plants (NPPs). Due to the long- (Jordan and Mitchell 2015). Machine learning-based fault diagnosis
term operation under harsh environment of high temperature, high methods do not rely on the fault mechanism or physics model that are
pressure, high load and varying working conditions, the components of hard to build or not clear for complicated industrial system or equip
rotating machinery inevitably incur damage such as corrosion, wear and ment. For building a machine leaning fault diagnosis model, signal
cracks, and the risk of fault increases gradually. Some typical rotating processing or statistical methods are required to extract data features at
machinery faults can cause reactor system accidents, e.g., loss of flow first, then, appropriate classifier model, e.g., random forest, support
accident (LOFA), loss of feed water accident (LOFW) and turbine trip, vector machine (SVM), is built to fit the mapping relationship between
which would result in significant economic loss, and even evolve into data features and their fault labels. A well-trained model can predict the
nuclear disasters if not handled properly. Effective condition monitoring condition type of the monitoring equipment based on new collected
and fault diagnosis techniques can help improve the safety, reliability sensor data automatically, thus reduces the workload of analysts. Many
and economy of NPPs (Ma and Jiang 2011). studies use advanced signal processing and machine learning methods
Vibration measurement is an effective method for condition moni for rotating machinery fault diagnosis in NPPs, e.g., artificial neural
toring in NPPs (Ayo-Imoru and Cilliers 2018; Koo and Kim 2000; Lebold network (ANN) method for reactor coolant pump fault diagnosis (Koo
et al., 2004). Traditional vibration analysis methods include analysis of and Kim 2000), K-Nearest neighbors (KNN) for turbo-generator rotor
the frequency spectrum, orbit plot, bode plot, 3D waterfall spectra plot fault diagnosis (Biet 2013), Hilbert-Huang Transform (HHT) for
and amplitude-phase plot (Sinha 2008), which are labor-intensive and flywheel fault feature extraction (Liu et al., 2015), Teager energy
rely heavily on the analysts’ experience. With the rapid development of operator (TEO) for bearing fault frequencies identification (Zhu et al.,
* Corresponding author.
E-mail address: qgs19@mails.tsinghua.edu.cn (G. Qian).
https://doi.org/10.1016/j.pnucene.2022.104401
Received 8 June 2022; Received in revised form 23 August 2022; Accepted 24 August 2022
Available online 5 September 2022
0149-1970/© 2022 Elsevier Ltd. All rights reserved.
G. Qian and J. Liu Progress in Nuclear Energy 152 (2022) 104401
2021), SVM (Feng et al., 2018), random forest and Adaboost (Zhong and was developed for safety function status check task (Park et al. 2020),
Ban 2022a) for bearing and gear fault diagnosis. control automation in the heat-up mode of a NPP (Park et al., 2022), and
In the above mentioned studies, signal analysis methods, like Wigner large-scale design optimization of boiling water reactor bundles
Distribution (WD), Variational Mode Decomposition (VMD), Empirical (Radaideh et al. 2021), etc.
Mode Decomposition (EMD) and wavelet packet transform (WPT), are Recently, a growing number of studies applied DRL to fault diagnosis
used for feature extraction at first. In the diagnosis framework of classic problems. For example, (Ding et al., 2019) proposed a DRL method
machine learning method, feature extraction and fault diagnosis pro based on sparse auto-encoder (SAE) for bearing and pump fault diag
cedures are executed separately. Generally, feature extraction inevitably nosis, which has comparable performance with the deep learning model
loses some information in the original data that may be helpful for fault SAE-softmax. (Wang et al., 2022) developed a planetary gearbox fault
diagnosis. Therefore, the classic machine learning model performance is diagnosis method based on time-frequency representation (TFR) and a
heavily limited by the manual feature extraction experience. The latest CNN-based DRL model, which shows good performance under
research topic, deep learning algorithms, such as convolutional neural multi-work conditions. (Li et al., 2021) developed a DRL model based on
network (CNN), can directly process raw data and automatically extract capsule neural network (Cap-net) and an online feature dictionary
abstract features with a deep network structure (LeCun et al. 2015). method, which can adapt to fault diagnosis tasks of variable working
Deep learning can integrate feature extraction and fault diagnosis in one conditions. (Zisheng Wang and Xuan 2021) proposed a 1D-CNN based
end-to-end framework, reducing the limit and reliance on manual DRL method to implement the compound fault diagnosis under heavy
feature extraction. For example, (Duan et al., 2020) proposed a fault background noise of bearing and tool. (Fan et al. 2022) presented a
diagnosis method for air compressors in NPPs based on vibration DRL-based fault diagnosis method called “DiagSelect” for sample
observation window (VOW) and CNN model. (Zou et al., 2021) pre imbalance scenario, which selects suitable samples from the initial
sented a one-dimensional CNN (1D-CNN) fault diagnosis method for training set to reduce the imbalance that helps improve the model
bearing. (Chen et al., 2020) developed two 1D-CNN models for rotor and performance.
bearing fault diagnosis, which can reliably identify simulated 48 ma As shown from the above studies, the DRL model can achieve
chine health conditions and even previously unlearned faults. (Zhang excellent fault diagnosis performance in scenarios such as strong noise,
et al., 2021) proposed a Gated Recurrent Unit neural network (GRU)- varying working conditions and sample imbalance. However, fault
based rotating machinery fault diagnosis model with residual connec diagnosis study of DRL model under small sample scenario is relatively
tion and learning rate decay strategies. lacking. Many studies in existing publications focus on developing CNN-
The machine learning and deep learning-based fault diagnosis based DRL fault diagnosis model. Since CNN and GRU are 2 popular
methods mentioned above are all based on labeled sample learning (i.e., deep learning models, it is an important work to evaluate the perfor
supervised learning). Usually, the upper limit of the diagnostic ability in mance difference between CNN-based and GRU-based DRL fault diag
the fitted function (mapping relationship) is determined by the sample nosis models. In addition, according to our literature survey, there are
size and quality. When the sample size reduces or the samples are no relevant publications (by July 2022) discussing the application of
disturbed by noise, it will lead to underfitting or overfitting problem and DRL-based fault diagnosis method for NPPs.
model performance degradation. In industrial practice, especially in the To fill the research gap, in this paper, we propose two DRL fault
nuclear industry, which requires high safety and reliability of equip diagnosis models. The model training process, diagnostic effectiveness,
ment, the plant operate in a normal condition most of the time, and the performance comparison under small sample scenario are carefully
accumulated fault samples are small. Small sample learning is chal analyzed. The main contribution of this paper are summarized as
lenging. Development of fault diagnosis model under small sample is a follows:
hot research topic at present (Pan et al., 2021).
Data augmentation (Li et al., 2022; Qian and Liu 2022) and transfer 1) This paper is likely to be the first DRL-based method study for NPP
learning (Zhong, Fu, and Lin 2019; Zhong and Ban 2022b) are two viable fault diagnosis.
solutions that deal small sample learning. Data augmentation methods 2) Two DRL models are presented for fault diagnosis of rotating ma
expand the sample size by over-sampling or generating models to yield chinery based on CNN and GRU model, respectively. And compari
synthetic data similar with existing samples. The transfer learning son experiments are designed for evaluating their performance with
approach first pre-trains the model using datasets of similar equipment baseline models.
or other field’s dataset (e.g., ImageNet, a general image dataset) to make 3) The fault diagnosis effectiveness is carefully studied and elaborated
the model have certain diagnostic capability, and then fine-tunes it on by analysis of the training accuracy curve, cumulative reward,
the target small sample dataset to implement the target diagnosis task. single-sample identification process and hidden layer visualization.
Overall, data augmentation and transfer learning methods can improve 4) The proposed DRL models can achieve excellent fault diagnosis
the diagnostic capability by optimizing the fitted mapping function via performance with over 99% accuracy, and are more robust in small
synthetic data or data from other equipment or even other fields. sample scenarios than baseline models, making them more suitable
Reinforcement learning (RL) is another branch of artificial intelli for fault diagnosis applications in NPP.
gence research topic. RL model learns by interacting with the environ
ment, more specifically, RL model continuously adapts its behavior in The rest of this paper are organized as follows: Section 2 introduces
order to maximize the feedback reward signal from the environment the background knowledge of CNN, GRU, DRL and the proposed fault
during the training (Sutton and Barto 2018). Compared to the super diagnosis method in this paper. Section 3 describes two fault experi
vised machine learning algorithms, RL learns an optimal policy (i.e., a mental datasets of rotating machinery and the data processing, model
response mechanism that yields the most reward) rather than fit the parameter setting and evaluating method in detail. Section 4 analyzes
mapping function between samples and labels. Deep reinforcement and discusses the model training process and evaluation results. Section
learning (DRL), which combines the advantages of automatic feature 5 concludes the research of this paper.
extraction capability from deep learning and interactive learning capa
bility from RL, is a revolutionary progress in artificial intelligence field 2. Methodology
and show promise to solve complex real-world problems (Arulkumaran
et al., 2017). For instance, DRL model can play video games and get 2.1. CNN
scores at human player level (Mnih et al., 2013). The DRL-based
AlphaGo computer program defeated a world champion in the game CNN is designed to process data in the form of multiple arrays and
of Go (Silver et al., 2017). In nuclear industry, DRL-based framework has been successfully applied in image, video, speech and audio
2
processing (LeCun et al. 2015). CNN has four key innovations, namely features from sequential data (Chung et al., 2014). It has been successful
local connections, shared weights, spatial pooling and deep layer applied in machine translation, speech recognition, etc. (Ravanelli et al.,
structure, and its key technical elements include convolution, pooling, 2018). The key technical elements of the GRU model include the
nonlinear activation, fully-connected (FC) layer processing and softmax recurrent information transfer and two gating mechanisms (update gate
regression, as shown in Fig. 1. The original vibration sensor data is 1D and reset gate), as shown in Fig. 3. The GRU model considers the input
time-series data, so this paper uses a 1D-CNN model, whose main data as a sequence stream (or flow), and the network output depends on
functional features are similar to 2D-CNN (usually for 2D image format the current input data and last hidden state (the learned features from
data processing). the whole history inputs), as shown in Fig. 3(a). The reset and update
The convolution layer contains multiple convolution kernels (also gates are responsible for manipulating the information flow in the
called filters) that scan and convolve the input data to obtain feature recurrent units (cells), ensuring that important features and long-time
maps. The pooling layer performs a subsampling operation to reduce the dependencies can be accumulated in the hidden state, and the unim
spatial resolution of the feature map, allowing the network to extract portant or unnecessary information can be reset and removed in time.
location-invariant features. Maximum pooling method is used in this An effective GRU-based fault diagnosis model design is given in this
paper, as shown in Fig. 1(b), which takes the maximum value in each paper, as shown in Fig. 4. The hidden state of all time steps of the GRU
pooling window as the output. After several convolutional and pooling layer is taken and spliced into a feature vector, which contains the
layers, the abstract features of the input data are automatically extrac feature information of the overall input sequence. FC layers are added at
ted. Then, the feature maps are flattened into a vector, and the subse the back-end of the network for subsequent processing and diagnosis
quent processing is performed using FC layers. Appropriate activation output. The model performance analysis are introduced in Section 4.
function is used after the convolutional layer and FC layer to improve
the nonlinear fitting ability of the network. Commonly used activation 2.3. DRL
functions are sigmoid function, hyperbolic tangent function (tanh), and
rectified linear unit (ReLU) function, etc. (Rasamoelina et al. 2020), DRL is a kind of composite framework of deep learning and rein
which are shown in Fig. 1(c ~ e). forcement learning (RL) that uses deep neural networks (DNNs) as the
For solving fault diagnosis problems, the output layer uses the soft intelligent agent to perceive the environment and make appropriate
max regression (see Fig. 1(g)) to convert the output results into a decisions, with feedback rewards from the environment used to train the
probability distribution, and the final diagnosis result is determined by agent in a RL way. AlphaGo is the most famous DRL model that defeated
the neuron (or called node) with the maximum probability. The cross- the human world champion in the Go game (Silver et al., 2017). Since
entropy loss function (see Fig. 1(h)) can quantify the difference be playing Go game is a much more difficult task than image classification,
tween the predicted distribution and the ground true label distribution, DRL can master it, which shows a revolutionary progress in artificial
and is used as the optimization objective of the network model. All intelligence. Therefore, DRL is considered to have stronger intelligence
network models in this paper are optimized by the popular Adam al than deep learning and has the potential to solve complex real-world
gorithm (Kingma et al., 2015). problems, e.g., electric power system control (Glavic 2019) and cyber
By combining these basic elements, a variety of CNN models can be security (Nguyen and Reddi 2021). This section provides a brief intro
designed. An effective CNN-based fault diagnosis model design is given duction to the basic principle of DRL.
in this paper, as shown in Fig. 2, and the model performance analysis are As shown in Fig. 5, the RL framework consists of a learning agent, the
introduced in Section 4. environment and their interactions in terms of states, actions and re
wards. Agent is an intelligent body, which can make decisions and take
2.2. GRU actions based on a certain policy. Environment is the interaction object
of the agent. State s represents one state of the environment, which can
GRU network is a modified and improved variant of early vanilla be observed and sensed by the agent. Action a represents the agent’s
recurrent neural network (RNN) that excels in processing and extracting behavior, which has some influence in the environment. Reward r
Fig. 1. Key conceptions in CNN model.
3
Fig. 2. Schematic diagram of a 1D-CNN based fault diagnosis model.
Fig. 3. Key conceptions in GRU model.
represents the feedback signal from the environment, which has some powerful ability to sense the environment state from raw sensor data.
meaning for the agent, like reward or penalty. The agent aims to learn an For example, the DQN can make appropriate actions based on the
optimal policy π(a|s) from the interaction experience. The optimal pol screenshot images of the Atari video games and achieve a level com
icy means how to choose the most appropriate action a based on the parable to professional human game testers on 49 different games (Mnih
current observing state s to obtain the optimal cumulative reward r from et al., 2015).
the environment. Nowadays, 3 important techniques are used in the training of DQN:
Generally, the function Q(s, a) is used to represent the expectation of ε-greedy exploration mechanism, experience replay (ER) method and
reward by choosing the action a at state s with policy π. network separation (Mnih et al., 2015). The ε-greedy mechanism sets a
small probability threshold ε, which is a trade-off between exploration
Q(s, a) = E[rt |st = s, at = a; π] (1)
and exploitation and allows the DQN to make random actions or make a
and its value is updated iteratively using the following equation (Sutton decision greedily. The ER method, which randomly takes small batches
and Barto 2018): of experience from the experience pool to train and update the model
( ) parameters, makes full use of the experience pool and reduces the cor
relation of 2 consecutive input experience. Network separation means
(2)
′ ′ ′
Q (s, a) = Q(s, a) + α r(s, a) + γmaxQ(s , a ) − Q(s, a)
′
a using a copy of DQN to predict the target Q value, which delays the
update of Q value and improve the model stability. The detailed training
where α is the learning rate. γ is the discount factor, which takes value procedure is described in Appendix Algorithm A1.
from 0 to 1, and indicates the delay discount reward for the current
action. When taking 0, only the reward at the current step is considered.
DNN can be used as a function approximator to estimate the Q 2.4. The proposed method
function as Q(s, a) ≈ Q(s, a; θ),where θ is the weight parameters of the
DNN. Such a network is also called DQN and was first proposed in 2013 Fault diagnosis is a classification decision problem and can be
by Mnih et al., (2013). The DQN can take full advantage of the DNN’s simulated as a guessing game (Ding et al., 2019; Wang et al., 2022). The
agent is the game player, which aims to guess the fault type correctly.
4
Fig. 4. Schematic diagram of a GRU-based fault diagnosis model.
The environment consists of a large number of samples of sensor data,

each state corresponding to one sample. For a K-class fault diagnosis
problem, the guessing action space is defined as {0, 1, …, K-1}, where
0 represents normal and other numbers, such as k, represents the k-th
type of fault. The agent will get a reward signal when it guesses the right
answer, otherwise, a penalty signal. We expect that after many rounds of
guessing game, the agent can learn an optimal policy to guess the correct
condition type using the sensor data from the monitoring equipment.
The proposed fault diagnosis method is shown in Fig. 6. The envi
ronment emulator is constructed by dataset D = {xi , yi }Ni=1 (x is the
sensor data, y is the fault type, and N is the number of samples). The
agent (i.e., DQN) is constructed using a CNN or GRU network, which
extracts and senses the sample features and tries to make a correct guess.
The agent is further divided into Policy-Net and Target-Net, which have
the same network structure and synchronize the parameters periodi
cally. To simplify the description, the Policy-Net and Target-Net are
Fig. 5. Basic conceptions of reinforcement learning. denoted by fθ and fθ′ , where θ and θ are the sets of weight parameters of
′
the corresponding networks. The Policy-Net outputs the best estimated
Fig. 6. The proposed DRL-based fault diagnosis model, where Policy-Net and Target-Net are implemented by 1D-CNN or GRU model in this paper.
5
action a and Q value based on the observing state s (see Eq. (3)). The follows:
Target-Net outputs the estimated Q’ value corresponding to the next
state s (see Eq. (4)).
′
1) Case Western Reserve University (CWRU) bearing fault dataset
(Bearing Data Center | Case School of Engineering | Case Western
(a, Q) = fθ (s) (3) Reserve University n.d.)
(4)
′ ′
Q = fθ′ (s ) The experiment setup is shown in Fig. 7(a). The experiments
The details of one interaction process are described as follows. For a measured the vibration acceleration signals of motor bearings under
state s (i.e., a vibration data sample), s is first processed using Fast normal and fault states. The sampling frequencies include 12 kHz and
Fourier Transform (FFT) (Cooley and Tukey 1965) to obtain the fre 48 kHz. Single point fault was introduced in the inner race, outer race
quency spectrum, and then the action is taken according to the ε-greedy and rolling body of the bearing in different experiments. Fault diameters
mechanism. With probability ε, randomly taking an action a and with include 0.18, 0.36, and 0.54 mm, etc.
probability 1-ε , the spectrum is input to Policy-Net to obtain the pre
dicted best estimated action a. The environment gives a reward signal r 2) University of Connecticut (UOC) gearbox fault dataset (P. Cao,
according to the action a and switches to a new state s randomly. We
′ Zhang, and Tang 2018)
define the reward signal as 1 when the guess is correct, otherwise, the
The experiment setup is shown in Fig. 7(b). The experiments
reward is − 1. At the end of each game, the experience tuple (s, a, r, s ) is
′
measured the vibration signals of two-stage gearbox in normal and fault

stored in the experience pool. ER method is used to update the param
states. The sampling frequency was 20 kHz and the operating conditions
eters of Policy-Net. Specifically, the Mean Square Error (MSE) loss
included health, missing tooth, root crack, spalling, and chipping tip of 5
function L(θ) is calculated based on an experience tuple (st , at , rt , st+1 )
severity levels.
sampled from the experience pool:
L(θ) = [yt − fθ (st , at )]2 (5) 3.2. Data processing
(6) In this paper, we take the drive-end sensor data in the CWRU dataset
′
yt = rt + γ maxa′ fθ′ (st+1 , a )
with the rotation speed of 1797 rpm and sampling frequency of 12 kHz
where fθ (st , at ) is the Q value predicted by Policy-Net, rt is the reward to construct dataset A. 1024 data points are taken as one sample. In the
value recorded in the experience, and fθ′ (st+1 , a ) is the Q value at step UOC dataset, 1800 data points are taken as one sample to construct
′
t+1 predicted by the Target-Net. dataset B. Detailed information of the dataset are shown in Table 1. 10
In the training process, T guesses are taken as one episode (epoch), and 9 samples (one sample from different condition types) are selected
and the cumulative reward value is recorded. Ideally, after adequate from datasets A and B, respectively, and their original vibration signals
episodes of training, the cumulative reward value in each episode should and frequency spectrum processed by FFT are plotted. As shown in Fig. 8
gradually increase until it converges to T (i.e., the game rounds). The (a)(c), the vibration time domain waveforms vary drastically and appear
cumulative reward can be monitored to determine whether the model to be cluttered. The frequency spectrum shown in Fig. 8(b)(d) indicate
has been trained enough or not. the vibration characteristics more clearly, which help the subsequent
For ease of description, the DRL models proposed in this paper are information processing and classification. So frequency spectrum is used
named as CNN-RL and GRU-RL based on the internal network type of the as input to the network model.
Agent, respectively. The performance comparison and validation are The raw frequency spectrum is a one-dimensional vector and needs
described in detail in Section 4. to be resized its shape accordingly when used as network model input.
For CNN and CNN-RL models, the spectrum will be reshaped to a three-
3. Case study dimensional tensor shaped as [B, L, C], where B represents the batch size
of training samples, L represents the sample length, and C represents the
3.1. Case description number of input channel (equals to 1 in this paper). Similarly, for the
GRU and GRU-RL models, the spectrum will be reshaped to a three-
Bearings and gearboxes are common and important rotating ma dimensional tensor shaped as [B, T, S], where T is the number of time
chinery components used in NPPs for motors, pumps, fans and turbines steps and S is the sequence length of data in each time step. B, L, C, T,
(Ma and Jiang 2011; Sinha 2008; Smith et al. 2007). However, for and S are set empirically and by trial-and-error. The specific values used
commercial or other considerations, there is no public rotating ma in this paper are shown in Section 3.3 and Table 2.
chinery fault dataset from NPPs now. As an alternative, related works
(Feng et al., 2018; Miki and Demachi 2020; Zhichao Wang et al., 2022; 3.3. Model setting
Zhong and Ban 2022a, 2022b) used public datasets for their proposed
rotating machinery fault diagnosis method validation of NPPs. Usually, We use TensorFlow (v2.3.0) (Abadi et al., 2016) for network model
public datasets are good benchmarks for method validation, and it also development. After much trial-and-error, the parameters shown in
facilitates the reproduction and propagation of the proposed methods. Table 2 were finally selected. Due to the symmetry of the FFT processing
On the other hand, validating the model performance on laboratory results, the input size is half of the original vibration signal sample
dataset is necessary. Only when the model can perform well on labo length, which are 512 for dataset A and 900 for dataset B, respectively.
ratory datasets does it have the possibility and potential to show good The other parameters are set as follows: the experience pool size is 300,
performance on real industrial datasets. Therefore, in this paper, 2 the replay size is 32, the game rounds in per episode (epoch) is 32, the
public fault experiment datasets are selected for method validation. update periodicity of the target network is 10 training steps, the initial
They are from 2 types of typical rotating machinery components, and value of ε is 0.5, εtotal is 50, and εmin is 0.01. Adam optimizer (Kingma
both contain sensor data under normal and multiple abnormal operating et al., 2015) is selected for the network parameters updating. The
conditions, which are suitable for constructing multi-classification ex learning rate is taken as 0.001 and the batch size is taken as 32.
periments for fault diagnosis. The validation process and results in this The discount factor γ (taking value from 0 to 1) in the DRL model is
paper can provide a valuable reference for further engineering appli an important parameter that reflects how much attention the model pays
cations in NPPs. The selected fault datasets are briefly described as to the delayed rewards of actions. For solving control problems, such as
playing Atari video games, consecutive input samples in an episode,
6
Fig. 7. Experimental setup (a) CWRU bearing fault experiment (Bearing Data Center | Case School of Engineering | Case Western Reserve University n.d.) (b) UOC
gearbox fault experiment (Cao et al. 2018).
Table 1
Detailed information of fault datasets in case study.
Dataset Machinery type Fault type Sample size Sample length Label
A Rolling bearing H,IR1,IR2,IR3, 100 in each class, 1000 in total 1024 0,1,2,3,4,5,6,7,8,9
OR1,OR2,OR3,
B1,B2,B3
B Gear H,MT,RC,SP, 208 in each class, 1872 in total 1800 0,1,2,3,4,5,6,7,8
CT1,CT2,CT3,CT4,CT5
H= Normal (i.e., health).

IR1~IR3 = Inner race fault with size 0.18 mm, 0.36 mm, 0.54 mm.
OR1~OR3 = Outer race fault with size 0.18 mm, 0.36 mm, 0.54 mm.
B1~B3=Ball fault with size 0.18 mm, 0.36 mm, 0.54 mm.
MT = missing tooth, RC = root crack, SP = spalling, CT1~CT5 = chipping tip with 5 severity levels.
The label field corresponds to the value of the fault type field.
corresponding to game screenshots at different moments, are strongly machinery, we evaluate model performance on 2 fault datasets,
correlated, and the game action at the previous step has a great influence respectively. During the evaluation process, dataset A or B is further
on the next action. So the model needs to pay much attention to the divided into 3 sub-datasets: a training set, a validation set and a test set
delayed reward of each action, and γ is usually taken as 0.99 (Mnih et al., in the sample size ratio of 5:2:3. The models are trained on the training
2015). However, in the fault diagnosis scenario of this paper, each set. The validation set is used to mark the model performance during
sample contains enough fault information for the DRL model to make training and select the best model for every case. The test set is not
type prediction action at a time, which indicates that the guessing ac involved in the model training and parameter optimization, and is only
tions are relatively independent. So the model should focus more on the used to evaluate the model performance finally.
current reward and the discount factor γ should take a small value. We use accuracy (see Eq. (7)) on the test set as the main evaluation
Results of the reference (Wang et al., 2022) show that the model per metric. The accuracy can reflect the comprehensive classification ability
formance tends to decrease when γ increases from 0.1 to 0.9. In the of the model for all sample types. In each case, 10 experiments are
reference (Ding et al., 2019), γ takes the value of 0. Therefore, the dis conducted and the mean test accuracy is taken as the evaluation result.
count factor γ is taken as 0 in this paper. ∑N
The baseline models chosen in this paper are CNN, GRU and SVM. I(̃y = yi )
acc = i=1 i × 100% (7)
The architecture of CNN and GRU model are shown in Figs. 2 and 4, N
respectively. To enhance comparability, the main hyper-parameter set where I( ⋅) is the indicator function, which return 1 if condition is valid
tings of CNN and GRU models are consistent with the corresponding
and 0 otherwise. N is the test set size. ̃
yi and yi are the predicted and true
DRL model described in Table 2. SVM is a classic machine learning
label of the i-th sample, respectively.
model with powerful nonlinear fitting ability and small sample learning
capability (Cortes and Vapnik 1995; Yang et al. 2007). Feature engi
4. Results and discussion
neering approach similar to (Zhong and Ban 2022a) is selected. The
normalized vibration energy percentage distribution is extracted by
4.1. Analysis of DRL training process
3-layer WPT and are used as input to the SVM model. Scikit-learn
(v1.0.1) (Pedregosa et al., 2011) is used to build SVM model and its
In this section, the training process of DRL is analyzed in detail to
hyper-parameters (e.g, regularization parameter, kernel parameter) is
verify the effectiveness of fault diagnosis. Specific contents include the
automatically optimized by the “hyperopt” toolkit in Python (Bergstra
analysis of training accuracy curve, cumulated reward, single-sample
et al., 2015).
identification process and hidden layer visualization. According to our
experimental results, the training processes of CNN-RL and GRU-RL
show high similarity, and GRU-RL can achieve better performance
3.4. Model evaluation
(see Section 4.2 and 4.3), so the training process of GRU-RL model is
selected for presentation here.
To verify that the model can be applied to multiple rotating
7
Fig. 8. Visualization of samples, 10 types on dataset A and 9 types on dataset B (a) (c) vibration time-domain waveform; (b) (d) frequency spectrum, y label
represents the abbreviation of fault type name.
Table 2
Main parameter settings of the proposed DRL and deep learning models.
Model Parameter Value
dataset A dataset B
CNN, CNN-RL Input size 512 × 1 900 × 1

Conv-1 4@3 × 1 (4 filters with same size 3 × 1)
Pool-1 Max pooling, size 2 × 1, stride 2
Conv-2 8@3 × 1
Pool-2 Max pooling, size 2 × 1, stride 2
FC layer [1024,32] [1800,32]
GRU, GRU-RL Input size 32 30
Time step 16 30
Hidden units 32 32
FC layer [512,32] [900,32]
CNN-RL,GRU-RL Target output size 1 1
(Q value)
Policy output size 10 9
(Action)
CNN, GRU Output size 10 9
The number of training epoch (episode) is set to 200, and the training convergence status.
accuracy at the end of each epoch is recorded. Training accuracy reflects Next, the variation characteristics of the cumulative reward values
the model’s ability of feature extraction and pattern recognition on the are analyzed. In each epoch, the GRU-RL model performs 32 rounds of
training set. As shown in Fig. 9(a)(c), the training accuracy of the GRU guessing games. It gets a reward of 1 point when the guess is correct, and
model rises faster than the GRU-RL model, reaching convergence status − 1 point (penalty signal) for an incorrect guess. Thus, the cumulative
first. This is due to the fact that supervised learning is a strong feedback reward value is taken in the range of [− 32, 32], where 32 represents
training mechanism that uses the sample labels to guide the updating of entirely correct, − 32 means entirely wrong. A larger reward value im
GRU model parameters. While GRU-RL model is training in a weak plies a more accurate guessing policy or capability embodied in the
feedback mechanism that uses scalar reward signal (taken − 1 or 1) to model. Fig. 9(b)(d) shows the 5 epoch-mean cumulative reward value of
evaluate the model behavior and update its parameters in the direction the GRU-RL model during the training process. As the model is contin
that may maximize cumulative reward in the interaction. For the model uously trained, the cumulative reward value gradually increases and
stability, as can be seen in the embedded small plot in Fig. 9(a), GRU-RL finally converges to 32. It implies that the model can learn the correct
model behaves more stably when it converges, while the accuracy of classification policy and can complete the guessing game excellently.
GRU model fluctuates. Although the training process of the GRU-RL We further analysis the training process of the GRU-RL model from
model is slower, its performance is more stable after reaching the perspective of single-sample identification. One fixed sample is
8
Fig. 9. Training accuracy and mean cumulative reward on (a) (b) dataset A, (c) (d) dataset B.
Fig. 10. Identification results of the GRU-RL model for a single fixed sample on (a) Dataset A, (b) Dataset B.
selected from datasets A and B, respectively. At the end of each training Now, we use the t-SNE algorithm (Van Der Maaten and Hinton,
epoch, the type of the fixed sample is predicted and recorded. Fig. 10 2008) to downscale and visualize the first FC layer state of the GRU-RL
shows the identification results for the fixed samples at different training model. The FC layer contains sample feature information extracted by
stages. The horizontal coordinate represents the epoch number and the the model. As shown in Fig. 11, each point in the figure represents a
vertical coordinate represents the identification result. At the beginning, sample, and different colors represent different sample labels (condition
the model is in the exploration and trial-and-error phase, and the pre types). Different types of samples are scattered in relatively independent
diction results are randomly distributed. After enough training epochs, local regions, which indicates that the GRU-RL model can accurately
the model prediction results gradually converge to the true labels. This extract the abstract representation of fault features in the selected
re-verifies that the GRU-RL model can learn the correct classification datasets.
strategy on the 2 different rotating machinery datasets.
9
Fig. 11. 2D visualization of the GRU-RL model’s first FC layer state by t-SNE on (a) dataset A, (b) dataset B.
4.2. Model performance comparison

Table 3
Model performance under initial training sample size.
This section compares the test accuracy of different models at the
Model Dataset initial sample size. Dataset A has 500 training set samples and dataset B
A (Bearing) B (Gear) has 936 training set samples. The performance of the 5 models is
SVM 89.47 ± 1.38 90.27 ± 1.00 compared according to the evaluation process described in Section 3.4.
CNN 98.73 ± 1.81 95.31 ± 1.62 Table 3 shows the accuracy evaluation results of all models on the test
GRU 99.90 ± 0.20 99.11 ± 0.65 set. The order of model performance superiority on the 2 datasets both
CNN-RL 99.95 ± 0.15 99.61 ± 0.50 are: GRU-RL > CNN-RL > GRU > CNN > SVM. The 2 presented DRL
GRU-RL 99.98 ± 0.05 99.95 ± 0.11
models can achieve very high accuracy of over 99%, showing excellent
Note: mean test accuracy ± standard deviation [%] of 10 experiments. diagnostic capability.
The DRL model has the ability to learn interactively with the envi
ronment and integrates the advantages of automatic feature extraction
from the deep learning model, allowing the model to deeply mine and
understand the essential pattern features of the environment (i.e., fault
Fig. 12. Model performance under small sample scenarios, test accuracy and its standard deviation on (a) (c) dataset A, (b) (d) dataset B.
10
datasets) and learn better fault diagnosis strategies than the deep engineers and lay a foundation for further study. For further engineering
learning model. applications, there are two more points, i.e., data source and sample size
SVM model uses the vibration signal energy features extracted by the (total number of samples, even sample classes), which need to be dis
WPT algorithm, which may lose a part of the important fault informa cussed here.
tion contained in the original signal, compared to the other four models For point 1: data source. The datasets used in this paper are from
(they use the original spectral information as input). Therefore, the SVM laboratories (with low noise and less disturbance compared to real in
has the lowest accuracy, which is consistent with the common sense. dustrial measurements), the monitored equipment types may differ from
Nevertheless, SVM achieves or approaches 90% accuracy and has good the specific ones used in NPPs, and the sensors and signal acquisition
diagnostic performance. frequency may also exist discrepancies, which result in the trained
model in this paper cannot directly deployed and delivered for NPP use.
4.3. Performance under small samples However, the modeling idea and process can be directly migrated and
provide valuable reference for NPP engineers of equipment
In real industry environment, with high safety requirements, espe maintenance.
cially for nuclear industry and NPPs, the rotating machinery operates in For point 2: data size. Compared to image datasets, such as ImageNet
normal state most of the time, forming a fault diagnosis problem under which has more than 15 million images and over 22,000 categories
small samples. Model performance under small sample scenario is more available for building deep learning models (Krizhevsky et al. 2017), the
informative for engineering applications. In this section, small sample training set used in this paper is small, whose maximum size is 500 (10
scenario is simulated by selecting a small percentage (5%–30%) of the categories) for dataset A and 936 (9 categories) for dataset B. Related
training set for model training. publications also use similar size of training set to build deep learning
Fig. 12 shows the evaluation results under different training sample fault diagnosis models, such as (Gan et al. 2016; Guo et al. 2016) both
ratio (TSR). On dataset A, the accuracy of CNN-RL, GRU-RL and GRU used 500 samples for model training and (Cao, Zhang, and Tang 2018)
models change less when the TSR decreases from 30% to 5%, while the used 747 samples for model training. This proves that such training set
accuracy of CNN and SVM models show a larger degree of decrease, size can meet the requirements for building deep learning-based rotating
which reflects that they are more sensitive to sample reduction. The machinery fault diagnosis model, at least for laboratory datasets.
accuracy of the 2 DRL models are higher than the other baseline models Two more important issues related to data size are overfitting
at different TSRs. Overall, the GRU-RL was slightly better than the CNN- problem and generalization ability. Overfitting is a common problem in
RL model, with smaller standard deviation and better stability. When the deep learning model training. Due to the powerful fitting ability of deep
TSR was 5%, GRU-RL achieves an average of 93.9% accuracy and CNN- learning models, it is easy to fit unimportant features and noise in the
RL achieves an average of 95.8% accuracy, both of which achieve very dataset, leading to overfitting. When overfitting occurs, the model has
high diagnostic accuracy in the small sample cases. high accuracy on the training set but low accuracy on the test set. In the
There is a similar pattern on dataset B. When the TSR decreases from experimental results of this paper, the proposed deep learning and DRL
30% to 5%, the accuracy of both 2 DRL models is higher than the other models can achieve high accuracy both in the training set and the test set
baseline models. When the TSR was 5%, GRU-RL achieves an average of most of the time. They do not show significant overfitting until the size
97.3% accuracy and CNN-RL achieves an average of 89.1% accuracy, of the training set was reduced to a certain level (e.g., TSR = 5%).
both of which achieve very high diagnostic accuracy in the small sample Generalization ability refers to the ability of the model to adapt to
cases. unknown data. Test accuracy is a measure of the model generalization
The proposed DRL method has a strong fault diagnosis capability in ability. Although the proposed DRL models achieve high test accuracy,
the case of small samples, surpassing the deep learning models CNN and their generalization ability to real NPP datasets still needs further study
GRU and the classic SVM model (although SVM has a strong small and validation. Since the high safety requirements in NPPs and con
sample learning capability). The DRL model is trained by interacting ducting equipment fault experiments are very expensive, small sample
with the environment continuously. By lots of trial-and-error, DRL problem is an objective situation and will last for years. Future studies
adapts its behavior and decision policy to maximize the reward signal should dedicate to not only the advanced model development, but also
from the environment. During this complex training process, DRL model the large-scale fault dataset establishment for NPPs.
can learn more essential features of the environment (i.e., fault dataset)
and focus on the most critical information in the data, rather than simply 5. Conclusions
fit the function relationship of samples and their labels. Therefore, DRL
can achieve better performance. In this paper, we propose 2 DRL fault diagnosis models based on CNN
With the network architecture setting in this paper, GRU-RL out and GRU models, analysis the model training process carefully, and
performs the CNN-RL model, especially on dataset B. This may be compare the performance between them and with 3 baseline models. 2
attributed to the mechanism of GRU network to process information experimental fault datasets of bearing and gear are selected as case
recurrently. The vibration signal spectrum has strong sequence features. studies. The results indicate that:
GRU can hold the temporal information features of the whole sequence 1) the DRL model converges more slowly than deep learning model
in the extracted feature vector. In contrast, CNN focuses more on the but has better stability after reaching convergence status; 2) the pro
local features of the data and lacks the mining of the overall temporal posed 2 DRL models outperform traditional deep learning models (CNN,
features of the information. GRU) and classic machine learning model (SVM) in both initial sample
size and small sample scenarios. This is attributed to the combination
4.4. Discussions advantages of powerful feature extraction from deep learning and
interactive learning ability from reinforcement learning, which moti
Since the verification datasets in this paper are from laboratory, but vates the models to learn more essential features with the weak reward
not the real datasets from NPPs, we cannot guarantee that the proposed feedback signal; 3) Based on the network architecture design in this
model can achieve 99% accuracy in real situations. To see the bright paper, the GRU-RL performance is slightly better than the CNN-RL
side, this paper gives a specific and effective model architecture design model, which may be attributed to the intrinsic ability of the GRU
and some valuable experiment results of DRL-based fault diagnosis network to extract the overall sequence information of the vibration
method for rotating machinery. The experimental results in this paper spectrum and the splicing of all hidden states in the recurrent unit.
can demonstrate the relative superiority of the model performance to Subsequent studies will further investigate the fault diagnosis per
some extent. This information may be very useful and valuable for NPP formance of the DRL model under varying working conditions, strong
11
environment noise and sample imbalance scenarios using real equip interests or personal relationships that could have appeared to influence
ment datasets or simulated system-level fault datasets and promote the the work reported in this paper.
engineering application of the DRL model in NPPs. In addition, there are
still many interesting topics and challenges, such as Monte Carlo tree Data availability
search technique (Silver et al., 2017) in DRL is not introduced for fault
diagnosis studies by now, model training efficiency needs to be Data will be made available on request.
improved (Ding et al., 2019; Wang et al., 2022) and network architec
ture optimization problem (Cao et al., 2022). In the future study, we will Acknowledgement
also consider promoting these techniques.
We appreciate the editors and anonymous reviewers very much for
Declaration of competing interest their precious time and valuable comments.
The authors declare that they have no known competing financial
Appendix
Appendix Algorithm A1 describes the training process of DQN models.
References Chen, S., et al., 2020. Robust deep learning-based diagnosis of mixed faults in rotating
machinery. IEEE ASME Trans. Mechatron. 25 (5), 2167–2176.
Chung, Junyoung, Gulcehre, Caglar, Cho, Kyunghyun, Bengio, Yoshua, 2014. Empirical
Abadi, TensorFlow, 2016. A system for large-scale machine learning. In: Proceedings of
evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014
the 12th USENIX Conference on Operating Systems Design and Implementation,
Workshop on Deep Learning, December 2014.
OSDI’16, USA. USENIX Association, 265–83.
Cooley, James W., Tukey, John W., 1965. An algorithm for the machine calculation of
Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A., 2017. Deep
complex fourier series. Math. Comput. 19, 297–301.
reinforcement learning: a brief survey. IEEE Signal Process. Mag. 34 (6), 26–38.
Cortes, Corinna, Vapnik, Vladimir, 1995. Support-vector networks. Mach. Learn. 20 (3),
Ayo-Imoru, R.M., Cilliers, A.C., 2018. A survey of the state of condition-based
273–297. https://doi.org/10.1007/BF00994018.
maintenance (CBM) in the nuclear power industry. Ann. Nucl. Energy 112, 177–188.
Ding, Yu, et al., 2019. intelligent fault diagnosis for rotating machinery using deep Q-
https://www.sciencedirect.com/science/article/pii/S0306454917303365.
network based health state classification: a deep reinforcement learning approach.
Bearing Data Center. Case School of engineering | case western Reserve university
Adv. Eng. Inf. 42, 100977. https://www.sciencedirect.com/science/article/pii/S14
[WWW Document], n.d. https://engineering.case.edu/bearingdatacenter. (Accessed
74034619305506.
2 July 2022).
Duan, Q., et al., 2020. fault diagnosis of air compressor in nuclear power plant based on
Bergstra, James, et al., 2015. Hyperopt: a Python library for model selection and
vibration observation window. IEEE Access 8, 222274–222284.
hyperparameter optimization. Comput. Sci. Discov. 8 (1), 014008 https://doi.org/
Fan, S., Zhang, X., Song, Z., 2022. Imbalanced sample selection with deep reinforcement
10.1088/1749-4699/8/1/014008.
learning for fault diagnosis. IEEE Trans. Ind. Inf. 18 (4), 2518–2527.
Biet, M., 2013. Rotor faults diagnosis using feature selection and nearest neighbors rule:
Feng, Yi, et al., 2018. Pump Bearing Fault Detection Based on EMD and SVM.” Volume 1:
application to a turbogenerator. IEEE Trans. Ind. Electron. 60 (9), 4063–4073.
Operations and Maintenance, Engineering, Modifications, Life Extension, Life Cycle,
Cao, P., Zhang, S., Tang, J., 2018. Preprocessing-free gear fault diagnosis using small
and Balance of Plant; Instrumentation and Control (I&C) and Influence of Human
datasets with deep convolutional neural network-based transfer learning. IEEE
Factors; Innovative Nuclear Power Plant Design and SMRs.
Access 6, 26241–26253.
Gan, Meng, Wang, Cong, Zhu, Chang‫׳‬an, 2016. Construction of hierarchical diagnosis
Cao, Jie, Ma, Jialin, Huang, Dailin, Yu, Ping, 2022. Finding the optimal multilayer
network based on deep learning and its application in The fault pattern recognition
network structure through reinforcement learning in fault diagnosis. Measurement
of rolling element bearings. Mech. Syst. Signal Process. 72 (73), 92–104. https://
188, 110377. https://www.sciencedirect.com/science/article/pii/S0263224
www.sciencedirect.com/science/article/pii/S0888327015005312.
121012707.
12
Glavic, Mevludin, 2019. Deep) reinforcement learning for electric power system control Radaideh, Majdi, I., Forget, Benoit, Shirvan, Koroush, 2021. Large-scale design
and related problems: a short review and perspectives. Annu. Rev. Control 48, optimisation of boiling water reactor bundles with neuroevolution. Ann. Nucl.
22–35. https://www.sciencedirect.com/science/article/pii/S1367578819301014. Energy 160, 108355. https://www.sciencedirect.com/science/article/pii/S0306454
Guo, Xiaojie, Chen, Liang, Shen, Changqing, 2016. Hierarchical adaptive deep 921002310.
convolution neural network and its application to bearing fault diagnosis. Rasamoelina, Andrinandrasana David, Adjailia, Fouzia, Peter, Sinčák, 2020. A review of
Measurement 93, 490–502. https://www.sciencedirect.com/science/article/pii/ activation function for artificial neural network. In: 2020 IEEE 18th World
S0263224116304249. Symposium on Applied Machine Intelligence and Informatics (Sami), pp. 281–286.
Jordan, M.I., Mitchell, T.M., 2015. Machine learning: trends, perspectives, and prospects. Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., 2018. Light gated recurrent units for
Science 349 (6245), 255–260. https://doi.org/10.1126/science.aaa8415. speech recognition. IEEE Transactions on Emerging Topics in Computational
Kingma, Diederik, P., Jimmy, Ba, 2015. Adam: A Method for Stochastic Optimization, Intelligence 2 (2), 92–102.
p. 6980. CoRR abs/1412. Silver, David, et al., 2017. Mastering the game of Go without human knowledge. Nature
Koo, In Soo, Kim, Whan Woo, 2000. The development of reactor coolant pump vibration 550 (7676), 354–359. https://doi.org/10.1038/nature24270.
monitoring and a diagnostic system in the nuclear power plant. ISA (Instrum. Soc. Sinha, Jyoti K., 2008. Vibration-based diagnosis techniques used in nuclear power plants:
Am.) Trans. 39 (3), 309–316. https://www.sciencedirect.com/science/article/pii/ an overview of experiences. Nucl. Eng. Des. 238 (9), 2439–2452. https://www.sci
S0019057800000197. encedirect.com/science/article/pii/S0029549308001556.
Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E., 2017. ImageNet classification with Smith, H.R., Wiedenbrug, E., Lind, M., 2007. Rotating element bearing diagnostics in a
deep convolutional neural networks. Commun. ACM 60 (6), 84–90. https://doi.org/ nuclear power plant: comparing vibration and torque techniques. In: 2007 IEEE
10.1145/3065386. International Symposium on Diagnostics for Electric Machines, Power Electronics
Lebold, M.S., et al., 2004. Using torsional vibration analysis as a synergistic method for and Drives, , 17–22.
crack detection in rotating equipment. In: 2004 IEEE Aerospace Conference Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: an Introduction, second ed.
Proceedings (IEEE Cat. No.04TH8720), vol. 6, pp. 3517–3527. Cambridge.
LeCun, Yann, Bengio, Yoshua, Hinton, Geoffrey, 2015. Deep learning. Nature 521 Van Der Maaten, Laurens, Hinton, Geoffrey, 2008. Visualizing data using T-SNE. J. Mach.
(7553), 436–444. https://doi.org/10.1038/nature14539. Learn. Res. 9 (2605), 2579–2605.
Li, G., et al., 2021. Deep reinforcement learning-based online domain adaptation method Wang, Zisheng, Xuan, Jianping, 2021. intelligent fault recognition framework by using
for fault diagnosis of rotating machinery. IEEE ASME Trans. Mechatron. 1–10. deep reinforcement learning with one dimension convolution and improved actor-
Li, Wei, et al., 2022. Multi-mode data augmentation and fault diagnosis of rotating critic algorithm. Adv. Eng. Inf. 49, 101315. https://www.sciencedirect.com/scienc
machinery using modified ACGAN designed with new framework. Adv. Eng. Inf. 52, e/article/pii/S1474034621000689.
101552. https://www.sciencedirect.com/science/article/pii/S1474034622000271. Wang, H., et al., 2022. intelligent fault diagnosis for planetary gearbox using time-
Liu, Meiru, et al., 2015. Vibration signal analysis of main coolant pump flywheel based frequency representation and deep reinforcement learning. IEEE ASME Trans.
on hilbert–huang transform. Nucl. Eng. Technol. 47 (2), 219–225. https://www.sci Mechatron. 27 (2), 985–998.
encedirect.com/science/article/pii/S1738573315000108. Wang, Zhichao, et al., 2022. Cross-domain fault diagnosis of rotating machinery in
Ma, Jianping, Jiang, Jin, 2011. Applications of fault detection and diagnosis methods in nuclear power plant based on improved domain adaptation method. J. Nucl. Sci.
nuclear power plants: a review. Prog. Nucl. Energy 53 (3), 255–266. https://www. Technol. 59 (1), 67–77. https://doi.org/10.1080/00223131.2021.1953630.
sciencedirect.com/science/article/pii/S0149197010001769. Yang, Junyan, Zhang, Youyun, Zhu, Yongsheng, 2007. intelligent fault diagnosis of
Miki, Daisuke, Demachi, Kazuyuki, 2020. bearing fault diagnosis using weakly rolling element bearing based on SVMs and fractal dimension. Mech. Syst. Signal
supervised long short-term memory. J. Nucl. Sci. Technol. 57 (9), 1091–1100. Process. 21 (5), 2012–2024. https://www.sciencedirect.com/science/article/pii/
https://doi.org/10.1080/00223131.2020.1761473. S0888327006002251.
Mnih, Volodymyr, et al., 2013. Playing Atari with Deep Reinforcement Learning. ArXiv Zhang, Yahui, et al., 2021. fault diagnosis of rotating machinery based on recurrent
abs/1312.5602. neural networks. Measurement 171, 108774. https://www.sciencedirect.com/scienc
Mnih, Volodymyr, et al., 2015. Human-level control through deep reinforcement e/article/pii/S0263224120312732.
learning. Nature 518 (7540), 529–533. https://doi.org/10.1038/nature14236. Zhong, Xianping, Ban, Heng, 2022a. Crack fault diagnosis of rotating machine in nuclear
Nguyen, T.T., Reddi, V.J., 2021. Deep reinforcement learning for cyber security. IEEE power plant based on ensemble learning. Ann. Nucl. Energy 168, 108909. https://
Transact. Neural Networks Learn. Syst. 1–17. www.sciencedirect.com/science/article/pii/S0306454921007866.
Pan, Tongyang, et al., 2021. Generative Adversarial Network in Mechanical Fault Zhong, Xianping, Ban, Heng, 2022b. Pre-trained network-based transfer learning: a
Diagnosis under Small Sample: A Systematic Review on Applications and Future small-sample machine learning approach to nuclear power plant classification
Perspectives. ISA Transactions. https://www.sciencedirect.com/science/article/pii/ problem. Ann. Nucl. Energy 175, 109201. https://www.sciencedirect.com/science/
S0019057821006169. article/pii/S0306454922002365.
Park, JaeKwan, Kim, TaekKyu, Seong, SeungHwan, 2020. Providing support to operators Zhong, Shi-sheng, Fu, Song, Lin, Lin, 2019. A novel gas turbine fault diagnosis method
for monitoring safety functions using reinforcement learning. Prog. Nucl. Energy based on transfer learning with CNN. Measurement 137, 435–453. https://www.sci
118, 103123. https://www.sciencedirect.com/science/article/pii/S0149197 encedirect.com/science/article/pii/S0263224119300405.
01930232X. Zhu, Shaomin, et al., 2021. Feature extraction for early fault detection in rotating
Park, JaeKwan, Kim, TaekKyu, Seong, SeungHwan, Koo, SeoRyong, 2022. Control machinery of nuclear power plants based on adaptive VMD and teager energy
automation in the heat-up mode of a nuclear power plant using reinforcement operator. Ann. Nucl. Energy 160, 108392. https://www.sciencedirect.com/science/
learning. Prog. Nucl. Energy 145, 104107. https://www.sciencedirect.com/science/ article/pii/S0306454921002681.
article/pii/S0149197021004595. Zou, Fengqian, et al., 2021. An anti-noise one-dimension convolutional neural network
Pedregosa, Fabian, et al., 2011. Scikit-learn: machine learning in Python. J. Mach. Learn. learning model applying on bearing fault diagnosis. Measurement 186, 110236.
Res. 12 (null), 2825–2830. https://www.sciencedirect.com/science/article/pii/S0263224121011453.
Qian, Gensheng, Liu, Jingquan, 2022. fault diagnosis based on conditional generative
adversarial networks in nuclear power plants. Ann. Nucl. Energy 176, 109267.
https://www.sciencedirect.com/science/article/pii/S0306454922003024.
13

Paper FD With RL

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper FD With RL

Uploaded by

Copyright:

Available Formats

Progress in Nuclear Energy 152 (2022) 104401

Contents lists available at ScienceDirect

Progress in Nuclear Energy

Development of deep reinforcement learning-based fault diagnosis method

Fig. 1. Key conceptions in CNN model.

Fig. 2. Schematic diagram of a 1D-CNN based fault diagnosis model.

Fig. 3. Key conceptions in GRU model.

Fig. 4. Schematic diagram of a GRU-based fault diagnosis model.

The environment consists of a large number of samples of sensor data,

the corresponding networks. The Policy-Net outputs the best estimated

measured the vibration signals of two-stage gearbox in normal and fault

L(θ) = [yt − fθ (st , at )]2 (5) 3.2. Data processing

H= Normal (i.e., health).

CNN, CNN-RL Input size 512 × 1 900 × 1

4.2. Model performance comparison

The authors declare that they have no known competing financial

Appendix Algorithm A1 describes the training process of DQN models.

You might also like