You are on page 1of 29

Measurement 163 (2020) 107929

Contents lists available at ScienceDirect

Measurement
journal homepage: www.elsevier.com/locate/measurement

Deep learning for prognostics and health management: State of the art,
challenges, and opportunities
Behnoush Rezaeianjouybari a,⇑, Yi Shang b
a
Department of Mechanical and Aerospace Engineering, University of Missouri, Columbia, MO 65211, USA
b
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA

a r t i c l e i n f o a b s t r a c t

Article history: Improving the reliability of engineered systems is a crucial problem in many applications in various engi-
Received 26 February 2020 neering fields, such as aerospace, nuclear energy, and water declination industries. This requires efficient
Received in revised form 15 April 2020 and effective system health monitoring methods, including processing and analyzing massive machinery
Accepted 4 May 2020
data to detect anomalies and performing diagnosis and prognosis. In recent years, deep learning has been
Available online 21 May 2020
a fast-growing field and has shown promising results for Prognostics and Health Management (PHM) in
interpreting condition monitoring signals such as vibration, acoustic emission, and pressure due to its
Keywords:
capacity to mine complex representations from raw data. This paper provides a systematic review of
Prognostics and health management
Deep learning
state-of-the-art deep learning-based PHM frameworks. It emphasizes on the most recent trends within
Fault diagnosis the field and presents the benefits and potentials of state-of-the-art deep neural networks for system
Anomaly detection health management. In addition, limitations and challenges of the existing technologies are discussed,
Domain adaptation which leads to opportunities for future research.
Ó 2020 Elsevier Ltd. All rights reserved.

1. Introduction time. This health information provides an advance warning of


potential failures and a window of opportunity for implementing
Recently, prognostics and health management (PHM) has measures to avert catastrophic failures by reducing the system
emerged as a key technology to overcome the limitations of tradi- downtime and maintenance costs.
tional reliability analysis. PHM focuses on utilizing sensory signals In traditional maintenance models, the machinery is investi-
acquired from an engineered system to monitor the health condi- gated and maintained via break-down based or time-based strate-
tion, detect anomalies, diagnose the faults, and more importantly, gies. These two strategies have two main disadvantages: (i) they
to predict the remaining useful life (RUL) of the system over its life- can be extremely costly; and (ii) their process can pose a safety risk

Abbreviations: AAE, Adversarial autoencoders; ACGAN, auxiliary classifier generative adversarial network; AdaBN, Adaptive batch normalization; AE, Autoencoder; AFSA,
Artificial fish swarm algorithm; AHKL, Auto-balanced high-order Kullback-Leibler; AI, Artificial Intelligence; BLSTM, Bi-directional Long Short Term Memory; CAE, Contractive
Autoencoder; CBLSTM, CNN- Bi-directional LSTM; CD, Contrastive divergence; CDBN, Convolutional Deep Belief Network; CLSTM, CNN-LSTM; CNN, Convolutional Neural
Network; CORAL, Correlation alignment; CPS, Cyber-physical-systems; CPU, Central Processing Unit; CUDA, Compute Unified Device Architecture; CVAE, Conditional
variational autoencoder; DA, Domain Adaptation; DAD, Deep Anomaly Detection; DAE, Denoising Autoencoder; DBM, Deep Boltzmann Machine; DBN, Deep Belief Network;
DL, Deep Learning; DNN, Deep Neural Network; DQN, Deep Q-Network; EMA, Exponential moving average; FFT, Fast Fourier Transform; GAN, Generative Adversarial
Network; GDA, Generalized discriminant analysis; GPU, Graphics Processing Unit; GDBM, Gaussian Bernoulli DBM; GRU, Gated Recurrent Unit; GRU-ED, GRU Encoder-
Decoder; HHT, Hilbert-Huang transform; HI, Health Indicator; IaaS, Infrastructure as a Service; IIoT, Industrial Internet of Things; JSD, Jensen-Shannon Divergence; KL,
Kullback-Leibler; KNN, k-nearest neighbors; LSTM, Long Short Term Memory; LSTM-ED, LSTM Encoder-Decoder; MCMC, Markov chain Monte Carlo; MLP, Multi-Layer
Perceptron; MMD, Maximum mean discrepancy; MSCNN, Multi-scale convolutional neural network; NAS, Neural Architecture Search; PaaS, Platform as a Service; PHM,
Prognostics and Health Management; PSO, Particle Swarm Optimization; PSR, Phase Space Representation; RBF, Radial basis function; RBM, Restricted Boltzmann Machine;
RKH, Reproducing kernel Hilbert; RL, Reinforcement Learning; RNN, Recurrent Neural Network; SaaS, Software as a Service; SAE, Stacked Autoencoder; SCDA, Smooth
conditional distribution alignment; SDAE, Sparse Denoising Autoencoder; SDAE-NCL, Stacked denoising autoencoder network with negative correlation learning; SGD,
Stochastic gradient descent; SML, Stochastic maximum likelihood; SNR, Signal to Noise Ratio; SPEV, Spectrum-principal-energy vector; SSAE, Sparse Stacked Autoencoder;
SSDAE, Stacked sparse denoising autoencoder; STPN, Spatiotemporal pattern network; SVM, Support Vector Machine; TDConvLSTM, Time-distributed Convolutional LSTM;
TL, Transfer Learning; TPU, Tensor Processing Unit; VAE, Variational Autoencoder; WGAN, Wasserstein generative adversarial network; WJDA, Weighted joint distribution
alignment; WPI, Wavelet Packet Image; WPT, Wavelet Packet Transform.
⇑ Corresponding author.
E-mail address: b.rezaeianjouybari@mail.missouri.edu (B. Rezaeianjouybari).

https://doi.org/10.1016/j.measurement.2020.107929
0263-2241/Ó 2020 Elsevier Ltd. All rights reserved.
2 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

to employees and other assets. Conversely. PHM is known to have learning works for common PHM problems is given in Section 4.
strong economic benefits for owners, operators, and society. In We summarize available hardware and computing resources in
PHM-based maintenance strategy, engineers predict when equip- Section 5, and finally, end the paper with challenges and future
ment failure might happen, and then perform maintenance to keep research directions in Section 6.
machines in operation. Modern systems are extensively complex The main contributions of this paper can be summarized as
and built with many interactive components and electronics, follows:
which highlights the importance of system reliability. Failure of
any component will result in a catastrophic failure of the system. - We categorize the available deep learning models into three
A viable PHM system framework gives early detection and isola- classes, discriminative, generative, and hybrid, and use practical
tion of the incipient fault of components/sub-systems. The out- examples to explain how these models, especially generative
come of an effective PHM model provides a tool to monitor the models, can be effective to solve existing challenges.
progression of the fault and to help in making assessment deci- - We present application of transfer learning and domain adapta-
sions and maintenance schedules. tion techniques in PHM and discuss their characteristics.
Availability of abundant data and exponentially increasing com- - We provide a comprehensive reference of available resources in
putational power provide significant opportunities for industry terms of datasets, hardware, software, and cloud computing.
and academia to develop advanced data-driven frameworks to - We discuss the most significant challenges, such as imbalance
determine the patterns, classify faults and assess system degrada- classes, unlabeled data, insufficient data, and domain shift,
tion trends. Numerous machine learning methods have been used, and explain how various deep learning techniques can be uti-
including support vector machines (SVM) [1], random forest [2], lized to alleviate these problems.
principal component analysis [3], particle filtering [4], Hidden
Markov Model (HMM) [5] and so on. However, these techniques 2. A brief overview of deep neural network architectures
require the experience of the experts and prior knowledge of signal
processing to manually select and extract meaningful features for Deep neural networks are inspired by the hierarchical struc-
real fault diagnosis and prognostics issues. tures of human brains, as they first learn simpler features, and then
With the evolution of smart sensing, communication technolo- process them to represent more abstract features. The general
gies, and complex engineered systems, the massive amount of data structure of a deep neural network (DNN) known as feedforward
from various resources is rapidly generated and collected in real- mainly consists of an input layer, multiple hidden layers, and an
time, which contains useful information about the degradation output layer. In the multi-layer perceptron (MLP) network shown
and health condition of the system. The performance of traditional in Fig. 1, as the simplest form of deep architecture, the output is
algorithms is greatly impeded by the proliferation of multi- computed straight forward along with the sequent layers of the
dimensional and heterogeneous data streams. Thus, more model as long as input data is fed. In each neuron of the middle
advanced analytic tools are necessary to adaptively and automati- hidden layers, the biased weighted sum of the previous layer out-
cally mine the characteristics hidden in the real-time measured puts is put into a nonlinear function aka activation function to pro-
streams. duce the output of that neuron. The hierarchical nature of
Deep learning, as a breakthrough in artificial intelligence, has representation learning in DL lets it find out desired but abstract
been embraced by various areas such as medical image analysis, underlying correlations and patterns among a large amount of
visual understanding, health care, computer vision, recommender data. In this section, we briefly discuss the fundamental concepts
systems, natural language processing, and automatic speech recog- of deep learning and typical deep structures commonly applied
nition. It can automatically process highly non-linear and complex to PHM in literature. Some mostly used terminologies have been
feature abstraction from raw data via deep neural networks and defined in Table 1.
eliminates the reliance on domain knowledge and manual feature
engineering. Deep learning can automatically learn hierarchical
representations of large-scale data, which makes it an effective tool 2.1. Restricted Boltzmann Machine
for PHM applications, especially in the presence of high volume
and multi-dimensional industrial data. Traditional data-driven Restricted Boltzmann Machines (RBMs) are undirected bipartite
frameworks require hand-crafted feature extraction and appropri- graphical models consisting of nx visible (input) units and nh hid-
ate feature selection processes, which is highly dependent on the den units, and no intralayer connections are allowed, see Fig. 1.
expertise of professionals and signal processing knowledge. Con- RBMs often perform as generative models trying to estimate the
ventional frameworks cannot be updated in real-time and require probability distribution of the input data. In other words, it learns
a great deal of work dealing with large-scale data sets. In compar- a reconstructed version of the input data through stochastic pro-
ison, deep learning algorithm makes it possible to integrate the cessing units. From the supervised perspective, RBM often acts as
PHM tasks such as feature extraction, feature selection, and classi- a pre-processor for other models to carry out the classification task,
fication/regression into an end-to-end architecture and jointly and can also be a self-contained classifier [12]. A practical guide to
optimize all the tasks in a hierarchical fashion. training RBMs is provided by Hinton [13]. In the following two sec-
To date, a few review papers on deep learning and PHM have tions, we briefly introduce two generative DNNs based on RBM,
been published [6-9]. However, they are either component (or sys- known as deep belief networks and deep Boltzmann machines.
tem) specific or not updated with more recent deep learning tech-
nologies. This is a fast-growing area, and refined solutions and 2.1.1. Deep belief network
advanced models are being developed every few months. There Deep belief networks (DBNs) are the first successful deep net-
is a need to present more current reviews to cover recent advances works. They are formed by stacking multiple RBMs and model
and suggested solutions in the PHM paradigm. In this paper, we the joint distribution of observed. As shown in Fig. 1, the top layer
review the variety of deep neural networks that have been devel- of a DBN is non-directional and connections in the other layers are
oped and explicitly deployed for fault diagnosis and RUL prediction top-down directed. DBNs are trained through a two-steps process:
of engineered systems. An overview of common deep learning the pre-training step and the fine-tuning step. A greedy layer-wise
architectures is presented in Section 2, followed by a revisit of tra- unsupervised algorithm in a down-top manner carries out the pre-
ditional data-driven PHM basics in Section 3. An overview of deep training via Contrastive Divergence [14]. Once the network has
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 3

Fig. 1. Typical deep architectures, rectangular hidden units represent recurrent cells and can be vanilla RNN, GRU or LSTM cells.

Table 1
Glossary of the terminologies.

Term Description
Cost function/Loss Loss function and cost function have been interchangeably used in the machine learning community. In this paper, loss function refers to
function the error term for a single training/validation/test set, whereas the cost function is the average of loss function over the entire or a batch of
training/validation/test set and may contain penalty terms.
Discriminative neural Modeling the conditional probability of the output, given the observed data
networks
Graphical models Probabilistic models in which the probabilistic distributions are expressed by graphs. Each node in the graphs represents a random variable
(or group of random variables), and the edges express probabilistic relationships such as conditional probability between these variables
[10].
Generative neural Modeling the joint probability distribution of the input variables and the output variables. The term ‘‘generative” comes from the network’s
networks ability to generate random instances.
Machine learning Programming computers to optimize a performance criterion using example data or experience. A machine learning model is defined based
on some parameters, and learning refers to optimize the parameters of the model. The model may be predictive to make predictions in the
future or descriptive to gain knowledge from data [11].
Representation learning Automatically extracting meaningful representations or features required for machine learning tasks, as opposed to manual feature
engineering techniques
Supervised learning Machine learning algorithms that use labeled data to infer a mapping function from the input to the output
Unsupervised learning Machine learning without labels or specific guidance
4 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

been initialized by pre-training, parameters can be fine-tuned with than regularizing the reconstruction that offers better performance
labeled data via a supervised up-down process [15]. compared to other regularized models.

2.1.2. Deep Boltzmann Machine 2.2.4. Variational autoencoder


Deep Boltzmann Machine (DBM) is another RMB-based deep Variational autoencoders (VAEs) proposed by Kingma et al. [23]
generative model where layers are again arranged in a hierarchical are directed generative models that use variational inference
manner [16]. Unlike DBN, in DBM all the connections are undi- framework to approximate the input data distributionpðxÞ and
rected, see Fig. 1. DBM can be regarded as a deep RBM with multi- can be trained with gradient-based methods [24]. VAEs are attrac-
ple hidden layers, where units in odd-numbered layers are tive deep models as they bridge the gap between neural networks
conditionally independent of even-numbered layers, and vice versa and probability models, and make it possible to design generative
[17]. During the training process, a stochastic maximum likelihood models of large complex datasets. VAEs have an encoder/decoder
(SML) based algorithm is used to jointly train all the layers by max- architecture, although the math behind the structure has little to
imizing the lower bound on the likelihood [18]. Salakhutdinov and do with other well-known autoencoders. In [25], the authors pro-
Hinton proposed a greedy layer-wise pre-training strategy, much posed a reparametrization trick for training the VAEs.
similar to DBN, i.e., by treating the network as a stack of RBMs
and pre-training them independently [16]. A final SML-based joint 2.3. Convolutional neural network
fine-tuning updates the parameter space.
Convolutional neural networks (CNNs) are deep discriminative
2.2. Autoencoder networks and have shown good results in processing data with
grid-like topology. The key difference between CNNs and standard
Autoencoders are unsupervised networks that are trained to neural networks is that CNNs benefit from parameter sharing,
reconstruct the input x on the output layer x ^ in a two-phase pro- which allows the network to look for specific features at different
cess: encoding learns a hidden representation of data h via a positions [24]. Fig. 1 shows the schematic of a typical 2-D CNN
feature-extracting function, and decoding maps h back into the characterized by three layers, i.e., convolutional, pooling and
input space to obtain a reconstruction of data (Fig. 1). Similar to fully-connected layers. The convolutional layer carries out the con-
RBMs, autoencoders can stack in a deep configuration called volution operation on the input data by sliding a filter (kernel) over
stacked autoencoder (SAE), which forwards the latent representa- the input to produce a feature map. The pooling layer aims to
tion of the layer below as the input to the next layer, and training reduce the dimension of the feature map, which reduces the num-
is done in a greedy layer-wise manner. A significant drawback of ber of the parameters and increases the shift-invariance property,
the standard autoencoder is the tendency to learn identity func- which leads to better robustness against noise [26]. The final fully
tions without extracting meaningful information about the data, connected layers map the data to a 1D feature vector, which can be
especially for the overcomplete case in which hidden layer has a either used by a classifier [27] or as a feature vector for further pro-
dimension equal or greater than the input, i.e. nh P nx . Alternative cessing [27].
variants are introduced to provide the solution by regularization or
training autoencoders via generative modeling approaches. The 2.4. Recurrent neural network
resulting variants are discussed in Sections 2.2.1–2.2.4.
Recurrent Neural Networks (RNNs) contain feedback loops to
remember the information of former units and are the most suit-
2.2.1. Sparse autoencoder
able for sequential data such as natural language and time-series
Sparse autoencoder exploits the inner structure of the data via
including a sparsity constraint on the activation of the hidden units data. During the training process, the hidden unit ht is sequentially
through the addition of Kullback-Leibler (KL) divergence term to updated based on the activation of the current input xt at time t
the cost function [19]. Sparse representation improves the classifi- and previous hidden state ht1 . RNNs are capable of capturing
cation task performance by increasing the likelihood that different long-term temporal dependencies from time series and sequential
categories will be easily separable [20]. data, but they suffer vanishing or exploding gradient problem, in
the sense that during the propagation of the gradients back to
the initial layers, small gradients shrink and eventually vanish.
2.2.2. Denoising autoencoder On the other hand, if gradients are larger than one, they accumu-
The denoising autoencoder (DAE) is another regularized net- late through numerous matrix multiplications and result in the
work to prevent the model from learning a trivial identity solution. model collapse [28]. Gated recurrent unit (GRU) and long short-
Instead of adding the penalty to the cost function, DAE takes a term memory (LSTM) cells are popular variants of RNN that try

noise-corrupted version of data x to reconstruct the input x and to alleviate the aforementioned problem [24], see Fig. 2. Bi-
learn meaningful information by changing the reconstruction error directional recurrent networks (BRNN), as shown in Fig. 1, can
term in the cost function. The input is first corrupted by employing increase the model capacity by sequencing the data in both for-
Binary or Gaussian noise and then fed to the hidden layer. There- ward and backward directions.
fore, DAEs must undo the corruption process by capturing the
input data distribution, rather than simply learning the identity. 2.5. Generative adversarial network
The learned representation is robust toward slight perturbations
[21]. Generative adversarial networks (GANs) proposed by Goodfel-
low et al. [30], are powerful generative models consisting of two
2.2.3. Contractive autoencoder neural networks: discriminator and generator. The generator
Contractive autoencoder (CAE) proposed by Rifai et al. [22] adds Gh ðzÞ as the generative part of the model learns the distribution
the Frobenius norm of the Jacobian matrix of the latent space rep- of the inputs and creates the fake data, while discriminator Dw ðxÞ
resentation of the input to the standard reconstruction loss. Con- as the adversarial part, takes in both fake and real data and evalu-
tractive autoencoders encourage the robustness of the ates them for authenticity, as shown in Fig. 1. The training process
representation by penalizing the sensitivity of the features rather is similar to a min–max two-player game between discriminator
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 5

Fig. 2. Two well-known recurrent cells: the sigmoid function is denoted as r, and  is an element-wise operator. ht and C t indicate the hidden state and memory state at time
t, respectively [29].

and generator in game theory that tries to reach a Nash-


equilibrium of the players. GANs produce appealing results, but
they are commonly challenging to train and suffer from diverging
behavior, mode collapse, and vanishing gradients issues [31,32].
The Original GAN model uses fully-connected networks for gener-
ator and discriminator. However, many recent studies developed
variants using AEs, CNNs and RNNs architectures. For more infor-
mation about various GAN architectures, the reader may refer to
[33].

2.6. Optimization in deep neural networks

Artificial neural networks, as universal function approximators,


are designed to learn any function. The multi-layered architecture
of deep networks makes them able to handle complex non-linearly
Fig. 3. Schematic of a single neuron showing the inputs xn , corresponding weight separable. However, the performance of deep learning is highly
wn , bias b, and activation function u.
reliant on model and training factors such as activation function
selection, weight initialization, hyperparameters (learning rate,

Table 2
Well-known activation functions in deep learning.

Function Equation Characteristics


Sigmoid uðzÞ ¼ 1=ð1 þ ez Þ Pros: normalized output between 0 and 1, smooth gradient.
Cons: Computationally expensive, vanishing gradient, not zero-centered outputs, usually not used in deep
models except in the output layer for binary classification.
Hyperbolic uðzÞ ¼ tanhðzÞ Pros: zero-centered output, smooth gradient, noise-robust representation.
Tangent (tanh) Cons: Computationally expensive, suffers from vanishing gradient but still better than sigmoid for hidden
layers.
Rectified Linear uðzÞ ¼ maxð0; zÞ Pros: computationally efficient, no vanishing gradient, the most common function for hidden layers.
Unit (ReLu) Cons: overfitting, bias shift because of non-zero mean activations, dying ReLu problem in which the inputs
approach zero or negative values with zero gradients.
Leaky ReLu uðzÞ ¼ maxðaz; zÞ, ais a Pros: solves dying ReLu issue.
constant, e.g. 0.01. Cons: inconsistent results for negative values, similar to ReLu.
Parametric ReLu uðzÞ ¼ maxðbz; zÞ, b is a Pros: the results are better than Leaky ReLu by finding proper b values as model parameter.
learnable parameter Cons: as the number of learnable variables increases, it may increase the computational budget of the
optimization.

Exponential Linear z if z > 0 Pros: Solves bias shift problem, faster learning for deeper models
uðzÞ ¼ a is a
Unit (ELU) aðez  1Þ if z 6 0 Cons: saturates for large negative values and results in inactive neurons.
hyperparameter
Softmax P
k¼K Ranges the output between 0 and 1, different from other activation functions gives multiple outputs for the
uðzÞi ¼ ezi = ezk for i ¼ 1; :::; K: vector of inputs and usually used in the last layer for multi-class models.
k¼1
6 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

Table 3
Optimization algorithms in deep architectures*.

Parameter update About


Vanilla SGD:htþ1 ¼ ht  gg t Reduces the variance of the parameters and leads to more stable convergence
comparing to batch gradient descent, but the learning is slow.
Momentum: htþ1 ¼ ht  v tþ1 ,v tþ1 ¼ cv t þ gg t Accelerates SGD learning process by accumulating the past gradients and moving in
their directions at speed v .

Nesterov momentum: htþ1 ¼ ht  v tþ1 ,v tþ1 ¼ cv t þ gg t Corrects standard momentum by evaluating the current gradient after applying the
velocity at that time step.
 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
AdaGrad: htþ1 ¼ ht  g= Gt þ d ʘ g t ,Gt ¼ Gt1 þ g t ʘg t AdaGrad is based on adapting the learning rates of all the parameters by dividing the
rate by the square root of the sum of past and current squared gradients. The
accumulation of gradients makes the learning rate to shrink to infinitesimally small
values for deep networks.
 qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^ t þ d ʘg , G
^ t ¼ bG
^ t1 þ ð1  bÞg ʘg Solves diminishing learning rate issue of AdaGrad by dividing the rate by
RMSprop: htþ1 ¼ ht  g= G t t t
exponentially weighted average of squared gradients.
pffiffiffiffi
Adam: htþ1 ¼ ht  g^st = ^r t þ d , st ¼ b1 st1 þ ð1  b1 Þg t r t ¼ b1 r t1 þ ð1  b1 Þg t ʘg t ,^st ¼ st =1  bt1 ; ^r t ¼ r t =1  bt2
It can be seen as the combination of RMSprop and momentum to enhance the
falling learning rate problem in AdaGrad. Adam achieves huge
improvement in terms of speed of training. However, it’s shown to have
convergence issue for some datasets. Several strategies have been proposed
to benefit Adam optimizer while solving the convergence issue.
AdaMax: htþ1 ¼ ht  g^s=Rt ,Rt ¼ maxðb2 Rt1 ; jg t jÞ A variation of Adam which uses the exponential moving average of gradients and past
p-norm of gradients, and mostly used for the settings with sparse parameter updates.
 
m
*
h : parameters, g : learning rate, m: minibatch size, f xðiÞ ; h : the predicted output where x ¼ xðiÞ i¼1 is the training minibatch, yðiÞ is the ground truth target, L : loss
P      Pm   
m ðiÞ ðiÞ ðiÞ ðiÞ
function, g t ¼ m i¼1 rh L f x ; ht ; y : gradient estimate at time-step t, ʘ denotes element-wise products, g t ¼ m i¼1 rh L f x ; ht ; y
1 1
: gradient estimate at interim

point ht ¼ ht  cv t , c: momentum coefficient, d : a small constant for numerical stability, b; b1 ; b2 : exponential decay rates, s : biased first moment estimator for Adam and
AdaMax, r : biased second moment estimator for Adam, Rt : second moment estimator for AdaMax.

number of layers and neurons at each layer and, etc.), optimization is to perform the update based on one or a sub-set of training sam-
and regularization methods in training. ples [34], which is called stochastic gradient descent (SGD) or
Activation functions are non-linear functions incorporated into mini-batch gradient descent.
artificial neural units (namely neurons, Fig. 3), which receive the Despite the effective training process, SGD faces challenges such
bias term and the weighted sum of the inputs from the previous as the proper learning rate selection, dealing with the sparsity of
layer and let deep neural networks to learn highly powerful repre- data, and minimizing highly non-convex error functions while
sentations over forward propagation and backpropagation algo- avoiding sub-optimal local minima. Various algorithms are pro-
rithms. Table 2 summarizes popular activation functions in deep posed to tackle the challenges of SGD, see Table 3.
neural networks. Although choosing the proper function depends Another critical problem in the context of deep learning is to
upon the type of the problem and depth of the network, it is rec- train the models that perform well on both training and test data-
ommended to start with ReLu for hidden layers, and then move sets. In this context, regularization is defined as ‘‘any modification
to alternatives if ReLu does not perform well. we make to a learning algorithm that is intended to reduce its gen-
The optimization algorithm plays a vital role in training. Gradi- eralization error but not its training error [24]. A summary of the
ent descent method is the first-order optimization technique that most common regularization algorithm is shown in Table 4.
is widely used for training. It converges slower than second-
order alternatives such as Newton and Conjugate gradient meth- 3. Prognostics and health management revisit
ods. The traditional gradient descent technique runs through the
whole training dataset to perform a single update of model param- PHM offers a wide range of tools for system health assessment
eters or weights. Hence, it can be slow and time-consuming to and reliability improvement and involves many subareas in differ-
train a very large dataset. To remedy the issue, the usual practice ent aspects. This section presents a brief overview of the standard

Table 4
Regularization in training deep networks [24].

Technique About
L2-Regularization Aka weight decay, calculates the sum of the squared values of the weights, it shrinks the weight vector and is the most common norm penalty
term added to the objective function.
L1-Regularization The norm penalty includes the sum of the absolute values of the weights, encourages more sparse weights that L2 which is good for feature
selection, as opposed to L2 does not provide clear algebraic solutions for quadratic approximations of the objective function.
Early stopping The idea is to find model parameters for the best validation error by terminating the training as soon as the validation error starts to increase,
and returning the model settings to the previous parameters with the lowest validation error.
Data augmentation Make training set larger by generating fake data, if possible.
Bagging Also called bootstrap aggregating, an ensemble learning technique in which several models are separately trained on bootstrapped samples,
and then aggregated to make an averaged model. It reduces the variance and has a regularization effect by reducing the generalization error
[35].
Dropout A widely used regularization method in which the dependency among the units is reduced by randomly ignoring some units during the
training.
Parameter sharing and Constrain sets of parameters (in various models or components of one model) to be equal according to prior knowledge on models’
tying dependency. It is widely used for domain adaptation task, see section 4.11. CNN networks also benefit parameter sharing to reduce the
number of parameters.
Manifold Manifold regularization techniques, such as tangent prop, manifold tangent classifier, etc., are data-dependent methods and are based on the
regularization idea that data of the same classes mostly comes from the same low-dimensional manifolds
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 7

data driven PHM framework including constituent parts, perfor- extensively used for stationary signals Time-frequency methods
mance assessment metrics, and existing datasets so that the read- such as Short-time Fourier transform (STFT), Empirical mode
ers might use them for their model evaluation. decomposition (EMD), Wavelet packet transform (WPT), Hilbert-
Huang transform (HHT) and etc. on the other hand, achieve better
3.1. The modules of the traditional prognostics and health results for non-stationary signal analysis [36–38].
management cycle Feature selection algorithm removes irrelevant and redun-
dant features by selecting the optimal feature subset through fil-
As illustrated in Fig. 4, PHM is mainly considered as the combi- ters, wrappers, or embedded methods [39]. Furthermore,
nation of several tasks to reduce the total life-cycle cost of equip- Dimensionality reduction techniques such as principal compo-
ment. The following paragraphs define the terminologies and nent analysis (PCA), Linear Discriminant Analysis (LDA) and ker-
commonly used techniques: nel principal component analysis and (KPCA) have been widely
Data Acquisition module comprises condition monitoring sen- adopted to generate a new subset of lower-dimensional features
sors (e.g., accelerometers, acoustic emission sensors, thermome- while retaining intuitive information of the original features
ters, etc.), data storage and transmission devices, which provide [40,41].
initial monitoring information from machinery. In system health management discipline, anomaly refers to
Feature extraction in PHM mostly refers to signal processing time instances that system behaves differently from normal, and
algorithms in time, frequency and time–frequency domains to the reason may or may not lies in an incipient fault or failure. Clas-
transform raw measurement data into the informative signature sical methods of anomaly detection, such as density-based tech-
of the behavior of the system. Statistical time-series features such niques, support vector machines, Hidden Markov models,
as RMS, kurtosis, crest factor, skewness and frequency domain sig- Bayesian networks, ensemble techniques, etc. have been broadly
natures including spectral, envelope and cepstrum analysis are used in system health assessment domain [42–44].

Fig. 4. Modules of traditional data-driven PHM cycle vs. deep learning based PHM model.
8 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

Table 5
Performance metrics for PHM model evaluation.

Measure Notation Reference


Diagnostics
Confusion matrix criteria:
 Accuracy and Error rate ACC, ER Ali et al. [52]
 Precision PR Shao et al. [53]
 Sensitivity (Recall) SN Shao et al. [53]
 F1-score F1 Shao et al. [53]
 Correlation coefficient CC Lou et al. [54]
Receiver operating characteristic (ROC) curve:
 Area under the curve AUC Batista et al. [55]
Detection error trade-off curve DET Batista et al. [55]
Prognostics
Offline evaluation; ground truth RUL data:
 Mean absolute error MAE Zhu et al. [56]
 Root mean squared error RMSE Deutsch et al. [57]
 Mean absolute percentage error MAPE Deutsch et al. [57]
 Prediction horizon PH Saxena et al. [58]
 a  l accuracy a  l ACC Saxena et al. [58]
 Convergence Con Saxena et al. [58]
 Relative accuracy RA Saxena et al. [58]
 Confidence interval CI Chen et al. [59]
 Prognostic accuracy criterion PAC Nguyen et al. [60]
 Hybrid criterion HyC Nguyen et al. [60]
 Exponential Transformed accuracy ETA Nectoux et al. [61]
Offline evaluation; run-to-failure data:
 Mean prediction error and standard deviation E, sd Zemouri et al. [62]
 Overall average bias OAB Zemouri et al. [62]
 Overall average variability OAV Zemouri et al. [62]
 Reproducibility Rep Zemouri et al. [62]
 Predictability Pred Javed et al. [63]
Online evaluation:
 Online root mean squared error Online RMSE Hu et al. [64]
 Coverage rate CR Hu et al. [64]
 Average width AW Hu et al. [64]

Diagnostics is the critical step after anomaly detection to iden- approaches is still difficult and challenging. The long-term deterio-
tify the health status of the system by analyzing the severity levels ration process and machinery break-down in-service makes it
of degradation. Classical supervised machine learning algorithms time-consuming and impracticable to collect high-resolution run-
such as SVMs, random forest, k-nearest neighbors (KNN), artificial to-failure data. Moreover, the measurements collected during the
neural networks (ANN), etc. have been trained on labeled datasets out-of-service period of the machinery, usually do not reflect the
to accurately classify fault types [1,2,45]. real working situation behaviors. To facilitate PHM model develop-
Prognostics refers to detecting incipient failures and associated ment, Table 6 presents public datasets for diagnostics and
RUL of the equipment to assess the reliability and support timely prognostics.
decision-making for maintenance operation. Numerous data-
driven methods have been adopted to address prognosis in PHM 4. Deep learning and system health management
cycle including ANNs, HMMs, particle filtering, Kalman filter vari-
ants, and regression methods [5,46,47]. In this section, we mainly discuss the existing DL-based archi-
Decision support represents the ‘‘health management” part of tectures for PHM tasks, starting with a short description of the
PHM which uses the outputs of Diagnostics and Prognostics for study selection criteria to analyze the current status and the
taking timely, appropriate maintenance and logistics decisions research trends.
[48]. Mathematical programming, Markov decision process, and
Reinforcement Learning (RL) techniques have been widely used 4.1. Methodology of the survey
to find the optimal maintenance action and the optimal time of
applying it [49–51]. The authors have conducted a thorough search in electronic
databases Google Scholar, Scopus, Web of Science, IEEE explore,
3.2. Performance metrics and Science direct using the keywords ‘‘fault detection” OR ‘‘fault
diagnosis” OR ‘‘prognostics” OR ‘‘condition monitoring” OR ‘‘re-
A variety of indices are used to evaluate the prediction perfor- maining useful life” AND either ‘‘deep learning” OR name of the
mance of the PHM model. Besides, depending on the complexity particular deep network. The search retrieved 227 articles in the
of the model, many researchers have proposed new measures for period between 2013 and September 2019. We have screened
the evaluation of RUL prediction for prognostics. Table 5 summa- the articles carefully and eliminated the repeated studies. After
rizes the list of the most commonly used metrics. Readers may applying the following exclusion criteria, a total of 137 studies
refer to the references in the last column for more information. were retained and analyzed:

3.3. Public datasets  EC1. Books, graduate theses, letters, and patents are not consid-
ered for review.
Despite recent advances in data acquisition and sensor technol-  EC2. Conference entries and preprint papers are excluded unless
ogy, acquiring enough high-quality data for data-driven those are highly-cited and not published in any journal.
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 9

Table 6
Public datasets for system health management.

Dataset Task Comment Link*


Bearing
KAT [65] Diagnostics Motor currents and vibration signals over four different conditions. 1
Case Western Reserve Diagnostics Delivers health-related vibration measurements of bearings at different locations under four 2
University (CWRU) [66] various loading conditions of the motor.
Lu et al., Centrifugal pump Diagnostics Vibration data are collected under normal conditions and fault conditions, including bearing roller 3
bearing/impeller [67] wearing, inner race wearing, and outer race wearing fault conditions, as well as impeller wearing
fault.
PRONOSTIA (IEEE PHM’12) Prognostics: RUL Measures vibration and temperature and the rotating speed is stable. 4
[61] prediction
IMS [68] Prognostics: HI Compared with PRONOSTIA, longer degradation process makes the data closer to the real industrial 4
construction, RUL case. However, the RUL prediction gets more complicated.
prediction
Turbofan engine
CMAPSS (IEEE PHM’08) [69] Prognostics: RUL Provides temperature, speed, pressure, and bleed measurements under six different operating 4
prediction conditions. Therefore, is the right candidate for multi-sensor fusion algorithms.
Gearbox
PHM’09 Diagnostics Unsupervised fault detection 5
Li-ion battery
Idaho national lab [70] Prognostics: State of Battery aging experiment affords operational profiles data of four batteries at room temperature. 4
health estimation
HIRF (IEEE PHM’15) [71] Prognostics: state of Battery current, voltage, and state of charge (SOC) available. 4
health estimation
Randomized battery usage Prognostics: State of Aging voltage and current data are collected over randomized discharge profiles. 4
[72] health estimation
Tool wear prediction
PHM’10 Prognostics: Health Collected data include wear measurements, vibration, acoustic emission, and force readings. 6
assessment
Milling dataset Prognostics: wear Provides measurements from acoustic emission and vibration sensors. 4
prediction, RUL
prediction
Industrial plant 7
PHM’15 [73] Diagnostics Contains time-series measurements for thirty plants with various components.
Bogie Diagnostics: Fault Provides vibration data from different components of the vehicle. 8
PHM’17 detection and isolation
*
1: KAT-Data Center, 2: Case Western Reserve University data center, 3: https://journals.plos.org/plosone/article?id=https://doi.org//10.1371/journal.pone.0164111, 4:
NASA Prognostics Center data repository, 5: https://www.phmsociety.org/references/datasets, 6: https://www.phmsociety.org/competition/phm/10, 7: https://www.phm-
society.org/events/conference/phm/15/data-challenge, 8: https://www.phmsociety.org/events/conference/phm/17/data-challenge.

 EC3. Non-primary studies such as literature survey articles are learning). Every keyword is represented by a colored circle. The
not included. size of the circle indicated the weight of the term occurrence in lit-
 EC4. Only unique studies are analyzed. For repeated studies erature. Also, the weight of the links between the nodes represents
with minor changes, the other copies of the study are excluded. the degree of co-occurrence of connected keywords.
 EC5. Studies that do not report the performance metric results VOSviewr uses a modularity-based clustering technique to group
are excluded. the most cooccurred keywords in the same cluster. We have merged
 EC6. Studies that do not contain validation or experimental ver- the smaller clusters to eliminate unnecessary details. The final map
ification are not considered for review. is identified by three clusters (red, green, and blue), and all points
with the same color are members of the same cluster. We have
Fig. 5 shows the popularity of various deep learning architec- found the degree of centrality of each keyword to find the most rep-
tures among PHM researchers and the distribution of the publica- resentative keyword of each cluster: Fault detection (blue), Fault
tions per year considering the variety of categories. There is a diagnosis (red), and Prognostics (green). Looking at these keywords
significant growth of related published papers in recent years. and the terms within each cluster provide us a glimpse of the inter-
The figure excludes pre-trained modified CNN architectures which connection among various tasks and deep neural networks on the
are discussed in 4.11.1. basis of available public datasets for practice. For example, at first
glance, one can say fault diagnosis is much more studied than prog-
4.2. Bibliometric analysis nostics. Also, bearings seem to be the most studied components for
PHM, as they are critical components of the engineered systems. The
To get further insights into the structure of the paper, co-word other reason lies in the availability of public datasets for rolling
analysis was undertaken based on related keywords in selected bearings. Moreover, the green cluster nodes indicate the nearness
127 research papers. Fig. 6 is the projection of the co-occurrence of the terms ‘‘prognostics”, ‘‘RUL estimation”, ‘‘RNN”, and ‘‘Battery”,
relationship among the top 29 frequently occurring keywords, showing the significance of prognostics for batteries. Furthermore,
which is visualized in VOSviewer tool [74]. Keywords include as expected and will be discussed, RNNs are the most commonly
well-known system types (battery, bearing, gearbox, bogie, aircraft used networks for RUL prediction.
engine, etc.), PHM tasks (fault detection, fault diagnosis, prognos-
tics, RUL estimation, etc.), deep neural network categories (CNN, 4.3. Taxonomy of deep learning in PHM
RNN, GAN, etc.), data types (vibration, current signal, and acoustic
emission), and several learning problems in the realm of machine In recent years, research on the use of deep learning for repre-
learning (transfer learning, domain adaptation, and unsupervised sentation learning, time series classification and prediction in the
10 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

4.4. Deep belief networks

Deep belief networks are the first successfully trained deep net-
works and the first deep model applied in PHM domain. Tamilsel-
van and Wang [77] developed a DBN-based multi-sensory fault
diagnosis framework and leveraged the hierarchical architecture
of DBN to handle heterogeneous sensory signals. Similarly, Tran
et al. [78] used DBN classifier with Gaussian Bernoulli units for
fault diagnosis of reciprocating compressor valves. They extracted
time-domain and frequency-domain features of heterogeneous
signals and applied generalized discriminant analysis (GDA) to
reduce the dimensionality of the feature space.
Despite the substantial improvement of the mentioned studies
over conventional models, hyperparameters of the model such as
the number of layers, size of the layers and learning rate have been
randomly selected, which is shown to reduce the model efficiency
significantly. To remedy the issue, Shao et al. [79] adopted a Parti-
cle swarm optimization (PSO) algorithm to decide the optimal
hyperparameters for the fault diagnosis of the rolling element
bearings. In their recent study, Tang et al. [80] proposed an adap-
tive learning rate with Nesterov momentum to accelerate network
training and improve the performance.
Deep belief networks may act as intermediate feature extrac-
tors. Yuan et al. [81] trained two DBNs to learn intermediate repre-
sentations of vibration and acoustic emission signals separately.
They used the wavelet packet transform (WPT) features as the
inputs of DBNs. Furthermore, Liang et al. [82] proposed a novel
raw signal segmentation method named Grassmann manifold-
angular central Gaussian distribution to capture fault impulse
information. A DBN is adopted to reduce the feature space dimen-
sion and extract more discriminative features. In [83] a new vibra-
tion imaging method is used to capture fault information of the
rotor system in different directions. Pretraining of a deep belief
network with a vibration image is conducted in an unsupervised
manner for high-level and scalable feature extraction. Deutsch
Fig. 5. (a) The density of various deep learning architectures in PHM from 2013
and He [57] made use of DBN to predict the remaining useful life
until September 2019, (b) Breakdown of the papers in the year of publication.
of rolling element bearings for prognostics applications.
There are other studies that addressed the application of DBN in
system health assessment [84–89] While most of them still need
hand-crafted features and manual signal processing expertise,
field of PHM has gained growing attention. The DNN models can be few studies used DBN as an end-to-end solution from raw input
mainly divided into three categories: generative models, discrimi- data and achieved comparative performance [90–92].
native models, and hybrid models, as shown in Fig. 7.
Generative models define a joint probability distribution over 4.5. Deep Boltzmann machines
input and target variables and can be used to generate new
instances from the underlying distribution of data. Among the Although deep Boltzmann machines are powerful in capturing
models in this class are VAEs, DBMs, DBNs, and GANs. Discrimina- complex representations of data, particularly in cases of non-
tive models estimate the conditional probability distributionPðyjxÞ, stationary signals and multi-sensory data with different modali-
where y and x are the target variable (discrete class or scalar pre- ties, their inference process is slow and costly, and the authors
diction) and observation variable, respectively. Discriminative found limited studies that applied DBMs in PHM applications. Li
models do not attempt to model the underlying distribution of et al. [93] adopted separate Gaussian Bernoulli DBMs (GDBMs) to
the variables and only perform a mapping from the inputs to the extract high-level features of vibration signal in three modalities
desired targets [75]. CNNs, RNNs, and autoencoders (excluding for gearbox fault diagnosis. A support vector classifier is used to
VAEs) are common discriminative models in PHM. fuse the representations towards the effective fault classification,
In this paper, the hybrid models refer to the deep architectures and the model is verified with both spur and helical gearboxes.
that combine various DNNs (generative and/or discriminative). In In [94] the authors applied DBMs to learn representations of acous-
such models, the generative part aids the discrimination either in tic emission and vibrations signals for fault diagnosis of the gear-
optimization via providing a good initialization or by reducing the box. A collaborative approach is proposed by Hu et al. [95] to
overall complexity of the models [76]. These architectures can deal with industrial fault diagnosis. A DBM turns the raw inputs
take advantage of strengths of both discriminative and generative to binary feature vectors, and the forest ensemble is used to con-
models. The application of the models mentioned above in PHM catenate the features. They utilized sliding windows to truncate
is discussed in the following subsections. In addition to the net- the feature vectors, and a complete-random forest performs the
works shown in Fig. 7, several studies proposed new DNN archi- classification. In their recent work, Wang et al. [96] leveraged
tectures that have generative and/or discriminative components. DBM for the prognosis of the centrifugal compressor in smart man-
These models are discussed as emergent models under section ufacturing. Raw vibration signals are normalized through Gaussian
4.10. neurons of DBM, and the model can learn the complex representa-
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 11

Fig. 6. Network visualization for related keywords in the review content according to co-occurrence terms.

Fig. 7. Taxonomy of deep learning architectures in PHM.

tion of the input sequence. The Particle Swarm Optimization algo- and electrical locomotive roller bearings. The modified loss func-
rithm searches the optimal hyperparameters, and a hybrid modi- tion is more robust to non-stationary noises and enhances the
fied Liu-Storey conjugate gradient accelerated the pre-training feature learning task. An artificial fish swarm algorithm (AFSA)
step of the model. is used to optimize the hyperparameters. In their other study
[53], they proposed an ensemble deep AE model for intelligent
4.6. Deep autoencoders fault diagnosis of rolling bearings. Firstly, raw vibration signals
are fed to various SSAEs with different activation functions, and
After convolutional neural networks, deep Autoencoders are the Softmax classifier performs the fault diagnosis. A new combi-
the most studied deep models in PHM applications. The earliest nation strategy based on majority voting determines the thresh-
deep AE models, specifically stack multiple autoencoders to learn old value for diagnostic accuracy of individual SSAEs, and the
more complex representation of the data. For instance, Zhou et al. ensemble model is used for feature learning with the training
[97] utilized SAE for bearing fault classification problem. Shao samples. Furthermore, in [53] they proposed a novel deep autoen-
et al. [98] adopted a deep AE with a modified maximum coder model with Gaussian wavelet activation functions and raw
correntropy-based loss function for fault diagnosis of the gearbox vibratory signals.
12 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

Regularized autoencoders enhance the generalization of the monitoring tasks. In [102], the authors utilized SSAE for fault diag-
model and provide more robust representations. There have been nosis of the gearbox with emerging new fault conditions. The pro-
numerous studies that leveraged regularized variants of autoen- posed SSAE framework assigns new labels to the samples that
coders in PHM applications. For example, the rolling bearing fault deviate from Gaussian distribution and achieved higher accuracies
diagnosis framework described in [99] benefits modified DAE with compared to the standard SSAE model. Liu et al. [103] utilized SAE
an improved norm penalty and new preprocessing method. Cap- for multi-sensor fusion-based fault diagnosis of rotating
turing the temporal dependency of the measurement data is a machinery.
challenging task in vibration-based fault diagnosis. To address Wang et al. [104] adopted a batch normalization optimization
the issue, Jiang et al. [100] proposed a deep DAE-based model for method to reduce the internal covariate shift problem between
wind turbine fault detection. Firstly, the sliding window is applied hidden layers of SSAE for gearbox fault diagnosis and achieved
to multi-sensory time-series data to capture the current and the the results superior to raw SSAE model. Sun et al. [105] used the
past temporal information in a small time frame. Then, the robust compressed sensing idea with less measured data for SSAE-based
multivariate reconstruction of the processed data is built via DAE. fault diagnosis of rolling bearings. The selective stacked denoising
Most of the studies above are based on the assumption of sta- autoencoder network with negative correlation learning
tionary operating conditions; however, real-case machinery work (Selective-SDAE-NCL) is proposed by Yu [106] for gearbox fault
under varying conditions and the signals are non-stationary, which diagnosis. In their model, the ensemble supervised fine-tuning of
makes it challenging to extract fault features. Luo et al. [101] built SDAE components via NCL is used to account for different aspects
SSAE for the early fault detection of CNC machine. The vibration of the data. The PSO algorithm produces the optimal subset of
signals are divided into fixed-length smaller samples using a slid- SDAE components, Fig. 8. Other PHM studies with stacked sparse
ing frame and labeled into impulse vs. non-impulse classes. The denoising autoencoders (SSDAE) have been carried out by Jian
SSAE model is trained to determine the impulse responses of the et al. [107] for wind turbine fault diagnosis, Lu et al. [108] and
data. The state-space model is adopted to estimate the dynamics Guo et al. [109] for fault diagnosis of rolling bearings, Shi et al.
of the machinery using impulse response data, and a dynamic [110] for tool condition monitoring, Zhang et al. [111] for fault
property similarity-based health indicator is constructed for health diagnosis of solid oxide fuel cell system.

Fig. 8. Selective-SDAE-NCL for gearbox fault diagnosis, [106]. First, the hierarchies of SDAEs are pre-trained with bootstrapped samples in an unsupervised fashion. Then, the
ensemble supervised fine-tuning of SDAE components via NCL is carried out to account for different aspects of the data. Finally, The PSO algorithm produces the optimal
subset of SDAE components.
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 13

Despite the achievements of DAE for automatic feature extrac- feature-level, and decision level fusions into an optimized deep
tion in fault diagnosis applications, it is challenging to select the CNN. In [130], the authors used raw acoustic signals in time and
best corruption level. Some authors used contractive autoencoder frequency domains for gear fault diagnosis, and leveraged multi-
(CAE) for more convenient and robust representation learning. channel CNN to fuse information from different microphones. Liu
For example, Shen et al. [112] proposed a CAE-based model for et al. [131] carried out the simultaneous diagnosis and prognosis
automatic feature learning in the gearbox and rolling bearing fault of rolling bearing using a joint-loss CNN model. Zhang et al.
diagnosis problems. The input was frequency domain data and [132] proposed a CNN-based fault diagnosis framework with resid-
compared to other regularized autoencoders, they could obtain ual blocks. The identity skip connections in the model allow direct
higher correlation coefficients under different signal to noise ratios propagation of information throughout the network and enhance
(SNRs). high-level feature extraction.
Contractive autoencoders penalize the sensitivity of the fea- Convolutional neural networks were originally designed for
tures and encourage the robustness of the representation rather image analysis tasks. Hence, different researchers investigated
than of the reconstruction, as in denoising autoencoders. Hence, approaches to preprocess and convert time-series data into 2-D
they may provide better performance and generalization. How- inputs for the system health assessment, see Table 7. Several stud-
ever, CAE cannot probe large perturbations of the input. Shao ies have used time–frequency analysis methods to transform
et al. [113] suggested an enhanced feature learning method by vibration signals into image inputs. Han et al. [133] adopted
combining the characteristics of DAE and CAE for fault diagnosis multi-level wavelet packet matrices as the inputs to several paral-
of electrical locomotive bearings. The raw vibration signals are lel CNNs with shared parameters for gearbox fault diagnosis.
fed into DAE to extract low-level fault features. Stack of multiple Multi-level wavelet packet matrices incorporate non-stationary
CAEs is then used for higher-level and robust feature extraction. vibration information from multiple resolutions and cancel the
Similarly, in [114], the authors leveraged a hybrid autoencoder need for level selection in WPT. Verstraete et al. [134] proposed a
representation learning based on DAE and CAE for fault diagnosis novel CNN architecture for rolling bearing fault diagnosis and com-
of rolling bearings from raw vibration signals. pared the effectiveness of the model with three different time–fre-
Although the generative variant of autoencoders, known as quency representations of raw signals as input images:
variational autoencoder (VAE) has shown outstanding results in spectrograms of short-time Fourier transform (STFT), scalograms
complex latent representation learning in varied applications, of the continuous wavelet transform (CWT), and Hilbert-Huang
few studies used VAE in the field of PHM. Ping et al. [115] utilized transform (HHT) plots. The suggested network consists of two con-
VAE to extract deterioration features of complex rotary machinery. secutive convolutional layers without any pooling layer between
They proposed log-normally distributed latent variables instead of them, but there are pooling layers between stacks of two-layered
standard normal units to address heteroscedasticity issue of degra- convolutional blocks. The model achieves the same accuracies as
dation data. In another study [116], the authors leveraged deep standard CNNs for scalogram images with significantly less learn-
VAE for fault diagnosis of rolling bearings using raw vibration mea- able parameters and computational cost but outperforms alterna-
surements. Zhan et al. [117] integrated VAE in a semi-supervised tive CNN models for HHT images and spectrograms. In [135], Yoo
learning-based network with multiple association layers for fault and Baek demonstrated the Morlet-based CWT representation of
diagnosis of the planetary gearbox. They applied wavelet packet vibration signals fed into CNN network to construct HI for remain-
transform to capture impulse components of the vibration signals, ing useful life estimation of rolling bearings. Although there is not a
and trained the model with a combination of labeled and unlabeled defined method to decide the best wavelet for various PHM scenar-
samples. A conditional VAE (CVAE) network [118] is adopted for ios, Morlet wavelets have shown effective results and high similar-
planetary gearbox fault diagnosis under noisy conditions. As ity to the impulse component of non-stationary signals of the
opposed to standard VAE, CVAE models the features conditioned faults in mechanical equipment. Zhu et al. [56] made use of binary
on some random variables and achieves a better reconstruction. interpolation to reduce the dimensionality of CWT image for bear-
Despite the successful examples above, there is still room for lever- ing RUL estimation problem. Besides, they utilized a multi-scale
aging VAE in health monitoring applications, especially for dealing convolutional neural network (MSCNN) that keeps global and local
with heterogeneous and incomplete data in real industrial machin- features synchronously by using the features of the last convolu-
ery [119]. tional layer and the pooling layer before for prediction. A compar-
ative study of the model with other CNN-based models verified the
4.7. Convolutional neural networks effectiveness of the proposed method.
There are other studies that leveraged multi-scale layers in
As shown in Fig. 5, convolutional neural network (CNN) is the CNNs to capture more levels of abstraction in the data. In [139],
mostly applied deep model in PHM field. Chen et al. [120] adopted Ding and He adopted the phase space reconstruction (PSR) tech-
1-D CNN for gearbox fault identification. They fed time and fre- nique to make the phase space image of the wavelet packets
quency domain features of the vibration signal to the model and (WP) referred to as wavelet packet image (WPI). Combined with
performed some parameter tunings to find the optimal architec- MSCNN, the proposed multi-scale feature learning method retains
ture of CNN. Guo et al. [121] demonstrated a CNN-based health the energy fluctuations of the WP nodes, and hence, provides a
indicator (HI) construction method for rolling bearing prognosis. robust fault diagnosis framework under fluctuating load condi-
A novel outlier region removal technique is applied to reduce trend tions, Fig. 9. Inspired by the Inception architecture of CNNs [145]
burr effect and enhance the prognostics performance. The new HI and dynamic routing capsule net [146], a novel deep net with
assessment metric named scale similarity comforts picking the inception blocks is used by Zhu et al. [136] to address poor gener-
proper failure threshold when HIs in the training set have different alization of standard CNNs under varying working conditions.
range scales. Similarly, the deep convolutional neural network was Alternatively, a few studies adopted innovative methods to
applied by Belmiloud et al. [122] for the RUL estimation of rolling incorporate time and frequency information into the inputs. For
bearings. Many studies have used 1-D CNN for fault classifications example, Ren et al. [140] presented a new feature extraction
of rolling bearings [123–128]. Jing et al. [129] made use of 1-D CNN approach named the spectrum-principal-energy vector (SPEV) for
for multi-sensory fault diagnosis of the planetary gearbox. They RUL estimation of rolling bearings. The vibration signal was sub-
utilized four types of signals including acoustic, vibration, current jected to FFT and then divided into 64 blocks. The maximum
and instantaneous angular speed signals to integrate data-level, amplitude of each block was obtained to build spectrum-princi
14 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

Table 7
Summary of 2-D CNN based models for PHM applications. P: Prognostics, D: Diagnostics. NRMSE: Normalized RMSE.

Study Task Input Architecture Performance


Yoo et al. [135] Bearing/P Morlet-based CWT CNN ETA: 0.57
Zhu et al. [56] Bearing/P Morlet-based CWT MSCNN ETA: 0.3624, MAE: 1091.8, NRMSE: 0.3514
Zhu et al. [136] Bearing/D STFT Inception ACC: 97.15
Verstraete et al. Bearing/D Morlet-based CWT Doubled convolutional layers ACC: 99.4
[134] HHT ACC: 97
STFT ACC: 99.5
Wang et al. [137] Gearbox/ Morlet-based CWT CNN ACC: 99.58
D
Han et al. [133] Gearbox/ Multi-level WPT Ensemble CNN ACC: 96.48
D
Li et al. [138] Bearing/P STFT Concatenated convolutional –
layers
Ding et al. [139] Bearing/D Energy-fluctuated image in WP phase space MSCNN ACC: 98.8
Ren et al. [140] Bearing/P Spectrum-Principal_Energy- Vector feature CNN RMSE: 0.119
map
Hoang et al. [141] Bearing/D Gray-scale vibration image CNN ACC: 100
ACC: 97.74 under noisy and varying working
conditions
Wen et al. [142] Bearing/D Gray-scale vibration image LeNet-5 ACC: 99.79
Lu et al. [143] Bearing/D Matrix reconstruction-based feature map CNN ACC: 96.48
Hu et al. [144] Bearing/D Compressed sensing-based constructed Improved MSCNN ACC: 99.4 with Gaussian measurement matrix
image

Fig. 9. Deep CNN architecture with three convolutional layers, two max-pooling layers, and a multiscale layer. Multiscale layer combines the output of the last convolutional
layer with the previous max-pooling layer [139].

pal-energy-vector. The 64-dimensional vector at 64 time-steps was nosis of complex systems such as wind turbine with unseen fault
combined into a 64*64-dimensional feature map and fed into CNN conditions. The model. In the last step of preparing the inputs, Mar-
followed by a deep feedforward network and final smoothing step kov Machines are utilized to generate self-state and cross-state
to perform regression task. Hoang and Kang [141] transformed raw transition matrices that build the 2-D image of spatiotemporal fea-
vibration signals into gray-scale images. The normalized amplitude tures. Despite the achievements of the studies above, most of them
of each sample presents the intensity of the corresponding pixel in lack enough information about the reasons for selecting certain
the vibration image. In [142], the authors utilized a closely similar architecture or pre-processing methods.
method to convert the time-series measurements into a gray-scale
vibration image. They used a CNN model based on LeNet-5 archi- 4.8. Recurrent neural networks
tecture, which is the earlier release of CNN for handwritten and
machine-printed character recognition [147]. In [132], Zhang Most of the system health management tasks deal with time-
et al. proposed a residual learning-based CNN for bearing fault series measurements, and for a reliable diagnosis and prognosis
diagnosis task. framework, it’s essential to capture the temporal information of
Larger-scale fault diagnosis in complex systems involves the data. Owing to their internal memory and feedback loops,
numerous disparate measurements from diverse sub-systems that recurrent neural networks (RNN) can remember temporal depen-
makes it challenging to capture the spatial dependency informa- dencies and learn the dynamic behavior of the failure. However,
tion between the components, especially under different operating vanilla RNNs (basic RNNs) suffer greatly from vanishing/exploding
conditions. Inspired by spatiotemporal pattern network (STPN), gradient issue and fail in learning long-term temporal dependen-
Han et al. [148] presented a spatiotemporal representation learn- cies. Gradient clipping technique is usually applied to limit the
ing method to handle multivariate time series data for fault diag- magnitude of the gradient by determining the threshold value.
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 15

Also, different gating mechanisms are proposed to address the van- [157] proposed a BLSTM-based framework for RUL prediction of
ishing gradient. Long-Short-Term-Memory (LSTM) and Gated the engine under multiple operating conditions. Their model con-
Recurrent Unit (GRU) are the two most well-known variants of sists of two bi-directional LSTM networks, see Fig. 10. The training
RNN to remedy the issue above. Guo et al. [149] Made use of LSTM set contains multi-sensory data, multi-operational data, and actual
to build a health indicator for RUL prediction of rolling bearings. RUL values at N consecutive observation cycles. p and q denote the
Firstly, they proposed a feature extraction method named number of sensors and operating conditions (control settings,
related-similarity (RS) to range both frequency and time domain input settings, and etc.), respectively. Initially, multi-variate
features from 0 to 1. After combining the RS features with time– time-series are normalized and converted into desired sequenced
frequency information, a linear combination of Correlation and data through a time-window processing method. Then, the nor-
Monotonicity metrics is adopted to select the sensitive features. malized sensory sequences as main inputs of the model are fed into
Lastly, the sequence of the features is fed into RNN to construct a deep BLSTM to extract degradation information with long-term
the health indicator. dependencies. The working condition sequences are also normal-
Adding more hidden layers to the RNN architecture results in ized (named auxiliary inputs) and merged into the output features
deep RNN, which is much powerful in learning complex temporal vector of the first BLSTM to arrange a new concatenated feature
dependencies of the sequential data, but introduces computational vector. The second BLSTM captures higher-level temporal informa-
complexity to the model. Considering prognostics as a regression tion of the machinery deterioration, and multiple fully-connected
problem with sequence degradation index output, and the RNN’s layers, followed by the final regression layer complete the
ability in handling complex sequence data, deep RNN-based health remaining-useful-life prediction task. The extensive comparative
monitoring frameworks are proposed by the researchers and have study with state-of-the-art deep models demonstrates the effec-
shown effective results [150–155] Zhang et al. [156] adopted bi- tiveness of the proposed method in terms of machinery prognosis
directional LSTM (BLSTM) with two hidden layers to track health under complex operating variables. Standard Bi-directional LSTMs
index variations of the turbofan engine. Similarly, Huang et al. process the data sequence in both forward and backward direc-

Fig. 10. The BLSTM-based prognostics framework, adapted from [157]. Initially, normalized multi-variate time-series acquired from p sensors are fed into the first deep
BLSTM to extract degradation information with long-term dependencies. Then, qoperating conditions are sequenced and normalized within inspection time framet, through
time-window processing method. Finally, the second BLSTM is fed with the concatenated vector of features from previous steps, accompanying actual RUL values, to capture
more discriminative features.
16 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

tions, and at any point in time, the network utilizes the earlier pro- volution (aka fractionally-strided convolution) and conducts up-
cessed observation and upcoming processed observation (by back- sampling by padding the feature maps with the zeros. The gener-
ward cells) simultaneously to perform intermediate prediction. ated data and the real data are fed into the discriminator that con-
However, the RUL estimation task requires a single prediction tains three convolutional layers without any pooling. The
ahead of the whole given sequence. In [158], the authors targeted discriminator has the mission to find the true data and to identify
the mentioned requirement and presented a modified LSTM archi- the fault classes. The model was able to enrich the faulty classes
tecture named bidirectional handshaking LSTM, for RUL estimation data and handle the imbalanced data issue.
from short sequences of the measurement data. In [162], the authors established a semi-supervised anomaly
There is a limited number of studies in the literature that used detection algorithm for imbalanced industrial time-series based
GRU for PHM tasks. For example, Zhao et al. [159] adopted an on an encoder-decoder-encoder structured generator with convo-
enhanced bi-directional GRU model for three health monitoring lutional layers. The model is trained just with normal samples,
case studies: tool wear prediction, gearbox fault diagnosis, and and the test phase considers both normal and faulty conditions.
incipient fault diagnosis of rolling bearings. Firstly, the multi- The semi-supervised convolutional GAN combined with switch-
sensory time-series are segmented into fixed-size windows fol- able normalization was used in [163] for vibration-based fault
lowed by extracting local features in time, frequency, and time– diagnosis of rolling bearings. Canceling the pooling layers and
frequency domains. Then, the local feature sequences are input replacing the batch normalization with switchable normalization
to a bi-directional GRU to capture higher level and more discrimi- increased the training stability and introduced a high accuracy rate
native information of the data. The authors concatenated the out- of 99.93% into the model for the bearing benchmark dataset. [164]
puts of the GRU with the weighted average of the local feature utilized GANs for modeling the trend in a bearing’s health indicator
sequence to avoid losing the mid-level information in the model. (RMS of vibration signals) and used the model to generate future
Kernel principal component analysis was used in [160] to fuse trajectories of a bearing’s health indicator.
time, frequency, and time–frequency domains information of roll- The training process of the GANs suffers from instability issues,
ing bearing degradation. Then, the HI was smoothed through an and they are prone to the mode collapse problem, which means the
exponentially weighted moving average technique, and fed to a generator learns a limited subset of the modes and generates the
hierarchical GRU-based recurrent network for future HI estimation same samples repeatedly. It has been shown that the design of
and RUL prediction. the loss function significantly influences training stability and
Although LSTM cells are the mostly used recurrent units in brings consequent issues into the model [33]. The original GAN
many applications, there is no evidence to show that one cell is structures use Jensen-Shannon Divergence (JSD) probability mea-
superior to another. The GRUs are computationally less expensive surement metric which is proved to incur the vanishing gradient
and make the right choice for training smaller datasets. On the and mode collapse problems. Many studies in various fields
other hand, LSTMs may work better for bigger datasets to retain designed alternative loss functions combined with enhanced archi-
longer temporal information. tectures to overcome the mentioned challenges. Wang et al. [165]
proposed a generalized imbalanced fault diagnosis framework
4.9. Generative adversarial networks based on Wasserstein generative adversarial network (WGAN). In
WGAN, the Wasserstein loss provides the continuous gradient for
Generative adversarial networks (GANs) have attracted growing generator training and solves the mode-collapse. Cabrera et al.
interest in various research areas, and have shown some advan- [166] established an unsupervised GAN model selection mecha-
tages over other well-known deep generative models, e.g., VAE, nism to find the best WPT generator for reciprocating machinery
in synthesizing excellent-quality samples. Besides, they are trained fault diagnosis. In their model, the training process is guided by
without any explicit density function, and no Markov chains are the dissimilarity of real and fake data clusters to enhance training
required neither in drawing the samples nor in training. Hence, stability. The obtained generator balances the 99% imbalanced
there is no risk of chain breakage in high-dimensional space as in dataset through producing more fault data. Zhou et al. [167]
DBMs, and they have demonstrated appealing performance in adopted a global optimization GAN framework to address the
dealing with the high-dimensional distribution of data. However, imbalanced class issue.
despite outstanding success in generating sharp synthetic images In a recent study [168], Shao et al. proposed an auxiliary classi-
followed by an exceptional performance in the computer vision fier GAN (ACGAN) framework to augment the fault dataset, Fig. 11.
area, there are limited studies that investigated using GANs in An auxiliary part is attached to the discriminator so that the
other domains for time-series sensor data. enhanced discriminator can recognize the fake data and fault class
Recently, the PHM community has started to leverage GAN to labels simultaneously. The generator possesses 1D convolutional
enhance their model targeting two major concerns in industrial structure with batch normalization and generates the artificial
fault classification: The imbalanced distribution of health classes data from random noise of latent variables with certain labels.
and insufficient labeled data; most of the fault diagnosis frame- The discriminator receives the generated data mixed with real
works assume equal proportions of the data for all health condi- samples to identify both source labels (1 or 0) and fault classes.
tions. However, the real machinery mostly works under normal Many emerging generative models, such as Adversarial autoen-
conditions, and fault rarely happens. So there are abundant healthy coders (AAE) [169] and Wasserstein autoencoders (WAE) [170] are
class data, while faulty samples are limited. Also, it is unmanage- inspired by adversarial learning, and they have shown promising
able to stop the machinery during the operation and inspect the results in various domains [171–173]. However, GAN and adver-
fault types. Therefore, the majority of the collected data are unla- sarial training are somehow novel concepts and, despite offering
beled. To face the first challenge, Li et al. [161] proposed an end- great success in producing realistic images, the possible applica-
to-end 2-D CNN-based GAN model for bearing and gearbox fault tion of them in the context of time-series data is still very much
diagnosis. In their model, the concatenated vector of the labels open for future research opportunities in different directions.
and the randomly generated noises are reshaped into 2-D feature
maps and input to the generator. The generator consists of three 4.10. Hybrid and emergent models
consecutive deconvolution layers that map the inputs into the
higher-resolution feature maps. It should be noted that in the Deep learning is a fast-growing field and there has been an
CNN context, the term deconvolution refers to the transpose con- enormous effort to develop new architectures that offer better per-
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 17

Fig. 11. Auxiliary classifier GAN-based fault classification framework: First, the generator produces the samples from latent space. Then, the discriminator is trained with
synthetic samples and real data. The modified loss function makes it possible for the discriminator to discriminate source labels (real or generated) and fault category labels
simultaneously. Finally, the parameters of discriminator are frozen, and the generator is updated to produce more realistic samples [168].

formance. Many new models are the hybrid of standard architec- layer extracts the spatiotemporal information between the
tures (i.e., CNN, RNN, AE, etc.) or rooted in the existing designs. A subsequences.
few studies, however, established novel ideas, but those are either It has been shown that the encoder-decoder structured RNNs
mathematically complex or very application-specific. Likewise, the introduce multiple improvements to the complex sequence-to-
PHM community is actively developing more effective models. For sequence tasks analogous to the RUL estimation from time-series
instance, He et al. [174] established a bearing fault diagnosis measurements for prognostics application. Malhotra et al. [181]
framework hinged on Large Memory Storage and Retrieval Neural proposed the LSTM Encoder-Decoder (LSTM-ED) framework for
Networks (LAMSTAR). The LAMSTAR is a fast and deep dynamic unsupervised health indicator construction of the system. Simi-
neural network made of self-organizing-map (SOM) modules and larly, the GRU Encoder-Decoder (GRU-ED) network on RUL estima-
has shown reliable results in various domains. They fed STFT of tion of the turbofan engine dataset demonstrated significantly
acoustic emission signals to the model and comparing to CNN- robust results encountering different noise levels [182]. Moreover,
based diagnosis, achieved better performance. DAE network with GRU hidden units has indicated the fault diag-
Convolutional Deep Belief Network (CDBN) was originally pro- nosis accuracy superior to the standard GRU network [183].
posed for visual recognition tasks and benefits the weight- Table 8 provides a summary of the hybrid and emerging mod-
sharing property of CNN to address the upscaling problem of els. Adversarial twists of standard CNN, RNN, and AE have recently
DBN [175]. An improved CDBN with Gaussian visible units was brought attention to deep learning research, and there are few
used to learn representative fault features of rolling bearings related studies in the PHM field. Table 9 gives a summary of the
[176]. The compressed sensing technique was adopted to enhance most well-known deep networks and their characteristics.
computation efficiency while preserving meaningful information.
Standard CDBN suffers from error oscillation issue followed by 4.11. Transfer learning and domain adaptation
weak generalization capability due to the limited number of Gibbs
sampling steps in practical cases. An exponential moving average Compared to conventional data-driven approaches, deep learn-
(EMA) weight smoothing method was employed to tackle the issue ing techniques remove the burden of manual feature engineering
and enhance the learning algorithm. In their other study [177]. and achieve state-of-the-art results. Despite the marvelous perfor-
Dealing with multi-dimensional sensory data with internal mance, the majority of the studies are based upon the assumption
dependencies is a critical challenge in most of the practical PHM that the training and test data are drawn from the same distribu-
frameworks. A few studies integrated convolutional and LSTM lay- tions. However, in the real industry, the data are collected under
ers into a unified model to capture both spatial and temporal infor- different operating and environmental conditions and during dif-
mation of multi-dimensional time-series. Zhao et al. [178] adopted ferent time intervals, often resulting in feature space difference
a CNN- Bi-directional LSTM (CBLSTM) network for tool wear pre- or distribution shift across training and testing datasets. Moreover,
diction task. In [179], the authors established a CNN-LSTM (CLSTM) labeling the industrial data is costly and error-prone, and requires
model with a class-imbalance-weighted loss function for imbal- huge human labor and expertise. Therefore, there is no sufficient
anced fault classification of cyber-physical-systems (CPS). Despite annotated data to train reliable models. Transfer Learning (TL)
the satisfying results, the models above extract spatial and tempo- and Domain Adaptation (DA) approaches focus on improving the
ral information independently and pay less attention to feature model by transferring the knowledge or utilizing the transferable
changes between time steps. A Time-distributed Convolutional features from one or more training datasets to execute the relevant
LSTM (TDConvLSTM) was proposed by Qiao et al. [180] to learn new task in the testing dataset.
spatiotemporal information of multi-channel time-series measure- Deep TL and DA techniques have gained growing attention
ments. They segmented the normalized raw data into subse- recently in the computer vision field and achieved excellent results
quences and fed them into the model with ConvLSTM cells for object classification, object recognition, and semantic segmen-
instead of vanilla LSTM units. The first Conv-LSTM layer simultane- tation applications. Although there are some studies that investi-
ously learns local spatial and temporal information inside a subse- gated the TL and specifically DA-related deep learning
quence. Stacking a holistic Conv-LSTM unit on top of the previous approaches for PHM to address different issues such as insufficient
18 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

Table 8
Survey of hybrid and emergent models where measures are rounded to two decimal places.

Publication Task/Dataset* Model Performance


Shao et al. [176] Bearing fault diagnosis/NA CDBN ACC: 97.37
Shao et al. [184] Bearing fault diagnosis/ NA CDBN ACC:97.44
Wu et al. [179] Fault diagnosis/PHM’15 CLSTM PR: 98.42, SN: 98.46, F1: 0.98
Zhao et al. [178] Tool wear prediction/PHM’10 CBLSTM RMSE: 10.8, MAE: 8.1
Qiao et al. [180] Gearbox fault diagnosis/ NA TDConvLSTM ACC: 97.56
Tool wear prediction/PHM’10 RMSE: 10.22, MAE: 7.50
Yoon et al. [185] RUL estimation/CMAPSS LSTM-VAE MAE: 28.13.4
Malhotra et al. [181] HI construction/CMAPSS LSTM-ED MAE: 18, MAPE: 39
Gugulothu et al. [182] HI construction/CMAPSS GRU-ED MAE: 17, MAPE: 39
Liu et al. [183] Fault diagnosis/CWRU DAE-GRU No noise- ACC: 99.75
1 dB SNR-ACC:96.98
Chen and Li [186] Bearing fault diagnosis/ NA SSAE-DBN ACC: 91.76
Lu et al. [187] Early fault detection/IMS AE-LSTM Fault alarm at 527 signal snapshot
Ellefsen et al. [188] RUL estimation/CMAPSS RBM-LSTM Lowest RMSE: 12.56, Highest RMSE: 22.66Lowest ETA: 231, Highest ETA: 2840
Li et al. [189] Cross-domain fault diagnosis/CWRU CNN-Generative Lowest ACC: 69.4
Highest ACC: 84.5
Zhang et al. [190] Cross-domain fault diagnosis/CWRU Adversarial CNN Lowest SN: 73.75
Highest SN: 98.88
Han et al. [173] Fault diagnosis/ Adversarial CNN Seen condition-ACC:99.4, PR:99.3, F1:0.99
PHM’09 Unseen condition: ACC:92.5, PR: 92.3, F1:0.92
He and He [174] Bearing fault diagnosis/NA LAMSTAR Lowest ACC: 96
Highest ACC: 100
Yu et al. [191] RUL estimation/CMAPSS, BLSTM-ED ETA: 273, RMSE: 14.74
Milling dataset RMSE: 7.14
*
Refer to Table 6 for public datasets information.

Table 9
Summary of deep baseline models and their characteristics.

Network Representation learning Merits Demerits


RBM
DBN Unsupervised/generative  Achieves appealing results with raw vibration data, without  Model performance is highly reliant on initializa-
considerable preprocessing effort tion of the parameters
DBM Unsupervised/generative  Learns complex representations  Intensive computation of joint optimization
 Top-down feedback allows good uncertainty propagation
Autoencoder
SAE Unsupervised/  Easy implementation  Not good in preserving the relationships among
discriminative  Tractable optimization function inputs
 Dimensionality reduction technique*  Risk of learning identity function without
extracting meaningful information
SSAE Unsupervised/  Better generalization by introducing sparse features  Less robust than other regularized autoencoders
discriminative  Easily separable classes due to sparse features
DAE Unsupervised/  Learns a robust representation  Stochastic regularization
discriminative  Reconstructing the clean data from a corrupted input  Partial robustness
 Easy implementation
CAE Unsupervised/  More robust representation than DAE (encouraging robust fea-  High computational cost
discriminative tures rather than robust reconstruction)  Not probing large perturbations
 Analytical regularization
 Deterministic gradient
 More stable than DAE
VAE Unsupervised/generative  Learns the complex probability distributions of latent space  Results are dependent on the expressiveness of
other than fixed scalars, and can generate new instances the inference
 Data imputation ability to handle incomplete datasets  Bad local optima issue
CNN Supervised/  Preserves spatial information  Overfitting
discriminative  Good for high-dimensional data  Model accuracy is highly reliant on parameters
initialization
RNN/BRNN  Suitable for sequential and time-series data  Difficult training process
 Captures time dependencies of data  Vanishing/exploding gradient
 Stores temporal information
GRU Supervised/  Simpler than LSTM and computationally more efficient  Slightly less control than LSTM over information
discriminative  Vanishing/exploding gradient remedy flow
 Better generalization than LSTM with less data
LSTM Supervised/  Better than GRU in dealing with vanishing/exploding gradient  Complicated structure
discriminative issue  Slower training process compared to GRU
 Remembers longer sequences than GRU
 Better generalization than GRU with more data
GAN Unsupervised/generative  Not requiring Markov chains  Training instability
 Often generating the most realistic samples among other gener-  Difficult optimization
ative models  Mode collapse of the generator which leads to
 Active research area toward promising results generating samples with the limited variety
 Subjective evaluation
*
All the variants have the property in common.
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 19

training data, class imbalance, cross-domain fault diagnosis, and 4.11.1. Transfer learning with pre-trained models
covariate-shift, it is still in its infancy stage. As opposed to traditional machine learning algorithms, the per-
We define domain D with the feature space X and the marginal formance of deep models highly depends on the availability of
probability distribution P ð X Þ, where X ¼ fx1 ; x2 ; :::; xn g 2 X . Also, massive training data to learn the latent pattern of data. However,
the desired task T consists of label space Yand the conditional in many domains, including PHM, it is extremely difficult to collect
probability distribution P ðYjX Þ, where Y ¼ fy1 ; y2 ; :::; yn Þ 2 Y. Assu- large-scale labeled datasets. Also, it requires extensive computa-



meDs ¼ X s ; Pð X Þs and Dt ¼ X t ; P ð X Þt as training dataset with tional power to train the model on large-scale datasets. Recently,
sufficient annotated data (known as source domain) and the test researchers leveraged the knowledge learned by various pre-
dataset with no/few labeled data (namely target domain). Tradi- trained models on large benchmark datasets and transferred the
tional machine learning approaches assume that Ds ¼ Dt and knowledge to other applications to tackle the issues mentioned
above. A plethora of deep CNN architectures have been trained
T s ¼ T t . However, in reality, target and test datasets are different
on large-scale image datasets such as ImageNet [193]. Inception
by domain Ds –Dt or by task T s –T t , or both [192]. Transfer learn-
Net, GoogleNet, LeNet, AlexNet, ResNet, and VGG are some exam-
ing encompasses all three settings, while the DA approaches
ples [194].
address the former situation. In the following subsections, two
The idea behind TL is to fine-tune pre-trained models on new
techniques within the TL paradigm have been discussed. The first
tasks in the target domain. Hence, the new model can be initialized
subsection focuses on using TL to fine-tune the source domain with
by transferred parameters instead of training from scratch. In liter-
a pre-trained target model. In this case, target and source tasks
ature, a recipe has been proposed to use pre-trained CNN architec-
don’t necessarily need to be similar. In the second subsection, DA
tures based on the similarity of source (pre-trained) domain and
techniques to solve the domain shift issue in PHM have been dis-
target domain, and size of the dataset, as shown in Fig. 12 [195].
cussed. For DA, target and source domains share the same label
At present, researchers in the PHM community have begun using
space, or the source label space should be a subspace of the target
pre-trained models for fault diagnosis and tasks and achieved
label space.
promising results. Wen et al. [196] fine-tuned all layers of an Alex-
Net pre-trained model on bearing fault diagnosis task. In their
model, the final fully connected layer is replaced with a classifier
layer with four neurons (number of bearing fault conditions). They
provided a comprehensive comparison of their proposed model
with eight different time–frequency image inputs and various
training/testing data set ratios. In another study [197], the authors
utilized a VGG-16 pre-trained network for bearing fault diagnosis.
They froze bottom blocks of the network and fine-tuned the last
three layers of VGG-16 with a supervised classifier layer. As most
of the pre-trained networks require RGB images with three chan-
nels as inputs, it is important to pre-process the data accordingly.
Wen et al. [196] proposed a signal-to-image method to convert
time-domain signals into an RGB image. They transferred the first
49 layers of a pre-trained ResNet-50 and fine-tuned the model
after adding a fully connected layer and a softmax classifier. Sev-
eral studies have taken similar strategies and achieved interesting
results, see Table 10. Despite promising results, more research is
needed to capture temporal features for time-series classifica-
tion/prediction. TimeNet and ConvTimeNet pre-trained models
Fig. 12. The general criterion to be considered in using pre-trained models for are two interesting examples [198,199].
transfer learning [195].

Table 10
Summary of transfer learning studies for PHM applications with pre-trained CNN architectures*.

Study Dataset Input Architecture Performance


Shao et al. CWRU bearing/D Three-channel augmented WT time– VGG-16 1. ACC = 100 (training and testing data from same working
[197] frequency image loads)
2. ACC = 98.80 (training and test data from different working
loads)
Xu et al. CWRU bearing/D Gray-scale CWT image LeNet-5 ACC = 99.08
[200]
Ma et a. CWRU bearing/D Frequency slice wavelet transform image AlexNet 1. ACC = 99.89 (without SNR)
[201] 2. ACC = 82.70 ~ 98.18 for SNR from 4 ~ 4 dB
Wen et al. 1. CWRU bearing/D Time-domain RGB image ResNet-50 1. ACC = 99.99
[202] 2. KAT bearing/D 2. ACC = 98.95
3. Lu et al. bearing/D 3. ACC = 99.2
Wang 1. Unknown bearing/D Trained and compared with eight different AlexNet 1. Highest ACC = 100, Lowest ACC = 92.57 (for HHT image,
et al. 2. CWRU bearing/D time–frequency images with 5% of images trained)
[203] 2. Highest ACC = 100, Lowest ACC = 76.10 (for Fast Kurtogram
image, with 5% of images trained)
Wen et al. CWRU bearing/D RGB time-domain image VGG-19 ACC = 99.17
[196]
Mao et al. PRONOSTIA bearing/ Three-channel vibration image VGG-16 NA
[204] Incipient fault detection
*
Refer to Table 6 for public datasets information.
20 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

4.11.2. Domain adaptation auto-balanced high-order Kullback-Leibler (AHKL) divergence to


The domain divergence can be caused by the distribution shift achieve better marginal distribution alignment by evaluating both
or feature space difference. The first setting is referred to as the first and higher-order discrepancies. Furthermore, the pro-
homogenous DA, while the latter case denotes the heterogeneous posed smooth conditional distribution alignment (SCDA) based
DA. Also, considering labeled, partially labeled or no-labeled target on soft-labeling covers large conditional distribution discrepancies.
dataset available in the training stage, the settings can be catego- Also, the novel weighted joint distribution alignment (WJDA) in
rized into supervised, semi-supervised or unsupervised, respec- the fine-tuning process balances the effects of conditional and
tively, Fig. 13 summarizes the major DA settings and approaches marginal distribution alignments in the final model.
in the machinery health monitoring field. Most of the studies above refer to the target domain as other
Some authors adopted discrepancy-based methods to enhance operating condition of the same machine. Thus, they may show
model performance through fine-tuning with labeled or unlabeled inaccurate results for fault diagnosis of similar components among
target data. A few studies carried-out the fine-tuning by adjusting different machinery. The deep 1D CNN model In [211] transfers
the architecture of the network through adaptive batch normaliza- knowledge from laboratory rolling bearings to real-case locomo-
tion (AdaBN) [205] and reweighting the weak learner [206] tech- tive bearings, Fig. 14. The domain-shared CNN holds symmetric
niques. However, most of the discrepancy-based methods utilize tied weights to handle the samples of source and target domain
pre-defined distance metrics such as maximum mean discrepancy simultaneously and, Softmax classifier in the last shared layer is
(MMD), KL divergence and correlation alignment (CORAL) to learn used to predict the class label of source-target samples. The final
a domain-invariant representation by reducing the shift between cost function incorporates the MMD of learned features in latent
two domains. For example, Zhang et al. [207] initialized the layers with loss of target domain pseudo-labeled samples to max-
CNN-target feature extractor with the parameters of pre-trained imize the inter-class distance of the features.
source feature extractor, and during the domain-adaptive fine- The adversarial learning inspired by GAN models achieved great
tuning stage, the higher-level representations of the domains are success. The core idea is to ensure that the classifier is fooled with
untied to balance the training efficiency and domain-invariant fea- synthetic labeled target data or cannot differentiate between the
ture learning. The MMD regularizer in reproducing kernel Hilbert source and target domains through a generative or discriminative
(RKH) space employed to the output layers of the feature extrac- adversarial process. The gradient reversal layer connects the fea-
tors ensures the minimization of marginal distributions differences ture extractor to the domain classifier, ensuring that the source
after mapping. Lu et al. [208] adopted MMD and a weight regular- regression layer receives the domain-invariant features using
ization term to learn the shared-subspace while preserving the dis- domain confusion loss. Zhang et al. [190], adversarially trained
criminative information of the original data for semi-supervised the source and target domains with partially tied weights to tackle
fault diagnosis of rotating machinery components only with the the domain adaptability and training efficiency trade-off. However,
normal class of target domain during the training. the target feature extractor is initialized with pre-trained source
The MMD-based approaches are highly reliant on the appropri- model parameters to avoid the target model learning the degener-
ate kernel choice to ensure low testing error. Li et al. [209] lever- ate solution.
aged a mixture of Radial basis function (RBF) kernels across As opposed to popular probability distances used in the adver-
multiple representation layers along with a higher-level feature sarial process such as KL and JS divergence, Wasserstein distance
clustering scheme to enhance fault classification accuracy by opti- provides a continuous mapping and usable gradient everywhere
mizing the intra-class and inter-class distances. In another study [212]. In [172], the domain critic loss uses Wasserstein distance
[189], the authors deployed CNN-based generators to produce fake between the source and target distributions to optimize the
target-domain fault data under the supervision of the source using CNN-based shared latent representation on a pre-trained CNN
high-level representations of frequency-domain signals and mini- (on the source data) feature extractor. The objective function incor-
mized the multi-kernel MMD between fake and real high-level fea- porates the adversarial loss into the discriminative cross-entropy
tures. Qian et al. [210] defined a new discrepancy metric, namely loss of the labeled source domain (and target domain in semi-

Fig. 13. Different Domain adaptation scenarios: (a) homogenously supervised, (b) homogenous semi-supervised, (c) homogenous unsupervised, (d) heterogeneous
supervised, (e) heterogeneous semi-supervised, and (f) heterogeneous unsupervised, Hom: homogenous, Het: heterogeneous.
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 21

Fig. 14. The discrepancy-based transfer model architecture [211].The domain-shared CNN holds symmetric tied weights to handle the samples of source and target domain
simultaneously. The model is trained by jointly minimizing the cross-entropy loss of the source data, the error between the predicted labels and pseudo labels of the target
domain, and the MMD of learned features in latent layers.

supervised case). The framework proposed in [173], contains the error of layer-wise unsupervised training and the Softmax classi-
CNN-feature descriptor and domain discriminator that compete fier cost function to enhance the sparsity of the model. Transferring
with each other through min–max adversarial learning. The former the parameters to a similar model followed by the fine-tuning pro-
tends to capture the shared representations of the subsets in a way cess tackles the scarce annotated data issue, Fig. 15, a. Xie et al.
that the discriminator fails to differentiate the domain label. More- [214] utilized Cycle-GAN network for bearing fault diagnosis. The
over, the supervision of labeled samples through fully-connected network learns one mapping from source to target and a reverse
layers and Softmax classification prevents deviations of the target mapping from target to source. The cycle consistency loss mea-
during the training, see Fig. 15. sures the reconstruction error after two generating steps, see
Reconstruction-based DA approaches use encoder-decoder or Fig. 16, b.
GAN architectures to create a shared representation between the A few studies utilized some of the mentioned approaches
domains while preserving the discriminative information of each simultaneously to enhance model performance. [215] combined
domain. Li et al. [213] well-trained and tested the source data on the reconstruction-based SSDAE network with MMD statistic for
the hierarchy of SAEs in an unsupervised fashion. The bearing fault diagnosis. Introducing the MMD term into classifica-
nonnegative-constraint term was employed to the reconstruction tion loss of fine-tuning step instead of reconstruction loss reduces

Fig. 15. The domain-adversarial fault diagnosis network. It has a shared feature descriptor and two classifiers. The update in the discriminative classifier (stage 1) is to reduce
the loss to improve domain discriminative ability of the network. On the other hand, the update in feature descriptor (stage 2) maximizes the loss to fool the discriminator.
Besides, the supervised training for all the labeled data with the cross-entropy loss (stage 3) performs diagnostics task [173].
22 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

Fig. 16. The reconstruction-based fault diagnosis framework: (a) SSAE network [213], (b) Cycle-GAN [214].

the complexity of the algorithm while preserving the domain


adaption capacity. The MMD distance is highly reliant on the ker-
nel selection and may suffer from low generalization ability. Deep
CORAL provides a kernel-free non-linear transformation that is
more efficient for large-scale applications. Wang et al. [216] incor-
porated the CORAL distance loss of both marginal and conditional
distributions into deep DAE objective function to learn domain-
invariant and discriminative features from low-level to higher-
level hierarchical latent layers. The model aligns the second-
order statistics of the distributions using the CORAL loss between
covariance matrices of source and target features, Fig. 17.
Most of the studies above have focused on homogenous deep
DA, and not much work has been done in heterogeneous deep
DA, even in more DA-active fields such as computer vision. How-
ever, a few researchers have used the approaches similar to
homogenous DA for heterogeneous DA settings. Table 11 provides
a summary of deep domain adaptation approaches for PHM
applications.

5. Hardware, software and computing resources

Although deep learning has shown good results on PHM prob-


Fig. 17. The hybrid DA network with SSAE architecture and CORAL discrepancy
lems, its applicability is impaired by high computational demand.
metric [216]. The CORAL distance loss of both marginal and conditional distribu-
tions is minimized to learn domain-invariant and discriminative features from low- Appropriate hardware and software are required to support effec-
level to higher-level hierarchical latent layers. tive training in complex settings [227]. In this section, we discuss

Table 11
Summary of deep domain adaptation approaches for PHM applications, based on [192] categorization.

Approach Unsupervised Semi-supervised Supervised


Discrepancy-based
MMD Li et al. [209], Li et al. [217], Zhang et al. [207], Yang et al. [211], Xiao et al. [218], Han et al. Lu et al. [208], Li et al. [189] –
[219]
AdaBN Zhang et al. [128] – Chen et al.
[220]
Re-weighting – – Xiao et al. [221]
AHKL divergence Qian et al. [210] – –
Adversarial-based
Discriminative Da Costa et al. [222], Zhang et al. [190], Cheng et al. [172], Li et al. [223] Han et al. [173], Li et. al. –
[224]
Generative – – –
Reconstruction-
based
Encoder-decoder – – Li et al. [213]
GAN architectures Xie et al. [214] – –
Hybrid Wang et al. [216], Sun et al. [215], Wen et al. [225], Sun et al. [226] – –
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 23

three main enablers of deep learning, i.e. parallel computing, Various frameworks play critical roles in generating new models
advanced libraries, and cloud/edge computing. with high scalability by offering different features in terms of
pre-trained models, multi-GPU processing, and training/test speed.
5.1. Parallel computing Table 12 presents the most well-known frameworks, showing the
supported programming language and the strong point of each.
Compared to traditional machine learning algorithms, deep The tools are ranked based on the popularity and ratings of the
architectures involve much larger parameter space, which should users in GitHub website [229], which is a collaborative code-
be updated at each training epoch, requiring a huge amount of hosting platform for developers. A comprehensive discussion and
matrix operations and an abundance of processing power. Parallel comparison of deep learning tools can be found in [230,231].
computing facilitates executing massive operations simultane-
ously. Central Processing Units (CPU), even the latest and powerful 5.3. Could computing
chips, have a limited number of processing units (cores) and low
parallelism capability. Hence, they are not efficient for implement- Cloud computing is a general term for on-demand computing
ing deep models, and it may take weeks for them to come up with resources including power, storage, and services over the internet
the results. (known as cloud) with high scalability and reliability and, can be
Graphics Processing Units (GPU) are originally dedicated to pro- seen as the evolution of cluster and grid computing. In the PHM
cessing graphics and high-quality 3D games. Compared with CPU, context, cloud resources can be basically used by the researchers
GPU has thousands of highly specialized cores that are adept at to develop, train and deploy their deep models in real-time at
processing matrices. Thanks to Compute Unified Device Architec- any scale. Public cloud vendors are rapidly improving their capabil-
ture (CUDA) platform and NVIDIA CUDA Deep Neural Network ities by offering advanced analytics services in the pay-as-you-go
(cuDNN) library, researchers and data scientists have recently basis that are practical for both newcomers and experienced data
identified that GPUs can be turned into a powerful general- analysts, see Table 13.
purpose computing engine to accelerate the training process In the bigger picture, cloud computing as a key enabler of ‘‘Big
through parallelism [227]. They offer higher memory bandwidth Data Analytics” and ‘‘Industrial Internet of Thing (IIoT)” technolo-
and dramatically speed up the training.
Moreover, Tensor Processing Units (TPU) are application- Table 13
specific integrated circuits recently developed by Google and Existing cloud environments.
specifically act as machine learning accelerators. TPUs have
Provider ML and DL Services URL
demonstrated higher processing speeds compared to GPUs, but
they are less flexible and limited to models in TensorFlow library. Microsoft Azure ML Studio https://azure.
[228] provides a comprehensive evaluation of TPU and compares it microsoft.com
with GPU and CPU in terms of performance and speed. Google Cloud Cloud AutoML, Google https://cloud.google.com/
Platform (GCP) ML engine
Amazon Web Amazon ML, Deep https://aws.amazon.com/
5.2. Platforms Services (AWS) Learning AMI
IBM Cloud Watson ML Studio https://www.

The success of deep learning is highly reliant on the develop- ibm.com/cloud


Oracle Cloud Oracle ML https://www.oracle.com/
ment of state-of-the-art tools including frameworks and libraries.

Table 12
Mainstream deep learning tools, rankings are based on the stars and forks in GitHub [229]. API: application programming interface.

Ranking Library API GitHub URL Comment


1 TensorFlow Python, C++, Java, https://github.com/tensorflow/ The most popular library with complete functionality and several interface
Go, JavaScript support. In CPU computing, it has shown better scalability compared to other
tensorflow
libraries [232].
2 Keras Python, R https://github.com/keras-team/ High-level API integrating with TensorFlow, CNTK, and Theano.
keras
3 Caffe/Caffe 2 Python, Matlab, C https://github.com/BVLC/caffe Extended for use in Hadoop with Spark.
++
https://
github.com/facebookarchive/caffe2
4 MXNet Python, C++, https://github.com/apache/ Difficult to learn, but highly scalable and memory-efficient.
Matlab, R, Julia,
incubator-mxnet
Scala, Perl
5 Theano Python https://github.com/Theano/Theano No longer supported after release 1.0.0 (November 2017). Supports tensor and
sparse operations, python 2 and python 3, GPU computation, and SIMD
parallelism on CPU.
6 CNTK Python, C++, and https://github.com/Microsoft/ The user can easily combine different models such as CNNs and RNNs and
C# supports transfer between different platforms (Caffe2, MXNet, and PyTorch)
CNTK
[233].
7 Deeplearning4j Java, Scala, https://github.com/eclipse/ Accelerates training through built-in integration with Apache Hadoop and
(DL4J) Clojure or Koltin Spark. Using Keras API bridges the gap between JVM languages and Python,
deeplearning4j
and it can import models from TensorFlow, Theano, CNTK, and Caffe [234].
8 PyTorch Python https://github.com/pytorch/ Pre-trained models are available. However, no visualization tool available.
pytorch
9 Chainer Python https://github.com/chainer/ –
chainer
10 Torch7 Lua https://github.com/torch/torch7 Focused on GPU computation acceleration. Despite that, not beating CNTK,
Caffe and MXNet for GPU accelerated implementation [235].
24 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

gies alongside with other technologies such as ‘‘advanced sensors,” data means more noise and uncertainty associated with the
‘‘wireless communications,” ‘‘advanced manufacturing‘‘, and operating environment, various data sources, and data trans-
‘‘robotics” in cyber-physical-systems (CPS) concept plays a critical mission that need to be addressed. Moreover, when it comes
role in moving the manufacturing systems to an intelligent level to big industrial data, the challenges regarding incomplete data,
which is known as ‘‘smart Manufacturing” toward ‘‘Industry 4.0 unlabeled data, imbalanced classes, and unseen classes get
(fourth industrial revolution)” goal [236]. more critical and severe. As mentioned earlier, a few works
Integrating PHM in the smart manufacturing paradigm goes put forth the effort to mitigate the issues using augmentation
beyond the monitoring and data analysis for an individual compo- techniques through generative algorithms, specifically GANs,
nent. It is challenging and requires a vast amount of computational and have achieved interesting results. However, most of the
resources to determine and manage the interactions between com- methods consider a moderately imbalanced scenario and ignore
ponents, sub-systems, and systems. Cloud computing is transfer- the challenge of significantly under-represented class which is
ring traditional manufacturing and condition monitoring what exists in real industrial applications. Also, real-world data
frameworks into service-oriented models [237,238]. Edge comput- come from various sensors and are mostly non-structured,
ing and fog computing are recent extensions of cloud computing multi-modal and heterogeneous which make the model much
that tackle high-latency, security, and bandwidth issues by pro- more complex. Further research is required in the future to
cessing the data in the layers of IIoT, which are closer to data leverage heterogeneous information in deep models while not
resources [239]. reducing the training efficiency.
3. Data analysis: Data preprocessing and visualization play critical
roles in machine learning and deep learning. The quality of a
6. Concluding remarks and open research directions model is highly sensitive to the quality of the training data. Pre-
processing ranges from simple normalization, standardization,
In this paper, a detailed review has been carried out on various and data segmentation to more complex tasks such as labeling,
aspects of employing deep neural networks in the context of fault dealing with incomplete data, outliers and missing values. Chen
detection, diagnostics, and prognostics. From the above discussion, et al. proposed a deep transfer learning-based framework to
it is clear that DL algorithms have brought new perspectives to tackle missing values by transferring a well-trained structurally
data-driven methods in terms of model performance, learning complete fault diagnosis model to the missing data model
complex representations, big data analysis, and handling the raw [241]. In the absence of sufficient labeled data, synthetic data
data with minimum preprocessing efforts. However, despite generation with generative deep networks, such as VAEs and
promising results, DL for PHM application has a long way to go GANs, provides a fast and cheap solution to produce new
to replace well-established data-driven techniques in the indus- labeled data. Moreover, for deep learning-based fault diagnosis,
tries. In addition to the barriers that are addressed in the literature the spatial distributions of the features reflect the quality of the
and thoroughly reviewed through the survey, the authors have disparity of the fault features and directly affect classification
identified several significant challenges in exploiting the potentials accuracy. Hence, effective visualization techniques are neces-
of DL toward employing a reliable, scalable and applicable PHM sary to analyze the quality of the features. Zemouri et al. pro-
model to realize industry 4.0 goals. To outline the future research posed a 2D visualization model based on deep convolutional
directions, we conclude with the key related challenges and asso- VAE for classification task [242].
ciated opportunities to the researchers. 4. Model selection: Choosing an optimal network architecture is an
important issue. Most of the reviewed papers have not justified
1. Data scarcity: As a matter of fact, DL algorithms are known to be using certain architecture to solve specific problems. Up to now,
data-hungry, and their superior performance depends upon the the majority of employed networks have been designed manu-
availability of abundant data, which is rarely feasible in most of ally by human experts which is error-prone and time-
the situations. In recent years, several approaches have been consuming. Although DL promises considerable benefits in
suggested to tackle the limitations imposed by the small size terms of automated solutions, still it remarkably depends on
of datasets in terms of model generalization and optimization. choosing a wide range of hyperparameters. There is a paucity
Data augmentation techniques have shown great success in of literature that used evolutionary algorithms to optimize the
enhancing the size of training datasets by generating synthetic hyperparameter setting. However, the authors have not found
data. Basic augmentation techniques such as window cropping, any literature in PHM regarding neural architecture search
wrapping, and flipping have been widely utilized to generate (NAS) which is a recent trend in machine learning and has
new data sequences from the original time-series data [240]. shown significant success in automating network design for
Also, advanced techniques such as generative algorithms can image classification and semantic segmentation tasks. It is
be used to generate new data similar to the real data. However, important to investigate the possibilities of developing more
new generative models are required to create valid time-series automated models in terms of hyperparameter and architecture
data in time and frequency domain considering the temporal selection for large-scale real industrial data.
dependency of the data. Transfer learning is an active research 5. Black-box tool: Despite promising results, many companies still
direction that helps to deal with a small amount of training data are unwilling to adopt DL. The reason lies behind the black-box
by transferring knowledge from one domain to another domain. nature of DL algorithms that imposes a lack of ‘‘transparency”
It mitigates the need for training the model from scratch by and ‘‘interpretability” to the models, especially the decision-
fine-tuning a pre-trained model on a new domain. Moreover, making part of the PHM cycle. There is no sufficient under-
one-shot learning approaches could be effective and support standing of the underlying process and the reason for making
learning from one annotated training sample either by defining certain decisions. In other words, companies cannot trust some-
a new loss function or creating an external memory which can thing that they don’t understand and don’t have control. In
encode and retrieve new data. recent years, several efforts have been made to tackle the afore-
2. Industrial data characteristics: The success of deep models is mentioned issue. Explainable deep learning is a new paradigm
highly reliant on the quality and variety of the collected data. to open the black-box and increase the transparency of the
The evolution of smart sensors and IIoT technologies has some- models. These techniques are roughly categorized into two
how eased the industrial data scarcity issue. Nevertheless, more types: a) utilize a relatively simple model to interpret complex
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 25

deep learning models, and b) build intrinsic interpretable deep [4] M. Jouin, R. Gouriveau, D. Hissel, M.-C. Péra, N. Zerhouni, Particle filter-based
prognostics: Review, discussion and perspectives, Mech. Syst. Signal Process.
architecture by incorporating attention mechanisms in inter-
72 (2016) 2–31.
mediate layers [243]. [5] A. Soualhi, G. Clerc, H. Razik, F. Guillet, Hidden Markov models for the
6. Cross-domain prediction: The majority of the existing work prediction of impending faults, IEEE Trans. Ind. Electron. 63 (5) (2016) 3271–
trained their models using public data that were collected 3281.
[6] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, R.X. Gao, Deep learning and its
under laboratory conditions. It is the same situation for the applications to machine health monitoring, Mech. Syst. Signal Process. 115
development of domain adaptation techniques, which focused (2019) 213–237, https://doi.org/10.1016/j.ymssp.2018.05.050.
on transferring the knowledge from one working condition to [7] S. Khan, T. Yairi, A review on the application of deep learning in system health
management, Mech. Syst. Signal Process. 107 (2018) 241–265, https://doi.
another working condition in laboratory data. Research on org/10.1016/j.ymssp.2017.11.024.
transfer learning to tackle distribution mismatch among various [8] A.L. Ellefsen, V. Æsøy, S. Ushakov, H. Zhang, A comprehensive survey of
domains, including real industrial equipment and artificial lab- prognostics and health management based on deep learning for autonomous
ships, IEEE Trans. Reliab. 68 (2) (2019) 720–740.
oratory faults, is ongoing within deep learning paradigm. More [9] D.-T. Hoang, H.-J. Kang, A survey on Deep Learning based bearing fault
work on domain adaptation with multiple source domains is diagnosis, Neurocomputing 335 (2019) 327–335.
required to reach superior domain generalization ability, which [10] C.M. Bishop, Pattern recognition and machine learning, Springer, 2006.
[11] E. Alpaydin, Introduction to machine learning, MIT press, 2020.
could create feasible models in practice. [12] H. Larochelle, M. Mandel, Y. Bengio, Learning Algorithms for the Classification
7. Real-time realization: While advanced hardware, DL architec- Restricted Boltzmann Machine, vol. 13, pp. 643–669, 2012.
tures and computing paradigms (Cloud, edge, and fog) have rev- [13] G. Hinton and G. Hinton, ‘‘A Practical Guide to Training Restricted Boltzmann
Machines,” 2010.
olutionized large-scale learning in recent years, emerging
[14] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of
computing challenges have risen for real-time training and deep networks, Adv. Neural Inf. Process. Syst. no (2007) 1.
deployment of DL algorithms. There are few relevant works in [15] J. Xu, H. Li, S. Zhou, J. Xu, H. Li, S. Zhou, An overview of deep generative
the realm of PHM which have exploited cloud computing capa- models an overview of deep generative models, no. December, pp. 37–41,
2014, doi: 10.1080/02564602.2014.987328.
bilities for faster offline training while speeding up the infer- [16] R. Salakhutdinov, G. Hinton, Deep Boltzmann Machines, no. 3, pp. 448–455,
ence (i.e. deployment) is of more concern of an applicable 2009.
PHM model. Real industrial data come in continuous streams [17] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Neurocomputing Deep
learning for visual understanding : A review, vol. 187, pp. 27–48, 2016, doi:
and their distribution characteristics are in dynamic change 10.1016/j.neucom.2015.09.116.
over time which limits accurate and real-time inference of the [18] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new
data. Thus, the model needs to cope with the concept drift of perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828,
https://doi.org/10.1109/TPAMI.2013.50.
continuously evolving new data within the incremental learn- [19] A. Ng, ‘‘CS294A Lecture notes Sparse autoencoder,” pp. 1–19.
ing settings. However, typical DL algorithms greatly suffer from [20] M. Ranzato, C. Poultney, S. Chopra, Y.L. Cun, Efficient learning of sparse
forgetting, which refers to the complete loss of previously representations with an energy-based model, Adv. Neural Inf. Process. Syst.
(2007) 1137–1144.
learned knowledge in favor of learning the new information [21] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and
during the sequential training process. New algorithms and composing robust features with denoising autoencoders, in: Proceedings
hardware architectures are needed to facilitate continuous of the 25th International Conference on Machine Learning, 2008, pp. 1096–
1103.
learning of non-stationary sequence data while retaining the
[22] S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio, Contractive auto-encoders:
general-domain knowledge of the pre-trained model. Explicit invariance during feature extraction, in: Proceedings of the 28th
8. The role of benchmarking: At the moment, many DL architec- International Conference on International Conference on Machine Learning,
tures, algorithms, platforms and frameworks are being used to 2011, pp. 833–840.
[23] D.P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv Prepr.
solve specific PHM problems which previously deemed unsolv- arXiv1312.6114, 2013.
able. The variety of available algorithms, models, software, and [24] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
hardware systems raises the need for benchmarking infrastruc- [25] D.I.J. Im, S. Ahn, R. Memisevic, Y. Bengio, Denoising criterion for variational
auto-encoding framework, in: Thirty-First AAAI Conference on Artificial
ture that enables a fair comparison of workloads with respect to Intelligence, 2017.
the time and cost of both training and inference. Currently, [26] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, Neurocomputing A survey
some authors have carried out a comparative analysis of differ- of deep neural network architectures and their applications q,
Neurocomputing, vol. 234, no. October 2016, pp. 11–26, 2017, doi:
ent techniques. However, they have compared their deep mod- 10.1016/j.neucom.2016.12.038.
els with classical machine learning algorithms and have focused [27] Y. Kim, Convolutional neural networks for sentence classification, arXiv Prepr.
solely on generic performance metrics such as accuracy and arXiv1408.5882, 2014.
[28] S. Hochreiter, The vanishing gradient problem during learning recurrent
classification (see Table 5). There is a need to build novel met- neural nets and problem solutions, Int. J. Uncertainty, Fuzziness Knowledge-
rics that incorporate runtime performance, model accuracy, and Based Syst. 6 (02) (1998) 107–116.
robustness across various architectures and DL frameworks. [29] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, R.X. Gao, Deep Learning and Its
Applications to Machine Health Monitoring : A Survey, vol. 14, no. 8, pp. 1–
14, 2015.
[30] I. Goodfellow et al., Generative adversarial nets, Adv. Neural Inf. Process. Syst.
Declaration of Competing Interest (2014) 2672–2680.
[31] N. Kodali, J. Abernethy, J. Hays, Z. Kira, On convergence and stability of gans,
arXiv Prepr. arXiv1705.07215, 2017.
The authors declare that they have no known competing finan- [32] I. Goodfellow, NIPS 2016 tutorial: Generative adversarial networks, arXiv
cial interests or personal relationships that could have appeared Prepr. arXiv1701.00160, 2016.
[33] Z. Wang, Q. She, T.E. Ward, Generative adversarial networks: a survey and
to influence the work reported in this paper.
taxonomy, arXiv Prepr. arXiv1906.01529, no. 2, pp. 1–16, 2019.
[34] S. Ruder, An overview of gradient descent optimization algorithms, arXiv
References Prepr. arXiv1609.04747, 2016.
[35] P. Gupta, Deep Learning-Regularisation, 2015.
[36] R. Yan, R.X. Gao, X. Chen, Wavelets for fault diagnosis of rotary machines: A
[1] V. Sugumaran, G.R. Sabareesh, K.I. Ramachandran, Fault diagnostics of roller
review with applications, Signal Processing, vol. 96, no. PART A, pp. 1–15,
bearing using kernel based neighborhood score multi-class support vector
2014, doi: 10.1016/j.sigpro.2013.04.015.
machine, Expert Syst. Appl. 34 (4) (2008) 3090–3098.
[37] Z. Feng, M. Liang, F. Chu, Recent advances in time–frequency analysis
[2] D. Cabrera et al., Fault diagnosis of spur gearbox based on random forest and
methods for machinery fault diagnosis: A review with application examples,
wavelet packet decomposition, Front. Mech. Eng. 10 (3) (2015) 277–286.
Mech. Syst. Signal Process. 38 (1) (2013) 165–205.
[3] T. Wang, H. Xu, J. Han, E. Elbouchikhi, M.E.H. Benbouzid, Cascaded H-bridge
[38] Y. Wang, J. Xiang, R. Markert, M. Liang, Spectral kurtosis for fault detection,
multilevel inverter system fault diagnosis using a PCA and multiclass
diagnosis and prognostics of rotating machines: A review with applications,
relevance vector machine approach, IEEE Trans. Power Electron. 30 (12)
Mech. Syst. Signal Process. 66 (2016) 679–698.
(2015) 7006–7018.
26 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

[39] A. Jović, K. Brkić, N. Bogunović, A review of feature selection methods with [69] A. Saxena, K. Goebel, D. Simon, N. Eklund, Damage propagation modeling for
applications, in: 2015 38th International Convention on Information and aircraft engine run-to-failure simulation, 2008 Int. Conf. Progn. Heal. Manag.
Communication Technology, Electronics and Microelectronics (MIPRO), 2015, PHM 2008, 2008, doi: 10.1109/PHM.2008.4711414.
pp. 1200–1205 [70] B. Saha, K. Goebel, Uncertainty management for diagnostics and prognostics
[40] R. Zimroz, A. Bartkowiak, Two simple multivariate procedures for monitoring of batteries using Bayesian techniques, in: 2008 IEEE Aerospace Conference,
planetary gearboxes in non-stationary operating conditions, Mech. Syst. 2008, pp. 1–8.
Signal Process. 38 (1) (2013) 237–247. [71] E.F. Hogge, et al., Verification of a remaining flying time prediction system for
[41] G. Cheng, X. Chen, H. Li, P. Li, H. Liu, Study on planetary gear fault diagnosis small electric aircraft, in: Annual Conference of the Prognostics and Health
based on entropy feature fusion of ensemble empirical mode decomposition, Management, PHM 2015, 2015.
Measurement 91 (2016) 140–154. [72] B. Bole, C.S. Kulkarni, M. Daigle, Adaptation of an electrochemistry-based li-
[42] W.D. Fisher, T.K. Camp, V.V. Krzhizhanovskaya, Anomaly detection in earth ion battery model to account for deterioration observed under randomized
dam and levee passive seismic data using support vector machines and use, Annual Conference of the Prognostics and Health Management, PHM
automatic feature selection, J. Comput. Sci. 20 (2017) 143–153. 2014, 2014.
[43] R. Moghaddass, S. Sheng, An anomaly detection framework for dynamic [73] W. Xiao, A probabilistic machine learning approach to detect industrial plant
systems using a Bayesian hierarchical framework, Appl. Energy 240 (2019) faults, arXiv Prepr. arXiv1603.05770, 2016.
561–582. [74] N.J. Van Eck, L. Waltman, VOSviewer manual, Leiden: Univeristeit Leiden 1 (1)
[44] S. Lee, J.-W. Park, D.-S. Kim, I. Jeon, D.-C. Baek, Anomaly detection of tripod (2013) 1–53.
shafts using modified Mahalanobis distance, J. Mech. Sci. Technol. 32 (6) [75] T. Jebara, Machine learning: discriminative and generative, Springer Science
(2018) 2473–2478. & Business Media, 2012.
[45] M.S. Safizadeh, S.K. Latifi, Using multi-sensor data fusion for vibration fault [76] L. Deng, A tutorial survey of architectures, algorithms, and applications for
diagnosis of rolling element bearings by accelerometer and load cell, Inf. deep learning, APSIPA Trans. Signal Inf. Process., vol. 3, 2014.
Fusion 18 (2014) 1–8. [77] P. Tamilselvan, P. Wang, Failure diagnosis using deep belief learning based
[46] R.K. Singleton, E.G. Strangas, S. Aviyente, Extended Kalman filtering for health state classification, Reliab. Eng. Syst. Saf. 115 (2013) 124–135, https://
remaining-useful-life estimation of bearings, IEEE Trans. Ind. Electron. 62 (3) doi.org/10.1016/j.ress.2013.02.022.
(2014) 1781–1790. [78] V.T. Tran, F. Althobiani, A. Ball, An approach to fault diagnosis of reciprocating
[47] F. Di Maio, K.L. Tsui, E. Zio, Combining relevance vector machines and compressor valves using Teager – Kaiser energy operator and deep belief
exponential regression for bearing residual life estimation, Mech. Syst. Signal networks, Expert Syst. Appl. 41 (9) (2014) 4113–4122, https://doi.org/
Process. 31 (2012) 405–427. 10.1016/j.eswa.2013.12.026.
[48] G. Niu, Data-Driven Technology for Engineering Systems Health [79] H. Shao, H. Jiang, X. Zhang, M. Niu, Rolling bearing fault diagnosis using an
Management, Springer, 2017. optimization deep belief network, Meas. Sci. Technol. 26 (11) (2015) pp,
[49] N. Aissani, B. Beldjilali, D. Trentesaux, Dynamic scheduling of maintenance https://doi.org/10.1088/0957-0233/26/11/115002.
tasks in the petroleum industry: A reinforcement approach, Eng. Appl. Artif. [80] S. Tang, C. Shen, D. Wang, S. Li, W. Huang, Z. Zhu, Adaptive deep feature
Intell. 22 (7) (2009) 1089–1103. learning network with Nesterov momentum and its application to rotating
[50] S. Wu, N. Gebraeel, M.A. Lawley, Y. Yih, A neural network integrated decision machinery fault diagnosis, Neurocomputing 305 (2018) 1–14, https://doi.org/
support system for condition-based optimal predictive maintenance policy, 10.1016/j.neucom.2018.04.048.
IEEE Trans. Syst. Man, Cybern. A Syst. Humans 37 (2) (2007) 226–236. [81] N. Yuan, W. Yang, B. Kang, S. Xu, C. Li, Signal fusion-based deep fast random
[51] G.K. Chan, S. Asgarpoor, Optimum maintenance policy with Markov forest method for machine health assessment, J. Manuf. Syst. 48 (February)
processes, Electr. Power Syst. Res. 76 (6–7) (2006) 452–456. (2018) 1–8, https://doi.org/10.1016/j.jmsy.2018.05.004.
[52] J. Ben Ali, N. Fnaiech, L. Saidi, B. Chebel-Morello, F. Fnaiech, Application of [82] J. Liang, Y. Zhang, J. Zhong, H. Yang, A novel multi-segment feature fusion
empirical mode decomposition and artificial neural network for automatic based fault classification approach for rotating machinery q, Mech. Syst.
bearing fault diagnosis based on vibration signals, Appl. Acoust. 89 (2015) Signal Process. 122 (2019) 19–41, https://doi.org/10.1016/j.
16–27. ymssp.2018.12.009.
[53] H. Shao, H. Jiang, Y. Lin, X. Li, A novel method for intelligent fault diagnosis of [83] H. Oh, J.H. Jung, B.C. Jeon, B.D. Youn, Scalable and unsupervised feature
rolling bearings using ensemble deep auto-encoders, Mech. Syst. Signal engineering using vibration-imaging and deep learning for rotor system
Process. 102 (2018) 278–297, https://doi.org/10.1016/j.ymssp.2017.09.026. diagnosis, IEEE Trans. Ind. Electron. 65 (4) (2018) 3539–3549, https://doi.org/
[54] X. Lou, K.A. Loparo, Bearing fault diagnosis based on wavelet transform and 10.1109/TIE.2017.2752151.
fuzzy inference, Mech. Syst. Signal Process. 18 (5) (2004) 1077–1095. [84] Y. Qin, X. Wang, J. Zou, The optimized deep belief networks with improved
[55] L. Batista, B. Badri, R. Sabourin, M. Thomas, A classifier fusion system for logistic Sigmoid units and their application in fault diagnosis for planetary
bearing fault diagnosis, Expert Syst. Appl. 40 (17) (2013) 6788–6797. gearboxes of wind turbines, IEEE Trans. Ind. Electron., vol. PP, no. CD, pp. 1–1,
[56] J. Zhu, N. Chen, W. Peng, Estimation of bearing remaining useful life based on 2018, doi: 10.1109/TIE.2018.2856205.
multiscale convolutional neural network, IEEE Trans. Ind. Electron. 66 (4) [85] X. Zhao, M. Jia, A novel deep fuzzy clustering neural network model and its
(2018) 190–193, https://doi.org/10.1109/TIE.2018.2844856. application in rolling bearing fault recognition, Meas. Sci. Technol. 29 (12)
[57] J. Deutsch, D. He, Using deep learning-based approach to predict remaining (2018), https://doi.org/10.1088/1361-6501/aae27a 125005.
useful life of rotating components, IEEE Trans. NEURAL NETWORKS Learn. [86] Q. Tang, Y. Chai, J. Qu, H. Ren, Fisher discriminative sparse representation
Syst. 48 (1) (2018) 11–20. based on DBN for fault diagnosis of complex system, Appl. Sci. 8 (5) (2018) pp,
[58] A. Saxena, J. Celaya, B. Saha, S. Saha, K. Goebel, Metrics for offline evaluation https://doi.org/10.3390/app8050795.
of prognostic performance, Int. J. Progn. Heal. Manag. 1 (1) (2010) 4–23. [87] Z. Gao, C. Ma, D. Song, Y. Liu, Deep quantum inspired neural network with
[59] N. Chen, K.L. Tsui, Condition monitoring and remaining useful life prediction application to aircraft fuel system fault diagnosis, Neurocomputing 238
using degradation signals: Revisited, IiE Trans. 45 (9) (2013) 939–952. (2017) 13–23, https://doi.org/10.1016/j.neucom.2017.01.032.
[60] K.T.P. Nguyen, M. Fouladirad, A. Grall, New methodology for improving the [88] J. He, S. Yang, C. Gan, Unsupervised fault diagnosis of a gear transmission
inspection policies for degradation model selection according to prognostic chain using a deep belief network, Sensors 17 (7) (2017) 1–21, https://doi.
measures, IEEE Trans. Reliab. 67 (3) (2018) 1269–1280. org/10.3390/s17071564.
[61] P. Nectoux, et al., PRONOSTIA : An experimental platform for bearings [89] K. Peng, R. Jiao, J. Dong, Y. Pi, A deep belief network based health indicator
accelerated degradation tests” in: IEEE International Conference on construction and remaining useful life prediction using improved particle
Prognostics and Health Management, 2012, pp. 1–8. filter, Neurocomputing 361 (2019) 19–28.
[62] R. Zemouri, R. Gouriveau, Towards accurate and reproducible predictions for [90] Z. Liu, Z. Jia, C.M. Vong, S. Bu, J. Han, X. Tang, Capturing high-discriminative
prognostic: an approach combining a RRBF network and an autoregressive fault features for electronics-rich analog system via deep learning, IEEE Trans.
model, IFAC Proc. 43 (3) (2010) 140–145. Ind. Informatics 13 (3) (2017) 1213–1226, https://doi.org/10.1109/
[63] K. Javed, R. Gouriveau, N. Zerhouni, A new multivariate approach for TII.2017.2690940.
prognostics based on extreme learning machine and fuzzy clustering, IEEE [91] J. Xie, G. Du, C. Shen, N. Chen, L. Chen, Z. Zhu, An end-to-end model based on
Trans. Cybern. 45 (12) (2015) 2626–2639. improved adaptive deep belief network and its application to bearing fault
[64] Y. Hu, P. Baraldi, F. Di Maio, E. Zio, Online performance assessment method diagnosis, IEEE Access 6 (2018) 63584–63596, https://doi.org/10.1109/
for a model-based prognostic approach, IEEE Trans. Reliab. 65 (2) (2015) 718– ACCESS.2018.2877447.
735. [92] G. Zhao, X. Liu, B. Zhang, Y. Liu, G. Niu, C. Hu, A novel approach for analog
[65] C. Lessmeier, J.K. Kimotho, D. Zimmer, W. Sextro, Condition monitoring of circuit fault diagnosis based on Deep Belief Network, Measurement, vol. 121,
bearing damage in electromechanical drive systems by using motor current no. August 2017, 2018, pp. 170–178, doi: 10.1016/j.
signals of electric motors: a benchmark data set for data-driven measurement.2018.02.044.
classification, Third Eur. Conf. Progn. Heal. Manag. Soc. 2016, no. Cm, pp. [93] C. Li, R. Sanchez, G. Zurita, M. Cerrada, D. Cabrera, R.E. Vásquez, Multimodal
152–156, 2016. deep support vector classification with homologous features and its
[66] ‘‘Case Western Reserve University Bearing vibration Data,” 2015. application to gearbox fault diagnosis, Neurocomputing 168 (2015) 119–
[67] C. Lu, Y. Wang, M. Ragulskis, Y. Cheng, Fault diagnosis for rotating machinery: 127, https://doi.org/10.1016/j.neucom.2015.06.008.
A method based on image processing, PLoS One, vol. 11, no. 10, 2016. [94] C. Li, R.V. Sanchez, G. Zurita, M. Cerrada, D. Cabrera, R.E. Vásquez, Gearbox
[68] J. Lee, H. Qiu, G. Yu, J. Lin, Rexnord technical services: Bearing data set, fault diagnosis based on deep random forest fusion of acoustic and vibratory
Moffett Field, CA IMS, Univ. Cincinnati. NASA Ames Progn. Data Repos. NASA signals, Mech. Syst. Signal Process. 76–77 (2016) 283–293, https://doi.org/
Ames, 2007. 10.1016/j.ymssp.2016.02.007.
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 27

[95] G. Hu, H. Li, Y. Xia, L. Luo, A deep Boltzmann machine and multi-grained [120] Z. Chen, R.V. Sánchez, C. Li, Gearbox fault identification and classification
scanning forest ensemble collaborative method and its application to with convolutional neural networks, Shock Vib. 2015 (1) (2015) 1–18,
industrial fault diagnosis, Comput. Ind., vol. 100, no. May 2017, pp. 287– https://doi.org/10.3390/s17020273.
296, 2018, doi: 10.1016/j.compind.2018.04.002. [121] L. Guo, Y. Lei, N. Li, T. Yan, N. Li, Machinery health indicator construction
[96] J. Wang, K. Wang, Y. Wang, Z. Huang, R. Xue, Deep Boltzmann machine based based on convolutional neural networks considering trend burr,
condition prediction for smart manufacturing, J. Ambient Intell. Humaniz. Neurocomputing 292 (2018) 142–150, https://doi.org/10.1016/j.
Comput. 10 (3) (2019) 851–861, https://doi.org/10.1007/s12652-018-0794-3. neucom.2018.02.083.
[97] F. Zhou, Y. Gao, C. Wen, A novel multimode fault classification method based [122] D. Belmiloud, T. Benkedjouh, M. Lachi, A. Laggoun, J.P. Dron, Deep
on deep learning, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. convolutional neural networks for Bearings failure prediction and
Intell. Lect. Notes Bioinformatics), vol. 2017, 2017, pp. 442–452, doi: 10.1155/ temperature correlation, J. Vibroengineering 20 (8) (2018) 2878–2891,
2017/3583610. https://doi.org/10.21595/jve.2018.19637.
[98] H. Shao, H. Jiang, H. Zhao, F. Wang, A novel deep autoencoder feature learning [123] L. Eren, T. Ince, S. Kiranyaz, A generic intelligent bearing fault diagnosis
method for rotating machinery fault diagnosis, Mech. Syst. Signal Process. 95 system using compact adaptive 1D CNN classifier, J Signal Process. Syst
(2017) 187–204, https://doi.org/10.1016/j.ymssp.2017.03.034. (2018) 1–11, https://doi.org/10.1007/s11265-018-1378-3.
[99] Z. Meng, X. Zhan, J. Li, Z. Pan, An enhancement denoising autoencoder for [124] J. Pan, Y. Zi, J. Chen, Z. Zhou, B. Wang, LiftingNet: a novel deep learning
rolling bearing fault diagnosis, Measurement 130 (2018) 448–454, https:// network with layerwise feature learning from noisy mechanical data for fault
doi.org/10.1016/j.measurement.2018.08.010. classification, IEEE Trans. Ind. Electron. 65 (6) (2018) 4973–4982, https://doi.
[100] G. Jiang, P. Xie, H. He, J. Yan, Wind turbine fault detection using a denoising org/10.1109/TIE.2017.2767540.
autoencoder with temporal information, IEEE/ASME Trans. Mechatronics 23 [125] H. Jiang, F. Wang, H. Shao, H. Zhang, Rolling bearing fault identification using
(1) (2018) 89–100, https://doi.org/10.1109/TMECH.2017.2759301. multilayer deep learning convolutional neural network, J. Vibroengineering
[101] B. Luo, H. Wang, H. Liu, B. Li, F. Peng, Early fault detection of machine tools 19 (1) (2017) 138–149, https://doi.org/10.21595/jve.2016.16939.
based on deep learning and dynamic identification, IEEE Trans. Ind. Electron. [126] Y. Chen, G. Peng, C. Xie, W. Zhang, C. Li, S. Liu, ACDIN: Bridging the gap
66 (1) (2019) 509–518, https://doi.org/10.1109/TIE.2018.2807414. between artificial and real bearing damages for bearing fault diagnosis,
[102] S. Zhang, M. Wang, W. Li, J. Luo, Z. Lin, Deep learning with emerging new Neurocomputing 294 (2018) 61–71, https://doi.org/10.1016/j.
labels for fault diagnosis 1 1 IEEE Access 7 (2018), https://doi.org/10.1109/ neucom.2018.03.014.
ACCESS.2018.2886078. [127] L. Eren, Bearing fault detection by one-dimensional convolutional neural
[103] J. Liu, Y. Hu, Y. Wang, B. Wu, J. Fan, Z. Hu, An integrated multi-sensor fusion- networks, Math. Probl. Eng. 2017 (2017), https://doi.org/10.1155/2017/
based deep feature learning approach for rotating machinery diagnosis, 8617315.
Meas. Sci. Technol. 29 (5) (2018) pp, https://doi.org/10.1088/1361-6501/ [128] W. Zhang, G. Peng, C. Li, Y. Chen, Z. Zhang, A new deep learning model for
aaaca6. fault diagnosis with good anti-noise and domain adaptation, Sensors (2017),
[104] J. Wang, S. Li, Z. An, X. Jiang, W. Qian, S. Ji, Batch-normalized deep neural https://doi.org/10.3390/s17020425.
networks for achieving fast intelligent fault diagnosis of machines, [129] L. Jing, T. Wang, M. Zhao, P. Wang, An adaptive multi-sensor data fusion
Neurocomputing 329 (2019) 53–65, https://doi.org/10.1016/j. method based on deep convolutional neural networks for, Sensors (2017),
neucom.2018.10.049. https://doi.org/10.3390/s17020414.
[105] J. Sun, C. Yan, J. Wen, Intelligent bearing fault diagnosis method combining [130] Y. Yao et al., End-To-end convolutional neural network model for gear fault
compressed data acquisition and deep learning, IEEE Trans. Instrum. Meas. 67 diagnosis based on sound signals, Appl. Sci. 8 (9) (2018) 1584, https://doi.org/
(1) (2018) 185–195, https://doi.org/10.1109/TIM.2017.2759418. 10.3390/app8091584.
[106] J. Yu, A selective deep stacked denoising autoencoders ensemble with [131] M. Xia, T. Li, L. Xu, L. Liu, C.W. De Silva, Fault diagnosis for rotating machinery
negative correlation learning for gearbox fault diagnosis, Comput. Ind. 108 using multiple sensors and convolutional neural networks, IEEE/ASME Trans.
(2019) 62–72, https://doi.org/10.1016/j.compind.2019.02.015. Mechatronics, vol. 4435, no. c, pp. 1–9, 2017, doi: 10.1109/
[107] G. Jiang, H. He, P. Xie, Y. Tang, Stacked multilevel-denoising autoencoders : a TMECH.2017.2728371.
new representation learning approach for wind turbine gearbox fault [132] W. Zhang, X. Li, Q. Ding, Deep residual learning-based fault diagnosis method
diagnosis, Ieee Trans. Instrum. Meas. 66 (9) (2017) 2391–2402. for rotating machinery, ISA Trans. (2018).
[108] C. Lu, Z. Wang, W. Qin, J. Ma, Fault diagnosis of rotary machinery components [133] Y. Han, B. Tang, L. Deng, Multi-level wavelet packet fusion in dynamic
using a stacked denoising autoencoder-based health state identi fi cation, ensemble convolutional neural network for fault diagnosis, Measurement
Signal Process. 130 (2017) 377–388, https://doi.org/10.1016/j. 127 (February) (2018) 246–255, https://doi.org/10.1016/j.
sigpro.2016.07.028. measurement.2018.05.098.
[109] X. Guo, C. Shen, L. Chen, Deep fault recognizer : an integrated model to [134] D. Verstraete, A. Ferrada, E.L. Droguett, V. Meruane, M. Modarres, Deep
denoise and extract features for fault diagnosis in rotating machinery, Appl. learning enabled fault diagnosis using time-frequency image analysis of
Sci. (2017), https://doi.org/10.3390/app7010041. rolling element bearings, Shock Vib. 2017 (2017), https://doi.org/10.1155/
[110] C. Shi, G. Panoutsos, B. Luo, H. Liu, B. Li, X. Lin, Using multiple feature spaces- 2017/5067651.
based deep learning for tool condition monitoring in ultra-precision [135] Y. Yoo, J.-G. Baek, A novel image feature for the remaining useful lifetime
manufacturing, IEEE Trans. Ind. Electron. 66 (5) (2019) 3794–3803, https:// prediction of bearings based on continuous wavelet transform and
doi.org/10.1109/TIE.2018.2856193. convolutional neural network, Appl. Sci. 8 (7) (2018) 1102, https://doi.org/
[111] Z. Zhang, S. Li, Y. Xiao, Y. Yang, Intelligent simultaneous fault diagnosis for 10.3390/app8071102.
solid oxide fuel cell system based on deep learning, Appl. Energy 233–234 [136] Z. Zhu, G. Peng, Y. Chen, H. Gao, A convolutional neural network based on a
(October) (2019) 930–942, https://doi.org/10.1016/j.apenergy.2018.10.113. capsule network with strong generalization for bearing fault diagnosis,
[112] C. Shen, Y. Qi, J. Wang, G. Cai, Z. Zhu, An automatic and robust features Neurocomputing 323 (2019) 62–75, https://doi.org/10.1016/j.
learning method for rotating machinery fault diagnosis based on contractive neucom.2018.09.050.
autoencoder, Eng. Appl. Artif. Intell. 76 (8) (2018) 170–184, https://doi.org/ [137] P. Wang, R. Yan Ananya, R.X. Gao, Virtualization and deep recognition for
10.1016/j.engappai.2018.09.010. system fault classification, J. Manuf. Syst. 44 (2017) 310–316, https://doi.org/
[113] H. Shao, H. Jiang, F. Wang, H. Zhao, An enhancement deep feature fusion 10.1016/j.jmsy.2017.04.012.
method for rotating machinery fault diagnosis, Knowledge-Based Syst. 119 [138] X. Li, W. Zhang, Q. Ding, Deep learning-based remaining useful life estimation
(2017) 200–220, https://doi.org/10.1016/j.knosys.2016.12.012. of bearings using multi-scale feature extraction, Reliab. Eng. Syst. Saf., vol.
[114] W. Jiang, J. Zhou, H. Liu, Y. Shan, A multi-step progressive fault diagnosis 182, no. July 2018, 2019, pp. 208–218, doi:10.1016/j.ress.2018.11.011.
method for rolling element bearing based on energy entropy theory and [139] X. Ding, Q. He, Energy-fluctuated multiscale feature learning with deep
hybrid ensemble auto-encoder, ISA Trans. 87 (2018) 235–250, https://doi. ConvNet for intelligent, IEEE Trans. Instrum. Meas. 66 (8) (2017) 1926–1935.
org/10.1016/j.isatra.2018.11.044. [140] L. Ren, Y. Sun, H. Wang, L. Zhang, Prediction of bearing remaining useful life
[115] G. Ping, J. Chen, T. Pan, J. Pan, Degradation feature extraction using multi- with deep convolution neural network, IEEE Access 6 (2018) 13041–13049,
source monitoring data via logarithmic normal distribution based variational https://doi.org/10.1109/ACCESS.2018.2804930.
auto-encoder, Comput. Ind. 109 (2019) 72–82, https://doi.org/10.1016/ [141] D.T. Hoang, H.J. Kang, Rolling element bearing fault diagnosis using
j.compind.2019.04.013. convolutional neural network and vibration image, Cogn. Syst. Res. 53
[116] G. San Martin, E. López Droguett, V. Meruane, M. das Chagas Moura, Deep (2019) 42–50, https://doi.org/10.1016/j.cogsys.2018.03.002.
variational auto-encoders: A promising tool for dimensionality reduction and [142] L. Wen, X. Li, L. Gao, Y. Zhang, A new convolutional neural network-based
ball bearing elements fault diagnosis, Struct. Heal. Monit. (2018), https://doi. data-driven fault diagnosis method, IEEE Trans. Ind. Electron. 65 (7) (2018)
org/10.1177/1475921718788299. 5990–5998.
[117] K. Zhang, B. Tang, Y. Qin, L. Deng, Fault diagnosis of planetary gearbox using a [143] C. Lu, Z. Wang, B. Zhou, Intelligent fault diagnosis of rolling bearing using
novel semi-supervised method of multiple association layers networks, hierarchical convolutional network based health state classification, Adv.
Mech. Syst. Signal Process. 131 (2019) 243–260, https://doi.org/10.1016/j. Eng. Informatics 32 (2017) 139–151, https://doi.org/10.1016/j.
ymssp.2019.05.049. aei.2017.02.005.
[118] Y. ren Wang, Q. Jin, G. dong Sun, C. fei Sun, Planetary gearbox fault feature [144] Z.-X. Hu, Y. Wang, M.-F. Ge, J. Liu, Data-driven fault diagnosis method
learning using conditional variational neural networks under noise based on compressed sensing and improved multi-scale network, IEEE
environment, Knowledge-Based Syst., vol. 163, 2019, pp. 438–449, doi: Trans. Ind. Electron., vol. PP, no. 1, 2019, pp. 1–1, doi: 10.1109/
10.1016/j.knosys.2018.09.005. tie.2019.2912763.
[119] A. Nazabal, P.M. Olmos, Z. Ghahramani, I. Valera, Handling incomplete [145] C. Szegedy, et al., Going deeper with convolutions, in: Proceedings of the IEEE
heterogeneous data using VAEs, arXiv Prepr. arXiv1807.03653, 2018. Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
28 B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929

[146] S. Sabour, N. Frosst, G.E. Hinton, Dynamic routing between capsules, Adv. faults, Knowledge-Based Syst. 165 (2019) 474–487, https://doi.org/10.1016/
Neural Inf. Process. Systems (2017) 3856–3866. j.knosys.2018.12.019.
[147] Y. LeCun, LeNet-5, convolutional neural networks, URL http//yann. lecun. [174] M. He, D. He, Deep learning based approach for bearing fault diagnosis, IEEE
com/exdb/lenet, vol. 20, p. 5, 2015. Trans. Ind. Appl. 53 (3) (2017) 3057–3065.
[148] T. Han, C. Liu, L. Wu, S. Sarkar, D. Jiang, An adaptive spatiotemporal feature [175] H. Lee, R. Grosse, R. Ranganath, A.Y. Ng, Convolutional deep belief networks
learning approach for fault diagnosis in complex systems, Mech. Syst. Signal for scalable unsupervised learning of hierarchical representations, in:
Process. 117 (2019) 170–187, https://doi.org/10.1016/j.ymssp.2018.07.048. Proceedings of the 26th annual international conference on machine
[149] L. Guo, N. Li, F. Jia, Y. Lei, J. Lin, A recurrent neural network based health learning, 2009, pp. 609–616.
indicator for remaining useful life prediction of bearings, Neurocomputing [176] H. Shao, H. Jiang, H. Zhang, W. Duan, T. Liang, S. Wu, Rolling bearing fault
240 (2017) 98–109, https://doi.org/10.1016/j.neucom.2017.02.045. feature learning using improved convolutional deep belief network with
[150] W. Peng, Z.-S. Ye, N. Chen, Bayesian deep learning based health prognostics compressed sensing, Mech. Syst. Signal Process. 100 (2018) 743–765, https://
towards prognostics uncertainty, IEEE Trans. Ind. Electron., vol. PP, no. c, doi.org/10.1016/j.ymssp.2017.08.002.
2019, pp. 1–1, doi: 10.1109/TIE.2019.2907440. [177] D. Park, S. Kim, Y. An, J.Y. Jung, Lired: A light-weight real-time fault detection
[151] R. Ma, T. Yang, E. Breaz, Z. Li, P. Briois, F. Gao, Data-driven proton exchange system for edge computing using LSTM recurrent neural networks, Sensors
membrane fuel cell degradation predication through deep learning method, (Switzerland) 18 (7) (2018) pp, https://doi.org/10.3390/s18072110.
Appl. Energy 231 (March) (2018) 102–115, https://doi.org/10.1016/j. [178] R. Zhao, R. Yan, J. Wang, K. Mao, Learning to monitor machine health with
apenergy.2018.09.111. convolutional Bi-directional LSTM networks, Sensors 17 (2) (2017) 1–18,
[152] Y. Zhang, R. Xiong, H. He, M.G. Pecht, Long short-term memory recurrent https://doi.org/10.3390/s17020273.
neural network for remaining useful life prediction of lithium-ion batteries, [179] Z. Wu, Y. Guo, W. Lin, A weighted deep representation learning model for
IEEE Trans. Veh. Technol. 67 (7) (2018) 5695–5705, https://doi.org/10.1109/ imbalanced fault diagnosis in cyber-physical systems, Sensors (2018),
TVT.2018.2805189. https://doi.org/10.3390/s18041096.
[153] Y. Wu, M. Yuan, S. Dong, L. Lin, Y. Liu, Remaining useful life estimation of [180] H. Qiao, T. Wang, P. Wang, S. Qiao, L. Zhang, A time-distributed
engineered systems using vanilla LSTM neural networks, Neurocomputing spatiotemporal feature learning method for machine health monitoring
275 (2018) 167–179, https://doi.org/10.1016/j.neucom.2017.05.063. with multi-sensor time series, Sensors 18 (9) (2018) pp, https://doi.org/
[154] J. Wu, K. Hu, Y. Cheng, H. Zhu, X. Shao, Y. Wang, Data-driven remaining useful 10.3390/s18092932.
life prediction via multiple sensor signals and deep long short-term memory [181] P. Malhotra, et al., Multi-sensor prognostics using an unsupervised health
neural network, ISA Trans. (2019), https://doi.org/10.1016/j. index based on LSTM encoder-decoder, arXiv Prepr. arXiv1608.06154, 2016.
isatra.2019.07.004. [182] N. Gugulothu, V. TV, P. Malhotra, G.S. Lovekesh Vig, Puneet Agarwal,
[155] S. Zhao, Y. Zhang, S. Wang, B. Zhou, C. Cheng, A recurrent neural network Predicting remaining useful life using time series embeddings based on
approach for remaining useful life prediction utilizing a novel trend features recurrent neural networks, Work. Proc. New Secur. Paradig., 2017, doi:
construction method, Measurement 146 (2019) 279–288. 10.1145/nnnnnnn.nnnnnnn.
[156] J. Zhang, P. Wang, R. Yan, R.X. Gao, Long short-term memory for machine [183] H. Liu, J. Zhou, Y. Zheng, W. Jiang, Y. Zhang, Fault diagnosis of rolling bearings
remaining life prediction, J. Manuf. Syst. 48 (2018) 78–86, https://doi.org/ with recurrent neural network-based autoencoders, ISA Trans. 77 (2018)
10.1016/j.jmsy.2018.05.011. 167–178, https://doi.org/10.1016/j.isatra.2018.04.005.
[157] C.-G. Huang, H.-Z. Huang, Y.-F. Li, A Bi-Directional LSTM prognostics method [184] H. Shao, H. Jiang, H. Zhang, T. Liang, Electric locomotive bearing fault
under multiple operational conditions 1 1 IEEE Trans. Ind. Electron. 66 (11) diagnosis using novel convolutional deep belief network, IEEE Trans. Ind.
(2019), https://doi.org/10.1109/tie.2019.2891463. Electron. 65 (3) (2018) 2727–2736.
[158] A. Elsheikh, S. Yacout, M.S. Ouali, Bidirectional handshaking LSTM for [185] A.S. Yoon, et al., Semi-supervised learning with deep generative models for
remaining useful life prediction, Neurocomputing 323 (2019) 148–156, asset failure prediction, arXiv Prepr. arXiv1709.00845, 2017.
https://doi.org/10.1016/j.neucom.2018.09.076. [186] Z. Chen, W. Li, Multisensor feature fusion for bearing fault diagnosis using
[159] R.Y. Rui Zhao, Dongzhe Wang, Machine health monitoring using local sparse autoencoder and deep belief network, IEEE Trans. Instrum. Meas. 66
feature-based gated recurrent unit networks, IEEE Trans. Ind. Electron. 65 (7) (2017) 1693–1702, https://doi.org/10.1109/TIM.2017.2669947.
(2) (2018) 1539–1548. [187] W. Lu, Y. Li, Y. Cheng, D. Meng, B. Liang, P. Zhou, Early fault detection
[160] X. Li, H. Jiang, X. Xiong, H. Shao, Rolling bearing health prognosis using a approach with deep architectures, IEEE Trans. Instrum. Meas. 67 (7) (2018)
modified health index based hierarchical gated recurrent unit network, 1679–1689, https://doi.org/10.1109/TIM.2018.2800978.
Mech. Mach. Theory 133 (2019) 229–249, https://doi.org/10.1016/j. [188] A. Listou Ellefsen, E. Bjørlykhaug, V. Æsøy, S. Ushakov, H. Zhang, Remaining
mechmachtheory.2018.11.005. useful life predictions for turbofan engine degradation using semi-supervised
[161] Q. Li, L. Chen, S. Changqing, Y. Bingru, Enhanced generative adversarial deep architecture, Reliab. Eng. Syst. Saf., vol. 183, no. June 2018, pp. 240–251,
networks for fault diagnosis of rotating machinery with imbalanced data, 2019, doi: 10.1016/j.ress.2018.11.027.
Meas. Sci. Technol., vol. 30, no. 11, 2019. [189] X. Li, W. Zhang, Q. Ding, Cross-domain fault diagnosis of rolling element
[162] W. Jiang, C. Cheng, B. Zhou, G. Ma, Y. Yuan, A novel GAN-based fault diagnosis bearings using deep generative neural networks, IEEE Trans. Ind. Electron. 66
approach for imbalanced industrial time series, arXiv Prepr. (7) (2019) 5525–5534, https://doi.org/10.1109/TIE.2018.2868023.
arXiv1904.00575, 2019, pp. 1–6. [190] B. Zhang, W. Li, J. Hao, X.-L. Li, M. Zhang, Adversarial adaptive 1-D
[163] D. Zhao, F. Liu, H. Meng, Bearing fault diagnosis based on the switchable convolutional neural networks for bearing fault diagnosis under varying
normalization SSGAN with 1-D representation of vibration signals as input, working condition, arXiv Prepr. arXiv1805.00778, 2018, pp. 1–19.
Sensors 19 (9) (2019) pp, https://doi.org/10.3390/s19092000. [191] W. Yu, I.Y. Kim, C. Mechefske, Remaining useful life estimation using a
[164] S.A. Khan, A.E. Prosvirin, J.-M. Kim, Towards bearing health prognosis using bidirectional recurrent neural network based autoencoder scheme, Mech.
generative adversarial networks: Modeling bearing degradation, in: 2018 Syst. Signal Process. 129 (2019) 764–780, https://doi.org/10.1016/j.
International Conference on Advancements in Computational Sciences ymssp.2019.05.005.
(ICACS), 2018, pp. 1–6. [192] M. Wang, W. Deng, Deep visual domain adaptation : A survey,
[165] J. Wang, S. Li, B. Han, Z. An, H. Bao, S. Ji, Generalization of deep neural Neurocomputing 312 (2018) 135–153, https://doi.org/10.1016/j.
networks for imbalanced fault classification of machinery using generative neucom.2018.05.083.
adversarial networks 1 1 IEEE Access PP (2019), https://doi.org/10.1109/ [193] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale
access.2019.2924003. hierarchical image database, in: 2009 IEEE conference on computer vision
[166] D. Cabrera et al., Generative adversarial networks selection approach for and pattern recognition, 2009, pp. 248–255.
extremely imbalanced fault diagnosis of reciprocating machinery, IEEE [194] ‘‘Model Zoo: Pre-trained networks.” https://github.com/BVLC/caffe/wiki/
Access 7 (2019) 70643–70653, https://doi.org/10.1109/ Model-Zoo.
ACCESS.2019.2917604. [195] M.Z. Alom, et al., The history began from alexnet: A comprehensive survey on
[167] F. Zhou, S. Yang, H. Fujita, D. Chen, C. Wen, Deep learning fault diagnosis deep learning approaches, arXiv Prepr. arXiv1803.01164, 2018.
method based on global optimization GAN for unbalanced data, Knowledge- [196] L. Wen, X. Li, X. Li, L. Gao, A new transfer learning based on VGG-19
Based Syst. 187 (2020) 104837. network for fault diagnosis, in: 2019 IEEE 23rd International Conference on
[168] S. Shao, P. Wang, R. Yan, Generative adversarial networks for data Computer Supported Cooperative Work in Design (CSCWD), 2019, pp. 205–
augmentation in machine fault diagnosis, Comput. Ind. 106 (2019) 85–93, 209.
https://doi.org/10.1016/j.compind.2019.01.001. [197] S. Shao, S. McAleer, R. Yan, P. Baldi, Highly accurate machine fault diagnosis
[169] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial using deep transfer learning, IEEE Trans. Ind. Informatics 15 (4) (2018) 2446–
autoencoders, arXiv Prepr. arXiv1511.05644, 2015. 2455.
[170] I. Tolstikhin, O. Bousquet, S. Gelly, B. Schoelkopf, Wasserstein auto-encoders, [198] P. Malhotra, V. TV, L. Vig, P. Agarwal, G. Shroff, TimeNet: Pre-trained deep
arXiv Prepr. arXiv1711.01558, 2017. recurrent neural network for time series classification. arXiv 2017, arXiv
[171] H. Liu, J. Zhou, Y. Xu, Y. Zheng, X. Peng, W. Jiang, Unsupervised fault diagnosis Prepr. arXiv1706.08838.
of rolling bearings using a deep neural network based on generative [199] K. Kashiparekh, J. Narwariya, P. Malhotra, L. Vig, G. Shroff, ‘‘ConvTimeNet: A
adversarial networks, Neurocomputing 315 (2018) 412–424. pre-trained deep convolutional neural network for time series
[172] C. Cheng, B. Zhou, G. Ma, D. Wu, Y. Yuan, Wasserstein distance based deep classification”, 2019 International Joint Conference on Neural Networks
adversarial transfer learning for intelligent fault diagnosis, arXiv Prepr. (IJCNN) (2019) 1–8.
arXiv1903.06753, 2019. [200] G. Xu, M. Liu, Z. Jiang, D. Söffker, W. Shen, Bearing fault diagnosis method
[173] T. Han, C. Liu, W. Yang, D. Jiang, A novel adversarial learning framework in based on deep convolutional neural network and random forest ensemble
deep convolutional neural network for intelligent diagnosis of mechanical learning, Sensors 19 (5) (2019) 1088.
B. Rezaeianjouybari, Y. Shang / Measurement 163 (2020) 107929 29

[201] P. Ma, H. Zhang, W. Fan, C. Wang, G. Wen, X. Zhang, A novel bearing fault [222] P.R. de O. da Costa, A. Akcay, Y. Zhang, U. Kaymak, Remaining useful lifetime
diagnosis method based on 2D image representation and transfer learning- prediction via deep domain adaptation, arXiv Prepr. arXiv1907.07480, 2019,
convolutional neural network, Meas. Sci. Technol. 30 (5) (2019) 55402. pp. 1–30.
[202] L. Wen, X. Li, L. Gao, A transfer convolutional neural network for fault [223] X. Li, W. Zhang, N.-X. Xu, Q. Ding, Deep learning-based machinery fault
diagnosis based on ResNet-50, Neural Comput. Appl., 2019, pp. 1–14. diagnostics with domain adaptation across sensors at different places, IEEE
[203] J. Wang, Z. Mo, H. Zhang, Q. Miao, A deep learning method for bearing fault Trans. Ind. Electron. (2019).
diagnosis based on time-frequency image, IEEE Access 7 (2019) 42373– [224] X. Li, W. Zhang, Q. Ding, X. Li, Diagnosing rotating machines with weakly
42383. supervised data using deep transfer learning, IEEE Trans. Ind Informatics
[204] W. Mao, L. Ding, S. Tian, X. Liang, Online detection for bearing incipient fault (2019).
based on deep transfer learning, Measurement 152 (2020) 107278. [225] L. Wen, L. Gao, X. Li, A new deep transfer learning based on sparse auto-
[205] Y. Li, N. Wang, J. Shi, J. Liu, X. Hou, Revisiting batch normalization for practical encoder for fault diagnosis, IEEE Trans. Syst. Man, Cybern. Syst., vol. 49, no. 1,
domain adaptation, arXiv Prepr. arXiv1603.04779, 2016. 2017, pp. 136–144, doi: 10.1109/TSMC.2017.2754287.
[206] A. Rozantsev, M. Salzmann, P. Fua, Beyond sharing weights for deep domain [226] C. Sun, M. Ma, Z. Zhao, S. Tian, R. Yan, X. Chen, Deep transfer learning based
adaptation, IEEE Trans. Pattern Anal. Mach. Intell. 41 (4) (2018) 801–814. on sparse autoencoder for remaining useful life prediction of tool in
[207] B. Zhang, W. Li, X.L. Li, S.K. Ng, Intelligent fault diagnosis under varying manufacturing, IEEE Trans. Ind. Informatics 15 (4) (2019) 2416–2425,
working conditions based on domain adaptive convolutional neural https://doi.org/10.1109/TII.2018.2881543.
networks, IEEE Access 6 (2018) 66367–66384, https://doi.org/10.1109/ [227] C. Zhang, P. Patras, H. Haddadi, Deep learning in mobile and wireless
ACCESS.2018.2878491. networking: A survey, IEEE Commun. Surv. Tutorials 21 (3) (2019) 2224–
[208] W. Lu et al., Deep model based domain adaptation for fault diagnosis, IEEE 2287.
Trans. Ind. Electron. 64 (3) (2017) 2296–2305. [228] N.P. Jouppi et al., In-datacenter performance analysis of a tensor processing
[209] X. Li, W. Zhang, Q. Ding, A robust intelligent fault diagnosis method for rolling unit, in: 2017 ACM/IEEE 44th Annual International Symposium on Computer
element bearings base d on deep distance metric learning, Neurocomputing Architecture (ISCA), 2017, pp. 1–12.
30 (2018) 77–95, https://doi.org/10.1016/j.neucom.2018.05.021. [229] ‘‘The world’s leading software development platform , GitHub.”.
[210] W. Qian, S. Li, X. Jiang, Deep transfer network for rotating machine fault [230] Z. Wang, K. Liu, J. Li, Y. Zhu, Y. Zhang, Various frameworks and libraries of
analysis, Pattern Recognit. 96 (2019). machine learning and deep learning: a survey, Arch. Comput. Methods Eng.,
[211] B. Yang, Y. Lei, F. Jia, S. Xing, An intelligent fault diagnosis approach based on 2019, pp. 1–24.
transfer learning from laboratory bearings to locomotive bearings, Mech. [231] J. Zacharias, M. Barz, D. Sonntag, A survey on deep learning toolkits and
Syst. Signal Process. 122 (2019) 692–706, https://doi.org/10.1016/j. libraries for intelligent user interfaces, arXiv Prepr. arXiv1803.04818, 2018.
ymssp.2018.12.051. [232] S. Shi, Q. Wang, P. Xu, X. Chu, Benchmarking state-of-the-art deep learning
[212] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein gan, arXiv Prepr. software tools, in: 2016 7th International Conference on Cloud Computing
arXiv1701.07875, 2017. and Big Data (CCBD), 2016, pp. 99–104.
[213] X. Li, H. Jiang, K. Zhao, R. Wang, A deep transfer nonnegativity-constraint [233] ‘‘The Microsoft Cognitive Toolkit.” https://www.microsoft.com/en-us/
sparse autoencoder for rolling bearing fault diagnosis with few labeled data, cognitive-toolkit/.
IEEE Access 7 (2019) 91216–91224, https://doi.org/10.1109/ [234] ‘‘Deeplearning4j: Open-source distributed deep learning for the JVM.” .
access.2019.2926234. [235] S. Shi, Q. Wang, P. Xu, X. Chu, Benchmarking state-of-the-art deep learning
[214] Y. Xie, T. Zhang, A transfer learning strategy for rotation machinery fault software tools, in: Proc. - 2016 7th Int. Conf. Cloud Comput. Big Data, CCBD
diagnosis based on cycle-consistent generative adversarial networks, Proc. 2016, pp. 99–104, 2017, doi: 10.1109/CCBD.2016.029.
2018 Chinese Autom. Congr. CAC 2018, pp. 1309–1313, 2019, doi:10.1109/ [236] P. Zheng et al., Smart manufacturing systems for Industry 4.0: Conceptual
CAC.2018.8623346. framework, scenarios, and future perspectives, Front. Mech. Eng. 13 (2)
[215] M. Sun, H. Wang, P. Liu, S. Huang, P. Fan, A sparse stacked denoising (2018) 137–150.
autoencoder with optimized transfer learning applied to the fault diagnosis [237] Q. Qi, F. Tao, A Smart Manufacturing service system based on edge
of rolling bearings, Measurement 146 (2019) 305–314, https://doi.org/ computing, fog computing, and cloud computing, IEEE Access 7 (2019)
10.1016/j.measurement.2019.06.029. 86769–86777.
[216] X. Wang, H. He, L. Li, A hierarchical deep domain adaptation approach for [238] J. Watkins, C. Teubert, J. Ossenfort, Prognostics as-a-service: a scalable cloud
fault diagnosis of power plant thermal system, IEEE Trans. Ind. Informatics, architecture for prognostics, in: 11th Annual Conference Prognostics and
vol. PP, no. XX, pp. 1–1, 2019, doi: 10.1109/tii.2019.2899118. Health Management Society, 2019.
[217] X. Li, W. Zhang, Q. Ding, J.Q. Sun, Multi-Layer domain adaptation method for [239] L. Li, K. Ota, M. Dong, Deep learning for smart industry: Efficient manufacture
rolling bearing fault diagnosis, Signal Process. 157 (2019) 180–197, https:// inspection system with fog computing, IEEE Trans. Ind. Informatics 14 (10)
doi.org/10.1016/j.sigpro.2018.12.005. (2018) 4665–4673.
[218] D. Xiao, Y. Huang, L. Zhao, C. Qin, H. Shi, C. Liu, Domain adaptive motor fault [240] X. Li, W. Zhang, Q. Ding, J.-Q. Sun, Intelligent rotating machinery fault
diagnosis using deep transfer learning 1 1 IEEE Access 7 (2019), https://doi. diagnosis based on deep learning using data augmentation, J. Intell. Manuf.
org/10.1109/access.2019.2921480. 31 (2) (2020) 433–452.
[219] T. Han, C. Liu, W. Yang, D. Jiang, Deep transfer network with joint distribution [241] D. Chen, S. Yang, F. Zhou, Transfer learning based fault diagnosis with missing
adaptation: a new intelligent fault diagnosis framework for industry data due to multi-rate sampling, Sensors 19 (8) (2019) 1826.
application, ISA Trans. (2019). [242] R. Zemouri, M. Lévesque, N. Amyot, C. Hudon, O. Kokoko, A. Tahan, ‘‘Deep
[220] Z. Chen, K. Gryllias, W. Li, Intelligent fault diagnosis for rotary machinery convolutional variational autoencoder as a 2D-visualization tool for partial
using transferable convolutional neural network, IEEE Trans. Ind. Informatics, discharge source classification in hydrogenerators”, IEEE Access (2019).
vol. 3203, no. c, 2019, pp. 1–1, doi: 10.1109/tii.2019.2917233. [243] A. Brown, A. Tuor, B. Hutchinson, N. Nichols, Recurrent neural network
[221] D. Xiao, Y. Huang, C. Qin, Z. Liu, Y. Li, C. Liu, Transfer learning with attention mechanisms for interpretable system log anomaly detection, in:
convolutional neural networks for small sample size problem in machinery Proceedings of the First Workshop on Machine Learning for Computing
fault diagnosis, Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. (2019), Systems, 2018, pp. 1–8.
0954406219840381.

You might also like