A Sequence - To - Sequence Approach For Remaining Useful Lifetime Estimation Using Attention Augmented Bidirectional LSTM

Intelligent Systems with Applications 10–11 (2021) 20 0 049
Contents lists available at ScienceDirect
Intelligent Systems with Applications

journal homepage: www.elsevier.com/locate/iswa
A Sequence-to-Sequence Approach for Remaining Useful Lifetime

Estimation Using Attention-augmented Bidirectional LSTM
Sayed Rafay Bin Shah a,∗, Gavneet Singh Chadha a, Andreas Schwung a, Steven X. Ding b
a
Department of Automation Technology, South Westphalia University of Applied Sciences, Luebecker Ring 2, Soest 59494 Germany
b
Department of Automatic Control and Complex Systems, University of Duisburg-Essen, Bismarckstrasse 81, Duisburg 47057 Germany
a r t i c l e i n f o a b s t r a c t
Article history: We propose a novel sequence-to-sequence prediction approach for the estimation of the remaining use-
Received 31 May 2021 ful lifetime (RUL) of technical components. The approach is based on deep recurrent neural network
Revised 8 July 2021
structures, namely bidirectional Long Short Term Memory (LSTM) networks, which we augment with an
Accepted 20 July 2021
attention mechanism to allow for a more fine-grained information flow between the input and output
Available online 31 July 2021
sequence. Using the base architecture as a reference, we experiment with various forms of attention
Keywords: mechanisms as well as different forms of additional input embeddings. Further, we analyse the impact of
Remaining useful lifetime estimation the sequence length on the estimation quality. We apply our approach to the well known C-MAPSS data
Encoder-decoder networks set previously serving as a benchmark dataset for RUL prediction. We obtain state of the art results on
Bidirectional long-short term memory the data set and provide a thorough hyperparameter study that underlines, that more simple but well
Attention mechanism tuned architecture can achieve comparable or better performance than highly complex architectures.
© 2021 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
1. Introduction predictive maintenance can be a challenging and time-consuming

task.
The requirement for manufacturing environments and machines In parallel, the upcoming digitisation augments the modern
to run in a 24/7 setting poses high production management chal- production environment with more and more information about
lenges. Due to the required high workload, machines’ uninten- its machines’ actual status. Using various sensor readings and mea-
tional standstill should be avoided, while intentional standstill surements all over the process allows for a detailed view of the
times should be reduced as much as possible. A common approach overall plant condition, continuous monitoring of the production
to prevent unintentional standstill is the incorporation of recur- process, and assists in analysing the production’s possible weak-
ring maintenance actions in the production schedule. This peri- nesses. Consequently, the obtained data can be used for data-
odic maintenance prevents machines from severe breakdowns but based maintenance purposes. Various approaches have been pro-
results in the machine’s potential downtimes due to unnecessary posed for data-based maintenance, see Jardine et al. (2006); Tsui
maintenance actions. Predictive maintenance approaches provide a et al. (2015) for overviews. One of the critical methodologies of
solution based on a thorough condition monitoring Martin (1994); data-based maintenance is estimating the remaining useful life-
Nandi et al. (2005) which generally uses a model of the degra- time (RUL) of a machine using historical data Lei et al. (2018).
dation process in the component allowing for a more detailed RUL is the estimation of the duration of an item, product, com-
prediction of future downtimes Isermann (2005). This predictive ponent, or a system for its intended purpose before it needs to be
maintenance results in various advantages such as economic inven- replaced. RUL estimation includes modelling the degradation pat-
tory management for spare parts, prediction-based planned main- tern in a system which can arise due to either normal operation
tenance of failure-prone equipment, conditional monitoring and (no fault symptoms) or after the detection of a fault. Therefore, an
life cycle optimisation of a system. However, deriving models for efficient RUL estimation approach guarantees a safe operation of a
product till till the end of its life. Among the various approaches
for data-driven RUL estimation, neural networks appeared to be of
particular use due to recent advances in the field of deep learn-
∗ ing Goodfellow et al. (2016).
Corresponding author.
E-mail address: shah.sayedrafaybin@fh-swf.de (S.R.B. Shah).
https://doi.org/10.1016/j.iswa.2021.20 0 049
2667-3053/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
S.R.B. Shah, G.S. Chadha, A. Schwung et al. Intelligent Systems with Applications 10–11 (2021) 200049
Neural networks have been researched in different application provide results on the C-MAPPS data set. Section 8 concludes the
fields including modelling Esen et al. (2008); Hikmet Esen et al. paper.
(2009, 2017), forecasting Hikmet Esen et al. (2008a,b), machine
health monitoring Zhao et al. (2019a) and RUL estimation. Mainly, 2. Related work
convolutional neural networks (CNN) Yann LeCun et al. (1990) and
recurrent neural networks (RNN) such as long short term memo- We present an overview on remaining useful lifetime estima-
ries (LSTM) Hochreiter and Schmidhuber (1997), and gated recur- tion methods with particular emphasis on the recently prevalent
rent units (GRU) Junyoung Chung et al. (2014) have provided more deep learning approaches while we refer to Jardine et al. (2006);
dynamic and efficient ways of performing RUL estimations. Tsui et al. (2015) for overviews on classical techniques. A review
This paper introduces a novel deep learning architecture for re- on the application of deep learning approaches for prognostics and
maining useful lifetime estimation that operates directly on raw health management is presented in Khan and Yairi (2018).
data sets. Notably, we propose to apply sequence-to-sequence ar- Notably, two types of Deep Neural Networks (DNN) have been
chitectures Sutskever (Vinyals) best known from natural language used in RUL estimation, namely RNNs and CNNs. Additionally, var-
processing tasks to extract the complex degradation pattern inher- ious architectures combining both network types have been pro-
ent to RUL estimation from a machine’s sensor readings. To this posed in the literature. In Zheng et al. (2017), a stacked LSTM
end, we opt for a bidirectional LSTM as the base network, which network combined with multiple fully connected (FC) layers has
we operate in an encoder-decoder (ED) style manner. Following the been presented while in Wang et al. (2018) a stacked bidirec-
recent improvements to encoder-decoder architectures provided by tional LSTM network with additional Fully Connected (FC) lay-
attention mechanisms, we augment the architecture with different ers for RUL prediction of C-MAPSS dataset is shown. Alternatively,
attention modules differing mainly in mapping input and hidden Zhao et al. (2019b) first construct trend features that are fed to
variables to the attention keys and queries. Further, we provide a a stacked LSTM network for predicting RULs. The approach pro-
detailed discussion on the impact of different ED architecture hy- posed in Listou Ellefsen et al. (2019) introduces an unsupervised
perparameters on the RUL estimation performance and the input pre-training stage using Restricted Boltzmann Machine (RBM) to
encoding required for the sequence processing. The approach is extract complex raw input features and subsequently employ a ge-
applied to the well known C-MAPSS Saxena et al. (06/10/2008 - netic algorithm to tune the hyperparameters. Following the pre-
09/10/2008) benchmark dataset and compared with other meth- training, supervised training is performed for RUL prediction. A
ods. The results obtained with the proposed architecture perform vanilla LSTM neural network model along with dynamic differen-
superior or on par with actual state-of-the-art approaches under- tial technology to get good RUL prediction accuracy is proposed
lining the approach’s applicability. Particularly, we show that even in Wu et al. (2018). A bidirectional LSTM model for RUL estimation
simpler architectures can result in comparable or better perfor- of the C-MAPPS dataset is presented in Elsheikh et al. (2019).
mance than more complex architectures with a high number of To further strengthen RNN-based encoder-decoder layers’ learn-
parameters. ing abilities, attention mechanisms have been proposed in Dzmitry
The contributions of the paper can be summarised as follows: Bahdanau et al. (2015) and Luong et al. (2015) to allow each de-
coder state to attend to all the encoder hidden states before gen-
• We present a novel approach for remaining useful life- erating the next output. Please refer to Chaudhari et al. (2019b) for
time estimation based on attention augmented sequence- a detailed overview of the different types of attention methods.
to-sequence neural networks using bidirectional LSTMs. The While these approaches concentrate on machine translation tasks,
sequence-to-sequence framework appears highly suitable to Chen et al. (2021) use such attention mechanism for RUL predic-
cover the inherent nonlinear trend analysis problem under- tion, where handcrafted features are extracted, concatenated with
lying RUL estimations. the LSTM output and fed to a regression layer. The attention-based
• A novel Shared Kernel Convolution (SKC) Neural Network ap- LSTM encoder-decoder network proposed in Ragab et al. (2020) re-
proach as an encoder to the bidirectional LSTM has been constructs the input data and generates RUL using a parallel RUL
proposed to model the inter-dependencies among the differ- predictor layer. The RUL predictor layer receives dual latent fea-
ent input variables. ture representations, i.e., the attended encoder latent representa-
• A universally applicable sliding window approach for target tions and the latent decoder representations as inputs.
sequence generation has been presented to better represent A survey of 1D CNN models for condition monitoring in ma-
the health status of a system. chines is presented in Kiranyaz et al. (2021). The first use of
• We discuss architectural variations, including different atten- CNNs for RUL estimation dates back to Sateesh Babu et al. (2016),
tion mechanisms operating on the hidden representation of while Li et al. (2018) performs convolution operation with ker-
the bidirectional LSTM and various types of input embed- nels of unit width. These unit width convolution kernels allow for
dings. Further, we analyse the impact of the sequence length kernel weights sharing across raw sensors, thereby enhancing the
on the prediction accuracy and provide a thorough analysis network’s ability to learn abstract feature information. Recently,
of hyperparameters, resulting in guidelines on architectural a generalized dilation CNN methodology for RUL estimation has
components’ suitability. been proposed in Gavneet Singh Chadha et al. (2021) to model
• We apply the approach to the well known C-MAPSS data long term time dependencies. An attention-based CNN approach
set serving as a very challenging benchmark study in pre- is proposed in Tan and Teo (2021) where the CNN extract features
vious works. We report state-of-the-art results on the data across multiple temporal axes fed to an attention layer to predict
set with comparably lightweight and straightforward archi- the RUL. The work in Tan and Teo (2021) replaces the Softmax ac-
tectures. tivation with a Sigmoid activation in the attention mechanism to
add multivariate RUL estimation features. The Global Luong con-
The paper is organised as follows. Section 2 discusses work re- catenation method to calculate attention alignment scores is pro-
lated to our approach. Section 3 presents preliminaries concerning posed in Paulo Roberto De Oliveira Da Costa et al. (2020) for RUL
the considered neural network structures. Section 4 presents the prediction for the C-MAPSS dataset. An attention CNN-LSTM ar-
novel approach for RUL estimation using RNNs followed by the de- chitecture is introduced in Zhang et al. (2017), where a CNN is
scription of the training data set generation for the Sequence-to- used for raw input feature extraction before feeding into a stacked
Sequence NN in 5 and the dataset description 6. In Section 7 we LSTM network. The LSTM output from all time-steps is attended
2
to the last hidden state via an attention layer to generate a rota-

tory machine’s RUL finally. The study Kong et al. (2019) proposes
a hybrid CNN-LSTM RUL predictor where raw sensor data is pre-
processed to generate a 1D-health indicator matrix. The study Peng
et al. (2021) extracts spatial and temporal features from CNN and
LSTM networks that are fused and passed to another CNN layer to
predict RULs of C-MAPSS sub-datasets FD001 and FD003. Another
hybrid network of parallel CNN and LSTM paths is proposed in Li
et al. (2019b), to reduce the influence of CNN extracted features on
series-connected LSTMs. However, the model adds LSTM network
that processes the previous CNN and LSTM paths’ fused outputs
and predicts RUL values.
Encoder-decoder models for C-MAPSS RUL prediction are pro-
posed in two separate works by Liu et al. (2019) and Liu et al.
(2021). In the model presented in Liu et al. (2019), the encoder
is made up of stacked bidirectional LSTM layers, followed by mul-
tiple CNNs with an intermediate pooling, and the decoder is sim-
ply a network of three FC layers. The proposed architecture in Liu
et al. (2021) utilises a self-attention mechanism for feature extrac-
tion. The feature-attended output is fed to a bidirectional GRU-CNN
encoder, followed by a decoder with flattening and FC layers.
In contrast to existing methods, this study proposes a Fig. 1. Structure of the LSTM cell.
Sequence-to-Sequence (Seq2Seq) RUL estimation model wherein
not a point estimate of the RUL is predicted, rather a sequence
estimate is predicted. Furthermore, this work compares the differ-
ent attention methods and encoder models for the RUL estimation
ct = C˜t it + ct−1 ft , (4)
task, which is missing in the literature. It must also be noted, that
most of the good performing architectures generally require a high
amount of learnable parameters, which renders their application
difficult for small RUL data sets. Therefore, we also provide a thor- ot = σ (Woxt + Roht−1 + bo ), (5)
ough comparison of the number of parameters of our proposed
models and the models provided in the literature.
ht = ψ (ht ) ot , (6)
3. Recurrent Neural Networks Architectures
where, xt , ht−1 and ct−1 are the new input, previous hidden and
previous cell states respectively. The input gate it from Eq. (1) de-
We consider long-short term memory networks as the back-
termines how much new input information should be added or
bone network architecture for processing sequential data and CNN
left out while creating the new cell state. Contrarily, the forget
for input encoding. Particularly, we employ bidirectional LSTMs,
gate ft from Eq. (2) decides how much memory from the pre-
first as they generally provide better performance scores, second
vious state should be retained or forgotten. In Eq. (3), a can-
as they allow for a change of view on the sequence prediction task
didate for the new cell state, C˜t is generated. Results of point-
as discussed in Gavneet Singh Chadha et al. (2020). Note that other
wise multiplication between it , C˜t and, ft , ct−1 are added to
bidirectional, gated RNNs like GRUs can be used similarly.
generate a new cell state, ct in Eq. (4). The output gate, ot in
Eq. (5) determines how much of the new information should be
3.1. Long-Short Term Memory
used as instantaneous LSTM cell output. New hidden state, ht
is formed by passing the previous hidden state, ht−1 through a
Originally, LSTMs have been proposed in Hochreiter and
hyperbolic tangent activation, ψ , and then performing pointwise
Schmidhuber (1997). Recently, LSTMs have shown great success in
multiplication with the newly generated output, ot as shown in
various domains, including speech recognition and machine trans-
Eq. (6). The weight matrices Wc , Wi , W f , Wo ∈ RH×I , Rc , Ri , R f , Ro ∈
lation. The complete network is obtained similarly to the classical
RNN by stacking LSTM layers. However, the structure of the neural RH×H and vectors bc , bi , b f , bo ∈ RH are the input, recurrent and
units varies considerably. A schematic of the LSTM is illustrated in bias weights, respectively where H denotes the hidden dimension
Fig. 1. As can be seen, the LSTM includes three gates, namely the of LSTM blocks and I is the dimensionality of the input vector. The
input it , forget ft and output gate ot , a cell state ct and σ , γ and initial states are given by h0 ∈ RH and c0 ∈ RH .
ψ are pointwise activation functions. In the vanilla LSTM, logistic In contrast to the classical RNN, additional memory cells ct are
sigmoid functions σ (x ) = 1/1 + e−x are used for the gate activa- introduced in the LSTM framework. Additionally, gates are incor-
tions, while the hyperbolic tangent γ (x ) = ψ (x ) = tanh(x ) is used porated to control the information flow from input, previous time
as candidate cell state and new hidden state activations. step and state, allowing the memory cell to keep its full informa-
The vector equations for the LSTM layer in the forward pass can tion even over many time steps. Hence, the LSTM can remove or
be written as add information during long periods.
it = σ (Wi xt + Ri ht−1 + bi ), (1)

3.2. Bidirectional LSTM
Bidirectional LSTM (BiLSTM) structures use two independent

ft = σ (W f xt + R f ht−1 + b f ), (2)
LSTM networks, where the first network computes data in the
−
→
usual forward sequential order, ht and the second network in the
←
−
C˜t = γ (Wc xt + Rc ht−1 + bc ), (3) reverse order, ht Graves and Schmidhuber (2005). At any timestep,
3
resentations. We use CNN-2D architectures similar to Li et al.

(2018) for extracting temporal features across each spatial dimen-
sion shown in Fig. 3 which serves as an input encoding upstream
of the LSTM.
Feature maps Nft refer to the no. of channels present in a cer-
tain data. As we consider time-series data, the no. of input chan-
nels equals to 1. After each 2D-convolutional layer, the features
extracted from the input are mapped by the corresponding filters
into multiple channels through a non-linear activation function. In
Fig. 3, the input is a 3D data with timesteps, Ntw = t1 , t2 ,., tT , fea-
tures, Nft = f1 , f2 ,., fF and, a channel size of 1.
The information extracted from this input by a CNN-2D(1)
Fig. 2. Structure of the bidirectional LSTM. (1 )
N ×N ×N
layer is mapped into an output of shape Z1 ∈ R f m tw f t , with
Nfm (1) feature maps Li et al. (2018); Wu (2017). The no. of
the output from each of the forward and backward cells are con- timesteps, Ntw and no. of features, Nft are preserved by the use of
catenated to produce a single output, yt . A bidirectional LSTM net- padding.
work is shown in Fig. 2. Both the forward and reverse LSTM net- Filters or kernels move over the input data and extract in-
works are initiated with the same hidden and cell states. At any formation into output feature maps. In CNN-2D architectures, a
timestep t, the equations involved in a BiLSTM cell are Yu et al. kernel is 2-dimensional in shape (KL ×KW ), and they move over
(2015), both the temporal and feature dimensions, compared to the 1-
−−→ − → ←−− ← − dimensional movement of CNN-1D kernels over time Albawi et al.
yˆt = Why · ht + Why · ht + by (7)
(2017). KL and KW are the height and width of the kernel. The
−−→ ←−− number of kernels involved in a convolutional layer also depends
where, Why and Why denote the forward and reverse hidden to out-
put weights and by denote the output bias. on the no. of feature maps. For Nfm =2, there will be two ker-
By providing a reversed copy of the input data, the individual nels of the same size, extracting input information to map into
LSTM cells can learn the context from future information. Hence their respective feature maps. This is illustrated in Fig. 3. These
the network, at any time, can process both the past and future kernels contain trainable weights, and they help extract local or
information compared to the unidirectional LSTM’s ability to pro- global information based on their size compared to the input
cess only past information. Additionally, BiLSTMs are computation- sample. The shared-kernel (SK) 2D-CNN architecture used in this
ally inexpensive since they use conventional BPTT training method work requires the kernel to extract temporal information from
like LSTM networks Graves and Schmidhuber (2005); Lipton et al. each sensor data individually and finally create feature maps based
(2015). This work makes use of the bidirectional feature in all the on all the sensors’ information. This is accomplished by setting
designed LSTM networks. the kernel’s width, KW to 1, i.e., each of the kernels spans over
only one feature dimension at a time. This structure is shown in
3.3. Shared Kernel Convolutional 2D Neural Networks Fig. 3, where the kernels on the input data are marked in red
and blue.
Convolutional 2D neural networks (CNN-2D) apply a two- For a kernel of size (KL ×1), spanning from the i-th timestep up
dimensional moving filter across 3D-input data to create output to the (i+KL -1)-th timestep, of an input sequence x = x1 , x2 ,., xT
feature maps by capturing the data’s spatial and temporal rep- with features, f = 0, 1,., F, the learned feature representation from
temporal dimension along a single feature, Zi f mapped from on a
Fig. 3. Shared Kernel Convolutional 2D Neural Network.
4
Fig. 4. LSTM ED Sequence RUL Predictor.
single output feature map is given by Gu et al. (2018), ED) structure. The ability of sequence mapping of variable input-
output lengths, i.e., Ti =To is exploited in the model shown in Fig. 4.
Zi f = [W T · xi:i+KL −1 + b] f (8)
The encoder is a BiLSTM network where the initial hidden and cell
where W is the kernel weight from the input to the output fea- states are initialized with zeros. After processing the input infor-
out
×N f m in ×KL ×KW −
→ ←−
ture map and, W ∈ RN f m , b is the bias and b ∈ mation in recurrent method, the final hidden states, hT and hT
out
Nfm
R . In the case of SK 2D-CNN with one input channel, W ∈ from both directions are concatenated to form a latent representa-
out
RN f m ×1×KL ×1 . For n = 0,1,... N output feature maps, the final con- tion, s0 . At any direction, hT ∈ RL×de where L denotes the number
volved output is given by, of stacked LSTM layers (L=1 in our case), and de denotes encoder
hidden dimension. Therefore, the combined latent representation
0:N
Cout = [Zi0 , Zi1 , Zi2 , ., ZiF ]0:N (9) is s0 ∈ R1×2de . Complex degradation pattern information from the
The convolution output is flattened and passed through input sequence is encoded in this latent representation, which is to
a fully connected (FC) layer with weight matrix, Wflatten ∈ be decoded and mapped into corresponding RUL points. Hence, the
hidden state of the first LSTM cell in the decoder is initialized by
R(N f m ·Ntw ·N f t )×(Ntw ·N f t ) in order to combine all the extracted infor-
this encoded latent representation. The input to the first decoder
mation across all feature maps. The FC output is then reshaped to
LSTM cell is a tensor of zeros, which serves as a start-of-sequence
create a vector of the original input size, i.e., RNtw ×N f t .
< SOS > token. The output from every decoder LSTM cell is passed
4. Sequence-to-sequence Neural Network based RUL estimation through a FC layer with weight, W∈ Rdd ×1 , where dd is the decoder
hidden dimension and 1 is the RUL dimension. The output is the
In this section, we present the novel sequence-to-sequence neu- predicted RUL in that decoder timestep and it is passed on as the
ral network architecture for RUL estimation. To this end, we first input to the next decoder LSTM cell. This recurrent operation con-
give an overview of the proposed architecture and subsequently, tinues until outputs from all pre-defined number of decoder LSTM
we explain the individual components in detail. cells, To are generated. The FC weights and biases are shared across
all the decoder LSTM timesteps.
4.1. Overview on the sequence-to-sequence architecture
4.3. Shared CNN encoder-decoder sequence-to-sequence predictor
Fig. 4 shows the architecture of the novel proposed sequence-
to-sequence prediction based neural network. As its core, we pro- As an input of the multivariate time series input data results
pose an encoder-decoder structure where a bidirectional LSTM rep- in potentially high dimensional encoder hidden dimension, we op-
resents the encoder. Simultaneously, the decoder is composed of a tionally augment the encoder BiLSTM with an additional input em-
unidirectional LSTM and a fully connected layer to form the out- bedding provided by a CNN-2D architecture with shared kernels
put of the network predicting a sequence of RUL estimation val- (SKC) as introduced in Section 3.3. The combined model is referred
ues. Additionally, we propose using the attention mechanism that to in this work as the Shared-Kernel Convolutional LSTM Encoder-
augments the network with relevance weights between the infor- Decoder (SKC LSTM ED) sequence RUL predictor and is shown in
mation on each decoder time-step and the encoder’s entire hidden Fig. 5.
state representation. We experiment with two types of encoder- The CNN layer potentially extracts complex features from the
decoder attention mechanisms, namely Dzmitry Bahdanau et al. raw input before feeding them to the BiLSTM encoder. The SKC
(2015) and Luong et al. (2015). Optionally, we further add in- layer kernel is of length, KL and unit width, KW = 1. The no. of
put encoding provided by a shared kernel CNN as described in feature maps are Nfm . Non-linear Leaky ReLU activation is applied
Section 3.3. In the following, we discuss the components of the after the CNN layer. Zero padding of appropriate lengths is applied
architecture in detail. along each feature map’s temporal edges to prevent information
loss in the sliding kernels. The CNN layer’s output is flattened to
4.2. Encoder-Decoder sequence-to-sequence predictor rearrange the information from the extracted feature maps in a
single column. The flattened tensor is then passed through a FC
As the core architecture of our proposed Sequence-to-Sequence layer with weight W ∈ R(N f m ·Ntw ·N f t )×(Ntw ·N f t ) , where, Ntw and Nft
RUL Predictor, we employ an LSTM-based encoder-decoder (LSTM are the no. of timesteps and no. of features respectively in the in-
5
Fig. 5. SKC LSTM ED Sequence RUL Predictor.
put feature map. The output is then reshaped to a form that re- where
sembles the shape of the input feature map. This is done to enable exi
the bidirectional LSTM encoder to perform the recurrent operation So f tmax(xi ) = (12)
on an equivalent number of input time-steps and features as in the
K
xj
e
raw input tensor.
j=1
4.4. Attention Mechanism in Neural Networks 3. Context Vector, Cij
Attention mechanism in deep neural networks refers to “At- Ci j = WiTj · h̄i (13)
tending to” certain parts of the input for generating output. As dis-
4. Attention Combined, Idecoder, j-1 This step performs a
cussed in the previous sections, vanilla RNN or LSTM structures use
weighted concatenation with weight matrix, Wapplied ∈
the input data’s temporal dynamics and map it to sequential out-
R(dd +dRUL )×dd , between the context vector, Cij and the
put data. However, there is still the question of relevance between
decoder output from previous timestep, yˆ j−1 .
the output generated at a particular time-step and the input se-
quence used to generate that output. Even though LSTM reduces Idecoder, j−1 = Wapplied · [Ci j ; yˆ j−1 ] (14)
the effects of vanishing and exploding gradients for very long se-
quences, it does not eliminate them entirely. Moreover, NN archi- And finally,
tectures such as RNN, LSTM or CNN can fall short of processing
yˆ j = LSTM((Idecoder, j−1 ), s j−1 ) (15)
highly complex feature representations to generate accurate out-
puts. The attention mechanism counters this problem by finding where LSTM is the decoder in the model.
the relevance between the output data it needs to generate and
the encoder’s hidden state representations. The attention structure It can be deduced from Eq. (10)–(15), that the Bahdanau atten-
creates a context vector from encoder hidden states. It then allows tion model computes the relevance of all encoder hidden states,
the decoder to use the context vector to generate a more precise h̄i w.r.t the previous decoder hidden state, sj-1 to decide the in-
and relevant output Chaudhari et al. (2019a); Dzmitry Bahdanau put, Idecoder, j-1 for the current decoder timestep. The input is passed
et al. (2015); Luong et al. (2015); Swati Meshram (2019). Hence, through the decoder LSTM cell at that timestep to generate an out-
attention structures are simply an addition to the existing encoder- put, yˆ j . Moreover, this method deploys a weighted additive align-
decoder networks. The different types of attention methodologies ment technique Dzmitry Bahdanau et al. (2015). These are the key
and their use in this work are detailed in the following sections. differences between this attention technique and the Luong atten-
tion model which is detailed in the next section.
4.4.1. Bahdanau Attention
The attention model proposed by Dzmitry Bahdanau et al. 4.4.2. Luong attention
(2015) calculates the decoder output, yˆ j at any time j based on the The Luong attention model deploys an approach similar to
i=1,2,...,T encoder timesteps as follows: Bahdanau attention, which considers all encoder hidden states to
1. Alignment Score, Sij calculate the context vector and attention weights Luong et al.
(2015) also called the global attention method. However, this
Si j = Walign · tanh · [(Wh · h̄i ); (Ws · s j−1 )] (10) model also suggests using a more localised approach, where the
context and attention weights are computed by focusing on a
where Wh ∈ Rde ×dd
and Ws ∈ Rdd ×dd
are the weight matri-
windowed segment of the encoder hidden states. Such attention
ces for all encoder hidden states, h̄i and the corresponding
is known as local attention or window-based attention since the
previous decoder hidden state, sj-1 respectively and tanh is
ex −e−x
mechanism deploys a window of a certain length upon the encoder
the hyperbolic tangent function with tanh(x ) = ex +e−x states, depending upon a position token, pi received from the cur-
2. Attention Weights, Wij
rent decoder time-step. The selection of the window length and its
Wi j = So f tmax(Si j ) (11) centre can be made in two ways Luong et al. (2015):
6
Fig. 6. LSTM ED Sequence RUL Predictor with Bahdanau Attention.
• Monotonic Alignment: A fixed window length is selected It is evident from Eq. (17)–(21), that the Luong attention model
where the center point, pi depends upon the current de- computes the relevance of all encoder hidden states, h̄i w.r.t the
coder position. The window moves along with the progress current decoder hidden state, sj , and candidate output, ȳ j to decide
of each decoder time-step to the next. For an encoder with i the output, yˆ j for the current decoder timestep. This method also
timesteps and decoder with j timesteps, the window length, provides the flexibility to deploy multiplicative, weighted multi-
Lwindow in the local attention models in this work is calcu- plicative and weighted additive alignment calculations Luong et al.
lated as, (2015). The possibility of aligning the decoder representations with
a windowed segment of the encoder hidden states reduces com-
Lwindow = i − j + 1 (16) putation cost and focuses more on the relevant input informa-
• Predictive Alignment: This method predicts the alignment tion. Global and local attention techniques, along with the differ-
window by applying a Gaussian distribution centred around ent alignment formulas, have been used in developing encoder-
pt . Predictive alignment is not applied in this work. decoder architectures in this work, and their performances will be
compared in the experimental section.
Regardless of global or local attention, the Luong attention
model also follows the four previously mentioned steps but in a
4.5. Attention enhanced encoder-decoder predictor
different manner. For simplicity, the steps are shown as follows
concerning global attention Luong et al. (2015).
As previously discussed, the attention mechanism generates rel-
1. Alignment Score, Sij The alignment score, Sij in Luong atten- evance weights between the information on each decoder time-
tion model can be calculated by any of the following three step and the encoder’s entire hidden state representation. Hence,
methods: it allows stronger emphasis on the beneficial input information,
⎧ i.e. degradation information, for the corresponding RUL prediction
⎨h̄i · s j , Dot and improves the prediction performance. We test the proposed
Si j = Wh (h̄i ) · s j , General (17) models with the two types of encoder-decoder attention mecha-
⎩
tanh(Walign .[h̄i ; s j ] ), Concat enat e nisms proposed in Dzmitry Bahdanau et al. (2015) and Luong et al.
(2015). In both architectures, the encoder remains the stated BiL-
Where, Wh ∈ Rde ×dd , h̄i is the complete encoder hidden STM network with zero initialisation of the hidden and cell states.
states and sj is the current decoder hidden state. The final hidden states are concatenated and then used to initialise
2. Attention Weights, Wij the decoder hidden state. The LSTM ED architectures with atten-
tion are explained in the following.
Wi j = So f tmax(Si j ) (18)
3. Context Vector, Cij 4.5.1. RUL-Prediction using Bahdanau Attention

The proposed LSTM ED RUL predictor with Bahdanau attention
Ci j = WiTj · h̄i (19) is shown in Fig. 6. The mechanism of the Bahdanau Attention block
4. Attention Combined, yˆ j First, the current decoder LSTM is implemented according to Section 4.4.1. At any decoder timestep
timestep, Dj generates a candidate output, ȳ j based on the sj-1 , an alignment score is created between the entire encoder hid-
previous decoder output, yˆ j−1 and hidden state, sj-1 . den representation, h̄i ∈ RTi ×2de and the instantaneous decoder hid-
den state sj-1 ∈ R1×dd . This score is softmaxed and element-wise
ȳ j = LSTM(yˆ j−1 , s j−1 ) (20) multiplication is performed between the softmaxed score and h̄i to
Then, a weighted concatenation with weight matrix, Wapplied generate a context vector. A weighted addition between the con-
∈ R(de +dd )×dRUL is performed between the context vector, Cij text vector and the instantaneous decoder output, yˆ j−1 creates the
and the candidate decoder output from the current timestep, input for the next decoder cell, sj . Bahdanau attention is performed
ȳ j to generate the final attended output, yˆ j . to create attended inputs for LSTM operation. Hence, every decoder
timestep output is its respective predicted RUL, as well as the can-
yˆ j = Wapplied · [Ci j ; ȳ j ] (21) didate to be passed to the attention block for generating the next
7
Fig. 7. LSTM ED Sequence RUL Predictor with Luong Attention.
timestep input. Thus, for every recurring decoder timestep, the at- decoded outputs can often suffer from slow convergence due to
tention mechanism generates attended inputs for that decoder cell processing faulty predictions at the initial training stages. The ac-
using the predicted RUL and hidden state from the previous step, cumulation of error for more extended periods can be avoided by
until the end of the sequence is reached. The collective outputs feeding the true label instead of the predicted outcome. Apart from
from all decoder timesteps represent the predicted sequence RUL. recurrent models, TF can also be used in auto-regressive models,
e.g, transformer since, it also operates on the principle of encoder-
4.5.2. RUL-Prediction using Luong Attention decoder structure. However, studies in Bengio et al. (2015); Goyal
The proposed LSTM ED RUL predictor with Luong attention is et al. (2016) have shown models to be suffering from ”Exposure
shown in Fig. 7. Luong Attention mechanism is implemented ac- Bias” during inference where the model fails to provide accurate
cording to Section 4.4.2. At any decoder timestep sj-1 , an alignment and stable predictions due to its high dependency on exposed
score is created between the entire encoder hidden representa- ground truth labels. As the model moves farther away from its own
tion, h̄i ∈ RTi ×2de and the instantaneous decoder hidden state, sj-1 ∈ predicted outputs, the discrepancy between training and inference
increases. This can be avoided by setting a TF-free running ratio,
R1×dd . This score is softmaxed and element-wise multiplication is
where the model will be trained for a certain period with right-
performed between the softmaxed score and h̄i to generate a con-
shifted ground truth labels and for the rest of the time with its
text vector. The decoder cell sj-1 creates a candidate output and a
predictions from the previous time step.
weighted addition between the candidate output and the context
vector generates the final predicted RUL, yˆ j−1 . This predicted RUL,
yˆ j−1 also serves as the input to the next decoder timestep sj . Thus
for every candidate output generated by a decoder timestep, Lu- 5. Sliding Window approach and sequence target RUL
ong attention block creates a final predicted output and the next generation
timestep input by attending to the encoder hidden representation
with respect to the candidate output. Training the sequence-to-sequence model in this work requires
Luong attention mechanism can also attend to a localised seg- feeding the input data in batches of sub-sequences. These sub-
ment of the encoder hidden states instead of the global hidden sequences optimise the model’s learning process since it learns to
states by performing local attention. In local attention mode, a po- map the corresponding health status from a short sequence of lo-
sitional token, Pj is sent to the attention block from the decoder calised data Ye and Keogh (2009). This sub-sequences creation is
hidden state based on which a local window is generated using achieved by using sliding windows, where a window of a specific
monotonic alignment (Section 4.4.2). The window ”rolls over” the length moves across the input signal over time, captures the infor-
encoder hidden states with the progression of each decoder time- mation at a certain instance and feeds it to the model to predict
step until the end of the sequence is reached. The collected RULs the corresponding RUL. The step size of the window is termed in
from the decoder’s recurrent operation is the sequence RUL pre- this work as stride. Additionally, the novelty of this work is the
diction of the Luong attention-based LSTM ED network. generation of RUL outputs in a sequence instead of a single data
point. This is proposed based on the hypothesis that a model can
4.6. Teacher Forcing better represent an engine’s health status in a sequence of RUL
labels. Sudden degradation or fluctuations occurring at a specific
In encoder-decoder models dealing with sequence-to-sequence input time window can be more easily mapped to a sequence of
predictions, teacher forcing (TF) is a technique used in training that RULs instead of averaging the learned information into a unit. The
feeds the ground truth label from the previous time-step instead RUL labels created by using the piece-wise linear function are un-
of the model predicted value as the input for processing the sub- folded to a specific size to achieve this. This size depends on the
sequent time-step output. TF is a commonly used feature in NLP length of the target sliding window, v. The space between two tar-
applications and is used in recurrent learning in RNNs Williams get RUL labels is filled with v-1 equidistant interval points. Hence,
and Zipser (1989). NN models that heavily depend on previously an output RUL sequence from the initial to the final cycle of length,
8
Table 1
C-MAPSS Dataset Sensor Parameters Saxena et al. (06/10/2008
- 09/10/2008).
Sensor Parameter Unit

◦
Fig. 8. Unfolded Target Labels for Sequence RUL generation with v = 5. S1 Total temperature at fan inlet R
◦
S2 Total temperature at LPC outlet R
◦
S3 Total temperature at HPC outlet R
◦
S4 Total temperature at LPT outlet R
S5 Pressure at fan inlet psia
S6 Total pressure in bypass-duct psia
S7 Total pressure at HPC outlet psia
S8 Physical fan speed rpm
S9 Physical core speed rpm
S10 Engine pressure ratio –
S11 Static pressure at HPC outlet psia
S12 Ratio of Fuel Flow to S12 parameter pps/psi
S13 Corrected fan speed rpm
S14 Corrected core speed rpm
S15 Bypass Ratio –
S16 Burner fuel-air ratio –
S17 Bleed Enthalpy –
S18 Demanded fan speed rpm
S19 Demanded corrected fan speed rpm
S20 HPT coolant blend lbm/s
S21 LPT coolant blend lbm/s
dataset Saxena et al. (06/10/2008 - 09/10/2008). The Turbofan En-

gine Degradation dataset (referred to as C-MAPSS dataset hence-
forth) is used to evaluate the model performances in this work.
The C-MAPSS program was simulated in four different work-
Fig. 9. Sliding Window operation and corresponding Sequence target RULs. ing environments to generate the C-MAPSS dataset with four sub-
datasets, namely FD001, FD002, FD003 and FD004. Each sub-
dataset consists of a train and a test dataset, with corresponding
to is unfolded to a sequence of length, To = to × v. An example is
RUL data containing the last run-time for each test engine data.
illustrated in Fig. 8 for v = 5.
In each of the four sub-datasets, the engines start operating in a
An illustration of sliding window operation and corresponding
healthy state and run until the system fails due to gradual degra-
sequence target RULs is shown in Fig. 9. Consider a sensor input
dation. In the test samples, the sensor measurements are trimmed
data, I ∈ RTi , and corresponding unfolded RUL labels O ∈ RTo , of
from a certain period (as provided in the RUL database) before sys-
lengths Ti and To respectively. u and v are the sliding window
tem failure. That corresponding period is taken for evaluation and
lengths of the input and target outputs. If Ti =To , then, u=v. Let,
scoring. Initial wear is considered in all the train, and test engines
the maximum number of times the respective windows slide over
Chatterjee and Litt (2003); Saxena et al. (06/10/2008 - 09/10/2008).
the input and output data is N. The input sliding window always
Each of the train and test C-MAPSS sub-datasets contains out-
follows a constant stride of 1 whereas, the output window stride is
puts from 26 sensors (Table 1). These measurements are sensor
equal to the length of the output window i.e., v. This is performed
responses to 14 different health parameter inputs that simulate
to ensure that the input sequences fed to the model contain a con-
degradation scenarios in any engine’s rotating components Saxena
tinuous and non-interrupted flow of temporal information. Hence,
et al. (06/10/2008 - 09/10/2008). For simplicity, the sensor data are
the edge of the input window slides over one cycle at a time. To
referred to as the way they appear in the sensor columns in the
map this accurately to the output RUL sequence, the final point in
raw dataset, i.e., the column ”Sensor” in Table 1. The engines in
an output window must contain the integer RUL value correspond-
the C-MAPSS dataset are operated under different operational set-
ing to the windowed input data, and its preceding values in that
tings and fault modes. Sub-datasets FD001 and FD003 work with
RUL window are the equidistant interval points.
one operational setting, resulting in one unique operating condi-
tion whereas, FD002 and FD004 operate with three different oper-
6. Dataset and Evaluation Metrics ational settings. The three operational settings result in six unique
operating conditions. This variation in operating conditions creates
In the following, we discuss the results obtained for the RUL es- fluctuations in sensor readings since data points in consecutive cy-
timation framework. We first present the considered data set, fol- cles may not belong to the same operating condition. The opera-
lowed by showing the results and comparing the existing state of tional settings, no. of resulting sub-conditions and fault modes for
the art. each sub-datasets, are shown in Table 2.
It is evident from the raw sensor data that not all signals
6.1. C-MAPSS Dataset show a clear degradation trend that may optimise the performance
of an NN model. Sensors S1, S5, S6, S10, S16, S18, S19 show ei-
Commercial Modular Aero-Propulsion System Simulation or C- ther a non-continuous or monotonous pattern in their readings.
MAPSS is a model-based simulation program prepared by NASA. These monotonous readings are excluded from all train and test
This program provides the simulation of a large commercial tur- sub-datasets. Additionally, the inclusion of the history of operat-
bofan engine used in aircraft propulsion Frederick et al. (2007). ing regimes as features along with the selected sensor readings to
The C-MAPSS program was used to generate the Turbofan En- optimise model performance is suggested in Sateesh Babu et al.
gine Degradation Simulation Data Set and the PHM 2008 Challenge (2016). The approach is implemented for the sub-dataset FD002.
9
Table 2
Operational Information of C-MAPSS Dataset .
Sub-Dataset Total Cycles Operational Settings Operating Conditions Fault Modes
Train Test
FD001 100 100 Sea Level 1 HPC Degradation

FD002 260 259 Altitude, Throttle Resolver Angle, Mach 6 HPC Degradation
FD003 100 100 Sea Level 1 HPC Degradation Fan Degradation
FD004 249 248 Altitude, Throttle Resolver Angle, Mach 6 HPC Degradation Fan Degradation
each engine under a sub-dataset is provided in a separate database.

The no. of healthy cycles, cycleshealthy of an engine, e of sub-dataset,
d is determined as
(e,d ) (e,d )
cycleshealthy = cyclestotal + RUL(fe,d ) − RULi (22)
cyclesd(e,d )
egrading
(e,d )
= cyclestotal (e,d )
− cycleshealthy . (23)
6.3. Root mean-Squared error (RMSE)
Loss functions represent how a predicted output from a NN

model compares against the actual desired output. These functions
are differentiable and are used to calculate the gradients in order
to update weights while training. The most common loss functions
used in regression are the Mean-Squared Error (MSE), Root Mean-
Squared Error (RMSE), and Mean Absolute Error (MAE) Botchkarev
(2019).
The RMSE loss function is used for sequence RUL prediction
Fig. 10. Piece-wise Linear Function for RUL Labels. task with C-MAPSS dataset in this work and is defined as

1
N
Further, we use the Min-Max Scaler approach to scale the data RMSE = (yî − yi )2 (24)
N
within a range of [-1, 1]. However, due to the six different oper- i=1
ating conditions in FD002 and FD004, the magnitude and distri- where N is the number of predicted samples, yî and yi are the pre-
bution of the data points within each condition vary and hence, dicted and target outputs respectively.
they require to be scaled separately Bektas et al. (2019). The in- The predicted sequence RUL for a specific engine consists of val-
dividual sensor data from each of the six conditions are segre- ues in the decimal range, beginning after the preceding RUL se-
gated, scaled and then combined according to their original pre- quence and ending with the final RUL for that cycle e.g. [194.8,
scaling positions. The same scaling technique is separately applied 194.6, 194.4, 194.2, 194]. This is explained in detail in the Section 5.
to FD002 and FD004. Since FD001 and FD003 operate under one Additionally, RMSE is an important metric for comparison against
operating condition, the multi-regime scaling is not required in other model performances proposed in the literature for the C-
these datasets. MAPSS dataset.
6.2. Piece-wise RUL target function

6.4. RUL Evaluation Metrics
It is not desirable to hypothesize a linearly degrading life cy-
The scoring functions Saxena et al. (06/10/2008 - 09/10/2008)
cle since the beginning of the operation, as it implies that a ma-
for the regression task of the C-MAPSS dataset is given by
chine deteriorates linearly with usage. A suitable alternative to lin- di
early degrading RUL is suggested in Heimes (2008), where the life-
e a1 − 1, di < 0
cycle is divided into two distinct phases- the healthy and constant si = di (25)
phase and, the linearly degrading phase. This approach is shown e a2 − 1, di ≥ 0
in Fig. 10, where it sets a limit to the maximum possible RUL. The
machine operates with a healthy status for a certain number of
N
S= si (26)
cycles and then, starts linearly degrading to the end RUL point.
i=1
This approach is called the piece-wise linear RUL function estima-
tor. Upon experimentation in Heimes (2008), the suitable maxi- where, a1 = 13, a2 = 10, di = Predicted RUL - Target RUL for the
mum RUL limit was found to be between 120 and 130 cycles. In i-th testing engine, si = score for the i-th engine, N = total no.
this work, this is termed as the start-of-life or initial RUL, RULi of engines in a test sub-dataset and S is the score for that sub-
and is taken as 120 cycles. The point at which the machine tran- dataset Li et al. (2018). The scoring function for C-MAPSS dataset
sitions from the healthy phase to the degrading phase is termed can be termed as an inverse scoring function i.e. a lower cumula-
as the knee-point, Pointknee and it varies for different engines since tive score signifies better model performance and a higher score
the number of cycles, cyclestotal for each engine is different. In the signifies the opposite. Hence, late predictions are penalized more
training dataset, the end-of-life or final RUL, RULf is at 0 for all than early predictions for an Error of the same absolute magni-
engines. In the test dataset, since the sensor data feed for the en- tude. Additionally, the scoring function depends solely on the RUL
gines are stopped at certain points before functional failure, RULf generated for the final input window, and not on the RULs for the
and Pointknee are not the same throughout all the engines. RULf for earlier stages of operation.
10
Table 3
Hyperparameters for ED RUL Predictors.
Hyperparameters Range
Attention Models Non-Attention Models
Predicted RUL Length (TRUL ) {10, 15, 20} {10, 15, 20}
Batch Size (BS) {1024} {256, 512}
SKC Kernel Length (KL ) {3, 5} {3, 5}
Encoder Hidden Size (de ) {32, 48, 64} {40, 64}
Decoder Hidden Size (dd ) {64, 96, 128} {80, 128}
Learning Rate (LR) {1e-1, 5e-2, 1e-2} {5e-2, 1e-2}
Teacher Forcing (TF%) {0, 30, 40} {0, 30, 40}
Table 4
Best Hyperparameters for ED Models.
Hyperparameters Model Architectures
LSTM Bahd. AT ED SKC LSTM Bahd. AT ED LSTM ED
TRUL 10 10 20
BS 1024 1024 512
KL - 3 -
de 32 64 40
dd 64 128 80
LR 1e-2 1e-2 1e-2
TF% 30 40 0
7. Results and Comparison
In this section, we present the results of the proposed architec-

tures on the C-MAPPS dataset. We first report results for the differ-
ent attention and encoder variations. Second, we present ablation
studies on the impact of different hyperparameter settings. Third,
we compare the results with existing approaches in the literature. Fig. 11. LSTM Encoder Decoder network with different types of Attention.
Unless stated otherwise, the Adam Kingma and Ba (2015) opti-
mizer is used for the gradient descent algorithm and for any initial
learning rate (LR), a milestone of 125 epochs is set after which, mance can be observed from Luong Dot product Global attention
the initial LR is reduced to its 10% of value. The attention aug- and Luong Concat Global attention. The model performance deteri-
mented models are trained for 200 epochs while the non-attention orates when local attention is used. Therefore, it can be concluded,
for 150 since they started to overfit the training set. All the weights that for this RUL estimation task, the LSTM Bahadanau AT ED per-
and biases are initialized using Xavier Uniform Glorot and Ben- forms the best. Therefore, this model along with its variants will
gio (2010) distribution. The hyperparameters and their respective be used for further analysis in this work.
ranges for training the different ED Seq2Seq RUL predictor mod-
els are shown in Table 3. The table includes the details for at- 7.2. Effect of Attention and No Attention with encoder decoder
tention as well as non-attention models. These ranges for hyper- architectures
parameters were based on a random search Bergstra and Bengio
(2012) methodology. The input sequence length, Ti is chosen to be The LSTM ED model without attention is evaluated in this sec-
30, 20, 38 and 18 for the sub-datasets FD0 01, FD0 02, FD0 03 and tion. The different hyperparameters for this model are shown in
FD004 respectively. Table 3. As illustrated in Fig. 12, the model performance with at-
All the models are tested with the different hyperparameters tention improves significantly on three of the four sub-datasets
and the best performing hyperparameters for each of the models when compared to the LSTM ED model without attention. There-
are shown in Table 4. In the next subsections, we will underline fore, the attention augmented models improve the predictive per-
the impact of some of these hyperparameters. formance of the Seq2Seq models.
7.1. Effect of type of attention with encoder-decoder architectures 7.3. Comparison among different encoder structures in the
encoder-decoder architectures
The performance of the LSTM encoder-decoder model with dif-
ferent attention types (LSTM AT ED) is evaluated in this section. The SKC LSTM ED model adds a feature extracting SKC layer on
The LSTM AT ED model is separately evaluated with Bahdanau at- top of the LSTM Bahadanau AT ED layer as discussed in Section 3.3.
tention and the three variants of Luong attention (Dot product, The best performing hyperparameters for achieving the optimum
General and Concatenation) for both Global(Gl.) and Local (L) at- performance from this model are shown in Table 4.
tention operations. The hyperparameters for each of the models The inclusion of a CNN layer requires new additional hyperpa-
are shown in Table 4. As illustrated in Fig. 11, the model with rameters to be optimized. A fixed output feature map, N f m = 8 is
Bahdanau attention performs excellently and much better than selected while kernel lengths of 3 and 5 are inspected. All other
the other attention types. A very significant improvement is ob- training parameters remain the same as the LSTM ED model. The
served on sub-datasets FD0 01, FD0 03 and FD0 04 compared to the scores in the SKC LSTM Bahdanau AT model deteriorate compared
scores from previous models while maintaining a very good score to the LSTM Bahdanau AT ED model since it fails to strike a balance
in FD002. Among the Luong attention variants, the best perfor- in any case throughout the four sub-datasets. Therefore, it can be
11
Fig. 12. Analysis of Bahdanau Attention in LSTM Encoder Decoder Network.
Fig. 14. Effect of Hidden Sizes (de , dd ) in LSTM Bahdanau Attention Encoder De-
coder Network.
Fig. 13. Analysis of SKC Layer in LSTM Bahdanau Attention Encoder Decoder Net-
work.
concluded, that the added parameters from the convolutional layer

harm the performance in this case.
7.4. Effect of changing hidden sizes in the encoder-decoder

architectures
Fig. 15. Effect of predicted sequence RUL lengths (TRUL ) in LSTM Bahdanau Atten-
The hidden dimension sizes, de and dd , are of significant impor- tion Encoder Decoder Network.
tance in attention models. This is seen in Fig. 14 where the LSTM
Bahdanau AT ED model displays a higher sensitivity to de and dd .
Keeping all other hyperparameters constant, increasing de and dd operating conditions. A comparison among the different lengths is
deteriorates the model performance across all four sub-datasets. illustrated in Fig. 15 where increasing the sequence length reduces
Again, the added overload of the number of parameters in the the performance across all the operating conditions.
LSTM layer leads to a degradation in the generalization capability
of the model.
7.6. Effect of Teacher Forcing on encoder-decoder architectures
7.5. Effect of output sequence length on encoder-decoder
architectures The importance of teacher forcing (TF) while training the LSTM
Bahdanau AT ED model is shown in Fig. 16. There is no clear trend
The length of the output sequence length, TRUL is a hyperpa- to be detected in this case as increasing the TF% sometimes de-
rameter of significant importance in attention models as it repre- grades the performance but also can lead to an improvement in
sents the output sampling frequency. In the case of the C-MAPSS performance on some operating conditions. Therefore, no conclu-
dataset, the output sequence length of 10 works the best across all sive evidence on the effect of TF can be provided here.
12
Table 5
Comparison of Model Performance (Scores) with Related Literature.
Model Description Score
FD001 FD002 FD003 FD004
Deep LSTM Zheng et al. (2017) 338 4450 852 5550

Stacked BiLSTM Wang et al. (2018) 295 4130 317 5430
Handcrafted Features + LSTM Chen et al. (2021) 322 - - 5649
CEEMD + DLSTM Zhao et al. (2019b) 262 6953 452 15069
RBM + LSTM Listou Ellefsen et al. (2019) 231 3366 251 2840
CNN Sateesh Babu et al. (2016) 1287 13,570 1596 7886
CNN + Attention Tan and Teo (2021) 198 1144 251 2072
HICNN-LSTM-NN Kong et al. (2019) 303 3440 1420 4630
FCLCNN Peng et al. (2021) 204 - 234 -
BLCNN Liu et al. (2019) 302 1558 381 3859
DAG Li et al. (2019a) 229 2730 535 3370
DCNN Li et al. (2018) 274 10,412 284 12466
AGCNN (Self-Att.) Liu et al. (2021) 225 1492 227 3392
LSTM + Luong Att. Paulo Roberto De Oliveira Da Costa et al. (2020) 320 2102 223 3100
BiLSTM 473 1223 676 2684
LSTM ED 254 898 243 2161
SKC LSTM ED 296 1182 185 2096
LSTM Bahd. AT ED 215 999 187 2081
LSTM Luong Concat (Gl.) AT ED 228 1063 214 2234
SKC LSTM Bahd. AT ED 318 1682 216 2111
SKC LSTM Luong(Gl.) AT ED 280 1224 200 2228
SKC LSTM Luong(Loc.) AT ED 294 1751 244 2462
encoder, which is decoded to predict a sequence of RULs. The in-

clusion of a Bahdanau attention layer in the LSTM ED helps the
model to find the most relevant input for the next decoder step.
Conversely, the Luong attention layer focuses its operation on the
decoder output to generate a final output using attention opera-
tion. According to Table 5, the best set of balanced scores between
the proposed models can be found in the LSTM Bahdanau AT ED
model whereas, the LSTM Luong Concat (Gl.) AT ED model dete-
riorates in performance. The decoder LSTM cells perform a much
better task at generating final RULs when fed with relevant infor-
mation from Bahdanau attention, instead of generating intermedi-
ate output from raw encoder latent representations and allowing
the Luong attention layer to produce final RULs.
The inclusion of a feature extracting SKC layer to the LSTM
ED model at the cost of increased number of trainable parame-
ters, fails to improve the model performance any further. As ob-
served in Table 5, this persists even when SKC LSTM ED models
are enhanced with attention layers. In every case, the LSTM en-
coder tends to perform much better when fed with raw sequential
input data rather than a feature extracted input from a preceding
SKC layer. An example of RUL predictions by the proposed models
compared to the actual RUL of engine ID 100 of sub-dataset FD001
are plotted in Fig. 17. The BiLSTM model follows the overall degra-
dation pattern of the engine, however, the predicted RULs deviate
significantly from the true RULs throughout the cycles (Fig. 17a).
The LSTM ED model improves this deviation with certain outliers
in the healthy phase of the prediction horizon (Fig. 17b). The pre-
diction capability of the model improves when incorporated with
a Bahdanau attention layer (Fig. 17c). The LSTM Bahdanau AT ED
model also minimizes the occurrence of a large over or under-
prediction, which is not observed in the performance of other ar-
Fig. 16. Effect of Teacher Forcing (TF%) in LSTM Bahdanau Attention Encoder De-
coder Network. chitectures. Further models such as the LSTM Luong Cat Gl. AT ED
(Fig. 17d), SKC LSTM ED (Fig. 17e) and SKC LSTM AT ED models
with Bahdanau and Luong attention layers (Fig. 17f-h), also main-
7.7. Qualitative Proposed Model Performance Analysis tain a good prediction capability, however, with increased fluctua-
tions throughout the prediction horizon.
The advantage of predicting sequence RULs with encoder- The testing loss comparison between the LSTM ED model, the
decoder models can be observed by comparing the scores be- LSTM Bahdanau AT ED model and LSTM Luong Concat Gl. AT ED
tween the LSTM ED architecture to the vanilla BiLSTM model in is shown in Fig. 18. Although all three models eventually converge
Table 5. The significant improvement in performance can be at- to the same minimum loss, the main difference lies in the initial
tributed to the generation of latent space representation by the epochs. Luong attention takes place at the end segment of every
13
Fig. 17. Remaining Useful Lifetime (RUL) Estimation Plots by Proposed Models for sub-dataset FD001 Engine ID 100.
14
Fig. 18. Testing Loss comparison between L STM ED, L STM Bahdanau AT ED and L STM Luong Concat Gl. AT ED architectures for all C-MAPSS sub-datasets.
Fig. 19. Testing Loss comparison between L STM Bahdanau AT ED and SKC L STM Bahdanau AT ED architectures for all C-MAPSS sub-datasets.
decoder cell, allowing each candidate RUL to directly attend to the throughout the life cycle of the engines. These, however, do not
encoder hidden representations. This evidently benefits the model represent the score-based performance of the model architectures
as it begins performing at a much lower loss magnitude than com- since, the scoring function depends on the final RUL values of each
pared to LSTM ED or LSTM Bahdanau AT ED. LSTM AT ED mod- engine instead of the RUL at every instance.
els, especially with Bahdanau attention, appear to converge to the
minimum loss magnitude in the latter epochs than compared to 7.8. Comparison with existing literature
the vanilla LSTM ED model. The testing loss comparison between
LSTM Bahdanau AT ED and SKC LSTM Bahdanau AT ED is shown The sequence RUL predictor models presented in this work
in Fig. 19. While both models require a fair amount of epochs to are evaluated against the literature model performances based on
converge to a minimum loss point, the SKC LSTM Bahdanau AT their generated scores (Table 5) for the C-MAPSS Turbofan En-
ED accomplishes this faster compared to the LSTM Bahdanau AT gine Dataset. As evident in Table 5, the proposed LSTM ED and
ED. This reflects the contribution of a feature extracting SKC layer the LSTM Bahdanau AT ED model outperforms the best literature
that helps the model to reach early convergence and predict RULs score Tan and Teo (2021) for FD002. The proposed SKC LSTM ED
that are closer to the actual values from an earlier stage. These and LSTM Bahdanau AT ED model outperform the best literature
loss plots reflect the overall prediction capability of the models score Paulo Roberto De Oliveira Da Costa et al. (2020) for FD003.
15
Table 6
Comparison of Model Performance (RMSE) with Related Literature.
Model Description RMSE
FD001 FD002 FD003 FD004
Deep LSTM Zheng et al. (2017) 16.14 24.49 16.18 28.17

Stacked BiLSTM Wang et al. (2018) 13.65 23.18 13.74 24.86
Handcrafted Features + LSTM Chen et al. (2021) 14.53 _ _ 27.08
CEEMD + DLSTM Zhao et al. (2019b) 14.72 29 17.72 33.43
RBM + LSTM Listou Ellefsen et al. (2019) 12.56 22.73 12.10 22.66
CNN Sateesh Babu et al. (2016) 18.45 30.29 19.82 29.16
CNN + Attention Tan and Teo (2021) 11.48 17.25 12.31 20.58
HI CNN-LSTM-NN Kong et al. (2019) 16.13 20.46 17.12 23.26
FCLCNN Peng et al. (2021) 11.17 _ 9.99 _
BLCNN Liu et al. (2019) 13.18 19.09 13.75 20.97
DAG Li et al. (2019a) 11.96 20.34 12.46 22.43
DCNN Li et al. (2018) 12.64 22.36 12.64 23.31
AGCNN (Self-Attention) Liu et al. (2021) 12.42 19.43 13.39 21.50
LSTM + Luong Attention Paulo Roberto De Oliveira Da Costa et al. (2020) 13.95 17.65 12.72 20.21
Transformer Encoder + Gated CNN Mo et al. (2021) 11.27 22.81 11.42 24.86
BiLSTM 15.87 16.59 15.10 14.36
LSTM ED 11.67 15.78 10.86 12.98
SKC LSTM ED 11.71 16.76 10.05 13.62
LSTM Bahdanau AT ED 11.69 15.47 11.42 13.24
LSTM Luong Concat (Global) AT ED 13.23 17.60 10.98 13.92
SKC LSTM Bahdanau AT ED 12.30 18.51 9.86 13.78
SKC LSTM Luong Concat (Global) AT ED 12.02 15.60 9.8 14.11
SKC LSTM Luong Concat (Local) AT ED 12.04 17.33 9.5 13.8
are shown in Table 6. Even in this case, the proposed LSTM Bah-
danau AT ED model RMSE is either very close to, or much lower
than the lowest RMSE from the literature for all sub-datasets. In
addition, the attention-based SKC LSTM ED also produce excellent
losses, reflecting the ability of the attention mechanism to attend
to specific parts of the input for predicting nearly accurate RULs
throughout the prediction horizon.
The average score on all the operating conditions in the C-
MAPSS dataset in relation to the total number of trainable param-
eters in the best performing models from literature and our best
performing model is shown in Fig. 20. The best performing LSTM
Bahdanau AT ED based models require a similar number of train-
able parameters as the ones in literature but performs drastically
better with an average score of 847. The obtained results are supe-
rior to the current state of the art.
8. Conclusions
Fig. 20. Average score of our best architecture LSTM Bahdanau Attention Encoder
In this paper we propose a novel approach for remaining useful
Decoder Network on the C-MAPSS dataset compared to existing work in literature.
lifetime estimation based on sequence-to-sequence LSTM encoder-
decoder structures augmented with Bahdanau and Luong atten-
tion. The different Luong attention variants are examined in both
The performance of the LSTM Bahdanau AT ED model for FD001 global and local format. The contributions of a Shared kernel CNN
and FD004 is very closely matched to the best literature score layer for feature extraction in the LSTM-based attention and non-
in Tan and Teo (2021). attention encoder-decoders are also evaluated. Not all attention
A thorough observation of Table 5 confirms that the proposed variants perform equally good, with Bahdanau attention outper-
LSTM Bahdanau AT ED model outmatches every literature model forming the rest in terms of balanced sub-dataset scores.
when providing a highly balanced performance throughout all the A novel network architecture is proposed which allows for the
C-MAPSS sub-datasets. Among the literature, the attention-based reading of sequences of the multidimensional time series using
temporal CNN architecture Tan and Teo (2021) provides a good a sliding window for the input as well as the output sequences.
overall performance. Additionally, an overall comparison between This significantly increases the sample efficiency of the training
attention and non-attention architectures in Table 5 shows the sig- procedure and hence allows for estimation of RUL also in sparse
nificant improvement yielded by the attention-based models. The data sets with strongly limited test cases of machine degradation.
Transformer encoder model with Gated CNN Mo et al. (2021) does Therefore, the primary hypothesis of enhanced performance with
not deploy the scoring function as a performance metric and a sequence of RUL is achieved in this work. A comparison of the
hence, scores from this literature are not available for comparison. obtained results with the actual state-of-the-art results from the
A distinctive analysis can be made by comparing the models literature shows the superior performance of the proposed archi-
based on their performances throughout the engines life cycles. tectures.
This is evaluated by the RMSE losses of the models during the test- In future work, we will further optimize the training procedure
ing phase. The RMSE for all the implemented and literature models for the proposed architecture especially by using deeper CNN mod-
16
els with more feature maps and other kernel lengths. Furthermore, Gavneet Singh Chadha, Utkarsh Panara, Andreas Schwung, & Steven X. Ding
we will test alternative neural network architecture, particularly (2021). Generalized dilation convolutional neural networks for remaining use-
ful lifetime estimation. Neurocomputing, 452, 182–199. https://doi.org/10.1016/j.
transformer networks, for their ability to tackle remaining useful neucom.2021.04.109.
lifetime estimation problems. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feed-
forward neural networks. In Proceedings of the Thirteenth International Confer-
ence on Artificial Intelligence and Statistics. In Proceedings of Machine Learning
Declaration of competing interest Research: 9 (pp. 249–256). PMLR.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Adaptive computation
and machine learning. Cambridge, Massachusetts: The MIT Press.
• All authors have participated in the conception of the
Goyal, A., Lamb, A., Zhang, Y., Zhang, S., Courville, A., & Bengio, Y. (2016). Professor
manuscript, including the analysis and interpretation of the re- forcing: A new algorithm for training recurrent networks. In Proceedings of the
sults, drafting and revising the article for important intellectual 30th International Conference on Neural Information Processing Systems. In NIPS’16
content. (p. 46084616). Red Hook, NY, USA: Curran Associates Inc..
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidi-
• The enclosed manuscript has not been submitted to, nor is con- rectional LSTM and other neural network architectures. Neural Networks : The
currently under review at another journal or publishing venue. Official Journal of the International Neural Network Society, 18(5–6), 602–610.
• The authors have no affiliation or involvement with any orga- https://doi.org/10.1016/j.neunet.2005.06.042.
Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G.,
nization or entity with a direct or indirect financial interest in Cai, J., & Chen, T. (2018). Recent advances in convolutional neural networks. Pat-
the subject matter discussed in the manuscript. tern Recognition, 77, 354–377. https://doi.org/10.1016/j.patcog.2017.10.013.
• The name of the authors below have affiliations or involvement Heimes, F. O. (2008). Recurrent neural networks for remaining useful life estimation.
In 2008 International Conference on Prognostics and Health Management (pp. 1–
with organizations with a direct or indirect financial interest in 6). https://doi.org/10.1109/PHM.2008.4711422.
the subject matter discussed in the manuscript Hikmet Esen, Filiz Ozgen, Mehmet Esen, & Abdulkadir Sengur (2009). Artificial neu-
ral network and wavelet neural network approaches for modelling of a solar air
heater. Expert Systems with Applications, 36(8), 11240–11248. https://doi.org/10.
CRediT authorship contribution statement 1016/j.eswa.2009.02.073.
Hikmet Esen, Mehmet Esen, & Onur Ozsolak (2017). Modelling and experimental
Sayed Rafay Bin Shah: Software, Conceptualization, Method- performance analysis of solar-assisted ground source heat pump system. Journal
of Experimental & Theoretical Artificial Intelligence, 29(1), 1–17. https://doi.org/10.
ology, Validation, Formal analysis, Resources, Writing – original 1080/0952813X.2015.1056242.
draft, Writing – review & editing, Visualization. Gavneet Singh Hikmet Esen, Mustafa Inalli, Abdulkadir Sengur, & Mehmet Esen (2008a). Forecast-
Chadha: Conceptualization, Methodology, Validation, Formal anal- ing of a ground-coupled heat pump performance using neural networks with
statistical data weighting pre-processing. International Journal of Thermal Sci-
ysis, Resources, Writing – original draft, Writing – review & ences, 47(4), 431–441. https://doi.org/10.1016/j.ijthermalsci.20 07.03.0 04.
editing, Project administration. Andreas Schwung: Conceptualiza- Hikmet Esen, Mustafa Inalli, Abdulkadir Sengur, & Mehmet Esen (2008b). Perfor-
tion, Writing – original draft, Supervision, Project administration. mance prediction of a ground-coupled heat pump system using artificial neural
networks. Expert Systems with Applications, 35(4), 1940–1948. https://doi.org/10.
Steven X. Ding: Supervision, Project administration. 1016/j.eswa.2007.08.081.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computa-
References tion, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.
Isermann, R. (2005). Model-based fault-detection and diagnosis – status and ap-
Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017). Understanding of a convolutional plications. Annual Reviews in Control, 29(1), 71–85. https://doi.org/10.1016/j.
neural network. In I. C. o. E. &. Technology (Ed.), Proceedings of 2017 Interna- arcontrol.20 04.12.0 02.
tional Conference on Engineering & Technology (icet’2017) (pp. 1–6). Piscataway, Jardine, A. K., Lin, D., & Banjevic, D. (2006). A review on machinery diagnos-
NJ: IEEE. https://doi.org/10.1109/ICEngTechnol.2017.8308186. tics and prognostics implementing condition-based maintenance. Mechanical
Bektas, O., Jones, J. A., Sankararaman, S., Roychoudhury, I., & Goebel, K. (2019). A Systems and Signal Processing, 20(7), 1483–1510. https://doi.org/10.1016/j.ymssp.
neural network filtering approach for similarity-based remaining useful life es- 2005.09.012.
timation. The International Journal of Advanced Manufacturing Technology, 101(1– Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, & Yoshua Bengio (2014). Empir-
4), 87–103. https://doi.org/10.10 07/s0 0170-018-2874-0. ical evaluation of gated recurrent neural networks on sequence modeling. CoRR,
Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for se- abs/1412.3555.
quence prediction with recurrent neural networks. In Proceedings of the 28th Khan, S., & Yairi, T. (2018). A review on the application of deep learning in sys-
International Conference on Neural Information Processing Systems - volume 1. In tem health management. Mechanical Systems and Signal Processing, 107, 241–
NIPS’15 (p. 11711179). Cambridge, MA, USA: MIT Press. 265. https://doi.org/10.1016/j.ymssp.2017.11.024.
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Kingma, D. P., & Ba, J. (2015). Adam: A Method for stochastic optimization. CoRR,
Journal of Machine Learning Research, 13(Feb), 281–305. abs/1412.6980.
Botchkarev, A. (2019). A new typology design of performance metrics to measure Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., & Inman, D. J. (2021).
errors in machine learning regression algorithms. Interdisciplinary Journal of In- 1D convolutional neural networks and applications: A survey. Mechanical Sys-
formation, Knowledge, and Management, 14, 045–076. https://doi.org/10.28945/ tems and Signal Processing, 151, 107398. https://doi.org/10.1016/j.ymssp.2020.
4184. 107398.
Chatterjee, S., & Litt, J. (2003). Online model parameter estimation of jet engine Kong, Z., Cui, Y., Xia, Z., & Lv, H. (2019). Convolution and long short-Term mem-
degradation for autonomous propulsion control. In Guidance, Navigation, and ory hybrid deep neural networks for remaining useful life prognostics. Applied
Control and Co-Located Conferences. https://doi.org/10.2514/6.2003-5425. Sciences, 9(19), 4156. https://doi.org/10.3390/app9194156.
Chaudhari, S., Polatkan, G., Ramanath, R., & Mithal, V. (2019a). An attentive survey Lei, Y., Li, N., Guo, L., Li, N., Yan, T., & Lin, J. (2018). Machinery health prognostics:
of attention models. ArXiv, abs/1904.02874. A systematic review from data acquisition to rul prediction. Mechanical Systems
Chaudhari, S., Polatkan, G., Ramanath, R., & Mithal, V. (2019b). An attentive survey and Signal Processing, 104, 799–834. https://doi.org/10.1016/j.ymssp.2017.11.016.
of attention models. arXiv preprint arXiv:1904.02874. Li, J., Li, X., & He, D. (2019a). A directed acyclic graph network combined with
Chen, Z., Wu, M., Zhao, R., Guretno, F., Yan, R., & Li, X. (2021). Machine remain- CNN and LSTM for remaining useful life prediction. IEEE Access, 7, 75464–75475.
ing useful life prediction via an attention-Based deep learning approach. IEEE https://doi.org/10.1109/ACCESS.2019.2919566.
Transactions on Industrial Electronics, 68(3), 2521–2531. https://doi.org/10.1109/ Li, X., Ding, Q., & Sun, J.-Q. (2018). Remaining useful life estimation in prognostics
TIE.2020.2972443. using deep convolution neural networks. Reliability Engineering & System Safety,
Dzmitry Bahdanau, Kyunghyun Cho, & Yoshua Bengio (2015). Neural machine trans- 172, 1–11. https://doi.org/10.1016/j.ress.2017.11.021.
lation by jointly learning to align and translate. CoRR, abs/1409.0473. Li, X., Zhang, W., & Ding, Q. (2019b). Deep learning-based remaining useful life es-
Elsheikh, A., Yacout, S., & Ouali, M.-S. (2019). Bidirectional handshaking lstm for re- timation of bearings using multi-scale feature extraction. Reliability Engineering
maining useful life prediction. Neurocomputing, 323, 148–156. https://doi.org/10. & System Safety, 182, 208–218. https://doi.org/10.1016/j.ress.2018.11.011.
1016/j.neucom.2018.09.076. Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A Critical Review of Recurrent Neural
Esen, H., Inalli, M., Sengur, A., & Esen, M. (2008). Artificial neural networks and Networks for Sequence Learning. https://arxiv.org/pdf/1506.0 0 019.
adaptive neuro-fuzzy assessments for ground-coupled heat pump system. En- Listou Ellefsen, A., Bjørlykhaug, E., Æsøy, V., Ushakov, S., & Zhang, H. (2019). Re-
ergy and Buildings, 40(6), 1074–1083. https://doi.org/10.1016/j.enbuild.2007.10. maining useful life predictions for turbofan engine degradation using semi-
002. Supervised deep architecture. Reliability Engineering & System Safety, 183, 240–
Frederick, D. K., DeCastro, J. A., & Litt, J. S. (2007). Users Guide for the Commercial 251. https://doi.org/10.1016/j.ress.2018.11.027.
Modular Aero-Propulsion System Simulation (C-MAPSS). Liu, H., Liu, Z., Jia, W., & Lin, X. (2019). A novel deep learning-based encoder-decoder
Gavneet Singh Chadha, Ambarish Panambilly, Andreas Schwung, & Steven X. Ding model for remaining useful life prediction. In 2019 International Joint Conference
(2020). Bidirectional deep recurrent neural networks for process fault classifica- on Neural Networks (ijcnn) (pp. 1–8). IEEE. https://doi.org/10.1109/IJCNN.2019.
tion. ISA Transactions, 106, 330–342. https://doi.org/10.1016/j.isatra.2020.07.011. 8852129.
17
Liu, H., Liu, Z., Jia, W., & Lin, X. (2021). Remaining useful life prediction using a Wang, J., Wen, G., Yang, S., & Liu, Y. (2018). Remaining useful life estimation in
novel feature-attention-based end-to-end approach. IEEE Transactions on Indus- prognostics using deep bidirectional LSTM neural network. In 2018 Prognos-
trial Informatics, 17(2), 1197–1207. https://doi.org/10.1109/TII.2020.2983760. tics and System Health Management Conference (phm-chongqing) (pp. 1037–1042).
Luong, T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention- https://doi.org/10.1109/PHM-Chongqing.2018.00184.
based Neural Machine Translation. In Proceedings of the 2015 Conference on Em- Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully
pirical Methods in Natural Language Processing (pp. 1412–1421). Lisbon, Portugal: recurrent neural networks. Neural Computation, 1(2), 270–280. https://doi.org/
Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1166. 10.1162/neco.1989.1.2.270.
Martin, K. F. (1994). A review by discussion of condition monitoring and fault diag- Wu, J. (2017). Introduction to convolutional neural networks. National Key Lab for
nosis in machine tools. International Journal of Machine Tools and Manufacture, Novel Software Technology. Nanjing University. China, 5, 23.
34(4), 527–551. https://doi.org/10.1016/0890- 6955(94)90083- 3. Wu, Y., Yuan, M., Dong, S., Lin, L., & Liu, Y. (2018). Remaining useful life estimation
Mo, Y., Wu, Q., Li, X., & Huang, B. (2021). Remaining useful life estimation via trans- of engineered systems using vanilla lstm neural networks. Neurocomputing, 275,
former encoder enhanced by a gated convolutional unit. Journal of Intelligent 167–179. https://doi.org/10.1016/j.neucom.2017.05.063.
Manufacturing. https://doi.org/10.1007/s10845- 021- 01750- x. Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E.
Nandi, S., Toliyat, H. A., & Li, X. (2005). Condition monitoring and fault diagnosis of Howard, Wayne E. Hubbard, & Lawrence D. Jackel (1990). Handwritten digit
electrical motors—A review. IEEE Transactions on Energy Conversion, 20(4), 719– recognition with a back-propagation network denver, colorado, usa, november
729. https://doi.org/10.1109/TEC.2005.847955. 27–30, 1989]. In David S. Touretzky (Ed.), Advances in Neural Information Pro-
Paulo Roberto De Oliveira Da Costa, Alp Akcay, Yingqian Zhang, & Uzay Kay- cessing Systems 2 (pp. 396–404). Morgan Kaufmann.
mak (2020). Attention and long short-Term memory network for remaining use- Ye, L., & Keogh, E. (2009). Time Series Shapelets: A New Primitive for Data Mining.
ful lifetime predictions of turbofan engine degradation. International Journal of In KDD ’09 (p. 947956). New York, NY, USA: Association for Computing Machin-
Prognostics and Health Management, 10, 12. ery. https://doi.org/10.1145/1557019.1557122.
Peng, C., Chen, Y., Chen, Q., Tang, Z., Li, L., & Gui, W. (2021). A remaining useful life Yu, Z., Ramanarayanan, V., Suendermann-Oeft, D., Wang, X., Zechner, K., Chen, L.,
prognosis of turbofan engine using temporal and spatial feature fusion. Sensors Tao, J., Ivanou, A., & Qian, Y. (2015). Using bidirectional LSTM recurrent neural
(Basel, Switzerland), 21(2). https://doi.org/10.3390/s21020418. networks to learn high-level abstractions of sequential features for automated
Ragab, M., Chen, Z., Wu, M., Kwoh, C.-K., Yan, R., & Li, X. (2020). Attention scoring of non-native spontaneous speech. In 2015 IEEE Workshop on Automatic
Sequence to Sequence Model for Machine Remaining Useful Life Prediction. Speech Recognition and Understanding (asru) (pp. 338–345). IEEE. https://doi.org/
https://arxiv.org/pdf/2007.09868. 10.1109/ASRU.2015.7404814.
Sateesh Babu, G., Zhao, P., & Li, X.-L. (2016). Deep convolutional neural network Zhang, H., Zhang, Q., Shao, S., Niu, T., & Yang, X. (2017). Attention-Based LSTM net-
based regression approach for estimation of remaining useful life. In Database work for rotatory machine remaining useful life prediction. IEEE Access, 1–9.
Systems for Advanced Applications (pp. 214–228). Springer International Publish- https://doi.org/10.1109/ACCESS.2020.3010066.
ing. https://doi.org/10.1007/978- 3- 319- 32025- 0_14. Zhao, R., Yan, R., Chen, Z., Mao, K., Wang, P., & Gao, R. X. (2019a). Deep learning and
Saxena, A., Goebel, K., Simon, D., & Eklund, N. (06/10/2008 - 09/10/2008). Dam- its applications to machine health monitoring. Mechanical Systems and Signal
age propagation modeling for aircraft engine run-to-failure simulation. In 2008 Processing, 115, 213–237. https://doi.org/10.1016/j.ymssp.2018.05.050.
International Conference on Prognostics and Health Management (pp. 1–9). IEEE. Zhao, S., Zhang, Y., Wang, S., Zhou, B., & Cheng, C. (2019b). A recurrent neural net-
https://doi.org/10.1109/PHM.2008.4711414. work approach for remaining useful life prediction utilizing a novel trend fea-
Sutskever, I., Vinyals, O., & Le V, Q. Sequence to sequence learning with neural net- tures construction method. Measurement, 146, 279–288. https://doi.org/10.1016/
works. http://arxiv.org/pdf/1409.3215v3. j.measurement.2019.06.004.
Swati Meshram (2019). Survey on attention neural network models for natural lan- Zheng, S., Ristovski, K., Farahat, A., & Gupta, C. (2017). Long short-term memory net-
guage processing: 8. https://doi.org/10.15680/IJIRSET.2019.0810037. work for remaining useful life estimation. In 2017 IEEE International Conference
Tan, W. M., & Teo, T. H. (2021). Remaining useful life prediction using temporal con- on Prognostics and Health Management (icphm) (pp. 88–95). https://doi.org/10.
volution with attention. Artificial Intelligence (AI), 2(1), 48–70. https://doi.org/10. 1109/ICPHM.2017.7998311.
3390/ai2010 0 05.
Tsui, K. L., Chen, N., Zhou, Q., Hai, Y., & Wang, W. (2015). Prognostics and health
management: A review on data driven approaches. Mathematical Problems in
Engineering, 2015(6), 1–17. https://doi.org/10.1155/2015/793161.
18

A Sequence - To - Sequence Approach For Remaining Useful Lifetime Estimation Using Attention Augmented Bidirectional LSTM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Sequence - To - Sequence Approach For Remaining Useful Lifetime Estimation Using Attention Augmented Bidirectional LSTM

Uploaded by

Copyright:

Available Formats

Intelligent Systems with Applications 10–11 (2021) 20 0 049

Contents lists available at ScienceDirect

Intelligent Systems with Applications

A Sequence-to-Sequence Approach for Remaining Useful Lifetime

1. Introduction predictive maintenance can be a challenging and time-consuming

to the last hidden state via an attention layer to generate a rota-

it = σ (Wi xt + Ri ht−1 + bi ), (1)

Bidirectional LSTM (BiLSTM) structures use two independent

resentations. We use CNN-2D architectures similar to Li et al.

Fig. 3. Shared Kernel Convolutional 2D Neural Network.

Fig. 4. LSTM ED Sequence RUL Predictor.

Fig. 5. SKC LSTM ED Sequence RUL Predictor.

4.4. Attention Mechanism in Neural Networks 3. Context Vector, Cij

Fig. 6. LSTM ED Sequence RUL Predictor with Bahdanau Attention.

3. Context Vector, Cij 4.5.1. RUL-Prediction using Bahdanau Attention

Fig. 7. LSTM ED Sequence RUL Predictor with Luong Attention.

Sensor Parameter Unit

dataset Saxena et al. (06/10/2008 - 09/10/2008). The Turbofan En-

Sub-Dataset Total Cycles Operational Settings Operating Conditions Fault Modes

FD001 100 100 Sea Level 1 HPC Degradation

each engine under a sub-dataset is provided in a separate database.

6.3. Root mean-Squared error (RMSE)

Loss functions represent how a predicted output from a NN

6.2. Piece-wise RUL target function

Attention Models Non-Attention Models

Hyperparameters Model Architectures

LSTM Bahd. AT ED SKC LSTM Bahd. AT ED LSTM ED

7. Results and Comparison

In this section, we present the results of the proposed architec-

Fig. 12. Analysis of Bahdanau Attention in LSTM Encoder Decoder Network.

concluded, that the added parameters from the convolutional layer

7.4. Effect of changing hidden sizes in the encoder-decoder

Model Description Score

FD001 FD002 FD003 FD004

Deep LSTM Zheng et al. (2017) 338 4450 852 5550

encoder, which is decoded to predict a sequence of RULs. The in-

Model Description RMSE

FD001 FD002 FD003 FD004

Deep LSTM Zheng et al. (2017) 16.14 24.49 16.18 28.17

You might also like