You are on page 1of 14

Neural Networks 154 (2022) 455–468

Contents lists available at ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

Using source data to aid and build variational state–space


autoencoders with sparse target data for process monitoring

Yi Shan Lee, Junghui Chen
Department of Chemical Engineering, Chung-Yuan Christian University, Chung-Li, Taoyuan, 32023, Taiwan, Republic of China

article info a b s t r a c t

Article history: In industrial processes, different operating conditions and ratios of ingredients are used to produce
Received 19 December 2021 multi-grade products in the same production line. Yet, the production grade changes so quickly as the
Received in revised form 18 April 2022 demand from customers varies from time to time. As a result, the process data collected in certain
Accepted 9 June 2022
operating regions are often scarce. Process dynamics, nonlinearity, and process uncertainty increase
Available online 18 June 2022
the hardship in developing a reliable model to monitor the process status. In this paper, the source-
Keywords: aided variational state–space autoencoder (SA-VSSAE) is proposed. It integrates variational state–space
Gaussian mixture model autoencoder with the Gaussian mixture. With the additional information from the source grades, SA-
Multigrade-process VSSAE can be used for monitoring processes with sparse target data by performing information sharing
Process dynamics to enhance the reliability of the target model. Unlike the past works which perform information sharing
Process monitoring and modeling in a two-step procedure, the proposed model is designed for information sharing and
Sparse data modeling in a one-step procedure without causing information loss. In contrast to the traditional
Variational autoencoder
state–space model, which is linear and deterministic, the variational state–space autoencoder (VSSAE)
extracts the dynamic and nonlinear features in the process variables using neural networks. Also, by
taking process uncertainty into consideration, VSSAE describes the features in a probabilistic form.
Probability density estimates of the residual and latent variables are given to design the monitoring
indices for fault detection. A numerical example and an industrial polyvinyl chloride drying process
are presented to show the advantages of the proposed method over the comparative methods.
© 2022 Elsevier Ltd. All rights reserved.

1. Introduction must be some common information between the data of different


grades. Sharing common information is important as it enriches
Product quality and economic benefits are increasingly im- the information of the individual grades and improves the model
portant due to the growing complexity and size of the mod- reliability.
ern chemical industry. To follow the dynamic market changes Transfer learning can adapt the variation of data distribution
and technology trends, manufacturers tend to produce products between source and target domains. In 2016, Zhang, Zuo, and
of multiple grades to meet the market demands and customer Zhang (2016) proposed the latent sparse domain transfer (LSDT)
needs. During the production of products of different grades, each for domain adaptation and visual categorization of heterogeneous
type of product is processed by changing the operating conditions data. In LSDT, the sparse subspace clustering (SSC) instead of the
of raw materials in the same production line (Liu, 2007). In low-rank representation was introduced with the joint learning
fact, manufacturers of polyethylene produce over 50 grades of mechanism to achieve optimal subspace representation. Then SSC
products per year. Thus, the grade would be changed to another was employed in a reproduced kernel Hilbert space to solve
one every 6 days. Hence, model overfitting would definitely occur low-level nonlinear problems. The selection of the kernel func-
if the limited samples collected in those 6 days are used to tion highly affects the performance of the constructed model
construct individual models for each grade (Jaeckle, MacGregor, & and the kernel approximation can only represent a low-level
MacGregor, 1998). As the solution, fortunately, even if the operat- nonlinearity of the process. To describe highly nonlinear feature
ing condition for each grade is different, the products undergo the representations in industrial processes, deep model structures
same production flow and even come from the same production instead of shallow ones were developed for transfer learning. In
line. In 2020, Jaeckle and MacGregor (2000) proved that multi- 2019, Yu, Wang, Chen, and Huang (2019) developed the dynamic
grade processes may have similar physical properties and there adversarial adaption network (DAAN) based on the concept of
generative adversarial network for transfer learning by proposing
∗ Corresponding author. the dynamic adversarial factor to perform an easy, dynamic, and
E-mail address: jason@wavenet.cycu.edu.tw (J. Chen). quantitative evaluation of the adaptation between marginal and

https://doi.org/10.1016/j.neunet.2022.06.010
0893-6080/© 2022 Elsevier Ltd. All rights reserved.
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

conditional distributions. In 2019, Wang et al. (2019) proposed


a concept called dynamic distribution adaptation (DDA) under
the framework of structural risk minimization to solve trans-
fer learning problems. DDA dynamically learns the distribution
weights to obtain the relative importance of marginal and con-
ditional distributions. In addition, Lee and Chen (2021) proposed
Fig. 1. Structure of the variational autoencoder (VAE).
a Gaussian mixture prior variational autoencoder (GMPVAE) to
obtain common and special features in a one-step procedure.
Although DAAN, DDA, and GMPVAE were developed for deep
transfer learning, they are not suitable to describe the dynamic noise level and lower-dimensional feature space is necessary. To
process behavior and time-lagged property in industrial processes our best knowledge, learning process dynamics in the feature
because they assume that the processes are operated in the space is still lacking in the field of transfer learning. This paper
steady-state condition. designs a novel method named source-aided variational state–
In contrast with the processes in the fields of robotics and space autoencoder (SA-VSSAE), which not only makes use of
informatics, large-scale chemical processes are slow, so they have the source data but also integrates the variational state–space
a large time constant when the operation is changed from one autoencoder with the Gaussian mixture to deal with the data
condition to another. Also, there is process nonlinearity in con- scarcity problem and increase the reliability of the target grade. In
junction with the time-series correlation between samples be- the past, state–space models (SSMs) were often used to describe
cause of feedback control (Wang, Chen, & Song, 2017). The neural process dynamics while GMMs were used to describe unevenly
network-based models (Wozniak, Silka et al., 2021; Wozniak, distributed data. SSMs were usually linear and GMMs were ap-
Wieczorek et al., 2021) developed for robotic and informatics plied to the noise-contaminated observation space. In this paper,
are not suitable for the large-scale operating plant because they the objective of constructing SA-VSSAE is to search the common
do not consider the large time constant and the process uncer- mapping to extract the useful common information between the
tainty inherited in operating processes. Several dynamic mod- source and target data for the target model enhancement. The
els were developed, including recurrent neural networks (RNNs) encoder of the SA-VSSAE in the form of SSM is responsible for
(Vijaya Raghavan, Radhakrishnan, & Srinivasan, 2011), the long extracting the common dynamic features between the target and
short-term memory (LSTM), and gated recurrent unit (GRU). They source variables. In the meantime, GMMs embedded in the lower
are widely used for process modeling because of their nonlinear dimensional feature space separate the noise-free target features
dynamic feature representation ability (Yuan, Li, & Wang, 2020). from the source features so that their unique grade information
Taking address the process and measurement uncertainty can be preserved. Thus, the extracted common dynamic features,
caused by unmeasured disturbances, measurement errors, or which can increase the reliability of the target grade, can be
missing values, Wang, Yuan, Chen, and Wang (2021) proposed a used for modeling the target grade in a one-step procedure. The
semi-supervised VAE (SS-VAE) structure for concurrent process- target grade accepts the useful features from the source so that
quality monitoring. Moreover, the temporal-spatial neighborhood the data-sparse target model is enhanced. Subsequently, process
enhanced sparse stack autoencoder (TS-SSAE) (Li, Shi, Song, & monitoring is designed for abnormality detection with 3 kinds
Tao, 2020) was developed in a weighted just-in-time concept of performance metrics. SA-VSSAE is detailed in the following
to extract the important features between process variables. sections. Section 2 presents the background knowledge about
Furthermore, Lin, Clark, Birke, and Sch (2020) proposed VAE with VAE. Next, the process modeling concept of the proposed method
long short term memory (VAE-LSTM) and took it as a latent prob- is described in Section 3. In Section 4, the indices of process
abilistic model, which can handle nonlinear, high-dimensional monitoring are proposed. Then a numerical case and an industrial
complex, and dynamic problems. Still, it is impractical to imple- case are provided in Section 5 and Section 6, respectively, to
ment the above-mentioned dynamic nonlinear models in practi- demonstrate the effectiveness of the proposed method. Section 7
cal applications if the samples available for training the models makes conclusions of the paper.
are insufficient. To solve the problem of insufficient training data,
a transfer learning method integrated the exogenous input (ARX)
2. Background study
model with the Gaussian process (GP) model was proposed to
incorporate source knowledge for quick dynamic target learning
The variational autoencoder (VAE) consists of an encoder and
(Wang, Chen, Xie, & Su, 2020). However, the selection of the num-
a decoder as Fig. 1. X is the input of the VAE with M dimensions
ber of input and output windows in the ARX model influences
the model’s reliability. Thus, Wu and Zhao (2020) developed a on each sample. z is the latent variable compressed for a lower di-
transfer learning fault detection and diagnosis method based on mension from the observation X. The network which compresses
maximum mean discrepancy (MMD), which is a distribution dis- and maps X to z is called the encoder. The encoded distribution
tance metric taken as a transfer criterion for domain adaption. As can be represented by the posterior distribution p (z|X). As the
the data in the observation space are highly noise-contaminated, main characteristic of the VAE, the latent variable z of VAE is
applying MMD in the observation space would certainly affect a probability representation of the compressed features which
the accuracy of the model. Also, describing the process dynamics is subjected to a standard normal distribution with the orthog-
in the high dimensional observation space would decrease the onal property for independency as p (z). The prior distribution
model reliability. p (z) ∼ N (0, I) acts as a constraint to be followed during the
Conventionally, Markovian models are directly applied to the training of the VAE model. After all, the decoder of the VAE re-
dynamic observation data. However, in the field of chemical constructs the latent variables back to the observation space. The
process system engineering, the data collected from the plant is conditional distribution p (X|z) is defined to represent the recon-
highly noise-contaminated and it inherits process uncertainties. struction result. The loss function of VAE is defined to maximize
Markovian model in the noise-free latent space can express the the likelihood of the reconstructed observation samples p (X) as
true dynamic behavior of the process system. Instead of modeling follows.
the process dynamics in the noise-contaminated space, a transfer

learning model which describes process dynamics at the lower p (X) = p (X|z) p (z) dz (1)

456
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

To solve the integral in Eq. (1), an inferenced posterior distribu- transition mapping from the past to the current latent feature.
tion q (z|X) is used to estimate the posterior distribution p (z|X) Hence, the transition mapping procedure is done by a nonlinear
as the latent variables are unobserved. The inferenced posterior state–space model and it mimics the dynamics in the process.
distribution q (z|X) is required to be as similar to the real pos- In SA-VSSAE, GMM is embedded in the latent space to preserve
terior distribution p (z|X) as possible. The Kullback–Leibler (KL) unique information of the target and source grades. Thus, the
divergence can be used to measure the similarity of two distri- individual dynamic information in the target and source grades is
butions, so the KL divergence between q (z|X) and p (z|X) have obtained while common dynamic feature extraction is performed.
to be low to reach good inference. Thus, the lowest divergence Then the decoder of the SA-VSSAE reconstructs the observation
value 0 is assumed to be achieved during the development of the sequence to ensure the rationality of the extracted latent features.
VAE. The detail of each of the network structure is shown in the

q (z|X) following explanation.
KL [q (z|X) ∥ p (z|X)] = q (z|X) ln dz ≈ 0 (2) As the event of data sampling in the target process would
p (z|X)
not affect the data sampled from the source process, the log-
Then with the Bayes rule and Jensen inequality, Eq. (2) can be probability of all samples can be represented by adding up the
rearranged into the variational lower bound given by log-probability of the target and source samples; that is,
LVAE = Eq(z|X) [ln p (X|z)] − KL [q (z|X) ∥ p (z)] ln p (X) = ln p Xt + ln p Xs
( ) ( )
(3) (7)
Hence, maximizing the variational lower bound can be equiv- Based on the union set of the target and source observation
alent to maximizing the reconstruction likelihood of p (X). The sequences, SA-VSSAE is developed to boost the modeling perfor-
expectation term of the LVAE represents the reconstruction per- mance of the target model upon insufficient target data samples
formance of the VAE model. The KL divergence term acts as a with useful common features extracted from the source-aided en-
regularization term to force the inferenced posterior near to the coder and decoder networks. However, the goal while optimizing
prior distribution. the SA-VSSAE model is to improve the data-sparse target model
One can see that the above developed VAE is not suitable to by emphasizing more on the target data. Thus, the objective
be used in a dynamic system as it only considers the relation function of SA-VSSAE is designed to give more weighting [ (to the
)]α
between variables but the autocorrelations are neglected. Most target data by maximizing the log-likelihood of p (X) p Xt
importantly, the VAE is a deep model that requires a sufficient like Eq. (8) instead of just maximizing the log-likelihood of p (X).
number of data to make sure its model reliability. In multi- [ ( )]α
ln p (X) p Xt = ln p Xt + ln p Xs + α ln p Xt
( ) ( ) ( )
max
grade processes, the data collected during a production grade is β,φ.θ,π,µ,Σ
definitely insufficient for a highly complex and nonlinear system.
= (α + 1) ln p Xt + ln p Xs
( ) ( )
(8)
Therefore, process modeling for the dynamic system with limited
target data is the topic of this paper. Its description is shown in
( )
p Xt represents the probability likelihood of the samples in
the next section. the observation sequence from the target. α is a weighting factor.
β , φ , and θ are the model parameters of the forward–backward
3. Process modeling for insufficient target data RNN, encoder network, and decoder network respectively. π , µ
and Σ are the model parameters of GMM. The structure and
To simply describe the information sharing between target parameters in SA-VSSAE are trained to have the maximum prob-
and source, assume that there are only one target and one source ability likelihood with more weighting on the target samples and
observation sequences. Xt is defined as the observation sequence with the help of the auxiliary source samples to enhance the
from the target grade like Eq. (4) and Xs as the observation target model.
sequence from the source grade like Eq. (5). To easily make derivations, the superscript (s or t) that de-
scribes the domain is ignored henceforth and replaced by the
Xt = xtk ∈ RM , k = 1, 2, . . . , K t
{ }
(4)
superscript g (g = s, t) because the samples from the target or
source at every time point are formed from the same formula-
Xs = xsk ∈ RM , k = 1, 2, . . . , K s
{ }
(5) tions. To build an LV model, the ways to generate observations
from LVs and infer LVs from observations are {very } important.
X = Xt , Xs g
The unobserved state LVs is defined as Zg = zk ∈ RD , k =
{ }
(6)
g
1, 2, . . . , K g and the categorical LVs as Wg = wk ∈ Z+ , k =
{ }
The subscript k represents the sample index in a sequence,
M denotes the number of variables for a data point, K t and K s 1, 2, . . . , K g . The posterior distribution q (Zg , Wg |Xg ) is the infer-
represent the total numbers of samples in the target and source ence result of the LVs from the observation Xg . To restrict the
observation sequences, respectively. A union set of the target and LVs in the latent space, the similarity of the posterior distribu-
source observation sequences is defined as X like Eq. (6), in which tion with the prior distribution p (Zg , Wg |Xg ) is defined using KL
the superscripts t and s denote the target and source sequences, divergence.
respectively. Fig. 2 shows the structure of the proposed SA-VSSAE.
KL q Zg , Wg |Xg ∥ p Zg , Wg |Xg
[ ( ) ( )]
It is set up and constructed based on the derivation result from
q (Zg , Wg |Xg )

the objective function which will be mentioned later, with the q Zg , Wg |Xg ln dZg dWg
( )
= (9)
variational inference approach and Bayes rules. Like an ordinary Z g ,W g p (Zg , Wg |Xg )
GMPVAE, the encoder and the decoder of the VAE structure in With Bayes rules applied to the joint distribution p (Zg , Wg , Xg )
SA-VSSAE act as the source-aided networks to represent the com- = p (Zg , Wg |Xg ) p (Xg ), Eq. (9) can be written as
mon mapping structure for common dynamic feature extraction
KL q Zg , Wg |Xg ∥ p Zg , Wg |Xg
[ ( ) ( )]
between the source and target grades. In front of common dy-
namic feature extraction, the forward–backward recurrent neural q (Zg , Wg |Xg )

q Zg , Wg |Xg ln p Xg dZg dWg
( ) ( )
network (FB-RNN) in the SA-VSSAE plays an important role in =
Z,W p (Zg , Wg , Xg ) (10)
representing the accumulated information from the past to the
= Eq(Zg ,Wg |Xg ) ln q Zg , Wg |Xg + Eq(Zg ,Wg |Xg ) ln p Xg
( ) ( )
current observation. Then the encoder of the SA-VSSAE maps the
−Eq(Zg ,Wg |Xg ) ln p Zg , Wg , Xg
( )
accumulated information to the latent space and performs the
457
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

Fig. 2. Structure of the proposed SA-VSSAE. . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)

As the marginal distribution p (Xg ) in Eq. (10) is not affected by W denotes the total number of domains. They both play the role
the latent variables Zg and Wg , so it can be taken out from the in defining the characteristics of the sub-distribution in GMM.
expectation of Eq(Zg ,Wg |Xg ) . Because the KL divergence term is non- If only the state latent variable Z is inferenced in the proposed
negative and it is assumed to have a 0 value for the posterior model, the latent space would be defined as a single Gaussian
distribution to be similar to the prior, Eq. (10) can be rearranged distribution and it may not be able to extract the unique features
into the variational lower bound Lg of that particular domain g. in the unevenly distributed data. Subsequently, it is impossible
for the model to separate the individual grades. The initial state
ln p Xg ≥ Lg = Eq(Zg ,Wg |Xg ) ln p Zg , Wg , Xg
( ) [ ( )]
LV distribution is assumed to be
− Eq(Zg ,Wg |Xg ) ln q Zg , Wg |Xg
[ ( )]
(11) p (z0 ) = N (µ0 , Λ0 ) (17)
Thus, the variational inference mechanism can be described by µ0 and Λ0 are the mean and( the
g
) variance of the initial state
the variational lower bound as Eq. (11). The logarithm probabil- latent variable distribution p z0 . With the generative process
ity terms ln p (Zg , Wg , Xg ) and ln q (Zg , Wg |Xg ) in the variational described by Eqs. (14)–(17), each term in the joint probability in
lower bound can be factorized according to the Bayes theorem as Eq. (12) can be factorized as
K g

ln p X , Z , W
( g) g g
ln p X , Z , W
g g g g g g g g
( ) ( ) ( ) ( ) ( g g g
) ( )
= ln p X |Z + ln p Z |W + ln p W (12) = ln p z0 + ln p xk |zk
k=1

ln q Z , W |X Kg
g g g g g g g g
( ) ( ) ( )
= ln q W |Z + ln q Z |X (13) K
∑ g g g
∑ g
,w ln p wk
( ) ( )
+ ln p zk | zk−1 k + (18)
To calculate each term in Eq. (12), the probability of the observa-
g k=1 k=1
tion sequence sample at each time point xk comes from the state
g g g
LV (zk ), and zk comes from the categorical LV (wk ). Both can be On the other side, each term in Eq. (13) for the inference process
doing from the observation to the latent spaces can be factorized
described as follows.
as
g g g g g g g
p zk |zk−1 , wk = N µw (zk−1 , wk ), Σw (zk−1 , wk )
( ) ( )
(14) K g
( g) ∑ g g g
ln q Zg , Wg |Xg = ln q z0 + ln q zk |zk−1 , xk
( ) ( )
g g g g
,
( ) ( )
p xk | zk =N µx (zk ) Σx (zk ) (15)
k=1
g g g
Eq. (14) shows the transition prior distribution p zk |zk−1 , wk
( )
Kg
and it represents the resulted probability of the D-dimensional
∑ g g
ln q wk |zk
( )
g + (19)
continuous state LVs zk ∈ RD obtained from the previous state
k=1
to the current state through the transition prior distribution DNN
network. µw and Σw are defined as( the mean and Therefore, in terms of sequence samples, the variational lower
g g g
) the variance of bound is rewritten as
the transition prior distribution p zk |zk−1 , wk . After( the trans-
g g
) ( g) K g
formation, Eq. (15) shows the emission distribution p xk |zk and p z0 ∑ g g
g
( )
it represents the resulted generation probability of the data sam- L = E ( )
g ln ( g ) + E( g g
) ln p xk |zk
q z0 q z k |xk
g q z0
ple xk in an observation sequence. µx and (Σx are) the mean and k=1
g g
the variance of the emission distribution p xk |zk . The emission Kg g g g
p zk |zk−1 , wk
( )

distribution DNN is used to generate the observation from the + E (
g g g
) (
g g
) ln (20)
q zk |zk−1 ,xk q wk |zk g g g
q zk |zk−1 , xk
( )
latents. Meanwhile, Eq. (16) shows the resulted probability of the k=1
g
categorical latent variable wk ∈ Z+ parameterized by πwg . Kg g
p wk
( )
k ∑
+ E( g g g
) (
g g
) ln
q zk |zk−1 ,xk q wk |zk g g
q wk |zk
( g) ( ) ( )
g
p wk = Cat wk |πwg (16) k=1
k

where π g is the prior probability for the specific g domain, Eqs. (14)–(17)
( g ) describe
( g g ) the
( gdetail of )the terms
( g )of the distri-
wk g g
{ } ∑W bution, p z0 , p xk |zk , p (zk |z)k−1(, wk and p) wk , in( Eq. (20).
πg = + , k = 1, 2, . . . , K ,
∈ RW πwg = 1, where
g g g g g g g
The distribution terms of q z0 , q zk |zk−1 , xk , and q wk |zk in
)
πw g g
wk =1
k k

458
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

DNN network.
( )
g g g g ,s g g ,s g
qs zk |zk−1 , xk = N ṽ(zk−1 , h̃k−1 ), Φ̃(zk−1 , h̃k−1 )
( )
(24)

ṽ and Φ̃, which are defined according to the previous la-


g ,s g
tent states zk−1 and the backward RNN cell states h̃k−1 , are the
mean
( g gand the ) variance of the transition posterior distribution
g
q zk |zk−1 , xk . The process dynamics is authentically described
and the auto-correlation in features is extracted within the LV
transition in a lower dimension and lower noise-contamination
( g)
latent space. In short, the smoothed posterior qs zk distribution
has to be gotten ( g first; then
) the smoothed transition posterior
g g
distribution q zk |zk−1 , xk is estimated.
Because a general sampler would not favor the backpropaga-
tion of the gradient as the sampling procedure is non-
Fig. 3. Forward-backward RNN structure (pink nodes) for information accu- g ,s
differentiable, zk is a reparametric sampling point from the
mulation and the smoothed posterior distribution DNN ( g )(blue nodes) for the
construction of the smoothed posterior distribution qs zk . . (For interpretation smoothed posterior distribution.
of the references to color in this figure legend, the reader is referred to the web ( ) ( )−1/2
g ,s g g
version of this article.) zk = µ̃ h̃k + Λ̃ h̃k εs (25)

where εs is sampled from the unit Gaussian distribution. Thus,


Eq. (20) also have to be determined and taken as a result of each term in the variational lower bound Eq. (20) can be eventu-
the inference process. Each term in the variational lower bound ally approximated by Eqs. (26)–(31).
can explain its role to handle the complications in the dynamic g g
( )
E( g g
) ln p xk |zk
q z k |xk
nonlinear
( gsystem, especially(the auto-correlation in features. The
g g g g g
terms p zk |zk−1 , wk and q zk |zk−1 , wk describe the state tran-
) )
S [
1∑ 1(
− xk − µx (zgk ,s ) Σx (zgk ,s ) xk − µx (zgk ,s )
) ( )
sition of the latent variables in the lower dimensional space. =
The auto-correlation within the process data are characterized S 2
s=1
with the state transition procedure. Before the state transition
]
1⏐
− ⏐Σx (zgk ,s )⏐ − mπ

g (26)
proceeds, the past and current observations x1 : k are mapped and
g l
2
compressed into an RNN cell state hk ∈ R as follows. g g g
ln p zk |zk−1 , wk
( )
E( g g g
) (
g g
)
g g g q zk |zk−1 ,xk q wk |zk
,
( )
hk =Λ hk−1 xk (21)
W g
g g ∑ g g
The cell state hk
is obtained from the previous state hk−1 . =− q(wk |zk ) (27)
g
With this iterative substitution, the forward final cell state hK g is g
wk =1
gotten. To prevent overfitting, the backward recursion in the RNN ⎡
g ,s
n
( g
structure can mimic the state transition between the sequence n 1∑ g ,s g Φ̃j (zk−1 , h̃k−1 )
samples to improve the cell states in the RNN window. Eq. (22) ⎣ ln(2π ) + ln(Σw (zk−1 , wk )j ) + g ,s g
2 2 Σw (zk−1 , wk )j
represents the backward RNN recursion. j=1
( ) 2 ⎞⎤
g ,s g ,s
( )
g g
g g
h̃k = Λ h̃k+1 , x1 : k
g
(22) µw (zk−1 , wk )j − ṽj (zk−1 , h̃k−1 )
+
⎟⎥
g ,s g
Σw (zk−1 , wk )j
⎠⎦
The forward–backward RNN (FB-RNN) structure is illustrated
by the pink nodes in Fig. 3. It is used to depict the accumulated
W g
attributes of the observation from the past to the current ob- g
∑ ( g g)
ln p wk q wk |zk ln πwg
[ ( )]
servations in the form of cell states. The final cell state of the E( g g g
) (
g g
) = (28)
q zk |zk−1 ,xk q wk |zk k
g g
forward RNN (hK g ) is set as the initial cell state of the backward wk =1
g g
RNN (h̃K g ). The initial cell state h0 , which( is)used to estimate the g g g n
ln q zk |zk−1 , xk = − ln (2π )
[ ( )]
g E( g g g
) (
g g
)
initial smoothed posterior distribution q z0 , can be corrected at q zk |zk−1 ,xk q wk |zk 2
g
the end of the backward recursion because h0 is assumed to be n
g 1 ∑( g ,s g
)
equal to h̃0 in FB-RNN. Λ in Eqs. (21) and (22) represents a bunch − 1 + ln Φ̃j (zk−1 , h̃k−1 ) (29)
of neurons for feature compression. For dimension reduction, 2
j=1
the cell states are projected into (the) latent space to form the W g
g
smoothed posterior distribution qs zk , k = 1, . . . , K with a DNN. E ( ) ( )
[ ( g g )] ∑
ln q wk |zk =
( g g) ( g g)
q wk |zk ln q wk |zk
g g g g g
g q zk |zk−1 ,xk q wk |zk
zk is obtained
( ) from a Gaussian distribution characterized by the g
( ) wk =1
g g
mean µ̃ h̃k and the variance Λ̃ h̃k .
(30)
( g) ( )
g g
qs zk = N µ̃(h̃k ), Λ̃(h̃k ) (23) ( From
g g
) the Bayes equation and the reparameterization trick,
q wk |zk can be formulated as
Hence, the information of the observations in the sequence is g g
[ ( g g )]
q wk |zk = E ( p wk |zk
( )
g g g g
) (
g g
)
compressed into the latent variable zk . It represents the extracted q zk |zk−1 ,xk q wk |zk
g
information from the observation sequence data x1 : k . To interpret S
( g ) ( g g ,s g
p wk p zk |zk−1 , wk
)
the latent state transition process 1∑
( g gof SA-VSSAE,
g
the smooth transi- ≡ (31)
tion posterior distribution q zk |zk−1 , xk is estimated
) ( ) ( )
∑W g g′ g g ,s g′
g ,s
( gby
) mapping s
s=1 g ′ p w k p z | z
k k−1 , wk
the sampling zk−1 from the smoothed posterior qs zk through a wk =1

459
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

Eq. (20) represents one of the specific domain data (target or mapping represented by the variance of the distributions. Hence,
source). When the target and source observation sequences are SA-VSSAE would provide a range for the output value in the form
considered together, the objective function of SA-VSSAE in Eq. (8) of mean and variance of the Gaussian distribution rather than a
is rewritten as the variational lower bound: deterministic output.
( ) SA-VSSAE can be also easily extended with multiple source
p zt0
⎡ ⎤
data. The variational lower bound of the multi-source SA-VSSAE
⎢Eq(zt0 ) ln ( t ) can be expressed as
q z0

⎢ ⎥
⎢ ⎥
⎢ Kt ⎥ ⎡ t ⎤
K
( )
p zt0
⎢ ∑ ( t t) ⎥
+ E ln p x |z

q(zk |xk ) xtk ztk
t t
( )
⎢Eq(zt ) ln ( t ) + Eq(zt |xt ) ln p |
⎢ k k

⎢ ⎥ ⎥
⎢ k=1 ⎥ ⎢ 0 q z0 k k ⎥
⎢ ⎥ ⎢ k=1 ⎥
Kt
⎢ ⎥ ⎢ ⎥
⎢ Kt ⎥
= (α + 1) ⎢
⎢ ∑ ⎥
LSA-VSSAE ⎢ + E(t t
⎥ ⎢ ∑ ⎥
k k−1 k ( k k )
)
, t q wt |zt + E
q ztk |ztk−1 ,xtk q(wtk |ztk )
q z |z x
⎥ ⎢ ( ) ⎥
⎢ ⎥ ⎢ ⎥
k=1
LSA-VSSAE = (α + 1) ⎢ k = 1
⎢ ⎥ ⎢ ⎥
p zk |zk−1 , wk
⎢ ( t t t
) ⎥ ⎥
p zk |zk−1 , wkt
⎢ ⎥ ⎢ ( t t ) ⎥
× ln ( t t ⎢ ⎥
q zk |zk−1 , xk
⎢ ) ⎥
⎢ t ⎥ ⎢ × ln ( t t ⎥
q zk |zk−1 , xk
⎢ ⎥ ⎢ t
) ⎥
Kt
⎢ ⎥ ⎢ ⎥
w
( t
)
⎢ p ⎥ ⎢
K t

w
∑ k
( t
)
+ E t t ln ( t t ) p
⎣ ⎦ ⎢ ⎥
q zk |zk−1 ,xtk q(wtk |ztk )
( ) ∑ k
q wk |zk + E(t t ln ( t t )
⎣ ⎦
q zk |zk−1 ,xk q(wk |zk )
)
t t t
k=1 q wk |zk
k=1
Ks
⎡ ( s) ⎤
p z0 ∑ ( s s) ⎡ ( s) Ks

⎢Eq(zs ) ln ( s ) + Eq(zs |xs ) ln p xk |zk ⎥ p z0 ∑ ( s s)
⎢ 0 q z0 k k ⎥ ⎢Eq(zs ) ln ( s ) + Eq(zs |xs ) ln p xk |zk ⎥
⎢ k=1 ⎥ ⎢ 0 q z0 k k ⎥
⎢ s
⎥ ⎢ k=1 ⎥
K
p zk |zk−1 , wk ⎥
⎢ ( s s s
)
s

Smax ⎢

K
, w
∑ ( s s
)
s ⎥
+⎢ + E s s ln ( s s p z |z
⎢ ⎥
q zk |zk−1 ,xsk q(wsk |zsk )
( ) ∑ ∑ k k−1 k ⎥
q zk |zk−1 , xsk ⎥
)⎥
+ + E(s s ln ( s s

q zk |zk−1 ,xsk q(wsk |zsk )
)
q zk |zk−1 , xk
⎢ ⎢ )⎥
s ⎥
⎢ k=1 ⎥ ⎢
s=1 ⎢ k=1
Ks
⎢ ⎥
w
( s) ⎥
⎢ p ⎥ ⎢ Ks ⎥
w
∑ k
( s)
⎣ + E(s s ln ( s s )
⎦ ⎢ ∑ p ⎥
q zk |zk−1 ,xk q(wk |zk )
)
s s s k
q wk |zk E s s ln
⎣+ ⎦
q zk |zk−1 ,xsk q(wsk |zsk )
( )
q w s |zs
( )
k=1
k=1 k k
(32)
(33)
In Eq. (32), the coefficient of the first term of the variational
The summary of learning the SA-VSSAE is provided in Algo-
lower bound is (α + 1) because the performance of the target
rithm 1.
model defined in Eq. (8) is emphasized. To show the way to obtain
the distributions in the variational lower bound, the constructed
4. Process monitoring
network for learning the SA-VSSAE model is shown in Fig. 4,
while Fig. 2 is just the simplified version of the model structure.
Like most latent models performing process monitoring, SA-
In Fig. 4, the neural network models for the target domain data
VSSAE monitors the process not only in the residual space but
are connected by black lines (g = t) while the neural network
also in the latent space. The KL divergence between the smoothed
models for the source domain data are connected by ( g blue lines
g g
q )zk |zk−1 , xk , ) in Eq. (24) q (zk |zk−1 ,
transition (posterior distribution described
)
(g( = s). In general, to obtain the transition terms
g g g
) ( g g
p zk |zk−1 , wk and the categorical mapping q wk |zk in the vari- xk ) = N ṽ(zk−1 , h̃k−1 ), Φ̃(zk−1 , h̃k−1 ) and the transition prior
ational lower bound, RNN is first used to map and accumulate the distribution described in Eq. (14) p (zk |zk−1 , wk ) = N (µw (zk−1 ,
g
past and current observations x1 : k into an RNN cell state hk ∈ Rl
g
wk ), Σw (zk−1 , wk )), which is defined in Eq. (34), measures the
like the green part on the left side of Fig. 2 and the bottom of similarity of these two distributions. Both q (zk |zk−1 , xk ) and
Fig. 4 by Eqs. (21)–(22). Then smoothed posterior distribution q (zk |zk−1 , xk ) are the transition mapping results of the normal
DNN reduces the dimension of the observation data and those observation data xk to the latent space. If a sample is abnor-
g
data are taken as the latent samples zk by Eq. (23). Then, to mal, the KL divergence value would exceed the control limit
favor the backpropagation of the gradient in the model network, constructed by the normal latent distribution because the ab-
the sampler is used to sample latent data collected from the normal transition posterior would be dissimilar to the abnormal
smoothed transition prior.
( g g posterior
g
(distribution
g g g
)(Eq. (25)). The state transitions
q zk |zk−1 , xk and p zk |zk−1 , wk are interpreted by the mapping
)
Dk = Eq(wk |zk ) KL [q (zk |zk−1 , xk ) ∥ p (zk |zk−1 , wk )]
g ,s g
of sampling zk−1 to zk through the transition posterior and prior ⎛
distribution DNN (the yellow part of Figs. ( 2g and ) 4) by Eqs. (24) 1 Φ̃(zk−1 , h̃k−1 )
g
and (14) . Also, the categorical mapping q wk |zk can be obtained = − q (wk |zk ) ⎝ ln Σw (zk−1 , wk ) +

g ,s g 2 Σw (zk−1 , wk )
from sampling zk to wk (the gray (part of ) Figs. 2 and 4) by
g g
Eq. (31). Finally, the emission term p xk |zk can be obtained by ( )2 ⎞ (34)
the emission distribution DNN (the blue part of Figs. 2 and 4) by ṽ(zk−1 , h̃k−1 ) − f (zk−1 , wk )
Eq. (15). Note that all the distributions are assumed as the ( Gaus- +

Σw (zk−1 , wk )

g g
sian distributions except the categorical distribution q wk |zk .
)
The Gaussian distribution networks are designed to output the
1( )
means and the variances of the corresponding distributions. This + 1 + ln Φ̃(zk−1 , h̃k−1 )
is why SA-VSSAE is considered a probability model. It not only 2
considers the deterministic mapping value, which is the mean The superscript of the sample is ignored as the new sam-
of the distributions, but also the variation or uncertainty of the ple which may or may not belong to the target is monitored.
460
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

Fig. 4. SA-VSSAE network training structure. . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)

Additionally, with the Gaussian characteristic of the emission a. For the normal data, standardize each sample with
distribution p (xk |zk ) = N (g(zk ), Σx (zk )), the monitoring index the sample means and variances and generate
in the residual space is obtained by calculating the conditional
q wkt |ztk , qs ztk |ztk−1 , xtk , p ztk |ztk−1 , wkt
( ) ( ) ( )
and p
log-likelihood:
xtk ztk for k = 1, 2, · · · , N
( )
× |
Rk = Eq(zk |xk ) [ln p (xk |zk )]
S
b. Iteratively calculate the monitoring indices by
1∑1 Eqs. (34) and (35) by the learned initial cell states
= [− (xk − g(zk )) Σx (zk ) (xk − g(zk )) − |Σx (zk )|]
S 2 h0 and the network parameters. The control limit of
s=1
the latent and residual spaces can be obtained by the
(35) estimated kernel density function (KDE) (Turlach,
1993) for D and R.
It measures the similarity between the reconstructed observation
samples and the original observation samples. If a reconstructed 2. On-line (Implementation Stage): Verify whether the new
observation sample is dissimilar to its true observation xk , Rk sample is normal.
would have a larger value and the residual in the observation
space would be large. In short, if the value of Dk exceeds the 5. Numerical case study
control limit of the latent space or Rk exceeds the control limit
of the residual space, xk is expected to be an abnormal point. To show the monitoring performance of the proposed SA-
Note that the proposed model is designed for fault detection VSSAE, a nonlinear
({ v g } dynamic system with 4 )observation variables
instead of fault diagnosis. SA-VSSAE is trained by the normal data. in 2 grades xkx , vx = 1, · · · , 4, g = 1, 2 is considered in this
It extracts the important features of the process at its normal example. The first superscript vx represents the number of ob-
operating condition. In this case, the process monitoring indices served variables and the second superscript g represents the
are designed based on the normal situation. There is no data domain the observation samples come from. Between the two
imbalance scenario between the normal and fault data. The entire grades, grade 1 is chosen as the target domain and grade 2 is set
monitoring procedure is summarized as follows: as the source domain. The numbers of sequence samples in the
source
({ vz g } and target domains) are both 500. The latent sequences
1. Off-line (Modeling Stage): zk , vz = 1, 2, g = 1, 2 from the target and the source to
461
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

generate observation samples are different dynamic behaviors as εkvz , vz = 1, 2 represent the process noises of zero-mean Gaus-
Eqs. (36) and (38), respectively. The first superscript represents sian distribution with the standard deviation of 0.01. Meanwhile,
the latent variable number and the second superscript still repre- ϖkvx represents the measurement noises of zero-mean Gaussian
sents the domain the latent variables come from. The generation distribution with the standard deviations of 0.05, 0.16, 0.02, and
of the observation samples from the 2-dimensional latent variable 0.05 for vx = 1, 2, 3, 4.
for the target and the source domains follows Eqs. (37) and (39) The true observation samples and the observation samples
respectively. with measurement noises in the target and source sequences
Grade 1 (Target, g = 1): are shown in Fig. 5. The left one( shows the ) samples of the
1g 2g 3g
first three observation variables xk , xk , xk
( )
sin zk11 and the right
zk11 3
zk11+ 0. + εk1
( )
−1
= cos −1 1e g =s.t
( 21 2
) (36) one shows the ) samples of the last three observation variables
zk21 = sin ezk−1 + zk21−1 + εk2
( ) (
2g 3g 4g
xk , xk , xk . The red star markers represent the target sam-
√ g =s.t
( )2 ( )2 ples with process noise and the blue circle markers represent
x11
k = 0.1zk11 + zk11 / zk11 + zk21 + ϖk1
√ the source samples with process noise. The maroon star markers
( )2 ( )2 represent the true target samples and the green circle markers
x21
k = 0.1zk11 zk21 + zk21 / zk11 + zk21 + ϖk2
( ) (37) represent the true source samples.
sin zk21
x31 = cos3 zk11 + 0.1e + ϖ3 To construct SA-VSSAE, the first 400 samples in the observa-
( )
k
( 21 ))k tion sequences for target and source, respectively, are used as the
x41 3
+ ϖk4
( 11 ) (
k = sin zk + ln 2 + cos zk
training set. Then the following 100 samples for target and source,
Grade 2 (Source, g = 2): respectively, are used as the testing set. The networks in Fig. 4 are
( ) designed with the following configuration:
sin zk12
zk12 3
zk12+ 0. −1 + εk1
( )
−1
= cos −1 1e
( ( 22 )2 ) (38) • The forward–backward RNN has only 1 layer with 40 neu-
zk22
zk22 = sin e + zk−1
−1 +1 + εk2 rons.

( )2 ( )2 • All the DNNs consist of 3 layers and each hidden layer has
x12
k = 0.1zk12 + zk12 / zk12 + zk22 + ϖk1 20 neurons. They are fully connected. Hyperbolic tangent is
the activation function for every layer.

( )2 ( )2
x22
k = 0.1zk12 zk22 + zk22 / zk12 + zk22 + ϖk2
(39)
The model is optimized by the Adam optimizer with a learning
( )
sin zk22
x32 = cos3 zk12 + 0.1e + ϖk3
( )
k rate of 0.01 in Tensorflow. The dimension of the latent variable in
x42 = sin3 zk12 + ln 2 + cos zk22 + ϖk4
( ) ( ( ))
k
the model is chosen to be 2. Cross-validation and early stopping
462
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

1g 2g 3g 2g 3g 4g
Fig. 5. The dispersion of the observation samples in the target and source sequences, (left) [xk , xk , xk ] and (right) [xk , xk , xk ]. . (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)

(Prechelt, 1998) is done to prevent model overfitting and the are closer to the blue points rather than the green points.) The
training loss is saturates at 60 iterations . polynomial construction is too poor to extract the real dynamic
The characteristic of a well-trained noise-independent model behavior of the complex nonlinear dynamic system. Furthermore,
is that the reconstructed target samples and the noise-free true the reconstruction result of the DKPCA model in Fig. 6(b) shows
target samples are as similar to each other as possible. To visual-
that the reconstructed observation sequence is unlike the true
ize the ability of signal reconstruction of SA-VSSAE for the tested
observation sequences. The DKPCA model is shallow. It cannot
noisy samples, the scatter plots of the reconstructed observation
samples are shown in Fig. 6. The left figure in Fig. 6(g) shows describe such a highly nonlinear system and the model accuracy
the observation samples of the first three observation variables. can be highly affected by imperfection. In Fig. 6(c), the VAE
The right figure in Fig. 6(g) shows the observation samples of the model cannot accurately catch the dynamics of the system as
last three observation variables. The blue circle markers repre- its trend of reconstruction does not follow the true observation.
sent the observed noise-contaminated target samples, the green With a model structure that can extract the nonlinear and dy-
circle markers represent the noise-free true observation target namic characteristics of the process, VAE-LSTM can reconstruct
samples and the red star markers represent the reconstructed better than CPM-KPCA, DKPCA, and VAE (Fig. 6(d)). However,
target samples. Compared with the noise-contaminated samples
because it can only express the process dynamics of the noisy
(blue circle), the reconstructed samples (in red star) are more
observation samples, its reconstruction result overfits the noisy
likely to the true samples (in green circle). This shows that the
SA-VSSAE can effectively reject the effect of measurement noise samples instead of the noise-free samples as the red points are
and perform well in terms of reconstruction. close to the blue points. Then for the variations of the solution
To show the advantages of SA-VSSAE, SA-VSSAE is compared to the problem, the reconstructed target samples of VSSAE are
with four different models, including the constructive polynomial shown in Fig. 6(e). The reconstruction result of VSSAE is poorer
mapping kernel PCA (CPM-KPCA) model, the dynamic kernel than that of SA-VSSAE (Fig. 6(g)) as the reconstructed samples
PCA (DKPCA) model, the variational autoencoder (VAE), and the (red star) slightly deviate from the noise-free, true observation
VAE with long short term memory (VAE-LSTM). CPM-KPCA and samples (green circle) in the right sub-figure in Fig. 6(e). This is
DKPCA are the representative models from the kernel family mainly because the target model constructed by limited samples
and they are shallow. VAE is the representative of deep models,
is not good enough without the association information of the
which can handle more complex systems, but it cannot describe
the autocorrelations in the system. VAE-LSTM is a deep model source data. Next, COM-VSSAE forces both the target and source
which can extract dynamic and nonlinear features. Nevertheless, data to map into a unit Gaussian prior in the latent space, so
it describes the process dynamics in the noise-contaminated ob- COM-VSSAE can only extract the common information between
servation space. In addition to the past developed models, several the target and the source rather than the unique information of
solutions to the problem of this paper are also compared with the the target. In Fig. 6(f), COM-VSSAE has a poorer reconstruction
proposed SA-VSSAE. They are the VSSAE models complemented result than SA-VSSAE (Fig. 6(g)).
by the target and source data (COM-VSSAE) and the VSSAE model Fault 1, g = f1
built only by the target data (VSSAE). Fig. 6 (a, b, c, d, e, f, g) ( 1f )
show the reconstructed observation results of CPM-KPCA, DKPCA, sin zk−11
( )
1f1 1f
zk = 2 cos3 zk−11 + 0.1e + εk1
VAE, VAE-LSTM, VSSAE, COM-CSSAE, and the proposed SA-VSSAE,
respectively. The plots on the left side show the reconstructed
(
2f1 ( )2 ) (40)
2f1 2f
observation results in the dimensions x1 , x2 , and x3 . The plots zk = sin ezk−1 + zk−11 + εk2
on the right side show the reconstructed observation results √
( )2 ( )2
in the dimensions x2 , x3 , and x4 . In the plots, the blue circles 1f 1f1 1f1 1f 2f
xk 1 = zk + zk / zk 1 + zk 1 + ϖk1
represent the noise-contaminated target observations, the green
circles represent the noise-free target observations, and the red √
( )2 ( )2
2f 1f1 2f1 2f1 1f 2f
circles represent the reconstructed target samples. xk 1 = 0.1zk zk + zk / zk 1 + zk 1 + ϖk2
In Fig. 6(g), the proposed SA-VSSAE can reconstruct the ob- (41)
( 2f )
sin zk 1
( )
servations (red) near the noise-free, true observations (green) 3f 1f
xk 1 = cos3 zk 1 + 0.1e +1 + ϖk3
rather than the noisy observations (blue). Moreover, in Fig. 6(a), ( ) ( ( ))
one can see that the reconstructed target samples in the ob- 4f
xk 1
1f
= sin3 zk 1 + ln 2 + cos zk 1 + ϖk4
2f
servation space of CPM-KPCA overfit the noise. (The red points
463
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

( )2 ( )2
1f 1f2 1f 1f2 2f
xk 2 = 0.1zk + zk 2 / zk + zk 2 +0.8 + ϖk1

)2 ( )2 (
2f 1f 2f 2f
1f 2f
xk 2 = 0. 1zk 2 zk 2
+ / zk 2 + zk 2 + ϖk2
zk 2
(43)
( 2f )
sin zk 2
( )
3f 1f
xk 2 = cos3 zk 2 + 0.1e + ϖk3
( ) ( ( ))
4f 1f 2f
xk 2 = sin3 zk 2 + ln 2 + cos zk 2 + ϖk4
To show the effectiveness of the proposed SA-VSSAE in pro-
cess monitoring, Fault 1 and Fault 2 are generated based on
Eqs. (40)–(43) to test the models. The difference between the
target and the faults is marked in red. The faults are tested by
all the comparison methods. In Table 1, the false alarm rates
(FARs) of the models indicate the model’s reliability in detecting
faults. FAR is the probability indication of an anomaly when no
anomaly occurs. With a 95% confidence level, a reliable model
must have its FAR close to 5%. One can see that in Table 1, SA-
VSSAE is the most reliable process monitoring method because
it has FAR values closest to 5% in both the latent and residual
spaces. Because of the shallowness of DKPCA and CPM-DPCA,
the lack of dynamic feature extraction model structure of VAE,
the overfitting problem of VAE-LSTM, the insufficient information
provided in VSSAE, and the forcing Gaussian latent distribution
of COM-VSSAE, those comparative models are not as reliable as
the proposed SA-VSSAE. Table 2 shows the effectiveness of the
proposed SA-VSSAE and the comparative methods in detecting
Fault 1, Fault 2, and the samples from the source grade (SG)
in terms of fault detection rates (FDRs). Note that Fault 1 and
Fault 2 consist of discrepancies in both the latent and observation
variables of the target grade. All the comparative models should
detect the faults in the latent and residual spaces. It is found that
VAE, VAE-LSTM, and COM-VSSAE cannot perfectly detect Fault 1
in the latent space. VAE and VAE-LSTM are insensitive to Fault 1 in
the latent space because their model structure cannot extract the
true process dynamics in the noise-free latent space. Also, COM-
VSSAE cannot detect Fault 1 because the forcing Gaussian latent
distribution in COM-VSSAE is unable to efficiently differentiate
Fault 1 from the target data. Next, the comparative models are
compared using Fault 2. Only DKPCA and SA-VSSAE have a 100%
fault detection rate in the latent and residual spaces. CPM-DPCA,
VAE, VAE-LSTM, VSSAE, and COM-VSSAE are not sensitive to Fault
2, especially in the latent space because they are either shallow or
unreliable in terms of model structure and insufficient samples.
Although the proposed SA-VSSAE is constructed with the aid of
source grade, it can separate the source grade in the latent space
with the embedded Gaussian mixture model. Thus, SA-VSSAE
can perfectly detect the source grade while DKPCA, CPM-DPCA,
VAE, VAE-LSTM, and COM-VSSAE are senseless to the source
grade. VSSAE is perfect for detecting the source grade because
the model considers the auto-correlation of the process in the
lower dimensional space and it extracts only the features of the
normal target data. Overall, VSSAE is mediocre because it has FAR
far from 5% and it cannot perfectly detect Fault 2 in its latent
space. In summary, the proposed SA-VSSAE is the best among the
Fig. 6. The reconstructed observation samples of (a) CPM-KPCA, (b) DKPCA, comparative methods as it can perfectly detect faults and has the
(c) VAE, (d) VAE-LSTM, (e) VSSAE, (f) COM-VSSAE, and (g) SA-VSSAE in the false alarm rate nearest to the expected rate.
observation space. . (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)
The missing detection denotes the faulty data are misclassified
as faultless data. Missing detection rate (MDR) is the number of
faulty data points that do not exceed the control limits (missing
detection) over the total number of faulty data. Thus, MDRs in
Fault 2, g = f2 :
( 1f ) the latent and residual spaces are summed up to provide a value
sin zk−21 index to show the performance of the models. The summed MDRs
( )
1f2 1f
zk = cos3 zk−21 + 0.1e + εk1
should be as low as possible to prove that a particular model is
( )2 ) (42)
2f2
2f2 (
2f representable for a process. Table 3 shows the MDR indices of
zk = 2 sin ezk−1 + zk−21 + εk2 the proposed SA-VSSAE and the comparative models with respect
464
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

Table 1
The D and R false alarm rates (FARs, %) of the normal samples based on the proposed SA-VSSAE and the comparative methods in
the numerical case.
Indices CPM-DPCA DKPCA VAE VAE-LSTM VSSAE COM-VSSAE SA-VSSAE
D 0 0 1.00 12.36 1.00 11.00 7.00
R 1.00 4.17 0 4.49 12.00 6.00 4.00

Table 2
The D and R fault detection rates (FDRs, %) of F1 based on the proposed SA-VSSAE and the comparative methods in the numerical case.
Methods CPM-KPCA DKPCA VAE VAE-LSTM VSSAE COM-VSSAE SA-VSSAE
Indices D R D R D R D R D R D R D R
Fault 1 100 100 100 100 50 100 84 100 100 100 94 100 100 100
Fault 2 35 49 100 100 1.6 100 97 100 74 100 94 100 100 100
SG 66 45 54 100 99 100 100 100 100 100 0 13 100 100

Table 3
The combined missing detection rates (MDRs, %) of the fault samples based on the proposed SA-VSSAE and the comparative methods
in the numerical case.
Methods CPM-KPCA DKPCA VAE VAE-LSTM VSSAE COM-VSSAE SA-VSSAE
Fault 1 0 0 25 7 0 3 0
Fault 2 58 0 49 1 13 3 0
SG 43 23 1 0 0 94 0

to the tested faults. In Table 3, the summed MDR index for the process monitoring has to be done by studying the dynamic
proposed SA-VSSAE model is significantly smaller than the other behavior of the drying process according to each grade. However,
comparative models. This means that the SA-VSSAE model is the data collected in each grade are insufficient as the grades are
the best in terms of missing detection among the comparative changed every 10 h. Thus, in this study, the drying process of one
models. grade is selected as the target process and the other grade process
is set as the source process.
6. Industrial case study In the target and source datasets, 600 sequence samples from
each of the datasets with the 14 process variables in Table 4
In our daily life, basic necessities like food, clothing, accom- are collected from the DCS database. In the 600 sequence sam-
modation, and transportation are highly tied to the materials ples of the target and source, the initial 540 samples are set as
of plastics. Particularly, there are many products made from the training sequence samples and the remaining 60 samples
polyvinyl chloride (PVC), such as food containers, sports wearing, are set as the testing sequence samples. Thus, there are a to-
building materials, vehicle covers, etc. The manufacturers of these tal 1080 training samples and 120 testing samples for process
products usually get the PVC in the form of powder from their modeling. Certainly, before process modeling, the sequence data
raw material suppliers. In the PVC powder synthesis process, undergo data pre-treatment to remove the outliers. Then the pre-
after the polymerization step, the PVC slurry which contains only treated data are sent into the proposed model structure for model
water and PVC particles is obtained. Then most of the water in training. Two latent variables are chosen by trial-and-error for
the slurry can be removed through a centrifugal step. Hence, a the highest lower bound value. Formerly, the structure of the
wet porous powder, called cake, can be produced. Subsequently, forward–backward RNN has only 1 layer with 45 neurons. The
the cake would be sent into a fluidized bed dryer to further DNNs in the proposed model have 3 layers and each hidden layer
eliminate the residual cake humidity so that the moisture content has 30 neurons. The layers are fully connected and the activation
of the PVC powder is below a specific limit before being stored function used in each layer is the hyperbolic tangent. In addition,
or sent to the customers. This drying process consumes almost a learning rate of 0.01 is used to optimize the model with the
800kJ per day. Consequently, it is important to monitor the drying Adam optimizer in Tensorflow. To prevent model overfitting,
procedure in the PVC synthesis process to keep the energy usage cross-validation is done and the training loss is converged after
at its lowest level. The flowchart of the drying process is shown 20 iterations and the computational time is 125 s.
in Fig. 7. The drying process consists of 4 major units. There The six models, including CPM-DPCA, DKPCA, VAE, VAE-LSTM,
are a centrifuge, a fluidized bed dryer (FBD), a bag filter, and a VSSAE, and COM-VSSAE are compared with the proposed SA-
mesh unit. To be specific, FBD is the core device in this drying VSSAE. The latent dimension structures of VAE, VAE-LSTM, VSSAE,
procedure. It consists of two parts: the backflow mixing (left) and and COM-VSSAE are set to be the same as those in the SA-
the plug flow mixing (right). The wet PVC slurry (green piping) is VSSAE model for a fair comparison. CPM-DPCA uses the 2nd order
fed into the drying system through the centrifuge and the dried polynomial kernel and its mapping number is set to be 3 with 15
PVC powder is discharged from the drying system through the principal components in this industrial case. Also, for DKPCA, the
mesh unit. The core mechanism of the drying process involves sigmoid function with β0 = 1 and β1 = 0.0005 is selected as the
energy transfer from the supplied hot water (blue piping) and the kernel type.
hot air (pink piping) to evaporate the residual water in the pores During the drying process of the PVC dryer, the data from two
of the PVC particles. The moisture in the pores of the PVC particle source grades are used to aid the target grade process model. Yet,
leaves the drying system with hot air. Table 4 lists the measured after feature sharing, the source data should be isolated from the
variables in the drying process. target data. Thus, not only the fault data (F1) which are seldom
Because of the polymerization procedure at upstream of the recorded and stored but also the aiding source data (SG1 and
PVC drying process, different grades of PVC powder are produced SG2) of the target model are used to test the effectiveness of the
and the material grades at the inlet of the centrifuge are different. proposed SA-VSSAE model in terms of process monitoring. The
To keep the drying process at its lowest energy consumption, corresponding faults are then defined as follows:
465
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

Fig. 7. Flowchart of the PVC drying process. . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)

Table 4
Descriptions of the selected 14 important process variables.
Variables Description
V1 Input hot air temperature (backflow mixing)
V2 Fluidized dryer temperature (backflow mixing)
V3 Fluidized dryer temperature (plug flow mixing)
V4 Percentage of the input hot air steam valve opening (backflow mixing)
V5 Input hot air flow rate (Plug flow mixing)
V6 Input hot air pump output signal (backflow mixing)
V7 Input hor air pressure (backflow mixing)
V8 Input hot air flow rate (backflow mixing)
V9 Percentage of the input hot air steam valve opening (plug flow mixing)
V10 Input hot air valve opening (plug flow mixing)
V11 Input hot water temperature
V12 Input hot water pressure
V13 Electric current of centrifuge
V14 Slurry input flow rate

• F1: The bias is caused by the change of the controller. fault detection result in Table 6, although CPM-DPCA and VAE-
• F2: The data sequence comes from the first source grade. LSTM have higher FDR values in both the latent and residual
• F3: The data sequence comes from the second source grade. spaces than the proposed SA-VSSAE model, they are not reliable
models because their FARs are far from 5%. In addition, as the
The FAR results of the proposed SA-VSSAE model and the proposed SA-VSSAE model can extract the unique features from
comparative models are listed in Table 5. With a 95% confidence each of the source grade samples by its embedded GMM in the
interval, the FARs of SA-VSSAE in the latent (D) and residual (R) latent space, SA-VSSAE can detect the assisting source grade data
spaces are both spontaneously near to 5%, but the FARs of the even if it extracts the common features of the source and the
other models are far from 5% than SA-VSSAE. Hence, the pro- target grades to enhance the target grade model. Thus, in view of
posed SA-VSSAE model is much better than the other comparative the fault detection result of the models in detecting the first (SG1)
methods in terms of model reliability. Likewise, the FDRs in the and the second (SG2) source sequences, the proposed SA-VSSAE
latent space (D) and the residual space (R) of the fault and source model outperforms the other comparative models because it has
grade sequences are given in Table 6. F1 is a fault resulted from not only the eligible FARs but also a high fault detection result
the change of the controller. A large bias is contained in the (FDRs) in the latent and residual spaces. Table 7 shows the MDR
collected data sequence. Not only the residual space but also the of the fault samples in this industrial case. Although CPM-DPCA
latent space should be able to detect this kind of fault. From the and VAE-LSTM have the lowest summed MDR, but its reliability is
466
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

Table 5
The D and R false alarm rates (FARs, %) of the normal target samples based on the proposed SA-VSSAE and the
comparative methods in the industrial case.
Indices DKPCA CPM-DPCA VAE VAE-LSTM VSSAE COM-VSSAE SA-VSSAE
D 10.00 48.28 53.33 31.22 30.00 5.00 3.00
R 5.00 75.86 5.00 16.33 43.33 38.33 5.00

Table 6
The D and R fault detection rates (FDRs, %) of the corresponding faults based on the proposed SA-VSSAE and the comparative methods in the industrial case.
Methods DKPCA CPM-DPCA VAE VAE-LSTM VSSAE COM-VSSAE SA-VSSAE
Indices D R D R D R D R D R D R D R
F1 1.67 100 100 100 1.67 100 100 100 15 100 0 64.67 86.67 100
SG1 0 100 100 100 0 100 100 100 0 100 0 0 50.00 100
SG2 0 100 100 100 0 100 100 100 0 100 0 0 73.33 100

Table 7
The summed missed detection rates (MDRs, %) of the corresponding faults based on the proposed SA-VSSAE and the comparative
methods in the industrial case.
Methods DKPCA CPM-DPCA VAE VAE-LSTM VSSAE COM-VSSAE SA-VSSAE
F1 49.17 0 49.17 0 42.50 69.17 6.67
SG1 50 0 50 0 50 100 25
SG2 50 0 50 0 50 100 13.33

questioned because of its high FARs. This means that the sequence • The proposed SA-VSSAE focuses on simultaneously extract-
samples which are normal have high probability to be determined ing the common and unique features for target grade en-
as fault. As a brief conclusion for the process monitoring result, hancement. In fact, the information of the assisting source
the proposed SA-VSSAE is the best for detecting all kinds of faults. grade data would be a key to the reliability of the con-
The proposed SA-VSSAE model outperforms the other compar- structed model. Thus, the selection of the source grade data
ative models because it has not only the eligible FARs but also may be critical to the level of target grade enhancement.
a perfect fault detection result (FDRs) in the latent and residual • As the proposed SA-VSSAE is used only for unsupervised
spaces and summed missing detection rate (MDR). learning, it does not take into account the relationship be-
tween the process and quality variables. The extracted fea-
7. Conclusions tures may not be useful in reflecting the quality condition
of the process system.
In this paper, a novel process monitoring model called SA- • Although SA-VSSAE uses the source grade data to aid the
VSSAE is proposed based on the concept of information sharing. target grade model as a solution to the insufficient data
The proposed SA-VSSAE is a source-aided probabilistic model problem, the total number of data must still achieve a cer-
with dynamic latent state variable inference in the latent space tain level for deep model training. If there are extremely few
of the model structure. It can effectively extract the important data available, SA-VSSAE may not perform well.
features in the target grade with the help of the source by in- • In real operating processes, the quality data are not always
troducing the Gaussian mixture model at its latent space struc- measured due to equipment limitations and laboratory cost
ture. The following points enumerate the merits of the proposed savings. SA-VSSAE may not be applied to such a case as it
SA-VSSAE: does not consider the problem of missing data.

• Compared with the GMPVAE, the SA-VSSAE can not only Thus, in the future, SA-VSSAE can be developed using super-
effectively extract the common information from the source vised and semi-supervised learning so that the features between
and share it with the target grade through the DNNs, but the process and quality variables can be extracted and the prob-
also describe the dynamic behavior of a process as the lem of missing quality data can be solved. Also, the topic of
contribution of the RNN and the state–space model. few-shot learning was prominent recently. It can be integrated
• With the Gaussian mixture model embedded in the latent with the proposed method to facilitate SA-VSSAE in extremely
space of SA-VSSAE, the unique information of the target few data environments.
and source grades can be extracted simultaneously. This
facilitates SA-VSSAE in enhancing the target grade from the Declaration of competing interest
assisting source grades after common feature extraction.
• With the information sharing-based and the weighting- The authors declare that they have no known competing finan-
based design of the objective function for the proposed
cial interests or personal relationships that could have appeared
SA-VSSAE, the proposed model can not only overcome the
to influence the work reported in this paper.
circumstances of limited data but also ensure the model
structure to focus on the target grade during training for
process modeling and process monitoring. Acknowledgments

Subsequently, in both the numerical case and the industrial The authors would like to gratefully acknowledge Ministry of
case of the PVC drying process, the proposed method outper- Science and Technology, Taiwan, R.O.C. (MOST 109-2221-E-033-
forms the other comparative models. However, some problems 013-MY3; MOST 110-2221-E-007-014). Also, they would like to
are worthy of study by extending the proposed methods shown thank China General Plastic Corporation, Taiwan, which provides
as follows: the industrial data for the case study.
467
Y.S. Lee and J. Chen Neural Networks 154 (2022) 455–468

References Wang, K., Chen, J., & Song, Z. (2017). Data-driven sensor fault diagnosis systems
for linear feedback control loops. Journal of Process Control, 54, 152–171.
Jaeckle, C., & MacGregor, J. (2000). Product transfer between plants using http://dx.doi.org/10.1016/j.jprocont.2017.03.001.
historical process data. AIChE Journal, 46(10), 1989–1997. http://dx.doi.org/ Wang, K., Chen, J., Xie, L., & Su, H. (2020). Transfer learning based on incorpo-
10.1002/aic.690461011. rating source knowledge using Gaussian process models for quick modeling
Jaeckle, C., MacGregor, J., & MacGregor, J. (1998). Product design through multi- of dynamic target processes. Chemometrics and Intelligent Laboratory Sys-
variate statistical analysis of process data. AIChE Journal, 44(5), 1105–1118. tems, 198(2019), Article 103911. http://dx.doi.org/10.1016/j.chemolab.2019.
http://dx.doi.org/10.1002/aic.690440509. 103911.
Lee, Y., & Chen, J. (2021). Enhancing monitoring performance of data sparse Wang, K., Yuan, X., Chen, J., & Wang, Y. (2021). Supervised and semi-supervised
nonlinear processes through information sharing among different grades probabilistic learning with deep neural networks for concurrent process-
using Gaussian mixture prior variational autoencoders. Chemometrics and quality monitoring. Neural Networks, 136, 54–62. http://dx.doi.org/10.1016/
Intelligent Laboratory Systems, 208(2020), Article 104219. http://dx.doi.org/ j.neunet.2020.11.006.
10.1016/j.chemolab.2020.104219. Wozniak, M., Silka, J., Wieczorek, M., & Alrashoud, M. (2021). Recurrent neural
Li, N., Shi, H., Song, B., & Tao, Y. (2020). Temporal-spatial neighborhood network model for IoT and networking malware threat detection. IEEE
enhanced sparse autoencoder for nonlinear dynamic process monitoring. Transactions on Industrial Informatics, 17(8), 5583–5594. http://dx.doi.org/10.
MDPI Processes, 8(1079), 1–19. http://dx.doi.org/10.3390/pr8091079. 1109/TII.2020.3021689.
Lin, S., Clark, R., Birke, R., & Sch, S. (2020). Anomaly detection for time series Wozniak, M., Wieczorek, M., Silka, J., & Polap, D. (2021). Body pose prediction
using VAE-LSTM hybrid model. In IEEE international conference on acoustics, based on motion sensor data and recurrent neural network. IEEE Transac-
speech and signal processing (ICASSP 2020) (pp. 4322–4326). tions on Industrial Informatics, 17(3), 2101–2111. http://dx.doi.org/10.1109/
Liu, J. (2007). On-line soft sensor for polyethylene process with multiple TII.2020.3015934.
production grades. Control Engineering Practice, 15(7), 769–778. http://dx.doi. Wu, H., & Zhao, J. (2020). Fault detection and diagnosis based on transfer learning
org/10.1016/j.conengprac.2005.12.005. for multimode chemical processes. Computers and Chemical Engineering, 135,
Prechelt, L. (1998). Automatic early stopping using cross validation: Quantify- Article 106731. http://dx.doi.org/10.1016/j.compchemeng.2020.106731.
ing the criteria. Neural Networks, 11(4), 761–767. http://dx.doi.org/10.1016/ Yu, C., Wang, J., Chen, Y., & Huang, M. (2019). Transfer learning with dynamic
S0893-6080(98)00010-0. adversarial adaptation network. In Proceedings - IEEE international conference
Turlach, B. (1993). Bandwidth selection in kernel density estimation: A review. on data mining, ICDM, 2019-Novem(Icdm) (pp. 778–786). http://dx.doi.org/10.
CORE and Institut de Statistique, 23–493, https://doi.org/1011446770. 1109/ICDM.2019.00088.
Vijaya Raghavan, S., Radhakrishnan, T., & Srinivasan, K. (2011). Soft sensor Yuan, X., Li, L., & Wang, Y. (2020). Nonlinear dynamic soft sensor modeling with
based composition estimation and controller design for an ideal reactive supervised long short-term memory network. IEEE Transactions on Industrial
distillation column. ISA Transactions, 50(1), 61–70. http://dx.doi.org/10.1016/ Informatics, 16(5), 3168–3176. http://dx.doi.org/10.1109/TII.2019.2902129.
j.isatra.2010.09.001. Zhang, L., Zuo, W., & Zhang, D. (2016). LSDT: LAtent sparse domain transfer
Wang, J., Chen, Y., Feng, W., Han, Y., Huang, M., & Yang, Q. (2019). Transfer learning for visual adaptation. IEEE Transactions on Image Processing, 25(3),
learning with dynamic distribution adaptation. ArXiv, 11(1). 1177–1191. http://dx.doi.org/10.1109/TIP.2016.2516952.

468

You might also like