Professional Documents
Culture Documents
434
IFAC SAFEPROCESS 2018
Warsaw, Poland, August 29-31, 2018 Carlos A. Muñoz et al. / IFAC PapersOnLine 51-24 (2018) 433–440 435
435
IFAC SAFEPROCESS 2018
436
Warsaw, Poland, August 29-31, 2018 Carlos A. Muñoz et al. / IFAC PapersOnLine 51-24 (2018) 433–440
model is trained based on data scaled using variable process data is driven by the interest in keeping the
scaling, this to keep the original covariance and to have structure of the data intact. However the added value
meaningful data trends. Next, the residual matrix is used of such tensor based methods has been investigated only
to rescale the data. Since the aim is to learn evenly from recently and several aspects still have to be developed and
every variable, the variable-wise unfolded matrix X(2) is understood further to exploit the comparative advantages
scaled using the error variance for each variable, i.e., of a tensor based approach. In this contribution the
the variance of each row in the variable-wise unfolded higher interpretability that can be obtained from a tensor
residual matrix. Then, a new model is trained based on decomposition is investigated to drive the traditional data
the scaled data and the loop is repeated until convergence mining approach towards the extraction of meaningful
is achieved. The condition for convergence is defined as features from the process data that in turn can result
achieving unit variance in the approximation error of every in better performance and interpretation regarding data
variable. Therefore, this algorithm guarantees having the approximation, process monitoring and fault detection.
model that best approximates the scaled data and which Thus, the use of structural sparsity in the factor matrices
produces unit variant residuals for every variable. The and/or core tensor is explored to constrain certain linear
advantages of this approach with respect to the traditional combinations of the loadings. The final aim is to guarantee
scaling can be pointed out. Since it is based on an independence between loadings in the time mode that
initial variable scaling, the elements in the data related approximate the dependent variables from those that
to deterministic behavior of the system are not lost or approximate the independent variables.
weakened. But since the data is rescaled based on the
variance of the residual for each variable, it is guaranteed When (multi)linear methods are used to decompose a
that the same weight is given to the variance present in given data set, a numerical search for loadings that best
each variable so the model is trained uniformly. approximate the data is performed. This results in finding
the basis or features that numerically produce the best
Table 1. Algorithm simultaneous error scaling results i.e., the minimum approximation error in the form
and PCA training. of least squares. However, this could imply the emergence
of numerical correlations between variables which are non-
1. Unfold X into X(1)
2. Mean center each column of the data
existing in the physical system. A clear example are the
3. Scale variable wise data X(2) to the range 0-1 distinction between dependent and independent variables.
5. Compute T, P via PCA for the scaled data. Equation (5) Since there exists a clear causality relation between these
2 →
− two subsets of variables the multilinear decomposition
4. While σE =Variance(E(2) ) = 1
(2)
2
4.1 Establish new scaling parameters as M = 1/σE
could be adjusted to reflect the independence of the inputs
(2) with respect to the states of the system. The strategy of
4.2 Scale data via XT M
(2) imposing independence between loadings in tensor based
4.3 Compute T, P via PCA for the new scaled data. models, has been already explored in chemometrics when
End
modeling data from chemical reactions that are being
An alternative formulation results from considering the monitored via UV-vis spectroscopy (Gurden et al., 2001).
problems of scaling the data and training the model as Equivalently in this contribution the proposed approach
a single optimization problem. In this form a scaling is investigated aiming at extracting more interpretable
parameter and a regularization term are added to the features for the variability present in the batch processes.
original least squares problem according to equation (8). Based on CPD and Tucker3 two different approaches are
This formulation is based on training a bilinear model presented to impose the structural constraints on the
as PCA but can be applied equivalently to multilinear models. In case of CPD, since this model requires the
methods. In this equation M is the vector containing the fewest number of parameters and the combinations are
scaling factors for each variable. As result of the product restricted to be only between corresponding loading and
between M and the unfolded data the scaling parameters scores, it is enough to impose structural sparsity on the
are applied. The regularization term on the other hand factor matrix B to generate the desired independence. This
aims at making the variance of the residual of each variable sparsity is structured as shown in equation (9) where zeros
equal to 1. In this term N is the number of elements are added in the rows n to n + k which correspond to the
along the variable-wise unfolded matrix which corresponds position of the independent variables in the data set and
to number of batches × number of time points. When for i columns to guarantee independence between the set of
applying the algorithm presented in Table 1 the solution independent variables and the first i time loadings. Thus,
to the optimization problem is obtained via an iterative those first i time loadings will only influence the dependent
procedure that guarantees first making the regularization variables.
term as close as possible to zero, and secondly finding the
b11 · · · b1i b1(i+1) ··· b1R
optimal least squared solution to the problem given by .. . . .. .. .. ..
. . . . . .
the first term in equation (8) with M being the cumulative
0 ··· 0 bn(i+1) ··· bnR
product of the variance from the residuals in each iteration. . . . .. .. .. ..
B=
.. . . . . .
(9)
T
diag(E(2) E(2) ) 0
min T
(X(2) T 2
M ) − [T, P ] + λ − 1 2 (8) · · · 0 b(n+k)(i+1) ··· b(n+k)R
M,T,P N . . . .. .. .. ..
. . . .
. . .
5. CONSTRAINED TENSOR DECOMPOSITION bJ1 · · · bJi bJ(i+1) ··· bJR
In case of Tucker3, given the more complex combination
The use of tensor based decomposition methods to train pattern between loadings, it is, according to equation (4),
models that reproduce the variability present in batch not possible to determine a unique set of parameters in the
436
IFAC SAFEPROCESS 2018
Warsaw, Poland, August 29-31, 2018 Carlos A. Muñoz et al. / IFAC PapersOnLine 51-24 (2018) 433–440 437
variable loadings which are related with an unique vari- the explained variance with each extra latent variable
able. Therefore only adding sparsity to the factor matrix has been considered as one possible criterion. However,
V is not enough to guarantee the independence relations as in all cases the challenge is to identify a well defined limit
in case of CPD. To fully establish independence between for the rank where the approximation is good enough to
the loadings that approximate the independent and the reproduce the desired systematic variability while avoiding
dependent variables, structural sparsity has to be imposed non-systematic behavior that can lead to overfitting. Reg-
both in the factor matrix V and the core tensor G. As ularization techniques have been used to reduce the risk of
the elements of the core tensor can be interpreted as overfitting. The extra regularization term is traditionally
the weighting factors for the combination applied between formulated to reduce the complexity of the model, e.g., by
loadings, making these parameters equal to zero avoids introducing non-structural sparsity to the factor matrices.
certain combinations. The sparsity imposed to the factor In this way an equilibrium between least squares minimum
matrix V is equivalent to that applied to matrix B in case error of the approximation and the model complexity is
of CDP. Equation (10) represents the unfolded core tensor obtained. As it was mentioned before the novel scaling
G with the structural sparsity required to guarantee the approach results in the addition of a regularization term
desired independence. to the optimization problem. Thus the first aspect investi-
g1,1,1 · · · 0 · · · 0 · · · g1,r2 ,1 gated in the application of this method to the Pensim case
.. study is on the rank estimation.
G(:, :, 1) = ... . . .. . .
. . .
.. . .
. . .
gr1 ,1,1 · · · 0 · · · 0 · · · gr1 ,r2 ,1 In Figs. 4 and 5 the training curves for the standard PCA
g1,1,i · · · 0 · · · 0 · · · g1,r2 ,i of autoscaled data and the proposed alternative using the
..
G(:, :, i) = ... . . .. . .
. . .
.. . .
. . . simultaneous scaling-training approach are presented. In
gr1 ,1,i · · · 0 · · · 0 · · · gr1 ,r2 ,i the figures the blue curve corresponds to the ratio between
(10) the variance explained by the given rank and the one
0 · · · g1,n,(i+1) · · · g1,(n+k),(i+1) ··· 0
. . .. obtained if a lower rank is used. The orange curve is the
G(:, :, i + 1) = ... . . . ..
.
..
.
..
. . .
0 · · · gr1 ,n,(i+1) · · · gr1 ,(n+k),(i+1) ··· 0 relative variability explained. Wold’s criterion (Gins et al.,
2014) defines the limit on terms of the relative gain in
0 · · · g1,n,r3 · · · g1,(n+k),r3 ··· 0
. . .. the variance explained with each extra latent variable to
G(:, :, r3 ) = ... . . . ..
.
..
.
..
. . . establish an equilibrium between model complexity and
0 · · · gr1 ,n,(i+1) · · · gr1 ,(n+k),r3 ··· 0
. the best rank approximation. As it can be seen in Figs. 4
and 5 the proposed novel approach produces a steeper
increase on the variance explained at low ranks and a
6. CASE STUDY sharper change when the point of no more significant
improvement is achieved. Thus while for the autoscaled
The in-silico Pensim case study (Birol et al., 2002) in- data the standard threshold for Wold’s criterion results in
cluded in the Matlab based software tool RAYMOND requiring 8 latent variables, in case of the proposed scaling-
(Gins et al., 2014) is used in this contribution as bench- training procedure with a rank 3 approximation sufficient
mark to evaluate and illustrate the advantages that the variance explained it is clear that higher complexity will
novel data driven monitoring framework offers. The Pen- not generate any significant improvement of the model.
sim model consists of a fed batch reactor that is used in
the production of penicillin. In the first stage the reactor Based on these results the rank approximation is fixed to
operates in batch condition, then the initial concentration use three latent variables for all applied methods. This pro-
decreases till the point when the feeding is started and vides a common ground to compare them. In case of CPD
the substrate in the reactor reaches an equilibrium. All
Table 2. Measured variables and initial condi-
11 variables depicted in Table 2 are measured during the
tions for Pensim case. study
reactor operation. Disturbances and noise are introduced
to simulate the variability expected in a real process. Other Type of Initial Sensor noise (SN)
Variable
conditions of the process were set as in the original bench- variable condition / Disturbance (D)
mark and are in detail presented by Birol et al. (2002). A Dissolved O2
Dependent 1.16-1.18 σ = 0.002 (SN)
total of 130 batches were simulated for a period equivalent [mmol/L]
to 400 time points. 100 batches were used as training data Volume [L] Dependent 90-115 -
set and 30 batches for validation of the NOC and definition pH Dependent 5 -
Temperature
of the control limits for SPE and T 2 . Dependent 298 -
[K]
Feed rate
Independent 0 σ = 0.005 (D)
7. RESULTS [L/h]
Aeration rate
Independent 8 σ = 0.3 (D)
7.1 Training and validation [L/h]
Agitation
Independent 30 σ = 1 (D)
power [W]
The dimensionality reduction via (multi)linear data driven Feed temp. [K] Independent 296
methods requires estimating the rank to be used in the ap- σ = 0.5 (D)
Cooling water
proximation. The rank selected is equivalent to the number Dependent - -
[L/h]
of latent variables or in other words the dimensionality of Base flow
Dependent - -
the latent space. Different methods have been investigated [L/h]
in order to determine the best rank approximation for Acid flow
Dependent - -
a given data set. Traditionally, the relative increase in [L/h]
437
IFAC SAFEPROCESS 2018
438
Warsaw, Poland, August 29-31, 2018 Carlos A. Muñoz et al. / IFAC PapersOnLine 51-24 (2018) 433–440
438
IFAC SAFEPROCESS 2018
Warsaw, Poland, August 29-31, 2018 Carlos A. Muñoz et al. / IFAC PapersOnLine 51-24 (2018) 433–440 439
produced by PCA of autoscaled data has a higher bias but 7.2 Monitoring and fault detection
with a good approximation of the trend behavior. On the
other hand, the result for the standard Tucker3 of variable First the interpretability of the features extracted from
scaled data results in lower bias but with inconsistencies on the data was evaluated. The novel proposed approach for
the systematic behavior of the variable. The latter devia- simultaneous error based scaling and training of the struc-
tion probably results from over-combination of loadings in tural constrained Tucker3 was compared with the results of
the standard Tucker3 model. This means, certain dynamic the standard PCA of autoscaled data. In Fig. 9 the features
behaviors that do not have a significant correlation in (loadings) extracted by PCA are presented. As expected,
the physical system are numerically combined to achieve since these features only represent the directions of highest
a better data approximation. In contrast the two results variability of the data in the the batch-wise unfold version,
using the proposed simultaneous approach are significantly it is very difficult to extract any further meaningful infor-
better and similar, independently of the method applied. mation from them. In contrast the features extracted in
At this level the relative advantages of using the proposed the time mode of the model trained using the proposed
structurally constrained Tucker3 method do not play a novel approach (Fig. 10) show a clear connection with
big role in comparison to the results obtained via PCA. the trends of the physical variables. Additionally a clear
A clear advantage is found when comparing the proposed distinction can be made from the features that approxi-
approach with respect to the standard Tucker3. In these mate the dependent variables and those that approximate
results it can be seen how the imposed constraint reduces the independent variables. Those three with a clear strong
the risk of over-combination of the loadings and therefore deterministic behavior are the features extracted for the
the wrong dynamic behavior disappears from the estima- dependent variables, while the other three correspond with
tion. the disturbances that were introduced for the independent
Finally in Figs. 6, 7 and 8 it is observed how the pro- variables.
posed novel scaling approach results in not only a more
homogeneous error distribution but the standarization of
the distribution. It can be seen in Fig. 6 that this has
particular importance because when applying autoscaling
it results in indirectly giving more weight to the variability
of one variable and its residual error. The dissolved oxygen
is approximated more accurately because any error on
this variable has a higher scale than the variability in
the volume. In Table 4 two parameters are evaluated over
the residual matrices to determine how well the different
scaling-decomposition methods perform at learning uni- Fig. 9. Loading of the PCA based model for autoscaled
formly from the set of variables. First the overall rela- data.
tive error of the approximation regarding the validation
batches is presented. As it can be seen, the overall approxi-
mation is equivalent for all methods applied since the same
rank approximation was used. However, results regarding
the modified E-criterion show a clear difference between
the methods. In OED this criterion is used to determine
how well distributed is the uncertainty of the estimation
along all parameters (Telen et al., 2012). Equivalently in
this case, being computed over the variance-covariance ma-
trix of the residuals it represents how well distributed the
approximation error is over all variables. Thus, the results
show that those models trained using the proposed novel Fig. 10. Time loadings of the constrained Tucker3 model
approach for simultaneous scaling and training result in a for data scaled and trained simultaneously.
better global distribution of the error along all variables.
This implies that those models have learned the variability A set of 10 new batches of the process were simulated to
of the system evenly from all the variables. evaluate the performance for fault detection of the trained
models. For these batches the parameter that determines
the oxygen uptake for maintenance of the microbial culture
Table 4. Residuals evaluation for (multi)linear in the kinetic model was modified (i.e., from the standard
decomposition. value 0.467 to 0.867). This deviation was intended to sim-
ulate a change in the process that was not directly related
PCA / Tucker3 / Const.
PCA / with the change in one variable but with the dynamic sys-
sim. Variable Tucker3 /
autoscal.
scaling scaled sim. scaling tem itself. This deviation simulates biological variability
relative that have an impact in the dynamics. Graphical results are
0.167 0.166 0.157 1.66 presented for the two online monitoring statistics, SPE and
error val.
Mod. T 2 . Fig. 11 corresponds to the case using standard PCA
166.8 4.71 208.8 9.04
E-crit. of autoscaled data, while Fig. 12 presents the case using
the novel proposed approach combining the simultaneous
scaling and training of the constrained Tucker3 decompo-
439
IFAC SAFEPROCESS 2018
440
Warsaw, Poland, August 29-31, 2018 Carlos A. Muñoz et al. / IFAC PapersOnLine 51-24 (2018) 433–440
440