Feature Learning For Fault Detection in High-Dimensional Condition Monitoring Signals

Original Article
Proc IMechE Part O:

J Risk and Reliability
2020, Vol. 234(1) 104–115
Feature learning for fault detection in Ó IMechE 2019
Article reuse guidelines:
high-dimensional condition monitoring sagepub.com/journals-permissions
DOI: 10.1177/1748006X19868335
signals journals.sagepub.com/home/pio
Gabriel Michau1 , Yang Hu2, Thomas Palmé3 and Olga Fink1
Abstract
Complex industrial systems are continuously monitored by a large number of heterogeneous sensors. The diversity of
their operating conditions and the possible fault types make it impossible to collect enough data for learning all the possi-
ble fault patterns. This article proposes an integrated automatic unsupervised feature learning and one-class classification
for fault detection that uses data on healthy conditions only for its training. The approach is based on stacked extreme
learning machines (namely hierarchical extreme learning machines) and comprises an autoencoder, performing unsuper-
vised feature learning, stacked with a one-class classifier monitoring the distance of the test data to the training healthy
class, thereby assessing the health of the system. This study provides a comprehensive evaluation of hierarchical extreme
learning machines fault detection capability compared to other machine learning approaches, such as stand-alone one-
class classifiers (extreme learning machines and support vector machines); these same one-class classifiers combined
with traditional dimensionality reduction methods (principal component analysis) and a deep belief network. The perfor-
mance is first evaluated on a synthetic dataset that encompasses typical characteristics of condition monitoring data.
Subsequently, the approach is evaluated on a real case study of a power plant fault. The proposed algorithm for fault
detection, combining feature learning with the one-class classifier, demonstrates a better performance, particularly in
cases where condition monitoring data contain several non-informative signals.
Keywords
Feature learning, representation learning, hierarchical extreme learning machines, fault detection, generators
Date received: 13 November 2018; accepted: 12 July 2019
Introduction growing. These challenges have fostered the develop-

ment of data-driven approaches with automatic feature
Machine learning and artificial intelligence have learning, such as deep learning that allows for end-to-
recently achieved some major breakthroughs leading to end learning.5
significant progress in many domains, including indus- Feature learning techniques can be classified into
trial applications1–4 One of the major enablers has been supervised and unsupervised learning. While in its tra-
the progress achieved on automatic feature learning, ditional meaning, supervised learning requires examples
also known as representation learning.1 It improves the of ‘labels’ to be learned, supervised feature learning
performance of machine learning algorithms while lim- refers instead to cases where features are learned while
iting the need for human intervention. Feature learning performing a supervised learning task, as for example
aims at transforming raw data into more meaningful
representations simplifying the main learning task.
Traditionally, features were engineered manually 1
ETH Zürich – Swiss Federal Institute of Technology, Zürich, Switzerland
and were thereby highly dependent on the experience 2
Science and Technology on Complex Aviation Systems Simulation
and expertise of domain experts. This has limited the Laboratory, Beijing, China
3
transferability, generalisation ability and scalability of General Electric (GE), Baden, Switzerland
the developed machine learning applications.5 The fea-
Corresponding author:
ture engineering task is also becoming even more chal- Olga Fink, ETH Zürich – Swiss Federal Institute of Technology, Stefano-
lenging as the size and complexity of data streams Franscini-Platz 5, 8093 Zürich, Switzerland.
capturing the condition of industrial assets are Email: ofink@ethz.ch
Michau et al. 105
fault classification or remaining useful life (RUL) esti- estimated thanks to healthy data not used in the train-
mation in prognostics and health management (PHM) ing of the one-class classifier. This health monitoring
applications. Recently, deep neural networks have been concept is different to other approaches that proposed
used for such integrated supervised learning tasks, to use HIs to monitor the system condition,15 as the
including convolutional neural networks (CNN)6–8 or HIs were learned in a supervised way, contrary to the
different types of recurrent neural networks, for exam- proposed unsupervised learning approach. Thanks to
ple, long short-term memory (LSTM) networks.9 The the distance to the healthy class, the proposed
performance achieved with such end-to-end feature approach can distinguish different degrees of fault
learning architectures on commonly used PHM bench- severity.
mark datasets has shown to be superior to traditional Results presented in this article demonstrate that in
feature engineering applications.4,10 high dimension, the one-class classification is a challen-
Contrary to the supervised learning, unsupervised ging task, and that results are improved by first learn-
feature learning does not require any additional infor- ing the informative features. As healthy data only is
mation on the input data and aims at extracting relevant used for the training, features have to be obtained in an
characteristics of the input data itself. Examples of unsupervised way and we propose to use a compressive
unsupervised feature learning include clustering and dic- AE to perform this task. To the best of our knowledge,
tionary learning.1 Deep neural networks have also been this is the first deep learning approach enabling efficient
used for unsupervised feature learning tasks for PHM health monitoring trained solely with nominal condi-
applications, for example, with deep belief networks tion monitoring data. This approach is unsupervised in
(DBNs),11,12 and with different types of autoencoders the sense that it is used to detect degraded conditions
(AEs),11 for example, variational autoencoders (VAE).13 that were not part of the training dataset.
Even though the approaches described above applied The approach presented here takes advantage of the
either supervised or unsupervised feature learning recent advances in using multilayer extreme learning
within their applications, the subsequent step of using machine (ELM) for representation learning.16 The
the learned features within fault detection, diagnostics novelty here is also in the use of hierarchical extreme
or prognostics applications has always required labelled learning machine (HELM).17–19 HELM consists in
data to train the algorithms. Yet, while data collected stacking sparse AEs, whose neurons can directly be
by condition monitoring devices are certainly massive, interpreted as features and a one-class classifier, aggre-
they often span over a very small fraction of the equip- gating the features in a single indicator, representing
ment lifetime and/or operating conditions. In addition, the health of the system. The very good learning abil-
when systems are complex, the potential fault types can ities of HELM have already been used in other fields,
be extremely numerous, of varied natures and conse- for example, in medical applications20 and in PHM.18
quences. In such cases, it is unrealistic to assume data This study provides a comprehensive evaluation of
availability on each and every fault that may occur, and HELM detection capability compared to other machine
these characteristics challenge the ability of data-driven learning approaches, including DBNs.
approaches to actually learn the fault patterns without This article is organised as follows: section
having access to a representative dataset of possible ‘Framework of HELM’ details the HELM theory and
fault types. the algorithms used for this research. Section
The focus of the methodology proposed in this arti- ‘Simulated case study’ presents a simulated case study
cle is to account for this possibility and to use healthy designed to test and compare the HELM to other tradi-
data only for the training. This is contrary to the widely tional PHM models. Controlled experiments allow for
used approaches applied for such problems: defining quantified results and comparisons. A real application
the problem as a two-class classification task and trying is analysed in section ‘Real-application case study: gen-
to take the imbalance of the two classes (abundant erator fault detection’, with data from a power plant
healthy data and few faulty cases) into account. We experiencing a generator inter-turn failure. Finally,
propose to define this problem as a one-class classifica- results and perspectives are discussed in section
tion task, designed to identify outliers to the class used ‘Discussion and conclusion’.
during the training without any knowledge on the char-
acteristics of the faulty system conditions. One-class
classification comprises the steps of comparing the test- Framework of HELM
ing data with a reference dataset, here a healthy data-
set, and the design of a decision boundary for outlier
HELM for fault detection in brief
detection.14 In this problem definition, we propose to HELM is a multilayer perceptron network, which was
interpret the raw output of the one-class classifier as a first proposed in Tang et al.21 The structure of HELM
distance to the healthy class and therefore as a health consists in staking single-layer neural networks hier-
indicator (HI): the larger the deviation from the healthy archically so that each one can be trained indepen-
system conditions, the larger the distance to the healthy dently. In this article, we propose to stack AEs and a
class, and therefore the larger the value of the HI. The one-class classifier. The compressive AEs are used for
decision boundary is a threshold on this distance, feature learning. The one-class classifier takes the
106 Proc IMechE Part O: J Risk and Reliability 234(1)
Table 1. Notations.
Superscript notations
: Train Training, healthy, dataset to train the neural
networks
: Thr Healthy dataset for finding the decision boundary
(equation (7))
: Test Test data
Data variables
X Input data to the model
Y Output values of a neural network
T Target data when training a neural network
Z Labels of datapoints
Dataset sizes
D Input dataset dimension
Figure 1. HELM for anomaly detection flow chart. DY Dimension of the model output
K Number of sample in the dataset
Neural networks variables

features of the AE as inputs and learns to output an Li Number of neurons in the ith hidden layer of
indicator18 that can be interpreted as a HI. Figure 2 the HELM
presents the proposed architecture, comprising a single A Weight matrix between input and hidden layer
AE for feature learning and a one-class classifier. The B Bias vector between input and hidden layer
b Weight matrix between hidden layer and output
framework proposed here is illustrated in Figure 1 and g Activation function
can be simplified as follows (Table 1): H Hidden layer matrix: H = g(A, X, B)
C Ridge regularisation parameter in the one-class
Split the training dataset (containing only observa- classifier
tions from healthy system conditions) in two sub- l LASSO regularisation parameter in the
autoencoder
sets: XTrain for training the neural network and XThr
for finding the optimal decision boundary of the Experiments
one-class classification. Nexp Number of experiments
Train the AE with XTrain such as to minimise the TP True positives: percentage of valid detections
FP False positives: percentage of false detections
reconstruction error. TN True negatives: percentage of non-detected
Train the one-class classifier network with the fea- healthy inputs
tures of XTrain from the previous AE such as to FN False negatives: percentage of missed detections
minimise the distance between the output and the Acc Accuracy: (TP + TN)/2
healthy class represented by the value 1. Mag Magnification coefficient
N ðm, sÞ Gaussian distribution (average m, standard
Run both networks with XThr . Infer a minimal deviation s)
threshold on the output of the one-class classifier U ð½a, bÞ Uniform distribution in the interval [a, b]
such that most points in XThr belong to the healthy U ð½½a, bÞ Uniform distribution of integers in [a, b]
class.
HELM: hierarchical extreme learning machine; LASSO: least absolute
When using testing data XTest , if the output of the
shrinkage and selection operator.
one-class classifier is above the threshold, consider
these data as abnormal.
The value of the output of the one-class classifier
provides an indication of the severity of the where X represents the input, Y the output, g the acti-
abnormality. vation function, A and b are weights, and B are the
biases.
The remainder of this section presents the theoretical A mathematical proof23 demonstrates that given a
background of HELM in more detail and highlights the linear kernel g, a training set X and its target T, an
adaptations needed to match the specific requirements SLFN with enough neurons and randomly sampled A
of PHM applications. and B can approximate T with any accuracy. Such net-
works with randomly sampled weights and biases are
called ELM. Formally, for any E . 0, for s1 . 0, u 2 N
Theory and training of ELM and for enough neurons, b ^ exists such that
ELM is a single-hidden layer feed-forward neural net-
work (SLFN).22 ^ Ts1 \ E
gðA X + BÞ b ð2Þ
u
SLFN equation is usually written as
^ T ks 1 \ E
kHb ð3Þ
u
Y = g(A, X, B) b ð1Þ
Michau et al. 107
where H denotes the hidden layer matrix to track the changes in the features, by linking them to
g(A XTrain + B). the parts of the system potentially affected. Sparsity can
By random sampling of the input weights A and the be achieved using the ‘0 norm for the regularisation
bias B, training an ELM only consists in computing the term (v = 0, s2 = 1). Yet, this would make problem (4)
optimal output weights b. This simplifies problem (3) non-convex and require an integer programming solver.
considerably as it is now formulated as the single- A surrogate is the ‘1 norm, which leads to the least
variable regularised convex optimisation in problem (4) absolute shrinkage and selection operator (LASSO)
problem, one of the classic convex optimisation prob-
^ = Arg minkHb Tks1 + Ckbks2
b ð4Þ lems in the literature.26 One can note that the ‘2 norm
u v
b
has a closed-form solution and is, thus, much easier to
where s1 , s2 . 0, u, v 2 Nn0 and C ø 0 is the weight of compute (cf. equation (5)). Yet, the solution is likely to
the regularisation. C represents a compromise between be dense and redundant,27 which is not compatible with
the learning objective and properties of b that one our idea of a good set of features.
would like to impose (e.g. sparsity, non-diverging coef- The AE-ELM consists finally in solving the typical
ficient). This is the strong advantage of ELM over reg- LASSO problem
ular SLFN. For traditional feed-forward neural
networks, the weights A, b and B are optimised over ^ = Arg min kHbX k2 + l k bk1
b ð6Þ
2
b
several iterations, usually using a back-propagation
(BP) algorithm, while for ELM, the optimisation of b In this article, we solve this problem using the fast
solely is a much faster and easier process and has a
iterative shrinkage-thresholding algorithm (FISTA)
unique solution.
known for both its fast convergence and computational
In problem (4), any combination of s1 , s2 , u, v leads
simplicity.28 The process for solving problem (4) is
to a different solution of problem (4); it is however
detailed in Algorithm 1.
usual to take (s1 , s2 , u, v) 2 ½½1, 2. When s1 = s2 = 2
and u = v = 2, equation (4) is known as the ridge
regression problem (also referred to as the Tikhonov
regularisation problem),24 and it has an explicit Algorithm 1. FISTA.
solution
Input: H, X, l ø 0, d 20, 1½, E ø 0
1 1: g d
b = C I + H> H H> T ð5Þ 1 + kHk2
2: k 0, bk 0, y k 0,
3: tk 1, Crit ‘,
4: while Crit ø E do
ELM-AE for feature learning 5: ck = yk 2 gHT ðHy k X Þ
6: bk + 1 = max (jc j l g, 0) sign (ck )
AEs are unsupervised machine learning tools trained qkffiffiffiffiffiffiffiffiffiffiffiffiffi
7: tk + 1 = 12 1 + 1 + 4t2k
for the reconstruction of their input. Structural or
mathematical constraints make sure that they do not 8: yk + 1 = bk + 1 + ttkk1
+1
ðbk bk + 1 Þ
simply learn the identity operation, such as adding 9: Crit = k bk bk + 1 k2
10: k=k+1
noise to the input (the AE is then qualified as denois-
Output: bk
ing), imposing dimensionality reduction (compressive
AE), dimensionality increase or adding a kernel regres-
sion hidden layer (VAE).
HI and magnification for fault detection
Within neural networks, AEs are often part of multi-
layer architectures: they can be trained independently In the context of critical and complex systems, the high
to the problem at hand and are often used for feature number of possible faults due to the large number of
learning.25 Conceptually, in a compressive AE, the hid- parts involved in the system, the rarity of certain faults
den layers learn the best few combinations of the differ- (if ever experienced at all), and, overall, the (under-
ent measurements that can explain all the signals. The standable) lack of willingness from operators to let the
ELM framework can be used to train single-layer AE faults happen repeatedly for data collection, make it
neural networks. In this case, it consists in solving prob- impossible to aim for fault recognition or classification.
lem (4) with the target T equal to the input X. Instead, we propose to focus on abnormality detection,
One concern for practitioners is the possibility to training a one-class classifier ELM29 on features from
understand how the signals are linked to the features healthy data solely.
and assess their impact on the results. This is a motiva- In this case, we use an ELM-based one-class classi-
tion to foster sparse connectivity between features and fier trained considering the ridge regression formulation
reconstructed inputs. Intuitively, if each feature is con- in equation (5) with output T = (1)14k4K (DY = 1).
nected to a few measurements only, each feature should Then, the distance between the output and the normal
maximise the information it contains on only a part of class, represented by the value 1, is monitored. A
the system. This would enable the monitoring engineers threshold on this distance discriminates between the
normal and the faulty class. As such, this distance is

similar to a HI.30 We propose here to base the thresh-
old on another set XThr , that is, a set of datapoints from
healthy conditions not used in the training. This step
ensures higher robustness of the threshold to variations
in the healthy conditions and mitigates the potential
risk of overfitting during training, which would result
in a too small threshold, causing thereby too many false
positives during the application. Experiments have
shown that a good threshold can be designed as
Figure 2. HELM architecture: example with a single AE whose
Thrd = g percentile p j1 YThr j ð7Þ hidden layer is used as the input of the next layer: a one-class
classifier.
where percentile p is the pth percentile function with
p, g ø 0 as hyperparameters.
Then the actual class Z of a sample i (1 for healthy, – Algorithm 2. Function HELMTrain (training of a HELM).
1 otherwise) can be devised with the following equation
later denoted as the decision rule Input: C ø 0, l ø 0, N 2 N X Train , T Train
1: x1 X Train
ZTest
i = sgn Thrd j1 YTesti j ð8Þ 2: for i = 1, . . . , N do . Stacked AE ELM
3: Generate: Ai , Bi , random weights
The choice of both p and g is to some extent a single 4: Hi = g(xi , Ai , Bi )
problem, consisting in fact in choosing a robust thresh- 5: bi = Arg min l k bk1 + k Hi b x i k22 . (cf. Alg. 1)
b
old. In this article, we chose to fix p = 99:5 and experi- 6: xi + 1 = xi bTi . Upper layer ELM
ments will show that g 2 ½1, 3 provides good results. 7: Generate: AN + 1 , BN + 1 , random weights
We denote by magnification coefficient the ratio 8: HN + 1 = g(xN + 1 , AN + 1 , BN + 1 )
between the distance of the one-class classifier output 9: bN + 1 = Arg min C k b k22 + k HN + 1 b T Train k22
b
to 1 and the threshold Output: fbi g14i4N + 1 , ðAN + 1 , BN + 1 Þ, g

j1 YTest j
Mag = ð9Þ
Thrd
Algorithm 3. Function HELMRun (running a HELM).
This ratio is above 1 for points detected as abnormal
and its value represents the robustness of the detection. Input: X, HELM Train (C, l, N, X Train , T Train )
A magnification coefficient well above the value 1 cor- 1: x1 X
responds to detection with low sensitivity to the choice 2: for i = 1, . . . , N do
of g and p and therefore higher confidence in the 3: xi + 1 = xi bTi
4: H = g(xN + 1 , AN + 1 , BN + 1 )
detection.
Output: Y = HbN + 1
Stacking AE(s) and a one-class classifier for HELM benefits of feature learning, HELM is compared to two
A traditional approach is to pre-train an AE before one-class classifiers (ELM and support vector machine
using it in a deeper classification architecture, which is (SVM)). Second, features are devised through one of
subsequently trained and fine-tuned using BP.31 To the the most traditional ways, a principal component anal-
opposite, in HELM, each layer is trained sequentially ysis (PCA), and are then used as input to two one-class
and independently. The hidden layer of each successive classifiers (ELM and SVM). Finally, to mimic the
ELM becomes the input of the next. This avoids well- HELM architecture with another commonly used
known limitations of the BP, in particular, the risk of neural network, HELM is compared with a DBN com-
exploding or vanishing gradient and intensive computa- posed of two staked restricted Boltzmann machines
tions required for the training.25,32 In our proposed (RBM): one for feature learning and a second whose
architecture of HELM, the lower layers are AEs, while hidden layer is fully connected to a single output neu-
the last layer is a one-class classifier, as illustrated in ron trained as a one-class classifier.
Figure 2.
If N denotes the number of stacked AEs, then train-
ing the HELM corresponds to Algorithm 2. Running Remark about the one-class classifier SVM. Note that usu-
the HELM for threshold computation and testing con- ally, the one-class classifier SVM output is interpreted
sists in applying Algorithm 3. as the likelihood of a point to belong to the main class,
In the following, HELM with a single AE is tested under two assumptions: First, that the data follow the
on a simulation and on a real case study. HELM is distribution used as a kernel (Gaussian most of the
compared to five other approaches: First, to assess the times), second, that the number of outliers in the
Michau et al. 109
training data was correctly specified. In this traditional Algorithm 4. Simulated case study.
interpretation, 0 is the decision boundary and points
with positive likelihood are labelled as belonging to the 1: K 14000 samples
2: D 200 sensors
training class (healthy), while points with negative like- 3: for n 2 ½5, 10, f 2 ½Id(), log() do
lihood are labelled as outliers. The quality of the classi- 4: repeat N exp = 150 times
fication depends therefore strongly on the validity of 5: Draw: a Base 2 ½N ð0, 1Þn , af 2 N ð0, 1Þ
the two above assumptions. In our experiments, better 6: Draw: E Base 2 ½U ð½0, 3Þn , Ef 2 U ð½0, 3Þ
detection performances could be achieved by choosing 7: Draw: X Base 2 ½N ð0, 1Þn3K
131000
a decision boundary on the negative likelihood with 8: Draw: Xf 2 ½N ð0, 1Þ
Base Base Base
equation (7) instead of 0. This modification compared 9: X a X + E Base
Base
to the traditional implementation provides a fairer 10: X ½9000 : 10000, 1 = 1:2
comparison to other approaches. 11: X Base ½10000 : 11000, 1 = 1:5
12: X Base ½11000 : 12000, 1 + = 0:2 a Base ½0
13: X Base ½12000 : 13000, 1 af Xf + Ef
Simulated case study 14: Draw: a 2 ½U ð½0, 1ÞD
15: Draw: b 2 ½Uð½½1, nÞD
To evaluate the performances of the proposed 16: Draw:E 2 N 0, 0:01 f (a a Base ½b)
approaches under known and controlled conditions, a 17: X f a X Base ½: , b + E
simulated case study is designed. The synthetically 18: Draw: if 2 ½U ð½½1, DÞ10
generated datasets aim to represent behaviours 19: X½13000 : 14000, if = 1:2
observed in real condition monitoring signals and to 20: for Every Model and every set of hyperparameters do
simulate varied faults. All the following experiments 21: Train Model on X½0 : 7000, :
have been performed with MATLAB on a laptop (Intel 22: Compute Thrd in Eq.(7) with X½7000 : 8000, :
23: Compute FP on X½8000 : 9000, :
Core i7-4600M 2.90 GHz, 16 GB of DDR3 RAM).
24: Compute TP on each fault type (fault 1 on
X½9000 : 10000, :, ., fault 5 on X½13000 : 14000, :)
25: Compute Mag, the average over true positives of
Description the output value of the one-class classifier,
Methodology. We generate 150 experiments according to normalised by Thrd.
26: Average TN over N exp for each Model, hyperparameter
Algorithm 4. The core of the experiment is to simulate
set, value n and function f .
measurements from D = 200 sensors, on a system with 27: Average TP and Mag over N exp for each Model,
n intrinsic physical properties (e.g. temperature, rota- hyperparameter set, value n, function f and fault type.
tion, voltage). Each of the D = 200 sensors, denoted s, 28: Compute Prec = TPTP+ FP for each model, hyperparameter
is reading one of the n intrinsic properties according to set, n value, function f and fault type.
an internal reading function f, a sensitivity as and a
noise Es . At a given time, we impose one of the follow-
ing five faults on the measurements:
tuning and N = 100 for the reported results). The typi-
cal performance metrics are thus computed as follows:
1. One intrinsic property is multiplied by 1.2.
2. One intrinsic property is multiplied by 1.5.
The true positive rate: TPR = TP=N;
3. A constant value computed as 20% of the signal
The true negative rate: TNR = TN=N;
amplitude is added to one intrinsic property.
The false positive rate: FPR = FP=N = 1 TNR;
4. The intrinsic property changes.
The false negative rate: FNR = FN=N = 1 TPR;
5. The measurements of 10 sensors are multiplied by
The precision is TP=(TP + FP);
1.2.
The accuracy is (TP + TN)=(2N);
The f1-score is 2TP=(N + FP + TP).
Choice of the metrics. For each experiment, we generate In the context of one-class classification, the practi-
one negative set (healthy dataset) and one positive set tioners have two concerns. The first one is the minimi-
for each fault. In the context of one-class classification, sation of false alarm occurrences. In practical
we have the following relationships: positive (faulty) applications, verifying the occurred alarms takes time,
sets are either detected (true positive sets TP) or missed expertise and is often associated with high costs. Too
(false negative sets FN). Similarly, negative sets are many of them will make the model impracticable to
either not detected as faulty (true negative sets TN) or use. Second, as abnormalities are unexpected, random,
falsely detected (false positive sets FP). Therefore, we rare and might have large consequences, the probabil-
have ity of raising an alarm in case of abnormal behaviour
N = TP + FN = TN + FP ð10Þ should be as high as possible. The two corresponding
indicators are the FPR and the TPR. This is different
where N is the number of experiments in which the to a two-class classification problem with imbalanced
metrics are computed (N = 50 for the hyperparameter classes, where the f1 score would be favoured, reflecting
that the classification of both classes is equivalently find the best set of hyperparameters with respect to the
important and that the results need to account for pos- accuracies, we then report aggregated results on further
sible class imbalance. 100 simulations.
In the following, we report the results with the
metrics TPR and FPR. An optimal model would at the
same time minimise the FPR while maximising the Impact of randomness on (H)ELM efficiency. Prior experi-
TPR. Considering that TNR = 1 – FPR, we choose to ments to this study demonstrated that for ELM-based
report the results for the models maximising the aver- methods, averaging the results over five runs with inde-
age between the TPR and the TNR, that is, the accu- pendent training reduces deviations in accuracy without
racy. Depending on the priority of practitioners, one or changing its average. This is important for practi-
the other could be weighted more, or other metrics tioners, as HELM is inexpensive to train and run and
could be used. In such a one-class classification prob- as reducing the randomness impact of the random
lem, most of the commonly used metrics can be weights and biases (A and B) on the results is a valid
computed with the knowledge of the TPR and FPR. concern.
We report in addition the magnification coefficient
(Mag), as an indicator of the confidence in the fault Results
detection.
First, to demonstrate the ability of each model to detect
the different faults and to give the model performance
Model hyperparameters. The hyperparameter search is upper limits, the hyperparameters are optimised in
performed according to the following grid: order to maximise the accuracy on each fault indepen-
dently over the first 50 experiments. Results on the fol-
g 2 ½1:1, 1:2, 1:5, 1:7, 2, 2:5, 3, the factor used in the lowing 100 experiments are reported in Table 2 both
decision rule in equation (7); for n = 5, n = 10 and for both reading functions f.
L1 2 ½5, 10, 20, 40, 70, 100, the number of neurons Second, to demonstrate the robustness of the model
for the AE (or DBN first layer); to detect with a single training different kinds of faults,
L2 2 ½20, 50, 100, 200, 400, 800, the number of neu- the hyperparameters are tuned over the first 50 experi-
rons for the one-class classifier (or DBN second ments in order to maximise the average accuracy over
layer); the five faults. Results aggregated over the following
l 2 ½105 , 103 , 102 , 101 , 1, the LASSO regulari- 100 experiments and selected hyperparameters are
sation parameter; reported in Table 3 and the sensitivity to the threshold
C 2 ½105 , 103 , 102 , 101 , 1, the ridge regularisa- g is presented as a receiver operating characteristic
tion parameter; (ROC) curve in Figure 3.
LPCA 2 ½5, 10, 15, the number of principal compo- In Tables 2 and 3, for each column, the best results
nents selected in the PCA (up to 99% of explained are presented in bold. From Table 2, two main results
variance); can be observed. First, according to expectations,
S 2 ½105 , 101 , 1, 5, 10, the SVM Gaussian kernel accuracies are decreasing for higher signal complexity
scale; (increasing number n of base signals and logarithmic
%out 2 ½0:5%, 1%, 2%, 5%, the number of outliers measurements). Second, HELM achieves very good
allowed in the training of SVM. detection performances on all fault types and is the
most efficient approach to detect faults that impact the
Simulation parameters. Each of the n intrinsic properties base signals. HELM provides consistent results both
is simulated with a distinct random ‘base’ signals XBase . for the different number of affected base signals and
We report results for n = 5 and n = 10. The sensor also for the two different measurement functions.
reading function f is either the identity (linear measure- The benefits from the feature encoding layer are pro-
ment of the physical property) or the logarithm (often ven by the highest performance of HELM over the
the case in intensity measurement). The measurements standard ELM. The robustness of HELM is also pro-
have an additive random Gaussian noise Es , corre- ven when comparing the results of the one-class classi-
sponding to 1% of the reading amplitude. fier SVM. While SVM performs equally well as HELM
For each experiment, K = 14, 000 datapoints are for n = 5, its accuracy is strongly reduced when the
generated. In total, 7000 are used for training the mod- inherent dimensionality of the measurements increases
els, 1000 are used for computing the decision threshold to n = 10. Using PCA as the first step for dimensional-
in equation (7) and another 1000 are used to compute ity reduction did not overall improve the results com-
the FPR. The remaining 5000 datapoints are divided pared to the one-class classifiers alone. Only on fault 5,
into five subgroups of size 1000, each subgroup being PCA tends to improve the results, yet not significantly.
impacted by one of the 5 faults. The model is tested on This fault corresponds to the sensors being directly
each faulty set to compute the TPR per fault type. The impacted and thus would have the most direct impact
statistics are aggregated over multiple repetitions of the on the principal components. Otherwise, the non-linear
experiment. We first repeat the experiments 50 times to relationships between measurements and faults seem to
Michau et al. 111
100 (100/0)
100 (100/0)
100 (100/0)
100 (100/0)
100 (100/0)
92 (84/1)
HELM: hierarchical extreme learning machine; ELM: extreme learning machine; SVM: support vector machine; PCA: principal component analysis; DBN: deep belief network; TPR: true positive rate; FPR: false positive rate.
100 (100/0)
100 (100/0)
100 (100/0)
98 (97/0)
98 (97/2)
89 (86/8)
10
100 (100/0)
100 (100/0)
100 (100/0)
100 (100/0)
100 (100/0)
90 (82/2)
100 (100/1)
100 (100/0)
100 (100/0)
99 (98/0)
98 (96/0)
87 (74/0)
5
5
94 (87/0)
92 (84/0)
89 (78/0)
93 (88/2)
91 (82/0)
84 (69/1)
94 (87/0)
91 (83/1)
91 (86/5)
92 (86/2)
90 (84/4)
85 (78/8)
Figure 3. ROC curve (f = log and n = 5). Each point of each

curve is labelled with the corresponding value of g, increasing
10
from right to left.

96 (91/0)
94 (89/2)
92 (84/1)
95 (90/1)
94 (89/1)
82 (65/1)
limit the efficiency and impact of PCA. Finally, the

94 (88/0)
93 (86/0)
91 (85/3)
93 (85/0)
93 (87/1)
84 (68/1)
DBN shows robust results for n = 5 and n = 10, but

Each cell contains three values: accuracy (TPR/FPR). The bold-font numbers highlight the best performing models in terms of the evaluation measures.
with low accuracies.

77 (74/20)
4
5
69 (50/12)
61 (53/32)
61 (35/14)
60 (35/16)
Table 3 presents the accuracies and magnification

60 (22/2)
coefficient for the set of hyperparameters with best

average accuracy Accd over the five fault types, the cor-
77 (78/24)
66 (61/30)
66 (54/23)
67 (53/20)
65 (55/25)
64 (30/2)
responding set of hyperparameters and the training

time. Results discussed so far are confirmed, in particu-
10
lar that HELM is the most robust model: with a single

76 (86/34)
58 (47/31)
94 (91/3)
93 (94/9)
65 (39/9)
64 (37/9)
set of hyperparameters, it achieves almost optimal

results on all faults, with accuracies consistently around
Table 2. Performance of the different models, n = 5 and n = 10 base signals, for fault types 1 to 5.
85 (81/12)
69 (67/30)
88 (94/19)
76 (77/25)
67 (61/27)
95 (96/6)
95% and very strong magnification coefficients (up to

around 100). SVM is up to 100 times slower to train
and its magnification is very low. Performing PCA
3
5
92 (95/12)
78 (81/25)
90 (93/13)
71 (72/30)
98 (98/2)
97 (97/3)
before the SVM reduces the time needed for training

and increases the magnification, but lowers the average
accuracy.
90 (95/15)
85 (93/23)
99 (99/1)
98 (97/2)
94 (91/4)
80 (67/8)
As shown in Figure 3, the sensitivity analysis con-

firms that SVM, with a low magnification coefficient,
10
100 (100/0)
100 (100/1)
84 (81/13)
78 (67/11)
has strongly decreasing performances as gamma

99 (99/2)
98 (97/1)
increases. For HELM, g 2 ½1:52:5 outperforms most

of the other model settings.
100 (100/0)
100 (100/0)
84 (81/14)
99 (98/0)
88 (83/8)
98 (97/1)
Real-application case study: generator

73 (76/30)
2
5
67 (67/34)
64 (59/32)
58 (29/14)
62 (54/30)
59 (19/2)
fault detection
Power plant condition monitoring
75 (64/14)
64 (57/30)
63 (49/23)
64 (47/20)
67 (59/25)
60 (21/2)
For the real case study, an H2-cooled generator from

an electricity production plant is evaluated. The genera-
10
73 (79/34)
58 (49/33)
66 (42/10)
65 (67/37)
93 (91/6)
tor is working on nominal load, with changing operat-

92 (93/9)
ing conditions and thus, variations in the parameters.

The evaluated dataset covers 9 months (275 days) with
84 (79/12)
64 (57/30)
86 (90/19)
69 (62/24)
70 (67/27)
96 (96/4)
X = log (a X Base ½: , b) + E
310 sensors recording every 5 min (K = 55774 and

D = 310). Sensors are grouped into five families:
½: , b + E
1
5
1. The rotor shaft voltage is used to detect shaft

PCA-SVM
PCA-SVM
Base
PCA-ELM
PCA-ELM
grounding problems, shaft rubbing, electro-ero-

Fault type
X=a X
HELM
HELM
DBN
DBN
SVM
SVM
ELM
ELM
sion, bearing isolation problems and rotor inter-

turn shorts.
n
Time (s) 2. Rotor flux is used to detect the occurrence, the
0.16
10.5
magnitude and the location of the rotor winding
0.1
0.2
3.3
8.2
TP: true positives; FP: false positives; HELM: hierarchical extreme learning machine; ELM: extreme learning machine; SVM: support vector machine; PCA: principal component analysis; DBN: deep belief network.
inter-turn short circuit.
10
10
3. Partial discharge is used to detect ageing of the
S
main insulation, loose bars or contact as well as

% Out
contamination.
1
1
4. End winding vibration is mainly used to detect
deterioration in mechanical stiffness of the over-
LPCA
15
15 hang support system.

Hyperparameters
5. Finally, the stator bar water temperature is also

105
105
C
monitored.
102
l
The generator failure mode tested here has been a

posteriori explained by experts as, first, at day 169, an
100
400
100
100
intermittent short circuits in the rotor that remained

L2
undetected, denoted in the following by lower level

20
70
L1
fault, developing in a second faulty state or upper level

fault at day 247, when the short circuit worsened to a
1.5
1.5
1.5
1.1
1.2
1.1
g
continuous one. This led to the power plant shut down.

105
Mag
1.5
2.0
70
28
6
Methodology
95 (100/10)
95 (100/10)
Acc (TP/FP)
99 (100/3)
99 (100/2)
96 (100/9)
5
Similarly, as in the previous case study, the data are

80 (68/9)
split into three parts: for training and threshold compu-

tation (first 120 days with a random sampling
c for fault types 1 to 5.
94%=6%) and for testing (from day 120 to day 169 to

Mag
compute FPR, the rest to compute TPR). The dataset

5.5
1.7
1.8
88
28
13
is hence split as follows: KTrain = 22, 017, KThr = 1406

Acc (TP/FP)
and KTest = 4524+27, 827. Remind that D = 310. As

89 (89/12)
89 (88/10)
90 (90/10)
95 (92/3)
4
93 (94/9)
81 (70/9)
results need to be assessed on a single experiment, in

the present case study, the TP and FP are the propor-
tion of points for which the model output exceeds the
The bold-font numbers highlight the best performing models in terms of each evaluation measure.
Table 3. Accuracies and magnification for the set of hyperparameters maximising Acc
threshold after and before day 169, respectively.

Mag
2.2
1.4
1.7
1.2
1.4
1.1
Acc (TP/FP)
54 (17/10)
65 (40/10)
94 (91/3)
Results
3
68 (38/2)
93 (94/9)
62 (32/9)
Table 4 synthesises the accuracy, percentage of TP and

FR, the magnification coefficient and the training time
for the best performing model of each family. In addi-
Mag
6.3
1.9
1.3
1.5
1.3
tion, the distance to the normal class (j1 Yj) of these

2
models is illustrated in Figure 4. In accordance with the

99 (100/3)
Acc (TP/FP)
81 (72/10)
96 (100/9)
94 (98/10)
results obtained so far, HELM provides the best per-

2
95 (95/4)
76 (61/9)
formance. Its accuracy is higher and it is a robust

model, with a strong magnification (Mag = 20) and a
short training time. The one-class classifier alone per-
Mag
forms sensibly worse: numerous FP are raised; TP are

2.3
1.4
1.8
1.2
1.3
1.2
scarce, which is consistent with its much lower magnifi-

cation. SVM provides good performances with no FP
Acc (TP/FP)
X = log (a X Base ½: , b + E (n = 5)
56 (22/10)
66 (42/10)
91 (85/3)
68 (37/2)
91 (92/9)
63 (35/9)
and a good rate of TP; yet its accuracy is smaller than

1
that of HELM and its magnification is very small

(Mag = 1.04). SVM is the least robust model with
respect to the threshold definition and is harder to train
Acc
c
95
85
75
94
82
72
(300 times slower). Performing a PCA before the classi-

fication also worsened the results. This could be the
PCA-SVM
PCA-ELM
consequence of the inherent non-linear relationships

HELM
DBN
SVM
between the measurements. The DBN has also a lower

ELM
accuracy compared to HELM. However, it has a stron-

ger magnification.
Michau et al. 113
Table 4. Performances on real dataset.
Acc TP (%) FP (%) Mag Time (s) Hyperparameters

g L1 L2 l C LPCA % Out S
HELM 95 89.1 0.0 20 0.5 1.2 20 200 103 105

ELM 63 26.6 1.0 4.8 0.4 1.2 200 105
PCA-ELM 57 14.6 0.5 3.1 1.4 1.1 400 15
SVM 87 73.6 0.0 1.04 169 1.1 1 5
PCA-SVM 69 39.8 1.2 1.03 41 1.1 15 1 1
DBN 78 56.2 0.0 40 5.5 1.1 10 50
TP: true positives; FP: false positives; HELM: hierarchical extreme learning machine; ELM: extreme learning machine; SVM: support vector machine; PCA:
principal component analysis; DBN: deep belief network. The bold-font numbers highlight the best performing models in terms of each evaluation measure.
Figure 4. Generator data case study: distance to normal conditions for the six approaches against time (days): (a) HELM (TP:
89.1%; FP: 0.0%), (b) ELM (TP: 26.6%; FP: 1.0%), (c) PCA-ELM (TP: 35.6%; FP: 3.2%), (d) SVM (TP: 73.6%; FP: 0.0%), (e) PCA-SVM
(TP: 39.8%; FP: 1.2%) and (f) DPN (TP: 56.3%; FP: 0.0%).
Blue points represent the training, yellow points represent the points used for computing the threshold and red points represent the testing. The
black horizontal lines correspond to the threshold in equation (7). The two vertical black lines represent t = 169 days (lower level fault) and
t = 247 days (upper level fault). The Y-axis is rescaled such that the threshold is 1.
Based on Figure 4, for all models but SVM, the dis- learning approach using engineered features as input.
tance to normal class increases after the lower level fault Doing so, ELM achieves an accuracy of 88% (TP:
(from day 169 to day 247) and again after the upper 75.1%; FP: 0.0%) and a magnification of 20 and SVM
level fault (after day 247). The output of the one-class achieves an accuracy of 80% (TP: 60.2%; FP: 0.0%)
classifier behaves similarly to a HI. and a magnification of 6. This approach brings ELM
results very close to those achieved by HELM in the
previous section. For SVM, accuracy decreases but the
HELM versus expertise higher magnification allows now to distinguish between
Experts who analysed the power plant operation could lower and higher level faults.
identify the fault starting at day 169 thanks to eight par- This approach, however, is based on the measure-
ticular signals. Using these signals as inputs to the ELM ments selected by the experts a posteriori to the fault. It
and SVM models mimics the traditional machine could probably detect similar faults in the future but it
may miss other types of faults, impacting other mea- particularly beneficial for industrial applications with
surements. This flaw is not shared by HELM which high-dimensional signals having complex relationships,
does not require any a posteriori knowledge. experiencing very few but critical faults and lacking
representative datasets covering all possible operating
conditions and fault types.
Discussion and conclusion For future work, the question of a priori hyperpara-
meter setting remains open. While in this study, results
Both case studies analysed in this research demonstrate
a better performance of HELM compared to that of were presented for the models that would maximise
other methods. Tested on high-dimensional signals the accuracies a posteriori, it is crucial for diagnostics
with varied characteristics and faults, HELM raises engineers to know how to set the model prior to the
consistent and robust alarms in faulty system condi- occurrence of faulty system conditions. Nonetheless,
tions. It also enables tracking in time of the evolution we have seen in the simulated case that HELM trained
of the unhealthy system states. The performance of with a good set of parameters is able to detect many
other tested methods is more impacted by the dimen- different kinds of faults. Therefore, inducing artificial
sionality of the input data, the characteristics of the sig- faults on real condition monitoring data could help find-
nals and faults. The dimensionality reduction step with ing a good set of hyperparameters that will likely be valid
PCA, for example, has proven in other experiments to for other types of faults. The investigation of the number
be particularly suitable for signals with linear relation- of AE layers on the feature learning performance is also
ships. However, in the presence of non-linear relation- left for future works. In this contribution, a single feature
ships, the ability of PCA to extract non-correlated learning layer was sufficient to demonstrate a superior
components decreases significantly. Other deep learn- performance. Additional feature learning layers could
ing approaches, such as DBN, can also provide excel- improve results or be necessary if the dimensionality of
lent results but are usually highly dependent on the the dataset is higher than in the present case.
selection of hyperparameters. Experimentally, training
an efficient DBN necessitates a lot of fine-tuning and Declaration of conflicting interests
has proven to be a difficult task. The author(s) declared no potential conflicts of interest
In both case studies, HELM demonstrates a perfor- with respect to the research, authorship and/or publica-
mance that is either superior or similar to that of the tion of this article.
other approaches on all the datasets. It is, in addition,
much faster and more robust. The generator case study
demonstrated, in addition, that HELM is able to learn Funding
the relevant features also from real condition monitor- The author(s) disclosed receipt of the following finan-
ing signals and to detect the fault at its earliest stage, cial support for the research, authorship and/or publi-
with higher accuracy compared to the manually selected cation of this article: This study was supported by the
features. This is a promising advantage as the dimen- Swiss Commission for Technology and Innovation
sionality of the condition monitoring data tends to under grant number 17833.2 PFES-ES.
increase rapidly since more parameters tend to be moni-
tored due to the decreasing costs of sensors. Features ORCID iD
are, therefore, harder to engineer manually and feature
Gabriel Michau https://orcid.org/0000-0001-6882-
selection becomes more challenging with the increasing
2906
number of features.
HELM integrates a feature learning step which
extracts the high-level representations of the condition References
monitoring signals and improves the detection accu- 1. Bengio Y, Courville A and Vincent P. Representation
racy. HELM is able to handle raw condition monitor- learning: a review and new perspectives. IEEE T Pattern
ing signals and does not require knowledge on the Anal 2013; 35: 1798–1828.
faults as it is trained on normal system state data solely. 2. Shao H, Jiang H, Zhao H, et al. A novel deep autoenco-
Its output can be interpreted as a distance to the nor- der feature learning method for rotating machinery fault
mal system condition, or a HI, and provides, as such, diagnosis. Mech Syst Signal Pr 2017; 95: 187–204.
more information over time than the typically applied 3. Valada A, Spinello L and Burgard W. Deep feature learn-
ing for acoustics-based terrain classification. In: Bicchi A
binary decision process (in cases for which the problem
and Burgard W (eds) Robotics Research. Cham: Springer,
is formulated as a binary classification task). Detecting
2018, pp.21–37.
very early anomalies (months before the faults occur, as 4. Li P, Chen Z, Yang LT, et al. Deep convolutional com-
in the generator case study) enables to realise savings in putation model for feature learning on big data in Inter-
terms of operational costs that can reach millions of net of Things. IEEE T Ind Inform 2018; 14: 790–798.
euros. HELM satisfies, therefore, a lot of the require- 5. Khan S and Yairi T. A review on the application of deep
ments of a good decision support tool for diagnostics learning in system health management. Mech Syst Signal
engineers. The proposed approach has shown to be Pr 2018; 107: 241–265.
Michau et al. 115
6. Ince T, Kiranyaz S, Eren L, et al. Real-time motor fault of the Prognostics and Health Management Society,
detection by 1-D convolutional neural networks. IEEE T (Vol. 4), 4–6 July 2018, Utrecht, The Netherlands.
Ind Electron 2016; 63: 7067–7075. 20. Miotto R, Li L, Kidd BA, et al. Deep patient: an unsu-
7. Krummenacher G, Ong CS, Koller S, et al. Wheel defect pervised representation to predict the future of patients
detection with machine learning. IEEE T Intell Transp from the electronic health records. Sci Rep 2016; 6:
2018; 19: 1176–1187. 26094.
8. Li X, Ding Q and Sun JQ. Remaining useful life estima- 21. Tang J, Deng C and Huang GB. Extreme learning
tion in prognostics using deep convolution neural net- machine for multilayer perceptron. IEEE T Neur Net
works. Reliab Eng Syst Safe 2018; 172: 1–11. Lear 2016; 27: 809–821.
9. Zhao R, Yan R, Wang J, et al. Learning to monitor 22. Huang GB, Zhu QY and Siew CK. Extreme learning
machine health with convolutional bi-directional LSTM machine: a new learning scheme of feed-forward neural
networks. Sensors 2017; 17: 273. networks. In: 2004 IEEE international joint conference on
10. Zhang J, Wang P, Yan R, et al. Deep learning for neural networks (IEEE Cat. No.04CH37541), vol. 2,
improved system remaining life prediction. Proc CIRP Budapest, 25–29 July 2004, pp.985–990. New York:
2018; 72: 1033–1038. IEEE.
11. Chen Z and Li W. Multisensor feature fusion for bearing 23. Huang GB, Chen L, Siew CK, et al. Universal approxi-
fault diagnosis using sparse autoencoder and deep belief mation using incremental constructive feedforward net-
network. IEEE T Instrum Meas 2017; 66: 1693–1702. works with random hidden nodes. IEEE Trans Neural
12. Zhao G, Zhang G, Liu Y, et al. Lithium-ion battery Netw 2006; 17: 879–892.
remaining useful life prediction with Deep Belief Net- 24. Huang GB. What are extreme learning machines? Filling
work and Relevance Vector Machine. In: 2017 IEEE the gap between frank Rosenblatt’s dream and john von
international conference on prognostics and health manage- Neumann’s puzzle. Cogn Comput 2015; 7: 263–278.
ment (ICPHM), Dallas, TX, 19–21 June 2017, pp.7–13. 25. Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising
New York: IEEE. autoencoders: learning useful representations in a deep
13. Yoon AS, Lee T, Lim Y, et al. Semi-supervised learning network with a local denoising criterion. J Mach Learn
with deep generative models for asset failure prediction, Res 2010; 11: 3371–3408.
2017, https://arxiv.org/abs/1709.00845 26. Hastie T, Tibshirani R and Tibshirani RJ. Extended
14. Moya MM and Hush DR. Network constraints and comparisons of best subset selection, forward stepwise
multi-objective optimization for one-class classification. selection, and the Lasso, 2017, https://arxiv.org/abs/
Neural Netw 1996; 9: 463–474. 1707.08692
15. Malhotra P, Vishnu TV, Ramakrishnan A, et al. Multi- 27. Cambria E, Huang GB, Kasun LLC, et al. Extreme learn-
sensor prognostics using an unsupervised health index ing machines [trends & controversies]. IEEE Intell Syst
based on LSTM encoder-decoder, 2016, https://arxiv.org/ 2013; 28: 30–59.
abs/1608.06154 28. Beck A and Teboulle M. A fast iterative shrinkage-
16. Yang Y, Wu QMJ and Wang Y. Autoencoder with inver- thresholding algorithm for linear inverse problems.
tible functions for dimension reduction and image recon- SIAM J Imaging Sci 2009; 2: 183–202.
struction. IEEE T Syst Man Cy-S 2017: 48: 1065–1079. 29. Leng Q, Qi H, Miao J, et al. One-class classification with
17. Cao L-L, Huang W-B and Sun F-C. Building feature extreme learning machine. Math Probl Eng 2015; 2015:
space of extreme learning machine with sparse denoising e412957.
stacked-autoencoder. Neurocomputing 2016; 174(Part A): 30. Hu Y, Palmé T and Fink O. Deep health indicator extrac-
60–71. tion: a method based on auto-encoders and extreme
18. Michau G, Palmé T and Fink O. Deep feature learning learning machines. In: Annual conference of the prognos-
network for fault detection and isolation. In: Annual con- tics and health management society, vol. 7, Denver, CO,
ference of the prognostics and health management society, USA, 3–6 October 2016, pp. 446–452. PMH Society.
St. Petersburg, FL, 2–5 October 2017, pp. 108–118. PHM 31. LeCun Y, Bengio Y and Hinton G. Deep learning.
Society. Nature 2015; 521: 436–444.
19. Michau G, Palmé T and Fink O. Fleet PHM for critical 32. Hinton GE, Osindero S and Teh YW. A fast learning
systems: bi-level deep learning approach for fault detec- algorithm for deep belief nets. Neural Comput 2006; 18:
tion. In: Proceedings of the Fourth European Conference 1527–1554.

Feature Learning For Fault Detection in High-Dimensional Condition Monitoring Signals

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Learning For Fault Detection in High-Dimensional Condition Monitoring Signals

Uploaded by

Copyright:

Available Formats

Original Article

Proc IMechE Part O:

Gabriel Michau1 , Yang Hu2, Thomas Palmé3 and Olga Fink1

Date received: 13 November 2018; accepted: 12 July 2019

Introduction growing. These challenges have fostered the develop-

Neural networks variables

normal and the faulty class. As such, this distance is

Figure 3. ROC curve (f = log and n = 5). Each point of each

from right to left.

limit the efficiency and impact of PCA. Finally, the

DBN shows robust results for n = 5 and n = 10, but

with low accuracies.

Table 3 presents the accuracies and magnification

coefficient for the set of hyperparameters with best

responding set of hyperparameters and the training

lar that HELM is the most robust model: with a single

set of hyperparameters, it achieves almost optimal

95% and very strong magnification coefficients (up to

before the SVM reduces the time needed for training

As shown in Figure 3, the sensitivity analysis con-

has strongly decreasing performances as gamma

increases. For HELM, g 2 ½1:52:5 outperforms most

Real-application case study: generator

For the real case study, an H2-cooled generator from

tor is working on nominal load, with changing operat-

ing conditions and thus, variations in the parameters.

X = log (a X Base ½: , b) + E

310 sensors recording every 5 min (K = 55774 and

1. The rotor shaft voltage is used to detect shaft

grounding problems, shaft rubbing, electro-ero-

sion, bearing isolation problems and rotor inter-

Time (s) 2. Rotor flux is used to detect the occurrence, the

main insulation, loose bars or contact as well as

15 hang support system.

5. Finally, the stator bar water temperature is also

The generator failure mode tested here has been a

intermittent short circuits in the rotor that remained

undetected, denoted in the following by lower level

fault, developing in a second faulty state or upper level

continuous one. This led to the power plant shut down.

Similarly, as in the previous case study, the data are

split into three parts: for training and threshold compu-

94%=6%) and for testing (from day 120 to day 169 to

compute FPR, the rest to compute TPR). The dataset

is hence split as follows: KTrain = 22, 017, KThr = 1406

and KTest = 4524+27, 827. Remind that D = 310. As

results need to be assessed on a single experiment, in

threshold after and before day 169, respectively.

Table 4 synthesises the accuracy, percentage of TP and

tion, the distance to the normal class (j1 Yj) of these

models is illustrated in Figure 4. In accordance with the

results obtained so far, HELM provides the best per-

formance. Its accuracy is higher and it is a robust

forms sensibly worse: numerous FP are raised; TP are

scarce, which is consistent with its much lower magnifi-

and a good rate of TP; yet its accuracy is smaller than

that of HELM and its magnification is very small

(300 times slower). Performing a PCA before the classi-

consequence of the inherent non-linear relationships

between the measurements. The DBN has also a lower

accuracy compared to HELM. However, it has a stron-

Table 4. Performances on real dataset.

Acc TP (%) FP (%) Mag Time (s) Hyperparameters

increases. For HELM, g 2 ½1:52:5 outperforms most

X = log (a X Base ½: , b) + E