10 1061@ajrua6 0001106

State-of-the-Art Review
Probabilistic Inference for Structural Health Monitoring:

New Modes of Learning from Data
Lawrence A. Bull, Ph.D. 1; Paul Gardner, Ph.D. 2; Timothy J. Rogers, Ph.D. 3; Elizabeth J. Cross 4;
Nikolaos Dervilis, Ph.D. 5; and Keith Worden 6
Downloaded from ascelibrary.org by University of Western Ontario on 11/29/20. Copyright ASCE. For personal use only; all rights reserved.
Abstract: In data-driven structural health monitoring (SHM), the signals recorded from systems in operation can be noisy and incomplete.
Data corresponding to each of the operational, environmental, and damage states are rarely available a priori; furthermore, labeling to describe
the measurements is often unavailable. In consequence, the algorithms used to implement SHM should be robust and adaptive while ac-
commodating for missing information in the training data—such that new information can be included if it becomes available. By reviewing
novel techniques for statistical learning (introduced in previous work), it is argued that probabilistic algorithms offer a natural solution to the
modeling of SHM data in practice. In three case-studies, probabilistic methods are adapted for applications to SHM signals, including semi-
supervised learning, active learning, and multitask learning. DOI: 10.1061/AJRUA6.0001106. © 2020 American Society of Civil Engineers.
Author keywords: Structural health monitoring (SHM); Statistical machine learning; Pattern recognition; Semisupervised learning; Active
learning; Multitask learning; Transfer learning.
Introduction: Probabilistic SHM associated with SHM data (outlined in the next section), the current
work focuses on probabilistic (i.e., statistical) tools: these algorithms
Under the pattern recognition paradigm associated with structural appear to offer a natural solution to some key issues, which can
health monitoring (SHM) (Farrar and Worden 2012), data-driven otherwise prevent practical implementation. Additionally, probabi-
methods have been established as a primary focus of research. Vari- listic methods can lead to predictions under uncertainty (Papoulis
ous machine learning tools have been applied in the literature (for 1965)—a significant advantage in risk-based applications.
example, Vanik et al. 2000; Sohn et al. 2003; Chatzi and Smyth
2009) and used to infer the health or performance state of the moni-
tored system, either directly or indirectly. Generally, algorithms for SHM, Uncertainty, and Risk
regression, classification, density estimation, or clustering learn It should be clear that measured/observed data in SHM will be
patterns in the measured signals (available for training), and the as- inherently uncertain to some degree. Uncertainties can enter via
sociated patterns, can be used to infer the state of the system in op- experimental sources, including limitations to sensor accuracy, pre-
eration, given future measurements (Worden and Manson 2006). cision, or human error; further uncertainties will be associated with
Unsurprisingly, there are numerous ways to apply machine learn- the model—machine learning or otherwise—including paramet-
ing to SHM. Notably (and categorized generally), advances have ric variability, model discrepancy, and interpolation uncertainty.
focussed on various probabilistic (e.g., Vanik et al. 2000; Ou et al. Considering the implications of risk, financially and in terms of
2017; Flynn and Todd 2010) and deterministic (e.g., Bornn et al. safety, uncertainty should be mitigated (during data acquisition), and
2009; Zhao et al. 2019; Janssens et al. 2018) methods. Each ap- quantified (within models) as far as possible to inform decision-
proach has its advantages; however, considering certain challenges making (Zonta et al. 2014; Cappello et al. 2015). That is, when sup-
porting a financial or safety-critical decision, predictions should be
1
Dept. of Mechanical Engineering, Univ. of Sheffield, Mappin St., presented with confidence: clearly, a certain prediction, which im-
Sheffield S1 3JD, UK (corresponding author). ORCID: https://orcid.org plies a system is safe to use, differs significantly from an uncertain
/0000-0002-0225-5010. Email: l.a.bull@sheffield.ac.uk prediction, supporting the same decision. If there is no attempt to
2
Dept. of Mechanical Engineering, Univ. of Sheffield, Mappin St., quantify the associated uncertainties, there is no distinction between
Sheffield S1 3JD, UK. ORCID: https://orcid.org/0000-0002-1882-9728.
these scenarios.
Email: p.gardner@sheffield.ac.uk
3
Dept. of Mechanical Engineering, Univ. of Sheffield, Mappin St.,
Various methods can return predictions with confidence (or cred-
Sheffield S1 3JD, UK. ORCID: https://orcid.org/0000-0002-3433-3247. ibility) (Murphy 2012). The current work focuses on probabilistic
Email: tim.rogers@sheffield.ac.uk models, which—under Kolmogorov’s axioms (Papoulis 1965)—
4
Professor, Dept. of Mechanical Engineering, Univ. of Sheffield, allow for predictions under well-defined uncertainty, provided the
Mappin St., Sheffield S1 3JD, UK. ORCID: https://orcid.org/0000-0001 model assumptions are appropriate.
-5204-1910. Email: e.j.cross@sheffield.ac.uk
5
Dept. of Mechanical Engineering, Univ. of Sheffield, Mappin St.,
Sheffield S1 3JD, UK. Email: n.dervilis@sheffield.ac.uk Probabilistic Approach
6
Professor, Dept. of Mechanical Engineering, Univ. of Sheffield,
Discussions in this work will consider the general strategy illus-
Mappin St., Sheffield S1 3JD, UK. Email: k.worden@sheffield.ac.uk
Note. This manuscript was published online on November 27, 2020. trated in Fig. 1. That is, SHM is viewed as a multiclass problem,
Discussion period open until April 27, 2021; separate discussions must which categorizes measured data into groups, corresponding to the
be submitted for individual papers. This paper is part of the ASCE-ASME condition of the monitored system. The ith input, denoted by xi, is
Journal of Risk and Uncertainty in Engineering Systems, Part A: Civil defined by a d-dimensional vector of variables, which represents an
Engineering, © ASCE, ISSN 2376-7642. observation of the system, such that xi ∈ Rd . The data labels yi , are
© ASCE 03120003-1 ASCE-ASME J. Risk Uncertainty Eng. Syst., Part A: Civ. Eng.
ASCE-ASME J. Risk Uncertainty Eng. Syst., Part A: Civ. Eng., 2021, 7(1): 03120003
models [that can localize and classify damage, as well as to detect it
(Worden and Manson 2006)] is unavailable or not obtained.
For the measurements xi that are available—as well as those
that are recorded during operation (in situ)—labels to describe what
the signals represent, yi , are rarely at hand. This missing informa-
Fig. 1. Simplified framework for pattern recognition within SHM.
tion is usually due to the cost associated with manually inspecting
structures (or data), as well as the practicality of investigating each
observation. The absence of labels makes defining and updating
used to specify the condition of the system, directly or indirectly. (multiclass) machine learning models difficult, particularly in the
Machine learning is introduced via the pattern recognition model, online setting, as it can become difficult to determine if/when the
denoted fð·Þ, and is used to infer relationships between the input novel valuable information has been recorded and what it repre-
and output variables to inform predictive maintenance. sents (Bull et al. 2019b). For example, consider streaming data re-
The inputs xi are assumed to be represented by some random corded from a subsea pipeline. Comparisons of measured data to
vector X (in this case, a continuous random vector), which can take the model might indicate novelty; however, without labels, it is dif-
any value within a given feature-space X . The random vector is ficult to include this new information in a supervised manner: the
therefore associated with an appropriate probability density func- measurements might represent another operational condition, ab-
tion (p.d.f.), denoted pð·Þ, such that the probability P of X falling normal wave loads, actual damage, or some other condition.
within the interval a < X ≤ b is, Pða < X ≤ bÞ ¼ ∫ ba pðxi Þdxi
such that pðxi Þ ≥ 0; ∫ X pðxi Þdxi ¼ 1. For a discrete classifica-
tion problem, the labels yi are represented by a discrete random New Modes of Probabilistic Inference
variable Y, which can take any value from the finite set, yi ∈ Y ¼
f1; : : : ; Kg. Note that the discrete classification is presented in this New modes of probabilistic inference are being proposed to address
work, although SHM is regularly informed by regression models, challenges with SHM data. Specifically, the algorithms focus on
i.e., yi is continuous; this is application-specific, and most of the probabilistic frameworks to deal with limited labeled data, as well
motivational arguments remain the same. The K is the number of as incomplete measured data, that only correspond to a subset of the
classes defining the (observed) operational, environmental, and expected conditions in situ.
health conditions while Y denotes the label-space. An appropriate
probability mass function (p.m.f.), also denoted P pð·Þ, is such that, Partially-Supervised Learning
PðY ¼ yi Þ ¼ pðyi Þ where 0 ≤ PðY ¼ yi Þ ≤ 1; yi ∈Y PðY ¼ yi Þ ¼ 1.
Note that the context should make the distinction between p.m.fs Partially-supervised learning allows multiclass inference in cases in
and p.d.fs clear. Further details regarding the probability theory for which labeled data are limited. Missing label information is espe-
pattern recognition can be found in a number of well-written cially relevant to practical applications of SHM: while fully labeled
textbooks—for example, Murphy (2012), Barber (2012), and data are often infeasible, it can be possible to include labels for a
Gelman et al. (2013). limited set (or budget) of measurements. Typically, the budget is
limited by some expense incurred when investigating the signals;
this might include direct costs associated with inspection or loss of
Layout income due to downtime (Bull et al. 2020b).
The section “Incomplete Data and Missing Information” summa- Generally speaking, partially-supervised methods can be used
rizes the most significant challenges for data-driven SHM, while to perform multiclass classification while utilizing both labeled
the section “New Modes of Probabilistic Inference” suggests prob- Dl and unlabeled Du signals within a unifying training scheme
abilistic methods to mitigate these issues. The section “Directed (Schwenker and Trentin 2014). As such, the training set D becomes
Graphical Models” introduces the theory behind directed graphical
D ¼ Dl ∪ Du ð1Þ
models (DGMs), which will be used to introduce each method for-
mally. The section “Case Studies” collects four case studies to high- ~
light the advantages of probabilistic inference. Active learning and ¼ fX; yg ∪ X ð2Þ
Dirichlet process clustering are applied to the Z24 bridge data.
Semisupervised learning is applied to data recorded during ground fX; yg ≜ fxi ; yi gni¼1 ð3Þ
vibration tests of a Gnat aircraft. Multitask learning is applied
simulated, and experimental data is applied from shear-building ~ ≜ fx~ i gm
X ð4Þ
i¼1
structures.
Note that the applications presented in this study were introduced Active and semisupervised techniques are suggested—as two
in a previous work by the authors. The related SHM literature is variants of partially-supervised learning—to combine/include in-
referenced in the descriptions of each mode of inference. formation from labeled and unlabeled SHM data (Bull et al. 2018,
2019b, 2020b).
Incomplete Data and Missing Information Semisupervised Learning

Semisupervised learning utilizes both the labeled and unlabeled
Arguably, the most significant challenge when implementing pat- data to inform a classification mapping, f∶ X ↦ Y. Often, a semi-
tern recognition for SHM is missing information. Primarily, it is supervised learner will use information in Du to further update/
difficult to collect data that might represent damage states or the constrain a classifier learned from Dl (McCallumzy and Nigamy
system in extreme environments (such as earthquakes) a priori; data 1998), or, alternatively, partial supervision can be implemented as
are usually only available for a limited subset of the possible con- constraints on an unsupervised clustering algorithm (Chapelle et al.
ditions for training algorithms (Farrar and Worden 2012). As a 2006). This work focuses on classifier-based methods; however,
result, conventional methods are restricted to novelty detection, constraints on clustering algorithms are discussed in subsequent
as the information that is required to inform multiclass predictive sections.
Arguably, the most simple/intuitive method to introduce unla-
beled data is self-labeling (Zhu 2005). In this case, a classifier is
trained using Dl , which is used to predict labels for the unlabeled
set Du . This defines a new training-set—some labels in D are the
ground truth from the supervised data, and the others are pseudo-
labels, predicted by the classifier. Self-labeling is simple, and it can
Fig. 3. General/simplified active learning heuristic.
be applied to any supervised method; however, the effectiveness is
highly dependent on the method of implementation and the super-
vised algorithm within it (Chapelle et al. 2006).
query/annotate the unlabeled data in Du to extend the labeled set
Generative mixture models offer a formal probabilistic frame-
Dl . Therefore, an active learner attempts to define an accurate map-
work to incorporate unlabeled data (Cozman et al. 2003; Nigam
ping, f∶ X ↦ Y, while keeping queries to a minimum (Dasgupta
et al. 1998). Generative mixtures apply the cluster assumption if
2011); general (and simplified) steps are illustrated in Fig. 3.
points are in the same cluster, they are likely to be of the same
The critical step for active algorithms is how to select the most
class. Note that the cluster assumption does not necessarily imply
informative signals to investigate (Wang et al. 2017; Schwenker
that each class is represented by a single, compact cluster; instead,
and Trentin 2014). For example, query by committee (QBC) meth-
the implication is that observations from different classes are un-
ods build an ensemble/committee of classifiers using a small, initial
likely to appear in the same cluster (Chapelle et al. 2006). Through
(random) sample of labeled data, leading to multiple predictions for
density estimation (Barber 2012), a mixture of base-distributions
can be used to estimate the underlying distribution of the data, unlabeled instances. Observations with the most conflicted label
pðxi ; yi Þ, and unlabeled observations can be included in various predictions are viewed as informative, and thus, they are queried
ways (McCallumzy and Nigamy 1998; Vlachos et al. 2009). For (Wang et al. 2017). On the other hand, uncertainty-sampling usu-
example, the expectation-maximization (EM) algorithm [used to ally refers to a framework that is based around a single classifier
learn mixture models in the unsupervised case (Murphy 2012)] (Kremer et al. 2014; Settles 2012) in which signals with the least
can be modified to incorporate labeled observations (Nigam et al. confident predicted label, given the model, are queried. (It is ac-
1998; McCallumzy and Nigamy 1998). Fig. 2 demonstrates how a knowledged that QBC methods can also be viewed as a type of
Gaussian mixture, given acoustic emission data (Rippengill et al. uncertainty sampling.) Uncertainty sampling is (perhaps) most
2003), can be improved by considering the surrounding unlabeled interpretable when considering probabilistic algorithms, as the
examples (via EM). posterior probability over the class-labels pðyi jxi Þ can be used to
To summarize, semisupervised methods allow algorithms to quantify uncertainty/confidence (Bull et al. 2020c). For example,
learn from the information in the available unlabeled measurements consider a binary (two-class) problem: intuitively, uncertain sam-
as well as a limited set of labeled data. In practice, semisupervised ples could be instances whose posterior probability is nearest to 0.5
inference implies that the cost associated with labeling data could for both classes. This view can be extended to multiple (>2) classes
be managed in SHM (Chen et al. 2013, 2014), as the information in using the Shannon entropy (MacKay 2003) as a measure of uncer-
a small set of labeled signals is combined with larger sets of un- tainty, i.e., high entropy (uncertain) signals given the GMM of the
labeled data (Bull et al. 2019c). acoustic emission data (Rippengill et al. 2003) are illustrated in
Fig. 4(a).
Active Learning In summary, as label information is limited by cost implica-
Active learning is an alternative partially-supervised method; the tions in practical SHM (Bull et al. 2019a), active algorithms can
key hypothesis is that an algorithm can provide improved perfor- be utilized to automatically administer the label budget by selecting
mance, using fewer training labels, if it is allowed to select the data the most informative data to be investigated such that the perfor-
from which it learns (Settles 2012). As with semisupervised tech- mance of predictive models is maximized (Bull et al. 2019d).
niques, the learner utilizes Dl and Du ; however, active algorithms
Fig. 4. Uncertainty sampling for the AE data: right, left, and down
Fig. 2. Semisupervised GMM for three-class AE data: (a) supervised arrow markers show the training set, and the closed circle markers show
learning, given the labeled data only, closed circle markers; and the unlabeled data; the circles indicate queries by the active learner:
(b) semisupervised learning, given the labeled and unlabeled data, (a) based on entropy; and (b) based on likelihood. [Adapted from Bull
closed circle/open circle markers. [Adapted from Bull (2019a).] (2019a).]
Dirichlet Process Mixture Models for
Nonparametric Clustering
Dirichlet process (DP) mixture models (Neal 2000) offer another
probabilistic framework to deal with limited labels as well as in-
complete data a priori. The DP is suggested as an (unsupervised)
Bayesian algorithm for nonparametric clustering, used to perform
inference online such that the need for extensive training data (be-
fore implementing the SHM strategy) is mitigated (Rogers et al.
2019). As such, unlike partially-supervised methods, labels are al-
ways an additional latent variable (they are never observed); thus,
the ground truth of yi is not known during inference. However, la-
bel information has the potential to be incorporated, either within
the SHM strategy (Rogers et al. 2019) or at the algorithm level to

define a semisupervised DP (Vlachos et al. 2009). Conveniently,
Bayesian properties of the DP allow the incorporation of prior Fig. 5. Unsupervised Dirichlet process Gaussian mixture model for the
knowledge and updates of belief, given the observed data. The three-class AE data: (a) unsupervised DP clustering, closed circle/open
aim is to avoid the need for comprehensive training data, while re- circle markers are the ground-truth/predicted values for yi ; and (b) pre-
taining the flexibility to include any available data formally as prior dictive likelihood for the number of clusters K given α, i.e., pðKjD; αÞ.
knowledge. Additionally, as there is a reduction in the number
of user-tuned parameters, models can be implemented to perform
powerful online learning with minimal a priori input/knowledge in
terms of access to data or a physical model (Rogers et al. 2019). For SHM in practice, the implementation of the DP for online
clustering means that an operator does not need to specify an ex-
Dirichlet Process Clustering pected number of normal, environmental, or damage conditions
(components K) in order to build the model, which can be difficult
A popular analogy to describe the DP (for clustering) considers or impossible to define for a structure in operation (Rogers et al.
a restaurant with an infinite number of tables (Aldous 1985) 2019).
(i.e., clusters in Y). Customers—resembling observations in
X —arrive and sit at one of the tables (according to some proba-
bility), which are either occupied or vacant. As a table becomes Transfer and Multitask Learning
more popular, the probability that customers join it increases.
Finally, methods for transfer (Gao and Mosalam 2018; Gardner
The seating arrangement can be viewed to represent a DP mixture.
et al. 2020c; Jang et al. 2019) and multitask (Wan and Ni 2019;
Importantly, the probability that a new vacant table is chosen (over
Huang et al. 2019) learning are proposed for inference with incom-
an existing table) is defined by a hyperparameter α, associated with
the DP. In consequence, α is sometimes referred to as the dispersion plete or limited training-data. In general terms, the idea for SHM
value—high values lead to an increased probability that new tables applications is that valuable information might be transferred or
(clusters) are formed, while low values lead to less tables, as new shared, in some sense, between similar systems (via measured
tables are less likely to be initiated. and/or simulated data). By considering shared information, the per-
The analogy should highlight a useful property of DP mixtures: formance of predictive models might improve, despite insufficient
the number of clusters K (i.e., tables) does not need to be defined in training observations (Chakraborty et al. 2011; Ye et al. 2017;
advance; instead, this is to be determined by the model and the data Dorafshan et al. 2018). For example, consider wind turbines in an
(as well as α) (Vlachos et al. 2009). As a result, the algorithm can offshore wind-farm; one system may have comprehensively labeled
be particularly useful when clustering SHM signals online, as the measurements, investigated by the engineer, corresponding to a
model can adapt and update, selecting the most appropriate value range of environmental effects; other turbines within the farm are
for K as new information becomes available. likely to experience similar effects. However, the measured signals
To demonstrate, consider a mixture of Gaussian base- might be incomplete, with partial labeling or no labels at all.
distributions; a conventional finite mixture (a GMM) requires Various tools (Pan and Yang 2010) offer frameworks to trans-
the number of components K to be defined a priori, as in the su- fer different aspects of shared information. For the methods dis-
pervised Gaussian mixture model (GMM) with K ¼ 3, shown in cussed in this study, it is useful to define two objects (Gardner et al.
Figs. 2 and 4. As suggested by the analogy, a DP can be interpreted 2020c):
as an infinite mixture, such that K → ∞ (Rasmussen 2000); this • A domain D ¼ fX ; pðxi Þg is an object that consists of a feature
allows for the probabilistic inference of K through the DP prior. An space X and a marginal probability distribution pðxi Þ over a
example DP-GMM for the same AE data (Rippengill et al. 2003) finite sample of feature data fxi gni¼1 ∈ X .
is shown in Fig. 5(a); the most likely number of components has • A task T ¼ fY; fð·Þg is a combination of a label space Y and a
been automatically found, K ¼ 3, given the data and the model for predictive model/ function fð·Þ.
α ¼ 0.1. The effect of the dispersion hyperparameter α can be visu- Domain adaptation is one approach to transfer learning, follow-
alized in Fig. 5(b), which shows the posterior-predictive-likelihood ing a framework that maps the distributions from feature/label
of K given the data for various values of α. Considering that K ¼ 3, spaces (i.e., X =Y) associated with different structures into a shared
an appropriate hyperparameter range appears to be 0.01 ≤ α ≤ 0.1; (more consistent) space. The observations are typically labeled for
although, as each class is clearly non-Gaussian, higher values of one structure only, and therefore, a predictive model fð·Þ can be
K are arguably more appropriate to approximate the underlying learned, such that label information is transferred between domains.
density of the data. Interestingly, for low values of α, three com- The domain with labeled data is referred to as the source domain
ponents appear significantly more likely to describe the data than Ds [shown in Fig. 6(a)], while the unlabeled data correspond to the
two (or one). target domain Dt [shown in Fig. 6(b)]. Importantly, a classifier fð·Þ
Fig. 6. Visualization of knowledge transfer via domain adaptation. Ellipses represent clusters of data: (a and b) are the source and target domains,
respectively, in their original sample spaces; and (c) shows the source and target data mapped into a shared, more consistent latent space.
applied in the projected latent space of Fig. 6(c) should generalize algorithms. In turn, this should increase the performance of predic-
to the target structure, despite missing label information. tive models, utilizing the shared information between systems.
Multitask learning considers shared information from an alter-
native perspective. As with domain adaptation, knowledge from
multiple domains is used to improve tasks (Pan and Yang 2010); Directed Graphical Models
however, in this case, each domain is weighted equally (Zhang and
Yang 2018). Therefore, the goal is to generate an improved predic- It will be useful to introduce basic concepts behind directed graphi-
tive function fð·Þ across multiple tasks by utilizing labeled feature cal models (DGMs), as these will be used to (visually) introduce
data from several different source domains. This approach to infer- each probabilistic algorithm. The terminology in this study follows
ence is particularly useful when labeled training data are insufficient that of Murphy (2012). Generally speaking, DGMs can be used to
across multiple tasks or systems. By considering the shared knowl- represent the joint distribution of the variables in a statistical model
edge across various labeled domains, the amount of the training data by making assumptions of conditional independence. For these
can, in effect, be increased. ideas to make sense, the chain rule is needed; that is, the joint dis-
This work suggests kernelized Bayesian transfer learning tribution of a probabilistic model can be represented as follows,
(KBTL) (Gönen and Margolin 2014) to model shared information. using any ordering of the variables fX 1 ; X 2 : : : ; X V g:
KBTL is a particular form of multitask learning, which can be pðX 1∶V Þ ¼ pðX 1 ÞpðX 2 jX 1 ÞpðX 3 jX 1 ; X 2 Þ : : : pðX V jX 1∶V−1 Þ
viewed as a method for heterogeneous transfer; i.e., at least one
feature space X j for a domain Dj is not the same dimension as X 1∶V ≜ fX 1 ; X 2 : : : ; X V g ð5Þ
another feature space X k (in the set of domains), such that dj ≠
dk (Gardner et al. 2020c). KBTL is a probabilistic method that In practice, a problem with Eq. (5) is that it becomes difficult
performs two tasks: (1) finding a shared latent subspace for each to represent the conditional distribution pðX V jX 1∶V−1 Þ as V gets
domain, and (2) inferring a discriminative classifier in the shared large. Therefore, to efficiently approximate large joint distributions,
latent subspace in a Bayesian manner. It is assumed that there is a assumptions of conditional independence in Eq. (6) are critical.
relationship between the feature space and the label space for each Specifically, conditional independence is denoted with ⊥, and it
domain and that all domains provide knowledge that will improve implies that
the predictive function fð·Þ for all domains (Gardner et al. 2020c). A ⊥ BjC ↔ pðA; BjCÞ ¼ pðAjCÞpðBjCÞ ð6Þ
In practice, methods such as KBTL should be particularly useful
for SHM, as the (labeled) training data are often insufficient or Considering these ideas, nodes in a graphical model can be used
incomplete across structures. If, through multitask/transfer learn- to represent variables, while edges represent conditional dependen-
ing, tasks from different structures can be considered together, cies. For example, for the AE data [in Figs. 2, 4, or 5(a)], one can
this should increase the amount of information available to train consider a random vector xi to describe the (two-dimensional)
DGMs; for details behind each algorithm, the reader is referred to
the SHM application papers (Bull et al. 2019b, 2020b; Rogers et al.
2019; Gardner et al. 2020a, c).
Active Learning with Gaussian Mixture Models

A generative classifier is used to demonstrate probabilistic ac-
tive learning. In this example—originally shown in Bull et al.
(2020b)—a Gaussian mixture model (GMM) is used to monitor
streaming data from a motorway bridge as if the signals were re-
corded online. The model defines a multiclass classifier to aid both
damage detection and identification while limiting the number of
(costly) system inspections.
Directed Graphical Model

As the data are being approximated by a Gaussian mixture model,
(a) (b) when a new class k is discovered from the streaming data (follow-
ing inspection), it is assigned a Gaussian distribution—Gaussian
Fig. 7. Examples of directed graphical models (DGMs) based on clusters like this can be visualized for the AE data in Fig. 2. Note
the AE data. Shaded and unshaded nodes represent observed/latent that the first DGM is explained in detail to introduce the theory that
variables, respectively, arrows represent conditional dependencies, and is used throughout. The conditional distribution of the observations
boxes represent plates. xi given label yi ¼ k is, therefore
pðxi jyi ¼ kÞ ¼ N ðxi ; μk ; Σk Þ ð9Þ
ð1Þ ð2Þ where the semicolon notation (;) is used to indicate that a function
measured features xi ¼ fxi ; xi g and a random variable yi to re-
present the class label f1; 2; 3g. As a result, the joint distribution of is parameterized by the variables that follow—this is distinct from
an appropriate model might be pðxi ; yi Þ. To simplify matters, the bar notation (j) that implies a conditional probability; and k is used
features can be considered to be independent (an invalid but often to index the class group, given the number of observed clusters at
ð1Þ ð2Þ that time k ∈ f1; : : : ; Kg. As such, μk is the mean (center) and Σk
acceptable assumption), i.e., xi ⊥ xi jyi . This leads to the follow- is the covariance (scatter) of the cluster of data xi with label k, for K
ing approximation of the distribution of the model (for a single Gaussian base-distributions.
observation) A discrete random variable is used to represent the labels yi ,
ð1Þ ð2Þ which is categorically distributed, parameterized by a vector of
pðxi ; yi Þ ¼ pðxi jyi Þpðxi jyi Þpðyi Þ ð7Þ
mixing proportions λ
An appropriate distribution function pð·Þ can now be assigned to pðyi Þ ¼ Catðyi ; λÞ ð10Þ
each of these densities (or masses). The DGM resulting from Eq. (7)
is plotted in Fig. 7(a). In many cases, the features in xi are the the mixing proportions can be viewed as a histogram over
observed variables (measured), while the labels yi are the latent the label values, such that λ ¼ fλ1 ; : : : ; λK g and pðyi ¼ kÞ ¼
(or hidden) variables that one wishes to infer. To visualize this, Pðyi ¼ kÞ ¼ λk .
the observed and latent variables are shown by shaded/unshaded The collected parameters of the model (from each component)
nodes, respectively, in Fig. 7(a). For high-dimensional feature vec- are denoted by θ, such that θ ¼ fΣ; μ; λg ¼ fΣi ; μi ; λi gKi¼1 ; there-
tors (e.g., d ≫ 2), plates can be used to represent conditionally- fore, the joint distribution of the model could be written
independent variables and avoid a cluttered graph, as shown in
pðxi ; yi ; θÞ ¼ pðxi jyi ; θÞpðyi ; θÞ ð11Þ
Fig. 7(b). Another plate with i ¼ f1; : : : ; ng is included to represent
independent and identically distributed data with n observations. However, to consider a complete model, a Bayesian approach is
The DGM now represents the whole dataset, which is a matrix adopted. That is, the parameters θ themselves are considered to be
of observed variables X ¼ fx1 ; : : : ; xn g, and the vector of labels random variables, and, therefore, they are included in the joint dis-
denoted y ¼ fy1 ; : : : ; yn g. This assumption implies that each sam- tribution (rather than simply parameterizing it)
ple was drawn independently from the same underlying distribution,
such that the order in which data arrive makes no difference to the pðxi ; yi ; θÞ ¼ pðxi jyi ; θÞpðyi jθÞpðθÞ ð12Þ
belief in the model, i.e., the likelihood of the dataset is
¼ pðxi jyi ; Σ; μÞpðΣ; μÞpðyi jλÞpðλÞ ð13Þ
Y
n
ð1Þ ð2Þ
pðX; y~Þ ¼ pðxi jyi Þpðxi jyi Þpðyi Þ ð8Þ
i¼1
This perspective has various advantages; importantly, it allows
for the incorporation of prior knowledge regarding the parameters
The corresponding DGM can be used to describe a (maximum via the prior distribution pðθÞ. Additionally, when implemented
likelihood) Naïve Bayes classifier—a simplified version of the gen- correctly, Bayesian methods lead to robust, self-regularizing mod-
erative classifiers applied subsequently in this work. els (Rasmussen and Ghahramani 2001).
To provide analytical solutions, it is convenient to assign conju-
gate (prior) distributions over the parameters pðθÞ ¼ pðΣ; μÞpðλÞ.
Case Studies In this study, it is assumed that fΣ; μg are independent from λ to
define two conjugate pairs: one associated with the observations xi
Semisupervised, active, and multitask learning, as well as DP clus- and another with the labels yi . For the mean μk and covariance Σk , a
tering, are now demonstrated in case studies. A brief overview of conjugate (hierarchical) prior is the normal inverse Wishart (NIW)
the theory for each algorithm is provided with the corresponding distribution
Active Sampling
To use the DGM to query informative data recorded from the mo-
torway bridge, an initial model is learned, given a small sample of
data recorded at the beginning of the monitoring regime. In this
case, it should be safe to assume the labels yi ¼ 1, which corre-
sponds to the normal condition of the structure. As new (unlabeled)
measurements arrive online, denoted x~ i , the model can be used to
predict the labels under uncertainty. The predictive equations are
found by marginalizing (integrating) out the parameters from the
joint distribution (for each conjugate pair)
Z Z
pðx~ i jy~ i ¼ k; Dl Þ ¼ pðx~ i jμk ; Σk Þpðμk ; Σk jDl Þdμk dΣk ð19Þ
|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}
Fig. 8. Directed graphical model for the GMM pðxi ; yi ; θÞ over the Eq: ð17Þ
labeled data Dl . As training data are supervised, both xi and yi are Z
observed variables. Shaded and white nodes are the observed and latent
pðy~ i jDl Þ ¼ pðy~ i jλÞpðλjDl Þdλ ð20Þ
variables, respectively, the arrows represent conditional dependencies, |fflfflfflffl{zfflfflfflffl}
and the dots represent constants (i.e., hyperparameters). [Adapted from Eq: ð18Þ
Bull (2019a).]
Again, due to conjugacy, these have analytical solutions (Murphy
2012). The posterior predictive Eqs. (19) and (20) can be combined
to define the posterior over the label estimates given unlabeled ob-
pðμk ; Σk Þ ¼ NIWðμk ; Σk ; m0 ; κ0 ; ν 0 ; S0 Þ ð14Þ servations of the bridge
This introduces the hyperparameters fm0 ; κ0 ; ν 0 ; S0 g associ- pðx~ i jy~ i ; Dl Þpðy~ i jDl Þ
ated with the prior, which can be interpreted as follows: m0 is pðy~ i jx~ i ; Dl Þ ¼ ð21Þ
pðx~ i jDl Þ
the prior mean for the location of each class μk , and κ0 determines
the strength of the prior; S0 is (proportional to) the prior mean of Considering the predictive distribution Eq. (21), labels that ap-
the covariance, Σk , and ν 0 determines the strength of that prior pear most uncertain can be investigated by the engineer. This ob-
(Murphy 2012). Considering that the streaming data will be nor- servation is now labeled fxi ; yi g, thus extending the (supervised)
malized (online), it is reasonable that hyperparemeters are defined training set Dl . Two measures of uncertainty are considered: (1) the
such that the prior belief states that each class is represented by a marginal likelihood of the new observation given the model [the
zero-mean and unit-variance Gaussian distribution. For the mixing denominator of Eq. (21)], and (2) the entropy of the predicted label,
proportions, the conjugate prior is a Dirichlet (Dir) distribution, given by
parameterized by α, which encodes the prior belief of the mixing
X
K
proportion (or weight) of each class. In this case, each class is as- Hðy~ i Þ ¼ − pðy~ i ¼ kjx~ i ; Dl Þ log pðy~ i ¼ kjx~ i ; Dl Þ ð22Þ
sumed equally weighted a priori for generality—although, care k¼1
should be taken when setting this prior, as it is application-specific,
particularly for streaming data (Bull et al. 2019b) Queries with high entropy consider data at the boundary be-
tween two existing classes, while queries given low likelihood will
Y
K
select data that appear unlikely given the current model estimate.
pðλÞ ¼ Dirðλ; αÞ ∝ λαk k −1 ð15Þ
k¼1
Visual examples of data that would be selected given these mea-
sures are shown in Fig. 4(a) for high entropy and Fig. 4(b) for low
α ≜ fα1 ; : : : ; αk g ð16Þ likelihood.
Fig. 9 demonstrates how streaming SHM signals might be que-
With this information, the joint distribution of the model ried using these uncertainty measures. The (unlabeled) data arrive
online in batches of size B; the data that appear most uncertain
Qn i ; yi ; θÞ can be approximated, such that pðX; y; θÞ ¼
pðx
(given the current model) are investigated. The number of investi-
i¼1 pðxi ; yi ; θÞ. The associated DGM can be drawn, including
conditional dependencies and hyperparameters, for n (supervised) gations per batch qb is determined by the label budget, which, in
training data in Fig. 8. turn, is limited by cost implications. Once labeled by the engineer,
Having observed the labeled training data Dl ¼ fX; yg, the pos- these data can be added to Dl and used to update the classifica-
terior distributions can be defined by applying Bayes’ theorem to tion model.
each conjugate pair, where Xk denotes the observations xi ∈ X
with the labels yi ¼ k Z24 Bridge Dataset
The Z24 bridge was a concrete highway bridge in Switzerland,
pðXk jμk ; Σk Þpðμk ; Σk Þ connecting the villages of Koppigen and Utzenstorf. Before its
pðμk ; Σk jXk ; Þ ¼ ð17Þ
pðXk Þ demolition in 1998, the bridge was used for experimental SHM
purposes (de Roeck 2003). Over a 12-month period, a series of sen-
pðyjλÞpðλÞ sors were used to capture dynamic response measurements to ex-
pðλjyÞ ¼ ð18Þ
pðyÞ tract the first four natural frequencies of the structure. Air/deck
temperature, humidity, and wind speed were also recorded (Peeters
In general terms, while the prior pðθÞ was the distribution and de Roeck 2001). There are a total of 3,932 observations in the
over the parameters before any data were observed, the posterior dataset.
distribution pðθjDl Þ describes the parameters given the training Before demolition, different types of damage were artificially
data (i.e., conditioned on the training data). Conveniently, each introduced, starting from observation 3,476 (Dervilis et al. 2014).
of these has analytical solutions (Barber 2012; Murphy 2012). The natural frequencies and deck temperature are shown in Fig. 10.
Fig. 9. Flow chart to illustrate the online active learning process. [Adapted from Bull et al. (2019b).]
Visible fluctuations in the natural frequencies can be observed in Results: Active Learning
Fig. 10, for 1,200 ≤ n ≤ 1,500, while there is little variation follow- The model is applied online to the frequency data from the Z24
ing the introduction of damage at observation 3,476. It is believed bridge. To provide an online performance metric, the dataset is
that the asphalt layer in the deck experienced very low temperatures divided into two equal subsets: one is used for training and query-
during this time, leading to increased structural stiffness. ing by the active learner fDl ; Du g, the other is used as a distinct/
In the analysis, the four natural frequencies are the observation independent test set. The f 1 score is used as the performance metric
data, such that xi ∈ R4 . The damage data are assumed to represent (throughout this work). This is a weighted average of precision and
their own class from observation 3,476. Outlying observations recall (Murphy 2012), with values between 0 and 1; a perfect score
within the remaining dataset are determined using the robust mini- corresponds to f1 ¼ 1. Precision (P) and recall (R) can be defined
mum covariance determinant (MCD) algorithm (Rousseeuw and in terms of numbers of true positives (TP), false positives (FP), and
Driessen 1999; Dervilis et al. 2014). In consequence, a three-class false negatives (FN) for each class, k ∈ Y (Murphy 2012)
classification problem is defined, according to Fig. 10: normal
TPk
data, outlying data due to environmental effects and damage, cor- Pk ¼ ð23aÞ
responding to yi ∈ f1; 2; 3g, respectively. TPk þ FPk
Clearly, it is undesirable for an engineer to investigate the bridge
following each data acquisition. Therefore, if active learning can TPk
provide an improved classification performance, compared to pas- Rk ¼ ð23bÞ
TPk þ FN k
sive learning (random sampling) with the same sample budget, this
demonstrates the relevance of active methods to SHM. The (macro) f 1 score is then defined by Murphy (2012)
2Pk Rk
f1;k ¼ ð24aÞ
Pk þ Rk
1X
f1 ¼ f ð24bÞ
K k∈Y 1;k
Fig. 11 illustrates improvements in classification performance

when active learning is used to label 25% and 12.4% of the mea-
sured data. Active learning is compared to the passive learning
benchmark in which the same number of data are labeled according
to a random sample, rather than uncertainty measures. Throughout
the monitoring regime, if the GMM is used to select the training
data, the predictive performance increases. Most notably, drops
in the f 1 score (corresponding to new classes being discovered)
are less significant when active learning is used to select data,
particularly when class two (environmental effects) is introduced.
Fig. 10. Z24 bridge data and time history of natural frequencies,
This is because new classes are unlikely given the current model,
representing three classes of data: normal data, outlying data due to
i.e., uncertainty measure (a). Intuitively, novel classes are discov-
environmental effects, and damage.
ered sooner via uncertainty sampling. For a range of query budgets
Fig. 11. Online classification performance (f1 score) for the Z24 data, for query budgets of (a) 25%; and (b) 12.5% of the total dataset. [Adapted from
Bull et al. (2019b).]
and additional SHM applications, refer to the study by Bull et al. using Bayes’ theorem, the MAP estimate of the parameters θ given
(2019b). Code and animations of uncertainty sampling for the Z24 the labeled and unlabeled subsets is
data are available (Bull 2019b).
pðDjθÞpðθÞ
θ̂jD ¼ argmaxθ
pðDÞ
Semisupervised Updates to Gaussian Mixture Models
pðDu jθÞpðDl jθÞpðθÞ
¼ argmaxθ
While active learning considered the unlabeled data Du for query- pðDu ; Dl Þ
ing, the observations only contribute to the model once labeled,
D ≜ Du ∪ Dl ð25Þ
i.e., once included in the labeled set Dl . However, a semisupervised
model can consider both the labeled and unlabeled data when Again, it is assumed that the data are i.i.d, so that Dl and Du
approximating the parameters. Therefore, θ is estimated given both can be factorized. Thus, the marginal likelihood of the model
labeled and unlabeled observations, such that the posterior becomes [the denominator of Eq. (25)], considers both the labeled and un-
pðθjDl ; Du Þ. This is advantageous for SHM, as unlabeled ob- labeled data. This is referred to as the joint likelihood, and it is the
servations can also contribute to the model estimate, reducing the value that is maximized while inferring the parameters of the model
dependence on costly supervised data. Continuing the probabilistic through EM.
approach, the original DGM in Fig. 8 can be updated (relatively The EM algorithm iterates E and M steps until convergence in the
simply) to become semisupervised (Fig. 12). The inclusion of Du joint (log) likelihood. During each E-step, the parameters are fixed,
introduces another latent variable y~ i , and, as a result, obtaining the and the unlabeled observations are classified using the current model
posterior distribution over the parameters becomes less simple. One ~ DÞ. The M-step corresponds to finding the θ̂, given
estimate pðyj~ X;
solution adopts an expectation-maximization approach (Dempster
the predicted labels from the E step and the absolute labels for the
et al. 1977). The implementation in this study involves finding the
supervised data. This involves some minor modifications to the con-
maximum a posteriori (MAP) estimate of the parameters θ̂ (the ventional MAP estimates, such that the contribution of the unlabeled
mode of the full posterior distribution) while maximizing the like- data is shared between classes, weighted according to the posterior
lihood of the model. Specifically, from the joint distribution, and ~ DÞ (Barber 2012; Bull et al. 2020b). A pseudo-
distribution pðyj~ X;
code is provided in Algorithm 1; a MATLAB version 2019b code for
the semisupervised GMM is also available (Bull 2019c).
Algorithm 1. Semisupervised EM for a Gaussian mixture model

Input: Labeled data Dl , unlabeled data Du
Output: Semisupervised MAP estimates of θ̂ ¼ fμ̂; Σ̂g
1 Initialize θ̂ using the labeled data, θ̂ ¼ arg maxθ fpðθjDl Þg
2 while the joint log-likelihood logfpðDl ; Du Þg improves do
3 E-step: use the current model θ̂jD to estimate
class-membership for the unlabeled data Du , i.e. pðyj ~ DÞ
~ X;
4 M-step: update the MAP estimate of θ̂ given the
Fig. 12. DGM of the semisupervised GMM, given the labeled Dl and component membership for all observations
unlabeled data Du . For the unsupervised set, x~ i is the only observed
θ̂ ≔ arg maxθ fpðθjDl ; Du Þg
variable, while y~ i is a latent variable. [Adapted from Bull (2019a).]
5 end
Semisupervised Learning with the Gnat Aircraft Data data. Such improvements to the classification performance for low
proportions of labeled data should highlight significant advantages
A visual example of improvements to a GMM via semisupervision
for SHM, reducing the dependence on large sets of costly super-
was shown in Fig. 2. To quantify potential advantages for SHM,
vised data.
the method is also applied to experimental data from aircraft ex-
periments, originally presented by Bull et al. (2020b). For details
behind the Gnat aircraft data, refer to the study by Manson et al. Dirichlet Process Clustering of Streaming Data
(2003). Briefly, during the tests, the aircraft was excited with an Returning to the streaming data recorded from the Z24 bridge, an
electrodynamic shaker and band-limited white noise. Transmis- alternative perspective considers that labels are not needed to infer
sibility data were recorded using a network of sensors distributed the model. In this case, an unsupervised algorithm could be used to
over the wing. Artificial damage was introduced by sequentially cluster data online, and labels could be assigned to the resulting
removing one of nine inspection panels in the wing. A total of 198 clusters outside of the inference within the wider SHM scheme—
measurements were recorded for the removal of each panel, such as suggested by Rogers et al. (2019). However, if yi is unobserved
that the total number of (frequency domain) observations was for the purposes of inference, the number of class components K
1,782. Over the network of sensors, nine transmissibilities were becomes an additional latent variable, unlike the GMM from pre-
recorded (Manson et al. 2003). Each transmissibility was converted vious case studies.
to a one-dimensional novelty detector, with reference to a distinct As aforementioned, the Dirichlet process Gaussian mixture
set of normal data, where all the panels were intact (Worden et al. model (DPGMM) is one solution to this problem. The DPGMM
2008). Therefore, the data represent a nine-class classification allows for the probabilistic selection of K through a Dirichlet pro-
problem, one class for the removal of each panel, such that yi ¼ cess prior. Initially, this involves defining a GMM in a Bayesian
f1; : : : ; 9g. The measurements are nine-dimensional xi ∈ R9 , each manner, using the same priors as before; however, by following
feature is a novelty index, representing one of nine transmissibil- Rasmussen (2000), it is possible to take the limit K → ∞ to form
ities. When applying semisupervised learning, 1=3 of the total data an infinite Gaussian mixture model. Surprisingly, this concept can
were set aside as an independent test-set. The remaining 2=3 were be shown through another simple modification to the first DGM in
used for training, i.e., D ¼ Dl ∪ Du . Of the training data D, the Fig. 8, leading to Fig. 14. The generative equations remain the same
number of labeled observations n was increased (in 5% increments) as Eqs. (9), (10), (14), and (15).
until all the observations are labeled. The results are compared to a A collapsed Gibbs sampler can be used to perform efficient on-
standard supervised learning for the same budget n. The changes in line inference over this model (Neal 2000). Although potentially
the classification performance through semisupervised updates are faster algorithms for variational inference exist (Blei and Jordan
shown in Fig. 13; the inclusion of the unlabeled data consistently 2006), it can be more practical to implement the Gibbs sampler
improves the f1 score. For very low proportions of labeled data when performing inference online. The nature of the Gibbs sam-
<1.26% (m > n), semisupervised updates can decrease the predic- pling solution is that each data point is assessed conditionally in
tive performance; this is likely due to the unlabeled data outweigh- the sampler, which allows the addition of new points online, rather
ing the labeled instances in the likelihood cost function. Notably, than batch updates (Rogers et al. 2019).
the maximum increase in the f1 score is 0.0405, corresponding to a Within the Gibbs sampler, only components k ¼ f1; : : : ; K þ 1g
3.83% reduction in the classification error for the 2.94% labeled need to be considered to cover the full set of possible clusters
Fig. 13. Classification performance (f1 score) for the supervised GMM versus the semisupervised GMM: (a) f 1 for an increasing proportion of labeled
data; and (b) gain in f 1 score through semisupervised updates, and the horizontal line hilghlights zero-gain. [Adapted from Bull et al. (2020b).]
in an online manner, and thus, the hyperparemeters of the prior
pðμ; ΣÞ encode this knowledge. The choice of the dispersion value
α, defining pðλÞ, is more application dependent, as discussed in
the restaurant analogy; this determines the likelihood that new clus-
ters will be generated. In the study by Rogers et al. (2019), sensible
values for online SHM applications were found to be between
0 < α < 20; for the Z24 data, this is set to α ¼ 10. As with the
active GMM, a small set of data from the start of the monitoring
regime make up an initial training set. Fig. 15 shows the algo-
rithm’s progress for the streaming data. A normal condition cluster
is quickly established. As the temperature cools, three more clusters
are created, corresponding to the progression of freezing of the
deck. Two additional clusters are also created: one around point
Fig. 14. DGM for the infinite Gaussian mixture model.

800 and one close to point 1,700. From an inspection of the feature
space (Rogers et al. 2019), it is hypothesized that the close-to-
point-1,700 cluster corresponds to a shift and rotation in the normal
(Rasmussen 2000). As with the GMM, there are two conjugate condition; therefore, this leads to another normal cluster. As the
pairs in the model; therefore, the predictive equations remain corresponding normal data are now non-Gaussian, they are better
analytical (leading to a collapsed Gibbs sampler). In brief/general approximated by two mixture components. Finally, the last cluster
terms, while fixing the parameters, the Gibbs scheme determines is created following two observations of damage, showing the abil-
the likelihood of an observation x~ i being sampled from an existing ity of the DPGMM implementation to detect a change in behavior
cluster k ¼ f1; : : : ; Kg or an (as of yet) unobserved cluster k ¼ corresponding to damage, as well as environmental effects.
K þ 1 (i.e., the prior). Given the posterior over the K þ 1 classes, The DPGMM has automatically inferred seven clusters, given
the cluster assignment y~ i is sampled, and the model parameters are the data and the model. While three classes were originally defined
updated accordingly. This process is iterated until convergence. (as in the active and semisupervised case), this representation is
equally interpretable following system inspections to describe each
component. Additionally, the DPGMM is likely to better approxi-
Applications to the Z24 Bridge Data
mate the underlying density, as each class of data can be described
In terms of monitoring the streaming Z24 data, any new observa- by a number of Gaussian components rather than one. That is, in
tions that relate to existing clusters will update the associated this case, three clusters describe the normal condition, three clusters
parameters. If a new cluster is formed, indicating novelty, this trig- cover various environmental effects, and one represents the damage
gers an alarm. In this case, the cluster must contain at least 50 ob- condition.
servations to indicate novelty; for details, refer to the study by The results shown on the Z24 data demonstrate the ability of the
Rogers et al. (2019). Upon investigating the structure, an appropri- online DP algorithm to deal with recurring environmental condi-
ate description can be assigned to the unsupervised cluster index tions while remaining sensitive to damage. The DPGMM is incor-
(outside of the inference). As before, the Z24 data are normalized porated into an SHM system for online damage detection, and it is
normalised freq. normalised freq. normalised freq. normalised freq.
0 500 1000 1500 2000 2500 3000 3500 4000

5
-5
0 500 1000 1500 2000 2500 3000 3500 4000
4
2
0
-2
0 500 1000 1500 2000 2500 3000 3500 4000
4
2
0
-2
0 500 1000 1500 2000 2500 3000 3500 4000
observations
Fig. 15. Figure showing online DP clustering applied to the Z24 bridge data using the first four natural frequencies as the features. Vertical lines
indicate that a new cluster has been formed. [Adapted from Rogers et al. (2019).]
shown to categorize multiple damaged and undamaged states while subspace, fHt ¼ A⊤ t Kt gt¼1 . In this shared space, a coupled dis-
T
automatically inferring an appropriate number of mixture compo- criminative classifier is inferred for the projected data from each
nents K in the mixture model. The method requires little user input, domain fft ¼ H⊤ t w þ 1bgt¼1 . This implies the same set of param-
T
and it updates online with simple feedback to the user as to when an eters fw; bg are used across all tasks.
inspection is likely required. If desired, the unsupervised clusters In a Bayesian manner, prior distributions are associated with
can be assigned meaningful descriptions to be interpreted by the the parameters of the model. For the nt × R task-specific projec-
end-user. tion matrices, At , there is an nt × R matrix of priors denoted Λt .
For the weights of the coupled classifier, the prior is η, and for the
bias b, the prior is γ. These are standard priors given the parameter
Multitask Learning types in the model—for details, refer to the study by Gönen and
In the final case study, supervised data from different structures Margolin (2014). Collectively, the priors are Ξ ¼ ffλt gTt¼1 ; η; γg
(each represented by their own domain) are considered simultane- and the latent variables are Θ ¼ ffHt ; At ; ft gTt¼1 ; w; bg; the ob-
served variables (training data) are given by fKt ; y t gTt¼1. The DGM
ously to improve the performance of an SHM task. In the following

example, each domain Dt corresponds to supervised training data associated with the model is shown in Fig. 17; this highlights the
recorded from a different system; the task T corresponds to a pre- variable dependencies and the associated prior distributions. The
dictive SHM model. By considering the data from a group (or distributional assumptions are briefly summarized; for details, refer
population) of similar structures in a latent space, the amount of to the study by Gönen and Margolin (2014). The prior for the el-
training data can (in effect) be increased. Multitask learning should ements At ½i; s of the projection matrix are (zero mean) normally
be particularly useful in SHM, where training data are often incom- distributed, with variance Λt ½i; s−1 ; in turn, the prior over Λt ½i; s is
plete for individual systems. If a predictive model can be improved Gamma distributed. As a result, the observations are normally dis-
by considering the data collected from various similar structures, tributed in the latent space, i.e., Ht ½s; i. For the coupled classifier,
this should highlight the potential benefit of multitask learning. the prior for the bias b is assumed to be (zero mean) normally dis-
tributed, with variance γ −1 , such that γ is Gamma distributed. Sim-
Kernelized Bayesian Transfer Learning ilarly, the weights w½s are (zero mean) normally distributed, with
Referring back to task T and domain D objects, it is assumed variance η½s−1 , such that η½s is Gamma distributed. This leads to
that there are T (binary) classification tasks over the heterogeneous normal distributions over the functional classifier ft ½i. The label
domains fDt gTt¼1 . In other words, the label space Y is consistent ðtÞ ðtÞ ðtÞ
predictive equations are given by pðy jf Þ, passing f through
across all tasks (in this case, normal or damaged), while the fea-
a truncated Gaussian, parameterized by ν (Gardner et al.,
ture space X t can change dimensionality, potentially leading to
forthcoming).
dt ≠ dt 0 . For each task, there is an i.i.d. training set of observa-
ðtÞ nt
The hyperparameters associated with these assumptions are
tions Xt and labels y t , where Xt ¼ fxi ∈ Rdt gi¼1 and y t ¼ shown in the DGM (Fig. 17). To infer the parameters of the model,
ðtÞ nt
fyi ∈ f−1; þ1ggi¼1 . Each domain has a task-specific kernel func- the approximate inference is required. Following Gönen and
tion kt to determine the similarities between observations and Margolin (2014), a variational inference scheme is used; this uti-
ðtÞ ðtÞ lizes a lower bound on the marginal likelihood to infer an approxi-
the associated kernel matrix Kt ½i; j ¼ kt ðxi ; xj Þ, such that
nt ×nt mation, denoted q, of the full joint distribution of the parameters
Kt ∈ R . Note that when subscripts/superscripts are cluttered,
pðΘ; ΞjfKt ; y t gTt¼1 Þ of the model. To achieve this, the posterior dis-
the square bracket notation is used to index matrices and vectors.
tribution is factorized as follows:
Fig. 16 is useful to visualize KBTL. The model can be split into
two main parts: (1) the first projects data from different tasks into a
pðΘ; ΞjfKt ; y t gTt¼1 Þ
shared subspace using kernel-based dimensionality reduction; and
(2) the second performs coupled binary classification in the shared Y
T Y
T
subspace, using common classification parameters. In terms of no- ≈ qðΘ; ΞÞ ¼ ðqðΛt ÞqðAt ÞqðHt ÞÞqðγÞqðηÞqðb; wÞ qðft Þ
tation, the kernel embedding for each domain Kt is projected into a t¼1 t¼1
shared latent subspace by an optimal projection matrix At ∈ Rnt ×R , ð26Þ

where R is the dimensionality of the subspace. Following projec-
tion, there is a representation of each domain in the shared latent Each approximated factor is defined as in the full conditional
distribution (Gönen and Margolin 2014). The lower bound can be
Fig. 16. Visualization of KBTL. [Adapted from Gonen and Margolin

(2014).] Fig. 17. Directed graphical model for binary classification KBTL.
optimized with respect to each factor separately while fixing the In each domain, the damped natural frequencies act as features,
remaining factors (iterating until convergence). such that Xt ½i; ∶ ¼ fωi gdi¼1 . Therefore, as each domain has differ-
ent DOFs/dimensions, a heterogeneous transfer is required. The la-
Numerical + Experimental Example: Shear-Building bel set is consistent across all domains, corresponding to normal or
Structures damaged, i.e., yi ∈ f−1; 1g, respectively. The training and test data
A numerical case study, supplemented with experimental data, is for each domain are summarized in Table 2. The training data have
used for demonstration—an extension of the work by Gardner et al. various degrees of class imbalance to reflect scenarios in which
(2020a). A population of six different shear-building structures certain structures in SHM provide more information about a par-
is considered, five are simulated, and one is experimental. A do- ticular state.
main and task are associated with each structure (such that T ¼ 6), Fig. 19 shows the coupled binary classifier in the (expected)
and the experimental rig and (simulated) lumped-mass models are shared latent subspace for all the data fHt gTt¼1 . The observations
shown in Fig. 18. For each structure (domain), there is a two-class associated with each of the six domains are distinguished via differ-
classification problem (task), which is viewed as binary damage ent markers. The left plot shows the test data and their predicted
detection (normal or damaged). labels given ft , while the right plot shows the ground truth labels.
Each simulated structure is represented by d mass, stiffness, KBTL has successfully embedded and projected data from different
and damping coefficients, i.e., fmi ; ki ; ci gdi¼1 . The masses have domains into a shared latent space (R ¼ 2), where the data can be
length lm , width wm , thickness tm , and density ρ. The stiffness categorized by a coupled discriminative classifier. It can also be
elements are calculated from four cantilever beams in bending, seen that, due to class imbalance (weighted toward the undamaged
4kb ¼ 4ð3EI=l3b Þ, where E is the elastic modulus, I the second mo- class −1 for each structure), there is greater uncertainty in the dam-
ment of area, and lb the length of the beam. The damping coeffi- aged class (þ1), leading to more significant scatter in the latent
cients are specified rather than derived from a physical model. space.
Damage is simulated via an open crack, using a reduction in EI The classification results for each domain are presented in
(Christides and Barr 1984). For each structure, each observation Fig. 20. An observation is considered to belong to class þ1 if
is a random draw from a base distribution for E, ρ, and c. The prop- pðy t ½ ¼ þ1jft ½Þ ≥ 0.5. KBTL is compared to a relevance vector
erties of the five simulated structures are shown in Table 1. machine (RVM) (Tipping 2000) as a benchmark—learned for each
The experimental structure is constructed from aluminum 6,082, domain independently. It is acknowledged that the RVM differs in
with dimensions nominally similar to those in Table 1. Observatio- implementation; however, similarities make it useful for compari-
nal data (the first three natural frequencies) were collected via model son as a standard (nonmultitask) alternative to KBTL.
testing in which an electrodynamic shaker applied up to 6,553.6 Hz Multitask learning has accurately inferred a general model. For
broadband white-noise excitation containing 16,384 spectral lines domains f1; 2; 3; 5; 6g, the SHM task is improved by considering
(0.2 Hz resolution). Forcing was applied to the first story, and three the data from all structures in a shared latent space. In particular,
uniaxial accelerometers measured the response at all stories. The extending the (effective) training data has improved the classifica-
damage was artificially introduced as a 50% saw-cut to the midpoint tion for Domain 5. This is because there are few training data
of the front-right beam in Fig. 18(a). associated with the damage class for Domain 5 (Table 2); there-
fore, considering damage data from similar structures (in the latent
space) has proved beneficial. Interestingly, for Domain 4 (t ¼ 4),
there is a marginal decrease in the classification performance. Like
Domain 1, Domain 4 has a less severe class imbalance, and thus, it
appears that the remaining domains (with severe class imbalance)
have negatively impacted the score for this specific domain/task.
These results highlight that the data from a group (or population)
of similar structures can be considered together to increase the (ef-
fective) amount of training data (Bull et al. 2020a; Gosliga et al.
2020; Gardner et al. 2020b). This can lead to significant improve-
ments in the predictive performance of SHM tools—particularly
those learned from small sets of supervised data.
(a) (b) (c)
Fig. 18. Shear structures: (a) test rig; (b) a nominal representation of Conclusions
the five simulated systems; and (c) depiction of the cantilever beam
Three new techniques for statistical inference with SHM signals have
component where fki gdi¼1 ¼ 4kb .
been collected and summarized (originally introduced in previous
Table 1. Properties of the five simulated structures

Domain DOF Beam dim. Mass dim. Elastic mod. Density ρ Damping coeff.
(t) (dt ) flb ; wb ; tb g (mm) flm ; wm ; tm g (mm) E (GPa) (kg=m3 ) c (Ns=m)
1 4 f185, 25, 6.35g f350, 254; 25g N (71; 1.0 × 10−9 ) N (2,700, 10) G (50; 0.1)
2 8 f200, 35, 6.25g f450, 322; 35g N (70; 1.2 × 10−9 ) N (2,800, 22) G (8, 0.8)
3 10 f177, 45, 6.15g f340, 274; 45g N (72; 1.3 × 10−9 ) N (2,550, 25) G (25; 0.2)
4 3 f193, 32, 5.55g f260, 265; 32g N (75; 1.5 × 10−9 ) N (2,600, 15) G (20; 0.1)
5 5 f165, 46, 7.45g f420, 333; 46g N (73; 1.4 × 10−9 ) N (2,650, 20) G (50; 0.1)
Note: Degrees-of-freedom (DOF) are denoted d. dim. = dimension; mod. = modulus; and coeff. = coefficient.
Table 2. Number of data for all domains (1) label information (to describe what measurements represent) is
Training Testing likely to be incomplete, and (2) the available data a priori will usu-
Domain ally correspond to a subset of the expected in situ conditions only.
(t) y ¼ −1 y ¼ þ1 y ¼ −1 y ¼ þ1
Considering the importance of uncertainty quantification in SHM,
1 250 100 500 500 probabilistic methods are suggested, which can be (intuitively) up-
2 100 25 500 500 dated to account for missing information.
3 120 20 500 500 The case study applications for each mode of inference highlight
4 200 150 500 500 the potential advantages for SHM. Partially-supervised methods
5 500 10 500 500
for active and semisupervised learning were utilized to manage the
6* 3 3 2 2
cost system inspections (to label data) while considering the unla-
Note: Domain 6 represents numerical and experimental. beled instances, both offline and online. Dirichlet process cluster-
ing has been applied to streaming data as an unsupervised method
for automatic damage detection and classification. Finally, multi-
task learning was applied to model shared information between

systems—to extend the data available for training, this approach
considers multiple (potentially incomplete) datasets associated with
different tasks (structures).
Data Availability Statement

Some or all data, models, or code that support the findings of this
study are available from the corresponding author upon reasonable
request.
Acknowledgments
The authors gratefully acknowledge the support of the UK Engi-

neering and Physical Sciences Research Council (EPSRC) through
Grant references EP/R003645/1, EP/R004900/1, EP/S001565/1,
and EP/R006768/1.
Fig. 19. KBTL probabilistic decision boundary for the coupled classi-
fication model in the shared subspace. Markers f×;□; ⋆; ; ⋄; Δ;•g cor-
respond to tasks and domains f1; 2; 3; 4; 5; 6g, respectively. References
Aldous, D. J. 1985. “Exchangeability and related topics.” In École d’Été de
Probabilités de Saint-Flour XIII—1983, 1–198. Berlin: Springer.
Barber, D. 2012. Bayesian reasoning and machine learning. Cambridge,
UK: Cambridge University Press.
Blei, D. M., and M. I. Jordan. 2006. “Variational inference for Dirichlet
process mixtures.” Bayesian Anal. 1 (1): 121–143. https://doi.org/10
.1214/06-BA104.
Bornn, L., C. R. Farrar, G. Park, and K. Farinholt. 2009. “Structural health
monitoring with autoregressive support vector machines.” J. Vib.
Acoust. 131 (2). https://doi.org/10.1115/1.3025827.
Bull, L. A. 2019a. “Towards probabilistic and partially-supervised struc-
tural health monitoring.” Ph.D. thesis, Univ. of Sheffield.
Bull, L. A. 2019b. “labull/probabilistic_active_learning_GMM.” Ac-
cessed October 1, 2019. https://github.com/labull/probabilistic_active
_learning_GMM.
Bull, L. A. 2019c. “labull/semi_supervised_GMM.” Accessed January 1,
2019. https://github.com/labull/semi_supervised_GMM.
Bull, L. A., P. A. Gardner, J. Gosliga, N. Dervilis, E. Papatheou, A. E.
Maguire, C. Campos, T. J. Rogers, E. J. Cross, and K. Worden. 2020a.
“Foundations of population-based structural health monitoring. Part I:
Fig. 20. KBTL classification performance, given an independent test Homogeneous populations and forms.” Mech. Syst. Sig. Process.
set: f 1 -scores across each domain compared to an RVM benchmark. 148 (Feb): 107141. https://doi.org/10.1016/j.ymssp.2020.107141.
Bull, L. A., G. Manson, K. Worden, and N. Dervilis. 2019a. “Active learn-
ing approaches to structural health monitoring.” In Vol. 5 of Special
topics in structural dynamics, edited by N. Dervilis, 157–159. Cham,
work), including partially-supervised learning (semisupervised/ Switzerland: Springer.
active learning), Dirichlet process clustering, and multitask learning. Bull, L. A., T. J. Rogers, C. Wickramarachchi, E. J. Cross, K. Worden, and
Primarily, each approach looks to address, from a different perspec- N. Dervilis. 2019b. “Probabilistic active learning: An online framework
tive, the issues of incomplete datasets and missing information, for structural health monitoring.” Mech. Syst. Sig. Process. 134 (Dec):
which lead to incomplete training data. The algorithms consider that 106294. https://doi.org/10.1016/j.ymssp.2019.106294.
Bull, L. A., K. Worden, and N. Dervilis. 2019c. “Damage classification Flynn, E. B., and M. D. Todd. 2010. “A Bayesian approach to optimal sen-
using labelled and unlabelled measurements.” In Structural health sor placement for structural health monitoring with application to active
monitoring 2019. Lancaster: Destech Publications. sensing.” Mech. Syst. Sig. Process. 24 (4): 891–903. https://doi.org/10
Bull, L. A., K. Worden, and N. Dervilis. 2020b. “Towards semi-supervised .1016/j.ymssp.2009.09.003.
and probabilistic classification in structural health monitoring.” Mech. Gao, Y., and K. M. Mosalam. 2018. “Deep transfer learning for image-
Syst. Sig. Process. 140 (Jun): 106653. https://doi.org/10.1016/j.ymssp based structural damage recognition.” Comput.-Aided Civ. Infrastruct.
.2020.106653. Eng. 33 (9): 748–768.
Bull, L. A., K. Worden, G. Manson, and N. Dervilis. 2018. “Active learn- Gardner, P., L. A. Bull, N. Dervilis, and K. Worden. 2020a. “Kernelised
ing for semi-supervised structural health monitoring.” J. Sound Vib. Bayesian transfer learning for population-based structural health mon-
437 (Dec): 373–388. https://doi.org/10.1016/j.jsv.2018.08.040. itoring.” In Proc., 38th Int. Modal Analysis Conf. London: Springer.
Bull, L. A., K. Worden, T. J. Rogers, E. J. Cross, and N. Dervilis. 2020c. Gardner, P., L. A. Bull, N. Dervilis, and K. Worden. Forthcoming. “A
“Investigating engineering data by probabilistic measures.” In Vol. 5 sparse Bayesian approach to heterogeneous transfer learning for
of Special topics in structural dynamics and experimental techniques, population-based structural health monitoring.” Mech. Syst. Sig.
77–81. Cham, Switzerland: Springer. Process.
Bull, L. A., K. Worden, T. J. Rogers, C. Wickramarachchi, E. J. Cross, T. Gardner, P., L. A. Bull, J. Gosliga, N. Dervilis, and K. Worden. 2020b.
McLeay, W. Leahy, and N. Dervilis. 2019d. “A probabilistic framework “Foundations of population-based structural health monitoring. Part III:
for online structural health monitoring: Active learning from machining Heterogeneous populations—Mapping and transfer.” Mech. Syst. Sig.
data streams.” In Vol. 1264 of Proc., Journal of Physics: Conf. Series, Process. 149 (Feb): 107142. https://doi.org/10.1016/j.ymssp.2020
012028. Bristol, UK: Institute of Physics Publishing. .107142.
Cappello, C., D. Bolognani, and D. Zonta. 2015. “Mechanical equivalent of Gardner, P., X. Liu, and K. Worden. 2020c. “On the application of domain
logical inference from correlated uncertain information.” In Proc., 7th adaptation in structural health monitoring.” Mech. Syst. Sig. Process.
Int. Conf. on Structural Health Monitoring of Intelligent Infrastructure. 138 (Apr): 106550. https://doi.org/10.1016/j.ymssp.2019.106550.
New York: Curran Associates. Gelman, A., H. S. Stern, J. B. Carlin, D. B. Dunson, A. Vehtari, and D. B.
Chakraborty, D., N. Kovvali, B. Chakraborty, A. Papandreou-Suppappola, Rubin. 2013. Bayesian data analysis. Boca Raton, FL: Chapman and
and A. Chattopadhyay. 2011. “Structural damage detection with insuf- Hall/CRC.
ficient data using transfer learning techniques.” In Sensors and smart Gönen, M., and A. Margolin. 2014. “Kernelized Bayesian transfer learn-
structures technologies for civil, mechanical, and aerospace systems, ing.” In Proc., 28th AAAI Conf. on Artificial Intelligence. Palo Alto,
798147. New York: Curran Associates. CA: Association for the Advancement of Artificial Intelligence Press.
Chapelle, O., B. Scholkopf, and A. Zien. 2006. Semi-supervised learning. Gosliga, J., P. Gardner, L. Bull, N. Dervilis, and K. Worden. 2020.
Boca Raton, FL: MIT Press. “Foundations of population-based structural health monitoring. Part II:
Chatzi, E. N., and A. W. Smyth. 2009. “The unscented Kalman filter and Heterogeneous populations—Graphs, networks and communities.”
particle filter methods for nonlinear structural system identification with Mech. Syst. Sig. Process. 148 (Feb): 107144. https://doi.org/10.1016
non-collocated heterogeneous sensing.” Struct. Control Health Monit. /j.ymssp.2020.107144.
Off. J. Int. Assoc. Struct. Control Monit. Eur. Assoc. Control Struct. Huang, Y., J. L. Beck, and H. Li. 2019. “Multitask sparse Bayesian learning
16 (1): 99–123. with applications in structural health monitoring.” Comput.-Aided Civ.
Chen, S., F. Cerda, J. Guo, J. B. Harley, Q. Shi, P. Rizzo, J. Bielak, J. H. Infrastruct. Eng. 34 (9): 732–754. https://doi.org/10.1111/mice.12408.
Garrett, and J. Kovacevic. 2013. “Multiresolution classification with Jang, K., N. Kim, and Y. An. 2019. “Deep learning-based autono-
semi-supervised learning for indirect bridge structural health monitor- mous concrete crack evaluation through hybrid image scanning.”
ing.” In Proc., 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Struct. Health Monit. 18 (5–6): 1722–1737. https://doi.org/10.1177
Processing, 3412–3416. New York: IEEE. /1475921718821719.
Chen, S., F. Cerda, P. Rizzo, J. Bielak, J. H. Garrett, and J. Kovacevic. 2014. Janssens, O., R. Van de Walle, M. Loccufier, and S. Van Hoecke. 2018.
“Semi-supervised multiresolution classification using adaptive graph “Deep learning for infrared thermal image based machine health mon-
filtering with application to indirect bridge structural health monitor- itoring.” IEEE/ASME Trans. Mechatron. 23 (1): 151–159. https://doi
ing.” IEEE Trans. Signal Process. 62 (11): 2879–2893. https://doi.org .org/10.1109/TMECH.2017.2722479.
/10.1109/TSP.2014.2313528. Kremer, J., K. P. Steenstrup, and C. Igel. 2014. “Active learning with
Christides, S., and A. Barr. 1984. “One-dimensional theory of cracked support vector machines.” Wiley Interdiscip. Rev.: Data Min. Knowl.
Bernoulli-Euler beams.” Int. J. Mech. Sci. 26 (11–12): 639–648. https:// Discovery 4 (4): 313–326.
doi.org/10.1016/0020-7403(84)90017-1. MacKay, D. J. 2003. Information theory, inference and learning algo-
Cozman, F. G., I. Cohen, and M. C. Cirelo. 2003. “Semi-supervised learn- rithms. Cambridge, UK: Cambridge University Press.
ing of mixture models.” In Proc., 20th Int. Conf. on Machine Learning Manson, G., K. Worden, and D. Allman. 2003. “Experimental validation of
(ICML-03), 99–106. Washington, DC: Association for the Advance- a structural health monitoring methodology. Part III: Damage location
ment of Artificial Intelligence Press. on an aircraft wing.” J. Sound Vib. 259 (2): 365–385. https://doi.org/10
Dasgupta, S. 2011. “Two faces of active learning.” Theor. Comput. Sci. .1006/jsvi.2002.5169.
412 (19): 1767–1781. https://doi.org/10.1016/j.tcs.2010.12.054. McCallumzy, A. K., and K. Nigamy. 1998. “Employing EM and pool-
de Roeck, G. 2003. “The state-of-the-art of damage detection by vibration based active learning for text classification.” In Proc., Int. Conf. on
monitoring: The SIMCES experience.” Struct. Control Health Monit. Machine Learning (ICML), 359–367. Princeton, NJ: Citeseer.
10 (2): 127–134. https://doi.org/10.1002/stc.20. Murphy, K. P. 2012. Machine learning: A probabilistic perspective.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum likeli- Cambridge, MA: MIT Press.
hood from incomplete data via the EM algorithm.” J. R. Stat. Soc. Ser. B Neal, R. M. 2000. “Markov chain sampling methods for Dirichlet process
(Methodol.) 39 (1): 1–22. mixture models.” J. Comput. Graphical Stat. 9 (2): 249–265.
Dervilis, N., E. Cross, R. Barthorpe, and K. Worden. 2014. “Robust Nigam, K., A. McCallum, S. Thrun, and T. Mitchell. 1998. “Learning to
methods of inclusive outlier analysis for structural health monitoring.” classify text from labeled and unlabeled documents.” AAAI/IAAI
J. Sound Vib. 333 (20): 5181–5195. https://doi.org/10.1016/j.jsv.2014 792 (6): 792–799.
.05.012. Ou, Y., E. N. Chatzi, V. K. Dertimanis, and M. D. Spiridonakos. 2017.
Dorafshan, S., R. J. Thomas, and M. Maguire. 2018. “Comparison of deep “Vibration-based experimental damage detection of a small-scale wind
convolutional neural networks and edge detectors for image-based crack turbine blade.” Struct. Health Monit. 16 (1): 79–96. https://doi.org/10
detection in concrete.” Constr. Build. Mater. 186 (Oct): 1031–1045. .1177/1475921716663876.
https://doi.org/10.1016/j.conbuildmat.2018.08.011. Pan, S. J., and Q. Yang. 2010. “A survey on transfer learning.” IEEE Trans.
Farrar, C. R., and K. Worden. 2012. Structural health monitoring: A ma- Knowl. Data Eng. 22 (10): 1345–1359. https://doi.org/10.1109/TKDE
chine learning perspective. Chichester, UK: Wiley. .2009.191.
Papoulis, A. 1965. Probabilities, random variables, and stochastic proc- Vanik, M. W., J. L. Beck, and S. Au. 2000. “Bayesian probabilistic approach
esses. New York: McGraw-Hill. to structural health monitoring.” J. Eng. Mech. 126 (7): 738–745. https://
Peeters, B., and G. de Roeck. 2001. “One-year monitoring of the doi.org/10.1061/(ASCE)0733-9399(2000)126:7(738).
Z24-bridge: Environmental effects versus damage events.” Earthquake Vlachos, A., A. Korhonen, and Z. Ghahramani. 2009. “Unsupervised
Eng. Struct. Dyn. 30 (2): 149–171. https://doi.org/10.1002/1096-9845 and constrained Dirichlet process mixture models for verb clustering.”
(200102)30:2<149::AID-EQE1>3.0.CO;2-Z. In Proc., Workshop on Geometrical Models of Natural Language
Rasmussen, C. E. 2000. “The infinite Gaussian mixture model.” In Advan- Semantics, 74–82. Stroudsburg, PA: Association for Computational
ces in neural information processing systems, 554–560. Cambridge, Linguistics.
MA: MIT Press. Wan, H., and Y. Ni. 2019. “Bayesian multi-task learning methodology for
Rasmussen, C. E., and Z. Ghahramani. 2001. “Occam’s razor.” In Advances reconstruction of structural health monitoring data.” Struct. Health
in neural information processing systems, 294–300. Cambridge, MA: Monit. 18 (4): 1282–1309. https://doi.org/10.1177/1475921718794953.
MIT Press. Wang, M., F. Min, Z.-H. Zhang, and Y.-X. Wu. 2017. “Active learning
Rippengill, S., K. Worden, K. M. Holford, and R. Pullin. 2003. “Automatic through density clustering.” Expert Syst. Appl. 85 (Nov): 305–317.
classification of acoustic emission patterns.” Strain 39 (1): 31–41. https://doi.org/10.1016/j.eswa.2017.05.046.
https://doi.org/10.1046/j.1475-1305.2003.00041.x. Worden, K., and G. Manson. 2006. “The application of machine learning to
Rogers, T. J., K. Worden, R. Fuentes, N. Dervilis, U. T. Tygesen, and E. J. structural health monitoring.” Philos. Trans. R. Soc. London, Ser. A
Cross. 2019. “A Bayesian non-parametric clustering approach for semi- 365 (1851): 515–537. https://doi.org/10.1098/rsta.2006.1938.
supervised structural health monitoring.” Mech. Syst. Sig. Process. Worden, K., G. Manson, G. Hilson, and S. Pierce. 2008. “Genetic optimi-
119 (Mar): 100–119. https://doi.org/10.1016/j.ymssp.2018.09.013. zation of a neural damage locator.” J. Sound Vib. 309 (3): 529–544.
Rousseeuw, P. J., and K. V. Driessen. 1999. “A fast algorithm for the https://doi.org/10.1016/j.jsv.2007.07.035.
minimum covariance determinant estimator.” Technometrics 41 (3): Ye, J., T. Kobayashi, H. Tsuda, and M. Murakawa. 2017. “Robust hammer-
212–223. https://doi.org/10.1080/00401706.1999.10485670. ing echo analysis for concrete assessment with transfer learning.”
Schwenker, F., and E. Trentin. 2014. “Pattern classification and cluster- In Proc., 11th Int. Workshop on Structural Health Monitoring,
ing: A review of partially supervised learning approaches.” Pattern 943–949. Stanford, CA: Stanford Univ.
Recognit. Lett. 37 (1): 4–14. https://doi.org/10.1016/j.patrec.2013 Zhang, Y., and Q. Yang. 2018. “An overview of multi-task learning.” Natl.
.10.017. Sci. Rev. 5 (1): 30–43. https://doi.org/10.1093/nsr/nwx105.
Settles, B. 2012. “Active learning.” Synth. Lect. Artif. Intell. Mach. Learn. Zhao, R., R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao. 2019. “Deep
6 (1): 1–114. https://doi.org/10.2200/S00429ED1V01Y201207AIM018. learning and its applications to machine health monitoring.” Mech. Syst.
Sohn, H., C. R. Farrar, F. M. Hemez, D. D. Shunk, D. W. Stinemates, B. R. Sig. Process. 115 (Jan): 213–237. https://doi.org/10.1016/j.ymssp.2018
Nadler, and J. J. Czarnecki. 2003. A review of structural health mon- .05.050.
itoring literature: 1996–2001. Los Alamos, NM: Los Alamos National Zhu, X. J. 2005. Semi-supervised learning literature survey. Report.
Laboratory. Madison, WI: Univ. of Wisconsin–Madison.
Tipping, M. E. 2000. “The relevance vector machine.” In Advances in neu- Zonta, D., B. Glisic, and S. Adriaenssens. 2014. “Value of information:
ral information processing systems, 652–658. Cambridge, MA: MIT Impact of monitoring on decision-making.” Struct. Control Health
Press. Monit. 21 (7): 1043–1056. https://doi.org/10.1002/stc.1631.

10 1061@ajrua6 0001106

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1061@ajrua6 0001106

Uploaded by

Copyright:

Available Formats

State-of-the-Art Review

Probabilistic Inference for Structural Health Monitoring:

Incomplete Data and Missing Information Semisupervised Learning

the SHM strategy (Rogers et al. 2019) or at the algorithm level to

Active Learning with Gaussian Mixture Models

Directed Graphical Model

Fig. 11 illustrates improvements in classification performance

Algorithm 1. Semisupervised EM for a Gaussian mixture model

Fig. 14. DGM for the infinite Gaussian mixture model.

0 500 1000 1500 2000 2500 3000 3500 4000

ously to improve the performance of an SHM task. In the following

shared latent subspace by an optimal projection matrix At ∈ Rnt ×R , ð26Þ

Fig. 16. Visualization of KBTL. [Adapted from Gonen and Margolin

Table 1. Properties of the five simulated structures

task learning was applied to model shared information between

Data Availability Statement

The authors gratefully acknowledge the support of the UK Engi-

You might also like