You are on page 1of 25

Rev Environ Sci Biotechnol (2021) 20:985–1009

https://doi.org/10.1007/s11157-021-09592-y(0123456789().,-volV)
( 01234567
89().,-volV)

REVIEW PAPER

A review of data-driven modelling in drinking water


treatment
Atefeh Aliashrafi . Yirao Zhang . Hannah Groenewegen . Nicolas M. Peleato

Received: 24 March 2021 / Accepted: 10 September 2021 / Published online: 18 September 2021
Ó The Author(s), under exclusive licence to Springer Nature B.V. 2021

Abstract There are significant opportunities to protecting source water quality, optimizing treatment
optimize drinking water treatment and water resource processes, and interpreting of sensor data. There is a
management using data-driven models. Modelling can focus on identifying approaches and algorithms best
help define complex system behaviour, such as water suited for specific applications and the interpretability
quality and environmental systems, giving insight into of trained models. Successful implementation of data-
expected outcomes from changing conditions. Many driven models in critical systems, such as water
water treatment models have been developed, such as treatment, requires that models be validated, and a
predicting treated water quality based on coagulant model’s decision-making logic can be identified and
addition or disinfection by-product formation from scrutinized.
chlorination, which can be used to better inform
operators of optimal treatment parameters to limit risk Keywords Data-driven modelling  Drinking water 
and reduce cost. Data-driven models, in particular, Water quality  Machine learning  Artificial
present an opportunity to learn relationships from intelligence
patterns in historical data without the need to pre-
define mechanisms or variable interactions. Further-
more, models built on currently monitored data are
likely easier to implement since they utilize water 1 Introduction
quality measures that are already in place. However,
data-driven approaches have significant challenges, Water resources management and water treatment are
including increased uncertainty in model validity, of growing importance due to increased demand for
challenges in interpreting model behaviour and deci- clean water coupled with deteriorating water quality
sion logic, and increased likelihood of incorporating from the contamination of water bodies (Li et al. 2013;
biases from training data. This article presents a Mei et al. 2014; Aghel et al. 2019). While augmenting
review of data-driven model applications in drinking treatment efforts or using alternative water sources can
water treatment to highlight opportunities related to offset deteriorating natural water quality, this comes
with increasing energy and cost requirements
(Brookes et al. 2014). It is, therefore, of economic
A. Aliashrafi  Y. Zhang  H. Groenewegen  and social interest to ensure that not only is sufficient
N. M. Peleato (&)
School of Engineering, University of British Columbia
treatment being applied for various water uses, but that
Okanagan, Kelowna, BC, Canada treatment and management efforts are optimized to
e-mail: nicolas.peleato@ubc.ca

123
986 Rev Environ Sci Biotechnol (2021) 20:985–1009

minimize cost. Furthermore, optimized treatment and Jäschke 2020). It should be considered that although
management require an in-depth understanding of data-driven approaches are often more flexible and
source or environmental water quality and the factors contain fewer constraints, the characteristics of the
that may impact water quality changes. model chosen and data chosen for training will impart
Generating sufficient knowledge about water qual- constraints on learned relationships (Karniadakis et al.
ity can be challenging due to limitations with moni- 2021). Furthermore, process-based models can be
toring methods, frequency, and analysis. As such, challenging to implement as they may rely on design
modelling approaches can help represent knowledge parameters that are difficult to know or measure
about systems with limited and uncertain inputs precisely (Juntunen et al. 2013). For example, floccu-
(Reckhow 1999). The increasing availability of water lation models may include parameters such as, colli-
quality data from improved monitoring technologies, sion efficiency factors, turbulent flow regimes, and
such as bio, miniaturized, and wireless sensors, offers floc strength are challenging to measure during
an opportunity to discover and develop enhanced operation. Still, knowledge of these parameters may
models for improving water management (Hey 2009; be integral to accurate predictions using physics-based
Juntunen et al. 2012; Eggimann et al. 2017). As models (Bridgeman et al. 2009). Data-driven models
demonstrated by the success of applying data science learn relationships from already observed data. There-
in other fields of science and engineering such as fore, future measurements of those same variables are
computer vision and health care (Hey 2009; Gilpin more likely to be feasible, making it more probable
et al. 2019; Montáns et al. 2019), data-driven methods data-driven models can be easily implemented in
are becoming critical tools for scientific research and practice. Water process modelling from a physics-
can be applied to help inform water treatment and based perspective is generally challenging, as is
resources management. evident from the routine use of empirical testing over
Increased availability of large water quality data- modelling approaches in day-to-day water treatment
sets is insufficient in itself to realize optimized water operations. For example, jar tests are routinely used to
treatment. Modelling is required to extract valuable determine optimum coagulant dosages (Wu and Lo
information from observed data that can predict future 2008) and the chlorine dose needed to maintain
conditions or find patterns that can elucidate underly- residuals in a distribution system is largely based on
ing mechanisms or associations between observations. trial and error (Soyupak et al. 2011).
Two general approaches can be taken to building There are several challenges with data-driven
water quality models: process/physics-based models models that need to be considered. With increased
with pre-defined frameworks and variables or data- reliance on historical data to learn system behaviour,
driven models that define the model framework and biased or non-representative training data is more
variable interactions based on patterns in historical likely to create a poorly performing model (Wang
observations. All simulation or modelling processes, et al. 2020a). The success of data-driven approaches in
including process-based models, can learn or be other fields is in part due to the availability of large
calibrated to data. However, process models define datasets, which can be more difficult and expensive to
the boundaries of the problem domain using a priori generate in fields such as water quality monitoring.
knowledge or assumptions of a system’s governing Modelling tasks that aim to predict parameters mea-
physical mechanisms. The framework of process- sured by grab samples and laboratory-based methods
based models is not data-dependent but rather stems (e.g. disinfection by-products, pharmaceutical con-
from implied knowledge or hypotheses about the centrations, genotoxicity) are particularly challenging
system that will impart their own biases (Montáns due to the low number of available training samples
et al. 2019). A principle advantage of data-driven (Peleato et al. 2018; Lin et al. 2020).
approaches is simulating processes without the The decreased influence of a priori knowledge in
explicit need to define relevant features or variables. data-driven approaches results in uncertainty in the
Relevant features and relationships between variables model’s validity. While highly accurate predictions
can be estimated from observed data, with varying can seem promising, there is the risk that data-driven
degrees of bias, constraints, and pior knowledge being models are biased, have memorized training samples,
incorporated (Marton et al. 2013; Bikmukhametov and or have not captured a true representation of system

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 987

behaviour (Guidotti et al. 2019). There are numerous Neural Networks (ANNs), with more limited discus-
examples of powerful machine learning or data-driven sion of more descriptive models, such as Decision
models applied to image processing, language inter- Trees or Bayesian Networks or model interpretation
pretation, risk of crime recidivism, and financial (Maier and Dandy 2000; Khataee and Kasiri 2011;
analyses that have been found to have significant O’Reilly et al. 2018; Tyralis et al. 2019). Furthermore,
biases that undermine their practical use and have applicational areas have most commonly pertained to
caused harm (Doshi-Velez and Kim 2017; Guidotti hydrological modelling of water quantity and flows,
et al. 2019). With increasing interest in implementing such as flood forecasting, rainfall-runoff modelling,
data-driven models in water treatment modelling and and streamflow (Maier et al. 2010; Tyralis et al. 2019).
operations, it is necessary that an understanding of the As such, this review aims to capture the use of data-
model logic is available. The use of black-box models driven models used more specifically for water quality
in critical systems directly impacting public health, and drinking water treatment modelling. Furthermore,
such as water treatment operations, brings significant efforts are taken to present alternative data-driven
concerns around accountability, safety, and liability. models that focus on descriptive or explanatory
With increasing interest in applying data-driven aspects of the modelling process. Common data-
models, there is a need to ensure that there is a focus driven methods are categorized and discussed based
on creating explanatory or interpretable models (Gil- on their application, along with a discussion of the
pin et al. 2019). Interpretable models can explain, in similarities and unique features. A review of the
terms understandable to a human, the reasoning application of these methods in water treatment
behind a specific output or model outcome (Doshi- modelling focuses on how specific models or
Velez and Kim 2017). approaches are best suited for specific modelling
There are several strategies that can be used to tasks. Finally, a discussion of challenges, limitations,
improve model interpretability and transparency. One and best-practices for building data-driven approaches
approach is to explore trained models to identify how is presented.
decisions are made. After training, decision bound-
aries can be explored using local models around
predictions (Ribeiro et al. 2016), decision rules can be 2 Review of data-driven models
extracted, or sensitivity analyses can be applied, where
changes in model outputs are observed as the input is Data-driven modelling is a fast-expanding approach to
manipulated to gain insight into model behaviour tackling a variety of challenges in many fields. There
(Gilpin et al. 2019). Alternatively, there can be a focus are a significant number of different algorithms and
on types of models or modification that result in more models discussed in the literature. For this review, a
interpretable structures. For example, symbolic few general classes of algorithms, based on their
regression algorithms aim to build functions from objectives or tasks and input data format, are covered.
observed data, by selecting mathematical operator and For each class, the most common and widely used
operands that maximize fit on training data (Quade models are reviewed to present an overview of
et al. 2016; El Hasadi and Padding 2019), although possible approaches without muddying the discussion
they have seen limited application in water quality with minutia.
modelling (Jagupilla et al. 2015). Recent work has A high-level categorization of model types can be
investigated the incorporation of first principles or the based on the characteristics of the dataset they are built
blending of data-driven and physics-based models to on or learn from: supervised vs. unsupervised learning.
increase intepretability and to better utilize small data Supervised models learn from input data that is
sets (Qin and Chiang 2019; Karniadakis et al. 2021), coupled with an associated target or label, while
with successful application to modelling fluid flow unsupervised models are given input data with no
(Raissi et al. 2019). associated target (Murphy 2012). Unsupervised mod-
Several reviews of data-driven methods applied in els typically are used to identify underlying structures
water resources are available in the literature, focused or properties in the dataset and can help identify
on specific algorithms or application areas. The groupings of samples without prior knowledge how to
majority of these reviews focus on the use of Artificial categorize the data (Goodfellow et al. 2016).

123
988 Rev Environ Sci Biotechnol (2021) 20:985–1009

A frequent objective of applying data-driven mod- deterministic modelling approaches with sharp bound-
els to water quality data is for prediction. Prediction aries between classes in complex and imprecise
implies a target variable associated with each sample. systems, and fuzzy logic can provide a useful approach
Therefore, predictive models aim to learn underlying to account for vague, qualitative, and inaccurate
relationships from historical data to determine the information (Chau 2006; Li et al. 2016). The fuzzy
correct label or target value of a new sample (Flach logic approach involves defining subsets of the data
2012; Finlay 2014). Predictive models have wide- without sharp boundaries using membership values.
ranging applications to water quality modelling, Any given input can be assigned to several subsets
including the optimal treatment needed for changing with a membership value, identifying the degree of
source water conditions, predicting treatment out- conformance with that subset (Harris et al. 2006). For
comes, and anticipating changes to water quality due several water quality applications such as coagulation
to interventions or land-use changes (Maier et al. dose prediction (Wu and Lo 2008; Heddam et al.
2004; Wan et al. 2014). Regression tasks refer to the 2012), pollution detection (Sahoo et al. 2005), and
prediction of a continuous numerical value given some solids concentrations (Banadkooki et al. 2020),
input. Examples of regression tasks would be the researchers have noted improved performance of
prediction of dissolved oxygen levels in rivers (Elki- Adaptive Neuro-Fuzzy Inference Systems (ANFIS);
ran et al. 2018), disinfection by-product formation a neural network type algorithm that incorporates a
(Sadiq and Rodriguez 2004), or settled water turbidity fuzzy logic approach.
after coagulation (Juntunen et al. 2012). Classification An alternative objective of data-driven models
tasks differ in that the prediction is discreet, and the applied in water quality assessments is dimensionality
goal is to assign a category to each sample. For reduction. Dimensionality reduction aims to transform
instance, classification tasks can include the use of a some original representation to a lower-dimensional
subset of water quality measures to predict a water representation while, hopefully, retaining important
quality index (Abba et al. 2020) or the conformance/ features. For example, spectroscopic water quality
exceedance of some regulatory threshold (Yang et al. monitoring techniques, including UV–Vis, near infra-
2019). red, or fluorescence, produce high-dimensional water
Regression and classification both assume that quality representations. These high-dimensional rep-
labelled data is available (i.e. supervised learning). resentations can be challenging to use as inputs in
In many cases, large and high-quality labelled datasets models, and it is often beneficial to identify only a few
are challenging to generate. Therefore, an alternative key wavelength combinations that capture the signif-
objective can be clustering, where the aim is to icant features of the sample (Trueman et al. 2016).
identify patterns or homogeneous subsets of samples Unsupervised reduction of dimensionality can be used
with similar patterns or behaviours without labelled to improve subsequent predictive modelling tasks. For
data (Mohri et al. 2018). Clustering can be applied for instance, reduction of input data by principal compo-
either predictive tasks, where an accurate grouping of nent analysis (PCA) or self-organized maps (SOM)
new samples is wanted based on structures discerned can be useful for improving neural network perfor-
from unlabelled historical data, or for descriptive or mance (Maier et al. 2004).
exploratory tasks, where the process of identifying
groups within samples is the ultimate goal (Flach 2.1 Popular models
2012). As an example of an illustrative clustering
study, Juntunen et al. (2013) utilized self-organized The selection of an appropriate model type is depen-
maps (SOM) and K-means clustering to identify dant on the task as well as the kind and size of the
typical and atypical process states in a water treatment dataset. For each type of model, a large number of
plant. By identifying these clusters, further analysis possible modifications could be applied, and it is
can be carried out to determine the nature of water impossible to identify one superior model for all tasks.
quality deviations due to changing conditions. In this review, we focus on a few popular models,
Water quality conditions are a product of complex identify some attractive and challenging characteris-
environmental interactions and are highly temporally tics of these methods, and describe some modifications
and spatially variable. It can be problematic to apply that can be applied.

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 989

An overview of models used in water resource and 2.1.1 Artificial neural networks
treatment was generated based on keyword searches in
the Web of Science database since 1996. A total of The application of ANNs to water quality and
5,546 articles in ‘‘Water Resources’’ have utilized or treatment modelling has been practiced for several
studied ANNs or neural networks (NNs), 1,127 have decades since the early 1990s (Maier and Dandy
utilized support vector machines or support vector 1996). They have been the most utilized type of data-
regression, 1,360 refer to decision trees, regression driven algorithm applied in the field (Tyralis et al.
trees or random forests, and 85 to Bayesian Networks. 2019), as evident from keyword searches in the Web of
Of the total 8,118 identified articles referencing data- Science database. ANNs are general function approx-
driven models in water resources since 1996, * 12% imators that map or transform a set of input values to
(971) include keywords associated with water quality. output values. The overall mapping is formed by
Most data-driven applications in water resources have combining many simpler functions that occur within
therefore focused on quantity, supply, and hydrologic the network. These simpler functions, referred to as
modelling challenges. Furthermore, the list was fur- activation functions, typically involve summed
ther manually refined to better identify articles that are weighted input values being processed by simple
more directly associated with drinking water treat- functions within nodes/neurons to produce an output
ment. Based on our review of the articles, * 1% of value that feeds into another layer of nodes (Fig. 2).
data-driven applications in water resources (102 By arranging layers of nodes, input values pass
articles since 1996) were directly applied to model through a number of hidden layers to an output layer.
aspects of water quality during drinking water treat- There is a high degree of flexibility in the number of
ment. Three subdivisions of water treatment applica- hidden layers, neurons per layer, and other structural
tion areas were defined: [1] water treatment considerations. Networks with several hidden layers
operations, [2] source water quality modelling (in can be referred to as deep networks. Deep networks or
the context of impacts on treatment), and [3] distribu- deep learning is the concept that high-level represen-
tion system water quality. The distribution of articles tations of system properties or factors that influence a
that are applied to these subcategories is presented in system as a whole can be represented in terms of
Fig. 1. ANNs have been the most popular data-driven several simpler representations. By having numerous
model applied to water quality modelling in drinking hidden layers, the network is capable of learning
water treatment, particularly for water treatment nested representations of the data that could provide a
process or operation modelling. more complete picture of how high-level factors may
influence an observed output (Goodfellow et al. 2016).
ANNs are trained iteratively in a supervised way,
where network outputs are compared to known or
desired labels/values and weights are then slightly
adjusted within the network to minimize error on the
55
next pass (Aggarwal 2018). The most common
50 Bayesian Networks algorithm for optimizing ANNs by adjusting weights
45 Decision Trees and Random Forests
Published Arcle Count

40 Support Vector Machines


35 Neural Networks
30
25
20
15
10
5
0
Source Water Water Distribuon Total
Quality Treatment System
Operaons
Fig. 2 Example schematic of a neuron in a feed-forward
Fig. 1 Distribution of published articles directly applied to network. x are inputs, w are connection weights, f ðxÞ is the
model water quality associated with drinking water treatment activation function, and y is the layer output

123
990 Rev Environ Sci Biotechnol (2021) 20:985–1009

is error backpropagation (Eq. 1), where all the n (Aggarwal 2018). The sensitivity of response for a
weights in a network are each adjusted (Dwi ) using the specific input range is unique to each activation
the gradient of the error with respect to the weight function, and certain functions are best suited for
 
dE specific situations. For instance, if it is known that
dwi and a learning rate constant (c), which defines
values should only take positive values (i.e. a water
the rate of weight adjustment (Rojas 1996). quality measurement where negatives are not mean-
dE ingful, like turbidity), use of ReLU units that output 0
Dwi ¼ c ; fori ¼ 1; . . .; n ð1Þ
dwi for any negative input (-1; 0 or a sigmoidal function
that always produces an output [ 0 could be
The choice of loss function explicitly defines the beneficial.
objective of the task and gives the measure of error (E) Simple and shallow NNs have been successful for
used in weight adjustment. Mean squared error (MSE) various water quality prediction tasks, however, there
would be appropriate for regression tasks where it is is a wide range of network structures and learning
desired for the network to predict a continuous output approaches that may be more appropriate for specific
accurately. As an example, the MSE error function can tasks. Recurrent Neural Networks (RNNs) are a type
be defined as shown in Eq. 2. of network structure where outputs consider the
1X
p system’s previous state. As such, they are ideal for
E¼ ðy  ybi Þ2 ð2Þ sequential data such as time-series. The structure’s
2 i¼1 i
recurrent aspect can take several forms, including the
where, yi is the network ouput for sample i. neurons in the hidden layers at time step t being
ybi is the known value for sample i. p is the total influenced by the values in the hidden layers at
number of samples in the dataset. times \ t. Long short-term memory (LSTM) networks
For networks applied for categorization, cross- are a specific form of RNN that are well suited for
entropy is used as a loss function to measure the modelling time series and have recently been applied
difference between two probability distributions (i.e. in natural water monitoring and predictions success-
the output probabilities of each class and the known fully (Wang et al. 2017, 2019; Zhou et al. 2018;
class). However, there is flexibility in defining loss Barzegar et al. 2020). Convolutional Neural Networks
functions that can provide advantages to specific (CNNs) train ‘filters’ or spatial patterns in inputs and
modelling tasks. For instance, sparsity in the network therefore consider the local context of data. As such,
can be encouraged and help prevent overfitting by this approach is effective for image processing-type
encouraging small weights using weight decay or tasks. CNNs have recently been shown to be useful for
P
penalties, where the square (L2 ¼ ni¼1 w2i ) or abso- processing spectral measures of water quality (Chen
Pn
lute value (L1 ¼ i¼1 jwi j) of the n weights are added et al. 2020a) and have application in utilizing satellite
to the overall loss function, or by adding in Kullback- imagery to aid water quality predictions (Pu et al.
Liebler divergence criteria (Hosseini-Asl et al. 2016). 2019).
Many other modifications are available, and custom
loss functions should be considered when designing a 2.1.2 Decision trees
network. For example, in order to improve explana-
tions of NNs and promote interpretability, Ross et al. Decision trees (DT) are simple but effective super-
(2017) included an additional term in the loss function vised learning methods most commonly applied to
that penalizes the network when considering pre- classification tasks; however, they can be adapted for
annotated irrelevant input parameters. regression. DTs model variability in the target variable
Both the activation and the loss functions play a key by sequentially splitting input variables at learned
role in defining the structure and ability of the network thresholds (Everaert et al. 2016). A node in the tree
to model a specific system. Activation functions represents a feature or variable, and the branches from
(Fig. 3) provide non-linearity in the model and can that node are possible values that the variable can take
take many forms including sigmoidal, hyperbolic (Kotsiantis 2007). The overall goal is to split the target
tangent (tanh), or rectified linear functions (ReLU) variable into homogenous groups based on a sequence
of decisions (De’ath and Fabricius 2000). Splits can

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 991

Fig. 3 Example common


activation functions that
provie non-linear
transformation of some
input (x) to an output (y)

take several forms, depending on the variable type. For and can effectively handle missing or incomplete data
instance, numerical inputs are split based on being (Rokach and Maimon 2015). Interpretability is impor-
greater or less than some threshold value, and tant to many modelling tasks, and through easily
categorical inputs are split based on the levels observed splits in data based on learned branches and
associated with that variable (De’ath and Fabricius leaves, decision trees provide an interpretable descrip-
2000). Figure 4 shows a simple decision tree structure, tion of the systematic structure in datasets. For
where categories of a response variable are ‘decided’ example, in reference to Fig. 3, when coliforms are
on based on binary splits of several inputs. In the low (B 20.5 CFU/100 mL or furthest left branches),
example shown, colony forming units (CFU) of the model predicts low E. coli concentrations. How-
Eschericia coli in a source water (Fraser River, British ever, when coliforms are above 21.5 CFU/100 mL
Columbia) are predicted based on several water and when temperature [ 19.95 °C, we can expect
quality parameters, including coliform counts, water high E. coli concentrations (furthest right branches).
temperature, turbidity, and conductivity. The visual There are several approaches to building trees that
representation of the tree (Fig. 4) shows the undivided aim to identify features that best divide the data and
data at the top (the root node), and each split is called a ways to recursively partition the data into child nodes
branch with associated leaves, which are subsets of the using splitting criteria that maximize information gain
data based on the division. and minimize node impurity (Gokgoz and Subasi
Decision trees have several unique advantages and 2015). Splitting criteria depends on a measure of
are well suited for several water quality modelling impurity in a given set of data at any given node (IðSÞ).
tasks. They can handle diverse types of input data, are When a set of data, S with n samples, is split into two
computationally efficient to train, are easy to interpret, subsets S1 with n1 samples and S2 with n2 samples, the

Fig. 4 Decision tree applied to predict Escherichia coli levels targets, input variables (coliform counts, conductivity, hardness,
(either [ or \ 20 CFU/100 mL) based on source water char- turbidity, and water temperature) are split based on a learned
acteristics. End nodes (bottom row) represent classification of threshold to best categorize E. coli levels

123
992 Rev Environ Sci Biotechnol (2021) 20:985–1009

information gain can be calculated as shown in Eq. 3. interactions between variables, and high dimension-
The impurity is a representation of how homogenous ality (Gokgoz and Subasi 2015). This often results in
each node is and the Gini index is a common metric of improvements to prediction accuracy on test data. For
impurity (Eq. 4) (Krzywinski and Altman 2017). The example, in reference to the example above (Fig. 3),
impurity measure will be at a minimum when only a prediction accuracy on a test set was improved from
single class is present in any given node. 71.4% (decision tree) to 85.7% when using an RF
approach. Although RFs can be more challenging to
Information GainðS1 ; S2 Þ ¼ I ðSÞ
interpret compared to single decision trees directly,
1
 ð n1 I ð S1 Þ þ n2 I ð S2 Þ Þ several approaches to identifying variable importance
n
in a RF are also available (Tyralis et al. 2019).
ð3Þ

X
d 2.1.3 Bayesian approaches
Gini index : I ðSÞ ¼ pi ð 1  pi Þ ð4Þ
i¼1 The Bayesian approach involves updating beliefs
where, pi is fraction of samples belonging to class i in about a system or probability of hypotheses with the
the subset. d is the total number of classes. inclusion of new evidence. A more detailed account of
Trees can be grown until a split does not improve differences between classical statistical approaches
the homogeneity or information gain of subsequent and a Bayesian approach can be found in Ellison
nodes over some cut-off value. Frequently the result- (2004), but can be briefly summarized as follows.
ing tree determined by the splitting criteria will be Bayesian inference identifies the probability of a
overly large since the algorithm has maximized the hypothesis given observed data and incorporates prior
ability to categorize samples correctly. Large trees are belief about the system, while the frequentist view-
more complex to interpret and can lead to issues of point assesses the probability of observed data given a
overfitting. As such, tree pruning is carried out after hypothesis. The Bayesian definition of probability is
building the tree to remove redundant splits and re- the degree of belief in the likelihood of an event, while
evaluate the trade-off between prediction accuracy a frequentist approach defines the probability of the
and tree size. frequency of an event. Some of these characteristics
A common type of decision tree method is the are evident in Baye’s theorem, which updates the
Random Forest (RF) approach. RFs are an ensemble probability of a hypothesis (H) based on new evidence
type approach to decision tree classification/regres- or observation (E), written as PðHjEÞ and referred to
sion. Multiple trees are grown from the same dataset, as the posterior probability:
and predictions are the most frequent (classification) PðEjHÞ  PðHÞ
or average (regression) output from the parallel PðHjEÞ ¼ ð4Þ
PðEÞ
models (Breiman 2001). For each independent tree
grown in the ensemble, randomness is introduced by where,PðHÞ is the probability of the hypothesis prior
taking a bootstrapped sample of the input data. At each to new evidence ðE) and is referred to as the prior
node, only a random subset of input variables is probability.
considered for splitting (Gokgoz and Subasi 2015). As PðEÞ is the probability of that evidence.
such, each tree will be unique since not all data (i.e. PðEjHÞ is the probability of evidence given the
from bootstrapping) and variables are considered in hypothesis.
each tree. An ensemble method with relatively uncor- The Bayesian approach can be well suited to
related trees increases the ability for the model to environmental systems since the uncertainty of input
generalize since each error between trees should also parameters and the hypothesis can be easily incorpo-
be independent (Qi 2012). Furthermore, since splitting rated or interpreted. Given the complexity of environ-
is only being carried out on a subset of features, the mental systems, modelling approaches that emphasize
dimensionality is effectively reduced. As such, RFs uncertainty in measurements and outputs and flexibil-
perform well in situations with high likelihood of ity to incorporate new evidence may be ideal in many
overfitting, such as limited dataset sizes, complex situations (Reckhow 1999). Furthermore, evidence or

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 993

observation of environmental systems is not collected exploration since with observation of evidence related
without prior knowledge about the system, which can to one variable, the probability distributions for other
be readily captured in the prior probability (Abbaspour nodes can be updated and a sense of impact can be
et al. 1996). Two types of Bayesian models are discerned (Aguilera et al. 2011). The ability to update
commonly used for modelling environmental systems: probability distributions for all variables given one
hierarchical models and Bayesian Networks. Both observation also makes exploration or explanation of
look to capture conditional dependencies between BN-based decisions interpretable.
variables and utilize Baye’s theorem to update beliefs BNs have several other characteristics that are well
but differ in how they are practically applied (Uusitalo suited for modelling environmental processes. Due to
2007). the probabilistic model output, confidence or uncer-
Hierarchical Bayesian models have proven to be tainty in the hypothesis or prediction can be easily
highly valuable for modelling complex environmental interpreted (Uusitalo 2007). Furthermore, BNs can
processes (Wikle 2003; Wan et al. 2014; Mei et al. utilize incomplete data to make predictions and are
2014). The hierarchical approach looks to model efficient with small sample sizes (Aguilera et al. 2011;
environmental systems, not from a joint perspective Fenton and Neil 2012). In particular, incomplete data
(i.e. all observations together as a joint probability), with missing values is common in water quality
but rather the collection of random variables decom- monitoring and can be a major issue when building
posed into several conditional models (Wikle 2003). predictive models. The major limitations of BNs
This approach can be particularly useful when incor- include the inability to include cyclic relationships
porating spatial or temporal variability into models (i.e. there can be no loops or cycles to dependencies
since the regression parameters and distributions of between variables), and it is more cumbersome to deal
parameters can be varied over spatial or temporal with continuous variables. As such, BN modelling is
scales (Wan et al. 2014). As an example of application often only applied to discreet data, and if discretizing
to water quality modelling, this modelling approach continuous variables, this can lead to the introduction
has been utilized for identifying the role of land use or of biases.
temporal variability on source water quality (Wan
et al. 2014; Mei et al. 2014). A classical approach to
hierarchical modelling is possible; however, the 3 Review of data-driven modelling for drinking
Bayesian paradigm has been most often used in this water treatment
context and may be better suited for complex scenarios
by allowing for subjective priors (Wikle 2003). In addition to the wide variety of modelling
Bayesian Networks (BNs) or Bayesian Belief approaches available, there is a significant breadth of
Networks (BBNs) have also seen successful environ- water quality applications that have seen benefit from
mental and water quality modelling applications data-driven approaches. This section aims to give an
(Aguilera et al. 2011). BBNs are graphical models overview of areas of opportunity for applying data-
that capture probabilistic relationships between vari- driven models to model source water quality and water
ables of interest. The structure of BNs comprises treatment operations. The focus is on how specific
nodes representing variables and directed arcs or links models or approaches are best suited to varying
between variables which encodes dependence situations and the possible role data-driven methods
between nodes (Kabir et al. 2015). The arcs or links have in improving water treatment.
are unidirectional in influence, showing the depen-
dence of a child node on the condition of the parent 3.1 Source water quality
node. BNs allow for both predictive and diagnostic
evaluations. Probability distributions can be modified Knowledge of source water quality is crucial for
in both directions based on evidence, meaning the optimal water resources management and water treat-
probability of causes (parent nodes) can be calculated ment process design and operation. There have been
given the consequences (child nodes) as well as several studies that take a data-driven approach to
consequences given causes (Uusitalo 2007). Once model environmental or source water quality.
developed, BNs can be useful for knowledge

123
994 Rev Environ Sci Biotechnol (2021) 20:985–1009

Data-driven source water monitoring is an area of water quality or contamination events can also be
opportunity for identifying key factors driving water accomplished by analysis of model residuals. The idea
quality changes in source waters that may impact is that during normal water quality conditions, a model
treatment operations. For example, Mei et al. (2014) built on historical data should produce accurate
applied factor analysis and clustering methods to predictions of some easy-to-measure parameter, how-
identify predominant organic pollution sources and ever, large errors will result during low frequency and
identify causes of variations in surface water quality. abnormal conditions (Perelman et al. 2012). Other
Tesoriero et al. (2017) used RFs to predict redox- approaches to measuring conformance or similarity
sensitive contaminants in groundwaters and estimated between time series have also been utilized to detect
variable importance based on the sensitivity of model anomalous water quality periods in source waters,
accuracy to permutations in input variables. This such as conformance with normal distributions (Deng
approach identified major factors driving levels of and Wang 2017).
nitrate, iron, and arsenic in groundwater, such as crop ANNs have been utilized to predict future water
coverage, water table depth, and geological forma- quality conditions in sources, including turbidity,
tions. Bayesian hierarchical models can account for conductivity COD, TOC, chlorophyll-a, or ammonia
interactions between variables and allow for regres- several hours in advance to act as early warning
sion parameter distributions to be varied on spatial or systems for challenging treatment scenarios (Mulia
temporal scales. As such, this approach has been et al. 2013; Burchard-Levine et al. 2014; Delpla et al.
useful for identifying key parameters that influence 2019; Jin et al. 2019). To predict future water
temporal and spatial changes in surface water quality conditions in the source, it is important to consider
(Mei et al. 2014; Guo et al. 2019). The factors driving the time-dependence of observations. Researchers
water quality changes identified by these data-driven have investigated several modifications to ANN
models could be utilized to improve policy decisions structures that include lagged inputs to predict future
for water resources management and the design or conditions. For example, Burchard-Levine et al.
operation of water treatment systems, which can (2014) utilized defined time-lagged inputs to predict
benefit from future predictions under various scenar- future water quality conditions (e.g. ammonia or
ios. For example, (Debnath et al. 2015) predicted a chemical oxygen demand). However, time-lags will
surface water vulnerability index for various potential be variable depending on season and other hydrolog-
drinking water sources to climate change impacts ical considerations (Delpla et al. 2019). Recurrent
using an ANN. These predictions were utilized to best neural network structures include connections
select a water source for a drinking water supply. between observations at different time-steps and are
Identification of disruptions or pollution in water possibly more suitable for analyzing sequential or
sources is also a major area of data-driven model time-series data. A persistent issue with recurrent
application. The premise is that algorithms would be models is the capabilities to capture details in short-
capable of identifying signatures of different ‘water time scales (e.g. day-to-day fluctuations) while also
types’ that would indicate their origin. Rapidly considering long-term dependencies (e.g. seasonal
identifying anomalous events could serve as an early fluctuations). LSTMs are a type of recurrent neural
warning system for water treatment operations and network that includes gates between steps that allow
help mitigate risk from short-term contamination for retaining previous information only if relevant and
events. ANN type approaches have been popular for modulating the information that is added to a current
this application, including identifying signatures of state (Goodfellow et al. 2016). This approach can
water originating from various geological formations therefore effectively retain long-term dependencies, if
or associated with agricultural or mining activities relevant, while also considering short-term changes.
(Keskin et al. 2015), predicting the presence of Recent LSTM type models have produced promising
contaminants such as pesticides (Sahoo et al. 2005), results for predicting water quality parameters such as
or wastewater in groundwater (Stedmon et al. 2011), dissolved oxygen levels and cyanobacteria blooms
as well as identification of contaminants such as (Wang et al. 2017; Lee and Lee 2018; Zhou et al. 2018;
polycyclic aromatic hydrocarbons (Ferretto et al. Liu et al. 2019; Barzegar et al. 2020). Jia et al. (2021)
2014; Li and Peleato 2021). Detection of abnormal recently demonstrated the integration of physical laws

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 995

with an LSTM to predict lake temperatures. The observed data and were used to explore the response of
method developed integrates the law of energy E. coli to variations in river and tributary flows
conservation (i.e. balances between incoming heat (Jagupilla et al. 2015).
fluxes and outgoing heat fluxes results in a change in To date, studies analyzing source water quality and
thermal energy) as additional states in an LSTM influencing factors have not regularly integrated
structure. The approach of blending machine learning results with treatment operations. Ultimately the
with physics-based models and constraints is a efficiency, treatment needed, and quality of treated
promising approach to address major challenges with water will depend heavily on source water quality. For
data-driven methods including ensuring models con- example, the growing popularity of risk-based water
form to known physical laws, more efficient use of management and treatment objectives requires cred-
small datasets, and improving model generalization ible and reliable knowledge of source water quality
(Jia et al. 2021). (e.g. pathogen concentrations) (Hamilton et al. 2018).
Data-driven approaches have also been used to As such, there is an opportunity to both inform water
estimate the occurrence of pathogens or indicator treatment operations of short-term water quality
bacteria in source waters. Standard methods to quan- events (e.g. turbidity spikes (Delpla et al. 2019)) and
tify microbial water quality involve the cultivation of help address long-term impacts on water quality due to
organisms, such as coliforms or Escherichia coli land management and climate change using projec-
(E. coli), in a lab, and therefore, results are severely tions from source water quality models.
delayed (18 h ?) (Avila et al. 2018). There could be a
significant benefit to recreational water management 3.2 Treatment process modelling
and setting drinking water treatment objectives if
models could produce accurate microbial water qual- Early data-driven water quality modelling applica-
ity predictions day-to-day. Several different modelling tions focused on predicting effluent or treated water
approaches have been investigated, including Baye- quality from water treatment facilities. A data-driven
sian hierarchical regression (Farnham and Lall 2015), approach allows for not explicitly defining the physics
Bayesian networks (Avila et al. 2018; Panidhapu et al. of process models, which can be cumbersome and rely
2020), symbolic regression (Jagupilla et al. 2015), on difficult-to-measure parameters (Juntunen et al.
neural networks (Thoe et al. 2014; Zhang et al. 2015), 2013). Predicting effluent quality during treatment
and decision tree methods (Stidson et al. 2012; Thoe could help reduce time delay in process feedback
et al. 2014; Brooks et al. 2016) to predict the loops and help define operational parameters to
concentrations of indicator microorganisms in waters. optimize effluent quality. Through constructing a
In many of these studies, several methods were model that relates influent quality, an applied treat-
compared, with a common finding was the superior ment, and resulting effluent quality, adjustments to the
performance of decision tree type methods, such as applied treatment can be carried out with better
random forests (Brooks et al. 2016; Mohammed et al. knowledge of the ultimate impacts on effluent quality.
2018) or classification trees (Thoe et al. 2014). Furthermore, there is a significant opportunity to
Decision trees generate predictions based on splitting improve process efficiency and enable automation if
at specific variable thresholds. This categorical or effluent quality can be accurately predicted (Fig. 5).
fuzzy approach may provide specific advantages for Overtreatment or undertreatment can be avoided by
modelling systems with significant variabilities, such reducing the need for conservative operational deci-
as predictions based on microbial indicator measures. sions to ensure adherence to effluent quality thresh-
Since there is no perfect indicator, some level of olds. Previous studies have looked to increase the
variability or uncertainty is inherent, and a fuzzy speed and accuracy of effluent quality prediction and
approach may be well suited to these types of evaluation without considering complex treatment
situations. Jagupilla et al. (2015) demonstrated the phenomena. For example, data-driven models can be
use of symbolic regression for water quality modelling implemented to improve predictive process control to
to estimate E. coli concentrations based on river flows. ensure chlorine is not underdosed (public health risk)
The symbolic regression approach allowed for defin- or overdosed (forming excessive DBPs, tase and
ing smooth non-linear functions that best represented odour) during treatment (Wang et al. 2020b).

123
996 Rev Environ Sci Biotechnol (2021) 20:985–1009

Fig. 5 Example of treatment scenarios based on constant treatment and variable drinking water quality. Automation
applied treatments versus use of models relating influent quality, requires improved modelling methods to strengthen relation-
treatment applied, and treated water quality. No adjustment of ships with treatment
treatment due to changing source water quality leads to over-

Initially, simple and straightforward feedforward indirectly impact the process’s overall efficiency
ANNs with one hidden layer were used for predicting determine the optimal coagulant dose (Gomes et al.
several effluent parameters, including colour, turbid- 2015). Most commonly, applied coagulant doses are
ity, and performance of filtration, coagulation, and set by operator experience and/or bench-scale jar tests
softening processes (Gagnon et al. 1997; Baxter et al. that identify optimal conditions through trial and error
1999, 2001; Maier et al. 2004). Simple ANNs remain (Wu and Lo 2008; Kim and Parnichkun 2017). This
as a very popular machine learning algorithm used for approach can be problematic for responding to rapid
effluent quality prediction (Abba et al. 2020), likely source water changes and the reliance on operator
due to the simplicity of building and training dense expertise to capture the complexities of the coagula-
neural networks with few hidden layers. There are tion process fully. Therefore, data-driven methods are
several limitations with the historical application of used for coagulation process monitoring to speed up
simple ANNs for treatment prediction. For instance, the responses to raw water changes. While several
the structure is not amenable to modelling time-series studies have reported high accuracy for predicting
and accounting for the variable delay between effluent coagulant doses based on water quality indicators
water quality and the monitored parameters in the there has been more limited attention on how to
source water or during treatment is challenging. implement the models to inform process decisions.
Furthermore, shallow networks with few hidden layers Maier et al. (2004) and Griffiths and Andrews (2011)
are limited in their ability to describe overall mappings described inversed ANNs, where coagulant doses are
through a nested hierarchy of lower-dimensional predicted based on treated water quality objectives
representations (Goodfellow et al. 2016). (i.e. settled water). However, these inversed models
Chemical coagulation is a complex water treatment predict coagulant doses that were applied (based on jar
process applied to remove suspended particles and tests), rather than directly suggesting optimum doses
organic matter and has received significant attention to control water quality parameters. The indirect use of
from data-driven models (Gagnon et al. 1997; Maier inversed ANNs in plant operations was reported by
et al. 2004; Chen and Hou 2006; Wu and Lo 2008; Baxter et al. (2001). More recently, Wang (2016)
Griffiths and Andrews 2011; Heddam et al. 2012; Kim utilized an SVM to provide real-time assessments of
and Parnichkun 2017). Many water quality parameters source water quality that was then utilized to provide
that are highly variable in source waters, including feedforward control of alum and ozone doses.
temperature, turbidity, ionic strength, zeta potential of Consideration of time-dependence is a key compo-
particulates, conductivity, organic matter concentra- nent to predicting coagulation, or any other processes,
tions, and particulate composition, will directly or performance. Accounting for time dependence has

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 997

often been accomplished by pre-defined lagged inputs, are often generated in controlled conditions where
such as values from yesterday or previous hours to contact times, chlorine doses, and water quality is
account for process times (Wu and Lo 2008; Griffiths consistent (Kulkarni and Chellam 2010; Singh and
and Andrews 2011). As such, the models are highly Gupta 2012; Peleato et al. 2018). Ultimately the
dependent on time-delay and pre-specified input lags concern with DBPs is the levels at consumer’s taps or
will impose rigidity on the network. Any changes to in the distribution system rather than exiting a
the system that would alter process time-delays (such centralized treatment system. Generally, distribution
as flow rate) may invalidate any generated model. In system water quality is difficult to model, due to large
cases where a lagged target variable (i.e. coagulant variations in water age, pipe conditions, nonlinear
dose) is used to predict a current target variable in interactions of water quality, and often rapidly
systems with relatively slow change rates, significant changing hydraulics (May et al. 2008). Previous
overfitting may occur. For example, reported predic- studies applying data-driven approaches to the pre-
tions in several studies using lagged inputs appear to diction of turbidity in distribution systems have
replicate actual data with the same lag (i.e. prediction demonstrated that predicting distribution system water
for today = yesterday’s value) (Wu and Lo 2008; Liu quality with appreciable lead times is challenging and
et al. 2019; Barzegar et al. 2020). The unclear future work in this area is needed (Meyers et al. 2017).
dependence of predictions on previous values further As such, contact time between chlorine and NOM in
illustrates the need for interpreting data-driven models real distribution systems is challenging to predict and
and identifying variables most influential on decisions. adds considerable complexity (Kulkarni and Chellam
Prediction of disinfection by-product (DBPs) for- 2010). Recent work has shown considerably better
mation is a significant area of application for data- performance of Radial Basis Function (RBF) ANNs
driven and process-based modelling efforts (Chen and for predicting haloacetic acids in distribution systems
Westerhoff 2010; Singh and Gupta 2012). The compared to conventional linear or log-linear models
formation of DBPs from the reactions between natural (Lin et al. 2020). Further research is needed to assess
organic matter (NOM) and chlorine residuals has been data-driven models’ ability to predict DBPs in distri-
studied (Sadiq and Rodriguez 2004). DBP formation bution systems with consideration for variable contact
prediction is an attractive area of application for data- times and water quality.
driven models since DBP concentrations are relatively In addition to the DBP formation predictions in
difficult to measure compared to many other control distribution systems, there is significant opportunity in
parameters, and data collection frequency is low (i.e. other applications to predicting distribution system
months between measurements). Therefore, it is water quality. A common water quality concern in
difficult to change day-to-day operations to limit distribution system is maintaining adequate disinfec-
DBP formation using monitored values. Furthermore, tant residuals (e.g. chlorine or chloramines), while
the reaction between chlorine and the highly chemi- avoiding excessive levels that may result in taste and
cally variable organic compounds present in natural odour issues or high DBP concentrations (Cordoba
waters make the reaction and the ultimate formation of et al. 2014). Chlorine decay and reactions with pipe
DBPs complex to predict (Pifer and Fairey 2012). This biofilms, corrosion, and other processes are difficult to
chemical diversity of NOM ultimately results in a predict by process-based models (Soyupak et al.
large number of possible DBPs that can be formed 2011). As such, data-driven approaches, such as
(Wagner and Plewa 2017). Due to difficulties in ANNs, have been applied to predict chlorine levels
measurement methods providing information on at specific points in the distribution system to aid in
NOM characteristics (Matilainen et al. 2011) and informing dosing levels at treatment plants or booster
complexities of chlorine reaction, data-driven stations (May et al. 2008; D’Souza and Kumar 2010;
approaches have significant theoretical advantages Soyupak et al. 2011). There is also interest in utilizing
over mechanistic modelling of DBP formation. data from distributed sensors in distribution systems to
Data-driven approaches, including ANNs and sup- identify deviations from baseline to indicate anoma-
port vector machines (SVMs), have been implemented lous or contamination events (Dogo et al. 2019).
to predict regulated and unregulated DBPs with good Several approaches have been taken to identify these
results. However data used in assessing these models events, including analysis of ANN residuals followed

123
998 Rev Environ Sci Biotechnol (2021) 20:985–1009

by Bayesian sequential analysis (Perelman et al. absorption and fluorescence emissions, such as the
2012), SVMs (Oliker and Ostfeld 2014a; Tinelli and impossibility of several fluorescence emission peaks
Juran 2019), NN-SVM hybrid models (Zou et al. for one compound. As such, models based on
2019), SVMs to identify outliers followed by sequence PARAFAC analysis are more interpretable, and model
classification (Oliker and Ostfeld 2014b), and Baye- outputs can be more easily traced back to specific
sian Networks (Murray et al. 2012). fluorescence spectra that resemble real fluorophores
(Murphy et al. 2014). This interpretability can be
3.3 Applications to processing sensor data useful for improved understanding of chemical char-
acteristics, but comes at the cost of increased error or
Some water quality sensors generate a significant loss of information moving between the original and
amount of data; however, results can be challenging to reduced dimensionality (Bro 1997). Algorithms such
interpret or utilize in subsequent models without as PARAFAC or PCA include several assumptions,
preprocessing by data-driven algorithms. Notably, including tri-linearity of the spectral data, which may
data-driven modelling of water quality sensor data has not hold true for all systems (e.g. peak shifts with pH
focused on the processing of spectroscopic data, such or interactions with other compounds). Furthermore,
as ultraviolet absorbance or fluorescence data. Spec- PARAFAC assumes that components absorb and emit
troscopic measures can provide several advantages, photons independent of each other, which may be
including low sample acquisition times, non-destruc- invalid if processes such as charge-transfer interac-
tive analysis, and real-time monitoring capability. For tions are significant between chromophores (Sharpless
example, the use of fluorescence spectroscopy for and Blough 2014; McKay et al. 2018).
water quality analysis has gained significant traction in Coupled with unsupervised dimensionality reduc-
water quality assessments and is utilized in source tion techniques, fluorescence analysis has seen signif-
water monitoring and water treatment process moni- icant application in the characterization of organic
toring (Bridgeman et al. 2011; Heibati et al. 2017). matter in water. These representations have been
However, the measured fluorescence or absorbance utilized in improved understanding of process changes
spectra are often high-dimensional (thousands of (Sanchez et al. 2013; Shutova et al. 2014), prediction
variables or monitored wavelengths), which can of disinfection by-product formation (Pifer and Fairey
complicate analysis and the use as model inputs 2012; Trueman et al. 2016), and identification of
(Murphy et al. 2014). Furthermore, many chro- pollution in source waters (Stedmon et al. 2011; Yang
mophores or fluorophores in water absorb and emit et al. 2019). Researchers have also employed super-
light in overlapping wavelength bands, giving rise to vised learning algorithms that are well suited for
issues separating independent signals (Murphy et al. multivariate datasets such as partial least squares
2013). (PLS) regression or the use of ANNs. For example,
To address these issues, unsupervised processing Bieroza et al. (2011) compared the use of various
techniques have been applied, such as parallel factors dimensionality reduction techniques and regression
analysis (PARAFAC) (Murphy et al. 2013) or princi- models to predict TOC removal during water treat-
pal component analysis (PCA) (Peiris et al. 2010). ment. The authors identified similar performance with
Unsupervised or self-supervised dimensionality decomposed data (i.e. using PARAFAC or a SOM)
reduction approaches using neural networks such as followed by regression by PLS or an ANN. Trueman
self-organizing maps (SOMs) (Bieroza et al. 2011) or et al. (2016) investigated several approaches to
autoencoders (Peleato et al. 2018) have also been used. identify fluorescence features for the prediction of
These algorithms can provide a reduced representation disinfection by-product formation potentials and
of dataset characteristics with minimal loss of essen- found that boosted regression trees were particularly
tial information. The type of algorithm applied will successful at identifying a smaller subset of fluores-
significantly influence the representation and subse- cence features needed to provide an accurate predic-
quent analysis. PARAFAC is often constrained to tion of trihalomethane formation potentials.
provide non-negative representations, and the model is In addition to characterizing NOM, data-driven
developed with expert guidance to ensure representa- interpretation of spectroscopic data may also used to
tions conform with the physical understanding of light detect several aqueous pollutants,

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 999

including constituents of wastewater (Stedmon et al.


2011), polycyclic aromatic hydrocarbons (PAHs)
(Yang et al. 2019), organic contaminants such as
phenol (Yu et al. 2018), and cyanobacterial pigments
(Harris and Graham 2017). Novel modelling
approaches such as utilizing Convolutional Neural
Networks coupled with decision trees to interpret NIR
data for predicting COD levels have been recently
investigated (Chen et al. 2020a).

Fig. 6 Distribution of data set sizes used in data driven


4 Challenges, limitations, and opportunities modelling of water treatment. Studies included are those where
sample numbers were available in the paper
4.1 Availability of reliable data and small dataset
challenges real-time measures (e.g. chlorine residuals, turbidity,
pH) of distribution systems (Murray et al. 2012; Zou
A pervasive challenge in the application of data-driven et al. 2019) and source waters (Chen et al. 2020b).
or machine learning methods in water quality mod- Another challenging characteristic of some water
elling is the typically small size of water quality quality datasets is the relatively high dimensionality
datasets. From other fields, there is an association of compared to the number of samples. Increasing
increasing model performance with increasing sample dimensionality can be enticing since more measured
size (Zhang and Ling 2018). In studies that utilize variables may describe the system more completely;
variables collected via grab samples or laborious however with more dimensions comes the need for
analysis techniques, such as pesticide concentrations more samples to adequately cover the entire sample
(Sahoo et al. 2005) or disinfection by-products (Lin space (i.e. the curse of dimensionality) (Bishop 1995).
et al. 2020), there is a significant limit on the number Pre-processing techniques and approaches to model
of labelled samples available. With small sample validation are particularly important when datasets
sizes, issues of overfitting (or loss of generalization) is contain a limited number of complete records. Several
accentuated. Given the number of trainable parameters water quality studies have noted improved perfor-
in neural networks, or other data-driven algorithms mance from first applying methods to simplify input
contrasted with low numbers of high-dimensional parameters (e.g. variable selection May et al. 2008;
data, the ability to accurately capture real system Tomperi and Leiviskä 2019) or simplifying the
behaviour is challenging (Raissi et al. 2019). Real- problem domain by clustering (Kim and Parnichkun
time data from water treatment operations is increas- 2017), or pre-processing through data-driven algo-
ingly available and can be used to generate large, high- rithms such as SVMs or ANNs (Oliker and Ostfeld
frequency datasets. However, it should be considered 2014b; Zou et al. 2019).
if the increase in dataset size increases knowledge
regarding a specific modelling task. For example, 4.2 Data preparation and preprocessing
high-frequency data of relatively constant water considerations
temperatures do not necessarily lead to more informed
models, despite a large number of samples. Dataset A model’s performance and the validity of the results
sizes for select articles identified through Web of can be highly dependent on how the data was collected
Science searches are summarized in Fig. 6. These and processed prior to learning or prediction. Water
articles directly deal with data-driven water quality quality data is often incomplete, noisy, and not
modelling in drinking water treatment operations and necessarily collected in a consistent way. Therefore,
reported the dataset size used. Small dataset sizes data preparation steps to interpolate missing values (if
(\ 500) are most prevalent; however several studies appropriate), correct measurement error, remove out-
utilized large sample numbers (5,000 ?). Predomi- liers, and ensure observed data is representative of the
nantly large sample sizes were generated from routine system being studied are needed to ensure data-driven

123
1000 Rev Environ Sci Biotechnol (2021) 20:985–1009

approaches are able to identify useful patterns in the outliers (i.e. high max or min values) would strongly
data (Zhang et al. 2003). skew the normalized data.
There are several questions that should be raised
when preparing datasets. Beyond the size of the 4.3 Overfitting, underfitting, and validation
dataset, as discussed previously, applying data-driven of models
models requires thought about overall balance or bias
within the data sets (Wang et al. 2020a). Are some Many powerful data-driven approaches can yield
specific sample types or classes overrepresented? impressive results using training data while perform-
Have sufficient examples of a characteristic that you ing poorly on new or unseen samples. When using
want to differentiate been included? Does the dataset only observed data to build a model, it is not
represent expected variance in water quality over reasonable to assume that all system behaviours were
seasons? To aid in data preparation, dataset explo- captured and therefore model validation for a specific
ration and visualization is a crucial step to identify any purpose is a critical step in establishing useful
possible issues and aid in removing outliers and environmental or water quality models (Humphrey
ensuring dataset balance is considered. These consid- et al. 2017).
erations are task, data, and objective specific, and The overall goal is to produce an accurate and
therefore it is challenging to define hard-fast data generalized model, or one that has captured true
preparation rules. Therefore, expert judgment should underlying processes and will therefore perform well
be applied when considering when to remove outliers on previously onobserved inputs (Goodfellow et al.
and how to prepare the data best. For example, care 2016). Poor performance on unseen samples can be
should be taken not to remove outliers simply because referred to as high generalization error and generally
they deviate strongly from other observations. The describes issues of overfitting. Overfitted models may
observed outlier may be an accurate observation and produce low error on data used in the training
should then be considered part of the modelling task. processes, but fail to accurately predict behaviour in
There is a risk in always removing highly deviant a more general sense. However, while overfitting
samples since this can significantly bias the training should be avoided, there is a balance with the objective
data to represent a possible system behaviour and to also generate a model that can provide accurate
artificially inflate model performance. predictions. Models can perform equally on training
Once the data is collected, prepared, and formatted, and unseen data (not overfitted), while also having
preprocessing is used to adjust the data for learning or overall poor performance. When the model is not able
prediction tasks. This step is sometimes neglected or to capture all the behaviour of the system, it is
carried out without much thought, but strongly underfitted and therefore has limited predictive accu-
impacts the overall performance of the model and racy (van der Aalst et al. 2010).
ultimately the value of predictions (Garcı́a et al. 2015). When assessing a predictive model, it is imperative
One of the most common preprocessing steps is data to separate a test and/or validation subset of samples
normalization. The attributes or measured values for that are not used during the training process. The
specific variables may be in scales that are not general idea is the training set should be used to
appropriate for specific algorithms being used. Fur- develop the model and the test set for reporting model
thermore, if several variables are measured on differ- performance. Without this split or observing perfor-
ent scales, algorithms may incorrectly favour variables mance on the test set during model development,
with higher variability. Min–max normalization nor- significant bias can be introduced and it would be
malizes the original scale of a variable to some new difficult to assess any issues with overfitting. Studies
specified interval, often [0,1] or [-1,1]. Normalization that evaluate model performance using only training
to [0,1] is common for ANN inputs and often results in data present a limited representation of predictive
better training speed and performance. Mean centering ability or performance. The tendency to memorize
or z-score normalization is where values are normal- training data is often easily detected if unique samples
ized based on a normal distribution to the mean of the in the test set are used to assess the performance of the
observed range. Compared to min–max normalization, model. The expectation is that overfitted models will
this approach can be better suited for situations where have large differences between the error using the

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 1001

training set and the test set. Models that are underfitted of data and tested on the remaining subset. This will
will likely show small differences between training produce K unique models with performance assessed
and test error, however they may not representing the on K subsets of the training data. An average
system with adequate complexity to provide accurate performance from k-fold cross-validation can be the
predictions (Goodfellow et al. 2016). As such, it is basis to compare models for a specific objective while
important to consider balancing the goal of minimiz- always making assessments on unseen data. A cross-
ing training error and the difference between training validation approach is particularly useful in circum-
and test error (overfitting) in order to develop an stances where there are a limited number of samples,
optimal model (van der Aalst et al. 2010). Figure 7 and initial separation of three subsets would compro-
shows an example history of training and test error mise the strength of model training (Handelman et al.
over neural network training iterations. It can be seen 2019). Several modifications to k-fold cross-validation
that prior to a minimum in test data error, the model is can be used, including only leaving one sample out or
considered underfitted (accuracy could be increased). splitting the dataset in 2 equal parts (Garcı́a et al.
As the number of training iterations or epochs 2015).
increases, the difference between training error and When separating the training, validation, and test
test error increases, indicating overfitting. Note that datasets, it is important to consider how to split the
training error continue to decline as iterations data. The chosen validation and test sets can introduce
increase, despite a decrease in generalization or ability significant bias into model selection and performance
to accurately predict unseen data. assessment if they are not equally representative of the
Using only training and test subsets can be cum- problem domain (May et al. 2010). Furthermore, it
bersome if any model selection or tuning is required, should be considered if it is desirable for all splits to
such as the selection of hyperparameters. One span the entire domain or if performance testing
approach is to make a third initial subset, a validation should focus on out-of-domain samples (i.e. samples
set, along with training and test splits. During model not similar to those used in training). A few general
development, changes to model parameters or types of options are available to split the data, such as, random
models can be assessed using this validation set splitting, splitting by specific sampling days or periods
without compromising the separated test set. Alterna- of time, or more sophisticated splitting techniques that
tively, a cross-validation approach can be used for incorporate pre-processing to stratify or organize the
model tuning and selection prior to validation. The samples (May et al. 2010; Zheng et al. 2018). To avoid
most common approach is k-fold cross-validation, bias and variance in model performance, DUPLEX
where the training dataset is split into K subsets splitting methods, where samples are assigned to
(usually equal in size) (Wang et al. 2020a). The model training/test sets based on maximizing Euclidean
is trained using a unique combination of K-1 subsets distance between them (May et al. 2010) or SOM-
based stratified sampling method have been used
(Snee 1977). The splitting method applied will have a
significant effect on the performance of data-driven
models of environmental systems, as illustrated by
(Wu et al. 2012) using hydrological models. In all
cases, consideration should be taken to what biases are
introduced via various splitting regimes and the
overall objective of the data-driven model. For
instance, creating a test set of time-series data by a
random selection of individual datapoints throughout
the observed time period is unlikely to present a
realistic assessment of the model performance for
predicting future time-dependent conditions.
Fig. 7 Example of training (gray line) and test set (red line)
error during training of a neural network to illustrate underfitting
and overfitting. Epoch refers to the number of training iterations

123
1002 Rev Environ Sci Biotechnol (2021) 20:985–1009

4.4 Interpretability analysis including scientific validation (or comparison


to known properties of the system), identifying key or
It should also be considered that to assess the utility or important factors that drive changes in output, eluci-
validity of a model for a specific purpose, the dating conditions where the model is most sensitive to
predictive power on a test set may not be sufficient. changes, uncertainty analysis, and possible interac-
Measures of fit or prediction accuracy using appro- tions between factors (Razavi and Gupta 2015). For
priate dataset splits ensure models are working well; example, mean deviation in model accuracy due to
however, they do not often contribute to scientific permutations in inputs can be used to estimate variable
understanding or result in improved knowledge of a importance (Tesoriero et al. 2017). Due to the
system (Rosé et al. 2019). In order to build trust in complexity of response surfaces (or how the output
data-driven modelling outputs and ultimately develop varies with changes in variable input), especially when
deployable models, there needs to be a focus on dealing with complex high-dimensional systems,
generating models that provide explanations for or at varying approaches to sensitivity analysis on the same
least interpretability of model outputs (Gilpin et al. model can produce conflicting results (Razavi and
2019). As part of model validation, an understanding Gupta 2015). Clearly defining the objectives of
of how reproducible the trained model is as well and sensitivity analysis and careful consideration of the
the accuracy of results should be reported as well as methodology to generate response surfaces is needed
what underlying physical processes are being repre- to best utilize this valuable approach for model
sented by the model or its scientific validity (Biondi validation. Several reports on the development of
et al. 2012; Humphrey et al. 2017). Establishing an data-driven models for water quality include some
assessment of the scientific validity of a model form of sensitivity analysis or description of variable
prediction is particularly important if the analysis task importance (Kulkarni and Chellam 2010; Singh and
is exploratory or aims to further knowledge about Gupta 2012; Pianosi et al. 2016; Harris and Graham
some physical process (Biondi et al. 2012). Previous 2017; Mohammed et al. 2018; Huang et al. 2020).
work has demonstrated that sophisticated models with It may be of interest to have sample-specific
state-of-the-art performance can be easily fooled into assessments of the impact of variables on predictions.
misclassifying samples (Szegedy et al. 2014; Nguyen For example, the Local Interpretable Model-Agnostic
et al. 2015; Guidotti et al. 2019; Gilpin et al. 2019) and Explanations (LIME) approach creates linear models
therefore emphasizes the need for assessing the model around sample-specific predictions and assesses the
validity more than purely on the basis of prediction impact of variable changes on how the model catego-
accuracy. Many popular data-driven modelling rizes any given sample (Ribeiro et al. 2016). Sample-
approaches, such as neural networks, are commonly specific assessments may be particularly important to
referred to as ‘black-box’ models in recognition of the build trust in a model. Not only could model users
difficulties explaining the inner workings or the basis receive a predicted output, the weightings or factors
for how functions are being approximated (Razavi and that influenced a particular decision would be avail-
Tolson 2011). able to scrutinize adherence with physical laws and
Assessment of model validity can be, in part, general expectations. For example, the LIME method
realized by careful examination of cohesion of model was applied (lime library in Python) to classification
outputs with known relationships or physical laws. trees, random forests, and a neural network to predict
Furthermore, several tools have been developed to E. coli greater than or less than 20 CFU/100 mL
help with developing explanations for model beha- (previous example). Figure 8 shows differences
viour including developing local models around between how models come to decisions on a specific
predictions (Ribeiro et al. 2016), rule-extraction prediction, despite having overall similar accuracy
techniques, and sensitivity analysis of changes in (decision trees: 71.4%; random forests: 85.7%; neural
model output or activations as the input is manipulated network: 85.7%). In particular, the neural network
(Gilpin et al. 2019). Sensitivity analysis generally model weights the variable ‘day of the week’ (i.e.
refers to the observation of changes in output due to Monday, Tuesday, etc.) with greater importance. The
perturbations in the input for a trained model (Pianosi day of week that the sample was collected is unlikely
et al. 2016). There are several uses of sensitivity to be a variable of importance determining E. coli

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 1003

to known physical laws or boundary conditions, and


the solution space can be effectively explored. The use
of symbolic regression has seen success in fluid flow
problems such as drag around particles (El Hasadi and
Padding 2019), solar power data and other dynamical
systems (Quade et al. 2016), and E.coli concentrations
based on river flows (Jagupilla et al. 2015). The
interpretability of this approach is demonstrated by
Jagupilla et al. (2015) by visualizing the smooth non-
linear responses of E. coli to variations in flows in
rivers and tributaries in the system they studied.

Fig. 8 Variable importance for an example prediction identi-


fied by the LIME method. The model being investigated predicts 5 Conclusions and future research directions
E. coli [ or \ 20 CFU/100 mL in a natural water. Positive
values indicate the variable was positively associated with the There are significant opportunities to improve the
prediction, while negative values indicate evidence against the
overall prediction
management and operation of drinking water treat-
ment systems using data-driven modelling
levels in a surface water. As such, there is some approaches. Water quality is complex to predict and
indication that the neural network has learned a learning patterns from historical data provides an
somewhat false representation of system behaviour. opportunity to define relationships without specific
This ‘model-agnostic’ method can be applied to any prior knowledge of the mechanistic processes. Previ-
prediction model to help identify models that make ous studies have illustrated that various algorithms
decisions based on incorrect premises. show high prediction accuracy for identifying source
Previously discussed methods aim to identify how water quality changes, effluent treated water quality,
trained models come to specific decisions in order to and quality within distribution systems. Furthermore,
check validity. However, it can also be of interest to with increasing use of real-time sensors and size of
select or design models that incorporate biases that water quality datasets, data-driven applications will
help ensure models achieve better generalization and continue to grow for water treatment optimization.
physically realtistic solutions (Karniadakis et al. Based on a review of literature, we suggest a few areas
2021). Physics-informed neural networks can be of research focus. Often data-driven models have not
trained by integrating partial differential equations considered the sequential or time-series nature of
into loss functions using automatic differentiation to water quality in source waters or in drinking water
solve a wide range of problems (Raissi et al. 2019; systems and use of sequential type algorithms is more
Karniadakis et al. 2021). For example, Raissi et al. limited. In work that has considered time-delays, often
(2019) introduced the idea that physics-informed they are static or pre-determined adding a degree of
neural networks can be utilized to efficiently solve rigidity to the model that may not be representative of
Navier–Stokes equations for incompressible fluid real conditions. Furthermore, data-driven applications
flow, even with noisy and sparse data, which could to distribution water quality have illustrated that this is
be used to aid in estimating pollutant transport. a more challenging but important area of application.
Techniques such as genetic programming to carry Water quality at the edges of the network and at the
out symbolic regression can be used to learn functions customer point of use are ultimately the most relevant
based on observed data (Koza 1994). A set of possible to public health and customer satisfaction, however, is
mathematical operators and independent variables can often the least known.
be searched and modified to suggest a function that There are also several challenges and areas of
maximizes fit to some dependent variable (Quade et al. future research that are needed to ensure that devel-
2016). Since the method identifies a mathematical oped models are useful, easily implemented, and
function, the solution is transparent, easily referenced produce valid predictions. Further to accurate predic-
tion accuracies, the value of modelling and data-

123
1004 Rev Environ Sci Biotechnol (2021) 20:985–1009

driven approaches is also predicated on the produced References


models being explainable. For more direct implemen-
tation and stakeholder buy-in of these powerful Abba SI, Pham QB, Saini G et al (2020) Implementation of data
intelligence models coupled with ensemble machine
models the decision-making process must be vali- learning for prediction of water quality index. Environ Sci
dated, and performance assessments purely based on Pollut Res. https://doi.org/10.1007/s11356-020-09689-x
prediction accuracy will be insufficient. Studies that Abbaspour KC, Schulin R, Schläppi E, Flühler H (1996) A
compare several data-driven algorithms for one task Bayesian approach for incorporating uncertainty and data
worth in environmental projects. Environ Model Assess
show performance between models is often similar, 1:151–158. https://doi.org/10.1007/BF01874902
and optimal models are identified based on small Aggarwal CC (2018) An introduction to neural networks. In:
absolute changes in accuracy. Ultimately, a singular Aggarwal CC (ed) Neural networks and deep learning: a
focus on performance neglects other model features textbook. Springer International Publishing, Cham,
pp 1–52
that are needed for proper implementation and use in Aghel B, Rezaei A, Mohadesi M (2019) Modeling and predic-
water treatment optimization. For example, imple- tion of water quality parameters using a hybrid particle
mentation of the results or the model is largely swarm optimization–neural fuzzy approach. Int J Environ
dependent on human decisions and interpretation of Sci Technol 16:4823–4832. https://doi.org/10.1007/
s13762-018-1896-3
the model. While this can be improved by focusing on Aguilera PA, Fernández A, Fernández R et al (2011) Bayesian
ensuring exploratory models are being developed, networks in environmental modelling. Environ Model
algorithms are not generally formulated to actively Softw 26:1376–1388. https://doi.org/10.1016/j.envsoft.
recommend changes to impact the real world or test 2011.06.004
Avila R, Horn B, Moriarty E et al (2018) Evaluating statistical
their predictions of optimal values (Dahan et al. 2014). model performance in water quality prediction. J Environ
For example, coagulation models based on current Manage 206:910–919. https://doi.org/10.1016/j.jenvman.
operational practices can be inversed to predict a 2017.11.049
coagulant dose needed to result in a specific treated Banadkooki FB, Ehteram M, Panahi F et al (2020) Estimation of
total dissolved solids (TDS) using new hybrid machine
water quality. However, the model has possibly learning models. J Hydrol 587:124989. https://doi.org/10.
learned to predict only human determined optimal 1016/j.jhydrol.2020.124989
values and the cost and benefit of adjusting coagulant Barzegar R, Aalami MT, Adamowski J (2020) Short-term water
doses should also be considered. As such, it is quality variable prediction using a hybrid CNN–LSTM
deep learning model. Stoch Environ Res Risk Assess
suggested that future research focus should be on 34:415–433. https://doi.org/10.1007/s00477-020-01776-2
developing and applying methodologies that allow for Baxter CW, Stanley SJ, Zhang Q (1999) Development of a full-
the exploration of data-driven model decision making. scale artificial neural network model for the removal of
Furthermore, although physics-informed and inter- natural organic matter by enhanced coagulation. J Water
Supply Res Technol AQUA 48:129–136. https://doi.org/
pretable modelling methods such as symbolic regres- 10.2166/aqua.1999.0013
sion have been used successfully in other domains, Baxter CW, Zhang Q, Stanley SJ et al (2001) Drinking water
there has been limited use in water quality monitoring. quality and treatment: the use of artificial neural networks.
In addition to techniques that explore already trained Can J Civ Eng 28:26–35. https://doi.org/10.1139/l00-053
Bieroza M, Baker A, Bridgeman J (2011) Classification and
models, future research should also consider the use of calibration of organic matter fluorescence data with mul-
more interpretable and physics-informed modelling tiway analysis methods and artificial neural networks: an
approaches. operational tool for improved drinking water treatment.
Environmetrics 22:256–270. https://doi.org/10.1002/env.
Funding Natural Sciences and Engineering Research Council 1045
(NSERC) Discovery Grant. Bikmukhametov T, Jäschke J (2020) Combining machine
learning and process engineering physics towards
enhanced accuracy and explainability of data-driven
Data availability Canada’s National Long-term Water
models. Comput Chem Eng 138:106834. https://doi.org/
Quality Monitoring database (open data).
10.1016/j.compchemeng.2020.106834
Biondi D, Freni G, Iacobellis V et al (2012) Validation of
Declarations
hydrological models: conceptual basis, methodological
approaches and a proposal for a code of practice. Phys
Conflict of interest None.
Chem Earth Parts A/b/c 42–44:70–76. https://doi.org/10.
1016/j.pce.2011.07.037

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 1005

Bishop CM (1995) Neural networks for pattern recognition. Debnath A, Majumder M, Pal M (2015) A cognitive approach in
Oxford University Press selection of source for water treatment plant based on cli-
Breiman L (2001) Random Forests. Mach Learn 45:5–32. matic impact. Water Resour Manag 29:1907–1919
https://doi.org/10.1023/A:1010933404324 Delpla I, Florea M, Rodriguez MJ (2019) Drinking water source
Bridgeman J, Bieroza M, Baker A (2011) The application of monitoring using early warning systems based on data
fluorescence spectroscopy to organic matter characterisa- mining techniques. Water Resour Manag 33:129
tion in drinking water treatment. Rev Environ Sci Deng W, Wang G (2017) A novel water quality data analysis
Biotechnol 10:277. https://doi.org/10.1007/s11157-011- framework based on time-series data mining. J Environ
9243-x Manage 196:365–375. https://doi.org/10.1016/j.jenvman.
Bridgeman J, Jefferson B, Parsons SA (2009) Computational 2017.03.024
fluid dynamics modelling of flocculation in water treat- Dogo EM, Nwulu NI, Twala B, Aigbavboa C (2019) A survey of
ment: a review. Eng Appl Comput Fluid Mech 3:220–241. machine learning methods applied to anomaly detection on
https://doi.org/10.1080/19942060.2009.11015267 drinking-water quality data. Urban Water Journal 16:235–248.
Bro R (1997) PARAFAC. Tutorial and applications. Chemom https://doi.org/10.1080/1573062X.2019.1637002
Intell Lab Syst 38:149–171. https://doi.org/10.1016/ Doshi-Velez F, Kim B (2017) Towards a rigorous science of
S0169-7439(97)00032-4 interpretable machine learning. [cs, stat]
Brookes JD, Carey CC, Hamilton DP et al (2014) Emerging D’Souza CD, Kumar MSM (2010) Comparison of ANN models
challenges for the drinking water industry. Environ Sci for predicting water quality in distribution systems.
Technol 48:2099–2101. https://doi.org/10.1021/es405606t J AWWA 102:92–106. https://doi.org/10.1002/j.1551-
Brooks W, Corsi S, Fienen M, Carvin R (2016) Predicting 8833.2010.tb10152.x
recreational water quality advisories: a comparison of Eggimann S, Mutzner L, Wani O et al (2017) The Potential of
statistical methods. Environ Model Softw 76:81–94. knowing more: a review of data-driven urban water man-
https://doi.org/10.1016/j.envsoft.2015.10.012 agement. Environ Sci Technol 51:2538–2553. https://doi.
Burchard-Levine A, Liu S, Vince F et al (2014) A hybrid evo- org/10.1021/acs.est.6b04267
lutionary data driven model for river water quality early El Hasadi YMF, Padding JT (2019) Solving fluid flow problems
warning. J Environ Manage 143:8–16. https://doi.org/10. using semi-supervised symbolic regression on sparse data.
1016/j.jenvman.2014.04.017 AIP Adv 9:115218. https://doi.org/10.1063/1.5116183
Chau K (2006) A review on integration of artificial intelligence Elkiran G, Nourani V, Abba SI, Abdullahi J (2018) Artificial
into water quality modelling. Mar Pollut Bull 52:726–733. intelligence-based approaches for multi-station modelling
https://doi.org/10.1016/j.marpolbul.2006.04.003 of dissolve oxygen in river. GJESM. https://doi.org/10.
Chen B, Westerhoff P (2010) Predicting disinfection by-product 22034/gjesm.2018.04.005
formation potential in water. Water Res 44:3755–3762. Ellison AM (2004) Bayesian inference in ecology. Ecol Lett
https://doi.org/10.1016/j.watres.2010.04.009 7:509–520. https://doi.org/10.1111/j.1461-0248.2004.
Chen C-L, Hou P-L (2006) Fuzzy model identification and 00603.x
control system design for coagulation chemical dosing of Everaert G, Bennetsen E, Goethals PLM (2016) An applicability
potable water. Water Supply 6:97–104. https://doi.org/10. index for reliable and applicable decision trees in water
2166/ws.2006.782 quality modelling. Eco Inform 32:1–6. https://doi.org/10.
Chen H, Chen A, Xu L et al (2020a) A deep learning CNN 1016/j.ecoinf.2015.12.004
architecture applied in smart near-infrared analysis of Farnham DJ, Lall U (2015) Predictive statistical models linking
water pollution for agricultural irrigation resources. Agric antecedent meteorological conditions and waterway bac-
Water Manag 240:106303. https://doi.org/10.1016/j. terial contamination in urban waterways. Water Res
agwat.2020.106303 76:143–159. https://doi.org/10.1016/j.watres.2015.02.040
Chen K, Chen H, Zhou C et al (2020b) Comparative analysis of Fenton N, Neil M (2012) Risk Assessment and Decision Anal-
surface water quality prediction performance and identifi- ysis with Bayesian Networks. CRC Press
cation of key water parameters using different machine Ferretto N, Tedetti M, Guigue C et al (2014) Identification and
learning models based on big data. Water Res 171:115454. quantification of known polycyclic aromatic hydrocarbons
https://doi.org/10.1016/j.watres.2019.115454 and pesticides in complex mixtures using fluorescence
Cordoba GAC, Tuhovčák L, Tauš M (2014) Using artificial excitation–emission matrices and parallel factor analysis.
neural network models to assess water quality in water Chemosphere 107:344–353. https://doi.org/10.1016/j.
distribution networks. Proc Eng 70:399–408. https://doi. chemosphere.2013.12.087
org/10.1016/j.proeng.2014.02.045 Finlay S (2014) Predictive analytics, data mining and big data:
Dahan H, Cohen S, Rokach L, Maimon O (2014) Proactive data myths. Springer, Misconceptions and Methods
mining: a general approach and algorithmic framework. In: Flach P (2012) Machine learning: the art and science of algo-
Dahan H, Cohen S, Rokach L, Maimon O (eds) Proactive rithms that make sense of data. Cambridge University
Data Mining with Decision Trees. Springer, New York, Press, Cambridge
NY, pp 15–20 Gagnon C, Grandjean BPA, Thibault J (1997) Modelling of
De’ath G, Fabricius KE, (2000) Classification and regression coagulant dosage in a water treatment plant. Artif Intell
trees: a powerful yet simple technique for ecological data Eng 11:401–404. https://doi.org/10.1016/S0954-
analysis. Ecology 81:3178–3192. https://doi.org/10.1890/ 1810(97)00010-1
0012-9658(2000)081[3178:CARTAP]2.0.CO;2 Garcı́a S, Luengo J, Herrera F (2015) Data Preprocessing in
Data Mining. Springer International Publishing, Cham

123
1006 Rev Environ Sci Biotechnol (2021) 20:985–1009

Gilpin LH, Bau D, Yuan BZ, et al (2019) Explaining explana- network models. Environ Model Softw 92:82–106. https://
tions: an overview of interpretability of machine learning. doi.org/10.1016/j.envsoft.2017.01.023
[cs, stat] Jagupilla SCK, Vaccari DA, Miskewitz R et al (2015) Symbolic
Gokgoz E, Subasi A (2015) Comparison of decision tree algo- regression of upstream, stormwater, and tributary E. Coli
rithms for EMG signal classification using DWT. Biomed concentrations using river flows. Water Environ Res 87:26–34.
Signal Process Control 18:138–144. https://doi.org/10. https://doi.org/10.1002/j.1554-7531.2015.tb00138.x
1016/j.bspc.2014.12.005 Jia X, Willard J, Karpatne A et al (2021) Physics-guided
Gomes LS, Souza FAA, Pontes RST et al (2015) Coagulant machine learning for scientific discovery: an application in
dosage determination in a water treatment plant using simulating lake temperature profiles. ACM/IMS Trans
dynamic neural network models. Int J Comp Intel Appl Data Sci 2:1–26. https://doi.org/10.1145/3447814
14:1550013. https://doi.org/10.1142/S1469026815500133 Jin T, Cai S, Jiang D, Liu J (2019) A data-driven model for real-
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. time water quality prediction and early warning by an
MIT Press integration method. Environ Sci Pollut Res
Griffiths KA, Andrews RC (2011) The application of artificial 26:30374–30385. https://doi.org/10.1007/s11356-019-
neural networks for the optimization of coagulant dosage. 06049-2
Water Supply 11:605–611. https://doi.org/10.2166/ws. Juntunen P, Liukkonen M, Lehtola M, Hiltunen Y (2013)
2011.028 Cluster analysis by self-organizing maps: an application to
Guidotti R, Monreale A, Ruggieri S et al (2019) A survey of the modelling of water quality in a treatment process. Appl
methods for explaining black box models. ACM Comput Soft Comput J 13:3191–3196. https://doi.org/10.1016/j.
Surv 51:1–42. https://doi.org/10.1145/3236009 asoc.2013.01.027
Guo D, Lintern A, Webb JA et al (2019) Key factors affecting Juntunen P, Liukkonen M, Pelo M et al. (2012) Modelling of
temporal variability in stream water quality. Water Resour Water Quality: an application to a water treatment process.
Res 55:112–129. https://doi.org/10.1029/2018WR023370 In: Applied Computational Intelligence and Soft Comput-
Hamilton KA, Waso M, Reyneke B et al (2018) Cryptosporid- ing. https://www.hindawi.com/journals/acisc/2012/
ium and Giardia in wastewater and surface water envi- 846321/. Accessed 15 Sep 2020
ronments. J Environ Qual 47:1006–1023. https://doi.org/ Kabir G, Tesfamariam S, Francisque A, Sadiq R (2015) Eval-
10.2134/jeq2018.04.0132 uating risk of water mains failure using a Bayesian belief
Handelman GS, Kok HK, Chandra RV et al (2019) Peering into network model. Eur J Oper Res 240:220–234. https://doi.
the black box of artificial intelligence: evaluation metrics org/10.1016/j.ejor.2014.06.033
of machine learning methods. Am J Roentgenol Karniadakis GE, Kevrekidis IG, Lu L et al (2021) Physics-in-
212:38–43. https://doi.org/10.2214/AJR.18.20224 formed machine learning. Nat Rev Phys 3:422–440. https://
Harris J, Tzafestas SG, Chen CS, et al (eds) (2006) Comments doi.org/10.1038/s42254-021-00314-5
and definitions. In: Fuzzy Logic Applications in Engi- Keskin TE, Düğenci M, Kaçaroğlu F (2015) Prediction of water
neering Science. Springer Netherlands, Dordrecht, pp 1–10 pollution sources using artificial neural networks in the
Harris TD, Graham JL (2017) Predicting cyanobacterial abun- study areas of Sivas, Karabük and Bartın (Turkey). Environ
dance, microcystin, and geosmin in a eutrophic drinking- Earth Sci 73:5333–5347. https://doi.org/10.1007/s12665-
water reservoir using a 14-year dataset. Lake Reser Man- 014-3784-6
age 33:32–48. https://doi.org/10.1080/10402381.2016. Khataee AR, Kasiri MB (2011) Modeling of biological water
1263694 and wastewater treatment processes using artificial neural
Heddam S, Bermad A, Dechemi N (2012) ANFIS-based mod- networks. Clean: Soil, Air, Water 39:742–749. https://doi.
elling for coagulant dosage in drinking water treatment org/10.1002/clen.201000234
plant: a case study. Environ Monit Assess 184:1953–1971. Kim CM, Parnichkun M (2017) Prediction of settled water
https://doi.org/10.1007/s10661-011-2091-x turbidity and optimal coagulant dosage in drinking water
Heibati M, Stedmon CA, Stenroth K et al (2017) Assessment of treatment plant using a hybrid model of k-means clustering
drinking water quality at the tap using fluorescence spec- and adaptive neuro-fuzzy inference system. Appl Water
troscopy. Water Res 125:1–10. https://doi.org/10.1016/j. Sci 7:3885–3902. https://doi.org/10.1007/s13201-017-
watres.2017.08.020 0541-5
Hey T (2009) The Fourth Paradigm: Data-Intensive Scientific Kotsiantis SB (2007) Supervised machine learning: a review of
Discovery, 1st Edition. Microsoft Research, Redmond, classification techniques. Informatica 249–268
Washington JohnR K (1994) Genetic programming as a means for pro-
Hosseini-Asl E, Zurada JM, Nasraoui O (2016) Deep learning of gramming computers by natural selection. Stat Comput.
part-based representation of data using sparse autoencoders https://doi.org/10.1007/BF00175355
with nonnegativity constraints. IEEE Trans Neural Netw Krzywinski M, Altman N (2017) Classification and regression
Learn Syst 27:2486–2498. https://doi.org/10.1109/ trees. Nat Methods 14:757–758. https://doi.org/10.1038/
TNNLS.2015.2479223 nmeth.4370
Huang J, Zhang Y, Arhonditsis GB et al (2020) The magnitude Kulkarni P, Chellam S (2010) Disinfection by-product forma-
and drivers of harmful algal blooms in China’s lakes and tion following chlorination of drinking water: artificial
reservoirs: a national-scale characterization. Water Res neural network models and changes in speciation with
181:115902. https://doi.org/10.1016/j.watres.2020.115902 treatment. Sci Total Environ 408:4202–4210. https://doi.
Humphrey GB, Maier HR, Wu W et al (2017) Improved vali- org/10.1016/j.scitotenv.2010.05.040
dation framework and R-package for artificial neural

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 1007

Lee S, Lee D (2018) Improved prediction of harmful algal Neural Netw 23:283–294. https://doi.org/10.1016/j.neunet.
blooms in four major south Korea’s rivers using deep 2009.11.009
learning models. Int J Environ Res Public Health 15:1322. McKay G, Korak JA, Erickson PR et al (2018) The case against
https://doi.org/10.3390/ijerph15071322 charge transfer interactions in dissolved organic matter
Li J, Liu H, Li Y et al (2013) Monitoring and modeling dissolved photophysics. Environ Sci Technol 52:406–414. https://
oxygen dynamics through continuous longitudinal sam- doi.org/10.1021/acs.est.7b03589
pling: a case study in wen-rui tang river, wenzhou, china. Mei K, Liao L, Zhu Y et al (2014) Evaluation of spatial-tem-
Hydrol Process 27:3502–3510. https://doi.org/10.1002/ poral variations and trends in surface water quality across a
hyp.9459 rural-suburban-urban interface. Environ Sci Pollut Res
Li R, Zou Z, An Y (2016) Water quality assessment in Qu River 21:8036–8051. https://doi.org/10.1007/s11356-014-2716-
based on fuzzy water pollution index method. J Environ Sci z
50:87–92. https://doi.org/10.1016/j.jes.2016.03.030 Meyers G, Kapelan Z, Keedwell E (2017) Short-term forecast-
Li Z, Peleato NM (2021) Comparison of dimensionality ing of turbidity in trunk main networks. Water Res
reduction techniques for cross-source transfer of fluores- 124:67–76. https://doi.org/10.1016/j.watres.2017.07.035
cence contaminant detection models. Chemosphere. Mohammed H, Hameed IA, Seidu R (2018) Comparative pre-
https://doi.org/10.1016/j.chemosphere.2021.130064 dictive modelling of the occurrence of faecal indicator
Lin H, Dai Q, Zheng L et al (2020) Radial basis function arti- bacteria in a drinking water source in Norway. Sci Total
ficial neural network able to accurately predict disinfection Environ 628–629:1178–1190. https://doi.org/10.1016/j.
by-product levels in tap water: taking haloacetic acids as a scitotenv.2018.02.140
case study. Chemosphere 248:125999. https://doi.org/10. Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of
1016/j.chemosphere.2020.125999 machine learning, 2nd edn. MIT Press
Liu P, Wang J, Sangaiah AK et al (2019) Analysis and predic- Montáns FJ, Chinesta F, Gómez-Bombarelli R, Kutz JN (2019)
tion of water quality using LSTM deep neural networks in Data-driven modeling and learning in science and engi-
IoT environment. Sustainability 11:2058. https://doi.org/ neering. Comptes Rendus Mécanique 347:845–855.
10.3390/su11072058 https://doi.org/10.1016/j.crme.2019.11.009
Maier HR, Dandy GC (2000) Neural networks for the prediction Mulia IE, Tay H, Roopsekhar K, Tkalich P (2013) Hybrid
and forecasting of water resources variables: a review of ANN–GA model for predicting turbidity and chlorophyll-a
modelling issues and applications. Environ Model Softw concentrations. J Hydro-Environ Res 7:279–299. https://
15:101–124. https://doi.org/10.1016/S1364- doi.org/10.1016/j.jher.2013.04.003
8152(99)00007-9 Murphy KP (2012) Machine learning: a probabilistic perspec-
Maier HR, Dandy GC (1996) The use of artificial neural net- tive, Illustrated. The MIT Press, Cambridge, MA
works for the prediction of water quality parameters. Water Murphy KR, Bro R, Stedmon CA (2014) Chemometric analysis
Resour Res 32:1013–1022. https://doi.org/10.1029/ of organic matter fluorescence. In: Coble P, Lead J, Baker
96WR03529 A et al (eds) Aquatic Organic Matter Fluorescence. Cam-
Maier HR, Jain A, Dandy GC, Sudheer KP (2010) Methods used bridge University Press, Cambridge, pp 339–375
for the development of neural networks for the prediction Murphy KR, Stedmon CA, Graeber D, Bro R (2013) Fluores-
of water resource variables in river systems: current status cence spectroscopy and multi-way techniques. Parafac
and future directions. Environ Model Softw 25:891–909. Anal Methods 5:6557–6566. https://doi.org/10.1039/
https://doi.org/10.1016/j.envsoft.2010.02.003 C3AY41160E
Maier HR, Morgan N, Chow CWK (2004) Use of artificial Murray S, Ghazali M, McBean EA (2012) Real-time water
neural networks for predicting optimal alum doses and quality monitoring: assessment of multisensor data using
treated water quality parameters. Environ Model Softw Bayesian belief networks. J Water Resour Plan Manag
19:485–494. https://doi.org/10.1016/S1364- 138:63–70. https://doi.org/10.1061/(ASCE)WR.1943-
8152(03)00163-4 5452.0000163
Marton I, Sánchez AI, Carlos S, Martorell S (2013) Application Nguyen A, Yosinski J, Clune J (2015) Deep neural networks are
of data driven methods for condition monitoring mainte- easily fooled: high confidence predictions for unrecogniz-
nance. Chem Eng Trans 33:301–306. https://doi.org/10. able images. pp 427–436
3303/CET1333051 Oliker N, Ostfeld A (2014a) Comparison of two multivariate
Matilainen A, Gjessing ET, Lahtinen T et al (2011) An overview classification models for contamination event detection in
of the methods used in the characterisation of natural water quality time series. J Water Supply Res Technol
organic matter (NOM) in relation to drinking water treat- AQUA 64:558–566. https://doi.org/10.2166/aqua.2014.
ment. Chemosphere 83:1431–1442. https://doi.org/10. 033
1016/j.chemosphere.2011.01.018 Oliker N, Ostfeld A (2014b) A coupled classification – evolu-
May RJ, Dandy GC, Maier HR, Nixon JB (2008) Application of tionary optimization model for contamination event
partial mutual information variable selection to ANN detection in water distribution systems. Water Res
forecasting of water quality in water distribution systems. 51:234–245. https://doi.org/10.1016/j.watres.2013.10.060
Environ Model Softw 23:1289–1299. https://doi.org/10. O’Reilly G, Bezuidenhout CC, Bezuidenhout JJ (2018) Artifi-
1016/j.envsoft.2008.03.008 cial neural networks: applications in the drinking water
May RJ, Maier HR, Dandy GC (2010) Data splitting for artificial sector. Water Supply 18:1869–1887. https://doi.org/10.
neural networks using SOM-based stratified sampling. 2166/ws.2018.016

123
1008 Rev Environ Sci Biotechnol (2021) 20:985–1009

Panidhapu A, Li Z, Aliashrafi A, Peleato NM (2020) Integration Rojas R (1996) The Backpropagation Algorithm. Neural Net-
of weather conditions for predicting microbial water works. Springer, Berlin Heidelberg, Berlin, Heidelberg,
quality using Bayesian Belief Networks. Water Res pp 149–182
170:115349. https://doi.org/10.1016/j.watres.2019.115349 Rokach L, Maimon O (2015) Data mining with decision trees:
Peiris RH, Hallé C, Budman H et al (2010) Identifying fouling theory and applications, 2nd edn. World Scientific, Hack-
events in a membrane-based drinking water treatment ensack, New Jersey
process using principal component analysis of fluorescence Rosé CP, McLaughlin EA, Liu R, Koedinger KR (2019)
excitation-emission matrices. Water Res 44:185–194. Explanatory learner models: why machine learning (alone)
https://doi.org/10.1016/j.watres.2009.09.036 is not the answer. Br J Edu Technol 50:2943–2958. https://
Peleato NM, Legge RL, Andrews RC (2018) Neural networks doi.org/10.1111/bjet.12858
for dimensionality reduction of fluorescence spectra and Ross AS, Hughes MC, Doshi-Velez F (2017) Right for the right
prediction of drinking water disinfection by-products. reasons: training differentiable models by constraining
Water Res 136:84–94. https://doi.org/10.1016/j.watres. their explanations. [cs, stat]
2018.02.052 Sadiq R, Rodriguez MJ (2004) Disinfection by-products (DBPs)
Perelman L, Arad J, Housh M, Ostfeld A (2012) Event detection in drinking water and predictive models for their occur-
in water distribution systems from multivariate water rence: a review. Sci Total Environ 321:21–46. https://doi.
quality time series. Environ Sci Technol 46:8212–8219. org/10.1016/j.scitotenv.2003.05.001
https://doi.org/10.1021/es3014024 Sahoo GB, Ray C, Wade HF (2005) Pesticide prediction in
Pianosi F, Beven K, Freer J et al (2016) Sensitivity analysis of ground water in North Carolina domestic wells using
environmental models: a systematic review with practical artificial neural networks. Ecol Model 183:29–46. https://
workflow. Environ Model Softw 79:214–232. https://doi. doi.org/10.1016/j.ecolmodel.2004.07.021
org/10.1016/j.envsoft.2016.02.008 Sanchez NP, Skeriotis AT, Miller CM (2013) Assessment of
Pifer AD, Fairey JL (2012) Improving on SUVA254 using flu- dissolved organic matter fluorescence PARAFAC com-
orescence-PARAFAC analysis and asymmetric flow-field ponents before and after coagulation–filtration in a full
flow fractionation for assessing disinfection byproduct scale water treatment plant. Water Res 47:1679–1690.
formation and control. Water Res 46:2927–2936. https:// https://doi.org/10.1016/j.watres.2012.12.032
doi.org/10.1016/j.watres.2012.03.002 Sharpless CM, Blough NV (2014) The importance of charge-
Pu F, Ding C, Chao Z et al (2019) Water-quality classification of transfer interactions in determining chromophoric dis-
inland lakes using landsat8 images by convolutional neural solved organic matter (CDOM) optical and photochemical
networks. Remote Sens 11:1674. https://doi.org/10.3390/ properties. Environ Sci Process Impacts 16:654–671.
rs11141674 https://doi.org/10.1039/C3EM00573A
Qi Y (2012) Random Forest for Bioinformatics. In: Zhang C, Shutova Y, Baker A, Bridgeman J, Henderson RK (2014)
Ma Y (eds) Ensemble Machine Learning: Methods and Spectroscopic characterisation of dissolved organic matter
Applications. Springer, US, Boston, MA, pp 307–323 changes in drinking water treatment: from PARAFAC
Qin SJ, Chiang LH (2019) Advances and opportunities in analysis to online monitoring wavelengths. Water Res
machine learning for process data analytics. Comput Chem 54:159–169. https://doi.org/10.1016/j.watres.2014.01.053
Eng 126:465–473. https://doi.org/10.1016/j. Singh KP, Gupta S (2012) Artificial intelligence based modeling
compchemeng.2019.04.003 for predicting the disinfection by-products in water. Che-
Quade M, Abel M, Shafi K et al (2016) Prediction of dynamical mom Intell Lab Syst 114:122–131. https://doi.org/10.1016/
systems by symbolic regression. Phys Rev E 94:012214. j.chemolab.2012.03.014
https://doi.org/10.1103/PhysRevE.94.012214 Snee RD (1977) Validation of regression models: methods and
Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-in- examples. Null 19:415–428
formed neural networks: a deep learning framework for Soyupak S, Kilic H, Karadirek IE, Muhammetoglu H (2011) On
solving forward and inverse problems involving nonlinear the usage of artificial neural networks in chlorine control
partial differential equations. J Comput Phys 378:686–707. applications for water distribution networks with high
https://doi.org/10.1016/j.jcp.2018.10.045 quality water. J Water Supply Res Technol AQUA
Razavi S, Gupta HV (2015) What do we mean by sensitivity 60:51–60. https://doi.org/10.2166/aqua.2011.086
analysis? The need for comprehensive characterization of Stedmon CA, Seredyńska-Sobecka B, Boe-Hansen R et al
‘‘global’’ sensitivity in Earth and Environmental systems (2011) A potential approach for monitoring drinking water
models. Water Resour Res 51:3070–3092. https://doi.org/ quality from groundwater systems using organic matter
10.1002/2014WR016527 fluorescence as an early warning for contamination events.
Razavi S, Tolson BA (2011) A new formulation for feedforward Water Res 45:6030–6038. https://doi.org/10.1016/j.watres.
neural networks. IEEE Trans Neural Netw 22:1588–1598. 2011.08.066
https://doi.org/10.1109/TNN.2011.2163169 Stidson RT, Gray CA, McPhail CD (2012) Development and use
Reckhow KH (1999) Water quality prediction and probability of modelling techniques for real-time bathing water quality
network models. 56:9 predictions. Water Environ J 26:7–18. https://doi.org/10.
Ribeiro MT, Singh S, Guestrin C (2016) ‘‘Why Should I Trust 1111/j.1747-6593.2011.00258.x
You?’’: Explaining the Predictions of Any Classifier. [cs, Szegedy C, Zaremba W, Sutskever I et al. (2014) Intriguing
stat] properties of neural networks. [cs]
Tesoriero AJ, Gronberg JA, Juckem PF et al (2017) Predicting
redox-sensitive contaminant concentrations in

123
Rev Environ Sci Biotechnol (2021) 20:985–1009 1009

groundwater using random forest classification. Water water quality pollutants. Sci Total Environ 693:133440.
Resour Res 53:7316–7331. https://doi.org/10.1002/ https://doi.org/10.1016/j.scitotenv.2019.07.246
2016WR020197 Wang Y, Zhou J, Chen K et al. (2017) Water quality prediction
Thoe W, Gold M, Griesbach A et al (2014) Predicting water method based on LSTM neural network. In: 2017 12th
quality at Santa Monica Beach: evaluation of five different international conference on intelligent systems and
models for public notification of unsafe swimming condi- knowledge engineering (ISKE). pp 1–5
tions. Water Res 67:105–117. https://doi.org/10.1016/j. Wikle CK (2003) Hierarchical models in environmental science.
watres.2014.09.001 Int Stat Rev 71:181–199. https://doi.org/10.1111/j.1751-
Tinelli S, Juran I (2019) Artificial intelligence-based monitoring 5823.2003.tb00192.x
system of water quality parameters for early detection of Wu G-D, Lo S-L (2008) Predicting real-time coagulant dosage
non-specific bio-contamination in water distribution sys- in water treatment by artificial neural networks and adap-
tems. Water Supply 19:1785–1792. https://doi.org/10. tive network-based fuzzy inference system. Eng Appl Artif
2166/ws.2019.057 Intell 21:1189–1195. https://doi.org/10.1016/j.engappai.
Tomperi J, Leiviskä K (2019) Utilizing variable selection 2008.03.015
methods in modelling potable water quality. Water Supply Wu W, May R, Dandy GC, Maier HR (2012) A method for
19:1187–1194. https://doi.org/10.2166/ws.2018.173 comparing data splitting approaches for developing
Trueman BF, MacIsaac SA, Stoddart AK, Gagnon GA (2016) hydrological ANN models. International Congress on
Prediction of disinfection by-product formation in drinking Environmental Modelling and Software 394
water via fluorescence spectroscopy. Environ Sci Water Yang YZ, Peleato NM, Legge RL, Andrews RC (2019) Fluo-
Res Technol 2:383–389. https://doi.org/10.1039/ rescence excitation emission matrices for rapid detection of
C5EW00285K polycyclic aromatic hydrocarbons and pesticides in surface
Tyralis H, Papacharalampous G, Langousis A (2019) A brief waters. Environ Sci Water Res Technol 5:315–324. https://
review of random forests for water scientists and practi- doi.org/10.1039/C8EW00821C
tioners and their recent history in water resources. Water Yu Q, Yin H, Wang K et al (2018) Adaptive detection method
11:910. https://doi.org/10.3390/w11050910 for organic contamination events in water distribution
Uusitalo L (2007) Advantages and challenges of Bayesian net- systems using the UV-Vis spectrum based on semi-super-
works in environmental modelling. Ecol Model vised learning. Water 10:1566. https://doi.org/10.3390/
203:312–318. https://doi.org/10.1016/j.ecolmodel.2006. w10111566
11.033 Zhang S, Zhang C, Yang Q (2003) Data preparation for data
van der Aalst WMP, Rubin V, Verbeek HMW et al (2010) mining. Appl Artif Intell 17:375–381. https://doi.org/10.
Process mining: a two-step approach to balance between 1080/713827180
underfitting and overfitting. Softw Syst Model 9:87–111. Zhang Y, Ling C (2018) A strategy to apply machine learning to
https://doi.org/10.1007/s10270-008-0106-z small datasets in materials science. Npj Comput Mater
Wagner ED, Plewa MJ (2017) CHO cell cytotoxicity and 4:1–8
genotoxicity analyses of disinfection by-products: an Zhang Z, Deng Z, Rusch KA (2015) Modeling fecal coliform
updated review. J Environ Sci 58:64–76. https://doi.org/10. bacteria levels at gulf coast beaches. Water Qual Expo
1016/j.jes.2017.04.021 Health 7:255–263. https://doi.org/10.1007/s12403-014-
Wan R, Cai S, Li H et al (2014) Inferring land use and land cover 0145-3
impact on stream water quality using a Bayesian hierar- Zheng F, Maier HR, Wu W et al (2018) On lack of robustness in
chical modeling approach in the Xitiaoxi River Watershed, hydrological model development due to absence of
China. J Environ Manage 133:1–11. https://doi.org/10. guidelines for selecting calibration and evaluation data:
1016/j.jenvman.2013.11.035 demonstration for data-driven models. Water Resour Res
Wang AY-T, Murdock RJ, Kauwe SK et al (2020a) Machine 54:1013–1030. https://doi.org/10.1002/2017WR021470
learning for materials scientists: an introductory guide Zhou J, Wang Y, Xiao F et al (2018) Water quality prediction
toward best practices. Chem Mater 32:4954–4965. https:// method based on IGRA and LSTM. Water 10:1148. https://
doi.org/10.1021/acs.chemmater.0c01907 doi.org/10.3390/w10091148
Wang D (2016) Research on raw water quality assessment ori- Zou X-Y, Lin Y-L, Xu B et al (2019) A novel event detection
ented to drinking water treatment based on the SVM model for water distribution systems based on data-driven
model. Water Supply 16:746–755. https://doi.org/10.2166/ estimation and support vector machine classification.
ws.2015.186 Water Resour Manage 33:4569–4581. https://doi.org/10.
Wang D, Shen J, Zhu S, Jiang G (2020b) Model predictive 1007/s11269-019-02317-5
control for chlorine dosing of drinking water treatment
based on support vector machine model. DWT
Publisher’s Note Springer Nature remains neutral with
173:133–141. https://doi.org/10.5004/dwt.2020.24144
regard to jurisdictional claims in published maps and
Wang P, Yao J, Wang G et al (2019) Exploring the application
institutional affiliations.
of artificial intelligence technology for identification of
water pollution characteristics and tracing the source of

123

You might also like