You are on page 1of 66

1

1 Type of the Paper (Article, Review, Communication, etc.)

2 Application of machine learning in modeling the relationship be-


3 tween catchment attributes and instream water quality in
4 data scarce regions
5

6 Miljan Kovačević 1*, Bahman Jabbarian Amiri 2, Silva Lozančić 3, Marijana Hadzima-Nyarko 3, Dorin Radu 4,
7 Emmanuel Karlo Nyarko 5,

8 1
Faculty of Technical Sciences, University of Pristina, Knjaza Milosa 7, 38220 Kosovska Mitrovica, Serbia; miljan.kovacevic@pr.ac.rs (M.K)
9 2
Faculty of Economics and Sociology, Department of Regional Economics and the Environment, 3/5 P.O.W. Street, 90-255 Lodz, Poland;
10 bahman.amiri@uni.lodz.pl (B.J.A)
11 3
Faculty of Civil Engineering and Architecture Osijek, Josip Juraj Strossmayer University of Osijek,
12 Vladimira Preloga 3, 31000 Osijek, Croatia; mhadzima@gfos.hr, lozancic@gfos.hr (S.L.); (M.H-N., S.L))
13 4
Faculty of Civil Engineering, Transilvania University of Brașov, Romania; dorin.radu@unitbv.ro (D.R)
14 5
Josip Juraj Strossmayer University of Osijek, Faculty of Faculty of Electrical Engineering, Computer Science
15 and Information Technology, Kneza Trpimira 2B, 31000 Osijek, Croatia; karlo.nyarko@ferit.hr (E.K.N.)
16
17
18 * Correspondence: miljan.kovacevic@pr.ac.rs (M.K); Tel.: +381 606173801.

19 Abstract: In order to be able to manage water quality, it is necessary to determine the relationship
20 between the physical attributes of the catchment, such as geological permeability and hydrologic
21 soil groups, and in-stream water quality parameters. The paper gives relations for 11 in-stream
22 water quality parameters SAR , Na+, Mg2+, Ca2+, SO42−, Cl−, HCO3−, K+ , pH, EC, and TDS using
Citation: To be added by editorial
23 staff during production.
machine learning methods. For each parameter, RMSE, MAE, R, and MAPE accuracy criteria were
24 used to assess the accuracy of each model, and the optimal model was defined. Furthermore, the
25 Academic Editor: Firstname most significant influencing variables were determined when modeling each parameter. Research
Lastname
26 has shown that with methods based on artificial intelligence, a quick and sufficiently accurate
27 Received: date water quality assessment is possible, depending on the attributes of the watershed.
Revised: date
28 Accepted: date Keywords: machine learning; water quality; land use; land cover; hydrologic soil groups;
29 Published: date geological permeability
30

Copyright: © 2023 by the authors.


Submitted for possible open access
publication under the terms and
conditions of the Creative Commons
Attribution (CC BY) license
(https://creativecommons.org/license
s/by/4.0/).

3 Toxics 2023, 11, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/toxics


4 Toxics 2023, 11, x FOR PEER REVIEW 2 of 66
5

31 1. Introduction
32 River water quality plays a crucial role in ensuring the sustainability and health of
33 freshwater ecosystems. Traditional monitoring methods often have spatial and temporal
34 coverage limitations, leading to difficulties in effectively assessing and managing water
35 quality [1]. However, recent developments in machine learning techniques have
36 indicated the potential to predict water quality accurately based on catchment
37 characteristics [1, 2]. One of the underlying reasons for the growing interest in applying
38 machine learning techniques to predict river water quality is the ability to
39 simultaneously consider a wide range of catchment characteristics. These characteristics
40 include various features that affect water quality, including land use, soil properties,
41 climate data, topography, and hydrological characteristics [3]. These variables interact in
42 complex ways, and their associations may not easily be distinguished using prevalent
43 analytical methods. A more comprehensive understanding of the factors influencing
44 water quality can be achieved by applying machine learning algorithms, which can
45 disclose hidden patterns and capture nonlinear relationships in large and diverse
46 datasets [4].
47 The performance of supervised machine learning algorithms has been proved by
48 recent studies in predicting river water quality [5]. These algorithms are trained on
49 historical water quality data, in line with associating catchment characteristics, to
50 explore patterns and relationships between them [6]. Researchers have been able to
51 develop accurate predictive models by applying algorithms such as decision trees,
52 random forests, support vector machines, and neural networks [7-9]. Water quality
53 parameters, such as nutrient concentrations, pollutant levels, and biological indicators,
54 can then be applied by these models to predict using catchment characteristics [10]. Such
55 predictions can help determine potential pollution points, prioritise management
56 actions, and support decision-making processes for water resource management.
57 In addition to catchment characteristics, integrating remote sensing data has
58 emerged as a valuable tool by which the accuracy of water quality predictions can
59 significantly be enhanced [11, 12] because the spatially-explicit information about land
60 cover/land use, vegetation, and surface characteristics can be provided by remote
61 sensing techniques, which include satellite imagery and aerial photographs. Researchers
62 can improve the predictive performance of machine learning models by combing remote
63 sensing data with catchment characteristics [13]. For example, satellite data can provide
64 insight into vegetation dynamics, land use/land cover changes, and the extent of
65 impervious surfaces, which encompass urban and semi-urban areas, all of which can
66 influence water quality. Integrating these additional spatial data sources into the
67 machine learning models can result in more accurate and spatially explicit predictions,
68 allowing a more comprehensive assessment of water quality dynamics across large river
69 basins.
6 Toxics 2023, 11, x FOR PEER REVIEW 3 of 66
7

70 Furthermore, applying machine learning techniques to predict river water quality


71 has led to collaborative efforts between researchers and stakeholders [14, 15], which can
72 be crucial to facilitate achieving the objectives of water resources management. These
73 efforts aim to develop standardised frameworks and models that can be applied across
74 different catchments and regions. Data sharing, methodologies, and best practices can
75 collectively improve the accuracy and reliability of predictive models which researchers
76 develop. Collaborative initiatives also facilitate the identification of common challenges,
77 such as data availability and quality issues, and foster the development of innovative
78 solutions.
79 Advances in machine learning applications to predict river water quality
80 underscore the growing importance of this field. Researchers are moving towards a
81 more sustainable and informed water resource management by integrating advanced
82 machine learning algorithms with catchment characteristics and remote sensing data.
83 These predictive models can support policymakers, water resource managers and
84 environmental authorities in making evidence-based decisions, implementing targeted
85 pollution control measures, and maintaining the ecological integrity of river ecosystems.
86 Integrating machine learning techniques with catchment characteristics gives
87 researchers, water resource engineers, planners, and managers’ immense potential to
88 predict river water quality. These models can provide accurate and timely information
89 to support water resource management, pollution mitigation efforts, and the
90 preservation of freshwater ecosystems by leveraging the power of advanced algorithms
91 and incorporating diverse environmental data sources [16]. The advancements made in
92 this field highlight the growing significance of machine learning in addressing the
93 challenges associated with water quality prediction and paving the way for a more
94 sustainable and informed management of our precious water resources.
95 Although machine learning applications in predicting river water quality based on
96 catchment characteristics have shown promising findings, several gaps in this research
97 field warrant further investigation. They are but not limited to 1) data availability and
98 quality, 2) incorporation of temporal dynamics, 3) uncertainty estimation, 3) across-
99 catchments transferability, 4) integration of socio-economic factors, and 5)
100 interpretability and transparency, which are briefly addressed as follows:
101 Data availability and quality, particularly historical water quality data and
102 comprehensive catchment characteristics data remain challenges in many regions.
103 Limited data may lead to biased or incomplete models, limiting the accuracy and
104 generality of predictions. Efforts should be made to improve data collection methods,
105 establish standardized data protocols, and improve data sharing among researchers and
106 stakeholders.
107 Current machine learning models often ignore the temporal dimension and assume
108 static relationships between catchment characteristics and water quality parameters [17].
109 While integrating temporal dynamics into machine learning models could increase their
8 Toxics 2023, 11, x FOR PEER REVIEW 4 of 66
9

110 predictive capacity and allow for more accurate short-term and long-term water quality
111 forecasts, various temporal factors, such as seasonal variations, climate change, and
112 short-term events such as precipitation events or pollution incidents, affect river water
113 quality [18].
114 Machine learning models typically provide point predictions, but quantifying and
115 communicating uncertainties associated with these predictions is crucial for decision-
116 making and risk assessment [19]. Developing methods to quantify and propagate
117 uncertainties through the modelling process, considering sources of uncertainty such as
118 data quality, model structure, and parameter estimation, would enhance the reliability
119 and applicability of predictive models.
120 In addition, models developed for a particular catchment cannot be applied directly
121 to another due to variations in catchment characteristics, land use/cover, soil, geology
122 and climatic conditions. The development of transferable models that can account for
123 specific variation in catchment while taking general patterns would be valuable for the
124 management of water resources on a larger scale [20]. On the other hand, capturing
125 characteristics related to human activities, such as agricultural practices, urbanisation
126 and industrial activities, have a significant impact on water quality. Incorporating socio-
127 economic factors into machine learning models can improve their predictive power and
128 enable more comprehensive water quality assessments. However, the integration of
129 socio-economic data and understanding the complex interactions between human
130 activity and water quality present challenges that must be addressed.
131 It should be noted that the need for greater ambiguity and transparency in machine
132 learning models can limit the adoption and acceptance of these models by policymakers
133 and stakeholders [21]. The development of logical machine learning techniques that
134 provide insights into the model decision-making process and highlight the most
135 influential catch characteristics would improve the reliability and usability of predictive
136 models.
137 Addressing these gaps requires interdisciplinary collaborations among
138 hydrologists, ecologists, data scientists, policymakers, and other relevant stakeholders.
139 Furthermore, focussing on data-driven approaches, data-sharing initiatives, and
140 advances in computational methods will be critical to advancing the field and
141 harnessing the full potential of machine learning in predicting river water quality based
142 on catchment characteristics. In the study, we intend to address how we can address
143 spatial variations in the characteristics of the catchment to explain the river water quality
144 using machine learning techniques.
145 This paper explores the relationship between catchment attributes and in-stream
146 water quality parameters using machine learning methods. It evaluates model accuracy
147 with RMSE, MAE, R, and MAPE, identifies optimal models for 11 parameters, and
148 determines significant influencing variables. The study highlights the potential of
10 Toxics 2023, 11, x FOR PEER REVIEW 5 of 66
11

149 artificial intelligence for quick and accurate water quality assessment, tailored to
150 watershed attributes.

151 2. Materials and Methods


152 To establish relationships between the catchment attributes and water quality
153 parameters, machine learning methods were employed. Multiple algorithms, such as
154 regression trees, treebagger, random forests, and Gaussian process regression models,
155 were applied to construct predictive models for each water quality parameter. The
156 models were trained using historical water quality data and catchment attribute
157 information.
158
159 2.1. Regression trees models
160 The fundamental concept behind regression trees is to partition the input space into
161 distinct regions and assign predictive values to these regions. This segmentation enables
162 the model to make predictions based on the most relevant conditions and characteristics
163 of the data.
164 A regression tree (RT) is a simple and comprehensible machine learning model
165 applicable to both regression and classification problems. It follows a tree-like structure
166 composed of nodes and branches. Each node corresponds to a specific condition related
167 to the input data, and this condition is evaluated at each node as the data progresses
168 through the tree.
169 To predict an outcome for a given input, the starting point is the root node of the
170 tree (Figure 1). Here, the initial condition associated with the input feature(s) is
171 considered. Depending on whether this condition is deemed true or false, the branches
172 are followed to reach the next node. This process is repeated recursively until a leaf node
173 is arrived at. At the leaf node, a value is found, which serves as the predicted result for
174 the input instance. For regression tasks, this value is typically a numeric prediction.
175 As the tree is traversed, the input space undergoes changes. Initially, all instances
176 are part of a single set represented by the root node. However, as the algorithm
177 progresses, the input space is gradually divided into smaller subsets. These divisions are
178 based on conditions that help in tailoring predictions to different regions within the
179 input space.
12 Toxics 2023, 11, x FOR PEER REVIEW 6 of 66
13

180
181
182 Figure 1. The partitioning of input space into distinct regions and the representation of a 3D
183 regression surface within a regression tree [22].

184 The process of constructing regression trees involves determining the optimal split
185 variable ("j") and split point ("s") to partition the input space effectively. These variables
186 are chosen by minimizing a specific expression (Equation 1) that considers all input
187 features. The goal is to minimize the sum of squared differences between observed
188 values and predicted values in resulting regions [23 -25].

(1) min min


j,s [ c1

xi ∈R 1( j , s)
( y i−c 1 )2+ min
c2

x i ∈ R2( j ,s )
( y i−c 2)2
]
189
190 Once "j" and "s" are identified, the tree-building process continues by iteratively
191 dividing regions. This process is referred to as a "greedy approach" because it prioritizes
192 local optimality at each step. The binary recursive segmentation approach divides the
193 input space into non-overlapping regions characterized by their mean values.
194 The depth of a regression tree serves as a critical factor in preventing overfitting
195 (too much detail) or underfitting (too simplistic).
196
197 2.2. Ensembles of regression trees: Bagging, Random Forest, and Boosted Trees
198 Bagging is another ensemble method that involves creating multiple subsets of the
199 training dataset through random sampling with replacement (bootstrap samples). Each
200 subset is used to train a separate regression tree model, and their predictions are
201 aggregated to make the final prediction (Figure 2). Bagging helps reduce variance by
14 Toxics 2023, 11, x FOR PEER REVIEW 7 of 66
15

202 averaging predictions from multiple models, making it particularly effective when the
203 base models are unstable or prone to overfitting [23-25].
204 In Bagging, multiple training sets are generated by repeatedly selecting samples
205 from the original dataset, and this process involves sampling with replacement. This
206 technique is utilized to create diverse subsets of data. The primary goal is to reduce the
207 variance in the model's predictions by aggregating the results from these different
208 subsets. Consequently, each subset contributes to the final prediction, and the averaging
209 of multiple models enhances the model's robustness and predictive accuracy.
210

211

212
213 Figure 2. Creating regression tree ensembles using bagging approach [26].

214 Random Forests, a variant of Bagging, stands out by introducing diversity among
215 the constituent models within the ensemble. This diversity is achieved through the
216 creation of multiple regression trees, each trained on a distinct bootstrap sample from
217 the data. Moreover, before making decisions at each split within these trees, only a
218 randomly selected subset of available features is considered. This approach helps in
219 decorrelating the individual trees within the ensemble, thereby further reducing
220 variance. The ensemble's final prediction is generated by aggregating the predictions
221 from these decorrelated trees, resulting in a robust and high-performing model [22–25].
16 Toxics 2023, 11, x FOR PEER REVIEW 8 of 66
17

222 Boosting Trees is a sequential training method where the addition of each new
223 regression tree to the ensemble is aimed at enhancing the overall model's performance
224 (Figure 3). In the context of Gradient Boosting, a widely utilized Boosting technique,
225 submodels are introduced iteratively. These submodels are selected based on their
226 ability to best estimate the residuals or errors of the previous model in the sequence. As
227 this iterative process continues, a definitive ensemble model emerges, composed of
228 multiple submodels that collectively provide highly accurate predictions.
229

230

231
232 Figure 3. The application of gradient boosting within regression tree ensembles [26].

233 The fundamental concept is rooted in gradient-based optimization techniques,


234 which involve refining the current solution to an optimization problem by incorporating
235 a vector that is directly linked to the negative gradient of the function under
236 minimization, as referenced in previous works [27-29]. This approach is logical because a
237 negative gradient signifies the direction in which the function decreases. When it is
238 applied a quadratic error function, each subsequent model aims to correct the
239 discrepancies left by its predecessors, essentially reinforcing and improving the model
240 with a focus on the residual errors from earlier stages.
241 In the context of gradient-boosting trees, the learning rate is a crucial
242 hyperparameter that controls the contribution of each tree in the ensemble to the final
243 prediction. It's often denoted as "lambda" (λ). The learning rate determines how quickly
244 or slowly the model adapts to the errors from the previous trees during the boosting
245 process. A lower learning rate means that the model adjusts more gradually and may
246 require a larger number of trees to achieve the same level of accuracy, while a larger
247 learning rate leads to faster adaptation but may risk overfitting with too few trees.
248
249 2.3. Gaussian process regression (GPR)
250
251 Gaussian processes provide a probabilistic framework for modeling functions,
252 capturing uncertainties, and making predictions in regression tasks. The choice of
18 Toxics 2023, 11, x FOR PEER REVIEW 9 of 66
19

253 covariance functions and hyperparameters allows for flexibility in modeling


254 relationships among variables [30].
255 Gaussian process modeling involves estimating an unknown function f(∙) in
256 nonlinear regression problems. It assumes that this function follows a Gaussian
257 distribution characterized by a mean function μ(∙) and a covariance function k(∙,∙). The
258 covariance matrix K is a fundamental component of GPR and is determined by the
259 kernel function (k).
260 The kernel function (k) plays a pivotal role in capturing the relationships between
261 input data points (x and x'). This function is essential for quantifying the covariance or
262 similarity between random values f(x) and f(x'). One of the most widely used kernels is
263 defined by the following expression:
264

( )
2
−( x−x )
'
k ( x , x )=σ exp
' 2
(2) 2
2l
265
266 In this expression, several elements are critical:
2
267 σ represents the signal variance, a model parameter that quantifies the overall
268 variability or magnitude of the function values.
269 The exponential function "exp" is used to model the similarity between x and x'. It
270 decreases as the difference between x and x' increases, capturing the idea that values
271 close to each other are more strongly correlated.
272 The parameter l, known as the lengthscale, is another model parameter. It controls
273 the smoothness and spatial extent of the correlation. A smaller l results in more rapid
274 changes in the function, while a larger l leads to smoother variations.
275 The observations in a dataset y= { y 1 , . . ., y n } can be viewed as a sample from a
276 multivariate Gaussian distribution.
T
(3) ( y 1 , .. . , y n ) N ( μ , Κ) ,
277
278 Gaussian processes are employed to model the relationship between input variables
2
279 x and target variable y, considering the presence of additive noise ε Ν ( 0 , σ ) .. The
280 goal is to estimate the unknown function f(∙). The observations y are treated as a sample
281 from a multivariate Gaussian distribution with mean vector μ and covariance matrix Κ.
282 This distribution captures the relationships among the data points. The conditional
¿ T
283 distribution of a test point's response value y , given the observed data y=( y 1 , . . ., y n )
¿ ¿2
284 y , σ^ ) with:
, is represented as N ( ^
20 Toxics 2023, 11, x FOR PEER REVIEW 10 of 66
21

(4) ^y ¿ =μ ( x¿ ) + K ¿T K −1 ( y−μ ) ,
2 ¿T −1 ¿
(5) σ^ ¿2=K ¿∗¿+σ −K K K ¿
.
285
2
286 In traditional GPR, a single length-scale parameter (l) and signal variance (σ ) are
287 used for all input dimensions. In contrast, the ARD approach employs a separate length-
288 scale parameter (l i) for each input dimension, where 'i' represents a specific dimension.
289 This means that for a dataset with 'm' input dimensions, you have 'm' individual length-
290 scale parameters [30].
291 The key advantage of ARD is that it automatically determines the relevance of each
292 input dimension in the modeling process. By allowing each dimension to have its own
293 length scale parameter, the model can assign different degrees of importance to each
294 dimension. This means that the model can adapt and focus more on dimensions that are
295 more relevant to the target variable and be less influenced by less relevant dimensions.

296 3. Case study of Caspian Sea catchment area


297 This study took place in the Caspian Sea catchment area (Figure 4) in Northern Iran,
298 covering approximately 618 square kilometers with coordinates ranging from 49°48′to
299 54°41′ longitude and 35°36′ to 37°19′latitude. The majority of this area, approximately
300 65.10%, is forested, while the rest consists of rangelands (24.41%), agricultural land
301 (9.41%), urban areas (0.88%), water bodies (0.0126%), and bare land (0.186%) [31].
302 Initially, 108 water quality monitoring stations scattered across the Caspian Sea
303 catchment were selected for analysis. To define the upstream catchment boundaries,
304 digital elevation models (DEMs) with a resolution of 30 meters by 30 meters from the
305 USGS database were used, with boundary refinement achieved through a user digitizing
306 technique. Macro-sized catchments, those exceeding 1000 square kilometers, totaling 18
307 catchments, were excluded from the modeling process due to their significant impact on
308 hydrological dynamics.
309 Water quality data, including parameters like SAR, Na+, Mg2+, Ca2+, SO42−, Cl−,
310 HCO3−, K+, pH, EC, and TDS, were obtained from the Iran Water Resource Management
311 Company (WRMC) through monthly sampling. Collection adhered to the WRMC
312 Guidelines for Surface Water Quality Monitoring (2009) and EPA-841-B-97-003
313 standards [31]. For statistical analysis, the 5-year means (1998–2002) of water quality
314 data were calculated. After scrutinizing for normality and identifying outliers, 88 final
315 stations were used in the study. The geographic scope of the study area is illustrated in
316 Figure 4.
22 Toxics 2023, 11, x FOR PEER REVIEW 11 of 66
23

317

318 Figure 4. The study region and catchment areas situated within the southern Caspian Sea basin.

319 A land cover dataset was created using a 2002 digital land cover map (Scale
320 1:250,000) from the Forest, Ranges, and Watershed Management Organization of Iran
321 (http://frw.org.ir). The original land cover categories were consolidated into six classes:
322 bare land, water bodies, urban areas, agriculture, rangeland, and forests, following [32]
323 land use and land cover classification systems. Furthermore, digital geological and soil
324 feature maps (1:250,000 scale) were obtained from the Geological Survey of Iran
325 (www.gsi.ir). Detailed information about the characteristics of the catchments and their
326 statistical attributes can be found in Table 1 and Table 2.
327 In this study, hydrologic soil groups and geological permeability classes were
328 developed and applied in conjunction with land cover types within the modeling
329 process. Hydrologic soil groups are influenced by runoff potential and can be used to
330 determine runoff curve numbers. They consist of four classes (A, B, C, and D), with A
24 Toxics 2023, 11, x FOR PEER REVIEW 12 of 66
25

331 having the highest runoff potential and D the lowest. Notably, soil profiles can undergo
332 significant alterations due to changes in land cover. In such cases, the soil textures of the
333 new surface soil can be employed to determine the hydrologic soil groups as described
334 in Table 1 [34]. Furthermore, the study incorporates the application of geological
335 permeability attributes related to catchments, with the development of three geological
336 permeability classes: Low, Medium, and High. These classes are associated with various
337 characteristics of geological formations, such as effective porosity, cavity type and size,
338 their connectivity, rock density, pressure gradient, and fluid properties like viscosity.
339
340

341 Table 1. Statistical properties of input variables used for modelling.

Input parameter (acronym) min max average std


Hydrometric Station Elevation (HSE) -23.0000 2360.0000 163.3333 363.3326
Catchment Area (CA) 22.0300 6328.2800 422.7320 1085.4846
Stream Order (SO) 1.0000 4.0000 2.6275 1.3261
Percentage of Land Use or Land Cover Types:
Barren Land (BL) 0.0000 3.1825 0.1246 0.6083
Forest (F) 1.1805 100.0000 70.0401 29.5955
Rangeland (RL) 0.0000 90.3170 17.0144 23.9666
Urban Area (UA) 0.0000 20.2095 1.1367 3.5475
Water Body (WB) 0.0000 0.3567 0.0074 0.0499
Agricultural Area (AA) 0.0000 84.3857 11.6768 20.3896
Hydrological Soil Group:
A - Sand, loamy sand, or sandy loam (HSGA) 0.0000 79.3654 8.0039 16.2217
B - Silt loam or loam (HSGB) 0.0000 48.4653 3.0354 8.4226
C - Sandy clay loam (HSGC) 12.9196 100.0000 80.4068 27.0174
D - Clay loam, silty clay loam, sandy clay, silty clay,
or clay (HSGD) 0.0000 56.4129 8.5539 15.9743
Geological Permeability:
Low (Geological hydrological group M – GHGM) 0.0143 100.0000 69.2656 28.0000
Average (Geological hydrological group N –
GHGN) 0.0000 96.9436 23.4102 24.3102
High (Geological hydrological group T – GHGT) 0.0000 90.9015 7.3243 15.3979
342

343 Table 2. Statistical properties of output variables used for modelling.


26 Toxics 2023, 11, x FOR PEER REVIEW 13 of 66
27

Parameter min max average std


SAR 7.1500 9.0900 7.5318 0.3976
Na+ 0.1200 15.8400 0.9978 2.3957
Mg2+ 0.4100 4.4100 1.0331 0.6790
Ca2+ 1.0600 5.8800 2.3584 0.9521
SO42− 0.2100 4.4500 0.6643 0.8449
Cl− 0.1900 18.2000 1.1861 2.7131
HCO3− 1.3500 4.0900 2.5978 0.7729
pH 172.0500 2879.9700 453.7716 428.3365
EC 108.8900 3892.8200 375.2543 579.1442
TDS 0.1000 3.4400 0.4750 0.5971
K+ 0.0200 0.1400 0.0447 0.0316
344
345 The range and statistical properties of training and test data play a fundamental
346 role in the development and evaluation of machine learning models. They impact the
347 model's generalization, robustness, fairness, and ability to perform effectively in diverse
348 real-world scenarios. Statistical properties of input and output data are given in Table 1
349 and Table 2.
350 The machine learning methods used in this paper were assessed using five-fold
351 cross-validation. In this approach, the dataset was randomly divided into five subsets,
352 with four of them dedicated to training the model and the remaining subset utilized for
353 model validation (testing). This five-fold cross-validation process was repeated five
354 times, ensuring that each subset was used exactly once for validation. Subsequently, the
355 results from these five repetitions were averaged to produce a single estimation.
356 All models were trained and tested under identical conditions, ensuring a fair and
357 consistent evaluation of their performance. This practice is essential in machine learning
358 to provide a level playing field for comparing different algorithms and models.
359 When machine learning models are trained and tested under equal conditions, it
360 means that they are exposed to the same datasets, preprocessing steps, and evaluation
361 metrics.
362 The quality of the model was assessed using several evaluation and performance
363 measures, which include the root mean square error (RMSE), mean absolute error
364 (MAE), Pearson's Linear Correlation Coefficient (R), and mean absolute percentage error
365 (MAPE).
366 The RMSE criterion, expressed in the same units as the target values, serves as a
367 measure of the model's general accuracy. It is calculated as the square root of the
368 average squared differences between the actual values (d k ) and the model's predictions (
369 o k ) across the training samples (N).
28 Toxics 2023, 11, x FOR PEER REVIEW 14 of 66
29


N
1
(6) RMSE= ∑
N k=1
(
2
d k −ok ) ,

370
371 The MAE criterion represents the mean absolute error of the model, emphasizing
372 the absolute accuracy. It calculates the average absolute differences between the actual
373 values and the model's predictions.
N
1
(7) MAE= ∑ |d k −o k|.
N k=1
374
375 Pearson's linear correlation coefficient R provides a relative measure of accuracy
376 assessment. It considers the correlation between the actual values ( d k ) and the model's
377 predictions (o k ) relative to their respective means (d and o ). Values of R greater than
378 0.75 indicate a strong correlation between the variables.

(2) R=√ ¿ ¿ ¿
379
380 The mean absolute percentage error (MAPE) is a relative criterion that evaluates
381 accuracy by calculating the average percentage differences between the actual values
382 and the model's predictions.

| |
d k −o k
N
100
(2) MAPE= ∑
N k=1 d k
.

383 4. Results
384 The paper analyzes the application of regression trees, bagging, gradient boosting
385 and Gaussian random process models using a systemic approach (Figure 5). For each of
386 the models, the hyperparameters of the model were varied in the appropriate range. The
387 following values were analyzed:
388
389  Regression trees model
390
391 The depth of a regression tree is a crucial factor in preventing overfitting, which
392 occurs when the tree becomes too detailed and fits the training data too closely, as well
393 as underfitting, which happens when the tree is too simplistic and fails to capture the
394 underlying patterns. Table 3 illustrates the impact of the minimum leaf size on the
395 accuracy of the regression tree (RT) model. It shows how changing the minimum leaf
30 Toxics 2023, 11, x FOR PEER REVIEW 15 of 66
31

396 size affects key performance metrics such as RMSE (Root Mean Square Error), MAE
397 (Mean Absolute Error), MAPE (Mean Absolute Percentage Error), and R (Coefficient of
398 Determination).Accordingly, the leaf size is almost positively associated with the RMSE
399 but inversely correlated with R values.
400
401  TreeBagger (TB) model
402
403 1. Number of generated trees (B): Investigated values up to a maximum of 500
404 trees, with 100 as the standard setting in Matlab. The bootstrap aggregation method was
405 employed, generating a specific number of samples in each iteration. The study
406 considered an upper limit of 500 trees.
407 2. Minimum number of data/samples per tree leaf: Analyzed values ranged
408 from 2 to 15 samples per leaf, with a step size of 1 sample. The standard setting in
409 Matlab is 5 samples per leaf for regression, but here, a broader range was examined to
410 assess its impact on model generalization.
411
412  Random Forest (RF) model:
413
414 1. Number of generated trees (B): Analyzed within a range of 100 to 500 trees,
415 with 100 as the standard setting in Matlab. Cumulative MSE values for all base models
416 in the ensemble were presented. Bootstrap aggregation was used to create trees,
417 generating 181 samples per iteration. The study explored an extended ensemble of up to
418 500 regression trees, aligning with recommended Random Forest practices.
419 2. Number of variables used for tree splitting: Based on guidance, the study
420 selected a subset of approximately √ p predictors for branching, where p is the number
421 of input variables. With 16 predictors, this translated to a subset of 4 variables, but in
422 those research, a wider number of variables, ranging from 1 to 16, is investigated.
423 3. Minimum number of samples per leaf: The study considered values from 2 to
424 10 samples per tree leaf, with a 1-sample increment.
425  Boosted trees model:
426 1. Number of generated trees (B): Analyzed within a range of 1 to 100 trees.
427 2. Learning rate (λ): Explored a range, including 0.001, 0.01, 0.1, 0.5, 0.75, and 1.0.
428 3. Tree splitting levels (d): Analyzed from 1 (a decision stump) to 2^7=128 in an
429 exponential manner.
430
431  Gaussian process regression (GPR) model
432
433 With the GPR method, the application of different kernel functions:
434 1. Exponential, quadratic exponential, Mattern 3/2, Mattern 5/2, rational quadratic
32 Toxics 2023, 11, x FOR PEER REVIEW 16 of 66
33

435 2. ARD Exponential, ARD quadratic exponential, ARD Mattern 3/2, ARD Mattern
436 5/2, ARD rational quadratic
437 All models were evaluated in terms of optimality in terms of mean square error,
438 and then the optimal model obtained from all the analysed ones was evaluated on the
439 test data using the RMSE, MAE, MAPE and R criteria.
440 In the paper, a detailed procedure is illustrated for determining the optimal model
441 for the prediction of the SAR parameter. In contrast, for all other models for the
442 prediction, it is given in a more concise form. Accompanying results for other models,
443 except the SAR parameter, can be found in the appendix of the paper.
34 Toxics 2023, 11, x FOR PEER REVIEW 17 of 66
35

444
36 Toxics 2023, 11, x FOR PEER REVIEW 18 of 66
37

445 Figure 5. Applied methodology for creating prediction models.

446

447 4.1 Prediction of SAR parameter values:


448 • Regresion trees models
449
450 Table 3 illustrates the impact of the minimum leaf size on the accuracy of the
451 regres-sion tree (RT) model. It shows how changing the minimum leaf size affects key
452 perfor-mance metrics such as RMSE (Root Mean Square Error), MAE (Mean Absolute
453 Error), MAPE (Mean Absolute Percentage Error), and R (Coefficient of Determination).
454 In this particular case, it was found that models with less complexity, i.e., the
455 number of data per terminal sheet is 10, have higher accuracy (Figure 6).
456

Table 3. Influence of minimum leaf size on regression tree (RT) model accuracy.

Min leaf size RMSE MAE MAPE R


1 0.5646 0.2753 0.5948 0.4363
2 0.6077 0.2923 0.6322 0.3226
3 0.6096 0.2875 0.6311 0.2909
4 0.5984 0.2914 0.6188 0.3186
5 0.5894 0.2931 0.6728 0.3621
6 0.5833 0.2925 0.6616 0.3593
7 0.5813 0.2949 0.6968 0.3542
8 0.5841 0.3095 0.7603 0.3226
9 0.5821 0.2990 0.7078 0.3432
10 0.5969 0.3083 0.7369 0.2906
457
458  TreeBagger models and random forest models
459
460 Application of TB and RF models was analyzed simultaneously (Figure 7). The
461 figure shows the dependence of the achieved accuracy of the model on the
462 hyperparameter value. The TB model represents the borderline case of the RF model
463 when all variables are taken into account for potential calculations.
464
38 Toxics 2023, 11, x FOR PEER REVIEW 19 of 66
39

465 Figure 6. Optimal individual model for SAR parameter prediction based on regression tree.

466 Among the optimal models in this group, the random forest models with 500
467 generated trees proved to be the best. In contrast, the model that uses a subset of 8
468 variables and has a minimum number of data per terminal leaf equal to 1 has a higher
469 accuracy according to the RMSE and R criteria, while the model that uses a subset with
470 six variables and has a minimum number of data per terminal sheet equal to 1 and has a
471 higher accuracy according to MAE and MAPE criteria (Table 4).
472

473 Table 4. Accuracy of obtained models for SAR parameter prediction according to defined criteria.

Criteria RMSE MAE MAPE R


RF 1 (var 8, leaf 1) 0.3668 0.2328 0.5679 0.8236
RF 2 (var 6, leaf 1) 0.3696 0.2244 0.5379 0.8012
40 Toxics 2023, 11, x FOR PEER REVIEW 20 of 66
41

a) b)

c) d)

474 Figure 7. Comparison of different accuracy criteria for RF model for SAR parameter as a function of the number of randomly
475 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

476
477 With the BT model (Figure 8), it was shown that the highest accuracy is obtained by
478 applying complex models with a large number of branches. The optimal obtained model
479 had a structure of 32 branches and a ya Learning Rate value equal to 0.01.
480
42 Toxics 2023, 11, x FOR PEER REVIEW 21 of 66
43

481 Figure 8. Dependence of the MSE value on the reduction parameter λ and the number of trees (base models) in the Boosted Trees
482 model for SAR parameter:

483 • GPR models models


484 The optimal values for the parameters of the applied models with different kernel
485 functions were obtained using marine probability (Table 5 and Table 6). The marginal
486 likelihood is a function influenced by the observed data (y(X)) and model parameters {l,
2 2
487 σ , η }. The determination of model parameters is achieved through the maximization of
488 this function. Importantly, when the marginal likelihood is transformed by taking the
489 logarithm, identical results are achieved as when optimizing the original likelihood.
490 Therefore, model parameter optimization is typically carried out by employing gradient-
491 based procedures on the log marginal probability expression, simplifying the
492 optimization process without altering the final outcomes.

Table 5. Values of optimal parameters in GPR models with different covariance functions.

GP Model Covariance Function Covariance Function Parameters

Exponential
k ( ( xi , x j|Θ ) ) =σ f exp
2
[ −1 r
2 σ l2 ]
σ l=¿111.9371 σ f =¿1.8001

[ ]
Squared Exponential T
−1 ( x i−x j ) ( x i−x j )
k ( ( xi , x j|Θ ) ) =σ exp
2
f 2
2 σl
44 Toxics 2023, 11, x FOR PEER REVIEW 22 of 66
45

σ l=¿8.3178 σ f =¿1.2040

Matern 3/2
k ( ( xi , x j|Θ ) ) =σ 2f 1+ √
3r
σl
exp √
(
− 3r
σl ) [ ]
σ l=¿14.7080 σ f =¿1.4023

Matern 5/2
k ( ( xi , x j|Θ ) ) =σ f 1+
2
( √ 5 r + 5 r 2 exp −√ 5 r
σl 2
3 σl ) [ σl ]
σ l=¿9.9890 σ f =¿1.1947

( )
2 −α
r
k ( ( xi , x j|Θ ) ) =σ f 1+
2
2
; r =0
Rational Quadratic 2 a σl
σ l=¿8.3178 a=¿3156603.8854 σ f =¿1.2040
493 where r = √( x − x ) ( x −x ).
i j
T
i j

Table 6. Values of optimal parameters in GPR ARD models with different covariance functions.

Covariance Function Parameters


σ1 σ2 σ3 σ4 σ5 σ6 σ7 σ8 σ9 σ 10 σ 11 σ 12 σ 13 σ 14 σ 15 σ 16
ARD Exponential:


2
k ( ( xi , x j|Θ ) ) =σ exp (−r ) ;σ F =1.; r =
d
( x ℑ−x jm)

2
f
2
m =1 σm
119.8634

300.3391

158.5921

109.1181

121.9050

114.0696

245.9937
29.7051

15.3385

89.5438
6.9058

ARD Squared exponential:

[ ]
2
−1
d
( x ℑ−x jm )
k ( ( xi , x j|Θ ) ) =σ exp
2
f ∑
2 m=1 σm
2
; σ f =1.0577
14.6207

15.1518

32.6135
4.9648

2.3307

1.1875

3.4365

Table 6 (continued). Values of optimal parameters in GPR ARD models with different covariance functions.
46 Toxics 2023, 11, x FOR PEER REVIEW 23 of 66
47

ARD Matern 3/2:


k ( ( xi , x j|Θ ) ) =σ ( 1+ √ 3 r ) exp [ −√ 3 r ] ; σ f =1.3295
2
f

131.4068
32.1227

28.2837

62.5967
9.2294

9.6482

4.7958

4.3381
ARD Matern 5/2:

( )
2
5r
k ( ( xi , x j|Θ ) ) =σ f 1+ √ 5 r + exp [ −√ 5 r ] ; σ f =1.0452
2
3
11.2644

30.9904
3.3999

3.4057

7.7608
ARD Rational quadratic:

( )
2 −α
1
d
( x ℑ−x jm )
k ( ( xi , x j|Θ ) ) =σ 1+
2
f ∑
2 α m=1 σm
2
; α =0.7452; σ f =1.07
18.4361

30.2548
2.2446

0.7733


2
d
( x ℑ−x jm)
494 where
r= ∑ σm
2 .
m =1
495 • Comparative analysis of results

496 Table 7. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 0.5646 0.2753 0.5948 0.4363
TreeBagger 0.4021 0.2513 0.6413 0.7652
RF 1 (var 8, leaf 1) 0.3668 0.2328 0.5679 0.8236
48 Toxics 2023, 11, x FOR PEER REVIEW 24 of 66
49

RF 2 (var 6, leaf 1) 0.3696 0.2244 0.5379 0.8012


Boosted Trees 0.5592 0.3348 0.6047 0.5867
GP exponential 0.4625 0.2104 0.4998 0.6317
GP Sq.exponential 0.4868 0.2393 0.5810 0.5733
GP matern 3/2 0.4757 0.2293 0.5406 0.5992
GP matern 5/2 0.4779 0.2307 0.5520 0.5941
GP Rat. quadratic 0.4868 0.2393 0.5810 0.5733
GP ARD exponential 0.5917 0.2873 0.6991 0.3302
GP ARD Sq. exponential 0.5669 0.2788 0.7736 0.3568
GP ARD matern 3/2 0.5276 0.2707 0.7206 0.4702
GP ARD matern 5/2 0.5464 0.2875 0.8794 0.4223
GP ARD Rat. quadratic 0.6573 0.3285 0.9059 0.2349
497
498 The values of all accuracy criteria according to the adopted accuracy criteria on the
499 test data set are shown in Table 7. According to the RMSE and R criteria, the RF model
500 had the highest accuracy (it uses a subset of 6 variables for calculation, and the number
501 of data per terminal sheet is equal to 1, while according to the MAE and MAPE criteria,
502 the GP model with an exponential kernel function stood out as the most accurate. On the
503 optimal random forest model the significance of each of the input variables was
504 determined such that the values of the considered variable are permuted within the
505 training data, and the out-of-bag error for such permuted data is recalculated. The
506 significance of the variable (Figure 9) is then determined by calculating the mean value
507 of the difference before and after permutation. This value is then divided by the
508 standard deviation of these differences. The variable for which a higher value was
509 obtained in relation to the others is ranked as more significant in relation to the variables
510 for which smaller values were obtained.
50 Toxics 2023, 11, x FOR PEER REVIEW 25 of 66
51

511

512 Figure 9. Significance of individual variables for SAR parameter prediction in optimal RF model.

513 The parameter of Hydrometric Station Elevation (HSE) is influential in SAR predic-
514 tion with an importance value of 0.188610351. Higher elevations often correspond to
515 different geological and hydrological conditions. For example, at higher elevations, there
516 might be different soil types, vegetation, and land use patterns. These variations can sig-
517 nificantly impact SAR values. Typically, elevated hydrometric stations are associated
518 with more complex terrain and geological features. The positive importance value im-
519 plies that water sources at higher elevations may have distinct characteristics or geologi-
520 cal properties that result in higher SAR values. These may include factors such as min-
521 eral content, weathering processes, or the influence of precipitation at higher altitudes.
522 The Catchment Area (CA) is a critical factor in SAR prediction, as indicated by its
523 substantial importance value of 0.335748728. A larger catchment area implies a more ex-
524 tensive land area contributing to the water source. This expansive catchment area results
525 in more significant dilution effects on water chemistry. The greater the catchment area,
526 the more effectively it reduces the concentration of ions in the water. This dilution effect
527 leads to more stable and predictable SAR values. It means that using catchment area pa-
528 rameter catchment areas with a larger geographical footprint has the potential to miti-
529 gate variations in SAR, making this parameter vital for accurate predictions.
530 Forested areas, denoted by the parameter Percentage of Forest (F), have a notable
531 positive influence on SAR predictions (importance: 0.260431787). Forests often contain
532 well-developed organic matter in the soil, which acts as a buffer to ions. This organic
533 matter helps to regulate the concentration of ions, potentially resulting in more stable
534 and predictable SAR values. Additionally, forested areas are typically associated with
535 lower human activity and reduced introduction of additional ions into the water. The
52 Toxics 2023, 11, x FOR PEER REVIEW 26 of 66
53

536 high importance value highlights the significance of forest cover in enhancing the preci-
537 sion of SAR predictions.
538 The presence of Rangeland (RL) plays a crucial role in SAR predictions, with an im-
539 portance value of 0.261470407. Rangeland areas often feature various types of vegetation
540 that can influence the ion content in water. This influence helps to stabilize SAR values.
541 Furthermore, the low-intensity land use associated with rangeland areas results in fewer
542 ion inputs compared to more heavily developed areas. As a result, rangeland cover is a
543 positive factor in SAR predictions, contributing to the accuracy and reliability of the
544 models.
545 Urban areas, represented by the Percentage of Urban Area (UA), have a significant
546 impact on SAR predictions, as indicated by the high importance value of 0.292730718.
547 Urbanization introduces various pollutants, including salts, into the water. These pollu-
548 tants can significantly influence SAR, potentially resulting in higher SAR values in ur-
549 banized areas. The high importance of this parameter emphasizes the need to consider
550 urban land use when predicting SAR, as it can lead to substantial variations in water
551 chemistry.
552 The Percentage of Agricultural Area (AA) has a substantial positive impact on SAR
553 predictions, with an importance value of 0.342586975. Agricultural areas often involve
554 various practices such as irrigation, fertilization, and pesticide application, which con-
555 tribute to the ion and salt content in the water. These agricultural activities significantly
556 influence SAR, potentially leading to higher SAR values. The high importance of this pa-
557 rameter underscores the need to consider agricultural land use when predicting SAR, as
558 it can have a profound effect on water chemistry.
559 Soil with characteristics represented by Hydrological Soil Group B (SHGB) has a
560 positive impact on SAR predictions, as reflected in its importance value of 0.235552135.
561 The soil in this group may include silt loam or loam, which has properties that facilitate
562 the retention of ions and salts. As a result, this soil group positively affects SAR values,
563 contributing to more reliable predictions.
564 Geological Permeability High (GHGN) is a critical factor in SAR predictions, with
565 an importance value of 0.280770699. High geological permeability affects how water in-
566 teracts with subsurface minerals, potentially leading to ion leaching or retention. This
567 parameter plays a crucial role in predicting SAR due to its substantial importance, em-
568 phasizing the significance of geological properties in understanding and predicting SAR
569 values.
570
571 4.2 Prediction of Na+ parameter values:
572 RF models proved to be the optimal models for predicting Sodium ion (Na+)
573 concentrations, while the analysis of all models in terms of accuracy is given in the
574 appendix. The dependence of the adopted accuracy criteria on the model parameters is
54 Toxics 2023, 11, x FOR PEER REVIEW 27 of 66
55

575 shown in Figure 10. Based on the defined accuracy criteria, four models with the
576 following criteria values were selected (Table 8).

577 Table 8. Accuracy of obtained models for Na+ parameter prediction according to defined criteria.

Criteria RMSE MAE MAPE R


RF 1 (var13, leaf 10) 1.6073 0.8086 1.1651 0.5817
RF 2 (var 11, leaf 5) 1.6755 0.7481 0.9734 0.5919
RF 3 (var 11, leaf 6) 1.6595 0.7516 0.9714 0.5923
RF 4 (var 6, leaf 2) 1.6385 0.8772 1.3929 0.6780

578

579

580 The "Weighted Sum Model" or "Simple Multi-Criteria Ranking" method was used
581 to select the optimal model. For the minimization objectives (RMSE, MAE, MAPE), Min-
582 Max normalization is applied, and for the maximization objective (R), Max-Min
583 normalization is applied to ensure that all metrics are on the same scale. Equal weights
584 are assigned to the normalized evaluation metrics to indicate their relative importance in
585 your decision-making process. The weighted sum method calculated an aggregated
586 value for each model, which considers all four normalized metrics. All models are
587 ranked based on their aggregated values, with the lower aggregated value indicating
588 better overall performance (Table 9).

589 Table 9. Determining the optimal prediction model for Na+ parameter using Simple Multi-Criteria Ranking

w.RMS Weighted criteria


Agg. Value w.R w.MAPE w.MAE E
0.5180 0.0000 0.1351 0.1328 0.2500 RF 1 (var13, leaf 10)
0.5253 0.0265 0.2488 0.2500 0.0000 RF 2 (var 11, leaf 5)
0.5794 0.0275 0.2500 0.2432 0.0587 RF 3 (var 11, leaf 6)
0.3856 0.2500 0.0000 0.0000 0.1356 RF 4 (var 6, leaf 2)
590
a) b)
56 Toxics 2023, 11, x FOR PEER REVIEW 28 of 66
57

c) d)

591 Figure 10. Comparison of different accuracy criteria for RF model for Na+ parameter as a function of the number of randomly
592 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

593 As the optimal model, the RF model with 500 trees was obtained, which uses a
594 subset of 6 variables, where the minimum number of data per sheet is 6. The assessment
595 of the significance of individual input variables on the accuracy of the prediction was
596 performed precisely on the obtained model with the highest accuracy (Figure 11).
597
58 Toxics 2023, 11, x FOR PEER REVIEW 29 of 66
59

598

599 Figure 11. Significance of individual variables for Na+ parameter prediction in optimal RF model

600 The significant importance value of Catchment Area (CA) (0.204129833) indicates
601 that the size of the catchment area has a strong influence on Na+ concentrations. A larger
602 catchment area can encompass a wider range of geological features, influencing mineral
603 content in water and, subsequently, Na+ levels. The highly positive importance of CA
604 underscores its crucial role in shaping Na+ predictions.
605 The highly positive importance value for Forest (F) (0.698694744) highlights the
606 substantial impact of forested areas on Na+ concentrations. Forests can contribute to
607 lower Na+ levels due to the presence of organic matter that can buffer ions and minerals
608 in the water. This buffering effect results in stable and predictable Na+ concentrations in
609 forested regions.
610 Rangeland (RL) positively influences Na+ predictions, as indicated by its impor-
611 tance value (0.201545333). Rangeland areas may contain various types of vegetation that
612 stabilize Na+ levels by influencing ion content in the water. The lower-intensity land use
613 in rangeland areas leads to fewer ion inputs, enhancing the precision of Na+ predictions.
614 The positive importance value of the percentage of Agricultural Area (AA)
615 (0.262775223) highlights the significant influence of agricultural areas on Na+ predic-
616 tions. Agriculture involves practices such as irrigation, fertilization, and pesticide appli-
617 cation, contributing to ion and mineral content in the water. The high importance of AA
618 underscores the need to consider agricultural land use when predicting Na+ levels.
619 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.168774875),
620 including silt loam or loam, have a moderate positive impact on Na+ predictions. Soils in
621 this group may retain sodium ions and salts, contributing to the overall accuracy of Na+
622 predictions.
60 Toxics 2023, 11, x FOR PEER REVIEW 30 of 66
61

623 Hydrological Soil Group C (SHGC) positively influences Na+ predictions, with an
624 importance value of 0.109796372. This group includes soils like sandy clay loam that can
625 influence Na+ concentrations by affecting ion leaching, making it a valuable parameter in
626 understanding and predicting Na+ levels.
627
628 4.3 Prediction of Magnesium (Mg2+) parameter values:
629 RF models proved to be the optimal models for predicting Sodium ion (Mg2+)
630 concentrations. An analysis of all models in terms of accuracy is given in the appendix.
631 The dependence of the adopted accuracy criteria on the model parameters is shown in
632 Figure 12. Based on the defined accuracy criteria, 3 models with the following values
633 were selected (Table 10).
634
a) b)

c) d)
62 Toxics 2023, 11, x FOR PEER REVIEW 31 of 66
63

635 Figure 12. Comparison of different accuracy criteria for RF model for Mg2+ parameter as a function of the number of randomly
636 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

637 Table 10. Accuracy of obtained models for Mg2+ prediction according to defined criteria.

Criteria RMSE MAE MAPE R


RF 1 (var13, leaf 1) 0.3988 0.2640 0.2662 0.7377
RF 2 (var 12, leaf 1) 0.4014 0.2608 0.2706 0.7516
RF 3 (var 10, leaf 1) 0.4020 0.2631 0.2717 0.7567
638
639 Simple Multi-Criteria Ranking" was applied again when extracting the optimal model (Table 11).

640 Table11. Determining the optimal prediction model for Mg2+ parameter using Simple Multi-Criteria Ranking.

Agg. Value w.R w.MAPE w.MAE w.RMSE Weighted criteria


RF 1 (var13, leaf 1)
0.5000 0.0000 0.2500 0.0000 0.2500
RF 2 (var 12, leaf 1)
0.5298 0.1829 0.0500 0.2500 0.0469
RF 3 (var 10, leaf 1)
0.3203 0.2500 0.0000 0.0703 0.0000
641 As the optimal model, the RF model with 500 trees was obtained, which uses a sub-
642 set of 12 variables, where the minimum number of data per sheet is 1. The assessment of
643 the importance of individual input variables on the accuracy of the prediction was per-
644 formed precisely on the obtained model with the highest accuracy (Figure 13).
64 Toxics 2023, 11, x FOR PEER REVIEW 32 of 66
65

645
646 Figure 13. Significance of individual variables for Mg2+ parameter prediction in optimal RF model

647 The positive importance value of Catchment Area (CA) (0.163764386) suggests that
648 the size of the catchment area has a moderate influence on Mg2+ concentrations. A larger
649 catchment area can encompass a wider range of geological features, influencing mineral
650 content in water and, subsequently, Mg2+ levels. The positive importance of CA indicates
651 its role in shaping Mg2+ predictions.
652 The highly positive importance value for Forest (F) (0.590270347) highlights the
653 substantial impact of forested areas on Mg2+ concentrations. Forests can contribute to
654 higher Mg2+ levels due to the decomposition of organic matter and the release of magne-
655 sium ions. This effect results in stable and predictable Mg2+ concentrations in forested re-
656 gions.
657 Rangeland (RL) positively influences Mg2+ predictions, as indicated by its impor-
658 tance value (0.302241325). Rangeland areas may contain various types of vegetation that
659 stabilize Mg2+ levels by influencing ion content in the water. The lower-intensity land use
660 in rangeland areas leads to fewer ion inputs, enhancing the precision of Mg2+ predic-
661 tions.
662 Urban areas, represented by Percentage of Urban Area (UA), significantly affect
663 Mg2+ concentrations, as suggested by the importance value (0.29234405). Urbanization
664 introduces various pollutants and ions into the water, potentially leading to higher Mg2+
665 concentrations.
666 The positive importance value of the percentage of Agricultural Area (AA)
667 (0.200998202) highlights the significant influence of agricultural areas on Mg2+ predic-
66 Toxics 2023, 11, x FOR PEER REVIEW 33 of 66
67

668 tions. Agriculture involves practices such as irrigation, fertilization, and pesticide appli-
669 cation, contributing to ion and mineral content in the water. The high importance of AA
670 underscores the need to consider agricultural land use when predicting Mg2+ levels.
671 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.180483735),
672 including silt loam or loam, have a moderate positive impact on Mg2+ predictions. Soils
673 in this group may retain magnesium ions and salts, contributing to the overall accuracy
674 of Mg2+ predictions.
675 Geological Permeability represented by Geological hydrological group N (GHGN)
676 (0.2735) with their importance suggests that areas or regions with low permeability geo-
677 logical features have a significant positive impact on Mg2+ levels in water.
678 4.4 Prediction of Ca2+ parameter values:
679 RF models proved to be the optimal models for Ca2+ (calcium ion concentration). An
680 analysis of all models in terms of accuracy is given in the appendix. The dependence of
681 the adopted accuracy criteria on the model parameters is shown in the Figure 14.
682 According to all the defined accuracy criteria, only 1 model stood out with values for
683 RMSE, MAE, MAPE and R of 0.5847, 0.4500, 0.2007, 0.7496, respectively.
684
a) b)

c) d)
68 Toxics 2023, 11, x FOR PEER REVIEW 34 of 66
69

R=0.7496

685 Figure 14. Comparison of different accuracy criteria for RF model for Ca2+ parameter as a function of the number of randomly
686 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

687 The assessment of the significance of individual input variables on the accuracy of
688 the prediction was performed precisely on the obtained model with the highest accuracy
689 (Figure 15).
690 The positive importance value of HSE (0.197664268) suggests that higher elevations
691 have a significant impact on predicting Ca2+ concentrations. Elevation can influence Ca2+
692 levels due to several factors. For example, in mountainous regions, the geological com-
693 position can change with elevation, affecting the calcium content in the surrounding soil
694 and water. Additionally, higher elevations may experience different weather patterns
695 and precipitation, which can leach calcium from rocks and soil. Therefore, the positive
696 importance of HSE highlights the need to consider the elevation of monitoring stations
697 when predicting Ca2+ concentrations.
70 Toxics 2023, 11, x FOR PEER REVIEW 35 of 66
71

698
699 Figure 15. Significance of individual variables for Ca2+ parameter prediction in optimal RF model.

700 The importance value of Catchment Area (0.317181573) indicates that the size of the
701 catchment area significantly influences Ca2+ concentrations. A larger catchment area of-
702 ten implies a greater geographical footprint contributing to water sources. This expan-
703 sive catchment area can lead to higher dilution effects on water chemistry, potentially re-
704 ducing Ca2+ concentrations. The positive importance of CA underscores the need to con-
705 sider catchment size when predicting Ca2+, as it plays a pivotal role in shaping the ion
706 levels in water.
707 The substantial positive importance value for Forest (F) (0.550159839) highlights the
708 significant impact of forested areas on Ca2+ concentrations. Forests often contain organic
709 matter that can buffer ions, including calcium. This buffering effect helps maintain stable
710 and predictable Ca2+ levels. The presence of forests can also reduce human activity and
711 the introduction of additional ions into water, contributing to the high importance of this
712 parameter in Ca2+ predictions.
713 Rangeland (RL) positively influences Ca2+ predictions, as indicated by its impor-
714 tance value (0.300511477). Rangeland areas may contain various types of vegetation that
715 stabilize Ca2+ levels by influencing ion content in the water. Additionally, the lower-in-
716 tensity land use in rangeland areas results in fewer ion inputs, enhancing the precision
717 of Ca2+ predictions.
718 Urban areas, represented by the percentage of Urban Area (UA), significantly affect
719 Ca2+ concentrations, as suggested by the substantial importance value (0.310613586). Ur-
72 Toxics 2023, 11, x FOR PEER REVIEW 36 of 66
73

720 banization introduces various pollutants into the water, potentially leading to altered
721 Ca2+ levels. The high importance of UA underscores the need to consider urban land use
722 when predicting Ca2+, as it can result in substantial variations in ion levels.
723 The substantial positive importance value of Percentage of Agricultural Area (AA)
724 (0.319123329) highlights the significant influence of agricultural areas on Ca2+ predic-
725 tions. Agriculture involves practices such as irrigation, fertilization, and pesticide appli-
726 cation, contributing to ion and calcium content in the water. The high importance of AA
727 underscores the need to consider agricultural land use when predicting Ca2+, as it has a
728 profound effect on water chemistry.
729 Hydrological Soil Group C (SHGC) (0.174279524) positively influences Ca2+ predic-
730 tions. This group includes soils like sandy clay loam that can influence ion leaching. The
731 importance of SHGC suggests that the specific characteristics of soil in this group have a
732 notable impact on Ca2+ predictions.
733 Geological Permeability High (GHGN) (0.517513106) is a crucial factor in Ca2+ pre-
734 dictions, with a high importance value. High geological permeability affects water's in-
735 teraction with subsurface minerals, potentially leading to ion leaching or retention. This
736 parameter plays a pivotal role in shaping Ca2+ levels in water due to its substantial im-
737 portance.
738 Geological Permeability Medium (GHGM) (0.176935951) positively contributes to
739 Ca2+ predictions, albeit to a lesser extent. Medium geological permeability influences ion
740 interactions with subsurface minerals, making it a valuable parameter in understanding
741 and predicting Ca2+ levels.
742 Low geological permeability (GHGN) (0.145602833) has a positive impact on Ca2+
743 predictions, as indicated by its importance value. Lower permeability can reduce ion
744 leaching or retention, impacting Ca2+ levels and contributing to the overall accuracy of
745 predictions.
746
747 4.5 Prediction of SO42− parameter values:
748 The RF models proved to be the optimal models for predicting SO42− levels. An
749 analysis of all models in terms of accuracy is given in the appendix A. The dependence
750 of the adopted accuracy criteria on the model parameters is shown in the Figure 16.
751 According to all defined accuracy criteria, only 1 model was singled out with values for
752 RMSE, MAE, MAPE and R of 0.5526, 0.3122, 0.5050 and 0.7381, respectively.
a) b)
74 Toxics 2023, 11, x FOR PEER REVIEW 37 of 66
75

c) d)

753 Figure 16. Comparison of different accuracy criteria for RF model for SO42− parameter as a function of the number of randomly
754 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

755 The assessment of the significance of individual input variables on the accuracy of
756 the prediction was performed directly on the obtained model with the highest accuracy
76 Toxics 2023, 11, x FOR PEER REVIEW 38 of 66
77

757 (Figure 17).

758
759 Figure 17. Significance of individual variables for SO42− parameter prediction in optimal RF model.

760 The positive importance value of HSE (0.13739278) suggests that elevation plays a
761 moderate role in predicting sulfate (SO42−) concentrations. Elevation can affect SO42−
762 levels by influencing factors like geological composition and the presence of minerals in
763 the soil. The positive importance value indicates that higher elevations may lead to
764 slightly increased SO42− concentrations, making it a moderately important parameter in
765 SO42− predictions.
766 The significant importance value of Catchment Area (CA) (0.179220323) indicates
767 that the size of the catchment area has a notable influence on SO42− concentrations. A
768 larger catchment area can encompass a wider range of geological features, affecting min-
769 eral content in water and, consequently, SO42− levels.
770 The highly positive importance value for Forest (F) (0.393199832) highlights the
771 substantial impact of forested areas on SO42− concentrations. Forests can contribute to
772 lower SO42− levels due to the presence of organic matter that can buffer ions and miner-
773 als in the water. This buffering effect results in stable and predictable SO42− concentra-
774 tions in forested regions.
775 Rangeland (RL) positively influences SO42− predictions, as indicated by its high im-
776 portance value (0.463039895). Rangeland areas may contain various types of vegetation
777 that stabilize SO42− levels by influencing ion content in the water. The lower-intensity
778 land use in rangeland areas leads to fewer ion inputs, enhancing the precision of SO42−
779 predictions.
78 Toxics 2023, 11, x FOR PEER REVIEW 39 of 66
79

780 Urban areas, represented by Percentage of Urban Area (UA), significantly affect
781 SO42− concentrations, as suggested by the importance value (0.146355872). Urbanization
782 introduces various pollutants and ions into the water, potentially leading to higher SO4 2−
783 concentrations.
784 The positive importance value of Percentage of Agricultural Area (AA) (0.12677091)
785 suggests that agricultural areas have a minor positive impact on SO42− predictions. Agri-
786 culture practices may introduce ions and minerals into the water, leading to slightly
787 higher SO42− levels.
788 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.077407654),
789 including silt loam or loam, have a minor positive impact on SO42− predictions. Soils in
790 this group may retain ions and salts, contributing to the overall accuracy of SO42− predic-
791 tions.
792
793 4.6 Prediction of Cl− parameter values:
794 RF models proved to be the optimal models for predicting Cl− concentrations. An
795 analysis of all models in terms of accuracy is given in the appendix. The dependence of
796 the adopted accuracy criteria on the model parameters is shown in the Figure 18. Based
797 on the defined accuracy criteria, 3 models were selected (Table 12).
a) b)

c) d)
80 Toxics 2023, 11, x FOR PEER REVIEW 40 of 66
81

798 Figure 18. Comparison of different accuracy criteria for RF model for Cl− parameter as a function of the number of randomly
799 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

800 Table 12. Accuracy of obtained models for Cl− prediction according to defined criteria.

RF Model RMSE MAE MAPE R


RF 1 (var 13, leaf 10) 1.7999 0.9111 1.1120 0.5691
RF 2 (var 11, leaf 5) 1.8831 0.8316 0.8589 0.5964
RF 3 (var 11, leaf 4) 1.8904 0.8323 0.8557 0.5933
RF 4 (var 6, leaf 2) 1.8473 0.9370 1.0288 0.6940

801 Table 13. Determining the optimal prediction model for Cl− parameter using Simple Multi-Criteria Ranking.

Agg. Value w.R w.MAPE w.MAE w.RMSE RF Model


0.3114 0.0000 0.0000 0.0614 0.2500 RF 1 (var 13, leaf 10)
0.5717 0.0546 0.2469 0.2500 0.0202 RF 2 (var 11, leaf 5)
0.5468 0.0484 0.2500 0.2483 0.0000 RF 3 (var 11, leaf 4)
0.4502 0.2500 0.0812 0.0000 0.1191 RF 4 (var 6, leaf 2)
802
803 As the optimal model, the RF model with 500 trees was obtained, which uses a
804 subset of 11 variables, where the minimum number of data per leaf is 5 (Table 13). The
805 assessment of the importance of individual input variables on the accuracy of the
806 prediction was performed precisely on the obtained model with the highest accuracy
807 (Figure 19).
808
82 Toxics 2023, 11, x FOR PEER REVIEW 41 of 66
83

809

810
811 Figure 19. Significance of individual variables for Cl− parameter prediction in optimal RF model.

812 Catchment Area (CA) (Importance: 0.2225) has a relatively high positive
813 importance, indicating that the size of the catchment area strongly influences chloride
814 levels. Larger catchment areas tend to have higher chloride concentrations due to
815 increased runoff and potential pollutant sources.
816 Stream Order (SO) (Importance: 0.0783) has a positive importance, suggesting that
817 higher stream order corresponds to higher chloride levels. This could be due to factors
818 such as the accumulation of pollutants and ions as water moves downstream.
819 Forest (F) (Importance: 0.5505) have a very high positive importance, indicating that
820 forests strongly influence chloride levels, likely due to their role in filtering and
821 modifying water chemistry.
822 Rangeland (RL) (Importance: 0.1992) have a positive impact, possibly due to the
823 composition of the soil and vegetation.
824 Agricultural Area (AA) (Importance: 0.2720) have a relatively high positive impact,
825 possibly related to farming practices, irrigation, and potential chloride runoff.
826
827 4.7 Prediction of HCO3− parameter values:
828 GPR models proved to be the optimal models for predicting HCO3− concentrations.
829 Very similar values in terms of accuracy were also given by the RF models. However,
830 since the difference between the GPR model and the RF model is practically negligible,
831 and since it is not possible to obtain the significance value for individual input variables
832 on the obtained GPR model because it has the same length scale parameter for all
84 Toxics 2023, 11, x FOR PEER REVIEW 42 of 66
85

833 variables, RF models were used for the analysis. An analysis of all models in terms of
834 accuracy is given in the appendix.
835 The dependence of the adopted accuracy criteria on the parameters of the RF model
836 is shown in Figure 20.
837 In the specific case of applying the RF model, two models were distinguished,
838 namely the RF model that uses ten variables as a subset for analysis and where the
839 number of data per terminal sheet is equal to 1, which is optimal according to the RMSE,
840 MAE, MAPE criteria, and the model that uses 13 variables as a subset for analysis and
841 where the number of data per terminal sheet is equal to 2, which is optimal according to
842 the R criterion. Since the first-mentioned model is optimal according to the three
843 adopted accuracy criteria, RMSE, MAE, and MAPE, and the difference compared to the
844 R criterion is practically negligible, the first model can be considered optimal.
845 The optimal model has the following criterion values for RMSE, MAE, MAPE, and
846 R of 0.5174, 0.4252, 0.1822, and 0.7721, respectively.
847
a) b)

c) d)
86 Toxics 2023, 11, x FOR PEER REVIEW 43 of 66
87

848 Figure 20. Comparison of different accuracy criteria for RF model for HCO3− parameter as a function of the number of randomly
849 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

850 The assessment of the importance of individual input variables on the accuracy of
851 the prediction was performed precisely on the obtained RF model with the highest
852 accuracy (Figure 21).
853 The positive importance value of HSE (0.191186813) suggests that elevation plays a
854 moderate role in predicting bicarbonate ion (HCO3−) concentrations. Elevation can affect
855 HCO3− levels by influencing factors like geological composition and the presence of
856 minerals in the soil. The positive importance value indicates that higher elevations may
857 lead to increased HCO3− concentrations, making it an important parameter in HCO3−
858 predictions.
859 The significant importance value of the Catchment Area (CA) (0.704530351)
860 indicates that the size of the catchment area has a strong influence on HCO3−
861 concentrations. A larger catchment area can encompass a wider range of geological
862 features, influencing mineral content in water and, subsequently, HCO3− levels. The
863 highly positive importance of CA underscores its crucial role in shaping HCO3−
864 predictions.
865 The highly positive importance value for Forest (F) (0.717278924) highlights the
866 substantial impact of forested areas on HCO3− concentrations. Forests can contribute to
867 lower HCO3− levels due to the presence of organic matter that can buffer ions and
868 minerals in the water. This buffering effect results in stable and predictable HCO3−
869 concentrations in forested regions.
870 Rangeland (RL) positively influences HCO3− predictions, as indicated by its
871 importance value (0.175212138). Rangeland areas may contain various types of
872 vegetation that stabilize HCO3− levels by influencing ion content in the water. The lower-
88 Toxics 2023, 11, x FOR PEER REVIEW 44 of 66
89

873 intensity land use in rangeland areas leads to fewer ion inputs, enhancing the precision
874 of HCO3− predictions.

875
876 Figure 21. Significance of individual variables for HCO3− parameter prediction in optimal RF
877 model.

878 Urban areas, represented by the percentage of Urban Area (UA), significantly affect
879 HCO3− concentrations, as suggested by the importance value (0.21786019). Urbanization
880 introduces various pollutants and ions into the water, potentially leading to higher
881 HCO3− concentrations. The positive importance of UA underscores the need to consider
882 urban land use when predicting HCO3− levels.
883 The positive importance value of the percentage of Agricultural Area (AA)
884 (0.243760535) highlights the significant influence of agricultural areas on HCO3−
885 predictions. Agriculture involves practices such as irrigation, fertilization, and pesticide
886 application, contributing to ion and mineral content in the water. The high importance
887 of AA underscores the need to consider agricultural land use when predicting HCO3−
888 levels.
889 Hydrological Soil Group A (SHGA) (0.154084653) is of moderate importance in
890 HCO3− predictions. Soils in this group, which may include sand, loamy sand, or sandy
891 loam, can influence HCO3− concentrations by affecting ion content in the water. The
892 positive importance suggests that SHGA plays a notable role in shaping HCO3− levels.
893 Hydrological Soil Group C (SHGC) (0.167034059) positively influences HCO3−
894 predictions. This group includes soils like sandy clay loam that can influence ion
895 leaching, making it a valuable parameter in understanding and predicting HCO3− levels.
90 Toxics 2023, 11, x FOR PEER REVIEW 45 of 66
91

896
897 4.8 Prediction of K+ parameter values:
898 The RF models proved to be the optimal models for predicting K+ levels. An
899 analysis of all models in terms of accuracy is given in the appendix. The dependence of
900 the adopted accuracy criteria on the model parameters is shown in Figure 22.
901 In terms of accuracy, three models were singled out, and the optimal model was
902 obtained by applying the Simple Multi-Criteria Ranking method (Table 14 and Table 15).
903

904 Table 14. Accuracy of obtained models for K+ parameter prediction according to defined criteria.

Criteria RMSE MAE MAPE R


Var 13, leaf 8 0.0231 0.0172 0.3755 0.5689
Var 3, leaf 1 0.0236 0.0174 0.3700 0.6397
Var 12, leaf 4 0.0241 0.0166 0.3476 0.6024

905 Table 15. Determining the optimal prediction model for K+ parameter using Simple Multi-Criteria Ranking.

Agg. Value w.R w.MAPE w.MAE w.RMSE Weighted criteria


0.3125 0.0000 0.0000 0.0625 0.2500 Var 13, leaf 8
0.4243 0.2500 0.0493 0.0000 0.1250 Var 3, leaf 1
0.6183 0.1183 0.2500 0.2500 0.0000 Var 12, leaf 4
906
907 The analysis of the significance of individual input variables of the model was
908 performed on the optimal RF model with hyperparameter values for the value of the
909 number of trees, a subset of variables for splitting, and the number of data per terminal
910 leaf, which are 500, 12, and 4, respectively (Table 15).
911
a) b)
92 Toxics 2023, 11, x FOR PEER REVIEW 46 of 66
93

c) d)

912 Figure 23. Comparison of different accuracy criteria for RF model for K+ parameter as a function of the number of randomly
913 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

914 The positive importance value of HSE (0.22352202) suggests that elevation plays a
915 significant role in predicting potassium ion (K+) concentrations. Elevation can impact K+
916 levels by influencing geological composition and the presence of minerals in the soil.
917 The positive importance value indicates that higher elevations may lead to increase K+
918 concentrations, making it an important parameter in K+ predictions (Figure 24).
919 The highly positive importance value for Forest (F) (0.303494299) highlights the
920 substantial impact of forested areas on K+ concentrations. Forests can contribute to
921 higher K+ levels due to the decomposition of organic matter and the release of potassium
922 ions. This effect results in stable and predictable K+ concentrations in forested regions.
94 Toxics 2023, 11, x FOR PEER REVIEW 47 of 66
95

923 Rangeland (RL) positively influences K+ predictions, as indicated by its importance


924 value (0.13811474). Rangeland areas may contain various types of vegetation that
925 stabilize K+ levels by influencing ion content in the water. The lower-intensity land use
926 in rangeland areas leads to fewer ion inputs, enhancing the precision of K+ predictions.

927
928 Figure 24. Significance of individual variables for K+ parameter prediction in optimal RF model.

929 Urban areas, represented by the percentage of Urban Area (UA), have a positive
930 influence on K+ concentrations, as suggested by the importance value (0.125652185).
931 Urbanization introduces various pollutants and ions into the water, potentially leading
932 to higher K+ concentrations.
933 The highly positive importance value of Percentage of Agricultural Area (AA)
934 (0.352999482) underscores the significant influence of agricultural areas on K+
935 predictions. Agriculture involves practices such as irrigation, fertilization, and pesticide
936 application, contributing to ion and mineral content in the water. The high importance
937 of AA emphasizes the need to consider agricultural land use when predicting K+ levels.
938 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.232769812),
939 including silt loam or loam, have a positive impact on K+ predictions. Soils in this group
940 may retain potassium ions and salts, contributing to the overall accuracy of K+
941 predictions.
942 Hydrological Soil Group C (SHGC) (0.107293215) positively influences K+
943 predictions. This group includes soils like sandy clay loam that can influence K+
96 Toxics 2023, 11, x FOR PEER REVIEW 48 of 66
97

944 concentrations by affecting ion leaching, making it a valuable parameter in


945 understanding and predicting K+ levels.
946 Hydrological Soil Group D (SHGD) (0.112585447) has a positive impact on
947 K+predictions. This group may include clayey soils that can influence K+ levels by
948 retaining potassium ions, contributing to the accuracy of K+ predictions.
949 Geological hydrological group N (GHGN) (0.3149): Average permeability
950 geological features have a high positive importance, indicating a strong positive impact
951 on potassium levels. This suggests that moderate permeability allows for significant
952 potassium interactions. Geological formations with average permeability enable water to
953 percolate through the subsurface. During this percolation, water comes into contact with
954 minerals present in the geological strata. These minerals may contain potassium-bearing
955 compounds. Also, in areas with moderate permeability, there is a significant potential
956 for ion exchange. As water moves through the geological layers, it can exchange ions
957 with minerals. This exchange process can lead to the release of potassium ions into the
958 water, thus increasing potassium concentrations.
959
960 4.9 Prediction of pH parameter values:
961 The RF models proved to be the optimal models for predicting SO4 levels. An
962 analysis of all models in terms of accuracy is given in the appendix. The dependence of
963 the adopted accuracy criteria on the model parameters is shown in the Figure 25.
964 In terms of accuracy, three models were singled out, and the optimal model was
965 obtained by applying the Simple Multi-Criteria Ranking method (Table 16 and Table 17).
966
a) b)

c) d)
98 Toxics 2023, 11, x FOR PEER REVIEW 49 of 66
99

967 Figure 25. Comparison of different accuracy criteria for RF model for pH parameter as a function of the number of randomly
968 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

969

970

971

972 Table 16. Accuracy of obtained models for pH parameter prediction according to defined criteria.

Criteria RMSE MAE MAPE R


RF 1 (var 7, leaf 1) 0.3469 0.2383 0.0306 0.3331
RF 2 (var 15, leaf 3) 0.3554 0.2326 0.0296 0.3476
RF 3 (var 6, leaf 5) 0.3531 0.2338 0.0298 0.3906
973
974 Using the weighted sum method, it is calculated an aggregated value for each
975 model, which takes into account all four normalized metrics.

976 Table 17. Determining the optimal prediction model for PH parametr using Simple Multi-Criteria Ranking

w.RMS Weighted criteria


Agg. Value w.R w.MAPE w.MAE E
0.2500 0.0000 0.0000 0.0000 0.2500 RF 1 (var 7, leaf 1)
0.5630 0.0630 0.2500 0.2500 0.0000 RF 2 (var 15, leaf 3)
100 Toxics 2023, 11, x FOR PEER REVIEW 50 of 66
101

0.7150 0.2500 0.2000 0.1974 0.0676 RF 3 (var 6, leaf 5)


977
978 The analysis of the significance of individual input variables of the model was
979 performed on the optimal RF model and shown in the Figure 26.

980
981 Figure 26. Significance of individual variables for PH parameter prediction in optimal RF model.

982 Stream Order (SO) has a highly positive importance value (0.467367002),
983 highlighting its strong impact on pH predictions. Stream order represents the hierarchy
984 of streams, and it significantly influences pH concentrations. The hierarchical structure
985 of streams can influence the flow, geochemistry, and organic matter content of water, all
986 of which can affect pH levels.
987 Rangeland (RL) positively influences pH predictions, with an importance value of
988 0.248377956. Rangeland areas may contain various types of vegetation that stabilize pH
989 levels by influencing ion content in the water. The lower-intensity land use in rangeland
990 areas leads to fewer ion inputs, enhancing the precision of pH predictions.
991
992 4.10 Prediction of EC parameter values:
993 RF models proved to be optimal models for EC parameter prediction. An analysis
994 of all models in terms of accuracy is given in the appendix. The dependence of the
995 adopted accuracy criteria on the model parameters is shown in Figure 27. According to
996 all defined accuracy criteria, only 1 model was singled out with values for RMSE, MAE,
997 MAPE, and R of 271.5346, 149.3192, 0.2779, and 0.7665, respectively. The obtained
102 Toxics 2023, 11, x FOR PEER REVIEW 51 of 66
103

998 hyperparameter values for the number of trees, the subset of splitting variables, and the
999 minimum number of data per leaf are 500, 6, and 1, respectively. The analysis of the
1000 significance of individual input variables of the model was performed on the optimal RF
1001 model and shown in Figure 28.

a) b)

c) d)

1002 Figure 27. Comparison of different accuracy criteria for RF model for EC parameter as a function of the number of randomly
1003 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

1004 The substantial positive importance value for Forest (F) (0.311749351) highlights the
1005 significant impact of forested areas on electrical conductivity. Forests often contribute to
1006 lower electrical conductivity due to the presence of organic matter that can buffer ions
1007 and minerals in the water. This buffering effect results in stable and predictable electrical
1008 conductivity in forested regions.
104 Toxics 2023, 11, x FOR PEER REVIEW 52 of 66
105

1009 Rangeland (RL) positively influences electrical conductivity predictions, as


1010 indicated by its importance value (0.178610139). Rangeland areas may contain various
1011 types of vegetation that stabilize electrical conductivity by influencing ion content in the
1012 water. Additionally, the lower-intensity land use in rangeland areas leads to fewer ion
1013 inputs, enhancing the precision of electrical conductivity predictions.
1014 Urban areas, represented by Percentage of Urban Area (UA), significantly affect
1015 electrical conductivity, as suggested by the importance value (0.125117033).
1016 Urbanization introduces various pollutants and ions into the water, potentially leading
1017 to higher electrical conductivity. The positive importance (0.125117033) of UA
1018 underscores the need to consider urban land use when predicting electrical conductivity.
1019 The positive importance value of Percentage of Agricultural Area (AA)
1020 (0.128402506) highlights the significant influence of agricultural areas on electrical
1021 conductivity. Agriculture involves practices such as irrigation, fertilization, and
1022 pesticide application, which contribute to ion and mineral content in the water. The high
1023 importance (0.128402506) of AA underscores the need to consider agricultural land use
1024 when predicting electrical conductivity.
1025 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.188209984),
1026 including silt loam or loam, positively influence electrical conductivity predictions. Soils
1027 in this group may retain ions and salts, contributing to the overall accuracy of electrical
1028 conductivity predictions.

1029 .
1030 Figure 28. Significance of individual variables for EC parameter prediction in optimal RF model

1031 4.11 Prediction of TDS parameter values:


106 Toxics 2023, 11, x FOR PEER REVIEW 53 of 66
107

1032 The RF models proved to be the optimal models for predicting SO4 2- levels. An
1033 analysis of all models in terms of accuracy is given in the appendix. The dependence of
1034 the adopted accuracy criteria on the model parameters is shown in the Figure 29.
1035 In terms of accuracy, three models were singled out, and the optimal model was
1036 obtained by applying the Simple Multi-Criteria Ranking method (Table 18 and Table 19).

1037 Table 18. Accuracy of obtained models for TDS parameter prediction according to defined criteria

Criteria RMSE MAE MAPE R


RF 1(Var 9, leaf 10) 417.2155 201.8572 0.4863 0.5467
RF 1(Var 12, leaf 7) 422.6822 196.7117 0.4578 0.5521
RF 1(Var 12, leaf 5) 435.3533 198.3639 0.4562 0.5502

1038 Table 19. Determining the optimal prediction model for TDS parameter using Simple Multi-Criteria Ranking

Agg. Value w.R w.MAPE w.MAE w.RMSE Weighted crieria


0.2500 0.0000 0.0000 0.0000 0.2500 RF 1(Var 9, leaf 10)
0.9114 0.2500 0.2367 0.2500 0.1747 RF 1(Var 12, leaf 7)
0.5818 0.1620 0.2500 0.1697 0.0000 RF 1(Var 12, leaf 5)
1039
1040
1041
1042
1043
a) b)
108 Toxics 2023, 11, x FOR PEER REVIEW 54 of 66
109

c) d)

1044 Figure 29. Comparison of different accuracy criteria for RF model for TDS parameter as a function of the number of randomly
1045 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.

1046 The substantial importance value of HSE (0.320185209) suggests that elevation
1047 plays a significant role in predicting Total Dissolved Solids (TDS) concentrations (Figure
1048 30). Elevation can affect TDS levels by influencing factors like geological composition,
1049 mineral content, and the presence of dissolved ions. The positive importance value
1050 indicates that higher elevations may lead to higher TDS concentrations, making it a
1051 highly important parameter in TDS predictions.
1052 The highly positive importance value for Forest (F) (0.633926137) highlights the
1053 substantial impact of forested areas on TDS concentrations. Forests can contribute to
1054 lower TDS levels due to the presence of organic matter that can buffer ions and minerals
1055 in the water. This buffering effect results in stable and predictable TDS concentrations in
1056 forested regions.
1057 The moderate importance value of the percentage of Agricultural Area (AA)
1058 (0.191967355) suggests that agricultural areas have a moderate impact on TDS
1059 predictions. Agriculture practices introduce ions and minerals into the water,
1060 influencing TDS levels to a moderate extent.
1061 Hydrological Soil Group A (SHGA) has a minor positive importance (0.126901692)
1062 in TDS predictions. Soils in this group, which may include sand, loamy sand, or sandy
1063 loam, can influence TDS concentrations by affecting ion content in the water.
110 Toxics 2023, 11, x FOR PEER REVIEW 55 of 66
111

1064
1065 Figure 30. Significance of individual variables for TDS parameter prediction in optimal RF model

1066 5. Discussion
1067 In the considered research, satisfactory accuracy values were obtained for the
1068 majority of models in terms of the defined criteria, but certain models also failed in
1069 terms of certain criteria. The accuracy of the model can be seen more easily through the
1070 relative criteria of accuracy R and MAPE than in relation to the absolute criteria such as
1071 RMSE and MAE (Table 20).

1072 Table 20. Accuracy of the ML model in predicting individual water parameters.

Output parameter Best Model RMSE MAE MAPE R


SAR RF 0.3668 0.2328 0.5679 0.8236
Na+ RF 16.385 0.8772 13.929 0.678
Mg2+ RF 0.402 0.2631 0.2717 0.7567
Ca2+ RF 0.5847 0.45 0.2007 0.7496
SO42- RF 0.5526 0.3122 0.505 0.6148
Cl- RF 18.831 0.8316 0.8589 0.5964
HCO3- GP 0.5056 0.4144 0.1782 0.7668
K+ RF 0.0241 0.0166 0.3476 0.6024
pH RF 0.3531 0.2338 0.0298 0.3906
EC RF 271.5346 149.3192 0.3013 0.7665
112 Toxics 2023, 11, x FOR PEER REVIEW 56 of 66
113

TDS RF 422.6822 196.7117 0.4578 0.5521


1073
1074 If we look at the value of R, we can see that practically all models, except for the
1075 model for predicting the pH parameter, have satisfactory values. Regarding the value of
1076 MAPE, five models can be observed where this value can be considered higher (SAR,
1077 Na+, SO4, Cl, TDS). However, the main goal of the research is to determine the
1078 importance of individual input variables in the best possible way under conditions of
1079 limited data so that in a larger set of machine learning models, the optimal model will be
1080 selected and implemented it. This was largely achieved through research. The direction
1081 of future research is the collection of more data the expansion of the database for the
1082 creation of models. In this way, the limitation of the model is reduced and enables
1083 obtaining a model with higher accuracy in a certain number of parameters where
1084 slightly less accuracy was obtained.
1085 Regarding the importance of individual input variables, several conclusions can be
1086 drawn (Table 21).
1087 Forest Cover ('F'): Forested areas have a significant impact on various water quality
1088 parameters. They contribute to the release of organic matter, which can influence the
1089 concentrations of potassium, sodium, magnesium, calcium, sulfate, bicarbonate, and
1090 chloride ions. Additionally, forest cover plays a role in pH regulation.
1091 Catchment Area ('CA'): The size and characteristics of the catchment area play a
1092 role in controlling the transport of ions. Larger catchment areas may dilute ion
1093 concentrations, affecting parameters such as sodium and sulfate levels.
1094 Stream Order ('SO'): Stream order is important for several parameters, including
1095 sodium, sulfate, and pH. It can affect the mixing and transport of ions in the water.
1096 Barren Land ('BL'): Barren land areas can influence ion levels, including sulfate,
1097 chloride, bicarbonate, and calcium. They may contribute to ion concentrations through
1098 geological and land cover factors.
1099 Urban Area ('UA'): Urban areas can impact ion levels, such as sodium and SAR, due
1100 to urbanization and land use changes. They can introduce contaminants and alter water
1101 chemistry.
1102

1103 Table 21. The most important input variables for predicting the output parameters of water.

Input variable

Output

2 4 1 5 3 SAR
114 Toxics 2023, 11, x FOR PEER REVIEW 57 of 66
115

5 2 4 1 3 Na+
4 5 3 2 1 Mg2+
2 3 5 1 4 Ca2+
3 5 1 2 4 SO42-
4 2 3 1 5 Cl-
1 4 2 3 5 HCO3-
2 4 1 3 5 K+
3 5 4 2 1 pH
3 4 5 2 1 EC
5 4 3 1 2 TDS
1104
1105 Hydrological Soil Groups (HSGA, HSGB, HSGC, HSGD): Different soil types have
1106 varying abilities to retain and release ions. Sandy soils (HSGA and HSGB) may allow for
1107 easier ion movement, while clayey soils (HSGD) can retain ions. Soil characteristics
1108 influence potassium, magnesium, calcium, and bicarbonate levels.
1109 Geological Permeability (GHGM, GHGN, GHGT): Geological features can affect ion
1110 concentrations. Moderate permeability (GHGN) has a positive impact on potassium
1111 levels, possibly due to ion interactions. Low permeability (GHGM) may positively
1112 influence magnesium levels by slowing water flow and allowing for more mineral
1113 dissolution.

1114 6. Conclusions
1115 The study demonstrated that machine learning methods can be effectively used to
1116 predict and assess water quality parameters in a catchment area. These models provide a
1117 valuable tool for efficiently and accurately evaluating water quality.
1118 Random Forest (RF) emerged as the optimal model for predicting various water
1119 quality parameters. These models exhibited the best performance in terms of accuracy
1120 and predictive power.
1121 The study highlighted the significant influence of environmental variables on water
1122 quality parameters.
1123 The findings underscore the practical utility of these models in water resource
1124 management and environmental conservation efforts. They enable a rapid and
1125 sufficiently accurate assessment of water quality based on the attributes of the watershed
1126 and its environmental variables.
1127
1128 Author Contributions: Conceptualization, M.K., B.J.A, S.L., E.K.N. M.H.-N and D.R..;
1129 methodology, M.K.; software, M.K. and E.K.N.; validation, M.K., S.L., M.H.-N and E.K.N. and.;
1130 formal analysis, M.K. and E.K.N.; investigation, M.K. B.J.A. and M.H.-N.; resources, B.J.A, S.L.
116 Toxics 2023, 11, x FOR PEER REVIEW 58 of 66
117

1131 M.H.-N and D.R..; data curation, M.K., S.L. and E.K.N.; writing—original draft preparation, M.K.,
1132 S.L., E.K.N. and M.H.-N.; writing—review and editing, M.K., S.L., E.K.N. and M.H.-N.; funding
1133 acquisition, D.R. All authors have read and agreed to the published version of the manuscript.
1134 Funding: This research received no external funding.
1135 Institutional Review Board Statement: Not Applicable.
1136 Informed Consent Statement: Not Applicable.
1137 Data Availability Statement: Not Applicable.
1138 Conflicts of Interest: The authors declare no conflict of interest.

1139

1140 References
1141 1. Dyer, F., Elsawah, S., Croke, B., Griffiths, R., Harrison, E., Lucena-Moya, P., Jakeman, A. (2013). The Effects of Climate Change
1142 On Ecologically-relevant Flow Regime and Water Quality Attributes. Stoch Environ Res Risk Assess, 1(28), 67-82.
1143 https://doi.org/10.1007/s00477-013-0744-8.
1144 2. Bisht, A., Singh, R., Bhutiani, R., Bhatt, A. (2019). Application Of Predictive Intelligence In Water Quality Forecasting Of the
1145 River Ganga Using Support Vector Machines. 206-218. https://doi.org/10.4018/978-1-5225-6210-8.ch009
1146 3. Liu, S., Ryu, D., Webb, J. A., Lintern, A., Guo, D., Waters, D., & Western, A. W. (2021). A multi-model approach to assessing
1147 the impacts of catchment characteristics on spatial water quality in the Great Barrier Reef catchments. Environmental
1148 Pollution, 288, 117337.
1149 4. Mazher, A. (2020), Visualization Framework for High-Dimensional Spatio-Temporal Hydrological Gridded Datasets using
1150 Machine-Learning Techniques" Water 12, no. 2: 590. https://doi.org/10.3390/w12020590.
1151 5. Nasir, N., Kansal, A., Alshaltone, O., Barneih, F., Sameer, M., Shanableh, A., & Al-Shamma'a, A. (2022). Water quality
1152 classification using machine learning algorithms. Journal of Water Process Engineering, 48, 102920.
1153 6. Wang, R., Kim, J. H., & Li, M. H. (2021). Predicting stream water quality under different urban development pattern scenarios
1154 with an interpretable machine learning approach. Science of the Total Environment, 761, 144057.
1155 7. Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., & Chica-Rivas, M. J. O. G. R. (2015). Machine learning
1156 predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support
1157 vector machines. Ore Geology Reviews, 71, 804-818.
1158 8. Koranga, M., Pant, P., Kumar, T., Pant, D., Bhatt, A. K., & Pant, R. P. (2022). Efficient water quality prediction models based on
1159 machine learning algorithms for Nainital Lake, Uttarakhand. Materials Today: Proceedings.
1160 9. Kovačević, M.; Ivanišević, N.; Dašić, T.; Marković, L. Application of artificial neural networks for hydrological modelling in
1161 Karst. Gradjevinar 2018, 70, 1–10.
1162 10. Zhu, M., Wang, J., Yang, X., Zhang, Y., Zhang, L., Ren, H., Ye, L. (2022). A review of the application of machine learning in
1163 water quality evaluation. Eco-Environment & Health.
1164 11. Leggesse, E. S., Zimale, F. A., Sultan, D., Enku, T., Srinivasan, R., & Tilahun, S. A. (2023). Predicting Optical Water Quality
1165 Indicators from Remote Sensing Using Machine Learning Algorithms in Tropical Highlands of Ethiopia. Hydrology, 10(5),
1166 110.
118 Toxics 2023, 11, x FOR PEER REVIEW 59 of 66
119

1167 12. Zhu, Y., Liu, K., Liu, L., Myint, S. W., Wang, S., Liu, H., & He, Z. (2017). Exploring the potential of worldview-2 red-edge
1168 band-based vegetation indices for estimation of mangrove leaf area index with machine learning algorithms. Remote Sensing,
1169 9(10), 1060.
1170 13. Cai, J., Chen, J., Dou, X., & Xing, Q. (2022). Using machine learning algorithms with in situ hyperspectral reflectance data to
1171 assess comprehensive water quality of urban rivers. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-13.
1172 14. Castrillo, M., & García, Á. L. (2020). Estimation of high frequency nutrient concentrations from water quality surrogates using
1173 machine learning methods. Water Research, 172, 115490.
1174 15. Gladju, J., Kamalam, B. S., & Kanagaraj, A. (2022). Applications of data mining and machine learning framework in
1175 aquaculture and fisheries: A review. Smart Agricultural Technology, 100061.
1176 16. Chen, K., Chen, H., Zhou, C., Huang, Y., Qi, X., Shen, R. , Ren, H. (2020). Comparative analysis of surface water quality
1177 prediction performance and identification of key water parameters using different machine learning models based on big
1178 data. Water research, 171, 115454.
1179 17. Zhang, H., Xue, B., Wang, G., Zhang, X., & Zhang, Q. (2022). Deep Learning-Based Water Quality Retrieval in an Impounded
1180 Lake Using Landsat 8 Imagery: An Application in Dongping Lake. Remote Sensing, 14(18), 4505.
1181 18. Jin, T., Cai, S., Jiang, D., & Liu, J. (2019). A data-driven model for real-time water quality prediction and early warning by an
1182 integration method. Environmental Science and Pollution Research, 26, 30374-30385.
1183 19. Uddin, M. G., Nash, S., Rahman, A., & Olbert, A. I. (2023). A novel approach for estimating and predicting uncertainty in
1184 water quality index model using machine learning approaches. Water Research, 229, 119422.
1185 20. Cheng, S., Cheng, L., Qin, S., Zhang, L., Liu, P., Liu, L., Wang, Q. (2022). Improved understanding of how catchment
1186 properties control hydrological partitioning through machine learning. Water Resources Research, 58(4), e2021WR031412.
1187 21. Krishnan, M. (2020). Against interpretability: a critical examination of the interpretability problem in machine learning.
1188 Philosophy & Technology, 33(3), 487-502.
1189 22. Kovačević, M.; Lozančić, S.; Nyarko, E.K.; Hadzima-Nyarko, M. Application of Artificial Intelligence Methods for Predicting
1190 the Compressive Strength of Self-Compacting Concrete with Class F Fly Ash. Materials 2022, 15, 4191.
1191 https://doi.org/10.3390/ma15124191
1192 23. Hastie, T.; Tibsirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2009.
1193 24. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [CrossRef]
1194 25. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
1195 26. Kovačević, M.; Ivanišević, N.; Petronijević, P.; Despotović, V. Construction cost estimation of reinforced and prestressed
1196 concrete bridges using machine learning. Grad¯evinar 2021, 73, 727.
1197 27. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput.
1198 Syst.Sci. 1997, 55, 119–139.
1199 28. Elith, J.; Leathwick, J.R.; Hastie, T. A working guide to boosted regression trees. J. Anim. Ecol. 2008, 77, 802–813.
1200 29. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232.
1201 30. Rasmussen, C.E.; Williams, C.K. Gaussian Processes for Machine Learning; The MIT Press: Cambridge, MA, USA, 2006.
1202 31. Fatehi, I., Amiri, B.J., Alizadeh, A. et al. Modeling the Relationship between Catchment Attributes and In-stream Water
1203 Quality. Water Resour Manage 29, 5055–5072 (2015). https://doi.org/10.1007/s11269-015-1103-y
1204 32. Dohner E, Markowitz A, Barbour M, Simpson J, Byrne J, Dates G (1997) Volunteer Stream Monitoring: A Methods Manual
1205 Environmental Protection Agency: Office of Water (EPA 841-B-97-003).
1206 33. Anderson JR (1976) A land use and land cover classification system for use with remote sensor data. Vol. 964. US Government
1207 Printing Office.
120 Toxics 2023, 11, x FOR PEER REVIEW 60 of 66
121

1208 34. Brakensiek D, Rawls W (1983) Agricultural management effects on soil-water Processes 2: Green and Ampt Parameters for
1209 Crusting Soils. Trans ASAE 26:1753–1757.
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224

1225 Appendix A

1226  NA parameter
1227

1228 Table A1. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 2.1568 0.7220 0.8754 0.4381
TreeBagger 1.6809 0.7675 1.0412 0.5813
Random Forest 1.6385 0.8772 1.3929 0.6780
Boosted Trees 1.8603 0.9529 1.5436 0.5024
GP exponential 2.5655 1.0096 1.8659 0.0642
GP Sq.exponential 2.8037 1.1203 2.1615 0.0133
GP matern 3/2 2.7865 1.0860 2.0261 0.0314
GP matern 5/2 2.8302 1.1018 2.0680 0.0240
GP Rat. quadratic 2.8037 1.1203 2.1615 0.0133
GP ARD exponential 3.3385 1.3350 2.8318 -0.0282
GP ARD Sq. exponential 3.6399 1.4212 2.5305 -0.0099
GP ARD matern 3/2 4.2629 1.5668 2.8746 0.0350
GP ARD matern 5/2 4.4450 1.6865 3.0962 0.0170
GP ARD Rat. quadratic 4.2855 1.5028 2.6716 0.0366
122 Toxics 2023, 11, x FOR PEER REVIEW 61 of 66
123

1229

1230  Mg parameter
1231

1232 Table A2. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 0.6043 0.3313 0.3001 0.4911
TreeBagger 0.4048 0.2641 0.2735 0.7472
Random Forest 0.4020 0.2631 0.2717 0.7567
GP exponential 0.6173 0.3524 0.3443 0.4355
GP Sq.exponential 0.6711 0.4076 0.4150 0.3733
GP matern 3/2 0.6532 0.3786 0.3731 0.3915
GP matern 5/2 0.6633 0.3907 0.3875 0.3802
GP Rat. quadratic 0.6711 0.4076 0.4150 0.3733
GP ARD exponential 0.6925 0.4114 0.3927 0.3953
GP ARD Sq. exponential 0.7831 0.4364 0.4129 0.3459
GP ARD matern 3/2 0.6877 0.4295 0.4213 0.4068
GP ARD matern 5/2 0.7180 0.4323 0.4207 0.3371
GP ARD Rat. quadratic 0.7291 0.4207 0.4208 0.3987

1233

1234

1235  CA parameter

1236

1237 Table A3. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 0.7057 0.5504 0.2511 0.6804
TreeBagger 0.5949 0.4642 0.2054 0.7496
Random Forest 0.5847 0.4500 0.2007 0.7496
Boosted Trees 0.7730 0.6435 0.2808 0.5093
GP exponential 0.9379 0.5540 0.2225 0.3910
124 Toxics 2023, 11, x FOR PEER REVIEW 62 of 66
125

GP Sq.exponential 0.8241 0.5888 0.2566 0.5166


GP matern 3/2 0.7853 0.5538 0.2374 0.5662
GP matern 5/2 0.7989 0.5687 0.2447 0.5505
GP Rat. quadratic 0.8093 0.5755 0.2498 0.5364
GP ARD exponential 0.8347 0.6113 0.2617 0.5626
GP ARD Sq. exponential 0.8156 0.5878 0.2515 0.5497
GP ARD matern 3/2 0.7873 0.5772 0.2391 0.5873
GP ARD matern 5/2 0.7825 0.5809 0.2409 0.5948
GP ARD Rat. quadratic 0.9471 0.6581 0.2822 0.4985
1238

1239  SO4 parameter

1240

1241 Table A4. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 0.6585 0.3319 0.4713 0.6249
TreeBagger 0.5997 0.3228 0.5064 0.5535
Random Forest 0.5526 0.3122 0.5050 0.6148
Boosted trees 0.6421 0.4283 0.8503 0.5900
GP exponential 0.8183 0.4002 0.6450 0.3296
GP Sq.exponential 0.9090 0.4751 0.8683 0.1869
GP matern 3/2 0.8811 0.4453 0.7749 0.2489
GP matern 5/2 0.8930 0.4592 0.8144 0.2260
GP Rat. quadratic 0.9060 0.4713 0.8580 0.1918
GP ARD exponential 0.8579 0.4061 0.5984 0.2182
GP ARD Sq. exponential 1.0228 0.5025 0.8036 0.1891
GP ARD matern 3/2 0.9006 0.4476 0.8242 0.3506
GP ARD matern 5/2 0.8340 0.4127 0.7664 0.3954
GP ARD Rat. quadratic 0.8370 0.4631 0.8300 0.3486
1242

1243
126 Toxics 2023, 11, x FOR PEER REVIEW 63 of 66
127

1244

1245  CL parameter
1246

1247 Table A5. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 2.4687 0.8348 0.9213 0.4090
TreeBagger 1.9022 0.8556 0.6413 0.5878
Random Forest 1.8831 0.8316 0.8589 0.5964
Bosted Trees 2.2544 1.0919 1.3900 0.4431
GP exponential 2.9457 1.1626 1.8895 0.0196
GP Sq.exponential 3.2492 1.2872 2.1291 -0.0253
GP matern 3/2 3.2183 1.2545 2.0769 -0.0135
GP matern 5/2 3.2735 1.2793 2.1214 -0.0177
GP Rat. quadratic 3.2492 1.2871 2.1291 -0.0253
GP ARD exponential 3.8178 1.5359 2.7443 -0.0370
GP ARD Sq. exponential 4.1299 1.5817 2.4557 -0.0557
GP ARD matern 3/2 4.9069 1.8010 2.7176 -0.0523
GP ARD matern 5/2 5.3636 2.0316 3.6072 -0.0620
GP ARD Rat. quadratic 4.1299 1.5817 2.4557 -0.0557
1248

1249  HCO3 parameter

1250

1251 Table A6. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 0.6782 0.5473 0.2228 0.5477
TreeBagger 0.5231 0.4400 0.1875 0.7386
Random Forest 0.5174 0.4252 0.1822 0.7280
GP exponential 0.5056 0.4144 0.1782 0.7668
GP Sq.exponential 0.5803 0.4791 0.2006 0.6541
128 Toxics 2023, 11, x FOR PEER REVIEW 64 of 66
129

GP matern 3/2 0.5312 0.4309 0.1827 0.7287


GP matern 5/2 0.5437 0.4404 0.1859 0.7109
GP Rat. quadratic 0.5516 0.4473 0.1872 0.6994
GP ARD exponential 0.6389 0.5309 0.2225 0.5902
GP ARD Sq. exponential 0.5596 0.4529 0.1860 0.6829
GP ARD matern 3/2 0.5692 0.4750 0.2001 0.6773
GP ARD matern 5/2 0.5986 0.4949 0.2026 0.6417
GP ARD Rat. quadratic 0.5951 0.4967 0.2070 0.6340
1252

1253

1254

1255

1256  K parameter
1257 

1258 Table A7. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 0.0308 0.0201 0.4365 0.3385
TreeBagger 0.0238 0.0172 0.3722 0.5919
Random Forest 0.0241 0.0166 0.3476 0.6024
GP exponential 0.0275 0.0181 0.4008 0.4880
GP Sq.exponential 0.0306 0.0205 0.4568 0.3393
GP matern 3/2 0.0292 0.0196 0.4344 0.4168
GP matern 5/2 0.0297 0.0199 0.4434 0.3924
GP Rat. quadratic 0.0299 0.0201 0.4486 0.3769
GP ARD exponential 0.0293 0.0192 0.4225 0.4182
GP ARD Sq. exponential 0.0301 0.0189 0.3988 0.3328
GP ARD matern 3/2 0.0311 0.0207 0.4621 0.3086
GP ARD matern 5/2 0.0307 0.0206 0.4682 0.3449
GP ARD Rat. quadratic 0.0315 0.0209 0.4728 0.3148

1259
130 Toxics 2023, 11, x FOR PEER REVIEW 65 of 66
131

1260  PH parameter

1261

1262 Table A8. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 0.3719 0.2457 0.0316 0.4241
TreeBagger 0.3558 0.2376 0.0303 0.3380
Random Forest 0.3531 0.2338 0.0298 0.3906
Boosted Trees 0.3817 0.2576 0.0330 0.3187
GP exponential 0.4155 0.2586 0.0330 -0.0197
GP Sq.exponential 0.4201 0.2622 0.0335 0.0009
GP matern 3/2 0.4183 0.2573 0.0328 -0.0002
GP matern 5/2 0.4172 0.2560 0.0326 0.0114
GP Rat. quadratic 0.4192 0.2613 0.0334 -0.0247
GP ARD exponential 0.4972 0.2970 0.0381 -0.0636
GP ARD Sq. exponential 0.5655 0.3499 0.0453 -0.0511
GP ARD matern 3/2 0.5025 0.3098 0.0400 0.0272
GP ARD matern 5/2 0.5096 0.3127 0.0403 -0.0118
GP ARD Rat. quadratic 0.4154 0.2497 0.0318 0.1367
1263

1264

1265

1266

1267  EC parameter

1268 

1269 Table A9. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 352.6501 168.7962 0.3289 0.5627
TreeBagger 286.5049 151.5407 0.2797 0.7664
Random Forest 271.5346 149.3192 0.3013 0.7665
132 Toxics 2023, 11, x FOR PEER REVIEW 66 of 66
133

Bosted Trees 297.9335 170.2860 0.3620 0.6393


GP exponential 432.0779 200.1383 0.4037 0.2465
GP Sq.exponential 474.4095 227.5836 0.4762 0.1730
GP matern 3/2 467.6136 217.1865 0.4402 0.1915
GP matern 5/2 472.1707 221.6132 0.4518 0.1856
GP Rat. quadratic 474.4095 227.5836 0.4762 0.1730
GP ARD exponential 461.6100 216.0258 0.4514 0.2684
GP ARD Sq. exponential 674.4802 269.8017 0.5526 0.1942
GP ARD matern 3/2 631.0788 275.7947 0.5870 0.1756
GP ARD matern 5/2 674.8450 287.7216 0.5782 0.1310
GP ARD Rat. quadratic 470.4831 237.1822 0.4714 0.2560
1270

1271  TDS parameter

1272 

1273 Table A10. Comparative analysis of results of different machine learning models.

Model RMSE MAE MAPE/100 R


Decision Tree 509.6578 212.4422 0.5367 0.4610
TreeBagger 422.7209 199.4986 0.4718 0.5535
Random Forest 422.6822 196.7117 0.4578 0.5521
Boosted trees 457.8293 235.2232 0.6106 0.5367
GP exponential 617.8458 274.5861 0.7459 0.0383
GP Sq.exponential 663.3802 302.3803 0.8247 0.0276
GP matern 3/2 662.2055 297.1371 0.8050 0.0191
GP matern 5/2 666.2775 299.6533 0.8082 0.0207
GP Rat. quadratic 663.3802 302.3803 0.8247 0.0276
GP ARD exponential 765.3184 377.8162 1.1147 -0.0131
GP ARD Sq. exponential 818.4166 408.9820 1.1664 -0.0367
GP ARD matern 3/2 881.9864 460.2564 1.4128 -0.0318
GP ARD matern 5/2 828.6321 416.7358 1.3426 0.0524
GP ARD Rat. quadratic 785.2499 392.9596 1.2369 0.0238

1274

You might also like