Professional Documents
Culture Documents
6 Miljan Kovačević 1*, Bahman Jabbarian Amiri 2, Silva Lozančić 3, Marijana Hadzima-Nyarko 3, Dorin Radu 4,
7 Emmanuel Karlo Nyarko 5,
8 1
Faculty of Technical Sciences, University of Pristina, Knjaza Milosa 7, 38220 Kosovska Mitrovica, Serbia; miljan.kovacevic@pr.ac.rs (M.K)
9 2
Faculty of Economics and Sociology, Department of Regional Economics and the Environment, 3/5 P.O.W. Street, 90-255 Lodz, Poland;
10 bahman.amiri@uni.lodz.pl (B.J.A)
11 3
Faculty of Civil Engineering and Architecture Osijek, Josip Juraj Strossmayer University of Osijek,
12 Vladimira Preloga 3, 31000 Osijek, Croatia; mhadzima@gfos.hr, lozancic@gfos.hr (S.L.); (M.H-N., S.L))
13 4
Faculty of Civil Engineering, Transilvania University of Brașov, Romania; dorin.radu@unitbv.ro (D.R)
14 5
Josip Juraj Strossmayer University of Osijek, Faculty of Faculty of Electrical Engineering, Computer Science
15 and Information Technology, Kneza Trpimira 2B, 31000 Osijek, Croatia; karlo.nyarko@ferit.hr (E.K.N.)
16
17
18 * Correspondence: miljan.kovacevic@pr.ac.rs (M.K); Tel.: +381 606173801.
19 Abstract: In order to be able to manage water quality, it is necessary to determine the relationship
20 between the physical attributes of the catchment, such as geological permeability and hydrologic
21 soil groups, and in-stream water quality parameters. The paper gives relations for 11 in-stream
22 water quality parameters SAR , Na+, Mg2+, Ca2+, SO42−, Cl−, HCO3−, K+ , pH, EC, and TDS using
Citation: To be added by editorial
23 staff during production.
machine learning methods. For each parameter, RMSE, MAE, R, and MAPE accuracy criteria were
24 used to assess the accuracy of each model, and the optimal model was defined. Furthermore, the
25 Academic Editor: Firstname most significant influencing variables were determined when modeling each parameter. Research
Lastname
26 has shown that with methods based on artificial intelligence, a quick and sufficiently accurate
27 Received: date water quality assessment is possible, depending on the attributes of the watershed.
Revised: date
28 Accepted: date Keywords: machine learning; water quality; land use; land cover; hydrologic soil groups;
29 Published: date geological permeability
30
31 1. Introduction
32 River water quality plays a crucial role in ensuring the sustainability and health of
33 freshwater ecosystems. Traditional monitoring methods often have spatial and temporal
34 coverage limitations, leading to difficulties in effectively assessing and managing water
35 quality [1]. However, recent developments in machine learning techniques have
36 indicated the potential to predict water quality accurately based on catchment
37 characteristics [1, 2]. One of the underlying reasons for the growing interest in applying
38 machine learning techniques to predict river water quality is the ability to
39 simultaneously consider a wide range of catchment characteristics. These characteristics
40 include various features that affect water quality, including land use, soil properties,
41 climate data, topography, and hydrological characteristics [3]. These variables interact in
42 complex ways, and their associations may not easily be distinguished using prevalent
43 analytical methods. A more comprehensive understanding of the factors influencing
44 water quality can be achieved by applying machine learning algorithms, which can
45 disclose hidden patterns and capture nonlinear relationships in large and diverse
46 datasets [4].
47 The performance of supervised machine learning algorithms has been proved by
48 recent studies in predicting river water quality [5]. These algorithms are trained on
49 historical water quality data, in line with associating catchment characteristics, to
50 explore patterns and relationships between them [6]. Researchers have been able to
51 develop accurate predictive models by applying algorithms such as decision trees,
52 random forests, support vector machines, and neural networks [7-9]. Water quality
53 parameters, such as nutrient concentrations, pollutant levels, and biological indicators,
54 can then be applied by these models to predict using catchment characteristics [10]. Such
55 predictions can help determine potential pollution points, prioritise management
56 actions, and support decision-making processes for water resource management.
57 In addition to catchment characteristics, integrating remote sensing data has
58 emerged as a valuable tool by which the accuracy of water quality predictions can
59 significantly be enhanced [11, 12] because the spatially-explicit information about land
60 cover/land use, vegetation, and surface characteristics can be provided by remote
61 sensing techniques, which include satellite imagery and aerial photographs. Researchers
62 can improve the predictive performance of machine learning models by combing remote
63 sensing data with catchment characteristics [13]. For example, satellite data can provide
64 insight into vegetation dynamics, land use/land cover changes, and the extent of
65 impervious surfaces, which encompass urban and semi-urban areas, all of which can
66 influence water quality. Integrating these additional spatial data sources into the
67 machine learning models can result in more accurate and spatially explicit predictions,
68 allowing a more comprehensive assessment of water quality dynamics across large river
69 basins.
6 Toxics 2023, 11, x FOR PEER REVIEW 3 of 66
7
110 predictive capacity and allow for more accurate short-term and long-term water quality
111 forecasts, various temporal factors, such as seasonal variations, climate change, and
112 short-term events such as precipitation events or pollution incidents, affect river water
113 quality [18].
114 Machine learning models typically provide point predictions, but quantifying and
115 communicating uncertainties associated with these predictions is crucial for decision-
116 making and risk assessment [19]. Developing methods to quantify and propagate
117 uncertainties through the modelling process, considering sources of uncertainty such as
118 data quality, model structure, and parameter estimation, would enhance the reliability
119 and applicability of predictive models.
120 In addition, models developed for a particular catchment cannot be applied directly
121 to another due to variations in catchment characteristics, land use/cover, soil, geology
122 and climatic conditions. The development of transferable models that can account for
123 specific variation in catchment while taking general patterns would be valuable for the
124 management of water resources on a larger scale [20]. On the other hand, capturing
125 characteristics related to human activities, such as agricultural practices, urbanisation
126 and industrial activities, have a significant impact on water quality. Incorporating socio-
127 economic factors into machine learning models can improve their predictive power and
128 enable more comprehensive water quality assessments. However, the integration of
129 socio-economic data and understanding the complex interactions between human
130 activity and water quality present challenges that must be addressed.
131 It should be noted that the need for greater ambiguity and transparency in machine
132 learning models can limit the adoption and acceptance of these models by policymakers
133 and stakeholders [21]. The development of logical machine learning techniques that
134 provide insights into the model decision-making process and highlight the most
135 influential catch characteristics would improve the reliability and usability of predictive
136 models.
137 Addressing these gaps requires interdisciplinary collaborations among
138 hydrologists, ecologists, data scientists, policymakers, and other relevant stakeholders.
139 Furthermore, focussing on data-driven approaches, data-sharing initiatives, and
140 advances in computational methods will be critical to advancing the field and
141 harnessing the full potential of machine learning in predicting river water quality based
142 on catchment characteristics. In the study, we intend to address how we can address
143 spatial variations in the characteristics of the catchment to explain the river water quality
144 using machine learning techniques.
145 This paper explores the relationship between catchment attributes and in-stream
146 water quality parameters using machine learning methods. It evaluates model accuracy
147 with RMSE, MAE, R, and MAPE, identifies optimal models for 11 parameters, and
148 determines significant influencing variables. The study highlights the potential of
10 Toxics 2023, 11, x FOR PEER REVIEW 5 of 66
11
149 artificial intelligence for quick and accurate water quality assessment, tailored to
150 watershed attributes.
180
181
182 Figure 1. The partitioning of input space into distinct regions and the representation of a 3D
183 regression surface within a regression tree [22].
184 The process of constructing regression trees involves determining the optimal split
185 variable ("j") and split point ("s") to partition the input space effectively. These variables
186 are chosen by minimizing a specific expression (Equation 1) that considers all input
187 features. The goal is to minimize the sum of squared differences between observed
188 values and predicted values in resulting regions [23 -25].
202 averaging predictions from multiple models, making it particularly effective when the
203 base models are unstable or prone to overfitting [23-25].
204 In Bagging, multiple training sets are generated by repeatedly selecting samples
205 from the original dataset, and this process involves sampling with replacement. This
206 technique is utilized to create diverse subsets of data. The primary goal is to reduce the
207 variance in the model's predictions by aggregating the results from these different
208 subsets. Consequently, each subset contributes to the final prediction, and the averaging
209 of multiple models enhances the model's robustness and predictive accuracy.
210
211
212
213 Figure 2. Creating regression tree ensembles using bagging approach [26].
214 Random Forests, a variant of Bagging, stands out by introducing diversity among
215 the constituent models within the ensemble. This diversity is achieved through the
216 creation of multiple regression trees, each trained on a distinct bootstrap sample from
217 the data. Moreover, before making decisions at each split within these trees, only a
218 randomly selected subset of available features is considered. This approach helps in
219 decorrelating the individual trees within the ensemble, thereby further reducing
220 variance. The ensemble's final prediction is generated by aggregating the predictions
221 from these decorrelated trees, resulting in a robust and high-performing model [22–25].
16 Toxics 2023, 11, x FOR PEER REVIEW 8 of 66
17
222 Boosting Trees is a sequential training method where the addition of each new
223 regression tree to the ensemble is aimed at enhancing the overall model's performance
224 (Figure 3). In the context of Gradient Boosting, a widely utilized Boosting technique,
225 submodels are introduced iteratively. These submodels are selected based on their
226 ability to best estimate the residuals or errors of the previous model in the sequence. As
227 this iterative process continues, a definitive ensemble model emerges, composed of
228 multiple submodels that collectively provide highly accurate predictions.
229
230
231
232 Figure 3. The application of gradient boosting within regression tree ensembles [26].
( )
2
−( x−x )
'
k ( x , x )=σ exp
' 2
(2) 2
2l
265
266 In this expression, several elements are critical:
2
267 σ represents the signal variance, a model parameter that quantifies the overall
268 variability or magnitude of the function values.
269 The exponential function "exp" is used to model the similarity between x and x'. It
270 decreases as the difference between x and x' increases, capturing the idea that values
271 close to each other are more strongly correlated.
272 The parameter l, known as the lengthscale, is another model parameter. It controls
273 the smoothness and spatial extent of the correlation. A smaller l results in more rapid
274 changes in the function, while a larger l leads to smoother variations.
275 The observations in a dataset y= { y 1 , . . ., y n } can be viewed as a sample from a
276 multivariate Gaussian distribution.
T
(3) ( y 1 , .. . , y n ) N ( μ , Κ) ,
277
278 Gaussian processes are employed to model the relationship between input variables
2
279 x and target variable y, considering the presence of additive noise ε Ν ( 0 , σ ) .. The
280 goal is to estimate the unknown function f(∙). The observations y are treated as a sample
281 from a multivariate Gaussian distribution with mean vector μ and covariance matrix Κ.
282 This distribution captures the relationships among the data points. The conditional
¿ T
283 distribution of a test point's response value y , given the observed data y=( y 1 , . . ., y n )
¿ ¿2
284 y , σ^ ) with:
, is represented as N ( ^
20 Toxics 2023, 11, x FOR PEER REVIEW 10 of 66
21
(4) ^y ¿ =μ ( x¿ ) + K ¿T K −1 ( y−μ ) ,
2 ¿T −1 ¿
(5) σ^ ¿2=K ¿∗¿+σ −K K K ¿
.
285
2
286 In traditional GPR, a single length-scale parameter (l) and signal variance (σ ) are
287 used for all input dimensions. In contrast, the ARD approach employs a separate length-
288 scale parameter (l i) for each input dimension, where 'i' represents a specific dimension.
289 This means that for a dataset with 'm' input dimensions, you have 'm' individual length-
290 scale parameters [30].
291 The key advantage of ARD is that it automatically determines the relevance of each
292 input dimension in the modeling process. By allowing each dimension to have its own
293 length scale parameter, the model can assign different degrees of importance to each
294 dimension. This means that the model can adapt and focus more on dimensions that are
295 more relevant to the target variable and be less influenced by less relevant dimensions.
317
318 Figure 4. The study region and catchment areas situated within the southern Caspian Sea basin.
319 A land cover dataset was created using a 2002 digital land cover map (Scale
320 1:250,000) from the Forest, Ranges, and Watershed Management Organization of Iran
321 (http://frw.org.ir). The original land cover categories were consolidated into six classes:
322 bare land, water bodies, urban areas, agriculture, rangeland, and forests, following [32]
323 land use and land cover classification systems. Furthermore, digital geological and soil
324 feature maps (1:250,000 scale) were obtained from the Geological Survey of Iran
325 (www.gsi.ir). Detailed information about the characteristics of the catchments and their
326 statistical attributes can be found in Table 1 and Table 2.
327 In this study, hydrologic soil groups and geological permeability classes were
328 developed and applied in conjunction with land cover types within the modeling
329 process. Hydrologic soil groups are influenced by runoff potential and can be used to
330 determine runoff curve numbers. They consist of four classes (A, B, C, and D), with A
24 Toxics 2023, 11, x FOR PEER REVIEW 12 of 66
25
331 having the highest runoff potential and D the lowest. Notably, soil profiles can undergo
332 significant alterations due to changes in land cover. In such cases, the soil textures of the
333 new surface soil can be employed to determine the hydrologic soil groups as described
334 in Table 1 [34]. Furthermore, the study incorporates the application of geological
335 permeability attributes related to catchments, with the development of three geological
336 permeability classes: Low, Medium, and High. These classes are associated with various
337 characteristics of geological formations, such as effective porosity, cavity type and size,
338 their connectivity, rock density, pressure gradient, and fluid properties like viscosity.
339
340
√
N
1
(6) RMSE= ∑
N k=1
(
2
d k −ok ) ,
370
371 The MAE criterion represents the mean absolute error of the model, emphasizing
372 the absolute accuracy. It calculates the average absolute differences between the actual
373 values and the model's predictions.
N
1
(7) MAE= ∑ |d k −o k|.
N k=1
374
375 Pearson's linear correlation coefficient R provides a relative measure of accuracy
376 assessment. It considers the correlation between the actual values ( d k ) and the model's
377 predictions (o k ) relative to their respective means (d and o ). Values of R greater than
378 0.75 indicate a strong correlation between the variables.
(2) R=√ ¿ ¿ ¿
379
380 The mean absolute percentage error (MAPE) is a relative criterion that evaluates
381 accuracy by calculating the average percentage differences between the actual values
382 and the model's predictions.
| |
d k −o k
N
100
(2) MAPE= ∑
N k=1 d k
.
383 4. Results
384 The paper analyzes the application of regression trees, bagging, gradient boosting
385 and Gaussian random process models using a systemic approach (Figure 5). For each of
386 the models, the hyperparameters of the model were varied in the appropriate range. The
387 following values were analyzed:
388
389 Regression trees model
390
391 The depth of a regression tree is a crucial factor in preventing overfitting, which
392 occurs when the tree becomes too detailed and fits the training data too closely, as well
393 as underfitting, which happens when the tree is too simplistic and fails to capture the
394 underlying patterns. Table 3 illustrates the impact of the minimum leaf size on the
395 accuracy of the regression tree (RT) model. It shows how changing the minimum leaf
30 Toxics 2023, 11, x FOR PEER REVIEW 15 of 66
31
396 size affects key performance metrics such as RMSE (Root Mean Square Error), MAE
397 (Mean Absolute Error), MAPE (Mean Absolute Percentage Error), and R (Coefficient of
398 Determination).Accordingly, the leaf size is almost positively associated with the RMSE
399 but inversely correlated with R values.
400
401 TreeBagger (TB) model
402
403 1. Number of generated trees (B): Investigated values up to a maximum of 500
404 trees, with 100 as the standard setting in Matlab. The bootstrap aggregation method was
405 employed, generating a specific number of samples in each iteration. The study
406 considered an upper limit of 500 trees.
407 2. Minimum number of data/samples per tree leaf: Analyzed values ranged
408 from 2 to 15 samples per leaf, with a step size of 1 sample. The standard setting in
409 Matlab is 5 samples per leaf for regression, but here, a broader range was examined to
410 assess its impact on model generalization.
411
412 Random Forest (RF) model:
413
414 1. Number of generated trees (B): Analyzed within a range of 100 to 500 trees,
415 with 100 as the standard setting in Matlab. Cumulative MSE values for all base models
416 in the ensemble were presented. Bootstrap aggregation was used to create trees,
417 generating 181 samples per iteration. The study explored an extended ensemble of up to
418 500 regression trees, aligning with recommended Random Forest practices.
419 2. Number of variables used for tree splitting: Based on guidance, the study
420 selected a subset of approximately √ p predictors for branching, where p is the number
421 of input variables. With 16 predictors, this translated to a subset of 4 variables, but in
422 those research, a wider number of variables, ranging from 1 to 16, is investigated.
423 3. Minimum number of samples per leaf: The study considered values from 2 to
424 10 samples per tree leaf, with a 1-sample increment.
425 Boosted trees model:
426 1. Number of generated trees (B): Analyzed within a range of 1 to 100 trees.
427 2. Learning rate (λ): Explored a range, including 0.001, 0.01, 0.1, 0.5, 0.75, and 1.0.
428 3. Tree splitting levels (d): Analyzed from 1 (a decision stump) to 2^7=128 in an
429 exponential manner.
430
431 Gaussian process regression (GPR) model
432
433 With the GPR method, the application of different kernel functions:
434 1. Exponential, quadratic exponential, Mattern 3/2, Mattern 5/2, rational quadratic
32 Toxics 2023, 11, x FOR PEER REVIEW 16 of 66
33
435 2. ARD Exponential, ARD quadratic exponential, ARD Mattern 3/2, ARD Mattern
436 5/2, ARD rational quadratic
437 All models were evaluated in terms of optimality in terms of mean square error,
438 and then the optimal model obtained from all the analysed ones was evaluated on the
439 test data using the RMSE, MAE, MAPE and R criteria.
440 In the paper, a detailed procedure is illustrated for determining the optimal model
441 for the prediction of the SAR parameter. In contrast, for all other models for the
442 prediction, it is given in a more concise form. Accompanying results for other models,
443 except the SAR parameter, can be found in the appendix of the paper.
34 Toxics 2023, 11, x FOR PEER REVIEW 17 of 66
35
444
36 Toxics 2023, 11, x FOR PEER REVIEW 18 of 66
37
446
Table 3. Influence of minimum leaf size on regression tree (RT) model accuracy.
465 Figure 6. Optimal individual model for SAR parameter prediction based on regression tree.
466 Among the optimal models in this group, the random forest models with 500
467 generated trees proved to be the best. In contrast, the model that uses a subset of 8
468 variables and has a minimum number of data per terminal leaf equal to 1 has a higher
469 accuracy according to the RMSE and R criteria, while the model that uses a subset with
470 six variables and has a minimum number of data per terminal sheet equal to 1 and has a
471 higher accuracy according to MAE and MAPE criteria (Table 4).
472
473 Table 4. Accuracy of obtained models for SAR parameter prediction according to defined criteria.
a) b)
c) d)
474 Figure 7. Comparison of different accuracy criteria for RF model for SAR parameter as a function of the number of randomly
475 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
476
477 With the BT model (Figure 8), it was shown that the highest accuracy is obtained by
478 applying complex models with a large number of branches. The optimal obtained model
479 had a structure of 32 branches and a ya Learning Rate value equal to 0.01.
480
42 Toxics 2023, 11, x FOR PEER REVIEW 21 of 66
43
481 Figure 8. Dependence of the MSE value on the reduction parameter λ and the number of trees (base models) in the Boosted Trees
482 model for SAR parameter:
Table 5. Values of optimal parameters in GPR models with different covariance functions.
Exponential
k ( ( xi , x j|Θ ) ) =σ f exp
2
[ −1 r
2 σ l2 ]
σ l=¿111.9371 σ f =¿1.8001
[ ]
Squared Exponential T
−1 ( x i−x j ) ( x i−x j )
k ( ( xi , x j|Θ ) ) =σ exp
2
f 2
2 σl
44 Toxics 2023, 11, x FOR PEER REVIEW 22 of 66
45
σ l=¿8.3178 σ f =¿1.2040
Matern 3/2
k ( ( xi , x j|Θ ) ) =σ 2f 1+ √
3r
σl
exp √
(
− 3r
σl ) [ ]
σ l=¿14.7080 σ f =¿1.4023
Matern 5/2
k ( ( xi , x j|Θ ) ) =σ f 1+
2
( √ 5 r + 5 r 2 exp −√ 5 r
σl 2
3 σl ) [ σl ]
σ l=¿9.9890 σ f =¿1.1947
( )
2 −α
r
k ( ( xi , x j|Θ ) ) =σ f 1+
2
2
; r =0
Rational Quadratic 2 a σl
σ l=¿8.3178 a=¿3156603.8854 σ f =¿1.2040
493 where r = √( x − x ) ( x −x ).
i j
T
i j
Table 6. Values of optimal parameters in GPR ARD models with different covariance functions.
√
2
k ( ( xi , x j|Θ ) ) =σ exp (−r ) ;σ F =1.; r =
d
( x ℑ−x jm)
∑
2
f
2
m =1 σm
119.8634
300.3391
158.5921
109.1181
121.9050
114.0696
245.9937
29.7051
15.3385
89.5438
6.9058
[ ]
2
−1
d
( x ℑ−x jm )
k ( ( xi , x j|Θ ) ) =σ exp
2
f ∑
2 m=1 σm
2
; σ f =1.0577
14.6207
15.1518
32.6135
4.9648
2.3307
1.1875
3.4365
Table 6 (continued). Values of optimal parameters in GPR ARD models with different covariance functions.
46 Toxics 2023, 11, x FOR PEER REVIEW 23 of 66
47
131.4068
32.1227
28.2837
62.5967
9.2294
9.6482
4.7958
4.3381
ARD Matern 5/2:
( )
2
5r
k ( ( xi , x j|Θ ) ) =σ f 1+ √ 5 r + exp [ −√ 5 r ] ; σ f =1.0452
2
3
11.2644
30.9904
3.3999
3.4057
7.7608
ARD Rational quadratic:
( )
2 −α
1
d
( x ℑ−x jm )
k ( ( xi , x j|Θ ) ) =σ 1+
2
f ∑
2 α m=1 σm
2
; α =0.7452; σ f =1.07
18.4361
30.2548
2.2446
0.7733
√
2
d
( x ℑ−x jm)
494 where
r= ∑ σm
2 .
m =1
495 • Comparative analysis of results
511
512 Figure 9. Significance of individual variables for SAR parameter prediction in optimal RF model.
513 The parameter of Hydrometric Station Elevation (HSE) is influential in SAR predic-
514 tion with an importance value of 0.188610351. Higher elevations often correspond to
515 different geological and hydrological conditions. For example, at higher elevations, there
516 might be different soil types, vegetation, and land use patterns. These variations can sig-
517 nificantly impact SAR values. Typically, elevated hydrometric stations are associated
518 with more complex terrain and geological features. The positive importance value im-
519 plies that water sources at higher elevations may have distinct characteristics or geologi-
520 cal properties that result in higher SAR values. These may include factors such as min-
521 eral content, weathering processes, or the influence of precipitation at higher altitudes.
522 The Catchment Area (CA) is a critical factor in SAR prediction, as indicated by its
523 substantial importance value of 0.335748728. A larger catchment area implies a more ex-
524 tensive land area contributing to the water source. This expansive catchment area results
525 in more significant dilution effects on water chemistry. The greater the catchment area,
526 the more effectively it reduces the concentration of ions in the water. This dilution effect
527 leads to more stable and predictable SAR values. It means that using catchment area pa-
528 rameter catchment areas with a larger geographical footprint has the potential to miti-
529 gate variations in SAR, making this parameter vital for accurate predictions.
530 Forested areas, denoted by the parameter Percentage of Forest (F), have a notable
531 positive influence on SAR predictions (importance: 0.260431787). Forests often contain
532 well-developed organic matter in the soil, which acts as a buffer to ions. This organic
533 matter helps to regulate the concentration of ions, potentially resulting in more stable
534 and predictable SAR values. Additionally, forested areas are typically associated with
535 lower human activity and reduced introduction of additional ions into the water. The
52 Toxics 2023, 11, x FOR PEER REVIEW 26 of 66
53
536 high importance value highlights the significance of forest cover in enhancing the preci-
537 sion of SAR predictions.
538 The presence of Rangeland (RL) plays a crucial role in SAR predictions, with an im-
539 portance value of 0.261470407. Rangeland areas often feature various types of vegetation
540 that can influence the ion content in water. This influence helps to stabilize SAR values.
541 Furthermore, the low-intensity land use associated with rangeland areas results in fewer
542 ion inputs compared to more heavily developed areas. As a result, rangeland cover is a
543 positive factor in SAR predictions, contributing to the accuracy and reliability of the
544 models.
545 Urban areas, represented by the Percentage of Urban Area (UA), have a significant
546 impact on SAR predictions, as indicated by the high importance value of 0.292730718.
547 Urbanization introduces various pollutants, including salts, into the water. These pollu-
548 tants can significantly influence SAR, potentially resulting in higher SAR values in ur-
549 banized areas. The high importance of this parameter emphasizes the need to consider
550 urban land use when predicting SAR, as it can lead to substantial variations in water
551 chemistry.
552 The Percentage of Agricultural Area (AA) has a substantial positive impact on SAR
553 predictions, with an importance value of 0.342586975. Agricultural areas often involve
554 various practices such as irrigation, fertilization, and pesticide application, which con-
555 tribute to the ion and salt content in the water. These agricultural activities significantly
556 influence SAR, potentially leading to higher SAR values. The high importance of this pa-
557 rameter underscores the need to consider agricultural land use when predicting SAR, as
558 it can have a profound effect on water chemistry.
559 Soil with characteristics represented by Hydrological Soil Group B (SHGB) has a
560 positive impact on SAR predictions, as reflected in its importance value of 0.235552135.
561 The soil in this group may include silt loam or loam, which has properties that facilitate
562 the retention of ions and salts. As a result, this soil group positively affects SAR values,
563 contributing to more reliable predictions.
564 Geological Permeability High (GHGN) is a critical factor in SAR predictions, with
565 an importance value of 0.280770699. High geological permeability affects how water in-
566 teracts with subsurface minerals, potentially leading to ion leaching or retention. This
567 parameter plays a crucial role in predicting SAR due to its substantial importance, em-
568 phasizing the significance of geological properties in understanding and predicting SAR
569 values.
570
571 4.2 Prediction of Na+ parameter values:
572 RF models proved to be the optimal models for predicting Sodium ion (Na+)
573 concentrations, while the analysis of all models in terms of accuracy is given in the
574 appendix. The dependence of the adopted accuracy criteria on the model parameters is
54 Toxics 2023, 11, x FOR PEER REVIEW 27 of 66
55
575 shown in Figure 10. Based on the defined accuracy criteria, four models with the
576 following criteria values were selected (Table 8).
577 Table 8. Accuracy of obtained models for Na+ parameter prediction according to defined criteria.
578
579
580 The "Weighted Sum Model" or "Simple Multi-Criteria Ranking" method was used
581 to select the optimal model. For the minimization objectives (RMSE, MAE, MAPE), Min-
582 Max normalization is applied, and for the maximization objective (R), Max-Min
583 normalization is applied to ensure that all metrics are on the same scale. Equal weights
584 are assigned to the normalized evaluation metrics to indicate their relative importance in
585 your decision-making process. The weighted sum method calculated an aggregated
586 value for each model, which considers all four normalized metrics. All models are
587 ranked based on their aggregated values, with the lower aggregated value indicating
588 better overall performance (Table 9).
589 Table 9. Determining the optimal prediction model for Na+ parameter using Simple Multi-Criteria Ranking
c) d)
591 Figure 10. Comparison of different accuracy criteria for RF model for Na+ parameter as a function of the number of randomly
592 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
593 As the optimal model, the RF model with 500 trees was obtained, which uses a
594 subset of 6 variables, where the minimum number of data per sheet is 6. The assessment
595 of the significance of individual input variables on the accuracy of the prediction was
596 performed precisely on the obtained model with the highest accuracy (Figure 11).
597
58 Toxics 2023, 11, x FOR PEER REVIEW 29 of 66
59
598
599 Figure 11. Significance of individual variables for Na+ parameter prediction in optimal RF model
600 The significant importance value of Catchment Area (CA) (0.204129833) indicates
601 that the size of the catchment area has a strong influence on Na+ concentrations. A larger
602 catchment area can encompass a wider range of geological features, influencing mineral
603 content in water and, subsequently, Na+ levels. The highly positive importance of CA
604 underscores its crucial role in shaping Na+ predictions.
605 The highly positive importance value for Forest (F) (0.698694744) highlights the
606 substantial impact of forested areas on Na+ concentrations. Forests can contribute to
607 lower Na+ levels due to the presence of organic matter that can buffer ions and minerals
608 in the water. This buffering effect results in stable and predictable Na+ concentrations in
609 forested regions.
610 Rangeland (RL) positively influences Na+ predictions, as indicated by its impor-
611 tance value (0.201545333). Rangeland areas may contain various types of vegetation that
612 stabilize Na+ levels by influencing ion content in the water. The lower-intensity land use
613 in rangeland areas leads to fewer ion inputs, enhancing the precision of Na+ predictions.
614 The positive importance value of the percentage of Agricultural Area (AA)
615 (0.262775223) highlights the significant influence of agricultural areas on Na+ predic-
616 tions. Agriculture involves practices such as irrigation, fertilization, and pesticide appli-
617 cation, contributing to ion and mineral content in the water. The high importance of AA
618 underscores the need to consider agricultural land use when predicting Na+ levels.
619 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.168774875),
620 including silt loam or loam, have a moderate positive impact on Na+ predictions. Soils in
621 this group may retain sodium ions and salts, contributing to the overall accuracy of Na+
622 predictions.
60 Toxics 2023, 11, x FOR PEER REVIEW 30 of 66
61
623 Hydrological Soil Group C (SHGC) positively influences Na+ predictions, with an
624 importance value of 0.109796372. This group includes soils like sandy clay loam that can
625 influence Na+ concentrations by affecting ion leaching, making it a valuable parameter in
626 understanding and predicting Na+ levels.
627
628 4.3 Prediction of Magnesium (Mg2+) parameter values:
629 RF models proved to be the optimal models for predicting Sodium ion (Mg2+)
630 concentrations. An analysis of all models in terms of accuracy is given in the appendix.
631 The dependence of the adopted accuracy criteria on the model parameters is shown in
632 Figure 12. Based on the defined accuracy criteria, 3 models with the following values
633 were selected (Table 10).
634
a) b)
c) d)
62 Toxics 2023, 11, x FOR PEER REVIEW 31 of 66
63
635 Figure 12. Comparison of different accuracy criteria for RF model for Mg2+ parameter as a function of the number of randomly
636 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
637 Table 10. Accuracy of obtained models for Mg2+ prediction according to defined criteria.
640 Table11. Determining the optimal prediction model for Mg2+ parameter using Simple Multi-Criteria Ranking.
645
646 Figure 13. Significance of individual variables for Mg2+ parameter prediction in optimal RF model
647 The positive importance value of Catchment Area (CA) (0.163764386) suggests that
648 the size of the catchment area has a moderate influence on Mg2+ concentrations. A larger
649 catchment area can encompass a wider range of geological features, influencing mineral
650 content in water and, subsequently, Mg2+ levels. The positive importance of CA indicates
651 its role in shaping Mg2+ predictions.
652 The highly positive importance value for Forest (F) (0.590270347) highlights the
653 substantial impact of forested areas on Mg2+ concentrations. Forests can contribute to
654 higher Mg2+ levels due to the decomposition of organic matter and the release of magne-
655 sium ions. This effect results in stable and predictable Mg2+ concentrations in forested re-
656 gions.
657 Rangeland (RL) positively influences Mg2+ predictions, as indicated by its impor-
658 tance value (0.302241325). Rangeland areas may contain various types of vegetation that
659 stabilize Mg2+ levels by influencing ion content in the water. The lower-intensity land use
660 in rangeland areas leads to fewer ion inputs, enhancing the precision of Mg2+ predic-
661 tions.
662 Urban areas, represented by Percentage of Urban Area (UA), significantly affect
663 Mg2+ concentrations, as suggested by the importance value (0.29234405). Urbanization
664 introduces various pollutants and ions into the water, potentially leading to higher Mg2+
665 concentrations.
666 The positive importance value of the percentage of Agricultural Area (AA)
667 (0.200998202) highlights the significant influence of agricultural areas on Mg2+ predic-
66 Toxics 2023, 11, x FOR PEER REVIEW 33 of 66
67
668 tions. Agriculture involves practices such as irrigation, fertilization, and pesticide appli-
669 cation, contributing to ion and mineral content in the water. The high importance of AA
670 underscores the need to consider agricultural land use when predicting Mg2+ levels.
671 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.180483735),
672 including silt loam or loam, have a moderate positive impact on Mg2+ predictions. Soils
673 in this group may retain magnesium ions and salts, contributing to the overall accuracy
674 of Mg2+ predictions.
675 Geological Permeability represented by Geological hydrological group N (GHGN)
676 (0.2735) with their importance suggests that areas or regions with low permeability geo-
677 logical features have a significant positive impact on Mg2+ levels in water.
678 4.4 Prediction of Ca2+ parameter values:
679 RF models proved to be the optimal models for Ca2+ (calcium ion concentration). An
680 analysis of all models in terms of accuracy is given in the appendix. The dependence of
681 the adopted accuracy criteria on the model parameters is shown in the Figure 14.
682 According to all the defined accuracy criteria, only 1 model stood out with values for
683 RMSE, MAE, MAPE and R of 0.5847, 0.4500, 0.2007, 0.7496, respectively.
684
a) b)
c) d)
68 Toxics 2023, 11, x FOR PEER REVIEW 34 of 66
69
R=0.7496
685 Figure 14. Comparison of different accuracy criteria for RF model for Ca2+ parameter as a function of the number of randomly
686 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
687 The assessment of the significance of individual input variables on the accuracy of
688 the prediction was performed precisely on the obtained model with the highest accuracy
689 (Figure 15).
690 The positive importance value of HSE (0.197664268) suggests that higher elevations
691 have a significant impact on predicting Ca2+ concentrations. Elevation can influence Ca2+
692 levels due to several factors. For example, in mountainous regions, the geological com-
693 position can change with elevation, affecting the calcium content in the surrounding soil
694 and water. Additionally, higher elevations may experience different weather patterns
695 and precipitation, which can leach calcium from rocks and soil. Therefore, the positive
696 importance of HSE highlights the need to consider the elevation of monitoring stations
697 when predicting Ca2+ concentrations.
70 Toxics 2023, 11, x FOR PEER REVIEW 35 of 66
71
698
699 Figure 15. Significance of individual variables for Ca2+ parameter prediction in optimal RF model.
700 The importance value of Catchment Area (0.317181573) indicates that the size of the
701 catchment area significantly influences Ca2+ concentrations. A larger catchment area of-
702 ten implies a greater geographical footprint contributing to water sources. This expan-
703 sive catchment area can lead to higher dilution effects on water chemistry, potentially re-
704 ducing Ca2+ concentrations. The positive importance of CA underscores the need to con-
705 sider catchment size when predicting Ca2+, as it plays a pivotal role in shaping the ion
706 levels in water.
707 The substantial positive importance value for Forest (F) (0.550159839) highlights the
708 significant impact of forested areas on Ca2+ concentrations. Forests often contain organic
709 matter that can buffer ions, including calcium. This buffering effect helps maintain stable
710 and predictable Ca2+ levels. The presence of forests can also reduce human activity and
711 the introduction of additional ions into water, contributing to the high importance of this
712 parameter in Ca2+ predictions.
713 Rangeland (RL) positively influences Ca2+ predictions, as indicated by its impor-
714 tance value (0.300511477). Rangeland areas may contain various types of vegetation that
715 stabilize Ca2+ levels by influencing ion content in the water. Additionally, the lower-in-
716 tensity land use in rangeland areas results in fewer ion inputs, enhancing the precision
717 of Ca2+ predictions.
718 Urban areas, represented by the percentage of Urban Area (UA), significantly affect
719 Ca2+ concentrations, as suggested by the substantial importance value (0.310613586). Ur-
72 Toxics 2023, 11, x FOR PEER REVIEW 36 of 66
73
720 banization introduces various pollutants into the water, potentially leading to altered
721 Ca2+ levels. The high importance of UA underscores the need to consider urban land use
722 when predicting Ca2+, as it can result in substantial variations in ion levels.
723 The substantial positive importance value of Percentage of Agricultural Area (AA)
724 (0.319123329) highlights the significant influence of agricultural areas on Ca2+ predic-
725 tions. Agriculture involves practices such as irrigation, fertilization, and pesticide appli-
726 cation, contributing to ion and calcium content in the water. The high importance of AA
727 underscores the need to consider agricultural land use when predicting Ca2+, as it has a
728 profound effect on water chemistry.
729 Hydrological Soil Group C (SHGC) (0.174279524) positively influences Ca2+ predic-
730 tions. This group includes soils like sandy clay loam that can influence ion leaching. The
731 importance of SHGC suggests that the specific characteristics of soil in this group have a
732 notable impact on Ca2+ predictions.
733 Geological Permeability High (GHGN) (0.517513106) is a crucial factor in Ca2+ pre-
734 dictions, with a high importance value. High geological permeability affects water's in-
735 teraction with subsurface minerals, potentially leading to ion leaching or retention. This
736 parameter plays a pivotal role in shaping Ca2+ levels in water due to its substantial im-
737 portance.
738 Geological Permeability Medium (GHGM) (0.176935951) positively contributes to
739 Ca2+ predictions, albeit to a lesser extent. Medium geological permeability influences ion
740 interactions with subsurface minerals, making it a valuable parameter in understanding
741 and predicting Ca2+ levels.
742 Low geological permeability (GHGN) (0.145602833) has a positive impact on Ca2+
743 predictions, as indicated by its importance value. Lower permeability can reduce ion
744 leaching or retention, impacting Ca2+ levels and contributing to the overall accuracy of
745 predictions.
746
747 4.5 Prediction of SO42− parameter values:
748 The RF models proved to be the optimal models for predicting SO42− levels. An
749 analysis of all models in terms of accuracy is given in the appendix A. The dependence
750 of the adopted accuracy criteria on the model parameters is shown in the Figure 16.
751 According to all defined accuracy criteria, only 1 model was singled out with values for
752 RMSE, MAE, MAPE and R of 0.5526, 0.3122, 0.5050 and 0.7381, respectively.
a) b)
74 Toxics 2023, 11, x FOR PEER REVIEW 37 of 66
75
c) d)
753 Figure 16. Comparison of different accuracy criteria for RF model for SO42− parameter as a function of the number of randomly
754 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
755 The assessment of the significance of individual input variables on the accuracy of
756 the prediction was performed directly on the obtained model with the highest accuracy
76 Toxics 2023, 11, x FOR PEER REVIEW 38 of 66
77
758
759 Figure 17. Significance of individual variables for SO42− parameter prediction in optimal RF model.
760 The positive importance value of HSE (0.13739278) suggests that elevation plays a
761 moderate role in predicting sulfate (SO42−) concentrations. Elevation can affect SO42−
762 levels by influencing factors like geological composition and the presence of minerals in
763 the soil. The positive importance value indicates that higher elevations may lead to
764 slightly increased SO42− concentrations, making it a moderately important parameter in
765 SO42− predictions.
766 The significant importance value of Catchment Area (CA) (0.179220323) indicates
767 that the size of the catchment area has a notable influence on SO42− concentrations. A
768 larger catchment area can encompass a wider range of geological features, affecting min-
769 eral content in water and, consequently, SO42− levels.
770 The highly positive importance value for Forest (F) (0.393199832) highlights the
771 substantial impact of forested areas on SO42− concentrations. Forests can contribute to
772 lower SO42− levels due to the presence of organic matter that can buffer ions and miner-
773 als in the water. This buffering effect results in stable and predictable SO42− concentra-
774 tions in forested regions.
775 Rangeland (RL) positively influences SO42− predictions, as indicated by its high im-
776 portance value (0.463039895). Rangeland areas may contain various types of vegetation
777 that stabilize SO42− levels by influencing ion content in the water. The lower-intensity
778 land use in rangeland areas leads to fewer ion inputs, enhancing the precision of SO42−
779 predictions.
78 Toxics 2023, 11, x FOR PEER REVIEW 39 of 66
79
780 Urban areas, represented by Percentage of Urban Area (UA), significantly affect
781 SO42− concentrations, as suggested by the importance value (0.146355872). Urbanization
782 introduces various pollutants and ions into the water, potentially leading to higher SO4 2−
783 concentrations.
784 The positive importance value of Percentage of Agricultural Area (AA) (0.12677091)
785 suggests that agricultural areas have a minor positive impact on SO42− predictions. Agri-
786 culture practices may introduce ions and minerals into the water, leading to slightly
787 higher SO42− levels.
788 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.077407654),
789 including silt loam or loam, have a minor positive impact on SO42− predictions. Soils in
790 this group may retain ions and salts, contributing to the overall accuracy of SO42− predic-
791 tions.
792
793 4.6 Prediction of Cl− parameter values:
794 RF models proved to be the optimal models for predicting Cl− concentrations. An
795 analysis of all models in terms of accuracy is given in the appendix. The dependence of
796 the adopted accuracy criteria on the model parameters is shown in the Figure 18. Based
797 on the defined accuracy criteria, 3 models were selected (Table 12).
a) b)
c) d)
80 Toxics 2023, 11, x FOR PEER REVIEW 40 of 66
81
798 Figure 18. Comparison of different accuracy criteria for RF model for Cl− parameter as a function of the number of randomly
799 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
800 Table 12. Accuracy of obtained models for Cl− prediction according to defined criteria.
801 Table 13. Determining the optimal prediction model for Cl− parameter using Simple Multi-Criteria Ranking.
809
810
811 Figure 19. Significance of individual variables for Cl− parameter prediction in optimal RF model.
812 Catchment Area (CA) (Importance: 0.2225) has a relatively high positive
813 importance, indicating that the size of the catchment area strongly influences chloride
814 levels. Larger catchment areas tend to have higher chloride concentrations due to
815 increased runoff and potential pollutant sources.
816 Stream Order (SO) (Importance: 0.0783) has a positive importance, suggesting that
817 higher stream order corresponds to higher chloride levels. This could be due to factors
818 such as the accumulation of pollutants and ions as water moves downstream.
819 Forest (F) (Importance: 0.5505) have a very high positive importance, indicating that
820 forests strongly influence chloride levels, likely due to their role in filtering and
821 modifying water chemistry.
822 Rangeland (RL) (Importance: 0.1992) have a positive impact, possibly due to the
823 composition of the soil and vegetation.
824 Agricultural Area (AA) (Importance: 0.2720) have a relatively high positive impact,
825 possibly related to farming practices, irrigation, and potential chloride runoff.
826
827 4.7 Prediction of HCO3− parameter values:
828 GPR models proved to be the optimal models for predicting HCO3− concentrations.
829 Very similar values in terms of accuracy were also given by the RF models. However,
830 since the difference between the GPR model and the RF model is practically negligible,
831 and since it is not possible to obtain the significance value for individual input variables
832 on the obtained GPR model because it has the same length scale parameter for all
84 Toxics 2023, 11, x FOR PEER REVIEW 42 of 66
85
833 variables, RF models were used for the analysis. An analysis of all models in terms of
834 accuracy is given in the appendix.
835 The dependence of the adopted accuracy criteria on the parameters of the RF model
836 is shown in Figure 20.
837 In the specific case of applying the RF model, two models were distinguished,
838 namely the RF model that uses ten variables as a subset for analysis and where the
839 number of data per terminal sheet is equal to 1, which is optimal according to the RMSE,
840 MAE, MAPE criteria, and the model that uses 13 variables as a subset for analysis and
841 where the number of data per terminal sheet is equal to 2, which is optimal according to
842 the R criterion. Since the first-mentioned model is optimal according to the three
843 adopted accuracy criteria, RMSE, MAE, and MAPE, and the difference compared to the
844 R criterion is practically negligible, the first model can be considered optimal.
845 The optimal model has the following criterion values for RMSE, MAE, MAPE, and
846 R of 0.5174, 0.4252, 0.1822, and 0.7721, respectively.
847
a) b)
c) d)
86 Toxics 2023, 11, x FOR PEER REVIEW 43 of 66
87
848 Figure 20. Comparison of different accuracy criteria for RF model for HCO3− parameter as a function of the number of randomly
849 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
850 The assessment of the importance of individual input variables on the accuracy of
851 the prediction was performed precisely on the obtained RF model with the highest
852 accuracy (Figure 21).
853 The positive importance value of HSE (0.191186813) suggests that elevation plays a
854 moderate role in predicting bicarbonate ion (HCO3−) concentrations. Elevation can affect
855 HCO3− levels by influencing factors like geological composition and the presence of
856 minerals in the soil. The positive importance value indicates that higher elevations may
857 lead to increased HCO3− concentrations, making it an important parameter in HCO3−
858 predictions.
859 The significant importance value of the Catchment Area (CA) (0.704530351)
860 indicates that the size of the catchment area has a strong influence on HCO3−
861 concentrations. A larger catchment area can encompass a wider range of geological
862 features, influencing mineral content in water and, subsequently, HCO3− levels. The
863 highly positive importance of CA underscores its crucial role in shaping HCO3−
864 predictions.
865 The highly positive importance value for Forest (F) (0.717278924) highlights the
866 substantial impact of forested areas on HCO3− concentrations. Forests can contribute to
867 lower HCO3− levels due to the presence of organic matter that can buffer ions and
868 minerals in the water. This buffering effect results in stable and predictable HCO3−
869 concentrations in forested regions.
870 Rangeland (RL) positively influences HCO3− predictions, as indicated by its
871 importance value (0.175212138). Rangeland areas may contain various types of
872 vegetation that stabilize HCO3− levels by influencing ion content in the water. The lower-
88 Toxics 2023, 11, x FOR PEER REVIEW 44 of 66
89
873 intensity land use in rangeland areas leads to fewer ion inputs, enhancing the precision
874 of HCO3− predictions.
875
876 Figure 21. Significance of individual variables for HCO3− parameter prediction in optimal RF
877 model.
878 Urban areas, represented by the percentage of Urban Area (UA), significantly affect
879 HCO3− concentrations, as suggested by the importance value (0.21786019). Urbanization
880 introduces various pollutants and ions into the water, potentially leading to higher
881 HCO3− concentrations. The positive importance of UA underscores the need to consider
882 urban land use when predicting HCO3− levels.
883 The positive importance value of the percentage of Agricultural Area (AA)
884 (0.243760535) highlights the significant influence of agricultural areas on HCO3−
885 predictions. Agriculture involves practices such as irrigation, fertilization, and pesticide
886 application, contributing to ion and mineral content in the water. The high importance
887 of AA underscores the need to consider agricultural land use when predicting HCO3−
888 levels.
889 Hydrological Soil Group A (SHGA) (0.154084653) is of moderate importance in
890 HCO3− predictions. Soils in this group, which may include sand, loamy sand, or sandy
891 loam, can influence HCO3− concentrations by affecting ion content in the water. The
892 positive importance suggests that SHGA plays a notable role in shaping HCO3− levels.
893 Hydrological Soil Group C (SHGC) (0.167034059) positively influences HCO3−
894 predictions. This group includes soils like sandy clay loam that can influence ion
895 leaching, making it a valuable parameter in understanding and predicting HCO3− levels.
90 Toxics 2023, 11, x FOR PEER REVIEW 45 of 66
91
896
897 4.8 Prediction of K+ parameter values:
898 The RF models proved to be the optimal models for predicting K+ levels. An
899 analysis of all models in terms of accuracy is given in the appendix. The dependence of
900 the adopted accuracy criteria on the model parameters is shown in Figure 22.
901 In terms of accuracy, three models were singled out, and the optimal model was
902 obtained by applying the Simple Multi-Criteria Ranking method (Table 14 and Table 15).
903
904 Table 14. Accuracy of obtained models for K+ parameter prediction according to defined criteria.
905 Table 15. Determining the optimal prediction model for K+ parameter using Simple Multi-Criteria Ranking.
c) d)
912 Figure 23. Comparison of different accuracy criteria for RF model for K+ parameter as a function of the number of randomly
913 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
914 The positive importance value of HSE (0.22352202) suggests that elevation plays a
915 significant role in predicting potassium ion (K+) concentrations. Elevation can impact K+
916 levels by influencing geological composition and the presence of minerals in the soil.
917 The positive importance value indicates that higher elevations may lead to increase K+
918 concentrations, making it an important parameter in K+ predictions (Figure 24).
919 The highly positive importance value for Forest (F) (0.303494299) highlights the
920 substantial impact of forested areas on K+ concentrations. Forests can contribute to
921 higher K+ levels due to the decomposition of organic matter and the release of potassium
922 ions. This effect results in stable and predictable K+ concentrations in forested regions.
94 Toxics 2023, 11, x FOR PEER REVIEW 47 of 66
95
927
928 Figure 24. Significance of individual variables for K+ parameter prediction in optimal RF model.
929 Urban areas, represented by the percentage of Urban Area (UA), have a positive
930 influence on K+ concentrations, as suggested by the importance value (0.125652185).
931 Urbanization introduces various pollutants and ions into the water, potentially leading
932 to higher K+ concentrations.
933 The highly positive importance value of Percentage of Agricultural Area (AA)
934 (0.352999482) underscores the significant influence of agricultural areas on K+
935 predictions. Agriculture involves practices such as irrigation, fertilization, and pesticide
936 application, contributing to ion and mineral content in the water. The high importance
937 of AA emphasizes the need to consider agricultural land use when predicting K+ levels.
938 Soil characteristics represented by Hydrological Soil Group B (SHGB) (0.232769812),
939 including silt loam or loam, have a positive impact on K+ predictions. Soils in this group
940 may retain potassium ions and salts, contributing to the overall accuracy of K+
941 predictions.
942 Hydrological Soil Group C (SHGC) (0.107293215) positively influences K+
943 predictions. This group includes soils like sandy clay loam that can influence K+
96 Toxics 2023, 11, x FOR PEER REVIEW 48 of 66
97
c) d)
98 Toxics 2023, 11, x FOR PEER REVIEW 49 of 66
99
967 Figure 25. Comparison of different accuracy criteria for RF model for pH parameter as a function of the number of randomly
968 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
969
970
971
972 Table 16. Accuracy of obtained models for pH parameter prediction according to defined criteria.
976 Table 17. Determining the optimal prediction model for PH parametr using Simple Multi-Criteria Ranking
980
981 Figure 26. Significance of individual variables for PH parameter prediction in optimal RF model.
982 Stream Order (SO) has a highly positive importance value (0.467367002),
983 highlighting its strong impact on pH predictions. Stream order represents the hierarchy
984 of streams, and it significantly influences pH concentrations. The hierarchical structure
985 of streams can influence the flow, geochemistry, and organic matter content of water, all
986 of which can affect pH levels.
987 Rangeland (RL) positively influences pH predictions, with an importance value of
988 0.248377956. Rangeland areas may contain various types of vegetation that stabilize pH
989 levels by influencing ion content in the water. The lower-intensity land use in rangeland
990 areas leads to fewer ion inputs, enhancing the precision of pH predictions.
991
992 4.10 Prediction of EC parameter values:
993 RF models proved to be optimal models for EC parameter prediction. An analysis
994 of all models in terms of accuracy is given in the appendix. The dependence of the
995 adopted accuracy criteria on the model parameters is shown in Figure 27. According to
996 all defined accuracy criteria, only 1 model was singled out with values for RMSE, MAE,
997 MAPE, and R of 271.5346, 149.3192, 0.2779, and 0.7665, respectively. The obtained
102 Toxics 2023, 11, x FOR PEER REVIEW 51 of 66
103
998 hyperparameter values for the number of trees, the subset of splitting variables, and the
999 minimum number of data per leaf are 500, 6, and 1, respectively. The analysis of the
1000 significance of individual input variables of the model was performed on the optimal RF
1001 model and shown in Figure 28.
a) b)
c) d)
1002 Figure 27. Comparison of different accuracy criteria for RF model for EC parameter as a function of the number of randomly
1003 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
1004 The substantial positive importance value for Forest (F) (0.311749351) highlights the
1005 significant impact of forested areas on electrical conductivity. Forests often contribute to
1006 lower electrical conductivity due to the presence of organic matter that can buffer ions
1007 and minerals in the water. This buffering effect results in stable and predictable electrical
1008 conductivity in forested regions.
104 Toxics 2023, 11, x FOR PEER REVIEW 52 of 66
105
1029 .
1030 Figure 28. Significance of individual variables for EC parameter prediction in optimal RF model
1032 The RF models proved to be the optimal models for predicting SO4 2- levels. An
1033 analysis of all models in terms of accuracy is given in the appendix. The dependence of
1034 the adopted accuracy criteria on the model parameters is shown in the Figure 29.
1035 In terms of accuracy, three models were singled out, and the optimal model was
1036 obtained by applying the Simple Multi-Criteria Ranking method (Table 18 and Table 19).
1037 Table 18. Accuracy of obtained models for TDS parameter prediction according to defined criteria
1038 Table 19. Determining the optimal prediction model for TDS parameter using Simple Multi-Criteria Ranking
c) d)
1044 Figure 29. Comparison of different accuracy criteria for RF model for TDS parameter as a function of the number of randomly
1045 selected splitting variables and minimum leaf size: a) RMSE, b) MAE, c) MAPE, d) R.
1046 The substantial importance value of HSE (0.320185209) suggests that elevation
1047 plays a significant role in predicting Total Dissolved Solids (TDS) concentrations (Figure
1048 30). Elevation can affect TDS levels by influencing factors like geological composition,
1049 mineral content, and the presence of dissolved ions. The positive importance value
1050 indicates that higher elevations may lead to higher TDS concentrations, making it a
1051 highly important parameter in TDS predictions.
1052 The highly positive importance value for Forest (F) (0.633926137) highlights the
1053 substantial impact of forested areas on TDS concentrations. Forests can contribute to
1054 lower TDS levels due to the presence of organic matter that can buffer ions and minerals
1055 in the water. This buffering effect results in stable and predictable TDS concentrations in
1056 forested regions.
1057 The moderate importance value of the percentage of Agricultural Area (AA)
1058 (0.191967355) suggests that agricultural areas have a moderate impact on TDS
1059 predictions. Agriculture practices introduce ions and minerals into the water,
1060 influencing TDS levels to a moderate extent.
1061 Hydrological Soil Group A (SHGA) has a minor positive importance (0.126901692)
1062 in TDS predictions. Soils in this group, which may include sand, loamy sand, or sandy
1063 loam, can influence TDS concentrations by affecting ion content in the water.
110 Toxics 2023, 11, x FOR PEER REVIEW 55 of 66
111
1064
1065 Figure 30. Significance of individual variables for TDS parameter prediction in optimal RF model
1066 5. Discussion
1067 In the considered research, satisfactory accuracy values were obtained for the
1068 majority of models in terms of the defined criteria, but certain models also failed in
1069 terms of certain criteria. The accuracy of the model can be seen more easily through the
1070 relative criteria of accuracy R and MAPE than in relation to the absolute criteria such as
1071 RMSE and MAE (Table 20).
1072 Table 20. Accuracy of the ML model in predicting individual water parameters.
1103 Table 21. The most important input variables for predicting the output parameters of water.
Input variable
Output
2 4 1 5 3 SAR
114 Toxics 2023, 11, x FOR PEER REVIEW 57 of 66
115
5 2 4 1 3 Na+
4 5 3 2 1 Mg2+
2 3 5 1 4 Ca2+
3 5 1 2 4 SO42-
4 2 3 1 5 Cl-
1 4 2 3 5 HCO3-
2 4 1 3 5 K+
3 5 4 2 1 pH
3 4 5 2 1 EC
5 4 3 1 2 TDS
1104
1105 Hydrological Soil Groups (HSGA, HSGB, HSGC, HSGD): Different soil types have
1106 varying abilities to retain and release ions. Sandy soils (HSGA and HSGB) may allow for
1107 easier ion movement, while clayey soils (HSGD) can retain ions. Soil characteristics
1108 influence potassium, magnesium, calcium, and bicarbonate levels.
1109 Geological Permeability (GHGM, GHGN, GHGT): Geological features can affect ion
1110 concentrations. Moderate permeability (GHGN) has a positive impact on potassium
1111 levels, possibly due to ion interactions. Low permeability (GHGM) may positively
1112 influence magnesium levels by slowing water flow and allowing for more mineral
1113 dissolution.
1114 6. Conclusions
1115 The study demonstrated that machine learning methods can be effectively used to
1116 predict and assess water quality parameters in a catchment area. These models provide a
1117 valuable tool for efficiently and accurately evaluating water quality.
1118 Random Forest (RF) emerged as the optimal model for predicting various water
1119 quality parameters. These models exhibited the best performance in terms of accuracy
1120 and predictive power.
1121 The study highlighted the significant influence of environmental variables on water
1122 quality parameters.
1123 The findings underscore the practical utility of these models in water resource
1124 management and environmental conservation efforts. They enable a rapid and
1125 sufficiently accurate assessment of water quality based on the attributes of the watershed
1126 and its environmental variables.
1127
1128 Author Contributions: Conceptualization, M.K., B.J.A, S.L., E.K.N. M.H.-N and D.R..;
1129 methodology, M.K.; software, M.K. and E.K.N.; validation, M.K., S.L., M.H.-N and E.K.N. and.;
1130 formal analysis, M.K. and E.K.N.; investigation, M.K. B.J.A. and M.H.-N.; resources, B.J.A, S.L.
116 Toxics 2023, 11, x FOR PEER REVIEW 58 of 66
117
1131 M.H.-N and D.R..; data curation, M.K., S.L. and E.K.N.; writing—original draft preparation, M.K.,
1132 S.L., E.K.N. and M.H.-N.; writing—review and editing, M.K., S.L., E.K.N. and M.H.-N.; funding
1133 acquisition, D.R. All authors have read and agreed to the published version of the manuscript.
1134 Funding: This research received no external funding.
1135 Institutional Review Board Statement: Not Applicable.
1136 Informed Consent Statement: Not Applicable.
1137 Data Availability Statement: Not Applicable.
1138 Conflicts of Interest: The authors declare no conflict of interest.
1139
1140 References
1141 1. Dyer, F., Elsawah, S., Croke, B., Griffiths, R., Harrison, E., Lucena-Moya, P., Jakeman, A. (2013). The Effects of Climate Change
1142 On Ecologically-relevant Flow Regime and Water Quality Attributes. Stoch Environ Res Risk Assess, 1(28), 67-82.
1143 https://doi.org/10.1007/s00477-013-0744-8.
1144 2. Bisht, A., Singh, R., Bhutiani, R., Bhatt, A. (2019). Application Of Predictive Intelligence In Water Quality Forecasting Of the
1145 River Ganga Using Support Vector Machines. 206-218. https://doi.org/10.4018/978-1-5225-6210-8.ch009
1146 3. Liu, S., Ryu, D., Webb, J. A., Lintern, A., Guo, D., Waters, D., & Western, A. W. (2021). A multi-model approach to assessing
1147 the impacts of catchment characteristics on spatial water quality in the Great Barrier Reef catchments. Environmental
1148 Pollution, 288, 117337.
1149 4. Mazher, A. (2020), Visualization Framework for High-Dimensional Spatio-Temporal Hydrological Gridded Datasets using
1150 Machine-Learning Techniques" Water 12, no. 2: 590. https://doi.org/10.3390/w12020590.
1151 5. Nasir, N., Kansal, A., Alshaltone, O., Barneih, F., Sameer, M., Shanableh, A., & Al-Shamma'a, A. (2022). Water quality
1152 classification using machine learning algorithms. Journal of Water Process Engineering, 48, 102920.
1153 6. Wang, R., Kim, J. H., & Li, M. H. (2021). Predicting stream water quality under different urban development pattern scenarios
1154 with an interpretable machine learning approach. Science of the Total Environment, 761, 144057.
1155 7. Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., & Chica-Rivas, M. J. O. G. R. (2015). Machine learning
1156 predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support
1157 vector machines. Ore Geology Reviews, 71, 804-818.
1158 8. Koranga, M., Pant, P., Kumar, T., Pant, D., Bhatt, A. K., & Pant, R. P. (2022). Efficient water quality prediction models based on
1159 machine learning algorithms for Nainital Lake, Uttarakhand. Materials Today: Proceedings.
1160 9. Kovačević, M.; Ivanišević, N.; Dašić, T.; Marković, L. Application of artificial neural networks for hydrological modelling in
1161 Karst. Gradjevinar 2018, 70, 1–10.
1162 10. Zhu, M., Wang, J., Yang, X., Zhang, Y., Zhang, L., Ren, H., Ye, L. (2022). A review of the application of machine learning in
1163 water quality evaluation. Eco-Environment & Health.
1164 11. Leggesse, E. S., Zimale, F. A., Sultan, D., Enku, T., Srinivasan, R., & Tilahun, S. A. (2023). Predicting Optical Water Quality
1165 Indicators from Remote Sensing Using Machine Learning Algorithms in Tropical Highlands of Ethiopia. Hydrology, 10(5),
1166 110.
118 Toxics 2023, 11, x FOR PEER REVIEW 59 of 66
119
1167 12. Zhu, Y., Liu, K., Liu, L., Myint, S. W., Wang, S., Liu, H., & He, Z. (2017). Exploring the potential of worldview-2 red-edge
1168 band-based vegetation indices for estimation of mangrove leaf area index with machine learning algorithms. Remote Sensing,
1169 9(10), 1060.
1170 13. Cai, J., Chen, J., Dou, X., & Xing, Q. (2022). Using machine learning algorithms with in situ hyperspectral reflectance data to
1171 assess comprehensive water quality of urban rivers. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-13.
1172 14. Castrillo, M., & García, Á. L. (2020). Estimation of high frequency nutrient concentrations from water quality surrogates using
1173 machine learning methods. Water Research, 172, 115490.
1174 15. Gladju, J., Kamalam, B. S., & Kanagaraj, A. (2022). Applications of data mining and machine learning framework in
1175 aquaculture and fisheries: A review. Smart Agricultural Technology, 100061.
1176 16. Chen, K., Chen, H., Zhou, C., Huang, Y., Qi, X., Shen, R. , Ren, H. (2020). Comparative analysis of surface water quality
1177 prediction performance and identification of key water parameters using different machine learning models based on big
1178 data. Water research, 171, 115454.
1179 17. Zhang, H., Xue, B., Wang, G., Zhang, X., & Zhang, Q. (2022). Deep Learning-Based Water Quality Retrieval in an Impounded
1180 Lake Using Landsat 8 Imagery: An Application in Dongping Lake. Remote Sensing, 14(18), 4505.
1181 18. Jin, T., Cai, S., Jiang, D., & Liu, J. (2019). A data-driven model for real-time water quality prediction and early warning by an
1182 integration method. Environmental Science and Pollution Research, 26, 30374-30385.
1183 19. Uddin, M. G., Nash, S., Rahman, A., & Olbert, A. I. (2023). A novel approach for estimating and predicting uncertainty in
1184 water quality index model using machine learning approaches. Water Research, 229, 119422.
1185 20. Cheng, S., Cheng, L., Qin, S., Zhang, L., Liu, P., Liu, L., Wang, Q. (2022). Improved understanding of how catchment
1186 properties control hydrological partitioning through machine learning. Water Resources Research, 58(4), e2021WR031412.
1187 21. Krishnan, M. (2020). Against interpretability: a critical examination of the interpretability problem in machine learning.
1188 Philosophy & Technology, 33(3), 487-502.
1189 22. Kovačević, M.; Lozančić, S.; Nyarko, E.K.; Hadzima-Nyarko, M. Application of Artificial Intelligence Methods for Predicting
1190 the Compressive Strength of Self-Compacting Concrete with Class F Fly Ash. Materials 2022, 15, 4191.
1191 https://doi.org/10.3390/ma15124191
1192 23. Hastie, T.; Tibsirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2009.
1193 24. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [CrossRef]
1194 25. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
1195 26. Kovačević, M.; Ivanišević, N.; Petronijević, P.; Despotović, V. Construction cost estimation of reinforced and prestressed
1196 concrete bridges using machine learning. Grad¯evinar 2021, 73, 727.
1197 27. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput.
1198 Syst.Sci. 1997, 55, 119–139.
1199 28. Elith, J.; Leathwick, J.R.; Hastie, T. A working guide to boosted regression trees. J. Anim. Ecol. 2008, 77, 802–813.
1200 29. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232.
1201 30. Rasmussen, C.E.; Williams, C.K. Gaussian Processes for Machine Learning; The MIT Press: Cambridge, MA, USA, 2006.
1202 31. Fatehi, I., Amiri, B.J., Alizadeh, A. et al. Modeling the Relationship between Catchment Attributes and In-stream Water
1203 Quality. Water Resour Manage 29, 5055–5072 (2015). https://doi.org/10.1007/s11269-015-1103-y
1204 32. Dohner E, Markowitz A, Barbour M, Simpson J, Byrne J, Dates G (1997) Volunteer Stream Monitoring: A Methods Manual
1205 Environmental Protection Agency: Office of Water (EPA 841-B-97-003).
1206 33. Anderson JR (1976) A land use and land cover classification system for use with remote sensor data. Vol. 964. US Government
1207 Printing Office.
120 Toxics 2023, 11, x FOR PEER REVIEW 60 of 66
121
1208 34. Brakensiek D, Rawls W (1983) Agricultural management effects on soil-water Processes 2: Green and Ampt Parameters for
1209 Crusting Soils. Trans ASAE 26:1753–1757.
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225 Appendix A
1226 NA parameter
1227
1228 Table A1. Comparative analysis of results of different machine learning models.
1229
1230 Mg parameter
1231
1232 Table A2. Comparative analysis of results of different machine learning models.
1233
1234
1235 CA parameter
1236
1237 Table A3. Comparative analysis of results of different machine learning models.
1240
1241 Table A4. Comparative analysis of results of different machine learning models.
1243
126 Toxics 2023, 11, x FOR PEER REVIEW 63 of 66
127
1244
1245 CL parameter
1246
1247 Table A5. Comparative analysis of results of different machine learning models.
1250
1251 Table A6. Comparative analysis of results of different machine learning models.
1253
1254
1255
1256 K parameter
1257
1258 Table A7. Comparative analysis of results of different machine learning models.
1259
130 Toxics 2023, 11, x FOR PEER REVIEW 65 of 66
131
1260 PH parameter
1261
1262 Table A8. Comparative analysis of results of different machine learning models.
1264
1265
1266
1267 EC parameter
1268
1269 Table A9. Comparative analysis of results of different machine learning models.
1272
1273 Table A10. Comparative analysis of results of different machine learning models.
1274