You are on page 1of 13

Chemosphere 303 (2022) 135265

Contents lists available at ScienceDirect

Chemosphere
journal homepage: www.elsevier.com/locate/chemosphere

Mapping of groundwater productivity potential with machine learning


algorithms: A case study in the provincial capital of Baluchistan, Pakistan
Umair Rasool a, b, Xinan Yin a, Zongxue Xu b, *, Muhammad Awais Rasool c,
Venkatramanan Senapathi d, Mureed Hussain e, Jamil Siddique f, Juan Carlos Trabucco g
a
State Key Laboratory of Water Environment Simulation, School of Environment, Beijing Normal University, No. 19 Xinjiekouwai Street, Beijing, 100875, China
b
College of Water Sciences, Beijing Normal University, Beijing Key Laboratory of Urban Hydrological Cycle and Sponge City Technology, Beijing, 100875, China
c
University of Agriculture, Faisalabad, Burewala Sub-campus, Punjab, Pakistan
d
Department of Disaster Management, Alagappa University, Kariakudi, 630003, Tamil Nadu, India
e
Lasbela University of Agriculture, Water and Marine Sciences, Uthal, Lasbela, Pakistan
f
Earth Science Department, Quaid-I-Azam University, Islamabad, Pakistan
g
Universidad Metropolitana - Department of Mathematics, Caracas, Venezuela

H I G H L I G H T S G R A P H I C A L A B S T R A C T

• The provincial capital Quetta, Baluchi­


stan, Pakistan was selected for present
study.
• For GW productivity potential mapping,
geology and slope degree are very
important.
• ANN, RF and XGB models are best fitted
models in the prediction.
• The ROC and AUC provided the predic­
tion performance accuracy of the
selected models.
• Correlation coefficient, RMSE and MAE
provided the useful information for
model performance accuracy.

A R T I C L E I N F O A B S T R A C T

Handling Editor: Derek Muir Although groundwater (GW) potential zoning can be beneficial for water management, it is currently lacking in
several places around the world, including Pakistan’s Quetta Valley. Due to ever increasing population growth
Keywords: and industrial development, GW is being used indiscriminately all over the world. Recognizing the importance of
Quetta valley GW potential for sustainable growth, this study used to 16 GW drive factors to evaluate their effectiveness by
Machine learning
using six machine learning algorithms (MLA’s) that include artificial neural networks (ANN), random forest (RF),
Groundwater productivity potential
support vector machine (SVM), K- Nearest Neighbor (KNN), Naïve Bayes (NB) and Extreme Gradient Boosting
GW factors
ROC/AUC (XGBoost). The GW yield data were collected and divided into 70% for training and 30% for validation. The
Standard error training data of GW yields were integrated into the MLA’s along with the GW driver variables and the projected
results were checked using the Receiver Operating Characteristic (ROC) curve and the validation data. Out of six
ML algorithms, ROC curve showed that the XGBoost, RF and ANN models performed well with 98.3%, 96.8% and
93.5% accuracy respectively. In addition, the accuracy of the models was evaluated using the mean absolute
error (MAE), root mean square error (RMSE), F-score and correlation-coefficient. Hydro-chemical data were
evaluated, and the water quality index (WQI) was also calculated. The final GW productivity potential (GWPP)

* Corresponding author.
E-mail address: zxxu@bnu.edu.cn (Z. Xu).

https://doi.org/10.1016/j.chemosphere.2022.135265
Received 1 February 2022; Received in revised form 31 May 2022; Accepted 4 June 2022
Available online 9 June 2022
0045-6535/© 2022 Published by Elsevier Ltd.
U. Rasool et al. Chemosphere 303 (2022) 135265

maps were created using the MLA’s output and WQI as they identify the different classification zones that can be
used by the government and other agenciesto locate new GW wells and provide a basis for water management in
rocky terrain.

1. Introduction (Olubusola et al., 2018). To evaluate GW productivity potential (GWPP)


in Quetta valley, we used six MLA models including ANN, RF, SVM,
Combining satellite data and remote sensing with the GIS domain KNN, NB and XGBoost and evaluated their performance with multiple
offer adequate knowledge of surface and subsurface features needed for statistical approaches. Finally, we used models output with water
groundwater (GW) potential exploration (Mogaji et al., 2011), and quality data for final GWPP maps. By including the element of highway
provide a highly cost- and time-effective way to manage GW resources proximity, which is often overlooked in mapping GWPP, this study is the
(Adiat et al., 2012; Verma and Singh, 2013). Numerous researchers have first of its kind to map GW potential in the capital city of Quetta of the
reported various anthropogenic and morphometric factors such as land least considered province, Baluchistan.
use/land cover, geomorphology, soil cover, slope, drainage density, and
surface temperature as critical variables that contribute to the GW po­ 2. Material and methodology description
tential (Verbesselt et al., 2006; Avtar et al., 2010). Highly weathered and
fractured rocks (Amudu et al., 2008) offer opportunities to search for 2.1. Materials and description
suitable aquifers in the bedrock because weathered and fractured areas
generally describe the intensity and nature of hydrodynamic activities in 2.1.1. Study area description
individual aquifer joints in the ground (Amadi and Olasehinde, 2010). The total watershed area of the provincial capital of Baluchistan,
Studies also reported that structural hills, residual hills, and linear ridge Quetta is 1756 Km2 where 792 Km2 area is covered by alluvium (Fig. 1).
sites of low GW potential possess low infiltration capacity, while The valley is divided into two basins by Mian Ghundi and Landi’s hills,
bedrock and valley-fill areas offer high GW potential due to their high one is Dasht plain (southern), and another is central Quetta valley
infiltration and water recharge (Rajaveni et al., 2017; Berhanu and (northern).
Hatiye, 2020). Slip faulting and zones of convergence characterized its geology
Geological structures and anthropogenic land-use practices (moun­ (Alam and Ahmad, 2014). Valley comprises semi-consolidated and un­
tains, roads, urban buildings and agricultural areas) (Gu et al., 2016; consolidated rocks (valley fill), which are underlain by consolidated
Kaushal et al., 2017) can considerably impact GW flow and storage in rocks. Shale and limestone belong to Jurassic age (Chiltan and Shirinab
the surface and subsurface, should therefore be considered in GW Formations) rocks (Kazmi et al., 2005a, 2005b) whereas siltstone, shale,
studies (Elmahdy and Mohamed, 2014). The study of the nature, origin, and limestone of Parh group belong to Cretaceous age rocks. Urak group
and features of spring in a certain geological structure can help to pre­ of siltstone, shale, conglomerates, and sandstone are the thick sequence
dict the hard rock aquifers (Rehrl and Birk, 2010). These aquifers pro­ of Tertiary age rocks (Kazmi et al., 2005a). Chaman fault (North-south)
vide significant clean water either in fractured zones or through is the main fault in the area, a strike-slip fault between Indian and
carbonate rock dissolution channels (Robert et al., 2012). The hard rock Eurasian plates.
(bedrock) and unconsolidated alluvial aquifer are hydraulically con­ In the valley, alluvial aquifers are present in unconsolidated and
nected (Kazmi et al., 2005a; TCI, 2008), have relatively uniform me­ semi-consolidated rocks whereas, bedrock aquifers are present in
chanical behavior and are characterized mainly by fracture permeability consolidated rocks (Khan and Mian, 2000). The alluvial aquifer consists
(Boutaleb et al., 2008). The structure and hydraulic characteristics of of sand, silt, and gravel deposits and is referred to as the primary aquifer
superficial geology are also considered to play an important role in replenished by precipitation, runoff, and inflow from foothill regions
influencing surface-GW interactions in drylands (Wheater et al., 2010). with bedrock aquifers. Bedrocks aquifers mainly consist of conglomer­
Machine learning algorithms (MLA’s) have recently emerged for ates of the Urak formation and limestone of Chiltan and Shirinab for­
precise forecast modeling that involves identifying complex structures, mations and are recharged by surrounding mountains exposed in the
especially irregular data and the development of forecast models (Olden area (Khan and Mian, 2000).
et al., 2008; Marjanović et al., 2011). Moreover, compared to conven­
tional statistical models, MLAs’ prediction rate is very accurate (Naghibi 2.1.2. Data Collection
et al., 2015; Chen et al., 2018), beneficial due to their capacity to include Geological map, Landsat 8 OLI & Sentinel 2 imageries, SRTM DEM,
a number of predicted variables and lost values, and their simplicity of precipitation data, 167 home-based wells yield information and 35 geo-
constructing suitable connections among predictors (Friedman and hydrological wells data monitored for at least last 10 years were ac­
Meulman, 2003). GW mapping via MLA’s utilizing random forest (RF) quired from Pakistan Council of Research in Water Resources (PCRWR).
(Sameen et al., 2019), support vector machine (SVM) (Panahi et al., The description of the dataset is presented in Table 1. In this study, 16
2020), naïve bayes (NB) (Pham et al., 2021) and extreme gradient GWPP conditioning factors or thematic maps were used. All thematic
boosting (XGBoost) have been continuously improved (Lee et al., 2018; maps or layers were prepared in ArcGIS 10.8 and then were resampled to
Arabameri et al., 2019). The goal of each is to find the most economical create a uniform grid size of 30 × 30 m for final GWPP maps using MLAs.
and accurate model to reduce the economic cost in GW potential map­
ping. Machine learning method has attracted scientific community 2.1.3. Wells inventory data
attention, specifically in the last couple of years (Mojaddadi Rizeei et al., A home-based wells inventory data or yield data were acquired from
2019; Prasad et al., 2020) where, a study yielded 86% accuracy for GW PCRWR and insured that at least 70% of the wells are providing 3–4
potential mapping using RF and XGBoost in Iran (Naghibi et al., 2020). gallons of water per minute during last 5 years. The 167 wells were
The water table of the Quetta Valley, has been declining at an collected in total. The proportion of 70:30 was used to split the points
alarming rate (1 m/year) since 2007, possibly due to overuse of the into training and validation points by using the “create random points”
aquifer system (Alam and Ahmad, 2014; Kakar et al., 2019), which may tool in ArcGIS.
lead to complete depletion of the aquifer system if overuse is not
properly controlled. GW’s evaluation and advancement at both local and 2.1.4. GWPP conditioning factors and mapping
regional levels require multidisciplinary strategies, with numerous data We selected 16 GW conditioning factors including aspect, rainfall,
resources, and by mapping all these localized geographical structures altitude, stream power index (SPI), geology, land use/landcover,

2
U. Rasool et al. Chemosphere 303 (2022) 135265

lineaments density, drainage density, slope in degree, curvature, topo­ Table 1


graphic roughness index (TRI), topographic wetness index (TWI), dis­ Material type and source used in the present study for GWPP mapping.
tance to fault, distance to lineaments, distance to stream and distance to Data Type Source Scale Time period
highways. The description of the conditioning factors is discussed well in
Yield data PCRWR Randomly 2015–20
the literature (Islam et al., 2021; Mirzaei et al., 2021; Yuan et al., 2021) selected
and as supplementary material (See S1). Hydro- PCRWR Randomly 2010–20
Morphometric explanatory variables such as TWI, TRI, altitude, chemical selected
drainage density, aspect, slope in degree, SPI, and curvature were data
Rainfall data Pakistan Meteorological Yearly average 2000–20
extracted from the 30 m Digital Elevation Model (DEM) downloaded Department data
from Shuttle Radar Topographic Mission (SRTM). As a result, all six Geology Geological Survey of Pakistan 1:250,000 2011
morphometric explanatory variables have a spatial resolution of 30 × DEM data SRTM, USGS Earth Explorer 1 arc second 2014
30 m grid. Distance to fault, distance to lineaments, distance to stream website
Satellite Landsat 8 OLI and Sentinel 2, 30 m and 10 December 6,
and distance to highways were created from their respective layers by
image USGS Earth Explorer website spatial 2020
using the ‘Euclidean distance’ tool under the spatial analyst toolbox in resolutions
ArcGIS. The geological map was derived from the geological toposheet
in ArcGIS. Sentinel 2 data with 10 m spatial resolution was used for land
use and lineaments extraction.
The remaining conditioning factors were resampled into 30 × 30 m

Fig. 1. Location map of the study area.

3
U. Rasool et al. Chemosphere 303 (2022) 135265

spatial resolution by using the ‘resample’ tool under the conversion model is the fast layer recognition and analysis (Micheletti et al., 2011).
toolbox in ArcGIS. We extracted the conditioning factors values by using It is commonly used to solve classification and regression problems,
the ‘extract multi values to point’ tool under the spatial analyst toolbox reducing the algorithm’s overfitting (Gayen et al., 2019). In this study,
in ArcGIS. The extracted points values were then imported to R software the “svmLinear” method was applied using the ‘caret’ package in R
for ML models. statistical software.

3. Methodology 3.1.4. K-nearest neighbors (KNN)


The KNN technique is a non-parametric MLA classification, which
The importance of selected parameters of GWPP was calculated eliminates class density estimates (Motevalli et al., 2019). In GW,
using variable importance and build-in-feature function for each ML Landslide, Gully Erosion and Flood Mapping, this technique is imple­
model. First, we created thematic maps in ArcGIS software and extracted mented and useful (Naghibi et al., 2019; Shahabi et al., 2020). The main
the values of sample locations for all 16 factors; then, we imported them hypothesis of the nearest neighbor is that unknown cells are classed in
as CSV files into R software and applied the above models. First, we the training set based on their similarity with known cells (He and Wang,
transform the data values between 0 and 1, where 1 is close to the higher 2007), which can be explained by the application of the Euclidean dis­
value and 0 is close to the lower value for each parameter. Then we split tance (Betrie et al., 2013). Fig. 2 shows an example of the classification
the data into training and validation sets. Different MLAs were then process in the k-nearest neighbors. The unknown cell in the figure is a
applied to the training set, followed by generating ROC curves with black star, and the objective is to determine if it is a well with yield or
predicted values and calculating AUC using SPSS software. Second, the not. Since there are two wells with yield (red squares) and one wells
validation data were used to check the result of the training data, then without yield (green circle) in the inward circle, it will be a well with
ROC curves with predicted values were generated and the AUC was yield if the value of K (number of nearest neighbors’) is equal to 3 (the
calculated (Lee et al., 2018; Prasad et al., 2020). The predicted values internal circle). However, since there are five wells without yield and
were converted to 0 and 1, where we took predicted values below 0.5 as four wells with yield in the outer circle, it will be perceived as a well
0 and values above 0.5 as 1, where 1 represents the possibility of GWPP, without yield if the value of k is equal to 9 (the external circle). The
whereas 0 means the opposite. Lastly, the predicted values of the “caret” package was used in the R statistical software to run KNN model.
training models were imported into the ArcGIS environment for further
analysis. 3.1.5. Naïve bayes (NB)
NB is a method of statistical classification that determines the class
3.1. Machine learning algorithm for classification based on the assumption that attributes do not rely to
maximise the probability later on (Soni et al., 2011). The NB model
3.1.1. Artificial neural network (ANN) gathers an example of an event and then estimates the prior probability
ANNs are an effective method for learning linear and nonlinear re­ for each class. The mean of each class is calculated to create a covariance
lationships between input and output pairs produced by an input layer, matrix, which is then used to create a discriminant function for each
intermediate hidden layers, and output layer. When constructed by class using Bayes’ theorem (Bhargavi and Jyothi, 2009). The NB model
several layers, neural networks may use simplified representations has the advantage of estimating the parameters needed for classification
formed by previous layers to represent complex features in later layers with a limited amount of training data (Bhargavi and Jyothi, 2009). The
(Goodfellow et al., 2016). For predictive analytics, the “h2o” package “naive Bayes” package was used in the R statistical software to run NB
fits the model of multi-layer, feedforward neural networks (Candel et al., model.
2016). ANN has the advantage that less training evidence can achieve
improved outcomes than other predictive methods (Kim et al., 2018a). 3.1.6. Extreme gradient boosting (XGBoost)
This study used the “h2o” Package to determine the GWPP areas using R XGBoost is an enhancing ensemble algorithm which is an enhanced
statistical software. GBM. The XGBoost model uses additive learning to create a strong
learner by combining several weak learners (each tree) (Fan et al.,
3.1.2. Random forest (RF) 2021). As a result, the gradient boosting concept is followed by both
RF is another strong and effective MLA developed by Breiman (2001) XGBoost and GBM. But XGBoost improves the workout and prevents
to promote its prediction proficiencies as an extension of “classification overfitting. To minimize the loss function and obtain more accurate
and regression trees” (Razavi-Termeh et al., 2019). This approach con­ trees, XGBoost implements second-order derivatives, while ordinary
sists of several decision trees and integrates them to illustrate the spatial GBM uses first-order derivatives (Dietterich, 1995; Chen and Guestrin,
relationship between GW control variables, spring, and well inventory 2016). In XGBoost, simultaneous calculation is carried out automatically
(Kim et al., 2018b). The algorithm’s implementation is based on the during training to improve computer efficiency (Fan et al., 2021). It also
higher number of votes cast by decision trees (Micheletti et al., 2014).
The RF method does not use all accessible tree-growing information but
uses 66% of the bootstrap data, and the remaining 34% used to evaluate
the fitted data (Razavi-Termeh et al., 2019). A predictor variable is then
applied randomly during the growing process, and this variable is used
to create a node in a tree. The ‘randomForest’ package was applied
(Cutler, 2020) using R statistical software.

3.1.3. Support vector machine (SVM)


This model’s methodology is based on the principle of statistical
learning, it eliminates errors and identifies the optimal solution (Vapnik
et al., 1995). Marjanovic et al. also affirmed that separating hyperplane
has been developed into the n original space between the points in two
distinct groups (Marjanovic et al., 2009). When the point issued over the
hyper-plane is categorized as positive 1, this is categorized as negative 1.
SVM indicates output estimation by responding to a convex optimization
problem (Nansen and Elliott, 2016). A significant feature of the SVM Fig. 2. The basic concept of K-Nearest Neighbor concept.

4
U. Rasool et al. Chemosphere 303 (2022) 135265

has several regularization features to avoid overfitting. The sum of the statistics of water quality variables and WQI by referring to WHO water
weighted contributions of all decision trees used in XGBoost’s final quality standards (Edition, 2011). WQI is a helpful technique to indicate
prediction (Dietterich, 1995). The “xgboost” package was used in the R the combined behavior of water quality. We classified the final WQI
statistical software to run XGBoost model. based on the Tiwari and Mishra classification (Tiwari and Mishra, 1985).
To check the concentration of WQI throughout the study area, the IDW
3.2. Models evaluation and validation interpolation technique of ArcGIS software is very useful because it has
been widely used to study of the spatial distribution of parameters
Evaluating the algorithms’ predictions with the aid of validation (Kanagaraj and Elango, 2016).
samples is a prominent part of the modeling process that validates the
predictions’ accuracy (Garosi et al., 2019; Kariminejad et al., 2020). In 3.4. Final GWPP maps
this study, the area under the Receiver Operating Characteristic (ROC)
Curve was selected to validate the model’s accuracy. ROC is a graphical The final maps of GWPP were generated using the MLA output and
plot that specifies the diagnostic test models (Golkarian et al., 2018). the WQI. The raster calculator was used to extract the final GWPP maps
The value between 0 and 1 is the area under the curve (AUC), and the for the study area. The natural break method was used to classify the
higher value represents the better performance of the model (Chen et al., final GWPP maps into four classes: excellent, good, moderate, and poor.
2018; Golkarian et al., 2018). The curve traces the true-positive The detailed methodological flow chart of the study is described in
(Sensitivity) rate on the Y-Axis and the false-positive (1- Sensitivity or the following Fig. 3 which categorized into three sections.
Specificity) rate on the X-Axis (Youssef et al., 2016; Golkarian et al.,
2018). 4. Results

Sensitivity =
TP
(6) 4.1. Factors importance
TP + FN
In the present study, six ML models were used for prediction, with
TN
Specificity = (7) SVM, KNN, and NB using “caret” package to analyze the importance of
FP + TN
features during the model development. The “caret” package analysis
where TP is “true positives,” FN is “false negatives,” TN is “true nega­ the feature importance based on the wrapper method (Sánchez-Maroño
tives,” and FP is “false positives.” TP and TN are the numbers of accu­ et al., 2009). The wrapper method works to evaluate the best subset of
rately categorized pixels, whereas FP and FN are the numbers of pixels feature for that ML algorithm. The wrapper method is also known as the
incorrectly classified. Accuracy is calculated as greedy algorithm because they find the best possible combinations of
features that give the best performance of the model (Sánchez-Maroño
TP + NT et al., 2009). The h2o package was used for the ANN model, which
Accuracy = (8)
TP + NT + FP + FN analyses the feature importance based on the Gedeon method (Gedeon,
To check whether the performance of the prediction model is over­ 1997). The method’s major objective is to exclude extraneous features
estimated and to avoid bias, we evaluated the models not only by recall, from the training model.(Gedeon, 1997).
precision, and F-score, but also by the mean absolute error (MAE), root Random forests are typically used in data science workflows for
mean square error (RMSE), and correlation coefficient of the delivered feature selection. Since random forests naturally adopt tree-based stra­
prediction model metrics. tegies that help improve the node’s purity (Archer and Kimes, 2008).
This means that the impurity of all trees decreases (called gini impurity)
TP
Recall = (9) the most in the first stage of the trees, while it decreases the least in the
TP + FN
last (Archer and Kimes, 2008). So, we can construct a subset of the
TP essential features by cutting trees underneath a particular node. Using
Precision = (10) the “xgboost” package, the importance is determined by a single deci­
TP + FP
sion tree where each attribute split point increases the output perfor­
F-score = 2*(Precision*Recall)/(Precision + Recall) (11) mance and weighted by the number of node-related observations (Hastie
et al., 2009). The purity (Gini index) used to pick the split points, or
Recall score indicates the percentage of correctly predicted positive some more precise error feature may be used as a performance metric.
records, while precision indicates the percentage of correct positive Then all decision trees in the model are averaged in terms of function
predictions. The F-Score is a single measure that uses the harmonic mean imports (Hastie et al., 2009). An advantage of gradient enhancement is
to combine Recall and Precision. The F-score ranges from 0 to 1, with 1 that it is relatively simple to obtain important results for each attribute
representing higher precision and 0 representing lower precision after the boosted trees are built.
(Inglada et al., 2015; Powers, 2020). The MAE, RMSE, R2 and adjusted XGBoost model obtained the highest accuracy where slope degree
R2 provide information about the score and the performance of the was the most important factor which was also the important factor in
models, where values close to 0 represent the best model performance, other ML models. Rainfall was found to be 2nd most important factor in
while values close to 1 represent the worst model performance (Moriasi XGBoost model which was also important in ANN model. TWI was found
et al., 2007; Chicco et al., 2021). The correlation coefficient indicates to be 3rd important factor in XGBoost which remained important in
how strong or weak the relationship between two variables is and ANN model. Land use was identified as the least significant component
whether this relationship is positive or negative (Ratner, 2009). The in XGBoost and was similarly less significant in the RF model but was
correlation coefficient ranges from − 1.0 to 1.0. somewhat significant in the ANN model. Fig. 4 represents the impor­
tance of each factor for all ML models.
3.3. Determination of water quality
4.2. MLA’s models output
For this study, we used mean values of all variables and especially
the selected 15 hydro-chemical variables such as HCO3, Cl, Na, NO3, Ca, We produced initial maps of GWPP from the results of the models,
K, Ph, Mg, As, SO4, alkalinity (Al), turbidity (Tu), total dissolved solids and three MLAs, ANN, RF, and XGBoost, showed a similar pattern of
(TDS), hardness (H) and electrical conductivity (E.C), for WQI (see GWPP, while the other three MLAs, SVM, KNN, and NB, showed some
Table 1 in S2). We used python software and calculated different incorrect predictions. In all the prediction models, the capital city where

5
U. Rasool et al. Chemosphere 303 (2022) 135265

Fig. 3. The methodological flow chart of the present study.

Fig. 4. The relative importance of the influencing factors for MLAs.

urbanization is very dense shows a high value of 1, which means that the good AUC value of 0.801. While KNN and NB showed low accuracy with
possibility of the GWPP occurring in this area is very high, since 1 0.719 and 0.714, respectively. To validate the prediction on training
represents the maximum possibility of the GWPP according to the al­ data, we used the remaining 30% data for validation. The ROC-AUC
gorithms, while 0 represents the minimum possibility of the GWPP. curves observed that all six MLAs have good to excellent predictive ac­
According to the results of the models, the probability of occurrence of curacy (Fig. 6b). The XGBoost, RF and ANN validated the highest ac­
the GWPP in the northeast and southwest directions is very low (Fig. 5). curacy of training data and showed AUC value of 0.985, 0.969 and
0.971, respectively. The SVM, KNN and NB also validated the accuracy
of training data and showed the AUC value of 0.831, 0.735 and 0.719,
4.3. Models evaluation and validation
respectively (see Table 2 in S2).
The X-axis represents sensitivity, which is the correct positive pre­
Validation of predictive models is essential for evaluating predictive
diction value in front of all positive outputs, and the Y-axis represents
accuracy. First, the ROC curves were created using extrapolated training
specificity, which is likewise the right negative prediction value in front
data representing 70% of the total sample size. The ROC-AUC curves
of all negative outputs.
show that all six MLAs had good to excellent prediction accuracy
Table 2 summarizes the different statistical information’s of the
(Fig. 6a). However, three of the six models showed excellent accuracy,
predictive models such as recall, accuracy, F-score, MAE, RMSE and
such as an AUC value of 0.983 for XGBoost, an AUC value of 0.968 for
correlation coefficient values. Precision and recall are inversely related,
RF, and an AUC value of 0.935 for ANN. Similarly, SVM showed a very

6
U. Rasool et al. Chemosphere 303 (2022) 135265

Fig. 5. The GWPP maps created from different MLAs for the study area: (a) ANN, (b) SVM, (c) RF, (d) KNN, (e) NB and (f) XGBoost.

which implies that increasing recall decreases precision and vice versa. carbonates dissolved in the water, and in the water samples the H value
We found that the F-scores of all predictive models were close to 1, ranged from 55 to 88. Alkalinity was slightly elevated in the water
indicating that their performance is relatively strong. In particular, the samples due to slightly elevated Ph values and ranged from 123 to 188
ANN, RF, and XGBoost models displayed F-scores more than 90%, while mg/L. Turbidity (Tu) depends on the solids present in suspended form,
KNN, NB, and SVM models revealed F-scores larger than 80%. and the water samples had values between 0.6 and 2.7 NTU.
The XGBoost, RF, and ANN models were evaluated as the best-fitting The resulting map of the indicated WQI ranged from 26.49 to 99.86.
models because these models had MAE and RMSE values close to 0, Well Nos. 6, 21, and 29 had higher values (91.43–99.86), evidencing
whereas the other three models, SVM, KNN, and NB, had higher values poor water quality for drinking purposes in and around the above wells
compared with XGBoost, RF, and ANN. The correlation coefficient of due to the presence of agricultural wastes in the surrounding area (Sahu
XGBoost, RF and ANN showed strong positive correlation (0.7–1.0), and Sikdar, 2008). Several secondary sources can also affect GW quality,
while SVM, KNN and NB showed moderate positive correlation including irrigation, flooding, industrial pollution, fungicides, fertil­
(0.3–0.7). izers, pesticides, and untreated municipal waste (Tareen et al., 2013).
Most of the wells indicated good GW conditions in the area, such as well
numbers 2, 3, 12, 21, 22, 25, 34, 42, 43, etc., and showed WQI values
4. Hydro-chemical analysis and water quality assessment ranging from 26.49 to 48.62. Nine sites were found to be unsuitable for
drinking purposes. Khan et al. revealed water quality conditions in the
Various inorganic and organic minerals or salts dissolved in water Quetta valley and suggested fit for consumption of water w. r.t
are referred to as total dissolved solids (TDS). TDS values in the study physico-chemical and aesthetic water quality parameters but due to
area samples ranged from 146.7 to 398.7 mg/L. The water quality bacterial contamination is generally poor (Khan et al., 2017). The pre­
samples were slightly acidic to alkaline with Ph values ranging from sent study area, most of the area is in the medium WQI class. A very
6.28 to 7.98. Electrical conductivity (E.C.) is the ability of water to allow small area had poor GW conditions and was mostly on agricultural land.
current to flow and in the water samples the values ranged from 226 to The distribution of WQI in the study area is shown in Fig. 7.
487 μS/cm. Hardness (H) of water indicates the amount of calcium

7
U. Rasool et al. Chemosphere 303 (2022) 135265

Fig. 6. Assessment of the different MLAs performance with ROC curves and AUC: (a) AUC of training data, (b) AUC of validation data.

poor class areas were mostly located in agricultural and industrial areas
Table 2
where the water infiltrates into the subsurface with dissolved solids and
MAE, RMSE and correlation coefficient of predictive models along with recall,
other toxic chemicals, ultimately affecting the water quality.
precision and F-score values for training and validation datasets.
However, the maps of potential GW productivity resulting from
ANN RF SVM KNN NB XGBoost ANN, RF, and XGBoost with WQI were ranked high because of their
Ta/Vb Ta/Vb Ta/Vb Ta/Vb Ta/Vb Ta/Vb excellent results. In Fig. 8, the results show that the areas near the capital
Recall 0.959/ 0.964/ 0.905/ 0.835/ 0.808/ 0.988/ and the city center had good to moderate GWPP in all maps. However,
0.965 1 0.888 0.852 0.823 0.941 SVM (Fig. 8b), KNN (Fig. 8d), and NB (Fig. 8e) incorrectly predicted GW
Precision 0.934/ 0.987/ 0.846/ 0.789/ 0.828/ 0.977/ yields in some areas, such as northeast to north (SVM model) and
0.933 0.971 0.865 0.763 0.778 0.970
northwest (KNN and NB models), where GW yields were less than 3 to 4
F1 Score 0.946/ 0.976/ 0.875/ 0.811/ 0.818/ 0.982/
0.949 0.985 0.876 0.805 0.8 0.955
gallons of water per minute, which was used as an area without water
MAE 0.094/ 0.017/ 0.119/ 0.153/ 0.145/ 0.025/ potential for this study.
0.04 0.02 0.14 0.18 0.18 0.02
RMSE 0.306/ 0.130/ 0.345/ 0.392/ 0.381/ 0.160/ 5. Discussion
0.2 0.141 0.374 0.424 0.424 0.141
Correlation 0.804/ 0.956/ 0.686/ 0.6/ 0.617/ 0.935/
Coefficient 0.914 0.914 0.673 0.57 0.588 0.955 Many studies reported the significance and predictive capability of
R2 0.363/ 0.381/ 0.648/ 0.916/ 0.471/ 0.875/ spring-related variables utilized in GW spring potential assessments,
0.326 0.346 0.837 911 454 0.913 determined by the field’s geographical, morphological, hydrological,
Adjusted R2 0.353/ 0.376/ 0.645/ 0.915/ 0.466/ 0.874/
and climatic conditions (Naghibi and Pourghasemi, 2015; Chen et al.,
0.312 0.332 8.33 0.909 0.442 0.912
2019a). In Ozdemir’s et al., interpretation, topographic characteristics,
a
Training. such as elevation and slope, have a negative effect on GW spring ca­
b
Validation. pacity, and TWI and drainage density, on the other hand, have a positive
impact (Ozdemir, 2011). Similar studies report that topographical
4.5. Final GWPP maps characteristics along with soil cover characteristics, tectonic character­
istics (fault depth and distance to faults), and hydrological characteris­
GWPP maps are primarily used to manage aquifer systems by tics (drainage density) influence the rate of rainfall-runoff and the rate of
determining which area is more productive. However, to successfully infiltration, thus potentially influencing the potential occurrence of GW
manage an aquifer, information on GW quality is also required due to spring (Ozdemir, 2011; Naghibi and Pourghasemi, 2015). Chen et al.
the suitability of this water for various applications such as drinking, reported a more significant effect on lithology, altitude, and distance to
agriculture, and industrial. Salinity (Prinos et al., 2014), streams, while the least influence appears to be on land use, NDVI, plan,
non-point-source pollutants like nitrate from agriculture (Puckett et al., and profile curvature (Chen et al., 2018).
2011), and geogenic pollutants like arsenic may affect water quality in
highly productive aquifers (Schaefer et al., 2017).
5.1. Model influencing factors co-vary with different feature selection
When final GWPP map was compared to the model’s output,
different patterns for the distribution of potential classes emerged,
The comparative significance of the 16 GW influencing factors was
indicating that GW quality should be considered when analyzing GW
demonstrated using the “variable importance” function and/or build-in-
potential (Al-Abadi et al., 2021). The excellent GWPP class was mainly
feature function of the model’s such as ANN, SVM, RF, KNN, NB and
located in mountainous areas, while the areas with good to moderate
XGBoost. The results revealed that in different MLAs, the importance of
GWPP class were mainly located in urban areas with dense population
influencing factors was different due to different methods of feature
and artificial vegetation cover. In urban areas, agricultural and livestock
selection or feature importance criteria. In the ANN model, aspect was
wastes mixed with GW, which reduced the GW quality in the area. The
influenced 100% during the model development, but in XGBoost and RF

8
U. Rasool et al. Chemosphere 303 (2022) 135265

Fig. 7. The IDW interpolation using GIS of WQI results at 35 different locations.

models, it decreased to 29.77% and 15.73%, respectively. While in the training data, the ANN, RF and XGBoost models showed the maximum
RF model, altitude influenced 87.05% during the model development accurate results 93.5%, 96.8% and 98.3%, respectively, while the SVM,
but, in the ANN, and XGBoost models, it decreased to 85.12% and KNN and NB models were showed slightly less accuracy of 80.1%, 71.9%
16.29%, respectively. Similarly, in the XGBoost model, slope degree and 71.4%, respectively. During the validation of training accuracy, the
influenced by 80.92% during the model development, which increased ANN, RF and XGBoost models were validated the accuracy of trained
in the ANN model to 91.85% and decreased in the RF model to 47.16%. models and showed the accuracy of 97.1%, 96.9% and 98.5%, respec­
Based on the changed behavior of influencing factors, we calculated tively, while the SVM, KNN and NB models were also validated the
the overall significance of different variables to understand the overall slightly less accuracy of 83.1%, 73.5% and 71.9%, respectively. Ac­
importance of different influencing factors (Fig. 9). In the present study, cording to the F-score values, the prediction models exhibited high ac­
geology, slope, drainage density, distance to highway, distance to river, curacy; in particular, ANN, RF, and XGBoost revealed more than 90% F-
distance to lines, height and density of lines were more important score values. MLA’s are data-derived methods, which primarily rely on
compared to other factors. These factors indicated the study area’s the quantity and the quality of available data. The models only make
geomorphology, where more than 50% of the area is a mountainous decisions based on data, resulting in adjusted threshold lines for
region. Geomorphology is the most important element, since the different ML models. Each MLA’s has some pros and cons which should
geomorphic features of each research area’s diverse landforms either be understand during the model building.
enhance or limit the potential for GW (Prasad et al., 2020). Similarly, Based on the previous studies, natural break (Jenks) was utilized to
curvature, SPI, aspect and TRI were found to be least important factors. classify the GWPP maps (Razavi-Termeh et al., 2019; Al-Djazouli et al.,
2020). We calculated the area based on the natural break to analyze the
area in Km2 for all four GWPP classes and discussed only the good GWPP
5.2. Model performance
class as most of the urbanized area covered by this class. Statistics
showed that the ANN, RF and XGBoost models predicted 646.91 km2,
Several findings have shown that, compared with other models, RF
683.76 km2 and 655.67 km2 area respectively in the good GWPP class
models have better precision. According to Naghibi et al. who used RF,
respectively, while the SVM, KNN and NB models predicted 542.01 km2,
RF-optimized genetic algorithm (RFGA) and SVM methods to assess GW
443.42 km2 and 541.17 km2 area in the good GWPP, respectively, and
potential by spring locations, RF and RFGA models (Naghibi and
the prediction for this class was slightly wrong.
Dashtpagerdi, 2017). The XGBoost model has been used to estimate GW
level (Ibrahem Ahmed Osman et al., 2021), analyses GW salinity
6. Conclusions
(Sahour et al., 2020), and assess GW quality (Bedi et al., 2020) in recent
studies. Naghibi et al. used RF, parallel random forest, and XGBoost
GWPP maps were prepared based on various geological, anthropo­
models to determine the GW spring potential (Naghibi et al., 2020). The
genic, morphometric and hydrological influencing factors such as pre­
AUROC values for all models were about 86%, with the XGBoost model
cipitation, distance to highways, distance to rivers, distance to
having the highest value. Pourghasemi et al. applied three machine
lineaments, distance to faults, SPI, elevation, land use/land cover, ge­
learning models and RF exhibited 0.987 accuracy (Pourghasemi et al.,
ology, lineament density, drainage density, slope, TRI, TWI, curvature
2020). Wei et al. applied NB, SVM and RF models in his study and
and aspect with six MLAs such as ANN, SVM, RF, KNN, NB and XGBoost
predicted 75.22%, 94.55% and 95.22% accurate results, respectively
for beautiful Quetta Valley, Baluchistan, Pakistan. 167 wells with yield
(Chen et al., 2019b).
and without yield data were used for selected models. 70% of the data
In the present study, based on the ROC curve performance on

9
U. Rasool et al. Chemosphere 303 (2022) 135265

Fig. 8. Classified GWPP maps based on Natural break (Jenks), the maps were classified into 5 classes, Very Poor, Poor, Moderate, Good and Very Good: (a) ANN, (b)
SVM, (c) RF, (d) KNN, (e) NB and (f) XGBoost.

was used to train the models and the remaining 30% was used to vali­ 7. Recommendations
date the results. The performance of the MLAs with the training data was
satisfactory based on the ROC curves and AUC accuracy and showed that - With alarmingdeclining rate of water (1 m/year) since 2007, use of
the performance of XGBoost, ANN and RF models were excellent with MLAs to investigate GWPPis time and cost efective in GW studies.
the accuracy of 98.3%, 96.8% and 93.5% respectively. Hydro-chemical - These results can be used with high confidence and in future studies
data were also analyzed and WQI was calculated and giving satisfactory in other areas of Pakistan to analyze the spatial distribution of GWPP
results. We classified the WQI of the study area into four major classes areas and constructive management by GW-related agencies.
and found that most of the samples were demonstrated good to mod­ - Deep learning and hybrid models are best tools at areas where
erate water quality. The final maps of GWPP were prepared using MLA’s sample points are not enough to investigate GW potential mapping.
output and WQI and indicated that mountainous regions have high
GWPP. The final GWPP maps could be utilized to manage and install Author contributions statement
new GW pumping stations to avoid the extra boring cost and most
importantly depletion of GW level for unnecessary usage. In addition, Umair Rasool: Conceptualization, Data collection, Writing - original
the current study found that places near the city center and urbanized draft, Investigation, Data curation; Xinan Yin: Conceptualization,
areas exhibited moderate to good GWPP zone, while places with dense Writing- Original draft; Zongxue Xu: Reviewing Manuscript, Supervi­
vegetation and industrial sites exhibited poor GWPP zone. sion; Muhammad Awais Rasool: Writng origional draft, Methodology,

10
U. Rasool et al. Chemosphere 303 (2022) 135265

Fig. 9. We used six MLAs for this study. These MLAs worked with different selected conditioning factors and assessed the study’s selected conditioning
factors’ overall importance.

Reviewing Manuscript; Venkatramanan Senapathi: Reviewing Manu­ study of Idi-Ayunre and its environs, Oyo State, Southwestern Nigeria. Nat. Appl. Sci.
J. 9, 1–12.
script, Data Curation; Mureed Hussain: Data Collection, Data Curation;
Arabameri, A., Rezaei, K., Cerda, A., Lombardo, L., Rodrigo-Comino, J., 2019. GIS-based
Jamil Siddique: Investigation, Data Curation; Juan Carlos Trabocco: groundwater potential mapping in Shahroud plain, Iran. A comparison among
Reviewing Manuscript, Helping ML models building.. statistical (bivariate and multivariate), data mining and MCDM approaches. Sci.
Total Environ. 658, 160–177.
Archer, K.J., Kimes, R.V., 2008. Empirical characterization of random forest variable
importance measures. Comput. Stat. Data Anal. 52, 2249–2260.
Declaration of competing interest
Avtar, R., Singh, C., Shashtri, S., Singh, A., Mukherjee, S., 2010. Identification and
analysis of groundwater potential zones in Ken–Betwa river linking area using
The authors declare that they have no known competing financial remote sensing and geographic information system. Geocarto Int. 25, 379–396.
Bedi, S., Samal, A., Ray, C., Snow, D., 2020. Comparative evaluation of machine learning
interests or personal relationships that could have appeared to influence
models for groundwater quality assessment. Environ. Monit. Assess. 192, 776.
the work reported in this paper. https://doi.org/10.1007/s10661-020-08695-3.
Berhanu, K.G., Hatiye, S.D., 2020. Identification of groundwater potential zones using
proxy data: case study of Megech watershed, Ethiopia. J. Hydrol.: Reg. Stud. 28,
Acknowledgments 100676.
Betrie, G.D., Tesfamariam, S., Morin, K.A., Sadiq, R., 2013. Predicting copper
The authors are thankful to the Pakistan Center of Research in Water concentrations in acid mine drainage: a comparative analysis of five machine
learning techniques. Environ. Monit. Assess. 185, 4171–4182. https://doi.org/
Resources (PCRWR) for providing GW yield data and hydro-chemical
10.1007/s10661-012-2859-7.
data, the Pakistan Meteorological Department (PMD) for providing Bhargavi, P., Jyothi, S., 2009. Applying naive bayes data mining technique for
precipitation data, Survey of Pakistan for providing us different types of classification of agricultural land soils. Int. J. Comput. Sci. Netw. Secur. 9, 117–122.
Boutaleb, S., Boualoul, M., Bouchaou, L., Oudra, M., 2008. Application of remote-sensing
vector data for the present study, and the United States Geological
and surface geophysics for groundwater prospecting in a hard rock terrain, Morocco.
Survey (USGS) for providing the free Sentinel-2 images and (SRTM) In: Applied Groundwater Studies in Africa. CRC Press, pp. 225–240.
Global Digital Elevation Model via https://earthexplorer.usgs.gov/. The Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.
authors are also thankful to National Key R&D Program of China Candel, A., Parmar, V., LeDell, E., Arora, A., 2016. Deep Learning with H2O. H2O. ai Inc.
Chen, T., Guestrin, C., 2016. Xgboost: a scalable tree boosting system. In: Proceedings of
(2017YFC1502701) for providing the fundings for this research. the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data
Mining, pp. 785–794.
Chen, W., Li, H., Hou, E., Wang, S., Wang, G., Panahi, M., Li, T., Peng, T., Guo, C.,
Appendix A. Supplementary data
Niu, C., 2018. GIS-based groundwater potential analysis using novel ensemble
weights-of-evidence with logistic regression and functional tree models. Sci. Total
Supplementary data to this article can be found online at https://doi. Environ. 634, 853–867.
Chen, W., Pradhan, B., Li, S., Shahabi, H., Rizeei, H.M., Hou, E., Wang, S., 2019a. Novel
org/10.1016/j.chemosphere.2022.135265.
hybrid integration approach of bagging-based Fisher’s linear discriminant function
for groundwater potential analysis. Nat. Resour. Res. 28, 1239–1258.
References Chen, W., Tsangaratos, P., Ilia, I., Duan, Z., Chen, X., 2019b. Groundwater spring
potential mapping using population-based evolutionary algorithms and data mining
methods. Sci. Total Environ. 684, 31–49.
Adiat, K., Nawawi, M., Abdullah, K., 2012. Assessing the accuracy of GIS-based
Chicco, D., Warrens, M.J., Jurman, G., 2021. The coefficient of determination R-squared
elementary multi criteria decision analysis as a spatial prediction tool–a case of
is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis
predicting potential zones of sustainable groundwater resources. J. Hydrol. 440,
evaluation. PeerJ Comput. Sci. 7, e623.
75–89.
Cutler, F., 2020. Original by LB and A, Wiener R Port by AL and M. randomForest.
Al-Abadi, A.M., Fryar, A.E., Rasheed, A.A., Pradhan, B., 2021. Assessment of
Breiman and Cutler’s Random Forests for Classification and Regression, 2015.
groundwater potential in terms of the availability and quality of the resource: a case
Dietterich, T., 1995. Overfitting and undercomputing in machine learning. ACM Comput.
study from Iraq. Environ. Earth Sci. 80, 1–22.
Surv. 27, 326–327.
Al-Djazouli, M.O., Elmorabiti, K., Rahimi, A., Amellah, O., Fadil, O.A.M., 2020.
Edition, F., 2011. Guidelines for drinking-water quality. WHO Chron. 38, 104–108.
Delineating of Groundwater Potential Zones Based on Remote Sensing, GIS and
Elmahdy, S.I., Mohamed, M.M., 2014. Relationship between geological structures and
Analytical Hierarchical Process: a Case of Waddai, Eastern Chad. GeoJournal,
groundwater flow and groundwater salinity in Al Jaaw Plain, United Arab Emirates;
pp. 1–14.
mapping and analysis by means of remote sensing and GIS. Arabian J. Geosci. 7,
Alam, K., Ahmad, N., 2014. Determination of aquifer geometry through geophysical
1249–1259.
methods: a case study from Quetta Valley, Pakistan. Acta Geophys. 62, 142–163.
Fan, J., Zheng, J., Wu, L., Zhang, F., 2021. Estimation of daily maize transpiration using
Amadi, A., Olasehinde, P., 2010. Application of remote sensing techniques in
support vector machines, extreme gradient boosting, artificial and deep neural
hydrogeological mapping of parts of Bosso Area, Minna, North-Central Nigeria. Int.
networks models. Agric. Water Manag. 245, 106547.
J. Phys. Sci. 5, 1465–1474.
Friedman, J.H., Meulman, J.J., 2003. Multiple additive regression trees with application
Amudu, G., Onwuemesi, A., Ajaegwu, N., Onuba, L., Omali, A., 2008. Electrical
in epidemiology. Stat. Med. 22, 1365–1381. https://doi.org/10.1002/sim.1501.
resistivity investigation for groundwater in the Basement Complex terrain: a case

11
U. Rasool et al. Chemosphere 303 (2022) 135265

Garosi, Y., Sheklabadi, M., Conoscenti, C., Pourghasemi, H.R., Van Oost, K., 2019. Moriasi, D.N., Arnold, J.G., Van Liew, M.W., Bingner, R.L., Harmel, R.D., Veith, T.L.,
Assessing the performance of GIS- based machine learning models with different 2007. Model evaluation guidelines for systematic quantification of accuracy in
accuracy measures for determining susceptibility to gully erosion. Sci. Total Environ. watershed simulations. Trans. ASABE (Am. Soc. Agric. Biol. Eng.) 50, 885–900.
664, 1117–1132. https://doi.org/10.1016/j.scitotenv.2019.02.093. Motevalli, A., Naghibi, S.A., Hashemi, H., Berndtsson, R., Pradhan, B., Gholami, V., 2019.
Gayen, A., Pourghasemi, H.R., Saha, S., Keesstra, S., Bai, S., 2019. Gully erosion Inverse method using boosted regression tree and k-nearest neighbor to quantify
susceptibility assessment and management of hazard-prone areas in India using effects of point and non-point source nitrate pollution in groundwater. J. Clean.
different machine learning algorithms. Sci. Total Environ. 668, 124–138. Prod. 228, 1248–1263.
Gedeon, T.D., 1997. Data mining of inputs: analysing magnitude and functional Naghibi, S.A., Dashtpagerdi, M.M., 2017. Evaluation of four supervised learning methods
measures. Int. J. Neural Syst. 8, 209–218. https://doi.org/10.1142/ for groundwater spring potential mapping in Khalkhal region (Iran) using GIS-based
s0129065797000227. features. Hydrogeol. J. 25, 169–189.
Golkarian, A., Naghibi, S.A., Kalantar, B., Pradhan, B., 2018. Groundwater potential Naghibi, S.A., Dolatkordestani, M., Rezaei, A., Amouzegari, P., Heravi, M.T., Kalantar, B.,
mapping using C5. 0, random forest, and multivariate adaptive regression spline Pradhan, B., 2019. Application of rotation forest with decision trees as base classifier
models in GIS. Environ. Monit. Assess. 190, 1–16. and a novel ensemble model in spatial modeling of groundwater potential. Environ.
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT press. Monit. Assess. 191, 1–20.
Gu, C., Mu, X., Zhao, G., Gao, P., Sun, W., Yu, Q., 2016. Changes in stream flow and their Naghibi, S.A., Hashemi, H., Berndtsson, R., Lee, S., 2020. Application of extreme gradient
relationships with climatic variations and anthropogenic activities in the Poyang boosting and parallel random forest algorithms for assessing groundwater spring
Lake Basin, China. Water 8, 564. potential using DEM-derived factors. J. Hydrol. 589, 125197.
Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data Naghibi, S.A., Pourghasemi, H.R., 2015. A comparative assessment between three
Mining, Inference, and Prediction. Springer Science & Business Media. machine learning models and their performance comparison by bivariate and
He, Q.P., Wang, J., 2007. Fault detection using the k-nearest neighbor rule for multivariate statistical methods in groundwater potential mapping. Water Resour.
semiconductor manufacturing processes. IEEE Trans. Semicond. Manuf. 20, Manag. 29, 5217–5236.
345–354. Naghibi, S.A., Pourghasemi, H.R., Pourtaghi, Z.S., Rezaei, A., 2015. Groundwater qanat
Ibrahem Ahmed Osman, A., Najah Ahmed, A., Chow, M.F., Feng Huang, Y., El-Shafie, A., potential mapping using frequency ratio and Shannon’s entropy models in the
2021. Extreme gradient boosting (Xgboost) model to predict the groundwater levels Moghan watershed, Iran. Earth Sci. India 8, 171–186.
in Selangor Malaysia. Ain Shams Eng. J. 12, 1545–1556. https://doi.org/10.1016/j. Nansen, C., Elliott, N., 2016. Remote sensing and reflectance profiling in entomology.
asej.2020.11.011. Annu. Rev. Entomol. 61, 139–158. https://doi.org/10.1146/annurev-ento-010715-
Inglada, J., Arias, M., Tardy, B., Hagolle, O., Valero, S., Morin, D., Dedieu, G., 023834.
Sepulcre, G., Bontemps, S., Defourny, P., 2015. Assessment of an operational system Olden, J.D., Lawler, J.J., Poff, N.L., 2008. Machine learning methods without tears: a
for crop type map production using high temporal and spatial resolution satellite primer for ecologists. Q. Rev. Biol. 83, 171–193.
optical imagery. Rem. Sens. 7, 12356–12379. Olubusola, I.S., Adebo, B.A., Oladimeji, O.K., Ayodele, A., 2018. Application of gis and
Islam, A.R.M.T., Talukdar, S., Mahato, S., Kundu, S., Eibek, K.U., Pham, Q.B., Kuriqi, A., multi-criteria decision analysis to geoelectric parameters for modeling of
Linh, N.T.T., 2021. Flood susceptibility modelling using advanced ensemble machine groundwater potential around ilesha, southwestern Nigeria. Eur. J. Acad. Ess. 5,
learning models. Geosci. Front. 12, 101075. 105–123.
Kakar, N., Kakar, D.M., Khan, A.S., Khan, S.D., 2019. Land subsidence caused by Ozdemir, A., 2011. Using a binary logistic regression method and GIS for evaluating and
groundwater exploitation in Quetta Valley, Pakistan. Int. J. Econ. Env. Geol. 10–19. mapping the groundwater spring potential in the Sultan Mountains (Aksehir,
Kanagaraj, G., Elango, L., 2016. Hydrogeochemical processes and impact of tanning Turkey). J. Hydrol. 405, 123–136.
industries on groundwater quality in Ambur, Vellore district, Tamil Nadu, India. Panahi, M., Sadhasivam, N., Pourghasemi, H.R., Rezaie, F., Lee, S., 2020. Spatial
Environ. Sci. Pollut. Res. Int. 23, 24364–24383. https://doi.org/10.1007/s11356- prediction of groundwater potential mapping based on convolutional neural network
016-7639-4. (CNN) and support vector regression (SVR). J. Hydrol. 588, 125033.
Kariminejad, N., Hosseinalizadeh, M., Pourghasemi, H.R., Ownegh, M., Rossi, M., Pham, B.T., Jaafari, A., Van Phong, T., Mafi-Gholami, D., Amiri, M., Van Tao, N.,
Tiefenbacher, J.P., 2020. Optimizing collapsed pipes mapping: effects of DEM spatial Duong, V.-H., Prakash, I., 2021. Naïve Bayes ensemble models for groundwater
resolution. Catena 187, 104344. potential mapping. Ecol. Inf. 64, 101389.
Kaushal, S.S., Gold, A.J., Mayer, P.M., 2017. Land Use, Climate, and Water Pourghasemi, H.R., Sadhasivam, N., Yousefi, S., Tavangar, S., Nazarlou, H.G.,
Resources—Global Stages of Interaction. Multidisciplinary Digital Publishing Santosh, M., 2020. Using machine learning algorithms to map the groundwater
Institute, p. 815. recharge potential zones. J. Environ. Manag. 265, 110525.
Kazmi, A., Abbas, G., Younas, S., 2005a. Water Resources and Hydrogeology of Quetta Powers, D.M., 2020. Evaluation: from Precision, Recall and F-Measure to ROC,
Basin, Balochistan, Pakistan. Geological Survey of Pakistan, Quetta. Informedness, Markedness and Correlation arXiv preprint arXiv:2010.16061.
Kazmi, A., Abbas, S., Younas, M., 2005b. Water Resources and Hydrogeology of Quetta Prasad, P., Loveson, V.J., Kotha, M., Yadav, R., 2020. Application of machine learning
Baisn. Geological Survey of Pakistan, special publication. techniques in groundwater potential mapping along the west coast of India.
Khan, A., Mian, B., 2000. Groundwater development issues of Baluchistan. In: GIScience Remote Sens. 57, 735–752.
Proceedings of the Global Water Partnership Seminar on Regional Groundwater Prinos, S.T., Wacker, M.A., Cunningham, K.J., Fitterman, D.V., 2014. Origins and
Management. Delineation of Saltwater Intrusion in the Biscayne Aquifer and Changes in the
Khan, M.W., Khalid, M., HabibUllah, H.U.R., Ayaz, Y., Ullah, F., Jadoon, M.A., Afridi, S., Distribution of Saltwater in Miami-Dade County, Florida. US Geological Survey.
2017. Detection of arsenic (as), antimony (Sb) and bacterial contamination in Puckett, L.J., Tesoriero, A.J., Dubrovsky, N.M., 2011. Nitrogen Contamination of
drinking water. Bio, Form 9, 133–238. Surficial Aquifers A Growing Legacy. ACS Publications.
Kim, J.-C., Jung, H.-S., Lee, S., 2018a. Groundwater productivity potential mapping Rajaveni, S., Brindha, K., Elango, L., 2017. Geological and geomorphological controls on
using frequency ratio and evidential belief function and artificial neural network groundwater occurrence in a hard rock region. Appl. Water Sci. 7, 1377–1389.
models: focus on topographic factors. J. Hydroinf. 20, 1436–1451. Ratner, B., 2009. The correlation coefficient: its values range between+ 1/− 1, or do
Kim, J.-C., Lee, S., Jung, H.-S., Lee, S., 2018b. Landslide susceptibility mapping using they? J. Target Meas. Anal. Market. 17, 139–142.
random forest and boosted tree models in Pyeong-Chang, Korea. Geocarto Int. 33, Razavi-Termeh, S.V., Sadeghi-Niaraki, A., Choi, S.-M., 2019. Groundwater potential
1000–1015. mapping using an integrated ensemble of three bivariate statistical models with
Lee, S., Hong, S.-M., Jung, H.-S., 2018. GIS-based groundwater potential mapping using random forest and logistic model tree models. Water 11, 1596.
artificial neural network and support vector machine models: the case of Boryeong Rehrl, C., Birk, S., 2010. Hydrogeological characterisation and modelling of spring
city in Korea. Geocarto Int. 33, 847–861. catchments in A changing environment. Aust. J. Earth Sci. 103.
Marjanovic, M., Bajat, B., Kovacevic, M., 2009. Landslide susceptibility assessment with Robert, T., Caterina, D., Deceuster, J., Kaufmann, O., Nguyen, F., 2012. A salt tracer test
machine learning algorithms. In: 2009 International Conference on Intelligent monitored with surface ERT to detect preferential flow and transport paths in
Networking and Collaborative Systems. IEEE, pp. 273–278. fractured/karstified limestones. Geophysics 77, B55–B67.
Marjanović, M., Kovačević, M., Bajat, B., Voženílek, V., 2011. Landslide susceptibility Sahour, H., Gholami, V., Vazifedan, M., 2020. A comparative analysis of statistical and
assessment using SVM machine learning algorithm. Eng. Geol. 123, 225–234. machine learning techniques for mapping the spatial distribution of groundwater
Micheletti, N., Foresti, L., Kanevski, M., Pedrazzini, A., Jaboyedoff, M., 2011. Landslide salinity in a coastal aquifer. J. Hydrol. 591, 125321.
Susceptibility Mapping Using Adaptive Support Vector Machines and Feature Sahu, P., Sikdar, P., 2008. Hydrochemical framework of the aquifer in and around East
Selection. Master Thesis submitted to University of Lausanne Faculty of Geosciences Kolkata Wetlands, West Bengal, India. Environ. Geol. 55, 823–835.
and Environment for the Degree of Master of Science in Environmental Geosciences, Sameen, M.I., Pradhan, B., Lee, S., 2019. Self-learning random forests model for mapping
p. 99. groundwater yield in data-scarce areas. Nat. Resour. Res. 28, 757–775.
Micheletti, N., Foresti, L., Robert, S., Leuenberger, M., Pedrazzini, A., Jaboyedoff, M., Sánchez-Maroño, N., Alonso-Betanzos, A., Calvo-Estévez, R.M., 2009. A wrapper method
Kanevski, M., 2014. Machine learning feature selection methods for landslide for feature selection in multiple classes datasets. In: International Work-Conference
susceptibility mapping. Math. Geosci. 46, 33–57. on Artificial Neural Networks. Springer, pp. 456–463.
Mirzaei, S., Vafakhah, M., Pradhan, B., Alavi, S.J., 2021. Flood susceptibility assessment Schaefer, M.V., Guo, X., Gan, Y., Benner, S.G., Griffin, A.M., Gorski, C.A., Wang, Y.,
using extreme gradient boosting (EGB), Iran. Earth Sci. India 14, 51–67. Fendorf, S., 2017. Redox controls on arsenic enrichment and release from aquifer
Mogaji, K., Aboyeji, O., Omosuyi, G., 2011. Mapping of lineaments for groundwater sediments in central Yangtze River Basin. Geochem. Cosmochim. Acta 204, 104–119.
targeting in the basement complex region of Ondo State, Nigeria, using remote Shahabi, H., Shirzadi, A., Ghaderi, K., Omidvar, E., Al-Ansari, N., Clague, J.J.,
sensing and geographic information system (GIS) techniques. Int. J. Water Resour. Geertsema, M., Khosravi, K., Amini, A., Bahrami, S., 2020. Flood detection and
Environ. Eng. 3, 150–160. susceptibility mapping using sentinel-1 remote sensing data and a machine learning
Mojaddadi Rizeei, H., Pradhan, B., Saharkhiz, M.A., 2019. Urban object extraction using approach: hybrid intelligence of bagging ensemble based on k-nearest neighbor
Dempster Shafer feature-based image analysis from worldview-3 satellite imagery. classifier. Rem. Sens. 12, 266.
Int. J. Rem. Sens. 40, 1092–1119.

12
U. Rasool et al. Chemosphere 303 (2022) 135265

Soni, J., Ansari, U., Sharma, D., Soni, S., 2011. Predictive data mining for medical Verma, A., Singh, T., 2013. Prediction of water quality from simple field parameters.
diagnosis: an overview of heart disease prediction. Int. J. Comput. Appl. 17, 43–48. Environ. Earth Sci. 69, 821–829.
Tareen, A.K., Sultan, I.N., Khan, M.W., Khan, A., 2013. Determination of heavy metals Wheater, H.S., Mathias, S.A., Li, X., 2010. Groundwater Modelling in Arid and Semi-arid
found in different sizes of tube wells of district pishin balochistan, Pakistan. Asian J. Areas. Cambridge University Press.
Inf. Technol. 4, 17–21. Youssef, A.M., Pourghasemi, H.R., Pourtaghi, Z.S., Al-Katheeri, M.M., 2016. Landslide
TCI, C., 2008. ARD. 2004. Techno Consult International Corporation, Cameous and Arab susceptibility mapping using random forest, boosted regression tree, classification
Resources Development. Research for Water and Sanitation Authority, Quetta. and regression tree, and general linear models and comparison of their performance
Quetta water supply and environmental improvement project 2. at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 13, 839–856.
Tiwari, T., Mishra, M., 1985. A preliminary assignment of water quality index of major Yuan, F., Mobley, W., Farahmand, H., Xu, Y., Blessing, R., Dong, S., Mostafavi, A.,
Indian rivers. Indian J. Environ. Protect. 5, 276–279. Brody, S.D., 2021. Predicting Road Flooding Risk with Machine Learning
Vapnik, V., Guyon, I., Hastie, T., 1995. Support vector machines. Mach. Learn. 20, Approaches Using Crowdsourced Reports and Fine-grained Traffic Data arXiv
273–297. preprint arXiv:2108.13265.
Verbesselt, J., Jonsson, P., Lhermitte, S., Van Aardt, J., Coppin, P., 2006. Evaluating
satellite and climate data-derived indices as fire risk indicators in savanna
ecosystems. IEEE Trans. Geosci. Rem. Sens. 44, 1622–1632.

13

You might also like