You are on page 1of 18

Hydrogeology Journal (2023) 31:1387–1404

https://doi.org/10.1007/s10040-023-02686-7

REPORT

Spatial prediction of groundwater levels using machine learning


and geostatistical models: a case study of coastal faulted aquifer
systems in southeastern Tunisia
Hayet Chihi1 · Iyadh Ben Cheikh Larbi2,3

Received: 12 November 2022 / Accepted: 15 July 2023 / Published online: 22 August 2023
© The Author(s), under exclusive licence to International Association of Hydrogeologists 2023

Abstract
Developing efficient methods for groundwater level (GWL) prediction is essential for identifying the groundwater flow
pattern, characterizing the spatial extent of contaminant plumes, and enhancing water resources management. Recently, sig-
nificant advances have been made in predicting GWL using machine learning (ML) models, but these do not consider hydro-
geological heterogeneities that condition the flow pattern. This study develops and evaluates the applicability of advanced
geostatistics and ML models to characterize the spatial variability of the GWL, taking into account the discontinuities induced
by complex geological environments and leveraging only piezometer positions and monitored GWL. Geostatistical-based
ordinary kriging (G/OK) and kernel ridge regression (KRR) were conducted on joint-faulted coastal aquifer systems in
southeastern Tunisia. Geological knowledge was incorporated into the characterization process, achieving better function
modeling, and optimizing both geostatistical and ML models. The present work counts among the first ML applications that
take into account the spatial variability modeling constrained by geological heterogeneities. The task is especially challenging
as actual data points are scarce. The results are evaluated using cross-validation with several error and evaluation metrics.
Comparative analyses were performed to assess the consistency with the hydrogeological reality. The proposed approaches
generated credible GWL maps that reproduce the regional and local flow patterns. A comprehensive interpretation provides
a range of essential insights on the spatial variation of the groundwater flow path and the hydraulic behavior of faults acting
as conduits, barriers, or conduit-barriers. The implemented model could be applied to other analogous areas to assess GWL
and other hydraulic parameters efficiently.

Keywords Groundwater level · Geostatistics · Machine learning · Kernel ridge regression · Spatial variability ·
Heterogeneity

Introduction are high, groundwater becomes a highly valuable resource,


playing a significant role in various sectors such as agricul-
Groundwater resources meet a substantial part of the water ture, industry and drinking water supply. Global changes,
needs for human life. Particularly in arid and semiarid coun- including natural and anthropogenic influences on climate
tries, where surface water is scarce and evaporation rates and the hydrological cycle, are increasingly altering surface-
water levels and groundwater recharge (De Marsily 2021).
Published in the special issue “Geostatistics and hydrogeology” Depletion of aquifers may involve drastic reductions in
yield and deviation of groundwater flow pathways, which
* Hayet Chihi may eventually entail degradation of water quality, such as
hayet_chihi@yahoo.fr; hayet.chihi@certe.rnrt.tn
through seawater intrusion in coastal areas (Custodio 2013;
1
Georesources Laboratory, Centre for Water Research UNESCO 2022).
and Technologies, University of Carthage, Borj Cedria More precisely, the groundwater level (GWL) surface
Ecopark, Soliman, Tunisia geometry is expected to fluctuate over time as a consequence
2
Technical University of Berlin, Berlin, Germany of short-term, seasonal, and long-term influences. Thus, the
3
German Research Center for Artificial Intelligence (DFKI), GWL serves as a basic and practical parameter for assess-
Berlin, Germany ing the potential availability of groundwater. Therefore,

13
Vol.:(0123456789)
1388 Hydrogeology Journal (2023) 31:1387–1404

efficient methods for regionalizing GWL, and thus a pro- regression (SVR) and Gaussian process regression (GPR)
viding thorough understanding of the generated groundwater (Band et al. 2021), random forest (RF), alongside ANN and
flow patterns, can guide decision-makers to implement effec- SVR (Tang et al. 2019; Koch et al. 2019; Ghordoyee Milan
tive water resources management strategies and to protect et al. 2023), and kernel models (Guzman et al. 2019) have
aquifers. been applied with success for temporal prediction of GWL.
Spatial estimates of GWL, however, are significantly con- A more comprehensive review of the literature on ML mod-
strained by an array of factors that include the data sources, els used for efficient estimation of GWL and other hydraulic
patterns, timing, and associated uncertainties, in addition to parameters can be found in Tao et al. (2022).
any influences due to the interpolation procedures, which While ML models have shown promising potential for
must be customized to the geologic environment. In fact, temporal and time series prediction, their application in
the changes in hydraulic properties primarily depend on spatial modeling of hydraulic parameters, such as GWL,
the environmental heterogeneity imposed by the geological has been limited. To overcome this limitation some authors
history (deposition, diagenesis, tectonic deformation, etc.; make use of hybrid models that combine ML models for
Matheron and De Marsily 1980; De Marsily et al. 2005; time domain prediction with geostatistical methods for spa-
Chihi and De Marsily 2009; Chihi et al. 2015). tial domain estimation. Nourani et al. (2011) conducted a
Groundwater level mapping usually relies on data gener- hybrid ANN–a geostatistical methodology for spatiotempo-
ated from boreholes and piezometer networks that survey ral prediction of GWL in two separate stages. First, an artifi-
GWL variations. It is common, nonetheless, to witness a cial neural network is trained for each piezometer, allowing
lack of groundwater measurements, attributed mainly to the the model to forecast the GWL for the following month.
high costs of piezometer installation and ongoing mainte- Then, in the second step, the predicted GWL value at each
nance, sampling missions, and analysis. Consequently, this of the piezometers is integrated into geostatistical mode-
has affected the spatial and temporal distribution of GWL ling to accurately estimate the spatial distribution of GWL.
measurements. Accordingly, considering the hydrogeologi- Recently, Varouchakis et al. (2022) implemented a geostatis-
cal heterogeneities inducing stochastic and nonstationary tical analysis based on the Gaussian process method, using
characteristics of GWL fluctuations, and the pattern of the space-time GWL observations, to generate reliable spatial
related data that are often incomplete or nonexistent, it is not maps of GWL.
easy to make an accurate assessment of GWL (Boezio et al. Upon reviewing the literature, and according to the
2006; Pirot et al. 2015; Chihi and De Marsily 2009; Chihi authors’ knowledge, no ML model has been directly applied
et al. 2016; Koch et al. 2019). to estimate GWL in the space domain, specifically. In the
Several approaches have been implemented in previous context of addressing these specific issues, kernel ridge
studies to adapt more advanced interpolation algorithms, to regression (KRR) is proposed as an appropriate approach.
enable increasingly efficient predictions of GWL. Geosta- KRR is a powerful machine learning algorithm that com-
tistical techniques, which have been widely developed, are bines the advantages of ridge regression and kernel methods
being implemented to identify the structure of the spatial to provide a more accurate and robust prediction of nonlin-
and/or temporal variability of GWL and to optimize the esti- ear functions. Kernel ridge regression makes use of a kernel
mation at locations where there are no measurements, using function to map the data into a higher dimensional space,
the neighboring sampled data (Matheron 1963, 1965; Isaaks allowing it to capture more complex nonlinear relationships
and Srivastava 1989; Delhomme 1978; Delhomme and De between the input and output variables (Zhang et al. 2013).
Marsily 2006, Goovaerts 1997; Chiles and Definer 2012). It, therefore, allows better modeling of the target function
Alternatively, the last decade has seen major advances in and more accurate predictions of GWL; furthermore, it uses
the use of machine learning (ML) models for GWL predic- regularization to reduce the variance of the model and pre-
tion and related hydrogeological modeling issues. Indeed, vent overfitting. In the present case study, KRR, based on a
several authors have implemented and compared different Gaussian function, was guided by geostatistical philosophy/
ML models in order to improve the efficiency of GWL pre- thinking and hydrostratigraphic configuration to deal with
diction in the time domain, ranging from annual, monthly, the geological complexity of the study area and the GWL
and daily to even hourly lead time. Most of them developed data scarcity. Faults were inferred as screens within the esti-
their predictive models by integrating previous GWL data mation procedure to capture the GWL local variability while
and other related meteorological variables such as precipi- reproducing the heterogeneous flow pattern at a local and
tation, air temperature, humidity and evapotranspiration— regional scale. Additionally, a model selection procedure
for example, artificial neural network (ANN) (Lallahem was conducted to optimize the parameters of the model and
et al. 2005; Mohanty et al. 2013; Moghaddam et al. 2019; maximize the accuracy. A cross-validation procedure was
Moghaddam et al. 2021; Kayhomayoon et al. 2022), extreme also used to assess the performance of the models and pre-
learning machines (ELM) (Yadav 2017), support vector vent any potential overfitting of the data. Furthermore, the

13
Hydrogeology Journal (2023) 31:1387–1404 1389

kriging map data and the estimation error, along with cross- 3. The Lower Cretaceous (Neocomian, Barremian, Aptian
validation, were all employed to supplement the database and Albian) continental formations, which form a com-
with secondary GWL data. This, in turn, provides greater plex succession of clastic sediments within the Zeus-
insight and guidance for the kernel ridge regression model, Koutine aquifer, where they overlay the Jurassic carbon-
enabling more accurate predictions. ate aquifers
The main aim of this work is to develop ML spatial 4. The Lower and Middle Triassic formations that host
models to evaluate the spatial variability of GWL. More the sandstone Sahel El Ababsa aquifer system, which
specifically, the implementation of spatial ML models is is locally confined where overlain by the intermediate
based on KRR, leveraging the methodology and principles Triassic layer
of geostatistics, in particular ordinary kriging (OK), which
has proven effective even in complex geological environ- Figure 1b displays the distribution of these aquifer sys-
ments. Four key issues deserve to be addressed, thus (1) tems in relation with the major normal faults (Chihi et al.
ensuring the reliability of ML models when dealing with a 2015). The figure also specifies which parts of the aquifers
complex geological environment characterized by a regional are confined or unconfined. Actually, repetitive tectonic
fault system, (2) handling the challenge of modeling with movements have significantly influenced the aquifer-system
uneven data distribution, (3) capturing the GWL variability configurations and their hydrodynamic functioning. The
at both regional and local scales, and (4) determining the Jeffara basin is intensely affected by a complex system of
hydraulic properties of the faults and deduce their impact NW–SE major faults crossed by minor NE–SW ones dis-
on the flow pattern featuring the regional and local water playing variable normal displacement. Deep E–W–trending
flow paths induced by the geological heterogeneities and fault systems are also featured in the northern and southern
more specifically the fault structures. The methodology was sectors, such as the Tebaga fault. These tectonic structures
implemented and tested over the largely extended aquifer have a direct impact on groundwater flow and divide the
system located in the Jeffara coastal plain in south-eastern huge Jeffara hydrologic system into distinct aquifer systems
Tunisia (Fig. 1). The area was chosen for a range of reasons, that have been designated as “Mareth aquifer” “Zeuss-Kou-
including the geological complexity of the basin, the pres- tine aquifer”, “Jorf aquifer” and “Sahel El Ababsa aquifer”
ence of various joined aquifers systems that underlie this (Fig. 1b). These aquifer systems are compartmentalized into
basin and the influence of tectonic processes. By consid- interconnected subsystems and exhibit significant lateral
ering these constraints, the developed ML model aims to variability in terms of depth, thickness, facies and reservoir
accurately estimate GWL, facilitating a better understanding connectivity (Chihi et al. 2014; Soua and Chihi 2014; Ham-
and management of groundwater resources and the hydro- mami et al. 2018a; Mezni et al. 2022a, b). Each aquifer is
geological environment. thus characterized by its specific recharge, boundary, and
flow conditions.
Furthermore, the tectonic activity in the region has also
The study area affected the topographic surface and shaped the major sur-
face outcropping structures. The western part of the study
The Jeffara basin, situated in southeastern Tunisia, was cho- area, particularly within the Dhahar Mountains, exhibits high
sen as an ideal case study to investigate the distribution of topographic elevations, ranging from 300 to 700 m above sea
GWL and the hydrogeological characteristics of the included level (asl). Elevations between 200 and 300 m are limited to
aquifer systems (Fig. 1). The Jeffara basin is characterized scattered mounds, such as Jebel Tebaga, Jebel Matmata, Jebel
by an almost continuous stratigraphic succession from the Zemlet Leben, and Jebel Tejra (Figs. 1 and 2). Elevations from
Permian to the Quaternary. The hydrostratigraphic studies 0 to 200 m are identified in the Jeffara plain and downhill
conducted in the region (Mammou 1990; Ben Baccar 1982; areas, including the main valley floor (Mezni et al. 2022b).
Chihi et al. 2013, 2016, 2023) concluded that a huge multi- The region is dominated by an arid to semiarid climate,
layered aquifer system, combining porous and karst reservoir influenced by continental dry air masses emanating from
formations, extends along the Jeffara coastal plain. Overall, the desert and maritime humid air masses originating from
four ensembles were distinguished: the Mediterranean Sea. The annual average rainfall is about
200 mm, while the potential evaporation is around 2,700
1. The Miocene sandy/detrital deposits, which constitute mm. The temperature ranges from an average of 12.5 °C in
a relatively continuous aquifer over the entire coastal January to a maximum of 30.4 °C in August, characterizing
Jeffara basin a Mediterranean-type climate.
2. The Upper Cretaceous (Turonian and Lower Senonian) The region is drained by ephemeral rivers (wadis) that
carbonate aquifer formations, which extend along the have a direct impact on the hydrology and geomorphology of
Mareth and Jorf regions the drainage network (Fig. 2). The main wadis are Zigzaou,

13
1390 Hydrogeology Journal (2023) 31:1387–1404

Fig. 1  a Study area location in


Tunisia; b Regional geological Mediterranean Sea
map, showing the main geo-
logical outcrops, faults, and the
nature and distribution of the
aquifer systems
Tunisia

Algeria Libya

(a)
600 620 640 660
.
aF N
tt an
Ke

F.
Gulf of Gabes

e
in
Mareth

rk
3730

Ze
F.
a ou
az

Za
g
Zi

ra
tF
Jorf

.
Aquife em
Ks .
Zeuss-Koutine
Matmatas ar em
Ch
3710 ra
rif
Ze
us Ko L.
s ut Ga
Tebaga F.
in m
Leben e Ou ou
a di
Tebag J.
Te M mZ aF
jra ed es .
en sa
in rF
e
3690 Sahel El Ababsa F. .
Aquifer em
Tejra_Medenine F.
Dh

Medenine High
ah
ar
ou
tc

3670
ro
ps

Dh
0 5 10 20 ah
(b) ar
km F.

Legend
Jurassic Albian Lower Senonian Quaternary Unconfined kar ed aquifer
Mio-Pliocene/
Triassic Aptian Turonian ocene Unc e aquifer
Permian L. Cretaceous Cenomanian U. Senonian Confined aquifer
Faults

Oum Zessar, Zeuss, Sidi Makhlouf, and El Morra. Due to the aquifer overexploitation and to groundwater quality deterio-
infrequent rainfall and the prevalence of permeable and semi- ration. Therefore, there is a need for accurate estimation and
permeable soils in the upper geologic layers for a significant an in-depth understanding of water flow in such complex
part of the area, surface flows and drainage in these rivers are aquifer systems.
very restricted. Therefore, the river appears to have surface
flows only for a short period of time after intense precipita-
tion events. Materials and methods
Accordingly, surface water is scarcely available, so instead
groundwater serves as the primary resource for domestic and The traditional methods of directly interpolating GWL
drinking water supply, as well as for agricultural and indus- measurements often result in inconsistent maps that are
trial activities. The accompanying exponential increase in hydrogeologically not coherent, for two main reasons.
the use of these water resources, especially for agricultural Firstly, measurements are rarely sufficient to reconstruct
and other irrigation purposes, has contributed to pervasive the true GWL map. Secondly, existing computational

13
Hydrogeology Journal (2023) 31:1387–1404 1391

Fig. 2  Digital elevation model, stream network, main compartments and (monitored and secondary) data locations within the study area

procedures and algorithms do not adequately consider the All spatial data, such as digital elevation, borehole infor-
local variability of the hydraulic parameter induced by the mation, geologic maps, faults, and the river network, were
complex geological environment and by the hydrogeologi- compiled, harmonized and georeferenced using ArcGIS soft-
cal heterogeneities prevalent in the study area. To address ware (Figs. 1 and 2). The dataset, used for implementing the
these issues, a comprehensive methodology is proposed, interpolation procedures in the space domain, concerns the
and the different steps are described in Fig. 3. GWL records of the year 2015 (wet season), as it provides
the most complete set of records (displayed as red dots in
The database Fig. 2). The majority of GWL records are taken from the
interconnected aquifers covering Sahel El Ababsa, Zeuss-
The data used in this work are derived from the research pro- Koutine, Mareth and Jorf aquifer systems. To address data
ject “Geological modeling for vulnerability characterization of gaps in certain areas, the measured data were complemented
aquifer systems”, developed by the Georesources Laboratory with secondary GWL data (indicated by blue dots in Fig. 2).
of the Centre for Water Research and Technologies, Tunisia. Extending the dataset was necessary to supply the ML model
The database includes various sources (Figs. 1 and 2): (1) six with sufficient information to improve its performance and
geological maps covering the entire study area, acquired from enhance the accuracy of the GWL prediction.
the National Office of Mines (ONM), which were assembled
and digitized; (2) a digital elevation model (DEM) derived Methods
from the Shuttle Radar Topography Mission (SRTM) with
a resolution of about 30 m, obtained from the US Geologi- Geostatistical modeling
cal Survey’s Earth Explorer database (USGS 2014); and (3)
observations from boreholes and piezometers installed for Geostatistics methods are based on the regionalized vari-
monitoring GWL fluctuations and geochemistry, provided by ables theory (Matheron 1963). The interpolation procedure,
the General Directorate of Water Resources (DGRE). Water commonly called “kriging”, is distinguished from other
level measurements from 70 monitoring wells were recorded basic methods, as it involves the spatial correlation between
at the end of both the dry and wet seasons from 1974 to 2018. data points to estimate the studied regionalized variable at

13
1392 Hydrogeology Journal (2023) 31:1387–1404

GWL Data / Hydrostratigraphic Configuration


Test /
Coherence
G/OK model ML/KRR model

Add secondary data to


better model the piezometric function
and/or repeat model selection
Variography n-folds

- Experimental variogram
One Leave-one-out
- Theoretical variogram
cross-validation Whole Region / Compartment
iteration Dataset
GWL estimation
Split according to current fold
- Neighborhood search
- Fault parameters Training points ing point

No Train KRR model


Cross-validation

Yes
Spatial coordinates
Trained KRR model
- Kriging
Predicted GWL Actual GWL

No Evaluation metrics Evaluation metrics


Hydrogeological (Averaged over all
realism cross-validation folds)

Yes Inference
Whole Region / Compartment
Dataset

Final model Generate grid points


Training points

KRR model

Spatial coordinates
Trained KRR model
Predicted GWLs

Piezometric map

No
Satisfying results Yes
& Final model
Hydrogeological
realism

Fig. 3  Workflow of the methodology used for groundwater level of spatial variability being considered. G geostatistics, OK ordinary
(GWL) prediction. This methodology can be applied to the whole kriging, ML machine learning, KRR kernel ridge regression
region or to individual compartments, depending on the specific cases

13
Hydrogeology Journal (2023) 31:1387–1404 1393

unsampled locations. Only a concise overview of the ordi- a SW–NE trend that was addressed using ordinary kriging,
nary kriging method is provided, as it is the specific method assuming local stationarity with a restricted angular-sector
employed in this work. A comprehensive description of geo- moving neighborhood. To capture the local variability along
statistical concepts and their application to hydraulic param- the major aquifer compartments, major normal faults were
eters estimation can be found in several publications (e.g. incorporated as screens within the estimation procedure.
Delhomme 1979; De Marsily et al. 1984; De Marsily 1986; Each step of the constrained kriging procedure was validated
Goovaerts 1997; Chiles and Definer 2012). The geostatistical using the error metrics.
modeling was performed using the ISATIS software (ISATIS This approach allows one to honor the data, generate reli-
2020). able spatial maps that accurately depict the GWL variability,
Kriging involves a series of advanced linear regression and identify groundwater flow patterns over the entire Jeffara
algorithms, which enable the implementation of all various of Medenine. The kriging procedures and the resulting map
kriging predictors, such as ordinary, simple, universal krig- were used as a basis to guide the ML modeling implementa-
ing (Journel and Huijbregts 1978; Matheron 1963; Chihi tion and to generate additional GWL data to compensate for
1998; Chihi et al. 2000, 2007). These variants of kriging the scarcity of data at specific locations.
can be processed as a basic linear regression expressed as
follows: Machine learning model: kernel ridge regression (KRR)
( ) ∑n ∑n
Z ∗ x0 = (1) In this work, the applicability of machine learning (ML) is
( )
𝜆∝ Z x∝ with 𝜆
∝=1 ∝=1 ∝
explored to automatically estimate GWL using a small data-
In Eq. (1), Z*(x0) is the estimate at the point x0; Z(xα) are set that only contains selected spatial information on the
the values measured at points xα, λα represents the kriging GWL. To improve the ML performance, a technique from
weights, and α = 1, . . . , n, where n denotes to the number the geostatistical domain-knowledge is applied, as explained
of samples. in section ‘Modeling constraints’ and section ‘Modeling
The weights λα are estimated by minimizing the estima- results’. In addition, a few more secondary data points that
tion variance 𝜎k2 = Var [Z ∗ (x) − Z(x)] while ensuring the are well-representative the GWL function were incorporated
unbiasedness of the estimator, E[Z∗(x) – Z(x)] = 0. in the database (Fig. 2). This approach allows one to apply
In the case of ordinary kriging, which is the kriging vari- and enhance ML algorithms under the geological settings and
ant employed in the present study, the weights are deter- data constraints. All implementations of the ML procedures
mined by solving a set of “n + 1 linear equations” known as were developed using the Python programming language and
the “ordinary kriging system”: mainly the scikit-learn library (Pedregosa et al. 2011).
� ∑n � � � � In this regard, a more abstract reformulation of the task
𝜆𝛽∝ 𝛾 x𝛼 − x𝛽 + 𝜇 = 𝛾 x𝛼 − x0 is outlined to clarify the ML process based on the kernel
∑n 𝛽=1
(2)
∝=1 ∝
𝜆 =1 ridge regression. First, there are n data positions where each
position xi ∈ R2 has a GWL value yi ∈ R, which is either a real
In Eq. (2), μ is the Lagrange parameter that expresses the
measurement or a “secondary” one. Note that yi is equivalent
constraint on the weights λa.
to Z(x) in the geostatistics notification. Similarly, the esti-
The kriging procedure requires the estimation and mod- mated GWL value y∗ for a data point x is equivalent to Z∗(x).
eling of a variogram function “γ(h)” that characterizes the This builds up the training data required for the ML model to
spatial variability of the regionalized variable Z(x): achieve the study’s goal, namely estimating the GWL as cor-
1 rectly as possible at these and other new points in the plane.
𝛾(h) = Var [Z(x) − Z(x + h)] (3) This problem is addressed as a regression task, in which
2
the goal is to find the best fitting function, based on the train-
In Eq. (3), x and x + h are two locations separated by the ing data, that best approximates the real-life function and
lag distance h. depicts the actual groundwater hydrodynamic behavior at
Ordinary kriging was performed using the GWL read- all points. There are several ML techniques to solve similar
ings from 70 monitoring wells during the winter of 2015. regression tasks—e.g. linear/polynomial regression, support
Figure 2 displays a map identifying the locations of the vector machines, artificial neural networks, etc. (Tao et al.
monitoring wells (shown in red dots), overlaid on the DEM 2022). However, considering the sparsity and low-dimen-
with the major watercourses. Ordinary kriging (OK) was sionality of the dataset, where only spatial coordinates are
implemented using ISATIS software to make estimates of used as input to predict the GWL, significant importance is
GWL, based on a calculated variogram model. The primary given to both the simplicity and modeling robustness of a
analysis of GWL spatial continuity and mapping revealed potential ML technique. Advanced or complex ML models

13
1394 Hydrogeology Journal (2023) 31:1387–1404

often require substantial training data and computational be simplified by replacing the regularization strength nζ by
resources. Indeed, the preliminary trials conducted here ζ. Now, to predict the GWL value y∗ for a new data point x,
have also shown that the available sparse data would not simply calculate as follows:
give satisfactory predictions even with simple feed-forward (( )T
neural networks. Therefore, it was considered more appro-
)−1
y∗ = wT x = XXT + 𝜁I Xy x (7)
priate to use simple models that have great modeling capa-
bilities given a well-representative dataset of the real func- To tackle nonlinear problems, kernels and the kernel
tion. Furthermore, the model should be intuitive, efficient to trick are employed. This involves mapping the data to a
train, and easily reproducible on other hydraulic parameters higher-dimensional space (which could be infinite) using
prediction and mapping. Simple models are highly suitable an appropriate feature map and solve the problem linearly
to achieve all these objectives (Li et al. 2022). In fact, KRR in the new feature space. The obtained solution corre-
turned out to be the most appropriate approach, given the sponds to a nonlinear solution in the original space. The
modeling constraints explained in the preceding. convenient aspect of this method is that all calculations in
Kernel ridge regression combines ridge regression with the high-dimensional feature space are performed implic-
the kernel trick; therefore, it learns a nonlinear function that itly thanks to the kernel trick.
best fits the data based on the chosen kernel and the data This is explained in more detail under the following con-
itself. This is elucidated in more detail and progressively in siderations. Indeed, in the kernel method setting, the feature
the following paragraphs. map is defined as Φ : Rd → Rh; x ↦ Φ(x), where Rd is the input
First, KRR is based on ridge regression, which involves space with d ∈ N features and Rh is the higher-dimensional
modeling a linear function so that it best represents the rela- feature space with h ∈ N (or infinite) dimensions. The kernel
tionship between the input variable X and the target vari- trick involves directly calculating the defined kernel function
able Y, both of which are continuous (Zhang et al. 2013). k(x, x′) for two points x and x′ instead of computing both the
This consists in finding the weights, w, of the model that feature map and the dot product Φ(x)TΦ(x′). Accordingly,
minimize the loss function L as defined in Eq. (4). The pre- the kernel function used, in this implementation, is the radial
diction y∗i given by the dot product wTxi of the model, for basis function (RBF) kernel, expressed in Eq. (8):
the ith training data point, should be as close as possible to ( )
the real value yi: k x, x� = exp −𝜂 ‖ � ‖2
(8)
( )
‖ x − x ‖ .
1 ∑n ( )2 1 ∑n ( )2
L(w) = yi − y∗i = yi − wT xi (4) Here, η is a hyperparameter of the RBF kernel, which is
2n i 2n i
equal to 2𝜎1 2 , and this kernel is also known as the Gaussian
To avoid overfitting, a regularization term is added to kernel with variance σ2.
penalize the norm of w, resulting in Eq. (5): This kernel function will be used in the prediction func-
tion, which is first redefined as:
1 �n � �2 1 �
yi − wT xi + 𝜁 ‖w‖2 − C (5)

L(w) =
2n i 2 y∗ = wT Φ(x), (9)

Here, ζ is a Lagrange multiplier and C is a hyperparameter where w ∈ Rh.


that controls the regularization strength. The solution for this Consequently, the loss function L becomes:
problem, obtained through differentiation, is expressed by
1 �n � � ��2 1 �
Eq. (6): yi − wT Φ xi + 𝜁 ‖w‖2 − C . (10)

L(w) =
2n i 2
(∑n )−1 (∑n ) )−1
x xT = XXT + n𝜁 I Xy, The solution to minimize this loss function is given by
(
w= + 𝜁I xy
i i i i i i
(6) Eq. (11):

where X is a matrix containing each data point xi as a vector


)−1
w = 𝚽(X)𝚽(X)T + 𝜁I 𝚽(X)y. (11)
(
and y is a vector containing each of the corresponding yi. I
represents the identity matrix. Here Φ(X) is a matrix containing Φ(x1), Φ(x2), …, Φ(xn)
In practice, ζ can be directly treated as the hyperparam- as column vectors, and ζ is an appropriate hyperparameter.
eter that controls the regularization strength, omitting the The GWL y∗ for a new data point x is then predicted
usage of the hyperparameter C. Furthermore, Eq. (6) can using the kernel trick as follows:

(12)
)−1
y∗ = wT Φ(x) = Φ(x)T w = Φ(x)T 𝚽(X)𝚽(X)T + 𝜁I 𝚽(X)y = k(x, X)(K + 𝜁I)−1 y,
(

13
Hydrogeology Journal (2023) 31:1387–1404 1395

where
∑n �
Φ(x)T Φ(X) = k(x, X) = i k x, xi and respect the local variability of the GWL, the studied area

K = Φ(X)TΦ(X) so Ki, j = k(xi, xj). was segmented into “homogeneous geological domains”
Considering the previous development, it is evident taking into account the hydrogeological environment and
that following training, the GWL function of the model the structural configuration as extensively discussed in pre-
depends on and is determined by the training data points vious works (Chihi et al. 2015; Hammami et al. 2018a, b)
and the selected kernel function (including its hyperpa- (Fig. 1b). In the present work, only three NW–SE major
rameters). Using KRR makes it possible to optimize the faults were involved, Dhahar fault (DF), Medenine fault
models based on the available sparse dataset while con- (MF), and the joined Tebaga-fault (TbF)-Tejra fault (TF)
sidering the faults, spatial location, and GWL information. (Figs. 1b and 2), considering the aquifer systems’ limits but
This approach helps to define the nonlinear function that also maintaining an adequate set of data to ensure a credible
estimates the GWL as best as possible. estimation process within each delineated domain. These
are designated as the “southwestern”, “intermediate” and
Modeling constraints “northeastern” compartments. The interpolation procedure
was carried out separately within each of them, that is, only
The implementation of each of the modeling algorithms, the neighboring samples located within the considered com-
whether based on geostatistical or ML/KRR methods, incor- partment, between the faults boundaries are involved in the
porated two main constraints: (1) the size and spatial distri- estimation process of each target location.
bution of the dataset and (2) the geological heterogeneity as
manifested, in this work, by fault discontinuities. Cross validation
Data constraint. As observed on Fig. 2, monitoring wells
measuring GWLs are dispersed unevenly. The limited num- To accurately assess the performance of the prediction
ber of measurements makes it challenging to reconstruct an models, it is essential to employ an appropriate procedure
accurate GWL map, particularly in the eastern part of the that ensures reliable and unbiased evaluations, especially
study area including the Jorf and in the Sahel El Ababsa when dealing with a small dataset. Therefore, a leave-one-
aquifers. Consequently, GWL predictions may be highly out cross-validation (Goovaerts 1997; Chiles and Definer
biased in some locations. Therefore, the lack of measure- 2012) was conducted, involving multiple iterations where
ments was addressed by integrating auxiliary information. the model is trained to predict the GWL using the set of
An effective strategy was adopted to define the location training points, within the same compartment, excluding
of secondary GWL data. This was, naturally, guided by a one. The model then tries to predict the target value at the
comprehensive understanding of aquifer characteristics, as excluded point. Throughout this process, four distinct evalu-
synthesized in Fig. 1b, including the hydro-stratigraphic pat- ation and error metrics (Eqs. 13–16) are calculated:
tern, the fault network, the soil surface topography, and the
hydrographic network, to properly discretize and generate – The root mean square error (RMSE) (Eq. 13) is the
the secondary dataset. square root of the mean squared error. It is a measure of
By considering this knowledge for each compartment, the the overall prediction error of the model. The closer it is
kriging map, the estimation error, and cross-validation were to 0, the more accurate is the model.
analyzed and employed to supplement the database with

1 ∑n ( ∗ ( ))2
secondary GWL data. This was achieved by, first, situating RMSE = Z x∝ )−Z(x∝ (13)
n ∝=1
the secondary data close to and along the boundaries of the
streams, as well as in specific locations with comparatively
– The mean absolute error (MAE) (Eq. 14) is the arithme-
low data density.
tic mean of all absolute errors. It measures the average
For each of these sites, a GWL is derived taking into
magnitude of the error. A lower value denotes a more
account the nearest piezometer records while ensuring to
accurate model.
reproduce a similar water pathway pattern to that of the cor-
responding river and respecting the local variability identi- 1 ∑n
∣ Z ∗ x∝ )−Z(x∝ ∣ (14)
( )
MAE =
fied through the geostatistical analysis. This process allows n ∝=1

for the database to be supplemented, to provide further


guidance for the kernel ridge regression model, to ensure – The linear correlation coefficient (R) (Eq. 15) specifies
an accurate prediction of water levels, and to generate a map the degree of correlation that exists between the observed
that respects the hydrogeological constraints. and estimated values. The closer it is to 1, the more the
Geological faults. These are defined as discontinuities observed and predicted data are correlated, and the better
that can interrupt the continuity of any studied variable. To the model is.

13
1396 Hydrogeology Journal (2023) 31:1387–1404

� � � � ��� ∗ � � � ��
(18)
∑n
𝛼=1
Z x𝛼 − Z x𝛼 Z x𝛼 − Z ∗ x
𝛼 𝛾(h) = 450 Gaussian(24000 )
R= � �
∑n � � � � ��2 ∑n � � � � ��2
Z x − Z x Z ∗ x − Z ∗ x
𝛼=1 𝛼 𝛼 𝛼=1 𝛼 𝛼 The cross-validation carried out indicated that the spheri-
(15) cal model generated a lower root mean square error (RMSE
= 2.7 m) than that of the Gaussian model (RMSE = 3 m).
– The Nash-Sutcliffe efficiency coefficient (NSE) (Eq. 16), Accordingly, the spherical structure was considered as the
measures the relative magnitude of the estimation error best-fitting model. Its range parameter (a = 29,400 m) and
variance compared to the observed data variance. NSE sill parameter (C = 450 ­m2) were subsequently used in the
ranges between −∞ and 1. The optimal fit between the kriging procedure.
observed and estimated data would be equal to 1. The estimation process of the GWL was implemented
on a regular grid with a 500 m × 500 m mesh. It was con-
⎡ ∑n � ∗ � ��2 ⎤ strained by the integrated fault system. A moving neighbor-
⎢ ∝=1 Z x∝ )−Z(x∝ hood with six angular sectors was adopted allowing one to
(16)

NSE = 1 − ⎢
� � ⎥ 2
ensure the best configuration of the estimating points. These
� �
⎢ ∑n
� �
Z x − Z x𝛼 ⎥
⎣ ∝=1 𝛼 ⎦ are selected exclusively within the corresponding compart-
ment where the unknown GWL is being estimated. The
kriged map of the GWL is shown in Fig. 4b.
In the preceding equations, Z*(xα() is)the estimate at point
KRR models. The KRR models were developed taking into
xα; Z(xα) is the observed( value, Z x𝛼 is the mean of the
account the data availability, and the regional and local vari-
observed values, and Z x𝛼 is the mean of the estimated
)

ability of the GWL variable, which is actually impacted by the
values.
cutting faults. However, due to the scarcity of GWL observations
in some domains, such as the southeastern side of the study area,
the initial prediction attempts, as well as the modeled maps,
Results and discussion were considered somewhat unsatisfactory in terms of accuracy.
Accordingly, the database was extended by incorporating
This section presents the results of the GWL modeling suc- “secondary” GWL data. These are thoroughly identified, as
cessively for the geostatistical and then for the ML/KRR already explained, and added (1) to help the model learn how
approaches. Then all the different prediction procedures are the function should look like at their corresponding positions,
comparatively evaluated based on three different aspects: and (2) more importantly to indirectly capture the spatial vari-
(1) error metrics, (2) scatter diagrams comparing observed ability of GWLs, induced by various factors such as climate,
and estimated values, and (3) consistency with the hydro- topography, hydrogeology, etc.
geological reality. A dataset of about 100 data points is established, with
features consisting only of piezometer positions (2D spa-
tial coordinates) and the GWL values as a target variable.
Modeling results Again, these are either real or auxiliary and are selected to be
sufficiently representative of the real function to reproduce
The geostatistical model. The assessment of the GWL in the spatial variability of the GWL. On the other hand, to
the area was carried out using the ordinary kriging method, improve the prediction accuracy at all locations in the study
based on the estimated variogram model and constrained by area, it was deemed beneficial to adapt the geostatistical
fault parameters. reasoning and test the KRR model potentials, to reproduce
The investigation of the GWL spatial continuity revealed both the regional and local variability of the GWL variable.
a stationary behavior complying with the local variability The models are trained according to the defined ML
within each compartment (Fig. 4a). Two authorized theoreti- approach, considering three cases of spatial variability and
cal models were tried to fit the experimental variogram of estimating the adequate corresponding functions. Firstly, pre-
the GWL variable. The spherical model (variogram func- dictions were performed considering the whole region without
tion 17) exhibits a linear behavior at the origin, while the taking into account the faults, evidently, a single KRR model
Gaussian model (variogram function 18) displays a para- (designated as “KRR_G”, G: Global or whole area) is imple-
bolic behavior at small distances (Fig. 4a). mented. Here, the model is trained on all data points of the
area. Secondly, predictions were performed considering the
𝛾(h) = 450 Spherical(29400 ) (17)
whole region, taking into account the faults. The data points

13
Hydrogeology Journal (2023) 31:1387–1404 1397

(a) To select the best models under each setting, a systematic


(h) trial-and-error procedure, involving a leave-one-out cross-
Spherical model
500 Gaussian model
validation for each model, is conducted. For each trial, dif-
Experimental variogram ferent hyperparameters ζ and η are selected, and the error
metrics are calculated to indicate how accurate the model is
400 in predicting the real value for a new unseen test point. This
procedure makes it possible to choose the best model while
avoiding overfitting.
300 That said, the η parameter is constrained to be within
a specific range before the trial-and-error procedure. In
fact, this kernel parameter plays a crucial role in correctly
200
modeling the spatial variability, the regional trend, and the
local GWL behavior expressed through the “wavy” contour-
100 ing, therefore getting a credible GWL map and a realistic
reconstruction of the hydrogeological reality. Whereas the
ζ parameter controls the regularization strength, with higher
0
0 10 20 30
values enforcing the smoothness of the function.
The best hyperparameters for all the models are stated in
Table 1. The models are trained using these hyperparameters
(b) on all the data points of their corresponding compartments.
3740 GWL (m)
The predicted functions are then modeled, and the generated
Gulf of Gabes 110
3730
maps are shown in Fig. 5.
100

3720 Results validation


90

3710 80 The results of the G/OK, ML/KRR_G, ML/KRR_F and ML/


Y(km)

KRR_I models are examined using scatter plots (Fig. 6), and
3700 70
error metrics including RMSE, MAE, R and NSE (Table 1).
Je

Most notably, the models were carefully checked for consist-


be

60
3690 ency with local and regional hydrogeological contexts.
lD

50 Figure 6 displays the different scatter plots depicting the


ha

3680 observed and estimated values of the GWL variable as calcu-


ha

40
0 5 10 20 lated from geostatistical and ML/KRR models. Overall, most
r

km
3670
30 of the predictions follow the 45° line, which indicates that each
600 610 620 630 640 650 660
X (km) of the tried models can consistently predict the GWL variable.
As indicated previously, the Kriging cross-validation pro-
Fig. 4  a Experimental and theoretical variograms; b Estimated GWL cess showed compatibility between the used data set, the
map using ordinary kriging, taking into account the major faults assessed theoretical variogram representing the GWL spatial
variability, and the involved local neighborhood configura-
tion. This is evident in Fig. 6a where the majority of GWL
are separated into their corresponding compartments defined predictions closely align with the 45° line, demonstrating
by the faults and outer borders. However, a single KRR model their accuracy, except for only two data points. These two
(designated as “KRR_F”, F: integrating Faults) is chosen for points were verified using the ISATIS software tools, and
the whole region, but trained separately on each compartment. found to be situated within poorly sampled locations east-
Thirdly, predictions were performed considering each com- ward, on the southwestern domain. The error metrics RMSE,
partment individually. Here, a specific KRR model (designated MAE, R and NSE are reported as 2.70 m, 1.68 m, 0.98 and
as “KRR_I”, I: Individual compartment) is implemented and 0.9860 respectively (Table 1). These values indicate satis-
optimized for each corresponding compartment. That is, each factory results and confirm the accuracy of the predictions.
model is trained exclusively on the data points belonging to the Concerning the ML methods, the ML/KRR_G model is
respective compartment. In the end, the models can be used to able to map the GWL with relatively good results; however, the
estimate the GWL at a specific point in the study area. RMSE and MAE values (3.53 and 2.40 m respectively) are the

13
1398 Hydrogeology Journal (2023) 31:1387–1404

highest of all other models (Table 1). This is also highlighted function, which ensures a more precise representation of
when looking at the scatter plot (Fig. 6b); points exhibit a more the GWL behavior.
scattered cloud that is roughly symmetric around the diagonal The ML/KRR_I model demonstrates its effectiveness in
line, indicating less accurate predictions. The main reason for ensuring high accuracy, as evidenced by the scatter diagram
this is that the KRR function is defined to fit all data along the in Fig. 6d and the corresponding error metrics. The model
entire study area, but faults are not included in the modeling achieves the lowest RMSE (1.74 m) and MAE (1.21 m)
procedure thus the local variability is not respected. Actually, the values and the highest R (0.99) and NSE (0.9929) values,
estimate of a point, situated on one side of a fault, may be heavily among all geostatistical and ML models. Its performance is
influenced by a lower or a higher variability of the neighborhood also revealed through its ability to capture spatial variability
situated on a downlifted or an uplifted adjacent compartment. and reproduce GWL patterns in each compartment.
For the proposed ML/KRR_F and the ML/KRR_I, spe- To recapitulate, the results and maps obtained in this
cific codes were developed, using Python programming study (Fig. 5a,b,c) clearly demonstrate that the ML/KRR_I
language, for incorporating faults into the prediction pro- model is more capable of estimating the GWL compared
cess. This allowed one to respect the local variability, since with the other (ML/KRR_G and ML/KRR_F) models. It
the observations taken as estimating neighborhood belong reproduces the local spatial variability and reveals the main
exclusively to the delimited compartment where estimation regional trends towards the NE and the East, as well as
is being processed. This “fault constraint” improves the the contouring details in faulted zones as observed in the
accuracy of the prediction as evidenced through the scat- G/OK. In fact, the G/OK, while only being based on the
ter diagram (Fig. 6c) displaying most of the ML/KRR_F monitored data, shows great efficiency in inferring structural
predictions along the 45° line. In addition, the error metrics variability. This is obviously ensured by (1) the assumed
indicating lower RMSE (1.77 m) and MAE (1.24 m) values, local stationarity through the spherical variogram model,
and higher R (0.99) and NSE (0.9926) values, indicating (2) the suitably designed neighborhood pattern, and (3) the
improved prediction accuracy. imposed “screen-fault” constraint. All these interpolation
Defining a specific KRR function for each compartment procedures allowed the G/OK model to capture the local
through the ML/KRR_I model allows one to better identify variability within each compartment, and the regional trend
the local variability and accurately capture the contouring from one compartment to another one.
pattern of the GWL map in every compartment compared It can therefore be confirmed that the advanced ML/
with the ML/KRR_F model. Actually, each compartment KRR_I model, developed in this work, is performing con-
requires its specific parameters to correctly model its sistently as well as geostatistical methods. Indeed, it allows

Table 1  General statistics of the observed and estimated groundwater level (GWL) from the different models with their spatial variability con-
straints
Error metrics ML/KRR modeling Geostatistical modeling
Considering the area Considering the area over- Considering Considering the area
overall, without faults all, with faults each compartment individually overall, with faults
“ML/KRR_G” “ML/KRR_F” “ML/KRR_I” G/OK
One model One model Model 1 Model 2 Model 3 Theoretical variogram
ζ = 1e-04 ζ = 1e-04 ζ = 1e-04 ζ = 1e-05 ζ = 1e-04 ­ 2
C = 450 m
η = 1e-09 η = 1e-09 η = 1e-09 η = 1e-09 η = 1e-09 a = 29,400 m
NEC IC SWC NEC IC SWC

RMSE BC - 0.97 2.29 2.03 0.97 2.19 2.03 -


All 3.53 1.77 1.74 2.70
R BC - 0.98 0.75 0.96 0.98 0.77 0.96 -
All 0.97 0.99 0.99 0.98
MAE BC - 0.65 1.79 1.50 0.65 1.68 1.50 -
All 2.40 1.24 1.21 1.66
NSE All 0.971 0.9926 0.9929 0.9860

Parameters: ζ regularization hyperparameter; η RBF kernel’s hyperparameter; C sill of the theoretical variogram; a range of theoretical vari-
ogram
NEC northeastern compartment; IC intermediate compartment; SWC southwestern compartment; BC by compartment. ‘All’ denotes taking the
average over all the data points

13
Hydrogeology Journal (2023) 31:1387–1404 1399

Fig. 5  Estimated GWL maps using: a ML/KRR considering the ▸


whole region without taking into account the faults; b ML/KRR con-
sidering the whole region taking into account the faults; c ML/KRR
considering each compartment individually

the GWL contours to be drawn robustly and the flow pattern


to be captured consistently, especially along rivers and more
importantly along secondary tributaries. A comprehensive
hydrogeological interpretation was conducted to provide a
solid basis for understanding the hydrogeological environ-
ment of the region. This interpretation is detailed in the fol-
lowing section.

Hydrogeological

Hydrogeological interpretations 0 5 10 20
km

The estimated GWL maps provide valuable insights regard-


ing the general patterns and the local variation of the
groundwater flow regime along the faulted groundwater
systems. This would constitute crucial support for develop-
ing a conceptual model of the groundwater flow system in
the Jeffara of Medenine.
Accordingly, a comprehensive hydrogeological inter-
pretation was conducted focusing on four key aspects—the
contour pattern, the hydraulic gradient, the flow direction
and the effect of faults on these different hydrogeologi-
cal properties. The analysis was primarily based on the
ML/KRR_I results, which demonstrated high reliability
in reproducing the specific features of the GWL spatial
variability within each compartment and along rivers. To
enhance the interpretative effectiveness, the contours of
the estimated GWL maps were visualized then superim- 0 5 10 20
km
posed onto the rivers and streams network (Fig. 7).
As regards the GWL, the potentiometric surface dis-
plays a comparable trend with the variation of elevation
(Figs. 2, 4b, 5c, and 7). Generally, areas of high poten-
tiometric surface coincide with topographic highs along
the Dhahar-Matmata Mountains, where the aquifer is
recharged. Conversely, depressions in the potentiometric
surface indicate areas of discharge along the coastal area.
Consequently, the regional groundwater flow follows the
down-dip direction from southwest to northeast.
A thorough analysis of the GWL map indicates hetero-
geneous patterns in the groundwater flow regime of the
aquifer system, expressed through variations of hydrau-
lic gradients and flow directions. The calculated maps
(Figs. 4b, 5c, and 7) show clearly that the flow direction
depends on the drainage pattern. Globally the direction
of flow is parallel to the rivers, demonstrating a hydraulic
0 5 10 20
connection between the aquifers and the main rivers as km

well as their tributaries. In the upstream basin, flow direc-


tions are influenced by topographic elevation. Regional

13
1400 Hydrogeology Journal (2023) 31:1387–1404

Fig. 6  Scatter plots comparing (a) (b)


observed and estimated values
using: a geostatistical modeling
considering the whole region,
taking into account the faults;
b ML/KRR_G considering the

True value (m)


True value (m)
whole region without taking
into account the faults; c ML/
KRR_F considering the whole
region, taking into account the
faults; d ML/KRR_I consider-
ing each compartment individu-
ally

(c) (d)
True value (m)

True value (m)


a

recharge occurs, predominantly, in the western part of the formulated regarding the hydrodynamic characteristics
study area, originating from both the Dhahar and Mata- specific to each compartment.
matas mountains. In the Southwestern compartment (SWC), the Sahel El
On the other hand, it is imperative to underline that Ababsa is recharged from Jebel Dhahar oriented NW–SE,
the flow pattern, as revealed above from the judiciously and where the aquifer sandstone unit outcrops (Figs. 1b
constructed maps (Figs. 4b, 5c, and 7), indicates an inter- and 7). The flow directions are oriented to the NE, follow-
aquifer-formations groundwater flow that constitutes a ing the course of rivers. However, deviations, indicated by
substantial part of recharge and discharge in the major the green arrows in Fig. 7, are observed along secondary
compartments of the adjacent aquifers. In fact, tectonic tributaries. In some areas, such as the northern aquifer limit,
structures exert considerable control over the partitioning defined by the Tebaga fault and on the southern edge, the
of the aquifer systems, the connectivity of aquifer forma- flow movement is locally oriented E–W to ENE–WSW. A
tions, and the regional and local groundwater flow pat- steep hydraulic gradient seems to dominate the northern and
tern (Chihi et al. 2015; Mezni et al. 2022a). Structural central parts of the SWC where the permeable sandstone
control over GWL is exerted through discontinuities, such unit is exposed (Figs. 1b and 7). This is attributed to the
as faults. The actual effect depends on the magnitude of relatively sloping land surface. The hydraulic gradient is
displacement, as well as on the degree of permeability larger in the southern part of the compartment because of the
on both sides of the fault. In fact, the location and also gently sloping land surface and where the aquifer unit is bur-
the amount of inter-aquifer flow, increase with increas- ied by the intermediate clay unit (Hammami et al. 2018a).
ing permeability as the water can more easily flow and Moreover, it is particularly relevant here to highlight the
decreases or stops with decreasing permeability. Based local recharge from the Medenine High zone (Fig. 1b) that
on these relevant results, further significant insights were constitutes the secondary source of supply to the water units.

13
Hydrogeology Journal (2023) 31:1387–1404 1401

3740 48
50 Gulf of Gabes
3730
HF 60 36 Legend
N
3720
AM E Regional flow direction
C
40
74 70 Zs Local flow direction
3710 along each compartment
IC
Y(km)

OZ Flow direction
60 50
3700 TbF
90 GWL contours (m asl)
Je

60 Gb
Sr
be

3690 100 80 Faults


MF
lD

90 70 Rivers
110 SW TF
ha

C 70
3680
ha

0 5 10 20
100 km
r

80
110
DF

3670
600 610 620 630 640 650 660
X(km)

Fig. 7  Estimated GWL map by ML/KRR_I, superimposed to the riv- IC: intermediate compartment, NEC: northeastern compartment.
ers and secondary streams map showing the heterogeneous ground- HF: Henchir Fraj, AM: Ain Mjirda, Zs: Zeuss, OZ: Oum Zassar, Gh:
water flow pattern. MD: Medenine Fault,TbF: Tebaga Fault, TF: Ghabbay, Sm: Smar
Tejra Fault, DF: Dhahar Fault. SWC: southwestern compartment,

Within the Intermediate compartment (IC), hosting the east, the global flow direction is oriented to the NE, with
Zeuss-Koutine aquifer, the groundwater flow patterns are contour lines predominantly perpendicular to the rivers,
more complex. This is obviously revealed through the heter- thereby revealing that the direction of groundwater flow is
ogeneous GWL contours on both sides of the limiting faults closely parallel to the main rivers. The equipotential lines
and in between, from the relatively large area in the NW to are widely spaced, indicating a noticeably low hydraulic gra-
the restricted one on the SE (Fig. 7). On the NW side, the dient attributed to both the low land surface (Fig. 2) and an
aquifer is supplied mainly from Jebel Matmatas, oriented overall high permeability of the aquifer formations (Chihi
N–S. The flow directions are initially eastward and then shift et al. 2015; Mamou 1990; Ben Baccar 1982).
to the NE aligning with the rivers. The hydraulic gradient The comparison of the GWL contour lines on both sides
is gentle, this is to be expected, since recharge from rainfall of the Medenine fault (MF) highlights the lateral change of its
occurs throughout the rivers but also directly from the upper hydraulic properties conditioning thus the hydraulic connection
karstified sedimentary unit, at the beginning of the flow path. between the two major compartments, IC and NEC. Actually,
On the SE side (Fig. 7), water cannot keep flowing to the the flow pattern is the same on both sides of the MF, which acts
NE, because of a barrier permeability (Chihi et al. 2015), as a conduit between the permeable formations in Henchir Fraj
but rather is forced to shift in an exceptional fashion towards (HF), Zeuss (Zs) and Ghabbay (Gh) regions (Fig. 7).
the NW, then retrieve the major NE direction as it comes Within the eastern zone, however, along Smar (Sm)
close to the junction with Tebaga Fault (TbF). This reveals domain, the equipotential lines do not conform across the
that in this central area, there is an anticipated water inflow fault. Within the NEC, equipotential lines are largely spaced
originating specifically from Jebel Tejra. and intersect the MF. Additionally, inside the IC, contours
Along the Northeastern compartment (NEC), including are closely spaced and their pattern indicates that the water-
the Mareth aquifer on the west and the Jorf aquifer on the ways are diverted laterally from the NE to the NW direction

13
1402 Hydrogeology Journal (2023) 31:1387–1404

as previously mentioned. This can be attributed to the vary- (KRR) approach was guided by geostatistical/based ordinary
ing permeability across the aquifer system units on either kriging (G/OK) reasoning and constrained by geological dis-
side of the MF, acting thus as a barrier to groundwater flow continuities and scarce data.
in this region. Several mathematical models were trained, taking into
In some other areas, such as Ain Mjirda (AM) and Oum account hydrological heterogeneities and the underlying
Zassar (OZ), the contours slightly bend downstream because of local and regional spatial variability, to reliably produce a
a lower permeability of the water units on the north side of the GWL map for joint-faulted aquifer systems extending along
MF. The groundwater flow is slowed, whilst the water discharge the southeastern coast of Tunisia. The incorporation of geo-
to the NE compartment is maintained parallel to the rivers. logical discontinuities supported by parametric analysis
In this respect, it is particularly important to note the has significantly improved the accuracy of the interpola-
change of the hydraulic behavior of the MF from the east to tion results, while ensuring hydrogeological consistency.
the west of the aquifer system. This is partly explained by the This highlights the importance of combining advanced
increasing displacement amount towards the NW direction as ML techniques with geological knowledge in the realm of
demonstrated in Chihi et al. (2013) and (2015). From the pre- hydrogeological parameter modeling, specifically for pre-
sent study, an alternating barrier/conduit behavior of the MF dicting GWL. Extensive interpretation of the GWL maps
within respectively Sm, Gh, OZ, Zs, AM and HF was further constructed, underpinned by geological knowledge, provided
demonstrated (Fig. 7). More importantly along these domains, valuable insights into the spatial variation of the ground-
the contour patterns are also alternating convergent/divergent water flow paths as well as the hydraulic behavior of faults
shapes downstream while maintaining the common NE flow acting as conduits, barriers or conduit-barriers.
direction. These changes in contours shape are induced by The significant observations regarding the heterogeneous
the combined effect of the NE–SW trending normal faults flow pattern along both sides of the NW–SE faults align with
(Fig. 1), generating horst and graben structures. Accordingly, the observed groundwater level (GWL) contours and flow
the connectivity between the water units depends on the pattern along the coastline. This study revealed that the vari-
amount of fault displacement, the thickness of the different ations in relative permeability between the horst and graben
units and their resulting intersection along the MF zone. compartments, shaped by the NE–SW faults, influence flow
These significant observations are consistent with the velocity and ultimately affect the timing of groundwater dis-
GWL contours and flow pattern along the coastline. The charge into the Mediterranean Sea. These effects of relative
focused groundwater discharge is expected to reach the permeability variations need to be distinguished from other
Mediterranean Sea, at the Gulf of Gabes, earlier in the HF factors, such as excessive water withdrawal and the potential
domain than in the AM domain (Fig. 7). It is important to occurrence of marine intrusions. Such considerations are
note that the reduced flow velocity due to the hydraulic prop- essential (1) to properly define the boundary conditions, (2)
erties of the medium should be distinguished from an exces- to ensure the most representative conceptual model is built
sive water withdrawal and/or a possible marine intrusion. for flow simulation and (3) to facilitate a thorough study of
These compelling results on contour shape change at dif- the marine water intrusion.
ferent scales, revealed by the effective implementation of Furthermore, the developed combined machine-learning/
advanced prediction methods, have allowed a significant geostatistics/hydrogeology-based approach has led to highly
characterization of the hydrodynamic behavior of the joined- relevant prediction results and effective perception of the
faulted aquifer systems extending along the Jeffara of Mede- actual hydrogeological phenomenon characterizing a set of
nine. Consequently, more attention has to be focused on the aquifer systems in a highly complex geological environment.
hydrogeological environment and on the geological con- Furthermore, the model should be intuitive, efficient to train,
figuration to get an accurate reading of any predicted GWL and easily reproducible on other hydraulic parameters in
map. A comprehensive analysis and an in-depth understand- terms of prediction and mapping. These methods are poten-
ing of the GWL contoured lines pattern would then help to tially extendable to larger areas, enabling the validation of
properly define the boundary conditions, ensuring the most regional models in complex environments and providing val-
representative conceptual model is built for flow simulations uable information for the sustainable management of water
and for a thorough study of the marine water intrusion. resources at regional, national and transboundary levels.

Acknowledgements The authors would like to thank the “General


Conclusions Directorate of Water Resources” (DGRE) for providing GWL data.

This study investigated the potential of ML methods to Declarations


predict the spatial variability of groundwater level (GWL). Conflicts of interest The authors state that there is no conflict of inter-
The implementation of the proposed kernel ridge regression est.

13
Hydrogeology Journal (2023) 31:1387–1404 1403

References Delhomme JP (1979) Spatial variability and uncertainty in ground-


water flow parameters: a geostatistical approach. Water Resour
Res 15(2):269–280. https://​doi.​org/​10.​1029/​WR015​i002p​00269
Band SS, Heggy E, Bateni SM, Karami H, Rabiee M, Samadianfard Delhomme JP, De Marsily G (2006) Flow in porous media: an attempt
S, Chau KW, Mosavi A (2021) Groundwater level prediction in to outline Georges Matheron’s contributions. In: Bilodeau M,
arid areas using wavelet analysis and Gaussian process regression. Meyer F, Schmitt M (eds) Space, structure and randomness: con-
Eng Appl Comput Fluid Mech 15(1):1147–1158. https://​doi.​org/​ tributions in honor of Georges Matheron in the fields of Geosta-
10.​1080/​19942​060.​2021.​19449​13 tistics. Random Sets and Mathematical Morphology: lecture notes
Ben Baccar B (1982) Contribution à L’étude Hydrogéologique de in statistics. Springer, Heidelberg, Germany, pp 69–88
L’aquifère Multicouche de Gabès Sud [Contribution to the hydro- De Marsily G (1986) Quantitative hydrogeology. Academic Pres,
geological study of the multilayer aquifer of Gabes Sud]. PhD New York
Thesis, University of Paris Sud, Orsay, France De Marsily G (2021) Will we soon run out of water? Ann Nutr Metab
Boezio MNMB, Costa JFCL, Koppe JC (2006) Accounting for extensive 76(1):10–16. https://​doi.​org/​10.​1159/​00051​5019
secondary information to improve watertable mapping. Nat Resour De Marsily G, Lavedan G, Boucher M, Fadanino G (1984) Inter-
Res 15(1):33–48. https://​doi.​org/​10.​1007/​s11053-​006-​9014-5 pretation of interference tests in a well field using geostatisti-
Chihi H (1998) Modélisation 3-D des unités stratigraphiques et simu- cal techniques to fit the permeability distribution in a reservoir
lation des faciès sismiques dans la marge du Golfe du Lion [3-D model. In: Verly G, David M, Journel A, Marechal A (eds) Geo-
modeling of stratigraphic units and simulation of seismic facies statistics for Natural Resources Characterizations. Part 2, D.
in the margin of the Gulf of Lion]. Technip, Rueil-Malmaison, Reidel, Dordrecht, pp 831–849
France. http://c​ atalo​ gue.b​ nf.f​ r/a​ rk:/1​ 2148/c​ b1332​ 5096k. Accessed De Marsily G, Delay F, Goncalves J, Renard P, Teles V, Violette S
July 2023 (2005) Dealing with spatial heterogeneity. Hydrogeol J 13:161–
Chihi H, de Marsily G (2009) Simulating non-stationary seismic facies 183. https://​doi.​org/​10.​1007/​s10040-​004-​0432-3
distribution in a prograding shelf environment, gas science and Ghordoyee Milan S, Kayhomayoon Z, Arya Azar N, Berndtsson
technology. Oil Gas Sci Tech Rev IFP 64(4):451–467. https://​doi.​ R, Reza Ramezani M, Moghaddam HK (2023) Using machine
org/​10.​2516/​ogst/​20090​17 learning to determine acceptable levels of groundwater con-
Chihi H, Alain G, Ravenne C, Tesson M, de Marsily G (2000) Estimat- sumption in Iran. Sustain Product Consump 35:388–400. https://​
ing the depth of stratigraphic units from marine seismic profiles doi.​org/​10.​1016/j.​spc.​2022.​11.​018
using non-stationary geostatistics. Nat Resour Res 9(1):77–95. Guzman SM, Paz JO, Tagert MLM, Mercer AE (2019) Evaluation of
https://​doi.​org/​10.​1023/A:​10101​65914​840 seasonally classified inputs for the prediction of daily groundwater
Chihi H, Tesson M, Alain G, de Marsily G, Ravenne C (2007) Geo- levels: Narx networks vs support vector machines. Environ Model
statistical modelling (3D) of the stratigraphic unit surfaces of the Assess 24(2):223–234. https://​doi.​org/​10.​1007/​s10666-​018-​9639-x
Gulf of Lion western margin (Mediterranean Sea) based on seis- Goovaerts P (1997) Geostatistics for natural resources evaluation.
mic profiles. Bull Soc Géol France 178(1):25–38. https://​doi.​org/​ Oxford University Press, New York
10.​2113/​gssgf​bull.​178.1.​25 Hammami MA, Chihi H, Ben Mammou A, Yahyaoui H (2018a)
Chihi H, Bedir M, Belayouni H (2013) Variogram identification aided Aquifer structure identification through geostatistical integra-
by a structural framework for improved geometric modeling of tion of geological parameters: case of the Triassic sandstone
faulted reservoirs: Jeffara basin, southeastern Tunisia. Nat Resour aquifer system (SE Tunisia). Arab J Geosci 11(248):1–18.
Res 22(2):139–161. https://​doi.​org/​10.​1007/​s11053-​013-​9201-0 https://​doi.​org/​10.​1007/​s12517-​018-​3591-6
Chihi H, Jeannée N, Yahayoui H, Belayouni H, Bedir M (2014) Geo- Hammami MA, Chihi H, de Marsily G (2018b) Building constrained
statistical optimization of water reservoir characterization case (3D) geostatistical models case of the Triassic sandstone aquifer
of the Jeffra de Medenine aquifer system (SE Tunisia). Desalin system (SE Tunisia). In: Kallel A, Ksibi M, Ben Dhia H, Khélifi N
Water Treat 52(10–12):2009–1016. https://d​ oi.o​ rg/1​ 0.1​ 080/1​ 9443​ (eds) Euro-Mediterranean conference for environmental integra-
994.​2013.​812988 tion (EMCEI-1). Springer, Cham. https://​doi.​org/​10.​1007/​978-3-​
Chihi H, de Marsily G, Belayouni H, Yahyaoui H (2015) Relationship 319-​70548-4_​192
between tectonic structures and hydrogeochemical compartmen- Isaaks EH, Srivastava RM (1989) An introduction to applied geosta-
talization in aquifers: example of the “Jeffara of Medenine” sys- tistics. Oxford University Press, New York
tem, south-east Tunisia. J Hydrol Reg Stud 4(part B):410–430. ISATIS (2020) Geovariances technical references. ISATIS, Fon-
https://​doi.​org/​10.​1016/j.​ejrh.​2015.​07.​004 tainebleau France
Chihi H, de Marsily G, Bourges M, Sbeaa M (2016) A constrained Journel A, Huijbregts C (1978) Mining geostatistics. Academic, New York
geostatistical approach for efficient multilevel aquifer system char- Kayhomayoon Z, Ghordoyee Milan K, Jaafari A, Arya-Azar NM,
acterization. J Water Resour Hydraul Eng 5(3):80–95. https://​doi.​ Melesse A, Moghaddam HK (2022) How does a combination of
org/​10.​5963/​JWRHE​05030​02 numerical modeling, clustering, artificial intelligence, and evo-
Chihi H, Hammami MA, Mezni I, Belayouni H, Ben Mammou A lutionary algorithms perform to predict regional groundwater
(2023) Multiscale modeling of reservoir systems using geosta- levels? Comput Electron Agric 203:107482. https://​doi.​org/​10.​
tistical methods. C R Géoscience (355_S1):1–31. https://​doi.​org/​ 1016/j.​compag.​2022.​107482
10.​5802/​crgeos.​210 Koch J, Berger H, Henriksen HJ, Sonnenborg TO (2019) Modelling
Chiles J P, Definer D (2012) Geostatistics: modeling spatial uncer- of the shallow water table at high spatial resolution using ran-
tainty. In: Wiley series in probability and statistics, 2nd edn. dom forests. Hydrol Earth Syst Sci 23(11):4603–4619. https://​
Wiley, Hoboken, NJ doi.​org/​10.​5194/​hess-​23-​4603-​2019
Custodio E (2013) Loss of groundwater quality & related services: Lallahem S, Mania J, Hani A, Najjar Y (2005) On the use of neural net-
trends in groundwater pollution—a global framework for country works to evaluate groundwater levels in fractured media. J Hydrol
action GEFID 3726. Environ Sci. www.​groun​dwate​rgove​rnance.​ 307:92–111. https://​doi.​org/​10.​1016/j.​jhydr​ol.​2004.​10.​005
org. Accessed July 2023 Li Z, Yoon J, Zhang R, Rajabipour F, Srubar WV, Dabo I, Radlińska
Delhomme JP (1978) Kriging in hydrosciences advances. Adv Water A (2022) Machine learning in concrete science: applications,
Resour 1(5):251–266. https://​doi.​org/​10.​1016/​0309-​1708(78)​ challenges, and best practices. npj Comput Mat 8(1):1–17.
90039-8 https://​doi.​org/​10.​1038/​s41524-​022-​00810-x

13
1404 Hydrogeology Journal (2023) 31:1387–1404

Mammou A (1990) Caractéristiques, évaluation et gestion des res- Pirot G, Renard P, Huber E, Straubhaar J, Huggenberger P (2015)
sources en eau du Sud-tunisien [Characteristics, evaluation and Influence of conceptual model uncertainty on contaminant
management of water resources in southern Tunisia]. PhD The- transport forecasting in braided river aquifers. J Hydrol 531(part
sis, University of Paris-Sud, Orsay, France 1):124–141. https://​doi.​org/​10.​1016/j.​jhydr​ol.​2015.​07.​036
Matheron G (1963) Principles of geostatistics. Econ Geol 58:1246–1266 Soua M, Chihi H (2014) Optimizing exploration procedure using
Matheron G (1965) Les variables régionalisées et leur estimation oceanic anoxic events as new tool for hydrocarbon strategy in
[Regionalized variables and their estimation]. Masson, Paris Tunisia. In: Gaci S, Hachay O (eds) Advances in data, methods,
Matheron G, De Marsily G (1980) Is transport in porous media models and their applications in oil/gas exploration. Cambridge,
always diffusive? a counterexample. Water Resour Res 16:901– New York, pp 25–89
917. https://​doi.​org/​10.​1029/​WR016​I005P​00901 Tang Y, Zang C, Wei Y, Jiang M (2019) Data-driven modeling of
Mezni I, Chihi H, Bounasri M, Ben Salem A, Ayfer S (2022a) Com- groundwater level with least-square support vector machine and
bined geophysical–geological investigation for 3D geological spatial–temporal analysis. Geotech Geol Eng 37(3):1661–1670.
modeling: case of the Jeffara reservoir systems, Medenine https://​doi.​org/​10.​1007/​s10706-​018-​0713-6
Basin, SE Tunisia. Nat Resour Res 3:1329–1350. https://​doi.​ Tao H, Hameed MM, Marhoon HA, Zounemat-Kermani M, Hed-
org/​10.​1007/​s11053-​022-​10067-2 dam S, Kim S, Sulaiman SO, Tan ML, Sa’adi Z, Mehr AD,
Mezni I, Chihi H, Hammami MA, Gabtni H, Baba Sy B (2022b) Allawi MF, Abba SI, Zain JM, Falah MW, Jamei M, Bokde
Regionalization of natural recharge zones using analytical hier- ND, Bayatvarkeshi M, Al-Mukhtar M, Bhagat SK et al (2022)
archy process in an arid Hydrologic Basin: a contribution for Groundwater level prediction using machine learning models: a
managed aquifer recharge. Nat Resour Res 3:867–895. https://​ comprehensive review. Neurocomputing 489:271–308. https://​
doi.​org/​10.​1007/​s11053-​022-​10023-0 doi.​org/​10.​1016/j.​neucom.​2022.​03.​014
Moghaddam HK, Moghaddam HK, Kivi ZR, Bahreinimotlagh UNESCO (2022) Groundwater: making the invisible visible. UN
M, Alizadeh MJ (2019) Developing comparative mathematic World Water Development Rep 2022, UNESCO, Paris
models, BN and ANN for forecasting of groundwater levels. USGS (2014) Earth explorer. US Geological Survey. earth​explo​rer.​
Groundw Sustain Dev 9:100237. https://​doi.​org/​10.​1016/j.​gsd.​ usgs.​gov. Accessed November 9, 2022
2019.​100237 Varouchakis EA, Guardiola-Albert C, Karatzas GP (2022) Spati-
Moghaddam KH, Ghordoyee Milan S, Kayhomayoon Z, Kivi ZR, otemporal geostatistical analysis of groundwater level in aquifer
Azar A (2021) The prediction of aquifer groundwater level systems of complex hydrogeology. Water Resour Res 58:1–14.
based on spatial clustering approach using machine learn- https://​doi.​org/​10.​1029/​2021W​R0299​88
ing. Environ Monit Assess 193:173. https://​doi.​org/​10.​1007/​ Yadav B, Ch S, Mathur S, Adamowski J (2017) Assessing the suit-
s10661-​021-​08961-y ability of extreme learning machines (ELM) for groundwater level
Mohanty S, Jha MK, Kumar A, Panda DK (2013) Comparative prediction. J WaterLand Dev 32:103–112. https://d​ oi.o​ rg/1​ 0.1​ 515/​
evaluation of numerical model and artificial neural network for jwld-​2017-​0012
simulating groundwater flow in Kathajodi–Surua inter-basin of Zhang Y, Duchi J, Wainwright M (2013) Divide and conquer kernel
Odisha, India. J Hydrol 495:38–51. https://​doi.​org/​10.​1016/j.​ ridge regression. Proceedings of the 26th annual conference on
jhydr​ol.​2013.​04.​041 learning theory. PMLR 30:592–617
Nourani N, Goli Ejlali R, Taghi Alami M (2011) Spatiotemporal
groundwater level forecasting in coastal aquifers by hybrid arti- Publisher’s note Springer Nature remains neutral with regard to
ficial neural network-Geostatistics model: a case study. Environ jurisdictional claims in published maps and institutional affiliations.
Eng Sci 28(3):217–225. https://​doi.​org/​10.​1089/​ees.​2010.​0174
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel Springer Nature or its licensor (e.g. a society or other partner) holds
O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas exclusive rights to this article under a publishing agreement with the
J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E author(s) or other rightsholder(s); author self-archiving of the accepted
(2011) Scikit-learn: machine learning in python. J Mach Learn manuscript version of this article is solely governed by the terms of
Res 12:2825–2830 such publishing agreement and applicable law.

13

You might also like