1 s2.0 S1674775521001128 Main

Journal of Rock Mechanics and Geotechnical Engineering 13 (2021) 1231e1245
Contents lists available at ScienceDirect
Journal of Rock Mechanics and

Geotechnical Engineering
journal homepage: www.jrmge.cn
Full Length Article
Prediction of rockhead using a hybrid N-XGBoost machine learning

framework
Xing Zhu a, b, Jian Chu a, *, Kangda Wang a, Shifan Wu a, Wei Yan a, Kiefer Chiam c
a
School of Civil and Environment Engineering, Nanyang Technological University, 618798, Singapore
b
State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu, 610059, China
c
Building and Construction Authority, 200 Braddell Road, 579700, Singapore
a r t i c l e i n f o a b s t r a c t
Article history: The spatial information of rockhead is crucial for the design and construction of tunneling or under-
Received 16 April 2021 ground excavation. Although the conventional site investigation methods (i.e. borehole drilling) could
Received in revised form provide local engineering geological information, the accurate prediction of the rockhead position with
6 June 2021
limited borehole data is still challenging due to its spatial variation and great uncertainties involved.
Accepted 20 June 2021
Available online 14 September 2021
With the development of computer science, machine learning (ML) has been proved to be a promising
way to avoid subjective judgments by human beings and to establish complex relationships with mega
data automatically. However, few studies have been reported on the adoption of ML models for the
Keywords:
Rockhead
prediction of the rockhead position. In this paper, we proposed a robust probabilistic ML model for
Machine learning (ML) predicting the rockhead distribution using the spatial geographic information. The framework of the
Probabilistic model natural gradient boosting (NGBoost) algorithm combined with the extreme gradient boosting (XGBoost)
Gradient boosting is used as the basic learner. The XGBoost model was also compared with some other ML models such as
the gradient boosting regression tree (GBRT), the light gradient boosting machine (LightGBM), the
multivariate linear regression (MLR), the artificial neural network (ANN), and the support vector machine
(SVM). The results demonstrate that the XGBoost algorithm, the core algorithm of the probabilistic N-
XGBoost model, outperformed the other conventional ML models with a coefficient of determination (R2)
of 0.89 and a root mean squared error (RMSE) of 5.8 m for the prediction of rockhead position based on
limited borehole data. The probabilistic N-XGBoost model not only achieved a higher prediction accu-
racy, but also provided a predictive estimation of the uncertainty. Thus, the proposed N-XGBoost
probabilistic model has the potential to be used as a reliable and effective ML algorithm for the pre-
diction of rockhead position in rock and geotechnical engineering.
2021 Institute of Rock and Soil Mechanics, Chinese Academy of Sciences. Production and hosting by
Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).
1. Introduction (GPR), reflection seismology, and electrical resistivity (Adepelumi

and Fayemi, 2012; Yu and Xu, 2015; Nath et al., 2018; Pan et al.,
Rockhead or depth to bedrock (DTB) refers to the interface be- 2018; Du et al., 2019; Moon et al., 2019; Ba ci
c et al., 2020;
tween soil (or completely weathered rock) and fresh rock. DTB is a Bressan et al., 2020). However, it is expensive or labor-intensive to
critical design parameter for tunneling or underground construc- drill many boreholes for DTB determination. On the other hand, the
tion (e.g. Cremasco, 2013; Wei et al., 2017; Cho et al., 2019; Du et al., estimation of the DTB in between boreholes may be less reliable if
2019; Zhang et al., 2020a,b). A reliable mapping of the rockhead the borehole numbers are insufficient or the spacing between the
position could help reduce the construction risks or project cost boreholes is too large (Nath et al., 2018).
(Wei et al., 2017). The DTB is conventionally identified by borehole In Singapore, a huge number of borehole data have been
data or geophysical methods such as ground-penetrating radar collected and integrated into a three-dimensional (3D) geological
model for cost-effective future urban planning (Pan et al., 2018).
The geological strata are usually interpreted using commercial
* Corresponding author. software based on the limited borehole information. The geological
E-mail address: cjchu@ntu.edu.sg (J. Chu). condition between boreholes is roughly estimated by the Kriging
Peer review under responsibility of Institute of Rock and Soil Mechanics, Chi- interpolation method (Zhu et al., 2012; Themistocleous et al., 2016;
nese Academy of Sciences.
https://doi.org/10.1016/j.jrmge.2021.06.012
1674-7755 2021 Institute of Rock and Soil Mechanics, Chinese Academy of Sciences. Production and hosting by Elsevier B.V. This is an open access article under the CC BY-
NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1232 X. Zhu et al. / Journal of Rock Mechanics and Geotechnical Engineering 13 (2021) 1231e1245
Fig. 1. Boreholes along MRT project line and two-dimensional (2D) geological map of Singapore.
Pan et al., 2018). Although the Kriging interpolation method is predict the soil-rock interfaces. The results indicated that the spline
widely used, its performance runs lower than expected when the regression method had outperformed the other three algorithms.
dataset is characterized as nonlinear or sparse (Qi et al., 2020a). Nevertheless, the prediction was merely based on the statistics
Meanwhile, results interpreted by this approach could involve method and the attributes (e.g. location, topographical features)
significant conflict against engineers’ geology knowledge of the site were not considered in the modeling process. The quantile
in a complicated case. Different from geological strata, the rockhead regression forests (QRF) were used by Chen et al. (2020) in a spatial
is normally distributed in the same formation. Due to weathering or model to predict the soil thickness of loess deposits in central
other geological complications, the prediction of rockhead through France, but the prediction accuracy was poor and only a mean co-
Kriging interpolation based on limited borehole data is still efficient of determination R2 of 0.33 was achieved. Overall, most of
challenging. the existing studies only adopted geoscience statistic methods or
With the rapid development of artificial intelligence (AI), ma- single ML regressor as their core algorithms, and the prediction
chine learning (ML) can provide a promising and effective way to accuracy of those models still has room for improvement.
deal with challenges in engineering prediction (Dixit et al., 2020; In 2016, Chen and Guestrin (2016) proposed the extreme
Fuentes et al., 2020; Huang et al., 2020; Zhang et al., 2021a, b; Zhao gradient boosting (XGBoost) model, which is a powerful scalable
et al., 2021). A good ML system could reduce the cost of manpower tree boosting ML framework and a sparsity-aware algorithm for
and provide an accurate reference for making decision through sparse data. For efficient performance, XGBoost implements the
learning the inherent laws from a big dataset. For instance, a gen- architecture of gradient boosted decision tree which could yield
eral regression neural network was introduced to present spatial high accuracy in both classification and regression tasks. It has been
distribution of soil type using borehole data (Zhou et al., 2018). The applied in disease prediction (Budholiya et al., 2020; Davagdorj
method is able to predict the simple soil distribution in an area of et al., 2020), gene expression prediction (Li et al., 2019), casualty
72 m 40 m with only spatial coordinates. Support vector machine prediction for terrorist attack (Feng et al., 2020), industrial pre-
(SVM) method has also been applied for interpreting sparse diction (Zheng and Wu, 2019), and construction engineering (Zhao
geological information (Smirnoff et al., 2008), where the task is et al., 2019; Duan et al., 2020a; Zhang et al., 2020a, c, 2021a, b).
regarded as a pure classification problem, and a cross-validation However, the application of XGBoost in rockhead prediction with
procedure is conducted for describing findings from different limited and sparse borehole data has not been reported so far.
training sets. In this case, SVM can be considered as a novel learning Motivated by the increasing demand for underground devel-
method for treating small data samples, especially when boreholes opment in Singapore, this study proposed a hybrid ML framework
and cross-section data are limited. Wei et al. (2017) built a global titled N-XGBoost based on the XGBoost and the natural gradient
spatial bedrock prediction model based on the random forest and boosting (NGBoost) methods to improve the predictive accuracy of
gradient boosting tree algorithms, but the data among spare global DTB. In this framework, XGBoost is used to be the base learners of
areas strongly impacted the precision of the proposed models in a the NGBoost algorithm. Borehole data and local terrain parameters
local region. Qi et al. (2020b) employed polynomial regression, from a tunneling project in Singapore were chosen as the data
spline interpolation, one-dimensional (1D) spline regression, and source to train, validate, and evaluate the proposed ML framework.
Bayesian-based conditional random field algorithms to spatially Other existing ML algorithms such as multivariate linear regression
X. Zhu et al. / Journal of Rock Mechanics and Geotechnical Engineering 13 (2021) 1231e1245 1233
(MLR), artificial neural network (ANN), SVM, gradient boosting acid rocks including granite, adamellite, granodiorite, and the acid
regression tree (GBRT), and light gradient boosting machine and intermediate hybrids (Qi et al., 2020a). Due to the humid
(LightGBM) were also evaluated by the same dataset in this study tropical climate in Singapore, the acid rocks in BTG formation have
for the purpose of comparison. The main contributions of this study been heavily weathered. The thickness of residual soils derived
include: (1) the proposal and application for the first time of the from weathered BTG ranges from a few meters up to 70 m and the
XGBoost-based hybrid ML model for the accurate prediction of average thickness is 30 m (Sharma et al., 1999; Wee and Zhou,
bedrock depth with limited borehole dataset, and (2) the ability of 2009). With the aims to ensure a safe underground development
the proposed model to provide not only accurate point prediction in Singapore, borehole investigations were carried out in the area of
but also estimation of the predictive uncertainty for reliable deci- interest to investigate the geological conditions. Based on practical
sion-making. engineering experience in Singapore, the rock mass weathered in
Grades I to III is classified as rock whereas the rock mass weathered
2. Geological conditions higher than Grade III is usually regarded as soil-like materials
(Sharma et al., 1999). The rockhead is normally regarded as the
Understanding the geological formation is necessary for con- elevation between Grades III and IV in engineering practice (Qi
struction and evaluation of the proposed ML model. From a et al., 2020a).
regional-scaled view, Singapore and its several smaller islands are To support the underground space development in Singapore,
lying in the southern extension of the Malaysian Peninsula, with a Singapore Land Authority is working to develop a 3D map of sub-
total land area of about 650 km2 (Sharma et al., 1999; Qi et al., surface utilities. In this case, GeM2S is established as a web-based
2020a). As shown in Fig. 1, the geological formation of Singapore 3D design tool for managing the shallow borehole data to present
contains three main parts: sedimentary rocks (Jurong formation, JF) the subsurface formation for future underground projects in
in the west, igneous rocks (Bukit Timah granite, BTG) in the central, Singapore (Pan et al., 2020). However, in the 3D geological model
and quaternary deposits (Old Alluvium soils and soft soils deposits construction process, interpolation methods like Kriging interpo-
called Kallang formation, KF) in the east. BTG is the largest phys- lation followed by expert justification could be time-consuming
iographic area for Singapore, which is characterized by hills and and introduce unexpected uncertainties when the geo-model is
valleys of both high and low relief. Most of the hills in this area are complex (Smirnoff et al., 2008). Therefore, it is desirable to utilize
less than 60 m in height, however, the granite near its western the ML techniques to estimate the unseen geological information
contact to JF formed steeper and more prominent hills that the between boreholes and automatically update the 3D geological
highest one is raised up to 163 m. The BTG is a general name for the model when new data are obtained. In this paper, a hybrid ML
Fig. 2. XGBoost-based ML framework for spatial rockhead prediction. GSE: ground surface elevation; EVRS: explained variance regression score; MAE: mean absolute error.
simplicity and interpretability of the linear regression method

makes it the basis of many other ML algorithms.
3.2.2. Artificial neural network (ANN)

ANN is widely applied to many nonparametric and nonlinear
problems with complex mappings from input to output. It is
contributing to identifying the relationship between known vari-
ables and unknown parameters (Thanh et al., 2019). Multi-layer
perceptron (MLP) network, as one category of ANN, is a typical
and most common feedforward network that gains its advantages
of nonlinearity and robustness regarding its mapping process from
inputs to outputs (Svozil et al., 1997). The signal forward and
backpropagation of errors allows the weights to be updated effec-
tively according to the learning rule. Similar to the biological ner-
vous system, MLP consists of a plenty of neurons that interact with
the corresponding links between each other. As shown in Fig. 3, the
neurons are organized in forms of layers which can be categorized
as input layer, hidden layers, and output layer (Ahmadi, 2015). The
Fig. 3. Architecture of MLP-ANN.
ultimate results come out by integrating the solution in each output
layer with an activation function f . Thus, the integrating outputs for
inputs Xi of forward process can be expressed with weight wij as
framework was proposed, and a comprehensive comparison with
other conventional ML methods was discussed. !
X
Oj ðkÞ ¼ f bj þ wij Xi (2)
i
3. Methodology
However, the selection of ANN structure, which is usually by
3.1. Overall framework trial and error, is still crucial since there is no specification for the
selection of hyper-parameters to guarantee the model’s perfor-
Fig. 2 illustrates the framework of this study. The framework mance (Lawal and Kwon, 2020). In this study, an MLP-ANN model
consists of three parts: was built for the prediction of DTB based on six input features as
shown in Fig. 3.
(1) Data preparation: To identify DTB for each borehole and
prepare covariates; To polish up the quality of the dataset
using synthetic minority over-sampling technique for
3.2.3. Support vector machine (SVM) for regression
regression (SMOTER) with introduction of Gaussian noise
SVM was first developed by Vapnik and Cortes (1995) as a new
(SMOGN); and To change the values of data to a common
approach in ML technology. The basic idea of SVM is to build a
scale, without distorting differences in the ranges of values
linear hyperplane based on the kernel function that separates
or losing information by using min-max normalization;
samples with different classes into a high-dimensional space.
(2) ML model establishment: To build N-XGBoost and other
Theoretically, with a given training dataset fXi ; yi gn , where Xi is the
existing ML models; and
high dimensional input and n is the number of training data, the
(3) ML model evaluation: To compare the performance of the
output yi can be described as
proposed hybrid ML models with other conventional ML
models. y ¼ f ðxÞ ¼ u$X þ b (3)
where y˛ð1; 1Þ, u is the weight vector which is normal to the

3.2. Conventional machine learning (ML) models hyperplane, and b is the hyperplane bias.
By introducing a kernel trick, SVM can be generalized to a
3.2.1. Multivariate linear regression (MLR) nonlinear classifier and regression problems. Eq. (3) can be modi-
MLR is regarded as an elegant algorithm for solving nonlinear fied to
relationship between covariates and target variable. The quality of
MLR model depends on the degree of correlation between the input f ðxÞ ¼ u$øðXÞ þ b (4)
and predicted values (Prion and Haerling, 2020). For the output y
where øðXÞ is the kernel function that converts the input space X
with predictor valuables of fX1 ; X2 .; Xp g, the model can be
into a higher feature space, and typically considered kernel func-
expressed as
tions are the linear, polynomial, radial basis, and sigmoid functions.
T As shown in Fig. 4, the principal of SVM for regression is to basically
y ¼ b0 þ X1 b1 þ X2 b2 þ . þ Xp bp þ ε ¼ b0 þ X b þ ε (1)
consider the sample points that are within the following range:
where bp represents the regression parameter, and ε is the Gaussian
y ε u$øðXÞ þ b y þ ε (5)
random variable which follows εwNð0; s2 Þ (Olive, 2017). The esti-
mation of the regression parameters is based on the criteria of The best fit line is the hyperplane that has a maximum number
minimizing the sum of squared error (SSE) for achieving the best of points, which is the mean squared error between prediction and
performance. If the number of predictors is greater than two, the observation. Hence, the SVM model can be trained by a large
regression equation will be a hyperplane (Young, 2017). The number of training data to obtain the optimal model for prediction.
Tj h i
1 1 X ðjÞ 2
U qj ¼ gTj þ lwk ¼ gTj þ l wk (9)
2 2
k¼1
where Tj is the number of leaves in the j-th regression tree, g is the

minimum loss reduction needed for a further node partition in
regression tree, l is the regularization term on the weight of leaves
ðjÞ
in regression tree, and wk is the weight of the k-th leaf in the j-th
regression tree. It is evident that more leaves (larger Tj ) will be
Fig. 4. SVM for regression. penalized by a larger factor g to minimize the objective function.
Therefore, the XGBoost method uses the greedy algorithm to build
the regression trees according to the objective function. Based on
the last-step predicted residuals, all regression trees are gradually
3.3. The proposed hybrid N-XGBoost model
determined through training using a forward stepwise method, and
such an XGBoost model is completed.
3.3.1. XGBoost algorithm
In the XGBoost model, the tree parameter qj can be determined
Chen and Guestrin (2016) proposed a highly scalable end-to-end
through the training process, but some of the hyperparameters like
tree boosting system, i.e. XGBoost, which has been widely applied
g, l, m, a and dmax should be specified before training. Herein, dmax
and optimized in many research fields ( Li et al., 2019; Zhao et al.,
is the maximum depth of regression tree (e.g. dmax ¼ 4 in Fig. 5).
2019; Zheng and Wu, 2019; Feng et al., 2020; Wang et al., 2020;
More details of the XGBoost algorithm can be found in Chen and
Zhang et al., 2020b, 2021a). XGBoost is an improved framework of
Guestrin (2016).
the GBRT model. As shown in Fig. 5, GBRT is a boosting model
Many researchers found that the hyperparameters could
consisting of a series of basic regression tree through a sequential
significantly affect the final performance of ML models (Rodriguez-
ensemble technique. It can adaptively add more trees to enlarge the
Galiano et al., 2015; Duan et al., 2020a; Feng et al., 2020; Zhang
model capacity.
et al., 2020c, 2021a). Hence, hyperparameters should be deter-
Therefore, the final prediction of model can be expressed by
mined for the best performance of ML model. The commonly used
methods include grid search, random search, and Bayesian opti-
mization to fine tune the hyperparameters in the ML model (Wang
m m1
X
m and Sherry Ni, 2019; Zhang et al., 2021a). The first two methods
y ¼b
b y þ afm ðX; qm Þ ¼ a fj X; qj (6) would roam the full space of available parameter values in an iso-
j¼1
lated way, while Bayesian optimization method could find the
optimal parameter combination by considering the past evalua-
where m is the number of regression trees for boosting; qj is the
tions through a more efficient way (Gao and Ding, 2020; Zhang
parameter for controlling the structure of j-th tree; a is shrinkage
et al., 2021a). In this study, Bayesian optimization method was
factor or learning rate of individual regression tree; X is the pre-
j adopted to adjust the following four key hyperparameters which
dictor and b y is the prediction of the j-th regression tree; and
has a high impact on the XGBoost model to optimize its
fj ðX; qj Þ is the output of the j-th regression tree based on structure
performance:
of qj without shrinkage, in which predictor X and the residual y
j1
b
y are used as its inputs. Consequently, the residual will generally
(1) max_depth (dmax ): it controls the complexity of model. A
reduce with the increased number of regression trees. The objective
more complicated model is much easier to be overfitted.
of gradient boosting regression is to find the optimal qj and build
(2) learning rate (a in Eq. (2)): it is a crucial hyperparameter in
fj ðX; qj Þ at the j-th step to minimize the objective function as
most of the ML algorithms. It can be adjusted to make model
more robust.
(3) gamma (g): it controls regularization in Eq. (5), and the
X X h j1 i
L¼ lð b
y i ; yi Þ ¼ y i þ afj Xi ; qj ; yi
l b (7) optimal value of g could help prevent overfitting.
i i (4) lambda (l): it controls regularization on weights to avoid
overfitting.
where l is the loss function, which usually uses the squared error
between the predictive value b y and the ground truth y. Fig. 6 shows the diagram of ten-fold cross-validation. In ten-fold
Compared to the conventional GBRT algorithm, a regularization cross-validation, the training dataset is divided into ten subsets,
term was introduced to the conventional loss function in XGBoost and nine of the ten subsets are taken to train the model whereas the
by Chen and Guestrin (2016) to penalize the complexity of model remaining one is used to validate the model in each iteration. In the
and prevent the model from overfitting. In XGBoost, Eq. (7) is end, the evaluation indicator (e.g. root mean squared error (RMSE)
rewritten as for regression problem) of the ten iterations demonstrates the
overall performance of the ML model on the current hyper-
parameter combination. With the combination of Bayesian opti-
X X X h j1 i mization and cross-validation methods, the final ML model tuned
L¼ lð b
y i ; yi Þ þ U qj ¼ y i þ afj Xi ; qj ; yi
l b by the optimal hyperparameters could ensure a better generaliza-
i j i tion performance on unseen data.
X
þ U qj (8) Furthermore, LightGBM as another advanced GBRT-based
j framework is also compared in this study. Unlike most other
implementations that grow trees level-wise, LightGBM grows the
where Uðqj Þ is the regularization item on the j-th regression tree to trees leaf-wise pattern instead. It chooses the leaf that it believes
prevent overfitting: will yield the largest decrease in loss. As a result, the development
Fig. 5. Schematic diagram of the GBRT.
Fig. 6. Ten-fold cross-validation for evaluation of model with specific hyperparameters.
of LightGBM focuses on performance and scalability. Due to the 4. Verification of the proposed method
similarity in base theory, the details of LightGBM are not presented
here and can be found in Ke et al. (2017) and Liang et al. (2020). 4.1. Data source
A huge number of geotechnical borehole data has been collected

over the years from various construction projects carried out in
3.3.2. Natural gradient boosting (NGBoost) for probabilistic Singapore. In this study, 502 borehole’s data along one MRT line
prediction (shown in Fig. 1 as green points) was sampled as DTB observation
In addition of accurate point prediction, the predictive interval data. The DTB of each borehole was identified manually according
of an ML model is crucial in real practice. Duan et al. (2020b) pro- to boreholes logs.
posed a supervised ML algorithm for generic probabilistic predic- Fig. 8a shows the distribution of the target variable DTB, where a
tion. It outputs a full probability distribution over the entire long-tailed distribution of DTBs lower than 33 m was found.
outcome space. The core of NGBoost is that it utilizes boosting
technique to estimate the parameters of a conditional probability
distribution Pq ðyjXÞ as function of X. Fig. 7 shows the conceptual
work flowchart of NGBoost, which includes three main parts: base
learner (f ), parametric probability distribution (Pq ), and scoring
rule (S).
In this study, a hybrid NGBoost modular with XGBoost base
learners was designed to perform both the point prediction and
probabilistic prediction. In this hybrid ML model, input features
were fitted to the XGBoost base learners to produce a probability
distribution of the predictions Pq ðyjXÞ over the entire outcome
space of y. The SðPq ; yÞ is used to optimize the NGBoost model by
using a maximum likelihood estimation (MLE) function, which
provides calibrated uncertainty and point predictions. The input
features of this model include borehole coordinates, GSE, and pa-
rameters in digital elevation model (DEM) such as slope, aspect,
and curvature. The target value is the elevation of rockhead. During
the modeling process, the XGBoost base learner was fine-tuned
prior to the training of the proposed hybrid model for a better
Fig. 7. The hybrid NGBoost model with base learner XGBoost.
point prediction performance.
Fig. 8. Distribution of DTB in this study: (a) Histogram of DTB, and (b) Distribution of DTB in boxplot.
Further can be seen, the number of DTBs lower than 33 m was generate new synthetic samples with the three nearest neighbors
few. Thus, the boreholes with DTB deeper than 33 m were of seed case which are supposed to have similar DTBs and local
recognized as outliers in the boxplot view in Fig. 8b. However, as a terrain features (e.g. slope and aspect). Therefore, SMOGN was
purely data-driven methodology, ML could be strongly affected by adopted to oversample the rare data points (DTB < 33 m) in this
the outliers due to the imbalanced distribution of DTB in the study to help improve the robustness of ML model for predicting a
original samples. To solve this problem, a new data preprocessing deeper DTB. More details on the SMOTER and SMOGN algorithms
method called SMOGN was adopted. can be found in some references (Torgo et al., 2013; Branco et al.,
Several studies have claimed that the thickness of the soil is 2017).
likely to be related to the local terrain (Themistocleous et al., 2016; As a usual data preprocessing method, normalization not only
Wei et al., 2017; Simon et al., 2020). Therefore, both the borehole enhance the overall predictive performance of ML models, but also
data from site investigation and the local terrain features (i.e. slope, improve the computing efficiency (Pu et al., 2019; Yu et al., 2020;
aspect, and curvature) derived from the high precision DEM of Zhang et al., 2021a). In this study, the min-max normalization was
Singapore were utilized to create the dataset for the ML model adopted to convert the dataset to a range from 0 to 1:
established in this study. Table 1 presents the summary statistics of
the dataset.
fi ðkÞ mini fi ðkÞ
fi0 ðkÞ ¼ (10)
maxi fi ðkÞ mini fi ðkÞ
4.2. Data preprocessing
where i is the sample index, and fi ðkÞ denotes the i-th sample in the
In this study, the observations of DTB, borehole locations, and k-th feature domain.
local terrain features under the Singapore coordinate generated a The innovative idea of SMOGN method is to oversample the
dataset which was then randomly divided into a training set (80% of minority in training data to improve the predictive ability of ML
the whole data) and a testing set (20% of the whole data). model based on two oversampling techniques by the KNN algo-
In order to overcome the performance degradation problem rithm distances in features space underlying a given observation
caused by imbalanced data, Torgo et al. (2013) proposed the (Branco et al., 2017). If the distance between given observation is
SMOTER algorithm which could change the distribution of the close enough, SMOTER is applied. If the distance is too far, Gaussian
given training dataset to balance the rare and the most frequent noise is introduced into SMOTER to oversample.
ones. Branco et al. (2017) further introduced Gaussian noise to the
SMOTER, i.e. SMOGN, for dealing with imbalanced regression
problems where the most important cases to the user are poorly
represented in the available data. SMOGN can generate new syn-
thetic examples with SMOTER only when the seed example and the
k-nearest neighbors (KNN) selected are ‘close enough’ and use the
introduction of Gaussian noise when the two examples are ‘more
distant’. As shown in Fig. 9, the key idea of SMOGN algorithm is to
Table 1
Summary statistics of dataset.
Feature Mean Max Min Median
Coordinate x of borehole (m) 21,410.84 24,965.65 20,029.06 20,960.72

Coordinate y of borehole (m) 37,129.76 39,987.84 34,305.65 36,952.76
GSE (m) 16.3 36.85 5.45 15.63
Slope of ground surface ( ) 2.68 17.14 0 2.02
Aspect of ground surface ( ) 124.12 351.87 1 84.81
Curvature of ground surface (m1) 0.12 3.56 4.45 0
DTB (m) 7.94 22.1 55.3 6.38 Fig. 9. Synthetic example of the application of SMOGN.
4.3. Model establishment in boxplot. To overcome the data imbalance problem, SMOGN was
adopted to change the distribution of minority samples as shown in
As presented above, the primary objective of this study is to Fig. 11b. The benefits of SMOGN in improving the performance of
build a novel method of AI for predicting the rockhead elevation of ML models will be presented in Section 4.4. Additionally, to reduce
engineering practice based on XGBoost model and NGBoost prob- the effects of different scales of features on performance of the ML
ability prediction algorithm, called N-XGBoost methodology. models, the training dataset should also be normalized by Eq. (1).
Accordingly, borehole data together with local ground surface pa-
rameters such as slope, aspect, and curvature were prepared as 4.3.2. Training ML models
predictors. The rockhead elevation was carefully recognized by As mentioned in Section 3.3.1, an initial XGBoost model as the N-
experts from the borehole log and was regarded as the target var- XGBoost base learner was developed with four important hyper-
iable. There were 502 boreholes data which have been randomly parameters in this study: max tree depth (dmax ), learning rate (a),
divided into two parts for training and testing purposes. That is 80% minimum loss reduction (g) and L2 regularization factor (l). The
of the total samples used for training the N-XGBoost model by ten- optimum values of these four parameters were picked up by
fold cross-validation strategy, whereas the remaining 20% is used Bayesian optimization. Bayesian optimization is widely used for
for testing the precision of the developed N-XGBoost model. With searching the value of the minimized objective function by estab-
the aims to overcome the bias influence of rare samples in lishing an alternative function according to the evaluation results. It
regression, a preprocessing method called SMOGN was introduced becomes a powerful tool when the objective function is unknown
before training in this study to improve the predictive capabilities and operation is complex (Zhou et al., 2018), therefore eliminating
of N-XGBoost in a larger space. Fig. 10 shows the flowchart of N- plenty of wasted effort. Other hyperparameters were set with their
XGBoost model in this study. default values. Table 2 shows the results of hyperparameters tun-
In the developed hybrid N-XGBoost model, the XGBoost algo- ing. All the other ML models were also trained with the best
rithm was introduced to NGBoost probability prediction algorithm hyperparameters set obtained by Bayesian optimization based on
as the base learner. To establish the model, an initial XGBoost the same training dataset.
model was first fitted by the training dataset. Meanwhile, the four After the optimum hyperparameter was set, the training dataset
key hyperparameters of XGBoost model were chosen after trial and was used to fit the ML models described above. The performance of
error, and were further optimized by the Bayesian optimization each ML model was evaluated by the ten-fold cross-validation
algorithm. With the achievement of optimal XGBoost model, its method. In the training stage, the developed XGBoost model was
predictive accuracy can be enhanced to some extent. systematically compared with three conventional ML models and
For comparison, the popular ML models like MLR, MLP-ANN, GBRT. Fig. 12 shows the comparison results with respect to the
and SVM were also trained and generated based on the same predictive accuracy and robustness under ten-fold cross-validation.
training dataset. All the developed ML models were evaluated by It demonstrates that the developed XGBoost model with an average
indices like R2, MAE, and RMSE under the same testing dataset. R2 R2 of 0.895 achieved a better performance than GBRT and LightGBM,
represents the correlation and fitting goodness between the target and outperformed the other three ML models significantly as well.
and real values. MAE is a measurement of average errors for all the Meanwhile, the curves of R2 for the three conventional ML models
predictions. RMSE is widely applied when sensible error estimation (MLR, MLP-ANN, and SVM) indicates that they have poor robustness
is required. In rockhead elevation prediction, RMSE and MAE are in with different training data subsets.
unit of meter. EVRS was also used to evaluate the explained vari- Fig. 13 illustrates the accuracy of the predictive values by the
ance of model. The higher the EVRS, the better the explained different ML algorithms for estimating the rockhead position (DTB).
variance of model. Overall, the tree-based ML models (LightGBM, GBRT, and XGBoost)
For n target values, the statistic criteria stated above can be have a higher prediction accuracy than SVM, MLR, and MLP-ANN for
calculated by determining the DTB values with limited data. The developed
P XGBoost model achieved the highest R2 and lowest RMSE among
ðy by i Þ2 those ML models.
R2 ¼ 1 P i (11)
ðyi b
y Þ2
4.4. Model evaluation
1X n
MAE ¼ jy b
yij (12) The performances of the prediction of rockhead position (DTB)
n i¼1 i
in the developed XGBoost model and the other five ML models
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (MRL, MLP-ANN, SVM, LightGBM, and GBRT) were both evaluated
u n in training and testing datasets by the four indicators (R2, RMSE,
u1 X
RMSE ¼ t ðy b y i Þ2 (13) MAE, and EVRS). Fig. 14 shows the predictive results of the testing
n i¼1 i
dataset using different ML models, which demonstrates that the
tree-based models are more suitable than other conventional ML

varð b
y i yi Þ models to predict the DTB based on sparse borehole data. The
EVRS ¼ 1 100 (14) performances of different models with or without SMOGN pre-
varðyi Þ
processing are presented in Table 3. The developed XGBoost model
where by is the mean observed value; and yi and b
y i are the i-th achieved the best performance among all the ML models in both
observed and predicted values, respectively. training and testing datasets. It also can be concluded that the
SMOGN preprocessing method could help improve the overall
predictive performance by oversampling in rare samples distrib-
4.3.1. SMOGN in preprocessing uted below 30 m.
Fig. 11a shows the distribution of rockhead elevation of original Based on the results shown in Table 3, the developed XGBoost
training data in this study. It demonstrates that the samples deeper model was selected to combine with NGBoost algorithm to make
than 33 m are so rare that it can be considered as abnormal points probability predictions.
Fig. 10. Establishment of N-XGBoost model in this study.
Fig. 11. Preprocessing results: (a) Before SMOGN, and (b) After SMOGN.
Table 2
Hyperparameters tuning for XGBoost in this study.
Hyperparameter Default Search range Explanation Optimal value
Learning rate, a 0.3 [0.01, 0.3] The weight of each step is tuned to improve the 0.098
robustness of the model
Max tree depth, dmax 6 [3, 10] Maximum depth of a tree. Increasing this value 8
will make the model more complex and more
likely to
overfit
Minimum loss reduction, g 0 [0.1, 5] Minimum loss reduction required to make a 0.012
further partition on a leaf node of the tree. The
larger g is, the
more conservative the algorithm will be
L2 regularization factor, l 1 [0.1, 20] Increasing this value will make model more 6.996
conservative
ensembles of decision tree methods like gradient boosting (e.g.

lightGBM, XGBoost, and Random Forest) is that they can auto-
matically provide estimates of feature importance from a trained
predictive model. Generally, feature importance provides scores
that indicate how useful or valuable each feature was in the
construction of the boosted decision trees within the model. The
more frequent a feature used to make decisions, the higher the
importance of the feature. Therefore, a trained XGBoost model can
provide the rank of feature importance for a specific application.
Features with higher scores can be regarded as being more
important than those with a lower score to affect the predictive
results (Gao and Ding, 2020). Fig. 17 shows the relevant impor-
tance of six features in terms of ‘weight’ in this study. The weight
shows the normalized number of times the feature is used to split
data in the nodes of trees, which indicates the relative importance
of features. The coordinate x is the most important feature vari-
ables, followed by ground elevation, coordinate y, slope, aspect,
Fig. 12. The average coefficients of determination R2 (AVG) for training data under ten-
fold CV.
and curvature. The ranking in Fig. 17 indicates that the spatial
distribution of rockhead position along the x-direction (west-
A predictive interval gives uncertainty around a prediction value in east) varies greatly, which is consistent with the actual situation.
regression analysis. For example, the 95% predictive interval of model The GSE also has a considerable contribution to the prediction of
with a certain input is a range (L, H), which means that 95% of samples rockhead because the depth of rockhead has a positive correlation
with the input based on this model will give predictions fall into the with the altitude.
range (L, H). In this study, the NGBoost algorithm with the developed
XGBoost model was used for the estimation of the predictive interval.
5. Conclusions
The comparison between N-XGBoost predicted DTB values and the
actual DTB observations is shown in Fig. 15, where 82.3% of the actual
Position of rockhead is an important design parameter for
DTB values in the testing dataset were found within the 95% predic-
tunneling and underground construction. In this paper, a hybrid
tive interval. Note that, in order to better present the distribution of
XGBoost based ML model was proposed for predicting the
DTB, the predictive and observed values were plotted in a cross-
rockhead position based on limited borehole data. To improve
section view, where the original point is the westmost point in
the performance of the XGBoost model, the hyperparameters
Fig. 1. It can be concluded from Fig. 15 that the N-XGBoost model not
were fine-tuned by the Bayesian optimization algorithm and
only achieved a good point prediction accuracy, but also provided a
SMOGN was introduced to balance the data distribution. The
reliable predictive interval that could take the uncertainty caused by
predictive results of the hybrid XGBoost model were compared
the quality of data and number of samples into account.
with five different ML methods (i.e. MLR, MLP-ANN, SVM,
However, the imbalance problem of training dataset could
LightGBM, and GBRT). The comparison results demonstrated that
seriously affect the model’s performance. Fig. 16 shows the pre-
the proposed hybrid XGBoost model has the highest prediction
dictive results of N-XGBoost model without SMOGN preprocessing.
accuracy for both training and testing datasets. It is worth noting
It clearly demonstrated that the predictive interval in the right
that the SMOGN can effectively solve the imbalanced data dis-
cross-section is much wider than that in Fig. 15 because only few
tribution problem and significantly improve the performance of
samples are presented in this area without SMOGN. Figs. 15 and 16
ML models, especially for predicting rare extreme values of a
clearly demonstrated that the adoption of SMOGN method in the
numeric target variable. Furthermore, the developed XGBoost
preprocessing stage could greatly improve the predictive perfor-
model was combined with NGBoost method as a base learner to
mance of ML model under sparse data conditions.
estimate predictive uncertainty. The results demonstrated that
82.3% of the actual rockhead values in the testing dataset suc-
4.5. Feature importance cessfully dropped into the 95% predictive interval of the hybrid
N-XGBoost model.
Feature importance is a useful indicator to estimate the Lastly, the feature importance was analyzed using the XGBoost
contribution of a feature to the target value. A benefit of using algorithm and the results showed that coordinate x, GSE, and
Fig. 13. Comparison of predictive results among ML models in training phase.

Fig. 14. Predictive results of ML models for test dataset.

Table 3
Comparison of the predictive models both in training and testing datasets.
Preprocessing Model R2 RMSE (m) MAE (m) EVRS (%)
Training Testing Training Testing Training Testing Training Testing
With SMOGN MLR 0.765 0.79 8.28 7.99 6.53 6.32 76.5 79
MLP-ANN 0.832 0.822 7.02 7.35 5.34 5.83 83.2 82.3
SVM 0.825 0.816 7.15 7.47 5.04 5.66 82.9 81.8
LightGBM 0.965 0.856 3.19 6.62 2.32 4.94 96.5 86
GBRT 0.954 0.862 2.63 6.47 2.76 4.83 95.5 86.7
XGBoost 0.985 0.889 2.06 5.81 1.45 4.35 98.5 89.2
Without SMOGN MLR 0.646 0.632 7.57 7.88 5.99 5.87 64.6 63.7
MLP-ANN 0.646 0.655 7.57 7.63 6.01 5.63 64.6 66.9
SVM 0.705 0.659 6.91 7.59 5.23 5.59 70.6 67.1
LightGBM 0.903 0.74 3.97 6.63 2.96 4.97 90.3 74
GBRT 0.891 0.805 4.2 5.74 3.06 4.39 89.1 80.6
XGBoost 0.981 0.812 1.76 5.64 1.27 4.43 98.1 81.4
Fig. 15. Plots of the predictive capability of the hybrid N-XGBoost model with SMOGN: (a) Prediction with uncertainty estimation of test data along the MRT line, and (b) Prediction
of training data along the MRT line.
coordinate y were the top three features that could mostly affect well as the available features would help to level up the
the predictive results of rockhead position in this study. The reason prediction accuracy.
may be that the variation of DTB is greatly affected by the spatial (2) Since the borehole information is regarded as discrete data
coordinates and GSE in the study area. samples in this study, the prediction of rockhead is strongly
Although the proposed model obtains desirable predictive results, related to the current GSE and limited spatial relationships
there are some limitations that need to be addressed in the future among rockhead points. Some other features not detected
study: may also have influences on the predictive results, such as
seismic velocity of rock, mechanical parameters of rock
(1) Because the predictive performance of ML is greatly affected sample, and rock quality. To further improve the perfor-
by the number and quality of the observation dataset, mance of the proposed model, these continuous features can
increasing the number of high-quality borehole samples, as be included in the training model.
Fig. 16. Plots of the predictive capability of the hybrid N-XGBoost model without SMOGN: (a) Prediction with uncertainty estimation of test data along the MRT line, and (b)
Prediction of training data along the MRT line.
significant financial support for this work that could have influ-
enced its outcome.
Acknowledgments
This work is supported by National Research Foundation (NRF) of

Singapore, under its Virtual Singapore program (Grant No.
NRF2019VSG-GMS-001), and by the Singapore Ministry of National
Development and the National Research Foundation, Prime Minis-
ter’s Office under the Land and Livability National Innovation Chal-
lenge (L2 NIC) Research Program (Grant No. L2NICCFP2-2015-1). Any
opinions, findings, and conclusions or recommendations expressed
in this material are those of the authors and do not reflect the views
of the Singapore Ministry of National Development and National
Research Foundation, Prime Minister’s Office, Singapore.
References
Fig. 17. Feature importance ranking.
Adepelumi, A.A., Fayemi, O., 2012. Joint application of ground penetrating radar and
electrical resistivity measurements for characterization of subsurface stratig-
(3) The zone of interest is line-type in this study. The perfor- raphy in Southwestern Nigeria. J. Geophys. Eng. 9 (4), 397e412.
Ahmadi, M.A., 2015. Developing a robust surrogate model of chemical flooding
mance of the proposed model in a more complex shape of
based on the artificial neural network for enhanced oil recovery implications.
area (e.g. rectangle, circular, and irregular) needs to be Math. Probl Eng. https://doi.org/10.1155/2015/706897.
further validated. Ba
ci
c, M., Libric, L., Kauni, D.J., Kovacevic, M.S., 2020. The usefulness of seismic
surveys for geotechnical engineering in karst: some practical examples. Geo-
sciences 10 (10), 406.
Declaration of competing interest Branco, P., Torgo, L., Ribeiro, R.P., 2017. SMOGN: a pre-processing approach for
imbalanced regression. Preceed. Mach. Learn. Res. 74, 36e50.
Bressan, T.S., de Souza, M.K., Girelli, T.J., Junior, F.C., 2020. Evaluation of machine
The authors wish to confirm that there are no known conflicts of learning methods for lithology classification using geophysical data. Comput.
interest associated with this publication and there has been no Geosci. 139.
Budholiya, K., Shrivastava, S.K., Sharma, V., 2020. An optimized XGBoost based Simon, A., Geitner, C., Katzensteiner, K., 2020. A framework for the predictive
diagnostic system for effective prediction of heart disease. J. King Saud Univer. - mapping of forest soil properties in mountain areas. Geoderma 371.
Comput. Inform. Sci. https://doi.org/10.1016/j.jksuci.2020.10.013. Smirnoff, A., Boisvert, E., Paradis, S.J., 2008. Support vector machine for 3D
Chen, T.Q., Guestrin, C., 2016. XGBoost: a scalable tree boosting system. Proceedings modelling from sparse geological information of various origins. COMPUT
of the 22nd ACM SIGKDD International Conference, pp. 785e794. GEOSCI-UK 34 (2), 127e143.
Chen, S.C., Richer-De-Forges, A.C., Mulder, V.L., Martelet, G., Loiseau, T., Lehmann, S., Svozil, D., Kvasnieka, V., Pospichal, J., 1997. Introduction to multi-layer feed-forward
Arrouays, D., 2020. Digital mapping of the soil thickness of loess deposits over a neural networks. CHEMOMETR INTELL LAB 39, 43e62.
calcareous bedrock in central France. Catena 198. https://doi.org/10.1016/ Thanh, H.V., Sugai, Y., Nguele, R., Sasaki, K., 2019. Integrated workflow in 3D
j.catena.2020.105062. geological model construction for evaluation of CO2 storage capacity of a
Cho, E., Jacobs, J.M., Jia, X., Kraatz, S., 2019. Identifying subsurface drainage using fractured basement reservoir in Cuu Long Basin. Vietnam. Int. J. Greenh. Gas
satellite big data and machine learning via google earth engine. Water Resour. Control 90. https://doi.org/10.1016/j.ijggc.2019.102826.
Res. 55 (10), 8028e8045. Themistocleous, K., Hadjimitsis, D.G., Michaelides, S., Papadavid, G., Kavoura, K.,
Cremasco, D., 2013. Estimating Depth to Bedrock in Weathered Terrains Using Konstantopoulou, M., Kyriou, A., Nikolakopoulos, K.G., Sabatakakis, N.,
Ground Penetrating Radar: A Case Study in the Adelaide Hills. BSc Thesis. Depountis, N., 2016. 3D subsurface geological modeling using GIS, remote
University of Adelaide. sensing, and boreholes data. Proceedings of Fourth International Conference on
Davagdorj, K., Pham, V.H., Theera-Umpon, N., Ryu, K.H., 2020. XGBoost-based Remote Sensing and Geoinformation of the Environment. RSCy2016.
framework for smoking-induced noncommunicable disease prediction. Int. J. Torgo, L., Ribeiro, R.P., Pfahringer, B., Branco, P., 2013. SMOTE for regression. Progress
Environ. Res. Publ. Health 17 (18). in Artificial Intelligence, pp. 378e389.
Dixit, N., Mccolgan, P., Kusler, K., 2020. Machine learning-based probabilistic lith- Vapnik, V., Cortes, C., 1995. Support-vector networks. Mach. Learn. 20, 273e297.
ofacies prediction from conventional well logs: a case from the umiat oil field of Wang, Y., Sherry Ni, X., 2019. A xgboost risk model via feature selection and Bayesian
Alaska. Energies 13 (18). hyper-parameter optimization. Int. J. Database Manag. Syst. 11 (1), 1e17.
Du, Y., Xu, P., Ling, S., Tian, B., You, Z., Zhang, R., 2019. Determining the soil-bedrock Wang, L., Wu, C., Tang, L., Zhang, W., Lacasse, S., Liu, H., Gao, L., 2020. Efficient
interface and fracture-zone scope in the central urban area of the Jinan city, reliability analysis of earth dam slope stability using extreme gradient boosting
China, by using microtremor signals. J. Geophys. Eng. 16 (4), 680e689. method. Acta Geotech 15 (11), 3135e3150.
Duan, J., Asteris, P.G., Nguyen, H., Bui, X.N., Moayedi, H., 2020a. A novel artificial Wee, L.K., Zhou, Y., 2009. Geology of Singapore, second ed. Defence Science and
intelligence technique to predict compressive strength of recycled aggregate Technology Agency, Singapore.
concrete using ICA-XGBoost model. Eng. Comput. https://doi.org/10.1007/ Wei, S., Hengl, T., Mendes De Jesus, J., Yuan, H., Dai, Y., 2017. Mapping the global depth
s00366-020-01003-0. to bedrock for land surface modeling. J. Adv. Model. Earth Syst. 9 (1), 65e88.
Duan, T., Anand, A., Ding, D.Y., Thai, K.K., Basu, S., Ng, A., Schuler, A., 2020b. Young, D.S., 2017. Handbook of Regression Methods. CRC Press.
NGBoost: natural gradient boosting for probabilistic prediction. In: Proceedings Yu, X., Xu, Y., 2015. A methodology for automatically 3D geological modeling based
of Proceedings of the 37th International Conference on Machine Learning, vol. on geophysical data grids. Proceedings of 2015 8th International Conference on
119, pp. 2690e2700. Intelligent Computation Technology and Automation (ICICTA), pp. 40e43.
Feng, Y., Wang, D., Yin, Y., Li, Z., Hu, Z., 2020. An XGBoost-based casualty prediction Yu, H., Chen, G., Gu, H., 2020. A machine learning methodology for multivariate
method for terrorist attacks. Complex Intell. Syst. 6 (3), 721e740. pore-pressure prediction. Comput. Geosci. 143.
Fuentes, I., Padarian, J., Iwanaga, T., Willem Vervoort, R., 2020. 3D lithological Zhang, W., Zhang, R., Wu, C., Goh, A.T.C., Lacasse, S., Liu, Z., Liu, H., 2020a. State-of-
mapping of borehole descriptions using word embeddings. Comput. Geosci. 141. the-art review of soft computing applications in underground excavations.
Gao, L., Ding, Y., 2020. Disease prediction via Bayesian hyperparameter optimization Geosci. Front. 11, 1095e1106.
and ensemble learning. BMC Res. Notes 205 (13), 1e6. Zhang, X., Zhang, Y., Xu, L., Zhang, J., Tian, Y., Wang, S., Li, Z., 2020b. Urban geological
Huang, H.W., Zhao, S., Zhang, D.M., Chen, J.Y., 2020. Deep learning-based instance 3D modeling based on papery borehole log. ISPRS Int. J. Geo-Inf. 9 (6).
segmentation of cracks from shield tunnel lining images. Struct. Infrastruct. Zhang, W., Zhang, R., Wu, C., Goh, A.T.C., Wang, L., 2020c. Assessment of basal heave
Eng. https://doi.org/10.1080/15732479.2020.1838559. stability for braced excavations in anisotropic clay using extreme gradient
Ke, G.L., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y., 2017. boosting and random forest regression. Undergr. Space. https://doi.org/10.1016/
LightGBM: a highly efficient gradient boosting decision tree. In: Guyon, I., j.undsp.2020.03.001.
Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. Zhang, W., Wu, C., Zhong, H., Li, Y., Wang, L., 2021a. Prediction of undrained shear
(Eds.), Proceedings of the NIPS2017. strength using extreme gradient boosting and random forest based on Bayesian
Lawal, A.I., Kwon, S., 2020. Application of artificial intelligence to rock mechanics: optimization. Geosci. Front. 12 (1), 469e477.
an overview. J. Rock Mech. Geotech. Eng. 13 (1), 248e266. Zhang, W., Li, H., Li, Y., Liu, H., Chen, Y., Ding, X., 2021b. Application of deep learning
Li, W., Yin, Y., Quan, X., Zhang, H., 2019. Gene expression value prediction based on algorithms in geotechnical engineering: a short critical review. Artif. Intell. Rev.
XGBoost algorithm. Front. Genet. 10. https://doi.org/10.3389/fgene.2019.01077. https://doi.org/10.1007/s10462-021-09967-1.
Liang, W., Luo, S., Zhao, G., Wu, H., 2020. Predicting hard rock pillar stability using Zhao, J., Shi, M., Hu, G., Song, X., Zhang, C., Tao, D., Wu, W., 2019. A data-driven
GBDT, XGBoost, and LightGBM algorithms. Mathematics 8 (5). framework for tunnel geological-type prediction based on TBM operating data.
Moon, S.W., Subramaniam, P., Zhang, Y., Vinoth, G., Ku, T., 2019. Bedrock depth IEEE Access 7, 66703e66713.
evaluation using microtremor measurement: empirical guidelines at weathered Zhao, S., Shadabfar, M., Zhang, D., Chen, J., Huang, H., 2021. Deep learning-based
granite formation in Singapore. J. Appl. Geophys. 171. classification and instance segmentation of leakage-area and scaling images of
Nath, R.R., Kumar, G., Sharma, M.L., Gupta, S.C., 2018. Estimation of bedrock depth shield tunnel linings. Struct. Contr. Health Monit. 28 (6).
for a part of Garhwal Himalayas using two different geophysical techniques. Zheng, H., Wu, Y., 2019. A XGBoost model with weather similarity analysis and
Geosci. Lett. 5 (1). feature engineering for short-term wind power forecasting. Appl. Sci. 9 (15).
Olive, D.J., 2017. Linear Regression, 1rd ed. Springer, USA. Zhou, W.H., Zhao, L.S., Chen, G.M., Yuen, K.V., 2018. 3D geologic modelling with borehole
Pan, X., Guo, W., Aung, Z., Nyo, A.K., Chiam, K., Wu, D., Chu, J., 2018. Procedure for data by general regression neural network. Proceedings of the 6th International
establishing a 3D geological model for Singapore. Proceedings of the Geo- Symposium on Reliability Engineering and Risk Management (6ISRERM).
Shanghai 2018 International Conference: Transportation Geotechnics and Zhu, L.F., Zhang, C.J., Li, M.J., Pan, X., Sun, J.Z., 2012. Building 3D solid models of
Pavement Engineering, pp. 81e89. sedimentary stratigraphic systems from borehole data: an automatic method
Pan, X., Chu, J., Aung, Z., Chiam, K., Wu, D., 2020. 3D geological modelling: a case and case studies. Eng. Geol. 127, 1e13.
study for Singapore. Information Technology in Geo-Engineering, pp. 161e167.
Prion, S.K., Haerling, K.A., 2020. Making sense of methods and measurements:
simple linear regression. Clin. Simul. Nurs. 48, 94e95. Dr. Xing Zhu is Postdoc Research Fellow at Nanyang
Pu, Y., Apel, D.B., Liu, V., Mitri, H., 2019. Machine learning methods for rockburst University of Technology, Singapore. His research interests
prediction-state-of-the-art review. Int. J. Min. Sci. Technol. 29 (4), 565e570. are in application of Artificial Intelligent (AI) in engi-
Qi, X.H., Pan, X., Chiam, K., Lim, Y.S., Lau, S.G., 2020a. Comparative spatial pre- neering geology, Wireless Sensor Network, Geological In-
dictions of the locations of soil-rock interface. Eng. Geol. 272. formation and Data Mining. He is also Associate Professor
Qi, X.H., Wang, H., Pan, X.H., Chu, J., Chiam, K., 2020b. Prediction of interfaces of at Chengdu University of Technology, China, where he
geological formations using the multivariate adaptive regression spline method. received his PhD in Geotechnical Engineering in 2014.
Undergr. Space 6 (3), 252e266.
Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., Chica-Rivas, M., 2015.
Machine learning predictive models for mineral prospectivity: an evaluation of
neural networks, random forest, regression trees and support vector machines.
Ore Geol. Rev. 71, 804e818.
Sharma, J.S., Chu, J., Zhao, J., 1999. Geological and Geotechnical features of
Singapore: an overview. Tunn. Undergr. Space Technol. 14 (4), 419e431.

1 s2.0 S1674775521001128 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1674775521001128 Main

Uploaded by

Copyright:

Available Formats

Journal of Rock Mechanics and Geotechnical Engineering 13 (2021) 1231e1245

Contents lists available at ScienceDirect

Journal of Rock Mechanics and

Full Length Article

Prediction of rockhead using a hybrid N-XGBoost machine learning

1. Introduction (GPR), reﬂection seismology, and electrical resistivity (Adepelumi

simplicity and interpretability of the linear regression method

3.2.2. Artiﬁcial neural network (ANN)

where y˛ð1; 1Þ, u is the weight vector which is normal to the

where Tj is the number of leaves in the j-th regression tree, g is the

Fig. 5. Schematic diagram of the GBRT.

Fig. 6. Ten-fold cross-validation for evaluation of model with speciﬁc hyperparameters.

A huge number of geotechnical borehole data has been collected

Feature Mean Max Min Median

Coordinate x of borehole (m) 21,410.84 24,965.65 20,029.06 20,960.72

Fig. 10. Establishment of N-XGBoost model in this study.

Hyperparameter Default Search range Explanation Optimal value

ensembles of decision tree methods like gradient boosting (e.g.

Fig. 13. Comparison of predictive results among ML models in training phase.

Fig. 14. Predictive results of ML models for test dataset.

Preprocessing Model R2 RMSE (m) MAE (m) EVRS (%)

Training Testing Training Testing Training Testing Training Testing

This work is supported by National Research Foundation (NRF) of

You might also like