You are on page 1of 12

Engineering Geology 281 (2021) 105972

Contents lists available at ScienceDirect

Engineering Geology
journal homepage: www.elsevier.com/locate/enggeo

Assessment of landslide susceptibility mapping based on Bayesian


hyperparameter optimization: A comparison between logistic regression
and random forest
Deliang Sun a, Jiahui Xu a, Haijia Wen b, c, d, *, Danzhou Wang e
a
Key Laboratory of GIS Application Research, Chongqing Normal University, Chongqing 401331, China
b
Key Laboratory of New Technology for Construction of Cities in Mountain Area, Ministry of Education, Chongqing 400045, China
c
National Joint Engineering Research Center of Geohazards Prevention in the Reservoir Areas, Chongqing 400044, China
d
School of Civil Engineering, Chongqing University, Chongqing 400045, China
e
Key Laboratory of Environmental Change and Natural Disaster, Ministry of Education, Beijing Normal University, Beijing 100875, China

A R T I C L E I N F O A B S T R A C T

Keywords: This study aims to develop two optimized models of landslide susceptibility mapping (LSM), i.e., logical
Landslide regression (LR) and random forest (RF) models, premised on hyperparameter optimization using the Bayesian
Bayesian hyperparameter optimization algorithm, and compare their applicability in a typical landslide-prone area (Fengjie County, China). First, data
Logical regression
for 1520 historical landslides occurring was collected from field investigations and literature reviews, to
Random forest
construct a spatial database of 16 conditioning factors. Subsequently, the Bayesian algorithm was adopted to
Landslide susceptibility mapping
optimize the hyperparameters of the LR and RF models, premised on the dataset of all cells (including landslides
and non-landslides). Finally, the two optimized models were estimated and compared with the area under curve
(AUC) and confusion matrix. Based on the Bayesian algorithm, the AUC value of the test dataset in LR model is
improved by 4%, while the AUC value of the test dataset in RF model is improved by 10%, indicating that both
models’ hyperparameter optimization premised on the Bayesian algorithm have delivered considerable impact
on the accuracy of the models; so hyperparameter optimization is very important for models of LSM. Although
both models exhibit reasonable performances, the optimized RF model premised on hyperparameter optimiza­
tion has a better stability and predictive capability in case area. These findings make up for the crucial step in
LSM (hyperparameter optimization) through the Bayesian algorithm, and provide a comparison case between LR
and RF models after comprehensive consideration of hyperparameter optimization, so as to increase the
convincing power of the comparison of these models and provide a knowledge base for model comparison:
comparison premised on hyperparameter optimization.

1. Introduction mitigate landslide-related destruction is an urgent need. Landslide sus­


ceptibility mapping (LSM), which describes the spatial distribution of
Landslides, a type of geological hazard, frequently occur around the landslide occurrence probability in a certain area according to the
world, leading to severe destructive consequences (Lombardo and Mai geographical environment, is considered a common countermeasure for
2018). According to the Emergency Events Database (EM-DAT) for mitigating the effects of landslides (Huang and Zhao 2018; Merghadi
2014–2018, landslides resulted in 4914 deaths, led to 27,110 people et al. 2020).
becoming homeless, and caused economic losses of $2.1 billion (USD). A At present, various models have been designed based on Geographic
report of the Safe Land-FP 7 project states that China includes vast areas Information Systems (GIS) and data mining technology, with a major
classified as high landslide risk zones, which lead to more than 700 amount of research applying statistical analysis and machine learning
deaths and result in property and infrastructure damage worth RMB 20 methods (Li et al. 2019; Zhao et al. 2019). Meanwhile, the comparison of
billion yuan every year (http://www.laram.unisa.it/initiatives/s different models could facilitate better assessment of the abilities and
afeland). Therefore, developing efficient solutions to reduce and limitations of each method and the statistical reliability of the LSM

* Corresponding author: Key Laboratory of New Technology for Construction of Cities in Mountain Area, Ministry of Education, Chongqing 400045, China.
E-mail address: jhw@cqu.edu.cn (H. Wen).

https://doi.org/10.1016/j.enggeo.2020.105972
Received 19 May 2020; Received in revised form 9 December 2020; Accepted 10 December 2020
Available online 16 December 2020
0013-7952/© 2020 Elsevier B.V. All rights reserved.
D. Sun et al. Engineering Geology 281 (2021) 105972

generated (Wang et al. 2020a). As the two most frequently adopted dataset, and it is the general model parameter; the number of decision
models for LSM, it is evident that both Logical Regression (LR) and trees in the RF model cannot be obtained through data training, but shall
Random Forest (RF) are suitable for analyzing the presence/absence of a be set before the model training, and this is the hyperparameter. In
landslide; a few studies have been published regarding the comparison machine learning, the performance of models is closely related to their
of these two models. By illustration, Tsangaratos et al. reported that the hyperparameters. By constantly adjusting the hyperparameters’ setting,
RF model has a slightly higher predictive capability than the LR model in the accuracy, operating speed and reliability of models can be greatly
Nancheng (China) (Tsangaratos et al. 2016), while Hong et al. demon­ improved (Xie et al. 2021). For this reason, the accuracy of models not
strated that the LR model exhibits a higher predictive capability than the only depends on the algorithm used, but also on the hyperparameters,
RF model in Lianhua (China) (Hong et al. 2016). Be that as it may, these rendering optimization of hyperparameters indispensable in any model.
studies have overlooked a crucial step: they have failed to consider the However, discussions on hyperparameter optimization mostly appear in
hyperparameters of their models. computer algorithm science. Premised on the Gaussian Kernel, Wang
Unlike general model parameters obtained through data training, et al. proposed a Support Vector Machine hyperparameter selection
hyperparameters are set before model training. For instance, the coef­ method, which includes two stages: selecting the kernel parameters and
ficient of the LR model could be obtained through training on the training the optimal penalty factors (Wang et al. 2014). This method has

Fig. 1. Location and landslide distribution of the study area.

2
D. Sun et al. Engineering Geology 281 (2021) 105972

the advantages of low computational complexity, high classification 3. Methods


accuracy, and short training time. Kang et al. proposed a non-inertial
particle thermal optimization method based on precise variance The assessment procedure consisted of four phases: (a) the con­
Gaussian Process (GP) regression (Kang et al. 2019). In the field of struction of the spatial database, (b) formulation of the training and test
landslide assessment, few scholars have explored hyperparameter opti­ datasets and hyperparameter optimization for the two models, (c) gen­
mization for landslide susceptibility models. Sun et al. developed an eration of LSMs, and (d) evaluation and comparison of the two models
optimized RF method based on hyperparameters optimization using (Fig. 2). The main operating software and platform were ArcGIS and
Bayesian algorithm (Sun et al. 2020a). Accodlying, in the relevant ENVI, and the programming language was Python.
comparative literatures, comparisons of two or more un-optimized
models are not convincing, because through hyperparameter optimiza­ 3.1. Construction of a spatial database
tion, the accuracy of the models described in these studies could be
further improved. In fact, their comparisons are unable to reflect the Selection of the conditioning factors to be included as input variables
strengths and weaknesses of each model in a particular study area (Hong in a universal model is a crucial step in LSM. However, when lacking
et al. 2016; Tsangaratos et al. 2016). sufficient corresponding evaluation criteria and technical specifications,
As the survey area of this study, Fengjie County is a mountainous the selection of landslide-conditioning factors is partially dependent on
region located in Three Gorges reservoir, southwest China, where human subjective judgments. Reichenbach et al. analyzed 565 works of
landslides occur frequently. A Bayesian algorithm was applied in the literature published from 1983 to 2016 related to the evaluation of
present study to optimize the hyperparameters of LR and RF models, as landslide susceptibility; they found that 596 factors had been used for
well as further explore and compare these optimization models in such evaluations, and that these factors could be classified into five
Fengjie County. This study is purposed to: (1) make up for the crucial categories: geology, hydrology, land cover, landforms, and others
step in LSM (hyperparameter optimization) through the Bayesian algo­ (Reichenbach et al. 2018). Premised on field investigations, literature
rithm; (2) provide a comparison case for LR and RF models after reviews (Sun et al. 2020a; Wang et al. 2020a), and the available data for
comprehensive consideration of hyperparameter optimization, so as to the study area, the following 16 conditioning factors were selected:
increase the convincing power of the comparison of these models; and elevation, slope, aspect, slope position, micro-landform, profile curva­
(3) provide a knowledge base for model comparison: comparison pre­ ture, topographic wetness index (TWI), lithology, distance from faults,
mised on hyperparameter optimization. combination reclassification of stratum dip direction and slope aspect
(CRDS), normalized vegetation index (NDVI), distance from rivers,
2. Materials annual average rainfall, land cover, distance from roads, and distance
from buildings. Notably, little consideration has been given to the
2.1. Description of the study area impact of human activities; however, such activities cannot be ignored
in determination of landslide susceptibility. For instance, improper
Fengjie County, the area of this study, spans 109◦ 1′ 17′′ –109◦ 45′ 58′′ E mining activities, such as the construction of roads and houses, often
and 30◦ 29′ 19′′ –31◦ 22′ 33′′ N. As the east gate of Chongqing (Fig. 1), lead to landslides because these engineering activities involve the
Fengjie County has mountainous landforms, located at the junction excavation and filling of mountains, destruction of surrounding moun­
between the Dabashan arc fold fault zone and the eastern Sichuan fold tains, and a decrease in the overall stability above slopes (Bourenane
belt, with sophisticated structural stress fields. The lithologies in Fengjie et al. 2014). As reported in existing literature, the distance from build­
County largely include those of Quaternary Q, Jurassic J, Triassic T, ings is rarely considered as a conditioning factor. Therefore, considering
Permian P, Carboniferous C, Devonian D, and Silurian S (Sun et al. the dense population of Fengjie County and the high concentration of
2020a). Under a humid subtropical monsoon climate, the Fengjie area human activities, this study selected the distance from buildings to
has an annual rainfall of approximately 1132 mm and an average annual pinpoint the impact of human activities on the occurrence of landslides.
temperature of approximately 16.5 ◦ C. Rainfall predominantly occurs The primary data source information is summarized in Table 1. The
from May to September, accounting for 69% of the yearly total rainfall slope, aspect, slope position, micro-landform, profile curvature, and TWI
thereof. data (Hong et al. 2016) were acquired by processing the digital elevation
model (DEM) from ASTER GDEM with a 30-m resolution. Landsat 8 OLI
2.2. Distribution of landslides in the study area satellite images were used to extract the NDVI and land cover data. Land
cover, in this research, was generated by a Support Vector Machine
Data regarding the historical distribution of landslides in the Fengjie model in ENVI 5.3 with a radial basis function kernel as well as field
area was acquired from the Chongqing Municipal Geological Environ­ investigations. Data on lithology and faults were extracted by vectoriz­
ment Agency, a governmental institution mainly responsible for moni­ ing geological maps at a scale of 1:200,000. Google Earth was used to
toring and investigating geological hazards. A total of 1520 landslides obtain information on rivers, roads, and buildings. The distances from
were identified (for 2001–2016), including landslide names, latitude faults, rivers, roads, and buildings were generated after buffering the
and longitude coordinates, areas and volumes, and occurrence times. faults, river networks, and roads. The selection of buffer distance was
The data on these landslides were classified pursuant to the types and based on field surveys and imagery resolution. CRDS was extracted by
triggers thereof. The three types, small, shallow, and soil landslides the subtraction and reclassification of aspects and tendencies (Sun et al.
accounted for 82% of all landslides in Fengjie County, while large, deep, 2020b). The annual average rainfall was derived from observations of
and bedrock landslides accounted for only 18%. Considering triggers, the local climate stations, by applying the spatial interpolation method.
75.88% of all the landslides were triggered by rainfall and the remaining As all 16 landslide-conditioning factors were expressed at different
24.12% by coupling, groundwater, or human engineering activities. The intervals or scales, all the factor maps should be unified. To that end, all
locations of the 1520 historical landslides are mapped in Fig. 1. factors were converted to a raster grid (with 30 × 30 m cells) that cor­
On April 17, 2006, the Linshuifang landslide (Fig. 1, right) occurred responded to the DEM resolution. Simultaneously, classification had to
in Guojia village, Kangle Town, Fengjie County, with soil as the sliding be performed for continuous factors. Premised on field investigations,
mass and rainstorm as the dominant factor; it is mostly stable at present. expert experiences, and multiple examples in the literature, a classifi­
In 2015, the Miaobao landslide (Fig. 1, left) occurred in Jiuliu village, cation scheme was established for each continuous factor (Table 2). In
Anping Town, Fengjie County, with soil as the sliding mass and rainfall summary, a spatial database of landslide-conditioning factors after
and earthquakes as the dominant factors; it is potentially unstable at reclassification was constructed with a 30-m resolution grid unit.
present. To reduce data discreteness, the 16 factors were normalized after

3
D. Sun et al. Engineering Geology 281 (2021) 105972

Fig. 2. Flowchart of this study.

the types of regression analysis, determining that the ratios of “positive:


Table 1
negative” cells often range between 1:1 and 1:10 (Heckmann et al.
Data and data sources.
2014). Considering different ratios of positive to negative cells, the ac­
Data name Data sources Type Scale curacy rate was determined to be the highest at a ratio of 1:10. Hence,
Historical landslide Chongqing Geological Datasheet such a ratio of positive to negative datasets was proposed, and 15,200
Monitoring Station non-landslide points were randomly extracted from the non-landslide
DEM ASTER satellite Grid 30 m
area. According to a 7:3 ratio, all sample datasets were divided into
Geological data National Geological Data Center Grid 1:200,000
Land cover Chongqing Municipal Bureau of Vector 1: training (11720) and test datasets (5022). The training dataset was
Land and Resources 100,000 utilized for model training, and the test dataset was used for testing.
Administrative Chongqing Municipal Bureau of Vector 1:
division Land and Resources 100,000
River network Chongqing Water Resources Vector 1: 3.3. Landslide susceptibility models
Bureau 100,000
Satellite image Geospatial Data Cloud platform Grid 30 m
Annual rainfall Chongqing Meteorological Datasheet 30 m 3.3.1. Logistic Regression
Administration LR is a generalized linear regression analysis model that is suitable
Road Chongqing Transportation Vector 1: for multivariable control. In contrast to common linear regression
Commission 100,000
models, the LR model restricts the output value to the interval [0,1]
through a sigmoid function. The relationship between the ability of
reclassification, and then transformed linearly such that that their values landslide susceptibility and the independent variables is
were reduced to the [0,1] interval. The normalization formula used is as 1
follows: f (z) = z
(2)
1 + e−
Y * = (Y − Ymin )/(Ymax − Ymin ) (1) where z = w1x1+ w2x2 + … + wMxM + b refers to a weighted linear
*
where Y indicates the normalized data, Y indicates the original data, combination model, and b indicates the intercept of the function; wM (M
Ymin is the minimum value of the original data, and Ymax is the maximum = 1,2,3…,16) denotes the correlation coefficient of the function; the
value of the original data. independent variable xM (M = 1,2,3, …,16) represents 16 conditioning
factors; and f(z) signifies the probability of landslide hazard, within the
range of [0,1]. When the function value is 1, the suggestion is that a
3.2. Preparation of the sample datasets landslide will surely occur at this point; and when it is 0, the implication
is that no landslide will occur (Zhao et al. 2019).
In this study, the positive cells (with landslides) and the negative
ones (with no landslide) comprised all the datasets. Positive cells 3.3.2. Random Forest
comprised 1520 historical landslide points, with each point regarded as As one of the most widely used classifier methods that have been
a single cell. Although additional non-landslide cells could expand the successfully utilized for regression, classification, and feature selection,
data size for machine learning, prior research has highlighted that no RF represents an ensemble of individually trained binary decision trees
ideal fixed percentage or ratio exists between landslide and non- (Chen et al. 2018). Unlike other machine learning models, RF provides
landslide cells, further relying on the modeling employed in the sus­ several important measures of a variable, and the most reliable measure
ceptibility analysis (Hussin et al. 2016). Heckmann et al. summarized is used to reduce the classification accuracy when the values of the

4
D. Sun et al. Engineering Geology 281 (2021) 105972

Table 2 the search to only focus on the areas of the input space that are expected
Classification of landslide-conditioning factors. to provide the most effective information about the solution to the
Conditioning factor Class Classification standard optimization problem. In particular, the probabilistic model for
Bayesian optimization is a GP because GPs can easily calculate the
Elevation / (m) 7 1. <340; 2. 340–595; 3. 595–850; 4. 850–1105; 5.
1105–1360; 6. 1360–1615; 7. >1615 predictive distribution of the target.
Slope / ( )

6 1. <10◦ ; 2. 10◦ –20◦ ; 3. 20◦ –30◦ ; 4. 30◦ –40◦ ; 5. When GP is adopted as the basic model, the assumption is that the
40◦ –50◦ ; 6. 50◦ –60◦ optimized black-box objective function f (x) is randomly sampled from
Aspect / (◦ ) 9 1. flat; 2. north; 3. northeast; 4. east; 5. southeast; GP. To be specific, f(x) ~ GP(m(x), k (x, x’)), where k (x, x’) represents a
6. south; 7. southwest; 8. west; 9. northwest
Slope position 6 1. valley; 2. lower slope; 3. flats slope; 4. middle
covariance function, and m(x) refers to a mean function. The covariance
slope; 5. upper slope; 6. ridge function k (x, x’) specifies the intrinsic characteristics of the f (x)
Micro-landform 10 1. canyons, and deeply incised streams; 2. objective lens (such as smoothness, level of additive noise). The output
midslope drainages, and shallow valleys; 3. result is the covariance of f(x) and f(x’). In a general GP model, the
upland drainages, and headwaters; 4. U-shape
probability of each feature is calculated and added, while there is a
valleys; 5. plains; 6. open slopes; 7. upper slopes,
and mesas; 8. local ridges/hills in valleys; 9. necessity to construct a covariance matrix and utilize the probability
midslope ridges, and small hills in plains; 10. values of all feature vectors in a multivariate GP model. The ultimate
mountain tops, and high narrow ridges multivariate GP model is as follows:
Profile curvature 7 1. <− 1.0; 2. − 1 to − 0.5; 3. -0.5–0; 4. 0–0.5; 5. ( )
0.5–1.0; 6. 1.0–1.5; 7. >1.5 1 1
TWI 7 1. <10; 2. 10–12; 3. 12–14; 4. 14–16; 5. 16–18; 6.
P(x) = n 1 exp − (x − μ )T
cov(x − μ )− 1
(4)
(2π)2 |cov|2 2
18–20; 7. >20
Lithology 12 1. J3p/J3s; 2. J2s/J2xs; 3. J1-2z/J1z; 4. T3xj; 5. T3b1; where μ (a mean), and cov (a covariance) are as follows:
6. T2b2; 7. T1d; 8. T1j; 9. P2; 10. P1; 11. D3/D2; 12.
S1-2 1∑n
Distance from faults/ 6 1. <500; 2. 500–1000; 3. 1000–1500; 4. μ= xi (5)
n i=1
(m) 1500–2000; 5. 2000–2500; 6. 2500–3000; 7.
>3000 1∑n
CRDS 7 1. dip-slope I; 2. dip-slope II; 3. outward slope; 4. cov = (xi − μ)(xi − μ)T (6)
oblique slope; 5. tangential slope; 6. reverse slope; n i=1
7. flat In the present research, the Bayesian algorithm was employed to
NDVI 7 1. <0.10; 2. 0.10–0.20; 3. 0.20–0.30; 4.
optimize the objective function with the accuracy of cross-validation.
0.30–0.40; 5. 0.40–0.50; 6. 0.50–0.60; 7. >0.60
Distance from rivers/ 7 1. <100; 2. 100–200; 3. 200–300; 4. 300–400; 5.
(m) 400–500; 6. 500–600; 7. >600
Annual average 8 1. < 1221; 2. 1221–1251; 3. 1251–1276; 4.
3.4. Model performance and validation
rainfall / (mm) 1276–1308; 5. 1308–1343; 6. 1343–1389; 7.
1389–1440; 8. >1440 Assessments require validation to ensure scientific significance;
Land cover 6 1. farmland; 2. forest; 3. grassland; 4. building; 5. hence, it is necessary to evaluate the validity of the landslide suscepti­
water; 6. not used
bility models used. The confusion matrix and AUC were used to analyze
Distance from roads / 7 1. <100; 2. 100–200; 3. 200–300; 4. 300–400; 5.
(m) 400–500; 6. 500–600; 7. >600 the accuracy. In the confusion matrix, examples can be divided into
Distance from 7 1. <100; 2. 100–200; 3. 200–300; 4. 300–400; 5. positive and negative ones. If the sample point is a landslide, it is posi­
buildings / (m) 400–500; 6. 500–600; 7. >600 tive; if the sample point is non-landslide, it is negative. If the status of the
instance is “non-landslide” and is predicted as “landslide,” it is recorded
as TN (true negative); if the state of the instance is “landslide” and is
variables in the nodes of the tree are randomly arranged (Breiman
predicted to be “non-landslide,” it is recorded as FN (false negative). The
2001). In the model-building process, RF grows multiple decision trees.
formulae for accuracy and precision are as follows:
The generalization error of RF depends on the accuracy of a single tree
and the correlation between the trees; the final prediction result is TP + TN
Accuracy = (7)
determined by a majority vote of the decision trees (Sahin et al. 2018). TP + TN + FP + FN
RF ranks the importance of factors based on the Gini index, repre­
senting a random variable in the sample set. The probability that the Precision =
TP
(8)
selected sample was misclassified is determined as follows: TP + FP
Confusion-matrix-based statistical measures such as accuracy, pre­
G = p(s) × p(m) (3)
cision, and recall were also used to evaluate the predictive capabilities of
where G is the Gini index, the larger the value of G, the higher the the models. Receiver operating characteristic (ROC) is a type of curve
uncertainty of data; p(s) is the probability of a sample being selected; based on confusion matrixes, which consider sensitivity and specificity
and p(m) is the probability of a sample being misclassified. as the horizontal and vertical axes, respectively. The value of AUC under
the ROC curve can quantitatively express the accuracy of model pre­
3.3.3. Bayesian Optimization dictions: the value “1” represents an ideal model, while “0” represents
The Bayesian Optimization algorithm is extensively adopted to the model is lack of sufficient information (Wang et al. 2020b).
determine the optimal hyperparameter value of a model, owing to the
ability thereof to rapidly obtain optimal values (Garrido-Merchán and 4. Results and analyses
Hernández-Lobato 2020). Owing to the use of the GP, the Bayesian al­
gorithm can completely master prior knowledge with strong robustness. 4.1. Model hyperparameter optimization
By increasing the number of samples, this algorithm can fit the posterior
distribution of the objective function, thereby obtaining the optimal Table 3 lists the five main hyperparameters included in the LR
value to realize the hyperparameter optimization of the models. model: Tol had a default value of 1e− 4; max_iter had a default value of
The Bayesian algorithm relies on fitting the probability model to the 100, with int as its default data type; and solver, penalty, and C were
observed value of the black-box target to be optimized. By considering optimized, with the hyperparameter values obtained in each iteration as
the predictive distribution (specifying the potential value of the target at output. As can be observed in Fig. 3, the AUC values ranged from 0.755
each point in the input space), the Bayesian optimization method guides to 0.799 under different hyperparameter values. When the AUC value

5
D. Sun et al. Engineering Geology 281 (2021) 105972

Table 3 Fig. 4, the AUC values range from 0.917 to 0.946 under different
Main hyperparameters involved in logistic regression. hyperparameter values. When the AUC value reached the maximum
Hyperparameters Explanation (0.946), the optimal hyperparameter values were as follows: [‘n_esti­
mators’: ‘50’, ‘max_depths’: ‘16’, ‘max_features’: ‘10’, ‘min_samples_s­
solver Loss function minimization algorithm, including sag, Newton
CG, lbfgs, and liblinear. plits’: ‘4’].
penalty Regularization methods include L1 and L2. For sag, Newton CG, Both models showed that their hyperparameters had a great impact
and lbfgs, only L2 can be chosen. on their accuracy in machine learning. The AUC values of LR and RF
C Regularization intensity value. models using 10-fold cross-validation were 5.8% and 3.2%, respectively.
max_iter Maximum number of iterations.
Tol Tolerance to stop the iteration.
Therefore, the RF model was determined to be relatively stable.

4.2. Validation of the logistic regression model

Multicollinearity refers to the existence of near-linear relationships


between different factors. In the LR model, any problem with multi­
collinearity will result in certain negative influences (Lee et al. 2018). In
this study, we used the stepwise regression method to address the
multicollinearity problem in the LR model, determining that five factors
(aspect, slope position, micro-landform, lithology, and distance from
faults) should be eliminated. The remaining 11 factors were used to
build the LR model. The final LR model with correlation coefficients and
function intercept values is as follows:
z = − 6.067x1 − 1.309x2 − 2.216x3 + 0.183x4 + 1.519x5 − 0.375x6 + 0.503x7
+ 1.141x8 − 0.425x9 − 1.359x10 − 0.427x11 − 0.227 (9)

where x1 is the normalized elevation value, x2 is the normalized slope


value, x3 is the normalized profile curvature value, x4 is the normalized
TWI value, x5 is the normalized lithology value, x6 is the normalized
CRDS value, x7 is the normalized NDVI value, x8 is the normalized dis­
tance from rivers value, x9 is the normalized annual average rainfall
Fig. 3. Receiver operating characteristic curves of logistic regression under value, x10 is the normalized distance from the road value, and x11 is the
different hyperparameter values. normalized distance from buildings value.
After construction, the LR model based on the optimized hyper­
reached the maximum (0.799), the optimal hyperparameter values were parameters was applied to the entire study area. Then, the resulting LSM
as follows: [‘C’: ‘0.1′ , ‘penalty’: ‘12’, ‘solver’: ‘lbfgs’]. was reclassified by defining the limits of the cumulative distribution of
The RF model mainly contains seven hyperparameters, namely the prediction values supplied by the LR. Finally, based on the visual and
n_estimators, criterion, min_samples_split, max_depths, max_features, easy interpretation and comparison of the areas, the LSM was classified
min_samples_leaf, and bootstrap (Sun et al. 2020a). Among the hyper­ into five categories: 10%, 10%, 10%, 20%, and 50% (from high to low),
parameters, Gini sample segmentation was selected as the principal corresponding to very-low-, low-, moderate-, high-, and very-high-
criterion. As this study had a size of 16,720 samples, we could directly susceptibility regions, respectively (Pradhan and Lee 2010). Fig. 5
consider the default value of 1. During sample selection, the repeated shows the LSM of Fengjie County generated using the LR model, which
sampling was inevitably put back, thus, the bootstrap value was set as shows that most areas of Fengjie County were located in the low-
“True.” The remaining four hyperparameters (n_estimators, max_depths, susceptibility regions, concentrated in the south. Its high-susceptibility
max_features, and min_samples_splits) were optimized, and the hyper­ regions were concentrated along the Yangtze River and its tributaries,
parameter values obtained in each iteration were output. As shown in a result that matched well with the distribution of actual historical
landslides.

4.3. Validation of the random forest model

Fig. 6 shows the LSM produced by the RF model based on the opti­
mized hyperparameters for Fengjie County, overlaid with landslides.
This indicates that the LSM matched well with the distribution of the
actual historical landslides.

4.4. Comparison of models

The confusion matrix and ROC plots of the LR and RF models are
depicted in Table 4 and Fig. 7. Further, three evaluation metrics,
namely, the AUC, overall accuracy, and precision, were utilized to
evaluate the models based on the training and test datasets. The AUC of
the training dataset is an indication of the success power of the models,
while the AUC of the test dataset is an indication of the predictive ca­
pabilities of the models (Tsangaratos et al. 2016). As shown in Fig. 7, the
AUC values of the training dataset of the LR and RF models were 0.78
Fig. 4. Receiver operating characteristic curves of random forest under and 1.00, respectively, while the AUC values of the test dataset of the LR
different hyperparameter values. model and the RF model were 0.80 and 0.95, respectively. In addition,

6
D. Sun et al. Engineering Geology 281 (2021) 105972

Fig. 5. Landslide susceptibility mapping based on the logistic regression model.

Fig. 6. Landslide susceptibility mapping based on the random forest model.

7
D. Sun et al. Engineering Geology 281 (2021) 105972

Table 4 Table 5
Confusion matrix of the logistic regression (LR) and random forest (RF) models. Statistics of the susceptibilities classified based on the logistic regression (LR)
LR Actual value
and random forest (RF) models.
Class Model Very Low Middle High Very
Landslide Non-landslide
low high
(1) (0)
Coverage (%) LR 50 20 10 10 10
Predicted Landslide (1) 2 5 Accuracy:
RF 50 20 10 10 10
value 0.909
Landslide LR 14.14 20.86 15.00 20.26 29.74
Non-landslide 1518 15,195 Precision:
proportion (%) RF 7.30 12.89 12.70 19.61 47.50
(0) 0.286
Landslide density LR 0.11 0.39 0.56 0.76 1.12
RF Actual value
(point /km2) RF 0.05 0.24 0.48 0.74 1.78
Landslide Non-landslide
(1) (0)
Predicted Landslide (1) 1317 13 Accuracy:
value 0.987 were located in the same area (10%). These results reveal that the
Non-landslide 203 15,187 Precision: proportion of landslides located in the low-susceptibility regions of LSM
(0) 0.990 generated by RF was lower in the same area. However, the proportion of
those in high-susceptibility regions was higher, exhibiting that the RF
model has a better application in indicating the landslide susceptibility
in the whole area compared to the LR model.
Furthermore, new landslide data from the study area in 2017 were
collected for further comparison. By overlaying the 25 coordinated
landslides in 2017 on the LSMs generated by the LR and RF models
(Fig. 8), an observation can be made that most of the new landslides fell
into the high- or very-high-susceptibility regions of LSMs generated by
both LR and RF models, showing a certain prediction ability. In January
2017, Zhoulaoshiliangzi suffered a medium-sized landslide, with 185
people affected, 180 buildings damaged, and properties worth 80
million yuan compromised. In the same month, a medium-sized land­
slide occurred in Houpoli, with 10 people affected and properties worth
3 million yuan damaged. Comparative analyses indicated that the
landslides in both areas were located in very-low-susceptibility regions
of LSM generated by the LR model but fell within the high- or very-high-
susceptibility regions of LSM generated by the RF model. For this reason,
the prediction ability of the RF model was exhibited to be significantly
better than that of the LR model.

5. Discussion

5.1. Importance of conditioning factors


Fig. 7. Receiver operating characteristic curves of the logistic regression and
random forest models. The impact of each conditioning factor on the occurrence of land­
slides varies; hence, analyzing the importance of the factors for landslide
for all the datasets shown in Table 4, the overall accuracy and precision occurrence can provide valuable guidance for landslide disaster man­
of the LR model were 0.909 and 0.286, respectively, while those of the agement. The above analyses have indicated that the RF model exhibits
RF model were 0.987 and 0.990, respectively. All of the metrics revealed better performance in the case of this study; therefore, the “Mean
that, although both models demonstrated a reasonable goodness of fit, Decrease Gini” in the RF model would be used to identify the critical
the RF model performed better in terms of the training and test datasets. order of factors. Fig. 9 illustrates the importance of the factors premised
Therefore, the RF model had a better prediction than the LR model in on the “Mean Decrease Gini”. Here, the annual average rainfall, eleva­
this case. tion, and distance from buildings were found to be the most vital factors
The LR and RF models were also compared to each other based on the conditioning to the occurrence of landslides.
generated LSMs. From a qualitative perspective, two typical regions The correlation between landslide and elevation and between land­
were selected for comparison for the generated LSMs (Figs. 5 and 6). For slide and average annual rainfall is consistent with the findings of pre­
region A, there was a strong correlation between the landslides and the vious researches (Sun et al. 2020a). Concerning distance from buildings,
two sets of LSM results. Here, the results exhibit that almost all of the the landslide density had a significantly negative correlation within 400
historical landslides fell into the high- or very-high-susceptibility re­ m of buildings. However, a positive correlation can be observed beyond
gions. Yet, the area of high- or very-high susceptibility of LSM generated 400 m (Fig. 10). In areas 100 m away from buildings, the landslide
by RF was smaller (Fig. 6), indicating a better correspondence. For re­ density was as high as 0.76/km2, which is much higher than that in other
gion B, the historical landslides largely fell into the high- or very-high- areas (Fig. 11a). This may be attributed to heavy human development
susceptibility regions of LSM generated by RF, revealing a good corre­ activities that cause a great degree of damage to the soil mass, thus
spondence, while more historical landslides fell into the low- or very- sharply increasing the probability of the occurrence of landslides. Pre­
low-susceptibility regions of LSM generated by LR, exhibiting a rela­ vious research has shown little consideration to the impact of human
tively poor correspondence. From a quantitative perspective (Table 5), activities, and the influence of buildings on landslides has not been well
in the very-low-susceptibility region, 14.14% of the landslides of LSM explored. The present research, however, observed that the distance
generated by LR and 7.30% of those generated by RF were located in the from buildings was among the top three contributing factors, and thus
same area (50%). In the very-high-susceptibility region, 29.74% of the cannot be ignored in the formation of landslides. The introduction of this
landslides of LSM generated by LR and 47.50% of those generated by RF factor provides a new perspective for exploring the impact of human
activities on the occurrence of landslides.

8
D. Sun et al. Engineering Geology 281 (2021) 105972

Fig. 8. Distribution of two new landslides in 2017.

The “Mean Decrease Gini” in the RF model also indicated that the occurrence of landslides (Guo et al. 2015). By comprehensively
distance from faults had little effect on the occurrence of landslides in exploring the mechanism of landslides, we determined that the land­
the study region (Figs. 9 and 10). As seen in Fig. 11b, only a few small slides in the study area were triggered by rainfall, while earthquakes
faults exist in the study area. It is necessary to note that some scholars triggered most of the landslides in the areas considered in other studies.
arrived at a contrasting conclusion: faults were essential for the Consequently, the present study maintained that there was no strong

9
D. Sun et al. Engineering Geology 281 (2021) 105972

Fig. 9. Importance of conditioning factors based on the mean decrease Gini.

1.00, while the corresponding AUC values of the test dataset are 0.85
and 0.95, respectively. Additionally, for all the datasets summarized in
Table 6, the overall accuracy and precision of RF before the optimization
are 0.866 and 0.962, respectively, and the overall accuracy and preci­
sion of the RF model after optimization are 0.987 and 0.990, respec­
tively. The results highlight that the RF model indexes increased after
the hyperparameter optimization, in both the test dataset and the whole
data set. For this reason, hyperparameter optimization can optimize the
performance of the model to a certain extent, thereby increasing its
accuracy.

5.3. Comparison of logistic regression and random forest models

The present study demonstrated that the performance of RF acting on


the study area was better than that of LR, being a similar conclusion to
the findings of other related comparative studies. As an example, as
reported by Trigila et al., who compared the results of the RF model with
Fig. 10. Landslide density charts for distance from buildings and distance a Frequency Ratio model and an LR model, the RF model shows higher
from faults. accuracy than the other models (Trigila et al. 2015). Similar results were
also reported by a comparative study in Austria, which indicated that
correlation between fault distribution and landslides, aside from the RF model is more accurate than LR, and other data-mining methods
earthquake-triggered landslides. Nevertheless, at least for this study (Goetz et al. 2015). In contrast, Hong et al. reported poor performance
area, it was difficult to use the factor of distance from faults to explain for the RF model as compared with that of Evidential Belief Function,
the spatial distribution of landslides. Hence, the elimination of this Frequency Ratio, and LR models in Lianhua County, China (Hong et al.
factor should be considered in future research. 2016). These contrasting conclusions may be due to a series of reasons:
by illustration, the differences in the geographical environments of the
5.2. Role of hyperparameter optimization study areas, the factors considered in selecting indicators, and the vol­
ume of data provided for building models, among others. Be that as it
The purpose of hyperparameter optimization in machine learning is may, there is a crucial step that should not be overlooked in LSM: the
to determine the best performance of the machine learning algorithm for optimization of the hyperparameters of the models. In the cited studies,
the verification dataset. Unlike general model parameters obtained from model optimization was not performed, and only the built-in parameters
data training, hyperparameters are set before training. Hyperparameter of the models were adopted. Thus, the accuracy of the models in these
optimization determines a set of hyperparameters that return to an studies could be further improved, as they were not necessarily opti­
optimization model. The model reduces the predefined loss function and mized models. Comparisons of two or more un-optimized models are not
improves the prediction or classification accuracy for the given inde­ convincing because they are unable to reflect the strengths and weak­
pendent data. Therefore, the performance of the model is changed by the nesses of each model in a particular study area. In contrast, this study
optimization of the parameters, which leads to a change in the accuracy implements hyperparameter optimization to obtain the optimized LR
of the model results. The results of the RF model before and after and RF models and carries out comparisons based on the optimized
hyperparameter optimization are summarized in Table 6 and shown models. Accordingly, the results of this study provide a case for the
Fig. 12. comparison of two landslide susceptibility modeling methods (i.e., the
As can be observed from Fig. 12, the AUC values of the RF model statistical LR modeling and the data-mining RF modeling) after
training dataset before and after hyperparameter optimization are both comprehensive consideration of hyperparameter optimization, so as to

10
D. Sun et al. Engineering Geology 281 (2021) 105972

Fig. 11. Thematic layers of distance from buildings and distance from faults.

Table 6
Confusion matrix of the random forest (RF) models before and after hyper­
parameter optimization.
RF (before) Actual value

Landslide Non-landslide
(1) (0)

Predicted Landslide (1) 1118 44 Accuracy:


value 0.866
Non-landslide 402 15,156 Precision:
(0) 0.962
RF (after) Actual value
Landslide Non-landslide
(1) (0)
Predicted Landslide (1) 1317 13 Accuracy:
value 0.987
Non-landslide 203 15,187 Precision:
(0) 0.990

enhance the convincing power of the comparison of these models and


provide a knowledge base for model comparison: comparison premised
on hyperparameter optimization.

6. Conclusion Fig. 12. Receiver operating characteristic curves of the random forest models
(a) before and (b) after hyperparameter optimization.
In this study, the optimized LR and RF landslide susceptibility
models were proposed through hyperparameter optimization. A com­ both models’ hyperparameter optimization premised on the
parison of these two models was conducted predicated on research on a Bayesian algorithm have delivered considerable impact on the
typical landslide-prone area, Fengjie County, China. The following accuracy of the models; so hyperparameter optimization is very
conclusions were drawn: important for models of LSM.
(2) Although both models exhibit reasonable performances, the
(1) Based on the Bayesian algorithm, the AUC value of the test optimized RF model premised on hyperparameter optimization
dataset in LR model is improved by 4%, while the AUC value of has a better stability and predictive capability in case area.
the test dataset in RF model is improved by 10%, indicating that

11
D. Sun et al. Engineering Geology 281 (2021) 105972

(3) Three major conditioning factors, i.e., annual average rainfall, Heckmann, T., Gegg, K., Gegg, A., et al., 2014. Sample size matters: investigating the
effect of sample size on a logistic regression susceptibility model for debris flows.
elevation, and distance from buildings, played an important role
Nat. Hazards Earth Syst. Sci. 14 (2), 259–278. https://doi.org/10.5194/nhess-14-
in landslide occurrence, whereas distance from faults was 259-2014.
impractical for explaining the spatial distribution of landslides in Hong, H., Pourghasemi, H.R., Pourtaghi, Z.S., 2016. Landslide susceptibility assessment
this study region. in Lianhua County (China): a comparison between a random forest data mining
technique and bivariate and multivariate statistical models. Geomorphology 259,
105–118. https://doi.org/10.1016/j.geomorph.2016.02.012.
These findings make up for the crucial step in LSM (hyperparameter Huang, Y., Zhao, L., 2018. Review on landslide susceptibility mapping using support
optimization) through the Bayesian algorithm, and provide a compari­ vector machines. Catena 165, 520–529. https://doi.org/10.1016/j.
catena.2018.03.003.
son case between LR and RF models after comprehensive consideration Hussin, H.Y., Zumpano, V., Reichenbach, P., et al., 2016. Different landslide sampling
of hyperparameter optimization, so as to increase the convincing power strategies in a grid-based bi-variate statistical susceptibility model. Geomorphology
of the comparison of these models and provide a knowledge base for 253, 508–523. https://doi.org/10.1016/j.geomorph.2015.10.030.
Kang, L., Chen, R.-S., Xiong, N., et al., 2019. Selecting Hyper-Parameters of Gaussian
model comparison: comparison premised on hyperparameter Process Regression based on Non-Inertial Particle Swarm Optimization in Internet of
optimization. Things. IEEE Access 7, 59504–59513. https://doi.org/10.1109/
access.2019.2913757.
Lee, J.-H., Sameen, M.I., Pradhan, B., et al., 2018. Modeling landslide susceptibility in
Funding data-scarce environments using optimized data mining and statistical methods.
Geomorphology 303, 284–298. https://doi.org/10.1016/j.geomorph.2017.12.007.
This research was funded by the National Key Research and Devel­ Li, C., Fu, Z., Wang, Y., et al., 2019. Susceptibility of reservoir-induced landslides and
strategies for increasing the slope stability in the Three Gorges Reservoir Area: Zigui
opment Program of China (No. 2018 YFC 1505501), the Natural Science
Basin as an example. Eng. Geol. 261 https://doi.org/10.1016/j.
Foundation of Chongqing (Grant No. cstc2020jcyj-msxmX0841), and enggeo.2019.105279.
Humanities and Social Sciences Foundation of the Ministry of Education Lombardo, L., Mai, P.M., 2018. Presenting logistic regression-based landslide
of China (Grant No. 20XJAZH002). susceptibility results. Eng. Geol. 244, 14–24. https://doi.org/10.1016/j.
enggeo.2018.07.019.
Merghadi, A., Yunus, A.P., Dou, J., et al., 2020. Machine learning methods for landslide
Declaration of Competing Interest susceptibility studies: a comparative overview of algorithm performance. Earth Sci.
Rev. 207 https://doi.org/10.1016/j.earscirev.2020.103225.
Pradhan, B., Lee, S., 2010. Landslide susceptibility assessment and factor effect analysis:
The authors declare no conflict of interest. backpropagation artificial neural networks and their comparison with frequency
ratio and bivariate logistic regression modelling. Environ. Model Softw. 25 (6),
Acknowledgments 747–759. https://doi.org/10.1016/j.envsoft.2009.10.016.
Reichenbach, P., Rossi, M., Malamud, B.D., et al., 2018. A review of statistically-based
landslide susceptibility models. Earth Sci. Rev. 180, 60–91. https://doi.org/
We want to express our gratitude to Chongqing Meteorological 10.1016/j.earscirev.2018.03.001.
Administration for providing essential meteorological data and also to Sahin, E.K., Colkesen, I., Kavzoglu, T., 2018. A comparative assessment of canonical
correlation forest, random forest, rotation forest and logistic regression methods for
Chongqing Institute of Geology and Mineral Resources for providing landslide susceptibility mapping. Geocarto Int. 35 (4), 341–363. https://doi.org/
valuable research data on historical landslides. We are also grateful to 10.1080/10106049.2018.1516248.
the editors and anonymous reviewers for their valuable comments on Sun, D., Wen, H., Wang, D., et al., 2020a. A random forest model of landslide
susceptibility mapping based on hyperparameter optimization using Bayes
this manuscript.
algorithm. Geomorphology 362. https://doi.org/10.1016/j.geomorph.2020.107201.
Sun, D., Wen, H., Zhang, Y., et al., 2020b. An optimal sample selection-based logistic
Appendix A. Supplementary data regression model of slope physical resistance against rainfall-induced landslide.
Natural Hazards. https://doi.org/10.1007/s11069-020-04353-6.
Trigila, A., Iadanza, C., Esposito, C., et al., 2015. Comparison of Logistic Regression and
Supplementary data to this article can be found online at https://doi. Random Forests techniques for shallow landslide susceptibility assessment in
org/10.1016/j.enggeo.2020.105972. Giampilieri (NE Sicily, Italy). Geomorphology 249, 119–136. https://doi.org/
10.1016/j.geomorph.2015.06.001.
Tsangaratos, P., Ilia, I., Hong, H., et al., 2016. Applying Information Theory and GIS-
References based quantitative methods to produce landslide susceptibility maps in Nancheng
County, China. Landslides 14 (3), 1091–1111. https://doi.org/10.1007/s10346-016-
Bourenane, H., Bouhadad, Y., Guettouche, M.S., et al., 2014. GIS-based landslide 0769-4.
susceptibility zonation using bivariate statistical and expert approaches in the city of Wang, X., Huang, F., Cheng, Y., 2014. Super-parameter selection for Gaussian-Kernel
Constantine (Northeast Algeria). Bull. Eng. Geol. Environ. 74 (2), 337–355. https:// SVM based on outlier-resisting. Measurement 58, 147–153. https://doi.org/
doi.org/10.1007/s10064-014-0616-6. 10.1016/j.measurement.2014.08.019.
Breiman, L., 2001. Random Forests. Mach. Learn. 45 (1), 5–32. https://doi.org/10.1023/ Wang, Y., Sun, D.L., Wen, H.J., et al., 2020a. Comparison of Random Forest Model and
A:1010933404324. Frequency Ratio Model for Landslide Susceptibility Mapping (LSM) in Yunyang
Chen, W., Zhang, S., Li, R., et al., 2018. Performance evaluation of the GIS-based data County (Chongqing, China). Int. J. Environ. Res. Public Health 17, 4206. https://doi.
mining techniques of best-first decision tree, random forest, and naive Bayes tree for org/10.3390/ijerph17124206.
landslide susceptibility modeling. Sci. Total Environ. 644, 1006–1018. https://doi. Wang, Y.M., Feng, L.W., Li, S.J., et al., 2020b. A hybrid model considering spatial
org/10.1016/j.scitotenv.2018.06.389. heterogeneity for landslide susceptibility mapping in Zhejiang Province, China.
Garrido-Merchán, E.C., Hernández-Lobato, D., 2020. Dealing with categorical and Catena 188. https://doi.org/10.1016/j.catena.2019.104425.
integer-valued variables in Bayesian Optimization with Gaussian processes. Xie, W., Chen, W., Shen, L., et al., 2021. Surrogate network-based sparseness hyper-
Neurocomputing 380, 20–35. https://doi.org/10.1016/j.neucom.2019.11.004. parameter optimization for deep expression recognition. Pattern Recogn. 111.
Goetz, J.N., Brenning, A., Petschko, H., et al., 2015. Evaluating machine learning and https://doi.org/10.1016/j.patcog.2020.107701.
statistical prediction techniques for landslide susceptibility modeling. Comput. Zhao, Y., Wang, R., Jiang, Y., et al., 2019. GIS-based logistic regression for rainfall-
Geosci. 81, 1–11. https://doi.org/10.1016/j.cageo.2015.04.007. induced landslide susceptibility mapping under different grid sizes in Yueqing,
Guo, C., Montgomery, D.R., Zhang, Y., et al., 2015. Quantitative assessment of landslide Southeastern China. Eng. Geol. 259 https://doi.org/10.1016/j.enggeo.2019.105147.
susceptibility along the Xianshuihe fault zone, Tibetan Plateau, China.
Geomorphology 248, 93–110. https://doi.org/10.1016/j.geomorph.2015.07.012.

12

You might also like