You are on page 1of 53

1 A Bayesian unsupervised learning approach for identifying soil stratification using

2 cone penetration data

n)
sio
3 Hui Wang1; Xiangrong Wang2; J. Florian Wellmann3; Robert Y. Liang4

er
1
4 Assistant Professor, Department of Civil and Environmental Engineering and Engineering

tv
5 Mechanics, The University of Dayton, Dayton, OH 45469-0243, USA, Email:

in
pr
6 hwang12@udayton.edu.

e-
pr
2
7 Corresponding Author, Visiting Professor, Department of Civil and Environmental Engineering


rs
8 and Engineering Mechanics, The University of Dayton, Dayton, OH 45469-0243, USA, Email:

ho
9 xwang2@udayton.edu, Tel: +1 (937)-229-3847.

ut
(a
3
10 Assistant professor, The Aachen Institute for Advanced Study in Computational Engineering
al
rn
11 Science (AICES), RWTH Aachen University, 52062 Aachen Germany, Email:
ou

12 wellmann@aices.rwth-aachen.de.
lJ
ca

4
13 Professor, Department of Civil and Environmental Engineering and Engineering Mechanics,
ni

14 The University of Dayton, Dayton, OH 45469-0243, USA, Email: rliang1@udayton.edu.


ch
te
eo
G
an
di
na
Ca
in
ed
ish
bl
Pu

1
15 Abstract

16 This paper presents a novel perspective to understand the spatial and statistical patterns of a

n)
17 cone penetration dataset and identify soil stratification using them. Both local consistency in

sio
18 physical space (i.e., along depth) and statistical similarity in feature space (i.e., logQt – logFr

er
tv
19 space or the Robertson chart) between data points are considered simultaneously. The proposed

in
pr
20 approach is, in essence, consist of two parts: 1) a pattern detection approach using Bayesian

e-
21 inferential framework, and 2) a pattern interpretation protocol using Robertson chart. The first

’ pr
22 part is the mathematical core of the proposed approach, which infers both spatial pattern in

rs
ho
23 physical space and statistical pattern in feature space from the input dataset; the second part

ut
(a
24 converts the abstract patterns into intuitive spatial configurations of multiple soil layers having

25 al
different soil behavior types. The advantages of the proposed approach include probabilistic soil
rn
ou

26 classification, and identifying soil stratification in an automatic and fully unsupervised manner.
lJ

27 The proposed approach has been implemented in MATLAB R2015b and python 3.6, and tested
ca
ni

28 using various datasets including both synthetic and real-world CPT soundings. The results show
ch

29 that the proposed approach can accurately and automatically detect soil layers with quantified
te
eo

30 uncertainty and reasonable computational cost.


G
n

31
a
di

32 Key words: unsupervised learning; soil stratification; CPT; Bayesian inferential framework; soil
na
Ca

33 behavior type.
in
ed
ish
bl
Pu

2
34 Introduction

35 The cone penetration test (CPT) is one of the commonly adopted in-situ tests to determine

n)
36 subsoil stratigraphy (Lunne et al. 1997). It is less disruptive and provides nearly continuous,

sio
37 repeatable and reliable data with cost savings (Robertson 2009). In recent years, CPT-based

er
tv
38 stratigraphic profiling has received considerable attention. Many different strategies have been

in
pr
39 investigated in previous studies, such as fuzzy sets (Zhang and Tumay 1999), clustering analysis

e-
40 (Das and Basudhar 2009; Depina et al. 2016; Hegazy and Mayne 2002; Liao and Mayne 2007),

’ pr
41 statistical analysis (Phoon et al. 2003; Wickremesinghe and Campanella 1991), Bayesian

rs
ho
42 inference (Wang et al. 2013), the wavelet transform modulus maxima (WTMM) method (Ching

ut
(a
43 et al. 2015).

44 al
Although these strategies all have sound theoretical ground and certain advantages, the
rn
ou

45 performances may vary a lot from case to case due to some limitations. A comparison study
lJ

46 among several above methods was performed in Ching et al. (2015). The results indicate that the
ca
ni

47 Bayesian method proposed by Wang et al. (2013) (hereafter referred to as the Bayesian method)
ch

48 and the WTMM method (Ching et al. 2015) generally outperform other methods in some cases
te
eo

49 with complex soil stratigraphy configurations. However, the Bayesian method is computationally
G
n

50 expensive due to solving high-dimensional integrals and high-dimensional non-convex


a
di

51 optimization problems, yet the WTMM method operates only on the soil behavior index hence
na
Ca

52 may not make full use of the original “two-dimensional” information (i.e., tip resistance and
in

53 friction ratio). To be more concrete, it is possible that different tip resistance and friction ratio
ed

54 combinations may result in the same soil behavior index yet they represent different soil
ish

55 mechanical properties (but could be similar), moreover, the statistical correlation between tip
bl
Pu

56 resistance and friction ratio cannot be reflected by the soil behavior index, besides, the soil

3
57 behavior index does not apply to all soil behavior types (Robertson and Cabal 2010). Both of the

58 two methods did not cover uncertainty quantification of the soil classification result at each

n)
59 sampling location along depth possibly because of certain considerations of computational cost

sio
60 or the limitations of the adopted mathematical tools in solving optimization problems. In this

er
tv
61 work, we are aiming to develop a Bayesian unsupervised learning approach for automatic layer

in
pr
62 detection and soil classification with quantified uncertainty and taking both tip resistance and

e-
63 friction ratio into consideration. It is expected to have reasonable accuracy and computational

’ pr
64 cost.

rs
ho
65 As electric cones convert raw signal into digital form at selected intervals (Robertson and

ut
(a
66 Cabal 2010), we discretize the physical space (i.e., a vertical line in the direction of depth) into

67 al
elements matching the measurement resolution. Each element is assigned with a set of original
rn
ou

68 observations. Two features can be derived from the observations: the normalized friction ratio,
lJ

69 Fr = 100 f s /( qt − σ v 0 ) , and the normalized tip resistance, Qt = ( qt − σ v 0 ) / σ v 0 ' , where f s , qt ,


ca
ni

70 σ v 0 , and σ v 0 ' are the sleeve friction, corrected tip resistance, vertical total stress, and vertical
ch
te

71 effective stress, respectively. After taking logarithm, the two features form a two-dimensional
eo
G

72 feature space. A portion of this feature space defines a soil classification chart and is frequently
a n

73 referred to as the Robertson chart (Robertson 1990), which is shown in Figure 1. The chart is
di
na

74 divided into nine zones corresponding to nine soil behavior types (SBTs) as listed in Table 1. In
Ca

75 this study, the feature space simply reduces to the Robertson chart with finite span of log10 Fr
in

76 and log10 Qt . The soil elements are represented by feature pairs ( log10 Fr , log10 Qt ) and
ed
ish

77 visualized as point clouds in the Robertson chart.


bl

78 For interpreting the soil profile, we need to evaluate the similarity between any two soil
Pu

79 elements in physical space and the corresponding data points in feature space since a section of

4
80 several similar consecutive soil elements can be interpreted as a potential soil layer. To be more

81 specific, the similarity is measured based on the vertical distance between the two soil elements

n)
82 along depth and the relative locations of the corresponding data points in the Robertson chart.

sio
83 Similar soil elements will be assigned with the same soil state index, which is an abstract

er
tv
84 description of the soil type. It should be noticed that the index number itself has no meaning but

in
pr
85 indicates that all soil elements with the same index number are similar to each other. The spatial

e-
86 distribution of the soil state index along depth is named as the spatial pattern, which indicates the

’ pr
87 physical locations of similar soil elements along depth, and the statistical characteristics (i.e.,

rs
ho
88 mean and covariance matrix) of the data points with the same soil state index are named as the

ut
(a
89 statistical pattern, which indicate the statistical average level, variation, and correlation of the

90 al
two soil features corresponding to a specific soil state index. In this way, the similarity between
rn
ou

91 any two soil elements can be quantitatively descried in terms of the two patterns, which can be
lJ

92 intuitively visualized in both physical space and feature space and jointly interpreted as multiple
ca
ni

93 soil layers with different SBTs by using Roberson chart. In other words, the spatial pattern
ch

94 describes an intuitive layer configuration (indicated by soil state indices but without knowing
te
eo

95 specific SBTs of each potential layer) of the soil profile, whereas the statistical pattern contains
G
n

96 information regarding the mean and covariance matrix of the soil mechanical properties of the
a
di

97 similar soil elements, and further indicates the possible SBTs of each potential layer. Then
na
Ca

98 combining the two patterns in the context of Robertson chart provides us with a reasonable
in

99 interpretation of the soil stratification (indicated by SBTs). The two patterns are the core
ed

100 concepts of this work and we will go into details in the following sections.
ish

101 In this work, A Markov random field (MRF) model (Besag 1974) is adopted to represent the
bl
Pu

102 spatial pattern; the corresponding statistical pattern in feature space is modeled as a finite

5
103 mixture (FM) model (McLachlan and Peel 2004). The local consistency in spatial pattern is

104 encoded through a neighborhood system, as we expect neighboring soil elements tend to be

n)
105 similar with each other in physical space. On the other hand, the similarity among soil elements

sio
106 is measured by the finite mixture density, as feature pairs of similar soil elements tend to cluster

er
tv
107 together in the feature space.

in
pr
108 We propose a novel unsupervised learning approach for identifying soil stratification using

e-
109 cone penetration data. A hidden Markov random field (HMRF) model is employed to integrate

’ pr
110 the MRF model (representing the spatial pattern) with the FM model (representing the statistical

rs
ho
111 pattern). The novelty of this work is, for the first time, we not only extract the subsoil

ut
(a
112 heterogeneity and identify the stratification based on the extracted spatial pattern, but also detect

113 al
and interpret the corresponding statistical pattern in the context of Robertson chart which is
rn
ou

114 familiar to practicing engineers. It will be demonstrated that taking both spatial and statistical
lJ

115 pattern into consideration provides us with a better strategy for identifying soil stratification. In
ca
ni

116 addition, by using real-world examples, we would like to highlight the trend from the original
ch

117 Bayesian framework (Cao and Wang 2012; Wang et al. 2013) to the machine learning-based
te
eo

118 Bayesian framework (e.g., the proposed method in this study) on the methodological
G
n

119 development for interpreting site investigation data.


a
di

120 Proposed unsupervised learning approach


na
Ca

121 Hidden Markov random field model and mean field-like approximation
in

122 A HMRF model is a comprehensive description of both statistical pattern in feature space
ed

123 and the corresponding latent field in physical space. The features and the latent field are not
ish

124 necessarily of the same nature. In our case, the features are two measured soil mechanical
bl
Pu

125 properties log10 Fr , and log10 Qt , while the latent field consists of abstract soil state labels. The

6
126 features are assumed to be generated from the latent field via a family of predefined probability

127 density functions known as the emission functions. Figure 2 shows the diagram of an HMRF

n)
128 model: one-dimensional lattice with the first-order neighborhood system (Besag 1986). The

sio
129 =
latent field is a configuration of all soil states x ( x1 , x2 , x3 ,..., xs ), x j ∈ L where L = {1, 2,..., K }

er
tv
130 is a set of all possible soil states. For each soil element j, the neighbors ∂ j are defined as the

in
pr
131 nearest two soil elements { j − 1, j + 1} (see Figure 2). The MRF–Gibbs equivalence (Besag 1974)

e-
pr
132 provides an explicit formula for the local conditional probability: (Besag 1986; Geman and


rs
133 Geman 1984):

ho
ut
P( x j , x∂ j | β ) exp[−U ( x j , x∂ j ; β )]
P( x j | x∂ j , β )

(a
134 = = (1)
∑ P( x j ', x∂ j | β ) ∑ exp[−U ( x j ', x∂ j ; β )]
x,j ∈L x,j ∈L
al
rn
ou
135 where U (.) is the local energy function. Equation (1) is the cornerstone of a classical Gibbs
lJ

136 sampler (Geman and Geman 1984), which is usually used to generate sample realizations of a
ca
ni

137 latent field within a Markov chain Monte Carlo (MCMC) framework. We adopt the most widely
ch

138 accepted Potts model (Koller and Friedman 2009) to calculate the local energy
te
eo

0 if xi = x j
U ( x j , x ∂ j ; β ) = ∑ Vi , j ( xi , x j ; β ) with the potential function Vi , j ( xi , x j ; β ) = 
G

139 , where
i∈∂ j  β if xi ≠ x j
a n
di

140 β is referred to as the granularity coefficient, and the subscript “i” is defined as the index of j’s
na
Ca

141 neighboring element. The physical meaning behind Equation (1) is that we assume neighboring
in

142 soil elements generally tend to have the same soil state. To this end, the spatial constraint is
ed

143 introduced by tuning the local energy function, and the strength of the spatial constraint is
ish

144 controlled by the granularity coefficient β . Greater β means stronger constraint.


bl
Pu

7
145 As is mentioned above, the features y is assumed to be generated from a particular

146 realization x and the emission functions f (y j ; θ x j ) where y j is the feature pair ( log10 Fr ,

n)
sio
147 log10 Qt ) of a specific soil element j and θ x j is the parameter of the emission function

er
148 corresponding to x j . Assuming the observations are conditionally independent for the sake of

tv
in
149 mathematical convenience, the conditional probability of the observed features can be expressed

pr
e-
150 as

pr
s s


(y | x, θl∈L ) ∏ P( y | x ) ∏ f ( y ; θ )

rs
151 P
= j j = j xj (2)

ho
=j 1 =j 1

ut
152 However, in practical problems, we usually do not know the configuration of x in advance, hence

(a
153 we are more interested in the marginal probability of the observed features y. Although direct
al
rn
154 computing of the entire observed feature field is intractable due to the large number of possible
ou
lJ

155 realizations of x , the marginal probability of a feature vector yj locally (i.e., if the configuration
ca

156 of its neighbors are known) follows the equation below,


ni
ch

157 P( y j | x ∂ j , θl∈L , β ) = ∑ P(l | x ∂ j , β ) f ( y j ; θl ) (3)


te

l∈L
eo

158 In Equation (2-3), l is defined as one of the possible soil states and this index is not linked to a
G
n

159 specific soil element (i.e., different from xj, which indicates the soil state assigned to soil element
a
di
na

160 j). In reality, the observed features y j should be spatially correlated and hence the conditionally
Ca

161 independent assumption could be too strong. As will be demonstrated later in the synthetic
in

162 examples, in presence of moderate vertical correlation, possible biased estimation of θ l may
ed
ish

163 happen, but the estimated parameters are still considered to be reasonably accurate for a specific
bl

164 realization. If θ l = (μ l , Σ l ) is the parameter set of the Gaussian component density function,
Pu

165 where μl is a vector that contains the mean of every feature for state l ∈ L , and Σl is the

8
166 covariance matrix of all the features for state l. Then the so-called Gaussian hidden Markov

167 random field (GHMRF) model is defined by Equations (1-3).

n)
168 Equation (3) provides us with a path to calculate the marginal probability of the observed

sio
169 feature vector at a specific element given the neighboring state labels, which makes it feasible to

er
tv
170 infer both the latent field and the model parameters within a Bayesian inferential framework. To

in
pr
171 be more concrete, based on the principle of the mean field-like approximation (Celeux et al.

e-
172 2003; Forbes and Peyrard 2003), given a good choice of configuration ~
x , which is an

pr
approximated expectation of the latent field, x ∂ j can be approximated by ~


173 x∂ j (the local

rs
ho
174 neighborhood configuration of pixel j in the field ~
x ). Then the corresponding density

ut
(a
175 distribution of the entire observed field can be approximated as

al
rn
176 ∏ ∑ P(l | x ∂ j , β ) f ( y j ; μl , Σl )
P(y | x , μ, Σ, β ) ≈ ∏ P( y j | x ∂ j , μ, Σ, β ) = (4)
ou
j∈S j∈S l∈L
lJ

177 where S = {1,2,3,..., s} is the set of all element indices. A reasonable candidate of ~
x would be
ca
ni

178 one that leads to a good approximation of the joint probability P(x, y , Φ ) where
ch

179 Φ = {μ, Σ, β }, μ = {μ l } , Σ = {Σ l }, for all l ∈ L (Forbes and Peyrard 2003). We recommend


te
eo

180 employing the simulated field algorithm as described in (Celeux et al. 2003), in which ~
x is a
G
n

181 realization of the conditional distribution p( x | y, Φ ) given an estimate of Φ , and generated


a
di
na

182 using a Gibbs sampler. We will describe how to implement Equation (4) iteratively in the
Ca

183 following section.


in

184 Probabilistic pattern extraction


ed

In the current problem setting, both the soil state sequence x and the corresponding model
ish

185
bl

186 parameters Φ = {μ, Σ, β } are unknown. A MCMC method is employed to optimize Equation (4)
Pu

9
187 by sampling x, Φ one after another via two conditional a posteriori distributions p (x | y , Φ ) and

188 p(Φ | y, x) iteratively.

n)
189 1. Simulation of conditional random field p (x | y , Φ )

sio
er
190 Conditional on a specific setting of model parameters Φ , and a set of the observations y,

tv
191 p (x | y , Φ ) is a Gibbs distribution (Wang et al. 2016), and the local energy is

in
pr
192 x j , x∂ j ; Φ) U j ( x j , x∂ j ; β ) + U j ( y j ; μ x j , Σ x j )
U j '(= (5)

e-
pr
193 The local energy is separated into two parts: 1) the MRF energy U j ( x j , x ∂ j ; β ) , which represents


rs
ho
194 the prior distribution (i.e., Gibbs distribution) of the latent field x and can be evaluated by using

ut
195 Potts model, and 2) the likelihood energy:

(a
196 U j ( y j ; μ x j , Σ x j ) ==
1 −1 1
al
( y j − μ x j ) T Σ x j ( y j − μ x j ) + log Σ x j . (6)
rn
2 2
ou
lJ

197 By adopting Equation (5) as a measure of local energy and plugging U j ' ( x j , x ∂ j ) into Equation
ca

198 (1), we can calculate the local conditional probability for choosing a specific soil state given the
ni
ch

199 local neighborhood system and observations. The equation is shown below,
te
eo

P( x j , x∂ j | y j , Φ) exp[−U j '( x j , x ∂ j ; Φ )]
200 P( x j | x∂ j , y j , Φ)
= = (7)
G

∑ P( x j ', x∂ j | y j , Φ) ∑ exp[−U j '( x j ', x∂ j ; Φ )]


n

x,j ∈L x,j ∈L
a
di

201 The realizations ~


x of the conditional random field p (x | y , Φ ) can be iteratively simulated via
na
Ca

202 the Gibbs sampler proposed by Geman and Geman (1984). More efficiently, we follow
in

203 Algorithm 1, in which a parallel strategy named chromatic sampler is used to boost up the
ed

204 computational speed. More details on the chromatic sampler can be found in Wang et al. (2016).
ish
bl

205 2. Simulation of the model parameters from p (Φ | y , x)


Pu

10
206 In this step, Φ = {μ, Σ, β } is sampled iteratively following the conditional posterior

207 distribution:

n)
post (μ | y, x, Σ, β ) ∝ prior(μ) L( y | x, μ, Σ, β )

sio
208 (8)

er
209 post ( Σ | y, x, μ, β ) ∝ prior ( Σ) L( y | x, μ, Σ, β ) (9)

tv
in
210 post ( β | y, x, μ, Σ) ∝ prior ( β ) L( y | x, μ, Σ, β ) (10)

pr
e-
211 The likelihood function L(y | x, μ, Σ, β ) = ∏ P( y j | x ∂ j , μ, Σ, β ) is evaluated by using Equation

pr
j∈S


rs
212 (4), where ~
x∂ j is the simulated local neighborhood configuration from step 1. To reflect the

ho
ut
213 inadequate prior knowledge, non-informative priors are used in this work: for all l ∈ L , 1) flat

(a
214 multivariate Gaussian priors are used for μ l with arbitrarily chosen center ηl and large
al
rn
215 covariance Ξ l ; 2) flat Gaussian prior is used for β with arbitrarily chosen mean µ β and large
ou
lJ

216 standard deviation σ β ; 3) the separation strategy (Barnard et al. 2000) is used to construct the
ca
ni

217 prior distribution of Σ l . To be more specific, Σ l is decomposed as Σ l = Λ l Rl Λ l , where Λ l is a


ch
te

ii
218 diagonal matrix with the i-th element σ l (i.e. the standard deviation of the i-th feature of cluster
eo
G

ii
219 l), and Rl is the correlation matrix of cluster l. σ l is assumed to follow a log-normal prior
a n
di

ii ii
ln(σ l ) ~ N (bl , ξ ) , and the prior density for the correlation matrix is
1
220 − (ν + d +1)
(∏i =1 rl )ν / 2
d
na

ii
P( Rl ) ∝ Rl 2
Ca

ii −1
221 where d is the number of features, and rl is the i-th diagonal element of Rl , and we use the
in

222 notation Σl ~ BSS (υ , b l , ξ ) to denote this prior. Other choices for the prior of Σ are also
ed
ish

223 possible, for example the hierarchical half-t prior (Huang and Wand 2013) and the inverse-
bl
Pu

224 Wishart distribution (Gelman et al. 2014). Recently, a comprehensive comparison study by

11
225 Alvarez et al. (2014) shows that the BSS prior presents the most flexibility and outperforms other

226 priors.

n)
227 To illustrate the relations between the hyperparameters, model parameters, latent fields and

sio
228 the features, a plate diagram is shown in Figure 3. The plate notation represents fixed

er
tv
229 hyperparameters with squares, random variables with circles, while gray shapes indicate known

in
pr
230 values or predefined values. K denotes the total number of soil states and N is the total number of

e-
231 elements. What we have done in step 1 is generating a realizations of ~
x using parameters from

pr
~


x∂ j are ready in step 2. What we want to do in step 2 is proposing

rs
232 last iteration and hence all

ho
candidates of Φ = {μ, Σ, β } and accepting/rejecting the candidates according to the change of

ut
233

(a
234 the approximated marginal log-likelihood of the observed features. The plate diagram in Figure 3
al
rn
235 exactly shows the intrinsic relationships among all variables in Equation (4). The most widely
ou
lJ

236 used Metropolis—Hasting (M-H) algorithm is employed in this step. Algorithm 2 shows the
ca

237 pseudocode of a single M-H iteration for updating Φ = {μ, Σ, β }.


ni
ch

238 Uncertainty quantification of soil states and mixture parameters


te
eo

239 By iteratively implementing the paralleled Gibbs sampler (Algorithm 1) and the Metropolis-
G

240 -Hasting algorithm (Algorithm 2), a series of realizations from the conditional a posteriori
a n
di

241 distributions p (x | y , Φ ) and p(Φ | y, x ) can be generated. To quantify the uncertainty of the
na
Ca

242 latent soil states x on a per-element basis, information entropy (Wellmann 2013; Wellmann and
in

243 Regenauer-Lieb 2012) is adopted as a quantitative measure of uncertainty. It has the following
ed

244 expression:
ish

245 H ( j ) = − ∑ Pl ( j ) log Pl ( j ) (11)


bl

l∈L
Pu

12
246 where Pl ( j ) is the empirical probability of assigning label l ∈ L to element j calculated from the

247 ensemble of ~
x . High information entropy indicates high uncertainty level. The maximum a

n)
sio
*
248 posteriori (MAP) estimate of the soil state at a specific element x j is achieved by taking the

er
249 soil state with the highest membership (probability of having a specific state), and then the MAP

tv
in
*
250 estimate of the latent field x * consists of all x j , j ∈ {1,2,3..., s} . To quantify the uncertainty of

pr
e-
251 the model parameters Φ , kernel density estimation (Bowman and Azzalini 1997) is employed to

pr
252 fit the simulated samples of each model parameter from Step 2 by using Gaussian kernels. The


rs
ho
253 fitting results are approximately considered as the posterior density functions. Sample mean is

ut
254 chosen as the representative value of each model parameter.

(a
255 Determine the number of soil states
al
rn
256 In this study, the total number of soil states is determined by Bayesian information criteria
ou
lJ

257 (BIC), which is generally recommended for model selection when no prior knowledge is
ca

258 available (Fraley and Raftery 1998; Fraley and Raftery 2002), and has the following form:
ni
ch

259 BIC = −2loglike( y, ϑ ) + M log(n ) (12)


te
eo

260 where loglike( y, ϑ ) is the maximized log-likelihood, ϑ is the MAP estimate of model
G
n

261 parameters, which are estimated by using EM algorithm (McLachlan and Peel 2004). M is the
a
di

262 number of independent parameters to be estimated, and n is the number of data points. Previous
na
Ca

263 studies recommend choosing the model at the minimum BIC (Fraley and Raftery 1998; Fraley
in

264 and Raftery 2002; McLachlan and Basford 1988; McLachlan and Krishnan 2007). We follow the
ed

265 same way to choose the optimal number of clusters. To be more specific, the observed feature
ish

266 pairs ( log10 Fr , log10 Qt ) of all soil elements are fitted to a family of finite Gaussian mixture
bl
Pu

267 models having different number of clusters, and the corresponding BICs are calculated for each

13
268 model using the MATLAB statistics and machine learning toolbox (MathWorks 2014) or the

269 scikit-learn package in python (Pedregosa et al. 2011). In both software packages, the EM

n)
270 algorithm has been implemented. The number of clusters with the lowest BIC value is chosen as

sio
271 the optima.

er
tv
272 Usually, the exact number of clusters may not be known a priori. A suitable choice for this

in
pr
273 number will result in a clustering result where data points have high similarities within each

e-
274 cluster and low inter-similarities between clusters within the feature space (Celeux and Govaert

’ pr
275 1995; McLachlan and Peel 2004). In certain cases, this choice may come down to using

rs
ho
276 subjective experts’ knowledge or ad hoc procedures. Several studies have been performed on

ut
(a
277 model selection using data-based approaches, including the Akaike information criterion (AIC)

278 al
(Bozdogan 1987), the Bayesian information criterion (BIC) (Fraley and Raftery 1998), the
rn
ou

279 integrated completed likelihood (ICL) criterion (Biernacki et al. 2000), a BIC approach based on
lJ

280 mean field theory (Forbes and Peyrard 2003) and a Bayesian approach combined with Monte
ca
ni

281 Carlo integration (Yuen and Mu 2011). The advantage of BIC is that it considers the dimension
ch

282 of input space by including the number of features as a penalty (Fraley and Raftery 1998; Fraley
te
eo

283 and Raftery 2002), and thus the BIC is more sensitive to the number of unknown parameters and
G
n

284 the model complexity, and hence effectively avoids possible overfitting.
a
di

285 Because the similarity and possible clustered patterns are reflected in the feature space, we
na
Ca

286 are able to choose a number that can maximize the likelihood of observations without
in

287 considering spatial constraint. We employ a conventional FGM model to measure the likelihood
ed

288 and find the best number of components, since it avoids performing Gibbs sampler within each
ish

289 iterative loop and thus is faster to trial many cases with a different number of components.
bl
Pu

14
290 Interpreting clusters in Robertson chart

291 1. Understand spatial pattern and statistical pattern

n)
292 At this stage, all soil elements are clustered into multiple groups in feature space,

sio
293 accordingly, segments of soil elements having same state along depth are extracted. Hence both

er
tv
294 spatial pattern and statistical pattern are revealed. Since the spatial pattern and statistical pattern

in
pr
295 are linked together via the HMRF model, the number of soil segments is automatically

e-
296 determined based on the statistical pattern in feature space. It is worthwhile to note that the

’ pr
297 segments in physical space and the clusters in feature space may not be one-to-one

rs
ho
298 correspondence as it is possible that a cluster in feature space is consist of several segments in

ut
(a
299 physical space, however, a segment in physical space must be a subset of (or equal to) a cluster

300 al
in feature space. The segments belonging to a specific cluster indicate that they are statistically
rn
ou

301 similar to each other (i.e., they may be subsets generated by the same Gaussian distribution
lJ

302 representing this cluster) but locate at different depths. It is also worth noting that the statistical
ca
ni

303 similarity does not guarantee these segments have the same soil behavior type as it is entirely
ch

304 possible that the parent cluster may span across two or more SBT zones. Therefore, comparing to
te
eo

305 a cluster in feature space, a segment in physical space should be considered as the basic unit (i.e.
G
n

306 a potential homogeneous soil layer) for soil profile interpretation since the soil elements within it
a
di

307 are not only physically grouped together but also statistically similar to each other. A simple
na
Ca

308 example is depicted in Figure 4 in which all soil elements are divided into three segments in
in

309 physical space but two clusters in feature space. Note that cluster 1 consists of Segment 1 and
ed

310 Segment 3 at different depth, whereas cluster 2 equals to Segment 2.


ish

311 2. Interpret soil segments and detect layer boundaries


bl
Pu

15
312 The Robertson chart (Robertson 1990; Robertson and Cabal 2010) directly links the feature

313 space with the mechanical behavior of subsoil. We consider soil segments as the basic units for

n)
314 identifying soil stratification. Each segment can be interpreted according to one or more zones it

sio
315 occupies. The probabilities that each segment belongs to various SBT zones are determined as

er
tv
316 the portion of the elements within this segment that lies in SBT zone n. Hence each segment has

in
pr
317 its own dominant SBT zone. It is worth noting that multiple soil segments having same soil state

e-
318 are similar to each other only in a statistical measure, however, they are separated by other

’ pr
319 segments with different soil states in physical space and may possess different dominant SBT

rs
ho
320 zones (e.g., Segment 1 and Segment 3 in Figure 4). Conversely, soil segments belonging to

ut
(a
321 different clusters may locate near each other in feature space and hence have same dominant

322 al
SBT zone (e.g., Segment 2 and Segment 3 in Figure 4). Although, statistically, they are not
rn
ou

323 similar with each other, they share similar mechanical properties.
lJ

324 A boundary will be detected between any two neighboring segments. If two neighboring
ca
ni

325 segments have same dominant SBT zone, the boundary is defined as an internal boundary. If two
ch

326 neighboring segments have different dominant SBT zones, the boundary is defined as a layer
te
eo

327 boundary. An example of the two types of boundaries is shown in Figure 4, in which the
G
n

328 dominant SBT zone of Segment 1 is 6, whereas the dominant zones of Segment 2 and Segment 3
a
di

329 are 5. Hence a layer boundary is detected between Segment 1 and Segment 2 and an internal
na
Ca

330 boundary is detected between Segment 2 and Segment 3.


in

331 Implementation procedure


ed

332 Figure 5 shows a flowchart of the proposed Bayesian unsupervised learning approach. To
ish

333 summarize, the procedure is listed as follows:


bl
Pu

16
334 1. Obtain a set of CPT data and do preprocessing (i.e. correction and normalization). Then

335 convert them to the logarithm of the normalized friction ratio log10 Fr and cone tip resistance

n)
336 log10 Qt .

sio
er
337 2. Construct the first-order neighborhood system. Fit the feature pairs of all elements to a

tv
338 family of finite Gaussian mixture models with different number of clusters (say, from 1 to kmax)

in
pr
339 and calculate corresponding BIC values. The optimal number of soil states is determined at the

e-
pr
340 lowest BIC.


rs
341 3. Determine the hyperparameters {µ β ,σ β , ηl , Ξl ,υ , b l , ξ } for priors. Set the number of

ho
ut
( 0) ( 0)
342 iterations N and initialize mixture density parameters x ( 0 ) , μ l , Σ l and granularity coefficient

(a
al
( 0) ( 0)
343 β ( 0 ) . If no prior information is available, 1) set β ( 0 ) =1; 2) initialize x ( 0 ) , μ l , Σ l with the
rn
ou

344 output from best fitted finite Gaussian mixture model in step 2; 3) use the non-informative prior
lJ

345 mentioned in Section 2.4, and the default setting is shown in Table 2. The predefined constant
ca
ni

346 100 in Table 2 indicates a high level of variability and represents the ignorance of the prior
ch

347 knowledge. According to the authors’ experiences, a much greater number does not make
te
eo

348 significant difference.


G
n

(t ) (t )
349 4. Draw random samples x ( t ) , μ l , Σ l , β ( t ) from two conditional a posteriori distributions
a
di
na

350 p (x | y , Φ ) and p(Φ | y, x ) iteratively via the Parallel Gibbs sampler (Algorithm 1) and M-H
Ca

351 algorithm (Algorithm 2).


in

352 5. Calculate the membership probabilities of different soil states for each soil element, and
ed
ish

353 perform kernel density fitting to all independent parameters in Φ . The MAP estimate x * is
bl

354 determined by the highest membership probability at each soil element, and the sample mean is
Pu

355 taken as the MAP estimator of all independent parameters in Φ .

17
356 6. Extract soil segments from x * . Determine the SBT number for each soil element, then

357 label each soil segment with its dominant SBT number.

n)
358 7. Determine all layer boundaries and internal boundaries as described above in Section 2.5.

sio
359 The SBT number for each layer is simply determined by the common SBT number of all soil

er
tv
360 segments in this layer.

in
pr
361 The proposed approach has been implemented in MATLAB R2015b and python 3.6, and

e-
362 tested using various datasets including both synthetic and real-world CPT logs. Details will be

’ pr
363 shown in the following sections. Interested readers may contact the first author for the MATLAB

rs
ho
364 implementation or the python package. Although the proposed method is mathematically

ut
(a
365 complicated, the integrated implementation is automatic and fully unsupervised. If there is no

366 al
prior information, in most cases, with recommended default setting of hyperparameters (see
rn
ou

367 Table 2), practicing engineers only need to provide project-specific CPT data with both depth
lJ

368 coordinates and measurements as input, and the output includes two patterns in physical space
ca
ni

369 and feature space as well as the identified stratification and corresponding SBT classification
ch
te

370 results. This not only improves the practicability for engineers who are not familiar with the
eo

371 mathematics under the hood, but also allows geotechnical researchers who are familiar with
G
n

372 Bayesian unsupervised learning techniques to have a closer examination of the detected patterns.
a
di

373 Model validation using synthetic examples


na
Ca

374 Description of stochastic simulation method and measure of classification quality


in

375 To validate the proposed approach and illustrate the model behavior, synthetic cases with
ed

376 known spatial configuration (i.e. the SBT profile and stratification) and model parameters (i.e.
ish
bl

377 mean μ l , standard deviations σ l of log10 Fr , log10 Qt , and the correlation coefficient ρ l between
Pu

378 log10 Fr and log10 Qt ) are studied. We are focusing on a simple soil profile with two SBT layers

18
379 but three different soil states as shown in Figure 6. In this configuration, each soil state

380 corresponds to a segment in physical space. The log10 Fr and log10 Qt for three different soil

n)
381 states are modeled by three Gaussian random fields with respective thickness of 4m, 6m and 5m.

sio
er
382 As is mentioned in Section 2.2, the probabilistic clustering algorithm is based on the conditional

tv
383 independent assumption, hence it is worthwhile to have a closer investigation on the effect of

in
pr
384 spatial correlation. To this end, two synthetic experiments are conducted, namely, conditionally

e-
pr
385 independent case and vertically correlated case. We set the depth interval (i.e., the element size)


rs
386 of the synthetic CPT logs as 0.05m. Features ( log10 Fr , log10 Qt ) of each soil element in the first

ho
387 case is simply generated independently using a multivariate Gaussian random number generator

ut
(a
388 with parameters shown in Figure 6. Simulated realizations of the second case are generated in a
al
rn
389 segment-wise manner. Within each segment, all observations are considered to be spatially
ou

390 correlated and follow an exponential correlation function (Wang et al. 2010)
lJ
ca

 2 di, j 
391 rn ( y i , y j ) = exp − , i, j ∈ Segment n (13)
ni

 λn 
ch

 
te

392 where d i , j is the distance between Element i and Element j, λn is the correlation length. In the
eo
G

393 second case, they are set to be λ1 = 1m , λ2 = 2m , λ3 = 1m .


a n
di

394 To measure the accuracy of the classification result, we define the misclassification ratio
na

395 (MCR) as:


Ca

number of misclassified elements


in

396 MCR = (14)


total number of elements
ed
ish

397 To evaluate the classification quality and convergence behavior during the sampling process, the
bl

398 MCR is calculated for each x (t ) . To evaluate the overall classification quality, MCR is calculated
Pu

399 using the MAP estimate x * .

19
400 Conditionally independent case

401 Figure 7 shows a simulated dataset. As the vertical correlation is not considered, this dataset

n)
402 perfectly satisfies the conditionally independent assumption. The default setting of

sio
403 hyperparameters in Table 2 is adopted. It is worthwhile to mention that although the optimal

er
tv
404 number of clusters and non-informative priors are used by default, it is possible (and sometimes

in
pr
405 necessary) to incorporate prior knowledge, if any, into the proposed Bayesian inferential

e-
406 framework by manually setting a proper number of clusters and the initial values

’ pr
( 0) ( 0)
407 x ( 0 ) , μ l , Σ l , β ( 0 ) . In this example, the BIC is calculated for the number of clusters from 1 to

rs
ho
408 10, and the result is shown in Figure 8a. The minimum BIC value corresponds to the optimal

ut
(a
409 finite Gaussian mixture model with three clusters. Then after setting the number of soil states

al
rn
410 k = 3 , 5000 MCMC iterations are performed. The extracted spatial pattern x * and the statistical
ou

411 pattern E (μ l ), E (σ l ), E ( ρ l ) , l ∈ L (i.e., the sample mean of mixture parameters) are visualized
lJ
ca

412 in Figure 8b-c. Remember that the soil state labels are only used for identifying different soil
ni
ch

413 clusters, and hence there is no one-to-one correspondence between soil states and soil segments.
te

414 The information entropy and the identified soil stratification are plotted together in Figure
eo
G

415 8b. By checking the probability that each segment belongs to various SBT numbers (Figure 8d),
n a

416 a layer boundary is detected at 4m and an internal boundary is identified at 10m, which match
di
na

417 well the original profile in Figure 6. In this synthetic example, high information entropy occurs
Ca

418 at the boundaries between neighboring soil segments, which makes sense since a higher
in

419 uncertainty is expected at boundary location.


ed
ish

420 In feature space, the extracted statistical pattern is represented by the estimated mixture
bl

421 parameters with quantified uncertainty, and they are shown in Table 3. It can be noticed that both
Pu

422 μ l and σ l are well inferred from the synthetic data, yet ρ l have lower accuracy (e.g., ρ 3 ). For

20
423 visualizing the convergence behavior and variability of mixture parameters, their sample

424 realizations and corresponding fitted kernel density functions are plotted in Figure 9. The

n)
425 Markov chains of μ l are stable and well mixed, however, the ones of σ l and ρ l are less stable

sio
er
426 and some traces may need more iterations to converge. This phenomenon indicates that, having a

tv
427 good estimation of higher order moments (i.e. the standard deviation and correlation coefficient)

in
pr
428 is generally more challenging than the first-order moment (i.e. the mean). The posterior

e-
pr
429 distribution of β shows significant skewness and variation. This is because the inferred simple


rs
430 spatial configuration of soil segments may result in comparable likelihood probabilities of the

ho
431 observations using a wide range of β . Figure 10 shows the MCR curve. The overall MCR is

ut
(a
432 below 3% and the convergence speed is extremely fast in this case as the separability of different
al
rn
433 soil clusters (Figure 8c) is fairly good.
ou

434 To have a closer examination on the robustness of the proposed approach, 100 synthetic
lJ
ca

435 datasets are analyzed. The boxplot of all parameter estimators and the histogram of MCR are
ni
ch

436 shown in Figure 11, in which the true values are indicated as gray lines. Similar to the result
te

437 from a single dataset, the sample estimators of μ l and σ l are generally closer to the true values
eo
G

438 and have less variation compared with the sample estimators of ρ l , however, the true correlation
a n
di

439 coefficients are still covered by the 50% credible interval (i.e., within the blue box). In addition,
na

440 the MCR values of the most synthetic datasets are smaller than 0.5%. Therefore, the performance
Ca

441 of the proposed approach is very satisfying under the conditional independent condition.
in
ed

442 Vertically correlated case


ish

443 In this section, we test the proposed approach in the presence of vertical correlation. Figure
bl
Pu

444 12 shows a synthetic datasets. Similar to the conditionally independent case, the default setting is

445 adopted (Table 2). Figure 13a shows that the lowest BIC still corresponds to the model with

21
446 three clusters, which means, if the vertical correlation is moderate (say, λn ≤ 2m ) and the soil

447 layer thickness is several times larger than the correlation length, the vertical correlation does not

n)
448 cast significant impact on the separability among clusters in feature space. After 5000 MCMC

sio
er
449 iterations, the MAP estimate of the soil states and extracted segments are shown in Figure 13b

tv
450 together with the spatial distribution of information entropy and interpreted soil stratification.

in
pr
451 The corresponding fitted three-cluster pattern in feature space is visualized in Figure 13c. For

e-
pr
452 reference purpose, Figure 13d shows the dominant SBT number of each soil segment. The


rs
453 estimated mixture parameters with quantified uncertainty are listed in Table 4. The high z-scores

ho
454 of some mixture parameters (e.g. μ1 and σ1 ) indicate possible biased estimation using

ut
(a
455 insufficient information due to the ergodic issue introduced by the joint effect of spatial
al
rn
456 correlation and limited layer depth. Summarizing all analyzing results, it can be noticed that, first,
ou

457 although the conditional independence assumption is not fully satisfied, the identified subsoil
lJ
ca

458 stratification is still considerably accurate; second, it is likely that for a specific CPT log, the
ni

459 realization may not be ergodic, hence the estimated parameters are considered to be case specific
ch
te

460 (or only represent the local condition in real-world datasets).


eo
G

461 Figure 14 presents the Markov chains of all mixture parameters and their corresponding
a n

462 fitted kernel density distributions. Similar to the conditionally independent case, the means are
di
na

463 well inferred with stable and mixed chains, however, the stochastic traces of higher order
Ca

464 moments converge slower and may need more iterations to have better estimates. The MCR
in

465 curve of the first 100 iterations is presented in Figure 15. It is shown that although a good
ed
ish

466 parameter estimation needs more computational cost, the segmentation of the latent field
bl

467 generally converges very fast with high accuracy in this specific case.
Pu

22
468 The robustness of the proposed approach in presence of the vertical correlation is evaluated

469 by applying it to 100 synthetic CPT logs. The results are shown in Figure 16. Compared with the

n)
470 conditionally independent case, the variabilities of all estimators are greater. The 50% credible

sio
471 intervals generally cover the true values. Careful inspection of Figure 16 reveals that the means

er
tv
472 and standard deviations of different soil segments are well separated from each other, however,

in
pr
473 the variability of correlation coefficients are generally great and the boxplots are highly

e-
474 overlapped. This may be explained by the ergodic problem of a single dataset. The histogram of

’ pr
475 MCR shows a satisfying overall accuracy as most of the synthetic datasets are well classified

rs
ho
476 with a MCR below 5%.

ut
(a
477 Application using real CPT data

478 The NGES at Texas A&M University al


rn
ou

479 We first apply the proposed approach to the well-known data collected at the National
lJ

480 Geotechnical Experimentation Site (NGES) at Texas A&M University (clay site) (Briaud 2000).
ca
ni

481 The same dataset has been analyzed by several previous relevant studies including Zhang and
ch

482 Tumay (1999), Wang et al. (2013), and Ching et al. (2015). This site comprises a sequence of
te
eo

483 sandy clay, clay, silty clay, and clay with silt seams. The depth of the entire CPT log is 15m and
G
n

484 the groundwater table is at about 6m below the ground surface. A comprehensive description of
a
di

485 this site can be found in Briaud (2000). The data is visualized in Figure 17.
na
Ca

486 This CPT log is discretized into 296 soil elements with a 0.05m depth interval. Assuming no
in

487 prior information is available, we used the default prior setting. Figure 18a shows the calculated
ed

488 BIC using the number of clusters from 1 to 15 with the best option at 7. By setting the number of
ish

489 soil states k = 7, the data is processed and the inferred mixture parameters are listed in Table 5.
bl
Pu

490 Ten detected soil segments are depicted in Figure 18b, and then, these segments are merged into

23
491 five soil layers according to their dominant SBT number (Figure 18d). The information entropy

492 shows significant spikes only at all boundary locations as the boundary elements may fall in the

n)
493 overlapping regions of multiple (≥2) clusters in feature space but the overall separability is still

sio
494 fairly good as shown in Figure 18c.

er
tv
495 For reference purpose, Figure 19a shows the soil profile obtained based on the boring log

in
pr
496 (reproduced from Ching (2015)). We compare our result with the results from Ching et al. (2015),

e-
497 which includes WTMM method, the Bayesian method (Wang et al. 2013), the clustering method

’ pr
498 (Hegazy and Mayne 2002; Liao and Mayne 2007), and the T ratio method (Wickremesinghe and

rs
ho
499 Campanella 1991). For the WTMM method (Figure 19c), the clustering method (Figure 19f), and

ut
(a
500 the T ratio method (Figure 19g-h), the dominant SBT number for each soil layer is shown. For

501 al
the results of Zhang and Tumay (1999) (Figure 19d), the soil type (HPS, HPC, or HPM) is
rn
ou

502 shown. For the results of Wang et al. (2013), the most probable SBT number of each soil layer is
lJ

503 shown. A more detailed description of the settings for different approaches (Figure 19c-h) can be
ca
ni

504 found in Ching et al. (2015). Our result (Figure 19b) not only agrees with the results from other
ch

505 methods, but also provides the uncertainty quantification of soil classification via the information
te
eo

506 entropy. The proposed approach, WTMM method, the clustering method, and the T ratio method
G
n

507 (with small window size) are all sensitive to thin layers and capable in detecting internal
a
di

508 boundaries in a single run. The clustering method is extremely sensitive, and hence has certain
na
Ca

509 difficulty to have consistent layers in the depth ranges of [0, 1]m and [6.5, 8]m. The proposed
in

510 approach, the Bayesian method, and the clustering method take information from both Qt and Fr.
ed

511 The fuzzy method operates on a normally distributed soil classification index. The WTMM and
ish

512 T ratio method analyze Ic (i.e. the SBT index (Robertson and Cabal 2010)) profile. A recent
bl
Pu

513 study on Ic-based probabilistic soil stratification (Cao et al. 2018) demonstrates that, based on Ic

24
514 profile, the soil stratification can be identified reasonably as well. Moreover, the study explicitly

515 quantifies the identification uncertainty of soil stratification, specifically uncertainty in soil layer

n)
516 boundaries by using Ic profile under a general Bayesian framework. Hence Ic-based probabilistic

sio
517 soil stratification methods can be alternatives of the proposed method. The clustering method

er
tv
518 only considers the similarity in feature space. The Bayesian method takes both spatial and

in
pr
519 statistical similarity into consideration. However, the primary layers (the most probable

e-
520 stratification result as shown in Figure 19) and possible internal layers (not emphasized in this

’ pr
521 example) are detected in different runs by gradually zooming in on local differences with

rs
ho
522 improved resolution. Each run requires solving high-dimensional integrals and a high-

ut
(a
523 dimensional non-convex optimization problem. Thus, as mentioned in Ching et al. (2015),

524 al
stratification problem with many potential layers (especially with many primary layers) can be
rn
ou

525 computationally challenging. The proposed approach detects both primary layers and internal
lJ

526 layers within a single run, besides, equipped with the parallel Gibbs sampler and explicit
ca
ni

527 expressions for calculating the likelihood function, the sampling process of the proposed
ch

528 approach is efficient and the convergence speed is very fast. The following example will provide
te
eo

529 further proof on the computational efficiency of the proposed method comparing to the Bayesian
G
n

530 method.
a
di

531 Lukang, Taiwan case


na
Ca

532 To further compare the proposed method with the WTMM method, the Bayesian method,
in

533 the clustering method, and the T ratio method, a CPT dataset collected in Lukang Township in
ed

534 Changhua County (Taiwan) is used. The same dataset also has been analyzed and published in
ish

535 Ching et al. (2015). The site is located on reclaimed land, the entire depth is around 40m and the
bl
Pu

536 groundwater table is approximately 2m below the current ground surface. The first 4~5m depth

25
537 is artificial hydraulic fill. The profiles of log10 Fr and log10 Qt as well as the data points on the

538 Robertson chart are shown in Figure 20.

n)
539 The CPT log is discretized into 808 soil elements with a 0.05m depth interval. A total

sio
er
540 number of five clusters are considered to be the optima (Figure 21a). The estimated mixture

tv
541 parameters are shown in Table 6 and visualized in Figure 21c. The soil segments and

in
pr
542 corresponding identified stratification is shown in Figure 21b. It can be noticed that Cluster 1, 2,

e-
pr
543 and 4 have considerable overlap in feature space and they are also mixed together in physical


rs
544 space. This increases the amount of uncertainty and causes frequent fluctuation of the

ho
545 information entropy from approximately 17m to 40m and hence numerous potential thin layers

ut
(a
546 are detected within this depth range. From an overall look, the identified stratification from the

547 al
proposed approach is generally consistent with the boring log in Figure 22a. As mentioned in
rn
ou

548 Ching et al. (2015), when applying the Bayesian method, the CPT log was divided into two
lJ

549 chunks (0~20m and 20~40m) because of the computational issue. However, there is no such
ca
ni

550 problem when applying the WTMM method and the proposed approach since the latter two
ch
te

551 approaches do not need to solve high-dimensional integrals or high-dimensional non-convex


eo

552 optimization problems multiple times. The results from the first three methods (i.e. the proposed
G
n

553 approach, WTMM method, and the Bayesian method) are generally agree with each other except
a
di

554 that the proposed approach identifies all potential thin layers which have been partially identified
na
Ca

555 by WTMM and partially identified by the Bayesian method. In addition, three thin layers having
in

556 SBT 4 are detected by the proposed approach. However, the WTMM method and the Bayesian
ed

557 method do not consider them as primary layers. The information entropy indicates that the
ish
bl

558 inferred soil states of these thin layers are highly uncertain, which means the evidence supporting
Pu

559 for the existence of these thin layers is not concrete. The performances of the other two methods

26
560 (i.e. the clustering method and the T ratio method) are generally not satisfying as they all need a

561 fixed scale parameter (i.e. the number of clusters or the window size), hence a trade-off between

n)
562 fine details and avoiding spurious layer boundaries always exists (Ching et al. 2015). The

sio
563 WTMM method is multiscale in nature and the result is controlled by the interpretation of the

er
tv
564 WTMM ridges. By checking the L and M parameters that characterize a WTMM ridge, it

in
pr
565 essentially transforms a soil profile interpretation problem to a linear discriminant analysis in a

e-
566 two-dimensional L-M space. A logistic regression is performed using L-M points from fifty

’ pr
567 manually interpreted real CPT logs to have a binary (i.e., jump/not jump) classifier. Therefore,

rs
ho
568 the WTMM method can be considered as a supervised method which needs considerable training

ut
(a
569 data from manual interpretation. In contrast, the Bayesian method, the clustering method, and the

570 al
proposed approach are unsupervised learning methods which do not require training information.
rn
ou

571 From this example, it can be noticed that one potential issue of the proposed method is the
lJ

572 possible misinterpretation of the very thin soil layers. Conceptually speaking, if the identified
ca
ni

573 soil layer is very thin (say containing less than 10 consecutive CPT data points), the statistical
ch

574 uncertainty will be significant. Under this condition, the existence of the potential thin layers will
te
eo

575 be difficult to justify due to the high level of information entropy. Moreover, we have modeled
G
n

576 the feature pairs as correlated random variables rather than a vector random field. This ideal
a
di

577 assumption may result in possible biased estimation of both spatial and statistical patterns due to
na
Ca

578 the ergodic problem. Besides, even with more advanced and realistic model assumptions, recent
in

579 studies by Ching and co-workers (Ching and Phoon 2017; Ching et al. 2017) have shown that
ed

580 there are major identifiability issues in random field parameters. Therefore, although the
ish

581 integrated implementation is automatic and fully unsupervised, the performance of the proposed
bl
Pu

27
582 method in identifying very thin layers should not be overestimated and more thorough

583 investigations on this point are still expected.

n)
584 Summary and conclusions

sio
585 In this work, we present a Bayesian unsupervised learning approach for identifying soil

er
tv
586 stratification using cone penetration data. The proposed approach, in essence, consists of two

in
pr
587 parts: 1) a pattern detection approach using Bayesian inferential framework, and 2) a pattern

e-
588 interpretation protocol using Robertson chart. The first part is the mathematical core of the

’ pr
589 proposed approach, which infers both spatial pattern in physical space and statistical pattern in

rs
ho
590 feature space from the input dataset; the second part converts the abstract patterns into intuitive

ut
(a
591 spatial configurations of multiple soil layers with different soil behavior types. The advantages of

592 al
the proposed approach include probabilistic soil classification, and considering spatial
rn
ou

593 consistency and statistical similarity of soil elements simultaneously. Although the proposed
lJ

594 approach is sophisticated in a mathematical sense, the integrated implementation is automatic


ca
ni

595 and fully unsupervised.


ch

596 The proposed approach has been validated by two numerical studies. The validation results
te
eo

597 indicate that the proposed approach can accurately extract both the spatial pattern and the
G
n

598 statistical pattern under the conditional independent condition. In presence of the moderate
a
di

599 vertical correlation (e.g. λn ≤ 2m ) which is common in real-world cases (Phoon and Kulhawy
na
Ca

600 1999), possible biased estimation may happen due to the ergodic issue introduced by the joint
in

601 effect of vertical correlation and limited layer depth. In this case, the estimated parameters are
ed

602 considered to be reasonably accurate for describing a given realization. The identified soil
ish
bl

603 stratification is still considerably accurate as both the local consistency in physical space and the
Pu

604 statistical separability in feature space are not compromised too much.

28
605 The proposed approach has been applied to two real-world case and compared with five

606 different methods. Four merits of the proposed approach can be identified: 1) it provides the

n)
607 uncertainty quantification of the soil stratification; 2) similar to the WTMM method and the

sio
608 Bayesian method, it is sufficiently sensitive to thin layers and yet does not detect spurious layer

er
tv
609 boundaries; 3) similar to the Bayesian method, it simultaneously operates on the physical space

in
pr
610 and feature space; 4) the computational cost is higher than WTMM method but considerably

e-
611 lower than the Bayesian method, and the classification process generally converges vary fast.

’ pr
612 The last merit of this study thanks to formulating the CPT-based soil stratification problem as an

rs
ho
613 “unsupervised learning” problem under the machine learning-based Bayesian framework so that

ut
(a
614 some advanced probabilistic models (e.g., the HMRF model) and algorithms (e.g., the chromatic

615 al
sampler) can be integrated and applied for solving the problem. It has been demonstrated that the
rn
ou

616 trend from the original Bayesian framework to the machine learning-based Bayesian framework
lJ

617 is very promising for interpreting geotechnical site investigation data.


ca
ni

618 However, the proposed approach has the following potential limitations. First, the
ch

619 conditionally independent assumption may be too strong as a CPT sounding record generally
te
eo

620 entails the presence of vertical correlation; second, although the extracted soil segments in
G
n

621 physical space and the corresponding clustered pattern in feature space have explicit statistical
a
di

622 meaning, they may be less intuitive for practicing engineers to fully understand them; third, the
na
Ca

623 performance of the proposed method in identifying very thin layers may be less satisfying and
in

624 more thorough investigations on this point are still expected.


ed

625 Acknowledgement
ish

626 The authors would like to thank Dr. J. Ching for providing the source code of the WTMM
bl
Pu

627 method and the dataset of the Lukang case. The authors would also like to thank Tianqi Zhang,

29
628 M.Sc. for conducting part of the coding work in implementing the developed method in python

629 3.6. The editors and two anonymous reviewers are greatly appreciated for their constructive

n)
630 comments that have helped to improve the paper significantly.

sio
631 References

er
tv
632 Alvarez, I., Niemi, J., and Simpson, M. 2014. Bayesian inference for a covariance matrix. arXiv

in
633 preprint arXiv:1408.4050.

pr
634 Barnard, J., McCulloch, R., and Meng, X.-L. 2000. Modeling covariance matrices in terms of
635 standard deviations and correlations, with application to shrinkage. Statistica Sinica:

e-
636 1281-1311.

pr
637 Besag, J. 1974. Spatial interaction and the statistical analysis of lattice systems. Journal of the


638 Royal Statistical Society. Series B (Methodological), 36(2): 192-236.

rs
639 Besag, J. 1986. On the statistical analysis of dirty pictures. Journal of the Royal Statistical

ho
640 Society, 48(3): 259-302.

ut
641 Biernacki, C., Celeux, G., and Govaert, G. 2000. Assessing a mixture model for clustering with

(a
642 the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine
643 Intelligence, 22(7): 719-725.
644 al
Bowman, A.W., and Azzalini, A. 1997. Applied smoothing techniques for data analysis: the
rn
645 kernel approach with S-Plus illustrations. OUP Oxford.
ou

646 Bozdogan, H. 1987. Model selection and Akaike's information criterion (AIC): The general
theory and its analytical extensions. Psychometrika, 52(3): 345-370.
lJ

647
648 Briaud, J.-L. 2000. The national geotechnical experimentation sites at Texas A&M University:
ca

649 clay and sand. Geotechnical Special Publication: 26-51.


ni

650 Cao, Z.-J., Zheng, S., Li, D., and Phoon, K.-K. 2018. Bayesian Identification of Soil Stratigraphy
ch

651 based on Soil Behaviour Type Index. Canadian geotechnical journal,(ja).


652 Cao, Z., and Wang, Y. 2012. Bayesian approach for probabilistic site characterization using cone
te

653 penetration tests. Journal of Geotechnical and Geoenvironmental Engineering, 139(2):


eo

654 267-276.
G

655 Celeux, G., and Govaert, G. 1995. Gaussian parsimonious clustering models. Pattern
n

656 Recognition, 28(5): 781-793.


a

657 Celeux, G., Forbes, F., and Peyrard, N. 2003. EM procedures using mean field-like
di

658 approximations for Markov model-based image segmentation. Pattern Recognition, 36(1):
na

659 131-144.
Ca

660 Ching, J., and Phoon, K.-K. 2017. Characterizing uncertain site-specific trend function by sparse
661 Bayesian learning. Journal of Engineering Mechanics, 143(7): 04017028.
in

662 Ching, J., Wang, J.-S., Juang, C.H., and Ku, C.-S. 2015. Cone penetration test (CPT)-based
663 stratigraphic profiling using the wavelet transform modulus maxima method. Canadian
ed

664 geotechnical journal, 52(12): 1993-2007.


ish

665 Ching, J.Y., Phoon, K.K., Beck, J.L., and Huang, Y. 2017. On the identification of geotechnical
666 site-specific trend functions. ASCE-ASME Journal of Risk and Uncertainty in
bl

667 Engineering Systems, Part A: Civil Engineering, 3(4): 04017021.


Pu

668 Das, S.K., and Basudhar, P.K. 2009. Utilization of self-organizing map and fuzzy clustering for
669 site characterization using piezocone data. Computers and Geotechnics, 36(1): 241-248.

30
670 Depina, I., Le, T.M.H., Eiksund, G., and Strøm, P. 2016. Cone penetration data classification
671 with Bayesian Mixture Analysis. Georisk: Assessment and management of risk for
672 engineered systems and geohazards, 10(1): 27-41.
673 Forbes, F., and Peyrard, N. 2003. Hidden Markov random field model selection criteria based on

n)
674 mean field-like approximations. IEEE Transactions on Pattern Analysis and Machine

sio
675 Intelligence, 25(9): 1089-1101.
676 Fraley, C., and Raftery, A.E. 1998. How many clusters? Which clustering method? Answers via

er
677 model-based cluster analysis. The computer journal, 41(8): 578-588.

tv
678 Fraley, C., and Raftery, A.E. 2002. Model-based clustering, discriminant analysis, and density

in
679 estimation. Journal of the American Statistical Association, 97(458): 611-631.

pr
680 Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., and Rubin, D.B. 2014. Bayesian
681 data analysis. CRC press Boca Raton, FL.

e-
682 Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian

pr
683 restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,


684 PAMI-6(6): 721-741.

rs
685 Hegazy, Y.A., and Mayne, P.W. 2002. Objective site characterization using clustering of

ho
686 piezocone data. Journal of Geotechnical and Geoenvironmental Engineering, 128(12):

ut
687 986-996.

(a
688 Huang, A., and Wand, M.P. 2013. Simple marginally noninformative prior distributions for
689 covariance matrices. Bayesian Analysis, 8(2): 439-452.
690 al
Koller, D., and Friedman, N. 2009. Probabilistic graphical models: principles and techniques.
rn
691 MIT press.
ou

692 Liao, T., and Mayne, P. 2007. Stratigraphic delineation by three-dimensional clustering of
piezocone data. Georisk, 1(2): 102-119.
lJ

693
694 Lunne, T., Robertson, P., and Powell, J. 1997. Cone penetration testing. Geotechnical Practice.
ca

695 MathWorks. 2014. Statistics and Machine Learning Toolbox.


ni

696 McLachlan, G., and Peel, D. 2004. Finite mixture models. John Wiley & Sons, Hoboken, N.J.
ch

697 McLachlan, G.J., and Basford, K.E. 1988. Mixture models. Inference and applications to
698 clustering. Statistics: Textbooks and Monographs, New York: Dekker, 1988, 1.
te

699 McLachlan, G.J., and Krishnan, T. 2007. The EM algorithm and extensions. Wiley-Interscience,
eo

700 New York.


G

701 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
n

702 Prettenhofer, P., Weiss, R., and Dubourg, V. 2011. Scikit-learn: Machine learning in
a

703 Python. Journal of machine learning research, 12(Oct): 2825-2830.


di

704 Phoon, K.-K., and Kulhawy, F.H. 1999. Characterization of geotechnical variability. Canadian
na

705 geotechnical journal, 36(4): 612-624.


Ca

706 Phoon, K.-K., Quek, S.-T., and An, P. 2003. Identification of statistically homogeneous soil
707 layers using modified Bartlett statistics. Journal of Geotechnical and Geoenvironmental
in

708 Engineering, 129(7): 649-659.


709 Robertson, P. 1990. Soil classification using the cone penetration test. Canadian Geotechnical
ed

710 Journal, 27(1): 151-158.


ish

711 Robertson, P. 2009. Interpretation of cone penetration tests—a unified approach. Canadian
712 geotechnical journal, 46(11): 1337-1355.
bl

713 Robertson, P., and Cabal, K. 2010. Guide to cone penetration testing for geotechnical
Pu

714 engineering. Gregg Drilling and Testing Inc., USA: 6-15.

31
715 Wang, H., Wellmann, J.F., Li, Z., Wang, X., and Liang, R.Y. 2016. A Segmentation Approach
716 for Stochastic Geological Modeling Using Hidden Markov Random Fields. Mathematical
717 Geosciences: 1-33.
718 Wang, Y., Au, S.-K., and Cao, Z. 2010. Bayesian approach for probabilistic characterization of

n)
719 sand friction angles. Engineering Geology, 114(3): 354-363.

sio
720 Wang, Y., Huang, K., and Cao, Z. 2013. Probabilistic identification of underground soil
721 stratification using cone penetration tests. Canadian geotechnical journal, 50(7): 766-776.

er
722 Wellmann, J.F. 2013. Information Theory for Correlation Analysis and Estimation of

tv
723 Uncertainty Reduction in Maps and Models. Entropy, 15(4): 1464-1485.

in
724 Wellmann, J.F., and Regenauer-Lieb, K. 2012. Uncertainties have a meaning: Information

pr
725 entropy as a quality measure for 3-D geological models. Tectonophysics, 526: 207-216.
726 Wickremesinghe, D., and Campanella, R. 1991. Statistical methods for soil layer boundary

e-
727 location using the cone penetration test. Proc. ICASP6, Mexico City, 2: 636-643.

pr
728 Yuen, K.V., and Mu, H.Q. 2011. Peak ground acceleration estimation by linear and nonlinear


729 models with reduced order Monte Carlo simulation. Computer‐Aided Civil and

rs
730 Infrastructure Engineering, 26(1): 30-47.

ho
731 Zhang, Z., and Tumay, M.T. 1999. Statistical to fuzzy approach toward CPT soil classification.

ut
732 Journal of Geotechnical and Geoenvironmental Engineering, 125(3): 179-186.

(a
733
734
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu

32
Algorithms

Algorithm 1: Parallel Gibbs sampler


Input : 2-Colored MRF (Wang et al. 2016), y, Φ = (μ ( t −1) , Σ ( t −1) , β ( t −1) ) , x ( t −1)
( t −1)

n)
sio
1 for the 2 colors κ i : i ∈ {1,2} do
parafor all elements in the i-th color j ∈ Sκ i

er
2

tv
( t −1)
3 calculate the local a posteriori distribution p( X j | x ∂ j , y j ; μ ( t −1) , Σ ( t −1) , β ( t −1) ) ;

in
(t ) ( t −1)
4 draw candidate x j ~ p( X j | x ∂ j , y j ; μ ( t −1) , Σ ( t −1) , β ( t −1) ) ;

pr
5 end parafor

e-
6 end for

pr
Return x (t )


rs
ho
Algorithm 2: Bayesian parameter estimation using M-H algorithm

ut
Input : y , Φ
( t −1)
= (μ ( t −1) , Σ ( t −1) , β ( t −1) ) , x (t )

(a
1 for the label l ∈ {1,2,..., k } do
2 propose µl* = µl( t −1) + ε jump , where ε jump ~ N (0, σ jump1 ) ; al
rn
3 compare µl( t −1) and µl* by evaluating Equation (8);
ou

4 accept/reject µl* and update µl(t ) ;


lJ

5 end for
ca

6 for the label l ∈ {1,2,..., k } do


ni

T
7 eigen decomposition Σ (l t −1) = Vl ( t −1) Dl ( t −1)Vl ( t −1)
ch

T * ( t −1)
propose Σ *l = Vl * Dl *Vl * , Vl = R(ϕ jump ) * Vl ,
te

8
eo

and Dl* = Dl( t −1) + diag ( γ jump ) , where ϕ jump ~ N (0, σ jump 2 ) γ jump ~ N (0, σ jump 3 ) ;
G

9 compare Σ*l and Σ (l t −1) by evaluating Equation (9);


n

10 accept/reject Σ*l and update Σ (tl ) ;


a
di

11 end for
na

12 propose β = β
* ( t −1)
+ δ jump , where δ jump ~ N (0, σ jump 4 ) ;
Ca

13 compare β * and β ( t −1) by evaluating Equation (10);


14 accept/reject β * and update β (t ) ;
in

Return Φ = (μ , Σ , β )
(t ) (t ) (t ) (t )
ed

Note: Vl is an orthogonal matrix and Dl is a diagonal matrix whose diagonal entries are eigenvalues of
ish

cos(ϕ jump ) − sin(ϕ jump )


Σ l . R(ϕ jump ) is a rotation matrix, which is defined as: R(ϕ jump ) = 
bl


 sin(ϕ jump ) cos(ϕ jump ) 
Pu

33
Table 1. Description of soil types in the Robertson’s soil classification chart (after Robertson
1990)
Zone Soil description
1 Sensitive, fine-grained

n)
2 Organic soils (peats)

sio
3 Clays (clay to silty clay)
4 Silt mixtures (clayey silt to silty clay)

er
5 Sand mixtures (silty sand to sandy silt)

tv
6 Sands (clean sand to silty sand)

in
7 Gravelly sand to sand

pr
8 Very stiff sand to clayey sand
9 Very stiff, fine-grained

e-
pr’
Table 2. Default setting for hyperparameters

rs
Parameter Value

ho
ηl μl
( 0)

ut
Ξl 100 * I d ,

(a
υ d +1
bl ( 0)
log(σ l ) al
rn
ξ 100
ou

µβ β ( 0)
lJ

σβ 100
ca

Note: d is the number of features; I d is a d-dimensional identity matrix.


ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu

34
n)
sio
er
Table 3. Estimated parameters of the conditional independent case

tv
Segment Soil True value Mean Std. |z-score|
Parameter
ID state log10 Fr log10 Qt log10 Fr log10 Qt log10 Fr log10 Qt log10 Fr log10 Qt

in
μ1

pr
-0.0969 2.0308 -0.1003 1.9868 0.0107 0.0285 0.3170 1.5451

e-
1 1 σ1 0.1 0.25 0.0993 0.2517 0.0053 0.0067 0.1285 0.2559

pr
ρ1 0.3 0.4385 0.0767 1.8063
μ2 -0.3010 1.3046 -0.2969 1.2962 0.0061 0.0096 0.6712 0.8657


rs
2 2 σ2 0.07 0.1 0.0712 0.1073 0.0036 0.0070 0.3327 1.0389

ho
ρ2 0.2 0.2991 0.0976 1.0152

ut
μ3 -0.0458 1.4862 -0.0324 1.4892 0.0107 0.0100 1.2463 0.2977

(a
3 3 σ3 0.1 0.1 0.1081 0.1038 0.0068 0.0065 1.1884 0.5918

al
ρ3 0.5 0.4021 0.0436 2.2430

rn
β N/A 80.68 59.87 N/A

ou
Note: |z-score| = |(True value - Mean)/Std.|

lJ
Table 4. Estimated parameters of the vertically correlated case

ca
Segment Soil True value Mean Std. |z-score|

ni
Parameter
ID state log10 Fr log10 Qt
ch log10 Fr log10 Qt log10 Fr log10 Qt log10 Fr log10 Qt
μ2 -0.0969 2.0308 -0.0484 2.0307 0.0109 0.0253 4.4457 0.0034
te
1 2 σ2 0.1 0.25 0.0951 0.2110 0.0064 0.0163 0.7649 2.3847
eo

ρ2 0.3 0.3378 0.1040 0.3629


G

μ1 -0.3010 1.3046 -0.3325 1.2701 0.0045 0.0062 7.0421 5.5543


an

2 1 σ1 0.07 0.1 0.0534 0.0660 0.0028 0.0023 5.8903 14.6243


di

ρ1 0.2 -0.1073 0.0781 3.9339


na

μ3 -0.0458 1.4862 -0.0290 1.5354 0.0095 0.0084 1.7741 5.8881


Ca

3 3 σ3 0.1 0.1 0.0937 0.0834 0.0050 0.0040 1.2492 4.1789


ρ3 0.5 0.0657 0.0766 5.6678
in

β N/A 81.09 57.96 N/A


ed
ish

35
bl
Pu
n)
sio
er
Table 5. Estimated parameters of the NGES case Table 6. Estimated parameters of the Lukang case

tv
Soil Mean Std. Soil Mean Std.
Parameter log F log Q log F log Q Parameter log F log Q log F log Q
state state

in
10 r 10 t 10 r 10 t 10 r 10 t 10 r 10 t

pr
μ1 0.30 1.45 0.024 0.063 μ1 0.48 0.96 0.025 0.030

e-
1 σ1 0.08 0.21 0.010 0.0062 1 σ1 0.21 0.25 0.0094 0.0060

pr
ρ1 -0.87 0.036 ρ1 -0.82 0.0075
μ2 0.34 1.79 0.0085 0.018 μ2 -0.28 1.75 0.0058 0.0059


rs
2 σ2 0.051 0.096 0.0028 0.0039 2 σ2 0.12 0.11 0.0060 0.0048

ho
ρ2 0.21 0.16 ρ2 -0.48 0.020

ut
μ3 0.42 2.37 0.043 0.052 μ3 0.43 0.51 0.020 0.017

(a
3 σ3 0.19 0.21 0.0090 0.0076 3 σ3 0.18 0.15 0.0068 0.0081

al
ρ3 -0.95 0.0028 ρ3 0.22 0.074

rn
μ4 0.47 1.60 0.0075 0.0088 μ4 -0.066 1.44 0.029 0.019

ou
4 σ4 0.07 0.081 0.0037 0.0037 4 σ4 0.28 0.16 0.0054 0.0068

lJ
ρ4 -0.71 0.016 ρ4 -0.57 0.038

ca
μ5 0.63 1.05 0.0063 0.027 μ5 -0.20 2.01 0.033 0.043
5 σ5 0.035 0.15 0.0017 0.0034
ni 5 σ5 0.18 0.27 0.014 0.011
ch
ρ5 0.25 0.15 ρ5 0.82 0.021
te

μ6 0.73 1.44 0.0061 0.0081 β 4.44 0.61


eo

6 σ6 0.052 0.059 0.0029 0.0029


G

ρ6 -0.55 0.065
an

μ7 0.78 1.69 0.0092 0.040


di

7 σ7 0.038 0.16 0.0025 0.0035


na

ρ7 0.0018 0.27
Ca

β 22.58 9.22
in
ed
ish

36
bl
Pu
10 3
7 8

n)
sio
Normalized tip resistance Q t

9
6
2
10

er
tv
5

in
4

pr
10 1

e-
3

pr
1
2


rs
0
10
10 -1 10 0 10 1

ho
Normalized friction ratio Fr

ut
Figrue 1. Robertson soil classification Chart (after Robertson 1990 and Wang et al. 2013)

(a
al
Latent field Observations
rn
cluster 1
ou

x1 y1 = (log10Fr1 , log10Qt1) cluster 2


lJ
ca

x2 y2 = (log10Fr2 , log10Qt2)
ni
ch

x3 y3 = (log10Fr3 , log10Qt3)
te
Depth

eo
G

x4 y4 = (log10Fr4 , log10Qt4)
n
a
di

A local neighborhood system


na

xs-1 ys-1 = (log10Fr(s-1) , log10Qt(s-


Ca

)
in

xs ys = (log10Frs , log10Qts)
ed

Feature space
ish

Physical space
bl

Figure 2. Sketch diagram of an one-dimensional hidden Markov random field model for
clustering CPT data
Pu
n)
sio
er
tv
N K

in
pr
e-
’pr
rs
N

ho
K

ut
Figure 3. Plate diagram for the proposed hidden Markov random field model.

(a
al
rn
Spatial pattern Statistical pattern
ou
lJ
ca

Segment 1
ni
ch
te
Depth

Segment 2
eo
G
a n
di

Segment 3
na
Ca
in

{Segment 1, Segment 3} cluster 1


ed

{Segment 2} cluster 2
ish

Layer boundary Internal boundary


bl
Pu

Figure 4 Illustration of spatial pattern and statistical pattern of a simple configuration, and
corresponding stratification.
Load , and depth

n)
sio
Construct neighborhood system Determine the number of clusters k

er
using depth coordinates using BIC

tv
in
pr
Define hyperparameters

e-
, set the number

pr
of iterations N and the initial values for


rs
ho
ut
Draw samples of

(a
using Parallel Gibbs sampler and M-H
al
algorithm iteratively
rn
ou

Uncertainty quantification using


lJ

samples of and report the


ca

MAP estimate of
ni
ch
te

Determine the SBT number for each


eo

extracted soil segment


G
a n

Determine all layer boundaries,


di

internal boundaries and SBT numbers


na

for each layer


Ca

Figure 5 Flowchart for the proposed Bayesian unsupervised approach.


in
ed
ish
bl
Pu
n)
4m SBT 6

sio
Segment 1
Layer boundary

er
tv
in
6m

pr
e-
Segment 2

pr
SBT 5
Internal boundary


rs
ho
5m

ut
(a
Segment 3 al
Figure 6 Synthetic soil profile with two SBT layers but three soil segments corresponding to
rn
three soil states.
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish

Figure 7 A set of simulated CPT log and corresponding point clouds in feature space (conditional
bl

independent case)
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish

Figure 8 Extracted patterns using (a) the optimal number of clusters = 3 in (b) physical space and
(c) feature space. The stratification in (b) is interpreted based on (d) the dominate SBT number
bl

of each soil segment. Note: C n = Cluster n in (c). (conditional independent case)


Pu
(a) 0.2 2.5 0.14

0.12
0 2

n)
0.1

sio
-0.2 1.5
0.08

er
-0.4 1 0.06

tv
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Iterations Iterations Iterations

in
0.3 1 400

pr
300
0.2 0.5

e-
200

pr
0.1 0
100


rs
0 -0.5 0

ho
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Iterations Iterations Iterations

ut
soil cluster 1 soil cluster 2 soil cluster 3

(a
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca

Figure 9 (a) Realizations of all density parameters for three soil clusters and (b) the
in

corresponding posterior distributions (conditional independent case)


ed
ish
bl
Pu
n)
sio
MCR

er
tv
in
pr
e-
pr
Figure 10 The MCR curve during the sampling process (conditional independent case).


rs
Mean of log10Fr

Mean of log10Qt

ho
-0.1
1.8

ut
-0.2 1.6

(a
1.4
-0.3
al
Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3
rn

0.3
ou
SD of log10Qt
SD of log10Fr

0.12
lJ

0.1 0.2
ca

0.08
0.1
ni
ch

Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3


Correlation coefficient

1
te

0.6
Probability
eo

0.4
0.5
G

0.2
a n

0 0
di

Segment 1 Segment 2 Segment 3 0 0.005 0.01 0.015


MCR
na

Figure 11 Boxplot of parameter estimators and histogram of MCR for 100 synthetic datasets
Ca

(conditional independent case)


in
ed
ish
bl
Pu
n)
sio
er
tv
Qt

in
pr
e-
’ pr
rs
ho
ut
(a
al
Figure 12 A set of simulated CPT log and corresponding point clouds in feature space (vertically
rn
correlated case).
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu
(a) -100 Model selection

n)
-200

sio
-300

log 10(Qt)

er
tv
-400

in
-500

pr
e-
-600

pr
2 4 6 8 10
Number of clusters


rs
ho
ut
(a
al
rn
ou
lJ
ca
ni
ch
te
eo

Figure 13 Extracted patterns using (a) the optimal number of clusters = 3 in (b) physical space
G

and (c) feature space. The stratification in (b) is interpreted based on (d) the dominate SBT
number of each soil segment (vertically correlated case)
a n
di
na
Ca
in
ed
ish
bl
Pu
(a)
0.2 2.5 0.15

0 2 0.1

n)
sio
-0.2 1.5 0.05

er
-0.4 1 0

tv
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Iterations Iterations Iterations
0.3 1 400

in
pr
300
0.2 0.5

e-
200

pr
0.1 0
100


0 -0.5 0

rs
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000

ho
Iterations Iterations Iterations

ut
soil cluster 1 soil cluster 2 soil cluster 3

(b)

(a
100 60 150
al
40 100
rn
50
ou

20 50
lJ

0 0 0
-0.4 -0.2 0 0.2 1 1.5 2 2.5 0 0.05 0.1 0.15
ca

Mean of log Fr Mean of log Qt SD of log Fr


10 10 10
10-3
200 6 8
ni
ch

150 6
4
te

100 4
2
eo

50 2
G

0 0 0
0 0.1 0.2 0.3 -1 -0.5 0 0.5 1 -400 -200 0 200 400
n

SD of log Qt Correlation coefficient


10
a

soil cluster 1 soil cluster 2 soil cluster 3


di

Figure 14 (a) Realizations of all density parameters for three soil clusters and (b) the
na

corresponding posterior distributions (vertically correlated case)


Ca
in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
pr
Figure 15 The MCR curve during the sampling process (vertically correlated case).


rs
ho
Mean of log10Qt
Mean of log10Fr

2.2
0
2

ut
1.8
-0.2

(a
1.6
1.4
al
-0.4 1.2
rn
Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3
ou

0.15
SD of log10Fr

SD of log10qt

0.3
lJ

0.1
ca

0.2
ni

0.1
0.05
ch

Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3


te
Correlation coefficient

0.6
eo

0.5
Probability

0.4
G

0
0.2
a n

-0.5
di

0
0 0.02 0.04 0.06 0.08
Segment 1 Segment 2 Segment 3
na

MCR
Figure 16 Boxplot of parameter estimators and histogram of MCR for 100 synthetic datasets
Ca

(vertically correlated case)


in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
Figure 17 NGES data: (a) log10 Fr profile; (b) log10 Qt profile; (c) Robertson chart.
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in

Figure 18 NGES analysis results: extracted patterns using (a) the optimal number of clusters = 7
ed

in (b) physical space and (c) feature space; the stratification in (b) is based on (d) the dominate
ish

SBT number of each soil segment


bl
Pu
n)
io
rs
ve
Ching et al. Zhang & Tumay Wang et al. Clustering T ratio method T ratio method
This study (2015) (1999) (2013) (n c =40) (window = 1m) (window = 2m)

t
in
SBT* = 6

pr
e-
’ pr
SBT* = 3

rs
ho
ut
(a
SBT* = 5

al
rn
ou
SBT* = 4

lJ
ca
ni
ch
SBT* = 5
te
eo
(b)
0 0.5 1
G

Information entropy
an

Note: (a, c-h) are reproduced from Ching (2015), HPS stands for highly probable sandy soil, HPC stands for highly probable clayey soil, and HPM stands for
highly probable mixed soil; in this study, SBT* indicates the dominant SBT number of each soil layer.
di
na

Figure 19 Comparisons for the NGES stratification results: (a) the borehole log; (b) the result from the proposed approach; (c) the results of
Ching (2015); (d) the results of Zhang and Tumay (1999); (e) the results of Wang et al. (2013); (f) the results obtained based on clustering
Ca

with nc = 40; (g) the results obtained based on the T ratio method with a 1 m window; (h) the results obtained based on the T ratio method
with a 2m window.
in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
al
rn
ou

Figure 20 CPT data for the Lukang case: (a) log10 Fr profile; (b) log10 Qt profile; (c) the
lJ

Robertson chart.
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
al
rn
log 10(Qt)

ou
lJ
ca
ni
ch
te
eo
G
an
di

Figure 21 Analysis results for the Lukang case: extracted patterns using (a) the optimal number
na

of clusters = 5 in (b) physical space and (c) feature space; the stratification in (b) is based on the
dominate SBT number of each soil segment (not shown)
Ca
in
ed
ish
bl
Pu
n)
io
rs
ve
Ching et al. Wang et al. Clustering Clustering Clustering T ratio method T ratio method T ratio method
This study (2015) (2013) (n c=40) (n c=80) (n c=200) (window = 1m) (window = 2m) (window = 5m)

t
in
SBT* = 6

pr
SBT* = 5
SBT* = 1

e-
pr
SBT* = 6


rs
ho
SBT* = 5
SBT* = 4

ut
SBT* = 3
*
SBT = 5 SBT* = 5

(a
SBT* = 3
SBT* = 6
*
SBT = 4
SBT* = 3

al
SBT** = 5
SBT = 4

rn
SBT* = 5

ou
SBT* = 3

SBT* = 3

lJ
SBT* = 6

SBT* = 6
SBT* = 3

ca
SBT** = 5
SBT = 6

ni
SBT* = 5

(b) SBT* = 3

ch
0 0.5 1
Information entropy te
.
eo
Note: (a, c-j) are reproduced from Ching (2015), in this study, SBT* indicates the dominant SBT number of each soil layer.
G

Figure 22 Comparisons of stratification results for the Lukang case: (a) the borehole log; (b) the results obtained using the proposed
an

approach; (c) the results of Ching (2015); (d) the results of Wang (2013); (e-g) the results obtained based on clustering with nc = 40, 80, and
200, respectively; (h-j) the results obtained based on the T ratio method with window size = 1 m, 2 m, and 5 m, respectively.
di
na
Ca
in
ed
ish
bl
Pu

You might also like