CPT Manuscript

1 A Bayesian unsupervised learning approach for identifying soil stratification using
2 cone penetration data
n)
sio
3 Hui Wang1; Xiangrong Wang2; J. Florian Wellmann3; Robert Y. Liang4
er
1
4 Assistant Professor, Department of Civil and Environmental Engineering and Engineering
tv
5 Mechanics, The University of Dayton, Dayton, OH 45469-0243, USA, Email:
in
pr
6 hwang12@udayton.edu.
e-
pr
2
7 Corresponding Author, Visiting Professor, Department of Civil and Environmental Engineering
’
rs
8 and Engineering Mechanics, The University of Dayton, Dayton, OH 45469-0243, USA, Email:
ho
9 xwang2@udayton.edu, Tel: +1 (937)-229-3847.
ut
(a
3
10 Assistant professor, The Aachen Institute for Advanced Study in Computational Engineering
al
rn
11 Science (AICES), RWTH Aachen University, 52062 Aachen Germany, Email:
ou
12 wellmann@aices.rwth-aachen.de.
lJ
ca
4
13 Professor, Department of Civil and Environmental Engineering and Engineering Mechanics,
ni
14 The University of Dayton, Dayton, OH 45469-0243, USA, Email: rliang1@udayton.edu.

ch
te
eo
G
an
di
na
Ca
in
ed
ish
bl
Pu
1
15 Abstract
16 This paper presents a novel perspective to understand the spatial and statistical patterns of a
n)
17 cone penetration dataset and identify soil stratification using them. Both local consistency in
sio
18 physical space (i.e., along depth) and statistical similarity in feature space (i.e., logQt – logFr
er
tv
19 space or the Robertson chart) between data points are considered simultaneously. The proposed
in
pr
20 approach is, in essence, consist of two parts: 1) a pattern detection approach using Bayesian
e-
21 inferential framework, and 2) a pattern interpretation protocol using Robertson chart. The first
’ pr
22 part is the mathematical core of the proposed approach, which infers both spatial pattern in
rs
ho
23 physical space and statistical pattern in feature space from the input dataset; the second part
ut
(a
24 converts the abstract patterns into intuitive spatial configurations of multiple soil layers having
25 al
different soil behavior types. The advantages of the proposed approach include probabilistic soil
rn
ou
26 classification, and identifying soil stratification in an automatic and fully unsupervised manner.
lJ
27 The proposed approach has been implemented in MATLAB R2015b and python 3.6, and tested
ca
ni
28 using various datasets including both synthetic and real-world CPT soundings. The results show
ch
29 that the proposed approach can accurately and automatically detect soil layers with quantified
te
eo
30 uncertainty and reasonable computational cost.

G
n
31
a
di
32 Key words: unsupervised learning; soil stratification; CPT; Bayesian inferential framework; soil
na
Ca
33 behavior type.
in
ed
ish
bl
Pu
2
34 Introduction
35 The cone penetration test (CPT) is one of the commonly adopted in-situ tests to determine
n)
36 subsoil stratigraphy (Lunne et al. 1997). It is less disruptive and provides nearly continuous,
sio
37 repeatable and reliable data with cost savings (Robertson 2009). In recent years, CPT-based
er
tv
38 stratigraphic profiling has received considerable attention. Many different strategies have been
in
pr
39 investigated in previous studies, such as fuzzy sets (Zhang and Tumay 1999), clustering analysis
e-
40 (Das and Basudhar 2009; Depina et al. 2016; Hegazy and Mayne 2002; Liao and Mayne 2007),
’ pr
41 statistical analysis (Phoon et al. 2003; Wickremesinghe and Campanella 1991), Bayesian
rs
ho
42 inference (Wang et al. 2013), the wavelet transform modulus maxima (WTMM) method (Ching
ut
(a
43 et al. 2015).
44 al
Although these strategies all have sound theoretical ground and certain advantages, the
rn
ou
45 performances may vary a lot from case to case due to some limitations. A comparison study
lJ
46 among several above methods was performed in Ching et al. (2015). The results indicate that the
ca
ni
47 Bayesian method proposed by Wang et al. (2013) (hereafter referred to as the Bayesian method)
ch
48 and the WTMM method (Ching et al. 2015) generally outperform other methods in some cases
te
eo
49 with complex soil stratigraphy configurations. However, the Bayesian method is computationally
G
n
50 expensive due to solving high-dimensional integrals and high-dimensional non-convex

a
di
51 optimization problems, yet the WTMM method operates only on the soil behavior index hence
na
Ca
52 may not make full use of the original “two-dimensional” information (i.e., tip resistance and
in
53 friction ratio). To be more concrete, it is possible that different tip resistance and friction ratio
ed
54 combinations may result in the same soil behavior index yet they represent different soil
ish
55 mechanical properties (but could be similar), moreover, the statistical correlation between tip
bl
Pu
56 resistance and friction ratio cannot be reflected by the soil behavior index, besides, the soil
3
57 behavior index does not apply to all soil behavior types (Robertson and Cabal 2010). Both of the
58 two methods did not cover uncertainty quantification of the soil classification result at each
n)
59 sampling location along depth possibly because of certain considerations of computational cost
sio
60 or the limitations of the adopted mathematical tools in solving optimization problems. In this
er
tv
61 work, we are aiming to develop a Bayesian unsupervised learning approach for automatic layer
in
pr
62 detection and soil classification with quantified uncertainty and taking both tip resistance and
e-
63 friction ratio into consideration. It is expected to have reasonable accuracy and computational
’ pr
64 cost.
rs
ho
65 As electric cones convert raw signal into digital form at selected intervals (Robertson and
ut
(a
66 Cabal 2010), we discretize the physical space (i.e., a vertical line in the direction of depth) into
67 al
elements matching the measurement resolution. Each element is assigned with a set of original
rn
ou
68 observations. Two features can be derived from the observations: the normalized friction ratio,
lJ
69 Fr = 100 f s /( qt − σ v 0 ) , and the normalized tip resistance, Qt = ( qt − σ v 0 ) / σ v 0 ' , where f s , qt ,

ca
ni
70 σ v 0 , and σ v 0 ' are the sleeve friction, corrected tip resistance, vertical total stress, and vertical
ch
te
71 effective stress, respectively. After taking logarithm, the two features form a two-dimensional
eo
G
72 feature space. A portion of this feature space defines a soil classification chart and is frequently
a n
73 referred to as the Robertson chart (Robertson 1990), which is shown in Figure 1. The chart is
di
na
74 divided into nine zones corresponding to nine soil behavior types (SBTs) as listed in Table 1. In
Ca
75 this study, the feature space simply reduces to the Robertson chart with finite span of log10 Fr
in
76 and log10 Qt . The soil elements are represented by feature pairs ( log10 Fr , log10 Qt ) and
ed
ish
77 visualized as point clouds in the Robertson chart.

bl
78 For interpreting the soil profile, we need to evaluate the similarity between any two soil
Pu
79 elements in physical space and the corresponding data points in feature space since a section of
4
80 several similar consecutive soil elements can be interpreted as a potential soil layer. To be more
81 specific, the similarity is measured based on the vertical distance between the two soil elements
n)
82 along depth and the relative locations of the corresponding data points in the Robertson chart.
sio
83 Similar soil elements will be assigned with the same soil state index, which is an abstract
er
tv
84 description of the soil type. It should be noticed that the index number itself has no meaning but
in
pr
85 indicates that all soil elements with the same index number are similar to each other. The spatial
e-
86 distribution of the soil state index along depth is named as the spatial pattern, which indicates the
’ pr
87 physical locations of similar soil elements along depth, and the statistical characteristics (i.e.,
rs
ho
88 mean and covariance matrix) of the data points with the same soil state index are named as the
ut
(a
89 statistical pattern, which indicate the statistical average level, variation, and correlation of the
90 al
two soil features corresponding to a specific soil state index. In this way, the similarity between
rn
ou
91 any two soil elements can be quantitatively descried in terms of the two patterns, which can be
lJ
92 intuitively visualized in both physical space and feature space and jointly interpreted as multiple
ca
ni
93 soil layers with different SBTs by using Roberson chart. In other words, the spatial pattern
ch
94 describes an intuitive layer configuration (indicated by soil state indices but without knowing
te
eo
95 specific SBTs of each potential layer) of the soil profile, whereas the statistical pattern contains
G
n
96 information regarding the mean and covariance matrix of the soil mechanical properties of the
a
di
97 similar soil elements, and further indicates the possible SBTs of each potential layer. Then
na
Ca
98 combining the two patterns in the context of Robertson chart provides us with a reasonable
in
99 interpretation of the soil stratification (indicated by SBTs). The two patterns are the core
ed
100 concepts of this work and we will go into details in the following sections.
ish
101 In this work, A Markov random field (MRF) model (Besag 1974) is adopted to represent the
bl
Pu
102 spatial pattern; the corresponding statistical pattern in feature space is modeled as a finite
5
103 mixture (FM) model (McLachlan and Peel 2004). The local consistency in spatial pattern is
104 encoded through a neighborhood system, as we expect neighboring soil elements tend to be
n)
105 similar with each other in physical space. On the other hand, the similarity among soil elements
sio
106 is measured by the finite mixture density, as feature pairs of similar soil elements tend to cluster
er
tv
107 together in the feature space.
in
pr
108 We propose a novel unsupervised learning approach for identifying soil stratification using
e-
109 cone penetration data. A hidden Markov random field (HMRF) model is employed to integrate
’ pr
110 the MRF model (representing the spatial pattern) with the FM model (representing the statistical
rs
ho
111 pattern). The novelty of this work is, for the first time, we not only extract the subsoil
ut
(a
112 heterogeneity and identify the stratification based on the extracted spatial pattern, but also detect
113 al
and interpret the corresponding statistical pattern in the context of Robertson chart which is
rn
ou
114 familiar to practicing engineers. It will be demonstrated that taking both spatial and statistical
lJ
115 pattern into consideration provides us with a better strategy for identifying soil stratification. In
ca
ni
116 addition, by using real-world examples, we would like to highlight the trend from the original
ch
117 Bayesian framework (Cao and Wang 2012; Wang et al. 2013) to the machine learning-based
te
eo
118 Bayesian framework (e.g., the proposed method in this study) on the methodological
G
n
119 development for interpreting site investigation data.

a
di
120 Proposed unsupervised learning approach

na
Ca
121 Hidden Markov random field model and mean field-like approximation
in
122 A HMRF model is a comprehensive description of both statistical pattern in feature space
ed
123 and the corresponding latent field in physical space. The features and the latent field are not
ish
124 necessarily of the same nature. In our case, the features are two measured soil mechanical
bl
Pu
125 properties log10 Fr , and log10 Qt , while the latent field consists of abstract soil state labels. The
6
126 features are assumed to be generated from the latent field via a family of predefined probability
127 density functions known as the emission functions. Figure 2 shows the diagram of an HMRF
n)
128 model: one-dimensional lattice with the first-order neighborhood system (Besag 1986). The
sio
129 =
latent field is a configuration of all soil states x ( x1 , x2 , x3 ,..., xs ), x j ∈ L where L = {1, 2,..., K }
er
tv
130 is a set of all possible soil states. For each soil element j, the neighbors ∂ j are defined as the
in
pr
131 nearest two soil elements { j − 1, j + 1} (see Figure 2). The MRF–Gibbs equivalence (Besag 1974)
e-
pr
132 provides an explicit formula for the local conditional probability: (Besag 1986; Geman and
’
rs
133 Geman 1984):
ho
ut
P( x j , x∂ j | β ) exp[−U ( x j , x∂ j ; β )]
P( x j | x∂ j , β )
(a
134 = = (1)
∑ P( x j ', x∂ j | β ) ∑ exp[−U ( x j ', x∂ j ; β )]
x,j ∈L x,j ∈L
al
rn
ou
135 where U (.) is the local energy function. Equation (1) is the cornerstone of a classical Gibbs
lJ
136 sampler (Geman and Geman 1984), which is usually used to generate sample realizations of a
ca
ni
137 latent field within a Markov chain Monte Carlo (MCMC) framework. We adopt the most widely
ch
138 accepted Potts model (Koller and Friedman 2009) to calculate the local energy
te
eo
0 if xi = x j
U ( x j , x ∂ j ; β ) = ∑ Vi , j ( xi , x j ; β ) with the potential function Vi , j ( xi , x j ; β ) = 
G
139 , where
i∈∂ j  β if xi ≠ x j
a n
di
140 β is referred to as the granularity coefficient, and the subscript “i” is defined as the index of j’s
na
Ca
141 neighboring element. The physical meaning behind Equation (1) is that we assume neighboring
in
142 soil elements generally tend to have the same soil state. To this end, the spatial constraint is
ed
143 introduced by tuning the local energy function, and the strength of the spatial constraint is
ish
144 controlled by the granularity coefficient β . Greater β means stronger constraint.

bl
Pu
7
145 As is mentioned above, the features y is assumed to be generated from a particular
146 realization x and the emission functions f (y j ; θ x j ) where y j is the feature pair ( log10 Fr ,
n)
sio
147 log10 Qt ) of a specific soil element j and θ x j is the parameter of the emission function
er
148 corresponding to x j . Assuming the observations are conditionally independent for the sake of
tv
in
149 mathematical convenience, the conditional probability of the observed features can be expressed
pr
e-
150 as
pr
s s
’
(y | x, θl∈L ) ∏ P( y | x ) ∏ f ( y ; θ )
rs
151 P
= j j = j xj (2)
ho
=j 1 =j 1
ut
152 However, in practical problems, we usually do not know the configuration of x in advance, hence
(a
153 we are more interested in the marginal probability of the observed features y. Although direct
al
rn
154 computing of the entire observed feature field is intractable due to the large number of possible
ou
lJ
155 realizations of x , the marginal probability of a feature vector yj locally (i.e., if the configuration
ca
156 of its neighbors are known) follows the equation below,

ni
ch
157 P( y j | x ∂ j , θl∈L , β ) = ∑ P(l | x ∂ j , β ) f ( y j ; θl ) (3)

te
l∈L
eo
158 In Equation (2-3), l is defined as one of the possible soil states and this index is not linked to a
G
n
159 specific soil element (i.e., different from xj, which indicates the soil state assigned to soil element
a
di
na
160 j). In reality, the observed features y j should be spatially correlated and hence the conditionally
Ca
161 independent assumption could be too strong. As will be demonstrated later in the synthetic
in
162 examples, in presence of moderate vertical correlation, possible biased estimation of θ l may
ed
ish
163 happen, but the estimated parameters are still considered to be reasonably accurate for a specific
bl
164 realization. If θ l = (μ l , Σ l ) is the parameter set of the Gaussian component density function,
Pu
165 where μl is a vector that contains the mean of every feature for state l ∈ L , and Σl is the
8
166 covariance matrix of all the features for state l. Then the so-called Gaussian hidden Markov
167 random field (GHMRF) model is defined by Equations (1-3).
n)
168 Equation (3) provides us with a path to calculate the marginal probability of the observed
sio
169 feature vector at a specific element given the neighboring state labels, which makes it feasible to
er
tv
170 infer both the latent field and the model parameters within a Bayesian inferential framework. To
in
pr
171 be more concrete, based on the principle of the mean field-like approximation (Celeux et al.
e-
172 2003; Forbes and Peyrard 2003), given a good choice of configuration ~
x , which is an
pr
approximated expectation of the latent field, x ∂ j can be approximated by ~
’
173 x∂ j (the local
rs
ho
174 neighborhood configuration of pixel j in the field ~
x ). Then the corresponding density
ut
(a
175 distribution of the entire observed field can be approximated as
al
rn
176 ∏ ∑ P(l | x ∂ j , β ) f ( y j ; μl , Σl )
P(y | x , μ, Σ, β ) ≈ ∏ P( y j | x ∂ j , μ, Σ, β ) = (4)
ou
j∈S j∈S l∈L
lJ
177 where S = {1,2,3,..., s} is the set of all element indices. A reasonable candidate of ~
x would be
ca
ni
178 one that leads to a good approximation of the joint probability P(x, y , Φ ) where
ch
179 Φ = {μ, Σ, β }, μ = {μ l } , Σ = {Σ l }, for all l ∈ L (Forbes and Peyrard 2003). We recommend

te
eo
180 employing the simulated field algorithm as described in (Celeux et al. 2003), in which ~
x is a
G
n
181 realization of the conditional distribution p( x | y, Φ ) given an estimate of Φ , and generated

a
di
na
182 using a Gibbs sampler. We will describe how to implement Equation (4) iteratively in the
Ca
183 following section.

in
184 Probabilistic pattern extraction

ed
In the current problem setting, both the soil state sequence x and the corresponding model
ish
185
bl
186 parameters Φ = {μ, Σ, β } are unknown. A MCMC method is employed to optimize Equation (4)
Pu
9
187 by sampling x, Φ one after another via two conditional a posteriori distributions p (x | y , Φ ) and
188 p(Φ | y, x) iteratively.
n)
189 1. Simulation of conditional random field p (x | y , Φ )
sio
er
190 Conditional on a specific setting of model parameters Φ , and a set of the observations y,
tv
191 p (x | y , Φ ) is a Gibbs distribution (Wang et al. 2016), and the local energy is
in
pr
192 x j , x∂ j ; Φ) U j ( x j , x∂ j ; β ) + U j ( y j ; μ x j , Σ x j )
U j '(= (5)
e-
pr
193 The local energy is separated into two parts: 1) the MRF energy U j ( x j , x ∂ j ; β ) , which represents
’
rs
ho
194 the prior distribution (i.e., Gibbs distribution) of the latent field x and can be evaluated by using
ut
195 Potts model, and 2) the likelihood energy:
(a
196 U j ( y j ; μ x j , Σ x j ) ==
1 −1 1
al
( y j − μ x j ) T Σ x j ( y j − μ x j ) + log Σ x j . (6)
rn
2 2
ou
lJ
197 By adopting Equation (5) as a measure of local energy and plugging U j ' ( x j , x ∂ j ) into Equation
ca
198 (1), we can calculate the local conditional probability for choosing a specific soil state given the
ni
ch
199 local neighborhood system and observations. The equation is shown below,
te
eo
P( x j , x∂ j | y j , Φ) exp[−U j '( x j , x ∂ j ; Φ )]
200 P( x j | x∂ j , y j , Φ)
= = (7)
G
∑ P( x j ', x∂ j | y j , Φ) ∑ exp[−U j '( x j ', x∂ j ; Φ )]

n
x,j ∈L x,j ∈L
a
di
201 The realizations ~

x of the conditional random field p (x | y , Φ ) can be iteratively simulated via
na
Ca
202 the Gibbs sampler proposed by Geman and Geman (1984). More efficiently, we follow
in
203 Algorithm 1, in which a parallel strategy named chromatic sampler is used to boost up the
ed
204 computational speed. More details on the chromatic sampler can be found in Wang et al. (2016).
ish
bl
205 2. Simulation of the model parameters from p (Φ | y , x)

Pu
10
206 In this step, Φ = {μ, Σ, β } is sampled iteratively following the conditional posterior
207 distribution:
n)
post (μ | y, x, Σ, β ) ∝ prior(μ) L( y | x, μ, Σ, β )
sio
208 (8)
er
209 post ( Σ | y, x, μ, β ) ∝ prior ( Σ) L( y | x, μ, Σ, β ) (9)
tv
in
210 post ( β | y, x, μ, Σ) ∝ prior ( β ) L( y | x, μ, Σ, β ) (10)
pr
e-
211 The likelihood function L(y | x, μ, Σ, β ) = ∏ P( y j | x ∂ j , μ, Σ, β ) is evaluated by using Equation
pr
j∈S
’
rs
212 (4), where ~
x∂ j is the simulated local neighborhood configuration from step 1. To reflect the
ho
ut
213 inadequate prior knowledge, non-informative priors are used in this work: for all l ∈ L , 1) flat
(a
214 multivariate Gaussian priors are used for μ l with arbitrarily chosen center ηl and large
al
rn
215 covariance Ξ l ; 2) flat Gaussian prior is used for β with arbitrarily chosen mean µ β and large
ou
lJ
216 standard deviation σ β ; 3) the separation strategy (Barnard et al. 2000) is used to construct the
ca
ni
217 prior distribution of Σ l . To be more specific, Σ l is decomposed as Σ l = Λ l Rl Λ l , where Λ l is a

ch
te
ii
218 diagonal matrix with the i-th element σ l (i.e. the standard deviation of the i-th feature of cluster
eo
G
ii
219 l), and Rl is the correlation matrix of cluster l. σ l is assumed to follow a log-normal prior
a n
di
ii ii
ln(σ l ) ~ N (bl , ξ ) , and the prior density for the correlation matrix is
1
220 − (ν + d +1)
(∏i =1 rl )ν / 2
d
na
ii
P( Rl ) ∝ Rl 2
Ca
ii −1
221 where d is the number of features, and rl is the i-th diagonal element of Rl , and we use the
in
222 notation Σl ~ BSS (υ , b l , ξ ) to denote this prior. Other choices for the prior of Σ are also
ed
ish
223 possible, for example the hierarchical half-t prior (Huang and Wand 2013) and the inverse-
bl
Pu
224 Wishart distribution (Gelman et al. 2014). Recently, a comprehensive comparison study by
11
225 Alvarez et al. (2014) shows that the BSS prior presents the most flexibility and outperforms other
226 priors.
n)
227 To illustrate the relations between the hyperparameters, model parameters, latent fields and
sio
228 the features, a plate diagram is shown in Figure 3. The plate notation represents fixed
er
tv
229 hyperparameters with squares, random variables with circles, while gray shapes indicate known
in
pr
230 values or predefined values. K denotes the total number of soil states and N is the total number of
e-
231 elements. What we have done in step 1 is generating a realizations of ~
x using parameters from
pr
~
’
x∂ j are ready in step 2. What we want to do in step 2 is proposing
rs
232 last iteration and hence all
ho
candidates of Φ = {μ, Σ, β } and accepting/rejecting the candidates according to the change of
ut
233
(a
234 the approximated marginal log-likelihood of the observed features. The plate diagram in Figure 3
al
rn
235 exactly shows the intrinsic relationships among all variables in Equation (4). The most widely
ou
lJ
236 used Metropolis—Hasting (M-H) algorithm is employed in this step. Algorithm 2 shows the
ca
237 pseudocode of a single M-H iteration for updating Φ = {μ, Σ, β }.

ni
ch
238 Uncertainty quantification of soil states and mixture parameters

te
eo
239 By iteratively implementing the paralleled Gibbs sampler (Algorithm 1) and the Metropolis-
G
240 -Hasting algorithm (Algorithm 2), a series of realizations from the conditional a posteriori
a n
di
241 distributions p (x | y , Φ ) and p(Φ | y, x ) can be generated. To quantify the uncertainty of the
na
Ca
242 latent soil states x on a per-element basis, information entropy (Wellmann 2013; Wellmann and
in
243 Regenauer-Lieb 2012) is adopted as a quantitative measure of uncertainty. It has the following
ed
244 expression:
ish
245 H ( j ) = − ∑ Pl ( j ) log Pl ( j ) (11)

bl
l∈L
Pu
12
246 where Pl ( j ) is the empirical probability of assigning label l ∈ L to element j calculated from the
247 ensemble of ~
x . High information entropy indicates high uncertainty level. The maximum a
n)
sio
*
248 posteriori (MAP) estimate of the soil state at a specific element x j is achieved by taking the
er
249 soil state with the highest membership (probability of having a specific state), and then the MAP
tv
in
*
250 estimate of the latent field x * consists of all x j , j ∈ {1,2,3..., s} . To quantify the uncertainty of
pr
e-
251 the model parameters Φ , kernel density estimation (Bowman and Azzalini 1997) is employed to
pr
252 fit the simulated samples of each model parameter from Step 2 by using Gaussian kernels. The
’
rs
ho
253 fitting results are approximately considered as the posterior density functions. Sample mean is
ut
254 chosen as the representative value of each model parameter.
(a
255 Determine the number of soil states
al
rn
256 In this study, the total number of soil states is determined by Bayesian information criteria
ou
lJ
257 (BIC), which is generally recommended for model selection when no prior knowledge is
ca
258 available (Fraley and Raftery 1998; Fraley and Raftery 2002), and has the following form:
ni
ch
259 BIC = −2loglike( y, ϑ ) + M log(n ) (12)

te
eo
260 where loglike( y, ϑ ) is the maximized log-likelihood, ϑ is the MAP estimate of model
G
n
261 parameters, which are estimated by using EM algorithm (McLachlan and Peel 2004). M is the
a
di
262 number of independent parameters to be estimated, and n is the number of data points. Previous
na
Ca
263 studies recommend choosing the model at the minimum BIC (Fraley and Raftery 1998; Fraley
in
264 and Raftery 2002; McLachlan and Basford 1988; McLachlan and Krishnan 2007). We follow the
ed
265 same way to choose the optimal number of clusters. To be more specific, the observed feature
ish
266 pairs ( log10 Fr , log10 Qt ) of all soil elements are fitted to a family of finite Gaussian mixture
bl
Pu
267 models having different number of clusters, and the corresponding BICs are calculated for each
13
268 model using the MATLAB statistics and machine learning toolbox (MathWorks 2014) or the
269 scikit-learn package in python (Pedregosa et al. 2011). In both software packages, the EM
n)
270 algorithm has been implemented. The number of clusters with the lowest BIC value is chosen as
sio
271 the optima.
er
tv
272 Usually, the exact number of clusters may not be known a priori. A suitable choice for this
in
pr
273 number will result in a clustering result where data points have high similarities within each
e-
274 cluster and low inter-similarities between clusters within the feature space (Celeux and Govaert
’ pr
275 1995; McLachlan and Peel 2004). In certain cases, this choice may come down to using
rs
ho
276 subjective experts’ knowledge or ad hoc procedures. Several studies have been performed on
ut
(a
277 model selection using data-based approaches, including the Akaike information criterion (AIC)
278 al
(Bozdogan 1987), the Bayesian information criterion (BIC) (Fraley and Raftery 1998), the
rn
ou
279 integrated completed likelihood (ICL) criterion (Biernacki et al. 2000), a BIC approach based on
lJ
280 mean field theory (Forbes and Peyrard 2003) and a Bayesian approach combined with Monte
ca
ni
281 Carlo integration (Yuen and Mu 2011). The advantage of BIC is that it considers the dimension
ch
282 of input space by including the number of features as a penalty (Fraley and Raftery 1998; Fraley
te
eo
283 and Raftery 2002), and thus the BIC is more sensitive to the number of unknown parameters and
G
n
284 the model complexity, and hence effectively avoids possible overfitting.
a
di
285 Because the similarity and possible clustered patterns are reflected in the feature space, we
na
Ca
286 are able to choose a number that can maximize the likelihood of observations without
in
287 considering spatial constraint. We employ a conventional FGM model to measure the likelihood
ed
288 and find the best number of components, since it avoids performing Gibbs sampler within each
ish
289 iterative loop and thus is faster to trial many cases with a different number of components.
bl
Pu
14
290 Interpreting clusters in Robertson chart
291 1. Understand spatial pattern and statistical pattern
n)
292 At this stage, all soil elements are clustered into multiple groups in feature space,
sio
293 accordingly, segments of soil elements having same state along depth are extracted. Hence both
er
tv
294 spatial pattern and statistical pattern are revealed. Since the spatial pattern and statistical pattern
in
pr
295 are linked together via the HMRF model, the number of soil segments is automatically
e-
296 determined based on the statistical pattern in feature space. It is worthwhile to note that the
’ pr
297 segments in physical space and the clusters in feature space may not be one-to-one
rs
ho
298 correspondence as it is possible that a cluster in feature space is consist of several segments in
ut
(a
299 physical space, however, a segment in physical space must be a subset of (or equal to) a cluster
300 al
in feature space. The segments belonging to a specific cluster indicate that they are statistically
rn
ou
301 similar to each other (i.e., they may be subsets generated by the same Gaussian distribution
lJ
302 representing this cluster) but locate at different depths. It is also worth noting that the statistical
ca
ni
303 similarity does not guarantee these segments have the same soil behavior type as it is entirely
ch
304 possible that the parent cluster may span across two or more SBT zones. Therefore, comparing to
te
eo
305 a cluster in feature space, a segment in physical space should be considered as the basic unit (i.e.
G
n
306 a potential homogeneous soil layer) for soil profile interpretation since the soil elements within it
a
di
307 are not only physically grouped together but also statistically similar to each other. A simple
na
Ca
308 example is depicted in Figure 4 in which all soil elements are divided into three segments in
in
309 physical space but two clusters in feature space. Note that cluster 1 consists of Segment 1 and
ed
310 Segment 3 at different depth, whereas cluster 2 equals to Segment 2.

ish
311 2. Interpret soil segments and detect layer boundaries

bl
Pu
15
312 The Robertson chart (Robertson 1990; Robertson and Cabal 2010) directly links the feature
313 space with the mechanical behavior of subsoil. We consider soil segments as the basic units for
n)
314 identifying soil stratification. Each segment can be interpreted according to one or more zones it
sio
315 occupies. The probabilities that each segment belongs to various SBT zones are determined as
er
tv
316 the portion of the elements within this segment that lies in SBT zone n. Hence each segment has
in
pr
317 its own dominant SBT zone. It is worth noting that multiple soil segments having same soil state
e-
318 are similar to each other only in a statistical measure, however, they are separated by other
’ pr
319 segments with different soil states in physical space and may possess different dominant SBT
rs
ho
320 zones (e.g., Segment 1 and Segment 3 in Figure 4). Conversely, soil segments belonging to
ut
(a
321 different clusters may locate near each other in feature space and hence have same dominant
322 al
SBT zone (e.g., Segment 2 and Segment 3 in Figure 4). Although, statistically, they are not
rn
ou
323 similar with each other, they share similar mechanical properties.
lJ
324 A boundary will be detected between any two neighboring segments. If two neighboring
ca
ni
325 segments have same dominant SBT zone, the boundary is defined as an internal boundary. If two
ch
326 neighboring segments have different dominant SBT zones, the boundary is defined as a layer
te
eo
327 boundary. An example of the two types of boundaries is shown in Figure 4, in which the
G
n
328 dominant SBT zone of Segment 1 is 6, whereas the dominant zones of Segment 2 and Segment 3
a
di
329 are 5. Hence a layer boundary is detected between Segment 1 and Segment 2 and an internal
na
Ca
330 boundary is detected between Segment 2 and Segment 3.

in
331 Implementation procedure

ed
332 Figure 5 shows a flowchart of the proposed Bayesian unsupervised learning approach. To
ish
333 summarize, the procedure is listed as follows:

bl
Pu
16
334 1. Obtain a set of CPT data and do preprocessing (i.e. correction and normalization). Then
335 convert them to the logarithm of the normalized friction ratio log10 Fr and cone tip resistance
n)
336 log10 Qt .
sio
er
337 2. Construct the first-order neighborhood system. Fit the feature pairs of all elements to a
tv
338 family of finite Gaussian mixture models with different number of clusters (say, from 1 to kmax)
in
pr
339 and calculate corresponding BIC values. The optimal number of soil states is determined at the
e-
pr
340 lowest BIC.
’
rs
341 3. Determine the hyperparameters {µ β ,σ β , ηl , Ξl ,υ , b l , ξ } for priors. Set the number of
ho
ut
( 0) ( 0)
342 iterations N and initialize mixture density parameters x ( 0 ) , μ l , Σ l and granularity coefficient
(a
al
( 0) ( 0)
343 β ( 0 ) . If no prior information is available, 1) set β ( 0 ) =1; 2) initialize x ( 0 ) , μ l , Σ l with the
rn
ou
344 output from best fitted finite Gaussian mixture model in step 2; 3) use the non-informative prior
lJ
345 mentioned in Section 2.4, and the default setting is shown in Table 2. The predefined constant
ca
ni
346 100 in Table 2 indicates a high level of variability and represents the ignorance of the prior
ch
347 knowledge. According to the authors’ experiences, a much greater number does not make
te
eo
348 significant difference.

G
n
(t ) (t )
349 4. Draw random samples x ( t ) , μ l , Σ l , β ( t ) from two conditional a posteriori distributions
a
di
na
350 p (x | y , Φ ) and p(Φ | y, x ) iteratively via the Parallel Gibbs sampler (Algorithm 1) and M-H
Ca
351 algorithm (Algorithm 2).

in
352 5. Calculate the membership probabilities of different soil states for each soil element, and
ed
ish
353 perform kernel density fitting to all independent parameters in Φ . The MAP estimate x * is
bl
354 determined by the highest membership probability at each soil element, and the sample mean is
Pu
355 taken as the MAP estimator of all independent parameters in Φ .
17
356 6. Extract soil segments from x * . Determine the SBT number for each soil element, then
357 label each soil segment with its dominant SBT number.
n)
358 7. Determine all layer boundaries and internal boundaries as described above in Section 2.5.
sio
359 The SBT number for each layer is simply determined by the common SBT number of all soil
er
tv
360 segments in this layer.
in
pr
361 The proposed approach has been implemented in MATLAB R2015b and python 3.6, and
e-
362 tested using various datasets including both synthetic and real-world CPT logs. Details will be
’ pr
363 shown in the following sections. Interested readers may contact the first author for the MATLAB
rs
ho
364 implementation or the python package. Although the proposed method is mathematically
ut
(a
365 complicated, the integrated implementation is automatic and fully unsupervised. If there is no
366 al
prior information, in most cases, with recommended default setting of hyperparameters (see
rn
ou
367 Table 2), practicing engineers only need to provide project-specific CPT data with both depth
lJ
368 coordinates and measurements as input, and the output includes two patterns in physical space
ca
ni
369 and feature space as well as the identified stratification and corresponding SBT classification
ch
te
370 results. This not only improves the practicability for engineers who are not familiar with the
eo
371 mathematics under the hood, but also allows geotechnical researchers who are familiar with
G
n
372 Bayesian unsupervised learning techniques to have a closer examination of the detected patterns.
a
di
373 Model validation using synthetic examples

na
Ca
374 Description of stochastic simulation method and measure of classification quality

in
375 To validate the proposed approach and illustrate the model behavior, synthetic cases with
ed
376 known spatial configuration (i.e. the SBT profile and stratification) and model parameters (i.e.
ish
bl
377 mean μ l , standard deviations σ l of log10 Fr , log10 Qt , and the correlation coefficient ρ l between
Pu
378 log10 Fr and log10 Qt ) are studied. We are focusing on a simple soil profile with two SBT layers
18
379 but three different soil states as shown in Figure 6. In this configuration, each soil state
380 corresponds to a segment in physical space. The log10 Fr and log10 Qt for three different soil
n)
381 states are modeled by three Gaussian random fields with respective thickness of 4m, 6m and 5m.
sio
er
382 As is mentioned in Section 2.2, the probabilistic clustering algorithm is based on the conditional
tv
383 independent assumption, hence it is worthwhile to have a closer investigation on the effect of
in
pr
384 spatial correlation. To this end, two synthetic experiments are conducted, namely, conditionally
e-
pr
385 independent case and vertically correlated case. We set the depth interval (i.e., the element size)
’
rs
386 of the synthetic CPT logs as 0.05m. Features ( log10 Fr , log10 Qt ) of each soil element in the first
ho
387 case is simply generated independently using a multivariate Gaussian random number generator
ut
(a
388 with parameters shown in Figure 6. Simulated realizations of the second case are generated in a
al
rn
389 segment-wise manner. Within each segment, all observations are considered to be spatially
ou
390 correlated and follow an exponential correlation function (Wang et al. 2010)
lJ
ca
 2 di, j 
391 rn ( y i , y j ) = exp − , i, j ∈ Segment n (13)
ni
 λn 
ch
 
te
392 where d i , j is the distance between Element i and Element j, λn is the correlation length. In the
eo
G
393 second case, they are set to be λ1 = 1m , λ2 = 2m , λ3 = 1m .

a n
di
394 To measure the accuracy of the classification result, we define the misclassification ratio
na
395 (MCR) as:

Ca
number of misclassified elements

in
396 MCR = (14)

total number of elements
ed
ish
397 To evaluate the classification quality and convergence behavior during the sampling process, the
bl
398 MCR is calculated for each x (t ) . To evaluate the overall classification quality, MCR is calculated
Pu
399 using the MAP estimate x * .
19
400 Conditionally independent case
401 Figure 7 shows a simulated dataset. As the vertical correlation is not considered, this dataset
n)
402 perfectly satisfies the conditionally independent assumption. The default setting of
sio
403 hyperparameters in Table 2 is adopted. It is worthwhile to mention that although the optimal
er
tv
404 number of clusters and non-informative priors are used by default, it is possible (and sometimes
in
pr
405 necessary) to incorporate prior knowledge, if any, into the proposed Bayesian inferential
e-
406 framework by manually setting a proper number of clusters and the initial values
’ pr
( 0) ( 0)
407 x ( 0 ) , μ l , Σ l , β ( 0 ) . In this example, the BIC is calculated for the number of clusters from 1 to
rs
ho
408 10, and the result is shown in Figure 8a. The minimum BIC value corresponds to the optimal
ut
(a
409 finite Gaussian mixture model with three clusters. Then after setting the number of soil states
al
rn
410 k = 3 , 5000 MCMC iterations are performed. The extracted spatial pattern x * and the statistical
ou
411 pattern E (μ l ), E (σ l ), E ( ρ l ) , l ∈ L (i.e., the sample mean of mixture parameters) are visualized
lJ
ca
412 in Figure 8b-c. Remember that the soil state labels are only used for identifying different soil
ni
ch
413 clusters, and hence there is no one-to-one correspondence between soil states and soil segments.
te
414 The information entropy and the identified soil stratification are plotted together in Figure
eo
G
415 8b. By checking the probability that each segment belongs to various SBT numbers (Figure 8d),
n a
416 a layer boundary is detected at 4m and an internal boundary is identified at 10m, which match
di
na
417 well the original profile in Figure 6. In this synthetic example, high information entropy occurs
Ca
418 at the boundaries between neighboring soil segments, which makes sense since a higher
in
419 uncertainty is expected at boundary location.

ed
ish
420 In feature space, the extracted statistical pattern is represented by the estimated mixture
bl
421 parameters with quantified uncertainty, and they are shown in Table 3. It can be noticed that both
Pu
422 μ l and σ l are well inferred from the synthetic data, yet ρ l have lower accuracy (e.g., ρ 3 ). For
20
423 visualizing the convergence behavior and variability of mixture parameters, their sample
424 realizations and corresponding fitted kernel density functions are plotted in Figure 9. The
n)
425 Markov chains of μ l are stable and well mixed, however, the ones of σ l and ρ l are less stable
sio
er
426 and some traces may need more iterations to converge. This phenomenon indicates that, having a
tv
427 good estimation of higher order moments (i.e. the standard deviation and correlation coefficient)
in
pr
428 is generally more challenging than the first-order moment (i.e. the mean). The posterior
e-
pr
429 distribution of β shows significant skewness and variation. This is because the inferred simple
’
rs
430 spatial configuration of soil segments may result in comparable likelihood probabilities of the
ho
431 observations using a wide range of β . Figure 10 shows the MCR curve. The overall MCR is
ut
(a
432 below 3% and the convergence speed is extremely fast in this case as the separability of different
al
rn
433 soil clusters (Figure 8c) is fairly good.
ou
434 To have a closer examination on the robustness of the proposed approach, 100 synthetic
lJ
ca
435 datasets are analyzed. The boxplot of all parameter estimators and the histogram of MCR are
ni
ch
436 shown in Figure 11, in which the true values are indicated as gray lines. Similar to the result
te
437 from a single dataset, the sample estimators of μ l and σ l are generally closer to the true values
eo
G
438 and have less variation compared with the sample estimators of ρ l , however, the true correlation
a n
di
439 coefficients are still covered by the 50% credible interval (i.e., within the blue box). In addition,
na
440 the MCR values of the most synthetic datasets are smaller than 0.5%. Therefore, the performance
Ca
441 of the proposed approach is very satisfying under the conditional independent condition.
in
ed
442 Vertically correlated case

ish
443 In this section, we test the proposed approach in the presence of vertical correlation. Figure
bl
Pu
444 12 shows a synthetic datasets. Similar to the conditionally independent case, the default setting is
445 adopted (Table 2). Figure 13a shows that the lowest BIC still corresponds to the model with
21
446 three clusters, which means, if the vertical correlation is moderate (say, λn ≤ 2m ) and the soil
447 layer thickness is several times larger than the correlation length, the vertical correlation does not
n)
448 cast significant impact on the separability among clusters in feature space. After 5000 MCMC
sio
er
449 iterations, the MAP estimate of the soil states and extracted segments are shown in Figure 13b
tv
450 together with the spatial distribution of information entropy and interpreted soil stratification.
in
pr
451 The corresponding fitted three-cluster pattern in feature space is visualized in Figure 13c. For
e-
pr
452 reference purpose, Figure 13d shows the dominant SBT number of each soil segment. The
’
rs
453 estimated mixture parameters with quantified uncertainty are listed in Table 4. The high z-scores
ho
454 of some mixture parameters (e.g. μ1 and σ1 ) indicate possible biased estimation using
ut
(a
455 insufficient information due to the ergodic issue introduced by the joint effect of spatial
al
rn
456 correlation and limited layer depth. Summarizing all analyzing results, it can be noticed that, first,
ou
457 although the conditional independence assumption is not fully satisfied, the identified subsoil
lJ
ca
458 stratification is still considerably accurate; second, it is likely that for a specific CPT log, the
ni
459 realization may not be ergodic, hence the estimated parameters are considered to be case specific
ch
te
460 (or only represent the local condition in real-world datasets).

eo
G
461 Figure 14 presents the Markov chains of all mixture parameters and their corresponding
a n
462 fitted kernel density distributions. Similar to the conditionally independent case, the means are
di
na
463 well inferred with stable and mixed chains, however, the stochastic traces of higher order
Ca
464 moments converge slower and may need more iterations to have better estimates. The MCR
in
465 curve of the first 100 iterations is presented in Figure 15. It is shown that although a good
ed
ish
466 parameter estimation needs more computational cost, the segmentation of the latent field
bl
467 generally converges very fast with high accuracy in this specific case.
Pu
22
468 The robustness of the proposed approach in presence of the vertical correlation is evaluated
469 by applying it to 100 synthetic CPT logs. The results are shown in Figure 16. Compared with the
n)
470 conditionally independent case, the variabilities of all estimators are greater. The 50% credible
sio
471 intervals generally cover the true values. Careful inspection of Figure 16 reveals that the means
er
tv
472 and standard deviations of different soil segments are well separated from each other, however,
in
pr
473 the variability of correlation coefficients are generally great and the boxplots are highly
e-
474 overlapped. This may be explained by the ergodic problem of a single dataset. The histogram of
’ pr
475 MCR shows a satisfying overall accuracy as most of the synthetic datasets are well classified
rs
ho
476 with a MCR below 5%.
ut
(a
477 Application using real CPT data
478 The NGES at Texas A&M University al

rn
ou
479 We first apply the proposed approach to the well-known data collected at the National
lJ
480 Geotechnical Experimentation Site (NGES) at Texas A&M University (clay site) (Briaud 2000).
ca
ni
481 The same dataset has been analyzed by several previous relevant studies including Zhang and
ch
482 Tumay (1999), Wang et al. (2013), and Ching et al. (2015). This site comprises a sequence of
te
eo
483 sandy clay, clay, silty clay, and clay with silt seams. The depth of the entire CPT log is 15m and
G
n
484 the groundwater table is at about 6m below the ground surface. A comprehensive description of
a
di
485 this site can be found in Briaud (2000). The data is visualized in Figure 17.
na
Ca
486 This CPT log is discretized into 296 soil elements with a 0.05m depth interval. Assuming no
in
487 prior information is available, we used the default prior setting. Figure 18a shows the calculated
ed
488 BIC using the number of clusters from 1 to 15 with the best option at 7. By setting the number of
ish
489 soil states k = 7, the data is processed and the inferred mixture parameters are listed in Table 5.
bl
Pu
490 Ten detected soil segments are depicted in Figure 18b, and then, these segments are merged into
23
491 five soil layers according to their dominant SBT number (Figure 18d). The information entropy
492 shows significant spikes only at all boundary locations as the boundary elements may fall in the
n)
493 overlapping regions of multiple (≥2) clusters in feature space but the overall separability is still
sio
494 fairly good as shown in Figure 18c.
er
tv
495 For reference purpose, Figure 19a shows the soil profile obtained based on the boring log
in
pr
496 (reproduced from Ching (2015)). We compare our result with the results from Ching et al. (2015),
e-
497 which includes WTMM method, the Bayesian method (Wang et al. 2013), the clustering method
’ pr
498 (Hegazy and Mayne 2002; Liao and Mayne 2007), and the T ratio method (Wickremesinghe and
rs
ho
499 Campanella 1991). For the WTMM method (Figure 19c), the clustering method (Figure 19f), and
ut
(a
500 the T ratio method (Figure 19g-h), the dominant SBT number for each soil layer is shown. For
501 al
the results of Zhang and Tumay (1999) (Figure 19d), the soil type (HPS, HPC, or HPM) is
rn
ou
502 shown. For the results of Wang et al. (2013), the most probable SBT number of each soil layer is
lJ
503 shown. A more detailed description of the settings for different approaches (Figure 19c-h) can be
ca
ni
504 found in Ching et al. (2015). Our result (Figure 19b) not only agrees with the results from other
ch
505 methods, but also provides the uncertainty quantification of soil classification via the information
te
eo
506 entropy. The proposed approach, WTMM method, the clustering method, and the T ratio method
G
n
507 (with small window size) are all sensitive to thin layers and capable in detecting internal
a
di
508 boundaries in a single run. The clustering method is extremely sensitive, and hence has certain
na
Ca
509 difficulty to have consistent layers in the depth ranges of [0, 1]m and [6.5, 8]m. The proposed
in
510 approach, the Bayesian method, and the clustering method take information from both Qt and Fr.
ed
511 The fuzzy method operates on a normally distributed soil classification index. The WTMM and
ish
512 T ratio method analyze Ic (i.e. the SBT index (Robertson and Cabal 2010)) profile. A recent
bl
Pu
513 study on Ic-based probabilistic soil stratification (Cao et al. 2018) demonstrates that, based on Ic
24
514 profile, the soil stratification can be identified reasonably as well. Moreover, the study explicitly
515 quantifies the identification uncertainty of soil stratification, specifically uncertainty in soil layer
n)
516 boundaries by using Ic profile under a general Bayesian framework. Hence Ic-based probabilistic
sio
517 soil stratification methods can be alternatives of the proposed method. The clustering method
er
tv
518 only considers the similarity in feature space. The Bayesian method takes both spatial and
in
pr
519 statistical similarity into consideration. However, the primary layers (the most probable
e-
520 stratification result as shown in Figure 19) and possible internal layers (not emphasized in this
’ pr
521 example) are detected in different runs by gradually zooming in on local differences with
rs
ho
522 improved resolution. Each run requires solving high-dimensional integrals and a high-
ut
(a
523 dimensional non-convex optimization problem. Thus, as mentioned in Ching et al. (2015),
524 al
stratification problem with many potential layers (especially with many primary layers) can be
rn
ou
525 computationally challenging. The proposed approach detects both primary layers and internal
lJ
526 layers within a single run, besides, equipped with the parallel Gibbs sampler and explicit
ca
ni
527 expressions for calculating the likelihood function, the sampling process of the proposed
ch
528 approach is efficient and the convergence speed is very fast. The following example will provide
te
eo
529 further proof on the computational efficiency of the proposed method comparing to the Bayesian
G
n
530 method.
a
di
531 Lukang, Taiwan case

na
Ca
532 To further compare the proposed method with the WTMM method, the Bayesian method,
in
533 the clustering method, and the T ratio method, a CPT dataset collected in Lukang Township in
ed
534 Changhua County (Taiwan) is used. The same dataset also has been analyzed and published in
ish
535 Ching et al. (2015). The site is located on reclaimed land, the entire depth is around 40m and the
bl
Pu
536 groundwater table is approximately 2m below the current ground surface. The first 4~5m depth
25
537 is artificial hydraulic fill. The profiles of log10 Fr and log10 Qt as well as the data points on the
538 Robertson chart are shown in Figure 20.
n)
539 The CPT log is discretized into 808 soil elements with a 0.05m depth interval. A total
sio
er
540 number of five clusters are considered to be the optima (Figure 21a). The estimated mixture
tv
541 parameters are shown in Table 6 and visualized in Figure 21c. The soil segments and
in
pr
542 corresponding identified stratification is shown in Figure 21b. It can be noticed that Cluster 1, 2,
e-
pr
543 and 4 have considerable overlap in feature space and they are also mixed together in physical
’
rs
544 space. This increases the amount of uncertainty and causes frequent fluctuation of the
ho
545 information entropy from approximately 17m to 40m and hence numerous potential thin layers
ut
(a
546 are detected within this depth range. From an overall look, the identified stratification from the
547 al
proposed approach is generally consistent with the boring log in Figure 22a. As mentioned in
rn
ou
548 Ching et al. (2015), when applying the Bayesian method, the CPT log was divided into two
lJ
549 chunks (0~20m and 20~40m) because of the computational issue. However, there is no such
ca
ni
550 problem when applying the WTMM method and the proposed approach since the latter two
ch
te
551 approaches do not need to solve high-dimensional integrals or high-dimensional non-convex

eo
552 optimization problems multiple times. The results from the first three methods (i.e. the proposed
G
n
553 approach, WTMM method, and the Bayesian method) are generally agree with each other except
a
di
554 that the proposed approach identifies all potential thin layers which have been partially identified
na
Ca
555 by WTMM and partially identified by the Bayesian method. In addition, three thin layers having
in
556 SBT 4 are detected by the proposed approach. However, the WTMM method and the Bayesian
ed
557 method do not consider them as primary layers. The information entropy indicates that the
ish
bl
558 inferred soil states of these thin layers are highly uncertain, which means the evidence supporting
Pu
559 for the existence of these thin layers is not concrete. The performances of the other two methods
26
560 (i.e. the clustering method and the T ratio method) are generally not satisfying as they all need a
561 fixed scale parameter (i.e. the number of clusters or the window size), hence a trade-off between
n)
562 fine details and avoiding spurious layer boundaries always exists (Ching et al. 2015). The
sio
563 WTMM method is multiscale in nature and the result is controlled by the interpretation of the
er
tv
564 WTMM ridges. By checking the L and M parameters that characterize a WTMM ridge, it
in
pr
565 essentially transforms a soil profile interpretation problem to a linear discriminant analysis in a
e-
566 two-dimensional L-M space. A logistic regression is performed using L-M points from fifty
’ pr
567 manually interpreted real CPT logs to have a binary (i.e., jump/not jump) classifier. Therefore,
rs
ho
568 the WTMM method can be considered as a supervised method which needs considerable training
ut
(a
569 data from manual interpretation. In contrast, the Bayesian method, the clustering method, and the
570 al
proposed approach are unsupervised learning methods which do not require training information.
rn
ou
571 From this example, it can be noticed that one potential issue of the proposed method is the
lJ
572 possible misinterpretation of the very thin soil layers. Conceptually speaking, if the identified
ca
ni
573 soil layer is very thin (say containing less than 10 consecutive CPT data points), the statistical
ch
574 uncertainty will be significant. Under this condition, the existence of the potential thin layers will
te
eo
575 be difficult to justify due to the high level of information entropy. Moreover, we have modeled
G
n
576 the feature pairs as correlated random variables rather than a vector random field. This ideal
a
di
577 assumption may result in possible biased estimation of both spatial and statistical patterns due to
na
Ca
578 the ergodic problem. Besides, even with more advanced and realistic model assumptions, recent
in
579 studies by Ching and co-workers (Ching and Phoon 2017; Ching et al. 2017) have shown that
ed
580 there are major identifiability issues in random field parameters. Therefore, although the
ish
581 integrated implementation is automatic and fully unsupervised, the performance of the proposed
bl
Pu
27
582 method in identifying very thin layers should not be overestimated and more thorough
583 investigations on this point are still expected.
n)
584 Summary and conclusions
sio
585 In this work, we present a Bayesian unsupervised learning approach for identifying soil
er
tv
586 stratification using cone penetration data. The proposed approach, in essence, consists of two
in
pr
587 parts: 1) a pattern detection approach using Bayesian inferential framework, and 2) a pattern
e-
588 interpretation protocol using Robertson chart. The first part is the mathematical core of the
’ pr
589 proposed approach, which infers both spatial pattern in physical space and statistical pattern in
rs
ho
590 feature space from the input dataset; the second part converts the abstract patterns into intuitive
ut
(a
591 spatial configurations of multiple soil layers with different soil behavior types. The advantages of
592 al
the proposed approach include probabilistic soil classification, and considering spatial
rn
ou
593 consistency and statistical similarity of soil elements simultaneously. Although the proposed
lJ
594 approach is sophisticated in a mathematical sense, the integrated implementation is automatic

ca
ni
595 and fully unsupervised.

ch
596 The proposed approach has been validated by two numerical studies. The validation results
te
eo
597 indicate that the proposed approach can accurately extract both the spatial pattern and the
G
n
598 statistical pattern under the conditional independent condition. In presence of the moderate
a
di
599 vertical correlation (e.g. λn ≤ 2m ) which is common in real-world cases (Phoon and Kulhawy
na
Ca
600 1999), possible biased estimation may happen due to the ergodic issue introduced by the joint
in
601 effect of vertical correlation and limited layer depth. In this case, the estimated parameters are
ed
602 considered to be reasonably accurate for describing a given realization. The identified soil
ish
bl
603 stratification is still considerably accurate as both the local consistency in physical space and the
Pu
604 statistical separability in feature space are not compromised too much.
28
605 The proposed approach has been applied to two real-world case and compared with five
606 different methods. Four merits of the proposed approach can be identified: 1) it provides the
n)
607 uncertainty quantification of the soil stratification; 2) similar to the WTMM method and the
sio
608 Bayesian method, it is sufficiently sensitive to thin layers and yet does not detect spurious layer
er
tv
609 boundaries; 3) similar to the Bayesian method, it simultaneously operates on the physical space
in
pr
610 and feature space; 4) the computational cost is higher than WTMM method but considerably
e-
611 lower than the Bayesian method, and the classification process generally converges vary fast.
’ pr
612 The last merit of this study thanks to formulating the CPT-based soil stratification problem as an
rs
ho
613 “unsupervised learning” problem under the machine learning-based Bayesian framework so that
ut
(a
614 some advanced probabilistic models (e.g., the HMRF model) and algorithms (e.g., the chromatic
615 al
sampler) can be integrated and applied for solving the problem. It has been demonstrated that the
rn
ou
616 trend from the original Bayesian framework to the machine learning-based Bayesian framework
lJ
617 is very promising for interpreting geotechnical site investigation data.

ca
ni
618 However, the proposed approach has the following potential limitations. First, the
ch
619 conditionally independent assumption may be too strong as a CPT sounding record generally
te
eo
620 entails the presence of vertical correlation; second, although the extracted soil segments in
G
n
621 physical space and the corresponding clustered pattern in feature space have explicit statistical
a
di
622 meaning, they may be less intuitive for practicing engineers to fully understand them; third, the
na
Ca
623 performance of the proposed method in identifying very thin layers may be less satisfying and
in
624 more thorough investigations on this point are still expected.

ed
625 Acknowledgement
ish
626 The authors would like to thank Dr. J. Ching for providing the source code of the WTMM
bl
Pu
627 method and the dataset of the Lukang case. The authors would also like to thank Tianqi Zhang,
29
628 M.Sc. for conducting part of the coding work in implementing the developed method in python
629 3.6. The editors and two anonymous reviewers are greatly appreciated for their constructive
n)
630 comments that have helped to improve the paper significantly.
sio
631 References
er
tv
632 Alvarez, I., Niemi, J., and Simpson, M. 2014. Bayesian inference for a covariance matrix. arXiv
in
633 preprint arXiv:1408.4050.
pr
634 Barnard, J., McCulloch, R., and Meng, X.-L. 2000. Modeling covariance matrices in terms of
635 standard deviations and correlations, with application to shrinkage. Statistica Sinica:
e-
636 1281-1311.
pr
637 Besag, J. 1974. Spatial interaction and the statistical analysis of lattice systems. Journal of the
’
638 Royal Statistical Society. Series B (Methodological), 36(2): 192-236.
rs
639 Besag, J. 1986. On the statistical analysis of dirty pictures. Journal of the Royal Statistical
ho
640 Society, 48(3): 259-302.
ut
641 Biernacki, C., Celeux, G., and Govaert, G. 2000. Assessing a mixture model for clustering with
(a
642 the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine
643 Intelligence, 22(7): 719-725.
644 al
Bowman, A.W., and Azzalini, A. 1997. Applied smoothing techniques for data analysis: the
rn
645 kernel approach with S-Plus illustrations. OUP Oxford.
ou
646 Bozdogan, H. 1987. Model selection and Akaike's information criterion (AIC): The general
theory and its analytical extensions. Psychometrika, 52(3): 345-370.
lJ
647
648 Briaud, J.-L. 2000. The national geotechnical experimentation sites at Texas A&M University:
ca
649 clay and sand. Geotechnical Special Publication: 26-51.

ni
650 Cao, Z.-J., Zheng, S., Li, D., and Phoon, K.-K. 2018. Bayesian Identification of Soil Stratigraphy
ch
651 based on Soil Behaviour Type Index. Canadian geotechnical journal,(ja).

652 Cao, Z., and Wang, Y. 2012. Bayesian approach for probabilistic site characterization using cone
te
653 penetration tests. Journal of Geotechnical and Geoenvironmental Engineering, 139(2):

eo
654 267-276.
G
655 Celeux, G., and Govaert, G. 1995. Gaussian parsimonious clustering models. Pattern
n
656 Recognition, 28(5): 781-793.

a
657 Celeux, G., Forbes, F., and Peyrard, N. 2003. EM procedures using mean field-like
di
658 approximations for Markov model-based image segmentation. Pattern Recognition, 36(1):
na
659 131-144.
Ca
660 Ching, J., and Phoon, K.-K. 2017. Characterizing uncertain site-specific trend function by sparse
661 Bayesian learning. Journal of Engineering Mechanics, 143(7): 04017028.
in
662 Ching, J., Wang, J.-S., Juang, C.H., and Ku, C.-S. 2015. Cone penetration test (CPT)-based
663 stratigraphic profiling using the wavelet transform modulus maxima method. Canadian
ed
664 geotechnical journal, 52(12): 1993-2007.

ish
665 Ching, J.Y., Phoon, K.K., Beck, J.L., and Huang, Y. 2017. On the identification of geotechnical
666 site-specific trend functions. ASCE-ASME Journal of Risk and Uncertainty in
bl
667 Engineering Systems, Part A: Civil Engineering, 3(4): 04017021.

Pu
668 Das, S.K., and Basudhar, P.K. 2009. Utilization of self-organizing map and fuzzy clustering for
669 site characterization using piezocone data. Computers and Geotechnics, 36(1): 241-248.
30
670 Depina, I., Le, T.M.H., Eiksund, G., and Strøm, P. 2016. Cone penetration data classification
671 with Bayesian Mixture Analysis. Georisk: Assessment and management of risk for
672 engineered systems and geohazards, 10(1): 27-41.
673 Forbes, F., and Peyrard, N. 2003. Hidden Markov random field model selection criteria based on
n)
674 mean field-like approximations. IEEE Transactions on Pattern Analysis and Machine
sio
675 Intelligence, 25(9): 1089-1101.
676 Fraley, C., and Raftery, A.E. 1998. How many clusters? Which clustering method? Answers via
er
677 model-based cluster analysis. The computer journal, 41(8): 578-588.
tv
678 Fraley, C., and Raftery, A.E. 2002. Model-based clustering, discriminant analysis, and density
in
679 estimation. Journal of the American Statistical Association, 97(458): 611-631.
pr
680 Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., and Rubin, D.B. 2014. Bayesian
681 data analysis. CRC press Boca Raton, FL.
e-
682 Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian
pr
683 restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,
’
684 PAMI-6(6): 721-741.
rs
685 Hegazy, Y.A., and Mayne, P.W. 2002. Objective site characterization using clustering of
ho
686 piezocone data. Journal of Geotechnical and Geoenvironmental Engineering, 128(12):
ut
687 986-996.
(a
688 Huang, A., and Wand, M.P. 2013. Simple marginally noninformative prior distributions for
689 covariance matrices. Bayesian Analysis, 8(2): 439-452.
690 al
Koller, D., and Friedman, N. 2009. Probabilistic graphical models: principles and techniques.
rn
691 MIT press.
ou
692 Liao, T., and Mayne, P. 2007. Stratigraphic delineation by three-dimensional clustering of
piezocone data. Georisk, 1(2): 102-119.
lJ
693
694 Lunne, T., Robertson, P., and Powell, J. 1997. Cone penetration testing. Geotechnical Practice.
ca
695 MathWorks. 2014. Statistics and Machine Learning Toolbox.

ni
696 McLachlan, G., and Peel, D. 2004. Finite mixture models. John Wiley & Sons, Hoboken, N.J.
ch
697 McLachlan, G.J., and Basford, K.E. 1988. Mixture models. Inference and applications to
698 clustering. Statistics: Textbooks and Monographs, New York: Dekker, 1988, 1.
te
699 McLachlan, G.J., and Krishnan, T. 2007. The EM algorithm and extensions. Wiley-Interscience,
eo
700 New York.

G
701 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
n
702 Prettenhofer, P., Weiss, R., and Dubourg, V. 2011. Scikit-learn: Machine learning in
a
703 Python. Journal of machine learning research, 12(Oct): 2825-2830.

di
704 Phoon, K.-K., and Kulhawy, F.H. 1999. Characterization of geotechnical variability. Canadian
na

Ca
706 Phoon, K.-K., Quek, S.-T., and An, P. 2003. Identification of statistically homogeneous soil
707 layers using modified Bartlett statistics. Journal of Geotechnical and Geoenvironmental
in
708 Engineering, 129(7): 649-659.

709 Robertson, P. 1990. Soil classification using the cone penetration test. Canadian Geotechnical
ed
710 Journal, 27(1): 151-158.

ish
711 Robertson, P. 2009. Interpretation of cone penetration tests—a unified approach. Canadian
bl
713 Robertson, P., and Cabal, K. 2010. Guide to cone penetration testing for geotechnical
Pu
714 engineering. Gregg Drilling and Testing Inc., USA: 6-15.
31
715 Wang, H., Wellmann, J.F., Li, Z., Wang, X., and Liang, R.Y. 2016. A Segmentation Approach
716 for Stochastic Geological Modeling Using Hidden Markov Random Fields. Mathematical
717 Geosciences: 1-33.
718 Wang, Y., Au, S.-K., and Cao, Z. 2010. Bayesian approach for probabilistic characterization of
n)
719 sand friction angles. Engineering Geology, 114(3): 354-363.
sio
720 Wang, Y., Huang, K., and Cao, Z. 2013. Probabilistic identification of underground soil
721 stratification using cone penetration tests. Canadian geotechnical journal, 50(7): 766-776.
er
722 Wellmann, J.F. 2013. Information Theory for Correlation Analysis and Estimation of
tv
723 Uncertainty Reduction in Maps and Models. Entropy, 15(4): 1464-1485.
in
724 Wellmann, J.F., and Regenauer-Lieb, K. 2012. Uncertainties have a meaning: Information
pr
725 entropy as a quality measure for 3-D geological models. Tectonophysics, 526: 207-216.
726 Wickremesinghe, D., and Campanella, R. 1991. Statistical methods for soil layer boundary
e-
727 location using the cone penetration test. Proc. ICASP6, Mexico City, 2: 636-643.
pr
728 Yuen, K.V., and Mu, H.Q. 2011. Peak ground acceleration estimation by linear and nonlinear
’
729 models with reduced order Monte Carlo simulation. Computer‐Aided Civil and
rs
730 Infrastructure Engineering, 26(1): 30-47.
ho
731 Zhang, Z., and Tumay, M.T. 1999. Statistical to fuzzy approach toward CPT soil classification.
ut
732 Journal of Geotechnical and Geoenvironmental Engineering, 125(3): 179-186.
(a
733
734
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu
32
Algorithms
Algorithm 1: Parallel Gibbs sampler

Input : 2-Colored MRF (Wang et al. 2016), y, Φ = (μ ( t −1) , Σ ( t −1) , β ( t −1) ) , x ( t −1)
( t −1)
n)
sio
1 for the 2 colors κ i : i ∈ {1,2} do
parafor all elements in the i-th color j ∈ Sκ i
er
2
tv
( t −1)
3 calculate the local a posteriori distribution p( X j | x ∂ j , y j ; μ ( t −1) , Σ ( t −1) , β ( t −1) ) ;
in
(t ) ( t −1)
4 draw candidate x j ~ p( X j | x ∂ j , y j ; μ ( t −1) , Σ ( t −1) , β ( t −1) ) ;
pr
5 end parafor
e-
6 end for
pr
Return x (t )
’
rs
ho
Algorithm 2: Bayesian parameter estimation using M-H algorithm
ut
Input : y , Φ
( t −1)
= (μ ( t −1) , Σ ( t −1) , β ( t −1) ) , x (t )
(a
1 for the label l ∈ {1,2,..., k } do
2 propose µl* = µl( t −1) + ε jump , where ε jump ~ N (0, σ jump1 ) ; al
rn
3 compare µl( t −1) and µl* by evaluating Equation (8);
ou
4 accept/reject µl* and update µl(t ) ;

lJ
5 end for
ca
6 for the label l ∈ {1,2,..., k } do

ni
T
7 eigen decomposition Σ (l t −1) = Vl ( t −1) Dl ( t −1)Vl ( t −1)
ch
T * ( t −1)
propose Σ *l = Vl * Dl *Vl * , Vl = R(ϕ jump ) * Vl ,
te
8
eo
and Dl* = Dl( t −1) + diag ( γ jump ) , where ϕ jump ~ N (0, σ jump 2 ) γ jump ~ N (0, σ jump 3 ) ;
G
9 compare Σ*l and Σ (l t −1) by evaluating Equation (9);

n
10 accept/reject Σ*l and update Σ (tl ) ;

a
di
11 end for
na
12 propose β = β
* ( t −1)
+ δ jump , where δ jump ~ N (0, σ jump 4 ) ;
Ca
13 compare β * and β ( t −1) by evaluating Equation (10);

14 accept/reject β * and update β (t ) ;
in
Return Φ = (μ , Σ , β )
(t ) (t ) (t ) (t )
ed
Note: Vl is an orthogonal matrix and Dl is a diagonal matrix whose diagonal entries are eigenvalues of
ish
cos(ϕ jump ) − sin(ϕ jump )

Σ l . R(ϕ jump ) is a rotation matrix, which is defined as: R(ϕ jump ) = 
bl

 sin(ϕ jump ) cos(ϕ jump ) 
Pu
33
Table 1. Description of soil types in the Robertson’s soil classification chart (after Robertson
1990)
Zone Soil description
1 Sensitive, fine-grained
n)
2 Organic soils (peats)
sio
3 Clays (clay to silty clay)
4 Silt mixtures (clayey silt to silty clay)
er
5 Sand mixtures (silty sand to sandy silt)
tv
6 Sands (clean sand to silty sand)
in
7 Gravelly sand to sand
pr
8 Very stiff sand to clayey sand
9 Very stiff, fine-grained
e-
pr’
Table 2. Default setting for hyperparameters
rs
Parameter Value
ho
ηl μl
( 0)
ut
Ξl 100 * I d ,
(a
υ d +1
bl ( 0)
log(σ l ) al
rn
ξ 100
ou
µβ β ( 0)
lJ
σβ 100
ca
Note: d is the number of features; I d is a d-dimensional identity matrix.

ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu
34
n)
sio
er
Table 3. Estimated parameters of the conditional independent case
tv
Segment Soil True value Mean Std. |z-score|
Parameter
ID state log10 Fr log10 Qt log10 Fr log10 Qt log10 Fr log10 Qt log10 Fr log10 Qt
in
μ1
pr
-0.0969 2.0308 -0.1003 1.9868 0.0107 0.0285 0.3170 1.5451
e-
1 1 σ1 0.1 0.25 0.0993 0.2517 0.0053 0.0067 0.1285 0.2559
pr
ρ1 0.3 0.4385 0.0767 1.8063
μ2 -0.3010 1.3046 -0.2969 1.2962 0.0061 0.0096 0.6712 0.8657
’
rs
2 2 σ2 0.07 0.1 0.0712 0.1073 0.0036 0.0070 0.3327 1.0389
ho
ρ2 0.2 0.2991 0.0976 1.0152
ut
μ3 -0.0458 1.4862 -0.0324 1.4892 0.0107 0.0100 1.2463 0.2977
(a
3 3 σ3 0.1 0.1 0.1081 0.1038 0.0068 0.0065 1.1884 0.5918
al
ρ3 0.5 0.4021 0.0436 2.2430
rn
β N/A 80.68 59.87 N/A
ou
Note: |z-score| = |(True value - Mean)/Std.|
lJ
Table 4. Estimated parameters of the vertically correlated case
ca
Segment Soil True value Mean Std. |z-score|
ni
Parameter
ID state log10 Fr log10 Qt
ch log10 Fr log10 Qt log10 Fr log10 Qt log10 Fr log10 Qt
μ2 -0.0969 2.0308 -0.0484 2.0307 0.0109 0.0253 4.4457 0.0034
te
1 2 σ2 0.1 0.25 0.0951 0.2110 0.0064 0.0163 0.7649 2.3847
eo
ρ2 0.3 0.3378 0.1040 0.3629

G
μ1 -0.3010 1.3046 -0.3325 1.2701 0.0045 0.0062 7.0421 5.5543

an
2 1 σ1 0.07 0.1 0.0534 0.0660 0.0028 0.0023 5.8903 14.6243

di
ρ1 0.2 -0.1073 0.0781 3.9339

na
μ3 -0.0458 1.4862 -0.0290 1.5354 0.0095 0.0084 1.7741 5.8881

Ca
3 3 σ3 0.1 0.1 0.0937 0.0834 0.0050 0.0040 1.2492 4.1789

ρ3 0.5 0.0657 0.0766 5.6678
in
β N/A 81.09 57.96 N/A

ed
ish
35
bl
Pu
n)
sio
er
Table 5. Estimated parameters of the NGES case Table 6. Estimated parameters of the Lukang case
tv
Soil Mean Std. Soil Mean Std.
Parameter log F log Q log F log Q Parameter log F log Q log F log Q
state state
in
10 r 10 t 10 r 10 t 10 r 10 t 10 r 10 t
pr
μ1 0.30 1.45 0.024 0.063 μ1 0.48 0.96 0.025 0.030
e-
1 σ1 0.08 0.21 0.010 0.0062 1 σ1 0.21 0.25 0.0094 0.0060
pr
ρ1 -0.87 0.036 ρ1 -0.82 0.0075
μ2 0.34 1.79 0.0085 0.018 μ2 -0.28 1.75 0.0058 0.0059
’
rs
2 σ2 0.051 0.096 0.0028 0.0039 2 σ2 0.12 0.11 0.0060 0.0048
ho
ρ2 0.21 0.16 ρ2 -0.48 0.020
ut
μ3 0.42 2.37 0.043 0.052 μ3 0.43 0.51 0.020 0.017
(a
3 σ3 0.19 0.21 0.0090 0.0076 3 σ3 0.18 0.15 0.0068 0.0081
al
ρ3 -0.95 0.0028 ρ3 0.22 0.074
rn
μ4 0.47 1.60 0.0075 0.0088 μ4 -0.066 1.44 0.029 0.019
ou
4 σ4 0.07 0.081 0.0037 0.0037 4 σ4 0.28 0.16 0.0054 0.0068
lJ
ρ4 -0.71 0.016 ρ4 -0.57 0.038
ca
μ5 0.63 1.05 0.0063 0.027 μ5 -0.20 2.01 0.033 0.043
5 σ5 0.035 0.15 0.0017 0.0034
ni 5 σ5 0.18 0.27 0.014 0.011
ch
ρ5 0.25 0.15 ρ5 0.82 0.021
te
μ6 0.73 1.44 0.0061 0.0081 β 4.44 0.61

eo
6 σ6 0.052 0.059 0.0029 0.0029

G
ρ6 -0.55 0.065
an
μ7 0.78 1.69 0.0092 0.040

di
7 σ7 0.038 0.16 0.0025 0.0035

na
ρ7 0.0018 0.27
Ca
β 22.58 9.22
in
ed
ish
36
bl
Pu
10 3
7 8
n)
sio
Normalized tip resistance Q t
9
6
2
10
er
tv
5
in
4
pr
10 1
e-
3
pr
1
2
’
rs
0
10
10 -1 10 0 10 1
ho
Normalized friction ratio Fr
ut
Figrue 1. Robertson soil classification Chart (after Robertson 1990 and Wang et al. 2013)
(a
al
Latent field Observations
rn
cluster 1
ou
x1 y1 = (log10Fr1 , log10Qt1) cluster 2

lJ
ca
x2 y2 = (log10Fr2 , log10Qt2)
ni
ch
te
Depth
eo
G
n
a
di
A local neighborhood system

na
xs-1 ys-1 = (log10Fr(s-1) , log10Qt(s-

Ca
)
in
xs ys = (log10Frs , log10Qts)
ed
Feature space
ish
Physical space
bl
Figure 2. Sketch diagram of an one-dimensional hidden Markov random field model for
clustering CPT data
Pu
n)
sio
er
tv
N K
in
pr
e-
’pr
rs
N
ho
K
ut
Figure 3. Plate diagram for the proposed hidden Markov random field model.
(a
al
rn
Spatial pattern Statistical pattern
ou
lJ
ca
Segment 1
ni
ch
te
Depth
Segment 2
eo
G
a n
di
Segment 3
na
Ca
in
{Segment 1, Segment 3} cluster 1

ed
{Segment 2} cluster 2
ish
Layer boundary Internal boundary

bl
Pu
Figure 4 Illustration of spatial pattern and statistical pattern of a simple configuration, and
corresponding stratification.
Load , and depth
n)
sio
Construct neighborhood system Determine the number of clusters k
er
using depth coordinates using BIC
tv
in
pr
Define hyperparameters
e-
, set the number
pr
of iterations N and the initial values for
’
rs
ho
ut
Draw samples of
(a
using Parallel Gibbs sampler and M-H
al
algorithm iteratively
rn
ou
Uncertainty quantification using

lJ
samples of and report the

ca
MAP estimate of
ni
ch
te
Determine the SBT number for each

eo
extracted soil segment

G
a n
Determine all layer boundaries,

di
internal boundaries and SBT numbers

na
for each layer

Ca
Figure 5 Flowchart for the proposed Bayesian unsupervised approach.

in
ed
ish
bl
Pu
n)
4m SBT 6
sio
Segment 1
Layer boundary
er
tv
in
6m
pr
e-
Segment 2
pr
SBT 5
Internal boundary
’
rs
ho
5m
ut
(a
Segment 3 al
Figure 6 Synthetic soil profile with two SBT layers but three soil segments corresponding to
rn
three soil states.
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
Figure 7 A set of simulated CPT log and corresponding point clouds in feature space (conditional
bl
independent case)
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
Figure 8 Extracted patterns using (a) the optimal number of clusters = 3 in (b) physical space and
(c) feature space. The stratification in (b) is interpreted based on (d) the dominate SBT number
bl
of each soil segment. Note: C n = Cluster n in (c). (conditional independent case)

Pu
(a) 0.2 2.5 0.14
0.12
0 2
n)
0.1
sio
-0.2 1.5
0.08
er
-0.4 1 0.06
tv
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
Iterations Iterations Iterations
in
0.3 1 400
pr
300
0.2 0.5
e-
200
pr
0.1 0
100
’
rs
0 -0.5 0
ho
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
ut
soil cluster 1 soil cluster 2 soil cluster 3
(a
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
Figure 9 (a) Realizations of all density parameters for three soil clusters and (b) the
in
corresponding posterior distributions (conditional independent case)

ed
ish
bl
Pu
n)
sio
MCR
er
tv
in
pr
e-
pr
Figure 10 The MCR curve during the sampling process (conditional independent case).
’
rs
Mean of log10Fr
Mean of log10Qt
ho
-0.1
1.8
ut
-0.2 1.6
(a
1.4
-0.3
al
Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3
rn
0.3
ou
SD of log10Qt
SD of log10Fr
0.12
lJ
0.1 0.2
ca
0.08
0.1
ni
ch

Correlation coefficient
1
te
0.6
Probability
eo
0.4
0.5
G
0.2
a n
0 0
di
Segment 1 Segment 2 Segment 3 0 0.005 0.01 0.015

MCR
na
Figure 11 Boxplot of parameter estimators and histogram of MCR for 100 synthetic datasets
Ca
(conditional independent case)

in
ed
ish
bl
Pu
n)
sio
er
tv
Qt
in
pr
e-
’ pr
rs
ho
ut
(a
al
Figure 12 A set of simulated CPT log and corresponding point clouds in feature space (vertically
rn
correlated case).
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu
(a) -100 Model selection
n)
-200
sio
-300
log 10(Qt)
er
tv
-400
in
-500
pr
e-
-600
pr
2 4 6 8 10
Number of clusters
’
rs
ho
ut
(a
al
rn
ou
lJ
ca
ni
ch
te
eo
Figure 13 Extracted patterns using (a) the optimal number of clusters = 3 in (b) physical space
G
and (c) feature space. The stratification in (b) is interpreted based on (d) the dominate SBT
number of each soil segment (vertically correlated case)
a n
di
na
Ca
in
ed
ish
bl
Pu
(a)
0.2 2.5 0.15
0 2 0.1
n)
sio
-0.2 1.5 0.05
er
-0.4 1 0
tv
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
0.3 1 400
in
pr
300
0.2 0.5
e-
200
pr
0.1 0
100
’
0 -0.5 0
rs
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
ho
ut
(b)
(a
100 60 150
al
40 100
rn
50
ou
20 50
lJ
0 0 0
-0.4 -0.2 0 0.2 1 1.5 2 2.5 0 0.05 0.1 0.15
ca
Mean of log Fr Mean of log Qt SD of log Fr

10 10 10
10-3
200 6 8
ni
ch
150 6
4
te
100 4
2
eo
50 2
G
0 0 0
0 0.1 0.2 0.3 -1 -0.5 0 0.5 1 -400 -200 0 200 400
n
SD of log Qt Correlation coefficient

10
a

di
Figure 14 (a) Realizations of all density parameters for three soil clusters and (b) the
na
corresponding posterior distributions (vertically correlated case)

Ca
in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
pr
Figure 15 The MCR curve during the sampling process (vertically correlated case).
’
rs
ho
Mean of log10Qt
Mean of log10Fr
2.2
0
2
ut
1.8
-0.2
(a
1.6
1.4
al
-0.4 1.2
rn
ou
0.15
SD of log10Fr
SD of log10qt
0.3
lJ
0.1
ca
0.2
ni
0.1
0.05
ch

te
Correlation coefficient
0.6
eo
0.5
Probability
0.4
G
0
0.2
a n
-0.5
di
0
0 0.02 0.04 0.06 0.08
Segment 1 Segment 2 Segment 3
na
MCR
Figure 16 Boxplot of parameter estimators and histogram of MCR for 100 synthetic datasets
Ca
(vertically correlated case)

in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
Figure 17 NGES data: (a) log10 Fr profile; (b) log10 Qt profile; (c) Robertson chart.
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
al
rn
ou
lJ
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
Figure 18 NGES analysis results: extracted patterns using (a) the optimal number of clusters = 7
ed
in (b) physical space and (c) feature space; the stratification in (b) is based on (d) the dominate
ish
SBT number of each soil segment

bl
Pu
n)
io
rs
ve
Ching et al. Zhang & Tumay Wang et al. Clustering T ratio method T ratio method
This study (2015) (1999) (2013) (n c =40) (window = 1m) (window = 2m)
t
in
SBT* = 6
pr
e-
’ pr
SBT* = 3
rs
ho
ut
(a
SBT* = 5
al
rn
ou
SBT* = 4
lJ
ca
ni
ch
SBT* = 5
te
eo
(b)
0 0.5 1
G
Information entropy
an
Note: (a, c-h) are reproduced from Ching (2015), HPS stands for highly probable sandy soil, HPC stands for highly probable clayey soil, and HPM stands for
highly probable mixed soil; in this study, SBT* indicates the dominant SBT number of each soil layer.
di
na
Figure 19 Comparisons for the NGES stratification results: (a) the borehole log; (b) the result from the proposed approach; (c) the results of
Ching (2015); (d) the results of Zhang and Tumay (1999); (e) the results of Wang et al. (2013); (f) the results obtained based on clustering
Ca
with nc = 40; (g) the results obtained based on the T ratio method with a 1 m window; (h) the results obtained based on the T ratio method
with a 2m window.
in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
al
rn
ou
Figure 20 CPT data for the Lukang case: (a) log10 Fr profile; (b) log10 Qt profile; (c) the
lJ
Robertson chart.
ca
ni
ch
te
eo
G
a n
di
na
Ca
in
ed
ish
bl
Pu
n)
sio
er
tv
in
pr
e-
’ pr
rs
ho
ut
(a
al
rn
log 10(Qt)
ou
lJ
ca
ni
ch
te
eo
G
an
di
Figure 21 Analysis results for the Lukang case: extracted patterns using (a) the optimal number
na
of clusters = 5 in (b) physical space and (c) feature space; the stratification in (b) is based on the
dominate SBT number of each soil segment (not shown)
Ca
in
ed
ish
bl
Pu
n)
io
rs
ve
Ching et al. Wang et al. Clustering Clustering Clustering T ratio method T ratio method T ratio method
This study (2015) (2013) (n c=40) (n c=80) (n c=200) (window = 1m) (window = 2m) (window = 5m)
t
in
SBT* = 6
pr
SBT* = 5
SBT* = 1
e-
pr
SBT* = 6
’
rs
ho
SBT* = 5
SBT* = 4
ut
SBT* = 3
*
SBT = 5 SBT* = 5
(a
SBT* = 3
SBT* = 6
*
SBT = 4
SBT* = 3
al
SBT** = 5
SBT = 4
rn
SBT* = 5
ou
SBT* = 3
SBT* = 3
lJ
SBT* = 6
SBT* = 6
SBT* = 3
ca
SBT** = 5
SBT = 6
ni
SBT* = 5
(b) SBT* = 3
ch
0 0.5 1
Information entropy te
.
eo
Note: (a, c-j) are reproduced from Ching (2015), in this study, SBT* indicates the dominant SBT number of each soil layer.
G
Figure 22 Comparisons of stratification results for the Lukang case: (a) the borehole log; (b) the results obtained using the proposed
an
approach; (c) the results of Ching (2015); (d) the results of Wang (2013); (e-g) the results obtained based on clustering with nc = 40, 80, and
200, respectively; (h-j) the results obtained based on the T ratio method with window size = 1 m, 2 m, and 5 m, respectively.
di
na
Ca
in
ed
ish
bl
Pu

CPT Manuscript

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CPT Manuscript

Uploaded by

Copyright:

Available Formats

1 A Bayesian unsupervised learning approach for identifying soil stratification using

2 cone penetration data

14 The University of Dayton, Dayton, OH 45469-0243, USA, Email: rliang1@udayton.edu.

30 uncertainty and reasonable computational cost.

50 expensive due to solving high-dimensional integrals and high-dimensional non-convex

69 Fr = 100 f s /( qt − σ v 0 ) , and the normalized tip resistance, Qt = ( qt − σ v 0 ) / σ v 0 ' , where f s , qt ,

77 visualized as point clouds in the Robertson chart.

119 development for interpreting site investigation data.

120 Proposed unsupervised learning approach

144 controlled by the granularity coefficient β . Greater β means stronger constraint.

156 of its neighbors are known) follows the equation below,

157 P( y j | x ∂ j , θl∈L , β ) = ∑ P(l | x ∂ j , β ) f ( y j ; θl ) (3)

167 random field (GHMRF) model is defined by Equations (1-3).

179 Φ = {μ, Σ, β }, μ = {μ l } , Σ = {Σ l }, for all l ∈ L (Forbes and Peyrard 2003). We recommend

181 realization of the conditional distribution p( x | y, Φ ) given an estimate of Φ , and generated

183 following section.

184 Probabilistic pattern extraction

188 p(Φ | y, x) iteratively.

∑ P( x j ', x∂ j | y j , Φ) ∑ exp[−U j '( x j ', x∂ j ; Φ )]

201 The realizations ~

205 2. Simulation of the model parameters from p (Φ | y , x)

217 prior distribution of Σ l . To be more specific, Σ l is decomposed as Σ l = Λ l Rl Λ l , where Λ l is a

237 pseudocode of a single M-H iteration for updating Φ = {μ, Σ, β }.

238 Uncertainty quantification of soil states and mixture parameters

245 H ( j ) = − ∑ Pl ( j ) log Pl ( j ) (11)

259 BIC = −2loglike( y, ϑ ) + M log(n ) (12)

291 1. Understand spatial pattern and statistical pattern

310 Segment 3 at different depth, whereas cluster 2 equals to Segment 2.

311 2. Interpret soil segments and detect layer boundaries

330 boundary is detected between Segment 2 and Segment 3.

331 Implementation procedure

333 summarize, the procedure is listed as follows:

348 significant difference.

351 algorithm (Algorithm 2).

355 taken as the MAP estimator of all independent parameters in Φ .

373 Model validation using synthetic examples

374 Description of stochastic simulation method and measure of classification quality

393 second case, they are set to be λ1 = 1m , λ2 = 2m , λ3 = 1m .

395 (MCR) as:

number of misclassified elements

396 MCR = (14)

399 using the MAP estimate x * .

419 uncertainty is expected at boundary location.

442 Vertically correlated case

460 (or only represent the local condition in real-world datasets).

478 The NGES at Texas A&M University al

531 Lukang, Taiwan case

538 Robertson chart are shown in Figure 20.

551 approaches do not need to solve high-dimensional integrals or high-dimensional non-convex

583 investigations on this point are still expected.

594 approach is sophisticated in a mathematical sense, the integrated implementation is automatic

595 and fully unsupervised.

617 is very promising for interpreting geotechnical site investigation data.

624 more thorough investigations on this point are still expected.

649 clay and sand. Geotechnical Special Publication: 26-51.

651 based on Soil Behaviour Type Index. Canadian geotechnical journal,(ja).

653 penetration tests. Journal of Geotechnical and Geoenvironmental Engineering, 139(2):

656 Recognition, 28(5): 781-793.

664 geotechnical journal, 52(12): 1993-2007.

667 Engineering Systems, Part A: Civil Engineering, 3(4): 04017021.

695 MathWorks. 2014. Statistics and Machine Learning Toolbox.

700 New York.