Professional Documents
Culture Documents
A R T I C L E I N F O A B S T R A C T
Keywords: Specific computational tools assist geologists in identifying and sorting lithologies in well surveys and reducing
Lithological group operational costs and practical working time. This allows for the management of professional output, the efficient
Pattern recognition interpretation of data, and completion of scientific research on data collected in geologically distinct regions.
Multivariate data
Machine learning methods and applications integrate large sets of information with the goal of efficient pattern
Sedimentary rocks
recognition and the capability of leveraging accurate decision making. The objective of this study is to apply
machine learning methods to the supervised classification of lithologies using multivariate log parameter data
from offshore wells from the International Ocean Discovery Program (IODP). According to the analysis of the
lithologies proposed in the IODP Expeditions and for the application of our methods, the lithologies were divided
into four groups. The IODP Expeditions were organized into four templates for better results in analyzing the set
of expeditions and practical application of the methods. The templates were submitted to training, validation,
and testing by multilayer perceptron (MLP), decision tree, random forest, and support vector machine (SVM)
methods. The evaluation was randomly divided into training (70%), validation (10%), and testing (20%) using
the classification methods as an evaluation of the results. In the results, it was observed that Template1 (IODP
Expedition 362) obtained better results with the MLP method, Template2 (IODP Expeditions 354, 355, and 359)
and Template3 (IODP Expeditions 354, 355, 359, and 362) obtained better results with the random forest
method with greater than 80.00% accuracy. For cross-validation, the random forest method performed well in all
scenarios. In the practical template, the G2 group obtained a better result with the MLP method with an average
accuracy above 85.00%. It is expected that machine learning methods can help improve the study of geology
with accurate and rapid answers related to interpreting collected data in different study regions.
* Corresponding author.
E-mail addresses: thiago.bressan@iffarroupilha.edu.br (T.S. Bressan), marcelo.k.souza@gmail.com (M. Kehl de Souza), tjgirelli@gmail.com (T.J. Girelli),
faridchemale@gmail.com (F.C. Junior).
1
Instituto Federal de Educaç~
ao Ci^encia e Tecnologia Farroupilha - IFFar
https://doi.org/10.1016/j.cageo.2020.104475
Received 2 October 2019; Received in revised form 6 March 2020; Accepted 18 March 2020
Available online 23 March 2020
0098-3004/© 2020 Elsevier Ltd. All rights reserved.
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
This study differs by its integration of geology and neural networks in Table 1
intelligent learning to assist professional geologists with practical work Division of lithology into groups. Each group contains its composition models,
in the laboratory or the field. This paper applies the methods of machine lithological composition, and lithological code.
learning in the supervised classification of lithologies using multivariate Groups Models Lithological Composition Lithology
data of log parameters of offshore wells from the International Ocean Code
Discovery Program (IODP). GP Litho1 Very fine sand/sandstone, Fine sand/ 10
sandstone, Medium sand/sandstone, Coarse
2. Materials sand/sandstone, Sand, Sand/Sandstone
mentary rocks, and identification of specific minerals in sedimentary G1 Litho1 Very fine sand/sandstone, Fine sand/ 10
rocks. Machine learning creates responses from the intensive processing sandstone, Medium sand/sandstone, Coarse
sand/sandstone, Sand, Sand/Sandstone
of information, predicting highly reliable outputs for decision making
Litho2 Silt/siltstone, Silty clay/claystone, Clay/ 14
(Raschka, 2015). The supervised learning method processes reference claystone, Clay, Silt, Alternating silt/siltstone
data input to create a model for the prediction of new data. For this and clay/claystone layers, Clayey silt/siltstone
training, the algorithm requires data in a standard format and type, as Litho3 Calcareous silty clay/claystone, Calcareous 15
well as data with reliable and accurate values, extracted from relevant silt/siltstone, Calcareous ooze, Chalk,
Marlstone, Rudstone, Floatstone, Grainstone,
sources with the ability to improve feedback. In this study, we used a Packstone, Wackestone, Boundstone,
dataset organized from the division of lithologies into GP (group GP), G1 Limestone, Calcareous claystone
(group 1), G2 (group 2), and G3 (group 3) according to Table 1. These
G2 Litho1 Very fine sand/sandstone, Fine sand/ 10
lithologies are linked to IODP Expeditions 349, 354, 355, 356, 359, 361, sandstone, Medium sand/sandstone, Coarse
and 362 and organized into four templates according to Table 4. They sand/sandstone, Sand, Sand/sandstone
were processed by the application of machine learning methods for su Litho2 Sand/sandstone-silt/siltstone-clay/Claystone, 13
pervised classification, including MLP, decision tree, random forest, and Clayey sand/sandstone, Silty sand/sandstone,
Alternating sand/sandstone and mud/
SVM. mudstone layers, Sandy clay/claystone, Sandy
silt/siltstone
2.1.1. Machine learning methods for the classification lithology Litho3 Silt/siltstone, Silty clay/claystone, Clay/ 14
In addition to the methods used in this work, MLP, decision tree, claystone, Clay, Silt, Alternating silt/siltstone
and clay/claystone layers, Clayey silt/siltstone
random forest, and SVM, for supervised data, lithological classification
Litho4 Calcareous silty clay/claystone, Calcareous 15
covers the use of several other important methods, such as naïve Bayes silt/siltstone, Calcareous ooze, Chalk,
(Rosid et al., 2019; Kong et al., 2014), probabilistic neural networks Marlstone, Rudstone, Floatstone, Grainstone,
(Al-Mudhafar, 2017a), logistic boosted regression (Al-Mudhafar, Packstone, Wackstone, Boundstone,
2017a), kernel support vector machine (Al-Mudhafar, 2015, 2017b), Limestone, Calcareous claystone
support vector regression (Awad and Khanna, 2015), methods for un G3 Litho1 Very fine sand/sandstone, Fine sand/ 10
supervised data, such as cluster analysis (Lee and Datta-Gupta, 1999; sandstone, Medium sand/sandstone, Coarse
sand/sandstone, Sand, Sand/sandstone
Pirrone et al., 2014; McCreery and Al-Mudhafar, 2017) and gaussian
Litho2 Alternating sand/sandstone and mud/ 11
mixture models (Wallet and Hardisty, 2019), and methods that integrate mudstone layers, Sandy clay/claystone, Sandy
functions of math and multivariate statistics with supervised and un silt/siltstone, Sand/sandstone-silt/siltstone-
supervised data such as principal components analysis (Lee and clay/claystone, Clayey sand/sandstone, Silty
Datta-Gupta, 1999), linear discriminant analysis (Lee and Datta-Gupta, sand/sandstone
Litho3 Alternating silt/siltstone and clay/claystone 12
1999; Al-Mudhafar, 2015c; Hong et al., 2004), multinomial logistic layers, Clayey silt/siltstone, Silty clay/
regression (Hong et al., 2004), singular value decomposition (Romp claystone
panem et al., 2017) and fuzzy logic (Devanand and AuthorAnonymous, Litho4 Silt/Siltstone, Silt 13
2015; Orozco-del-Castillo et al., 2011), with application-specific char Litho5 Clay/claystone, Clay 14
Litho6 Calcareous silty clay/claystone, Calcareous 15
acteristics and support for certain data types.
silt/siltstone, Calcareous ooze, Chalk,
Marlstone, Rudstone, Floatstone, Grainstone,
2.1.1.1. Decision tree. The decision tree is a practical, fast, and robust Packstone, Wackestone, Boundstone,
learning method for supervised inductive learning (Maimon and Limestone, Calcareous claystone
Rokach, 2010). It is a useful method in the process of previously un
known information extraction from the analysis of a large volume of
through a structure of nodes and sheets. When applied to database re
data. Examples of applications that use a decision tree as a learning al
cords, it results in the classification of records and is a robust method for
gorithm include landslides (Alkhasawneh et al., 2014), classification
data with considerable noise, as well as nonstandard data (S� aez et al.,
and identification of natural minerals (Akkas et al., 2015) and image
2013). Configurations such as maximum tree depth, number of features
classification (Loussaief and Abdelkrim, 2018).
for the best split, maximum number of nodes, maximum number of
A decision tree is essentially a series of if-else statements aligned
sheets, and the function for division and choice of nodes can be defined
2
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
X
c
HðEÞ ¼ pj log pj 2
j¼1
3
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
FP
FPR ¼ 8
FP þ TN
Fig. 2. Definition of the confusion matrix. The matrix is divided into real data
The ROC curve plots TPR vs. FPR at different classification thresh
and predicted data (rows and columns), combining true positive (TP) data, false
positive (FP) data, false negative (FN) data and true negative (TN) data. olds. Its multidimensional capability allows for better visualization of
Figure modified from Navin and Pankaja (2016). the result variables throughout the spectrum of the graph. The
descending diagonal (0,1) represents the classification model that plays
equally in both classes. Points belonging to the upper left triangle of this
false method performance because this metric calculates the average
diagonal represent the best results, and points belonging to the lower
between the return of the classes of a dataset (Hossin and Sulaiman,
right triangle represent the worst results. Its origin is related to the
2015).
detection of signals and the evaluation of the transmission quality of a
Precision (Eq. (4)) and recall (Eq. (5)) belong to the F1-score metric.
noise signal (Egan, 1975). ROC graphics are used in medicine (Tilaki-
Precision is calculated by the division of true positive values by the sum
Hajian, 2013), economics (Gajowniczek et al., 2014), weather fore
between true positive values and false positive values. The recall is
casting (Zhao et al., 2011), and geology (Vakhshoori and Zare, 2018;
calculated by the division of the true positive values by the sum between
Chen and Wu, 2016; Airola et al., 2018).
true positive and false negative values. Their equations are presented
below:
TP 2.2. Geological setting
Precision ¼ 4
TP þ FP
This study is based on IODP Expeditions 349, 354, 355, 356, 359,
TP 361, and 362. The holes were drilled in different regions of the Indian
Recall ¼ 5
TP þ FN Ocean. As a result, there is a large amount of information resulting in a
good data group for this study in machine learning. Further description
F1-score or F-measure (Eq. (6)) is the harmonic mean between pre
of the study areas can be found in the supplementary material.
cision and recall. Its use is beneficial in dataset processing with diver
All of the expeditions described in the present study contain similar
sified classes that are highly disproportionate. This equation is given
successions of sediments/sedimentary rocks from the Bengal-Nicobar
below:
Fan. Therefore, it is possible to perform a grouping of lithologies and
2*Precision*recall to map a pattern between depth and lithology, as well as the distribution
F1 ¼ 6
Precision þ recall of sedimentary rocks in all of the sites and holes surveyed for the seven
The cross-validation method is used to evaluate the performance of expeditions.
the data. This method randomly partitions the total untrained dataset The grouping of lithologies seeks to organize the sets of lithologies
into k smaller groups of equal size (Haykin, 2009). Processing of the data present in the IODP Expeditions by separating the records, creating a
is repeated k times until all groups are trained and tested. Processing wide combination of data and directions for the heterogeneous sedi
returns are described through rating metrics such as accuracy, precision, mentary rocks identified and described in the visual core description
recall, and F1-score according to Fig. 3. In this way, the entire set of (VCD). Fan et al. (2019), Korolev et al. (2018), and Rahim et al. (2009)
available data is evaluated, returning precise classification of the data described the division into groups as integral to the heterogeneous re
and integrating the various characteristics of data formation and ality of lithological analysis in the field as it is related to the multivariate
grouping. physical characteristics of the site or core sampled.
For each lithological group, models and lithological code were
assigned for the classification of the lithology by machine learning
methods. Table 1 presents the model of the division of lithologies into
four groups, denominated GP (group GP), G1 (group 1), G2 (group 2),
and G3 (group 3). The core images of different lithologic groups can be
found in the supplementary material.
During IODP Expeditions, the onboard data collected onboard pro
vided the first steps for working with the sampled rocks. The collection
of these data occurred in core or collected samples. Among the mea
surements of the log parameters of rocks, we can highlight the gamma-
ray attenuation bulk density (GRA), P-wave velocity logger system (PWL
or P-wave), magnetic susceptibility (MS), reflectance spectrophotom
etry and colorimetry (RSC), and shock remanent magnetization (SRM).
GRA is a measure of the density expressed in g/cm3, with a high
degree of penetration, emitted spontaneously from an atomic nucleus
Fig. 3. Graphical representation of cross-validation with the selected data in (137 Ce) during radioactive decay. It has a gamma-ray peak of 662 KeV
the datasets of the templates. In each round, the dataset is divided into n groups and is attenuated as it passes through the core. This attenuation is
of equal size, with n groups for training and n groups for validation. The cross- related to Compton spreading, where a known sample thickness is
validation result produces the measurements of accuracy, precision, recall, and proportional to the bulk density. Bulk density can also be affected by
F1-score. Figure modified from Haykin (2009). vertical compaction during the collection of cores. This measure is used
4
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
in the identification or classification of rocks, mineral composition, which member of the lithological group belongs to the value sought.
grain size, and porosity calculation (ODP, 2007). Fig. 4 shows a model of the log parameters for IODP Expedition 362,
PWL values are measurements of the sound wave velocity through a sites U1480, and holes E, F, and G, which form Template1, lithology
sample. The PWL velocity varies according to the physical composition, group GP, with a plot of the classification of the lithologies in relation to
porosity, density, and degree of fracture. In marine environments, PWL the depth of the wells. The other expeditions and sites follow the model
values are influenced by the degree of consolidation and lithification, in Fig. 4, and adjust depth with the lithological classification.
fractures, and hydrocarbon occurrence (Brckovic et al., 2017; Doveton,
1994). Together with the GRA, the PWL measurements are used to 2.3. Methods and data configuration
calculate the acoustic impedance and reflection coefficients to construct
synthetic seismic profiles and to estimate the depths of seismic horizons. 2.3.1. Data preparation
MS is the intensity with which the material can be magnetized in an The total number of records for the IODP Expeditions, divided into
external magnetic field (Blum, 1997; Brckovic et al., 2017; Mcneill et al., groups, is shown in Table 2. Each record includes values for the seven
2017). The ratio of magnetization is expressed in units of volume, log parameters GRA, PWL, MS, RSC (L*, a*, b*), and SRM. The dataset
defined as: formed by the log parameters is transformed into the matrix. A table of
the complete dataset with IODP Expeditions, groups, and lithologies is
k ¼ M=H
found in Table S1 of the supplementary material.
where M is the volume of magnetization applied to a magnetic suscep
tibility k by an applied external field (H). Susceptibility is measured with 2.3.2. Programming language and library for machine learning
the main recording devices, for which calibration factors must be met for Python is a high-level programming language that is interpretable,
geometry and effects of transport and core coatings. They can be clas easy to learn and use, and supports numerous add-ons, which makes it a
sified by the magnetization volume value (M) into three groups: powerful language for calculating and analyzing large quantities of in
diamagnetic materials ( 1 < M < 0), paramagnetic materials (0 < M � formation (Spronck, 2017). In this machine learning study, the
1), and ferromagnetic materials (M � 1). MS varies according to the type scikit-learn library was used, which integrates all methods for learning
and concentration of magnetic grains and corresponds to variation in processing, with the support of supervised and unsupervised training on
sediment composition, mainly the granulometry and mineralogical data, creating extremely elaborate and understandable outputs. The
composition. Sediments with the presence of clay have relatively lower modules used in scikit-learn are the MLPClassifier module for the MLP
magnetic susceptibility, and materials with the presence of water tend to algorithm, the DecisionTreeClassifier module for the decision tree al
have slightly negative values. gorithm, the RandomForestClassifier module for the random forest al
RSC is a unit of measurement related to two widely used techniques gorithm, and the SVC module for the SVM algorithm.
in the visual identification of rock characteristics: colorimetry and
reflectance spectrophotometry. Colorimetry is used to measure the color 2.3.3. Training
value of a surface. Many numerical systems have been developed to To carry out the training, the data were divided according to Table 4.
express the visual values of colors. The International Commission on Three templates were created with data combinations between IODP
Illumination (CIE) proposed a standard method for the numerical Expeditions 354, 355, 359, and 362, using 70.00% for training data,
measurement of colors, the L*, a*, b* system, considering the nonlinear 10.00% for validation data and 20.00% for test data, randomly sepa
perception of the human eye and the combination of illumination and rated. A practical template was created to simulate a real exercise that
basic colors (Hughes and Langlois, 2010; CIE, 2004). L* defines the applies the methods and predicts the lithologies of a hole through the
brightness value of the respective sample, with values between L* ¼ training of the entire dataset. The composition of the expeditions in the
0 (totally dark) and L* ¼ 100 (totally bright). The a* defines the value of templates was given by the proximity of the expeditions in the same
the red-green coordinate (a* positive defines values with more red and geological region to direct the context as practical analysis in the field or
a* negative defines values with more green) and b* defines the value of directly on the ship.
the blue-yellow coordinate (b* positive defines values with more yellow
and b* negative defines values with more blue) (Blum, 1997; CIE, 2004). 2.3.4. Prediction
Colorimetry is widely used to calculate relative brightness and colors on For each of the templates, the training described in the Train column
a surface or sample and is combined with other properties to create a in Table 4 was performed using the MLP, decision tree, random forest,
more accurate evaluation of the analyzed object (Hughes and Langlois, and SVM methods. To carry out the prediction, data were used according
2010). to the test column in Table 4.
Magnetization is correlated with the proportional response of mag The configurations of the MLP, decision tree, random forest, and
netic susceptibility to a magnetic field. In some cases, for pure magne SVM methods are shown in Table 3. The configurations use features of
tization, this relationship may undergo changes, where a medium the methods to create algorithms in the Python language, and the main
exhibits a magnetic field, even with the absence of a magnetic field points were adjusted for use with possible characteristics such as the
applied to it. This process is called shock remanent magnetization (SRM) number of nodes, trees, depth, and activation function. The SVM method
(Buschow and Boer, 2004; Goodrich, 2007; Jovane et al., 2013). The was configured with the default configuration provided by the pro
vector of magnetization of an object is the sum of the values of the gramming language. The default setting maintenance is because it pre
induced magnetic field and the magnetic field remanence. Magnetic sents the best result compared to adjustments in the method parameters.
remanence can be classified into five types, of which the main type, The distribution between training, validation, and testing in tem
natural remanent magnetization (NRM), was used in the IODP. NRM is a plates 1, 2, and 3 was related to the performance of practical tests with
more reliable method since remanence is transmitted from an object the methods and in accordance with Storkey (2013) and Korjus et al.
naturally by its favorable chemical compositions and without the (2016) with a percentage of 10% for validation and 10–20% for testing.
interference of equipment or sensors. For the practical template, training, validation, and testing were divided
The definition of use for these seven log parameters is based on the according to the segmentation of the expeditions.
fact that the data acquisition resolution is closest to the depth, and the After training and predictions, a table was compiled comparing the
joint capability of these logs facilitates the recognition and classification output result for each template and each method of the analyzed dataset.
of lithologies. Through analysis of the GRA, PWL, MS, L*, a*, b*, and For each processing, classification metrics and confusion matrix values
SRM in joint processing using machine learning, it is possible to identify were tabulated for better organization and presentation. A detailed table
with the results is found in Table S1 of the supplementary material.
5
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
Fig. 4. Log parameters for the IODP Expedition 362, Site U1480, Holes E, F, and G, which form Template1, lithology group GP. Lithology group GP is divided into
Litho1, Litho2, Litho3, Litho4, Litho5, and Litho6. Litho1 consists of sand with variable granulation; Litho2 is composed of interlayered sand and mud with layers of
clay and silt; Litho3 is composed of layers of clay and silt; Litho4 consists of mixed layers of sand, silt, and clay; Litho5 is composed of lithologies divided between silt,
clay, and mudstone; and finally, Litho6 is composed of lithologies based on carbonate composition and its derivatives. The color legend in the figure refers to the
lithologies present in the GP group. The unit of measurement for GRA is g/cm3, P-wave is m/s, MS is m3/kg, L* is %, a* is %, and b* is %. The SRM log parameter is
not included.
6
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
Table 4
Division for training according to IODP Expeditions.
Groups Template Composition IODP Expeditions – Sites Train Validation Test Methods
7
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
(10,12,13, and 14). The results are good in all plots, with a positive ROC
curve above the main diagonal (0,1) and corroborate with precise
identification (from 70.00% to 86.00%) of the lithologies, as seen in
Fig. 7. The MLP method has a better and growing stability of the plotted
results relative to the other methods.
The best result for the lithological classification linked to the G2
group is because the grouping was organized into four lithologies:
Litho1, Litho2, Litho3, and Litho4, which formed a larger organization,
agglomerating more lithologies and facilitating recognition by the
methods. The GP group did not obtain good results in the general
context due to greater refinement of the division and grouping of the
lithologies, making it difficult to recognize the analyzed methods.
Table 5 summarizes the results of machine learning processing
methods on the groups and data templates proposed in this article. The
best method was random forest. The best data group was the G2 group.
The best template was Template2. Referring to group vs. method, in the
GP group, the best method was random forest; in the G1 group, the best
method was random forest; in the G2 group, the best method was MLP;
and in the G3 group, the best method was random forest. In the com
bination of template vs. method, in Template1, Template2, and Tem
plate3, the best method was random forest, and in the practical
template, the best method was MLP. The best result in applying cross-
validation was the random forest method.
In all processes, there are limitations that create restrictions on
achieving results with more than 80.00% accuracy (Peng and Bai, 2017;
Papernot et al., 2016), especially due to the characteristics of the data
sets, the quantity of data in the datasets, the number of features (GRA,
PWL, MS, L*, a*, b*), and the inability to store training and testing for
future use, which strengthens datasets and improves results.
Regarding the performance and complexity of the execution of the
algorithms, the algorithms that use decision trees, random trees, nodes
and edges are executed very quickly and accurately than algorithms that
use mathematical calculations or statistical functions (Maniriho and
Ahmad, 2018). The MLP algorithm has in its organizational structure a
sequential node-to-node and layer-by-layer execution runs quickly with
optimal performance in relation to the hardware involved. Finally, the
SVM algorithm, acting directly on variables, vectors, performs calcula
tions in a two-dimensional or three-dimensional plane using an appro
priate kernel (Gaussian, polynomials or sigmoid) requires compatible
hardware and high processing time compared to other algorithms. It is
observed that in the results proposed by this article, the SVM algorithm
always obtained inferior results to the algorithms that use decision trees
or nodes and layers.
The ash layer was not part of the division of the lithological groups
since its presence in the lithological profile of the wells is restricted to
mm- and cm-thick layers, with few log records and inadequate resolu
tion, making it impossible for machine learning applications in all of the
configurations used at this stage. Further details referring to the ash
layer records are found in Table S2 of the supplementary material with
ash layer records for the IODP Expedition 362, Site U1480.
4. Conclusions
8
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
Fig. 6. Resulting lithological classification – practical template, G3 group, IODP Expedition 362, Site U1481. The log parameters of the well GRA, P-wave, MS, L*, a*,
and b* are displayed with respect to the real lithology related to the depth presented. Graph A displays the total number of records per method vs. lithology. Graph B
displays the resulting lithological classification by method relative to the real lithology. The methods are MLP, random forest, decision tree, and SVM, presented in
the sequence of identification of the lithologies. (Aa) to (Dd) correspond to the method distribution by percentage of hits by lithology. Litho1 refers to lithology code
10, Litho3 refers to lithology code 12, Litho4 refers to lithology code 13 and Litho5 refers to lithology code 14. Litho2 and Litho6 are not recorded in this hole range.
The SRM log parameter is not included.
increased the accuracy according to the quantity of the dataset, as it for a specific group of data. Indeed, we must perform an adjustment of
facilitated the execution of the classifications by the methods proposed. the configuration parameters, making it difficult to use this for the
The SVM method obtained poor results in cross-validation analysis classification of lithologies as proposed in this study.
and accuracy. The geological data analyzed do not have a highly defined The best result for cross-validation was obtained by the random
pattern, and this method depends on many external and natural factors forest method in all groups and templates analyzed. The characteristics
to obtain excellent results in its application. To successfully use machine of the random forest method allowed for better results of lithological
learning with this method, it is necessary to have a dataset with a large classification, as it has a simple configuration and its design includes a
quantity of data, and the data must be balanced regarding each training set of several decision trees, leading to accurate results with smaller
9
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
Fig. 7. Confusion matrix for the G3 group, practical template (IODP Expedition 362, Site U1481). (A) MLP method, (B) random forest method, (C) decision tree
method, (D) SVM method. A to D represent the diagonal matrix for the result of each method, where the MLP method had the best result. (A*) represents the number
of true and predicted G3 group (lithology codes 10, 12, 13, 14) lithology records without normalization. (B*) highlights the percentage of normalized data between
true and predicted. Colors represent the frequency of normalized and nonnormalized records. Lithology code 11 and lithology code 15 are not recorded in this
hole range.
10
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
Table 5
Summary table with the best accuracy per method, group, and
template.
Table summary best accuracy
Method:
RandomForest
Group:
G2
Template:
Template2
Group vs. Method:
GP RandomForest
G1 RandomForest
G2 MLP
G3 RandomForest
Template vs. Method:
Template1 RandomForest
Template2 RandomForest
Template3 RandomForest
Practical Template MLP
Cross-Validation:
RandomForest
statistical variance.
The main contributions of this work are a rapid approach to litho
logical classification in offshore wells, the proposal of a lithological
classification using supervised training methods, lithological classifica
tion using multivariate data and the support of a large number of vari
ables, and strengthened and improved learning using neural network
methods and machine learning.
Authorship statement
Acknowledgement
Fig. 8. ROC curve of the MLP, random forest, decision tree, and SVM methods
applied to the G3 group, practical template (IODP Expedition 362, Site U1481). This work was financed through a grant from the Coordenaça
~o de
Lithology code 10 refers to the G3 group, models Litho1. Lithology code 12 Aperfeiçoamento de Pessoal de Nível Superior – Brasil -grant #
refers to the G3 group, models Litho3. Lithology code 13 refers to the G3 group, 88887.091717/2014–01). This research used data provided by the In
models Litho4. Lithology code 14 refers to the G3 group, models Litho5. Li ternational Ocean Discovery Program (IODP) (www.iodp.org/access-d
thology code 11 and lithology code 15 are not recorded in this hole range. ata-and-samples).
11
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
Appendix A. Supplementary data Buschow, K.H.J., Boer, F. R. de, 2004. Physics of Magnetism and Magnetic Materials.
Kluwer Academic Publishers, ISBN 0-306-47421-2, pp. 105–115. https://doi.org/
10.1007/b100503.
Supplementary data to this article can be found online at https://doi. Castro, W., Oblitas, J., Santa-Cruz, R., Avila-George, H., 2017. Multilayer perceptron
org/10.1016/j.cageo.2020.104475. architecture optimization using parallel computing techniques. PloS One 12, 12.
https://doi.org/10.1371/journal.pone.0189369.
Chen, Y., Wu, W., 2016. A prospecting cost-benefit strategy for mineral potential
Abbreviations used in this manuscript mapping based on ROC curve analysis. Ore Geol. Rev. 74, 26–38. https://doi.org/
10.1016/j.oregeorev.2015.11.011, 2016.
ANN Artificial Neural Network CIE, 2004. Technical Report, third ed. COLORIMETRY, ISBN 3901906339, p. 82pp
https://archive.org/details/gov.law.cie.15.2004.
CIE International Commission on Illumination De Boissieu, F., Sevin, B., Cudahy, T., Mangeas, M., Chevrel, S., Ong, C., Rodger, A.,
GRA Gamma Ray Attenuation Maurizot, P., Laukamp, C., Lau, I., Touraivane, T., Cluzel, D., Despinoy, M., 2017.
IODP International Ocean Discovery Program Regolith-geology mapping with support vector machine: a case study over
weathered Ni-bearing peridotites, New Caledonia. Int. J. Appl. Earth Obs. Geoinf.
MS Magnetic Susceptibility (64), 377–385. https://doi.org/10.1016/j.jag.2017.05.012.
MLP Multi-Layer Perceptron Devanand, Kumar, N., 2015. Prediction of CMRS rock mass rating using fuzzy logic.
NRM Natural Remanence Magnetization International Conference on Advances in Computer Engineering and Applications.
https://doi.org/10.1109/ICACEA.2015.7164685. ICACEA - 2015.
PWL P-Wave Velocity Logger System Doveton, J.H., 1994. Geological log interpretation. SEPM Society for Sedimentary
RSC Reflectance Spectrophotometry Geology (29). https://doi.org/10.2110/scn.94.29.
ROC Receiver Operating Characteristic Dubeau, P., King, D.J., Unbushe, D.G., Rebelo, L., 2017. Mapping the dabus wetlands,
Ethiopia, using random forest classification of landsat, PALSAR and topographic
SRM Magnetic Remanence
data. Rem. Sens. (9), 1056. https://doi.org/10.3390/rs9101056.
SVM Support Vector Machine Egan, J.P., 1975. Signal Detection Theory and ROC Analysis. Academic Press, New York,
VCD Visual Core Description 2017.
Fajana, A.O., Ayuk, M.A., Enikanselu, P.A., 2019. Application of multilayer perceptron
neural network and seismic multiattribute transforms in reservoir characterization of
References Pennay field, Niger Delta. J. Petrol. Explor. Prod. Technol. (9), 31–49. https://doi.
org/10.1007/s13202-018-0485-9, 2019.
Airola, A., Pohjankukka, J., Torppa, J., Middleton, M., Nyk€ anen, V., Heikkonen, J., Fan, B., Shi, L., Li, Y., Zhang, T., Lv, L., Shikai, T., 2019. Lithologic heterogeneity of
Pahikkala, T., 2018. The spatial leave-pair-out cross-validation method for reliable lacustrine shale and its geological significance for shale hydrocarbon-a case study of
AUC estimation of spatial classifiers. Data Min. Knowl. Discov. 33, 730–747. https:// Zhangjiatan Shale. Open Geosci. 2019 (11), 101–112. https://doi.org/10.1515/geo-
doi.org/10.1007/s10618-018-00607-x. 2019-0009.
Akkas, E., Akin, L., Çubukçu, E., Artuner, H., 2015. Application of Decision Tree Gajowniczek, K., Zabkowski, T., Szupiluk, R., 2014. Estimating the roc curve and its
Algorithm for classification and identifi-cation of natural minerals using SEM–EDS. significance for classification models’ assessment. Quantit. Methods Econ. 15 (2),
Comput. Geosci. 80, 38–48. https://doi.org/10.1016/j.cageo.2015.03.015, 2015. 382–391, 2014.
Al-Mudhafar, W.J., 2015. Integrating component analysis & classification techniques for Goodrich, W.E., 2007. Characterization and Quantification of Magnetic Remanence in
comparative prediction of continuous & discrete lithofacies distributions. In: OTC- Unexploded Ordnance. Master of Science (Geophysics). Faculty and Board of
25806-MS, the Offshore Technology Conference. https://doi.org/10.4043/25806- Trustees of the Colorado School. https://mountainscholar.org/bitstream/handle
MS (4-7 May), Houston, TX, USA. /11124/79225/T06342.pdf?sequence¼1.
Al-Mudhafar, W.J., 2017a. Integrating well log interpretations for lithofacies Haykin, S., 2009. Neural Networks and Learning Machines, third ed. Pearson Prentice
classification and permeability modeling through advanced machine learning Hall, New York City, U. S., p. 938pp
algorithms. J. Petrol. Explor. Prod. Technol. (2017) 7, 1023–1033. https://doi.org/ Hong, T., White, C.D., Gani, M.R., Bhattacharya, J., 2004. Comparison of multivariate
10.1007/s13202-017-0360-0. statistical algorithms for wireline log facies classification. AAPG Ann. Meet Abst.
Al-Mudhafar, W.J., 2017b. Integrating kernel support vector machines for efficient rock (88), 13.
facies classification in the main pay of Zubair formation in South Rumaila oil field, Hossin, M., Sulaiman, M.N., 2015. A review on evaluation metrics for data classification
Iraq. Model. Earth Syst. Environ. (2017) 3, 12. https://doi.org/10.1007/s40808- evaluations. Int. J. Data Min. Knowl. Manag. Process 2 (5), 1–11. https://doi.org/
017-0277-0. 10.5121/ijdkp.2015.5201.
Alkhasawneh, M.S., Ngah, U.K., Tay, L.T., Isa, N.A.M., Al-Batah, M.S., 2014. Modeling Hsieh, B., Lewis, C., Lin, Z., 2005. Lithology identification of aquifers from geophysical
and testing landslide hazard using decision tree. J. Appl. Math. 2014 https://doi. well logs and fuzzy logic analysis: shui-Lin Area, Taiwan. Comput. Geosci. 31 (3),
org/10.1155/2014/929768, 929768. 263–275. https://doi.org/10.1016/j.cageo.2004.07.004.
Awad, M., Khanna, R., 2015. Support Vector Regression. Efficient Learning Machines. Hughes, V.K., Langlois, N.E.I., 2010. Use of reflectance spectrophotometry and
Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5990-9_4. colorimetry in a general linear model for the determination of the age of bruises.
Bhattacharya, S., Carr, T.R., 2019. Integrated data-driven 3D shale lithofacies modeling Forensic Sci. Med. Pathol. 6 https://doi.org/10.1007/s12024-010-9171-z, 275–28.
of the Bakken Formation in the Williston basin, North Dakota, United States. Jahdhami, N.A., Anboori, A., 2017. The Application of Specific Drilling Energy to
J. Petrol. Sci. Eng. 177, 1072–1086. https://doi.org/10.1016/j.petrol.2019.02.036, Identify Overburden Lithological Boundaries and Aid Well Operations - Oman
2019. Khazzan Field. Abu Dhabi International Petroleum Exhibition & Conference, Abu
Bhattacharya, S., Mishra, S., 2018. Applications of machine learning for facies and Dhabi, UAE. https://doi.org/10.2118/188413-MS.
fracture prediction usingBayesian Network Theory and Random Forest: case studies James, G., Witten, D., Hastie, T., Tibshirani, R., 2013. An Introduction to Statistical
from theAppalachian basin, USA. J. Petrol. Sci. Eng. 170, 1005–1017. https://doi. Learning. Springer, New York City, U.S, p. 187. https://doi.org/10.1007/978-1-
org/10.1016/j.petrol.2018.06.075, 2018. 4614-7138-7.
Bhattacharya, S., Carr, T.R., Pal, M., 2016. Comparison of supervised and unsupervised Jing, S., Liu, C., Li, G., Yan, G., Zhang, Y., 2017. An efficient algorithm for parallel
approaches for mudstonelithofacies classification: case studies from the Bakken computation of rough entropy using CUDA. Int. Conf. Comput. Intell. Sec. 13
andMahantango-Marcellus Shale, USA. J. Nat. Gas Sci. Eng. 33, 1119–1133. https:// https://doi.org/10.1109/CIS.2017.00009.
doi.org/10.1016/j.jngse.2016.04.055, 2016. Jovane, L., Hinnov, L., Housen, B.A., Herrero-Barvera, E., 2013. Magnetic Methods and
Bhattacharya, S., Ghahfarokhi, P.K., Carr, T.R., Pantaleone, S., 2019. Application of the Timing of Geological Processes, vol. 373. Geological Society Special Publication
predictive data analytics to model daily hydrocarbon production using Nº https://doi.org/10.1144/SP373.17.
petrophysical, geomechanical, fiber-optic, completions, and surface data: a case Kong, G., Xia, Y., Qiu, C., 2014. Cost-sensitive bayesian network classifiers and their
study from the Marcellus Shale, North America. J. Petrol. Sci. Eng. 176, 702–715. applications in rock burst prediction. ICIC 2014, LNCS 8588, 101–112. https://doi.
https://doi.org/10.1016/j.petrol.2019.01.013, 2017. org/10.1007/978-3-319-09333-8_12.
Blum, P., 1997. Physical properties handbook: a guide to the shipboard measurement of Korjus, K., Hebart, M.N., Vicente, R., 2016. An efficient data partitioning to improve
physical properties of deep-sea cores. Coll. Stat. Texas, USA http://www-odp.tamu. classification performance while keeping parameters interpretable. PloS One 11 (8).
edu. https://doi.org/10.1371/journal.pone.0161788.
Brckovic, A., Kovacevic, M., Cvetkovic, M., Mocilac, I.K., Rukavina, D., Saftic, B., 2017. Korolev, E.A., Usmanov, S.A., Nikolaev, D.S., Gabdelvaliyeva, R.R., 2018. Effect of
Application of artificial neural networks for lithofacies determination based on lithological heterogeneity of bitumen sandstones on SAGD reservoir development.
limited well data. Central Eur. Geol. 3 (60), 299–315. https://doi.org/10.1556/ IOP Conf. Ser. Earth Environ. Sci. 155 https://doi.org/10.1088/1755-1315/155/1/
24.60.2017.012. 012019, 2018.
Brieuc, M.S.O., Waters, C.D., Drinan, D.P., Naish, K.A., 2018. A practical introduction to Latifovic, R., Pouliot, D., Campbell, J., 2018. Assessment of convolution neural networks
Random Forest for genetic association studies in ecology and evolution. Mol. Ecol. for surficial geology mapping in the south rae geological region, northwest
Resour. 18 (4), 755–766. https://doi.org/10.1111/1755-0998.12773. territories, Canada. Rem. Sens. 10 (2), 307. https://doi.org/10.3390/rs10020307.
Bruno, A.E., Charbonneau, P., Newman, J., Snell, E.H., So, D.R., Vanhoucke, V., Lee, S.H., Datta-Gupta, A., 1999. Electrofacies characterization and permeability
Watkins, C.J., Williams, S., Wilson, J., 2018. Classification of crystallization predictions in carbonate reservoirs: role of multivariate analysis and nonparametric
outcomes using deep convolutional neural networks. PloS One 13, 6. https://doi. regression. In: SPE Annual Technical Conference and Exhibition (3–6 October).
org/10.1371/journal.pone.0198883. https://doi.org/10.2118/56658-MS. Houston, Texas.
12
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
Li, J., Tran, M., Siwabessy, J., 2016. Selecting optimal random forest predictive models: a Rahim, I.A., Tahir, S. Hj, Musta, B., Omang, S.A.K., 2009. Lithological unit thickness
case study on predicting the spatial distribution of seabed hardness. Geosci. Australia approach for determining intact rock strength (IRS) of slope forming rock material of
GPO Box 378. https://doi.org/10.1371/journal.pone.0149089. crocker formation. Borneo Sci. 25, 23–32.
Loussaief, S., Abdelkrim, A., 2018. Machine Learning framework for image classification. Raschka, S., 2015. Python Machine Learning. Packt Publishing Ltd, England, p. 454pp.
ASTESJ 3, 1. https://doi.org/10.1109/SETIT.2016.7939841. Romppanem, S., H€ akk€anem, H., Kaski, S., 2017. Singular value decomposition approach
Maimon, O., Rokach, L., 2010. Data Mining and Knowledge Discovery Handbook, 2 to the yttrium occurrence in mineral maps of rare earth element ores using laser-
editions. Springer, New York City, U.S. https://doi.org/10.1007/978-0-387-09823-4 induced breakdown spectroscopy. Spectrochim. Acta, Part B 134, 69–74. https://doi.
(Chapter 9). org/10.1016/j.sab.2017.06.002, 2017.
Maniriho, P., Ahmad, T., 2018. Analyzing the Performance of Machine Learning Rosid, M.S., Haikel, S., Haidar, M.W., 2019. Carbonate reservoir rock type classification
Algorithms in Anomaly Network Intrusion Detection Systems. In: 4th International using comparison of Naïve Bayes and Random Forest method in field “S” East Java.
Conference on Science and Technology. https://doi.org/10.1109/ AIP Conf. Proc. 2168, 020019 https://doi.org/10.1063/1.5132446, 2017.
ICSTC.2018.8528645. ICST 2018. S�
aez, J.A., Galar, M., Luengo, J., Herrera, F., 2013. Tackling the problem of classification
McCreery, E., Al-Mudhafar, W., 2017. Geostatistical classification of lithology using with noisy data using Multiple Classifier Systems: analysis of the performance and
partitioning algorithms on well log data - a case study in forest hill oil field, east robustness. Inf. Sci. 247 https://doi.org/10.1016/j.ins.2013.06.002.
Texas basin. 79th eage conference and exhibition. At Paris, France. https://doi.org/ Sahoo, S., Jha, M.K., 2017. Pattern recognition in lithology classification: modeling using
10.3997/2214-4609.201700905. neural networks, self-organizing maps and genetic algorithms. Hydrogeol. J. 25,
Mcneill, L.C., Dugan, B., Petronotis, K.E., Backman, J., Bourlange, S., Chemale, F., 311–330. https://doi.org/10.1007/s10040-016-1478-8, 2017.
Chen, W., Colson, T.A., Frederik, M.C.G., Guerin, G., Hamahashi, M., Henstick, T., Souza, J.F.L., Santos, M.D., Magalh~ aes, R.M., Neto, E.M., Oliveira, G.P., Roque, W.L.,
House, B.M., Hupers, A., Jeppson, T.N., Kachovich, S., Kenigsberg, A.R., 2019. Automatic classification of hydrocarbon ‘‘leads’’ in seismic images through
Kuranaga, M., Kutterolf, S., Milliken, K.L., Mitchison, F.L., Mukoyoshi, H., Nair, N., artificial and convolutional neural networks. Comput. Geosci. 132, 23–32. https://
Owari, S., Pickering, K.T., Pouderoux, H.F.A., Yehua, S., Song, I., Torres, M.E., doi.org/10.1016/j.cageo.2019.07.002.
Vannucchi, P., Vrolijk, P.J., Yang, T., Zhao, X., 2017. Expedition 362 summary. Proc. Spronck, P., 2017. The Coder’s Apprentice - Learning Programming with Python 3.
Int. Ocean Discov. Program 362. https://doi.org/10.14379/iodp. Spronck Create Commons Licences, p. 398. Version 1.0.16. http://www.spronck.
proc.362.101.2017. net/pythonbook/pythonbook.pdf.
Navin, M., Pankaja, R., 2016. Performance analysis of text classification algorithms using Storkey, A.J., 2013. When Training and Test Sets Are Different: Characterising Learning
confusion matrix. Int. J. Eng. Tech. Res. 4 (6), 75–78. Transfer. https://doi.org/10.7551/mitpress/9780262170055.003.0001.
ODP, 2007. ODP Prime Scientific Data: Collection, Archive, and Quality ODP. Technical Strauß, S., 2018. From big data to deep learning: a leap towards strong AI or ‘intelligentia
Note 37. URL: http://www-odp.tamu.edu/publications/tnotes/tn37/TNOTE_37. obscura’? Big Data Cognitive Computing 2 (3), 16. https://doi.org/10.3390/
PDF. bdcc2030016.
Orozco-del-Castillo, M.G., Ortiz-Aleman, C., Urrutia-Fucugauchi, J., Rodr0 ıguez- Tharwat, A., 2018. Classification assessment methods. Appl. Comput. Inform. https://
Castellanos, A., 2011. Fuzzy logic and image processing techniques for the doi.org/10.1016/j.aci.2018.08.003.
interpretation of seismic data. J. Geophys. Eng. 8, 185–194. https://doi.org/ Tilaki-Hajian, K., 2013. Receiver operating characteristic (ROC) curve analysis for
10.1088/1742-2132/8/2/006, 2017. medical diagnostic test evaluation. Caspian J Intern Med 4 (2), 627–635pp.
Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A., 2016. The Vakhshoori, V., Zare, M., 2018. Is the ROC curve a reliable tool to compare the validity of
limitations of deep learning in adversarial settings. In: European Symposium on landslide susceptibility maps? Geomatics, Nat. Hazards Risk 1 (9). https://doi.org/
Security and Privacy. https://doi.org/10.1109/EuroSP.2016.36, 2017, Germany. 10.1080/19475705.2018.1424043, 2018.
Peng, H., Bai, X., 2017. Limits of Machine Learning Approach on Improving Orbit Vapnik, V.N., 1998. Statistical Learning Theory. A Wiley-Interscience Publication, New
Prediction Accuracy Using Support Vector Machine. Advanced Maui Optical and York City, U. S., p. 740pp
Space Surveillance (AMOS) Technologies Conference, Hawaii, 2017. Venkataraman, S., 2017. System Design for Large Scale Machine Learning. Electrical.
Peng, J., Zhou, Y., Chen, C.L.P., 2015. Region-kernel-based support vector machines for University of California at Berkeley, EECS Department. https://www2.eecs.berkeley.
hyperspectral image classification. IEEE Trans. Geosci. Rem. Sens. 9 (53), edu/Pubs/TechRpts/2017/EECS-2017-219.html.
4810–4824. https://doi.org/10.1109/TGRS.2015.2410991. Wallet, B., Hardisty, R., 2019. Unsupervised seismic facies using Gaussian mixture
Pirrone, M., Battigelli, A., Ruvo, L., 2014. Lithofacies Classification of Thin Layered models. SEG Library 3 (7), 1A–T725. https://doi.org/10.1190/INT-2018-0119.1.
Turbidite Reservoirs through the Integration of Core Data and Dielectric Dispersion Wang, G., Carr, T.R., Ju, Y., Li, C., 2014. Identifying organic-rich Marcellus Shale
Log Measurements. https://doi.org/10.2118/170748-MS. lithofacies by support vectormachine classifier in the Appalachian basin. Comput.
Puggini, L., Doyle, J., Mcloone, S., 2015. fault detection using random forest similarity Geosci. 64, 52–60. https://doi.org/10.1016/j.cageo.2013.12.002, 2014.
distance. IFAC-PapersOnLine 48 (21), 583–588. https://doi.org/10.1016/j. YU, L., Porwal, A., Holden, E., Dentith, M.C., 2012. Towards automatic lithological
ifacol.2015.09.589. classification from remote sensing data using support vector machines. Comput.
Rafik, B., Kamel, B., 2017. Prediction of permeability and porosity from well log data Geosci. 45 https://doi.org/10.1016/j.cageo.2011.11.019.
using the nonparametric regression with multivariate analysis and neural network, Zhao, L.N., Tian, F.Y., Wu, H., Qi, D., Wang, Z., 2011. Verification and comparison of
Hassi R’Mel Field, Algeria. Egyptian J. Petrol. 26 (3), 763–778. https://doi.org/ probabilistic precipitation forecasts using the TIGGE data in the upriver of Huaihe
10.1016/j.ejpe.2016.10.013. Basin. Adv. Geosci. (29), 95–102pp. https://doi.org/10.5194/adgeo-29-95-2011,
2011.
13