1 s2.0 S0098300419309239 Main

Computers & Geosciences 139 (2020) 104475
Contents lists available at ScienceDirect
Computers and Geosciences

journal homepage: www.elsevier.com/locate/cageo
Evaluation of machine learning methods for lithology classification using

geophysical data
Thiago Santi Bressan 1, *, Marcelo Kehl de Souza, Tiago J. Girelli, Farid Chemale Junior
Universidade do Vale do Rio do Sinos – Unisinos, S~
ao Leopoldo, RS, Brazil
A R T I C L E I N F O A B S T R A C T
Keywords: Specific computational tools assist geologists in identifying and sorting lithologies in well surveys and reducing
Lithological group operational costs and practical working time. This allows for the management of professional output, the efficient
Pattern recognition interpretation of data, and completion of scientific research on data collected in geologically distinct regions.
Multivariate data
Machine learning methods and applications integrate large sets of information with the goal of efficient pattern
Sedimentary rocks
recognition and the capability of leveraging accurate decision making. The objective of this study is to apply
machine learning methods to the supervised classification of lithologies using multivariate log parameter data
from offshore wells from the International Ocean Discovery Program (IODP). According to the analysis of the
lithologies proposed in the IODP Expeditions and for the application of our methods, the lithologies were divided
into four groups. The IODP Expeditions were organized into four templates for better results in analyzing the set
of expeditions and practical application of the methods. The templates were submitted to training, validation,
and testing by multilayer perceptron (MLP), decision tree, random forest, and support vector machine (SVM)
methods. The evaluation was randomly divided into training (70%), validation (10%), and testing (20%) using
the classification methods as an evaluation of the results. In the results, it was observed that Template1 (IODP
Expedition 362) obtained better results with the MLP method, Template2 (IODP Expeditions 354, 355, and 359)
and Template3 (IODP Expeditions 354, 355, 359, and 362) obtained better results with the random forest
method with greater than 80.00% accuracy. For cross-validation, the random forest method performed well in all
scenarios. In the practical template, the G2 group obtained a better result with the MLP method with an average
accuracy above 85.00%. It is expected that machine learning methods can help improve the study of geology
with accurate and rapid answers related to interpreting collected data in different study regions.
1. Introduction the identification and recognition of patterns in sedimentary rocks. The

collection of physical samples of sedimentary rocks is essential, as the
The accurate classification of information intensifies work, improves professional needs material available for visual interpretations and
scientific results, and automates the execution of activities with reduced laboratory analysis to generate information with a high degree of val
time and work hours. Appropriate methods and computational tools idity and accuracy. This process is slow and costly, requiring highly
assist geologists in identifying lithologies in onshore or offshore well trained and dedicated professionals. In addition, the geophysical prop
surveys by allowing for efficient data interpretation. The efficiency of erties of sedimentary rocks acquired directly in the field with specific
lithological classification can be improved using applications capable of equipment or in the laboratory can be utilized, reducing time and costs
producing more objective decisions rather than more conventional in the identification of lithologies. These methods include applications
methods of interpretation (Hsieh et al., 2005; Jahdhami and Anboori, in the estimation of permeability and porosity using well profiles (Rafik
2017). and Kamel, 2017), determination of lithofacies with seismic sections
Integration between machine learning and geology with numerous (Brckovic et al., 2017), and geological mapping of surfaces and minerals
scientific and technological studies, search forms and processes aids in (Latifovic, Pouliot and Campbell, 2018; Bruno et al., 2018).
* Corresponding author.
E-mail addresses: thiago.bressan@iffarroupilha.edu.br (T.S. Bressan), marcelo.k.souza@gmail.com (M. Kehl de Souza), tjgirelli@gmail.com (T.J. Girelli),
faridchemale@gmail.com (F.C. Junior).
1
Instituto Federal de Educaç~
ao Ci^encia e Tecnologia Farroupilha - IFFar
https://doi.org/10.1016/j.cageo.2020.104475
Received 2 October 2019; Received in revised form 6 March 2020; Accepted 18 March 2020
Available online 23 March 2020
0098-3004/© 2020 Elsevier Ltd. All rights reserved.
T.S. Bressan et al. Computers and Geosciences 139 (2020) 104475
This study differs by its integration of geology and neural networks in Table 1
intelligent learning to assist professional geologists with practical work Division of lithology into groups. Each group contains its composition models,
in the laboratory or the field. This paper applies the methods of machine lithological composition, and lithological code.
learning in the supervised classification of lithologies using multivariate Groups Models Lithological Composition Lithology
data of log parameters of offshore wells from the International Ocean Code
Discovery Program (IODP). GP Litho1 Very fine sand/sandstone, Fine sand/ 10
sandstone, Medium sand/sandstone, Coarse
2. Materials sand/sandstone, Sand, Sand/Sandstone
Litho2 Alternating sand/sandstone and mud/ 11

2.1. Machine learning mudstone layers, Sandy clay/claystone, Sandy
silt/siltstone
Litho3 Alternating silt/siltstone and clay/claystone 12
Machine learning can be defined as the most current method of using layers, Clayey silt/siltstone
a computer to discover new knowledge. In the last decade, computers Litho4 Sand/sandstone-silt/siltstone-clay/claystone, 13
have been used for simple learning or information storage only. With the Clayey sand/sandstone, Silty sand/sandstone
advancement of computer science and the creation of new, ultrafast Litho5 Silt/Siltstone, Silty clay/claystone, Clay/ 14
claystone, Clay, Silt, Mudstone
hardware, one can enhance algorithms and exponentially increase pro
Litho6 Calcareous silty clay/claystone, Calcareous 15
cessing by considering computational thinking from the analysis of large silt/siltstone, Calcareous ooze, Chalk,
quantities of data in a short time. Machine learning is used in areas such Marlstone, Rudstone, Floatstone, Grainstone,
as natural language processing, speech recognition, stock prediction, Packstone, Wackestone, Boundstone,
identification of diseases in healthcare, calculation of porosity in sedi Limestone, Calcareous claystone
mentary rocks, and identification of specific minerals in sedimentary G1 Litho1 Very fine sand/sandstone, Fine sand/ 10
rocks. Machine learning creates responses from the intensive processing sandstone, Medium sand/sandstone, Coarse
sand/sandstone, Sand, Sand/Sandstone
of information, predicting highly reliable outputs for decision making
Litho2 Silt/siltstone, Silty clay/claystone, Clay/ 14
(Raschka, 2015). The supervised learning method processes reference claystone, Clay, Silt, Alternating silt/siltstone
data input to create a model for the prediction of new data. For this and clay/claystone layers, Clayey silt/siltstone
training, the algorithm requires data in a standard format and type, as Litho3 Calcareous silty clay/claystone, Calcareous 15
well as data with reliable and accurate values, extracted from relevant silt/siltstone, Calcareous ooze, Chalk,
Marlstone, Rudstone, Floatstone, Grainstone,
sources with the ability to improve feedback. In this study, we used a Packstone, Wackestone, Boundstone,
dataset organized from the division of lithologies into GP (group GP), G1 Limestone, Calcareous claystone
(group 1), G2 (group 2), and G3 (group 3) according to Table 1. These
G2 Litho1 Very fine sand/sandstone, Fine sand/ 10
lithologies are linked to IODP Expeditions 349, 354, 355, 356, 359, 361, sandstone, Medium sand/sandstone, Coarse
and 362 and organized into four templates according to Table 4. They sand/sandstone, Sand, Sand/sandstone
were processed by the application of machine learning methods for su Litho2 Sand/sandstone-silt/siltstone-clay/Claystone, 13
pervised classification, including MLP, decision tree, random forest, and Clayey sand/sandstone, Silty sand/sandstone,
Alternating sand/sandstone and mud/
SVM. mudstone layers, Sandy clay/claystone, Sandy
silt/siltstone
2.1.1. Machine learning methods for the classification lithology Litho3 Silt/siltstone, Silty clay/claystone, Clay/ 14
In addition to the methods used in this work, MLP, decision tree, claystone, Clay, Silt, Alternating silt/siltstone
and clay/claystone layers, Clayey silt/siltstone
random forest, and SVM, for supervised data, lithological classification
covers the use of several other important methods, such as naïve Bayes silt/siltstone, Calcareous ooze, Chalk,
(Rosid et al., 2019; Kong et al., 2014), probabilistic neural networks Marlstone, Rudstone, Floatstone, Grainstone,
(Al-Mudhafar, 2017a), logistic boosted regression (Al-Mudhafar, Packstone, Wackstone, Boundstone,
2017a), kernel support vector machine (Al-Mudhafar, 2015, 2017b), Limestone, Calcareous claystone
support vector regression (Awad and Khanna, 2015), methods for un G3 Litho1 Very fine sand/sandstone, Fine sand/ 10
supervised data, such as cluster analysis (Lee and Datta-Gupta, 1999; sandstone, Medium sand/sandstone, Coarse
sand/sandstone, Sand, Sand/sandstone
Pirrone et al., 2014; McCreery and Al-Mudhafar, 2017) and gaussian
Litho2 Alternating sand/sandstone and mud/ 11
mixture models (Wallet and Hardisty, 2019), and methods that integrate mudstone layers, Sandy clay/claystone, Sandy
functions of math and multivariate statistics with supervised and un silt/siltstone, Sand/sandstone-silt/siltstone-
supervised data such as principal components analysis (Lee and clay/claystone, Clayey sand/sandstone, Silty
Datta-Gupta, 1999), linear discriminant analysis (Lee and Datta-Gupta, sand/sandstone
Litho3 Alternating silt/siltstone and clay/claystone 12
1999; Al-Mudhafar, 2015c; Hong et al., 2004), multinomial logistic layers, Clayey silt/siltstone, Silty clay/
regression (Hong et al., 2004), singular value decomposition (Romp claystone
panem et al., 2017) and fuzzy logic (Devanand and AuthorAnonymous, Litho4 Silt/Siltstone, Silt 13
2015; Orozco-del-Castillo et al., 2011), with application-specific char Litho5 Clay/claystone, Clay 14
acteristics and support for certain data types.
silt/siltstone, Calcareous ooze, Chalk,
Marlstone, Rudstone, Floatstone, Grainstone,
2.1.1.1. Decision tree. The decision tree is a practical, fast, and robust Packstone, Wackestone, Boundstone,
learning method for supervised inductive learning (Maimon and Limestone, Calcareous claystone
Rokach, 2010). It is a useful method in the process of previously un
known information extraction from the analysis of a large volume of
through a structure of nodes and sheets. When applied to database re
data. Examples of applications that use a decision tree as a learning al
cords, it results in the classification of records and is a robust method for
gorithm include landslides (Alkhasawneh et al., 2014), classification
data with considerable noise, as well as nonstandard data (S� aez et al.,
and identification of natural minerals (Akkas et al., 2015) and image
2013). Configurations such as maximum tree depth, number of features
classification (Loussaief and Abdelkrim, 2018).
for the best split, maximum number of nodes, maximum number of
A decision tree is essentially a series of if-else statements aligned
sheets, and the function for division and choice of nodes can be defined
2
and improved with training.

In relation to the criteria for division and choice of nodes, two
methods are available: Gini (1) and entropy (2).
X
c
GiniðEÞ ¼ 1 p2j 1
j¼1
X
c
HðEÞ ¼ pj log pj 2
j¼1
The use of each criterion is defined according to the quantity and

format of data available for training. For large quantities of data, the
entropy criterion may be altered using a logarithm, requiring more
computational processing and making it relatively slower in relation to
the Gini criterion (Jing et al., 2017). In relation to the data format type,
the Gini criterion works well with binary values.
2.1.1.2. Random forest. Random forest is an ensemble method for

pattern classification developed based on decision trees. The random Fig. 1. Representation of SVM in a linearly separable model. The groups are
forest creates a set of multiple decision trees and calculates the average converted into data vectors and separated by a hyperplane calculated by the
of all processed trees. In this context, the random forest includes groups method that best represents the separation of the data. Margins represent the
of calculations including averaging calculations that aggregate several distance between the data vectors. Figure modified from Vapnik (1998).
estimators and return the mean of their predictions and reduction in
their variances, and boosting calculations that integrate several small, (ANN) can be defined as a linear model based on brain architecture,
specific estimators, which seek an appropriately large return with the developed in an attempt to transfer learning capacity to a computerized
sum of the results (James et al., 2013; Raschka, 2015). system (Castro et al., 2017; Souza et al., 2019). The best-known archi
Random forest is integral to the configuration criteria for division tecture of an ANN is the MLP, which is made up of a series of layers,
and choice of nodes (Gini and entropy), depth of decision trees, and neurons, and connections.
quantity of decision trees. It is important to note that due to its definition, the MLP architecture
The averaging method includes a bootstrap aggregation procedure as depends on the adjustment of the main points: architecture configura
a statistical approach used to quantify the degree of uncertainty asso tion and training. The configuration of the architecture includes the
ciated with a certain learning estimator. definition of the connections and the number of neurons necessary for
The random forest method is applied in several areas, such as genetic good learning and adequate resolution of the proposed problem.
association studies in ecology (Brieuc et al., 2018), specifically in ge Consequently, training MLP networks can be very time consuming for
ology and geoscience in group mapping of vegetation (Dubeau et al., large datasets. The determination of the ideal architecture configuration
2017), predicting the spatial distribution of seabed hardness (Li et al., is a constant research objective, in which the quadratic error, prediction,
2016), fault detection (Puggini et al., 2015), facies and fracture pre precision, and its metrics are adjusted for the expected results. In these
diction (Bhattacharya and Mishra, 2018) and data analytics to model points and using these metrics, the results are related to the type and
daily hydrocarbon production (Bhattacharya et al., 2019). quality of the data used in the process of forming the network.
Because it is a traditional network in the context of the generation
2.1.1.3. Support vector machine (SVM). The SVM is a supervised and application of knowledge, areas such as mudstone classification
learning method that constructs a border (hyperplane) on a represen (Bhattacharya et al., 2016), reservoir characterization (Fajana et al.,
tation of the data, thus improving its presentation, grouping, and sep 2019) and pattern recognition in lithology classification of complex
aration between different instances of data classes (Vapnik, 1998). aquifer systems (Sahoo and Jha, 2017) utilize the practical organization
Its application depends on the format and the relationship between of this important method.
the data, with a possible application on linearly separable and linearly
nonseparable data. The method calculates the vector that best represents 2.1.1.5. Calibration of methods used. Further description of the method
the hyperplane. Support vectors (Fig. 1) are points closest to the hy hyperparameter configurations, as well as the result of calibration and
perplane line, and the distance between these vectors represent the choice of values, can be found in the accompanying material.
margins. The method chooses the hyperplane that has the greatest
margin, being considered the more robust model and less tolerant to 2.1.1.6. Validation and evaluation. A confusion matrix is the represen
errors. tation of real values and predicted values that allows for visualization of
For a model with linearly separable data (as seen in Fig. 1), the the performance of a machine learning classifier method on the pro
representation performed by SVM is fast and accurate, resulting in an posed templates (Navin and Pankaja, 2016). The confusion matrix
excellent practical return for classification output. A model with linearly configures itself as a table generated for the classification of a binary
nonseparable data requires a method for interpreting the data group, dataset and is used to describe the performance of a classification or
which applies a kernel method to find patterns and relationships after method. The main diagonal indicates the accuracy of the evaluated re
applying the representation of the hyperplane to the group data. cords, combining the true results in an organized structure in the matrix.
SVM is applied in several areas, such as geology and geoscience for Its organization is presented in Fig. 2.
3D geological modeling (De Boissieu et al., 2017; Bhattacharya and Carr, Accuracy (Eq. (3)) is formed by the division between true positive
2019), regolith-geology mapping (De Boissieu et al., 2017), hyper and true negative values and the total positive and negative values,
spectral classification of images (Peng et al., 2015), automated litho according to Eq. (3):
logical classification using ASTER remote sensing data (YU et al., 2012)
TP þ TN
and identification of organic-rich lithofacies (Wang et al., 2014). Accuracy ¼ 3
PþN
2.1.1.4. Multilayer perceptron (MLP). An artificial neural network It is important to highlight that this classification metric can result in
3
The receiver operating characteristic (ROC) is a graphical method for

the evaluation, presentation, and selection of prediction systems
(Tharwat, 2018). It uses the confusion matrix to construct the results and
represents two parameters of the probability of accuracy (true positive
rate (TPR) and false positive rate (FPR)). TPR (Eq. (7)) and FPR (Eq. (8))
are defined as follows:
TP
TPR ¼ 7
TP þ FN
FP
FPR ¼ 8
FP þ TN
Fig. 2. Definition of the confusion matrix. The matrix is divided into real data
The ROC curve plots TPR vs. FPR at different classification thresh
and predicted data (rows and columns), combining true positive (TP) data, false
positive (FP) data, false negative (FN) data and true negative (TN) data. olds. Its multidimensional capability allows for better visualization of
Figure modified from Navin and Pankaja (2016). the result variables throughout the spectrum of the graph. The
descending diagonal (0,1) represents the classification model that plays
equally in both classes. Points belonging to the upper left triangle of this
false method performance because this metric calculates the average
diagonal represent the best results, and points belonging to the lower
between the return of the classes of a dataset (Hossin and Sulaiman,
right triangle represent the worst results. Its origin is related to the
2015).
detection of signals and the evaluation of the transmission quality of a
Precision (Eq. (4)) and recall (Eq. (5)) belong to the F1-score metric.
noise signal (Egan, 1975). ROC graphics are used in medicine (Tilaki-
Precision is calculated by the division of true positive values by the sum
Hajian, 2013), economics (Gajowniczek et al., 2014), weather fore
between true positive values and false positive values. The recall is
casting (Zhao et al., 2011), and geology (Vakhshoori and Zare, 2018;
calculated by the division of the true positive values by the sum between
Chen and Wu, 2016; Airola et al., 2018).
true positive and false negative values. Their equations are presented
below:
TP 2.2. Geological setting
Precision ¼ 4
TP þ FP
This study is based on IODP Expeditions 349, 354, 355, 356, 359,
TP 361, and 362. The holes were drilled in different regions of the Indian
Recall ¼ 5
TP þ FN Ocean. As a result, there is a large amount of information resulting in a
good data group for this study in machine learning. Further description
F1-score or F-measure (Eq. (6)) is the harmonic mean between pre
of the study areas can be found in the supplementary material.
cision and recall. Its use is beneficial in dataset processing with diver
All of the expeditions described in the present study contain similar
sified classes that are highly disproportionate. This equation is given
successions of sediments/sedimentary rocks from the Bengal-Nicobar
below:
Fan. Therefore, it is possible to perform a grouping of lithologies and
2*Precision*recall to map a pattern between depth and lithology, as well as the distribution
F1 ¼ 6
Precision þ recall of sedimentary rocks in all of the sites and holes surveyed for the seven
The cross-validation method is used to evaluate the performance of expeditions.
the data. This method randomly partitions the total untrained dataset The grouping of lithologies seeks to organize the sets of lithologies
into k smaller groups of equal size (Haykin, 2009). Processing of the data present in the IODP Expeditions by separating the records, creating a
is repeated k times until all groups are trained and tested. Processing wide combination of data and directions for the heterogeneous sedi
returns are described through rating metrics such as accuracy, precision, mentary rocks identified and described in the visual core description
recall, and F1-score according to Fig. 3. In this way, the entire set of (VCD). Fan et al. (2019), Korolev et al. (2018), and Rahim et al. (2009)
available data is evaluated, returning precise classification of the data described the division into groups as integral to the heterogeneous re
and integrating the various characteristics of data formation and ality of lithological analysis in the field as it is related to the multivariate
grouping. physical characteristics of the site or core sampled.
For each lithological group, models and lithological code were
assigned for the classification of the lithology by machine learning
methods. Table 1 presents the model of the division of lithologies into
four groups, denominated GP (group GP), G1 (group 1), G2 (group 2),
and G3 (group 3). The core images of different lithologic groups can be
found in the supplementary material.
During IODP Expeditions, the onboard data collected onboard pro
vided the first steps for working with the sampled rocks. The collection
of these data occurred in core or collected samples. Among the mea
surements of the log parameters of rocks, we can highlight the gamma-
ray attenuation bulk density (GRA), P-wave velocity logger system (PWL
or P-wave), magnetic susceptibility (MS), reflectance spectrophotom
etry and colorimetry (RSC), and shock remanent magnetization (SRM).
GRA is a measure of the density expressed in g/cm3, with a high
degree of penetration, emitted spontaneously from an atomic nucleus
Fig. 3. Graphical representation of cross-validation with the selected data in (137 Ce) during radioactive decay. It has a gamma-ray peak of 662 KeV
the datasets of the templates. In each round, the dataset is divided into n groups and is attenuated as it passes through the core. This attenuation is
of equal size, with n groups for training and n groups for validation. The cross- related to Compton spreading, where a known sample thickness is
validation result produces the measurements of accuracy, precision, recall, and proportional to the bulk density. Bulk density can also be affected by
F1-score. Figure modified from Haykin (2009). vertical compaction during the collection of cores. This measure is used
4
in the identification or classification of rocks, mineral composition, which member of the lithological group belongs to the value sought.
grain size, and porosity calculation (ODP, 2007). Fig. 4 shows a model of the log parameters for IODP Expedition 362,
PWL values are measurements of the sound wave velocity through a sites U1480, and holes E, F, and G, which form Template1, lithology
sample. The PWL velocity varies according to the physical composition, group GP, with a plot of the classification of the lithologies in relation to
porosity, density, and degree of fracture. In marine environments, PWL the depth of the wells. The other expeditions and sites follow the model
values are influenced by the degree of consolidation and lithification, in Fig. 4, and adjust depth with the lithological classification.
fractures, and hydrocarbon occurrence (Brckovic et al., 2017; Doveton,
1994). Together with the GRA, the PWL measurements are used to 2.3. Methods and data configuration
calculate the acoustic impedance and reflection coefficients to construct
synthetic seismic profiles and to estimate the depths of seismic horizons. 2.3.1. Data preparation
MS is the intensity with which the material can be magnetized in an The total number of records for the IODP Expeditions, divided into
external magnetic field (Blum, 1997; Brckovic et al., 2017; Mcneill et al., groups, is shown in Table 2. Each record includes values for the seven
2017). The ratio of magnetization is expressed in units of volume, log parameters GRA, PWL, MS, RSC (L*, a*, b*), and SRM. The dataset
defined as: formed by the log parameters is transformed into the matrix. A table of
the complete dataset with IODP Expeditions, groups, and lithologies is
k ¼ M=H
found in Table S1 of the supplementary material.
where M is the volume of magnetization applied to a magnetic suscep
tibility k by an applied external field (H). Susceptibility is measured with 2.3.2. Programming language and library for machine learning
the main recording devices, for which calibration factors must be met for Python is a high-level programming language that is interpretable,
geometry and effects of transport and core coatings. They can be clas easy to learn and use, and supports numerous add-ons, which makes it a
sified by the magnetization volume value (M) into three groups: powerful language for calculating and analyzing large quantities of in
diamagnetic materials ( 1 < M < 0), paramagnetic materials (0 < M � formation (Spronck, 2017). In this machine learning study, the
1), and ferromagnetic materials (M � 1). MS varies according to the type scikit-learn library was used, which integrates all methods for learning
and concentration of magnetic grains and corresponds to variation in processing, with the support of supervised and unsupervised training on
sediment composition, mainly the granulometry and mineralogical data, creating extremely elaborate and understandable outputs. The
composition. Sediments with the presence of clay have relatively lower modules used in scikit-learn are the MLPClassifier module for the MLP
magnetic susceptibility, and materials with the presence of water tend to algorithm, the DecisionTreeClassifier module for the decision tree al
have slightly negative values. gorithm, the RandomForestClassifier module for the random forest al
RSC is a unit of measurement related to two widely used techniques gorithm, and the SVC module for the SVM algorithm.
in the visual identification of rock characteristics: colorimetry and
reflectance spectrophotometry. Colorimetry is used to measure the color 2.3.3. Training
value of a surface. Many numerical systems have been developed to To carry out the training, the data were divided according to Table 4.
express the visual values of colors. The International Commission on Three templates were created with data combinations between IODP
Illumination (CIE) proposed a standard method for the numerical Expeditions 354, 355, 359, and 362, using 70.00% for training data,
measurement of colors, the L*, a*, b* system, considering the nonlinear 10.00% for validation data and 20.00% for test data, randomly sepa
perception of the human eye and the combination of illumination and rated. A practical template was created to simulate a real exercise that
basic colors (Hughes and Langlois, 2010; CIE, 2004). L* defines the applies the methods and predicts the lithologies of a hole through the
brightness value of the respective sample, with values between L* ¼ training of the entire dataset. The composition of the expeditions in the
0 (totally dark) and L* ¼ 100 (totally bright). The a* defines the value of templates was given by the proximity of the expeditions in the same
the red-green coordinate (a* positive defines values with more red and geological region to direct the context as practical analysis in the field or
a* negative defines values with more green) and b* defines the value of directly on the ship.
the blue-yellow coordinate (b* positive defines values with more yellow
and b* negative defines values with more blue) (Blum, 1997; CIE, 2004). 2.3.4. Prediction
Colorimetry is widely used to calculate relative brightness and colors on For each of the templates, the training described in the Train column
a surface or sample and is combined with other properties to create a in Table 4 was performed using the MLP, decision tree, random forest,
more accurate evaluation of the analyzed object (Hughes and Langlois, and SVM methods. To carry out the prediction, data were used according
2010). to the test column in Table 4.
Magnetization is correlated with the proportional response of mag The configurations of the MLP, decision tree, random forest, and
netic susceptibility to a magnetic field. In some cases, for pure magne SVM methods are shown in Table 3. The configurations use features of
tization, this relationship may undergo changes, where a medium the methods to create algorithms in the Python language, and the main
exhibits a magnetic field, even with the absence of a magnetic field points were adjusted for use with possible characteristics such as the
applied to it. This process is called shock remanent magnetization (SRM) number of nodes, trees, depth, and activation function. The SVM method
(Buschow and Boer, 2004; Goodrich, 2007; Jovane et al., 2013). The was configured with the default configuration provided by the pro
vector of magnetization of an object is the sum of the values of the gramming language. The default setting maintenance is because it pre
induced magnetic field and the magnetic field remanence. Magnetic sents the best result compared to adjustments in the method parameters.
remanence can be classified into five types, of which the main type, The distribution between training, validation, and testing in tem
natural remanent magnetization (NRM), was used in the IODP. NRM is a plates 1, 2, and 3 was related to the performance of practical tests with
more reliable method since remanence is transmitted from an object the methods and in accordance with Storkey (2013) and Korjus et al.
naturally by its favorable chemical compositions and without the (2016) with a percentage of 10% for validation and 10–20% for testing.
interference of equipment or sensors. For the practical template, training, validation, and testing were divided
The definition of use for these seven log parameters is based on the according to the segmentation of the expeditions.
fact that the data acquisition resolution is closest to the depth, and the After training and predictions, a table was compiled comparing the
joint capability of these logs facilitates the recognition and classification output result for each template and each method of the analyzed dataset.
of lithologies. Through analysis of the GRA, PWL, MS, L*, a*, b*, and For each processing, classification metrics and confusion matrix values
SRM in joint processing using machine learning, it is possible to identify were tabulated for better organization and presentation. A detailed table
with the results is found in Table S1 of the supplementary material.
5
Fig. 4. Log parameters for the IODP Expedition 362, Site U1480, Holes E, F, and G, which form Template1, lithology group GP. Lithology group GP is divided into
Litho1, Litho2, Litho3, Litho4, Litho5, and Litho6. Litho1 consists of sand with variable granulation; Litho2 is composed of interlayered sand and mud with layers of
clay and silt; Litho3 is composed of layers of clay and silt; Litho4 consists of mixed layers of sand, silt, and clay; Litho5 is composed of lithologies divided between silt,
clay, and mudstone; and finally, Litho6 is composed of lithologies based on carbonate composition and its derivatives. The color legend in the figure refers to the
lithologies present in the GP group. The unit of measurement for GRA is g/cm3, P-wave is m/s, MS is m3/kg, L* is %, a* is %, and b* is %. The SRM log parameter is
not included.
6
Table 2 G3 group, Template1 obtained 79.00% accuracy in the SVM method,

Total number of records by lithological groups. Template2 obtained 77.89% accuracy in the random forest method, and
Number of Records – IODP-Expeditions Group Template3 obtained 79.42% accuracy in the random forest method. A
detailed table with these results is found in Table S1 of the supple
24,504 GP
23,967 G1 mentary material.
24,504 G2 The results of the application of cross-validation are shown in Fig. 5.
24,504 G3 They were divided into five folds, performing training and testing for all
Total number: 97,479 GP, G1, G2 and G3 templates and datasets. Accuracy, precision, recall, and F1-score were
used to measure the performance of templates for five-fold cross-vali
dation. The decision tree method in the entropy criterion obtained a
Table 3 better result in relation to the Gini criterion, with an increase in accuracy
Configuration of the methods for training, validation and of 5.00%, as well as a proportional increase in cross-validation with 5 k-
testing. folds.
Methods Configuration In Fig. 6, log parameters are presented for the practical template in
MLP solver ¼ lbfgs the defined groups and methods (data of G3 group, IODP Expedition
activation ¼ relu 362, Site U1481). For the practical template, the execution of the
random_state ¼ 8 methods obtained excellent results, which can be observed in Fig. 6 and
DecisionTree random_state ¼ 8 Table S1 of the supplementary material.
max_depth ¼ 20
criterion ¼ entropy
It should be noted that the G2 group obtained better mean results in
RandomForest max_depth ¼ 20 accuracy for the four methods. The random forest method obtained an
n_estimators ¼ 1000 excellent result for the practical template, with more than 79.00% ac
random_state ¼ 8 curacy for the classification of lithologies in the GP and G1 groups.
n_jobs ¼ 1
Emphasis is placed on the MLP method with 85.08% accuracy in the G2
SVM Default setting
group.
IODP Expedition 362, Site U1481 obtained the best result for the
3. Results and discussions lithological classification in the G3 group, MLP method. IODP Expedi
tion 361, Site U1478 obtained the best result in the GP group, SVM
In this section, the results of the application of the four methods are method and site U1477 obtained the best result in the G2 group, MLP
discussed, according to the templates and dataset presented. method. IODP Expedition 356, Site U1462 obtained the best result in the
The sum of the records of the dataset used by the templates in this G2 group, MLP method. IODP Expedition 349, Site U1431 obtained the
study was 97,474 records. According to Strauβ (2018) and Venkatara best result in the G2 group, MLP method, Site U1432 obtained the best
man (2017), to obtain results above 70.00% accuracy using machine results in the G2 group, MLP and decision tree methods, Site U1433
learning, it is necessary to have thousands or millions of records for obtained excellent results in the G2 group, MLP method. Therefore, it
training. Such results depend on the organization of a precise and reli was concluded that the best method for the practical template is the
able historical database that adjusts itself with training and testing. In MLP, with the best results in 85.70% of the total practical template.
this case, it was determined that for the GP group, Template1 obtained The confusion matrix highlighted in Fig. 7 (related to Fig. 6) and the
an accuracy of 76.00% in the SVM method, Template2 obtained an ac supplementary material: boxplot represents the method result of the
curacy of 82.61% in the random forest method, and Template3 obtained respective lithological classification for IODP Expedition 362, Site
an accuracy greater than 82.74% in the random forest method. For the U1481 (G3 group). The MLP method (A) has the best average result for
G1 group, Template1 obtained an accuracy of 89.51% in the random the hole, with an accuracy of 76.30%. For Litho1 (lithology code 10), the
forest method, Template2 obtained an accuracy of over 75.00% in the best accuracy was obtained by the random forest method (B). For Litho3
four methods analyzed, and Template3 obtained an accuracy of over (lithology code 12), the best accuracy was obtained by the MLP method
73.00% in the four methods analyzed. For the G2 group, Template1 (A). For Litho4 (lithology code 13), the best accuracy was obtained by
obtained 82.84% accuracy in the random forest method, Template2 the MLP (A) and SVM (D) methods. Finally, for Litho5 (lithology code
obtained 82.61% accuracy in the random forest method, and Template3 14), the best accuracy was obtained by the MLP (A) and decision tree (C)
obtained 84.20% accuracy in the random forest method. Finally, for the methods. The complete table with classification metrics is found in
Table 4
Division for training according to IODP Expeditions.
Groups Template Composition IODP Expeditions – Sites Train Validation Test Methods
GP, G1, G2 Template1 362 (U1480) 70.00% 10.00% 20.00% MLP

and G3 Template2 354 (U1449, U1450, U1451) 70.00% 10.00% 20.00% DecisionTree
355 (U1456) RandomForest
359 (U1465, U1466, U1467, U1468, SVM
U1470, U1471, U1472)
Template3 354 (U1449, U1450, U1451) 70.00% 10.00% 20.00%
355 (U1456)
359 (U1465, U1466, U1467, U1468,
U1470, U1471, U1472)
362 (U1480)
Practical 349 (U1431, U1432, U1433) Expeditions 354,355,359 and Expedition Expeditions
Template 354 (U1449, U1450, U1451) 362 (U1480) 349 349, 356, 361 and 362
355 (U1456) (U1433) (U1481)
356 (U1462)
359 (U1465, U1466, U1467, U1468,
U1470, U1471, U1472)
361 (U1477, U1478)
362 (U1480, U1481)
7
(10,12,13, and 14). The results are good in all plots, with a positive ROC
curve above the main diagonal (0,1) and corroborate with precise
identification (from 70.00% to 86.00%) of the lithologies, as seen in
Fig. 7. The MLP method has a better and growing stability of the plotted
results relative to the other methods.
The best result for the lithological classification linked to the G2
group is because the grouping was organized into four lithologies:
Litho1, Litho2, Litho3, and Litho4, which formed a larger organization,
agglomerating more lithologies and facilitating recognition by the
methods. The GP group did not obtain good results in the general
context due to greater refinement of the division and grouping of the
lithologies, making it difficult to recognize the analyzed methods.
Table 5 summarizes the results of machine learning processing
methods on the groups and data templates proposed in this article. The
best method was random forest. The best data group was the G2 group.
The best template was Template2. Referring to group vs. method, in the
GP group, the best method was random forest; in the G1 group, the best
method was random forest; in the G2 group, the best method was MLP;
and in the G3 group, the best method was random forest. In the com
bination of template vs. method, in Template1, Template2, and Tem
plate3, the best method was random forest, and in the practical
template, the best method was MLP. The best result in applying cross-
validation was the random forest method.
In all processes, there are limitations that create restrictions on
achieving results with more than 80.00% accuracy (Peng and Bai, 2017;
Papernot et al., 2016), especially due to the characteristics of the data
sets, the quantity of data in the datasets, the number of features (GRA,
PWL, MS, L*, a*, b*), and the inability to store training and testing for
future use, which strengthens datasets and improves results.
Regarding the performance and complexity of the execution of the
algorithms, the algorithms that use decision trees, random trees, nodes
and edges are executed very quickly and accurately than algorithms that
use mathematical calculations or statistical functions (Maniriho and
Ahmad, 2018). The MLP algorithm has in its organizational structure a
sequential node-to-node and layer-by-layer execution runs quickly with
optimal performance in relation to the hardware involved. Finally, the
SVM algorithm, acting directly on variables, vectors, performs calcula
tions in a two-dimensional or three-dimensional plane using an appro
priate kernel (Gaussian, polynomials or sigmoid) requires compatible
hardware and high processing time compared to other algorithms. It is
observed that in the results proposed by this article, the SVM algorithm
always obtained inferior results to the algorithms that use decision trees
or nodes and layers.
The ash layer was not part of the division of the lithological groups
since its presence in the lithological profile of the wells is restricted to
mm- and cm-thick layers, with few log records and inadequate resolu
tion, making it impossible for machine learning applications in all of the
configurations used at this stage. Further details referring to the ash
layer records are found in Table S2 of the supplementary material with
ash layer records for the IODP Expedition 362, Site U1480.
4. Conclusions
In this study, four different machine learning methods were applied

to three standard data templates and a practical data template in a
Fig. 5. Results of the application of cross-validation in the four defined groups: lithological classification problem for wells from IODP Expeditions.
GP, G1, G2, and G3, divided into Template1, Template2, and Template3.
The results indicate that the MLP method had better results in the
Template1 covers IODP Expedition 362, Site U1480; Template2 covers IODP
lithological classification for the practical template, considering the li
Expeditions 354, 355, and 359; and Template3 covers IODP Expeditions 354,
355, 359, and 362. The methods applied to the data were MLP, decision tree, thology of the G2 group proposed in this research and the characteristics
random forest, and SVM. of the datasets used.
In datasets Template1, Template2, and Template3, the best results
were in the G1 group with the random forest method, with an accuracy
Table S1 of the supplementary material.
of more than 85.00%. The G1 group had better results for lithological
Fig. 8 shows the ROC curve representing four methods for the G3
classification for the organized templates due to the advantageous
group, practical template in IODP Expedition 362, Site U1481, divided
grouping of the three lithologies (Litho1, Litho2, and Litho3). The Litho2
by models (Litho1, Litho3, Litho4, Litho5) and lithological code
grouping composite for joining lithologies between silt and clay
8
Fig. 6. Resulting lithological classification – practical template, G3 group, IODP Expedition 362, Site U1481. The log parameters of the well GRA, P-wave, MS, L*, a*,
and b* are displayed with respect to the real lithology related to the depth presented. Graph A displays the total number of records per method vs. lithology. Graph B
displays the resulting lithological classification by method relative to the real lithology. The methods are MLP, random forest, decision tree, and SVM, presented in
the sequence of identification of the lithologies. (Aa) to (Dd) correspond to the method distribution by percentage of hits by lithology. Litho1 refers to lithology code
10, Litho3 refers to lithology code 12, Litho4 refers to lithology code 13 and Litho5 refers to lithology code 14. Litho2 and Litho6 are not recorded in this hole range.
The SRM log parameter is not included.
increased the accuracy according to the quantity of the dataset, as it for a specific group of data. Indeed, we must perform an adjustment of
facilitated the execution of the classifications by the methods proposed. the configuration parameters, making it difficult to use this for the
The SVM method obtained poor results in cross-validation analysis classification of lithologies as proposed in this study.
and accuracy. The geological data analyzed do not have a highly defined The best result for cross-validation was obtained by the random
pattern, and this method depends on many external and natural factors forest method in all groups and templates analyzed. The characteristics
to obtain excellent results in its application. To successfully use machine of the random forest method allowed for better results of lithological
learning with this method, it is necessary to have a dataset with a large classification, as it has a simple configuration and its design includes a
quantity of data, and the data must be balanced regarding each training set of several decision trees, leading to accurate results with smaller
9
Fig. 7. Confusion matrix for the G3 group, practical template (IODP Expedition 362, Site U1481). (A) MLP method, (B) random forest method, (C) decision tree
method, (D) SVM method. A to D represent the diagonal matrix for the result of each method, where the MLP method had the best result. (A*) represents the number
of true and predicted G3 group (lithology codes 10, 12, 13, 14) lithology records without normalization. (B*) highlights the percentage of normalized data between
true and predicted. Colors represent the frequency of normalized and nonnormalized records. Lithology code 11 and lithology code 15 are not recorded in this
hole range.
10
Table 5
Summary table with the best accuracy per method, group, and
template.
Table summary best accuracy
Method:
RandomForest
Group:
G2
Template:
Template2
Group vs. Method:
GP RandomForest
G1 RandomForest
G2 MLP
G3 RandomForest
Template vs. Method:
Template1 RandomForest
Practical Template MLP
Cross-Validation:
RandomForest
statistical variance.
The main contributions of this work are a rapid approach to litho
logical classification in offshore wells, the proposal of a lithological
classification using supervised training methods, lithological classifica
tion using multivariate data and the support of a large number of vari
ables, and strengthened and improved learning using neural network
methods and machine learning.
Computer code availability
Name of code: LithoPy. Developer: Thiago Santi Bressan. Contact

address: Programa de Po �s-Graduaça~o em Geologia, Universidade do
Vale do Rio dos Sinos (Unisinos), Av. Unisinos, 950, Cristo Rei, S~ ao
Leopoldo, RS, Brasil. Telephone number: þ55-55-99601-6723. e-mail.
tsbressan@gmail.com. Year of first release: 2019. Hardware required: I3
CPU or better with 16 GB memory RAM. Software required: Anaconda
(>¼5.3), Jupyter Notebook (>¼5.2), Sklearn (>¼0.20), NumPy
(>¼1.8.2), SciPy (>¼ 0.13.3), Matplotlib (>¼3.0.0) and Windows (10)
or Linux (Ubuntu or such as system). Program language: Python. Pro
gram size: 10 MB. How to access the source code: Available at https://
github.com/tsbressan/LithoPy.
Authorship statement
Thiago Santi Bressan designed and developed the algorithms,

worked on the main writing of the manuscript. Marcelo Kehl de Souza
contributed to the statistical analysis, algorithm tests, review and
writing of the manuscript. Tiago J Girelli contributed to the develop
ment of figures and tables and writing of the manuscript. Farid Chemale
Junior contributed to the writing and revision of the manuscript.
Declaration of competing interest
The authors declare that they have no known competing financial

interests or personal relationships that could have appeared to influence
the work reported in this paper.
Acknowledgement
Fig. 8. ROC curve of the MLP, random forest, decision tree, and SVM methods
applied to the G3 group, practical template (IODP Expedition 362, Site U1481). This work was financed through a grant from the Coordenaça
~o de
Lithology code 10 refers to the G3 group, models Litho1. Lithology code 12 Aperfeiçoamento de Pessoal de Nível Superior – Brasil -grant #
refers to the G3 group, models Litho3. Lithology code 13 refers to the G3 group, 88887.091717/2014–01). This research used data provided by the In
models Litho4. Lithology code 14 refers to the G3 group, models Litho5. Li ternational Ocean Discovery Program (IODP) (www.iodp.org/access-d
thology code 11 and lithology code 15 are not recorded in this hole range. ata-and-samples).
11
Appendix A. Supplementary data Buschow, K.H.J., Boer, F. R. de, 2004. Physics of Magnetism and Magnetic Materials.
Kluwer Academic Publishers, ISBN 0-306-47421-2, pp. 105–115. https://doi.org/
10.1007/b100503.
Supplementary data to this article can be found online at https://doi. Castro, W., Oblitas, J., Santa-Cruz, R., Avila-George, H., 2017. Multilayer perceptron
org/10.1016/j.cageo.2020.104475. architecture optimization using parallel computing techniques. PloS One 12, 12.
https://doi.org/10.1371/journal.pone.0189369.
Chen, Y., Wu, W., 2016. A prospecting cost-benefit strategy for mineral potential
Abbreviations used in this manuscript mapping based on ROC curve analysis. Ore Geol. Rev. 74, 26–38. https://doi.org/
10.1016/j.oregeorev.2015.11.011, 2016.
ANN Artificial Neural Network CIE, 2004. Technical Report, third ed. COLORIMETRY, ISBN 3901906339, p. 82pp
https://archive.org/details/gov.law.cie.15.2004.
CIE International Commission on Illumination De Boissieu, F., Sevin, B., Cudahy, T., Mangeas, M., Chevrel, S., Ong, C., Rodger, A.,
GRA Gamma Ray Attenuation Maurizot, P., Laukamp, C., Lau, I., Touraivane, T., Cluzel, D., Despinoy, M., 2017.
IODP International Ocean Discovery Program Regolith-geology mapping with support vector machine: a case study over
weathered Ni-bearing peridotites, New Caledonia. Int. J. Appl. Earth Obs. Geoinf.
MS Magnetic Susceptibility (64), 377–385. https://doi.org/10.1016/j.jag.2017.05.012.
MLP Multi-Layer Perceptron Devanand, Kumar, N., 2015. Prediction of CMRS rock mass rating using fuzzy logic.
NRM Natural Remanence Magnetization International Conference on Advances in Computer Engineering and Applications.
https://doi.org/10.1109/ICACEA.2015.7164685. ICACEA - 2015.
PWL P-Wave Velocity Logger System Doveton, J.H., 1994. Geological log interpretation. SEPM Society for Sedimentary
RSC Reflectance Spectrophotometry Geology (29). https://doi.org/10.2110/scn.94.29.
ROC Receiver Operating Characteristic Dubeau, P., King, D.J., Unbushe, D.G., Rebelo, L., 2017. Mapping the dabus wetlands,
Ethiopia, using random forest classification of landsat, PALSAR and topographic
SRM Magnetic Remanence
data. Rem. Sens. (9), 1056. https://doi.org/10.3390/rs9101056.
SVM Support Vector Machine Egan, J.P., 1975. Signal Detection Theory and ROC Analysis. Academic Press, New York,
VCD Visual Core Description 2017.
Fajana, A.O., Ayuk, M.A., Enikanselu, P.A., 2019. Application of multilayer perceptron
neural network and seismic multiattribute transforms in reservoir characterization of
References Pennay field, Niger Delta. J. Petrol. Explor. Prod. Technol. (9), 31–49. https://doi.
org/10.1007/s13202-018-0485-9, 2019.
Airola, A., Pohjankukka, J., Torppa, J., Middleton, M., Nyk€ anen, V., Heikkonen, J., Fan, B., Shi, L., Li, Y., Zhang, T., Lv, L., Shikai, T., 2019. Lithologic heterogeneity of
Pahikkala, T., 2018. The spatial leave-pair-out cross-validation method for reliable lacustrine shale and its geological significance for shale hydrocarbon-a case study of
AUC estimation of spatial classifiers. Data Min. Knowl. Discov. 33, 730–747. https:// Zhangjiatan Shale. Open Geosci. 2019 (11), 101–112. https://doi.org/10.1515/geo-
doi.org/10.1007/s10618-018-00607-x. 2019-0009.
Akkas, E., Akin, L., Çubukçu, E., Artuner, H., 2015. Application of Decision Tree Gajowniczek, K., Zabkowski, T., Szupiluk, R., 2014. Estimating the roc curve and its
Algorithm for classification and identifi-cation of natural minerals using SEM–EDS. significance for classification models’ assessment. Quantit. Methods Econ. 15 (2),
Comput. Geosci. 80, 38–48. https://doi.org/10.1016/j.cageo.2015.03.015, 2015. 382–391, 2014.
Al-Mudhafar, W.J., 2015. Integrating component analysis & classification techniques for Goodrich, W.E., 2007. Characterization and Quantification of Magnetic Remanence in
comparative prediction of continuous & discrete lithofacies distributions. In: OTC- Unexploded Ordnance. Master of Science (Geophysics). Faculty and Board of
25806-MS, the Offshore Technology Conference. https://doi.org/10.4043/25806- Trustees of the Colorado School. https://mountainscholar.org/bitstream/handle
MS (4-7 May), Houston, TX, USA. /11124/79225/T06342.pdf?sequence¼1.
Al-Mudhafar, W.J., 2017a. Integrating well log interpretations for lithofacies Haykin, S., 2009. Neural Networks and Learning Machines, third ed. Pearson Prentice
classification and permeability modeling through advanced machine learning Hall, New York City, U. S., p. 938pp
algorithms. J. Petrol. Explor. Prod. Technol. (2017) 7, 1023–1033. https://doi.org/ Hong, T., White, C.D., Gani, M.R., Bhattacharya, J., 2004. Comparison of multivariate
10.1007/s13202-017-0360-0. statistical algorithms for wireline log facies classification. AAPG Ann. Meet Abst.
Al-Mudhafar, W.J., 2017b. Integrating kernel support vector machines for efficient rock (88), 13.
facies classification in the main pay of Zubair formation in South Rumaila oil field, Hossin, M., Sulaiman, M.N., 2015. A review on evaluation metrics for data classification
Iraq. Model. Earth Syst. Environ. (2017) 3, 12. https://doi.org/10.1007/s40808- evaluations. Int. J. Data Min. Knowl. Manag. Process 2 (5), 1–11. https://doi.org/
017-0277-0. 10.5121/ijdkp.2015.5201.
Alkhasawneh, M.S., Ngah, U.K., Tay, L.T., Isa, N.A.M., Al-Batah, M.S., 2014. Modeling Hsieh, B., Lewis, C., Lin, Z., 2005. Lithology identification of aquifers from geophysical
and testing landslide hazard using decision tree. J. Appl. Math. 2014 https://doi. well logs and fuzzy logic analysis: shui-Lin Area, Taiwan. Comput. Geosci. 31 (3),
org/10.1155/2014/929768, 929768. 263–275. https://doi.org/10.1016/j.cageo.2004.07.004.
Awad, M., Khanna, R., 2015. Support Vector Regression. Efficient Learning Machines. Hughes, V.K., Langlois, N.E.I., 2010. Use of reflectance spectrophotometry and
Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5990-9_4. colorimetry in a general linear model for the determination of the age of bruises.
Bhattacharya, S., Carr, T.R., 2019. Integrated data-driven 3D shale lithofacies modeling Forensic Sci. Med. Pathol. 6 https://doi.org/10.1007/s12024-010-9171-z, 275–28.
of the Bakken Formation in the Williston basin, North Dakota, United States. Jahdhami, N.A., Anboori, A., 2017. The Application of Specific Drilling Energy to
J. Petrol. Sci. Eng. 177, 1072–1086. https://doi.org/10.1016/j.petrol.2019.02.036, Identify Overburden Lithological Boundaries and Aid Well Operations - Oman
2019. Khazzan Field. Abu Dhabi International Petroleum Exhibition & Conference, Abu
Bhattacharya, S., Mishra, S., 2018. Applications of machine learning for facies and Dhabi, UAE. https://doi.org/10.2118/188413-MS.
fracture prediction usingBayesian Network Theory and Random Forest: case studies James, G., Witten, D., Hastie, T., Tibshirani, R., 2013. An Introduction to Statistical
from theAppalachian basin, USA. J. Petrol. Sci. Eng. 170, 1005–1017. https://doi. Learning. Springer, New York City, U.S, p. 187. https://doi.org/10.1007/978-1-
org/10.1016/j.petrol.2018.06.075, 2018. 4614-7138-7.
Bhattacharya, S., Carr, T.R., Pal, M., 2016. Comparison of supervised and unsupervised Jing, S., Liu, C., Li, G., Yan, G., Zhang, Y., 2017. An efficient algorithm for parallel
approaches for mudstonelithofacies classification: case studies from the Bakken computation of rough entropy using CUDA. Int. Conf. Comput. Intell. Sec. 13
andMahantango-Marcellus Shale, USA. J. Nat. Gas Sci. Eng. 33, 1119–1133. https:// https://doi.org/10.1109/CIS.2017.00009.
doi.org/10.1016/j.jngse.2016.04.055, 2016. Jovane, L., Hinnov, L., Housen, B.A., Herrero-Barvera, E., 2013. Magnetic Methods and
Bhattacharya, S., Ghahfarokhi, P.K., Carr, T.R., Pantaleone, S., 2019. Application of the Timing of Geological Processes, vol. 373. Geological Society Special Publication
predictive data analytics to model daily hydrocarbon production using Nº https://doi.org/10.1144/SP373.17.
petrophysical, geomechanical, fiber-optic, completions, and surface data: a case Kong, G., Xia, Y., Qiu, C., 2014. Cost-sensitive bayesian network classifiers and their
study from the Marcellus Shale, North America. J. Petrol. Sci. Eng. 176, 702–715. applications in rock burst prediction. ICIC 2014, LNCS 8588, 101–112. https://doi.
https://doi.org/10.1016/j.petrol.2019.01.013, 2017. org/10.1007/978-3-319-09333-8_12.
Blum, P., 1997. Physical properties handbook: a guide to the shipboard measurement of Korjus, K., Hebart, M.N., Vicente, R., 2016. An efficient data partitioning to improve
physical properties of deep-sea cores. Coll. Stat. Texas, USA http://www-odp.tamu. classification performance while keeping parameters interpretable. PloS One 11 (8).
edu. https://doi.org/10.1371/journal.pone.0161788.
Brckovic, A., Kovacevic, M., Cvetkovic, M., Mocilac, I.K., Rukavina, D., Saftic, B., 2017. Korolev, E.A., Usmanov, S.A., Nikolaev, D.S., Gabdelvaliyeva, R.R., 2018. Effect of
Application of artificial neural networks for lithofacies determination based on lithological heterogeneity of bitumen sandstones on SAGD reservoir development.
limited well data. Central Eur. Geol. 3 (60), 299–315. https://doi.org/10.1556/ IOP Conf. Ser. Earth Environ. Sci. 155 https://doi.org/10.1088/1755-1315/155/1/
24.60.2017.012. 012019, 2018.
Brieuc, M.S.O., Waters, C.D., Drinan, D.P., Naish, K.A., 2018. A practical introduction to Latifovic, R., Pouliot, D., Campbell, J., 2018. Assessment of convolution neural networks
Random Forest for genetic association studies in ecology and evolution. Mol. Ecol. for surficial geology mapping in the south rae geological region, northwest
Resour. 18 (4), 755–766. https://doi.org/10.1111/1755-0998.12773. territories, Canada. Rem. Sens. 10 (2), 307. https://doi.org/10.3390/rs10020307.
Bruno, A.E., Charbonneau, P., Newman, J., Snell, E.H., So, D.R., Vanhoucke, V., Lee, S.H., Datta-Gupta, A., 1999. Electrofacies characterization and permeability
Watkins, C.J., Williams, S., Wilson, J., 2018. Classification of crystallization predictions in carbonate reservoirs: role of multivariate analysis and nonparametric
outcomes using deep convolutional neural networks. PloS One 13, 6. https://doi. regression. In: SPE Annual Technical Conference and Exhibition (3–6 October).
org/10.1371/journal.pone.0198883. https://doi.org/10.2118/56658-MS. Houston, Texas.
12
Li, J., Tran, M., Siwabessy, J., 2016. Selecting optimal random forest predictive models: a Rahim, I.A., Tahir, S. Hj, Musta, B., Omang, S.A.K., 2009. Lithological unit thickness
case study on predicting the spatial distribution of seabed hardness. Geosci. Australia approach for determining intact rock strength (IRS) of slope forming rock material of
GPO Box 378. https://doi.org/10.1371/journal.pone.0149089. crocker formation. Borneo Sci. 25, 23–32.
Loussaief, S., Abdelkrim, A., 2018. Machine Learning framework for image classification. Raschka, S., 2015. Python Machine Learning. Packt Publishing Ltd, England, p. 454pp.
ASTESJ 3, 1. https://doi.org/10.1109/SETIT.2016.7939841. Romppanem, S., H€ akk€anem, H., Kaski, S., 2017. Singular value decomposition approach
Maimon, O., Rokach, L., 2010. Data Mining and Knowledge Discovery Handbook, 2 to the yttrium occurrence in mineral maps of rare earth element ores using laser-
editions. Springer, New York City, U.S. https://doi.org/10.1007/978-0-387-09823-4 induced breakdown spectroscopy. Spectrochim. Acta, Part B 134, 69–74. https://doi.
(Chapter 9). org/10.1016/j.sab.2017.06.002, 2017.
Maniriho, P., Ahmad, T., 2018. Analyzing the Performance of Machine Learning Rosid, M.S., Haikel, S., Haidar, M.W., 2019. Carbonate reservoir rock type classification
Algorithms in Anomaly Network Intrusion Detection Systems. In: 4th International using comparison of Naïve Bayes and Random Forest method in field “S” East Java.
Conference on Science and Technology. https://doi.org/10.1109/ AIP Conf. Proc. 2168, 020019 https://doi.org/10.1063/1.5132446, 2017.
ICSTC.2018.8528645. ICST 2018. S�
aez, J.A., Galar, M., Luengo, J., Herrera, F., 2013. Tackling the problem of classification
McCreery, E., Al-Mudhafar, W., 2017. Geostatistical classification of lithology using with noisy data using Multiple Classifier Systems: analysis of the performance and
partitioning algorithms on well log data - a case study in forest hill oil field, east robustness. Inf. Sci. 247 https://doi.org/10.1016/j.ins.2013.06.002.
Texas basin. 79th eage conference and exhibition. At Paris, France. https://doi.org/ Sahoo, S., Jha, M.K., 2017. Pattern recognition in lithology classification: modeling using
10.3997/2214-4609.201700905. neural networks, self-organizing maps and genetic algorithms. Hydrogeol. J. 25,
Mcneill, L.C., Dugan, B., Petronotis, K.E., Backman, J., Bourlange, S., Chemale, F., 311–330. https://doi.org/10.1007/s10040-016-1478-8, 2017.
Chen, W., Colson, T.A., Frederik, M.C.G., Guerin, G., Hamahashi, M., Henstick, T., Souza, J.F.L., Santos, M.D., Magalh~ aes, R.M., Neto, E.M., Oliveira, G.P., Roque, W.L.,
House, B.M., Hupers, A., Jeppson, T.N., Kachovich, S., Kenigsberg, A.R., 2019. Automatic classification of hydrocarbon ‘‘leads’’ in seismic images through
Kuranaga, M., Kutterolf, S., Milliken, K.L., Mitchison, F.L., Mukoyoshi, H., Nair, N., artificial and convolutional neural networks. Comput. Geosci. 132, 23–32. https://
Owari, S., Pickering, K.T., Pouderoux, H.F.A., Yehua, S., Song, I., Torres, M.E., doi.org/10.1016/j.cageo.2019.07.002.
Vannucchi, P., Vrolijk, P.J., Yang, T., Zhao, X., 2017. Expedition 362 summary. Proc. Spronck, P., 2017. The Coder’s Apprentice - Learning Programming with Python 3.
Int. Ocean Discov. Program 362. https://doi.org/10.14379/iodp. Spronck Create Commons Licences, p. 398. Version 1.0.16. http://www.spronck.
proc.362.101.2017. net/pythonbook/pythonbook.pdf.
Navin, M., Pankaja, R., 2016. Performance analysis of text classification algorithms using Storkey, A.J., 2013. When Training and Test Sets Are Different: Characterising Learning
confusion matrix. Int. J. Eng. Tech. Res. 4 (6), 75–78. Transfer. https://doi.org/10.7551/mitpress/9780262170055.003.0001.
ODP, 2007. ODP Prime Scientific Data: Collection, Archive, and Quality ODP. Technical Strauß, S., 2018. From big data to deep learning: a leap towards strong AI or ‘intelligentia
Note 37. URL: http://www-odp.tamu.edu/publications/tnotes/tn37/TNOTE_37. obscura’? Big Data Cognitive Computing 2 (3), 16. https://doi.org/10.3390/
PDF. bdcc2030016.
Orozco-del-Castillo, M.G., Ortiz-Aleman, C., Urrutia-Fucugauchi, J., Rodr0 ıguez- Tharwat, A., 2018. Classification assessment methods. Appl. Comput. Inform. https://
Castellanos, A., 2011. Fuzzy logic and image processing techniques for the doi.org/10.1016/j.aci.2018.08.003.
interpretation of seismic data. J. Geophys. Eng. 8, 185–194. https://doi.org/ Tilaki-Hajian, K., 2013. Receiver operating characteristic (ROC) curve analysis for
10.1088/1742-2132/8/2/006, 2017. medical diagnostic test evaluation. Caspian J Intern Med 4 (2), 627–635pp.
Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A., 2016. The Vakhshoori, V., Zare, M., 2018. Is the ROC curve a reliable tool to compare the validity of
limitations of deep learning in adversarial settings. In: European Symposium on landslide susceptibility maps? Geomatics, Nat. Hazards Risk 1 (9). https://doi.org/
Security and Privacy. https://doi.org/10.1109/EuroSP.2016.36, 2017, Germany. 10.1080/19475705.2018.1424043, 2018.
Peng, H., Bai, X., 2017. Limits of Machine Learning Approach on Improving Orbit Vapnik, V.N., 1998. Statistical Learning Theory. A Wiley-Interscience Publication, New
Prediction Accuracy Using Support Vector Machine. Advanced Maui Optical and York City, U. S., p. 740pp
Space Surveillance (AMOS) Technologies Conference, Hawaii, 2017. Venkataraman, S., 2017. System Design for Large Scale Machine Learning. Electrical.
Peng, J., Zhou, Y., Chen, C.L.P., 2015. Region-kernel-based support vector machines for University of California at Berkeley, EECS Department. https://www2.eecs.berkeley.
hyperspectral image classification. IEEE Trans. Geosci. Rem. Sens. 9 (53), edu/Pubs/TechRpts/2017/EECS-2017-219.html.
4810–4824. https://doi.org/10.1109/TGRS.2015.2410991. Wallet, B., Hardisty, R., 2019. Unsupervised seismic facies using Gaussian mixture
Pirrone, M., Battigelli, A., Ruvo, L., 2014. Lithofacies Classification of Thin Layered models. SEG Library 3 (7), 1A–T725. https://doi.org/10.1190/INT-2018-0119.1.
Turbidite Reservoirs through the Integration of Core Data and Dielectric Dispersion Wang, G., Carr, T.R., Ju, Y., Li, C., 2014. Identifying organic-rich Marcellus Shale
Log Measurements. https://doi.org/10.2118/170748-MS. lithofacies by support vectormachine classifier in the Appalachian basin. Comput.
Puggini, L., Doyle, J., Mcloone, S., 2015. fault detection using random forest similarity Geosci. 64, 52–60. https://doi.org/10.1016/j.cageo.2013.12.002, 2014.
distance. IFAC-PapersOnLine 48 (21), 583–588. https://doi.org/10.1016/j. YU, L., Porwal, A., Holden, E., Dentith, M.C., 2012. Towards automatic lithological
ifacol.2015.09.589. classification from remote sensing data using support vector machines. Comput.
Rafik, B., Kamel, B., 2017. Prediction of permeability and porosity from well log data Geosci. 45 https://doi.org/10.1016/j.cageo.2011.11.019.
using the nonparametric regression with multivariate analysis and neural network, Zhao, L.N., Tian, F.Y., Wu, H., Qi, D., Wang, Z., 2011. Verification and comparison of
Hassi R’Mel Field, Algeria. Egyptian J. Petrol. 26 (3), 763–778. https://doi.org/ probabilistic precipitation forecasts using the TIGGE data in the upriver of Huaihe
10.1016/j.ejpe.2016.10.013. Basin. Adv. Geosci. (29), 95–102pp. https://doi.org/10.5194/adgeo-29-95-2011,
2011.
13

1 s2.0 S0098300419309239 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0098300419309239 Main

Uploaded by

Copyright:

Available Formats

Computers & Geosciences 139 (2020) 104475

Contents lists available at ScienceDirect

Computers and Geosciences

Evaluation of machine learning methods for lithology classification using

1. Introduction the identification and recognition of patterns in sedimentary rocks. The

Litho2 Alternating sand/sandstone and mud/ 11

and improved with training.

The use of each criterion is defined according to the quantity and

2.1.1.2. Random forest. Random forest is an ensemble method for

The receiver operating characteristic (ROC) is a graphical method for

Table 2 G3 group, Template1 obtained 79.00% accuracy in the SVM method,

GP, G1, G2 Template1 362 (U1480) 70.00% 10.00% 20.00% MLP

In this study, four different machine learning methods were applied

Computer code availability

Name of code: LithoPy. Developer: Thiago Santi Bressan. Contact

Thiago Santi Bressan designed and developed the algorithms,

Declaration of competing interest

The authors declare that they have no known competing financial

You might also like