Professional Documents
Culture Documents
net/publication/328020065
CITATIONS READS
0 2,469
1 author:
Paolo Dell’Aversana
Eni SpA - San Donato Milanese (MI)
129 PUBLICATIONS 903 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Nuovi strumenti per il Management della Complessità: Machine Learning e Neuroscienze View project
All content following this page was uploaded by Paolo Dell’Aversana on 02 October 2018.
Paolo Dell’Aversana
Email: dellavers@tiscali.it
Abstract
Rock sample classification is commonly performed through mineralogical and/or chemical
analysis, often combined with additional information, such as laboratory measurements and well
logs. All data sets can be combined (Data Fusion) in the same automatic process based on the
application of Machine Learning algorithms. This automatic approach can speed up the entire
classification/interpretation process, increasing at the same time the reliability of the classification
results. In this paper, I introduce a Machine Learning framework and a complete workflow for rapid
and reliable rock classification based on mineralogical and chemical composition. I show, with simple
tests, the effectiveness of this framework for quick rock classification of sedimentary and magmatic
samples, with significant possible implications in operation geology.
1
Introduction
Fast classification and/or clustering of rock samples based on mineralogical and/or chemical
composition can be efficiently performed through supervised and unsupervised Machine Learning
(ML) methods (Bishop, 2006; Samuel, 1959). If a supervised learning approach is used, one or more
automatic classifiers are trained on a labelled subset of data representing the training data set; then,
the learnt classification models are generalized to the entire unlabeled data set, speeding up
significantly the entire interpretation workflow. There are many algorithms allowing fast and
process performed by geologists (Aminzadeh and de Groot, 2006; Hall, 2016). These algorithms use
some type of relevant feature by which data can be grouped and/or classified.
developed using open source Python libraries. It allows multidisciplinary applications (not
necessarily confined in the geosciences domain), such as seismic facies classification, combination
of seismic and non-seismic data, well log analysis, statistical analysis of seismic attributes, medical
diagnosis, musical genre classification, analysis of financial trends, analysis of social media
system, first I introduce the general aspects of the methodology. I discuss some of the key steps of
the workflow that includes the main algorithms commonly used for Big Data Analytics, Clustering,
Beside the scientific goal of introducing that multidisciplinary ML framework, this paper has
pragmatic, tutorial and didactic purposes too. Consequently, I prefer focusing the discussion on
applicative cases, in order to show the practical aspects of Machine Learning. Theoretical and
2
computational aspects of my ML framework can be found in the book(s) of Russell and Norvig (2016)
different geological domains. For didactical reasons, I start with a very simple test of binary
classification for better controlling/explaining each step of the workflow. Then, I progressively
increase the complexity of the examples, including classification tests of many types of sedimentary
Figure 1 shows a schematic representation of the Machine Learning framework that I have
developed and that I have used in the classification tests discussed in the following sections. It
includes a set of supervised learners, such as CN2 Rule Induction, Naïve Bayes, Decision Tree,
Random Forest, Support Vector Machine, and Adaptive Boosting (as schematically showed in the
central part of figure 1). Detailed descriptions of these algorithms and their implementation in
Python Language can be found, for instance, in Raschka and Mirjalili (2017). The same framework
includes also semi-supervised and unsupervised algorithms, such K-means and many other methods
(Barnes and Laughlin, 2002), although they are not explicitly showed in this schematic figure. These
are useful when we do not have any training data set and we desire, for instance, to cluster our data
without necessarily assigning a specific class label. Furthermore, the same framework includes a
suite of algorithms for data pre-processing, statistical analysis, Principal Component Analysis (PCA),
feature ranking analysis and so forth (see left part of the figure 2). Some of these algorithms, like
PCA, are applied and briefly discussed in the following examples. I will show how they work on real
3
In the applications described in this paper, the input of the workflow consists of a multi-
feature matrix including the percentages of the various mineralogical types and/or chemical
composition recognized in each sample. Part of the rock samples has been analyzed at the
microscope by an expert geologist and are labelled for creating a training data set. Then the
remaining classification work is performed automatically using one or more classifiers. The training
(labelled) data set is useful also for comparing the “classification performances” of the various
classifiers through cross-validation tests, as discussed in the following. Finally, the workflow is
completed by applying a suite of post-processing algorithms to classified data. These algorithms can
have different functions, such as detection of outliers, creation of interpolated maps and so forth
The workflow commonly starts with training the various ML algorithms using the labelled
data set (left block) and then it continues by applying the algorithms to unlabeled data (right block).
The double arrows linking the three blocks horizontally indicate simply that, as a first attempt, I try
to use the same algorithms for training and for classifying the data. Eventually, after evaluating the
performance of the different algorithms (see following paragraphs), I select only the most
performing classifiers for classifying the unlabeled data. After this brief overview to the ML
4
Figure 1. Block diagram of the Machine Learning framework.
I start discussing a simple test of binary classification based on the mineralogical composition
of a small data set of sedimentary rock samples. The initial steps of the ML workflow concern
statistical data analysis and features’ processing, normalization and comparison. Figure 2 shows
three examples of statistical distribution of three mineralogical types in the rock samples (claystone
and sandstone) used in this classification tests: quartz, kaolinite and illite. As expected, illite is a
mineralogical feature that shows a good “discrimination power” between the two types of rocks. In
addition, the content in quartz shows two distinct curves, although the two distributions for
sandstone and claystone are partially overlapped. A large range of overlap can be observed for the
probability density functions of kaolinite, thus we can expect that this feature is not the most
relevant for classification purposes. In total, I used nine mineralogical species, showing variable
discrimination power: illite, pyrite, chlorite, k-feldspar, kaolinite, quartz, calcite, dolomite, and
5
plagioclase. We can use all of them in the automatic classification. However, a different approach is
to identify and select only those features (minerals) that allow the sharpest distinction between the
different rock classes. For that reason, an important part of the workflow is “features’ ranking”.
Figure 2. Three examples of statistical distribution of minerals in the rock samples used in
the test.
Table 1 shows a table of features’ ranking. This was calculated using different ranking
indexes, such as Information Gain, Gain Ratio, Gini Index and others. For instance, Information Gain
(IG) and Gain Ratio (GR) provide a measure of how much information a feature gives us about the
class (Gain Ratio is a simple modification of the Information Gain aimed at reducing its bias). The
“rule” is that features that allow perfect class partition should give maximal information. Instead,
unrelated features should give no information. In other words, IG of a feature measures the
reduction in entropy when using that feature, where entropy reflects the level of heterogeneity
(impurity) in an arbitrary collection of examples. Analogously, the features with the maximum Gain
Ratio are the ones showing the highest power of splitting the data into separate classes.
Gini Index (GI) follows a ranking criterion not very different from IG and GR. It reflects the
6
A quantitative description of IG, GR, GI and all other ranking criteria can be found in Raschka
and Mirjalili (2017). From table 1, we can see that illite and pyrite show the highest indexes, whereas
dolomite and plagioclase show the lowest values. Just looking at this table, we are able to select in
advance the most relevant features for our clustering and/or classification purposes.
algorithms is known as cross-validation test. This uses the labelled data forming the training data
set that is further partitioned into two complementary subsets. First, we apply the various learning
algorithms on one subset (called the training sub-set), and then we validate their generalization
power on the other subset (called the validation sub-set or testing sub-set).
Using “confusion matrices”, we can verify the performance of each classification algorithm
examined in the cross-validation tests. Each row of the confusion matrix represents the instances in
a predicted class while each column represents the instances in an actual class. Thus, we can
7
estimate the effectiveness of each algorithm in generalizing the classification results (obtained on
the training sub-set) by verifying the percentage cases properly classified (on the validation sub set).
For instance, figure 3 compares the confusion matrix of the Random Forest, Adaptive
Boosting, Logistic Regression and Naïve Bayes methods. We can see that all the four mentioned
algorithms show relatively high values in the principal diagonal, suggesting a good capability of
generalization. In this cross-validation test I used simultaneously all the nine features.
Table 2 collects a set of indexes for evaluating the prediction results of the cross—validation
test, for each individual method (Raschka and Mirjalili, 2017). For example, Classification Accuracy
(CA) is the proportion of correctly classified examples. Instead, Precision is the proportion of true
Figure 3. Confusion matrices calculated for four different methods, applied to the labelled
8
Table 2. Indexes of the classification performance for the various methods.
Figure 4 shows a plot of classification results, here showed for four different methods,
mapped vs. illite and pyrite mineralogical features. All the methods allow obtaining comparable
classification results. Naïve Bayes and Random Forest show sharp separation between the two
classes. The remaining two methods (and the other two here not mapped: CN2 Rule Induction and
Decision Tree), show one apparent classification outlier in the upper right corner of the map.
9
Adaptive Boosting Logistic Regression
Pyrite (normalized)
Pyrite (normalized)
Pyrite (normalized)
Figure 4. Classification results for four different methods (crosses = sandstone; circles =
claystone).
values of linearly uncorrelated variables (Jolliffe, 2002). These are called “principal components”.
The optimal matrix transformation is defined in such a way that the first principal component has
the largest possible variance. It means that it takes into account for as much of the variability in the
10
data as possible. Each succeeding component in turn has the highest variance possible under the
Figure 5 shows a PCA map (where on the axes there is not anyone of the original features,
but two principal components) of classified results into the two sandstone and claystone classes.
This classification was obtained through Naïve Bayes. We can see that the distinction between the
two classes is properly performed (in the sense that the automatic classifiers allowed clear
Naive Bayes
PC1
PC2
Figure 5. Principal Component Analysis map, using the two principal components, PC1 and PC2
After having acquired some insight in the ML workflow, now we can apply the same framework
to slightly more complex data sets. In the next test, I added some carbonate samples with, of course,
significant mineralogical differences with respect to claystones and sandstones. In that case, using
11
features like calcite and dolomite content was fundamental for obtaining a correct classification.
For instance, figure 6 shows the classification results obtained by using Naïve Bayes method, applied
to all the nine features, and plotted into the PCA domain formed by the first two principal
components (PC1 and PC2). The three classes are properly distinguished, excluding just one sample
Finally, an expert geologist checked all these classification results, confirming their geological
reliability.
Naive Bayes
PC1
PC2
Figure 6. Naïve Bayes classification of three different rock types based on nine mineralogical
features. Horizontal axis: PC1; vertical axis: PC2 (crosses = sandstone; circles = claystone; triangles:
carbonates).
chemical composition (oxides) of magmatic samples. I have used the public data set available on the
12
GEOROC (Geochemistry of Rocks of the Oceans and Continents) web site (http://georoc.mpch-
mainz.gwdg.de/georoc/ ). For this application, I used only a part of the available data set, consisting
of 235 samples and 11 features, creating an input matrix for the ML workflow including 2585
instances. The samples come from different parts of the world and represent different types of
volcanic rocks, belonging to four classes: andesite, dacite, rhyolite and shoshonite (Mamani et al.,
2010).
Table 3 shows the ranking of the chemical features used for training and classification.
Figure 7 shows two examples of statistical distributions for two of these features extracted
from the training data set (almost 100 samples). We can see that especially the SiO2 composition
13
Norm. probability density distribution
Norm. probability density distribution
Shoshonite
Following the same workflow applied for the previous examples, I applied a supervised
learning procedure involving six different learners. I performed the training phase using different
percentages of samples (ranging from 10% to 40%). Figure 8 shows the confusion matrix for Random
Forest and Naïve Bayes methods. It is possible to notice that the prediction performances range
between 85% and 100% (looking at the principal diagonal of the matrices). The classification
performances are better summarized for all the methods in Table 4, where they are expressed using
Figure 8. Confusion matrix for Random Forest (left panel) and for Naïve Bayes (right panel).
14
Table 4. Table of classification performance for all six methods used in this test.
Finally, figure 9 shows an example of classification result of all the unlabeled samples (for
Naïve Bayes method), plotted after applying PCA (Principal Component Analysis), using the first two
principal components (PC1 and PC2). We can see that all the samples are separated in distinct fields,
excluding few ambiguous cases falling at the transition border between different rock types.
Naïve Bayes
Shoshonite
Rhyolite
PC2
PC2
Andesite
Dacite
PC1
15
A challenging classification test
Finally, I discuss a classification test including 1670 samples, 10 chemical features and 9 rock
classes, for 16700 instances. There are four prevalent rock types: rhyolite, dacite, basalt and
andesite. Furthermore, there are other 5 “minor” classes (with significantly less samples). These
distributions of the previously mentioned rock types (figure 10). The few samples belonging to
classes different from rhyolite, dacite, andesite and basalt, create a sort of “random noise” that
affects negatively the classification process. In other words, they have chemical composition that is
partially similar to one or more of the four major classes. This overlap in the feature space makes
the classification more challenging. Despite that difficulty, the ML framework here introduced
allows a satisfactory classification, separating the main classes in four distinct features areas, with
only partial (unavoidable) overlap. In fact, there is a partial overlap between basalt and andesite
rocks, as we can expect looking at the distribution of chemical composition in figure 9. The points
belonging to the “minor” classes (few samples) are hidden in the background and are only partially
visible (see, for instance, the few orange and purple circles). Figure 10 shows the plot produced
through the Bayes classifier, as an example. In this case, the plot is showed for K2O and TiO2 oxides,
but the classification was performed using all ten chemical features available in the database.
Although I used a relatively large and complex data set, the whole classification test required
small computation time, as in the previous examples. About 3 seconds were necessary for running
the entire workflow on a standard PC1, including statistical analysis, feature ranking, data pre-
processing, normalization of the features, PCA, training (using about 20% of the data set),
1
(System characteristics: Dual core Intel processor; 2.5 GHz; RAM 12.0 GB, Windows 10, 64 bit).
16
performance evaluation, calculation of the confusion matrices, classification of the whole unlabeled
data, using all the features, and using six different classifier algorithms.
Rhyolite
Norm. prob. density distribution
Andesite
Dacite
Basalt
Normalized TiO2
Rhyolite
Normalized K2O
Andesite
Dacite
Basalt
Normalized TiO2
Figure 11. Classification using Naïve Bayes and K2O-TiO2 chemical features.
17
Final remarks and possible applications
classification of rocks based on mineralogical content and chemical composition. I have used six
different classifiers and all of them showed good classification performances. Comparing the
automatic classification with “manual” classification performed by an expert geologist, all the
classifiers produced geologically reasonable results. The main benefit of this automatic approach
is that it can support the “manual” mineralogical analysis of the rock samples for speeding up
the classification process. In fact, the computation time is extremely short. I estimated few
seconds on a standard PC, for classifying thousands of samples, using six classifiers at the same
time, based on 10 or more mineralogical features. The same work is commonly performed in
many days through “traditional” (non-automatic) methods. Consequently, the Machine Learning
approach here presented can be helpful for geologists who have to classify rock samples,
For instance, efficient methods (like Raman Spectroscopy) for measuring mineralogical
composition in rock samples exist (Carey et al, 2015). During well site and/or field operations,
mineralogical (and chemical) data could be combined with composite logs in the same Machine
course, a similar practical impact is possible in other fields where quick and reliable rock
classification of samples forming large databases is required, such as in volcanology, in the study
18
References
1) Aminzadeh, F. and de Groot, P., 2006. Neural Networks and Other Soft Computing
Abstracts, 2221–2224.
4) Carey, C., Boucher, T., Mahadevan, S., Bartholomew, P., and Dyar, M. D. (2015) Machine
learning tools for mineral recognition and classification from Raman spectroscopy. J.
5) Hall, B., 2016. Facies classification using machine learning: The Leading Edge, 35, 906–
909.
6) Jolliffe I.T., 2002. Principal Component Analysis, Series: Springer Series in Statistics, 2nd
7) Mamani, M., Wörner, G., and Sempere, T., 2010. Geochemical variations in igneous rocks
of the Central Andean orocline (13°S to 18°S): Tracing crustal thickening and magma
generation through time and space. – GSA Bulletin; January/February 2010; v. 122; no.
8) Raschka, S. and Mirjalili, V., 2017. Python Machine Learning: Machine Learning and Deep
Learning with Python, scikit-learn, and TensorFlow, 2nd Edition, PACKT Books.
9) Russell, S. and Norvig, P., 2016. Artificial Intelligence: A Modern approach, Global Edition,
10) Samuel, A., L., 1959. Some studies in machine learning using the game of checkers, in
19
Note from the author
The framework discussed in this paper has been created using software libraries developed
privately by the author and open source libraries (such as Python Scikit-learn) readapted for the
scopes of this work. The classification tests are based on public data set (see web reference below):
http://georoc.mpch-mainz.gwdg.de/georoc/webseite/Expert_Datasets.htm
20