You are on page 1of 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328020065

Machine Learning for rock classification based on mineralogical and chemical


composition. A tutorial

Preprint · October 2018


DOI: 10.13140/RG.2.2.32886.04168

CITATIONS READS

0 2,469

1 author:

Paolo Dell’Aversana
Eni SpA - San Donato Milanese (MI)
129 PUBLICATIONS   903 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Machine learning workflows. Multidisciplinary applications using Python. View project

Nuovi strumenti per il Management della Complessità: Machine Learning e Neuroscienze View project

All content following this page was uploaded by Paolo Dell’Aversana on 02 October 2018.

The user has requested enhancement of the downloaded file.


Machine Learning for rock classification based on mineralogical and chemical
composition. A tutorial

Paolo Dell’Aversana
Email: dellavers@tiscali.it

Abstract
Rock sample classification is commonly performed through mineralogical and/or chemical
analysis, often combined with additional information, such as laboratory measurements and well
logs. All data sets can be combined (Data Fusion) in the same automatic process based on the
application of Machine Learning algorithms. This automatic approach can speed up the entire
classification/interpretation process, increasing at the same time the reliability of the classification
results. In this paper, I introduce a Machine Learning framework and a complete workflow for rapid
and reliable rock classification based on mineralogical and chemical composition. I show, with simple
tests, the effectiveness of this framework for quick rock classification of sedimentary and magmatic
samples, with significant possible implications in operation geology.

Keywords: Machine Learning, Supervised, Unsupervised, Rock classification, Chemical composition,


Mineralogy.

1
Introduction

Fast classification and/or clustering of rock samples based on mineralogical and/or chemical

composition can be efficiently performed through supervised and unsupervised Machine Learning

(ML) methods (Bishop, 2006; Samuel, 1959). If a supervised learning approach is used, one or more

automatic classifiers are trained on a labelled subset of data representing the training data set; then,

the learnt classification models are generalized to the entire unlabeled data set, speeding up

significantly the entire interpretation workflow. There are many algorithms allowing fast and

reliable classification/clustering results, supporting the “manual” interpretation/classification

process performed by geologists (Aminzadeh and de Groot, 2006; Hall, 2016). These algorithms use

some type of relevant feature by which data can be grouped and/or classified.

In this paper, I discuss a Machine Learning framework to classify rock samples on

mineralogical/chemical basis. I used a multi-purpose Machine Learning workflow that I have

developed using open source Python libraries. It allows multidisciplinary applications (not

necessarily confined in the geosciences domain), such as seismic facies classification, combination

of seismic and non-seismic data, well log analysis, statistical analysis of seismic attributes, medical

diagnosis, musical genre classification, analysis of financial trends, analysis of social media

information, speech and sentiment analysis. Considering the multidisciplinary applicability of my ML

system, first I introduce the general aspects of the methodology. I discuss some of the key steps of

the workflow that includes the main algorithms commonly used for Big Data Analytics, Clustering,

Classification and Prediction of continuous outcomes.

Beside the scientific goal of introducing that multidisciplinary ML framework, this paper has

pragmatic, tutorial and didactic purposes too. Consequently, I prefer focusing the discussion on

applicative cases, in order to show the practical aspects of Machine Learning. Theoretical and

2
computational aspects of my ML framework can be found in the book(s) of Russell and Norvig (2016)

that I used as the principal scientific reference.

I discuss several applications addressed to classification of rock samples belonging to

different geological domains. For didactical reasons, I start with a very simple test of binary

classification for better controlling/explaining each step of the workflow. Then, I progressively

increase the complexity of the examples, including classification tests of many types of sedimentary

and magmatic samples based on a multitude of mineralogical/chemical features.

Methodology and workflow

Figure 1 shows a schematic representation of the Machine Learning framework that I have

developed and that I have used in the classification tests discussed in the following sections. It

includes a set of supervised learners, such as CN2 Rule Induction, Naïve Bayes, Decision Tree,

Random Forest, Support Vector Machine, and Adaptive Boosting (as schematically showed in the

central part of figure 1). Detailed descriptions of these algorithms and their implementation in

Python Language can be found, for instance, in Raschka and Mirjalili (2017). The same framework

includes also semi-supervised and unsupervised algorithms, such K-means and many other methods

(Barnes and Laughlin, 2002), although they are not explicitly showed in this schematic figure. These

are useful when we do not have any training data set and we desire, for instance, to cluster our data

without necessarily assigning a specific class label. Furthermore, the same framework includes a

suite of algorithms for data pre-processing, statistical analysis, Principal Component Analysis (PCA),

feature ranking analysis and so forth (see left part of the figure 2). Some of these algorithms, like

PCA, are applied and briefly discussed in the following examples. I will show how they work on real

data and the benefits that they produce.

3
In the applications described in this paper, the input of the workflow consists of a multi-

feature matrix including the percentages of the various mineralogical types and/or chemical

composition recognized in each sample. Part of the rock samples has been analyzed at the

microscope by an expert geologist and are labelled for creating a training data set. Then the

remaining classification work is performed automatically using one or more classifiers. The training

(labelled) data set is useful also for comparing the “classification performances” of the various

classifiers through cross-validation tests, as discussed in the following. Finally, the workflow is

completed by applying a suite of post-processing algorithms to classified data. These algorithms can

have different functions, such as detection of outliers, creation of interpolated maps and so forth

(see the right block in figure 1).

The workflow commonly starts with training the various ML algorithms using the labelled

data set (left block) and then it continues by applying the algorithms to unlabeled data (right block).

The double arrows linking the three blocks horizontally indicate simply that, as a first attempt, I try

to use the same algorithms for training and for classifying the data. Eventually, after evaluating the

performance of the different algorithms (see following paragraphs), I select only the most

performing classifiers for classifying the unlabeled data. After this brief overview to the ML

framework, let us see how it works through real classification examples.

4
Figure 1. Block diagram of the Machine Learning framework.

Statistical analysis, probability distributions and feature ranking

I start discussing a simple test of binary classification based on the mineralogical composition

of a small data set of sedimentary rock samples. The initial steps of the ML workflow concern

statistical data analysis and features’ processing, normalization and comparison. Figure 2 shows

three examples of statistical distribution of three mineralogical types in the rock samples (claystone

and sandstone) used in this classification tests: quartz, kaolinite and illite. As expected, illite is a

mineralogical feature that shows a good “discrimination power” between the two types of rocks. In

addition, the content in quartz shows two distinct curves, although the two distributions for

sandstone and claystone are partially overlapped. A large range of overlap can be observed for the

probability density functions of kaolinite, thus we can expect that this feature is not the most

relevant for classification purposes. In total, I used nine mineralogical species, showing variable

discrimination power: illite, pyrite, chlorite, k-feldspar, kaolinite, quartz, calcite, dolomite, and

5
plagioclase. We can use all of them in the automatic classification. However, a different approach is

to identify and select only those features (minerals) that allow the sharpest distinction between the

different rock classes. For that reason, an important part of the workflow is “features’ ranking”.

Figure 2. Three examples of statistical distribution of minerals in the rock samples used in
the test.

Table 1 shows a table of features’ ranking. This was calculated using different ranking

indexes, such as Information Gain, Gain Ratio, Gini Index and others. For instance, Information Gain

(IG) and Gain Ratio (GR) provide a measure of how much information a feature gives us about the

class (Gain Ratio is a simple modification of the Information Gain aimed at reducing its bias). The

“rule” is that features that allow perfect class partition should give maximal information. Instead,

unrelated features should give no information. In other words, IG of a feature measures the

reduction in entropy when using that feature, where entropy reflects the level of heterogeneity

(impurity) in an arbitrary collection of examples. Analogously, the features with the maximum Gain

Ratio are the ones showing the highest power of splitting the data into separate classes.

Gini Index (GI) follows a ranking criterion not very different from IG and GR. It reflects the

homogeneity of the samples with regard to a given feature.

6
A quantitative description of IG, GR, GI and all other ranking criteria can be found in Raschka

and Mirjalili (2017). From table 1, we can see that illite and pyrite show the highest indexes, whereas

dolomite and plagioclase show the lowest values. Just looking at this table, we are able to select in

advance the most relevant features for our clustering and/or classification purposes.

Table 1. Features’ ranking based on various indexes.

Training, cross-validation tests and classifiers’ selection

A usual approach for estimating the generalization effectiveness of different classification

algorithms is known as cross-validation test. This uses the labelled data forming the training data

set that is further partitioned into two complementary subsets. First, we apply the various learning

algorithms on one subset (called the training sub-set), and then we validate their generalization

power on the other subset (called the validation sub-set or testing sub-set).

Using “confusion matrices”, we can verify the performance of each classification algorithm

examined in the cross-validation tests. Each row of the confusion matrix represents the instances in

a predicted class while each column represents the instances in an actual class. Thus, we can

7
estimate the effectiveness of each algorithm in generalizing the classification results (obtained on

the training sub-set) by verifying the percentage cases properly classified (on the validation sub set).

For instance, figure 3 compares the confusion matrix of the Random Forest, Adaptive

Boosting, Logistic Regression and Naïve Bayes methods. We can see that all the four mentioned

algorithms show relatively high values in the principal diagonal, suggesting a good capability of

generalization. In this cross-validation test I used simultaneously all the nine features.

Table 2 collects a set of indexes for evaluating the prediction results of the cross—validation

test, for each individual method (Raschka and Mirjalili, 2017). For example, Classification Accuracy

(CA) is the proportion of correctly classified examples. Instead, Precision is the proportion of true

positives among instances classified as positive.

Figure 3. Confusion matrices calculated for four different methods, applied to the labelled

data set used for the training phase.

8
Table 2. Indexes of the classification performance for the various methods.

Classification and mapping

Figure 4 shows a plot of classification results, here showed for four different methods,

mapped vs. illite and pyrite mineralogical features. All the methods allow obtaining comparable

classification results. Naïve Bayes and Random Forest show sharp separation between the two

classes. The remaining two methods (and the other two here not mapped: CN2 Rule Induction and

Decision Tree), show one apparent classification outlier in the upper right corner of the map.

9
Adaptive Boosting Logistic Regression

Pyrite (normalized)
Pyrite (normalized)

Illite (normalized) Illite (normalized)

Naive Bayes Random Forest


Pyrite (normalized)

Pyrite (normalized)

Illite (normalized) Illite (normalized)

Figure 4. Classification results for four different methods (crosses = sandstone; circles =

claystone).

Principal Component Analysis on labeled data set

Principal Component Analysis (PCA) is a statistical approach that converts a set of

observations of possibly correlated variables (using an orthogonal transformation) into a set of

values of linearly uncorrelated variables (Jolliffe, 2002). These are called “principal components”.

The optimal matrix transformation is defined in such a way that the first principal component has

the largest possible variance. It means that it takes into account for as much of the variability in the

10
data as possible. Each succeeding component in turn has the highest variance possible under the

constraint that it is orthogonal to the preceding components.

Figure 5 shows a PCA map (where on the axes there is not anyone of the original features,

but two principal components) of classified results into the two sandstone and claystone classes.

This classification was obtained through Naïve Bayes. We can see that the distinction between the

two classes is properly performed (in the sense that the automatic classifiers allowed clear

separation between the two classes).

Naive Bayes

PC1

PC2

Figure 5. Principal Component Analysis map, using the two principal components, PC1 and PC2

(crosses = sandstone; circles = claystone).

Expanding the data set

After having acquired some insight in the ML workflow, now we can apply the same framework

to slightly more complex data sets. In the next test, I added some carbonate samples with, of course,

significant mineralogical differences with respect to claystones and sandstones. In that case, using

11
features like calcite and dolomite content was fundamental for obtaining a correct classification.

For instance, figure 6 shows the classification results obtained by using Naïve Bayes method, applied

to all the nine features, and plotted into the PCA domain formed by the first two principal

components (PC1 and PC2). The three classes are properly distinguished, excluding just one sample

that probably represents an outlier.

Finally, an expert geologist checked all these classification results, confirming their geological

reliability.

Naive Bayes

PC1

PC2

Figure 6. Naïve Bayes classification of three different rock types based on nine mineralogical

features. Horizontal axis: PC1; vertical axis: PC2 (crosses = sandstone; circles = claystone; triangles:

carbonates).

Classification of magmatic rocks

In this paragraph, I discuss a further example of automatic rock classification based on

chemical composition (oxides) of magmatic samples. I have used the public data set available on the

12
GEOROC (Geochemistry of Rocks of the Oceans and Continents) web site (http://georoc.mpch-

mainz.gwdg.de/georoc/ ). For this application, I used only a part of the available data set, consisting

of 235 samples and 11 features, creating an input matrix for the ML workflow including 2585

instances. The samples come from different parts of the world and represent different types of

volcanic rocks, belonging to four classes: andesite, dacite, rhyolite and shoshonite (Mamani et al.,

2010).

Table 3 shows the ranking of the chemical features used for training and classification.

Table 3. Ranking of the chemical features of the volcanic rocks.

Figure 7 shows two examples of statistical distributions for two of these features extracted

from the training data set (almost 100 samples). We can see that especially the SiO2 composition

represents a relevant feature for distinguishing the different types of rocks.

13
Norm. probability density distribution
Norm. probability density distribution

Andesite Dacite Rhyolite


Shoshonite Andesite Dacite Rhyolite

Shoshonite

Normalized SiO2 Normalized K2O

Figure 7. Normalized probability density distribution for SiO2 and K20.

Following the same workflow applied for the previous examples, I applied a supervised

learning procedure involving six different learners. I performed the training phase using different

percentages of samples (ranging from 10% to 40%). Figure 8 shows the confusion matrix for Random

Forest and Naïve Bayes methods. It is possible to notice that the prediction performances range

between 85% and 100% (looking at the principal diagonal of the matrices). The classification

performances are better summarized for all the methods in Table 4, where they are expressed using

different types of index.

Figure 8. Confusion matrix for Random Forest (left panel) and for Naïve Bayes (right panel).

14
Table 4. Table of classification performance for all six methods used in this test.

Finally, figure 9 shows an example of classification result of all the unlabeled samples (for

Naïve Bayes method), plotted after applying PCA (Principal Component Analysis), using the first two

principal components (PC1 and PC2). We can see that all the samples are separated in distinct fields,

excluding few ambiguous cases falling at the transition border between different rock types.

Naïve Bayes

Shoshonite

Rhyolite

PC2
PC2
Andesite

Dacite

PC1

Figure 9. Classification obtained by Naïve Bayes method, plotted after PCA.

15
A challenging classification test

Finally, I discuss a classification test including 1670 samples, 10 chemical features and 9 rock

classes, for 16700 instances. There are four prevalent rock types: rhyolite, dacite, basalt and

andesite. Furthermore, there are other 5 “minor” classes (with significantly less samples). These

show statistical distribution of chemical/mineralogical composition largely overlapped with the

distributions of the previously mentioned rock types (figure 10). The few samples belonging to

classes different from rhyolite, dacite, andesite and basalt, create a sort of “random noise” that

affects negatively the classification process. In other words, they have chemical composition that is

partially similar to one or more of the four major classes. This overlap in the feature space makes

the classification more challenging. Despite that difficulty, the ML framework here introduced

allows a satisfactory classification, separating the main classes in four distinct features areas, with

only partial (unavoidable) overlap. In fact, there is a partial overlap between basalt and andesite

rocks, as we can expect looking at the distribution of chemical composition in figure 9. The points

belonging to the “minor” classes (few samples) are hidden in the background and are only partially

visible (see, for instance, the few orange and purple circles). Figure 10 shows the plot produced

through the Bayes classifier, as an example. In this case, the plot is showed for K2O and TiO2 oxides,

but the classification was performed using all ten chemical features available in the database.

Although I used a relatively large and complex data set, the whole classification test required

small computation time, as in the previous examples. About 3 seconds were necessary for running

the entire workflow on a standard PC1, including statistical analysis, feature ranking, data pre-

processing, normalization of the features, PCA, training (using about 20% of the data set),

1
(System characteristics: Dual core Intel processor; 2.5 GHz; RAM 12.0 GB, Windows 10, 64 bit).
16
performance evaluation, calculation of the confusion matrices, classification of the whole unlabeled

data, using all the features, and using six different classifier algorithms.

Rhyolite
Norm. prob. density distribution

Andesite

Dacite

Basalt

Normalized TiO2

Figure 10: statistical distribution of rock types based on TiO2.

Rhyolite
Normalized K2O

Andesite

Dacite

Basalt

Normalized TiO2

Figure 11. Classification using Naïve Bayes and K2O-TiO2 chemical features.

17
Final remarks and possible applications

I developed and discussed a tutorial Machine Learning framework for automatic

classification of rocks based on mineralogical content and chemical composition. I have used six

different classifiers and all of them showed good classification performances. Comparing the

automatic classification with “manual” classification performed by an expert geologist, all the

classifiers produced geologically reasonable results. The main benefit of this automatic approach

is that it can support the “manual” mineralogical analysis of the rock samples for speeding up

the classification process. In fact, the computation time is extremely short. I estimated few

seconds on a standard PC, for classifying thousands of samples, using six classifiers at the same

time, based on 10 or more mineralogical features. The same work is commonly performed in

many days through “traditional” (non-automatic) methods. Consequently, the Machine Learning

approach here presented can be helpful for geologists who have to classify rock samples,

especially if that classification must be performed quickly.

For instance, efficient methods (like Raman Spectroscopy) for measuring mineralogical

composition in rock samples exist (Carey et al, 2015). During well site and/or field operations,

mineralogical (and chemical) data could be combined with composite logs in the same Machine

Learning workflow. The result would be an “in-situ/real-time” litho-facies classification. Of

course, a similar practical impact is possible in other fields where quick and reliable rock

classification of samples forming large databases is required, such as in volcanology, in the study

of metamorphic rocks and so forth.

18
References

1) Aminzadeh, F. and de Groot, P., 2006. Neural Networks and Other Soft Computing

Techniques with Applications in the Oil Industry, EAGE Publications.

2) Barnes, A. E., and K. J. Laughlin, 2002. Investigation of methods for unsupervised

classification of seismic data: 72nd Annual International Meeting, SEG, Expanded

Abstracts, 2221–2224.

3) Bishop, C. M., 2006. Pattern recognition and machine learning: Springer.

4) Carey, C., Boucher, T., Mahadevan, S., Bartholomew, P., and Dyar, M. D. (2015) Machine

learning tools for mineral recognition and classification from Raman spectroscopy. J.

Raman Spectrosc., 46: 894–903. doi: 10.1002/jrs.4757.

5) Hall, B., 2016. Facies classification using machine learning: The Leading Edge, 35, 906–

909.

6) Jolliffe I.T., 2002. Principal Component Analysis, Series: Springer Series in Statistics, 2nd

ed., Springer, NY, 2002, XXIX, 487 p. 28 illus. ISBN 978-0-387-95442-4.

7) Mamani, M., Wörner, G., and Sempere, T., 2010. Geochemical variations in igneous rocks

of the Central Andean orocline (13°S to 18°S): Tracing crustal thickening and magma

generation through time and space. – GSA Bulletin; January/February 2010; v. 122; no.

1/2; p. 162–182; doi: 10.1130/B26538.1.

8) Raschka, S. and Mirjalili, V., 2017. Python Machine Learning: Machine Learning and Deep

Learning with Python, scikit-learn, and TensorFlow, 2nd Edition, PACKT Books.

9) Russell, S. and Norvig, P., 2016. Artificial Intelligence: A Modern approach, Global Edition,

published by Pearson Education, Inc., publishing as Prentice Hall.

10) Samuel, A., L., 1959. Some studies in machine learning using the game of checkers, in

IBM Journal of research and development.

19
Note from the author

The framework discussed in this paper has been created using software libraries developed
privately by the author and open source libraries (such as Python Scikit-learn) readapted for the
scopes of this work. The classification tests are based on public data set (see web reference below):
http://georoc.mpch-mainz.gwdg.de/georoc/webseite/Expert_Datasets.htm

20

View publication stats

You might also like