You are on page 1of 64

Geological Data Analysis

Introduction to Machine Learning


and Compositional Data Analysis
Dr. Ir. I Wayan WARMADA
Machine Learning
 Machine learning (ML) is the study of computer algorithms that can
improve automatically through experience and by the use of data.
 Machine learning algorithms build a model based on sample data,
known as training data, in order to make predictions or decisions
without being explicitly programmed to do so.
 Machine learning algorithms are used in a wide variety of
applications, such as in medicine, email filtering, speech recognition,
and computer vision, where it is difficult or unfeasible to develop
conventional algorithms to perform the needed tasks.
Machine Learning
 Applications of machine learning in earth sciences include geological
mapping, gas leakage detection, geological features identification,
autocorrelation, etc.
 Machine learning (ML) is a type of artificial intelligence (AI) that
enables computer systems to classify, cluster, identify and analyze
vast and complex sets of data while eliminating the need for explicit
instructions and programming.
 Earth science is the study of the origin, evolution, and future of the
planet Earth.
Machine Learning
 A variety of algorithms may be applied depending on the nature of
the earth science exploration.
 Some algorithms may perform significantly better than others for
particular objectives.
 For example, convolutional neural networks (CNN) are good at
interpreting images, artificial neural networks (ANN) perform well in
soil or rock classification but more computationally expensive to train
than support-vector machine (SVM) learning.
Machine Learning
 The application of machine learning has been popular in recent
decades, as the development of other technologies such as
unmanned aerial vehicles (UAVs), ultra-high resolution remote
sensing technology and high-performance computing units lead to
the availability of large high-quality datasets and more advanced
algorithms.
 Problems in earth science are often complex. It is difficult to apply
well-known and described mathematical models to the natural
environment, therefore machine learning is commonly a better
alternative for such non-linear problems.
ML for Mapping
 Geological or lithological mapping produces maps showing
geological features and geological units.
 Mineral prospectivity mapping utilizes a variety of datasets such as
geological maps, aeromagnetic imagery, etc to produce maps that
are specialized for mineral exploration.
 Geological/ Lithological Mapping and Mineral Prospectivity Mapping
can be carried out by processing the data with machine-learning
techniques with the input of spectral imagery obtained from remote
sensing and geophysical data.
ML for
Mapping

a) Tectonic compartmentalization
of the Southern Amazonian
craton proposed by Vasquez et al.
(2008) highlighting the region of
the Cinzento Lineament. b)
Lithological Map of Cinzento
Lineament (Oliveira et al. 2018). Costa et al. 2019
ML for
Mapping

The data used as input in the


machine learning algorithms in
this study. a) Magnetic Vector Location of training points used as
Inversion, b) Ternary RGB map input in the Machine Learning
(K, eTh and eU), C) Shuttle Radar Algorithms. The training points
Topography Mission (SRTM) and database is composed of 100 random
d) false color (RGB) of Landsat 8 samples from each class (or
combining bands 4 (red), 3 (green) lithological unit) resulting in 1400
and 2 (blue). Costa et al. 2019 samples.
ML for
Mapping

Comparison between the a)


Lithological Map of Cinzento
Lineament from Oliveira et al.
(2018), b) training points, c)
Predictive Lithological Map
using remote data only, and d)
the Predictive Map with remote
An example of class probability (in data and spatial correlation. The
percentage) for the lithological unit limits of the geological map
A4γ2gl from the Predictive were superimposed on the
Lithological map with remote data predictive map to improve the
only. Costa et al. 2019 association.
ML for Landslide
 Landslide susceptibility refers to the probability of landslide of a
place, which is affected by the local terrain conditions.
 Landslide susceptibility mapping can highlight areas prone to
landslide risks which are useful for urban planning and disaster
management works.
 Input dataset for machine learning algorithms usually includes
topographic information, lithological information, satellite images,
etc. and some may include land use, land cover, drainage
information, vegetation cover according to their study needs.
ML for Landslide

Landslide susceptibility and hazard


zonation techniques (Shano et al. 2020)
ML for Landslide
Top countries using machine
learning for landslide
susceptibility modeling;
corresponding number of
publications marked for each
popular model in the box
(colour code: LR = green,
NNET/ANN = magenta,
DT = black, SVM = orange,
and RF = blue; letters ‘a’, ‘b’,
‘c’, … ‘r’ in alphabetical
order represent the largest to
smallest share of articles by
country) (Merghadi et al.
2020).
ML for Landslide Top ten journals in which studies using the
most popular machine learning for landslide
susceptibility models appear, in terms of
number of articles listed in the Web of Science
database. Number indicates the percentage
share (EES – Environmental Earth Science,
GEM – Geomorphology, NAT – Natural
Hazards, LAN – Landslides, ARA – Arabian
Journal of Geosciences, NHE – Natural
Hazard and Earth System Sciences, ENG –
Engineering Geology, GNH – Geomatics
Natural Hazard Risks, GEO – Geocarto
International, CAT - Catena, JMS - Journal of
Mountain Science, SCT – Science of Total
Environment, REM – Remote Sensing, BOE –
Bulletin of Engineering Geology and the
Environment, ISP – ISPRS International
Journal of Geo-Information, RSE – Remote
Sensing of Environment, ASB – Applied
Science, COM – Computer and Geosciences).
ML for Digital
Petrography

A) Photomosaic of a thin section; B)


Its segmented image obtained from
the mineralogical model validated by
chemical microanalysis from
QEMSCAN equipment – the model
did not detect any occurrence of the
classes “clays” or “others” for this
photomosaic; C) Mineralogical
composition of the thin section
created by interpreted QEMSCAN
measures. Relative occurrences for
mineral phases are presented for each
method (Rubo et al. 2019).
ML for Digital Petrography

Rubo et al. 2019


ML for Digital
Petrography
Examples of classification provided by fine-
tuned ResNet50 for the smaller cropped
images in the test set. Images in the same
row were extracted from the same
microfacies as labeled by the interpreter.
The left column shows examples of smaller
cropped images in which the classification
provided by the CNN model is the same as
the classification provided by the
petrographer. In contrast, the right column
shows examples of smaller cropped images
in which the classification provided by the
CNN is not the same as the classification
provided by the petrographer. Source: de
Lima et al. (2020).
ML for Facies Classification

Typical well section: logs, stratigraphic zone indexes, and target facies labels as raw numbers (Kuvichko et
al. 2019).
ML for Facies
Classification

A comparison of human (first facies


track) and machine learning (second
facies track) facies logs (Kuvichko et
al. 2019).
Multivariate Data Analysis
 Many statistical techniques focus on
just one or two variables
 Multivariate analysis (MVA) techniques
allow more than two variables to be
analysed at once
 Multiple regression is not typically
included under this heading, but can
be thought of as a multivariate
analysis
Many Variables
 Commonly have many relevant variables in market research surveys
 E.g. one not atypical survey had ~2000 variables
 Typically researchers pore over many crosstabs
 However it can be difficult to make sense of these, and the
crosstabs may be misleading
 MVA can help summarise the data
 E.g. factor analysis and segmentation based on agreement ratings
on 20 attitude statements
 MVA can also reduce the chance of obtaining spurious results
Multivariate Analysis Methods
 Two general types of MVA technique
 Analysis of dependence
 Where one (or more) variables are dependent variables, to be
explained or predicted by others
 E.g. Multiple regression, PLS, MDA
 Analysis of interdependence
 No variables thought of as “dependent”
 Look at the relationships among variables, objects or cases
 E.g. cluster analysis, factor analysis
Multivariate Analysis – How
 The model underlying the different effects is quantifiably measured
for each individual.
 Then the different effects are solved for using mathematical
techniques (calculus and numerical methods) to quantify their effects
and tests for their significance.
Advantages of Multivariate Analysis
 It assures that the results are not biased and influenced by other
factors that are not accounted for.
 Case in point: if difference in cancer progression due to different
therapies is not a mere reflection of differences in prior therapies and
or stage of diagnostic and or demographic factors.
Which Technique?
 Statistical techniques are like tools: the task to be accomplished
determines the selection of the tool.
 The selection of a statistical technique is greatly based on the type of
data under analysis (e.g., measurement level, type of distribution)
and the reason (research question) for the analysis.
Choice of Technique
Normal Curve

0.25

0.2

0.15

0.1

0.05

0
0 5 10 15 20
Choise of Technique
Level of Measurement
Independent Variable Dependent Variable Technique
Numerical Numerical Multiple Regression
Nominal or Numerical Nominal Logistic Regression
Nominal or Numerical Numerical (censored) Cox Regression
Nominal or Numerical Numerical ANOVA, MANOVA

Nominal or Numerical Nominal (2 or more values) Discriminant Analysis

Factor and Cluster


Numerical No Dependent Variable
Analysis
Principal Components
 Identify underlying dimensions or principal components of a
distribution
 Helps understand the joint or common variation among a set of
variables
 Probably the most commonly used method of deriving “factors” in
factor analysis (before rotation)
Principal Components
 The first principal component is identified as the vector (or
equivalently the linear combination of variables) on which the most
data variation can be projected
 The 2nd principal component is a vector perpendicular to the first,
chosen so that it contains as much of the remaining variation as
possible
 And so on for the 3rd principal component, the 4th, the 5th etc.
Principal Components

Example of a compositional biplot


Principal Components

Biplot variants (left, covariance biplot; right, form biplot) of a principal component analysis
of the glacial sediment dataset. The right-hand biplot shows a graphical representation of
the communalities
Cluster Analysis
 Techniques for identifying separate groups of similar cases
 Similarity of cases is either specified directly in a distance matrix, or
defined in terms of some distance function
 Also used to summarise data by defining segments of similar cases in
the data
 This use of cluster analysis is known as “dissection”
Clustering Technique
 Two main types of cluster analysis methods
 Hierarchical cluster analysis
 Each cluster (starting with the whole dataset) is divided into two, then
divided again, and so on
 Iterative methods
 k-means clustering (PROC FASTCLUS)
 Analogous non-parametric density estimation method
 Also other methods
 Overlapping clusters
 Fuzzy clusters
Applications
 Market segmentation is usually conducted using some form of
cluster analysis to divide people into segments
 Other methods such as latent class models or archetypal analysis
are sometimes used instead
 It is also possible to cluster other items such as products/SKUs, image
attributes, brands
Cluster Analysis
Dendrogram of association of
variables of the dataset of
geochemical composition of
glacial sediments. This is obtained
with Ward algorithm using the
variation array as distance matrix
between parts
Cluster Analysis Issues
 Distance definition
 Weighted Euclidean distance often works well, if weights are chosen intelligently
 Cluster shape
 Shape of clusters found is determined by method, so choose method appropriately
 Hierarchical methods usually take more computation time than k-means
 However multiple runs are more important for k-means, since it can be badly
affected by local minima
 Adjusting for response styles can also be worthwhile
 Some people give more positive responses overall than others
 Clusters may simply reflect these response styles unless this is adjusted for, e.g. by
standardising responses across attributes for each responden
Correspondence Analysis
 Provides a graphical summary of the interactions in a table
 Also known as a perceptual map
 But so are many other charts
 Can be very useful
 E.g. to provide overview of cluster results
 However the correct interpretation is less than intuitive, and this leads
many researchers astray
Four Clusters (imputed, normalised)

Usage 9
Usage 7
Usage 8
Usage 4
Usage 10 Cluster 2
Reason 2 Reason 9
Cluster 3 Reason 13
Reason 6
Usage 5

Reason 10
Usage 6 Usage 2 Reason 4
Cluster 1 Reason 12
Usage 1
Usage 3
Reason 11 Reason 7
Reason 3

Reason 5
Reason 14

Cluster 4 Reason 8
Reason 1

Reason 15
25.3%

53.8%
= Correlation < 0.50 2D Fit = 79.1%
Interpretation
 Correspondence analysis plots should be interpreted by looking at
points relative to the origin
 Points that are in similar directions are positively associated
 Points that are on opposite sides of the origin are negatively
associated
 Points that are far from the origin exhibit the strongest associations
 Also the results reflect relative associations, not just which rows are
highest or lowest overall
Discriminant Analysis
 Discriminant analysis also refers to a wider family of techniques
 Still for discrete response, continuous predictors
 Produces discriminant functions that classify observations into
groups
 These can be linear or quadratic functions
 Can also be based on non-parametric techniques
 Often train on one dataset, then test on another
Partial Least Squares (PLS)
 Multivariate generalisation of regression
 Have model of form Y = XB + E
 Also extract factors underlying the predictors
 These are chosen to explain both the response variation and the
variation among predictors
 Results are often more powerful than principal components
regression
 PLS also refers to a more general technique for fitting general path
models, not discussed here
Structural Equation Modeling (SEM)
 General method for fitting and testing path analysis models, based
on covariances
 Also known as LISREL
 Implemented in SAS in PROC CALIS
 Fits specified causal structures (path models) that usually involve
factors or latent variables
 Confirmatory analysis
SEM Example:
Relationship between Academic and Job Success
Multivariate Analysis
 Clr biplot  Association:
 Association 1: Ca-Mg
 Association 2: Na-K-P-B-
Rb-Ba
 Association 3: Zn-Cd-Pb-
Hg –Ag-Au
 Association 4: As-U-Th-Zr-
La-Li
 Association 5: Fe-Ti-Co-V
 Association 6 : Ni, Cr
Multivariate Analysis
 Clr biplot (Amalgamation)  Most of the SAR-SAR
samples ( samples along
Sarno river) are linked to the
Clr-Amag (Cr-Ni) vertex
 As-U-Th-Zr-La-Li and Fe-Ti-
Co-V Associations might be
related to the same origin
Multivariate Analysis
Raw data Centered log-transformed data
 Factor analysis
 Total variance for Clr
transformation (74.99%)
greater than that the use of
raw data (69.54%)
Multivariate Analysis
 Factor score maps
 This method tries to locate and group pixels on the map sharing concentrations
apparently varying according to a spatial law
Multivariate Analysis
Raw data

Elevated factor scores


near the slopes of
Mts. Somma–Vesuvius

Clr data

Factor score values decrease


gradually from the slope of
the Mt.Vesuvius to inner basin

Factor 1 element associations is related to the igneous formations and volcanic soils
Multivariate Analysis
Raw data

Elevated factor scores


is mapped near Solofra
and surrounding

Clr data

Low factor scores values


correspond to the main
urban cities

Sedimentary materials dominated by silico-clastic
deposits

Anthropogenic sources (high rates of fossil fuel
combustion and vehicle emissions)
Compositional Data
 Compositional data are those which contain only relative
information.
 Compositions are positive vectors whose components represent a
relative contribution of differentparts to a whole; therefore their sum
is a constant, usually 1 or 100.
 If one component increases, others must, perforce, decrease,
whether or not there is a genetic link between these components.
 Compositions are a familiar and important kind of data for geologists
because they appear in many geological datasets (chemical analyses,
geochemical compositions of rocks, sand-silt-clay sediments, etc).
Compositional Data
QQ normality plots of all pairwise
log ratios. This is the result of
qqnorm(x) with an acomp object.
The accessory argument
alpha=0.01 (or any other
significance level) marks which log
ratios deviate from normality at that
significance.
Compositional Data Analysis in Geology
 For illustration consider three major oxides, SiO2 = 80, TiO2 = 5 and Al2O3
= 12, with units in weight percent. The residual with respect to 100% would
be R = 100 - (80 + 5 + 12) = 3 and the four-part composition to be
analysed would be [SiO2, TiO2, Al2O3, R]. If the residual part is not of
interest, the subcomposition C[SiO2, TiO2, Al2O3] can be obtained from the
four-part composition dividing by the sum of the three parts:
Compositional Data Analysis in Geology
 Log-ratio representation: Aitchison (1982) introduced the additive-log-
ratio (alr) and centred-log-ratio (clr) transformations and Egozcue et
al. (2003) the isometric-log-ratio (ilr) transformation.
 For the three-part composition x : [80, 15, 5]  S3, the three
representations would be written as a vector with two component for the
air and ilr coordinates, and with three components for the clr coefficients,
as follows:
Compositional Data Analysis in Geology
Compositional Data Analysis in Geology
Compositional Data Analysis in Geology
Compositional Data Analysis in Geology
 The three representations have different properties.
 The alr coordinates are simple: D - 1 of the components are divided by
the remaining component and logarithms taken.
 The clr coefficients are obtained by dividing the components by the
geometric mean of the parts and then taking logarithms.
 Expressions for the calculation of ilr coordinates are more complex and
there are different rules on how to generate them
 Balances are a particular form of ilr coordinates (Egozcue & Pawlowsky-
Glahn 2005). They frequently have a simple interpretation as they
represent the relative variation of two groups of parts.
Compositional Data Analysis in Geology
 Example, consider the composition of a sediment described in terms of
the proportions of coarse sand (S~), fine sand (SO, silt (S), and clay (C).
Assume interest lies in the relative proportion of coarse and fine sand, (Sc,
SO, to silt and clay, (S, C), the relative proportion of coarse sand, Sc, to fine
sand, Sf, and the relative proportion of silt, S, to clay, C. The analysis would
require three balances:
Compositional Data Analysis in Geology
 For illustration consider two
samples, x1 = (20, 30, 40, 10)
and x2 = (25, 35, 30, 10), where
the parts represent their (Sc, Sf,
S, C) composition.
 The three balance coordinates
of each sample are:
Compositional Data Analysis in Geology
 Distances between samples can be computed simply with these
coordinates using the usual Euclidean distance. The squared
compositional or Aitchison distance between samples x1 and x2 is:

and taking the square root da(x1, x2)= 0.3928 is obtained.


Visualization
 Multivariate Data Analysis:
 Hirarchical clustering:
 Clustering analysis using
clr transformed data
enhances groups of
elements and the
correlation between
variables (use of
Aitchison distance)
Softwares
 Commercial software: SPSS, S, SAS, Minitab
 Free OpenSource for Statistics: R, Octave
 Machine Learning: Python, R
 Special Purpose Software: CoDaPack, compositions (R package),
GCDkit (R package).
Further Readings
 Buccianti, A., Mateu-Figuras, G., and Pawlowsky-Glahn, V. (eds) 2006. Compositional Data
Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special
Publications, 264p.
 Hair, J.F.Jr., Black, W.C., Babin, and B.J., Anderson, R.E., 2014. Multivariate Data Analysis. 7th
Edition. Pearson Education Ltd. 734p.
 Martin-Fernández, J.A. and Thió-Henestrosa (eds), 2016. Compositional Data Analysis –
CoDaWork, L’Escala, Spain, June 2015. Springer-Verlag. 209p.
 Misra, S., Li, H., and He., J., 2020. Machine Learning for Subsurface Characterization. Gulf
Professional Publishing – Elsevier, Oxford, 412p.
 Rahman, W., 2020. AI and Machine Learning. SAGE Publications India Pvt Ltd, New Delhi, 157p.
 Raut, R., Kautish, S., Polkowski, Z., Kumar, A., Liu, C.-M., 2022. Green Internet of Things and
Machine Learning: Toward a Smart Sustainable World. Wiley Scrivener Publishing, Massachusetts,
354p.
TERIMA KASIH

You might also like