You are on page 1of 44

Exploratory Data Analysis

Statistical Summaries & Graphical Analysis

Prepared by
ioGlobal

ioGeochemistry
a division of ioGlobal

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Exploratory Data Analysis

Note:

If you are reading papers on various statistical


methods, bear in mind that a sample, in statistical
terminology, refers to a group of observations drawn
from a population, not an individual result.

Also, in geochemical context, element and variable


are usually used interchangeably

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Exploratory Data Analysis
Exploratory data analysis (EDA) is the analysis of geochemical data
for the purpose of recognising trends and structures that provide
insight into geochemical/geological processes.

Conventional methods of data analysis are based on the classical


statistics of estimation and hypothesis testing, which assume the
normal distribution model. For example, classical estimators of
central tendency and dispersion, the arithmetic mean and standard
deviation, assume that the underlying data are normally
distributed.

For geochemical data, influences affecting the absolute value


and variability of an element are manifold and generally
uncontrollable.

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Exploratory Data Analysis
Such effects can be due to sampling and analytical procedures,
physiochemical environment, changing lithology and mineralisation
processes.

As a result, Kurzl (1988) eloquently concluded that geochemical


data reflect certain rather disagreeable properties, such as
considerable departure from the normal model, data
inconsistencies, polymodal behaviour, and upper and lower
outliers. Classical statistics are clearly not designed for such
situations.

EDA makes use of order statistics, histograms, box and


whisker plots, probability plots, summary tables and
innovative data visualisation aids.

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Exploratory Data
Analysis

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Lake Water - Example of EDA

18,307 Lake
water
samples from
southern
Ontario

Mineralisation > sulphide > acid > low lake pH in vicinity > prospecting tool??

But Q: Is there a relationship between size of lake and pH?


ie, is pH affected by processes other than mineralisation of rock type?

….go to example
EDA – Stats & Graphical Analysis
© ioGlobal Pty Ltd
Worked Example
Reminder
<10m distribution

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Special Problems That Need to be Dealt
With in Geochemical Data Pre/during EDA
Less than detection data

Over-range data

Non-normal distributions

Missing values

Combined groups of data that have inconsistent


digest/instrument finish/analytical laboratory

Coordinate recording

Closure affected data (constant sum problem)

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
….But first, data cleaning & QC
Detection Limits. May be replaced by:
the detection limit itself
zero
a fixed portion of the detection limit (usually half), PREFERRED or
a random number between zero and the detection limit

If greater than 30% of the data values fall below the detection limit, the variable should be used with
caution, especially if using multivariate techniques.

The Problem of Closed Data (covered in Lithogeochemistry)

NULL data must remain unambiguously null

Units must be consistent (ppm, ppb, %)

Data must be attributed consistently (soil, stream sed., rockchip etc)

Analytical method recorded, as least generically…….eg, so that fire assay


Au can be separated from aqua regia Au

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
QC: Data Cleaning
Spotting inconsistencies
Often a problem
when storing partial
leach data

It ALWAYS ends
up corrupt in
databases

<dl missing
Null?

Detection??

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
QC: Data Cleaning - spotting inconsistencies

Watch for meta-data problems as


well
Different size fractions (same
location) in stream sediments with
identical Cu-Pb-Zn analyses
(location and samples numbers
omitted)
Eg, could double count data in
stats

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
QC: Data
Cleaning
Spotting
inconsistencies

Here we have public


domain database that has
the metadata (including
location!!!!) offset by 1
row across the entire
dataset

As samples were
commonly disparate in a
spatial sense, this is a
potential disaster!

Same data compilation had


these Eu ppb values

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Published mineral probe data.. CaO is actually meant to be Cobalt! The
give away, apart from Co being more sensible for this mineral type, are
the differing columns in the spreadsheet that calculate the weight
percent values from the raw probe results

QC: Data Cleaning


Spotting inconsistencies

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
QC: Data Cleaning - spotting inconsistencies
Mixed quality data

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
QC: Data Cleaning – Units Errors – ‘spatial QC’

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
QC: Batch Effect – Different sheets with
different DL values (in this instance)
- ‘spatial QC’

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
QC: - ‘spatial QC’ – checking projection information

Projection issues between two partially overlapping


sources of data… which is correct?

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
CLOSURE
14
EXAMPLE! Original Sample
12 Altered Sample
10

Extensive 8

Moles
6

Moles in Original 4
and Altered 2
Samples
0
A B C D E
Element
35
Intensive 30
Original Sample
Altered Sample
25
Mole % in Original
Mole %

20
and Altered Samples
15

Note how there are 10


differences between 5
concentration values, and
0
the absolute values.
A B C D E
EDA – Stats & Graphical Analysis Elements
© ioGlobal Pty Ltd
Suggested Flow for Exploratory Data Analysis
1. Preliminary Data Analysis
a Histogram, Box and Whisker, Q-Q Plots, Normal Probability Plots, Scatter Plot Matrix, Ranked Data
b Summary Statistics
c Maps of percentile values for individual elements
d Investigation of outliers (don’t necessarily remove) may be analytical error or mineralisation
e Transform data based on samples below 90-95th percentile, if necessary
f Select thresholds
2. Exploratory Multivariate Data Analysis
a Robust estimates of mean and covariance, enhances outlier detections and prevents them affecting
subsequent analysis
b Dimension reduction e.g. PCA, Factor Analysis
c Maps of large Mahalanobis distances (>95th percentile) may identify anomalous areas
3. Specific Multivariate Data Analysis and Modelled Multivariate Analysis
a Calculation of empirical indices specifically tailored to areas and targets in which multi-element
associations are understood
b Multiple regression where a linear model of multi-element association can be computed with good
results (high R2 coefficients). Residuals may represent mineralisation
c Establishment of background and target groups that characterised the geochemical variation of the
regional geochemistry and the mineral deposits
d The use of all possible subsets to compare reference groups with each other and determine which
group of elements enhances the group separations
e Use of allocation/typicality procedures to test unknown samples from an exploration program against
established groups. Each sample should be assigned a probability of belonging to one of the
reference groups. Maps of typicality or posterior probability can be made to indicate group
membership EDA – Stats & Graphical Analysis
© ioGlobal Pty Ltd
EDA – Preliminary Data Analysis

Data Transformations
As a prelude to interpretation, data are often mathematically transformed to
a more normal form so that underlying statistical assumptions are met. This
also improves the reliability of statistical results and enhances relationships
present in the bulk of the data.

It is an important concept that transformations are a requirement of the


interpretation, not a property of a particular element distribution. They
should not be viewed as a fix for recalcitrant data, but rather as tools of the
interpreter used to extract meaning from the data

linear scaling y=kx or y=x/k


standardisation (z-score) y=(xi-xmean)/s
logarithmic y=log10(x), y=ln(a+x)
exponential y=ex, or the
Box-Cox Generalised Power Transform y=(xλ –1)/λλ, y=ln(x) for λ=0

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
EDA – Preliminary Data Analysis

Data Transformations

Go to Datadesk demonstration

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Histograms

Very popular – they reflect the


shape of the underlying
distribution of the data

However, their appearance


changes with the subjective
choice of the number of bars to
use

The data are binned, which can


result is a loss of information
(compared with probability plots
for example)

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Tukey Box % Whisker Plot
A method of graphically displaying order
statistics (percentiles)

There are many variants

The example shown to the left is a ‘Tukey’


boxplot

Box = 25th, 50th (median) & 75th percentiles


Upper and lower Whiskers = hinge +/- 1.5
times the IQR
Data that plot beyond 3 times the hinge +/- 3
times the IQR are called far outliers

Basis of an automatically derived


thematic outlier map

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
EDA – Preliminary Data Analysis

Histograms and Boxplots

Go to ioGAS demo file


Split Ni by regolith and geol
Histograms, Summary Stats

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Normal Probability Plots - Construction
Probability plots are used to screen data for; non-normality, outliers (anomalies) and
multiple populations.
….data are plotted in a scatterplot against the values "expected from the normal distribution."
The data are ranked, and from these ranks z values (ie, standardised values of the normal
distribution) are calculated based on the assumption that the data come from a normal
distribution…. The proportion of data that fall to the left of, or is less than the ith ordered
observation, if the data are normally distributed, is given by p=(i-0.5)/N. For each probability
level, an expected z score is calculated…The expected z values are plotted on the Y-axis. If the
data (plotted on the X-axis) are normally distributed, then all values should fall onto a straight line
extending from the lower left to the upper right corner… (***axis assignments may vary)

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
A chart comparing the various grading methods in a normal distribution. Includes: Standard deviations, cumulative percentages, percentile
equivalents, Z-scores, T-scores. Modified from Figure 4.3 on Page 74 of Ward, A. W., Murray-Ward, M. (1999). Assessment in the
Classroom. Belmont, CA: Wadsworth. ISBN 0534527043

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Normal Probability Plots - Construction
A. Normal Probability Plot of Example Data B. As for A, But With Outlier
2.0 2.0

1.5 1.5

1.0 1.0
Expected z Score

Expected Z Score
0.5 0.5

0.0 0.0

-0.5 -0.5

-1.0 -1.0

-1.5 -1.5

-2.0 -2.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5 5.5
X(i) X(i)

Normal probability plot of data in Table. A. Points plot near to a straight line
and are likely to be normally distributed. B. If the last data point in the Table
(10) is changed to 4.5, it plots away from the line and is obviously an outlier.

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Normal Probability Plots

Example probability plots. A. Normally distributed data. B. Normally


distributed data set with outliers. C. Data with a longer tails than would be
expected from normally distributed data. D. Rounded, normally
distributed data. E. Lognormally distributed data. F. Two normally
distributed populations with overlap.
EDA – Stats & Graphical Analysis
© ioGlobal Pty Ltd
Normal Probability Plots – Example Data

Reminders

Go to ioGAS file

Discrete examples – change bin size

Gaussian and long tails distribution together

700-50-500-50 example

Small upper population example

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Demo: eg, Nabberu Data

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Demo: eg Nabberu Data

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Demo: Nabberu

Reminders
Go to Nabberu Data
Log transform
Split plots by rego and geol
Examples of
over transformation
outliers
multiple populations
dl data

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Data Cleaning – Spotting Units Errors
MMI STD Partial Leach A

Zn Raw

Zn Log

Two orders of magnitude jump


EDA – Stats & Graphical Analysis
© ioGlobal Pty Ltd
Prob plots also good for QC
Thallium might have been
useful even with small
numbers of samples.
However, it is plagued by
high detection limits in
many rock samples,
which is well above
expected background
abundances.

On the contrary, Li may


be worth retaining despite
the low values

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Useful Summary Statistics

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Summary Statistics – Example Report
Summary For CU_PPM
Raw Data Log10 Transformed Data Statistics
6000 1200
Summary Stats Percentiles
5000 1000 Parameter Description Percentile Value
Total Cases 4900
4000 800 Valid Cases 4896 99 66.000
Num>0 4886 98 57.000
3000 600
Num=0 0 95 46.00
2000 400 Num<0 10 90 38.00
Max 990.00 80 30.00
1000 200 Min -1.000 60 22.00
Range 991.00 50 18.00
0 0 Min>0 1.000 40 15.00
-26 272 569 866 -0.4 0.3 0.9 1.6 2.3 2.9 Mean 21.055 20 9.000
123 420 718 1015 -0.1 0.6 1.3 1.9 2.6 Median 18.000 10 6.000
Geo Mean 16.183 5 4.000
Harm Mean 11.501 2 2.000
10% Trim M 19.130 1 1.000
Skew Raw 22.574
4 4
Skew Log -.870
Expected Normal Value

2 2
Detection Limits
0 Replacement Value ( .5000)
0
Value Count
-2 -1.000 10
-2
-4
-4
-1.0 0.0 1.0 2.0 3.0
0 200 400 600 800 1000 -0.5 0.5 1.5 2.5 3.5

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Summary
Statistics –
Example Table
Bizarre data

The minima for Cu and Mn can not possibly be real DL values with a ‘-’ instead of a ‘<‘

100 percent in ppm is 1,000,000….

Thus many of the maxima are problematic as well! Even values approaching 100 percent
should be viewed with caution, unless people were sampling and analysing gold nuggets or
concentrates!

What has usually occurred is a units mixup e.g. ppb put into a ppm or percent column, units not
converted during compilations etc

This problem can occur the other way too, e.g. if you see 0.001ppb in historical Cu data, alarm
bells should ring!

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Summary Statistics – Example Table
Reasonable data – with intrinsic limitations
Look for zeros

• Median = Minimum
values is bad for
multivariate (rule of
thumb is greater
than 30% DL data
means little can be
gained in a
multivariate sense)
• However, binary
results can still be
important e.g. Au,
Pt, Pd
• No negatives or
ludicrous max
values, which is
good

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
ioGAS Demo
• Reminders
• Uni/multi demo
• Table and spatial

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Summary Statistics –
Frequency Tables

Multiple Detection limits… be careful that replacing


with ‘positive half’ values are sensible

Large ‘negatives’ could swamp responses of ‘real’


values, particularly if old and new data are
combined (new data commonly has much lower
Detection Limits)

Frequency Tables useful for assessing such


problems

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Multivariate Graphical Summary 1.
Spider Plot

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Multivariate Graphical Summary 2.
Parallel Coordinate Plot

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Summary Statistics -
Contingency Table of Geol/Rego
Important for Levelling work

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd
Summary: Preliminary Data Analysis
Really only have looked at univariate methods

Methods of visualising data in ‘chemical’ space

Spent a fair amount of time looking at problems with


data

…not because geochemists are masochists, but because geochemical


data are usually problematic in some way, and if the QC is not done
conclusions that are drawn from the data may not be valid.

Further, esoteric processing of data that has not been subject to rigorous
QC is a waste of time, and possibly much worse, a waste of resources if
the erroneous results lead to further work

EDA – Stats & Graphical Analysis


© ioGlobal Pty Ltd

You might also like