4 EDA - Stats & Graphical Techniques

Exploratory Data Analysis
Statistical Summaries & Graphical Analysis
Prepared by
ioGlobal
ioGeochemistry
a division of ioGlobal
EDA – Stats & Graphical Analysis

© ioGlobal Pty Ltd
Note:
If you are reading papers on various statistical

methods, bear in mind that a sample, in statistical
terminology, refers to a group of observations drawn
from a population, not an individual result.
Also, in geochemical context, element and variable

are usually used interchangeably

© ioGlobal Pty Ltd
Exploratory data analysis (EDA) is the analysis of geochemical data
for the purpose of recognising trends and structures that provide
insight into geochemical/geological processes.
Conventional methods of data analysis are based on the classical

statistics of estimation and hypothesis testing, which assume the
normal distribution model. For example, classical estimators of
central tendency and dispersion, the arithmetic mean and standard
deviation, assume that the underlying data are normally
distributed.
For geochemical data, influences affecting the absolute value

and variability of an element are manifold and generally
uncontrollable.

© ioGlobal Pty Ltd
Such effects can be due to sampling and analytical procedures,
physiochemical environment, changing lithology and mineralisation
processes.
As a result, Kurzl (1988) eloquently concluded that geochemical

data reflect certain rather disagreeable properties, such as
considerable departure from the normal model, data
inconsistencies, polymodal behaviour, and upper and lower
outliers. Classical statistics are clearly not designed for such
situations.
EDA makes use of order statistics, histograms, box and

whisker plots, probability plots, summary tables and
innovative data visualisation aids.

© ioGlobal Pty Ltd
Exploratory Data
Analysis

© ioGlobal Pty Ltd
Lake Water - Example of EDA
18,307 Lake
water
samples from
southern
Ontario
Mineralisation > sulphide > acid > low lake pH in vicinity > prospecting tool??
But Q: Is there a relationship between size of lake and pH?

ie, is pH affected by processes other than mineralisation of rock type?
….go to example
© ioGlobal Pty Ltd
Worked Example
Reminder
<10m distribution

© ioGlobal Pty Ltd
Special Problems That Need to be Dealt
With in Geochemical Data Pre/during EDA
Less than detection data
Over-range data
Non-normal distributions
Missing values
Combined groups of data that have inconsistent

digest/instrument finish/analytical laboratory
Coordinate recording
Closure affected data (constant sum problem)

© ioGlobal Pty Ltd
….But first, data cleaning & QC
Detection Limits. May be replaced by:
the detection limit itself
zero
a fixed portion of the detection limit (usually half), PREFERRED or
a random number between zero and the detection limit
If greater than 30% of the data values fall below the detection limit, the variable should be used with
caution, especially if using multivariate techniques.
The Problem of Closed Data (covered in Lithogeochemistry)
NULL data must remain unambiguously null
Units must be consistent (ppm, ppb, %)
Data must be attributed consistently (soil, stream sed., rockchip etc)
Analytical method recorded, as least generically…….eg, so that fire assay

Au can be separated from aqua regia Au

© ioGlobal Pty Ltd
QC: Data Cleaning
Spotting inconsistencies
Often a problem
when storing partial
leach data
It ALWAYS ends
up corrupt in
databases
<dl missing
Null?
Detection??

© ioGlobal Pty Ltd
QC: Data Cleaning - spotting inconsistencies
Watch for meta-data problems as

well
Different size fractions (same
location) in stream sediments with
identical Cu-Pb-Zn analyses
(location and samples numbers
omitted)
Eg, could double count data in
stats

© ioGlobal Pty Ltd
QC: Data
Cleaning
Spotting
inconsistencies
Here we have public

domain database that has
the metadata (including
location!!!!) offset by 1
row across the entire
dataset
As samples were
commonly disparate in a
spatial sense, this is a
potential disaster!
Same data compilation had

these Eu ppb values

© ioGlobal Pty Ltd
Published mineral probe data.. CaO is actually meant to be Cobalt! The
give away, apart from Co being more sensible for this mineral type, are
the differing columns in the spreadsheet that calculate the weight
percent values from the raw probe results
QC: Data Cleaning

Spotting inconsistencies

© ioGlobal Pty Ltd
QC: Data Cleaning - spotting inconsistencies
Mixed quality data

© ioGlobal Pty Ltd
QC: Data Cleaning – Units Errors – ‘spatial QC’

© ioGlobal Pty Ltd
QC: Batch Effect – Different sheets with
different DL values (in this instance)
- ‘spatial QC’

© ioGlobal Pty Ltd
QC: - ‘spatial QC’ – checking projection information
Projection issues between two partially overlapping

sources of data… which is correct?

© ioGlobal Pty Ltd
CLOSURE
14
EXAMPLE! Original Sample
12 Altered Sample
10
Extensive 8
Moles
6
Moles in Original 4
and Altered 2
Samples
0
A B C D E
Element
35
Intensive 30
Original Sample
Altered Sample
25
Mole % in Original
Mole %
20
and Altered Samples
15
Note how there are 10

differences between 5
concentration values, and
0
the absolute values.
A B C D E
EDA – Stats & Graphical Analysis Elements
© ioGlobal Pty Ltd
Suggested Flow for Exploratory Data Analysis
1. Preliminary Data Analysis
a Histogram, Box and Whisker, Q-Q Plots, Normal Probability Plots, Scatter Plot Matrix, Ranked Data
b Summary Statistics
c Maps of percentile values for individual elements
d Investigation of outliers (don’t necessarily remove) may be analytical error or mineralisation
e Transform data based on samples below 90-95th percentile, if necessary
f Select thresholds
2. Exploratory Multivariate Data Analysis
a Robust estimates of mean and covariance, enhances outlier detections and prevents them affecting
subsequent analysis
b Dimension reduction e.g. PCA, Factor Analysis
c Maps of large Mahalanobis distances (>95th percentile) may identify anomalous areas
3. Specific Multivariate Data Analysis and Modelled Multivariate Analysis
a Calculation of empirical indices specifically tailored to areas and targets in which multi-element
associations are understood
b Multiple regression where a linear model of multi-element association can be computed with good
results (high R2 coefficients). Residuals may represent mineralisation
c Establishment of background and target groups that characterised the geochemical variation of the
regional geochemistry and the mineral deposits
d The use of all possible subsets to compare reference groups with each other and determine which
group of elements enhances the group separations
e Use of allocation/typicality procedures to test unknown samples from an exploration program against
established groups. Each sample should be assigned a probability of belonging to one of the
reference groups. Maps of typicality or posterior probability can be made to indicate group
membership EDA – Stats & Graphical Analysis
© ioGlobal Pty Ltd
EDA – Preliminary Data Analysis
Data Transformations
As a prelude to interpretation, data are often mathematically transformed to
a more normal form so that underlying statistical assumptions are met. This
also improves the reliability of statistical results and enhances relationships
present in the bulk of the data.
It is an important concept that transformations are a requirement of the

interpretation, not a property of a particular element distribution. They
should not be viewed as a fix for recalcitrant data, but rather as tools of the
interpreter used to extract meaning from the data
linear scaling y=kx or y=x/k

standardisation (z-score) y=(xi-xmean)/s
logarithmic y=log10(x), y=ln(a+x)
exponential y=ex, or the
Box-Cox Generalised Power Transform y=(xλ –1)/λλ, y=ln(x) for λ=0

© ioGlobal Pty Ltd
Data Transformations
Go to Datadesk demonstration

© ioGlobal Pty Ltd
Histograms
Very popular – they reflect the

shape of the underlying
distribution of the data
However, their appearance

changes with the subjective
choice of the number of bars to
use
The data are binned, which can

result is a loss of information
(compared with probability plots
for example)

© ioGlobal Pty Ltd
Tukey Box % Whisker Plot
A method of graphically displaying order
statistics (percentiles)
There are many variants
The example shown to the left is a ‘Tukey’

boxplot
Box = 25th, 50th (median) & 75th percentiles

Upper and lower Whiskers = hinge +/- 1.5
times the IQR
Data that plot beyond 3 times the hinge +/- 3
times the IQR are called far outliers
Basis of an automatically derived

thematic outlier map

© ioGlobal Pty Ltd
Histograms and Boxplots
Go to ioGAS demo file

Split Ni by regolith and geol
Histograms, Summary Stats

© ioGlobal Pty Ltd
Normal Probability Plots - Construction
Probability plots are used to screen data for; non-normality, outliers (anomalies) and
multiple populations.
….data are plotted in a scatterplot against the values "expected from the normal distribution."
The data are ranked, and from these ranks z values (ie, standardised values of the normal
distribution) are calculated based on the assumption that the data come from a normal
distribution…. The proportion of data that fall to the left of, or is less than the ith ordered
observation, if the data are normally distributed, is given by p=(i-0.5)/N. For each probability
level, an expected z score is calculated…The expected z values are plotted on the Y-axis. If the
data (plotted on the X-axis) are normally distributed, then all values should fall onto a straight line
extending from the lower left to the upper right corner… (***axis assignments may vary)

© ioGlobal Pty Ltd
A chart comparing the various grading methods in a normal distribution. Includes: Standard deviations, cumulative percentages, percentile
equivalents, Z-scores, T-scores. Modified from Figure 4.3 on Page 74 of Ward, A. W., Murray-Ward, M. (1999). Assessment in the
Classroom. Belmont, CA: Wadsworth. ISBN 0534527043

© ioGlobal Pty Ltd
Normal Probability Plots - Construction
A. Normal Probability Plot of Example Data B. As for A, But With Outlier
2.0 2.0
1.5 1.5
1.0 1.0
Expected z Score
Expected Z Score
0.5 0.5
0.0 0.0
-0.5 -0.5
-1.0 -1.0
-1.5 -1.5
-2.0 -2.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5 5.5
X(i) X(i)
Normal probability plot of data in Table. A. Points plot near to a straight line
and are likely to be normally distributed. B. If the last data point in the Table
(10) is changed to 4.5, it plots away from the line and is obviously an outlier.

© ioGlobal Pty Ltd
Normal Probability Plots
Example probability plots. A. Normally distributed data. B. Normally

distributed data set with outliers. C. Data with a longer tails than would be
expected from normally distributed data. D. Rounded, normally
distributed data. E. Lognormally distributed data. F. Two normally
distributed populations with overlap.
© ioGlobal Pty Ltd
Normal Probability Plots – Example Data
Reminders
Go to ioGAS file
Discrete examples – change bin size
Gaussian and long tails distribution together
700-50-500-50 example
Small upper population example

© ioGlobal Pty Ltd
Demo: eg, Nabberu Data

© ioGlobal Pty Ltd
Demo: eg Nabberu Data

© ioGlobal Pty Ltd
Demo: Nabberu
Reminders
Go to Nabberu Data
Log transform
Split plots by rego and geol
Examples of
over transformation
outliers
multiple populations
dl data

© ioGlobal Pty Ltd
Data Cleaning – Spotting Units Errors
MMI STD Partial Leach A
Zn Raw
Zn Log
Two orders of magnitude jump

© ioGlobal Pty Ltd
Prob plots also good for QC
Thallium might have been
useful even with small
numbers of samples.
However, it is plagued by
high detection limits in
many rock samples,
which is well above
expected background
abundances.
On the contrary, Li may

be worth retaining despite
the low values

© ioGlobal Pty Ltd
Useful Summary Statistics

© ioGlobal Pty Ltd
Summary Statistics – Example Report
Summary For CU_PPM
Raw Data Log10 Transformed Data Statistics
6000 1200
Summary Stats Percentiles
5000 1000 Parameter Description Percentile Value
Total Cases 4900
4000 800 Valid Cases 4896 99 66.000
Num>0 4886 98 57.000
3000 600
Num=0 0 95 46.00
2000 400 Num<0 10 90 38.00
Max 990.00 80 30.00
1000 200 Min -1.000 60 22.00
Range 991.00 50 18.00
0 0 Min>0 1.000 40 15.00
-26 272 569 866 -0.4 0.3 0.9 1.6 2.3 2.9 Mean 21.055 20 9.000
123 420 718 1015 -0.1 0.6 1.3 1.9 2.6 Median 18.000 10 6.000
Geo Mean 16.183 5 4.000
Harm Mean 11.501 2 2.000
10% Trim M 19.130 1 1.000
Skew Raw 22.574
4 4
Skew Log -.870
Expected Normal Value
2 2
Detection Limits
0 Replacement Value ( .5000)
0
Value Count
-2 -1.000 10
-2
-4
-4
-1.0 0.0 1.0 2.0 3.0
0 200 400 600 800 1000 -0.5 0.5 1.5 2.5 3.5

© ioGlobal Pty Ltd
Summary
Statistics –
Example Table
Bizarre data
The minima for Cu and Mn can not possibly be real DL values with a ‘-’ instead of a ‘<‘
100 percent in ppm is 1,000,000….
Thus many of the maxima are problematic as well! Even values approaching 100 percent
should be viewed with caution, unless people were sampling and analysing gold nuggets or
concentrates!
What has usually occurred is a units mixup e.g. ppb put into a ppm or percent column, units not
converted during compilations etc
This problem can occur the other way too, e.g. if you see 0.001ppb in historical Cu data, alarm
bells should ring!

© ioGlobal Pty Ltd
Summary Statistics – Example Table
Reasonable data – with intrinsic limitations
Look for zeros
• Median = Minimum
values is bad for
multivariate (rule of
thumb is greater
than 30% DL data
means little can be
gained in a
multivariate sense)
• However, binary
results can still be
important e.g. Au,
Pt, Pd
• No negatives or
ludicrous max
values, which is
good

© ioGlobal Pty Ltd
ioGAS Demo
• Reminders
• Uni/multi demo
• Table and spatial

© ioGlobal Pty Ltd
Summary Statistics –
Frequency Tables
Multiple Detection limits… be careful that replacing

with ‘positive half’ values are sensible
Large ‘negatives’ could swamp responses of ‘real’

values, particularly if old and new data are
combined (new data commonly has much lower
Detection Limits)
Frequency Tables useful for assessing such

problems

© ioGlobal Pty Ltd
Multivariate Graphical Summary 1.
Spider Plot

© ioGlobal Pty Ltd
Multivariate Graphical Summary 2.
Parallel Coordinate Plot

© ioGlobal Pty Ltd
Summary Statistics -
Contingency Table of Geol/Rego
Important for Levelling work

© ioGlobal Pty Ltd
Summary: Preliminary Data Analysis
Really only have looked at univariate methods
Methods of visualising data in ‘chemical’ space
Spent a fair amount of time looking at problems with

data
…not because geochemists are masochists, but because geochemical

data are usually problematic in some way, and if the QC is not done
conclusions that are drawn from the data may not be valid.
Further, esoteric processing of data that has not been subject to rigorous
QC is a waste of time, and possibly much worse, a waste of resources if
the erroneous results lead to further work

© ioGlobal Pty Ltd

4 EDA - Stats &amp; Graphical Techniques

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 EDA - Stats &amp; Graphical Techniques

Uploaded by

Copyright:

Available Formats

Exploratory Data Analysis

Statistical Summaries & Graphical Analysis

EDA – Stats & Graphical Analysis

If you are reading papers on various statistical

Also, in geochemical context, element and variable

EDA – Stats & Graphical Analysis

Conventional methods of data analysis are based on the classical

For geochemical data, influences affecting the absolute value

EDA – Stats & Graphical Analysis

As a result, Kurzl (1988) eloquently concluded that geochemical

EDA makes use of order statistics, histograms, box and

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

But Q: Is there a relationship between size of lake and pH?

EDA – Stats & Graphical Analysis

Combined groups of data that have inconsistent

Closure affected data (constant sum problem)

EDA – Stats & Graphical Analysis

The Problem of Closed Data (covered in Lithogeochemistry)

NULL data must remain unambiguously null

Units must be consistent (ppm, ppb, %)

Data must be attributed consistently (soil, stream sed., rockchip etc)

Analytical method recorded, as least generically…….eg, so that fire assay

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

Watch for meta-data problems as

EDA – Stats & Graphical Analysis

Here we have public

Same data compilation had

EDA – Stats & Graphical Analysis

QC: Data Cleaning

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

Projection issues between two partially overlapping

EDA – Stats & Graphical Analysis

Note how there are 10

It is an important concept that transformations are a requirement of the

linear scaling y=kx or y=x/k

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

Very popular – they reflect the

However, their appearance

The data are binned, which can

EDA – Stats & Graphical Analysis

There are many variants

The example shown to the left is a ‘Tukey’

Box = 25th, 50th (median) & 75th percentiles

Basis of an automatically derived

EDA – Stats & Graphical Analysis

Histograms and Boxplots

Go to ioGAS demo file

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

Example probability plots. A. Normally distributed data. B. Normally

Discrete examples – change bin size

Gaussian and long tails distribution together

Small upper population example

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

EDA – Stats & Graphical Analysis

4 EDA - Stats & Graphical Techniques

4 EDA - Stats & Graphical Techniques