Professional Documents
Culture Documents
4 EDA - Stats & Graphical Techniques
4 EDA - Stats & Graphical Techniques
Prepared by
ioGlobal
ioGeochemistry
a division of ioGlobal
Note:
18,307 Lake
water
samples from
southern
Ontario
Mineralisation > sulphide > acid > low lake pH in vicinity > prospecting tool??
….go to example
EDA – Stats & Graphical Analysis
© ioGlobal Pty Ltd
Worked Example
Reminder
<10m distribution
Over-range data
Non-normal distributions
Missing values
Coordinate recording
If greater than 30% of the data values fall below the detection limit, the variable should be used with
caution, especially if using multivariate techniques.
It ALWAYS ends
up corrupt in
databases
<dl missing
Null?
Detection??
As samples were
commonly disparate in a
spatial sense, this is a
potential disaster!
Extensive 8
Moles
6
Moles in Original 4
and Altered 2
Samples
0
A B C D E
Element
35
Intensive 30
Original Sample
Altered Sample
25
Mole % in Original
Mole %
20
and Altered Samples
15
Data Transformations
As a prelude to interpretation, data are often mathematically transformed to
a more normal form so that underlying statistical assumptions are met. This
also improves the reliability of statistical results and enhances relationships
present in the bulk of the data.
Data Transformations
Go to Datadesk demonstration
1.5 1.5
1.0 1.0
Expected z Score
Expected Z Score
0.5 0.5
0.0 0.0
-0.5 -0.5
-1.0 -1.0
-1.5 -1.5
-2.0 -2.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5 5.5
X(i) X(i)
Normal probability plot of data in Table. A. Points plot near to a straight line
and are likely to be normally distributed. B. If the last data point in the Table
(10) is changed to 4.5, it plots away from the line and is obviously an outlier.
Reminders
Go to ioGAS file
700-50-500-50 example
Reminders
Go to Nabberu Data
Log transform
Split plots by rego and geol
Examples of
over transformation
outliers
multiple populations
dl data
Zn Raw
Zn Log
2 2
Detection Limits
0 Replacement Value ( .5000)
0
Value Count
-2 -1.000 10
-2
-4
-4
-1.0 0.0 1.0 2.0 3.0
0 200 400 600 800 1000 -0.5 0.5 1.5 2.5 3.5
The minima for Cu and Mn can not possibly be real DL values with a ‘-’ instead of a ‘<‘
Thus many of the maxima are problematic as well! Even values approaching 100 percent
should be viewed with caution, unless people were sampling and analysing gold nuggets or
concentrates!
What has usually occurred is a units mixup e.g. ppb put into a ppm or percent column, units not
converted during compilations etc
This problem can occur the other way too, e.g. if you see 0.001ppb in historical Cu data, alarm
bells should ring!
• Median = Minimum
values is bad for
multivariate (rule of
thumb is greater
than 30% DL data
means little can be
gained in a
multivariate sense)
• However, binary
results can still be
important e.g. Au,
Pt, Pd
• No negatives or
ludicrous max
values, which is
good
Further, esoteric processing of data that has not been subject to rigorous
QC is a waste of time, and possibly much worse, a waste of resources if
the erroneous results lead to further work