You are on page 1of 15

GEMS Basic Statistics Concepts

Copyright 2009 Gemcom Software International Inc.


All Rights Reserved. This publication, or parts thereof, may not be reproduced in any form, by any
method, in whole or in part, for any purpose.
Gemcom Software International Inc. makes no warranty, either expressed or implied, including but not
limited to implied warranties of merchantability or fitness for a particular purpose, regarding these
materials.
In no event shall Gemcom Software International Inc. be liable to anyone for special, collateral, incidental,
or consequential damages in connection with or arising out of the use of these materials. The sole and
exclusive liability to Gemcom Software International Inc., regardless of the form of action, shall not
exceed the purchase price of the materials described herein.
Gemcom Software International Inc. reserves the right to revise and improve its products as it deems
appropriate. This publication describes the state of this product at the time of publication for the version
number stated, and may not reflect the product at all times in the future.
Gemcom Software International Inc.
Suite 1100 1066 West Hastings Street Tel: +1 604.684.6550
Vancouver, BC Canada V6E 3X1
Fax: +1 604.684.3541
Web site: www.gemcomsupport.com
Gemcom, the Gemcom logo, combinations thereof, and GEMS are trademarks of Gemcom Software
International Inc.
Revision date: 4/23/2009

Table of Contents
Introduction .............................................................................................................................. 4
Overview............................................................................................................................................. 4
Requirements ..................................................................................................................................... 4
Workflow ............................................................................................................................................. 4

General Summary ..................................................................................................................... 5


Understand the Domains..................................................................................................................... 5
Validate the Input Data........................................................................................................................ 5
Understand Estimation Methods and Parameters ................................................................................ 5
Validate the Output Model ................................................................................................................... 5

Domains .................................................................................................................................... 6
Overview............................................................................................................................................. 6
The Impact of Domains on Estimated Values ...................................................................................... 6

Composites ............................................................................................................................... 8
Basic Statistics ....................................................................................................................... 10
Overview........................................................................................................................................... 10
Descriptive statistics.......................................................................................................................... 10
What is typical? .............................................................................................................................. 10
How are they different? .................................................................................................................. 10
Histograms ....................................................................................................................................... 11
Cumulative Probability Plots .............................................................................................................. 11
Probability Plots ................................................................................................................................ 12
Bimodal Distributions ........................................................................................................................ 12

Outliers .................................................................................................................................... 14
Overview........................................................................................................................................... 14
Outliers and Topcuts ......................................................................................................................... 14
Methods of Determining a Topcut Value ............................................................................................ 14
Histogram....................................................................................................................................... 14
Confidence interval......................................................................................................................... 15
Percentile ....................................................................................................................................... 15
Experience ..................................................................................................................................... 15

Page 3 of 15

Introduction
Overview
Statistics helps build our understanding about the data we are working with. It is useful to know what
values are typical, how different our samples, if there are any patterns or spatial relationship between our
samples, and how these observations correlate with our geological knowledge. We use statistics for:

QAQC
Validating, comparing and combining domains
Validating resource models

Requirements
You should be able to do the following:

Create, edit and import data into a geological drillhole database

Manipulate drillhole data


Composite drillhole data

Interpret geological sections and plans


Surface and solid modelling
Display and View data in 2D and 3D.

Caution: If you do not have a good background in these subjects, many parts of this training may be
difficult to follow.

Workflow
There can be many different workflows for applying basic statistics. The decision of which techniques to
apply, and which order they are applied in is usually something that only experience will teach you.
Here is a generic process for performing basic statistics:
1. Identify and model Domains
For each domain:
2. Create composites
3. Perform Basic statistics
4. Topcut Outliers
This would be followed up by geostatistics which would include the following
5. Calculate Anisotropy and Variogram Parameters
6. Perform ID2 and Ordinary kriging
7. Validate the Model
Steps 5 -7 are covered in the Variography Manual.

Page 4 of 15

Table of Contents

General Summary
In order to reduce estimation errors, you should:

understand the domains


validate the input data
understand estimation methods and parameters
validate the output model.

Understand the Domains


It is important to recognise separate regions or domains within a model. Once you have identified the
domains, it is important to group all sample data contained within each domain into distinct subsets. After
that, you can analyse each subset individually, and use data from each separate domain to make
estimations within that domain.

Validate the Input Data


The saying Garbage in = Garbage out is certainly true in geostatistics. Although sampling theory and
laboratory quality control practices are important concepts which impact the quality of any estimation
made using a set of data values, these subjects are outside the scope of this tutorial.
Assuming that the quality of the data is as good as youre going to get, there are a couple of potentially
hazardous characteristics of the data which you should look for: bimodalism and outliers. You can
look for both of these features with a histogram. A data set is said to be unimodal if the histogram
shows a single peak. If there are two peaks, the data is said to be bimodal. If you use some of the
more common estimation techniques to create a model based on a bimodal distribution, it is likely to
contain more estimation errors than a model created from a unimodal data set. Additionally, outliers
(values which are significantly distant from the majority of the data) can cause estimation errors.

Understand Estimation Methods and Parameters


There are a large number of estimation methods, and a large number of parameters within each method.
Before using a particular estimation method, you should have a good background in basic statistics, as
well as basic geostatistical principles.
Using geostatistics can be likened to flying a jet plane. Although there are autopilot modes, where you
just press a few buttons and something happens, it is important that the pilot understand the theory of
aerodynamics to understand what impact a particular control has upon the end result.

Validate the Output Model


A final method you should use to check the quality of estimation is to take time to examine the output.
Histograms of estimated values, contours of plans, cross sections of block models, colour coded and
rotated in three-dimensional space are all methods which can be used to verify the output values.

Page 5 of 15

Table of Contents

Domains
Overview
One of the most important aspects of geostatistics is to ensure that any data set is correctly classified into
a set of homogenous domains. A domain is either a 2D or 3D region within which all data is related.
Mixing data from more than one domain or not classifying data into correct domains can often be the
source of estimation errors.
You will learn about:

the impact of domains on estimated values


viewing and using domains.

The Impact of Domains on Estimated Values


Imagine that you are a meteorologist, and you are given three air temperatures measured at locations A,
B, and C, as displayed below. Based on the values shown, what would you guess the temperature is at
location X? Would you guess that the temperature at location X was greater than 25?

Using the information above, you may have the following thoughts:
1.
2.
3.
4.

Since location A is relatively distant from X, the value at A may have little or no influence on the
estimated temperature at X.
Since locations B and C are about the same distance from X, they will probably have equal
influence on the estimated temperature.
Given the previous two points, the temperature at X would probably be the average of the
temperatures at B and C: (18 + 32) / 2 = 25 degrees
Since the influence of A has not been accounted for at all, and the estimate is exactly 25 degrees,
it is difficult to say with certainty if the temperature at X is above 25 degrees.

Now consider the following: Imagine that you want to go to your favourite beach, but only if the
temperature is 25 degrees or more. You have three friends who live near the beach you want to go to,
and you call them up and ask each one what the temperature is at each of their homes. You draw the
map below, with the locations of each friend (A, B, and C) and the temperatures they give you. Your
favourite beach is at location X. Note that the friend at location B lives high up in the mountains, while
friends at A and C live near the beach.

Page 6 of 15

Table of Contents

Using the information above, you may have the following thoughts:
1.
2.
3.
4.
5.

The data from B can be ignored, because temperatures high up in the mountains are usually not
good estimates of temperatures on the beach.
A and C are on the beach, so they can be used to guess the temperature at X.
Since X is between A and C on the map, the temperature at X will probably be somewhere
between the temperature at A and the temperature at C.
Therefore, the temperature at X will be somewhere between 28 and 32 degrees
Since the temperature range of 28 to 32 degrees is greater than the minimum value of 25 degrees,
you would probably decide Yes, Im going to the beach!

Compare this example with the first one. In both cases, all of the locations and temperatures are exactly
the same. However, in the second case, when you took account of the domain which contains the data,
you came up with a considerably different result. The point is that separating data into similar regions, or
domains, is a very important part of making any geostatistical estimation.

Page 7 of 15

Table of Contents

Composites
Compositing is designed to ensure samples have comparable weight in the model (estimation). Samples
inside a geological domain should be length weighted to create equal length composite grades. Basic
statistics on grade versus sample interval length can help determine the optimum sample length for
compositing.
The composite length should be as close to the original sampling interval as possible to preserve the
original standard. However, the new regular length of composites may be a projected length, for example
by bench. Additional weighing factor may be used such as density (SG) or sample recovery factor. The
outcome is to have equivalent weighed samples for estimation. When compositing, users of database
should consider that the altered samples or composites may loose some of the properties of the original
samples useful for QA/QC. For instance, detection limits will be affected and outliers will be more in line
with the average. This is one of the reasons why the industry practice is to perform the topcut before
compositing. In effect, this will alter the data further and it is not always desirable. For example, if an ore
deposit does have zones (pockets) of high grade represented by cluster of high grade samples, cutting
high grade will particularly hurt the quality of the estimation of those features.

A histogram showing the sample interval lengths helps identify the most common sampling interval. In the
example above, we can see that 1m and 4m are the dominant sample lengths. It should be noticed in that
case that if the 4m samples are reduced to the size of the majority (1m), the grade continuity will be
overly amplified by the repetitious values from the 4m samples broken down into numerous smaller 1m
samples. This is why in many cases, compositing should be avoided or tailored to make large samples
(4m in this case). However, large composites instead of small, can reduce the number of data points
significantly. Conversely, that can alter the quality of estimation where grade changes rapidly, i.e., over
smoothing.

Page 8 of 15

Table of Contents

A scatter plot showing the relationship between length and grade is also useful. In the example above, we
can see that the 1m sample length has ore grade values, whereas the 4m lengths most probably occur in
areas of waste.
Read the Compositing manual for more information about different compositing methods.

Page 9 of 15

Table of Contents

Basic Statistics
Overview
One of the important preliminary steps in performing a geostatistical evaluation is to understand the
statistical properties of the data. You will learn about:

descriptive statistics
histograms, cumulative frequency plots and probability curves
bimodal distributions

outliers.

Descriptive statistics
We use descriptive statistics to answer questions about:

What are the typical values?


How variable is the data?
How skewed is our data - will extreme values bias our data?
How similar/different is each domain?

There are 2 types of statistics used to describe a set of data:

Measures of what is typical


Measures of difference

What is typical?
Measures of what is typical include:

Mean the sum of the sample values divided by the number of samples.
Median the middle value when the samples are sorted in order
Mode the most frequent sample value.

How are they different?


Measure of difference include:

Range the difference between the maximum and minimum sample value
Inter-quartile range difference between the top and bottom quarter values when the
data is sorted in grade order.
Variance the average difference between each sample and the mean grade
= sum (each sample value mean)2 / (number of samples -1)

Standard deviation the square root of the variance (this brings the number back into a
grade sense).

Page 10 of 15

Table of Contents

Histograms
A histogram is a statistical term which refers to the distribution of values. A graphical version of a
histogram table which shows what proportion of cases fall into each of several non-overlapping intervals
of some variable.
For example, a distribution of gold grades could be represented by the following table:
Gold (g/t)

Number of samples
(frequency)

0.0 - 0.5
0.5 1.0
1.0 - 1.5
1.5 2.0
2.0 - 2.5
2.5 3.0
3.0 - 3.5
3.5 4.0
4.0 4.5
4.5 5.0
5.5 6.0
6.0 6.5
6.5 7.0
7.0 7.5
7.5 8.0

0
40
58
82
40
29
18
10
12
5
5
5
5
8
5

This same data can be displayed in a histogram as shown:

Cumulative Probability Plots


The cumulative probability plot is a summary of the proportion of samples that occur below each grade.

Grades associated with the steep part of the curve are the more frequent grades.

Page 11 of 15

Table of Contents

Probability Plots
This is a cumulative probability plot with the axis adjusted.

If the plot presents as a straight line, then the data set reflects a normal distribution. If the grade scale is
converted to a log scale, a straight line indicates a log-normal population. With the log scale making the
line straight, it is possible to distinguish various domains by a change of slope. It is easier to see on a
probability plot than an histogram in general. Hence the probability plot may be more useful to select
indicators for analysis. Inflection points can represent different data populations.

Bimodal Distributions
Two characteristics which can potentially reduce the quality of your estimations are bimodalism and
outliers.
The mode is the most commonly occurring value in a data set. For example, in the following data set,
the number 8 is the mode:
1 3 5 5 8 8 8 9
Bimodal means that there are two relatively most common values which are not adjacent to one
another. In the following data set, the numbers 2 and 8 are equally common, and the distribution is said
to be bimodal:
1 2 2 2 3 5 5 8 8 8 9
Imagine that you are studying the average specific gravity, or density of rocks in a coal deposit. A
histogram of all rock samples might look like this:

Page 12 of 15

Table of Contents

Any histogram which displays two humps, as in the example above, is said to be bimodal. The bimodal
distribution in the example above can be explained by the fact that the data set is comprised of coal
samples as well as intervening sandstone and mudstone bands. The specific gravity values between 1
and 2 are representative of the coal, while specific gravity values between 2 and 3 represent the
intervening rock.
Often the source of a bimodal distribution can be two domains being mixed into a single data set. In
order to minimise estimation errors, you should make every attempt to separate any data set which has a
bimodal distribution. In the example above, merely segregating the data based on rock type would result
in two separate normal distributions.

Page 13 of 15

Table of Contents

Outliers
Overview
Outliers are data values that are much higher (or much lower) than most data values in a single domain.
For a number of reasons, you should either "cut" them down (or up) to some value or remove them.
You will learn about:

outliers and topcuts

methods of determining a topcut value

Outliers and Topcuts


An outlier is a statistical term for a value which is significantly different than the majority of all other
values in the data set. For example, in the following data set, the number 236 would be considered to be
an outlier:
1 3 5 5 8 8 8 236
Outliers can cause "noisy" experimental variograms, which are difficult to model. Additionally, if used in
an estimation, outliers can result in unrealistic results. One technique used to reduce the impact of
outliers is to apply a cutoff, or "topcut" to them. In the example above, the value of 236 could be cut,
or changed to a value of 9:
1 3 5 5 8 8 8 9
Another alternative is to remove the outlier value(s).

Another point of view on Topcuts!


Topcutting samples could be like saying those samples are no good and should be thrown out. It is a
rather strong statement that must be justified by geological evidence or not. Geological interpretation is
not a game with numbers. If outliers are anomalous values, ore deposits are all anomalies, and then all
ore deposits are outliers.
One of the problems with altering the outliers value to make it more normal is that is makes the data
false in fact, i.e., the grade will look more continuous and smooth than it is. Hence the estimation based
on it will not be exact, perhaps less correct. For instance, Kriging automatically considers grade variation for each
set of data per block and correct the outliers by giving them a proper weight. If a block is surrounded by extreme
high values, perhaps it is a high value block or ore. Only Kriging can process outliers in this way. The point is to have
many values to be able to understand what is a normal value at the block scale or the domain large scale.

Methods of Determining a Topcut Value


Some methods used to determine a topcut value use concepts such as:

Histogram

Confidence interval
Percentile
Experience

Histogram
The point at which the cumulative frequency curve "flattens out" can be used as the cutoff. In the case
below the curve appears to be fairly flat at a value of 25.
A histogram can also help visually substantiate the choice of a topcut value obtained by other means.

Page 14 of 15

Table of Contents

Confidence interval
A confidence interval is an estimated range of values which is likely to include a given percentage of the
data values, assuming that the data is normally distributed. The calculation for the upper limit of a 95%
confidence interval (CI) is:
95% CI = mean + (1.96 * standard deviation)
For example, if a data set has: mean = 6.49 and standard deviation = 9.30.
95% CI = 6.49 + (1.96 * 9.30)
95% CI = 24.718
For simplicity, you could choose to use the nearest integer value of 25 as your cutoff.

Percentile
A percentile is that data value at which a given percentage of all other data values fall below. Any given
percentile value could be selected as the outlier cutoff, such as the 90th, 95th, or 99th percentile. For
example, you could choose one of the following (from the previous Basic Statistics report):
90.0 Percentile: 22.5
95.0 Percentile: 27.2
97.5 Percentile: 38.8
99.0 Percentile: 42.4

Experience
Topcut values are often chosen based on knowledge of a deposit.
If part of an ore zone has been mined, information from grade control samples and reconciliation studies
may provide a good idea of what the maximum mined block value will be. Comparing sampling statistics
to production results is also another method to choose a top cut. One can lower the topcut value until the
model estimation match the production grade. Again this is experimental, not really scientifique. As mine
production will go through zones of different quality ore, the process will have to be repeated.
If the deposit has not yet been mined, information from similar deposits may be useful in determining the
outlier cutoff.

Page 15 of 15

You might also like