indd 1 10/7/10 4:22 PM
This page intentionally left blank This page intentionally left blank
Imperial College Press
ICP
Michael Thompson • Philip J Lowthian
Birkbeck University of London, UK
P739.tp.indd 2 10/7/10 4:22 PM
British Library CataloguinginPublication Data
A catalogue record for this book is available from the British Library.
Published by
Imperial College Press
57 Shelton Street
Covent Garden
London WC2H 9HE
Distributed by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Printed in Singapore.
For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.
ISBN13 9781848166165
ISBN10 1848166168
ISBN13 9781848166172(pbk)
ISBN10 1848166176(pbk)
Typeset by Stallion Press
Email: enquiries@stallionpress.com
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the Publisher.
Copyright © 2011 by Imperial College Press
NOTES ON STATISTICS AND DATA QUALITY FOR ANALYTICAL CHEMISTS
XiaoLing  Notes on Statistics and Data.pmd 3/23/2011, 11:38 AM 1
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
Preface
This book is based on the experience of teaching statistics for many years to
analytical chemists and the speciﬁc diﬃculties that they encounter. Many
analytical chemists ﬁnd statistics diﬃcult and burdensome. That is a per
ception that we hope to dispel. Statistics is straightforward and it adds a
fascinating extra dimension to the science of chemical measurement. In fact
it is hardly possible to conceive of a measurement science such as analytical
chemistry that does not have statistics as both its conceptual foundation
and its everyday tool. Measurement results necessary have uncertainty and
statistics shows us how to make valid inferences in the face of this uncer
tainty. But, over the years, it has become apparent to us that statistics is
much more interesting when it makes full use of the computer revolution.
It would be hard to overstate the eﬀect that easilyavailable computing
has had on the practice of statistics. It is now possible to undertake,
often in milliseconds and with perfect accuracy, calculations that previ
ously would have been impracticably longwinded and errorprone. A simple
example is the calculation of the probabilities associated with density func
tions. Moreover, we can now produce in seconds several accurate graphical
images of our datasets and select the most informative. These capabilities
have transformed the applicability of both standard statistical methods
and more recent computerintensive methods. Textbooks for the most part
have not caught up with this revolution and, to an unnecessary degree,
are still stressing pencilandpaper methods of calculation. Of course, a
small number of pencilandpaper examples of some elementary examples
can assist learning, but they are too prone to mistakes for ‘reallife’ appli
cation. Many textbooks place a heavy stress on the mathematical basis of
statistics. We regard this as inappropriate in an applied text. Analytical
chemists do not need too many details of statistical theory, so we have kept
these to a minimum. Drivers don’t need to know exactly how every part of
a car works in order to drive competently.
v
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
vi Notes on Statistics and Data Quality for Analytical Chemists
With ease of computation, there is, of course, a concomitant danger that
people are tempted to use one of the many excellent computer statistics
packages (or perhaps one of the notsoexcellent ones) without under
standing what the output means or whether an appropriate method has
been used. Analytical chemists have to guard against that serious short
coming by exercising a proper scientiﬁc attitude. There are several ways
of developing that faculty in relation to statistics. A key practice is the
habitual careful consideration of the data before any statistics is under
taken. A visual appraisal of a graphical image is of paramount importance
here, and the book is profusely illustrated with them: almost every dataset
discussed or analysed is depicted. Another essential is developing an under
standing of the exact meaning of the results of statistical operations. Finally,
practitioners need the experience of both guided and unsupervised consid
eration of many examples of relevant datasets. Drivers don’t need too many
details of how the car works, but they do need the Highway Code, a road
map, some driving lessons and as much practice as they can get.
The book is divided into quite short sections, each dealing with a single
topic, headed by a ‘Key points’ box. Most sections are terminated by ‘Notes
and further reading’ with useful references for those wishing to pursue topics
in more detail. The sections are as far as possible selfcontained, but are
extensively crossreferenced. The book can therefore be used either in a sys
temic way by reading the sections sequentially, or as a quick reference by
going directly to the topic of interest. Every statistical method and appli
cation covered has at least one example where the results are analysed in
detail. This enables readers to emulate this analysis on their own examples.
The statistical results on these examples have been crosschecked by at least
two diﬀerent statistics packages. All of the datasets used in examples are
available for download, so that readers can compare the output of their own
favourite statistical package with that shown in the book and thus verify
that they are entering data correctly.
Statistics is a huge subject, and a problem with writing a book such as
this is knowing where to stop. We have concentrated on providing a selection
of techniques and applications that will satisfy the needs of most ana
lytical chemists most of the time, and we make no apology for omitting any
mention of the numerous other interesting methods and applications. Statis
ticians may be surprised at the relative emphasis placed on diﬀerent topics.
We have simply used heavier weighting on the topics that experience has
told us that analytical chemists have most diﬃculty with. The book is cast
in two parts, a techniquebased approach followed by an applicationbased
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
Preface vii
approach. This engenders a certain amount of overlap and duplication.
Analytical chemists are thereby encouraged to create their own overview
of the subject and see for themselves the relationship between tasks and
techniques.
Statisticians will also notice that we use the ‘pvalue’ approach to sig
niﬁcance testing. This was adopted after careful consideration of the needs
of analytical chemists. It greatly improves the transparency of signiﬁcance
testing so long as the exact meaning of the pvalue is borne in mind, and we
stress that meaning repeatedly. The alternative approaches tend to cause
more diﬃculty. Experience has shown that analytical chemists ﬁnd the
somewhat convoluted logic of using statistical tables confusing and hard
to remember. The conﬁdence interval approach is simple but almost uni
versally misunderstood among nonstatisticians. Both of these methods also
have the disadvantage of engendering the idea that the signiﬁcance test can
validly dichotomise reality — once you have set a level of conﬁdence the
test tells you ‘yes’ or ‘no’. This tempts practitioners to use statistics to
replace judgement rather than to assist it.
Finally, here are ten basic rules for analytical chemists undertaking a
statistical analysis.
1. If you can, plan the experiment or data collection before you start the
practical work. Make sure that it will have suﬃcient statistical power for
your needs. Ensure that the data collection is randomised appropriately,
so that any inferences drawn will be valid.
2. Make sure that you know how to enter the data correctly into your
statistical software. After you have entered it, print it out for checking.
3. Examine the data as one or more graphical displays. This will often
tell you all that you need to know. In addition it will tell you if your
dataset is unlikely to conform to the statistical model that underlies
the statistical test that you are proposing to use. Important features
to look out for are: suspect data; lack of ﬁt to linear calibrations; and
uneven variance in regression and analysis of variance.
4. Select the correct statistical test, e.g., onetailed or twotailed, one
sample, two sample or paired.
5. Make sure that you know exactly what the statistical output means,
especially the p−value associated with a test of signiﬁcance.
6. Be careful how you express the outcome in words. Avoid attributing
probabilities to hypotheses (unless you are making a Bayesian
analysis — not within the scope of this book).
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
viii Notes on Statistics and Data Quality for Analytical Chemists
7. Report the magnitude of an eﬀect as well as its signiﬁcance. Distin
guish between eﬀects that are statistically signiﬁcant and those that
are practically important.
8. After a regression always make plots of the residuals against the pre
dictor variable. This will give you valuable information about lack of
ﬁt and uneven variance. It is sometimes useful to make other plots of
the residuals, e.g., as a time series to detect drift in the measurement
process.
9. If in doubt, ask a statistician.
10. Have fun!
Michael Thompson & Philip J Lowthian
Birkbeck University of London, UK
May 2010
* * * * * *
Data ﬁles used in the book can be downloaded
from http://www.icpress.co.uk/chemistry/p739.html
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
Contents
Preface v
Part 1: Statistics 1
1. Preliminaries 3
1.1 Measurement Variation . . . . . . . . . . . . . . . . . . . 3
1.2 Conditions of Measurement and the Dispersion
(Spread) of Results . . . . . . . . . . . . . . . . . . . . . 4
1.3 Variation in Objects . . . . . . . . . . . . . . . . . . . . . 5
1.4 Data Displays . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Levels of Probability . . . . . . . . . . . . . . . . . . . . 7
1.7 An Example — Ethanol in Blood — a OneTailed Test . 8
1.8 What Exactly Does the Probability Mean? . . . . . . . . 9
1.9 Another Example — Accuracy of a Nitrogen
Analyser — a TwoTailed Test . . . . . . . . . . . . . . . 10
1.10 Null Hypotheses and Alternative Hypotheses . . . . . . . 12
1.11 Statements about Statistical Inferences . . . . . . . . . . 12
2. Thinking about Probabilities and Distributions 15
2.1 The Properties of the Normal Curve . . . . . . . . . . . . 15
2.2 Probabilities Relating to Means of n Results . . . . . . . 17
2.3 Probabilities from Data . . . . . . . . . . . . . . . . . . . 19
2.4 Probability and Statistical Inference . . . . . . . . . . . . 20
2.5 Precomputer Statistics . . . . . . . . . . . . . . . . . . . 21
2.6 Conﬁdence Intervals . . . . . . . . . . . . . . . . . . . . . 22
3. Simple Tests of Signiﬁcance 25
3.1 OneSample Test — Example 1: Mercury in Fish . . . . 25
3.2 OneSample Test — Example 2: Alumina (Al
2
O
3
)
in Cement . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ix
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
x Notes on Statistics and Data Quality for Analytical Chemists
3.3 Comparing Two Independent Datasets — Method . . . . 28
3.4 Comparing Means of Two Datasets
with Equal Variances . . . . . . . . . . . . . . . . . . . . 30
3.5 The Variance Ratio Test or FTest . . . . . . . . . . . . 31
3.6 TwoSample TwoTailed Test — An Example . . . . . . 32
3.7 TwoSample OneTailed Test — An Example . . . . . . 34
3.8 Paired Data . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 Paired Data — OneTailed Test . . . . . . . . . . . . . . 38
3.10 Potential Problems with Paired Data . . . . . . . . . . . 40
4. Analysis of Variance (ANOVA) and Its Applications 43
4.1 Introduction — The Comparison of Several Means . . . . 43
4.2 The Calculations of OneWay ANOVA . . . . . . . . . . 45
4.3 Example Calculations with OneWay ANOVA . . . . . . 47
4.4 Applications of ANOVA: Example 1 — Catalysts
for the Kjeldahl Method . . . . . . . . . . . . . . . . . . 49
4.5 Applications of ANOVA: Example 2 —
Homogeneity Testing . . . . . . . . . . . . . . . . . . . . 51
4.6 ANOVA Application 3 — The Collaborative Trial . . . . 54
4.7 ANOVA Application 4 — Sampling and Analytical
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.8 More Elaborate ANOVA — Nested Designs . . . . . . . 60
4.9 TwoWay ANOVA — Crossed Designs . . . . . . . . . . 63
4.10 Cochran Test . . . . . . . . . . . . . . . . . . . . . . . . 65
4.11 Ruggedness Tests . . . . . . . . . . . . . . . . . . . . . . 66
5. Regression and Calibration 71
5.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 How Regression Works . . . . . . . . . . . . . . . . . . . 73
5.3 Calibration: Example 1 . . . . . . . . . . . . . . . . . . . 76
5.4 The Use of Residuals . . . . . . . . . . . . . . . . . . . . 78
5.5 Suspect Patterns in Residuals . . . . . . . . . . . . . . . 80
5.6 Eﬀect of Outliers and Leverage Points
on Regression . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Variances of the Regression Coeﬃcients: Testing
the Intercept and Slope for Signiﬁcance . . . . . . . . . . 83
5.8 Regression and ANOVA . . . . . . . . . . . . . . . . . . 85
5.9 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 87
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
Contents xi
5.10 A StatisticallySound Test for Lack of Fit . . . . . . . . . 90
5.11 Example Data/Calibration for Manganese . . . . . . . . 91
5.12 A Regression Approach to Bias Between Methods . . . . 94
5.13 Comparison of Analytical Methods: Example . . . . . . . 97
6. Regression — More Complex Aspects 101
6.1 Evaluation Limits — How Precise is an Estimated
xvalue? . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 Reducing the Conﬁdence Interval Around
an Estimated Value of Concentration . . . . . . . . . . . 104
6.3 Polynomial Regression . . . . . . . . . . . . . . . . . . . 105
6.4 Polynomial Calibration — Example . . . . . . . . . . . . 108
6.5 Multiple Regression . . . . . . . . . . . . . . . . . . . . . 109
6.6 Multiple Regression — An Environmental Example . . . 111
6.7 Weighted Regression . . . . . . . . . . . . . . . . . . . . 116
6.8 Example of Weighted Regression — Calibration
for
239
Pu by ICP–MS . . . . . . . . . . . . . . . . . . . . 119
6.9 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . 122
6.10 Example of Regression with Transformed Variables . . . 124
7. Additional Statistical Topics 127
7.1 Control Charts . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2 Suspect Results and Outliers . . . . . . . . . . . . . . . . 130
7.3 Dixon’s Test for Outliers . . . . . . . . . . . . . . . . . . 132
7.4 The Grubbs Test . . . . . . . . . . . . . . . . . . . . . . 133
7.5 Robust Statistics — MAD Method . . . . . . . . . . . . 135
7.6 Robust Statistics — Huber’s H15 Method . . . . . . . . 137
7.7 Lognormal Distributions . . . . . . . . . . . . . . . . . . 139
7.8 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.9 Nonparametric Statistics . . . . . . . . . . . . . . . . . . 142
7.10 Testing for Speciﬁc Distributions — the
Kolmogorov–Smirnov OneSample Test . . . . . . . . . 144
7.11 Statistical Power and the Planning of Experiments . . . 146
Part 2: Data Quality in Analytical Measurement 151
8. Quality in Chemical Measurement 153
8.1 Quality — An Overview . . . . . . . . . . . . . . . . . . 153
8.2 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 154
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
xii Notes on Statistics and Data Quality for Analytical Chemists
8.3 Why Uncertainty is Important . . . . . . . . . . . . . . . 156
8.4 Estimating Uncertainty by Modelling
the Analytical System . . . . . . . . . . . . . . . . . . . . 158
8.5 The Propagation of Uncertainty . . . . . . . . . . . . . . 160
8.6 Estimating Uncertainty by Replication . . . . . . . . . . 162
8.7 Traceability . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.8 Fitness for Purpose . . . . . . . . . . . . . . . . . . . . . 165
9. Statistical Methods Involved in Validation 167
9.1 Precision of Analytical Methods . . . . . . . . . . . . . . 167
9.2 Experimental Conditions for Observing Precision . . . . 169
9.3 External Calibration . . . . . . . . . . . . . . . . . . . . 171
9.4 Example — A Complete Regression Analysis
of a Calibration . . . . . . . . . . . . . . . . . . . . . . . 172
9.5 Calibration by Standard Additions . . . . . . . . . . . . 175
9.6 Detection Limits . . . . . . . . . . . . . . . . . . . . . . . 177
9.7 Collaborative Trials — Overview . . . . . . . . . . . . . 180
9.8 The Collaborative Trial — Outlier Removal . . . . . . . 182
9.9 Collaborative Trials — Summarising the Results
as a Function of Concentration . . . . . . . . . . . . . . . 184
9.10 Comparing two Analytical Methods by Using
Paired Results . . . . . . . . . . . . . . . . . . . . . . . . 186
10. Internal Quality Control 189
10.1 Repeatability Precision and the Analytical Run . . . . . 189
10.2 Examples of WithinRun Quality Control . . . . . . . . . 192
10.3 Internal Quality Control (IQC) and RuntoRun
Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.4 Setting Up a Control Chart . . . . . . . . . . . . . . . . 197
10.5 Internal Quality Control — Example . . . . . . . . . . . 198
10.6 Multivariate Internal Quality Control . . . . . . . . . . . 201
11. Proﬁciency Testing 205
11.1 Proﬁciency Tests — Purpose and Organisation . . . . . . 205
11.2 Scoring in Proﬁciency Tests . . . . . . . . . . . . . . . . 207
11.3 Setting the Value of the Assigned Value x
A
in Proﬁciency Tests . . . . . . . . . . . . . . . . . . . . . 208
11.4 Calculating a Participant Consensus . . . . . . . . . . . . 210
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050fm
Contents xiii
11.5 Setting the Value of the ‘Target Value’ σ
p
in Proﬁciency Tests . . . . . . . . . . . . . . . . . . . . . 215
11.6 Using Information from Proﬁciency Test Scores . . . . . 216
11.7 Occasional Use of Certiﬁed Reference Materials
in Quality Assurance . . . . . . . . . . . . . . . . . . . . 219
12. Sampling in Chemical Measurement 221
12.1 Traditional and Modern Approaches to Sampling
Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 221
12.2 Sampling Uncertainty in Context . . . . . . . . . . . . . 223
12.3 Random and Systematic Sampling . . . . . . . . . . . . . 226
12.4 Random Replication of Sampling . . . . . . . . . . . . . 227
12.5 Sampling Bias . . . . . . . . . . . . . . . . . . . . . . . . 229
12.6 Sampling Precision . . . . . . . . . . . . . . . . . . . . . 231
12.7 Precision of the Estimated Value of σ
s
. . . . . . . . . . 234
12.8 Quality Control of Sampling . . . . . . . . . . . . . . . . 236
Index 239
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b105
2
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
PART 1
Statistics
1
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
2
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
Chapter 1
Preliminaries
This chapter sets the scene for statistical thinking, covering variation
in measurement results and the properties of objects, and its
graphical representation. The basis for statistical inference derived
from analytical data is associated with the probability of obtaining
the observed results under the assumption of appropriate hypotheses.
1.1 Measurement Variation
Key points
— Variation is inherent in results of measurements.
— We must avoid excessive rounding to draw valid conclusions about the
magnitude of variation.
The results of replicated measurements vary. If we measure the same thing
repeatedly, we get a diﬀerent result each time. For example, if we measured
the proportion of sodium in a ﬁnely powdered rock, we might get results
such as 2.335, 2.281, 2.327, 2.308, 2.311, 2.264, 2.299, 2.295 per cent mass
fraction (%). This variation is not the outcome of carelessness, but simply
caused by the uncontrolled variation in the activities that comprise the
measurement, which is often a complex multistage procedure in chemical
measurement. Sometimes it may appear that results of repeated measure
ments are identical, but this is always a false impression, brought about
by a limited digit resolution available in the instruments used to make the
measurement or by excessive rounding by the person recording or reporting
the data. If the above data are rounded to two signiﬁcant ﬁgures, they all
turn into an identical 2.3%, which tells us nothing about the magnitude of
3
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
4 Notes on Statistics and Data Quality for Analytical Chemists
the variation. Excessive rounding of data must be avoided if we want to
draw valid inferences that depend on the variation (see §7.8 for guidance
on rounding).
1.2 Conditions of Measurement and the Dispersion
(Spread) of Results
Key points
— The dispersion (spread) of results varies with the conditions of
measurement.
— The shape of the distribution of replicated results is characteristic,
with a single peak tailing away to zero on either side.
— We must remember the diﬀerence between repeatability and repro
ducibility conditions.
The scale of variation in results depends on the conditions under which
the measurements are repeated. If the analysis of the rock powder were
repeated many times by the same analyst, with the same equipment and
reagents, in the same laboratory, within a short period of time, (that is,
the conditions are kept constant as far as possible) we might see the results
represented in Fig. 1.2.1. These results were obtained under what is called
repeatability conditions. If, in contrast, each measurement on the same rock
powder is made by the same method in a diﬀerent laboratory (obviously
by a diﬀerent analyst with diﬀerent equipment and reagents and at a dif
ferent time) we observe a wider dispersion of results (Fig. 1.2.2). These
results were obtained in what we call reproducibility conditions. Notice
Fig. 1.2.1. Results from the analysis of
a rock powder for sodium under repeata
bility conditions.
Fig. 1.2.2. Results from the analysis of
a rock powder for sodium under repro
ducibility conditions.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
Preliminaries 5
the characteristic shape of these distributions, roughly symmetrical with a
single peak tailing away to zero on either side. There are other conditions
of measurement encountered by analytical chemists, but repeatability and
reproducibility are the most important.
1.3 Variation in Objects
Key points
— We must distinguish between the two sources of variation (between
objects and between measurement results on a single object).
— Variation among objects often gives rise to asymmetric distributions.
If we measure a quantity (such as a concentration) in many diﬀerent objects
in a speciﬁc category, we obtain a dispersion of results, but this is largely
because the objects really do diﬀer. Distinguishing between diﬀerent objects
is one of the reasons why we make measurements. Figure 1.3.1 shows the
concentrations of copper measured in samples of sediment from nearly every
square mile of England and Wales. As the concentrations displayed are
actually the results of measurements, some of the variation (but only a small
part) must derive from the measurement process. Note that this distribution
is far from symmetrical — it has a strong positive skew. This skew is often
observed in collections of data from natural objects.
Fig. 1.3.1. Concentrations of copper in 49,300 stream sediments from England
and Wales. The distribution has a strong positive skew.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
6 Notes on Statistics and Data Quality for Analytical Chemists
1.4 Data Displays
Key points
— Always look at a graphical display of your data before you carry out
any statistics.
— Use visual appraisal to make a preliminary judgement about the
question you are asking and to select appropriate statistical techniques.
Graphic representations of data, such as the histograms in §1.2, are essential
tools in handling variation. They should always be the ﬁrst resort for anyone
with data to interpret. An appropriate diagram, coupled with a certain
amount of experience, can tell us much of what we need to infer from data
without resort to statistical methods. Indeed, a diagram can nearly always
tell us which statistical techniques would suit our purpose and which of
them would lead us to an incorrect conclusion.
There are several ways of representing simple replicated data. A his
togram is a suitable tool to inspect the location (the central tendency) and
dispersion of data all of the same type, when the number of observations
is large, say 50 or more. With smaller amounts of data, histograms either
look unduly ragged or do not show the shape of distribution adequately.
In such circumstances the dotplot is often more helpful. (There is no exact
dividing line — we have to use our own judgement!) Figure 1.4.1 shows a
dotplot of the rock powder data from §1.1.
Fig. 1.4.1. Repeated results for the concentration of sodium in a rock powder, presented
as a dotplot.
1.5 Statistics
Key points
— The main reason for using statistics is the estimation of summary
statistics to describe datasets in a compact way.
— Another important reason is to assign probabilities to events and assist
you in making judgements in the presence of uncertainty.
Statistics is the mathematical science of dealing with variation both in
objects and in measurement results. It provides a logical way of drawing
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
Preliminaries 7
conclusions in the presence of uncertainty of measurement. It helps us in two
main ways. First, it enables us to summarise data concisely, an essential step
in seeing what data are telling us, especially important with large datasets.
For example, the data in Fig. 1.3.1 can be summarised by very simple
statistics by saying that 95% of the results fall between 15.74 and 45.36.
We have condensed information about 50,000 data into three numbers. Of
course this summary does not tell the whole story, but it tells us a lot
more than we could ﬁnd simply by looking at a list of 50,000 numbers. The
histogram Fig. 1.3.1 tells us much more again but is a summary speciﬁed
by about 60 numbers, namely the heights and boundaries of the histogram
bars.
The other way that statistics helps us is by allowing us to assign prob
abilities to events. For example, it could tell us that the results obtained
in an experiment were very unlikely to be obtained if certain assumptions
were true. That in turn would allow us to infer that at least some of the
assumptions are very likely to be untrue.
1.6 Levels of Probability
Key points
— Statistics tells us, as a probability, whether an event is likely or unlikely
under stated assumptions.
— Scientists normally accept a conﬁdence level of 95% as convincing
(‘statistically signiﬁcant’), that is, we would observe the event with
a probability of 0.95 under the assumptions, and fail to observe the
event with a probability of 0.05.
— We might need a higher conﬁdence level for certain applications.
Notice in the previous section that we are not dealing with absolutes such
as ‘true’ or ‘false’, but with qualiﬁed judgements such as ‘likely’ or ‘unlikely’
etc. This uncertainty is inherent in making deductions from measurements.
However, the level of probability that we accept as convincing varies both
with the person making the judgement and the area of application of the
result. For many scientiﬁc purposes we can accept 95% conﬁdence. Typ
ically we conduct an experiment to test some assumptions. Imagine that
we could repeat such an experiment many times. If the results obtained
in these experiments supported the assumptions only about half of the
time, we wouldn’t be convinced about the validity of those assumptions.
If the experiment supported the assumptions 99 times out of a hundred,
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
8 Notes on Statistics and Data Quality for Analytical Chemists
we would almost certainly be convinced. Somewhere between is a dividing
line between ‘convinced’ and ‘not convinced’. For many practical purposes
that level is one time in twenty repeats (i.e., 95% conﬁdence). However, we
might want a much higher level of conﬁdence under certain circumstances,
in forensic science for example. To secure a prosecution based on an ana
lytical result, such as the concentration of ethanol in a blood sample, we
would want a very high level of conﬁdence. We would not accept a situation
where our result gave rise to the wrong conclusion one time in twenty. We
will see later (§2.3) how these probabilities are estimated.
1.7 An Example — Ethanol in Blood — a OneTailed Test
Key points
— If we are concerned about whether results are signiﬁcantly greater than
a legal or contractual limit, we calculate ‘onetailed’ probabilities.
— Onetailed probabilities would apply also to other instances where
there was an interest in results falling below such a limit.
Suppose that we had repeated measurement results such as shown in
Fig. 1.7.1. A sample of blood is analysed four times by a forensic sci
entist and the results compared with the legal maximum limit for driving
of 80 mg ethanol per 100 ml of blood. The mean of the results is above
the limit, but the variation among the individual results raises the possi
bility that the mean is above the limit only by chance. We need to estimate
how large or small that probability is. Using methods based on standard
statistical assumptions to be explained in §1.8, we ﬁnd that the proba
bility of obtaining that particular mean result if the blood sample con
tained exactly 80 mg/100 ml is 0.0005. This low probability means that it
is very unlikely that such a high mean result could be obtained if the true
concentration were 80 mg/100 ml. (An even lower probability would apply
if the true concentration were lower than the limit.) As the probability
is very low, we infer that the sample contained a level of ethanol higher
than the limit, probably high enough in this instance to support a legal
prosecution.
Notice that, if we repeated the calculation for a diﬀerent set of results
that were closer to the limit (Fig. 1.7.2), we would obtain a higher proba
bility estimate of 0.02. We would not, in that case, use the data to support
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
Preliminaries 9
Fig. 1.7.1. Results for the determination
of ethanol in a sample of blood.
Fig. 1.7.2. Results for the determination
of ethanol in a diﬀerent sample of blood.
a prosecution, even though the suspect was probably over the limit. The
reason is that the probability of getting those results if the suspect were
innocent would be unacceptably high.
Notice also that we are interested here only with probabilities of data
falling above a limit: this is called a onetailed probability. (In other circum
stances we might be interested only in probabilities of data falling below a
limit. An example might be testing a dietary supplement for a guaranteed
minimum level of a vitamin. That would also entail calculating onetailed
probabilities.)
1.8 What Exactly Does the Probability Mean?
Key points
— The null hypothesis is an assumption that we make about the inﬁnite
number of possible results that we could obtain by replicating the
measurement under unchanged conditions.
— We can calculate from the data the probability of observing the actual
(or more extreme) results given the null hypothesis, but not the prob
ability of the null hypothesis given the results.
The probabilities calculated in §1.7 have a very speciﬁc meaning. It is
essential for the analyst to keep that meaning in mind when using statistics.
First, it is a value calculated under a number of assumptions. A crucial
assumption is that our results comprise a random sample from an inﬁnite
number of possible values. We further assume that the mean value µ of this
inﬁnite set is exactly equal to the legal limit x
L
. This latter assumption is
called the null hypothesis and, in this instance, is written H
0
: µ = x
L
.
Second, it is the probability of observing the ‘event’ (the results obtained
or more extreme results) under these assumptions, not the probability of
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
10 Notes on Statistics and Data Quality for Analytical Chemists
the null hypothesis being true given the results. In terms of the forensic
example, broadly speaking it is the probability of getting the results
assuming innocence, not the probability of innocence given the results.
There is a crucial diﬀerence between the two probabilities. They are log
ically related, but can be rather diﬀerent in value. The latter probability
can be calculated from the former, but only if we have some extra infor
mation that cannot be derived from the data. (Using such extra infor
mation gives rise to ‘Bayesian statistics’, which is beyond the scope of these
notes.)
1.9 Another Example — Accuracy of a Nitrogen
Analyser — a TwoTailed Test
Key point
— If we are concerned about whether results are signiﬁcantly different
from a reference value (such as a true value or other reference value),
we need to calculate ‘twotailed’ probabilities.
Suppose that we want to test the accuracy of an instrument that automat
ically measures the nitrogen content of food (from which we can estimate
the protein content). We could do that by inserting into the instrument a
sample of pure glycine, an amino acid for which we can calculate the true
nitrogen content (x
true
= 18.92%), and observing the result x. Because
results vary, we would probably want to repeat the experiment a number of
times and compare the mean result ¯ x with x
true
. Suppose that we obtain the
results: 18.95, 18.86, 18.74, 18.93, 19.00% nitrogen (shown in Fig. 1.9.1).
Fig. 1.9.1. Set of measurement results showing no signiﬁcant diﬀerence between the
true value and the mean.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
Preliminaries 11
The mean result is 18.90 (to four signiﬁcant ﬁgures). We want to know
whether the absolute diﬀerence between the true value and the mean result,
¯ x − x
true
 = 18.90 − 18.92 = 0.02%, plausibly represents an inaccuracy
in the machine or is more probably due to the usual variation among the
individual results.
1
In other words, we are asking whether ¯ x−x
true
 is signif
icantly greater than zero. In this case the null hypothesis is H
0
: µ = 18.92
or, more generally, H
0
: µ = x
true
. Under H
0
(and the other assumptions)
we calculate the probability of getting the observed value of ¯ x−x
true
 or a
greater one (i.e., a mean result even further from x
true
in either direction).
This particular probability has the value of p = 0.62. We could expect a
diﬀerence as large as (or greater than) ¯ x−x
true
 as often as six times in ten
repeat experiments if there were no inaccuracy in the instrument. As this
is a large probability, greater than 0.05 for example, we say that there is
no signiﬁcant diﬀerence, no compelling reason to believe that the machine
is inaccurate.
However, if the results had been as depicted in Fig. 1.9.2, the probability
would have been p = 0.033, indicating a quite unusual event under the null
hypothesis. We would have drawn the opposite inference, namely that there
was a real bias in the results.
In these examples, in contrast to that in §1.7, we are interested in prob
abilities related to deviations from x
true
in either direction, positive or
negative: we want to know whether the mean is signiﬁcantly diﬀerent from
the reference value x
true
. This calls for the calculation of a twotailed prob
ability. (Contrast this with §1.7 where we were concerned with whether the
mean was signiﬁcantly greater than the reference value, i.e., a onetailed
probability.)
Fig. 1.9.2. Set of measurement results showing a signiﬁcant diﬀerence between the true
value and the mean.
1
The notation a signiﬁes the absolute value of a, the magnitude of the number without
regard to its sign, so that  −3 = 3 = 3.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
12 Notes on Statistics and Data Quality for Analytical Chemists
1.10 Null Hypotheses and Alternative Hypotheses
Key points
— The alternative hypothesis for a onetailed test is H
A
: µ > x
L
or H
A
: µ < x
L
.
— The alternative hypothesis for a twotailed test is H
A
: µ = x
true
.
We have seen (§1.9) that calculating probabilities from results depends
on the speciﬁcation of a null hypothesis H
0
. To distinguish between one
tailed and twotailed situations and to allow the calculation of the correct
probability we have to invoke some extra information, called the ‘alternative
hypothesis’, which is designated H
A
(or H
1
in some texts).
For a onetailed probability the null hypothesis is H
0
: µ = x
L
for a limit
value x
L
. For the alternative hypothesis, we would use either H
A
: µ > x
L
or H
A
: µ < x
L
, depending respectively on whether x
L
was an upper limit
for the quantity of interest or a lower limit. For a twotailed probability, the
null hypothesis is H
0
: µ = x
true
for a reference value x
true
. The alternative
hypothesis is then H
A
: µ = x
true
.
The role of the alternative hypothesis is to remind us of what we are
trying to establish when we are undertaking statistical calculations. It is
also what we accept by default if the evidence is such as to reject the null
hypothesis, that is, to ﬁnd that the outcome is statistically signiﬁcant.
1.11 Statements about Statistical Inferences
Key point
— We have to be very careful in our choice of words to avoid misleading
statements about statistical inference.
Having settled on a level of probability that we designate as convincing for
the particular inference that we wish to make (a critical level, p
crit
), and
then having estimated the probability p associated with our data under
H
0
and H
A
, we may wish to express the ﬁnding in words. Acceptable and
unacceptable forms of words are tabulated below. They should be qualiﬁed
by referencing p
crit
in the form 100(1 −p
crit
)%, so if p
crit
= 0.05 we would
say ‘at the 95% conﬁdence level’.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
Preliminaries 13
p ≥ p
crit
p < p
crit
We can say: We cannot reject the
null hypothesis. There are no
grounds for considering an
alternative hypothesis.
We can say: We can reject the null
hypothesis and consider the
alternative.
We might say: We accept the null
hypothesis.
We can say [for twotailed
probabilities]: We ﬁnd no
signiﬁcant diﬀerence between the
mean result and the reference
value.
We can say [for twotailed
probabilities]: We ﬁnd a signiﬁcant
diﬀerence between the mean value and
the reference value.
We can say [for onetailed
probabilities]: (i) We do not ﬁnd
the mean result to be signiﬁcantly
greater than the reference value; or
(ii) We do not ﬁnd the mean result
to be signiﬁcantly less than the
reference value.
We can say [for onetailed
probabilities]: (i) We ﬁnd the mean
result to be signiﬁcantly greater than
the reference value; or (ii) We ﬁnd the
mean result to be signiﬁcantly less
than the reference value.
We cannot say: The null hypothesis
is true [because statistical inference
is probabilistic: there is a small
chance that the null hypothesis is
false].
We cannot say: The null hypothesis is
false [because statistical inference is
probabilistic: there is a small chance
that the null hypothesis is true].
We cannot say: The null hypothesis
is true with a probability p
[because we need additional
information to estimate
probabilities of hypotheses].
We cannot say: The null hypothesis is
false with a probability (1 −p)
[because we need additional
information to make inferences about
hypotheses].
We should note that it is misleading to regard 95% conﬁdence as a
kind of absolute dividing line between signiﬁcance and nonsigniﬁcance. A
conﬁdence level of 90% would be convincing in many situations or, at least,
suggest that further experimentation would be proﬁtable.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch01
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
Chapter 2
Thinking about Probabilities
and Distributions
This chapter covers the estimation of probabilities relating to data
from the assumption of the normal distribution of analytical error.
Various approaches are covered but the main thrust is the use of
the pvalue to determine how likely the data are under the various
assumptions.
2.1 The Properties of the Normal Curve
Key points
— Probabilities of random results falling into various regions of the
normal distribution are determined by the values of µ and σ.
— To apply the normal model to estimating probabilities, we have to
assume that our data comprise a random sample from the inﬁnite
population represented by the normal curve. Those may or may not
be reasonable assumptions.
— Real repeated datasets encountered in analytical chemistry seldom
resemble the smooth normal curve, but often provide ragged
histograms.
We can estimate the probabilities involved in statistical inference in several
quite diﬀerent ways, but most often by reference to a mathematical model
of the variation. The model most widely applicable in analytical chemistry
is the normal distribution, which has a probability density (height of the
curve y for a given value of x) given by
y =
exp((x −µ)
2
/2σ
2
)
√
2πσ
(2.1.1)
15
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
16 Notes on Statistics and Data Quality for Analytical Chemists
and a unit area. The appearance of the normal distribution depends on
the values of the parameters, µ (mu) and σ (sigma). µ is called the mean
of the function, and σ the standard deviation, σ
2
is called the variance.
The shape of the normal curve is shown in Fig. 2.1.1. It is symmetrical
about µ where the highest density lies, and falls to nearzero density at
distances outside the range µ ±3σ.
A key feature of a density function such as Eq. (2.1.1) is that the area
deﬁned by any two values of x represents the probability of a randomly
distributed variable falling in that range. In the normal curve (Figs. 2.1.2–
2.1.4), we ﬁnd that about 2/3 of results fall within the range µ ±σ. Close
to 19/20 results (or 95%) fall within the range µ ± 2σ, while about 99.7%
fall within µ±3σ. Exact limits for 95% probability are µ±1.96σ. If we are
interested in onetailed probabilities, we note that 95% of results fall below
µ + 1.63σ (Fig. 2.1.4) and 95% fall above µ −1.63σ.
Analysts should commit these ranges and probabilities to memory.
The normal curve is widely used in statistics, partly because of theory
and partly because replicated results often approximate to it. The Central
Fig. 2.1.1. The normal curve.
Fig. 2.1.2. The normal curve. The region
within the range µ±σ (unshaded) occupies
about 2/3 of the total area.
Fig. 2.1.3. The normal curve. The region
within the range µ ± 2σ (unshaded)
occupies about 95% of the total area.
Fig. 2.1.4. The normal curve. The region
below µ + 1.63σ (unshaded) occupies 95%
of the total area.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
Thinking about Probabilities and Distributions 17
Fig. 2.1.5. Six random samples of 100 results from a normal distribution with mean 10
and standard deviation 1.
Limit Theorem shows that the combination of numerous small independent
errors tends to give rise to such a curve. This combination of errors is
exactly what we would expect in analytical operations, which comprise a
lengthy succession of separate actions by the analyst, each prone to its own
variation. In practice, repeated analytical data usually take a form that
is indistinguishable from a random selection from the normal curve. Of
course, histograms of real datasets are not smooth like the normal curve.
Histograms are ‘steppy’ if representing a large dataset (e.g., Fig. 1.3.1) but
tend to be ragged for the size of dataset usually encountered by analytical
chemists. Genuine random selections from a normal distribution, even quite
large samples, seldom closely resemble the parent curve. Figure 2.1.5 shows
six such selections.
2.2 Probabilities Relating to Means of n Results
Key points
— σ/
√
n is called the ‘standard error of the mean’. ‘Standard error’
simply means the standard deviation of a statistic (such as a mean)
as opposed to that of an individual result.
— Means of even a small number of results tend to be close to normally
distributed even if the original results are not.
We are often interested in probabilities relating to the mean of two or more
results. Statistical theory tells us that the mean ¯ x of n random results from
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
18 Notes on Statistics and Data Quality for Analytical Chemists
the normal curve with mean µ and standard deviation σ is also normally
distributed, with a mean of µ but a standard deviation of σ/
√
n. Even if
the parent distribution of the individual results diﬀers from a normal dis
tribution, the means will be much closer to normally distributed, especially
for higher n. Again, this is an outcome of the Central Limit Theorem. The
term σ/
√
n is known as ‘the standard error of the mean’.
We can now apply the previously established properties of the normal
curve to means. For example, the probability of an observed mean ¯ x falling
above µ + 1.63σ/
√
n is 0.05. This is a onetailed probability (§1.7) for an
upper limit. In statistical notation
1
we have
Pr
¯ x −µ >
1.63σ
√
n
= Pr
¯ x −µ
σ/
√
n
> 1.63
= 0.05.
Likewise, for a lower limit, we have
Pr
¯ x −µ < −
1.63σ
√
n
= Pr
¯ x −µ
σ/
√
n
< −1.63
= 0.05.
For a probability p other than 0.05 we simply need to replace 1.63 by the
appropriate coverage factor k derived from the normal distribution, to give
the general formulae
Pr
¯ x −µ
σ/
√
n
> k
= p (2.2.1)
and
Pr
¯ x −µ
σ/
√
n
< −k
= p. (2.2.2)
For the twotailed case, the probability of a mean falling outside the range
µ ±1.96σ/
√
n is 0.05. From this we can deduce that
Pr
¯ x −µ
σ/
√
n
> 1.96
= 0.05.
or, in general,
Pr
¯ x −µ
σ/
√
n
> k
= p. (2.2.3)
1
The notation Pr [ ] signiﬁes the probability of whatever expression is within the square
brackets.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
Thinking about Probabilities and Distributions 19
However, we must remember that the relationship between p and k is
diﬀerent in onetailed and twotailed probabilities.
2.3 Probabilities from Data
Key points
— Equations deﬁning normal probabilities have to be modiﬁed if you are
using standard deviations s estimated from a small number (less than
about 50) of data instead of the population value σ.
— The variable t (or ‘Student’s t’) replaces the coverage factor k in these
equations.
— The value of t depends on the probability required and the number of
results used to calculate the mean.
To estimate probabilities from data, ﬁrst we have to ﬁnd which particular
normal distribution (deﬁned by µ and σ) is the best representation of our
data x
1
, x
2
, . . . , x
i
, . . . , x
n
. We do that by calculating the corresponding
‘sample statistics’ ¯ x and s. The ‘sample mean’ ¯ x is the ordinary arithmetic
mean given by
¯ x =
1
n
n
i=1
x
i
,
while the ‘sample standard deviation’ s is
s =
n
i=1
(x
i
− ¯ x)
2
n −1
.
The statistics ¯ x and s are estimates of the unknown parameters µ and
σ, and (usually) approach them more closely as n increases. We must
remember that ¯ x and s are variables, in the sense that if a series of mea
surements were repeated the resultant values of ¯ x and s would be diﬀerent
each time. As an outcome, they cannot be used directly to substitute for
the parameters µ and σ in the probabilities given in Eqs. (2.2.1)–(2.2.3).
Instead we have to use modiﬁed equations, in which we substitute a variable
t (also called ‘Student’s t’) for the normal coverage factor k, giving
Pr
¯ x −
ts
√
n
< µ < ¯ x +
ts
√
n
= 1 −p, (2.3.1)
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
20 Notes on Statistics and Data Quality for Analytical Chemists
from which we obtain for twotailed probabilities
Pr
¯ x −µ
s/
√
n
> t
= p. (2.3.2)
In some contexts we assume a null hypothesis H
0
: µ = x
ref
and, in
such an instance, we can substitute a reference value x
ref
for µ. This gives
the corresponding expression,
Pr
¯ x −x
ref

s/
√
n
> t
= p. (2.3.3)
Corresponding expressions can be obtained for onetailed probabilities
but, again, the relationship between t and p is diﬀerent.
Like k, the value of t depends on the probability p. Unlike k, however,
the value of t also depends on n and gets closer to k as n increases. For
small n this diﬀerence is important. With n > 50 there is little diﬀerence: for
n = 50 results t = 2.01 compared with k = 1.96 for a twotailed probability
of 0.05. Corresponding values of t and p are tabulated in statistics texts,
and can be calculated from each other, quickly by computer but with great
diﬃculty by hand.
2.4 Probability and Statistical Inference
Key points
— Probabilities about means of repeated results can be calculated simply
by computer under H
0
and H
A
.
— Care should be taken in interpreting the exact value of a probability.
— Statistics should be used to assist a decision, not to make it auto
matically.
We can calculate a probability associated with speciﬁc data by using
Eq. (2.3.2). If we set
¯ x −x
true

s/
√
n
= t, we can obtain the value of p associated
with this ‘sample tvalue’ derived from the data under H
0
and H
A
. As a
twotailed example we take the nitrogen analyser data from §1.9. We have
H
0
: µ = 18.92 and H
A
: µ = 18.92. Most statistics packages give the prob
ability p = 0.62 directly. This is the probability of obtaining our mean value
(or a value more distant from 18.92) if H
0
is true. As this is a large value
(e.g., much larger than 0.05), the event would be common under repeated
experiments, so there are no grounds for suspecting the truth of H
0
.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
Thinking about Probabilities and Distributions 21
These calculated probabilities are an essential guide to the assessment of
experimental results. All analysis is conducted to inform a decision. Often
the decision amounts to comparing some experimental results with an inde
pendent reference value of some kind — a legal or contractual limit, or a
true value. The probability of our results under stated hypotheses tells
us whether those assumptions are plausible or not. The decision depends
on comparing our calculated probability with a predetermined ‘critical’
probability, typically 0.05 for general purposes. Using such critical levels
is designed to help us avoid drawing unsafe conclusions but it does not
relieve us of the responsibility of making the decision. A probability of say
0.07 still means that it is considerably more likely that the null hypothesis
could reasonably be rejected than otherwise.
We must further avoid placing too much reliance on the exact value
of the calculated probability. The calculation is based on assumptions, in
particular, of randomness, independence and normality. While these are
sensible assumptions, they are unlikely to be exactly true. This fact is espe
cially important in the consideration of very small probabilities, which are
related to the extreme tails of a distribution. Probabilities lower than 0.001
are likely to be accurate only to within an order of magnitude and are best
simply regarded as ‘very low’, except in the hands of an expert.
Finally, we must always remember that statistics provides a method of
making optimal decisions in the face of uncertainty in our data. It is not a
magical way of converting uncertainty into certainty.
2.5 Precomputer Statistics
Key points
— Before the computer age, statisticians used tabulated values of t cal
culated for certain ﬁxed levels of p. To proceed with the test, the
sample t value is calculated from the data and H
0
. If it is greater
than the tabulated value for the preselected p, the result is regarded
as ‘statistically signiﬁcant’.
— Nowadays, most statistics packages calculate p directly from the data,
H
0
and H
A
. This is the simplest way of looking at probability. If p is
less than some small predetermined level, the outcome is unlikely
under H
0
and is regarded as ‘statistically signiﬁcant’.
Before computers were readily available it was not practicable to cal
culate p from the sample t for each problem. The alternative was to use
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
22 Notes on Statistics and Data Quality for Analytical Chemists
precalculated values of t called ‘critical values’ for a small number of spe
ciﬁc probabilities. The sample t could then be compared with the tabulated
value for a speciﬁc probability (usually p = 0.05). If the sample t exceeded
the tabulated t, we would know that the probability of the event was less
than 0.05, so it would be reasonable to reject H
0
at a conﬁdence level
of 100(1 − 0.05) = 95%. Working this out for the previous example data
from §1.9, we have the statistics: n = 5; ¯ x = 18.8960; s = 0.1006. These
provide a sample t given by t =
¯ x −x
true

s/
√
n
=
18.896 −18.92
0.1006/
√
5
= 0.53 under
H
0
: µ = 18.92. We need to compare this value with Student’s t for a speciﬁc
probability. Values of Student’s t are tabulated according to the number of
‘degrees of freedom’
2
which in this case equals n −1 = 4. For four degrees
of freedom and a twotailed pvalue of 0.05 the tabulated value of t is 2.78.
Our sample value of 0.53 is well below the critical value of 2.78 for p = 0.05.
Again this tells us that the event is not signiﬁcant at the 95% conﬁdence
level.
2.6 Conﬁdence Intervals
Key points
— 100(1 −p)% conﬁdence limits can be calculated from the data and a
t value corresponding to required value of p.
— If an H
0
value falls outside this interval, we feel justiﬁed in rejecting
the null hypothesis.
— Using the conﬁdence interval approach to signiﬁcance testing is math
ematically related to calculating a pvalue, and it provides the same
answer. Calculating the actual pvalue provides more information: for
example, it tells you whether you are near or far from a selected critical
level.
Equation (2.3.1), Pr
¯ x −
ts
√
n
< µ < ¯ x +
ts
√
n
= 1 − p, tells us the range
in which the unknown population mean µ falls with a probability of (1−p)
2
The number of degrees of freedom (n − m) designates the number of independent
values after m statistics have been estimated from n results. If we have say ten results,
the mean has nine degrees of freedom because we can calculate any one result from the
other nine and the mean.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
Thinking about Probabilities and Distributions 23
and is known as the 100(1−p)% conﬁdence interval. So if p = 0.05, we have
a 95% conﬁdence interval. The ends of the range are called the ‘conﬁdence
limits’. The 95% conﬁdence interval is a convenient and currently popular
method of expressing the uncertainty in our measurement results. The
meaning of the conﬁdence interval is subtle, however, and widely misun
derstood. Its exact meaning is as follows: if the experiment were repeated a
large number of times, µ would fall in the calculated interval on an average
of 100(1 −p)% occasions.
Conﬁdence intervals provide an alternative way of assessing and visu
alising tests of signiﬁcance. As we assume that µ = x
true
under the null
hypothesis H
0
, x
true
should fall within the calculated 95% conﬁdence region
most of the time. If it doesn’t fall within the conﬁdence interval, we feel
justiﬁed in rejecting the null hypothesis. So calculating a conﬁdence interval
is one way of attributing a limiting probability to our data. All we need to
do is to settle on a desired level of conﬁdence and ﬁnd the corresponding
value of t from a table.
Again using the nitrogen analyser data in §1.9, we calculate the
95% conﬁdence limits (for a twotailed test) as ¯ x ± ts/
√
n = 18.896 ±
2.78 ×0.1006/
√
5 = (18.77, 19.02). The H
0
value of 18.92 falls close
to the middle of this range (Fig. 2.6.1), so there are no grounds for
rejecting H
0
.
If we calculate 95% conﬁdence limits for the sodium data (§1.1) and, in
addition, use the information that the rock powder is a reference material
Fig. 2.6.1. The nitrogen analyser data (points), showing the 95% conﬁdence interval
and the H
0
value falling inside the interval. LCL = lower conﬁdence limit; UCL = upper
conﬁdence limit.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch02
24 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 2.6.2. The sodium data (points), showing the 95% conﬁdence interval and the
H
0
value falling outside the interval. LCL = lower conﬁdence limit; UCL = upper conﬁ
dence limit.
with a certiﬁed value of 2.33% for the sodium content, we ﬁnd that
the certiﬁed value falls outside the conﬁdence interval (Fig. 2.6.2), and
consequently we ﬁnd that there is a signiﬁcant diﬀerence and justiﬁably
consider that the analytical method is providing a biased result.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Chapter 3
Simple Tests of Signiﬁcance
This chapter contains worked examples of simple tests of significance
of means, incorporating onesample tests, twosample tests and tests
on paired data, in onetailed and twotailed forms. Its main purpose
is the demonstration of use of appropriate methods and a critical
appraisal of the outcome, and to allow readers to check that they
are using their own statistical computer software correctly and inter
preting the output correctly. There is also a small amount of theory.
3.1 OneSample Test — Example 1: Mercury in Fish
Key points
— This is a onetailed test because we are concerned with concentrations
above a regulatory limit.
— The usual symmetrical 95% conﬁdence limits are not applicable here.
We should beware that some statistical software may confusingly
produce twotailed conﬁdence limits in combination with a onetailed
test of signiﬁcance.
European Regulation 629/2008 sets a maximum concentration of mercury
in ﬁsh of 1.0 ppm. A laboratory analyses a suspect sample four times and
obtains the results
1.34, 1.44, 1.42, 1.14 ppm.
25
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
26 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 3.1.1. Results for mercury content with null hypothesis, mean and 95% conﬁdence
region.
Can we conclude that the concentration of mercury is above the
allowable level?
This requires a onetailed probability because we are testing the mean
result against an upper limit. A display of the data is in Fig. 3.1.1: we see
that the null hypothesis value does not lie in the 95% conﬁdence region
(which has only a lower limit for the onetailed test). The statistics are
shown in Box 3.1.1. The pvalue of 0.0082 is low: the mean result is signiﬁ
cantly greater than the regulatory limit. The chance of obtaining the data
(or a set with a higher mean) if H
0
were true is less than one in a hundred,
so we are justiﬁed in rejecting it.
Box 3.1.1 Onetailed ttest of the mean
H
0
: µ = 1.0 : H
A
: µ > 1.0
Variable n Mean St Dev SE Mean t p
Mercury, ppm 4 1.3350 0.1370 0.0685 4.89 0.0082
Lower 95% conﬁdence limit = 1.17
Note
• The ﬁle containing this dataset is named Mercury.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Simple Tests of Signiﬁcance 27
3.2 OneSample Test — Example 2: Alumina (Al
2
O
3
)
in Cement
Key points
— This is a twotailed test because we are considering whether an
observed mean diﬀers signiﬁcantly from a target value.
— It is worth looking for trend in sequential data.
A special cement has a target for the alumina content of 4.00%. In its man
ufacture, the composition of the product is monitored by regular automatic
analysis. Twelve successive results for alumina were:
3.91, 4.14, 3.87, 4.21, 4.05, 4.19, 4.16, 3.86, 4.13, 4.18, 4.14, 4.04.
Is there any evidence that the alumina content diﬀers from the target
4.00% during the measurement period?
There is no obvious trend in the data (Fig. 3.2.1), so a straightforward
test of signiﬁcance on the mean is appropriate. The investigation calls for
a twotailed test as we are interested in a signiﬁcant diﬀerence. A display
of the data is shown in Fig. 3.2.2. We see that the H
0
value lies in the 95%
conﬁdence region (just).
The statistics are shown in Box 3.2.1. As the pvalue is greater than
0.05, we see that the null hypothesis is not rejected at the usual 95%
conﬁdence. On the face of it, there is no compelling evidence to suggest
that the true level in the sample is diﬀerent from the target value (assuming
Fig. 3.2.1. The results plotted in time
sequence.
Fig. 3.2.2. Results for alumina (points),
showing null hypothesis (H
0
), mean and
95% conﬁdence region.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
28 Notes on Statistics and Data Quality for Analytical Chemists
that the analytical method is unbiased). Intervention would not be called
for, because there is a small (but not negligible) probability of obtaining
these results when the concentration of alumina did not diﬀer from 4.00.
Box 3.2.1 Twotailed test of the mean
H
0
: µ = 4.00 : H
A
: µ = 4.00
Variable n Mean St Dev SE Mean t p
Alumina % 12 4.0733 0.1274 0.0368 1.99 0.071
95% Conﬁdence limits (3.9924, 4.1543)
Notes
• The ﬁle containing this dataset is named Alumina.
• In an instance like this one, where data have been collected sequen
tially, it is worth looking for a trend in the data, in case the process
seems to be drifting away from the target value. That circumstance would
invalidate a signiﬁcance test on the mean: the data would not be inde
pendent. However, the drift itself would suggest that the process needed
adjustment.
3.3 Comparing Two Independent Datasets — Method
Key points
— If we want to test the signiﬁcance of a diﬀerence between the means
of two sets of results, we have to use a twosample test.
— If we do not assume that the standard deviations of the xdata and
the ydata are equal, the number of degrees of freedom has to be
reduced according to a complex formula.
— Twosample tests can provide onetailed or twotailed probabilities as
required.
Sometimes we want to know whether the means of two separate datasets
should be regarded as diﬀerent. This is known as a ‘twosample’ test.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Simple Tests of Signiﬁcance 29
Suppose that we had access to two diﬀerent methods for the determi
nation of protein nitrogen in food. The Kjeldahl method is based on the
conversion of the protein nitrogen to ammonia and measuring the ammonia
by titration. The Dumas method relies on the combustion of the foodstuﬀ
with the formation of elemental nitrogen which is measured volumetrically
or otherwise. It is suspected that the Dumas method gives on average a
slightly diﬀerent result, because there are other nitrogenous substances
(e.g., nitrate, nitrogenous organic bases) present in the food, which may
aﬀect the results diﬀerently. In a simple experiment we could test for this
suspected bias by applying both methods repeatedly to a homogeneous
sample of a foodstuﬀ.
Suppose the Dumas results are x
1
, x
2
, . . . , x
i
, . . . , x
n
and the Kjeldahl
results are y
1
, y
2
, . . . , y
j
, . . . , y
m
. The estimated bias is ¯ x − ¯ y. We need
to know whether ¯ x − ¯ y is signiﬁcantly greater than zero, or whether the
measured diﬀerence is simply a result of random variations in the data.
The null hypothesis for this test is H
0
: µ
x
= µ
y
. We set up an equation
analogous to Eq. (2.3.2) by recognising that, for any statistic θ expected
to be normally distributed,
ˆ
θ/se(
ˆ
θ) has a tdistribution. (The function se( )
means the standard error of whatever is in the brackets, and
ˆ
θ signiﬁes an
estimate of θ.) So for the diﬀerence between the means ¯ x−¯ y we have, just
like Eq. (2.3.2):
Pr
_
¯ x − ¯ y
se(¯ x − ¯ y)
> t
_
= p.
We can readily show that se(¯ x − ¯ y) =
_
s
2
x
/n + s
2
y
/m, so we have
Pr
_
_
¯ x − ¯ y
_
s
2
x
/n + s
2
y
/m
> t
_
_
= Pr
_
¯ x − ¯ y ×
_
s
2
x
/n + s
2
y
/m
_
−
1
2
> t
¸
= p.
As before, we can either use a computer to calculate p from the
observed value of t, namely
¯ x − ¯ y
_
s
2
x
/n + s
2
y
/m
, or manually ﬁnd whether
¯ x − ¯ y
_
s
2
x
/n + s
2
y
/m
exceeds the critical value of t obtained from tables for
the selected level of p and the appropriate number of degrees of freedom.
However, in either instance, the number of degrees of freedom for t is not
(n + m − 2) as might be expected: we have to make an adjustment using
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
30 Notes on Statistics and Data Quality for Analytical Chemists
the complexlooking formula which gives a smaller number of degrees of
freedom:
adjusted degrees of freedom
=
_
s
2
x
n
+
s
2
y
m
_
2
_
_
s
4
x
n
2
(n − 1)
+
s
4
y
m
2
(m − 1)
_
, (3.3.1)
rounded to the nearest integer.
The above is an example of a twotailed test as we are looking for a bias
between the mean results of the two methods, that is, a diﬀerence regardless
of which method gives the higher result. Onetailed twosample tests are
equally possible (see §3.7).
If we can reasonably assume that the population standard deviations of
the datasets, σ
x
and σ
y
, are equal, we can use a somewhat simpler procedure
(see §3.4 below), which may be quicker for manual calculations but is of no
advantage if a computer package is used.
3.4 Comparing Means of Two Datasets
with Equal Variances
Key points
— If we can assume that the two datasets come from populations with the
same precision, we can use the pooled standard deviation to provide
a simpler version of the twosample test. This procedure is simpler for
handcalculations.
— There is no practical advantage in using a pooled standard deviation
if probabilities or tvalues are calculated by computer.
If we can assume that the two datasets are from distributions with
the same variance, the mathematics of the twosample test of means
is somewhat simpler for hand calculation. The standard error of the
diﬀerence is now given by se(¯ x − ¯ y) = s
_
1
n
+
1
m
, where s
=
¸
n
i=1
(x
i
− ¯ x)
2
+
m
j=1
(y
j
− ¯ y)
2
n + m − 2
, a standard deviation derived from both
datasets, which is called the ‘pooled standard deviation’. We derive
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Simple Tests of Signiﬁcance 31
probabilities from a tvalue with a simple n + m − 2 degrees of freedom,
to give
Pr
_
¯ x − ¯ y
se(¯ x − ¯ y)
> t
_
= p.
Of course the assumption that both datasets come from equalvariance
populations has to be reasonable, and there is a simple method of testing
that called Fisher’s variance ratio test or the Ftest (see §3.5).
3.5 The Variance Ratio Test or FTest
Key point
— The Ftest is used to determine whether two independent variances
(or two independent standard deviations) are signiﬁcantly diﬀerent.
We may want to test whether two standard deviations s
x
and s
y
,
calculated from two independent datasets x
1
, x
2
, . . . , x
i
, . . . , x
n
and
y
1
, y
2
, . . . , y
j
, . . . , y
m
, are signiﬁcantly diﬀerent from each other. The test
statistic in this case is the ratio of the estimated variances, F = s
2
x
/s
2
y
,
s
x
> s
y
, and its value depends on two separate degrees of freedom, (n −1)
for x and (m−1) for y. As with ttests, the Ftest can be conducted by cal
culating a probability corresponding with F, or by comparing the sample
value of F with tabulated values for predetermined probabilities. These
probabilities depend on the assumption that the original datasets were nor
mally distributed samples. The test can be used to determine whether it is
sensible to use a pooled standard deviation for a twosample test (§3.4) but,
as will be seen, it is more widely used to determine signiﬁcance in analysis
of variance (§4.4).
As an example we use the nitrogen data from §3.6: the variance ratio
is F = 1.193 and the corresponding probability under H
0
: σ
2
x
= σ
2
y
(against H
A
: σ
2
x
> σ
2
y
) is p = 0.58. In other words, if there were no
diﬀerence between the variances, we would expect a ratio at least as large
as 1.193 with a probability of 0.58. As this is a high probability, we show
that the variances are not signiﬁcantly diﬀerent. The situation is illustrated
in Fig. 3.5.1.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
32 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 3.5.1. The Fdistribution for nine and seven degrees of freedom. The shaded area
shows the probability of obtaining the nitrogen data under H
0
.
Note
• The ﬁle containing the dataset is named Wheat ﬂour.
3.6 TwoSample TwoTailed Test — An Example
Key points
— This is a twotailed test because we are interested in the diﬀerence
between the mean results regardless of which is the greater.
— If we assume variances are unequal when they are in fact close, we
get an outcome very similar to that obtained by the pooled variance
method, so no harm results.
— If we assume variances are equal when they are not we could get a
misleading outcome, so we use the pooled variance method only in
special circumstances.
A laboratory, possessing the two diﬀerent methods (Dumas and Kjeldahl)
for determining protein nitrogen in food, repeatedly analysed the same
sample of wheat ﬂour, and obtained the results given below. A dotplot of
the data is shown in Fig. 3.6.1. We can see that the means of the two sets
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Simple Tests of Signiﬁcance 33
Fig. 3.6.1. Dotplot of the nitrogen data (points), showing the means of the two datasets.
of data are diﬀerent, but are they signiﬁcantly diﬀerent, given the spread
of the data? We also observe that the dispersions of the two datasets are
similar.
Dumas, % Kjeldahl, %
3.12 3.07
3.01 2.92
3.05 3.01
3.04 3.00
3.04 3.02
2.98 3.02
3.08 2.98
3.09 2.92
3.01
3.05
In this instance, without assuming that the variances are identical,
Eq. (3.3.1) tells us to use 15 degrees of freedom and obtain p = 0.035 as
the probability of obtaining the absolute diﬀerence that we observe (or a
greater diﬀerence). We conclude that the observed diﬀerence of 0.0513%
mass fraction is signiﬁcant at 95% conﬁdence.
Using a pooled standard deviation (and the full 16 degrees of freedom)
provides an almost identical probability (p = 0.036). Notice that we
do not need equal numbers of observations in each dataset for the
twosample test.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
34 Notes on Statistics and Data Quality for Analytical Chemists
Box 3.6.1 Twosample ttest and conﬁdence interval
H
0
: µ
Dumas
= µ
Kjeldahl
: H
A
: µ
Dumas
= µ
Kjeldahl
Assuming unequal variances:
n Mean St Dev SE Mean
Dumas, % 8 3.0513 0.0449 0.016
Kjeldahl % 10 3.0000 0.0490 0.015
t = 2.31 p = 0.035 DF = 15
95% conﬁdence interval for diﬀerence: (0.004, 0.099)
Using a pooled standard deviation:
t = 2.29 p = 0.036 DF = 16
95% conﬁdence interval for diﬀerence: (0.004, 0.099)
Notes and further reading
• The ﬁle containing this dataset is named Wheat ﬂour.
• The outcome of this experiment may not be relevant outside the labo
ratory concerned. Other laboratories, conducting the same methods (but
somewhat diﬀerently) may have diﬀerent biases. A discussion of this point
can be found in Thompson, M., Owen, L., Wilkinson, K. et al. (2002). A
Comparison of the Kjeldahl and Dumas Methods for the Determination of
Protein in Foods, using Data from a Proﬁciency Testing Scheme. Analyst,
12, pp. 1666–1668.
3.7 TwoSample OneTailed Test — An Example
Key points
— This is a onetailed test because we are interested in whether the
modiﬁed conditions were better, not just diﬀerent.
— Note the assumption of unequal standard deviations.
Ethanol is made industrially by the catalytic hydration of ethene. A
process supervisor wants to test whether a change in the reaction con
ditions improves the onepass yield of ethanol. The conversion eﬃciency is
measured with ten successive runs under the original conditions and ten
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Simple Tests of Signiﬁcance 35
Fig. 3.7.1. Results for the conversion of ethene to ethanol under original and modiﬁed
plant conditions.
more under the modiﬁed conditions. The results are as follows. There are
no apparent trends in the data, so a simple ttest is appropriate.
Eﬃciency % original conditions Eﬃciency % modiﬁed conditions
7.0 7.4
7.3 7.5
7.0 7.3
6.9 8.2
6.7 7.2
7.3 7.8
7.1 7.8
7.1 7.4
7.1 7.2
6.8 7.0
The dotplots in Fig. 3.7.1 shows the means well separated and the dis
persion of the results under modiﬁed conditions greater. The pvalue of
0.002 (Box 3.7.1) for the onetailed test shows a very low probability of
obtaining the data if the null hypothesis is true, so we can reject it and
accept that the modiﬁed conditions give a signiﬁcantly greater conversion
eﬃciency. Note that by assuming unequal variances, the test uses 13 degrees
of freedom in place of the original 18.
Box 3.7.1 Twosample ttest and conﬁdence interval
H
0
: µ
Mod
= µ
Orig
: H
A
: µ
Mod
> µ
Orig
Conditions n Mean St Dev SE Mean
Modiﬁed 10 7.480 0.358 0.11
Original 10 7.030 0.195 0.062
95% conﬁdence interval for diﬀerence: (0.17, 0.729)
t = 3.49 p = 0.0020 DF = 13
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
36 Notes on Statistics and Data Quality for Analytical Chemists
Note
• The ﬁle containing this dataset is named Ethene.
3.8 Paired Data
Key points
— Paired data arise when there is an extra source of variation that has
a common eﬀect on both members of corresponding pairs of data.
— Paired data are treated by calculating the diﬀerences between corre
sponding pairs and testing the diﬀerences under H
0
: µ
diﬀ
= 0. In this
instance we have a twotailed test so H
A
: µ
diﬀ
= 0.
— It is essential to recognise paired data: a twosample test will usually
provide a misleading answer.
Here we consider again the results of two methods for determining nitrogen
in food. We want to ﬁnd whether the outcome in §3.6 (i.e., no signiﬁcant
diﬀerence between the means) was valid for wheat in general and not just
speciﬁc to a particular type of wheat. One way to do that would be to
arrange for the comparison methods to be made with a number of diﬀerent
types of wheat. Suppose in such an experiment, in a single laboratory, the
results below were obtained.
C1 C2 C3 C4
Type of wheat Kjeldahl result % Dumas result % Diﬀerence %
A 2.98 3.08 0.10
B 2.81 2.88 0.07
C 2.97 3.02 0.05
D 3.15 3.17 0.02
E 3.03 3.08 0.05
F 3.05 3.21 0.16
G 3.24 3.20 −0.04
H 3.14 3.12 −0.02
I 3.04 3.11 0.07
J 3.08 3.16 0.08
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Simple Tests of Signiﬁcance 37
Fig. 3.8.1. Results obtained by the
analysis of ten diﬀerent types of wheat for
the protein nitrogen content of a sample
of a foodstuﬀ. Kjeldahl result ◦. Dumas
result •.
Fig. 3.8.2. Diﬀerences between corre
sponding results.
We must recognise that, as well as a possible diﬀerence between results
of the methods, there is an extra source of variation in the results, due
to variation in the true concentrations of protein in the various types of
wheat. In fact both of these variations show up clearly in Fig. 3.8.1. For
instance, wheat type B has provided a particularly low pair of results.
It would be futile to attempt to compare the methods by comparing the
mean results. If we tried to do that, any bias between methods might
be swamped by the variation between the wheat types. However, we can
see that it is the diﬀerences between respective pairs of results which will
tell us what we want to know. In fact we can see in Figs. 3.8.1 and 3.8.2
that for most of the wheat types (8/10), the Dumas method gives the higher
result, which in itself suggests that the diﬀerence between methods may be
signiﬁcant.
We deal with the situation statistically by calculating a list of diﬀerences
between corresponding results. We then apply a onesample test to the
diﬀerences, with H
0
: µ
diﬀ
= 0, i.e., we expect the mean of the diﬀerences
to be zero if there is no bias between the methods. The statistics are shown
in Box 3.8.1. The observed mean diﬀerence is 0.054% and we ﬁnd that
p = 0.016 (for a twotailed test). This low value tells us that such a large
observed diﬀerence is unlikely under the null hypothesis, so we feel justiﬁed
in rejecting it and accepting that, for this range of wheat types, there is a
bias between the methods of about 0.05%.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
38 Notes on Statistics and Data Quality for Analytical Chemists
Box 3.8.1 Paired data twotailed ttest
H
0
: µ
diﬀ
= 0 : H
A
: µ
diﬀ
= 0
N Mean St Dev SE Mean
Dumas result % 10 3.1030 0.0981 0.0310
Kjeldahl result % 10 3.0490 0.1176 0.0372
Diﬀerence % 10 0.0540 0.0578 0.0183
t = 2.96 p = 0.016 DF = 9
95% conﬁdence interval for mean diﬀerence: (0.0127; 0.0953)
Notes
• The ﬁle containing this dataset is named Wheat types.
• The bias observed in this experiment may be valid only for analyses
conducted in a single laboratory. (See also the Notes in §3.6.)
3.9 Paired Data — OneTailed Test
Key point
— This is a onetailed test involving paired data.
Nitrogen oxides (NO
x
) are harmful atmospheric pollutants produced largely
by vehicle engines. Their concentration is monitored in large cities. This
experiment attempts to tell whether concentrations measured at face
level are higher than those measured with a monitor at a height of
5 m (where it is safe from vandalism). Onehour average concentrations
were measured every hour for a day at a particular location, with the
results shown in units of parts per billion by volume (i.e., mole ratio).
The results are plotted by the hour in Fig. 3.9.1. Most of the diﬀer
ences between methods are positive (21/24), showing a higher reading
at face level. A dotplot of the diﬀerences (Fig. 3.9.2) shows the null
hypothesis value well below the lower 95% conﬁdence limit. The statistics
(Box 3.9.1) show a pvalue well below 0.05, so we conclude that the
null hypothesis can be safely rejected in favour of the alternative: the
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Simple Tests of Signiﬁcance 39
concentration of NO
x
at face level is signiﬁcantly higher than at 5 m above
the pavement.
Hour NO
x
, ppb at 1.5 m NO
x
, ppb at 5 m Diﬀerence, ppb
1 11 10 1
2 15 13 2
3 16 13 3
4 13 11 2
5 19 15 4
6 16 14 2
7 15 13 2
8 20 15 5
9 18 17 1
10 19 21 −2
11 26 24 2
12 22 19 3
13 26 22 4
14 27 24 3
15 24 25 −1
16 26 22 4
17 28 24 4
18 23 25 −2
19 23 19 4
20 22 20 2
21 21 19 2
22 18 12 6
23 14 14 0
24 19 18 1
Fig. 3.9.1. Results for the concentration
of NO
x
in air at a site measured at
two heights above ground: 1.5 m (•) and
5 m (◦).
Fig. 3.9.2. Diﬀerences between pairs of
results for NO
x
.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
40 Notes on Statistics and Data Quality for Analytical Chemists
Box 3.9.1 ttest of the mean diﬀerence
H
0
: µ
diﬀ
= 0 : H
A
: µ
diﬀ
> 0
Variable n Mean St Dev SE Mean t p
Diﬀerence, ppb 24 2.167 2.036 0.416 5.21 0.0000
Lower 95% conﬁdence boundary: 1.455
Notes
• The ﬁle containing this dataset is named NOx in air.
• We ﬁnd the diﬀerence signiﬁcantly greater than zero, but in such cases we
must always remember to consider the separate question of whether the
diﬀerence is of important magnitude. Such a question cannot be answered
without reference to an external criterion based on the use to which the
data will be put. In this case we might consider a diﬀerence of about 2 ppb
would be unlikely to aﬀect decisions strongly, so could be safely ignored.
3.10 Potential Problems with Paired Data
Key points
— For a valid test, the diﬀerences must be a random sample from a single
distribution, or a reasonable approximation to that.
— That is likely to occur only if the concentrations involved are drawn
from a relatively short range. If a few anomalous concentrations are
present, those data can be safely deleted before the paired data test.
— It is important to recognise paired data. Treating paired data by a
twosample test will probably lead to an incorrect inference.
The paired data test is based on the hidden assumption that the diﬀerences
form a coherent set. As an example, it is a reasonable assumption that the
results of the Dumas method in §3.8 are taken from distributions with
diﬀerent means but the same (unknown) standard deviation σ
D
. Likewise
the Kjeldahl results are taken to represent values taken from distributions
with unknown standard deviation σ
K
, probably diﬀerent from σ
D
. These
are reasonable assumptions because there is no great variation among the
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
Simple Tests of Signiﬁcance 41
concentrations of nitrogen in the types of wheat. The diﬀerences then have a
standard deviation of σ
diﬀ
=
_
σ
2
D
+ σ
2
K
which we estimated for the ttest
from the observed diﬀerences as s
diﬀ
= 0.0578.
If the assumption of a single standard deviation is unrealistic, the
outcome of the test may be misleading. Consider the following data, which
refer to the determination of beryllium in 11 diﬀerent types of rock by two
diﬀerent analytical methods.
Rock type ICP result Be, ppm AAS result Be, ppm Diﬀerence ppm
A 2.6 2.9 0.3
B 1.7 1.9 0.2
C 1.7 1.6 −0.1
D 2.4 2.5 0.1
E 0.4 1.1 0.7
F 1.2 1.5 0.3
G 1.6 1.9 0.3
H 2.9 2.9 0.0
J 2.2 2.6 0.4
K 2.1 2.4 0.3
L 56.2 61.5 5.3
If the data are treated naively by conducting a ttest under
H
0
: µ
diﬀ
= 0 : H
A
: µ
diﬀ
= 0 on the complete dataset (all 11 diﬀerences),
we obtain the result p = 0.16 (Box 3.10.1), an apparently nonsigniﬁcant
result that might deceive an inexperienced person. However, an inspection
of the data shows that one diﬀerence is much greater than any of the others,
and this is apparently owing to a much higher concentration of beryllium in
Rock type ‘L’. At this concentration (about 60 ppm) we would expect the
standard deviation of the determination to be much greater than for the rest
of the rocks (at concentrations of 3 ppm or less). Consequently, it is appro
priate to delete the results pertaining to Rock ‘L’ from the list and then
repeat the test with the remaining ten diﬀerences. This gives us the result
p = 0.0062, a clearly signiﬁcant result, contrasting sharply with the na¨ıve
result.
Analysts need have no fear that this procedure amounts to improperly
‘adjusting the results’. The diﬀerence is not deleted because it is an outlier
or anomalous per se (although it clearly is), but because Rock type ‘L’
obviously diﬀers from the others by virtue of the very high concentration
of beryllium present.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch03
42 Notes on Statistics and Data Quality for Analytical Chemists
Box 3.10.1 Beryllium data
H
0
: µ
diﬀ
= 0 : H
A
: µ
diﬀ
= 0
‘Na¨ıve’ statistics
n Mean St Dev SE Mean t p
Diﬀerence, ppm 11 0.709 1.537 0.463 1.53 0.16
‘Sensible’ statistics
n Mean St Dev SE Mean t p
Diﬀerence, ppm 10 0.2500 0.2224 0.0703 3.56 0.0062
Another problem can arise when data are paired but the fact is over
looked. Treating paired datasets by a twosample test will often eliminate
any sign of a real signiﬁcant diﬀerence. This happens because diﬀerences
between any two paired results will often be considerably smaller than dif
ferences between sets of pairs. For instance, the paired beryllium results
above, if incorrectly treated as twosample results under the hypotheses
H
0
: µ
ICP
= µ
AAS
: H
A
: µ
ICP
= µ
AAS
, would give a pvalue of 0.42, erro
neously indicating a nonsigniﬁcant diﬀerence. Likewise, the paired results
in §3.8 show a diﬀerence signiﬁcant at 95% conﬁdence. However, if the
data are incorrectly treated as twosample results under the hypotheses
H
0
: µ
Dumas
= µ
Kjeldahl
: H
A
: µ
Dumas
= µ
Kjeldahl
we obtain an apparently
nonsigniﬁcant pvalue of 0.28, because of the relatively large diﬀerences
among the nitrogen contents of the wheat types.
Pairing is not diﬃcult to spot if we are looking out for it. The diagnostic
sign of paired data is that there is extra information available about how
they were collected. Often this information is explicit. In this section, for
example, we have a column telling us that each pair of data is obtained
from the analysis of a diﬀerent rock type. However, we do not need to
know the actual rock type to do the statistics. Moreover, sometimes the
information about pairing is implicit or stated separately rather than as
part of the dataset. In analytical data, key signs of pairing might be that the
observations were made on diﬀerent types of test material, or by diﬀerent
analysts, by diﬀerent laboratories, or on diﬀerent days.
Note
• The ﬁle containing the dataset is named Beryllium methods.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Chapter 4
Analysis of Variance (ANOVA)
and Its Applications
This chapter treats the statistical techniques of analysis of variance
(ANOVA) and its most prominent applications in analytical science.
ANOVA has many applications in various sectors that utilise (rather
than produce) analytical data, agricultural studies for example, but
these are not covered in this book.
4.1 Introduction — The Comparison of Several Means
Key points
— Analysis of variance (ANOVA) is a broad method for analysis of data
aﬀected by two (or more) separate sources of variation.
— Typically the sources of variation are between and within subsets of
results.
— Important applications of ANOVA in analytical science are in
(a) homogeneity testing, (b) sampling uncertainty and (c) collabo
rative trials.
There are four or more recognised methods for the determination of ‘fat’
in foodstuﬀs. They are thought to give diﬀerent results from each other,
because ‘fat’ is not a clearly deﬁned analyte (although its determination
is very important commercially). Suppose that we applied the four most
popular of these methods ten times each to a wellhomogenised batch
of dog food, and produced the results shown in Fig. 4.1.1 (Dataset A).
Visually there is no apparent reason to believe that the mean results of the
four methods diﬀer signiﬁcantly among themselves. We could quite readily
accept the hypothesis that the four sets each comprise random selections
43
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
44 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 4.1.1. Dataset A. Results from the
determination of fat in dog food by using
four diﬀerent methods.
Fig. 4.1.2. Hypothesis that all of the
results could be accounted for as four
random samples of data from a common
normal distribution.
Fig. 4.1.3. Dataset B. Results from the
analysis of dog food, showing diﬀering
mean values of results from four diﬀerent
methods.
Fig. 4.1.4. Hypothesis that accounts for
the results by assuming that the sets of
results came from distributions with dif
ferent means.
from a single normal distribution N(µ, σ
2
0
) (Fig. 4.1.2). We could account for
all of the variation by means of one mean µ and one standard deviation σ
0
.
In contrast, we would entertain no such hypothesis if we obtained the
results shown in Fig. 4.1.3 (Dataset B). It would seem far more plau
sible that the four sets of results came from four separate normal distri
butions, all with the same standard deviation σ
0
but with distinct means
µ
A
, µ
B
, µ
C
, µ
D
, as in Fig. 4.1.4. In this set of data there are two separate
sources of variation. There is still the common withinmethod variation des
ignated by σ
0
, but there is an independent variation due to the dispersion
of the means. The standard deviation of the means is designated σ
1
. By
the principle of addition of independent variances, the variance of a single
observation is σ
2
0
+σ
2
1
, and the variance of a mean of n results from a single
group is σ
2
0
/n +σ
2
1
.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 45
The statistical method called ANOVA handles a range of problems like
the one illustrated here. More complex versions of ANOVA have been
devised, but only the simpler ones have regular application in analytical
science. ANOVA enables us to do two things. First, it enables us to
estimate separately the values of σ
0
and σ
1
from datasets such as those
illustrated. This procedure has several important applications in analytical
science, including: (a) testing a divided material for homogeneity (§4.5);
(b) studying the uncertainty caused by sampling (§4.7); and (c) analysing
the results of collaborative trials (interlaboratory studies of method perfor
mance) (§4.6).
The second range of applications of ANOVA provides us with a test of
whether the diﬀerences among a number of groups of results (like the results
of diﬀerent methods of analysis above) are statistically signiﬁcant. One way
of writing the null hypothesis for this test is H
0
: µ
A
= µ
B
= µ
C
= · · · .
What we actually test, however, is H
0
: σ
1
= 0 versus H
A
: σ
1
> 0. There
are fewer important applications of this aspect of ANOVA in analytical
science, partly because analysts usually need information about chemical
systems that is deeper than the simple fact that there are signiﬁcant diﬀer
ences among them.
Note
The datasets used in this section (and §4.3) can be found in ﬁles Dogfood
dataset A and Dogfood dataset B. The raw data are shown in §4.3.
4.2 The Calculations of OneWay ANOVA
Key points
— Oneway ANOVA considers a number of groups each containing
several results.
— The two primary statistics calculated from the data are: (a) the within
group mean square, MSW; and (b) the betweengroup mean square,
MSB.
— Estimates of σ
0
and σ
1
can be calculated from the mean squares.
— The null hypothesis H
0
: σ
1
= 0 can be tested versus H
A
: σ
1
> 0
by calculating the probability associated with the value of
F = MSB/MSW.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
46 Notes on Statistics and Data Quality for Analytical Chemists
Oneway ANOVA is concerned with situations such as those shown in §4.1
where there is a source of variation (between groups) beyond the usual
measurement errors. We consider a general case with m groups each con
taining n results. It is not necessary to have equal numbers of results
in each group, but it simpliﬁes the explanation and the notation. Each
result x
ji
has two subscripts, the ﬁrst referring to the jth row of data
(the group) and the second to the ith result in the row. The mean and
variance of the results in the jth row are shown as ¯ x
j
, s
2
j
(notice the
single subscript for the row statistics). They have the usual deﬁnitions,
although the notation is simpliﬁed here.
i
x
i
is used as shorthand for
n
i=1
x
i
, while SS
j
is shorthand for the jth sum of squares, namely
i
(x
ji
− ¯ x
j
)
2
.
Group 1 x
11
x
12
· · · x
1i
· · · x
1n
¯ x
1
=
i
x
1i
/n s
2
1
= SS
1
/(n − 1)
Group 2 x
21
x
22
· · · x
2i
· · · x
2n
¯ x
2
=
i
x
2i
/n s
2
2
= SS
2
/(n − 1)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Group j x
j1
x
j2
· · · x
ji
· · · x
jn
¯ x
j
=
i
x
ji
/n s
2
j
= SS
j
/(n − 1)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Group m x
m1
x
m2
· · · x
mi
· · · x
mn
¯ x
m
=
i
x
mi
/n s
2
m
= SS
m
/(n − 1)
For oneway ANOVA we need to calculate two variances. First, we can
calculate a withingroup estimate by pooling the information from the m
groups. This variance estimate is usually called the ‘mean square within
group’, designated MSW and given by
MSW =
j
SS
j
m(n − 1)
, which estimates σ
2
0
.
In this equation, the numerator is the total of all the group sums of squares.
The denominator is the total number of degrees of freedom: as each row
has (n−1) degrees of freedom and there are m rows, the total is m(n−1).
(Note that this deﬁnition of pooled variance is consistent with that used
in §3.4.)
The second estimate is called the ‘mean square betweengroups’, desig
nated by MSB. First, we calculate the grand mean ¯ x (with no subscript),
which is the mean of the row means, ¯ x =
j
¯ x
j
/m. The variance of the
row means is obtained by applying the ordinary formula for variance, giving
j
(¯ x
j
− ¯ x)
2
/(m−1). This latter statistic, being based on the means of n
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 47
results, estimates σ
2
1
+σ
2
0
/n, but is n times smaller than we want. Finally,
multiplying by n, we have MSB =
n
m− 1
j
(¯ x
j
− ¯ x)
2
, which estimates
nσ
2
1
+σ
2
0
. From these considerations we see that
√
MSW estimates σ
0
, and
MSB −MSW
n
estimates σ
1
.
When σ
2
1
is zero (as it would be under H
0
), but only then, MSB also
estimates σ
2
0
. Thus, if H
0
is true, the expected ratio F = MSB/MSW would
be unity, but the value calculated from data would vary according to the
Fdistribution with the appropriate number of degrees of freedom (§3.5).
We need to see whether the deviation of F from unity is signiﬁcantly large
by observing the corresponding probability.
Notes
• The denominator in the expression estimating σ
1
is n (the number of
repeat results in a group), not m (the number of groups).
• ANOVA calculates probabilities under the assumptions that: (a) the
groups all have a common standard deviation σ
0
; (b) errors are normally
distributed; and (c) the variations withingroups and betweengroups are
independent. Unlike twosample tests, ANOVA cannot readily take proper
account of diﬀerent variances among groups of data. Fortunately, it is
often possible through good experimental design to ensure that assumption
(a) is more or less correct. Where no such assurance is possible, the sci
entist has to use judgement about the applicability of the pooled variances
and probabilities resulting from the use of ANOVA.
4.3 Example Calculations with OneWay ANOVA
Key points
— We can use oneway ANOVA to conduct a signiﬁcance test by calcu
lating the probability associated with F = MSB/MSW.
— We can estimate σ
0
as s
0
=
√
MSW.
— Where the Fratio is big enough, we can estimate σ
1
as s
1
=
(MSB −MSW)/n.
Suppose we apply these calculations to the two datasets used in §4.1. First
we look at Dogfood dataset A (% mass fraction).
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
48 Notes on Statistics and Data Quality for Analytical Chemists
Dataset A
Method A 11.08 11.19 11.17 11.50 11.14 11.29 11.28 10.80 11.01 11.17
Method B 11.35 10.76 10.63 11.13 11.14 11.30 10.74 10.95 10.96 11.01
Method C 10.75 11.33 11.14 11.55 11.34 11.13 11.12 11.17 11.10 10.99
Method D 11.11 11.04 11.43 10.98 10.92 11.26 10.91 10.69 11.42 11.06
If we calculate the mean squares using the formulae in §4.2, we
obtain MSW = 0.048 and MSB = 0.063. Under the null hypothesis
H
0
: µ
A
= µ
B
= µ
C
= µ
D
, σ
2
1
must be zero, so both mean squares would be
independent estimates of σ
2
0
. We would expect the ratio F = MSB/MSW
to follow the Fdistribution with m − 1 and m(n − 1) degrees of freedom
(see §3.5), so we can use the probability associated with F as a test of
signiﬁcance regarding H
0
. If the value of p is low, lower than 0.05 say, the
results would be unlikely to occur under H
0
, so we would be justiﬁed in
rejecting it.
For Dogfood dataset A we ﬁnd that F = 0.063/0.048 = 1.30, and the
corresponding probability is p = 0.29. With this high probability there
would be no justiﬁcation for rejecting H
0
, so we infer that there are no
signiﬁcant diﬀerences among the means of the method results.
The corresponding statistics from dataset B are: MSW = 0.048, MSB =
1.89, F = MSB/MSW = 1.89/0.048 = 39.3, and p < 0.0005. With this high
value of F and the small probability, the results are very unlikely to occur
under the assumption of H
0
, so we infer that there is a genuine diﬀerence
among the means of the results of the methods, that is, σ
2
1
> 0.
Dataset B
Method A 10.68 10.79 10.77 11.10 10.74 10.89 10.88 10.40 10.61 10.77
Method B 11.85 11.26 11.13 11.63 11.64 11.80 11.24 11.45 11.46 11.51
Method C 10.65 11.23 11.04 11.45 11.24 11.03 11.02 11.07 11.00 10.89
Method D 10.51 10.44 10.83 10.38 10.32 10.66 10.31 10.09 10.82 10.46
For dataset B we can estimate σ
1
with some conﬁdence. As MSW esti
mates σ
2
0
, and MSB estimates nσ
2
1
+ σ
2
0
, we see that s
0
=
√
MSW is the
estimate of σ
0
, while s
1
=
MSB − MSW
n
is the estimate of σ
1
. This
gives us:
s
0
=
√
0.048 = 0.22%;
s
1
=
1.89 − 0.048
10
= 0.43%.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 49
Notes
• The ﬁles containing the datasets in this section are named Dogfood
dataset A and Dogfood dataset B.
• The denominator in the expression for s
1
is n (the number of results in
a group) not m (the number of groups).
• Unless σ
1
is somewhat greater than σ
0
, estimates s
1
will tend to be very
variable and not much use.
• In instances where the expression for s
1
provides a square root of a neg
ative number, it is customary to set s
1
= 0 . This simply means that σ
1
is
too small to estimate meaningfully (or alternatively, that σ
0
is too large
to do the job adequately).
• There are essentially two distinct types of application of ANOVA. The
ﬁrst type is where primarily we want to test for signiﬁcant diﬀerences
among a number (> 2) of means. The second type is where we are not
interested in testing for signiﬁcance (which we can often take for granted)
but to estimate the separate variances σ
2
0
and σ
2
1
. It is important to be
aware of this diﬀerence.
4.4 Applications of ANOVA: Example 1 — Catalysts
for the Kjeldahl Method
Key points
— In this ﬁxed eﬀect experiment, there were predetermined categorical
diﬀerences between the groups of results.
— As such the experiment left unanswered some important, more
general, questions about the methods.
In this experiment we consider the eﬀects of diﬀerent catalysts in the results
of the Kjeldahl method for determining the protein content of a meat
product. It is hoped that the commonlyused catalysts HgO and SeO
2
,
which leave toxic residues that are diﬃcult to dispose of, can be replaced
by the coppercontaining catalysts that do not have that problem. The
determination is carried out by one analyst, ten times with each catalyst,
keeping all of the conditions as similar as possible. The results and a dotplot
are shown below (Fig. 4.4.1). The groups have similar dispersions so the
assumption of a common group variance (see §4.2) is reasonable, although
there is a low suspect result among those for CuO/TiO
2
. Variation among
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
50 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 4.4.1. Results (points) for diﬀerent catalysts showing variation among means
(arrows).
the group means is visible but, given the dispersion of the results, it is not
clear whether the variation is signiﬁcant.
HgO SeO
2
CuSO
4
CuO/TiO
2
29.93 30.12 29.47 30.47
30.68 30.89 30.51 30.36
29.59 30.64 29.90 29.47
29.81 30.57 30.14 30.41
30.75 31.56 29.51 30.16
30.85 30.30 30.22 29.61
30.26 30.15 30.35 30.54
29.84 30.96 29.11 29.97
30.35 30.03 30.12 29.86
30.07 29.86 30.15 28.23
ANOVA gives the following results.
Box 4.4.1
Source of Degrees of Sum of Mean F p
variation freedom squares square
Between group 3 2.314 0.771 2.70 0.060
Within group 36 10.293 0.286
Total 39 12.607
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 51
The probability is p = 0.06, which is low enough to make us suspect
that there is a real diﬀerence among the mean results of the catalysts, even
though the conﬁdence does not quite reach the 95% level.
Notes
• The dataset used in this section can be found in the ﬁle Kjeldahl
catalysts.
• There are a number of practical problems with this experiment. First,
we do not know what would happen if diﬀerent types of foodstuﬀs were
used instead of the meat product: the catalysts might behave diﬀerently
with other types of food, or at other concentrations of protein, or simply
under the diﬀerent conditions found in other laboratories. Second, we
do not know enough about how the analyses were done. There might
be systematic eﬀects due to changed conditions if the results for each
catalyst were obtained in a temporal sequence, especially if they were
done of diﬀerent days. The only way to avoid that would be to do the
determinations in a randomised sequence (which could be diﬃcult to
organise in practice!).
• The above is an example of a ﬁxedeﬀect experiment, which is con
ducted when we wish to see whether there is a signiﬁcant eﬀect when cat
egorical diﬀerences between the groups have been deliberately brought
about by the experimenter. There are relatively few instances in analytical
science where this type of experiment is conducted. More frequently (§4.5–
4.8) we encounter experiments with random eﬀects between the groups.
4.5 Applications of ANOVA: Example 2 —
Homogeneity Testing
Key points
— A simple test based on ANOVA permits us to test for lack of hetero
geneity in a material.
— The usefulness of such a test depends critically on the precision of the
analytical method.
A commonly encountered application of ANOVA in analytical chemistry is
where we are testing a material (usually some kind of reference material)
for homogeneity, i.e., to make sure that there is no measurable diﬀerence
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
52 Notes on Statistics and Data Quality for Analytical Chemists
between diﬀerent portions of the material. (‘Measurable’ here signiﬁes under
appropriate test conditions — we can usually ﬁnd some diﬀerence given
suﬃcient resources.) This is a very common exercise carried out by pro
ﬁciency test providers and laboratories that manufacture reference mate
rials. The bulk material is carefully homogenised (ground to a ﬁne powder
if solid and thoroughly mixed). It is then divided and packed into the
portions that are to be distributed. A number of the packaged portions
(10–20) are selected at random and analysed in duplicate (with the analysis
being carried out in a random order). The generalised scheme is as shown
in Fig. 4.5.1.
In this example, silicon as SiO
2
is determined in a rock powder. The
results are:
Random portion Result 1, % Result 2, %
1 71.04 71.17
2 71.05 71.04
3 70.91 71.09
4 70.98 71.04
5 70.91 70.99
6 71.11 71.06
7 70.93 70.96
8 71.13 71.15
9 71.06 71.01
10 71.04 71.09
Fig. 4.5.1. General experimental layout in a homogeneity test, with m > 9.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 53
Fig. 4.5.2. Result of duplicate determination of SiO
2
in ten random samples from a
bulk powdered rock.
Figure 4.5.2 shows that there are no ‘outlying’ sample sets and no
unusually large diﬀerences between corresponding duplicate results. (At
ﬁrst glance we might suspect the large diﬀerence in the sample3 results,
but that is only slightly greater than that of sample 1, which in turn is
only slightly greater than that of sample 5, so there is a gradation of dif
ferences rather than a single outstanding diﬀerence.) So we don’t see any
suspect results or obvious signiﬁcant diﬀerences among the mean results.
That, of course, is the expected outcome, as the material has been care
fully homogenised — the test is simply to make sure that nothing has gone
wrong. We now carry out the ANOVA and obtain the following results.
Box 4.5.1
Source of Degrees of Sum of Mean F p
variation freedom squares square
Between samples 9 0.07302 0.00811 2.38 0.097
Between analyses 10 0.03410 0.00341
Total 19 0.10712
Here we see that p = 0.097 > 0.05, so we cannot reject H
0
at the 95%
level of conﬁdence. As expected we ﬁnd no signiﬁcant diﬀerences among
the ten mean results and hence the material passes the homogeneity test.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
54 Notes on Statistics and Data Quality for Analytical Chemists
More strictly, we should say that no signiﬁcant heterogeneity was found
by using that particular analytical method. Use of a more precise ana
lytical method, resulting in a smaller mean square MSB, would enhance
the Fvalue and could easily provide a signiﬁcant outcome for the same
material. In practice, a more sophisticated test for lack of homogeneity,
based on the same analysis of variance, is preferable to a simple test of
H
0
: σ
2
1
= 0. The reason is that the test above compares the between
sample variation with the variation of the analytical method used in the
test. This latter variation could be irrelevant to the users of the material. It
is better to compare σ
1
with a criterion based on user needs. That, however,
is beyond the scope of the current text.
Notes and further reading
• The dataset used in this section can be found in the ﬁle named
Silica.
• This example demonstrates oneway ANOVA with ‘random eﬀects’.
There are two sources of random variation acting independently, namely
variation between the true concentrations of silica in the samples and
variation between the replicated results on each sample. This contrasts
with ‘ﬁxed eﬀects’ (§4.4) where the variation between groups is imposed
by the experimenter.
• Fearn, T. and Thompson, M. (2001). A New Test for Suﬃcient Homo
geneity. Analyst, 126, pp. 14141417.
• ‘Test for suﬃcient homogeneity in a reference material’. (2008). AMC
Technical Brief No. 17A. (Free download via www.rsc.org/amc.).
4.6 ANOVA Application 3 — The Collaborative Trial
Key point
— A collaborative trial (interlaboratory method performance study)
allows us to estimate the repeatability standard deviation σ
r
and the
reproducibility (between laboratory) standard deviation σ
R
.
The collaborative trial is an interlaboratory study to determine the char
acteristics of an analytical method. The main characteristics determined
are the repeatability (average withinlaboratory) standard deviation and
the reproducibility (betweenlaboratory) standard deviation. These are
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 55
Fig. 4.6.1. Experimental layout for one material in a collaborative trial.
obtained from the results of an experiment where each participant labo
ratory (of at least eight) analyses a number of materials (at least ﬁve) in
duplicate by a method prescribed in considerable detail (i.e., they all use
the same method as far as possible). The layout of the experiment for each
material is shown in Fig. 4.6.1.
The results for each material are subjected separately to oneway
ANOVA. A typical set of results (ppm), for the concentration of copper
in one type of sheep feed, is as follows.
Lab ID Result 1, ppm Result 2, ppm
1 2.0 3.6
2 1.4 1.7
3 2.2 2.3
4 2.6 2.7
5 2.8 3.0
6 1.3 2.4
7 2.1 2.7
8 1.7 1.3
9 3.7 3.3
10 2.4 2.2
11 1.4 2.3
We can see in Fig. 4.6.2 that the variation among laboratories is
roughly comparable with that between the duplicate results of any one
laboratory. There are no clearly outlying laboratories or suspect duplicate
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
56 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 4.6.2. Results of the duplicate analysis of a batch of sheep feed for copper in 11
diﬀerent laboratories.
diﬀerences, so we accept the data as it is. Laboratory 1 has produced
the biggest diﬀerences between results, but it is not much bigger than
those of Laboratory 6 or 11. (In the collaborative trial outliers, up to a
certain proportion according to a strict protocol, are conventionally deleted,
because we are interested in the properties of the method, not the behaviour
of the laboratories — see §9.7.) ANOVA gives us:
Box 4.6.1
Source of Degrees of Sum of Mean F p
variation freedom squares square
Between laboratories 10 7.574 0.757 3.06 0.040
Within laboratories 11 2.725 0.248
Total 21 10.299
We calculate:
s
0
=
√
MSW =
√
0.248 = 0.50 ppm
s
1
=
MSB −MSW
n
=
0.757 − 0.248
2
= 0.504 ppm.
The ‘repeatability standard deviation’ is s
r
= s
0
= 0.50.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 57
The ‘reproducibility standard deviation’ is deﬁned as
s
R
=
s
2
0
+s
2
1
= 0.71 ppm.
Notes
• The dataset used in this section can be found in the ﬁle Copper.
• Notice that n = 2 as we have duplicate determinations (see §4.3).
• There is more information about collaborative trials in §9.7
and 9.8.
4.7 ANOVA Application 4 — Sampling
and Analytical Variance
Key points
— Sampling almost always precedes analysis.
— Sampling introduces error into the ﬁnal result. Because sampling
targets are heterogeneous, samples from the same target diﬀer in com
position.
— ANOVA, applied to the results from a properly designed experiment,
can give useful estimates of the sampling standard deviation.
Another type of application involving sampling is where we want to
quantify the variance associated with sampling. Nearly all analysis requires
the taking of a sample, a procedure that itself introduces uncertainty into
the ﬁnal result. Suppose that a routine procedure calls for the taking of a
sample of soil from a ﬁeld by a carefully described method and the analysis
of the sample by another carefully described analytical method. We can
design a suitable experiment to estimate the separate variances associated
with sampling and analysis. We take a number of samples from the ﬁeld
by the given procedure, but randomise the method each time. We then
analyse the samples in duplicate in a completely random order. This looks
like a similar experiment to that in §4.5, and the schematic for the exper
iment (Fig. 4.7.1) is the same, but there is a diﬀerence: here we expect the
samples to vary in composition, because soil is often very heterogeneous.
So we are not really interested in a signiﬁcance test — we can assume from
the start that the samples are diﬀerent but we want to know by how much
they diﬀer.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
58 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 4.7.1. A simple design for the determination of sampling variance and analytical
variance.
For an example we use the following results for cadmium (ppm). (Note:
cadmium levels are exceptionally high in this ﬁeld.)
Soil sample Result 1 Result 2
1 11.8 9.8
2 6.4 6.3
3 11.9 10.3
4 12.2 10.2
5 7.5 7.3
6 6.4 6.4
7 10.1 10.0
8 11.3 9.9
9 14.0 12.5
10 16.5 15.1
Figure 4.7.2 shows (as to be expected) considerable diﬀerences between
the samples and rather less between the duplicate results on each sample.
There are no suspect data (i.e., seriously discrepant duplicate results
on a sample, or wildly outlying samples — as judged from the mean
result of the duplicates). There is no obvious reason to doubt the usual
assumptions.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 59
Fig. 4.7.2. Duplicate results from the analysis of ten samples of soil from a ﬁeld.
(For Sample 6 the results coincide.)
Oneway ANOVA gives the following.
Box 4.7.1
Source of Degrees of Sum of Mean F p
variation freedom squares square
Between sample 9 160.055 17.784 21.18 0.000
Between analysis 10 8.395 0.840
Total 19 168.449
From the mean squares we calculate the analytical standard deviation
as
s
a
= s
0
=
√
MSW =
√
0.84 = 0.92 ppm.
The sampling standard deviation is given by:
s
s
= s
1
=
MSB −MSW
n
=
17.78 − 0.84
2
= 2.9 ppm.
(Notice that n = 2 because we have duplicate analysis — see §4.3.)
Here we see, as often happens, that the sampling standard deviation
is substantially greater than the analytical standard deviation. We can see
whether these variances are ‘balanced’. The total standard deviation for a
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
60 Notes on Statistics and Data Quality for Analytical Chemists
combined operation of sampling and analysis is going to be
s
tot
=
s
2
s
+s
2
a
=
2.91
2
+ 0.92
2
= 3.05.
The analytical variation makes hardly any contribution to this total vari
ation — the sampling variation dominates. If the analytical standard devi
ation were much smaller at 0.46 (e.g., the analytical method was twice as
precise as it is), the total standard deviation would be
2.91
2
+ 0.46
2
= 2.94,
that is, hardly changed. There is no point trying to improve the precision
of the analytical method, because it will cost much more money with no
eﬀective improvement in uncertainty of the overall result. As a rule of thumb
analysts should try to get
1/3 < σ
a
/σ
s
< 3.
Notes
• The dataset used in this section can be found in the ﬁle Cadmium.
• There is more information about sampling variance in Chapter 12.
4.8 More Elaborate ANOVA — Nested Designs
Key points
— Nested designs are used in combination with ANOVA when there are
two or more sources of measurement error.
— They are typically used by analytical chemists for orientation surveys
(method validation) and studying uncertainty from sampling.
Hierarchical (or ‘nested’) designs accommodate datasets that have more
than one independent source of error beyond the simple measurement error.
They have a range of applications in analytical science. In an example like
Fig. 4.8.1, we see that multiple ﬁelds have been sampled in duplicate, and
each sample has also been analysed in duplicate. From the results of such
an experiment, we can estimate three variances: the analytical variance
σ
2
a
(= σ
2
0
), the sampling variance σ
2
s
(= σ
2
1
) and the betweensite variance
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 61
Fig. 4.8.1. Balanced nested design for an ‘orientation survey’.
σ
2
site
(= σ
2
2
). In such an example, all three levels of variation can be regarded
as random.
This is a classic design (known as an ‘orientation survey’) for validating
a measurement procedure comprising sampling plus analysis, and giving
some guidance to its ﬁtness for purpose. For example, if the complete mea
surement variance
σ
2
a
+ σ
2
s
is considerably smaller than σ
2
site
, there is a
reasonable chance of diﬀerentiating between diﬀerent sites by sampling and
analysis. The design also provides a more rugged estimate of the sampling
variance σ
2
s
than the design shown in §4.7, and is favoured for the study
of sampling uncertainty. This is because the estimate will be averaged over
a number of typical sites, rather than just one site, and so will be more
representative of sites in general.
As an example we consider the capabilities of protocols for analysis and
sampling proposed for a major survey of lead in playground dust. Ten play
grounds were selected in an inner city area and were sampled in duplicate.
Each sample was then analysed in duplicate. The results were as follows
(ppm of lead in dried dust). In the column headings, S1A1 indicates the
ﬁrst analytical result on the ﬁrst sample, S1A2 the second analytical result
on the ﬁrst sample and so on. (These very high results refer to a period
before leaded petrol was banned.)
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
62 Notes on Statistics and Data Quality for Analytical Chemists
Playground S1A1 S1A2 S2A1 S2A2
1 714 719 644 602
2 414 387 482 499
3 404 357 380 408
4 759 767 711 636
5 833 777 748 788
6 621 602 532 520
7 455 472 498 389
8 635 708 694 684
9 589 609 591 606
10 783 764 857 803
Box 4.8.1 Analysis of variance for results for lead
Source of Degrees of Sum of Mean F p
variation freedom squares square
Between sites 9 764525 MS
2
= 84947 22.58 0.000
Between samples 10 37613 MS
1
= 3761 3.94 0.004
Analytical 20 19115 MS
0
= 956
Total 39 821253
From the mean squares we can calculate the following.
Analytical standard deviation =
MS
0
=
√
956 = 31 ppm.
Betweensample standard deviation =
MS
1
−MS
0
2
=
3761 − 956
2
= 37 ppm.
Betweensite standard deviation =
MS
2
−MS
1
4
=
84947 − 3761
4
= 143 ppm.
In this instance the proposed methods would be able to diﬀerentiate
reasonably well between diﬀerent playgrounds of the type in the survey.
This can be seen in a plot of the results (Fig. 4.8.2).
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 63
Fig. 4.8.2. Results from duplicated sampling and analysis of ten playgrounds, distin
guishing between results from ﬁrst sample (solid circles) and second sample (open circles).
Notes
• The dataset used in this section can be found in the ﬁle Playground.
• There is another example of a nested experiment in §12.6.
4.9 TwoWay ANOVA — Crossed Designs
Key points
— Crossed designs are used when we want to study results classiﬁed by
two factors.
— There are relatively few applications of this technique in analytical
chemistry itself, but numerous examples using analytical data in
various application sectors.
Crossed designs are seldom used in analytical science as such, but the fol
lowing experiment serves as an example. An investigation into the loss of
weight on drying of a foodstuﬀ subjected to diﬀerent temperatures and
times of heating provided the data below (% loss), illustrated in Fig. 4.9.1.
There are two imposed sources of variation plus the random measurement
variation. Inspection suggests that there is little diﬀerence in weight loss
between one and three hours at any temperature, but that at 15 hours
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
64 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 4.9.1. Results showing weight loss at diﬀerent temperatures and durations of
heating.
heating there is an extra weight loss at all three temperatures. Likewise,
there is little diﬀerence between corresponding results at 80
◦
and 100
◦
C,
but the loss is more marked at 120
◦
C.
1 hour 3 hours 15 hours
80
◦
C 7.57 6.61 8.02
6.74 7.32 8.05
100
◦
C 7.03 7.72 8.40
6.55 7.59 8.65
120
◦
C 9.04 8.48 10.99
9.60 9.20 11.76
Twoway crossed ANOVA gives the following results.
Box 4.9.1 Analysis of variance of the moisture data
Source of Degrees of Sum of Mean F p
variation freedom squares square
Time 2 9.3050 4.6525 28.60 0.000
Temp 2 21.8284 10.9142 67.08 0.000
Time
∗
Temp 4 2.2619 0.5655 3.48 0.056
Error 9 1.4643 0.1627
Total 17 34.8596
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 65
This output is telling us that both temperature and time separately have
signiﬁcant eﬀects on the result of the drying, with pvalues well below 0.05.
Moreover, as we have duplicated results in all cells of the table, ANOVA
can provide an estimate of the interaction between temperature and time
(Time*Temp in the table) and in fact ﬁnds that this interaction should
probably be regarded as signiﬁcant with p = 0.056 (although the level of
conﬁdence does not quite reach 95%). We can see that interaction term
is largely due to the fact that the loss of weight at 15 hours for 120
◦
C
is greater than would be predicted by adding the individual eﬀects of
time and temperature. While providing a test of signiﬁcance, the ANOVA
itself does little to help us — we have to use the diagram to see what
it means.
Note
• The dataset used in this section can be found in the ﬁle Moisture.
4.10 Cochran Test
Key points
— The Cochran test compares the variances for a number of datasets.
Its purpose is to determine if the largest variance is signiﬁcantly larger
than the others.
— It is primarily used to test for uniformity before analysis of variance.
The Cochran test compares the highest individual variance with the sum of
the variance for all the datasets. If each data set contains only two values
then the standard deviation (s) is replaced by the range (d) in the equation
below. The test statistic is calculated as:
C = s
2
max
m
i=1
s
2
i
where m is the number of groups. If the calculated value exceeds the critical
value the largest variance is considered to be inconsistent with the variances
of the other datasets. There are two parameters in this test, the number of
groups (m) and the degrees of freedom (ν = n − 1).
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
66 Notes on Statistics and Data Quality for Analytical Chemists
An example using the Kjeldahl catalysts data from §4.4 gives the
following variances for the four catalysts.
HgO 0.1915
SeO
2
0.2721
CuSO
4
0.1982
CuO/TiO
2
0.4820
The test statistic is C = 0.4820/1.1438 = 0.4214. In this example m = 4 and
n = 10 (so ν = 9). The critical value from tables is 0.5017 so as 0.4214 is less
than this value, the null hypothesis is retained. The variance for catalyst
CuO/TiO
2
is not regarded as inconsistent with the other variances.
Notes
• Tables of critical values for this test are available both in textbooks and
online.
• Use of the Cochran test in collaborative trials can be found in §9.7
and 9.8.
4.11 Ruggedness Tests
Ruggedness is the capacity of a method to provide accurate results in
the wake of minor variations in procedure such as might occur when the
procedure is undertaken in diﬀerent laboratories. Subjecting an analytical
method to a collaborative (interlaboratory) trial (§4.6) is costly (around
£50,000 at 2010 prices), so it is important that the methods tested have no
unexpected defects. A ruggedness test comprises a relatively inexpensive
means of screening a method for such defects.
An analytical method is made up of a moderate number of sep
arate steps, carried out sequentially, each step carefully deﬁned. However,
analysts are expected to use some judgement in the execution of a pre
scribed method. For instance, if the procedure says ‘boil for one hour’,
most analysts would expect the method to be equally accurate if the actual
period was between 55 and 65 minutes, and act accordingly. Ruggedness
can be tested by subjecting each separate step in the method to plausible
levels of such variation and observing the measurement results. As there
are many stages, this may take some time. A very economical alternative
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 67
Table 4.11.1. Experimental design for a ruggedness test.
Experiment number
1 2 3 4 5 6 7 8
Factor 1 1 1 1 1 −1 −1 −1 −1
Factor 2 1 1 −1 −1 1 1 −1 −1
Factor 3 1 −1 1 −1 1 −1 1 −1
Factor 4 1 1 −1 −1 −1 −1 1 1
Factor 5 1 −1 1 −1 −1 1 −1 1
Factor 6 1 −1 −1 1 1 −1 −1 1
Factor 7 1 −1 −1 1 −1 1 1 −1
Result x
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
is to test variations in all of the steps simultaneously. A special design for
such an experiment has been developed by Youden.
The Youden design requires 2
n
experiments for testing up to 2
n
− 1
factors (i.e., the steps of the analytical procedure that are under test). A
widely useful size for analytical chemistry is eight experiments, which can
test up to seven factors. Each experiment comprises the method with a
particular combination of the factors at perturbed levels. In Table 4.11.1,
the lower perturbed levels are indicated by −1 and the higher levels by 1.
For instance, if the method protocol said ‘boil for one hour’, the respective
perturbed levels tested might be 50 and 70 minutes. The combinations
shown are a special subset of the 128 possible diﬀerent combinations. The
result of each experiment is the apparent concentration of the analyte.
If the original method were completely rugged (completely insensitive to
the combinations of changes) the variation in the results would estimate the
repeatability standard deviation of the method. Given an appropriate choice
of perturbed levels and minor eﬀects, the variation might be somewhat
larger.
The eﬀect of a factor is estimated by the mean result for the higher level
minus the mean result for the lower (or vice versa — it doesn’t matter).
So for (say) Factor 2 the eﬀect is
x
1
+x
2
+x
5
+x
6
4
−
x
3
+x
4
+x
7
+x
8
4
.
The results for the higher level group and the lower level group contain
the eﬀects of each of the other six factors exactly twice, so the extraneous
eﬀects ‘cancel out’. The design ensures that this cancelling occurs with
every factor.
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
68 Notes on Statistics and Data Quality for Analytical Chemists
The eﬀects of all the factors should be listed and compared. If there are
no signiﬁcant eﬀects, they will all be of comparable magnitude. A signif
icant eﬀect would be much greater than the majority. The method is most
eﬀective if there are no signiﬁcant eﬀects or only one. That would be the
expected outcome for a properly developed method. A minor ambiguity
of interpretation would occur if there were an interaction between two (or
more) of the factors. For example, an interaction between Factors 6 and 7
could be mistaken for (or masked by) a main eﬀect due to Factor 1. Such
interactions should be rare in a null test (i.e., with no main eﬀects) or a
test with only one signiﬁcant main eﬀect.
Example
The analysis under consideration is the determination of patulin (a natural
contaminant) in apple juice. The critical factors in the analytical method
and their perturbed levels are shown in Table 4.11.2. As there are six
factors we need an eightdetermination layout. The seventh factor (labelled
‘Dummy’) needed to bring the number of factors to the nominal seven
Table 4.11.2. Result of the ruggedness test applied to a method for the determination
of patulin.
Experiment number
Procedure,
quantity, unit 1 2 3 4 5 6 7 8
1. Extract with ethyl
acetate,
volume (ml)
12 12 12 12 8 8 8 8
2. Cleanup with
Na
2
CO
3
solution,
concentration
(g/100 ml)
1.8 1.8 1.4 1.4 1.8 1.8 1.4 1.4
3. Cleanup
duration (s)
80 40 80 40 80 40 80 40
4. Evaporate solvent,
ﬁnal
temperature (
◦
C)
50 50 30 30 30 30 50 50
5. Dissolve residue in
dilute acetic acid,
volume (ml)
1.05 0.95 1.05 0.95 0.95 1.05 0.95 1.05
6. Determination by
HPLC, injection
volume (µl)
100 80 80 100 100 80 80 100
7. (Dummy)
Analytical result, ppm 95.8 93.6 101.4 95.4 77.8 78.4 85 91.4
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
Analysis of Variance (ANOVA) and Its Applications 69
Fig. 4.11.1. Paired boxplots of the results for each factor (as numbered in Table 4.11.2).
The unshaded boxes represent results for the lower perturbed levels, the shaded boxes
the upper perturbed levels.
can be regarded as the null operation, i.e., ‘Do nothing’. The perturbed
levels were arranged as in Table 4.11.1, showing the conditions for the eight
experiments and the corresponding analytical results.
We can make a good appraisal of this outcome simply by observing the
results as boxplots paired for each factor (Fig. 4.11.1). We immediately see
that Factor 1 (volume of ethyl acetate) seems to indicate a signiﬁcant eﬀect,
Factor 2 (concentration of the cleanup reagent) a possible eﬀect and the
other factors no eﬀect. The dummy factor (No. 7) gives a useful simulation
of a null eﬀect.
We can also calculate the eﬀects. As an example we consider the eﬀect
of the Factor 7, that is:
95.8 + 93.6 + 101.4 + 95.4
4
−
77.8 + 78.4 + 85.0 + 91.4
4
= 13.4 ppm.
A complete list of factors and eﬀects, in decreasing magnitude, is as follows.
Procedure Eﬀect, ppm
1. Extract with ethyl acetate, volume (ml) 13.4
2. Cleanup with Na
2
CO
3
solution, concentration (g/100 ml) −6.9
5. Dissolve residue in dilute acetic acid, volume (ml) 3.8
4. Evaporate solvent, ﬁnal temperature (
◦
C) 3.2
7. (Dummy) −2.4
6. Determination by HPLC, injection volume (µl) 0.5
3. Cleanup duration (s) 0.3
December 23, 2010 13:33 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch04
70 Notes on Statistics and Data Quality for Analytical Chemists
The repeatability standard deviation of the method was separately
determined to be 8 ppm at this concentration, so the expected standard
deviation of a null eﬀect should be about 5.7 ppm. This shows that the
volume of ethyl acetate probably has a signiﬁcant eﬀect and should be
more carefully controlled. None of the other factors are signiﬁcant on that
basis. As there is only one signiﬁcant eﬀect, the possibility of confounding
interactions can be ignored in this context.
Further reading
• Youden, W.J. and, Steiner, E.H. (1975). Statistical Manual of the
AOAC, AOAC International, Washington DC. ISBN 0935584153.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Chapter 5
Regression and Calibration
Linear regression is the natural approach to analytical calibration:
it is a method of fitting a line of best fit (in some defined sense) to
calibration data. Regression is capable of telling us nearly all that
we need to know about the quality of a welldesigned calibration
dataset, including the likely uncertainty in unknown concentrations
estimated thereby. It also has an application in comparing the results
of different analytical methods.
5.1 Regression
Key points
— Regression is a method for ﬁtting a line to experimental points.
— Regression uses experimental points x
i
, y
i
, (i = 1, . . . , n), and these
points are taken as ﬁxed numbers when the experiment is complete.
— Linear regression makes use of a speciﬁc set of assumptions about the
data, known as the ‘model’, as follows: (a) there is an unknown true
functional relationship y =α + βx; (b) the xvalues are ﬁxed by the
experimenter; (c) the yvalues are experimental results and therefore
subject to variation under repeated measurement; and (d) in simple
regression a single unknown variance σ
2
describes the variation of the
yvalues around the true line.
Regression is sometimes loosely described as ‘ﬁtting the best straight line’
to a set of points. There are in fact a number of ways of ﬁtting a line to such
a set of points, and which method is ‘best’ depends on both the intentions of
the scientist and the nature of the data. Simple ‘leastsquares’ regression is
71
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
72 Notes on Statistics and Data Quality for Analytical Chemists
perhaps the method most widely used, and one that has especial relevance
for calibration in analytical chemistry. Simple linear regression is based on
a speciﬁc model of the data. It is worth taking the eﬀort to understand how
it works because if misapplied it can provide misleading results.
First, we assume that there is a true linear relationship between two
variables x and y, namely y = α+βx, which represents a straight line with
slope β and intercept α. (The intercept is the value of y when x is zero.)
Second, we assume that n values of the xvariable x
1
, x
2
, x
3
, . . . , x
i
, . . . , x
n
are ﬁxed exactly by the experimenter. At each value x
i
the corresponding
value y
i
is measured. As y
i
is the result of a measurement, its value will
not fall exactly on the line y = α + βx, and will be diﬀerent each time
the measurement is repeated. (Remember that the xvalues are exactly set:
they are not the results of measurements.) Third, we assume that y
i
is
normally distributed with a mean value of α + βx
i
and a variance of σ
2
.
This model can be written as
y
i
= α +βx
i
+ε
i
, ε
i
∼ N(0, σ
2
).
The model is illustrated in Fig. 5.1.1. For the marked xvalue, the cor
responding yvalue will be somewhere within the range of the indicated
normal distribution. The point is more likely to fall closer to the centre of
the distribution (on the line) than in the tails. Other xvalues give rise to
corresponding yvalues, each one independently distributed around the line
as N(0, σ
2
). When the scaﬀolding of the model is stripped away, we are
left with the bare experimental points (Fig. 5.1.2). We have no information
Fig. 5.1.1. Model used for linear regression. The xvalues are taken as ﬁxed and the
yvalues subject to measurement variation.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 73
Fig. 5.1.2. A set of experimental points produced under the linear regression model
(same points as in Fig. 5.1.1).
about the true relationship (even whether it is a straight line), nor the size
of σ
2
. The task of regression is to estimate the values of α and β as best
we can, and check that these estimated values provide a plausible model of
the data.
5.2 How Regression Works
Key points
— The regression line y = a +bx is calculated from the points (x
i
, y
i
) by
the method of least squares, that is, by ﬁnding the minimum value of
the sum of the squared residuals.
— The regression coeﬃcients are given by:
b =
i
(x
i
−¯x)(y
i
−¯y)/
i
(x
i
−¯x)
2
, a = ¯y −b¯x.
— a and b are estimates of (but not identical with) the respective
α and β.
— Regression is not symmetric: strictly speaking, we cannot exchange
the roles of x and y in the equations.
Imagine that a straight line y = a+bx is drawn through the data. The line
could be regarded as a ‘trial ﬁt’ (Fig. 5.2.1). We need to adjust the values
of a and b in this equation until the line is a ‘best ﬁt’ in some deﬁned sense.
The ‘ﬁtted points’ ˆ y
i
fall exactly on the line vertically above the points
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
74 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 5.2.1. Key aspects of regression, showing a trial ﬁt with ﬁtted values and the
residuals. Same points as Fig. 5.1.1.
Fig. 5.2.2. Regression line (solid line) for the data shown in Figs 5.1.2 and 5.2.1. The
‘true line’ used to generate the data is shown dashed.
x
i
, so that we have ˆ y
i
= a + bx
i
. (In statistics, the notation ˆ y
i
[spoken
as ‘yhat’] implies that the quantity y
i
is an estimate.) The residuals are
deﬁned as r
i
= y
i
− ˆ y
i
. We deﬁne the regression line by ﬁnding values of
a and b that provide the smallest possible value of the sum of the squared
residuals, Q. Thus we have
Q =
i
r
2
i
=
i
(y
i
− ˆ y
i
)
2
=
i
(y
i
−a −bx
i
)
2
.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 75
Fig. 5.2.3. Regression lines from repeated experiments under the conditions shown in
Fig. 5.1.1, that is, with the same xvalues, but a diﬀerent random selection of yvalues
each time.
(Remember: in this equation the x
i
and y
i
are constants ﬁxed by the exper
iment, and here we are treating a and b as variables.) We ﬁnd the minimum
value of Q by setting the ﬁrst derivatives equal to zero, that is,
∂Q
∂a
= 0,
∂Q
∂b
= 0.
Solving these two simultaneous equations gives expressions for a and b,
namely
b =
i
(x
i
− ¯ x)(y
i
− ¯ y)
i
(x
i
− ¯ x)
2
, a = ¯ y −b¯ x.
Notes
• The values of a and b are not the same as the respective unknown α and β
(Fig. 5.2.2), but they are unbiased: if the experiment were repeated many
times (that is, with new random y
i
values each time, as in Fig. 5.2.3),
the average values of a and b would tend towards α and β.
• The procedure is called the ‘regression of y on x’. It is not symmetric: we
get a diﬀerent line if we exchange x and y in the equations for a and b.
That would be an incorrect line, because the assumption that the xvalues
were errorfree would be violated.
• x is called the ‘independent variable’ or the ‘predictor variable’. y is called
the ‘dependent variable’ or the ‘response variable’.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
76 Notes on Statistics and Data Quality for Analytical Chemists
• The procedure outlined above is called the ‘method of least squares’. There
are other procedures for ﬁtting a line to data, but least squares is simple
mathematically and meets most requirements so long as the data are pro
duced in welldesigned experiments.
• Always make the x, the errorfree independent variable, the horizontal
axis in xy scatter plots.
5.3 Calibration: Example 1
Key points
— Regression is suited to estimating calibration functions because we
can usually regard the concentrations as ﬁxed and the responses as
variables.
— Unknown concentrations can be estimated from the transformed
calibration function.
The process of calibrating an analytical instrument, by measuring the
responses obtained from solutions of several known concentrations of the
analyte, conforms closely to the assumptions of regression. The independent
variable is the concentration of the analyte, accurately known because the
Fig. 5.3.1. Calibration data (points) and regression line for manganese.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 77
concentration is determined by gravimetric and volumetric operations only.
It is a reasonable presumption that the concentrations of the analyte in the
calibration solutions are known with a small relative uncertainty, often as
good as 0.001. The dependent variable is the response of the instrument,
and that is seldom known with a relative uncertainty less than 0.01.
Consider the following data obtained by constructing a shortrange cal
ibration for manganese determined by inductivelycoupled plasma atomic
emission spectrometry. Here we are making the responses the dependent
variable y, and concentration the independent variable x. The regression
line (Fig. 5.3.1) is given by the function y = 75 + 496x and, visually,
seems like a good ﬁt to the data. We can use this function to estimate
unknown concentrations in any other matrixmatched solution by inverting
the equation to give x = (y−75)/496. A test solution providing a response of
(say) 3000 units would indicate a concentration of manganese of 5.9 µg l
−1
.
We can also use the b value (the slope of the line) to convert response
data for estimating the detection limit (according to the simple deﬁnition in
§9.6). If we record n > 10 repeated responses when the concentration of the
analyte is zero (that is, with a blank solution) and calculate the standard
deviation s of these responses, the detection limit is given by c
L
= 3s/b.
Other important features of the calibration function can be tested after
regression. In practice we would need to check the validity of various
assumptions underlying the regression and these items are considered in
the following sections.
Concentration of Mn, ppb Response, arbitrary units
0 114
2 870
4 2087
6 3353
8 3970
10 4950
Note
The dataset is available in the ﬁle named Manganese1.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
78 Notes on Statistics and Data Quality for Analytical Chemists
5.4 The Use of Residuals
Key points
— Examining a plot of the residuals is an essential part of using
regression.
— If there is no lack of ﬁt between the observations and the ﬁtted model,
the residuals should resemble a random sample of independent values
from a normal distribution.
If the assumptions of regression are fulﬁlled and the measured data are
truly derived from a straight line function, then we expect the residuals to
behave very like a random selection from a normal distribution centred on
zero. The variance of the residuals s
2
y  x
is given by
s
2
y  x
=
i
(y
i
− ˆ y
i
)
2
/(n −2).
Note that this expression is similar to the ordinary expression for variance,
except that here the deviations are measured from the ﬁtted values ˆ y
i
(instead of from the mean ¯ y) and there are now n − 2 degrees of freedom.
(There are n−2 degrees of freedom because we have calculated two statistics
(a and b) from n pairs of observations.) This variance s
2
y  x
estimates σ
2
(see
§5.1) if the regression line is a good ﬁt to the data. The standard deviation
of the residuals is of course the square root of the variance. So if we divide
the residuals by s
y  x
, these ‘scaled residuals’ should resemble a sample
from standard normal distribution N(0, 1). This provides a useful method
of checking visually whether the line produced by regression is an acceptable
ﬁt and whether the data plausibly conform to the assumptions underlying
regression.
For the manganese calibration data we ﬁnd that s
y  x
= 190 and obtain
Figs. 5.4.1 and 5.4.2 when (a) the residuals and (b) the scaled residuals
are plotted against the xvalues. The pattern of the residuals is seen to
correspond to results that fall below or above the regression line. In other
instances the deviations from the line may be too small to see on the plot
of response against concentration, but they will always be apparent in the
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 79
Fig. 5.4.1. Residuals from the linear
regression of the manganese calibration
data.
Fig. 5.4.2. Scaled residuals from the
linear regression of the manganese cal
ibration data.
residual plot. In this case, the successive residuals could plausibly be a
random selection from a normal distribution, so we conclude that there is
no reason to suspect any lack of ﬁt. In other words, linear regression has
provided a line that ﬁts the data well. It is somewhat easier to see this from
the scaled residuals, because all of the values fall within bounds of about
±1.6. In a genuine N(0, 1) distribution, we expect about 90% of data to
fall within these limits on average.
Producing and considering residual plots is an essential part of using
regression. It helps to avoid using inappropriate methods and making
incorrect decisions. However, it is important not to overinterpret these
plots when the number of data points is small, as in most examples from
analytical calibration. Patterns indicating lack of ﬁt or other problems (see
§5.5) must deviate strikingly from a random, independent sample from
a normal distribution before lack of ﬁt is inferred from small numbers of
residuals. Where possible, such inferences should be backed up by numerical
tests of signiﬁcance (see §5.10, 5.11).
Notes
• The dataset is available in the ﬁle named Manganese1.
• The least squares calculation ensures that the mean of the residuals is
exactly zero, so the residuals are not quite independent.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
80 Notes on Statistics and Data Quality for Analytical Chemists
5.5 Suspect Patterns in Residuals
Key points
— There are several ways in which a residual plot can deviate from that
expected for a good ﬁt.
— There are diagnostic patterns for outliers, lack of ﬁt and nonuniform
variance.
Where plots of residuals show patterns that deviate strongly from a random
normal sequence, it is likely that the original data deviate from the assump
tions of the regression model used, and that the results of the regression
might be misleading. A diﬀerent model might provide a more accurate and
useful outcome. There are several patterns that analytical chemists should
be aware of.
The ﬁrst type of pattern (Figs. 5.5.1, 5.5.2) occurs when there is an
outlying point among the data. This is apparent as a residual with a stan
dardised value of 2.2. (This example is only marginally outlying.) The
remaining residuals seem acceptable, although there is a negative mean.
An outlier can bias the regression by drawing the regression line towards
itself and thus may provide misleading information (§5.11). If they can be
clearly identiﬁed as such, outliers should be deleted from calibration data.
Statistical testing for outliers in calibration data is not simple, but applying
Dixon’s test (§7.3) or Grubbs’ test (§7.4) to the residuals would probably
provide a reasonable guide for testing a single suspect data point.
The second type of pattern occurs when the regression model does not
ﬁt the data properly. In the instance shown (Figs. 5.5.3, 5.5.4) a linear
Fig. 5.5.1. Data and regression line
with a suspect point at x = 5.
Fig. 5.5.2. Standardised residuals from
regression line in Fig. 5.5.1, showing a
single suspect point at x = 5.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 81
Fig. 5.5.3. Data and linear reg
ression line showing lack of ﬁt due
to curvature in the true relationship.
The lack of ﬁt at very low concentra
tions could be seriously misleading.
Fig. 5.5.4. Standardised residuals show
ing a strongly bowed pattern, indicating
systematic lack of ﬁt of the data to the
regression line.
Fig. 5.5.5. Heteroscedastic data and
regression line giving rise to residuals
that tend to increase with x.
Fig. 5.5.6. Standardised residuals showing
a heteroscedastic tendency: the residuals
tend to increase with x. (They also suggest a
possibly low bias in the regression line below
about x = 3.)
regression has been applied to data with a curved trend, and the residuals
show a corresponding bowshaped pattern. It is important to avoid this
type of lack of ﬁt in calibration, because of the relatively large discrepancy
at low concentrations between the regression line and the true trend of the
data. The occurrence is dealt with by the use of a more complex regression
model, such as a polynomial (§6.3, 6.4) or a nonlinear model (§6.9, 6.10).
The pattern may be unconvincing (but also prone to overinterpretation)
if only a small number of points are represented. Statistical tests for non
linearity are therefore a useful adjunct to residual plots, and are discussed
below (§5.10).
The third type of suspect residual pattern shows residuals that vary in
size with the value of the independent variable. In the example illustrated
(Figs. 5.5.5 and 5.5.6) there is a tendency for the residuals to increase
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
82 Notes on Statistics and Data Quality for Analytical Chemists
in magnitude with increasing x. This phenomenon is called ‘heteroscedas
ticity’. The tendency is contrary to the assumption of simple regression
that the variance σ of the yvalues is constant across the range of the inde
pendent variable x. The feature is common in analytical calibrations unless
the range of concentration is quite small (for example, 1–2 orders of mag
nitude of concentration above the detection limit). Where several orders of
magnitude of concentration are covered in a calibration set, heteroscedas
ticity may be pronounced. If simple regression (i.e., as discussed so far)
is used on markedly heteroscedastic data, the resulting line will have a
magniﬁed uncertainty towards zero concentration (§6.7). This may have a
disastrous eﬀect on the apparent detection limit and the accuracy of results
in this region (see Fig. 6.8.3). The correct approach to heteroscedastic data
is to use a statistical technique called weighted regression (§6.7). As with
other suspicious patterns in residuals, there is a tendency among the inex
perienced to see random data as patterns when the number of points is
small. The complication of using weighted regression is best avoided unless
the need for it is completely clear.
5.6 Eﬀect of Outliers and Leverage Points on Regression
Key points
— Outliers and leverage points can adversely aﬀect the outcome of
regression.
— They are often easy to deal with or avoid in analytical calibration.
There are two kinds of suspect data points that can adversely aﬀect the
outcome of regression, namely outliers and leverage points. Outliers are
essentially anomalous in the value of the dependent variable, that is, in
the ydirection in a graph (Fig. 5.6.1). They have the eﬀect of drawing
the ﬁtted line towards the outlying point and thus rendering it unrepresen
tative of the other (valid) points. Because of this, regression should never be
undertaken without a prior visual appraisal of the data or a retrospective
residual plot. Extreme outliers will be immediately obvious and should be
removed from the dataset. Marginal outliers are more diﬃcult to deal with.
Some statistical software packages give an indication of which points could
reasonably be regarded as outliers. In any event, practitioners must avoid
deleting the point with the largest residual without careful thought. That
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 83
Fig. 5.6.1. Eﬀect of an outlier on
regression. When one of the original
points (solid circles) is moved to an out
lying position (open circle), the original
regression line (solid) moves to an unrep
resentative position (dashed line).
Fig. 5.6.2. Eﬀect of a leverage point
on regression. The wellspaced points
(solid circles) give rise to the original
regression line (solid). The extra
leverage point (open circle) draws the
regression line (dashed) to an unrep
resentative position.
would always reduce the variance of the residuals, but would not improve
the ﬁt unless the point is a genuine outlier. In calibration it will normally be
possible for the analyst quickly to check the data or the calibrators for mis
takes, so it is probably better to repeat the whole calibration if an outlier is
encountered.
Leverage points are anomalous in respect of the independent variable
(Fig. 5.6.2). Because of their distance from the other points they can draw
the ﬁtted line towards them. Even if they are close to the same trend as the
rest of the points, they will have an undue inﬂuence on the regression line.
Again, leverage points should be treated with caution. They can easily be
avoided in calibration. Again, some statistical software will identify points
that have undue leverage.
5.7 Variances of the Regression Coeﬃcients: Testing the
Intercept and Slope for Signiﬁcance
Key points
— The variances of the regression coeﬃcients can be calculated from the
data.
— They can be used to conduct signiﬁcance tests on the coeﬃcients.
Because the yvalues in regression are variables (that is, subject to random
error of measurement), the estimates a and b are also variables, and it is
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
84 Notes on Statistics and Data Quality for Analytical Chemists
of interest to note how large that variability is. This gives us information
about how much reliance we can place on the calculated values of a and b.
We can obtain the required statistics from the data itself. Thus we have an
expression for the variance of the slope b, namely:
var(b) = s
2
y  x
/
i
(x
i
− ¯ x)
2
,
and for the intercept a:
var(a) = var(b)
i
x
2
i
/n.
The standard errors se(b) and se(a) are simply the respective square roots
of these variances. Various null hypothesis about the coeﬃcients can be
tested for signiﬁcance by using these standard errors.
We can test whether there is evidence to discount the idea that a cal
ibration line plausibly passes through the origin (that is, that the instru
mental response is zero when the concentration of the analyte is zero). For
that purpose we consider the null hypothesis H
0
: α = 0 by calculating the
value of Student’s t, namely t = (a − α)/se(a) = a/se(a) and the corre
sponding probability. A suﬃciently low value (say p < 0.05) should convince
us that we can safely reject the null hypothesis. (Note: a test for H
0
: α = 0
is pointless unless there is a good ﬁt between the data and the regression
line: for example, Figs. 5.5.3 and 5.11.3 show a lack of ﬁt situation where
the intercept of the regression line diﬀers obviously from the true trend of
the data.)
Hypotheses about the slope b can likewise be tested by calculating t =
(b − β)/se(b) and the corresponding probability under various hypotheses
about β. If we consider H
0
: β = 0, we are asking if there is any rela
tionship at all between x and y, that is, whether the slope is zero (where
the value of y is unaﬀected by the value of x). That circumstance is irrel
evant in calibration, but might arise in exploratory studies of data, where
we want to see which (if any) of a large number of possible predictors has an
eﬀect on the response (§6.5, 6.6). In calibration it is possible that we might
want to compare an experimental value of b with a literature value b
lit
, in
which case we have H
0
: β = b
lit
. Finally, we might consider H
0
: β = 1
in studies of bias over extended ranges, but in such instances we have
to take care that the assumptions of regression are not seriously violated
(see §5.12, 5.13).
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 85
For our manganese calibration data (§5.3) we ﬁnd the following.
Coeﬃcient Standard error t p
Intercept a 75.5 137.7 0.55 0.613
Slope b 496.37 22.74 21.83 0.000
The high pvalue for the intercept means that there is no evidence to
reject the idea that the true calibration line passes through the origin.
A very low pvalue for the slope is inevitable in analytical calibration and
is of no inferential value.
Notes
• The data used can be found in the ﬁle named Manganese1.
• Most computer packages give tvalues and corresponding values of p
alongside the estimates of the regression coeﬃcients. The tvalue usually
relates to the null hypothesis that the respective coeﬃcient has a zero
population mean.
5.8 Regression and ANOVA
Key point
— In regression, the variance of the yvalues can be analysed into
a component attributed to the regression and a residual component.
In regression the variation among the yvalues can be split between the com
ponent due to the regression and that due to the residuals. This enables us
to compare the relative magnitude of the components and make deductions
about the success of the regression. The overall variance of the yvalues is
given by the normal expression for variance, namely:
i
(y
i
− ¯ y)
2
/(n −1).
The variance of the residuals we have seen (§5.4) is
i
(y
i
− ˆ y
i
)
2
/(n −2).
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
86 Notes on Statistics and Data Quality for Analytical Chemists
For the variance due to the regression (that is, of the ﬁtted values ˆ y
i
around ¯ y), there is a denominator of one because there is only one degree of
freedom remaining, namely (n −1) −(n −2) = 1, so the variance is simply
i
(ˆ y
i
− ¯ y)
2
. Computer packages provide an ANOVA table composed as
follows.
Source of Degrees of Sum of Mean square
variation freedom squares (variance) F
Regression 1
i
(ˆ y
i
− ¯ y)
2
i
(ˆ y
i
− ¯ y)
2
Residuals n −2
i
(y
i
− ˆ y
i
)
2
i
(y
i
− ˆ y
i
)
2
/(n −2)
Regression mean square
Residual mean square
Total n −1
i
(y
i
− ¯ y)
2
i
(y
i
− ¯ y
i
)
2
/(n −1)
The value of F can be used as another test of a signiﬁcant relationship
between y and x, and is mathematically equivalent to testing H
0
: β = 0
with a ttest. Another statistic provided by ANOVA is designated R
2
and is
the ratio of the regression sum of squares to the total sum of squares (usually
expressed as a percentage). This can be thought of as the proportion of
the variation in the yvalues that is accounted for by the regression. It is
numerically equal to the square of the correlation coeﬃcient (§5.9) between
the yvalues and the ﬁtted values. It can be used as a rough guide to the
success of the regression, because a value approaching 100% means that
most of the variation has been accounted for. But the statistic must be
treated with caution: a model providing (say) a 99% value of R
2
is not
necessarily better than one that provides a value of 95%, for the same
reasons that apply to the correlation coeﬃcient.
The example calibration data for manganese (§5.3) provides the fol
lowing ANOVA table.
Source of Degrees of Sum of Mean square
variation freedom squares (variance) F p
Regression 1 17246922 17246922
Residuals 4 144830 36207 476.3 0.0000
Total 5 17391751
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 87
We see that a value of R
2
= 99.2%. Such a high value is usual in
analytical calibration unless a grossly inappropriate model has been used.
Likewise, the high Fvalue and the corresponding small probability are
typical of calibration and of no real interest.
5.9 Correlation
Key points
— Correlation is a measure of the relationship between two variables.
— A perfect linear relationship provides a correlation coeﬃcient of exactly
1 or −1.
— The correlation coeﬃcient is not a reliable indicator of linearity in
calibration.
Correlation is a measure of relationship between two variables. It is related
to, but distinct from, regression, but it is prone to misinterpretation unless
great care is taken. Valid inferences that can be made with the help of
the correlation coeﬃcient are few in analytical science, and the statistic
is best avoided. This section is essentially a warning. Unfortunately, some
computer packages provide the correlation coeﬃcient as routine as a by
product of regression, which give it a false appearance of applicability.
The correlation coeﬃcient is deﬁned as
r =
i
((x
i
− ¯ x)(y
i
− ¯ y))
i
(x
i
− ¯ x)
2
i
(y
i
− ¯ y)
2
.
The value of r must fall between ±1, regardless of the actual x and yvalues.
Unlike regression, it is symmetric in x and y, the identical value of r being
produced if the roles of x and y are interchanged in the equation.
When there is no relationship between the variables, that is, the yvalues
do not depend on the xvalues in any way, the correlation coeﬃcient will
be zero. When the points fall exactly on a straight line the coeﬃcient takes
a value of +1 for lines with a positive slope or −1 for lines with a negative
slope. For intermediate situations (points scattered at a greater or lesser
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
88 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 5.9.1. Scatterplot of 20 points
with zero correlation.
Fig. 5.9.2. Scatterplot of 20 points
with a correlation coeﬃcient of 0.80.
Fig. 5.9.3. Scatterplot of 20 points
with a correlation coeﬃcient of 0.95.
Fig. 5.9.4. Scatterplot of 20 points
with a correlation coeﬃcient of 0.99.
distance from a straight line) the coeﬃcient takes values 0 < r < 1. Some
examples are shown in Figs. 5.9.1–5.9.4.
Some problems of interpreting r are as follows.
• Outliers have a strong eﬀect on the value. In Fig. 5.9.5 we see the same
data as in Fig. 5.9.2 but with an outlier added. The coeﬃcient has
increased from 0.80 to 0.92, despite the fact that the outlier does not
lie on the same trend as the original data.
• Points with an exact functional relationship that is not linear do not
necessarily give values of r above zero (Fig. 5.9.6).
• While points very close to a straight line provide a coeﬃcient of almost 1
(or −1) by deﬁnition, the converse is not true. Points with r ≈ 1 do not
have to be scattered around a straight line. This kind of ambiguity is illus
trated in Figs. 5.9.7 and 5.9.8, where points with a straight tendency and
a curved tendency provide identical values of r. This ambiguity extends
to values of r that are even closer to unity.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 89
Fig. 5.9.5. Scatterplot of data with an
outlier, which tends to increase the cor
relation coeﬃcient.
Fig. 5.9.6. Bivariate dataset with an
exact functional relationship but zero
correlation.
Fig. 5.9.7. Data with a linear trend and
a correlation coeﬃcient of 0.95.
Fig. 5.9.8. Data with a nonlinear trend
and a correlation coeﬃcient of 0.95.
All of this shows that r has little to commend it in the context of
chemical measurement, and that a scatterplot is much more informative
and nearly always to be preferred. It is particularly unfortunate when r
is used as a test for suﬃcient linearity in calibration graphs in analytical
science. It is quite possible to have a calibration with r = 0.999 and still
demonstrate signiﬁcant lack of ﬁt to a linear function (§5.11). Moreover,
it is dangerous to compare diﬀerent values of r. For example, it would be
wrong to say that a dataset with r = 0.999 was more linear than a set with
r = 0.99. (Other tests for lack of ﬁt are statistically sound: for example,
the pure error test [§5.10] and polynomial ﬁtting.)
Note
• The use of r for the correlation coeﬃcient must not be confused with the
same symbol used for ‘residual’ or for repeatability.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
90 Notes on Statistics and Data Quality for Analytical Chemists
5.10 A StatisticallySound Test for Lack of Fit
Key points
— Lack of ﬁt between data and a regression line can be studied by making
an independent estimate of the ‘pure error’.
— This requires replicated responses to be recorded.
Calibration lines that are straight or very close to straight occur often in
analytical measurement, and such a calibration is highly beneﬁcial in that it
tends to reduce the uncertainty of predicted concentrations as well as being
easiest to ﬁt. Consequently, there is a tendency for analytical chemists to
assume linearity and ignore small deviations from it. In many instances
that is completely justiﬁed. However, ﬁtting a straight line to data that
have a curving trend will produce errors that are likely to be serious at the
bottom end of the calibration. It is therefore of considerable importance to
check whether calibration lines actually are straight, and if they are not,
to determine the magnitude of the discrepancy. As a result, tests for lin
earity are written into procedures for method validation in many analytical
sectors. Unfortunately, these tests are mostly based on the correlation coef
ﬁcient and therefore are statistically unsound (see §5.9). Here we consider
a test for lack of ﬁt that is both sound and suitable for method validation.
It requires an independent estimate of σ
2
, which can be obtained by repli
cating some or all of the response measurements.
A general scheme for the independent estimation of σ
2
(called the ‘pure
error mean square’ MS
PE
) is given below. Suppose there are m diﬀerent
concentrations and the response at each concentration is measured n times.
Then we have the following data layout from which we can calculate sums
of squares and degrees of freedom.
Concentration Repeated responses Sum of squares Degrees of freedom
x
1
y
11
, . . . , y
1i
, . . . , y
1n
SS
1
=
i
(y
1i
− ¯ y
1
)
2
n −1
.
.
.
.
.
.
.
.
.
.
.
.
x
j
y
j1
, . . . , y
ji
, . . . , y
jn
SS
j
=
i
(y
ji
− ¯ y
j
)
2
n −1
.
.
.
.
.
.
.
.
.
.
.
.
x
m
y
m1
, . . . , y
mi
, . . . , y
mn
SS
m
=
i
(y
mi
− ¯ y
m
)
2
n −1
Totals — SS
PE
=
j
SS
j
m(n −1)
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 91
Then we have:
MS
PE
= SS
PE
/(m(n −1)).
The sum of squares for lack of ﬁt is the residual sum of squares SS
RES
minus the pure error sum of squares, so the mean squares due to lack of
ﬁt is
MS
LOF
= (SS
RES
−SS
PE
)/(n −2).
The test statistic is
F = MS
LOF
/MS
PE
,
which has (n −2) and m(n −1) degrees of freedom.
This looks a bit formidable, but the calculations are quite straight
forward and will always be carried out by computer.
5.11 Example Data/Calibration for Manganese
Key points
— Residual plots and tests for lack of ﬁt serve to detect nonlinearity in
a calibration plot.
— These are more reliable tests than a consideration of the correlation
coeﬃcient.
— Outlying results can perturb the conclusions for such tests.
The calibration data for manganese previously considered are actually part
of a larger set of duplicated results at each concentration. The data set up
to concentration 10 ppb is given here.
Concentration, ppb Response 1 Response 2
0 114 14
2 870 1141
4 2087 2212
6 3353 2633
8 3970 4299
10 4950 5207
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
92 Notes on Statistics and Data Quality for Analytical Chemists
The regression and residual plots are shown in Figs. 5.11.1 and 5.11.2.
By visual judgement there is no suggestion of lack of ﬁt in the residual plot.
The following simple calculations lead to the pure error estimate. d
2
a
+d
2
b
is the pure error sum of squares for duplicated results. (This data layout
is for demonstration purposes only: the calculations will always be carried
out by computer.)
x y
a
y
b
¯ y d
a
= y
a
− ¯ y d
b
= y
b
− ¯ y d
2
a
+ d
2
b
DOF
0 114 14 64.0 50.0 −50.0 5000 1
2 870 1141 1005.5 −135.5 135.5 36721 1
4 2087 2212 2149.5 −62.5 62.5 7813 1
6 3353 2633 2993.0 360.0 −360.0 259200 1
8 3970 4299 4134.5 −164.5 164.5 54121 1
10 4950 5207 5078.5 −128.5 128.5 33025 1
Totals — — — — — 395878 6
This gives rise to the following ANOVA table. The probability associated
with the lack of ﬁt F statistic is high and conﬁrms the absence of signiﬁcant
lack of ﬁt.
Source of Degrees of Sum of Mean square
variation freedom squares (variance) F p
Regression 1 35608623 35608623 819.34 0.000
Residual error 10 434603 43460
Lack of ﬁt 4 38725 9681 0.15 0.958
Pure error 6 395878 65980
Total 11 36043226
If we now consider the full dataset, up to a concentration of 20 ppb we
have the following data.
Concentration Concentration
ppb Response 1 Response 2 ppb Response 1 Response 2
0 114 14 12 5713 5898
2 870 1141 14 6496 6736
4 2087 2212 16 7550 7430
6 3353 2633 18 8241 8120
8 3970 4299 20 8862 8909
10 4950 5207 — — —
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 93
Fig. 5.11.1. Calibration data for mang
anese, with duplicated results for res
ponse.
Fig. 5.11.2. Residuals from the simple
calibration of the manganese data. There
is no suggestive pattern (although there
is a single suspect point).
Fig. 5.11.3. Calibration data for man
ganese, up to 20 ppb.
Fig. 5.11.4. Residuals (points) from the
calibration using simple regression. There
is a strong visual suggestion of lack of ﬁt,
which is perhaps weakened by an outlying
point (encircled).
The calibration plot (Fig. 5.11.3) and residual plot (Fig. 5.11.4) suggest
a slight curvature in the relationship, but it is diﬃcult to be sure because of
the relatively large variability in the responses at each level of concentration.
On completion of the linear regression with the test for lack of ﬁt,
we have the following ANOVA table. The probability of 0.096 associated
with the lack of ﬁt test is low enough to substantiate the visual appraisal,
although not signiﬁcant at the 95% level of conﬁdence. The probability may
have been aﬀected unduly by one apparently discrepant value of response
at 6 ppb concentration, which inﬂates the estimate of pure error variance
and thereby reduces the Fvalue. If this discrepant value is deleted before
the test, the lack of ﬁt becomes signiﬁcant with p = 0.007.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
94 Notes on Statistics and Data Quality for Analytical Chemists
Source of Degrees of Sum of Mean square
variation freedom squares (variance) F p
Regression 1 173714191 173714191 2631.62 0.000
Residual error 20 1320205 66010
Lack of ﬁt 9 862790 95866 2.31 0.096
Pure error 11 457416 41583
Total 21 175034397
Notes
• The datasets used in this section can be found in ﬁles named Man
ganese2 and Manganese3.
• The pure error test for lack of ﬁt takes no account of the order in which
the residuals are arranged. The test statistic would have the same result if
the order of the residuals were randomised. This has two corollaries: (a)
nonlinearity may be somewhat more likely than suggested by the pvalue;
and (b) lack of ﬁt, if detected, may have a cause other than nonlinearity.
Hence it is essential to examine a residual plot.
• The coeﬃcient of correlation between the concentrations and the
responses in dataset Manganese3 is r = 0.996. This would often incor
rectly be taken as a demonstration of linearity in the calibration function.
This example substantiates the previous comments (§5.9) about the short
comings of the correlation coeﬃcient as a test for linearity.
5.12 A Regression Approach to Bias Between Methods
Key points
— Linear regression can be used to test for bias between two analytical
methods by using them in tandem to analyse a set of test materials.
— The inference will be safe so long as the results of the more precise
method are used as the independent (x) variable.
The results of two analytical methods can be compared by using both
of them to analyse the same set of test materials. When the range of
concentrations determined is small, a ttest on the diﬀerences between
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 95
paired results is usually appropriate (§3.8, 3.9). When the range is greater,
a number of possible outcomes are possible, and the data can sometimes be
analysed by regression or a related method. The regression approach is safe
so long as the variance of the independent (x) variable is somewhat smaller
than that of the dependent (y) variable. If the variances are comparable,
or that of y exceeds that of x, regression is likely to provide misleading
statistics, because a basic assumption of regression (invariant xvalues) has
been notably violated. Such instances can, however, be readily managed by
(a) (b)
(c) (d)
Fig. 5.12.1. Possible outcomes of experiments for the comparison of two methods of
analysis, showing the regression line (solid) and the theoretical line of no bias (dashed).
Each point represents the two results on a particular material. (a) No bias between
the methods. (b) Translational bias only between the methods. (c) Rotational bias only
between the methods. (d) Both rotational and translational bias between the methods.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
96 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 5.12.2. A complex outcome of an experimental comparison between analytical
methods where the test materials fall into two subsets, one subset with no bias (solid
circles) and the remainder with a serious (but not statistically characterised) bias (open
circles).
the use of a more complex method known as functional relationship ﬁtting,
which is beyond the scope of the present book.
Bias between two analytical methods can adopt one of several diﬀerent
‘styles’, the most common of which are well modelled by linear regression
in paired experiments. If both methods gave identical results (apart from
random measurement variation, of course) the trend of results would follow
a line with zero intercept and a slope of unity, that is, y = α + βx where
α = 0, β = 1 (Fig. 5.12.1a). As the methods are intended to address the
same measurand, the two methods should produce results quite close to
the ideal model (y = x). An obvious statistical approach is to test the
hypotheses H
0
: α = 0 vs H
A
: α = 0 and H
0
: β = 1 vs H
A
: β = 1.
Figure 5.12.1(b) shows a dataset where α = 0, β = 1. This type of bias
is called ‘translational’ or ‘constant’ bias, and is commonly associated with
baseline interference in analytical signals. Another common type of bias is
characterised by α = 0, β = 1, that is, the slope has departed from unity.
This style of bias (Fig. 5.12.1c) is called ‘rotational bias’ or proportional
bias. It is quite possible for both types of bias to be present simultaneously,
that is, α = 0, β = 1 (Fig. 5.12.1d).
In some instances a more complex behaviour may be seen. Fig. 5.12.2
shows an example where the results from the majority of the test materials
follow a simple trend line, but a subset of test materials give results that
show quite diﬀerent behaviour.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 97
Notes and further reading
• Ripley, B. and Thompson, M. (1987). Regression Techniques for the
Detection of Analytical Bias. Analyst, 112, pp. 377–383.
• ‘Fitting a linear functional relationship to data with error on both vari
ables’.(March 2002). AMC Technical Briefs No. 10. Free download via
www.rsc.org/amc.
• Software also available for Excel and Minitab. Free download via
www.rsc.org/amc.
5.13 Comparison of Analytical Methods: Example
Key point
— Regression shows the bias between a rapid ﬁeld method for the deter
mination of uranium and an accurate laboratory method.
As an example we consider the determination of uranium in a number of
stream waters by a wellestablished laboratory method and by a newly
developed rapid ﬁeld method. The purpose is to test the accuracy of the
ﬁeld method, regarding the laboratory method as a reference point for
accuracy. The results, in units of µg l
−1
(ppb) are as follows
Site code Field result Laboratory result
1 24 19
2 8 8
3 2 2
4 1 1
5 10 9
6 26 23
7 31 27
8 17 11
9 4 4
10 0 0
11 6 4
12 40 26
13 2 3
14 2 4
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
98 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 5.13.1. Comparison between two methods for the determination of uranium in
natural waters, showing the ﬁtted regression line (solid) and the theoretical line for zero
bias between the methods (dashed line). Each point is a separate test material.
Figure 5.13.1 shows the ﬁeld results plotted against the laboratory
results. We see that the trend of the results deviates from the theoretical
line for unbiased methods where both methods give the same result (apart
from random measurement variation). It seems reasonable to estimate this
trend by linear regression of the ﬁeld results against the laboratory results.
This action can be justiﬁed in this instance because the precision of the
laboratory method is known to be small in comparison with that of the
ﬁeld method.
The results of the regression are as follows. We see that the intercept
is not signiﬁcantly diﬀerent from zero (p = 0.412), but the slope of 1.32 is
clearly signiﬁcantly diﬀerent from unity, which is the slope required for no
bias of any kind between the methods. The ﬁeld method is giving results
that are on average 1.32 times greater than the established laboratory
method.
Predictor Coeﬃcient Standard error H
0
, H
A
t p
Intercept (a) −0.944 1.111 α = 0, α = 0 0.85 0.412
Slope (b) 1.32065 0.08120 β = 1, β = 1 3.95 0.001
s
y  x
= 2.82 R
2
= 95.7%
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
Regression and Calibration 99
Fig. 5.13.2. Residuals from the regression.
Notes
• The dataset used in this section is in a ﬁle named Uranium.
• It is interesting to observe that the residuals of the regression (Fig. 5.13.2)
tend to suggest a small degree of heteroscedastic variation in the results
of the ﬁeld method. Weighted regression should ideally have been used for
this exercise (§6.7), although in this instance it would not have changed
the interpretation of the data. In any event the estimation of the weights
would have called for assumptions outside the data.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch05
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Chapter 6
Regression — More
Complex Aspects
This chapter examines some more complex aspects of regression in
relation to analytical applications, specifically polynomial and mul
tiple regression, weighted regression and nonlinear regression.
6.1 Evaluation Limits — How Precise is an Estimated
xvalue?
Key points
— ‘Evaluation limits’ around concentrations estimated from calibration
graphs can be calculated from the data and are often unexpectedly
wide.
— Evaluation limits give rise to an alternative way of thinking about
detection limits.
Values of concentration x
in unknown solutions, estimated from calibration
lines, are subject to two sources of variation, ﬁrst, the variation in the
position of the regression line and second, the variation in the new measured
response y
. The way that these two variances interact is shown schemati
cally in Fig. 6.1.1. (Remember that y
and x
are not part of the calibration
data.) The resultant variation in x
is, at ﬁrst encounter, surprisingly large,
but the reason for this is apparent from the diagram.
Conﬁdence intervals around x
= (y
−a)/b are given by x
±ts
x
with
an appropriate value of t, where
s
x
=
s
yx
b
1
m
+
1
n
+
(y
− ¯ y)
2
b
2
i
(x
i
− ¯ x)
2
, (6.1.1)
101
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
102 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 6.1.1. Estimation of an unknown concentration x
from a response y
and an estab
lished calibration function. The shaded areas illustrate schematically the uncertainties
in y
, the calibration function (solid line) and x
.
Fig. 6.1.2. Evaluation interval calculated from the manganese data (points), showing
the 95% conﬁdence interval around an estimated value of concentration.
and where y
is the mean of m observations, and the regression line is
based on n pairs of x − y data. The conﬁdence limits around x
can be
calculated for any value of y
, and shown as two continuous lines which
are gentle curves (see Fig. 6.1.2), which shows the curves calculated from
the manganese calibration data. This procedure looks complex, but all of
the statistics stem from the calibration data, and the limit lines can be cal
culated rapidly by computer. The manganese data shows that an unknown
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 103
Fig. 6.1.3. Part of previous graph around the origin, illustrating how a detection limit
can be estimated. At point A, the lower conﬁdence limit intersects the zero concentration
line. Concentrations below the corresponding detection limit c
L
are not signiﬁcantly
greater than zero.
solution producing a response of (say) 1600 units provides an estimated
concentration of manganese of 3.1 ppb with 95% conﬁdence limits of 2.1
and 4.1.
Assuming that an appropriate model has been ﬁtted, this provides a
useful way of estimating of the contribution that stems from the calibration
and evaluation procedures to the total uncertainty of the measurement
result. It also provides us with an alternative means of conceptualising
detection limits (Fig. 6.1.3). If we think about the lower conﬁdence limit
for a continuously diminishing concentration x
, at some response (A in
Fig. 6.1.3) the limit intersects the zero concentration line and, at that point
and below, the estimated concentration is not signiﬁcantly greater than
zero. In this region we are not sure that there is any of the analyte present
in the test solution.
Notes
• The dataset used in the example diagrams are in the ﬁle named
Manganese2.
• Terminology in this area is not stabilised. Here we call the limits ‘eval
uation limits’, others call them ‘inverse conﬁdence limits’ or ‘ﬁducial
limits’.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
104 Notes on Statistics and Data Quality for Analytical Chemists
• Many textbooks correctly state that Eq. (6.1.1) is approximate. However,
the error involved in the approximation is negligible in any realistic ana
lytical calibration.
6.2 Reducing the Conﬁdence Interval Around an Estimated
Value of Concentration
Key points
— Reducing the uncertainty in an estimated concentration is addressed
by concentrating on the largest term in Eq. (6.1.1).
— The largest term is often 1/m, where m is the number of replicate
measurements of response for the unknown solution.
If the uncertainty in an estimated x
is too large, the equation for s
x
above allows us to see which aspect of the calibration and evaluation
needs attention, by comparing the separate magnitudes of the three terms
under the square root sign. The magnitude of 1/m can be reduced by
increasing m, the number of results averaged to obtain y
. This is nor
mally the most eﬀective strategy. Likewise, the magnitude of 1/n can
be reduced by increasing n, the number of calibrators, and this also has
the eﬀect of reducing the value of t, which depends on the degrees of
freedom, (n −2).
The third term
(y
− ¯ y)
2
b
2
i
(x
i
− ¯ x)
2
disappears when y
= ¯ y (usually around
the centre of the calibration line) and, in most instances of calibration, is
small compared with the other terms. (This is why the conﬁdence limits
usually look like straight lines parallel to the calibration function: strictly
they are gentle curves.) It could be reduced somewhat by moving the cali
brators to the extreme of the existing concentration range or by increasing
the range of the existing calibration.
Common sense is necessary here: there is no point in worrying about
the magnitude of errors contributed by calibration/evaluation if errors
introduced at other levels of the analytical procedure exceed them to any
extent. This is usually the case in chemical analysis, except at concentra
tions near the detection limit, where calibration errors assume a dominant
magnitude.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 105
6.3 Polynomial Regression
Key points
— Polynomial regression allows us to construct models that can ﬁt data
with a curved trend.
— It is sometimes suitable for calibration data, with models up to order
two (i.e., with squared terms).
— Models of order higher than two are nearly always unsuitable for ana
lytical calibrations, and give unreliable extrapolations.
When calibration data x
i
, y
i
, (i = 1, . . . , n) (or any other data) fall on a
curved trend it is often possible to account for them successfully with a
polynomial model which takes the form
y
i
= β
0
+ β
1
x
i
+ β
2
x
2
i
+ β
3
x
3
i
+· · · + ε
i
, ε ∼ N(0, σ
2
).
Notice here that the coeﬃcients β are distinguished by a subscript that
corresponds with the power to which the predictor variable is raised. The
intercept is now called β
0
instead of α, so we can use a compact notation
for the model, y
i
=
m
j=0
β
j
x
j
+ε
i
, where m is order of the polynomial, the
highest power used. (Note that β
0
x
0
= β
0
.) The leastsquares coeﬃcients
can be calculated by an extension of the procedure used in §5.2 so long as
n > m + 2, but in practice it is wise to use n > 2m.
It is very unlikely that a power of greater than three would be suitable
for analytical calibration purposes. We can see this clearly in Fig. 6.3.1,
Fig. 6.3.1. Various polynomial ﬁts to ﬁve calibration data points.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
106 Notes on Statistics and Data Quality for Analytical Chemists
which shows ﬁts, of up to order four, to ﬁve data points. The straight
line ﬁt is omitted for clarity. The second order ‘quadratic’ curve (part of
a parabola) is a better ﬁt to the points (in a leastsquares sense) and pro
vides a plausible calibration line. We ﬁnd this plausible because we often
observe slightly curved calibrations in practice and can usually account for
them in terms of known physical processes taking place in the analytical
system. The third order curve (part of a cubic equation) is still closer to
the points, but now the ﬁt is less plausible. We would be unhappy to use
a calibration curve like that because there is an inﬂection in the curve
that would be diﬃcult to account for by physical theory. Finally, the fourth
order ﬁt passes exactly through each point, but is obviously nonsensical as a
calibration.
In general, as the order of the ﬁt is increased, the variance of the
residuals becomes smaller. However, there is a point beyond which the
improvement in ﬁt is meaningless. There are statistical tests that can
identify that point, but common sense is usually good enough in ana
lytical calibration. A suitable procedure for analytical calibration is the
following.
1. Determine at least six equally spaced calibration points.
2. Try a ﬁrst order (linear) ﬁt.
3. Examine the resulting residual plot.
4. If there is a lack of ﬁt through curvature, try a quadratic (order
two) ﬁt.
5. Examine the new residual plot.
6. If there is still lack of ﬁt, abandon the attempt to ﬁt a polynomial.
All of these operations are easily accomplished in statistical packages.
While polynomials can be used satisfactorily to model calibrations with
slight curves and short ranges, the fact remains that they are often inher
ently the wrong shape to describe the physical processes going on in
an analytical system. This incompatibility is likely to become apparent
in even small extrapolations. Consider the calibration data shown in
Figs. 6.3.2–6.3.4. The graphs show various order ﬁtted lines extrapolated,
with the corresponding 95% conﬁdence intervals. The ﬁrst order ﬁt shows
the uncertainty in the extrapolation remains reasonably small. The higher
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 107
Fig. 6.3.2. Linear ﬁt to calibration data,
with 95% conﬁdence interval.
Fig. 6.3.3. Second order ﬁt to calibration
data, with 95% conﬁdence interval.
Fig. 6.3.4. Third order ﬁt to calibration data, with 95% conﬁdence interval.
order ﬁts model that data rather more closely, but show extrapolations with
a strongly curved trend and immediately very wide conﬁdence interval.
Extrapolation is very risky because we cannot infer the correct shape of
the relationship from the data but only look for lack of ﬁt within the
range.
Note
• In technical mathematical terminology, these polynomial models are all
called ‘linear’, even though they describe curves: they are linear in the
coeﬃcients. There is another class of models technically called ‘non
linear’ that are more diﬃcult to ﬁt and discussed brieﬂy in §6.9 and
§6.10.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
108 Notes on Statistics and Data Quality for Analytical Chemists
6.4 Polynomial Calibration — Example
Key point
— Polynomial regression order two (quadratic) provides a good ﬁt for
the manganese data.
If we reexamine the complete manganese data from §5.11 by using
quadratic regression, we obtain the following output, residual plot and
analysis of variance table. The regression equation is:
Response = 3.03 + 550.24 ×c
Mn
−5.297 ×c
2
Mn
,
where c
Mn
is the concentration of manganese. The table of coeﬃcients is
as follows.
Predictor Coeﬃcient Standard error t p
Intercept 3.03 91.64 0.03 0.974
Mn 550.24 21.32 25.81 0.000
Mn squared −5.297 1.027 5.16 0.000
s
yx
= 170.1 R
2
= 99.7%
We see that the squared term is highly signiﬁcant in the test for H
0
:
β
2
= 0, so we are disposed to think that the quadratic model will have
improved the ﬁt.
Source of Degrees of Sum of Mean square
variation freedom squares (variance) F p
Regression 2 174484616 87242308 3015.03 0.000
Residual error 19 549780 28936
Lack of ﬁt 8 92365 11546 0.28 0.960
Pure error 11 457416 41583
Total 21 175034397
The residual plot (Fig. 6.4.1) shows no trace of lack of ﬁt (although the
suspect value is now more obviously an outlier), and there is no apparent
lack of ﬁt near the origin. The analysis of variance shows no signiﬁcant lack
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 109
Fig. 6.4.1. Standardised residuals after carrying out quadratic regression on the
manganese calibration data.
of ﬁt, with p = 0.96. Quadratic regression therefore provides a good model
for this particular calibration.
Note
• The dataset used in this section is named Manganese3.
6.5 Multiple Regression
Key points
— Multiple regression allows the exploration of datasets with more than
one predictor variable.
— The technique has applications in analytical calibration in multivariate
methods such as principal components regression.
— It is also useful in general exploratory data analysis.
Multiple regression is used when two or more independent predictors
combine to determine the size of a response variable. The dataset layout is
as follows, where the ﬁrst subscript on each value of x indicates a separate
variable.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
110 Notes on Statistics and Data Quality for Analytical Chemists
Response variable Predictor variables
1 2 . . . m
y
1
x
11
x
21
. . . x
m1
y
2
x
12
x
22
. . . x
m2
y
3
x
13
x
23
. . . x
m3
.
.
.
.
.
.
.
.
. · · ·
.
.
.
y
n
x
1n
x
2n
. . . x
mn
The statistical model is therefore
y
i
= β
0
+ β
1
x
1i
+ β
2
x
2i
+ β
3
x
3i
+· · · + ε
i
, ε ∼ N(0, σ
2
),
and the equations to calculate the least squares estimates b
j
of the coef
ﬁcients β
j
are derived in a manner similar to those for simple regression.
As with polynomial regression, we should aim to have n > 2m at least to
obtain stable results.
Regression with two independent variables, that is, ˆ y = b
0
+ b
1
x
1
+
b
2
x
2
can be readily visualised by means of perspective projections, and we
can see such a model represented by a tilted plane OABC in the three
dimensional representations (Figs. 6.5.1, 6.5.2). When x
2
= 0 we have the
simple regression ˆ y = b
0
+ b
1
x
1
shown as line OA. Likewise, when x
1
= 0
we have ˆ y = b
0
+ b
2
x
2
shown as line OC. At nonzero values of both x
1
,
x
2
, values of ˆ y are represented by points on the plane OABC. (Note: in
these two ﬁgures the value of b
0
is zero, but generally this will not be so.)
When regression is executed on a set of points such as shown in Fig. 6.5.3,
the residuals are the vertical distances (that is, in the ydirection) from the
points to the ﬁtted plane (Fig. 6.5.4). As in simple regression, it is essential
Fig. 6.5.1. Representation of a ﬁtted
function of two predictor variables as a
tilted plane OABC in a threedimensional
space.
Fig. 6.5.2. Representation of a ﬁtted
function of two predictor variables as a
tilted plane OABC in a threedimensional
space. (Same as Fig. 6.5.1, but with the
origin at the bottom right corner.)
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 111
Fig. 6.5.3. Swarm of data points in three
dimensional space.
Fig. 6.5.4. Data points as in Fig. 6.5.3
and a ﬁtted regression model (plane),
showing the residuals. Points above the
plane are shown in black, below the plane
in grey.
good practice to examine the residuals, by plotting them against the value
of each predictor variable in turn.
Multiple regression plays an essential part in the more complex types of
calibration (for example in principal components regression [PCR]), but is
perhaps more often used in exploratory data analysis. In the latter instance,
it is important to check that the predictor variables used are not strongly
correlated. (In PCR the predictors have zero correlation by deﬁnition.)
If such correlation exists, the regression model may be unstable and the
coeﬃcients misleading. In other words, we might get very diﬀerent results
if two randomly selected subsets of the data were both separately treated
by multiple regression.
6.6 Multiple Regression — An Environmental Example
Key points
— Multiple regression has been used to explore possible relationships
between the concentration of lead in garden soils and predictors:
(a) distance from a smelter; (b) the age of the house; and (c) the
underlying geology.
— A preliminary regression gave a promising outcome but the residuals
showed a lack of ﬁt to distance.
— Replacing the distance predictor by a negative exponential transfor
mation gave an improved ﬁt with no signiﬁcant lack of ﬁt.
The data listed below show lead concentrations found in the soil of 18
gardens from houses in the vicinity of a smelter. Also listed for each garden
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
112 Notes on Statistics and Data Quality for Analytical Chemists
are some corresponding environmental factors that might serve to explain
the lead concentrations. The explanatory factors are as follows.
• The distance of each house from the smelter.
• The age of the house in years.
• The geological formation underlying the garden. (This is simply identiﬁed
by codes 1 or 2 which show which one of two rocks is present.)
The task is to try multiple linear regression with lead concentration as the
dependent variable and distance, age and type of geology as the predictors
(independent variables). Note that while the geology code can only take
one of two values, we can still use this variable as a predictor in multiple
regression.
Lead, ppm Distance, m Age, Yr Geology Lead, ppm Distance, m Age, Yr Geology
10 9609 22 1 136 6064 166 2
25 9283 58 1 146 3895 25 1
69 7369 61 2 164 4051 74 1
79 9887 132 2 170 6827 187 2
86 9887 118 1 184 4466 149 2
94 6130 121 1 198 4919 199 1
100 8328 125 1 201 6821 295 2
132 4790 42 2 219 3598 77 2
132 6248 139 2 275 3438 84 2
The ﬁrst stage is to ensure that the intended predictors are not strongly
correlated. The correlation matrix is as follows.
Distance Age
Age 0.040
Geology −0.243 0.298
All of these coeﬃcients are small and not signiﬁcant, so the three vari
ables can be safely used in the intended regression. The next stage is to scan
scatter plots (Figs. 6.6.1–6.6.3) of lead concentration against each variable
in turn to see if the data are consistent with the proposed model. We can
clearly see that the concentration of lead decreases with increasing distance,
as would be expected (Fig. 6.6.1). There are no obvious trends suggesting
nonlinearity and no suspect points that might have an undue inﬂuence on
the regression. There is no clear relationship between lead concentration
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 113
Fig. 6.6.1. Lead concentrations plotted
against distance from the smelter (r =
−0.80; p < 0.0005).
Fig. 6.6.2. Lead concentrations plotted
against the age of the house (r = 0.39;
p = 0.11).
Fig. 6.6.3. Lead concentrations plotted against code for the underlying geology.
and the age of the house (Fig. 6.6.2), and the correlation is small and
not signiﬁcant. However, we must not be tempted to omit the age of the
house from the regression at this stage. Quite often, real relationships are
obscured by the inﬂuence of other variables. Figure 6.6.3 illustrates the
eﬀect of geology. In this diagram it seems that the mean of the results for
the gardens coded as 2 is higher than the results for gardens coded as 1, but
it is not clear whether the diﬀerence is signiﬁcant. (Notice that the data
points have been ‘jittered’ slightly in the xdirection in Fig. 6.6.3 so that
overlapping points in the ydirection are separated.) There is no reason not
to conduct the multiple regression with all three variables.
The outcome of the multiple regression is shown in the table. The R
2
value (proportion of variance accounted for by the regression) is 83%, which
is very good for an environmental study. Both the distance from the smelter
and the age of the house give values of t that are signiﬁcantly diﬀerent from
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
114 Notes on Statistics and Data Quality for Analytical Chemists
zero at the 95% level of conﬁdence. It is noteworthy that the age of the
house is signiﬁcant in the multiple regression when it seemed unpromising
as a predictor when considered alone. The underlying geology apparently
has no signiﬁcant eﬀect on the lead content of the soil, as the pvalue
is high.
Predictor β
j
se(β
j
) t = β
j
/se(β
j
) p
Constant 221.23 37.00 5.98 0.000
Distance −0.024179 0.003494 −6.92 0.000
Age 0.3835 0.1146 3.35 0.005
Geology 15.63 16.00 0.98 0.345
While this outcome is promising, we now have to examine the residuals
for lack of ﬁt or other features that might throw doubt on the suitability of
the ﬁrst regression. The residuals plotted against predictors are shown in
Figs. 6.6.4 and 6.6.5.
The residuals plotted against the age of the house display no obvious
deviation from a random sample except for one suspect value (roughly at
90 years). Plotted against the distance from the smelter, however, there
is a distinct suggestion of a curved trend in the residuals, showing that
the linear representation was not completely adequate. Intuition supports
this outcome, because we would expect the lead contamination to fall with
Fig. 6.6.4. Residuals plotted against the age of the house.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 115
Fig. 6.6.5. Residuals plotted against the distance from the smelter.
distance from the smelter but in a roughly exponential manner. Notice that
there is no exact way of using the pure error as a test for lack of ﬁt, because
there are no replicated results.
A possible improvement on the modelling might be obtained by trans
forming the distance variable d in some way that provides a suitably curved
function, and using that as a predictor in a new multiple regression. We
can try an exponential decay to give the new variable exp(−d/1000).
(The division by 1000 is simply to provide results of a handy mag
nitude.) We now repeat the multiple regression and obtain the following
results.
Predictor β
j
se(β
j
) t = β
j
/se(β
j
) p
Intercept −7.31 14.03 −0.52 0.611
exp(−d/1000) 6096.1 446.0 13.67 0.000
Age 0.63813 0.06815 9.36 0.000
Geology 14.661 8.848 1.66 0.120
The variance accounted for (R
2
) is now 95% which is exceptionally
high for an environmental study. Both the transformed distance and the
age of the house are still highly signiﬁcant predictors: the geology as a
predictor remains not signiﬁcant at the 95% level of conﬁdence, but the
pvalue of 0.12 shows that its inﬂuence is not implausible, and a relationship
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
116 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 6.6.6. Residuals from the second
regression (with transformed distance)
plotted against age of the house.
Fig. 6.6.7. Residuals from the second
regression (with transformed distance)
plotted against distance from the smelter.
might be revealed by a larger or more focused study. None of the residual
plots from the new regression (Figs. 6.6.6 and 6.6.7) now show any
obvious deviation from a random pattern. (Notice that the lead residuals
have been plotted against the original distance measure rather than the
transformed value, but that merely spaces them conveniently: it is not
essential.)
Note
• The dataset for this example is found in the ﬁle named Lead.
6.7 Weighted Regression
Key points
— Weighted regression is designed for heteroscedastic data, i.e., the
variance of the yvalue varies with the xvalue, as opposed to the
assumption of uniform variance for simple regression.
— Using simple regression where weighted regression should be used can
have a deleterious eﬀect on the statistics.
— Analytical calibrations are often heteroscedastic, and weighted
regression is often beneﬁcial.
— Weighted regression is a standard part of statistical packages.
Quite often in analytical calibration the range of the concentration of
the analyte extends over several orders of magnitude. In such instances
we usually ﬁnd that the variance of the analytical response increases
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 117
Fig. 6.7.1. The heteroscedastic model of regression.
with concentration. This probably happens also with shorterrange cal
ibrations, but the change is small enough to escape detection, or small
enough to ignore. The phenomenon is called heteroscedasticity, and it
violates one of the assumptions on which the simple regression model
is based. If heteroscedasticity is suﬃciently marked, an adaptation of
the model is required for best results. The model for this situation is
shown in Fig. 6.7.1. The yvalues (in calibration the analytical responses)
are drawn at random from a distribution of which the standard devi
ation varies with the xvariable (concentration). (Compare this with
Fig. 5.1.1.)
This adaptation, which takes account of the changes in variance, is
called weighted regression. It works by loading each observation with a
weight that is inversely proportional to the variance at the respective
concentration. In that way, the regression ‘takes more notice’ of points
with larger weights (smaller variances). The regression formulae tabulated
below are similar to (and a generalisation of) those for simple regression.
The important thing here is not to remember the equations, but to see
the analogies between weighted and unweighted regression. Note that
the weights are scaled so that the sum is equal to n, which simpliﬁes
the formulae somewhat. It should also be apparent that simple regression
is a special case of weighted regression in which all of the weights are
equal.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
118 Notes on Statistics and Data Quality for Analytical Chemists
Statistic Standard regression Weighted regression
Data x
1
, x
2
, . . . , x
i
, . . . , x
n
x
1
, x
2
, . . . , x
i
, . . . , x
n
y
1
, y
2
, . . . , y
i
, . . . , y
n
y
1
, y
2
, . . . , y
i
, . . . , y
n
s
1
, s
2
, . . . , s
i
, . . . , s
n
Weight w
i
= 1 w
i
= n(1/s
2
i
)/
i
(1/s
2
i
),
(i.e.,
i
w
i
= n)
Mean ¯ x =
i
x
i
/n ¯ x
w
=
i
w
i
x
i
/n
Slope of
regression
b =
i
(x
i
− ¯ x)(y
i
− ¯ y)
i
(x
i
− ¯ x)
2
b
w
=
i
w
i
(x
i
− ¯ x
w
)(y
i
− ¯ y
w
)
i
w
i
(x
i
− ¯ x
w
)
2
Intercept a = ¯ y −b¯ x a
w
= ¯ y
w
−b
w
¯ x
w
Residual
variance
s
2
yx
=
i
(y
i
− ˆ y)
2
/(n −2) s
2
yx(w)
=
i
w
i
(y
i
− ˆ y
w
)
2
/(n −2)
Variance of
slope
s
2
b
= s
2
yx
/
i
(x
i
− ¯ x)
2
s
2
b(w)
= s
2
yx(w)
/
i
w
i
(x
i
− ¯ x
w
)
2
Variance of
intercept
s
2
a
= s
2
b
i
x
2
i
/n s
2
a(w)
= s
2
b(w)
i
x
2
i
/n
Variance of
evaluated
xvalue
∗
s
2
x
e
=
s
2
yx
b
2
×
1
m
+
1
n
+
(y
− ¯ y)
2
b
2
i
(x
i
− ¯ x)
2
s
2
x
e
(w)
=
s
2
yx(w)
b
2
w
×
1
w
+
1
n
+
(y
− ¯ y
w
)
2
b
2
w
i
w
i
(x
i
− ¯ x
w
)
2
∗
This is the variance of unknown xvalues x
calculated from the regression equation as
x
= (y
−a)/b by means of a measured response y
, which is the mean of m measurements
or has a corresponding weight w
.
The calculations of weighted regression are available in the usual statis
tical packages, so can be executed with ease. The only extra labour com
prises the estimation of the weights. This may not be justiﬁed when the
degree of heteroscedasticity is small. However, in longrange calibration
the change in precision is often suﬃcient to have a deleterious impact on
the resulting statistics. In that case, use of a weighted regression is recom
mended but, fortunately, even rough estimates of the weights improve the
outcome substantially.
Weights can be estimated either from repeat measurements results or
from a general experience of the performance of the analytical system. (For
an example of the latter case, we might assume that the measurement
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 119
standard deviation is 1% of the result except at zero concentration, where
it is half of the detection limit.) In the example given below, weights are
estimated from a small number of repeat measurements.
Further reading
• ‘Why are we weighting?’ (2007). AMC Technical Briefs. No. 27. Free
download via www.rsc.org/amc
6.8 Example of Weighted Regression — Calibration
for
239
Pu by ICP–MS
Key points
— The ICPMS data for calibration of
239
Pu are markedly heteroscedastic.
— Weights were calculated by smoothing the initial estimates.
— Weighted regression gives statistics that are clearly more appropriate
for the measurement of small concentrations.
The calibration data and calculation of the weights are shown in the table
below (units are ng l
−1
).
Concentration R1 R2 R3 Raw SD Smoothed SD Variance Weight
0 548 662 1141 315 231 53504 16.7015
2 6782 9661 9316 1572 1055 1112196 0.8035
4 15966 14067 17063 1516 1878 3526528 0.2534
6 25612 30337 26987 2430 2701 7296499 0.1225
8 30483 32143 35701 2666 3525 12422109 0.0719
10 42680 46291 35968 5239 4348 18903359 0.0473
Columns R1 to R3 show three repeat responses for each concentration.
The next column shows the standard deviations calculated from the three
responses. Such estimates are, of course, very variable and a better estimate
might be obtained by smoothing them by simple regression, as in Fig. 6.8.1,
or even by eye. The ﬁtted values are shown in the following column. When
conducting this smoothing it is important to check that all of the resulting
estimates are reasonable and greater than zero. The variance is the square
of the smoothed standard deviation and the weights are calculated from
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
120 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 6.8.1. Smoothing the raw standard deviation estimates (circles) by regression
(line).
w
i
= n(1/s
2
i
)/
i
(1/s
2
i
), so that the sum of the weights is 18 (the total
number of measured responses).
The statistics from the weighted regression are as follows.
Predictor Coeﬃcient Standard error t p
Intercept 767.8 145.8 5.27 0.000
Slope 4055.2 135.0 30.05 0.000
s
yx
= 142.0 R
2
= 98.3%
Analysis of variance
Source of Degrees of Sum of
variation freedom squares Mean square F p
Regression 1 18208348 18208348 902.70 0.000
Residual error 16 322734 20171
–Lack of ﬁt 4 72606 18152 0.87 0.509
–Pure error 12 250128 20844
Total 17 18531082
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 121
Note the following.
• The intercept is signiﬁcantly diﬀerent from zero (p < 0.0005).
• As we have repeated measurements at each concentration, we can also
conduct the test for lack of ﬁt, and ﬁnd that there is no signiﬁcant lack
of ﬁt (p = 0.509).
• We can estimate a detection limit (§9.6) from c
L
= (3se(a))/b =
(3 ×146)/4055 = 0.1 ng l
−1
.
• The regression line and the 95% conﬁdence limits are shown in Fig. 6.8.2.
The conﬁdence interval is least near the origin of the graph.
We can see the beneﬁcial eﬀects of weighted regression in this instance
by repeating the regression without weighting. The statistics are given
below.
Predictor Coeﬃcient Standard error t p
Intercept 559 1119 0.50 0.624
Slope 4126.1 184.8 22.32 0.000
s
yx
= 2679 R
2
= 96.9
Analysis of variance
Source of Degrees of
variation freedom Sum of squares Mean square F p
Regression 1 3575212011 3575212011 498.31 0.000
Residual error 16 114794611 7174663
–Lack of ﬁt 4 24146489 6036622 0.80 0.548
–Pure error 12 90648122 7554010
Total 17 3690006622
The most obvious diﬀerences from the weighted statistics are as follows.
• The standard error of the intercept is now incorrectly much greater
(1119 instead of 146) and as a consequence we are tempted to make
the incorrect inference that it is not signiﬁcantly diﬀerent from zero
(p = 0.624).
• The apparent detection limit is now degraded to (3 × 1119)/4125 =
0.8 ng l
−1
, that is, magniﬁed by a factor of about eight.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
122 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 6.8.2. Calibration data (circles) for
Pu
239
by ICP–MS, showing weighted
regression line and its 95% conﬁdence
interval.
Fig. 6.8.3. Same calibration data as
Fig. 6.8.2, with regression line and conﬁ
dence intervals calculated by unweighted
regression.
• The regression line is not much changed (Fig. 6.8.3) but the 95% conﬁ
dence limit is much wider at low concentrations.
Therefore, for accurate work, especially at low concentrations, using
weighted regression in calibration may be worth the minor degree of extra
eﬀort involved.
Further reading
• ‘Why are we weighting?’ (2007). AMC Technical Briefs No. 27. Free
download via www.rsc.org/amc
6.9 Nonlinear Regression
Occasionally in analytical science we want to study the relationship between
variables where the proposed model is nonlinear in the coeﬃcients (β).
(This is ‘nonlinear’ in the technical mathematical sense.) For instance, a
model proposed for the variation in the uncertainty of measurement (u) in
a particular method as a function of the concentration of the analyte (c) is
as follows:
u =
β
2
0
+ β
2
1
c
2
.
This expression is not directly tractable by standard regression methods. In
other words, if we minimise the sum of the squared residuals, the resulting
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 123
normal equations cannot be solved for β
0
, β
1
by straight algebraic methods.
However, if we simply square the expression, we have
u
2
= β
2
0
+ β
2
1
c
2
,
and writing α
0
= β
2
0
, α
1
= β
2
1
gives us
u
2
= α
0
+ α
1
c
2
,
which is linear in the coeﬃcients and can be handled by simple regression
by regarding u
2
as the dependent variable and c
2
as the independent
variable.
Another example of this type is encountered in exploring how repro
ducibility standard deviation varies with concentration of the analyte in
interlaboratory studies such as proﬁciency tests. We might suspect that
the data could follow a generalised version of the Horwitz function (see
§9.7) with unknown parameters, namely
σ
R
= β
0
c
β
1
.
Again this function cannot be tackled directly by regression. However,
transforming the variables simply by taking logarithms gives us
log σ
R
= log β
0
+ β
1
log c.
Now we can regress the dependent variable (log σ
R
) against the independent
variable (log c) to obtain estimates of the parameters α
0
= log β
0
and β
1
.
There is a slight diﬃculty in that the transformed error term may not be
normally distributed, making the tests of signiﬁcance nonexact, but this
is seldom a serious objection. As usual we should look at the residual plots
to ensure that the model is adequate.
However, there is another class of nonlinear equations that cannot be
treated by transformation. For example, there are theoretical reasons for
expecting a calibration curve in ICPAES to follow the pattern
r = β
0
+ β
1
c −e
β
2
+β
3
c
.
There is no transformation that will reduce this equation to a linear form.
There are several ways of ﬁnding the estimates of the β
i
, but all of these
numerical methods depend on iterative procedures starting from initial
approximations. Such methods are beyond the scope of these notes.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
124 Notes on Statistics and Data Quality for Analytical Chemists
6.10 Example of Regression with Transformed Variables
Key points
— Reproducibility standard deviation was related to concentration of the
analyte (protein nitrogen in feed ingredients) using a nonlinear model
analogous to the Horwitz function.
— The data were logtransformed before regression.
— No lack of ﬁt was detected.
In a collaborative trial (method performance study) 14 diﬀerent animal
feed ingredients were subjected to the determination of the concentration c
of nitrogen (as an indicator of protein content) by the Dumas method in
the participating laboratories. The standard deviation of reproducibility σ
R
was calculated for each material. The investigator wished to see whether
the results conformed to the Horwitz function (see §9.7),
σ
R
= 0.02c
0.8495
,
or, transforming to logarithms base 10,
log σ
R
= log 0.02 + 0.8495 log c = −1.699 + 0.8495 log c.
The primary statistics obtained were as follows. (All the data are expressed
as mass fractions, as required by the Horwitz function so, for example,
0.0946 ≡ 9.46% mass fraction.)
c s
R
log c log s
R
0.0946 0.0011 −1.02411 −2.95861
0.1197 0.002488 −0.92191 −2.60415
0.0271 0.0007 −1.56703 −3.1549
0.1867 0.003062 −0.72886 −2.51399
0.0188 0.000328 −1.72584 −3.48413
0.0827 0.001444 −1.08249 −2.84043
0.0141 0.000404 −1.85078 −3.39362
0.0168 0.00046 −1.77469 −3.33724
0.0267 0.000572 −1.57349 −3.2426
0.0443 0.001022 −1.3536 −2.99055
0.1049 0.001498 −0.97922 −2.82449
0.0111 0.000284 −1.95468 −3.54668
0.0636 0.001306 −1.19654 −2.88406
0.0436 0.000814 −1.36051 −3.08938
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
Regression — More Complex Aspects 125
Fig. 6.10.1. Nitrogen data (solid circles)
with ﬁtted line (solid) and Horwitz
function (dashed).
Fig. 6.10.2. Nitrogen data (solid circles)
with ﬁtted line (solid) and Horwitz
function (dashed). Same data and func
tions as Fig. 6.10.1 but on logarithmic axes.
Fig. 6.10.3. Residuals from the ﬁt to the logtransformed nitrogen data.
Regression of log s
R
on log c gave the following output.
Predictor Coeﬃcient Standard error t p
Intercept −1.97935 0.08363 −23.67 0.000
Slope 0.79366 0.05915 13.42 0.000
s
yx
= 0.0824 R
2
= 93.8
The regression therefore tells us that
log s
R
= −1.979 + 0.794 log c.
As 10
−1.979
= 0.0105, transforming back to mass fractions gives us
s
R
= 0.0105c
0.794
.
The exponent found is somewhat lower than that of the Horwitz function,
although probably not signiﬁcantly so. The coeﬃcient of 0.0105 is con
siderably lower than the Horwitz value, however, showing that the deter
mination is more precise than predicted. These features can be seen in
Figs. 6.10.1 and 6.10.2, together with the Horwitz function. The residual
plot (Fig. 6.10.3) suggests a reasonable ﬁt — the residuals look a bit
skewed — but the deviation from normality is not signiﬁcant at the 95%
level of conﬁdence.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch06
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Chapter 7
Additional Statistical Topics
This chapter covers a number of practical topics related to the normal
distribution and deviations from it that are relevant to the work of
analytical chemists. Outlier tests are covered, but the superiority of
robust methods in this area is emphasised.
7.1 Control Charts
Key points
— Control charts are used to check that a system is operating ‘in statis
tical control’.
— Shewhart charts traditionally have ‘warning limits’ at µ ± 2σ and
‘action limits’ at µ ±3σ.
— A system is regarded as out of control on the basis of a result outside
the action limits.
An analytical system where all the factors that aﬀect the magnitude of
errors are kept constant is said to be in ‘statistical control’. Under those con
ditions it would be reasonable to assume that results obtained by repeated
analysis of a single test material would resemble independent random values
taken from a normal distribution N(µ, σ
2
). Thus we would expect in the
long term about 95% of results to fall within a range of µ ±2σ and about
99.7% to fall within the range µ ±3σ. A result obtained outside the latter
range would be so unusual under the assumption of statistical control that
it is conventionally taken to indicate that the assumption is invalidated, i.e.,
that conditions determining the size of errors have changed, and the ana
lytical system is behaving in a new and unacceptable fashion. Either a new
127
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
128 Notes on Statistics and Data Quality for Analytical Chemists
value of µ prevails, or a larger value of σ, perhaps because of instrument
malfunction or a new batch of reagents, or failure on the part of the analyst
to observe some aspects of the operating procedure. This requires some
investigation of the system and remediation where necessary to restore the
initial conditions.
A convenient way to monitor an analytical system in this way is via a
control chart. This is based on the results obtained by the analysis of one
or more special test materials that have been homogenised and tested for
stability. These ‘control materials’ must be typical of the material under
routine test and are analysed exactly as if they were normal samples in
every run of the analytical system. The results are plotted on a chart that
shows the results as a function of run number. The chart conventionally has
lines at µ, µ±2σ (‘Warning limits’), and µ±3σ (‘Action limits’) (Fig. 7.1.1).
This type of chart is called a Shewhart chart.
Under statistical control, a result is very unlikely to fall outside the
action limits. Such a point is taken to show that the system is ‘out of
control’, requiring the results obtained in that run to be regarded as
suspect, and the analytical system to be halted until statistical control has
been restored. Some other occurrences are about equally unlikely, and are
also taken to indicate outofcontrol conditions, namely: (a) two successive
results outside the warning limits; or (b) nine consecutive results on the
same side of the mean line.
There are many diﬀerent kinds of control chart with diﬀering capa
bilities. The Shewhart chart is good for detecting abrupt changes in the
Fig. 7.1.1. Shewhart control chart showing results for a control material in successive
runs. The arrows show the system going out of control at Run 7 and Run 26.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 129
Fig. 7.1.2. Zone chart showing part of the same data as Fig. 7.1.1. Symbol positions
(numbered circles) show the zone of the current result. Numbers show the accumulated
score. The arrows show Run 26 to be out of control as before, but also shows Run 37 to
be out of control.
analytical system. Other control charts are better at detecting smaller
changes or a drift. The Cusum chart is one such. The Zone chart (Fig. 7.1.2)
has roughly the combined capabilities of the Shewhart chart and the Cusum
chart and is simple to plot and interpret. In this chart, results are converted
into scores that depend on which zone of the chart the results falls into.
These scores are labelled weights in the ﬁgure. With each successive result
the scores are aggregated. If a new result falls on the opposite side of the
mean from the previous one, the total score is rest to zero before the new
score is aggregated. If the aggregate score gets to eight or more, the system
is deemed out of control. The Zone chart detects all of the outofcontrol
conditions that the Shewhart chart does but also detects the smaller change
that results in the score of eight at Run 37.
Further reading
• Thompson, M. and Wood, R. (1995). Harmonised Guidelines for Internal
Quality Control in Analytical Chemistry Laboratories, Pure Appl Chem,
67, pp. 649–666.
• ‘The Jchart: a simple plot that combines the capabilities of Shewhart
and Cusum charts, for use in analytical quality control’. (2003). AMC
Technical Briefs No. 12. Free download via www.rsc.org/amc.
• ‘Internal quality control in routine analysis’. (2010). AMC Technical
Briefs No. 46. Free download via www.rsc.org/amc.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
130 Notes on Statistics and Data Quality for Analytical Chemists
7.2 Suspect Results and Outliers
Key points
— Analytical datasets often contain suspect values that seem inconsistent
with the majority.
— Outliers can have a large inﬂuence on classical statistics, especially
standard deviations.
— It is often diﬃcult to identify suspect values visually as outliers.
— Deleting identiﬁed outliers before calculating statistics needs an
informed judgement.
While a set of repeated analytical results will often broadly resemble a
normal distribution, it is not uncommon to ﬁnd that a small proportion of
the results are discrepant, that is suﬃciently diﬀerent from the rest of the
results to make the analyst suspect that they are the outcome of a large
uncontrolled variation (i.e., a mistake) in procedure. Data given below and
shown in Fig. 7.2.1 can be taken as a typical example.
15.1 24.9 26.7 27.1 28.4 31.1
Outliers have a large eﬀect on classical statistics (especially the standard
deviation), which could invalidate decisions depending on probabilities.
This eﬀect can be seen by comparing Figs. 7.2.2 and 7.2.3. Where there
is a documented mistake that accounts for the suspect value, it can be cor
rected or deleted from the dataset without any question. When there is no
such explanation, analysts diﬀer about whether deletion is justiﬁed. Those
against deletion argue that the discrepant result is still part of the ana
lytical system so should be retained if the summary statistics are to be fully
descriptive and suitable for prediction of future results from the analytical
system. In any event, deletion seems like an unhealthy subjectivity creeping
into the science. Other scientists maintain that deletion is appropriate when
Fig. 7.2.1. Results of a determination repeated by ﬁve analysts. The result at 15.1 is
suspected of being an outlier.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 131
Fig. 7.2.2. The normal distribution mod
elling the data if the suspect value is
excluded. Most of the data are modelled
well.
Fig. 7.2.3. The normal distribution mod
elling the data if the suspect value is
included. The mean seems biased low and
the standard deviation is inﬂated. All of
the data are modelled poorly.
the suspect result is clearly discrepant, so that summary statistics accu
rately represent the behaviour of the great majority of the results. These sci
entists have to bear in mind that outliers may occur in the future, although
they will have no indication of their probability or magnitude. Figures 7.2.2
and 7.2.3 illustrate these points. Both of these arguments have some virtue
in particular circumstances, but the crucial decision is whether the suspect
point is really discrepant or is simply a slightly unusual selection of results
from a normal distribution. Our visual judgement of this is notoriously
poor when, as typically in analytical science, there are only a few data
available.
Statistical tests for outliers abound, but they tend to suﬀer from various
defects. The simple version of Dixon’s test (§7.3), for example, will not give
a sensible outcome if there are two outliers present, at either end of the
distribution. A better way of handling suspect data is the use of robust
statistics (see §7.5, 7.6).
Notes and further reading
• ‘Rogues and suspects: how to tackle outliers’. (April 2009). AMC Tech
nical Briefs No. 39. Free download via www.rsc.org/amc.
• The dataset can be found in the ﬁle named Suspect.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
132 Notes on Statistics and Data Quality for Analytical Chemists
7.3 Dixon’s Test for Outliers
Key points
— Dixon’s test compares the distance from the suspect value to its
nearest value with the range of the data.
— The simple version of Dixon’s test is foiled by the presence of a second
outlier.
A simple test of a suspect value is Dixon’s Q test. The test statistic Q is
the distance from the suspect point to its nearest neighbour divided by the
range of the data: in terms of Fig. 7.3.1 we have Q = A/B. For our example
data we have
Q = A/B = (24.9 −15.1)/(31.1 −15.1) = 0.61.
The probability of a value of Q exceeding 0.61 arising from random samples
of six observations from a normal distribution is about 0.07. This is almost
small enough to reject the null hypothesis (no outlier) at 95% conﬁdence
so we could reasonably treat 15.1 as an outlier. (In this case a Q value of
greater than 0.63 would be required to provide 95% conﬁdence.)
A problem with this simple test is that it can be foiled by the presence
of a second outlier at either end of the range. Suppose in addition to the
previous results there was an extra value at 37.1 ppb (Fig. 7.3.2). The test
statistic would then be
Q = A
/B
= ((24.9 −15.1)/(37.1 −15.1)) = 0.45.
A value as high as 0.45 would arise by chance with a probability as high as
0.17, so we should now be unwilling to regard the low value as an outlier.
Indeed, it is not an outlier even though it is the same distance as before from
the closest value! A similar problem would arise if the extra value were on
Fig. 7.3.1. Dixon’s test Q = A/B applied
to the aﬂatoxin data.
Fig. 7.3.2. Dixon’s test applied to an
extended dataset.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 133
the same extreme as the original suspect value. There are modiﬁed versions
of Dixon’s test that can be applied to this new situation, but these and
similar problems aﬀect many outlier tests. A more sophisticated treatment
of suspect values is preferable (§7.5, 7.6).
Notes and further reading
• ‘Rogues and suspects: how to tackle outliers’. (April 2009). AMC Tech
nical Briefs No. 39. Free download via www.rsc.org/amc.
• Critical values of Q can be found in many statistical texts.
• The dataset is in ﬁle named Suspect.
7.4 The Grubbs Test
Key points
— Grubbs test in the simplest form tests for an outlier by calculating the
diﬀerence between the largest (or smallest) value and the mean of a
dataset with respect to the standard deviation.
— It can be expanded to test for multiple outliers.
This is a more sophisticated test for outliers than Dixon’s test. It is used to
detect outliers in a dataset by testing for one outlier at a time. Any outlier
which is detected is deleted from the data and the test is repeated until no
outliers are detected. However, multiple iterations change the probabilities
of detection, and the test should not be used for sample sizes of six or less
since it frequently tags most of the points as outliers. The basic assumption
underlying the Grubbs test is that, outliers aside, the data are normally
distributed. The null hypothesis is that there are no outliers in the dataset.
The test statistic G is calculated for each result x
i
from the sample
mean ¯ x and standard deviation s as
G = max x
i
− ¯ x/s.
This statistic calculates the value with the largest absolute deviation from
the sample mean in units of the sample standard deviation. This form of
the Grubbs test is therefore a twotailed test. There are other forms of the
test including onetailed versions.
As with Dixon’s test, the aﬂatoxin data can be used as an example.
15.1 24.9 26.7 27.1 28.4 31.1
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
134 Notes on Statistics and Data Quality for Analytical Chemists
We carry out a twotailed test as above under the null hypothesis that there
is no outlier in the data set. The test statistic is calculated as
G = 15.1 −25.55/5.5186 = 1.894.
This is compared with the 95% critical value for n = 6 of 1.933. As
the calculated value of G is less than the critical value, the null hypothesis
is not rejected at 95% conﬁdence and the result 15.1 is not identiﬁed as
an outlier. However, G is high enough at least to warrant checking the
calculations leading to the result.
Notes and further reading
• ‘Grubbs’ is the name of the originator of this test, and is not a possessive
case.
• ‘Rogues and suspects: how to tackle outliers’. (April 2009). AMC Tech
nical Briefs No. 39. Free download via www.rsc.org/amc.
• Tables of critical levels of the test statistic can be found in some statistical
texts. A brief table for the twotailed test is given below, with N the number
of observations.
N G
95%
G
99%
6 1.933 1.993
7 2.081 2.171
8 2.201 2.316
9 2.3 2.438
10 2.383 2.542
11 2.455 2.631
12 2.518 2.709
13 2.574 2.778
14 2.624 2.84
15 2.669 2.895
16 2.71 2.945
17 2.748 2.991
18 2.782 3.032
19 2.814 3.071
20 2.843 3.106
• In the absence of tables, critical levels can be calculated from the
equation
N −1
√
N
t
2
(α/2N), N−2
N −2 +t
2
(α/2N), N−2
with t
(α/2N), N−2
denoting the
critical value of the t distribution with (N − 2) degrees of freedom and
a signiﬁcance level of α/2N. (α is the overall signiﬁcance level so, for
G
95%,
α = 0.05.)
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 135
7.5 Robust Statistics — MAD Method
Key points
— Robust methods reduce the inﬂuence of outlying results and provide
statistics that describe the distribution of the central or ‘good’ part
of the data.
— The methods are applicable to data that seem to be unimodal and
roughly symmetrically distributed.
— Robust statistics must be used with care for prediction.
— The MAD method is a quick way of calculating robust mean and
standard deviation, without requiring any decisions about rejecting
outliers.
The use of robust statistics enables us to circumvent the sometimes con
tentious issue of the deletion of outliers and provides perhaps the best
method of identifying them. Robust methods of estimating statistics such
the mean and standard deviation (and many others) reduce the inﬂuence of
outlying results and heavy tails in distributions. Robust statistics requires
the original data to be similar to normal (i.e., roughly symmetrical and
unimodal) but with a small proportion of outliers or heavy tails. It
cannot be used meaningfully to describe strongly skewed or multimodal
datasets. Robust statistics must be used with care in prediction. They
will not enable the user to predict the probability or likely magnitude of
outliers.
There are a number of robust methods in use. One of the simplest is the
MAD (Median Absolute Diﬀerence) method. Suppose we have replicated
results as follows:
145 130 157 153 183 148 151 143 147 163.
Putting these in ascending order we have:
130 143 145 147 148 151 153 157 163 183.
We need the median of these results, that is, the central value. In this
instance the median is the mean of 148 and 151, namely 149.5. This median
is a robust estimate of the mean, that is, ˆ µ = 149.5. It is unaﬀected by
making the extreme values more extreme, for instance, by changing the
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
136 Notes on Statistics and Data Quality for Analytical Chemists
size of the lowest value to any value lower than 148, and/or the highest
value to any value greater than 151.
The next stage is to subtract the median of the data from each value
and ignore the sign of the diﬀerence, giving the absolute diﬀerences (in the
same order as immediately above):
19.5 6.5 4.5 2.5 1.5 1.5 3.5 7.5 13.5 33.5
If we sort these absolute diﬀerences into increasing order, we have:
1.5 1.5 2.5 3.5 4.5 6.5 7.5 13.5 19.5 33.5.
The median of these results (the median absolute diﬀerence) is the mean
of 4.5 and 6.5, namely 5.5. This also is unaﬀected by the magnitude of
the extreme values. We multiply this median by the factor 1.4825, which is
derived from the properties of the normal distribution. The product is the
robust estimate of the standard deviation, namely ˆ σ = 5.5 × 1.4825 = 8.2
(to two signiﬁcant ﬁgures). (Note: the robust statistics are designated ˆ µ, ˆ σ
to distinguish them from the classical estimators ¯ x, s, which are used only
as deﬁned in §2.3.)
The robust estimates are thus ˆ µ = 149.5, ˆ σ = 8.2.
The MAD method is quick and has a negligible deleterious eﬀect on the
statistics if the dataset does not include outliers. It therefore can be used in
emergencies (i.e., when there isn’t a calculator handy). There are somewhat
better ways of estimating robust means and standard deviations, but they
all require special programs to do the calculations.
Notes and further reading
• A factor of 1.5 applied to the MAD (instead of 1.4825) is accurate enough
for all analytical applications.
• ‘Rogues and suspects: how to tackle outliers’. (April 2009). AMC Tech
nical Briefs No. 39. Free download from www.rsc.org/amc.
• Rousseeuw, P.J. (1991). Tutorial to Robust Statistics, J Chemomet, 5,
pp. 1–20. This paper can be downloaded gratis from ftp://ftp.win.ua.ac.
be/pub/preprints/91/Tutrob91.pdf.
• The dataset used in this example is named is Outlier.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 137
7.6 Robust Statistics — Huber’s H15 Method
Key points
— Huber’s H15 method is a procedure for calculating robust mean and
standard deviation.
— It is an iterative procedure, starting with initial estimates of the
statistics.
— At each iteration, the data are ‘winsorised’ (modiﬁed) by using the
current values of the statistics.
— Comparing data with a robust ﬁt is one of the best ways of identifying
suspect values.
Several methods for robustifying statistical estimates depend on down
weighting extreme values in some way, to give them less inﬂuence on the
outcome of the calculation. Huber’s H15 method is one such that has been
widely used in the analytical community. Like most of these methods, it
relies on taking initial rough estimates of the statistics (ˆ µ
0
, ˆ σ
0
) and reﬁning
them by an iterated procedure.
Using the data from§7.5 we subject it to a process called ‘Winsorisation’.
This involves replacing original datapoints xfalling outside the range ˆ µ
0
±kˆ σ
0
with the actual range limits. This creates pseudovalues ˜ x thus:
˜ x
1
=
ˆ µ
0
+kˆ σ
0
, if x > ˆ µ
0
+kˆ σ
0
ˆ µ
0
−kˆ σ
0
, if x < ˆ µ
0
−kˆ σ
0
x, if ˆ µ
0
−kˆ σ
0
< x < ˆ µ
0
+kˆ σ
0
.
The ﬁrst revised estimates of the statistics are then given by ˆ µ
1
= mean(˜ x
1
)
and ˆ σ
1
= sd(˜ x
1
)/θ. For a moderate proportion of outlying results (and
most analytical applications) k can be set at 1.5. The corresponding value
of θ, derived from the properties of the normal distribution, is 0.882. The
process is then repeated using ˆ µ
1
, ˆ σ
1
to winsorise the data and calculate the
improved estimates ˆ µ
2
, ˆ σ
2
in the same manner, and so on until a suﬃcient
degree of convergence is obtained. Convergence is slow so a computer is
required.
The table below shows the application of this to the previouslyused
suspect data (row x
0
) starting with the MAD estimates ˆ µ
0
= 149.5,
ˆ σ
0
= 8.15. The ﬁrst replacement limits are ˆ µ
0
± kˆ σ
0
= (161.73, 137.27),
so any value less that 137.27 becomes 137.27 and any value greater than
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
138 Notes on Statistics and Data Quality for Analytical Chemists
161.73 becomes 161.73. Thus in the ﬁrst Winsorisation (row ˜ x
1
) three values
(boldface) are replaced, producing estimates of ˆ µ
1
= 150.47, ˆ σ
1
= 9.11 to
two decimal places.
Data ˆ µ
rob
ˆ σ
rob
x
0
130.00 143 145 147 148 151 153 157 163.00 183.00 149.50 8.15
˜ x
1
137.27 143 145 147 148 151 153 157 161.73 161.73 150.47 9.11
˜ x
2
136.81 143 145 147 148 151 153 157 163.00 164.14 150.80 9.87
˜ x
3
135.98 143 145 147 148 151 153 157 163.00 165.61 150.86 10.33
˜ x
4
135.36 143 145 147 148 151 153 157 163.00 166.36 150.87 10.62
.
.
.
.
.
.
.
.
.
˜ x
17
134.21 143 145 147 148 151 153 157 163.00 167.54 150.88 11.11
In subsequent iterations only two values are replaced. The results have
stabilised suﬃciently by the 17th iteration, giving ﬁnal estimates of ˆ µ
rob
=
150.88, ˆ σ
rob
= 11.11. These can be compared with the classical statistics,
for the complete data (¯ x = 152.0, s = 14.0) and for the data with the
suspect value deleted (¯ x
= 148.0, s
= 9.33). Simply deleting the suspect
value gives a standard deviation that is too low.
Robust statistics is probably one of the best ways of identifying out
liers. If we pseudostandardise the dataset as z = (x − ˆ µ
rob
)/ˆ σ
rob
, the
‘good’ results should resemble a sample from the standard normal distri
bution. Any results with a magnitude greater than about 2.5 can therefore
be regarded as at least suspect, if not outlying. If we apply this transfor
mation to our example data (in increasing order) we obtain:
z −1.9 −0.7 −0.5 −0.3 −0.3 0.0 0.2 0.6 1.1 2.9.
The value of 2.9 suggests that the original result of 183 is suspect and that
its provenance should be investigated further at least.
Notes and further reading
• ‘Rogues and suspects: how to tackle outliers’. (April 2009). AMC Tech
nical Briefs No. 39. Free download from www.rsc.org/amc.
• Analytical Methods Committee. (1989 ). Robust Statistics — How Not to
Reject Outliers. Part 1. Basic Concepts, Analyst, 114, pp. 1693–1697.
• There is Excel software for conducting the H15 method in AMC Software
free download via www.rsc.org/amc.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 139
• Rousseeuw, P.J. (1991). Tutorial to Robust Statistics, J Chemomet, 5,
pp. 1–20. This paper can be downloaded gratis from ftp://ftp.win.ua.ac.
be/pub/preprints/91/Tutrob91.pdf.
• The dataset used in this example is named is Outlier.
7.7 Lognormal Distributions
Key points
— A variable x is lognormally distributed if log x is normally distributed.
— The shape of a lognormal distribution depends on its relative standard
deviation.
— Lognormal distributions of error are rare in chemical measurement.
— Some analytical circumstances give rise to distributions with a quasi
lognormal distribution.
— Logtransformation sometimes can be safely used to stabilise variance
before regression or ANOVA.
A variable x is lognormally distributed if log x is normally distributed.
Figure 7.7.1 shows a lognormal distribution with a mean of two and a
standard deviation of one, that is, with a relative standard deviation (RSD)
of 50%. It has zero density (height) when x is zero, and a positive skew.
All lognormal distributions (but many others) have these two properties. A
plot of density against log x (Fig. 7.7.2) has the familiar shape of the normal
distribution. The shape (degree of skewness) of the lognormal distribution
depends on the RSD. For instance, a distribution with an RSD of 10%
Fig. 7.7.1. A lognormal distribution with
a relative standard deviation of 50%.
Fig. 7.7.2. The same distribution plotted
against log
10
x.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
140 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 7.7.3. Lognormal distribution with an RSD of 10%, showing little visible sign of
asymmetry.
(Fig. 7.7.3) or lower is very diﬀerent from one with an RSD of 50%, and
hard to distinguish visually from a normal distribution.
A lognormal distribution in measurement implies that the errors are
multiplicative. The physical circumstances of a chemical measurement
rarely give rise to results that genuinely follow that rule. However, in
one currentlyimportant type of measurement — the determination of
speciﬁc DNA sequences from geneticallymodiﬁed food by the realtime
polymerase chain reaction (PCR) — that circumstance is approximately
realised. Because the fundamental procedure of PCR is multiplicative, the
errors tend to follow the same pattern. So we might ﬁnd the 95% conﬁ
dence limits of repeated measurements of (say) 0.5 and 2.0 times the mean
concentration. For this particular measurement a logtransformation of the
results has been found to be justiﬁed and helpful.
In nearly all other types of chemical measurement, however, these condi
tions do not apply and logtransformation should be used with due caution.
This is important to remember because appearances can be deceptive at
concentrations near the detection limit, where the RSD is high — by def
inition greater than about 30%. Repeat results at low concentrations may
sometimes appear to be similar to the lognormal because they have been
censored or truncated at zero. The confusion arises because true concentra
tions cannot be below zero. However, the results of measurements are not
true concentrations — they include errors — and they can and sometimes
do fall below zero. Some analysts are uncomfortable with this apparent
conﬂict and as a consequence do not record negative results. Despite the
appearance of censored results, logtransformation will be misleading in this
situation.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 141
Another mathematical operation that gives rise to a skewed distribution
(and causes the same type of confusion) is the division of one imprecise
variable by another. That might happen in the correction of raw analytical
results for recovery. It should not cause noticeable skewness except near the
detection limit. In that region, and in combination with censoring results
at zero, the resulting distribution could on casual inspection be taken as
lognormal by the unwary.
Analytical chemists sometimes encounter other datasets where a
variable is genuinely strictly positive and skewed. Concentrations of trace
analytes in moreorless random collections of natural materials (e.g.,
copper in sediments [§1.3]) usually have that property and sometimes
approach lognormal. Again, samples taken from a contaminated site or
area may show a similar distribution if the contamination is patchy. Log
transformation is sometimes useful in the characterisation of such data but,
as always, should be used with due caution.
There are, however, situations where logtransformation can be helpful,
and that is in regression and analysis of variance where the precision of
the response varies widely but its RSD can be taken as constant. Of
course, weighted regression will serve if the weights can be estimated,
but often the information is not available. Logtransformation will stabilise
the variance of the response in that situation, because diﬀerent variables
with the same RSD all have the same absolute standard deviation when
logtransformed. For example, an RSD of 10% becomes a constant SD of
0.045 under logtransformation, regardless of the concentration. The trans
formed data will still have a distribution close to normal (unless the RSD is
much higher than 10%), so the usual assumptions of simple regression can
be made.
7.8 Rounding
Key point
— Round the standard deviation to two signiﬁcant ﬁgures and round the
mean to the same number of decimal points (or trailing zeros).
Modern computers use a large number of signiﬁcant ﬁgures in calculations,
and applications often provide statistics with an excessive number of sig
niﬁcant ﬁgures. For example, a computer might tell us that the statistics
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
142 Notes on Statistics and Data Quality for Analytical Chemists
for the data in §7.2 are: mean 25.5500; and standard deviation 5.5186. Such
data need to be rounded before reporting, because it is obvious that the
fourth decimal place (at least) is quite meaningless.
We are normally taught to retain only the ﬁrst ﬁgure that is uncertain.
But it is important not to overdo the rounding, which can destroy useful
information. The commonlyused rule is sometimes too rigorous. This may
be important if the results are to be used in further statistical operations.
There is a simple rule for appropriate rounding when we report such data:
round the standard deviation to two signiﬁcant ﬁgures and round the mean
to the same number of decimal points (or trailing zeros). The rationale here
is that estimated standard deviations are hardly ever more accurate than
that. Applying this rule to the statistics above gives us: mean 25.6; and
standard deviation 5.5. This rule nearly always leaves a generous number
of signiﬁcant ﬁgures.
7.9 Nonparametric Statistics
Key points
— Nonparametric tests require less stringent assumptions than para
metric tests. The assumption of normality is not required.
— Many parametric tests have nonparametric equivalents.
Nonparametric tests are sometimes called distributionfree statistics
because they do not depend on the data being drawn from normal dis
tributions. More generally, nonparametric tests require less restrictive
assumptions about the data. Many statistical tests have nonparametric
equivalents. An important reason for using these tests is that they allow the
analysis of rank data. Despite these useful features, and their widespread
use in the social sciences, nonparametric tests are seldom used by analytical
chemists because parametric methods are more usually more powerful,
and the normal distribution is often a reasonable assumption in physical
measurement.
The most commonly used nonparametric test in analytical chemistry is
perhaps the Mann–Whitney U test, which is explained in detail below. The
Mann–Whitney test is the nonparametric equivalent of the twosample
ttest for comparing the central tendencies of two independent datasets.
For a twotailed test the null and alternative hypotheses are as shown here.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 143
One tailed tests can also be carried out. For datasets P and Q, the null and
alternative hypotheses are:
H
0
: Median(P) = Median(Q)
H
A
: Median(P) = Median(Q)
The test statistic is obtained by calculating the lesser of U
P
and U
Q
, where
U
P
= nm+
m(m+ 1)
2
−S
P
,
U
Q
= nm+
n(n + 1)
2
−S
Q
,
and data samples of size m (the larger set P) and n (the smaller set Q) are
pooled. S
P
, S
Q
are the sums of the pooled ranks for the respective datasets.
The test is based only on the following assumptions: the datasets are
independent random samples from the respective populations and the
measurement scale is at least ordinal. A conﬁdence interval for the dif
ference between the population medians can be estimated with the further
assumption that the two population distribution functions are identical
apart from a possible diﬀerence in location.
An example is shown using the wheat ﬂour data from §1.3.6, in which
two methods are used to test for nitrogen in a sample of wheat ﬂour. The
ranked data (that is, tabulated in increasing order) are shown below along
with the associated method (K=Kjeldahl, D=Dumas). Note that tied
results (e.g., 2.92, 2.92) are each given the mean rank (1.5).
Result % Rank Result % Rank
2.92 (K) 1.5 3.02 (K) 9.5
2.92 (K) 1.5 3.04 (D) 11.5
2.98 (K) 3.5 3.04 (D) 11.5
2.98 (D) 3.5 3.05 (K) 13.5
3.00 (K) 5 3.05 (D) 13.5
3.01 (K) 7 3.07 (K) 15
3.01 (K) 7 3.08 (D) 16.5
3.01 (D) 7 3.08 (D) 16.5
3.02 (K) 9.5 3.12 (D) 18
For the Kjeldahl (larger) dataset, m = 10 and the sum of ranks is
S
K
= 73.
For the Dumas (smaller) dataset, n = 8 and the sum of ranks is S
D
= 98.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
144 Notes on Statistics and Data Quality for Analytical Chemists
From this we have U
K
= 80+55−73 = 62, and U
D
= 80+36−98 = 18,
so the value of the test statistic U = 18 is the lesser of these.
For the sample sizes in this example, the critical value for 95% conﬁdence
is 17. As 18 > 17, the null hypothesis is rejected and the medians are
signiﬁcantly diﬀerent. Most statistical software will provide this information
plus a pvalue, and conﬁdence limits for the diﬀerence between the medians.
Notes and further reading
• There are a number of other tests that are counterparts of a parametric
test. These include the Wilcoxon Matched Pairs Signed Ranks test, which
is the equivalent of the paired ttest, and the Kruskal–Wallis test, which is
a method for comparing several independent random samples and which
can be used as a nonparametric alternative to the one way ANOVA.
• Further details and tables of critical values can be found in standard
reference books.
• Most statistical software packages provide all of the common non
parametric tests.
7.10 Testing for Speciﬁc Distributions — the Kolmogorov–
Smirnov OneSample Test
Key points
— The Kolmogorov–Smirnov test is a nonparametric test used to test
whether or not a single sample of data is consistent with a speciﬁed
distribution function.
— The data values are ordered and compared with the equivalent value
from the distribution.
The Kolmogorov–Smirnov statistic quantiﬁes a distance between the
empirical cumulative distribution function of the sample and the cumulative
distribution function of the hypothesised distribution, often the normal dis
tribution. A graphical representation is show in Fig. 7.10.1.
To calculate the test statistic it is necessary to calculate the values of
F(x): the empirical cumulative density function
G(x): the cdf from the hypothesised distribution. In this example it is the
normal distribution.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 145
Fig. 7.10.1. Cumulative distribution functions of a test dataset (step function) and a
normal distribution (dashed line) with mean zero and unit standard deviation.
Observed data F(x) G(x) F(x) −G(x)
−3 0 0.0010 −0.0010
−2 0.1429 0.0183 0.1246
−1 0.2857 0.1379 0.1478
0 0.4286 0.500 −0.0714
1 0.5714 0.8414 −0.27
2 0.7143 0.9773 −0.236
3 0.8571 0.9987 −0.1416
4 1.0 1.0 0
The test statistic is:
D = Max(F(x) −G(x))
D = 0.27
The hypotheses for the Kolmogorov–Smirnov test are deﬁned as
H
0
: The data follow a speciﬁed distribution
H
A
: The data do not follow the speciﬁed distribution
The null hypothesis is rejected if the test statistic, D, is greater than
the critical value which is provided by most statistical software. Tables can
be found in textbooks and online.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
146 Notes on Statistics and Data Quality for Analytical Chemists
For n = 8 the critical value = 0.457 at the 95% level. As 0.27 <0.457
then the null hypothesis cannot be rejected.
Assumptions:
• It only applies to samples from continuous distributions.
• The distribution must be fully speciﬁed. That is, if location, scale and
shape parameters are estimated from the data, the critical region of the
Kolmogorov–Smirnov test is no longer valid. It typically must be deter
mined by simulation.
The distribution speciﬁed in the null hypothesis is often the normal distri
bution. Hence the Kolmogorov–Smirnov test is often used as a test for nor
mality. Here the test compares the cumulative distribution of the data with
the expected cumulative normal distribution, with the test statistic being
based on the largest discrepancy. Other tests for normality also depend on
the comparison of cumulative distributions.
Testing for normality should be treated with caution. If a dataset fails
the null hypothesis it is not always the case that parametric tests cannot
be applied. Consider the following points.
• Small samples almost always pass a normality test. Normality tests have
little power to tell whether or not a small sample of data comes from a
normal distribution.
• With large samples, minor deviations from normality may be ﬂagged
as statistically signiﬁcant, even though small deviations from a normal
distribution will not aﬀect the results of a ttest or ANOVA.
7.11 Statistical Power and the Planning of Experiments
Key points
— The power of a statistical test is the probability of rejecting a null
hypothesis when it is in fact false.
— Power calculations provide a way of checking whether a proposed
experiment is capable of delivering a useful result at a minimal cost.
The outcome of a statistical test stems from a balance between various cir
cumstances, namely the magnitude of the eﬀect being tested, the precision
of the measurements and the number of measurements made. (An eﬀect is
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 147
the deviation of the test statistic from H
0
.) Suppose the outcome was ‘not
signiﬁcant’. It might well be that the opposite outcome would have been
recorded if the analytical method had been more precise or more repeat
measurements made. Alternatively, an eﬀect might be signiﬁcant but of a
magnitude that is of no importance in the context of the test. It is worth
while considering this balance before any measurements are undertaken:
there is no point in undertaking an experiment that is unlikely to provide a
useful outcome. Equally, as measurements cost money, and higher precision
costs more than lower, it is important to commit the least resource that will
provide a useful outcome. These considerations come under the heading of
statistical power.
1
It is therefore very good practice to estimate the power
of a proposed experiment before it is undertaken.
Section 1.11 described a critical level of probability that we regard as
convincing for the particular inference that we wish to make. In statistical
power terminology, this critical probability is referred to as α. A ‘Type I’
error occurs when a true null hypothesis is rejected. The probability of a
Type I error is equal to α.
A Type II error occurs when a false null hypothesis is accepted. However,
the probability of a Type II error depends on the speciﬁc alternative
hypothesis. For a particular H
A
this probability is often represented by β.
The probability of rejecting a null hypothesis when it is false is called the
power of a test. The power is therefore a probability with the value 1 −β,
and clearly also depends on H
A
and is related to a Type II error. The
position is summarised in this table.
Decision: accept H
0
Decision: reject H
0
H
0
true Correct decision.
The conﬁdence level is 1 α.
Type I error, probability α.
H
A
true Type II error, probability β. Correct decision, probability
1 β, the power of the test
Calculating the power initially requires specifying the eﬀect size that
is required to be detected. (This will be demonstrated in the following
example.) The greater the eﬀect size, the greater the power. Power can
also be increased by improving the precision of the measurements and
by increasing the number of replicated results. Increasing the number of
1
Ethical considerations are involved as well as money. Experiments involving people or
animals should be as small as consistent with a useful outcome.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
148 Notes on Statistics and Data Quality for Analytical Chemists
replications is the most commonly used method for increasing statistical
power. Although there are no formal standards for power, a value of more
than 0.80 is sometimes regarded as satisfactory.
Example
Suppose that we wish to see whether the concentration of an analyte in a
test material is signiﬁcantly diﬀerent from 20 ppm at the 95% conﬁdence
level, so that α = 0.05. We consider making four measurements of con
centration on the material by using a method with a known standard
deviation of 2 ppm. We wish to know whether this experiment would be
likely to provide the information that we want. The null hypothesis H
0
for the test is represented by the upper graph in Fig. 7.11.1, which shows
the tdistribution with three (n − 1) degrees of freedom, with the critical
regions for 95% conﬁdence shaded black. The upper limit Uof the conﬁdence
region falls at µ +tσ/
√
n = 23.18 for data drawn originally from a normal
distribution. An observed mean above this limit would occur with a prob
ability of 0.025 for this twotailed test and would be regarded (falsely) as
signiﬁcant under H
0
. Note that we are doing this calculation before making
any measurements, and have to rely on previous experience of the precision
of the analytical method.
Now suppose that, in the context of the application, a deviation of less
than 4 ppm from the H
0
value of 20 could be regarded as inconsequential
or unimportant. We can then focus on the speciﬁc alternative hypothesis
H
A
: µ = 24 as the upper limit of this acceptable range. This is represented
Fig. 7.11.1. Representation of a null hypothesis (upper graph, H
0
: µ = 20 with lower
(L) and upper (U) 95% conﬁdence limits) and an alternative hypothesis (lower graph,
H
A
= 24) showing the power (1 −β) of the test.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
Additional Statistical Topics 149
Fig. 7.11.2. Estimated power of the signiﬁcance test as a function of the number of
repeat measurements.
by the lower graph in Fig. 7.11.1. If H
A
were true, we would make a Type
II error with a probability of β = 0.24. (This is shown as the area below
U, shaded light grey in the lower graph, and calculated from the t
3
dis
tribution). The test would therefore have a power of (1 − β) = 0.76 (area
shaded dark grey in the graph). A material containing 24 ppm would be
detected as signiﬁcantly diﬀerent in only 76% of experiments as originally
proposed, which would be unsatisfactory in many applications.
If we needed a more powerful test we could increase the number of repeat
measurements. The relationship between power and number of measure
ments in this experiment is shown in Fig. 7.11.2. It is clear that a sample
of six repeat measurements would nearly always provide the information
we wanted (with a power of 0.97), while an experiment with less than ﬁve
measurements would not. It is a good idea to err on the safe side, as the
power of the actual experiment may not be as good as predicted. (We could
also increase the power by using a more precise analytical method, if one
were available.)
Notes and further reading
• Power can be estimated for all of the usual tests of signiﬁcance, including
those in analysis of variance and regression.
• The logic of estimating power is intricate but statistical software packages
usually provide power calculations for the most common tests.
• ‘Signiﬁcance, importance and power’. (March 2009). AMC Technical
Briefs No. 38. Free download via www.rsc.org/amc.
December 23, 2010 13:34 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch07
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
PART 2
Data Quality in Analytical
Measurement
151
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
152
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
Chapter 8
Quality in Chemical Measurement
This chapter reviews the topic of ‘quality’ in analytical measurement
as a basis for the following discussion of the statistical methods
involved. Quality concepts and practices are summarised under three
main headings, fitness for purpose, method validation and quality
control. However, the overarching idea is the uncertainty attached to
an analytical result, what it means, why it is important and how to
estimate and use it.
8.1 Quality — An Overview
Key point
— The principles and practices relating to the quality of analytical data
can be systematised under three related headings: ﬁtness for purpose,
method validation and quality control.
At ﬁrst sight the number of concepts and practices applied to quality in
analytical chemistry is dauntingly large and, moreover, apparently not con
nected adequately into a coherent whole by overarching principles. However,
a simple scheme that provides an inclusive overview can be formulated in
terms of just three basic ideas, which should be applied in the following
order.
• Fitness for purpose: what uncertainty in the analytical result is
acceptable to, and best suited for, the needs of the customer?
153
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
154 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 8.1.1. A schematic view of the three principal aspects of quality in analytical
chemistry and the contributory practices that relate to them.
• Method validation: can the method under consideration for the analytical
task produce a suitably low uncertainty when executed in a particular
environment, in other words, is it apparently ﬁt for purpose?
• Quality control: have the environmental factors that determine uncer
tainty changed since the validation demonstrated that ﬁtness for purpose
was achievable? (i.e., did the method work well day after day?)
The logical sequence and the practices that contribute to each are shown
in Fig. 8.1.1. Each of these aspects of quality will be considered in turn,
but ﬁrst we have to consider brieﬂy the idea of the uncertainty of a result
and how it can be estimated.
8.2 Uncertainty
Key point
— The meanings of the following terms are discussed: uncertainty,
standard uncertainty, expanded uncertainty, coverage factor, mea
surand and traceability.
The purpose of analysis is to reduce uncertainty about the chemical compo
sition of the test material. Before any analysis is undertaken, we might be
in a state of complete uncertainty about what a material is, but that would
be unusual. We are far more likely to have some indication of what it
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
Quality in Chemical Measurement 155
might contain a priori. For example, a sample of (dried) cabbage would
nearly always have a copper content between 1 and 20 ppm. After analysis
this uncertainty would be far smaller: we might be conﬁdent that the
concentration fell, say, in the interval between 10 and 12 ppm. But, however
careful the analyst, there would always be some uncertainty remaining in
the analytical result. How do we quantify this uncertainty?
First, we need to know exactly what it is that we are estimating: here
are the current internationallyrecognised deﬁnitions.
• Measurement uncertainty: nonnegative parameter characterising the dis
persion of the quantity values being attributed to a measurand, based on
the information used.
• Measurand: quantity intended to be measured.
• Quantity: property of a phenomenon, body or substance where the
property has a magnitude that can be expressed by a number and a
reference.
• Standard uncertainty: measurement uncertainty expressed as a standard
deviation.
• Expanded uncertainty: product of a combined standard measurement
uncertainty and a factor larger than the number one.
• Metrological traceability: property of a measurement result whereby the
result can be related to a reference through a documented unbroken chain
of calibrations, each contributing to the measurement uncertainty.
These formal deﬁnitions are not uniformly intelligible and do not imme
diately convey their meaning. Their application to chemical measurement
needs elucidation. Expanded uncertainty (U) deﬁnes a concentration interval
around the result of the measurement within which we expect the true value
to lie with a reasonably high probability, usually 95%. The standard uncer
tainty (u) is the basic value that is used to calculate U as U = ku, where k is
the coverage factor, usually between two and three. Standard uncertainty
is treated and used in the same way as standard deviation. Traceability
describes the relationship between the result and the units of the SI (Le
Syst`eme International d’Unit´es). The measurand is a quantity that is being
measured (e.g., mass, length, time, concentration), not a chemical substance
(that is the analyte), nor the numerical outcome of a measurement (that is
the result).
It is noteworthy that uncertainty is the property of a measurement
result, while bias and precision are properties of measurement methods.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
156 Notes on Statistics and Data Quality for Analytical Chemists
Broadly speaking, we should try to use all of these words correctly so as
to reduce misunderstanding, especially when our words may be translated
into another language. It is especially important not to confuse uncertainty
with error.
Further reading
• Evaluation of measurement data — guide to the expression of uncer
tainty in measurement (GUM). (2008). Document produced by Working
Group 1 of the Joint Committee for Guides in Metrology. This doc
ument can be downloaded gratis from the BIPM website www.bipm.org/
utils/common/documents.
• International vocabulary of metrology — basic and general concepts and
associated terms (VIM). (2008). Document produced by Working Group 2
of the Joint Committee for Guides in Metrology. This document can be
downloaded gratis from the BIPM website www.bipm.org/utils/common/
documents.
• Quantifying uncertainty in analytical measurement. (2000).
EURACHEM/CITAC Guide CG 4 Second edition. This document can
be downloaded gratis via www.eurachem.org/guides.
8.3 Why Uncertainty is Important
Key points
— Analysis is conducted to inform decisions.
— Logically we cannot make a valid decision without knowing the uncer
tainty associated with the result.
The result of an analytical measurement is incomplete without a statement
(or at least an implicit knowledge) of its uncertainty. This is because we
cannot make a valid decision based on the result alone, and nearly all
analysis is conducted to inform a decision. Typical decisions based on
analysis mostly come in one of the following forms. They all require a
knowledge of uncertainty for a rational outcome.
• Does this batch of material contain less than the maximum allowed con
centration of an impurity?
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
Quality in Chemical Measurement 157
Fig. 8.3.1. Possible results of an analysis (solid circles) and expanded uncertainties
(vertical bars) in relation to a legal or contractual upper limit for the concentration of
an impurity.
• Does this batch of material contain at least the minimum required con
centration of a named ingredient?
• How much is this batch of material worth?
Figure 8.3.1 shows a variety of instances aﬀecting decisions about exter
nally imposed limits. The error bars can be taken as expanded uncertainties,
eﬀectively intervals containing the true value of the concentration of the
analyte with 95% conﬁdence.
Result A clearly indicates a material that is below the limit, as even the
highest extremity of the uncertainty interval is below the limit. Result B
is below the limit but the upper end of the uncertainty is above the limit,
so we are not sure that the true value is below. Result C is above the limit
but the lower end of the uncertainty is below the limit, so we are not sure
that the true value is above. It is interesting to compare the equal results
D and E. Both results are above the limit but, while D is clearly above the
limit, E is not so because the greater uncertainty interval extends below
the limit. Organisations aﬀected by such decisions have to agree in advance
how to act upon results B, C and E.
Notes and further reading
• Accreditation agencies require estimates of uncertainty before a method
can be accepted as validated.
• Use of uncertainty information in compliance assessment. (2007).
EURACHEM/CITAC Guide. This document can be downloaded gratis
via www.eurachem.org/guides.
• The main normative documents on uncertainty are listed in §8.2.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
158 Notes on Statistics and Data Quality for Analytical Chemists
8.4 Estimating Uncertainty by Modelling
the Analytical System
Key points
— Uncertainty can be estimated by the metrological (‘bottomup’)
method by creating a complete model of the measurement procedure
and combining the fundamental uncertainties of the ultimate opera
tions.
— In chemical analysis, modelling often gives a value that is too small
because there are nearly always unidentiﬁable sources of error.
Chemical measurement usually involves a complex multistage procedure.
Each stage of the procedure is potentially subject to variation in execution,
and therefore makes its own contribution to the uncertainty of the result.
If the procedure can be completely characterised as a statistical model, the
uncertainties related to each separate stage can be estimated and combined
to give the uncertainty of the result. These contributions are best seen as
hierarchical. So the determination of copper in a sample of cabbage could
be modelled as Fig. 8.4.1 as a ﬁrst stage.
Each of the three ﬁrstlevel contributions can be further broken down,
as exempliﬁed in Fig. 8.4.2 for one of them.
Even more detail can be built into the model (Fig. 8.4.3): for example,
weighing introduces uncertainty in the calibration of the weights (or
balance) and in correction for buoyancy, absorption of moisture from the
atmosphere, and so on. Ultimately the calibration of the weights can
be traced back to the SI unit of mass, the kilogramme. Such ultimate
Fig. 8.4.1. First level of factors that contribute to uncertainty in an analytical result.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
Quality in Chemical Measurement 159
Fig. 8.4.2. Second level of factors that contribute to uncertainty in the analyte concen
tration in the test solution.
Fig. 8.4.3. Third level of factors that contribute to uncertainty in the dry content of
the primary sample.
traceability is seldom if ever a practical issue for analytical chemistry.
Transfer of the SI unit to the analyst’s bench gives rise to a negligible
proportion of the analytical uncertainty in nearly all instances.
When the measurement procedure is completely broken down in this
fashion the inﬂuences of the individual uncertainties are estimated in
various ways and combined in the manner prescribed by error propa
gation theory (§8.5) to give the uncertainty of the result. This is clearly
a lengthy operation, although simpliﬁed by the fact that, because of the
way uncertainties are propagated, small uncertainties provide a negligible
contribution to the outcome.
This approach to uncertainty is called the ‘cause and eﬀect’ method
or, informally, the ‘bottomup’ method. A beneﬁt of the method is that
spreadsheets are available that can carry out the calculations once the
model is deﬁned. The eﬀect of a change in one of the factors contributing to
uncertainty can be rapidly seen in the combined uncertainty. A drawback is
that it is diﬃcult to detect structural mistakes and omissions in the model
itself. In chemical measurement this defect often results in estimates of
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
160 Notes on Statistics and Data Quality for Analytical Chemists
uncertainty that are too low. We know this to be the case by studying the
results of interlaboratory comparisons. If individual uncertainty estimates
were correct, they would account for all of the observed interlaboratory
variation. In cases where this has been checked, it is nearly always found
that interlaboratory variation is greater than expected on the basis of indi
vidual uncertainties. This demonstrates that there are typically unknown
causes of uncertainty in chemical analysis. An unknown cause cannot be
included in a ‘cause and eﬀect’ model.
Notes and further reading
• This method of estimating uncertainty is covered in detail in the ‘GUM’
and the Eurachem Guide (see §8.2).
• Ellison, S.L.R. and Mathieson, K. (2008). Performance of Uncertainty
Evaluation Strategies in a Food Proﬁciency Scheme, Accred. Qual.
Assur., 13, pp. 231–238.
8.5 The Propagation of Uncertainty
Key points
— Simple mathematical rules are available for combining intermediate
uncertainties contributing to a ﬁnal result.
— Broadly speaking, the outcome will be dominated by the major uncer
tainties.
The propagation of uncertainty through calculations from intermediate
measurement results (A, B, C, etc.) to the ﬁnal result (x) is handled in the
same way as general error propagation. The mathematical rules for com
bining independent features are as follows. The ﬁrst three are of greatest
importance to analytical chemists.
1. If x = A±B±C ± · · · , with respective standard uncertainties u
A
, u
B
,
u
c
, · · · (i.e., for addition or subtraction), the standard uncertainty on x
is given by u
x
=
u
2
A
+u
2
B
+u
2
C
+· · ·.
2. If the results are multiplied or divided, it is the relative uncertainties
that are combined. So if x =
A×B ×· · ·
C ×D ×· · ·
, then
u
x
x
=
u
A
A
2
+
u
B
B
2
+
u
C
C
2
+
u
D
D
2
+· · ·.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
Quality in Chemical Measurement 161
As a special case of this, if x = A
k
, k being a constant, then
u
x
x
=
√
ku
A
A
.
3. If x = kA, k being a constant, then u
x
= ku
A
.
4. Generally, if x = f(A), then u
x
=
u
A
dx
dA
.
These rules are used sequentially when applied to complex equations.
As an example we consider the standardisation of hydrochloric acid by
titration of a weighed quantity of pure sodium carbonate. The primary
measurements are as follows.
• First weighing of sodium carbonate (w
1
) : 15.2086 g.
• Second weighing (w
2
) : 15.0501 g.
• Uncertainty of a single weighing (u
w
) : 0.0002 g.
• Initial burette reading (v
1
) : 0.53 ml.
• Final burette reading (v
2
) : 38.83 ml.
• Uncertainty in a single burette reading (u
v
) : 0.03 ml.
• Uncertainty in endpoint recognition (u
e
) : 0.04 ml.
• Relative molecular mass (R) of Na
2
CO
3
: 106.00.
• Uncertainty of the relative molecular mass (u
R
) : 0.01.
The concentration of the acid (mol l
−1
) is given by
2000(w
1
−w
2
)
R(v
2
−v
1
)
=
0.07808.
By Rule 1 the uncertainty on the weight (W) of Na
2
CO
3
is given by
u
W
=
2u
2
w
=
√
2 ×0.0002
2
= 0.00028. (Note that there are two weighings
to obtain this weight.)
By Rule 1, the uncertainty on the volume (V ) of acid is given by
u
V
=
2u
2
v
+u
2
e
=
√
2 ×0.03
2
+ 0.04
2
= 0.0583. (Note that there are
two volume readings and the uncertainty on the endpoint recognition to
account for.)
As the remaining operations are multiplication and division, we use
Rule 2 to complete the calculation.
u
M
M
=
u
W
W
2
+
u
V
V
2
+
u
R
R
2
=
0.00028
0.1585
2
+
0.0583
38.3
2
+
0.01
106
2
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
162 Notes on Statistics and Data Quality for Analytical Chemists
=
(3.12 ×10
−6
+ 2.32 ×10
−6
+ 9 ×10
−9
)
= 0.00233.
u
M
= 0.07808 ×0.00233 = 0.00018 mol l
−1
(Note that the uncertainty term for the relative molecular mass is
negligible.)
Thus, the acid concentration is 0.07808 mol l
−1
with a standard uncer
tainty of 0.00018.
Notes and further reading
• If the features are not independent, the covariances have to be taken into
account. Details can be found in GUM (§8.2).
• It is clear from the way that uncertainties combine that a single con
tribution that is less than a quarter of the dominant contribution will
make negligible contribution to the combined uncertainty. We see that
u
2
+ (0.25u)
2
≈ 1.03u ≈ u. Estimates of uncertainty will hardly ever
be as accurate as two signiﬁcant ﬁgures, so the diﬀerence between 1.03u
and u is negligible. This will often simplify the calculations in ‘bottomup’
estimation.
8.6 Estimating Uncertainty by Replication
Key point
— The reproducibility standard deviation is easily obtained and is often
a serviceable (‘topdown’) estimate of uncertainty.
An alternative way of looking at uncertainty is to attempt to replicate
the whole analytical process and calculate the uncertainty as the standard
deviation. This is called the ‘topdown’ approach. However, the standard
deviation will not be a good estimate of the uncertainty unless two condi
tions are fulﬁlled.
• First, there must be no perceptible bias in the procedure, that is, the dif
ference between the expectation of the result and the true value must be
negligible in relation to the standard deviation. This condition is usually
(but not always) fulﬁlled in analytical chemistry.
• Second, the replication has to explore all of the possible variations in the
execution of the method (or at least all of the variations of important
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
Quality in Chemical Measurement 163
magnitude). This latter condition cannot usually be met by replication
under repeatability conditions (i.e., withinlaboratory, see §9.2), because
variations in execution of the procedure would be laboratoryspeciﬁc to a
substantial extent. This fact is clearly seen in the results of collaborative
trials (§9.7, 9.8), where the reproducibility standard deviation σ
R
is on
average about twice the repeatability standard deviation.
Experiments have shown that the betweenlaboratory standard deviation
σ
R
is often a good estimate of uncertainty, better than many laboratories
estimate from withinlaboratory validation exercises. Indeed, estimates of
standard uncertainty less than σ
R
should be considered suspect unless the
laboratory concerned can provide evidence of unusually careful procedures
or special methods, such as might be found in national reference labora
tories. However, σ
R
may well be an underestimate of uncertainty in other
laboratories. Therefore, laboratories claiming an uncertainty equal to σ
R
should attempt to justify that, for instance, by reference to the results
from proﬁciency tests (§11.1).
In practice, laboratories seldom use pure bottomup or topdown
methods for estimating uncertainty, but more often a combination of
the two.
Further reading
• Ellison, S.L.R. and Mathieson, K. (2008). Performance of Uncertainty
Evaluation Strategies in a Food Proﬁciency Scheme, Accred. Qual.
Assur., 13, pp. 231–238.
8.7 Traceability
Key points
— The traceability of a result shows that its units are properly related to
the corresponding SI units with an appropriate uncertainty.
— The uncertainty involved in delivering the SI unit to the analyst’s
bench is usually negligible.
— Other sources of uncertainty tend to be dominant in chemical mea
surement.
Traceability for a result shows how any unit in which the result is expressed
is compared with the parent SI unit. To express a result in (say) moles per
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
164 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 8.7.1. Steps showing how an analytical result is traceable to its ultimate origins.
The black arrows leading to shaded boxes indicate actions where relatively large uncer
tainties may be incurred.
litre, the analyst has to know what a mole is, and what a litre is, and show
that the measurement comprises a complete chain of comparisons from the
deﬁnition of the unit to the result. For analytical chemists the necessary
connection with SI units is easy. The relative uncertainty in the journey
from the SI unit to the analyst’s bench is small and nearly always negli
gible in comparison with the relative uncertainty in the ﬁnal result. This
is because the dominant sources of error in analysis come from elsewhere.
The most obvious are shown in the schematic of an analytical measurement
in Fig. 8.7.1.
• The sampling error is introduced by taking the sample from the ‘target’,
the bulk of material of which we need to know the composition (§12.1).
Sampling errors can be relatively large and sometimes exceed those
incurred during the whole of the remaining analytical procedure.
• Preparing the test solution from the test portion often involves chemical
manipulations, such as dissolution and separation, in which the recovery
of the analyte is incomplete. Correcting for incomplete recovery (or failing
to correct) can introduce a relatively large uncertainty in some instances.
• Comparing concentrations via the calibration function and the analytical
signal from the test solution is subject to extra error (i.e., beyond that
described in §8.4) if there is an unrecognised matrix mismatch with the
calibrators.
In addition to these ‘nonSI’ comparisons, we have to recognise that in
many instances the result is calculated from ratios of measurements made
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
Quality in Chemical Measurement 165
in the same unit, and in forming such a ratio the link with the SI unit is
annulled. For instance, if we are expressing our result as a mass fraction, the
most commonly required outcome of chemical measurement, we would get
the same numerical result (within limits determined by random variation)
if we used a diﬀerent base unit for mass, the Imperial pound for instance.
Such a result can hardly be said to be traceable to the kilogramme even
when, as usual, SI weights are used throughout the procedure.
Further reading
• Traceability in chemical measurement. (2003). Eurachem/CITAC Guide.
This document can be downloaded gratis via www.eurachem.org/guides.
8.8 Fitness for Purpose
Key points
— A ﬁtforpurpose result is optimal for the enduser.
— The uncertainty of a result is inversely related to its cost.
— The probability of incorrect decision based on a result is directly related
to uncertainty.
An analytical result can be said to be ﬁt for purpose when its uncertainty
is optimal. The optimality stems from the balance between the cost of con
ducting the analysis and the cost and probability of making an incorrect
decision based on the result. Such a balance is usually determined by expert
opinion on the basis of experience of the application. In favourable instances
(i.e., where suﬃcient information about costs etc. is available) it may be
possible to calculate the optimum directly. Even when this cannot be done,
decision theory provides a useful conceptual framework for a more trans
parent estimate.
The cost of an analytical result (including sampling where appropriate)
is related to the uncertainty required. Smaller uncertainty costs more
money. Indeed, a useful rule of thumb is that of an inverse square rela
tionship, that is, a procedure that reduces uncertainty by one half costs four
times the amount. The probability of an incorrect decision also depends on
the uncertainty, the greater the uncertainty the greater the probability of
an incorrect decision and therefore of a ﬁnancial penalty stemming from
the decision. Such penalties might be the result of mistakenly condemning
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch08
166 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 8.8.1. Costs versus uncertainty, showing costs of measurement (dashed line), costs
of incorrect decisions (dotted line) and total costs (solid line). The uncertainty u
f
at
minimal cost is regarded as deﬁning ﬁtness for purpose.
a batch of satisfactory material, or incorrectly shipping a batch of material
that was out of speciﬁcation.
The sum of the two costs (one a decreasing function of uncertainty and
the other increasing) must necessarily have a minimum value, and this point
provides a rational deﬁnition for a ﬁtforpurpose uncertainty (Fig. 8.8.1).
Of course, an uncertainty smaller than that demanded for ﬁtness for purpose
would give an even smaller proportion of mistaken decisions, but would be
unnecessarily expensive.
Notes and further reading
• Fearn, T., Fisher, S., Thompson, M. et al. (2002). A Decision Theory
Approach to Fitness for Purpose in Analytical Measurement, Analyst,
127, pp. 818–824.
• ‘Optimising your uncertainty — a case study’. (July 2008). AMC
Technical Briefs No. 32. Free download via www.rsc.org/amc.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Chapter 9
Statistical Methods Involved
in Validation
The statistical methods involved in validation are straightforward,
but applying them effectively is more involved than generally appre
ciated. Regression is the natural method to apply to analytical cal
ibration, but a good design is needed for an informative outcome.
Particular attention is needed in selecting the appropriate conditions
under which precision is estimated for various purposes. Moreover,
setting proper control limits for control charts involves a moderately
long series of measurements and review at suitable intervals.
9.1 Precision of Analytical Methods
Precision is the smallness of variation in the results of replicated measure
ments. It is characterised in terms of standard deviation, so high precision
is equivalent to low standard deviation. Analytical results for diﬀerent pur
poses vary in precision: a relative standard deviation (RSD) of 0.1% is
‘high precision’ in analysis and is seldom attained except for special pur
poses, such as ﬁnding the commercial value of materials containing gold.
Most analytical results have repeatability RSDs of about 1–5% except in the
measurements of very low concentrations, where even higher levels prevail.
RSDs higher than about 30% are problematic, because the expanded uncer
tainty is of the same magnitude as or greater than the result.
There is little technical diﬃculty about estimating a standard devi
ation — the measurement has to be replicated under appropriate condi
tions and the standard deviation of the results calculated, in the absence of
complications, as s =
i
(x
i
− ¯ x)/(n −1). The analyst should, however,
initially examine the dataset to ensure the absence of features that could
create a false impression, such as outlying data points (see §7.2 ﬀ) or trends.
167
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
168 Notes on Statistics and Data Quality for Analytical Chemists
A simple plot of the results in the order measured is usually enough to show
such problems. If appropriate, the inﬂuence of outliers can be accommo
dated by the use of a robust estimate (§7.6).
The eﬀect of trends, even complicated ones, can also be reduced by
various methods. Perhaps the simplest is to consider the signed diﬀer
ences between successive results, that is, the sequence d
1
= x
1
− x
2
,
d
2
= x
2
−x
3
, . . . , d
n−1
= x
n−1
−x
n
. The standard deviation of the signed
diﬀerences divided by
√
2 is equal to that of the original data detrended.
In the following example (Fig. 9.1.1), the raw results (upper series) show a
strong trend and have a standard deviation of 3.02. The signed diﬀerences
show no trend and have a standard deviation of 1.59. The detrended data
would therefore have an estimated standard deviation of 1.59/
√
2 = 1.12.
Another important feature of estimates of standard deviation is that
they are very variable with small to moderate numbers of observations.
The standard error se(s) of an estimate s based on n normallydistributed
results is given by se(s) = σ/
√
2n. With the usual ten replicated analytical
results, we could expect to see 95% conﬁdence intervals of around ±40%
around the estimated standard deviation. (Note: the conﬁdence limits will
not be symmetrically disposed around s for small n.) Two estimates of the
same standard deviation based on separate sets of ten results will diﬀer
by more than 30% half of the time and by more than 77% in one in ten
experiments. Because of this it is rarely worth quoting standard deviations
or uncertainties to more than two signiﬁcant ﬁgures, and there is no point
in discriminating between minor gradations in precision.
Fig. 9.1.1. Replicated results as a time series (upper plot) and as successive signed
diﬀerences (lower plot).
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 169
Notes and further reading
• The dataset is available from the ﬁle named Drift.
• There is often a practical problem in estimating the standard deviation
when the concentration of the analyte is close to zero. A proportion
of the results (sometimes a substantial proportion) will fall below zero
unless they are censored. A true concentration cannot fall below zero,
but the result of a measurement can. Recording such results as zero, or
repeating the measurement until a nonnegative result is obtained, will
bias the estimates of both the mean (upwards) and the standard devi
ation (downwards). There are statistical techniques (maximum likelihood
estimation) for handling this situation, but they are beyond the scope of
this book.
• ‘Measurement uncertainty and conﬁdence intervals near natural
limits’. (2008). AMC Technical Briefs No. 26A. Free download from
www.rsc.org/amc.
• Analytical Methods Committee. (2008). Measurement Uncertainty Eval
uation for a Nonnegative Measurand: an Alternative to Limit of
Detection, Accred. Qual. Assur., 13, pp. 29–32.
9.2 Experimental Conditions for Observing Precision
The precision of the results of a method depends on the conditions under
which the measurement is replicated. There are many conceivable condi
tions and, regrettably, the terminology used is often confusing (Table 9.2.1).
However, only a few of these conditions have a practical bearing on quality
practice. The key point here is to identify conditions that are genuinely
useful to analytical chemists. Of these, reproducibility precision is an
important consideration in method validation, as it is a prominent feature
of estimating uncertainty. Runtorun precision and repeatability precision
are mainly relevant to internal quality control.
Further reading
• International vocabulary of metrology — basic and general concepts and
associated terms (VIM). (2008). Document produced by Working Group
2 of the Joint Committee for Guides in Metrology. This document can be
downloaded gratis from the BIPM website www.bipm.org/utils/common/
documents.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
170 Notes on Statistics and Data Quality for Analytical Chemists
Table 9.2.1. Conditions for assessing precision and the utility of the resulting estimates.
(Names deﬁned in normative documents are in boldface type.)
Names Conditions of replication Comments
Instrumental Replication as quickly as
possible, with no change of
test solution nor
adjustment of instrument.
Not very useful but often
seen in research papers
and brochures. Does
not include variation
originating from
chemical
manipulations.
Repeatability Replication on separate test
portions of the same
material, with the same
instrument and reagents,
in the same laboratory, by
the same analyst, in a
‘short’ period of time.
The ‘short period of time’
is the length of an
analytical ‘run’, or
period in which we
assume that the factors
aﬀecting error have not
changed. Limited
applicability, but used
in restricted types of
quality control.
Runtorun:
Intermediate:
Withinlaboratory
reproducibility
Replication in separate runs.
Same method and
laboratory, but may be
diﬀerent analysts,
instruments and batches
of reagent.
This type of precision is
addressed in internal
quality control.
Reproducibility (1) Replication by the same
detailed method in
diﬀerent laboratories.
This is the estimate
provided by the
collaborative
(interlaboratory) trial.
Reproducibility (2) Replication by the same
nominal method but with
variation in details in
diﬀerent laboratories.
Estimate can often be
obtained from the
results of a single round
of a proﬁciency test.
The SD may be greater
or smaller than that of
Reproducibility (1).
Reproducibility (3) Replication by various
methods in diﬀerent
laboratories.
Estimate can often be
obtained from the
results of a single
round of a proﬁciency
test. The SD is usually
greater than that of
Reproducibility (1).
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 171
9.3 External Calibration
Regression is the natural way to explore the behaviour of an analytical cal
ibration and its likely eﬀect on the uncertainty of the result (see Chapters 5
and 6). Topics that can be studied by regression are as follows.
• Does the calibration depart signiﬁcantly from linearity? If it does, is the
deviation of a magnitude that will aﬀect the uncertainty of the results of
a complete measurement, that is, including the chemical treatment of the
test portion? This is best addressed by measuring responses in duplicate
from calibrators equally spaced over the range and conducting a test
for signiﬁcant lack of ﬁt. Simply inspecting a plot of the residuals for a
curved pattern is also a powerful way of detecting lack of ﬁt. However, the
correlation coeﬃcient commonly used as a test for linearity is ambiguous
in this context and potentially misleading (§5.9). An important aspect
of unacknowledged calibration curvature is that its eﬀects on esti
mated concentrations will be relatively very large at concentrations near
zero. If low concentrations are to be measured, lack of ﬁt cannot be
tolerated.
• If there is no lack of ﬁt to the selected calibration function, is there a
signiﬁcant intercept? In other words, in a calibration function such as
r =
ˆ
β
0
+
ˆ
β
1
c, are there grounds to reject β
0
= 0? If there are not, the
analyst can use the hypothesis that the calibration passes through the
origin.
• Do the residuals display signs of heteroscedasticity? This would be a
quite common occurrence in analytical calibration and suggests that
the use of weighted regression would give better results than simple
regression.
• What are the conﬁdence limits on an unknown concentration found by
using the calibration function (also called ‘evaluation limits’, or ‘ﬁducial
limits’)? It is salutary to calculate these limits as they are considerably
larger than expected by intuition and may make a substantial contri
bution to the combined uncertainty of the result.
• What is the detection limit of the calibration? When the evaluation conﬁ
dence interval includes zero concentration, the evaluated concentration is
not signiﬁcantly greater than zero, so we are unsure whether it is present
at all.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
172 Notes on Statistics and Data Quality for Analytical Chemists
But how does all this information about calibration relate to the combined
uncertainty of a result? The shape of the calibration function is of great
importance. If a rectilinear function can be assumed, calculations become
simpler, errors of interpolation become smaller and the method of standard
additions (§9.5) becomes available. Heteroscedastic residuals are a quite
common occurrence in chemical measurement, and their presence suggests
that the use of weighted regression would give better results than simple
regression. As a rule of thumb, if the calibration is to be restricted to con
centrations up to about 200 times the detection limit, simple regression
will be good enough. For concentrations outside this range, weighted
regression will give more accurate values, especially at low concentrations.
Lack of ﬁt if present may be judged negligible, but care must be taken
(see §5.10) if we need to measure concentrations in the lower quartile of
the range.
We must bear in mind that in analysis there are many sources of uncer
tainty other than calibration and evaluation, and these will usually over
whelm calibration uncertainty. In such instances, calibration uncertainty
per se can be ignored or subsumed into uncertainties that are readily esti
mated by replication.
9.4 Example — A Complete Regression Analysis
of a Calibration
Here we examine the calibration of zinc by atomic absorption spectrometry,
with duplication of responses, which were measured in the random order
listed.
Concentration mg l
−1
Absorbance
4 0.237
0 0.004
5 0.291
3 0.177
2 0.124
1 0.061
2 0.124
1 0.064
3 0.182
5 0.300
0 0.009
4 0.241
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 173
Fig. 9.4.1. Calibration data (points) with
linear function estimated by regression
(solid line) and 95% evaluation limits
(dashed lines).
Fig. 9.4.2. Residuals from the regression.
The data show no apparent deviation from the regression line
(Fig. 9.4.1). The residuals plotted against concentration show no obvious
pattern (Fig. 9.4.2) and the test for lack of ﬁt provides a pvalue of 0.867.
Thus we can conclude that there is no signiﬁcant lack of ﬁt and that the
regression has provided a satisfactory account of the data.
Source of Degrees of
variation freedom Sum of squares Mean square F p
Regression 1 0.11774 0.11774 12570.11 0.000
Residual error 10 0.00009 0.00001
Lack of ﬁt 4 0.00002 0.00000 0.30 0.867
Pure error 6 0.00008 0.00001
Total 11 0.11783
Visually the regression line passes close to the origin. As there is no
observed lack of ﬁt, the pvalue of 0.003 for the intercept shows that we
can reject H
0
: α = 0 in favour of H
A
: α = 0, that is, the intercept is
signiﬁcantly diﬀerent from zero. The deviation of 0.006 absorbance would
be large for modern instrumentation, but presumably arose from an instru
mental drift after the initial setting of the zero point.
Predictor Coeﬃcient Standard deviation t p
Constant 0.006167 0.001566 3.94 0.003
Concentration 0.0580000 0.0005173 112.12 0.000
s
yx
= 0.0031 R
2
= 99.9%
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
174 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 9.4.3. Bottom end of the estimated calibration function, showing the detection
limit c
L
of the calibration.
The 95% conﬁdence interval for a predicted concentration (the evalu
ation interval) amounts to about 0.3 ppm (mg l
−1
) over the whole of the
calibrated range (Fig. 9.4.1). In eﬀect, that would be the contribution of
the calibration procedure to the combined uncertainty. By zooming in to
the bottom end of the calibration (Fig. 9.4.3), we see that the calibration
detection limit c
L
will be about 0.13 mg l
−1
.
Note
• An unusual feature of this calibration dataset demonstrates the value of a
thorough examination of the residuals. If the residuals are plotted against
the order in which the measurements were made (Fig. 9.4.4), we see a
systematic eﬀect in that the earlier residuals tend to be negative and the
later ones positive. This is a signiﬁcant eﬀect (p = 0.004) and shows that
the instrument is still drifting during the calibration. This eﬀect would
not have been detectable had the calibrators been analysed in order of
increasing concentration.
Fig. 9.4.4. Residuals plotted against the order in which the solutions were measured.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 175
9.5 Calibration by Standard Additions
To make a valid comparison between calibrators and test solutions, the
intercept and sensitivity of the calibration function must match those we
would observe in the matrix of the test solution. A suﬃcientlyclose matrix
matching is readily contrived in respect of reagents such as mineral acids
added during the chemical treatment of the test portion. In some instances,
however, the matrix of the test materials (that is, all of the constituents
other than the analyte) is not readily predictable and the matrix varies
enough within a class of material to prevent matrix matching being com
plete. This can lead to an unacceptable addition to the uncertainty of the
result if the analytical signal is sensitive to such changes. A number of
strategies are available to overcome such interference eﬀects.
Calibration by standard additions is a widely applicable method that
overcomes changes in sensitivity (or rotational eﬀects) caused by the matrix
(Fig. 9.5.1). The method, however, does not overcome changes to the
baseline of the signal, which the analyst has to deal with separately.
Standard additions, therefore, has to be applied to the net signal: it requires
any baseline signal to be subtracted from the gross signal before the method
can be applied.
The conventional paradigm of the method, as presented in most text
books, requires the addition to a test solution of several diﬀerent exactly
known amounts of the analyte. This has to be done in such a way
that the overall concentration of the test material remains the same
in all of them, so that the matrix is identical in each solution. The
analytical response (corrected for baseline shift) is measured for each
solution, and the line ﬁtted to the points is extrapolated down to zero
Fig. 9.5.1. The possible eﬀect of matrix mismatch on the comparison between cali
brators and test solution.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
176 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 9.5.2. The method of standard addi
tions (conventional paradigm), with ﬁve
diﬀerent added concentrations of analyte.
Fig. 9.5.3. Standard additions with the
spike added at only one level (triplicate
measurements of response).
net response (Fig. 9.5.2). The negative reading on the concentration axis
is the concentration estimate. The extrapolation could be done graphically,
but the application of regression to estimating the original concentration is
obvious. The resulting function y = a+bx with the response (y) set at zero
gives x = −a/b.
Standard additions is a valid method when the calibration is known to
be linear. Nonlinear extrapolation is unwise (§6.3). The standard paradigm
of the method, with several diﬀerent levels of added analyte (Fig. 9.5.2), is
featured in most texts because it is thought (wrongly) to allow the analyst to
check that the calibration is linear. However, standard additions should not
be attempted unless nonlinearity is known to be undetectable by previous
experimentation during validation. In any event, testing for nonlinearity
is unlikely to be fruitful without an inordinate number of measurements to
obtain just one result.
As we can assume linearity, a simpler experimental design is preferable,
with only one level of added analyte, perhaps with replicated measurement
(Fig. 9.5.3). The added level should be the highest possible concentration
consistent with a linear calibration function. This design not only cuts down
the analyst’s workload, but also improves the precision of the ﬁnal result
for the same number of measurements. Moreover, regression is not needed
in the calculations. A line passing through the means of the responses
at both concentrations is identical with a regression line, either simple or
weighted. Standard additions is sometimes regarded as problematic because
the extrapolation step was thought to introduce extra imprecision in com
parison with eternal calibration. A careful study, with realistic models of
uncertainty, has shown that this fear is unfounded.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 177
Further reading
• ‘Standard additions — myth and reality’. (2009). AMC Technical Briefs
No. 37. Free download via www.rsc.org/amc.
• Ellison, S.L.R. and Thompson, M. (2008). Standard Additions: Myth and
Reality. Analyst, 133, pp. 992–997.
9.6 Detection Limits
Key points
— There are numerous possible deﬁnitions of detection limit. The sim
plest is: the concentration c
L
of analyte that corresponds with a signal
of R
0
+ 3σ where R
0
and σ describe the variation in the analytical
signal when the actual concentration of analyte is zero.
— Modern more complicated deﬁnitions provide very similar estimates.
— Detection limits provide only a rough guide to method performance
and should not be taken too seriously.
— Detection limits cited in the literature and in instrument manufac
turers’ brochures may be misleadingly low.
A detection limit c
L
is the smallest concentration of analyte that can be
reliably detected by the analytical system. Detection limits are usually given
undue attention in relation to their usefulness. The main use of a detection
limit is to warn the analyst of a concentration level better to avoid if at all
possible. However, in the determination of undesirable contaminants —
a very common activity — analysts often have to work near detection
limits.
Detection limits are encountered when the expanded uncertainty of mea
surement is roughly comparable with the concentration of the analyte. But
there are complications that aﬀect this basically simple idea.
• There are several diﬀerent ideas about how the detection limit can be
deﬁned in statistical terms. All of these ideas are based on the standard
deviation of replicated results near zero concentration.
• The magnitude of the detection limit estimate will depend on the con
ditions of replication of the measurements. Detection limits quoted in
descriptions of methods and instrument brochures are often far too small
because they are estimated under unrealistic conditions of replication
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
178 Notes on Statistics and Data Quality for Analytical Chemists
such as ‘instrumental conditions’ (§9.2). Reallife detection limits (based
on reproducibility statistics) are sometimes as much as 10 times higher
than these instrumental detection limits.
• The accuracy of an estimated detection limit will be poor because it is
typically based on ten replicated measurements, while the standard error
of an estimated standard deviation is given by σ/
√
2n. With n = 10
measurements the 95% conﬁdence limits on s (and therefore on c
L
) will
fall at about ±40% of the true value.
• Detection limits give rise to an artifactual dichotomy of the concentra
tions axis that distorts perception of reality. Many analysts and endusers
alike regard a result of 1.1c
L
as a valid result and 0.9c
L
as invalid. Modern
thought is moving to the position that detection limits are unnecessary —
all that the enduser needs is the result and its uncertainty.
The simplest (and therefore the most commendable) deﬁnition of detection
limit is this: the concentration c
L
of analyte that corresponds to an ana
lytical signal of R
L
= R
0
+ 3σ where R
0
(mean) and σ (standard devi
ation) describe the distribution of the analytical signal when the actual
concentration of analyte is zero. We can see the meaning of this by ref
erence to a short calibration graph (Fig. 9.6.1). The normal distribution
describes the variation in the analytical signal (response) when the blank
solution is repeatedly presented for measurement. A response larger than
R
0
+ 3σ will occur rarely if no analyte is present (about one time in a
thousand, as we are dealing with onetailed probabilities), so if we saw
Fig. 9.6.1. Schematic diagram of a calibration function at low concentrations showing
the variation in the response (analytical signal) for zero concentration of analyte, and
the detection limit c
L
.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 179
Fig. 9.6.2. Schematic diagram of a calibration function at low concentrations of analyte,
showing a more complex deﬁnition of detection limit.
a response greater than that we could be conﬁdent that some analyte
was actually present. We could say that, at that point, the concentration
of analyte was signiﬁcantly greater than zero at a conﬁdence level of
about 99.9%.
International standards nowadays prefer a more complex idea (and ter
minology) of detection limit. For this purpose we consider the previous
calibration graph in slightly more detail (Fig. 9.6.2). As before we have the
distribution of responses at zero concentration with mean R
0
and standard
deviation σ. We now arbitrarily deﬁne a ‘critical level of response’ R
c
that
deﬁnes a probability α in the upper tail of the distribution. This corre
sponds via the calibration function to a ‘critical concentration’ x
c
. If we
looked at the distribution of responses at concentration x
c
, half of the
responses would be below R
c
and therefore, in some sense, ‘not detected’.
x
c
is clearly too low to act as a serviceable detection limit. However, at
some higher concentration, where the response was R
L
, the area β in the
tail of the distribution below a response R
c
could be made suitably small
to deﬁne a sensible detection limit x
L
. In practice, we usually make both α
and β equal to 0.05, so that we have
R
c
= R
0
+ 1.63σ,
R
L
= R
0
+ 3.26σ.
x
L
therefore corresponds closely with the previous deﬁnition of detection
limit c
L
.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
180 Notes on Statistics and Data Quality for Analytical Chemists
Further reading
• Capability of detection — Part 1: Terms and deﬁnitions. (1997). 118431.
International Standards Organisation, Geneva.
• ‘Measurement uncertainty and conﬁdence intervals near natural
limits’. (2008). AMC Technical Briefs No. 26A. Free download from
www.rsc.org/amc.
• Analytical Methods Committee. (2008). Measurement Uncertainty Eval
uation for a Nonnegative Measurand: an Alternative to Limit of
Detection, Accred. Qual. Assur., 13, pp. 29–32.
9.7 Collaborative Trials — Overview
Key points
— A collaborative trial is an interlaboratory study to characterise the
performance of an analytical method.
— The main performance features are precisions of repeatability and
reproducibility.
Collaborative trials are interlaboratory studies to characterise the perfor
mance of a welldeﬁned analytical method applied to a welldeﬁned type
of test material. The performance features characterised are repeatability
precision, reproducibility precision and, if certiﬁed reference materials are
included, trueness. Usually each of these features will vary as a function
of concentration of the analyte, so the tests need to be carried out using
at least ﬁve diﬀerent materials of the deﬁned type, with concentrations
spanning the relevant range. The organising body selects and prepares these
materials and distributes them to the participating laboratories, which
should be at least eight in number (and preferably considerably more).
The participant laboratories analyse each of the materials in duplicate,
preferably ‘blind’ (i.e., without knowing the identity of the duplicates during
the analysis). The participant laboratories should be proﬁcient in the type
of analytical test involved.
The participants report the results obtained to the organiser, who then
carries out the statistical analysis of the results to estimate the various per
formance indicators. The basic statistical technique in collaborative trials
is oneway analysis of variance applied separately to the results from each
material (§4.6). However, there are several reﬁnements that are regarded
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 181
as essential. An important aspect is the initial removal of results identiﬁed
as outliers. There is an elaborate procedure for doing that described below
(§9.8). Alternatively, an approach based on robust analysis of variance has
been found to provide very similar outcomes without resort to outlier tests
(but see Note below). The justiﬁcation for rejecting outliers in collaborative
trials is that the resulting statistics are regarded as describing the essential
properties of the method, not those of the participant laboratories.
When the standard deviations of repeatability and reproducibility are
obtained for the various test materials, it is common practice to attempt
to ﬁnd a functional relationship that describes the relationship between
precision and concentration of the analyte (§9.9). This relationship can
provide a compact description of the capabilities of the method and a means
of interpolation to concentrations other than those actually represented in
the study.
The results of collaborative trials are often compared with the well
known ‘Horwitz function’. This function stems from an empirical obser
vation about the trend of reproducibility relative standard deviation σ
R
as
a function of concentration c, in the food analysis sector. Perhaps the most
useful formulation of the function is σ
R
= 0.02c
0.8495
, with both variables
expressed as a mass fraction. The trend of results from collaborative trials
follows this function closely over seven orders of magnitude (between con
centrations of about 10
−8
and 10%) although it must be stressed that there
are both random and systematic deviations from the trend. The function
is therefore not necessarily a good descriptor of individual methods. It
is, however, often used as an independent ﬁtnessforpurpose criterion in
method validation and proﬁciency testing. In that context it is used to
prescribe the performance required, not describe the performance obtained.
Notes and further reading
• Robust ANOVA must be used as strictly alternative to outlier deletion.
The practice of using both methods and then choosing the outcome that
is smaller must be avoided.
• ‘The amazing Horwitz function’. (2004). AMC Technical Briefs No. 17.
Free download via www.rsc.org/amc.
• Precision of test methods Part 1: Guide for the determination of repeata
bility and reproducibility for a standard test method. (1979). ISO 5275.
International Standards Organisation, Geneva.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
182 Notes on Statistics and Data Quality for Analytical Chemists
9.8 The Collaborative Trial — Outlier Removal
Key points
— Outliers are conventionally removed from collaborative trial datasets
before analysis of variance is carried out.
— This is justiﬁed because the study is meant to characterise the method
rather than the participant laboratories.
The most frequently used method for purging the initial valid data of out
liers is deﬁned in the 1995 IUPAC Harmonised Protocol. This procedure
consists of the alternating use of the Cochran and Grubbs tests until no
further outliers are ﬂagged or until the proportion of dropped laboratories
would exceed twoninths of the original number of laboratories providing
valid data (Fig. 9.8.1).
Cochran test (see §4.10)
First apply the Cochran outlier test (onetail test at p = 2.5%) and remove
any laboratory whose critical value exceeds the tabulated value.
Grubbs tests (see §7.4)
Apply the single value Grubbs test (twotail test at p = 2.5%) and remove
any outlying laboratory. If no laboratory is ﬂagged, then apply the pair
value tests (twotail) with two values at the same end and one value at
each end, with p = 2.5% overall. Remove any result ﬂagged by these
tests of which the critical value exceeds the tabulated value. Stop removal
when the next application of the test would ﬂag as outliers more than
twoninths of the laboratories. (Note: the Grubbs tests are to be applied
one material at a time to the set of replicate means from all laboratories,
and not to the individual values from replicated designs because the dis
tribution of all the values taken together is multimodal, not Gaussian,
i.e., their diﬀerences from the overall mean for that material are not
independent.)
Final estimation
Recalculate the ANOVA statistics after the laboratories ﬂagged by the pre
ceding procedure have been removed.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 183
All valid data
Calculate
precision
measures
Cochran
outlying
lab?
Y
Y
Drop lab unless
fraction dropped
would exceed 2/9
N
Single Grubbs
outlier?
Y
Drop lab unless
fraction dropped
would exceed 2/9
N
Paired Grubbs
outlier?
Y
N
Drop lab unless
fraction dropped
would exceed 2/9
Any labs
dropped in
this loop?
Y
N
Execute 1way
ANOVA on
remaining data
Fig. 9.8.1. Flow chart of procedure for removing outliers from collaborative trial data
before ANOVA.
Further reading
• Horwitz, W. (1995). Protocol for the Design, Conduct and Interpretation
of Method Performance Studies, Pure Appl. Chem., 67, pp. 331–343.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
184 Notes on Statistics and Data Quality for Analytical Chemists
9.9 Collaborative Trials — Summarising the Results
as a Function of Concentration
Key points
— Expressing precision data as a function of analyte concentration is a
useful way of summarising performance information.
— Finding a good ﬁt is not always straightforward.
It is usually beneﬁcial to summarise the statistics obtained for each
material in a collaborative trial by treating the precision as a function
of concentration. This provides a compact summary of the study. However,
it may not as simple as it ﬁrst appears. The main problems are as follows.
• There are few data points, and the relative uncertainty in the standard
deviation estimate at each point is large. There will be a large relative
uncertainty in the estimated functional relationship.
• The data may be markedly heteroscedastic. Some form of weighted pro
cedure should be used (see §6.7).
• Lack of ﬁt may be apparent if there is a signiﬁcant variation in the
matrices of the test materials.
• The intrinsic shape of the true functional relationship is unlikely to be
linear and its value must be strictly positive. Standard regression methods
may not be applicable, i.e., we cannot use a function that imputes a
negative standard deviation.
These features are apparent in Fig. 9.9.1. A number of functional forms
are suggested in ISO 5275, but they are all fundamentally ﬂawed. In par
ticular, simple linear regression will be suspect as it will not take account of
the heteroscedasticity and tend to give a negative or otherwise misleading
intercept.
Some simple methods that automatically take account of the het
eroscedasticity may be conditionally appropriate in speciﬁc circumstances.
When all of the materials have analyte concentrations well above the
detection limit of the method, a constant relative standard deviation (RSD)
is a reasonable assumption unless the data are clearly at odds with that. In
such an instance a suitable expedient might be to calculate the average of
the RSDs. Interpolation could then be executed by applying this average
value to new concentration values. Alternatively, a similar outcome could
be obtained by applying simple regression to the logtransformed data. As
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 185
Fig. 9.9.1. Standard deviation vs. concentration (points) for a collaborative trial of
a method for determining propyl gallate. The vertical bars show the 95% conﬁdence
intervals for the estimate. Two ﬁts are shown, simple linear regression (dotted line)
and loglog regression (solid curve). Both functions show lack of ﬁt to some points and
indicate inappropriate values near zero concentration.
the model for constant RSD is σ = βc, where σ is the standard deviation
at concentration c, the transformed equation is log σ = log β + log c. This
function should have a slope of one (unity) if the model is correct, and a
ttest on the slope should be able to conﬁrm that. Given a unit slope, the
intercept would be log
ˆ
β, and we would calculate
ˆ
β as its antilog. The value
of
ˆ
β is then the RSD for interpolation, that is, for calculating values of σ
from new values of c within the range of the original data and well away
from the detection limit.
The method of regressing logarithms will also work if the functional
relationship has features similar to the Horwitz function, that is, of the form
σ = βc
γ
, γ = 1 so that log σ = log β +γ log c. The regression coeﬃcient will
then equal the unknown exponent γ, which may or may not diﬀer from the
Horwitz exponent of 0.8495 (see §9.7). An example of this method is shown
in §6.10.
However, taking logarithms will tend to be misleading if the concen
tration values are lower than about 100 times the detection limit. The
fundamental reason is that the relationship between standard deviation
of measurement and concentration of the analyte must have a positive
intercept. In other words, the standard deviation at zero concentration must
be greater than zero, because it is describing a measurement result.
Many analytical systems can be better described by a function of
the form σ =
α
2
+ (βc)
2
, which is technically appropriate as it has
an intercept of α and tends towards a constant RSD of β at high
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
186 Notes on Statistics and Data Quality for Analytical Chemists
concentrations. However, a weighted ﬁtting of this type of function goes
beyond elementary statistical considerations.
Further reading
• Thompson, M., Mathieson, K., Damant, A. P. et al. (2008). A General
Model for Interlaboratory Precision Accounts for Statistics from Proﬁ
ciency Testing in Food Analysis, Accred. Qual. Assur., 13, pp. 223–230.
9.10 Comparing Two Analytical Methods by Using
Paired Results
Key points
— Comparison between two analytical methods can be undertaken with
paired results, using either a simple ttest or a regressiontype method.
— If the comparison is based on results from a single laboratory, the
outcome may not be generally applicable.
A common analytical task is the comparison of results from two methods
for measuring the concentration of the same analyte in a number n of test
materials. Usually one of the methods is recognised as a reference method
and the other, perhaps more rapid or convenient, is under trial. This ‘paired
results’ method is a particularly valuable approach because the comparison
between methods is based on the behaviour of ‘reallife’ test materials, not
on reference materials or spiked solutions which might behave atypically.
The comparison is obviously a statistical matter, but the correct technique
for the interpretation of the results depends on the concentration range
spanned by the test materials and on the precisions of the two methods. If
the concentration range is relatively small it might be possible to assume
a single variance for each method, in which case a ttest of the diﬀerences
between corresponding pairs would be suitable (§3.8–§3.10). If the concen
tration range is wider, it is probably advantageous to make the comparison
a function of concentration, as in §5.12. Quite commonly, the reference
method will produce results x
i=1,...,n
that are substantially more precise
than the corresponding trial method results y
i=1,...,n
. In such instances
regression of y on x (but not x on y) will probably be a suitable statistical
technique (§5.12). If the precisions are comparable, however, regression may
lead to a misleading interpretation because a basic assumption of regression
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
Statistical Methods Involved in Validation 187
is violated. In such instances, a more general technique that accommodates
variable precision on both variables would be required. This ‘functional
relationship’ ﬁtting will provide an unbiased outcome, but an account of
the method is beyond the scope of the present work.
Some care is needed in the design of such experiments, especially in
regard to the required generality of the conclusions. Often the reference
method and the trial method are quite diﬀerent in the analytical proce
dures and the physical principles behind the procedures. In such instances
the potential existence of laboratory bias has to be taken into account. Lab
oratory bias, small or large, is always present in results, as can clearly be
seen in results of proﬁciency tests. Paired results from a single laboratory
will not be descriptive of the methods because of the unique biases in that
laboratory. In short, the conclusions of the study would apply only to the
laboratory concerned. There is, however, an important exception to this
limitation: when the two methods diﬀer only in one particular, the eﬀect
under study, the laboratory bias (apart from the eﬀect under study) will
be the same for both methods and thus cancel out.
If a general (rather than a laboratoryspeciﬁc) conclusion is required,
so that it applies to the methods (and therefore to all laboratories), it is
necessary to compare the mean results of a number of laboratories, for both
methods and for each material. This requires a large study, comparable in
size (and indeed cost) to a double collaborative trial. There is one miti
gating circumstance where suitable data can be obtained at no cost, and
that is where both methods are wellcharacterised and wellrepresented in a
large proﬁciency test. Then it is possible to compare the means of medium
sized datasets (e.g., 20–50 laboratories) and use the respective variances as
weights for ﬁtting a functional relationship.
Notes and further reading
• Requirements for a valid comparison of the results of methods are dis
cussed in Thompson, M., Owen, L., Wilkinson K. et al. (2002). A Com
parison of the Kjeldahl and Dumas Methods for the Determination of
Protein in Food, Using Data from a Proﬁciency Testing Scheme, Analyst,
127, pp. 1666–1668.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch09
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
Chapter 10
Internal Quality Control
This chapter is concerned with the statistical aspects of internal
quality control, with special emphasis on the correct setting up of
control charts. It covers the meaning of statistical control, within
run and runtorun precision, and the use of multipleanalyte control
charts.
10.1 Repeatability Precision and the Analytical Run
Key points
— Duplication can be used to estimate or control repeatability (within
run) variation.
— The key statistic is the absolute diﬀerence between corresponding
duplicated results.
— It is usually necessary to take account of how precision depends on
the analyte concentration.
— ‘Maps’ of absolute diﬀerence against concentration can utilise control
lines for a prescribed repeatability precision.
Repeatability conditions of precision are formally deﬁned as those under
which replicate measurements are made on the same test material by the
same analyst using the same method, equipment and batch of reagents and
within a ‘short’ period of time (§9.2). The undeﬁned short period can be
most usefully interpreted as the length of an analytical ‘run’, a continuous
period, involving anything from one to a large number of measurements,
189
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
190 Notes on Statistics and Data Quality for Analytical Chemists
during which factors contributing to the magnitude of errors are deemed
to remain constant. Of course, the conditions are never really constant and
some systematic changes can be expected in a typical run. The eﬀect of
undetected changes of this kind can be handled by conducting the sequence
of analyses in a random order. The errors can then be regarded as part of
the repeatability variation.
In any event, repeatability standard deviation (σ
r
) is of only limited
use to analytical chemists as it is usually considerably smaller than the
uncertainty of the measurement. Its main value is in enabling the analyst
to gauge whether results replicated within a run are consistent, with each
other or with some externallyderived criterion. Duplication within a run
thus provides the analyst with a restricted type of quality control, which
is executed by consideration of the diﬀerence between the duplicate results
on the test materials. The method has the advantage that the variation
observed is that of materials that are entirely typical, both in composition
and state of comminution. To represent the true variation within the run,
however, the duplicated test portions must be at random positions in the
analytical sequence — if they were adjacent, or simply close in relation to
the length of the run, the variation observed between pairs would tend to
be too small, because they would not account for unremarked systematic
changes.
The test statistic is the diﬀerence d = x
1
− x
2
between corresponding
pairs of results, and this has a standard deviation σ
d
=
√
2σ
r
. The dif
ference d has an expectation (longterm average) of zero only if there is no
consistent trend in the instrumental performance. (For instance, if the sen
sitivity of the instrument were consistently falling during the run, the ﬁrst
of a pair of duplicated results would tend to be greater even if the duplicate
test portions were analysed in a random order.) This apparent diﬃculty can
be overcome simply by considering only the absolute diﬀerence d between
the corresponding results.
A complication arises when, as often happens, the concentration of
the analyte varies considerably among the test materials comprising the
run. In such an instance we would expect σ
r
, and therefore σ
d
, to vary
with concentration. If we postulated a functional relationship of the form
σ
r
=
α
2
+ (βc)
2
, we could predict that σ
d
=
2(α
2
+ (βc)
2
). If we knew
or prescribed values for the parameters α, β, we could calculate σ
d
at any
concentration c by inserting c = (x
1
+ x
2
)/2 into the equation. Under
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
Internal Quality Control 191
the normal assumption, the scaled diﬀerences d/σ
d
should behave like a
sample from a standard normal distribution, regardless of any varying con
centration in the materials. There is no point in plotting these results on a
Shewhart control chart because the results are not a temporal sequence. A
dotplot is suﬃcient to show the distribution.
More information, however, can be extracted by adding a concentration
dimension to the plot. A ‘control map’ based on this idea is popular in the
geoanalytical community but could be more widely useful. In this map (it is
not strictly a ‘chart’) d is plotted directly against c, on linear or logarithmic
axes as convenient. The control lines are functions of concentration and have
to be plotted as such at 2σ
d
and 3σ
d
. In addition, the median absolute
diﬀerence is close in numerical value to σ
r
, so adding this line to the map
gives the analyst an extra test of the data, by counting the data points above
and below this line — they should be roughly equal apart from random
diﬀerences. The map is not a control chart because we are not presenting
the data as a sequence. That usage is consistent with the concept of the
run as an unchanging analytical system.
In some instances it may be preferable to use ﬁtnessforpurpose uncer
tainty rather than repeatability standard deviation as a criterion of perfor
mance. The most likely circumstance is when the analyst has no previous
knowledge of σ
d
, such as might happen with determinations that are rarely
undertaken. An example is discussed in the next section (§10.2).
Repeatability duplicates can be used to estimate the relationship
between precision and the concentration of the analyte. However, a large
number of results are needed to do this adequately.
Notes and further reading
• If comparability within run is the customer’s sole requirement then
repeatability standard deviation can be used as the basis for uncertainty.
The rationale for this usage lies in the deﬁnition of the measurand. In
the circumstances referred to, the measurand is not the absolute concen
tration of the analyte, but the concentration diﬀerentials among two or
more test materials analysed in the same run. If such results are released
into the public domain, this limitation of the uncertainty should be
made clear.
• For results that are normally distributed, the median of d falls at
0.954σ
r
.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
192 Notes on Statistics and Data Quality for Analytical Chemists
10.2 Examples of WithinRun Quality Control
Key points
— A ‘control map’ is produced by plotting d against concentration. Zero
diﬀerences are set to a small positive value for plotting on logarithmic
axes.
— Control lines are inserted at σ
r
, 2σ
d
and 3σ
d
.
— Overall, the precision of the results conforms well to the expected
variation, and exceeds that of the external (10%) criterion, except
possibly at low concentrations.
A large batch of samples of soil was analysed for selenium in one run
in a completely random order. One in ten of the samples was analysed
in duplicate. The duplicate results (ppm) are shown in the table. From
previous experience, the repeatability standard deviation was expected to
conform to the function σ
r
=
0.015
2
+ (0.04c)
2
.
Se 1 Se 2 Se 1 Se 2
0.92 0.92 0.32 0.29
0.39 0.45 0.30 0.31
0.33 0.34 2.53 2.63
0.18 0.18 0.17 0.17
0.20 0.15 2.95 2.95
0.22 0.23 0.04 0.09
0.46 0.39 2.71 2.35
1.94 2.07 1.67 1.63
0.42 0.42 0.38 0.38
0.20 0.20 0.22 0.24
0.45 0.41 0.16 0.16
Diﬀerences between the pairs of results and corresponding values of
σ
d
were calculated. A dotplot of d/σ
d
is shown in Fig. 10.2.1. The
Fig. 10.2.1. Dotplot of scaled diﬀerences d/σ
d
for the selenium data.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
Internal Quality Control 193
bunching of points at zero is the outcome of excessive rounding. Despite
that, the data do not diﬀer signiﬁcantly from the standard normal distri
bution, so there is no evidence here that the original data deviated from
expectations.
For further work, absolute diﬀerences d were plotted against concen
tration. Control lines were calculated at σ
r
, 2σ
d
and 3σ
d
. The resulting map
is shown in Fig. 10.2.2. It is perhaps easier to interpret the same features
plotted on logarithmic axes (Fig. 10.2.3). (Diﬀerences of zero were set to
0.05 to allow them to be plotted on logarithmic axes: that does not aﬀect
the interpretation.) Thirteen points fall below the median line (dotted) and
Fig. 10.2.2. Control map for results (points) duplicated under repeatability conditions,
with control lines at 3σ
d
(solid), 2σ
d
(solid) and σ
r
(dashed).
Fig. 10.2.3. Same data and lines as Fig. 10.2.2, plotted on logarithmic axes.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
194 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 10.2.4. The selenium results (points) plotted on a 10% control map. The control
lines represent 10% RSDr.
nine above, indicating that, if anything, that the repeatability precision
overall may be slightly better than expected. However, three points fall
above the 2σ
d
line, as compared with the expectation of 1.1. The probability
of observing three or more points above this line, assuming the data conform
to the expected repeatability, is p = 0.02. As these points are concentrated
at lower concentrations, this suggests that the detection limit of the method
was not as low as expected.
An alternative approach is to use an independent ﬁtnessforpurpose
criterion to judge the results. Figure 10.2.4 shows an absolute diﬀerence map
with control lines for a repeatability relative standard deviation (RSDr) of
10%, i.e., for σ
r
/c = 0.1. (This analytical precision would be suﬃcient for
most environmental applications.) In this instance the results easily fulﬁl
the requirement, with only four points above the theoretical median for
10% RSDr. There is one discrepant result (that is exceeding the 3σ
d
line),
at a low concentration, again suggesting that the detection limit may be
higher than expected.
A singular advantage of using loglog plots with ﬁxed RSDs is that
generalpurpose map blanks can be printed in advance in large numbers
and the results quickly entered by hand. It is also relatively easy to write
a macro that does the same job by computer.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
Internal Quality Control 195
Notes and further reading
• The dataset is in ﬁle named Selenium.
• ‘A simple ﬁtnessforpurpose control chart based on duplicate results
obtained from routine test materials’. (2002). AMC Technical Briefs
No. 9. Free download via www.rsc.org/amc.
10.3 Internal Quality Control (IQC)
and RuntoRun Precision
Key points
— To apply statistical control to routine analysis, we have to use a sur
rogate, the control material, which resembles the test materials closely.
— Control limits are set according to the runtorun precision for the
system.
The purpose of internal quality control in analysis (IQC) is to ensure as far
as possible that the magnitude of the errors aﬀecting the analytical system
is not changing during its routine use. The timebase for IQC is therefore
the analytical run. During method validation we estimate the uncertainty
of the method and show that it is ﬁt for purpose. When the method is
in use, every run of analysis should be checked to show that the errors of
measurement are probably no larger than they were at validation time. For
this purpose we employ the concept of statistical control, which means in
general that some critical feature of the system is behaving like a normally
distributed variable. In industrial production, the critical feature is nor
mally part of the speciﬁcation, such as the length of a screw, and is readily
available for measurement.
For chemical analysis, however, we have to generate separately a repre
sentative feature of the system. This is done by adding one or more ‘control
materials’ to the run of test materials. The control materials are treated
throughout in exactly the same manner as the test materials, from the
weighing of the test portion to the ﬁnal measurement. Clearly the control
materials must be of the same type as the materials for which the ana
lytical system was validated, in respect of matrix composition and analyte
concentration. In that way the control materials act as a surrogate and
their behaviour is a proper indicator of the performance of the system. The
results obtained in successive runs can be plotted on a control chart (§10.4),
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
196 Notes on Statistics and Data Quality for Analytical Chemists
which shows when the system becomes out of control and, by implication,
needs investigation and possible remedial action before analysis resumes.
Such a chart has to be set up with control lines determined by runtorun
precision.
When we undertake a number of successive runs in the same labo
ratory, the conditions of measurement will inevitably be diﬀerent in each
run: the instrument will be set up diﬀerently or a diﬀerent instrument of
the same type may be used. Newly prepared reagents or a new calibrator
set may be used, perhaps by a diﬀerent analyst. The environmental con
ditions in the laboratory may be diﬀerent. Each run thus has a slightly
diﬀerent set of circumstances, giving rise to a ‘run bias’ eﬀect. In the long
term this variation looks like a random ‘betweenruns’ eﬀect in addition
to the repeatability variation. Results replicated (that is, with the same
material being analysed by the same method) in successive runs will
therefore be more variable than repeatability replicates. This combined
eﬀect of repeatability and betweenrun variations is referred to here as
runtorun variation (and elsewhere, rather unhelpfully, as ‘intermediate’
variation). It is runtorun standard deviation that should be used to set
up control charts for internal quality control. An incorrect use of repeata
bility standard deviation for this purpose would result in too frequent
an indication of loss of statistical control, whereas use of reproducibility
standard deviation or standard uncertainty would result in too low a
proportion.
It must be remembered that the parameters that deﬁne statistical
control and used to set up control charts should refer only to the behaviour
of the process itself. External criteria, such as certiﬁed reference values and
their uncertainties are irrelevant to quality control per se. For the purposes
of internal quality control, we need to know only whether the process (the
analytical system) has changed since validation.
Notes
• There is more information on control charts in §7.1.
• Thompson, M. and Wood, R. (1995). Harmonised Guidelines for Internal
Quality Control in Analytical Chemistry Laboratories, Pure Appl. Chem.,
67, pp. 649–666.
• ‘The Jchart: a simple plot that combines the capabilities of Shewhart
and cusum charts, for use in analytical quality control’. (2003). AMC
Technical Briefs No. 12. Free download via www.rsc.org/amc.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
Internal Quality Control 197
10.4 Setting Up a Control Chart
Key points
— It is diﬃcult to estimate runtorun standard deviation accurately
unless routine conditions prevail during the measurements.
— Early results may be atypically disperse.
— An interim chart can be set up immediately after validation by using
a repeatability mean and an inﬂated repeatability precision.
— The interim chart should be replaced after ten or more runs with
runtorun statistics and reviewed periodically after that.
There are practical problems in setting up a control system for analysis.
Runto run standard deviation cannot be estimated adequately in the usual
type of oneoﬀ validation. Reallife replication is required, over an extended
series of runs. To achieve this realism, any control material has to be in a
random position in a runlength sequence of typical test materials. Many
observations are needed to estimate the standard deviation with suitable
accuracy, far more than the usual ten. These conditions cannot be realised
except during routine use of the method. Moreover, the analysts will not be
familiar with the method at initial validation time, and will produce results
of atypically low precision: almost invariably an improved precision comes
with experience of the method.
How then does the analyst actually start the control chart? The best
strategy is to use an interim control chart and update it as more information
becomes available, as follows.
• Start an interim control chart with the mean result for the control
material established at validation time. Deﬁne the control limits on the
basis of 1.6 times a repeatability standard deviation estimated from
the results at validation. (The factor of 1.6 is derived from a broadly
applicable empirical observation of the magnitude of runtorun vari
ation.) Do not use uncertainty values attached to a certiﬁed value of a
reference material: that is a description of knowledge about the material,
not about the analytical system.
• After results have accumulated from ten runs, replace the control limits
with those based on the robust estimates of the mean and standard devi
ation of the new results. After further (say about 30) results have accu
mulated these estimates should be checked.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
198 Notes on Statistics and Data Quality for Analytical Chemists
• Review the control limits occasionally, as may be necessary if the mean of
the process has clearly changed. This requires the exercise of judgement
and should not be done without careful consideration. A substantive
change may demand a partial revalidation of the process.
Notes and further reading
• Howarth, R. J. (1995). Quality Control Charting for the Analytical Lab
oratory. Part 1. Univariate Methods. A Review, Analyst, 120, pp. 1851–
1873.
• Internal quality control in routine analysis. (2010). AMC Technical
Briefs No. 46. Free download via www.rsc.org/amc.
10.5 Internal Quality Control — Example
Key points
— The best option for a control chart for this dataset is with control lines
based on robust statistics of more than about 15 runs of analysis.
— The interim chart based on inﬂating an estimate of repeatability
standard deviation was very similar to the ‘permanent’ control chart
based on runtorun precision.
Here we consider some quality control data derived from the routine
analysis of soil samples by inductivelycoupled plasma atomic emission
spectrometry. The element of interest is zinc, and the main emphasis is on
the typical (but perhaps unexpected) diﬃculty in determining the control
limits. Note that we do not proceed on the assumption that the data will
be exactly normally distributed. A more likely occurrence would be that
the majority of points are roughly normally distributed and a minority are
outliers.
Four methods are illustrated in Fig. 10.5.1. Graph (a) shows the mean
and action limits, determined from rolling statistics (simple mean and
standard deviation), from each set of ten successive results from 50 runs.
(For example, the limits at Run 50 are based on data from Runs 41–50.)
The positions of the control lines are very variable, showing that any choice
of just ten runs would give rise to highly questionable outcomes. Graph (b)
shows the corresponding control lines determined from all data up to the
current round (that is, simple statistics based on runs 1 to n). The lines
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
Internal Quality Control 199
(a) (b)
(c) (d)
Fig. 10.5.1. Positions of control lines (mean and action limits) estimated by various
methods at diﬀerent points in the accumulation of data. (a) Rolling groups of ten data,
simple statistics. (b) All data up to run n, simple statistics. (c) Rolling groups of ten
data, robust statistics. (d) All data up to run n, robust statistics.
are more stable in position after Run 14, but noticeably aﬀected by some
following individual runs. Graph (c) shows the statistics estimated by a
robust method (Huber’s proposal H15, see §7.6) with rolling groups of ten
successive results. The lines are very unstable until Run 25, and after that
they are narrower but still somewhat ragged. When the robust statistics
are calculated from all data up to Run n (Graph d), the resulting lines are
narrow and stable almost from the start. In this example at least, robust
statistics from a long succession of data would give the best outcome for a
control chart. Using statistics from data from the ﬁrst run up to any point
after run 20 would give a serviceable control chart.
For setting up an interim control chart, the initial repeatability statistics
were a mean of 370 and a standard deviation of 8.9 ppm. The interim
estimate of runtorun standard deviation was therefore 1.6 × 8.9 = 14.2
(see §10.4). A chart with control lines based on 370 ±k ×14.2, k = 2, 3 was
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
200 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 10.5.2. Interim control chart based on a standard deviation of 1.6 times the repeata
bility value.
Fig. 10.5.3. ‘Permanent’ control chart for zinc with control limits based on robust statis
tics from the whole dataset, namely a mean of 370.6 and a standard deviation of 14.8.
used as the interim chart. Figure 10.5.2 shows this chart used for the ﬁrst
25 rounds. The chart has the appearance of wellbehaved data apart from
Runs 13 and 14, which are out of control.
A ‘permanent’ control chart based on option d (all data up to Run
50, robust statistics) is shown in Fig. 10.5.3. It is remarkably similar to the
interim chart in this instance. There are a number of clear outliers signifying
outofcontrol conditions (Runs 13, 14, 27, 34 and 47), and one marginal
case (Run 41). This has every appearance of a sensible control chart.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
Internal Quality Control 201
Notes and further reading
• The data for this example were a subset from ﬁle AGRGIQC down
loadable from AMC Datasets via www.rsc.org/amc. This dataset com
prises multielement results from 156 successive runs (including rejected
runs), with a variable number of repeat results for each run. The subset
used here comprised the ﬁrst result for zinc from each run up to Run 50.
The repeatability standard deviation for the interim chart was obtained
independently.
• The closeness of agreement between the control lines of the interim and
permanent control charts was in this instance unusually good.
10.6 Multivariate Internal Quality Control
Key points
— Results from simultaneous multianalyte methods are likely to be cor
related because of variations in procedure that aﬀect all analytes.
— Such correlations may be diagnostic and point to speciﬁc causes of
problems.
— Multiple symbolic control charts are useful because they clearly show
runs where many channels are aﬀected simultaneously and analytes
that are aﬀected in many runs.
Chemical analysis with multiple outputs, either simultaneous (e.g.,
ICPAES) or nearly simultaneous (e.g., HPLC) is commonplace nowadays.
The question of how best to apply IQC principles to such systems is often
discussed. The use of multivariate statistical methods in this context has
seldom been reported so far, and these methods are beyond the scope of
this book. In any event, such methods may provide a robust account of the
analytical system as a whole, but both the analyst and the customer will
require information about the validity of results for each individual analyte.
The present discussion is therefore limited to the multiple use of univariate
methods.
One preliminary consideration is the extent to which the variables
(the results for the analytes) are correlated. Variation in parts of the
analytical procedure that are common to all analytes will tend to cause
correlationcorrelation. Variation in the volume of test solution injected into
a chromatograph would be of that kind: it would tend to aﬀect all analytes
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
202 Notes on Statistics and Data Quality for Analytical Chemists
in proportion. Other variations in method may have outcomes that are
limited to a subset of analytes. In an acid decomposition of a soil sample,
variation in the ﬁnal temperature of drying would cause variation in the
recovery of a suite of volatile elements (e.g., Hg, As, Se) while other ele
ments would be completely unaﬀected. An extreme example with only one
analyte aﬀected could be caused by malfunction in a single channel of a
multichannel instrument. These features may be worth some consideration
for their potential diagnostic value in instances of outofcontrol runs.
Another aspect of this is the wellknown ‘Bonferroni’ problem. Suppose
we have a system measuring 30 analytes simultaneously, and we are con
sidering results that fall outside the warning limits of a control chart (that
is, the approximate 95% conﬁdence limits). If all of the channels were inde
pendent (not correlated), in every run we would expect to see results for one
or more analytes in the warning zone purely by chance. Isolated instances
almost certainly would mean nothing. The probability (under the usual
assumptions) of observing a result for a single analyte outside the action
limits is about one in 300, which is so rare in an incontrol system that we
are justiﬁed in assuming that the system has changed. But in an uncorre
lated 30channel system operating in control we could expect to encounter
a result outside the action limits with a probability of one in ten. That
might lead the unwary to reject data at a quite unnecessary rate.
Fortunately, largely uncorrelated multianalyte systems are seldom
encountered. When things go wrong the eﬀects tend to be visible on a
number of analytes simultaneously. Such occurrences are clearly visible
in a multiple symbolic control chart. To make such a chart the results
x for each channel are standardised as z = (x − ˆ µ)/ˆ σ, where ˆ µ, ˆ σ are
respectively the estimated mean and standard deviation for that channel.
The values of z are then converted into symbols indicating the zone into
which the result would fall on the corresponding Shewhart chart. The
symbols are then plotted according to analyte (rows) and run number
(columns).
Figure 10.6.1 shows such a chart for the results of 25 elements in a
control material over the ﬁrst 50 runs of a routine procedure. (The example
used previously [§10.5.3] represents one row of this data.) On this chart
the great majority of instances of results falling outside action limits are
concentrated into a few runs: a large number of instances occur together
(and usually in the same direction), preeminently in runs 13, 14, 15, 24, 27,
34, 41 and 47. (Another phenomenon potentially present in multiple sym
bolic charts is where a particular analyte shows signs of persistent problems
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
Internal Quality Control 203
Fig. 10.6.1. Symbolic multipleanalyte control chart.
over a number of runs. That would be indicative of a speciﬁc problem with
that element. No example is present in Fig. 10.6.1.)
It is clear that the channels in this example are mostly highly correlated.
Even when the anomalous runs were deleted from the dataset, there is still
a degree of correlation among the variations. The following correlations are
typical of the whole correlation matrix.
Li Na K Rb
Li 1.0 0.5 0.5 0.3
Na 0.5 1.0 0.9 0.4
K 0.5 0.9 1.0 0.4
Rb 0.3 0.4 0.4 1.0
Notes and further reading
• When a number of analytes in a material are determined by quite sep
arate methods on separate test portions, the results can be assumed to be
independent.
• The data for this example were a subset from ﬁle named AGRGIQC,
downloadable from AMC Datasets on the website www.rsc.org/amc.
This dataset comprises multielement results from 156 successive runs
(including rejected runs), with a variable number of repeat results for
each run.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch10
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
Chapter 11
Proﬁciency Testing
Proficiency testing, an externallyprovided means for participants to
check the accuracy of their routine measurements, is now incumbent
on all laboratories seeking accreditation. The conversion of partici
pants’ results into meaningful and readily understandable scores is
almost universal. Nearly all such scoring systems are based on the
properties of the normal distribution, but some statistical methods
needing special software may be required in the process. While it is
the job of the scheme provider to execute these methods, it is useful
for the participant to be aware of what is involved.
11.1 Proﬁciency Tests — Purpose and Organisation
Key points
— Proﬁciency tests are regular interlaboratory comparisons of results
obtained by ‘blind’ analysis.
— The main purpose is to help laboratories to achieve a suitably low
uncertainty.
— Participation in a scheme (where available) is an almost universal
requirement for accreditation.
— Participants’ results are usually converted into scores that give an
indication of accuracy.
Proﬁciency tests are interlaboratory exercises, provided on a regular basis,
to allow participating laboratories to check the accuracy of their results. For
each round of the scheme, the scheme provider sends portions of one or more
test materials to the participants, who analyse the materials ‘blind’ (that
is, with no indication of the concentrations of the analytes) by their routine
205
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
206 Notes on Statistics and Data Quality for Analytical Chemists
methods. The materials and analytes should be typical of the participants’
normal work. The materials should be suﬃciently close to homogeneous and
stable, so that variations among the results reﬂect accurately the variations
in the participants’ performance rather than variations in the test material.
After the reporting deadline, the provider processes the results, usually
converting them into scores that gives an indication of the accuracy, often
in relation to a predetermined maximum level of uncertainty. The provider
sends a report of the round to the participants, showing the results and/or
scores of all participants.
Proﬁciency tests rounds are provided at various frequencies, most com
monly several times per year. They cannot therefore act as a substitute for
internal quality control (§10.3), which should be carried out in every run of
analysis.
The primary purpose of proﬁciency testing is to enable participants
to be conﬁdent about their normal analytical methods. If an unexpected
inaccuracy in their routine results is detected, an investigation can be trig
gered and remedial actions instituted where necessary. This function is so
important that participation in a proﬁciency test, where one is available,
has been made a universal requirement for accreditation. Moreover, accred
itation agencies expect participants to have and apply a written procedure
for dealing with poor scores. However, accreditation has had the unfor
tunate eﬀect of encouraging participants to try to excel in accuracy rather
than merely to assess the performance of routine operations. This ten
dency is reinforced when laboratories use their scores in promotional activ
ities, for example by quoting favourable scores in tenders for work, or
for monitoring the performance of individual analysts. These secondary
uses have to a certain extent subverted the original ethos of proﬁciency
testing.
Notes and further reading
• Thompson, M., Ellison, S. L. R. and Wood, R. (2006). The Internaional
Harmonised Protocol for the Proﬁciency Testing of Analytical Chemistry
Laboratories, Pure Appl. Chem., 78, pp. 145–196.(Free download from
IUPAC website.)
• ISO Guide 43. Proﬁciency testing by interlaboratory comparisons –
Part 1: development and operation of proﬁciency testing schemes. (1994).
International Organisation for Standardisation, Geneva.
• Lawn, R. E., Thompson, M. and Walker, R. F. (1997). Proﬁciency
Testing in Analytical Chemistry, The Royal Society of Chemistry,
Cambridge.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
Proﬁciency Testing 207
• ISO 13528. Statistical methods for use in proﬁciency testing by interlab
oratory comparisons. (2005). International Organisation for Standardis
ation, Geneva.
• ILACG13. Guidelines for the requirements for the competence of
providers of proﬁciency testing schemes. (2000). (Available free online
at www.ilac.org/documents/)
11.2 Scoring in Proﬁciency Tests
Key points
— Converting a participant’s result into a score is pointless unless it adds
information about the accuracy of the result.
— An ideal score should convey the same information about accuracy
regardless of the nature of the analytical measurement.
— The zscore is close to ideal.
A participant’s result x for an analyte in a round of a proﬁciency test is
usually converted into a score that reﬂects the accuracy of the results. The
ideal score should be universally applicable: a particular value, say 1.5,
should convey the same information about the accuracy of a result,
regardless of the analyte, its concentration, the test material or the physical
principle underlying the measurement. In fact, scoring is pointless unless it
has this property. The zscore, given by z = (x − x
A
)/σ
p
, is appropriate,
where the ‘assigned value’ x
A
is the scheme provider’s best estimate of the
true value of the measurand, and σ
p
is the ‘standard deviation for proﬁ
ciency’ (also known informally as the ‘target value’). However, the eﬃcacy
of a scheme depends critically on the selection of appropriate values for x
A
and σ
p
.
A hypothetical laboratory using an unbiased method producing results
with an uncertainty u = σ
p
would tend to produce zscores that are a
random sample from a standard normal distribution N(0, 1), that is, with
a mean of zero and a variance of unity. Consequently, it is appropriate
to interpret zscores on this basis, as we would expect about 95% of z
scores from exactly compliant laboratories to fall between ±2 and very
few to fall outside ±3. Reallife laboratories will not be exactly compliant,
however. Laboratories operating with poor uncertainty (u > σ
p
) and/or
with a bias tend to produce higher proportions of scores outside these
limits. In contrast, laboratories operating with no bias and an uncertainty
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
208 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 11.2.1. Barchart of results for alumina showing an approximately symmetrical
distribution in a round of the GeoPT proﬁciency test.
smaller than σ
p
tend to produce a smaller proportion of scores outside the
limits.
Some examples of results and zscores obtained in rounds of some proﬁ
ciency tests are shown in Figs. 11.2.1, 11.4.2 and 11.4.4. In typical reports,
the participants’ scores are shown as a bar chart or ordered bar chart,
with the individual laboratory identiﬁed by an anonymised code. In some
instances the number of participants is so large that a bar chart is imprac
ticable and is replaced by a histogram.
11.3 Setting the Value of the Assigned Value x
A
in Proﬁciency Tests
Key points
— There are a number of possible ways of determining an ‘assigned
value’.
— The value most often used is the consensus of the participants’ results.
There are several possibilities for the choice of assigned value x
A
.
• The certiﬁed value of the analyte in a certiﬁed reference material
(CRM). Metrologically sound, this value is seldom used because the
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
Proﬁciency Testing 209
cost of a CRM is usually too great for use in a routine manner. In
addition, the uncertainty of the certiﬁed value is often too large to be
useful.
• A value from a national reference laboratory obtained by a method such
as isotope dilution mass spectrometry. Similar comments apply to this
value.
• A value obtained by analysis alongside a number of matrixmatched
CRMs in the same analytical run (that is, using the CRMs as calibrators).
There are seldom enough appropriate CRMs available.
• A value based on formulation. This can be used where the analyte is
added gravimetrically or volumetrically to a base material containing
none. Sometimes applicable, there are often diﬃculties in accurately
spiking the base material with low trace levels of analyte.
• A consensus of expert laboratories. One problem with this assigned value
stems from the diﬃculty of identifying the expert laboratories to the
satisfaction of every stakeholder. A practical problem is that the variation
between the experts’ results is often comparable with that of the whole
participant set, and the assigned value does not have a suﬃciently small
uncertainty.
• A consensus of all participants. This is by far the most commonly used
assigned value, and it costs nothing. A consensus is usually easy to
identify and has a suﬃciently small standard error if there are more than
about 20 participants. The consensus has been criticised on metrological
grounds, as it is perfectly possible for the great majority of the partici
pants to be using a biased analytical method. In such instances, which
are occasionally detected, there would be a latent uncertainty in the
assigned value, and participants using an unbiased method could receive
poor zscores. However, at present there is seldom an economically viable
alternative. Methods of determining a consensus are considered below
(§11.4).
Further reading
• Thompson, M., Ellison, S. L. R. and Wood, R. (2006). The Internaional
Harmonised Protocol for the Proﬁciency Testing of Analytical Chemistry
Laboratories, Pure Appl. Chem., 78, pp. 145–196. (Free download from
IUPAC website.)
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
210 Notes on Statistics and Data Quality for Analytical Chemists
11.4 Calculating a Participant Consensus
Key points
— A robust mean is usually a good estimator of a consensus if the results
from a round of a proﬁciency test seem unimodal and (outliers aside)
roughly symmetrical.
— If the dataset seems to be unimodal but skewed, a mode estimated
by kernel density methods may be a suitable consensus.
— Where the results are apparently multimodal, it may be impossible to
ﬁnd a consensus.
In the context of proﬁciency testing, ‘consensus’ does not mean absolute
concordance, but an identiﬁable and unique point of maximum agreement
between the participants’ results. All of the usual measures of central ten
dency have been considered in this context. In addition to the value of the
selected statistic itself, an estimate of its uncertainty is required to ensure
that the assigned value is suﬃciently stable. Methods for estimating these
statistics abound, but experience and judgement are needed to select the
method appropriate for particular datasets.
• The mean. The almost inevitable presence of outliers and heavy tails in
sets of results from proﬁciency tests means that the arithmetic mean may
be biased and the variance inﬂated. One of several robust estimates is
suitable to avoid these problems if the dataset is unimodal and reasonably
close to symmetric (e.g., Fig. 11.4.1). In such datasets the various esti
mates of central tendency are almost coincident, and the robust mean is a
good estimator. The standard error of the robust mean can be estimated
as ˆ σ
rob
/
√
n from the robust standard deviation ˆ σ
rob
and the number of
participants n. (This is a reasonable estimate if the robustiﬁcation is not
too severe: otherwise the value of n should be adjusted for any down
weighting.)
• The median. The median is a type of robust mean but is more resistant
than some estimators to the inﬂuence of skewness, which may appear in
proﬁciency tests datasets through the use of a number of methods with
diﬀering detection limits. However, the mode is usually preferable for
skewed distributions.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
Proﬁciency Testing 211
Fig. 11.4.1. Histogram of results for alumina in a rock test material from a round of
the GeoPT proﬁciency test. Extreme outliers have been omitted. The results are heavy
tailed in comparison with the robustly ﬁtted normal distribution (solid line). Same data
as Fig. 11.2.1.
• The mode. The mode is intuitively attractive as a consensus estimator,
and serves well even if the dataset shows a moderate degree of skewness,
e.g., Fig. 11.4.2. The mode of a smooth distribution is the point of
highest density. Real datasets presented as histograms or dotplots are
not smooth, however, owing to the class boundaries or because of the
limited digit resolution of the data. A degree of smoothing is therefore
required to identify the mode, and to check that there is indeed only one
mode. This smoothing can be conveniently carried out by kernel density
estimation (Fig. 11.4.3). The standard error of the mode can be estimated
via the bootstrap (a computerintensive method of estimation).
In some instances more than one mode (other than outliers) may be
apparent (Figs. 11.4.4–11.4.6). This could happen if substantial subsets
of the participants used one of several discrepant analytical methods or
variants of a single method. In such instances it is usually not possible
to identify a consensus. Occasionally, however, there may be additional
evidence (such as use by participants of a prescribed method) that enables
the provider to determine that one such mode represents reliable results
and other modes suspect results. In that case the provider can, with due
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
212 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 11.4.2. Barchart of results for lead, from the GeoPT proﬁciency test, showing a
strong positive skew.
Fig. 11.4.3. Kernel density representation of results for lead from a round of the GeoPT
proﬁciency test, showing a positive skew and the single mode at about 6 ppm. (Same data
as Fig. 11.4.2.) (The subzero density in the low tail is the result of smoothing.)
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
Proﬁciency Testing 213
Fig. 11.4.4. Bar chart of results for niobium from a round of the GeoPT proﬁciency
test. Despite the high proportion of results with good zscores, there is a suggestion of
multimodality.
Fig. 11.4.5. Kernel density representation of results for niobium from a round of the
GeoPT proﬁciency test (same data as Fig. 11.4.4). The tendency to multimodality is
clear.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
214 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 11.4.6. The results for niobium from a round of the GeoPT proﬁciency test are
well described as a mixture model of three normally distributed subsets. (Same data as
Fig. 11.4.4.)
caution, use the reliable mode as the consensus. A mixture model is a useful
additional technique for characterising such subsets of the data.
Notes and further reading
• Some statistical methods mentioned in this section are beyond the scope of
this book, but straightforward accounts can be read in the AMC Technical
Briefs listed below. A kernel density provides a smooth representation of
data by replacing data points by small (usually) normal distributions and
then adding the resulting densities at each point on the measurement axis.
The bootstrap estimates distributions of statistics by resampling the data
with replacement a large number of times. A mixture model regards the
dataset as a mixture of random samples of observations from two or more
diﬀerent distributions.
• ‘Representing data distributions with kernel density estimates’. (Revised
March 2006). AMC Technical Briefs No. 4. Free download from
www.rsc.org/amc.
• ‘The bootstrap: a simple approach to estimating standard errors and con
ﬁdence intervals when theory fails’. (August 2001). AMC Technical Briefs
No. 8. Free download from www.rsc.org/amc.
• ‘Mixture models for describing multimodal data’. (March 2006). AMC
Technical Briefs No. 23. Free download from www.rsc.org/amc.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
Proﬁciency Testing 215
11.5 Setting the Value of the ‘Target Value’ σ
p
in Proﬁciency Tests
Key points
— In best practice the target value should be a criterion known to par
ticipants in advance and describing the uncertainty regarded as ﬁt for
purpose in the relevant application sector.
— Target values based on a robustiﬁed standard deviation of the partic
ipants’ results do not provide any useful additional information.
The best value of σ
p
is simply the uncertainty that is regarded as ﬁt for
purpose in the application sector. That information should be available
to the participants before the analysis. Under this convention zscores
will be comparable over any analytical method, analyte or matrix. For
example, a zscore outside the range ±3 will always show that the original
result was not ﬁt for purpose. It is important to emphasise (because it
is widely misunderstood) that σ
p
is not intended to predict the uncer
tainty of individual laboratories. Equally it does not imply that the col
lected zscores of all of the participants will be a random sample from
the standard normal deviation N(0, 1). Individual laboratories will tend to
have diﬀerent precisions and biases that jointly contribute to the between
laboratory variation. Thus real datasets deviate from N(0, 1) to a greater
or lesser extent, often with heavy tails, a proportion of outliers and, occa
sionally, skews or multiple modes. The value of σ
p
is not intended to
predict that diversity: rather it is set to prescribe in advance the uncer
tainty that is required by the scheme provider. It is up to the par
ticipants to attempt to comply. The zscores then give a good idea of
the degree of compliance. If the ﬁtforpurpose uncertainty varies with
the concentration of the analyte, it should be expressed as a functional
relationship.
Some proﬁciency tests use a robust standard deviation of the partic
ipants’ results in a round to deﬁne σ
p
. The resulting zscores will nearly
always show in excess of 90% laboratories with scores between ±2. This may
be comforting for the participants, the great majority of whom could claim
that their result was ‘satisfactory’, and equally so for the provider, who
could claim that the scheme was achieving its purpose. In fact such a score
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
216 Notes on Statistics and Data Quality for Analytical Chemists
tells us nothing about whether the results are accurate enough in relation to
their intended purpose. Moreover, it does not tend to reduce the variation
among the great majority of the participant laboratories. Finally, it does
not give the participant any prior guidance as to the standard of accuracy
that is required: such guidance is necessary so that appropriate analytical
methods can be chosen in advance or modiﬁed to meet the requirement. In
short, scoring is pointless unless it adds information to that already present
in the raw results.
11.6 Using Information from Proﬁciency Test Scores
Key points
— zscores should be regarded as action limits rather than a method of
classifying participants.
— Laboratories should have a documented strategy for examining and
acting upon zscores.
In a scheme where the σ
p
value is determined by ﬁtness for purpose, zscores
in the range ±2 can be regarded as calling for no action by the participant.
Scores outside the range ±3 would be very unusual for a participant con
forming to the ﬁtness criterion and can be taken as calling for investigation
and possibly remedial action, such as modiﬁcation of the analytical system.
Scores in the intermediate range would not be especially uncommon and
isolated instances could be ignored. So z = ±3 can best be regarded as
deﬁning action limits.
There is a temptation amongst practitioners to regard these arbitrary
limits as class boundaries and to name the classes accordingly, for instance,
‘satisfactory’ for z < 2 and ‘unsatisfactory’ for z > 3. These class labels
are best avoided. There is a danger that they will be interpreted literally
and misapplied, especially by those not familiar with statistical inference.
There is also a tendency among nonscientists to want to construct a ranked
‘league table’ from a set of zscores. That is especially invidious, and should
be strongly discouraged if encountered, as ranks can change greatly from
round to round without any underlying change in the performances of the
participants.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
Proﬁciency Testing 217
There is also an understandable desire for people to try to summarise
performance by averaging zscores, for a single analyte over a period of
time or for a number of diﬀerent analytes at a particular time. There are
several problems associated with creating these summary scores. One such
is that a single zscore of large magnitude will have a persistent eﬀect in
time, possibly long after the problem giving rise to it has been rectiﬁed.
Another is that, in an average score from a number of analytes, a particular
analyte may consistently attract a score of large magnitude that is hidden
in the average. Finally, when several analytes are determined ‘simultane
ously’, the results are unlikely to be independent and the average misleading
(unless corrected for covariance).
No such problems are attached to interpreting zscores by standard
univariate control chart procedures, using the usual rules of interpre
tation (§7.1). Either Shewhart charts or zone charts are suitable for this
purpose. In Fig. 11.6.1 we see no trend in the scores for an analyte in suc
cessive rounds, but the score in Round 10 is less than 3 so the analytical
system needs investigating. If several analytes are determined together
in successive rounds, parallel symbolic control charts are especially infor
mative. In Fig. 11.6.2 we see ﬁve instances where z > 3, each calling
for investigation. Under the usual rules, Analyte 2 in Round 11 would
also trigger investigation, because there are two successive results where
−3 < z < −2. We can also see that Analyte 2 alone is unduly prone to
low results while, in Round 4, ﬁve out of the six analytes attract unduly
Fig. 11.6.1. zscores for a single analyte from successive rounds of a proﬁciency test.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
218 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 11.6.2. Multiple univariate control chart used to summarise the results for a number
of analytes determined in successive rounds of a proﬁciency test. (Analyte 1 results are
also shown in Fig. 11.6.1.)
high results, possibly by a procedural mistake aﬀecting all of them in the
same way.
Notes and further reading
• A zscore of unexpectedly large magnitude shows that the internal quality
control system also needs investigation: a problem causing the trou
blesome zscore (unless due to a sporadic mistake) should have been
detected promptly by internal quality control and the result rejected by the
analyst.
• It is important to realise that a poorlyperforming participant can still
receive a majority of ‘satisfactory’ zscores. If a participant’s method was
unbiased but the standard deviation were twice the target value (i.e., 2σ
p
)
the laboratory would on average still receive a zscore between ±2 on about
67% of occasions and between ±3 on about 87% of occasions.
• ‘Understanding and acting on scores obtained in proﬁciency testing
schemes’. (December 2002). AMC Technical Briefs No. 11. Free download
from www.rsc.org/amc.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
Proﬁciency Testing 219
11.7 Occasional Use of Certiﬁed Reference Materials
in Quality Assurance
Key points
— Certiﬁed reference materials (CRMs) are not ideal for internal quality
control. For several reasons, despite their providing a route to trace
ability.
— CRMs as an occasional check on accuracy are best regarded as akin
to proﬁciency testing.
Certiﬁed reference materials (CRMs) are sometimes advocated for use as
control materials in internal quality control (IQC) of analysis, in that the
CRM is a direct route to an acceptable traceability. In many ways, however,
it is better to keep the concepts of quality control and traceability distinct.
The use of a CRM (where available) on a scale appropriate for IQC would
usually be too expensive but on a lesser scale ineﬀectual. Moreover, there is
a discrepancy in concept between IQC and the CRM. The control chart is
based on the properties (i.e., the mean and variance) of the whole analytical
system. For the CRM, in contrast, the certiﬁed value and its uncertainty
describe the material alone.
The analysis of a CRM, however, can provide a useful occasional
check of an ongoing analytical system, if a suﬃciently close match to
the test materials can be found. In such instances it is better to regard
the action as a kind of onelaboratory proﬁciency test rather than part
of IQC. The outcome could be assessed by calculating a ‘pseudozscore’
z
= (x−x
crm
)
u
2
f
+u
2
crm
from the result x, the uncertainty u
f
regarded
as ﬁt for purpose for the result, the certiﬁed value x
crm
and its uncer
tainty u
crm
. The ﬁtforpurpose uncertainty u
f
would have to be speciﬁed
in advance, possibly as a function of concentration. It is very important
to notice that any such test would be pointless unless the uncertainty on
the certiﬁed value is negligible. Unless u
crm
<≈ u
f
/2, the score z
would
reﬂect the uncertainty in the certiﬁed value to an undue extent and mask
the behaviour of the analytical system.
December 23, 2010 13:35 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch11
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Chapter 12
Sampling in Chemical Measurement
This chapter is concerned with the statistics involved in the neglected
topic of uncertainty from sampling. In many application sectors sam
pling uncertainty is a substantial or even dominant term in the
uncertainty budget and therefore very important. The enduser of
analytical results needs to know the combined uncertainty (ana
lytical plus sampling) to make valid decisions about the sampling
‘target’, decisions such as the commercial value of a batch or lot of
a material, or whether material conforms with a legal or contractual
specification.
12.1 Traditional and Modern Approaches to Sampling
Uncertainty
Key points
— Sampling uncertainty is traditionally ignored if the sample is ‘repre
sentative’.
— The modern approach regards sampling as an integral part of the
measurement process and includes its contribution to the combined
uncertainty.
— Analytical chemists should use the recommended terminology of
sampling.
Nearly all analysis is preceded by sampling, the process of taking a small
portion (the sample) from the much larger amount (the target), the com
position of which is in question. The sample is small enough to be removed
to the laboratory for further physical preparation such as grinding before
221
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
222 Notes on Statistics and Data Quality for Analytical Chemists
analysis, while the target usually is not. Of course, taking a sample is
pointless unless it reasonably approximates the average composition of
the target: such a sample is called ‘representative’. Obtaining a represen
tative sample is sometimes very diﬃcult, especially when the target is very
large, a shipload of ore for example, and parts of the target are diﬃcult to
access. Procedures for obtaining representative samples of nearly all mate
rials of commerce have been produced and often form parts of contractual
agreements or legal requirements. Such samples are usually accepted by
analytical chemists and endusers of analytical data without further con
sideration. In eﬀect, the uncertainty introduced into the ﬁnal analytical
result by sampling is ignored in the traditional approach.
A more modern approach to sampling avoids the idea of a represen
tative sample, but considers the uncertainty introduced by the sampling as
an integral part of the measurement uncertainty. This approach provides a
quantitative basis for the often repeated saying among analytical chemists,
that ‘the result is only as good as the sample’. It is an important devel
opment because the end user of analytical results needs information about
the target, not the sample. Moreover, we cannot ensure that we are using
resources optimally unless we can compare the uncertainty contributions
from sampling and analysis. Despite the modern trend towards considering
sampling and analysis as parts of a single measurement operation, their
contributions usually have to be estimated separately because sampling is
often conducted at a location remote from the analytical laboratory and
seldom under the complete control of analytical chemists.
The following sections in this chapter refer to the estimation and use of
uncertainty from sampling. They do not provide instructions for obtaining
representative samples of speciﬁc materials. However, many textbooks refer
to the general principles of sampling for chemical analysis.
The word ‘sample’ is often used informally among analytical chemists
to indicate ‘analyte’, ‘test portion’, ‘test material’, ‘test solution’, ‘aliquot’,
‘matrix’, and so on. Such usage should be discontinued to avoid confusion.
The recommended terminology for the various stages of sampling is shown
here. ‘Sampling’ is usually taken to include all operations down to the
preparation of the test sample. ‘Analysis’ is taken to mean all of the sub
sequent steps, starting with the selection and weighing of the test portion.
Any residual heterogeneity in the test sample gives rise to an uncertainty
that is attributed to the analytical variation. Some stages (subsample, lab
oratory sample) are omitted or merged in many instances. Key terms are
underlined.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Sampling in Chemical Measurement 223
Notes and further reading
• ‘Terminology – the key to understanding analytical science. Part 2:
Sampling and sample preparation’. (March 2005). AMC Technical Briefs
No. 19.
• Crosby, N.T. and Patel, I. (1995). General Principles of Good Sampling
Practice, Royal Society of Chemistry, Cambridge.
12.2 Sampling Uncertainty in Context
Key points
— Only the combined uncertainty (sampling plus analytical) is relevant
to the customer’s needs.
— Taking proper account of sampling uncertainty can aﬀect decisions
based on analysis.
— The concepts of bias and precision can be applied to sampling.
All sampling targets are actually or potentially heterogeneous: the chemical
composition can vary from point to point in the material. This implies that
replicate test samples from a single target will diﬀer in composition from
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
224 Notes on Statistics and Data Quality for Analytical Chemists
each other and from the target. This variation gives rise to uncertainty
from sampling u
s
that is additional to and independent of the uncertainty
u
a
derived from purely analytical activities. The combined uncertainty on
the composition of the target is thus u =
u
2
s
+u
2
a
. It is this combined
uncertainty that is relevant to the needs of the enduser of the data, who
is required to make rational decisions about the target (not about the lab
oratory sample). Typical decisions are the probable commercial value of a
batch of material, or whether it is probably within speciﬁcation. In a high
proportion of instances, u
s
makes a substantial contribution to the com
bined uncertainty. In environmental studies, and in the examination of raw
materials such as food or ores, u
s
could even be the dominant contribution.
This fact has important implications for decision making by endusers of
analytical data, legislators, enforcers of legislation and analytical chemists
(§8.3).
Analysis is often conducted to determine whether the material in a
target conforms to a legislative or contractual limit, either an upper limit
or a lower limit. Hitherto, only the analytical uncertainty was taken into
account in making such decisions. The sampling uncertainty was in eﬀect
taken to be zero. That basis for decision making is illustrated in Fig. 12.2.1
as examples A and A
. As the analytical uncertainties do not encompass
the limit value, the results are regarded as deﬁnitive. Result A is clearly
below the limit and result A
above. That, however, is not a logical stand
point unless the sample is deﬁned as the target by law. (For instance, a
single bottle of milk may have to conform to a regulation, rather than the
Fig. 12.2.1. Measurement results (points) and expanded uncertainties (vertical bars) in
relation to a decision limit: A, A
when only analytical uncertainty is considered; B, B
when combined uncertainty (analytical plus sampling) is considered.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Sampling in Chemical Measurement 225
whole consignment from which the bottle is an example.) The uncertainty
from sampling should be taken into proper account in decision making.
If sampling uncertainty is taken into account, so that the combined uncer
tainty is larger (B, B
), the same analytical result would give rise to a
diﬀerent decision.
To create a conceptual framework for sampling uncertainty, we must
consider ideas like precision and bias applied to sampling, and carry out
operations such as the validation of methods and quality control. These
terms are familiar when applied to analytical methods but in sampling there
are diﬀerences in the way that they can be tackled. This is because sampling
uncertainty is partly the outcome of the heterogeneity of the material under
test as well as the process of collecting it. Moreover, successive targets
of the same type of material can vary in the degree of heterogeneity, so
that the value of u
s
could vary from target to target. A working value of
u
s
will have to be a robust average that is typical of the material as a
whole. This potential variation in the degree of heterogeneity implies that
quality control is especially important in sampling. An analytical result on
a sample from a target with an anomalously high value of u
s
could be unﬁt
for purpose, even though the validated sampling protocol was scrupulously
followed.
Notes and further reading
• At the time of writing, the subject of sampling uncertainty is immature: to
date very little practical experience has accrued in using tools for dealing
with uncertainty from sampling, and even less has been documented.
• ‘What is uncertainty from sampling, and why is it important?’
(July 2008). AMC Technical Brief No. 16A. Free download via
www.rsc.org/amc.
• Ramsey, M. H. and Thompson, M. (2007). Sampling Uncertainty in the
Context of Fitness for Purpose, Accred. Qual. Assur., 12, pp. 503–513.
• Ramsey, M. H. and Ellison, S. L. R. (eds). (2007). Measurement
Uncertainty Arising from Sampling — a Guide to Methods and
Approaches. The Guide is the joint production, under the Chairmanship
of Prof. M. H. Ramsey, of Eurachem, CITAC, Eurolab, Nordtest and
the Analytical Methods Committee. It contains chapters on fundamental
concepts, estimation of sampling uncertainty and management issues. Six
practical examples are examined in detail. Free download available from
the Eurachem website www.eurachem.org/guides.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
226 Notes on Statistics and Data Quality for Analytical Chemists
12.3 Random and Systematic Sampling
Key points
— In random sampling all parts of the target have an equal chance of
being selected.
— Random sampling may be diﬃcult or impossible to carry out.
— Systematic sampling is often used instead of random sampling.
— It is not possible to make valid inferences about the target unless the
whole of it is accessible for sampling.
By deﬁnition, an unbiased sampling procedure, under replication, should
provide samples with an expectation (a longterm average) of composition
equal to that of the target. Individual samples will vary, but the mean
of a large number of samples will approach the target composition. This
outcome can be ensured only if the sample is a random sample. This implies
that every part of the target must have an equal chance of being incorpo
rated in the sample. Random sampling, however, is often impossible, too
costly or too time consuming to carry out, and some kind of systematic sam
pling is used in its stead. The diﬀerence between random and systematic
sampling is shown in Figs. 12.3.1–12.3.3, using as an example the collection
of a composite sample of 20 increments of soil taken from a ﬁeld. ‘Strat
iﬁed random’ sampling is a compromise between the two forms, in which
the target is divided into equal parts (‘strata’), and each part sampled
Fig. 12.3.1. Example of random sam
pling of soil: increments (points) taken
from a ﬁeld. A substantial hotspot (shaded
ellipse) could be overlooked in this
instance, but a second random sampling
would be likely to detect it.
Fig. 12.3.2. Example of systemic sam
pling of soil from a ﬁeld: increments
(points) taken at the intersections of a
rectangular grid. A substantial hotspot
(shaded ellipse) could be systematically
overlooked.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Sampling in Chemical Measurement 227
Fig. 12.3.3. Example of stratiﬁed random sampling of soil from a ﬁeld: increments
(points) taken at random positions within artiﬁcial strata.
randomly. The strata could be real, as when a shipment of coal arrives in a
number of railway trucks, or notional, as when a ﬁeld is divided into equal
rectangles (Fig. 12.3.3).
Systematic samples are often as good as random samples in practice,
although both are capable of missing an important ‘hotspot’ (a region of
anomalously high concentration of the analyte) in a target. Only increasing
the number of increments taken (and therefore the cost of the sampling
operation) could reduce this possibility. Sampling procedures that cannot
access all parts of the target, however, cannot possibly produce a random
sample. Inferences from such samples should therefore be treated with
caution, as they may not be based on sound statistical principles.
Notes and further reading
• Ramsey, M. H. and Thompson, M. (2007). Sampling Uncertainty in the
Context of Fitness for Purpose, Accred. Qual. Assur., 12, pp. 503–513.
12.4 Random Replication of Sampling
Key points
— Sampling protocols do not provide instructions for collecting randomly
replicated samples.
— Ideas for several regular sampling scenarios are presented.
Estimating uncertainty from sampling involves the replication of the estab
lished sampling procedure. To encompass the potential variation fully,
the replication has to be done in a randomised way. Sampling protocols,
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
228 Notes on Statistics and Data Quality for Analytical Chemists
however, do not provide instructions for how that can be carried out. That
may tax the ingenuity of the sampler in some instances, although certain
ideas are broadly applicable. If, for example, the protocol demands that
increments for a sample be collected at random positions within the target,
the increments for the second sample should be collected at new random
positions. This is illustrated for stratiﬁed random sampling in Fig. 12.4.1.
If the target is usually sampled by coning and quartering, it should be
reconed after the ﬁrst sample is taken and then quartered again. In sam
pling soil or crops from a ﬁeld, a common practice is to walk the ﬁeld in
a Wshaped path (Fig. 12.4.2) and collect increments at roughly equally
spaced intervals along each leg. To duplicate this adequately, the sampler
Fig. 12.4.1. Schematic method for duplication of stratiﬁed random sampling of soil in
a ﬁeld. Solid circles show the positions of increments for the ﬁrst sample, while open
circles show the positions for the increments for the second.
Fig. 12.4.2. Duplicate sampling from a ﬁeld. Increments (solid circles) are taken at
roughly equal intervals along each leg of a ‘W’shaped path.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Sampling in Chemical Measurement 229
Fig. 12.4.3. Schematic method for duplication of random sampling from a conveyor
belt. Increments are taken according to two independent lists of random numbers or
times.
could walk a second ‘W’ in a diﬀerent orientation. If the test material is
presented in a systematic way, on a conveyor belt for example, two diﬀerent
random selections can be made (Fig. 12.4.3).
Notes and further reading
• Ramsey, M. H. and Thompson, M. (2007). Sampling Uncertainty in the
Context of Fitness for Purpose, Accred. Qual. Assur., 12, pp. 503–513.
12.5 Sampling Bias
Key points
— The existence of sampling bias is denied by some scientists.
— It is diﬃcult to address sampling bias experimentally.
— A useful method for comparing sampling methods is the ‘paired
sample’ approach.
Sampling bias is a contentious subject. Some authorities claim that it does
not exist: if the sample is taken ‘correctly’ (that is, according to the agreed
protocol) there is no bias by deﬁnition. This is equivalent to saying that
the sampling method is akin to an empirical analytical method, where the
method deﬁnes the measurand. This is a comforting point of view, as the
sampler need not worry about bias. Those holding this view are encouraged
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
230 Notes on Statistics and Data Quality for Analytical Chemists
by the fact that it is diﬃcult to address sampling bias adequately in practice.
However, it is easy to see how bias could arise, for instance if the sample
was contaminated by the sampling tools or if the sampler misinterpreted
the protocol. Bias in sampling could be general (method bias) or speciﬁc
(sampler bias).
We can consider analogues of methods used to study bias in analytical
methods as potential tools for handling sampling bias.
• A certiﬁed sampling reference target (analogue of a certiﬁed reference
material) could in principle address all sources of sampling bias, but
would be extremely costly to create, diﬃcult to maintain and could not
be distributed to users. Very few examples have been reported.
• Intersampler studies with a single sampling protocol (analogue of a col
laborative trial) could address betweensampler variation. In these trials,
the collection of biases of individual samplers is regarded as a random
factor. A small number of such studies have been carried out on an
experimental basis, and in some a signiﬁcant diﬀerence between sam
plers has been found. The ‘reproducibility sampling variance’ could be
used as an extra term in the combined uncertainty. These trials are costly
to organise, as all of the samplers have to travel to a number of targets.
• The ‘paired samples’ approach (analogue of the paired methods approach
[§9.10]) is carried out by sampling a (preferably large) series of typical
targets by two methods, the method under scrutiny and by an estab
lished reference method. All of the samples are then analysed by the
same method, so that any analytical bias is cancelled out. The ‘paired
samples’ method is simple to carry out. As a single sampler would nor
mally be involved, any bias detected will reﬂect the method bias plus an
unknown term from the personal bias (if any) of the sampler. The bias
between the methods can be characterised statistically by the methods
used for comparing two analytical methods (§5.12).
Notes and further reading
• If sampling bias is ignored, precision alone determines uncertainty and
random replication is suﬃcient to quantify it. Consequently, standard
uncertainty and standard deviation are treated almost as identical in what
follows.
• Ramsey, M. H. and Thompson, M. (2007). Sampling Uncertainty in the
Context of Fitness for Purpose, Accred. Qual. Assur., 12, pp. 503–513.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Sampling in Chemical Measurement 231
12.6 Sampling Precision
Key points
— Replicated sampling and analysis followed by ANOVA is required for
estimating sampling standard deviation.
— A multiple target nested design and hierarchical ANOVA is needed for
representative results.
Sampling precision can be quantiﬁed as a standard deviation σ
s
estimated
by a replicated experiment. If the act of sampling is replicated in a ran
domised way, the variation in the composition of the samples obtained
is a measure of the precision. However, we have to estimate the compo
sition by analysis and that introduces analytical variation characterised
as σ
a
. To separate the two sources of variation we have to replicate the
measurement as well and use analysis of variance (§4.7). A simple balanced
design for this experiment is shown in Fig. 12.6.1, with multiple samples
(n ≥ 8) taken by the same procedure (but randomised, see §12.4) from
a single target and with duplicate analysis of each sample. However, this
Fig. 12.6.1. Simple balanced design for estimating sampling standard deviation.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
232 Notes on Statistics and Data Quality for Analytical Chemists
Fig. 12.6.2. Nested (multipletarget) balanced design for estimating sampling standard
deviation.
simple design is based on the assumption that the particular target under
study is typical of all targets in the class of material. It disregards the pos
sibility that targets may vary in their degree of heterogeneity and therefore
in the value of σ
s
.
A greatly preferable estimate, characterising a whole class of material,
may be obtained by taking duplicate samples from a succession of diﬀerent
targets of the same type (Fig. 12.6.2). This procedure is also straightforward
and involves no extra work, although a greater time span may be required
to accumulate the results. Moreover, the procedure points directly to a
method for the quality control of sampling (§12.8) in a natural way. The
results are treated by hierarchical analysis of variance to obtain the sam
pling standard deviation. Estimates of the betweentarget variation and the
analytical variation are also obtained.
As an example we can consider the sampling of animal feedstuﬀ and
its analysis for aluminium. Twelve successive targets were sampled in
duplicate and each sample analysed in duplicate. The results are as shown
in the following table and are illustrated in Fig. 12.6.3. In the table
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Sampling in Chemical Measurement 233
Fig. 12.6.3. Results from a nested duplication exercise to validate the sampling protocol
for aluminium in animal feed. There are no indications of (a) anomalous targets or (b)
discrepant analytical duplication. Target 9 gave rise to possibly discrepant samples.
the column heading ‘s1a1’ refers to the ﬁrst result on the ﬁrst sample,
and so on.
Target s1a1 s1a2 s2a1 s2a2
1 128 114 124 101
2 109 98 120 110
3 113 121 110 106
4 76 65 86 74
5 96 88 121 122
6 95 104 84 84
7 111 110 115 110
8 124 114 122 115
9 59 72 113 122
10 87 95 97 101
11 91 109 88 95
12 102 105 91 93
Hierarchical ANOVA gave the following statistics.
Source of
variation
Degrees of
freedom
Sum of
squares
Mean
square
F p
Between sites 11 7242.25 658.39 1.77 0.170
Between samples 12 4463.00 371.92 7.67 0.000
Analytical 24 1164.00 48.50
Total 47 12869.25
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
234 Notes on Statistics and Data Quality for Analytical Chemists
From the mean squares we can calculate these estimates:
• The analytical standard deviation component, ˆ σ
a
= 7.0 ppm;
• The sampling standard deviation component, ˆ σ
s
= 12.7 ppm;
• The betweentarget standard deviation component, ˆ σ
t
= 8.5 ppm. (This
statistic is of no importance in the present context except to show that
there was little variation in the concentration.)
Notes and further reading
• The dataset can be found in the ﬁle named Aluminium.
• The above estimates are likely to be rather variable from this (in statistical
terms) small experiment. (See §12.7.)
• In the example it was reasonable to treat the sampling standard devi
ation as homoscedastic as there was no indication in the plot of the data
(Fig. 12.6.3) of a wide concentration range or a correspondingly wide
variation in σ
s
. Had such a variation been apparent, some attempt at
scaling the data should have been made, for instance by using logtrans
formation, to render it reasonably close to homoscedastic.
• Ramsey, M. H. and Thompson, M. (2007). Sampling Uncertainty in the
Context of Fitness for Purpose, Accred. Qual. Assur., 12, pp. 503–513.
12.7 Precision of the Estimated Value of σ
s
Key points
— The sampling precision estimated from eight duplicated samples will
itself be very variable.
— For best outcomes, an analytical method with standard deviation
σ
a
< σ
s
/2 should be used.
The sampler and analytical chemist must be aware that an estimated
sampling precision will be uncomfortably variable. A suggested minimum
of n = 8 samples is the usual compromise between an acceptable
estimate of the sampling precision and the cost of carrying out the exper
iment, for which 2n analyses would be required. If we assume that both
errors (sampling and analytical) are normally distributed we can estimate
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Sampling in Chemical Measurement 235
Table 12.7.1. Conﬁdence limits (95%) for an estimate ˆ σ
s
of a true sampling standard deviation of unity (i.e., σ
s
= 1).
The calculations are based on an experiment with eight
targets sampled in duplicate, and analysed by methods with
various analytical standard deviations σ
a
.
Analytical standard
deviation σ
a
95% conﬁdence
limits on ˆ σ
s
Proportion of
zero estimates
σ
a
<< σ
s
0.5−1.5 0 %
σ
a
= σ
s
/2 0.35−1.55 1 %
σ
a
= σ
s
0.0−1.7 8 %
σ
a
= 2σ
s
0.0−2.2 32 %
this variability (Table 12.7.1). With eight replicate samples, analysed in
duplicate by using an analytical method of high precision (σ
a
<< σ
s
), the
relative standard error of the estimated sampling standard deviation ˆ σ
s
will be about 25%. Thus the 95% conﬁdence limits will be about 0.5σ
s
and 1.5σ
s
. With σ
a
> σ
s
/2 far worse precisions will be obtained, to the
extent that it may be impossible to estimate σ
s
. There would be a wide and
highly asymmetric distribution of outcomes, with a high proportion of zero
results.
The practical rule of thumb is to use if possible an analytical method
with a precision σ
a
< σ
s
/2. Of course, the analyst will not know if this
criterion is fulﬁlled until after the experiment. If σ
s
happens to be very
small (i.e., the target is close to homogeneous), it may be impossible to
estimate its value for lack of a suitable analytical method. In such instances,
however, the sampling standard deviation will make a negligible contri
bution to the combined uncertainty of the measurement and can be safely
ignored.
The precision of the estimated sampling standard deviation improves
with the number of duplicated samples. Unfortunately, it improves only
slowly: in comparison with an eightsample experiment, 32 samples would
be required to reduce the standard error by half. That would usually be
impracticable as a oneoﬀ method validation. However, if the sampling
method is in routine use, data can be collected over many sampling events
and the estimate gradually reﬁned. The procedure would be analogous to
establishing limits for a control chart while it is in use, as in §10.4 and 10.5.
The results would have to be robustiﬁed in some way against the possible
incidence of atypically heterogeneous targets.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
236 Notes on Statistics and Data Quality for Analytical Chemists
12.8 Quality Control of Sampling
Key point
— A combined sampling/analytical control chart can be constructed from
the results of duplicate samples.
We have seen (§12.6) that a ‘generallyapplicable’ estimated value ˆ σ
s
can be
attached to the sampling standard deviation for a typical target in a deﬁned
class. However, the sampler may encounter particular targets, apparently
within the deﬁned class, for which the ‘general’ ˆ σ
s
is not appropriate.
For such particular targets, the sampling precision may be poor, because the
sampling has been carried out ineptly or, more probably, because the target
is more heterogeneous than is usual for the type of material. Such instances
should be detected if possible, because an incorrect assumption about the
heterogeneity will tend to invalidate decisions about the target. Even if the
sampling is carried out exactly according to a validated protocol, excessive
heterogeneity could make the result unﬁt for purpose. Quality control of
sampling can alleviate this situation.
A simple way of conducting sampling QC is to take duplicate samples
A and B at random from each target. Each sample is analysed once, and
the mean result (x
A
+x
B
)/2 can be taken as the result for the target. This
design is shown in Fig. 12.8.1.
Meanwhile, the diﬀerence between the results can be used as an indicator
of compliance. The standard deviation of a signed diﬀerence d = x
A
−x
B
for
a compliant (in control) outcome would be σ
d
=
2 (σ
2
s
+σ
2
a
). This value
can be used to deﬁne control lines for a Shewhart or other control chart so
that a single point falling outside the ±3σ
d
limit indicates a system out of
control. However, as the order in which the results are obtained is arbitrary,
it is preferable to use a onesided control chart with control lines at zero,
Fig. 12.8.1. Design for routine sampling quality control.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
Sampling in Chemical Measurement 237
Fig. 12.8.2. Routine internal quality control chart for combined analytical and sampling
variation for the determination of aluminium in animal feed.
2σ
d
and 3σ
d
, on which the absolute diﬀerence x
A
−x
B
 should be plotted.
The lines will have the same implications as in an ordinary Shewhart chart.
An example of such a chart is shown in Fig. 12.8.2. The control lines
were set according to σ
d
=
2 (ˆ σ
2
s
+ ˆ σ
2
a
) =
2 (12.7
2
+ 7.0
2
) = 20.5,
using values for ˆ σ
s
, ˆ σ
a
established previously (§12.6). Two outofcontrol
conditions were detected, at target 21 with a diﬀerence outside the action
limit, and at target 27 because two successive targets gave diﬀerences above
the warning limit. It is not clear whether these excursions resulted from an
analytical problem or a sampling problem or to a combination of the two.
A more elaborate design with duplicate analyses of both samples (as in
Fig. 12.6.2) would enable this ambiguity to be clariﬁed, but would obvi
ously cost more to execute as a routine practice.
Notes and further reading
• The dataset can be found in the ﬁle named Alsamiqc.
• If the concentration of the analyte varies substantially in successive
targets, it may be preferable to construct a control chart for relative
absolute diﬀerence.
• An alternative approach to sampling IQC is sometimes applicable, the
Split Absolute Diﬀerence (SAD) method, which does not require duplicate
samples. See: Analyst, 2004, 129, 359–363.
December 23, 2010 13:36 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050ch12
This page intentionally left blank This page intentionally left blank
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050index
Index
F, 48, 86
Fdistribution, 47
Ftest, 31
R
2
, 86, 113
239
Pu, 119
pvalue, 15, 22, 26, 27
t, see Student’s t
ttest, 35
absolute diﬀerence map, 194
accreditation agencies, 157
acid decomposition of a soil sample,
202
action limits, 127, 216
adjusted degrees of freedom, 30
adjusting the results, 41
aﬂatoxin data, 133
AGRGIQC, 201, 203
Alsamiqc, 237
alternative hypothesis, 12, 147
Alumina, 27
Aluminium, 232, 234
aluminium in animal feed, 237
Analysis of Variance, see ANOVA
analyte, 155
analytical response, 175
analytical run, 189, 195
analytical standard deviation, 59
anomalous concentrations, 40
ANOVA, 43, 46, 47, 51, 56, 60, 86,
92, 181, 182, 231
ANOVA application, 49, 51, 54, 57
assigned value, 207, 208
assumptions, 21
averaging zscores, 217
bar chart, 208
baseline interference, 96
basic assumption of regression, 186
Bayesian statistics, 10
Beryllium methods, 42
betweengroup mean square, 45
betweensampler variation, 230
betweensite variance, 60
bias, 29, 155, 162, 207
bias between two analytical methods,
94
‘Bonferroni’ problem, 202
bootstrap, 214
bottomup estimation, 162
bottomup method, 159
boxplots, 69
Cadmium, 58, 60
calculating a participant consensus,
210
calibration, 76
calibration data, 105
calibration detection limit, 174
calibration function at low
concentrations, 178
calibration functions, 76
calibration uncertainty, 172
capability of detection, 180
239
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050index
240 Notes on Statistics and Data Quality for Analytical Chemists
catalysts for the Kjeldahl method, 49
‘cause and eﬀect’ method, 159
cement, 27
censored or truncated, 140
central limit theorem, 17, 18
certiﬁed reference materials in quality
assurance, 219
certiﬁed sampling reference target,
230
certiﬁed value, 208
Cochran test, 65, 182
collaborative trial, 54, 124, 180, 184
collaborative trial outliers, 56
collaborative trial — outlier removal,
182
combined sampling/analytical control
chart, 236
combined uncertainty, 221, 223
combining intermediate uncertainties,
160
comparing two analytical methods,
186
comparison of analytical methods, 97
comparison of several means, 43
compliance, 236
conceptual framework for sampling
uncertainty, 225
conditions of measurement, 4, 196
conditions of replication, 170
conﬁdence, 169
conﬁdence interval for a predicted
concentration, 174
conﬁdence intervals, 22
conﬁdence level, 7, 12
conﬁdence limits, 22
conﬁdence limits around x
, 102
consensus of all participants, 209
consensus of expert laboratories, 209
consensus of the participants results,
208
constant bias, 96
contractual agreements, 222
contractual limit, 8
control chart, 127, 217, 219
control limits, 197
control lines, 192
control map, 191, 192
control materials, 195
Copper, 57
correlation, 87
correlation coeﬃcient, 87, 91
correlation matrix, 112
cost and probability of making an
incorrect decision, 165
cost of an analytical result, 165
coverage factor, 18, 19, 155
critical concentration, 179
critical level, 12, 21
critical level of response, 179
critical values, 22
crossed designs, 63
cumulative distribution function, 144
curved trend, 81
Cusum chart, 129
data displays, 6
data with a curved trend, 105
detrended data, 168
decision, 21, 156, 223
decision theory, 165
decisions about the target, 236
deﬁnition of detection limit, 178
degrees of freedom, 22, 28, 29, 78
deletion of outliers, 135
dependent variable, 75, 77
detection limit, 77, 101, 103, 121,
171, 177
discrepant duplicate results, 58
distributionfree statistics, 142
distributions, 15
Dixon’s test, 131
Dixon’s test for outliers, 132
Dogfood dataset A, 45, 49
Dogfood dataset B, 45, 49
downweighting extreme values, 137
Drift, 169
Dumas, 36, 143
Dumas method, 29
eﬀect of an outlier on regression, 83
eﬀect of trends, 168
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050index
Index 241
empirical cumulative distribution
function, 144
enduser of analytical results, 221
enduser of the data, 224
environmental studies, 224
equal variances, 30
error, 156
error propagation theory, 159
estimate runtorun standard
deviation, 197
estimated value of concentration, 104
estimates of central tendency, 210
estimating uncertainty, 227
estimating uncertainty by replication,
162
estimation of the weights, 118
ethanol, 8, 34
Ethene, 36
EURACHEM/CITAC Guide, 156,
165
evaluation interval, 174
evaluation limits, 101, 171
example calculations, 47
example of regression with
transformed variables, 124
example of weighted regression, 119
examples of withinrun quality
control, 192
expanded uncertainty, 155, 157
experience and judgement, 210
exploratory data analysis, 111
external calibration, 171
factors that contribute to uncertainty,
158
‘functional relationship’ ﬁtting, 187
fat in foodstuﬀs, 43
ﬁducial limits, 103
ﬁtness for purpose, 153, 165, 216
ﬁtnessforpurpose control chart, 195
ﬁtted values, 78
ﬁtting a line, 71
ﬁxed eﬀect experiment, 49
formulation, 209
functional relationship ﬁtting, 96
geneticallymodiﬁed food, 140
graphical display, 6
Grubbs test, 133, 182
guide to the expression of uncertainty
in measurement (GUM), 156, 160,
162
harmonised guidelines for internal
quality control, 196
heterogeneity, 51, 225
heteroscedastic, 119
heteroscedastic data, 116
heteroscedasticity, 82, 171, 184
hierarchical analysis of variance, 232
homogeneity testing, 51
homoscedastic, 234
Horwitz function, 123, 124, 181
hotspot, 227
Huber’s H15 method, 137
identifying suspect values, 137
important magnitude, 40
inaccuracy in their routine results,
206
incorrect assumption about the
heterogeneity, 236
independent datasets, 28
independent predictors, 109
independent variable, 75, 76
inﬂuence of outliers, 168
inﬂuence of outlying results, 135
inﬂuence on classical statistics, 130
information from proﬁciency test
scores, 216
instrumental drift, 173
intersampler studies, 230
intercept, 72
interference eﬀects, 175
interim control chart, 197, 199
interlaboratory comparisons, 205
interlaboratory method performance
study, 54
interlaboratory study, 180
internal quality control, 195, 198, 218,
219
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050index
242 Notes on Statistics and Data Quality for Analytical Chemists
International Vocabulary of
Metrology, 156
intervals, 169
inverse conﬁdence limits, 103
ISO Guide 43, 206
Jchart, 129
kernel density estimation, 211
Kjeldahl, 36, 143
Kjeldahl catalysts, 51, 66
Kjeldahl method, 29, 49
Kolmogorov–Smirnov onesample
test, 144
laboratory bias, 187
lack of ﬁt, 78, 90, 92, 108, 114
lead, 116
lead in garden soils, 111
lead in playground dust, 61
leastsquares, 71, 105
legal requirements, 222
leverage points, 82
limits, 169
linear regression, 71
linear regression in paired
experiments, 96
linearity, 171
logtransformed data, 184
lognormal distributions, 139
logtransformation, 139
MAD method, 135
manganese, 77, 78, 85, 86, 91, 108
manganese calibration data, 102
Manganese1, 77, 79, 85
Manganese2, 94, 103
Manganese3, 94, 109
Mann–Whitney test, 142
mass fraction, 165
materials, 206
matrix matching, 175
matrix mismatch, 164, 175
matrixmatched CRMs, 209
mean, 210
mean square, 45, 46
meaning, 9
measurand, 155
measurement, 155
measurement uncertainty, 155
measurement variation, 3
median, 135, 210
Median Absolute Diﬀerence (MAD),
135
mercury in ﬁsh, 25
method of least squares, 73, 76
method validation, 60, 90, 154, 195
metrological traceability, 155
mixture model, 214
mode, 211
model of the measurement procedure,
158
modern approach to sampling, 222
Moisture, 65
multimodal, 210
multiple regression, 111
multiple symbolic control charts,
201
multivariate internal quality control,
201
national reference laboratory, 209
natural, 169
near, 169
nested design, 60, 231
nitrogen, 143
nitrogen analyser, 10, 23
nitrogen analyser data, 20
nitrogen data, 31
nitrogen oxides, 38
nonlinear equations, 123
nonlinear extrapolation, 176
nonlinear regression, 122
nonlinearity, 91
nonparametric statistics, 142
nonparametric test, 144
normal curve, 18
normal distribution, 15, 127
NOx in air, 40
null hypothesis, 9, 20, 147
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050index
Index 243
occasional check of an ongoing
analytical system, 219
onesample test, 25, 27, 37
onetailed probability, 9, 18, 20
onetailed test, 12, 25, 34, 38
oneway ANOVA, 45, 54, 55, 180
orientation surveys, 60
original ethos of proﬁciency testing,
206
outofcontrol conditions, 128, 200,
237
outlier, 82, 130, 136, 139
outlier deletion, 181
outliers and heavy tails, 210
outliers in calibration, 80
outlying data points, 167
outlying samples, 58
‘paired results’ method, 186
‘paired sample’ approach, 229, 230
paired data, 25, 36, 38
paired datasets, 42
parameters, 19
patterns in residuals, 80
patulin, 68
performance of an analytical method,
180
planning of experiments, 146
Playground, 63
polymerase chain reaction, 140
polynomial calibration, 108
polynomial regression, 105, 108
pooled standard deviation, 30, 31
pooled variance, 46
population, 15
power calculations, 146
precomputer statistics, 21
precision, 155
precision as a function of
concentration., 184
precision of analytical methods, 167
precision of test methods, 181
precision of the estimated sampling
standard deviation, 235
precision of the measurements, 147
predictor variable, 75, 109
principal components regression, 111
probabilities from data, 19
probability, 7, 9, 20
probability of incorrect decision, 165
problems of interpreting r, 88
proﬁciency test, 52, 123, 215
proﬁciency tests — purpose and
organisation, 205
propagation of uncertainty, 160
proportional bias, 96
protein nitrogen, 29, 32
‘pseudozscore’, 219
pure error, 90
pure error mean square, 90
quadratic regression, 108
quality control, 154, 225
quality control data, 198
quality control of sampling, 232, 236
quality — an overview, 153
quantity, 155
random eﬀects, 54
random replication of sampling, 227
random sample, 15, 40
random sampling, 226
rapid ﬁeld method, 97
raw materials, 224
recognise paired data, 40
recommended terminology, 222
recommended terminology of
sampling, 221
recovery, 164
reducing the uncertainty, 104
reference material, 23, 51
reference method, 186
reference value, 21
regression, 71, 73, 76, 92, 94, 98
regression analysis of a calibration,
172
regression and ANOVA, 85
regression or a related method, 95
regulatory limit, 25
relationship between precision and
concentration, 181
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050index
244 Notes on Statistics and Data Quality for Analytical Chemists
relative standard deviation, 139, 167,
184
repeatability conditions, 4
repeatability precision, 169, 189
repeatability standard deviation, 56,
192
replicated sampling, 231
replication of the established
sampling procedure, 227
representative sample, 222
reproducibility conditions, 4
reproducibility precision, 169, 180
reproducibility standard deviation,
57, 123, 124, 162
requirement for accreditation, 205
residual plots, 79, 80, 91, 92
residuals, 74, 78, 99, 114, 173
response variable, 75
result, 155
robust analysis of variance, 181
robust estimates, 210
robust mean, 210
robust standard deviation, 215
robust statistics, 135, 137
rolling statistics, 198
rotational bias, 95
rounding, 3, 141
ruggedness test, 66
‘run bias’ eﬀect, 196
runtorun precision, 169, 195, 196
runtorun standard deviation, 199
sample t, 22
sample tvalue, 20
sampling, 221
sampling and analytical variance, 57
sampling bias, 229
sampling error, 164
sampling of animal feedstuﬀ, 232
sampling precision, 231
sampling standard deviation, 57, 234
sampling uncertainty, 61, 223
scaled diﬀerences, 191
scaled residuals, 78
scatter plots, 112
scoring in proﬁciency tests, 207
scoring systems, 205
second order ‘quadratic’ curve, 106
Selenium, 192, 195
setting up a control chart, 197
shape of the calibration function, 172
Shewhart charts, 127, 217
SI, 155
SI units, 163
signed diﬀerences, 168
signiﬁcant intercept, 171
Silica, 54
simple linear regression, 72
skewed, 210
sodium data, 23
soil, 58
standard additions, 175
standard deviation, 16, 167
standard deviation for proﬁciency,
207
standard error of the intercept, 121
standard error of the mean, 17
standard error of the mode, 211
standard error of the robust mean,
210
standard errors, 84
standard uncertainty, 155
standardisation of hydrochloric acid,
161
statistical computer software, 25
statistical control, 127, 195
statistical inference, 12, 15, 20
statistical power, 146
statistical tests for outliers, 131
‘stratiﬁed random’ sampling, 226
Student’s t, 19, 22
suﬃcient linearity in calibration, 89
summary statistics, 6
Suspect, 131, 133
suspect results, 130
suspect value, 130, 133
symbolic control charts, 217
Syst´eme International d’Unit´es, see SI
systematic sampling, 226
tabulated values, 21
target, 221, 225
December 23, 2010 13:32 9in x 6in Notes on Statistics and Data Quality for Analytical Chemists b1050index
Index 245
target value, 207, 215
test for lack of ﬁt, 89, 90, 121, 173
test for linearity, 90
test of signiﬁcance, 23, 25, 27
testing for normality, 146
testing for speciﬁc distributions, 144
topdown estimate of uncertainty, 162
traceability, 163
traditional and modern approaches to
sampling uncertainty, 221
traditional approach, 222
transfer of the SI unit, 159
transforming to logarithms, 124
translational bias, 95
trend, 27
true value, 21
true value of the measurand, 207
two independent variables, 110
twosample, 32, 34
twosample ttest, 142
twosample test, 25, 28, 31
twotailed probabilities, 20
twotailed test, 10, 12, 27, 30, 32, 36,
37
twoway ANOVA, 63
unacknowledged calibration
curvature, 171
uncertainty, 153, 154, 156, 207
uncertainty budget, 221
uncertainty evaluation strategies, 160,
163
uncertainty from sampling, 221, 225
uncertainty information in
compliance assessment, 157
unimodal, 210
Uranium, 97, 99
using resources optimally, 222
valid decision, 156
variance, 16, 65
variance due to the regression, 86
variance of the residuals, 85
variance ratio test, 31
variances of the regression
coeﬃcients, 83
variation in objects, 5
VIM, see International Vocabulary of
Metrology, 156
visual appraisal, 6
warning limits, 127
weighing of the test portion, 195
weighted regression, 82, 116, 119, 171
Wheat ﬂour, 34
wheat ﬂour data, 143
Wheat types, 38
Winsorisation, 137
withingroup mean square, 45
Youden design, 67
zscore, 207, 216
zinc, 172, 198
Zone chart, 129
This page intentionally left blank
Michael Thompson • Philip J Lowthian
Birkbeck University of London, UK
ICP
Imperial College Press
Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library CataloguinginPublication Data A catalogue record for this book is available from the British Library.
NOTES ON STATISTICS AND DATA QUALITY FOR ANALYTICAL CHEMISTS Copyright © 2011 by Imperial College Press All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN13 ISBN10 ISBN13 ISBN10
9781848166165 1848166168 9781848166172(pbk) 1848166176(pbk)
Typeset by Stallion Press Email: enquiries@stallionpress.com Printed in Singapore.
That is a perception that we hope to dispel. Measurement results necessary have uncertainty and statistics shows us how to make valid inferences in the face of this uncertainty. Of course. are still stressing pencilandpaper methods of calculation. But. it has become apparent to us that statistics is much more interesting when it makes full use of the computer revolution. v . Statistics is straightforward and it adds a fascinating extra dimension to the science of chemical measurement. It would be hard to overstate the eﬀect that easilyavailable computing has had on the practice of statistics. Analytical chemists do not need too many details of statistical theory. Many analytical chemists ﬁnd statistics diﬃcult and burdensome. often in milliseconds and with perfect accuracy. over the years. We regard this as inappropriate in an applied text. Drivers don’t need to know exactly how every part of a car works in order to drive competently. we can now produce in seconds several accurate graphical images of our datasets and select the most informative. Many textbooks place a heavy stress on the mathematical basis of statistics.Preface This book is based on the experience of teaching statistics for many years to analytical chemists and the speciﬁc diﬃculties that they encounter. calculations that previously would have been impracticably longwinded and errorprone. to an unnecessary degree. so we have kept these to a minimum. It is now possible to undertake. Textbooks for the most part have not caught up with this revolution and. In fact it is hardly possible to conceive of a measurement science such as analytical chemistry that does not have statistics as both its conceptual foundation and its everyday tool. but they are too prone to mistakes for ‘reallife’ application. a small number of pencilandpaper examples of some elementary examples can assist learning. A simple example is the calculation of the probabilities associated with density functions. Moreover. These capabilities have transformed the applicability of both standard statistical methods and more recent computerintensive methods.
each dealing with a single topic. Every statistical method and application covered has at least one example where the results are analysed in detail.vi Notes on Statistics and Data Quality for Analytical Chemists With ease of computation. so that readers can compare the output of their own favourite statistical package with that shown in the book and thus verify that they are entering data correctly. A visual appraisal of a graphical image is of paramount importance here. a road map. but are extensively crossreferenced. Finally. headed by a ‘Key points’ box. Another essential is developing an understanding of the exact meaning of the results of statistical operations. a techniquebased approach followed by an applicationbased . but they do need the Highway Code. All of the datasets used in examples are available for download. Statisticians may be surprised at the relative emphasis placed on diﬀerent topics. or as a quick reference by going directly to the topic of interest. The statistical results on these examples have been crosschecked by at least two diﬀerent statistics packages. We have concentrated on providing a selection of techniques and applications that will satisfy the needs of most analytical chemists most of the time. The sections are as far as possible selfcontained. We have simply used heavier weighting on the topics that experience has told us that analytical chemists have most diﬃculty with. a concomitant danger that people are tempted to use one of the many excellent computer statistics packages (or perhaps one of the notsoexcellent ones) without understanding what the output means or whether an appropriate method has been used. A key practice is the habitual careful consideration of the data before any statistics is undertaken. some driving lessons and as much practice as they can get. Analytical chemists have to guard against that serious shortcoming by exercising a proper scientiﬁc attitude. Most sections are terminated by ‘Notes and further reading’ with useful references for those wishing to pursue topics in more detail. of course. Drivers don’t need too many details of how the car works. The book is cast in two parts. and the book is profusely illustrated with them: almost every dataset discussed or analysed is depicted. The book is divided into quite short sections. There are several ways of developing that faculty in relation to statistics. there is. practitioners need the experience of both guided and unsupervised consideration of many examples of relevant datasets. This enables readers to emulate this analysis on their own examples. Statistics is a huge subject. and a problem with writing a book such as this is knowing where to stop. The book can therefore be used either in a systemic way by reading the sections sequentially. and we make no apology for omitting any mention of the numerous other interesting methods and applications.
and uneven variance in regression and analysis of variance. onetailed or twotailed. 1. Both of these methods also have the disadvantage of engendering the idea that the signiﬁcance test can validly dichotomise reality — once you have set a level of conﬁdence the test tells you ‘yes’ or ‘no’. 5. one sample. so that any inferences drawn will be valid. Be careful how you express the outcome in words. and we stress that meaning repeatedly. two sample or paired. The alternative approaches tend to cause more diﬃculty. 3. 2. lack of ﬁt to linear calibrations. It greatly improves the transparency of signiﬁcance testing so long as the exact meaning of the pvalue is borne in mind. Finally. Analytical chemists are thereby encouraged to create their own overview of the subject and see for themselves the relationship between tasks and techniques. e. This was adopted after careful consideration of the needs of analytical chemists.. Important features to look out for are: suspect data. This will often tell you all that you need to know. print it out for checking. . In addition it will tell you if your dataset is unlikely to conform to the statistical model that underlies the statistical test that you are proposing to use. especially the p−value associated with a test of signiﬁcance. This tempts practitioners to use statistics to replace judgement rather than to assist it. 6.g. After you have entered it. Avoid attributing probabilities to hypotheses (unless you are making a Bayesian analysis — not within the scope of this book). The conﬁdence interval approach is simple but almost universally misunderstood among nonstatisticians. This engenders a certain amount of overlap and duplication. Make sure that it will have suﬃcient statistical power for your needs. If you can.Preface vii approach. here are ten basic rules for analytical chemists undertaking a statistical analysis. 4. Examine the data as one or more graphical displays. Make sure that you know how to enter the data correctly into your statistical software. plan the experiment or data collection before you start the practical work. Statisticians will also notice that we use the ‘pvalue’ approach to signiﬁcance testing. Select the correct statistical test. Experience has shown that analytical chemists ﬁnd the somewhat convoluted logic of using statistical tables confusing and hard to remember. Ensure that the data collection is randomised appropriately. Make sure that you know exactly what the statistical output means.
Have fun! Michael Thompson & Philip J Lowthian Birkbeck University of London. This will give you valuable information about lack of ﬁt and uneven variance.co.icpress.g. If in doubt. 9.uk/chemistry/p739.viii Notes on Statistics and Data Quality for Analytical Chemists 7. After a regression always make plots of the residuals against the predictor variable. ask a statistician. e. 10. It is sometimes useful to make other plots of the residuals.html . Report the magnitude of an eﬀect as well as its signiﬁcance.. 8. UK May 2010 * * * * * * Data ﬁles used in the book can be downloaded from http://www. Distinguish between eﬀects that are statistically signiﬁcant and those that are practically important. as a time series to detect drift in the measurement process.
8 1. . . . . . . Levels of Probability . . . . . . . . . . . . . . Simple Tests of Signiﬁcance 3. . . . Another Example — Accuracy of a Nitrogen Analyser — a TwoTailed Test . . . . An Example — Ethanol in Blood — a OneTailed What Exactly Does the Probability Mean? . . 2. .1 2. . .3 1. . . .2 2. . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 5 6 6 7 8 9 10 12 12 15 .5 2.Contents Preface v Part 1: Statistics 1. . . . . . . . . . . . . . . .11 Measurement Variation . . . . . . . . . .1 1. . . . . . . . . . . . . Thinking about Probabilities and Distributions 2. . . . . . . . . . . . . . . . . . . . OneSample Test — Example 2: Alumina (Al2 O3 ) in Cement . . . . . Statements about Statistical Inferences . . .3 2. . . . . Data Displays . . . .4 2. . . . . . . . . . . . . . . . . . Statistics . . . . . . . . Variation in Objects . . . . Conﬁdence Intervals . . . 15 17 19 20 21 22 25 25 27 . . . . . . . . . . . . .10 1. . . . Precomputer Statistics . . . . Conditions of Measurement and the Dispersion (Spread) of Results . . . . . . . . . . . .4 1. . . .6 The Properties of the Normal Curve . . . . Test . . 3. . . . . . . . . . .2 1. . . . . . . . . . . . . . . . . . . . . . . Preliminaries 1. . . . . . Probability and Statistical Inference . . . . . . . . . . .9 1. . . . . . . . . . . . . . . . . . . .7 1. . . . Probabilities Relating to Means of n Results Probabilities from Data . . . Null Hypotheses and Alternative Hypotheses .6 1.5 1. .1 3. . . . . ix . . . . . .2 OneSample Test — Example 1: Mercury in Fish . . . .
. . . . . . . . .2 5. . . . .4 3. . .10 4. . . . Cochran Test . . . . . . .4 5. . .9 Regression . . . . . . . . . . . . . . . The Use of Residuals . . . . The Calculations of OneWay ANOVA . . . . . . . . . . . . . . . . . ANOVA Application 3 — The Collaborative Trial . . . . . . . . . . . . .5 4. . .7 5.6 3. . . . . .7 4. Correlation . . . . . . . . . . . Potential Problems with Paired Data . . . .6 4. . .8 4. . . . . . . . . . . . . . . . . . . . . . . . . .9 4. . . . . Variances of the Regression Coeﬃcients: Testing the Intercept and Slope for Signiﬁcance . . . . . . . . . . . . . . . . . . . . . . . . .1 5. . . . . . .11 Introduction — The Comparison of Several Means . . . . . . . . .10 Comparing Two Independent Datasets — Method Comparing Means of Two Datasets with Equal Variances . . . . . . . Example Calculations with OneWay ANOVA . . . . . . . . . Paired Data . . . Regression and ANOVA . . . . . . . . . . . . . . .8 3. .8 5. . . . Analysis of Variance (ANOVA) and Its Applications 4. .5 3. . . . . . . . . . . .3 5. . . . . . . . . . . More Elaborate ANOVA — Nested Designs . . . . . . . . . . . .9 3. . . . . . . . ANOVA Application 4 — Sampling and Analytical Variance . . . . . . . Regression and Calibration 5. . . . . . Applications of ANOVA: Example 2 — Homogeneity Testing . . TwoWay ANOVA — Crossed Designs . . . . . . . . . . . . . . . . . . . Ruggedness Tests . . . . . . . . . . . . . . . . . Calibration: Example 1 . TwoSample TwoTailed Test — An Example . . . . . . . . . . . .1 4. . . . .6 5. . . . . . . . . . . . . .7 3. . . . . . . . . . . . . . . . . Paired Data — OneTailed Test . . . . . . . . . . . . . . . . . . . . . . . .x Notes on Statistics and Data Quality for Analytical Chemists 3. TwoSample OneTailed Test — An Example . . . . . Applications of ANOVA: Example 1 — Catalysts for the Kjeldahl Method . . . How Regression Works . . .4 4. . . . . . . . . . . . . . 71 73 76 78 80 82 83 85 87 . . . . Eﬀect of Outliers and Leverage Points on Regression . . . . 43 45 47 49 51 54 57 60 63 65 66 71 5. Suspect Patterns in Residuals . . . .3 3. . . . . . 28 30 31 32 34 36 38 40 43 4. . . . . . . .5 5. . . . .2 4. . . . The Variance Ratio Test or F Test . .3 4. . . .
.2 7. . . .6 6. . . . . . . . . . . . .10 Evaluation Limits — How Precise is an Estimated xvalue? . . .2 6. . . . . . . . Robust Statistics — Huber’s H15 Method . . . . . . . . . . . . . . . . . Example of Regression with Transformed Variables .11 Control Charts . . . .1 8. . . . . .10 5. . . . . . . . . . . .1 7. . . . . Nonlinear Regression . . . . . . . . . . . . 101 . . A Regression Approach to Bias Between Methods Comparison of Analytical Methods: Example .1 6. . Lognormal Distributions . . . . . . Statistical Power and the Planning of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . Robust Statistics — MAD Method . . . . . .9 6. . . . . . . .8 6. . . . . Polynomial Calibration — Example . . . Reducing the Conﬁdence Interval Around an Estimated Value of Concentration . . . . . .6 7. . . . 144 . . . 127 130 132 133 135 137 139 141 142 7. . . . 124 127 . . . .7 7. . . . . . . . . . . . . . . . . . 146 Part 2: Data Quality in Analytical Measurement 8. . .11 5. . . . . .4 6. . . . 153 Uncertainty . . . . . . . . 104 105 108 109 111 116 . . .3 6. . . Multiple Regression — An Environmental Example Weighted Regression . . . . . . . . . . . The Grubbs Test . . . . . . . . . . 119 . . . . . . . 122 . . . . . . .3 7. . .7 6. . . Additional Statistical Topics 7. . . . . . . . . . . . . . . . . . . . . . . . 154 . . . . . Suspect Results and Outliers .8 7. . . . . .13 A StatisticallySound Test for Lack of Fit . . . . . . . . . . . . . . Quality in Chemical Measurement 8. .2 151 153 Quality — An Overview . . . . 90 91 94 97 101 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Regression . . . . . . Regression — More Complex Aspects 6. . . . .5 7. .Contents xi 5. . .9 7. . . . . . . . . . Testing for Speciﬁc Distributions — the Kolmogorov–Smirnov OneSample Test . . . . Example Data/Calibration for Manganese . Nonparametric Statistics .5 6. . . Rounding . . . . . . . . . . . . . .4 7. . . . . Polynomial Regression . . . . . . . . . .10 7. . . . . . . Example of Weighted Regression — Calibration for 239 Pu by ICP–MS . Dixon’s Test for Outliers . . . . . .12 5. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .3 9. . . . . . . . . . . . . . . . . .2 11. . . . . . . . . . . . . . . . Multivariate Internal Quality Control . . . . . . .3 11. . . . . . . . . . Internal Quality Control 10. . . .3 10. . . . . . Fitness for Purpose . The Collaborative Trial — Outlier Removal . Setting the Value of the Assigned Value xA in Proﬁciency Tests . .9 9.2 9. .4 8. . . . . . . . . . . . . . . . . .1 11. .7 8. .4 Proﬁciency Tests — Purpose and Organisation Scoring in Proﬁciency Tests . . . Calibration by Standard Additions . . . Setting Up a Control Chart . . . . . . . . . . . . .6 Repeatability Precision and the Analytical Run Examples of WithinRun Quality Control . . . . . . . . Experimental Conditions for Observing Precision External Calibration . . . . . . . . . . . . . . . . . .5 8. . . . . . . . . . 156 . . . . . . . . 195 197 198 201 205 11. . . . . . . . . 171 . . . . Statistical Methods Involved in Validation 9. . . . . . .3 8. 186 189 10. . . . . Estimating Uncertainty by Modelling the Analytical System . . . . .xii Notes on Statistics and Data Quality for Analytical Chemists 8. . .5 10. . . . . . . .4 9. . 158 160 162 163 165 167 . . . .1 10. . . . . . . 210 . . .5 9. . . . 169 . . . . . . The Propagation of Uncertainty . . .7 9. . . . . . . . . . Internal Quality Control — Example . . . . Proﬁciency Testing 11. . 189 . . . . . . . . . . . . . . 208 . . Detection Limits . . . . 192 . . . . . . . . . . . . . .6 8. . . . . . . . . . . . . . . .2 10. . . . . . . . . . . .4 10. . . . . . . . . . . . 167 . Internal Quality Control (IQC) and RuntoRun Precision . . . . . . . . . . 207 . . . . . . . Collaborative Trials — Overview .8 Why Uncertainty is Important . . . . . . . . . . . 205 . . . . . . . . . . .10 Precision of Analytical Methods . . . . . .6 9. Estimating Uncertainty by Replication Traceability . . . . 184 . . . . Calculating a Participant Consensus . . . . . . . . . . . . . . . . . .8 9. . Comparing two Analytical Methods by Using Paired Results . . . . . . .1 9. . . . . . Collaborative Trials — Summarising the Results as a Function of Concentration . . . . Example — A Complete Regression Analysis of a Calibration . . . . . 172 175 177 180 182 9.
. . . . . . . .2 12. . . . . . . Random and Systematic Sampling . . . . . . . . 221 223 226 227 229 231 234 236 239 12. . . . 216 Occasional Use of Certiﬁed Reference Materials in Quality Assurance . . .7 12. 219 221 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Replication of Sampling . . . . . . .6 11. . . . . . . . . . . . . . .4 12.8 Index Traditional and Modern Approaches to Sampling Uncertainty . 215 Using Information from Proﬁciency Test Scores . . .3 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling Uncertainty in Context . . . . . . . . . . . . . .5 11. . Quality Control of Sampling . .1 12. . Precision of the Estimated Value of σs . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Setting the Value of the ‘Target Value’ σp in Proﬁciency Tests . . . . .6 12. . . Sampling in Chemical Measurement 12. .Contents xiii 11. .5 12. Sampling Precision . . . Sampling Bias . . . . . . . . . . .
This page intentionally left blank .
PART 1
Statistics
This page intentionally left blank
Chapter 1
Preliminaries
This chapter sets the scene for statistical thinking, covering variation in measurement results and the properties of objects, and its graphical representation. The basis for statistical inference derived from analytical data is associated with the probability of obtaining the observed results under the assumption of appropriate hypotheses.
1.1 Measurement Variation
Key points — Variation is inherent in results of measurements. — We must avoid excessive rounding to draw valid conclusions about the magnitude of variation.
The results of replicated measurements vary. If we measure the same thing repeatedly, we get a diﬀerent result each time. For example, if we measured the proportion of sodium in a ﬁnely powdered rock, we might get results such as 2.335, 2.281, 2.327, 2.308, 2.311, 2.264, 2.299, 2.295 per cent mass fraction (%). This variation is not the outcome of carelessness, but simply caused by the uncontrolled variation in the activities that comprise the measurement, which is often a complex multistage procedure in chemical measurement. Sometimes it may appear that results of repeated measurements are identical, but this is always a false impression, brought about by a limited digit resolution available in the instruments used to make the measurement or by excessive rounding by the person recording or reporting the data. If the above data are rounded to two signiﬁcant ﬁgures, they all turn into an identical 2.3%, which tells us nothing about the magnitude of
3
4
Notes on Statistics and Data Quality for Analytical Chemists
the variation. Excessive rounding of data must be avoided if we want to draw valid inferences that depend on the variation (see §7.8 for guidance on rounding). 1.2 Conditions of Measurement and the Dispersion (Spread) of Results
Key points — The dispersion (spread) of results varies with the conditions of measurement. — The shape of the distribution of replicated results is characteristic, with a single peak tailing away to zero on either side. — We must remember the diﬀerence between repeatability and reproducibility conditions.
The scale of variation in results depends on the conditions under which the measurements are repeated. If the analysis of the rock powder were repeated many times by the same analyst, with the same equipment and reagents, in the same laboratory, within a short period of time, (that is, the conditions are kept constant as far as possible) we might see the results represented in Fig. 1.2.1. These results were obtained under what is called repeatability conditions. If, in contrast, each measurement on the same rock powder is made by the same method in a diﬀerent laboratory (obviously by a diﬀerent analyst with diﬀerent equipment and reagents and at a different time) we observe a wider dispersion of results (Fig. 1.2.2). These results were obtained in what we call reproducibility conditions. Notice
Fig. 1.2.1. Results from the analysis of a rock powder for sodium under repeatability conditions.
Fig. 1.2.2. Results from the analysis of a rock powder for sodium under reproducibility conditions.
Preliminaries
5
the characteristic shape of these distributions, roughly symmetrical with a single peak tailing away to zero on either side. There are other conditions of measurement encountered by analytical chemists, but repeatability and reproducibility are the most important. 1.3 Variation in Objects
Key points — We must distinguish between the two sources of variation (between objects and between measurement results on a single object). — Variation among objects often gives rise to asymmetric distributions.
If we measure a quantity (such as a concentration) in many diﬀerent objects in a speciﬁc category, we obtain a dispersion of results, but this is largely because the objects really do diﬀer. Distinguishing between diﬀerent objects is one of the reasons why we make measurements. Figure 1.3.1 shows the concentrations of copper measured in samples of sediment from nearly every square mile of England and Wales. As the concentrations displayed are actually the results of measurements, some of the variation (but only a small part) must derive from the measurement process. Note that this distribution is far from symmetrical — it has a strong positive skew. This skew is often observed in collections of data from natural objects.
Fig. 1.3.1. Concentrations of copper in 49,300 stream sediments from England and Wales. The distribution has a strong positive skew.
There are several ways of representing simple replicated data.4.5 Statistics Key points — The main reason for using statistics is the estimation of summary statistics to describe datasets in a compact way. 1. such as the histograms in §1. Fig.4 Data Displays Key points — Always look at a graphical display of your data before you carry out any statistics.2. Statistics is the mathematical science of dealing with variation both in objects and in measurement results. A histogram is a suitable tool to inspect the location (the central tendency) and dispersion of data all of the same type.1 shows a dotplot of the rock powder data from §1. Indeed. — Another important reason is to assign probabilities to events and assist you in making judgements in the presence of uncertainty. can tell us much of what we need to infer from data without resort to statistical methods. — Use visual appraisal to make a preliminary judgement about the question you are asking and to select appropriate statistical techniques. Graphic representations of data. presented as a dotplot.1. histograms either look unduly ragged or do not show the shape of distribution adequately.4. With smaller amounts of data. when the number of observations is large. 1.6 Notes on Statistics and Data Quality for Analytical Chemists 1. Repeated results for the concentration of sodium in a rock powder. a diagram can nearly always tell us which statistical techniques would suit our purpose and which of them would lead us to an incorrect conclusion. In such circumstances the dotplot is often more helpful. They should always be the ﬁrst resort for anyone with data to interpret.1. are essential tools in handling variation. (There is no exact dividing line — we have to use our own judgement!) Figure 1. An appropriate diagram. It provides a logical way of drawing . say 50 or more. coupled with a certain amount of experience.
3.000 numbers. If the experiment supported the assumptions 99 times out of a hundred.95 under the assumptions. For example. it could tell us that the results obtained in an experiment were very unlikely to be obtained if certain assumptions were true. the level of probability that we accept as convincing varies both with the person making the judgement and the area of application of the result. This uncertainty is inherent in making deductions from measurements. whether an event is likely or unlikely under stated assumptions. that is. — We might need a higher conﬁdence level for certain applications. . The other way that statistics helps us is by allowing us to assign probabilities to events.Preliminaries 7 conclusions in the presence of uncertainty of measurement. Typically we conduct an experiment to test some assumptions. Imagine that we could repeat such an experiment many times.3. especially important with large datasets. namely the heights and boundaries of the histogram bars. If the results obtained in these experiments supported the assumptions only about half of the time.05. and fail to observe the event with a probability of 0. That in turn would allow us to infer that at least some of the assumptions are very likely to be untrue. Notice in the previous section that we are not dealing with absolutes such as ‘true’ or ‘false’. It helps us in two main ways. but it tells us a lot more than we could ﬁnd simply by looking at a list of 50. First. but with qualiﬁed judgements such as ‘likely’ or ‘unlikely’ etc. 1. we would observe the event with a probability of 0.1 can be summarised by very simple statistics by saying that 95% of the results fall between 15. Of course this summary does not tell the whole story. an essential step in seeing what data are telling us. the data in Fig.74 and 45. it enables us to summarise data concisely. For many scientiﬁc purposes we can accept 95% conﬁdence.1 tells us much more again but is a summary speciﬁed by about 60 numbers. For example.36. as a probability.6 Levels of Probability Key points — Statistics tells us. — Scientists normally accept a conﬁdence level of 95% as convincing (‘statistically signiﬁcant’). We have condensed information about 50. we wouldn’t be convinced about the validity of those assumptions. The histogram Fig. 1. However.000 data into three numbers. 1.
we calculate ‘onetailed’ probabilities. in that case. such as the concentration of ethanol in a blood sample. Notice that. Somewhere between is a dividing line between ‘convinced’ and ‘not convinced’. we would want a very high level of conﬁdence. 95% conﬁdence). use the data to support .8.7.02.) As the probability is very low. Using methods based on standard statistical assumptions to be explained in §1.e. in forensic science for example. (An even lower probability would apply if the true concentration were lower than the limit.3) how these probabilities are estimated. we infer that the sample contained a level of ethanol higher than the limit. We would not accept a situation where our result gave rise to the wrong conclusion one time in twenty.7. if we repeated the calculation for a diﬀerent set of results that were closer to the limit (Fig. We need to estimate how large or small that probability is. However. For many practical purposes that level is one time in twenty repeats (i. The mean of the results is above the limit. 1. To secure a prosecution based on an analytical result. 1.1. — Onetailed probabilities would apply also to other instances where there was an interest in results falling below such a limit. we ﬁnd that the probability of obtaining that particular mean result if the blood sample contained exactly 80 mg/100 ml is 0. we would obtain a higher probability estimate of 0. A sample of blood is analysed four times by a forensic scientist and the results compared with the legal maximum limit for driving of 80 mg ethanol per 100 ml of blood. 1.. we might want a much higher level of conﬁdence under certain circumstances. This low probability means that it is very unlikely that such a high mean result could be obtained if the true concentration were 80 mg/100 ml. We would not. but the variation among the individual results raises the possibility that the mean is above the limit only by chance.8 Notes on Statistics and Data Quality for Analytical Chemists we would almost certainly be convinced.7 An Example — Ethanol in Blood — a OneTailed Test Key points — If we are concerned about whether results are signiﬁcantly greater than a legal or contractual limit. Suppose that we had repeated measurement results such as shown in Fig. We will see later (§2.0005.2). probably high enough in this instance to support a legal prosecution.
(In other circumstances we might be interested only in probabilities of data falling below a limit. Results for the determination of ethanol in a sample of blood. This latter assumption is called the null hypothesis and. it is the probability of observing the ‘event’ (the results obtained or more extreme results) under these assumptions.) 1. Notice also that we are interested here only with probabilities of data falling above a limit: this is called a onetailed probability. Second. A crucial assumption is that our results comprise a random sample from an inﬁnite number of possible values. even though the suspect was probably over the limit. The probabilities calculated in §1.7. is written H0 : µ = xL . 1. 1.8 What Exactly Does the Probability Mean? Key points — The null hypothesis is an assumption that we make about the inﬁnite number of possible results that we could obtain by replicating the measurement under unchanged conditions.2. Fig. but not the probability of the null hypothesis given the results. It is essential for the analyst to keep that meaning in mind when using statistics.7.7 have a very speciﬁc meaning. not the probability of . An example might be testing a dietary supplement for a guaranteed minimum level of a vitamin. The reason is that the probability of getting those results if the suspect were innocent would be unacceptably high. That would also entail calculating onetailed probabilities.Preliminaries 9 Fig. Results for the determination of ethanol in a diﬀerent sample of blood.1. First. a prosecution. We further assume that the mean value µ of this inﬁnite set is exactly equal to the legal limit xL . it is a value calculated under a number of assumptions. in this instance. — We can calculate from the data the probability of observing the actual (or more extreme) results given the null hypothesis.
9. 1. an amino acid for which we can calculate the true nitrogen content (xtrue = 18.86. 18.) 1. and observing the result x. which is beyond the scope of these notes. The latter probability can be calculated from the former.9 Another Example — Accuracy of a Nitrogen Analyser — a TwoTailed Test Key point — If we are concerned about whether results are signiﬁcantly different from a reference value (such as a true value or other reference value). not the probability of innocence given the results.00% nitrogen (shown in Fig. We could do that by inserting into the instrument a sample of pure glycine.93. (Using such extra information gives rise to ‘Bayesian statistics’. They are logically related. In terms of the forensic example. 18. 1.9. we would probably want to repeat the experiment a number of times and compare the mean result x with xtrue . Suppose that we obtain the ¯ results: 18. There is a crucial diﬀerence between the two probabilities. broadly speaking it is the probability of getting the results assuming innocence.74.1).92%). . Fig. Because results vary. but only if we have some extra information that cannot be derived from the data.95. but can be rather diﬀerent in value. 19. 18. we need to calculate ‘twotailed’ probabilities. Suppose that we want to test the accuracy of an instrument that automatically measures the nitrogen content of food (from which we can estimate the protein content).1.10 Notes on Statistics and Data Quality for Analytical Chemists the null hypothesis being true given the results. Set of measurement results showing no signiﬁcant diﬀerence between the true value and the mean.
. we are interested in probabilities related to deviations from xtrue in either direction. indicating a quite unusual event under the null hypothesis.90 − 18. i. the magnitude of the number without regard to its sign. We would have drawn the opposite inference.92 = 0. no compelling reason to believe that the machine is inaccurate.9. (Contrast this with §1. However. 1.e. a onetailed probability. We could expect a diﬀerence as large as (or greater than) ¯ −xtrue  as often as six times in ten x repeat experiments if there were no inaccuracy in the instrument.. we say that there is no signiﬁcant diﬀerence. greater than 0. more generally. 1..92 or.) Fig.2. We want to know whether the absolute diﬀerence between the true value and the mean result.1 In other words. if the results had been as depicted in Fig. plausibly represents an inaccuracy x in the machine or is more probably due to the usual variation among the x individual results. in contrast to that in §1. As this is a large probability. namely that there was a real bias in the results. the probability would have been p = 0. Set of measurement results showing a signiﬁcant diﬀerence between the true value and the mean.05 for example. Under H0 (and the other assumptions) we calculate the probability of getting the observed value of ¯ − xtrue  or a x greater one (i. so that  − 3 = 3 = 3.e. This particular probability has the value of p = 0.9.Preliminaries 11 The mean result is 18. 1 The notation a signiﬁes the absolute value of a.02%. a mean result even further from xtrue in either direction).2.7 where we were concerned with whether the mean was signiﬁcantly greater than the reference value. positive or negative: we want to know whether the mean is signiﬁcantly diﬀerent from the reference value xtrue .033. H0 : µ = xtrue . we are asking whether ¯−xtrue  is significantly greater than zero. This calls for the calculation of a twotailed probability.7. In this case the null hypothesis is H0 : µ = 18.90 (to four signiﬁcant ﬁgures). ¯ − xtrue  = 18. In these examples.62.
called the ‘alternative hypothesis’. Acceptable and unacceptable forms of words are tabulated below. and then having estimated the probability p associated with our data under H0 and HA . to ﬁnd that the outcome is statistically signiﬁcant. — The alternative hypothesis for a twotailed test is HA : µ = xtrue .9) that calculating probabilities from results depends on the speciﬁcation of a null hypothesis H0 . The role of the alternative hypothesis is to remind us of what we are trying to establish when we are undertaking statistical calculations. . For the alternative hypothesis. For a twotailed probability. which is designated HA (or H1 in some texts).12 Notes on Statistics and Data Quality for Analytical Chemists 1. The alternative hypothesis is then HA : µ = xtrue .05 we would say ‘at the 95% conﬁdence level’. that is. we would use either HA : µ > xL or HA : µ < xL . Having settled on a level of probability that we designate as convincing for the particular inference that we wish to make (a critical level. depending respectively on whether xL was an upper limit for the quantity of interest or a lower limit. For a onetailed probability the null hypothesis is H0 : µ = xL for a limit value xL . 1. we may wish to express the ﬁnding in words. pcrit ). so if pcrit = 0.10 Null Hypotheses and Alternative Hypotheses Key points — The alternative hypothesis for a onetailed test is HA : µ > xL or HA : µ < xL .11 Statements about Statistical Inferences Key point — We have to be very careful in our choice of words to avoid misleading statements about statistical inference. It is also what we accept by default if the evidence is such as to reject the null hypothesis. the null hypothesis is H0 : µ = xtrue for a reference value xtrue . To distinguish between onetailed and twotailed situations and to allow the calculation of the correct probability we have to invoke some extra information. We have seen (§1. They should be qualiﬁed by referencing pcrit in the form 100(1 − pcrit )%.
or (ii) We do not ﬁnd the mean result to be signiﬁcantly less than the reference value. We cannot say: The null hypothesis is true [because statistical inference is probabilistic: there is a small chance that the null hypothesis is false]. . suggest that further experimentation would be proﬁtable. A conﬁdence level of 90% would be convincing in many situations or. We should note that it is misleading to regard 95% conﬁdence as a kind of absolute dividing line between signiﬁcance and nonsigniﬁcance. We can say [for onetailed probabilities]: (i) We do not ﬁnd the mean result to be signiﬁcantly greater than the reference value. We can say [for twotailed probabilities]: We ﬁnd a signiﬁcant diﬀerence between the mean value and the reference value. There are no grounds for considering an alternative hypothesis. p < pcrit We can say: We can reject the null hypothesis and consider the alternative. We can say [for onetailed probabilities]: (i) We ﬁnd the mean result to be signiﬁcantly greater than the reference value. We cannot say: The null hypothesis is true with a probability p [because we need additional information to estimate probabilities of hypotheses]. or (ii) We ﬁnd the mean result to be signiﬁcantly less than the reference value. We cannot say: The null hypothesis is false with a probability (1 − p) [because we need additional information to make inferences about hypotheses].Preliminaries 13 p ≥ pcrit We can say: We cannot reject the null hypothesis. at least. We can say [for twotailed probabilities]: We ﬁnd no signiﬁcant diﬀerence between the mean result and the reference value. We might say: We accept the null hypothesis. We cannot say: The null hypothesis is false [because statistical inference is probabilistic: there is a small chance that the null hypothesis is true].
This page intentionally left blank .
2.1 The Properties of the Normal Curve Key points — Probabilities of random results falling into various regions of the normal distribution are determined by the values of µ and σ. but most often by reference to a mathematical model of the variation. which has a probability density (height of the curve y for a given value of x) given by y= exp((x − µ)2 /2σ 2 ) √ 2πσ 15 (2. Various approaches are covered but the main thrust is the use of the pvalue to determine how likely the data are under the various assumptions. — To apply the normal model to estimating probabilities.1) . The model most widely applicable in analytical chemistry is the normal distribution.Chapter 2 Thinking about Probabilities and Distributions This chapter covers the estimation of probabilities relating to data from the assumption of the normal distribution of analytical error. — Real repeated datasets encountered in analytical chemistry seldom resemble the smooth normal curve.1. but often provide ragged histograms. We can estimate the probabilities involved in statistical inference in several quite diﬀerent ways. Those may or may not be reasonable assumptions. we have to assume that our data comprise a random sample from the inﬁnite population represented by the normal curve.
2. partly because of theory and partly because replicated results often approximate to it. The appearance of the normal distribution depends on the values of the parameters.1. 2.1. Fig. Fig. If we are interested in onetailed probabilities.63σ (unshaded) occupies 95% of the total area.16 Notes on Statistics and Data Quality for Analytical Chemists and a unit area.1. we note that 95% of results fall below µ + 1. σ 2 is called the variance. and falls to nearzero density at distances outside the range µ ± 3σ. µ is called the mean of the function.96σ. 2. we ﬁnd that about 2/3 of results fall within the range µ ± σ. (2.1.1. 2.1. A key feature of a density function such as Eq.1. In the normal curve (Figs. 2. 2. Fig.2. while about 99.63σ (Fig.4). The normal curve.1. Analysts should commit these ranges and probabilities to memory. Exact limits for 95% probability are µ ± 1. It is symmetrical about µ where the highest density lies. The region within the range µ±σ (unshaded) occupies about 2/3 of the total area.2– 2. The normal curve is widely used in statistics.1) is that the area deﬁned by any two values of x represents the probability of a randomly distributed variable falling in that range.4) and 95% fall above µ − 1.63σ. 2.1.4. and σ the standard deviation. The Central Fig. The normal curve. µ (mu) and σ (sigma).1. The region within the range µ ± 2σ (unshaded) occupies about 95% of the total area. The normal curve. The normal curve. Close to 19/20 results (or 95%) fall within the range µ ± 2σ. The shape of the normal curve is shown in Fig. The region below µ + 1.3. .1.7% fall within µ ± 3σ.
2. — Means of even a small number of results tend to be close to normally distributed even if the original results are not. each prone to its own variation. We are often interested in probabilities relating to the mean of two or more results. which comprise a lengthy succession of separate actions by the analyst. Of course. This combination of errors is exactly what we would expect in analytical operations.2 Probabilities Relating to Means of n Results Key points √ — σ/ n is called the ‘standard error of the mean’. repeated analytical data usually take a form that is indistinguishable from a random selection from the normal curve.3. histograms of real datasets are not smooth like the normal curve.g. Genuine random selections from a normal distribution. Statistical theory tells us that the mean x of n random results from ¯ . even quite large samples. Figure 2.5 shows six such selections.1. In practice.Thinking about Probabilities and Distributions 17 Fig.1) but tend to be ragged for the size of dataset usually encountered by analytical chemists. seldom closely resemble the parent curve.1. Histograms are ‘steppy’ if representing a large dataset (e. Limit Theorem shows that the combination of numerous small independent errors tends to give rise to such a curve. 2. ‘Standard error’ simply means the standard deviation of a statistic (such as a mean) as opposed to that of an individual result.5. Six random samples of 100 results from a normal distribution with mean 10 and standard deviation 1. 1.. Fig.
to give the general formulae Pr and Pr x−µ ¯ √ < −k = p. the means will be much closer to normally distributed.63σ/ n is 0. this is an outcome of the Central Limit Theorem. = Pr Pr x − µ < − √ ¯ n σ/ n For a probability p other than 0.05. especially for higher n.2. with a mean of µ but a standard deviation of σ/ n.96σ/ n is 0.18 Notes on Statistics and Data Quality for Analytical Chemists the normal curve with mean µ and standard deviation σ is also normally √ distributed. = Pr Pr x − µ > √ ¯ n σ/ n Likewise.2.63 = 0. .05. σ/ n (2. for a lower limit. we have x−µ ¯ 1. For example.3) ¯ − µ x √ > 1.63σ √ > 1.63 = 0. the probability of a mean falling outside the range √ µ ± 1. in general. the probability of an observed mean x falling ¯ √ above µ + 1.05. Even if the parent distribution of the individual results diﬀers from a normal distribution. In statistical notation1 we have x−µ ¯ 1.2.05. Again. σ/ n (2. We can now apply the previously established properties of the normal curve to means. The √ term σ/ n is known as ‘the standard error of the mean’. Pr ¯ − µ x √ > k = p.05 we simply need to replace 1.05.2) x−µ ¯ √ >k =p σ/ n (2. From this we can deduce that Pr or.96 = 0. σ/ n 1 The notation Pr [ ] signiﬁes the probability of whatever expression is within the square brackets.63 by the appropriate coverage factor k derived from the normal distribution.7) for an upper limit.1) For the twotailed case.63σ √ < −1. This is a onetailed probability (§1.
.3 Probabilities from Data Key points — Equations deﬁning normal probabilities have to be modiﬁed if you are using standard deviations s estimated from a small number (less than about 50) of data instead of the population value σ. — The value of t depends on the probability required and the number of results used to calculate the mean.3). .2. . we must remember that the relationship between p and k is diﬀerent in onetailed and twotailed probabilities. i=1 while the ‘sample standard deviation’ s is s= n i=1 (xi − x)2 ¯ . ﬁrst we have to ﬁnd which particular normal distribution (deﬁned by µ and σ) is the best representation of our data x1 . . in which we substitute a variable t (also called ‘Student’s t’) for the normal coverage factor k.3. . We do that by calculating the corresponding ‘sample statistics’ x and s. they cannot be used directly to substitute for the parameters µ and σ in the probabilities given in Eqs. . . xn . 2.1) .2. x2 . To estimate probabilities from data. As an outcome. (2. The ‘sample mean’ x is the ordinary arithmetic ¯ ¯ mean given by x= ¯ 1 n n xi . n n (2.Thinking about Probabilities and Distributions 19 However. Instead we have to use modiﬁed equations. — The variable t (or ‘Student’s t’) replaces the coverage factor k in these equations. . in the sense that if a series of mea¯ surements were repeated the resultant values of x and s would be diﬀerent ¯ each time. and (usually) approach them more closely as n increases. giving ts ts Pr x − √ < µ < x + √ ¯ ¯ = 1 − p.1)–(2. xi . We must remember that x and s are variables. n−1 The statistics x and s are estimates of the unknown parameters µ and ¯ σ.
Like k.4 Probability and Statistical Inference Key points — Probabilities about means of repeated results can be calculated simply by computer under H0 and HA . If we set s/ n with this ‘sample tvalue’ derived from the data under H0 and HA . much larger than 0.. Unlike k.92. in such an instance. the value of t depends on the probability p. .01 compared with k = 1. This is the probability of obtaining our mean value (or a value more distant from 18. — Care should be taken in interpreting the exact value of a probability. the event would be common under repeated experiments. As this is a large value (e. s/ n (2. (2.92 and HA : µ = 18. s/ n (2. we can obtain the value of p associated Eq. We have H0 : µ = 18.9.05).62 directly. This gives the corresponding expression. not to make it automatically.2). We can calculate a probability associated with speciﬁc data by using ¯ − xtrue  x √ = t. As a twotailed example we take the nitrogen analyser data from §1. so there are no grounds for suspecting the truth of H0 . Corresponding values of t and p are tabulated in statistics texts. the relationship between t and p is diﬀerent.2) In some contexts we assume a null hypothesis H0 : µ = xref and. — Statistics should be used to assist a decision. 2.3.3) Corresponding expressions can be obtained for onetailed probabilities but. Most statistics packages give the probability p = 0. we can substitute a reference value xref for µ.3.96 for a twotailed probability of 0. again.g. however.92) if H0 is true. Pr ¯ − xref  x √ > t = p. quickly by computer but with great diﬃculty by hand. the value of t also depends on n and gets closer to k as n increases.20 Notes on Statistics and Data Quality for Analytical Chemists from which we obtain for twotailed probabilities Pr ¯ − µ x √ > t = p.3. and can be calculated from each other.05. With n > 50 there is little diﬀerence: for n = 50 results t = 2. For small n this diﬀerence is important.
5 Precomputer Statistics Key points — Before the computer age.Thinking about Probabilities and Distributions 21 These calculated probabilities are an essential guide to the assessment of experimental results. except in the hands of an expert. of randomness. All analysis is conducted to inform a decision. the outcome is unlikely under H 0 and is regarded as ‘statistically signiﬁcant’. Before computers were readily available it was not practicable to calculate p from the sample t for each problem. — Nowadays. Using such critical levels is designed to help us avoid drawing unsafe conclusions but it does not relieve us of the responsibility of making the decision. The probability of our results under stated hypotheses tells us whether those assumptions are plausible or not. typically 0. This fact is especially important in the consideration of very small probabilities. The alternative was to use . While these are sensible assumptions.07 still means that it is considerably more likely that the null hypothesis could reasonably be rejected than otherwise.001 are likely to be accurate only to within an order of magnitude and are best simply regarded as ‘very low’. they are unlikely to be exactly true. Probabilities lower than 0. H 0 and H A . The decision depends on comparing our calculated probability with a predetermined ‘critical’ probability. statisticians used tabulated values of t calculated for certain ﬁxed levels of p. If p is less than some small predetermined level. A probability of say 0. we must always remember that statistics provides a method of making optimal decisions in the face of uncertainty in our data. To proceed with the test. which are related to the extreme tails of a distribution. If it is greater than the tabulated value for the preselected p. The calculation is based on assumptions. independence and normality. Often the decision amounts to comparing some experimental results with an independent reference value of some kind — a legal or contractual limit. 2. the sample t value is calculated from the data and H 0 . This is the simplest way of looking at probability. the result is regarded as ‘statistically signiﬁcant’. or a true value. Finally. We must further avoid placing too much reliance on the exact value of the calculated probability. most statistics packages calculate p directly from the data.05 for general purposes. It is not a magical way of converting uncertainty into certainty. in particular.
Our sample value of 0. Calculating the actual pvalue provides more information: for example. we have the statistics: n = 5.22 Notes on Statistics and Data Quality for Analytical Chemists precalculated values of t called ‘critical values’ for a small number of speciﬁc probabilities. The sample t could then be compared with the tabulated value for a speciﬁc probability (usually p = 0. If the sample t exceeded the tabulated t.9.1006/ 5 H0 : µ = 18.8960. If we have say ten results. x = 18. These ¯ 18.1006.1). 2. we feel justiﬁed in rejecting the null hypothesis.53 under s/ n 0.05 the tabulated value of t is 2.05.53 is well below the critical value of 2. we would know that the probability of the event was less than 0. so it would be reasonable to reject H0 at a conﬁdence level of 100(1 − 0.78.05). the mean has nine degrees of freedom because we can calculate any one result from the other nine and the mean. Again this tells us that the event is not signiﬁcant at the 95% conﬁdence level.6 Conﬁdence Intervals Key points — 100(1 − p)% conﬁdence limits can be calculated from the data and a t value corresponding to required value of p.92. Pr x − √ < µ < x + √ ¯ ¯ = 1 − p. Working this out for the previous example data from §1.92 ¯ − xtrue  x √ √ = provide a sample t given by t = = 0.3. s = 0. We need to compare this value with Student’s t for a speciﬁc probability. . For four degrees of freedom and a twotailed pvalue of 0. — Using the conﬁdence interval approach to signiﬁcance testing is mathematically related to calculating a pvalue. ts ts Equation (2.05. Values of Student’s t are tabulated according to the number of ‘degrees of freedom’2 which in this case equals n − 1 = 4.05) = 95%.896 − 18. it tells you whether you are near or far from a selected critical level. — If an H 0 value falls outside this interval. and it provides the same answer.78 for p = 0. tells us the range n n in which the unknown population mean µ falls with a probability of (1 − p) 2 The number of degrees of freedom (n − m) designates the number of independent values after m statistics have been estimated from n results.
896 ± ¯ √ 2. so there are no grounds for rejecting H0 .02). The 95% conﬁdence interval is a convenient and currently popular method of expressing the uncertainty in our measurement results. Conﬁdence intervals provide an alternative way of assessing and visualising tests of signiﬁcance. If it doesn’t fall within the conﬁdence interval. µ would fall in the calculated interval on an average of 100(1 − p)% occasions. we feel justiﬁed in rejecting the null hypothesis. All we need to do is to settle on a desired level of conﬁdence and ﬁnd the corresponding value of t from a table. we calculate the √ 95% conﬁdence limits (for a twotailed test) as x ± ts/ n = 18.6.1006/ 5 = (18.1) and. The nitrogen analyser data (points). 19.9. use the information that the rock powder is a reference material Fig. The ends of the range are called the ‘conﬁdence limits’. . Its exact meaning is as follows: if the experiment were repeated a large number of times.6.Thinking about Probabilities and Distributions 23 and is known as the 100(1 − p)% conﬁdence interval. showing the 95% conﬁdence interval and the H0 value falling inside the interval. UCL = upper conﬁdence limit.05.92 falls close to the middle of this range (Fig. xtrue should fall within the calculated 95% conﬁdence region most of the time.1). we have a 95% conﬁdence interval. LCL = lower conﬁdence limit. and widely misunderstood. 2. The H0 value of 18. in addition.77. So calculating a conﬁdence interval is one way of attributing a limiting probability to our data. however. As we assume that µ = xtrue under the null hypothesis H0 . So if p = 0. 2. The meaning of the conﬁdence interval is subtle. If we calculate 95% conﬁdence limits for the sodium data (§1.1.78 × 0. Again using the nitrogen analyser data in §1.
6. showing the 95% conﬁdence interval and the H0 value falling outside the interval.2). UCL = upper conﬁdence limit. LCL = lower conﬁdence limit.6. 2. . The sodium data (points). with a certiﬁed value of 2.2. we ﬁnd that the certiﬁed value falls outside the conﬁdence interval (Fig. and consequently we ﬁnd that there is a signiﬁcant diﬀerence and justiﬁably consider that the analytical method is providing a biased result.33% for the sodium content. 2.24 Notes on Statistics and Data Quality for Analytical Chemists Fig.
1. twosample tests and tests on paired data. and to allow readers to check that they are using their own statistical computer software correctly and interpreting the output correctly. in onetailed and twotailed forms.34. European Regulation 629/2008 sets a maximum concentration of mercury in ﬁsh of 1.0 ppm. Its main purpose is the demonstration of use of appropriate methods and a critical appraisal of the outcome.1 OneSample Test — Example 1: Mercury in Fish Key points — This is a onetailed test because we are concerned with concentrations above a regulatory limit. A laboratory analyses a suspect sample four times and obtains the results 1. We should beware that some statistical software may confusingly produce twotailed conﬁdence limits in combination with a onetailed test of signiﬁcance. 1.14 ppm.44. — The usual symmetrical 95% conﬁdence limits are not applicable here.Chapter 3 Simple Tests of Signiﬁcance This chapter contains worked examples of simple tests of significance of means.42. 25 . There is also a small amount of theory. 3. 1. incorporating onesample tests.
17 Note • The ﬁle containing this dataset is named Mercury .1 Onetailed ttest of the mean H0 : µ = 1.26 Notes on Statistics and Data Quality for Analytical Chemists Fig. The chance of obtaining the data (or a set with a higher mean) if H0 were true is less than one in a hundred. The statistics are shown in Box 3. 3.0685 t 4. Box 3.0082 is low: the mean result is signiﬁcantly greater than the regulatory limit. region.0 : HA : µ > 1.1.1. 3.1. Results for mercury content with null hypothesis.1.0 Variable Mercury.1: we see that the null hypothesis value does not lie in the 95% conﬁdence region (which has only a lower limit for the onetailed test). . so we are justiﬁed in rejecting it.3350 St Dev 0. A display of the data is in Fig. The pvalue of 0.0082 Lower 95% conﬁdence limit = 1.1370 SE Mean 0.1. mean and 95% conﬁdence Can we conclude that the concentration of mercury is above the allowable level? This requires a onetailed probability because we are testing the mean result against an upper limit.89 p 0. ppm n 4 Mean 1.1.
91.2. 3.86. we see that the null hypothesis is not rejected at the usual 95% conﬁdence. mean and 95% conﬁdence region. so a straightforward test of signiﬁcance on the mean is appropriate.00%. Results for alumina (points). .13.2 OneSample Test — Example 2: Alumina (Al2 O3 ) in Cement Key points — This is a twotailed test because we are considering whether an observed mean diﬀers signiﬁcantly from a target value. The statistics are shown in Box 3. 3. A special cement has a target for the alumina content of 4. 4. We see that the H0 value lies in the 95% conﬁdence region (just).2.1. The results plotted in time Fig. 4.19.04.05.14.Simple Tests of Signiﬁcance 27 3.2. sequence. the composition of the product is monitored by regular automatic analysis. The investigation calls for a twotailed test as we are interested in a signiﬁcant diﬀerence. 3. 4.16.87. 4. As the pvalue is greater than 0. 4. 3. In its manufacture.2. Is there any evidence that the alumina content diﬀers from the target 4. showing null hypothesis (H0 ). A display of the data is shown in Fig.18. 4. Twelve successive results for alumina were: 3. 3.2. there is no compelling evidence to suggest that the true level in the sample is diﬀerent from the target value (assuming Fig. 4. 3. On the face of it.2. 4.14.05. — It is worth looking for trend in sequential data.1.21.00% during the measurement period? There is no obvious trend in the data (Fig.2.1). 4.
3. because there is a small (but not negligible) probability of obtaining these results when the concentration of alumina did not diﬀer from 4. .1543) Notes • The ﬁle containing this dataset is named Alumina. the drift itself would suggest that the process needed adjustment. Sometimes we want to know whether the means of two separate datasets should be regarded as diﬀerent. Box 3.0368 t 1. it is worth looking for a trend in the data.3 Comparing Two Independent Datasets — Method Key points — If we want to test the signiﬁcance of a diﬀerence between the means of two sets of results. However. Intervention would not be called for. the number of degrees of freedom has to be reduced according to a complex formula.99 p 0.00. 4. where data have been collected sequentially. — Twosample tests can provide onetailed or twotailed probabilities as required. — If we do not assume that the standard deviations of the x data and the y data are equal.1274 SE Mean 0.0733 St Dev 0. in case the process seems to be drifting away from the target value. • In an instance like this one.00 Variable Alumina % n 12 Mean 4. This is known as a ‘twosample’ test.00 : HA : µ = 4.28 Notes on Statistics and Data Quality for Analytical Chemists that the analytical method is unbiased).2. That circumstance would invalidate a signiﬁcance test on the mean: the data would not be independent.9924.1 Twotailed test of the mean H0 : µ = 4. we have to use a twosample test.071 95% Conﬁdence limits (3.
However. so we have x y −1 2 We can readily show that se(¯ − y ) = x ¯ Pr ¯ − y  x ¯ s2 /n + s2 /m x y > t = Pr ¯ − y  × s2 /n + s2 /m x ¯ x y > t = p. for any statistic θ expected ˆ ˆ to be normally distributed. The null hypothesis for this test is H0 : µx = µy .g. nitrogenous organic bases) present in the food. . . As before. x2 . the number of degrees of freedom for t is not (n + m − 2) as might be expected: we have to make an adjustment using ¯ − y  x ¯ . ym . we can either use a computer to calculate p from the ¯ − y  x ¯ . We need to know whether ¯ − y  is signiﬁcantly greater than zero.2) by recognising that. y2 . . . . . In a simple experiment we could test for this suspected bias by applying both methods repeatedly to a homogeneous sample of a foodstuﬀ. It is suspected that the Dumas method gives on average a slightly diﬀerent result. se(¯ − y ) x ¯ s2 /n + s2 /m. . The Dumas method relies on the combustion of the foodstuﬀ with the formation of elemental nitrogen which is measured volumetrically or otherwise.) So for the diﬀerence between the means ¯ − y  we have.. or whether the x ¯ measured diﬀerence is simply a result of random variations in the data. The Kjeldahl method is based on the conversion of the protein nitrogen to ammonia and measuring the ammonia by titration. .3. xi . in either instance. (2. . namely 2 /n + s2 /m sx y exceeds the critical value of t obtained from tables for s2 /n + s2 /m x y the selected level of p and the appropriate number of degrees of freedom. nitrate. and θ signiﬁes an estimate of θ. Suppose the Dumas results are x1 . (2. or manually ﬁnd whether observed value of t. because there are other nitrogenous substances (e. . just x ¯ like Eq. We set up an equation analogous to Eq. The estimated bias is ¯ − y . . . .3. .Simple Tests of Signiﬁcance 29 Suppose that we had access to two diﬀerent methods for the determination of protein nitrogen in food. θ/se(θ) has a tdistribution. .2): Pr ¯ − y  x ¯ > t = p. xn and the Kjeldahl x ¯ results are y1 . which may aﬀect the results diﬀerently. (The function se( ) ˆ means the standard error of whatever is in the brackets. yj . .
If we can reasonably assume that the population standard deviations of the datasets. The above is an example of a twotailed test as we are looking for a bias between the mean results of the two methods. are equal. we can use the pooled standard deviation to provide a simpler version of the twosample test. (3. we can use a somewhat simpler procedure (see §3. which is called the ‘pooled standard deviation’.3. This procedure is simpler for handcalculations. If we can assume that the two datasets are from distributions with the same variance. The standard error of the diﬀerence is now given by se(¯ − y ) = s x ¯ n i=1 (xi 1 1 + .7). where s n m = . 3. Onetailed twosample tests are equally possible (see §3. — There is no practical advantage in using a pooled standard deviation if probabilities or tvalues are calculated by computer.4 below). a standard deviation derived from both n+m−2 datasets. that is. the mathematics of the twosample test of means is somewhat simpler for hand calculation. We derive − x)2 + ¯ m j=1 (yj − y )2 ¯ . a diﬀerence regardless of which method gives the higher result.1) rounded to the nearest integer.4 Comparing Means of Two Datasets with Equal Variances Key points — If we can assume that the two datasets come from populations with the same precision. which may be quicker for manual calculations but is of no advantage if a computer package is used.30 Notes on Statistics and Data Quality for Analytical Chemists the complexlooking formula which gives a smaller number of degrees of freedom: adjusted degrees of freedom = s2 s2 y x + n m 2 s4 s4 y x + 2 n2 (n − 1) m (m − 1) . σx and σy .
Simple Tests of Signiﬁcance 31 probabilities from a tvalue with a simple n + m − 2 degrees of freedom. .4) but. . F = s2 /s2 . As an example we use the nitrogen data from §3. In other words.58. These probabilities depend on the assumption that the original datasets were normally distributed samples. 3. are signiﬁcantly diﬀerent from each other. . The situation is illustrated in Fig. if there were no diﬀerence between the variances. . . . . . 3. ym . x y sx > sy . . The test statistic in this case is the ratio of the estimated variances. y2 . (n − 1) for x and (m − 1) for y. x2 . As with ttests.5). yj . . . The test can be used to determine whether it is sensible to use a pooled standard deviation for a twosample test (§3. we show that the variances are not signiﬁcantly diﬀerent.5. . se(¯ − y ) x ¯ Pr Of course the assumption that both datasets come from equalvariance populations has to be reasonable.58. and its value depends on two separate degrees of freedom. We may want to test whether two standard deviations sx and sy . it is more widely used to determine signiﬁcance in analysis of variance (§4. we would expect a ratio at least as large as 1. . . xn and y1 .4). calculated from two independent datasets x1 . xi . as will be seen.1. to give ¯ − y  x ¯ > t = p.193 and the corresponding probability under H0 : σx = σy 2 2 (against HA : σx > σy ) is p = 0.193 with a probability of 0. As this is a high probability. and there is a simple method of testing that called Fisher’s variance ratio test or the F test (see §3.6: the variance ratio 2 2 is F = 1. .5 The Variance Ratio Test or F Test Key point — The F test is used to determine whether two independent variances (or two independent standard deviations) are signiﬁcantly diﬀerent. the F test can be conducted by calculating a probability corresponding with F . . . or by comparing the sample value of F with tabulated values for predetermined probabilities.
32
Notes on Statistics and Data Quality for Analytical Chemists
Fig. 3.5.1. The F distribution for nine and seven degrees of freedom. The shaded area shows the probability of obtaining the nitrogen data under H0 .
Note • The ﬁle containing the dataset is named Wheat ﬂour . 3.6 TwoSample TwoTailed Test — An Example
Key points — This is a twotailed test because we are interested in the diﬀerence between the mean results regardless of which is the greater. — If we assume variances are unequal when they are in fact close, we get an outcome very similar to that obtained by the pooled variance method, so no harm results. — If we assume variances are equal when they are not we could get a misleading outcome, so we use the pooled variance method only in special circumstances.
A laboratory, possessing the two diﬀerent methods (Dumas and Kjeldahl) for determining protein nitrogen in food, repeatedly analysed the same sample of wheat ﬂour, and obtained the results given below. A dotplot of the data is shown in Fig. 3.6.1. We can see that the means of the two sets
Simple Tests of Signiﬁcance
33
Fig. 3.6.1.
Dotplot of the nitrogen data (points), showing the means of the two datasets.
of data are diﬀerent, but are they signiﬁcantly diﬀerent, given the spread of the data? We also observe that the dispersions of the two datasets are similar.
Dumas, % 3.12 3.01 3.05 3.04 3.04 2.98 3.08 3.09
Kjeldahl, % 3.07 2.92 3.01 3.00 3.02 3.02 2.98 2.92 3.01 3.05
In this instance, without assuming that the variances are identical, Eq. (3.3.1) tells us to use 15 degrees of freedom and obtain p = 0.035 as the probability of obtaining the absolute diﬀerence that we observe (or a greater diﬀerence). We conclude that the observed diﬀerence of 0.0513% mass fraction is signiﬁcant at 95% conﬁdence. Using a pooled standard deviation (and the full 16 degrees of freedom) provides an almost identical probability (p = 0.036). Notice that we do not need equal numbers of observations in each dataset for the twosample test.
34
Notes on Statistics and Data Quality for Analytical Chemists
Box 3.6.1 Twosample ttest and conﬁdence interval H0 : µDumas = µKjeldahl : HA : µDumas = µKjeldahl Assuming unequal variances: Dumas, % Kjeldahl % n 8 10 Mean 3.0513 3.0000 St Dev 0.0449 0.0490 SE Mean 0.016 0.015
t = 2.31 p = 0.035 DF = 15 95% conﬁdence interval for diﬀerence: (0.004, 0.099) Using a pooled standard deviation: t = 2.29 p = 0.036 DF = 16 95% conﬁdence interval for diﬀerence: (0.004, 0.099)
Notes and further reading • The ﬁle containing this dataset is named Wheat ﬂour . • The outcome of this experiment may not be relevant outside the laboratory concerned. Other laboratories, conducting the same methods (but somewhat diﬀerently) may have diﬀerent biases. A discussion of this point can be found in Thompson, M., Owen, L., Wilkinson, K. et al. (2002). A Comparison of the Kjeldahl and Dumas Methods for the Determination of Protein in Foods, using Data from a Proﬁciency Testing Scheme. Analyst, 12 , pp. 1666–1668. 3.7 TwoSample OneTailed Test — An Example
Key points — This is a onetailed test because we are interested in whether the modiﬁed conditions were better, not just diﬀerent. — Note the assumption of unequal standard deviations.
Ethanol is made industrially by the catalytic hydration of ethene. A process supervisor wants to test whether a change in the reaction conditions improves the onepass yield of ethanol. The conversion eﬃciency is measured with ten successive runs under the original conditions and ten
Simple Tests of Signiﬁcance
35
Fig. 3.7.1. Results for the conversion of ethene to ethanol under original and modiﬁed plant conditions.
more under the modiﬁed conditions. The results are as follows. There are no apparent trends in the data, so a simple ttest is appropriate.
Eﬃciency % original conditions 7.0 7.3 7.0 6.9 6.7 7.3 7.1 7.1 7.1 6.8 Eﬃciency % modiﬁed conditions 7.4 7.5 7.3 8.2 7.2 7.8 7.8 7.4 7.2 7.0
The dotplots in Fig. 3.7.1 shows the means well separated and the dispersion of the results under modiﬁed conditions greater. The pvalue of 0.002 (Box 3.7.1) for the onetailed test shows a very low probability of obtaining the data if the null hypothesis is true, so we can reject it and accept that the modiﬁed conditions give a signiﬁcantly greater conversion eﬃciency. Note that by assuming unequal variances, the test uses 13 degrees of freedom in place of the original 18. Box 3.7.1 Twosample ttest and conﬁdence interval H0 : µM od = µOrig : HA : µM od > µOrig Conditions Modiﬁed Original n 10 10 Mean 7.480 7.030 St Dev 0.358 0.195 SE Mean 0.11 0.062
95% conﬁdence interval for diﬀerence: (0.17, 0.729) t = 3.49 p = 0.0020 DF = 13
36
Notes on Statistics and Data Quality for Analytical Chemists
Note • The ﬁle containing this dataset is named Ethene.
3.8
Paired Data
Key points — Paired data arise when there is an extra source of variation that has a common eﬀect on both members of corresponding pairs of data. — Paired data are treated by calculating the diﬀerences between corresponding pairs and testing the diﬀerences under H0 : µdiﬀ = 0. In this instance we have a twotailed test so HA : µdiﬀ = 0. — It is essential to recognise paired data: a twosample test will usually provide a misleading answer.
Here we consider again the results of two methods for determining nitrogen in food. We want to ﬁnd whether the outcome in §3.6 (i.e., no signiﬁcant diﬀerence between the means) was valid for wheat in general and not just speciﬁc to a particular type of wheat. One way to do that would be to arrange for the comparison methods to be made with a number of diﬀerent types of wheat. Suppose in such an experiment, in a single laboratory, the results below were obtained.
C1 Type of wheat A B C D E F G H I J
C2 Kjeldahl result % 2.98 2.81 2.97 3.15 3.03 3.05 3.24 3.14 3.04 3.08
C3 Dumas result % 3.08 2.88 3.02 3.17 3.08 3.21 3.20 3.12 3.11 3.16
C4 Diﬀerence % 0.10 0.07 0.05 0.02 0.05 0.16 −0.04 −0.02 0.07 0.08
The statistics are shown in Box 3. This low value tells us that such a large observed diﬀerence is unlikely under the null hypothesis. It would be futile to attempt to compare the methods by comparing the mean results. 3. If we tried to do that. i. with H0 : µdiﬀ = 0.2. for this range of wheat types.8. We must recognise that. Diﬀerences between corresponding results. due to variation in the true concentrations of protein in the various types of wheat. We deal with the situation statistically by calculating a list of diﬀerences between corresponding results..8.1. The observed mean diﬀerence is 0. .1.Simple Tests of Signiﬁcance 37 Fig. In fact we can see in Figs. we expect the mean of the diﬀerences to be zero if there is no bias between the methods.1 and 3.8. wheat type B has provided a particularly low pair of results. 3. However.8. there is an extra source of variation in the results.2 that for most of the wheat types (8/10). as well as a possible diﬀerence between results of the methods.8.054% and we ﬁnd that p = 0.1. we can see that it is the diﬀerences between respective pairs of results which will tell us what we want to know. there is a bias between the methods of about 0. In fact both of these variations show up clearly in Fig. so we feel justiﬁed in rejecting it and accepting that.016 (for a twotailed test). Dumas result •. the Dumas method gives the higher result.e.8. any bias between methods might be swamped by the variation between the wheat types. Fig. 3. Kjeldahl result ◦. 3.05%. Results obtained by the analysis of ten diﬀerent types of wheat for the protein nitrogen content of a sample of a foodstuﬀ. For instance. which in itself suggests that the diﬀerence between methods may be signiﬁcant. We then apply a onesample test to the diﬀerences.
016 N 10 10 10 Mean 3.05. Nitrogen oxides (NOx ) are harmful atmospheric pollutants produced largely by vehicle engines.9. Their concentration is monitored in large cities..0310 0.2) shows the null hypothesis value well below the lower 95% conﬁdence limit. 3. 3.6.0578 SE Mean 0.0981 0.1030 3. • The bias observed in this experiment may be valid only for analyses conducted in a single laboratory.9 Paired Data — OneTailed Test Key point — This is a onetailed test involving paired data. (See also the Notes in §3. 0.) 3. The results are plotted by the hour in Fig.38 Notes on Statistics and Data Quality for Analytical Chemists Box 3.0953) Notes • The ﬁle containing this dataset is named Wheat types. with the results shown in units of parts per billion by volume (i.96 p = 0.0127.0490 0.9.1. A dotplot of the diﬀerences (Fig.1 Paired data twotailed ttest H0 : µdiﬀ = 0 : HA : µdiﬀ = 0 Dumas result % Kjeldahl result % Diﬀerence % t = 2. mole ratio). Onehour average concentrations were measured every hour for a day at a particular location.0183 DF = 9 95% conﬁdence interval for mean diﬀerence: (0. Most of the diﬀerences between methods are positive (21/24).0540 St Dev 0.9.1) show a pvalue well below 0. showing a higher reading at face level.e. so we conclude that the null hypothesis can be safely rejected in favour of the alternative: the . This experiment attempts to tell whether concentrations measured at face level are higher than those measured with a monitor at a height of 5 m (where it is safe from vandalism). The statistics (Box 3.0372 0.8.1176 0.
9. 3. ppb at 5 m 10 13 13 11 15 14 13 15 17 21 24 19 22 24 25 22 24 25 19 20 19 12 14 18 Diﬀerence.1.2. 3. Diﬀerences between pairs of results for NOx .5 m 11 15 16 13 19 16 15 20 18 19 26 22 26 27 24 26 28 23 23 22 21 18 14 19 NOx .5 m (•) and 5 m (◦). ppb at 1. Results for the concentration of NOx in air at a site measured at two heights above ground: 1. Hour 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 NOx . ppb 1 2 3 2 4 2 2 5 1 −2 2 3 4 3 −1 4 4 −2 4 2 2 6 0 1 Fig. .9. Fig.Simple Tests of Signiﬁcance 39 concentration of NOx at face level is signiﬁcantly higher than at 5 m above the pavement.
036 SE Mean 0.416 t 5.455 Notes • The ﬁle containing this dataset is named NOx in air .10 Potential Problems with Paired Data Key points — For a valid test. the diﬀerences must be a random sample from a single distribution. If a few anomalous concentrations are present. Treating paired data by a twosample test will probably lead to an incorrect inference. probably diﬀerent from σD . it is a reasonable assumption that the results of the Dumas method in §3. These are reasonable assumptions because there is no great variation among the .167 St Dev 2.21 p 0. — That is likely to occur only if the concentrations involved are drawn from a relatively short range. Such a question cannot be answered without reference to an external criterion based on the use to which the data will be put.1 ttest of the mean diﬀerence H0 : µdiﬀ = 0 : HA : µdiﬀ > 0 Variable Diﬀerence.40 Notes on Statistics and Data Quality for Analytical Chemists Box 3. The paired data test is based on the hidden assumption that the diﬀerences form a coherent set. In this case we might consider a diﬀerence of about 2 ppb would be unlikely to aﬀect decisions strongly. those data can be safely deleted before the paired data test. 3. ppb n 24 Mean 2.8 are taken from distributions with diﬀerent means but the same (unknown) standard deviation σD .0000 Lower 95% conﬁdence boundary: 1. so could be safely ignored. • We ﬁnd the diﬀerence signiﬁcantly greater than zero. As an example. but in such cases we must always remember to consider the separate question of whether the diﬀerence is of important magnitude. or a reasonable approximation to that. Likewise the Kjeldahl results are taken to represent values taken from distributions with unknown standard deviation σK . — It is important to recognise paired data.9.
16 (Box 3.3 0. The diﬀerence is not deleted because it is an outlier or anomalous per se (although it clearly is). If the assumption of a single standard deviation is unrealistic. an inspection of the data shows that one diﬀerence is much greater than any of the others.9 2.Simple Tests of Signiﬁcance 41 concentrations of nitrogen in the types of wheat.4 61.4 1. Rock type A B C D E F G H J K L ICP result Be. Consequently. a clearly signiﬁcant result.1 0. but because Rock type ‘L’ obviously diﬀers from the others by virtue of the very high concentration of beryllium present. This gives us the result p = 0.2 1. the outcome of the test may be misleading.4 0. ppm 2.7 0.4 0.10.5 1. Analysts need have no fear that this procedure amounts to improperly ‘adjusting the results’. contrasting sharply with the na¨ ıve result.6 2.9 1. which refer to the determination of beryllium in 11 diﬀerent types of rock by two diﬀerent analytical methods.2 2. and this is apparently owing to a much higher concentration of beryllium in Rock type ‘L’. At this concentration (about 60 ppm) we would expect the standard deviation of the determination to be much greater than for the rest of the rocks (at concentrations of 3 ppm or less).9 2.9 1. .1). we obtain the result p = 0.5 Diﬀerence ppm 0.3 0. ppm 2.3 0.7 1.5 1. The diﬀerences then have a 2 2 standard deviation of σdiﬀ = σD + σK which we estimated for the ttest from the observed diﬀerences as sdiﬀ = 0.9 2.2 AAS result Be.6 2.3 5.1 56.0578. an apparently nonsigniﬁcant result that might deceive an inexperienced person.2 −0.1 0. Consider the following data.0062.6 1.3 If the data are treated naively by conducting a ttest under H0 : µdiﬀ = 0 : HA : µdiﬀ = 0 on the complete dataset (all 11 diﬀerences).1 1.0 0.7 2.6 2. However. it is appropriate to delete the results pertaining to Rock ‘L’ from the list and then repeat the test with the remaining ten diﬀerences.
10.709 St Dev 1.8 show a diﬀerence signiﬁcant at 95% conﬁdence.2224 SE Mean 0.0703 t 1. ppm 11 Mean 0. The diagnostic sign of paired data is that there is extra information available about how they were collected.28. Moreover. In this section. key signs of pairing might be that the observations were made on diﬀerent types of test material. This happens because diﬀerences between any two paired results will often be considerably smaller than differences between sets of pairs.463 SE Mean 0.1 Beryllium data H0 : µdiﬀ = 0 : HA : µdiﬀ = 0 ‘Na¨ ıve’ statistics n Diﬀerence. ppm 10 0. we have a column telling us that each pair of data is obtained from the analysis of a diﬀerent rock type.16 p 0. In analytical data. or by diﬀerent analysts. Treating paired datasets by a twosample test will often eliminate any sign of a real signiﬁcant diﬀerence.537 St Dev 0. For instance. . However. However.2500 Another problem can arise when data are paired but the fact is overlooked. Often this information is explicit.53 t 3. by diﬀerent laboratories. erroneously indicating a nonsigniﬁcant diﬀerence. if incorrectly treated as twosample results under the hypotheses H0 : µICP = µAAS : HA : µICP = µAAS . Note • The ﬁle containing the dataset is named Beryllium methods.42 Notes on Statistics and Data Quality for Analytical Chemists Box 3. for example. the paired results in §3. if the data are incorrectly treated as twosample results under the hypotheses H0 : µDumas = µKjeldahl : HA : µDumas = µKjeldahl we obtain an apparently nonsigniﬁcant pvalue of 0. because of the relatively large diﬀerences among the nitrogen contents of the wheat types.42. the paired beryllium results above. would give a pvalue of 0. Pairing is not diﬃcult to spot if we are looking out for it.0062 ‘Sensible’ statistics n Mean Diﬀerence. sometimes the information about pairing is implicit or stated separately rather than as part of the dataset. Likewise.56 p 0. we do not need to know the actual rock type to do the statistics. or on diﬀerent days.
ANOVA has many applications in various sectors that utilise (rather than produce) analytical data. but these are not covered in this book.1. We could quite readily accept the hypothesis that the four sets each comprise random selections 43 . and produced the results shown in Fig.Chapter 4 Analysis of Variance (ANOVA) and Its Applications This chapter treats the statistical techniques of analysis of variance (ANOVA) and its most prominent applications in analytical science. They are thought to give diﬀerent results from each other. There are four or more recognised methods for the determination of ‘fat’ in foodstuﬀs. (b) sampling uncertainty and (c) collaborative trials. Visually there is no apparent reason to believe that the mean results of the four methods diﬀer signiﬁcantly among themselves. 4. — Typically the sources of variation are between and within subsets of results. Suppose that we applied the four most popular of these methods ten times each to a wellhomogenised batch of dog food.1 Introduction — The Comparison of Several Means Key points — Analysis of variance (ANOVA) is a broad method for analysis of data aﬀected by two (or more) separate sources of variation.1 (Dataset A). — Important applications of ANOVA in analytical science are in (a) homogeneity testing. 4. agricultural studies for example. because ‘fat’ is not a clearly deﬁned analyte (although its determination is very important commercially).
σ0 ) (Fig.1.1.3 (Dataset B).4. We could account for all of the variation by means of one mean µ and one standard deviation σ0 .3. Fig. Results from the analysis of dog food. . Fig. 2 from a single normal distribution N (µ.44 Notes on Statistics and Data Quality for Analytical Chemists Fig. In this set of data there are two separate sources of variation.2). Hypothesis that accounts for the results by assuming that the sets of results came from distributions with different means. µD .1.1. It would seem far more plausible that the four sets of results came from four separate normal distributions. In contrast. 4. we would entertain no such hypothesis if we obtained the results shown in Fig. 4. and the variance of a mean of n results from a single 2 2 group is σ0 /n + σ1 . 4.1. µC . Results from the determination of fat in dog food by using four diﬀerent methods. showing diﬀering mean values of results from four diﬀerent methods. 4. the variance of a single 2 2 observation is σ0 + σ1 .1. Fig. Dataset A. as in Fig. By the principle of addition of independent variances. Dataset B. but there is an independent variation due to the dispersion of the means. There is still the common withinmethod variation designated by σ0 . 4.1.1. Hypothesis that all of the results could be accounted for as four random samples of data from a common normal distribution. 4.2. 4. all with the same standard deviation σ0 but with distinct means µA .4. µB . The standard deviation of the means is designated σ1 .
The second range of applications of ANOVA provides us with a test of whether the diﬀerences among a number of groups of results (like the results of diﬀerent methods of analysis above) are statistically signiﬁcant. is H0 : σ1 = 0 versus HA : σ1 > 0. First. What we actually test. it enables us to estimate separately the values of σ0 and σ1 from datasets such as those illustrated. There are fewer important applications of this aspect of ANOVA in analytical science.2 The Calculations of OneWay ANOVA Key points — Oneway ANOVA considers a number of groups each containing several results. MSB. including: (a) testing a divided material for homogeneity (§4. The raw data are shown in §4. . MSW . however. This procedure has several important applications in analytical science. but only the simpler ones have regular application in analytical science.6).7). — The two primary statistics calculated from the data are: (a) the withingroup mean square. partly because analysts usually need information about chemical systems that is deeper than the simple fact that there are signiﬁcant diﬀerences among them. and (c) analysing the results of collaborative trials (interlaboratory studies of method performance) (§4. One way of writing the null hypothesis for this test is H0 : µA = µB = µC = · · · . 4. ANOVA enables us to do two things. and (b) the betweengroup mean square. Note The datasets used in this section (and §4.3) can be found in ﬁles Dogfood dataset A and Dogfood dataset B . More complex versions of ANOVA have been devised.3.Analysis of Variance (ANOVA) and Its Applications 45 The statistical method called ANOVA handles a range of problems like the one illustrated here. — The null hypothesis H0 : σ1 = 0 can be tested versus HA : σ1 > 0 by calculating the probability associated with the value of F = MSB/MSW. — Estimates of σ0 and σ1 can be calculated from the mean squares. (b) studying the uncertainty caused by sampling (§4.5).
The variance of the ¯ ¯ row means is obtained by applying the ordinary formula for variance. i i Group m xm1 xm2 · · · xmi · · · xmn xm = ¯ xmi /n s2 = SS m /(n − 1) m For oneway ANOVA we need to calculate two variances. . First. while SSj is shorthand for the jth sum of squares. . . . It is not necessary to have equal numbers of results in each group. The denominator is the total number of degrees of freedom: as each row has (n − 1) degrees of freedom and there are m rows. i xi is used as shorthand for n xi . Group j xj1 xj2 · · · xji · · · xjn . .4. . . . the total is m(n − 1). but it simpliﬁes the explanation and the notation. (Note that this deﬁnition of pooled variance is consistent with that used in §3. This latter statistic. . . The mean and variance of the results in the jth row are shown as xj . . . . being based on the means of n . . This variance estimate is usually called the ‘mean square withingroup’. . . ¯ which is the mean of the row means. s2 (notice the ¯ j single subscript for the row statistics). We consider a general case with m groups each containing n results. although the notation is simpliﬁed here. . namely i=1 ¯ 2 i (xji − xj ) . giving x ¯2 j (¯j − x) /(m − 1). . Group 1 x11 x12 · · · x1i · · · x1n Group 2 x21 x22 · · · x2i · · · x2n . . . . . . . . xji /n s2 = SS j /(n − 1) j . x1 = ¯ x2 = ¯ xj = ¯ i i x1i /n s2 = SS 1 /(n − 1) 1 x2i /n s2 = SS 2 /(n − 1) 2 . . . we calculate the grand mean x (with no subscript).) The second estimate is called the ‘mean square betweengroups’. . . the numerator is the total of all the group sums of squares. They have the usual deﬁnitions. . x = j xj /m. . . .1 where there is a source of variation (between groups) beyond the usual measurement errors.46 Notes on Statistics and Data Quality for Analytical Chemists Oneway ANOVA is concerned with situations such as those shown in §4. . First. . 2 which estimates σ0 . . . . Each result xji has two subscripts. . the ﬁrst referring to the jth row of data (the group) and the second to the ith result in the row. In this equation. designated by MSB. . . . . we can calculate a withingroup estimate by pooling the information from the m groups. . . . . designated MSW and given by MSW = j SS j m(n − 1) . . . .
First we look at Dogfood dataset A (% mass fraction). Fortunately. and M SB − M SW estimates σ1 . we have MSB = m−1 j √ 2 2 nσ1 + σ0 .1. Notes • The denominator in the expression estimating σ1 is n (the number of repeat results in a group). if H0 is true. not m (the number of groups). (b) errors are normally distributed. Finally. but is n times smaller than we want. but the value calculated from data would vary according to the F distribution with the appropriate number of degrees of freedom (§3. Where no such assurance is possible.5). ANOVA cannot readily take proper account of diﬀerent variances among groups of data. — Where the Fratio is big enough.Analysis of Variance (ANOVA) and Its Applications 47 2 2 results. n 2 When σ1 is zero (as it would be under H0 ). • ANOVA calculates probabilities under the assumptions that: (a) the groups all have a common standard deviation σ0 .3 Example Calculations with OneWay ANOVA Key points — We can use oneway ANOVA to conduct a signiﬁcance test by calculating the probability associated with F = MSB/MSW. the expected ratio F = MSB /MSW would be unity. Suppose we apply these calculations to the two datasets used in §4. n (¯j − x)2 . MSB also 2 estimates σ0 . From these considerations we see that MSW estimates σ0 . . estimates σ1 + σ0 /n. Thus. Unlike twosample tests. which estimates x ¯ multiplying by n. √ — We can estimate σ0 as s0 = MSW. 4. We need to see whether the deviation of F from unity is signiﬁcantly large by observing the corresponding probability. the scientist has to use judgement about the applicability of the pooled variances and probabilities resulting from the use of ANOVA. we can estimate σ1 as s1 = (MSB − MSW)/n. and (c) the variations withingroups and betweengroups are independent. it is often possible through good experimental design to ensure that assumption (a) is more or less correct. but only then.
we obtain MSW = 0. so we infer that there is a genuine diﬀerence 2 among the means of the results of the methods.08 11.99 11.09 10.32 10.048.26 11.01 10.42 11.68 11.98 11.92 11.80 11.85 10.28 10.04 11.10 11.30. the results would be unlikely to occur under H0 .048 = 0.33 11.79 11.14 11.14 11. 1.13 11. and p < 0. lower than 0.46 11. and MSB estimates nσ1 + σ0 .12 10.74 11. The corresponding statistics from dataset B are: MSW = 0. so we can use the probability associated with F as a test of signiﬁcance regarding H0 .17 11.66 10.0005. Under the null hypothesis 2 H0 : µA = µB = µC = µD .63 11.√ MSW estiAs 2 2 2 mates σ0 .22%. With this high value of F and the small probability.24 10.06 If we calculate the mean squares using the formulae in §4.74 11.3.063/0.46 For dataset B we can estimate σ1 with some conﬁdence.30 11.13 11.14 11. σ1 must be zero. the results are very unlikely to occur under the assumption of H0 .29 11.26 11.69 11.17 10.063.04 10.82 10.77 11. Dataset B Method Method Method Method A B C D 10.44 10.76 11. 10 .5).17 10.2.01 10.43 11.048 and MSB = 0.96 11.51 10. and the corresponding probability is p = 0. We would expect the ratio F = MSB /MSW to follow the F distribution with m − 1 and m(n − 1) degrees of freedom (see §3.51 10.07 10. while s1 = gives us: s0 = s1 = MSB − MSW is the estimate of σ1 .048 = 39.23 10.89/0.048 = 0.29. so we would be justiﬁed in rejecting it.05 say.38 10.89 − 0.43%.83 11.45 10.03 10. For Dogfood dataset A we ﬁnd that F = 0. With this high probability there would be no justiﬁcation for rejecting H0 .77 11.19 10.24 11. F = MSB /MSW = 1.89 10.31 10.50 11.88 11.63 11.95 11.91 10.40 11.00 10.55 10.80 10.10 11.11 11.64 11.48 Notes on Statistics and Data Quality for Analytical Chemists Dataset A Method Method Method Method A B C D 11.45 11. If the value of p is low.35 10. that is.65 10. σ1 > 0.89.75 11. This n √ 0. so we infer that there are no signiﬁcant diﬀerences among the means of the method results.61 11. we see that s0 = M SW is the estimate of σ0 .048 = 1.13 11.34 10. MSB = 1.02 10. so both mean squares would be 2 independent estimates of σ0 .89 11.
1). questions about the methods. keeping all of the conditions as similar as possible. 4. • There are essentially two distinct types of application of ANOVA. more general. that σ0 is too large to do the job adequately). • In instances where the expression for s1 provides a square root of a negative number. although there is a low suspect result among those for CuO/TiO2 . there were predetermined categorical diﬀerences between the groups of results.2) is reasonable. It is important to be aware of this diﬀerence. • Unless σ1 is somewhat greater than σ0 . ten times with each catalyst. The determination is carried out by one analyst. Variation among . This simply means that σ1 is too small to estimate meaningfully (or alternatively. • The denominator in the expression for s1 is n (the number of results in a group) not m (the number of groups). can be replaced by the coppercontaining catalysts that do not have that problem. — As such the experiment left unanswered some important. it is customary to set s1 = 0 . In this experiment we consider the eﬀects of diﬀerent catalysts in the results of the Kjeldahl method for determining the protein content of a meat product.Analysis of Variance (ANOVA) and Its Applications 49 Notes • The ﬁles containing the datasets in this section are named Dogfood dataset A and Dogfood dataset B . 4. estimates s1 will tend to be very variable and not much use. The ﬁrst type is where primarily we want to test for signiﬁcant diﬀerences among a number (> 2) of means. The results and a dotplot are shown below (Fig. The groups have similar dispersions so the assumption of a common group variance (see §4.4. It is hoped that the commonlyused catalysts HgO and SeO2 . The second type is where we are not interested in testing for signiﬁcance (which we can often take for granted) 2 2 but to estimate the separate variances σ0 and σ1 .4 Applications of ANOVA: Example 1 — Catalysts for the Kjeldahl Method Key points — In this ﬁxed eﬀect experiment. which leave toxic residues that are diﬃcult to dispose of.
70 p 0.56 30. 4.68 29.35 29. it is not clear whether the variation is signiﬁcant.23 ANOVA gives the following results.93 30.1.286 F 2.86 28.14 29.47 30.30 30.41 30.607 Mean square 0.96 30.1 Source of variation Between group Within group Total Degrees of freedom 3 36 39 Sum of squares 2.35 30.15 30.293 12.314 10.47 30.75 30.07 SeO2 30.51 29. (arrows).03 29.15 CuO/TiO2 30.26 29.771 0.51 30.61 30.86 CuSO4 29.12 30.4. given the dispersion of the results.22 30.12 30.50 Notes on Statistics and Data Quality for Analytical Chemists Fig.90 30.59 29.85 30. Box 4.84 30.64 30.4.36 29.89 30.81 30.47 30.57 31.54 29.97 29.16 29. Results (points) for diﬀerent catalysts showing variation among means the group means is visible but. HgO 29.11 30.060 .
which is conducted when we wish to see whether there is a signiﬁcant eﬀect when categorical diﬀerences between the groups have been deliberately brought about by the experimenter. 4. There are relatively few instances in analytical science where this type of experiment is conducted. we do not know what would happen if diﬀerent types of foodstuﬀs were used instead of the meat product: the catalysts might behave diﬀerently with other types of food.06.8) we encounter experiments with random eﬀects between the groups. i. More frequently (§4. we do not know enough about how the analyses were done.5 Applications of ANOVA: Example 2 — Homogeneity Testing Key points — A simple test based on ANOVA permits us to test for lack of heterogeneity in a material.. There might be systematic eﬀects due to changed conditions if the results for each catalyst were obtained in a temporal sequence. to make sure that there is no measurable diﬀerence . which is low enough to make us suspect that there is a real diﬀerence among the mean results of the catalysts. especially if they were done of diﬀerent days. A commonly encountered application of ANOVA in analytical chemistry is where we are testing a material (usually some kind of reference material) for homogeneity. • The above is an example of a ﬁxedeﬀect experiment . or simply under the diﬀerent conditions found in other laboratories. — The usefulness of such a test depends critically on the precision of the analytical method. First.5– 4. Second. • There are a number of practical problems with this experiment. Notes • The dataset used in this section can be found in the ﬁle Kjeldahl catalysts. or at other concentrations of protein.Analysis of Variance (ANOVA) and Its Applications 51 The probability is p = 0. even though the conﬁdence does not quite reach the 95% level.e. The only way to avoid that would be to do the determinations in a randomised sequence (which could be diﬃcult to organise in practice!).
13 71.91 70.) This is a very common exercise carried out by proﬁciency test providers and laboratories that manufacture reference materials. 4.06 71. (‘Measurable’ here signiﬁes under appropriate test conditions — we can usually ﬁnd some diﬀerence given suﬃcient resources.98 70.93 71.15 71.52 Notes on Statistics and Data Quality for Analytical Chemists between diﬀerent portions of the material.04 71. The bulk material is carefully homogenised (ground to a ﬁne powder if solid and thoroughly mixed). General experimental layout in a homogeneity test. The results are: Random portion Result 1.5.04 71. The generalised scheme is as shown in Fig. A number of the packaged portions (10–20) are selected at random and analysed in duplicate (with the analysis being carried out in a random order).11 70.01 71.17 71.99 71. .91 71.1.96 71. % 1 2 3 4 5 6 7 8 9 10 71.04 70.09 Fig. It is then divided and packed into the portions that are to be distributed. In this example. with m > 9.09 71. silicon as SiO2 is determined in a rock powder. % 71. 4.06 70.05 70.1.5.04 Result 2.
07302 0.097 > 0.) So we don’t see any suspect results or obvious signiﬁcant diﬀerences among the mean results. That. (At ﬁrst glance we might suspect the large diﬀerence in the sample3 results.1 Source of variation Between samples Between analyses Total Degrees of freedom 9 10 19 Sum of squares 0.5.00341 F 2. .097 Here we see that p = 0.2. is the expected outcome. Result of duplicate determination of SiO2 in ten random samples from a bulk powdered rock. so there is a gradation of differences rather than a single outstanding diﬀerence. as the material has been carefully homogenised — the test is simply to make sure that nothing has gone wrong.Analysis of Variance (ANOVA) and Its Applications 53 Fig.05.5. We now carry out the ANOVA and obtain the following results. As expected we ﬁnd no signiﬁcant diﬀerences among the ten mean results and hence the material passes the homogeneity test. Box 4.2 shows that there are no ‘outlying’ sample sets and no unusually large diﬀerences between corresponding duplicate results.38 p 0. so we cannot reject H0 at the 95% level of conﬁdence. but that is only slightly greater than that of sample 1.00811 0.03410 0. 4.10712 Mean square 0. which in turn is only slightly greater than that of sample 5. Figure 4.5. of course.
(2008). and Thompson. 126 . In practice. A New Test for Suﬃcient Homogeneity. M. (2001). 17A. 4. This contrasts with ‘ﬁxed eﬀects’ (§4.rsc. The collaborative trial is an interlaboratory study to determine the characteristics of an analytical method. pp. The main characteristics determined are the repeatability (average withinlaboratory) standard deviation and the reproducibility (betweenlaboratory) standard deviation. This latter variation could be irrelevant to the users of the material. 14141417. we should say that no signiﬁcant heterogeneity was found by using that particular analytical method. T. These are . is preferable to a simple test of 2 H0 : σ1 = 0.org/amc. Notes and further reading • The dataset used in this section can be found in the ﬁle named Silica. AMC Technical Brief No. Use of a more precise analytical method. • This example demonstrates oneway ANOVA with ‘random eﬀects’.4) where the variation between groups is imposed by the experimenter. • ‘Test for suﬃcient homogeneity in a reference material’. • Fearn. There are two sources of random variation acting independently. however.54 Notes on Statistics and Data Quality for Analytical Chemists More strictly. resulting in a smaller mean square MSB. is beyond the scope of the current text. (Free download via www. would enhance the F value and could easily provide a signiﬁcant outcome for the same material. namely variation between the true concentrations of silica in the samples and variation between the replicated results on each sample. based on the same analysis of variance. a more sophisticated test for lack of homogeneity.6 ANOVA Application 3 — The Collaborative Trial Key point — A collaborative trial (interlaboratory method performance study) allows us to estimate the repeatability standard deviation σr and the reproducibility (between laboratory) standard deviation σR . The reason is that the test above compares the betweensample variation with the variation of the analytical method used in the test. It is better to compare σ1 with a criterion based on user needs. Analyst. That.).
6. ppm 3. they all use the same method as far as possible).3 3.0 2.1 1. The results for each material are subjected separately to oneway ANOVA. There are no clearly outlying laboratories or suspect duplicate . The layout of the experiment for each material is shown in Fig. Experimental layout for one material in a collaborative trial.4 2. 4.4 Result 2. obtained from the results of an experiment where each participant laboratory (of at least eight) analyses a number of materials (at least ﬁve) in duplicate by a method prescribed in considerable detail (i. is as follows. 4.7 2.7 3. for the concentration of copper in one type of sheep feed.Analysis of Variance (ANOVA) and Its Applications 55 Fig. ppm 2.2 that the variation among laboratories is roughly comparable with that between the duplicate results of any one laboratory.6.3 We can see in Fig.2 2.6 1.4 2.1..7 1. A typical set of results (ppm). Lab ID 1 2 3 4 5 6 7 8 9 10 11 Result 1.4 1.1.2 2.3 2.0 1.e.6 2.3 2.8 1.6.7 2.3 2.7 3. 4.
299 Mean square 0.504 ppm.50. not the behaviour of the laboratories — see §9. diﬀerences. . 2 M SB − M SW = n The ‘repeatability standard deviation’ is sr = s0 = 0.6.248 = 0.56 Notes on Statistics and Data Quality for Analytical Chemists Fig.6.725 10. Laboratory 1 has produced the biggest diﬀerences between results. Results of the duplicate analysis of a batch of sheep feed for copper in 11 diﬀerent laboratories.50 ppm 0.) ANOVA gives us: Box 4.757 − 0.248 = 0.248 F 3.7. but it is not much bigger than those of Laboratory 6 or 11. so we accept the data as it is.06 p 0. 4.757 0.040 We calculate: s0 = s1 = √ M SW = √ 0. up to a certain proportion according to a strict protocol.574 2. (In the collaborative trial outliers. are conventionally deleted.2. because we are interested in the properties of the method.1 Source of variation Between laboratories Within laboratories Total Degrees of freedom 10 11 21 Sum of squares 7.
Another type of application involving sampling is where we want to quantify the variance associated with sampling.3). but there is a diﬀerence: here we expect the samples to vary in composition. Because sampling targets are heterogeneous. We then analyse the samples in duplicate in a completely random order. We can design a suitable experiment to estimate the separate variances associated with sampling and analysis.7. Nearly all analysis requires the taking of a sample. 4. and the schematic for the experiment (Fig. but randomise the method each time. • There is more information about collaborative trials in and 9.7 ANOVA Application 4 — Sampling and Analytical Variance s2 + s2 = 0. This looks like a similar experiment to that in §4. samples from the same target diﬀer in composition. Suppose that a routine procedure calls for the taking of a sample of soil from a ﬁeld by a carefully described method and the analysis of the sample by another carefully described analytical method.Analysis of Variance (ANOVA) and Its Applications 57 The ‘reproducibility standard deviation’ is deﬁned as sR = Notes • The dataset used in this section can be found in the ﬁle Copper . 4.7 Key points — Sampling almost always precedes analysis. — ANOVA. • Notice that n = 2 as we have duplicate determinations (see §4.71 ppm. We take a number of samples from the ﬁeld by the given procedure.5.1) is the same. 0 1 §9. a procedure that itself introduces uncertainty into the ﬁnal result. can give useful estimates of the sampling standard deviation. .8. because soil is often very heterogeneous. — Sampling introduces error into the ﬁnal result. So we are not really interested in a signiﬁcance test — we can assume from the start that the samples are diﬀerent but we want to know by how much they diﬀer. applied to the results from a properly designed experiment.
(Note: cadmium levels are exceptionally high in this ﬁeld.5 6.0 9.3 10.2 7.2 shows (as to be expected) considerable diﬀerences between the samples and rather less between the duplicate results on each sample.) Soil sample 1 2 3 4 5 6 7 8 9 10 Result 1 11. seriously discrepant duplicate results on a sample..7. There is no obvious reason to doubt the usual assumptions. .9 12.2 7.7.4 11.0 16.9 12. 4.8 6.1 Figure 4. A simple design for the determination of sampling variance and analytical For an example we use the following results for cadmium (ppm).3 6.3 14.e.1 11.4 10.8 6.1.4 10.5 15.5 Result 2 9. variance.3 10. There are no suspect data (i. or wildly outlying samples — as judged from the mean result of the duplicates).58 Notes on Statistics and Data Quality for Analytical Chemists Fig.
7. (For Sample 6 the results coincide.84 = 0.) Oneway ANOVA gives the following.Analysis of Variance (ANOVA) and Its Applications 59 Fig.78 − 0. 2 (Notice that n = 2 because we have duplicate analysis — see §4.92 ppm.9 ppm. Box 4. as often happens.784 0.3.395 168. The sampling standard deviation is given by: s s = s1 = M SB − M SW = n 17.84 = 2.840 F 21.) Here we see.000 From the mean squares we calculate the analytical standard deviation as sa = s0 = √ √ M SW = 0.18 p 0.7. 4.055 8.2. We can see whether these variances are ‘balanced’. Duplicate results from the analysis of ten samples of soil from a ﬁeld. that the sampling standard deviation is substantially greater than the analytical standard deviation. The total standard deviation for a .449 Mean square 17.1 Source of variation Between sample Between analysis Total Degrees of freedom 9 10 19 Sum of squares 160.
we see that multiple ﬁelds have been sampled in duplicate.94.60 Notes on Statistics and Data Quality for Analytical Chemists combined operation of sampling and analysis is going to be stot = s2 + s2 = s a 2. In an example like Fig. and each sample has also been analysed in duplicate. • There is more information about sampling variance in Chapter 12. that is. the total standard deviation would be 2. because it will cost much more money with no eﬀective improvement in uncertainty of the overall result.462 = 2.1. — They are typically used by analytical chemists for orientation surveys (method validation) and studying uncertainty from sampling. If the analytical standard deviation were much smaller at 0. There is no point trying to improve the precision of the analytical method.8 More Elaborate ANOVA — Nested Designs Key points — Nested designs are used in combination with ANOVA when there are two or more sources of measurement error.. They have a range of applications in analytical science.05.46 (e.8.g. The analytical variation makes hardly any contribution to this total variation — the sampling variation dominates.912 + 0. As a rule of thumb analysts should try to get 1/3 < σa /σs < 3.912 + 0. From the results of such an experiment. the sampling variance σs (= σ1 ) and the betweensite variance .922 = 3. 4. Hierarchical (or ‘nested’) designs accommodate datasets that have more than one independent source of error beyond the simple measurement error. we can estimate three variances: the analytical variance 2 2 2 2 σa (= σ0 ). 4. the analytical method was twice as precise as it is). hardly changed. Notes • The dataset used in this section can be found in the ﬁle Cadmium.
Ten playgrounds were selected in an inner city area and were sampled in duplicate. there is a reasonable chance of diﬀerentiating between diﬀerent sites by sampling and analysis. In the column headings.7. In such an example.1. rather than just one site. (These very high results refer to a period before leaded petrol was banned. The results were as follows (ppm of lead in dried dust). if the complete mea2 2 2 surement variance σa + σs is considerably smaller than σsite . Each sample was then analysed in duplicate. S1A2 the second analytical result on the ﬁrst sample and so on.Analysis of Variance (ANOVA) and Its Applications 61 Fig. Balanced nested design for an ‘orientation survey’. 4. S1A1 indicates the ﬁrst analytical result on the ﬁrst sample. and giving some guidance to its ﬁtness for purpose. and is favoured for the study of sampling uncertainty. For example. This is a classic design (known as an ‘orientation survey’) for validating a measurement procedure comprising sampling plus analysis. As an example we consider the capabilities of protocols for analysis and sampling proposed for a major survey of lead in playground dust.8. 2 2 σsite (= σ2 ). This is because the estimate will be averaged over a number of typical sites.) . The design also provides a more rugged estimate of the sampling 2 variance σs than the design shown in §4. all three levels of variation can be regarded as random. and so will be more representative of sites in general.
58 3. 4 In this instance the proposed methods would be able to diﬀerentiate reasonably well between diﬀerent playgrounds of the type in the survey.94 p 0.8.000 0.004 From the mean squares we can calculate the following.8.2).62 Notes on Statistics and Data Quality for Analytical Chemists Playground S1A1 1 2 3 4 5 6 7 8 9 10 714 414 404 759 833 621 455 635 589 783 S1A2 719 387 357 767 777 602 472 708 609 764 S2A1 644 482 380 711 748 532 498 694 591 857 S2A2 602 499 408 636 788 520 389 684 606 803 Box 4. 4. M S1 − M S0 2 3761 − 956 = 37 ppm. Analytical standard deviation = Betweensample standard deviation = = Betweensite standard deviation = = M S0 = √ 956 = 31 ppm.1 Analysis of variance for results for lead Source of variation Between sites Between samples Analytical Total Degrees of freedom 9 10 20 39 Sum of squares 764525 37613 19115 821253 Mean square MS2 = 84947 MS1 = 3761 MS0 = 956 F 22. . 2 M S2 − M S1 4 84947 − 3761 = 143 ppm. This can be seen in a plot of the results (Fig.
6. distinguishing between results from ﬁrst sample (solid circles) and second sample (open circles). but numerous examples using analytical data in various application sectors. 4. but that at 15 hours . Notes • The dataset used in this section can be found in the ﬁle Playground. Inspection suggests that there is little diﬀerence in weight loss between one and three hours at any temperature. illustrated in Fig. Results from duplicated sampling and analysis of ten playgrounds.2. 4. An investigation into the loss of weight on drying of a foodstuﬀ subjected to diﬀerent temperatures and times of heating provided the data below (% loss). — There are relatively few applications of this technique in analytical chemistry itself. Crossed designs are seldom used in analytical science as such. but the following experiment serves as an example. There are two imposed sources of variation plus the random measurement variation.9 TwoWay ANOVA — Crossed Designs Key points — Crossed designs are used when we want to study results classiﬁed by two factors. 4.Analysis of Variance (ANOVA) and Its Applications 63 Fig. • There is another example of a nested experiment in §12.9.1.8.
55 9.05 8. Likewise.2619 1.8596 Mean square 4.60 3 hours 6.57 6.32 7.61 7.04 9.02 8. Results showing weight loss at diﬀerent temperatures and durations of heating there is an extra weight loss at all three temperatures.65 10.59 8. 4.4643 34.8284 2.6525 10.76 100◦ C 120◦ C Twoway crossed ANOVA gives the following results.1.5655 0.03 6.74 7.9.1627 F 28. but the loss is more marked at 120◦ C.000 0.000 0.64 Notes on Statistics and Data Quality for Analytical Chemists Fig.9.08 3. there is little diﬀerence between corresponding results at 80◦ and 100◦ C. Box 4. heating.48 9.72 7.40 8.99 11.20 15 hours 8.3050 21.60 67. 1 hour 80◦ C 7.1 Analysis of variance of the moisture data Source of variation Time Temp Time∗ Temp Error Total Degrees of freedom 2 2 4 9 17 Sum of squares 9.48 p 0.9142 0.056 .
ANOVA can provide an estimate of the interaction between temperature and time (Time*Temp in the table) and in fact ﬁnds that this interaction should probably be regarded as signiﬁcant with p = 0.05. the number of groups (m) and the degrees of freedom (ν = n − 1). as we have duplicated results in all cells of the table. . Note • The dataset used in this section can be found in the ﬁle Moisture. The test statistic is calculated as: C = s2 max m i=1 s2 i where m is the number of groups. There are two parameters in this test. the ANOVA itself does little to help us — we have to use the diagram to see what it means. — It is primarily used to test for uniformity before analysis of variance. Its purpose is to determine if the largest variance is signiﬁcantly larger than the others. The Cochran test compares the highest individual variance with the sum of the variance for all the datasets.056 (although the level of conﬁdence does not quite reach 95%). with pvalues well below 0. If each data set contains only two values then the standard deviation (s) is replaced by the range (d) in the equation below. We can see that interaction term is largely due to the fact that the loss of weight at 15 hours for 120◦ C is greater than would be predicted by adding the individual eﬀects of time and temperature. 4.Analysis of Variance (ANOVA) and Its Applications 65 This output is telling us that both temperature and time separately have signiﬁcant eﬀects on the result of the drying. If the calculated value exceeds the critical value the largest variance is considered to be inconsistent with the variances of the other datasets. While providing a test of signiﬁcance. Moreover.10 Cochran Test Key points — The Cochran test compares the variances for a number of datasets.
and act accordingly. For instance. so it is important that the methods tested have no unexpected defects. An analytical method is made up of a moderate number of separate steps. 4.4 gives the following variances for the four catalysts. A ruggedness test comprises a relatively inexpensive means of screening a method for such defects.8.7 and 9.6) is costly (around £50. the null hypothesis is retained. analysts are expected to use some judgement in the execution of a prescribed method.000 at 2010 prices). As there are many stages. HgO SeO2 CuSO4 CuO/TiO2 0. Notes • Tables of critical values for this test are available both in textbooks and online. carried out sequentially.1438 = 0. this may take some time.4214. if the procedure says ‘boil for one hour’. A very economical alternative . Ruggedness can be tested by subjecting each separate step in the method to plausible levels of such variation and observing the measurement results.66 Notes on Statistics and Data Quality for Analytical Chemists An example using the Kjeldahl catalysts data from §4. Subjecting an analytical method to a collaborative (interlaboratory) trial (§4.5017 so as 0.2721 0. However. most analysts would expect the method to be equally accurate if the actual period was between 55 and 65 minutes.1915 0. each step carefully deﬁned. The variance for catalyst CuO/TiO2 is not regarded as inconsistent with the other variances.1982 0.4820 The test statistic is C = 0. In this example m = 4 and n = 10 (so ν = 9).11 Ruggedness Tests Ruggedness is the capacity of a method to provide accurate results in the wake of minor variations in procedure such as might occur when the procedure is undertaken in diﬀerent laboratories.4214 is less than this value. The critical value from tables is 0.4820/1. • Use of the Cochran test in collaborative trials can be found in §9.
The design ensures that this cancelling occurs with every factor. Experiment number 1 Factor Factor Factor Factor Factor Factor Factor Result 1 2 3 4 5 6 7 1 1 1 1 1 1 1 x1 2 1 1 −1 1 −1 −1 −1 x2 3 1 −1 1 −1 1 −1 −1 x3 4 1 −1 −1 −1 −1 1 1 x4 5 −1 1 1 −1 −1 1 −1 x5 6 −1 1 −1 −1 1 −1 1 x6 7 −1 −1 1 1 −1 −1 1 x7 8 −1 −1 −1 1 1 1 −1 x8 67 is to test variations in all of the steps simultaneously.e. Experimental design for a ruggedness test. so the extraneous eﬀects ‘cancel out’.11. if the method protocol said ‘boil for one hour’.11. Given an appropriate choice of perturbed levels and minor eﬀects. In Table 4.. The Youden design requires 2n experiments for testing up to 2n − 1 factors (i. 4 4 The results for the higher level group and the lower level group contain the eﬀects of each of the other six factors exactly twice. A widely useful size for analytical chemistry is eight experiments. which can test up to seven factors. . Each experiment comprises the method with a particular combination of the factors at perturbed levels.1. the steps of the analytical procedure that are under test). If the original method were completely rugged (completely insensitive to the combinations of changes) the variation in the results would estimate the repeatability standard deviation of the method. A special design for such an experiment has been developed by Youden.1. the variation might be somewhat larger. The result of each experiment is the apparent concentration of the analyte. For instance. So for (say) Factor 2 the eﬀect is x3 + x4 + x7 + x8 x1 + x2 + x5 + x6 − . The combinations shown are a special subset of the 128 possible diﬀerent combinations. The eﬀect of a factor is estimated by the mean result for the higher level minus the mean result for the lower (or vice versa — it doesn’t matter). the lower perturbed levels are indicated by −1 and the higher levels by 1. the respective perturbed levels tested might be 50 and 70 minutes.Analysis of Variance (ANOVA) and Its Applications Table 4.
(Dummy) Analytical result.4 1.05 0. injection volume (µl) 7. Determination by HPLC. The critical factors in the analytical method and their perturbed levels are shown in Table 4.05 0.11.6 101. ppm Result of the ruggedness test applied to a method for the determination Experiment number 1 12 2 12 3 12 4 12 5 8 6 8 7 8 8 8 1. volume (ml) 2. The method is most eﬀective if there are no signiﬁcant eﬀects or only one.4 1. For example.e. Cleanup duration (s) 4.95 1.8 1. of patulin.95 1.4 1.68 Notes on Statistics and Data Quality for Analytical Chemists The eﬀects of all the factors should be listed and compared.8 1. they will all be of comparable magnitude. an interaction between Factors 6 and 7 could be mistaken for (or masked by) a main eﬀect due to Factor 1.2.4 77.95 0.8 1..2. That would be the expected outcome for a properly developed method.8 93. Such interactions should be rare in a null test (i. Example The analysis under consideration is the determination of patulin (a natural contaminant) in apple juice. Cleanup with Na2 CO3 solution.8 1. Procedure.8 78. As there are six factors we need an eightdetermination layout. Dissolve residue in dilute acetic acid. with no main eﬀects) or a test with only one signiﬁcant main eﬀect. Evaporate solvent.4 .05 0. unit 1. quantity.4 85 91.05 100 80 80 100 100 80 80 100 95. A minor ambiguity of interpretation would occur if there were an interaction between two (or more) of the factors. The seventh factor (labelled ‘Dummy’) needed to bring the number of factors to the nominal seven Table 4. volume (ml) 6.11. If there are no signiﬁcant eﬀects. Extract with ethyl acetate. ﬁnal temperature (◦ C) 5.4 95. A significant eﬀect would be much greater than the majority.95 1.4 80 50 40 50 80 30 40 30 80 30 40 30 80 50 40 50 1. concentration (g/100 ml) 3.
9 3. We can make a good appraisal of this outcome simply by observing the results as boxplots paired for each factor (Fig.4 − = 13. The perturbed levels were arranged as in Table 4. The unshaded boxes represent results for the lower perturbed levels.2). ppm 13.2 −2. the shaded boxes the upper perturbed levels. 7.Analysis of Variance (ANOVA) and Its Applications 69 Fig.1). volume (ml) Cleanup with Na2 CO3 solution.8 + 93. 5. 4.11. Procedure 1. 4. 2. As an example we consider the eﬀect of the Factor 7. 7) gives a useful simulation of a null eﬀect.1..11. concentration (g/100 ml) Dissolve residue in dilute acetic acid. 4 4 A complete list of factors and eﬀects. is as follows.8 + 78. ﬁnal temperature (◦ C) (Dummy) Determination by HPLC.4 ppm.8 3. in decreasing magnitude.4 + 85. 6. showing the conditions for the eight experiments and the corresponding analytical results. We can also calculate the eﬀects. 3. 4.4 0. can be regarded as the null operation. Factor 2 (concentration of the cleanup reagent) a possible eﬀect and the other factors no eﬀect. i.1. that is: 95.0 + 91. injection volume (µl) Cleanup duration (s) Eﬀect. volume (ml) Evaporate solvent.4 −6.4 77. Extract with ethyl acetate.e.11.3 .5 0. ‘Do nothing’. We immediately see that Factor 1 (volume of ethyl acetate) seems to indicate a signiﬁcant eﬀect. Paired boxplots of the results for each factor (as numbered in Table 4.6 + 101.4 + 95. The dummy factor (No.11.
Washington DC. Steiner.70 Notes on Statistics and Data Quality for Analytical Chemists The repeatability standard deviation of the method was separately determined to be 8 ppm at this concentration. None of the other factors are signiﬁcant on that basis. E. . and. Statistical Manual of the AOAC.H. so the expected standard deviation of a null eﬀect should be about 5.7 ppm. the possibility of confounding interactions can be ignored in this context. Further reading • Youden. (1975). As there is only one signiﬁcant eﬀect. ISBN 0935584153.J. This shows that the volume of ethyl acetate probably has a signiﬁcant eﬀect and should be more carefully controlled. W. AOAC International.
. (c) the yvalues are experimental results and therefore subject to variation under repeated measurement. 5. as follows: (a) there is an unknown true functional relationship y = α + βx. Regression is capable of telling us nearly all that we need to know about the quality of a welldesigned calibration dataset. (i = 1. (b) the xvalues are ﬁxed by the experimenter. n). and which method is ‘best’ depends on both the intentions of the scientist and the nature of the data. and these points are taken as ﬁxed numbers when the experiment is complete. yi . Simple ‘leastsquares’ regression is 71 .1 Regression Key points — Regression is a method for ﬁtting a line to experimental points. including the likely uncertainty in unknown concentrations estimated thereby.Chapter 5 Regression and Calibration Linear regression is the natural approach to analytical calibration: it is a method of fitting a line of best fit (in some defined sense) to calibration data. . — Regression uses experimental points xi . . and (d) in simple regression a single unknown variance σ 2 describes the variation of the yvalues around the true line. Regression is sometimes loosely described as ‘ﬁtting the best straight line’ to a set of points. There are in fact a number of ways of ﬁtting a line to such a set of points. It also has an application in comparing the results of different analytical methods. — Linear regression makes use of a speciﬁc set of assumptions about the data. known as the ‘model’. .
σ 2 ). . . When the scaﬀolding of the model is stripped away. . we assume that n values of the xvariable x1 . . . we are left with the bare experimental points (Fig. The xvalues are taken as ﬁxed and the yvalues subject to measurement variation. 5. . σ 2 ).1. 5. First. and will be diﬀerent each time the measurement is repeated. x2 . we assume that there is a true linear relationship between two variables x and y. xn are ﬁxed exactly by the experimenter. At each value xi the corresponding value yi is measured. . εi ∼ N (0. As yi is the result of a measurement. we assume that yi is normally distributed with a mean value of α + βxi and a variance of σ 2 . Model used for linear regression. (The intercept is the value of y when x is zero. each one independently distributed around the line as N (0. x3 . Other xvalues give rise to corresponding yvalues.) Third. We have no information Fig. . the corresponding yvalue will be somewhere within the range of the indicated normal distribution.1. xi . 5.2).) Second. Simple linear regression is based on a speciﬁc model of the data. which represents a straight line with slope β and intercept α. It is worth taking the eﬀort to understand how it works because if misapplied it can provide misleading results. namely y = α + βx.72 Notes on Statistics and Data Quality for Analytical Chemists perhaps the method most widely used.1. This model can be written as yi = α + βxi + εi . The point is more likely to fall closer to the centre of the distribution (on the line) than in the tails. The model is illustrated in Fig. . its value will not fall exactly on the line y = α + βx.1.1. For the marked xvalue. and one that has especial relevance for calibration in analytical chemistry. (Remember that the xvalues are exactly set: they are not the results of measurements.
1.1). 5.2 How Regression Works Key points — The regression line y = a + bx is calculated from the points (xi . A set of experimental points produced under the linear regression model (same points as in Fig. nor the size of σ2 . The ‘ﬁtted points’ yi fall exactly on the line vertically above the points ˆ .1.Regression and Calibration 73 Fig.2. by ﬁnding the minimum value of the sum of the squared residuals. x a = ¯ − b¯. The task of regression is to estimate the values of α and β as best we can. — The regression coeﬃcients are given by: b= i (xi − ¯)(yi − ¯)/ x y i (xi − ¯)2 . The line could be regarded as a ‘trial ﬁt’ (Fig. y x — a and b are estimates of (but not identical with) the respective α and β. that is. yi ) by the method of least squares. 5.2. about the true relationship (even whether it is a straight line). we cannot exchange the roles of x and y in the equations. Imagine that a straight line y = a + bx is drawn through the data. 5. 5. We need to adjust the values of a and b in this equation until the line is a ‘best ﬁt’ in some deﬁned sense. and check that these estimated values provide a plausible model of the data.1). — Regression is not symmetric: strictly speaking.
5.2. The ‘true line’ used to generate the data is shown dashed.2 and 5.1.1.2. showing a trial ﬁt with ﬁtted values and the residuals. Q. . 5. 5. Fig.1. Thus we have Q= i 2 ri = i (yi − yi )2 = ˆ i (yi − a − bxi )2 . We deﬁne the regression line by ﬁnding values of a and b that provide the smallest possible value of the sum of the squared residuals.74 Notes on Statistics and Data Quality for Analytical Chemists Fig.2. so that we have yi = a + bxi . Same points as Fig. xi .) The residuals are ˆ deﬁned as ri = yi − yi .2. the notation yi [spoken ˆ ˆ as ‘yhat’] implies that the quantity yi is an estimate. Regression line (solid line) for the data shown in Figs 5.1. (In statistics.1. Key aspects of regression.
5. that is. that is. and here we are treating a and b as variables. • The procedure is called the ‘regression of y on x’. It is not symmetric: we get a diﬀerent line if we exchange x and y in the equations for a and b.2. ∂Q ∂Q = 0. as in Fig. the average values of a and b would tend towards α and β.3). ¯ x Notes • The values of a and b are not the same as the respective unknown α and β (Fig. but they are unbiased: if the experiment were repeated many times (that is. 5.1. That would be an incorrect line. but a diﬀerent random selection of yvalues each time. • x is called the ‘independent variable’ or the ‘predictor variable’. . with new random yi values each time.3. Regression lines from repeated experiments under the conditions shown in Fig.2. 2 ¯ i (xi − x) a = y − b¯. y is called the ‘dependent variable’ or the ‘response variable’.2). because the assumption that the xvalues were errorfree would be violated.) We ﬁnd the minimum value of Q by setting the ﬁrst derivatives equal to zero.1. 5. = 0.2. 5. (Remember: in this equation the xi and yi are constants ﬁxed by the experiment. namely b= i (xi − x)(yi − y ) ¯ ¯ . ∂a ∂b Solving these two simultaneous equations gives expressions for a and b. with the same xvalues.Regression and Calibration 75 Fig.
conforms closely to the assumptions of regression.76 Notes on Statistics and Data Quality for Analytical Chemists • The procedure outlined above is called the ‘method of least squares’. The process of calibrating an analytical instrument. by measuring the responses obtained from solutions of several known concentrations of the analyte. 5. accurately known because the Fig. — Unknown concentrations can be estimated from the transformed calibration function. There are other procedures for ﬁtting a line to data. .1. Calibration data (points) and regression line for manganese.3. The independent variable is the concentration of the analyte. 5.3 Calibration: Example 1 Key points — Regression is suited to estimating calibration functions because we can usually regard the concentrations as ﬁxed and the responses as variables. • Always make the x. the errorfree independent variable. the horizontal axis in xy scatter plots. but least squares is simple mathematically and meets most requirements so long as the data are produced in welldesigned experiments.
5. and concentration the independent variable x. visually. In practice we would need to check the validity of various assumptions underlying the regression and these items are considered in the following sections.6). We can also use the b value (the slope of the line) to convert response data for estimating the detection limit (according to the simple deﬁnition in §9. The dependent variable is the response of the instrument. the detection limit is given by cL = 3s/b.9 µg l−1 . The regression line (Fig. often as good as 0. Consider the following data obtained by constructing a shortrange calibration for manganese determined by inductivelycoupled plasma atomic emission spectrometry. It is a reasonable presumption that the concentrations of the analyte in the calibration solutions are known with a small relative uncertainty.1) is given by the function y = 75 + 496x and. . If we record n > 10 repeated responses when the concentration of the analyte is zero (that is.01. Concentration of Mn. arbitrary units 114 870 2087 3353 3970 4950 Note The dataset is available in the ﬁle named Manganese1 . and that is seldom known with a relative uncertainty less than 0. A test solution providing a response of (say) 3000 units would indicate a concentration of manganese of 5. We can use this function to estimate unknown concentrations in any other matrixmatched solution by inverting the equation to give x = (y−75)/496. with a blank solution) and calculate the standard deviation s of these responses. ppb 0 2 4 6 8 10 Response. seems like a good ﬁt to the data.001.Regression and Calibration 77 concentration is determined by gravimetric and volumetric operations only.3. Other important features of the calibration function can be tested after regression. Here we are making the responses the dependent variable y.
ˆ i Note that this expression is similar to the ordinary expression for variance. The variance of the residuals s2  x is given by y s2  x = y (yi − yi )2 /(n − 2).) This variance s2  x estimates σ 2 (see y §5. then we expect the residuals to behave very like a random selection from a normal distribution centred on zero. The pattern of the residuals is seen to correspond to results that fall below or above the regression line.4 The Use of Residuals Key points — Examining a plot of the residuals is an essential part of using regression. 5. the residuals should resemble a random sample of independent values from a normal distribution.78 Notes on Statistics and Data Quality for Analytical Chemists 5. except that here the deviations are measured from the ﬁtted values yi ˆ (instead of from the mean y ) and there are now n − 2 degrees of freedom. If the assumptions of regression are fulﬁlled and the measured data are truly derived from a straight line function. In other instances the deviations from the line may be too small to see on the plot of response against concentration. these ‘scaled residuals’ should resemble a sample from standard normal distribution N (0. This provides a useful method of checking visually whether the line produced by regression is an acceptable ﬁt and whether the data plausibly conform to the assumptions underlying regression. 1).4. So if we divide the residuals by sy  x .2 when (a) the residuals and (b) the scaled residuals are plotted against the xvalues. — If there is no lack of ﬁt between the observations and the ﬁtted model. For the manganese calibration data we ﬁnd that sy  x = 190 and obtain Figs.1 and 5. ¯ (There are n−2 degrees of freedom because we have calculated two statistics (a and b) from n pairs of observations. but they will always be apparent in the .1) if the regression line is a good ﬁt to the data.4. The standard deviation of the residuals is of course the square root of the variance.
4.4.Regression and Calibration 79 Fig. In this case. such inferences should be backed up by numerical tests of signiﬁcance (see §5. It helps to avoid using inappropriate methods and making incorrect decisions.11). 5. it is important not to overinterpret these plots when the number of data points is small. as in most examples from analytical calibration. residual plot. Where possible.6. because all of the values fall within bounds of about ±1. • The least squares calculation ensures that the mean of the residuals is exactly zero.1. Notes • The dataset is available in the ﬁle named Manganese1 . .2. 1) distribution. However.5) must deviate strikingly from a random. independent sample from a normal distribution before lack of ﬁt is inferred from small numbers of residuals. 5. It is somewhat easier to see this from the scaled residuals. In other words. Residuals from the linear regression of the manganese calibration data. so we conclude that there is no reason to suspect any lack of ﬁt. Scaled residuals from the linear regression of the manganese calibration data. 5. Patterns indicating lack of ﬁt or other problems (see §5. we expect about 90% of data to fall within these limits on average. Producing and considering residual plots is an essential part of using regression. linear regression has provided a line that ﬁts the data well. In a genuine N (0. Fig. the successive residuals could plausibly be a random selection from a normal distribution. so the residuals are not quite independent.10.
5. An outlier can bias the regression by drawing the regression line towards itself and thus may provide misleading information (§5.4) to the residuals would probably provide a reasonable guide for testing a single suspect data point.5. — There are diagnostic patterns for outliers.5.5.5 Suspect Patterns in Residuals Key points — There are several ways in which a residual plot can deviate from that expected for a good ﬁt.5.4) a linear Fig. The second type of pattern occurs when the regression model does not ﬁt the data properly. Data and regression line with a suspect point at x = 5. If they can be clearly identiﬁed as such.3. it is likely that the original data deviate from the assumptions of the regression model used.3) or Grubbs’ test (§7. 5. Fig. (This example is only marginally outlying.11). 5. 5.2.80 Notes on Statistics and Data Quality for Analytical Chemists 5.2) occurs when there is an outlying point among the data.) The remaining residuals seem acceptable.1. showing a single suspect point at x = 5. In the instance shown (Figs. Standardised residuals from regression line in Fig. This is apparent as a residual with a standardised value of 2. A diﬀerent model might provide a more accurate and useful outcome.5. outliers should be deleted from calibration data. but applying Dixon’s test (§7. Statistical testing for outliers in calibration data is not simple. Where plots of residuals show patterns that deviate strongly from a random normal sequence. There are several patterns that analytical chemists should be aware of. lack of ﬁt and nonuniform variance. 5. although there is a negative mean. and that the results of the regression might be misleading.2. 5. 5.1.1.5. . The ﬁrst type of pattern (Figs.5.
5.4. In the example illustrated (Figs. 6. and the residuals show a corresponding bowshaped pattern.3. 5.10). Fig.5.5. Fig. Standardised residuals showing a heteroscedastic tendency: the residuals tend to increase with x.10).3. It is important to avoid this type of lack of ﬁt in calibration. Statistical tests for nonlinearity are therefore a useful adjunct to residual plots. The lack of ﬁt at very low concentrations could be seriously misleading.5.5. because of the relatively large discrepancy at low concentrations between the regression line and the true trend of the data. 5. The occurrence is dealt with by the use of a more complex regression model. 6. Data and linear regression line showing lack of ﬁt due to curvature in the true relationship.) regression has been applied to data with a curved trend. and are discussed below (§5. The pattern may be unconvincing (but also prone to overinterpretation) if only a small number of points are represented.4) or a nonlinear model (§6. Heteroscedastic data and regression line giving rise to residuals that tend to increase with x.Regression and Calibration 81 Fig. Standardised residuals showing a strongly bowed pattern.5.5 and 5. Fig. The third type of suspect residual pattern shows residuals that vary in size with the value of the independent variable.5. (They also suggest a possibly low bias in the regression line below about x = 3.6) there is a tendency for the residuals to increase .5. such as a polynomial (§6.9. indicating systematic lack of ﬁt of the data to the regression line. 5. 5.6.
As with other suspicious patterns in residuals. Where several orders of magnitude of concentration are covered in a calibration set. In any event. That .7). 1–2 orders of magnitude of concentration above the detection limit). There are two kinds of suspect data points that can adversely aﬀect the outcome of regression. heteroscedasticity may be pronounced.7). there is a tendency among the inexperienced to see random data as patterns when the number of points is small. This phenomenon is called ‘heteroscedasticity’. practitioners must avoid deleting the point with the largest residual without careful thought. If simple regression (i.82 Notes on Statistics and Data Quality for Analytical Chemists in magnitude with increasing x. The complication of using weighted regression is best avoided unless the need for it is completely clear. 5. regression should never be undertaken without a prior visual appraisal of the data or a retrospective residual plot. The correct approach to heteroscedastic data is to use a statistical technique called weighted regression (§6. the resulting line will have a magniﬁed uncertainty towards zero concentration (§6. Marginal outliers are more diﬃcult to deal with. Extreme outliers will be immediately obvious and should be removed from the dataset. This may have a disastrous eﬀect on the apparent detection limit and the accuracy of results in this region (see Fig.6 Eﬀect of Outliers and Leverage Points on Regression Key points — Outliers and leverage points can adversely aﬀect the outcome of regression. 6. as discussed so far) is used on markedly heteroscedastic data. in the ydirection in a graph (Fig.e. Outliers are essentially anomalous in the value of the dependent variable.3).8. that is. The tendency is contrary to the assumption of simple regression that the variance σ of the yvalues is constant across the range of the independent variable x. 5. Some statistical software packages give an indication of which points could reasonably be regarded as outliers. They have the eﬀect of drawing the ﬁtted line towards the outlying point and thus rendering it unrepresentative of the other (valid) points. — They are often easy to deal with or avoid in analytical calibration.1). Because of this.. namely outliers and leverage points.6. The feature is common in analytical calibrations unless the range of concentration is quite small (for example.
2). Because of their distance from the other points they can draw the ﬁtted line towards them. — They can be used to conduct signiﬁcance tests on the coeﬃcients. 5. they will have an undue inﬂuence on the regression line. In calibration it will normally be possible for the analyst quickly to check the data or the calibrators for mistakes. 5.1. so it is probably better to repeat the whole calibration if an outlier is encountered. Even if they are close to the same trend as the rest of the points. leverage points should be treated with caution. The extra leverage point (open circle) draws the regression line (dashed) to an unrepresentative position. the estimates a and b are also variables.6.2. but would not improve the ﬁt unless the point is a genuine outlier.6. and it is . Again.6. some statistical software will identify points that have undue leverage. Eﬀect of a leverage point on regression. Because the yvalues in regression are variables (that is. Fig.7 Variances of the Regression Coeﬃcients: Testing the Intercept and Slope for Signiﬁcance Key points — The variances of the regression coeﬃcients can be calculated from the data. 5. They can easily be avoided in calibration. When one of the original points (solid circles) is moved to an outlying position (open circle). Leverage points are anomalous in respect of the independent variable (Fig. would always reduce the variance of the residuals. subject to random error of measurement).Regression and Calibration 83 Fig. Eﬀect of an outlier on regression. 5. Again. the original regression line (solid) moves to an unrepresentative position (dashed line). The wellspaced points (solid circles) give rise to the original regression line (solid).
12. but might arise in exploratory studies of data. Thus we have an expression for the variance of the slope b.11. That circumstance is irrelevant in calibration.84 Notes on Statistics and Data Quality for Analytical Chemists of interest to note how large that variability is. This gives us information about how much reliance we can place on the calculated values of a and b. that is. . i The standard errors se(b) and se(a) are simply the respective square roots of these variances. where we want to see which (if any) of a large number of possible predictors has an eﬀect on the response (§6. A suﬃciently low value (say p < 0. For that purpose we consider the null hypothesis H0 : α = 0 by calculating the value of Student’s t. Figs. 6. that the instrumental response is zero when the concentration of the analyte is zero).05) should convince us that we can safely reject the null hypothesis. but in such instances we have to take care that the assumptions of regression are not seriously violated (see §5.5. We can obtain the required statistics from the data itself.) Hypotheses about the slope b can likewise be tested by calculating t = (b − β)/se(b) and the corresponding probability under various hypotheses about β. we are asking if there is any relationship at all between x and y. We can test whether there is evidence to discount the idea that a calibration line plausibly passes through the origin (that is. Finally. ¯ and for the intercept a: var(a) = var(b) i x2 /n.3 and 5. in which case we have H0 : β = blit . namely: var(b) = s2  x / y i (xi − x)2 .6). (Note: a test for H0 : α = 0 is pointless unless there is a good ﬁt between the data and the regression line: for example. 5.3 show a lack of ﬁt situation where the intercept of the regression line diﬀers obviously from the true trend of the data. namely t = (a − α)/se(a) = a/se(a) and the corresponding probability. In calibration it is possible that we might want to compare an experimental value of b with a literature value blit .13). whether the slope is zero (where the value of y is unaﬀected by the value of x). If we consider H0 : β = 0. we might consider H0 : β = 1 in studies of bias over extended ranges.5. 5. Various null hypothesis about the coeﬃcients can be tested for signiﬁcance by using these standard errors.
t 0.7 22.55 21. ¯ i The variance of the residuals we have seen (§5.3) we ﬁnd the following.83 p 0. 5. In regression the variation among the yvalues can be split between the component due to the regression and that due to the residuals.000 Coeﬃcient Intercept a Slope b 75. A very low pvalue for the slope is inevitable in analytical calibration and is of no inferential value.8 Regression and ANOVA Key point — In regression. the variance of the yvalues can be analysed into a component attributed to the regression and a residual component. • Most computer packages give tvalues and corresponding values of p alongside the estimates of the regression coeﬃcients. The overall variance of the yvalues is given by the normal expression for variance.4) is (yi − yi )2 /(n − 2). The tvalue usually relates to the null hypothesis that the respective coeﬃcient has a zero population mean.613 0. namely: (yi − y )2 /(n − 1). ˆ i .74 The high pvalue for the intercept means that there is no evidence to reject the idea that the true calibration line passes through the origin. Notes • The data used can be found in the ﬁle named Manganese1 .5 496.37 Standard error 137. This enables us to compare the relative magnitude of the components and make deductions about the success of the regression.Regression and Calibration 85 For our manganese calibration data (§5.
It can be used as a rough guide to the success of the regression. for the same reasons that apply to the correlation coeﬃcient. of the ﬁtted values yi ˆ around y ).9) between the yvalues and the ﬁtted values. and is mathematically equivalent to testing H0 : β = 0 with a ttest. so the variance is simply y ¯2 i (ˆi − y ) .0000 .86 Notes on Statistics and Data Quality for Analytical Chemists For the variance due to the regression (that is. The example calibration data for manganese (§5. This can be thought of as the proportion of the variation in the yvalues that is accounted for by the regression. Source of variation Regression Residuals Total Degrees of freedom 1 n−2 n−1 Sum of squares y i (ˆi i (yi Mean square (variance) y i (ˆi i (yi F Regression mean square Residual mean square − y )2 ¯ − y )2 ¯ ˆ 2 i (yi − yi ) − y )2 ¯ ˆ 2 i (yi − yi ) /(n − 2) − yi )2 /(n − 1) ¯ The value of F can be used as another test of a signiﬁcant relationship between y and x. Computer packages provide an ANOVA table composed as follows. there is a denominator of one because there is only one degree of ¯ freedom remaining.3 0. because a value approaching 100% means that most of the variation has been accounted for. Source of variation Regression Residuals Total Degrees of freedom 1 4 5 Sum of squares 17246922 144830 17391751 Mean square (variance) 17246922 36207 F p 476. Another statistic provided by ANOVA is designated R2 and is the ratio of the regression sum of squares to the total sum of squares (usually expressed as a percentage). namely (n − 1) − (n − 2) = 1. But the statistic must be treated with caution: a model providing (say) a 99% value of R2 is not necessarily better than one that provides a value of 95%.3) provides the following ANOVA table. It is numerically equal to the square of the correlation coeﬃcient (§5.
2 ¯ ¯2 i (xi − x) i (yi − y ) i ((xi The value of r must fall between ±1. Unlike regression. For intermediate situations (points scattered at a greater or lesser .Regression and Calibration 87 We see that a value of R2 = 99. regression. the correlation coeﬃcient will be zero. The correlation coeﬃcient is deﬁned as r= − x)(yi − y )) ¯ ¯ . 5. Valid inferences that can be made with the help of the correlation coeﬃcient are few in analytical science. the identical value of r being produced if the roles of x and y are interchanged in the equation. but it is prone to misinterpretation unless great care is taken. which give it a false appearance of applicability. Likewise. but distinct from. the high F value and the corresponding small probability are typical of calibration and of no real interest.9 Correlation Key points — Correlation is a measure of the relationship between two variables. some computer packages provide the correlation coeﬃcient as routine as a byproduct of regression. that is. When there is no relationship between the variables. — The correlation coeﬃcient is not a reliable indicator of linearity in calibration. It is related to. Such a high value is usual in analytical calibration unless a grossly inappropriate model has been used. Unfortunately. the yvalues do not depend on the xvalues in any way.2%. and the statistic is best avoided. This section is essentially a warning.and yvalues. regardless of the actual x. it is symmetric in x and y. When the points fall exactly on a straight line the coeﬃcient takes a value of +1 for lines with a positive slope or −1 for lines with a negative slope. Correlation is a measure of relationship between two variables. — A perfect linear relationship provides a correlation coeﬃcient of exactly 1 or −1.
2. This kind of ambiguity is illustrated in Figs. Scatterplot of 20 points with zero correlation. Scatterplot of 20 points with a correlation coeﬃcient of 0.4. • Outliers have a strong eﬀect on the value.9. 5. the converse is not true.9. 5.9. Scatterplot of 20 points with a correlation coeﬃcient of 0.9.8.9.5 we see the same data as in Fig.88 Notes on Statistics and Data Quality for Analytical Chemists Fig.9. 5. 5.9.7 and 5. Scatterplot of 20 points with a correlation coeﬃcient of 0.9. Fig. . • Points with an exact functional relationship that is not linear do not necessarily give values of r above zero (Fig.80 to 0.4.9. 5. despite the fact that the outlier does not lie on the same trend as the original data. In Fig.1–5.80. Points with r ≈ 1 do not have to be scattered around a straight line.99.6).2 but with an outlier added. The coeﬃcient has increased from 0.9. Fig.92. 5.9.3. distance from a straight line) the coeﬃcient takes values 0 < r < 1. Fig. 5.1. This ambiguity extends to values of r that are even closer to unity. • While points very close to a straight line provide a coeﬃcient of almost 1 (or −1) by deﬁnition.95. 5. 5. Some examples are shown in Figs. Some problems of interpreting r are as follows. where points with a straight tendency and a curved tendency provide identical values of r.
5.9.11).7. Scatterplot of data with an outlier.9. Fig.95. the pure error test [§5. All of this shows that r has little to commend it in the context of chemical measurement. Bivariate dataset with an exact functional relationship but zero correlation.9. Moreover. 5.99. (Other tests for lack of ﬁt are statistically sound: for example. 5. 5. For example.999 was more linear than a set with r = 0.Regression and Calibration 89 Fig. Fig. It is particularly unfortunate when r is used as a test for suﬃcient linearity in calibration graphs in analytical science. It is quite possible to have a calibration with r = 0.95.9. and that a scatterplot is much more informative and nearly always to be preferred.5.999 and still demonstrate signiﬁcant lack of ﬁt to a linear function (§5. it would be wrong to say that a dataset with r = 0. Data with a nonlinear trend and a correlation coeﬃcient of 0.6. Fig. it is dangerous to compare diﬀerent values of r.) Note • The use of r for the correlation coeﬃcient must not be confused with the same symbol used for ‘residual’ or for repeatability.10] and polynomial ﬁtting. .8. Data with a linear trend and a correlation coeﬃcient of 0. which tends to increase the correlation coeﬃcient.
. xm Totals Repeated responses y11 . (ymi − ym )2 ¯ i i (y1i j Degrees of freedom n−1 . n−1 m(n − 1) SSP E = SSj . . . . — This requires replicated responses to be recorded. ymn — Sum of squares SS1 = SSj = SSm = − y 1 )2 ¯ . . tests for linearity are written into procedures for method validation in many analytical sectors. n−1 . .9). ym1 . Consequently. Concentration x1 . . yji . Unfortunately. However. It requires an independent estimate of σ 2 . yj1 . xj . . As a result.90 Notes on Statistics and Data Quality for Analytical Chemists 5. . . . ymi . Then we have the following data layout from which we can calculate sums of squares and degrees of freedom. . . . . . and such a calibration is highly beneﬁcial in that it tends to reduce the uncertainty of predicted concentrations as well as being easiest to ﬁt. these tests are mostly based on the correlation coefﬁcient and therefore are statistically unsound (see §5. yjn . A general scheme for the independent estimation of σ 2 (called the ‘pure error mean square’ M SP E ) is given below. Here we consider a test for lack of ﬁt that is both sound and suitable for method validation. . . . Suppose there are m diﬀerent concentrations and the response at each concentration is measured n times. . Calibration lines that are straight or very close to straight occur often in analytical measurement.10 A StatisticallySound Test for Lack of Fit Key points — Lack of ﬁt between data and a regression line can be studied by making an independent estimate of the ‘pure error’. and if they are not. to determine the magnitude of the discrepancy. y1i . . there is a tendency for analytical chemists to assume linearity and ignore small deviations from it. . ﬁtting a straight line to data that have a curving trend will produce errors that are likely to be serious at the bottom end of the calibration. . . . . y1n . . It is therefore of considerable importance to check whether calibration lines actually are straight. . . . ¯ 2 i (yji − yj ) . . . which can be obtained by replicating some or all of the response measurements. . . . . . . . . In many instances that is completely justiﬁed.
5. The data set up to concentration 10 ppb is given here.Regression and Calibration 91 Then we have: M SP E = SSP E /(m(n − 1)). which has (n − 2) and m(n − 1) degrees of freedom. — These are more reliable tests than a consideration of the correlation coeﬃcient. so the mean squares due to lack of ﬁt is M SLOF = (SSRES − SSP E )/(n − 2). The sum of squares for lack of ﬁt is the residual sum of squares SSRES minus the pure error sum of squares. This looks a bit formidable.11 Example Data/Calibration for Manganese Key points — Residual plots and tests for lack of ﬁt serve to detect nonlinearity in a calibration plot. The calibration data for manganese previously considered are actually part of a larger set of duplicated results at each concentration. ppb 0 2 4 6 8 10 Response 1 114 870 2087 3353 3970 4950 Response 2 14 1141 2212 2633 4299 5207 . The test statistic is F = M SLOF /M SP E . — Outlying results can perturb the conclusions for such tests. but the calculations are quite straightforward and will always be carried out by computer. Concentration.
5 2149.5 5078. The following simple calculations lead to the pure error estimate.92 Notes on Statistics and Data Quality for Analytical Chemists The regression and residual plots are shown in Figs.958 If we now consider the full dataset.0 1005.0 164.11.5 128. By visual judgement there is no suggestion of lack of ﬁt in the residual plot.0 4134.000 0.) x 0 2 4 6 8 10 Totals ya 114 870 2087 3353 3970 4950 — yb 14 1141 2212 2633 4299 5207 — y ¯ 64.34 0.0 135.1 and 5.0 −164.5 2993.11. (This data layout is for demonstration purposes only: the calculations will always be carried out by computer.5 −62.5 −360. d2 +d2 a b is the pure error sum of squares for duplicated results.0 −135.5 — d2 + d2 a b 5000 36721 7813 259200 54121 33025 395878 DOF 1 1 1 1 1 1 6 This gives rise to the following ANOVA table. Source of variation Regression Residual error Lack of ﬁt Pure error Total Degrees of freedom 1 10 4 6 11 Sum of squares 35608623 434603 38725 395878 36043226 Mean square (variance) 35608623 43460 9681 65980 F 819. up to a concentration of 20 ppb we have the following data.5 360.5 — db = yb − y ¯ −50. The probability associated with the lack of ﬁt F statistic is high and conﬁrms the absence of signiﬁcant lack of ﬁt.5 62.15 p 0. Concentration ppb 0 2 4 6 8 10 Concentration ppb 12 14 16 18 20 — Response 1 114 870 2087 3353 3970 4950 Response 2 14 1141 2212 2633 4299 5207 Response 1 5713 6496 7550 8241 8862 — Response 2 5898 6736 7430 8120 8909 — .5 — da = ya − y ¯ 50.2. 5.5 −128.
Regression and Calibration 93 Fig. Residuals from the simple calibration of the manganese data. the lack of ﬁt becomes signiﬁcant with p = 0. Fig. On completion of the linear regression with the test for lack of ﬁt. There is a strong visual suggestion of lack of ﬁt. Calibration data for manganese.11. which is perhaps weakened by an outlying point (encircled). Fig. we have the following ANOVA table.1. The probability may have been aﬀected unduly by one apparently discrepant value of response at 6 ppb concentration. Residuals (points) from the calibration using simple regression. 5. up to 20 ppb. There is no suggestive pattern (although there is a single suspect point).3. but it is diﬃcult to be sure because of the relatively large variability in the responses at each level of concentration. 5. 5.11. 5. . The calibration plot (Fig.11. Calibration data for manganese. Fig. The probability of 0.096 associated with the lack of ﬁt test is low enough to substantiate the visual appraisal.11.11. 5.2. If this discrepant value is deleted before the test.4. with duplicated results for response.4) suggest a slight curvature in the relationship. 5.11. although not signiﬁcant at the 95% level of conﬁdence.3) and residual plot (Fig. which inﬂates the estimate of pure error variance and thereby reduces the F value.007.
• The pure error test for lack of ﬁt takes no account of the order in which the residuals are arranged. This example substantiates the previous comments (§5.94 Notes on Statistics and Data Quality for Analytical Chemists Source of variation Regression Residual error Lack of ﬁt Pure error Total Degrees of freedom 1 20 9 11 21 Sum of squares 173714191 1320205 862790 457416 175034397 Mean square (variance) 173714191 66010 95866 41583 F 2631. • The coeﬃcient of correlation between the concentrations and the responses in dataset Manganese3 is r = 0.31 p 0.62 2.9) about the shortcomings of the correlation coeﬃcient as a test for linearity. The results of two analytical methods can be compared by using both of them to analyse the same set of test materials. — The inference will be safe so long as the results of the more precise method are used as the independent (x) variable.996. if detected. This would often incorrectly be taken as a demonstration of linearity in the calibration function. The test statistic would have the same result if the order of the residuals were randomised. This has two corollaries: (a) nonlinearity may be somewhat more likely than suggested by the pvalue. may have a cause other than nonlinearity. Hence it is essential to examine a residual plot.000 0. a ttest on the diﬀerences between . 5. and (b) lack of ﬁt. When the range of concentrations determined is small.096 Notes • The datasets used in this section can be found in ﬁles named Manganese2 and Manganese3 .12 A Regression Approach to Bias Between Methods Key points — Linear regression can be used to test for bias between two analytical methods by using them in tandem to analyse a set of test materials.
Regression and Calibration 95 paired results is usually appropriate (§3. The regression approach is safe so long as the variance of the independent (x) variable is somewhat smaller than that of the dependent (y) variable. (d) Both rotational and translational bias between the methods. or that of y exceeds that of x. . showing the regression line (solid) and the theoretical line of no bias (dashed). When the range is greater.9).8. Possible outcomes of experiments for the comparison of two methods of analysis.1. Each point represents the two results on a particular material. If the variances are comparable. 5. be readily managed by (a) (b) (c) (d) Fig. regression is likely to provide misleading statistics. a number of possible outcomes are possible. because a basic assumption of regression (invariant xvalues) has been notably violated. 3. (c) Rotational bias only between the methods. however. Such instances can. (b) Translational bias only between the methods. (a) No bias between the methods.12. and the data can sometimes be analysed by regression or a related method.
the use of a more complex method known as functional relationship ﬁtting. β = 1.12. the slope has departed from unity. Bias between two analytical methods can adopt one of several diﬀerent ‘styles’. the most common of which are well modelled by linear regression in paired experiments. Another common type of bias is characterised by α = 0.1d). but a subset of test materials give results that show quite diﬀerent behaviour. and is commonly associated with baseline interference in analytical signals.96 Notes on Statistics and Data Quality for Analytical Chemists Fig. that is. It is quite possible for both types of bias to be present simultaneously. If both methods gave identical results (apart from random measurement variation.1a).1c) is called ‘rotational bias’ or proportional bias. 5. Figure 5.12. α = 0. 5. . In some instances a more complex behaviour may be seen. A complex outcome of an experimental comparison between analytical methods where the test materials fall into two subsets.12. 5.2. 5. β = 1 (Fig. Fig. of course) the trend of results would follow a line with zero intercept and a slope of unity. As the methods are intended to address the same measurand.12. β = 1. that is.2 shows an example where the results from the majority of the test materials follow a simple trend line. one subset with no bias (solid circles) and the remainder with a serious (but not statistically characterised) bias (open circles). that is.12. which is beyond the scope of the present book.12. 5. y = α + βx where α = 0. β = 1 (Fig.1(b) shows a dataset where α = 0. An obvious statistical approach is to test the hypotheses H0 : α = 0 vs HA : α = 0 and H0 : β = 1 vs HA : β = 1. This style of bias (Fig. This type of bias is called ‘translational’ or ‘constant’ bias. the two methods should produce results quite close to the ideal model (y = x).
377 –383. Analyst. in units of µg l−1 (ppb) are as follows Site code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Field result 24 8 2 1 10 26 31 17 4 0 6 40 2 2 Laboratory result 19 8 2 1 9 23 27 11 4 0 4 26 3 4 .(March 2002). • Software also available for Excel and Minitab.org/amc. M.Regression and Calibration 97 Notes and further reading • Ripley. 5. The purpose is to test the accuracy of the ﬁeld method. • ‘Fitting a linear functional relationship to data with error on both variables’. 112 . and Thompson. As an example we consider the determination of uranium in a number of stream waters by a wellestablished laboratory method and by a newlydeveloped rapid ﬁeld method. regarding the laboratory method as a reference point for accuracy. Free download via www. (1987).org/amc. B.rsc. The results. 10. pp.rsc. Free download via www. AMC Technical Briefs No.13 Comparison of Analytical Methods: Example Key point — Regression shows the bias between a rapid ﬁeld method for the determination of uranium and an accurate laboratory method. Regression Techniques for the Detection of Analytical Bias.
111 0. This action can be justiﬁed in this instance because the precision of the laboratory method is known to be small in comparison with that of the ﬁeld method.08120 H0 . Predictor Intercept (a) Slope (b) sy  x = 2. HA α = 0.001 . We see that the trend of the results deviates from the theoretical line for unbiased methods where both methods give the same result (apart from random measurement variation).412 0. Comparison between two methods for the determination of uranium in natural waters.82 Coeﬃcient −0.13. α = 0 β = 1.85 3.944 1. The results of the regression are as follows. The ﬁeld method is giving results that are on average 1.32 is clearly signiﬁcantly diﬀerent from unity.32 times greater than the established laboratory method.1. It seems reasonable to estimate this trend by linear regression of the ﬁeld results against the laboratory results.98 Notes on Statistics and Data Quality for Analytical Chemists Fig.1 shows the ﬁeld results plotted against the laboratory results.412). but the slope of 1. showing the ﬁtted regression line (solid) and the theoretical line for zero bias between the methods (dashed line). Figure 5. Each point is a separate test material.13. which is the slope required for no bias of any kind between the methods.7% Standard error 1.32065 R2 = 95.95 p 0. We see that the intercept is not signiﬁcantly diﬀerent from zero (p = 0. β = 1 t 0. 5.
In any event the estimation of the weights would have called for assumptions outside the data. Residuals from the regression.13. • It is interesting to observe that the residuals of the regression (Fig. 5.2) tend to suggest a small degree of heteroscedastic variation in the results of the ﬁeld method. Notes • The dataset used in this section is in a ﬁle named Uranium.13.2. 5.7). . although in this instance it would not have changed the interpretation of the data. Weighted regression should ideally have been used for this exercise (§6.Regression and Calibration 99 Fig.
This page intentionally left blank .
1.1 Evaluation Limits — How Precise is an Estimated xvalue? Key points — ‘Evaluation limits’ around concentrations estimated from calibration graphs can be calculated from the data and are often unexpectedly wide. + + 2 m n b (xi − x)2 ¯ i 101 (6. the variation in the position of the regression line and second. but the reason for this is apparent from the diagram.Chapter 6 Regression — More Complex Aspects This chapter examines some more complex aspects of regression in relation to analytical applications. surprisingly large. Values of concentration x in unknown solutions. — Evaluation limits give rise to an alternative way of thinking about detection limits.1. where sx = syx b 1 ¯ 1 (y − y )2 . specifically polynomial and multiple regression.) The resultant variation in x is. weighted regression and nonlinear regression. are subject to two sources of variation. 6. 6. estimated from calibration lines. the variation in the new measured response y . Conﬁdence intervals around x = (y − a)/b are given by x ± tsx with an appropriate value of t. at ﬁrst encounter.1.1) . ﬁrst. The way that these two variances interact is shown schematically in Fig. (Remember that y and x are not part of the calibration data.
Estimation of an unknown concentration x from a response y and an established calibration function. and the limit lines can be calculated rapidly by computer. The shaded areas illustrate schematically the uncertainties in y .102 Notes on Statistics and Data Quality for Analytical Chemists Fig. the calibration function (solid line) and x .1.2).1. and the regression line is based on n pairs of x − y data. which shows the curves calculated from the manganese calibration data. 6. Evaluation interval calculated from the manganese data (points). 6. and where y is the mean of m observations.1. but all of the statistics stem from the calibration data. Fig.2.1. This procedure looks complex. The manganese data shows that an unknown . and shown as two continuous lines which are gentle curves (see Fig. The conﬁdence limits around x can be calculated for any value of y . showing the 95% conﬁdence interval around an estimated value of concentration. 6.
6.1. the lower conﬁdence limit intersects the zero concentration line. others call them ‘inverse conﬁdence limits’ or ‘ﬁducial limits’.1. Part of previous graph around the origin. Assuming that an appropriate model has been ﬁtted. this provides a useful way of estimating of the contribution that stems from the calibration and evaluation procedures to the total uncertainty of the measurement result. at that point and below.1 ppb with 95% conﬁdence limits of 2. illustrating how a detection limit can be estimated. the estimated concentration is not signiﬁcantly greater than zero. • Terminology in this area is not stabilised.1. solution producing a response of (say) 1600 units provides an estimated concentration of manganese of 3.3. It also provides us with an alternative means of conceptualising detection limits (Fig.3).Regression — More Complex Aspects 103 Fig.1. At point A. In this region we are not sure that there is any of the analyte present in the test solution. Here we call the limits ‘evaluation limits’.3) the limit intersects the zero concentration line and. Notes • The dataset used in the example diagrams are in the ﬁle named Manganese2 . 6. If we think about the lower conﬁdence limit for a continuously diminishing concentration x .1 and 4. . at some response (A in Fig. 6. Concentrations below the corresponding detection limit cL are not signiﬁcantly greater than zero.
1. the equation for sx above allows us to see which aspect of the calibration and evaluation needs attention.1). (6. . except at concentrations near the detection limit. where calibration errors assume a dominant magnitude. (This is why the conﬁdence limits usually look like straight lines parallel to the calibration function: strictly they are gentle curves. ¯ (y − y )2 disappears when y = y (usually around ¯ The third term 2 b (xi − x)2 ¯ i the centre of the calibration line) and. the error involved in the approximation is negligible in any realistic analytical calibration. (n − 2). in most instances of calibration.) It could be reduced somewhat by moving the calibrators to the extreme of the existing concentration range or by increasing the range of the existing calibration. If the uncertainty in an estimated x is too large. However. the magnitude of 1/n can be reduced by increasing n. This is usually the case in chemical analysis. and this also has the eﬀect of reducing the value of t. where m is the number of replicate measurements of response for the unknown solution. — The largest term is often 1/m. by comparing the separate magnitudes of the three terms under the square root sign. 6. Common sense is necessary here: there is no point in worrying about the magnitude of errors contributed by calibration/evaluation if errors introduced at other levels of the analytical procedure exceed them to any extent. (6. The magnitude of 1/m can be reduced by increasing m.1) is approximate.104 Notes on Statistics and Data Quality for Analytical Chemists • Many textbooks correctly state that Eq. This is normally the most eﬀective strategy. is small compared with the other terms. the number of calibrators. Likewise.1. which depends on the degrees of freedom. the number of results averaged to obtain y .2 Reducing the Conﬁdence Interval Around an Estimated Value of Concentration Key points — Reducing the uncertainty in an estimated concentration is addressed by concentrating on the largest term in Eq.
yi . Fig. We can see this clearly in Fig. — It is sometimes suitable for calibration data. .2 so long as n > m + 2.1.3. (i = 1.e. but in practice it is wise to use n > 2m.) The leastsquares coeﬃcients can be calculated by an extension of the procedure used in §5. . the highest power used.1. with squared terms). . Various polynomial ﬁts to ﬁve calibration data points. σ 2 ). 6. (Note that β0 x0 = β0 . where m is order of the polynomial.3 Polynomial Regression Key points — Polynomial regression allows us to construct models that can ﬁt data with a curved trend. Notice here that the coeﬃcients β are distinguished by a subscript that corresponds with the power to which the predictor variable is raised. The intercept is now called β0 instead of α. . yi = m j=0 βj xj + εi . — Models of order higher than two are nearly always unsuitable for analytical calibrations. .. so we can use a compact notation for the model. i i ε ∼ N (0.3.Regression — More Complex Aspects 105 6. It is very unlikely that a power of greater than three would be suitable for analytical calibration purposes. and give unreliable extrapolations. with models up to order two (i. n) (or any other data) fall on a curved trend it is often possible to account for them successfully with a polynomial model which takes the form yi = β0 + β1 xi + β2 x2 + β3 x3 + · · · + εi . When calibration data xi . 6.
The ﬁrst order ﬁt shows the uncertainty in the extrapolation remains reasonably small. 1. 3. The second order ‘quadratic’ curve (part of a parabola) is a better ﬁt to the points (in a leastsquares sense) and provides a plausible calibration line. We would be unhappy to use a calibration curve like that because there is an inﬂection in the curve that would be diﬃcult to account for by physical theory. The straight line ﬁt is omitted for clarity. A suitable procedure for analytical calibration is the following. there is a point beyond which the improvement in ﬁt is meaningless.106 Notes on Statistics and Data Quality for Analytical Chemists which shows ﬁts. Determine at least six equally spaced calibration points. The higher . Consider the calibration data shown in Figs. There are statistical tests that can identify that point. However. as the order of the ﬁt is increased. Finally. abandon the attempt to ﬁt a polynomial. Examine the new residual plot. 5. 2. to ﬁve data points. Try a ﬁrst order (linear) ﬁt. This incompatibility is likely to become apparent in even small extrapolations. but now the ﬁt is less plausible. the fourth order ﬁt passes exactly through each point. 6. If there is a lack of ﬁt through curvature. All of these operations are easily accomplished in statistical packages. the variance of the residuals becomes smaller.4. of up to order four. Examine the resulting residual plot. We ﬁnd this plausible because we often observe slightly curved calibrations in practice and can usually account for them in terms of known physical processes taking place in the analytical system. the fact remains that they are often inherently the wrong shape to describe the physical processes going on in an analytical system. try a quadratic (order two) ﬁt.3. but is obviously nonsensical as a calibration. If there is still lack of ﬁt.3. While polynomials can be used satisfactorily to model calibrations with slight curves and short ranges. with the corresponding 95% conﬁdence intervals. 4. The third order curve (part of a cubic equation) is still closer to the points. 6. The graphs show various order ﬁtted lines extrapolated. but common sense is usually good enough in analytical calibration. In general.2–6.
Regression — More Complex Aspects 107 Fig.10. 6. . with 95% conﬁdence interval. Second order ﬁt to calibration data.3. Linear ﬁt to calibration data.2.3. even though they describe curves: they are linear in the coeﬃcients. Note • In technical mathematical terminology. order ﬁts model that data rather more closely. with 95% conﬁdence interval. 6. but show extrapolations with a strongly curved trend and immediately very wide conﬁdence interval. There is another class of models technically called ‘nonlinear’ that are more diﬃcult to ﬁt and discussed brieﬂy in §6. Extrapolation is very risky because we cannot infer the correct shape of the relationship from the data but only look for lack of ﬁt within the range. these polynomial models are all called ‘linear’.3.9 and §6. Fig.3.4. 6. Fig. Third order ﬁt to calibration data. with 95% conﬁdence interval.
24 × cM n − 5. The table of coeﬃcients is as follows. Predictor Intercept Mn Mn squared Coeﬃcient 3.297 × c2 n . and there is no apparent lack of ﬁt near the origin.960 The residual plot (Fig. The regression equation is: Response = 3.03 550.11 by using quadratic regression.974 0.4 Polynomial Calibration — Example Key point — Polynomial regression order two (quadratic) provides a good ﬁt for the manganese data.1 Standard error 91.28 p 0.64 21. The analysis of variance shows no signiﬁcant lack .03 + 550.03 25.03 0. residual plot and analysis of variance table.32 1.027 R2 = 99.1) shows no trace of lack of ﬁt (although the suspect value is now more obviously an outlier).24 −5. M where cM n is the concentration of manganese.000 We see that the squared term is highly signiﬁcant in the test for H0 : β2 = 0. If we reexamine the complete manganese data from §5.000 0.108 Notes on Statistics and Data Quality for Analytical Chemists 6. 6.4.81 5. we obtain the following output.16 p 0.7% t 0. so we are disposed to think that the quadratic model will have improved the ﬁt.297 syx = 170.000 0. Source of variation Regression Residual error Lack of ﬁt Pure error Total Degrees of freedom 2 19 8 11 21 Sum of squares 174484616 549780 92365 457416 175034397 Mean square (variance) 87242308 28936 11546 41583 F 3015.
Note • The dataset used in this section is named Manganese3 . Standardised residuals after carrying out quadratic regression on the manganese calibration data.96. with p = 0. 6. of ﬁt. — The technique has applications in analytical calibration in multivariate methods such as principal components regression. .1. Quadratic regression therefore provides a good model for this particular calibration. 6. The dataset layout is as follows.4. where the ﬁrst subscript on each value of x indicates a separate variable.Regression — More Complex Aspects 109 Fig. Multiple regression is used when two or more independent predictors combine to determine the size of a response variable. — It is also useful in general exploratory data analysis.5 Multiple Regression Key points — Multiple regression allows the exploration of datasets with more than one predictor variable.
values of y are represented by points on the plane OABC. 6. . At nonzero values of both x1 . . As in simple regression. 6.5.5. yn Predictor variables 1 x11 x12 x13 .1. xmn The statistical model is therefore yi = β0 + β1 x1i + β2 x2i + β3 x3i + · · · + εi . and the equations to calculate the least squares estimates bj of the coefﬁcients βj are derived in a manner similar to those for simple regression. x2n . . . Regression with two independent variables.) When regression is executed on a set of points such as shown in Fig. ε ∼ N (0. in the ydirection) from the points to the ﬁtted plane (Fig. ˆ x2 . and we can see such a model represented by a tilted plane OABC in the threedimensional representations (Figs. 6.) . . we should aim to have n > 2m at least to obtain stable results. . . Representation of a ﬁtted function of two predictor variables as a tilted plane OABC in a threedimensional space... ··· ...2)..3.5. when x1 = 0 ˆ we have y = b0 + b2 x2 shown as line OC. . m xm1 xm2 xm3 . 6. σ 2 ). . . As with polynomial regression.5. (Same as Fig. x1n 2 x21 x22 x23 . it is essential Fig.110 Notes on Statistics and Data Quality for Analytical Chemists Response variable y1 y2 y3 ..1. (Note: in ˆ these two ﬁgures the value of b0 is zero.. Likewise. 6. When x2 = 0 we have the simple regression y = b0 + b1 x1 shown as line OA.2.. that is. Fig. y = b0 + b1 x1 + ˆ b2 x2 can be readily visualised by means of perspective projections. the residuals are the vertical distances (that is. 6.. Representation of a ﬁtted function of two predictor variables as a tilted plane OABC in a threedimensional space.5.4). but generally this will not be so.5.5. but with the origin at the bottom right corner.1. .. 6.
by plotting them against the value of each predictor variable in turn. Swarm of data points in threedimensional space. (b) the age of the house. (In PCR the predictors have zero correlation by deﬁnition.3 and a ﬁtted regression model (plane).) If such correlation exists. In other words.6 Multiple Regression — An Environmental Example Key points — Multiple regression has been used to explore possible relationships between the concentration of lead in garden soils and predictors: (a) distance from a smelter. Points above the plane are shown in black. Multiple regression plays an essential part in the more complex types of calibration (for example in principal components regression [PCR]). Data points as in Fig. it is important to check that the predictor variables used are not strongly correlated. we might get very diﬀerent results if two randomly selected subsets of the data were both separately treated by multiple regression. — Replacing the distance predictor by a negative exponential transformation gave an improved ﬁt with no signiﬁcant lack of ﬁt.4.5. Also listed for each garden . good practice to examine the residuals. but is perhaps more often used in exploratory data analysis.5. 6. 6. 6. the regression model may be unstable and the coeﬃcients misleading. Fig.Regression — More Complex Aspects 111 Fig. In the latter instance. below the plane in grey.5.3. and (c) the underlying geology. showing the residuals. 6. The data listed below show lead concentrations found in the soil of 18 gardens from houses in the vicinity of a smelter. — A preliminary regression gave a promising outcome but the residuals showed a lack of ﬁt to distance.
There are no obvious trends suggesting nonlinearity and no suspect points that might have an undue inﬂuence on the regression.112 Notes on Statistics and Data Quality for Analytical Chemists are some corresponding environmental factors that might serve to explain the lead concentrations. m Age. as would be expected (Fig.298 All of these coeﬃcients are small and not signiﬁcant.1–6. Distance Age Geology 0.1). • The geological formation underlying the garden. ppm Distance.6. Lead. (This is simply identiﬁed by codes 1 or 2 which show which one of two rocks is present. The correlation matrix is as follows. m Age.6. Yr Geology 10 25 69 79 86 94 100 132 132 9609 9283 7369 9887 9887 6130 8328 4790 6248 22 58 61 132 118 121 125 42 139 1 1 2 2 1 1 1 2 2 136 146 164 170 184 198 201 219 275 6064 3895 4051 6827 4466 4919 6821 3598 3438 166 25 74 187 149 199 295 77 84 2 1 1 2 2 1 2 2 2 The ﬁrst stage is to ensure that the intended predictors are not strongly correlated. age and type of geology as the predictors (independent variables). 6. we can still use this variable as a predictor in multiple regression. ppm Distance. The next stage is to scan scatter plots (Figs. • The distance of each house from the smelter. Note that while the geology code can only take one of two values. • The age of the house in years. Yr Geology Lead.3) of lead concentration against each variable in turn to see if the data are consistent with the proposed model. There is no clear relationship between lead concentration . 6.) The task is to try multiple linear regression with lead concentration as the dependent variable and distance. so the three variables can be safely used in the intended regression.243 Age 0. We can clearly see that the concentration of lead decreases with increasing distance.6.040 −0. The explanatory factors are as follows.
6. Quite often.6. The outcome of the multiple regression is shown in the table.3 so that overlapping points in the ydirection are separated. Both the distance from the smelter and the age of the house give values of t that are signiﬁcantly diﬀerent from .0005). 6.11).3 illustrates the eﬀect of geology. p = 0. 6.) There is no reason not to conduct the multiple regression with all three variables. Lead concentrations plotted against distance from the smelter (r = −0. we must not be tempted to omit the age of the house from the regression at this stage. p < 0. Figure 6. and the correlation is small and not signiﬁcant.6.1. The R2 value (proportion of variance accounted for by the regression) is 83%. 6.6. However.6. Lead concentrations plotted against the age of the house (r = 0. which is very good for an environmental study.3.39. In this diagram it seems that the mean of the results for the gardens coded as 2 is higher than the results for gardens coded as 1. and the age of the house (Fig.2.80.6.Regression — More Complex Aspects 113 Fig. real relationships are obscured by the inﬂuence of other variables. (Notice that the data points have been ‘jittered’ slightly in the xdirection in Fig. Fig.2). 6. Lead concentrations plotted against code for the underlying geology. Fig. but it is not clear whether the diﬀerence is signiﬁcant. 6.
The residuals plotted against predictors are shown in Figs. 6.5.6.98 −6. because we would expect the lead contamination to fall with Fig.00 t = βj /se(βj ) 5.024179 0.000 0.4. The residuals plotted against the age of the house display no obvious deviation from a random sample except for one suspect value (roughly at 90 years).00 0.000 0. . however.4 and 6. 6. Residuals plotted against the age of the house.98 p 0.003494 0.1146 16. Predictor Constant Distance Age Geology βj 221.23 −0. we now have to examine the residuals for lack of ﬁt or other features that might throw doubt on the suitability of the ﬁrst regression.63 se(βj ) 37.6.6. there is a distinct suggestion of a curved trend in the residuals.35 0. The underlying geology apparently has no signiﬁcant eﬀect on the lead content of the soil.92 3.005 0.345 While this outcome is promising. It is noteworthy that the age of the house is signiﬁcant in the multiple regression when it seemed unpromising as a predictor when considered alone.3835 15.114 Notes on Statistics and Data Quality for Analytical Chemists zero at the 95% level of conﬁdence. Plotted against the distance from the smelter. as the pvalue is high. Intuition supports this outcome. showing that the linear representation was not completely adequate.
1 0. Both the transformed distance and the age of the house are still highly signiﬁcant predictors: the geology as a predictor remains not signiﬁcant at the 95% level of conﬁdence. Notice that there is no exact way of using the pure error as a test for lack of ﬁt. 6. (The division by 1000 is simply to provide results of a handy magnitude. Residuals plotted against the distance from the smelter.000 0.) We now repeat the multiple regression and obtain the following results.6.848 t = βj /se(βj ) −0.36 1. Predictor Intercept exp(−d/1000) Age Geology βj −7.120 The variance accounted for (R2 ) is now 95% which is exceptionally high for an environmental study.5.63813 14.12 shows that its inﬂuence is not implausible.66 p 0. because there are no replicated results.03 446.0 0. but the pvalue of 0. We can try an exponential decay to give the new variable exp(−d/1000). and a relationship . A possible improvement on the modelling might be obtained by transforming the distance variable d in some way that provides a suitably curved function.67 9.661 se(βj ) 14. distance from the smelter but in a roughly exponential manner.31 6096.06815 8.611 0.Regression — More Complex Aspects 115 Fig. and using that as a predictor in a new multiple regression.000 0.52 13.
Fig. In such instances we usually ﬁnd that the variance of the analytical response increases . None of the residual plots from the new regression (Figs.7. might be revealed by a larger or more focused study. 6. and weighted regression is often beneﬁcial.6. 6. Residuals from the second regression (with transformed distance) plotted against age of the house. 6. Residuals from the second regression (with transformed distance) plotted against distance from the smelter. — Using simple regression where weighted regression should be used can have a deleterious eﬀect on the statistics. (Notice that the lead residuals have been plotted against the original distance measure rather than the transformed value. 6..6.116 Notes on Statistics and Data Quality for Analytical Chemists Fig. Quite often in analytical calibration the range of the concentration of the analyte extends over several orders of magnitude.6.) Note • The dataset for this example is found in the ﬁle named Lead . — Weighted regression is a standard part of statistical packages. — Analytical calibrations are often heteroscedastic.e.7 Weighted Regression Key points — Weighted regression is designed for heteroscedastic data.7) now show any obvious deviation from a random pattern. but that merely spaces them conveniently: it is not essential. as opposed to the assumption of uniform variance for simple regression.6.6 and 6. the variance of the y value varies with the x value.6. i.
5. but to see the analogies between weighted and unweighted regression. or small enough to ignore. 6. The model for this situation is shown in Fig.1.7. The important thing here is not to remember the equations.) This adaptation. The phenomenon is called heteroscedasticity.1. If heteroscedasticity is suﬃciently marked. the regression ‘takes more notice’ of points with larger weights (smaller variances). It works by loading each observation with a weight that is inversely proportional to the variance at the respective concentration. This probably happens also with shorterrange calibrations. 6. which simpliﬁes the formulae somewhat.Regression — More Complex Aspects 117 Fig.1. is called weighted regression.7. with concentration. (Compare this with Fig. . Note that the weights are scaled so that the sum is equal to n. The heteroscedastic model of regression. which takes account of the changes in variance. The yvalues (in calibration the analytical responses) are drawn at random from a distribution of which the standard deviation varies with the xvariable (concentration). In that way. and it violates one of the assumptions on which the simple regression model is based. an adaptation of the model is required for best results.1. It should also be apparent that simple regression is a special case of weighted regression in which all of the weights are equal. but the change is small enough to escape detection. The regression formulae tabulated below are similar to (and a generalisation of) those for simple regression.
. . . This may not be justiﬁed when the degree of heteroscedasticity is small. However. The only extra labour comprises the estimation of the weights. . xi . . . even rough estimates of the weights improve the outcome substantially. yi . . . fortunately. . . In that case. xn y1 . . . . x2 . . . xn y1 . .118 Notes on Statistics and Data Quality for Analytical Chemists Statistic Data Standard regression x1 . yn s1 . so can be executed with ease. . . . i Mean x= ¯ i xi /n i (xi xw = ¯ i wi xi /n i Slope of regression Intercept Residual variance Variance of slope Variance of intercept Variance of evaluated xvalue∗ b= i (xi − x)(yi − y ) ¯ ¯ − x)2 ¯ bw = wi (xi − xw )(yi − yw ) ¯ ¯ ¯ 2 i wi (xi − xw ) a = y − b¯ ¯ x s2 = yx (yi − y )2 /(n − 2) ˆ i aw = yw − bw xw ¯ ¯ s2 = yx(w) wi (yi − yw )2 /(n − 2) ˆ i s2 = s2 / b yx s2 = s2 a b s2 e = x (xi − x)2 ¯ i s2 = s2 / b(w) yx(w) s2 = s2 a(w) b(w) s2 x wi (xi − xw )2 ¯ i i x2 /n i i x2 /n i s2 yx b2 ¯ 1 1 (y − y )2 × + + 2 m n b (xi − x)2 ¯ i b2 w 1 1 × + + 2 w n bw e (w) = s2 yx(w) (y − yw )2 ¯ ¯ 2 i wi (xi − xw ) ∗ This is the variance of unknown xvalues x calculated from the regression equation as x = (y −a)/b by means of a measured response y . . yi . y2 . yn wi = 1 Weighted regression x1 . . . . Weights can be estimated either from repeat measurements results or from a general experience of the performance of the analytical system. y2 . . . . sn wi = n(1/s2 )/ i (i. . (For an example of the latter case. we might assume that the measurement . . . . i Weight wi = n) i (1/s2 ). . si . .e. . . .. use of a weighted regression is recommended but. . in longrange calibration the change in precision is often suﬃcient to have a deleterious impact on the resulting statistics. The calculations of weighted regression are available in the usual statistical packages. which is the mean of m measurements or has a corresponding weight w . . x2 . . s2 . . . xi .
— Weights were calculated by smoothing the initial estimates.1225 0. of course. as in Fig.) In the example given below.rsc. No. Such estimates are. The calibration data and calculation of the weights are shown in the table below (units are ng l−1 ). Free download via www.7015 0.8 Example of Weighted Regression — Calibration for 239 Pu by ICP–MS Key points — The ICPMS data for calibration of 239 Pu are markedly heteroscedastic. Further reading • ‘Why are we weighting?’ (2007).8035 0. 6.0719 0. Concentration 0 2 4 6 8 10 R1 548 6782 15966 25612 30483 42680 R2 662 9661 14067 30337 32143 46291 R3 1141 9316 17063 26987 35701 35968 Raw SD 315 1572 1516 2430 2666 5239 Smoothed SD 231 1055 1878 2701 3525 4348 Variance 53504 1112196 3526528 7296499 12422109 18903359 Weight 16. The ﬁtted values are shown in the following column.Regression — More Complex Aspects 119 standard deviation is 1% of the result except at zero concentration. — Weighted regression gives statistics that are clearly more appropriate for the measurement of small concentrations. The variance is the square of the smoothed standard deviation and the weights are calculated from . weights are estimated from a small number of repeat measurements. or even by eye. 27.2534 0.1.0473 Columns R1 to R3 show three repeat responses for each concentration. When conducting this smoothing it is important to check that all of the resulting estimates are reasonable and greater than zero. AMC Technical Briefs.8.org/amc 6. The next column shows the standard deviations calculated from the three responses. where it is half of the detection limit. very variable and a better estimate might be obtained by smoothing them by simple regression.
70 0. t 5.2 syx = 142.000 0.1.8 135.87 p 0.27 30.0 R2 = 98.509 .05 p 0. (line).000 Predictor Intercept Slope Coeﬃcient 767.120 Notes on Statistics and Data Quality for Analytical Chemists Fig. so that the sum of the weights is 18 (the total i i number of measured responses).0 Standard error 145.000 0.3% Analysis of variance Source of variation Regression Residual error –Lack of ﬁt –Pure error Total Degrees of freedom 1 16 4 12 17 Sum of squares 18208348 322734 72606 250128 18531082 Mean square 18208348 20171 18152 20844 F 902. Smoothing the raw standard deviation estimates (circles) by regression wi = n(1/s2 )/ i (1/s2 ). 6.8. The statistics from the weighted regression are as follows.8 4055.
6) from cL = (3se(a))/b = (3 × 146)/4055 = 0. The statistics are given below.548 The most obvious diﬀerences from the weighted statistics are as follows.1 syx = 2679 Standard error 1119 184. We can see the beneﬁcial eﬀects of weighted regression in this instance by repeating the regression without weighting. • The apparent detection limit is now degraded to (3 × 1119)/4125 = 0.624). t 0.000 Predictor Intercept Slope Coeﬃcient 559 4126. • As we have repeated measurements at each concentration.2. and ﬁnd that there is no signiﬁcant lack of ﬁt (p = 0. • The regression line and the 95% conﬁdence limits are shown in Fig.1 ng l−1 .8 ng l−1 . that is.0005).509).32 p 0. • The intercept is signiﬁcantly diﬀerent from zero (p < 0. .8. • The standard error of the intercept is now incorrectly much greater (1119 instead of 146) and as a consequence we are tempted to make the incorrect inference that it is not signiﬁcantly diﬀerent from zero (p = 0.50 22. we can also conduct the test for lack of ﬁt.Regression — More Complex Aspects 121 Note the following. The conﬁdence interval is least near the origin of the graph.9 Analysis of variance Source of variation Regression Residual error –Lack of ﬁt –Pure error Total Degrees of freedom 1 16 4 12 17 Sum of squares 3575212011 114794611 24146489 90648122 3690006622 Mean square 3575212011 7174663 6036622 7554010 F 498.80 p 0.31 0. 6.624 0. • We can estimate a detection limit (§9.000 0. magniﬁed by a factor of about eight.8 R2 = 96.
Further reading • ‘Why are we weighting?’ (2007).8. Same calibration data as Fig. Fig. for accurate work. Free download via www. 6.2.8.9 Nonlinear Regression Occasionally in analytical science we want to study the relationship between variables where the proposed model is nonlinear in the coeﬃcients (β).rsc.8. showing weighted regression line and its 95% conﬁdence interval. Therefore. 6. (This is ‘nonlinear’ in the technical mathematical sense. if we minimise the sum of the squared residuals. 6. In other words. AMC Technical Briefs No.org/amc 6. with regression line and conﬁdence intervals calculated by unweighted regression. using weighted regression in calibration may be worth the minor degree of extra eﬀort involved. especially at low concentrations.3. This expression is not directly tractable by standard regression methods.8. Calibration data (circles) for Pu239 by ICP–MS.) For instance. a model proposed for the variation in the uncertainty of measurement (u) in a particular method as a function of the concentration of the analyte (c) is as follows: u= 2 2 β0 + β1 c2 . • The regression line is not much changed (Fig.3) but the 95% conﬁdence limit is much wider at low concentrations. 6. 27. the resulting .2.122 Notes on Statistics and Data Quality for Analytical Chemists Fig.
There is no transformation that will reduce this equation to a linear form. β1 by straight algebraic methods. As usual we should look at the residual plots to ensure that the model is adequate. making the tests of signiﬁcance nonexact. α1 = β1 gives us u2 = α0 + α1 c2 . we have 2 2 u2 = β0 + β1 c2 . transforming the variables simply by taking logarithms gives us log σR = log β0 + β1 log c. 2 2 and writing α0 = β0 . Now we can regress the dependent variable (log σR ) against the independent variable (log c) to obtain estimates of the parameters α0 = log β0 and β1 . but this is seldom a serious objection. Such methods are beyond the scope of these notes. There is a slight diﬃculty in that the transformed error term may not be normally distributed. There are several ways of ﬁnding the estimates of the βi . However. there are theoretical reasons for expecting a calibration curve in ICPAES to follow the pattern r = β0 + β1 c − eβ2 +β3 c . namely σR = β0 cβ1 . Another example of this type is encountered in exploring how reproducibility standard deviation varies with concentration of the analyte in interlaboratory studies such as proﬁciency tests. if we simply square the expression. but all of these numerical methods depend on iterative procedures starting from initial approximations. . We might suspect that the data could follow a generalised version of the Horwitz function (see §9.Regression — More Complex Aspects 123 normal equations cannot be solved for β0 . For example. there is another class of nonlinear equations that cannot be treated by transformation. Again this function cannot be tackled directly by regression. However. However.7) with unknown parameters. which is linear in the coeﬃcients and can be handled by simple regression by regarding u2 as the dependent variable and c2 as the independent variable.
for example.02 + 0.10 Example of Regression with Transformed Variables Key points — Reproducibility standard deviation was related to concentration of the analyte (protein nitrogen in feed ingredients) using a nonlinear model analogous to the Horwitz function.0141 0.001306 0.72886 −1.46% mass fraction. The standard deviation of reproducibility σR was calculated for each material.1549 −2.0111 0.000814 log c −1.95861 −2. The primary statistics obtained were as follows.0267 0.60415 −3.1049 0.0946 ≡ 9. In a collaborative trial (method performance study) 14 diﬀerent animal feed ingredients were subjected to the determination of the concentration c of nitrogen (as an indicator of protein content) by the Dumas method in the participating laboratories.0271 0.51399 −3. log σR = log 0.19654 −1.02411 −0.003062 0. (All the data are expressed as mass fractions.0007 0. or.0188 0.) c 0.000284 0.8495 .39362 −3.82449 −3.95468 −1.92191 −1.000328 0.001498 0.08938 .0011 0.0636 0.02c0.85078 −1.1867 0.8495 log c.001444 0.3536 −0.56703 −0.36051 log sR −2.0827 0.0946 0.7).0168 0.84043 −3. 0.001022 0. σR = 0.8495 log c = −1.0443 0.699 + 0.88406 −3.33724 −3. — No lack of ﬁt was detected.000404 0.77469 −1.54668 −2.72584 −1.1197 0.08249 −1.124 Notes on Statistics and Data Quality for Analytical Chemists 6.000572 0.99055 −2. transforming to logarithms base 10. — The data were logtransformed before regression.002488 0.57349 −1. as required by the Horwitz function so.00046 0.0436 sR 0. The investigator wished to see whether the results conformed to the Horwitz function (see §9.2426 −2.97922 −1.48413 −2.
As 10−1. Fig.10. however.0105c0.10. 6. Fig. Same data and functions as Fig.000 The regression therefore tells us that log sR = −1.979 = 0.10.0105 is considerably lower than the Horwitz value. Predictor Intercept Slope Coeﬃcient −1. Nitrogen data (solid circles) with ﬁtted line (solid) and Horwitz function (dashed).794 log c.0105.10. The residual plot (Fig.08363 0. .79366 syx = 0. 6.42 p 0. Regression of log sR on log c gave the following output.10. although probably not signiﬁcantly so.2. showing that the determination is more precise than predicted. Residuals from the ﬁt to the logtransformed nitrogen data. 6. 6.05915 R2 = 93. The exponent found is somewhat lower than that of the Horwitz function.10.000 0.979 + 0.Regression — More Complex Aspects 125 Fig.97935 0.1.67 13.2. These features can be seen in Figs.1 but on logarithmic axes.794 . together with the Horwitz function.1 and 6. Nitrogen data (solid circles) with ﬁtted line (solid) and Horwitz function (dashed).10.0824 Standard error 0.3) suggests a reasonable ﬁt — the residuals look a bit skewed — but the deviation from normality is not signiﬁcant at the 95% level of conﬁdence.8 t −23. 6.3. 6. The coeﬃcient of 0. transforming back to mass fractions gives us sR = 0.
This page intentionally left blank .
Under those conditions it would be reasonable to assume that results obtained by repeated analysis of a single test material would resemble independent random values taken from a normal distribution N (µ.1 Control Charts Key points — Control charts are used to check that a system is operating ‘in statistical control’.7% to fall within the range µ ± 3σ. that conditions determining the size of errors have changed. Either a new 127 .. A result obtained outside the latter range would be so unusual under the assumption of statistical control that it is conventionally taken to indicate that the assumption is invalidated.e. — Shewhart charts traditionally have ‘warning limits’ at µ ± 2σ and ‘action limits’ at µ ± 3σ. σ 2 ). but the superiority of robust methods in this area is emphasised. Outlier tests are covered. i.Chapter 7 Additional Statistical Topics This chapter covers a number of practical topics related to the normal distribution and deviations from it that are relevant to the work of analytical chemists. 7. Thus we would expect in the long term about 95% of results to fall within a range of µ ± 2σ and about 99. and the analytical system is behaving in a new and unacceptable fashion. An analytical system where all the factors that aﬀect the magnitude of errors are kept constant is said to be in ‘statistical control’. — A system is regarded as out of control on the basis of a result outside the action limits.
µ±2σ (‘Warning limits’). The results are plotted on a chart that shows the results as a function of run number. requiring the results obtained in that run to be regarded as suspect.128 Notes on Statistics and Data Quality for Analytical Chemists value of µ prevails. or failure on the part of the analyst to observe some aspects of the operating procedure.1. The Shewhart chart is good for detecting abrupt changes in the Fig. . This type of chart is called a Shewhart chart. a result is very unlikely to fall outside the action limits. and are also taken to indicate outofcontrol conditions. and the analytical system to be halted until statistical control has been restored.1). Under statistical control. Some other occurrences are about equally unlikely. This requires some investigation of the system and remediation where necessary to restore the initial conditions.1. The chart conventionally has lines at µ. or a larger value of σ. The arrows show the system going out of control at Run 7 and Run 26. or (b) nine consecutive results on the same side of the mean line. 7. There are many diﬀerent kinds of control chart with diﬀering capabilities.1. This is based on the results obtained by the analysis of one or more special test materials that have been homogenised and tested for stability. A convenient way to monitor an analytical system in this way is via a control chart. These ‘control materials’ must be typical of the material under routine test and are analysed exactly as if they were normal samples in every run of the analytical system. and µ±3σ (‘Action limits’) (Fig. Shewhart control chart showing results for a control material in successive runs. 7. Such a point is taken to show that the system is ‘out of control’. perhaps because of instrument malfunction or a new batch of reagents. namely: (a) two successive results outside the warning limits.
analytical system.2. 7. 67. but also shows Run 37 to be out of control. Zone chart showing part of the same data as Fig. Numbers show the accumulated score. These scores are labelled weights in the ﬁgure. With each successive result the scores are aggregated. the system is deemed out of control. In this chart. The arrows show Run 26 to be out of control as before. results are converted into scores that depend on which zone of the chart the results falls into. Harmonised Guidelines for Internal Quality Control in Analytical Chemistry Laboratories.1.rsc.1.1.org/amc. AMC Technical Briefs No. (1995). The Zone chart detects all of the outofcontrol conditions that the Shewhart chart does but also detects the smaller change that results in the score of eight at Run 37. 7. (2010). M. for use in analytical quality control’.2) has roughly the combined capabilities of the Shewhart chart and the Cusum chart and is simple to plot and interpret. If the aggregate score gets to eight or more.Additional Statistical Topics 129 Fig. 12. 649–666.1. pp. AMC Technical Briefs No. and Wood. Free download via www.org/amc. 46. The Zone chart (Fig. Other control charts are better at detecting smaller changes or a drift. The Cusum chart is one such. (2003). • ‘Internal quality control in routine analysis’.rsc. Free download via www. . R. the total score is rest to zero before the new score is aggregated. If a new result falls on the opposite side of the mean from the previous one. Further reading • Thompson. Pure Appl Chem. • ‘The Jchart: a simple plot that combines the capabilities of Shewhart and Cusum charts. 7. Symbol positions (numbered circles) show the zone of the current result.
When there is no such explanation. that is suﬃciently diﬀerent from the rest of the results to make the analyst suspect that they are the outcome of a large uncontrolled variation (i. 7. While a set of repeated analytical results will often broadly resemble a normal distribution.1 Outliers have a large eﬀect on classical statistics (especially the standard deviation).2.3.9 26.1. Those against deletion argue that the discrepant result is still part of the analytical system so should be retained if the summary statistics are to be fully descriptive and suitable for prediction of future results from the analytical system. This eﬀect can be seen by comparing Figs..e.2. — Deleting identiﬁed outliers before calculating statistics needs an informed judgement. it is not uncommon to ﬁnd that a small proportion of the results are discrepant. In any event. especially standard deviations. Where there is a documented mistake that accounts for the suspect value. a mistake) in procedure. Other scientists maintain that deletion is appropriate when Fig. Results of a determination repeated by ﬁve analysts.2 Suspect Results and Outliers Key points — Analytical datasets often contain suspect values that seem inconsistent with the majority. analysts diﬀer about whether deletion is justiﬁed.4 31.130 Notes on Statistics and Data Quality for Analytical Chemists 7. which could invalidate decisions depending on probabilities. 7. it can be corrected or deleted from the dataset without any question. The result at 15.2. deletion seems like an unhealthy subjectivity creeping into the science.7 27. 15. — It is often diﬃcult to identify suspect values visually as outliers.1 24.2.1 can be taken as a typical example. .2 and 7. 7. — Outliers can have a large inﬂuence on classical statistics.1 is suspected of being an outlier. Data given below and shown in Fig.1 28.
as typically in analytical science. Notes and further reading • ‘Rogues and suspects: how to tackle outliers’. The mean seems biased low and the standard deviation is inﬂated. The normal distribution modelling the data if the suspect value is excluded. Most of the data are modelled well.3). so that summary statistics accurately represent the behaviour of the great majority of the results.2. These scientists have to bear in mind that outliers may occur in the future. will not give a sensible outcome if there are two outliers present. 7. the suspect result is clearly discrepant.org/amc. although they will have no indication of their probability or magnitude. The simple version of Dixon’s test (§7. Free download via www. for example.2. Statistical tests for outliers abound.3 illustrate these points. Figures 7. The normal distribution modelling the data if the suspect value is included.3. 39. A better way of handling suspect data is the use of robust statistics (see §7. but they tend to suﬀer from various defects. 7.2 and 7. Our visual judgement of this is notoriously poor when.5. Both of these arguments have some virtue in particular circumstances. at either end of the distribution. • The dataset can be found in the ﬁle named Suspect.6). 7.2.2.Additional Statistical Topics 131 Fig. . (April 2009). but the crucial decision is whether the suspect point is really discrepant or is simply a slightly unusual selection of results from a normal distribution.2. AMC Technical Briefs No.rsc. All of the data are modelled poorly. Fig. there are only a few data available.
The test statistic Q is the distance from the suspect point to its nearest neighbour divided by the range of the data: in terms of Fig.1.45 would arise by chance with a probability as high as 0.132 Notes on Statistics and Data Quality for Analytical Chemists 7. The test statistic would then be Q = A /B = ((24. The probability of a value of Q exceeding 0.3.1 ppb (Fig. This is almost small enough to reject the null hypothesis (no outlier) at 95% conﬁdence so we could reasonably treat 15.07. Fig.61 arising from random samples of six observations from a normal distribution is about 0.2). A value as high as 0. Dixon’s test Q = A/B applied to the aﬂatoxin data.) A problem with this simple test is that it can be foiled by the presence of a second outlier at either end of the range.9 − 15.1 as an outlier. (In this case a Q value of greater than 0.1)/(37.45. .1)) = 0. A simple test of a suspect value is Dixon’s Q test. For our example data we have Q = A/B = (24.3.1 we have Q = A/B. Indeed.63 would be required to provide 95% conﬁdence. 7. Suppose in addition to the previous results there was an extra value at 37. 7.2.1) = 0.17. — The simple version of Dixon’s test is foiled by the presence of a second outlier.1 − 15.61.3 Dixon’s Test for Outliers Key points — Dixon’s test compares the distance from the suspect value to its nearest value with the range of the data. so we should now be unwilling to regard the low value as an outlier.1)/(31. 7. 7. it is not an outlier even though it is the same distance as before from the closest value! A similar problem would arise if the extra value were on Fig.3.1 − 15. Dixon’s test applied to an extended dataset.9 − 15.3.
Notes and further reading • ‘Rogues and suspects: how to tackle outliers’. but these and similar problems aﬀect many outlier tests.9 26. The test statistic G is calculated for each result xi from the sample mean x and standard deviation s as ¯ ¯ G = max xi − x/s. 7.1 24. the data are normally distributed. 7. • The dataset is in ﬁle named Suspect. multiple iterations change the probabilities of detection. Free download via www.7 27.4 31. However. AMC Technical Briefs No.org/amc. the aﬂatoxin data can be used as an example.1 . There are modiﬁed versions of Dixon’s test that can be applied to this new situation. A more sophisticated treatment of suspect values is preferable (§7. This statistic calculates the value with the largest absolute deviation from the sample mean in units of the sample standard deviation. This form of the Grubbs test is therefore a twotailed test. There are other forms of the test including onetailed versions.6). — It can be expanded to test for multiple outliers.4 The Grubbs Test Key points — Grubbs test in the simplest form tests for an outlier by calculating the diﬀerence between the largest (or smallest) value and the mean of a dataset with respect to the standard deviation.5. The basic assumption underlying the Grubbs test is that. 15. (April 2009). This is a more sophisticated test for outliers than Dixon’s test. The null hypothesis is that there are no outliers in the dataset. It is used to detect outliers in a dataset by testing for one outlier at a time. 39.Additional Statistical Topics 133 the same extreme as the original suspect value.rsc. • Critical values of Q can be found in many statistical texts. and the test should not be used for sample sizes of six or less since it frequently tags most of the points as outliers. outliers aside.1 28. Any outlier which is detected is deleted from the data and the test is repeated until no outliers are detected. As with Dixon’s test.
134
Notes on Statistics and Data Quality for Analytical Chemists
We carry out a twotailed test as above under the null hypothesis that there is no outlier in the data set. The test statistic is calculated as G = 15.1 − 25.55/5.5186 = 1.894. This is compared with the 95% critical value for n = 6 of 1.933. As the calculated value of G is less than the critical value, the null hypothesis is not rejected at 95% conﬁdence and the result 15.1 is not identiﬁed as an outlier. However, G is high enough at least to warrant checking the calculations leading to the result. Notes and further reading • ‘Grubbs’ is the name of the originator of this test, and is not a possessive case. • ‘Rogues and suspects: how to tackle outliers’. (April 2009). AMC Technical Briefs No. 39. Free download via www.rsc.org/amc. • Tables of critical levels of the test statistic can be found in some statistical texts. A brief table for the twotailed test is given below, with N the number of observations.
N 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 G95% 1.933 2.081 2.201 2.3 2.383 2.455 2.518 2.574 2.624 2.669 2.71 2.748 2.782 2.814 2.843 G99% 1.993 2.171 2.316 2.438 2.542 2.631 2.709 2.778 2.84 2.895 2.945 2.991 3.032 3.071 3.106
• In the absence of tables, critical levels can be calculated from the equation N −1 √ N N − 2 + t2 (α/2N ), N −2 t2 (α/2N ), N −2 with t(α/2N ), N −2 denoting the
critical value of the t distribution with (N − 2) degrees of freedom and a signiﬁcance level of α/2N . (α is the overall signiﬁcance level so, for G95%, α = 0.05.)
Additional Statistical Topics
135
7.5
Robust Statistics — MAD Method
Key points — Robust methods reduce the inﬂuence of outlying results and provide statistics that describe the distribution of the central or ‘good’ part of the data. — The methods are applicable to data that seem to be unimodal and roughly symmetrically distributed. — Robust statistics must be used with care for prediction. — The MAD method is a quick way of calculating robust mean and standard deviation, without requiring any decisions about rejecting outliers.
The use of robust statistics enables us to circumvent the sometimes contentious issue of the deletion of outliers and provides perhaps the best method of identifying them. Robust methods of estimating statistics such the mean and standard deviation (and many others) reduce the inﬂuence of outlying results and heavy tails in distributions. Robust statistics requires the original data to be similar to normal (i.e., roughly symmetrical and unimodal) but with a small proportion of outliers or heavy tails. It cannot be used meaningfully to describe strongly skewed or multimodal datasets. Robust statistics must be used with care in prediction. They will not enable the user to predict the probability or likely magnitude of outliers. There are a number of robust methods in use. One of the simplest is the MAD (Median Absolute Diﬀerence) method. Suppose we have replicated results as follows: 145 130 157 153 183 148 151 143 147 163.
Putting these in ascending order we have: 130 143 145 147 148 151 153 157 163 183.
We need the median of these results, that is, the central value. In this instance the median is the mean of 148 and 151, namely 149.5. This median is a robust estimate of the mean, that is, µ = 149.5. It is unaﬀected by ˆ making the extreme values more extreme, for instance, by changing the
136
Notes on Statistics and Data Quality for Analytical Chemists
size of the lowest value to any value lower than 148, and/or the highest value to any value greater than 151. The next stage is to subtract the median of the data from each value and ignore the sign of the diﬀerence, giving the absolute diﬀerences (in the same order as immediately above): 19.5 6.5 4.5 2.5 1.5 1.5 3.5 7.5 13.5 33.5
If we sort these absolute diﬀerences into increasing order, we have: 1.5 1.5 2.5 3.5 4.5 6.5 7.5 13.5 19.5 33.5.
The median of these results (the median absolute diﬀerence) is the mean of 4.5 and 6.5, namely 5.5. This also is unaﬀected by the magnitude of the extreme values. We multiply this median by the factor 1.4825, which is derived from the properties of the normal distribution. The product is the robust estimate of the standard deviation, namely σ = 5.5 × 1.4825 = 8.2 ˆ (to two signiﬁcant ﬁgures). (Note: the robust statistics are designated µ, σ ˆ ˆ to distinguish them from the classical estimators x, s, which are used only ¯ as deﬁned in §2.3.) The robust estimates are thus µ = 149.5, σ = 8.2. ˆ ˆ The MAD method is quick and has a negligible deleterious eﬀect on the statistics if the dataset does not include outliers. It therefore can be used in emergencies (i.e., when there isn’t a calculator handy). There are somewhat better ways of estimating robust means and standard deviations, but they all require special programs to do the calculations. Notes and further reading • A factor of 1.5 applied to the MAD (instead of 1.4825) is accurate enough for all analytical applications. • ‘Rogues and suspects: how to tackle outliers’. (April 2009). AMC Technical Briefs No. 39. Free download from www.rsc.org/amc. • Rousseeuw, P.J. (1991). Tutorial to Robust Statistics, J Chemomet, 5 , pp. 1–20. This paper can be downloaded gratis from ftp://ftp.win.ua.ac. be/pub/preprints/91/Tutrob91.pdf. • The dataset used in this example is named is Outlier .
Additional Statistical Topics
137
7.6
Robust Statistics — Huber’s H15 Method
Key points — Huber’s H15 method is a procedure for calculating robust mean and standard deviation. — It is an iterative procedure, starting with initial estimates of the statistics. — At each iteration, the data are ‘winsorised’ (modiﬁed) by using the current values of the statistics. — Comparing data with a robust ﬁt is one of the best ways of identifying suspect values. Several methods for robustifying statistical estimates depend on downweighting extreme values in some way, to give them less inﬂuence on the outcome of the calculation. Huber’s H15 method is one such that has been widely used in the analytical community. Like most of these methods, it relies on taking initial rough estimates of the statistics (ˆ0 , σ0 ) and reﬁning µ ˆ them by an iterated procedure. Using the data from §7.5 we subject it to a process called ‘Winsorisation’. This involves replacing original datapoints x falling outside the range µ0 ±kˆ0 ˆ σ with the actual range limits. This creates pseudovalues x thus: ˜ µ0 + kˆ0 , if x > µ0 + kˆ0 σ ˆ σ ˆ . x1 = µ0 − kˆ0 , if x < µ0 − kˆ0 ˜ ˆ σ ˆ σ x, if µ − kˆ < x < µ + kˆ ˆ σ ˆ σ
0 0 0 0
The ﬁrst revised estimates of the statistics are then given by µ1 = mean(˜1 ) ˆ x x and σ1 = sd(˜1 )/θ. For a moderate proportion of outlying results (and ˆ most analytical applications) k can be set at 1.5. The corresponding value of θ, derived from the properties of the normal distribution, is 0.882. The process is then repeated using µ1 , σ1 to winsorise the data and calculate the ˆ ˆ improved estimates µ2 , σ2 in the same manner, and so on until a suﬃcient ˆ ˆ degree of convergence is obtained. Convergence is slow so a computer is required. The table below shows the application of this to the previouslyused ˆ suspect data (row x0 ) starting with the MAD estimates µ0 = 149.5, ˆ σ σ0 = 8.15. The ﬁrst replacement limits are µ0 ± kˆ0 = (161.73, 137.27), ˆ so any value less that 137.27 becomes 137.27 and any value greater than
138
Notes on Statistics and Data Quality for Analytical Chemists
161.73 becomes 161.73. Thus in the ﬁrst Winsorisation (row x1 ) three values ˜ (boldface) are replaced, producing estimates of µ1 = 150.47, σ1 = 9.11 to ˆ ˆ two decimal places.
Data x0 x1 ˜ x2 ˜ x3 ˜ x4 ˜ . . . x17 ˜ 130.00 143 145 147 148 151 153 157 137.27 136.81 135.98 135.36 143 143 143 143 145 145 145 145 147 147 147 147 148 148 148 148 151 151 151 151 153 153 153 153 163.00 µrob ˆ 183.00 149.50 σrob ˆ 8.15
157 161.73 161.73 150.47 9.11 157 163.00 164.14 150.80 9.87 157 163.00 165.61 150.86 10.33 157 163.00 166.36 150.87 10.62 . . . . . . 134.21 143 145 147 148 151 153 157 163.00 167.54 150.88 11.11
In subsequent iterations only two values are replaced. The results have stabilised suﬃciently by the 17th iteration, giving ﬁnal estimates of µrob = ˆ 150.88, σrob = 11.11. These can be compared with the classical statistics, ˆ for the complete data (¯ = 152.0, s = 14.0) and for the data with the x suspect value deleted (¯ = 148.0, s = 9.33). Simply deleting the suspect x value gives a standard deviation that is too low. Robust statistics is probably one of the best ways of identifying outσ liers. If we pseudostandardise the dataset as z = (x − µrob )/ˆrob , the ˆ ‘good’ results should resemble a sample from the standard normal distribution. Any results with a magnitude greater than about 2.5 can therefore be regarded as at least suspect, if not outlying. If we apply this transformation to our example data (in increasing order) we obtain: z − 1.9 − 0.7 − 0.5 − 0.3 − 0.3 0.0 0.2 0.6 1.1 2.9.
The value of 2.9 suggests that the original result of 183 is suspect and that its provenance should be investigated further at least. Notes and further reading • ‘Rogues and suspects: how to tackle outliers’. (April 2009). AMC Technical Briefs No. 39. Free download from www.rsc.org/amc. • Analytical Methods Committee. (1989 ). Robust Statistics — How Not to Reject Outliers. Part 1. Basic Concepts, Analyst, 114 , pp. 1693–1697. • There is Excel software for conducting the H15 method in AMC Software free download via www.rsc.org/amc.
1.Additional Statistical Topics 139 • Rousseeuw.win. 7. (1991). All lognormal distributions (but many others) have these two properties. 7.2.1 shows a lognormal distribution with a mean of two and a standard deviation of one. A lognormal distribution with a relative standard deviation of 50%. • The dataset used in this example is named is Outlier .7. 7.7 Lognormal Distributions Key points — A variable x is lognormally distributed if log x is normally distributed. The same distribution plotted against log10 x. A variable x is lognormally distributed if log x is normally distributed. — Lognormal distributions of error are rare in chemical measurement.7. — Logtransformation sometimes can be safely used to stabilise variance before regression or ANOVA. Fig. — Some analytical circumstances give rise to distributions with a quasilognormal distribution. Figure 7.pdf. with a relative standard deviation (RSD) of 50%. and a positive skew. 7.ac. — The shape of a lognormal distribution depends on its relative standard deviation. a distribution with an RSD of 10% Fig. A plot of density against log x (Fig.7. pp. This paper can be downloaded gratis from ftp://ftp. 1–20.7. P. that is. 5 . Tutorial to Robust Statistics. The shape (degree of skewness) of the lognormal distribution depends on the RSD.ua.J. be/pub/preprints/91/Tutrob91.2) has the familiar shape of the normal distribution. . It has zero density (height) when x is zero. J Chemomet. For instance.
This is important to remember because appearances can be deceptive at concentrations near the detection limit. The confusion arises because true concentrations cannot be below zero. however. 7. in one currentlyimportant type of measurement — the determination of speciﬁc DNA sequences from geneticallymodiﬁed food by the realtime polymerase chain reaction (PCR) — that circumstance is approximately realised. showing little visible sign of asymmetry. Repeat results at low concentrations may sometimes appear to be similar to the lognormal because they have been censored or truncated at zero. So we might ﬁnd the 95% conﬁdence limits of repeated measurements of (say) 0.3) or lower is very diﬀerent from one with an RSD of 50%. the results of measurements are not true concentrations — they include errors — and they can and sometimes do fall below zero.7. . Because the fundamental procedure of PCR is multiplicative. However.5 and 2. Despite the appearance of censored results. and hard to distinguish visually from a normal distribution. logtransformation will be misleading in this situation. where the RSD is high — by definition greater than about 30%. (Fig.7. Some analysts are uncomfortable with this apparent conﬂict and as a consequence do not record negative results.140 Notes on Statistics and Data Quality for Analytical Chemists Fig. 7. A lognormal distribution in measurement implies that the errors are multiplicative.3. the errors tend to follow the same pattern.0 times the mean concentration. In nearly all other types of chemical measurement. Lognormal distribution with an RSD of 10%. For this particular measurement a logtransformation of the results has been found to be justiﬁed and helpful. However. The physical circumstances of a chemical measurement rarely give rise to results that genuinely follow that rule. these conditions do not apply and logtransformation should be used with due caution.
copper in sediments [§1.Additional Statistical Topics 141 Another mathematical operation that gives rise to a skewed distribution (and causes the same type of confusion) is the division of one imprecise variable by another. Again. In that region. should be used with due caution. There are.3]) usually have that property and sometimes approach lognormal.045 under logtransformation. That might happen in the correction of raw analytical results for recovery. and applications often provide statistics with an excessive number of signiﬁcant ﬁgures. an RSD of 10% becomes a constant SD of 0. Analytical chemists sometimes encounter other datasets where a variable is genuinely strictly positive and skewed. For example. however. weighted regression will serve if the weights can be estimated. and that is in regression and analysis of variance where the precision of the response varies widely but its RSD can be taken as constant. situations where logtransformation can be helpful. but often the information is not available.8 Rounding Key point — Round the standard deviation to two signiﬁcant ﬁgures and round the mean to the same number of decimal points (or trailing zeros). Of course. The transformed data will still have a distribution close to normal (unless the RSD is much higher than 10%).g. Concentrations of trace analytes in moreorless random collections of natural materials (e. For example. as always. so the usual assumptions of simple regression can be made. samples taken from a contaminated site or area may show a similar distribution if the contamination is patchy. the resulting distribution could on casual inspection be taken as lognormal by the unwary. It should not cause noticeable skewness except near the detection limit. Logtransformation will stabilise the variance of the response in that situation.. regardless of the concentration. Logtransformation is sometimes useful in the characterisation of such data but. Modern computers use a large number of signiﬁcant ﬁgures in calculations. and in combination with censoring results at zero. a computer might tell us that the statistics . because diﬀerent variables with the same RSD all have the same absolute standard deviation when logtransformed. 7.
9 Nonparametric Statistics Key points — Nonparametric tests require less stringent assumptions than parametric tests. This may be important if the results are to be used in further statistical operations.6. Many statistical tests have nonparametric equivalents. and their widespread use in the social sciences. Applying this rule to the statistics above gives us: mean 25.5. Such data need to be rounded before reporting. nonparametric tests are seldom used by analytical chemists because parametric methods are more usually more powerful. For a twotailed test the null and alternative hypotheses are as shown here. We are normally taught to retain only the ﬁrst ﬁgure that is uncertain. The commonlyused rule is sometimes too rigorous. — Many parametric tests have nonparametric equivalents. . But it is important not to overdo the rounding. This rule nearly always leaves a generous number of signiﬁcant ﬁgures. The assumption of normality is not required. nonparametric tests require less restrictive assumptions about the data. which can destroy useful information. Despite these useful features. Nonparametric tests are sometimes called distributionfree statistics because they do not depend on the data being drawn from normal distributions.5186. 7. and standard deviation 5. which is explained in detail below. because it is obvious that the fourth decimal place (at least) is quite meaningless. There is a simple rule for appropriate rounding when we report such data: round the standard deviation to two signiﬁcant ﬁgures and round the mean to the same number of decimal points (or trailing zeros). More generally.2 are: mean 25. and the normal distribution is often a reasonable assumption in physical measurement.142 Notes on Statistics and Data Quality for Analytical Chemists for the data in §7. The rationale here is that estimated standard deviations are hardly ever more accurate than that. and standard deviation 5. The Mann–Whitney test is the nonparametric equivalent of the twosample ttest for comparing the central tendencies of two independent datasets.5500. An important reason for using these tests is that they allow the analysis of rank data. The most commonly used nonparametric test in analytical chemistry is perhaps the Mann–Whitney U test.
01 3.02 3. 2 n(n + 1) UQ = nm + − SQ . 2 UP = nm + and data samples of size m (the larger set P) and n (the smaller set Q) are pooled. in which two methods are used to test for nitrogen in a sample of wheat ﬂour.5 3.12 (K) (D) (D) (K) (D) (K) (D) (D) (D) Rank 9.01 3.5 11. An example is shown using the wheat ﬂour data from §1. 2. the null and alternative hypotheses are: H0 : Median(P) = Median(Q) HA : Median(P) = Median(Q) The test statistic is obtained by calculating the lesser of UP and UQ .5 18 For the Kjeldahl (larger) dataset. The ranked data (that is.04 3. SQ are the sums of the pooled ranks for the respective datasets. Note that tied results (e. 2.04 3.05 3.92 2.02 (K) (K) (K) (D) (K) (K) (K) (D) (K) Rank 1.92.08 3. For datasets P and Q. tabulated in increasing order) are shown below along with the associated method (K = Kjeldahl.08 3.5 3.01 3. Result % 2.98 2.92) are each given the mean rank (1.5 16. m = 10 and the sum of ranks is SK = 73. SP .g.. For the Dumas (smaller) dataset.3.5 1. D = Dumas). .00 3.Additional Statistical Topics 143 One tailed tests can also be carried out.98 3.5 5 7 7 7 9. where m(m + 1) − SP .5 13. n = 8 and the sum of ranks is SD = 98. The test is based only on the following assumptions: the datasets are independent random samples from the respective populations and the measurement scale is at least ordinal.6.5 Result % 3.5).5 13.5 15 16.92 2.05 3.07 3. A conﬁdence interval for the difference between the population medians can be estimated with the further assumption that the two population distribution functions are identical apart from a possible diﬀerence in location.5 11.
Notes and further reading • There are a number of other tests that are counterparts of a parametric test. In this example it is the normal distribution. . Most statistical software will provide this information plus a pvalue. so the value of the test statistic U = 18 is the lesser of these. — The data values are ordered and compared with the equivalent value from the distribution. • Most statistical software packages provide all of the common nonparametric tests.144 Notes on Statistics and Data Quality for Analytical Chemists From this we have UK = 80 + 55 − 73 = 62. and conﬁdence limits for the diﬀerence between the medians. the critical value for 95% conﬁdence is 17. The Kolmogorov–Smirnov statistic quantiﬁes a distance between the empirical cumulative distribution function of the sample and the cumulative distribution function of the hypothesised distribution. These include the Wilcoxon Matched Pairs Signed Ranks test. which is a method for comparing several independent random samples and which can be used as a nonparametric alternative to the one way ANOVA. • Further details and tables of critical values can be found in standard reference books. As 18 > 17. which is the equivalent of the paired ttest. To calculate the test statistic it is necessary to calculate the values of F(x): the empirical cumulative density function G(x): the cdf from the hypothesised distribution. the null hypothesis is rejected and the medians are signiﬁcantly diﬀerent.10 Testing for Speciﬁc Distributions — the Kolmogorov– Smirnov OneSample Test Key points — The Kolmogorov–Smirnov test is a nonparametric test used to test whether or not a single sample of data is consistent with a speciﬁed distribution function. often the normal distribution. and UD = 80 + 36 − 98 = 18.10. A graphical representation is show in Fig. For the sample sizes in this example.1. and the Kruskal–Wallis test. 7. 7.
10.Additional Statistical Topics 145 Fig.0010 0.9773 0. D. Cumulative distribution functions of a test dataset (step function) and a normal distribution (dashed line) with mean zero and unit standard deviation. Tables can be found in textbooks and online.1416 0 The test statistic is: D = Max(F (x) − G(x)) D = 0.236 −0.5714 0.0714 −0.0 F(x) − G(x) −0.8571 1.8414 0.7143 0. Observed data −3 −2 −1 0 1 2 3 4 F(x) 0 0.9987 1.1246 0.0 G(x) 0. is greater than the critical value which is provided by most statistical software.4286 0.0010 0.0183 0.1379 0.27 −0.1478 −0. .500 0.1.1429 0. 7.2857 0.27 The hypotheses for the Kolmogorov–Smirnov test are deﬁned as H0 : The data follow a speciﬁed distribution HA : The data do not follow the speciﬁed distribution The null hypothesis is rejected if the test statistic.
11 Statistical Power and the Planning of Experiments Key points — The power of a statistical test is the probability of rejecting a null hypothesis when it is in fact false. Consider the following points. Hence the Kolmogorov–Smirnov test is often used as a test for normality. if location. minor deviations from normality may be ﬂagged as statistically signiﬁcant. • The distribution must be fully speciﬁed. • Small samples almost always pass a normality test. The outcome of a statistical test stems from a balance between various circumstances. Here the test compares the cumulative distribution of the data with the expected cumulative normal distribution. even though small deviations from a normal distribution will not aﬀect the results of a ttest or ANOVA. That is. namely the magnitude of the eﬀect being tested. The distribution speciﬁed in the null hypothesis is often the normal distribution. the critical region of the Kolmogorov–Smirnov test is no longer valid. scale and shape parameters are estimated from the data. • With large samples.146 Notes on Statistics and Data Quality for Analytical Chemists For n = 8 the critical value = 0. As 0. (An eﬀect is .457 at the 95% level. Other tests for normality also depend on the comparison of cumulative distributions. Normality tests have little power to tell whether or not a small sample of data comes from a normal distribution.27 <0. — Power calculations provide a way of checking whether a proposed experiment is capable of delivering a useful result at a minimal cost. If a dataset fails the null hypothesis it is not always the case that parametric tests cannot be applied.457 then the null hypothesis cannot be rejected. It typically must be determined by simulation. the precision of the measurements and the number of measurements made. Assumptions: • It only applies to samples from continuous distributions. 7. Testing for normality should be treated with caution. with the test statistic being based on the largest discrepancy.
The power is therefore a probability with the value 1 − β. Decision: accept H0 H0 true HA true Correct decision.β. However. (This will be demonstrated in the following example. the probability of a Type II error depends on the speciﬁc alternative hypothesis. Increasing the number of 1 Ethical considerations are involved as well as money. the power of the test Calculating the power initially requires specifying the eﬀect size that is required to be detected. It might well be that the opposite outcome would have been recorded if the analytical method had been more precise or more repeat measurements made. the greater the power. . The probability of rejecting a null hypothesis when it is false is called the power of a test. and clearly also depends on HA and is related to a Type II error. probability α. as measurements cost money. Type II error. an eﬀect might be signiﬁcant but of a magnitude that is of no importance in the context of the test. this critical probability is referred to as α. probability β. it is important to commit the least resource that will provide a useful outcome.) Suppose the outcome was ‘not signiﬁcant’. Alternatively. Section 1. It is worthwhile considering this balance before any measurements are undertaken: there is no point in undertaking an experiment that is unlikely to provide a useful outcome. The conﬁdence level is 1. The probability of a Type I error is equal to α. A ‘Type I’ error occurs when a true null hypothesis is rejected.α.) The greater the eﬀect size.1 It is therefore very good practice to estimate the power of a proposed experiment before it is undertaken. Power can also be increased by improving the precision of the measurements and by increasing the number of replicated results.11 described a critical level of probability that we regard as convincing for the particular inference that we wish to make. Correct decision. The position is summarised in this table. Experiments involving people or animals should be as small as consistent with a useful outcome.Additional Statistical Topics 147 the deviation of the test statistic from H0 . Equally. These considerations come under the heading of statistical power. Decision: reject H0 Type I error. A Type II error occurs when a false null hypothesis is accepted. For a particular HA this probability is often represented by β. and higher precision costs more than lower. probability 1. In statistical power terminology.
We can then focus on the speciﬁc alternative hypothesis HA : µ = 24 as the upper limit of this acceptable range. HA = 24) showing the power (1 − β) of the test.1. 7. The upper limit U of the conﬁdence √ region falls at µ + tσ/ n = 23. We consider making four measurements of concentration on the material by using a method with a known standard deviation of 2 ppm. We wish to know whether this experiment would be likely to provide the information that we want. and have to rely on previous experience of the precision of the analytical method. This is represented Fig. Representation of a null hypothesis (upper graph.1. Note that we are doing this calculation before making any measurements. .18 for data drawn originally from a normal distribution.025 for this twotailed test and would be regarded (falsely) as signiﬁcant under H0 . An observed mean above this limit would occur with a probability of 0. H0 : µ = 20 with lower (L) and upper (U) 95% conﬁdence limits) and an alternative hypothesis (lower graph. The null hypothesis H0 for the test is represented by the upper graph in Fig. 7. a value of more than 0. so that α = 0.11.05.80 is sometimes regarded as satisfactory.11. with the critical regions for 95% conﬁdence shaded black. which shows the tdistribution with three (n − 1) degrees of freedom. Example Suppose that we wish to see whether the concentration of an analyte in a test material is signiﬁcantly diﬀerent from 20 ppm at the 95% conﬁdence level.148 Notes on Statistics and Data Quality for Analytical Chemists replications is the most commonly used method for increasing statistical power. in the context of the application. Now suppose that. a deviation of less than 4 ppm from the H0 value of 20 could be regarded as inconsequential or unimportant. Although there are no formal standards for power.
if one were available. 38. 7. which would be unsatisfactory in many applications. while an experiment with less than ﬁve measurements would not. (We could also increase the power by using a more precise analytical method. (March 2009). Estimated power of the signiﬁcance test as a function of the number of repeat measurements. .24.org/amc.Additional Statistical Topics 149 Fig. we would make a Type II error with a probability of β = 0. The test would therefore have a power of (1 − β) = 0.97). shaded light grey in the lower graph.11.11. as the power of the actual experiment may not be as good as predicted.1. Free download via www. If we needed a more powerful test we could increase the number of repeat measurements. by the lower graph in Fig. The relationship between power and number of measurements in this experiment is shown in Fig. 7. and calculated from the t3 distribution). It is clear that a sample of six repeat measurements would nearly always provide the information we wanted (with a power of 0. AMC Technical Briefs No.2. A material containing 24 ppm would be detected as signiﬁcantly diﬀerent in only 76% of experiments as originally proposed.11.76 (area shaded dark grey in the graph). • The logic of estimating power is intricate but statistical software packages usually provide power calculations for the most common tests. If HA were true. including those in analysis of variance and regression.) Notes and further reading • Power can be estimated for all of the usual tests of signiﬁcance. (This is shown as the area below U . 7.rsc. It is a good idea to err on the safe side.2. importance and power’. • ‘Signiﬁcance.
This page intentionally left blank .
PART 2 Data Quality in Analytical Measurement .
This page intentionally left blank .
• Fitness for purpose: what uncertainty in the analytical result is acceptable to. and best suited for. what it means. apparently not connected adequately into a coherent whole by overarching principles. However. method validation and quality control. 8. method validation and quality control. moreover. At ﬁrst sight the number of concepts and practices applied to quality in analytical chemistry is dauntingly large and. which should be applied in the following order. the overarching idea is the uncertainty attached to an analytical result.Chapter 8 Quality in Chemical Measurement This chapter reviews the topic of ‘quality’ in analytical measurement as a basis for the following discussion of the statistical methods involved. why it is important and how to estimate and use it. Quality concepts and practices are summarised under three main headings.1 Quality — An Overview Key point — The principles and practices relating to the quality of analytical data can be systematised under three related headings: ﬁtness for purpose. However. a simple scheme that provides an inclusive overview can be formulated in terms of just three basic ideas. the needs of the customer? 153 . fitness for purpose.
Before any analysis is undertaken. we might be in a state of complete uncertainty about what a material is. expanded uncertainty. 8.e. 8. 8. measurand and traceability.154 Notes on Statistics and Data Quality for Analytical Chemists Fig. standard uncertainty.1.1. Each of these aspects of quality will be considered in turn. is it apparently ﬁt for purpose? • Quality control: have the environmental factors that determine uncertainty changed since the validation demonstrated that ﬁtness for purpose was achievable? (i. but that would be unusual.. A schematic view of the three principal aspects of quality in analytical chemistry and the contributory practices that relate to them. did the method work well day after day?) The logical sequence and the practices that contribute to each are shown in Fig. We are far more likely to have some indication of what it .1. in other words. • Method validation: can the method under consideration for the analytical task produce a suitably low uncertainty when executed in a particular environment.1. but ﬁrst we have to consider brieﬂy the idea of the uncertainty of a result and how it can be estimated. The purpose of analysis is to reduce uncertainty about the chemical composition of the test material.2 Uncertainty Key point — The meanings of the following terms are discussed: uncertainty. coverage factor.
we need to know exactly what it is that we are estimating: here are the current internationallyrecognised deﬁnitions. • Measurement uncertainty: nonnegative parameter characterising the dispersion of the quantity values being attributed to a measurand. based on the information used. however careful the analyst. nor the numerical outcome of a measurement (that is the result). mass. in the interval between 10 and 12 ppm. not a chemical substance (that is the analyte). • Measurand: quantity intended to be measured. But. These formal deﬁnitions are not uniformly intelligible and do not immediately convey their meaning. • Metrological traceability: property of a measurement result whereby the result can be related to a reference through a documented unbroken chain of calibrations. time. concentration). After analysis this uncertainty would be far smaller: we might be conﬁdent that the concentration fell. while bias and precision are properties of measurement methods. usually 95%.. The measurand is a quantity that is being e e measured (e. usually between two and three. The standard uncertainty (u) is the basic value that is used to calculate U as U = ku. How do we quantify this uncertainty? First. each contributing to the measurement uncertainty. • Quantity: property of a phenomenon. . where k is the coverage factor. length. • Expanded uncertainty: product of a combined standard measurement uncertainty and a factor larger than the number one. Expanded uncertainty (U ) deﬁnes a concentration interval around the result of the measurement within which we expect the true value to lie with a reasonably high probability. there would always be some uncertainty remaining in the analytical result.Quality in Chemical Measurement 155 might contain a priori. It is noteworthy that uncertainty is the property of a measurement result.g. Standard uncertainty is treated and used in the same way as standard deviation. For example. a sample of (dried) cabbage would nearly always have a copper content between 1 and 20 ppm. Traceability describes the relationship between the result and the units of the SI (Le Syst`me International d’Unit´s). Their application to chemical measurement needs elucidation. • Standard uncertainty: measurement uncertainty expressed as a standard deviation. say. body or substance where the property has a magnitude that can be expressed by a number and a reference.
we should try to use all of these words correctly so as to reduce misunderstanding. and nearly all analysis is conducted to inform a decision. EURACHEM/CITAC Guide CG 4 Second edition.bipm. (2000). This document can be downloaded gratis from the BIPM website www.156 Notes on Statistics and Data Quality for Analytical Chemists Broadly speaking. — Logically we cannot make a valid decision without knowing the uncertainty associated with the result. It is especially important not to confuse uncertainty with error. The result of an analytical measurement is incomplete without a statement (or at least an implicit knowledge) of its uncertainty. • International vocabulary of metrology — basic and general concepts and associated terms (VIM). especially when our words may be translated into another language. Typical decisions based on analysis mostly come in one of the following forms.org/guides. • Does this batch of material contain less than the maximum allowed concentration of an impurity? . Document produced by Working Group 2 of the Joint Committee for Guides in Metrology.3 Why Uncertainty is Important Key points — Analysis is conducted to inform decisions. 8. Further reading • Evaluation of measurement data — guide to the expression of uncertainty in measurement (GUM).eurachem.bipm.org/ utils/common/documents. This document can be downloaded gratis from the BIPM website www. Document produced by Working Group 1 of the Joint Committee for Guides in Metrology. This is because we cannot make a valid decision based on the result alone. They all require a knowledge of uncertainty for a rational outcome. (2008). This document can be downloaded gratis via www.org/utils/common/ documents. (2008). • Quantifying uncertainty in analytical measurement.
Notes and further reading • Accreditation agencies require estimates of uncertainty before a method can be accepted as validated. C and E. as even the highest extremity of the uncertainty interval is below the limit. so we are not sure that the true value is below. (2007). Possible results of an analysis (solid circles) and expanded uncertainties (vertical bars) in relation to a legal or contractual upper limit for the concentration of an impurity.3. EURACHEM/CITAC Guide. Result C is above the limit but the lower end of the uncertainty is below the limit. E is not so because the greater uncertainty interval extends below the limit. • The main normative documents on uncertainty are listed in §8.org/guides. • Does this batch of material contain at least the minimum required concentration of a named ingredient? • How much is this batch of material worth? Figure 8. It is interesting to compare the equal results D and E.Quality in Chemical Measurement 157 Fig.3.eurachem. eﬀectively intervals containing the true value of the concentration of the analyte with 95% conﬁdence. The error bars can be taken as expanded uncertainties. Both results are above the limit but. so we are not sure that the true value is above. • Use of uncertainty information in compliance assessment.1. Result A clearly indicates a material that is below the limit.2. 8. while D is clearly above the limit. Organisations aﬀected by such decisions have to agree in advance how to act upon results B. Result B is below the limit but the upper end of the uncertainty is above the limit.1 shows a variety of instances aﬀecting decisions about externally imposed limits. This document can be downloaded gratis via www. .
4. — In chemical analysis. So the determination of copper in a sample of cabbage could be modelled as Fig. Even more detail can be built into the model (Fig. and therefore makes its own contribution to the uncertainty of the result. Chemical measurement usually involves a complex multistage procedure. 8. First level of factors that contribute to uncertainty in an analytical result.2 for one of them. 8.4 Estimating Uncertainty by Modelling the Analytical System Key points — Uncertainty can be estimated by the metrological (‘bottomup’) method by creating a complete model of the measurement procedure and combining the fundamental uncertainties of the ultimate operations. 8.1. Each stage of the procedure is potentially subject to variation in execution. the uncertainties related to each separate stage can be estimated and combined to give the uncertainty of the result.1 as a ﬁrst stage. Each of the three ﬁrstlevel contributions can be further broken down. weighing introduces uncertainty in the calibration of the weights (or balance) and in correction for buoyancy. Such ultimate Fig.4. These contributions are best seen as hierarchical.4. If the procedure can be completely characterised as a statistical model. modelling often gives a value that is too small because there are nearly always unidentiﬁable sources of error.4. as exempliﬁed in Fig.3): for example. . Ultimately the calibration of the weights can be traced back to the SI unit of mass. absorption of moisture from the atmosphere. 8. the kilogramme.158 Notes on Statistics and Data Quality for Analytical Chemists 8. and so on.
Third level of factors that contribute to uncertainty in the dry content of the primary sample.Quality in Chemical Measurement 159 Fig. the ‘bottomup’ method. This approach to uncertainty is called the ‘cause and eﬀect’ method or. In chemical measurement this defect often results in estimates of .2. 8. small uncertainties provide a negligible contribution to the outcome. When the measurement procedure is completely broken down in this fashion the inﬂuences of the individual uncertainties are estimated in various ways and combined in the manner prescribed by error propagation theory (§8.4.4. A beneﬁt of the method is that spreadsheets are available that can carry out the calculations once the model is deﬁned. although simpliﬁed by the fact that. This is clearly a lengthy operation. Second level of factors that contribute to uncertainty in the analyte concentration in the test solution. Transfer of the SI unit to the analyst’s bench gives rise to a negligible proportion of the analytical uncertainty in nearly all instances. traceability is seldom if ever a practical issue for analytical chemistry. because of the way uncertainties are propagated.3. A drawback is that it is diﬃcult to detect structural mistakes and omissions in the model itself. informally. 8. Fig.5) to give the uncertainty of the result. The eﬀect of a change in one of the factors contributing to uncertainty can be rapidly seen in the combined uncertainty.
Accred. they would account for all of the observed interlaboratory variation. . We know this to be the case by studying the results of interlaboratory comparisons. If individual uncertainty estimates were correct. In cases where this has been checked. If the results are multiplied or divided. Qual. 231–238. — Broadly speaking.) to the ﬁnal result (x) is handled in the same way as general error propagation.2). 8. 1. B. K. The ﬁrst three are of greatest importance to analytical chemists. If x = A ± B ± C ± · · · . the outcome will be dominated by the major uncertainties. etc. Notes and further reading • This method of estimating uncertainty is covered in detail in the ‘GUM’ and the Eurachem Guide (see §8.L.e.. it is nearly always found that interlaboratory variation is greater than expected on the basis of individual uncertainties.R. and Mathieson. uB . pp. for addition or subtraction). uc . S. Assur. with respective standard uncertainties uA . A B C 2. (2008). Performance of Uncertainty Evaluation Strategies in a Food Proﬁciency Scheme. The mathematical rules for combining independent features are as follows. · · · (i.5 The Propagation of Uncertainty Key points — Simple mathematical rules are available for combining intermediate uncertainties contributing to a ﬁnal result.160 Notes on Statistics and Data Quality for Analytical Chemists uncertainty that are too low.. This demonstrates that there are typically unknown causes of uncertainty in chemical analysis. it is the relative uncertainties A × B × ··· that are combined. C. then C × D × ··· ux = x uA A 2 + uB B 2 + uC C 2 + uD D 2 + · · ·. the standard uncertainty on x is given by ux = u2 + u2 + u2 + · · ·. An unknown cause cannot be included in a ‘cause and eﬀect’ model. The propagation of uncertainty through calculations from intermediate measurement results (A. 13 . • Ellison. So if x = .
03 ml.) As the remaining operations are multiplication and division. By Rule 1 the uncertainty on the weight (W ) of Na2 CO3 is given by √ uW = 2u2 = 2 × 0.53 ml. the uncertainty on the volume (V ) of acid is given by √ 2u2 + u2 = 2 × 0. Uncertainty in a single burette reading (uv ) : 0.04 ml.2086 g. then ux = kuA . if x = f (A). uM = M = uW W 2 + uV V 2 2 + uR R 2 2 0.) By Rule 1.00028.00028 0. Uncertainty of the relative molecular mass (uR ) : 0. then x 3. Final burette reading (v2 ) : 38.032 + 0.07808. (Note that there are two weighings w to obtain this weight. A These rules are used sequentially when applied to complex equations. Uncertainty of a single weighing (uw ) : 0. Generally. • • • • • • • • • First weighing of sodium carbonate (w1 ) : 15. Second weighing (w2 ) : 15.042 = 0. Initial burette reading (v1 ) : 0.00022 = 0. dx .00. 2000(w1 − w2 ) = R(v2 − v1 ) The concentration of the acid (mol l−1 ) is given by 0. k being a constant.0501 g. we use Rule 2 to complete the calculation. (Note that there are uV = v e two volume readings and the uncertainty on the endpoint recognition to account for.83 ml.1585 + 0. if x = Ak . k being a constant. Relative molecular mass (R) of Na2 CO3 : 106. 4. Uncertainty in endpoint recognition (ue ) : 0.01.0583 38.01 106 2 . If x = kA.3 + 0.Quality in Chemical Measurement 161 ux = As a special case of this.0002 g. The primary measurements are as follows. then ux = uA dA √ kuA .0583. As an example we consider the standardisation of hydrochloric acid by titration of a weighed quantity of pure sodium carbonate.
• Second.07808 mol l−1 with a standard uncertainty of 0. This will often simplify the calculations in ‘bottomup’ estimation.03u and u is negligible. • First. the covariances have to be taken into account. 8. the acid concentration is 0. there must be no perceptible bias in the procedure.12 × 10−6 + 2.6 Estimating Uncertainty by Replication Key point — The reproducibility standard deviation is easily obtained and is often a serviceable (‘topdown’) estimate of uncertainty. Estimates of uncertainty will hardly ever be as accurate as two signiﬁcant ﬁgures. An alternative way of looking at uncertainty is to attempt to replicate the whole analytical process and calculate the uncertainty as the standard deviation. so the diﬀerence between 1.03u ≈ u. This is called the ‘topdown’ approach. • It is clear from the way that uncertainties combine that a single contribution that is less than a quarter of the dominant contribution will make negligible contribution to the combined uncertainty.32 × 10−6 + 9 × 10−9 ) = 0. that is.00018 mol l−1 (Note that the uncertainty term for the relative molecular mass is negligible. the standard deviation will not be a good estimate of the uncertainty unless two conditions are fulﬁlled.07808 × 0. Notes and further reading • If the features are not independent. the replication has to explore all of the possible variations in the execution of the method (or at least all of the variations of important . Details can be found in GUM (§8. uM = 0.2). the difference between the expectation of the result and the true value must be negligible in relation to the standard deviation. We see that u2 + (0.00233.25u)2 ≈ 1.) Thus.162 Notes on Statistics and Data Quality for Analytical Chemists = (3. This condition is usually (but not always) fulﬁlled in analytical chemistry. However.00233 = 0.00018.
Traceability for a result shows how any unit in which the result is expressed is compared with the parent SI unit.8). Indeed. Accred.e. — The uncertainty involved in delivering the SI unit to the analyst’s bench is usually negligible. estimates of standard uncertainty less than σR should be considered suspect unless the laboratory concerned can provide evidence of unusually careful procedures or special methods. In practice. 9. S. 13 . for instance. This fact is clearly seen in the results of collaborative trials (§9. 8. laboratories claiming an uncertainty equal to σR should attempt to justify that. To express a result in (say) moles per . Experiments have shown that the betweenlaboratory standard deviation σR is often a good estimate of uncertainty. Performance of Uncertainty Evaluation Strategies in a Food Proﬁciency Scheme. Further reading • Ellison. better than many laboratories estimate from withinlaboratory validation exercises.7.R. pp. withinlaboratory.1). such as might be found in national reference laboratories. Therefore.Quality in Chemical Measurement 163 magnitude). 231–238.7 Traceability Key points — The traceability of a result shows that its units are properly related to the corresponding SI units with an appropriate uncertainty. where the reproducibility standard deviation σR is on average about twice the repeatability standard deviation. K. σR may well be an underestimate of uncertainty in other laboratories. by reference to the results from proﬁciency tests (§11. but more often a combination of the two. and Mathieson. This latter condition cannot usually be met by replication under repeatability conditions (i. However. Assur. (2008)...L. laboratories seldom use pure bottomup or topdown methods for estimating uncertainty.2). see §9. because variations in execution of the procedure would be laboratoryspeciﬁc to a substantial extent. — Other sources of uncertainty tend to be dominant in chemical measurement. Qual.
such as dissolution and separation. • Preparing the test solution from the test portion often involves chemical manipulations.164 Notes on Statistics and Data Quality for Analytical Chemists Fig..4) if there is an unrecognised matrix mismatch with the calibrators. In addition to these ‘nonSI’ comparisons. beyond that described in §8.7. we have to recognise that in many instances the result is calculated from ratios of measurements made .1. and what a litre is. For analytical chemists the necessary connection with SI units is easy. Steps showing how an analytical result is traceable to its ultimate origins.7. the analyst has to know what a mole is.e.1. Sampling errors can be relatively large and sometimes exceed those incurred during the whole of the remaining analytical procedure.1). • Comparing concentrations via the calibration function and the analytical signal from the test solution is subject to extra error (i. the bulk of material of which we need to know the composition (§12. 8. in which the recovery of the analyte is incomplete. and show that the measurement comprises a complete chain of comparisons from the deﬁnition of the unit to the result. The relative uncertainty in the journey from the SI unit to the analyst’s bench is small and nearly always negligible in comparison with the relative uncertainty in the ﬁnal result. This is because the dominant sources of error in analysis come from elsewhere. 8. The black arrows leading to shaded boxes indicate actions where relatively large uncertainties may be incurred. litre. • The sampling error is introduced by taking the sample from the ‘target’. Correcting for incomplete recovery (or failing to correct) can introduce a relatively large uncertainty in some instances. The most obvious are shown in the schematic of an analytical measurement in Fig.
(2003).. Such penalties might be the result of mistakenly condemning . The probability of an incorrect decision also depends on the uncertainty. The cost of an analytical result (including sampling where appropriate) is related to the uncertainty required.eurachem. as usual. is available) it may be possible to calculate the optimum directly. we would get the same numerical result (within limits determined by random variation) if we used a diﬀerent base unit for mass. a procedure that reduces uncertainty by one half costs four times the amount. if we are expressing our result as a mass fraction. the most commonly required outcome of chemical measurement. For instance. This document can be downloaded gratis via www. 8. Even when this cannot be done. Such a result can hardly be said to be traceable to the kilogramme even when. Further reading • Traceability in chemical measurement. and in forming such a ratio the link with the SI unit is annulled.org/guides. Eurachem/CITAC Guide. the greater the uncertainty the greater the probability of an incorrect decision and therefore of a ﬁnancial penalty stemming from the decision. where suﬃcient information about costs etc.e. Such a balance is usually determined by expert opinion on the basis of experience of the application.8 Fitness for Purpose Key points — A ﬁtforpurpose result is optimal for the enduser. An analytical result can be said to be ﬁt for purpose when its uncertainty is optimal. — The uncertainty of a result is inversely related to its cost. Smaller uncertainty costs more money. that is. In favourable instances (i.Quality in Chemical Measurement 165 in the same unit. The optimality stems from the balance between the cost of conducting the analysis and the cost and probability of making an incorrect decision based on the result. a useful rule of thumb is that of an inverse square relationship. — The probability of incorrect decision based on a result is directly related to uncertainty. Indeed. the Imperial pound for instance. SI weights are used throughout the procedure. decision theory provides a useful conceptual framework for a more transparent estimate.
org/amc.. 32. S. M. Fisher. A Decision Theory Approach to Fitness for Purpose in Analytical Measurement. a batch of satisfactory material. • ‘Optimising your uncertainty — a case study’. 8. showing costs of measurement (dashed line). (July 2008). or incorrectly shipping a batch of material that was out of speciﬁcation. Thompson. (2002).8. AMC Technical Briefs No. Notes and further reading • Fearn. 127 . et al. Free download via www.1. 818–824. but would be unnecessarily expensive.8. 8. costs of incorrect decisions (dotted line) and total costs (solid line). The sum of the two costs (one a decreasing function of uncertainty and the other increasing) must necessarily have a minimum value.1).rsc. . The uncertainty uf at minimal cost is regarded as deﬁning ﬁtness for purpose. T.166 Notes on Statistics and Data Quality for Analytical Chemists Fig. pp. Costs versus uncertainty. and this point provides a rational deﬁnition for a ﬁtforpurpose uncertainty (Fig. Of course.. Analyst. an uncertainty smaller than that demanded for ﬁtness for purpose would give an even smaller proportion of mistaken decisions.
in the absence of complications. Regression is the natural method to apply to analytical calibration. but applying them effectively is more involved than generally appreciated. ¯ initially examine the dataset to ensure the absence of features that could create a false impression. so high precision is equivalent to low standard deviation.Chapter 9 Statistical Methods Involved in Validation The statistical methods involved in validation are straightforward. RSDs higher than about 30% are problematic. such as outlying data points (see §7. Analytical results for diﬀerent purposes vary in precision: a relative standard deviation (RSD) of 0. Particular attention is needed in selecting the appropriate conditions under which precision is estimated for various purposes.1% is ‘high precision’ in analysis and is seldom attained except for special purposes. setting proper control limits for control charts involves a moderately long series of measurements and review at suitable intervals. 9. Moreover. but a good design is needed for an informative outcome. because the expanded uncertainty is of the same magnitude as or greater than the result. however.2 ﬀ) or trends. The analyst should. It is characterised in terms of standard deviation. such as ﬁnding the commercial value of materials containing gold. Most analytical results have repeatability RSDs of about 1–5% except in the measurements of very low concentrations. as s = i (xi − x)/(n − 1). 167 . where even higher levels prevail. There is little technical diﬃculty about estimating a standard deviation — the measurement has to be replicated under appropriate conditions and the standard deviation of the results calculated.1 Precision of Analytical Methods Precision is the smallness of variation in the results of replicated measurements.
(Note: the conﬁdence limits will not be symmetrically disposed around s for small n. . In the following example (Fig. .1. With the usual ten replicated analytical results. The eﬀect of trends. dn−1√ xn−1 − xn . The signed diﬀerences show no trend and have a standard deviation of 1. The standard error se(s) of an estimate s based on n normallydistributed √ results is given by se(s) = σ/ 2n.59.6). even complicated ones. and there is no point in discriminating between minor gradations in precision. that is. If appropriate. the raw results (upper series) show a strong trend and have a standard deviation of 3. Because of this it is rarely worth quoting standard deviations or uncertainties to more than two signiﬁcant ﬁgures. The detrended data √ would therefore have an estimated standard deviation of 1. Perhaps the simplest is to consider the signed diﬀerences between successive results. . Another important feature of estimates of standard deviation is that they are very variable with small to moderate numbers of observations.12. 9. . the sequence d1 = x1 − x2 . 9.168 Notes on Statistics and Data Quality for Analytical Chemists A simple plot of the results in the order measured is usually enough to show such problems. Fig.1.59/ 2 = 1. we could expect to see 95% conﬁdence intervals of around ±40% around the estimated standard deviation. The standard deviation of the signed diﬀerences divided by 2 is equal to that of the original data detrended.02. . = d2 = x2 − x3 .1). can also be reduced by various methods.) Two estimates of the same standard deviation based on separate sets of ten results will diﬀer by more than 30% half of the time and by more than 77% in one in ten experiments.1. Replicated results as a time series (upper plot) and as successive signed diﬀerences (lower plot). the inﬂuence of outliers can be accommodated by the use of a robust estimate (§7.
• There is often a practical problem in estimating the standard deviation when the concentration of the analyte is close to zero. Document produced by Working Group 2 of the Joint Committee for Guides in Metrology. Of these.org/amc. reproducibility precision is an important consideration in method validation. regrettably. pp.Statistical Methods Involved in Validation 169 Notes and further reading • The dataset is available from the ﬁle named Drift. Runtorun precision and repeatability precision are mainly relevant to internal quality control. 9. Measurement Uncertainty Evaluation for a Nonnegative Measurand: an Alternative to Limit of Detection. Recording such results as zero. or repeating the measurement until a nonnegative result is obtained. .2. (2008). only a few of these conditions have a practical bearing on quality practice. There are statistical techniques (maximum likelihood estimation) for handling this situation. 13 . Accred. as it is a prominent feature of estimating uncertainty.bipm. • Analytical Methods Committee. but they are beyond the scope of this book. Qual. the terminology used is often confusing (Table 9. However.rsc. Further reading • International vocabulary of metrology — basic and general concepts and associated terms (VIM). Free download from www. The key point here is to identify conditions that are genuinely useful to analytical chemists. (2008). A true concentration cannot fall below zero. 29–32. This document can be downloaded gratis from the BIPM website www. There are many conceivable conditions and. 26A.. will bias the estimates of both the mean (upwards) and the standard deviation (downwards). Assur. AMC Technical Briefs No.2 Experimental Conditions for Observing Precision The precision of the results of a method depends on the conditions under which the measurement is replicated. • ‘Measurement uncertainty and conﬁdence intervals near natural limits’. A proportion of the results (sometimes a substantial proportion) will fall below zero unless they are censored.1).org/utils/common/ documents. but the result of a measurement can. (2008).
(Names deﬁned in normative documents are in boldface type. Reproducibility (1) This is the estimate provided by the collaborative (interlaboratory) trial.1. The SD is usually greater than that of Reproducibility (1).) Names Instrumental Conditions of replication Replication as quickly as possible. Runtorun: Intermediate: Withinlaboratory reproducibility Replication in separate runs.170 Notes on Statistics and Data Quality for Analytical Chemists Table 9. or period in which we assume that the factors aﬀecting error have not changed. This type of precision is addressed in internal quality control. Reproducibility (2) Reproducibility (3) Replication by various methods in diﬀerent laboratories. but used in restricted types of quality control.2. Repeatability Replication on separate test portions of the same material. Same method and laboratory. Does not include variation originating from chemical manipulations. with no change of test solution nor adjustment of instrument. Estimate can often be obtained from the results of a single round of a proﬁciency test. in a ‘short’ period of time. in the same laboratory. Replication by the same detailed method in diﬀerent laboratories. Comments Not very useful but often seen in research papers and brochures. Replication by the same nominal method but with variation in details in diﬀerent laboratories. by the same analyst. with the same instrument and reagents. instruments and batches of reagent. The ‘short period of time’ is the length of an analytical ‘run’. Estimate can often be obtained from the results of a single round of a proﬁciency test. Conditions for assessing precision and the utility of the resulting estimates. but may be diﬀerent analysts. . Limited applicability. The SD may be greater or smaller than that of Reproducibility (1).
or ‘ﬁducial limits’)? It is salutary to calculate these limits as they are considerably larger than expected by intuition and may make a substantial contribution to the combined uncertainty of the result. Simply inspecting a plot of the residuals for a curved pattern is also a powerful way of detecting lack of ﬁt. is the deviation of a magnitude that will aﬀect the uncertainty of the results of a complete measurement. is there a signiﬁcant intercept? In other words. including the chemical treatment of the test portion? This is best addressed by measuring responses in duplicate from calibrators equally spaced over the range and conducting a test for signiﬁcant lack of ﬁt. • Do the residuals display signs of heteroscedasticity? This would be a quite common occurrence in analytical calibration and suggests that the use of weighted regression would give better results than simple regression. Topics that can be studied by regression are as follows. However.9).3 External Calibration Regression is the natural way to explore the behaviour of an analytical calibration and its likely eﬀect on the uncertainty of the result (see Chapters 5 and 6). the evaluated concentration is not signiﬁcantly greater than zero. in a calibration function such as ˆ ˆ r = β0 + β1 c.Statistical Methods Involved in Validation 171 9. the analyst can use the hypothesis that the calibration passes through the origin. the correlation coeﬃcient commonly used as a test for linearity is ambiguous in this context and potentially misleading (§5. • What is the detection limit of the calibration? When the evaluation conﬁdence interval includes zero concentration. are there grounds to reject β0 = 0? If there are not. that is. so we are unsure whether it is present at all. • Does the calibration depart signiﬁcantly from linearity? If it does. . An important aspect of unacknowledged calibration curvature is that its eﬀects on estimated concentrations will be relatively very large at concentrations near zero. lack of ﬁt cannot be tolerated. • What are the conﬁdence limits on an unknown concentration found by using the calibration function (also called ‘evaluation limits’. • If there is no lack of ﬁt to the selected calibration function. If low concentrations are to be measured.
061 0. calculations become simpler. As a rule of thumb.182 0. but care must be taken (see §5. and their presence suggests that the use of weighted regression would give better results than simple regression.177 0.124 0. weighted regression will give more accurate values.064 0. For concentrations outside this range. simple regression will be good enough. which were measured in the random order listed.004 0. We must bear in mind that in analysis there are many sources of uncertainty other than calibration and evaluation. and these will usually overwhelm calibration uncertainty.10) if we need to measure concentrations in the lower quartile of the range. errors of interpolation become smaller and the method of standard additions (§9. If a rectilinear function can be assumed.4 Example — A Complete Regression Analysis of a Calibration Here we examine the calibration of zinc by atomic absorption spectrometry.009 0. especially at low concentrations. 9.124 0. In such instances.300 0.291 0. calibration uncertainty per se can be ignored or subsumed into uncertainties that are readily estimated by replication. Concentration mg l−1 4 0 5 3 2 1 2 1 3 5 0 4 Absorbance 0.241 . if the calibration is to be restricted to concentrations up to about 200 times the detection limit.172 Notes on Statistics and Data Quality for Analytical Chemists But how does all this information about calibration relate to the combined uncertainty of a result? The shape of the calibration function is of great importance.5) becomes available. Lack of ﬁt if present may be judged negligible.237 0. Heteroscedastic residuals are a quite common occurrence in chemical measurement. with duplication of responses.
0580000 syx = 0.4. 9.Statistical Methods Involved in Validation 173 Fig. 9.0031 Standard deviation 0.00001 F 12570. The data show no apparent deviation from the regression line (Fig.4.006 absorbance would be large for modern instrumentation.2.11 0.11783 Mean square 0. The deviation of 0.11774 0.006167 0.4.30 p 0. that is. Predictor Constant Concentration Coeﬃcient 0.001566 0.867 Visually the regression line passes close to the origin.000 0.003 0.0005173 R2 = 99. As there is no observed lack of ﬁt.9% t 3. the intercept is signiﬁcantly diﬀerent from zero. Thus we can conclude that there is no signiﬁcant lack of ﬁt and that the regression has provided a satisfactory account of the data.94 112.00000 0.000 .867.00009 0. Source of variation Regression Residual error Lack of ﬁt Pure error Total Degrees of freedom 1 10 4 6 11 Sum of squares 0. Residuals from the regression.00001 0. 9.00008 0.1. 9.003 for the intercept shows that we can reject H0 : α = 0 in favour of HA : α = 0. the pvalue of 0.11774 0. The residuals plotted against concentration show no obvious pattern (Fig.2) and the test for lack of ﬁt provides a pvalue of 0. Calibration data (points) with linear function estimated by regression (solid line) and 95% evaluation limits (dashed lines).00002 0. Fig.12 p 0.4.1). but presumably arose from an instrumental drift after the initial setting of the zero point.
showing the detection limit cL of the calibration.1).3.004) and shows that the instrument is still drifting during the calibration. .4. 9. If the residuals are plotted against the order in which the measurements were made (Fig.13 mg l−1 .4). Note • An unusual feature of this calibration dataset demonstrates the value of a thorough examination of the residuals. 9. Residuals plotted against the order in which the solutions were measured. we see a systematic eﬀect in that the earlier residuals tend to be negative and the later ones positive.3 ppm (mg l−1 ) over the whole of the calibrated range (Fig.3).4.174 Notes on Statistics and Data Quality for Analytical Chemists Fig.4. Fig. Bottom end of the estimated calibration function. The 95% conﬁdence interval for a predicted concentration (the evaluation interval) amounts to about 0.4.4. In eﬀect. By zooming in to the bottom end of the calibration (Fig. 9. This eﬀect would not have been detectable had the calibrators been analysed in order of increasing concentration. we see that the calibration detection limit cL will be about 0. This is a signiﬁcant eﬀect (p = 0. 9. that would be the contribution of the calibration procedure to the combined uncertainty. 9.4.
Statistical Methods Involved in Validation 175 9.1.5. Standard additions. requires the addition to a test solution of several diﬀerent exactlyknown amounts of the analyte. which the analyst has to deal with separately. . 9. therefore. all of the constituents other than the analyte) is not readily predictable and the matrix varies enough within a class of material to prevent matrix matching being complete. The analytical response (corrected for baseline shift) is measured for each solution. does not overcome changes to the baseline of the signal. and the line ﬁtted to the points is extrapolated down to zero Fig. The possible eﬀect of matrix mismatch on the comparison between calibrators and test solution. the matrix of the test materials (that is. 9. This can lead to an unacceptable addition to the uncertainty of the result if the analytical signal is sensitive to such changes. A suﬃcientlyclose matrix matching is readily contrived in respect of reagents such as mineral acids added during the chemical treatment of the test portion. as presented in most textbooks. Calibration by standard additions is a widely applicable method that overcomes changes in sensitivity (or rotational eﬀects) caused by the matrix (Fig. so that the matrix is identical in each solution. has to be applied to the net signal: it requires any baseline signal to be subtracted from the gross signal before the method can be applied.5 Calibration by Standard Additions To make a valid comparison between calibrators and test solutions. The conventional paradigm of the method.5. In some instances. however. This has to be done in such a way that the overall concentration of the test material remains the same in all of them. however. the intercept and sensitivity of the calibration function must match those we would observe in the matrix of the test solution. A number of strategies are available to overcome such interference eﬀects. The method.1).
The method of standard additions (conventional paradigm). Moreover.5. Standard additions with the spike added at only one level (triplicate measurements of response).3.5. Standard additions is a valid method when the calibration is known to be linear. 9. However. 9. Nonlinear extrapolation is unwise (§6. with only one level of added analyte.2). testing for nonlinearity is unlikely to be fruitful without an inordinate number of measurements to obtain just one result. perhaps with replicated measurement (Fig.2). a simpler experimental design is preferable. The negative reading on the concentration axis is the concentration estimate. either simple or weighted. with ﬁve diﬀerent added concentrations of analyte. A line passing through the means of the responses at both concentrations is identical with a regression line. The resulting function y = a + bx with the response (y) set at zero gives x = −a/b.3). The extrapolation could be done graphically. This design not only cuts down the analyst’s workload. standard additions should not be attempted unless nonlinearity is known to be undetectable by previous experimentation during validation. 9. but the application of regression to estimating the original concentration is obvious. net response (Fig. 9. The added level should be the highest possible concentration consistent with a linear calibration function. As we can assume linearity. .3). but also improves the precision of the ﬁnal result for the same number of measurements.5. The standard paradigm of the method. 9.5.5. Fig. regression is not needed in the calculations. Standard additions is sometimes regarded as problematic because the extrapolation step was thought to introduce extra imprecision in comparison with eternal calibration. In any event. is featured in most texts because it is thought (wrongly) to allow the analyst to check that the calibration is linear. with realistic models of uncertainty. A careful study. with several diﬀerent levels of added analyte (Fig.2.176 Notes on Statistics and Data Quality for Analytical Chemists Fig. has shown that this fear is unfounded.
Statistical Methods Involved in Validation
177
Further reading • ‘Standard additions — myth and reality’. (2009). AMC Technical Briefs No. 37. Free download via www.rsc.org/amc. • Ellison, S.L.R. and Thompson, M. (2008). Standard Additions: Myth and Reality. Analyst, 133 , pp. 992–997. 9.6 Detection Limits
Key points — There are numerous possible deﬁnitions of detection limit. The simplest is: the concentration cL of analyte that corresponds with a signal of R0 + 3σ where R0 and σ describe the variation in the analytical signal when the actual concentration of analyte is zero. — Modern more complicated deﬁnitions provide very similar estimates. — Detection limits provide only a rough guide to method performance and should not be taken too seriously. — Detection limits cited in the literature and in instrument manufacturers’ brochures may be misleadingly low.
A detection limit cL is the smallest concentration of analyte that can be reliably detected by the analytical system. Detection limits are usually given undue attention in relation to their usefulness. The main use of a detection limit is to warn the analyst of a concentration level better to avoid if at all possible. However, in the determination of undesirable contaminants — a very common activity — analysts often have to work near detection limits. Detection limits are encountered when the expanded uncertainty of measurement is roughly comparable with the concentration of the analyte. But there are complications that aﬀect this basically simple idea. • There are several diﬀerent ideas about how the detection limit can be deﬁned in statistical terms. All of these ideas are based on the standard deviation of replicated results near zero concentration. • The magnitude of the detection limit estimate will depend on the conditions of replication of the measurements. Detection limits quoted in descriptions of methods and instrument brochures are often far too small because they are estimated under unrealistic conditions of replication
178
Notes on Statistics and Data Quality for Analytical Chemists
such as ‘instrumental conditions’ (§9.2). Reallife detection limits (based on reproducibility statistics) are sometimes as much as 10 times higher than these instrumental detection limits. • The accuracy of an estimated detection limit will be poor because it is typically based on ten replicated measurements, while the standard error √ of an estimated standard deviation is given by σ/ 2n. With n = 10 measurements the 95% conﬁdence limits on s (and therefore on cL ) will fall at about ±40% of the true value. • Detection limits give rise to an artifactual dichotomy of the concentrations axis that distorts perception of reality. Many analysts and endusers alike regard a result of 1.1cL as a valid result and 0.9cL as invalid. Modern thought is moving to the position that detection limits are unnecessary — all that the enduser needs is the result and its uncertainty. The simplest (and therefore the most commendable) deﬁnition of detection limit is this: the concentration cL of analyte that corresponds to an analytical signal of RL = R0 + 3σ where R0 (mean) and σ (standard deviation) describe the distribution of the analytical signal when the actual concentration of analyte is zero. We can see the meaning of this by reference to a short calibration graph (Fig. 9.6.1). The normal distribution describes the variation in the analytical signal (response) when the blank solution is repeatedly presented for measurement. A response larger than R0 + 3σ will occur rarely if no analyte is present (about one time in a thousand, as we are dealing with onetailed probabilities), so if we saw
Fig. 9.6.1. Schematic diagram of a calibration function at low concentrations showing the variation in the response (analytical signal) for zero concentration of analyte, and the detection limit cL .
Statistical Methods Involved in Validation
179
Fig. 9.6.2. Schematic diagram of a calibration function at low concentrations of analyte, showing a more complex deﬁnition of detection limit.
a response greater than that we could be conﬁdent that some analyte was actually present. We could say that, at that point, the concentration of analyte was signiﬁcantly greater than zero at a conﬁdence level of about 99.9%. International standards nowadays prefer a more complex idea (and terminology) of detection limit. For this purpose we consider the previous calibration graph in slightly more detail (Fig. 9.6.2). As before we have the distribution of responses at zero concentration with mean R0 and standard deviation σ. We now arbitrarily deﬁne a ‘critical level of response’ Rc that deﬁnes a probability α in the upper tail of the distribution. This corresponds via the calibration function to a ‘critical concentration’ xc . If we looked at the distribution of responses at concentration xc , half of the responses would be below Rc and therefore, in some sense, ‘not detected’. xc is clearly too low to act as a serviceable detection limit. However, at some higher concentration, where the response was RL , the area β in the tail of the distribution below a response Rc could be made suitably small to deﬁne a sensible detection limit xL . In practice, we usually make both α and β equal to 0.05, so that we have Rc = R0 + 1.63σ, RL = R0 + 3.26σ. xL therefore corresponds closely with the previous deﬁnition of detection limit cL .
180
Notes on Statistics and Data Quality for Analytical Chemists
Further reading • Capability of detection — Part 1: Terms and deﬁnitions. (1997). 118431. International Standards Organisation, Geneva. • ‘Measurement uncertainty and conﬁdence intervals near natural limits’. (2008). AMC Technical Briefs No. 26A. Free download from www.rsc.org/amc. • Analytical Methods Committee. (2008). Measurement Uncertainty Evaluation for a Nonnegative Measurand: an Alternative to Limit of Detection, Accred. Qual. Assur., 13 , pp. 29–32. 9.7 Collaborative Trials — Overview
Key points — A collaborative trial is an interlaboratory study to characterise the performance of an analytical method. — The main performance features are precisions of repeatability and reproducibility.
Collaborative trials are interlaboratory studies to characterise the performance of a welldeﬁned analytical method applied to a welldeﬁned type of test material. The performance features characterised are repeatability precision, reproducibility precision and, if certiﬁed reference materials are included, trueness. Usually each of these features will vary as a function of concentration of the analyte, so the tests need to be carried out using at least ﬁve diﬀerent materials of the deﬁned type, with concentrations spanning the relevant range. The organising body selects and prepares these materials and distributes them to the participating laboratories, which should be at least eight in number (and preferably considerably more). The participant laboratories analyse each of the materials in duplicate, preferably ‘blind’ (i.e., without knowing the identity of the duplicates during the analysis). The participant laboratories should be proﬁcient in the type of analytical test involved. The participants report the results obtained to the organiser, who then carries out the statistical analysis of the results to estimate the various performance indicators. The basic statistical technique in collaborative trials is oneway analysis of variance applied separately to the results from each material (§4.6). However, there are several reﬁnements that are regarded
Statistical Methods Involved in Validation
181
as essential. An important aspect is the initial removal of results identiﬁed as outliers. There is an elaborate procedure for doing that described below (§9.8). Alternatively, an approach based on robust analysis of variance has been found to provide very similar outcomes without resort to outlier tests (but see Note below). The justiﬁcation for rejecting outliers in collaborative trials is that the resulting statistics are regarded as describing the essential properties of the method, not those of the participant laboratories. When the standard deviations of repeatability and reproducibility are obtained for the various test materials, it is common practice to attempt to ﬁnd a functional relationship that describes the relationship between precision and concentration of the analyte (§9.9). This relationship can provide a compact description of the capabilities of the method and a means of interpolation to concentrations other than those actually represented in the study. The results of collaborative trials are often compared with the wellknown ‘Horwitz function’. This function stems from an empirical observation about the trend of reproducibility relative standard deviation σR as a function of concentration c, in the food analysis sector. Perhaps the most useful formulation of the function is σR = 0.02c0.8495 , with both variables expressed as a mass fraction. The trend of results from collaborative trials follows this function closely over seven orders of magnitude (between concentrations of about 10−8 and 10%) although it must be stressed that there are both random and systematic deviations from the trend. The function is therefore not necessarily a good descriptor of individual methods. It is, however, often used as an independent ﬁtnessforpurpose criterion in method validation and proﬁciency testing. In that context it is used to prescribe the performance required, not describe the performance obtained.
Notes and further reading • Robust ANOVA must be used as strictly alternative to outlier deletion. The practice of using both methods and then choosing the outcome that is smaller must be avoided. • ‘The amazing Horwitz function’. (2004). AMC Technical Briefs No. 17. Free download via www.rsc.org/amc. • Precision of test methods Part 1: Guide for the determination of repeatability and reproducibility for a standard test method. (1979). ISO 5275. International Standards Organisation, Geneva.
Stop removal when the next application of the test would ﬂag as outliers more than twoninths of the laboratories.1).10) First apply the Cochran outlier test (onetail test at p = 2.e. If no laboratory is ﬂagged. not Gaussian.) Final estimation Recalculate the ANOVA statistics after the laboratories ﬂagged by the preceding procedure have been removed.. with p = 2.5%) and remove any laboratory whose critical value exceeds the tabulated value. (Note: the Grubbs tests are to be applied one material at a time to the set of replicate means from all laboratories. — This is justiﬁed because the study is meant to characterise the method rather than the participant laboratories. their diﬀerences from the overall mean for that material are not independent. Grubbs tests (see §7.8.4) Apply the single value Grubbs test (twotail test at p = 2. Remove any result ﬂagged by these tests of which the critical value exceeds the tabulated value. . Cochran test (see §4. The most frequently used method for purging the initial valid data of outliers is deﬁned in the 1995 IUPAC Harmonised Protocol.8 The Collaborative Trial — Outlier Removal Key points — Outliers are conventionally removed from collaborative trial datasets before analysis of variance is carried out. 9.5%) and remove any outlying laboratory. and not to the individual values from replicated designs because the distribution of all the values taken together is multimodal.182 Notes on Statistics and Data Quality for Analytical Chemists 9. then apply the pair value tests (twotail) with two values at the same end and one value at each end. i.5% overall. This procedure consists of the alternating use of the Cochran and Grubbs tests until no further outliers are ﬂagged or until the proportion of dropped laboratories would exceed twoninths of the original number of laboratories providing valid data (Fig.
(1995)..1.8. W. pp. 67 . 331–343.Statistical Methods Involved in Validation 183 All valid data Calculate precision measures Y Cochran outlying lab? Y Drop lab unless fraction dropped would exceed 2/9 N Y Drop lab unless fraction dropped would exceed 2/9 Single Grubbs outlier? N Y Drop lab unless fraction dropped would exceed 2/9 Paired Grubbs outlier? N Y Any labs dropped in this loop? N Execute 1way ANOVA on remaining data Fig. Protocol for the Design. Further reading • Horwitz. Pure Appl. Chem. . Flow chart of procedure for removing outliers from collaborative trial data before ANOVA. 9. Conduct and Interpretation of Method Performance Studies.
When all of the materials have analyte concentrations well above the detection limit of the method. we cannot use a function that imputes a negative standard deviation. and the relative uncertainty in the standard deviation estimate at each point is large. A number of functional forms are suggested in ISO 5275.7).9 Collaborative Trials — Summarising the Results as a Function of Concentration Key points — Expressing precision data as a function of analyte concentration is a useful way of summarising performance information. Some simple methods that automatically take account of the heteroscedasticity may be conditionally appropriate in speciﬁc circumstances.9. However.e. Alternatively. • The intrinsic shape of the true functional relationship is unlikely to be linear and its value must be strictly positive. a constant relative standard deviation (RSD) is a reasonable assumption unless the data are clearly at odds with that. but they are all fundamentally ﬂawed. In such an instance a suitable expedient might be to calculate the average of the RSDs. 9. • There are few data points. As . i. Standard regression methods may not be applicable. Some form of weighted procedure should be used (see §6. In particular. • Lack of ﬁt may be apparent if there is a signiﬁcant variation in the matrices of the test materials. — Finding a good ﬁt is not always straightforward. It is usually beneﬁcial to summarise the statistics obtained for each material in a collaborative trial by treating the precision as a function of concentration..184 Notes on Statistics and Data Quality for Analytical Chemists 9. a similar outcome could be obtained by applying simple regression to the logtransformed data. The main problems are as follows. it may not as simple as it ﬁrst appears. This provides a compact summary of the study.1. These features are apparent in Fig. simple linear regression will be suspect as it will not take account of the heteroscedasticity and tend to give a negative or otherwise misleading intercept. • The data may be markedly heteroscedastic. Interpolation could then be executed by applying this average value to new concentration values. There will be a large relative uncertainty in the estimated functional relationship.
Statistical Methods Involved in Validation 185 Fig.10.1. 9.9. that is. The fundamental reason is that the relationship between standard deviation of measurement and concentration of the analyte must have a positive intercept. An example of this method is shown in §6. Both functions show lack of ﬁt to some points and indicate inappropriate values near zero concentration. simple linear regression (dotted line) and loglog regression (solid curve).7). This function should have a slope of one (unity) if the model is correct. The regression coeﬃcient will then equal the unknown exponent γ. where σ is the standard deviation at concentration c.8495 (see §9. that is. The method of regressing logarithms will also work if the functional relationship has features similar to the Horwitz function. for calculating values of σ of β from new values of c within the range of the original data and well away from the detection limit. The vertical bars show the 95% conﬁdence intervals for the estimate. taking logarithms will tend to be misleading if the concentration values are lower than about 100 times the detection limit. In other words. the transformed equation is log σ = log β + log c. concentration (points) for a collaborative trial of a method for determining propyl gallate. because it is describing a measurement result. and we would calculate β as its antilog. the standard deviation at zero concentration must be greater than zero. which is technically appropriate as it has an intercept of α and tends towards a constant RSD of β at high . The value ˆ is then the RSD for interpolation. and a ttest on the slope should be able to conﬁrm that. However. of the form σ = βcγ . Standard deviation vs. Many analytical systems can be better described by a function of the form σ = α2 + (βc)2 . γ = 1 so that log σ = log β + γ log c. which may or may not diﬀer from the Horwitz exponent of 0. the model for constant RSD is σ = βc. Two ﬁts are shown. the ˆ ˆ intercept would be log β. Given a unit slope.
. it is probably advantageous to make the comparison a function of concentration.. not on reference materials or spiked solutions which might behave atypically.. the outcome may not be generally applicable. 13 .186 Notes on Statistics and Data Quality for Analytical Chemists concentrations. 223–230. pp.. but the correct technique for the interpretation of the results depends on the concentration range spanned by the test materials and on the precisions of the two methods.10 Comparing Two Analytical Methods by Using Paired Results Key points — Comparison between two analytical methods can be undertaken with paired results. et al. M. a weighted ﬁtting of this type of function goes beyond elementary statistical considerations. Damant...8–§3.. A.. Qual. regression may lead to a misleading interpretation because a basic assumption of regression .n that are substantially more precise than the corresponding trial method results yi=1. in which case a ttest of the diﬀerences between corresponding pairs would be suitable (§3. If the precisions are comparable. This ‘paired results’ method is a particularly valuable approach because the comparison between methods is based on the behaviour of ‘reallife’ test materials. is under trial. perhaps more rapid or convenient. In such instances regression of y on x (but not x on y) will probably be a suitable statistical technique (§5. as in §5. however. Accred. (2008). the reference method will produce results xi=1.10). The comparison is obviously a statistical matter. Assur. A common analytical task is the comparison of results from two methods for measuring the concentration of the same analyte in a number n of test materials. P. A General Model for Interlaboratory Precision Accounts for Statistics from Proﬁciency Testing in Food Analysis. Usually one of the methods is recognised as a reference method and the other.. — If the comparison is based on results from a single laboratory. Further reading • Thompson. K. If the concentration range is relatively small it might be possible to assume a single variance for each method. Quite commonly.. If the concentration range is wider.n . Mathieson. 9.12. However.. using either a simple ttest or a regressiontype method.12).
There is. In short. (2002). Analyst. an important exception to this limitation: when the two methods diﬀer only in one particular.Statistical Methods Involved in Validation 187 is violated. is always present in results. 127 . et al. M. the eﬀect under study.. the laboratory bias (apart from the eﬀect under study) will be the same for both methods and thus cancel out. as can clearly be seen in results of proﬁciency tests. If a general (rather than a laboratoryspeciﬁc) conclusion is required. Notes and further reading • Requirements for a valid comparison of the results of methods are discussed in Thompson. so that it applies to the methods (and therefore to all laboratories). pp. There is one mitigating circumstance where suitable data can be obtained at no cost. it is necessary to compare the mean results of a number of laboratories.. In such instances the potential existence of laboratory bias has to be taken into account. This ‘functional relationship’ ﬁtting will provide an unbiased outcome. Often the reference method and the trial method are quite diﬀerent in the analytical procedures and the physical principles behind the procedures. small or large. and that is where both methods are wellcharacterised and wellrepresented in a large proﬁciency test. A Comparison of the Kjeldahl and Dumas Methods for the Determination of Protein in Food. Then it is possible to compare the means of mediumsized datasets (e. Paired results from a single laboratory will not be descriptive of the methods because of the unique biases in that laboratory. This requires a large study. especially in regard to the required generality of the conclusions. Using Data from a Proﬁciency Testing Scheme. L. .. Laboratory bias. comparable in size (and indeed cost) to a double collaborative trial. Wilkinson K. In such instances. Owen. Some care is needed in the design of such experiments. a more general technique that accommodates variable precision on both variables would be required. however. for both methods and for each material.g. 20–50 laboratories) and use the respective variances as weights for ﬁtting a functional relationship. the conclusions of the study would apply only to the laboratory concerned. but an account of the method is beyond the scope of the present work. 1666–1668.
This page intentionally left blank .
— The key statistic is the absolute diﬀerence between corresponding duplicated results.1 Repeatability Precision and the Analytical Run Key points — Duplication can be used to estimate or control repeatability (withinrun) variation. a continuous period. — It is usually necessary to take account of how precision depends on the analyte concentration. withinrun and runtorun precision. — ‘Maps’ of absolute diﬀerence against concentration can utilise control lines for a prescribed repeatability precision. It covers the meaning of statistical control. The undeﬁned short period can be most usefully interpreted as the length of an analytical ‘run’. Repeatability conditions of precision are formally deﬁned as those under which replicate measurements are made on the same test material by the same analyst using the same method.Chapter 10 Internal Quality Control This chapter is concerned with the statistical aspects of internal quality control. involving anything from one to a large number of measurements. and the use of multipleanalyte control charts. equipment and batch of reagents and within a ‘short’ period of time (§9. 189 .2). with special emphasis on the correct setting up of control charts. 10.
To represent the true variation within the run. as often happens. β. because they would not account for unremarked systematic changes. or simply close in relation to the length of the run. which is executed by consideration of the diﬀerence between the duplicate results on the test materials. The errors can then be regarded as part of the repeatability variation. with each other or with some externallyderived criterion. If we postulated a functional relationship of the form σr = α2 + (βc)2 . In such an instance we would expect σr . however. In any event. The eﬀect of undetected changes of this kind can be handled by conducting the sequence of analyses in a random order.190 Notes on Statistics and Data Quality for Analytical Chemists during which factors contributing to the magnitude of errors are deemed to remain constant. and therefore σd . both in composition and state of comminution. Of course. repeatability standard deviation (σr ) is of only limited use to analytical chemists as it is usually considerably smaller than the uncertainty of the measurement. the variation observed between pairs would tend to be too small. we could calculate σd at any concentration c by inserting c = (x1 + x2 )/2 into the equation. the duplicated test portions must be at random positions in the analytical sequence — if they were adjacent. (For instance. the ﬁrst of a pair of duplicated results would tend to be greater even if the duplicate test portions were analysed in a random order. to vary with concentration. and this has a standard deviation σd = 2σr . the conditions are never really constant and some systematic changes can be expected in a typical run. Its main value is in enabling the analyst to gauge whether results replicated within a run are consistent.) This apparent diﬃculty can be overcome simply by considering only the absolute diﬀerence d between the corresponding results. we could predict that σd = 2(α2 + (βc)2 ). the concentration of the analyte varies considerably among the test materials comprising the run. A complication arises when. The test statistic is the diﬀerence d = x1 − x2 between corresponding √ pairs of results. if the sensitivity of the instrument were consistently falling during the run. The method has the advantage that the variation observed is that of materials that are entirely typical. Under . The difference d has an expectation (longterm average) of zero only if there is no consistent trend in the instrumental performance. If we knew or prescribed values for the parameters α. Duplication within a run thus provides the analyst with a restricted type of quality control.
In addition. Repeatability duplicates can be used to estimate the relationship between precision and the concentration of the analyte. the scaled diﬀerences d/σd should behave like a sample from a standard normal distribution. That usage is consistent with the concept of the run as an unchanging analytical system. but the concentration diﬀerentials among two or more test materials analysed in the same run. so adding this line to the map gives the analyst an extra test of the data. • For results that are normally distributed. The rationale for this usage lies in the deﬁnition of the measurand.Internal Quality Control 191 the normal assumption. An example is discussed in the next section (§10. A ‘control map’ based on this idea is popular in the geoanalytical community but could be more widely useful. Notes and further reading • If comparability within run is the customer’s sole requirement then repeatability standard deviation can be used as the basis for uncertainty. A dotplot is suﬃcient to show the distribution. The control lines are functions of concentration and have to be plotted as such at 2σd and 3σd .954σr . can be extracted by adding a concentration dimension to the plot. by counting the data points above and below this line — they should be roughly equal apart from random diﬀerences. In some instances it may be preferable to use ﬁtnessforpurpose uncertainty rather than repeatability standard deviation as a criterion of performance. the median of d falls at 0. There is no point in plotting these results on a Shewhart control chart because the results are not a temporal sequence. However.2). In this map (it is not strictly a ‘chart’) d is plotted directly against c. a large number of results are needed to do this adequately. The most likely circumstance is when the analyst has no previous knowledge of σd . this limitation of the uncertainty should be made clear. More information. . the measurand is not the absolute concentration of the analyte. the median absolute diﬀerence is close in numerical value to σr . on linear or logarithmic axes as convenient. regardless of any varying concentration in the materials. In the circumstances referred to. If such results are released into the public domain. however. The map is not a control chart because we are not presenting the data as a sequence. such as might happen with determinations that are rarely undertaken.
33 0.39 2.07 0. A dotplot of d/σd is shown in Fig.42 0.42 0.18 0. the precision of the results conforms well to the expected variation. 10.92 0. 2σd and 3σd .09 2. the repeatability standard deviation was expected to conform to the function σr = Se 1 0.38 0.63 0. except possibly at low concentrations.18 0.1.2 Examples of WithinRun Quality Control Key points — A ‘control map’ is produced by plotting d against concentration.0152 + (0. Zero diﬀerences are set to a small positive value for plotting on logarithmic axes.2.92 0. . Se 2 0. A large batch of samples of soil was analysed for selenium in one run in a completely random order. Dotplot of scaled diﬀerences d/σd for the selenium data.31 2.23 0.53 0.34 0. — Control lines are inserted at σr .32 0.35 1. — Overall. From previous experience.1.94 0.39 0.95 0. The Fig.45 0.45 0.67 0.24 0.20 0. 10. One in ten of the samples was analysed in duplicate.20 0.63 0.16 Diﬀerences between the pairs of results and corresponding values of σd were calculated.17 2.16 Se 2 0.04c)2 .17 2.71 1.20 0.38 0.22 0.41 Se 1 0.30 2.04 2. and exceeds that of the external (10%) criterion. The duplicate results (ppm) are shown in the table.2.22 0.29 0.15 0.95 0.192 Notes on Statistics and Data Quality for Analytical Chemists 10.46 1.
so there is no evidence here that the original data deviated from expectations.Internal Quality Control 193 bunching of points at zero is the outcome of excessive rounding.2. Despite that. Fig.2.2. 10. (Diﬀerences of zero were set to 0. .2. absolute diﬀerences d were plotted against concentration.2. 10. the data do not diﬀer signiﬁcantly from the standard normal distribution. 10.2. The resulting map is shown in Fig.05 to allow them to be plotted on logarithmic axes: that does not aﬀect the interpretation. Control lines were calculated at σr .2. plotted on logarithmic axes.) Thirteen points fall below the median line (dotted) and Fig. It is perhaps easier to interpret the same features plotted on logarithmic axes (Fig.2.3). 2σd and 3σd . Same data and lines as Fig. For further work. with control lines at 3σd (solid). Control map for results (points) duplicated under repeatability conditions. 10. 2σd (solid) and σr (dashed). 10.3.
As these points are concentrated at lower concentrations. The probability of observing three or more points above this line.1. indicating that.02. three points fall above the 2σd line. It is also relatively easy to write a macro that does the same job by computer. as compared with the expectation of 1. at a low concentration.2.4. nine above. There is one discrepant result (that is exceeding the 3σd line). again suggesting that the detection limit may be higher than expected. 10. Figure 10.e.194 Notes on Statistics and Data Quality for Analytical Chemists Fig. (This analytical precision would be suﬃcient for most environmental applications.4 shows an absolute diﬀerence map with control lines for a repeatability relative standard deviation (RSDr) of 10%. The selenium results (points) plotted on a 10% control map.. for σr /c = 0. is p = 0. with only four points above the theoretical median for 10% RSDr. assuming the data conform to the expected repeatability.2. that the repeatability precision overall may be slightly better than expected. An alternative approach is to use an independent ﬁtnessforpurpose criterion to judge the results. The control lines represent 10% RSDr. .1. However. i. this suggests that the detection limit of the method was not as low as expected. if anything. A singular advantage of using loglog plots with ﬁxed RSDs is that generalpurpose map blanks can be printed in advance in large numbers and the results quickly entered by hand.) In this instance the results easily fulﬁl the requirement.
the control material. In that way the control materials act as a surrogate and their behaviour is a proper indicator of the performance of the system. The results obtained in successive runs can be plotted on a control chart (§10. we have to use a surrogate. from the weighing of the test portion to the ﬁnal measurement. in respect of matrix composition and analyte concentration. which resembles the test materials closely. For chemical analysis.3 Internal Quality Control (IQC) and RuntoRun Precision Key points — To apply statistical control to routine analysis. 10. AMC Technical Briefs No. The purpose of internal quality control in analysis (IQC) is to ensure as far as possible that the magnitude of the errors aﬀecting the analytical system is not changing during its routine use. the critical feature is normally part of the speciﬁcation.rsc. In industrial production. The timebase for IQC is therefore the analytical run. • ‘A simple ﬁtnessforpurpose control chart based on duplicate results obtained from routine test materials’. (2002). however. and is readily available for measurement. During method validation we estimate the uncertainty of the method and show that it is ﬁt for purpose. Clearly the control materials must be of the same type as the materials for which the analytical system was validated. every run of analysis should be checked to show that the errors of measurement are probably no larger than they were at validation time. such as the length of a screw. Free download via www. This is done by adding one or more ‘control materials’ to the run of test materials. For this purpose we employ the concept of statistical control. 9. we have to generate separately a representative feature of the system. which means in general that some critical feature of the system is behaving like a normallydistributed variable. When the method is in use.org/amc.4). The control materials are treated throughout in exactly the same manner as the test materials.Internal Quality Control 195 Notes and further reading • The dataset is in ﬁle named Selenium. . — Control limits are set according to the runtorun precision for the system.
This combined eﬀect of repeatability and betweenrun variations is referred to here as runtorun variation (and elsewhere. In the long term this variation looks like a random ‘betweenruns’ eﬀect in addition to the repeatability variation. 649–666. and Wood. Results replicated (that is. (1995). as ‘intermediate’ variation). When we undertake a number of successive runs in the same laboratory.196 Notes on Statistics and Data Quality for Analytical Chemists which shows when the system becomes out of control and. the conditions of measurement will inevitably be diﬀerent in each run: the instrument will be set up diﬀerently or a diﬀerent instrument of the same type may be used. pp. by implication. Free download via www.1. R. . The environmental conditions in the laboratory may be diﬀerent.org/amc. 67 . It must be remembered that the parameters that deﬁne statistical control and used to set up control charts should refer only to the behaviour of the process itself. • ‘The Jchart: a simple plot that combines the capabilities of Shewhart and cusum charts. Chem. AMC Technical Briefs No. Notes • There is more information on control charts in §7. (2003). It is runtorun standard deviation that should be used to set up control charts for internal quality control. Newly prepared reagents or a new calibrator set may be used. Harmonised Guidelines for Internal Quality Control in Analytical Chemistry Laboratories. External criteria. with the same material being analysed by the same method) in successive runs will therefore be more variable than repeatability replicates. such as certiﬁed reference values and their uncertainties are irrelevant to quality control per se. 12. we need to know only whether the process (the analytical system) has changed since validation. whereas use of reproducibility standard deviation or standard uncertainty would result in too low a proportion. rather unhelpfully. perhaps by a diﬀerent analyst. • Thompson. M. Pure Appl. needs investigation and possible remedial action before analysis resumes. for use in analytical quality control’. Each run thus has a slightly diﬀerent set of circumstances. For the purposes of internal quality control. An incorrect use of repeatability standard deviation for this purpose would result in too frequent an indication of loss of statistical control. giving rise to a ‘run bias’ eﬀect.. Such a chart has to be set up with control lines determined by runtorun precision.rsc.
These conditions cannot be realised except during routine use of the method.4 Setting Up a Control Chart Key points — It is diﬃcult to estimate runtorun standard deviation accurately unless routine conditions prevail during the measurements.6 is derived from a broadlyapplicable empirical observation of the magnitude of runtorun variation.6 times a repeatability standard deviation estimated from the results at validation. To achieve this realism. — Early results may be atypically disperse. — An interim chart can be set up immediately after validation by using a repeatability mean and an inﬂated repeatability precision. There are practical problems in setting up a control system for analysis. • Start an interim control chart with the mean result for the control material established at validation time. Many observations are needed to estimate the standard deviation with suitable accuracy. and will produce results of atypically low precision: almost invariably an improved precision comes with experience of the method.Internal Quality Control 197 10. the analysts will not be familiar with the method at initial validation time. Deﬁne the control limits on the basis of 1. Reallife replication is required. not about the analytical system. over an extended series of runs. as follows. Moreover. • After results have accumulated from ten runs.) Do not use uncertainty values attached to a certiﬁed value of a reference material: that is a description of knowledge about the material. . any control material has to be in a random position in a runlength sequence of typical test materials. (The factor of 1. — The interim chart should be replaced after ten or more runs with runtorun statistics and reviewed periodically after that. replace the control limits with those based on the robust estimates of the mean and standard deviation of the new results. After further (say about 30) results have accumulated these estimates should be checked. Runto run standard deviation cannot be estimated adequately in the usual type of oneoﬀ validation. How then does the analyst actually start the control chart? The best strategy is to use an interim control chart and update it as more information becomes available. far more than the usual ten.
5 Internal Quality Control — Example Key points — The best option for a control chart for this dataset is with control lines based on robust statistics of more than about 15 runs of analysis.1. showing that any choice of just ten runs would give rise to highly questionable outcomes.198 Notes on Statistics and Data Quality for Analytical Chemists • Review the control limits occasionally. 10. The element of interest is zinc. A substantive change may demand a partial revalidation of the process. J. Analyst. 10. simple statistics based on runs 1 to n).) The positions of the control lines are very variable. A more likely occurrence would be that the majority of points are roughly normally distributed and a minority are outliers. Note that we do not proceed on the assumption that the data will be exactly normally distributed. • Internal quality control in routine analysis. 1851– 1873.rsc. Quality Control Charting for the Analytical Laboratory. R. A Review. Graph (a) shows the mean and action limits.5. determined from rolling statistics (simple mean and standard deviation). the limits at Run 50 are based on data from Runs 41–50. from each set of ten successive results from 50 runs. pp. Free download via www. and the main emphasis is on the typical (but perhaps unexpected) diﬃculty in determining the control limits. Here we consider some quality control data derived from the routine analysis of soil samples by inductivelycoupled plasma atomic emission spectrometry. The lines .org/amc. (1995). Notes and further reading • Howarth. Part 1. as may be necessary if the mean of the process has clearly changed. — The interim chart based on inﬂating an estimate of repeatability standard deviation was very similar to the ‘permanent’ control chart based on runtorun precision. Univariate Methods. (2010). Graph (b) shows the corresponding control lines determined from all data up to the current round (that is. AMC Technical Briefs No. (For example. This requires the exercise of judgement and should not be done without careful consideration. 120 . Four methods are illustrated in Fig. 46.
The lines are very unstable until Run 25. k = 2.Internal Quality Control 199 (a) (b) (c) (d) Fig. robust statistics from a long succession of data would give the best outcome for a control chart. robust statistics.9 = 14. Positions of control lines (mean and action limits) estimated by various methods at diﬀerent points in the accumulation of data. (a) Rolling groups of ten data. simple statistics.2 (see §10. 3 was . (c) Rolling groups of ten data.1.6 × 8. A chart with control lines based on 370 ± k × 14. simple statistics. When the robust statistics are calculated from all data up to Run n (Graph d). see §7.4). robust statistics.6) with rolling groups of ten successive results. (d) All data up to run n. For setting up an interim control chart. Graph (c) shows the statistics estimated by a robust method (Huber’s proposal H15. 10.9 ppm. Using statistics from data from the ﬁrst run up to any point after run 20 would give a serviceable control chart. but noticeably aﬀected by some following individual runs. the resulting lines are narrow and stable almost from the start.2. In this example at least. and after that they are narrower but still somewhat ragged.5. The interim estimate of runtorun standard deviation was therefore 1. the initial repeatability statistics were a mean of 370 and a standard deviation of 8. (b) All data up to run n. are more stable in position after Run 14.
. 10.3. Fig. 10. robust statistics) is shown in Fig.2. Interim control chart based on a standard deviation of 1. 27.200 Notes on Statistics and Data Quality for Analytical Chemists Fig.2 shows this chart used for the ﬁrst 25 rounds. There are a number of clear outliers signifying outofcontrol conditions (Runs 13. 10. The chart has the appearance of wellbehaved data apart from Runs 13 and 14.8.6 times the repeatability value. namely a mean of 370.3. used as the interim chart. 34 and 47). This has every appearance of a sensible control chart. A ‘permanent’ control chart based on option d (all data up to Run 50. Figure 10. and one marginal case (Run 41). ‘Permanent’ control chart for zinc with control limits based on robust statistics from the whole dataset. It is remarkably similar to the interim chart in this instance.5.5. which are out of control. 14.5.5.6 and a standard deviation of 14.
with a variable number of repeat results for each run. Chemical analysis with multiple outputs. ICPAES) or nearly simultaneous (e. such methods may provide a robust account of the analytical system as a whole. Variation in the volume of test solution injected into a chromatograph would be of that kind: it would tend to aﬀect all analytes . but both the analyst and the customer will require information about the validity of results for each individual analyte. The repeatability standard deviation for the interim chart was obtained independently.g.Internal Quality Control 201 Notes and further reading • The data for this example were a subset from ﬁle AGRGIQC downloadable from AMC Datasets via www. Variation in parts of the analytical procedure that are common to all analytes will tend to cause correlationcorrelation.org/amc. The question of how best to apply IQC principles to such systems is often discussed. The use of multivariate statistical methods in this context has seldom been reported so far. The subset used here comprised the ﬁrst result for zinc from each run up to Run 50. — Such correlations may be diagnostic and point to speciﬁc causes of problems. One preliminary consideration is the extent to which the variables (the results for the analytes) are correlated. 10. In any event. — Multiple symbolic control charts are useful because they clearly show runs where many channels are aﬀected simultaneously and analytes that are aﬀected in many runs..6 Multivariate Internal Quality Control Key points — Results from simultaneous multianalyte methods are likely to be correlated because of variations in procedure that aﬀect all analytes.rsc. This dataset comprises multielement results from 156 successive runs (including rejected runs). and these methods are beyond the scope of this book. • The closeness of agreement between the control lines of the interim and permanent control charts was in this instance unusually good. either simultaneous (e. HPLC) is commonplace nowadays.g.. The present discussion is therefore limited to the multiple use of univariate methods.
Another aspect of this is the wellknown ‘Bonferroni’ problem. preeminently in runs 13.g.) On this chart the great majority of instances of results falling outside action limits are concentrated into a few runs: a large number of instances occur together (and usually in the same direction). Hg.3] represents one row of this data. largely uncorrelated multianalyte systems are seldom encountered. 41 and 47. These features may be worth some consideration for their potential diagnostic value in instances of outofcontrol runs. in every run we would expect to see results for one or more analytes in the warning zone purely by chance.202 Notes on Statistics and Data Quality for Analytical Chemists in proportion. 24. Other variations in method may have outcomes that are limited to a subset of analytes. where µ. Suppose we have a system measuring 30 analytes simultaneously. 15. The values of z are then converted into symbols indicating the zone into which the result would fall on the corresponding Shewhart chart. Se) while other elements would be completely unaﬀected. the approximate 95% conﬁdence limits). The symbols are then plotted according to analyte (rows) and run number (columns). σ are ˆ σ ˆ ˆ respectively the estimated mean and standard deviation for that channel. Fortunately. and we are considering results that fall outside the warning limits of a control chart (that is. which is so rare in an incontrol system that we are justiﬁed in assuming that the system has changed. variation in the ﬁnal temperature of drying would cause variation in the recovery of a suite of volatile elements (e. Isolated instances almost certainly would mean nothing. That might lead the unwary to reject data at a quite unnecessary rate.5.. 34. 14.1 shows such a chart for the results of 25 elements in a control material over the ﬁrst 50 runs of a routine procedure. Figure 10. (The example used previously [§10. In an acid decomposition of a soil sample. Such occurrences are clearly visible in a multiple symbolic control chart. As. If all of the channels were independent (not correlated). When things go wrong the eﬀects tend to be visible on a number of analytes simultaneously. But in an uncorrelated 30channel system operating in control we could expect to encounter a result outside the action limits with a probability of one in ten. The probability (under the usual assumptions) of observing a result for a single analyte outside the action limits is about one in 300. To make such a chart the results x for each channel are standardised as z = (x − µ)/ˆ . (Another phenomenon potentially present in multiple symbolic charts is where a particular analyte shows signs of persistent problems .6. An extreme example with only one analyte aﬀected could be caused by malfunction in a single channel of a multichannel instrument. 27.
6.5 0.0 0. 10.4 K 0.9 0.rsc. . This dataset comprises multielement results from 156 successive runs (including rejected runs).5 0. downloadable from AMC Datasets on the website www.1. Even when the anomalous runs were deleted from the dataset.3 0.5 0. No example is present in Fig.9 1. Symbolic multipleanalyte control chart. Li Li Na K Rb 1.4 Rb 0. over a number of runs.5 1.4 0.6. • The data for this example were a subset from ﬁle named AGRGIQC . the results can be assumed to be independent.org/amc.) It is clear that the channels in this example are mostly highly correlated. there is still a degree of correlation among the variations. 10. That would be indicative of a speciﬁc problem with that element. with a variable number of repeat results for each run.0 0.3 Na 0.0 0.0 Notes and further reading • When a number of analytes in a material are determined by quite separate methods on separate test portions.4 1.Internal Quality Control 203 Fig.1. The following correlations are typical of the whole correlation matrix.
This page intentionally left blank .
1 Proﬁciency Tests — Purpose and Organisation Key points — Proﬁciency tests are regular interlaboratory comparisons of results obtained by ‘blind’ analysis. an externallyprovided means for participants to check the accuracy of their routine measurements. provided on a regular basis. — Participants’ results are usually converted into scores that give an indication of accuracy. the scheme provider sends portions of one or more test materials to the participants. who analyse the materials ‘blind’ (that is. to allow participating laboratories to check the accuracy of their results.Chapter 11 Proﬁciency Testing Proficiency testing. — Participation in a scheme (where available) is an almost universal requirement for accreditation. 11. Nearly all such scoring systems are based on the properties of the normal distribution. with no indication of the concentrations of the analytes) by their routine 205 . While it is the job of the scheme provider to execute these methods. The conversion of participants’ results into meaningful and readily understandable scores is almost universal. but some statistical methods needing special software may be required in the process. Proﬁciency tests are interlaboratory exercises. For each round of the scheme. — The main purpose is to help laboratories to achieve a suitably low uncertainty. is now incumbent on all laboratories seeking accreditation. it is useful for the participant to be aware of what is involved.
This function is so important that participation in a proﬁciency test. (1997). F. They cannot therefore act as a substitute for internal quality control (§10. L. R. Chem. S. has been made a universal requirement for accreditation. where one is available. accreditation agencies expect participants to have and apply a written procedure for dealing with poor scores. pp. The Internaional Harmonised Protocol for the Proﬁciency Testing of Analytical Chemistry Laboratories. Notes and further reading • Thompson. The primary purpose of proﬁciency testing is to enable participants to be conﬁdent about their normal analytical methods. so that variations among the results reﬂect accurately the variations in the participants’ performance rather than variations in the test material. R. This tendency is reinforced when laboratories use their scores in promotional activities. which should be carried out in every run of analysis.) • ISO Guide 43. an investigation can be triggered and remedial actions instituted where necessary. often in relation to a predetermined maximum level of uncertainty. The Royal Society of Chemistry. Thompson. 145–196. • Lawn. or for monitoring the performance of individual analysts. M.3).(Free download from IUPAC website. These secondary uses have to a certain extent subverted the original ethos of proﬁciency testing. The materials should be suﬃciently close to homogeneous and stable. However. The materials and analytes should be typical of the participants’ normal work.. Moreover. R... Cambridge. Ellison. the provider processes the results. Proﬁciency testing by interlaboratory comparisons – Part 1: development and operation of proﬁciency testing schemes. After the reporting deadline. E. . usually converting them into scores that gives an indication of the accuracy. showing the results and/or scores of all participants. accreditation has had the unfortunate eﬀect of encouraging participants to try to excel in accuracy rather than merely to assess the performance of routine operations. The provider sends a report of the round to the participants. International Organisation for Standardisation. R. and Walker. M. Proﬁciency Testing in Analytical Chemistry. (1994). Proﬁciency tests rounds are provided at various frequencies.206 Notes on Statistics and Data Quality for Analytical Chemists methods. If an unexpected inaccuracy in their routine results is detected. and Wood. 78 . Geneva. Pure Appl. (2006). for example by quoting favourable scores in tenders for work. most commonly several times per year.
International Organisation for Standardisation. that is. A hypothetical laboratory using an unbiased method producing results with an uncertainty u = σp would tend to produce zscores that are a random sample from a standard normal distribution N (0. However. (2000). should convey the same information about the accuracy of a result. where the ‘assigned value’ xA is the scheme provider’s best estimate of the true value of the measurand. Statistical methods for use in proﬁciency testing by interlaboratory comparisons. 1). Guidelines for the requirements for the competence of providers of proﬁciency testing schemes. say 1. In contrast. regardless of the analyte. Consequently. In fact. and σp is the ‘standard deviation for proﬁciency’ (also known informally as the ‘target value’).ilac. Laboratories operating with poor uncertainty (u > σp ) and/or with a bias tend to produce higher proportions of scores outside these limits. the test material or the physical principle underlying the measurement. The ideal score should be universally applicable: a particular value. — The zscore is close to ideal. A participant’s result x for an analyte in a round of a proﬁciency test is usually converted into a score that reﬂects the accuracy of the results. as we would expect about 95% of zscores from exactly compliant laboratories to fall between ±2 and very few to fall outside ±3. the eﬃcacy of a scheme depends critically on the selection of appropriate values for xA and σp . however. (Available free online at www.2 Scoring in Proﬁciency Tests Key points — Converting a participant’s result into a score is pointless unless it adds information about the accuracy of the result. laboratories operating with no bias and an uncertainty .org/documents/) 11. it is appropriate to interpret zscores on this basis. given by z = (x − xA )/σp . is appropriate. — An ideal score should convey the same information about accuracy regardless of the nature of the analytical measurement. The zscore. scoring is pointless unless it has this property. Reallife laboratories will not be exactly compliant. its concentration. (2005).5. Geneva. • ILACG13. with a mean of zero and a variance of unity.Proﬁciency Testing 207 • ISO 13528.
3 Setting the Value of the Assigned Value xA in Proﬁciency Tests Key points — There are a number of possible ways of determining an ‘assigned value’.1. 11. In typical reports.4.4. Barchart of results for alumina showing an approximately symmetrical distribution in a round of the GeoPT proﬁciency test. • The certiﬁed value of the analyte in a certiﬁed reference material (CRM).208 Notes on Statistics and Data Quality for Analytical Chemists Fig. There are several possibilities for the choice of assigned value xA . this value is seldom used because the . with the individual laboratory identiﬁed by an anonymised code.4. Metrologically sound. In some instances the number of participants is so large that a bar chart is impracticable and is replaced by a histogram.2. — The value most often used is the consensus of the participants’ results. 11. smaller than σp tend to produce a smaller proportion of scores outside the limits. 11. Some examples of results and zscores obtained in rounds of some proﬁciency tests are shown in Figs.2 and 11. the participants’ scores are shown as a bar chart or ordered bar chart.2.1. 11.
R. This is by far the most commonly used assigned value. as it is perfectly possible for the great majority of the participants to be using a biased analytical method. there would be a latent uncertainty in the assigned value. The Internaional Harmonised Protocol for the Proﬁciency Testing of Analytical Chemistry Laboratories. 78 . Pure Appl. M. A value from a national reference laboratory obtained by a method such as isotope dilution mass spectrometry. 145–196. S. Chem. L. which are occasionally detected.Proﬁciency Testing 209 • • • • • cost of a CRM is usually too great for use in a routine manner. and the assigned value does not have a suﬃciently small uncertainty. and participants using an unbiased method could receive poor zscores. (Free download from IUPAC website.) .. Further reading • Thompson. The consensus has been criticised on metrological grounds. There are seldom enough appropriate CRMs available. However. at present there is seldom an economically viable alternative.4). In such instances. A value based on formulation. (2006). A consensus of all participants. and it costs nothing. Ellison. pp. Sometimes applicable. the uncertainty of the certiﬁed value is often too large to be useful. there are often diﬃculties in accurately spiking the base material with low trace levels of analyte. A value obtained by analysis alongside a number of matrixmatched CRMs in the same analytical run (that is.. Similar comments apply to this value. A practical problem is that the variation between the experts’ results is often comparable with that of the whole participant set. Methods of determining a consensus are considered below (§11. using the CRMs as calibrators). A consensus of expert laboratories. This can be used where the analyte is added gravimetrically or volumetrically to a base material containing none. One problem with this assigned value stems from the diﬃculty of identifying the expert laboratories to the satisfaction of every stakeholder. R. In addition. A consensus is usually easy to identify and has a suﬃciently small standard error if there are more than about 20 participants. and Wood.
a mode estimated by kernel density methods may be a suitable consensus.1). In the context of proﬁciency testing. One of several robust estimates is suitable to avoid these problems if the dataset is unimodal and reasonably close to symmetric (e.. The standard error of the robust mean can be estimated √ ˆ as σrob / n from the robust standard deviation σrob and the number of ˆ participants n.g. • The mean. The almost inevitable presence of outliers and heavy tails in sets of results from proﬁciency tests means that the arithmetic mean may be biased and the variance inﬂated. — If the dataset seems to be unimodal but skewed. (This is a reasonable estimate if the robustiﬁcation is not too severe: otherwise the value of n should be adjusted for any downweighting. . All of the usual measures of central tendency have been considered in this context. — Where the results are apparently multimodal.4. In addition to the value of the selected statistic itself.210 Notes on Statistics and Data Quality for Analytical Chemists 11. but an identiﬁable and unique point of maximum agreement between the participants’ results.4 Calculating a Participant Consensus Key points — A robust mean is usually a good estimator of a consensus if the results from a round of a proﬁciency test seem unimodal and (outliers aside) roughly symmetrical. However. but experience and judgement are needed to select the method appropriate for particular datasets. an estimate of its uncertainty is required to ensure that the assigned value is suﬃciently stable. and the robust mean is a good estimator. 11. The median is a type of robust mean but is more resistant than some estimators to the inﬂuence of skewness. In such datasets the various estimates of central tendency are almost coincident. Fig. ‘consensus’ does not mean absolute concordance. the mode is usually preferable for skewed distributions. Methods for estimating these statistics abound.) • The median. it may be impossible to ﬁnd a consensus. which may appear in proﬁciency tests datasets through the use of a number of methods with diﬀering detection limits.
The results are heavy tailed in comparison with the robustly ﬁtted normal distribution (solid line). The mode is intuitively attractive as a consensus estimator.3). Occasionally. 11. 11. and serves well even if the dataset shows a moderate degree of skewness. In such instances it is usually not possible to identify a consensus.g. Extreme outliers have been omitted. Histogram of results for alumina in a rock test material from a round of the GeoPT proﬁciency test. In that case the provider can. 11.2.. Fig.4. In some instances more than one mode (other than outliers) may be apparent (Figs.4–11. The mode of a smooth distribution is the point of highest density. Same data as Fig. however.4. The standard error of the mode can be estimated via the bootstrap (a computerintensive method of estimation). with due . e. 11. owing to the class boundaries or because of the limited digit resolution of the data.6).4. however. • The mode. there may be additional evidence (such as use by participants of a prescribed method) that enables the provider to determine that one such mode represents reliable results and other modes suspect results. A degree of smoothing is therefore required to identify the mode. and to check that there is indeed only one mode. Real datasets presented as histograms or dotplots are not smooth.1. This could happen if substantial subsets of the participants used one of several discrepant analytical methods or variants of a single method. This smoothing can be conveniently carried out by kernel density estimation (Fig.4.Proﬁciency Testing 211 Fig. 11.1.4.2.
(Same data as Fig.) . showing a strong positive skew. Kernel density representation of results for lead from a round of the GeoPT proﬁciency test.4.4. 11. 11. 11.3. from the GeoPT proﬁciency test.) (The subzero density in the low tail is the result of smoothing. Fig. Barchart of results for lead. showing a positive skew and the single mode at about 6 ppm.2.4.212 Notes on Statistics and Data Quality for Analytical Chemists Fig.2.
Fig. there is a suggestion of multimodality. 11. The tendency to multimodality is clear.4).4.4. 11. 11. Bar chart of results for niobium from a round of the GeoPT proﬁciency test. Kernel density representation of results for niobium from a round of the GeoPT proﬁciency test (same data as Fig. Despite the high proportion of results with good zscores.5.4.4.Proﬁciency Testing 213 Fig. .
rsc.214 Notes on Statistics and Data Quality for Analytical Chemists Fig.6. (March 2006). 23. • ‘The bootstrap: a simple approach to estimating standard errors and conﬁdence intervals when theory fails’. 8. A mixture model is a useful additional technique for characterising such subsets of the data. Free download from www.) caution. 11.rsc. A kernel density provides a smooth representation of data by replacing data points by small (usually) normal distributions and then adding the resulting densities at each point on the measurement axis. AMC Technical Briefs No.4. A mixture model regards the dataset as a mixture of random samples of observations from two or more diﬀerent distributions. (Revised March 2006).4. AMC Technical Briefs No. (Same data as Fig. The results for niobium from a round of the GeoPT proﬁciency test are well described as a mixture model of three normally distributed subsets.org/amc. AMC Technical Briefs No. but straightforward accounts can be read in the AMC Technical Briefs listed below. 11. The bootstrap estimates distributions of statistics by resampling the data with replacement a large number of times. . Free download from www.rsc. • ‘Mixture models for describing multimodal data’. use the reliable mode as the consensus. • ‘Representing data distributions with kernel density estimates’. (August 2001).4. Free download from www. 4.org/amc. Notes and further reading • Some statistical methods mentioned in this section are beyond the scope of this book.org/amc.
analyte or matrix. and equally so for the provider. Equally it does not imply that the collected zscores of all of the participants will be a random sample from the standard normal deviation N (0. occasionally. skews or multiple modes. it should be expressed as a functional relationship. a zscore outside the range ±3 will always show that the original result was not ﬁt for purpose. the great majority of whom could claim that their result was ‘satisfactory’. Some proﬁciency tests use a robust standard deviation of the participants’ results in a round to deﬁne σp . Under this convention zscores will be comparable over any analytical method.5 Setting the Value of the ‘Target Value’ σp in Proﬁciency Tests Key points — In best practice the target value should be a criterion known to participants in advance and describing the uncertainty regarded as ﬁt for purpose in the relevant application sector. The best value of σp is simply the uncertainty that is regarded as ﬁt for purpose in the application sector. If the ﬁtforpurpose uncertainty varies with the concentration of the analyte. 1) to a greater or lesser extent. For example. That information should be available to the participants before the analysis. The resulting zscores will nearly always show in excess of 90% laboratories with scores between ±2. Thus real datasets deviate from N (0.Proﬁciency Testing 215 11. This may be comforting for the participants. who could claim that the scheme was achieving its purpose. a proportion of outliers and. It is important to emphasise (because it is widely misunderstood) that σp is not intended to predict the uncertainty of individual laboratories. The value of σp is not intended to predict that diversity: rather it is set to prescribe in advance the uncertainty that is required by the scheme provider. Individual laboratories will tend to have diﬀerent precisions and biases that jointly contribute to the betweenlaboratory variation. It is up to the participants to attempt to comply. often with heavy tails. The zscores then give a good idea of the degree of compliance. — Target values based on a robustiﬁed standard deviation of the participants’ results do not provide any useful additional information. 1). In fact such a score .
for instance. Scores in the intermediate range would not be especially uncommon and isolated instances could be ignored. There is a danger that they will be interpreted literally and misapplied.216 Notes on Statistics and Data Quality for Analytical Chemists tells us nothing about whether the results are accurate enough in relation to their intended purpose. as ranks can change greatly from round to round without any underlying change in the performances of the participants. These class labels are best avoided. So z = ±3 can best be regarded as deﬁning action limits. 11. Finally. — Laboratories should have a documented strategy for examining and acting upon zscores. it does not give the participant any prior guidance as to the standard of accuracy that is required: such guidance is necessary so that appropriate analytical methods can be chosen in advance or modiﬁed to meet the requirement. such as modiﬁcation of the analytical system. scoring is pointless unless it adds information to that already present in the raw results. In short. There is a temptation amongst practitioners to regard these arbitrary limits as class boundaries and to name the classes accordingly. especially by those not familiar with statistical inference. That is especially invidious. . and should be strongly discouraged if encountered. ‘satisfactory’ for z < 2 and ‘unsatisfactory’ for z > 3. Moreover.6 Using Information from Proﬁciency Test Scores Key points — zscores should be regarded as action limits rather than a method of classifying participants. zscores in the range ±2 can be regarded as calling for no action by the participant. it does not tend to reduce the variation among the great majority of the participant laboratories. Scores outside the range ±3 would be very unusual for a participant conforming to the ﬁtness criterion and can be taken as calling for investigation and possibly remedial action. There is also a tendency among nonscientists to want to construct a ranked ‘league table’ from a set of zscores. In a scheme where the σp value is determined by ﬁtness for purpose.
but the score in Round 10 is less than 3 so the analytical system needs investigating.2 we see ﬁve instances where z > 3. each calling for investigation.Proﬁciency Testing 217 There is also an understandable desire for people to try to summarise performance by averaging zscores. the results are unlikely to be independent and the average misleading (unless corrected for covariance). using the usual rules of interpretation (§7.1. No such problems are attached to interpreting zscores by standard univariate control chart procedures. for a single analyte over a period of time or for a number of diﬀerent analytes at a particular time. In Fig.6. Under the usual rules.6.6. In Fig. possibly long after the problem giving rise to it has been rectiﬁed. 11. 11. a particular analyte may consistently attract a score of large magnitude that is hidden in the average. Finally. Analyte 2 in Round 11 would also trigger investigation. ﬁve out of the six analytes attract unduly Fig. 11. One such is that a single zscore of large magnitude will have a persistent eﬀect in time. Either Shewhart charts or zone charts are suitable for this purpose. . parallel symbolic control charts are especially informative. when several analytes are determined ‘simultaneously’. Another is that. because there are two successive results where −3 < z < −2. in Round 4. zscores for a single analyte from successive rounds of a proﬁciency test.1). in an average score from a number of analytes. If several analytes are determined together in successive rounds.1 we see no trend in the scores for an analyte in successive rounds. We can also see that Analyte 2 alone is unduly prone to low results while. There are several problems associated with creating these summary scores.
e. 11.rsc. • ‘Understanding and acting on scores obtained in proﬁciency testing schemes’.6. 11.2.org/amc.1. If a participant’s method was unbiased but the standard deviation were twice the target value (i. 11. . Notes and further reading • A zscore of unexpectedly large magnitude shows that the internal quality control system also needs investigation: a problem causing the troublesome zscore (unless due to a sporadic mistake) should have been detected promptly by internal quality control and the result rejected by the analyst.218 Notes on Statistics and Data Quality for Analytical Chemists Fig. (Analyte 1 results are also shown in Fig. Free download from www. (December 2002).. 2σp ) the laboratory would on average still receive a zscore between ±2 on about 67% of occasions and between ±3 on about 87% of occasions. • It is important to realise that a poorlyperforming participant can still receive a majority of ‘satisfactory’ zscores.6.) high results. Multiple univariate control chart used to summarise the results for a number of analytes determined in successive rounds of a proﬁciency test. AMC Technical Briefs No. possibly by a procedural mistake aﬀecting all of them in the same way.
possibly as a function of concentration. however. Moreover. It is very important to notice that any such test would be pointless unless the uncertainty on the certiﬁed value is negligible. Certiﬁed reference materials (CRMs) are sometimes advocated for use as control materials in internal quality control (IQC) of analysis. the certiﬁed value xcrm and its uncertainty ucrm .e. however. if a suﬃciently close match to the test materials can be found. In such instances it is better to regard the action as a kind of onelaboratory proﬁciency test rather than part of IQC. For several reasons. The ﬁtforpurpose uncertainty uf would have to be speciﬁed in advance. the score z would reﬂect the uncertainty in the certiﬁed value to an undue extent and mask the behaviour of the analytical system. the mean and variance) of the whole analytical system. The outcome could be assessed by calculating a ‘pseudozscore’ z = (x−xcrm ) u2 + u2 from the result x. the certiﬁed value and its uncertainty describe the material alone.7 Occasional Use of Certiﬁed Reference Materials in Quality Assurance Key points — Certiﬁed reference materials (CRMs) are not ideal for internal quality control. — CRMs as an occasional check on accuracy are best regarded as akin to proﬁciency testing. it is better to keep the concepts of quality control and traceability distinct. there is a discrepancy in concept between IQC and the CRM. The control chart is based on the properties (i. The use of a CRM (where available) on a scale appropriate for IQC would usually be too expensive but on a lesser scale ineﬀectual.Proﬁciency Testing 219 11. The analysis of a CRM. For the CRM. the uncertainty uf regarded crm f as ﬁt for purpose for the result. in that the CRM is a direct route to an acceptable traceability. In many ways. . despite their providing a route to traceability. can provide a useful occasional check of an ongoing analytical system. in contrast.. Unless ucrm <≈ uf /2.
This page intentionally left blank .
The enduser of analytical results needs to know the combined uncertainty (analytical plus sampling) to make valid decisions about the sampling ‘target’. the composition of which is in question. 12. Nearly all analysis is preceded by sampling. — Analytical chemists should use the recommended terminology of sampling.Chapter 12 Sampling in Chemical Measurement This chapter is concerned with the statistics involved in the neglected topic of uncertainty from sampling. The sample is small enough to be removed to the laboratory for further physical preparation such as grinding before 221 . or whether material conforms with a legal or contractual specification. the process of taking a small portion (the sample) from the much larger amount (the target).1 Traditional and Modern Approaches to Sampling Uncertainty Key points — Sampling uncertainty is traditionally ignored if the sample is ‘representative’. In many application sectors sampling uncertainty is a substantial or even dominant term in the uncertainty budget and therefore very important. — The modern approach regards sampling as an integral part of the measurement process and includes its contribution to the combined uncertainty. decisions such as the commercial value of a batch or lot of a material.
the uncertainty introduced into the ﬁnal analytical result by sampling is ignored in the traditional approach. many textbooks refer to the general principles of sampling for chemical analysis. Of course. Some stages (subsample. not the sample. taking a sample is pointless unless it reasonably approximates the average composition of the target: such a sample is called ‘representative’. ‘test material’. A more modern approach to sampling avoids the idea of a representative sample. It is an important development because the end user of analytical results needs information about the target. Procedures for obtaining representative samples of nearly all materials of commerce have been produced and often form parts of contractual agreements or legal requirements. we cannot ensure that we are using resources optimally unless we can compare the uncertainty contributions from sampling and analysis. especially when the target is very large. Such usage should be discontinued to avoid confusion. However. starting with the selection and weighing of the test portion. Obtaining a representative sample is sometimes very diﬃcult. They do not provide instructions for obtaining representative samples of speciﬁc materials. The following sections in this chapter refer to the estimation and use of uncertainty from sampling. The word ‘sample’ is often used informally among analytical chemists to indicate ‘analyte’.222 Notes on Statistics and Data Quality for Analytical Chemists analysis. and parts of the target are diﬃcult to access. . ‘test solution’. ‘Analysis’ is taken to mean all of the subsequent steps. ‘Sampling’ is usually taken to include all operations down to the preparation of the test sample. a shipload of ore for example. Key terms are underlined. ‘test portion’. Any residual heterogeneity in the test sample gives rise to an uncertainty that is attributed to the analytical variation. In eﬀect. while the target usually is not. that ‘the result is only as good as the sample’. ‘matrix’. laboratory sample) are omitted or merged in many instances. Moreover. but considers the uncertainty introduced by the sampling as an integral part of the measurement uncertainty. ‘aliquot’. The recommended terminology for the various stages of sampling is shown here. Such samples are usually accepted by analytical chemists and endusers of analytical data without further consideration. This approach provides a quantitative basis for the often repeated saying among analytical chemists. their contributions usually have to be estimated separately because sampling is often conducted at a location remote from the analytical laboratory and seldom under the complete control of analytical chemists. Despite the modern trend towards considering sampling and analysis as parts of a single measurement operation. and so on.
• Crosby. 19. AMC Technical Briefs No. — Taking proper account of sampling uncertainty can aﬀect decisions based on analysis. 12. This implies that replicate test samples from a single target will diﬀer in composition from . I. Part 2: Sampling and sample preparation’. Cambridge. — The concepts of bias and precision can be applied to sampling. N. Royal Society of Chemistry. (March 2005).Sampling in Chemical Measurement 223 Notes and further reading • ‘Terminology – the key to understanding analytical science.2 Sampling Uncertainty in Context Key points — Only the combined uncertainty (sampling plus analytical) is relevant to the customer’s needs. All sampling targets are actually or potentially heterogeneous: the chemical composition can vary from point to point in the material. and Patel. (1995). General Principles of Good Sampling Practice.T.
(For instance. Result A is clearly below the limit and result A above. rather than the Fig. or whether it is probably within speciﬁcation. us could even be the dominant contribution. The sampling uncertainty was in eﬀect taken to be zero. In environmental studies. The combined uncertainty on the composition of the target is thus u = u2 + u2 .224 Notes on Statistics and Data Quality for Analytical Chemists each other and from the target. That. us makes a substantial contribution to the combined uncertainty. enforcers of legislation and analytical chemists (§8. As the analytical uncertainties do not encompass the limit value. is not a logical standpoint unless the sample is deﬁned as the target by law. only the analytical uncertainty was taken into account in making such decisions. 12. That basis for decision making is illustrated in Fig. either an upper limit or a lower limit. A when only analytical uncertainty is considered. This variation gives rise to uncertainty from sampling us that is additional to and independent of the uncertainty ua derived from purely analytical activities.2.1 as examples A and A . B.2. This fact has important implications for decision making by endusers of analytical data. Typical decisions are the probable commercial value of a batch of material. however.3). legislators. B when combined uncertainty (analytical plus sampling) is considered. Measurement results (points) and expanded uncertainties (vertical bars) in relation to a decision limit: A. It is this combined s a uncertainty that is relevant to the needs of the enduser of the data.1. the results are regarded as deﬁnitive. a single bottle of milk may have to conform to a regulation. Hitherto. who is required to make rational decisions about the target (not about the laboratory sample). In a high proportion of instances. and in the examination of raw materials such as food or ores. . 12. Analysis is often conducted to determine whether the material in a target conforms to a legislative or contractual limit.
If sampling uncertainty is taken into account. Free download via www. It contains chapters on fundamental concepts. of Eurachem. and Ellison. H. • Ramsey. B ). successive targets of the same type of material can vary in the degree of heterogeneity. so that the value of us could vary from target to target.Sampling in Chemical Measurement 225 whole consignment from which the bottle is an example.org/amc. Free download available from the Eurachem website www. Measurement Uncertainty Arising from Sampling — a Guide to Methods and Approaches. CITAC. M. L. and carry out operations such as the validation of methods and quality control. These terms are familiar when applied to analytical methods but in sampling there are diﬀerences in the way that they can be tackled. (2007). Six practical examples are examined in detail. and even less has been documented. the same analytical result would give rise to a diﬀerent decision. 503–513.rsc. Qual. estimation of sampling uncertainty and management issues. (2007). 12 . Notes and further reading • At the time of writing. This is because sampling uncertainty is partly the outcome of the heterogeneity of the material under test as well as the process of collecting it. • ‘What is uncertainty from sampling. H. R. A working value of us will have to be a robust average that is typical of the material as a whole. M. H. The Guide is the joint production. Eurolab.eurachem. . M. under the Chairmanship of Prof. Sampling Uncertainty in the Context of Fitness for Purpose. Nordtest and the Analytical Methods Committee. Accred.) The uncertainty from sampling should be taken into proper account in decision making. (eds). and why is it important?’ (July 2008). we must consider ideas like precision and bias applied to sampling. M. An analytical result on a sample from a target with an anomalously high value of us could be unﬁt for purpose.org/guides. so that the combined uncertainty is larger (B. and Thompson. This potential variation in the degree of heterogeneity implies that quality control is especially important in sampling. pp. the subject of sampling uncertainty is immature: to date very little practical experience has accrued in using tools for dealing with uncertainty from sampling. 16A.. • Ramsey. even though the validated sampling protocol was scrupulously followed. Ramsey. S. Assur. To create a conceptual framework for sampling uncertainty. Moreover. AMC Technical Brief No.
This implies that every part of the target must have an equal chance of being incorporated in the sample. and each part sampled Fig.3. under replication.3.3. ‘Stratiﬁed random’ sampling is a compromise between the two forms. 12. Example of random sampling of soil: increments (points) taken from a ﬁeld.3. is often impossible. Random sampling. and some kind of systematic sampling is used in its stead. an unbiased sampling procedure. — It is not possible to make valid inferences about the target unless the whole of it is accessible for sampling. but a second random sampling would be likely to detect it. A substantial hotspot (shaded ellipse) could be overlooked in this instance. By deﬁnition. A substantial hotspot (shaded ellipse) could be systematically overlooked. This outcome can be ensured only if the sample is a random sample. but the mean of a large number of samples will approach the target composition.1–12. 12.3 Random and Systematic Sampling Key points — In random sampling all parts of the target have an equal chance of being selected. should provide samples with an expectation (a longterm average) of composition equal to that of the target. using as an example the collection of a composite sample of 20 increments of soil taken from a ﬁeld. however.2. — Random sampling may be diﬃcult or impossible to carry out. too costly or too time consuming to carry out. in which the target is divided into equal parts (‘strata’). Individual samples will vary. . Fig. — Systematic sampling is often used instead of random sampling. Example of systemic sampling of soil from a ﬁeld: increments (points) taken at the intersections of a rectangular grid.1.226 Notes on Statistics and Data Quality for Analytical Chemists 12.3. The diﬀerence between random and systematic sampling is shown in Figs. 12.
however. as when a ﬁeld is divided into equal rectangles (Fig. Only increasing the number of increments taken (and therefore the cost of the sampling operation) could reduce this possibility. Notes and further reading • Ramsey.Sampling in Chemical Measurement 227 Fig. Accred. — Ideas for several regular sampling scenarios are presented. Example of stratiﬁed random sampling of soil from a ﬁeld: increments (points) taken at random positions within artiﬁcial strata. cannot possibly produce a random sample. The strata could be real. M.3.4 Random Replication of Sampling Key points — Sampling protocols do not provide instructions for collecting randomly replicated samples. the replication has to be done in a randomised way. randomly. as when a shipment of coal arrives in a number of railway trucks. 12. or notional. To encompass the potential variation fully.3). (2007). . Qual. pp. H. Inferences from such samples should therefore be treated with caution. 12. Systematic samples are often as good as random samples in practice. 12. although both are capable of missing an important ‘hotspot’ (a region of anomalously high concentration of the analyte) in a target. Sampling Uncertainty in the Context of Fitness for Purpose. 503–513.3.3. M. Estimating uncertainty from sampling involves the replication of the established sampling procedure. as they may not be based on sound statistical principles. Assur.. and Thompson. 12 . Sampling procedures that cannot access all parts of the target. Sampling protocols.
. for example. it should be reconed after the ﬁrst sample is taken and then quartered again. the increments for the second sample should be collected at new random positions. Fig. a common practice is to walk the ﬁeld in a Wshaped path (Fig. That may tax the ingenuity of the sampler in some instances. If. Schematic method for duplication of stratiﬁed random sampling of soil in a ﬁeld. the sampler Fig.2.1.2) and collect increments at roughly equally spaced intervals along each leg. Duplicate sampling from a ﬁeld.4.1. the protocol demands that increments for a sample be collected at random positions within the target.4. do not provide instructions for how that can be carried out. In sampling soil or crops from a ﬁeld. Increments (solid circles) are taken at roughly equal intervals along each leg of a ‘W’shaped path.4. 12. To duplicate this adequately. Solid circles show the positions of increments for the ﬁrst sample. although certain ideas are broadly applicable. 12. while open circles show the positions for the increments for the second. If the target is usually sampled by coning and quartering. 12. 12.4.228 Notes on Statistics and Data Quality for Analytical Chemists however. This is illustrated for stratiﬁed random sampling in Fig.
Schematic method for duplication of random sampling from a conveyor belt. This is equivalent to saying that the sampling method is akin to an empirical analytical method. Notes and further reading • Ramsey. 12 . two diﬀerent random selections can be made (Fig. If the test material is presented in a systematic way.3. could walk a second ‘W’ in a diﬀerent orientation. Assur. Increments are taken according to two independent lists of random numbers or times.. H. according to the agreed protocol) there is no bias by deﬁnition. as the sampler need not worry about bias.Sampling in Chemical Measurement 229 Fig. 503–513. This is a comforting point of view. Sampling Uncertainty in the Context of Fitness for Purpose. Those holding this view are encouraged .3). Accred. Qual. 12. (2007). Sampling bias is a contentious subject. and Thompson. — A useful method for comparing sampling methods is the ‘paired sample’ approach.5 Sampling Bias Key points — The existence of sampling bias is denied by some scientists. Some authorities claim that it does not exist: if the sample is taken ‘correctly’ (that is. 12. on a conveyor belt for example. 12. M. — It is diﬃcult to address sampling bias experimentally. M.4. pp.4. where the method deﬁnes the measurand.
• A certiﬁed sampling reference target (analogue of a certiﬁed reference material) could in principle address all sources of sampling bias. Bias in sampling could be general (method bias) or speciﬁc (sampler bias). 12 . Sampling Uncertainty in the Context of Fitness for Purpose. but would be extremely costly to create. and Thompson.10]) is carried out by sampling a (preferably large) series of typical targets by two methods. pp. 503–513. • Intersampler studies with a single sampling protocol (analogue of a collaborative trial) could address betweensampler variation. M. H. as all of the samplers have to travel to a number of targets. Consequently. for instance if the sample was contaminated by the sampling tools or if the sampler misinterpreted the protocol.12). the collection of biases of individual samplers is regarded as a random factor. diﬃcult to maintain and could not be distributed to users.. and in some a signiﬁcant diﬀerence between samplers has been found. • Ramsey. In these trials. However.230 Notes on Statistics and Data Quality for Analytical Chemists by the fact that it is diﬃcult to address sampling bias adequately in practice. • The ‘paired samples’ approach (analogue of the paired methods approach [§9. any bias detected will reﬂect the method bias plus an unknown term from the personal bias (if any) of the sampler. . (2007). The bias between the methods can be characterised statistically by the methods used for comparing two analytical methods (§5. it is easy to see how bias could arise. The ‘reproducibility sampling variance’ could be used as an extra term in the combined uncertainty. All of the samples are then analysed by the same method. the method under scrutiny and by an established reference method. A small number of such studies have been carried out on an experimental basis. Accred. As a single sampler would normally be involved. Qual. These trials are costly to organise. The ‘paired samples’ method is simple to carry out. precision alone determines uncertainty and random replication is suﬃcient to quantify it. We can consider analogues of methods used to study bias in analytical methods as potential tools for handling sampling bias. standard uncertainty and standard deviation are treated almost as identical in what follows. Very few examples have been reported. M. Assur. Notes and further reading • If sampling bias is ignored. so that any analytical bias is cancelled out.
A simple balanced design for this experiment is shown in Fig. we have to estimate the composition by analysis and that introduces analytical variation characterised as σa .7).1. Sampling precision can be quantiﬁed as a standard deviation σs estimated by a replicated experiment.6. the variation in the composition of the samples obtained is a measure of the precision. see §12. 12. with multiple samples (n ≥ 8) taken by the same procedure (but randomised.4) from a single target and with duplicate analysis of each sample.6 Sampling Precision Key points — Replicated sampling and analysis followed by ANOVA is required for estimating sampling standard deviation.Sampling in Chemical Measurement 231 12. this Fig. However. However. Simple balanced design for estimating sampling standard deviation. 12. — A multiple target nested design and hierarchical ANOVA is needed for representative results. If the act of sampling is replicated in a randomised way.6.1. . To separate the two sources of variation we have to replicate the measurement as well and use analysis of variance (§4.
Nested (multipletarget) balanced design for estimating sampling standard simple design is based on the assumption that the particular target under study is typical of all targets in the class of material. deviation. Estimates of the betweentarget variation and the analytical variation are also obtained. It disregards the possibility that targets may vary in their degree of heterogeneity and therefore in the value of σs .6. In the table .232 Notes on Statistics and Data Quality for Analytical Chemists Fig. The results are as shown in the following table and are illustrated in Fig.3. 12. As an example we can consider the sampling of animal feedstuﬀ and its analysis for aluminium. A greatly preferable estimate. characterising a whole class of material.6. The results are treated by hierarchical analysis of variance to obtain the sampling standard deviation. 12.2).8) in a natural way. although a greater time span may be required to accumulate the results. may be obtained by taking duplicate samples from a succession of diﬀerent targets of the same type (Fig. the procedure points directly to a method for the quality control of sampling (§12. Twelve successive targets were sampled in duplicate and each sample analysed in duplicate.6. Moreover. This procedure is also straightforward and involves no extra work.2. 12.
000 . Source of variation Between sites Between samples Analytical Total Degrees of freedom 11 12 24 47 Sum of squares 7242.25 4463.92 48.3. and so on.39 371. Results from a nested duplication exercise to validate the sampling protocol for aluminium in animal feed. Target 9 gave rise to possibly discrepant samples. Target 1 2 3 4 5 6 7 8 9 10 11 12 s1a1 128 109 113 76 96 95 111 124 59 87 91 102 s1a2 114 98 121 65 88 104 110 114 72 95 109 105 s2a1 124 120 110 86 121 84 115 122 113 97 88 91 s2a2 101 110 106 74 122 84 110 115 122 101 95 93 Hierarchical ANOVA gave the following statistics. 12.50 F 1.00 12869.6.00 1164. the column heading ‘s1a1’ refers to the ﬁrst result on the ﬁrst sample. There are no indications of (a) anomalous targets or (b) discrepant analytical duplication.Sampling in Chemical Measurement 233 Fig.25 Mean square 658.67 p 0.77 7.170 0.
• The above estimates are likely to be rather variable from this (in statistical terms) small experiment. σa = 7. 12. 503–513. σs = 12. (2007). H. Had such a variation been apparent. ˆ • The sampling standard deviation component. an analytical method with standard deviation σa < σs /2 should be used. σt = 8. M.3) of a wide concentration range or a correspondingly wide variation in σs . Assur. to render it reasonably close to homoscedastic.7.5 ppm. 12 . Accred.7 Precision of the Estimated Value of σs Key points — The sampling precision estimated from eight duplicated samples will itself be very variable.234 Notes on Statistics and Data Quality for Analytical Chemists From the mean squares we can calculate these estimates: • The analytical standard deviation component.. — For best outcomes. and Thompson.0 ppm. • Ramsey.) • In the example it was reasonable to treat the sampling standard deviation as homoscedastic as there was no indication in the plot of the data (Fig.) Notes and further reading • The dataset can be found in the ﬁle named Aluminium. M. Qual. for which 2n analyses would be required. some attempt at scaling the data should have been made. A suggested minimum of n = 8 samples is the usual compromise between an acceptable estimate of the sampling precision and the cost of carrying out the experiment. The sampler and analytical chemist must be aware that an estimated sampling precision will be uncomfortably variable.7 ppm. for instance by using logtransformation.6. 12. (See §12. ˆ • The betweentarget standard deviation component. If we assume that both errors (sampling and analytical) are normally distributed we can estimate . Sampling Uncertainty in the Context of Fitness for Purpose. pp. (This ˆ statistic is of no importance in the present context except to show that there was little variation in the concentration.
the analyst will not know if this criterion is fulﬁlled until after the experiment.4 and 10. to the extent that it may be impossible to estimate σs .5σs . Unfortunately. The precision of the estimated sampling standard deviation improves with the number of duplicated samples. If σs happens to be very small (i.35−1.7 0. analysed in duplicate by using an analytical method of high precision (σa << σs ). it may be impossible to estimate its value for lack of a suitable analytical method.55 0.5. however. σs = 1).1.Sampling in Chemical Measurement Table 12. Of course.7.7. it improves only slowly: in comparison with an eightsample experiment. In such instances.. The procedure would be analogous to establishing limits for a control chart while it is in use. However. as in §10. The calculations are based on an experiment with eight targets sampled in duplicate.. the sampling standard deviation will make a negligible contribution to the combined uncertainty of the measurement and can be safely ignored. .e. Thus the 95% conﬁdence limits will be about 0. Conﬁdence limits (95%) for an estimate σs ˆ of a true sampling standard deviation of unity (i. With σa > σs /2 far worse precisions will be obtained. The practical rule of thumb is to use if possible an analytical method with a precision σa < σs /2. data can be collected over many sampling events and the estimate gradually reﬁned. if the sampling method is in routine use. the target is close to homogeneous).0−1. That would usually be impracticable as a oneoﬀ method validation. the relative standard error of the estimated sampling standard deviation σs ˆ will be about 25%.1). The results would have to be robustiﬁed in some way against the possible incidence of atypically heterogeneous targets.5σs and 1.2 Proportion of zero estimates 0 1 8 32 % % % % 235 this variability (Table 12.e.5 0. With eight replicate samples.0−2. and analysed by methods with various analytical standard deviations σa . There would be a wide and highly asymmetric distribution of outcomes.5−1. Analytical standard deviation σa σa σa σa σa << σs = σs /2 = σs = 2σs 95% conﬁdence limits on σs ˆ 0. 32 samples would be required to reduce the standard error by half. with a high proportion of zero results.
1.1. the sampler may encounter particular targets. Each sample is analysed once.8 Quality Control of Sampling Key point — A combined sampling/analytical control chart can be constructed from the results of duplicate samples.8. Such instances should be detected if possible. because the target is more heterogeneous than is usual for the type of material. However. This design is shown in Fig. because an incorrect assumption about the heterogeneity will tend to invalidate decisions about the target. A simple way of conducting sampling QC is to take duplicate samples A and B at random from each target. Quality control of sampling can alleviate this situation. more probably. and the mean result (xA + xB )/2 can be taken as the result for the target. 12.6) that a ‘generallyapplicable’ estimated value σs can be ˆ attached to the sampling standard deviation for a typical target in a deﬁned class. 12. the sampling precision may be poor. . ˆ For such particular targets.8. We have seen (§12. Design for routine sampling quality control. the diﬀerence between the results can be used as an indicator of compliance. Even if the sampling is carried out exactly according to a validated protocol. for which the ‘general’ σs is not appropriate. excessive heterogeneity could make the result unﬁt for purpose. Meanwhile. However.236 Notes on Statistics and Data Quality for Analytical Chemists 12. as the order in which the results are obtained is arbitrary. because the sampling has been carried out ineptly or. apparently within the deﬁned class. Fig. it is preferable to use a onesided control chart with control lines at zero. This value can be used to deﬁne control lines for a Shewhart or other control chart so that a single point falling outside the ±3σd limit indicates a system out of control. The standard deviation of a signed diﬀerence d = xA −xB for 2 2 a compliant (in control) outcome would be σd = 2 (σs + σa ).
and at target 27 because two successive targets gave diﬀerences above the warning limit. 2004.8. but would obviously cost more to execute as a routine practice. 12.8. 359–363.72 + 7. It is not clear whether these excursions resulted from an analytical problem or a sampling problem or to a combination of the two.2.5.02 ) = 20. . Two outofcontrol ˆ ˆ conditions were detected. 12. were set according to σd = using values for σs . The lines will have the same implications as in an ordinary Shewhart chart. which does not require duplicate samples. The control lines 2 (ˆs + σa ) = σ2 ˆ 2 2 (12. A more elaborate design with duplicate analyses of both samples (as in Fig. at target 21 with a diﬀerence outside the action limit. Notes and further reading • The dataset can be found in the ﬁle named Alsamiqc.2) would enable this ambiguity to be clariﬁed. • If the concentration of the analyte varies substantially in successive targets. 12. the Split Absolute Diﬀerence (SAD) method.6.2. σa established previously (§12.6). Routine internal quality control chart for combined analytical and sampling variation for the determination of aluminium in animal feed.Sampling in Chemical Measurement 237 Fig. it may be preferable to construct a control chart for relative absolute diﬀerence. • An alternative approach to sampling IQC is sometimes applicable. on which the absolute diﬀerence xA − xB  should be plotted. 129 . See: Analyst. An example of such a chart is shown in Fig. 2σd and 3σd .
This page intentionally left blank .
48. 202 action limits. 41 aﬂatoxin data. 69 Cadmium. see Student’s t ttest. 175 analytical run. 35 absolute diﬀerence map. 27 t.Index F . 162. 47 F test. 51. 174 calibration function at low concentrations. 157 acid decomposition of a soil sample. 133 AGRGIQC. 12. 231 ANOVA application. 56. 26. 182. 105 calibration detection limit. 203 Alsamiqc. 186 Bayesian statistics. 181. 59 anomalous concentrations. 127. 40 ANOVA. 46. 21 averaging zscores. 113 239 Pu. 210 calibration. 194 accreditation agencies. 29. 207. 147 Alumina. 47. 234 aluminium in animal feed. 155. 237 Analysis of Variance. 58. 230 betweensite variance. 86. 57 239 assigned value. 162 bottomup method. 60. 42 betweengroup mean square. 22. 189. 119 pvalue. 76 calibration uncertainty. 207 bias between two analytical methods. 208 baseline interference. 51. 86. 178 calibration functions. 180 . 217 bar chart. 208 assumptions. 60 calculating a participant consensus. 45 betweensampler variation. 15. 201. 49. 94 ‘Bonferroni’ problem. 237 alternative hypothesis. 31 R2 . 155 analytical response. 195 analytical standard deviation. 96 basic assumption of regression. 214 bottomup estimation. 60 bias. 216 adjusted degrees of freedom. 86 F distribution. 232. 172 capability of detection. see ANOVA analyte. 27 Aluminium. 202 bootstrap. 10 Beryllium methods. 92. 76 calibration data. 43. 159 boxplots. 54. 30 adjusting the results.
186 comparison of analytical methods. 97 comparison of several means. 217. 21. 87. 29 eﬀect of an outlier on regression. 22 crossed designs. 49 Dogfood dataset B. 222 contractual limit. 43 compliance. 209 consensus of the participants results. 208 constant bias.240 Notes on Statistics and Data Quality for Analytical Chemists catalysts for the Kjeldahl method. 182 combined sampling/analytical control chart. 219 control limits. 22 conﬁdence limits around x . 101. 105 detrended data. 221. 102 consensus of all participants. 49 ‘cause and eﬀect’ method. 168 . 56 collaborative trial — outlier removal. 169 Dumas. 81 Cusum chart. 78 deletion of outliers. 165 decisions about the target. 58 distributionfree statistics. 208 Cochran test. 124. 12. 57 correlation. 165 coverage factor. 197 control lines. 6 data with a curved trend. 182 collaborative trial. 27 censored or truncated. 15 Dixon’s test. 219 certiﬁed sampling reference target. 144 curved trend. 103. 236 conceptual framework for sampling uncertainty. 4. 22. 137 Drift. 49 downweighting extreme values. 179 critical values. 83 eﬀect of trends. 143 Dumas method. 135 dependent variable. 171. 18. 29. 168 decision. 18 certiﬁed reference materials in quality assurance. 28. 196 conditions of replication. 96 contractual agreements. 77. 223 combining intermediate uncertainties. 184 collaborative trial outliers. 155 critical concentration. 19. 223 decision theory. 191. 209 consensus of expert laboratories. 192 control map. 174 conﬁdence intervals. 131 Dixon’s test for outliers. 169 conﬁdence interval for a predicted concentration. 195 Copper. 180. 36. 87 correlation coeﬃcient. 177 discrepant duplicate results. 142 distributions. 121. 91 correlation matrix. 45. 112 cost and probability of making an incorrect decision. 236 combined uncertainty. 230 certiﬁed value. 65. 236 deﬁnition of detection limit. 63 cumulative distribution function. 140 central limit theorem. 179 critical level. 129 data displays. 178 degrees of freedom. 75. 127. 12 conﬁdence limits. 54. 45. 132 Dogfood dataset A. 22 conﬁdence level. 77 detection limit. 156. 7. 192 control materials. 159 cement. 8 control chart. 21 critical level of response. 165 cost of an analytical result. 17. 170 conﬁdence. 160 comparing two analytical methods. 225 conditions of measurement.
51 homoscedastic. 195 ﬁtted values. 124 example of weighted regression. 156 error propagation theory. 159 estimate runtorun standard deviation. 165. 227 estimating uncertainty by replication. 43 ﬁducial limits. 103 ﬁtness for purpose. 230 intercept. 109 independent variable. 158 ‘functional relationship’ ﬁtting. 197. 104 estimates of central tendency. 174 evaluation limits. 119 examples of withinrun quality control. 160. 221 enduser of the data. 181 hotspot. 182 guide to the expression of uncertainty in measurement (GUM). 173 intersampler studies. 157 experience and judgement. 165 evaluation interval. 34 Ethene. 133. 36 EURACHEM/CITAC Guide. 72 interference eﬀects. 192 expanded uncertainty. 144 enduser of analytical results. 219 . 51. 135 inﬂuence on classical statistics. 71 ﬁxed eﬀect experiment. 171 factors that contribute to uncertainty. 124. 171. 162 estimation of the weights. 196 heterogeneity. 210 exploratory data analysis. 118 ethanol. 236 independent datasets. 6 Grubbs test. 216 instrumental drift. 216 ﬁtnessforpurpose control chart. 224 equal variances. 116 heteroscedasticity. 187 fat in foodstuﬀs. 218. 175 interim control chart. 155. 199 interlaboratory comparisons. 197 estimated value of concentration. 137 important magnitude. 156. 195. 180 internal quality control. 101. 168 inﬂuence of outlying results. 78 ﬁtting a line. 210 estimating uncertainty. 8. 137 identifying suspect values. 171 example calculations. 49 formulation. 47 example of regression with transformed variables. 96 geneticallymodiﬁed food. 153. 40 inaccuracy in their routine results. 209 functional relationship ﬁtting. 76 inﬂuence of outliers. 82. 140 graphical display. 227 Huber’s H15 method. 225 heteroscedastic. 224 environmental studies. 184 hierarchical analysis of variance. 111 external calibration. 162 harmonised guidelines for internal quality control. 54 interlaboratory study. 75. 123. 232 homogeneity testing. 28 independent predictors. 156. 119 heteroscedastic data. 206 incorrect assumption about the heterogeneity.Index 241 empirical cumulative distribution function. 198. 234 Horwitz function. 205 interlaboratory method performance study. 130 information from proﬁciency test scores. 30 error.
111 multiple symbolic control charts. 78. 169 nested design. 76 method validation. 45. 66 Kjeldahl method. 116 lead in garden soils. 10. 105 legal requirements. 222 Moisture. 123 nonlinear extrapolation. 92. 77. 61 leastsquares. 135 mercury in ﬁsh. 209 natural. 85 Manganese2. 156 intervals. 211 model of the measurement procedure. 155 measurement variation. 94. 135. 90. 155 mixture model. 91. 176 nonlinear regression. 210 mean square. 155 measurement. 154. 94. 175 matrixmatched CRMs. 147 . 164. 40 null hypothesis. 129 kernel density estimation. 143 nitrogen analyser. 90.242 Notes on Statistics and Data Quality for Analytical Chemists International Vocabulary of Metrology. 144 normal curve. 122 nonlinearity. 127 NOx in air. 135 manganese. 79. 155 measurement uncertainty. 102 Manganese1. 139 MAD method. 65 multimodal. 20. 142 nonparametric test. 51. 25 method of least squares. 187 lack of ﬁt. 103 Manganese3. 82 limits. 206 matrix matching. 78. 108 manganese calibration data. 143 Kjeldahl catalysts. 15. 60. 31 nitrogen oxides. 209 mean. 20 nitrogen data. 91 nonparametric statistics. 210 multiple regression. 108. 165 materials. 231 nitrogen. 103 ISO Guide 43. 139 logtransformation. 3 median. 169 near. 111 lead in playground dust. 96 linearity. 184 lognormal distributions. 77. 18 normal distribution. 85. 175 matrix mismatch. 169 linear regression. 9 measurand. 201 multivariate internal quality control. 114 lead. 214 mode. 158 modern approach to sampling. 86. 222 leverage points. 23 nitrogen analyser data. 38 nonlinear equations. 109 Mann–Whitney test. 169 inverse conﬁdence limits. 71. 201 national reference laboratory. 46 meaning. 195 metrological traceability. 71 linear regression in paired experiments. 29. 9. 73. 206 Jchart. 60. 210 Median Absolute Diﬀerence (MAD). 144 laboratory bias. 142 mass fraction. 171 logtransformed data. 211 Kjeldahl. 36. 49 Kolmogorov–Smirnov onesample test.
205 propagation of uncertainty. 226 rapid ﬁeld method. 36. 219 onesample test. 154. 7. 90 pure error mean square. 19 probability. 130. 229. 94. 63 polymerase chain reaction. 96 protein nitrogen. 20 probability of incorrect decision. 21 precision. 140 polynomial calibration. 215 proﬁciency tests — purpose and organisation. 235 precision of the measurements. 186 reference value. 172 regression and ANOVA. 164 reducing the uncertainty. 153 quantity.Index 243 occasional check of an ongoing analytical system. 40 random sampling. 54. 181 outliers and heavy tails. 46 population. 219 pure error. 109 principal components regression. 88 proﬁciency test. 232. 52. 221 recovery. 227 random sample. 186 ‘paired sample’ approach. 104 reference material. 60 original ethos of proﬁciency testing. 85 regression or a related method. 82. 146 Playground. 95 regulatory limit. 155 precision as a function of concentration. 54 random replication of sampling. 90 quadratic regression. 210 outliers in calibration. 180 orientation surveys. 73. 80 outlying data points. 32 ‘pseudozscore’. 45. 38 oneway ANOVA. 23. 206 outofcontrol conditions. 19 patterns in residuals. 9. 30. 42 parameters. 147 predictor variable. 184 precision of analytical methods. 198 quality control of sampling. 9. 155 random eﬀects. 25 relationship between precision and concentration. 18. 29. 224 recognise paired data. 40 recommended terminology. 12. 68 performance of an analytical method. 181 . 230 paired data. 237 outlier. 21 regression. 181 precision of the estimated sampling standard deviation. 37 onetailed probability. 55. 225 quality control data. 27. 167 outlying samples. 25. 136. 58 ‘paired results’ method. 105. 222 recommended terminology of sampling. 71. 25. 31 pooled variance. 51 reference method. 108 pooled standard deviation. 146 precomputer statistics. 34. 15 power calculations. 180 planning of experiments. 160 proportional bias. 76. 236 quality — an overview. 80 patulin. 139 outlier deletion. 111 probabilities from data. 38 paired datasets. 200. 15. 25. 92. 97 raw materials. 128. 108 quality control. 165 problems of interpreting r. 75. 98 regression analysis of a calibration. 108 polynomial regression. 123. 20 onetailed test.. 167 precision of test methods.
130. 21 target. 89 summary statistics. 133 suspect results. 189 repeatability standard deviation. 91. 191 scaled residuals. 123. 221. 56. 168 signiﬁcant intercept. 4 reproducibility precision. 164 sampling of animal feedstuﬀ. 221 sampling and analytical variance. 184 repeatability conditions. 114. 234 sampling uncertainty. 171 Silica. 196 runtorun precision. 173 response variable. 54 simple linear regression. 232 sampling precision. 19. 22 suﬃcient linearity in calibration. 127. 199 sample t. 167. 155 robust analysis of variance. 137 rolling statistics. 84 standard uncertainty. 155 standardisation of hydrochloric acid. 58 standard additions. 23 soil. 22 sample tvalue. 25 statistical control. 124. 78. 57 sampling bias. 16. 66 ‘run bias’ eﬀect. 72 skewed. 175 standard deviation. 130 suspect value. 95 rounding. 135. 17 standard error of the mode. 131 ‘stratiﬁed random’ sampling. 217 SI. 205 residual plots. 79. 99. 222 reproducibility conditions. 181 robust estimates. 195. 223 scaled diﬀerences. 231 sampling standard deviation. 195 statistical inference. 167 standard deviation for proﬁciency. 198 rotational bias. 6 Suspect. 210 sodium data. 172 Shewhart charts. 121 standard error of the mean. 80. 75 result. 195 setting up a control chart. 207 scoring systems. 215 robust statistics. 226 Student’s t. 162 requirement for accreditation. 3. 133 symbolic control charts. 197 shape of the calibration function. 217 Syst´me International d’Unit´s. 210 robust mean. 169. 229 sampling error. 207 standard error of the intercept. see SI e e systematic sampling. 155 SI units. 141 ruggedness test. 57.244 Notes on Statistics and Data Quality for Analytical Chemists relative standard deviation. 139. 180 reproducibility standard deviation. 210 robust standard deviation. 196 runtorun standard deviation. 12. 211 standard error of the robust mean. 106 Selenium. 169. 112 scoring in proﬁciency tests. 92 residuals. 57. 20 statistical power. 163 signed diﬀerences. 192 replicated sampling. 127. 192. 146 statistical tests for outliers. 225 . 74. 15. 78 scatter plots. 131. 4 repeatability precision. 169. 20 sampling. 231 replication of the established sampling procedure. 226 tabulated values. 161 statistical computer software. 205 second order ‘quadratic’ curve. 210 standard errors. 61. 227 representative sample.
127 weighing of the test portion. 163 traditional and modern approaches to sampling uncertainty. 28. 83 variation in objects. 31 twotailed probabilities. 222 transfer of the SI unit. 207 uncertainty budget. 27. 23. 85 variance ratio test. 198 Zone chart. 171 uncertainty. 225 uncertainty information in compliance assessment. 5 VIM. 121. 156 variance. 222 valid decision. 63 unacknowledged calibration curvature. 90.Index 245 target value. 116. 207. 154. 221 uncertainty evaluation strategies. 156. 32. 20 twotailed test. 34 wheat ﬂour data. 172. 146 testing for speciﬁc distributions. 137 withingroup mean square. 216 zinc. 95 trend. 90 test of signiﬁcance. 97. 67 zscore. 162 traceability. 37 twoway ANOVA. 31 variances of the regression coeﬃcients. 25. 221. 99 using resources optimally. 142 twosample test. 10. 27 true value. 143 Wheat types. 89. 157 unimodal. 82. 34 twosample ttest. 160. 195 weighted regression. 173 test for linearity. 153. 12. 144 topdown estimate of uncertainty. 221 traditional approach. 129 . 36. 6 warning limits. 27 testing for normality. 45 Youden design. 32. 124 translational bias. 156 visual appraisal. 171 Wheat ﬂour. 207. 30. 86 variance of the residuals. 21 true value of the measurand. see International Vocabulary of Metrology. 25. 159 transforming to logarithms. 210 Uranium. 163 uncertainty from sampling. 16. 38 Winsorisation. 119. 207 two independent variables. 110 twosample. 65 variance due to the regression. 215 test for lack of ﬁt.