You are on page 1of 10

Journal of Chromatography A, 1158 (2007) 158–167

Review

Set-up and evaluation of interlaboratory studies


Yvan Vander Heyden, Johanna Smeyers-Verbeke ∗
Vrije Universiteit Brussel (VUB), Pharmaceutical Institute, Department of Analytical Chemistry and Pharmaceutical Technology,
Laarbeeklaan 103, B-1090 Brussels, Belgium
Available online 22 February 2007

Abstract
Interlaboratory comparison by means of method performance precision and bias studies and proficiency testing schemes are described. The set-up
of the experiments as well as the evaluation of the data by means of graphical and statistical methods are considered. The use of interlaboratory
data for the estimation of measurement uncertainty is also addressed.
© 2007 Elsevier B.V. All rights reserved.

Keywords: Interlaboratory studies; Method performance; Proficiendy testing; Bias evaluation; Uncertainty

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
2. Method performance studies—precision experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
2.1. Aim and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
2.2. Organisation of a precision experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.3. The components of precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2.4. Evaluation of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2.4.1. Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2.4.2. Calculation of the variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.4.3. Repeatability and reproducibility as a function of concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2.5. What precision to expect? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3. Method performance studies—bias experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.1. Aim and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.2. Organisation of the bias experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.3. The evaluation and interpretation of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4. Proficiency studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.1. Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.2. Organisation of a proficiency study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.3. Evaluation of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5. Interlaboratory studies and uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

∗ Corresponding author. Tel.: +32 2 477 4737; fax: +32 2 477 4735.
E-mail address: fabi@vub.ac.be (J. Smeyers-Verbeke).

0021-9673/$ – see front matter © 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.chroma.2007.02.053
Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167 159

1. Introduction without recalibration of the instrument, unless the recalibration


is an inherent part of the method. The word “independent” means
Interlaboratory studies, which are sometimes also called col- that all steps of the method must be carried out for each repli-
laborative studies or trials or ring tests, are studies in which cate measurement. For instance, it is not acceptable to extract
several laboratories analyse the same material(s). Depending one sample and then carry out replicate determinations on it by
on the focus of the study three main types can be distin- chromatography. This results only in an estimate of the injec-
guished. tion repeatability and cannot be considered an estimate of the
Collaborative trials or method-performance studies assess method repeatability.
the performance characteristics of a specific method. In the ISO An analyst can measure the repeatability of a method as
5725 guidelines [1] they are called accuracy experiments and practised by him/herself in his own laboratory. When the repeata-
consider the evaluation of the precision as well as the true- bility of a standard measurement method as carried out by
ness from the interlaboratory trial. The ISO 5725-2 guideline qualified laboratories in general is measured, this must be done
[1] specifically describes precision experiments for the determi- by a collaborative experiment.
nation of the repeatability and the reproducibility. The second The repeatability can be described by the repeatability stan-
part of accuracy is trueness, which, in an interlaboratory context, dard deviation sr . Another measure is the repeatability limit, r.
measures the bias of the measurement method. ISO describes the This is defined by ISO as “the value less than or equal to which
bias experiments in ISO 5725-4 [1]. the absolute difference between two test results obtained under
Laboratory-performance or proficiency studies focus on the repeatability conditions may be expected to be with a probability
laboratory with the aim to assess the proficiency of the individual of 95%”. It can be shown that r = 2.8sr .
laboratories. In such studies, sometimes called round-robin stud- The repeatability variance and relative repeatability standard
ies, test samples of which the concentrations are known or have deviation are written as sr2 and RSDr , respectively.
been assigned, often from the interlaboratory experiment itself Reproducibility is defined by ISO as the “precision under
is analysed by a group of laboratories. The laboratories apply reproducibility conditions”. The latter are “conditions where
whatever method in use in their laboratory. Proficiency testing is test results are obtained with the same method on identical test
an essential part of the accreditation of analytical laboratories. items in different laboratories with different operators using
The International Harmonized Protocol for the proficiency test- different equipment”. It follows that the reproducibility of a
ing of analytical chemistry laboratories [2], which is the result method is necessarily obtained from a collaborative precision
of a cooperation between AOAC International, ISO and IUPAC, study.
has recently been revised to include the experience gained in The reproducibility limit R is defined by ISO as “the value
the last 10 years [3]. In contrast to collaborative trials, which are less than or equal to which the absolute difference between two
generally organised by one of the participating laboratories, pro- test results obtained under reproducibility conditions may be
ficiency tests that are planned a few times per year are managed expected to be with a probability of 95%”. It is obtained as
by an external body. R = 2.8sR .
The objective of material-certification studies is to provide The repeatability and reproducibility limits, r and R, can
(certified) reference materials. A group selected laboratories be used to include precision clauses in the description of the
analyses, preferably with different methods, a material to deter- method. A typical precision clause is: “The absolute difference
mine the most probable value of the concentration of a certain between two single test results obtained under repeatability con-
analyte with the smallest uncertainty possible. ditions should not be greater than 0.7 mg/L”. An experimental
Since the latter studies for the production and evaluation of difference between two values larger than r therefore indicates
reference materials are very specialised [4] and generally organ- that the laboratory’s repeatability for the method investigated is
ised by institutions created for that purpose we will only focus not up to standard. Part 6 of the ISO standard 5725 [1] describes
on the method-performance and proficiency studies. a procedure on how to judge on the acceptability of results when
the repeatability limit is exceeded.
2. Method performance studies—precision experiments Notice that the GUM [5] defines reproducibility as “the
closeness of agreement between the results of measurements
2.1. Aim and definitions of the same measurand carried out under changed conditions
of measurement”. The Guide continues to specify that the
The aim of a precision experiment is to evaluate the changed conditions may include different principles or meth-
repeatability and the reproducibility of an analysis method. By ods of measurement, different observers, different measuring
convention everything related to repeatability is reported by r instruments, different locations, different conditions of use or
and everything related to reproducibility by R. different periods of time and a note specifies that “a valid state-
ISO 5725-1 [1] defines repeatability as “the precision under ment of reproducibility requires specification of the conditions
repeatability conditions”. This means “conditions where inde- changed”. Reproducibility is then no longer linked to a specific
pendent test results are obtained with the same method on method and therefore, in a GUM context, an interlaboratory
identical test items in the same laboratory by the same operator study as described in this article measures reproducibility in
using the same equipment within short intervals of time”. Short a measurement set up in which all possible conditions change
intervals of time mean that the measurements are carried out except the method applied.
160 Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167

2.2. Organisation of a precision experiment additional step in the intralaboratory validation of a few labo-
ratories with common interests or a preliminary step in a true
Guidelines and protocols to set up and interpret method per- interlaboratory study (see [11]).
formance studies are available from among others ISO [1] and The basic experimental design is a balanced uniform level
IUPAC [6]. The latter is a revision of an earlier protocol [7], experiment that requires p laboratories to analyse the material n
which is also adopted by the AOAC as the guideline for the times at q levels. The analysis is carried out at q levels of con-
AOAC Official Methods Program [8]. centration. Indeed, precision can (and very often does) depend
ISO [1] specifies that the method to be investigated shall be on concentration and the validation of a method has to be done
standardised, which means that the procedure is described in over the whole range concerned. It is considered that in general it
detail and that in general it is in use in a number of laboratories. should be possible to cover the expected range by 5 materials, i.e.
ISO adds however that the same approach can also be used to q = 5, where a material is a specific combination of matrix and
study the precision of non-standardised methods. In any case a level of analyte. However there are cases where this is neither
clearly written protocol has to be available. This should, where possible nor useful. For instance, when a method is validated for
relevant, include system suitability checks and the way the test the analysis of a drug in formulations, its concentration may be
items are prepared before the actual analysis starts. situated in the relatively small range at which it is pharmacolog-
Since collaborative studies are very time-consuming AOAC ically active but not toxic so that drugs will not be formulated at
[8] warns not to conduct such experiments with unoptimised largely different concentrations. A smaller number can then be
methods. A preliminary pilot trial can be useful to familiarise accepted. AOAC states that it should not be less than 3.
with the measurement method and to evaluate the protocol for For qualitative analyses AOAC [8] recommends 2 analyte
errors and ambiguities. For the latter CIPAC [9] and AOAC [8] levels per matrix, 6 samples per level and 6 negative controls
suggest performing a pilot trial with three or four laboratories. per matrix.
The study should be organised and supervised by a panel For the evaluation of the repeatability n independent replicate
that consists of experts familiar with the method. They usually test results have to be obtained under repeatability conditions at
delegate certain responsibilities to an executive officer. The exec- each level. The number of replicates is most often recommended
utive officer, which takes care of the actual organisation of the to be n = 2. Several organisations, among which ISO and AOAC,
study, usually belongs to one of the laboratories participating prefer not to repeat the experiment but recommend a blindly
in the study. Moreover, a statistical expert should be involved coded distribution of duplicate samples so that the operator does
in the design of the experiments, the analysis and reporting of not know which samples are replicates and therefore is not influ-
the data. Within each laboratory a supervisor selects a proficient enced to produce results which are as alike as possible. However
analyst, organises the actual performance of the measurements, this set-up is only possible if the qn measurements can be per-
taking into account the instructions from the executive officer formed within a short time interval. Alternatively a split-level
and the reporting of the results. replication can be applied. In the split-level experiment a mate-
The laboratories that participate are representative for those rial with (usually two) slightly different concentrations is used.
laboratories that will apply the method later. This also means Split-level materials are also known as Youden matched pairs.
that the laboratories should achieve a representative level of Each of the levels is analysed once. ISO discusses this approach
competence. There may be a tendency to include only the very in part 5 of the 5725 standard [1]. IUPAC [6] and AOAC [8] also
best laboratories, but this will lead to an underestimation of the propose this design as an alternative.
precision measurements. The materials investigated and the test items that are sent to
The objective of a precision experiment is to get an estimate the laboratories should be representative for those that will be
s of the true standard deviation σ. Enough measurements must encountered in normal use. If it is not possible to cover all types
be carried out for the values of s to approach σ sufficiently well. of material in the procedure then the materials for which the pre-
The ISO standard 5725-1 [1] includes a complex equation that cision measurements are considered to apply should be clearly
allows computing how many measurements should be carried mentioned, e.g. water in cheese. For a given type of material the
out and concludes that the number of laboratories p participat- precision is usually related to the concentration of the analyte. In
ing in the experiment should be at least 8. It does not appear such cases the whole range of concentration should be studied
useful to include more than 15 laboratories (although this is not and the precision should be reported as a function of concen-
discouraged). The AOAC guideline [8] states that enough labo- tration. The ISO standard puts the emphasis very much on the
ratories should participate to ensure that at least 8 valid results effect of concentration on precision and much less on the effect
would be obtained. The older version of the ISO standard [10] of matrix effects that can occur between materials belonging to
added that when only one single level of concentration is of inter- the same type of material, which is a pity. It is for instance far
est at least 15 laboratories should be included. AOAC states that from evident that the same kind of results will be obtained for
at least 15 laboratories are needed for collaborative studies of the determination of water content in all types of cheese.
qualitative analyses. A difficult question is how to account for possible heterogene-
The number of laboratories mentioned above is needed if one ity between the test items distributed to the laboratories (or even
wants to develop an internationally accepted reference method. within the test items). If it is not possible to avoid heterogeneity,
This does not mean that method-performance studies with a for instance when the items are solids, such as powders, the final
smaller number of participants are not useful. They are then an report should make it clear that the precision measures reported
Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167 161

include possible heterogeneity between samples as a source of are too low or too high compared to the others, indicating
variation. It is possible to assess heterogeneity to some extent but unacceptable laboratory bias;
this requires more extensive experiments. Indications on how to (ii) outlying results for individual laboratories at a given level:
do this can be found in [12]. Sometimes it is not possible at all to these too can deviate either in precision or in mean value
send identical items to the participating laboratories. ISO 5725- compared to the other results at the same level.
5 [1] gives as example leather: no two hides are identical and
between-hide variation could therefore be a serious source of From a statistical point of view unequal repeatabilities vio-
variation. To distinguish it from sources of variation due to the late a fundamental assumption of ANOVA namely equality of
measurement a hierarchical design, which requires more sam- variances, also called homogeneity of variances, and too low or
ples to be tested, should be used instead of the balanced uniform too high results lead to distributions that are no longer normal,
level design. another assumption of ANOVA.
Measures should also be taken to ensure stability of the
samples. For food samples for instance it may be necessary 2.4.1.1. Procedures for outlier removal. The ISO 5725-2 [1]
to transport them in a deep-frozen state, to include directions and IUPAC/AOAC [6,8] standards describe a procedure for
for storage of the sample if it is not analysed immediately on outlier removal. Both procedures are very similar, the main dif-
receipt by the participating laboratory and on how to carry out ference being the confidence level at which outliers are detected
the thawing of the sample. (see further).
Visual evaluation for consistency of results: It should be noted
2.3. The components of precision that outlier rejection is not applied here (and should never be
applied) as an automatic procedure, but rather that the statistical
Each test result y can be represented as follows [1]: conclusions are only one aspect of the whole context leading to
the final decision. A graphical way to evaluate the consistency
y =m+B+e (1) of results and laboratories is therefore useful. ISO recommends
where m is the general average for the material analysed, B Mandel’s k and h statistics. These statistics can also be used to
represents the laboratory component of bias and e is the random evaluate the quality of laboratories in laboratory-performance
error, representing the residual variance. The latter is estimated studies. They are obtained as follows:
by sr2 , the repeatability variance, while the variance of B gives si
2 , the between-laboratory variance. ki =  (3)
rise to sL p
s 2 /p
i=1 i
The reproducibility variance is the sum of the repeatability
variance and the between-laboratory variance and
2
= sr2 + sL
2
(2) ȳi − ȳ
sR hi =  p (4)
1/(p − 1)( i=1 (ȳi − ȳ)2 )
Eqs. (1) and (2) state that laboratory components of biases are
systematic errors from the point of view of a single laboratory The statistic ki is a within-laboratory consistency statistic
(Eq. (1)) but random errors from the interlaboratory point of that compares for a selected level the standard deviation within
view (Eq. (2)). one laboratory to the mean standard deviation of the different
laboratories for that level.
2.4. Evaluation of the data The statistic hi is a between-laboratory consistency statistic
that measures for a selected level the standardised deviation of
The basic statistical model is a random effects one-way anal- the mean value obtained by laboratory i (ȳi ) from the grand mean
ysis of variance (ANOVA) model from which for each sample for that level (ȳ). It is therefore a measure of the laboratory bias.
the components of variance sr2 and sL 2 can be extracted and by
The ki and hi values obtained in that way for the different lev-
2 . The IUPAC/AOAC
using Eq. (2) it is then possible to obtain sR els (samples) can be plotted grouped per laboratory or grouped
protocol [6,8] states this as such. The ISO standard 5725-2 [1] per sample together with the critical values of the k and h statis-
does not use this terminology, probably because it prefers to tics at the significance levels of 1% and 5%. These critical values,
apply a calculation method, which can be carried out by hand. as a function of the number of laboratories and the number of
It does not explicitly follow the usual calculation scheme for replicates per laboratory (for k), can be found in the ISO guide-
ANOVA but yields the same results. line 5725-2 [1]. An example is given in Fig. 1 which is adapted
from ref [13]. It can be observed from Mandel’s k plot that lab-
2.4.1. Outliers oratory 4 almost consistently shows the highest repeatability
Before computing sr , sL and sR it is necessary to clean the data while Mandel’s h plot indicates that laboratory 7 tends to report
from outliers. There are two possible types of outliers, namely: higher values than the other laboratories. However a decision is
not taken before the statistical tests described in the following
(i) outlying laboratories: they deviate from the others in pre- sections have been performed.
cision (repeatability), meaning that they deliver work of a Test for homogeneity of variances: The homogeneity of vari-
lesser standard than the others or the means of their results ances is tested with the Cochran test. This test investigates if the
162 Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167

with s as defined above and s1 the standard deviation obtained


by removal of the suspected outlying laboratory (either the
lowest or the highest result). This value is compared in the
usual way with tabulated R-values. This is equivalent with
the test of Eq. (6) because their critical values are related
[14].
If the single outlier test is negative, the Grubbs’ test for
two outlying observations, which can detect two simultaneous
high or two simultaneous low values, is applied. Again different
equivalent tests are applied by ISO and IUPAC/AOAC. More-
over a procedure proposed in the IUPAC/AOAC standard also
detects one high and one low value occurring simultaneously.
The reader is referred to the guidelines for more information on
these tests.
Stragglers and outliers: In contrast with most other statistical
tests, the decision is not taken at the 5% level of confidence, nor
is it taken automatically. In this context, outliers are eliminated at
the 1% level (ISO [1]) or 2.5% (IUPAC/AOAC [6,8]). ISO calls
outliers detected at the 5% level stragglers and they are flagged,
but are included in the analysis except when an assignable cause
for the outlying result is found.
In the procedure for outlier treatment in both the ISO and
IUPAC/AOAC guidelines, the Cochran and Grubbs’ tests are
Fig. 1. Mandel’s k (a) and h (b) statistics (adapted from [12]). The lines represent
the 1% (—) and 5% (- - -) limit of the statistics.
sequentially applied. The procedure stops if no further outliers
are flagged or if too many laboratories have to be eliminated.
IUPAC/AOAC specify that no more than 2/9 laboratories should
variances obtained by the p laboratories at a certain level (si2 )
be eliminated since this might indicate problems with the
are not too dissimilar by calculating the following statistic:
method.
s2 Robust statistics: The outlier tests described higher are based
C = pmax 2 (5) on a normal distribution of errors, which can be assumed but not
i=1 si
proven and decisions have to be taken by the panel. To avoid this,
2
with smax the largest of the p variances. Critical C-values are it is possible to apply so called robust statistics. ISO describes
tabulated [1,8]. them in part 5 of the 5725 standard [1] as an alternative to the
This test can be recycled, which means that if, in a first round, above statistical procedures. In this case outliers are not deleted,
laboratories or values in an individual laboratory are rejected, but the statistics are such that extreme values have lower weights
the Cochran test is applied again to the remaining laboratories. in the calculations.
Test for outlying laboratory averages: The averages of the
results for the same level obtained by the p laboratories are tested 2.4.2. Calculation of the variances
for outliers by the Grubbs’ tests. There are several variants of The ANOVA table as well as the calculation of the differ-
these tests. First the single outlier test is applied, which means ent variances to be performed for each material (sample) is
that it is investigated if there is one average, which is too high summarised in Table 1.
or too low, compared to the others. The test can be performed Simple hand calculations are also possible and are carried out
by calculating: as follows when all ni = n = 2.
(ȳi − ȳ)
1  2
p
G= (6)
s sr2 = di (i = 1, . . . , p) (8)
2p
i=1
with ȳi the mean value for the suspected outlying laboratory
(either the lowest or highest result), ȳ the grand mean and s the and
standard deviation of the p mean values. The absolute value of  p 
G is compared with the critical values for this test, which are 1  s2
2
sL = (ȳi − ȳ)2 − r (9)
tabulated [1]. p−1 2
i=1
IUPAC/AOAC perform the single Grubbs’ test by calculat-
ing the percentage reduction in the standard deviation when the where di is the difference and ȳi the mean of the tworesults
suspect laboratory is deleted: obtained by laboratory i and ȳ the grand mean, i.e. ȳi /p.
  2 have been computed, s2 can be obtained from
When sr2 and sL R
1 − s1 Eq. (2). Similar equations for n ≥ 2 can be found in the ISO
R = 100 (7)
s standard [1].
Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167 163

Table 1
ANOVA table
Source Mean squares Estimate of
p 2
ni (ȳi −ȳ¯ )
Laboratory MSL = i=1
σr2 + n̄σL2
p (p−1)
ni
(yij −ȳi )2
Residual (=repeatability) MSr = i=1 j=1
σr2
p p 2

(N−p)
 
p

ni − i=1
ni
with n̄ = 1
(p−1) p and N = ni
ni
i=1
i=1 i=1
Calculation of variances
The repeatability variance
sr2 = MSr d.f. = (N − p)
Variance component between laboratories (between-laboratory variance)
sL2 = MSL −MSr 2 < 0 set s2 = 0
if sL
n̄ L
Reproducibility variance
sR2 = s2 + s2
r L

2.4.3. Repeatability and reproducibility as a function of They found the following relationship between the relative
concentration reproducibility standard deviation RSDR (%) and the concentra-
A last step is to investigate if there is a relationship between tion expressed as a decimal fraction:
sr and the concentration m and between sR and m. Indeed, it is
RSDR (%) = 2(1−0.5 log C) (13)
known that precision measures depend on concentration. The
ISO document recommends testing the following models (b0 This equation states that RSDR approximately doubles for
and b1 are the regression coefficients): every 100-fold decrease in concentration, starting at 2% for
C = 1, i.e. for a 100% pure sample. This means, for instance,
sr = b1 m (10)
that when an assay is carried out by determining the content of
sr = b0 + b1 m (11) a single component sample, a relative reproducibility standard
deviation of 2% is to be expected.
log sr = b0 + b1 log m (12) Eq. (13) can be rewritten as:
Model (10), a straight line through the origin, means that RSDR (%) = 2C−0.1505 or as σR = 0.02C0.8495
the coefficient of variation is constant. No formal procedures
or as log σR = 0.8495 log C − 1.699 (14)
are described to decide which of the equations fits best. The
simplest of the models that is found to fit the data sufficiently
well is adopted for further use. For all models weighted regres- It was also concluded from the Horwitz’s initial study that
sion methods, which however are quite complicated, are also the RSDR only depends on the concentration and not on the
described. Similar models are developed for sR . analyte or method. The author [15] concludes that RSDR values
Other relations than those described by Eqs. (10)–(12) can are suspect, when they exceed by more than a factor 2 what is
of course also be established. If the precision is found not to expected from Eq. (13). The ratio between the reproducibility
depend on the concentration the reported precision measures, sr obtained and the one expected from Eq. (13) is sometimes called
and sR, are obtained from the variance pooled for the different the HORRAT (short for Horwitz ratio) and it should not exceed
concentrations. 2.
Another interesting result of the Horwitz study is that the
2.5. What precision to expect? corresponding repeatability measure (RSDr ) is generally one-
half to two-thirds of the reproducibility measure RSDR .
An important question in the evaluation of the precision is In a recent article the Horwitz function (Eq. (13)) has been
what values of sr and sR can be considered acceptable. Horwitz revisited [16] and the authors conclude that it is a useful tool to
et al. [15] and Boyer et al. [16] carried out interesting work in summarise historical data but that it is an unsuitable criterion
this context. They found from the evaluation of a large number of to assess interlaboratory comparisons. Horwitz himself already
collaborative studies that there was a strong relationship between showed [17] that for very low concentrations the estimates of the
the concentration of an analyte and the observed precision. Inter- reproducibility standard deviation, σ R, are somewhat better, i.e.
laboratory collaborative studies on various commodities ranging lower, than expected from the above equations. He concluded
in concentration from a few percent (salt in foods) to the ppb that for such concentrations σ R = C/3 or log σ R = log C − 0.477.
(ng/g) level (aflatoxin M1 in foods) but also including stud- Thompson came to a similar conclusion [18], but with different
ies on, for example, drug formulations, antibiotics in feeds and coefficients, and, moreover, also found a divergence from the
pesticide residues were included in their study. first Horwitz equation at high C (>0.138, i.e. 14%) [19]. He
164 Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167

proposes the following set of equations: The bias is estimated as δ̂ = ȳ¯ − μ which is the difference
between the grand mean (ȳ¯ ) and the accepted reference value
0.22C if C < 1.2 10 − 7
(μ) and the standard deviation associated with this bias estimate
σR = 0.02C0.8495 if 1.2 10 − 7 < C < 0.138 (14) is given by [1]:
0.01C0.5 if C > 0.138 
2 − (1 − 1/n)s2
sR r
sδ̂ = (15)
3. Method performance studies—bias experiments p
A 95% confidence interval is calculated that allows testing
3.1. Aim and definitions
the significance of the bias.
Method performance studies are mainly conducted with the
4. Proficiency studies
aim to evaluate the precision of the measurement method. They
are much less applied for the evaluation of the bias. However
4.1. Aim
interlaboratory experiments are required if one wants to estimate
the method bias which is the systematic error inherent to the
Proficiency studies assess the performance of individual lab-
method. Indeed there are two components of bias, the method
oratories by comparing their measurement results with those
bias and the laboratory bias which is the bias introduced by the
from other laboratories or with known or assigned values. They
way a specific laboratory applies an unbiased method. A single
are generally operated on a regular basis, ranging most often
laboratory that by an internal method validation evaluates the
from 2 to 6 rounds per year. Proficiency studies are also one
bias cannot distinguish the method bias from the laboratory bias.
of the elements that are considered in the assessment for the
A collaborative trial is needed for this. Part 4 of the ISO 5725
accreditation of laboratories and as a result of this develop-
guidelines [1] describes how to evaluate the bias of a standard
ment many proficiency-testing providers or specific schemes
measurement method.
have now themselves been accredited.

3.2. Organisation of the bias experiment 4.2. Organisation of a proficiency study

The experimental set-up described in the ISO guidelines is There are several guidelines and protocols for the set-up
very similar to that of precision experiments. The main differ- and evaluation of proficiency studies from among others ISO
ences are (i) the requirement that a reference value is known [22,23], Eurachem [24] and ILAC (International Laboratory
for example by using reference materials or materials whose Accreditation Cooperation) [25]. The most recent is the Inter-
properties are known or are obtained from analysis by another national Harmonized Protocol for the Proficiency Testing of
method, known to be unbiased and (ii) the number of laborato- Analytical Chemistry Laboratories, which is the result of a coop-
ries and of replicates per laboratory. The ISO standard 5725-4 eration between ISO, AOAC and IUPAC [3]. It is a revision of
[1] includes a table that allows to derive the minimum number an earlier Harmonized Protocol [2].
of laboratories, p, and of replicates, n, in order to be able to con- The proficiency scheme is organised by the scheme provider
clude that there is no bias with 95% probability (α = 0.05) but in cooperation with an advisory group consisting of persons
also to be able to detect a predefined bias with 95% probability with knowledge of the methods and procedures involved and
(β = 0.05). Taking the β-error into account is important to avoid including a statistical expert.
that a relevant bias goes unnoticed due to a too low number of The materials investigated or the test items that are sent to
measurements [20,21]. the laboratories should be representative for those that are rou-
IUPAC/AOAC [6,8] do not consider separate bias experi- tinely tested by the participating laboratories. The material must
ments and specify that if a true or assigned value is not known a be sufficient homogeneous and stable. The revised Harmonized
consensus value must be obtained from the collaborative study Protocol [3] provides improved homogeneity testing and also
itself. includes requirements concerning the repeatability for the ana-
lytical methods used to check homogeneity. It also describes
3.3. The evaluation and interpretation of the data pros and cons of several methods to determine an assigned value
and particularly pays a lot of attention to the estimation of the
The data are evaluated for outliers as described for the assigned value from the consensus of the participant’s results
precision experiment and the repeatability variance (sr2 ) and since this is the most frequently used and less costly method.
reproducibility variance (sR 2 ) are estimated from the experi- Participating laboratories are generally free in the choice of
ments. They are compared with the variances, σr2 and σR2 , the analytical method but it should be consistent with their nor-
obtained from an independent precision experiment, if avail- mal routine practice. It is obvious that the method should not be
able. The latter variances σr2 and σR2 are used in the assessment adapted for the proficiency exercise.
of the bias if the test applied for the comparison is not significant. ISO [22] nor the revised Harmonized Protocol [3] explicitly
In the other case it is necessary to examine the reason for the specify the number of laboratories to be included in the scheme
discrepancy and it might be necessary to repeat the experiments. but the latter warns that with a small number of participants (<15)
Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167 165

the uncertainty on the consensus value might be unacceptably


high and consequently jeopardise the interpretation the z-scores
(see below).

4.3. Evaluation of the data

When the assigned value is determined as a consensus value


the influence of extreme values should be minimised and there-
fore the data should be cleared from outliers. Notice that outliers
should only be removed for the calculation of summary statistics
but they should be considered in the evaluation of the per-
formance of the individual laboratories (see below). Graphical
techniques (such as Mandel’s h statistic but also box plots [26])
and statistical outlier tests such as the Grubbs’ tests are applied
(see Section 4.2).
The most widespread used performance indicator of the lab-
oratories is the z-score that is calculated as:
x−X
z= (16) Fig. 2. Youden plot for data adapted from [28]. y1 and y2 are the concentrations
s
of clenbuterol as determined by 10 laboratories in two urine samples (Youden
with x the result obtained by a laboratory, X the assigned value pairs).
(see above) and s a standard deviation. The revised Harmonized
Protocol describes several ways in which the latter parameter of the plot. As occurs frequently, here too there are more points
can be obtained. It can for example be determined in a profi- in the upper right and lower left quadrant which indicates that
ciency testing as the standard deviation of the laboratory results, several laboratories report results that are too high or too low,
after the elimination of outliers. However it is to be preferred which means that there is a systematic error. If all laboratories
that s is a set target value of precision that could e.g. be derived apply the same method these systematic errors are laboratory
from the precision to perform a certain task, from method perfor- biases.
mance studies or from the Horwitz curve [15]. The Harmonized Proficiency testing can also serve other purposes than assess-
Protocol [3] calls this s the “fitness-for-purpose based standard ing the proficiency of laboratories as illustrated by Thompson et
deviation for proficiency assessment”. al. [30]. Historical data collected over several years in a profi-
With good estimates of X and s, z corresponds to a standard- ciency scheme in which many laboratories used the Kjeldahl or
ised variable and the interpretation is based on the assumption the Dumas method for the determination of protein in various
that the z-scores are normally distributed with a mean of zero and foodstuffs were used to examine the bias between both methods.
a standard deviation of one. Consequently one expects that |z|>2 The findings were confirmed later in a designed interlaboratory
in only 4.55% of the cases and |z|>3 in only 0.27%. Therefore experiment [31].
the z-scores are generally interpreted as follows:
|z| < 2 satisfactory performance 5. Interlaboratory studies and uncertainty
2 ≤ |z| ≤ 3 questionable performance
|z| > 3 unsatisfactory performance Analysts and metrologists in general know that the measure-
ment result is only an approximation of the true value. To be
It is not unusual that more than one test item is analysed within complete an uncertainty statement should therefore accompany
one round of a proficiency test. The combination of several z- the measurement result. Although this has been accepted for a
scores to summarise the overall performance of a laboratory is very long time, it is only relatively recently that guidelines about
however not recommended to avoid an individual significant how to produce a quantitative statement of uncertainty have been
score to be systematically masked [3]. published. The basic document concerning the uncertainty of
In a split-level experiment useful information can also be measurement results is the Guide to the Expression of Uncer-
obtained from a Youden plot, which consists of plotting the tainty [5], often called the GUM. It was first published in 1993
results of two samples with similar concentration against each and corrected in 1995 by ISO. It is the result of a collaboration
other [27,28]. An example, adapted from ref [29] is given in from several organisations in a task force called ISO/TAG4.
Fig. 2. The GUM proposes an error-propagation or error-budget
The median values for both samples are also plotted. They approach to estimate the uncertainty related to a measurement
divide the plot into four quadrants and their intersection point result. The analytical process is decomposed into its compo-
is accepted as the most probable value. With only random error, nents, the uncertainty related to these components are separately
one expects the points to be equally distributed over the four quantified in the form of a standard deviation and then com-
quadrants. The circle represents the 95% limit around the origin bined into an error-budget. This also includes an estimation of
166 Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167

all sources of systematic error which according to the GUM ducibility takes laboratory biases into account and therefore the
should be corrected for if significant. The advantage of this laboratory must have evidence that its lab bias is lower than the
approach is that once the contributions of individual sources of reproducibility standard deviation. It may derive this evidence
uncertainty have been quantified it may be possible to adapt from analysing relevant reference materials or in other ways,
the method used such that the more important contributions such as by participating in proficiency studies.
become smaller, finally leading to a method that yields results Some applications of interlaboratory data in the estima-
with less uncertainty. However it soon became clear that the tion of the uncertainty of chromatographic methods have been
principles of the GUM could relatively easily be applied to phys- described [35,36].
ical measurements (such as weights and volumes, or electrical Notice however that, dependent on the situation, other pre-
measurements), but much less easily to chemical ones. cision measures than the reproducibility standard deviation can
As an alternative method to measure uncertainty for chemical be included in the uncertainty statement. Operational definitions
measurements the Analytical Methods Committee of the Royal of uncertainty have been proposed that take into account differ-
Society of Chemistry (AMC) [32] proposed an approach, which ent sources of uncertainty according to the situation considered
is based on precision data assessed in an interlaboratory study. [37]. Barwick and Ellison propose a guide for the evaluation of
It is based on reproducibility data and fulfils the requirement to uncertainty from in-house validation data [38].
include also systematic errors, because as explained in Section
2.3 systematic errors within a laboratory, such as the laboratory 6. Conclusion
bias, become random errors if a population of laboratories is
considered. Ellison has discussed several issues related to the Interlaboratory studies serve several needs and aspects of the
use of interlaboratory data in the estimation of measurement quality management of chemical measurements. They allow the
uncertainty [33]. validation of analytical methods, assessing the proficiency of
The reproducibility standard deviation from a method- individual laboratories, estimating measurement uncertainty and
performance experiment includes all uncertainty contributions certifying reference materials in a wide range of application
that randomly occur in a population of laboratories but it does fields (ranging from clinical chemistry and microbiology over
not include the uncertainty associated with the method bias. environmental science to food analysis).
This uncertainty consists of the variation of the bias estimate Several guidelines, intended very often, for application in dif-
as expressed by its standard deviation, sδ̂ (see Eq. (15)), and the ferent areas, are therefore very general. They have to be adapted
uncertainty associated with the accepted reference value, uref . It for specific situations so that they can easily be integrated to fit
follows from Eq. (15) that the former will be small compared particular customer’s needs.
to the reproducibility standard deviation if a large number of
laboratories are included in the interlaboratory trial.
Acknowledgement
The uncertainty associated with a measurement result y there-
fore becomes:
This article is partly based on a draft text on the same sub-

ject, prepared in collaboration with Professor D.L. Massart who
uy = sR 2 + u2
ref (17) unfortunately passed away on 26 December 2005.
A further simplification of this expression is possible when
u2ref is negligible in comparison to the reproducibility variance. References
O’Donell and Hibbert, using simulated data, have compared
[1] Accuracy (trueness and precision) of measurement methods and
several methods to take account of the bias in the measure- results—parts 1–6 (ISO 5725), International Organisation for Standard-
ment uncertainty. They conclude that correcting for the bias is ization (ISO), Geneva, 1994.
consistently the best throughout the range of biases evaluated [2] IUPAC, The International Harmonized Protocol for the proficiency test-
[34]. ing of (chemical) analytical laboratories, Pure Appl. Chem. 65 (1993)
When the method has been validated in an interlaboratory 2123–2144.
[3] IUPAC, The International Harmonized Protocol for the proficiency testing
method-performance study, the reproducibility standard devia- of analytical laboratories, Pure Appl. Chem. 78 (2006) 145–196.
tion can be applied to assess the uncertainty due to (lack of) [4] General requirements for the competence of reference material producers
precision. The laboratory that applies the method is then sup- (ISO 34), International Organisation for Standardization (ISO), Geneva,
posed to show that it is sufficiently proficient: reproducibility 2000.
takes into account intermediate precision (i.e. repeatability and [5] Guide to the expression of uncertainty in measurements, International
Organisation for Standardization (ISO), Geneva, 1995.
in-house sources of variation due to e.g. time or different oper- [6] IUPAC, Protocol for the design, conduct and interpretation of method-
ators) and laboratory biases and it has to be shown that the performance studies, Pure Appl. Chem. 67 (1995) 331–343.
laboratory’s repeatability is comparable to the one obtained in [7] IUPAC, Protocol for the design, conduct and interpretation of collaborative
the interlaboratory study. studies, Pure Appl. Chem. 60 (1988) 855–864.
Reproducibility is supposed to include sources of variance [8] AOAC, Guidelines for Collaborative Study, Procedures to validate char-
acteristics of a method of analysis, J. AOAC Int. 78 (1995) 143A–
such as different times and operators. Therefore the laboratory 160A.
should show that its intermediate precision is not larger than [9] Guidelines for CIPAC collaborative study procedures for assessment
the reproducibility standard deviation. As said higher, repro- of performance of analytical methods, Collaborative International Pes-
Y. Vander Heyden, J. Smeyers-Verbeke / J. Chromatogr. A 1158 (2007) 158–167 167

ticides Analytical Council (CIPAC), UK (1989) available online: [25] Guidelines for the requirements for the competence of providers of
http://www.cipac.org/howprepa.htm. proficiency testing schemes, ILAC Guide G13, International Lab-
[10] Precision of test methods—determination of repeatability and reproducibil- oratory Accreditation Cooperation (ILAC) (2000) available online:
ity for a standard test method by inter-laboratory tests, (ISO 5725), http://www.ilac.org/publicationslist.html.
International Organisation for Standardization (ISO), Geneva, 1986. [26] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. De Jong, P.J. Lewi,
[11] J.O. De Beer, B.M.J. De Spiegeleer, J. Hoogmartens, I. Samson, D.L. J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part
Massart, M. Moors, Analyst 117 (1992) 933. A, Elsevier, Amsterdam, 1997.
[12] Certification of reference materials—general and statistical principles [27] W.J. Youden, Int. Qual. Control 15 (1959) 24.
(ISO 35), International Organisation for Standardization (ISO), Geneva, [28] J. Mandel, T.W. Lashof, J. Qual. Technol. 6 (1974) 22.
1989. [29] H. Beernaert, Ringtest clenbuterol in urine, IHE Report, Instituut voor
[13] Y. Vander Heyden, J. Saevels, E. Roets, J. Hoogmartens, D. Decolin, M.G. Hygiene en Epidemiologie (IHE), Brussels, September 1992.
Quaglia, W. Van den Bossche, R. Leemans, O. Smeets, F. Van de Vaart, B. [30] M. Thompson, L. Owen, K. Wilkinson, R. Wood, A. Damant, Analyst 127
Mason, G.C. Taylor, W. Underberg, A. Bult, P. Chiap, J. Crommen, J. De (2002) 1666.
Beer, S.H. Hansen, D.L. Massart, J. Chromatogr. A 830 (1999) 3. [31] M. Thompson, L. Owen, K. Wilkinson, R. Wood, A. Damant, Meat Sci. 68
[14] P.C. Kelly, J. Assoc. Off. Anal. Chem. 71 (1988) 161. (2004) 631.
[15] W. Horwitz, L.R. Kamps, K.W. Boyer, J. Assoc. Off. Anal. Chem. 63 (1980) [32] AMC (Analytical Method Committee), Analyst 120 (1995) 2303–2308.
1344. [33] Ellison, Accred. Qual. Assur. 3 (1998) 95–100.
[16] K.W. Boyer, W. Horwitz, R. Albert, Anal. Chem. 57 (1985) 454. [34] G.E. O’Donnell, D.B. Hibbert, The Analyst 130 (2005) 721.
[17] T.P.J. Linsinger, R.D. Josephs, Trends Anal. Chem. 25 (2006) 1125. [35] P. Dehouck, Y. Vander Heyden, J. Smeyers-Verbeke, D.L. Massart, J. Crom-
[18] M. Thompson, P.J. Lowthian, J. AOAC Int. 80 (1997) 676. men, Ph. Hubert, R.D. Marini, O.S.N.M. Smeets, G. Decristoforo, W. Van
[19] M. Thompson, Analyst 125 (2000) 385. de Wauw, J. De Beer, M.G. Quaglia, C. Stella, J.-L. Veuthey, O. Estevenon,
[20] C. Hartmann, J. Smeyers-Verbeke, W. Penninckx, Y. Vander Heyden, P. A. Van Schepdael, E. Roets, J. Hoogmartens, Anal. Chim. Acta 481 (2003)
Vankeerberghen, D.L. Massart, Anal. Chem. 67 (1995) 4491. 261–272.
[21] C. Hartmann, J. Smeyers-Verbeke, D.L. Massart, J. Pharm. Biomed. Anal. [36] P. Dehouck, Y. Vander Heyden, J. Smeyers-Verbeke, D.L. Massart, R.D.
17 (1998) 193. Marini, Ph. Hubert, P. Chiap, J. Crommen, W. Van de Wauw, J. De Beer, R.
[22] Proficiency testing by interlaboratory comparisons—part 1: development Cox, G. Mathieu, J.C. Reepmeyer, B. Voigt, O. Estevenon, A. Nicolas, A.
and operation of proficiency testing schemes (ISO 43), International Organ- Van Schepdael, E. Adams, J. Hoogmartens, J. Chromatogr. A 1010 (2003)
isation for Standardization (ISO), Geneva, 1994. 63.
[23] Statistical methods for use in proficiency testing by interlaboratory compar- [37] E. Hund, D.L. Massart, J. Smeyers-Verbeke, Trends Anal. Chem. 20 (2001)
isons (ISO 13528), International Organisation for Standardization (ISO), 394.
Geneva, 2005. [38] V.J. Barwick, S.L.R. Ellison, Protocol for uncertainty evaluation from
[24] Selection, use and interpretation of proficiency testing schemes, validation data, VAM Technical Report No. LGC/VAM/1998/088,
EURACHEM Nederland and Laboratory of the Government Chemist Valid Analytical Measurement (VAM) Programme, Laboratory of
(LGC), UK (2000) available online: http://www.eurachem.ul.pt/guides/ the Government Chemist (LGC), Teddinton, 2000; available at:
ptguide2000.pdf. http://www.vam.org.uk/publications/publicationdocs/315.pdf.