You are on page 1of 60

1.

02 Quality of Analytical Measurements: Statistical


Methods for Internal Validation
M. C. Ortiz, L. A. Sarabia, M. S. Sánchez, and A. Herrero, University of Burgos, Burgos, Spain
ª 2009 Elsevier B.V. All rights reserved.

1.02.1 Introduction 18
1.02.2 Confidence and Tolerance Intervals 23
1.02.2.1 Confidence Interval 23
1.02.2.2 Confidence Interval on the Mean of a Normal Distribution 23
1.02.2.3 Confidence Interval on the Variance of a Normal Distribution 25
1.02.2.4 Confidence Interval on the Difference in Two Means 26
1.02.2.5 Confidence Interval on the Ratio of Variances of Two Normal Distributions 27
1.02.2.6 Confidence Interval on the Median 28
1.02.2.7 Joint Confidence Intervals 28
1.02.2.8 Tolerance Intervals 29
1.02.3 Hypothesis Test 31
1.02.3.1 Elements of a Hypothesis Test 31
1.02.3.2 Hypothesis Test on the Mean of a Normal Distribution 34
1.02.3.3 Hypothesis Test on the Variance of a Normal Distribution 36
1.02.3.4 Hypothesis Test on the Difference in Two Means 36
1.02.3.5 Test Based on Intervals 38
1.02.3.6 Hypothesis Test on the Variances of Two Normal Distributions 40
1.02.3.7 Hypothesis Test on the Comparison of Several Independent Variances 41
1.02.3.8 Goodness-of-Fit Tests: Normality Tests 43
1.02.4 One-Way Analysis of Variance 45
1.02.4.1 The Fixed Effects Model 46
1.02.4.2 Power of the ANOVA for the Fixed Effects Model 48
1.02.4.3 Uncertainty and Testing of the Estimated Parameters for the Fixed Effects Model 49
1.02.4.4 The Random Effects Model 53
1.02.4.5 Power of the ANOVA for the Random Effects Model 54
1.02.4.6 Confidence Intervals of the Estimated Parameters for the Random Effects Model 54
1.02.5 Statistical Inference and Validation 54
1.02.5.1 Trueness 54
1.02.5.2 Precision 56
1.02.5.3 Statistical Aspects of the Experiments to Determine Precision 58
1.02.5.4 Consistency Analysis and Incompatibility of Data 59
1.02.5.5 Accuracy 63
1.02.5.6 Ruggedness 63
References 73

Symbols H1 alternative hypothesis


1– confidence level N(,) normal distribution with mean  and
1– power standard deviation 
CC limit of decision NID(,2) independent random variables equally
CC capability of detection distributed as normal with mean  and
F1 ;2 F distribution with v1 and v2 degrees of variance 2
freedom s sample standard deviation
H0 null hypothesis s2 sample variance

17
18 Quality of Analytical Measurements: Statistical Methods for Internal Validation

t Student’s t distribution with v degrees  mean


of freedom  degree(s) of freedom, d.f.

X sample mean  standard deviation
 significance level, probability of type I 2 variance
error R reproducibility (as standard deviation)
 probability of type II error r repeatability (as standard deviation)
 bias (systematic error) w 2 ;w2 (chi-squared) distribution with 
" random error degrees of freedom

1.02.1 Introduction

The set of operations to determine the value of an amount (measurand) suitably defined is called the measure-
ment. The method of measurement is the sequence of operations that is used in the performance of measuremens.
It is documented with enough details so that the measurement may be done without additional information.
Once a method is designed or selected, it is necessary to evaluate its characteristics of operation and to
identify the factors that can change these characteristics and to what extent they can change. If, in addition, the
method is developed to solve a particular analytical problem, it is necessary to verify that the method is fit for
purpose.1 This process of evaluation is called validation of the method. It implies the determination of several
parameters that characterize the performance of the method: capability of detection, selectivity, specificity,
ruggedness, and accuracy (trueness and precision). In any case, they are the measurements themselves which
allow evaluation of the performance characteristics of the method and its fit for purpose. In addition, in the use
stage of the method, they turn out to be the obtained measurements that will be used to make decisions on the
analyzed sample, for example whether the amount of an analyte fulfills a legal specification and consequently
whether the material from which the sample is taken is valid. Therefore, it is necessary to suitably model the
data that a method provides. In what follows we will consider that the data provided by the analytical method
are real numbers; other possibilities exist, for example, the count of bacteria or impacts in a detector take only
(discrete) natural values. In addition, sometimes, the data resulting from an analysis are qualitative, for example,
detection of the presence of an analyte in a sample.
With regard to the analytical measurement, it is admitted that the value (measurment), x, provided by the
method of analysis consists of three terms, the true value of the parameter , a systematic error (bias) , and a
random error " with a zero mean, in an additive way:
x ¼þþ" ð1Þ
All the possible measurements that a method can provide when analyzing a specific sample constitute the
population of the measurements. Ideally, this supposes to admit that it has limitless samples and that the method
of analysis remains unalterable. In these conditions, the model of the analytical method, Equation (1), is
mathematically a random variable, X, with mathematical expectation  þ and variance equal to the variance
of ", that is, a random variable X of mean E(X) ¼  þ and variance V(X) ¼ V(").
A random variable, and thus the analytical method, is described by its probability distribution FX(x), that is,
the probability that the method provides measurements less than or equal to x for any value x. Symbolically this
is written as FX(x) ¼ pr{X  x} for any real value x. In most of the applications, it is assumed that FX(x) is
differentiable, which implies, among other things, that none of the possible results of the method has positive
probability, that is, the probability of obtaining exactly a specific value is zero. In the case of a differentiable
probability distribution, the derivative of FX(x) is the probability density function
R (pdf) fX(x). Any function f (x)
such that (1) it is positive, f (x)  0, and (2) the area under the function is 1, R f ðxÞ dx ¼ 1, is the pdf of a
random variable. The probability that the random variable X takes values in the interval [a, b] is the area under
the pdf over the interval [a, b], that is
Z b
prfX P½a; b g ¼ f ðxÞ dx ð2Þ
a
Quality of Analytical Measurements: Statistical Methods for Internal Validation 19

and the mean and variance of X are written as


Z
E ðX Þ ¼ xf ðxÞ dx ð3Þ
R

Z
V ðX Þ ¼ ðx – E ðX ÞÞ2 f ðxÞ dx ð4Þ
R

In general, the mean and the variance do not characterize in a unique way a random variable and therefore the
method of analysis. Figure 1 shows the pdf of four random variables with the same mean 6.00 and standard
deviation 0.61.
These four distributions, uniform or rectangular (Figure 1(a)), triangular (Figure 1(b)), normal (Figure
1(c)), and Weibull (Figure 1(d)), are rather frequent in the scope of analytical determinations, and they appear
in the EURACHEM/CITAC Guide1 (Appendix E) and also they are used in metrology.2
If the only available information regarding a quantity X is the lower limit, l, and the upper limit, u, but the
quantity could be anywhere in between, with no idea of whether any part of the range is more likely than
another part, then a rectangular distribution in the interval [l,u] would be assigned to X. This is so because this is
the pdf that maximizes the ‘information entropy’ of Shannon, in other words the pdf that adequately
characterizes the incomplete knowledge about X. Frequently, in reference material, the certified concentration
is expressed in terms of a number and unqualified limits (e.g., 1000  2 mg l1). In this case, a rectangular
distribution should be used (Figure 1(a)).
When the available information concerning X includes the knowledge that values close to c (between l and u)
are more likely than those near the bounds, the adequate distribution is a triangular one (Figure 1(b)), with the
maximum of its pdf in c.
If a good estimate, m, and associated standard uncertainty, s, are the only information available regarding X,
then, according to the principle of maximum entropy, a normal probability distribution N(m,s) (Figure 1(c))
would be assigned to X (remember that m and s may have been obtained from repeated applications of a
measurement method).
Finally, the Weibull distribution (Figure 1(d)) is very versatile; it can mimic the behavior of other
distributions such as the normal or exponential. It is adequate for the analysis of reliability of processes, and
in chemical analysis it is useful in describing the behavior of the figures of merit of a long-term procedure, for

(a) (b)
1.2 1.2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
4 5 6 7 8 4 5 6 7 8
(c) (d)
1.2 1.2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
4 5 6 7 8 4 5 6 7 8
Figure 1 Probability density functions of four random variables with mean 6 and variance 0.375. (a) Uniform in [4.94, 7.06].
(b) Symmetric triangular in [4.5, 7.5]. (c) Normal (6, 0.61). (d) Weibull with shape 1.103 and scale 0.7 shifted to give a mean of 6.
20 Quality of Analytical Measurements: Statistical Methods for Internal Validation

example, the distribution of the capability of detection CCb3 or in the determination of ammonia in water by
UV–vis spectroscopy during 350 different days.4
In the four cases given in Figure 1, the probability of obtaining values between 5 and 7 has been computed
from Equation (2). For the uniform distribution (Figure 1(a)) this probability is 0.94, whereas for the
triangular distribution (Figure 1(b)) it is 0.88, for the normal distribution (Figure 1(c)) 0.90, and for the
Weibull distribution (Figure 1(d)) 0.93. Therefore, the proportion of the values that each distribution
accumulates in the interval [5,7] orders the distributions as uniform, Weibull, normal, and triangular,
although the triangular and normal distributions tend to give values symmetrically around the mean and
the Weibull distribution does not. If another interval is considered, say [5.4,6.6], the distributions accumulate
probabilities of 0.57, 0.64, 0.67, and 0.54, respectively, in which the difference between the values is more than
before and, in addition, order the distributions as normal, triangular, uniform, and Weibull.
If for each of those variables, value b is determined so that there is a fixed probability, p, of obtaining values
below b (i.e., the value b such that p ¼ pr{X < b} for each distribution X), the results of Table 1 are obtained. For
example, 5% of the times, a uniform distribution gives values less than b ¼ 5.05, less than 4.97 if it is a triangular
distribution, and so on. In the table, the extreme values among the four distributions for each value of p have
been marked, and great differences are observed caused as much by the form in which the values far from 6 are
distributed (i.e., very different for the normal, the triangular, or the uniform distribution) as by the asymmetry
of the Weibull distribution.
Therefore, the mean and variance of a random variable constitute very limited information on the values
provided by the random variable, unless additional information is at hand about the form of its density (pdf).
For example, if one knows that the distribution is uniform or symmetrical triangular or normal, the random
variable is completely characterized by its mean and variance.
In practice, the pdf of a method of analysis is unknown. We only have a finite number, n, of these
measurements, which are the results obtained when applying repeatedly (n times) the same method to the
same sample. These n measurements constitute a statistical sample of the random variable X defined by the
method of analysis.
Figure 2 shows 100 results obtained when applying four methods of analysis, named A,B,C, and D, to aliquot
parts of a sample to determine an amount whose true value (unknown) is  ¼ 6. Clearly, the four methods
behave differently.
From the experimental data, the (sample) mean and variance are computed as
Pn
i¼1 xi
x ¼ ð5Þ
n

Pn
– xÞ2
i¼1 ðxi
s2 ¼ ð6Þ
n–1

x and s2 are estimates of the mean and variance of the distribution of X. These estimates with the data in
Figure 2 are shown in Table 2.

Table 1 Values of b such that p ¼ pr{X < b} where X is one of the random variables in
Figure 1

Random variable

p Uniform Triangular Normal Weibull

0.01 4.96 4.71 4.58a 5.34b


0.05 5.05 4.97a 4.99 5.37b
0.50 6.00b 6.00b 6.00b 5.83a
0.95 6.95a 7.03 7.01 7.22b
0.99 7.04a 7.29 7.42 8.12b
a
Minimum b among the four distributions.
b
Maximum b among the four distributions.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 21

(a) (b)
40 40

30 30

20 20

10 10

0 0
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
(c) (d)
40 40

30 30

20 20

10 10

0 0
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
Figure 2 Histograms of 100 measurements obtained with four different analytical methods on aliquot parts of a sample.
(a) method A; (b) method B; (c) method C; (d) method D.

Table 2 Some characteristics of the distributions in Figure 2

Method

A B C D

Mean, x 6.66 6.66 6.16 6.16


Variance, s2 0.25 1.26 1.26 0.25
pr{5 < X < 7} 0.76 0.54 0.63 0.94
pr{X < 6} 0.09 0.24 0.46 0.35
pr{5 < N(x,s) < 7} 0.75 0.55 0.62 0.94
pr{N(x,s) < 6} 0.09 0.27 0.44 0.37

According to the model of Equation (1), E ðX Þ ¼  þ . x , that is, the mean estimates the true value  plus
the bias . The bias estimated for methods A and B is 0.66 whereas for methods C and D it is 0.16. The bias of a
method is one of its performance characteristics and must be evaluated during the validation of the method. In
fact, technical guides, for example, that by the International Organization for Standardization (ISO), define that
a method better fulfills the trueness if it has less bias. To estimate the bias, it is necessary to have samples with
known concentration  (e.g., certified material, spiked samples).
The value of the variance is independent of the true content, , of the sample. For this reason, to estimate the
variance, it is only necessary to have replicated measurements on aliquot parts of the same sample. Table 2
shows that methods B and C have the same variance, 1.26, which is 5 times greater than that of methods A and
D, 0.25. The dispersion of the data that provides a method is the precision of the method and constitutes another
performance characteristic to be determined in the validation of the method. In agreement with model (1), a
measure of the dispersion is the variance V(X), which is estimated by means of s2.
In some occasions, for evaluating trueness and precision, it is more descriptive to use statistics other than
mean and variance. For example, when the distribution is rather asymmetric, as in Figure 1(d), it is more
reasonable to use the median than the mean. The median is the value in which the distribution accumulates
50% of the probability, 5.83 for the pdf in Figure 1(d) and 6.00 for the other three distributions, which are
symmetric around their mean. In practice, it is frequent to see the presence of anomalous (outliers) data that
influence the mean and above all the variance, which is improperly increased; in these cases, it is advisable to
use robust estimates of the centrality and dispersion values.5–7 In Chapter 1.07 of the present book, there is a
detailed description of these robust procedures.
22 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Figure 2 and Table 2 show that the two characteristics, trueness and precision, are independent in the sense
that a method with better trueness (small bias), cases C and D, can be more, case D, or less, case C, precise.
Analogously, A and B have an appreciable bias but A is more precise than B. A method is said to be accurate
when it is precise and fulfills trueness.
The histograms are an estimate of the pdf and allow evaluation of the performance of each method in a more
detailed way than when considering only trueness and precision. For example, the probability of obtaining
values in any interval can be estimated with the histogram. The third row in Table 2 shows the estimated
frequencies for the interval [5,7]. Method D (better trueness, better precision) provides 94% of the values in the
interval, whereas method B (worse trueness and precision) provides only 54% of the values in the interval.
Trueness and precision interact, according to data in Table 2: The effect of increasing the precision, from B to
A, when the bias is ‘small’ is an increase of 22% of results of a measurement to be in the interval [5,7], whereas
when the bias is high, the increase is about 31%. This interaction should be taken into account when optimizing
a process and also in the ruggedness analysis, which is another performance characteristic to be validated
according to most of the guides. As can be seen in the last row of Table 2, if the method that provides more
results below 6 is needed, C would be the method selected.
The previous analysis shows the usefulness of knowing the pdf of the results of a method of analysis. As in
practice we have only a limited number of results, two basic strategies are possible to estimate it: (1) to
demonstrate that the experimental data are compatible with a known distribution (e.g., normal) and use the
corresponding pdf; (2) to estimate the pdf by a data-driven technique based on a computer-intensive method
such as the kernel method8 described in Section 1.03.2.4 of the present book, or by using other methods such as
adaptive or penalized likelihood.9,10 The data of Figure 2 can be adequately modeled by a normal distribution,
according to the results of four tests (chi-square, Shapiro–Wilks, skewness, kurtosis) (see Section 1.02.3.8), and
the procedure will be detailed in the following. The last two rows in Table 2 show the probabilities of obtaining
values in the interval [5,7] or less than 6 with the fitted normal distribution. When comparing these values with
those computed with the empirical histograms (compare rows 3 and 5, and rows 4 and 6), evidently, there are no
appreciable differences.
In the validation of an analytical method and during its later use, methodological strategies of statistical type
are needed to make decisions from the available experimental data. The knowledge of these strategies supposes
a way to think and to act that, subordinated to the chemical knowledge, objectives both the analytical results
and their comparison with those of other researchers and/or other analytical methods.
Ultimately, a good method of analysis is a serious attempt to come close to the true value of the measure-
ment. The difference between the true value, always unknown, and the one experimentally obtained can be
estimated. For this reason, the result of a measurement has to be accompanied by an evaluation of (un)certainty
or its degree of reliability. This is done by means of a confidence interval. When it is required to establish the
quality of an analytical method, its capability of detection, precision, etc. have to be compared with those
corresponding to other methods. This is formalized with a hypothesis test. Confidence intervals and test of
hypothesis are the basic tools in the validation of analytical methods.
In this introduction, the word sample has been used with two different meanings. Usually, there is no confusion
because the context allows one to distinguish whether it is a sample in the statistical or chemical sense.
In chemistry, according to the International Union for Pure and Applied Chemistry (IUPAC) (Section 18.3.2
of Inczédy et al.11), ‘sample’ should be used only when it is a part of a selected material of a great amount of
material. This meaning coincides with that of a statistical sample and implies the existence of sampling error,
that is, error caused by the fact that the sample can be more or less representative of the amount in the material.
For example, suppose that we want to measure the amount of plaguicide that remains in the ground of an arable
land after a certain time. For this, we take several samples ‘representative’ of the ground of the parcel (statistical
sampling) and this introduces an uncertainty in the results characterized by a variance (theoretical) 2s .
Afterward, the quantity of plaguicide in each chemical sample is determined by an analytical method, which
has its own uncertainty, characterized by 2m , in such a way that the uncertainty in the quantity of plaguicide in
the parcel is 2s þ 2m provided that the method gives results independent of the location of the sample.
Sometimes, when evaluating whether a method is adequate for a task, the sampling error can be an important
part of the uncertainty in the result and, of course, should be taken into account to plan the experimentation.
The modeling of sampling errors is studied in Chapter 1.01.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 23

When the sampling error is negligible, for example when a portion is taken from a homogeneous solution, the
IUPAC recommends to use words such as test portion, aliquot, or specimen.

1.02.2 Confidence and Tolerance Intervals

There are important questions when evaluating a method, for example, what is the maximum value that it
provides? In fact, given the random character of the results, the question cannot be answered with just a number.
In order to include the degree of certainty in the answer, the question should be reformulated: What is the
maximum value, U, that will be obtained 95% of the times that the method is used? The answer is the tolerance
interval and to build it, the probability distribution must be known. Let us suppose that we know that it is a
N(,), then a possible answer is U ¼  þ z0.05 because then the probability that the analytical method gives
values greater than U is pr{method > U} ¼ pr{N(,) >  þ z0.05}, which, according to the result in Appendix,
is equal to pr{N(0,1) > z0.05} ¼ 0.05.
For any 100  (1  )%, the value
U ¼  þ z  ð7Þ
is the maximum value provided by the method, with a probability  that the aforementioned assertion is false.
Analogously, the results that will be obtained the 100  (1  )% of times will be above the following value,
L
L ¼  – z  ð8Þ
Then, the statement ‘the method provides values greater than L’ will be false 100  % of the times.
Finally, the interval [L,U] contains 100  (1  )% of the values proportioned by the method
 
½L; U  ¼  – z=2 ;  þ z=2  ð9Þ

or, in other words, the statement that the method gives values between L and U is false 100  % of the times.

1.02.2.1 Confidence Interval


We have already noted that estimation of solely the mean, x, and the variance, s2, from n independent results
provides very limited information on the performance of the method. The objective now is to make affirmations
of the type ‘in the sample, the amount of the analyte,  (unknown), is between L and U ( P [L,U]) with a
certain probability that the statement is true’. In general, to obtain a confidence interval for a random variable,
X, from a sample x1,x2,. . .,xn consists of obtaining two functions l(x1,x2,. . .,xn) and u(x1,x2,. . .,xn) such that
prfX P½l; ug ¼ prfl  X  ug ¼ 1 –  ð10Þ
1   is the confidence level and  is the significance level, that is to say, the statement that the value of X is
between l and u will be false 100  % of the times. In the following, this idea will be specified in some
interesting cases.

1.02.2.2 Confidence Interval on the Mean of a Normal Distribution


Case 1: Suppose that we have a random variable that follows a normal distribution with known variance. This
will be the case, for example, of using an already validated method of analysis. In this case, we know that " in
Equation (1) is normally distributed and its variance. Taking into account the pffiffiffiproperties of the normal
distribution (see Appendix), the sample mean, X, is a random variable N ð; = nÞ; thus, by Equation (10),
the following holds:
 
 
pr  – z=2 pffiffiffi  X   þ z=2 pffiffiffi ¼ 1 –  ð11Þ
n n
24 Quality of Analytical Measurements: Statistical Methods for Internal Validation

that is, 100  (1  )% of the values of the sample mean are in the interval in Equation (11). A simple algebraic
manipulation (subtract  and X, multiply by –1) gives
 
   
pr X – z=2 p ffiffi
ffi    X þ z=2 p ffiffi
ffi ¼ 1– ð12Þ
n n

Therefore, according to Equation (10), the confidence interval on the mean that is obtained from
Equation (12) is
 
 
X – z=2 pffiffiffi ; X þ z=2 pffiffiffi ð13Þ
n n

Analogously, the confidence intervals at level 100  (1  )% for the maximum and minimum values of the
mean are computed from
 

pr   X þ z pffiffiffi ¼ 1 –  ð14Þ
n
 

pr X – z pffiffiffi   ¼ 1 –  ð15Þ
n

When measuring n aliquot parts of a test sample, we obtain n values x1,x2,. . .,xn; the sample mean x is the value
that the random variable X takes and an estimate of the quantity .
Example 1:
Suppose that the analytical method follows a N(,4) and we have a sample of size 10 made up of values 98.87,
92.54, 99.42, 105.66, 98.70, 97.23, 98.44, 103.73, 94.45, and 101.08. pffiffiffiffiffi
The mean is 99.01 and, thus, the interval at 95% confidence level for this sample is [99.01  1.96  4= 10 ,
pffiffiffiffiffi
99.01 þ 1.96  4= 10 ] equal to [96.54,101.48].

When considering different samples of size 10, different intervals will be obtained at the same 95%
confidence level. The endpoints of these intervals are nonrandom values, and the unknown mean value,
which is also a specific value, will or will not belong to the interval. Therefore, the affirmation ‘the interval
contains the mean’ is a deterministic assertion that is true or false for each of the intervals. What one knows is
that it is true for 100  (1  )% of those intervals. In our case, as 95% of the constructed intervals will
contain the true value, we say, with a confidence level of 95%, that the interval [96.54,101.48] contains .
This is the interpretation with the frequentist approach adopted in this chapter, that is to say that the
information on random variables is obtained by means of samples of them and that the parameters to be
estimated are not known but are fixed amounts (e.g., the amount of analyte in a sample, , is estimated by the
measurement results obtained by analyzing it n times). With a Bayesian approach to the problem, a
probability distribution is attributed to the amount of analyte  and once fixed an interval of interest [a,b],
the ‘a priori’ distribution of , the experimental results, and the Bayes’ theorem are used to calculate the
probability ‘a posteriori’ that  belongs to the interval [a,b]. It is shown that, although in most practical cases
the uncertainty intervals obtained from repeated measurements using either theory may be similar, their
interpretation is completely different. In Chapter 1.08 of this book, Bayesian statistical techniques are
described. The works by Lira and Wöger12 and Zech13 are devoted to compare both approaches from the
point of view of the experimental data and their uncertainty.
Case 2: Suppose a normally distributed random variable with unknown variance that must be estimated
together with the mean from n experimental data. The confidence interval is computed as in Case 1, but now
the random variable X follows a Student’s t distribution with n  1 degrees of freedom (d.f.) (see Appendix);
thus, the interval at the 100  (1  )% confidence level is given from
Quality of Analytical Measurements: Statistical Methods for Internal Validation 25

 
s s
pr X – t =2; pffiffiffi    X þ t =2; pffiffiffi ¼ 1 –  ð16Þ
n n

where t/2, is the upper percentage point of the Student t distribution with  ¼ n  1 d.f. and s is the standard
deviation computed with the sample. Analogously, the one-sided intervals at the 100  (1  )% confidence
level come from
 
s
pr   X þ t ; pffiffiffi ¼ 1 –  ð17Þ
n
 
s
pr X – t ; pffiffiffi   ¼ 1 –  ð18Þ
n

Example 2:
Suppose that the distribution of the analytical method is a normal but its standard deviation is not known. With
the data of Example 1, the sample standard deviation, s, is computed as 3.91. As t0.025,9 ¼ 2.262 (see Appendix),
the confidence interval at 95% level is [99.01  2.26  1.24, 99.01 þ 2.26  1.24] ¼ [96.21,101.81].
The 95% confidence interval on the minimum of the mean is made, according to Equation (18), by all the
values greater than 96.74 ¼ 99.01  1.83  1.24. The corresponding interval on the maximum will be made by
the values less than 101.28.
The length of the confidence intervals from Equations (12)–(15) tends toward zero when the sample size
tends to infinity. This permits the computation
pffiffiffi of the sample size needed to obtain an interval of given length, d.
2
It will suffice to consider d =2 ¼ z=2 = n, and take as n the nearest integer greater than 2z=2 =d . For
example, if we want a 95% confidence interval with length, d, less than 2, in the hypothesis of Example 1, we
will need a sample size greater than or equal to 62.
The same argument can be applied when the standard deviation is unknown. However, in this case, to
2
compute n by 2t=2; s=d it is necessary to have an initial estimation of s, which, in general, is obtained in a
pilot test with size n9, in such a way that in the previous expression the d.f., , are n9  1. An alternative is to
enunciate the length of the interval in standard deviation units (remember that the standard deviation is
unknown). For instance, in Example 2, if we want d ¼ 0.5s, we will need a sample size greater than
(4z/2)2 ¼ 61.5; note the substitution of t/2, by z/2, which is mandatory because we do not have the sample
size needed to compute t/2, , which is precisely what we want to estimate.

1.02.2.3 Confidence Interval on the Variance of a Normal Distribution


In this case, the data come from a N(,) distribution with  and  unknown, and we have a sample with values
x1,x2,. . .,xn. The distribution of the random variable ‘sample variance’ S2 is related to the chi-square distribu-
tion, w2 (see Appendix). As a consequence, the confidence interval at 100  (1  )% level for the variance 2 is
obtained from
( )
ðn – 1ÞS 2 2 ðn – 1ÞS 2
pr    ¼ 1– ð19Þ
w2=2; w21 – =2;

where w2=2; is the critical value of a chi-square distribution with  ¼ n  1 d.f. at level /2. As in the previous
case for the sample mean, we should distinguish between the random variable sample variance S2 and one of its
concrete values, s2 computed with Equation (6), that takes this variable when we have the sample x1,x2,. . .,xn.
The intervals on the maximum and minimum of the variance at 100  (1  )% confidence level are
obtained from Equations (20) and (21), respectively.
( )
ðn – 1ÞS 2
2
pr   2 ¼ 1– ð20Þ
w1 – ;
26 Quality of Analytical Measurements: Statistical Methods for Internal Validation

( )
ðn – 1ÞS 2
pr  2 ¼ 1– ð21Þ
w2;

Example 3:
Knowing that the n ¼ 10 data of Example 2 come from a normal distribution with both mean and variance
unknown, the 95% confidence interval on 2 is found from Equation (19) as [7.22,50.83] because s2 ¼ 15.25,
w20:025;9 ¼ 19:02, and w20:975;9 ¼ 2:70. If the analyst is interested in obtaining a confidence interval for the
maximum variance, the 95% upper confidence interval is found from Equation (20) as [0,41.22] because
w20:95;9 ¼ 3:33, that is, the upper bound for the variance is 41.22 with a probability of error equal to 0.05. To
obtain confidence intervals for the standard deviation, it suffices to take the square root of the aforementioned
intervals because this operation is a monotonically increasing transformation of the values; therefore, the
intervals at 95% confidence level for the standard deviation are [2.69,7.13] and [0,6.42].
2 2
The size, n, of thesample needed 2 s / is between 1  k and 1 þ k is given by the nearest integer
ffi so that
pffiffiffiffiffiffiffiffiffiffi
greater than 1 þ 1=2 z=2 1 þ k þ 1 =k . For example, for k ¼ 0.5, such that the length of the confidence
interval verifies 0.5 < s2/2 < 1.5, we need n ¼ 40 data (at least). Just for comparative purposes, we will admit
that with the sample of size 40 we obtain the same variance s2 ¼ 15.25, w20:025;39 ¼ 58:12, and w20:975;39 ¼ 23:65;
hence, the interval at 95% confidence level is [10.23,25.15], which verifies the required specifications.

1.02.2.4 Confidence Interval on the Difference in Two Means


Case 1. Known variances: Consider two independent random variables distributed as N1(1,1) and N2(2,2)
with unknown means and known variances 21 and 22. We wish to find a 100  (1  )% confidence interval on
the difference in means 1  2. Let x11,x12,. . ., x1n1 be a random sample of n1 observations from N1 and
x21,x22,. . ., x2n2 be a random sample of n2 observations from N2. The 100  (1  )% confidence interval on
1  2 is obtained from the equation
8 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi9
<  2
 2
21 22 =
pr X 1 – X 2 – z=2 1
þ  1 – 2  X 1 – X 2 þ z=2
2
þ ¼ 1– ð22Þ
: n1 n2 n1 n2 ;

where X1 and X2 are the random variables of the sample mean, which take the values x1 and x2 . The
expressions analogous to Equations (14) and (15) for the one-sided intervals are obvious.
Case 2. Unknown variances: The approach to this question is similar to the previous case, but here even the
variances 21 and 22 are unknown. However, it can be reasonable to assume that both variances are equal,
21 ¼ 22 ¼ 2, and that the differences observed in their estimates with the data of both samples are not
significant. Later, in the section dedicated to hypothesis tests, the methodology to decide about this question
will be explained. An estimate of the common variance 2 is given by the pooled sample variance, which is an
arithmetic average of both variances weighted by the corresponding d.f.,
ðn1 – 1Þs12 þ ðn2 – 1Þs22
sp2 ¼ ð23Þ
n1 þ n2 – 2
The 100  (1  )% confidence interval is obtained from the following equation:
 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1 1 1
pr X 1 – X 2 – t =2; s p þ  1 – 2  X 1 – X 2 þ t =2; s p þ ¼ 1– ð24Þ
n1 n2 n1 n2
where  ¼ n1 þ n2  2 are the d.f. of the Student’s t distribution. The one-sided intervals at 100  (1  )%
confidence level have an obvious expression deduced from Equation (24) by substituting t/2, for t, . If a fixed
length is desired for the confidence interval, the computation explained in Section 1.02.2.2 can be immediately
adapted to obtain the needed sample size.

Example 4:
We want to verify that a substance is sufficiently stable to remain unchanged in composition when it is stored
Quality of Analytical Measurements: Statistical Methods for Internal Validation 27

for a month. Two series of measurements (n1 ¼ n2 ¼ 8) were made before and after a storage period. The results
were x1 ¼ 90:8, s12 ¼ 3:89 and x2 ¼ 92:7, s22 ¼ 4:02, respectively. Compute a 95% confidence interval on the
difference of means.
The two-sided interval when assuming equal variances (sp2 ¼ 3:96) (Equation (24)) is (90.8  92.7)  2.1448 
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1.99 1=8 þ 1=8 ¼ ½ – 4:034; 0:234. As zero belongs to the interval, we can conclude that the substance is stable
at 95% confidence level.

When the assumption 21 ¼ 22 is not reasonable, we can still obtain an interval on the difference 1  2 by
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
using the fact that the statistic ½X1 – X 2 – ð1 – 2 Þ s12 =n1 þs22 =n2 is distributed approximately as a t with d.f.
given by
2
s12 =n1 þ s22 =n2
¼ 2 2 ð25Þ
ðs12 =n1 Þ ðs22 =n2 Þ
n1 – 1 þ n2 – 1

The 100  (1  )% confidence interval is obtained from the following equation:
8 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi9
< s 2
s 2
s2 s2 =
pr X 1 – X 2 – t =2; 1 þ 2  1 – 2  X 1 – X 2 þ t =2; 1 þ 2 ¼ 1 –  ð26Þ
: n1 n2 n1 n2 ;

Example 5:
We want to compute a confidence interval on the difference of two means with unknown and nonequal
variances, with the results that come from an experiment carried out with four aliquot samples by two different
analysts. The first analyst obtains x1 ¼ 3:285, whereas the second one obtains x2 ¼ 3:257. The variances were
s12 ¼ 3:3310 – 5 and s22 ¼ 9:1710 – 5 respectively. Assuming that 21 6¼ 22 , Equation (25) gives  ¼ 4.9, so the
d.f. to apply Equation (26) are 5 and t0.025,5 ¼ 2.571. Thus, the 95% confidence interval is
 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð3:285 – 3:257Þ  2:571 ð3:3310 – 5 =4Þ þ ð9:1710 – 5 =4Þ ¼ ½0:014; 0:042. That is, at 95% of confidence,
the two analysts provide unequal measurements because zero is not in the interval.

The confidence intervals for the maximum and the minimum are obtained by considering the last or the first
term respectively in Equation (26) and replacing t/2, by t, .
Case 3. Confidence interval for paired samples: Sometimes we are interested in evaluating an effect (e.g., the
reduction of a polluting agent in an industrial spill by means of a catalyst) but it is impossible to have two
homogeneous populations of samples without and with treatment to obtain the two means of the recoveries,
because the amount of polluting agent is not controllable. In these cases, the solution is to determine the
polluting agent before and after applying the procedure to the same spill. The difference between both
determinations is a measure of the effect of the catalyst. The (statistical) samples obtained in this way are
known as paired samples. Formally, the problem consists of having two paired samples of size n, x11,x12,. . ., x1n
and x21,x22,. . ., x2n , and compute the differences between any pair of data, di ¼ x1i– x2i, i ¼ 1,2,. . .,n. If these
differences follow a normal distribution, the 100  (1  )% confidence interval is obtained from
 
sd sd
pr d – t =2; pffiffiffi    d þ t =2; pffiffiffi ¼ 1 –  ð27Þ
n n

where d and sd are the mean and standard deviation of the differences di and  ¼ n  1 are the d.f.

1.02.2.5 Confidence Interval on the Ratio of Variances of Two Normal Distributions


This section approaches the question of giving a confidence interval on the ratio 21/22 of the variances of two
distributions N1(1,1) and N2(2,2) with unknown means and variances. Let x11,x12,. . ., x1n1 be a random
sample of n1 observations from N1 and x21,x22,. . ., x2n2 be a random sample of n2 observations from N2. The
100  (1  )% confidence interval on the ratio of variances is given by the following equation:
28 Quality of Analytical Measurements: Statistical Methods for Internal Validation

 
S 2 2 S2
pr F1 – =2;1 ;2 12  12  F=2;1 ;2 12 ¼ 1 –  ð28Þ
S2 2 S2

where F1 – =2;1 ;2 and F=2;1 ;2 are the critical values of an F distribution with  1 ¼ n2  1 d.f. in the numerator
and  2 ¼ n1  1 d.f. in the denominator. Appendix contains a description of the F distribution.
We can also compute one-sided confidence intervals. The 100  (1  )% upper or lower confidence limit
on 21/22 is obtained from Equations (29) and (30), respectively:

 2 
1 S12
pr 2  F;1 ;2 2 ¼ 1 –  ð29Þ
2 S2
 
S 2 2
pr F1 – ;1 ;2 12  12 ¼ 1 –  ð30Þ
S 2 2
Example 6:
In this example, we compute a two-sided 95% confidence interval for the ratio of the variances in Example 4.
The resulting interval is [0.20  (3.89/4.02), 4.99  (3.89/4.02)] ¼ [0.19,4.83]. As 1 belongs to this interval, we
can admit that both variances are equal.

1.02.2.6 Confidence Interval on the Median


This case is different from the previous cases, because the confidence interval is a ‘distribution-free’ interval,
that is, there is no distribution assumed for the data. As it is known, a percentile (pct) is the value xpct such that
100  pct% of the values are less than or equal to xpct. It is possible to compute confidence intervals on any pct,
but for values of pct near one or zero we need sample sizes, n, that are very large because the values n  pct and
n  (1  pct) must be greater than 5. For the median (pct ¼ 0.5), it suffices to consider samples of size 10 or
more.
The fundamentals of these confidence intervals are based on the binomial distribution whose details are
outside the scope of this chapter and can be found in Sprent.14 We use the data of Example 1 to show step by
step how a 100  (1  )% confidence interval on the median is computed. The procedure consists of three
steps:
1. To sort the data in ascending order. In our case, 92.54, 94.45, 97.23, 98.44, 98.70, 98.87, 99.42, 101.08, 103.73,
and 105.66. The rank of each datum is the position that it occupies in the sorted list, for example, the rank of
98.44 is four.
2. To calculate the rank,prlffiffi,ffi of the value that will be the lower endpoint of the interval. It is the nearest integer
pffiffiffiffiffi
less than 1=2 n – z=2 n þ 1 . In our case, this value is 0.5(10  1.96 10 þ 1Þ ¼ 2:40, thus rl ¼ 2.
3. To calculate the rank, ru, of pthe
ffiffiffi value that will be the upper endpoint of the interval. It is the nearest integer
pffiffiffiffiffi
greater than 1=2 n þ z=2 n – 1 . In our case, this value is 0.5(10 þ 1.96 10 – 1Þ ¼ 7:60, then ru ¼ 8.
Hence, the 95% confidence interval on the median is made by the values that are between positions 2 and 8,
that is, [94.45,101.08].

1.02.2.7 Joint Confidence Intervals


Sometimes it is necessary to compute confidence intervals for several parameters but maintaining
a 100  (1  )% confidence that all of them contain the true value of the corresponding parameter.
For example, if two parameters are statistically independent, we can assure a 95% joint level of
confidence by taking separately the corresponding 100  (1  )1/2% confidence intervals because
(1  )1/2  (1  )1/2 ¼ (1  ). In general, if there are k parameters, we will compute the 100  (1  )1/k%
confidence interval for any of them.
However, if the sample statistics used are not independent, the above computation is not valid. The
Bonferroni inequality states that the probability
Pk that all the affirmations are true at 100  (1 – )% confidence
level is greater than or equal to 1 – i¼1  i , where 1  i is the confidence level considered for the ith
Quality of Analytical Measurements: Statistical Methods for Internal Validation 29

interval (usually i ¼ /k). For example, if a joint 90% confidence interval is needed for the mean of two
distributions, according to Bonferroni inequality i ¼ /2 ¼ 0.10/2 ¼ 0.05; thus, each should be the corre-
sponding 95% confidence interval.

1.02.2.8 Tolerance Intervals


In the introduction to Section 1.02.2, the tolerance intervals of a normal distribution have been calculated knowing
its mean and variance. Remember that the confidence interval [l,u] contains 100  (1  )% of the values of the
distribution or, equivalently, pr{X ˇ [l,u]} ¼ . Actually, the values of the parameters that define the probability
distribution are unknown; this uncertainty should be transferred into the endpoints of the interval. There are
several types of tolerance regions, but in this chapter we will restrict ourselves to two common cases.
Case 1. -content tolerance interval: Given a random variable X, an interval [l,u] is a -content tolerance
interval at the confidence level  if the following is fulfilled:

prfprfX P½l; ug   g   ð31Þ

To put in words, [l,u] contains at least 100  % of the X values with  confidence level. For the case of an
analytical method, this is to say that we have to determine, based on a sample of size n, the interval that will
contain 95% ( ¼ 0.95) of the results and this assertion must be true 90% of the times ( ¼ 0.90). Evidently, the
-content tolerance intervals can be one-sided, which means that the procedure will provide 95% of its results
above l (respectively, below u) 90% of the times. We leave to the reader the corresponding formal definitions.
One-sided and two-sided -content tolerance intervals can be computed either by controlling the center or
by controlling the tails, and for both continuous and discrete random variables (a review can be seen in Patel15).
Here we will limit to the case of a normally distributed X with unknown mean and variance, of which we have a
sample of size n, with which the mean x and standard deviation s are estimated. We want to obtain a two-sided
-content tolerance interval controlling the center, that is, an interval such that

prfprfX P½x – ks; x þ ks g  g   ð32Þ

To determine k, several approximations have been reported; consult Patel15 for a discussion on them. The
approach by Wald and Wolfowitz16 is based on determining k1 such that
   
1 1
pr N ð0; 1Þ  p ffiffi
ffi þ k1 – pr N ð0; 1Þ  pffiffi
ffi – k1 ¼  ð33Þ
n n
Therefore
sffiffiffiffiffiffiffiffiffiffiffiffi
n–1
k ¼ k1 ð34Þ
w2;n – 1

w2,n1 is the point exceeded with probability  when using the chi-square distribution with n  1 d.f.

Example 7:
With the data in Example 1, x ¼ 99:01, s ¼ 3.91, k1 ¼ 2.054, and w20:95;9 ¼ 3:33; thus, according to Equation (34),
k ¼ 3.377 and, as a consequence, the interval [99.01  3.38  3.91, 99.01 þ 3.38  3.91] ¼ [85.81,112.23] contains
95% of the results of the method 95% of the times that the procedure is repeated with a sample of size 10.
Remember the following points: (1) This tolerance interval is for individual results and pffiffinot
ffi for mean values as
the confidence intervals. (2) The standard deviation of the mean is estimated as s= n ¼ 1:24, whereas the
standard deviation of the individual results of the method is estimated to be 3.91. (3) The length of the
-content tolerance interval does not tend toward zero when increasing the sample size, as does with the
confidence intervals. Now, the value of k in Equation (34) tends to z(1–)/2, which is the length of the theoretical
interval that, in our example, would be [91.35,106.67] with z0.025 ¼ 1.96.
Case 2. The interval [l,u] is called a -expectation tolerance interval if
30 Quality of Analytical Measurements: Statistical Methods for Internal Validation

E fprfX P½l; ugg ¼  ð35Þ

Unlike the -content tolerance interval, condition (35) only demands that, on average, the probability that the
random variable takes values between l and u is .
As in the previous case, we limit ourselves to obtain intervals of the form [ x – ks; x þ ks]. When the
distribution of the random variable is normal and we have a sample of size n, the solution was obtained for
the first time by Wilks17 and is
rffiffiffiffiffiffiffiffiffiffiffi
nþ1
k ¼ tð1 – Þ=2; ð36Þ
n

where tð1 – Þ=2; is the upper (1-)/2 percentage point of the t distribution with  ¼ n  1 d.f.
For the same data as before, the expectation tolerance interval at 95% is [99.01  2.37  3.91, 99.01 þ 2.37 
3.91] ¼ [89.73,108.29] because t0:025;9 ¼ 2:262. This interval is shorter than the -content tolerance interval
because it only assures the mean of the probabilities that the interval contains the value of the random variable
X. In fact, the interval [89.73,108.29] contains 95% of the values of X only 64% of the times, conclusion drawn
by applying Equation (32) with k ¼ 2.37. Also, note that when the sample size tends to infinity, the value of k in
Equation (36) tends toward z(1–)/2.
Case 3. It is also possible to obtain tolerance intervals independent of the distribution (provided it is
continuous) of variable X. These intervals are based on the rank of the observations, but they demand very
large sample sizes, which makes them quite useless in practice. For example, to guarantee that the -content
tolerance interval [l,u] is made as l ¼ x(1) and u ¼ x(n) (i.e., the endpoints are the smallest and the
greatest values in the sample), it is necessary that n fulfills approximately the equation
logðnÞ þ ðn – 1Þlogð Þ ¼ logð1 –  Þ – logð1 –  Þ.18 If we need, as in Example 7,  ¼  ¼ 0.95, the value of n
has to be 89. Nevertheless, Willinks19 has used the Monte Carlo method to compute shorter ‘distribution-free’
tolerance intervals (-content and -expectation); this is of utility in the calculation of the uncertainty
proposed in Draft Supplement2 but it still requires sample sizes that are rather large in the scope of the
chemical analysis. A complete theoretical development on the tolerance intervals (including their estimation by
means of Bayesian methods) is in the book by Guttman.20
The tolerance intervals are of interest to show that a method is fit for purpose because when establishing that
the interval [ x – ks; x þ ks] will contain in average 100  % of the values provided by the method (or
100  % of the values with  confidence level), we are including precision and trueness. It suffices that the
tolerance interval [ x – ks; x þ ks] is included in the specifications that the method should fulfill, so that it is
valid for that purpose. Note that a method with high precision (small value of s) but with a significant bias can
get to fulfill the specifications in the sense that a high proportion of its values are within the specifications. In
addition in the estimation of s, the repeatability can be introduced as the intermediate precision or the
reproducibility to consider the scope of application of the method. The use of the tolerance interval solves
the problem of introducing the bias as a component of the uncertainty, in clear contradiction with the model of
Equation (1).
With the aim of developing analytical fit for purpose methods, the Societé Française des Sciences et
Techniques Pharmaceutiques (SFSTP) has proposed21–24 the use of the -expectation tolerance intervals in
the validation of quantitative methods. In four case studies, it has shown the validity of -expectation tolerance
intervals as an adequate way to conciliate both the objectives of the analytical method in routine analysis and
those of the validation step, and it proposes them as a criterion to select the calibration curve.25 Also, it has
analyzed their adequacy to the guides26 that establish the performance criteria that should be validated and
their usefulness in the problem of the transference of an analytical method.27 González and Herrador28 have
proposed their computation from the estimation of uncertainty of the analytical assay. In all these cases, the -
expectation tolerance intervals based on the normality of data are used, that is, using Equation (36). To avoid
dependence on the underlying distribution and the use of the classic distribution-free methods, Rebafka et al.29
have proposed the use of a bootstrap technique to calculate -expectation tolerance intervals, whereas Fernholz
and Gillespie30 have studied the estimation by means of bootstrap of the -content tolerance intervals.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 31

Nevertheless, the confidence intervals will have to be used when evaluating the trueness and precision of a
method with no need to have as objective the fulfillment of external requirements but their possible comparison
with other methods and the quantification of the uncertainty and the bias of the results that it provides.
There are other aspects of the determination of the uncertainty that are of practical interest, for example the
problem that raises the fact that any uncertainty interval, particularly an expanded uncertainty interval, should
be restricted to the range of feasible values of the measurand. Cowen and Ellison31 analyzed how to modify the
interval when the data are close to a natural limit in a feasible range such as 0 or 100% mass or mole fraction.

1.02.3 Hypothesis Test

This section is dedicated to the introduction of a statistical methodology to decide whether an affirmation is
false, for example, the affirmation ‘this method of analysis applied to this sample of reference provides the
certified value’. If on the basis of the experimental results it is decided that it is false, we will conclude that the
method has bias. The affirmation is habitually called hypothesis and the procedure of decision making is called
hypothesis testing. A statistical hypothesis is an asseveration on the probability distribution that follows a
random variable. Sometimes one has to decide on a parameter, for example, whether the mean of a normal
distribution is a specific value. In other occasions it may be required to decide on other characteristics of the
distribution, for example, whether the experimental data are compatible with the hypothesis that they come
from a normal or uniform distribution.

1.02.3.1 Elements of a Hypothesis Test


As the results that provide the analytical methods are modeled by means of a probability distribution, it is
evident that both, the validation of a method and its routine use, imply to make decisions that are formulated in
a natural way as problems of hypothesis testing. In order to describe the elements of a hypothesis test, we will
use a concrete case.

Example 8:
For an experimental procedure, we need solutions with pH values less than 2. The preparation of these
solutions provides pH values that follow a normal distribution with  ¼ 0.55. pH values obtained from 10
measurements were 2.09, 1.53, 1.70, 1.65, 2.00, 1.68, 1.52, 1.71, 1.62, and 1.58. Is the pH of the resulting solution
adequate to proceed with the experiment?
We may express this formally as

H0 :  ¼ 2:00 ðinadequate solutionÞ


ð37Þ
H1 :  < 2:00 ðvalid solutionÞ

The statement ‘ ¼ 2.00’ in Equation (37) is called the null hypothesis, denoted as H0, and the statement
‘ < 2.00’ is called the alternative hypothesis, H1. As the alternative hypothesis specifies values of  that are less
than 2.00 it is called one-sided alternative. In some situations, we may wish to formulate a two-sided alternative
hypothesis to specify values of  that could be either greater or less than 2.00 as in
H0 :  ¼ 2:00
ð38Þ
H1 :  6¼ 2:00
The hypotheses are not affirmations about the sample but about the distribution from which those values come,
that is to say,  is the value, unknown, of the pH of the solution that will be the same as the value provided by the
procedure if the bias is zero (see the model of Equation (1)). In general, to test a hypothesis, the analyst must
consider the experimental objective and decide upon a null hypothesis for the test, as in Equation (37). Hypothesis-
testing procedures rely on using the information in a random sample; if this information is inconsistent with the null
32 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 3 Decisions in hypothesis testing

The unknown truth

Researcher’s decision H0 is true H0 is false

Accept H0 No error Type II error


Reject H0 Type I error No error

hypothesis, we would conclude that the hypothesis is false. If sufficient evidence does not exist to prove falseness,
the test defaults to the conclusion of the test is not to reject the null hypothesis but this does not actually prove that
it is correct. It is therefore critical to choose carefully the null hypothesis in each problem.
In practice, to test a hypothesis, we must take a random sample, compute an appropriate test statistic from the
sample data, and then use the information contained in this statistic to make a decision. However, as the
decision is based on a random sample, it is subject to error. Two kinds of potential errors may be made when
testing hypothesis. If the null hypothesis is rejected when it is true, then a type I error has been made. A type II
error occurs when the researcher fails to reject the null hypothesis when it is false. The situation is described in
Table 3.
In Example 8, if the experimental data lead to rejection of the null hypothesis H0 being true, our (wrong)
conclusion is that the pH of the solution is less than 2. A type I error has been made and the analyst will
use the solution in the procedure when in fact it is not chemically valid. If, on the contrary, the
experimental data lead to acceptance of the null hypothesis when it is false, the analyst will not use the
solution when in fact the pH is less than 2 and a type II error has been made. Note that both types of error
have to be considered because their consequences are very different. In the case of type I error, a
nonsuitable solution is accepted, the procedure will be inadequate, and the analytical result will be
wrong with the later damages that it may cause (e.g., the loss of a client, or a mistaken environmental
diagnosis). On the contrary, the type II error implies that a valid solution is not used with the correspond-
ing extra cost of the analysis. It is clear that the analyst has to specify the risk that assumes to make these
errors, and this is done in terms of the probability that they occur.
The probabilities of occurrence of type I and II errors are denoted by specific symbols:

 ¼ prftype I errorg ¼ pr reject H0 =H0 is true
 ð39Þ
 ¼ prftype II errorg ¼ pr accept H0 =H0 is false

In Equation (39), the symbol ‘/’ indicates that the probability is calculated under that condition. In the example
we are following,  will be calculated by means of a normal distribution of mean 2 and standard deviation 0.55.
The probability  of the test is called the significance level, and the power of the test is 1  , which measures
the probability of correctly rejecting the null hypothesis.
Statistically, in the example, one wants to decide about the value of the mean of a normal distribution
with known variance and one-sided alternative pffiffiffi hypothesis (a one-tail p test).
ffiffiffiffiffi With these data, the statistic is
(Table 4, second row) Zcalc ¼ ðx – Þ=ð= nÞ ¼ ð1:708 – 2:0Þ= 0:55= 10 ¼ – 1:679:
In addition, the analyst must assume the risk , say 0.05, which means that the decision rule that is going to
apply to its experimental results will accept an inadequate solution 5% of the times. Therefore, the critical or
rejection region is defined (Table 4, second row) as the set CR ¼ {Zcalc < 1.645}, that is, the null hypothesis
will be rejected for the samples of size 10 that provide values of the statistic less than –1.645. In the example,
the value Zcalc belongs to the critical region; thus, the decision is to reject the null hypothesis (i.e., the prepared
solution is adequate) at 5% significance level.
Given the present facilities of computation, instead of the CR, the statistical packages calculate the so-called
p-value, which is the probability of obtaining the computed value of the statistic under the null hypothesis H0.
In our case, p-value ¼ pr{Z  1.679} ¼ 0.0466. When the p-value is less than the significance level , the null
hypothesis is rejected because this is equivalent to saying that the value of the statistic belongs to the critical
region.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 33

Table 4 Some parametric hypothesis tests

Null hypothesis Alternative hypothesis Statistic Critical region


 – 0 
1  ¼ 0  6¼ 0 Zcalc ¼ pffiffiffi Zcalc < – Z=2 or Zcalc >Z=2
2  ¼ 0  < 0 = n fZcalc < – Z g
3  ¼ 0  > 0 fZcalc >Z g

 – 0 
4  ¼ 0  6¼ 0 tcalc ¼ pffiffiffi tcalc < – t=2;n – 1 or tcalc >t=2;n – 1
s= n 
5  ¼ 0  < 0 tcalc < – t;n – 1

6  ¼ 0  > 0 tcalc > t;n – 1

7 1 ¼ 2 1 6¼ 2 ð1 – 2 Þ Zcalc < – z=2 or Zcalc > z=2
Zcalc ¼ qffiffiffiffiffiffiffiffiffiffiffiffi

8 1 ¼ 2 1 > 2 21 22 fZcalc > – z g
n1 þ n2


9 1 ¼ 2 1 6¼ 2 ð1 – 2 Þ tcalc < – t=2;n1 þn2 – 2 or tcalc > t=2;n1 þn2 – 2
tcalc ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi ðÞ
sp n11 þ n12

10 1¼ 2 1 > 2 tcalc >t;n1 þn2 – 2

n o
11 d ¼ 0 d 6¼ 0 d¯ tcalc < – t=2;n – 1 or tcalc >t=2;n – 1
tcalc ¼ pffiffiffi ðÞ
sd = n 
12 d ¼ 0 d > 0 tcal > t;n – 1
n o
13 2 ¼ 20 2 6¼ 20 ðn – 1Þs12 w2calc < w21 – =2;n – 1 or w2calc >w2=2;n – 1
w2calc ¼
20 n o
14 2 ¼ 20 2 > 20 w2calc > w2;n – 1

n o
15 21 ¼ 22 21 6¼ 22 s12 Fcalc < F1 – =2;n1 – 1;n2 – 1 or Fcalc >F=2;n1 – 1;n2 – 1
Fcalc ¼
s22 
16 21 ¼ 22 21 > 22 Fcalc > F;n1 – 1;n2 – 1

The values z are the percentiles of a standard normal distribution such that  ¼ pr{N(0,1) > z}.
The values t, are the percentiles of a Student’s t distribution with  degrees of freedom such that  ¼ pr{t > t, }.
ðn1 – 1Þs12 þ ðn2 – 1Þs22
() sp2 ¼ is the pooled variance.
n1 þ n2 – 2
() d̄ is the mean of the differences di ¼ x1i – x2i between the paired samples; sd is its standard deviation.
n o
The values w2; are the percentiles of a w2 distribution with  degrees of freedom such that  ¼ pr w2 > w2; .
The values F, 1, 2 are the percentiles of an F distribution with  1 degrees of freedom for the numerator and  2 degrees of
freedom for the denominator, such that  ¼ pr{F > F, 1,2}.

However, which is the power of the decision rule (statistic and critical region) that has been used? Equation
(39) implies that to calculate  it is necessary to specify what is exactly understood by the alternative
hypothesis. In our case, what is understood by pH smaller than 2. From a mathematical point of view, the
answer is clear: any number smaller than 2, for example 1.999 9; from the point of view of the analyst, this
mathematical answer does not have sense. Sometimes on the basis of the previous knowledge, in other cases
because of the regulatory stipulations or simply by the detail of the working standardized procedure, the analyst
can decide the value of pH that is considered less than 2.00, for example, a pH less than 1.60. This is the same as
assuming as ‘pH equal to 2’ any smaller value such that its distance to 2 is less than 0.40. In these conditions
 
jj pffiffiffi
 ¼ pr N ð0; 1Þ < z – n ð40Þ

Where jj ¼ 0.40 in our problem and when replacing it in Equation (40), we have  ¼ 0.26 (calculations can be
seen in Example A9 of Appendix). That is to say, whatever the decision taken, the decision rule leads to throw a
valid solution away 26% of the times. Evidently, this is an inadequate rule.
A simple analysis of Equation (40) explains the situation. To decrease , we should decrease the value
pffiffiffi
z – ðjj=Þ n. This may be done by decreasing z (i.e., increasing the significance level ) or increasing
34 Quality of Analytical Measurements: Statistical Methods for Internal Validation

0.50

0.45

0.40

0.35 n = 10

β = pr{Type II error}
0.30
n =15
0.25
n = 20
0.20
n = 25
0.15

0.10

0.05

0
0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
α = pr{Type I error}
Figure 3 Simultaneous (opposite) behavior of  and  for different sample sizes, n ¼ 10, 15, 20, and 25.

pffiffiffi
ðjj=Þ n. As both the procedure precision, , and the difference of pH that we wish to detect are fixed, the
only possibility left is to increase the sample size. Solving Equation (40) in n we have
2
z þ z
n. 2 ð41Þ
ðjj=Þ
The values of  and  for sample sizes of 10, 15, 20, and 25, maintaining  and  fixed, are given in Figure 3.
As can be seen,  and  exhibit opposite behavior and, unless the sample size is increased, it is not
possible to simultaneously decrease the probability of both errors. In our case, Equation (41) gives n ¼ 20.5
for  ¼  ¼ 0.05, thus n ¼ 21 because the sample size must be an integer. The dotted lines in Figure 3 intersect
in  values of 0.263, 0.126, 0.058, and 0.025, which correspond to the sample sizes considered while maintaining
the significance level  ¼ 0.05. Again, we see that for a given , the risk  decreases with the increase in n.
Equation (40) also allows the analyst to decide the standard deviation (precision) necessary to obtain a
decision rule according to the risks  and  that he/she wishes to admit. For example, if one has to make a
decision on the validity of the prepared solution with 10 results and the analyst wishes  ¼  ¼ 0.05, the only
pffiffiffiffiffi
option according to Equation (40) is to increase jj=. Solving 0.05 ¼ pr{N(0,1) < 1.645  ð0:40=Þ 10}, one
obtains  ¼ 0.3845. This means that the procedure should be improved from the current value of 0.55 until 0.38.
If only five results were allowed, the standard deviation would have to decrease to 0.27 for maintaining both the
significance level and the power of the test.
Finally, there is an aspect in Equation (40) that does not have to pass unnoticed. Maintaining , , and n
fixed, it is possible to modify the pH value that can be distinguished from 2.00 if the analyst simultaneously
increases the precision of the method so that the ratio jj= stays constant. Said otherwise, without changing
any of the specifications of the hypothesis test, by diminishing  we can discriminate a value of pH nearer to 2.
Qualitatively this argument is obvious – if a procedure is more precise, more similar results are easier to
distinguish, so that with a more precise procedure different values will appear that would be considered equal
with a less precise procedure. Equation (40) quantifies this relation for the hypothesis test we are doing.
In summary, a hypothesis test includes the following steps: (1) defining the null, H0, and the alternative, H1,
hypothesis according to the purpose of the test and the properties of the distribution of the random variable, which,
according to Equation (1), is the distribution of the values provided by a method of measurement; (2) deciding on the
probabilities  and , that is, the risk for the two types of error that will be assumed for the decision; (3) computing
the needed sample size; (4) obtaining the results, computing the corresponding test statistic and evaluating whether it
belongs to the critical region CR; and (5) writing the analytical conclusion, which should entail more than reporting
the pure statistical test decision. The conclusion should include the elements of the statistical test, the assumed
Quality of Analytical Measurements: Statistical Methods for Internal Validation 35

distribution, , , and n. Care must be taken in writing the conclusion; for example, it is more adequate to say ‘there
is no experimental evidence to reject the null hypothesis’ than ‘the null hypothesis is accepted’.
Table 4 summarizes the tests most frequently used in the validation of analytical procedures and in the
analysis of their results.

1.02.3.2 Hypothesis Test on the Mean of a Normal Distribution


Case 1. Known variance: Admit that the data follow a normal N(,) distribution with unknown . The
corresponding tests are in rows 1–3 in Table 4. The test statistic is always the same, but, depending on
whether the alternative hypothesis is two-sided (row 1 in Table 4) or one-sided (rows 2 and 3), the critical
region is different. The value z/2 verifies /2 ¼ pr{N(0,1) > z/2} or the analogous result for z. For the two-
tail test, the relation among n, , and  is given by
2
z=2 þ z
n. ð42Þ
ðjj=Þ2
whereas for the one-tail tests, Equation (41) must be used.
Case 2. Unknown variance: In this case, both the mean, , and the standard deviation, , of the normal
distribution are unknown. The hypothesis tests are in row 4 of Table 4 for the two-tail case and in rows 5 and 6
for the one-tail tests. The statistic tcalc should be compared to the values t,n1 and t/2,n1 of the Student’s t
distribution with n  1 d.f. The equation that relates , , and n is

 ¼ pr – t=2;n – 1  tn – 1 ðÞ  t=2;n – 1 ð43Þ
pffiffiffi
where  ¼ ðjj=Þ n is the noncentrality parameter of a noncentral t() distribution, which in Equation (43)
has n  1 d.f. Note the analogy with the ‘shift’ of the N(0,1) in Equation (40). The discussion about
the relative effect of the sample size and the precision is similar to the case in which  the variance is
known. The corresponding  equations for one-tail tests are  ¼ pr – t ;n – 1  t n – 1 ð Þ if H1:  < 0 and
 ¼ pr tn – 1 ðÞ  t;n – 1 if H1:  > 0.
To compute n from Equation (43), the standard deviation is needed. To solve this additional difficulty, the
comments in Case 2 of Section 1.02.2.2 are valid and can be applied here also. Usually,  ¼ 2 or 3. Let us
compare the solutions with known and unknown variance by supposing the data of Example 8, but considering
that the variance is unknown. We wish to detect differences in pH of 0.73 (the same / as in Example 8). By
using a sample of size 10, the probability  is 0.32 instead of the previous 0.26 (calculations can be seen in
Example A10 of Appendix). This increase in the probability of type II error is due to the less information we
have about the problem; now the standard deviation is unknown.
Case 3. The paired t-test: In Case 3 of Section 1.02.2.4 the experimental procedure and the reasons for
considering paired samples have been already explained. To decide on the effect of a treatment, the null
hypothesis is that the mean of the differences is zero, that is, H0: d ¼ 0 and the two-sided alternative H1: d 6¼ 0.
This is the test shown in row 11 of Table 4, where there is only a one-tail test (row 12) because, if needed, it
suffices to consider the opposite differences di ¼ x2i  x1i instead of di ¼ x1i  x2i, i ¼ 1,. . .,n. The statistic and
the critical region are analogous to those of Case 2 (test on the mean with unknown variance).

Example 9:
Table 5 shows the results of the recoveries obtained in 10 different places with two solid-phase extraction (SPE)
cartridges after fortification of wastewater with a sulfonamide. We want to decide whether cartridge A is more
Table 5 Recovery of sulfonamides spiked in wastewater obtained by using two different extraction cartridges

Place 1 2 3 4 5 6 7 8 9 10

Cartridge A (%) 77.2 74.0 75.6 80.0 75.2 69.2 75.4 74.0 71.6 60.4
Cartridge B (%) 74.4 70.0 70.2 77.2 75.9 60.0 77.0 76.0 70.0 55.0

See Example 9 for more details.


36 Quality of Analytical Measurements: Statistical Methods for Internal Validation

efficient than cartridge B and to compute the  risk of the text. For answer these questions, it is important to define
that we consider ‘different’ those differences between the means of recoveries that are greater than 2%.
We use a paired t-test on the mean of the differences between the recoveries obtained with the two cartridges
on the same sample (those of cartridge A minus those of cartridge B). By considering these differences, we
eliminate the effect that the wastewater could have on the performance of the two cartridges. The test is carried
out as follows:
H0 : d ¼ 0 ðthe differences of recoveries are nullÞ
ð44Þ
H1 : d > 0 ðcartridge A provides recoveries greater than cartridge BÞ
pffiffiffi pffiffiffiffiffi
Following row 12 of Table 4, the statistic is tcalc ¼ d=ðsd = nÞ ¼ 2:69= 3:526= 10 ¼ 2:412 and the critical
region CR ¼ {tcalc > t,n – 1}. The critical value t0.05,9 is equal to 1.833; thus the statistic belongs to the critical
region. Therefore, the null hypothesis is rejected for  ¼ 0.05 and we can conclude that cartridge A is more
efficient than cartridge B, because the mean of the differences is positive. 
To evaluate the power (1  ) of the test, the equation  ¼ pr tn – 1 ðÞ  t;n – 1 with d ¼j – 0 j= ¼
jj= ¼ 2=3:53 ¼ 0:57 provides 1   ¼ 0.496 6 for  ¼ 0.05 and n ¼ 10 (calculations can be seen in Example
A11 of Appendix). Hence, 50% of the times the conclusion of accepting that there is no difference between
recoveries is wrong. In this case, the risk of a type II error is very large; in other words, the power is very poor
when we want to discriminate differences of 2% in recovery because the ratio d is small.

1.02.3.3 Hypothesis Test on the Variance of a Normal Distribution


The variance is a measure of the dispersion of the data used to evaluate the precision of a procedure of analysis;
thus, decisions have to be made on this parameter frequently. The corresponding hypothesis tests are in rows 13
and 14 of Table 4.

Example 10:
A validated procedure has a repeatability of 0 ¼ 1.40 mg l–1 when measuring concentrations around 400 mg l–1.
After a technical revision of the instrument, the laboratory is interested in testing the hypothesis

H0 : 2 ¼ 20 ¼ 1:96 ðthe repeatability has not changedÞ


ð45Þ
H1 : 2 > 1:96 ðthe repeatability got worseÞ

The analyst decides that a repeatability is admissible until 2.0 times the initial one, 1.40 mg l–1, and assumes the
risks  ¼  ¼ 0.05. The sample size needed to guarantee the requirements of the analyst is formally obtained
from a one-tail hypothesis test on the variance where
 
k
 ¼ pr w2n – 1 < 2 ð46Þ

k is the value such that  ¼ pr w2n – 1 > k and ¼ =0 . As ¼ 2.0, Equation (46) gives that for n ¼ 14,
 ¼ 0.040 2, whereas for n ¼ 13,  ¼ 0.051 1 (calculations can be seen in Example A12 of Appendix). Therefore,
he/she decides to do 14 determinations on aliquot parts of a sample with 400 mg l–1 obtaining a variance of 3.1
(s ¼ 1.76).
The statistic is w2calc ¼ ð14 – 1Þ3:10=1:96 ¼ 20:56. As the critical region is made by the values
CR ¼ fw2calc >w20:05;13 ¼ 22:36g, he/she concludes that there is no sufficient experimental evidence to say that
the precision has worsened. In this case, the acceptance of the null hypothesis, that is, to maintain the
repeatability below 2.0 times the initial one, will be erroneous 5% of the times because  was fixed at 5%.
The decision rule is equally protected against type I and II errors.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 37

Table 6 Recoveries of triazines from wastewater using solid-phase microextraction (SPME) and solid-phase extraction
(SPE)

Recovery (%)

SPME 91 85 90 81 79 78 84 87 93 91
SPE 86 82 85 86 79 82 80 77 79 82

See Example 11 for more details.


1.02.3.4 Hypothesis Test on the Difference in Two Means
Case 1. Known variances: We assume that X1 is normal and has unknown mean 1 and known variance 21, and that
X2 is also normal with unknown mean 2 and known variance 22. We will be concerned with testing the hypothesis
that the means 1 and 2 are equal. The two-sided alternative hypothesis is in line 7 of Table 4 and the one-sided
in line 8 when we have a random sample of size n1 of X1 and another random sample of size n2 of X2.

Example 11:
A solid-phase microextraction (SPME) procedure to extract triazines from wastewater has been carried out.
The results must be compared with previous ones where extraction was made by means of SPE. The
repeatability of both procedures is known to be 5.36% for SPME procedure and 3.12% for SPE. The mean
for 10 samples (Table 6) is 85.9% for recovery obtained with SPME and 81.8% for SPE. At a 0.05 significance
level, is the recovery of SPME procedure greater than that of SPE?

As the standard repeatability of both procedures is known, a test to compare two means with normal
distribution and known variances is adequate.
H0 : SPME ¼ SPE ðthe recovery is the same for both proceduresÞ
ð47Þ
H1 : SPME > SPE ðthe recovery using SPME procedure is greater than that using SPEÞ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The statistic is Zcalc ¼ ðxSPME – xSPE Þ= 2SPME =n1 þ 2SPE =n2 ¼ ð85:9 – 81:8Þ= 28:73=10 þ 9:73=10 ¼ 2:091
following line 8 of Table 4. For a significance level  ¼ 0.05, CR ¼ {Zcalc > z ¼ 1.645}. As the statistic
2.091PCR, the null hypothesis is rejected and we conclude that the mean of the recovery with SPME is
greater than that with SPE.
If a difference in recovery of 3% is sufficient in the analysis, what is the risk  for this hypothesis test? A
simple modification of Equation (40) shows that
8 9
>
< >
j j =
 ¼ pr N ð0; 1Þ > z – qffiffiffiffiffiffiffiffiffiffiffiffi
ffi ð48Þ
>
: 21 22 >
;
n1 þ n2

By substituting our data in Equation (48), one obtains  ¼ 0.55. That means that in 55% of the cases, we will
incorrectly accept that the recovery is the same for both procedures.
It is also possible to derive formulas to estimate the sample size required to obtain a specified  for given 
and . For the one-sided alternative, the sample size n ¼ n1 ¼ n2 is
2
z þ z
n. 2
ð49Þ
21 þ22

Again, with the data of the problem at hand and  ¼ 0.05, Equation (49) gives the number 46.30, that is, 47
aliquot samples should be analyzed for each procedure so that  ¼  ¼ 0.05.
For the two-sided alternative, the sample size n ¼ n1 ¼ n2 is
2
z=2 þ z
n. 2
ð50Þ
21 þ22
38 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Case 2. Unknown variances: As in Section 1.02.2.4 (Case 2), there are two possibilities: (1) the unknown variances
are equal 21 ¼ 22 ¼ 2, although for chance reasons the numerical values differ, and (2) both variances are
different 21 6¼ 22. The question of deciding between (1) and (2) will be approached in Section 1.02.3.6.
Let X1 and X2 be two independent normal random variables with unknown but equal variances. The statistics
and the critical region for the two-tail test are in line 9 of Table 4 and in line 10 we can see the one-tail case.
For the two-sided alternative, with risks  and , we consider further that the two means are different if their
difference is at least a quantity  ¼ j1  2j. As the variances are unknown, the comments in Case 2 of Section
1.02.2.2 are also applicable. If we have samples from a pilot experiment with respective sizes n19 and n29, and sp2
is the pooled variance computed with them (see the statistic in lines 9 and 10 of Table 4), the sample size
needed n ¼ n1 ¼ n2 is
2
t=2; þ t;
n. ð51Þ
2 =2sp2

where  ¼ n19 þ n29  2 are the d.f. of the Student’s t distribution. If the aforementioned is not possible, the
difference to be detected should be expressed in standard deviation units, then  ¼ j1  2j ¼ k and the
following expression applies:
2
z=2 þ z
n. ð52Þ
k2 =2

where z/2 and z are the corresponding upper percentage points of the standard normal distribution.

Example 12:
An experimenter wishes to compare the mean of two procedures, stating that they are to be considered different
if they differ in 2 or more standard deviations (k ¼ 2). In addition, he/she wants to assume  ¼ 0.05 and  ¼ 0.10.
As z0.025 ¼ 1.96 and z0.10 ¼ 1.282, Equation (52) gives n ¼ 5.26, then six samples must be considered for each
procedure. If he/she wishes to distinguish 1 standard deviation (k ¼ 1), then n ¼ 21.02, that is, he/she should
have 22 data from each procedure.
Although it is preferable to always take equal sample sizes, it can happen that it is more expensive or
laborious to collect the data of X1 than that of X2. In this case, there are weighted sample sizes to be
considered.32
In the case where equality of variances 21 and 22 cannot be admitted, there is no completely justified solution
for the test. However, approximations exist with good power and easy to use tests, such as the Welch test. This
method consists of substituting the known variances in the expression of Zcalc (Case 1 of this section) by their
sample estimates, in such a way that the statistic becomes

x1 – x2
t calc ¼ qffiffiffiffiffiffiffiffiffiffiffiffi ð53Þ
s12 s22
n1 þ n2

which follows a Student’s t with the d.f.  in Equation (25). The critical region for the two-tail test is
CR ¼ {tcalc < –t/2, or tcalc > t/2, }. The critical region for the one-tail test is CR ¼ {tcalc < –t, } if
H1:  < 0 and CR ¼ {tcalc > t,} if H1:  > 0.
As the variances are different, it seems reasonable to take the sample sizes, n1 and n2, also different. If
2 ¼ r  1, similar to Equation (52), one obtains
2
z=2 þ z
n1 . k2
ð54Þ
ðrþ1Þ

Once n1 is determined, n2 is obtained as n2 ¼ r  n1. The computation of the sample sizes with different
variances when pilot samples are at hand can be found in Schouten.32
Quality of Analytical Measurements: Statistical Methods for Internal Validation 39

1.02.3.5 Test Based on Intervals


The problem of deciding ‘the equality’ of the means of two distributions discussed in the previous
section shows the fact that the result of interest (the two means are equal) is obtained by accepting the
null hypothesis. Hence, the type II error becomes very important. To compute it, it is necessary to
decide which is the least difference between the means that is to be detected,  ¼ j1  2j. A more
natural framework is to define the null and alternative hypotheses in such a way that the decision of
accepting the equality of means is made by rejecting the null hypothesis, that is, the test should be
posed as

H0 : j1 – 2 j   ðthe means are differentÞ


H1 : j1 – 2 j <  ðthe means are equalÞ

Contrary to the tests so far, the hypothesis of this test, called interval hypothesis, is not made by one point but an
interval. The two one-sided tests (TOST) procedure consists of decomposing the interval hypotheses H0 and
H1 into two sets of one-sided hypothesis:
H01 : 1 – 2  – 
H11 : 1 – 2 > – 
and
H02 : 1 – 2  
H12 : 1 – 2 < 
The TOST procedure consists of rejecting the interval hypothesis H0 (and thus concluding equality of 1 and
2) if and only if both H01 and H02 are rejected at a chosen level of significance .
If two normal distributions with the same unknown variance, 2, are supposed and two samples of size n1 and
n2 are taken from each one, respectively, the two sets of one-sided hypothesis will be tested with ordinary one-
tail test (row 10 of Table 4). Thus, the critical region is
8 9
>
<ðx – x Þ –  >
=
1 2  – ðx1 – x2 Þ
CR ¼ qffiffiffiffiffiffiffiffiffiffiffiffi  t; and qffiffiffiffiffiffiffiffiffiffiffiffi  t; ð55Þ
>
: sp n1 þ n1 sp n11 þ n12 >
;
1 2

where sp2 is the pooled variance and  ¼ n1 þ n2  2 its d.f.


The TOST procedure turns out to be operationally identical to the procedure of declaring equality only if
the usual confidence interval at 100  (1  2)% on 1  2 is completely contained in the interval [,].
The expression that relates the sample sizes with  and  is
z þ z=2
n1 . 2
ð56Þ
f 2

where f ¼ (r þ 1)/r and n2 ¼ r  n1. When the sample sizes are equal, f ¼ 2.
As  is unknown in Equation (56), it should be adapted as in Case 2 of Section 1.02.3.4. When comparing
Equation (56) with those corresponding to the two-tail t-test on the difference of means, one observes that it is
completely analogous by exchanging the two risks (see Equations (50) and (52)). That is, the significance level
and the power of the t-test become the power and significance level, respectively, of the TOST procedure,
which completely agrees with the exchange of the hypotheses.
The tests based on intervals have a great tradition in statistics, see for example the book (very technical) by
Lehmann33. The TOST procedure is a particular case that has also been used under the name bioequivalence
test.34 Mehring35 has proposed some technical improvements to obtain optimal interval hypothesis tests,
including equivalence testing. It is shown that TOST is always biased, in particular, the power tends to zero
for increasing variances independently of the difference of means. As a result, an unbiased test36 and a suitable
compromise between the most powerful test and the shape of its critical region37 have been proposed. In
40 Quality of Analytical Measurements: Statistical Methods for Internal Validation

chemistry the use of TOST has been proposed to verify the equality of two procedures.38,39 Kuttatharmmakull
et al.40 provide a detailed analysis of the sample sizes necessary in a TOST procedure to compare methods of
measurement. There are different versions of TOST for ratio of variables and for proportions; the details of the
equations for these cases can be consulted in Section 8.13 of the book by Martin Andrés and Luna del Castillo41
and in the book by Wellek.42 The latter is a comprehensive review of inferential procedures that enable one to
‘prove the null hypothesis’ for many areas of applied statistical data analysis.

1.02.3.6 Hypothesis Test on the Variances of Two Normal Distributions


Suppose that two procedures follow normal distributions X1 and X2 with unknown means and variances. We
wish to test the hypothesis on the equality of the two variances, that is, H0: 21 ¼ 22. In practice, this is a relevant
problem because this hypothesis is related to the equality of the precision of the two procedures, and also as a
previous check to decide about the equality of variances before applying the test on the equality of means (Case
2 of Section 1.02.3.4) or to compute a confidence interval on the difference of means (Case 2 of Section 1.02.2.4).
Assume that two random samples of size n1 of X1 and of size n2 of X2 are available and let s21 and s22 be
the sample variances. To test the two-sided alternative, we use the statistic and CR of line 15 of Table 4.
The probability  can be computed as a function of the ratio of variances l 2 ¼ 21 =22 that is to be detected, by
the equation
 
F=2;n1 – 1;n2 – 1 F1 – =2;n1 – 1;n2 – 1
 ¼ pr < F n1 – 1;n2 – 1 < ð57Þ
l2 l2
where Fn1 –1;n2 –1 denotes an F-distribution
 with n1  1 and n2  1 d.f. and F=2;n1 –1;n2 –1 its upper /2 point, so that
pr Fn1 –1;n2 –1 > F=2;n1 –1;n2 –1 ¼ =2. Similarly, F1–=2;n1 –1;n2 –1 is the upper 1 – ð=2Þ point.

Example 13:
Aliquot samples have been analyzed in random order under the same experimental conditions to carry out a
stability test. The results are given in Table 7 and must be compared for assessing test material stability. (1) Is
there experimental evidence of instability in the material? (2) What is the probability of accepting the null
hypothesis when it is in fact wrong? To answer these questions, the analyst considers that the material is not
stable if the mean of the test sample differs from the mean of the control sample in 2 standard deviations. (3)
What should be the sample size if 1 standard deviation is needed for fit for purpose of this analysis (and
 ¼  ¼ 0.05)?

1. As we only know the estimates of the variance, a t-test to compare means before and after the storage of the
analyte has been carried out.
The first step is to test if the variances can be considered equal by using a two-tail F-test:
H0: 21 ¼ 22

H1: 21 6¼ 22

Following row 15 in Table 4, Fcalc ¼ s12 =s22 ¼ 50:76=26:22 ¼ 1:93 and CR ¼ Fcalc < F1 – =2;n1 – 1;n2 – 1 or
Fcalc > F=2;n1 – 1;n2 – 1 g with F0.025,8,8 ¼ 4.43 and F1– 0:025;8;8 ¼ 1=F0:025;8;8 ¼ 1=4:43 ¼ 0:225 7. Hence, there is
no experimental evidence to say that the variances differ.
Therefore, a hypothesis t-test on the difference of the two means with equal variances is being formulated
(Case 2 of Section 1.02.3.4). The statistic and the CR are given in line 9 of Table 4.
H0: 1 ¼ 2 ðthe test material is stableÞ

Table 7 Data for analysis of stability (arbitrary units)

Control sample 46.31 44.90 44.12 36.07 39.20 36.39 50.71 47.85 45.60
Test sample 43.12 43.00 44.75 39.66 37.74 37.50 54.79 53.08 55.07

See Example 13 for more details.


Quality of Analytical Measurements: Statistical Methods for Internal Validation 41

H1: 1 6¼ 2 ðthe material is not stableÞ


pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The ‘pooled’ variance, sp2 , is 38.49, so sp ¼ 6:20, with 9 þ 9  2 ¼ 16 d.f. tcalc ¼ ðx1 – x2 Þ= sp 1=n1 þ 1=n2
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ ð45:41 – 43:46Þ= 6:20 1=9 þ 1=9 ¼ 0:67 and t0:025;16 ¼ 2:120. Therefore, the critical region is
CR ¼ftcalc < – 2:120 or tcalc > 2:120g. Hence, there is no evidence to reject the null hypothesis, that is, with
these data there is no experimental evidence of instability.
2. Power of the test: With the condition imposed by the analyst, Equation (52) with k ¼ 2 provides  0.05.
3. In this case, the analyst is interested in computing the sample size under the assumption that only 1 standard
deviation is admissible for fit for purpose of this analysis. Therefore, k ¼ 1 and from Equation (52) n ¼ 25.99,
so n1 ¼ n2 ¼ 26. The sample size is greater than the former because he/she is interested in distinguishing a
quantity that is less than this in previous point 2.
Incidentally, the null hypothesis of the F-test has been accepted. When a standard deviation twice that of the
control samples is to be detected, Equation (57) gives a probability  for this test of 0.56. That means that
56% of the times the null hypothesis will be wrongly accepted, and in this case, we have accepted the null
hypothesis. When the F-test is used as a previous step to the one of equality of means, and this one will be
used with  ¼ 0.05, it is common to fix  ¼ 0.10 for the F-test. Equation (57) would provide n ¼ n1 ¼ n2 ¼ 24
with  ¼ 0.11, and by maintaining that a change of 2 times the standard deviation of the control samples is to
be detected (all calculations of  can be seen in Example A13 of Appendix). In general, the F-tests on the
equality of variances are very conservative and large sample sizes are needed to assure an adequately small
probability of type II error.

1.02.3.7 Hypothesis Test on the Comparison of Several Independent Variances


When the hypothesis of equality of variances of several groups of data coming from normal and independent
distributions is to be tested, a good practice is to draw the data for a visual inspection of their dispersion.

Example 14:
Table 8 shows the results of the determination of the acetic degree by means of an acid–base titrimetry,
employing sodium hydroxide as the titrant. These data are adapted from the practice ‘Analysis and comparison
of the acetic grade of a vinegar’ included in Ortiz et al.,43 and each series is a replicated determination carried
out by a group of students over the same vinegar sample. The means and variances obtained by each group are
also included in Table 8.
Figure 4 shows the plot of the results obtained by different groups of students.

The most commonly used tests to compare several variances are the Cochran’s, the Bartlett’s, and the
Levene’s tests. In all the cases, we wish testing the hypothesis
H0 : 21 ¼ ¼ 2i ¼ ¼ 2k
ð58Þ
Ha : At least one 2i is different

Table 8 Determination of acetic degree of a vinegar by means of an acid–base titration

Group 1 Group 2 Group 3 Group 4 Group 5

6.028 5.974 5.886 6.132 5.916


6.028 6.004 5.970 6.120 6.123
5.998 6.005 5.880 6.131 6.034
6.089 5.852 5.910 6.072 6.004
6.059 5.944 5.910 6.071 6.152
Mean xi 6.040 4 5.955 8 5.911 2 6.105 2 6.045 8
Variance si2 1.203  10–3 3.997  10–3 1.267  10–3 0.969  10–3 8.993  10–3

See Example 14 for more details.


42 Quality of Analytical Measurements: Statistical Methods for Internal Validation

6.2

Acetic degree
6.1

6.0

5.9

5.8
1 2 3 4 5
Group
Figure 4 Data for testing the equality of variances.

Table 9 Critical values for Cochran’s test for testing homogeneity of several variances at 5% significance level

, degrees of freedom

k 1 2 3 4 5 6 7 8 9 10

2 0.9985 0.9750 0.9392 0.9057 0.8772 0.8534 0.8332 0.8159 0.8010 0.7880
3 0.9669 0.8709 0.7977 0.7457 0.7071 0.6771 0.6530 0.6333 0.6167 0.6025
4 0.9065 0.7679 0.6841 0.6287 0.5895 0.5598 0.5365 0.5175 0.5017 0.4884
5 0.8412 0.6838 0.5981 0.5441 0.5065 0.4783 0.4564 0.4387 0.4241 0.4118
6 0.7808 0.6161 0.5321 0.4803 0.4447 0.4184 0.3980 0.3817 0.3682 0.3568
7 0.7271 0.5612 0.4800 0.4307 0.3974 0.3726 0.3939 0.3384 0.3259 0.3154
8 0.6798 0.5157 0.4377 0.3910 0.3595 0.3362 0.3185 0.3043 0.2926 0.2820
9 0.6385 0.4775 0.4027 0.3584 0.3286 0.3067 0.2901 0.2768 0.2659 0.2568
10 0.6020 0.4450 0.3733 0.3311 0.3029 0.2823 0.2666 0.2541 0.2439 0.2353

k, number of groups; , degrees of freedom.


Adapted from Sachs, L. Applied Statistics. A Handbook of Techniques; New York: Springer-Verlag, 1982.

P
The sample size of each group is denoted as ni, i ¼ 1,2,. . .,k, and N ¼ ki¼1 ni .
Case 1. Testing the equality of several variances according to Cochran’s test: The null hypothesis is that the
variances within each of the k groups of data are the same. This test detects if one variance is greater than the
rest. The statistic is
max s 2
Gcalc ¼ Pk i ð59Þ
2
i¼1 si

The critical region at significance level  is given by



CR ¼ Gcalc > G;k; ð60Þ

where G;k; is the value tabulated in Table 9 for  d.f. In the case ni ¼ n for all i, is  ¼ n1.
With the data of Example 14, Gcalc ¼ 8.993  10–3/16.429  10–3 ¼ 0.5474 and G0:05;5;5 – 1 ¼ 0:5441; thus, at
0.05 significance level, the null hypothesis should be rejected and the variance of group 5 should be considered
different from the rest.
Case 2. Bartlett’s test is appropriate to detect groups of similar variance within each of the k groups of data but
that differs from one group to another. The statistic is defined using the following equations:
q
w2calc ¼ 2:3026 ð61Þ
c
X
k
q ¼ ðN – kÞlog sp2 – ðni – 1Þlog si2 ð62Þ
i¼1
Quality of Analytical Measurements: Statistical Methods for Internal Validation 43

hP i
k 1 1
i¼1 ni – 1 – N –k
c ¼1þ ð63Þ
3ðk – 1Þ
In Equation (62), ‘log’ meansh the decimal logarithm
i and sp2 is the pooled variance that, analogous to Equation
P k
(23), for k variances is sp2 ¼ 2
i¼1 ðni – 1Þsi =ðN – kÞ. The critical region is
n o
CR ¼ w2calc > w2;k – 1 ð64Þ

In Example 14, c ¼ 1.10, q ¼ 3.43, and w2calc ¼ 7:19, which does not belong to the critical region defined in
Equation (64) because w20:05;4 ¼ 9:49.
Cochran’s and Bartlett’s tests are very sensitive to the normality assumption. The Levene’s test, particularly
when it is based on the medians of each group, is more robust to the lack of normality of data.
Case 3. Levene’s test: Consider, in the jth group of replicates, the absolute deviations of the values xij from the
mean of their repeats groups.
 
lij ¼ xij – xi ; i ¼ 1; 2; . . . ; ni ð65Þ

Consider the data arranged as in Table 8 and compute the usual F statistics for the deviations lij
Pk ni ðli – lÞ
2

i¼1 k–1
Fcalc ¼ ð66Þ
Pk Pni ðlij – li Þ2
j ¼1 i¼1 N –k

where li is the mean of the ith group and lis the global mean. The critical region at 100  (1  )% confidence
level is

CR ¼ Fcalc > F;k – 1;N – k ð67Þ

Note that the numerator of Equation (66) is the pooled variance of the deviations and the denominator is the
global variance of these deviations.
Computing the differences in Equation (65) with the data of Table 8, Fcalc ¼ (2.205  10–3)/(0.905  10–3) ¼ 2.44.
As F0.05,4,20 ¼ 2.866, there is no evidence to reject the null hypothesis (the variances are equal).
The Levene’s test using group medians instead of means is more recommendable. The adaptation is simple;
one has to consider the absolute value of the differences but to the median, x̃i , of each group
 
lij ¼ xij ˜– x i ; i ¼ 1; 2; . . . ; ni ð68Þ

The statistics is again that of Equation (66) and it is tested in the same way as before.
With the same data of Table 8, but applying the transformation in Equation (68), one obtains
Fcalc ¼ (2.146  10–3)/(1.360  10–3) ¼ 1.58, and the conclusion is the same. The variance of the five groups
should be considered as equal.
Frequently it happens that the three tests do not agree in the result, as is the case here. But the joint
interpretation clarifies the situation: In the data of Example 14 , the variance of group 5 is greater than the
variance of other groups. When the Levene’s test is applied, a large difference between both statistics is
observed in the statistics when using the median. This makes one think that the increase in the variance of the
last group is caused by some data being different from the others; this is graphically seen in Figure 4.

1.02.3.8 Goodness-of-Fit Tests: Normality Tests


The test on distribution or goodness-of-fit tests are designed to decide whether the experimental data are
compatible with a predetermined probability distribution, generally characterized by one or several para-
meters, such as the normal, the Student’s t, or the F distribution. Almost all the inferential procedures proposed
in this chapter are based on normality; thus, in most of the cases, it is necessary to verify whether the data are
44 Quality of Analytical Measurements: Statistical Methods for Internal Validation

compatible with this assumption. In this section, we will show the chi-square test that is used for any
distribution and the D’Agostino test that is recommendable for evaluating the normality of a data set.
Case 1. Chi-square test: The test is designed to detect frequencies inadequate for a specified probability
distribution F0. Given a sample x1,x2,. . .,xn from a random variable, one is interested in testing the hypothesis
H0 : The distribution of the random variable is F0
ð69Þ
H1 : This is not the case
To compute the statistics, the n sample values are grouped into k classes (intervals). Denote by Oi, i ¼ 1,. . .,k, the
frequency observed in each class and by Ei the expected frequency for the same class provided the distribution
is exactly F0. Then,
X
k
ð O i – Ei Þ 2
w2calc ¼ ð70Þ
i¼1
Ei

The critical region at (1  )  100% of confidence is


n o
CR ¼ w2calc > w2;k – p – 1 ð71Þ
n o
where w2;k – p – 1 is the value such that pr w2k – p – 1 > w2;k – p – 1 ¼ , where p is a number that depends on the
distribution F0, for instance p ¼ 2 for a normal, p ¼ 1 for a Poisson, and p ¼ 0 for a uniform distribution. The test
requires that the expected frequencies are not too small. Therefore, the data are regrouped into bigger classes.
In the practice of chemical analysis, the sample sizes are not large and when grouping the data the d.f. of the chi-
square statistics are few, the critical value of Equation (71) becomes large, and it is necessary to have a large
discrepancy between the estimated and observed frequencies to reject the null hypothesis. That means that the
test is very conservative.

Example 15:
To show the validity of the use of the crystal violet (CV) as an internal standard in the determination by
HPLC-MS-MS of malachite green (MG) in trout, a sample of trout was spiked with increasing concentrations
of MG between 0.5 and 5.0 mg kg–1 and in all of them with 1 mg kg–1 of CV. The area of the CV-specific peak
(transition 372 > 356) resulted: 1326, 1384, 1419, 1464, 1425, 1409, 1387, 1449, 1311, 1338, 1350, and 1345
abundance per count. To verify whether the CV is constant and independent of the concentration of MG, we
can test the hypothesis

H0 : The distribution of the random variable is uniform


H1 : This is not the case

Table 10 shows the calculation of both observed and expected frequencies under the uniform distribution in
the interval [1311,1464], the endpoints being respectively the minimum and maximum values in the sample.
By summing up the values of the last column of Table 10, the statistics is w2calc ¼ 0:51, which does not belong
to the critical region because it is not greater than w20:05;5–0–1 ¼ 9:49. Therefore, there is no evidence to reject the
hypothesis that the data come from a uniform distribution.

Table 10 w2 test for uniform distribution applied to asses the validity of crystal violet as internal
standard; data of Example 15

Class Observed frequency (Oi) Expected frequency (Ei) ðOi – Ei Þ2


Ei

[1311, 1341.6) 3 2.40 0.15


[1341.6, 1372.2) 2 2.40 0.07
[1372.2, 1402.8) 2 2.40 0.07
[1402.8, 1433.4) 3 2.40 0.15
[1433.4, 1464) 2 2.40 0.07
Quality of Analytical Measurements: Statistical Methods for Internal Validation 45

Table 11 Significance limits for the D’Agostino normality test

Significance level

Sample size  ¼ 0.05  ¼ 0.01

DL DU DL DU
10 0.2513 0.2849 0.2379 0.2857
12 0.2544 0.2854 0.2420 0.2862
14 0.2568 0.2858 0.2455 0.2865
16 0.2587 0.2860 0.2482 0.2867
18 0.2603 0.2862 0.2505 0.2868
20 0.2617 0.2863 0.2525 0.2869
22 0.2629 0.2864 0.2542 0.2870
24 0.2639 0.2865 0.2557 0.2871
26 0.2647 0.2866 0.2570 0.2872
28 0.2655 0.2866 0.2581 0.2873
30 0.2662 0.2866 0.2592 0.2873

Adapted from Martı́n Andrés, A.; Luna del Castillo, J. D. Bioestadı́stica para las ciencias de la salud; Spain: Norma Capitel
Madrid, 2004.

Case 2. Normality D’Agostino: The problem of verifying the normality of a set of data has been extensively
treated. When the empirical and theoretical histograms are compared, the most commonly used tests are those
of the chi-square and the Kolmogoroff–Smirnov. However, there are many characteristics that are specific for
the pdf of a normal; for example, the skewness and the kurtosis, which are statistics related to the higher than
two order moments for the normal pdf. A very powerful test is that of D’Agostino.
H0 : The distribution of the random variable is normal
H1 : This is not the case
To apply the test, the data are sorted in increasing order, so that x1  x2   xn. The statistic is
Pn n þ 1 Xn
i¼1 ixi – i¼1 i
x
2
Dcalc ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi ð72Þ
Pn 2 –
Pn 2
n n i¼1 i x i¼1 i x =n

Index i in Equation (72) refers to the ordered data. Table 11 shows some of the critical values of the statistics
D,n with the two values, DL,n and DU,n, for each sample size n and significance level . The critical region of
the test is

CR ¼ Dcalc < DL;n or Dcalc > DU;n ð73Þ

For further details, consult the work by D’Agostino and Stephens.44


As with the confidence intervals, a Bayesian approach exists for the construction of the hypothesis tests that
many statisticians prefer because of its internal coherence. Consult Chapter 1.08 and the basic references on this
subject; for a recent comparative analysis of both approaches, see Moreno and Girón.45

1.02.4 One-Way Analysis of Variance

Sometimes, more than two means must be compared. One can think in comparing, say, five means, by applying
the test of comparison of two means of Section 1.02.3.4 to each of the 10 pairs of means that can be formed taking
them from two in two. This option has the serious problem of the requirement of enormous sample sizes,
because to test the null hypothesis ‘the five means are equal’ with  ¼ 0.05 and assuming that the 10 tests are
46 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 12 Arrangement of data for an ANOVA

Factor

Level 1 Level 2 Level 3 Level k

x11 x21 x31 xk1


x12 x22 x32 xk2
x13 x23 x33 xk3
.. .. .. .. ..
. . . . .
x1n x2n x3n xkn

independent, each one of the hypothesis ‘the means xi and xj are equal’ should be tested with a significance
level of 0.0051 to obtain (1  0.0051)10 0.95. The appropriate procedure for testing the equality of several
means is the analysis of variance (ANOVA).
The ANOVA has many more applications; it is particularly useful in the validation of a model fit to some
experimental data and, hence, in an analytical calibration (Chapter 1.05), or in the analysis of response surfaces
(Chapter 1.12).
Table 12 shows how the data are usually arranged in a general case, with a factor to k levels (e.g., five
different extraction cartridges) and in each one n data (e.g., four determinations with each cartridge). Each of the
N ¼ k  n values xij (i ¼ 1,2,. . .,k, j ¼ 1,2,. . .,n) is the result obtained when using the ith cartridge withPthe jth
aliquot sample. In general, in each level i, a different number of replicates are available ni, with N ¼ ki¼1 ni .
To make the notation easier, we will suppose that all ni are equal, that is, ni ¼ n for each level.
Suppose that the data in Table 12 can be described by the model.
xij ¼  þ i þ "ij with i¼ 1; 2; . . . ;k; j ¼ 1; 2; . . . ;n ð74Þ

where  is a parameter common to all treatments, called the overall mean, i is a parameter associated with the
ith level, called the factor effect, and "ij is the random error component. In our example,  is the content of the
sample and i is the variation in this quantity caused by the use of the ith cartridge. Note that in the model of
Equation (74), the factor effect is additive; this is an assumption that may be unacceptable in some practical
situations.
We would like to test certain hypotheses about the treatment effects and to estimate them. For the hypothesis
testing the model errors are assumed to be normal and independently distributed random variables, with mean
zero and variance 2, NID(0,2). The variance 2 is assumed to be constant for all levels of the factor.
The model of Equation (74) is called the one-way ANOVA, because only one factor is studied. The analysis
for two or more factors can be seen in the Chapter 1.12 about factorial techniques. Furthermore, the data of
Table 12 are required to be obtained in random order so that the environment in which the factor varies is as
uniform as possible.
There are two ways for choosing the k factor levels in the experiment. In the first case, the k levels are
specifically chosen by the researcher, as the cartridges in our example. In this case, we wish to test the
hypothesis about the size of i, and conclusions will apply only to the factor levels considered in the analysis
and they cannot be extended to similar levels that were not considered. This is called the ‘fixed effects model’.
Alternatively, the k levels could be a random sample from a larger population of levels. In this case, we would
like to be able to extend the conclusions based in the sample to all levels in the population, independent of
whether they have been explicitly considered in the analysis or not. Here i are random variables and
information about the specific values included in the analysis is useless. Instead, we test the hypothesis about
the variability. This is called the ‘random effects model’. This model is used to evaluate the repeatability and
reproducibility of a method and also the laboratory bias when the method of analysis is being tested by a
proficiency test. In the same experiment, and provided there are at least two factors, fixed and random effects
can simultaneously appear.46,47
Quality of Analytical Measurements: Statistical Methods for Internal Validation 47

1.02.4.1 The Fixed Effects Model


In this model, the effect of the factors is defined as a deviation with respect to the overall mean, so that

X
k
i ¼0 ð75Þ
i¼1

From the individual data, the mean value per level is defined as
Pn
j ¼1 xij
xi ¼ ; i ¼ 1; 2; . . . ; k ð76Þ
n

and the overall mean is


Pk Pn
i¼1 j ¼1 xij
x ¼ ð77Þ
N
A simple calculation gives
X
k X
n
2
X
k k X
X n
2
xij – x ¼ n ðxi – xÞ2 þ xij – xi ð78Þ
i¼1 j ¼1 i¼1 i¼1 j ¼1

Equation (78) shows that the total variability of the data, measured by the sum of squares of the difference of
each datum and the overall mean, can be partitioned into a sum of squares of differences between level means
and the overall mean and a sum of squares of differences of individual values and their level mean. The term
P P P 2
n ki¼1 ðxi – xÞ2 measures the differences between levels, whereas ki¼1 nj¼1 xij – xi can be due to random
error alone. It is common to write Equation (78) as
SST ¼ SSF þ SSE ð79Þ
where SST is the total sum of squares, SSF is the sum of squares due to change levels of factor, which is called
sum of squares between levels, and SSE is the sum of squares due to error, which is called sum of squares within
levels. There are N individual values, thus SST has N  1 d.f. Similarly, as there are k factor levels, SSF has k  1
d.f. Finally, SSE has N  k d.f. We are interested in testing
H0 : 1 ¼ 2 ¼ 3 ¼ ¼ k ¼ 0 ðthere is no effect due to the factorÞ
H 1: i 6¼ 0 for at least one i
Because of the assumption that the errors "ij are NID(0,2), the values xij are NID( þ i,2), and therefore
SST/2 is distributed as a w2N – 1 . The Cochran’s theorem guarantees that, under the null hypothesis, SSF/2 and
SSE/2 are independent chi-square distributions with k  1 and N  k d.f., respectively. Therefore, under the
null hypothesis, the statistic
SSF
k–1 MSF
Fcalc ¼ SSE
¼ ð80Þ
N –k
MSE

Table 13 Skeleton of an ANOVA of fixed effects

Source of variation Sum of squares d.f. Mean squares E(MS) Fcalc

Factor (between levels) SSF k–1 MSF 2 MSF


MSE
Pk
Error (within levels) SSE N–k MSE n 2
2 þ i¼1 i
k–1
Total SST N–1
48 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 14 Effect of the type of fiber in a solid-phase microextraction of triazine (ppb)

Type of fiber

Fiber 1 Fiber 2 Fiber 3 Fiber 4 Fiber 5

490 612 509 620 490


Replicates 478 609 496 601 502
492 599 489 580 495
499 589 500 603 479
Means xi 489.75 602.25 498.50 601.00 491.50
Variances si2 76.25 108.92 69.67 268.67 93.67

follows an Fk–1,N–k distribution, whereas under the alternative hypothesis, it follows aP noncentral F.46 The
2 k 2
quantities MSF and MSE are called mean squares. Their means are E ðMSF Þ ¼  þ n i¼1 i =ðk – 1Þ and
2
E ðMSE Þ ¼  , respectively. Therefore, under the null hypothesis, both are unbiased estimates of the residual
variance, 2, whereas under the alternative hypothesis, the expected value of MSF is greater than 2. Therefore,
if the null hypothesis is false, the numerator is significantly greater than the denominator of Equation (80) and
the critical region at significance level  is

CR ¼ Fcalc > F;k – 1;N – k ð81Þ

Usually, the test procedure is summarized in a table, called ANOVA table (Table 13).

Example 16:
To investigate the influence of fibers composition on an SPME procedure, an experiment was performed using
five different fibers. The data shown in Table 14 are the results (after extraction) of four replicated analyses,
with each fiber, of a sample spiked with 1000 ppb of triazine. All the analyses were carried out in random order
and maintaining the rest of experimental conditions controlled.
In the last two rows of Table 14, the means and variances for each fiber are given. Before doing the ANOVA,
the hypothesis of equality of variances should be tested:

H0: 21 ¼ 22 ¼ ¼ 2k


H1 : At least one 2i is different

With the variances in Table 14, the statistics of the Cochran’s test (Equation (59)) is Gcalc ¼ 268.67/
617.168 ¼ 0.435. As G0.05,k,n1 ¼ 0.5981 (see Table 9), the statistic does not belong to the critical region
(Equation (60)) and there is no evidence to reject the null hypothesis at 5% significance level.
The statistics of the Bartlett’s test is w2calc¼1.792 (Equation (61)) and the critical value is w20.05,4¼9.488, so there
is no evidence to reject the null hypothesis either (Equation (64)). The same happens with the Levene’s test;
computing the absolute values, according to Equation (65), of the data of Table 14, Fcalc ¼ 14.70/44.01 ¼ 0.33.
As F0.05,4,15 ¼ 3.06, there is no evidence to reject the null hypothesis on the equality of variances. By using the
median instead of the mean (Equation (68)), Fcalc ¼ 15.13/44.23 ¼ 0.34, and the conclusion is the same. From the
analysis of the equality of variances, we can conclude that the variances of the five levels should be considered
as equal.

Table 15 Results of the ANOVA for data of Table 14

Source Sum of squares d.f. Mean squares Fcalc

Between fibers 56 551.3 4 14 137.8 114.54


Error (within fibers) 1 851.5 15 123.4
Total 58 402.8 19
Quality of Analytical Measurements: Statistical Methods for Internal Validation 49

The ANOVA of the experimental data gives the results in Table 15. Considering the critical region defined
in Equation (81), as Fcalc ¼ 114.54 is greater than the critical value F0.05,4,15 ¼ 3.06, we reject the null hypothesis
and hence the conclusion is that the effect of the factor ‘fiber composition over extraction’ is significant at 0.05
level.

1.02.4.2 Power of the ANOVA for the Fixed Effects Model


The power of the ANOVA is computed by the following expression:
n o
1 –  ¼ pr Fk– 1;N – k; > F;k – 1;N – k ð82Þ

where F;k – 1;N – k is the critical value of Equation (81), Fk– 1;N – k; is a noncentral F distribution with k  1 and
N  k d.f. of the numerator and denominator, respectively, and  is the noncentrality parameter, whose value is
given by
Pk 2
i¼1 i
¼n ð83Þ
2

P  depends on the number of replicates and also on the difference in means that we
The noncentrality parameter
in terms of ki¼1 i2 . Furthermore, the error variance is usually unknown. In such cases, we must
wish to detectP
choose ratios ki¼1 i2 =2 that we wish to detect. As the power, 1  , of the test increases with , we next ask
what is the minimum  subject to the condition that two of the i differ by  or more. The minimum  is
obtained if two of the i differ by  and the remaining k  2 equal the mean of these two;46 therefore,
X
k
2
2
i ¼ ð84Þ
i¼1
2

For example, with the data of Example 16 (Table 14), we are now interested in the risk to affirm that the type of
composition fiber is not significant for the recovery.
The answer consists of evaluating the probabilityP  by Equation (82). Suppose that we want to discriminate
effects greater than twice the MSE, that is, ki¼1 i2 =2 2, so  ¼ n  2 ¼ 8, F0:05;4;15 ¼ 3:06 and  ¼ 0.54
(calculations can be seen in Example A14 of Appendix). To put in words, 54 out of 100 times we will accept the
null hypothesis (there is no effect of fiber) when it is wrong. This is not good enough for a useful decision rule.
According to Equation (84), our  value means that we want to discriminate a difference  at least equal to 2
between the two types of fiber.
Another question related to the above equation is to know the sample size before starting an experiment. In
many situations, we would like to know what the sample size should be to assess whether both risks  and  will
be good enough. Also, the following question can be answered: How P many replicates should be carried out in
the experiment for  ¼  ¼ 0.05 maintaining the approximation ki¼1 i2 =2 3? Note that, in this case, the
analyst considers ‘effect of fiber type’ if it is greater than 3 times p
MSffiffiffi E, which is equivalent, using Equation (84),
to detect a difference between two fibers at least equal to  ¼ 6 2:5.
To calculate the sample size, a table must be made to write  as a function of n in Equation (82) with k, , and
 fixed at 5, 0.05, and 3  n, respectively. Following the results shown in Table 16, we need n ¼ 8 replicates
with each fiber to achieve   0.05, but in practice n ¼ 7 suffices.

1.02.4.3 Uncertainty and Testing of the Estimated Parameters for the Fixed Effects Model
It is possible to derive estimators for the parameters  and i (i ¼ 1,. . .,k) in the one-way ANOVA modeled by
Equation (74). The normality assumption on the errors is not needed to obtain an estimate by least squares;

Table 16 Probability of type II error, , as a function of the number n of replicates in the ANOVA of the fiber types

n 4 5 6 7 8 9
 0.347 0.203 0.111 0.058 0.029 0.014
50 Quality of Analytical Measurements: Statistical Methods for Internal Validation

however, the solution is not unique, so the constraint of Equation (75) is imposed. Using this constraint we
obtain the estimates
ˆ ¼ x and ˆi ¼ xi – x; i ¼ 1; . . . ; k ð85Þ
where x and xi have been defined in Equations (77) and (76), respectively. If the number of replicates, nP
i, in each
level is not equal (unbalanced ANOVA), then the constraint of Equation (75) should be changed by ki¼1 ni i
¼ 0 and use the weighted average of xi instead of the unweighted average in Equation (85).
Now, if we assume that the errors are NID(0,2) and ni ¼ n, i ¼ 1,. . .,k, the estimates of Equation (85) are also
the maximum likelihood ones. For unbalanced designs, the least squares solution is biased and the maximum
likelihood is better. The reader interested in this subject should consult statistical monographs that describe this
matter at a high level, such as Milliken and Johnson48 and Searle.49
The mean of the ith level is i ¼  þ i, i ¼ 1,. . .,k. In our case, with a balanced design an estimator of i
would be ˆ i ¼ ˆ þ ˆi ¼ xi and, as errors are NID(0,2), then xi is NID(i,2/n). Using MSE as an estimator of
2, Equation (16) gives the confidence interval at (1  )  100% level:
" rffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffi#
MSE MSE
xi – t =2;N – k ; xi þ t =2;N – k ð86Þ
n n

A (1  )  100% confidence interval on the difference in any two level means, say i – j , would be
" rffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffi#
2MSE 2MSE
xi – xj – t=2;N – k ; xi – xj þ t=2;N – k ð87Þ
n n

With the data in Example 16 (Tablep 14), a 95% confidence


ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi interval on the difference between fibers 1 and 2 is
given by [(489.75  602.25)  2.131 2  123:43=4] ¼ [–129.24,–95.76].
Finally, the (1  )  100% confidence interval on the global mean is
" rffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffi#
MSE MSE
x – t =2;N – k ; x þ t =2;N – k ð88Þ
nk nk

Rejecting the null hypothesis in the fixed effect model of the ANOVA implies that there are differences
between the k levels, but the exact nature of the differences is not specified. To treat this question, two
procedures are used: the orthogonal contrasts and the multiple test.
Case 1. Orthogonal contrasts: For example, with the data of Table 14, we would like to test the hypothesis H0:
4 ¼ 5. The linear relation related with this hypothesis is x4 – x5 ¼ 0; this linear combination is called a
contrast. A contrast is tested by comparing its sum of squares to the mean square error. The resulting statistic
would be distributed as an F, with 1 and N  k d.f. Each contrast is defined by the coefficients of the linear
P in the previous case (0,0,0,1,–1). Two contrasts C ¼ (c1,c2,. . .,ck) and D ¼ (d1,d2,. . .,dk) are ortho-
combination,
gonal if ki¼1 ci di ¼ 0. There are many ways to choose the orthogonal contrast coefficients for a set of levels.
Usually, something in the experiment should suggest which comparison will be of interest. In order to show the
procedure, we are going to raise with the data of Table 14 a fictitious case of orthogonal contrasts with purely
didactic purpose; in each problem, its peculiarities and the previous knowledge of the analyst will suggest the

Table 17 ANOVA table with orthogonal contrasts for composition of fibers for SPME

Source Sum of squares d.f. Mean squares Fcalc

Between fibers 56 551.3 4 14 137.8 114.54


C1: 4 ¼ 5 23 981.0 1 23 981.0 194.28
C2: 1 þ 3 ¼ 4 þ 5 10 668.0 1 10 668.0 86.43
C3: 1 ¼ 3 153.1 1 153.1 1.24
C4: 42 ¼ 1 þ 3 þ 4 þ 5 21 550.0 1 21 550.0 174.59
Error (within fibers) 1 851.5 15 123.4
Total 58 402.8 19
Quality of Analytical Measurements: Statistical Methods for Internal Validation 51

contrast to be studied. The comparisons between the means per fiber and the associated orthogonal contrasts
proposed are
H0 : 4 ¼ 5 C1 ¼ – x4 þ x5
H 0 : 1 þ 3 ¼ 4 þ 5 C2 ¼ x1 þ x3 – x4 – x5
H0 : 1 ¼ 3 C3 ¼ x1 – x3
H0 : 42 ¼ 1 þ 3 þ 4 þ 5 C4 ¼ – x1 þ 4
x 2 – x3 – x4 – x5
The sum of squares associated with each contrast C is
Pk 2
n i
i¼1 ci x
SSC ¼ Pk ð89Þ
2
i¼1 ci

For example, SSC1 ¼ 4  (–1  601.00 þ 1  491.50)2/2 ¼ 23 981 with 1 d.f. These sums of squares are incorpo-
rated into the ANOVA table as shown in Table 17.
Now, to test each of the hypotheses written in the table, it suffices to compare the corresponding Fcalc with the
critical value F0.05,1,15 ¼ 4.54. Except for C3, the rest of the contrasts are significant according to Equation (81). Thus,
we should reject the hypothesis that fibers 4 and 5 give the same recovery. The hypothesis that the mean of the
recovery for fibers 1 and 3 is the same as for fibers 4 and 5 is also rejected. Also, fiber 2 differs significantly from the
mean of the other four, whereas there is no experimental evidence to reject that fibers 1 and 3 provide the same
recovery. It is evident that the possibilities of the analysis of the experimental results are very ample.
Case 2. Comparison of several means: Many different methods have been described that were specifically
designed for the comparison of several means. Here, we will describe the method of Newman–Keuls. The
hypothesis test is the following:
H0 : All the differences two by two are equal to zero
H1 : At least one difference is nonnull

Table 18 Newman–Keuls procedure for multiple comparison test; data


of SPME fibers

Levels Rank Mean Homogeneous groups

2 1 602.25 
4 2 601.00 
3 3 498.50 
5 4 491.50 
1 5 489.75 

Table 19 Skeleton for using the corresponding tabulated values for the Newman–
Keuls procedure

t: difference of ranks plus one

k k–1 ... 2

xr ð1Þ – xr ðkÞ xr ð1Þ – xr ðk – 1Þ ... xr ð1Þ – xr ð2Þ
xr ð2Þ – xr ðkÞ ... xr ð2Þ – xr ð3Þ
.. ...
.
xr ðk – 1Þ – xr ðkÞ

q ðk; kðn – 1ÞÞ q ðk – 1; kðn – 1ÞÞ ... q ð2; kðn – 1ÞÞ

The subscript r(i) indicates the ith rank. k is the number of levels in the ANOVA and
q are the tabulated values at significance level .
52 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 20 Values of q(t,), the upper percentage points of the Studentized range for  ¼ 0.05

 2 3 4 5 6 7 8 9 10

1 17.969 26.98 32.82 37.08 40.41 43.12 45.50 47.36 49.07


2 6.085 8.33 9.80 10.88 11.74 12.44 13.03 13.54 13.99
3 4.501 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46
4 3.926 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83
5 3.635 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99
6 3.460 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49
7 3.344 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16
8 3.261 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92
9 3.199 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74
10 3.151 3.88 4.33 4.66 4.91 5.12 5.30 5.46 5.60
11 3.113 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49
12 3.081 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39
13 3.055 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32
14 3.033 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25
15 3.014 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20
16 2.998 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15
17 2.984 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11
18 2.971 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07
19 2.960 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04
20 2.950 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01

t is the difference of ranks plus one.  is the degrees of freedom of MSE.


Adapted from Sachs, L. Applied Statistics. A Handbook of Techniques; New York: Springer-Verlag, 1982.

The procedure consists of the following steps:


1. To sort the means per level, xi , i ¼ 1,2,. . .,k, in decreasing order, xr ð1Þ  xr ð2Þ   xr ðkÞ . The subindex
r(i) refers to the rank of the corresponding mean, that is, the position that it occupies in the ordered list. For
example, the means of Table 14 have the following ranks: r(1) ¼ 2, r(2) ¼ 4, r(3) ¼ 3, r(4) ¼ 5, and r(5) ¼ 1,
respectively. That means that the first one, which is 489.75, has rank 5, that is, is in the fifth position in the
decreasing ordered list. Table 18 shows the ordered means and the ranks in the second column.
2. To create a table with the differences between the means from greatest to lowest being t equal to the
difference of ranks plus one. Table 19 contains all the possible contrasts two by two of the means, grouped in
each column by the number, t (number of means that separates them plus one), in the ordered list.
We wish to test each of the following hypotheses:

Table 21 Results of the Newman–Keuls test to data of SPME fibers



Contrast levels Differences xi – xj t q0.05,t,15 Critical values, Rt CR ¼ xr ðiÞ – xr ðiþk – 1Þ  Rt

1–2 112.5 5 4.37 4.37  5.555 ¼ 24.27 Reject H0


1–3 8.75 3 3.67 3.67  5.555 ¼ 20.39 No evidence to reject H0
1–4 111.25 4 4.08 4.08  5.555 ¼ 22.66 Reject H0
1–5 1.75 2 3.01 3.01  5.555 ¼ 16.72 No evidence to reject H0
2–3 103.75 3 3.67 3.67  5.555 ¼ 20.39 Reject H0
2–4 1.25 2 3.01 3.01  5.555 ¼ 16.72 No evidence to reject H0
2–5 110.75 4 4.08 4.08  5.555 ¼ 22.66 Reject H0
3–4 102.5 2 3.01 3.01  5.555 ¼ 16.72 Reject H0
3–5 7.0 2 3.01 3.01  5.555 ¼ 16.72 No evidence to reject H0
4–5 109.5 3 3.67 3.67  5.555 ¼ 20.39 Reject H0

H0: the difference is null, x̄ix̄j ¼ 0; H1: x̄ix̄j 6¼ 0.


Quality of Analytical Measurements: Statistical Methods for Internal Validation 53

H0 : xrðiÞ – xrðiþk – 1Þ ¼ 0
H1 : xrðiÞ – xrðiþk – 1Þ > 0

The statistic is rffiffiffiffiffiffiffiffiffiffi


MSE
Rt ¼ q ðt ; kðn – 1ÞÞ ð90Þ
n
The value of the statistic depends, as always, on the significance level , on t, and on the d.f. N-k of MSE.
Further, the first term in Rt changes with the difference of ranks, t. The corresponding values are written in the
last row in Table 19 and some of their tabulated values are in Table 20.
The values of q ðt ; kðn – 1ÞÞ are tabulated. Table 20 shows some of them. The critical region is:

CR ¼ xrðiÞ – xrðiþk – 1Þ  Rt ð91Þ

The results obtained when applying the method of Newman–Keuls to the data of Example 16 are given in
Table 21. The first column contains the means to be compared, for example, 1–2 indicates that the comparison
is between x1 and x2 ; the second column contains the differences (without sign) between them; the values of t
(difference of ranks plus one) are in the third column, as x1 has rank 5 and the rank of x2 is 1 (the ranks are in
Table 18), the value of t is 5. With the value of q (Table 20) and Equation (90), the critical value (Rt) is
computed for each of the tests. The critical value Rt allows the analyst to decide whether the estimated
difference is significant or not. The decision of rejecting or not rejecting the null hypothesis is shown in the last
column of Table 21.
Usually, the result of this multiple comparison is presented as in the last column of Table 18. The symbols
‘’ aligned indicate that the corresponding means are all equal two by two. For example, the means x2 and x4 by
one side and any other pair among x1 , x3 , and x5 on the other side, as can be seen in Table 21. It is possible to
conclude that there are two fiber groups; as far as the recovery is concerned, fibers 2 and 4 provide results that
are significantly equal and greater than the others. The other group is made by the other three that are similar to
each other but different from the two previous ones.

1.02.4.4 The Random Effects Model


In many cases, the factor of interest is a random variable as well, so that the chosen levels are in fact a sample of
this random variable and we want to extract conclusions about the population from which the sample comes.
For example, in the case of validating an analytical method, several laboratories will apply it to aliquot samples
so that it is possible to decide what part of the variability of the results is attributable to the change of laboratory
and what to the repetition of the procedure inside the same laboratory. These are the concepts of reproduci-
bility and repeatability. The same happens in the analytical control of processes: It is necessary to distribute the
variability observed between the one due to the measurement procedure and the one assignable to the process.
The linear statistical model is
xij ¼  þ i þ "ij with i ¼ 1; 2; . . . ; k; j ¼ 1; 2; . . . ; n ð92Þ

where i and "ij are independent random variables. Note that the model is identical in structure to the fixed
effect case (Equation (74)), but the parameters have a different interpretation. If the V( i) ¼ 2 , then the
variance of any observation is
V xij ¼ 2 þ 2 ð93Þ

The variances of Equation (93) are called variance components, and the model, Equation (92), is called
components of variance or the random effects model. To test hypotheses in this model, we require that the
"ij are NID(0,2), all of the i are NID(0,2 ), and i and "ij are independent.
The sum of squares equality SST ¼ SSF þ SSE still holds. However, instead of testing the hypothesis about
individual levels effects, we test the hypothesis
H0 : 2 ¼ 0
H1 : 2 > 0
54 Quality of Analytical Measurements: Statistical Methods for Internal Validation

If 2 ¼ 0, all levels are identical; if 2 > 0, then there is variability between levels. Thus, under the null
hypothesis, the ratio
SSF
MSF
Fcalc ¼ k – 1 ¼ ð94Þ
SSE MSE
N –k
is distributed as an F with k  1 and N  k d.f. The expected means of MSF and MSE are
E ðMSF Þ ¼ 2 þ n2 ð95Þ

and
E ðMSE Þ ¼ 2 ð96Þ
Therefore, the critical region is

CR ¼ Fcalc > F;k – 1;N – k ð97Þ

1.02.4.5 Power of the ANOVA for the Random Effects Model


The power of the random effects ANOVA is obtained from
 
F;k – 1;N – k
1 –  ¼ pr Fk – 1;N – k > 2
ð98Þ

2
where 2 ¼ 1 þ n 2 . As 2 is usually unknown, we may either use a prior estimate or define the value of 2 that
we are interested in detecting in terms of the ratio 2 =2 . An application to determine the number of replicates
in a proficiency test can be seen in Example A15 of Appendix.

1.02.4.6 Confidence Intervals of the Estimated Parameters for the Random Effects Model
In general, the mean value per level xi does not have more statistical meaning than being a sample of the
random factor. But sometimes, as in the case of the proficiency tests, this mean value is of interest for each
laboratory participant. The variance of the mean value per level is theoretically equal to V ðxi Þ ¼ 2 þ 2 =n.
From Equations (95) and (96), MSF =n (with k – 1 d.f.) estimates the variance of the mean per level. As a
consequence, the 100  (1  )% confidence interval is
" rffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffi#
MSF MSF
xi – t =2;k – 1 ; xi þ t =2;k – 1 ð99Þ
n n

When calculating the variance for the global mean, it is necessary to consider the variability contributed by the
factor, as the factor always acts/. For example, when evaluating an analytical method, the results without the
variability attributable to the factor laboratory are not conceivable. The variance of the overall mean is
P
V ðxÞ ¼ ki¼1 V ðxi Þ=k2 , which is estimated by MSF =nk, with k  1 d.f., in such a way that the
100  (1  )% confidence interval is
" rffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffi#
MSF MSF
x – t =2;k – 1 ; x þ t =2;k – 1 ð100Þ
nk nk

The ANOVA of random effects is a model of practical interest because it allows attributing real meaning to
many statements that seem evident. For example, when in a proficiency test the samples are distributed to
laboratories, it is insisted that they must be homogeneous. Strictly speaking, in most of the occasions, it is
impossible to assure homogeneity, but it is enough that the variability attributable to the change of sample is
significantly smaller than the one attributable to the procedure of analysis. This can be guaranteed by means of
an ANOVA of random effects.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 55

1.02.5 Statistical Inference and Validation


1.02.5.1 Trueness
The trueness is a key concept; several international organizations are unifying its definition. For example, the
definition ‘‘The closeness of agreement between the average value obtained from a large series of test results
and an accepted reference value’’ has been adopted by the IUPAC (Inczédy et al.,11 Chapter 18). The definition
of the ISO7 exactly coincides with it, and it is the definition accepted by the European Union in the Decision
2002/657/EC3 as far as the operation of the analytical methods and the interpretation of results are concerned.
The trueness is usually expressed in terms of bias, which combines all the components of the systematic error,
denoted by  in Equation (1).
The decisions on the trueness of a method are problems of hypothesis testing on the central value of a
distribution; in case the random error can be assumed to have a zero mean, they are in fact tests on the mean
because, according to Equation (1), the expected mean value for a series of measurements will be  þ  and the
problem is reduced to test whether  is zero or not (equivalently, to test whether x is significantly equal to ).
To use one or another test depends only on the information available about the distribution of the random
error – its type (normal, parametric, or unknown) and, in the case of normality, whether the variance is known
or not. Some common cases are given below:
1. To decide whether an analytical procedure fulfills trueness by the use of a reference sample whose value is
assumed to be true. If normal data with known variance 2 are supposed, then the tests of Section 1.02.3.2 will
be used.
2. To decide whether an analytical procedure has bias specifically positive (or negative) by the use of a
reference sample whose value is assumed to be true. If normal data are assumed, the one-tail test versions of
Cases 1 and 2 of Section 1.02.3.2 will be of use.
When a hypothesis test is to be posed, one can think about omitting some known data, for example the
variance. The effect is a loss of power, that is, with the same value of  and the same sample size, the
probability of type II error will be greater. Said otherwise, if it is desired to have the same power, larger
sample sizes are needed to acquire the same experimental evidence; a calculation on this matter is in Case 2
of Section 1.02.3.2. The same applies about the use of one-tail tests (2, 3, 5, and 6 in Table 4) respect to the
respective two-tail tests (1 and 4); or about the use of nonparametric tests that do not impose any type of
distribution a priori.
3. In other occasions, the question of trueness is considered comparatively between two methods: ‘To decide if
the difference in means between both, when they are applied to the same sample of reference, is significant or
not’. It is the two-tail test. The one-tail case is ‘to decide whether one has bias of specific sign against the other’.
In these tests (Section 1.02.3.4), two experimental means are compared, one coming from applying n1 times the
first method to the reference sample (aliquot parts) and the other one coming from applying n2 times the
second method. Under the normality assumption, we will have to know whether the variances of both methods
are known (tests 7 and 8 of Table 4) or it is necessary to estimate them from the sample. In addition, in the
second case we have to decide whether they are equal (tests 9 and 10 of Table 4) or different (Equation (53)).
4. Sometimes it is impossible to use reference samples similar enough; a solution is the use of the ‘test on the
difference of means with paired data’ (Case 3 of Section 1.02.3.2). For example, suppose that we wish to
introduce a new method of analysis on line, faster, to indirectly determine the content of an analyte in
wastewater of a company, and we need to decide whether it maintains the trueness at the same level as the
previous method, which is sufficiently proved.
Once the new method is ready and validated with reference samples, real samples have to be measured. The
difficulty is that we cannot be sure about the amount that is to be found because this may vary from day to day. In
order to eliminate the ‘factor sample’, paired designs are used: Both methods are applied to aliquot parts of the
same sample, and two series of paired results x1i and x2i are obtained when applying the old and new methods,
respectively. To compute the mean here has no sense because in that case we would be introducing variability due
to the change of sample in each series. The correct procedure is to subtract both values di ¼ x1i  x2i; now the
differences are caused exclusively by the change of method and its mean will estimate the bias attributable to the
56 Quality of Analytical Measurements: Statistical Methods for Internal Validation

new method. It will be enough to apply a test on the mean; thus, the normality and independence hypotheses must
be evaluated on the differences, as well as the standard deviation, which is also estimated with them.
This test for paired data is frequently used to evaluate the improvement achieved in a procedure by a
technical variation, as is the case of Example 9. The effect of the change on the trueness must always be
evaluated in the range of concentrations in which one is going to use the procedure. An alternative to the use of
this test is the analysis of the pairs of data by a linear regression; in this case, the regression method used should
consider the existence of error in both axes (see Section 1.05.10).
The normality (Section 1.02.3.8), independence of the determinations, and, if it is necessary the equality of
variances hypotheses will have to be tested with the appropriate tests (Section 1.02.3.7).
Also, it is important to remember that the presence of outlier data tends to greatly increase the variance so
that the tests become insensitive, that is to say, a larger experimental evidence is needed to reject the null
hypothesis. The nonparametric alternative has, in general, a high cost in terms of power for the same
significance level (or in terms of sample size). For this reason, it will not have to be used unless it is strictly
necessary. In addition, some nonparametric tests also assume hypothesis on the distribution of the values, for
example, to be symmetric or unimodal.

1.02.5.2 Precision
The other very important criterion in the validation of a method is the precision. In the ISO 5725,7 the IUPAC
(Inczédy et al.,11 Sections 2 and 3), and the European norm,3 we can read ‘‘Precision, the closeness of agreement
between independent test results obtained under stipulated conditions’’.
The precision usually is expressed as imprecision. The lesser the dispersion of the random component in
Equation (1), the more precise the procedure. It must be remembered that the precision depends solely on the
distribution of the random errors and is not related to the value of reference or that assigned to the sample. In a
first approach, it is computed as a standard deviation of the results; nevertheless, even the ISO 5725-5
recommends the use of a robust estimation.
Two measures, limits in a certain sense, of the precision of an analytical method are the reproducibility and
the repeatability.
Repeatability is defined as ‘‘precision under repeatability conditions. Repeatability conditions means condi-
tions where independent test results are obtained with the same method on identical test items in the same
laboratory by the same operator using the same equipment with short intervals of time’’.
The repeatability limit, r, is the value below which is, with a probability of 95%, the absolute value of the
difference between two results of a test, obtained under repeatability conditions. The repeatability limit is
given by
pffiffiffi
r ¼ z=2  2  r ð101Þ

where z/2 is the critical value of the standard normal distribution. r is the repeatability expressed as standard
deviation.
Reproducibility is defined as ‘‘precision under reproducibility conditions. Reproducibility conditions means
conditions where test results are obtained with the same method on identical test items in different laboratories
with different operators using different equipment’’.
The reproducibility limit, R, defined in Equation (102), is the value below which is, with a probability of
95%, the absolute value of the difference between two results of a test, obtained under reproducibility
conditions.
pffiffiffi
R ¼ z=2  2  R ð102Þ

where R is the standard deviation computed under reproducibility conditions. If n < 10, a correction factor7
should be applied to Equation (6) when r (or R) is estimated.
The ISO introduces the concept of intermediate precision when only some of the factors described in the
reproducibility conditions are varied. A particular interesting case is when the ‘internal’ factors of the laboratory
(analyst, instrument, day) are varied, which in Commission Decision3 is called intralaboratory reproducibility.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 57

One of the causes of the ambiguity when defining precision is the laboratory bias. When the method is
applied only in a laboratory, the laboratory bias is a systematic error of that laboratory. If the analytical method
is evaluated in general, the laboratory bias becomes a part of the random error: to change the laboratory
contributes to the variance, which is expected for a determination done with that method in any laboratory.
The most eclectic position is the one described in the ISO 5725 that declares ‘‘The laboratory bias is
considered constant when the method is used under repeatability conditions, but is considered as a random
variable if series of applications of the method are made under reproducibility conditions.’’
On the basis of these points, we can deduce that to evaluate the precision of an analytical method is
equivalent to estimating the variance of the random error in the results and that the discrepancies that an
appear when establishing the sources of variability must be explicitly identified, for example the laboratory bias.
The precision of two methods can be compared by a hypothesis test on the equality of variances, under the
normality assumption that it is a (Snedecor) F-test (Section 1.02.3.6).
Another usual problem is to decide whether the variance observed can be considered significantly equal or
not to an external value, using a w2 test (Section 1.02.3.3).
It is common that the lack of control on a concrete aspect of an analytical procedure is the origin of a great
variability. If the experimental conditions are not stable, we will have an additional variability in the
determinations. The F-test permits one to decide whether the precision improves significantly when an
optimization is carried out.
In fact, many improvements in the procedures are due to the identification of the causes of the variability in
the results and their quantification. This aspect of control and improvement of the precision is dealt with some
detail in the section dedicated to the ruggedness of the chemical analyses.
The technique used to construct an ANOVA with random effects (Section 1.02.4.4) is also the adequate
technique to split the variance that presents each experimental data in addends that are specially adapted to
estimate the repeatability and the reproducibility of an analytical method when interlaboratory test comparison

Variance n s x2 with k –1 degrees of freedom If the variability due to the change of


It estimates V(ε) + n V(Δ), and thus laboratory is not significant, it also
the reproducibility is sR = V(ε) + V(Δ) estimates the intralaboratory variance

∑ i =1( xi − x) 2
k
Variance of the
means, V(Δ) s x2 =
k−1

Mean estimated per x1 x2 x3 xk
...
laboratory
⇑ ⇑ ⇑ ⇑
x11 x21 x31 ... xk1
x12 x22 x32 ... xk2
Determinations
x13 x23 x33 ... xk3
made by each
. . . ... .
laboratory
. . . ... .
xij = μ + Δi + εij
. . . ... .
x1n x2n x3n ... xkn
⇓ ⇓ ⇓ ⇓
Variance estimated per
s 12 s 22 s 32 ... s k2
laboratory

(n – 1) 2 + (n – 1) s 2 + … + (n – 1) 2
s sk
Pooled variance s p2 = 1 2
≈ V(ε)
k(n – 1)

Intralaboratory variance = sp2
k(n–1) degrees of freedom sp estimates the repeatability, sr
It estimates V(ε)
Figure 5 ANOVA of random effects for a interlaboratory study.
58 Quality of Analytical Measurements: Statistical Methods for Internal Validation

has been carried out. In the following, the use of an ANOVA to estimate reproducibility and repeatability in a
proficiency test is briefly explained.
There is no doubt that a good analytical procedure has to be insensitive to the change of laboratory. To
decide about this question, k laboratories apply a procedure to aliquot samples; each laboratory makes n
determinations. In the terminology of the ANOVA, we have a random factor (the laboratory) at k levels and n
replicates in each level. In general, it is not necessary to have the same number of replicates in all the levels.
We denote by xij the experimental results, where i ¼ 1,. . .,k identifies the laboratory and j ¼1,. . .,n the
replicate.
Figure 5 is a skeleton of Equations (93)–(96) and shows how to compute an estimate of the variance of the
random variable " in Equation (93). If the analytical procedure is well defined, the k estimates si2 are expected to
be approximately equal and to gather the variability due to the use of the analytical method by only one
laboratory. In these conditions, the pooled variance sp2 is a joint (‘pooled’) estimate of the same variance, that is,
by definition, the repeatability of the method expressed as standard deviation (ISO 5725)
pffiffiffiffiffiffiffiffiffiffi
sr ¼ V ð"Þ sp ð103Þ

From the same data we can obtain k estimates of the bias i (Figure 5, top) and then the variance of the
laboratory bias, considering this bias as a random variable. Taking into account the quantities estimated by the
variances described in Figure 5, one obtains the following expression for the interlaboratory variance:
sp2
V ðÞ sx2 – ð104Þ
n
which, linked to Equation (1), provides the following estimate of the reproducibility as standard deviation (ISO
5725):

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sR V ðÞ þ V ð"Þ ð105Þ

In the ANOVA, the null hypothesis is that 1 ¼ 2 ¼ ¼ k ¼ 0 (i.e., there is no effect of the factor), and the
alternative is that at least one laboratory has nonnull bias (there is effect of the factor).
The conclusion of the ANOVA is obtained by deciding whether both variances, n sx2 and sp2 , can be
considered significantly equal. To decide it, an F-test is applied. The logic is clear: If there is no effect
(laboratory effect), V() should be significantly zero and, thus, both variances are equal, or in other words,
they estimate the same quantity.
In practice, the expression for the computation of the power of the ANOVA with random effects (Equation
(98)) is useful in deciding the number of laboratories that should participate, k, and the number of replicated
determinations, n, that each one must do.
It is essential to remember that an ANOVA requires normal distribution of the residuals and equality of the
variances s21,s22,. . .,sk2.
When the number of replicates is two (n ¼ 2), a common way of the interlaboratory analysis is the use of the
graph of Youden50 to show the trueness and precision of each laboratory. Actually, the Youden’s graph is none
other than the graphical representation of an ANOVA shown in Kateman and Pijpers51 (p 112). In addition to
being used for comparing the quality of the laboratories, the graph of Youden can be used to compare two
methods of analysis in terms of the laboratory bias they have.
An approach for the comparison of two methods in the intralaboratory situation has been proposed by
Kuttatharmmakull et al.52 Instead of the reproducibility, as included in Figure 5 and ISO guidelines, the
(operator þ instrument þ time) different intermediate precision is considered in the comparison.
In the case of precision, the effect of outlier data is really devastating; hence, a very delicate task of analysis to
detect those outlier data is essential. In general, more than one test is needed (usual ones are those of Dixon, and
Grubbs and Cochran), especially to accept the hypotheses of the ANOVA made for the determination of
repeatability and reproducibility. In view of the difficulties, the AMC5,6 advises the use of robust methods to
evaluate the precision and trueness and for proficiency testing. This path is also followed in the new ISO norm
about reproducibility and repeatability.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 59

1.02.5.3 Statistical Aspects of the Experiments to Determine Precision


The analysis of data implies three steps:
1. Critical examination of the data, in order to identify outliers or other irregularities, and to verify the
suitability of the model.
2. To compute for each level of concentration the preliminary values of precision and the mean values.
3. To establish the final values of precision and means, including the establishment of a relation between
precision and the level of concentration when the analysis indicates that such relation can exist.
The analysis includes a systematic application of statistical tests for detecting outliers, and a great variety of
such tests are available from the literature and could be used for this task.

1.02.5.4 Consistency Analysis and Incompatibility of Data


From the data collected in a specific number of levels, a decision must be taken about certain individual results
or values that seem to be ‘different’ from those of the rest of laboratories or that can vary the estimations.
Specific tests are used for the detection of these outlier numerical results.
Case 1. Elimination of data: It is the classic procedure based on detecting and, if this is the case, eliminating the
outlier data. The tests are of two types. The test of Cochran is related to the interlevels variability of the factor
and should be applied first; its objective is to detect an anomalous variance in one or several of the levels of the
factor. The test of Cochran has already been described in Section 1.02.3.7.
Later the test of Grubbs will be applied. It is basically a test on the intralevel variability to discover
possible outlier individual data. It can be used (if ni > 2) for those levels in which the test of Cochran has
led to the suspicion that the interlevel variation is attributable to an individual result. It is applied in two
stages:
1. Detection of a unique outlying observation (single Grubbs’ test)
In a data set xi (i ¼ 1,2,. . .,n) sorted in increasing order, to prove whether the greatest observation, xn, is
incompatible with the rest, the following statistic is computed:

Table 22 Critical values for Grubbs’ test

One largest or one smallest Two largest or two smallest

n  ¼ 0.05  ¼ 0.01  ¼ 0.05  ¼ 0.01

4 1.481 1.496 0.0002 0.0000


5 1.715 1.764 0.0090 0.0018
6 1.887 1.973 0.0349 0.0116
7 2.020 2.139 0.0708 0.0308
8 2.126 2.274 0.1101 0.0563
9 2.215 2.387 0.1492 0.0851
10 2.290 2.482 0.1864 0.1150
11 2.355 2.564 0.2213 0.1448
12 2.412 2.636 0.2537 0.1738
13 2.462 2.699 0.2836 0.2016
14 2.507 2.755 0.3112 0.2280
15 2.549 2.806 0.3367 0.2530
16 2.585 2.852 0.3603 0.2767
17 2.620 2.894 0.3822 0.2990
18 2.651 2.932 0.4025 0.3200
19 2.681 2.968 0.4214 0.3398
20 2.709 3.001 0.4391 0.3585

Adapted with permission from ISO-5725–2. Accuracy, Trueness and Precision of


Measurement Methods and Results; Gèneve, 1994; pp. 22, Table 5.
60 Quality of Analytical Measurements: Statistical Methods for Internal Validation

xn – x
Gn;calc ¼ ð106Þ
s

On the contrary, to verify whether the smallest observation, x1, is significantly different from the rest, the
statistic G1 is computed as

x – x 1
G1;calc ¼ ð107Þ
s

In Equations (106) and (107), x and s are, respectively, the mean and standard deviation of xi.
To decide whether the greatest or smallest value is significantly different from the rest at % significance
level, the values obtained in Equations (106) and (107) are compared to the corresponding critical values
written down in Table 22.
The decision includes two ‘anomaly levels’:
(a) If Gi,calc < G0.05,i, with i ¼ 1 or i ¼ n, accept that the corresponding x1 or xn is similar to the rest.
(b) If G0.05,i < Gi,calc < G0.01,i, with i ¼ 1 or i ¼ n, the corresponding x1 or xn is considered an straggler.
(c) If G0.01,i < Gi,calc, with i ¼ 1 or i ¼ n, the corresponding x1 or xn is incompatible with the rest of data of the
same level (statistical outlier).
2. Detection of two outlying observations (double Grubbs’ test)
Sometimes it is necessary to verify that two extreme data (very large or very small) incompatible with the
others do not exist. In the case of the two greatest observations, xn and xn–1, the statistic G is computed as
sn2 – 1;n
G¼ ð108Þ
s02
P P Pn – 2 2
where s02 ¼ ni¼1 ðxi – xÞ2 and sn1;n
2
¼ n2 1
i¼1 xi – n – 2 i¼1 xi .
Similarly, it is possible to decide jointly on the two smallest observations, x1 and x2, by means of the
following statistic:
2
s1;2
G¼ ð109Þ
s02

Table 23 Data of Example 17

Series 1 Series 2 Series 3 Series 4 Series 5

13.50 13.50 13.70 13.04 13.48


13.40 13.51 13.71 13.03 13.47
13.47 13.35 13.76 15.93 13.92
13.49 13.35 13.80 13.04 13.46

Table 24 Robust and nonrobust estimates of the centrality and dispersion parameters
(data of Table 23)

With all data (n ¼ 20) Without 15.93 (n ¼ 19)

Nonrobust procedures
Mean, x 13.60 13.47
Standard deviation, s 0.60 0.25
Robust procedures
Median 13.49 13.48
H15, centrality parameter 13.50 13.48
MAD 0.26 0.21
H15, dispersion 0.27 0.24
Quality of Analytical Measurements: Statistical Methods for Internal Validation 61

P P 2
2
where s1;2 ¼ ni¼3 xi – n –1 2 ni¼3 xi .
The decision rule is analogous to that of the case of an extreme value but with the corresponding critical
values in Table 22.
In general, the norms, for example ISO 5725,7 propose to inspect the origin of the anomalous results and, if
assignable cause does not exist eliminate the incompatible ones and leave the straggler ones indicating their
condition with an asterisk.

Example 17:
With didactic purpose to apply the test of Grubbs and to verify the effect of outliers, the data of Table 23 have
been considered as a unique series of 20 results.

The greatest value is 15.93 and the lowest is 13.03, as s ¼ 0.60 and x ¼ 13:60, Equation (106) gives
G20,calc ¼ 3.889 and Equation (107) gives G1,20 ¼ 0.942. By consulting the critical values in Table 22,
G0.05,20 ¼ 2.709 and G0.01,20 ¼ 3.001; therefore, according to the decision rule in Case 1 (single Grubbs’ test),
the value 15.93 should be considered different from the rest.
Applying again the test, with 19 data, the greatest value now is 13.92 and the lowest is still 13.03, with
G19,calc ¼ 1.804 and G1,calc ¼ 1.785. As the tabulated values are G0.05,19 ¼ 2.681 and G0.01,19 ¼ 2.968, there is no
evidence to say that any of the extreme values is different from the rest. Table 24 contains the mean and
standard deviation, with and without the value 15.93. A large effect is observed on the standard deviation, which
is reduced in more than 50%.
The test of Grubbs can also be applied to the mean values by level. In practice, the test of Grubbs is also used
to restore the equality of variances in the ANOVA when the homogeneity of variances is rejected (Section
1.02.3.7). The work by Ortiz et al.43 contains a complete analysis with sequential application of the tests of
Cochran, Bartlett, and Grubbs.
Case 2. Robust methods: The procedure described in the previous section is focused on the detection of
anomalous data within a set of results. Nevertheless, the elimination of these data is not recommendable when
the variability of the analytical procedure is to be evaluated because the procedure is sensitive to the present
values, that is to say, it depends on the data that have been eliminated (Equations (106)–(109) can lead to
elimination of data in successive stages because of the reduction of the variance), and because the attainable real
variance is underestimated.
As previously indicated, the values of repeatability (r) and reproducibility (R) are determined by means of
an ANOVA whose validity depends on whether the hypotheses of normality and homogeneity of variances are

Ψ(x)

m+cs

m–cs

m–cs m m+cs
Figure 6 Function m.s.c.
62 Quality of Analytical Measurements: Statistical Methods for Internal Validation

fulfilled. The robust methodology proposed in this section avoids these limitations. Its technical details can be
found in Hampel et al.53 and Huber.54
An alternative to the procedures based on the elimination of outlier data, as exposed in the ISO 5725-5 norm,
consists in using the H15 estimator proposed by Huber (c ¼ 1.5 and ‘Proposal 2 Scale’, Huber54), recommended
by the Analytical Methods Committee5,6 and accepted in the Harmonized Protocol.55 It is an estimator whose
influence function is a monotone function that limits the influence of the anomalous data by ‘moving them’
toward the position of the majority, but maintaining for them the maximum influence. This is carried out by
transforming the original data by means of the function
m:s:c: ðx Þ ¼ max½m – cs; minðm þ cs; x Þ ð110Þ
where m and s are the central and dispersion parameters, which must be iteratively estimated. The function in
Equation (110) is represented in Figure 6.
The estimate is exactly the generalization of the maximum-likelihood estimate. It is asymptotically optimum
for high-quality data, that is, data with little contamination and not very different from data following a
Student’s t distribution with three d.f. Remember that Hampel et al.53 have shown that the Student’s t
distributions with between 3 and 9 d.f. reproduce high-quality experimental data, and that for t3 the efficiency
of the mean and standard deviation is 50 and 0%, respectively. Therefore, in practice there is the need for a
robust estimate of high-quality empirical data (as those provided for the present analytical methods).
The H15 estimator provides enough protection against high concentration of data that are abnormally large
but near to the correct data. Nevertheless, the clearly anomalous data are not rejected with the H15 estimator,
and they maintain the maximum influence but bounded. This produces an avoidable loss of the efficiency of the
H15 estimator between 5 and 15% when the proportion of anomalous data present is also between 5 and 15%
(rather usual percentages in routine analyses). In order to avoid this limited weakness, robust estimators as the
median and the median of absolute deviation (MAD) (Equation (111)) are necessary at least in the first step of
the calculation, to identify surely most of the ‘suitable’ data.
jxi – medianfxi gj
MAD ¼ median ð111Þ
0:6745
The robust procedure obtained when adapting the H15 estimator to the problem of the estimation of
repeatability and reproducibility as posed in ISO norm consists of two stages and it has been followed in an
identical way to the proposal in Sanz et al.56 As in the parametric procedure, it uses the mean of the table of data
and the standard deviation. Therefore, once the robust procedure is applied, the data necessary to estimate the
reproducibility or the intermediate precision are at hand.
In order to verify the utility of these robust procedures, with the same data of Table 23 considered as a
unique series of 20 values, the median and the centrality parameter of the H15 estimator have been written
down in Table 24. These are very similar to the nonrobust estimates, as much with 20 as with 19 data.
Nevertheless, when comparing the robust parameters of dispersion, MAD and H15, they do not differ when
considering 20 or 19 data and are similar to the standard deviation obtained after applying the method of

Table 25 Robust and nonrobust estimates of the repeatability and reproducibility with data of Table 23

ANOVA With all data (n ¼ 20) Without 15.93 (n ¼ 19) Without series 4 and 13.92 (n ¼ 15)

Nonrobust procedure
Fcalc (p-value) 0.22 (0.92) 17.02 (<5  10–5) 24.91 (<5  10–5)
SSF (d.o.f.) 0.094 (4) 0.229 (4) 0.083 (3)
SSE (d.o.f.) 0.431 (15) 0.013 (14) 0.003 (11)
sR 0.657 0.260 0.153
sr 0.657 0.116 0.058
p-value, Cochran’s test 8.9  10–9 0.001 0.093
p-value, Bartlett’s test 3.9  10–8 0.001 0.100
p-value, Levene’s test 0.53 0.61 0.005
Robust procedures
Robust sR 0.281 0.172
Robust sr 0.072 0.072
Quality of Analytical Measurements: Statistical Methods for Internal Validation 63

Grubbs and repeating the calculations without the outlier. For this reason, it is a good strategy to apply
systematically robust procedures together with the classic ones. The difference in the results is an indication of
the presence of outlier data, in which case the robust estimations will have to be used.
The effect, and therefore the advantage, of the robust procedures is much more remarkable when an
ANOVA of random effects is evaluated, for example, to estimate the reproducibility and repeatability of a
method by means of an interlaboratory test as the one described in Figure 5. To show this, we will use the data
of Table 23, this time considering its structure of levels of the factor (k ¼ 5) and replicates (n ¼ 4).
The values of reproducibility and repeatability should not be accepted if the homogeneity of variances in the
ANOVA assumption is not fulfilled. In this case, it is necessary to verify whether some of the levels have outlier
data. The first column of Table 25 shows that the ANOVA with all the data is not acceptable because of the
lack of equality of variances (rejection in the tests of variance homogeneity). In addition, it is observed that the
anomaly in the data causes that the estimates sR and sr are equal and very different from the robust estimates.
Once the value 15.93 of series 5 is removed, the ANOVA (column 2 of Table 25) is significant but still there
is lack of variance homogeneity. Nevertheless, the new estimates of sR and sr are more similar to those obtained
with the robust procedure.
The lack of equality of variances forces one to eliminate series 4, which has a very different variance (smaller
than the others), and later the value 13.92 of series 5. The final result of this sequential process is the third column
of Table 25. The ANOVA is significant, the homogeneity of variances can be accepted, and the estimates of the
reproducibility and the repeatability are 0.153 and 0.058, similar to the ones that would be obtained with
the robust procedure without series 4. The values sR and sr can be too small due to the elimination of data,
with the risk of having underestimations that are not realistic and thus the laboratories cannot fulfill them. For this
reason, it is advised to avoid reduction of the sample and to maintain the initial robust estimates.5–7
As the presence of outliers in the experimental work is unavoidable, the robust statistical methodology has
been consolidated as an essential tool in the chemical analysis. Also, Chapter 1.07 of this book is dedicated to
robust statistical techniques.

1.02.5.5 Accuracy
According to the IUPAC (Inczédy et al.,11 Sections 2–3), the ISO,7 and the Directive of the European Union
(Commission Decision,3 Definition 1.1), the accuracy is a concept defined as ‘‘Closeness of agreement between
a test result and the accepted reference value’’. It is estimated by determining trueness and precision. Evidently,
this definition collects together the systematic and random errors, because for an individual determination it is
xi   ¼ (xi  x) þ (x  ) ¼ " þ .
In practice, it is unreasonable to think that an analytical procedure has no bias; experimentally we can decide
about the hypothesis of null bias. If it is significant, it is possible to correct the measurement by subtracting the
value . However, this implies an increase in the variance of the final result because  is estimated by
experimental replicates and therefore it has uncertainty. For this reason, when the uncertainty of a measure-
ment is expressed, it is usual to include a term to consider the bias in a form similar to Equation (105). For a
detailed treatment of this question, consult the guide of the EURACHEM/CITAC.1

1.02.5.6 Ruggedness
The ruggedness of a method is defined as its capacity to maintain trueness and precision throughout the time.
The same applies for the robustness of a material of reference or any other reagent.
Ruggedness means susceptibility of an analytical method to changes in experimental conditions, which can
be expressed as a list of sample materials, analytes, storage conditions, environmental conditions, and/or sample
preparation conditions under which the method can be applied as presented or with specified minor
modifications.3
The study on ruggedness can be approached using two different statistical methodologies, one of which
consists of using the well-known control charts (they are confidence intervals on the mean, the variance, or the
range of the measured parameter) and continuously writing down the results obtained on known samples
throughout the time. This type of ‘a posteriori’ control is essential to maintain the quality (precision, trueness,
64 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 26 Experimental factors with nominal (þ) and extreme () levels selected
for a Plackett–Burman design for seven factors

Level

Factor (units)  þ

x1: buffer solution (ml) 1.0 1.5


x2: pH 4.5 4.8
x3: extracting agent (ml) 20 25
x4: extraction cycles One Two
x5: petroleum benzin (ml) 20 25
x6: volume of elution (ml) 7 8
x7: evaporation To dryness Evaporated until 1 ml

capability of detection, etc.) of a measurement method and to establish alarm mechanisms when the observed
drift can alter the quality of the procedure affecting the value of the analytical results. Chapter 1.04 of this book
deals specifically with control charts.
The other approach to the problem of ruggedness involves evaluating ‘a priori’ the variability expected in the
analytical procedure and identifying the sources of that variability.
Before routinely using a procedure, the effect of small changes in the reagents, in the conditions of work, or in
the specifications of its protocol would have been verified. It can happen that small changes in the volume of
extracting reagent do not lead to great variations in the response, whereas a small variation in, say, pH does.
One way of knowing and controlling this quality criterion is by making small changes in the potentially
influential factors and observing the effect on the response.
The influence of each factor should not be separately analyzed because methodologically it is not adequate
and in addition it is not realistic, because in practice unforeseeable combinations of all the factors will occur that
can affect the results. Instead, the methodology of the design of experiments should be used, and details about
this are in Chapters 1.09–1.15 of this collection. As the number of factors that potentially affect the response is

Table 27 Plackett–Burman design for the seven factors of Table 26

Factor Response (y)

Run x1 x2 x3 x4 x5 x6 x7 SDZ SMT SMP

1 1 1 1 1 1 1 1 10.50 10.50 10.50


10.30 10.30 10.30
2 1 1 1 1 1 1 1 5.87 6.31 8.91
7.56 7.74 8.72
3 1 1 1 1 1 1 1 7.55 8.88 7.06
8.08 11.01 8.23
4 1 1 1 1 1 1 1 3.82 6.58 6.11
5.61 7.20 7.53
5 1 1 1 1 1 1 1 6.74 10.00 9.61
8.19 10.63 10.38
6 1 1 1 1 1 1 1 6.89 5.35 4.29
7.63 6.99 6.40
7 1 1 1 1 1 1 1 8.21 7.86 10.24
8.70 8.56 9.94
8 1 1 1 1 1 1 1 8.56 6.95 7.02
5.85 8.11 7.96

The three responses are the areas by duplicate under the chromatographic peak (in a.u.) of sulfadiazine (SDZ), sulfamethazine
(SMT), and sulfamethoxypyridazine (SMP).
Quality of Analytical Measurements: Statistical Methods for Internal Validation 65

Table 28 Estimated coefficients of the linear model (Equation (112)) fit for each sulfonamide by means of a Plackett–Burman
design

Sulfadiazine Sulfamethazine Sulfamethoxypyridazine

Coefficient p-value Coefficient p-value Coefficient p-value

b0 7.504 <0.000 1a 8.309 <0.000 1a 8.325 <0.000 1a


b1 –0.093 0.726 0.252 0.274 0.095 0.635
b2 0.456 0.111 0.165 0.465 0.314 0.142
b3 1.030 0.004a 1.409 0.000a 1.208 0.000a
b4 0.690 0.020a –0.021 0.924 0.874 0.002a
b5 0.666 0.039a 0.203 0.374 –0.605 0.014a
b6 –0.057 0.820 0.475 0.058 0.351 0.105
b7 0.204 0.447 –0.391 0.106 –0.161 0.426
a
Significant factor at a 0.05 significance level.

large, highly fractional factorial designs for two levels have to be used (to decrease for example the needed
27 ¼ 128 different experiments that contain a complete factorial design for seven factors); Chapter 1.10
dedicated to these strategies. Plackett–Burman designs and D-optimal designs have proven to be useful tools
in ruggedness analysis.57–59

Example 18:
An analysis of ruggedness of an extraction procedure of three sulfonamides is carried out. The seven factors
considered are buffer solution, pH, methanol as extracting agent, extraction cycles, petroleum benzin, volume
of elution, and evaporation. A Plackett–Burman design has been proposed to estimate the effects of the factors
by fitting a linear model for each sulfonamide:

y ¼ b0 þ b1 x1 þ b2 x2 þ þ b7 x7 ð112Þ

where y represents the response to be modeled, which is the area under the corresponding chromatographic
peak for each of the three sulfonamides. The details about the experimental domain can be seen in Table 26,
where the nominal level is codified as ‘—’ and the extreme level as ‘þ’.
Table 27 shows the experimental runs and the two values (replicates) of the three responses.
Finally, Table 28 contains the estimated coefficients of model (112) and its p-values. The conclusion is that
only the extracting agent (x3), the times that the extraction is carried out (x4), and the volume of petroleum
benzin (x5) are significant at 5% level. Hence, special care should be taken with these factors because small
changes in any of them can cause large changes in the response.

Appendix
1. Some Basic Elements of Statistics
A distribution function (cumulative distribution function (cdf)) in R is any function F, such that
1. F is an application from R to the interval [0,1]
2. x!lim F ðx Þ ¼ 0
–1
3. x!þ1 F ðx Þ ¼ 1
lim
4. F is a monotonously increasing function, that is, a  b implies F(a)  F(b).
66 Quality of Analytical Measurements: Statistical Methods for Internal Validation

5. F is continuous on the left or the right. For example, F is continuous on the left if lim F ðx Þ ¼ F ðaÞ for
x!a;x<a
each real number a.
Any probability defined in R corresponds to a distribution function and vice versa.
If p is the probability defined for intervals of real numbers, F(x) is defined as the probability that accumulates
until x, that is, F(x) ¼ p(–1,x). It is easy to show that F(x) verifies the above definition of distribution function.
If F is a cdf continuous on the left, its associated probability p is defined by

p½a; b  ¼ pfa  x  b g ¼ F ðb Þ – F ðaÞ


pða; b ¼ pfa < x  b g ¼ F ðb Þ – lim F ðx Þ
x!a;x>a

p½a; bÞ ¼ pfa  x < b g ¼ F ðb Þ – F ða Þ


pða; b Þ ¼ pfa < x < b g ¼ F ðb Þ – lim F ðx Þ
x!a;x>a

If the distribution function is continuous, then the above limits coincide with the value of the function in the
corresponding point. The probability density function f(x), abbreviated pdf, if it exists, is the derivative of the
cdf.
Each random variable X is characterized by a distribution function FX(x).
When several random variables are handled, it is necessary to define the joint distribution function.

FX1 ;X2 ;...;Xk ða1 ; a2 ; . . . ; ak Þ ¼ prfX1  a1 and X2  a2 and . . . and Xk  ak g ðA1Þ

If the previous joint probability is equal to the product of the individual probabilities, it is said that the random
variables are independent:

FX1 ;X2 ;:::Xk ða1 ; a2 ; :::; ak Þ ¼ prfX1  a1 g  prfX2  a2 g   prfXk  ak g ðA2Þ

Equations (3) and (4) define the mean and variance of a random variable. Some basic properties are

E ðaX þ bY Þ ¼ aE ðX Þ þ bE ðY Þ for any X and Y ðA3Þ

V ðaX Þ ¼ a 2 V ðX Þ for any random variable X ðA4Þ

Given a random variable, X, the standardized


pffiffiffiffiffiffiffiffiffiffiffivariable is obtained by subtracting the mean and dividing by the
standard deviation, Y ¼ ðX – E ðX ÞÞ= V ðX Þ. The standardized variable has E ðY Þ ¼ 0 and V ðY Þ ¼ 1.
For any two random variables, the variance is

V ðX þ Y Þ ¼ V ðX Þ þ V ðY Þ þ 2CovðX ; Y Þ ðA5Þ

and the covariance is defined as


ZZ
CovðX ; Y Þ ¼ ðx – E ðX ÞÞðy – E ðY ÞÞfX ;Y ðx; y Þ dx dy ðA6Þ

In the definition of the covariance (Equation (A6)), fX ;Y ðx; y Þ is the joint pdf of the random variables. In the case
where they are independent, the joint pdf is equal to the product fX ðx ÞfY ðy Þ and the covariance is zero.
In general, E ðXY Þ 6¼ E ðX ÞE ðY Þ, except where the variables are independent, in which case the equality
holds.
In the applications in Analytical Chemistry, it is very frequent to use formulas to obtain the final measure-
ment from other intermediate ones that had experimental variability. A strategy for the calculation of the
uncertainty (variance) in the final result under two basic hypotheses has been developed. The strategy is to
make a linear approach to the formula and then to assimilate the quadratic terms to the variance of the implied
random variable (see for example the ‘Guide to the Expression of Uncertainty in Measurement’).2 This
Quality of Analytical Measurements: Statistical Methods for Internal Validation 67

9.2

8.2

7.2

6.2

5.2
A B
Figure A1 Box and whisker plot. (A) data of method A in Figure 2. (B) data of method A with an outlier.

procedure, called in many texts the method of transmission of errors, can lead to unacceptable results. Hence,
an improvement based on Monte Carlo simulation has been suggested for the calculation of the compound
uncertainty (see the Supplement 1 to the aforementioned guide).
A useful representation of the data is the so-called box and whisker plot (or simply box plot). To explain its
procedure of construction, we will use the 100 values of the method A of Figure 2.
These data have the following characteristics (summary of statistics):
Minimum: 5.23
Maximum: 7.86
First or lower quartile, Q1 ¼ 6.39. It is the value below which lie 25% of the data.
Second quartile (median), Q2 ¼ 6.66. It is the value below which lie 50% of the data.
Third or upper quartile, Q3 ¼ 6.98. It is the value below which lie 75% of the data.
Interquartile range, IR ¼ Q3 – Q1 ¼ 0.59 in our case.
With these quartiles, the central rectangle (the box) is drawn that contains 50% of the data around the median.
The lower and upper limits are computed as LL ¼ Q1  1.5IR and UL ¼ Q3 þ 1.5IR. In the example,
LL ¼ 6.39  1.5  0.59 ¼ 5.505 and UL ¼ 7.865.
Then, the ‘whiskers’ are made by joining the inferior side of the rectangle with the data immediately greater
than or equal to LL, and the superior side of the rectangle with the greatest value in the data that is immediately
less than UL.
The three smallest values in our case are 5.396, 5.233, and 5.507; thus the whisker will go until 5.507 and the
other two values are left ‘disconnected’. The other whisker reaches the maximum 7.86 because it is less than UL.
The box and whisker plot is the first one in Figure A1.
The advantage of using box plots is that the quartiles are practically insensitive to outliers. For example,
suppose that value 7.86 is changed by 8.86; this change does not affect the median or the quartiles, the box plot
continues being similar but with a datum outside the upper whisker, as can be seen in the second box plot in
Figure A1.

Table A1 Selected point of the Z ¼ N(0,1) distribution

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
1.9 0.0287 0.0281 0.0274 0.0268 0.0282 0.0258 0.0250 0.0244 0.0239 0.0233
2.0 0.0227 5 0.0222 2 0.0216 9 0.0211 8 0.0208 8 0.0201 8 0.0197 0 0.0192 3 0.0187 8 0.0183 1

Values of p such that p ¼ pr{N(0,1) > z}.


68 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table A2 Selected point of the t distribution with  degrees of freedom

 0.25 0.10 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005

1 1.000 3.078 6.314 12.706 31.821 63.657 127.321 318.309 636.619


2 0.816 1.886 2.920 4.303 6.965 9.925 14.089 22.327 31.598
3 0.765 1.638 2.353 3.182 4.541 5.841 7.453 10.214 12.924
4 0.741 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 0.721 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.718 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.711 1.415 1.895 2.365 2.998 3.499 4.929 4.785 5.408
8 0.716 1.397 1.860 2.365 2.896 3.355 3.833 4.501 5.041
9 0.713 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 0.711 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
20 0.687 1.325 1.725 2.086 2.528 2.845 3.497 3.552 3.850
100 0.677 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390
1000 0.675 1.282 1.645 1.962 2.330 2.581 2.813 3.098 3.300
1 0.675 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.290

Values of t such that  ¼ pr{t > t}.

2. The Normal Distribution


A normal distribution with mean  and standard deviation , N(,), has as pdf the following function defined
for all real numbers:
1 1 x– 2
f ðx Þ ¼ pffiffiffiffiffi e – 2ð  Þ ðA7Þ
2 
The normal distribution is a continuous random variable with E ðN ð; ÞÞ ¼  and V ðN ð;ÞÞ ¼ 2 , and these
two parameters completely define the distribution.
Particularly interesting is the N(0,1), usually called Z, because any other normal distribution N(,) is
transformed into a Z when standardizing it, that is, Z ¼ ðN ð; Þ – Þ=.
The distribution function of a normal random variable does not have an analytical expression; hence it is
necessary to use tables or somewhat complex formulas to calculate the probabilities. As any normal distribution
can be transformed into a N(0,1), it is customary to use only the table of this distribution. Table A1 contains
some of its values that, in any case, cover the cases used in this chapter. For example, if z ¼ 1.83, from the
reading in row 1.8 and column 0.03, p ¼ prfN ð0;1Þ > 1:83g ¼ 0:0336.

Table A3 Selected  percentage points of the w2 distribution with  degrees of freedom

 0.990 0.975 0.950 0.900 0.100 0.050 0.025 0.010

1 0.000 16 0.000 98 0.003 9 0.015 8 2.71 3.84 5.02 6.63


2 0.020 1 0.050 6 0.102 8 0.210 7 4.61 5.99 7.38 9.21
3 0.115 0.216 0.352 0.584 6.25 7.81 9.35 11.34
4 0.297 0.484 0.711 1.064 7.78 9.49 11.14 13.28
5 0.554 0.831 1.15 1.61 9.24 11.07 12.83 15.09
6 0.872 1.24 1.64 2.20 10.64 12.69 14.45 16.81
7 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48
8 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.09
9 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67
10 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21
100 70.06 74.22 77.93 82.36 118.50 124.34 129.56 135.91

Values of x such that  ¼ pr w2 > x .
Quality of Analytical Measurements: Statistical Methods for Internal Validation 69

Pn
The sum of normal and independent random variables, i¼1 N ði ;i Þ, also follows a normal distribution
Pn Pn 2 1=2
N i¼1 i ; i¼1 i .

3. Student’s t Distribution
If X is a random variable N ð; Þ and X1,X2,. . .,Xn arepnffiffiffirandom variables independent and with the same
distribution
Pn as X, then the random variable ðX – Þ=ð= nÞ is a N(0,1), where X denotes the random variable
X i =n.
i¼1 pffiffiffi
Let S2 be the sample variance. The statistics t ¼ ðX – Þ=ðS= nÞ follows a t distribution with  ¼ n  1 d.f.
The mean and variance are respectively E ðt Þ ¼ 0 and V ðt Þ ¼ =ð – 2Þ, for  > 2. The general appearance of its
pdf is similar to that of the standard normal distribution, both are symmetrical around zero, unimodal, and vary in
(–1,1). However, the t distribution has heavier tails than the normal; that is, exhibits greater variability. As the
number of d.f. tends to infinity, the limiting distribution is the standard normal one. The family of t distributions
depend only on one parameter, its degrees of freedom.
Table A2 contains some values of the t distribution. For example, if  ¼ 5, for  ¼ 0.025, the value t ¼ 2.571
in the table is the one for which 0.025 ¼ pr{t5 > 2.571} holds. Compare with the value 1.96 in Table A1 that
would correspond, in the same conditions, to a N(0,1), that is, 0.025 ¼ pr{N(0,1) > 1.96}.
Because of the symmetry, 0.025 ¼ pr{t5 < –2.571} also holds and, consequently, 0.95 ¼ pr{–2.571 <
t5 < 2.571}.

4. The w2 (Chi-Square) Distribution


Under the same conditions as the previous section for the variables X1,X2,. . .,Xn, the random variable
w2 ¼ ½ðn – 1ÞS 2 =2 follows a chi-square distribution with  ¼ n  1 d.f.
The mean and variance of a w2 distribution with  d.f. are E w2 ¼  and V w2 ¼ 2. The chi-square
distribution is nonnegative and the pdf is skewed to the right. However, as  increases, the distribution becomes
more symmetric. Some percentage points of the chi-square distribution are given in Table A3.
For example, if  ¼ 5 and  ¼ 0.025, the value of the chi-square distribution
 with 5 d.f., w20:025;5 , that leaves to

its right the probability 0.025 is 12.83. That is, 0:025
 ¼ pr w5 > 12:83 . Analogously, 0:975 ¼ pr w25 > 0:83 ,
2

and consequently, 0:95 ¼ pr 0:83 < w25 < 12:83 .

Table A4 Selected percentage points of the F 1,2 distribution for  ¼ 0.05

1

2 1 2 3 4 5 6 7 8 9 10

1 161.45 99.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54

Values of x such that 0.05 ¼ pr {F 1, 2>x}.


70 Quality of Analytical Measurements: Statistical Methods for Internal Validation

The chi-square distribution has an important property: Let w21 ; w22 ; :::; w2n be independent chi-square random
., n d.f., respectively. Then the random variable w21 þ w22 þ
variables with  1, 2,. .P þ w2n is a chi-square
n
distribution with  ¼ i¼1 i d.f.
This property becomes apparent if we note Pthat if Z1,Z2,. . .,Zn are normally and independently distributed
random variables, then the random variable ni¼1 Zi2 is a chi-square distribution with n d.f.

5. The F Distribution
Let V and W be independent chi-square random variables with  1 and  2 d.f., respectively. Then, the
ratio F ¼ ðV =1 Þ=ðW =2 Þ is an F distribution with  1 d.f. in the numerator and  2 d.f. in the denomi-
nator. It is usually abbreviated as F1 ;2 . The mean and variance of F1 ;2 are E F1 ;2 ¼ 2 =ð2 – 2Þ for
 2 > 2 and V F1 ;2 ¼ 22 ð1 þ 2 – 2Þ=1 ð2 – 2Þ2 ð2 – 4Þ with  2 > 4.
The F distribution is nonnegative and skewed to the right. Some percentage points of the F distribution are
given in Table A4  for  ¼ 0.05. For, say,  1 ¼ 5 and  2 ¼ 10, F0:05;5;10 ¼ 3:33 is the value such that
0:05 ¼ pr F5;10 > 3:33 .
The lower percentage points can be found taking into account that F1 – ;1 ;2 ¼ 1=F;2 ;1 . For example, to
find F0:95;5;10 is F0:95;5;10 ¼ 1=F0:05;10;5 ¼ 1=4:74 ¼ 0:21. Therefore, we have 0:90 ¼ pr 0:21 < F5;10 < 3:33 .

6. Convergence of Random Variables


Sometimes, it is useful to think how a succession of random variables converges to another random variable.
Let X1,X2,. . . be a sequence of random variables and let F1(x),F2(x),. . . be the corresponding sequence
of distribution functions. If the distribution functions are more and more similar to the distribution function
F of a random variable X when n ! 1, then we say that they converge to X in ‘distribution’. Formally, this
means that
lim Fn ðx Þ ¼ F ðx Þ for each x
n!1

We say that Xn converges to X in ‘probability’ if 8 " > 0

lim prfjXn – X j > "g ¼ 0


n!1

This means that the probability of the set where Xn differs from X is getting smaller and smaller. Furthermore,
we say that Xn converges to X ‘almost surely’ if
n o
pr lim jXn – X j ¼ 0 ¼ 1
n!1

Almost sure convergence implies that the set of ideal measurements such that the outcomes (real values) of Xn
are getting closer and closer to X has probability one. It can be proven that almost sure convergence implies
convergence in probability, which then implies convergence in distribution. The following three fundamental
results for convergence are the most widely used in practice.
The ‘weak law of large numbers’ states that if X1,X2,. . .,Xn,. . . are independent and identically distributed
random variables with finite mean , then

X1 þ þ Xn
!  in probability:
n

If the random variables also have a finite variance (a weaker condition is also possible), then we have the ‘strong
law of large numbers’,
Quality of Analytical Measurements: Statistical Methods for Internal Validation 71

X1 þ þ Xn
!  almost surely:
n

The ‘central limit theorem’ says that for independent (or weakly correlated) random variables X1,X2,. . .,Xn,
with the same distribution,
pffiffiffi
nðX – Þ
! N ð0; 1Þin distribution

where  and 2 are the mean and variance of the random variables Xn. This means that the distributional shape
of X is more and more like that of a normal standard random variable as n increases.

7. Some Computational Aspects


The accessibility of the personal computers allows doing the statistical calculations without tables. It is
advisable to use software specially dedicated to Statistics, but in the initial stages of the learning it is worthwhile
to make the calculations manually, so that the intuition necessary to avoid the errors derived from a nonre-
flexive and automatic use of the software is acquired.
The basic distributions can be programmed with the algorithms in Abramowitz and Stegun,60 normal
distributions are in Chapter 3.12, the Student’s t and F are in Chapter 3.12, and finally, the chi-square
distribution is in Section Chapter 3.12.
Chapter 5 (appendices) of the book by Meier and Zünd61 shows the necessary numerical approximations and
programs in BASIC for these distributions. To compute the noncentral F, the needed numerical approximation
can be consulted in Johnson and Kotz,62 and Evans et al.63
All the calculations in this chapter have been made with the Statistics Toolbox for MATLAB.64 What follows
is a list of basic command instructions used. Note that all the commands referring to cumulative distribution
functions, equations (A8–A11), compute the cumulative probability  until the corresponding value of the
distribution. However, along the text and in tables A1–A4, the calculated probability  is always the upper
percentage point, that is, the cumulative probability above the corresponding value.

Normal distribution

 ¼ prfN ð;Þ < zg ðA8Þ

– z ¼ norminv(, , )
Example A1:  ¼ 0.05,  ¼ 0,  ¼ 1; then norminv(0.05, 0, 1) gives z ¼ –1.645.
–  ¼ normcdf(z, , )
Example A2: z ¼ 1.645,  ¼ 0,  ¼ 1; then normcdf(1.645, 0, 1) gives  ¼ 0.95

Student’s t distribution with  degrees of freedom


 ¼ pr t < t; ðA9Þ

– t ¼ tinv(,v)
Example A3:  ¼ 0.05,  ¼ 5; then tinv(0.05,5) gives t ¼ –2.015.
–  ¼ tcdf(t,v)
Example A4: t ¼ 1.645,  ¼ 5; then tcdf(1.645,5) gives  ¼ 0.9196.
n o
 ¼ pr w2 < w2; ðA10Þ
72 Quality of Analytical Measurements: Statistical Methods for Internal Validation

w2 distribution with  degrees of freedom


– x ¼ chi2inv(,)
Example A5:  ¼ 0.05,  ¼ 5; then chi2inv(0.05,5) gives x ¼ 1.1455.
–  ¼ chi2cdf(x,)
Example A6: x ¼ 9.24,  ¼ 5; then chi2cdf(9.24,5) gives  ¼ 0.9001.

F1 ;2 distribution  1 and  2 degrees of freedom



 ¼ pr F1 ;2 < F;1 ;2 ðA11Þ

– x ¼ finv(,1,2)
Example A7:  ¼ 0.95,  1 ¼ 5,  2 ¼ 15; then finv(0.95,5,15) gives x ¼ 2.9013.
–  ¼ fcdf(x,1,2)
Example A8: x ¼ 2.90,  1 ¼ 5,  2 ¼ 15; then fcdf(2.90,5,15) gives  ¼ 0.9499.

Power for the z-test, Equation (40)


Example A9. With the data of Example 8, jj ¼ 0.40,  ¼ 0.55, n ¼ 10,  ¼ 0.05,
normcdf(norminv(0.95,0,1) – 0.40sqrt(10)/0.55) gives 0.2562.

Power for the t-test, Equation (43)


Example A10. With the same data as for the z-test, the sentence includes the noncentral t-distributionp‘nctcdf’
ffiffiffiffiffi
and the t distribution ‘tinv’, both of them with n – 1 ¼ 9 d.f. and noncentrality parameter ð0:40=0:55Þ 10.
nctcdf(tinv(0.95,9),9, 0.40sqrt(10)/0.55) gives  ¼ 0.3165. pffiffiffiffiffi
Example A11. With data of Example 9,  ¼ 0.05, n ¼ 10, and the noncentrality parameter is 0:57 10.
nctcdf(tinv(0.95,9),9, 0.57sqrt(10)) gives  ¼ 0.5034.

Power for the chi-square test, Equation (46)


Example A12. We have ¼ 2,  ¼ 0.05, and n ¼ 14 or n ¼ 13 to obtain a value of   0.05.
chi2cdf(chi2inv(0.95,13)/(22),13) gives  ¼ 0.0402.
chi2cdf(chi2inv(0.95,12)/(22),12) gives  ¼ 0.0511.
(note that the d.f. equal 14 – 1 ¼ 13 or 13 – 1 ¼ 12).

Power for the F-test, Equation (57)


Example A13. Data:  ¼ 0.05, n1 ¼ n2 ¼ 9, l ¼ 1/2 ¼ 2 are those of question (3) in Example 13.
fcdf(finv(0.975,8,8)/(22),8,8)-fcdf(finv(0.025,8,8)/(22),8,8) gives  ¼ 0.5558.
If we look for the sample size n ¼ n1 ¼ n2 so that  0.10, trying some values, we get
fcdf(finv(0.975,22,22)/(22),22,22)-fcdf(finv(0.025,22,22)/(22),22,22) that gives  ¼ 0.1115 and
fcdf(finv(0.975,23,23)/(22),23,23)-fcdf(finv(0.025,23,23)/(22),23,23) that gives  ¼ 0.0981.
Consequently, n ¼ 24.

Power for the ANOVA with fixed effects, Equation (82) P 2 2


Example A14.  ¼ 0.05,  1 ¼ 4,  2 ¼ 15, noncentrality parameter  ¼ n i = ¼ 4  2 ¼ 8.
ncfcdf(finv((0.95),4,15),4,15,8) gives  ¼ 0.5364
Note that in the ANOVA the levels of factor are k ¼ 5 and n ¼ 4 replicates per level.

Power for the ANOVA with random effects, Equation (98)


Example A15. Suppose that 10 laboratories participate in a proficiency test to evaluate a method. Assumed risks
are  ¼  ¼ 0.05 and it is desirable to detect at least an interlaboratory variance equal to the intralaboratory
variance, that is, 2 ¼ 1 þ n2 =2 ¼ 1 þ n  1. With these data
Quality of Analytical Measurements: Statistical Methods for Internal Validation 73

k ¼ 10;n ¼ 4;fcdf(finv(0.95,k – 1,k(n – 1))/(1 þ 1n),k – 1,n(k – 1)) gives  ¼ 0.0973


k ¼ 10;n ¼ 5;fcdf(finv(0.95,k – 1,k(n – 1))/(1 þ 1n),k – 1,n(k – 1)) gives  ¼ 0.0494
Thus, each laboratory must do five determinations.

References

1. EURACHEM/CITAC. Guide CG4 In Quantifying Uncertainty in Analytical Measurement, 2nd ed.; Ellison, S. L. R., Rosslein, M.,
Williams, A., Ed.; 2000. ISBN 0–948926–15-5. Available from the Europchem Secretariat. (see http://www.eurochem.org).
2. Draft Supplement 1 to the ‘Guide to the Expression of Uncertainty in Measurement’. Evaluation of measurement data.
Propagation of distributions using a Monte Carlo method (2004).
3. Commission Decision 12 August 2002, Brussels. Off. J. Eur. Commun. L 221 (17 August 2002) 8-36. Implementing Council
Directive 96/23/EC concerning the performance of analytical methods and the interpretation of results.
4. Aldama, J. M. Practicum of Master in Advanced Chemistry; University of Burgos: Burgos, Spain, 2007.
5. Analytical Methods Committee. Robust Statistics – How Not to Reject Outliers, Part 1. Basic Concepts. Analyst 1989, 114, 1693–1697.
6. Analytical Methods Committee. Robust Statistics – How Not to Reject Outliers, Part 2. Inter-laboratory Trials. Analyst 1989, 114,
1699–1702.
7. ISO 5725. Accuracy Trueness and Precision of Measurement Methods and Results, Part 1. General Principles and Definitions,
Part 2. Basic Method for the Determination of Repeatability and Reproducibility of a Standard Measurement Method, Part 3.
Intermediate Measures of the Precision of a Standard Measurement Method, Part 4. Basic Methods for the Determination of the
Trueness of a Standard Measurement Method, Part 5. Alternative Methods for the Determination of the Precision of a Standard
Measurement Method, Part 6. Use in Practice of Accuracy Values. Genève, 1994.
8. Analytical Methods Committee. Technical brief No. 4. Ed. M. Thompson. 2006. www.rsc.org/amc/.
9. Silverman, B. W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: London, Great Britain, 1986.
10. Wand, M. P.; Jones, M. C. Kernel Smoothing; Chapman and Hall: London, Great Britain, 1995.
11. Inczédy, J.; Lengyel, T.; Ure, A. M.; Gelencsér, A.; Hulanicki, A. Compendium of Analytical Nomenclature IUPAC, 3rd ed.; Pot City
Press Inc.: Baltimore 2nd printing, 2000; p 50.
12. Lira, I.; Wöger, W. Comparison Between the Conventional and Bayesian Approaches to Evaluate Measurement Data. Metrologia
2006, 43, S249–S259.
13. Zech, G. Frequentist and Bayesian Confidence Intervals. Eur. Phys. J. Direct 2002, C12, 1–81.
14. Sprent, P. Applied Nonparametric Statistical Methods; Chapman and Hall, Ltd: New York, 1989.
15. Patel, J. K. Tolerance Limits. A Review. Commun. Stat. Theory Methods 1986, 15 (9), 2716–2762.
16. Wald, A.; Wolfowitz, J. Tolerance Limits for a Normal Distribution. Ann. Math. Stat. 1946, 17, 208–215.
17. Wilks, S. S. Determination of Sample Sizes for Setting Tolerance Limits. Ann. Math. Stat. 1941, 12, 91–96.
18. Kendall, M.; Stuart, A. The Advanced Theory of Statistics, Inference and Relationship. Charles Griffin & Company Ltd: London,
1979; pp 547–548; Section 32.11; Vol. 2.
19. Willink, R. On Using the Monte Carlo Method to Calculate Uncertainty Intervals. Metrologia 2006, 43, L39–L42.
20. Guttman, I. Statistical Tolerance Regions; Charles Griffin and Company: London, 1970.
21. Huber, Ph.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Chiap, P.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.;
Lallier, M.; Laurentie, M.; Mercier, N.; Muzard, G.; Nivet, C.; Valat, L. Harmonization of Strategies for the Validation of Quantitative
Analytical Procedures. A SFSTP Proposal – Part I. J. Pharm. Biomed. Anal. 2004, 36, 579–586.
22. Huber, Ph.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Chiap, P.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.;
Lallier, M.; Laurentie, M.; Mercier, N.; Muzard, G.; Nivet, C.; Valat, L.; Rozet, E. Harmonization of Strategies for the Validation of
Quantitative Analytical Procedures. A SFSTP Proposal – Part II. J. Pharm. Biomed. Anal. 2007, 45, 70–81.
23. Huber, Ph.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Laurentie,
M.; Mercier, N.; Muzard, G.; Valat, L.; Rozet, E. Harmonization of Strategies for the Validation of Quantitative Analytical
Procedures. A SFSTP Proposal – Part III. J. Pharm. Biomed. Anal. 2007, 45, 82–86.
24. Feinberg, M. Validation of Analytical Methods Based on Accuracy Profiles. J. Chromatogr. A 2007, 1158, 174–183.
25. Rozet, E.; Hubert, C.; Ceccato, A.; Dewé, W.; Ziemons, E.; Moonen, F.; Michail, K.; Wintersteiger, R.; Streel, B.; Boulanger, B.;
Hubert, Ph. Using Tolerance Intervals in Pre-Study Validation of Analytical Methods to Predict In-Study Results. The Fit-for-
Future-Purpose Concept. J. Chromatogr. A 2007, 1158, 126–137.
26. Rozet, E.; Ceccato, A.; Hubert, C.; Ziemons, E.; Oprean, R.; Rudaz, S.; Boulanger, B.; Hubert, Ph. Analysis of Recent
Pharmaceutical Regulatory Documents on Analytical Method Validation. J. Chromatogr. A 2007, 1158, 111–125.
27. Dewé, W.; Govaerts, B.; Boulanger, B.; Rozet, E.; Chiap, P.; Hubert, Ph. Using Total Error as Decision Criterion in Analytical
Method Transfer. Chemom. Intell. Lab. Syst. 2007, 85, 262–268.
28. González, A. G.; Herrador, M. A. Accuracy Profiles from Uncertainty Measurements. Talanta 2006, 70, 896–901.
29. Rebafka, T.; Clémençon, S.; Feinberg, M. Bootstrap-Based Tolerance Intervals for Application to Method Validation. Chemom.
Intell. Lab. Syst. 2007, 89, 69–81.
30. Fernholz, L. T.; Gillespie, J. A. Content-Correct Tolerance Limits Based on the Bootstrap. Technometrics 2001, 43 (2),
147–155.
31. Cowen, S.; Ellison, S. L. R. Reporting Measurement Uncertainty and Coverage Intervals Near Natural Limits. Analyst 2006, 131,
710–717.
32. Schouten, H. J. A. Sample Size Formulae with a Continuous Outcome for Unequal Group Sizes and Unequal Variances. Stat.
Med. 1999, 18, 87–91.
33. Lehmann, E. L. Testing Statistical Hypothesis; Wiley & Sons: New York, 1959.
74 Quality of Analytical Measurements: Statistical Methods for Internal Validation

34. Schuirmann, D. J. A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence
of Average Bioavailability. J. Pharmacokinet. Biopharm. 1987, 15, 657–680.
35. Mehring, G. H. On Optimal Tests for General Interval Hypothesis. Commun. Stat. Theory Methods 1993, 22 (5), 1257–1297.
36. Brown, L. D.; Hwang, J. T. G.; Munk, A. An Unbiased Test the Bioequivalence Problem. Ann. Stat. 1998, 25, 2345–2367.
37. Munk, A.; Hwang, J. T. G.; Brown, L. D. Testing Average Equivalence. Finding a Compromise Between Theory and Practice.
Biom. J. 2000, 42 (5), 531–552.
38. Hartmann, C.; Smeyers-Verbeke, J.; Penninckx, W.; Vander Heyden, Y.; Vankeerberghen, P.; Massart, D. L. Reappraisal of
Hypothesis Testing for Method Validation: Detection of Systematic Error by Comparing the Means of Two Methods or of Two
Laboratories. Anal. Chem. 1995, 67, 4491–4499.
39. Limentani, G. B.; Ringo, M. C.; Ye, F.; Bergquist, M. L.; McSorley, E. O. Beyond the t-Test. Statistical Equivalence Testing. Anal.
Chem. 2005, 77, 221A–226A.
40. Kuttatharmmakull, S.; Massart, D. L.; Smeyers-Verbeke, J. Comparison of Alternative Measurement Methods: Determination of
the Minimal Number of Measurements Required for the Evaluation of the Bias by Means of Interval Hypothesis Testing. Chemom.
Intell. Lab. Syst. 2000, 52, 61–73.
41. Andrés Martin, A.; Luna del Castillo, J. D. Bioestadı́stica para las ciencias de la salud; Norma-Capitel: Madrid, 2004.
42. Wellek, S. Testing Statistical Hypotheses of Equivalence; Chapman & May/CRC Press LLC: Boca Raton, FL, 2003.
43. Ortiz, M. C.; Herrero, A.; Sanllorente, S.; Reguera, C. The Quality of the Information Contained in Chemical Measures (electronic
book); . Servicio de Publicaciones Universidad de Burgos: Burgos, 2005 Available on. http://web.ubu.es/investig/grupos/
cien_biotec/QA4/index.htm.
44. D’Agostino, R. B.; Stephens, M. A.; Eds.; Goodness-of-Fit Techniques. Marcel Dekker Inc.: New York, 1986.
45. Moreno, E.; Girón, F. J. On the Frequentist and Bayesian Approaches to Hypothesis Testing (with discussion). Stat. Oper. Res.
Trans. 2006, 30 (1), 3–28.
46. Scheffé, H. The Analysis of Variance; Wiley & Sons: New York, 1959.
47. Anderson, V. L.; MacLean, R. A. Design of Experiments. A Realistic Approach; Marcel Dekker Inc.: New York, 1974.
48. Milliken, G. A.; Johnson, D. E. Analysis of Messy Data: Designed Experiments; Wadsworth Publishing Co, Belmont, NJ, 1984;
Vol. I.
49. Searle, S. R. Linear Models; Wiley & Sons, Inc: New York, 1971.
50. Youden, W. J. Statistical Techniques for Collaborative Tests; Association of Official Analytical Chemists: Washington, DC,
1972.
51. Kateman, G.; Pijpers, F. W. Quality Control in Analytical Chemistry; Wiley & Sons: New York, 1981.
52. Kuttatharmmakull, S.; Massart, D. L.; Smeyers-Verbeke, J. Comparison of Alternative Measurement Methods. Anal. Chim. Acta
1999, 391, 203–225.
53. Hampel, F. R.; Ronchetti, E. M.; Rousseeuw, P. J.; Stahel, W. A. Robust Statistics. The Approach Based on Influence Functions;
Wiley-Interscience: Zurich, 1985.
54. Huber, P. J. Robust Statistics; Wiley & Sons: New York, 1981.
55. Thompson, M.; Wood, R. J. Assoc. Off. Anal. Chem. Int. 1993, 76, 926–940.
56. Sanz, M. B.; Ortiz, M. C.; Herrero, A.; Sarabia, L. A. Robust and Non Parametric Statistic in the Validation of Chemical Analysis
Methods. Quı́m. Anal. 1999, 18, 91–97.
57. Garcı́a, I.; Sarabia, L.; Ortiz, M. C.; Aldama, J. M. Usefulness of D-optimal Designs and Multicriteria Optimization in Laborious
Analytical Procedures. Application to the Extraction of Quinolones from Eggs. J. Chromatogr. A 2005, 1085, 190–198.
58. Garcı́a, I.; Sarabia, L.; Ortiz, M. C.; Aldama, J. M. Robustness of the Extraction Step When Parallel Factor Analysis (PARAFAC) is
Used to Quantify Sufonamides in Kidney by High Performance Liquid Chromatography-Diode Array Detection (HPLC-DAD).
Analyst 2004, 129 (8), 766–771.
59. Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; de Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of
Chemometrics and Qualimetrics: Part A; Elsevier: Amsterdam, 1997.
60. Abramowitz, M.; Stegun, I. A. Handbook of Mathematical Functions; Government Printing Office, 1964.
61. Meier, P. C.; Zünd, R. E. Statistical Methods in Analytical Chemistry, 2nd ed.; Wiley & Sons: New York, 2000.
62. Johnson, N.; Kotz, S. Distributions in Statistics: Continuous Univariate Distributions – 2; Wiley & Sons: New York, 1970; p 191
(Equation (5)).
63. Evans, M.; Hastings, N.; Peacock, B. Statistical Distributions, 2nd ed.; Wiley & Sons: New York, 1993; pp 73–74.
64. Statistics Toolbox for use with Matlab, version 5.3, The MathWorks, Inc., 2006.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 75

Biographical Sketches

Marı́a Cruz Ortiz received her Ph.D. in Chemistry from the University of Valladolid (Spain)
in 1988. She is a member of the staff of Analytical Chemistry in the University of Burgos
(Faculty of Sciences) since 1989. She does research and teaching on Analytical Chemistry
and Chemometrics. Her research activity has focused on experimental design, optimization,
pattern recognition, quality assurance, and validation of analytical methods according to
official regulations and multivariate and/or multiway regression models. All of these applied
to problems in Food Chemistry, typification, etc. has resulted about 100 papers. She is at
present the head of an active research group, the Chemometrics and Qualimetrics group of
the University of Burgos.

Luis A. Sarabia received his Ph.D. in Statistics from the University of Valladolid (Spain) in
1979. Since 1974, he is teaching Statistics and Mathematics mostly to graduate and post-
graduate students of Chemistry. At present, his research is centred on Chemometrics as a
member of the Chemometrics and Qualimetrics group of the University of Burgos. His
research activities include development of software and implementation of nonparametric
and robust statistical methods, genetic algorithms, neural networks, etc. He is also involved in
multivariate/multiway regression methods, methodology of the experimental design, quality
assurance, and validation. He is the author of about a hundred papers on these matters.
76 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Ma Sagrario Sánchez received her Ph.D. in Mathematics from the University of Valladolid in
1997. She is working at the University of Burgos since 1991 and as a member of the
permanent staff since 2002. The teaching activities are mostly directed to students of the
degree in Chemistry and Science and Food Technology, and postgraduate courses. She is
also a permanent member of the Chemometrics and Qualimetrics Group since its foundation.
Her main research activities are developed into the areas of interest of the group, which
include modeling and analysis of n-way data, modeling of categories, design of experiments,
optimization, etc. by using classical methods as well as computationally intensive methods
(such as neural networks or evolutionary algorithms).

Ana Herrero, after completing her undergraduate studies in Chemistry from the University
of Valladolid (Spain) in 1991, received her Ph.D. at the University of Burgos (Spain) in 1996.
She is working at the University of Burgos since 1992, teaching and researching in Analytical
Chemistry. Her research involves Chemometrics and she is a member of the Chemometrics
and Qualimetrics group of the University of Burgos. Experimental design, optimization,
pattern recognition, multivariate regression analysis, quality assurance, and validation of
analytical methods are among the fields she is working on.

You might also like