Professional Documents
Culture Documents
The choice of statistical method can greatly influence the calculated nor-
mal range apart from any biological or chemical considerations. Nonpa.
rametric normal range estimates are, for practical purposes, as accurate
as estimates that assume the distribution of data to be gaussian or log.
gaussian when the distribution assumed is true. When data are not distrib-
uted as assumed, nonparametric estimates are more accurate.
When the clinical chemist establishes a “normal disagree with the a priori assumption that most
range,” his intention is to make a statement about biological measurements are adequately described
typical values in a population larger than those by gaussian or log-gaussian curves. rfhese authors
involved in his study. His statement is statistical prefer the use of nonparametric2 methods for
in nature. Rather than deriving the normal range, estimating the normal range because they apply
he is deriving an estimate of the normal range, regardless of the underlying form of the statistical
intended to apply to this larger population, and population from which data are obtained. If the
this estimate has a certain amount of uncertainty distribution of the data is gaussian or log-gaussian,
associated with it. normal range estimates obtained by nonparametric
If the normal range calculation is viewed in this methods will require more samples to obtain the
context the next logical step is to compare statis- same precision of estimation as those obtained by
tical methods of normal range estimation on the methods based on the assumption that the dis-
basis of how well they estimate the true normal tribution is gaussian. However, if data are neither
range. We will show that the method used most gaussian nor log-gaussian, and they are treated as if
commonly today is not the most appropriate. they are, results obtained can he severely l)iased.
It has become standard to calculate normal Two nonparametric methods of normal range
range estimates by assuming data to be described estimation are the method of PE’s3 with associated
either by gaussian or log-gaussian curves.’ Usually, nonparametric confidence intervals and the
choice is limited to these two frequency functions. method of nonparametrie TI’S. r PE method is
No other option is considered unless there is strong discussed in (5). Recently Brunden et al. (6)
evidence that the data are described by neither applied nonparametric TI methods to the estima-
curve. tion of normal ranges for various blood constit-
Recently the validity of the gaussian model has
been the subject of considerable discussion (1-5).
Elveback et at. (3), Mainland (4), and others Parametric estimates are estimates derived from data as-
sumed to be described h’ a specific frequency distribution, such
as gaussian. Parameters characterizing the assumed distribution
curve are first estimated and then used for estimating normal
From Bio-Science Laboratories, 7600 Tyrone Ave., Van Nuys, range endpoints. Nonparametric estimates do not involve any
Calif. 91405. a priori assumption regarding frequency distribution.
1 Often called “normal’’ and “log-normal” curves. Log-normal Abbreviations used: ei, percentile estimate; TI, tolerance
and log-gaussian mean that logarithms of the data are distributed interval, an interval that includes a specified proportion, P, of
n a gaussian fashion. the population with a specified probability, y; NED, normal equiv-
Received Sept. 8, 1970; accepted Feb. 1, 1971. alent deviate; K-S, Kolmogorov-Smirnov (test).
assumed to be gaussian distributed. In the case of 97.5 percentiles of the gaussian distribution are
real data, negative estimates can occur with - l.96a and + l.96a. PE’S are obtamed by
estimation methods that assume gaussian distribu- replacing and a by their estimates from the ob-
tion (20). served data, i and s. If data are distributed
according to the log-gaussian distribution, their
Gaussian PE’S and TI’s-The k Factor Method logarithms have a gaussian distribution. ‘l’hese
facts are the basis for the estimates of lines 3 and
For a gaussian distribution the population mean 6, Table 2. They are the familiar normal range
and standard deviation, and a, determine every estimates given as . ± 2s or log transforms of this
point of the curve. In particular the true 2.5 and expression.
:
the wider the computed TI the more likely it is that
it will include at least 95% of the population values.
We will now show that, in general, the width of
the TI is directly related to the choice of -y. Con-
sider a hypothetical population whose test values
70 have a gaussian distribution with true mean
= 100, and true standard deviation a = 10.
a. ‘ - For this gaussian frequency distribution of test
30 60 100 000
SAMPLE SIZE. ‘I
values, the middle 95% ranges from 80.4 to 119.6,
the 2.5 and 97.5 percentile points, respectively.
Fig. 2. Accuracy and precision of estimated normal
These are the true 95% normal limits. To see how
limits by the k factor method with = 0.90. In this ex-
ample, the true distribution is gaussian with = 100 and -y influences normal range estimates, the “ex-
= 10. In a, the “expected” or average value, E,, of the pected” or average values of L and U and their
estimated upper limit, is shown as a solid point for standard deviations were calculated for two
selected values of n. Vertical lines correspond to ±2 different eases in which -y is 0.90 and 0.50, respec-
so intervals about 1?. In b, corresponding data are
shown for the lower limit
tively.
In Figure 2a are graphed two standard deviation
intervals about the expected value of U-i.e. E ±
2a-for -y = 0.90 and for numbers of samples n =
Next consider gaussian TI’S. Thus, for P = 0.95 30, 60, 100, and 1000. Figure 2b shows the corre-
and -y = 0.90, a 95-90 TI is a pair of numbers L sponding graphs for E1. ± 2a. Note that when
and U, such that 95% or more of the population n = 30, the expected value of U is ELr = 123.9, and
values are greater than L and less than U, and this that as n increases this expected value becomes
statement is true with probability 0.90 (i.e., 90%). closer to the true value, 119.6. In addition, as one
A method for computing a TI when the data have would expect, the amount of variation in repeated
a gaussian distribution was first presented by calculations of L and U decreases for larger n.
Wald and Wolfowitz (8). They determined a con- For example, when n = 30, a value of U as high as
stant k, depending on the number of samples n, U = 131 is within the two sigma interval about
such that E.
Figures 3a and 3b are of the same form. The
Pr
(rU ,-
1 exp I
r -
1fx-\21
dx P = -y
only difference is that -y = 0.5 instead of 0.9-i.e.,
JL V2ira L \ a jj
a computed TI will encompass 95% or more of the
population with 50% probability rather than 90%
(1)
as in Figure 2. Note that the centers of the 2-
for L = ks and U =
- + ks. sigma intervals tend to be closer to the correct
To apply the method to normal range estima- values when ‘y = 0.5 than when -y = 0.9.
tion, assume a priori that the data are gaussian or Previously, one of us (H. J. H.) has advocated
log-gaussian distributed. If they are log-gaussian that the Ic factor method with -y = 0.90 be used in
their logs are gaussian distributed and calculations normal range estimation (9). However, the above
are based on these logarithms. In that case the analysis shows that the k factor method with
antilogs of L and U are estimated limits of the 1’ = 0.90 results in normal range estimates that are
normal range. If the data are taken as gaussian generally too wide unless the number of samples
distributed, the estimated normal limits are L and is extremely large (n 1000). If the normal range
U. In the context of normal range estimation, estimate is too wide, this weakens the usefulness of
Equation 1 may be paraphrased as follows: If we the test because its diagnostic power is decreased.
120 I I
I
i /‘“
US 97V5 probability
obtaining
distribution,
a chi-square
then the probability
value as large or larger than
16.7 is less than 0.05 but greater than 0.01. The
of
.90
/1/i/I. #{149}‘
.50
.01
I I I I I I I I I I I
1.1 1.2 1.3 .4 .5 1.6 1.7 1.8 .9 2.0 2.1 2.2 2.3 2.4
LOG NAPTOGLOBIN
Thus, it is clear there is very little to support a of the normal range has rank 100 - a + 1 = 100. Since
conclusion that the data follow either a gaussian or = 225, the upper limit is 225. An estimated normal range
corresponding to the 95-90 TI is therefore 14 to 225.
log-gaussian distribution. In spite of this difficulty,
If we are content to dei’ive a 95-50 TI, in which ‘ = 0.50 is
the estimated normal range depends very much on the probability of coveting 95% of the population, then from
the distribution selected, as previously shown in Somerville’s tables, ni = 5. If r is set to 2 then s = 3 and the
Table 2a. estimated normal range has lower limit X(2) = 21 and upper
limit x1951 = 191.
Note in this case where in is 5, that we have arbitrarily
Nonparametric Normal Range Estimation choseti r = 2 and a = 3, so that r + a = m. If r = 3 and a = 2
is choseti, however, a second normal range estimate is oh-
A nonparametric solution to the ri problem was tamed, namely 21 to 199. A third TI estimate is obtained by
originally given by Wilks (13). Wilks’ solution is averaging the two TI estimates above. The estimated normal
based on the binomial probability distribution. range by this method is 21 to 195 (ef. lines S to 10, Table 2).
For n individuals randomly selected from the Now, consider the method of percentile estima-
population about which inferences are desired, a tion. The PE method is also based on ranking the
95% TI is computed by ordering (ranking) the n test results in order of magnitude. An estimate of
test results X(i),. . ., Xn and then finding the highest the 2.5 percentile of the frequency distribution of
rank r and the smallest rank n s + 1 greater
-
the target population is the 2.5 percentile of the
than r such that observed sample frequency distribution. The
n-.+1/n\ sample 2.5 percentile is the lth ordered sample
j=r
( .1
.) (0.95)’(0.O5)”- -y (2) value where 1 = 0.025(n+ 1). The corresponding
estimate of the 97.5 percentile is the lth largest
where -y is the probability of including 95% of the sample. For most values of n, 1 is not a whole
population within the TI. Test results for the numl)er and it is best to interpolate between the
individuals ranked rth and n s + 1X(r) and
-
two ordered sample values whose ranks are nearest
X(_ .+ l)-are the endpoints of the normal range as and on each side of 1.
estimated by the TI method.
Tables have been published in Somerville (14) Exam pie: For the haptoglobin data, n = 100, and therefore
Example: To derive an interval having a 90% probability The PE method gives single numbers as es-
of covering 95% of the target population-i.e., a 95-90 TI- timates of the population normal limits. It is also
for the 100 haptoglobin values cited earlier, note that from
possible to derive a confidence interval for each of
Table 1 of reference 14, ni 2. Choose r = 1, and obtain
X(1) = 14 as the lower limit of the estimated normal range. the normal limits. A confidence interval is a con-
In order that r + a = m, a = 1 and the estimated upper limit tinuous interval that covers the true value of the
b-i /fl\
120 131 1 7
(.)(o.025)i(o.975)n-1 0.90 (3)
j=a J 132 159 1 8
160 187 1 9
The ath and bth ranked test values (X(a), X(5)) then 188 189 1 10
comprise a 90% confidence interval for the 2.5 190 216 2 10
percentile in the population. Therefore, the prob- 217 246 2 11
ability is less than 0.1 that xb, the upper limit of the 247 251 2 12
confidence interval, is less than the true 2.5 252 276 3 12
percentile, or X(0), the lower limit of the confidence 277 307 3 13
interval, is greater than the true 2.5 percentile- 308 310 3 14
i.e., the converse of Equation 3. Table 3 gives 311 338 4 14
339 366 4 15
values of a and b for various values of n, and also
367 369 5 15
relates the confidence interval for the upper limit
to a and b. ath lowest sample value = lower limit of 90% confidence
#{176}
cision of estimated normal limits. Our recommen- 11. Lilliefors, H. W., On the Kolmogorov-Smirnov test for nor-
mality with mean and variance unknown. J. Amer. Statist. Ass.
dation for the minimum number of samples in 62, 399 (1967).
order to estimate a normal range with accuracy 12. Cochran, W. G., The chi-square test of goodness of fit. Ann.
is n = 120. This is the smallest number of sample Math. St at i.st. 23, 329 (1952).
values that permit 90% confidence intervals for 13. Wilks, S. S., Statistical prediction with special reference to
the endpoints of the normal range. the problem of tolerance limits. Ann. Math. Statist. 13, 400 (1941).
14. Somerville, P. N., Tables for obtaining nonparametric toler-
ance limits. Ann. Math. Statist. 29, 399 (1958).
15. Pearson, E. S. and Hartley, H. 0., Biometrika Tables for
Statisticians, 1, Cambridge University Press, New York, N.Y.,
References 1966, p 150.
16. Anscombe, F. J., Rejection of outliers. Technometrics 2, 123
1. Letters to the editor-Normal ranges and Gaussian distribu- (1960).
tions. Cr.IN. CHEM. 16, 809 (1970).
17. Dixon, W. J., Processing data for outliers. Biometrics 9, 74
2. Letters to the Journal-The “normal” range. J. Amer. Med. (1953).
Ass. 212, 883 (1970). 18. Grubbs, F. E., Procedures for detecting outlying observation
3. Elveback, L. II., Guiller, C. L., and Keating, F. H., Health, in samples. Technometrics 11, 1 (1969).
normality and the ghost of Gauss. J. Amer. Med. Ass. 211, 69 19. Herrera, L., The precision of percentiles in establishing nor-
(1970). mal limits in medicine. J. Lab. Gun. Med. 52, 34 (1958).
4. Mainland, 1)., Normal values in medicine. Ann. N.Y. Aced. 20. Henry, R. J., Improper statistics characterizing the normal
&i. 161, 327 (1969). (See also editorial, this issue, CLIN. cHEM.). range. Amer. J. Gun. Patho!. 34, 326 (1960).