Professional Documents
Culture Documents
Nwip WD 5667-20
Nwip WD 5667-20
Secretariat of ISO/TC147/SC 6
Water quality - Sampling (general methods)
Dear Member,
DRAFT TO ACCOMPANY NEW WORK ITEM PROPOSAL - ISO/WD 5667-20 WATER QUALITY -
SAMPLING – PART 20: GUIDANCE ON THE USE OF SAMPLE DATA FOR DECISION MAKING (See
doc ISO/TC147/SC6 N 332)
The attached draft is to be read in conjunction with the New Work Item Proposal in document
ISO/TC147/SC6 N 332. Please note that this NWIP is being processed as part of a pilot study on the
electronic committee-internal balloting application and respond as detailed in N332.
The attached draft contains an updated version of the text that was developed during the time at which
this project was at Stage 0 (preliminary item) on the TC147/SC6 programme and also the response of
the author of the draft to comments made by the Canadian member body on the Stage 0 draft.
Please note that the reply date on the NWIP is 9 May 2005. The results will be discussed at the
meeting of ISO/TC147/SC6 planned for 3 June 2005 in Japan.
Yours sincerely
David Upstone
Secretary ISO/C147/SC6
Upstoned\tc147\sc6\doc
BSI, 389 Chiswick High Road, London W4 4AL, UK. Tel: + 44 20 8996 7174. Fax: + 44 20 8996 7799.
E-mail: david.upstone@bsi-global.com
NEW WORK ITEM PROPOSAL
Date of presentation Reference number
2005.02.09 (to be given by the Secretariat)
Proposer
ISO/TC147/SC6 ISO/TC 147 / SC 6 N 331
Secretariat
BSI
A proposal for a new work item within the scope of an existing committee shall be submitted to the secretariat of that committee with a copy to
the Central Secretariat and, in the case of a subcommittee, a copy to the secretariat of the parent technical committee. Proposals not within the
scope of an existing committee shall be submitted to the secretariat of the ISO Technical Management Board.
The proposer of a new work item may be a member body of ISO, the secretariat itself, another technical committee or subcommittee, or
organization in liaison, the Technical Management Board or one of the advisory groups, or the Secretary-General.
The proposal will be circulated to the P-members of the technical committee or subcommittee for voting, and to the O-members for information.
See overleaf for guidance on when to use this form.
IMPORTANT NOTE: Proposals without adequate justification risk rejection or referral to originator.
Guidelines for proposing and justifying a new work item are given overleaf.
Preparatory work (at a minimum an outline should be included with the proposal)
A draft is attached An outline is attached. It is possible to supply a draft by
The proposer or the proposer's organization is prepared to undertake the preparatory work required Yes No
Proposed Project Leader (name and address) Name and signature of the Proposer
Mr T Warn (include contact information)
5) Urgency of the activity, considering the needs of other fields or organizations. Indicate target date and, when a series of standards is
proposed, suggest priorities.
6) The benefits to be gained by the implementation of the proposed standard; alternatively, the loss or disadvantage(s) if no standard is
established within a reasonable time. Data such as product volume or value of trade should be included and quantified.
7) If the standardization activity is, or is likely to be, the subject of regulations or to require the harmonization of existing regulations, this should
be indicated.
If a series of new work items is proposed having a common purpose and justification, a common proposal may be drafted including all elements
to be clarified and enumerating the titles and scopes of each individual item.
e) Relevant documents: List any known relevant documents (such as standards and regulations), regardless of their source. When the
proposer considers that an existing well-established document may be acceptable as a standard (with or without amendment), indicate this with
appropriate justification and attach a copy to the proposal.
f) Cooperation and liaison: List relevant organizations or bodies with which cooperation and liaison should exist.
ISO/WD 5667-20
Secretariat: BSI
Warning
This document is not an ISO International Standard. It is distributed for review and comment. It is subject to
change without notice and may not be referred to as an International Standard.
Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of
which they are aware and to provide supporting documentation.
Copyright notice
This ISO document is a working draft or committee draft and is copyright-protected by ISO. While the
reproduction of working drafts or committee drafts in any form for use by participants in the ISO standards
development process is permitted without prior permission from ISO, neither this document nor any extract
from it may be reproduced, stored or transmitted in any form for any other purpose without prior written
permission from ISO.
Requests for permission to reproduce this document for the purpose of selling it should be addressed as
shown below or to ISO's member body in the country of the requester:
[Indicate the full address, telephone number, fax number, telex number, and electronic mail address, as
appropriate, of the Copyright Manger of the ISO member body responsible for the secretariat of the TC or
SC within the framework of which the working document has been prepared.]
Reproduction for sales purposes may be subject to royalty payments or a licensing agreement.
Introduction...................................................................................................................................................1
1 Scope ................................................................................................................................................1
2 References........................................................................................................................................1
3 Terms and definitions ......................................................................................................................1
4 Summary of key points ....................................................................................................................1
5 Types of error and variation.............................................................................................................2
5.1 General..............................................................................................................................................2
5.2 Analytical error .................................................................................................................................3
5.3 Sampling error..................................................................................................................................3
6 Activities ...........................................................................................................................................4
6.1 Estimation of summary statistics ....................................................................................................4
6.2 Limit Values and Compliance ..........................................................................................................4
6.3 Confidence of failure........................................................................................................................5
6.4 Methods for percentile standards....................................................................................................6
6.5 Non-parametric methods .................................................................................................................7
6.6 Look-up tables..................................................................................................................................9
7 Water quality limit values...............................................................................................................10
7.1 General............................................................................................................................................10
7.2 Ideal limit values.............................................................................................................................10
7.3 Absolute limits................................................................................................................................11
7.4 Percentage of failed samples.........................................................................................................14
7.5 Calculating limits in effluent discharges.......................................................................................14
8 Declaring that a substance has been detected .............................................................................16
9 Detecting change............................................................................................................................17
10 Classification..................................................................................................................................18
10.1 General............................................................................................................................................18
10.2 Confidence that class has changed ..............................................................................................20
Annex A (informative) Calculation of confidence limits in clause 6.4 .....................................................24
Annex B (informative) Calculation of confidence limits in clause 6.5 .....................................................25
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 5667-20 was prepared by Technical Committee ISO/TC 147, Water quality, Subcommittee SC 6,
Sampling (general methods).
ISO 5667 consists of the following parts, under the general title, Water quality — Sampling:
Part 5: Guidance on sampling of drinking water and water used for food and beverage processing;
Part 13: Guidance on sampling of sludges from sewage and water-treatment works;
Part 14: Guidance on quality assurance of environmental water sampling and handling;
Part 15: Guidance on preservation and handling of sludge and sediment samples;
Introduction
This guide is about the use of data obtained by taking samples. The purpose is to deal with the use of such
data in taking decisions; in measuring success or failure in the presence of the errors associated with
sampling. The guide aims to help control the risk that such errors lead to wrong decisions.
Poor decisions can also stem from the way in which water quality standards for discharges and environmental
waters are framed or set out in regulations and permits. This guide looks also at the problems that are caused
when compliance with these standards is assessed using data obtained by sampling.
NOTE Decisions might result in the commendation or criticism of people, sites, companies, sectors or nations. They
may lead to legal action or decisions to reduce or increase discharges to the environment.
There are several sampling methods available and their respective merits are under debate. This guide does
not deal with sampling methods. It deals with the additional issue of using the results from sampling to take
decisions, even where the choice of method of sampling is correct and it is used properly.
1 Scope
This guide establishes general principles for dealing with the use of sample data for decision-making. The
scope includes assessment of:
• Change
It is not the purpose of this guide to recommend particular statistical techniques. Nor does it cover a wide
range of techniques and the circumstances in which they should be used. The purpose is to establish the
principle that sampling errors (and errors generally) must be assessed and taken into account as part of the
process of taking decisions.
NOTE A few statistical techniques are used as illustrative examples. These are techniques that have seen routine
use in some regulatory regimes.
2 References
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references the latest edition of the referenced
document (including any amendments) applies.
Sampling error should be quantified and taken into account in all cases where water quality varies and
sampling is used to estimate information that is used to take decisions. This includes assessing compliance
with standards and thresholds (see Clause 7), deciding whether water quality has changed (see Clause 9),
and putting waters into grades in classification systems (see Clause 10). This guide recommends that:
Standards where compliance is assessed by sampling should be defined or used so that sampling error
can be estimated and dealt with appropriately (see Clause 7.2);
Absolute limits should be treated as a percentile when assessing compliance using sampling (see 7.3);
Standards defined as limits to be met by a percentage of samples should be defined or used as the
corresponding percentiles (see Clause 7.4); and,
The degree of confidence should be estimated when aiming to demonstrate failure of water quality
standards (see Clause 6).
The degree of confidence in changes or differences should be estimated when aiming to demonstrate
change or no change (see Clause 10.2).
This guide sets out basic requirements and illustrative methods that will be adequate in many decisions of the
type set out in Clause 1 above and Clauses 5.1 and 5.3 below, or that will serve as a preliminary look at the
sensitivity of a decision to sampling error. This guide does not cover a full range of statistical techniques.
This guide does not deal with the mechanics of taking the samples themselves, or how to ensure the samples
are representative. Neither does it deal with how to perform chemical analysis on the samples. These are big
topics in themselves and they are covered in other guides. If badly done they add to the difficulties from
sampling error and, in some cases, the resulting errors may dominate those from sampling error.
5.1 General
In many procedures by which data are used to take decisions, there will be a set of results taken over a period
of time (e.g. a year). This information might be used to make judgements such as the following:
Water quality in this river failed to meet the required standards during this year;
This company has better effluent discharge compliance than that one; or,
Most of the risk of environmental impact is from this particular type of effluent discharge.
It is unlikely that there are many significant changes in water quality from second to second throughout a year,
but variations from day to day are common. These can be due to diurnal cycles, the play of random errors and
bias from the laboratory, the weather, step changes and day-to-day variations (perhaps in the natural
processes in water or caused by discharges and abstractions and changes in these), seasonal and economic
cycles, and several underlying and overlapping long-term trends.
In addition, the set of samples must representative of the average quality of the masses of water from which
they were taken. For example, a set of samples must be representative of the period of time being reviewed,
i.e. when estimating an annual mean, it would not be acceptable for all the samples to be taken in April.
Analytical errors are those errors that are introduced by the process of chemical analysis and reflect that these
measurements are not error free. It may be that the result for a single sample can be specified to within a
specific range, e.g. ± 15 %.
NOTE This depends on the capabilities of the equipment and the laboratory that has been used to perform the
analysis.
When a mean is calculated from 12 samples, the errors in the chemical analysis tend to average out
according to the square root of the number of samples. For example if the analytical error associated with a
single sample were ± 15%, then the error in the estimate of the mean of a set of chemical analyses would
tend to reduce to something like ± 4% for 12 samples or to ± 2,5 % for 36 samples.
In using samples to take decisions this kind of error from chemical analysis augments but is often smaller than
that from statistical sampling error.
NOTE: The estimate of the mean and its errors can be misleading if the samples are unrepresentative. For example if the
samples for the estimate of the annual mean are all taken in winter, or from a particular patch or depth of water.
Sometimes several or most of the data will be reported as less than a specified limit of detection and, depending on the
1
types of decisions that depend on the data, may require special techniques .
2
When the sample results are used to estimate the value of other summary statistics such as percentiles , the
picture is similar, i.e. the errors are inversely proportional to the square root of the number of samples but are
larger than for the mean.
Sampling error is due to quality variations in the water being sampled, and the ability of the sampling process to
accurately reflect these variations. In a set of samples taken over a period of time, the results are affected by the
operation of the laws of chance in the way the particular samples came to be collected. This produces error even
if analytical errors happened to be close to zero, and if the sampling were truly representative.
In using sampling, the main source of error is usually associated with the number of samples taken. In the
types of decision listed below sampling error is usually a bigger issue than, for example, that associated with
errors of chemical analysis. (Though in practice, errors in chemical analysis, for example, are bound up within
what is observed as statistical sampling error, say, in the list of measurements taken in a year.)
Sampling error should be assessed in cases where water quality varies, such as the following:
when using samples to estimate summary statistics, e.g.. the monthly mean, the annual percentile or the
annual maximum;
when making statements about whether this year’s summary statistics are higher or lower than last
year’s;
when using summary statistics to place water quality in a particular class within a classification system; or
1 Such data are called censored data. Special techniques are available for getting the best out of such data.
6 Activities
An estimate of a summary statistic depends on the values that happen to be captured by sampling. The
estimate, due to sampling error, is almost certain to differ from the true value – that which would be obtained if
it were possible to achieve continuous error-free monitoring.
Sampling error can be managed by calculating confidence limits. Confidence limits define the range within
which the true value of the estimate of the summary statistic is expected to lie. In the example in Table 1, the
estimate of the mean from 8 samples is 101 mg/l and there is a pair of 95 % confidence limits, 47 and 155.
There is 95 % confidence that the true mean exceeds the 95 % (optimistic) confidence limit and 95 %
confidence that the true mean is less than the 95 % (pessimistic) confidence limit of 155. Overall there is 90 %
confidence that the true mean occupies the range between 47 and 155.
Standard deviation 82
Number of samples 8
The gap between the confidence limits widens as the sampling rate is decreased. It is also larger for estimates
of extreme summary statistics such as the 95-percentile and 99-percentile. For a typical water quality pollutant,
the confidence limits for a mean estimated from 12 samples are ±30 % around this estimate. For an estimate
of the 95-percentile, this range is –20 % to +80 % around the estimate of the 95-percentile.
It will be argued in Clause 7, that the impact of sampling error makes it vital that water quality standards and
similar controls are defined or used as means or percentiles and, not, for example, as absolute limits. In other
words the decision to use sampling means that definitions of water quality standards should be restricted to
summary statistics that can be assessed properly by sampling. In this clause the discussion is limited to
summary statistics that are means and percentiles.
Note: The use of a limit that is expressed as an annual mean implies, for example, that the pollutant causes damage
that builds up over time. But it can also apply if the impact of the pollutant is associated with values higher
than the mean, so long as the shape of statistical distribution can be expected to be fairly stable and if action
to reduce the mean will also reduce the number or scale of peak events. The extent to which this is not risky
is often covered by the size of the safety factors built into the standard in the first place. Given all these
conditions, the use of a mean as a standard has the advantage that it is generally efficient in terms of getting
the smallest sampling errors from a set of samples.
If the water quality standard is defined as a mean, then it is a simple matter to estimate the mean from a set of
samples. This estimate can then be compared with the value of the mean that is set down as the water quality
standard. If the value estimated from the samples is worse than this value, the site under test can be said to
have failed. If the estimate is better than this value, the site under test can be said to have passed. This type
of assessment is called a face-value assessment.
A difficulty with the face-value approach is that any estimate of the mean depends on the values captured by
sampling. There is a risk, caused by sampling error, that a compliant site (one which met the mean standard)
might be reported as a failure purely because the set of samples happened by chance to capture a few high
This risk can be controlled by allowing for the doubt that the sampling error puts into the estimate of the mean.
One way of doing this is to calculate confidence limits (see Table 1).
In Table 1, the face-value estimate of the mean is 101. Around this there is the pair of confidence limits, 47
and 155. These define a confidence interval. With a 90 % confidence interval, there is 95 % confidence that
the true mean is less than the upper or pessimistic confidence limit (155 in Table 1). There is a chance of only
5 % that a true mean as high as this could have come about from the action of chance in sampling.
Similarly, there is 95 % confidence that the true mean exceeds the optimistic or lower confidence limit (47 in
Table 1). There is a chance of only 5 % that a value as low as this could have arisen by chance.
To assess compliance, the confidence limits should be compared with the standard.
If the pessimistic confidence limit is less than the mean limit (for a 90% confidence interval), there is at
least 95 % confidence that the standard was met.
If the optimistic confidence limit exceeds the mean limit, there is at least 95 % confidence that the
standard was not met.
Where the mean limit lies between the optimistic and pessimistic confidence limits, it is not possible to
state compliance or failure with at least 95 % confidence.
With the results shown in Table 1, the site would pass a water quality standard set as a mean of 160 and fail a
―
standard of 40. These decisions have at least 95 % confidence. Performance against a standard of 100 is
unresolved at 95 % confidence the value of 100 lies between the confidence limits.
A limit that is expressed as a percentile is perhaps preferred to the mean for example, for damage associated
with, say, high concentrations rather than the average. It can also apply if the impact of the pollutant is
associated with values higher than the 95-percentile limit, so long as the shape of statistical distribution can be
expected to be fairly stable and if action to reduce the 95-percentile will also reduce the number of peak
events. Similarly the use of a standard like the annual 95-percentile implies knowledge or assumption that the
duration of individual events of high concentration is not important so long as the total exceedence is less than
5 %. Though this standard can apply where the distribution of the duration of events is expected to remain
fairly stable.
Where none of this applies other types of standard can be used, though these will imply that compliance may
need to be assessed by monitoring that is nearly continuous, and not by a small set of samples.
In taking decisions, the response could vary from “report the failure” to “take legal action” to “spend a lot of
money” to “rectify the problem”. The consequences of being wrong vary, and, in principle, each type of
decision requires its own degree of confidence, –i.e. its own accepted risk of being wrong. The more important
the decision, the less the decision-taker should allow of the play of errors in sampling and measurement to
lead to a wrong decision.
The confidence of failure is a single statistic that replaces the need to compute different confidence limits for
each type of decision. It varies on a scale from 0 % to 100 % (see Table 2).
Mean 101
Standard deviation 82
Number of samples 8
In Table 2, the 98 % confidence of failure states that there is a risk of only 2 % that the site under test met the
mean standard of 30 but it appeared to fail because the action of chance produced a set of bad samples. In
this case it is appropriate to take any action where it is acceptable to live with a risk of up to 2 % that such
action is unnecessary.
This looks at the risk that failure is wrongly reported3. The parallel exercise, though less common, is to look at
the risk that success is claimed wrongly4. In Table 2, there is 1 % confidence that a standard of 180 is failed.
In other words there is 99 % confidence that the standard was met.
One way of estimating percentiles from the results of sampling is to use an assumption about the statistical
distribution from which the samples were taken, e.g. whether it is log-normal or normal. Such methods are called
parametric methods, as distinct from non-parametric methods (that generally need to make no assumption about
distribution).
Parametric methods depend on the fact, for example, that the 95-percentile for a normal distribution is 1.64
standard deviations above the mean. An example in Table 3 gives an estimate of the 95-percentile of 250 for
a mean and standard deviation of 101 and 82 respectively. In this a log-normal distribution is assumed and
the mean and standard deviation are converted to the log domain using the Method of Moments5.
NOTE It is not the intention in this example to advocate the assumption of a log-normal distribution in circumstances
where it is not appropriate, or the use of the particular methods of calculating percentiles and confidence limits that were
used for this example. But it is useful do these sorts of calculations to indicate the scale of error, if only as a preliminary.
What is being advocated is that it is a folly to act as if the true 95-percentile in the above example were 250 and to act in
ignorance that the range on this was something like 160 to 760, or worse.
It may be important in the context of the decisions made as a result of these data to use different statistical techniques. It
may also be that the data are unrepresentative in time of space, or that there were mistakes in mechanics of taking and
handling the samples. The data may be affected by limitations in analytical technique and expressed as “less than” some
detection limit. There may be underlying trends. Many of these factors will mean that the true error is bigger that
suggested by the range from 160 to 760.
5 This provides equations that convert the mean and standard deviation into estimates of the mean and standard deviation
for the logarithms of the data without the need to take logarithms of the sample results themselves (see Appendix 1).
6 They are calculated in this particular case using the properties of the Shifted T-Distribution (see Appendix 1).
Mean 101
Standard deviation 82
Number of samples 8
In Table 3, the confidence of failure of 4 % states that there is 96 % confidence that the standard was met.
There is a risk of only 4 % that the site truly met the 95-percentile standard of 800 but appeared to fail
because the set of samples contained, through chance, and unexpected number of high values.
The assumption of log-normality is worth making if data can be assumed to be roughly compatible with the
log-normal distribution. The assumption injects extra information into the calculation and so can boost the
precision of estimates of percentiles, when compared with non-parametric methods.
NOTE A parametric method may be risky where there is no evidence that the data follow a parametric distribution.
There are instances where parametric methods cause difficulties. It has been noted in Clause 6.4 that it might
sometimes be wrong to assume, for example, a log-normal distribution.
Non-parametric methods for the estimation of percentiles are based on ranking the sample results from
smallest to largest. An estimate of the 95-percentile is given as the value that is approximately 95 % of the
way along this ranked list, interpolating where this point falls between a pair of samples.
Since assumption (or information) is excluded, the estimates from non-parametric methods tend to be less
precise than those from parametric methods.
NOTE Estimates from non-parametric methods might be less risky than a false assumption of log-normality.
The parametric methods may not be unhelpful in applications that involve legal actions. This may happen if
there is a need to avoid assumptions that might be contested, e.g. whether the sample results follow a log-
normal distribution (or any other parametric distribution).
When assessing compliance with percentile standards, an alternative and simpler way of using a non-
parametric approach is to count the number of failed samples, i.e. the number of sample results whose
concentration exceeds the concentration in the percentile standard. The proportion of failed standards is an
estimate of the time spent in excess of the threshold. If more than 5 % of samples exceed the limit in a 95-
percentile standard, it is tempting to say that the standard was not met. However, this is a face-value
assessment of the percentage of time spent outside the limit, vulnerable to sampling error.
This method of using data, counting failed samples, means that some of the information in the samples is not
used, and this can sometimes be significant. For example, there is no difference between a sample that only
just exceeded a limit, and one that exceeded it grossly. Both are merely failed samples under this method.
For 26 samples with one failed sample, the percentage of failed samples is 1/26 × 100 or 3,8 %. This is an
estimate of the true failure rate, i.e. the true time spent in failure. The value of 3,8 is less than 5 %. This states,
at face value, that a 95-percentile standard has not been failed (whereas a 99-percentile standard would have
been).
Table 4 shows that for 26 samples and one failed sample, the 95 % confidence limits about the value of 3,8 %
are 0,20 and 177. The values of 0,2 and 17 define the pair of 95 % confidence limits on the estimate of the
time spent in failure (the percentage of failed samples). There is 90 % confidence that the true failure rate is in
this range.
Number of samples Number of failed Percentage of failed True failure rate (%)
samples samples
(90 % confidence
(%) interval)
4 1 25.0 1.27–75.1
12 1 8.33 0.43–33.9
26 1 3.85 0.20–17.0
52 1 1.92 0.099–8.8
The third row of Table 4 shows that there is a risk of 5 % that a result as bad or worse than one failed sample
in a set of 26 could have been produced from a site whose true failure rate was as low as 0,20 %. Similarly,
there is a risk of only 5 % that a result as good or better than one failed sample in a set of 26 could have been
produced from a site whose true failure rate was as bad as 17 %.
As before, sampling is used to estimate a summary statistic that in this case refers to the time spent in failure.
Again the use of sampling introduces sampling error.
If there were 12 samples and one of the samples exceeded the limit. This would represent 8,33 % of failed
samples. This is the face-value estimate of the time spent in failure. Table 4 gives the corresponding 95 %
confidence limits as 0,43 % and 33,9 %.
Suppose the standard was a 95-percentile. As before, it is necessary to compare not only the face-value
estimate of 8,33 with the allowance of 5 %, but to do the same with the optimistic confidence limit. If this
exceeds 5 % there is at least 95 % confidence that the site has truly failed the 95-percentile standard. In this
case, the optimistic confidence limit is only 0,43 % and the failure is not significant at 95 % confidence.
The above deals with the assessment of failure. To be sure of a pass it should be a requirement that the
pessimistic confidence limit is less than 5 %.
Any other position, where the figure of 5 % lies between the two confidence limits, is unresolved at 95 %
confidence.
Table 5 parallels the contents of Table 4 but gives the confidence of failure of a 95-percentile limit for
particular outcomes from sampling.
7 In this particular case this is estimated from the properties of the binomial distribution (see Appendix 2).
―
percentile limit. The Table shows that there is 54 % confidence that this outcome, 12 samples and 1 failed
sample, indicates that the 95-percentile limit has been failed that the concentration in the 95-percentile limit
has been exceeded for more than 5 % of the time. In this case it is appropriate to take decisions as a
consequence of this set of samples, so long as it is acceptable that the risk is 46 % that the action is
unnecessary. This might rule out expensive or irreversible decisions.
(%)
4 1 81
12 1 54
26 1 26
52 1 7
150 1 0.05
The preceding clause has discussed the use of the optimistic confidence limit or confidence of failure to
determine which sets of samples show failure that is statistically significant. This process can be simplified by
allowing more failed samples than, for example, the 5 % suggested by a 95-percentile standard.
In this, the permitted number of failed samples is increased to a point where there is at least 95 % confidence
that the site has failed for the percentage of time allowed by the percentile standard, i.e. 5 % for the 95-
percentile. This gives at most a risk of one-in-20 that a site is wrongly declared as a failure. Table 6 contains
figures for the 95-percentile standard and 95 % confidence of failure.
4–7 2
8–16 3
17–28 4
29–40 5
41–53 6
54–67 7
A similar table forms part of the permits sanctioning discharges from sewage treatment plants under the EU
Directive concerning urban wastewater treatment (91/271/EEC). It defines “failure” as a state where there is 95 %
confidence that the 95-percentile standard was failed.
1–10 1
11–71 2
3–7 3
8-14 4
15-23 5
Just as extra failures are allowed in order to give proof of failure, so fewer failures than the 5 % associated
with the 95-percentile standard is a condition of demonstrating proof of success. In the design of rules in
awarding prizes for proven compliance, a different look-up-table would be needed. This table would be
designed to control the risk of stating wrongly that a site that has truly failed, and is reported wrongly as
compliant because of sampling error.
There is a limitation in using this type of look-up table to show high confidence of compliance with standards
such as the 95-percentile. This is because at low rates of sampling, reaching the required level of confidence
can appear to require fewer than zero failed samples. Confirming at least 95 % confidence that a limit is met
for 95 % of the time cannot be done with less than 57 samples.
NOTE The use of fewer than 57 samples may require the use of parametric methods.
7.1 General
Water quality can be assessed by the use of limit values9. The results from samples are compared with limit
values in order to assess compliance. This clause looks at the extra difficulties that can be caused by the way
limit values are defined.
8 The calculations in this Clause are based on the properties of the Binomial Distribution (Appendix 2)
9 Often called water quality standards, or environmental quality standards, or set as conditions in the permits for
discharges.
a summary statistic, e.g. how often the limit may be exceeded, e.g. 1 % of the time10;
the period of time over which this statistic applies, e.g. a calendar year.
These three criteria are key and set the limit value. A fourth is relevant when deciding the action to improve
water quality. When the action is finished, the question arises: what residual risk of failure is acceptable in the
long run, for example, as a consequence of rare patterns in the weather? A fourth point covers this as follows:
the definition of the design risk, i.e. the proportion of time periods for which failure to meet the criterion
(enshrined in the above three bullet points) is accepted, (e.g.. one in 20 calendar years).
In other words, using the numbers introduced so far, it is acceptable that an annual estimate of the 99-
percentile exceeds 10 µg/l in 1 year in 20.
A fifth criterion may be added that deals with the actual assessment of compliance, i.e. from the samples taken in
a particular calendar year:
the statistical confidence with which non-compliance is to be demonstrated before failure is reported.
There is a trade-off between the fourth and fifth points. The fourth point, perhaps not as important as the first
three, is best regarded as the outcome given even continuous error-free monitoring. It relates to the
acceptability of the physical consequences of truly failing the standard. The fifth deals with compliance. Failure
might be defined as the case where the monitoring or sampling shows at least 95 % confidence that the failure
is true and not attributable to the play of chance in sampling.
In other words, again using the numbers introduced so far, it is acceptable that an annual estimate of the
upper 95 % confidence limit exceeds 10 µg/l in 1 year in 20.
In an ideal limit all five items are defined. If any were undefined, its value would take an arbitrary, unknown
value that could vary from decision to decision as the limit was used. It is useless to know that the limit is 10
µg/l, whilst allowing any or several of the other four items to vary in an unknown manner for each decision.
Over a long period of time, e.g. 20 years, the 95-percentile value of the concentration should be less than
200 in 19 summers out of 20 and failure will be declared when monitoring shows non-compliance with
95 % confidence.
Over a long period of time, e.g. 20 years, the mean value of the concentration should be less than 0,6 in
five calendar years out of 10 and failure will be declared when monitoring shows non-compliance with
95 % confidence.
The purpose of a limit is compromised if it is defined in a way that ignores the fact that compliance will be
checked by sampling. One type of limit that runs this risk is the maximum value (or absolute limit). This type of
limit is popular because it is easy to understand and use, especially in legal actions.
These benefits should be set against the errors that arise when maxima are assessed against data collected
by sampling. These errors can lead to faulty assessments of performance and so to wrong decisions, e.g. on
legal action and investment to improve quality that does not, in reality, require improvement, or failure to invest
where improvement is truly necessary.
10 This is the annual 99-percentile. Standards might also be expressed as other percentiles and averages for a particular
period of time, e.g. a month.
a report that the standard has been failed is almost guaranteed under continuous monitoring or very
frequent sampling.
To illustrate, consider a site that exceeds a standard for 1 % of the time. Such a site will always be reported as a
failure if assessed using a continuous error-free monitoring. Table 8 shows that this failure will escape detection
with a probability that depends strongly on the number of samples11.
4 4
12 11
52 39
When we use sampling, the impression of failure depends on the number of samples. The illusion of improved
performance can be manufactured by taking fewer samples. In the meantime, the “true quality” may have
deteriorated.
As discussed below, absolute limits monitored solely by sampling are not true absolute limits at all. This is
because of the mathematical implication that failure is permitted at times when samples are not taken, i.e.
failure is tolerated for a proportion of the time. Such absolute limits are, in truth, percentiles.
As an ideal limit, the absolute limit has the required clarity for the first item, which might be a value, e.g. a
concentration of 10 µg/l. However, for the second item, the summary statistic is ambiguous. The absolute limit
requires compliance by 100 % of samples in a year.
This has two meanings. The first meaning is that the limit is a 100-percentile, a value that should be met for 100 %
of a year. It has been discussed above that this is illogical if sampling alone is used to assess compliance. The
use of only twelve samples leaves a lot of time where failure could have occurred and might not have been
noticed.
11 These calculations are based on the probability of no failures in a set of N samples. This is 1 minus p to the power N,
12
where p is the probability of a compliant sample. Thus for 12 samples in Table 8 it is 1-0.99
This second option, treating the absolute limits as, for example, 99.5-percentiles, is attractive in cases where
limits have been made so strict that occasional failed samples are likely, but where the occasional failure is of
low concern. This option controls the problem, illustrated in Table 9a, that the percentile (and the severity of
the limit or standard) changes with the sampling rate. It also avoids the untidy and uncomfortable alternative of
inventing rules by which operators and regulators discount certain failed samples.
4 75-percentile 50
12 95-percentile 50
52 98-percentile 50
In Table 9a an absolute limit of say 10 µg/l, is equivalent to a 75-percentile is assessed from 4 samples, but a
98-percentile if checked against 52 samples. This change in percentile with sampling rate is equivalent,
typically, to a move from 10 to 30 µg/l in the applied limit. This can be an arbitrary and unfair change in
severity.
Similarly, for a fixed percentile, the severity of the standard is increased in terms of the degree of proof required to
produce a report of failure (see Table 9b).
4 95-percentile 81
12 95-percentile 54
52 95-percentile 7
These problems are controlled if the limit is set, as for an ideal limit, to some particular combination of percentile
and level of proof, i.e. the 99,5-percentile with a level of proof of 95 %.
However, it may be the case that the limit can never be failed because it is known to cause immediate damage. In
this case, it may be best to move it to a lower concentration, e.g. 4 µg/l as a 99-percentile instead of 10µg/l as a
“100-percentile”, perhaps retaining the original 10µg/l as an additional control. In this case a failure of the 99-
percentile of 44 µg/l is taken might be taken as an unacceptable risk of getting an actual value of 10µg/l.
If the absolute limit is to be used on its own there is an implication that continuous accurate monitoring is required,
and that there are real-time controls that can prevent exceedence.
12 These conclusions follow from the properties of the binomial distribution. One failed sample in a set of 12 gives a
probability of greater than 95 % that the standard was failed for at least 0,5 % of the time covered by the samples.
In adopting, e.g. the 99,5-percentile instead of the “100-percentile”, the regulator acknowledges that it is
impossible to demonstrate from sampling alone that a limit was met for 100 % of the time. This method, declaring
the percentile point and level of proof, allows the taking of more or fewer samples, whilst retaining the same
severity of standard and level of proof.
The absolute limit has the advantage that lawyers and the public easily understand it. An absolute limit might be
retained in law, but used with a declared policy that the basis for taking a decision, i.e. to prosecute for non-
compliance, treats the absolute limit as a 99,5-percentile.
If the concentration in the sample exceeds a certain limit, then the sample can be said to fail. Some water quality
standards are expressed as the maximum percentage of failed samples in a set of samples taken over a period of
time.
This betrays a lack of appreciation of the difference between a statistical population and a set of samples. A
statistical population is the distribution of all the values that actually occur over a period of time, e.g. in one year.
For one year it can be thought of as approximating to the set of error-free results of chemical analysis taken from
totally representative samples taken for each of the 31,536,000 seconds in the year. In contrast there might be
only 12 samples. The results of samples are used to make an estimate for the population.
Limit values should be expressed as a function of the population, e.g. that concentration should be below the limit
should for at least 95 % of the time. They should not be expressed as limits to be met by, e.g. 95 % of samples.
The percentage of failed samples is an estimate of the time spent in failure.
If a limit value is defined as a limit to be met by at least 95 % of samples and 8 % of samples fail, this value, 8 %,
will mean that the site is declared to have failed because 8 % is greater than 5 %, the allowed percentage of failed
samples.
Although the 8 % of failed samples exceeds 5 %, the conclusion that the time spent in failure exceeds 5 %, is
certain only if sampling is continuous, representative and accurate. There may be only 12 samples in a year, and
purely by chance, the true failure rate might have been less than 5 %, but a set of samples with high
concentrations might have been collected. In this case the site under test would be wrongly condemned because
of sampling error (or the decision to take only 12 samples).
Limits defined as having to be met by a percentage of samples should be redefined and treated as the
corresponding percentiles. Confidence limits, or the confidence of failure, should then be calculated and action
taken according to the accepted risk of acting unnecessarily. This was discussed for Tables 1, 2 and 3.
Data collected by sampling is also used to set the limits needed in permits in order to meet environmental
standards. Table 10 illustrates the calculation (by Monte-Carlo Simulation) of the mean and 95-percentile of
discharge quality in order to meet a 90-percentile standard of 1,3 in a river.
These calculations ensure that the permit conditions are justified in terms of being necessary to meet the
environmental requirement, and that they go no further than this. The need to do them reinforces the
requirement for the “ideal limits” discussed in Clause 7.2 and the need to define limits as means or percentiles.
Though these calculations are outside the scope of this guidance.
Environmental problems may be due to short term events. For example a river may be fully saturated with
oxygen for more than 95 % of the time but a few minutes of zero oxygen will kill the fish. The use of a 5
percentile standard in this case works only to the extent that the extreme events are correlated with the 5-
percentile (Clause 6.2) in the context of the safety factors built into the 5-percentile standard. As discussed in
Clauses 6.2 and 7.2, there may be cases where this does not apply, though if this risk is to be lived with and
managed by water quality standards, there is an implication that compliance cannot assessed by sampling,
but requires some form of continuous assessment.
Input data
Results
A variation on the absolute limit lies in answering questions such as: “was the substance detected by chemical
analysis?” This should be answered by calculating, for example, whether there is at least 95 % confidence
that the substance is present, above an agreed limit of detection, for at least 10 % of the time. This can be
done by using Table 1113.
2–3 2
4–8 3
9–13 4
14–18 5
50 5
20 14
10 29
5 59
1 298
9 Detecting change
Sometimes sampling has to be used to demonstrate that water quality has improved or not deteriorated. If the
estimate of the mean was 35 in 2003 and 41 in 2004 this looks like an increase of 30 %.
As before, this conclusion is affected by sampling error. There is a need to calculate the statistical confidence
that the recorded difference is real. A test 14 can be used to calculate the significance of an apparent
difference in the mean. Table 13 gives an example.
In Table 13, an apparent difference in the mean of 25 % is confirmed as significant at a level of confidence of
97 %. This indicates there is a chance of only 3 % that a difference as large as this could arise by chance.
2003 20 10 25 97
2004 25 9 33
Similarly, if the estimate of the 95-percentile was 39 in 2003 and 42 in 2004, this looks like an increase of 8 %.
Again this is the face-value conclusion. As in Table 13, it is necessary to calculate the statistical confidence
that the recorded difference is real, as in the following example of the output from a parametric calculation
(see Table 14).
14 There are lots of tests for detecting change. The above example uses a test suitable for small sets of samples (a T-
test). The aim of the text is to illustrate the need to consider the error using a technique that produces the sort of
information in the tables.
2003 39 25 72
2004 42 33
Table 14 states that there is a risk of 28 % that the increase from 39 to 42 is due to chance (and a 72%
probability that the change is a real one).
Non-parametric methods can be used to tackle the same issue. The first step is to look at the uncertainties in
the estimate of a limit, L. L might be an estimate of the 95-percentile for 2001, i.e. the concentration exceeded
for 5 % of the time. This estimate might be based on N, the number of samples used to estimate L, and E, the
number of exceedences of L in 2001.
Suppose in 2002 M samples were taken and F exceedences of L are observed. A test15 can be carried out to
compare the proportion of exceedences in 2001, (E/N), with the proportion in 2002 (F/M). Table 15 illustrates
the outcome.
(%) (%)
2001 40 3 7,5 93
2002 12 3 25,0
In Table 15, the exceedence rate is apparently higher in 2002 (25 %) than in 2001 (7,5 %). At face value this
is a deterioration. It turns out that there is a chance of 7 % that a discrepancy this big could have arisen by
chance.
In summary, to demonstrate change, confidence of change should always be calculated. This distinguishes
changes that can be ascribed to sampling error, from those that really need attention.
10 Classification
10.1 General
Sometimes water quality is described in terms of a classification system. In such a system there may be sets
of water quality standards, for one or more pollutants. Table 16 illustrates a classification system for a single
pollutant. The class limits might be summary statistics of water quality, such as the 95-percentile.
15 This can be done, for example, using “Fisher’s exact test” for 2-by-2 contingency tables.
1 <10
2 10–20
3 20–40
4 40–80
5 >100
To assign the class it might be that an estimate is made of the 95-percentile. If this value were 25 it could be
said that the site was in Class 3 because 25 falls with in the range from 20 to 40 (see Table 16). Again this is
a face-value assessment. The true 95-percentile might have differed from 25, and the true grade might have
been Class 2 or 4. The risk of a mistake is being caused by sampling error.
As in the examples discussed above (for example Tables 2, 3, 4 and 12, 13, 14), the statistical confidence
should be estimated to ensure that water quality has exceeded the water quality standards. In this case this
means the confidence that the 95-percentile exceeded 40, the confidence that it was less than 20, and the
confidence that it lay between 20 and 40. This might give the example in Table 17.
Table 17 — Example classification for a single pollutant using confidence of class (%)
– 5 56 35 4
In Table 17, there is 56 % confidence that Class 3 is the true class, 35 % confidence that the true class is
Class 4, 5 % that it is Class 2 and even 4 % that it is Class 5, i.e. two classes different from the face-value.
The possibility of Class 1 has zero confidence.
In practice the classification may be based on several pollutants and the “worst” result might be used to define
the class. Table 18 illustrates the case of four pollutants. The fourth sets the face-value class, i.e. Class 4.
This has 68 % confidence. There is 4 % confidence of Class 5 as a result of the third pollutant. The residual
confidence is 28 % (100 - 68 - 4). In this case this residual can be assigned to the worst class of better quality
than the face-value class – this means it is assigned to Class 3.
A – 94 6 – –
B – 23 76 1 –
C – 5 56 35 4
D 32 – – 68 –
E – – 28 68 4
The confidence of failure of a target class can be used to rank priorities – to manage the risk of action that
later might turn out to have been unnecessary. For a target of Class 1 and or 2, Table 18 gives 100 %
confidence of failure. For a target of Class 3, the confidence of failure is 72 % [(68 + 4) %].
Suppose that the face-value class changed from Class 4 in 1997 to Class 3 in 2002. At face value this looks
like an improvement16 (see Table 19).
1 – –
2 – 40
3 20 60
4 70 –
5 10 –
The confidence of a change from Class 4 to Class 3 is the product of the values of the confidence of class for
Class 4 in 1997 and for Class 3 in 2002. This is 0,7 ×0,6, or 42 %.
All the possible combinations are given in Table 20. This shows 42 % confidence of the change from Class 4
in 1997 to Class 3 in 2002. It also shows 8 % confidence of a change from Class 3 in 1997 to Class 2 in 2002,
12 % confidence that the site stayed in Class 3, 4 % confidence of a change from Class 5 to Class 2. Finally,
there is 6 % confidence and in a move from Class 5 to Class 3.
1 — — — — — —
Class 2 — — — — — —
In 3 — 8 12 — — 20
1997 4 — 28 42 — — 70
5 — 4 6 — — 10
Confidence — 40 60 — —
in 2002
The sum of the numbers in the diagonal (dark) squares in Table 20 gives the overall confidence of no change
in class. This is 12 %, i.e. the confidence that the site stayed in Class 3. The entries are zero for no change
from Class 1, 2, 4 or 5.
The sum of the adjacent lower (light) squares gives the confidence of an upgrade. There is 50 % confidence
of an improvement by one class. This is made up of a 42 % confidence of a change from Class 4 to Class 3
and an 8 % confidence of a change from Class 3 to Class 2.
The shaded portion of Table 20 shows that 88 % is the confidence of an improvement of one class or more17.
Following this logic, the situation can be summarised as in Table 21. This shows 50 % confidence that quality
improved by one class and 34 % confidence that the improvement was by two classes.
The data can also be presented as an accumulating sum, from the bottom, stopping at the middle, to give the
numbers in Table 22.
17 Similarly the sum of the upper diagonals gives the confidence of an overall drop in class, which is zero.
Tables 17 to 22 are real examples from the management of river water quality.
It may be that over a 5-year period that different methods and instruments were used. Random errors in
these will come through in the analysis. If in Table 18 the results were based on fewer sample, or on less
accurate methods of chemical analysis, this will come though as a wider spread of the probabilities of class
and a reduced ability of picking of changes in class between 1997 and 2002. On the other hand the risk is
controlled that expensive action to improve water quality, or complacency that all is well, is caused by errors in
monitoring.
A more difficult issue occurs if the methods in 1997 were biased or based on unrepresentative samples. This
undermines the assessment of change but a lack of knowledge of the bias is no excuse for failing to assess
the impact of sampling error itself.
[1] ISO 5667-1, Water quality — Sampling — Part 1: Guidance on the design of sampling programmes.
[2] Council directive of 21 May 1991 concerning urban wastewater treatment (91/271/EEC), O.J. L 31 of
5.2.1976.
N L Johnson and B L Welch (1939). Applications of the Non-central t-Distribution. Biometrika, 31, 362-389.
E S Pearson and H O Hartley (1972). Biometrika Tables for Statisticians. Volume II. Cambridge University Press.
[3]
A.1 Clause 6.4 used as an example a standard Parametric Method (Method of Moments, described below)
to estimate confidence limits around percentiles. The values of m and s are converted to the values for the
logarithms of the data using the Method of Moments:
m
M = ln
(1 + s 2
m 2
)
S = ln(1 + s 2 m 2 )
A.2 M and S stand for estimates of the mean and standard deviation of the logarithms of the data. The
characters, ln, denote the natural logarithm. The Face-value estimate of the 95-percentile{ XE "90-percentile"
} is then:
q = exp( M + 1.6449S )
A.3 To calculate confidence limits the factor 1.6449 is replaced with t0 a value which depends on the
sampling rate. The values of t0 are given by:
δ + λ (1+ δ 2 / 2 f − λ2 2 f )
t0 =
n(1− λ2 2 f )
where f is the number of degrees of freedom, in this case n-1, where n is the number of samples. Also,
in this equation:
δ =z n
and z is the Standard Normal Deviate{ XE "Normal Deviate" }: 1.6445 for the 95-percentile{ XE "95-
percentile" }.
A.4 The value of λ approximates to the Standard Normal Deviate{ XE "Normal Deviate" } for the confidence
limit used to define the Optimistic Confidence Limit. For 95% confidence, λ approximates to 1.6445. The true
value of λ is calculated more precisely as a function of f and z.
B.1 If a water achieves the limit for a proportion, p, of the time, the chance that f sample will fail out of a total
of n is R in:
n- f f
n! p (1 - p)
R =
f! (n- f)!
B.2 In the equation the term f! is f factorial. Where f is 4, f! is 4 times 3 times 2 times 1. Similarly n! is n
factorial etc.
B.3 Usually we know that f samples have failed out of a total of n and we want to estimate the proportion of
time, p, that the water failed and compare this with, say, a proportion 0.95 (or 95 per cent) that is the standard.
(A 95-percentile is a value that is exceeded 0.05 of the time)
B.4 A face value estimate of p is given by the proportion of failed samples, or f/n.
B.5 A confidence limit on the estimate of p is obtained by summing the f+1 values of R calculated by the
above equation for numbers of failed samples of 0,1.2,....,f. In this we make an initial guess of the values of p.
This gives:
[Total R] = R0 + R1 + R2 + … + Rf
B.6 By iteration, the value of p is calculated which gives a sum of the values of R, [Total R], equal of 0.95.
This value of p is the 95% confidence limit.
B.7 To derive a look up table we set p to 0.95 (for a 95-percentile standard) and choose a value of n, say 20
samples. We then work out [Total R] several times - for zero failed samples, f=0, one failed sample, f=1, two
failed samples, f=2, and so on. We continue until we first get a value of [Total R] that exceeds the required
degree of confidence that the 95-percentile has been failed. One less sample, f minus 1 gives the maximum
permitted numbers of samples. These outcomes for calculations for 20 samples is shown in table B.1:
The entire process is repeated for other sampling rates to give the full look-up table.
• Please avoid use of loaded terms I find upper and lower get
such as “optimistic” and muddled for water quality
“pessimistic” – use upper and standards that must be
lower. exceeded. Not yet actioned.
6.3 2nd TE Insert discussion of one-tailed versus Not yet done. Not sure what to
Confidence paragraph two-tailed tests do. Suggestions?
of Failure
6.4 2nd para. TE While one can’t disagree with an The text is weak if the
Methods assumption of log-normal distribution discussion of this example gives
for in this particular example, the an impression that this
percentile weakness of having this discussion in technique is best and universal.
standards a standard is that many readers might I’ve tried to change it. But I’m
infer that log-normal is commonplace concerned not to frighten people
and will use the example method in away from using techniques in
inappropriate situations. cases where log-normal is
reasonable in the context of the
decision being considered. In
my experience log-normal
assumption is also useful as a
first step in quantifying
uncertainty.