You are on page 1of 66

Home Search Collections Journals About Contact us My IOPscience

Consensus building for interlaboratory studies, key comparisons, and meta-analysis

This content has been downloaded from IOPscience. Please scroll down to see the full text.

Download details:

IP Address: 132.239.1.231
This content was downloaded on 27/04/2017 at 16:20

Manuscript version: Accepted Manuscript


Koepke et al

To cite this article before publication: Koepke et al, 2017, Metrologia, at press:
https://doi.org/10.1088/1681-7575/aa6c0e

This Accepted Manuscript is: Not subject to copyright in the USA. Contribution of NIST.

During the embargo period (the 12 month period from the publication of the Version of Record of this
article), the Accepted Manuscript is fully protected by copyright and cannot be reused or reposted
elsewhere.

As the Version of Record of this article is going to be / has been published on a subscription basis,
this Accepted Manuscript is available for reuse under a CC BY-NC-ND 3.0 licence after a 12 month embargo
period.

After the embargo period, everyone is permitted to use all or part of the original content in this
article for non-commercial purposes, provided that they adhere to all the terms of the licence
https://creativecommons.org/licences/by-nc-nd/3.0

Although reasonable endeavours have been taken to obtain all necessary permissions from third parties to
include their copyrighted content within this article, their full citation and copyright line may not be
present in this Accepted Manuscript version. Before using any content from this article, please refer to
the Version of Record on IOPscience once published for full citation and copyright details, as
permissions will likely be required. All third party content is fully copyright protected, unless
specifically stated otherwise in the figure caption in the Version of Record.

When available, you can view the Version of Record for this article at:
http://iopscience.iop.org/article/10.1088/1681-7575/aa6c0e
Page 1 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3
4

pt
5
6
7
8
9

cri
10 1 Consensus Building for Interlaboratory Studies, Key
11
12 2 Comparisons, and Meta-Analysis
13
14
15 3 Amanda Koepke
16

us
17 4 National Institute of Standards and Technology, U.S. Department of Commerce, Boulder,
18 5 CO, USA
19
20
21
6 Thomas Lafarge
22
23
24
25
26
27
28
7

9
Gaithersburg, MD, USA

Antonio Possolo
an
National Institute of Standards and Technology, U.S. Department of Commerce,
dM
29
30 10 E-mail: antonio.possolo@nist.gov
31 11 National Institute of Standards and Technology, U.S. Department of Commerce,
32 12 Gaithersburg, MD, USA
33
34
35
36
13 Blaza Toman
37 14 National Institute of Standards and Technology, U.S. Department of Commerce,
38 15 Gaithersburg, MD, USA
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 2 of 65

1
2
3 Consensus Building 2
4

pt
5 16 Abstract. Interlaboratory studies in measurement science, including key comparisons,
6 17 and meta-analyses in several fields, including medicine, serve to intercompare measure-
7 18 ment results obtained independently, and typically produce a consensus value for the com-
8 19 mon measurand that blends the values measured by the participants.
9 20 Since interlaboratory studies and meta-analyses reveal and quantify differences

cri
10
21 between measured values, regardless of the underlying causes for such differences, they
11
22 also provide so-called “top-down” evaluations of measurement uncertainty.
12
13 23 Measured values are often substantially over-dispersed by comparison with their
14 24 individual, stated uncertainties, thus suggesting the existence of yet unrecognized sources
15 25 of uncertainty (dark uncertainty). We contrast two different approaches to take dark
16 26 uncertainty into account both in the computation of consensus values and in the evaluation

us
17 27 of the associated uncertainty, which have traditionally been preferred by different scientific
18 28 communities. One inflates the stated uncertainties by a multiplicative factor. The other
19 29 adds laboratory-specific “effects” to the value of the measurand.
20
30 After distinguishing what we call recipe-based and model-based approaches to data
21
31 reductions in interlaboratory studies, we state six guiding principles that should inform
22
23
24
25
26
27
28
32

33

34

35

36

37
an
such reductions. These principles favor model-based approaches that expose and facilitate
the critical assessment of validating assumptions, and give preeminence to substantive
criteria to determine which measurement results to include, and which to exclude, as
opposed to purely statistical considerations, and also how to weigh them.
Following an overview of maximum likelihood methods, three general purpose
procedures for data reduction are described in detail, including explanations of how the
dM
29 38 consensus value and degrees of equivalence are computed, and the associated uncertainty
30
39 evaluated: the DerSimonian-Laird procedure; a hierarchical Bayesian procedure; and the
31
40 Linear Pool. These three procedures have been implemented and made widely accessible
32
33 41 in a Web-based application (NIST Consensus Builder).
34 42 We illustrate principles, statistical models, and data reduction procedures in four
35 43 examples: (i) the measurement of the Newtonian constant of gravitation; (ii) the
36 44 measurement of the half-lives of radioactive isotopes of caesium and strontium; (iii)
37 45 the comparison of two alternative treatments for carotid artery stenosis; and (iv) a key
38 46 comparison where the measurand was the calibration factor of a radio-frequency power
39 47 sensor.
pte

40
41
42
43
44
45 48 Keywords: consensus, interlaboratory, key comparison, meta-analysis, DerSimonian-Laird,
46 49 hierarchical, Bayesian, linear pool, uncertainty, top-down, bottom-up, gravitation, half-
47 50 life, stenosis, radio-frequency
48
ce

49
50
51 51 1. Introduction
52
53
52 “Consensus building (also known as collaborative problem solving or collaboration) is a
Ac

54
55 53 conflict-resolution process used mainly to settle complex, multiparty disputes” [17]. In
56 54 measurement science, the need for consensus building arises when the same measurand
57
58 55 is measured by multiple methods or laboratories, the measured values differ appreciably
59 56 from one another, and there is a compelling reason for reconciling them.
60
Page 3 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 3
4

pt
5 57 The reason may be the assignment of value to a property of a reference material that
6 58 has been measured by different methods or laboratories, coupled with the desire of
7
8 59 assimilating the information that each measurement result provides about the measurand.
9 60 (However, Marchandise and Colinet [97] voice a dissenting view, expressing skepticism

cri
10 61 about the value of “a statistical consensus of several laboratories” as means to assign values
11
12 62 to reference materials.) Or the reason may be the definition of a reference value that
13 63 pools independent measurement results and is intended to serve as baseline against which
14
64 to gauge deviations of individual measured values, to establish their exchangeability or
15
16 65 equivalence.

us
17
18
66 For example, the certified value of the mass fraction of sulfur in the bituminous coal of
19 67 NIST Standard Reference Material (SRM) 2685c [60] is a consensus value derived from
20 68 two measurement results by application of the linear pooling method described in §5.4:
21
22 69 one obtained using inductively coupled plasma mass spectrometry, the other a consensus
23
24
25
26
27
28
70

71

72

73
an
value with associated uncertainty resulting from a prior interlaboratory study.
Similarly, the certified value of the mass fraction of zinc in NIST SRM 3168a [59] is a
weighted average, computed using the DerSimonian-Laird procedure described in §5.2,
of two measurement results: one from a gravimetric preparation using high-purity zinc
metal assayed by NIST, and the other from analysis by inductively coupled plasma optical
dM
29 74

30 75 emission spectrometry.
31
32 76 In key comparison CCQM-K5 [124], multiple laboratories measured the mass fractions
33 77 of several polychlorinated biphenyl (PCB) congeners in aliquots of a blend of marine
34
35 78 sediments. Consensus values were computed to gauge degrees of equivalence (§6) for the
36 79 participating laboratories.
37
38
39 80 1.1. Interlaboratory Studies, Key Comparisons, Proficiency Tests, and Meta-Analysis
pte

40
41
42 81 An interlaboratory study involves an intercomparison of measurement results for the same
43 82 measurand, obtained by different laboratories working independently, and serves one or
44
83 more of the following purposes:
45
46
47 84 (i) To assess, compare, and demonstrate the measurement performance of the
48 participating laboratories;
ce

85
49
50 86 (ii) To produce a consensus estimate of the true value of the measurand, and to qualify
51 87 it with an evaluation of measurement uncertainty, for example to characterize
52 88 a reference material (say, NIST SRM 2685c mentioned above), or to compare
53
different realizations of the same SI unit (for example, of the kilogram [116]), or
Ac

54 89

55 90 measurements of a fundamental constant (for example, of the Newtonian constant


56
57
91 of gravitation, §7.1);
58 92 (iii) To evaluate contributions from sources of uncertainty attributable to differences
59
93 between laboratories, separately from contributions that are laboratory-specific,
60
94 thus including the so-called Gauge Repeatability and Reproducibility (Gauge R&R)
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 4 of 65

1
2
3 Consensus Building 4
4

pt
5 95 studies that are commonly carried out in manufacturing [37];
6
7
96 (iv) To determine whether the measurement results are homogeneous, or mutually
8 97 consistent, in the sense that the estimate of the between-laboratory variance
9 98 component (which ISO 5725-1 [70] denotes σL2 , and that below we denote τ2 ) does

cri
10
11 99 not differ significantly from zero;
12 100 (v) To identify individual measurement results that differ significantly from the
13
14 101 consensus value;
15 102 (vi) To determine the differences between pairs of values measured by the participants,
16
and to evaluate the associated uncertainties, to identify the differences that differ

us
103
17
18 104 significantly from zero.
19
20 105 Assessing, comparing, and demonstrating the measurement performance of the
21
22 106 participating laboratories is often achieved by performing proficiency tests, which are a key
23
24
25
26
27
28
107

108

109

110

111
an
tool for maintaining measurement quality [137]. However, many proficiency tests do not
involve consensus building. In fact, in a typical proficiency test the measurement results of
the participating laboratories are compared against a reference value that will have been
determined in advance by an expert laboratory (in particular when the test material is a
pre-existing certified reference material). Only some proficiency tests involve a consensus
dM
29
30 112 building exercise to assign a value to the measurand in the test material: those where
31 113 the reference value is derived by blending the measurement results produced by a select
32
33
114 subset of the participants.
34 115 Comparisons between different measurement methods for the same measurand also may
35
36 116 be arranged similarly to interlaboratory studies, even when they are conducted within
37 117 the same laboratory. The same models and procedures that are used to reduce data from
38 118 interlaboratory studies may be used to reduce the data from intercomparisons between
39
methods. For example, NIST SRM 3168a, mentioned above, is one of a large suite of single
pte

119
40
41 120 element standard solutions intended to support calibrations in analytical chemistry. Values
42
121 are routinely assigned to these materials by combining the results of two measurement
43
44 122 methods using a statistical procedure (DerSimonian-Laird’s, §5.2) that is widely used in
45 123 meta-analysis. However, there are statistical methods developed specifically to compare
46
47
124 two measurement methods that are rather different from the methods used to analyze
48 results of interlaboratory studies [11].
ce

125

49
50 126 Key comparisons (KCs), which are defined in the Mutual Recognition Arrangement (MRA)
51 127 [26], are interlaboratory studies involving national metrology institutes (NMIs), whose
52 128 principal goal is to establish the degree of equivalence of measurement services provided
53
129 by the NMIs. This involves detecting and identifying individual measurement results that
Ac

54
55 130 are significantly inconsistent with a reference or consensus value, and pairs of measured
56
131 values that differ significantly from one another.
57
58 132 The bulk of the studies involving measurements of the same attribute made by multiple
59
60
133 laboratories that are published in any year originate in the medical field, and are done
134 retrospectively, based on previously published results. Because they re-analyze and
Page 5 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 5
4

pt
5 135 combine the results of prior data analyses, such studies are usually referred to as meta-
6 136 analyses. The goal often is to boost confidence in conclusions that each individual study
7
8 137 and corresponding analysis, if taken alone, might fall short of supporting decisively. The
9 138 models and methods that are used in medical meta-analysis, to pool or synthesize evidence

cri
10 139 through consensus building [35; 36], are very much the same that are used to reduce data
11
12 140 from interlaboratory studies in measurement science.
13
141 Some clinical trials (of medications, or of medical devices or procedures) also include the
14
15 142 need to combine measurement results using very much the same methods that are used
16 143 in conventional meta-analysis [53]. Similarly to interlaboratory studies in measurement

us
17
18
144 science, and differently from many meta-analyses in medicine, which are post hoc, clinical
19 145 trials are carefully planned in advance.
20
21 146 The term “meta-analysis” was introduced by Glass [58] referring “to the statistical
22 147 analysis of a large collection of analysis results from individual studies for the purpose
23
24
25
26
27
28
148

149

150

151
an
of integrating the findings.” Meta-analysis is practiced in medical research to compare
different therapies for the same condition or the performance of medical centers, to boost
the strength of the evidence in favor of a particular therapy, or to substantiate a health
risk factor, by pooling information gathered in independent studies.
dM
29 152 Typically, meta-analysis involves: (i) defining the property intended to be measured; (ii)
30
31 153 identifying relevant published studies; and (iii) gathering and combining the underlying
32 154 data to produce a consensus estimate of an effect, or to evaluate differences between
33 155 estimates of the value of the property and a consensus value, or to rank the participants
34
35 156 according to some performance metric [62].
36 157 The individual studies that are combined in meta-analysis generally are not interlaboratory
37
38 158 studies themselves. In the example of medical meta-analysis that we review in §7.3,
39 159 each study provided a single measurement of the relative performance of two alternative
pte

40
160 procedures for the treatment of carotid stenosis [99]. For example, the study listed first
41
42 161 in Table 5 was conducted in a university teaching hospital and involved 23 patients that
43 162 were randomly allocated to undergo one or the other of the two treatment procedures,
44
45 163 and produced an estimate of the log-odds ratio of stroke or death and the associated
46 164 uncertainty [103]. This example also illustrates a situation, more common in medical
47 165 meta-analysis than in metrological interlaboratory studies, where it is necessary to do
48
ce

49 166 considerable pre-processing of the measurement results before combining them.


50 167 Some of the interlaboratory studies undertaken in measurement science also are
51
52 168 retrospective and meta-analytic. For example, the 2014 value of the Newtonian constant
53 169 of gravitation, G, recommended by the Committee on Data for Science and Technology
Ac

54
55
170 (CODATA) [102] is the result of a meta-analysis that combines 14 measurement results,
56 171 obtained in experiments carried out independently and published between 1982 and 2014.
57 172 In §7.1 we provide an alternative reduction of the same data.
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 6 of 65

1
2
3 Consensus Building 6
4

pt
5 173 1.2. Challenges
6
7 174 The meaningful reduction of data from interlaboratory studies and meta-analyses faces
8
9 175 several challenges that depend on the type and purpose of the study. For example: in a

cri
10 176 proficiency test there is no need to decide which results to use because the purpose is to
11
177 score all of them; and the issue of material homogeneity is relevant only in those studies
12
13 178 where different laboratories measure aliquots of the same material with supposedly
14 179 identical values of the measurand. One or several of the following challenges arise
15
16
180 commonly:

us
17
18 181 (i) Determining which measurement results should be combined into a consensus value,
19 182 and which should be set aside;
20
21 183 (ii) Selecting a procedure for data reduction that produces results that are fit for
22 184 purpose and that is consistent with a statistical model capturing all the recognized
23
24
25
26
27
28
185

186

187

188
an
contributions from sources of uncertainty;
(iii) Assessing the homogeneity of the material whose aliquots are sent to the
participating laboratories for measurement, or the (temporal) stability of the
material or artifact whose properties are the subject of measurement, and correcting
dM
29 189 for differences or instability when practicable;
30
31 190 (iv) Evaluating the extent of the heterogeneity of the measurement results, which “partly
32 191 determines the difficulty in drawing overall conclusions” [65].
33
34
192 Challenges (i) and (ii) are discussed in §3 (under (P1)), and §5, respectively. Material
35
36 193 homogeneity (iii) is a common requirement in interlaboratory studies that are designed so
37 194 that each participating laboratory — for practical reasons — measures a different aliquot
38
39
195 of the same material, but where the aliquots can be considered equivalent. A material
is homogeneous (for a particular measurand, and for aliquots of a specified size), when
pte

40 196

41 197 the value of the measurand does not vary significantly between different aliquots (of that
42
43 198 same, specified size) of the material.
44 199 Material homogeneity is typically ascertained by performing an analysis of variance of
45
46 200 replicated determinations of the measurand in each of several aliquots of the material,
47 201 made by the laboratory responsible for the preparation of the material for measurement
48
ce

202 [144]. ISO 5725-1 provides guidance on the selection of materials to be used for an
49
50 203 accuracy experiment [70, §6.4]. Sharpless et al. [128] provide examples of assessments
51 204 of homogeneity (both of value and of variability) in the context of the use of NIST
52
53
205 Standard Reference Materials. Ulrich et al. [142] illustrate evaluations of homogeneity,
and in particular of sufficient homogeneity as proposed by Fearn and Thompson [50], for
Ac

54 206

55 207 a candidate reference material.


56
57 208 In some cases, the preparation procedure used to form the aliquots itself induces some
58 209 heterogeneity, as it did, for example, in key comparison CCQM-K23b [143], which focused
59
60 210 on the measurement of the composition of a gas mixture emulating natural gas. Because
211 the gas mixtures sent to the participants were prepared gravimetrically by the pilot
Page 7 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 7
4

pt
5 212 laboratory, one separately from the other, rather than having been drawn from a single,
6 213 pre-mixed and homogenized batch, their compositions were not identical. In this case,
7
8 214 the differences in composition of the mixtures measured by the participants could be
9 215 estimated, and the measurement results were adjusted accordingly. And, of course, the

cri
10 216 associated uncertainties were increased owing to the imperfect knowledge of the true
11
12 217 values of the necessary adjustments.
13
218 Challenge (iii) also includes the effects of temporal drift in the value of the measurand
14
15 219 when this is a property of an artifact that circulates among the participants, a so-
16 220 called traveling standard [158]. Key comparison CCEM-K2 affords an example of such

us
17
18
221 eventuality, and statistical methods have been developed to reduce the measurement
19 222 results taking a linear trend into account [157].
20
21 223 The Cochrane Handbook for Systematic Reviews of Interventions [64] has the following to
22 224 say about challenge (iv): “Any kind of variability among studies in a systematic review
23
24
25
26
27
28
225

226

227

228
an
may be termed heterogeneity. It can be helpful to distinguish between different types
of heterogeneity. Variability in the participants, interventions and outcomes studied
may be described as clinical diversity (sometimes called clinical heterogeneity), and
variability in study design and risk of bias may be described as methodological diversity
(sometimes called methodological heterogeneity). Variability in the intervention effects
dM
29 229

30 230 being evaluated in the different studies is known as statistical heterogeneity, and is a
31
32
231 consequence of clinical or methodological diversity, or both, among the studies. Statistical
33 232 heterogeneity manifests itself in the observed intervention effects being more different
34 233 from each other than one would expect due to random error (chance) alone.”
35
36 234 Heterogeneity, or mutual inconsistency of the measurement results, means that the
37 235 dispersion, or scatter, of the measured values is greater than what the uncertainties
38
39 236 associated with the measured values suggest it should be. Throughout the remainder of
pte

40 237 this contribution, when we refer to homogeneity, we mean the opposite of heterogeneity
41
238 as just defined. In other words, when homogeneity prevails, no uncertainty component
42
43 239 needs to be introduced, above and beyond those that are expressed in the laboratory-
44 240 specific uncertainties, to “explain” the variability of the measured values.
45
46 241 In the terminology of ISO 5725-1, heterogeneity means that there is a significant
47 242 variance component attributable to lack of reproducibility. The laboratory-specific
48
ce

49 243 uncertainties, resulting from the uncertainty evaluation performed by each participating
50 244 laboratory individually, generally include not only the variance component attributable
51 245 to lack of repeatability, but also uncertainty contributions from other sources that
52
53 246 laboratories already have visibility of and can evaluate even without intercomparing their
Ac

54 247 measurement results with the results obtained by other laboratories.


55
56 248 Since the laboratory-specific uncertainty evaluations supposedly reflect contributions from
57 249 all identified sources of uncertainty, such “excess” dispersion of the measured values
58
59
250 perforce is attributable to sources that are not visible, hence are not listed in the
60 251 corresponding uncertainty budgets. Taken together, these “invisible” sources contribute
252 what Thompson and Ellison [136] felicitously call dark uncertainty.
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 8 of 65

1
2
3 Consensus Building 8
4

pt
5 253 We shall use this term to designate both the joint contribution of these yet unrecognized
6 254 sources of uncertainty, and its quantification, either in the form of a standard deviation,
7
8 255 for which we reserve the letter τ, or the corresponding variance, τ2 , which we also call
9 256 “excess variance”, following customary usage in the statistical arts.

cri
10
11 257 It is widely recognized that it is very important to detect and then to evaluate dark
12 258 uncertainty, and also that the manner of evaluating it should depend on the purpose of the
13
259 evaluation [120]. Cochran’s Q test is often used to detect heterogeneity [25]. However,
14
15 260 it has small probability of detecting real heterogeneity when the number of participating
16 261 laboratories is small. Higgins and Thompson [65] point out that when many laboratories

us
17
18
262 are involved, the test may detect statistically significant heterogeneity that is substantively
19 263 irrelevant. Hoaglin [67] reviews these and other shortcomings of the test. A coverage
20 264 interval for τ is generally more useful and informative than a statistical test.
21
22 265 The recognition of the aforementioned “extra” uncertainty component predates the
23
24
25
26
27
28
266

267

268

269
an
introduction of the colorful term “dark uncertainty”, and is reflected both in ISO 5725-
1 and in guidance issued by the European Federation of National Associations of
Measurement, Testing and Analytical Laboratories (EUROLAB). For example, when
persistent differences are detected between measured values and the value of the
reference material measured by the participants, but these differences are deemed to
dM
29 270

30 271 be insignificant, then EUROLAB [46, Table 3.1] suggest that an additional uncertainty
31
32
272 contribution should be introduced “to account for uncorrected bias.”
33 273 Suppose that x 1 , . . . , x n denote estimates of the same measurand produced by n
34
35 274 b, then
laboratories, with associated uncertainties u1 , . . . , un , and the consensus value is µ
Pn 2 2
36 275 b) /u j . Suppose also that τ is estimated by τDL as in the DerSimonian-Laird
Q = j=1 (x j − µ
37 276 procedure (§5.2). In these circumstances, the heterogeneity index suggested by Higgins
38
39 277 and Thompson [65], I 2 = (Q − n + 1)/Q = τ2DL /(τ2DL + u e), where u
e denotes the “typical”
pte

40 278 value of the {u j } (for example their median or their geometric average), provides a more
41
279 accessible metric of heterogeneity than Q. In the examples that we describe in §7, we
42
43 280 always provide values of I 2 , which represents the proportion of the total variability of the
44 281 measured values that is attributable to heterogeneity.
45
46 282 There have been interlaboratory studies where only the measured values were reported,
47 283 unqualified by any evaluations of laboratory-specific measurement uncertainty, but where
48
ce

49 284 the heterogeneity of the results was such that alone it demanded improvements in
50 285 measurement quality. This happened in a cooperative investigation of precision and
51 286 trueness in chemical, spectrochemical, and modal analysis of silicate rocks [43; 125]. The
52
53 287 intercomparison turned out to be particularly consequential because it revealed surprising
Ac

54 288 “enormous discrepancies in some of the chemical results” and indicated “an urgent need
55
289 for thorough inter-laboratory standardization” [48].
56
57 290 At the time of the cooperative investigation just mentioned, the term “accuracy” used
58
59
291 to mean what currently is called “trueness.” ISO 5725-1 [70, §3.7] defines “trueness”
60 292 as “the closeness of agreement between the average value obtained from a large series
293 of test results and an accepted reference value.” ISO 5725-1 (§3.6) proposes that the
Page 9 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 9
4

pt
5 294 concept of accuracy should comprise both trueness and precision, and defines accuracy
6 295 as the closeness of agreement between a test result and the accepted reference value.
7
8 296 This definition assumes that the test involves measuring a standard that instantiates this
9 297 reference value.

cri
10
11 298 ISO 5725-1 defines precision as “the closeness of agreement between independent test
12 299 results obtained under stipulated conditions” (3.12). The corresponding definition offered
13
300 by the International Vocabulary of Metrology [77] (VIM 2.15) is similar, and includes the
14
15 301 explanation that “the ‘specified conditions’ can be, for example, repeatability conditions
16 302 of measurement, intermediate precision conditions of measurement, or reproducibility

us
17
18
303 conditions of measurement”.
19 304 “Repeatability” and “reproducibility” are two terms often mentioned in relation with
20
21 305 interlaboratory studies. The VIM (2.24, 2.25) defines the former as closeness of agreement
22 306 between indications or measured quantity values obtained by replicate measurements on
23
24
25
26
27
28
307

308

309

310
an
the same or similar objects made under repeatability conditions (VIM 2.20), the latter
under reproducibility conditions (VIM 2.24). Repeatability conditions require the same
measurement procedure, measuring system, operating conditions, location, and operator
for replicate measurements of the same or similar objects over a short period of time,
while reproducibility conditions encompass different locations, operators, or measuring
dM
29 311

30 312 systems. Mandel [90] argues that “the concepts of repeatability and reproducibility are,
31
32 313 by themselves, not sufficient to completely characterize a test method, although in most
33 314 cases these measures are the only ones derived from the data.”
34
35 315 However these various aspects of metrological performance may be labeled and defined,
36 316 an important point that ISO 5725-1 conveys clearly and forcefully is that interlaboratory
37 317 comparisons, and collaborative assessment experiments in particular (ISO 5725-1, §3.22),
38
39 318 are valuable devices to assess the quality of measurements. The same view is echoed by
pte

40 319 EUROLAB: “interlaboratory comparison is a tool for progress” [45].


41
42
43 320 1.3. Historical Perspective
44
45
46 321 Very much the same statistical methods are used to reduce measurement results obtained
47 322 in studies involving multiple laboratories, clinical centers, medical procedures, or
48
ce

49
323 measurement methods, in any of the fields of application where such comparisons are
50 324 made. In a pioneering contribution, Cochran [24] discusses statistical methods to reduce
51 325 data obtained in “an efficient type of modern field experiment” where “a replicated
52
53 326 trial is laid down in the same year at a number of centres, or carried out at the same
centre independently throughout a number of years.” The motivating experiments were
Ac

54 327
55 328 agricultural field experiments, but he recognized that “the statistical problems which arise
56
57 329 in the interpretation of the results of such a set of data are of wide generality.”
58
330 John Mandel, motivated by problems in measurement science at the National Bureau of
59
60 331 Standards, now the National Institute of Standards and Technology (NIST), emphasized
332 a model-based approach to the problem of estimating a “best value” based on
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 10 of 65

1
2
3 Consensus Building 10
4

pt
5 333 “measurements from different sources, such as different laboratories” [92; 93; 94]. In
6 334 the process, he also explained the crucial role that interlaboratory studies have in the
7
8 335 validation of measurements [91].
9 336 The research avenues that John Mandel inaugurated have since been pursued energetically

cri
10
11 337 and expanded by NIST statisticians, including Iyer et al. [73], Rukhin and Vangel [123],
12 338 Rukhin [119], Rukhin and Possolo [122], Toman [138], Vangel and Rukhin [145], and
13
339 Wang and Iyer [149], among others. Youden [155] and Duewer et al. [39], also from NIST,
14
15 340 emphasized graphical methods for the analysis of results from interlaboratory studies, and
16 341 Zhang [156] addresses cases where the value of the measurand drifts in the course of an

us
17
18
342 interlaboratory study.
19 343 The European Community (in a press release dated September 19, 1986, entitled The
20
21 344 Community Bureau of Reference: an instrument for the attainment of the “big market”)
22 345 recognized that “divergences (sometimes of major proportions) between the results of
23
24
25
26
27
28
346

347

348

349
an
physical measurements or chemical analyses carried out by different laboratories in the
Community often lie at the root of disputes between companies or even between Member
States as well as of barriers to trade, financial losses, etc.” To address this challenge, the
European Community established a Community Bureau of Reference (BCR) to enable “the
causes of these discrepancies to be identified and eliminated through the harmonization
dM
29 350

30 351 not of the measuring methods — which may continue to differ — but of the results.”
31
32 352 The BCR was a network of proficient European laboratories developing reference materials
33 353 and methods, together with an Expert Advisory group on Reference Materials from many
34
35 354 different countries and institutes. Currently, the Institute for Reference Materials and
36 355 Measurements (IRMM, Geel, Belgium), which is part of the Joint Research Center of
37 356 the European Commission, offers a variety of measurement services, including reference
38
39 357 materials and interlaboratory comparisons, supporting measurement best practices
pte

40 358 throughout Europe.


41
42 359 The field is much too wide to attempt a review of it here, or even merely to reference all
43 360 of the most important contributions. Notable book-length overviews by Hedges and Olkin
44
45
361 [63], Whitehead [153], Hartung et al. [62], and Cooper et al. [28] focus only on medical
46 362 meta-analysis. Gaver et al. [54] provide an overview of statistical methods for combining
47 363 information in medicine, the physical and natural sciences, social sciences, public policy,
48
ce

49 364 and national defense. Hund et al. [69] review several different types of interlaboratory
50 365 studies commonly carried out in analytical chemistry, and Thompson et al. [137] provide
51 366 detailed guidance for proficiency tests.
52
53 367 Several standards and guides have also been published aiming to foster and facilitate
Ac

54
368 the application of best practices in interlaboratory studies. The following are just a
55
56 369 few, particularly noteworthy references in this general category: (i) the ISO 5725 suite
57 370 of international standards, comprising six parts and published between 1994 and 2005
58
59
371 (including correction supplements), which discuss interlaboratory studies at length, as
60 372 means to characterize the accuracy of measurement methods and results [70]; (ii) the
373 international standard ISO 13528 on statistical methods for use in proficiency testing
Page 11 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 11
4

pt
5 374 by interlaboratory comparison [71] ; (iii) the EURACHEM guide on the selection, use,
6 375 and interpretation of proficiency testing schemes [95]; (iv) the international harmonized
7
8 376 protocol for the proficiency testing of analytical chemistry laboratories [137], which is
9 377 a Technical Report issued by the International Union of Pure and Applied Chemistry

cri
10 378 (IUPAC); (v) the EUROLAB guides on the evaluation of measurement uncertainty and
11
12 379 on alternative approaches to uncertainty evaluation [46; 47].
13
380 The particular references listed above do not represent the first steps in the analysis
14
15 381 of interlaboratory studies. For example, Simpson and Pearson [129] combined, by
16 382 simple averaging, correlations between immunity and inoculation against enteric (that is,

us
17
18
383 typhoid) fever obtained in five independent studies. Birge [10] considered “the calculation
19 384 of the ‘most probable’ values of certain quantities, from a given set of experimental data”,
20 385 and discussed the case where the dispersion of the values to be combined is appreciably
21
22 386 larger than the uncertainties associated with the individual values.
23
24
25
26
27
28
387

388
1.4. Goals and Structure
an
This contribution provides an overview of the main concepts and approaches to consensus
building in measurement science and medicine, including model-based evaluations of
dM
389
29
30 390 uncertainty for consensus values and degrees of equivalence. Our main goals are the
31
391 following:
32
33
34 392 • To foster model-based approaches to the reduction of data from interlaboratory
35 393 studies, as opposed to ad hoc and recipe-based approaches (detailed in §2.2), and
36 394 to explain the nexus between the model selected for the data and the procedures
37
38 395 used to reduce the data and evaluate the uncertainty associated with the results.
39 396 • To stimulate critical examination of: (i) the assumptions underlying every statistical
pte

40
41 397 model that may be used for consensus building; (ii) the adequacy of the model to the
42 398 data; and (iii) the fitness-for-purpose of the model and of the results.
43
44 399 • To suggest that it is inappropriate to choose a procedure for data reductions
45 400 before examining the data. Different interlaboratory studies and inter-method
46
401 comparisons may require different models, hence calling for different procedures for
47
48 data reduction and uncertainty evaluation. Therefore, it is unrealistic to hope for a
ce

402

49 403 “solution” that might be used in all cases.


50
51 404 • To emphasize the need for realistic uncertainty evaluations, in particular for
52 405 recognizing dark uncertainty (refer to §2, and to (P5) in §3), when its presence is
53
406 undeniable, while also taking into account the fact that the contribution from dark
Ac

54
55 407 uncertainty is often evaluated based on the dispersion of a rather small number of
56 408 measured values.
57
58 409 • To delineate general principles that statistical models for the results of interlaboratory
59 410 studies should embody and implement.
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 12 of 65

1
2
3 Consensus Building 12
4

pt
5 411 • To propose several improvements to statistical procedures typically associated with
6 412 a few paradigmatic models for the results of interlaboratory studies, inter-method
7
8 413 studies, and meta-analyses, in particular by relying on the statistical bootstrap [41]
9 414 and Markov Chain Monte Carlo (MCMC) sampling [56]. These improvements include

cri
10 415 the means to take into account the fact that uncertainty evaluations associated with
11
12 416 measured values often are based on rather limited evidence (an issue also addressed
13 417 by Forbes [51]).
14
15 418 • To explain that the uncertainties associated with differences between measured
16 419 values and the consensus value, or between pairs of measured values, should

us
17
420 be evaluated consistently with the purpose of identifying differences that are
18
19 421 more extreme than what the model allows, generally leading to much greater
20 422 circumspection (that is, to flagging fewer differences as significant) than conventional
21
22
423 treatments.
23
24
25
26
27
28
424

425

426

427
an
• To illustrate applications of principles, methods, and computations in an assortment
of examples, in particular as facilitated by the availability of the NIST Consensus
Builder (NICOB), a Web-based application that may be used for the reduction of data
from interlaboratory studies, at consensus.nist.gov [82].
dM
29
30
428 Section 2 discusses two essentially different, commonly employed approaches to the
31 429 problem of combining heterogeneous results from interlaboratory studies, noting that
32 430 traditionally they have tended to be favored by two different scientific communities.
33
34 431 Section 3 lays out several principles that we believe should guide the modeling and
35 432 reduction of data from interlaboratory studies, and from key comparisons in particular.
36
37 433 Section 4 reviews approaches to modeling and specific statistical models for data from
38 434 interlaboratory studies. Section 5 describes how selected models may be fitted to the data
39
435 and used for uncertainty evaluations, highlighting several improvements that we propose.
pte

40
41 436 Section 6 discusses the computation of degrees of equivalence in key comparisons.
42
43
437 Section 7 describes applications of these principles, models, and methods, to data from
44 438 actual interlaboratory studies and meta-analyses. Finally, section 8 summarizes lessons
45 439 learned and offers practical recommendations.
46
47
48
ce

49 440 2. Approaches
50
51
441 The approaches most commonly used to model data from interlaboratory studies, inter-
52
53 442 method comparisons, and meta-analysis, may be classified as to how they address the
Ac

54 443 problem created by heterogeneous measurement results (§2.1), and also as to whether
55
56
444 they are model-based or recipe-based (§2.2).
57
58
59
60
Page 13 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 13
4

pt
5 445 2.1. Multiplicative versus Additive Approaches
6
7 446 The approach suggested by Birge [10] for the combination of heterogeneous measurement
8
9 447 results involves inflating the uncertainties associated with the values to be combined. The

cri
10 448 uncertainties are multiplied by a “correction” factor sufficiently large to make the results
11
449 mutually consistent. This approach has been refined by Toman et al. [141] and Bodnar
12
13 450 and Elster [12].
14
15
451 Birge’s approach is used routinely by the Particle Data Group (pdg.lbl.gov) [105], and
16 452 also by CODATA to produce recommended values for some of the fundamental physical

us
17 453 constants [102]. The motivating understanding behind Birge’s proposal is that, in cases
18
19 454 where the results are heterogeneous, the individual uncertainties are underestimated,
20 455 hence need to be inflated.
21
22 456 For example, in the calculation of the 2010 recommended value of the Newtonian constant
23
24
25
26
27
28
457

458

459

460
an
of gravitation, G, the correction factor was 14 [101, Page 1576]. With this choice, the most
extreme of the 11 values differed by about twice its uncertainty from the consensus value.
Had the same criterion been applied in the 2014 exercise, the expansion factor would have
been 16 [102, Page 035009-47].
Baker and Jackson [3] contrast Birge’s approach to the more commonly used alternative
dM
29 461

30 462 where laboratory-specific stochastic effects are added to the value of the measurand,
31
32 463 shifting its perceived value up or down, possibly differently from one laboratory to another,
33 464 by an amount whose expected value is zero. The standard deviation, τ, of the effects, is
34 465 the expression of dark uncertainty.
35
36 466 These two alternatives — uniform multiplicative correction of laboratory-specific
37
467 uncertainties, or variable additive adjustments to the value of the measurand — are
38
39 468 not only intrinsically different, but are also favored by very different communities:
pte

40 469 physicists generally relying on the the multiplicative correction, biostatisticians and
41
42
470 medical researchers preferring additive adjustments, particularly in the context of meta-
43 471 analysis.
44
45 472 Both multiplicative and additive approaches have been used in measurement science
46 473 to reduce data from interlaboratory studies, and from key comparisons in particular.
47 474 For example, by CODATA as noted above. The laboratory effects models described by
48
ce

49 475 Toman and Possolo [139, 140] implement the additive approach. Rukhin [120, 121]
50 476 recognizes explicitly that the same statistical models are used in medical meta-analysis
51
477 and in metrological interlaboratory studies.
52
53 478 Toman et al. [141] compare alternative approaches when they were applied to the same
Ac

54
55
479 set of measurement results for the Planck constant, including a Bayesian version of Birge’s
56 480 procedure that accounts for the additional uncertainty of the estimate due the correction
57 481 factor. Baker and Jackson [3] also compared the two approaches in reductions of high-
58
59 482 energy particle measurement data, and recommend the additive approach when the
60 483 measurement results are heterogeneous.
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 14 of 65

1
2
3 Consensus Building 14
4

pt
5 484 2.2. Model-based versus Recipe-based Approaches
6
7 485 Approaches to data reduction for interlaboratory studies may also be classified according
8
9 486 to whether they are prescriptive (that is, recipe-like), or model-based. A prescriptive

cri
10 487 approach involves an algorithmic sequence of steps to be performed to reduce the data,
11
488 without explicit (or even implicit) relation to a model that articulates the influence the
12
13 489 data have upon the measurand. A model-based approach rests on a statistical model for
14 490 the dispersion of values of the experimental data, which together with some specified
15
16
491 criterion of optimality in estimation, determine a procedure for reducing the data and

us
17 492 evaluating the associated uncertainty.
18
19 493 Procedures A and B defined by Cox [29] are instances of the prescriptive approach, even
20 494 though they are qualified by a list of validating assumptions. The ad hoc procedures
21 495 described by MacMahon et al. [88] (discussed in §7.2), and by Pommé and Keightley
22
[109], similarly are recipe-based. The diagram in Figure 10-1 [151, Page 171] of the final
23
24
25
26
27
28
496

497

498

499

500
an
report of key comparison CCPR-K2.c, organized by the CIPM Consultative Committee for
Photometry and Radiometry, to compare measurements of spectral power responsivity,
summarizes a prescriptive approach that was used to achieve mutual consistency between
measurement results.
dM
29
30 501 It should be noted that the difference between model-based and recipe-based approaches
31 502 is not always clear-cut. For example, any of the recipes commonly employed to
32
33 503 classify measured values as outliers may be rationalized in terms of a mixture model,
34 504 and addressed by employing an estimator (of the consensus value) that detects and
35 505 automatically downweighs the more deviant results without formally setting any of them
36
37 506 aside [38].
38
507 Rocke [117] employs a model-based approach to address the challenge posed by
39
outlying observations in interlaboratory studies. Crowder [31] focuses on the selection
pte

40 508

41 509 of appropriate stochastic model, and on the choice of associated statistical methods,
42
43
510 motivated by interlaboratory studies where “a material is manufactured to well-defined
44 511 specifications and then samples are sent out to the participating laboratories for analysis
45 512 [. . . ] to assess differences between laboratories and to identify possible sources of
46
47 513 incompatibility or other anomalies.”
48
ce

514 Other model-based approaches involve maximum likelihood estimation for the model
49
50 515 underlying the classical, one-way random effects analysis of variance. Rukhin and Vangel
51 516 [123] and Vangel and Rukhin [145] describe the procedure for the Gaussian laboratory
52
517 effects model as used in measurement science. Rukhin et al. [118] discuss the restricted
53
maximum likelihood (REML) variant and its relation with the Mandel-Paule procedure
Ac

54 518

55 519 [93; 94; 106]. Searle et al. [127] provide a detailed overview of these methods, and
56
57 520 discuss their merits.
58 521 Toman [138] presents model-based, Bayesian analyses of the results from three
59
60 522 key comparisons, and Possolo [111] and Koepke and Possolo [83] employ model-
523 based approaches to handle skewness and tail-heaviness of the laboratory effects in
Page 15 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 15
4

pt
5 524 interlaboratory studies.
6
7 525 Bodnar et al. [13] present several examples of application, to medical meta-analyses,
8 526 of a Bayesian procedure that uses a reference prior distribution [6] derived by Bodnar
9 527 et al. [14] under the generally unrealistic assumption that the laboratory-specific standard

cri
10
11 528 uncertainties are known with full certainty. This procedure is notable because it allows
12 529 exact inference under the stated assumptions without recourse to MCMC sampling.
13
14 530 Forbes [51] introduces a welcome model-based approach to the often neglected, yet very
15 531 important and very common situation where there is some doubt about the validity of
16
the uncertainty statements associated with the values measured by the participants in

us
532
17
18 533 interlaboratory studies.
19
20
21 534 3. Principles
22
23
24
25
26
27
28
535

536

537
an
The following general principles should guide the combination of measurement results
obtained independently by different laboratories or measurement methods:

(P1) No measurement result should be set aside except for substantive, documented
cause. The mere fact that a measured value lies far from the bulk of the others,
dM
29 538
30
539 alone is insufficient reason to set it aside, even if a statistical test suggests that it is
31
32 540 an “outlier.”
33
541 Marchandise [96, Page 82] articulates this position in the context of value
34
35 542 assignment to reference materials: “As outlying results may be the most accurate,
36 543 statistics are used as a guide, not as a means of rejecting outliers.”
37 544 Schlecht and Stevens [126], writing in the context of the chemical analysis
38 545 of silicate rocks, voice a similar understanding: “Experience with collaborative
39
546 analysis of other materials has shown that the correct values may be quite different
pte

40
41 547 from the most frequent ones”. They reference Lundell [86, Table V], who reports
42 548 values of the mass fraction of tricalcium phosphate in a phosphate rock measured
43 549 by thirty laboratories, and notes that the average of the measured values is
44 550 78.38 %, and their standard deviation is 0.40 %. The smallest value of the batch,
45
551 77.40 %, is 2.41 standard deviations below the average, while the bulk (97 %) of
46
47 552 the determinations are within 2 standard deviations of the average. However, the
48 smallest value is “the one which is nearest the correct result.”
ce

553

49 554 Graphical and statistical detection of anomalous results, and examination of


50 555 mutual consistency indices like Cochran’s Q or Higgins and Thompson [65]’s
51
556 I 2 , are useful screening tools that may serve to draw the scientists’ attention
52
53 557 to measurement results deserving further scrutiny, but should be advisory, not
decisional [27]. Bayesian statistical models may be used to compute posterior
Ac

54 558

55 559 probabilities of whether each measurement result “belongs” with the rest, which
56 560 will then serve as yet another advisory contribution to the substantive process of
57 561 data selection.
58
59 562 In all cases, substantive considerations, rather than statistical tests, should drive
60 563 the selection of the subset of the measurement results to be combined into a
564 consensus value, and any standard deviation that a statistical test may use as
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 16 of 65

1
2
3 Consensus Building 16
4

pt
5 565 a measurement unit to gauge the significance of a deviation should capture
6 566 contributions from all relevant sources of uncertainty.
7
8 567 (P2) No measured value should dominate the consensus value “automatically”, simply
9 568 because the associated measurement uncertainty is much smaller than the

cri
10
11 569 uncertainties associated with the other measured values.
12 570 This provision is particularly relevant when the measured values are markedly
13 571 heterogeneous. Methods of data reduction consistent with this principle will
14
15
572 therefore include some damping mechanism to limit the influence that measured
16 573 values with unusually small associated uncertainties will have upon the consensus

us
17 574 value.
18 575 When the consensus value is a weighted average, this damping may be achieved
19 576 by replacing weights proportional to 1/u2j , by weights proportional to 1/(u2j +
20
21 577 τ2 ), where u j denotes the uncertainty associated with the value measured by
22 578 laboratory j and τ gauges the heterogeneity of the measurement results. This
23
24
25
26
27
28
579

580

581

582

583
an
approach is implemented in the procedure described in §5.2.
When there are substantive reasons to weigh the different measurement results
differently, doing so does not contradict this principle. This may be accomplished,
for example, by specifying unequal weights for the inputs of the Linear Pool (§5.4).
If substantive reasons dictate that pre-eminence should be given to the laboratory-
dM
29 584 specific uncertainty evaluations, then the contribution from dark uncertainty
30 585 may be de-emphasized by selecting a suitably small value for the median of the
31
586 corresponding prior distribution in a hierarchical Bayesian procedure (§5.3).
32
33 587 (P3) Measurement methods should be sufficiently well characterized to warrant
34
35
588 confidence in the belief that the measured values, taken as a group, are roughly
36 589 centered at the true value of the measurand, and participating laboratories should
37 590 have previously demonstrated sufficient competence, in particular by having earned
38
39 591 satisfactory scores in proficiency tests.
pte

40 592 On the one hand, it is obvious that if all the measured values tend to be too low
41
593 or too high, no statistical procedure that relies on the data alone will be able to
42
43 594 detect this and “correct” the consensus estimate accordingly.
44 595 For example, when immunoassays are used to measure the concentration of
45 596 vitamin D, they may be persistently low or high, depending on the antibody that
46 597 they use for targeting the vitamin, and on how the vitamin is bound to materials
47 598 in the matrix of the sample [44; 49; 134].
48
ce

49 599 On the other hand, it is desirable that the statistical procedures used for data
50 600 reductions should be able to cope with situations where individual measured
51 601 values lie far from the bulk of the others, and also with situations where there
52 602 is some asymmetry in the apparent distribution of the measured values [27].
53 603 For example, some methods for extracting polychlorinated biphenyls (PCBs) from
Ac

54
55 604 riverine sediments, or for extracting heavy metals from organic materials, may do
56 605 so only incompletely. In such cases, the distribution of measured values may be
57 606 markedly asymmetrical, showing a histogram whose left tail is longer than the
58 607 right tail [111].
59
60 608 (P4) A model should be formulated that explicitly relates the measured values to the true
609 value µ of the measurand, and that includes elements representing contributions
Page 17 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 17
4

pt
5 610 from all recognized sources of uncertainty. Furthermore, the estimation of µ, and
6 611 the evaluation of the associated uncertainty, should be consistent with the statistical
7
8 612 model and with some principle of estimation whose general reliability is widely
9 613 recognized. The model itself, and the principles used to fit the model to the data, and

cri
10 614 to derive the results of interest from the fitted model — consensus value, associated
11
12 615 uncertainty, degrees of equivalence (in key comparisons), performance scores (in
13 616 proficiency tests), etc. — ought to vary smoothly with changes in the data.
14
15 617 Typically, the required model will be a statistical model, or observation equation
16 618 in the nomenclature of Possolo and Toman [113] and Forbes and Sousa [52],

us
17 619 where µ appears as a parameter of the probability distribution of the measurement
18 620 results. Specification of a statistical model involves treating the measured values,
19
621 and possibly also their associated uncertainties, as observed values of random
20
21 622 variables. The principle of estimation may be maximum likelihood (§5.1),
22 623 minimal Bayes risk (§5.3), or even something as simple as the method of moments
23
24
25
26
27
28
624

625

626

627

628
[7] (§5.2).

an
The smoothness requirement discourages discrete model selection that may
impact the results profoundly, and favors model averaging. It also speaks against
procedures that involve outright rejection of some measurement results instead of
a modulated, continuously varying down-weighing scheme.
dM
29
30 629 (P5) The statistical model underlying data reductions should be able to detect, evaluate,
31 630 and propagate dark uncertainty (§1.2) [136].
32
33 631 ISO 5725-1 (§6.3.2 and §6.3.3) provides guidance about the number of
34 632 laboratories necessary to be able to characterize dark uncertainty reliably, and
35 633 about the numbers of degrees of freedom that ideally should support the
36 634 laboratory-specific evaluations of uncertainty {u j }.
37 635 The definition of a mutually consistent subset of the measurement results [30]
38
39
636 violates (P1) because it sets aside measurement results based on a statistical
criterion alone, and violates (P5) because it suppresses dark uncertainty forcibly,
pte

40 637

41 638 instead of recognizing it. Iyer et al. [72] discuss consistency testing critically, and
42 639 Toman and Possolo [139] expose several other serious shortcomings of the notion
43 640 of “largest consistent subset” and of how it is implemented.
44
641 Once substantiated, dark uncertainty should be propagated to all derivative quan-
45
46 642 tities, including the degrees of equivalence. Differently from common miscon-
47 643 ceptions and unwarranted oversimplification, recognizing dark uncertainty does
48
ce

644 not entail inflating the laboratory-specific uncertainties to achieve consistency:


49 645 instead, it implies that the uncertainty surrounding differences (between mea-
50
646 sured values and the consensus value, and between measured values themselves)
51
52 647 is larger than what the laboratory-specific uncertainties alone may suggest.
53 648 This is consistent with the GUM [75], which stipulates that the uncertainty
Ac

54 649 associated with a measured value should reflect the contributions of all sources of
55 650 uncertainty. Therefore, it should include sources that are captured both in bottom-
56
651 up evaluations performed by the participating laboratories individually, and those
57
58 652 that are revealed in top-down evaluations [112] and whose joint effect is manifest
59 653 in the dark uncertainty.
60 654 EUROLAB [46, §5.1] already recognizes the legitimacy of top-down uncertainty
655 evaluations, in particular as enabled by interlaboratory comparisons: “For
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 18 of 65

1
2
3 Consensus Building 18
4

pt
5 656 standard test procedures, trueness and precision are usually determined by
6 657 an interlaboratory comparison (see ISO 5725-2). Among the performance
7 658 characteristics obtained in this manner, the so-called ‘reproducibility standard
8
659 deviation’ (sR ) is a suitable estimate for the measurement uncertainty.” EUROLAB
9
[47] characterizes these approaches to uncertainty evaluation as “empirical”,

cri
660
10
11 661 contrasting them to the approach based on the measurement model that the
12 662 GUM focuses on, and adding that “they are based on whole-method performance
13 663 investigations designed and conducted so as to comprise the effects from as many
14
664 relevant uncertainty sources as possible.” Désenfant and Priel [34] offer a similar
15
16 665 viewpoint.

us
17 666 The willingness of the participants in an interlaboratory study to engage in
18 667 an intercomparison should include a tacit agreement to abide by the resulting
19 668 findings, which create an opportunity for collective learning and provide a
20
669 stimulus for improving measurement quality. This agreement ought to include
21
22 670 the willingness to recognize any component of uncertainty that their individual
23
24
25
26
27
28
671

672

673

674

675
down evaluation.

an
uncertainty evaluations might have missed and that come to light during the top-

In some cases, the model used to recognize and propagate dark uncertainty
may have to be more flexible than the models we illustrate in §7. It particular,
the model may have to contemplate the possibility that τ may vary among the
participating laboratories.
dM
29 676

30
31 677 (P6) Degrees of equivalence (differences between measured values and the consensus
32 678 value, or between pairs of measured values, qualified with evaluations of associated
33 679 uncertainty) should be computed consistently with their primary goal of identifying
34
35 680 participants with “unusual” results, in the sense that their measured values lie
36 681 “beyond the range allowed by the model”, as suggested by Jones and Spiegelhalter
37
682 [78] and elaborated in §6.
38
39 683 Since determining statistical significance typically involves performing statistical
pte

40 684 tests, one for each degree of equivalence being evaluated, the results of these tests
41
685 should be interpreted taking into account the fact that multiple tests are being
42
43 686 carried out simultaneously.
44 687 When classical, non-Bayesian methods are employed for these multiple tests, the
45 688 collection of corresponding p-values should be adjusted upward (thus making
46 689 the results generally less significant), using a procedure like the one suggested
47
690 by Benjamini and Hochberg [4].
48
ce

49 691 The p-value of a classical statistical test is the probability of observing a value of
50 692 the test statistic at least as extreme as was observed by chance alone, possibly
51 693 owing to the vagaries of sampling, in the absence of a real difference [20].
52
53
Ac

54 694 4. Models
55
56
57 695 Model-driven approaches have many advantages relative to recipe-driven (prescriptive)
58
59
696 approaches. The principal advantage is the transparency of the underlying assumptions
60 697 that validate the results, thus stimulating and facilitating the assessment of their adequacy
698 in each particular application.
Page 19 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 19
4

pt
5 699 Another consequential advantage is the logical nexus that model-driven approaches
6 700 establish between the understanding of the data that is captured in the model, and the
7
8 701 procedures for data reduction that follow from such understanding. In particular, a
9 702 model-based approach elucidates the meaning of the consensus value by making it appear

cri
10 703 explicitly as a parameter in the statistical model for the measurement results [19].
11
12 704 A model-based approach obviously requires a model, which must be selected from among
13
705 an array of often comparably reasonable alternatives. The conventional approach to
14
15 706 model selection relies on some criterion of optimality, typically striking a balance between
16 707 goodness-of-fit (of the model to the data that it is intended for) and parsimony (or

us
17
18
708 model simplicity, often gauged in terms of the number of adjustable parameters in the
19 709 model). Commonly used model-selection criteria include Akaike’s Information Criterion
20 710 (AIC), its second-order version AICc, and the Bayesian Information Criterion (BIC) [18].
21
22 711 Alternatively, one may opt for an inclusive, non-committal approach that entertains several
23
24
25
26
27
28
712

713

714

715
model averaging [23; 42; 68].
an
models simultaneously, and produces results that correspond to some form or another of

In this section we review one classification of models according to how differences between
measured values and the consensus value are represented, and another classification
according to whether their perspective is sampling-theoretic (frequentist) or Bayesian.
dM
29 716

30
31
32 717 4.1. Random versus Fixed Effects
33
34 718 For interlaboratory studies, the measurement result from laboratory j = 1, . . . , n ideally is a
35
36 719 triplet (x j , u j , ν j ) comprising a measured value x j , an evaluation of the associated standard
37 720 uncertainty u j , and the number of degrees of freedom ν j on which the uncertainty
38
721 evaluation is based. The {u j } may be derived from experimental data, expert opinion
39
or other authoritative reference, or a combination of both experimental data and expert
pte

40 722

41 723 opinion, or they may be standard deviations of Bayesian posterior distributions.


42
43 724 In many interlaboratory studies, the {ν j } unfortunately are not reported. In such cases, it
44 725 is often assumed that the numbers of degrees of freedom are very large, practically infinity,
45
46 726 which gives rise to overoptimistic uncertainty evaluations for the consensus value.
47 727 David Duewer (NIST, personal communication) has pointed out that when the
48
ce

49 728 measurement results are heterogeneous, and the corresponding dark uncertainty is
50 729 neglected, this largely invalidates the common practice of using the coverage factor
51 730 k = 2 to obtain expanded uncertainties intended to define coverage intervals with 95 %
52
53 731 probability, in those data reductions that fail to take dark uncertainty into account. In
Ac

54 732 fact, k = 2 corresponds to infinitely many degrees of freedom, hence to full knowledge of
55
733 the standard uncertainties, while dark uncertainty effectively denies the fullness of such
56
57 734 knowledge.
58
59
735 Three different models are frequently used (and often confused) for values measured
60 736 in interlaboratory studies: (i) random effects models; (ii) fixed effects models; and (iii)
737 common mean or fixed effect (note the singular) models, defined as follows.
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 20 of 65

1
2
3 Consensus Building 20
4

pt
5 738 (i) Random Effects. Each measured value is represented as an additive superposition
6 739 of three elements, x j = µ + λ j + ǫ j , for j = 1, . . . , n where µ denotes the
7
8 740 measurand, the {λ j } denote laboratory effects, and the {ǫ j } represent laboratory-
9 741 specific measurement errors. The {λ j } are conceived of as values of independent

cri
10 742 random variables with mean 0 and standard deviation τ ¾ 0. This assumption means
11
12 743 that, taken collectively, the measured values are unbiased (that is, are centered on µ).
13 744 The measurement errors are assumed to be realized values of independent random
14
745 variables with mean 0, but possibly different standard deviations {σ j }.
15
16 746 (ii) Fixed Effects. Each measured value is centered at its own mean, x j = θ j + ǫ j , as if

us
17
747 the different laboratories were effectively measuring different measurands. Here, θ j
18
19 748 denotes an unknown, laboratory-specific constant, and ǫ j denotes the value of a non-
20 749 observable random variable with mean 0 and standard deviation σ j , for j = 1, . . . , n.
21
22 750 (iii) Common Mean (Fixed Effect). The measured values differ from the true value of
23
24
25
26
27
28
751

752

753

754

755
an
the measurand owing to laboratory-specific measurement errors only, x j = µ + ǫ j ,
where µ denotes an unknown material-specific constant and ǫ j denotes the value
of a non-observable random variable with mean 0 and standard deviation σ j for
j = 1, . . . , n. This model is nested within both of the previous models: it is equivalent
to the random effects model when τ = 0 and the fixed effects model when θ j = µ
dM
29
30 756 for j = 1, . . . , n.
31
32 757 In the random effects model, if the data were only the {x j } then it would be impossible
33
34 758 to distinguish the laboratory effects {λ j } from the measurement errors {ǫ j }. Since the
35 759 {u j } also are part of the data, and we know that the absolute values of the {ǫ j } should be
36 760 generally comparable to the {u j }, we can conclude that any “excess variance” exhibited
37
38 761 by the {x j } is attributable to the {λ j }, whose dispersion (or scatter) is gauged by τ. The
39 762 same ambiguity (or, non-identifiability) arises in the fixed effects model, and it is usually
pte

40
763 resolved by imposing an arbitrary constraint on the {θ j }, for example that they must sum
41
42 764 to 0.
43
44
765 Random effects models are appropriate when one wishes to derive lessons from an
45 766 interlaboratory study that will be applicable to laboratories similar to those that have
46 767 participated in the study, or when one must recognize between-laboratory differences as
47
48 768 a source of measurement uncertainty [89; 93].
ce

49 769 ISO 5725-1 (5.1) suggests the same random effects model described above, for the purpose
50
51 770 of estimating accuracy (trueness and precision), explaining that τ2 (for which it uses
52 771 the customary designation of “between-laboratory variance”) comprises contributions
53
772 from “both random and systematic components” (that is, volatile and persistent effects
Ac

54
55 773 [112, §5e]), including “different climatic conditions, variations of equipment within the
56 774 manufacturer’s tolerances, and even differences in the techniques in which operators are
57
58
775 trained in different places.
59 776 Entertaining a fixed effects model, where the {θ j } are different from one another, is
60
777 tantamount to admitting that the different laboratories are measuring different quantities
Page 21 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 21
4

pt
5 778 owing to persistent effects (biases) that do not average out as each laboratory replicates
6 779 its measurements. In such circumstances, no consensus value can be meaningful.
7
8 780 Consensus values are only meaningful when the {x j } have a common expected value
9 781 µ, which happens for the random effects model and for the common mean model.

cri
10
11 782 Among these two, the random effects model should be preferred, as Borenstein et al.
12 783 [15, Page 107] argue:
13
14 784 “If we were going to use one model as the default, then the random-effects model
15
16 785 is the better candidate because it makes less stringent assumptions about the

us
17 786 consistency of effects. Mathematically, the fixed-effect model is really a special
18
787 case of the random-effects model with the additional constraint that all studies
19
20 788 share a common effect size. To impose this constraint is to impose a restriction
21 789 that is not needed, not proven, and often implausible.”
22
23
24
25
26
27
28
790

791

792
an
4.2. Sampling-Theoretic vs. Bayesian Procedures

The same model may be interpreted and fitted either from a sampling-theoretic or
Bayesian viewpoint. (§5.3 includes an overview of the Bayesian paradigm.) We will
dM
29
30 793 restrict attention to the random effects model as we explain the differences between these
31 794 viewpoints, because it is the focus of several of the procedures discussed in §5. The key
32 795 differences between them concern (i) how they regard µ, and (ii) how they interpret the
33
34 796 uncertainty surrounding estimates of µ.
35 797 The sampling-theoretic viewpoint focuses on the variability of the consensus value (the
36
37 798 estimate of µ) under hypothetical repetitions of the process that generated the data. The
38 799 Bayesian viewpoint focuses on the information that the particular data in hand provide
39
800 about µ, and uses this information to update the prior distribution for µ and produce a
pte

40
41 801 posterior distribution that typically will be appreciably less dispersed than the prior.
42
43 802 The sampling-theoretic approach is concerned with (objective) fluctuations of the
44 803 consensus value attributable to the vagaries of sampling a hypothetical population, while
45 804 the Bayesian approach updates a (subjective) state of knowledge about µ based on the
46
47 805 data that have actually been observed. Accordingly, from the sampling-theoretic viewpoint
48 the uncertainty associated with µ expresses sampling variability, while from the Bayesian
ce

806
49
807 viewpoint it expresses the margin of doubt that derives from incomplete or imperfect
50
51 808 knowledge of µ.
52
53
809 The merits and demerits of these different approaches have been debated ad infinitum,
not only within statistical circles but also in many areas of science, as well as within
Ac

54 810

55 811 epistemology, which is the branch of philosophy concerned with knowledge in general
56
57 812 and with scientific knowledge in particular [131]. Within measurement science, some
58 813 argue that “it is only the definition of probability as a degree of belief that is applicable”
59 814 [104], while others argue quite the opposite, “that the change from a frequentist treatment
60
815 of measurement error to a Bayesian treatment of states of knowledge is misguided” [152].
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 22 of 65

1
2
3 Consensus Building 22
4

pt
5 816 These foundational issues aside, it should be noted that the Bayesian procedure we discuss
6 817 in §5.3 enjoys two very important, practical advantages: (i) it captures and propagates
7
8 818 effectively the uncertainty surrounding the estimate of the between-laboratory dispersion
9 819 of measured values (τ) without resorting to complex approximations, and (ii) it offers the

cri
10 820 means to express a priori knowledge about the value of τ and about the reliability of the
11
12 821 uncertainty evaluations produced by the participants in the study.
13
14
15 822 5. Methods
16

us
17
18 823 A large and varied collection of statistical procedures are available for meta-analysis and
19 824 for the reduction of data from interlaboratory studies. In December 2016, there were more
20
21 825 than one dozen R packages available on the Comprehensive R Archive Network (CRAN)
22 826 offering functions to fit models to such data and to present the results.
23
24
25
26
27
28
827

828

829

830
an
In this section first we review maximum likelihood procedures, and then turn to three
procedures that adhere to the principles outlined in §3 and have both a long history of
usage and a proven track record of reliable performance. For these reasons they have been
implemented in the NICOB: the DerSimonian-Laird procedure, a hierarchical Bayesian
dM
29 831 procedure, and the Linear Pool. They are not, however, interchangeable, because they
30
31 832 involve different sets of assumptions. The choice of procedure to use in each application
32 833 should be guided by consideration of the validating assumptions, and fitness for purpose.
33 834 We believe that applying them all in turn, in hopes of finding the one that produces the
34
35 835 most “convenient” results, would be tantamount to statistical malpractice.
36
37 836 • The DerSimonian-Laird procedure (§5.2) is used most often in practice, particularly
38
39
837 in medical meta-analysis. It reverts to the weighted average automatically when the
measurement results appear homogeneous, that is when τ = 0. We have enhanced
pte

40 838

41 839 it with a supplementary bootstrap procedure that recognizes the uncertainty


42
43 840 surrounding the estimate of dark uncertainty.
44 841 • The hierarchical Bayesian procedure (§5.3) is a preferable alternative to maximum
45
46 842 likelihood because it recognizes the uncertainty surrounding the estimate of τ without
47 843 special “add-ons”, and allows expressing a modicum of prior knowledge about the
48
ce

844 variance components, τ and the {σ j }, that typically are difficult to evaluate reliably.
49
50 845 • The Linear Pool (§5.4) has been in use the longest and makes the fewest assumptions
51 846 about the nature of the data. It is particularly attractive because its motivating idea
52
53 847 and operating principle can be explained very easily in non-technical terms.
Ac

54
55
56 848 5.1. Maximum Likelihood Estimation
57
58 849 Considering the generally good performance of maximum likelihood estimates (MLE) of
59
60 850 parameters in statistical models [150], and the long and illustrious history of their use
Page 23 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 23
4

pt
5 851 [132], the procedure is a natural candidate for the reduction of data from interlaboratory
6 852 studies in model-based contexts.
7
8 853 The MLE depends on specific assumptions about the laboratory effects {λ j } and the
9 854 laboratory-specific measurement errors {ǫ j }, in addition to those already made in the

cri
10
11 855 definition of the random effects model (§4). Most commonly, the {λ j } are assumed to be
12 856 a sample from a Gaussian distribution, and the {ǫ j } are assumed to be values of Gaussian
13
857 random variables. The {u j }, when they are based on finite numbers {ν j } of degrees of
14
15 858 freedom, are assumed to be such that {ν j u2j /σ2j } are like outcomes of chi-squared random
16 859 variables with the {ν j } as numbers of degrees of freedom. Any u j whose number of degrees

us
17
18
860 of freedom is either infinite or unspecified is treated as a known constant. Furthermore,
19 861 all the random variables mentioned are assumed to be mutually independent.
20
21 862 Computing the MLE involves maximizing the likelihood function with respect to the
22 863 parameters µ, τ, and the {σ j } (these only when the {ν j } are finite), under the constraint
23
24
25
26
27
28
864

865

866

867
function and discuss its maximization. an
that all except µ must be non-negative. Rukhin and Vangel [123], Vangel and Rukhin
[145], and Searle et al. [127], among many others, state the form of the likelihood

Vangel and Rukhin [145] point out, and illustrate in an example, the possibility that
dM
29 868 the likelihood function may have multiple local maxima. Rukhin et al. [118] discuss
30
31 869 an instance of the same possibility in restricted maximum likelihood (REML) estimation.
32 870 For this reason, in the applications of the method reported in §7.1, to compute a
33 871 consensus value for the Newtonian constant of gravitation, the likelihood function was
34
35 872 maximized numerically multiple times, using different optimizers and starting from
36 873 different combinations of values of its arguments.
37
38 874 Viechtbauer [146] notes that the maximum likelihood estimate of τ2 generally is biased
39 875 low because it fails to take into account the fact that the value of the measurand, µ, is
pte

40
876 also estimated from the data. For this reason, he suggests that “one should probably
41
42 877 avoid [. . . ] maximum likelihood estimators because they can potentially provide quite
43 878 misleading results.” Such underestimation “might lead researchers to (a) ignore possible
44
45 879 heterogeneity in the effect sizes resulting from either random population effect sizes or
46 880 moderator effects, and (b) to overstate the precision of the estimate of µ” [146].
47
48 881 Rukhin et al. [118, Equation (5)] present the REML criterion and discuss its maximization.
ce

49 882 Even though REML compensates for the typical underestimation of τ, it suffers from two
50 883 other important shortcomings that it shares with the MLE: (i) neglect of the typically small
51
52 884 number of observations on which the estimate of τ is based; and (ii) coverage intervals for
53 885 the measurand µ are either based on large-sample approximations (obviously unrealistic
Ac

54
55
886 in most of the applications in measurement science), or on ad hoc approximations of
56 887 dubious general validity. Rukhin [120] suggests alternative, promising estimators for τ,
57 888 that have Bayesian underpinnings.
58
59 889 Maximum likelihood may be employed when the data are incompatible with Gaussian
60 890 distributions. This possibility is illustrated in §7.3.2, where the likelihood function
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 24 of 65

1
2
3 Consensus Building 24
4

pt
5 891 involves the probability mass function of a non-central hypergeometric distribution.
6 892 Pinheiro et al. [107] describe models whose likelihood function is a multivariate Student’s
7
8 893 t distribution. Rukhin and Possolo [122] discuss the MLE for a random effects model where
9 894 both the {λ j } and the {ǫ j } have Laplace distributions.

cri
10
11 895 In general, the MLE is numerically identical to the Bayesian maximum a posteriori estimates
12 896 of the parameters of interest when the prior distributions assigned to these parameters
13
897 are “flat.” Since the prior distributions used in the hierarchical Bayesian procedure
14
15 898 discussed in §5.3 are either only weakly informative or practically “flat”, the estimate
16 899 of the consensus value that it produces is likely to be close to the corresponding maximum

us
17
18
900 likelihood estimate in most cases, as it is in the example concerning the Newtonian
19 901 constant of gravitation (§7.1).
20
21 902 The theory behind the MLE provides an approximation to the uncertainty associated
22 903 with its estimate of the consensus value [150, Theorem 9.27]. This approximation may
23
24
25
26
27
28
904

905

906

907
an
be unreliable when the number of participating laboratories is as small as it usually
is in interlaboratory studies in measurement science. Therefore, realistic uncertainty
evaluations for the MLE in such cases require application of a supplementary procedure,
similarly to the DerSimonian-Laird procedure.
dM
29 908 Owing to the limitations just reviewed — typical under-estimation of the key parameter
30
31 909 τ, computational cost, need of supplemental computation for realistic uncertainty
32 910 evaluation, and results that typically are numerically similar to the results of the
33 911 hierarchical Bayesian procedure — we do not emphasize the MLE in this contribution,
34
35 912 or promote its use.
36
37
38 913 5.2. DerSimonian-Laird Procedure
39
pte

40 914 In the DerSimonian-Laird procedure [33; 154], the consensus value, which estimates
41
42 915 µ, is a weighted average of the values measured by the participating laboratories, µ b=
Pn Pn 2 2
43 916
j=1
w j x j / j=1 w j , with weights w j = 1/(τ + σ j ) for j = 1, . . . , n. Bowden and Jackson
44
917 [16] describe an educational online tool to visualize these weights and to assess the
45
46 918 influence that extreme measurement results have upon the estimate of the consensus
47 919 value.
48
ce

49 920 Since the {σ j } and τ are unknown, they are substituted by estimates, σ b = u and
Pn Pn Pn j −2 j
50 921 bDL = max{0, τ
τ bM }, where τ 2 −2 −4
bM = (Q − n + 1)/( j=1 u j − j=1 u j / j=1 u j ), and
51 P −2 2
52 922 Q = b) . It should be noted that the DerSimonian-Laird estimates of the
u j (x j − µ
53 923 consensus value, and of τ, do not involve assumptions about the shapes of the probability
Ac

54 924 distributions of either the {λ j } or the {ǫ j }.


55
56 925 When τ = 0, the DerSimonian-Laird estimate of the consensus value is the weighted
57
926 average of the {x j } with weights proportional to {1/u2j }. However, treating these weights
58
59 927 as if they were known constants in the evaluation of the uncertainty associated with
60 928 the consensus value is typically unrealistic. This leads to over-optimistic uncertainty
Page 25 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 25
4

pt
5 929 evaluations, and in consequence also to excessive numbers of discrepant degrees of
6 930 equivalence. We address this shortcoming below.
7
8 931 The
Ç approximate standard uncertainty associated with the consensus value is uDL (µ) =
9 Pn
932 1/ j=1 w j [66]. The presence of τ2 in the denominator of the weights {w j } acts

cri
10
11 933 as a moderating influence, preventing very small laboratory-specific uncertainties from
12 934 influencing the consensus value to an extent that is often found to be objectionable when
13
14
935 conventional weighted averages are used (cf. principle P2).
15 936 Unfortunately, this closed-form expression for the standard uncertainty of a weighted
16
mean involving known weights is irrelevant in practice because the weights depend not

us
937
17
18 938 only on the {u j }, but also on τ, which has to be estimated from the data even in the
19 939 utopian scenario in which the {u j } are treated as known.
20
21 940 The alternatives include a suggestion by Knapp and Hartung [81], and an approximation
22
made possible by technology developed by Biggerstaff and Tweedie [9] and Biggerstaff and
23
24
25
26
27
28
941

942

943

944

945
an
Jackson [8]. However, Hoaglin [67] warns that, although improving on naive evaluations,
these alternatives still rely on the typically unrealistic assumption that u j = σ j .
It is preferable to evaluate uDL (µ) by application of the parametric statistical bootstrap,
which offers the means to take the numbers of degrees of freedom {ν j } associated with the
dM
29
30 946 {u j } into account, and also to recognize the typically small number of measurement results
31 947 used to estimate τ. However, the parametric statistical bootstrap involves additional
32 948 assumptions about the {λ j } and the {ǫ j }.
33
34 949 To approximate the distribution of the estimate of τ2 , Biggerstaff and Tweedie [9]
35 950 derive the exact mean and variance of Cochran’s Q statistic. From these parameters,
36
37 951 the distribution of Q is approximated using a gamma distribution. Since τ2M = (Q −
Pn Pn Pn
38 952 n + 1)/( j=1 u−2j
− j=1 u−4j
/ j=1 u−2
j
), the distribution of τ2M can be approximated
39
953 using a location-shifted, scaled gamma distribution. Thus, to simulate from the
pte

40
41 954 approximate distribution of τ2DL = max{0, τ2M }, we simulate τ2M from the appropriate
42 955 gamma distribution and take the maximum of the simulated value and 0.
43
44 956 The uncertainty evaluation via the parametric statistical bootstrap is consistent with the
45 957 GUM Supplement 1 [76]. The raw materials for this evaluation are obtained by repeating
46
47 958 the following steps a sufficiently large number of times (say, K = 105 ), for k = 1, . . . , K:
48
ce

49 959 bDL .
(a) Draw τk from the approximate probability distribution of τ
50
960 (b) Draw x jk from a Gaussian distribution with mean µ b and variance τ2k + u2j , for j =
51
52 961 1, . . . , n.
53 q
962 (c) If ν j is either infinity or unspecified, u jk = u j , otherwise u jk = u j ν j /χν2j where χν2j
Ac

54
55 963 denotes a value drawn from a chi-squared distribution with ν j degrees of freedom,
56 964 for j = 1, . . . , n.
57
58 965 (d) Compute the DerSimonian-Laird consensus value µk corresponding to the triplets
59 966 (x 1k , u1k , ν1 ), . . . , (x nk , unk , νn ).
60
967 The standard uncertainty associated with the DerSimonian-Laird consensus value is the
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 26 of 65

1
2
3 Consensus Building 26
4

pt
5 968 standard deviation of the {µk }, and one (among many alternatives) 95 % coverage interval
6 969 for µ ranges from the 2.5th to the 97.5th percentile of the {µk }.
7
8
9
970 5.3. Hierarchical Bayesian Procedure

cri
10
11
12 971 A Bayesian treatment can also remedy the defect that the conventional DerSimonian-
13 972 Laird uncertainty evaluation suffers from: not recognizing the uncertainty surrounding
14
15 973 the estimate of τ. The Bayesian treatment offers additional advantages that will become
16 974 apparent in the following discussion, and in §6.

us
17
18 975 In a Bayesian analysis, parameters like µ, τ, and the {σ j }, whose values are unknown,
19 976 are modeled as outcomes of non-observable random variables, and the measurement
20 977 results {(x j , u j , ν j )} are modeled as actually observed outcomes of random variables. The
21
22 978 expression “random variable” does not imply that there is anything chancy about the value
23
24
25
26
27
28
979

980

981

982
an
of the corresponding quantity. It simply indicates that there is a probability distribution
associated with the quantity. This probability distribution recognizes that the value of
the quantity is generally unknown, owing either to natural sampling variability, or to
incomplete knowledge.
The probability distributions for the unknowns, reflecting a priori (that is, before acquiring
dM
29 983

30 984 any data) states of knowledge about their true values, are called prior distributions. In
31
32 985 many cases encountered in practice there is considerable a priori information about the
33 986 measurand (for example, that the amount-of-substance fraction of ethane, in a synthetic
34 987 mixture designed to emulate natural gas, lies within a fairly narrow interval). There may
35
36 988 also exist a priori information about other parameters in the model: for example, about
37 989 the measurement uncertainty associated with a gravimetric preparation.
38
39 990 The Bayesian approach is advantageous and indeed recommended when there exists
pte

40 991 specific, relevant information about the value of the measurand, or about other aspects of
41
992 measurement, that may be encapsulated in truly informative prior distributions. These
42
43 993 are preferable to so-called “non-informative priors” [80] that are intended to portray
44 994 complete ignorance about the true value of the measurand. Indeed, in just about all
45
46 995 instances of measurement there is some a priori notion about the value of the measurand,
47 996 for otherwise it would be impossible even to select a measurement method. The elicitation
48
ce

997 of informative prior distributions, however, is a non-trivial exercise [104], and their use
49
50 998 involves suitably customized versions of MCMC.
51 999 Since the hierarchical Bayesian model we suggest is intended for general use, the only
52
53 1000 prior information we incorporate is about the median values expected for the variance
Ac

54 1001 components recognized in the model, but otherwise the prior distributions all are rather
55
1002 noncommittal. Still, it should be kept in mind that weakly informative priors can be
56
57 1003 influential, particularly when the number of measured values is small [85]. However, the
58 1004 hierarchical Bayesian model, and the method we use to fit it, in principle can accommodate
59
60 1005 any prior information that may be encapsulated in a parametric distribution.
1006 The distributions selected for the recommended hierarchical Bayesian model are the
Page 27 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 27
4

pt
5 1007 following:
6
7 1008 • µ has a prior Gaussian distribution with mean 0 and very large standard deviation
8 1009 (105 ), a very diffuse, essentially non-informative, yet proper prior distribution. This
9
should be replaced by a more concentrated distribution when there is credible,

cri
10 1010

11 1011 specific information about the true value of the measurand — for example, that it
12
1012 must be positive, or lies within a given bounded interval, or that it is of a particular
13
14 1013 order of magnitude.
15 1014 • τ and the {σ j } have prior half-Cauchy distributions as suggested by Gelman [55] and
16
further supported by Polson and Scott [108]. We have chosen the medians of these

us
17 1015

18 1016 prior distributions as follows: for τ, equal to the median of the absolute values of the
19
20
1017 differences between the measured values and their median; and for the {σ j }, equal
21 1018 to the median of the {u j }.
22
• Given τ, the {λ j } are Gaussian with mean 0 and standard deviation τ.
23
24
25
26
27
28
1019

1020

1021

1022

1023
an
• Given µ, {λ j }, and {σ j }, the measured values {x j } are modeled as outcomes of
Gaussian random variables with means {µ + λ j } and standard deviations {σ j }.
• When the standard uncertainties associated with the measured values are based on
infinitely many degrees of freedom, σ j = u j . When they are based on finite numbers
dM
29
30 1024 of degrees of freedom {nu j }, given {σ j } the {ν j u2j /σ2j } are modeled as outcomes of
31 1025 chi-squared random variables with {ν j } degrees of freedom.
32
33 1026 The estimate of the consensus value µ is the mean of the corresponding posterior
34
35
1027 distribution, and the associated standard uncertainty u(µ) is the standard deviation of
36 1028 the same distribution. Since the probability distribution that Bayes’s rule [32; 114]
37 1029 produces for µ cannot be computed explicitly and analytically under the modeling choices
38
39 1030 just described, the consensus value produced by the Bayesian procedure is obtained as
the average of a sample of values drawn from the posterior distribution of µ via MCMC
pte

40 1031
41 1032 sampling, and u(µ) is its standard deviation.
42
43 1033 The sample should be sufficiently large to characterize u(µ) accurately enough for the
44
1034 intended purpose. One should also verify that the MCMC sample has been drawn after the
45
46 1035 Markov chain has reached equilibrium. This can be done using the convergence diagnostic
47 1036 test suggested by [57]. If equilibrium has not been reached, the analysis should be re-run
48
ce

49
1037 using a larger MCMC sample with a longer burn-in period.
50
51
52 1038 5.4. Linear Pool
53
Ac

54 1039 The Linear Pool was suggested by Stone [133] to aggregate the opinions or states of
55
1040 knowledge of several experts on a particular matter, expressed as probability distributions,
56
57 1041 thereby producing a consensus. However, Bacharach [2] attributes the idea to Pierre
58 1042 Simon, Marquis de Laplace.
59
60 1043 Duewer [38] has suggested the linear pool as a general-purpose procedure for
1044 key comparisons organized by the Consultative Committee for Amount of Substance
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 28 of 65

1
2
3 Consensus Building 28
4

pt
5 1045 (Metrology in Chemistry and Biology), of the International Committee for Weights and
6 1046 Measures. The Linear Pool has also been used to reduce data from interlaboratory studies
7
8 1047 in thermometry [130].
9 1048 In our case, the “experts” are the laboratories or measurement methods involved in an

cri
10
11 1049 intercomparison, and their opinions or states of knowledge are expressed in the form
12 1050 of probability distributions. The Linear Pool produces a sample from a mixture of these
13
1051 probability distributions, which may then be suitably summarized to produce a consensus
14
15 1052 value and an evaluation of the associated uncertainty.
16
The distributions aforementioned are taken to be either Gaussian (when the number of

us
1053
17
18 1054 degrees of freedom is infinite or is left unspecified), or re-scaled and shifted Student’s
19 1055 t distributions (when the number of degrees of freedom is finite), in both cases with
20
21 1056 means equal to the measured values {x j } and standard deviations equal to the associated
22 1057 standard uncertainties {u j }.
23
24
25
26
27
28
1058

1059

1060

1061
an
The n probability distributions are aggregated by mixing with specified weights {w j },
to produce a consensus distribution whose mean is the consensus value, and whose
standard deviation is the standard uncertainty associated with the consensus value. The
weights represent the quality or reliability of the participating laboratories, as perceived
dM
29 1062 by the person performing the aggregation. Note that equal weights do not imply that
30
31 1063 the measured values themselves will end-up being weighted equally: those with larger
32 1064 uncertainties influence the consensus value less than those with smaller uncertainties.
33
34 1065 If {φ j } denote the probability densities of the distributions assigned to the participants
35 1066 as described above (Gaussian or re-scaled and shifted Student’s t), and {w j } denote the
36 1067 corresponding non-negative weights normalized to sum to 1, then the mixture distribution
37 Pn
38 1068 has probability density f = j=1 w j φ j , where n is the number of participants.
39
1069 Typically, a sample of large size K is drawn from the mixture distribution of the
pte

40
41 1070 measurand by repeating the following process K times: select a laboratory at random, with
42 1071 probabilities proportional to the weights, and then draw a value from the corresponding
43
44 1072 distribution.
45 1073 The K values obtained through this process are summarized in the same way as the
46
47 1074 bootstrap sample for the DerSimonian-Laird procedure, or as the MCMC sample drawn
48 in the Bayesian procedure. Specifically, the average of the sample drawn from this
ce

1075
49 1076 mixture distribution is the consensus value, and its standard deviation is the corresponding
50
51 1077 standard uncertainty. Toman [138, Equation (20)] provides an analytical expression for
52 1078 this standard uncertainty for a common implementation of the Linear Pool. Coverage
53
1079 intervals are built by selecting suitable percentiles from this sample, for example the 2.5th
Ac

54
55 1080 and the 97.5th for a 95 % coverage interval (which generally need not be centered at the
56 1081 consensus value).
57
58 1082 The Linear Pool is but one of several ways in which the opinions of multiple experts,
59 1083 expressed as probability distributions, may be merged into a consensus distribution.
60
1084 Clemen and Winkler [22] review some of the alternatives, detailing and comparing their
Page 29 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 29
4

pt
5 1085 underlying assumptions and properties.
6
7 1086 The consensus distribution obtained by linear pooling may be multimodal (meaning that
8 1087 the mixture density f mentioned above may have multiple peaks). In such cases its
9 1088 mean may be a poor indication of its typical value, and its standard deviation may be a

cri
10
11 1089 misleading indication of the associated uncertainty. To facilitate a critical evaluation of the
12 1090 fitness-for-purpose of the mean and standard deviation of the distribution, the probability
13
1091 density f should be examined graphically.
14
15
16
6. Degrees of Equivalence

us
17 1092

18
19
20 1093 The principal goal of some interlaboratory studies, key comparisons in particular, is not
21 1094 so much to produce a consensus value as to detect and identify measurement results that
22 1095 deviate significantly from the consensus value, or from one another when considered in
23
24
25
26
27
28
1096

1097

1098

1099
pairs.

an
The consensus value used for this purpose may not be derived from the measurement
results of the participants, or it may not even be a meaningful physical quantity, being
defined only as a convenient baseline against which to gauge deviations. For example, in
dM
29
30 1100 key comparison CCQM-K1 [1], the consensus value was the amount-of-substance fraction
31 1101 of each of several gas species that had been determined gravimetrically by the pilot
32 1102 laboratory as it prepared the gas mixtures for distribution to the participants. And in
33
34 1103 key comparison CCM.FF-K6.2011 [5] the consensus value was a weighted average of the
35 1104 relative differences between volumes of gas indicated by the transfer standard and the
36 1105 corresponding volumes measured by the reference (national) standard.
37
38 1106 The goal of detecting and identifying measurement results that are significantly
39
1107 inconsistent with a reference or consensus value is paramount in proficiency tests [137],
pte

40
41 1108 and in performance rankings of medical centers [87].
42
43 1109 In the context of KCs, the relevant comparisons are based on unilateral and bilateral degrees
44 1110 of equivalence (DoEs). Owing to the legal force of the MRA that frames KCs, there is less
45 1111 latitude for KCs than there is for studies done in other contexts, about how to define the
46
47 1112 DoEs. In fact, the MRA is quite specific, stating in its Technical Supplement:
48
ce

49 1113 For the purposes of this arrangement, the term degree of equivalence of measurement
50
51
1114 standards is taken to mean the degree to which a standard is consistent with
52 1115 the key comparison reference value. The degree of equivalence of each national
53 1116 measurement standard is expressed quantitatively by two terms: its deviation from
Ac

54
55 1117 the key comparison reference value and the uncertainty of this deviation (at a
56 1118 95 % level of confidence). The degree of equivalence between pairs of national
57 1119 measurement standards is expressed by the difference of their deviations from the
58
59 1120 reference value and the uncertainty of this difference (at a 95 % level of confidence).
60 1121 — Comité International des Poids et Mesures (CIPM) [26, T.3]
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 30 of 65

1
2
3 Consensus Building 30
4

pt
5 1122 That is, the unilateral DoEs are the pairs {(D j , U95 % (D j ))}, where D j = x j − µ b and U95 % (D j )
6 1123 denotes the associated expanded uncertainty for 95 % coverage of the true difference. And
7
8 1124 the bilateral DoEs are pairs {(Bi j , U95 % (Bi j ))}, where Bi j = Di − D j , and the {U95 % (Bi j )} are
9 1125 the counterparts of the {U95 % (D j )}. Note that Bi j may differ from x i − x j : for example, if

cri
10 1126 D j is replaced by D∗j as defined below.
11
12 1127 Occasionally, the measurement results for a participant in a KC are not used in the
13
1128 computation of the consensus value. This may happen when a participant is traceable
14
15 1129 to the SI via another participant, or when the participant will have failed to follow
16 1130 the protocol agreed for the comparison. Even in such cases the unilateral DoE for the

us
17
18
1131 participant may be computed.
19 1132 This suggests an alternative definition for the unilateral DoE that could be applied
20
21 1133 generally: based on the difference D∗j = x j − µb− j , where µ b− j denotes an estimate of the
22 1134 consensus value derived from the measurement results produced by all the participants
23
24
25
26
27
28
1135

1136

1137

1138
an
but leaving out the results from participant j, for j = 1, . . . , n.
Viechtbauer and Cheung [148] have used this idea to identify outliers in meta-analysis,
and Duewer et al. [40] have promoted it specifically for the evaluation of degrees of
equivalence in key comparisons. This alternative definition may then be carried forward
and lead to an alternative definition of the bilateral DoEs based on Bi∗j = Di∗ − D∗j (which
dM
29 1139
30
31 1140 generally differs from x i − x j ), for 1 ¶ i 6= j ¶ n.
32 1141 It may be argued that D∗j is more accurate than D j as an assessment of the “distance”
33
34 1142 between the value measured by laboratory j and the values measured by the other
35 1143 b may be too small in absolute value because µ
laboratories. In fact, x j − µ b incorporates
36 1144 b− j are uncorrelated, this alternative definition
(“tracks”) x j . Furthermore, since x j and µ
37
38 1145 greatly simplifies the evaluation of U95 % (D∗j ).
39
1146 The question must also be answered as to whether, once their associated uncertainties are
pte

40
41 1147 taken into account, the {D j } and the {Bi j } (or the {D∗j } and the {Bi∗j }) differ significantly
42 1148 from 0. We follow guidance from Jones and Spiegelhalter [78] about how to identify
43
44 1149 participants with “unusual” results in an interlaboratory study, and favor their Approach 2
45 1150 to Identify Outliers to the Random Effects Distribution.
46
47 1151 The perspective in this endeavor is one of statistical testing, rather than of estimation: for
48 the unilateral DoE, the goal is to identify measured values that, as Jones and Spiegelhalter
ce

1152
49 1153 [78] specify, “lie beyond the range allowed by the model”, and that effectively are
50
51 1154 outliers relative to the random effects distribution. Both Bayesian and sampling-theoretic
52 1155 approaches lead to the same criterion to identify significant discrepancies.
53
Next we describe how to compute unilateral and bilateral DoEs using the conventional
Ac

54 1156

55 1157 version, as defined in the MRA, and according to the leave-one-out strategy, for the
56
57 1158 DerSimonian-Laird and the hierarchical Bayesian procedures, and for the Linear Pool.
58
59
60
Page 31 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 31
4

pt
5 1159 6.1. DerSimonian-Laird
6
7 1160 For the DerSimonian-Laird procedure, the conventional version of the unilateral DoE
8
9 1161 is D j = x j − µ b, and the bilateral DoE is Bi j = Di − D j for i, j = 1, . . . , n. Both the

cri
10 1162 {U95 % (D j )} and the {U95 % (Bi j )} are evaluated using the parametric statistical bootstrap,
11 1163 from raw materials computed by repeating the following steps for k = 1, . . . , K, where K
12
13 1164 is sufficiently large to achieve the required accuracy (say, K = 105 ):
14
15 1165 bDL described in §5.2.
(a) Draw τk from the approximate sampling distribution of τ
16
1166 (b) Draw x j,k from a Gaussian distribution with mean µ b and variance τ2k + u2j , for

us
17
18 1167 j = 1, . . . , n.
19
q
1168 (c) If ν j is either infinity or unspecified, u j,k = u j , otherwise u j,k = u j ν j /χν2j where χν2j
20
21 1169 denotes a value drawn from a chi-squared distribution with ν j degrees of freedom,
22 1170 for j = 1, . . . , n.
23
24
25
26
27
28
1171

1172

1173
(x 1k , u1k , ν1 ), . . . , (x nk , unk , νn ).
(e) Compute D j,k = x j,k − µk , for j = 1, . . . , n.
an
(d) Compute the DerSimonian-Laird consensus value µk corresponding to the triplets

Compute U95 % (D j ) as one half of the length of the shortest interval centered at the average
dM
29 1174

30 1175 of D j,1 , . . . , D j,K that includes 95 % of the {D j,k }. The value of U95 % (Bi j ) is computed
31
1176 similarly, based on Bi j,k = Di,k − D j,k for i, j = 1, . . . , n. Defined in this fashion, these
32
33 1177 expanded uncertainties are versions of the “shorth” estimate of scale [61].
34
35
1178 For the leave-one-out version, D∗j = x j − µ b− j and the bilateral DoE is Bi∗j = Di∗ − D∗j
36 1179 for i, j = 1, . . . , n. Instead of performing a parametric bootstrap evaluation of u(b µ− j ),
37 1180 we simply draw K samples from the Student’s t approximation suggested by Knapp and
38
39 1181 Hartung [81]. The deviations that are used for the evaluation of the expanded uncertainty
associated with D∗j are of the form D∗j,k = x j + e j,k −µ− j,k , where e j,k is an outcome of either
pte

40 1182
41 1183 a Student’s t or a Gaussian distribution with mean 0 and variance τ2− j,k + u2j (depending
42
43 1184 on whether degrees of freedom have, or have not been specified), and τ2− j,k is drawn from
44 1185 the approximate distribution given in §5.2.
45
46
47 1186 6.2. Hierarchical Bayesian
48
ce

49
50 1187 The Bayesian approach that Jones and Spiegelhalter [78] describe is based on the posterior
51 1188 predictive distributionRRR for measured values, whose probability density is g such that
52 g(ξ j |x 1 , . . . , x n ) = φ(ξ j |µ, τ2 + σ2j ) q(µ, τ, σ j |x 1 , u1 , ν1 , . . . , x n , un , νn ) dµ dτ dσ j ,
53
where ξ j denotes a prediction for a value that laboratory j may measure, φ(·|µ, τ2 + σ2j )
Ac

54 1189
55
1190 denotes the probability density of a Gaussian distribution with mean µ and variance
56
57 1191 τ2 + σ2j , and q denotes the probability density of the joint posterior distribution of µ,
58 1192 τ, and σ j given the measurement results.
59
60 1193 b,
The unilateral degree of equivalence for laboratory j = 1, . . . , n comprises D j = x j − µ
1194 b denotes the average of a MCMC sample drawn from the posterior distribution
where µ
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 32 of 65

1
2
3 Consensus Building 32
4

pt
5 1195 of µ, and U95 % (D j ), which is derived from a sample ξ j,1 , . . . , ξ j,K drawn from the
6 1196 aforementioned predictive distribution by repeating the following steps a sufficiently large
7
8 1197 number K of times, for k = 1, . . . , K:
9

cri
10 1198 (a) Draw µk , τk , and σ j,k from the corresponding posterior distributions via MCMC
11 1199 sampling.
12
13 1200 (b) Draw ξ j,k from a Gaussian distribution with mean µk and variance τ2k + σ2j,k .
14
1201 (c) Compute D j,k = x j − ξ j,k .
15
16
The value of U95 % (D j ) is one half of the length of the shortest interval that is symmetrical

us
1202
17
18 1203 around the average of the {D j,k } and includes 95 % of the {D j,k } values.
19
20 1204 Under the leave-one-out approach, we leave out the results from participant j, for
21 1205 j = 1, . . . , n. This means there is no posterior distribution for σ j , so in the leave-one-
22
out versions of the unilateral and bilateral DoE we modify the algorithm. For k in
23
24
25
26
27
28
1206

1207

1208

1209

1210
an
1, . . . , K, µ− j,k and τ− j,k are drawn from the corresponding posterior distributions, via
MCMC sampling. Similar to the DerSimonian-Laird procedure, D ∗j,k = x j + e j,k −µ− j,k where
e j,k are drawn from either a Student’s t or a Gaussian distribution, depending on whether
the degrees of freedom have been specified, with mean 0 and variance τ2− j,k + u2j .
dM
29
30
31 1211 6.3. Linear Pool
32
33
34 1212 For the Linear Pool procedure, the conventional version of the unilateral DoE is D j = x j − µ b,
35 1213 where µb is the mean of the sample drawn from the mixture distribution, described in §5.4.
36
1214 Again, the bilateral DoE is Bi j = Di − D j for i, j = 1, . . . , n. The {U95 % (D j )} and {U95 % (Bi j )}
37
38 1215 b, where e j,k is drawn from a Student’s t (or Gaussian)
are evaluated using D j,k = x j +e j,k − µ
39 1216 distribution with mean 0 and variance u2j .
pte

40
41 1217 For the leave-one-out version, let {x̃ − j,k } for k = 1, . . . , K denote the sample of size K
42 1218 produced when the Linear Pool is applied to all the measurements excluding those from
43
44 1219 laboratory j, and define µ b− j as their average, for each j = 1, . . . , n. The unilateral DoE have

45 1220 Dj = x j − µ b− j , and the bilateral DoE have Bi∗j = Di∗ − D∗j for i, j = 1, . . . , n. The {U95 % (D∗j )}
46 1221 and {U95 % (Bi∗j )} are computed similarly to how they are evaluated in the other two leave-
47
48 1222 one-out procedures, based on {D∗j,k } and {Bi∗j,k } such that D∗j,k = x j + e j,k − x̃ − j,k and
ce

49 1223 Bi∗j,k = Di,k



− D∗j,k . Here e j,k are drawn from either a Student’s t or a Gaussian distribution,
50
1224 depending on whether the degrees of freedom have been specified, with mean 0 and
51
52 1225 variance u2j .
53
Ac

54
55 1226 7. Examples
56
57
58 1227 The following examples are drawn from several different areas of measurement science,
59 1228 and from medicine, and serve to illustrate and compare the procedures for data
60
1229 reduction that we have described in §5, emphasizing those that are implemented in
Page 33 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 33
4

pt
5 1230 the NICOB— DerSimonian-Laird, hierarchical Bayes, and Linear Pool. In two of the
6 1231 examples including also the results of maximum likelihood estimation and corresponding
7
8 1232 uncertainty evaluations.
9 1233 The DerSimonian-Laird procedure, and the Knapp-Hartung adjustment, are implemented

cri
10
11 1234 in R function rma defined in package metafor [147]. The NICOB provides a user-friendly
12 1235 interface to these facilities, and also to the parametric bootstrap procedures described in
13
1236 §5.2.
14
15
16
7.1. Newtonian Constant of Gravitation

us
1237
17
18
19 1238 Newton’s law of universal gravitation states that two material objects attract each
20
21
1239 other with a force that is directly proportional to the product of their masses and
22 1240 inversely proportional to the squared distance between them. The Newtonian constant
23
24
25
26
27
28
1241

1242

1243

1244
an
of gravitation, G, is the constant of proportionality; it was first measured by Cavendish
[21]. The value of G recommended by CODATA in the 2014 adjustment of the fundamental
physical constants is 6.674 08 × 10−11 m3 kg−1 s−2 , with associated standard uncertainty
u(G) = 0.000 31 × 10−11 m3 kg−1 s−2 [102, Table XXXII].
The effort and rigor notwithstanding that have gone into the characterization of
dM
29 1245
30
1246 the uncertainty budgets of the many reputable measurements of G, the dispersion
31
32 1247 of the resulting measured values is still strikingly larger than what the stated
33 1248 uncertainties would lead one to expect. The standard deviation of the measured values
34
35
1249 listed in Table 1 and depicted in Figure 1 equals 0.001 17 × 10−11 m3 kg−1 s−2 , while
36 1250 the laboratory-specific standard uncertainties range from 0.000 092 × 10−11 m3 kg−1 s−2
37 1251 to 0.000 99 × 10−11 m3 kg−1 s−2 , and their median equals 0.000 27 × 10−11 m3 kg−1 s−2 .
38
39 1252 Therefore, the measured values are about four times more dispersed than the typical,
within-laboratory standard uncertainty.
pte

40 1253
41
42 1254 This manifestation of dark uncertainty has given great impetus to collective efforts, which
43 1255 were discussed during a Newtonian Constant of Gravitation Workshop held at NIST in
44
1256 October, 2014, to identify the underlying causes. These efforts are expected to lead not
45
46 1257 only to improved estimates of G, but also, and maybe most importantly, to advances in
47 1258 the measurement of very weak forces generally.
48
ce

49 1259 The value of G recommended by CODATA in the 2014 adjustment is a weighted average
50 1260 of the estimates of G listed in Table 1, computed similarly to the value recommended
51
52 1261 in the previous release, but employing a different weighting strategy. The CODATA Task
53 1262 Group “decided that it would be more appropriate to follow its usual approach of treating
Ac

54 1263 inconsistent data, namely, to choose an expansion factor that reduces each |ri | to less than
55
56 1264 2” [102], where ri denotes the normalized residual corresponding to the ith measured
57 1265 value. That is, ri is the difference between the ith measured value and the adjusted value,
58
1266 divided by the stated measurement uncertainty associated with the ith measured value.
59
60 1267 The end result was equivalent to inflating all the stated uncertainties by a multiplicative
1268 expansion factor of 6.3.
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 34 of 65

1
2
3 Consensus Building 34
4

pt
5 1269 [Table 1 about here.]
6
7 1270 [Figure 1 about here.]
8
9

cri
10 1271 Table 2 lists the estimates of the consensus value, and associated uncertainties, produced
11 1272 by six different procedures of data reduction, along with the CODATA 2014 recommended
12
13 1273 value and associated uncertainty.
14 1274 Two of the estimates are MLEs: one computed disregarding the correlations stated in
15
16 1275 Mohr et al. [102, Table XXVII], between two pairs of the values being combined, the

us
17 1276 other taking these correlations into account. In both cases, the likelihood function is a
18
1277 function of two variables, G and τ (standard deviation of the laboratory effects). In the
19
20 1278 first case, it has the form of a product of univariate Gaussian probability densities, all
21 1279 with the same mean G and variances {u2j (G) + τ2 }, evaluated at the measured values of
22
G and at the stated, associated uncertainties. In the second case, the likelihood function
23
24
25
26
27
28
1280

1281

1282

1283

1284
an
is the probability density of a multivariate Gaussian distribution in 14-dimensional space,
whose mean vector has all entries equal to G, the variances are as in the first case, and the
correlation matrix has all off-diagonal elements equal to zero except those that pertain to
the two pairs of laboratories whose measured values are deemed to be correlated.
dM
29
30 1285 Figure 2 shows the densities of the probability distributions mixed in the Linear Pool,
31 1286 and the density of the resulting probability distribution, which has multiple modes (local
32
1287 maxima). Figure 3 shows two versions of the unilateral degrees of equivalence that would
33
34 1288 normally be computed had this meta-analysis been a key comparison.
35
36 1289 [Table 2 about here.]
37
38
39 1290 [Figure 2 about here.]
pte

40
41 1291 [Figure 3 about here.]
42
43
44 1292 The estimates of τ, the standard deviation of the laboratory effects that quantifies
45 1293 the contribution from dark uncertainty, are τ bDL = 0.000 95 × 10−11 m3 kg−1 s−2 for the
46
47
1294 DerSimonian-Laird procedure, 0.001 13 × 10−11 m3 kg−1 s−2 for the hierarchical Bayes
48 procedure, and 0.001 02 × 10−11 m3 kg−1 s−2 for both versions of maximum likelihood
ce

1295

49 1296 estimation. The heterogeneity index I 2 , which is defined in §1.2, corresponding to τ


bDL
50
51 1297 equals 96 %: this is the proportion of the variability of the measured values that is
52 1298 attributable to dark uncertainty.
53
1299 The results summarized in Table 2 show that both the Bayesian procedure and maximum
Ac

54
55 1300 likelihood produce very similar results, as expected in light of the remarks made toward
56
1301 the end of §5.3. Also, the correlations stated by Mohr et al. [102, Table XXVII],
57
58 1302 between NIST-82 and LANL-97, and between HUST-05 and HUST-09, have rather minimal
59 1303 impact on the results. More importantly, the several consensus estimates are statistically
60
1304 indistinguishable in light of their associated uncertainties.
Page 35 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 35
4

pt
5 1305 Correlations may very easily be taken into account in the Linear Pool, by application of
6 1306 a copula [110]. If the copula is chosen to be Gaussian, then the Linear Pool reduces to
7
8 1307 sampling from a multivariate Gaussian distribution whose mean is the vector of measured
9 1308 values of G, and whose covariance matrix is diagonal except for the entries corresponding

cri
10 1309 to the pairs of correlated measured values, with the entries in the main diagonal equal to
11
12 1310 the squared standard uncertainties associated with the measured values.
13
1311 Mohr et al. [102, Table XV] list the methods that were used in the different experiments:
14
15 1312 a rough classification simply recognizes that six used fiber torsion, three strip torsion,
16 1313 one cryogenic torsion, three either a suspended or stationary body, and one atomic

us
17
18
1314 interferometry. This information may be used as a moderator (that is, a covariate) in
19 1315 the DerSimonian-Laird analysis, as implemented in R function rma defined in package
20 1316 metafor [147]. Doing so reveals the potentially interesting fact that only strip torsion
21
22 1317 has a statistically significant (p-value of 0.04) effect of increasing G relative to atomic
23
24
25
26
27
28
1318

1319 7.2. Half-Life of 137 Cs and 90 Sr an


interferometry, which was chosen as the baseline.

MacMahon et al. [88] use two sets of measurement results to illustrate a manner of
dM
1320
29
30 1321 computing a consensus value in situations where there are “discrepant data”, in the sense
31
1322 of the results being heterogeneous. These are instances of meta-analysis in measurement
32
33 1323 science because the intercomparisons involve measurement results published previously.
34
35
1324 MacMahon et al. [88] employ an ad hoc procedure to compute the consensus value. Their
36 1325 idea is to examine the “convergence” of six different methods to compute consensus
37 1326 values (all with some degree of resistance to deviant data), as the set of measurement
38
39 1327 results being combined increases monotonically, from the single oldest result to all the
measurement results available at the time of their analysis, as increasingly more results
pte

40 1328
41 1329 are taken into account in the chronological order in which they were obtained.
42
43 1330 The measurands are the half-lives of 137 Cs and 90 Sr. According to NuDat 2.6, an online data
44
1331 service of the National Nuclear Data Center (http://www.nndc.bnl.gov/) the half-life
45
46 1332 of the caesium isotope is 30.08 a, with standard uncertainty 0.09 a, and the half-life of
47 1333 the strontium isotope is 28.90 a with standard uncertainty 0.03 a (retrieved on October
48
ce

49
1334 10, 2016). These estimates and uncertainty evaluations may not be based on the exact
50 1335 same data that MacMahon et al. [88] use. We mention them here only to serve as baseline
51 1336 against which to compare the results of alternative data reductions.
52
53 1337 Table 3 lists the 19 measurements of the half-life of 137 Cs, and the 11 measurements of
Ac

54 1338 the half-life of 90 Sr, in chronological order of publication, beginning with the oldest.
55
56
57 1339 [Table 3 about here.]
58
59 1340 MacMahon et al. [88] conclude that the best estimate of the half-life of 137 Cs is 30.06 a,
60
1341 with standard uncertainty 0.03 a, and for 90 Sr it is 28.9 a, with standard uncertainty 0.04 a.
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 36 of 65

1
2
3 Consensus Building 36
4

pt
5 1342 Table 4 lists the corresponding results obtained by application of the DerSimonian-Laird
6 1343 and hierarchical Bayesian procedures, and by the Linear Pool. All the calculations were
7
8 1344 done with the data expressed in days, the results having subsequently been converted to
9 1345 years by division by 365.25 d/a.

cri
10
11 1346 Figure 4 depicts the measurement results, and the consensus value and associated
12 1347 expanded uncertainty (for 95 % coverage) produced by the DerSimonian-Laird procedure.
13
1348 Figure 5 depicts the bilateral DoE as defined conventionally, based on the DerSimonian-
14
15 1349 Laird procedure (cf. §6.1).
16
The consensus values corresponding to different data reduction procedures in Table 4,

us
1350
17
18 1351 for both 137 Cs and 90 Sr, do not differ significantly from one another once the associated
19 1352 uncertainties are taken into account. Therefore, we may conclude that conventional
20
21 1353 statistical methods corresponding either to a random effects model (DerSimonian-Laird
22 1354 and hierarchical Bayesian) or to a mixture model (Linear Pool) achieve the same goal that
23
24
25
26
27
28
1355

1356

1357
procedures.
an
the approach involving the “convergence” of estimates produced by specialized statistical

[Table 4 about here.]


dM
29
30 1358 Figure 4 shows that not only is there heterogeneity generally, but there is also one
31 1359 measurement result (WT55) markedly deviant from the bulk of the others, so much so that
32
1360 one may wonder whether this datum alone might account for most of the heterogeneity
33
34 1361 in the results.
35
36
1362 In fact it does not, as can be ascertained by comparing the estimate of τ and of the
37 1363 heterogeneity index I 2 , when all the measurements are considered, with their counterparts
38 1364 when the measurement result WT55 is excluded: the corresponding changes are very
39
modest for both 137 Cs and 90 Sr. For 137 Cs, τ
b changes from 52 d to 46 d, and I 2 changes
pte

1365
40
41 1366 from 95 % to 93 %. For 90 Sr, τb changes from 105 d to 104 d, and I 2 changes from 97.5 %
42 1367 to 97.7 %.
43
44
45 1368 [Figure 4 about here.]
46
47 1369 [Figure 5 about here.]
48
ce

49
50
51 1370 7.3. Carotid Artery Stenosis
52
53 1371 Carotid stenosis is the narrowing of the carotid artery, which conveys freshly oxygenated
Ac

54 1372 blood to the brain, usually caused by the build up of plaque. When fragments of plaque
55
56 1373 break off and blood flow carries them into the brain, they may block smaller arteries and
57 1374 cause a stroke.
58
59 1375 Carotid endarterectomy and carotid stenting are two procedures that may be used to treat
60 1376 this condition. Endarterectomy is the surgical removal of the plaque deposits. Stenting
Page 37 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 37
4

pt
5 1377 involves deployment of an expansible tube (stent) inside the artery that mechanically
6 1378 increases the artery’s inner diameter.
7
8 1379 Table 5 lists the results of nine randomized controlled clinical trials that Meier et al. [99]
9 1380 selected and combined in a meta-analysis, indicating the numbers of patients involved

cri
10
11 1381 in each study, and the numbers of these that either suffered a stroke or died within 30
12 1382 days following the procedure. Figure 6 depicts the data and summarizes the results of the
13
1383 analysis.
14
15
16 1384 [Table 5 about here.]

us
17
18 1385 [Figure 6 about here.]
19
20
21 1386 7.3.1. Log-Odds Ratios and Standard Uncertainties The data are a set of nine 2 × 2 tables
22
of counts: for example, for Naylor-1998, they are displayed in Table 6. To be able to
23
24
25
26
27
28
1387

1388

1389

1390

1391
an
apply the methods described in §4, each 2 × 2 table of counts needs to be reduced to a
scalar summary of the relationship between the probability of stroke or death and the
two alternative treatments. The log-odds ratio is the summary that we shall use in this
example, and that we explain below.
dM
29
30 1392 But first we point out that, under the sampling conditions that produced the counts in
31 1393 the 2 × 2 tables, the log-odds ratio has a probability distribution that is approximately
32
33 1394 Gaussian, hence the models that involve the assumption of the data being like outcomes
34 1395 of Gaussian random variables should be applicable [74].
35
36 1396 [Table 6 about here.]
37
38
39 1397 For Naylor-1998, a naive estimate of the probability of stroke or death among the patients
pte

40 1398 that underwent endarterectomy is pE = 0/12 = 0. Its counterpart for the group that
41
1399 underwent stenting is pS = 5/11. Therefore, the odds of stroke or death are pE /(1 − pE ) =
42
43 1400 0 in the endarterectomy group, and pS /(1 − pS ) = 5/6 in the stenting group. The
44 1401 corresponding odds ratio is the ratio of these two odds, which equals 0, hence whose
45
46
1402 logarithm (log-odds ratio) is negative infinity — obviously problematic as a potential input
47 1403 to subsequent calculations.
48
ce

49 1404 Bayesian estimates of pE and pS are much more reasonable than the naive estimates above,
50 1405 especially when the number of cases of stroke or death is zero: pE = (kE + ½)/(nE + 1),
51 1406 and similarly for pS . These estimates are the means of the posterior distributions that
52
53 1407 correspond to the Jeffreys prior distribution for the probability of success in nE and nS
Ac

54 1408 binomial trials, respectively. The posterior distribution for the probability of stroke or
55
1409 death in the endarterectomy group is a beta distribution with shape parameters kE + ½
56
57 1410 and nE − kE + ½, and similarly for the stenting group.
58
59
1411 To evaluate the log-odds ratio and its standard uncertainty computed using Bayes
60 1412 estimates of the relevant probabilities, we make a large number K of draws from these
1413 beta distributions (with nE and nS kept fixed at the values given), form the corresponding
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 38 of 65

1
2
3 Consensus Building 38
4

pt
5 1414 log-odds ratios, and finally compute the mean and standard deviation of the resulting K
6 1415 values of the log-odds ratio. The means and standard deviations obtained in this way,
7
8 1416 based on samples of size K = 1 × 107 drawn from the appropriate posterior distributions,
9 1417 are listed under log(OR) and u(log(OR)) in Table 5.

cri
10
11 1418 The last column in the same table lists the effective numbers of degrees of freedom νOR that
12 1419 the values of u(log(OR)) are based on, computed using the Welch-Satterthwaite formula
13
1420 [100, Equation (2.44)], as suggested by Taylor and Kuyatt [135, §B.3] and by the GUM
14
15 1421 (G.4).
16

us
17
18 1422 7.3.2. Results The results listed in Table 7 correspond to the estimates of the log-odds
19 1423 ratios, and associated standard uncertainties and numbers of degrees of freedom, listed
20
21 1424 in the last three columns of Table 5. The maximum likelihood estimate is obtained
22 1425 by maximizing the probability density of a non-central hypergeometric distribution in
23
24
25
26
27
28
1426

1427

1428

1429
rma.glmm defined in package metafor [147].an
the context of a generalized linear mixed-effects model fitted conditionally on the total
numbers of strokes or deaths [98, §7.4.2]. This procedure is implemented in R function

The heterogeneity index I 2 is a modest 31 %, and the p-value for Cochran’s Q test is
dM
29 1430 0.17, hence not suggesting statistically significant heterogeneity. A 95 % Bayesian credible
30
31 1431 interval for τ ranges from 0.02 to 0.98, suggesting otherwise.
32
33 1432 [Table 7 about here.]
34
35
36 1433 7.4. Radiofrequency Power Sensor
37
38
39 1434 Table 8 lists, and Figure 7 depicts, measurement results from key comparison CCEM.RF-
pte

40 1435 K25.W [79] of the calibration factor ηCAL at 33 GHz, for a commercial, temperature-
41
1436 compensated thermistor power sensor that circulated among the participants. The
42
43 1437 calibration factor is the proportion of the power of the radiofrequency signal that the
44 1438 sensor actually measures, hence it is a dimensionless quantity with values between 0 and
45
46
1439 1.
47
48 [Table 8 about here.]
ce

1440
49
50 1441 [Figure 7 about here.]
51
52
53 1442 The original study excluded the measurement result from NRC (Canada) because a
statistical criterion described by Randa [115] identifies it as an outlier, and computed
Ac

54 1443
55
1444 the KCRV as the arithmetic average 0.8184 of the seven remaining measured values, with
56
57 1445 associated standard uncertainty 0.0028.
58
59
1446 Analyzing all of the data in Table 8, Cochran’s Q-test of homogeneity yields a p-value of
60 1447 0.59, suggesting the data are not heterogeneous, and the heterogeneity index I 2 = 0.
1448 A 95 % Bayesian credible interval for the dark uncertainty τ ranges from 0.000 081 to
Page 39 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 39
4

pt
5 1449 0.0082. The reliability of the criterion described by Randa [115] is questionable because
6 1450 it neglects the uncertainties associated with the measured values. In any case, principle
7
8 1451 (P1) from §3 rules out the practice of excluding measured values from the calculation of
9 1452 the KCRV based on statistical criteria alone.

cri
10
11 1453 The DerSimonian-Laird, applied to all eight measurement results listed in Table 8,
12 1454 estimates the KCRV as 0.8192, with associated standard uncertainty 0.0022. A 95 %
13
1455 coverage interval for the true value of the measurand ranges from 0.8147 to 0.8235 (where
14
15 1456 the standard uncertainty and the coverage interval were obtained by application of the
16 1457 parametric bootstrap).

us
17
18 1458 Table 9 lists the results of data reductions from the final report of the key comparison,
19 1459 and those produced by the NICOB, and Figure 8 depicts unilateral degrees of equivalence.
20
21 1460 The close agreement between the unilateral degrees of equivalence (computed as defined
22 1461 in the MRA) produced by the DerSimonian-Laird and Bayesian procedures is particularly
23
24
25
26
27
28
1462

1463

1464

1465
for the measurement results from NRC. an
striking. The Linear Pool produces considerably larger uncertainty evaluations than the
other two procedures. Still, none of the three procedures suggests a significant discrepancy

[Table 9 about here.]


dM
29
30
31 1466 [Figure 8 about here.]
32
33
34
1467 8. Conclusions
35
36
37 1468 The reduction of measurement data acquired in interlaboratory studies, including key
38
39 1469 comparisons and inter-method comparisons, and in meta-analysis should be based on
statistical models that articulate explicitly the relationship between the measurement
pte

40 1470
41 1471 results and the measurand. Specifying the relationship provides transparency to the
42
43 1472 underlying assumptions that impart meaning to the consensus value and validate the
44 1473 evaluations of associated measurement uncertainty. Transparency, in turn, promotes and
45
1474 facilitates public, critical examination of the assumptions and the assessment of whether
46
47 1475 they seem adequate for the data in hand, and are likely to enable the production of results
48
ce

1476 that are fit for purpose.


49
50 1477 The selection of measurement results to include in any such reduction, and of
51 1478 measurement results to set aside, should rest firmly on substantive reasons, not on purely
52
53 1479 statistical criteria. Such selection is primarily the responsibility of the scientists involved in
the study or performing the meta-analysis, mindful of the use intended for the conclusions.
Ac

54 1480
55
56 1481 Statistical indicators and tests may help identify those results that deserve special attention
57 1482 and scrutiny, yet should serve only in an advisory, not decisional capacity. In particular, the
58
1483 mere observation that a measured value appears to be an “outlier”, when compared to the
59
60 1484 bulk of the others, is an insufficient reason to exclude it; there are documented instances
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 40 of 65

1
2
3 Consensus Building 40
4

pt
5 1485 of when such discrepant values have turned out to be the ones closest to the true values
6 1486 sought.
7
8 1487 Interlaboratory studies and meta-analyses provide not only independent measurements
9 1488 that may be combined into a consensus value, but also enable top-down evaluations of

cri
10
11 1489 measurement uncertainty. In many cases, these evaluations reveal that the measured
12 1490 values are considerably more dispersed than what might be expected given their
13
1491 individual, stated uncertainties.
14
15 1492 This “excess variance”, or heterogeneity of the results, suggests the presence of
16
yet unrecognized sources of uncertainty that together contribute a measure of dark

us
1493
17
18 1494 uncertainty. Such revelation should serve as driver for discovery and motivation
19 1495 for possible improvement of the quality of measurements made by the participating
20
21 1496 laboratories. But even before the provenance and causes of dark uncertainty will have
22 1497 been elucidated, it behooves the community to recognize it rather than ignore it at the
23
24
25
26
27
28
1498

1499

1500

1501
an
risk of providing misleadingly optimistic uncertainty evaluations.
In the case of interlaboratory studies in measurement science, where extensive advance
planning is usually done and agreements are reached about the conduct of the studies,
the participating laboratories in fact tacitly agree to all the duties and obligations of
dM
29 1502 learning together, and of demonstrating not only individual but collective measurement
30
31 1503 capabilities. In consequence, they remain together for better or for worse, and together
32 1504 should strive to improve the quality of their measurements, in particular where there are
33 1505 clear inconsistencies in their measurement results.
34
35 1506 Two fundamentally different approaches have been identified to quantify dark uncertainty,
36 1507 and to use the result in the computation of consensus values and their associated
37
38 1508 uncertainties. Interestingly, the two approaches seem to be more rooted in the traditions
39 1509 of different scientific communities than in the consideration of their intrinsic advantages
pte

40
1510 or disadvantages.
41
42 1511 The example involving the Newtonian constant of gravitation affords a comparison
43
44 1512 between the results of the two approaches: common multiplicative inflation of study-
45 1513 specific, stated uncertainties as applied by the CODATA Task Group on the one hand; and
46 1514 several versions of the model involving additive laboratory effects, on the other hand.
47
48 1515 In this case at least, all produced results that are statistically indistinguishable once the
ce

49 1516 associated uncertainties are taken into account.


50
51 1517 The data reduction procedures that we have described in detail are not meant to be
52 1518 exchangeable. Certainly, the decision of relying on one as opposed to another should not
53
1519 be based on the extent to which it produces results that are aligned with some expedient
Ac

54
55 1520 reason or subjective desire. In all cases, explicit reasons should drive the choice of model
56 1521 and method of fitting, including considerations of goodness-of-fit and assessment of the
57
58 1522 sensitivity of the conclusions to the validating assumptions. The availability of the NIST
59 1523 Consensus Builder should not only streamline data reductions and facilitate the evaluation
60 1524 of the impact that individual measurement results have upon the final results, but also
Page 41 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 Consensus Building 41
4

pt
5 1525 expedite sensitivity analysis.
6
7
8 1526 Acknowledgments
9

cri
10
11 1527 The authors are greatly indebted to their NIST colleagues David Duewer, Adam Pintar,
12
1528 Andrew Rukhin, and Jolene Splett, who generously spent much time and effort reviewing
13
14 1529 and preparing a large and most helpful collection of suggestions for improving a draft
15 1530 of this contribution. Two anonymous reviewers for Metrologia generously provided most
16
helpful suggestions that led to considerable improvements of the original submission. The

us
1531
17
18 1532 authors are also grateful to the hosts of MATHMET 2016, Markus Bär and Clemens Elster,
19 1533 of the Physikalisch-Technische Bundesanstalt, in Berlin, Germany, for welcoming them to
20
21 1534 the workshop.
22
23
24
25
26
27
28
an
dM
29
30
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 42 of 65

1
2
3 REFERENCES 42
4

pt
5 1535 References
6
7 1536 [1] A. Alink, M.J.T. Milton, F. Guenther, E.W.B. de Leer, H.J. Heine, A. Marschal, G. S. Heo, C. Takahashi, W. L. Zhen,
1537 Y. Kustikov, and E. Deak. Final report of Key Comparison CCQM-K1. Technical report, Nederlands Meetinstituut,
8
1538 Van Swinden Laboratory, Delft, The Netherlands, January 1999. URL kcdb.bipm.org/appendixB/
9
appbresults/ccqm-k1.b/ccqm-k1_final_report.pdf. Corrected version, September 2001.

cri
1539
10
11 1540 [2] M. Bacharach. Normal Bayesian dialogues. Journal of the American Statistical Association, 74(368):837–846,
1541 December 1979. doi: 10.1080/01621459.1979.10481039.
12
13 1542 [3] R. D. Baker and D. Jackson. Meta-analysis inside and outside particle physics: two traditions that should
14 1543 converge? Research Synthesis Methods, 4(2):109–124, 2013. doi: 10.1002/jrsm.1065.
15 1544 [4] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to
16 1545 multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57:289–300, 1995. doi:

us
17 1546 10.2307/2346101.
18 1547 [5] M. Benková, S. Makovnik, B. Mickan, R. Arias, K. Chahine, T. Funaki, C. Li, H. M. Choi, D. Seredyuk, C.-
19 1548 M. Su, C. Windenberg, and J. Wright. CIPM key comparison CCM.FF-K6.2011: Comparison of the primary
20 1549 (national) standards of low-pressure gas flow — final report. Technical report, Consultative Committee for
21 1550 Mass and Related Quantities, Bureau International des Poids et Mesures, Sèvres, France, February 2014. URL
22 1551 kcdb.bipm.org/appendixB/appbresults/ccm.ff-k6/ccm.ff-k6_final_report.pdf.
23
24
25
26
27
28
1552
1553

1554
1555

1556
1557
an
[6] J. M. Bernardo. Reference posterior distributions for Bayesian inference. Journal of the Royal Statistical Society,
Series B (Methodological), 41:113–128, 1979. doi: 10.2307/2985028.
[7] P. J. Bickel and K. A. Doksum. Mathematical Statistics — Basic Ideas and Selected Topics, volume I. Chapman and
Hall/CRC, San Francisco, California, 2nd edition, 2015.
[8] B. J. Biggerstaff and D. Jackson. The exact distribution of Cochran’s heterogeneity statistic in one-way random
effects meta-analysis. Statistics in Medicine, 27:6093–6110, 2008. doi: 10.1002/sim.3428.
dM
29 1558 [9] B. J. Biggerstaff and R. L. Tweedie. Incorporating variability in estimates of heterogeneity in the random effects
30 1559 model in meta-analysis. Statistics in Medicine, 16:753–768, 1997. doi: 10.1002/(SICI)1097-0258(19970415)16:
31 1560 7<753::AID-SIM494>3.0.CO;2-G.
32
1561 [10] R. T. Birge. The calculation of errors by the method of least squares. Physical Review, 40:207–227, April 1932.
33 1562 doi: 10.1103/PhysRev.40.207.
34
1563 [11] J. M. Bland and D. G. Altman. Statistical methods for assessing agreement between two methods of clinical
35
1564 measurement. Lancet, 327:307–310, 1986. doi: 10.1016/S0140-6736(86)90837-8.
36
37 1565 [12] O. Bodnar and C. Elster. On the adjustment of inconsistent data using the birge ratio. Metrologia, 51(5):516–521,
38 1566 2014. doi: 10.1088/0026-1394/51/5/516.
39 1567 [13] O. Bodnar, A. Link, B. Arendacká, A. Possolo, and C. Elster. Bayesian estimation in random effects meta-analysis
using a non-informative prior. Statistics in Medicine, 36:378–399, 2016. doi: 10.1002/sim.7156.
pte

40 1568

41 1569 [14] O. Bodnar, A. Link, and C. Elster. Objective Bayesian inference for a generalized marginal random effects model.
42 1570 Bayesian Analysis, 11(1):25–45, March 2016. doi: 10.1214/14-BA933.
43 1571 [15] M. Borenstein, L. V. Hedges, J. P.T. Higgins, and H. R. Rothstein. A basic introduction to fixed-effect and random-
44 1572 effects models for meta-analysis. Research Synthesis Methods, 1:97–111, 2010. doi: 10.1002/jrsm.12.
45 1573 [16] J. Bowden and C. Jackson. Weighing evidence “steampunk” style via the Meta-Analyser. The American Statistician,
46 1574 70(4):385–394, 2016. doi: 10.1080/00031305.2016.1165735.
47 1575 [17] H. Burgess and B. Spangler. Consensus building. In G. Burgess and H. Burgess, editors, Beyond Intractability.
48
ce

1576 Conflict Research Consortium, University of Colorado, Boulder, Colorado, USA, September 2003. URL www.
49 1577 beyondintractability.org/essay/consensus-building.
50
1578 [18] K.P. Burnham and D.R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic
51 1579 Approach. Springer-Verlag, New York, NY, 2nd edition, 2002.
52
1580 [19] J. B. Carlin. Letters to the Editor: “tutorial in biostatistics. meta-analysis: formulating, evaluating, combining,
53
1581 and reporting” by S.-T. Norman (1999). Statistics in Medicine, 19:753–761, 2000. doi: 10.1002/(SICI)
Ac

54
1582 1097-0258(20000315)19:5<753::AID-SIM427>3.0.CO;2-F.
55
56 1583 [20] G. Casella and R. L. Berger. Statistical Inference. Duxbury, Pacific Grove, California, 2nd edition, 2002.
57 1584 [21] H. Cavendish. Experiments to determine the density of the earth. by Henry Cavendish, Esq. F. R. S. and A. S.
58 1585 Philosophical Transactions of the Royal Society of London, 88:469–526, 1798. doi: 10.1098/rstl.1798.0022.
59 1586 [22] R. T. Clemen and R. L. Winkler. Aggregating probability distributions. In W. Edwards, R. F. Miles Jr., and
60 1587 D. von Winterfeldt, editors, Advances in Decision Analysis: From Foundations to Applications, chapter 9, pages
1588 154–176. Cambridge University Press, Cambridge, UK, 2007. ISBN 978-0-521-68230-5.
Page 43 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 REFERENCES 43
4

pt
5 1589 [23] M. Clyde. Model averaging. In S. J. Press, editor, Subjective and Objective Bayesian Statistics: Principles, Models,
6 1590 and Applications, chapter 13, pages 320–335. John Wiley & Sons, Hoboken, NJ, 2nd edition, 2003.
7 1591 [24] W. G. Cochran. Problems arising in the analysis of a series of similar experiments. Supplement to the Journal of
8 1592 the Royal Statistical Society, 4(1):102–118, 1937. doi: 10.2307/2984123.
9 1593 [25] W. G. Cochran. The combination of estimates from different experiments. Biometrics, 10(1):101–129, March

cri
10 1594 1954.
11 1595 [26] Comité International des Poids et Mesures (CIPM). Mutual recognition of national measurement standards and
12 1596 of calibration and measurement certificates issued by national metrology institutes. Bureau International des
13 1597 Poids et Mesures (BIPM), Pavillon de Breteuil, Sèvres, France, October 14th 1999. URL www.bipm.org/en/
14 1598 cipm-mra/. Technical Supplement revised in October 2003.
15 1599 [27] Consultative Committee for Amount of Substance. CCQM guidance note: Estimation of a consensus KCRV and
16 1600 associated degrees of equivalence. Technical report, Bureau International des Poids et Mesures (BIPM), Sèvres,

us
17 1601 France, April 12th 2013. URL www.bipm.org/cc/CCQM/Allowed/19/CCQM13-22\_Consensus\_KCRV\
18 1602 _v10.pdf. Version 10.
19 1603 [28] H. Cooper, L. V. Hedges, and J. C. Valentine, editors. The Handbook of Research Synthesis and Meta-Analysis.
20 1604 Russell Sage Foundation Publications, New York, NY, 2nd edition, 2009.
21
1605 [29] M. G. Cox. The evaluation of key comparison data. Metrologia, 39:589–595, 2002. doi: 10.1088/0026-1394/
22 39/6/10.
23
24
25
26
27
28
1606

1607
1608

1609
1610

1611
187–200, 2007. doi: 10.1088/0026-1394/44/3/005.
an
[30] M. G. Cox. The evaluation of key comparison data: determining the largest consistent subset. Metrologia, 44:

[31] M. Crowder. Interlaboratory comparisons: Round robins with random effects. Journal of the Royal Statistical
Society, Series C (Applied Statistics), 41:409–425, 1992. doi: 10.2307/2347571.
[32] M. H. DeGroot and M. J. Schervish. Probability and Statistics. Addison-Wesley, 4th edition, 2011.
dM
29 1612 [33] R. DerSimonian and N. Laird. Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3):177–188, September
30 1613 1986. doi: 10.1016/0197-2456(86)90046-2.
31 1614 [34] M. Désenfant and M. Priel. Road map for measurement uncertainty evaluation. Measurement, 39:841–848,
32 1615 2006. doi: 10.1016/j.measurement.2006.04.008.
33 1616 [35] S. Dias, A. J. Sutton, A. E. Ades, and N. J. Welton. Evidence synthesis for decision making 2. Medical Decision
34 1617 Making, 33(5):607–617, 2013. doi: 10.1177/0272989X12458724.
35
1618 [36] S. Dias, N. J. Welton, A. J. Sutton, and A. E. Ades. Evidence synthesis for decision making 1. Medical Decision
36 1619 Making, 33(5):597–606, 2013. doi: 10.1177/0272989X13487604.
37
1620 [37] M. Down, F. Czubak, G. Gruska, S. Stahley, and D. Benham. Measurement Systems Analysis — Reference
38
1621 Manual. Measurement Systems Analysis Work Group: Chrysler Group LLC, Ford Motor Company, General Motors
39
1622 Corporation, Southfield, MI, Fourth edition, June 2010. Automotive Industry Action Group (www.aiag.org).
pte

40
41 1623 [38] D. L. Duewer. A robust approach for the determination of CCQM key comparison reference values and
42 1624 uncertainties. Technical report, Consultative Committee for Amount of Substance: Metrology in Chemistry
1625 (CCQM), International Bureau of Weights and Measures (BIPM), Sèvres, France, 2004. URL www.bipm.info/
43
1626 cc/CCQM/Allowed/10/CCQM04-15.pdf. 9th Annual Meeting, Working Document CCQM/04-15.
44
45 1627 [39] D. L. Duewer, M. C. Kline, K. E. Sharpless, J. B. Thomas, K. T. Gary, and A. L. Sowell. Micronutrients measurement
46 1628 quality assurance program: Helping participants use interlaboratory comparison exercise results to improve their
1629 long-term measurement performance. Analytical Chemistry, 71(9):1870–1878, 1999. doi: 10.1021/ac981074k.
47
48 1630 [40] D. L. Duewer, K. W. Pratt, C. Cherdchu, N. Tangpaisarnkul, A. Hioki, M. Ohata, P. Spitzer, M. Máriássy, and
ce

49 1631 L. Vyskočil. “Degrees of equivalence” for chemical measurement capabilities: primary pH. Accreditation and
50 1632 Quality Assurance, 19:329–342, 2014. doi: 10.1007/s00769-014-1076-1.
51 1633 [41] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, London, UK, 1993.
52 1634 [42] C. Elster and B. Toman. Analysis of key comparisons: estimating laboratories’ biases by a fixed effects model
53 1635 using Bayesian model averaging. Metrologia, 47:113–119, 2010.
Ac

54 1636 [43] H. W. Fairbairn et al. A cooperative investigation of precision and accuracy in chemical, spectrochemical and
55 1637 modal analysis of silicate rocks. Geological Survey Bulletin 980, U.S. Geological Survey, 1951.
56 1638 [44] D. Enko, G. Kriegshäuser, R. Stolba, E. Worf, and G. Halwachs-Baumann. Method evaluation study of a new
57 1639 generation of vitamin D assays. Biochemia Medica, 25(2):203–212, 2015. doi: 10.11613/BM.2015.020.
58
1640 [45] EUROLAB. Interlaboratory comparison : the views of laboratories. http://www.eurolab.org/cookbooks.
59 1641 aspx. Cookbook No. 17.
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 44 of 65

1
2
3 REFERENCES 44
4

pt
5 1642 [46] EUROLAB. Guide to the evaluation of measurement uncertainty for quantitative test results. Technical Report
6 1643 1/2006, European Federation of National Associations of Measurement, Testing and Analytical Laboratories
7 1644 (EUROLAB), Paris, France, August 2006. URL www.eurolab.org.
8 1645 [47] EUROLAB. Measurement uncertainty revisited: Alternative approaches to uncertainty evaluation. Technical
9 1646 Report 1/2007, European Federation of National Associations of Measurement, Testing and Analytical

cri
10 1647 Laboratories (EUROLAB), Paris, France, March 2007. URL www.eurolab.org.
11 1648 [48] H. W. Fairbairn. Preparation and distribution of the samples. In H. W. Fairbairn at al., editor, A Cooperative
12 1649 Investigation of Precision and Accuracy in Chemical, Spectrochemical and Modal Analysis of Silicate Rocks,
13 1650 Geological Survey Bulletin 980, pages 1–6. U.S. Geological Survey, 1951.
14 1651 [49] C.-J. L. Farrell, S. Martin, B. McWhinney, I. Straub, P. Williams, and M. Herrmann. State-of-the-art vitamin
15 1652 D assays: A comparison of automated immunoassays with liquid chromatography-tandem mass spectrometry
16 1653 methods. Clinical Chemistry, 58(3):531–542, 2012. doi: 10.1373/clinchem.2011.172155.

us
17 1654 [50] T. Fearn and M. Thompson. A new test for ‘sufficient homogeneity’. Analyst, 126:1414–1417, 2001. doi:
18 1655 10.1039/b103812p.
19 1656 [51] A. B. Forbes. A hierarchical model for the analysis of inter-laboratory comparison data. Metrologia, 53(6):
20 1657 1295–1305, 2016. doi: 10.1088/0026-1394/53/6/1295.
21
1658 [52] A. B. Forbes and J. A. Sousa. The GUM, Bayesian inference and the observation and measurement equations.
22 Measurement, 44(8):1422–1435, 2011. doi: 10.1016/j.measurement.2011.05.007.
23
24
25
26
27
28
1659

1660
1661

1662
1663
1664
1665
Springer, Switzerland, 5th edition, 2015.
an
[53] L. M. Friedman, C. D. Furberg, D. DeMets, D. M. Reboussin, and C. B. Granger. Fundamentals of Clinical Trials.

[54] D. P. Gaver, D. Draper, P. K. Goel, J. B. Greenhouse, L. V. Hedges, C. N. Morris, and C. Waternaux. Combining
information: Statistical Issues and Opportunities for Research. National Academy Press, Washington, DC, 1992.
Committee on Applied and Theoretical Statistics, Board on Mathematical Sciences, Commission on Physical
Sciences, Mathematics and Applications, National Research Council.
dM
29
30 1666 [55] A. Gelman. Prior distributions for variance parameters in hierarchical models (comment on article by Browne
31 1667 and Draper). Bayesian Analysis, 1(3):515–533, 2006. doi: doi:10.1214/06-BA117A.
32 1668 [56] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. Chapman
33 1669 & Hall / CRC, Boca Raton, FL, 3rd edition, 2013.
34 1670 [57] J. Geweke. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In J. M.
35 1671 Bernado, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 4: Proceedings of the Fourth
36 1672 Valencia International Meeting. Clarendon Press, Oxford, UK, 1998.
37 1673 [58] G. V. Glass. Primary, secondary, and meta-analysis of research. Educational Researcher, 5(10):3–8, November
38 1674 1976. doi: 10.2307/1174772.
39 1675 [59] C. A. Gonzales and R. L. Watters. Standard Reference Material 3168a, Zinc (Zn) Standard Solution. Office of
pte

40 1676 Reference Materials, National Institute of Standards and Technology, Department of Commerce, Gaithersburg,
41 1677 Maryland, 2013. URL www.nist.gov/srm/.
42 1678 [60] C. A. Gonzales and R. L. Watters, Jr. Standard reference material 2685c: Bituminous coal. National Institute of
43 1679 Standards and Technology, Gaithersburg, MD, October 2013. URL www.nist.gov/srm/. Certificate of Analysis.
44
1680 [61] R. Grübel. The length of the Shorth. The Annals of Statistics, 16(2):619–628, 1988. doi: 10.2307/2241744.
45
46 1681 [62] J. Hartung, G. Knapp, and B. K. Sinha. Statistical Meta-Analysis with Applications. John Wiley & Sons, Hoboken,
1682 NJ, 2008. ISBN 978-0-470-29089-7.
47
48 1683 [63] L. V. Hedges and I. Olkin. Statistical Methods for Meta-Analysis. Academic Press, San Diego, CA, 1985.
ce

49 1684 [64] J. P. T. Higgins and S. Green. Cochrane handbook for systematic reviews of interventions. The Cochrane
50 1685 Collaboration, March 2011. URL www.cochrane-handbook.org. Version 5.1.0.
51 1686 [65] J. P. T. Higgins and S. G. Thompson. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21:
52 1687 1539–1558, 2002. doi: 10.1002/sim.1186.
53 1688 [66] J. P. T. Higgins, S. G. Thompson, and D. J. Spiegelhalter. A re-evaluation of random-effects meta-analysis. Journal
Ac

54 1689 of the Royal Statistical Society, Series A (Statistics in Society), 172(1):137–159, January 2009.
55 1690 [67] D. C. Hoaglin. Misunderstandings about Q and ‘Cochran’s Q test’ in meta-analysis. Statistics in Medicine, 35:
56 1691 485–495, 2016. doi: 10.1002/sim.6632.
57
1692 [68] J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky. Bayesian model averaging: A tutorial. Statistical
58
1693 Science, 14(4):382–417, 1999.
59
60 1694 [69] E. Hund, D. L. Massart, and J. Smeyers-Verbeke. Inter-laboratory studies in analytical chemistry. Analytica
1695 Chimica Acta, 423:145–165, 2000. doi: 10.1016/S0003-2670(00)01115-6.
Page 45 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 REFERENCES 45
4

pt
5 1696 [70] ISO. Accuracy (trueness and precision) of measurement methods and results — Part 1: General principles and
6 1697 definitions. International Organization for Standardization (ISO), Genève, Switzerland, December 1994. ISO
7 1698 5725-1:1994(E).
8 1699 [71] ISO. Statistical methods for use in proficiency testing by interlaboratory comparison. International Organization
9 1700 for Standardization (ISO), Geneva, Switzerland, Second edition, August 2015. ISO 13528:2015(E).

cri
10 1701 [72] H. K. Iyer, C. M. Wang, and D. F. Vecchia. Consistency tests for key comparison data. Metrologia, 41:223–230,
11 1702 2004. doi: 10.1088/0026-1394/41/4/001.
12 1703 [73] H. K. Iyer, C. M. J. Wang, and T. Mathew. Models and confidence intervals for true values in interlaboratory
13 1704 trials. Journal of the American Statistical Association, 99(468):1060–1071, December 2004. doi: 10.1198/
14 1705 016214504000001682.
15 1706 [74] N. P. Jewell. Statistics for Epidemiology. Chapman & Hall/CRC, Boca Raton, FL, 2004.
16
[75] Joint Committee for Guides in Metrology. Evaluation of measurement data — Guide to the expression of

us
1707
17
1708 uncertainty in measurement. International Bureau of Weights and Measures (BIPM), Sèvres, France, 2008.
18 1709 URL www.bipm.org/en/publications/guides/gum.html. BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP and
19 1710 OIML, JCGM 100:2008, GUM 1995 with minor corrections.
20
1711 [76] Joint Committee for Guides in Metrology. Evaluation of measurement data — Supplement 1 to the “Guide
21
1712 to the expression of uncertainty in measurement” — Propagation of distributions using a Monte Carlo method.
22 International Bureau of Weights and Measures (BIPM), Sèvres, France, 2008. URL www.bipm.org/en/
23
24
25
26
27
28
1713
1714

1715
1716
1717
1718

1719
an
publications/guides/gum.html. BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP and OIML, JCGM 101:2008.
[77] Joint Committee for Guides in Metrology. International vocabulary of metrology — Basic and general concepts and
associated terms (VIM). International Bureau of Weights and Measures (BIPM), Sèvres, France, 3rd edition, 2012.
URL www.bipm.org/en/publications/guides/vim.html. BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP and
OIML, JCGM 200:2012 (2008 version with minor corrections).
[78] H. E. Jones and D. J. Spiegelhalter. The identification of “unusual” health-care providers from a hierarchical
dM
29
1720 model. The American Statistician, 65(3):154–163, 2011. doi: 10.1198/tast.2011.10190.
30
31 1721 [79] R. Judaschke. CCEM Key comparison CCEM.RF-K25.W. RF power from 33 ghz to 50 ghz in waveguide. final
32 1722 report of the pilot laboratory. Metrologia, 52(1A):01001, 2015. doi: 10.1088/0026-1394/52/1A/01001.
33 1723 [80] R. E. Kass and L. Wasserman. The selection of prior distributions by formal rules. Journal of the American
34 1724 Statistical Association, 91(435):1343–1370, September 1996. doi: 10.1080/01621459.1996.10477003.
35 1725 [81] G. Knapp and J. Hartung. Improved tests for a random effects meta-regression with a single covariate. Statistics
36 1726 in Medicine, 22:2693–2710, 2003. doi: 10.1002/sim.1482.
37 1727 [82] A. Koepke, T. Lafarge, B. Toman, and A. Possolo. NIST Consensus Builder — User’s Manual. National Institute of
38 1728 Standards and Technology, Gaithersburg, MD, 2017. URL consensus.nist.gov.
39 1729 [83] A. A. Koepke and A. Possolo. Bayesian approach to adaptive robustness for inter-laboratory studies and for
pte

40 1730 meta-analysis. International Society for Bayesian Analysis (ISBA) 2016 World Meeting, Sardinia, Italy, June
41 1731 2016. Conference Poster Presentation.
42 1732 [84] H. H. Ku, editor. Precision Measurement and Calibration — Selected NBS Papers on Statistical Concepts and
43 1733 Procedures. NBS Special Publication 300 — Volume 1. National Bureau of Standards, Washington, DC, February
44 1734 1969. URL nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication300v1.pdf.
45 1735 [85] P. C. Lambert, A. J. Sutton, P. R. Burton, K. R. Abrams, and D. R. Jones. How vague is vague? A simulation
46 1736 study of the impact of the use of vague prior distributions in MCMC using WinBUGS. Statistics in Medicine, 24:
47 1737 2401–2428, 2005. doi: 10.1002/sim.2112.
48
ce

1738 [86] G. E. F. Lundell. The chemical analysis of things as they are. Industrial & Engineering Chemistry — Analytical
49 1739 Edition, 5(4):221–225, 1933. doi: 10.1021/ac50084a001.
50
1740 [87] T. A. MacKenzie, G. L. Grunkemeier, G. K. Grunwald, A. J. O’Malley, C. Bohn, Y. X. Wu, and D. J. Malenka. A
51
1741 primer on using shrinkage to compare in-hospital mortality between centers. The Annals of Thoracic Surgery, 99:
52
1742 757–761, March 2015. doi: 10.1016/j.athoracsur.2014.11.039.
53
1743 [88] D. MacMahon, A. Pearce, and P. Harris. Convergence of techniques for the evaluation of discrepant data. Applied
Ac

54
1744 Radiation and Isotopes, 60:275–281, 2004. doi: 10.1016/j.apradiso.2003.11.028.
55
56 1745 [89] J. Mandel. The Statistical Analysis of Experimental Data. Interscience Publishers (John Wiley & Sons), New York,
57 1746 NY, 1964.
58 1747 [90] J. Mandel. Repeatability and reproducibility. Journal of Quality Technology, 4(2):74–85, April 1972.
59 1748 [91] J. Mandel. The validation of measurement through interlaboratory studies. Chemometrics and Intelligent
60 1749 Laboratory Systems, 11(2):109–119, 1991. doi: 10.1016/0169-7439(91)80058-X.
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 46 of 65

1
2
3 REFERENCES 46
4

pt
5 1750 [92] J. Mandel and T.W. Lashof. The interlaboratory evaluation of testing methods. ASTM Bulletin, 239:53–61, July
6 1751 1959. Reprinted in Ku [84, Page 170-53].
7 1752 [93] J. Mandel and R. Paule. Interlaboratory evaluation of a material with unequal numbers of replicates. Analytical
8 1753 Chemistry, 42(11):1194–1197, September 1970. doi: 10.1021/ac60293a019.
9 1754 [94] J. Mandel and R. Paule. Correction — interlaboratory evaluation of a material with unequal numbers of replicates.

cri
10 1755 Analytical Chemistry, 43(10):1287–1287, 1971. doi: 10.1021/ac60304a001.
11 1756 [95] I. Mann and B. Brookman, editors. Selection, Use and Interpretation of Proficiency Testing (PT) Schemes.
12 1757 EURACHEM, Second edition, 2011. URL www.eurachem.org.
13
1758 [96] H. Marchandise. New reference materials — improvement of methods of measurements. Technical Report EUR
14 1759 9921 EN, Community Bureau of Reference — Commission of the European Communities, Luxembourg, 1985.
15
1760 [97] H. Marchandise and E. Colinet. Assessment of methods of assigning certified values to reference materials.
16
Fresenius’ Zeitschrift für analytische Chemie, 316:669–672, January 1983. doi: 10.1007/BF00488426.

us
1761
17
18 1762 [98] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall / CRC, London, UK, 2nd edition,
19 1763 1989.
20 1764 [99] P. Meier, G. Knapp, U. Tamhane, S. Chaturvedi, and H. S. Gurm. Short term and intermediate term comparison of
21 1765 endarterectomy versus stenting for carotid artery stenosis: systematic review and meta-analysis of randomised
22 1766 controlled clinical trials. BMJ, 340:c467, 2010. doi: 10.1136/bmj.c467.
23
24
25
26
27
28
1767

1768
1769

1770
1771

1772
an
[100] R. G. Miller. Beyond ANOVA, Basics of Applied Statistics. John Wiley & Sons, New York, NY, 1986.
[101] P. J. Mohr, B. N. Taylor, and D. B. Newell. CODATA recommended values of the fundamental physical constants:
2010. Reviews of Modern Physics, 84(4):1527–1605, October-December 2012. doi: 10.1063/1.4724320.
[102] P. J. Mohr, D. B. Newell, and B. N. Taylor. CODATA recommended values of the fundamental physical constants:
2014. Reviews of Modern Physics, 88:035009, July-September 2016. doi: 10.1103/RevModPhys.88.035009.
[103] A. R. Naylor, A. Bolia, R. J. Abbott, I. F. Pye, J. Smith, N. Lennard, A. J. Loyd, N. J. M. London, and P. R. F. Bell.
dM
29 1773 Randomized study of carotid angioplasty and stenting versus carotid endarterectomy: a stopped trial. Journal
30 1774 of Vascular Surgery, 28(2):326–334, 1998. doi: 10.1016/S0741-5214(98)70182-X.
31
1775 [104] A. O’Hagan. Eliciting and using expert knowledge in metrology. Metrologia, 51(4):S237–S244, 2014. doi:
32 1776 10.1088/0026-1394/51/4/S237.
33
1777 [105] K. A. Olive and Particle Data Group. Review of particle physics. Chinese Physics C, 38(9):090001, 2014. doi:
34
1778 10.1088/1674-1137/38/9/090001.
35
36 1779 [106] R. Paule and J. Mandel. Consensus values and weighting factors. Journal of Research of the National Bureau of
37 1780 Standards, 87:377–385, 1982.
38 1781 [107] J. C. Pinheiro, C. Liu, and Y. N. Wu. Efficient algorithms for robust estimation in linear mixed-effects models
39 1782 using the multivariate t distribution. Journal of Computational and Graphical Statistics, 10(2):249–276, 2001.
pte

40 1783 [108] N. G. Polson and J. G. Scott. On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7(4):
41 1784 887–902, 2012. doi: 10.1214/12-BA730.
42 1785 [109] S. Pommé and J. Keightley. Determination of a reference value and its uncertainty through a power-moderated
43 1786 mean. Metrologia, 52(3):S200–S212, 2015. doi: 10.1088/0026-1394/52/3/S200.
44 1787 [110] A. Possolo. Copulas for uncertainty analysis. Metrologia, 47:262–271, 2010. doi: 10.1088/0026-1394/47/3/
45 1788 017.
46 1789 [111] A. Possolo. Five examples of assessment and expression of measurement uncertainty. Applied Stochastic Models
47 1790 in Business and Industry, 29:1–18, January/February 2013. doi: 10.1002/asmb.1947. Discussion and Rejoinder
48
ce

1791 pp. 19–30.


49
1792 [112] A. Possolo. Simple Guide for Evaluating and Expressing the Uncertainty of NIST Measurement Results. NIST
50
1793 Technical Note 1900. National Institute of Standards and Technology, Gaithersburg, MD, 2015. doi: 10.6028/
51 1794 NIST.TN.1900.
52
1795 [113] A. Possolo and B. Toman. Assessment of measurement uncertainty via observation equations. Metrologia, 44:
53
1796 464–475, 2007. doi: 10.1088/0026-1394/44/6/005.
Ac

54
55 1797 [114] A. Possolo and B. Toman. Tutorial for metrologists on the probabilistic and statistical apparatus underlying the GUM
56 1798 and related documents. National Institute of Standards and Technology, Gaithersburg, MD, November 2011.
1799 doi: 10.13140/RG.2.1.2256.8482. URL www.itl.nist.gov/div898/possolo/TutorialWEBServer/
57
1800 TutorialMetrologists2011Nov09.xht.
58
59 1801 [115] J. Randa. Update to Proposal for KCRV and Degree of Equivalence for GTRF Key Comparisons. National
60 1802 Institute of Standards and Technology, Boulder, CO, February 2005. Consultative Committee for Electricity and
1803 Magnetism, Working Group on Radiofrequency Quantities (GT-RF).
Page 47 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 REFERENCES 47
4

pt
5 1804 [116] I. A. Robinson and S. Schlamminger. The watt or Kibble balance: a technique for implementing the new si
6 1805 definition of the unit of mass. Metrologia, 53:A46–A74, 2016. doi: 10.1088/0026-1394/53/5/A46.
7 1806 [117] D. M. Rocke. Robust statistical analysis of interlaboratory studies. Biometrika, 70(2):421–431, 1983. doi:
8 1807 10.1093/biomet/70.2.421.
9 1808 [118] A. Rukhin, B. Biggerstaff, and M. Vangel. Restricted maximum likelihood estimation of a common mean

cri
10 1809 and the Mandel-Paule algorithm. Journal of Statistical Planning and Inference, 83:319–330, 2000. doi:
11 1810 10.1016/S0378-3758(99)00098-1.
12 1811 [119] A. L. Rukhin. Weighted means statistics in interlaboratory studies. Metrologia, 46(3):323–331, 2009. doi:
13 1812 10.1088/0026-1394/46/3/021.
14 1813 [120] A. L. Rukhin. Estimating heterogeneity variance in meta-analysis. Journal of the Royal Statistical Society, Series
15 1814 B (Statistical Methodology), 75(3):451–469, 2013. doi: 10.1111/j.1467-9868.2012.01047.x.
16
[121] A. L. Rukhin. Restricted likelihood representation and decision-theoretic aspects of meta-analysis. Bernoulli, 20

us
1815
17
1816 (4):1979–1998, 11 2014. doi: 10.3150/13-BEJ547.
18
19 1817 [122] A. L. Rukhin and A. Possolo. Laplace random effects models for interlaboratory studies. Computational Statistics
1818 and Data Analysis, 55:1815–1827, 2011. doi: 10.1016/j.csda.2010.11.016.
20
21 1819 [123] A. L. Rukhin and M. G. Vangel. Estimation of a common mean and weighted means statistics. Journal of the
22 1820 American Statistical Association, 93:303–308, 1998. doi: 10.1080/01621459.1998.10474111.
23
24
25
26
27
28
1821
1822

1823
1824

1825
1826
an
[124] M. Schantz and S. Wise. CCQM–K25: Determination of PCB congeners in sediment. Metrologia, 41(Technical
Supplement):08001, 2004. doi: 10.1088/0026-1394/41/1A/08001.
[125] W. Schlecht. Cooperative investigation of precision and accuracy in chemical analysis of silicate rocks. Analytical
Chemistry, 23(11):1568–1571, 1951. doi: 10.1021/ac60059a014.
[126] W. G. Schlecht and R. E. Stevens. Results of chemical analysis of samples of granite and diabase. In H. W.
Fairbairn, editor, A Cooperative Investigation of Precision and Accuracy in Chemical, Spectrochemical and Modal
dM
29 1827 Analysis of Silicate Rocks, Geological Survey Bulletin 980, pages 7–24. U.S. Geological Survey, 1951.
30 1828 [127] S. R. Searle, G. Casella, and C. E. McCulloch. Variance Components. John Wiley & Sons, Hoboken, NJ, 2006.
31 1829 ISBN 0-470-00959-4.
32 1830 [128] K. E. Sharpless, K. A. Lippa, D. L. Duewer, and A. L. Rukhin. The ABCs of Using Standard Reference Materials in
33 1831 the Analysis of Foods and Dietary Supplements: A Practical Guide. NIST Special Publication 260-181r1. National
34 1832 Institute of Standards and Technology, Gaithersburg, MD, 2015. doi: 10.6028/NIST.SP.260-181r1.
35
1833 [129] R. J. S. Simpson and Karl Pearson. Report on certain enteric fever inoculation statistics. The British Medical
36 1834 Journal, 2(2288):1243–1246, 1904. doi: 10.2307/20282622.
37
1835 [130] A. G. Steele, K. D. Hill, and R. J. Douglas. Data pooling and key comparison reference values. Metrologia, 39(3):
38
1836 269–277, 2002.
39
[131] M. Steup. Epistemology. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. The Metaphysics
pte

40 1837

41 1838 Research Lab, Center for the Study of Language and Information, Stanford University, Stanford, California, spring
42 1839 2014 edition, 2014. URL plato.stanford.edu/archives/spr2014/entries/epistemology/.
43 1840 [132] S. M. Stigler. The epic story of maximum likelihood. Statistical Science, 22(4):598–620, 2007. doi: 10.1214/
44 1841 07-STS249.
45 1842 [133] M. Stone. The opinion pool. The Annals of Mathematical Statistics, 32:1339–1342, December 1961. doi:
46 1843 10.1214/aoms/1177704873.
47 1844 [134] S. S.-C. Tai, M. Bedner, and K. W. Phinney. Development of a candidate reference measurement procedure
48
ce

1845 for the determination of 25-hydroxyvitamin D3 and 25-hydroxyvitamin D2 in human serum using isotope-
49 1846 dilution liquid chromatography-tandem mass spectrometry. Analytical Chemistry, 82(5):1942–1948, 2010. doi:
50 1847 10.1021/ac9026862.
51 1848 [135] B. N. Taylor and C. E. Kuyatt. Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement
52 1849 Results. NIST Technical Note 1297. National Institute of Standards and Technology, Gaithersburg, MD, 1994.
53 1850 URL physics.nist.gov/Pubs/guidelines/TN1297/tn1297s.pdf.
Ac

54 1851 [136] M. Thompson and S. L. R. Ellison. Dark uncertainty. Accreditation and Quality Assurance, 16:483–487, October
55 1852 2011. doi: 10.1007/s00769-011-0803-0.
56 1853 [137] M. Thompson, S. L. R. Ellison, and R. Wood. The International Harmonized Protocol for the proficiency testing of
57 1854 analytical chemistry laboratories (IUPAC Technical Report). Pure and Applied Chemistry, 78(1):145–196, 2006.
58 1855 doi: 10.1351/pac200678010145.
59 1856 [138] B. Toman. Bayesian approaches to calculating a reference value in key comparison experiments. Technometrics,
60 1857 49(1):81–87, February 2007. doi: 10.1198/004017006000000273.
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 48 of 65

1
2
3 REFERENCES 48
4

pt
5 1858 [139] B. Toman and A. Possolo. Laboratory effects models for interlaboratory comparisons. Accreditation and Quality
6 1859 Assurance, 14:553–563, October 2009. doi: 10.1007/s00769-009-0547-2.
7 1860 [140] B. Toman and A. Possolo. Erratum to: Laboratory effects models for interlaboratory comparisons. Accreditation
8 1861 and Quality Assurance, 15:653–654, 2010. doi: 10.1007/s00769-010-0707-4.
9 1862 [141] B. Toman, J. Fischer, and C. Elster. Alternative analyses of measurements of the Planck constant. Metrologia, 49:

cri
10 1863 567–571, 2012. doi: 10.1088/0026-1394/49/4/567.
11 1864 [142] J. C. Ulrich, J. E. S. Sarkis, and M. A. Hortellani. Homogeneity study of candidate reference material in fish
12 1865 matrix. Journal of Physics: Conference Series, 575(1):012040, 2015. doi: 10.1088/1742-6596/575/1/012040.
13 1866 VII Brazilian Congress on Metrology (Metrologia 2013).
14 1867 [143] A. M. H. van der Veen, H. Chander, P. R. Ziel, E. W. B. de Leer, D. Smeulders, L. Besley, V. Smarçaro da Cunha,
15 1868 Z. Zhou, H. Qiao, H.-J. Heine, J. Tichy, T. L. Esteban, K. Kato, Z. N. Szilágyi, J. S. Kim, J.-C. Woo, H.-G. Bae, A. P.
16 1869 Castorena, F. R. Murillo, V. M. S. Caballero, C. R. Nambo, M. de J. A. Salas, A. Rakowska, F. Dias, L. A. Konopelko,

us
17 1870 T. A. Popova, V. V. Pankratov, M. A. Kovrizhnih, T. A. Kuzmina, O. V. Efremova, Y. A. Kustikov, S. Musil, and M. J. T.
18 1871 Milton. International comparison CCQM K23b: Natural gas type II. Metrologia, 47(Technical Supplement):08013,
19 1872 2010. doi: 10.1088/0026-1394/47/1A/08013.
20 1873 [144] A. M.H. van der Veen, T. Linsinger, and J. Pauwels. Uncertainty calculations in the certification of reference
21 1874 materials. 2. homogeneity study. Accreditation and Quality Assurance, 6:26–30, January 2001. doi: 10.1007/
22 1875 s007690000238.
23
24
25
26
27
28
1876
1877

1878
1879

1880
1881
an
[145] M. G. Vangel and A. L. Rukhin. Maximum likelihood analysis for heteroscedastic one-way random effects ANOVA
in interlaboratory studies. Biometrics, 55:129–136, March 1999. doi: 10.1111/j.0006-341X.1999.00129.x.
[146] W. Viechtbauer. Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of
Educational and Behavioral Statistics, 30(3):261–293, 2005. doi: 10.3102/10769986030003261.
[147] W. Viechtbauer. Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3):
1–48, 2010. doi: 10.18637/jss.v036.i03.
dM
29
30 1882 [148] W. Viechtbauer and M. W.-L. Cheung. Outlier and influence diagnostics for meta-analysis. Research Synthesis
31 1883 Methods, 1:112–125, 2010. doi: 10.1002/jrsm.11.
32 1884 [149] C. M. Wang and H. K. Iyer. Detection of influential observations in the determination of the weighted-mean
33 1885 KCRV. Metrologia, 42:262–265, 2005. doi: 10.1088/0026-1394/42/4/010.
34 1886 [150] L. Wasserman. All of Statistics, A Concise Course in Statistical Inference. Springer Science+Business Media, New
35 1887 York, NY, 2004.
36 1888 [151] L. Werner. Report on the Key Comparison CCPR-K2.c-2003: Spectral Responsivity in the Range of 200 nm to
37 1889 400 nm. Technical report, Physikalisch-Technische Bundesanstalt, Germany, May 2014.
38 1890 [152] D. R. White. In pursuit of a fit-for-purpose uncertainty guide. Metrologia, 53:S107–S124, 2016. doi:
39 1891 10.1088/0026-1394/53/4/S107.
pte

40
1892 [153] A. Whitehead. Meta-Analysis of Controlled Clinical Trials. John Wiley & Sons, Chichester, England, 2002. ISBN
41 1893 978-0-471-98370-5.
42
1894 [154] A. Whitehead and J. Whitehead. A general parametric approach to the meta-analysis of randomized clinical
43
1895 trials. Statistics in Medicine, 10(11):1665–1677, 1991. doi: 10.1002/sim.4780101105.
44
45 1896 [155] W. J. Youden. Graphical diagnosis of interlaboratory test results. Industrial Quality Control, 25(11):133–137,
46 1897 1959. Reprinted in Ku [84, §3.1, Page 133].
47 1898 [156] N. F. Zhang. Statistical analysis for interlaboratory comparisons with linear trends in multiple loops. Metrologia,
48 1899 49(3):390–394, 2012. doi: 10.1088/0026-1394/49/3/390.
ce

49 1900 [157] N. F. Zhang, N. Sedransk, and D. G. Jarrett. Statistical uncertainty analysis of key comparison CCEM-K2. IEEE
50 1901 Transactions on Instrumentation and Measurement, 52(2):491–494, May 2003. doi: 10.1109/TIM.2003.811669.
51 1902 [158] N. F. Zhang, H. k. Liu, N. Sedransk, and W. E. Strawderman. Statistical analysis of key comparisons with linear
52 1903 trends. Metrologia, 41:231–237, 2004. doi: 10.1088/0026-1394/41/4/002.
53
Ac

54
55
56
57
58
59
60
Page 49 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 FIGURES 49
4

pt
5
6 6.671 6.672 6.673 6.674 6.675 6.676
7
8
9 ● ●

cri
10
11
G / (10−11m3kg−1s−2)

12

13 ● ● ●

14 ●
15 ●
16

us
17 ●
18 ●

19 ●
20 ●
21
22
23
24
25
26
27
28
NIST−82

TR&D−96
LANL−97

UWash−00
BIPM−01
an
UWup−02
MSL−03 UZur−06

HUST−05
JILA−10

HUST−09
LENS−14

BIPM−14 UCI−14
dM
29 Figure 1. Measurement results for the Newtonian constant of gravitation G that were
30 used to determine the 2014 CODATA recommended value: the red dots represent the
31
measured values, and the vertical, thick line segments represent the measured value plus
32
33 or minus one standard uncertainty; the thin, vertical line segments that extend the thick
34 segments indicate the contribution that dark uncertainty adds (in root sum of squares)
35 to the stated uncertainties. The thin, horizontal green line indicates the DerSimonian-
36 Laird estimate of G, and the light-green rectangle surrounding it represents plus or minus
37 2 times the associated standard uncertainty. The thin, horizontal, dashed blue line, and
38 the light-blue rectangle “behind” the light-green rectangle, represent the 2014 CODATA
39 recommended value and 2 times its associated standard uncertainty. Obviously, the
pte

40 2014 CODATA recommended value and the DerSimonian-Laird estimate are statistically
41
indistinguishable.
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 50 of 65

1
2
3 FIGURES 50
4

pt
800
5
6

600
7

Prob. Density
8

400
9

cri
10
200
11
12
13
0

14 6.670 6.672 6.674 6.676


15
G / (10−11 3 −1 −2
m kg s )
16

us
17
18 Figure 2. Probability density (thick blue line with multiple peaks) of the consensus value
19 of the Linear Pool estimate of the Newtonian constant of gravitation, based on a sample
20
of size K = 1 × 107 drawn from the equally weighted mixture of Gaussian distributions
21
assigned to the 14 measured values of G, whose probability densities are represented by
22
23
24
25
26
27
28
an
thin blue lines, re-scaled to reflect the mixing proportions, all equal to 1/14.
dM
29
30
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
Page 51 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 FIGURES 51
4

0.006

pt
5 ● DerSimonian−Laird
● Bayes
6

0.004
● Linear Pool
7
8

0.002
9 D / (10−11m3kg−1s−2) ●●● ●●●

cri
10 ●●● ●●● ●●● ●●●
0.000 ●●● ●●●
11 ●●●
●●●
12 ●●● ●●●
−0.002

●●●
●●●
13
14
15
16

us
−0.006

17
18 NIST−82 LANL−97 BIPM−01 MSL−03 UZur−06 JILA−10 LENS−14
19 TR&D−96 UWash−00 UWup−02 HUST−05 HUST−09 BIPM−14 UCI−14
0.006

20
21
0.004

22
23
24

an
0.002

●●●
D* / (10−11m3kg−1s−2)

●●●
25
●●●
26 ●●● ●●● ●●●
0.000

●●● ●●●
●●●
27
●●●
28 ●●●
−0.002


●●● ●●
●●●
dM
29
30
31
32
−0.006

33
34 NIST−82 LANL−97 BIPM−01 MSL−03 UZur−06 JILA−10 LENS−14

35 TR&D−96 UWash−00 UWup−02 HUST−05 HUST−09 BIPM−14 UCI−14

36
37 Figure 3. The red (DerSimonian-Laird), green (Bayes), and blue (Linear Pool)
38 dots, and line segments of matching colors, depict the unilateral DoE that
39 correspond to three different methods of data reduction, for the measurements
pte

40 of the Newtonian constant of gravitation, even though this comparison is a


41 meta-analysis involving published studies, and not a key comparison: the MRA
42 version in the upper panel, and the Leave-One-Out version in the lower panel.
43 Any differences between the three {D j } corresponding to the same laboratory
44 (differences in the heights of the corresponding dots), are attributable to
45
differences in the consensus values produced by the three procedures used for
46
47 data reduction. The {D∗j } (lower panel) are generally larger in absolute value
48 than the corresponding {D j } (upper panel), and similarly for the expanded
ce

49 uncertainties.
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 52 of 65

1
2
3 FIGURES 52
4

pt
11500
5
6

7 ●

11000
8 ●


● ● ●
● ●

● ●

9 ●

(137Cs) (day)
● ●

cri
10 ●

11
12 10500
2

1 1 9 5 0.4 7 8 9 10 9
13
T1

1 0.4 2 0.7 1 10 7 10 10
14
10000

15
16 ●

us
17
9500

18
19 WT55 Fa61 Go63 Le65 Fy65 Em72 Co73 Ho80 Go92 Sc04

20 Br55 Fe62 Ri63 Fy65 Ha70 DP73 GS78 MT80 Un02

21
22
23

an
11000

24
25
26
(90Sr) (day)

27 ●

10500



28 ●

dM
29 ●
T1 2

30 ●

31 ●
10000

32
33
5 5 9 8 10 10
34 0.5 1 10 10 10
35
WT55 Fl65 Ho77 Ra83 Ma94 Sc04
36
An58 Fl65 La78 Ko89 WL96
37
38
39 Figure 4. Each large (red) dot represents the value of T½ measured in one of the studies
pte

40 of the half-lives of 137 Cs and 90 Sr. The thick, vertical (blue) line segment represents
41 T½ ±u(T½ ). The thin, vertical line segments protruding from the ends of the thick segments
42 b(137 Cs) = 52 d and τ
b(90 Sr) =
depict the combined effect of dark uncertainty, estimated as τ
43
105 d, and of the study-specific standard uncertainties u(T½ ). The thin, horizontal (dark
44
45 green) line marks the consensus value of the half-life, which is 10 971 d = 30.04 a for 137 Cs
46 and 10 494 d = 28.73 a for 90 Sr, as produced by the DerSimonian-Laird procedure. The
47 shaded (light green) band centered on the horizontal line that indicates the consensus
48 value, represents a 95 % coverage band for the true value of the half-life, as produced
ce

49 by the DerSimonian-Laird procedure. The expanded uncertainty was evaluated using


50 the parametric bootstrap. The small (blue) numbers in two staggered rows indicate the
51 weights (expressed as percentages) assigned to the measured values to form the weighted
52 average that is the consensus value.
53
Ac

54
55
56
57
58
59
60
Page 53 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 FIGURES 53
4

pt
5
6 Sc04

7 Un02
Go92
8
MT80
9 Ho80

cri
10 GS78
11 Co73
12 DP73
13 Em72
14 Ha70
15 Fy65

16 Fy65

us
Le65
17
Ri63
18
Go63
19 Fe62
20 Fa61
21 Br55
22 WT55
23
24

an
WT55
Br55
Fa61
Fe62
Go63
Ri63
Le65
Fy65
Fy65
Ha70
Em72
DP73
Co73
GS78
Ho80
MT80
Go92
Un02
Sc04
25
26
27
28
dM
29
30 Sc04

31 WL96
32
Ma94
33
34 Ko89
35
Ra83
36
37 La78
38
Ho77
39
pte

40 Fl65
41 Fl65
42
43 An58

44 WT55
45
46
47
WT55

An58

Fl65

Fl65

Ho77

La78

Ra83

Ko89

Ma94

WL96

Sc04

48
ce

49
50
51
52 Figure 5. Graphical representation of the bilateral DoE that indicate significant differences
53 between pairs of studies, for the half-lives of 137 Cs and 90 Sr, evaluated as bilateral degrees
Ac

54 of equivalence, even though this comparison is a meta-analysis involving published studies,


55 and not a key comparison. The light blue squares indicate Bi j that do not differ significantly
56 from 0 in the sense that the interval Bi j ± U95 % (Bi j ) includes 0. The yellow squares with a
57 blue asterisk indicate statistically significant differences.
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 54 of 65

1
2
3 FIGURES 54
4

pt
5 Study νOR w log(OR) [95% CI]
6
7 Naylor−1998 14 0.69% −4.27 [ −8.81 , 0.28 ]
8 CAVATAS−2001 499 19.11% 0.16 [ −0.50 , 0.81 ]
9

cri
10 Brooks−2001 102 0.38% 0.04 [ −6.13 , 6.21 ]
11 Brooks−2004 83 0.38% 0.02 [ −6.15 , 6.19 ]
12 SAPPHIRE−2004/8 330 8.14% −0.19 [ −1.39 , 1.02 ]
13
14 EVA−3S−2006/8 445 15.31% −1.03 [ −1.82 , −0.24 ]
15 SPACE−2006 1165 27.05% −0.21 [ −0.67 , 0.24 ]
16 BACASS−2007 15 0.61% 2.10 [ −2.73 , 6.94 ]

us
17
18 ICSS−2009 1575 28.33% −0.69 [ −1.12 , −0.27 ]
19
20 RE Model 100.00% −0.41 [ −0.79 , −0.03 ]
21
22
23
24
25
26
27
28
−10.00

an 0.00 5.00
Log−Odds Ratio

Figure 6. Forest plot summarizing the data and results for the carotid artery stenosis meta-
analysis. For Naylor-1998, for example, the plot shows the number of degrees of freedom
dM
29 (νOR = 14) underlying the standard uncertainty of the corresponding log-odds ratio, the
30
weight that the DerSimonian-Laird procedure assigned to the result (0.69 %), the estimate
31
32 of the log-odds ratio (−4.27), and an approximate, 95 % coverage interval for the true
33 log-odds ratio. The (red) diamond at the bottom indicates the consensus value (−0.41)
34 and associated uncertainty.
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
Page 55 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 FIGURES 55
4

pt
0.85
5
6
7
8

0.84
9

cri
10
11 ●
12
0.83

13
14
ηCAL


15 ●
0.82

16 ●

us
17 ● ●

18
19
0.81

20
21 ●
22

an
0.80

23
24
25 KRISS NIM NPL PTB
26
LNE NIST NRC VNIIFTRI
27
28
dM
29 Figure 7. The large (red) dots represent the values of the calibration factor ηCAL of traveling
30 standard SN 216 that was measured by the laboratories participating in key comparison
31
CCEM.RF-K25.W. The vertical (blue) line segments depict the intervals {ηCAL, j ± u(ηCAL, j )}.
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 56 of 65

1
2
3 FIGURES 56
4

pt
5 ● DerSimonian−Laird
6

0.04
● Bayesian
7 ● Linear Pool
8
9

cri
10 0.02
11 ●●●
12
13
●●● ●●●
0.00
DoE

14
●●● ●●● ●●●
15 ●●

16

us
17 ●●
−0.04 −0.02


18
19
20
21
22
23
24
25
26
27
28
KRISS

LNE
NIM
an NIST
NPL

NRC
PTB

VNIIFTRI
dM
29 Figure 8. Unilateral degrees of equivalence as defined in the MRA corresponding to the
30 DerSimonian-Laird, Bayesian, and Linear Pool procedures, for the calibration factor ηCAL
31
of traveling standard SN 216 used in key comparison CCEM.RF-K25.W.
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
Page 57 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 TABLES 57
4

pt
5 EXPERIMENT G u(G)
6
7 /10−11 m3 kg−1 s−2
8
9 NIST-82 6.672 48 0.000 43

cri
10 TR&D-96 6.6729 0.0005
11
12 LANL-97 6.673 98 0.000 70
13 UWash-00 6.674 255 0.000 092
14
BIPM-01 6.675 59 0.000 27
15
16 UWup-02 6.674 22 0.000 98

us
17 MSL-03 6.673 87 0.000 27
18
19 HUST-05 6.672 22 0.000 87
20 UZur-06 6.674 25 0.000 12
21 HUST-09 6.673 49 0.000 18
22
JILA-10 6.672 34 0.000 14
23
24
25
26
27
28
BIPM-14
LENS-14
UCI-14
an
6.675 54
6.671 91
6.674 35
0.000 16
0.000 99
0.000 13
dM
29 Table 1. Measurement results (estimates and standard uncertainties) for the Newtonian
30 gravitational constant used for the CODATA 2014 adjustment of the fundamental physical
31 constants [102, Table XXVII].
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 58 of 65

1
2
3 TABLES 58
4

pt
5 PROCEDURE CONSENSUS STD . UNC . 95 % COV. INT.
6
7 /10−11 m3 kg−1 s−2
8
9 DerSimonian-Laird 6.673 80 0.000 28 (6.67320, 6.67440)

cri
10 Hierarchical Bayesian 6.673 77 0.000 33 (6.67310, 6.67442)
11
12 Linear Pool 6.673 67 0.001 24 (6.67118, 6.67579)
13 Linear Pool (r) 6.673 87 0.001 39 (6.67168, 6.67643)
14
Maximum Likelihood 6.673 78 0.000 30
15
16 Maximum Likelihood (r) 6.673 80 0.000 31

us
17
18 CODATA 2014 6.674 08 0.000 31
19
20 Table 2. Results of four consensus building procedures applied to the measurement results
21 for the Newtonian constant of gravitation listed in Table 1, alongside the corresponding
22 results from the 2014 CODATA adjustment of the fundamental physical constants. The
23
24
25
26
27
28
an
standard uncertainty associated with the maximum likelihood estimate was computed
using a conventional large-sample approximation [150, Theorem 9.27]. The variants
of the Linear Pool and of maximum likelihood marked “(r)”, and the CODATA estimate,
take the correlations of 0.351 between the estimates from NIST-82 and LANL-97, and of
0.134 between HUST-05 and HUST-09, into account, while the others ignore them [102,
Table XXVII].
dM
29
30
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
Page 59 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 TABLES 59
4

pt
5 STUDY T½ (137 Cs) u(T½ (137 Cs)) STUDY T½ (90 Sr) u(T½ (90 Sr))
6
7 WT55 9715 146 WT55 10 120 150
8 Br55 10 957 146 An58 10 700 580
9
Fa61 11 103 146 Fl65 10 230 150

cri
10
11 Fl62 10 994 256 Fl65 10 410 330
12 Go63 10 840 18 Ho77 10 636 88
13
14 Ri63 10 665 110 La78 10 282 12
15 Le65 11 220 47 Ra83 10 588 91
16 Fl65 10 921 183 Ko89 10 665 37

us
17
18 Fl65 11 286 256 Ma94 10 561 14
19 Ha70 11 191 157 WL96 10 495 4
20 Em72 11 023 37 Sc04 10 557 11
21
22 DP73 11 020.8 4.1
23
24
25
26
27
28
Co73
GS78
Ho80
MT80
Go92
11 034
10 906
11 009
10 967.8
10 940.8
29
33
11
4.5
6.9
an
dM
29
30 Un02 11 018.3 9.5
31 Sc04 10 970 20
32
33 Table 3. Measured values of the half-life of 137 Cs and 90 Sr, and associated standard
34
measurement uncertainty, expressed in days, reported by MacMahon et al. [88, Tables 1
35
36 and 2]. The labels in the first and fourth columns are succinct abbreviations of the
37 corresponding sources that MacMahon et al. [88] reference.
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 60 of 65

1
2
3 TABLES 60
4

pt
5 T½ (137 Cs) u(T½ (137 Cs)) T½ (90 Sr) u(T½ (90 Sr))
6
7 Convergence 30.06 a 0.03 a 28.89 a 0.04 a
8 DerSimonian-Laird 30.04 a 0.04 a 28.73 a 0.11 a
9
Hierarchical Bayesian 29.94 a 0.19 a 28.71 a 0.16 a

cri
10
11 Linear Pool 29.94 a 0.93 a 28.68 a 0.77 a
12
13 NuDat 2.6 30.08 a 0.09 a 28.90 a 0.03 a
14
15 Table 4. Consensus values and associated standard uncertainties for the half-life of 137 Cs
16 and 90 Sr, as computed by MacMahon et al. [88] (labeled “Convergence”), by the three

us
17 main data reduction procedures discussed in §5, and as reported by the National Nuclear
18 Data Center (NuDat 2.6).
19
20
21
22
23
24
25
26
27
28
an
dM
29
30
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
Page 61 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 TABLES 61
4

pt
5 STUDY nE kE nS kS log(OR) u(log(OR)) νOR
6
7 Naylor-1998 12 0 11 5 −4.2670 2.3209 14
8
9
CAVATAS-2001 253 21 251 18 0.1585 0.3342 499

cri
10 Brooks-2001 51 0 53 0 0.0393 3.1485 102
11 Brooks-2004 42 0 43 0 0.0211 3.1492 83
12
13 SAPPHIRE-2004/8 167 5 167 6 −0.1883 0.6150 330
14 EVA-3S-2006/8 262 9 265 24 −1.0290 0.4009 445
15 SPACE-2006 584 36 599 45 −0.2123 0.2316 1165
16
BACASS-2007 10 1 10 0 2.1059 2.4654 15

us
17
18 ICSS-2009 857 34 853 65 −0.6915 0.2175 1575
19
20
21 Table 5. Results of nine randomized, controlled trials comparing the incidence of strokes
22 among patients suffering from carotid stenosis, and estimates of the corresponding log-
23
24
25
26
27
28
an
odds ratios and associated uncertainties (columns headed “log(OR)” and “u(log(OR))”). For
each study, nE denotes the total number of patients that underwent carotid endarterectomy,
and kE denotes the number of these who suffered a stroke or died within 30 days following
the procedure; nS and kS have similar meaning for carotid stenting. These counts are
as reported by Meier et al. [99]. Since several counts are rather small, the relative
uncertainties of the corresponding log-odds ratios are large: we report an excessive number
dM
29
of significant digits so that a reader wishing to reproduce our calculations will appreciate
30
the differences resulting from even very large Monte Carlo samples.
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 62 of 65

1
2
3 TABLES 62
4

pt
5 STROKE / DEATH NONE
6
7 ENDARTERECTOMY 0 12 12
8 STENTING 5 6 11
9

cri
10 5 18 23
11
12 Table 6. Counts of outcomes of two alternative procedures to treat carotid artery stenosis
13 corresponding to the measurement results reported in Table 5 for Naylor-1998.
14
15
16

us
17
18
19
20
21
22
23
24
25
26
27
28
an
dM
29
30
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
Page 63 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 TABLES 63
4

pt
5 PROCEDURE CONSENSUS STD . UNC . 95 % COV. INT.
6
7 DerSimonian-Laird −0.41 0.21 (−0.83, +0.012)
8
9
Hierarchical Bayesian −0.41 0.24 (−0.88, +0.066)

cri
10 Linear Pool −0.46 2.35 (−6.34, +5.01)
11 Maximum Likelihood −0.46 0.19 (−0.83, −0.093)
12
13
14 Table 7. Results of four consensus building procedures for the log-odds ratio that compares
15 the performance of carotid endarterectomy and carotid stenting using the data compiled
16 by Meier et al. [99]. The standard uncertainty and coverage interval for DerSimonian-

us
17 Laird were computed using the parametric statistical bootstrap, and their counterparts
18 for maximum likelihood are based on the conventional large-sample approximation [150,
19 Theorem 9.27].
20
21
22
23
24
25
26
27
28
an
dM
29
30
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1 Page 64 of 65

1
2
3 TABLES 64
4

pt
5 LAB ηCAL u(ηCAL )
6
7 KRISS 0.8247 0.0095
8
9
LNE 0.8184 0.0112

cri
10 NIM 0.8196 0.0033
11 NIST 0.8170 0.0070
12
13 NPL 0.8069 0.0072
14 NRC 0.8355 0.0130
15 PTB 0.8186 0.0038
16
VNIIFTRI 0.8236 0.0058

us
17
18
19 Table 8. Measurement results for the calibration factor ηCAL of traveling standard SN 216
20 used in key comparison CCEM.RF-K25.W.
21
22
23
24
25
26
27
28
an
dM
29
30
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60
Page 65 of 65 AUTHOR SUBMITTED MANUSCRIPT - MET-100872.R1

1
2
3 TABLES 65
4

pt
5 PROCEDURE CONSENSUS STD . UNC . 95 % COV. INT.
6
7 CCEM.RF-K25.W 0.8184 0.0028
8
9
DerSimonian-Laird 0.8192 0.0022 (0.8147, 0.8235)

cri
10 Hierarchical Bayesian 0.8192 0.0022 (0.8192, 0.8239)
11 Linear Pool 0.8205 0.0112 (0.7993, 0.8471)
12
13
14 Table 9. Estimates of the consensus value computed in key comparison CCEM.RF-
15 K25.W, and by the three consensus building procedures implemented in the NICOB, of
16 the calibration factor ηCAL of traveling standard SN 216. The standard uncertainty

us
17 and coverage interval for the DerSimonian-Laird procedure were computed using the
18 parametric statistical bootstrap.
19
20
21
22
23
24
25
26
27
28
an
dM
29
30
31
32
33
34
35
36
37
38
39
pte

40
41
42
43
44
45
46
47
48
ce

49
50
51
52
53
Ac

54
55
56
57
58
59
60

You might also like