You are on page 1of 10

This article was downloaded by: [Moskow State Univ Bibliote]

On: 10 January 2014, At: 06:14


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41
Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical Association


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/uasa20

An Evaluation of Ten Pairwise Multiple Comparison


Procedures by Monte Carlo Methods
a b
S. G. Carmer & M. R. Swanson
a
Department of Agronomy , University of Illinois Urbana , Ill. , 61801 , USA
b
U.S.D.A. Consumer and Marketing Service , Washington , D.C. , 20013 , USA
Published online: 05 Apr 2012.

To cite this article: S. G. Carmer & M. R. Swanson (1973) An Evaluation of Ten Pairwise Multiple Comparison Procedures by Monte
Carlo Methods, Journal of the American Statistical Association, 68:341, 66-74

To link to this article: http://dx.doi.org/10.1080/01621459.1973.10481335

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the
publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or
warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed
by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with
primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings,
demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly
in connection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is
expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
© Journal of the American Statistical Association
March 1973, Volume 68, Number 341
Applications Section

An Evaluation of Ten Pairwise Multiple Comparison


Procedures by Monte Carlo Methods
S. G. CARMER and M. R. SWANSON*

Computer simulation techniques were used to study the Type I and advice that appear in the statistical literature (e.g.,
Type III error rates and the correct decision rates for ten pairwise
[15,8J).
multiple comparison procedures. Results indicated that Scheffe's
test, Tukey's test, and the Student-Newman-Keuls test are less ap-
The simulation study reported herein was prompted
propriate than either the least significant difference with the restric- mainly by the authors' own uncertainty as to the most
Downloaded by [Moskow State Univ Bibliote] at 06:14 10 January 2014

tion that the analysis of variance F value be significant at a = .05, appropriate procedure to recommend to students and
two Bayesian modifications of the least significant difference, or researchers in the agricultural sciences. Data were
Duncan's multiple range test. Because of its ease of application,
generated to simulate a number of experimental situa-
many researchers may prefer the restricted least significant
difference.
tions (88,000 experiments in all, with various numbers of
treatments and replicates). Since the true treatment
1. INTRODUCTION means were chosen in advance, it was possible to deter-
mine whether, for any given pair of treatments, a de-
Numerous procedures are available for performance of
cision reached on the basis of "observed" means was
pairwise multiple comparisons among the observed
treatment means from designed experiments. With any correct or incorrect. A less detailed account of a portion
particular procedure, the observed difference between of the study has been prepared as the basis for recom-
any two means is compared to the appropriate critical mendations to agricultural researchers (Carmer and
value for that procedure. If the observed difference ex- Swanson [4J). The results presented here will be of
ceeds the critical value, the two means are declared primary value to consulting statisticians faced with the
significantly different; otherwise, the difference is con- problem of suggesting suitable procedures to their
sidered nonsignificant. Since the magnitudes of the clientele, but they may also be of some interest to
critical values vary among procedures, results obtained mathematical statisticians concerned with theoretical
from the application of one procedure to a given set of aspects of the problem.
data will often differ from those obtained if another The study concerned ten pairwise multiple comparisons
procedure is utilized.
procedures, described in Section 2, in relation to the
Disagreements among statisticians concerning the
following questions:
appropriate principles on which to base a procedure for
pairwise multiple comparisons have undoubtedly pro- 1. What are the Type I error rates (i.e., probability of rejecting
vided healthy stimuli to further study of the problem. a true null hypothesis) when a set of treatments with real
differences among some means includes one or more subsets
However, many consulting statisticians may have some of treatments with no true differences between means?
doubts as to which procedure to recommend for use by
2. What are the reverse decision or Type III error rates (i.e.,
their clientele. In addition, experimenters in such areas probability of declaring one treatment superior to another
as the agricultural, behavioral and biological sciences, when the reverse is actually true) when relatively small
education, and engineering, are often left in a somewhat true differences exist for some pairs of treatments in the set?
confused state when forced to choose a procedure for 3. What are the correct decision rates (i.e., probability of de-
claring one treatment superior to another when it actually is)
application to their own data. This situation has led to when the magnitudes of relative true differences between
criticisms in subject matter journals of other experi- means range from small to large?
menters' choices of procedures; for a recent example, see 4. What are the effects on the above rates of differing amounts
[11, p. 408; 12]. Further evidence of the confusion that of replication, of various numbers of treatments, and of
varying levels of homogeneity among the true means?
exists is provided by the periodic requests for professional
The present study is considerably broader in scope
* S. G. Carmer is professor of Biometry, Department of Agronomy, University of than those of Chen [5J, Balaam [lJ, and Boardman and
Illinois Urbana, Ill. 61801. M. R. Swanson is statistician, V.S.D.A. Consumer and
Marketing Service, Washington D.C. 20013. Portions of this research were sup- Moffitt [2J in that a larger number of procedures are
ported by funds from the Illinois Agricultural Experiment Station. The authors wish
to express their appreciation to J.W. Johnson, E. McKenzie, Jr., and J.S. Rice who,
evaluated over a wider range in levels of homogeneity
as part of their graduate studi.., assisted in the simulation. among treatment means.
66
Evaluation of Multiple Comparison Procedures 67

2. DESCRIPTION OF PROCEDURES STUDIED Except for the case of 2 treatments, the SSD is larger
For experiments which involve n equally replicated and thus more conservative than the TSD.
treatments and f degrees of freedom for error, and result Use of the LSD is often preceded by a preliminary sig-
in uncorrelated means with equal variances, four of the nificance test based on the observed F ratio of the among
more commonly used multiple comparisons procedures treatments mean square divided by the error mean
can be related to the upper a p percentage points of the square. If the observed F is significant, the ordinary
distribution of studentized ranges, Q(a p , p, f), where p LSD is performed. However, if the F value is not sig-
is the number of means included in the range. nificant, no pairwise comparisons of means are made, thus
Letting Sd represent the standard error of thedif- eliminating the possibility of committing Type I errors.
ference between any two treatment means, the critical Since this procedure is generally attributed to R.A.
value for the well-known multiple t test, or ordinary Fisher, it is referred to here as the Fisher significant dif-
least significant difference, is computed as: ference. Its critical value is computed as:

LSD = Q(a c , 2, f)Sd/Vl = t(a c, f)Sd FSD = LSD = t(a c , f)Sd,

where t(a c , f) is the tabular value of Student's t for the if the observed F ratio is significant, or
chosen comparisonwise Type I error rate, a c • FSD =~,
On the other hand, choice of an experimentwise
Type I error rate, a., results in the Tukey significant if the F value is not significant. In general practice, the
difference, for which the critical value is computed as: preliminary F test is performed at the same nominal
significance level as that for the pairwise comparisons.
TSD = Q(a., n, f)sdVl. However, the authors are aware of no rule or theoretical
If a c and a. are assigned equal values, e.g., .05, and n is basis that requires the same nominal significance level
greater than 2, the TSD value will be larger than the to be used. Consequently, for this study, the significance
LSD value. Thus, with respect to declaring significant level of the preliminary F test was set at three different
differences, the TSD is a more conservative procedure levels, giving three restricted least significant difference
than the LSD. procedures. These are referred to as FSDl, FSD2, and
While the LSD and TSD each require the computation FSD3 for procedures with F tests performed at a = .01,
of a single critical value, the Student-Newman-Keuls .05, and .10, respectively.
procedure involves the calculation of (n - 1) critical During the last decade, D.B. Duncan and coworkers
values: have investigated Bayesian approaches to the multiple
comparisons problem from which they have developed
SNK p = Q(a., p, f)sdV'l.
procedures which directly incorporate the observed F
for p = 2, 3, "', n. Thus, SNK 2 equals the LSD value, value into the critical value. In lieu of choosing a sig-
SNK n equals the TSD value, and, for intermediate nificance level, a measure of the relative seriousness of a
values of p, SNK p is intermediate to the LSD and TSD. Type I error to a Type II error is selected.
The LSD, TSD, and SNK procedures all utilize Two of these Bayes rules were included in the present
ordinary studentized ranges tabulated, e.g., in Table II.2 study. One, referred to here as the Bayes significant dif-
of Harter, Clemm, and Guthrie [10]. Special studentized ference, is the short application procedure described by
ranges, first presented by Duncan [6J and later re- Duncan [7, pp. 181-4]. This is an approximate pro-
computed by Harter, Clemm, and Guthrie [lOJ, with cedure which, according to Duncan, can be used for n
ap = [1 - (1 - a)p-1J for p = 2, 3, "', n, are used greater than 15 and f greater than 30. Its critical value
for Duncan's multiple range test. For this procedure the is computed as :
(n - 1) critical values are computed as:
BSD = too[F/ (F - 1) Jis d
MRT p = Q(ap, p, f)Sd/V'l.
where F is the observed ratio of the among treatments
for p = 2, 3, .. " n. Except for p = 2, the MRT p values mean square divided by the error mean square, and too
are all larger than the LSD, but smaller than the TSD, is taken from Table 6 of Duncan [7J, for the selected
and smaller than the corresponding SNK p values. value of k, the Type I to Type II error weight ratio which
A fifth procedure, due to Scheffe [14J, is based on the F replaces the traditional significance level.
distribution. Although this method is quite general in The second Bayes rule is based on improvements and
that it is applicable to any imaginable linear contrast extensions due to Waller [16]. This more exact pro-
among the treatment means, it is sometimes used as a cedure is referred to here as the Bayes exact test; its
pairwise multiple comparisons procedure. When em- single critical value is computed as :
ployed in this latter context, the single critical value is
BET = t(k, F, j, q)Sd
SSD = [qF (a, q, f) JiSd
where t(k, F, j, q) is the minimum average risk t value
where Et«, q, f) is the tabular F value with q = (n - 1) for the chosen Type I to Type II error weight ratio, the
and f degrees of freedom at the a significance level. observed F value, and the degrees of freedom for error
68 Journal of the American Statistical Association, March 1973

1. SETS OF TRUE MEANS USED IN SIMULA TlON STUDya single experiment were generated from the linear model:

Set .. II' True treatllent .an.: ~ + "l Y bt =' J.l + f3b + Tt + fbt
10 .000 reo, 100. 100, 100. 100 where Y bt is the simulated observation for the bth
.100 1.000 104, 102, 100, •s, •• replication of the tth treatment, J.l is the grand mean which
.250 2.000 105. 105. 100,
", ss
always equalled 100, f3b is the bth block effect which, for
.625 1.000 110. 105. 100,
", .0
convenience, was always equal to zero, Tt is the tth
r.coo
1.010

2.500
2.000

1.405

1.000
llO, llO, 100,

111, 109,

uo,
roo,
HO, 100,
..
'0,

'0,
.0

, s.
SO
treatment effect which equalled the appropriate true
mean from Table 1 minus 100, and fbt is the contribution
10 45 .000 100, ioo, ioo, 100, 100. 100, 100, 100, 100, 100 due to random error independently sampled from a
". •• normal distribution with a mean of zero and a standard
10
10

10
.067

.600
1.125

1.125
104. L03, 102, 101. 100, 100,

112, 109, 106, 103, 100. LOO. 97,


se.
'4,
..,
97,

SS deviation of 10.
II 10 .967 2.632 120, 105, 103, 101, 100, 100,
", ", ", 80
Numerical sample values of fbt were generated in the
12 10 20 1.600 9.000 112. 112, 112. 112. 112, ss, ,S, ss, SS. SS
following manner. Pseudo-random variates uniformly dis-
13 10 1.656 3.560 114. 113, 112, 111. 110, 92, '0, S8, S., 84
tributed over the interval zero to one were computed
14 10 1.667 1.125 120. 115, 110, 105, 100, ioo, ", '0, S5, so
using a FORTRAN composite random number generator
IS 10 2.731 1.880 122, 120, 118. 104. 100. ioc, ss, S3, so. rt
1. 20 190 .000 LOO. 100. 100. 100, 100, 100. LOU, 100, reo, roo, described by Marsaglia and Bray [13]. The uniform
Downloaded by [Moskow State Univ Bibliote] at 06:14 10 January 2014

ioo, ioo, 100, 100, icc, 100, icc, ioo, ice, 100
variates were then transformed to independent standard
17 20 14 .063 2.375 104, 104, 103, 103, 102, 102, 101, 101,
100, 100,
", ", .s, •s. 97, 97, ..
ioc, too,
, •• normal variates using the method of Box and Muller
IS 20 14 .568 2.375 112, 112, 109, 109, 106, 106, 103, 103, 100, roo, [3J; multiplication by 10 produced numerical sample
100, 100, 97, 97, '4. 94,
", ",
ss, ss
values of fbi which were normally and independently
19 20 1.367 3.101 122, 120, 118, 106, lOS, 104, 103, 102, 101, icc,
100,
", ss. 97, se, 95, '4, 82, so, 7S
distributed with a mean of zero and standard deviation
20 20 120, 120, 115, 115, 110, 110, lOS, lOS, icc, ioo,
" 1.579 2.375
100, 100, 95, 95, '0, '0, S5, S5, so, 80 of 10.
21 20 10 2.663 3.231 120, 120, 120, 120, 120, 112, Ill, 110, 109, 108, The true values of the grand mean, J.l = 100, and the
'4, 92, '0, SS, S., S2, Sl, so, 79, 78

22 20 2.719 3.068 124, 123, 122, 120, 118, 116, 103, 102, 101, roc, standard deviation, a = 10, were selected arbitrarily
100,
", .S, 97, S4, S2, so, 7S, 77, 7.
to give a coefficient of variation equal to 10 percent, a
• n = number of treatments; value comparable to that strived for in many agricultural
m = number of pairwise comparisons with true difference equal to zero; field experiments. The 22 sets of true treatment means
6' = [];~_l Tl'/(n - l)J/(cr' = 1(0);
H' = [];~_l (Tl - Tl+l)'J/[(n - T.)'/q], where q = n - 1. (see Table 1) were also chosen somewhat arbitrarily, but
the basis of their selection included the desirability of
and among treatments, respectively. Fairly extensive having wide ranges in the degree of heterogeneity of
tables of minimum risk t values have been published by homogeneity among true means, in the overall level of
Waller and Duncan [17J, and were used in this study. homogeneity among means, and in the number, m, of
The properties of the BSD and BET procedures are pairwise comparisons with a true difference equal to zero.
such that large observed F ratios result in small critical The phrase heterogeneity of homogeneity among true
values similar in magnitude to the LSD, while small F means refers to what extent the means tend to cluster
values, less than 2.5, result in considerably larger critical with more homogeneity among some means than others
values, Thus, the Bayes rules should avoid Type I errors [7, p. 174]. An index to the heterogentity of homogeneity
when the observed F value indicates a great deal of among means can be computed as:
homogeneity among means, and should avoid Type II
errors when the F value indicates considerable hetero- H2 = [L:~1 (Tt - Tt+l)2J/[(TI - T n)2/q J
geneity among means. More complete descriptions and
where q = (n - 1). The denominator is equivalent to
details of the BET and BSD are provided in the papers
what the sum of squares of increments would be if the
by Waller and Duncan [17J and Duncan [7]. The latter
true means were equally spaced over the entire range of
paper also includes expository discussions of the LSD,
FSD, TSD, MRT, SNK, and SSD procedures along
with a numerical example illustrating their usage. 2. ANALYSIS OF VARIANCE TABLE FOR RANDOMIZED
COMPLETE BLOCK DESIGN WITH n TREATMENTS
3. SIMULATION METHODS AND r REPLICATIONS
3.1 Description of the Model
Source of df Expected Theoretical
A FORTRAN IV Program for an IBM 360/75 was vaTiation mean square F va lue

written to empirically demonstrate the properties of the Total nr-l


ten multiple comparisons procedures just described.
Repli ca tions r-1
Data for 1,000 randomized complete block experiments
were generated for each of the 88 combinations of 22 sets 0
Among treatments 0-1 ri' + r E 7,'/(0-1) 1 + r~'
of true treatment means (Table 1) and four different t-1
numbers of replications (3, 4, 6, or 8), giving a total of Error (r-1)(0-1) rf'
88,000 simulated experiments. Simulated data for a
Evaluation of Multiple Comparison Procedures 69

values in the set, which would be the case in the absence and for all experiments having a particular number of
of heterogeneity of homogeneity. Thus, values of H2 = 1.0, replications and number of treatments. Also, experiment-
which are obtained for sets 2, 4, and 7, indicate equal wise Type I error rates were determined for those ex-
spacing between true means, while values greater than periments in which the set of true means included one
1.0 indicate a tendency for the means to cluster with or more true differences equal to zero. In the authors'
more homogeneity among some than others. As shown use of the phrase, experimentwise error rate refers to the
in Table 1, sets 9, 10, and 14 exhibit only a small degree of ratio, multiplied by 100, of the number of experiments in
heterogeneity of homogeneity due to the means of treat- which at least one Type I error is actually committed
ments 5 and 6 both being equal to 100. The most severe divided by the total number of experiments in which at
heterogeneity of homogeneity occurs for set 12 which least one true difference equals zero.
consists of two subsets, each with 5 equal means. Although the simulation was performed using specific
For any particular treatment set, the level of homo- and arbitrary values for the parametric values of treat-
geneity is reflected in the theoretical F value which ment effects and error variance, results were sum-
equals 1 + r(J2, where r is the number of replications, and marized, when possible, on a relative, rather than ab-
82 is the relative expected value of the fixed effects com- solute, scale. Thus, the results are not limited solely to
ponent for treatments in the among treatments mean the specific situations simulated.
square (see Tables 1 and 2). The theoretical F values
Downloaded by [Moskow State Univ Bibliote] at 06:14 10 January 2014

range from F = 1.0, for sets 1, 8, and 16 with any value 4. TYPE I ERROR RATES
of r, to F = 22.848 for set 15 with 8 replications. 4.1 All Treatment Effects Equal to Zero
The 22 sets of true treatment means obviously do not Observed comparisonwise and experimentwise Type I
exhaust the possible configurations of true differences error rates are given in Table 3 for treatment sets 1, 8,
among means. However, they do include a wide diversity and 16, where all 5, 10, and 20 true treatment means,
in patterns of homogeneity, and they are, at least in respectively, were equal. In general, changes in the fre-
part, somewhat representative of situations found In quencies of Type I errors due to varying the number of
actual experiments in the agricultural sciences. replications were partially masked by random variation
3.2 Analysis of Data from Simulated Experiments in results produced by the simulator. For example, based
on the probabilities of values of the studentized range
For each of the 88,000 simulated experiments, the tabulated by Harter, Clemm, and Guthrie [lOJ, experi-
analysis of variance outlined symbolically in Table 2 was mentwise error rates for the LSD in the case of 10 treat-
computed, and the observed means were ranked in order ments should have been approximately 54.9 percent,
largest to smallest. Using ex = .05, critical values for the 57.4 percent, 59.0 percent, and 60.2 percent for 3, 4, 6,
LSD, TSD, SSD, MRT and SNK procedures were cal- and 8 replications, respectively. The corresponding em-
culated without regard to the observed F value. Critical pirical percentages were 52.0, 59.2, 62.7, and 59.6, re-
values for the FSDl, FSD2, and FSD3 were set equal to spectively (with estimated standard errors! of about
the ex = .05 LSD value whenever the observed F was 1.55). Because of the simulator's inability to precisely
significant at ex = .01, .05, and .10, respectively; when pinpoint the effects of increasing numbers of replications,
the preliminary F test failed to be significant, an arbitrary and to conserve space, the Type I error rates for 3, 4, 6,
large value, 106 , was assigned as the critical value to and 8 replications were averaged for presentation in
prevent subsequent declaration of significance for any Table 3.
observed pairwise difference. For the BSD and BET As one should expect, the observed comparisonwise
procedures, k, the Type I to Type II error weight ratio, error rates for the LSD were close to 5 percent, and the
was chosen to be 100, a value which, according to Dun- observed experimentwise rates for the FSDl, FSD2, and
can [7J and Waller and Duncan [17J, approximates the FSD3 were close to 1 percent, 5 percent, and 10 percent,
usual ex = .05 significance level. For the BSD, the ob- respectively. Also as expected, the experimentwise rates
served F value is directly involved in the computations for the TSD and SNK were about 5 percent. Highest ex-
for the critical value; while for the BET, the observed F perimentwise rates were observed for the LSD; these
value determines the point of entry in the tables of values were only slightly lower than the expected rates
minimum average risk t values furnished by Waller and of 28.5 percent, 62.7 percent, and 91.8 percent for 5, 10,
Duncan [17]. and 20 treatments, respectively, given by Duncan [7,
The analysis of each simulated experiment was con- Table 2J for error degrees of freedom equal to infinity.
cluded by comparing observed differences between pairs The MRT also showed close agreement between ob-
of treatment means with the various critical values to served and expected experimentwise error rates; the
make decisions concerning significance. Based on the latter are 100ex. = 100 [1 - (1 - .05) 2J = 18.5 per-
value of the actual true difference, each decision was cent, 37.0 percent, and 62.3 percent for q = (n - 1) = 4,
categorized as being either a correct decision or a Type I, 9, and 19, respectively. Experimentwise rates for the
II, or III error. Comparisonwise frequencies of these
1 Letting 100 ~ denote an observed percentage based on X experiments, and 100 •
decisions were tabulated for the 1,000 experiments with the corresponding expected value, the standard error of 100. is 100 [.(1 - 'J/X]t.
a particular number of replications and set of true means, For. ~ .60 and X = 1,000, e.g., the standard error is 1.549.
70 Journal of the American Statistical Association, March 1973

3. OBSERVED COMPARISONWISE AND EXPERIMENT- anticipate well-conceived experiments in many areas of


WISE TYPE I ERROR RATESa the agricultural, behavioral, and biological sciences, etc.,
which might have 20, 10, or even 5 treatments with no
Comparisonwlse Experimentwise
differences among the true means. For many subject
Procedure _ _...::..:..::..:..::...-="'--
error rate _ error ra te matter researchers, the results of the present section
n n
may be less pertinent than those in Section 4.2, since
10 20 10 20
experiments with one or more proper subsets of equal
FSDI .55 % .29% .21 % 1.1 % 1.0 %
true means would seem more likely to occur in practice
FSD2 1.82 .82 4.8 5.4 5.2 than experiments with an entire set of homogeneous true
FSD3 2.91 1.44 9.6 9.9 10.0 means.
LSD 4.99 5. 01 5.03 25.6 58.4 89.5
4.2 Some, but Not All, Treatment Effects Equal to Zero
rsn .79 .18 .04 5.0 4.8 4.7
In Table 4 observed experimentwise Type I error rates
SSD .34 .01 .00 2.4 .3 .0 (as defined in Section 3.2, and averaged for 3, 4, 6, and 8
BSD 3.85 1.62 .66 15.0 15.5 15.1 replications) are presented for the 14 treatment sets
BET 3.37 1.50 .58 15.6 18.4 18.7 which contain one or more proper subsets of homogeneous
MRT 3.69 2.55 1.82 18.2 37.3 62.6
true means within a set containing real differences be-
Downloaded by [Moskow State Univ Bibliote] at 06:14 10 January 2014

tween some pairs of true means.


SNK 1.30 .24 .05 5.7 5.0 4.8
In all cases, very low experimentwise error rates were
found for the SSD and TSD procedures; rates for the
a Percentages are based on simulation of 4,000 experiments in .eaeh case.
SNK were also generally quite low. Whenever the value
of (}2 was .967 or greater, the highest rates were observed
BSD and BET procedures were lower than for either the
for the BSD and BET procedures. Type I error fre-
MRT or LSD, but higher than for any of the three FSD
quencies were equal for the LSD and the three FSD pro-
procedures. The results with Scheffe's procedure reflected
cedures except at low values of (}2 where failure to declare
the direct effect that the degrees of freedom among
the analysis of variance F value significant prevented the
treatments have on the critical value of the SSD; the
commission of many Type I errors by the FSD pro-
experimentwise and comparisonwise error rates were
cedures. Lower rates were exhibited by the IHRT than
both essentially zero for the case of 20 treatments.
by either the BSD, BET or FSD procedures except at
Empirical comparisonwise Type I error rates for
low values of (}2.
several of the procedures can be compared with theo-
Expected experimentwise error rates for the LSD for
retical rates presented by Harter [9]. For the TSD the
the situations in Table 4 can be determined by multiply-
observed comparisonwise rates are in reasonable agree-
ing together the probabilities of not committing a Type I
ment with values given in Harter's Table 1A; for the
error for each homogeneous subset and then subtracting
SNK and MRT, the observed values are in reasonable
the product from 1. Using the values given in Table 2 of
agreement with the theoretical minimum values shown
Duncan [6J, the probability of not making a Type I error
in Harter's Tables 1A and l C, respectively.
for set 12, which contains two homogeneous subsets of
Although the FSD2, TSD, and SNK procedures all
exhibited experimentwise error rates of about 5 percent,
4. OBSERVED EXPERIMENTWISE TYPE I ERROR RATES
their comparisonwise rates were not equal. The latter
FOR TREATMENT SETS WITH ONE OR MORE
were lowest for the TSD and highest for the FSD2. The
PROPER SUBSETS OF EQUAL MEANSa
critical value for the SNK changes as p increases, for
P = 2 3 '" n and consequently the SNK would be
" "
expected to produce more Type I errors than .the TS~. e"b
Procedure

FSDI F~ ps~ L~ no SSD ~ Hr ~ SNK


In experiments producing a significant analysis of. van- 12 10 20 1.600 44.9 45.2 45.2 45.2 3.1 .2 58.5 52,2 32.6 11.0
ance F value, the critical value for the FSD2 IS, of 17 20 14 .063 5.4 15.1 22.8 45.8 .5 .0 11.5 10.3 21.1 .7
course, smaller than that for the TSD and smaller than 18 20 14 .568 43.1 45.5 45.8 46.0 .5 .0 44.1 39.7 25.2 2.9

those for the SNK, except for p = 2; thus, more Type I 20 20 14 1.579 44.4 44.4 44.4 44.4 .5 .0 56.6 53.4 28.2 6.4

errors would be expected with use of the FSD2, par- 21 20 10 2.663 27.8 27.8 27.8 27.8 .5 .0 38.7 36.4 16.7 3.6

ticularly in the cases of 10 and 20 treatments. .250 4.1 7.0 8.6 9.7 1.3 .6 11.0 8.6 8.3 5.6

2 1.000 7.6 8.5 8.6 8.6 1.3 .5 15.0 10.9 8.2 7.1
The previously discussed results shown in Table 3 are,
10 .067.9 2.3 3.3 ~.3 .2 .0 2.4 1.9 2.7 .2
of course, consequences of the fundamental differences
10 10 .600 4.1 4.6 4.7 4.8 .2 .0 5.8 4.6 3.0 .4
in the theoretical development of the various procedures. 11 10 .967 4.7 4.8 4.8 4.8 .1 .0 7.0 5.8 2.7 .3
Many statisticians may find them useful, in consulting 19 20 1 1.367 5.5 5.5 5.5 5.5 .1 .0 7.2 6.6 2.4 .2
with subject matter researchers, to help explain the 14 10 1 1.667 4.8 4.8 4.8 4.8 .2 .0 7.6 6.2 3.4 1.2

effects of a particular choice of error rate. It should be 22 20 1 2.719 5.3 5.3 5.3 5.3 .1 .0 8.2 7.6 3.0 .4

pointed out, however, that, in many research situations, 15 10 1 2.731 5.0 5.0 5.0 5.0 .2 .0 8.1 7.0 4.0 1.3

experiments where all the true treatment means are equal a Percentages are based on simulation of 4,000 experiments in each case.
can be expected to be rare. It is indeed unrealistic to b See Table 1.
Evaluation of Multiple Comparison Procedures 71

size five, is [(1 - .286) (1 - .286) J = .510; thus, the ex- 6. EFFECT OF THE MAGNITUDE OF (P ON OBSERVED
pected experimentwise error rate is 49.0 percent. Simi- TYPE 11/ ERROR RATES (PERCENT)
larly, for sets 17, 18, and 20, which each have one homo-
geneous subset of size four and eight subsets of size two, &11 / "
.1
the probability of not committing a Type I error is .1 .5 .5
[(1 - .203)(1 - .05)8J = .529, resulting in an expected Procedure S"
.063 2.719 .063
experimentwise rate of 47.1 percent for each set. Set 2.719

21 includes just one homogeneous subset of size 5; thus,


FSD2 .709 1. 873 .108 .344
its expected experimentwise rate is 28.6 percent. Sets 3
LSD 1.719 1.873 .312 .344
and 5 each contain two homogeneous subsets of size two,
and, consequently, have expected experimentwise rates of BSD .404 2.354 .059 .581
9.75 percent. For each of the remaining sets in Table 4, BET .334 2.127 .044 .525
there is only one homogeneous subset of size two; for HaT .616 1.085 .088 .150
these sets the expected experimentwise and comparison-
Treatment set 17 22 17 22
wise rates are, of course, equal to 5 percent. All of the
expected values agree quite favorably with the observed
rates for the LSD. set 18. It is also consistent with the rates observed for
Downloaded by [Moskow State Univ Bibliote] at 06:14 10 January 2014

Performance of similar computations for the MRT, sets 9, 10, 11, 19, 14, 22, and 15, all of which were below
where the minimum probability of not committing a 5 percent.
Type I error is [(l - .05) 8 J, where s is one less than the Critical values for five of the procedures are dependent
subset size, gives maximum expected experimentwise on the analysis of variance F ratio. The effects of in-
rates as follows: 33.7 percent for set 12, 43.1 percent for creasing values of (j2 on the observed comparisonwise
sets 17, 18, and 20, 18.5 percent for set 21, 9.75 percent Type I error rates (averaged for 3, 4, 6, and 8 replica-
for sets 3 and 5, and 5 percent for the remaining sets. tions) for these procedures are shown in Table 5. The
Observed rates for sets 12, 21, 3 and 5 were only slightly individual treatment sets in each of the four pairs of sets
lower than expected; but observed rates for sets 17, 18, have approximately equal values of (j2 but quite different
and 20 were markedly less than the maximum expected values of H2. Frequencies of Type I errors were appar-
rates. This latter result might have been caused by ently affected only slightly, if at all, by changes in H2
rankings of observed means which failed to correspond at a given level of homogeneity. Thus, for the treatment
with those of true means being more frequent in cases sets under study here, the heterogeneity of homogeneity
with homogeneous subsets of size two than in cases with was not sufficiently substantial to indicate a need for a
larger subsets. This, in turn, would have caused the multiple-significant-difference procedure, and there is
critical values for the MRT used for the comparisons in little support from these results for Duncan's [7, p. 174J
question to be based on a range of three or more means concern in this particular aspect of the problem. Error
rather than on a range of two. This notion is also con- rates increased, however, as the value of (j2 increased. The
sistent with the observed increases in experimentwise results indicate that, within the vagaries of sampling vari-
rates from set 17 to 18 and from set 18 to set 20, because ation, the FSD procedures were limited to a maximum
the frequency of "improper" rankings would be expected rate of 5 percent, while the Bayesian procedures ex-
to be less in set 18 than in set 17, and less in set 20 than in hibited the disadvantage of no such upper bound.

5. OBSERVED COMPARISONWISE TYPE I ERROR RATES 5. TYPE III ERROR RATES


FOR SEVERAL VALUES OF (j2 AND H2 a If the true difference between two means, say
Ti - Tj = Oij, is such that Oij is positive, then reach-

Set
b (l"b H,b Procedure ing the decision that Tj is greater than Ti results in a
FSDI FSD2 FSD3 BSD BET reverse decision or Type III error. Thus, the Type III
.067 1.125
error rate is the probability of declaring Tj significantly
.90 2.32 3.27 2.45 1.92
greater thanv, when in fact the reverse is true, i.e., r, is
17 .063 2.375 .80 2.02 2.89 1.16 .99
greater than Tj.
1.000 2.000 4.00 4.44 4.47 8.00 5.72 The observed comparisonwise Type III error rates
11 .967 2.632 4.70 4.82 4.82 6.97 5.80
were quite low; in general they were less than 2 percent
for true differences as small as .1u and decreased rapidly
20 1.579 2.375 4.89 4.89 4.89 7.14 6.56 as the magnitude of Oij/ tr increased. For all procedures,
12 1.600 9.000 5.03 5.05 5.05 7.86 6.60 the rates quickly approached zero for true differences
15
greater than .5 a , These results are in basic agreement
2.731 1.880 5.02 5.02 5.02 8.07 6.97
with the findings of Balaam [lJ who reported prob-
22 2.719 3.068 5.27 5.27 5.27 8.20 7.60
abilities of Type III errors to be practically zero for true
a Percentages are based on simulation of 4,000 experiments in each case. differences greater than .5 a for sets of four treatment
b See Table 1. means.
72 Journal of the American Statistical Association, March 1973

7. OBSERVED CORRECT DECISION RATES (%) FOR SETS In actual practice the experimenter, of course, controls
OF 5 TREATMENT MEANS the numbers of replications and treatments, but, in
Procedure
general, his prior knowledge concerning the 3rd and 4th
Repli·
cation "'lJ /(1 items is meager. Consequently, average correct decision
FSDI FSD2 FSD3 LSD TSD SSD 'so BET HRT SNK

.2 1.7 3.6 4.3 4.8 .8 .5 6.2 4.9 4.3 2.3


rates, pooled over a diversity of levels of homogeneity
.4
.5
1.4
2.7
4.1
5.2
5.6
6.0
7.8
7.7
2.1
1.7
1.1
1.0
9.0
10.5
1.1
8.0
6.4
6.7
2.6
3.2
and degrees of heterogeneity of homogeneity among
.6
.8
2.2
2.7
6.2
6.9
8.4
10.6
11.8
1"'.4
3.0
4.9
2.0
2.8
D.5
1.5.2
11.2
13.5
9.9
12.0
4.0
>.8
means, are likely to be of most practical interest to the
.9
1.0
9.0
9.5
13.1
IS.0
14.8
17 .2
15.2
18.9
4.3
5.2
2.2
3.1
)).4
25.6
16.7
20.2
14.0
17.2
8.8
10.9
consulting statistician and to the experimenter. For this
1.1
1.5
11.8
9.8
19.8
24.4
21.5
31.4
21.8
36.5
6.6
12.4
3.6
1.0
29.9
40.6
23.6
35.0
20.2
32.8
12.8
16.8
reason, observed comparisonwise correct decision rates
1.8
2.0
18.4
21.4
36.8
41.4
44.0
53.8
48.2
57.9
19.0
24.6
11.9
16.3
56.1
65.6
41.9
59.0
43.0
53.8
25.0
34.1
for relative true differences, Oij/U, ranging from .2 to
21.4
2.2
3.0 63.4
48.3
85.8
59.2
88.9
65;8
89.6
33.9
')9.8
23.4
47.0
69.4
92.8
63.7
90.2
60.7
87.2
39.3
69.3
4.0 were pooled over the six sets of 5 treatments, the
4.0 b4.8 91.5 91.2 99.1 88.8 77.3 99.6 99.0 98.5 90.5
seven sets of 10 treatments, and the six sets of 20 treat-
.2 1.9 3.0 3.6 4.4 .9 .5 5.3 4.0 3.6 2.2
.', 1.4 3.9 5.5 7.6 1.4 .8 7.2 5.8 S.9 2.2 ments with nonhomogeneous true means. These results
.5 4.2 7.3 8.5 9.7 2.1 .9 1l.1 8.9 8.3 4.4
.6
.8
2.2
2.9
6.0
8.4
8.8
12.6
1l.8
19.2
3.0 1.6 10.5 9.4
14.0
9.5
14.9
4.2
6.5
are given in Tables 7, 8, and 9, respectively.!
4.8 2.8 16.1
.9
1.0
16.8
15.7
21.2
21,0
21.7
23.0
21.8
24.8
5.8
1.8
3.6
4.4
28.6
)0.4
23.2
35.4
19.8
22.6
14.2
15.5
For nonzero true differences the Type II error rate
1.1
1.5
21.2
19.0
28.4
35.5
29.3
42.4
29.8
46.8
9.0
18.0
5.6
12.1
37.0
48.2
30.9
43,2
26.9
41.4
17.4
23.1
is equal to:
39.0 60.1 43.1
Downloaded by [Moskow State Univ Bibliote] at 06:14 10 January 2014

1.8 59.4 62.4 64.5 13.0 21.9 69.6 63.4


2.0
2.2
49.3
43.3
66.2
70.9
70.6
77.7
73.4
81.3
41.2
50.9
29.1
37.8
77.1
82.5
72.2
78.8
69.3
17.8
52.0
56.9
(100 - correct decision rate - Type III error rate),
3.0 91.4 96.5 96.9 96.9 82.4 12.2 98.2 96.8 95.8 89.0
4.0 92.4 98.6 99.7 99.9 97.6 94.8 99.8 99.8 99.8 98.1 when all are expressed as percentages on a comparison-
.2 2.6 3. I 4.3 5.1 1.0 .5 5.5 4.7 4.4 2.8
.4 2.3 5.2 1.2 9.9 2.1 1.0 1.4 0.6 7.4 3.1 wise basis. Since, as was shown in the previous section,
.5 7.8 10.8 11. 7 12.2 2.7 1.4 D.8 11.7 10.9 7.0
.6 3.3 7.8 1l.6 16.2 4.2 2.0 12.4 1l.0 12.8 5.5 Type III error rates were quite low and usually negligible,
.8 4.2 10.4 16.5 22.7 6.8 3.6 18.0 10.3 18.4 8.1
.9
1.0
30.4
28.7
31.6
34.1
31.8
36.0
31.8 11.0
13.1
5.8 39.0 35.2 30.4
34.8
24.6
26.5
the Type II error rate was, for practical purposes, equal
37.6 7.4 42.4 38.3
1.1
1.5
40.7
48.4
43.2
66.3
43.5
69.9
43.6
11.4
16.5 10.8
24.8
50.9
73.2
45.8
69.1
40.8
67.7
32.8
49.0
to:
38.8
1.8 72.7 81.5 83.5 83.7 53.1 41.8 80.1 82./. 80.6 65.8
2.0 78.9 88.1 89.8 90.4 66.9 53.2 91.8 90.2 88.6 77.2 (100 - correct decision rate).
:.2 79.1 9l.b 94.3 94. I 77.4 65.8 94.7 93.0 93.2 83.9
3.0 99.8 99.8 99.8 99.8 91.4 94.0 S9.9 99.8 99.6 99.0
4.0 100.0 100.0 100.0 100.0 100.0 99.9 100.0 100.0 100.0 100.0 Since examination of correct decision rates provides a
5.1 3.2
.2
.4
3.1
3.5
4.5
7.9
5.2
lO.5
5.9
12.5
1.0
2.0
.0
1.0
0.4
10.2
5.5
9.1 9.9 4.4 more positive approach to the evaluation of the ten pro-
1l.6 14.4 15.1 3.8 2.0 17.6 15.6 14.1 9.8
.5
.6 5.7 13.0 11.2
15.5
22.3 5.5 3.0 16.9 16.0 17.6 7.8
cedures than does examination of Type II error rates,
.8 7.0 11.6 24.1 31.2 10.8 0.6 24.1 23.1 26.1 12.8
.9 40.6 40.9 40.9 41.0 14.4 8.6 50.2 45.6 39.0 34.6 tabulations of frequencies of failure to detect real dif-
1.0 40.1 45.1 46.9 48.1 20.5 12.4 52.6 49.6 '~S.6 36.9
1.1 56.2 56. I 56.8 56.8 25.0 10.8 03.6 60.3 54.1 44.8 ferences between means are not presented here.
1.5 71.2 80.6 82.4 82.9 55.2 41.3 84.1 81.1 79.4 64.3
1.8 90.3 92.2 92.3 92.4 14.1 6/•• 2 94.3 93.3 91.2 84.2 For sets of 5 treatments (Table 7) the procedures can
2.0 93.0 96.3 96.8 91.0 84.2 74.8 97.6 97.1 96.2 91.3
2.2 95.4 98.3 98.6 98.8 92.1 85.4 99.2 98. I 98.1 95.3 be divided into two groups on the basis of their sen-
3.0 100.0 100.0 100.0 100.0 99.8 99.4 100.0 100.0 100.0 100.0
4.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 sitivity. The FSD2, FSD3, LSD, BSD, BET, and MRT

8. OBSERVED CORRECT DECISION RATES (%) FOR SETS


Balaam [lJ found little variation in the frequency of OF 10 TREATMENT MEANS
reverse decisions over the range of (J2 included in his
study. However, his work preceded the development of Repll- Procedure

the BSD and BET procedures. As is shown in Table 6, cation


FSOI YSDZ FS03 LSO 1'$0 SSO SSO 8ET lilT SNK

these two procedures were sensitive to the magnitude of .5 6.3 1.0 1.4 8.3 .6 .1 9.7 1.~ 6.0 2.1
02 and resulted in higher Type III error rates as 02 in- 1.0
1.5
20.3
32.8
21.2
38.3
21.3
39.8
21.3
41.0
2.6
1.4
.3
1.3
27.0
44.2
22.4
39.2
16.6
34.4
6.5
16.7
creased. The FSD2 and MRT procedures were somewhat 2.0
2.5
57.6
11.0
62.2
19.8
63.0
81.4
63.4
82.5
17.5
34.0
4.3
10.8
68.5
84.0
63.2
80.1
56.2
75.8
29.8
45.0
less affected, and the LSD was not affected by the value 3.0
4.0
86.3
82.2
93.3
93.6
94.1
96.7
94.2
99.5
54.1
87.2
21.9
51.2
94.9
98. I
92.8
98.8
90.2
99.0
62.8
89.8
of 02• As would be expected, the influence decreased as .5 8.1 8.8 9.2 10.2 .6 .1 1l.7 10.0 7.4 2.5
the magnitude of the relative true difference increased. 1.0
1.5
27.6
48.4
21.7
52.1
27.8
52.9
27.8
53.4
3.1
12.0
.4
2.3
34.1
56.3
30.4
52.5
22.3
46.0
9.4
23.6
2.0 75.6 77.3 77.5 77.6 28.9 7.9 81.1 18.6 11.8 48.5
2.5 89.6 92.2 92.5 92.6 53.4 21.3 93.8 92.2 89.0 66. I
6. CORRECT DECISION AND TYPE II ERROR RATES 3.0
4.0
97.9
94.7
98.3
99.0
98.3
99.6
98.3
99.9
77.0
91.3
42.4
83.5
98.8
99.9
98.3
99.9
97.0
99.9
84.8
98.4

The sensitivity of a procedure or its ability to correctly .5 10. I 1l.6 12.2 D.3 1.0 .1 15.3 13.8 9.9 4.4
1.0 39.6 39.6 39.6 39.6 6.6 .8 47.2 44.7 34.2 19.4
detect real differences among means depends on several 1.5 10.2 71.5 71.6 11.7 24.4 5.6 75.8 n.3 65.6 43.1
2.0 92.2 92.2 92.2 92.2 55.4 21.0 94.6 93.8 89. I 75.8
factors. These factors include: 2.5 98.8 98.9 98.9 98.9 82.4 49.7 99.4 99.1 98.2 91.6
3.0 99.8 99.8 99.8 99.8 96.4 78.8 99.9 99.9 99.8 98.3
4.0
1. The number of replications, which affects both the magni- 99.9 99.9 99.9 100.0 99.9 99.2 99.9 99.9 100.0 99.9

tude of Sd and the degrees of freedom associated with it; .5 13.2 14.4 14.9 16.2 1.4 .1 18.5 17.4 12.6 6.3
1.0 50.0 50.0 50.0 50.0 10.7 1.6 57.9 56.1 44.4 29.0
2. The number of treatments, which also affects the degrees of 1.5 83. I 84.0 84.0 84.0 39.1 1l.3 87.4 86.1 19.3 61.3
2.0 97.8 91.8 91. S 91.8 75.7 ]9.2 98.6 96.7 90.8
freedom associated with Sd and directly affects the tabular 2.5 99.8 99.8 99.8 99.8 95.1 14.6 99.9
98.4
99.9 99.8 98.4
values used for the TSD, SSD, BET, MRT, and SNK 3.0 100.0 100.0 100.0 100.0 99.7 94.4 100.0 100.0 100.0 100.0
4.0 100.0 100.0 100.0 100.0 99.9 99.9 100.0 100.0 100.0 100.0
procedures.
3. The magnitudes of the relative true differences; and
4. The level of homogeneity among the true treatment means, 2 To conserve space, only the rates for 6'i!V = .5, 1.0, 1.5. 2.0, 2.5, 3.0, and 4.0
which affects the critical values for the BSD, BET, and FSD are shown in Tables 8 and 9. With few exceptions, the rates for omitted values of
procedures. 6ii!v were such that they inoreased monotonically over the range from .1 v to 4.0 a :
Evaluation of Multiple Ccmperison Procedures 73

procedures were consistently better able to detect real 10. EFFECTS OF SIZE OF 02 ON OBSERVED CORRECT
differences than were the less sensitive FSDl, TSD, SSD, DECISION RATES FOR THE FSD2 AND
and SNK. Differences in sensitivity between the two BET PROCEDURESa
groups became less pronounced as the number of replica- Relative true difference'" 6 l J 10
b e?b
tions and the magnitude of relative true difference Replicate Procedure Set
.3 .6 .9 1.2 1.5 1.8 2.1 2.4
increased.
FSD2 18 .568 5.2 10.4 18.3 28.7 42.0 54.6 66.3 16.1
The inferiority of the SSD, TSD, and SNK procedures 19 1. 361 5.6 10.4 19.2 30.1 42.8 51.4 10.9 81.1
21 2.663 5.3 9.1 19.1 29.8 42.1 58.4 69.8 83.8
for detecting real differences appeared even more evident 22 2.719 5.8 10.9 44.4 51.0 10.4 82.3

with sets of 10 and 20 treatments (Tables 8 and 9). This, 8ET 18


19
.568
1. 361
3.1
5.8
1.5
10.8
13.6
19.5
22.2
30.8
33.9
42.9
45.1
51.0
59.3
10.1
71.4
80.5
21 2.663 12.4
of course, is due to the dependence of critical values for 22 2.719
6.8
1.2 13.2
22.5 34.0 47.6
49.1
63.2
61.3
13.1
14.2
85.6
85.0
these procedures on the number of treatments. In general, 1"SDl 18 • 568 8.5 21.6 42.1 65.1 84.4 94.6 98.5 99.8
19 1.361 8.1 22.8 44.2 66.5 84.1 94.1 98.6 99.8
there was a remarkably high degree of similarity in sensi- 21 2.663 8.4 22.4 43.6 61.4 84.8 94.5 99.2 99.8
22 2.719 8.1 22.1 83.4 94.5 98.3 99.1
tivity for the FSD2, FSD3, LSD, BSD, and BET pro-
8ET 18 .568 9.2 22.1 43.9 66.1 84.1 94.6 98.5 99.8
cedures. Although the correct decision rates for this 19
21
1. 361
2.663
n .s
U.8
21.9
28.9
49.8
51.4
11.6
13.8
88.0
89.2
95.6
96.4
99.0
99.5
99.8
99.9
group were influenced by number of replications, they 22 2.719 12.3 28.6 88.3 96.5 99.0 99.8

were less affected by the number of treatments than were Percentages are based on simulation of 1,000 experiments.
Downloaded by [Moskow State Univ Bibliote] at 06:14 10 January 2014

b See Table 1.
results for the SSD, TSD, and SNK procedures. Rates
for the MRT were considerably higher than for the SSD,
TSD, and SNK, but consistently lower than for the 7. CONCLUDING REMARKS
FSD2, FSD3, LSD, BSD, and BET with the differences The underlying thrust of the work reported here was
in sensitivity being greater for 20 treatments than for 10. to consider the practical appropriateness of ten pro-
Because of their dependence on the observed analysis cedures for pairwise multiple comparisons among treat-
of variance F value, correct decision rates for the FSD, ment means. If one agrees with the notion that an ex-
BSD, and BET procedures were expected to vary with perimenter should want to use a procedure capable of
detecting real differences when they exist, then it seems
the size of 02 • This influence is shown in Table 10 for the
clear from Tables 7, 8, and 9 that the SSD should never
FSD2 and BET procedures. Except with three replica-
be employed for pairwise multiple comparisons. This
tions and a small value of 02, .568, the FSD2 was not statement is intended to apply only to the use of Sheffe's
appreciably affected by the level of homogeneity among procedure as a pairwise multiple comparisons procedure
true means. On the other hand, the correct decision rates and not to its use for testing unplanned linear contrasts
for the BET consistently increased as 02 increased. In among means.
general, the BSD behaved similarly to the BET, and For the other procedures studied, the largest differences
the FSDI and FSD3 procedures behaved similarly to in correct decision rates occurred over a range of true
the FSD2. differences from about .5 a to about 2.5 tT, and it is in this
range of true differences that the TSD and SNK are
clearly inferior in ability to detect real differences. Al-
9. OBSERVED CORRECT DECISION RATES (%) FOR SETS
though the SSD, TSD, and SNK provide excellent pro-
OF 20 TREATMENT MEANS
tection against Type I errors (Tables 3 and 4), it is the
Repli- Procedure authors' feeling that, in evaluation of the various pro-
cation ~1 J I"
FSOl FSOl FSD3 LSD TSD '50 BSD BET HRT SNK cedures, concern for ability to detect real differences
.5 6.8 1.1 1.5 8.6 .2 .0 8.8 1.8 4.1 .6
should receive a high priority. At least for small numbers
1.0 22.0 22.0 22.0 22.0 .9 .0 26. a 23.8 14.2 2.5
of treatments the FSDI procedure also appears to stress
1.5 40.7 43.0 43.4 43.6 3.6 .0 45.2 42.0 31.5 1.0
2.0 66.5 66.5 66.5 66.5 n .o .2 10.6 68.0 54.8 19.2 protection against Type I errors at the expense of
2.5 84.8 84.9 84.9 84.9 25.6 .8 81.0 85.4 15.6 35.9
3.0
4.0
94.7
99.6
94.8
99.8
94.8
99.8
94.8
99.8
46.2
85.1
2.1
19.5
95.9
99.8
95.1
99.7
90.1
99.3
59.0
90.2
sensitivity (Table 7).
.5 8.4 8.9 9.3 10.6 .2 .0 n.l 10.3 6.0 .8
Having passed judgment against the procedures which,
1.0 28.6 28.6
1.5 53.8 54.8
28.6
54.8
28.6
54.9
1.5
6.8
.0
.1
33.1
51.8
32.0
55.4
19.5
42.4
4.3
13.0
in the authors' opinion, overemphasize the importance of
2.0 19.5 19.5 19.5
2.5 93.5 93.5 93.5
19.5
93.5
20.3
43.9
.4
2.4
83.2
94.9
82.0
94.3
10.2
88.2
34.3
58.3
Type I errors, it also seems reasonable not to recommend
3.0 98.5 98.5 98.5 98.5
4.0 99.9 99.9 99.9 99.9
69.2
96.6
8.2
45.1
99.0
99.9
98.8
99.9
96.7
99.9
80.8
98.4
procedures which unduly deemphasize protection against
.5 10.1 n.5 12.0 13.1 .4 .0 14.4 13.6 8.0 1.5
Type I errors. From this point of view, then, the ordinary
1.0 39.1 39.1 39.1 39.1 3.0 .0
1.5 13.1 13.1 13.2 13.2 15.5 .2
46.9
11.1
45.5
15.8
29.1
62.8
9.0
28.1
LSD and perhaps the FSD3 can be eliminated from con-
2.0 92.8 92.8 92.8 92.8 42.8 1.9 95.0 94.6 88.5 63.5
2.5 99.0 99.0 99.0 99.0 14.4 10.1 99.4 99.3 ..1.8 86.2 sideration; in addition, their sensitivities to real dif-
3.0 99.9 99.9 99.9 99.9 93.4 31.6 99.9 99.9 99.8 91.6
4.0 100.0 100.0 100.0 100.0 99.9 86.1 100.0 100.0 100.0 99.9 ferences are not appreciably greater than those of the
.5
1.0
13.8
50.8
14.1
50.8
15.3
50.8
16.5
50.8
.5 .0 18.1 17.4 10.4 2.3 FSD2, BSD, BET, and MRT. These latter four pro-
5.4 .0 58.4 51.4 40.3 15.1
1.5
2.0
84.4
91.1
84.4
91.1
84.4
97.7
84.4
91.7
26.9
63.9
.6
5.8
81.8
98.6
81.2
98.5
16.5
95.9
45.8
82.9
cedures thus constitute a group from which the consult-
2.5 99.8 99.8 99.8 99.8 90.8 99.9
3.0 100.0 100.0 100.0 100.0 98.9
26.5
62.8 100.0
99.9
100.0
99.1
100.0
96.1
100.0
ing statistician or experimenter might generally make a
4.0 100.0 100.0 100.0 100.0 99.9 98.1 100.0 100.0 100.0 100.0
choice.
74 Journal of the American Statistical Association, March 1973

Unfortunately, based on the correct decision and error In conclusion, it is the authors' feeling that, at the
rates observed in this study, the choice among the present time, there is little or no reason for users of the
FSD2, BSD, BET, and MRT procedures appears dif- FSD2 to be critical of those who may prefer the BET,
ficult and not completely objective. While the MRT often nor is there reason for users of the BET to be critical
produces a lower frequency of Type I errors, the other of those who may prefer the FSD2.
three are generally more sensitive in detecting real dif-
[Received December 1971. Revised June 1972.J
ferences, especially as the number of treatments increases.
Moreover, the authors concur with the view expressed by
Duncan [7J that dependence of the critical value on the REFERENCES
observed analysis of variance F value is more appealing [IJ Balaam, L.N., "Multiple Comparisons-A Sampling Experi-
than dependence on the number of treatments in the ment," The Amtralian Journal of Statistics, 5 (August 1963),
experiment. Since the BET is an improved and more 62-84.
[2J Boardman, T.J. and Moffitt, D.R, "Graphical Monte Carlo
exact version than the BSD, it seems reasonable to prefer Type I Error Rates for Multiple Comparison Procedures,"
the former even though the latter requires the use of a Biometrics, 27 (September 1971), 738-44.
shorter and less sophisticated table of values. [3J Box, G.E.P. and Muller, M.E., "A Note on the Generation
Removing the MRT and BSD from further considera- of Normal Deviates," Annals of Mathematical Statistics, 29
tion based on these comments, the choice is narrowed to (June 1958), 610-11.
[4J Carmer, S.G. and Swanson, M.n., "Detection of Differences
Downloaded by [Moskow State Univ Bibliote] at 06:14 10 January 2014

two procedures, the FSD2 and BET. At the conclusion Between Means: A Monte Carlo Study of Five Pairwise
of their paper, Waller and Duncan [17J state: "In over- Multiple Comparisons Procedures," Agronomy Journal, 63
all operation, however, the procedures (referring to the (November-December 1971), 940-5.
FSD2 and BET) are quite similar, saying a lot, we feel [5J Chen, T.C., "Multiple Comparisons of Population Means,"
for the new rule (BET)." Although the efforts of Duncan MSc thesis, Iowa, State University; Ames, Iowa 1960.
[6J Duncan, D.B., "Multiple-Range and Multiple-F Tests,"
and Waller have shed considerable light on the multiple
Biometrics, 11 (March 1955), 1-42.
comparisons problem, many consulting statisticians and [7J Duncan, D.B., "A Bayesian Approach to Multiple Compari-
subject matter researchers may deem it reasonable to sons," Technometrics, 7 (May 1965), 171-222.
reverse the conclusion reached in the quoted sentence [8J Dunnett, C.W., "Query 272: Multiple Comparison Tests,"
and state that in overall operation the procedures are Biometrics, 26 (March 1970), 139--41.
[9J Harter, H.L., "Error Rates and Sample Sizes for Range Tests
quite similar, saying a lot for the FSD2.
in Multiple Comparisons," Biometrics, 13 (December 1957),
In many, perhaps most, experimental situations, the 511-36.
choice of either the BET or the' FSD2 appears to be [1OJ Harter, H.L., Clemm, D.S. and Guthrie, E.H., "The Prob-
reasonable. Many statisticians and subject matter re- ability Integrals of the Range and of the Studentized Range-
searchers may prefer the BET; many others may choose Probability Integral and Percentage Points of the Studentized
Range; Critical Values for Duncan's New Multiple Range
the FSD2. The critical value for the BET is a continuous
Test," Wright-Patterson Air Force Base: Wright Air De-
function of the observed F value; this means that, in velopment Center Technical Report 58-484, Vol. II, 1959.
comparison to the FSD2, the BET is a slightly more sen- [l1J Hopkins, K.D. and Chadbourn, R.A., "A Schema for Proper
sitive procedure when the observed F value is relatively Utilization of Multiple Comparisons in Research and a Case
large (Table 10) and a somewhat more conservative pro- Study," American Educational Research Journal, 4 (November
1967), 407-12.
cedure when the observed F value is low (Table 4).
[12J Hopkins, K.D. and Chadbourn, RA., "Multiple Comparisons
On the other hand, again in comparison to the FSD2, in Research: A Response to a Comment," American Educational
the BET has a somewhat greater potential for commission Research. Journal, 6 (November 1969), 704-5.
of Type I and Type III errors when the observed F [13J Marsaglia, George and Bray, T,A., "One-Line Random
value is large (Tables 4, 5, and 6) and a slightly lower Number Generators and Their Use in Combinations," Com-
potential for detecting real differences when the observed munications of the Association of Computinq Machinery, 11
(November 1968), 757-59.
F value is low (Table 10). [14J Scheffe, H., "A Method for Judging All Contrasts in the
Use of the BET .is not very difficult (the procedure is Analysis of Variance," Biometrika, 40 (June 1953), 87-104.
easier to apply, for example, than the MRT), and the [15J Steel, RD.G., "Query 163: Error Rates in Multiple Com-
necessary tables for its use should be expected to find parisons," Biometrics, 17 (June 1961), 326-8.
their way into future statistical textbooks. Yet, many [16J Waller, R.A., "A Bayes Solution to the Symmetric Multiple
Comparisons Problem," Ph.D. thesis, Johns Hopkins Uni-
subject matter researchers, and many statisticians who versity, Baltimore, Md., 1967.
consult with them, will find the FSD2 attractive be- [17J Waller, RA., and Duncan, D.B., "A Bayes Rule for the Sym-
cause of its simplicity and the fact that they are already metric Multiple Comparisons Problem," Journal of the American
familiar with Student's t table. Statistical Association, 64 (December 1969), 1484-503.

You might also like