You are on page 1of 14

A Multiple-Comparisons Method

Based on the Distribution of the


Root Node Distance of a Binary Tree
J. A. DI RIENZO, A. W. GUZMAN, and F. CASANOVES

This article proposesan easy to implement cluster-basedmethod for identifyinggroups


of nonhomogeneous means. The method overcomes the common problem of the classical
multiple-comparison methods that lead to the construction of groups that often have sub-
stantial overlap. In addition,it solves the problem of other cluster-basedmethods that do not
have a known level of signiŽ cance and are not easy to apply. The new procedureis compared
by simulation with a set of classical multiple-comparisonmethods and a cluster-based one.
Results show that the new procedure compares quite favorably with those included in this
article.

Key Words: Cluster analysis; Cluster-based multiple comparisons; Complexity index;


Dendograms; Genotype evaluation.

1. INTRODUCTION
Experimental results in the Ž eld of biology and agriculture are often analyzed by mean
of analysis of variance and the associated techniques of multiple comparisons. The lack
of transitivity of multiple-comparison methods makes it very difŽ cult to determine the
arrangement of means belonging to homogeneous groups, particularly when the number of
means is large. On the other hand, dendrograms produced by some clustering techniques
have long been used to arrange entities in nonoverlappinggroups. Applied to the means of a
set of treatments, they produce a clear picture of the differences and relationships between
treatments, but there are not widely accepted criteria to apply to the dendrograms in order
to establish groups of statistically homogeneous means.
In order to apply inferential procedures to clustering (commonly used as a descriptive
technique) and develop methods for identifying nonoverlapping groups of homogeneous
means, some authors have proposed a number of cluster-based multiple-comparison meth-

Julio Alejandro Di Rienzo and Fernando Casanoves are Professors of Estadstica y Biometra, Facultad de Ciencias
Agropecuarias Universidad Nacional de Cordoba, CC 509 (5000) Cordoba, Argentina (E-mail: dirienzo@agro.
uncor.edu). Adolfo Washington Guzman is a Doctorate Student at the Instituto de Matematica e Estadstica,
Universidad de Sao Paulo, Rua do Matao 1010 (05508-900), Sao Paulo, Brasil.

® c 2002 American Statistical Association and the International Biometric Society


Journal of Agricultural, Biological, and Environmental Statistics, Volume 7, Number 2, Pages 129–142

129
130 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES

ods. Pioneering work by Scott and Knott (SK) (1974) introduced a divisive algorithm with
a stopping rule based on a criterion that resembles the F -test. Jolliffe (1975) proposed
the application of the single linkage clustering algorithm to the matrix of p-values of the
Student–Newman and Keuls (SNK) technique. Calinski and Corsten (1985) applied the
complete linkage method to the matrix of p-values from the Tukey’s test. Cox and Spjtvoll
(1982) introduced another approximation based on the consideration of all possible parti-
tions, choosing one of them by an F -like test criterion.
Many papers have been written comparing these and other new methods of multiple
comparisons with the classical ones. Carmer and Lin (1983) and Willavize, Carmer, and
Walker (1980) pointed out the sensitivity of cluster-based methods to an increased ex-
perimental error. Tasaki, Yoden, and Goto (1987) compared six cluster-based methods of
multiple comparisons and concluded that each one had their own area of application.
Most recently, Bautista, Smith, and Steiner (1997) proposed a recursive cluster-based
approach to means separation (BSS) combining a nested analysis of variance with a clus-
tering technique. This method, as other recursive methods, does not have a known level of
signiŽ cance and the nominal level must be seen only as an index. On the other hand, and
according to its authors, the method does not perform well with equally spaced means. Prob-
ably this problem is not unique to the Bautista et al. method but is common to other methods
intended to provide nonoverlappinggroups, including the one proposed here. Although this
is not a desirable property, it is probably of less concern in view of the uncommonness of
such a case.
In summary, the main problem with the classical multiple-comparison methods is that
they often construct groups of means that may have substantial overlap, whereas the cluster-
based methods encountered in the literature that overcome this problem do not have a known
level of signiŽ cance and are usually not easy to apply.
The objective of this work is to develop an easy-to-implementmethod that will solve the
problems of classical and cluster-based techniques,distinguishinggroups of nonoverlapping
statistically homogeneous means with a known experimentwise error rate and preserving
appropriate low comparisonwise Type I and II error rates.

2. THE TEST
Let x1 ; : : : ; xk be a set of uncorrelated means calculated from random samples of size
n from normal distributions N(· i ; ¼ 2 ), i = 1; : : : ; k. DeŽ ne the matrix D = fdij g with
dij = jxi ¡ xj j=(s2 =n)1=2 and S 2 to be the within-groups common variance (estimated
by the analysis of variance mean square error). Applying the average linkage clustering
technique to the matrix D, it is possible to obtain a dendrogram (binary tree) whose terminal
nodes are the k treatment means. Each node in the dendrogram has a related measure that
represents the distance between the clusters that it joins. If SK and SL are two clusters and
SK 6= SL , the distance between them is deŽ ned as
1 X
q(SK ; SL ) = dij ;
#(SK )#(SL ) x 2S
i K
x j 2S L
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 131

Figure 1. Dendrogram Showing the Relationships Between Means and the Cut-Off Criterion (Q1¡ ¬ ) Obtained
by the DGC Test.

otherwise, q(SK ; SL ) = 0. The means with smallest distance between them appear in the
dendrogram joined by the node with associated distance q1 (Figure 1). The next node will
have the associated distance q2 , which represents the distance between the following pair
of most alike means or the distance between a mean and the cluster previously formed at a
distance q1 (depending on which is the smallest distance). At last the tree ends in the root
node at distance qk¡1 . If Q is the random variable root node distance, the 1 ¡ ¬ quantile
of the distribution of Q under the hypothesis H0 : · 1 = · 2 = ¢ ¢ ¢ = · k could be used to
construct a test of size ¬ . Hence, any value of Q ¶ Q1¡¬ would lead to the rejection of this
hypothesis. Moreover, since the distance associated with every other node is smaller than
Q, the critical point for the null distribution of Q is an upper limit for the critical point of
the distance associated with them, and the clusters joined by a node with distance greater
than the critical point of Q can be declared different from a statistical point of view. The
elements dij of the distance matrix are distributed, under the usual assumptions of analysis
of variance, as (2F1;k(n¡1) )1=2 , and since Q is an average of these distances (under average
linkage), the null distributionof Q depends on k and n. We report the 95 and 99% percentage
points of the null distribution of Q, obtained by Monte Carlo simulation (Tables I and II
of the Appendix), for a selected number of treatments (k) and replications (n). To apply
the method, one only needs an average linkage dendrogram obtained from the matrix of
Euclidean distances between treatment means, an estimate of the standard error of a mean
(SEM), and the appropriate percentage point of the null distribution of Q for a given number
132 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES

of treatments (k) and replications (n). The ¬ -level cut-off criterion for the dendrogram is
obtained as c = SEM £ Qk;n;1¡¬ .

3. DISTRIBUTION OF Q
The percentage points of the null distribution of Q provided in the Appendix were
obtained by Monte Carlo simulation according to the following algorithm:

(a) k random samples of size n from a normal distribution were generated.


(b) A matrix of absolute differences between sample means was calculated and each
element divided by the standard error of the mean estimated as the square root of
the pooled variance divided by n (D matrix).
(c) Matrix D was used to perform a hierarchical cluster analysis based on the average
linkage principle, implemented in the HIERCL algorithm (option 5) described in
Spath (1980).
(d) The distance q associated with the root node was recorded.
(e) Steps a–d were repeated 1,000 times to obtain an estimate of the 95 and 99% per-
centage points of the null distribution of Q as the sample q (950) and q (990) order
statistics.
(f) Steps a–e were repeated 500 times obtaining 500 estimates of the 95 and 99%
percentage points of the null distribution of Q. These estimates were averaged to
produce an entry to the corresponding table.
(g) Steps a–f were repeated for different choices of k and n to complete the tables
presented.

The average standard error of the percentage point was 0.0018 for Q0:95 and 0.0035
for Q0:99.
Simulations were carried out on a Pentium-based computer running a computer pro-
gram written in object-oriented Pascal (Borland Delphi 2.0, 1996) and using a random
number generator based on the implementation of the RAND3 algorithm (Knuth, 1981)
according to Press, Flannery, Teukolsky, and Vetterling (1986).

4. AN EXAMPLE
In a study of chickpea genotype characterization, 14 genotypes were compared in a
randomized complete block design with four replications. One of the studied traits was
the pod’s average length. Four traditional multiple-comparison tests and four cluster-based
tests, including the proposed method (hereafter DGC), were applied.
It is difŽ cult to obtain a simple conclusion about genotype differences based on the
traditional multiple-comparison procedures due to overlapping, a frequent situation when
many treatment means are compared. Instead, cluster-based multiple-comparison proce-
dures give a simple picture of genotype differences (Figure 2).
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 133

Figure 2. Representationof Differences Between Average Length of the Pod (mm) in a Study of Genotype Character-
ization of 14 Genotypes of Chickpea for Four Classical Multiple-ComparisonProcedures and Three Cluster-Based
Ones. Equal letters indicate no signiŽ cant differences between treatment means. (Data from J. Carreras, ¬ =
0.05.)

Applying DGC to this example, the cut-off criterion for the dendrogram is c = 3:05 £
(1:203=4)1=2 = 1:67 because the number of treatments is k = 14, the number of replica-
tions is n = 4, the error mean square is MSE = 1.203, and Q14;4;0:95 = 3:05. The resulting
dendrogram includes the cut-off point and the equal letters identify nondifferent means
picture (Figure 3). The proposed procedure yields nonoverlappinggroups of means that are
easy to interpret. This is also the case for other cluster-based multiple comparison proce-
dures (Figure 2). The methods BSS and DGC identify the same four groups of treatment
means. SK produces the same arrangement except for the allocation of the third highest
mean. A different treatment grouping picture is shown by the Jolliffe’s method, which only
distinguishes the smallest mean from the rest.

5. DGC PERFORMANCE EVALUATION


Performance of the proposed method was evaluatedby simulation.A Monte Carlo study
based on 10,000 simulated completely randomized experiments was used for performance
comparisons. Simulations were done for the combination of k = 5 and k = 20 treatments,
two levels of replication (n = 5, n = 10), two degrees of precision (¼ = 1 and ¼ = 2),
and two different arrangements of treatment differences (Table 1). The data set identiŽ ed
with letter A corresponds to the example proposed by Willavize et al. (1980), also used by
Carmer and Lin (1983), and is a modiŽ cation of the original example of Scott and Knott
(1974). Set B is a subset of A, and sets C and D correspond to the case where all treatment
means are equal. The level of signiŽ cance was 0.05.
The values of ¼ were selected taking into account that the minimum difference between
treatment means for data set A was three units. These means are the most easily confounded,
and therefore it was assumed that the highest tolerable coefŽ cient of variation (CV) for the
134 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES

Figure 3. Average Linkage Dendrogram Obtained From the Euclidean Distance Between the Average Length of
Pod of 14 Genotypes of Chickpea. Equal letters indicate no signiŽ cant differences between treatment means. (Data
from J. Carreras, ¬ = 0.05.)

differences between means should not be much greater than 40% in the worst case (n = 5).
Hence, the maximum selected ¼ was two.
For each combination of n, k, and ¼ , the comparisonwise and experimentwise Type I
and Type II error rates were estimated. The experimentwise Type I error rate was calculated
as the ratio of the number of experiments in which at least one Type I error is actually
committed divided by the total number of experiments in which at least one true difference
equals zero. Comparisonwise Type I error rate was calculated as the ratio of the number of

Table 1. Sets of Treatment Means Used in Simulation Study. Sets C and D correspond to the true null
hypothesis; sets A and B correspond to the partially true null hypothesis.

k= 20
A 122 112 109 109 106 106 103 103 100 100 100 100 97 97 94 94 91 91 88 78
C 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

k= 5
B 112 109 109 106 106
D 100 100 100 100 100
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 135

Table 2. Comparisonwise and Experimentwise Type I Error Rates Under True Null Hypothesis

¼ n k DGC SK Jolliffe BSS Duncan LSD SNK Tukey


Comparison Type I Error Rates
1 5 5 0.018 0.024 0.002 0.025 0.016 0.017 0.009 0.007
20 0.003 0.014 0.000 0.016 0.005 0.008 0.000 0.000
10 5 0.017 0.024 0.002 0.024 0.014 0.016 0.008 0.006
20 0.002 0.017 0.000 0.016 0.004 0.008 0.000 0.000
2 5 5 0.018 0.025 0.002 0.024 0.016 0.017 0.010 0.007
20 0.002 0.015 0.000 0.015 0.004 0.008 0.000 0.000
10 5 0.018 0.026 0.002 0.025 0.015 0.017 0.009 0.006
20 0.002 0.017 0.000 0.016 0.004 0.007 0.000 0.000

Experiment Type I Error Rates


1 5 5 0.038 0.046 0.005 0.050 0.050 0.050 0.043 0.043
20 0.017 0.030 0.000 0.050 0.050 0.050 0.029 0.031
10 5 0.037 0.046 0.004 0.048 0.048 0.048 0.042 0.040
20 0.015 0.036 0.000 0.051 0.051 0.051 0.022 0.022
2 5 5 0.038 0.047 0.005 0.049 0.049 0.049 0.043 0.043
20 0.016 0.030 0.000 0.048 0.048 0.048 0.027 0.028
10 5 0.037 0.049 0.004 0.051 0.051 0.051 0.044 0.041
20 0.013 0.034 0.000 0.049 0.049 0.049 0.022 0.022

comparisons in which a Type I error is actually committed divided by the total number of
comparisons in which the true difference equals zero.
The experimentwise Type II error rate was calculated as the ratio of the number of
experiments in which at least one Type II error is actually committed divided by the total
number of experiments in which at least one true difference is not equal to zero. The
comparisonwise Type II error rate was calculated as the ratio of the number of comparisons
in which a Type II error is actually committed divided by the total number of comparisons
in which the true difference is not equal to zero (Carmer and Swanson, 1973).
The error rates were also calculated for the methods of Tukey, Duncan, Student–
Newman and Keuls (SNK), Fisher-LSD (least signiŽ cant difference), and the cluster-based
methods of Scott and Knott (SK), Bautista et al. (BSS), and Jolliffe. Because traditional
use of multiple-comparisonmethods includes a previous overall F -test, all procedures com-
pared were run only after a signiŽ cant F -test. Otherwise, all treatment means were declared
equal in order to calculate comparisonwise and experimentwise error rates.
Table 2 displays the comparisonwise and experimentwise error rates when all treatment
effects are equal to zero (true null hypothesis; data sets C and D). Tables 3 and 4 show the
comparisonwise and experimentwise error rates when some but not all treatment effects
are equal to zero (partially true null hypothesis; data sets A and B). Comparisonwise error
rates in Table 2 were based on 100,000 and 1,900,000 comparisons, depending on whether
k = 5 or k = 20. In Table 3, the number of comparisons for k = 5 and k = 20 were 20,000
and 120,000, respectively, and for Table 4, error rates were based on 80,000 and 1,780,000
comparisons (different true means) for k = 5 and k = 20, respectively.
136 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES

5.1 TRUE NULL HYPOTHESIS

5.1.1 Experimentwise Type I Error Rates


Under the true null hypothesis (all means equal), DGC shares with SK, Jolliffe, Tukey,
and SNK methods the property of having an experimentwise Type I error rate smaller than
the nominal ¬ -level of 0.05 (Table 2). BSS, Duncan, and LSD showed higher error rates
close to the nominal ¬ -level. Jolliffe showed the smallest error rate.

5.1.2 Comparisonwise Type I Error Rates


Regarding comparisonwise Type I error rates, DGC performed similarly to Duncan
and LSD and showed smaller error rates than SK and BSS, with greater differences among
methods for k = 20 (Table 2). It is also noticeable that DGC has smaller error rates than
Duncan and LSD when k is large but not for small k values. In addition, Jolliffe, Tukey,
and SNK showed the smallest error rates.
As in the previous case, comparisonwise Type I error rates were not affected by the
number of replications or the degree of precision (¼ ).

5.2 PARTIALLY TRUE NULL HYPOTHESIS

5.2.1 Type I Error Rates Under Partially True Null Hypothesis


When the overall null hypothesisis not true (as is the case for the set of true means A and
B in Table 1), Type I error rates appear to be more sensitive to ¼ as well as to the number of
treatments (k) and the number of replications (Table 3). All cluster-based methods perform
similarly in the sense they are more sensitive to ¼ , except Jolliffe’s method.

Experimentwise Type I Error Rates. Tukey’s test is the most conservative method in
terms of experimentwise error rate for the cases analyzed in this simulation because it
preserves the error rate for experiments in which all the means are equal. When the number
of equal means is less than k, Tukey’s test appears to be very conservative. Other methods,
when ¼ = 1 and k = 5, showed an average error rate of about 9% and are very similar in
this respect. When k = 20, the error rates were 30–40% in the non-cluster-based methods
and between 10 and 26% in the cluster-based ones. When ¼ = 2, the error rates increased in
the cluster-based methods that also showed greater sensitivity to the number of replications,
except Jolliffe’s method.

Comparisonwise Type I Error Rates. Except for Tukey’s test, which shows the lowest
comparison error rate, for ¼ = 1, all other methods show an average error rate close to
4%, with the cluster-based methods having a smaller or equal error rate compared with the
non-cluster-based ones. None of the methods compared showed differences attributable to
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 137

Table 3. Comparisonwiseand ExperimentwiseType I Error Rates Under Partially True Null Hypothesis

¼ n k DGC SK Jolliffe BSS Duncan LSD SNK Tukey


Comparison Type I Error Rates
1 5 5 0.040 0.046 0.049 0.037 0.049 0.049 0.049 0.008
20 0.030 0.028 0.027 0.016 0.044 0.051 0.032 0.001
10 5 0.041 0.032 0.050 0.036 0.050 0.050 0.050 0.006
20 0.028 0.026 0.028 0.011 0.042 0.050 0.032 0.000
2 5 5 0.115 0.135 0.040 0.117 0.048 0.049 0.045 0.008
20 0.138 0.167 0.020 0.149 0.042 0.050 0.027 0.001
10 5 0.072 0.100 0.051 0.069 0.050 0.051 0.050 0.007
20 0.069 0.094 0.026 0.066 0.041 0.049 0.031 0.000

Experiment Type I Error Rates


1 5 5 0.076 0.089 0.094 0.071 0.094 0.094 0.094 0.016
20 0.224 0.221 0.263 0.105 0.362 0.404 0.294 0.006
10 5 0.080 0.063 0.097 0.071 0.096 0.097 0.097 0.012
20 0.226 0.202 0.277 0.070 0.363 0.407 0.303 0.004
2 5 5 0.218 0.261 0.078 0.221 0.093 0.094 0.086 0.016
20 0.715 0.770 0.207 0.729 0.356 0.407 0.264 0.007
10 5 0.132 0.188 0.098 0.124 0.097 0.098 0.097 0.015
20 0.442 0.541 0.261 0.403 0.354 0.404 0.289 0.005

the number of replications. The cluster-based methods showed a decreasing error rate with
the number of means compared, DGC being the least sensitive in this respect. The increment
in variance (¼ = 2) does not seem to be important for the non-cluster-based methods, but it
is for the cluster-based ones except Jolliffe. The error rates increased to 12–17% for n = 5
and to 7–10% with n = 10. Nevertheless, the DGC method showed less sensitivity to the
increase of variance than the other cluster-based methods except Jolliffe.

5.2.2 Type II Error Rates Under Partially True Null Hypothesis


Experimentwise Type II Error Rates. For ¼ = 1, DGC and SK have the smallest
rates of experimentwise Type II errors, the lowest being that of DGC (Table 4). Jolliffe,
BSS, Duncan, LSD, and SNK showed error rates two or three times greater than those of
DGC or SK. Tukey’s test showed the highest rate, approximately 10 times that of the other
non-cluster-based methods. An increase in the number of treatments and a decrease in the
number of replications is followed by an increment in the error rates. When ¼ = 2, the
experimentwise Type II error rates increased in all the methods, but the most powerful are
the cluster-based ones (except Jolliffe) and, within these, DGC is the best.

Comparisonwise Type II Rates. Regarding the comparisonwise Type II error rates, if


¼ = 1, all were small and the smallest were those of DGC. When ¼ = 2, the rates increased,
the effects of the number of treatments, and the number of replications were more evident.
The smallest comparisonwise Type II error rates were those of DGC, SK, and BSS methods.
138 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES

Table 4. Comparisonwiseand ExperimentwiseType II Error Rates Under Partially True Null Hypothesis

¼ n k DGC SK Jolliffe BSS Duncan LSD SNK Tukey


Comparison Type II Error Rates
1 5 5 0.003 0.004 0.014 0.017 0.004 0.004 0.006 0.036
20 0.000 0.001 0.003 0.001 0.001 0.001 0.001 0.028
10 5 0.000 0.000 0.000 0.010 0.000 0.000 0.000 0.000
20 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2 5 5 0.278 0.274 0.600 0.256 0.311 0.291 0.393 0.526
20 0.066 0.061 0.366 0.062 0.080 0.071 0.111 0.204
10 5 0.058 0.060 0.180 0.058 0.076 0.070 0.101 0.228
20 0.013 0.011 0.075 0.017 0.019 0.017 0.029 0.122

Experiment Type II Error Rates


1 5 5 0.012 0.018 0.037 0.027 0.030 0.030 0.038 0.203
20 0.033 0.049 0.108 0.100 0.093 0.092 0.111 0.957
10 5 0.000 0.000 0.000 0.010 0.000 0.000 0.000 0.001
20 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.041
2 5 5 0.746 0.784 0.908 0.734 0.876 0.873 0.909 0.985
20 0.998 0.995 1.000 0.999 1.000 1.000 1.000 1.000
10 5 0.206 0.225 0.429 0.223 0.384 0.380 0.430 0.799
20 0.582 0.588 0.948 0.770 0.911 0.905 0.951 1.000

6. DISCUSSION
Taking into account the experimentwise Type I error rate under the true null hypothesis,
all methods worked well in controlling this error rate. Jolliffe had the best performance,
with actual ¬ -level far under the nominal, followed by DGC.
Contrasting comparisonwise error rates, the best performance was shown by Jolliffe,
Tukey, and SNK, followed by DGC, Duncan, and LSD, with SK and BSS being the worst.
However, in all the cases, error rates were below the nominal level of signiŽ cance. DGC
showed the highest error rate drop for all combinations of number of replications and degree
of precision when the number of means changed from a small sample size (k = 5) to a
larger one (k = 20).
When the null hypothesis is false, the results must be interpreted taking into account the
magnitude of the standard deviation. At this point, it is important to discuss what acceptable
values of the standard deviation are in order to evaluate the performance of these methods.
Carmer and Lin (1983) and Willavize et al. (1980) pointed out the sensitivity of the cluster-
based algorithms, indicating that, for high experimental errors, the Type I error rates were
comparatively very high, and they recommended the use of the cluster-based methods with
caution and only when the experimental error was small.
Both authors drew similar conclusions working with the set of means labeled A in this
article using ¼ = 1, 5, and 10 and n = 4. They observed that, for the average of the means
in set A, the standard deviation proposed corresponds to CVs between 1 and 10%. But it
is easy to see that there are 11 groups of distinct means and the smallest distance between
groups is three units in 8 of 10 interdistances between adjacent groups. The number of
replications to compare these treatments should be determined to estimate differences of
three units with a standard error that, in terms of CV, shouldn’t be greater than 10–20%.
Using ¼ = 1 and n = 4, we have that the standard deviation of the difference is one,
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 139

and this represents a CV of 24% for a difference of three units. This is the best situation
these authors considered in their experimental set-up. Meanwhile, in our worst scenario
(¼ = 2 and n = 5), the standard error of a difference of three represented a CV of 42%. It
is essential to compare the performance of the methods in the context of a well-designed
experiment; otherwise, we are asking that the comparison method not only do its job but
also overcome the deŽ ciencies in the design.
Examining the case with ¼ = 1 and n = 5, which could be considered an extreme case
within acceptable designs for pairwise comparisons with a CV of 21% for the minimum
differences between means, all cluster-based methods perform satisfactorily. They were
better than or equal to the non-cluster-based ones in terms of Type I error. DGC and SK
showed the better performance in terms of power with respect to every other cluster-based
or non-cluster-based method and are clearly much more powerful that Tukey’s and Jolliffe’s
tests.
Comparing DGC to SK, the latter showed the highest comparisonwise and experimen-
twise Type I errors rates under the true null hypothesis. In the case of the partially true null
hypothesis, comparisonwise and experimentwise Type I error rates were very similar for
¼ = 1, but DGC performed better when ¼ = 2. Considering Type II experimentwise error
rates, DGC was the best, but in comparisonwise Type II error rates, they were very similar.
Finally, DGC, as other cluster-based methods, yields nonoverlapping groups of homo-
geneous means, a property recognized by Jolliffe, Allen, and Christie (1989) as the most
desirable one of these procedures. From the point of view of the complexity index of Shaffer
(1981), when error rates are similar, the smallest complexity index is a useful criterion that
always supports the use of partitioning methods.
There are many additional subjects that should be explored, such as the theoretical null
distribution of Q, the effect of departure of the assumptions of normality and homogeneity
of variance, and the problem of unequal number of replications. We leave those topics for
future research.

[Received April 1999. Accepted May 2001.]

REFERENCES
Bautista, M. G., Smith, D. W., and Steiner, R. L. (1997), “A Cluster-Based Approach to Means Separation,” Journal
of Agricultural, Biological, and Environmental Statistics, 2, 179–197.
Calinski, T., and Corsten, L. C. A. (1985), “Clustering Means in ANOVA by Simultaneous Testing,” Biometrics,
41, 39–48.
Carmer, S. G., and Lin, W. T. (1983), “Type I Error Rates for Divisive Clustering Methods for Grouping Means in
Analysis of Variance,” Communications in Statistics Simulation and Computation, Series B, 12, 451–466.
Carmer, S. G., and Swanson, M. R. (1973), “An Evaluation of Ten Pairwise Multiple Comparison Procedures by
Monte Carlo Methods,” Journal of the American Statistical Association, 68, 66–74.
Cox, D. R., and Spjtvoll, E. (1982), “On Partitioning Means Into Groups," Scandinavian Journal of Statistics, 9,
147–152.
140 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES

Gates, C. E., and Bilbro, J. D. (1978), “Illustration of Cluster Analysis Method for Means Separation,” Agronomy
Journal, 70, 462–465.
Jolliffe, I. T. (1975), “Cluster Analysis as a Multiple Comparison Method,” Applied Statistics, Proceedings of
Conference at Dalhousie University, Halifax, 159–168.
Jolliffe, I. T., Allen, O. B., and Christie, B. R. (1989), “Comparison of Variety Means Using Cluster Analysis and
Dendrograms,” Experimental Agriculture, 25, 259–269.
Knuth, D. E. (1981), The Art of Computer Programming (Vol. 2, 2nd ed.), Seminumerical Algorithms, Reading,
PA: Addison-Wesley.
Press, W. H., Flannery, P., Teukolsky, S. A., and Vetterling, W. T. (1986), Numerical Recipes, Cambridge: Cam-
bridge University Press.
Scott, A. J., and Knott, M. (1974), “A Cluster Analysis Method for Grouping Means in the Analysis of Variance,”
Biometrics, 30, 507–512.
Shaffer, J. P. (1981), “Complexity: An Interpretability Criterion for Multiple Comparisons,” Journal of the Amer-
ican Statistical Association, 76, 395–401.
Spath, H. (1980), Cluster Analysis Algorithms, New York: Wiley.
Tasaki, T., Yoden, A., and Goto, M. (1987), “Graphical Data Analysis in Comparative Experimental Studies,”
Computational Statistics & Data Analysis, 5, 113–125.
Willavize, S. A., Carmer, S. G., and Walker, W. M. (1980), “Evaluation of Cluster Analysis for Comparing
Treatment Means,” Agronomy Journal, 72, 317–320.
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 141

APPENDIX
142 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES

You might also like