Professional Documents
Culture Documents
1. INTRODUCTION
Experimental results in the eld of biology and agriculture are often analyzed by mean
of analysis of variance and the associated techniques of multiple comparisons. The lack
of transitivity of multiple-comparison methods makes it very dif cult to determine the
arrangement of means belonging to homogeneous groups, particularly when the number of
means is large. On the other hand, dendrograms produced by some clustering techniques
have long been used to arrange entities in nonoverlappinggroups. Applied to the means of a
set of treatments, they produce a clear picture of the differences and relationships between
treatments, but there are not widely accepted criteria to apply to the dendrograms in order
to establish groups of statistically homogeneous means.
In order to apply inferential procedures to clustering (commonly used as a descriptive
technique) and develop methods for identifying nonoverlapping groups of homogeneous
means, some authors have proposed a number of cluster-based multiple-comparison meth-
Julio Alejandro Di Rienzo and Fernando Casanoves are Professors of Estadstica y Biometra, Facultad de Ciencias
Agropecuarias Universidad Nacional de Cordoba, CC 509 (5000) Cordoba, Argentina (E-mail: dirienzo@agro.
uncor.edu). Adolfo Washington Guzman is a Doctorate Student at the Instituto de Matematica e Estadstica,
Universidad de Sao Paulo, Rua do Matao 1010 (05508-900), Sao Paulo, Brasil.
129
130 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES
ods. Pioneering work by Scott and Knott (SK) (1974) introduced a divisive algorithm with
a stopping rule based on a criterion that resembles the F -test. Jolliffe (1975) proposed
the application of the single linkage clustering algorithm to the matrix of p-values of the
Student–Newman and Keuls (SNK) technique. Calinski and Corsten (1985) applied the
complete linkage method to the matrix of p-values from the Tukey’s test. Cox and Spjtvoll
(1982) introduced another approximation based on the consideration of all possible parti-
tions, choosing one of them by an F -like test criterion.
Many papers have been written comparing these and other new methods of multiple
comparisons with the classical ones. Carmer and Lin (1983) and Willavize, Carmer, and
Walker (1980) pointed out the sensitivity of cluster-based methods to an increased ex-
perimental error. Tasaki, Yoden, and Goto (1987) compared six cluster-based methods of
multiple comparisons and concluded that each one had their own area of application.
Most recently, Bautista, Smith, and Steiner (1997) proposed a recursive cluster-based
approach to means separation (BSS) combining a nested analysis of variance with a clus-
tering technique. This method, as other recursive methods, does not have a known level of
signi cance and the nominal level must be seen only as an index. On the other hand, and
according to its authors, the method does not perform well with equally spaced means. Prob-
ably this problem is not unique to the Bautista et al. method but is common to other methods
intended to provide nonoverlappinggroups, including the one proposed here. Although this
is not a desirable property, it is probably of less concern in view of the uncommonness of
such a case.
In summary, the main problem with the classical multiple-comparison methods is that
they often construct groups of means that may have substantial overlap, whereas the cluster-
based methods encountered in the literature that overcome this problem do not have a known
level of signi cance and are usually not easy to apply.
The objective of this work is to develop an easy-to-implementmethod that will solve the
problems of classical and cluster-based techniques,distinguishinggroups of nonoverlapping
statistically homogeneous means with a known experimentwise error rate and preserving
appropriate low comparisonwise Type I and II error rates.
2. THE TEST
Let x1 ; : : : ; xk be a set of uncorrelated means calculated from random samples of size
n from normal distributions N(· i ; ¼ 2 ), i = 1; : : : ; k. De ne the matrix D = fdij g with
dij = jxi ¡ xj j=(s2 =n)1=2 and S 2 to be the within-groups common variance (estimated
by the analysis of variance mean square error). Applying the average linkage clustering
technique to the matrix D, it is possible to obtain a dendrogram (binary tree) whose terminal
nodes are the k treatment means. Each node in the dendrogram has a related measure that
represents the distance between the clusters that it joins. If SK and SL are two clusters and
SK 6= SL , the distance between them is de ned as
1 X
q(SK ; SL ) = dij ;
#(SK )#(SL ) x 2S
i K
x j 2S L
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 131
Figure 1. Dendrogram Showing the Relationships Between Means and the Cut-Off Criterion (Q1¡ ¬ ) Obtained
by the DGC Test.
otherwise, q(SK ; SL ) = 0. The means with smallest distance between them appear in the
dendrogram joined by the node with associated distance q1 (Figure 1). The next node will
have the associated distance q2 , which represents the distance between the following pair
of most alike means or the distance between a mean and the cluster previously formed at a
distance q1 (depending on which is the smallest distance). At last the tree ends in the root
node at distance qk¡1 . If Q is the random variable root node distance, the 1 ¡ ¬ quantile
of the distribution of Q under the hypothesis H0 : · 1 = · 2 = ¢ ¢ ¢ = · k could be used to
construct a test of size ¬ . Hence, any value of Q ¶ Q1¡¬ would lead to the rejection of this
hypothesis. Moreover, since the distance associated with every other node is smaller than
Q, the critical point for the null distribution of Q is an upper limit for the critical point of
the distance associated with them, and the clusters joined by a node with distance greater
than the critical point of Q can be declared different from a statistical point of view. The
elements dij of the distance matrix are distributed, under the usual assumptions of analysis
of variance, as (2F1;k(n¡1) )1=2 , and since Q is an average of these distances (under average
linkage), the null distributionof Q depends on k and n. We report the 95 and 99% percentage
points of the null distribution of Q, obtained by Monte Carlo simulation (Tables I and II
of the Appendix), for a selected number of treatments (k) and replications (n). To apply
the method, one only needs an average linkage dendrogram obtained from the matrix of
Euclidean distances between treatment means, an estimate of the standard error of a mean
(SEM), and the appropriate percentage point of the null distribution of Q for a given number
132 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES
of treatments (k) and replications (n). The ¬ -level cut-off criterion for the dendrogram is
obtained as c = SEM £ Qk;n;1¡¬ .
3. DISTRIBUTION OF Q
The percentage points of the null distribution of Q provided in the Appendix were
obtained by Monte Carlo simulation according to the following algorithm:
The average standard error of the percentage point was 0.0018 for Q0:95 and 0.0035
for Q0:99.
Simulations were carried out on a Pentium-based computer running a computer pro-
gram written in object-oriented Pascal (Borland Delphi 2.0, 1996) and using a random
number generator based on the implementation of the RAND3 algorithm (Knuth, 1981)
according to Press, Flannery, Teukolsky, and Vetterling (1986).
4. AN EXAMPLE
In a study of chickpea genotype characterization, 14 genotypes were compared in a
randomized complete block design with four replications. One of the studied traits was
the pod’s average length. Four traditional multiple-comparison tests and four cluster-based
tests, including the proposed method (hereafter DGC), were applied.
It is dif cult to obtain a simple conclusion about genotype differences based on the
traditional multiple-comparison procedures due to overlapping, a frequent situation when
many treatment means are compared. Instead, cluster-based multiple-comparison proce-
dures give a simple picture of genotype differences (Figure 2).
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 133
Figure 2. Representationof Differences Between Average Length of the Pod (mm) in a Study of Genotype Character-
ization of 14 Genotypes of Chickpea for Four Classical Multiple-ComparisonProcedures and Three Cluster-Based
Ones. Equal letters indicate no signi cant differences between treatment means. (Data from J. Carreras, ¬ =
0.05.)
Applying DGC to this example, the cut-off criterion for the dendrogram is c = 3:05 £
(1:203=4)1=2 = 1:67 because the number of treatments is k = 14, the number of replica-
tions is n = 4, the error mean square is MSE = 1.203, and Q14;4;0:95 = 3:05. The resulting
dendrogram includes the cut-off point and the equal letters identify nondifferent means
picture (Figure 3). The proposed procedure yields nonoverlappinggroups of means that are
easy to interpret. This is also the case for other cluster-based multiple comparison proce-
dures (Figure 2). The methods BSS and DGC identify the same four groups of treatment
means. SK produces the same arrangement except for the allocation of the third highest
mean. A different treatment grouping picture is shown by the Jolliffe’s method, which only
distinguishes the smallest mean from the rest.
Figure 3. Average Linkage Dendrogram Obtained From the Euclidean Distance Between the Average Length of
Pod of 14 Genotypes of Chickpea. Equal letters indicate no signi cant differences between treatment means. (Data
from J. Carreras, ¬ = 0.05.)
differences between means should not be much greater than 40% in the worst case (n = 5).
Hence, the maximum selected ¼ was two.
For each combination of n, k, and ¼ , the comparisonwise and experimentwise Type I
and Type II error rates were estimated. The experimentwise Type I error rate was calculated
as the ratio of the number of experiments in which at least one Type I error is actually
committed divided by the total number of experiments in which at least one true difference
equals zero. Comparisonwise Type I error rate was calculated as the ratio of the number of
Table 1. Sets of Treatment Means Used in Simulation Study. Sets C and D correspond to the true null
hypothesis; sets A and B correspond to the partially true null hypothesis.
k= 20
A 122 112 109 109 106 106 103 103 100 100 100 100 97 97 94 94 91 91 88 78
C 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
k= 5
B 112 109 109 106 106
D 100 100 100 100 100
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 135
Table 2. Comparisonwise and Experimentwise Type I Error Rates Under True Null Hypothesis
comparisons in which a Type I error is actually committed divided by the total number of
comparisons in which the true difference equals zero.
The experimentwise Type II error rate was calculated as the ratio of the number of
experiments in which at least one Type II error is actually committed divided by the total
number of experiments in which at least one true difference is not equal to zero. The
comparisonwise Type II error rate was calculated as the ratio of the number of comparisons
in which a Type II error is actually committed divided by the total number of comparisons
in which the true difference is not equal to zero (Carmer and Swanson, 1973).
The error rates were also calculated for the methods of Tukey, Duncan, Student–
Newman and Keuls (SNK), Fisher-LSD (least signi cant difference), and the cluster-based
methods of Scott and Knott (SK), Bautista et al. (BSS), and Jolliffe. Because traditional
use of multiple-comparisonmethods includes a previous overall F -test, all procedures com-
pared were run only after a signi cant F -test. Otherwise, all treatment means were declared
equal in order to calculate comparisonwise and experimentwise error rates.
Table 2 displays the comparisonwise and experimentwise error rates when all treatment
effects are equal to zero (true null hypothesis; data sets C and D). Tables 3 and 4 show the
comparisonwise and experimentwise error rates when some but not all treatment effects
are equal to zero (partially true null hypothesis; data sets A and B). Comparisonwise error
rates in Table 2 were based on 100,000 and 1,900,000 comparisons, depending on whether
k = 5 or k = 20. In Table 3, the number of comparisons for k = 5 and k = 20 were 20,000
and 120,000, respectively, and for Table 4, error rates were based on 80,000 and 1,780,000
comparisons (different true means) for k = 5 and k = 20, respectively.
136 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES
Experimentwise Type I Error Rates. Tukey’s test is the most conservative method in
terms of experimentwise error rate for the cases analyzed in this simulation because it
preserves the error rate for experiments in which all the means are equal. When the number
of equal means is less than k, Tukey’s test appears to be very conservative. Other methods,
when ¼ = 1 and k = 5, showed an average error rate of about 9% and are very similar in
this respect. When k = 20, the error rates were 30–40% in the non-cluster-based methods
and between 10 and 26% in the cluster-based ones. When ¼ = 2, the error rates increased in
the cluster-based methods that also showed greater sensitivity to the number of replications,
except Jolliffe’s method.
Comparisonwise Type I Error Rates. Except for Tukey’s test, which shows the lowest
comparison error rate, for ¼ = 1, all other methods show an average error rate close to
4%, with the cluster-based methods having a smaller or equal error rate compared with the
non-cluster-based ones. None of the methods compared showed differences attributable to
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 137
Table 3. Comparisonwiseand ExperimentwiseType I Error Rates Under Partially True Null Hypothesis
the number of replications. The cluster-based methods showed a decreasing error rate with
the number of means compared, DGC being the least sensitive in this respect. The increment
in variance (¼ = 2) does not seem to be important for the non-cluster-based methods, but it
is for the cluster-based ones except Jolliffe. The error rates increased to 12–17% for n = 5
and to 7–10% with n = 10. Nevertheless, the DGC method showed less sensitivity to the
increase of variance than the other cluster-based methods except Jolliffe.
Table 4. Comparisonwiseand ExperimentwiseType II Error Rates Under Partially True Null Hypothesis
6. DISCUSSION
Taking into account the experimentwise Type I error rate under the true null hypothesis,
all methods worked well in controlling this error rate. Jolliffe had the best performance,
with actual ¬ -level far under the nominal, followed by DGC.
Contrasting comparisonwise error rates, the best performance was shown by Jolliffe,
Tukey, and SNK, followed by DGC, Duncan, and LSD, with SK and BSS being the worst.
However, in all the cases, error rates were below the nominal level of signi cance. DGC
showed the highest error rate drop for all combinations of number of replications and degree
of precision when the number of means changed from a small sample size (k = 5) to a
larger one (k = 20).
When the null hypothesis is false, the results must be interpreted taking into account the
magnitude of the standard deviation. At this point, it is important to discuss what acceptable
values of the standard deviation are in order to evaluate the performance of these methods.
Carmer and Lin (1983) and Willavize et al. (1980) pointed out the sensitivity of the cluster-
based algorithms, indicating that, for high experimental errors, the Type I error rates were
comparatively very high, and they recommended the use of the cluster-based methods with
caution and only when the experimental error was small.
Both authors drew similar conclusions working with the set of means labeled A in this
article using ¼ = 1, 5, and 10 and n = 4. They observed that, for the average of the means
in set A, the standard deviation proposed corresponds to CVs between 1 and 10%. But it
is easy to see that there are 11 groups of distinct means and the smallest distance between
groups is three units in 8 of 10 interdistances between adjacent groups. The number of
replications to compare these treatments should be determined to estimate differences of
three units with a standard error that, in terms of CV, shouldn’t be greater than 10–20%.
Using ¼ = 1 and n = 4, we have that the standard deviation of the difference is one,
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 139
and this represents a CV of 24% for a difference of three units. This is the best situation
these authors considered in their experimental set-up. Meanwhile, in our worst scenario
(¼ = 2 and n = 5), the standard error of a difference of three represented a CV of 42%. It
is essential to compare the performance of the methods in the context of a well-designed
experiment; otherwise, we are asking that the comparison method not only do its job but
also overcome the de ciencies in the design.
Examining the case with ¼ = 1 and n = 5, which could be considered an extreme case
within acceptable designs for pairwise comparisons with a CV of 21% for the minimum
differences between means, all cluster-based methods perform satisfactorily. They were
better than or equal to the non-cluster-based ones in terms of Type I error. DGC and SK
showed the better performance in terms of power with respect to every other cluster-based
or non-cluster-based method and are clearly much more powerful that Tukey’s and Jolliffe’s
tests.
Comparing DGC to SK, the latter showed the highest comparisonwise and experimen-
twise Type I errors rates under the true null hypothesis. In the case of the partially true null
hypothesis, comparisonwise and experimentwise Type I error rates were very similar for
¼ = 1, but DGC performed better when ¼ = 2. Considering Type II experimentwise error
rates, DGC was the best, but in comparisonwise Type II error rates, they were very similar.
Finally, DGC, as other cluster-based methods, yields nonoverlapping groups of homo-
geneous means, a property recognized by Jolliffe, Allen, and Christie (1989) as the most
desirable one of these procedures. From the point of view of the complexity index of Shaffer
(1981), when error rates are similar, the smallest complexity index is a useful criterion that
always supports the use of partitioning methods.
There are many additional subjects that should be explored, such as the theoretical null
distribution of Q, the effect of departure of the assumptions of normality and homogeneity
of variance, and the problem of unequal number of replications. We leave those topics for
future research.
REFERENCES
Bautista, M. G., Smith, D. W., and Steiner, R. L. (1997), “A Cluster-Based Approach to Means Separation,” Journal
of Agricultural, Biological, and Environmental Statistics, 2, 179–197.
Calinski, T., and Corsten, L. C. A. (1985), “Clustering Means in ANOVA by Simultaneous Testing,” Biometrics,
41, 39–48.
Carmer, S. G., and Lin, W. T. (1983), “Type I Error Rates for Divisive Clustering Methods for Grouping Means in
Analysis of Variance,” Communications in Statistics Simulation and Computation, Series B, 12, 451–466.
Carmer, S. G., and Swanson, M. R. (1973), “An Evaluation of Ten Pairwise Multiple Comparison Procedures by
Monte Carlo Methods,” Journal of the American Statistical Association, 68, 66–74.
Cox, D. R., and Spjtvoll, E. (1982), “On Partitioning Means Into Groups," Scandinavian Journal of Statistics, 9,
147–152.
140 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES
Gates, C. E., and Bilbro, J. D. (1978), “Illustration of Cluster Analysis Method for Means Separation,” Agronomy
Journal, 70, 462–465.
Jolliffe, I. T. (1975), “Cluster Analysis as a Multiple Comparison Method,” Applied Statistics, Proceedings of
Conference at Dalhousie University, Halifax, 159–168.
Jolliffe, I. T., Allen, O. B., and Christie, B. R. (1989), “Comparison of Variety Means Using Cluster Analysis and
Dendrograms,” Experimental Agriculture, 25, 259–269.
Knuth, D. E. (1981), The Art of Computer Programming (Vol. 2, 2nd ed.), Seminumerical Algorithms, Reading,
PA: Addison-Wesley.
Press, W. H., Flannery, P., Teukolsky, S. A., and Vetterling, W. T. (1986), Numerical Recipes, Cambridge: Cam-
bridge University Press.
Scott, A. J., and Knott, M. (1974), “A Cluster Analysis Method for Grouping Means in the Analysis of Variance,”
Biometrics, 30, 507–512.
Shaffer, J. P. (1981), “Complexity: An Interpretability Criterion for Multiple Comparisons,” Journal of the Amer-
ican Statistical Association, 76, 395–401.
Spath, H. (1980), Cluster Analysis Algorithms, New York: Wiley.
Tasaki, T., Yoden, A., and Goto, M. (1987), “Graphical Data Analysis in Comparative Experimental Studies,”
Computational Statistics & Data Analysis, 5, 113–125.
Willavize, S. A., Carmer, S. G., and Walker, W. M. (1980), “Evaluation of Cluster Analysis for Comparing
Treatment Means,” Agronomy Journal, 72, 317–320.
IDENTIFYING GROUPS OF NONHOMOGENEOUS MEANS 141
APPENDIX
142 J. A. DI RIENZO, A. W. GUZMAN, AND F. CASANOVES