You are on page 1of 11

1

Specific Multiple Comparison Procedures III: Tukey HSD and Scheffe Tests

Background
This handout should be read after the introduction to multiple comparison
handouts and the handout on familywise and per comparison error rates. Remember that,
at least in theory, all MC procedures attempt to control familywise Type 1 error rates at
acceptable levels. Thus, its important to be familiar with the concepts of per comparison
and familywise Type 1 error rates.

There are a variety of different approaches to multiple comparisons. In this
handout and in other ones we will discuss several of the major approaches. This handout
will discuss the Tukey HSD (honestly significant difference) and Scheffe approaches to
multiple comparisons. The former test is sometimes called the Tukey WSD (wholly
significant difference) test. I would read this handout after the handouts on the Fisher
LSD and Bonferroni approaches.

The data that we will use to exemplify these tests is the film data weve used
previously. Here it is:

ANGER SAD FEAR HAPPY NEUTRAL
S1 8 9 14 2 12
S2 10 15 16 5 8
S3 7 12 18 7 10
S4 12 13 12 10 4
S5 13 11 10 6 6
MEAN 10.0 12.0 14.0 6.0 8.0
SD 2.55 2.24 3.16 2.92 3.16

We will also need to know that the MSW from the omnibus ANOVA is 8. The F test
from the ANOVA is highly significant: F (4,20) = 6.25, p =.002.

Tukey HSD Test
Overview
The Tukey HSD test can only be used to conduct pairwise comparisons.
Typically, it is used to test all possible pairwise comparisons or to test only some
pairwise comparisons when youve selected them post-hoc. In the latter case, the logic is
that you were only interested in pairwise comparisons before the study began but you had
to look at the data before deciding which pairwise comparisons to actually test. In this
case, the effective field of potential comparisons are all pairwise ones even though
youre only doing some of them. Most commonly, the Tukey HSD test is used to test all
pairwise comparisons. For these purposes it is typically more powerful than the classic
Bonferroni and Scheffe tests. As Ill note below, the comparison to stepdown Bonferroni
is more complex. Let me add a point here. As well see below, you can typically figure
out in advance which of these tests will be more powerful for the particular situation that
youre confronting.

2
Logic and Rationale
The Tukey HSD test has an interesting conceptual basis. Lets assume that we are
only interested in comparing pairs of means (e.g., anger vs. sad), that we have equal
group ns, that the assumptions of the ANOVA are met, and that we want to set our
familywise Type 1 error rate at alpha = .05. If you think about it, one way to control our
FW Type 1 error rate would be to focus on the task of setting the Type 1 error rate for the
largest mean difference at .05. For example, lets assume that the critical t value and
minimum mean difference for all contrasts will be the same (e.g., were using an
approach more like Bonferroni than stepwise Bonferroni). What if we found some
magical critical value that always controlled the Type 1 error rate for the pairwise
contrast with the maximum t or F value at < or = .05? For example, lets assume that the
complete null hypothesis is true (all population means are =) and that we conducted the
film study 1 million times. Each time we ran the 5 film conditions and sampled 25 people
(5 per condition). Lets further say that each time we conducted all 10 pairwise contrasts
and we picked out the maximal contrast. For the particular sample in the table above, that
maximal contrast would be fear-happy. Of course, in other samples, other pairwise
contrasts would be the ones that would be maximal. The idea here is to come up with a
statistical test that protects Type 1 errors even when the contrast tested is always the
maximal contrast. Thus, even though were picking out the contrast each time that is most
likely to be significant, we want to insure that we reject the null hypothesis for this
contrast only 5% of the time.

Thats what the Tukey HSD test does. Its critical values and overall procedure
are designed to control rejection rates of the maximal contrast to the desired alpha level.
Note that if we control the maximal contrast to the desired alpha level, then we
automatically control the familywise error rate to the maximal alpha level given that we
are going to use the same critical t or F value to test all contrasts. Thus, if we fail to reject
the maximal pairwise contrast for a given set of data, then we will automatically fail to
reject any other pairwise contrast. If we fail to reject the maximal contrast 95% of the
time, then we will fail to reject any contrast 95% of the time. Conversely, if we reject the
maximal contrast 5% of the time, our familywise Type 1 error rate can only be 5%. Put
still another way, if ns are equal, it can never happen that we would reject a non-
maximal contrast but fail to reject the maximal contrast: rejection of the latter is a
necessary condition for rejection of the former.

Studentized Range Distribution

As the section above would suggest, the key thing that we need from a
mathematical standpoint is the ability to characterize the sampling distribution of the
maximal contrast. The studentized range (SR) distribution allows us to do this.

The SR distribution is defined as follows. Lets say we generate g independent Z
scores sampled from a normal distribution with a mean of 0 and variance of 1. For
example, we could generate 5 such scores, in which case g = 5. Lets further say that we
compute the absolute value of the difference between each pair of Z scores and that we
pick out the pair thats associated with the biggest difference. Of course, the two Z scores
3
picked out will be the biggest and the smallest. Now, lets also say that we have a chi-
square variable (lets call it V) with degrees of freedom = df that is independent of the Z
scores that we generated. If this is the case, it turns out that the following ratio is
distributed as a studentized range variable (denoted Q).

,
R Max Min
g d f
Z Z
Q
V
df
!
=
(1)
That is, Q is distributed as a studentized range variable with parameters g and df. The
SAS program stud_range1.sas on the web site generates the distribution of the
studentized range distribution for the case in which g=5 and df=45. You can modify this
program to see how the shape changes with different values of g and df.

Given this definition of the S.R. distribution we can further show that the
following test statistic is also distributed as a studentized range variable:


,
, where n = group sample size
R Max Min
g d f
X X
Q
MSW
n
!
= (2)

It time allows, we will show in class that this ratio is in fact distributed as a studentized
range variable. In essence, Q in the equation above is the sampling distribution of the
difference between the biggest and smallest means across an infinite # of samples
assuming the null hypothesis is true. If Q is a sampling distribution in this manner, than
we can use it to test hypotheses concerning the difference between the biggest and
smallest means or, it turns out, between any pair of means. Note that the two parameters
here are g and df and correspond to the # of groups (i.e., # of means involved) and the df
corresponding to the MSW. In addition, of course, you have to specify a given alpha level
(typically .05).


Procedure
When conducting tests using the SR distribution, we will proceed in a manner
similar to the way we proceeded when conducting tests using the t or F distributions. We
will have critical Q value that corresponds to our 5% cutoff point. We will compare our
observed Qs to this critical Q value.

Heres how we proceed. First, we need to find the critical value of the studentized
range distribution that corresponds to the upper 5% of scores of the studentized range
distribution. Thus, we need to make reference to a table of studentized range scores. You
can use the SAS function probmc to do this (well demonstrate this in class).
Alternatively, you can make reference to Table A.4 in Maxwell & Delaney. In the case of
the film data, we would look for the value associated with .05
FW
! = , g = 5, and df=20.
The critical value of Q here is 4.23. Note what the Q critical value of 4.23 represents in
the context of the film study. Assume that the null hypothesis is true (all population
4
means are equal) and we repeated the sampling scheme of the film study an infinite
number of times. Each time, we identify that pairs of means with the maximal difference
and compute an empirical Q using this pair of means. If we plot out the distribution of
obtained Q values, we will find that they will exceed 4.23 5% of the time. Thus, our
critical Q value is indeed our 5% cutoff point.

Once youve got the critical Q value, you can compute the observed Q values
from each pair of means as

A B
Observed
X X
Q
MSW
n
!
=

Here, A and B denote any two groups of interest. You would compare the observed Q to
the critical Q value. If its > crit. Q, you reject the null hypothesis.

Its probably the case, however, that once youve got the critical Q value, the
easiest way to proceed is to compute the minimal mean difference that needs to be
detected to declare the difference significant. Refer to formula (2) above and note that

,
minimum diff *
R
g d f
MSW
X criticalQ
n
=
Thus, for the film data

8
minimum diff 4.23* 5.35
5
X = =
In order for any difference among means to be declared significant, the absolute value of
the difference has to be > 5.35. In the context of the film data, we would find that 3 mean
differences meet this criterion: Fear-happy (dif = 8), fear-neutral (dif = 6), and sad-
neutral (dif = 6). This is the same result we got from the Bonferroni contrasts. However,
note that our minimum mean difference for the Tukey test (5.35) is smaller than the
minimum difference we computed for the Bonferroni test (5.62). If youre going to
conduct all pairwise contrasts on means, the Tukey test is more powerful than the
Bonferroni test.

The choice between Tukey and stepdown Bonferroni is more complicated. The
first step in stepdown Bonferroni (difference between largest and smallest means) is
identical to the first step in the traditional (non-step-down) Bonferroni test. Thus, at this
step we favor Tukey. Note, though, how with each succeeding step, the critical t or F
value for stepdown Bonferroni gets smaller and smaller (youre dividing .05 by a smaller
number). The Tukey value stays constant across all comparisons. Typically, there is a
point in the sequence at which stepdown Bonferroni becomes more powerful than Tukey.
Thus, for early comparisons in the sequence (comparisons involving bigger differences
among means), Tukey is more powerful and desirable, but for later comparisons,
stepwise Bonferroni is more desirable. On balance, I would probably choose Tukey
simply because if the first step is not significant with stepwise Bonferroni, you cant
declare any contrasts to be significant.

5
From a more conceptual perspective, heres the key thing to understand. Our
critical Q of 4.23 and our minimal mean difference of 5.36 originate in the attempt to
control Type 1 errors for the maximal contrast. That is, if the null hypothesis is true, we
only want to declare the 2 maximally different means to be significantly different 5% of
the time. However, you should recognize that in the process we also control the
familywise Type 1 error rate for the whole set of pairwise contrasts at alpha = .05. The
reasons are: (1) Under the complete null hypothesis, if we use this procedure the maximal
contrast will not be significant 95% of the time; and, (2) if the maximal contrast is not
significant using the Tukey procedure, theres no way any other contrasts can be
significant. Thus, if the null hypothesis is true, 95% of the time we will judge no contrasts
to be significant using the Tukey HSD test.

The SAS program film_cont_tukey.sas that is on the web site shows you how to
conduct Tukey HSD contrasts using SAS. You will see that the output using the means
command makes more intuitive sense than the output using the lsmeans command. I will
explain the latter output in class.

Let me also note that the Tukey procedure assumes both equal ns and
homogeneity of population variances. In the unequal-n case, the Tukey-Kramer test can
be used. This test is discussed in Maxwell & Delaney pp. 184-185. For violations of
homogeneity of variance, neither Tukey procedure is appropriate but there are
alternatives available that are quite similar to the Welch ADF omnibus test that we have
discussed. See the Maxwell & Delaney discussion on pp. 145-150 and 184-5.

Scheffe Test

Overview
The Scheffe MCP can be used to test any contrast or set of contrasts:
whether they are pairwise or complex. In my view, it is the only true post-hoc contrast. In
essence, it is appropriate for the case where the researcher is willing initially to consider
any possible contrast or set of contrasts without being able to specify in advance which
specific contrasts he or she wants to do. Instead after looking at the data, the researcher
decides to do a contrast or contrasts. This might sound like an unrealistic situation to you
and it probably is. However, it is important that there be a MCP procedure that helps
control familywise error rates for truly post-hoc, exploratory type situations. If youre not
in that situation, the typical advice is dont do a Scheffe test. The reason is that, relative
to other procedures, it has low power. Given the tradeoff between Type 1 errors and
power, low power for the Scheffe test should make sense. This test is designed to control
Type 1 errors even if the family of contrasts conducted was infinite in size. For example,
in the context of the film data, there are only 10 pairwise contrasts that one could
conduct. However, if one considered both pairwise and complex contrasts, there literally
is an infinite # of possibilities. A truly post-hoc procedure needs to protect familywise
error rates even if the family were infinitely large or, more realistically, if any of the
infinite number of contrasts were selected by the researcher. One would expect that such
a procedure would be associated with per comparison Type 1 error rates that are quite
low.
6
Let me note, however, that there are actually instances in which the Scheffe test is
more powerful than other MCPs. In particular, this test can be more powerful than the
classic Bonferroni test under some circumstances. I will discuss this point below

Logic and Procedure

The Scheffe test has an underlying logic that is similar to that of the Tukey HSD
test and is actually quite nifty. Recall how the latter controls familywise Type 1 error
rates for all pairwise contrasts by controlling the Type 1 error rate for the maximal
pairwise contrast. The Scheffe procedure has a similar logic except now the goal is to
control familywise Type 1 error rates for any set of contrasts (pairwise and/or complex)
of any size. It does this by controlling the Type 1 error rate for the maximal contrast that
one could possibly observe. Note that were not just thinking about the maximal pairwise
contrasts here. Were thinking about the maximal contrast of any type.

You might ask: if there are an infinite # of potential contrasts, how on earth
would we ever be able to identify the contrast with the largest value? Wouldnt that
contrast vary appreciably from sample to sample or study to study? Well it turns out that
we know what the maximal contrast will always be. I will present the results here and
refer you to the discussion in Maxwell and Delaney (pp. 186-192) to further flesh out the
proofs and underlying rationale if youre interested.

When we conduct Scheffe contrasts, we typically think in terms of Fs rather than
ts. Recall from an earlier handout that when we test a contrast as an F distribution

2
2
1
,
j
j
j
SS
F where
MSW
SS
c
n
"
"
"
=
=
=
"
!

It follows that the largest F value we could observe would be linked to the contrast with
the largest SS. In other words,
max
maximum
SS
F
MSW
=

As shown in Maxwell & Delaney, we can show that the SS for any contrast must always
be # the SS Between from the omnibus ANOVA. That is,

SS SSB " #

We can further show that it is in fact possible to identify a contrast whose value is
identical to the SSB. This contrast has the following coefficients:
7

( )
j j j
c n Y Y = !

As shown in footnote 6 in M&D on p. 764, a contrast with these coefficients will always
= the SSB. The point here is not that ( )
j j j
c n Y Y = ! is an attractive or meaningful
contrast that you should use. Rather, the point is that SS for the maximal contrast possible
is indeed = SSB. In other words, we now know that:


maximum
SSB
F
MSW
=

However, we also know that:

( 1) , where J = # of groups SSB J MSB = !

Thus, by substitution:

max
( 1)
( 1)*
imum
J MSB
F
MSW
MSB
J
MSW
!
=
= !
(3)


Now, consider equation (3) above. Lets assume the null hypothesis is true. We know that
under the null hypothesis
MSB
MSW
is distributed as a central F variable with (J-1) and (N-J)
degree of freedom. It follows from (3) that
maximum
F -- the value of the maximum contrast
is distributed as J-1 times a central F variable. That is, lets assume that the null
hypothesis is true (all population means are equal). We do a million experiments and for
each experiment we the maximal contrast --- the one where the coefficients are
( )
j j j
c n Y Y = ! . This contrast is distributed as J-1 times an F variable. If we want to find
the upper 5% cutoff point for this contrast, we: (1) find the upper 5% cutoff point for a
central F distribution with J-1 and N-J df (denote this
,.05, 1, crit J N J
F
! !
); and, (2)multiply this
cutoff point by J-1 (i.e., compute (J-1)*
,.05, 1, crit J N J
F
! !
). The result is the upper 5% critical
value (= cutoff point) for the maximal contrast. This is the critical value that we will use
when we do Scheffe contrasts.

The logic here is similar to that for the Tukey test. The key point is that by
holding Type 1 errors for the maximal contrast to the desired alpha level, we hold the
familywise error rate for any family of contrasts to that desired alpha level. The insight
here is that if we reject the null hypothesis for the maximal possible contrast only 5% of
the time, our familywise Type 1 error rate can not be > .05.

8
The logic here reveals another interesting point. An observed value of a Scheffe
contrast is significant if it exceeds the 5% cutoff point = (J-1)*
,.05, 1, crit J N J
F
! !
. The
ANOVA is significant only if its F value exceeds
,.05, 1, crit J N J
F
! !
. There is a 1-to-1 relation
here. The ANOVA is significant only if the maximal contrast as tested by Scheffes
method is significant and the maximal contrast as tested by Scheffes method is
significant only if the ANOVA is significant. A significant ANOVA tells us, then, tells us
that at least one contrast is significant.

Heres how we proceed when doing Scheffes comparisons:

(1) We compute (J-1)*
,.05, 1, crit J N J
F
! !
. This is our upper 5% critical value
(2) We compute our observed contrasts and F values of interest using the
standard formulae:

2
2
1
j
j j
Observed
SS
c
n
SS
F
MSW
"
"
"
=
=
=
"
!

(3) We compare our observed F computed in step 2 to the critical F
computed in step 1. If
Observed Crit
F F > , we reject the null hypothesis and
declare the contrast significant.



So, lets take the film data as an example. Well just do one contrast to exemplify the
procedure. Lets say that we want to compare sad ( 12
SAD
Y = ) to happy ( 6
Happy
Y = ):

.05,4,20
Crit
(1) We need to find our critical F.
We see that 2.867 from a table of F values.
Thus, F for our contrast = (J-1)*2.867= 4*2.867=11.47
F =

(2) We compute the observed value of our contrast
9
2
2
1
2
(12 6)
1 1
( )
5 5
36
2
5
90
j
j j
SS
c
n
"
"
=
=
!
=
+
=
=
"
!

(3) We compute our observed F
90
8
11.25
Observed
SS
F
MSW
"
=
=
=


(4) We compare the observed to the critical F. Here we see that 11.25 < 11.47. Thus, we
fail to reject the null hypothesis that sad and happy differ.

Note that when we did Bonferroni and Tukey contrasts, we found that sad and happy did
significantly differ. The Scheffe contrast is associated with relatively low power when
compared to other procedures.

One could also compute a Scheffe complex contrast using a similar procedure

If we considered conducting a Scheffe test for a set of pairwise contrasts, one thing that
we could do is compute the minimum mean difference necessary for the contrast to be
significant.

10
2
2
2
To do this, we set:
Thus,
* * min. mean diff.
j
j j
Crit
j
Crit A B
j
j
c
n
F
MSW
c
F MSW Y Y
n
"
"
=
= ! = =
"
"

In the context of the film data, then

2
min. mean dif = 11.47*8*( ) 6.06
5
=

Thus, for any mean difference to be declared significant, it has to be greater than 6.06.

Note how this min. mean difference is greater than those for the Bonferroni or Tukey
HSD tests comparing all pairs of means:


Test Minimum Mean Difference
Bonferroni 5.64
Tukey 5.35
Scheffe 6.06

Clearly, we would not choose the Scheffe test over these procedures in this case.
This test, in essence, is typically a last resort. It is used when at least one of the contrasts
that you are doing was not anticipated in advance and is based on post-hoc peeking at
the data.

You should note, however, that when youre doing a large number of contrasts
(and particularly when youre doing so when ns and df are very small) the critical values
for the classic Bonferroni test (non-step-down) can actually be greater than those for the
Scheffe. Thus, the latter can be more powerful than the former. In this regard, recall that
the Scheffe critical value stays constant across variations in the # of contrasts actually
performed, while Bonferroni critical values go up (and per comparison alphas go down)
as you add contrasts. For example, if you have 5 groups and 20 df (as we do in the film
study) and you 18 contrasts (!), youd be better off doing Scheffes than Bonferroni.
Table 5.5 in Maxwell & Delaney offers a nice summary of when to use Bonferroni rather
than Scheffe. I suspect that for situations you actually encounter in practice youll want to
do Bonferroni contrasts rather than Scheffes. Moreover, youll probably want to do
stepdown Bonferronis most of all.
11

Note that if the choice is stepdown Bonferroni vs. Scheffe, youre typically going
to have greater power overall with the former. In those cases whether the Table 5.5
indicates that Scheffe is better than traditional Bonferroni, the choice is more complex.
For the first contrast you conduct (the one with the biggest value), Scheffe is more
powerful than both traditional Bonferroni and stepwise Bonferroni . At a certain point in
the stepdown sequence, however, the stepdown Bon will become more powerful.

Let me emphasize, though, that this whole discussion of Bonferroni vs. Scheffe
presumes that you have identified the contrasts that you want to use in advance. If you
are truly picking comparisons to do on a post-hoc basis, you should use Scheffe no matter
what the # of contrasts youre doing.

The SAS program film_cont_scheffe.sas demonstrates how to do Scheffe
contrasts using SAS.