You are on page 1of 8

REVIEW

Multiple Testing of Hypotheses in Comparing Two Groups


L ADRIENNE CUPPLES, Ph.D.; TIMOTHY HEEREN, B.A.; ARTHUR SCHATZKIN, M.D., Dr.P.H.; and
THEODORE COLTON, Sc.D.; Boston, Massachusetts

Researchers frequently encounter studies that compare variables. This problem should be distinguished from the
two groups on many variables. We discourage the use of similar problem of multiple comparisons that arises in
multiple tests of hypotheses on individual variables, an
approach that ignores the correlation among the variables
analysis of variance, in which an investigator compares
and increases the chance of a type I error. Instead of three or more groups on one variable (instead of many).
examining each variable separately, we recommend using The use of multiple comparisons in analysis of variance is
multivariate procedures that integrate all measures on a discussed in many other sources (5-15) and is not ad-
person into a unified analysis of the differences between
the two groups. We describe three multivariate
dressed here.
procedures: Hotelling's T 2 , discriminant analysis, and
logistic regression. We also discuss the use of Predictors of Coronary Heart Disease in a Cohort Study
Bonferroni's adjustment to preserve the overall chance of To illustrate the issues, we analyzed data from the
a type I error in conducting individual tests on each
Framingham Heart Study (2, 3, 16, 17). (The sole pur-
variable after doing the multivariate procedures. We
review the underlying assumptions and relative merits and pose of our analyses is to show the statistical concerns;
disadvantages of the three multivariate methods and we believe the results presented here are consistent with
recommend which method to use in various other published results.) For our purposes, we consid-
circumstances. ered only men, whom we divided into two groups: those
who did and those who did not develop coronary heart
CLINICAL RESEARCHERS frequently ask, "Do two disease over a 26-year period. In our analyses, coronary
groups differ on one or more variables?" Case-control heart disease comprised one or more of three conditions:
studies, cohort studies, and clinical trials are all examples angina pectoris, coronary insufficiency, and myocardial
of studies in which this question is ubiquitous. In a case- infarction. To compare the two groups of patients, we
control study, researchers may wish to compare "cases" chose ten characteristics determined at baseline and puta-
and controls on several potential exposures or back- tively associated with coronary heart disease: systolic
ground variables, as in a study of oral contraceptive use blood pressure, total serum cholesterol level, casual blood
and the incidence of myocardial infarction ( 1 ) . In cohort glucose level, number of cigarettes smoked per day,
studies and clinical trials, researchers may wish to mea- Framingham relative weight, age, serum hemoglobin lev-
sure various outcomes as well as baseline characteristics, el, vital lung capacity, serum uric acid level, and ventric-
as in the Framingham Heart Study (2, 3) and the Uni- ular rate. (We used the first recorded data among the
versity Group Diabetes Project ( 4 ) . first three examinations to explore the long-range predict-
Whereas statistical methods (Mests and chi-squared ability of these measures. If data were missing from the
tests) for comparing two groups on one variable are well first examination, we used data from the second examina-
known, techniques for comparing groups on two or more tion; if data were not available on the second examina-
variables are less widely understood and applied. In tion, we used information from the third examination. If
many cases, investigators simply use Mests or chi- data for any of the ten variables were missing on all three
squared tests, or both, for all variables. This strategy en- examinations, we excluded that patient from analysis.)
tails doing many tests of hypotheses, an approach with The total sample consisted of 2248 men, 640 who devel-
two potential weaknesses. First, separate tests on each oped and 1608 who did not develop coronary heart dis-
variable ignore the fact that some variables may be corre- ease over the 26-year period; we excluded 88 men who
lated; hence, the result of a test for a particular variable had missing information.
may, in fact, be due to differences between the groups on The means and standard deviations of each variable in
some other related measure (the comparison is con- the two groups and the p values (two-tailed) associated
founded). Second, the overall probability of finding at with the conventional two-sample Mest (Appendix I)
least one statistically significant difference between two are given in Table 1. From these ten separate tests of
groups when, in fact, there are no differences (type I hypotheses (one for each variable), one might conclude
error) is somewhat larger than the conventional 5% and that all variables except hemoglobin level differ signifi-
1 % levels chosen for conducting each individual test. cantly between the two groups. In other words, initial
This article examines the problem of multiple testing of measurements of systolic blood pressure, cholesterol lev-
hypotheses in the comparison of two groups on many el, casual blood sugar level, Framingham relative weight,
age, vital lung capacity, uric acid level, ventricular rate,
• From the Department of Epidemiology and Biostatistics, Boston University
School of Public Health; Boston, Massachusetts. and perhaps cigarettes smoked per day (p < 0.06) are
122 Annals of Internal Medicine. 1984;100:122-129.

Downloaded From: http://annals.org/ by a Penn State University Hershey User on 09/17/2016


all long-range predictors of coronary heart disease. Table 1. Comparison of Mean Baseline Values of Men Who Did
This analysis neglects two important facts. First, some and Did Not Develop Coronary Heart Disease*
of these variables are correlated (Table 2 ) ; hence, the CHD Not p Value*
comparison on one variable may be confounded by its (n = 640) CHD
relation with another variable. For example, the correla- (72 = 1608)
tion between uric acid level and Framingham relative
Systolic blood pres-
weight is 0.29; therefore, the significantly higher average sure, mm Hg 142 ± 23 135 ± 19 0.0001
level of uric acid among men with coronary heart disease Cholesterol,
may, in fact, be a consequence of their higher average mg/lOOmL 236 ± 4 4 218 ± 4 2 0.0001
relative weight. To assess whether the observed differ- Blood sugar,
ences in uric acid levels between the two groups (suggest- mg/lOOmL 87 ± 2 8 81 ± 2 1 0.0323
Cigarettes smoked,
ed in Table 1) can be attributed to the development of n/d 15 ± 14 14 ± 14 0.0613
coronary heart disease, or whether they can be explained Framingham relative
by other differences between the groups, an appropriate weight, % 104 ± 13 101 ± 14 0.0001
statistical analysis should account for the correlations Age, yrs 46 ± 8 43 ± 9 0.0001
Hemoglobin,
among all ten variables.
dg/lOOmL 143 ± 12 143 ± 13 0.9700
A second weakness of an analysis consisting of ten sep- Lung capacity, cL 367 ± 67 376 ± 70 0.0058
arate r-tests is that the chance of finding a difference be- Uric acid, mg/100 mL 51 ± 12 49 ± 12 0.0014
tween the two groups on at least one variable, when in Ventricular rate,
beats/min 75 ± 14 73 ± 13 0.0058
fact there are no differences (type I error), is much high-
er than one might expect. If we use the 0.05 level to * Mean ± standard deviations. C H D = coronary heart disease.
t p Value for two-sided test of hypothesis, as determined by two-sample f-test
decide which variables differ significantly, we look down for independent groups using the SAS (Statistical Analysis System) computer
the column of p values and find eight variables to be program (18, 19). In this and subsequent tables, one can maintain an overall
significance level of 0.05 by applying the Bonferroni rule of 0.005 to determine
statistically different (and one marginally so). However, which variables are significantly different.
if men with coronary heart disease and those without the
disease were actually alike on average on all ten charac- test is statistically significant, then the two groups differ
teristics, the chance of finding one or more statistically on at least one variable. A researcher can then proceed to
significant result (assuming all variables are independent examine each variable to find where the differences are.
of one another) is This second stage of multivariate analysis is generally not
1 - Pr = 1 - (1-0.05)10 = 0.40 equivalent to univariate procedures, because discriminant
where Pr = probability of no significant differences. analysis and logistic regression incorporate the inter-
Even though the researcher is ostensibly using an alpha correlations among the variables. A method for control-
level of 0.05, the chance of a type I error in comparing ling the overall alpha level must be used at this stage.
these two groups is 0.40. This result derives from Bonfer- If the overall test is not statistically significant, the pri-
roni's inequality (5, 8). mary conclusion is that the two groups do not differ on
These two problems indicate a need for a unified analy- these variables (assuming that the samples are sufficient-
sis that incorporates the correlations among the variables ly large to detect clinically meaningful differences be-
and is an overall test for differences between the two tween groups when they exist). How one proceeds at this
groups. Because the methods we describe involve simul- point in the analysis is unclear. Statistically significant
taneous examination and analysis of several variables, results may appear at a second stage of closer scrutiny of
these methods have become known as multivariate analy- individual variables, even though the overall test is not
sis. significant. Although we could ignore the results of a sec-
ondary test, we believe that valuable information can be
Multivariate Methods for Comparing Two Groups gained from this second examination. However, investi-
We describe three multivariate approaches to remedy gators should view significant secondary test results as
the multiple testing problem: Hotelling's T 2 , discriminant suggesting the need for further research, rather than as
analysis, and logistic regression. All three methods over- definitive, because statistically significant results may, in
come the weaknesses previously discussed by incorporat- fact, be due simply to an increased chance of a type I
ing the correlations among the variables and by limiting error resulting from multiple testing.
the overall chance of a type I error to a desired level, The computation needed by all three procedures is
such as 0.05 or 0.01. Instead of viewing each person as burdensome and not amenable to calculation by hand or
having several separate characteristics (such as systolic pocket calculator. Calculations however, are easily done
blood pressure distinct from cholesterol level), these pro- with the widely-available computerized statistical pack-
cedures assume that each person provides information on ages B M D P (Biomedical Computer Programs: P-series)
all variables simultaneously. (20), SAS (Statistical Analysis System) (18, 19), and
All three multivariate methods should be done in two SPSS (Statistical Package for the Social Sciences) ( 2 1 ,
stages: an overall test of the differences between the two 22). The first program, BMDP, will compute Hotelling's
groups, and an examination of the individual variables to T 2 directly, whereas SAS and SPSS programs compute an
see where differences, if any, lie. The overall test incorpo- equivalent statistic through one-way multivariate analysis
rates the correlations among the variables. If the overall of variance ( M A N O V A ) when there are two groups. All
Cupplesetal. • Multiple Testing of Hypotheses 123

Downloaded From: http://annals.org/ by a Penn State University Hershey User on 09/17/2016


Table 2. Correlations for Framingham Data *

Systolic Cholesterol Blood Cigarettes Relative Age Hemoglobin Lung Uric Pulse
Blood Sugar Weight Capacity Acid
Pressure

Systolic blood
pressure 10
Cholesterol 0.14 1.0
Blood sugar 0.12 0.03 1.0
Cigarettes -0.07 0.06 -0.05 1.0
Relative
weight 0.28 0.13 0.08 -0.07 1.0
Age 0.24 0.09 0.10 -0.15 0.03 1.0
Hemoglobin 0.07 0.11 0.00 0.08 0.14 -0.11 1.0
Lung capacity -0.17 -0.08 -0.05 -0.01 -0.07 -0.39 0.05 1.0
Uric acid 0.14 0.09 -0.03 -0.04 0.29 0.01 0.08 -0.02 1.0
Pulse 0.25 0.09 0.10 0.14 0.14 -0.02 0.05 -0.13 0.08 1.0

* n = 2248 men

three packages do discriminant analysis, which can also Unlike Hotelling's T 2 statistic, these multiple Mests at
be done with multiple linear regression techniques ( 6 ) . this second stage of analysis do not incorporate the corre-
Logistic regression can also be done with the SAS and lations among variables, even though Bonferroni's adjust-
BMDP programs. ment controls the overall alpha level. In fact, these tests
In the following discussion, we assume that all vari- use the same statistic that we criticized earlier on this
ables are continuous. After describing these methods, we account. Hence, this procedure allows a researcher to de-
discuss the issue of nominal variables. scribe the differences in mean values between the groups,
but it does not consider confounding variables or inter-
HOTELLING'S T 2 A N D B O N F E R R O N f S A D J U S T M E N T correlations among variables that may explain these dif-
The multivariate counterpart of the conventional inde- ferences.
pendent samples f-statistic for the comparison of mean For the data from the Framingham Heart Study, Ho-
values of two groups on a single variable is Hotelling's T 2 telling's T2 is 188.73. Hence,
statistic (8, 9, 20). Instead of using one variable, several
variables, say k, are Used in multivariate analyses. The F ( 1 8 8 7 3 ) 18 79
null hypothesis states that with multivariate considera- =22^0) - = -
tion of all k variables simultaneously, there are no differ- with 10 and 2237 degrees of freedom. This F value is
ences in the mean values between the two groups. The highly significant with p < 0.0001, suggesting that men
alternative hypothesis is that the two groups differ in who have coronary heart disease differ from men who
their mean values on at least one variable. To compute have no coronary heart disease in their averages on at
Hotelling's T 2 statistic, we need the mean value of each least one of the ten variables.
variable for each group and a pooled estimate of the com- Given the overall conclusion that men with coronary
mon variability in the two groups (Appendix 1). One heart disease differ from men without coronary heart dis-
rejects the null hypothesis if ease, we next examine each of the ten variables separate-
ly. With Bonferroni's adjustment we should conclude
that there is a significant difference between the two
(nx + n2 - 2) k groups whenever the p-value associated with the conven-
yields a significantly large value. This test statistic (F) tional ^-statistic is less than 0.05/10, or 0.005. Applying
has an F distribution with k (numerator) and this procedure to the results in Table 1, we see that the
n
l + n2 ~ k — 1 (denominator) degrees of freedom, two groups differ in their averages of systolic blood pres-
where n\ and n2 are the sample sizes of the two groups sure, cholesterol level, Framingham relative weight, age,
(8). and uric acid level. Vital lung capacity and ventricular
When Hotelling's T 2 is statistically significant, one rate appear to be marginally significant. Thus, if the two
rejects the overall null hypothesis and proceeds to exam- groups were actually alike, the overall chance of seeing at
least one difference as extreme as these is 0.05.
ine the individual variables in the second stage of multi-
variate analysis. Again, one is faced with many tests of
D I S C R I M I N A N T ANALYSIS A N D BONFERRONI'S
hypotheses and would like to control the overall alpha
ADJUSTMENT
level. One method, called Bonferroni's adjustment (5, 8),
reduces the significance level (alpha) for each compari- Another approach to the problem of comparing two
son and then uses a conventional two-sample r-test. Spe- groups on multiple variables is through discriminant
cifically, to do k tests, one for each variable, divide the analysis (6, 8, 22). First, the variables (say k) for each
alpha level by k\ if the test statistic yields a p-value of less patient are reduced to a single variable by using a "linear
than alpha/k, then one concludes that the two groups discriminant" function, L, to combine them:
differ on a particular variable. L = b0 4- bxXx + &2*2+ • • • W*Y
124 January 1984 • Annals of Internal Medicine • Volume 100 • Number 1

Downloaded From: http://annals.org/ by a Penn State University Hershey User on 09/17/2016


where X\, X2, . . . X^ are the k variables and bo, bj, b2, Table 3. Discriminant Analysis of Framingham Data *
. . . bk are linear coefficients. The two groups are then
Discriminant Function Coefficient t Valuef p Valuet
compared with respect to L, which is chosen to maximize
Variable
the discrimination between the groups.
The differences between the groups are measured by a Constant -6.731 -7.56 0.000
statistic called Mahalanobis' distance, D 2 , which can be Systolic blood pressure -0.011 -4.09 0.000
converted to an F statistic: Cholesterol -0.009 -7.32 0.000
Blood sugar -0.002 -0.96 0.336
p _ nxn2 (nx + n2 - k - 1) Cigarettes -0.012 -3.18 0.002
Relative weight -0.012 -3.13 0.002
(/J, + n2) («i + n2 - 2) k Age -0.038 -595 0.000
with k (numerator) and n\ + n2 — k — 1 (denomina- Hemoglobin 0.005 1.37 0.171
Lung capacity -0.001 -1.51 0.131
tor) degrees of freedom. This statistic tests the null hy- Uric acid -0.005 -1.13 0.258
pothesis that no differences exist between the two groups Ventricular rate -0.002 -0.39 0.699
on any of the k variables considered simultaneously. If an
* Mahalanobis' D^ = 0.4116.
overall difference is found, the coefficients of the t The r-value is used to tesi the null hypothesis that the coefficient is zero, and
discriminant function can be examined to determine the corresponding p-value is computed for a two-sided test.

which variables are statistically different between the two


populations. These individual comparisons are made arettes smoked per day, relative weight, and age are sig-
through r-statistics. (These f-statistics are not from the nificantly related to the occurrence of coronary heart dis-
usual r-test, but are similar to r-statistics for individual ease.
variables in a multiple regression.) This examination ac- These results differ from those obtained by the analysis
counts for the intercorrelations among the variables and with Hotelling's T 2 . The overall tests are equivalent—in
should incorporate some method for controlling the over- fact T 2 = D 2 [(nx) (n2)]/[(n{ + n2)]— and the result-
all alpha level, such as Bonferroni's technique described ing F statistics are theoretically identical. (Our reported
earlier. F statistic for Hotelling's T 2 is 18.79, whereas the F sta-
Another use of discriminant analysis is in the related tistic here is 18.77. The former value was computed by
problem of classifying persons into one of two groups on the BMDP program, and the latter by the SAS program.
the basis of a set of variables. For example, if a new We attribute the minor difference in the two values to
person appears with these k variables, one may use the different computational methods in these two computer
discriminant function to predict the group to which this packages.) However, the procedures consider the indi-
person belongs. A previously published description (6) vidual variables at the second stage of analysis in different
of discriminant analysis focuses on this approach. ways. As in an examination of the contribution of each
The results of a discriminant analysis of the Framing- variable to a multiple regression analysis, discriminant
ham data are given in Table 3. Mahalanobis* D 2 is analysis tests the contribution of each variable while con-
0.4116, and the corresponding F statistic with k = 10 trolling for other variables considered in the analysis.
and (n\ + n2 — k — 1) = 2237 degrees of freedom is Whereas Hotelling's T 2 statistic incorporates the rela-
tions among all variables, the Mests for the individual
(1608) (640) (2237) • . . , . . • ,.__ variables at the second stage of analysis do not. As a
F =
(2248) (2246) 10 ( f t 4 l l 6 > = " ^ result, these individual comparisons may be confounded
This value is significantly large (p < 0.0001), indicating by some other variables. For example, the two groups
that men with coronary heart disease differ from men differ with respect to both mean relative weight and uric
without the disease on at least one of the ten variables. acid level, as seen from the individual comparisons fol-
Given the overall difference, it is appropriate to take a lowing the Hotelling's T 2 analysis. But relative weight
closer look at the individual variables, which one evalu- and uric acid levels are correlated (r = 0.29 in Table 2 ) .
ates as part of the overall anaylsis. As with regression, Hence, given a patient's relative weight, the uric acid lev-
the coefficients of the discriminant function are not di- el may not contribute additional information to the likeli-
rectly comparable because they depend on the scale of hood of developing coronary heart disease. The
the variables. To achieve comparability, one can compute discriminant analysis and its r-statistics account for this
standardized coefficients by subtracting the correspond- intercorrelation and indicate that once relative weight is
ing mean from each observation and dividing by the cor- considered, the uric acid level adds no new information
responding standard deviation before computing the to the discrimination between men who do and do not
discriminant coefficients. (In reading the literature, it is develop coronary heart disease. The discriminant r-statis-
important to know which coefficients, unstandardized or tics reflect intercorrelation among variables, whereas the
standardized, are presented.) However, r-statistics, which two-sample r-tests used in Hotelling's approach do not.
test the null hypotheses to see if each coefficient is zero,
adjust for the differences in scales and hence are compa- LOGISTIC REGRESSION A N D BONFERRONI'S
rable. These r-statistics are given in Table 3 with their ADJUSTMENT
associated p values. If Bonferroni's adjustment is used at The comparison of two groups is done within the
the alpha level (p < 0.05/10 = p < 0.005), these results framework of multiple regression by the logistic regres-
suggest that systolic blood pressure, cholesterol level, cig- sion method (23-26). The goal is to predict the probabili-
Cupples et al. • Multiple Testing of Hypotheses 125

Downloaded From: http://annals.org/ by a Penn State University Hershey User on 09/17/2016


Table 4 . Logistic Analysis of Framingham Data * ing effects of the variables.
The results of a logistic analysis of the Framingham
Logistic Function Coefficient X2 Valuet p Valuef data are given in Table 4. The model chi-squared statistic
Variable
is 177.39 with ten degrees of freedom, is significantly
Constant -7.541 68.87 0.000 large (p < 0.0001), and indicates that the ten variables
Systolic blood pressure 0.010 14.41 0.000 are related to the probability of developing coronary
Cholesterol 0.008 50.49 0.000 heart disease. We now examine the individual variables.
Blood sugar 0.002 0.79 0.375
Cigarettes 0.012 10.40 0.000 (Regression coefficients are scale-dependent and should
Relative weight 0.013 10.52 0.000 not be directly compared; however, the chi-squared sta-
Age 0.039 35.92 0.000 tistics are standardized measures and can be compared.)
Hemoglobin -0.005 1.83 0.177 The chi-squared statistics for each variable along with
Lung capacity 0.001 2.52 0.113 associated p values are given in Table 4, indicating that
Uric acid 0.005 1.42 0.233
Ventricular rate 0.002 0.21 0.644 systolic blood pressure, cholesterol level, cigarettes
smoked per day, Framingham relative weight, and age
* Model chi-square ( 1 0 degrees of freedom) = 177.39. are associated with coronary heart disease (using Bonfer-
t The X 2 value is used to test the null hypothesis that the coefficient is zero, and
the corresponding p-value is computed by SAS (Statistical Analysis System) (18, roni's adjustment o f p < 0.005).
19) for a two-sided test.
These results are very similar to those of the discrimi-
nant analysis. Both procedures incorporate the relations
ty, p, that a person belongs to one of two groups (for among all ten variables at both stages of the analysis
example, men with coronary heart disease) using the k (overall and individual comparisons). Results thus indi-
variables. (It is important to distinguish this probability, cate the relation between each variable and coronary
p, from a p value, which is computed in tests of hypothe- heart disease, accounting for the effect of the other nine
ses.) As in multiple linear regression, an overall test is variables in the analysis.
used to determine if a significant relationship exists be- Using the logistic regression model, it is now possible
tween p and the set of k variables. If this test statistic is to estimate relative risks (odds ratios) from these results
significant, then individual tests are done to determine (Appendix 2 ) . For example, smoking two packs of ciga-
which variables contribute most to the prediction. rettes a day increases the risk 1.6 times compared with
The equation or model for the logistic approach is not smoking, when all other characteristics are equal;
and a person with a systolic blood pressure of 170 mm
log e ( P / ( 1 - P ) ) = b0+bxXx + . . . bkXk Hg has 1.8 times the risk of a person with a pressure of
110 mm Hg. One can also estimate the chance of coro-
where X\, Xi, . . . Xk are the k variables and bo, b\, . . . bk nary heart disease among persons with several risk fac-
are the associated regression coefficients. An equivalent tors; persons who smoke two packs of cigarettes a day
formula that often appears in the literature (23, 24) is and whose blood pressure is 170 mm Hg have a risk 2.9
times that of persons who do not smoke and whose blood
= exp (fro + bjXi + . . . bkXk)
P
pressure is 110, when all other characteristics are equal.
1 4- exp (b0 + b\Xx + . . . bkXk) For a cohort study, the model can also be used to esti-
where exp (y) means ey. mate relative risks directly (Appendix 2).
Several advantages of this model are that it meets the
requirement that p is between 0 and 1, and that it can Recommendations
generate relative risks for each of the variables or combi- The above analyses of the Framingham data suggest
nations (for epidemiologic purposes). somewhat different conclusions: Discriminant analysis
The overall test considers the null hypothesis that the k and logistic regression yield almost identical results, indi-
variables, taken together, do not improve the ability to cating that baseline measures of systolic blood pressure,
predict the group to which a person belongs (b\9 b^ . . . cholesterol level, cigarettes smoked per day, Framingham
bk are all zero). This hypothesis is tested by means of a relative weight, and age are each statistically significant
chi-squared statistic, called the model chi-squared statis- long-range predictors of coronary heart disease. Howev-
tic, that is analogous to an F-test for the overall effect of er, the Hotelling's T 2 analysis suggests that baseline mea-
several independent variables in multiple linear regres- sures of systolic blood pressure, cholesterol level, Fram-
sion. One rejects the null hypothesis if this statistic is ingham relative weight, age, uric acid level, and possibly
significantly large when compared to a chi-squared distri- vital capacity and ventricular rate are significantly relat-
bution with k degrees of freedom. If the null hypothesis is ed to the long-range development of coronary heart dis-
rejected by the overall test, one then proceeds to test the ease.
contribution of each separate variable to the overall mod- Which of the three analyses is best? There is no clear
el, as in multiple linear regression. Again, the overall al- answer to this question; generally, it depends upon the
pha level can be controlled with Bonferroni's adjustment. research goals and the type of data collected in a study.
To test the null hypothesis that the associated coefficient In the following section, we describe two goals. Research-
is zero, the corresponding test statistic is compared with ers should decide before the design of a study what their
a chi-squared distribution with one degree of freedom. goals are and then proceed accordingly.
These tests account for the intercorrelations or confound- Before listing these goals, the underlying assumptions
126 January 1984 • Annals of Internal Medicine • Volume 100 • Number 1

Downloaded From: http://annals.org/ by a Penn State University Hershey User on 09/17/2016


of these procedures should be noted. First, logistic regres- ally done by maximum likelihood estimation and requires
sion does not rely on any assumptions about the underly- special computerized routines.
ing distribution and variance structure of the k variables Another advantage of discriminant analysis is that it
and thus is called a nonparametric procedure. However, can be readily extended to more than two groups. For
both Hotelling's T 2 and discriminant analysis assume example, in a study of the determinants of myocardial
that the variation in the two populations is the same and infarction and angina pectoris, one might compare three
that data are normally distributed; hence, data should be groups: persons with myocardial infarction, persons with
continuous (measurable) and have an approximately angina pectoris, and healthy persons. Discriminant anal-
bell-shaped distribution. If data are not continuous (such ysis suits this purpose; whereas logistic regression does
as ordinal measures) but are still approximately bell- not, because it requires a dichotomous, dependent vari-
shaped, these statistics (T 2 and D 2 ) are reliable measures able.
of the differences between the two groups. Neither Ho- Sometimes, an investigator has interest in only one
telling's T 2 nor discriminant analysis is appropriate when variable and views all others as extraneous. In this situa-
some of the measures are nominal or when there is great tion, one might apply a univariate procedure (Student's
disparity in the variability of the two populations. We r-test or chi-squared test) to the primary variable, but
recommend using logistic regression in such circum- this approach ignores the possibility of confounding by
stances (27, 28). other variables. It would be better to explore the differ-
ence between groups on the primary variable after adjust-
Goal 1: The primary objective is to describe differences ing for the other, extraneous variables. Both discriminant
between two groups. analysis and logistic regression accomplish this goal,
If the assumptions stated above are satisfied, we recom- whereas Hotelling's T 2 statistic and subsequent f-tests do
mend using Hotelling's T 2 statistic in this situation. The not. For example, suppose one is primarily interested in
method has intuitive appeal, because it addresses the ob- the relation between systolic blood pressure and coronary
jective of the study directly, and subsequent analyses of heart disease in the Framingham data. Because other
individual Mests (using Bonferroni's adjustment) pro- variables are also associated with coronary heart disease,
vide immediate measures of the magnitude of the differ- it is necessary to consider the effect of systolic blood pres-
ences between the two groups. For example, in the data sure after adjusting for other variables, such as cholester-
from the Framingham Heart Study, relative weight and ol level, smoking, relative weight, and age. To test a
uric acid level are both significantly different in the two priori hypotheses an ordinary significance level (for ex-
groups. The results indicate that, on average, men with ample, 0.05) can be used for the variables of interest in
coronary heart disease are heavier in relative weight by the second stage of analysis, whereas Bonferroni's adjust-
approximately 3% and they have uric levels that are ap- ment is used for all other variables.
proximately 2 mg/100 mL higher. The weakness of this
approach is that tests on individual variables do not con- Discussion
sider confounding or intercorrelations among variables. We have described the use of Bonferroni's adjustments
in the second stage of analysis, after one has done an
Goal 2: The primary objective is to determine which vari- overall test of differences between the groups. It is appar-
ables are the best correlates of the group to which a per- ent that the alpha level for each test with this technique
son belongs. becomes small quickly as the number of variables increas-
Both discriminant analysis and logistic regression are es. Consequently, a researcher may be faced with a signif-
suited to this goal. Instead of describing the differences icant overall statistic and no statistically significant differ-
between two groups, these procedures indicate which ences on individual variables. The perplexing conclusion
variables best discriminate between the two groups when is that the two groups differ, but it is unclear on which
all variables are considered together. In the Framingham variable they differ. This problem is a weakness of Bon-
data, both discriminant and logistic regression analyses ferroni's technique. Generally, researchers should be
indicate that systolic blood pressure, cholesterol level, wary of comparing groups on many variables.
smoking, relative weight, and age are determinants of A significant overall statistic and no significance on an
coronary heart disease in men when all ten variables are individual variable may also be due to multicollinearity.
considered together. This problem, which arises when two or more variables
An advantage of logistic regression is that it relies on are highly correlated and thereby "cancel each other
fewer assumptions than does discriminant analysis. In out," is familiar in multiple linear regression analysis
many studies, some variables are nominal, thereby pre- (26). This same difficulty can occur in discriminant and
cluding discriminant analysis (27, 29). When all vari- logistic regression analyses. For example, both systolic
ables are normally distributed, however, discriminant and diastolic blood pressures are associated with coro-
analysis is more powerful than logistic regression. Also, nary heart disease, but they are also highly correlated
discriminant analysis can be done by multiple linear re- with each other. When both variables are included in a
gression (6) for which there are many computerized discriminant analysis or logistic regression model, the re-
packages. Ordinary regression techniques cannot be used sult may be that neither is significant at the second stage
for logistic regression because the variability of p changes of analysis; and when only one is included, each is signifi-
with the value of the A^s. Thus, logistic regression is usu- cant. A solution is to use only one of the variables or a
Cupplesetal. • Multiple Testing of Hypotheses 127

Downloaded From: http://annals.org/ by a Penn State University Hershey User on 09/17/2016


single summary variable in the analysis. Appendix 1: The Two-Sample /-Statistic and Hotelling's T2
There are many situations when multivariate tech-
niques are appropriate but perhaps not practical. One
such situation is a study with a considerable amount of THE TWO-SAMPLE f-STATISTIC
To compare two independent groups on a single variable, one often uses
missing data. (In our analysis of the Framingham data, Student's /-statistic:
we believe that the loss of 88 persons, 3.9%, was not
large enough to alter the conclusions of the analysis.) t= xx-x2
Conventionally, one excludes from the multivariate anal-
ysis persons who have missing information on any of the
variables. The number of persons with complete informa- ^ , - \y}+ (n2- \yi
tion may be much smaller than the number with informa- with s2 =
n\ + n2 — 2
tion on any one variable, and so univariate methods may where X\t X2 = sample means for groups 1 and 2; n\ n2 = sample sizes for
make better use of available data. However, univariate groups 1 and 2; S\ts2 = sample standard deviations for groups 1 and 2; and
techniques cannot identify which variables from the total s = pooled-sample standard deviation. The assumptions of this statistic are
that two samples are independently selected, that underlying population dis-
set are the best predictors. N o statistical technique can tributions are normally distributed, and that the variation in the two popula-
handle bias introduced from missing information; howev- tions is the same. (The third assumption leads to the use of the pooled-sam-
er, a feasible approach when there is a large amount of ple variance, s 2 .)

missing data is to check that those persons with missing


HOTELLING'S T 2
information are not atypical in available demographic in- To compare two independent groups on many variables (for example, k),
formation. one may use Hotelling's T 2 :
Finally, it is worth noting that the problem of multiple k k
testing of hypotheses appears in many situations. Here, / = 1
2
X
2 {X\j — X2i) Sij (X\j — Xy)
T 2- J=
we have described multiple testing in the context of com-
paring two independent groups on many variables. Cer- n\ n2
tainly, these same problems arise when the two groups where Xy, X2i = sample means of the /th variable in groups 1 and 2; X\jt
Xy = sample means of the/th variable in groups 1 and 2; Sy = the (/j)th
are pair-matched. Another situation involving multiple element of the inverse of the pooled-sample covariance matrix; n\t
testing is the periodic analysis of accumulating data in n2 = sample sizes for groups 1 and 2; and k = number of variables. The
long-term clinical trials. Special statistical methods, assumptions for this statistic are the same as those for Student's f-statistic.
When k = 1, there is only one variable, and Hotelling's T 2 reduces to the
called "sequential analysis," are used in this situation square of the f-statistic given above.
(30, 31). Multiple testing is also used in repeated analy-
ses on different subgroups of patients (32) as well as in COMPARISON
the common practice of battery testing. A typical exam- One can readily see the similarities between these two statistics. Both
statistics are based on differences in the sample means of the two groups and
ple of the use of battery testing is in a hospitalized patient are weighted by a measure of common variability in the two populations. S,y
on whom many laboratory tests have been done, and one in the T 2 statistic is analogous to s2 in-Student's f-statistic.
of the results is abnormal. How should this result be in-
terpreted? Is it indicative of a real health problem or is it
Appendix 2: Estimation of Relative Risk Using the Logistic
simply a type I error? Boyd and Lachner (33) and Abt
Model
(34) discuss this problem and propose some solutions.
CASE-CONTROL STUDIES (ODDS RATIO)
Our concern in this article has been statistical signifi- In case-control studies of rare diseases, relative risk is estimated by the
cance. Whether statistical differences are, in fact, medi- odds ratio. Suppose we wish to estimate the risk of disease at level X?
compared with that at level X, for the /th variable, adjusting for other vari-
cally meaningful is always a concern, whether in multi- ables considered in the analysis. This relative risk is estimated by
variate or univariate situations.
exp [ * > , ( * , * - * , ) ]
Summary
where 6, is the estimated logistic regression coefficient for the /th variable
Current medical research frequently entails the exami- (24). For example, to compare a person in the Framingham data whose
systolic blood pressure is 170 mm Hg with one whose pressure is 110 mm
nation of many variables in the comparison of two or Hg, the estimated relative risk (odds ratio) is
more groups. This reflects the complex nature of patho-
physiologic processes in which many variables are associ- exp [0.010 ( 1 7 0 - 110)] = 1.82
ated with one another. Fortunately, major advances in A person who has a systolic blood pressure of 170 mm Hg thus is 1.82 times
data analysis have been made recently in the area of mul- as likely to develop coronary heart disease as one whose pressure is 110 mm
tivariate techniques, which enable an investigator to Hg, when all other variables are constant. Thus, the model allows an investi-
gator to estimate the independent effect of a risk factor while adjusting for
study multiple variables. In part, computers have made other variables.
these achievements possible, because the computer can More involved comparisons can be made. To compare a person whose
readily handle the burdensome computations of multivar- systolic blood pressure is 170 mm Hg and who smokes two packs of ciga-
rettes a day with a nonsmoker with a blood pressure of 110 mm Hg, the
iate techniques. This article describes three procedures relative risk is estimated by
for the comparison of two groups on several variables:
exp [0.010 (170 - 110) + 0.012 (40 - 0 ) ] = 2.94
Hotelling's T 2 , discriminant analysis, and logistic regres-
sion. Instead of examining such variables separately, re- COHORT STUDIES
searchers should apply multivariate techniques to obtain In cohort studies relative risk can be estimated directly from the logistic
a unified analysis of the differences between the groups. equation by assuming mean values for all other variables. For example, to

1 2 8 January 1984 • Annals of Internal Medicine • Volume 100 • Number 1

Downloaded From: http://annals.org/ by a Penn State University Hershey User on 09/17/2016


compare a person with a systolic blood pressure of 170 mm Hg with a person 9. WINER BJ. Statistical Principles in Experimental Design. 2nd ed. New
whose pressure is 110 mm Hg, we first compute York: McGraw-Hill Book Company; 1971:54-7, 170-204. (Psychology
loge (Pl7(/(1 - Pl70>) Series.)
= -7.541 + 0.010(170) + 0.008(223.1) + 0.002(82.7) + 10. HAWKINS BR. Table of critical chi-square values for investigations in-
0 . 0 1 2 ( 1 4 . 3 ) -»- 0 . 0 1 3 ( 1 0 1 . 9 ) + 0 . 0 3 9 ( 4 3 . 9 ) - volving multiple comparisons. Tissue Antigens. 1981;17:243-4.
0.005 (143) + 0.001(373.4) + 0.005(49.6) + 0.002(73.6) 11. MILLER R. Developments in multiple comparisons, 1966-1976. J Am
= 0.010(170) - 2.3288 Stat Assoc. 1977;72:779-88.
= - 0.6288 12. MILLER R JR. Simultaneous Statistical Inference. New York: McGraw-
and Hill Book Company; 1966.
log e (Plio/O - Pno) = 0.010(110) - 2.3288 = - 1.2288 13. SCHEFFE H. Analysis of Variance. New York: John Wiley and Sons,
Solving each of these equations, for pi 7 0 and pn 0 , respectively, gives Inc.; 1959:55-86. (Wiley Series in Probability & Mathematical Statis-
e x p ( - 0.6288) tics.)
Pl7 = 14. SNEDECOR GW, COCHRAN WG. Statistical Methods. 6th ed. Ames,
° l + e x p ( - 0.6288) = ° 3 4 8 Iowa: Iowa State University Press; 1967:268-75.
and 15. STOLAR MH. Interpretation of research data: hypothesis testing. Am J
e x p ( - 1.2288) HospPharm. 1980;37:1539-45.
P l l 0 =
l + e x p ( - 1.2288) = ° 2 2 6 16. TRUETT J, CORNFIELD J, K A N N E L W. A multivariate analysis of the
risk of coronary heart disease in Framingham. / Chronic Dis.
Thus, the chance that a person with a systolic blood pressure of 170 mm Hg 1967;20:511-24.
will develop coronary heart disease in 26 years is 0.348 and, similarly, the 17. CORNFIELD J. Joint dependence of risk of coronary heart disease on
chance for a person with a pressure of 110 to develop coronary heart disease serum cholesterol and systolic blood pressure: a discriminant function
is 0.226, assuming mean values on all other variables. Hence, the relative analysis. Fed Proc. 1962;21:58-61.
risk is estimated as 18. SAS User's Guide, 1979 Edition. Cary, North Carolina: SAS Institute;
Pno = 0-348 = i ^ 1979:183-90, 245-63.
19. SAS Supplemental Library User's Guide, 1980 Edition. Cary, North
Pl,o 0.226
Carolina: SAS Institute; 1980:83-102.
A person with a systolic blood pressure of 170 mm Hg therefore is 1.54 times 20. DIXON WJ, BROWN MB, eds. BMDP-77, Biomedical Computer Pro-
more likely to develop coronary heart disease than is a person with a pres- grams, P-series. Berkeley: University of California Press; 1977:170-84,
sure of 110 mm Hg, when all other variables are at their mean values. 501-4.
ACKNOWLEDGMENTS: The authors thank Dr. William Kannel for 21 N I E NH, H U L L CH, JENKINS JG, STEINBRENNER K, B E N T DH. SPSS,
making the Framingham Heart Study data available to us and Joni Kindell Statistical Package for the Social Sciences. 2nd ed. New York: McGraw-
for preparation of the manuscript. Hill Book Company; 1975:434-62.
22. H U L L C H . N I E N , eds. SPSS Update: New Procedures & Facilities for
• Requests for reprints should be addresed to L. Adrienne Cupples; P h D ; Releases 7-9. New York: McGraw-Hill Book Company; 1981:1-79.
Boston University School of Public Health, Talbot Center—Room 325, 80 23. KLEINBAUM DG, KUPPER LL; MORGENSTERN H. Epidemiologic Re-
East Concord Street; Boston, MA 02118. search. Principles & Quantitative Methods. Belmont, California: Life-
time Learning Publications; 1982:419-507. (Research Methods Series.)
24. SCHLESSELMAN JJ, STOLLEY PD. Case Control Studies: Design, Con-
References duct, Analysis. New York: Oxford University Press; 1982:227-80.
1. SHAPIRO S, SLONE D, ROSENBERG L, K A U F M A N DW, STOLLEY PD, 25. ANDERSON S, AUQUIER A, HAUCK WW. Statistical Methods for Com-
MIETTINEN OS. Oral-contraceptive use in relation to myocardial infarc- parative Studies: Techniques for Bias Reduction. New York: John Wiley
tion. Lancet. 1979;1:743-7. and Sons, Inc.; 1980:161-77. (Wiley Series in Probability & Mathemati-
2. SHURTLEFF D. Some characteristics related to the incidence of cardio- cal Statistics: Applied Probability & Statistics.)
vascular disease and death: Framingham Study 18 years follow-up. In. 26. NETER J, WASSERMAN W. Applied Linear Statistical Models. Home-
KANNEL WB, GORDON T, eds. The Framingham Study: An Epidemio- wood, Illinois: Richard D. Irwin, Inc.; 1974:320-35.
logical Investigation of Cardiovascular Disease. Bethesda, Maryland: 27. PRESS SJ, WILSON S. Choosing between logistic regression and discrimi-
Department of Health, Education, and Welfare, Public Health Service, nant analysis. J Am Stat Assoc. 1978;73:699-705.
National Institutes of Health, National Heart, Lung, and Blood Insti- 28. HALPERIN M, BLACKWELDER WC, VERTER JI. Estimation of the mul-
tute; 1974. (DHEW publication no. (NIH) 74-599.) tivariate logistic risk function: a comparison of the discriminant function
3. DAWBER TR. The Framingham Study: The Epidemiology of Athero- and maximum likelihood approaches. J Chronic Dis. 1971;24:125-58.
sclerotic Disease. Cambridge, Massachusetts: Harvard University Press; 29. EFRON B. The efficiency of logistic regression compared to normal dis-
1980. (Commonwealth Fund Series.) criminant analysis. J Am Stat Assoc. 1975;70:892-8.
4. The University Group Diabetes Program: a study of the effects of the 30. ARMITAGE P. Sequential Medical Trials. 2nd ed. New York: Halsted
hypoglycemic agents on vascular complications in patients with adult- Press; 1975.
onset diabetes: II. Mortality results. Diabetes. 1970;19:785-830. 31. O'BRIEN PC, FLEMING TR. A multiple testing procedure for clinical
5. WALLENSTEIN S, ZUCKER CL, FLEISS JL. Some statistical methods trials. Biometrics. 1979;35:549-56.
useful in circulation research. Circ Res. 1980;47:1-9. 32. TEMPLE R, PLEDGER GW. The FDA's critique of the anturane rein-
6. KLEINBAUM DG, KUPPER LL. Applied Regression Analysis and Other farction trial. New Engl J Med. 1980;303:1488-92.
Multivariate Methods. North Scituate, Massachusetts: Duxbury Press; 33. BOYD JC, LACHER DA. The multivariate reference range: an alternative
1978:264-83. interpretation of multi-test profiles. Clin Chem. 1982;28:259-65.
7. ARMITAGE P. Statistical Methods in Medical Research. New York: 34. A B T K. Statistical problems in the analysis of comparative pharmaco-
Hallsted Press; 1971:202-7. EEG trials. Pharmakopsychiat/Neuro-Psychopharmakol. 1979; 12:228-
8. MORRISON DF. Multivariate Statistical Methods. 2nd ed. New York: 36.
McGraw-Hill Book Company; 1976:32-6, 111-6, 128-141, 230-45.

Cupples eta/. • Multiple Testing of Hypotheses 129

Downloaded From: http://annals.org/ by a Penn State University Hershey User on 09/17/2016

You might also like