Professional Documents
Culture Documents
1997 - Living Beyond Our - Means
1997 - Living Beyond Our - Means
12
Copyright © 1997 by The Johns Hopkins University School of Hygiene and Public Health Printed in U.S.A.
All rights reserved
This paper introduces the projection methods for describing and testing the differences between pairs of
continuous distributions. These methods include the projection plot, the projection spline, and the iter-1 test.
The projection plot displays the difference between corresponding quantiles against the average of the
The data that epidemiologists observe and analyze distributions, that all of the data are used in generating
are usually sampled from some underlying, unknown the plot, and that there is no arbitrary grouping or
distribution. If that distribution is continuous, it can be smoothing of the data. However, interpretation of the
characterized by 1) its location, which is the typical quantile-quantile plot is based solely on practiced vi-
value, such as the mean, median, or mode; 2) its sual assessment. There is no associated statistical test
spread, which is the dispersion around that typical to aid in interpreting apparent differences.
value, such as the standard deviation or interquartile This paper introduces the projection methods (4), a
range; and 3) its shape, which describes other aspects set of new methods for describing and testing the
of how the data are grouped, including asymmetry, differences between pairs of continuous distributions.
peakedness, and the number of data clusters. Common These methods include 1) the projection plot, a graph-
approaches to the comparison of samples from contin- ical display of the difference between distributions
uous distributions, including Student's t test and linear that retains the advantages of the empirical quantile-
regression methods, represent a comparison of loca- quantile plot from which it is derived; 2) the iter-1 test,
tions only. The wealth of information available in a a global test of difference based on the projection plot;
comparison of spreads and shapes most often goes and 3) the projection spline, a summary model fit to
untapped. the projection plot that enables a difference between
The empirical quantile-quantile plot, introduced by distributions to be classified as a difference in shape,
Wilk and Gnanadesikan (1) and popularized by Tukey in spread, or in location. The projection methods im-
(2) and Chambers et al. (3), provides a succinct graph- prove on the quantile-quantile plot by including a
ical comparison of location, spread, and shape for two statistical test of whether distributions differ and im-
samples from continuous distributions. The quantile- prove on all currently available methods by providing
quantile plot has the advantages that no assumptions a statistical summary of how distributions differ. They
are made about the shape of either of the underlying will enhance epidemiologic practice by making the
comparison of full distributions an accessible tool for
Received for publication October 7, 1996, and accepted for routine data analysis.
publication July 8, 1997.
Abbreviation: LR, likelihood ratio.
Department of Health and Social Behavior and Department of
Epidemiology, Harvard School of Public Health, Boston, MA. BACKGROUND
Reprint requests to Dr. Camara Jones, Department of Health and
Social Behavior, Harvard School of Public Health, 677 Huntington The empirical quantile-quantile plot is constructed
Avenue, Boston, MA 02115. by plotting the sample quantiles (order statistics) of
1056
New Methods for Comparing Distributions 1057
one batch of data against the corresponding sample If two distributions differ in shape, corresponding
quantiles of a second batch. If the two batches contain quantiles will deviate from a straight line pattern.
the same number of data points, then the smallest Therefore, one approach to test whether two distri-
observation from batch X is plotted against the small- butions differ would be to fit a line to the empirical
est observation from batch Y, the second smallest quantile-quantile plot and then test the fitted line for
from X against the second smallest from Y, and so difference from the line y = x. Although the corre-
forth until the largest observation from X is plotted lation and unequal variances of sample quantiles vio-
against the largest observation from Y. If the two late the assumptions of ordinary least squares regres-
batches differ in size, the convention is to plot all of sion, this violation does not present a major stumbling
the ordered data from the smaller batch against inter- block to model-fitting, since the asymptotic covari-
polated values from the larger batch that estimate the ance of sample quantiles has been described (5). The
corresponding quantiles (3). asymptotic covariance between the ith and jth sample
As seen in figure 1, if two distributions are identical, quantiles (order statistics), for i :£ j , is
PROJECTION PLOT
The problems of asymmetry between "dependent"
and "independent" quantiles are resolved when the
data on the empirical quantile-quantile plot are viewed
from the perspective of the line y = x. This sym-
xquantBes x quantiles
metry can be seen in the top left panel of figure 2.
Spread difference Shape difference With respect to the line y = x, it does not matter
FIGURE 1. Interpretation of empirical quantile-quantile plots, con- whether group A quantiles are plotted on the x-axis
structed using simulated data sets for which there was no difference and group B quantiles on the y-axis, or vice versa. The
(top left), a location difference only (top right), a spread difference points (A,B) and (B,A) are equidistant from the line
only (bottom left), and a difference in skewness only (bottom right)
between the parent distributions. y = x, and they project on the line at the same point.
ii
point (A,B) from the line y = x is the difference CO
B-A
x-axis, the right triangle with hypotenuse of length w
u
I
B-A has legs of length (B-A)/sqrt(2). The perpendic-
8 ° 4±B
ular distance of the point (A,B) from the line y C
D
line dif=O
2
= x is therefore (B-A)/sqrt(2). D
•
In the bottom left panel of figure 2, the calculation O
O
of the coordinates of the point at which (A,B) projects c
D
on the line y = x is shown. The small right triangle D
(1) Decide on alpha, the two-sided type 1 error rate used to evaluate coefficients at each
iteration.
This will be guided by the purpose of the analysis and the sample sizes available.
(5) Perform ordinary least squares regression to estimate the parameters of the model
where y is the difference between corresponding quantiles, and the xplus variables are
functions of the average of the corresponding quantiles. Note that when all x are non-
negative, the variable x can be expressed as the xplus function of a knot at 0.
(6) Adjust the standard errors of the parameter estimates for the covariance of the quantile
differences.
Adjusted standard errors are obtained by pre- and post-multiplying the covariance
matrix estimated in step (4) above by the (X'X)'1 matrix from the regression in step (5).
(7) Calculate the Wald statistic and the associated p value for each slope parameter.
The Wald statistic is the ratio of the parameter estimate to its standard error. Under
the null hypothesis that the parameter is zero, this statistic has a t distribution with
degrees of freedom equal to the sample size minus the number of parameters in the
model.
is based on three characteristics of the projection zero, evaluate the intercept of that segment. If the
spline: the number of spline segments, the slope if intercept differs significantly from zero, there is
only one segment, and the intercept if only one seg- evidence of a location difference.
ment with zero slope. The algorithm is summarized in 4. If there is only one segment with slope equal to
figure 8 and described below: zero and intercept that does not differ signifi-
1. Evaluate the number of spline segments. If there cantly from zero, there is evidence of no differ-
is more than one segment, there is evidence of a ence in shape, spread, or location.
shape difference.
2. If there is only one segment, evaluate the slope of
that segment. If the slope differs significantly PERFORMANCE OF THE PROPOSED METHODS
from zero, there is evidence of a spread differ- One thousand samples, each consisting of 200 ob-
ence. servations, are randomly generated from each of 10
3. If there is only one segment with slope equal to known parent distributions as described in table 1. For
FIGURE 7. Iterative fit of the projection spline, which in this case is iteratively fit with a = 0.05 to the projection plot comparing two systolic
blood pressure distributions. The candidate knots are spaced every 10 mmHg at multiples of 10. The minimum sample size in an interval is
set at 20, so some knots are removed before the first iteration to coalesce adjacent spline segments. The carat ( A ) marks the knot that is
associated with the least significant change in slope for a given iteration, which is dropped from the model before the next iteration.
pendence (with intervals of width 10 centered at mul- The power of each test is the proportion of simulations
tiples of 10), and the two-sample Kolmogorov- with p < 0.05, in which the null hypothesis of no
Smirnov test are performed. (The parameters of the difference is correctly rejected. For example, the sec-
iter-1 and chi-square tests were chosen because of an ond line of table 2 shows that the iter-1 test has an
interest in comparing systolic blood pressure distribu- estimated 71 percent power to detect a 5-unit shift in
tions.) The p values from these tests are dichotomized location relative to the reference distribution. This
as <0.05 or sO.05 to reflect a decision rule that rejects compares with the chi-square test with 68 percent
the null hypothesis of no difference with a nominal power and the Kolmogorov-Smirnov test with 81 per-
type 1 error rate of 0.05. The results from these sim- cent power to detect the same difference.
ulations are presented in table 2. The iter-1 test performs better than the chi-square
Type 1 error rates. Estimates of the actual type 1 and the Kolmogorov-Smirnov tests in most types of
error rates of the three tests are based on the 1,000 comparison. This superior performance is especially pro-
comparisons between the reference and "no differ- nounced when there are shape differences between the
ence" distributions. The type 1 error rate of each test is two distributions being compared. The Kolmogorov-
the proportion of simulations with p < 0.05, in which Smirnov test is the most powerful of the three tests in
the null hypothesis of no difference is erroneously detecting a location shift, but otherwise it performs
rejected. The top line of table 2 shows that this occurs poorly. None of the three tests is very powerful in de-
52/1,000 times with the iter-1 test, 51/1,000 times with tecting an isolated difference in modality.
the chi-square test, and 43/1,000 times with the
Kolmogorov-Smirnov test. All three tests have ob-
served type 1 error rates near the nominal a level of
Performance of the projection spline
0.05.
Power against specific alternatives. Comparisons The simulated comparisons described above are
of each of the other distributions described in table 1 summarized by projection splines fit with candidate
with the reference distribution provide estimates of the knots at multiples of 10, minimum sample size of 20
power of the iter-1, the chi-square, and the Kolmog- observations in an interval, and two-sided a = 0.05.
orov-Smirnov tests to detect various types and com- Each projection spline is classified into one of four
binations of location, spread, and shape differences. categories according to the number of spline segments
\
It \\
It \\
ll \\
II \\
l \\
II ll
1 \l
1 \\
II \\
ll \\
l \
J ^
FIGURE 9. Parent distributions for the simulated comparisons. The probability densities for each of the nine distributions named in the
captions (broken lines) are compared with the probability density of the reference distribution (solid line).
(1 vs. >1), the slope (0 vs. #0) if there is a single one segment with nonzero slope, 43/1,000 splines
segment, and the intercept (0 vs. =£0) if there is a consist of one segment with zero slope and nonzero
single segment with slope 0. The highest level of intercept, and 841/1,000 splines consist of one seg-
difference between the two distributions is theorized to ment with zero slope and zero intercept.
relate to this classification as summarized in figure 8. Sensitivity. The data in table 3 can be used to
The distribution of the four classes of spline in these calculate the sensitivity of the various classes of pro-
simulations is presented in table 3. Each row summa- jection spline. In each row of the table, the bold value
rizes 1,000 comparisons of samples from the named is the number of splines falling into the proper cate-
distribution with samples from the reference distribu- gory based on the proposed classification of projection
tion. For example, when samples from the no differ- splines. For example, comparison of samples from the
ence distribution are compared with samples from the no difference and the reference distributions results in
reference distribution, 93/1,000 splines consist of 841/1,000 splines correctly allocated to the "none"
more than one segment, 23/1,000 splines consist of category. In other words, the sensitivity of the one
By understanding how distributions differ, we may methods for data analysis. Pacific Grove, CA: Wadsworth and
Brooks/Cole Publishing Company, 1983.
be led to new hypotheses about why they differ and
4. Jones CP. Methods for comparing distributions: development
may spawn innovative approaches to improve the pub- and application exploring "race"-associated differences in sys-
lic's health. tolic blood pressure. Doctoral dissertation. Department of
Epidemiology, The Johns Hopkins School of Hygiene and
Public Health, Baltimore, MD, 1994.
5. Serfling RJ. Approximation theorems of mathematical statis-
tics. New York, NY: John Wiley and Sons, 1980.
ACKNOWLEDGMENTS 6. Weisberg S. Applied linear regression. 2nd ed. New York,
This work was supported in part by National Heart, Lung, NY: John Wiley and Sons, 1985.
and Blood Institute Cardiovascular Training grant T32 7. Smith PL. Splines as a useful and convenient statistical tool.
Am Stat 1979;33:57-62.
HL07024. 8. Wold S. Spline functions in data analysis. Technometrics
The author gratefully acknowledges the valuable input of
Drs. Karen Bandeen-Roche and Curtis Meinert. 9. Drizd T, Dannenberg AL, Engel A. Blood pressure levels in