You are on page 1of 11

American Journal of Epidemiology Vol. 146, No.

12
Copyright © 1997 by The Johns Hopkins University School of Hygiene and Public Health Printed in U.S.A.
All rights reserved

Living Beyond Our "Means": New Methods for Comparing Distributions

Camara Phyllis Jones

This paper introduces the projection methods for describing and testing the differences between pairs of
continuous distributions. These methods include the projection plot, the projection spline, and the iter-1 test.
The projection plot displays the difference between corresponding quantiles against the average of the

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


corresponding quantiles. It is analogous to an empirical quantile-quantile plot that has been rotated 45
degrees. The projection spline is a knotted linear spline iteratively fit to the projection plot so that all knots are
associated with significant changes in slope. It summarizes nonrandom deviations from linearity on the
projection plot, allowing classification of the highest level of difference between two distributions as a
difference in shape, in spread, or in location. The iter-1 test compares the first iteration of the projection spline
with the line y = 0, providing a global test of difference between two distributions that is more powerful in
simulations than either the chi-square test of independence or the Kolmogorov-Smirnov test. These methods
will enhance epidemiologic practice by making the comparison of full distributions an accessible tool for
routine data analysis. Am J Epidemiol 1997;146:1056-66.

computer simulation; epidemiologic methods; models, statistical

The data that epidemiologists observe and analyze distributions, that all of the data are used in generating
are usually sampled from some underlying, unknown the plot, and that there is no arbitrary grouping or
distribution. If that distribution is continuous, it can be smoothing of the data. However, interpretation of the
characterized by 1) its location, which is the typical quantile-quantile plot is based solely on practiced vi-
value, such as the mean, median, or mode; 2) its sual assessment. There is no associated statistical test
spread, which is the dispersion around that typical to aid in interpreting apparent differences.
value, such as the standard deviation or interquartile This paper introduces the projection methods (4), a
range; and 3) its shape, which describes other aspects set of new methods for describing and testing the
of how the data are grouped, including asymmetry, differences between pairs of continuous distributions.
peakedness, and the number of data clusters. Common These methods include 1) the projection plot, a graph-
approaches to the comparison of samples from contin- ical display of the difference between distributions
uous distributions, including Student's t test and linear that retains the advantages of the empirical quantile-
regression methods, represent a comparison of loca- quantile plot from which it is derived; 2) the iter-1 test,
tions only. The wealth of information available in a a global test of difference based on the projection plot;
comparison of spreads and shapes most often goes and 3) the projection spline, a summary model fit to
untapped. the projection plot that enables a difference between
The empirical quantile-quantile plot, introduced by distributions to be classified as a difference in shape,
Wilk and Gnanadesikan (1) and popularized by Tukey in spread, or in location. The projection methods im-
(2) and Chambers et al. (3), provides a succinct graph- prove on the quantile-quantile plot by including a
ical comparison of location, spread, and shape for two statistical test of whether distributions differ and im-
samples from continuous distributions. The quantile- prove on all currently available methods by providing
quantile plot has the advantages that no assumptions a statistical summary of how distributions differ. They
are made about the shape of either of the underlying will enhance epidemiologic practice by making the
comparison of full distributions an accessible tool for
Received for publication October 7, 1996, and accepted for routine data analysis.
publication July 8, 1997.
Abbreviation: LR, likelihood ratio.
Department of Health and Social Behavior and Department of
Epidemiology, Harvard School of Public Health, Boston, MA. BACKGROUND
Reprint requests to Dr. Camara Jones, Department of Health and
Social Behavior, Harvard School of Public Health, 677 Huntington The empirical quantile-quantile plot is constructed
Avenue, Boston, MA 02115. by plotting the sample quantiles (order statistics) of

1056
New Methods for Comparing Distributions 1057

one batch of data against the corresponding sample If two distributions differ in shape, corresponding
quantiles of a second batch. If the two batches contain quantiles will deviate from a straight line pattern.
the same number of data points, then the smallest Therefore, one approach to test whether two distri-
observation from batch X is plotted against the small- butions differ would be to fit a line to the empirical
est observation from batch Y, the second smallest quantile-quantile plot and then test the fitted line for
from X against the second smallest from Y, and so difference from the line y = x. Although the corre-
forth until the largest observation from X is plotted lation and unequal variances of sample quantiles vio-
against the largest observation from Y. If the two late the assumptions of ordinary least squares regres-
batches differ in size, the convention is to plot all of sion, this violation does not present a major stumbling
the ordered data from the smaller batch against inter- block to model-fitting, since the asymptotic covari-
polated values from the larger batch that estimate the ance of sample quantiles has been described (5). The
corresponding quantiles (3). asymptotic covariance between the ith and jth sample
As seen in figure 1, if two distributions are identical, quantiles (order statistics), for i :£ j , is

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


corresponding quantiles from the distributions will fall
roughly along the line v = x. If two distributions Pi*a - PJ)
differ in location only, corresponding quantiles will
fall roughly along a line parallel to the line y = x but
be displaced from it by the magnitude of the location where p,; is i/{n + 1), the percentile associated with
difference. If two distributions differ in spread but not the ith order statistic; n is the total sample size; and
in shape, corresponding quantiles will fall roughly densj is the probability density of the underlying dis-
along a straight line the slope of which differs from 1. tribution at the ith order statistic. This covariance can
be taken into account with ordinary least squares re-
gression by doing postregression adjustment of the
standard errors (6).
It is the lack of symmetry of least squares regression
methods that is problematic when fitting models to the
empirical quantile-quantile plot. The choice of which
set of quantiles to plot on the x-axis and which set to
plot on the y-axis of the quantile-quantile plot is com-
pletely arbitrary. In fact, reciprocal plots are mirror
images of one another. Yet with least squares regres-
sion methods, the two sets of quantiles are not treated
x quantiles x quantiles symmetrically. The slopes from reciprocal models are
Location difference
not exact reciprocals, since least squares methods min-
No difference
imize the vertical distance between the data and the
fitted line. The standard errors from reciprocal models
are likewise not equivalent, since only the covariance
of the quantiles plotted on the y-axis is taken into
account whereas the quantiles plotted on the x-axis are
treated as fixed. Occasionally, reciprocal models fit to
the quantile-quantile plot can lead to conflicting infer-
ences about whether two distributions differ.

PROJECTION PLOT
The problems of asymmetry between "dependent"
and "independent" quantiles are resolved when the
data on the empirical quantile-quantile plot are viewed
from the perspective of the line y = x. This sym-
xquantBes x quantiles
metry can be seen in the top left panel of figure 2.
Spread difference Shape difference With respect to the line y = x, it does not matter
FIGURE 1. Interpretation of empirical quantile-quantile plots, con- whether group A quantiles are plotted on the x-axis
structed using simulated data sets for which there was no difference and group B quantiles on the y-axis, or vice versa. The
(top left), a location difference only (top right), a spread difference points (A,B) and (B,A) are equidistant from the line
only (bottom left), and a difference in skewness only (bottom right)
between the parent distributions. y = x, and they project on the line at the same point.

Am J Epidemiol Vol. 146, No. 12, 1997


1058 Jones

The bottom right panel of figure 2 suggests rotating


the quantile-quantile plot 45 degrees so that the line
y = x is the new x-axis. The new coordinates of the
point that is (A,B) on the quantile-quantile plot be-
come ((A+B)/2, (B-A)/sqrt(2)) on the rotated plot.
The proposed projection plot (figure 3) is the rotated
empirical quantile-quantile plot, after the y-coordinate
has been multiplied by the constant sqrt(2) for ease of
interpretation. It is called the "projection plot" because
x quantlles A
the data on the quantile-quantile plot are viewed from
Symmetry with respect to line y=x Perpendicular distance from line y=x the vantage point of the line y = x, as if perpendic-
ularly projected onto that line. On the projection plot,
the difference between corresponding quantiles, B-A,

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


is plotted against the average of the corresponding
quantiles, (A+B)/2. In figure 4, the relation between
the quantile-quantile plot and the projection plot is
shown. Note that the line y = x on the quantile-
quantile plot becomes the line difference = 0 on the
(A.B)
/ projection plot.
Inferences about relative locations, spreads, and
\ shapes that could be made from the empirical quantile-
quantile plot can likewise be made from the related
projection plot. If two distributions do not differ, the
data on the projection plot will fall along the line
/
4±B
x quantiles
difference = 0. If two distributions differ in location
2
only, the data on the projection plot will fall along a
Projection on line y=x Rotating the plot 45 degrees line with slope zero and intercept that corresponds to
FIGURE 2. Derivation of the projection plot from the empirical the difference in location. If two distributions differ in
quantile-quantile plot, illustrating the symmetry of the points (A,B) spread but not shape, the data on the projection plot
and (B,A) with respect to the line y = x on the quantile-quantile will fall along a line with a nonzero slope. If two
plot (top left), calculation of the perpendicular distance of the point
(A,B) from the line y = x (top right), calculation of the coordinates distributions differ in shape, the data will deviate from
of the perpendicular projection of the point (A,B) on the line y a straight line pattern.
= x (bottom left), and the view if one rotated the plot 45 degrees The projection plot therefore retains all of the ad-
(bottom right).
vantages of the empirical quantile-quantile plot from
which it is derived. Differences in location, spread,
In the top right panel of figure 2, the calculation of
the perpendicular distance of the point (A,B) from the
line y = x is shown. The vertical distance of the
nines

ii
point (A,B) from the line y = x is the difference CO

between the corresponding quantiles, B-A. Because IT


O)
the line y = x makes a 45 degree angle with the
urn

B-A
x-axis, the right triangle with hypotenuse of length w
u
I
B-A has legs of length (B-A)/sqrt(2). The perpendic-
8 ° 4±B
ular distance of the point (A,B) from the line y C
D
line dif=O
2
= x is therefore (B-A)/sqrt(2). D

In the bottom left panel of figure 2, the calculation O
O
of the coordinates of the point at which (A,B) projects c
D
on the line y = x is shown. The small right triangle D

with hypotenuse of length (B-A)/sqrt(2) has legs of


length (B-A)/2. The sum of A and (B-A)/2 is average of corresponding quantiles
(A+B)/2, or the average of the corresponding quan-
FIGURE 3. The projection plot, constructed by plotting the differ-
tiles. The projection of the point (A,B) on the line y ence between corresponding quantiles against the average of the
= x is therefore at the point ((A+B)/2, (A+B)/2). corresponding quantiles.

Am J Epidemiol Vol. 146, No. 12, 1997


New Methods for Comparing Distributions 1059

knotted linear spline with two knots (three segments)


can be modeled as
/30 + /3i x + (32 xplus(knoti) + /33 xplus{knot2),

where xplus(k) is evaluated as 0 for x £ k, and as


(x — k)forx > k (7). As is illustrated in figure 5,
the coefficient associated with each knot describes the
change in slope at that knot.
x quantiles average of quantlles
The steps in the iterative fit of a projection spline are
detailed in figure 6 and illustrated in figure 7. Initially,
Ouantile-quantile plot Projection plot knots are evenly spaced throughout the range of the
data. The model is fit, and the significance of each

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


knot coefficient is evaluated. If any knot is associated
with a nonsignificant change in slope, the knot with
the least significant change in slope is removed from
the model. The reduced model is fit, and the signifi-
cance of each remaining knot coefficient is reevalu-
ated. The iterative removal of knots, one at a time, is
continued until all remaining knots are associated with
significant changes in slope at the specified a level.
The initial iteration of the projection spline, called
the iter-1 spline, provides the basis for a global test of
difference between the two distributions. The multi-
variate Wald test is used to compare the parameter
estimates from the iter-1 spline with a vector of zeroes.
group A quantiles average of A and B quantlles
The author refers to this application of the multivariate
Quantile-quantile plot Projection plot Wald test as the iter-1 test.
FIGURE 4. Comparison of the quantile-quantile and projection The final iteration of the projection spline is used to
plots, showing the point (A,B) on the empirical quantile-quantile plot classify the highest level of difference between the
(top left) and the corresponding point ((A+B)/2, B-A) on the projec- distributions, with the hierarchy of difference being
tion plot (top right). The comparison between two systolic blood
pressure distributions is shown using the quantile-quantile plot (bot- shape > spread > location > none. The classification
tom left) and the corresponding projection plot (bottom right).

and shape are readily appreciated from the plot, no


assumptions are made about the shape of either under-
lying distribution, all of the data are used in generating
the plot, and there is no arbitrary grouping or smooth-
ing of the data.
Moreover, the projection plot improves on the em-
pirical quantile-quantile plot through its symmetric
treatment on both the x- and the y-axis of the two
distributions being compared. This feature allows
model-fitting to the data using ordinary least squares
methods.

PROJECTION SPLINE AND ITER-1 TEST


The projection spline is a knotted linear spline iter-
atively fit to the projection plot to provide a flexible,
potentially nonlinear summary of the relation between y = bO + b1 x + b2xplus(knot1) + b3 xplus (knot2)
two distributions. A knotted linear spline is a piece- FIGURE 5. Example of a knotted linear spline with two knots (three
wise polynomial of power 1, that is, a function con- segments), which can be modeled as ft + P\x
sisting of line segments connected end to end (7, 8). A + f}}Xplus(knot2).

Am J Epidemiol Vol. 146, No. 12, 1997


1060 Jones

The four initial steps in the fit of a projection spline are:

(1) Decide on alpha, the two-sided type 1 error rate used to evaluate coefficients at each
iteration.
This will be guided by the purpose of the analysis and the sample sizes available.

(21 Choose the initial spacing of the candidate knots.


This determines where and how often the fitted spline is allowed to change slopes.

(3) Choose the minimum sample size allowed in an interval.


This reflects the amount of data considered necessary for estimation of the changes in
slope. If an interval contains fewer data than this minimum, it is merged with the
interval to the right by removing the intervening knot from the model. If the rightmost
interval contains too few data, it is merged with the interval to the left.

(4) Estimate the covariance matrix of the quantile differences.

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


Because the two sets of quantiles being compared come from independent samples, the
covariance of the quantile differences is the sum of the quantile covariances. The same
covariance matrix is used at each iteration to adjust the standard errors of the
parameter estimates.

The four iterative steps in the fit of a projection spline are:

(5) Perform ordinary least squares regression to estimate the parameters of the model

y = Ro + B, xplus(O) + B2 xpluslknot,) + S3 xplus(knot2) + . . . + Bn xpluslknot^,).

where y is the difference between corresponding quantiles, and the xplus variables are
functions of the average of the corresponding quantiles. Note that when all x are non-
negative, the variable x can be expressed as the xplus function of a knot at 0.

(6) Adjust the standard errors of the parameter estimates for the covariance of the quantile
differences.
Adjusted standard errors are obtained by pre- and post-multiplying the covariance
matrix estimated in step (4) above by the (X'X)'1 matrix from the regression in step (5).

(7) Calculate the Wald statistic and the associated p value for each slope parameter.
The Wald statistic is the ratio of the parameter estimate to its standard error. Under
the null hypothesis that the parameter is zero, this statistic has a t distribution with
degrees of freedom equal to the sample size minus the number of parameters in the
model.

(8) Eliminate the least significant knot.


If all knot coefficients have/7 values less than alpha, do nothing more. The final model
has been identified. If any knot coefficient has p value greater than or equal to alpha,
identify the knot whose coefficient has the largest p value and eliminate it from the
model. Return to step (5) above to start another iteration.

FIGURE 6. The iterative fit of a projection spline.

is based on three characteristics of the projection zero, evaluate the intercept of that segment. If the
spline: the number of spline segments, the slope if intercept differs significantly from zero, there is
only one segment, and the intercept if only one seg- evidence of a location difference.
ment with zero slope. The algorithm is summarized in 4. If there is only one segment with slope equal to
figure 8 and described below: zero and intercept that does not differ signifi-
1. Evaluate the number of spline segments. If there cantly from zero, there is evidence of no differ-
is more than one segment, there is evidence of a ence in shape, spread, or location.
shape difference.
2. If there is only one segment, evaluate the slope of
that segment. If the slope differs significantly PERFORMANCE OF THE PROPOSED METHODS
from zero, there is evidence of a spread differ- One thousand samples, each consisting of 200 ob-
ence. servations, are randomly generated from each of 10
3. If there is only one segment with slope equal to known parent distributions as described in table 1. For

Am J Epidemiol Vol. 146, No. 12, 1997


New Methods for Comparing Distributions 1061

Candidate knots Initial knots Iteration 1

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


Iteration 2 Iteration 3 Iteration 4

Iteration 5 Iteration 6 Iteration 7

FIGURE 7. Iterative fit of the projection spline, which in this case is iteratively fit with a = 0.05 to the projection plot comparing two systolic
blood pressure distributions. The candidate knots are spaced every 10 mmHg at multiples of 10. The minimum sample size in an interval is
set at 20, so some knots are removed before the first iteration to coalesce adjacent spline segments. The carat ( A ) marks the knot that is
associated with the least significant change in slope for a given iteration, which is dropped from the model before the next iteration.

Shape? reference distribution are displayed in figure 9. Note


If not shape: Spread? that five of these distributions are non-Gaussian dis-
If neither shape nor spread: Location?
tributions that are generated as mixtures of Gaussian
None?
distributions. Pairwise comparisons between the 1,000
samples from the reference distribution and the 1,000
Shape Spread Location None
samples from each of the other nine distributions are
Segments >1 1 1 1 done to assess 1) the performance of the iter-1 test as
Slope =*0 0 0 a global test of difference, and 2) the performance of
Intercept 40 0
the projection spline as a basis for classifying the
highest level of difference.
FIGURE 8. Classification of the highest level of difference be-
tween two distributions, based on the projection spline. Performance of the iter-1 test
For each simulated comparison with the reference
example, the "reference" distribution is Gaussian, with distribution, the iter-1 test (with candidate knots at
a mean of 120 and a standard deviation of 15. The multiples of 10 and minimum sample size of 20 ob-
relations of each of the other nine distributions to the servations in an interval), the chi-square test of inde-

Am J Epidemiol Vol. 146, No. 12, 1997


1062 Jones

TABLE 1 . Parent distributions for randomly generated data sets


Component Gaussian Differs from
Name dlstribulion(s) "reference"?
of
Standard No.
distribution Mean Location Spread Shape
deviation sampled
Reference 120 15 200

No difference 120 15 200 No No No

Location differs 125 15 200 Yes No No

Spread differs 120 20 200 No Yes No

Skewness differs 117 9 160 No No Yes


132 24.9 40

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


Kurtosis differs 120 9 160 No No Yes
120 28.3 40

Modality differs 108 9 100 No No Yes


132 9 100

Location/spread differ 125 20 200 Yes Yes No

All differ, skew 120 15 160 Yes Yes Yes


145 15 40

All differ, symmetric 120 15 100 Yes Yes Yes


145 15 100

pendence (with intervals of width 10 centered at mul- The power of each test is the proportion of simulations
tiples of 10), and the two-sample Kolmogorov- with p < 0.05, in which the null hypothesis of no
Smirnov test are performed. (The parameters of the difference is correctly rejected. For example, the sec-
iter-1 and chi-square tests were chosen because of an ond line of table 2 shows that the iter-1 test has an
interest in comparing systolic blood pressure distribu- estimated 71 percent power to detect a 5-unit shift in
tions.) The p values from these tests are dichotomized location relative to the reference distribution. This
as <0.05 or sO.05 to reflect a decision rule that rejects compares with the chi-square test with 68 percent
the null hypothesis of no difference with a nominal power and the Kolmogorov-Smirnov test with 81 per-
type 1 error rate of 0.05. The results from these sim- cent power to detect the same difference.
ulations are presented in table 2. The iter-1 test performs better than the chi-square
Type 1 error rates. Estimates of the actual type 1 and the Kolmogorov-Smirnov tests in most types of
error rates of the three tests are based on the 1,000 comparison. This superior performance is especially pro-
comparisons between the reference and "no differ- nounced when there are shape differences between the
ence" distributions. The type 1 error rate of each test is two distributions being compared. The Kolmogorov-
the proportion of simulations with p < 0.05, in which Smirnov test is the most powerful of the three tests in
the null hypothesis of no difference is erroneously detecting a location shift, but otherwise it performs
rejected. The top line of table 2 shows that this occurs poorly. None of the three tests is very powerful in de-
52/1,000 times with the iter-1 test, 51/1,000 times with tecting an isolated difference in modality.
the chi-square test, and 43/1,000 times with the
Kolmogorov-Smirnov test. All three tests have ob-
served type 1 error rates near the nominal a level of
Performance of the projection spline
0.05.
Power against specific alternatives. Comparisons The simulated comparisons described above are
of each of the other distributions described in table 1 summarized by projection splines fit with candidate
with the reference distribution provide estimates of the knots at multiples of 10, minimum sample size of 20
power of the iter-1, the chi-square, and the Kolmog- observations in an interval, and two-sided a = 0.05.
orov-Smirnov tests to detect various types and com- Each projection spline is classified into one of four
binations of location, spread, and shape differences. categories according to the number of spline segments

Am J Epidemiol Vol. 146, No. 12, 1997


New Methods for Comparing Distributions 1063

\
It \\
It \\
ll \\
II \\
l \\
II ll
1 \l
1 \\
II \\
ll \\
l \
J ^

No difference Location differs Spread differs

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


Skewness differs Kurtosis differs Modality differs

Location/spread differ All differ skew All differ symmetric

FIGURE 9. Parent distributions for the simulated comparisons. The probability densities for each of the nine distributions named in the
captions (broken lines) are compared with the probability density of the reference distribution (solid line).

(1 vs. >1), the slope (0 vs. #0) if there is a single one segment with nonzero slope, 43/1,000 splines
segment, and the intercept (0 vs. =£0) if there is a consist of one segment with zero slope and nonzero
single segment with slope 0. The highest level of intercept, and 841/1,000 splines consist of one seg-
difference between the two distributions is theorized to ment with zero slope and zero intercept.
relate to this classification as summarized in figure 8. Sensitivity. The data in table 3 can be used to
The distribution of the four classes of spline in these calculate the sensitivity of the various classes of pro-
simulations is presented in table 3. Each row summa- jection spline. In each row of the table, the bold value
rizes 1,000 comparisons of samples from the named is the number of splines falling into the proper cate-
distribution with samples from the reference distribu- gory based on the proposed classification of projection
tion. For example, when samples from the no differ- splines. For example, comparison of samples from the
ence distribution are compared with samples from the no difference and the reference distributions results in
reference distribution, 93/1,000 splines consist of 841/1,000 splines correctly allocated to the "none"
more than one segment, 23/1,000 splines consist of category. In other words, the sensitivity of the one

Am J Epidemiol Vol. 146, No. 12, 1997


1064 Jones

TABLE 2. Power to detect differences in the simulated the hypothesis of no difference.


comparisons, using the iter-1 test, the chi-square test of The results of analogous calculations for each of the
independence, and the two-sample Kolmogorov-Smirnov
(K-S) test
four classes of spline are presented in table 4 and
summarized below:
No. i3f comparisons with p < 0.05
NcUTlQ from 1,000 simulations • One segment, zero slope, zero intercept: Very
of
distribution Iter-1 Chk
K-S strong evidence of no difference in underlying
square distributions against all of the simulated alterna-
No difference 52 51 43
tives except an isolated difference in modality
Location differs 708 683 808
Spread differs 850 751 279 (LR 1.7).
Skewness differs 953 863 418 • One segment, zero slope, nonzero intercept: Very
Kurtosis differs 918 693 310 strong evidence of a location difference against
Modality differs 450 321 83
all of the simulated alternatives.
Location/spread differ 969 935 878

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


All differ, skew 884 799 709
• One segment, nonzero slope: Very strong evi-
All differ, symmetric 1,000 1,000 1 ,000 dence of a spread difference against all of the
simulated alternatives except combined differ-
ences in location, spread, and shape (LRs 3.2 and
TABLE 3. Highest level of difference inferred in the 1.4)..
simulated comparisons, based on the projection splinei * • More than one segment: Strong evidence of a
N3JT16
Projection spline classification of the shape difference against all of the simulated al-
nf
rlighest level of difference ternatives except an isolated spread difference
Ul
distribution Shape Spread Location None (LR 1.7) and a combined difference in location
No difference 93 23 43 841 and spread (LR 1.9).
Location differs 107 18 809 66
Spread differs 421 561 1 17
The LRs reported in table 4 can be applied to the
Skewness differs 975 1 1 23 analysis of an actual data set when the alternative
Kurtosis differs 844 1 6 149 hypotheses being considered are analogous in both
Modality differs 507 4 8 481 type and magnitude of difference to the simulated
Location/spread differ 375 613 9 3
All differ, skew 645 183 151 21
comparisons on which the LRs are based.
All differ, symmetric 552 424 24 0
Each row represents the results of 1,000 simulations DISCUSSION
comparing the named distribution with the reference distribution.
Bold values are the number of splines falling into the proper This paper introduces the projection methods for the
category based on the proposed classification of projection splines. pairwise comparison of continuous distributions. The
projection plot, derived from the empirical quantile-
quantile plot, displays the difference between corre-
segment, zero slope, zero intercept projection spline to sponding quantiles against the average of the corre-
detect a true lack of difference between underlying sponding quantiles. It provides a simultaneous
distributions is 84 percent. In these simulations, the graphical comparison of location, spread, and shape
sensitivity of the projection spline ranges from 51
percent in detecting an isolated difference in modality
to 98 percent in detecting an isolated difference in TABLE 4. Likelihood ratios based on simulated
comparisons*
skewness.
Projection spline classification of the
Strength of evidence. Likelihood ratio tests can be Name highest level of difference
done using the data in table 3 to assess the strength of OI
distribution Shape Spread Location None
evidence that a given class of spline represents in support
of one hypothesis over a specified alternative hypothesis. No difference 7.6 25.5 18.8
Location differs 6.6 32.6 12.7
As an example, let us estimate the strength of a spline Spread differs 1.7 809.0 49.5
with one segment, zero slope, and nonzero intercept as Skewness differs 587.0 809.0 36.6
evidence in favor of a location difference over no differ- Kurtosis differs 587.0 134.8 5.6
ence. The likelihood of observing such a spline is 809/ Modality differs 146.8 101.1 1.7
Location/spread differ 1.9 89.9 280.3
1,000 in the "location differs"-reference comparison, and 40.0
All differ, skew 3.2 5.4
43/1,000 in the no difference-reference comparison. This All differ, symmetric 1.4 33.7 Infinite
yields a likelihood ratio (LR) of 0.809/0.043, or 18.8. The * Each likelihood ratio provides a measure of the evidence that
specified spline therefore represents very strong evidence the given class of spline favors the inferred level of difference over
in favor of the hypothesis of a location difference over the named alternative.

Am J Epidemiol Vol. 146, No. 12, 1997


New Methods for Comparing Distributions 1065

between two distributions. The related projection Why study distributions?


spline, an iteratively fitted knotted linear spline, allows Information is lost when a full set of data is reduced
identification of the highest level of difference be- to a smaller set of summary statistics, such as a pro-
tween the two distributions being compared. The iter-1 portion or a mean. Yet before the advent of modern
test, based on the first iteration of the projection spline, computing, the best way to understand a large batch of
provides a global test of difference between the two numbers was to reduce it to a few summary statistics.
distributions. Now, powerful computers facilitate the graphical pre-
The projection plot is simple to construct, makes no sentation and statistical manipulation of large amounts
assumptions about the shape of either underlying dis- of data, making the study of full distributions quite
tribution, and uses all of the available data without feasible.
arbitrary grouping or smoothing. Because of the ver- The simultaneous comparison of location, spread,
tical symmetry of the projection plot, ordinary least and shape is a powerful approach for detecting group

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


squares regression methods can be used to fit consis- differences. If two distributions have the same location
tent statistical models to the data. This improvement but differ in shape or in spread, they will not be
over the quantile-quantile plot enables the develop- distinguished by Student's t test or by linear regression
ment of a statistical test for evaluating the significance methods. Yet differences in shape or in spread can be
of apparent differences. of biological significance. The shape of a distribution
The iter-1 test provides a global test of difference contains information about the process that generated
between two distributions that is more powerful in sim- the distribution. A bimodal distribution might suggest
ulations for most types of comparison than either the two etiologies, for example, and a truncated distribu-
chi-square test of independence or the Kolmogorov- tion might suggest some biological threshold. Simi-
Smirnov test. The difference in performance is especially larly, the spread of a distribution contains information
pronounced when there are shape differences between about tolerable limits of variability and may reflect the
the two distributions being compared. A line fit to the degree of homeostatic control of a biological process.
projection plot could also serve as the basis for a global Indeed, increases in spread with increasing age have
test of difference. However, it would be less powerful been noted for a number of variables, including sys-
than a knotted linear spline for nonlinear alternative tolic blood pressure (9, 10).
hypotheses, that is, when a shape difference between The study of full distributions is also compelling
distributions is suspected. because distributions characterize populations. Just as
individuals within a population can differ from one
The projection spline provides the basis for a de-
another, whole populations can also differ from each
tailed description of the nonrandom differences be-
other (11). The ability to describe differences on a
tween two distributions. The algorithm for classifying population level facilitates discourse about population-
the highest level of difference based on the projection level risk factors and population-level interventions.
spline is easy to apply, and the performance is gener-
ally quite good. However, a multisegment projection
spline provides only weakly positive evidence for a Potential applications
shape difference over a spread difference in the sim-
It is anticipated that the proposed methods will
ulations performed. Similarly, a single-segment pro-
enjoy wide application in many areas of epidemiologic
jection spline with nonzero slope provides only
inquiry, since they can be used whenever two samples
weakly positive evidence for a spread difference over
from continuous distributions are to be compared.
a mixed shape/spread/location difference.
These methods will enable epidemiologists to compare
It is noteworthy that although the projection plot is 1) continuous outcomes between exposed and unex-
nonparametric, the projection spline is not. The final posed groups in prospective studies, 2) continuous risk
model depends on the number, spacing, and location factors between cases and controls in retrospective
of candidate knots, the minimum sample size allowed studies, 3) continuous baseline characteristics or con-
in an interval, and the a level used for knot retention. tinuous outcomes between treatment groups in clinical
The effects of variations in each of these parameters trials, and 4) continuous measures between popula-
and in total sample size on the performance of the tions of interest in cross-sectional studies. For exam-
projection spline merit further study. Development of ple, these methods can be used to examine differences
a lexicon for interpreting the nature of shape differ- in infant lung function by intrauterine cigarette smoke
ences and refinement of a method for quantifying the exposure, differences in measures of physical func-
degree of shape differences represent other promising tioning by extremes of socioeconomic status, or dif-
extensions of this work. ferences in systolic blood pressure by "race."

Am J Epidemiol Vol. 146, No. 12, 1997


1066 Jones

By understanding how distributions differ, we may methods for data analysis. Pacific Grove, CA: Wadsworth and
Brooks/Cole Publishing Company, 1983.
be led to new hypotheses about why they differ and
4. Jones CP. Methods for comparing distributions: development
may spawn innovative approaches to improve the pub- and application exploring "race"-associated differences in sys-
lic's health. tolic blood pressure. Doctoral dissertation. Department of
Epidemiology, The Johns Hopkins School of Hygiene and
Public Health, Baltimore, MD, 1994.
5. Serfling RJ. Approximation theorems of mathematical statis-
tics. New York, NY: John Wiley and Sons, 1980.
ACKNOWLEDGMENTS 6. Weisberg S. Applied linear regression. 2nd ed. New York,
This work was supported in part by National Heart, Lung, NY: John Wiley and Sons, 1985.
and Blood Institute Cardiovascular Training grant T32 7. Smith PL. Splines as a useful and convenient statistical tool.
Am Stat 1979;33:57-62.
HL07024. 8. Wold S. Spline functions in data analysis. Technometrics
The author gratefully acknowledges the valuable input of
Drs. Karen Bandeen-Roche and Curtis Meinert. 9. Drizd T, Dannenberg AL, Engel A. Blood pressure levels in

Downloaded from https://academic.oup.com/aje/article/146/12/1056/111425 by guest on 12 April 2022


persons 18-74 years of age in 1976-80, and trends in blood
pressure from 1960 to 1980 in the United States. Hyattsville,
MD: National Center for Health Statistics, 1986. Vital and
health statistics, series 11, no. 234. (DHHS publication no.
REFERENCES (PHS) 86-1684).
1. Wilk MB, Gnanadesikan R. Probability plotting methods for 10. Berkman LF. The changing and heterogeneous nature of aging
the analysis of data. Biometrika 1968;55:1—17. and longevity: a social and biomedical perspective. Ann Rev
2. Tukey JW. Exploratory data analysis. Reading, MA: Addison- Gerontol Geriatr 1988;8:37-68.
Wesley Publishing Company, 1977. 11. Rose G. The strategy of preventive medicine. Oxford,
3. Chambers JM, Cleveland WS, Kleiner B, et al. Graphical England: Oxford University Press, 1992.

Am J Epidemiol Vol. 146, No. 12, 1997

You might also like