You are on page 1of 9

Thirteen Ways to Look at the Correlation Coefficient

Author(s): Joseph Lee Rodgers and W. Alan Nicewander


Source: The American Statistician, Vol. 42, No. 1 (Feb., 1988), pp. 59-66
Published by: American Statistical Association
Stable URL: http://www.jstor.org/stable/2685263
Accessed: 19/04/2010 12:19

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=astata.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to The
American Statistician.

http://www.jstor.org
test is sup{Pr( XA- XB> a) 1.75 ' P, = P2 ' 1}- Pr(XA-XB> . 1) + Pr(XA- XB < -.1)
The power function K(p1,P2, n1, n2) is Pr(XA-XBI > a).
To constructthe tables, we evaluate the power function (for = Pr((XA- XB)/1\p/q(1/nj + l/n2) 2 1)
given n1 and n2 and a = .1) at the points (P1,P2) in + Pr((XA - XB)/ /pq(1nl +1/n2) ? -1)

P = {(m1/100,m2/100) | 75 ' m1,m2 ' 100 = Pr(Z 2 1) + Pr(Z ? -1),

and Iml-m21 2 10}, where 1 = . 1/[pq(1/n1 + 1/n2)] /2andZ is a standardnormal


randomvariable. Using the usual techniques from calculus
gives the desired result.
leaving out the so-called indifferenceregion (see Bickel and In Table 5 we find a similar patternusing the pdf given
Doksum 1977), that is, those points (Pi,P2) such that in (8) and the computer.Note the small percentagevariation
IP1 - P21 < .1. The average power reportedin the tables in mean power for the values in the table in comparisonto
is then the average value of K over the set P. The size the large percentage variation in size. The difference be-
reportedis similarlycomputedby takingthe supremumover tween class sizes (50, 50) and (5, 95) is a small increase
a finite set. in power but a fourfold increase in size.
One effect of class size on the test is that in general, large
imbalances in class size are associated with largertest size [Received October 1986. Revised June 1987.]
and small imbalances with smaller test size. Indeed, when
the normalapproximationapplies, one can show thatamong REFERENCES
all n1, n2 with n1 + n2 = k, k fixed, the test has minimum
Bickel, P. J., and Doksum, K. A. (1977), MathematicalStatistics: Basic
size when n1 = n2 = k/2 for k even and n1 = [k/2] or n1 Ideas and Selected Topics, San Francisco:Holden-Day.
= [k/2] + 1 when k is odd. To see this, note that under Cohen, Ayala (1983), "On the Effect of Class Size on the Evaluationof
Ho, Lecturers'Performance,"The American Statistician, 37, 331-333.

Thirteen Ways to Look at the Correlation Coefficient


JOSEPH LEE RODGERS and W. ALAN NICEWANDER*

In 1885, Sir Francis Galton first defined the term "regres- INTRODUCTION
sion" and completed the theory of bivariatecorrelation. A
We are currentlyin the midst of a "centennial decade"
decade later, Karl Pearsondeveloped the index that we still
for correlationand regression. The empiricaland theoretical
use to measurecorrelation,Pearson'sr. Ourarticleis written
developmentsthat defined regressionand correlationas sta-
in recognition of the 100th anniversaryof Galton's first
tistical topics were presentedby Sir FrancisGalton in 1885.
discussion of regression and correlation. We begin with a
Then, in 1895, Karl Pearson published Pearson's r. Our
brief history. Then we present 13 different formulas, each
article focuses on Pearson's correlation coefficient, pre-
of which representsa different computationaland concep-
senting both the backgroundand a number of conceptual-
tual definition of r. Each formula suggests a different way
izations of r that will be useful to teachers of statistics.
of thinkingaboutthis index, from algebraic, geometric, and
We begin with a brief history of the development of
trigonometricsettings. We show thatPearson's r (or simple
correlationand regression. Following, we present a longer
functions of r) may variouslybe thoughtof as a special type
review of ways to interpretthe correlationcoefficient. This
of mean, a special type of variance, the ratio of two means,
presentationdemonstratesthatthe correlationhas developed
the ratio of two variances, the slope of a line, the cosine of
into a broad and conceptually diverse index; at the same
an angle, and the tangentto an ellipse, and may be looked
time, for a 100-year-old index it is remarkablyunaffected
at from several other interestingperspectives.
by the passage of time.
The basic idea of correlationwas anticipatedsubstantially
before 1885 (MacKenzie 1981). Pearson (1920) credited
Gauss with developing the normal surface of n correlated
variates in 1823. Gauss did not, however, have any partic-
*Joseph Lee Rodgers is Associate Professor and W. Alan Nicewander
is Professor and Chair, Departmentof Psychology, University of Okla-
ular interest in the correlation as a conceptually distinct
homa, Norman, Oklahoma73019. The authorsthankthe reviewers, whose notion; instead he interpretedit as one of the several pa-
comments improved the article. rametersin his distributionalequations. In a previous his-

? 1988 American Statistical Association The American Statistician, February 1988, Vol. 42, No. 1 59
torical paper published in 1895, Pearson credited Auguste ification of the former as a necessary but not sufficient
Bravais, a Frenchastronomer,with developing the bivariate condition for the latter were being recognized almost si-
normal distribution in 1846 (see Pearson 1920). Bravais multaneouslyin the establisheddiscipline of philosophy and
actually referredto one parameterof the bivariate normal the fledgling discipline of biometry.
distributionas "une correlation,"but like Gauss, he did not By 1885 the stage was set for several importantcontri-
recognize the importanceof the correlationas a measureof butions. During that year, Galton was the presidentof the
association between variables. [By 1920, Pearson had re- Anthropological Section of the British Association. In his
scinded the credit he gave to Bravais. But Walker (1929) presidential address, he first referred to regression as an
and Seal (1967) reviewed the history that Pearsonboth re- extention of the "law of reversion." Laterin thatyear (Gal-
ported and helped develop, and they supportedBravais's ton 1885) he published his presidentialaddress along with
claim to historical precedence.] Galton's cousin, Charles the first bivariatescatterplotshowing a correlation(Fig. 1).
Darwin, used the concept of correlationin 1868 by noting In this graph he plotted the frequencies of combinationsof
that "all the partsof the organisationare to a certainextent children's height and parents' height. When he smoothed
connected or correlatedtogether." Then, in 1877, Galton the results and then drew lines through points with equal
first referredto "reversion"in a lecture on the relationship frequency, he found that "lines drawn through entries of
betweenphysicalcharacteristics of parentandoffspringseeds. the same value formed a series of concentric and similar
The "law of reversion" was the first formal specification ellipses." This was the first empiricalrepresentationof the
of what Galton later renamed "regression." isodensity contour lines from the bivariate normal distri-
During this same period, importantdevelopmentsin phi- bution. With the assistance of J. D. Hamilton Dickson, a
losophy also contributedto the concepts of correlationand Cambridge mathematician,Galton was able to derive the
regression. In 1843, the BritishphilosopherJohnStuartMill theoretical formula for the bivariate normal distribution.
first presentedhis "Five Canons of ExperimentalInquiry." This formalized mathematicallythe topic on which Gauss
Among those was includedthe methodof concomitantvari- and Bravaishad been workinga half centurybefore. Pearson
ation: "Whateverphenomenonvaries in any mannerwhen- (1920) statedthat "in 1885 Galtonhad completedthe theory
ever anotherphenomenonvaries in some particularmanner, of bi-variate normal correlation"(p. 37).
is either a cause or an effect of that phenomenon, or is In the years following 1885, several additional events
connected with it through some fact of causation." Mill addedmathematicalimportto Galton's 1885 work. In 1888,
suggestedthreeprerequisitesfor valid causal inference(Cook Galton noted that r measures the closeness of the "co-re-
and Campbell 1979). First, the cause must temporallypre- lation," and suggested that r could not be greater than 1
cede the effect. Second, the cause andeffect mustbe related. (although he had not yet recognized the idea of negative
Third, other plausible explanationsmust be ruled out. Thus correlation). Seven years later, Pearson (1895) developed
the separabilityof correlationand causation and the spec- the mathematicalformula that is still most commonly used

DIAGRAM BASED ON TABLE I.


(all femlaleheights cr multiplied by 108 )

MID-PARENTS ADULT CHILDREN


their Heights, anA Devia.tions from 68t inhes.
HeiutsD.vmaZas 8 Ej5 66 67 68 619 1 70 7j7
in in
nches indces - -3 -2 -I 0 #I 2 .3 *4
72-

.3

'2
70-

+1
69

9i~ ~~ 7 J4\ J3 0
68

7 -1
67~~~~~~
-2

Figure 1. The FirstBivariateScatterplot(fromGalton 1885).

60 The American Statistician, February 1988, Vol. 42, No. 1


Table 1. Landmarks in the History of Correlation and Regression

Date Person Event


1823 Carl Friedrich Gauss, German mathematician Developed the normal surface of N correlated variates.
1843 John Stuart Mill, British philosopher Proposed four canons of induction, including concomitant vari-
ation.
1846 Auguste Bravais, French naval officer and astronomer Referred to "une correlation," worked on bivariate normal distri-
bution.
1868 Charles Darwin, Galton's cousin, British natural philosopher "Allparts of the organisation are ... connected or correlated."
1877 Sir Francis Galton, British, the first biometrician First discussed "reversion,"the predecessor of regression.
1885 Sir Francis Galton First referred to "regression."
Published bivariate scatterplot with normal isodensity lines, the
first graph of correlation.
"Completed the theory of bi-variate normal correlation." (Pear-
son 1920)
1888 Sir Francis Galton Defined r conceptually, specified its upper bound.
1895 Karl Pearson, British statistician Defined the (Galton-) Pearson product-moment correlation
coefficient.
1920 Karl Pearson Wrote "Notes on the History of Correlation."
1985 Centennial of regression and correlation

to measure correlation, the Pearson product-momentcor- 1982), as well as understandingof its extension to multiple
relationcoefficient. In historicalperspective, it seems more and partialcorrelation,our focus will be more basic. First,
appropriatethat the popular name for the index should be we restrictourprimaryinterestto bivariatesettings. Second,
the Galton-Pearson r. The importantdevelopments in the most of our interpretationsare distributionfree, since com-
history of correlation and regression are summarized in putation of a sample correlation requires no assumptions
Table 1. about a population (see Nefzger and Drasgow 1957). For
By now, a century later, contemporaryscientists often consideration of the rnany problems associated with the
take the correlationcoefficient for granted. It is not appre- inferential use of r (restrictionof range, attenuation,etc.)
ciated that before Galton and Pearson, the only means for we defer to other treatments(e.g., Lord and Novick 1968).
establishing a relationshipbetween variables was to educe We present 13 different ways to conceptualize this most
a causative connection. There was no way to discuss-let basic measure of bivariate relationship. This presentation
alone measure-the associationbetweenvariablesthatlacked does not claim to exhaust all possible interpretationsof the
a cause-effect relationship. Today, the correlationcoeffi- correlationcoefficient. Otherscertainlyexist, and new ren-
cient-and its associated regression equation-constitutes derings will certainly be proposed.
the principal statistical methodology for observationalex-
periments in many disciplines. Carroll(1961), in his pres-
idential address to the Psychometric Society, called the 1. CORRELATION AS A FUNCTION OF RAW
correlationcoefficient "one of the most frequentlyused tools SCORES AND MEANS
of psychometricians. . . and perhapsalso one of the most
frequently misused" (p. 347). Factor analysis, behavioral
genetics models, structuralequations models (e.g., LIS- The Pearson product-momentcorrelationcoefficient is a
REL), and other related methodologies use the correlation dimensionless index, which is invariantto linear transfor-
coefficient as the basic unit of data. mationsof eithervariable. Pearsonfirst developed the math-
This article focuses on the Pearson product-momentcor- ematical formula for this importantmeasure in 1895:
relation coefficient. Pearson's r was the first formal cor- (Xi - X) (Yj- Y)
relationmeasure,andit is still the most widely used measure
of relationship. Indeed, many "competing" correlationin- [ (Xi - X)2 (Yi - Y)2]1/2

dexes are in fact special cases of Pearson's formula. Spear- This, or some simple algebraic variant, is the usual for-
man'srho, the point-biserialcorrelation,andthephi coefficient mula found in introductorystatistics textbooks. In the nu-
are examples, each computable as Pearson's r applied to merator,the raw scores are centered by subtractingout the
special types of data (e.g., Henrysson 1971). mean of each variable, and the sum of cross-productsof the
Our presentationwill have a somewhat didactic flavor. centeredvariablesis accumulated.The denominatoradjusts
On first inspection, the measure is simple and straightfor- the scales of the variablesto have equal units. Thus Equation
ward. There are surprisingnuances of the correlationcoef- (1.1) describes r as the centered and standardizedsum of
ficient, however,andwe will presentsome of these. Following cross-productof two variables. Using the Cauchy-Schwartz
Pearson, our focus is on the correlation coefficient as a inequality, it can be shown that the absolute value of the
computationalindex used to measurebivariateassociation. numeratoris less than or equal to the denominator(e.g.,
Whereas a more statistically sophisticated appreciationof Lord and Novick 1968, p. 87); therefore, the limits of + 1
correlation demands attention to the sampling model as- are established for r. Several simple algebraic transforma-
sumedto underliethe observations(e.g., Carroll1961;Marks tions of this formulacan be used for computationalpurposes.

The American Statistician, February 1988, Vol. 42, No. 1 61


2. CORRELATION AS STANDARDIZED 4. CORRELATION AS THE GEOMETRIC MEAN
COVARIANCE OF THE TWO REGRESSION SLOPES
The covariance, like the correlation,is a measureof linear The correlationmay also be expressed as a simultaneous
associationbetween variables. The covarianceis defined on function of the two slopes of the unstandardizedregression
the sum of cross-productsof the centered variables, unad- lines, by.x and bx.y. The function is, in fact, the geometric
justed for the scale of the variables. Althoughthe covariance mean, and it representsthe first of several interpretations
is often ignoredin introductory textbooks,the variance(which of r as a special type of mean:
is not) is actually a special case of the covariance-that is,
r= + .x. (4.1)
the varianceis the covariance of a variable with itself. The
covariance of two variables is unboundedin infinite pop- This relationship may be derived from Equation (3.1) by
ulations, and in the sample it has indeterminantbounds(and multiplying the second and third terms in the equality to
unwieldy interpretation).Thus the covariance is often not give r2, canceling the standarddeviations, and taking the
a useful descriptivemeasureof association, becauseits value square root.
depends on the scales of measurementfor X and Y. The There is an extension of this interpretationinvolving mul-
correlationcoefficient is a rescaled covariance: tivariateregression. Given the matrices of regression coef-
ficients relating two sets of variables, By.x and Bx.y, the
r = sxylsxsy, (2.1)
squareroots of the eigenvalues of the productof these ma-
where Sxyis the samplecovariance, andSxandsy are sample trices are the canonical correlationsfor the two sets of vari-
standarddeviations. When the covarianceis divided by the ables. These valuesreduceto the simplecorrelationcoefficient
two standarddeviations, the range of the covariance is re- when there is a single X and a single Y variable.
scaled to the interval between - 1 and + 1. Thus the inter-
pretationof correlationas a measureof relationshipis usu- 5. CORRELATION AS THE SQUARE ROOT OF
ally more tractablethan thatof the covariance(and different THE RATIO OF TWO VARIANCES (PROPORTION
correlationsare more easily compared). OF VARIABILITY ACCOUNTED FOR)
Correlationis sometimes criticized as having no obvious
interpretationfor its units. This criticism is mitigated by
3. CORRELATION AS STANDARDIZED SLOPE
squaringthe correlation. The squaredindex is often called
OF THE REGRESSION LINE
the coefficient of determination,and the units may beein-
The relationship between correlation and regression is terpretedas proportionof variancein one variableaccounte!d
most easily portrayedin for by differences in the other [see Ozer (1985) for a dis-
cussion of several differentinterpretationsof the coefficient
r = by.x(sx/sy) = bx.(sylsx), (3.1) of determination].We may partitionthe total sum of squares
where by.x and bx.y are the slopes of the regression lines for Y (SSTOT)into the sum of squares due to regression
for predicting Y from X and X from Y, respectively. Here, (SSREG)and the sum of squares due to error (SSERR). The
the correlation is expressed as a function of the slope of variabilityin X accountedfor by differences in Y is the ratio
either regressionline and the standarddeviationsof the two of SSREGto SSTOT,and r is the square root ojf that ratio:
variables. The ratio of standarddeviations has the effect of
rescaling the units of the regression slope into units of the r= VSYR - Y)2(YT - Y) ST
correlation. Thus the correlationis a standardizedslope. Equivalently, the numerator and denominator of this
A similar interpretationinvolves the correlation as the equationmay be divided by (N - 1)1/2, and r becomes the
slope of the standardizedregression line. When we stan- square root of the ratio of the variances (or the ratio of the
dardize the two raw variables, the standarddeviations be- standarddeviations)of the predictedandobservedvariables:
come unity and the slope of the regressionline becomes the
r = =I = SY/Sy. (5.1)
correlation.In this case, the interceptis 0, andthe regression
line is easily expressed as
(Note that S2 is a biased estimate of o3, whereas S2 is
zy = r zx. (3.2) unbiased.) This interpretationis the one thatmotivatedPear-
son's early conceptualizationsof the index (see Mulaik 1972,
From this interpretation,it is clear that the correlationre-
p. 4). The correlation as a ratio of two variances may be
scales the units of the standardizedX variable to predict
compared to another interpretation(due to Galton) of the
units of the standardizedY variable. Note that the slope of
correlationas the ratio of two means. We will present that
the regression of zy on Zxrestrictsthe regressionline to fall
interpretationin Section 13.
between the two diagonals portrayedin the shaded region
of Figure 2. Positive correlations imply that the line will
pass through the first and third quadrants;negative corre- 6. CORRELATION AS THE MEAN CROSS-
lations imply that it will pass throughthe second and fourth PRODUCT OF STANDARDIZED VARIABLES
quadrants.The regression of Zx on Zy has the same angle
with the Y axis that the regression of Zy on Zx has with the Another way to interpretthe correlationas a mean (see
X axis, and it will fall in the unshadedregion of Figure 2, Sec. 4) is to express it as the average cross-productof the
as indicated. standardizedvariables:

62 The American Statistician, February 1988, Vol. 42, No. 1


r= zxzyIN.- (6.1) 8. CORRELATION AS A FUNCTION OF THE
ANGLE BETWEEN THE TWO
Equation(6.1) can be obtaineddirectlyby dividing both the VARIABLE VECTORS
numeratorand denominatorin Equation(1. 1) by the product
of the two sample standarddeviations. Since the mean of The standardgeometric model to portraythe relationship
a distributionis its first moment, this formulagives insight between variables is the scatterplot. In this space, obser-
into the meaning of the "product-moment"in the name of vations are plotted as points in a space defined by variable
the correlationcoefficient. axes. An "inside out" version of this space-usually called
The next two portrayalsinvolve trigonometricinterpre- "person space"-can be defined by letting each axis rep-
tations of the correlation. resent an observation.This space contains two points-one
for each variable-that define the endpoints of vectors in
7. CORRELATION AS A FUNCTION OF THE this (potentially)huge dimensionalspace. Althoughthe mul-
ANGLE BETWEEN THE TWO STANDARDIZED tidimensionality of this space precludes visualization, the
REGRESSION LINES two variablevectors define a two-dimensionalsubspacethat
is easily conceptualized.
As suggested in Section 3, the two standardizedregres- If the variable vectors are based on centered variables,
sion lines are symmetricabouteitherdiagonal. Let the angle then the correlationhas a straightforwardrelationshipto the
between the two lines be /8 (see Fig. 2). Then angle a between the variable vectors (Rodgers 1982):
r = sec(,3) ? tan(,3). (7.1)
r= cos(a). (8.1)
A simple proof of this relationship is available from us.
Equation(7.1) is not intuitively obvious, nor is it as useful
for computationalor conceptual purposes as some of the When the angle is 0, the vectors fall on the same line and
others. Its value is to show that there is a systematic rela- cos(a) = ? 1. When the angle is 90?, the vectors are per-
tionship between the correlation and the angular distance pendicular and cos(a) = 0. [Rodgers, Nicewander, and
between the two regressionlines. The next interpretation- Toothaker(1984) showed the relationshipbetween orthog-
also trigonometric-has substantiallymore conceptualvalue. onal and uncorrelatedvariable vectors in person space.]

zx: rZy

zy

x
z r
.. . . . . .

Figure2 Geometryof BivariateCorrelationfor Standardized....Variable.


The

The American Statistician, February 1988, Vol. 42, No. 1 63


Visually, it is much easier to view the correlation by (see Fig. 3). h is the vertical diameterof the ellipse at the
observing an angle than by looking at how points cluster center of the distributionon the X axis; H is the vertical
about the regressionline. In our opinion, this interpretation range of the ellipse on the Y axis. Chatillonshowed that the
is by far the easiest way to "see" the size of the correlation, correlationmay be roughly computed as
since one can directly observe the size of an angle between
two vectors. This inside-out space that allows r to be rep- r= A1 - (h/H)2. (10.1)
resented as the cosine of an angle is relatively neglected as
He gave a theoretical justification of the efficacy of this
an interpretationaltool, however. Exceptionsincludea num-
rough-and-readycomputationalprocedure, assuming both
ber of factor-analyticinterpretations,Draper and Smith's
bivariate normality and bivariate uniformity. He also pre-
(1981, pp. 201-203) geometric portrayals of multiple
sented a numberof examples in which the techniqueworks
regression analysis, and Huck and Sandler (1984, p. 52).
quite well. An intriguing suggestion he made is that the
Fisher also used this space quite often to conceptualizehis
"balloon rule" can be used to construct approximatelya
elegant statisticalinsights (see Box 1978).
bivariaterelationshipwith some specified correlation. One
draws an ellipse that produces the desired r and then fills
9. CORRELATION AS A RESCALED VARIANCE in the pointsuniformlythroughoutthe ellipse. Thomas(1984)
OF THE DIFFERENCE BETWEEN presented a "pocket nomograph," which was a 3" x 5"
STANDARDIZED SCORES slide that could be used to "sight" a bivariaterelationship
Define zy - zx as the difference between standardizedX and estimate a correlationbased on the balloon rule.
and Y variables for each-observation. Then
11. CORRELATION IN RELATION TO THE
r= I - s 2
/)2. (9.1) BIVARIATE ELLIPSES OF ISOCONCENTRATION
This can be shown by startingwith the variance of a dif- Two different authors have suggested interpretationsof
ference score: s2_X = s2 + 52 2rsxsy. Since the stan-
-
r related to the bivariateellipses of isoconcentration.Note
darddeviationsandvariancesbecomeunitywhenthe variables thatthese ellipses are more formalversions of the "balloon"
are standardized,we can easily solve for r to get Equation from Section 10 and that they are the geometric structures
(9.1). that Galton observed in his empirical data (see Fig. 1).
It is interesting to note that in this equation, since the Chatillon (1984b) gave a class of bivariate distributions
correlationis bounded by the intervalfrom - 1 to + 1, the (including normal, uniform, and mixtures of uniform) that
varianceof this difference score is bounded by the interval have elliptical isodensity contours. There is one ellipse for
from0 to 4. Thus the varianceof a differenceof standardized every positive constant, given the population correlation.
scores can never exceed 4. The upperboundfor the variance The balloon thatone would drawarounda scatterplotwould
is achieved when the correlationis - 1. approximateone of these ellipses for a large positive con-
We may also define r as the variance of a sum of stan- stant. If the variables are standardized,then these ellipses
dardizedvariables: are centered on the origin. For p > 0, the major axes fall
on the positive diagonal; for p < 0, the negative diagonal.
r= s(zY+zx)/2- 1. (9.2)
Marks (1982) showed, throughsimple calculus, that the
Here, the variance of the sum also ranges from 0 to 4, and slope of the tangentline at Zx = 0 is the correlation.Figure
the upper bound is achieved when the correlationis + 1.
The value of this ninth interpretationis to show that the
correlation is a linear transformationof a certain type of Zy
variance. Thus, given the correlation,we can directlydefine
the variance of either the sum or difference of the stan-
dardizedvariables, and vice versa. tangent at Zx=O

All nine of the precedinginterpretationsof the correlation


coefficient were algebraic and trigonometricin nature. No
distributionalassumptions were made about the nature of
the univariateor bivariatedistributionsof X and Y. In the
final interpretations,bivariate normality will be assumed.
We maintain our interest in conceptual and computational
versions of r, but we base our last set of interpretationson
this common assumptionabout the populationdistribution. major axis length= D

minor axis length =d


10. CORRELATION ESTIMATED FROM THE
BALLOON RULE
This interpretationis due to Chatillon (1984a). He sug-
gested drawing a "birthdayballoon" aroundthe scatterplot
of a bivariaterelationship.The balloon is actually a rough Figure3. The CorrelationRelated to Functionsof the Ellipsesof
ellipse, from which two measures-h and H-are obtained Isoconcentration.

64 The AmericanStatistician, February 1988, Vol. 42, No. 1


3 shows this tangentline, the slope of which equals r. When 13. CORRELATION AS THE RATIO
the correlationis 0, the ellipse is a circle and the tangent OF TWO MEANS
has slope 0. When the correlationis unity, the ellipse ap-
This is the thirdinterpretationof the correlationinvolving
proaches a straightline that is the diagonal (with slope 1).
means (see Secs. 4 and 6). It provides an appropriatecon-
Note that, since all of the ellipses of isoconcentrationare
clusion to our article, since it was first proposedby Galton.
parallel, the interpretationis invariantto the choice of the
Furthermore,Galton's earliest conceptions about and cal-
ellipse. It is also worth noting that the slope of the tangent
culationsof the correlationwere based on this interpretation.
line at Zx = 0 is the same as the slope of the standardized
An elaboration of this interpretationwas presented by
regression line (see Sec. 3).
Nicewander and Price (1982).
Schilling (1984) also used this framework to derive a
For Galton, it was naturalto focus on correlation as a
similar relationship. Let the variables be standardizedso
ratio of means, because he was interestedin questions such
that the ellipses are centered on the origin, as before. If D
as, How does the average height of sons of unusually tall
is the length of the major axis of some ellipse of isocon-
fathers compare to the average height of their fathers?The
centrationand d is the length of the minor axis, then
following development uses populationratherthan sample
r = (D2 - d2)/(D2 + d2). (11.1) notationbecause it is only in the limit (of increasingsample
size) that the ratio-of-means expression will give values
These axes are also portrayedin Figure 3, and the inter-
identical to Pearson's r.
pretationis invariantto choice of ellipse, as before.
Consider a situation similar to one that would have in-
terested Galton. Let X be a variable denoting mother's IQ,
and let Y denote the IQ of her oldest child. Furtherassume
12. CORRELATION AS A FUNCTION OF TEST that the means ,t(X) and ,t(Y) are 0 and that the standard
STATISTICS FROM DESIGNED EXPERIMENTS deviations cr(X) and cr(Y)are unity. Now select some ar-
bitrarilylarge value of X (say Xc), and compute the mean
IQ of mothers whose IQ is greater than X. Let this mean
The previous interpretationsof r were based on quanti- be denoted by ,u(X|X > X,), that is, the average IQ of
tative variables. Our 12th representationof the correlation mothers whose IQ is greaterthan X,. Next, average the IQ
shows its relationship to the test statistic from designed scores, Y, of the oldest offspring of these exceptional moth-
experiments,in which one of the variables(the independent ers. Denote this mean by ,A(YIX> X,), that is, the average
variable) is a categorical variable. This demonstratesthe IQ of the oldest offspringof motherswhose IQ's are greater
artificiality of the correlational/experimentaldistinction in than Xc. Then it can be shown that
discussing experimentaldesign. In fact, Fisher (1925) orig-
inally presentedthe analysis of variance(ANOVA) in terms A(YIX > Xc) - Ay L(YIX> XC)
r~~~
r = ~ _
(13.1)
of the intraclasscorrelation(see Box 1978). (XIX> xc) - ,uLx /.L(XIX> Xc) ,1.1,
Suppose we have a designed experimentwith two treat-
The proof of (13.1) requires an assumption of bivariate
ment conditions. The standardstatistical model to test for
normality of standardizedX and Y. The proof is straight-
a difference between the conditions is the two-independent-
forward and entails only the fact that, for Zx and zy, r is
sample t test. If X is defined as a dichotomous variable
the slope of the regression line as well as the ratio of these
indicating group membership(0 if group 1, 1 if group 2),
two conditional means.
then the correlationbetween X and the dependentvariable
Our example is specific, but the interpretationapplies to
Y is
any setting in which explicit selection occurs on one vari-
r = t 2N + n -2, (12.1) able, which implicitly selects on a second variable. Brogden
(1946) used the ratio-of-meansinterpretationto show that
where n is the combined total number of observations in
when a psychological test is used for personnel selection,
the two treatmentgroups. This correlationcoefficient may
the correlationbetween the test score and the criterionmea-
be used as a measureof the strength of a treatmenteffect,
sure gives the proportionatedegree to which the test is a
as opposed to the significance of an effect. The test of
"perfect" selection device. Otheruses of this interpretation
significance of r in this setting provides the same test as the
are certainly possible.
usual t test. Clearly, then, r can serve as a test statistic in
a designed experiment, as well as provide a measure of
association in observationalsettings.
In ANOVA settingswith more groupsor multiplefactors, CONCLUSION
the extension of this relationshipdefines the multiple cor- Certainlythere are other ways to interpretthe correlation
relation coefficients associated with main effects and inter- coefficient. A wealth of additional fascinating and useful
actions in more complex experiments. For example, in a portrayalsis available when a more statistical and less al-
one-way ANOVA setting with k groups and a total of N gebraic approachis taken to the correlationproblem. We
subjects, the squared-multiplecorrelationbetween the de- in no sense presume to have summarizedall of the useful
pendent variable and the columns of the design matrix or interestingapproaches,even within the fairly tight frame-
is related to the F statistic through the following formula work that we have defined. Nevertheless, these 13 ap-
(Draper and Smith 1981, p. 93): R2 _ F(k - 1)/[F(k - proaches illustratethe diversity of interpretationsavailable
1) + (N -k)]. for teachers and researcherswho use correlation.

The American Statistician, February 1988, Vol. 42, No. 1 65


Galton's original work on the correlationwas motivated Huck, S. W., and Sandler, H. M. (1984), Statistical Illusions: Solutions,
by a very specific biometricproblem. It is remarkablethat New York: Harper& Row.
Lord, F. M., and Novick, M. R. (1968), Statistical Theories of Mental
such a focused effort would lead to the developmentof what
Test Scores, Reading, MA: Addison-Wesley.
is perhapsthe most broadlyappliedindex in all of statistics. MacKenzie, D. A. (1981), Statistics in Britain: 1865-1930, Edinburgh,
The range of interpretationsfor the correlationcoefficient U.K.: EdinburghUniversity Press.
demonstratesthe growth of this remarkableindex over the Marks, E. (1982), "A Note on the Geometric Interpretationof the Cor-
past century. On the otherhand, Galtonand Pearson'sindex relation Coefficient," Journal of Educational Statistics, 7, 233-237.
Mulaik, S. A. (1972), The Foundations of Factor Analysis, New York:
is surprisinglyunchangedfrom the one originallyproposed.
McGraw-Hill.
[ReceivedJune 1987. Revised August 1987.] Nefzger, M. D., and Drasgow, J. (1957), "The Needless Assumption of
Normality in Pearson's r," The AmericanPsychologist, 12, 623-625.
Nicewander,W. A., andPrice, J. M. (1982), "TheCorrelationCoefficient
as the Ratio of Two Means: An InterpretationDue to Galton and Brog-
den," unpublishedpaper presentedto the Psychometric Society, Mon-
REFERENCES
treal, Canada, May.
Box, J. F. (1978), R. A. Fisher: The Life of a Scientist, New York: John Ozer, D. J. (1985), "Correlationand the Coefficient of Determination,"
Wiley. Psychological Bulletin, 97, 307-315.
Brogden, H. E. (1946), "On the Interpretationof the CorrelationCoef- Pearson, K. (1895), Royal Society Proceedings, 58, 241.
ficient as a Measure of Predictive Efficiency," Journal of Educational (1920), "Notes on the History of Correlation,"Biometrika, 13,
Psychology, 37, 65-76. 25-45.
Carroll, J. B. (1961), "The Nature of the Data, or How to Choose a Rodgers, J. L. (1982), "On Huge Dimensional Spaces: The Hidden Su-
CorrelationCoefficient," Psychometrika,26, 347-372. perspace in Regression and CorrelationAnalysis," unpublishedpaper
Chatillon, G. (1984a), "The Balloon Rules for a Rough Estimate of the presentedto the PsychometricSociety, Montreal, Canada, May.
CorrelationCoefficient," The AmericanStatistician, 38, 58-60. Rodgers, J. L., Nicewander, W. A., and Toothaker,L. (1984), "Linearly
(1984b), "Reply to Schilling," TheAmericanStatistician, 38, 330. Independent,Orthogonal, and UncorrelatedVariables," The American
Cook, T. C., and Campbell, D. T. (1979), Quasi-Experimentation,Bos- Statistician, 38, 133-134.
ton: Houghton Mifflin. Schilling, M. F. (1984), "Some Remarks on Quick Estimation of the
Draper, N. R., and Smith, H. (1981), Applied Regression Analysis, New CorrelationCoefficient," The American Statistician, 38, 330.
York: John Wiley. Seal, H. L. (1967), "The Historical Development of the Gauss Linear
Fisher, R. A. (1925), Statistical Methodsfor Research Workers, Edin- Model," Biometrika, 54, 1-24.
burgh, U.K.: Oliver & Boyd. Thomas, H. (1984), "PsychophysicalEstimationof the CorrelationCoef-
Galton, F. (1885), "Regression Towards Mediocrity in HereditaryStat- ficient From a Bivariate Scatterplot:A Pocket Visual Nomograph,"
ure," Journal of the AnthropologicalInstitute, 15, 246-263. unpublishedpaperpresentedto the PsychometricSociety, SantaBarbara,
Henrysson, S. (1971), "Gathering,Analyzing, and Using Data on Test California, June.
Items," in Educational Measurement,ed. R. L. Thorndike, Washing- Walker, H. M. (1929), Studies in the History of Statistical Method, Bal-
ton, DC: American Council on Education, pp. 130-159. timore: Williams & Wilkins.

Election Recounting
BERNARD HARRIS*

1. INTRODUCTION In the anecdotal materialthat follows, the names of the


actual participants are suppressed. Also, I have delayed
The purpose of this article is to provide a simple model writing this materialfor many years with the belief that the
for the calculationof the probabilityof success in reversing passage of time would make the facts concerningthis elec-
the result of a closely contested election by recounting of tion indistinguishablefrom many other elections with com-
ballots. Technological changes renderthe model described parableoutcomes.
here less relevantto actualpolitical elections than has been
the case. The consequences of these technological changes 2. THE ELECTION
and their effect on modeling election recounting are de-
scribed in Section 5. There I mention some other election About 500,000 votes were cast in the disputed election.
recounting problems, which are also not covered by the CandidateA lost by a pluralityof about 1,500 votes. That
simple model employed here. Nevertheless, the election is, CandidateB received 50.15% of the votes cast. In view
described actually occurred, and I discussed the chance of of the closeness of the outcome, Candidate A felt that a
reversing the outcome with the losing candidate. recount would be desirable and that there was an excellent
chance that the outcome would be reversed. His campaign
advisor, who was also the treasurerof the election campaign,
*Bernard Harris is Professor, Departmentof Statistics, University of said, "There are 2,000 election districts. We can win with
Wisconsin, Madison, Wisconsin 53706. He thanks the referees for their a recount if we change just one vote in each district." State
careful perusal of this article and for their constructivecomments. law requiredthe candidaterequestinga recountto guarantee

66 The American Statistician, February 1988, Vol. 42, No. 1 ?) 1988 American Statistical Association

You might also like