Professional Documents
Culture Documents
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=astata.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to The
American Statistician.
http://www.jstor.org
test is sup{Pr( XA- XB> a) 1.75 ' P, = P2 ' 1}- Pr(XA-XB> . 1) + Pr(XA- XB < -.1)
The power function K(p1,P2, n1, n2) is Pr(XA-XBI > a).
To constructthe tables, we evaluate the power function (for = Pr((XA- XB)/1\p/q(1/nj + l/n2) 2 1)
given n1 and n2 and a = .1) at the points (P1,P2) in + Pr((XA - XB)/ /pq(1nl +1/n2) ? -1)
In 1885, Sir Francis Galton first defined the term "regres- INTRODUCTION
sion" and completed the theory of bivariatecorrelation. A
We are currentlyin the midst of a "centennial decade"
decade later, Karl Pearsondeveloped the index that we still
for correlationand regression. The empiricaland theoretical
use to measurecorrelation,Pearson'sr. Ourarticleis written
developmentsthat defined regressionand correlationas sta-
in recognition of the 100th anniversaryof Galton's first
tistical topics were presentedby Sir FrancisGalton in 1885.
discussion of regression and correlation. We begin with a
Then, in 1895, Karl Pearson published Pearson's r. Our
brief history. Then we present 13 different formulas, each
article focuses on Pearson's correlation coefficient, pre-
of which representsa different computationaland concep-
senting both the backgroundand a number of conceptual-
tual definition of r. Each formula suggests a different way
izations of r that will be useful to teachers of statistics.
of thinkingaboutthis index, from algebraic, geometric, and
We begin with a brief history of the development of
trigonometricsettings. We show thatPearson's r (or simple
correlationand regression. Following, we present a longer
functions of r) may variouslybe thoughtof as a special type
review of ways to interpretthe correlationcoefficient. This
of mean, a special type of variance, the ratio of two means,
presentationdemonstratesthatthe correlationhas developed
the ratio of two variances, the slope of a line, the cosine of
into a broad and conceptually diverse index; at the same
an angle, and the tangentto an ellipse, and may be looked
time, for a 100-year-old index it is remarkablyunaffected
at from several other interestingperspectives.
by the passage of time.
The basic idea of correlationwas anticipatedsubstantially
before 1885 (MacKenzie 1981). Pearson (1920) credited
Gauss with developing the normal surface of n correlated
variates in 1823. Gauss did not, however, have any partic-
*Joseph Lee Rodgers is Associate Professor and W. Alan Nicewander
is Professor and Chair, Departmentof Psychology, University of Okla-
ular interest in the correlation as a conceptually distinct
homa, Norman, Oklahoma73019. The authorsthankthe reviewers, whose notion; instead he interpretedit as one of the several pa-
comments improved the article. rametersin his distributionalequations. In a previous his-
? 1988 American Statistical Association The American Statistician, February 1988, Vol. 42, No. 1 59
torical paper published in 1895, Pearson credited Auguste ification of the former as a necessary but not sufficient
Bravais, a Frenchastronomer,with developing the bivariate condition for the latter were being recognized almost si-
normal distribution in 1846 (see Pearson 1920). Bravais multaneouslyin the establisheddiscipline of philosophy and
actually referredto one parameterof the bivariate normal the fledgling discipline of biometry.
distributionas "une correlation,"but like Gauss, he did not By 1885 the stage was set for several importantcontri-
recognize the importanceof the correlationas a measureof butions. During that year, Galton was the presidentof the
association between variables. [By 1920, Pearson had re- Anthropological Section of the British Association. In his
scinded the credit he gave to Bravais. But Walker (1929) presidential address, he first referred to regression as an
and Seal (1967) reviewed the history that Pearsonboth re- extention of the "law of reversion." Laterin thatyear (Gal-
ported and helped develop, and they supportedBravais's ton 1885) he published his presidentialaddress along with
claim to historical precedence.] Galton's cousin, Charles the first bivariatescatterplotshowing a correlation(Fig. 1).
Darwin, used the concept of correlationin 1868 by noting In this graph he plotted the frequencies of combinationsof
that "all the partsof the organisationare to a certainextent children's height and parents' height. When he smoothed
connected or correlatedtogether." Then, in 1877, Galton the results and then drew lines through points with equal
first referredto "reversion"in a lecture on the relationship frequency, he found that "lines drawn through entries of
betweenphysicalcharacteristics of parentandoffspringseeds. the same value formed a series of concentric and similar
The "law of reversion" was the first formal specification ellipses." This was the first empiricalrepresentationof the
of what Galton later renamed "regression." isodensity contour lines from the bivariate normal distri-
During this same period, importantdevelopmentsin phi- bution. With the assistance of J. D. Hamilton Dickson, a
losophy also contributedto the concepts of correlationand Cambridge mathematician,Galton was able to derive the
regression. In 1843, the BritishphilosopherJohnStuartMill theoretical formula for the bivariate normal distribution.
first presentedhis "Five Canons of ExperimentalInquiry." This formalized mathematicallythe topic on which Gauss
Among those was includedthe methodof concomitantvari- and Bravaishad been workinga half centurybefore. Pearson
ation: "Whateverphenomenonvaries in any mannerwhen- (1920) statedthat "in 1885 Galtonhad completedthe theory
ever anotherphenomenonvaries in some particularmanner, of bi-variate normal correlation"(p. 37).
is either a cause or an effect of that phenomenon, or is In the years following 1885, several additional events
connected with it through some fact of causation." Mill addedmathematicalimportto Galton's 1885 work. In 1888,
suggestedthreeprerequisitesfor valid causal inference(Cook Galton noted that r measures the closeness of the "co-re-
and Campbell 1979). First, the cause must temporallypre- lation," and suggested that r could not be greater than 1
cede the effect. Second, the cause andeffect mustbe related. (although he had not yet recognized the idea of negative
Third, other plausible explanationsmust be ruled out. Thus correlation). Seven years later, Pearson (1895) developed
the separabilityof correlationand causation and the spec- the mathematicalformula that is still most commonly used
.3
'2
70-
+1
69
9i~ ~~ 7 J4\ J3 0
68
7 -1
67~~~~~~
-2
to measure correlation, the Pearson product-momentcor- 1982), as well as understandingof its extension to multiple
relationcoefficient. In historicalperspective, it seems more and partialcorrelation,our focus will be more basic. First,
appropriatethat the popular name for the index should be we restrictourprimaryinterestto bivariatesettings. Second,
the Galton-Pearson r. The importantdevelopments in the most of our interpretationsare distributionfree, since com-
history of correlation and regression are summarized in putation of a sample correlation requires no assumptions
Table 1. about a population (see Nefzger and Drasgow 1957). For
By now, a century later, contemporaryscientists often consideration of the rnany problems associated with the
take the correlationcoefficient for granted. It is not appre- inferential use of r (restrictionof range, attenuation,etc.)
ciated that before Galton and Pearson, the only means for we defer to other treatments(e.g., Lord and Novick 1968).
establishing a relationshipbetween variables was to educe We present 13 different ways to conceptualize this most
a causative connection. There was no way to discuss-let basic measure of bivariate relationship. This presentation
alone measure-the associationbetweenvariablesthatlacked does not claim to exhaust all possible interpretationsof the
a cause-effect relationship. Today, the correlationcoeffi- correlationcoefficient. Otherscertainlyexist, and new ren-
cient-and its associated regression equation-constitutes derings will certainly be proposed.
the principal statistical methodology for observationalex-
periments in many disciplines. Carroll(1961), in his pres-
idential address to the Psychometric Society, called the 1. CORRELATION AS A FUNCTION OF RAW
correlationcoefficient "one of the most frequentlyused tools SCORES AND MEANS
of psychometricians. . . and perhapsalso one of the most
frequently misused" (p. 347). Factor analysis, behavioral
genetics models, structuralequations models (e.g., LIS- The Pearson product-momentcorrelationcoefficient is a
REL), and other related methodologies use the correlation dimensionless index, which is invariantto linear transfor-
coefficient as the basic unit of data. mationsof eithervariable. Pearsonfirst developed the math-
This article focuses on the Pearson product-momentcor- ematical formula for this importantmeasure in 1895:
relation coefficient. Pearson's r was the first formal cor- (Xi - X) (Yj- Y)
relationmeasure,andit is still the most widely used measure
of relationship. Indeed, many "competing" correlationin- [ (Xi - X)2 (Yi - Y)2]1/2
dexes are in fact special cases of Pearson's formula. Spear- This, or some simple algebraic variant, is the usual for-
man'srho, the point-biserialcorrelation,andthephi coefficient mula found in introductorystatistics textbooks. In the nu-
are examples, each computable as Pearson's r applied to merator,the raw scores are centered by subtractingout the
special types of data (e.g., Henrysson 1971). mean of each variable, and the sum of cross-productsof the
Our presentationwill have a somewhat didactic flavor. centeredvariablesis accumulated.The denominatoradjusts
On first inspection, the measure is simple and straightfor- the scales of the variablesto have equal units. Thus Equation
ward. There are surprisingnuances of the correlationcoef- (1.1) describes r as the centered and standardizedsum of
ficient, however,andwe will presentsome of these. Following cross-productof two variables. Using the Cauchy-Schwartz
Pearson, our focus is on the correlation coefficient as a inequality, it can be shown that the absolute value of the
computationalindex used to measurebivariateassociation. numeratoris less than or equal to the denominator(e.g.,
Whereas a more statistically sophisticated appreciationof Lord and Novick 1968, p. 87); therefore, the limits of + 1
correlation demands attention to the sampling model as- are established for r. Several simple algebraic transforma-
sumedto underliethe observations(e.g., Carroll1961;Marks tions of this formulacan be used for computationalpurposes.
zx: rZy
zy
x
z r
.. . . . . .
Election Recounting
BERNARD HARRIS*
66 The American Statistician, February 1988, Vol. 42, No. 1 ?) 1988 American Statistical Association