DOCUMENT RESUME
SE 037 439
" ED 216 874
O'Brien,Trancis J.
AUTHOR
TITLE'
Jr.
Correlation
Cdefficient Has Limits ((Plus or Minus)) 1.
P r
'
PUB DATE
NOTE
32p.
EDRS PRICE
DESCRIPTORS
.
.11
8'2
IDENTIFIERS
MF01/PCO2 Plus postagf.
*Correlation; *Educatidnal Research; Higher
Education; Mathematical Formulas; *Proof
(Mathematics); *Research Tools; *Statistics
*AppliecOStatistics
,
ABSTRACT'
er presents in detail a proof of the liMits,
This
of th'asample bivariate correlation cdefficient which requires only
knowledge of algebra. Notation and basic formulas of standard (z)
scores, bivariate correlation formulas in unstandardized and
istandardized form, and algebraic inequalities are `reviewed first,
since they are necessary for understanding the proof.'Then the proof
is presented, with an appendix.containing additional proofs of
related material. (NNS)
6
1
4
'
,P,
.
I
O
f,
****************************4,**************1***************************
*
*
Reproductions supplied by EDRS are the beet that can be made
* ,
.*
from the original document.
*****************************************v*****************************.fr
Proof That the Sample Bivariate Correlation Coefficient
U S. DEPARTMENT OF EDUCATION
NATIONAL INSTITUTE OF EDUCATION
EDUCATIONAL RESOURCES INFORMATION
CENTER IERICI
Has
Limits
I:
"PERMISSION TO REPRODUCE THIS
MATERrAL HAS BEEN GRANTED BY
Ars aocument has been reproduced as
recereed from the person or ofganizahon
onginating It,
Minor changes have been made to Improve
Fraricis J. O'Brien, Jr.,'Ph.D.
reprOduCnOn Quality.
Points of crew or %mons sated fn this docu
meet do not necessaray represent
othcfal NIE
T,0 THE EDUCATIONAL RESOURCES
INFORMATION CENTER (ERIC)."
posafon or 'poky
Virtually all social science students who have studied applied
4'
..
r%
CO '.
%CO
statistics have been introduced to the concepts and formulas
orslinear
s
correlation of two variables.
Applied statistics textbooks routinely
r4
report the theoretical limits of the bivariate . correlation coefficient; tamely,
6.11.1
'HoWever,
'that the coefficient is no' more than +1 and no less than
no commonly used applied statistics textbook
proves this.
One of the
best textbooks available to students of education and psychology introdUces
o
the'prOof (Glass and Stanley, 1970).
UndoUbtedly, one of the constraints
placed on authors by publishers is space limitations available for detailed
explanations, derivations and proofs;
This paper will set forth indetail a proof bf the limits of the sample
bivariate correlation coefficient. Since the proof requires only knowledge of
4
algebra, most students of applied statistics at the advanced undeigraduate
, or introductory graduate'level should have little difficulty in under
standing.the proof.
As a former instructor of graduate level introductory
applied statistics; I know that the typical student can understand the
prodf as it is presented here.
The key fdr tinderstanding statistical proofs is a presentation

of detaile\d steps in 4 well articulated and coherent manner.
A review
1P4
of relevant statistical and mathematical concepts is also helpful
.
..
and
.\
'2.
1.
'
usually required).
When students are presented
in detail im
portant statistical proofsithey feel that some of the mystery and magic
of mathematics has been unveiled.
My experience has been that the typical
student of applied statistics can follow a good number of proofs becguse
most proofs can be presented algebraically without use of calculus. In addition to
enhancing knowledge, an occasional
proof often increases academic
motivational
)
Some Preliminary
Concepts
The proof requires knowledge of several concepts in statistics
X
and mathematics. In order to make thispaPer selfcontained, some
preliminary concepts stated in a consistent notation will be reviewed.
We will review the concepts and forMulas of standardiscores (z scores),
bivariate correlation formulas in unstandardized and standardized form,
and algebraic inequalities.
Notation and Basic Formulas
Table l'i,a layout of symbolic values written in the notation
to'be used in this paper.
The model presented in Table 1 is of two measures
in unstandardized (raw scorer and standardized (z score) form. Table
.presents some familiar
formulas based on unstandardized variables that
will be useful'for the; development of the proof.
1This
T
paper is one of a series contemplated.for publication. [See
O'Brien, 1982]. Eventually ,I hope to present a textbook of applied /stat'
tics proofs and derivations to supplement standard applied statistics
textbooks.
61"
Table 1
Table Layout for Two Measures in Unstandardized and Standardized Form
Measure
Unstandardizgd
Measure
(X R)/S'
rR)/S
(X
3
r,
x2
Y1
='i
JY
= Z
(Y Y)/S
Zx
1
Standardized
Unstandardized
',Standardized
(X ;) /S
x
1
Y2
= Z
(Y
3'
Y3
I
0
t*
x.
(g.1
R1/S
= Z
(Yi) /S
Yi
x.
.
= Z
Y
X
(X T) /S
.=Z
= Z
(Y
40
Y n'
'Sample
Size
n
zx
n
x
Y,
Sample
Mean
Sample
Variance
.,2*
3c
r'.
#.
.
NOTE:
all sample size terms are equa.1; th4t is: n
..
= Ti
,1= n
z.x
= n
zy
..
' o,
Any of these sample size terms could be identified by,just.Che symbol  such As
n.
We' will use h when it is not important to disiinguish..kmong the other sample'
size terms, but will Ilse the table' values above when itis necessary or important
to do so.
,,
Table
Relevant Formulas for Unstandardized Measures
Y
Measure
Measure
4
Sample
Mean
1 =1
X=
X.
X
1=1
Y
n
TX.
nxX =
nx
Sum
Y.
n Y =
E
1=1
i=1
Y.
1
nx
i=1
Sample
Variance
S
x
nx
2
(n 1)S
x
x
Sum of
Squares ,
L....(Xi :X)
2,
.
(n
i=1
1)S2 =
Y
),__.
i=1
(T.(Y. Y')?
1
.
NOTES:
The sampleisize terms are equal:n =ny . Also n x =n.=n4
y
x
l ,
..
anipulation of "Sample Mean"; i.e.,
2. ,"Sum"is simplyan algebraic manipulation
multiply over the sample sizeterm in "Sample Mean" to get"Sum".
Also,"Sum of Squares is such a manipulationbased on "Sample
"Sum" and "Sum of Squares"will be useful later on.
Variance"..
.
3.
.
Des criptive statistics for standardized scoret'will be
body".ol. the text.
developed in. the
Ilr
Standard Scores
It will be Kecalled that the standard score for an unstandardized
.
'
measure (raw score) is "the scoreminus the mean divided by the standard
deviation".
For case 1 of measure X in Table 1 , the standard' (z) score is:
For any (hypotheticall, case, 'the standard,score,ofan,X measure is
.
x
=
1
X
.The lame procedure can be applied to Ymeasurds..For case'l:
,
Y Y
1
Y1
Similarly,
for the ith (hypothetical case),we have:
.:
Y
,Since a standard score distribupion (such as in Table 11 is a
.
distribution of variable measures., we can calculate means; standard deviations,
,
.,
,
.
variances, correlations, and so forth, just as we can,calculate theSe.
__
.
statistics for unstandardized measures.
0
6.
Most students will recall that the mean of z scores is equal to 0
and.the standard deviation (and variance) of standardiZed scores is equal to 1.
(The proof of these statements is given in the Appendix
1
.)
The mean of X standardized scores is defined as:
xi
i=i
Z =
x
40k
Similarly, for Ymeasures:
z
0
The variance Of X in z score notation is defined as:
n
cz
11 2
P
4
1=1
n
z
.
.
E ) 2
...
1
4,
Th'e Appendii Contains proof of certaimconcepts'or relationhips
that miy be of interest to the reader but are not crucial for.the.development
.,of the proof in this Raper (the theoretical limits of the sample bivariate
correlation coefficient).
.
..
16
7.
'I
For the standerdized,Y measure the variance is defined as:
4:0
n
z
1E: (z
2
i=1
z 1
Yi Y
nz
1
Sum of Squared Standard Scores.
To understand the proof it is necessary to know the result
of summing a distributiorkof squared standard
/.
If we square
scor4s.
each, standard score for the X measurein Table 1 and sum them, we obtain:
11,
4
.
Xi"
i=1
Z2
>..... Z2
Z
.
,,,.
Z2
2
V
.
If we substitute the appropriate means and variances in .the right hand side
of the expression, we obtain:.
'n
zx
E
.i.71

z2
xi
;
0
.r'
(X
+
2
40 2
0(
X).
(X X)
n
2
C,
8.
Since
the
2
S
is a qonstant, we can.factor it outside, and write:
n ,
z ..
x
'
'
zx4
..
(x x)
1
(x
xr
+...+ (x
)i)
n.
1 =1
.1
S,
!I
4'
,.,)
.e08
Rewriting the right hand side in summation notation, we obtain:
(X1 co 2
h
z
1:
k=1,
Z2
S
i
i=1
.
x'
.
.
From Table 2, we know that we can substitute the sum of squares term
into the numerator on'the right hand side. This results in:
.5
(n Ll)S
x
n x1
n71
(Recall that n 1 = n1).
If we were to work through the same steps for Y, we would obtain:
,(n 1)Sy
Y
2
.1a 1
,Y
4
(Recall that n 71 = n1) .
41.
9.
As
Theu relationships between squared.z scores and sample size are very important'
,
for the proof 'later on.
They will be summarized dater on for easy reference._.,
Correlation Formulas
Unstandardized Form .
Using the notation'and variables in Tables 1 and 2, theiinstand
dized
'form.corc,elation'for two measures(Vand Y) is defined as follows:
i=1
r
3;
0;
xy
7
X)2
1=1
i=1
n 1
x
ma.
n 1,
Y
6...1.
Note that the numerator contains the term n1 because it is not important
9
or necessary to distinguish between nx1 or n y 1.
inator it is helpful to distingpish nn 1 from n 1.
However, in the denom,
In any case, all
of the sample size terms would be equal, to the same numerical'value
,....serre/ation coefficient were computed on a set of data (n x 1 =
r
4'
10
if5a,
1 =n1)
Jr
10.
Standardized Form
The correlation of measure,Xsand measure Y in standard score form
is defined as follows:
1
n1
1E: (Z E )(Z
x
x.
17 .
i=1
n
rzx`zY
zx
z.
E(Z x.
Z
i=1
i=1,
nz 1
z y,
It is proven in the Appendix that this correlation formula is egdal to
I
E.Z x.Z
d=1
171.
.
1
rz z
x y
If we rearrangejthis formula y multiplying over the n1 term,we
obtain:
n
(n1) r
x. , y .
z z
x y
i=1 "
This relationship will be useful in,the proof
it will be restated for
easy reference later.
The reader may recall,that the same correlation coefficient results
when the variables are in raw score form or standard score form. That is:'
ll:
.
.__
This statement is proven in the Appendix.
W&will restate it prior to the
. .
proof for the readers convenience.
'C'
,Inequalities,
to
it it necessaryto revie
Before startlng the prOor
In the proof we are required to manip
topic: algebraic inequalitiA.
ulate
one eifurther
the folm "gfeatek than or eqUal to"
form of inequality:
and
Airless than,or equalto". An example will serve as a refresher..
For two variables, say A and B, we Can. write:
> B
which means "A isgreater than or equal to B".
'1
Equivalently, we can write,i,
,
<
A ',
"B is less than or, equal to A"!1For example,.
which means the same thing:
3
) 1 or equivalently
}e obvious.
All'of this m
1 L 3.
What students, sometimes forget is
what happenS when. multiplying or dividing by negative quantities.
For
example, if
.
3 > 1, and we multiply this inequality by 1, we would obtain:
N.
J
=1( (3)
(1)]
3
/
12
12.
4"
.
That is, the inequality sign is reversed. when multiplying by a negative number.
004:
,
The same result occurs for more cbmplex expressions.
For example:
1  A >
Multiplying each side of the inequitS, by 1,we
11(1A) >
(1A)
(B)]
(0))
<
A1 <
Example:
1 1
Multiplying through by
4>Q
we obtain:
1[ (11/4)
(11/4)
347.1
<
^:
SummaryNof Important Concepts:
a
We have reviewed standard scores(z), correlkion formulas and
algebrait inequalities.
(
the
proof that follows.
All of these concepts are important to understand
For the readers convenience, we will summarize
these concepts, for easy reference.
This is done in Table 3.
Table 3
Summary of Important Condepts
xz'
1=1
z
n1
=
Y.
1=1
xy,
z z
x y
I
n
Z
1=1
,,
.y.
(n1) r
'
(n1) r
x y
(IA)
z z
>.
(B)
<
xy
. \
.4
14.
Proof
Formally,,we want to pro've
We are now ready to present the proof.
C.
the following statements:
1
xy
<
xy
+1
.011
writing each of the statements in one linear form:
1
C
410t
+1
xy
This states the same'information as the above two separate statements'.
The proof consists of two parts: one part shows the lower limit
4
o
of r
of'r
(i.e.,
xy
xy
xy
(i.e., e
xy
>
1), and
+1). We will prove the'upp'er limit first.
the second'part shows theupper limit
g7
Proof that
xy
+1
To prOve this limit, we sill perform
algebraic manipulations
on a statement whichis mathematically true.
Dz
i=1
That statement is:
z.
xi
Yi
15
15.
In,wdrds, the statement means:: the sum of squared differences of n
standardized value pairs will always be equal'to or greater than O. The
The squared differences
reader may refer to.Table 1 for clarification.
e,
Yi
fi
Z
values starting at
and
are taken 17 each row (pairs) of
...44.
xl, yi and conti uing down ,to the last pair of Z's (Zx ,Z
n
).
Yn
ob
Most students readily agree that the squared sum will be greater than O.
But can it ever be exactly equal to 0? Yes, theoretically it can.
Refer
',
4
ring to Table 1, if olle;imagines each standardized X and Y measure to have
0
1
the same numerical value, then it is apparent that each difference will be
0; so, the squared
will itself be equal toi0.
it is onlyreqUired
sense.
that
Now, a sum of squared O's.
of 0 is also O.
value
While it may be unlikely to occur in practice,
:,(Z
xi
1=1
Yi
Thus, the statement is true.
be true in a mathematical
We will expand this squared, sum,
perform algebiaic manipulations and substitutions, and arrive at the proof
for the upper limit of thesample correlation coefficient.
The actual steps in the derivation will now be presented.
.pertaining to the
Notes
algebra are provided for the readers reference.
to Tables 1,2 and 3 as needed.
Refer,
It is suggested that the reader first
examine the algebraic statement on the left side of the page. Then read the
comment on the right side for explanation. Seenext page.
1
That is, within pairs, not all pairs.
Example:
Y.
i
1.41
.68
1.41
.68
.05
.05
etc.
40.
1.
That
+1
xy
Notes
2
Z
Z(Z
yestatement
each term, we obtain an expansion of
1=1'
thd binomial in this form:
2
;t(Z2
2Z
1=1
Z
x. y.
3.
>
A2 + B2  2AB ,
(AB.)
AK.
Squaring
from before.
Y1
Distributing the summati& opelator
3.
to 'each term, and, bringing gthe
a
constant 12)P outside
the summation sign
co,
\
ti
n
2
+" Z Y1
x.
1=1
1=1 1:4:
2 E
1=1
Z
x. y.
1
This next stet) is very important. We will
substitute three quantitiee,a,11 frdm
Table 3. They are:
n
2:Z2
n1
x.
i=1 1
17
n
I
n1
.1.Z2
i=1
xZy
3.
3.
(n1)r
(n1)
z z
x y
i=1
.
So
f"
Making the
> 0
2(n 1)r
(n1)
(n1)
'xy
ree substitutions
Collecting, the liiceteims of (nr1)
A.
2.(n 1)r
2(.11)
2(n1)(17r
xy
>
>
xy
n1).term
.FactOring the
Dividing each
2(n1) which
equality s gn
positive b ca
side of the inequality by
oes not change the in,
as
2 (p1). is always
se n mutt ,alwayS, be >
'
2(n1)(1rxy ]
2(n1)
2(n1)'
(1r
Here we make
se of multiplying an
a negative number. Let
ch side of the inequality
us multiply
which reverses
by 1_(see Ta le 3)
sign
and
reverses the 1r
the inequalit
xy
inequality' b
13
1[(1r
xy
xy
>
'+i
xy
0]
1
xy
Now, add
:+1
to each side
This gives us
END OF PROOF
OR. UPPER LIMIT.
xy
18.
Proof that
xy
1
,Part two of the proof
will be much simpler because the structure
of the proof is very much
of
e the first part.
.
follow the ame basic steps.
We will
We start out with a statement that is mathe
true, namely:g.
.
",
2: (zx
)2
'o
y.
1=1
Again this statement is true in a mathematical senseeven through the
411;,'
.
"equals 0" aspect is very, unlikely to occur in statistical , practice.
The development of the proof with apprOpriate notesbegins on the
next page
1
4
0
10'
,0
41Nha
xy
;..?
4
(2,
,. n
Zx
>
)2
Step ) restated. Squaring each term
results in a binomi4a1 expansion in
1
this form:
i=1
(A+B)
o,
2AB
Ez
i=1
+ Z
x.
2Z
+
.
Yi e
>
Z
)
x. y.
1
1
....,
2 EZ Z
i=1 1
Distributing the summatiOn operator
and bringing out the 2
i=1111
, 0
Makingsthesame three substitutions
As in part one/ we obtainer
y
i
i
.'
i
t+
(n1)
2 (n1) r
2(n1)[1 + r
xy
Adding like terms and factoring
xy
Dividing each side by ;pi1)
xy
.>
Adding.
i%
1.
xy
.41"4"'to.each side
>
1
Simplifying
xy 0
>
END OF PROOF FOR LOWER LIMIT
1
23
22
ra
jolt
JO,
20.
,
.'..
...
We have just proven that 1 <
+1.
See the
Appendix
'
for aclditional proofs of related material.
4
tit
re
ri
,r
ti
24
v..
21.
J.
.4
APPENDIX
Fs
g.
Selsiled'Proofs2
1:
That the mean of standard scores is equal to 0.
We will start wi th the definition of the mean of z scores for
o
6C
'4604
the X meature:'
. n
zx '
Z. 41 i
Z
1=1
=
x
Expanding the right sidel
4.
.,,
(X256
i)
(X
n_z
S

..
..._
S
.
____CX_n ri)
+ ... +
S
?(
Factoring the constant,
outside amervwriting the sum of deviatiOns
in summation notation:
n
x
(ni:X)
1 ;
Zx
nz
47
1=1
22
7
'
Distributing the summation sign inside the parentheses
:*
t_
Zx
n.
Sx
z
x
Y.
e,
Since
x,
"")L
i=1
v Xlisequal
4,
x
and '4,e sum of the constant,
n X
x
. 
X...1,.
i=1
,t..:
to in X
X
P,
0,"'
..
1.
1

Zx
(n X
x

n
n7()
.,
Thus the mean of
Z scores is equal to 0.,
x
Similar reasoning for the
Y measure will produce the same result, namely:
nz
Z
yi
i=1
Zy
(n
n1
zy
"Y.
 n
S1
.y
Therefore, variables in standardized form have
mean equal to 0.
0,
Recall that when taking the sum of a c onstant (say C) we have:
n
JE:c .
\
..,0
c + c + c + ..1 c
i=1
. ,nCL
.
That is, the sum of 4i0brisiant
is equal to the constant times the number
...,4.,
..of terms added (in thts.case / n).
,
.
.
26
.
23.
That the variance and standard deviation og standard scores is equal to 1.
2.
By. definition, the variance fbr X meAsures in ,'standard score form is:
2::(Z
x.
i=1
,z
zX
1
=;0, we now have:
Since we know that Z
S.
n1
nz
*S(Z
L___ x.
1=1
1
nz 1
.,
If we
..
x',
rewrite 'Z
'
in terms ofunstandardized mean and standard
x.
'T
deviation:
z =
nz
x
i=1
2
S
Rearranging terms:
n
='
2'
1
z
S2
x
x
2
14EDX X)
i=1
59
24.
From Table 3, we can substitute. into the numerator of S
the "sup of
x
This results in:
situares" for the X measure.
Since
nx = nz
2
S.
=
z
1
:(n 1)(S )
X
X
we can cancel terms,' leaving:'
x
Similar reasoning for Y standardized measureswill produce, as the next
to the last step in the derivation:
(n 1) (S2)
Y
'Y
1
nzy
2
,
Y
46.
Since n
= n
z
Y'
2
In each case, the standard deviation for approp riate variance terms,
is simply the square root of 1. That iS!
,
2
n.
Sz
=
X
=17=
Sz
and
Tr
S
Y'
=
,
=j1
S.
z
Thtis the variance and standard deviaapnof z scores
equal to 1.
a.
25.
0 3.
That
xy
rz
x y
We want to show that when measures X and Y are converted to standard
scores and correlated, the resulting correlation is the same as the correlation between the unstandardized (raw) measures of X and Y.Let us first
rewrite the correlation formula foY z scores:
:E=
,e 1
n1
r
(z E
x
i=1
(Z E y)
tex
Yi
z z
x y
am
z
2
Z
x
i=1
n
^
i=1
1
1
x
,
Since Z
= 0, we can simplify to get:
4i
n
(Zx ) (Z
n1
r
y .
3.
3.
i=1
z z
x y
nz
2
(Z
Y.
i=1
nz 1
Y4
In the denominator, we recognize. that'
i=1
(z'xi
)2
= n 1
z
.111140114,
nz
Substituting these valu es, we obtain:
and
r.
26.
S(Z
r
.)(Z
x.
1.
y.
n1 i=1
z z
x y
1
.nz)c1
1
z
1
n
z
The denominator cancels out completely leaving:
r
z z
111
x y
xi yi
1=1
(Recall that this relationship was used in the proof for the limits of r ).
xy
Nowo expanding the z score terms:
r.
,
r
z z
x y
cp
n1
(Xi T)
(y. 1i)
.4i=1 '(S ) (S
x
This is identical to :
n
1.
RHY i)'
n1 i=1
30
441
27.
Recognizing that
S2
and \02
, we can write:
40P
'
)
(X R).(Y.i
1
n1
1=1
2
(S
2
) (S
Rewriting the denominator of the variance product term in
aw score
terms (see Table 2)':
.
i EmX)
n1
.
z z
x y
r
i)
i=1
....
n
Y
2
21(X. X)
.4*
i=1
i=1
.
n x1
Therefore,
Y)2
Y:
n
,110i46
=mow.
01/0.
This is precisely the fOrm for
.....
...
n
x
that was defined earlier it the iapei.
xY
the correlation ketween measures in rawrscore and z score
forms is identical.
31
28.
References
Glass, Gene V and Stanley, Julian C. StatisticalMethods in Education
and Psychology. Englewood Cliffs, New Jersey: PrenticeHall,1970.
a
2
O'Brien, Francis J., Jr. A 15/15tf that t and 'F are i e tical: the general
case. ERIC Clearinghouse orisCience, Mathematics and Environmental`.
(Note: the ED number for
Education, Ohio State University, April, 1982.
was not available at the time of this publication).
this document