You are on page 1of 29

DOCUMENT RESUME

SE 037 439

"- ED 216 874

O'Brien,-Trancis J.

AUTHOR
TITLE'

Jr.

Correlation
Cdefficient Has Limits ((Plus or Minus)) 1.
P r

'

PUB DATE
NOTE

32p.

EDRS PRICE
DESCRIPTORS
.

.11

8'2

IDENTIFIERS

MF01/PCO2 Plus postagf.


*Correlation; *Educatidnal Research; Higher
Education; Mathematical Formulas; *Proof
(Mathematics); *Research Tools; *Statistics
*AppliecOStatistics
,

ABSTRACT'

er presents in detail a proof of the liMits,


This
of th'asample bivariate correlation cdefficient which requires only
knowledge of algebra. Notation and basic formulas of standard (z)
scores, bivariate correlation formulas in unstandardized and
istandardized form, and algebraic inequalities are `reviewed first,
since they are necessary for understanding the proof.'Then the proof
is presented, with an appendix.containing additional proofs of
related material. (NNS)

6
1
4
'

,P,

.
I
O

f,

****************************4,**************1***************************
*
*
Reproductions supplied by EDRS are the beet that can be made
* ,
.*
from the original document.
*****************************************v*****************************.fr

Proof That the Sample Bivariate Correlation Coefficient


U S. DEPARTMENT OF EDUCATION
NATIONAL INSTITUTE OF EDUCATION
EDUCATIONAL RESOURCES INFORMATION
CENTER IERICI

Has

Limits

-I:

"PERMISSION TO REPRODUCE THIS


MATERrAL HAS BEEN GRANTED BY

Ars aocument has been reproduced as


recereed from the person or ofganizahon
onginating It,

Minor changes have been made to Improve

Fraricis J. O'Brien, Jr.,'Ph.D.

reprOduCnOn Quality.

Points of crew or %mons sated fn this docu


meet do not necessaray represent

othcfal NIE

T,0 THE EDUCATIONAL RESOURCES


INFORMATION CENTER (ERIC)."

posafon or 'poky

Virtually all social science students who have studied applied

-4'

..

r%-

CO '-.
%CO

statistics have been introduced to the concepts and formulas

orslinear
s

correlation of two variables.

Applied statistics textbooks routinely

r--4

report the theoretical limits of the bivariate . correlation coefficient; tamely,


6.11.1

'HoWever,

'that the coefficient is no' more than +1 and no less than

no commonly used applied statistics textbook

proves this.

One of the

best textbooks available to students of education and psychology introdUces


o

the'prOof (Glass and Stanley, 1970).

UndoUbtedly, one of the constraints

placed on authors by publishers is space limitations available for detailed


explanations, derivations and proofs;

This paper will set forth in-detail a proof bf the limits of the sample
bivariate correlation coefficient. Since the proof requires only knowledge of
4

algebra, most students of applied statistics at the advanced undeigraduate


, or introductory graduate'level should have little difficulty in under-

standing.the proof.

As a former instructor of graduate level introductory

applied statistics; I know that the typical student can understand the
prodf as it is presented here.
The key fdr tinderstanding statistical proofs is a presentation
-

of detaile\d steps in -4 well articulated and coherent manner.

A review

1P4

of relevant statistical and mathematical concepts is also helpful


.

..

and

.\

'2.

1.

'

usually required).

When students are presented

in detail im-

portant statistical proofsithey feel that some of the mystery and magic
of mathematics has been unveiled.

My experience has been that the typical

student of applied statistics can follow a good number of proofs becguse


most proofs can be presented algebraically without use of calculus. In addition to
enhancing knowledge, an occasional

proof often increases academic

motivational
)

Some Preliminary

Concepts

The proof requires knowledge of several concepts in statistics


X
and mathematics. In order to make thispaPer self-contained, some

preliminary concepts stated in a consistent notation will be reviewed.


We will review the concepts and forMulas of standardiscores (z scores),
bivariate correlation formulas in unstandardized and standardized form,
and algebraic inequalities.

Notation and Basic Formulas

Table l'i,a layout of symbolic values written in the notation


to'be used in this paper.

The model presented in Table 1 is of two measures

in unstandardized (raw scorer and standardized (z score) form. Table


.presents some familiar

formulas based on unstandardized variables that

will be useful'for the; development of the proof.

1This
T
paper is one of a series contemplated.for publication. [See
O'Brien, 1982]. Eventually ,I hope to present a textbook of applied /stat'
tics proofs and derivations to supplement standard applied statistics
textbooks.

61"

Table 1

Table Layout for Two Measures in Unstandardized and Standardized Form

Measure
Unstandardizgd

Measure

(X -R)/S'

rR)/S

(X
3

r,

x2

Y1

='i

JY

= Z

(Y -Y)/S

Zx
1

Standardized

Unstandardized

',Standardized

(X -;) /S
x
1

Y2

= Z

(Y
3'

Y3

I
0

t*

x.

(g.1

-R1/S

= Z

(Yi-) /S

Yi

x.
.

= Z
Y

-X

(X -T) /S

.=Z

= Z

(Y

40

Y n'

'Sample
Size

n
zx

n
x

Y,

Sample
Mean

Sample
Variance

.,2*

3c

r'.

#.
.

-NOTE:

all sample size terms are equa.1; th4t is: n


..

= Ti

,1= n

z.x

-= n

zy

..

' o,

Any of these sample size terms could be identified by,just.Che symbol -- such As
n.
We' will use h when it is not important to disiinguish..kmong the other sample'
size terms, but will Ilse the table' values above when itis necessary or important
to do so.
,,

Table

Relevant Formulas for Unstandardized Measures

-Y

Measure

Measure

-4
Sample
Mean

1 =1

X=

X.
X

1=1

Y
n

TX.

nxX =

nx

Sum

Y.

n Y =

E
1=1

i=1

Y.
1

nx

i=1

Sample
Variance

S
x

nx
2

(n -1)S
x
x

Sum of
Squares ,

L....(Xi :X)

2,
.

(n

i=1

-1)S2 =
Y

),__.

i=1

(T.-(Y. -Y')-?

1
.

NOTES:

The sampleisize terms are equal:n =ny . Also n x =n.=n4


y
x

l ,

..

anipulation of "Sample Mean"; i.e.,


2. ,"Sum"is simply-an algebraic manipulation
multiply over the sample sizeterm in "Sample Mean" to get-"Sum".
Also,"Sum of Squares is such a manipulationbased on "-Sample
"Sum" and "Sum of Squares"-will be useful later on.
Variance"..
.

3.
.

Des criptive statistics for standardized scoret'will be


body".ol. the text.
developed in. the

Ilr

Standard Scores
It will be Kecalled that the standard score for an unstandardized
.

'

measure (raw score) is "the score-minus the mean divided by the standard
deviation".

For case 1 of measure X in Table 1 , the standard' -(z) score is:

For any (hypotheticall, case, 'the standard,score,of-an,X measure is


.

x
=

1
X

.The lame procedure can be applied to Ymeasurds..For case'l:


,

Y -Y
1

Y1

Similarly,

for the ith (hypothetical case),-we have:


.:

Y-

,Since a standard score distribupion (such as in Table 11 is a


.

distribution of variable measures., we can calculate means; standard deviations,


,

.,

,
.

variances, correlations, and so forth, just as we can,calculate theSe.

__
.

statistics for unstandardized measures.

0-

6.

Most students will recall that the mean of z scores is equal to 0


and.the standard deviation (and variance) of standardiZed scores is equal to 1.
(The proof of these statements is given in the Appendix

1
.)

The mean of X standardized scores is defined as:

xi

i=i

Z =
x

40k

Similarly, for Ymeasures:

z
0

The variance Of X in z score notation is defined as:

n
cz
11 2

P
4

1=1
n

z
.

--.

E ) 2

...

-1

4,

Th'e Appendii Contains proof of certaimconcepts'or relationhips


that miy be of interest to the reader but are not crucial for.the.development
.,of the proof in this Raper (the theoretical limits of the sample bivariate
correlation coefficient).
.

..

16

7.

'I

For the standerdized,Y measure the variance is defined as:


4:0

n
z

1E: (z
2

i=1

-z 1
Yi Y

nz

-1

Sum of Squared Standard Scores.

To understand the proof it is necessary to know the result

of summing a distributiorkof squared standard

/.

If we square

scor4s.

each, standard score for the X measurein Table 1 and sum them, we obtain:
11,

4
.

Xi"

i=1

Z2

>..... Z2

Z
.

,,,-.

Z-2

-2

V
.

If we substitute the appropriate means and variances in .the right hand side

of the expression, we obtain:.

'n

zx

E
.i.71

------

z2
xi

;
0

.r'

(X

+
2

40 2

0(

-X).

(X -X)
n
2

C,

8.

Since

the

2
S

is a qonstant, we can.factor it outside, and write:

n ,
z ..
x

'

'

z-x4

..

(x -x)
1

(x

-xr

+...+ (x

)i)

n.

1 =1

.1

S,

!I

4'

,.,)

.e08

Rewriting the right hand -side in summation notation, we obtain:

(X1 --co 2

h
z

1:

k=1,
Z2

S
i

i=1
.

x'

.
.

From Table 2, we know that we can substitute the sum of squares term
into the numerator on'the right hand side. This results in:

.5

(n Ll)S
x

n x-1

n71

(Recall that n -1 = n-1).

If we were to work through the same steps for Y, we would obtain:

,(n -1)Sy
Y
2

--.1a -1
,Y

4
(Recall that n 71 = n-1) .
41.

9.

As

Theu relationships between squared.z scores and sample size are very important'
,

for the proof 'later on.

They will be summarized dater on for easy reference._.,

Correlation Formulas

Unstandardized Form .

Using the notation'and variables in Tables 1 and 2, theiinstand

dized

'form.corc,elation'for two measures(Vand Y) is defined as follows:

i=1
r

3;

0;

xy
-7

-X)2

1=1

i=1

n -1
x

ma.

n -1,
Y

6...1.

Note that the numerator contains the term n-1 because it is not important
9

or necessary to distinguish between nx-1 or n y -1.


inator it is helpful to distingpish nn -1 from n -1.

However, in the denom-,

In any case, all

of the sample size terms would be equal, to the same numerical'value


,....serre/ation coefficient were computed on a set of data (n x -1 =
r

4'
10

if5a,

-1 =n-1)

Jr

10.

Standardized Form
The correlation of measure,Xsand measure Y in standard score form
is defined as follows:
1

n-1

1E: (Z --E )(Z


x
x.
17 .
i=1
n

rzx`zY

zx

z.

E(Z x.

-Z

i=1

i=1,
nz- -1

z y,

It is proven in the Appendix that this correlation formula is egdal to


I

E.Z x.Z
d=1

171.
.

-1

rz z
x y

If we rearrangejthis formula y multiplying over the n-1 term,we

obtain:

-n

(n-1) r

x. , y .

z z
x y

i=1 "

This relationship will be useful in,the proof

it will be restated for

easy reference later.

The reader may recall,that the same correlation coefficient results


when the variables are in raw score form or standard score form. That is:'

ll:
.

.__

This statement is proven in the Appendix.

W&will restate it prior to the

. .

proof for the readers convenience.

'C'

,Inequalities,

to

it it necessaryto revie

Before startlng the prOor

In the proof we are required to manip--

topic: algebraic inequalitiA.


ulate

one eifurther

the folm "gfeatek than- or eqUal to"

form of inequality:

and

Airless than,or equalto". An example will serve as a refresher..


For two variables, say A and B, we Can. write:

> B

which means "A is-greater than or equal to B".


'1

Equivalently, we can write,i,


,

<

A ',

"B is less than or, equal to A"!1-For example,.

which means the same thing:


3

) 1 or equivalently

}e obvious.

All'of this m

1 L 3.

What students, sometimes forget is

what happenS when. multiplying or dividing by negative quantities.

For

example, if
.

3 > 1, and we multiply this inequality by -1, we would obtain:

N.

J
=1( (3)

(1)]

-3

/---

12

12.

4"
.

That is, the inequality sign is reversed. when multiplying by a negative number.
004:
,

The same result occurs for more cbmplex expressions.

For example:

1 - A >

Multiplying each side of the inequitS, by -1,we

-11(1-A) >

-(1-A)

(B)]

(0))

<

A-1 <

Example:

1 1

-Multiplying through by

4>Q

we obtain:

-1[ (1-1/4)

-(1-1/4)

347.1

<

^:

SummaryNof Important Concepts:

a
We have reviewed standard scores(z), correlkion formulas and

algebrait inequalities.

(
the

proof that follows.

All of these concepts are important to understand


For the readers convenience, we will summarize

these concepts, for easy reference.

This is done in Table 3.

Table 3

Summary of Important Condepts

x-z'

1=1

z
n-1

=
Y.

1=1

xy,

z z
x y

I
n

Z
1=1

,,

.y.

(n-1) r

'

(n-1) r

x y

(I-A)

z z

>.

(B)

<

xy

. \

.4

14.

Proof

Formally,,we want to pro've

We are now ready to present the proof.


C.

the following statements:

-1

xy

<

xy

+1

.-011

writing each of the statements in one linear form:

-1

C
410t

+1
xy

This states the same'information as the above two separate statements'.

The proof consists of two parts: one part shows the lower limit
4
o

of r
of'r

(i.e.,

xy

xy

xy

(i.e., e

xy

>

-1), and

+1). We will prove the'upp'er limit first.

the second'part shows theupper limit

g7

Proof that

xy

+1

To prOve this limit, we sill perform

algebraic manipulations

on a statement which-is mathematically true.

Dz
i=1

That statement is:

z.

xi

Yi

15

15.

In,wdrds, the statement means:: the sum of squared differences of n


standardized value pairs will always be equal'to or greater than O. The
The squared differences

reader may refer to.Table 1 for clarification.

e,

Yi

fi
Z

values starting at

and

are taken 17 each row (pairs) of

...44.

xl, yi and conti uing down ,to the last pair of Z's (Zx ,Z
n

).

Yn
ob

Most students readily agree that the squared sum will be greater than O.
But can it ever be exactly equal to 0? Yes, theoretically it can.

Refer-

',
4

ring to Table 1, if olle;imagines each standardized X and Y measure to have


0

1
the same numerical value, then it is apparent that each difference will be

0; so, the squared

will itself be equal toi0.

it is onlyreqUired
sense.

that

Now, a sum of squared O's.

of 0 is also O.

value

While it may be unlikely to occur in practice,


:,(Z
xi
1=1

Yi

Thus, the statement is true.

be true in a mathematical

We will expand this squared, sum,

perform algebiaic manipulations and substitutions, and arrive at the proof

for the upper limit of thesample correlation coefficient.


The actual steps in the derivation will now be presented.
.pertaining to the

Notes

algebra are provided for the readers reference.

to Tables 1,2 and 3 as needed.

Refer,

It is suggested that the reader first

examine the algebraic statement on the left side of the page. Then read the

comment on the right side for explanation. Seenext page.


1

That is, within pairs, not all pairs.

Example:

Y.

i
1.41
-.68

1.41
-.68

.05

.05

etc.

40.

1.

That

+1

xy

Notes
2

-Z

Z(Z

yestatement
each term, we obtain an expansion of

1=1'

thd binomial in this form:


2

;t(Z2

2Z

1=1

Z
x. y.
3.

>

A2 + B2 - 2AB ,

(A-B.)

AK.

Squaring

from before.

Y1

Distributing the summati& opel-ator

3.

to 'each term, and, bringing g-the


a

constant 12)P outside


the summation sign
co,
\

ti

n
2

+" Z- Y1

x.

1=1

1=1 1:4:

2 E

1=1

Z
x. y.
1

This next stet) is very important. We will


substitute three quantitiee,a,11 frdm
Table 3. They are:
n

2:Z2

n-1

x.

i=1 1

17

n
I

n-1

.1.Z2

i=1

xZy

3.

3.

(n-1)r

(n-1)

z z
x y

i=1
.

So

f"

Making the

> 0

2(n -1)r

(n-1)

(n-1)

'xy

ree substitutions

Collecting, the liicetei-ms of (nr1)


A.

2.(n -1)r

2(.1-1)

2(n-1)(17r

xy

>

>

xy

n-1).term

.FactOring the

Dividing each
2(n-1) which
equality s gn
positive b ca

side of the inequality by


oes not change the in-,
as

2 (p-1). is always

se n mutt ,alwayS, be >

'

2(n-1)(1-rxy ]

2(n-1)

2(n-1)'

(1-r

Here we make

se of multiplying an
a negative number. Let
ch side of the inequality
us multiply
which reverses
by -1_(see Ta le 3)
sign
and
reverses the 1-r
the inequalit

xy

inequality' b

13
-1[(1-r

xy

xy

>

'+i

xy

0]

-1

xy

Now, add

:+1

to each side

This gives us

END OF PROOF

OR. UPPER LIMIT.

xy

18.

Proof that

xy

-1

,Part two of the proof

will be much simpler because the structure

of the proof is very much

of

e the first part.


.

follow the ame basic steps.

We will

We start out with a statement that is mathe-

true, namely:g.
.

",

2: (zx

)2

'o

y.

1=1

Again this statement is true in a mathematical senseeven through the


411;,'
.

"equals 0" aspect is very, unlikely to occur in statistical , practice.

The development of the proof with apprOpriate notes-begins on the

next page-

-1

4
0

10'

,0

-41Nha

xy
;..?

4
(2,

,. n

Zx

>

)2

Step ) restated. Squaring each term


results in a binomi4a1 expansion in
1
this form:

i=1

(A+B)

o,

2AB

Ez

i=1

+ Z

x.

2Z

+
.

Yi e

>

Z
)
x. y.
1
1

....,

2 EZ Z

i=1 1

Distributing the summatiOn operator


and bringing out the 2

i=1111

, 0

Makingsthesame three substitutions


As in part one/ we obtainer

y
i
i
.'

i
t+

(n-1)

2 (n-1) r

2(n-1)[1 + r

xy

Adding like terms and factoring


xy
Dividing each side by ;pi-1)

xy

.>

Adding.

i%

1.

xy

.41"4"'to.each side

>

-1

Simplifying

xy 0

>

END OF PROOF FOR LOWER LIMIT

-1

23

22

ra

jolt
JO,

20.
,
.'..
...

We have just proven that 1 <

+1.

See the

Appendix

'

for aclditional proofs of related material.


4

tit

re

ri

,r

ti

24

v..

21.
J.

.4

APPENDIX

Fs

g.

Selsiled'Proofs2

1:

That the mean of standard scores is equal to 0.

We will start wi th the definition of the mean of z scores for


o

6C

'4604

the X meature:'

. n

zx '

Z. 41 i
Z

1=1

=
x

Expanding the right sidel


4.
.,,

(X2-56

-i)

(X

n_z

S
-

..

..._

S
.

____CX_n ri)----

+ ... +
S

?(

Factoring the constant,

outside amervwriting the sum of deviatiOns

in summation notation:

n
x
(ni:X)

1 ;

Zx

nz

47
1=1

22
7

'

Distributing the summation sign inside the parentheses

:*

t_

Zx

n.

Sx

z
x-

Y.

e,

Since

x,

"")L

i=1

v Xlisequal
4,
x

and '4,e sum of the constant,

n X
x

. -

X..-.1,.

i=1

,t--..:

to in X
X

P,

0,"'

..

1.

1
-

Zx

(n X
x

----

n-

n7()

.,

Thus the mean of

Z scores is equal to 0.,


x

Similar reasoning for the

Y measure will produce the same result, namely:

nz
Z

yi

i=1

Zy

(n

n1

zy

"Y.

- n

S1

.y

Therefore, variables in standardized form have

mean equal to 0.

0,

Recall that when taking the sum of a c onstant (say C) we have:


n
JE:c .
\

..,0

c + c + c + ..1- c

i=1

. ,nCL
.

That is, the sum of 4i0brisiant


is equal to the constant times the number
...,4.,
..---of terms added (in thts.case / n).
,

.
.

26
.

23.

-That the variance and standard deviation og standard scores is equal to 1.

2.

By. definition, the variance fbr X meAsures in ,'standard score form is:

2::(Z

x.

i=1

,z

zX

-1

=;0, we now have:

Since we know that Z

S.
n1
nz

*S---(Z

L___ x.
1=1
1
nz -1

.,

If we

..

x',

rewrite 'Z

'

in terms ofunstandardized mean and standard


x.

'T

deviation:

z =
nz
x

i=1

2
S

Rearranging terms:
n

='

2-'

-1
z

S2
x

x
2
14EDX -X)

i=1

59

24.

From Table 3, we can substitute. into the numerator of S

the "sup of

x
This results in:

situares" for the X measure.

Since

nx = nz

2
S.
=
z

-1

:(n -1)(S )
X
X

we can cancel terms,' leaving:'


x

Similar reasoning for Y standardized measures-will produce, as the next


to the last step in the derivation:

(n -1) (S2)
Y
'Y

-1

nzy

2
,

Y
46.

Since n

= n
z
-Y'
2

In each case, the standard deviation for approp riate variance terms,

is simply the square root of 1. That iS!


,

2
n.

Sz

=
X

=17-=

Sz

and

Tr
S

Y'

=
,-

=j1

S.
z

Thtis the variance and standard deviaapn-of z scores

equal to 1.

a.

25.

0 3.

That

xy

rz

x y

We want to show that when measures X and Y are converted to standard


scores and correlated, the resulting correlation is the same as the correlation between the unstandardized (raw) measures of X and Y.Let us first
rewrite the correlation formula foY z scores:

:E=

,e 1

n-1
r

(z E
x

i=1

(Z E y)

tex

Yi

z z
x y

-am
z
2

-Z
x

i=1
n

^
i=1

-1

-1

x
,

Since Z

= 0, we can simplify to get:

4i

n
(Zx ) (Z

n-1
r

y .
3.

3.

i=1

z z

x y

nz
2

(Z

Y.

i=1
nz -1
Y4

In the denominator, we recognize. that'


i=1

(z'xi

)2

= n -1
z

.111140114,

nz

Substituting these valu es, we obtain:

and

r.

26.

S--(Z
r

.)(Z

x.

1.

y.

n-1 i=1
z z

x y

-1

.nz)c-1

-1
z

-1

n
z

The denominator cancels out completely leaving:

r
z z

11-1

x y

xi yi

1=1

(Recall that this relationship was used in the proof for the limits of r ).
xy
Nowo expanding the z score terms:

r.--

,
r

z z
x y

cp

n-1

(Xi -T)

(y.- 1-i)

.4i=1 '(S ) (S
x

This is identical to :

n
1.

RHY i)'

n-1 i=1

30
441

27.

Recognizing that

S2

and \02

, we can write:

40P

'

)
(X -R).(Y.-i
1

n-1

1=1
2

(S

2
) (S

Rewriting the denominator of the variance product term in

aw score

terms (see Table 2)':

.
i Em-X)
n-1
.

z z
x y

r-

-i)

i=1
....

n
Y
2

21(X. -X)

.4*

i=1

i=1
.

n x-1

Therefore,

-Y)2

Y:

n
,110i46

=mow.

01/0.

This is precisely the fOrm for

.....

...-

n
x

that was defined earlier it the iapei.

xY

the correlation ketween measures in rawrscore and z score

forms is identical.

31

28.

References

Glass, Gene V and Stanley, Julian C. StatisticalMethods in Education


and Psychology. Englewood Cliffs, New Jersey: Prentice-Hall,-1970.
a
2

O'Brien, Francis J., Jr. A 15/15tf that t and 'F are i e tical: the general
case. ERIC Clearinghouse orisCience, Mathematics and Environmental`.
(Note: the ED number for
Education, Ohio State University, April, 1982.
was not available at the time of this publication).
this document