You are on page 1of 5

CHAPTER 4 CORRELATION

CHAPTER 4: CORRELATION

4.1 Association between variables


Often one has two or more variables measured for a population or sample, eg. rainfall
and temperature; age and income.
One may be interested in the relationship between the variables - do certain values of
one tend to be associated with certain values of the other? One may also want to describe and
measure the relationship, if any.

4.2 Scattergrams
A scattergram is a graph of two variables (usually labelled x and y) that is plotted in
order to illustrate the relationship between them, if any. It is constructed by drawing the
usual x-axis and y-axis and then plotting a point for each pair of x- and y-values in the
dataset.
Example 4.2.1: A sample of five steel cables in a workshop yielded the following data,
which is illustrated in a scattergram.

Cable No. of strands in cable Breaking strength (tonne)


(x) (y)
A 4 15
B 3 10
C 2 8
D 5 17
E 5 16

Unlike the axes usually drawn in mathematics, it is not necessary to show the origin in
a scattergram; the axis for each variable need be drawn only for the range of values that is
required for that variable.

27
CHAPTER 4 CORRELATION

4.3 Types of relationship


The scatter of points may tend to follow a pattern, eg. a straight line or a curve.
Hereafter, these notes consider only straight-line (ie. linear) patterns, which are the most
common.

4.4 Strength of (linear) relationship


The closer the points lie to a straight line, the stronger is the linear relationship
between the variables.

4.5 Direction of relationship


When high values of one variable tend to be associated with high values of the other,
and low values of one tend to be associated with low values of the other, the relationship is

28
CHAPTER 4 CORRELATION

said to be a positive, or direct, one. When high values of one variable tend to be associated
with low values of the other, the relationship is said to be a negative, or inverse, one.

4.6 Pearson’s correlation coefficient


For data that has been measured on an interval or ratio type of scale, the usual way of
measuring linear correlation is by Pearson’s (product-moment) correlation coefficient.
This is usually denoted by “R” (for a population) or “r” (for a sample). The formula for R is:
N XY -  X  Y 
N X 2

  X  * N Y 2   Y 
2 2

Example 4.6.1: Calculate the product-moment correlation coefficient for the data in
Example 4.2.1.

Cable No. of Breaking xy x2 y2


strands in strength
cable (tonne)
(x) (y)
A 4 15 60 16 225
B 3 10 30 9 100
C 2 8 16 4 64
D 5 17 85 25 289
E 5 16 80 25 256
Total 19 66 271 79 934

4.7 Interpretation of Pearson’s correlation coefficient


The product-moment correlation coefficient is defined in such a way that its value
always lies between -1 and +1. The general rule for interpreting the coefficient is illustrated
in this diagram.

29
CHAPTER 4 CORRELATION

Note that values close to zero (whether positive or negative) indicate low (or weak)
correlation, whereas values close to 1 in size (whether positive or negative) indicate high (or
strong) correlation.
A serious and common misunderstanding about correlation is that “high correlation
implies causation”; in other words, if there is moderate to strong correlation between two
variables, then one of the variables “is caused by” or “depends on” the other one. This is not
necessarily so. For example, a high positive correlation has been found to exist between the
number of murders in the UK over a number of years and the number of marriages in the
Anglican Church over the same period; it is obviously nonsense to suggest that “murders
cause marriages” or that “marriages cause murders”. Often the explanation for such so-called
“nonsense correlations” is that the two variables in question are affected by (“caused by”) a
third variable or several other variables. In this example, the number of murders and the
number of marriages occurring over time are both clearly affected by the growth in
population over time.
A high correlation between two variables merely indicates that there is a
mathematical relationship between the two sets of values; further research is required in
order to determine whether there is a causal relationship between the two variables or
whether there is some other explanation for the observed association between them.

4.8 Rank correlation


If two variables are measured on ordinal-type scales (ie. if the data values are ranks),
then one can use Spearman’s rank correlation coefficient to measure the correlation
between the variables. This coefficient is usually denoted by a Greek letter called “rho” ().
The ranking may be done from highest to lowest or vice versa, as long as it is done in the
same way for both the variables.
Rho uses the differences between the ranks of the two variables in order to measure
correlation. If d denotes the difference between the rank on x and the rank on y for each pair
of values, the formula for rho is:
6 d 2
  1

n n2 1 
Example 4.8.1: Calculate the rank correlation coefficient for the following data, which
refers to some students’ exam results.

Student Rank in English Rank in Maths Difference in d2


ranks (d)
A 5 8 -3 9
B 7 6 1 1
C 2 10 -8 64
D 1 9 -8 64
E 6 4 2 4
F 8 1 7 49
G 9 2 7 49
H 10 3 7 49
30
CHAPTER 4 CORRELATION

I 4 5 -1 1
J 3 7 -4 16
Total 55 55 0 306

n = 10
 = 1 - 6(306)/(10*99) = -0.85
If one has data for one variable measured on an interval or ratio scale and for another
variable measured on an ordinal scale, one can convert the former values into rank values and
then apply Spearman’s coefficient. When ranking data, if there are ties, each of the common
values must be given the average of the rank positions that they cover, as described in Section
1.10.

4.9 Interpretation of rank correlation coefficient


Like r,  always lies between -1 and +1. The interpretation of  is similar to the
interpretation of r. Thus in the example of the previous section, the  value of -0.85 indicates
that there is a strong negative correlation between a student’s rank in English and his/her rank
in Maths

31

You might also like