You are on page 1of 3

TUTORIAL | SCOPE

CORRELATION COEFFICIENT: ASSOCIATION


BETWEEN TWO CONTINUOUS VARIABLES
Dr Jenny Freeman and Dr Tracey Young use statistics to calculate the correlation
coefficient: the association between two continuous variables

Many statistical analyses can be variable is influencing the value of the


undertaken to examine the relationship other variable; correlation simply
between two continuous variables within measures the degree to which the two
a group of subjects. Two of the main vary together. A positive correlation
purposes of such analyses are: indicates that as the values of one
I To assess whether the two variables variable increase the values of the other
are associated. There is no variable increase, whereas a negative
distinction between the two correlation indicates that as the values
variables and no causation is of one variable increase the values of
implied, simply association. the other variable decrease. The
I To enable the value of one variable standard method (often ascribed to
to be predicted from any known Pearson) leads to a statistic called r, FIGURE 1. Perfect positive correlation (r = +1).
value of the other variable. One Pearson’s correlation coefficient. In
variable is regarded as a response essence r is a measure of the scatter of
to the other predictor (explanatory) the points around an underlying linear
variable and the value of the trend: the closer the spread of points to
predictor variable is used to predict a straight line the higher the value of the
what the response would be. correlation coefficient; the greater the
For the first of these, the statistical spread of points the smaller the
method for assessing the association correlation coefficient. Given a set of n
between two continuous variables is pairs of observations (x1, y1), (x2, y2), ...
known as correlation, whilst the (xn, yn) the formula for the Pearson
technique for the second, prediction of correlation coefficient r is given by:
one continuous variable from another, is
known as regression. Correlation and FIGURE 2. Perfect negative correlation (r = –1).
regression are often presented together
and it is easy to get the impression that
they are inseparable. In fact, they have
distinct purposes and it is relatively rare
that one is genuinely interested in Certain assumptions need to be met
performing both analyses on the same for a correlation coefficient to be valid as
set of data. However, when preparing to outlined in Box 1. Both x and y must be
analyse data using either technique it is continuous random variables (and
always important to construct a scatter Normally distributed if the hypothesis
plot of the values of the two variables test is to be valid).
against each other. By drawing a scatter Pearson's correlation coefficient r
plot it is possible to see whether or not can only take values between –1 and +1;
there is any visual evidence of a straight a value of +1 indicates perfect positive FIGURE 3. No linear association (r = 0).
line or linear association between the association (figure 1), a value of –1
two variables. indicates perfect negative association
This tutorial will deal with (figure 2), and a value of 0 indicates no
correlation, and regression will be the linear association (figure 3).
subject of a later tutorial. The easiest way to check whether it
is valid to calculate a correlation
CORRELATION coefficient is to examine the scatterplot
The correlation coefficient is a measure of the data. This plot should be produced
of the degree of linear association as a matter of routine when correlation
between two continuous variables, i.e. coefficients are calculated, as it will give
when plotted together, how close to a a good indication of whether the
straight line is the scatter of points. No relationship between the two variables
assumptions are made about whether is roughly linear and thus whether it is FIGURE 4. The correlation for this plot is 0.8. It is heavily
the relationship between the two appropriate to calculate a correlation influenced by the extreme cluster of four points away from
variables is causal, i.e. whether one coefficient all. In addition, as the the main body.
M

SCOPE | JUNE 09 | 31
SCOPE | TUTORIAL
M

correlation coefficient is highly sensitive example, with 10 observations a


to a few abnormal values, a scatterplot correlation of 0.63 is significant at the 5
will show whether this is the case, as per cent level, whereas with 150
illustrated in figures 4 and 5. observations a correlation of 0.16 is
significant at the 5 per cent level. Figure 7
EXAMPLE illustrates this.
Consider the heights and weights of 10 The statistical test is based on the test
elderly men: statistic t = r / se(r) which under the null
hypothesis follows a Students’ t
(173, 65), (165, 57), (173, 77), (183, 89), distribution on n–2 degrees of freedom
(178, 93), (188, 73), (180, 83), (183, 86), and the confidence interval is given by:
(163, 70), (178, 83)
The standard error of r =
FIGURE 5. The correlation for this plot is close to 0. Whilst it
Plotting these data indicates that,
is clear that the relationship is not linear and so a
unsurprisingly, there is a positive linear
correlation is not appropriate, it is also clear that there is a
strong n-shaped relationship between these two variables. relationship between height and weight For the Pearson correlation coefficient
(figure 6). The shorter a person is the above the standard error is 0.27, the t
lower their weight and, conversely, the statistic is 2.30 and the P-value is 0.05.
taller a person is the greater their
weight. In order to examine whether WHEN NOT TO USE A
there is an association between these CORRELATION COEFFICIENT
two variables, the correlation coefficient Whilst the correlation coefficient is a
can be calculated (table 1). In calculating useful measure for summarising how
the correlation coefficient, no two continuous variable are related, there
assumptions are made about whether are certain situations when it should not
the relationship is causal, i.e. whether be calculated, as has already been
one variable is influencing the value of alluded to above. As it measures the
the other variable. linear association between two variables,
Thus the Pearson correlation it should not be used when the
coefficient for these data is 0.63, relationship is non-linear. Where outliers
indicating a positive association between are present in the data, care should be
height and weight. When calculating the taken when interpreting its value. It
FIGURE 6. Plot of weight against height for 10 elderly men.
correlation coefficient it is assumed that should not be used when the values of
at least one of the variables is Normally one of the variables are fixed in advance,
distributed. If the data do not have a for example when measuring the
Normal distribution, a non-parametric responses to different doses of a drug.
correlation coefficient, Spearman's rho Causation should not be inferred from a
(rs), can be calculated. This is calculated correlation coefficient. There are many
in the same way as the Pearson other criteria that need to be satisfied
correlation coefficient, except that the before causation can be concluded.
data are ordered by size and given ranks Finally, just because two variables are
(from 1 to n, where n is the total sample correlated at a particular range of values,
size) and the correlation is calculated it should not be assumed that the same
using the ranks rather than the actual relationship holds for a different range.
values. For the data above the Spearman
correlation coefficient is 0.59 (table 2). SUMMARY
The square of the correlation This tutorial has outlined how to
coefficient gives the proportion of the construct the correlation coefficient
FIGURE 7A. Correlation of 0.63, P = 0.048, n = 10. variance of one variable explained by the between two continuous variables.
other. For the example above, the square However, correlation simply quantifies
of the correlation coefficient is 0.398, the degree of linear association (or not)
indicating that about 39.8 per cent of the between two variables. It is often more
variance of one variable is explained by useful to describe the relationship
the other. between the two variables, or even
predict a value of one variable for a given
HYPOTHESIS TESTING value of the other and this is done using
The null hypothesis is that the regression. If it is sensible to assume that
correlation coefficient is zero. However, one variable may be causing a response
its significance level is influenced by the in the other then regression analysis
number of observations and so it is should be used. If, on the other hand,
worth being cautious when comparing there is doubt as to which variable is the
correlations based on different sized causal one, it would be most sensible to
samples. Even a very small correlation use correlation to describe the
can be statistically significant if the relationship. Regression analysis will be
FIGURE 7B. Correlation of 0.16, P = 0.04, n = 150. number of observations is large. For covered in a subsequent tutorial.

32 | JUNE 09 | SCOPE
TUTORIAL | SCOPE

TABLE 1

Subject

1 173 –3.5 12.25 65 –12.5 156.25 43.75


2 165 –11.5 132.25 57 –20.5 420.25 235.75
3 174 –2.5 6.25 77 –0.5 0.25 1.25
4 183 6.5 42.25 89 11.5 132.25 74.75
5 178 1.5 2.25 93 15.5 240.25 23.25
6 188 11.5 132.25 73 –4.5 20.25 –51.75
7 180 3.5 12.25 83 5.5 30.25 19.25
8 182 5.5 30.25 86 8.5 72.25 46.75
9 163 –13.5 182.25 70 –7.5 56.25 101.25
10 179 2.5 6.25 82 4.5 20.25 11.25
Total 1765 0.0 558.50 775 0.0 1148.50 505.50
TABLE 1. Calculation of Pearson's correlation coefficient (r).
1765 / 10 = 176.5 cm =775 / 10 = 77.5 kg r = 505.50 / √(558.50*1148.50) = 0.63

TABLE 2

Subject Rank ( ) Rank ( )

1 3 (173) –2.5 6.25 2 (65) –3.5 12.25 8.75


2 2 (165) –3.5 12.25 1 (57) –4.5 20.25 15.75
3 4 (174) –1.5 2.25 5 (77) –0.5 0.25 0.75
4 9 (183) 3.5 12.25 9 (89) 3.5 12.25 12.25
5 5 (178) –0.5 0.25 10 (93) 4.5 20.25 –2.25
6 10 (188) 4.5 20.25 4 (73) –1.5 2.25 –6.75
7 7 (180) 1.5 2.25 7 (83) 1.5 2.25 2.25
8 8 (182) 2.5 6.25 8 (86) 2.5 6.25 6.25
9 1 (163) –4.5 20.25 3 (70) –2.5 6.25 11.25
10 6 (179) –0.5 0.25 6 (82) 0.5 0.25 0.25
Total 55 0.0 82.50 55 0.0 82.50 48.50

TABLE 2. Calculation of Spearman's rank correlation coefficient (rs).

(ranks) = 55 / 10 = 5.5 (ranks) = 55 / 10 = 5.5 rs = 48.5 / √(82.5*82.5) = 0.59

BOX 1: The assumptions underlying the validity of the hypothesis test associated with the correlation coefficient

1 The two variables are observed on a random sample of individuals.


2 The data for at least one of the variables should have a Normal distribution in the population.
3 For the calculation of a valid confidence interval for the correlation coefficient both variables should have a Normal distribution.

SCOPE | JUNE 09 | 33

You might also like