You are on page 1of 5

R Help Sheet 6: Correlation and Regression

Summary

How to look for relationships between continuous variables using correlation and regression

Functions

cor.test, lm, abline

Introduction

This help sheet covers correlation and regression. The main difference between these approaches is
the issue of causality: correlation does not examine causality and simply describes whether and how
a change in one variable is related to another. For example, we might ask how CO2 levels relate to
the air temperature, without implying that CO2 drives temperature changes or vice-versa; we simply
want to know “does temperature increase (or decrease) as CO2 increases?” We get this information
from the p-value of the correlation test, which indicates the evidence for whether a correlation is
there or not. We can also ask the question “how strong is the correlation between temperature and
C02?”, which indicates how closely they are associated (see below).

In contrast, we could also model temperature according to CO2 levels using linear regression. Here,
we assume temperature responds to changes to CO2 levels. This allows us to say how much
temperature increase we would expect from a given CO2 increase; in other words, we can use CO2
levels to predict temperature.

Let’s start by loading the in-built cars dataset:

This gives the speed and stopping distance (dist) of cars. We’ll start by having a look at the data:

BIO2426 Analysis of Biological Data Page 1 of 5

120
100
80
dist

60
40
20
0

5 10 15 20 25

speed

Looks like stopping distance is pretty closely related to speed; let’s test this using Pearson’s
correlation:

Note the method=“pearson” bit, which tells R to use a Pearson’s correlation. Remember also that
the order of dist and speed doesn’t actually matter here as we aren’t assuming any causality.

We can see that there is strong evidence for a correlation (t48=9.46, p=1.5x10-12). We can also see
that there is a strong positive correlation; distance and speed are closely related (the correlation
statistic, cor, is 0.81). The strength of correlation can vary from 1 (perfectly positively correlated; dist
and speed fall on a perfect straight line, with no scatter) to 0 (correlation so weak that dist and speed
are virtually unrelated) to -1 (dist and speed are perfectly negatively correlated stopping distance
decreases as speed goes up, and vice-versa). Remember that it is possible to get a significant weak
correlation or a non-significant strong correlation; there is a difference between the evidence for the
correlation and the strength of that correlation.

Spearman’s Rank Correlation (non-parametric)

Let’s say we discover that one (or both) of our variables isn’t normally distributed; what do we do
then? If we can’t transform the data to normalise it (help sheet 8), we need to use a non-parametric

BIO2426 Analysis of Biological Data Page 2 of 5

alternative. The most common option here is Spearman’s rank correlation, which ranks both
variables separately and then sees if, for example, the cars with the highest speed tended to also
have high stopping distances:

The spearman test also supports a correlation between speed and stopping distance (S=3532, n=50,
p=8.83x10-14; note that we have to use the sample size n here instead of quoting the degrees of
freedom, because the test doesn’t estimate any parameters). The estimate of the strength of the
correlation, rho, is similar to that estimated in the Pearson correlation (0.83, indicating a strong
positive correlation).

Linear Regression

We know that stopping distance is positively correlated with the speed of the car: but how much
does stopping distance increase for every extra mph of speed? To predict this, we need to use linear
regression:

Our linear model (lm, assigned to the object model) predicts for every 1mph increase in speed, the
stopping distance increases by around 3.9 feet. It also predicts a stopping distance of -17.5 feet at a
speed of 0 mph, as indicated by the intercept with the y axis – not a particularly sensible prediction!
To make this clearer, let’s check what our modelled relationship looks like on a graph. Plot the graph
from earlier, this time using ylim to extend the y-axis between -20 and 120 and xlim to extend the x-
axis between 0 and 25 (see help sheet 9 for more on graphical functions). Next, use abline (a
function for plotting a straight line with intercept a and gradient b) to put a line on based on our
model:

BIO2426 Analysis of Biological Data Page 3 of 5

120
100
80
60
dist

40
20
0
-20

0 5 10 15 20 25

speed

We can see that our line crosses x=0 (the y-axis) at just below -20 (-17.5 feet), and for every 10mph
increase we get an increase in stopping distance of around 40 feet (10 x 3.9 = 39 feet).

But to know if this fitted relationship is actually any good, we need to test the model to see if
explains a significant amount of variation. This is done using the summary command:

BIO2426 Analysis of Biological Data Page 4 of 5

Lots of information! However, there isn’t that much here that you haven’t met before. Let’s start at
the bottom first. The test statistic is the F-statistic (89.57). The p-value is 1.49x10-12. Here, we have
two degrees of freedom: 1 for the line and 48 for the data*. This means we would report our result
like so: There was significant increase in stopping distance with speed (F=89.571,48, p=1.49x10-12).
Another way of saying this is that fitting our relationship between stopping distance and speed
provides a significantly better explanation of the stopping distance than just using a mean stopping
distance (intercept), with no relationship between stopping distance and speed.

How well does our model explain the variation in stopping distance? To test this, we can look at our
R-squared value, which tells us what proportion of the variation in the data is explained by our
model. Here, R-squared = 0.65 or 65%, so we’re doing a decent job of explaining stopping distance,
but there’s still quite a bit (35%) of unexplained variation in the data. This can be seen in the scatter
around the line in our graph above; if the points lined up perfectly on the line, our R-squared value
would be 100%. The adjusted R-squared value accounts for the fact that the more complicated you
make your model, the more of the data you can explain. We’ll be using pretty simple models though,
so it doesn’t really matter which R-squared value you choose to use.

The table of Coefficients just tell you the same information we looked at earlier (estimates of the
intercept and gradient of the model), together with information on how accurate those estimates
are. You don’t need to worry about the information on the residuals too much; this just tells you a
bit more about the scatter around the line.

We can use our modelled relationship to predict stopping distances based on speed. For example,
our model predicts that the stopping distance at 150 mph would be as follows:

= -17.5791 + 3.9324 * 150

= 572.3 feet, or nearly 175 metres – about one and a half football pitches! However, here, we’re
extrapolating beyond the range of our data, which is often ill-advised: we didn’t measure the
stopping distances of cars at speeds any higher than 25mph.

Linear regression makes a number of assumptions about the data, and isn’t valid if these
assumptions aren’t met - see help sheet 8 for details.

*The degrees of freedom thing is a little complex, but is to do with the way the test is being done. We’ve fitted
a line with two parameters: an intercept (-17.5 feet at speed = 0 mph) and a gradient (3.9 feet for every mph
increase in speed). We’re comparing this line to the null hypothesis of no relationship between stopping
distance and speed; this hypothesis explains stopping distance using only a single parameter, a mean speed,
which doesn’t vary with distance. So our more complicated model, with df=2 (intercept and gradient), is being
compared to the simpler null model, with df=1 (intercept, but no gradient). So our treatment degrees of
freedom = 2-1 = 1. Since we have 50 datapoints, and have fitted two parameters (mean and intercept), we’re
left with 50-2 = 48 freely varying datapoints: so our error degrees of freedom = 48.