You are on page 1of 9

Correlation:

(Bálint Tóth, Pázmány Péter Catholic University: toth.balint.pte@gmail.com, do not


share without author’s permission)

In everyday speech, when we say „something is correlated with another thing” we mean that
when thing A happens, thing B also happens. Or if we say something like „weight is
correlated with chocolate consumption”, we intuitively understand what it means: the more
chocolate we eat, the heavier we become. Correlation, as a statistical term, means something
extremely similar, we just have to make this intuition more formal, and more precise.
Consider the following three visually represented datasets:

Here, the green points are individual observations (people in our study). We can see that the
green dots seem to be random. We have many people with a low income but high blood
pressure, and we see many people with a high income and high blood pressure, and
everything in between. There does not seem to be any pattern in the observations. We can
intuitively say that blood pressure and income are uncorrelated. Look at the next dataset:
Here, we see that a pattern emerges. The higher you go in years of education, the higher
income gets. We know this because there are no people with low years of education and a
very high income, but the higher we go on education, the more points we see that are high on
income. This is demonstrated by the red line drawn into the cloud of points. This clearly
points upwards. Income and years of education are positively correlated. This means that if
you increase one of them, the other is also likely to increase.
Here, the situation is the opposite. The more time you spend in prison, the lower your income
will be. Hence, income and years spent in prison are negatively correlated.

This much is clear. But how do we actually measure the extent to which two variables are
correlated? Obviously, for most scientific applications, we need a number that tells u show
„strong” the relationship is. Well, there is something we call the „Pearson Product-Moment
Correlation Coeffcicient”, or „Pearson’s r” for short, which is the most commonly used
measure of correlation between two variables. It is calculated in such a way that its value
always ranges between -1 and 1. -1 means that the two variables are perfectly negatively
correlated (when one variable increases by a certain amount, the other decreases by a
proportionate amount), 0 means that there is no correlation between the variables at all, and 1
means that the two variables are perfectly positively correlated (when one increases, the other
also increases by a proportionate amount). To see how Pearson’s r is calculated, we have to
learn about covariance:

To understand what covariance is, let’s go back to variance for a second. Remember that the
variance of a variable (a set of numbers, basically) is calculated by taking each number,
subtracting the mean from it, squaring the difference, summing these squared differences
together, and dividing the result by the number of cases we have (N). We can represent this
formula as:
As usual, this looks scary, but it is not. This just tells you to do the following: From each
number we have (Xi, the i-th number) subtract the mean of the numbers (X-bar), square the
difference, sum these numbers and divide by N, the number of cases. Now, if you consider
what (Xi – X-bar)2 means mathematically, you would probably say that we multiply (Xi – X-
bar) by (Xi – X-bar) since this is what squaring something means. So the above formula can
be represented equally well by the following:

This means exactly the same as the above formula, we just „unboxed” the square (22 is the
same as 2 times 2 or 2*2). Again, this is the formula for variance, which tells us how
„variable” a set of number is by itself. But for correlation, we are not interested how variable
a single set of numbers is, we are interested in whether two sets of numbers vary together, or
in other words, covary. (when one increases, the other increases, or the other way around).
What we do therefore, is change the formula a bit so that it takes into account not the variation
within one variable but the joint variation in two variables (that is, how they move together,
whether one increases or decreases with the other). We achieve this by switching out the (Xi –
X-bar) part for (Yi – Y-bar), where Yi is the i-th element of another set of numbers, and Y-bar
is the mean of that set of numbers:

The second formula is called the covariance of variable X and variable Y, and is represented
as COV(X, Y). We can illustrate this with a very simple toy dataset, with all the appropriate
numbers calculated in advance (Mean of X is 4, Mean of Y is 6:

X Y X-Mean(X) Y-Mean(Y) (X-MeanX)2 (Y-MeanY)2

2 7 -2 1 2 1
4 3 0 -3 0 9
6 8 2 2 2 4
For the covariance formula, this gives us:

-2(1) + 0(-3) + 2(2) / 3

-2 + 0 + 4 / 3

2 / 3 = 0.6667

So the covariance of the two variables is 0.6667

Remember that we are actually looking for the correlation coefficient, not the covariance, and
we only looked at covariance to understand how correlation is calculated. If you got this far,
there is nothing to fear, as the Pearson correlation coefficient is calculated by making a very
simple adjustment to the covariance formula: we multiply N (the number of people) in the
denominator by the standard deviation of X and the standard deviation of Y:

I leave it to you as a homework to verify that the standard deviation of the above two datasets
is 1.63 (X) and 2.16 respectively. So the formula will reduce to:

-2(1) + 0(-3) + 2(2) / 3(1.63 x 2.16)

-2 + 0 + 4 / 10.56

2/7.452 = 0.189

So the correlation between variable X and Y is 0.189. Remember, values larger than 0
indicate a positive correlation, and the maximum value is 1, so this means that there is a small
positive correlation between the two variables.

At this point, you may (and hopefully will) ask the question: „Why the hell do we have to
divide by the multiplied standard deviations thing, if covariance already indicates how much
the two variables move together? What is the difference between correlation and covariance?”
The answer to this is fairly simple:

Covariance is scale-bound (ugly term, don’t get scared), which means that the actual value
depends on the scale of measurement. If we calculate covariance between height and weight,
with height measured in inches, and weight measured in pounds, the covariance will be
different from calculating a covariance between the same two variables in centimetres and
kilograms. Obviously, this is not very good, since in science, we often have to compare results
between studies and experiments, and we need a measure of the „correlation” or „moving
together” of two variables that is the same, regardless of what measurements we use int he
experiment, and that can be directly compared between studies/experiments. Correlation gets
around this issue by dividing the covariance by the multiplied standard deviations and N.
What this achieves is actually express the covariance in standard deviation units. So a
correlation of 0.6 for example means that if one variable increases by one standard deviation,
the other will increase 0.6 standard deviations. Since every variable has a standard deviation,
this neat trick makes sure that we can directly compare results between experiments. A
correlation of 0.6 will be the same, regardless of what the original units of measurement were.

Now, let’s take a look at how correlation can be conducted in SPSS and how the results
should be interpreted:

For this tutorial, we are going to be using a dataset from one of my earlier experiments, with
the variable names translated to English. I modified the data somewhat for easier exposition,
so some of what we’ll do does not make sense in the original research context, but you will
not notice. Basically, the research was aimed at finding out whether playing an educational
board game (Ticket to Ride) can be as effective as listening to a PPT presentation by our
Geography teacher. Kids were allocated to two groups (Group variable) and their test results
were measured before and after the activity they were allocated to. They also filled out a
questionnaire assessing different cognitive and emotional states during the experiment (how
happy, attentive, etc. they were).
Next, put all variables you want to examine the correlations of into the box:
Press OK and run the analysis. Next, we’ll examine the anatomy of the resulting correlation
matrix:

The matrix basically contains all of our variables in the rows and columns as well. The matrix
itself is essentially divided into 3x1 sections (3 rows, 1 columns), one for EACH PAIR OF
VARIABLES. These are marked with yellow in the picture. Each such section is further
divided by 3 rows. The first one indicates the correlation coefficient between the two
variables (at the intersection of which the section is located), which is reported as „r = xxx”
the second indicates whether the correlation coefficient is significant (this is the p-value), and
the third is just the number of values from which these were calculated (n).

The teal colored values on the diagonals are all 1, since each variable correlated with itself
perfectly (since if you put the values of a variable next to the same values, they will always be
the same).

For example, the first row (first variable) is „I felt bored during the activity”. The first column
is the same variable, so we ignore that and look at the second column instead (colored
markings). This part of the table shows the correlation between „I felt bored during the
activity” (row) and „Time flew by during the activity” (column). These variables have a
correlation of .779, which indicates the they have a strong positive correlation: higher values
of one variable correspond to higher values of the other. This correlation is also statistically
significant (next row).

This would be reported as: There was a significant positive correlation between „Time flew
by during the activity” and „I felt bored during the activity” (r = 0.779, p < 0.001)

You might also like