You are on page 1of 21

Solved Example

Question: Marks obtained by 5 students in algebra and trigonometry as given below:

Algebra      15      16      12     10      8

   Trigonometry 18 11 10 20 17
Calculate the Pearson correlation coefficient.
Solution:
Construct the following table:

x y x2 y2 xy

15 18 225 324 270

16 11 256 121 176

12 10 144 100 120

10 20 100 400 200

8 17 64 289 136

    ∑x = 61     ∑y = 76     ∑x2 = 789    ∑y2 = 1234    ∑xy = 902


Formula for Pearson correlation coefficient is given by:

r = n(∑xy)−(∑x)(∑y)[n∑x2−(∑x)2][n∑y2−(∑y)2]√
r = 5×902–61×76[5×789–(61)2][5×1234–(76)2]√
r = -0.424

Example Problem The following example includes the changes we will need to make for hypothesis
testing with the correlation coefficient, as well as an example of how to do the computations. Below are
the data for six participants giving their number of years in college (X) and their subsequent yearly
income (Y). Income here is in thousands of dollars, but this fact does not require any changes in our
computations. Test whether there is a relationship with Alpha = .05. # of Years of College Income X Y X2
Y2 XY 0 15 0 225 0 1 15 1 225 15 3 20 9 400 60 4 25 16 625 100 4 30 16 900 120 6 35 36 1225 210 ΣX =
18 ΣY = 140 Σ =2 X 78 Σ = 2 Y 3600 ΣXY = 505 Notice that I have included the computation for obtaining
the summary values for you for completeness. Be sure you know how to obtain all the summed values,
as they will not always be given on the exam. Step 1: State the Hypotheses in Words and Symbols H1
The correlation between years of education and income is equal to zero in the population. H0: The
correlation between years of education and income not equal to zero in the population. As usual the null
states that there is no effect or no relationship, and the research hypothesis states that there is an
effect. When we write them in symbols we will use the Greek letter “rho” (ρ) to indicate the correlation
in the population. Thus: H1 ρ ≠ 0 H0: ρ = 0 Step 2: Find the Critical Value Again, we will use a table to find
the critical value in Appendix A of your book. Locate the table, and find the degrees of freedom for the
appropriate test to find the critical value. For this test df = n – 2, where n is the number of pairs of scores
we have. Df = 6 – 2 = 4 rcritical = + 0.811 Step 3: Run the Statistical Test ( ) ( )         −  
      − − = ∑ ∑ ∑ ∑ ∑ ∑ ∑ 2 2 2 2 n Y Y n X X n X Y XY r        −      − − = 6
140 3600 6 18 78 6 (18)(140) 505 2 2 r       −       − − = 6 19600 3600 6 324 78 6
2520 505 r [ ][ ] 78 54 3600 3266.67 505 420 − − − r = .95 89.44 85 7999.92 85 (24)(333.33) 85 r = = = =
Step 4: Make a Decision about the Null Reject the null Å Since the value we computed in Step 3 is larger
than the critical value in Step 2, we reject the null. Step 5: Write a Conclusion There is a relationship
between years spent in college and income. The more years of school, the more the subsequent income.
r 2 Often times we will square the r-value we compute in order to get a measure of the size of the effect.
Just like with eta-square in ANOVA, we will compute the percentage of variability in Y, that is accounted
for by X. For the current example r2 = .90, so 90% of the variability in income is accounted for by
education

or example, you could correlate a person’s age with their blood


sugar levels. Here, the units are completely different; age is
measured in years and blood sugar level measured in mmol/L (a
measure of concentration).

It is also important to realize that the v

An example with calculating Pearson Coefficient:


Find the value of the correlation coefficient from the following
table:

Age and Glucose levels of 6 subjects

We’ll calculate the value of r using the formula mentioned above.


For using that formula we need to compute Σ(X*Y), Σ(X), Σ(Y),
Σ(X²), Σ(Y²).

The table below shows the computed values of all the summations
mentioned above.
From our table we get:

 Σ(X) = 247

 Σ(Y) = 486

 Σ(X*Y) = 20,485

 Σ(X²) = 11,409

 Σ(Y²) = 40,022
 n is the sample size, in our case = 6

r = 6(20,485) — (247 × 486) / [√[[6(11,409) — (24⁷²)] ×


[6(40,022) — 48⁶²]]]

r = 0.5298.

The range of the correlation coefficient is from -1 to +1. Our result


is 0.5298 or 52.98%, which means the variables have
a moderate positive correlation.

What is Spearman Rank Correlation /


Spearman’s Rho?
The Spearman rank correlation coefficient, rs, is the nonparametric version of the Pearson correlation
coefficient. Your data must be ordinal, interval or ratio. Spearman’s returns a value from -1 to 1, where:
+1 = a perfect positive correlation between ranks
-1 = a perfect negative correlation between ranks
0 = no correlation between ranks.
Contents:

1. No Tied ranks example.


2. What to do with tied ranks.

Spearman Rank Correlation: Worked Example


(No Tied Ranks)
The formula for the Spearman rank correlation coefficient when there are no tied ranks is:
Example Question:
The scores for nine students in physics and math are as follows:
 Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28
 Mathematics: 30, 33, 45, 23, 8, 49, 12, 4, 31
Compute the student’s ranks in the two subjects and compute the Spearman rank correlation.

Step 1: Find the ranks for each individual subject. I used the Excel rank function to find the ranks. If you
want to rank by hand, order the scores from greatest to smallest; assign the rank 1 to the highest score, 2
to the next highest and so on:

Step 2: Add a third column, d, to your data. The d is the difference between ranks. For example, the first
student’s physics rank is 3 and math rank is 5, so the difference is 3 points. In a fourth column, square
your d values.

Step 4: Sum (add up) all of your d-squared values.


4 + 4 + 1 + 0 + 1 + 1 + 1 + 0 + 0 = 12. You’ll need this for the formula (the Σ d 2 is just “the sum of d-
squared values”).
Step 5: Insert the values into the formula. These ranks are not tied, so use the first formula:

= 1 – (6*12)/(9(81-1))
= 1 – 72/720
= 1-0.1
= 0.9
The Spearman Rank Correlation for this set of data is 0.9.
Spearman Rank Correlation: What to do with
Tied Ranks
Tied ranks are where two items in a column have the same rank. Let’s say two items in the above
example tied for ranks 5 and 6. The following image shows each tied data point assigned a mean rank of
5.5:

When this happens, you have a couple of options. You could also use the easier formula for tied ranks *if*
you only have one or two tied ranks here and there. The image above shows the workings for the ties and
the d-squared values you’ll need to input into the simple version of the formula above. However, that
option may leave you with little confidence in any p-values you produce (Kinnear and Gray, 1999). A better
option may be to calculate correlation with another method, like Kendall’s Tau.
Another option is simply to use the full version of Spearman’s formula (actually a slightly
modified Pearson’s r), which will deal with tied ranks:

Full Spearman’s r formula (Clef, 2013. p. 4)

Where:

 R(x) and R(y) are the ranks,


 R(x)bar and R(y)bar are the mean ranks.

Spearman
Spearman Correlation - Example

A sample of 1,000 companies were asked about their number of employees and their
revenue over 2018. For making these questions easier, they were offered answer
categories. After completing the data collection, the contingency table below shows the
results.

The question we'd like to answer isis company size related to revenue?A good look
at our contingency table shows the obvious: companies having more employees
typically make more revenue. But note that this relation is not perfect: there's 60
companies with 1 employee making $50,000 - $99,999 while there's 89 companies with
2-5 employees making $0 - $49,999. This relation becomes clear if we visualize our
results in the chart below.

The chart shows an undisputable positive monotonous relation between size and


revenue: larger companies tend to make more revenue than smaller companies. Next
question.How strong is the relation?The first option that comes to mind is computing
the Pearson correlation between company size and revenue. However, that's not going
to work because we don't have company size or revenue in our data. We only have size
and revenue categories. Company size and revenue are ordinal variables in our data:
we know that 2-5 employees is larger than 1 employee but we don't know how
much larger.
So which numbers can we use to calculate how strongly ordinal variables are related?
Well, we can assign ranks to our categories as shown below.

As a last step, we simply compute the Pearson correlation between the size and
revenue ranks. This results in aSpearman rank correlation (Rs) = 0.81.This tells us
that our variables are strongly monotonously related. But in contrast to a normal
Pearson correlation, we do not know if the relation is linear to any extent.

Spearman Rank Correlation - Basic Properties


Like we just saw, a Spearman correlation is simply a Pearson correlation computed on
ranks instead of data values or categories. This results in the following basic properties:

 Spearman correlations are always between -1 and +1;


 Spearman correlations are suitable for all but nominal variables. However, when both
variables are either metric or dichotomous, Pearson correlations are usually the better
choice;

 Spearman correlations indicate monotonous -rather than linear- relations;

 Spearman correlations are hardly affected by outliers. However, outliers should be


excluded from analyses instead of determine whether Spearman or Pearson correlations
are preferable;
 Spearman correlations serve the exact same purposes as Kendall’s tau.

Spearman Rank Correlation - Assumptions


 The Spearman correlation itself only assumes that both variables are at least ordinal
variables. This excludes all but nominal variables.
 The statistical significance test for a Spearman correlation assumes independent
observations or -precisely- independent and identically distributed variables.

Spearman Correlation - Example II


A company needs to determine the expiration date for milk. They therefore take a tiny
drop each hour and analyze the number of bacteria it contains. The results are shown
below.

For bacteria versus time,

 the Pearson correlation is 0.58 but

 the Spearman correlation is 1.00.


There is a perfect monotonous relation between time and bacteria: with each hour
passed, the number of bacteria grows. However, the relation is very non linear as
shown by the Pearson correlation.
This example nicely illustrates the difference between these correlations. However, I'd
argue against reporting a Spearman correlation here. Instead, model this curvilinear
relation with a (probably exponential) function. This'll probably predict the number of
bacteria with pinpoint precision.

Spearman Correlation - Formulas and Calculation


First off, an example calculation, exact significance levels and critical values are given
in this Googlesheet (shown below).

Right. Now, computing Spearman’s rank correlation always starts off with replacing
scores by their ranks (use mean ranks for ties). Spearman’s correlation is now
computed as the Pearson correlation over the (mean) ranks.

Alternatively, compute Spearman correlations with

Rs=1−6⋅ΣD2n3−n

where  D denotes the difference between the 2 ranks for each observation.

For reasonable sample sizes of N ≥ 30, the (approximate) statistical significance uses the t distribution.
In this case, the test statistic

T=Rs⋅N−21−Rs2
follows a t-distribution with

Df=N−2

degrees of freedom.

This approximation is inaccurate for smaller sample sizes of N < 30. In this case, look up the (exact)
significance level from the table given in this Googlesheet. These exact p-values are based on a
permutation test that we may discuss some other time. Or not.

Spearman Rank Correlation - Software


Spearman correlations can be computed in Googlesheets or Excel but statistical
software is a much easier option. JASP -which is freely downloadable- comes up with
the correct Spearman correlation and its significance level as shown below.

SPSS also comes up with the correct correlation. However, its significance level is
based on the t-distribution:

t=0.77⋅4(1−0.772)=2.42

and
t(4)=2.42,p=0.072

Again, this approximation is only accurate for larger sample sizes of N ≥ 30. For N = 6, it is wildly off as
shown below.

Example: The hypothesis tested is that prices should decrease with distance from the key
area of gentrification surrounding the Contemporary Art Museum. The line followed is
Transect 2 in the map below, with continuous sampling of the price of a 50cl bottle water at
every convenience store.

Map to show the location of environmental gradients for transect lines in El Raval,
Barcelona
 
Hypothesis
We might expect to find that the price of a bottle of water decreases as distance from the
Contemporary Art Museum increases. Higher property rents close to the museum should be
reflected in higher prices in the shops.
The hypothesis might be written like this:
The price of a convenience item decreases as distance from the Contemporary Art Museum
increases.
The more objective scientific research method is always to assume that no such price-
distance relationship exists and to express the null hypothesis as:
there is no significant relationship between the price of a convenience item and
distance from the Contemporary Art Museum.
What can go wrong?
Having decided upon the wording of the hypothesis, you should consider whether there are
any other factors that may influence the study. Some factors that may influence prices may
include:
 The type of retail outlet. You must be consistent in your choice of retail outlet. For
example, bars and restaurants often charge significantly more for water than a
convenience store. You should decide which type of outlet to use and stick with it
for all your data collection.
 Some shops have different prices for the same item: a high tourist and lower local
price, dependent upon the shopkeeper's perception of the customer.
 Shops near main roads may charge more than shops in less accessible back
streets, due to the higher rents demanded for main road retail sites.
 The positive spread effects from other nearby areas of gentrification or from
competing areas of tourist attraction.
 The negative spread effects from nearby areas of urban decay.
 Higher prices may be charged during the summer when demand is less flexible,
making seasonal comparisons less reliable.
 Cumulative sampling may distort the expected price-distance gradient if several
shops cluster within a short area along the transect line followed by a
considerable gap before the next group of retail outlets.
You should mention such factors in your investigation.
Data collected (see data table below) suggests a fairly strong negative relationship as
shown in this scatter graph:
Scatter graph to show the change in the price of a convenience item with distance
from the Contemporary Art Museum. Roll over image to see trend line.
The scatter graph shows the possibility of a negative correlation between the two variables
and the Spearman's rank correlation technique should be used to see if there is indeed a
correlation, and to test the strength of the relationship.

Spearman’s Rank correlation coefficient


A correlation can easily be drawn as a scatter graph, but the most precise way to compare
several pairs of data is to use a statistical test - this establishes whether the correlation is
really significant or if it could have been the result of chance alone.
Spearman’s Rank correlation coefficient is a technique which can be used to summarise the
strength and direction (negative or positive) of a relationship between two variables.
The result will always be between 1 and minus 1.

Method - calculating the coefficient


 Create a table from your data.
 Rank the two data sets. Ranking is achieved by giving the ranking '1' to the
biggest number in a column, '2' to the second biggest value and so on. The
smallest value in the column will get the lowest ranking. This should be done for
both sets of measurements.
 Tied scores are given the mean (average) rank. For example, the three tied
scores of 1 euro in the example below are ranked fifth in order of price, but
occupy three positions (fifth, sixth and seventh) in a ranking hierarchy of ten. The
mean rank in this case is calculated as (5+6+7) ÷ 3 = 6.
 Find the difference in the ranks (d): This is the difference between the ranks of
the two values on each row of the table. The rank of the second value (price) is
subtracted from the rank of the first (distance from the museum).
 Square the differences (d²) To remove negative values and then sum them (
d²).
 
Distance Price of Difference
Convenienc Rank
from CAM 50cl bottle Rank price between d²
e Store distance
(m) (€) ranks (d)

1 50 10 1.80 2 8 64

2 175 9 1.20 3.5 5.5 30.25

3 270 8 2.00 1 7 49

4 375 7 1.00 6 1 1

5 425 6 1.00 6 0 0

6 580 5 1.20 3.5 1.5 2.25

7 710 4 0.80 9 -5 25

8 790 3 0.60 10 -7 49
Distance Price of Difference
Convenienc Rank
from CAM 50cl bottle Rank price between d²
e Store distance
(m) (€) ranks (d)

9 890 2 1.00 6 -4 16

10 980 1 0.85 8 -7 49

 d² = 285.5

Data Table: Spearman's Rank Correlation


 Calculate the coefficient (Rs) using the formula below. The answer will always be
between 1.0 (a perfect positive correlation) and -1.0 (a perfect negative
correlation).

When written in mathematical notation the Spearman Rank formula looks like this :

Now to put all these values into the formula.


 Find the value of all the d² values by adding up all the values in the Difference²
column. In our example this is 285.5. Multiplying this by 6 gives 1713.
 Now for the bottom line of the equation. The value n is the number of sites at
which you took measurements. This, in our example is 10. Substituting these
values into n³ - n we get 1000 - 10
 We now have the formula: Rs = 1 - (1713/990) which gives a value for Rs:
1 - 1.73 = -0.73

What does this Rs value of -0.73 mean?


The closer Rs is to +1 or -1, the stronger the likely correlation. A perfect positive correlation
is +1 and a perfect negative correlation is -1. The Rs value of -0.73 suggests a fairly strong
negative relationship.

A further technique is now required to test the significance of the relationship.


The Rs value of -0.73 must be looked up on the Spearman Rank significance table below as
follows:
 Work out the 'degrees of freedom' you need to use. This is the number of pairs in
your sample minus 2 (n-2). In the example it is 8 (10 - 2).
 Now plot your result on the table.
 If it is below the line marked 5%, then it is possible your result was the product of
chance and you must reject the hypothesis.
 If it is above the 0.1% significance level, then we can be 99.9% confident the
correlation has not occurred by chance.
 If it is above 1%, but below 0.1%, you can say you are 99% confident.
 If it is above 5%, but below 1%, you can say you are 95% confident (i.e.
statistically there is a 5% likelihood the result occurred by chance).

In the example, the value 0.73 gives a significance level of slightly less than 5%. That
means that the probability of the relationship you have found being a chance event is about
5 in a 100. You are 95% certain that your hypothesis is correct. The reliability of your
sample can be stated in terms of how many researchers completing the same study as
yours would obtain the same results: 95 out of 100.

Graph of significance levels for Spearman's Rank correlation coefficients using


Student's t distribution

 The fact two variables correlate cannot prove anything - only further research can
actually prove that one thing affects the other.
 Data reliability is related to the size of the sample. The more data you collect, the
more reliable your result.
What values can the Spearman correlation coefficient, rs, take?
The Spearman correlation coefficient, rs, can take values from +1 to -1. A rs of +1
indicates a perfect association of ranks, a rs of zero indicates no association between
ranks and a rs of -1 indicates a perfect negative association of ranks. The closer rs is to
zero, the weaker the association between the ranks.

An example of calculating Spearman's correlation


To calculate a Spearman rank-order correlation on data without any ties we will use the
following data:

  Marks

Engli 5 7 4 7 6 6 5 8 7 6
sh 6 5 5 1 2 4 8 0 6 1

Math 6 7 4 6 6 5 5 7 6 6
s 6 0 0 0 5 6 9 7 7 3

We then complete the following table:

English (mark) Maths (mark) Rank (English) Rank (maths) d

56 66 9 4 5

75 70 3 2 1
English (mark) Maths (mark) Rank (English) Rank (maths) d

45 40 10 10 0

71 60 4 7 3

62 65 6 5 1

64 56 5 9 4

58 59 8 8 0

80 77 1 1 0

76 67 2 3 1

61 63 7 6 1

Where d = difference between ranks and d2 = difference squared.

We then calculate the following:

We then substitute this into the main equation with the other information as follows:
as n = 10. Hence, we have a ρ (or rs) of 0.67. This indicates a strong positive
relationship between the ranks individuals obtained in the maths and English exam.
That is, the higher you ranked in maths, the higher you ranked in English also, and vice
versa.

Join the 10,000s of students, academics and professionals who rely


on Laerd Statistics.TAKE THE TOURPLANS & PRICING
How do you report a Spearman's correlation?
How you report a Spearman's correlation coefficient depends on whether or not you
have determined the statistical significance of the coefficient. If you have simply run the
Spearman correlation without any statistical significance tests, you are able to simple
state the value of the coefficient as shown below:

However, if you have also run statistical significance tests, you need to include some
more information as shown below:

where df = N – 2, where N = number of pairwise cases.

How do you express the null hypothesis for this test?


The general form of a null hypothesis for a Spearman correlation is:

H0: There is no [monotonic] association between the two variables [in the population].

Remember, you are making an inference from your sample to the population that the
sample is supposed to represent. However, as this a general understanding of
an inferential statistical test, it is often not included. A null hypothesis statement for the
example used earlier in this guide would be:
H0: There is no [monotonic] association between maths and English marks.

How do I interpret a statistically significant Spearman


correlation?
It is important to realize that statistical significance does not indicate the strength of
Spearman's correlation. In fact, the statistical significance testing of the Spearman
correlation does not provide you with any information about the strength of the
relationship. Thus, achieving a value of p = 0.001, for example, does not mean that the
relationship is stronger than if you achieved a value of p = 0.04. This is because the
significance test is investigating whether you can reject or fail to reject the null
hypothesis. If you set α = 0.05, achieving a statistically significant Spearman rank-order
correlation means that you can be sure that there is less than a 5% chance that the
strength of the relationship you found (your ρ coefficient) happened by chance if the null
hypothesis were true.

You might also like