Professional Documents
Culture Documents
Correlation Stats Example Jan.7
Correlation Stats Example Jan.7
Trigonometry 18 11 10 20 17
Calculate the Pearson correlation coefficient.
Solution:
Construct the following table:
x y x2 y2 xy
8 17 64 289 136
r = n(∑xy)−(∑x)(∑y)[n∑x2−(∑x)2][n∑y2−(∑y)2]√
r = 5×902–61×76[5×789–(61)2][5×1234–(76)2]√
r = -0.424
Example Problem The following example includes the changes we will need to make for hypothesis
testing with the correlation coefficient, as well as an example of how to do the computations. Below are
the data for six participants giving their number of years in college (X) and their subsequent yearly
income (Y). Income here is in thousands of dollars, but this fact does not require any changes in our
computations. Test whether there is a relationship with Alpha = .05. # of Years of College Income X Y X2
Y2 XY 0 15 0 225 0 1 15 1 225 15 3 20 9 400 60 4 25 16 625 100 4 30 16 900 120 6 35 36 1225 210 ΣX =
18 ΣY = 140 Σ =2 X 78 Σ = 2 Y 3600 ΣXY = 505 Notice that I have included the computation for obtaining
the summary values for you for completeness. Be sure you know how to obtain all the summed values,
as they will not always be given on the exam. Step 1: State the Hypotheses in Words and Symbols H1
The correlation between years of education and income is equal to zero in the population. H0: The
correlation between years of education and income not equal to zero in the population. As usual the null
states that there is no effect or no relationship, and the research hypothesis states that there is an
effect. When we write them in symbols we will use the Greek letter “rho” (ρ) to indicate the correlation
in the population. Thus: H1 ρ ≠ 0 H0: ρ = 0 Step 2: Find the Critical Value Again, we will use a table to find
the critical value in Appendix A of your book. Locate the table, and find the degrees of freedom for the
appropriate test to find the critical value. For this test df = n – 2, where n is the number of pairs of scores
we have. Df = 6 – 2 = 4 rcritical = + 0.811 Step 3: Run the Statistical Test ( ) ( ) −
− − = ∑ ∑ ∑ ∑ ∑ ∑ ∑ 2 2 2 2 n Y Y n X X n X Y XY r − − − = 6
140 3600 6 18 78 6 (18)(140) 505 2 2 r − − − = 6 19600 3600 6 324 78 6
2520 505 r [ ][ ] 78 54 3600 3266.67 505 420 − − − r = .95 89.44 85 7999.92 85 (24)(333.33) 85 r = = = =
Step 4: Make a Decision about the Null Reject the null Å Since the value we computed in Step 3 is larger
than the critical value in Step 2, we reject the null. Step 5: Write a Conclusion There is a relationship
between years spent in college and income. The more years of school, the more the subsequent income.
r 2 Often times we will square the r-value we compute in order to get a measure of the size of the effect.
Just like with eta-square in ANOVA, we will compute the percentage of variability in Y, that is accounted
for by X. For the current example r2 = .90, so 90% of the variability in income is accounted for by
education
The table below shows the computed values of all the summations
mentioned above.
From our table we get:
Σ(X) = 247
Σ(Y) = 486
Σ(X*Y) = 20,485
Σ(X²) = 11,409
Σ(Y²) = 40,022
n is the sample size, in our case = 6
r = 0.5298.
Step 1: Find the ranks for each individual subject. I used the Excel rank function to find the ranks. If you
want to rank by hand, order the scores from greatest to smallest; assign the rank 1 to the highest score, 2
to the next highest and so on:
Step 2: Add a third column, d, to your data. The d is the difference between ranks. For example, the first
student’s physics rank is 3 and math rank is 5, so the difference is 3 points. In a fourth column, square
your d values.
= 1 – (6*12)/(9(81-1))
= 1 – 72/720
= 1-0.1
= 0.9
The Spearman Rank Correlation for this set of data is 0.9.
Spearman Rank Correlation: What to do with
Tied Ranks
Tied ranks are where two items in a column have the same rank. Let’s say two items in the above
example tied for ranks 5 and 6. The following image shows each tied data point assigned a mean rank of
5.5:
When this happens, you have a couple of options. You could also use the easier formula for tied ranks *if*
you only have one or two tied ranks here and there. The image above shows the workings for the ties and
the d-squared values you’ll need to input into the simple version of the formula above. However, that
option may leave you with little confidence in any p-values you produce (Kinnear and Gray, 1999). A better
option may be to calculate correlation with another method, like Kendall’s Tau.
Another option is simply to use the full version of Spearman’s formula (actually a slightly
modified Pearson’s r), which will deal with tied ranks:
Where:
Spearman
Spearman Correlation - Example
A sample of 1,000 companies were asked about their number of employees and their
revenue over 2018. For making these questions easier, they were offered answer
categories. After completing the data collection, the contingency table below shows the
results.
The question we'd like to answer isis company size related to revenue?A good look
at our contingency table shows the obvious: companies having more employees
typically make more revenue. But note that this relation is not perfect: there's 60
companies with 1 employee making $50,000 - $99,999 while there's 89 companies with
2-5 employees making $0 - $49,999. This relation becomes clear if we visualize our
results in the chart below.
As a last step, we simply compute the Pearson correlation between the size and
revenue ranks. This results in aSpearman rank correlation (Rs) = 0.81.This tells us
that our variables are strongly monotonously related. But in contrast to a normal
Pearson correlation, we do not know if the relation is linear to any extent.
Right. Now, computing Spearman’s rank correlation always starts off with replacing
scores by their ranks (use mean ranks for ties). Spearman’s correlation is now
computed as the Pearson correlation over the (mean) ranks.
Rs=1−6⋅ΣD2n3−n
where D denotes the difference between the 2 ranks for each observation.
For reasonable sample sizes of N ≥ 30, the (approximate) statistical significance uses the t distribution.
In this case, the test statistic
T=Rs⋅N−21−Rs2
follows a t-distribution with
Df=N−2
degrees of freedom.
This approximation is inaccurate for smaller sample sizes of N < 30. In this case, look up the (exact)
significance level from the table given in this Googlesheet. These exact p-values are based on a
permutation test that we may discuss some other time. Or not.
SPSS also comes up with the correct correlation. However, its significance level is
based on the t-distribution:
t=0.77⋅4(1−0.772)=2.42
and
t(4)=2.42,p=0.072
Again, this approximation is only accurate for larger sample sizes of N ≥ 30. For N = 6, it is wildly off as
shown below.
Example: The hypothesis tested is that prices should decrease with distance from the key
area of gentrification surrounding the Contemporary Art Museum. The line followed is
Transect 2 in the map below, with continuous sampling of the price of a 50cl bottle water at
every convenience store.
Map to show the location of environmental gradients for transect lines in El Raval,
Barcelona
Hypothesis
We might expect to find that the price of a bottle of water decreases as distance from the
Contemporary Art Museum increases. Higher property rents close to the museum should be
reflected in higher prices in the shops.
The hypothesis might be written like this:
The price of a convenience item decreases as distance from the Contemporary Art Museum
increases.
The more objective scientific research method is always to assume that no such price-
distance relationship exists and to express the null hypothesis as:
there is no significant relationship between the price of a convenience item and
distance from the Contemporary Art Museum.
What can go wrong?
Having decided upon the wording of the hypothesis, you should consider whether there are
any other factors that may influence the study. Some factors that may influence prices may
include:
The type of retail outlet. You must be consistent in your choice of retail outlet. For
example, bars and restaurants often charge significantly more for water than a
convenience store. You should decide which type of outlet to use and stick with it
for all your data collection.
Some shops have different prices for the same item: a high tourist and lower local
price, dependent upon the shopkeeper's perception of the customer.
Shops near main roads may charge more than shops in less accessible back
streets, due to the higher rents demanded for main road retail sites.
The positive spread effects from other nearby areas of gentrification or from
competing areas of tourist attraction.
The negative spread effects from nearby areas of urban decay.
Higher prices may be charged during the summer when demand is less flexible,
making seasonal comparisons less reliable.
Cumulative sampling may distort the expected price-distance gradient if several
shops cluster within a short area along the transect line followed by a
considerable gap before the next group of retail outlets.
You should mention such factors in your investigation.
Data collected (see data table below) suggests a fairly strong negative relationship as
shown in this scatter graph:
Scatter graph to show the change in the price of a convenience item with distance
from the Contemporary Art Museum. Roll over image to see trend line.
The scatter graph shows the possibility of a negative correlation between the two variables
and the Spearman's rank correlation technique should be used to see if there is indeed a
correlation, and to test the strength of the relationship.
1 50 10 1.80 2 8 64
3 270 8 2.00 1 7 49
4 375 7 1.00 6 1 1
5 425 6 1.00 6 0 0
7 710 4 0.80 9 -5 25
8 790 3 0.60 10 -7 49
Distance Price of Difference
Convenienc Rank
from CAM 50cl bottle Rank price between d²
e Store distance
(m) (€) ranks (d)
9 890 2 1.00 6 -4 16
10 980 1 0.85 8 -7 49
d² = 285.5
When written in mathematical notation the Spearman Rank formula looks like this :
In the example, the value 0.73 gives a significance level of slightly less than 5%. That
means that the probability of the relationship you have found being a chance event is about
5 in a 100. You are 95% certain that your hypothesis is correct. The reliability of your
sample can be stated in terms of how many researchers completing the same study as
yours would obtain the same results: 95 out of 100.
The fact two variables correlate cannot prove anything - only further research can
actually prove that one thing affects the other.
Data reliability is related to the size of the sample. The more data you collect, the
more reliable your result.
What values can the Spearman correlation coefficient, rs, take?
The Spearman correlation coefficient, rs, can take values from +1 to -1. A rs of +1
indicates a perfect association of ranks, a rs of zero indicates no association between
ranks and a rs of -1 indicates a perfect negative association of ranks. The closer rs is to
zero, the weaker the association between the ranks.
Marks
Engli 5 7 4 7 6 6 5 8 7 6
sh 6 5 5 1 2 4 8 0 6 1
Math 6 7 4 6 6 5 5 7 6 6
s 6 0 0 0 5 6 9 7 7 3
56 66 9 4 5
75 70 3 2 1
English (mark) Maths (mark) Rank (English) Rank (maths) d
45 40 10 10 0
71 60 4 7 3
62 65 6 5 1
64 56 5 9 4
58 59 8 8 0
80 77 1 1 0
76 67 2 3 1
61 63 7 6 1
We then substitute this into the main equation with the other information as follows:
as n = 10. Hence, we have a ρ (or rs) of 0.67. This indicates a strong positive
relationship between the ranks individuals obtained in the maths and English exam.
That is, the higher you ranked in maths, the higher you ranked in English also, and vice
versa.
However, if you have also run statistical significance tests, you need to include some
more information as shown below:
H0: There is no [monotonic] association between the two variables [in the population].
Remember, you are making an inference from your sample to the population that the
sample is supposed to represent. However, as this a general understanding of
an inferential statistical test, it is often not included. A null hypothesis statement for the
example used earlier in this guide would be:
H0: There is no [monotonic] association between maths and English marks.