You are on page 1of 13

Md.

Ashraful Islam
Lecturer (Statistics)
Course Name: STATISTICS Dept. of Natural Science, PCIU
Cell: 01620728620
Email: ashraful144cu@gmail.com

Correlation
A correlation is a statistical measure of the relationship between two variables. The measure is
best used in variables that demonstrate a linear relationship between each other.

The fit of the data can be visually represented in a scatterplot. Using a scatterplot, we can generally
assess the relationship between the variables and determine whether they are correlated or not.
Correlation Coefficient

The correlation coefficient is a value that indicates the strength of the relationship between
variables.

It can be denoted as r, the coefficient can take any values from -1 to 1.

The interpretations of the values are:

 -1: Perfect negative correlation. The variables tend to move in opposite directions (i.e.,
when one variable increases, the other variable decreases).
 0: No correlation. The variables do not have a relationship with each other.
 1: Perfect positive correlation. The variables tend to move in the same direction (i.e.,
when one variable increases, the other variable also increases).

If two variables are correlated, it does not imply that one variable causes the changes in another
variable. Correlation only assesses relationships between variables, and there may be different
factors that lead to the relationships.

In correlation analysis, we estimate a sample correlation coefficient. The sample correlation


coefficient, denoted r, ranges between -1 and +1 and quantifies the direction and strength of
the linear association between the two variables. The correlation between two variables can be
positive or negative.

The sign of the correlation coefficient indicates the direction of the association. The magnitude
of the correlation coefficient indicates the strength of the association.

For example, a correlation of r = 0.9 suggests a strong, positive association between two
variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation
close to zero suggests no linear association between two continuous variables.

It is important to note that there may be a non-linear association between two continuous
variables, but computation of a correlation coefficient does not detect this. Therefore, it is always
important to evaluate the data carefully before computing a correlation coefficient. Graphical
displays are particularly useful to explore associations between variables.

The figure below shows four hypothetical Figures in which one continuous variable is plotted
along the X-axis and the other along the Y-axis.
 Figure 1 depicts a strong positive association (r=0.9), similar to what we might see for the
correlation between infant birth weight and birth length.
 Figure 2 depicts a weaker association (r=0,2) that we might expect to see between age
and body mass index (which tends to increase with age).
 Figure 3 might depict the lack of association (r approximately 0) between the extent of
media exposure in adolescence and age at which adolescents initiate sexual activity.
 Figure 4 might depict the strong negative association (r= -0.9) generally observed
between the number of hours of aerobic exercise per week and percent body fat.
Correlation Coefficient

The correlation coefficient that indicates the strength of the relationship between two or more
variables. It can be found using the following formula:

𝛴(𝑥−𝐱̅)(𝑦−ȳ)
𝑟𝑥𝑦 =
√𝛴(𝑥−𝐱̅)2 𝛴(𝑦−ȳ)2

𝐶𝑜𝑣 (𝑥,𝑦)
𝑟𝑥𝑦 =
√𝑣 (𝑥).𝑣(𝑦)

𝛴𝑥𝛴𝑦
𝛴𝑥𝑦 −
𝑛
𝑟𝑥𝑦 = 2 2
√(𝛴𝑥 2 − (𝛴𝑥) )×(𝛴𝑦 2 − (𝛴𝑦) )
𝑛 𝑛

Where:

𝛴𝑥𝛴𝑦
𝐶𝑜𝑣 (𝑥, 𝑦) = 𝛴𝑥𝑦 −
𝑛

(𝛴𝑥)2
𝑣(𝑥)=𝛴𝑥 2 −
𝑛

(𝛴𝑦)2
𝑣(𝑦)= 𝛴𝑦 2 −
𝑛

 rxy – the correlation coefficient of the linear relationship between the variables x and y
 xi – the values of the x-variable in a sample
 x̅ – the mean of the values of the x-variable
 yi – the values of the y-variable in a sample
 ȳ – the mean of the values of the y-variable
Example-01

Find the correlation coefficient between the both variables:

Income (in thousands) 3 5 8 10 12


Expenditure (In thousands) 2 5 6 7 8

Solution:

X Y 𝒙𝟐 𝒚𝟐 xy
3 2 9 4 6
5 5 25 25 25
8 6 64 36 48
10 7 100 49 70
12 8 144 64 96
𝚺x=38 𝚺y=28 𝛴𝑥 2 = 342 𝛴𝑦 2 =178 𝚺xy =245

𝛴𝑥𝛴𝑦
𝛴𝑥𝑦 −
𝑟𝑥𝑦 = 𝑛
2 2
√(𝛴𝑥 2 − (𝛴𝑥 ) ) × (𝛴𝑦 2 − (𝛴𝑦) )
𝑛 𝑛

38 × 28
245 −
𝑟𝑥𝑦 = 5
38 2 28 2
√(342 − ) × (178 − )
5 5

= 0.96
It means there is a strong positive relationship between x
and y values.
Rank correlation coefficient
A correlation can easily be drawn as a scatter graph, but the most precise way to compare several
pairs of data is to use a statistical test - this establishes whether the correlation is really
significant or if it could have been the result of chance alone.

Spearman's Rank correlation coefficient is a technique which can be used to summarise the
strength and direction (negative or positive) of a relationship between two variables.

The result will always be between -1 and +1.

Method - calculating the coefficient

 Create a table from your data.


 Rank the two data sets. Ranking is achieved by giving the ranking '1' to the biggest
number in a column, '2' to the second biggest value and so on. The smallest value in the
column will get the lowest ranking. This should be done for both sets of measurements.
 Tied scores are given the mean (average) rank. For example, the three tied scores of 1
euro in the example below are ranked fifth in order of price, but occupy three positions
(fifth, sixth and seventh) in a ranking hierarchy of ten. The mean rank in this case is
calculated as (5+6+7) ÷ 3 = 6.
 Find the difference in the ranks (d): This is the difference between the ranks of the two
values on each row of the table. The rank of the second value (price) is subtracted from
the rank of the first (distance from the museum).
 Square the differences (d²) To remove negative values and then sum them ( d²).

X Y Rank of x Rank of y di= x-y 𝒅𝟐𝒊


10 55 5 3 2 4
16 95 1 1 0 0
11 0 4 5 -1 1
15 33 2 4 -2 4
12 66 3 2 1 1
𝛴𝑑𝑖2 = 10

6𝛴𝑑𝑖2
𝜌 =1−
𝑛(𝑛2 − 1)
60
=1 − 120

= 1-0.50 = 0.50
X Y Rank of x Rank of y di= x-y 𝒅𝟐𝒊
10 55 5.5 4 1.5 2.25
16 95 1 1 0 0
11 55 4 4 0 0
15 33 2 5 -3 9
12 66 3 2 1 1
10 55 5.5 4 1.5 2.25
𝛴𝑑𝑖2 = 14.50

Convenience Distance from CAM Rank Price of 50cl Rank price Difference between ranks d²
Store (m) distance bottle (€) (d)
1 50 10 1.80 2 8 64
2 175 9 1.20 3.5 5.5 30.25
3 270 8 2.00 1 7 49
4 375 7 1.00 6 1 1
5 425 6 1.00 6 0 0
6 580 5 1.20 3.5 1.5 2.25
7 710 4 0.80 9 -5 25
8 790 3 0.60 10 -7 49
9 890 2 1.00 6 -4 16
10 980 1 0.85 8 -7 49
d² = 285.5

Data Table: Spearman's Rank Correlation

 Calculate the coefficient (Rs) using the formula below. The answer will always be
between 1.0 (a perfect positive correlation) and -1.0 (a perfect negative correlation).

When written in mathematical notation the Spearman Rank formula looks like this-
6𝛴𝑑𝑖2
𝜌 =1−
𝑛(𝑛2 − 1)

Now to put all these values into the formula.

 Find the value of all the d² values by adding up all the values in the Difference² column.
In our example this is 285.5. Multiplying this by 6 gives 1713.
 Now for the bottom line of the equation. The value n is the number of sites at which you
took measurements. This, in our example is 10. Substituting these values into n³ - n we
get 1000 - 10
 We now have the formula: 𝜌 = 1 - (1713/990) which gives a value for R:

1 - 1.73 = -0.73

What does this 𝜌 value of -0.73 mean?

The closer 𝜌 is to +1 or -1, the stronger the likely correlation. A perfect positive correlation is +1
and a perfect negative correlation is -1. The 𝜌 value of -0.73 suggests a fairly strong negative
relationship.

Assumptions for Spearman’s Rank Correlation


Your data must be ordinal, interval or ratio.
In addition, because Spearman’s measures the strength of a monotonic relationship, your data
has to be monotonically related. Basically, this means that if one variable increases (or
decreases), the other variable also increases (or decreases).

Test Results
Spearman’s returns a value from -1 to 1, where:
+1 = a perfect positive correlation between ranks
-1 = a perfect negative correlation between ranks
0 = no correlation between ranks.

Spearman Rank Correlation: Worked Example (No Tied


Ranks)
The formula for the Spearman rank correlation coefficient when there are no tied ranks is:
6𝛴𝑑𝑖2
𝜌 =1−
𝑛(𝑛2 − 1)

Example Question:
The scores for nine students in physics and math are as follows:

 Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28


 Mathematics: 30, 33, 45, 23, 8, 49, 12, 4, 31

Compute the student’s ranks in the two subjects and compute the Spearman rank correlation.

Step 1: Find the ranks for each individual subject. I used the Excel rank function to find the
ranks. If you want to rank by hand, order the scores from greatest to smallest; assign the rank 1
to the highest score, 2 to the next highest and so on:
Step 2: Add a third column, d, to your data. The d is the difference between ranks. For example,
the first student’s physics rank is 3 and math rank is 5, so the difference is 2 points. In a fourth
column, square your d values.

Step 3: Sum (add up) all of your d-squared values.


4 + 4 + 1 + 0 + 1 + 1 + 1 + 0 + 0 = 12. You’ll need this for the formula (the Σ d2 is just “the sum
of d-squared values”).

Step 5: Insert the values into the formula. These ranks are not tied, so use the first formula:

= 1 – (6*12)/(9(81-1))
= 1 – 72/720
= 1-0.1
= 0.9
The Spearman Rank Correlation for this set of data is 0.9.

Spearman Rank Correlation: What to do with Tied Ranks


Tied ranks are where two items in a column have the same rank. Let’s say two items in the above
example tied for ranks 5 and 6. The following image shows each tied data point assigned a mean
rank of 5.5:
When this happens, you have a couple of options. You could also use the easier formula for tied
ranks *if* you only have one or two tied ranks here and there. The image above shows the
workings for the ties and the d-squared values you’ll need to input into the simple version of the
formula above. However, that option may leave you with little confidence in any p-values you
produce (Kinnear and Gray, 1999). A better option may be to calculate correlation with another
method, like Kendall’s Tau.

Another option is simply to use the full version of Spearman’s formula (actually a slightly
modified Pearson’s r), which will deal with tied ranks:

Full Spearman’s r formula (Clef, 2013. p. 4)

Where:

 R(x) and R(y) are the ranks,


 R(x)bar and R(y)bar are the mean ranks.

Two referees in a flower beauty competition rank the 10 types of flowers as follows:

Use the rank correlation coefficient and find out what degree of agreement is
between the referees.
Solution:

Interpretation: Degree of agreement between the referees ‘A’ and ‘B’ is 0.636 and
they have “strong agreement” in evaluating the competitors.
What is a Scatter Diagram?
A scatter diagram (Also known as scatter plot, scatter graph, and correlation chart) is a tool for
analyzing relationships between two variables for determining how closely the two variables are
related. One variable is plotted on the horizontal axis and the other is plotted on the vertical axis.
The pattern of their intersecting points can graphically show relationship patterns.

Most often a scatter diagram is used to prove or disprove cause-and-effect relationships. While
the diagram shows relationships, it does not by itself prove that one variable causes the other.
Thus, we can use a scatter diagram to examine theories about cause-and-effect relationships and
to search for root causes of an identified problem.

For example, we can analyze the pattern of motorcycle accidents on a highway. You select the
two variables: motorcycle speed and number of accidents, and draw the diagram. Once the
diagram is completed, you notice that as the speed of vehicle increases, the number of accidents
also goes up. This shows that there is a relationship between the speed of vehicles and accidents
happening on the highway.

You might also like