Professional Documents
Culture Documents
Correlation
Correlation
Ashraful Islam
Lecturer (Statistics)
Course Name: STATISTICS Dept. of Natural Science, PCIU
Cell: 01620728620
Email: ashraful144cu@gmail.com
Correlation
A correlation is a statistical measure of the relationship between two variables. The measure is
best used in variables that demonstrate a linear relationship between each other.
The fit of the data can be visually represented in a scatterplot. Using a scatterplot, we can generally
assess the relationship between the variables and determine whether they are correlated or not.
Correlation Coefficient
The correlation coefficient is a value that indicates the strength of the relationship between
variables.
-1: Perfect negative correlation. The variables tend to move in opposite directions (i.e.,
when one variable increases, the other variable decreases).
0: No correlation. The variables do not have a relationship with each other.
1: Perfect positive correlation. The variables tend to move in the same direction (i.e.,
when one variable increases, the other variable also increases).
If two variables are correlated, it does not imply that one variable causes the changes in another
variable. Correlation only assesses relationships between variables, and there may be different
factors that lead to the relationships.
The sign of the correlation coefficient indicates the direction of the association. The magnitude
of the correlation coefficient indicates the strength of the association.
For example, a correlation of r = 0.9 suggests a strong, positive association between two
variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation
close to zero suggests no linear association between two continuous variables.
It is important to note that there may be a non-linear association between two continuous
variables, but computation of a correlation coefficient does not detect this. Therefore, it is always
important to evaluate the data carefully before computing a correlation coefficient. Graphical
displays are particularly useful to explore associations between variables.
The figure below shows four hypothetical Figures in which one continuous variable is plotted
along the X-axis and the other along the Y-axis.
Figure 1 depicts a strong positive association (r=0.9), similar to what we might see for the
correlation between infant birth weight and birth length.
Figure 2 depicts a weaker association (r=0,2) that we might expect to see between age
and body mass index (which tends to increase with age).
Figure 3 might depict the lack of association (r approximately 0) between the extent of
media exposure in adolescence and age at which adolescents initiate sexual activity.
Figure 4 might depict the strong negative association (r= -0.9) generally observed
between the number of hours of aerobic exercise per week and percent body fat.
Correlation Coefficient
The correlation coefficient that indicates the strength of the relationship between two or more
variables. It can be found using the following formula:
𝛴(𝑥−𝐱̅)(𝑦−ȳ)
𝑟𝑥𝑦 =
√𝛴(𝑥−𝐱̅)2 𝛴(𝑦−ȳ)2
𝐶𝑜𝑣 (𝑥,𝑦)
𝑟𝑥𝑦 =
√𝑣 (𝑥).𝑣(𝑦)
𝛴𝑥𝛴𝑦
𝛴𝑥𝑦 −
𝑛
𝑟𝑥𝑦 = 2 2
√(𝛴𝑥 2 − (𝛴𝑥) )×(𝛴𝑦 2 − (𝛴𝑦) )
𝑛 𝑛
Where:
𝛴𝑥𝛴𝑦
𝐶𝑜𝑣 (𝑥, 𝑦) = 𝛴𝑥𝑦 −
𝑛
(𝛴𝑥)2
𝑣(𝑥)=𝛴𝑥 2 −
𝑛
(𝛴𝑦)2
𝑣(𝑦)= 𝛴𝑦 2 −
𝑛
rxy – the correlation coefficient of the linear relationship between the variables x and y
xi – the values of the x-variable in a sample
x̅ – the mean of the values of the x-variable
yi – the values of the y-variable in a sample
ȳ – the mean of the values of the y-variable
Example-01
Solution:
X Y 𝒙𝟐 𝒚𝟐 xy
3 2 9 4 6
5 5 25 25 25
8 6 64 36 48
10 7 100 49 70
12 8 144 64 96
𝚺x=38 𝚺y=28 𝛴𝑥 2 = 342 𝛴𝑦 2 =178 𝚺xy =245
𝛴𝑥𝛴𝑦
𝛴𝑥𝑦 −
𝑟𝑥𝑦 = 𝑛
2 2
√(𝛴𝑥 2 − (𝛴𝑥 ) ) × (𝛴𝑦 2 − (𝛴𝑦) )
𝑛 𝑛
38 × 28
245 −
𝑟𝑥𝑦 = 5
38 2 28 2
√(342 − ) × (178 − )
5 5
= 0.96
It means there is a strong positive relationship between x
and y values.
Rank correlation coefficient
A correlation can easily be drawn as a scatter graph, but the most precise way to compare several
pairs of data is to use a statistical test - this establishes whether the correlation is really
significant or if it could have been the result of chance alone.
Spearman's Rank correlation coefficient is a technique which can be used to summarise the
strength and direction (negative or positive) of a relationship between two variables.
6𝛴𝑑𝑖2
𝜌 =1−
𝑛(𝑛2 − 1)
60
=1 − 120
= 1-0.50 = 0.50
X Y Rank of x Rank of y di= x-y 𝒅𝟐𝒊
10 55 5.5 4 1.5 2.25
16 95 1 1 0 0
11 55 4 4 0 0
15 33 2 5 -3 9
12 66 3 2 1 1
10 55 5.5 4 1.5 2.25
𝛴𝑑𝑖2 = 14.50
Convenience Distance from CAM Rank Price of 50cl Rank price Difference between ranks d²
Store (m) distance bottle (€) (d)
1 50 10 1.80 2 8 64
2 175 9 1.20 3.5 5.5 30.25
3 270 8 2.00 1 7 49
4 375 7 1.00 6 1 1
5 425 6 1.00 6 0 0
6 580 5 1.20 3.5 1.5 2.25
7 710 4 0.80 9 -5 25
8 790 3 0.60 10 -7 49
9 890 2 1.00 6 -4 16
10 980 1 0.85 8 -7 49
d² = 285.5
Calculate the coefficient (Rs) using the formula below. The answer will always be
between 1.0 (a perfect positive correlation) and -1.0 (a perfect negative correlation).
When written in mathematical notation the Spearman Rank formula looks like this-
6𝛴𝑑𝑖2
𝜌 =1−
𝑛(𝑛2 − 1)
Find the value of all the d² values by adding up all the values in the Difference² column.
In our example this is 285.5. Multiplying this by 6 gives 1713.
Now for the bottom line of the equation. The value n is the number of sites at which you
took measurements. This, in our example is 10. Substituting these values into n³ - n we
get 1000 - 10
We now have the formula: 𝜌 = 1 - (1713/990) which gives a value for R:
1 - 1.73 = -0.73
The closer 𝜌 is to +1 or -1, the stronger the likely correlation. A perfect positive correlation is +1
and a perfect negative correlation is -1. The 𝜌 value of -0.73 suggests a fairly strong negative
relationship.
Test Results
Spearman’s returns a value from -1 to 1, where:
+1 = a perfect positive correlation between ranks
-1 = a perfect negative correlation between ranks
0 = no correlation between ranks.
Example Question:
The scores for nine students in physics and math are as follows:
Compute the student’s ranks in the two subjects and compute the Spearman rank correlation.
Step 1: Find the ranks for each individual subject. I used the Excel rank function to find the
ranks. If you want to rank by hand, order the scores from greatest to smallest; assign the rank 1
to the highest score, 2 to the next highest and so on:
Step 2: Add a third column, d, to your data. The d is the difference between ranks. For example,
the first student’s physics rank is 3 and math rank is 5, so the difference is 2 points. In a fourth
column, square your d values.
Step 5: Insert the values into the formula. These ranks are not tied, so use the first formula:
= 1 – (6*12)/(9(81-1))
= 1 – 72/720
= 1-0.1
= 0.9
The Spearman Rank Correlation for this set of data is 0.9.
Another option is simply to use the full version of Spearman’s formula (actually a slightly
modified Pearson’s r), which will deal with tied ranks:
Where:
Two referees in a flower beauty competition rank the 10 types of flowers as follows:
Use the rank correlation coefficient and find out what degree of agreement is
between the referees.
Solution:
Interpretation: Degree of agreement between the referees ‘A’ and ‘B’ is 0.636 and
they have “strong agreement” in evaluating the competitors.
What is a Scatter Diagram?
A scatter diagram (Also known as scatter plot, scatter graph, and correlation chart) is a tool for
analyzing relationships between two variables for determining how closely the two variables are
related. One variable is plotted on the horizontal axis and the other is plotted on the vertical axis.
The pattern of their intersecting points can graphically show relationship patterns.
Most often a scatter diagram is used to prove or disprove cause-and-effect relationships. While
the diagram shows relationships, it does not by itself prove that one variable causes the other.
Thus, we can use a scatter diagram to examine theories about cause-and-effect relationships and
to search for root causes of an identified problem.
For example, we can analyze the pattern of motorcycle accidents on a highway. You select the
two variables: motorcycle speed and number of accidents, and draw the diagram. Once the
diagram is completed, you notice that as the speed of vehicle increases, the number of accidents
also goes up. This shows that there is a relationship between the speed of vehicles and accidents
happening on the highway.