Professional Documents
Culture Documents
-1 0 1
Negative Positive
Correlation
• r = 0 -> there is no linear association
• r = 1 -> there is a perfect positive association
• r = -1 -> there is a perfect negative association
– Perfect association means that we could use a
precise formula to describe the association between X
and Y.
– r = 0 does not mean that there is no association at all.
It means that the association (if such an association
really exists) is not linear one for sure.
Correlation
• The value of the correlation coefficient does not depend
on the specific measurement units used.
– For example, the correlation between height and
weight will be identical regardless of whether inches
and pounds, or centimeters and kilograms are used
as measurement units.
• In order to evaluate the correlation between variables, it
is important to confirm the statistical significance of
this correlation as well.
– The p-value is a primary source of information about
the reliability of the correlation.
Correlation
• The most widely-used type of correlation coefficient is
Pearson r, also called linear or product-moment
correlation. For Pearson correlation, both X and Y values
must be sampled from populations that follow normal
distribution.
– Pearson r assumes that the variables under
consideration were measured on at least an interval
scale.
• Spearman rank correlation rs does not make this
assumption. This non-parametric method separately
ranks X and Y values and then calculates the correlation
between the two sets of ranks.
– Spearman rs assumes that the variables under
consideration were measured on at least an ordinal
scale.
Correlation
• Pearson correlation quantifies the linear relationship
between X and Y.
– As X goes up, does Y go up a consistent amount?
• Spearman correlation quantifies the monotonic
relationship between X and Y (A monotonic function is a
function which is either entirely non-increasing or non-
decreasing).
– As X goes up, does Y go up as well (by any amount)?
Assumptions
• The assumptions for Pearson correlation coefficient
are:
– Level of measurement (each variable should be
measured on an interval scale at least)
– Related pairs (each unit of observation should have a
pair of values)
– Absence of outliers (outliers can skew the results of
the correlation by pulling the line of best fit too far in
one direction or another)
– Normality
– Linearity (this could be checked by a scatterplot)
– Homoscedasticity (this could be checked by a
scatterplot)
Assumptions
• Linearity and homoscedasticity refer to the shape of the
values formed by the scatterplot.
• For linearity, a “straight line” relationship between the
variable should be formed. If a line were to be drawn
between all the dots, the line should be straight and
not curved.
• Homoscedasticity (“same scatter”) refers to the
distance between the points to that straight line. The
shape of the scatterplot should be tube-like.
– A sequence of random variables is homoscedastic if
all its random variables have the same finite
variance. This is also known as homogeneity of
variance.
Phi-coefficient
• The phi-coefficient is actually a variation of Pearson’s
definition of r when each variable has two states that
are given values of 0 and 1 respectively (dichotomous or
binary variables).
Cases (+) Non-cases (-) Total
Source: www.tylervigen.com
Beware!
• Do not attempt to interpret a
correlation coefficient without
looking at the corresponding
scatter plot!
• Each data set has a correlation
coefficient of 0.7
Beware!
Association is not
causation.
The observed association between two
variables might be due to the action of
a third, unobserved variable.
Coefficient of determination
• Coefficient of determination r2 is the proportion of the
variance in the dependent variable that is predictable
from the independent variable/variables.
• r = 0.70 (association between height and weight)
• r2 = 0.49
– 49% of the variance in weight is explained by /
predictable from the height
– 51% of the variance in weight is not explained by / not
predictable from the height
Data modeling
Birth weight Y [kg] vs Gestational age X [weeks]
5
4.5 4.5
4 4
3.5 3.5
3 3
2.5 2.5
2 2
1
0.5
0
0 5 10 15 20 25 30 35 40 45
5
4.5
y = 0.2415x - 5.7
4
R2 = 0.764
3.5
3
2.5
2
1.5
1
0.5
0
0 5 10 15 20 25 30 35 40 45