You are on page 1of 28

Correlation and regression

Associate Professor Georgi Iskrov, PhD


Department of Social Medicine
and Public Health
Outline
• Correlation
• Pearson correlation coefficient
• Spearman rank correlation coefficient
• Phi coefficient
• Coefficient of determination
• Regression
• Simple linear regression
• Least squares method
Correlation
• A correlation exists when changes in one variable tend to
be accompanied by consistent and predictable changes
in the other variable.
• Correlation quantifies the linear association between
two variables.
• The measurement scales used should be at least
interval (Pearson correlation coefficient), but specific
correlation coefficients (Spearman rank correlation
coefficient, Phi coefficient) are available to handle
other types of data.
Correlation
• The direction and strength of the correlation are
expressed by the correlation coefficient r.
• The value of this coefficient could range between –1 and
1:

Strong Moderate Weak Moderate Strong

-1 0 1

Negative Positive
Correlation
• r = 0 -> there is no linear association
• r = 1 -> there is a perfect positive association
• r = -1 -> there is a perfect negative association
– Perfect association means that we could use a
precise formula to describe the association between X
and Y.
– r = 0 does not mean that there is no association at all.
It means that the association (if such an association
really exists) is not linear one for sure.
Correlation
• The value of the correlation coefficient does not depend
on the specific measurement units used.
– For example, the correlation between height and
weight will be identical regardless of whether inches
and pounds, or centimeters and kilograms are used
as measurement units.
• In order to evaluate the correlation between variables, it
is important to confirm the statistical significance of
this correlation as well.
– The p-value is a primary source of information about
the reliability of the correlation.
Correlation
• The most widely-used type of correlation coefficient is
Pearson r, also called linear or product-moment
correlation. For Pearson correlation, both X and Y values
must be sampled from populations that follow normal
distribution.
– Pearson r assumes that the variables under
consideration were measured on at least an interval
scale.
• Spearman rank correlation rs does not make this
assumption. This non-parametric method separately
ranks X and Y values and then calculates the correlation
between the two sets of ranks.
– Spearman rs assumes that the variables under
consideration were measured on at least an ordinal
scale.
Correlation
• Pearson correlation quantifies the linear relationship
between X and Y.
– As X goes up, does Y go up a consistent amount?
• Spearman correlation quantifies the monotonic
relationship between X and Y (A monotonic function is a
function which is either entirely non-increasing or non-
decreasing).
– As X goes up, does Y go up as well (by any amount)?
Assumptions
• The assumptions for Pearson correlation coefficient
are:
– Level of measurement (each variable should be
measured on an interval scale at least)
– Related pairs (each unit of observation should have a
pair of values)
– Absence of outliers (outliers can skew the results of
the correlation by pulling the line of best fit too far in
one direction or another)
– Normality
– Linearity (this could be checked by a scatterplot)
– Homoscedasticity (this could be checked by a
scatterplot)
Assumptions
• Linearity and homoscedasticity refer to the shape of the
values formed by the scatterplot.
• For linearity, a “straight line” relationship between the
variable should be formed. If a line were to be drawn
between all the dots, the line should be straight and
not curved.
• Homoscedasticity (“same scatter”) refers to the
distance between the points to that straight line. The
shape of the scatterplot should be tube-like.
– A sequence of random variables is homoscedastic if
all its random variables have the same finite
variance. This is also known as homogeneity of
variance.
Phi-coefficient
• The phi-coefficient is actually a variation of Pearson’s
definition of r when each variable has two states that
are given values of 0 and 1 respectively (dichotomous or
binary variables).
Cases (+) Non-cases (-) Total

Exposed (+) a b a+b

Non-exposed (-) c d c+d

Total a+c b+d


Phi-coefficient

Cases (+) Non-cases (-) Total

Vaccinated (+) 2 148 150

Non-vaccinated (-) 148 102 250

Total 150 250 400


Beware!
• Correlation and coincidence

Source: www.tylervigen.com
Beware!
• Do not attempt to interpret a
correlation coefficient without
looking at the corresponding
scatter plot!
• Each data set has a correlation
coefficient of 0.7
Beware!

Association is not
causation.
The observed association between two
variables might be due to the action of
a third, unobserved variable.
Coefficient of determination
• Coefficient of determination r2 is the proportion of the
variance in the dependent variable that is predictable
from the independent variable/variables.
• r = 0.70 (association between height and weight)
• r2 = 0.49
– 49% of the variance in weight is explained by /
predictable from the height
– 51% of the variance in weight is not explained by / not
predictable from the height
Data modeling
Birth weight Y [kg] vs Gestational age X [weeks]

5
4.5 4.5

4 4

3.5 3.5

3 3

2.5 2.5

2 2

1.5 1.5 1.5

1
0.5
0
0 5 10 15 20 25 30 35 40 45

• Bigger gestational age, higher birth weight…


Data modeling
• Deterministic models:
– All data are known beforehand
– Once you run the model, you know exactly what is
going to change and by how much.
• Probabilistic models:
– Element of chance is involved
– You know the likelihood that something will change,
but you do not know exactly when and by how much.
Data modeling
• Probabilistic models:
– Correlation models
– Regression models
• Simple vs Multiple (one predictor variable vs many
predictor variables)
• Linear vs Non-linear
Data modeling
X-axis Y-axis
independent dependent
predictor predicted
carrier response
input output
Linear regression
• Origin of term regression – children of tall parents tend
be shorter than their parents and vice versa; children
“regressed” toward the mean height of all children.
• Model – equation that describes the relationship
between variables.
• Parameters (regression coefficients)
• Simple linear regression
– Yt = (b x X) + a + ε
– Outcome = (Slope x Predictor) + Intercept + Random
Error
• Multiple regression
– Yt = a + (b1 x X1) + (b2 x X2) + … + (bn x Xn)
Linear regression
Birth weight Yt predicted based on gestational age X

5
4.5
y = 0.2415x - 5.7
4
R2 = 0.764
3.5
3
2.5
2
1.5
1
0.5
0
0 5 10 15 20 25 30 35 40 45

• There are many straight lines that could be drawn through


the data.
• How to choose the line of best fit among them?
Least squares method
• Residual – the difference between the actual Y value and
the Yt value predicted by the model.
• Least squares method finds the values of the slope and
intercept that minimize the sum of the squares of the
residuals.
Least squares method
• In regression, the coefficient of determination r2 is a
statistical measure of how well the regression line
approximates the real-world data.
• An r2 of 1 indicates that the regression line perfectly fits
the data.
• An r2 < 0.50 (r < 0.70) is not considered relevant from
clinical point of view.
Assumptions
• Linear relationship (this could be checked by a
scatterplot)
• Multivariate normality (any linear combination of the
variables is also normally distributed)
• No or little multicollinearity (multicollinearity occurs when
the independent variables are too highly correlated with
each other)
• No auto-correlation (autocorrelation occurs when the
residuals are not independent from each other)
• Homoscedasticity (this could be checked by a scatter
plot, i.e. residuals are equal across the regression line)
Assumptions
• In case of simple linear regression, the two variables
should be measured either on interval or on ratio scale.
• Recommended, but not mandatory: sample size
(regression analysis requires at least 20 cases per
independent variable in the analysis)
Interpolation and
extrapolation
• Interpolation is making a prediction within the range of
values of the predictor in the sample used to generate
the model.
• Extrapolation is making a prediction outside the range
of values of the predictor in the sample used to generate
the model.
Interpolation and
extrapolation
• Extrapolating back to zero is not always meaningful
from physiological point of view (see the intercept
coefficient).
• Extrapolating way beyond the range of data, even
though mathematically correct, may be risky from clinical
point of view.
• In order to evaluate the regression model, it is important
to confirm the statistical significance of the regression
coefficients. The p-value is a primary source of
information about the reliability of the model.
• However, the coefficient of determination is important
as well.

You might also like