Professional Documents
Culture Documents
Biostatistics Series Module 6: Correlation and Linear Regression
Biostatistics Series Module 6: Correlation and Linear Regression
net/publication/309893046
CITATIONS READS
49 7,446
2 authors:
Some of the authors of this publication are also working on these related projects:
Controlled release mebeverine in diarrhoea predominant irritable bowel syndrome View project
All content following this page was uploaded by Avijit Hazra on 21 November 2016.
a b
Figure 1: Scatter diagram depicting direct and inverse linear relationships Figure 2: Scatter diagram depicting a curvilinear relationship
If one or both variables in a correlation analysis is/are Calculation of this CI requires r to be transformed to
not normally distributed a rank correlation coefficient give a normal distribution by making use of Fisher’s
that depends on the rank order of the values rather z transformation. The width of the CI depends on the
than the actual observed values can be calculated. sample size, and it is possible to calculate the sample
Examples include Spearman’s rho (ρ) (after Charles size required for a given level of accuracy.
Edward Spearman) and Kendall’s tau (τ) (after Maurice
George Kendall) statistics. In essence Spearman’s rank Coefficient of Determination
correlation coefficient rho, which is the more frequently When exploring linear relationship between numerical
used nonparametric correlation, is simply Pearson’s variables, a part of the variation in one of the variables
product moment correlation coefficient calculated for can be thought of as being due to its relationship with
the rank values of x and y rather than their actual the other variable with the rest due to undetermined
values. It is also appropriate to use ρ rather than r (often random) causes. A coefficient of determination
when at least one variable is measured on an ordinal can be calculated to denote the proportion of the
scale or when the sample size is small (say n ≤10); ρ variability of y that can be attributed to its linear
is also less sensitive to deviations from linear relation relation with x. This is taken simply as r multiplied by
than r. itself that is r2. It is also denoted as R2.
Although less often used, Kendall’s tau is another For any given value of r, the r2 will denote a value
nonparametric correlation offered by many statistical that is closer to 0 and will be devoid of a sign. Thus, if
packages. Some statisticians recommend that it should r is +0.7 or −0.7, r2 will be 0.49. We can interpret this
be used, rather than Spearman’s coefficient, when the 0.49 figures as that 49% of the variability in y is due to
data set is small with the possibility of a large number variation of x. Values of r2 close to 1 imply that most of
of tied ranks. This means that if we rank all of the the variability in y is explained by its linear relationship
with x. The value (1 −r2) has sometimes been referred to
scores and many scores have the same rank, Kendall’s
as the coefficient of alienation.
tau should be used. It is also considered to be a more
accurate gauge of the correlation in the underlying In statistical modeling, the r2 statistics gives information
population. about the goodness of fit of a model. In regression, it
denotes how well the regression line approximates the the use of dichotomous variables is frequent. Suppose
real data points. An r2 of 1 indicates that the regression we are interested in exploring whether in a group of
line perfectly fits the data. Note that values of r2 outside students opting for university courses, there is gender
the range 0–1 can occur where it is used to measure the preference for physical sciences or life sciences. Here,
agreement between observed and modeled values, and we have two binary variables – gender (male or female)
the modeled values are not obtained by linear regression. and discipline (physical science or life science). We
can arrange the data as a 2 × 2 contingency table and
Point Biserial and Biserial Correlation calculate the phi coefficient.
The point biserial correlation is a special case of the
product‑moment correlation, in which one variable is Simple Linear Regression
continuous, and the other variable is binary. The point If two variables are highly correlated, it is then feasible
biserial correlation coefficient measures the association to predict the value of one (the dependent variable)
between a binary variable x, taking values 0 or 1, and from the value of the other (the independent variable)
a continuous numerical variable y. It is assumed that using regression techniques. In simple linear regression,
for each value of x, the distribution of y is normal, the value of one variable (x) is used to predict the
with different means but same variance. It is often value of the other variable (y) by means of a simple
abbreviated as rPB. mathematical function, the linear regression equation,
which quantifies the straight‑line relationship between
The binary variable frequently has categories such as yes
the two variables. This straight line, or regression line,
or no, present or absent, success or failure, etc.
is actually the “line of best fit” for the data points on
If the variable x is not naturally dichotomous but the scatter plot showing the relationship between the
is artificially dichotomized, we calculate the biserial variables in question.
correlation coefficient rB instead of the point‑biserial The regression line has the general formula:
correlation coefficient.
y = a + bx.
Although not often used, an example where we
may apply the point biserial correlation coefficient Where “a” and “b” are two constants denoting the
would be in cancer studies. How strong is the intercept of the line on the Y‑axis (y‑intercept) and the
association between administering the anticancer drug gradient (slope) of the line, respectively. The other name
(active drug vs. placebo) and the length of survival for b is the “regression coefficient.”
after treatment? The value would be interpreted in the Physically, “b” represents the change in y, for every 1
same way as Pearson’s r. Thus, the value would range unit change in x; while “a” represents the value that
from −1 to +1, where −1 indicates a perfect inverse y would take, if x is 0. Once the values of a and b
association, +1 indicates a perfect direct association, and have been established, the expected value of y can be
0 indicates no association at all. Take another example. predicted for any given value of x, and vice versa. Thus, a
Suppose we want to calculate the correlation between model for predicting y from x is established. There may be
intelligence quotient and the score on a certain test, but situations, in which a straight line passing through the
all the test scores are not available with us although we origin will be appropriate for the data, and in these cases,
know whether each subject passed or failed. We could the equation of the regression line simplifies to y = bx.
then use the biserial correlation.
But how do we fit a straight line to a scattered set of
The Phi Coefficient points which seem to be in linear relationship? If the
points are not all on a single straight line, we can, by
The phi coefficient (also called the mean square
eye estimation, draw multiple lines that seem to fit the
contingency coefficient) is a measure of association for
series of data points on the scatter diagram. But which
two binary variables. It is denoted as ϕ or rϕ.
is the line of best fit? This problem had mathematicians
Also introduced by Karl Pearson, this statistic is similar stumped literally for centuries. The solution was in the
to the Pearson’s correlation coefficient in its derivation. form of the method of least squares, which was first
In fact, a Pearson’s correlation coefficient estimated published by the French mathematician Adrien‑Marie
for two binary variables will return the phi coefficient. Legendre in 1805 but used earlier by Carl Friedrich Gauss
The interpretation of the phi coefficient requires in Germany in 1795 to calculate the orbit of celestial
caution. It has a maximum value that is determined bodies. Gauss developed the idea further and today is
by the distribution of the two variables. If both have a known as the father of regression.
50/50 split, values of phi will range from −1 to +1.
Look at Figure 4. When we have a scattered series of
Application of the phi coefficient is particularly seen dot which are lying approximately but not exactly
in educational and psychological research, in which on a straight line, we can, by eye estimation, draw a
number of lines that seem to fit the series. But which regression subsequently somehow got attached to the
is the line of best fit? The method of least squares, in procedure of line fitting itself.
essence, selects the line that would provide the least
Note that in our discussion above; we have discussed
sum of squares for the vertical residuals or offsets. In
the predictive relationship between two numerical
Figure 4, it is line B as is shown in the lower panel. For
variables. This is simple linear regression. If the value
a particular value of x, the vertical distance between the
of y requires more than one numerical variable for a
observed and fitted value of y is known as the residual
reasonable prediction, we are encountering the situation
or offset. Since some of the residuals are above and
called multiple linear regression. We will be discussing
some below the line of best fit, we require a +or −sign
the basics of this in a future module.
to mathematically denote the residuals. Squaring removes
the effect of the −sign. The method of least squares finds Pitfalls in Correlation and Regression
the values of “a” and “b” that minimizes the sum of the
squares of all the residuals. The method of least squares
Analysis
is not the only technique, but is regarded as the simplest Correlation and linear regression analysis are based on
technique for linear regression, that is for the task of certain assumptions pertaining to the data sets. If these
finding the straight line of best fit for a series of points assumptions are not met, conclusions drawn can be
depicting a linear relationship on a scatter diagram. misleading. Both assume that the relationship between
the two variables is linear. The observations have to be
You may wonder why the statistical procedure of fitting a independent – they are not independent if there is more
line is called “regression” which in common usage means than one pair of observations (that is repeat measurements)
“going backward.” Interestingly, the term was used from one individual. For correlation, both variables should
neither by Legendre or Gauss but is attributed to the be random variables although for regression, only the
English scientist Francis Galton who had a keen interest response variable y needs to be random.
in heredity. In Victorian England, Galton measured the
heights of 202 fathers and their first born adult sons and Inspecting a scatter plot is of utmost importance before
plotted them on a graph of median height versus height estimation of the correlation coefficient for many
group. The scatter for fathers and sons approximated to reasons:
two lines that intersected at a point representing the • A nonlinear relationship may exist between two
average height of the adult English population. Studying variables that would be inadequately described,
this plot, Galton made the very interesting observation or possibly even undetected, by the correlation
that tall fathers tend to have tall sons but they are not coefficient. For instance, the correlation coefficient
as tall as their fathers, and short fathers tend to have if calculated for the set of data points in Figure 2,
short sons but they are not as short as their fathers; would be almost zero, but we will be grossly wrong
and in the course of just two or three generations the if we conclude that there is no association between
height of individuals tended to go back or “regress” to the variables. The zero coefficient only tells us that
the mean population height. He published a famous there is no linear (straight‑line) association between
paper titled “Regression towards mediocrity in hereditary the variables, when in reality there is a clear
stature.” This phenomenon of regression to the mean curvilinear (curved‑line) association between them
can be observed in many biological variables. The term • An outlier may create a false correlation. Inspect
the scatter plot in Figure 5a. The r value of 0.923
a b
c
Figure 4: The principle of the method of least squares for linear regression. The sum
of the squared “Residuals” is the least for the line of best fit Figure 5: Examples of misleading correlations
suggests a strong correlation. However, a closer look cause and effect relationship between two variables that
makes it obvious that the series of dots is actually are correlated, even if the correlation is strong. In other
quite scattered, and the apparent correlation is being words, correlation does not imply causation.
created by the outlier point. This kind of outlier is
As a now widely stated example, numerous
called univariate outlier. If we consider the x value of
epidemiological studies showed that women taking
this point, it is way beyond the range of rest of the
combined hormone replacement therapy (HRT) also
x values; similarly, the y value of the point is much had a lower‑than‑average incidence of coronary heart
beyond the range of y values for the rest of the disease (CHD). The correlation was strong leading
dots. An univariate outlier is easy to spot by simply researchers to propose that HRT was protective against
sorting the values or constructing boxplots CHD. However, subsequent randomized controlled
• Conversely, an outlier can also spoil a correlation. trials showed that HRT causes a small but statistically
The scatter plot in Figure 5b suggests only moderate significant increase in the risk of CHD. Reanalysis of
correlation at r value of 0.583, but closer inspection the data from the epidemiological studies showed that
reveals that a single bivariate outlier reduces what is women undertaking HRT were more likely to be from
otherwise an almost perfect association between the higher socioeconomic groups, with better‑than‑average
variables. Note that the deviant case is not an outlier diet and exercise regimens. The use of HRT and
in the usual univariate sense. Individually, its x value decreased the incidence of CHD were coincident effects
and y value are unexceptional. What is exceptional is of a common cause (i.e., the benefits associated with a
the combination of values on the two variables that higher socioeconomic status), rather than a direct cause
it exhibits, making it an outlier, and this would be and effect, as had been supposed.
evident only on a scatter plot
• Clustering within datasets may also inflate a Correlation may simply be due to chance. For
correlation. Look at Figure 5c. Two clusters are example, one could compute r between the abdominal
evident, and individually they do not appear to circumference of subjects and their shoe sizes,
show strong correlation. However, combining the two intelligence, or income. Irrespective of the value of r,
suggests a decent correlation. This combination may these associations would make no sense.
be undesirable in real life. Clustering within datasets For any two correlated variables, A and B, the following
may be a pointer that the sampling has not really relationships are possible:
been random. • A causes B or vice versa (direct causation)
When using a regression equation for prediction, errors • A causes C which causes B or the other way
in prediction may not be just random but may also round (indirect causation)
be due to inadequacies in the model. In particular, • A causes B and B causes A (bidirectional or cyclic
extrapolating beyond the range of observed data can be causation)
risky and is best avoided. Consider the simple regression • A and B are consequences of a common cause but do
equation: not cause each other
• There is no causal connection between A and B; the
Weight = a + b × height. correlation is just coincidence.
Suppose we give a height value of 0. The corresponding Thus, causality ascertainment requires consideration of
weight value, strangely, is not 0 but equals a. What several other factors, including temporal relationship,
is the matter here? Is the equation derived through dose‑effect relationship, effect of dechallenge and
regression faulty? The fact is that the equation is not rechallenge, and biological plausibility. Of course, a strong
at fault, but we are trying to extrapolate its use beyond correlation may be an initial pointer that a cause‑effect
the range of values used in deriving the equation relationship exists, but per se it is not sufficient to infer
through least squares regression. This is a common causality. There must also be no reasonable alternative
pitfall. Equations derived from one sample should not explanation that challenges causality. Establishing
be automatically applied to another sample. Equations causality is one of the most daunting challenges in both
derived from adults, for instance, should not be applied public health and drug research. Carefully controlled
to children. studies are needed to address this question.
methods. Two sets of measurements would be perfectly • How big is the average discrepancy between the
correlated if the scatter diagram shows that they all lie methods, which is indicated by the position of
on a single straight line, but they are not likely to be the bias line. This discrepancy may be too large to
in perfect agreement unless this line passes through accept clinically. However, if the differences within
the origin. It is very likely that two tests designed mean ± 1.96 SD are not clinically important, the two
to measure the same variable would return figures methods may be used interchangeably
that would be strongly correlated, but that does not • Whether the scatter around the bias line is too much,
automatically mean that the repeat measurements are with a number of points falling outside the 95%
also in strong agreement. Data which seem to be in agreement limit lines
poor agreement can produce quite high correlations. In • Whether the difference between the methods tends
addition, a change in scale of measurement does not to get larger or smaller as the values increase. If it
affect the correlation, but it can affect the agreement. does, as is indicated in Figure 7, it indicates the
existence of a proportional bias which means that
Bland–Altman Plot the methods do not agree equally through the range
Bland and Altman devised a simple but informative of measurements.
graphical method of comparing repeat measurements. The Bland–Altman plot may also be used to assess
When repeat measurements have been taken on a series the repeatability of a method by comparing repeated
of subjects or samples, the difference between pairs of measurements on a series of subjects or samples by
measurements (Y‑axis) is plotted against the arithmetic that single method. A coefficient of repeatability can
mean of the corresponding measurements (X‑axis). be calculated as 1.96 times the SD of the differences
The resulting scatter diagram such as figure is the between the paired measurements. Since the same
Bland–Altman plot (after John Martin Bland and method is used for the repeated measurements, it is
Douglas G. Altman, who first proposed it in 1983 and expected that the mean difference should be zero. This
then popularized it through a Lancet paper in 1986) can be checked from the plot.
an example of which is given in Figure 6. The repeat
measurements could represent results of two different Intraclass Correlation Coefficient
assay methods or scores from the same subjects by two Although originally introduced in genetics to judge
different raters. sibling correlations, the intraclass correlation coefficient
Computer software that draws a Bland–Altman plot (ICC) statistic is now most often used to assess the
can usually add a ‘bias’ line parallel to the X‑axis. consistency, or conformity, of measurements made by
This represents the difference between the means multiple observers measuring the same parameter or two
of the two sets of measurements. Lines denoting or more raters scoring the same set of subjects.
95% limits of agreement (mean difference ± 1.96 SD of The methods of ICC calculation have evolved over time.
the difference) can be added on either side of the bias The earliest work on intraclass correlations focused on
line. Alternatively, lines denoting 95% confidence limits paired measurements, and the first ICC statistics to be
of mean of differences can be drawn surrounding the proposed was modifications of the Pearson’s correlation
bias line. coefficient (which can be regarded interclass correlation)
Bland–Altman plots are generally interpreted informally. calculations. Beginning with Ronald Fisher, the intraclass
Three things may be looked at:
correlation has been regarded within the framework of single measures is an index for the reliability of the
analysis of variance and its calculation is now based on multiple ratings by a single typical rater. ICC for average
the true (between subject) variance and variance of the measures is an index for the reliability of different
measurement error (during repeat measurement). raters averaged together. This ICC is always slightly
The ICC takes a value between 0 and 1. Complete higher than the single measures ICC. Software may also
inter‑rater agreement is indicated by a value of 1 but offer different models for ICC calculation. One model
this is seldom achieved. Arbitrarily, the agreement assumes that all subjects were rated by the same raters.
boundaries proposed are <0.40: Poor; 0.40–0.60: Fair; A different model may be used when this precondition
0.60–0.74: Good, and >0.75: Strong. Software may report is not true. The model may test for consistency when
two coefficients with their respective 95% CIs. ICC for systematic differences between raters are irrelevant, or