You are on page 1of 6

# THIS MONTH

## Table 1 | Regression coefficients and R2 for different predictors

POINTS OF SIGNIFICANCE and predictor correlations
Predictors in model Regression coefficients

## Multiple linear regression H, J

βH
0.7
βJ
–0.08
β0
–46.5

When multiple variables are associated with Predictors fitted Estimated regression coefficients
bH bJ b0 R2
a response, the interpretation of a prediction
Uncorrelated predictors, r(H,J) = 0
equation is seldom simple. H 0.71 –51.7 0.66
Last month we explored how to model a simple relationship J –0.088 69.3 0.19
between two variables, such as the dependence of weight on H, J 0.71 –0.088 –47.3 0.85
height1. In the more realistic scenario of dependence on several Correlated predictors, r(H,J) = 0.9
variables, we can use multiple linear regression (MLR). Although H 0.44 –8.1 (ns) 0.64
MLR is similar to linear regression, the interpretation of MLR J 0.097 60.2 0.42
correlation coefficients is confounded by the way in which the H, J 0.63 –0.056 –36.2 0.67
predictor variables relate to one another. Actual (βH, βJ, β0) and estimated regression coefficients (bH, bJ, b0) and coefficient of
In simple linear regression1, we model how the mean of vari- determination (R2) for uncorrelated and highly correlated predictors in scenarios where either H
or J or both H and J predictors are fitted in the regression. Regression coefficient estimates for all
able Y depends linearly on the value of a predictor variable X; this values of predictor sample correlation, r(H,J) are shown in Figure 2. ns, not significant.
relationship is expressed as the conditional expectation E(Y|X)
© 2015 Nature America, Inc. All rights reserved.

= b 0 + b 1X. For more than one predictor variable X 1, . . ., X p, The slope bj is the change in Y if predictor j is changed by one unit
this becomes b0 + SbjXj. As for simple linear regression, one can and others are held constant. When normality and independence
use the least-squares estimator (LSE) to determine estimates bj of assumptions are fulfilled, we can test whether any (or all) of the
the bj regression parameters by minimizing the residual sum of slopes are zero using a t-test (or regression F-test). Although the
squares, SSE = S(yi – ŷi)2, where ŷi = b0 + Sjbjxij. When we use the interpretation of bj seems to be identical to its interpretation in the

regression sum of squares, SSR = S(ŷi –Y )2, the ratio R2 = SSR/ simple linear regression model, the innocuous phrase “and others
(SSR + SSE) is the amount of variation explained by the regres- are held constant” turns out to have profound implications.
sion model and in multiple regression is called the coefficient of To illustrate MLR—and some of its perils—here we simulate
determination. predicting the weight (W, in kilograms) of adult males from their
height (H, in centimeters) and their maximum jump height (J, in
a Uncorrelated predictors
r(H,J) = 0
b Regression for uncorrelated predictors
Weight on height Weight on jump height
centimeters). We use a model similar to that presented in our previ-
80 70 70 ous column1, but we now include the effect of J as E(W|H,J) = bHH
0.71
+ bJJ + b0 + e, with bH = 0.7, bJ = –0.08, b0 = –46.5 and normally
60 distributed noise e with zero mean and s = 1 (Table 1). We set bJ
W (kg)

W (kg)
J (cm)

## 65 65 negative because we expect a negative correlation between W and

40
–0.088 J when height is held constant (i.e., among men of the same height,
20 60 60
lighter men will tend to jump higher). For this example we simulated
npg

160 165
H (cm)
170 160 165
H (cm)
170 20 40
J (cm)
60 80 a sample of size n = 40 with H and J normally distributed with means
of 165 cm (s = 3) and 50 cm (s = 12.5), respectively.
c Correlated predictors d Regression for correlated predictors
Although the statistical theory for MLR seems similar to that for
r(H,J) = 0.9 Weight on height Weight on jump height
80 70 70
simple linear regression, the interpretation of the results is much
0.71
60
0.44 0.097 more complex. Problems in interpretation arise entirely as a result of
the sample correlation2 among the predictors. We do, in fact, expect
W (kg)

W (kg)
J (cm)

65 65
40 a positive correlation between H and J—tall men will tend to jump
–0.088
higher than short ones. To illustrate how this correlation can affect
20
160 165 170
60
160 165 170
60
20 40 60 80
the results, we generated values using the model for weight with
H (cm) H (cm) J (cm) samples of J and H with different amounts of correlation.
Let’s look first at the regression coefficients estimated when the
Figure 1 | The results of multiple linear regression depend on the
correlation of the predictors, as measured here by the Pearson correlation
predictors are uncorrelated, r(H,J) = 0, as evidenced by the zero
coefficient r (ref. 2). (a) Simulated values of uncorrelated predictors, slope in association between H and J (Fig. 1a). Here r is the Pearson
r(H,J) = 0. The thick gray line is the regression line, and thin gray lines correlation coefficient2. If we ignore the effect of J and regress W
show the 95% confidence interval of the fit. (b) Regression of weight (W) on H, we find Ŵ = 0.71H – 51.7 (R2 = 0.66) (Table 1 and Fig. 1b).
on height (H) and of weight on jump height (J) for uncorrelated predictors Ignoring H, we find Ŵ = –0.088J + 69.3 (R2 = 0.19). If both predic-
shown in a. Regression slopes are shown (bH = 0.71, bJ = –0.088). tors are fitted in the regression, we obtain Ŵ = 0.71H – 0.088J – 47.3
(c) Simulated values of correlated predictors, r(H,J) = 0.9. Regression and
(R2 = 0.85). This regression fit is a plane in three dimensions (H, J,
95% confidence interval are denoted as in a. (d) Regression (red lines)
using correlated predictors shown in c. Light red lines denote the 95%
W) and is not shown in Figure 1. In all three cases, the results of the
confidence interval. Notice that bJ = 0.097 is now positive. The regression F-test for zero slopes show high significance (P ≤ 0.005).
line from b is shown in blue. In all graphs, horizontal and vertical dotted When the sample correlations of the predictors are exactly zero,
lines show average values. the regression slopes (bH and bJ) for the “one predictor at a time”

## NATURE METHODS | VOL.12 NO.12 | DECEMBER 2015 | 1103

THIS MONTH
Effect of predictor correlation r(H,J) on regression coefficient estimates and R 2 because although J and H are likely to be positively correlated, other
bH bJ b0 R2
1.0 1.0 1.0
scenarios might use negatively correlated predictors (e.g., lung

Fraction of P values
0 0.8
1.0 0.8

0.8 J not
capacity and smoking habits). For example, if we include only H
−40
H
βH
0.6
considered 0.5
0.66 in the regression and ignore the effect of J, bH steadily decreases
−80
0.4 from about 1 to 0.35 as r(H,J) increases. Why is this? For a given
0.2 −120
height, larger values of J (an indicator of fitness) are associated with
0 0 0
1.0 80 1.0 1.0
Predictors in regression

0.08

0
0.8
75
0.8
lower weight. If J and H are negatively correlated, as J increases, H
H not
βJ 70 0.5 decreases, and both changes result in a lower value of W. Conversely,
J considered −0.16

−0.24
65
0.19
as J decreases, H increases, and thus W increases. If we use only H
1.0 1.0 0
0
1.0
60
0
0
1.0
0
1.0
as a predictor, J is lurking in the background, depressing W at low
0.8
−0.04 0.8
−25
0.8 values of H and enhancing W at high levels of H, so that the effect of
0.8
βH
βJ β0
0.5
H is overestimated (bH increases). The opposite effect occurs when J
H,J 0.6
−0.12 −75
0.85
and H are positively correlated. A similar effect occurs for bJ, which
0.4
0 −0.16 0 −100 0 0
increases in magnitude (becomes more negative) when J and H are
−1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0
Predictor correlation r(H,J) negatively correlated. Supplementary Figure 1 shows the effect of
correlation when both regression coefficients are positive.
Figure 2 | Results and interpretation of multiple regression changes with
the sample correlation of the predictors. Shown are the values of regression
When both predictors are fitted (Fig. 2), the regression coefficient
coefficient estimates (bH, bJ, b0) and R2 and the significance of the test used estimates (bH, bJ, b0) are centered at the actual coefficients (bH, bJ,
to determine whether the coefficient is zero from 250 simulations at each b0) with the correct sign and magnitude regardless of the correla-
value of predictor sample correlation –1 < r(H,J) < 1 for each scenario where tion of the predictors. However, the standard error in the estimates
© 2015 Nature America, Inc. All rights reserved.

either H or J or both H and J predictors are fitted in the regression. Thick and steadily increases as the absolute value of the predictor correlation
thin black curves show the coefficient estimate median and the boundaries increases.
of the 10th–90th percentile range, respectively. Histograms show the fraction
Neglecting important predictors has implications not only for R2,
of estimated P values in different significance ranges, and correlation
intervals are highlighted in red where >20% of the P values are >0.01. Actual which is a measure of the predictive power of the regression, but
regression coefficients (bH, bJ, b0) are marked on vertical axes. The decrease also for interpretation of the regression coefficients. Unconsidered
in significance for bJ when jump height is the only predictor and r(H,J) is variables that may have a strong effect on the estimated regression
moderate (red arrow) is due to insufficient statistical power (bJ is close to coefficients are sometimes called ‘lurking variables’. For example,
zero). When predictors are uncorrelated, r(H,J) = 0, R2 of individual regressions muscle mass might be a lurking variable with a causal effect on both
sum to R2 of multiple regression (0.66 + 0.19 = 0.85). Panels are organized to body weight and jump height. The results and interpretation of the
correspond to Table 1, which shows estimates of a single trial at two different
regression will also change if other predictors are added.
predictor correlations.
Given that missing predictors can affect the regression, should we
regressions and the multiple regression are identical, and the simple try to include as many predictors as possible? No, for three reasons.
regression R2 sums to multiple regression R2 (0.66 + 0.19 = 0.85; First, any correlation among predictors will increase the standard
Fig. 2). The intercept changes when we add a predictor with a non- error of the estimated regression coefficients. Second, having more
zero mean to satisfy the constraint that the least-squares regression slope parameters in our model will reduce interpretability and cause
line goes through the sample means, which is always true when the problems with multiple testing. Third, the model may suffer from
regression model includes an intercept. overfitting. As the number of predictors approaches the sample size,
npg

Balanced factorial experiments show a sample correlation of zero we begin fitting the model to the noise. As a result, we may seem to
among the predictors when their levels have been fixed. For exam- have a very good fit to the data but still make poor predictions.
ple, we might fix three heights and three jump heights and select MLR is powerful for incorporating many predictors and for esti-
two men representative of each combination, for a total of 18 sub- mating the effects of a predictor on the response in the presence
jects to be weighed. But if we select the samples and then measure of other covariates. However, the estimated regression coefficients
the predictors and response, the predictors are unlikely to have zero depend on the predictors in the model, and they can be quite vari-
correlation. able when the predictors are correlated. Accurate prediction of the
When we simulate highly correlated predictors r(H,J) = 0.9 (Fig. response is not an indication that regression slopes reflect the true
1c), we find that the regression parameters change depending on relationship between the predictors and the response.
whether we use one or both predictors (Table 1 and Fig. 1d). If we
Note: Any Supplementary Information and Source Data files are available in the
consider only the effect of H, the coefficient bH = 0.7 is inaccurately
online version of the paper.
estimated as bH = 0.44. If we include only J, we estimate bJ = –0.08
inaccurately, and even with the wrong sign (bJ = 0.097). When we COMPETING FINANCIAL INTERESTS
use both predictors, the estimates are quite close to the actual coef- The authors declare no competing financial interests.
ficients (bH = 0.63, bJ = –0.056). Martin Krzywinski & Naomi Altman
In fact, as the correlation between predictors r(H,J) changes, the
1. Altman, N. & Krzywinski, M. Nat. Methods 12, 999–1000 (2015).
estimates of the slopes (bH, bJ) and intercept (b0) vary greatly when 2. Altman, N. & Krzywinski, M. Nat. Methods 12, 899–900 (2015).
only one predictor is fitted. We show the effects of this variation for
Naomi Altman is a Professor of Statistics at The Pennsylvania State University.
all values of predictor correlation (both positive and negative) across Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences
250 trials at each value (Fig. 2). We include negative correlation Centre.

## 1104 | VOL.12 NO.12 | DECEMBER 2015 | NATURE METHODS

THIS MONTH

POINTS OF SIGNIFICANCE a Relationship between weight and height b Linear regression of weight on height
75
σ=3 σ=3
70

Weight (kg)
65
55 65 75
Weight (kg) 60

## “The statistician knows…that in nature there 150 165 180

55
150 160 170 180
Height (cm) Height (cm)
never was a normal distribution, there never
Figure 2 | In a linear regression relationship, the response variable has a
was a straight line, yet with normal and linear distribution for each value of the independent variable. (a) At each height,
assumptions, known to be false, he can often derive weight is distributed normally with s.d. s = 3. (b) Linear regression of
n = 3 weight measurements for each height. The mean weight varies as
results which match, to a useful approximation, m(Height) = 2 × Height/3 – 45 (black line) and is estimated by a regression
those found in the real world.”1 line (blue line) with 95% confidence interval (blue band). The 95%
prediction interval (gray band) is the region in which 95% of the population
We have previously defined association between X and Y as meaning is predicted to lie for each fixed height.
that the distribution of Y varies with X. We discussed correlation as
a type of association in which larger values of Y are associated with in the estimated regression equation. Here X is referred to as the
larger values of X (increasing trend) or smaller values of X (decreas- predictor, and Y is referred to as the predicted variable.
ing trend)2. If we suspect a trend, we may want to attempt to pre- Consider a relationship between weight Y (in kilograms) and
dict the values of one variable using the values of the other. One of height X (in centimeters), where the mean weight at a given height
© 2015 Nature America, Inc. All rights reserved.

the simplest prediction methods is linear regression, in which we is m(X) = 2X/3 – 45 for X > 100. Because of biological variability,
attempt to find a ‘best line’ through the data points. the weight will vary—for example, it might be normally distributed
Correlation and linear regression are closely linked—they both with a fixed s = 3 (Fig. 2a). The difference between an observed
quantify trends. Typically, in correlation we sample both variables weight and mean weight at a given height is referred to as the error
randomly from a population (for example, height and weight), for that weight.
and in regression we fix the value of the independent variable (for To discover the linear relationship, we could measure the weight
example, dose) and observe the response. The predictor variable of three individuals at each height and apply linear regression to
may also be randomly selected, but we treat it as fixed when mak- model the mean weight as a function of height using a straight
ing predictions (for example, predicted weight for someone of a line, m(X) = b0 + b1X (Fig. 2b). The most popular way to esti-
given height). We say there is a regression relationship between X mate the intercept b0 and slope b1 is the least-squares estimator
and Y when the mean of Y varies with X. (LSE). Let (xi, yi) be the ith pair of X and Y values. The LSE esti-
In simple regression, there is one independent variable, X, and mates b0 and b1 by minimizing the residual sum of squares (sum
one dependent variable, Y. For a given value of X, we can estimate of squared errors), SSE = S(yi – ŷi)2, where ŷi = m(xi) = b0 + b1xi are
the average value of Y and write this as a conditional expectation the points on the estimated regression line and are called the fitted,
E(Y|X), often written simply as m(X). If m(X) varies with X, then predicted or ‘hat’ values. The estimates are given by b0 = – b1
we say that Y has a regression on X (Fig. 1). Regression is a specific and b1 = rsX/sY, and where and are means of samples X and
kind of association and may be linear or nonlinear (Fig. 1c,d). Y, sX and sY are their s.d. values and r = r(X,Y) is their correlation
npg

## The most basic regression relationship is a simple linear regres- coefficient2.

sion. In this case, E(Y|X) = m(X) = b0 + b1X, a line with intercept The LSE of the regression line has favorable properties for very
b0 and slope b1. We can interpret this as Y having a distribution general error distributions, which makes it a popular estimation
with mean m(X) for any given value of X. Here we are not interested method. When Y values are selected at random from the condi-
in the shape of this distribution; we care only about its mean. The tional distribution E(Y|X), the LSEs of the intercept, slope and fitted
deviation of Y from m(X) is often called the error, e = Y – m(X). It’s values are unbiased estimates of the population value regardless
important to realize that this term arises not because of any kind of the distribution of the errors, as long as they have zero mean.
of error but because Y has a distribution for a given value of X. By “unbiased,” we mean that although they might deviate from
In other words, in the expression Y = m(X) + e, m(X) specifies the the population values in any sample, they are not systematically
location of the distribution, and e captures its shape. To predict Y too high or too low. However, because the LSE is very sensitive to
at unobserved values of X, one substitutes the desired values of X extreme values of both X (high leverage points) and Y (outliers),
diagnostic outlier analyses are needed before the estimates are used.
a No association b No regression c Linear regression d Nonlinear regression
In the context of regression, the term “linear” can also refer to a
linear model, where the predicted values are linear in the param-
eters. This occurs when E(Y|X) is a linear function of a known
Y function g(X), such as b0 + b1g(X). For example, b0 + b1X2 and
X b 0 + b 1sin(X) are both linear regressions, but exp( b 0+ b 1X) is
nonlinear because it is not a linear function of the parameters b0
Figure 1 | A variable Y has a regression on variable X if the mean of Y (black
line) E(Y|X) varies with X. (a) If the properties of Y do not change with
and b1. Analysis of variance (ANOVA) is a special case of a linear
X, there is no association. (b) Association is possible without regression. model in which the t treatments are labeled by indicator variables
Here E(Y|X) is constant, but the variance of Y increases with X. (c) Linear X1 . . . Xt, E(Y|X1 . . . Xt) = μi is the ith treatment mean, and the LSE
regression E(Y|X) = b0+ b1X. (d) Nonlinear regression E(Y|X) = exp(b0+ b1X). predicted values are the corresponding sample means3.

## NATURE METHODS | VOL.12 NO.11 | NOVEMBER 2015 | 999

THIS MONTH
a fixed value of X. For example, the 95% confidence interval for
a Uncertainty in linear regression
of weight on height
b Regression to the mean
m(x) is b0 + b1x ± t0.975SE(ŷ(x)) (Fig. 2b) and depends on the error
Weight on height Height on weight
75
σ=2 σ=4 W = 71.6 variance (Fig. 3a). This is called a point-wise interval because the
H ’ = 172.7
70
95% coverage is for a single fixed value of X. One can compute a

Height (cm)
Weight (kg)
Weight (kg)

65 band that covers the entire line 95% of the time by replacing t0.975
60 with W0.975 = √(2F0.975), where F0.975 is the critical value from the
55 F2,n–2 distribution. This interval is wider because it must cover the
150 160 170 180 150 160 170 180 H = 175 W = 71.6
Height (cm) Height (cm) Weight (kg) entire regression line, not just one point on the line.
Figure 3 | Regression models associate error to response which tends to
To express uncertainty about where a percentage (for example,
pull predictions closer to the mean of the data (regression to the mean). 95%) of newly observed data points would fall, we use the pre-
(a) Uncertainty in a linear regression relationship can be expressed by a 95% diction interval b0 + b1x + t0.975 (MSE(1 + 1/n + (x – )2/sXX)).
confidence interval (blue band) and 95% prediction interval (gray band). This interval is wider than the confidence interval because it must
Shown are regressions for the relationship in Figure 2a using different incorporate both the spread in the data and the uncertainty in the
amounts of scatter (normally distributed with s.d. s). (b) Predictions model parameters. A prediction interval for Y at a fixed value of
using successive regressions X → Y → Xʹ to the mean. When predicting
X incorporates three sources of uncertainty: the population vari-
using height H = 175 cm (larger than average), we predict weight W = 71.6
kg (dashed line). If we then regress H on W at W = 71.6 kg, we predict ance s2, the variance in estimating the mean and the variability
Hʹ = 172.7 cm, which is closer than H to the mean height (64.6 cm). Means due to estimating s2 with the MSE. Unlike confidence intervals,
of height and weight are shown as dotted lines. which are accurate when the sampling distribution of the estima-
tor is close to normal, which usually occurs in sufficiently large
Recall that in ANOVA, the SSE is the sum of squared deviations samples, the prediction interval is accurate only when the errors
© 2015 Nature America, Inc. All rights reserved.

of the data from their respective sample means (i.e., their pre- are close to normal, which is not affected by sample size.
dicted values) and represents the variation in the data that is not Linear regression is readily extended to multiple predictor vari-
accounted for by the treatments. Similarly, in regression, the SSE ables X1, . . ., Xp, giving E(Y|X1, . . ., Xp) = b0 + SbiXi. Clever choice
is the sum of squared deviations of the data from the predicted of predictors allows for a wide variety of models. For example,
values that represents variation in data not explained by regres- Xi = Xi yields a polynomial of degree p. If there are p + 1 groups,
sion. In ANOVA we also compute the total and treatment sum letting Xi = 1 when the sample comes from group i and 0 other-
of squares; the analogous quantities in linear regression are the wise yields a model in which the fitted values are the group means.
total sum of squares, SST = (n–1)s2Y, and the regression sum of In this model, the intercept is the mean of the last group, and the
squares, SSR = S(ŷi – )2, which are related by SST = SSR + SSE. slopes are the differences in means.
Furthermore, SSR/SST = r 2 is the proportion of variance of Y A common misinterpretation of linear regression is the ‘regres-
explained by the linear regression of X (ref. 2). sion fallacy’. For example, we might predict weight W = 71.6 kg for
When the errors have constant variance s2, we can model the a larger than average height H = 175 cm and then predict height
uncertainty in regression parameters. In this case, b0 and b1 have Hʹ = 172.7 cm for someone with weight W = 71.6 kg (Fig. 3b).
means b 0 and b 1, respectively, and variances s 2(1/n + 2/s XX) Here we will find Hʹ < H. Similarly, if H is smaller than average,
and s2/sXX, where sXX = (n – 1)s2X . As we collect X over a wider we will find Hʹ> H. The regression fallacy is to ascribe a causal
range, sXX increases, so the variance of b1 decreases. The predicted mechanism to regression to the mean, rather than realizing that
value ŷ(x) has a mean b0 +b1x and variance s2(1/n + (x – )2/sXX). it is due to the estimation method. Thus, if we start with some
npg

Additionally, the mean square error (MSE) = SSE/(n – 2) is an value of X, use it to predict Y, and then use Y to predict X, the
unbiased estimator of the error variance (i.e., s2). This is identi- predicted value will be closer to the mean of X than the original
cal to how MSE is used in ANOVA to estimate the within-group value (Fig. 3b).
variance, and it can be used as an estimator of s2 in the equations Estimating the regression equation by LSE is quite robust to
above to allow us to find the standard error (SE) of b0, b1 and ŷx. non-normality of and correlation in the errors, but it is sensitive
For example, SE(ŷ(x)) = √(MSE(1/n + (x – )2/sXX)). to extreme values of both predictor and predicted. Linear regres-
If the errors are normally distributed, so are b0, b1 and (ŷ(x)). sion is much more flexible than its name might suggest, includ-
Even if the errors are not normally distributed, as long as they ing polynomials, ANOVA and other commonly used statistical
have zero mean and constant variance, we can apply a version of methods.
the central limit theorem for large samples4 to obtain approximate
COMPETING FINANCIAL INTERESTS
normality for the estimates. In these cases the SE is very helpful
The authors declare no competing financial interests.
in testing hypotheses. For example, to test that the slope is b1 =
2/3, we would use t* = (b1 – b1)/SE(b1); when the errors are nor- Naomi Altman & Martin Krzywinski
mal and the null hypothesis true, t* has a t-distribution with d.f.
1. Box, G. J. Am. Stat. Assoc. 71, 791–799 (1976).
= n – 2. We can also calculate the uncertainty of the regression
2. Altman, N. & Krzywinski, M. Nat. Methods 12, 899–900 (2015).
parameters using confidence intervals, the range of values that 3. Krzywinski, M. & Altman, N. Nat. Methods 11, 699–700 (2014).
are likely to contain bi (for example, 95% of the time)5. The inter- 4. Krzywinski, M. & Altman, N. Nat. Methods 10, 809–810 (2013).
val is bi ± t0.975SE(bi), where t0.975 is the 97.5% percentile of the 5. Krzywinski, M. & Altman, N. Nat. Methods 10, 1041–1042 (2013).
t-distribution with d.f. = n – 2.
Naomi Altman is a Professor of Statistics at The Pennsylvania State University.
When the errors are normally distributed, we can also use con- Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences
fidence intervals to make statements about the predicted value for Centre.

## 1000 | VOL.12 NO.11 | NOVEMBER 2015 | NATURE METHODS

THIS MONTH

POINTS OF SIGNIFICANCE a Fluctuation of correlation of random data b Spurious correlations in random data
−0.94 −0.93 −0.91

## Association, correlation 0.91 0.91 0.89

and causation −1 0
r
1 −1 0
r
1

## Figure 2 | Correlation coefficients fluctuate in random data, and spurious

Correlation implies association, but not causation. correlations can arise. (a) Distribution (left) and 95% confidence
intervals (right) of correlation coefficients of 10,000 n = 10 samples of
Conversely, causation implies association, but not two independent normally distributed variables. Statistically significant
correlation. coefficients (α = 0.05) and corresponding intervals that do not include r = 0
are highlighted in blue. (b) Samples with the three largest and smallest
Most studies include multiple response variables, and the dependen- correlation coefficients (statistically significant) from a.
cies among them are often of great interest. For example, we may
wish to know whether the levels of mRNA and the matching protein spent outdoors is a confounding variable—a cause common to
vary together in a tissue, or whether increasing levels of one metabo- both observations. In such a situation, a direct causal link cannot
lite are associated with changed levels of another. This month we be inferred; the association merely suggests a hypothesis, such as a
begin a series of columns about relationships between variables (or common cause, but does not offer proof. In addition, when many
features of a system), beginning with how pairwise dependencies variables in complex systems are studied, spurious associations can
can be characterized using correlation. arise. Thus, association does not imply causation.
© 2015 Nature America, Inc. All rights reserved.

Two variables are independent when the value of one gives no In everyday language, dependence, association and correlation are
information about the value of the other. For variables X and Y, we used interchangeably. Technically, however, association is synony-
can express independence by saying that the chance of measuring mous with dependence and is different from correlation (Fig. 1a).
any one of the possible values of X is unaffected by the value of Y, and Association is a very general relationship: one variable provides
vice versa, or by using conditional probability, P(X|Y) = P(X). For information about another. Correlation is more specific: two vari-
example, successive tosses of a coin are independent—for a fair coin, ables are correlated when they display an increasing or decreasing
P(H) = 0.5 regardless of the outcome of the previous toss, because a trend. For example, in an increasing trend, observing that X > μX
toss does not alter the properties of the coin. In contrast, if a system implies that it is more likely that Y > μY. Because not all associations
is changed by observation, measurements may become associated are correlations, and because causality, as discussed above, can be
or, equivalently, dependent. Cards drawn without replacement are connected only to association, we cannot equate correlation with
not independent; when a red card is drawn, the probability of draw- causality in either direction.
ing a black card increases, because now there are fewer red cards. For quantitative and ordinal data, there are two primary mea-
Association should not be confused with causality; if X causes sures of correlation: Pearson’s correlation (r), which measures lin-
Y, then the two are associated (dependent). However, associations ear trends, and Spearman’s (rank) correlation (s), which measures
can arise between variables in the presence (i.e., X causes Y) and increasing and decreasing trends that are not necessarily linear
absence (i.e., they have a common cause) of a causal relationship, (Fig. 1b). Like other statistics, these have population values, usu-
as we’ve seen in the context of Bayesian networks1. As an example, ally referred to as ρ. There are other measures of association that
npg

suppose we observe that people who daily drink more than 4 cups are also referred to as correlation coefficients, but which might not
of coffee have a decreased chance of developing skin cancer. This measure trends.
does not necessarily mean that coffee confers resistance to cancer; When “correlated” is used unmodified, it generally refers to
one alternative explanation would be that people who drink a lot of Pearson’s correlation, given by ρ(X, Y) = cov(X, Y)/σXσY, where
coffee work indoors for long hours and thus have little exposure to cov(X, Y) = E((X – μX)(Y – μY)). The correlation computed from
the sun, a known risk. If this is the case, then the number of hours the sample is denoted by r. Both variables must be on an interval or
ratio scale; r cannot be interpreted if either variable is ordinal. For a
linear trend, |r| = 1 in the absence of noise and decreases with noise,
a Association and correlation b Correlation coefficients c Anscombe’s quartet
but it is also possible that |r| < 1 for perfectly associated nonlinear
Associated Not associated
trends (Fig. 1b). In addition, data sets with very different associa-
tions may have the same correlation (Fig. 1c). Thus, a scatter plot
rs 11 0.91 1 0.82 0.82
Correlated should be used to interpret r. If either variable is shifted or scaled, r
does not change and r(X, Y) = r(aX + b, Y). However, r is sensitive to
0 0.04 0.78 1 0.82 0.82
nonlinear monotone (increasing or decreasing) transformation. For
example, when applying log transformation, r(X, Y) ≠ r(X, log(Y)).
Figure 1 | Correlation is a type of association and measures increasing or It is also sensitive to the range of X or Y values and can decrease as
decreasing trends quantified using correlation coefficients. (a) Scatter plots values are sampled from a smaller range.
of associated (but not correlated), non-associated and correlated variables.
If an increasing or decreasing but nonlinear relationship is sus-
In the lower association example, variance in y is increasing with x. (b) The
Pearson correlation coefficient (r, black) measures linear trends, and the pected, Spearman’s correlation is more appropriate. It is a nonpara-
Spearman correlation coefficient (s, red) measures increasing or decreasing metric method that converts the data to ranks and then applies the
trends. (c) Very different data sets may have similar r values. Descriptors formula for the Pearson correlation. It can be used when X is ordinal
such as curvature or the presence of outliers can be more specific. and is more robust to outliers. It is also not sensitive to monotone

## NATURE METHODS | VOL.12 NO.10 | OCTOBER 2015 | 899

THIS MONTH
enough that r = 0.42 (P = 0.063) is not statistically significant—its
a Effect of noise on correlation b Effect of sample size on correlation
0.95 0.69 0.42 (ns) 0.96 0.59 0.42 confidence interval includes ρ = 0.
When the linear trend is masked by noise, larger samples are need-
ed to confidently measure the correlation. Figure 3b shows how the
1 1 correlation coefficient varies for subsamples of size m drawn from
r 0 0 samples at different noise levels: m = 4–20 (σ = 0.1), m = 4–100 (σ
−1 −1
σ = 0.1 σ = 0.3 σ = 0.6 = 0.3) and m = 4–200 (σ = 0.6). When σ = 0.1, the correlation coef-
0 0.25 0.5 0.75 1 4 8 12 16 20 4 25 50 75 100 4 100 200
σ m ficient converges to 0.96 once m > 12. However, when noise is high,
not only is the value of r lower for the full sample (e.g., r = 0.59 for σ
Figure 3 | Effect of noise and sample size on Pearson’s correlation coefficient = 0.3), but larger subsamples are needed to robustly estimate ρ.
r. (a) r of an n = 20 sample of (X, X + ε), where ε is the normally distributed
The Pearson correlation coefficient can also be used to quantify
noise scaled to standard deviation σ. The amount of scatter and value of r at
three values of σ are shown. The shaded area is the 95% confidence interval.
how much fluctuation in one variable can be explained by its cor-
Intervals that do not include r = 0 are highlighted in blue (σ < 0.58), and relation with another variable. A previous discussion about analysis
those that do are highlighted in gray and correspond to nonsignificant r values of variance4 showed that the effect of a factor on the response vari-
(ns; e.g., r = 0.42 with P = 0.063). (b) As sample size increases, r becomes able can be described as explaining the variation in the response; the
less variable, and the estimate of the population correlation improves. Shown response varied, and once the factor was accounted for, the variation
are samples with increasing size and noise: n = 20 (σ = 0.1), n = 100 (σ = decreased. The squared Pearson correlation coefficient r2 has a simi-
0.3) and n = 200 (σ = 0.6). Traces at the bottom show r calculated from a
lar role: it is the proportion of variation in Y explained by X (and vice
subsample, created from the first m values of each sample.
versa). For example, r = 0.05 means that only 0.25% of the variance
of Y is explained by X (and vice versa), and r = 0.9 means that 81% of
© 2015 Nature America, Inc. All rights reserved.

increasing transformations because they preserve ranks—for exam- the variance of Y is explained by X. This interpretation is helpful in
ple, s(X, Y) = s(X, log(Y)). For both coefficients, a smaller magnitude assessments of the biological importance of the magnitude of r when
corresponds to increasing scatter or a non-monotonic relationship. it is statistically significant.
It is possible to see large correlation coefficients even for random Besides the correlation among features, we may also talk about the
data (Fig. 2a). Thus, r should be reported together with a P value, correlation among the items we are measuring. This is also expressed
which measures the degree to which the data are consistent with as the proportion of variance explained. In particular, if the units are
the null hypothesis that there is no trend in the population. For clustered, then the intraclass correlation (which should be thought of
Pearson’s r, to calculate the P value we use the test statistic √[d.f. × as a squared correlation) is the percent variance explained by the clus-
r2/(1 – r2)], which is t-distributed with d.f. = n – 2 when (X, Y) has ters and given by σb2/(σb2 + σw2), where σb2 is the between-cluster
a bivariate normal distribution (P for s does not require normality) variation and σb2 + σw2 is the total between- and within-cluster varia-
and the population correlation is 0. Even more informative is a 95% tion. This formula was discussed previously in an examination of
confidence interval, often calculated using the bootstrap method2. the percentage of total variance explained by biological variation5
In Figure 2a we see that values up to |r| < 0.63 are not statistically where the clusters are the technical replicates for the same biological
significant—their confidence intervals span zero. More important, replicate. As with the correlation between features, the higher the
there are very large correlations that are statistically significant (Fig. intraclass correlation, the less scatter in the data—this time measured
2a) even though they are drawn from a population in which the not from the trend curve but from the cluster centers.
true correlation is ρ = 0. These spurious cases (Fig. 2b) should be Association is the same as dependence and may be due to direct or
npg

expected any time a large number of correlations is calculated— indirect causation. Correlation implies specific types of association
for example, a study with only 140 genes yields 9,730 correlations. such as monotone trends or clustering, but not causation. For exam-
Conversely, modest correlations between a few variables, known to ple, when the number of features is large compared with the sample
be noisy, could be biologically interesting. size, large but spurious correlations frequently occur. Conversely,
Because P depends on both r and the sample size, it should never when there are a large number of observations, small and substan-
be used as a measure of the strength of the association. It is possible tively unimportant correlations may be statistically significant.
for a smaller r, whose magnitude can be interpreted as the estimated
COMPETING FINANCIAL INTERESTS
effect size, to be associated with a smaller P merely because of a large
The authors declare no competing financial interests.
sample size3. Statistical significance of a correlation coefficient does
not imply substantive and biologically relevant significance. Naomi Altman & Martin Krzywinski
The value of both coefficients will fluctuate with different sam-
1. Puga, J.L., Krzywinski, M. & Altman, N. Nat. Methods 12, 799–800 (2015).
ples, as seen in Figure 2, as well as with the amount of noise and/
2. Kulesa, A., Krzywinski, M., Blainey, P. & Altman, N. Nat. Methods 12, 477–
or the sample size. With enough noise, the correlation coefficient 478 (2015).
can cease to be informative about any underlying trend. Figure 3a 3. Krzywinski, M. & Altman, N. Nat. Methods 10, 1139–1140 (2013).
shows a perfectly correlated relationship (X, X) where X is a set of 4. Krzywinski, M. & Altman, N. Nat. Methods 11, 699–700 (2014).
5. Altman, N. & Krzywinski, M. Nat. Methods 12, 5–6 (2015).
n = 20 points uniformly distributed in the range [0, 1] in the pres-
ence of different amounts of normally distributed noise with a stan-
Naomi Altman is a Professor of Statistics at The Pennsylvania State University.
dard deviation σ. As σ increases from 0.1 to 0.3 to 0.6, r(X, X + σ) Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences
decreases from 0.95 to 0.69 to 0.42. At σ = 0.6 the noise is high Centre.