0 views

Uploaded by Lilian Ponce

something

something

© All Rights Reserved

- AP Statistics 1st Semester Study Guide
- Powerpoint - Regression and Correlation Analysis
- STATISTICS NOTES
- article_24_vol_2_3
- Factors Influencing Project Success
- Correlation and Regression
- Computational Mathematics
- 2014 Kinematic Analysis of the Instep Kick in Youth Soccer Players
- lit review
- JC13_165
- Management Information Systems A1
- 2014_EJAP_Effect of CMJ on P-F-V Profile
- Customer Satisfaction Measures
- Chap 002
- Multiple Regression Models
- Ba Spss Steps
- 66-78 Vol. 2, No.2 (April, 2013).pdf
- Bi Variate
- Correlation and Regression Corrected
- ABC.pdf

You are on page 1of 6

POINTS OF SIGNIFICANCE and predictor correlations

Predictors in model Regression coefficients

βH

0.7

βJ

–0.08

β0

–46.5

When multiple variables are associated with Predictors fitted Estimated regression coefficients

bH bJ b0 R2

a response, the interpretation of a prediction

Uncorrelated predictors, r(H,J) = 0

equation is seldom simple. H 0.71 –51.7 0.66

Last month we explored how to model a simple relationship J –0.088 69.3 0.19

between two variables, such as the dependence of weight on H, J 0.71 –0.088 –47.3 0.85

height1. In the more realistic scenario of dependence on several Correlated predictors, r(H,J) = 0.9

variables, we can use multiple linear regression (MLR). Although H 0.44 –8.1 (ns) 0.64

MLR is similar to linear regression, the interpretation of MLR J 0.097 60.2 0.42

correlation coefficients is confounded by the way in which the H, J 0.63 –0.056 –36.2 0.67

predictor variables relate to one another. Actual (βH, βJ, β0) and estimated regression coefficients (bH, bJ, b0) and coefficient of

In simple linear regression1, we model how the mean of vari- determination (R2) for uncorrelated and highly correlated predictors in scenarios where either H

or J or both H and J predictors are fitted in the regression. Regression coefficient estimates for all

able Y depends linearly on the value of a predictor variable X; this values of predictor sample correlation, r(H,J) are shown in Figure 2. ns, not significant.

relationship is expressed as the conditional expectation E(Y|X)

© 2015 Nature America, Inc. All rights reserved.

= b 0 + b 1X. For more than one predictor variable X 1, . . ., X p, The slope bj is the change in Y if predictor j is changed by one unit

this becomes b0 + SbjXj. As for simple linear regression, one can and others are held constant. When normality and independence

use the least-squares estimator (LSE) to determine estimates bj of assumptions are fulfilled, we can test whether any (or all) of the

the bj regression parameters by minimizing the residual sum of slopes are zero using a t-test (or regression F-test). Although the

squares, SSE = S(yi – ŷi)2, where ŷi = b0 + Sjbjxij. When we use the interpretation of bj seems to be identical to its interpretation in the

–

regression sum of squares, SSR = S(ŷi –Y )2, the ratio R2 = SSR/ simple linear regression model, the innocuous phrase “and others

(SSR + SSE) is the amount of variation explained by the regres- are held constant” turns out to have profound implications.

sion model and in multiple regression is called the coefficient of To illustrate MLR—and some of its perils—here we simulate

determination. predicting the weight (W, in kilograms) of adult males from their

height (H, in centimeters) and their maximum jump height (J, in

a Uncorrelated predictors

r(H,J) = 0

b Regression for uncorrelated predictors

Weight on height Weight on jump height

centimeters). We use a model similar to that presented in our previ-

80 70 70 ous column1, but we now include the effect of J as E(W|H,J) = bHH

0.71

+ bJJ + b0 + e, with bH = 0.7, bJ = –0.08, b0 = –46.5 and normally

60 distributed noise e with zero mean and s = 1 (Table 1). We set bJ

W (kg)

W (kg)

J (cm)

40

–0.088 J when height is held constant (i.e., among men of the same height,

20 60 60

lighter men will tend to jump higher). For this example we simulated

npg

160 165

H (cm)

170 160 165

H (cm)

170 20 40

J (cm)

60 80 a sample of size n = 40 with H and J normally distributed with means

of 165 cm (s = 3) and 50 cm (s = 12.5), respectively.

c Correlated predictors d Regression for correlated predictors

Although the statistical theory for MLR seems similar to that for

r(H,J) = 0.9 Weight on height Weight on jump height

80 70 70

simple linear regression, the interpretation of the results is much

0.71

60

0.44 0.097 more complex. Problems in interpretation arise entirely as a result of

the sample correlation2 among the predictors. We do, in fact, expect

W (kg)

W (kg)

J (cm)

65 65

40 a positive correlation between H and J—tall men will tend to jump

–0.088

higher than short ones. To illustrate how this correlation can affect

20

160 165 170

60

160 165 170

60

20 40 60 80

the results, we generated values using the model for weight with

H (cm) H (cm) J (cm) samples of J and H with different amounts of correlation.

Let’s look first at the regression coefficients estimated when the

Figure 1 | The results of multiple linear regression depend on the

correlation of the predictors, as measured here by the Pearson correlation

predictors are uncorrelated, r(H,J) = 0, as evidenced by the zero

coefficient r (ref. 2). (a) Simulated values of uncorrelated predictors, slope in association between H and J (Fig. 1a). Here r is the Pearson

r(H,J) = 0. The thick gray line is the regression line, and thin gray lines correlation coefficient2. If we ignore the effect of J and regress W

show the 95% confidence interval of the fit. (b) Regression of weight (W) on H, we find Ŵ = 0.71H – 51.7 (R2 = 0.66) (Table 1 and Fig. 1b).

on height (H) and of weight on jump height (J) for uncorrelated predictors Ignoring H, we find Ŵ = –0.088J + 69.3 (R2 = 0.19). If both predic-

shown in a. Regression slopes are shown (bH = 0.71, bJ = –0.088). tors are fitted in the regression, we obtain Ŵ = 0.71H – 0.088J – 47.3

(c) Simulated values of correlated predictors, r(H,J) = 0.9. Regression and

(R2 = 0.85). This regression fit is a plane in three dimensions (H, J,

95% confidence interval are denoted as in a. (d) Regression (red lines)

using correlated predictors shown in c. Light red lines denote the 95%

W) and is not shown in Figure 1. In all three cases, the results of the

confidence interval. Notice that bJ = 0.097 is now positive. The regression F-test for zero slopes show high significance (P ≤ 0.005).

line from b is shown in blue. In all graphs, horizontal and vertical dotted When the sample correlations of the predictors are exactly zero,

lines show average values. the regression slopes (bH and bJ) for the “one predictor at a time”

THIS MONTH

Effect of predictor correlation r(H,J) on regression coefficient estimates and R 2 because although J and H are likely to be positively correlated, other

bH bJ b0 R2

1.0 1.0 1.0

scenarios might use negatively correlated predictors (e.g., lung

Fraction of P values

0 0.8

1.0 0.8

0.8 J not

capacity and smoking habits). For example, if we include only H

−40

H

βH

0.6

considered 0.5

0.66 in the regression and ignore the effect of J, bH steadily decreases

−80

0.4 from about 1 to 0.35 as r(H,J) increases. Why is this? For a given

0.2 −120

height, larger values of J (an indicator of fitness) are associated with

0 0 0

1.0 80 1.0 1.0

Predictors in regression

0.08

0

0.8

75

0.8

lower weight. If J and H are negatively correlated, as J increases, H

H not

βJ 70 0.5 decreases, and both changes result in a lower value of W. Conversely,

J considered −0.16

−0.24

65

0.19

as J decreases, H increases, and thus W increases. If we use only H

1.0 1.0 0

0

1.0

60

0

0

1.0

0

1.0

as a predictor, J is lurking in the background, depressing W at low

0.8

−0.04 0.8

−25

0.8 values of H and enhancing W at high levels of H, so that the effect of

0.8

βH

βJ β0

0.5

H is overestimated (bH increases). The opposite effect occurs when J

H,J 0.6

−0.12 −75

0.85

and H are positively correlated. A similar effect occurs for bJ, which

0.4

0 −0.16 0 −100 0 0

increases in magnitude (becomes more negative) when J and H are

−1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0

Predictor correlation r(H,J) negatively correlated. Supplementary Figure 1 shows the effect of

correlation when both regression coefficients are positive.

Figure 2 | Results and interpretation of multiple regression changes with

the sample correlation of the predictors. Shown are the values of regression

When both predictors are fitted (Fig. 2), the regression coefficient

coefficient estimates (bH, bJ, b0) and R2 and the significance of the test used estimates (bH, bJ, b0) are centered at the actual coefficients (bH, bJ,

to determine whether the coefficient is zero from 250 simulations at each b0) with the correct sign and magnitude regardless of the correla-

value of predictor sample correlation –1 < r(H,J) < 1 for each scenario where tion of the predictors. However, the standard error in the estimates

© 2015 Nature America, Inc. All rights reserved.

either H or J or both H and J predictors are fitted in the regression. Thick and steadily increases as the absolute value of the predictor correlation

thin black curves show the coefficient estimate median and the boundaries increases.

of the 10th–90th percentile range, respectively. Histograms show the fraction

Neglecting important predictors has implications not only for R2,

of estimated P values in different significance ranges, and correlation

intervals are highlighted in red where >20% of the P values are >0.01. Actual which is a measure of the predictive power of the regression, but

regression coefficients (bH, bJ, b0) are marked on vertical axes. The decrease also for interpretation of the regression coefficients. Unconsidered

in significance for bJ when jump height is the only predictor and r(H,J) is variables that may have a strong effect on the estimated regression

moderate (red arrow) is due to insufficient statistical power (bJ is close to coefficients are sometimes called ‘lurking variables’. For example,

zero). When predictors are uncorrelated, r(H,J) = 0, R2 of individual regressions muscle mass might be a lurking variable with a causal effect on both

sum to R2 of multiple regression (0.66 + 0.19 = 0.85). Panels are organized to body weight and jump height. The results and interpretation of the

correspond to Table 1, which shows estimates of a single trial at two different

regression will also change if other predictors are added.

predictor correlations.

Given that missing predictors can affect the regression, should we

regressions and the multiple regression are identical, and the simple try to include as many predictors as possible? No, for three reasons.

regression R2 sums to multiple regression R2 (0.66 + 0.19 = 0.85; First, any correlation among predictors will increase the standard

Fig. 2). The intercept changes when we add a predictor with a non- error of the estimated regression coefficients. Second, having more

zero mean to satisfy the constraint that the least-squares regression slope parameters in our model will reduce interpretability and cause

line goes through the sample means, which is always true when the problems with multiple testing. Third, the model may suffer from

regression model includes an intercept. overfitting. As the number of predictors approaches the sample size,

npg

Balanced factorial experiments show a sample correlation of zero we begin fitting the model to the noise. As a result, we may seem to

among the predictors when their levels have been fixed. For exam- have a very good fit to the data but still make poor predictions.

ple, we might fix three heights and three jump heights and select MLR is powerful for incorporating many predictors and for esti-

two men representative of each combination, for a total of 18 sub- mating the effects of a predictor on the response in the presence

jects to be weighed. But if we select the samples and then measure of other covariates. However, the estimated regression coefficients

the predictors and response, the predictors are unlikely to have zero depend on the predictors in the model, and they can be quite vari-

correlation. able when the predictors are correlated. Accurate prediction of the

When we simulate highly correlated predictors r(H,J) = 0.9 (Fig. response is not an indication that regression slopes reflect the true

1c), we find that the regression parameters change depending on relationship between the predictors and the response.

whether we use one or both predictors (Table 1 and Fig. 1d). If we

Note: Any Supplementary Information and Source Data files are available in the

consider only the effect of H, the coefficient bH = 0.7 is inaccurately

online version of the paper.

estimated as bH = 0.44. If we include only J, we estimate bJ = –0.08

inaccurately, and even with the wrong sign (bJ = 0.097). When we COMPETING FINANCIAL INTERESTS

use both predictors, the estimates are quite close to the actual coef- The authors declare no competing financial interests.

ficients (bH = 0.63, bJ = –0.056). Martin Krzywinski & Naomi Altman

In fact, as the correlation between predictors r(H,J) changes, the

1. Altman, N. & Krzywinski, M. Nat. Methods 12, 999–1000 (2015).

estimates of the slopes (bH, bJ) and intercept (b0) vary greatly when 2. Altman, N. & Krzywinski, M. Nat. Methods 12, 899–900 (2015).

only one predictor is fitted. We show the effects of this variation for

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

all values of predictor correlation (both positive and negative) across Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences

250 trials at each value (Fig. 2). We include negative correlation Centre.

THIS MONTH

POINTS OF SIGNIFICANCE a Relationship between weight and height b Linear regression of weight on height

75

σ=3 σ=3

70

Weight (kg)

65

55 65 75

Weight (kg) 60

55

150 160 170 180

Height (cm) Height (cm)

never was a normal distribution, there never

Figure 2 | In a linear regression relationship, the response variable has a

was a straight line, yet with normal and linear distribution for each value of the independent variable. (a) At each height,

assumptions, known to be false, he can often derive weight is distributed normally with s.d. s = 3. (b) Linear regression of

n = 3 weight measurements for each height. The mean weight varies as

results which match, to a useful approximation, m(Height) = 2 × Height/3 – 45 (black line) and is estimated by a regression

those found in the real world.”1 line (blue line) with 95% confidence interval (blue band). The 95%

prediction interval (gray band) is the region in which 95% of the population

We have previously defined association between X and Y as meaning is predicted to lie for each fixed height.

that the distribution of Y varies with X. We discussed correlation as

a type of association in which larger values of Y are associated with in the estimated regression equation. Here X is referred to as the

larger values of X (increasing trend) or smaller values of X (decreas- predictor, and Y is referred to as the predicted variable.

ing trend)2. If we suspect a trend, we may want to attempt to pre- Consider a relationship between weight Y (in kilograms) and

dict the values of one variable using the values of the other. One of height X (in centimeters), where the mean weight at a given height

© 2015 Nature America, Inc. All rights reserved.

the simplest prediction methods is linear regression, in which we is m(X) = 2X/3 – 45 for X > 100. Because of biological variability,

attempt to find a ‘best line’ through the data points. the weight will vary—for example, it might be normally distributed

Correlation and linear regression are closely linked—they both with a fixed s = 3 (Fig. 2a). The difference between an observed

quantify trends. Typically, in correlation we sample both variables weight and mean weight at a given height is referred to as the error

randomly from a population (for example, height and weight), for that weight.

and in regression we fix the value of the independent variable (for To discover the linear relationship, we could measure the weight

example, dose) and observe the response. The predictor variable of three individuals at each height and apply linear regression to

may also be randomly selected, but we treat it as fixed when mak- model the mean weight as a function of height using a straight

ing predictions (for example, predicted weight for someone of a line, m(X) = b0 + b1X (Fig. 2b). The most popular way to esti-

given height). We say there is a regression relationship between X mate the intercept b0 and slope b1 is the least-squares estimator

and Y when the mean of Y varies with X. (LSE). Let (xi, yi) be the ith pair of X and Y values. The LSE esti-

In simple regression, there is one independent variable, X, and mates b0 and b1 by minimizing the residual sum of squares (sum

one dependent variable, Y. For a given value of X, we can estimate of squared errors), SSE = S(yi – ŷi)2, where ŷi = m(xi) = b0 + b1xi are

the average value of Y and write this as a conditional expectation the points on the estimated regression line and are called the fitted,

E(Y|X), often written simply as m(X). If m(X) varies with X, then predicted or ‘hat’ values. The estimates are given by b0 = – b1

we say that Y has a regression on X (Fig. 1). Regression is a specific and b1 = rsX/sY, and where and are means of samples X and

kind of association and may be linear or nonlinear (Fig. 1c,d). Y, sX and sY are their s.d. values and r = r(X,Y) is their correlation

npg

sion. In this case, E(Y|X) = m(X) = b0 + b1X, a line with intercept The LSE of the regression line has favorable properties for very

b0 and slope b1. We can interpret this as Y having a distribution general error distributions, which makes it a popular estimation

with mean m(X) for any given value of X. Here we are not interested method. When Y values are selected at random from the condi-

in the shape of this distribution; we care only about its mean. The tional distribution E(Y|X), the LSEs of the intercept, slope and fitted

deviation of Y from m(X) is often called the error, e = Y – m(X). It’s values are unbiased estimates of the population value regardless

important to realize that this term arises not because of any kind of the distribution of the errors, as long as they have zero mean.

of error but because Y has a distribution for a given value of X. By “unbiased,” we mean that although they might deviate from

In other words, in the expression Y = m(X) + e, m(X) specifies the the population values in any sample, they are not systematically

location of the distribution, and e captures its shape. To predict Y too high or too low. However, because the LSE is very sensitive to

at unobserved values of X, one substitutes the desired values of X extreme values of both X (high leverage points) and Y (outliers),

diagnostic outlier analyses are needed before the estimates are used.

a No association b No regression c Linear regression d Nonlinear regression

In the context of regression, the term “linear” can also refer to a

linear model, where the predicted values are linear in the param-

eters. This occurs when E(Y|X) is a linear function of a known

Y function g(X), such as b0 + b1g(X). For example, b0 + b1X2 and

X b 0 + b 1sin(X) are both linear regressions, but exp( b 0+ b 1X) is

nonlinear because it is not a linear function of the parameters b0

Figure 1 | A variable Y has a regression on variable X if the mean of Y (black

line) E(Y|X) varies with X. (a) If the properties of Y do not change with

and b1. Analysis of variance (ANOVA) is a special case of a linear

X, there is no association. (b) Association is possible without regression. model in which the t treatments are labeled by indicator variables

Here E(Y|X) is constant, but the variance of Y increases with X. (c) Linear X1 . . . Xt, E(Y|X1 . . . Xt) = μi is the ith treatment mean, and the LSE

regression E(Y|X) = b0+ b1X. (d) Nonlinear regression E(Y|X) = exp(b0+ b1X). predicted values are the corresponding sample means3.

THIS MONTH

a fixed value of X. For example, the 95% confidence interval for

a Uncertainty in linear regression

of weight on height

b Regression to the mean

m(x) is b0 + b1x ± t0.975SE(ŷ(x)) (Fig. 2b) and depends on the error

Weight on height Height on weight

75

σ=2 σ=4 W = 71.6 variance (Fig. 3a). This is called a point-wise interval because the

H ’ = 172.7

70

95% coverage is for a single fixed value of X. One can compute a

Height (cm)

Weight (kg)

Weight (kg)

65 band that covers the entire line 95% of the time by replacing t0.975

60 with W0.975 = √(2F0.975), where F0.975 is the critical value from the

55 F2,n–2 distribution. This interval is wider because it must cover the

150 160 170 180 150 160 170 180 H = 175 W = 71.6

Height (cm) Height (cm) Weight (kg) entire regression line, not just one point on the line.

Figure 3 | Regression models associate error to response which tends to

To express uncertainty about where a percentage (for example,

pull predictions closer to the mean of the data (regression to the mean). 95%) of newly observed data points would fall, we use the pre-

(a) Uncertainty in a linear regression relationship can be expressed by a 95% diction interval b0 + b1x + t0.975 (MSE(1 + 1/n + (x – )2/sXX)).

confidence interval (blue band) and 95% prediction interval (gray band). This interval is wider than the confidence interval because it must

Shown are regressions for the relationship in Figure 2a using different incorporate both the spread in the data and the uncertainty in the

amounts of scatter (normally distributed with s.d. s). (b) Predictions model parameters. A prediction interval for Y at a fixed value of

using successive regressions X → Y → Xʹ to the mean. When predicting

X incorporates three sources of uncertainty: the population vari-

using height H = 175 cm (larger than average), we predict weight W = 71.6

kg (dashed line). If we then regress H on W at W = 71.6 kg, we predict ance s2, the variance in estimating the mean and the variability

Hʹ = 172.7 cm, which is closer than H to the mean height (64.6 cm). Means due to estimating s2 with the MSE. Unlike confidence intervals,

of height and weight are shown as dotted lines. which are accurate when the sampling distribution of the estima-

tor is close to normal, which usually occurs in sufficiently large

Recall that in ANOVA, the SSE is the sum of squared deviations samples, the prediction interval is accurate only when the errors

© 2015 Nature America, Inc. All rights reserved.

of the data from their respective sample means (i.e., their pre- are close to normal, which is not affected by sample size.

dicted values) and represents the variation in the data that is not Linear regression is readily extended to multiple predictor vari-

accounted for by the treatments. Similarly, in regression, the SSE ables X1, . . ., Xp, giving E(Y|X1, . . ., Xp) = b0 + SbiXi. Clever choice

is the sum of squared deviations of the data from the predicted of predictors allows for a wide variety of models. For example,

values that represents variation in data not explained by regres- Xi = Xi yields a polynomial of degree p. If there are p + 1 groups,

sion. In ANOVA we also compute the total and treatment sum letting Xi = 1 when the sample comes from group i and 0 other-

of squares; the analogous quantities in linear regression are the wise yields a model in which the fitted values are the group means.

total sum of squares, SST = (n–1)s2Y, and the regression sum of In this model, the intercept is the mean of the last group, and the

squares, SSR = S(ŷi – )2, which are related by SST = SSR + SSE. slopes are the differences in means.

Furthermore, SSR/SST = r 2 is the proportion of variance of Y A common misinterpretation of linear regression is the ‘regres-

explained by the linear regression of X (ref. 2). sion fallacy’. For example, we might predict weight W = 71.6 kg for

When the errors have constant variance s2, we can model the a larger than average height H = 175 cm and then predict height

uncertainty in regression parameters. In this case, b0 and b1 have Hʹ = 172.7 cm for someone with weight W = 71.6 kg (Fig. 3b).

means b 0 and b 1, respectively, and variances s 2(1/n + 2/s XX) Here we will find Hʹ < H. Similarly, if H is smaller than average,

and s2/sXX, where sXX = (n – 1)s2X . As we collect X over a wider we will find Hʹ> H. The regression fallacy is to ascribe a causal

range, sXX increases, so the variance of b1 decreases. The predicted mechanism to regression to the mean, rather than realizing that

value ŷ(x) has a mean b0 +b1x and variance s2(1/n + (x – )2/sXX). it is due to the estimation method. Thus, if we start with some

npg

Additionally, the mean square error (MSE) = SSE/(n – 2) is an value of X, use it to predict Y, and then use Y to predict X, the

unbiased estimator of the error variance (i.e., s2). This is identi- predicted value will be closer to the mean of X than the original

cal to how MSE is used in ANOVA to estimate the within-group value (Fig. 3b).

variance, and it can be used as an estimator of s2 in the equations Estimating the regression equation by LSE is quite robust to

above to allow us to find the standard error (SE) of b0, b1 and ŷx. non-normality of and correlation in the errors, but it is sensitive

For example, SE(ŷ(x)) = √(MSE(1/n + (x – )2/sXX)). to extreme values of both predictor and predicted. Linear regres-

If the errors are normally distributed, so are b0, b1 and (ŷ(x)). sion is much more flexible than its name might suggest, includ-

Even if the errors are not normally distributed, as long as they ing polynomials, ANOVA and other commonly used statistical

have zero mean and constant variance, we can apply a version of methods.

the central limit theorem for large samples4 to obtain approximate

COMPETING FINANCIAL INTERESTS

normality for the estimates. In these cases the SE is very helpful

The authors declare no competing financial interests.

in testing hypotheses. For example, to test that the slope is b1 =

2/3, we would use t* = (b1 – b1)/SE(b1); when the errors are nor- Naomi Altman & Martin Krzywinski

mal and the null hypothesis true, t* has a t-distribution with d.f.

1. Box, G. J. Am. Stat. Assoc. 71, 791–799 (1976).

= n – 2. We can also calculate the uncertainty of the regression

2. Altman, N. & Krzywinski, M. Nat. Methods 12, 899–900 (2015).

parameters using confidence intervals, the range of values that 3. Krzywinski, M. & Altman, N. Nat. Methods 11, 699–700 (2014).

are likely to contain bi (for example, 95% of the time)5. The inter- 4. Krzywinski, M. & Altman, N. Nat. Methods 10, 809–810 (2013).

val is bi ± t0.975SE(bi), where t0.975 is the 97.5% percentile of the 5. Krzywinski, M. & Altman, N. Nat. Methods 10, 1041–1042 (2013).

t-distribution with d.f. = n – 2.

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

When the errors are normally distributed, we can also use con- Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences

fidence intervals to make statements about the predicted value for Centre.

THIS MONTH

POINTS OF SIGNIFICANCE a Fluctuation of correlation of random data b Spurious correlations in random data

−0.94 −0.93 −0.91

and causation −1 0

r

1 −1 0

r

1

Correlation implies association, but not causation. correlations can arise. (a) Distribution (left) and 95% confidence

intervals (right) of correlation coefficients of 10,000 n = 10 samples of

Conversely, causation implies association, but not two independent normally distributed variables. Statistically significant

correlation. coefficients (α = 0.05) and corresponding intervals that do not include r = 0

are highlighted in blue. (b) Samples with the three largest and smallest

Most studies include multiple response variables, and the dependen- correlation coefficients (statistically significant) from a.

cies among them are often of great interest. For example, we may

wish to know whether the levels of mRNA and the matching protein spent outdoors is a confounding variable—a cause common to

vary together in a tissue, or whether increasing levels of one metabo- both observations. In such a situation, a direct causal link cannot

lite are associated with changed levels of another. This month we be inferred; the association merely suggests a hypothesis, such as a

begin a series of columns about relationships between variables (or common cause, but does not offer proof. In addition, when many

features of a system), beginning with how pairwise dependencies variables in complex systems are studied, spurious associations can

can be characterized using correlation. arise. Thus, association does not imply causation.

© 2015 Nature America, Inc. All rights reserved.

Two variables are independent when the value of one gives no In everyday language, dependence, association and correlation are

information about the value of the other. For variables X and Y, we used interchangeably. Technically, however, association is synony-

can express independence by saying that the chance of measuring mous with dependence and is different from correlation (Fig. 1a).

any one of the possible values of X is unaffected by the value of Y, and Association is a very general relationship: one variable provides

vice versa, or by using conditional probability, P(X|Y) = P(X). For information about another. Correlation is more specific: two vari-

example, successive tosses of a coin are independent—for a fair coin, ables are correlated when they display an increasing or decreasing

P(H) = 0.5 regardless of the outcome of the previous toss, because a trend. For example, in an increasing trend, observing that X > μX

toss does not alter the properties of the coin. In contrast, if a system implies that it is more likely that Y > μY. Because not all associations

is changed by observation, measurements may become associated are correlations, and because causality, as discussed above, can be

or, equivalently, dependent. Cards drawn without replacement are connected only to association, we cannot equate correlation with

not independent; when a red card is drawn, the probability of draw- causality in either direction.

ing a black card increases, because now there are fewer red cards. For quantitative and ordinal data, there are two primary mea-

Association should not be confused with causality; if X causes sures of correlation: Pearson’s correlation (r), which measures lin-

Y, then the two are associated (dependent). However, associations ear trends, and Spearman’s (rank) correlation (s), which measures

can arise between variables in the presence (i.e., X causes Y) and increasing and decreasing trends that are not necessarily linear

absence (i.e., they have a common cause) of a causal relationship, (Fig. 1b). Like other statistics, these have population values, usu-

as we’ve seen in the context of Bayesian networks1. As an example, ally referred to as ρ. There are other measures of association that

npg

suppose we observe that people who daily drink more than 4 cups are also referred to as correlation coefficients, but which might not

of coffee have a decreased chance of developing skin cancer. This measure trends.

does not necessarily mean that coffee confers resistance to cancer; When “correlated” is used unmodified, it generally refers to

one alternative explanation would be that people who drink a lot of Pearson’s correlation, given by ρ(X, Y) = cov(X, Y)/σXσY, where

coffee work indoors for long hours and thus have little exposure to cov(X, Y) = E((X – μX)(Y – μY)). The correlation computed from

the sun, a known risk. If this is the case, then the number of hours the sample is denoted by r. Both variables must be on an interval or

ratio scale; r cannot be interpreted if either variable is ordinal. For a

linear trend, |r| = 1 in the absence of noise and decreases with noise,

a Association and correlation b Correlation coefficients c Anscombe’s quartet

but it is also possible that |r| < 1 for perfectly associated nonlinear

Associated Not associated

trends (Fig. 1b). In addition, data sets with very different associa-

tions may have the same correlation (Fig. 1c). Thus, a scatter plot

rs 11 0.91 1 0.82 0.82

Correlated should be used to interpret r. If either variable is shifted or scaled, r

does not change and r(X, Y) = r(aX + b, Y). However, r is sensitive to

0 0.04 0.78 1 0.82 0.82

nonlinear monotone (increasing or decreasing) transformation. For

example, when applying log transformation, r(X, Y) ≠ r(X, log(Y)).

Figure 1 | Correlation is a type of association and measures increasing or It is also sensitive to the range of X or Y values and can decrease as

decreasing trends quantified using correlation coefficients. (a) Scatter plots values are sampled from a smaller range.

of associated (but not correlated), non-associated and correlated variables.

If an increasing or decreasing but nonlinear relationship is sus-

In the lower association example, variance in y is increasing with x. (b) The

Pearson correlation coefficient (r, black) measures linear trends, and the pected, Spearman’s correlation is more appropriate. It is a nonpara-

Spearman correlation coefficient (s, red) measures increasing or decreasing metric method that converts the data to ranks and then applies the

trends. (c) Very different data sets may have similar r values. Descriptors formula for the Pearson correlation. It can be used when X is ordinal

such as curvature or the presence of outliers can be more specific. and is more robust to outliers. It is also not sensitive to monotone

THIS MONTH

enough that r = 0.42 (P = 0.063) is not statistically significant—its

a Effect of noise on correlation b Effect of sample size on correlation

0.95 0.69 0.42 (ns) 0.96 0.59 0.42 confidence interval includes ρ = 0.

When the linear trend is masked by noise, larger samples are need-

ed to confidently measure the correlation. Figure 3b shows how the

1 1 correlation coefficient varies for subsamples of size m drawn from

r 0 0 samples at different noise levels: m = 4–20 (σ = 0.1), m = 4–100 (σ

−1 −1

σ = 0.1 σ = 0.3 σ = 0.6 = 0.3) and m = 4–200 (σ = 0.6). When σ = 0.1, the correlation coef-

0 0.25 0.5 0.75 1 4 8 12 16 20 4 25 50 75 100 4 100 200

σ m ficient converges to 0.96 once m > 12. However, when noise is high,

not only is the value of r lower for the full sample (e.g., r = 0.59 for σ

Figure 3 | Effect of noise and sample size on Pearson’s correlation coefficient = 0.3), but larger subsamples are needed to robustly estimate ρ.

r. (a) r of an n = 20 sample of (X, X + ε), where ε is the normally distributed

The Pearson correlation coefficient can also be used to quantify

noise scaled to standard deviation σ. The amount of scatter and value of r at

three values of σ are shown. The shaded area is the 95% confidence interval.

how much fluctuation in one variable can be explained by its cor-

Intervals that do not include r = 0 are highlighted in blue (σ < 0.58), and relation with another variable. A previous discussion about analysis

those that do are highlighted in gray and correspond to nonsignificant r values of variance4 showed that the effect of a factor on the response vari-

(ns; e.g., r = 0.42 with P = 0.063). (b) As sample size increases, r becomes able can be described as explaining the variation in the response; the

less variable, and the estimate of the population correlation improves. Shown response varied, and once the factor was accounted for, the variation

are samples with increasing size and noise: n = 20 (σ = 0.1), n = 100 (σ = decreased. The squared Pearson correlation coefficient r2 has a simi-

0.3) and n = 200 (σ = 0.6). Traces at the bottom show r calculated from a

lar role: it is the proportion of variation in Y explained by X (and vice

subsample, created from the first m values of each sample.

versa). For example, r = 0.05 means that only 0.25% of the variance

of Y is explained by X (and vice versa), and r = 0.9 means that 81% of

© 2015 Nature America, Inc. All rights reserved.

increasing transformations because they preserve ranks—for exam- the variance of Y is explained by X. This interpretation is helpful in

ple, s(X, Y) = s(X, log(Y)). For both coefficients, a smaller magnitude assessments of the biological importance of the magnitude of r when

corresponds to increasing scatter or a non-monotonic relationship. it is statistically significant.

It is possible to see large correlation coefficients even for random Besides the correlation among features, we may also talk about the

data (Fig. 2a). Thus, r should be reported together with a P value, correlation among the items we are measuring. This is also expressed

which measures the degree to which the data are consistent with as the proportion of variance explained. In particular, if the units are

the null hypothesis that there is no trend in the population. For clustered, then the intraclass correlation (which should be thought of

Pearson’s r, to calculate the P value we use the test statistic √[d.f. × as a squared correlation) is the percent variance explained by the clus-

r2/(1 – r2)], which is t-distributed with d.f. = n – 2 when (X, Y) has ters and given by σb2/(σb2 + σw2), where σb2 is the between-cluster

a bivariate normal distribution (P for s does not require normality) variation and σb2 + σw2 is the total between- and within-cluster varia-

and the population correlation is 0. Even more informative is a 95% tion. This formula was discussed previously in an examination of

confidence interval, often calculated using the bootstrap method2. the percentage of total variance explained by biological variation5

In Figure 2a we see that values up to |r| < 0.63 are not statistically where the clusters are the technical replicates for the same biological

significant—their confidence intervals span zero. More important, replicate. As with the correlation between features, the higher the

there are very large correlations that are statistically significant (Fig. intraclass correlation, the less scatter in the data—this time measured

2a) even though they are drawn from a population in which the not from the trend curve but from the cluster centers.

true correlation is ρ = 0. These spurious cases (Fig. 2b) should be Association is the same as dependence and may be due to direct or

npg

expected any time a large number of correlations is calculated— indirect causation. Correlation implies specific types of association

for example, a study with only 140 genes yields 9,730 correlations. such as monotone trends or clustering, but not causation. For exam-

Conversely, modest correlations between a few variables, known to ple, when the number of features is large compared with the sample

be noisy, could be biologically interesting. size, large but spurious correlations frequently occur. Conversely,

Because P depends on both r and the sample size, it should never when there are a large number of observations, small and substan-

be used as a measure of the strength of the association. It is possible tively unimportant correlations may be statistically significant.

for a smaller r, whose magnitude can be interpreted as the estimated

COMPETING FINANCIAL INTERESTS

effect size, to be associated with a smaller P merely because of a large

The authors declare no competing financial interests.

sample size3. Statistical significance of a correlation coefficient does

not imply substantive and biologically relevant significance. Naomi Altman & Martin Krzywinski

The value of both coefficients will fluctuate with different sam-

1. Puga, J.L., Krzywinski, M. & Altman, N. Nat. Methods 12, 799–800 (2015).

ples, as seen in Figure 2, as well as with the amount of noise and/

2. Kulesa, A., Krzywinski, M., Blainey, P. & Altman, N. Nat. Methods 12, 477–

or the sample size. With enough noise, the correlation coefficient 478 (2015).

can cease to be informative about any underlying trend. Figure 3a 3. Krzywinski, M. & Altman, N. Nat. Methods 10, 1139–1140 (2013).

shows a perfectly correlated relationship (X, X) where X is a set of 4. Krzywinski, M. & Altman, N. Nat. Methods 11, 699–700 (2014).

5. Altman, N. & Krzywinski, M. Nat. Methods 12, 5–6 (2015).

n = 20 points uniformly distributed in the range [0, 1] in the pres-

ence of different amounts of normally distributed noise with a stan-

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

dard deviation σ. As σ increases from 0.1 to 0.3 to 0.6, r(X, X + σ) Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences

decreases from 0.95 to 0.69 to 0.42. At σ = 0.6 the noise is high Centre.

- AP Statistics 1st Semester Study GuideUploaded bySusan Huynh
- Powerpoint - Regression and Correlation AnalysisUploaded byVivay Salazar
- STATISTICS NOTESUploaded byhsrinivas_7
- article_24_vol_2_3Uploaded byAnonymous Zwg3b7vCd
- Factors Influencing Project SuccessUploaded byMuhammad Mohkam Ud Din
- Correlation and RegressionUploaded byreshmii_123
- Computational MathematicsUploaded byspvstsabha
- 2014 Kinematic Analysis of the Instep Kick in Youth Soccer PlayersUploaded byFrancisco Antonó Castro Weith
- lit reviewUploaded byapi-349695000
- JC13_165Uploaded bykenneth
- Management Information Systems A1Uploaded byelias_galyamov
- 2014_EJAP_Effect of CMJ on P-F-V ProfileUploaded byJuan Palomo
- Customer Satisfaction MeasuresUploaded byIsmayil
- Chap 002Uploaded byHuỳnh TẤn Phát
- Multiple Regression ModelsUploaded byArun Prasad
- Ba Spss StepsUploaded bySumit Chauhan
- 66-78 Vol. 2, No.2 (April, 2013).pdfUploaded byGerardo Antonio
- Bi VariateUploaded byNandkumar Khachane
- Correlation and Regression CorrectedUploaded byyathsih24885
- ABC.pdfUploaded byPeter Muennig
- Lecture_6_noMooreUploaded byapi-3767414
- Regression IntroUploaded byravikiranwgl
- Chap010.pptUploaded byAli Elattar
- Chap 006Uploaded byJudy Anne Salucop
- US Federal Reserve: 200207papUploaded byThe Fed
- web 2Uploaded byapi-248914370
- Corelation and RegressionUploaded byN Ram
- 17616752 Advertising Sales ManagementUploaded bykhushi_sharma58101959
- Štúdia správania diplomatovUploaded byDennik SME
- Abril 2014Uploaded byJosé Carlos Castillo Fallas

- Muller BisectionUploaded byMax Grebeniuk
- Matlab CodeUploaded bySelva Raj
- k 060272324Uploaded byYudaRSyahbani
- MATH110 Syllabus College Algebra.doc 2014.Doc SyllabusUploaded byRiz Dadole
- LinearUploaded byLuqman J. Cheema
- Euler MascheroniUploaded byImroj Syed
- 497_ShashankMamidipelliUploaded byMamidipelli Shashank
- Dynopt - Dynamic Optimisation Code ForUploaded byVyta Adelia
- Mat Lab Curve Fit Code (Linear Approximation)Uploaded bybotch55
- K-L expansionUploaded byWayne Uy
- CE-401CE 4.0 PERT 2015Uploaded byShubham Bansal
- Digital Control SystemsUploaded byvalchuks2k1
- Local stability properties of a delayed SIR model with relapse effectUploaded byIJSRP ORG
- Linear Equation in Two VariablesUploaded byVaibhav Zala
- Matrix DifferentiationUploaded bytake shobo
- Project Time Management QuestionsUploaded bymaninambiram
- Probability - Lebesgue Integral Basics - Mathematics Stack ExchangeUploaded bySeyed Pedram Refaei Saeedi
- UnconstrainedOptimization IUploaded byShubham
- BUSN20019 - Webinar 5Uploaded bySumann Ballaa
- Problem Set 1Uploaded byKyle Busse
- 38593356 Assignement QTM Quantitive Techique Mgt (2)Uploaded byMahendra Bairwa
- 1394-2809-1-PBUploaded byNishant Jadhav
- Batch Grinding KineticsUploaded byrodrigoalcaino
- Module 4Uploaded byNagamalli Swamy
- Control theoryUploaded byvondutchlg
- Probability Homework4 Necati AKSOYUploaded byeeengineer07
- Project Management_Tools for BusinessUploaded byjayshah1991
- The International Journal of Soft Computing, Mathematics and Control (IJSCMC)Uploaded byijscmc
- buckling-summary.pdfUploaded byAnonymous idBsC1
- Discrete Time SignalUploaded byAlwyne Reyes