2 views

Uploaded by cfisicaster

- MATH SL Internal Assessment (IA) 2015 Correlation
- EPI FINAL
- Bookbinders Case 2
- Buku SPSS Complete
- Regression
- Retail Value and Thread Count
- Correlation and Regression With R
- Forecasting
- 2016 fm units 34 course outline- semester 1
- Chapter 17
- Correlation and Regression
- Correlation and Regression
- ABCDE Socio Economic Classification MEDIARESEARCH Specification 2015
- MAS Cost Concepts and Analysis
- Correlation Analysis
- 66-78 Vol. 2, No.2 (April, 2013).pdf
- Homework 3 Answers Updated (1)
- Econometrics Midterm 1 Group 1
- 1241 ID Pengaruh Sistem Informasi Manajemen Sim Terhadap Efektifitas Pengambilan Keputus
- A Study of the Interaction Between Computer

You are on page 1of 2

POINTS OF SIGNIFICANCE and predictor correlations

Predictors in model Regression coefficients

βH

0.7

βJ

–0.08

β0

–46.5

When multiple variables are associated with Predictors fitted Estimated regression coefficients

bH bJ b0 R2

a response, the interpretation of a prediction

Uncorrelated predictors, r(H,J) = 0

equation is seldom simple. H 0.71 –51.7 0.66

Last month we explored how to model a simple relationship J –0.088 69.3 0.19

between two variables, such as the dependence of weight on H, J 0.71 –0.088 –47.3 0.85

height1. In the more realistic scenario of dependence on several Correlated predictors, r(H,J) = 0.9

variables, we can use multiple linear regression (MLR). Although H 0.44 –8.1 (ns) 0.64

MLR is similar to linear regression, the interpretation of MLR J 0.097 60.2 0.42

correlation coefficients is confounded by the way in which the H, J 0.63 –0.056 –36.2 0.67

predictor variables relate to one another. Actual (βH, βJ, β0) and estimated regression coefficients (bH, bJ, b0) and coefficient of

In simple linear regression1, we model how the mean of vari- determination (R2) for uncorrelated and highly correlated predictors in scenarios where either H

or J or both H and J predictors are fitted in the regression. Regression coefficient estimates for all

able Y depends linearly on the value of a predictor variable X; this values of predictor sample correlation, r(H,J) are shown in Figure 2. ns, not significant.

relationship is expressed as the conditional expectation E(Y|X)

© 2015 Nature America, Inc. All rights reserved.

= b 0 + b 1X. For more than one predictor variable X 1, . . ., X p, The slope bj is the change in Y if predictor j is changed by one unit

this becomes b0 + SbjXj. As for simple linear regression, one can and others are held constant. When normality and independence

use the least-squares estimator (LSE) to determine estimates bj of assumptions are fulfilled, we can test whether any (or all) of the

the bj regression parameters by minimizing the residual sum of slopes are zero using a t-test (or regression F-test). Although the

squares, SSE = S(yi – ŷi)2, where ŷi = b0 + Sjbjxij. When we use the interpretation of bj seems to be identical to its interpretation in the

–

regression sum of squares, SSR = S(ŷi –Y )2, the ratio R2 = SSR/ simple linear regression model, the innocuous phrase “and others

(SSR + SSE) is the amount of variation explained by the regres- are held constant” turns out to have profound implications.

sion model and in multiple regression is called the coefficient of To illustrate MLR—and some of its perils—here we simulate

determination. predicting the weight (W, in kilograms) of adult males from their

height (H, in centimeters) and their maximum jump height (J, in

a Uncorrelated predictors

r(H,J) = 0

b Regression for uncorrelated predictors

Weight on height Weight on jump height

centimeters). We use a model similar to that presented in our previ-

80 70 70 ous column1, but we now include the effect of J as E(W|H,J) = bHH

0.71

+ bJJ + b0 + e, with bH = 0.7, bJ = –0.08, b0 = –46.5 and normally

60 distributed noise e with zero mean and s = 1 (Table 1). We set bJ

W (kg)

W (kg)

J (cm)

40

–0.088 J when height is held constant (i.e., among men of the same height,

20 60 60

lighter men will tend to jump higher). For this example we simulated

npg

160 165

H (cm)

170 160 165

H (cm)

170 20 40

J (cm)

60 80 a sample of size n = 40 with H and J normally distributed with means

of 165 cm (s = 3) and 50 cm (s = 12.5), respectively.

c Correlated predictors d Regression for correlated predictors

Although the statistical theory for MLR seems similar to that for

r(H,J) = 0.9 Weight on height Weight on jump height

80 70 70

simple linear regression, the interpretation of the results is much

0.71

60

0.44 0.097 more complex. Problems in interpretation arise entirely as a result of

the sample correlation2 among the predictors. We do, in fact, expect

W (kg)

W (kg)

J (cm)

65 65

40 a positive correlation between H and J—tall men will tend to jump

–0.088

higher than short ones. To illustrate how this correlation can affect

20

160 165 170

60

160 165 170

60

20 40 60 80

the results, we generated values using the model for weight with

H (cm) H (cm) J (cm) samples of J and H with different amounts of correlation.

Let’s look first at the regression coefficients estimated when the

Figure 1 | The results of multiple linear regression depend on the

correlation of the predictors, as measured here by the Pearson correlation

predictors are uncorrelated, r(H,J) = 0, as evidenced by the zero

coefficient r (ref. 2). (a) Simulated values of uncorrelated predictors, slope in association between H and J (Fig. 1a). Here r is the Pearson

r(H,J) = 0. The thick gray line is the regression line, and thin gray lines correlation coefficient2. If we ignore the effect of J and regress W

show the 95% confidence interval of the fit. (b) Regression of weight (W) on H, we find Ŵ = 0.71H – 51.7 (R2 = 0.66) (Table 1 and Fig. 1b).

on height (H) and of weight on jump height (J) for uncorrelated predictors Ignoring H, we find Ŵ = –0.088J + 69.3 (R2 = 0.19). If both predic-

shown in a. Regression slopes are shown (bH = 0.71, bJ = –0.088). tors are fitted in the regression, we obtain Ŵ = 0.71H – 0.088J – 47.3

(c) Simulated values of correlated predictors, r(H,J) = 0.9. Regression and

(R2 = 0.85). This regression fit is a plane in three dimensions (H, J,

95% confidence interval are denoted as in a. (d) Regression (red lines)

using correlated predictors shown in c. Light red lines denote the 95%

W) and is not shown in Figure 1. In all three cases, the results of the

confidence interval. Notice that bJ = 0.097 is now positive. The regression F-test for zero slopes show high significance (P ≤ 0.005).

line from b is shown in blue. In all graphs, horizontal and vertical dotted When the sample correlations of the predictors are exactly zero,

lines show average values. the regression slopes (bH and bJ) for the “one predictor at a time”

THIS MONTH

Effect of predictor correlation r(H,J) on regression coefficient estimates and R 2 because although J and H are likely to be positively correlated, other

bH bJ b0 R2

1.0 1.0 1.0

scenarios might use negatively correlated predictors (e.g., lung

Fraction of P values

0 0.8

1.0 0.8

0.8 J not

capacity and smoking habits). For example, if we include only H

−40

H

βH

0.6

considered 0.5

0.66 in the regression and ignore the effect of J, bH steadily decreases

−80

0.4 from about 1 to 0.35 as r(H,J) increases. Why is this? For a given

0.2 −120

height, larger values of J (an indicator of fitness) are associated with

0 0 0

1.0 80 1.0 1.0

Predictors in regression

0.08

0

0.8

75

0.8

lower weight. If J and H are negatively correlated, as J increases, H

H not

βJ 70 0.5 decreases, and both changes result in a lower value of W. Conversely,

J considered −0.16

−0.24

65

0.19

as J decreases, H increases, and thus W increases. If we use only H

1.0 1.0 0

0

1.0

60

0

0

1.0

0

1.0

as a predictor, J is lurking in the background, depressing W at low

0.8

−0.04 0.8

−25

0.8 values of H and enhancing W at high levels of H, so that the effect of

0.8

βH

βJ β0

0.5

H is overestimated (bH increases). The opposite effect occurs when J

H,J 0.6

−0.12 −75

0.85

and H are positively correlated. A similar effect occurs for bJ, which

0.4

0 −0.16 0 −100 0 0

increases in magnitude (becomes more negative) when J and H are

−1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0 −1.0 −0.5 0 0.5 1.0

Predictor correlation r(H,J) negatively correlated. Supplementary Figure 1 shows the effect of

correlation when both regression coefficients are positive.

Figure 2 | Results and interpretation of multiple regression changes with

the sample correlation of the predictors. Shown are the values of regression

When both predictors are fitted (Fig. 2), the regression coefficient

coefficient estimates (bH, bJ, b0) and R2 and the significance of the test used estimates (bH, bJ, b0) are centered at the actual coefficients (bH, bJ,

to determine whether the coefficient is zero from 250 simulations at each b0) with the correct sign and magnitude regardless of the correla-

value of predictor sample correlation –1 < r(H,J) < 1 for each scenario where tion of the predictors. However, the standard error in the estimates

© 2015 Nature America, Inc. All rights reserved.

either H or J or both H and J predictors are fitted in the regression. Thick and steadily increases as the absolute value of the predictor correlation

thin black curves show the coefficient estimate median and the boundaries increases.

of the 10th–90th percentile range, respectively. Histograms show the fraction

Neglecting important predictors has implications not only for R2,

of estimated P values in different significance ranges, and correlation

intervals are highlighted in red where >20% of the P values are >0.01. Actual which is a measure of the predictive power of the regression, but

regression coefficients (bH, bJ, b0) are marked on vertical axes. The decrease also for interpretation of the regression coefficients. Unconsidered

in significance for bJ when jump height is the only predictor and r(H,J) is variables that may have a strong effect on the estimated regression

moderate (red arrow) is due to insufficient statistical power (bJ is close to coefficients are sometimes called ‘lurking variables’. For example,

zero). When predictors are uncorrelated, r(H,J) = 0, R2 of individual regressions muscle mass might be a lurking variable with a causal effect on both

sum to R2 of multiple regression (0.66 + 0.19 = 0.85). Panels are organized to body weight and jump height. The results and interpretation of the

correspond to Table 1, which shows estimates of a single trial at two different

regression will also change if other predictors are added.

predictor correlations.

Given that missing predictors can affect the regression, should we

regressions and the multiple regression are identical, and the simple try to include as many predictors as possible? No, for three reasons.

regression R2 sums to multiple regression R2 (0.66 + 0.19 = 0.85; First, any correlation among predictors will increase the standard

Fig. 2). The intercept changes when we add a predictor with a non- error of the estimated regression coefficients. Second, having more

zero mean to satisfy the constraint that the least-squares regression slope parameters in our model will reduce interpretability and cause

line goes through the sample means, which is always true when the problems with multiple testing. Third, the model may suffer from

regression model includes an intercept. overfitting. As the number of predictors approaches the sample size,

npg

Balanced factorial experiments show a sample correlation of zero we begin fitting the model to the noise. As a result, we may seem to

among the predictors when their levels have been fixed. For exam- have a very good fit to the data but still make poor predictions.

ple, we might fix three heights and three jump heights and select MLR is powerful for incorporating many predictors and for esti-

two men representative of each combination, for a total of 18 sub- mating the effects of a predictor on the response in the presence

jects to be weighed. But if we select the samples and then measure of other covariates. However, the estimated regression coefficients

the predictors and response, the predictors are unlikely to have zero depend on the predictors in the model, and they can be quite vari-

correlation. able when the predictors are correlated. Accurate prediction of the

When we simulate highly correlated predictors r(H,J) = 0.9 (Fig. response is not an indication that regression slopes reflect the true

1c), we find that the regression parameters change depending on relationship between the predictors and the response.

whether we use one or both predictors (Table 1 and Fig. 1d). If we

Note: Any Supplementary Information and Source Data files are available in the

consider only the effect of H, the coefficient bH = 0.7 is inaccurately

online version of the paper.

estimated as bH = 0.44. If we include only J, we estimate bJ = –0.08

inaccurately, and even with the wrong sign (bJ = 0.097). When we COMPETING FINANCIAL INTERESTS

use both predictors, the estimates are quite close to the actual coef- The authors declare no competing financial interests.

ficients (bH = 0.63, bJ = –0.056). Martin Krzywinski & Naomi Altman

In fact, as the correlation between predictors r(H,J) changes, the

1. Altman, N. & Krzywinski, M. Nat. Methods 12, 999–1000 (2015).

estimates of the slopes (bH, bJ) and intercept (b0) vary greatly when 2. Altman, N. & Krzywinski, M. Nat. Methods 12, 899–900 (2015).

only one predictor is fitted. We show the effects of this variation for

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

all values of predictor correlation (both positive and negative) across Martin Krzywinski is a staff scientist at Canada’s Michael Smith Genome Sciences

250 trials at each value (Fig. 2). We include negative correlation Centre.

- MATH SL Internal Assessment (IA) 2015 CorrelationUploaded byAnggiat Bright Sitorus
- EPI FINALUploaded bySana Savana Aman R
- Bookbinders Case 2Uploaded byKATHERINE
- Buku SPSS CompleteUploaded byb6788
- RegressionUploaded byFarhana Mishu
- Retail Value and Thread CountUploaded byKaryn N. Lewis
- Correlation and Regression With RUploaded byst
- ForecastingUploaded byBhavanams Rao
- 2016 fm units 34 course outline- semester 1Uploaded byapi-319995141
- Chapter 17Uploaded byeuhsaha
- Correlation and RegressionUploaded bysuteeqa
- Correlation and RegressionUploaded byΕιρήνηΔασκιωτάκη
- ABCDE Socio Economic Classification MEDIARESEARCH Specification 2015Uploaded byMuhamad Tazyinul Akhlaq
- MAS Cost Concepts and AnalysisUploaded byMary Dale Joie Bocala
- Correlation AnalysisUploaded bySanket Gangal
- 66-78 Vol. 2, No.2 (April, 2013).pdfUploaded byGerardo Antonio
- Homework 3 Answers Updated (1)Uploaded byRoanBasilla
- Econometrics Midterm 1 Group 1Uploaded byaaammaaa
- 1241 ID Pengaruh Sistem Informasi Manajemen Sim Terhadap Efektifitas Pengambilan KeputusUploaded bytitha
- A Study of the Interaction Between ComputerUploaded byijite
- 321Uploaded byijsab.com
- TestingUploaded bySandor Lemstra
- Economics PhD Test Paper 2016Uploaded byRajat Dubey
- 8 (3)Uploaded byAhmed Ayad
- ME Ch5 WosabiUploaded byAbdulrahman Alotaibi
- 24. Castro ExportCostsUploaded byTan Jiunn Woei
- Homework_3_Answers_updated.docxUploaded byMariana Oliveira
- Regression AnalysisUploaded bypranay
- Chapter Three FORCASTINGUploaded bykeyredin selman
- bankingUploaded byRehman Chaudhary

- Encyclopedia-of-Atmospheric-Sciences-Second-Edition-V1-6.pdfUploaded bycfisicaster
- IDL Advanced Math and Statistics.pdfUploaded byAshoka Vanjare
- nmeth.3665Uploaded bycfisicaster
- altman2015Uploaded bycfisicaster
- Full ManualUploaded byAndreas Stavrianos
- Picket Fence Free FallUploaded byMark
- Lever 2017Uploaded bycfisicaster
- 492180 AUploaded bycfisicaster
- Numerical Methods for Conservation Laws_-_LeVequeUploaded byEng. Mahmoud Al-Zahrani
- IAC Lecture4 HomeworkUploaded bycfisicaster
- Solution Hw3Uploaded bycfisicaster
- Idl Python XrefUploaded bycfisicaster
- Lab 1Uploaded bycfisicaster
- Geometria_Analitica_-_Um_trataUploaded byMaxwell Allan
- Hwiv SolutionUploaded bycfisicaster
- 10 COMP Atwoods MachineUploaded bycfisicaster
- 08 COMP Projectile MotionUploaded bycfisicaster
- 06 COMP Ball TossUploaded bycfisicaster
- 04 COMP g on InclineUploaded bycfisicaster
- 03-COMP-cart_on_a_rampUploaded bycfisicaster
- Overland and Preisendorfer (1992), A Significance Test for Principal Components Applied to a Cyclone ClimatologyUploaded bycfisicaster
- On Sampling Errors in Empirical Orthogonal FunctionsUploaded bycfisicaster
- MatrixCalculusUploaded byNanda Kishore Chavali
- 552_Notes_1Uploaded bycfisicaster
- 552_Notes_6aUploaded bycfisicaster
- EOF TutorialUploaded bycfisicaster

- 11 - Risk and EconomicsUploaded bysotork
- Paper 30Uploaded byPaulo Dias
- Han Liu, Alexander Gegov, Mihaela Cocea - Rule Based Systems for Big DataUploaded byvladislav
- ScatterUploaded bySam
- Moderating Effect of Negative Affectivity on the Job SatisfactionUploaded byarcanodarker
- Compressor2 (Onur-WS's Conflicted Copy 2011-12-15)Uploaded byfeutseu
- Investigation of Schools - Strategies to Enhance Implementation of Lifeskills Curriculum in Public Early Childhood Development Centres in Igembe North, Meru CountyUploaded byIJIRST
- Business Research PaperUploaded byarooj
- JNTUK - Revised Syllabus for M. Tech Transportation EngineeringUploaded byvamsi_rs
- Finding the Jobless an Initial Spatio-Temporal Mod-2Uploaded byputuekarahmayanti
- Impact of Profitability-2Uploaded byjashish4
- Indicator Species IndexUploaded byBrage Rygh
- (Cambridge Studies in Comparative Politics) Roberto Franzosi-The Puzzle of Strikes_ Class and State Strategies in Postwar Italy -Cambridge University Press (1995)Uploaded byDiana M
- Destination brand moleculeUploaded byStanislav Ivanov
- How and Why Digital Generation Teachers Use Technology in the Classroom- An Explanatory Sequential Mixed Methods StudyUploaded byStephen Raghunath
- 14_08-02-0038.PDFUploaded bynithinvl
- quantitative research artifactUploaded byapi-253262741
- Chandi John Rugendo_Influence of Study Habits and Demographic Variables on Academic PerformanceUploaded bywillie2210
- Exploration for Gold in the Kazdag, Western Turkey.pdfUploaded byJcsss2010
- Cleaning ValidationUploaded byjaanhoney
- Correlation and Regression in KYPLOTUploaded byAli Abdelrahem
- Stat Mini ProjectUploaded byyadlajagruthi
- Addressing Crucial Risk Factors in the Middle East Construction IndustriesUploaded bysadiqaftab786
- CONN FMRI Functional Connectivity Toolbox Manual v15Uploaded byJuan Hernández García
- 107B Numerical & Statistical MethodsUploaded bygowtham1990
- Assessment of the Distribution Strategies of Premium Motor Spirit in Nigerian National Petroleum Corporation, Ore DepotUploaded byinventionjournals
- Alumni Association Funding SurveyUploaded byBrian Sowards
- Tutorial T4Uploaded byrkhanna1965
- IBM SPSS CategoriasUploaded bylinginfo
- Effect of Basel Liquidity Rules on the Interbank Money Market Lending Rates in Kenya: A Survey of Commercial Banks in KenyaUploaded byIJSRP ORG