You are on page 1of 33

Six Sigma Green Belt Training

Correlation/Regression
 2004 American Society for Quality. All Rights Reserved. Recognize Define Measure Analyze Improve Control

About This Module . . .

Correlation Analysis is used to quantify: the degree of association between variables Regression Analysis is used to quantify: the functional relationship between variables

Six Sigma, A Quest for Process Perfection Attack Variation and Meet Goals

\DataFile\Correl.mtw \DataFile\RegressAnova.mtw \DataFile\Correg Your Turn.mtw

Measure

Analyze

Improve

Control

Page 2

Correlation/Regression

Version 2.1

 2004 American Society for Quality.
All Rights Reserved.

What We Will Learn . . .

Correlation – How to measure the linear relationship between two variables – How to interpret the Pearson Correlation Coefficient, r

Regression – Y = f(X): how to regress a dependent variable, Y, on an independent variable, X (simple linear regression) – How to interpret the Coefficient of Determination, R-Sq

– How to interpret the ANOVA table for simple linear regression
– How to analyze residuals
Measure Analyze Improve Control  2004 American Society for Quality.
All Rights Reserved.

Page 3

Correlation/Regression

Version 2.1

wants to investigate the relationship between a key input variable and the stack-loss of ammonia. All Rights Reserved.Real World Examples ADMINISTRATIVE A software company wants to know the relationship between calls in queue and service time.‖ DESIGN A chemical engineer. MANUFACTURING A quality manager wants to predict the strength of a plastic molding by destructively testing a ―coupon. designing a new process. Measure Analyze Improve Control Page 4 Correlation/Regression Version 2.1 .  2004 American Society for Quality.

R-Sq = 0.0 – All of the variation in Y is explained by X. R-Sq = 1. r = -1     Regression Simple linear regression used when both Y and X are continuous Quantifies the relationship between Y and X (Y = b0 + b1X) Metric: Coefficient of Determination. r = 1 – No relationship.0 to 1. All Rights Reserved. . r (r varies between -1 and +1) – Perfect positive relationship.Terms Correlation   Used when both Y and X are continuous Measures the strength of linear relationship between Y and X  Metric: Pearson Correlation Coefficient. R-Sq (varies from 0. r = 0 – Perfect negative relationship.0 or zero to 100%) – None of the variation in Y is explained by X.1  2004 American Society for Quality.0 Measure Analyze Improve Control Page 5 Correlation/Regression Version 2.

0 200 210 220 180 X Measure Analyze Improve Control Page 6 Correlation/Regression Version 2. All Rights Reserved. .0 101 -99 r = -1.Correlation Coefficients: Illustration SCATTERPLOT OF Y VERSUS X 103 -98 SCATTERPLOT OF Y VERSUS X r = 102 +1.0 -100 Y -Y 100 99 98 98 99 100 101 102 103 -101 -102 -103 98 99 100 101 102 103 X X SCATTERPLOT OF Y VERSUS X 210 200 Y 190 r = 0.1  2004 American Society for Quality.

1 9.4 9.mtw (the data are displayed in the Data Window) Go to Stat > Basic Statistics > Correlation…  Measure Analyze Improve Control Page 7 Correlation/Regression Version 2.2 9. All Rights Reserved.8 9.2 9.2 9.0 9. .8 8.0 9.6 8.0 9.8 9.1 9.9  Voltage for the same power supply is measured at Station 1 and Station 2.3 9.1 9.2 9.1 9.1 9.0 8.Correlation: Minitab Example Station 1 Station 2 8.1  2004 American Society for Quality.4 9.0 9.0 9.0 9.2 9.  Determine the correlation for voltage between the two stations.0 9.5 8.8 9.7 8.1 9.2 9.6 9.1 9.2 9.2 9.3 9.0 9. Approach:  Open Datafile\CORREL.4 9.1 9.2 8.

All Rights Reserved. Select C1 Station 1 and C2 Station 2 2. Select OK Measure Analyze Improve Control Page 8 Correlation/Regression Version 2. Push Select 3. Observe ‘Station 1’ and ‘Station 2’ as Variables: 4.Correlation: Minitab Example (Con’t) 1 3 4 2 5 1.1  2004 American Society for Quality. . Select Display p-values 5.

6 8.05) Measure Analyze Improve Scatterplot of Station 1 vs Station 2 9.8 9. All Rights Reserved.5 8.2 9.6 8.000 From Minitab Session Window Null Hypothesis: NO correlation between Station 1 and Station 2 (H0 is false because p is less than 0.1 Station 1 9.0 9.4 9.0 8.3 9.9 8.7 8.8 8. .1  2004 American Society for Quality.Correlation: Minitab Example (Con’t) Correlations: Station 1.959 P-Value = 0.4 9.2 Station 2 9. Station 2 Pearson correlation of Station 1 and Station 2 = 0.6 Graph > Scatterplot… Control Page 9 Correlation/Regression Version 2.

Simple Linear Regression Analysis  Used to fit lines and curves to data when the parameters (b’s) are linear The fitted lines – Quantify the relationship between the predictor (input) variable (X) and response (output) variable (Y)  – Help to identify the vital few X’s (―funneling‖) – Enable predictions of the response Y to be made from a knowledge of the predictor X – Identify the impact of controlling a process input variable (X) on a process output variable (Y)  Produces an equation of the form: ˆ  b  bX Y ˆ is an estimate (" fitted value' ) where Y of the populaton value Y Measure Analyze Improve Control Page 10 Correlation/Regression Version 2.1  2004 American Society for Quality. All Rights Reserved. .

6 9.1 9.8 8.1 9.0 9.1 9.6 8.8 9.4 9.1 9.2 9.0 9.1 9.0 9.4 9.0 9.2 9. Approach:  Open Datafile\CORREL.4 9.0 8.5 8.  A Green Belt is given the task of predicting voltage at Station 2 from the voltage at Station 1.1 9.2 9.9 Measure Analyze  The voltage at Station 1 is correlated with the voltage at Station 2.2 9.8 9.3 9.mtw (the data are displayed in the Data Window) Go to Stat > Regression > Fitted Line Plot…  Improve Control Page 11 Correlation/Regression Version 2.2 9.0 9.1 9.7 8. All Rights Reserved.1 9.Regression: Minitab Example Station 1 Station 2 8. .0 9.3 9.1  2004 American Society for Quality.0 9.2 9.2 8.0 9.8 9.2 9.2 9.

Observe ‘Station 1’ as Response (Y): and ‘Station 2’ as Predictor (X): 4. Select Linear as Type of Regression Model 5. All Rights Reserved. Push Select 3. . Select OK Measure Analyze Improve Control Page 12 Correlation/Regression Version 2.1  2004 American Society for Quality. Select C1 Station 1 and C2 Station 2 2.Regression: Minitab Example (Con’t) 3 1 2 4 5 1.

8 8.5% Station 1 = 1.0 8.9 8.1  2004 American Society for Quality.0% 91.8729 Station 2 9.0 9. . All Rights Reserved.020 + 0.0557288 92.8 9.4 9.4 9.3 9.1 9.6 Fitted line: obeys the prediction equation Coefficient of Determination: use R-Sq for simple linear regression (one X) 8.5 9.Regression: Minitab Example (Con’t) Prediction equation S R-Sq R-Sq(adj) 0.2 Fitted Line Plot Station 1 9.2 Station 2 9.6 Measure Analyze Improve Control Page 13 Correlation/Regression Version 2.7 8.6 8.

8729 Station 2 Intercept.020 + 0. b0.Linear Regression of Station 1 on Station 2  How is dependent Station 1 related to independent Station 2 or what is the regression of Station 1 on Station 2? From the Session Window. Measure Analyze Improve Control Page 14 Correlation/Regression Version 2. All Rights Reserved. the regression equation is Station 1 = 1. b1. is where the fitted line (regression line crosses the Y-axis when X = 0 The slope. b0 − −  Slope.1 . is “rise over run” or DY/DX  The coefficients b0 and b1 are estimates of the population parameters b0 and b1: they are linear coefficients  2004 American Society for Quality. b1 The intercept.

All Rights Reserved.1  2004 American Society for Quality. .Origin of the Regression Equation Scatter Plot 100 90 Time to Invoice (Y) 80 70 60 50 40 40 50 60 70 80 ??? The best fitted line goes through the means of Y and X (shown by the cross) 90 100 Items Ordered (X) What is the best fitted line between the Time to Invoice and the Items Ordered ? Measure Analyze Improve Control Page 15 Correlation/Regression Version 2.

r = Observed Value – Predicted Value Fitted Line and Residuals 100 90  Time to Invoice (Y) The ―least squares method‖ minimizes the sum of the squared residuals The resulting equations for the intercept and slope are called the normal equations 80  70 60 50 40 40 50 60 70 80 90 100 r Items Ordered (X) Measure Analyze Improve Control Page 16 Correlation/Regression Version 2.1  2004 American Society for Quality. . All Rights Reserved.Least Squares Method  Residual.

.Least Square Method (Con’t) Fitted Line and Residuals 100 90 A residual may be positive. negative. or zero:  Time to Invoice (Y) Positive 80 Residual 70 Positive: point above the fitted line Zero: point on the fitted line Negative: point below the fitted line Zero Residual Negative Residual  60 50 40 40 50 60 70  80 90 100 Items Ordered (X) Measure Analyze Improve Control Page 17 Correlation/Regression Version 2. All Rights Reserved.1  2004 American Society for Quality.

05. H0: the regression results from common cause variation—when H0 is true.1  2004 American Society for Quality. the p-value is used to evaluate the null hypothesis: if p is less than 0. . the null hypothesis is false.Statistical Significance  An analysis of variance (ANOVA) table informs us about the statistical significance of the regression analysis  The null hypothesis. All Rights Reserved.mtw Go to Stat > Regression… >Regression Measure Analyze Improve Control Page 18 Correlation/Regression Version 2. there is no statistically significant regression and the best prediction of Y is the mean of Y As before. and the regression is statistically significant  Approach:   Use Datafile\REGRESSANOVA.

ANOVA for Simple Linear Regression 1. Select Options 2. .1  2004 American Society for Quality. Select OK 1 2 3 Measure Analyze Improve Control Page 19 Correlation/Regression Version 2. Select Pure Error in Lack of Fit Tests 3. All Rights Reserved.

05 Analysis of Variance Source Regression Residual Error Lack of Fit Pure Error Total   DF 1 12 3 9 13 SS 32.1  2004 American Society for Quality.98 0. All Rights Reserved.322 32.000 1.05 The sum of squares (SS) for Regression involves each predicted value of Y minus the mean of Y The SS for Residual Error involves each observed value minus the predicted value. Measure .071 0.123 0.212 0.044 0.036 F 722.657 MS 32. the residual – SS for Residual Error can be further decomposed into SS Lack of Fit and SS Pure Error – SS Pure Error is the within subgroup variation and SS Lack of Fit is the Residual minus the SS Pure Error Analyze Improve Control Page 20 Correlation/Regression Version 2. that is.31 P 0.188 No lack of fit: p >= 0.ANOVA for Simple Linear Regression Observe the ANOVA (Minitab Session Window) Regression is significant: p < 0.534 0.123 0.

Y2. All Rights Reserved. DOF for SS of residuals equals (number of observations – number of parameters estimated) b0 and b1 for simple linear regression Measure Analyze Improve Control Page 21 Correlation/Regression Version 2. . that are needed to compile the sum of squares SS about the mean needs (n – 1) pieces of information SS due to regression needs one piece of information. involving the n independent responses Y1.1  2004 American Society for Quality. DOF DOF is the number of independent pieces of information. …. Yn. b1     SS of residuals needs (n – 2) pieces of information: in general.Degrees of Freedom (Linear Regression)  Every sum of squares (SS) has a number called degrees of freedom.

All Rights Reserved.ANOVA for Simple Linear Regression (Cont’d) Source of Variation and Degrees of Freedom Source Regression Residual Error Lack of Fit Pure Error Total m = sample size of subgroup p p = number of subgroups Degrees of Freedom 1 n-2 DOF RESIDUAL ERROR .1  2004 American Society for Quality. .DOF PURE ERROR  p j (m j  ) n-1 Measure Analyze Improve Control Page 22 Correlation/Regression Version 2.

20 2.60 5.00 1+3+2+1+2=9 Each subgroup “within subgro variation  2004 American Society for Quality. m =2 Subgroup #5.00 5.00 3.70 1.30 4. Xi Response.00 5.40 5.10 3.ANOVA for Simple Linear Regression (Cont’d) Identifying subgroups and sample size  Observe Minitab Data Window  A subgroup contains all of the predictor variables.30 3.40 1. Yi 1.00 2. m =4 Subgroup #3.00 2. m =3 Subgroup #4.00 4. m =2 Subgroup #2.00 3.40 5. All Rights Reserved.00 2.1 .80 2.00 2. m =3 Measure Analyze Improve Control 1.00 4.00 5.50 DOF PURE ERROR = Subgroup #1. Xi.00 1. Page 23 Correlation/Regression Version 2. that have the same value  The sample size is the number of cells in each subgroup Predictor.00 3.20 3.00 1.10 4.

and F-value Source Regression Residual Error Lack of Fit Pure Error Total Sum of Squares Mean Square MS REGRESSION MS RESIDUAL ERROR MS LACK OF FIT MS PURE ERROR MS MS LACK _ OF _ FIT PURE _ ERROR F MS REGRESSION MS RESIDUAL _ERROR   ˆi  Y ) ( Y i n ˆi ) ( Y  Y i i n SS RESIDUAL ERROR .1  2004 American Society for Quality.SS PURE ERROR   p j m k  (Yk  Yj )   ( Y  Y )  i i n Measure Analyze Improve Control Page 24 Correlation/Regression Version 2. All Rights Reserved. Mean Square. .ANOVA for Simple Linear Regression (Cont’d) Source of Variation. Sum of Squares.

Fans  Approach 2.1  2004 American Society for Quality. three types of plots indicate model inadequacy  The plots will be dramatic—not subtle! 1. All Rights Reserved. Curved bands  Open Datafile\Residuals  Go to Stat > Regression > Fitted Line Plot… Note: Fitted Line Plot…. Bands sloping up or down 3. does not have Lack of Fit Tests. Measure Analyze Improve Control Page 25 Correlation/Regression Version 2. .Analysis of Residuals  Residuals are used to test the adequacy of the prediction equation (model)  In residual plots.

Select Four in One Plot 3. Select OK 4. Select OK 2 3 Measure Analyze Improve Control Page 26 Correlation/Regression Version 2. All Rights Reserved. In Fitted Line Plot dialog box.1  2004 American Society for Quality.Analysis of Residuals (Con’t) 1 4 1. Select Graphs… 2. .

7% 15 The regression is significant Can we do better? 10 residuals look? How do the 5 0 0 50 100 Minutes 150 200 Measure Analyze Improve Control Page 27 Correlation/Regression Version 2.7% 89. .2% Fitted Line Plot     Units R-Sq is 89.1  2004 American Society for Quality.08993 Minutes 20 S R-Sq R-Sq(adj) 1.2. All Rights Reserved.78117 89.Analysis of Residuals (Con’t) Units = .343 + 0.

1  2004 American Society for Quality.0 Residual 2.5 5. All Rights Reserved.0 -2.0 Residual Percent Residuals Versus the Fitted Values 4 2 0 -2 0 4 8 Fitted Value 12 16 Histogram of the Residuals 8 Frequency Residuals Versus the Order of the Data 4 Residual 6 4 2 0 -3 -2 -1 0 1 Residual 2 3 4 2 0 -2 2 4 6 8 10 12 14 16 18 20 22 24 Observation Order Measure Analyze Improve Control Page 28 Correlation/Regression Version 2.Analysis of Residuals (Con’t) Residual Plots for Units Normal Probability Plot of the Residuals 99 90 50 10 1 -5.5 0. .

Are they? First.Analysis of Residuals (Con’t) Normal Probability Plot of the Residuals (response is Units) 99 95 90 80 Percent 70 60 50 40 30 20 10 5 p > 0.742 24 0.479 Percent Residuals must be normally distributed. . All Rights Reserved.1  2004 American Society for Quality.336 0.69595E-15 1. Store residuals.05 Can assume residuals are normal Probability Plot of RESI1 1 -4 -3 -2 -1 0 1 Residual 2 3 4 99 5 Normal 95 90 Mean StDev N AD P-Value -9. then 80 70 60 50 40 30 20 10 5 Stat > Basic Statistics > Normality Test… Measure Analyze Improve Control 1 -4 -3 -2 -1 0 1 RESI1 2 3 4 5 Page 29 Correlation/Regression Version 2.

Measure Analyze Improve Control Page 30 Correlation/Regression Version 2. All Rights Reserved. . Try Stat > Regression > Fitted Line Plot… and select Residual Residuals Versus the Fitted Values (response is Units) 4 3 2 1 0 -1 -2 -3 0 2 4 6 8 10 Fitted Value 12 14 16 18 Quadratic. Select Graphs > Four in One Plot.1  2004 American Society for Quality.Analysis of Residuals (Con’t) The plot of Residuals vs. Fits shows a curved band.

0.5 3.000466 Minutes**2 20 S R-Sq R-Sq(adj) 1.5 0. .1  2004 American Society for Quality.0% Residual Plots for Units Units 10 5 Normal Probability Plot of the Residuals 99 90 Residual Percent Residuals Versus the Fitted Values 2 1 0 -1 -2 0 0 50 100 Minutes 150 50 200 10 1 -3.0 Frequency Residual Residuals Versus the Order of the Data 2 1 0 -1 -2 4.5 0.Analysis of Residuals (Con’t) Units = 2.0 Residual 1.0 1.26903 95.672 .0 5 10 Fitted Value 15 20 Histogram of the Residuals 6.0 -1. All Rights Reserved.5% Fitted Line Plot 15 Improving the model adequacy increased RSq from 89.7% to 95.5 3.0 -2 -1 0 Residual 1 2 How do the residuals look? Measure Analyze Improve 2 4 6 8 10 12 14 16 18 20 22 24 Observation Order Control Page 31 Correlation/Regression Version 2.02075 Minutes + 0.0% 94.

All Rights Reserved. Does the analysis of residuals indicate anything unusual?  Another approach: Stat > Regression > Regression… > Options… > Lack of Fit Tests − Select Pure Error when your data is replicated − Select Data Sub setting when you data is not replicated Measure Analyze Improve Control Page 32 Correlation/Regression Version 2. . What is the prediction equation? 3. Is the regression statistically significant? 4.1  2004 American Society for Quality. Do the variables correlate? 2.Your Turn   Open Datefile\CORREG YOUR TURN Analyze the data sets: 1.

What We Have Learned . r  Regression – Y = f(X): how to regress a dependent variable. X (simple linear regression) – How to interpret the Coefficient of Determination. Y. .1 . All Rights Reserved. . R-Sq – How to interpret the ANOVA table for simple linear regression – How to analyze residuals Measure Analyze Improve Control  2004 American Society for Quality. Page 33 Correlation/Regression Version 2. on an independent variable.  Correlation – How to measure the linear relationship between two variables – How to interpret the Pearson Correlation Coefficient.