You are on page 1of 72

Correlation and Simple Regression

© 2001 ConceptFlow 1
Module Objectives

By the end of this module the participant should be able to:


• Measure the strength of correlation between two variables
• Determine if a correlation coefficient is statistically significant
• Perform simple linear regression including polynomial regression
• Perform model diagnostics and validate assumptions
• Use a regression model to predict the value of a response variable for
a given value of predictor

© 2001 ConceptFlow 2
Why Learn Correlation and Regression?

• Explore the existence of relationship between variables with the aid of


data
• Screen variables and determine which variable(s) has the biggest
impact on the response(s) variable
• Describe the nature of relationship with the help of an equation and
use it for prediction

© 2001 ConceptFlow 3
Correlation

© 2001 ConceptFlow 4
What is Correlation?

• Correlation is a measure of the strength of association between two


quantitative variables
(Ex: Pressure and Yield)
• Measures the degree of linearity between two variables assumed to be
completely independent of each other

Correlation coefficient or Pearson correlation


coefficient is a way of measuring the strength of correlation

© 2001 ConceptFlow 5
Correlation Coefficient

• Correlation coefficient, r, is defined as follows:

n
(xi - x) (yi - y)
)( ) =
n

(
1
rxy =
1
n- 1 i=1
xi - x
sx
yi - y
sy
rxy = 
n-1 i=1 sx sy

• Sample correlation coefficient is denoted as ‘r’


• Population correlation coefficient is denoted as ‘’
• Correlation coefficient lies between -1 and +1

Sample correlation coefficient ‘r’ is an estimate of


population correlation coefficient ‘’

© 2001 ConceptFlow 6
Illustration of Correlation Coefficient

Regression Plot
Sales = -4710.51 + 10.0720 Shelf Space

S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 %

2000

x  x 
i

1500
y  y 
i
Ybar
Y

1000

550 600 650


X
Xbar

© 2001 ConceptFlow 7
Correlation Coefficient

90 100

80 90
70
80
60
70
Y

Y
50
60
40
50
30
r = +1.0 40 r = -1.0
20
10 20 30
10 20 30
X X
76
r = 0.0
75

74 No correlation
Y

73

72

71

10 20 30
X
© 2001 ConceptFlow 8
Strength and Direction of “+” Correlation

Moderate positive correlation


110

100

90

Output
80
Y=25.7595+0.645418X
70
R Squared=0.369
60

50

40

50 60 70 80 90 100

Weak positive correlation Input Strong positive correlation


110
85
100

90

Output
75 80
Output

Y=56.6537+0.181987X 70
Y=9.77271+0.745022X
65 R Squared=0.115 60
R Squared=0.876
50

40
55
30

40 50 60 70 80 90 100 110 120


40 50 60 70 80 90

Input Input

© 2001 ConceptFlow 9
Strength and Direction of “-” Correlation

Moderate negative correlation


110

100

90

Output
80
Y=90.3013-0.645418X
70
R Squared=0.369
60

50

40

50 60 70 80 90 100

Weak negative correlation Input Strong negative correlation


110
85
100

90

Output
75 80
Output

Y=74.8524-0.181987X 70
Y=99.1754-0.745022X
65 R Squared=0.115 60
R Squared=0.876
50

40
55
30

40 50 60 70 80 90 100 110 120


40 50 60 70 80 90

Input Input

© 2001 ConceptFlow 10
Correlation vs. Causation

• Data shows that average life expectancy of Americans increased when the
divorce rate went up!
• Is there a correlation between shark attacks and Popsicle sales?

# of Shark Attack
Average Life
Expectancy

Divorce Rate Popsicle Sale


in America

Correlation does not imply causation! A third variable may


be ‘lurking’ that causes both x and y to vary
© 2001 ConceptFlow 11
Business Process Example: Cereal Sales

Minitab Worksheet

A market research analyst for a brand of cereal is interested in finding


out if there is a relationship between the sales generated and shelf
space used to display the cereal. She conducted a study and collected
data from 12 different stores selling this brand of cereal.
Shelf Space, Sq in Sales, $
574 960
635 1779
The data contains sales $
533 651
generated for a certain month
560 831
and the shelf space dedicated to
628 1460
615 1370
the product.
540 851 What would you do?
587 1220 What questions might you ask?
656 1889
594 1370 Data in Sales.mtw
622 1609
567 1120
© 2001 ConceptFlow 12
Example: Cereal Sales

• Practical Problem
• Is there a relationship between sales $ from cereal and the shelf
space used to display the cereal?
• If there a relationship, how strong is that relationship?
• Statistical Problem
• Are the variables ‘Sales’ and ‘Shelf Space’ correlated?
• Null hypothesis: Sales and Shelf space are
not correlated
• Alternate hypothesis: Sales and Shelf space
are correlated

© 2001 ConceptFlow 13
Example: Cereal Sales

• State the Hypotheses and Significance Level


• H o:  = 0
• H a:   0
 = 0.01
• Notice that the hypotheses are about a population parameter
• What Hypothesis Test is Appropriate?
• These hypotheses deal with correlation coefficient
• Make decisions based on Pearson correlation coefficient and ‘p-
value’

© 2001 ConceptFlow 14
Example: Cereal Sales

• Practical and Graphical:


• Practical questions about the data?
• Plot the data using different techniques
Tool Bar Menu > Stat > Basic Statistics > Display Descriptive Statistics
Descriptive Statistics
Graph > Plot Variable: Sales, $
2000
Anderson-Darling Normality Test
A-Squared: 0.177
P-Value: 0.898

Mean 1258.00
StDev 402.92
1500 Variance 162346
Sales, $

Skewness -1.8E-02
Kurtosis -1.04056
N 12
700 900 1100 1300 1500 1700 1900
Minimum 651.00
1st Quartile 863.25
Median 1304.00
1000 3rd Quartile 1575.00
95% Confidence Interval for Mu Maximum 1924.00
95% Confidence Interval for Mu
1002.00 1514.00
800 900 1000 1100 1200 1300 1400 1500 1600 95% Confidence Interval for Sigma
285.43 684.11
550 600 650 95% Confidence Interval for Median
95% Confidence Interval for Median
Shelf Space, Sq in 864.94 1573.22

© 2001 ConceptFlow 15
Example: Cereal Sales

Tool Bar Menu > Stat > Basic Statistics > Correlation

© 2001 ConceptFlow 16
Example: Cereal Sales

Correlations: Shelf Space, Sales


Pearson correlation of Shelf Space and Sales = 0.978
p-value = 0.000

What is the Decision?


• Pearson correlation or correlation coefficient
for the sample, r = 0.978
• Does that mean ‘’ is greater than zero? Or could
it be that r = 0.978 due to chance
variation while ‘’ is still zero?
• Answer this question using table
on next slide
What about
the p-value?

© 2001 ConceptFlow 17
How Big Should ‘r’ Be?

By finding the sample size of your Sample Size d.f. Significance Level
n n-2 0.05 0.025 0.01 0.005
sample, any correlation that is greater 3 1 0.9877 0.9969 0.9995 0.9999
4 2 0.9000 0.9500 0.9800 0.9900
than the table value is considered to 5 3 0.8054 0.8783 0.9343 0.9587
be “important” or statistically 6
7
4
5
0.7293
0.6694
0.8114
0.7545
0.8822
0.8329
0.9172
0.8745
significant. 8
9
6
7
0.6215
0.5822
0.7067
0.6664
0.7887
0.7498
0.8343
0.7977
10 8 0.5494 0.6319 0.7155 0.7646
11 9 0.5214 0.6021 0.6851 0.7348
12 10 0.4973 0.5760 0.6581 0.7079
13 11 0.4762 0.5529 0.6339 0.6835
t2
r =
14 12 0.4575 0.5324 0.6120 0.6614
15 13 0.4409 0.5140 0.5923 0.6411
16 14 0.4259 0.4973 0.5742 0.6226
n - 2 + t2 17
18
15
16
0.4124
0.4000
0.4821
0.4683
0.5577
0.5425
0.6055
0.5897
19 17 0.3887 0.4555 0.5285 0.5751
20 18 0.3783 0.4438 0.5155 0.5614

or 21
22
27
19
20
25
0.3687
0.3598
0.3233
0.4329
0.4227
0.3809
0.5034
0.4921
0.4451
0.5487
0.5368
0.4869
32 30 0.2960 0.3494 0.4093 0.4487
37 35 0.2746 0.3246 0.3810 0.4182
n- 2 •r
t =
42 40 0.2573 0.3044 0.3578 0.3932
47 45 0.2429 0.2876 0.3384 0.3721

1- r 2
52 50 0.2306 0.2732 0.3218 0.3542
62 60 0.2108 0.2500 0.2948 0.3248
72 70 0.1954 0.2319 0.2737 0.3017
82 80 0.1829 0.2172 0.2565 0.2830
92 90 0.1726 0.2050 0.2422 0.2673
102 100 0.1638 0.1946 0.2301 0.2540

© 2001 ConceptFlow 18
Example: Cereal Sales

What is the statistical interpretation?


• From the table, for a sample size of 12, for
one-sided test (i.e., is  > 0?), rcritical = 0.6581
• Since r > rcritical, reject the null hypothesis
or
• p-value (0.000) < -risk (0.01): reject the null hypothesis
• Infer Ha: sufficient evidence that there is a correlation
between sales $ and shelf space

© 2001 ConceptFlow 19
Example: Mystery Data Set

Minitab Worksheet

• Open the file: Mystery.mtw


• Calculate the correlation
• Are the two variables related?
• Plot the two variables
• What is your conclusion?

© 2001 ConceptFlow 20
Example: Mystery Data Set

10

r = -.253 p-value = 0.012

5
Output

-3 -2 -1 0 1 2
Input
Really?

No correlation or very weak correlation!


© 2001 ConceptFlow 21
Example: Mystery Data Set

• Now what do you think?


• A quadratic relationship is more
10
appropriate!

Output 0

-3 -2 -1 0 1 2 3
Input

Pearson correlation coefficient measures


only strength of linear relationship
© 2001 ConceptFlow 22
Regression

© 2001 ConceptFlow 23
Correlation and Regression

• Correlation tells how much linear association exists between two


variables
• Regression provides an equation describing the nature of relationship

Correlations: Shelf Space, Sales


Pearson correlation of Shelf Space and Sales = 0.978
p-value = 0.000

Regression Analysis: Sales versus Shelf Space


The regression equation is Sales = - 4711 + 10.1 Shelf Space

© 2001 ConceptFlow 24
Regression Terminology

• Response Variable
• This is the uncontrolled variable - also known as dependent
variable, output variable or Y variable
• Regressor Variable
• Response depends on these variables - also known as
independent variables, input variables, or X variables
• Noise Variable
• Input variables (X) that are not controlled in the experiment
• Regression Equation
• Equation that describes relationship between independent variables
and dependent variable
• Residuals
• Difference between predicted response values and observed
response values

© 2001 ConceptFlow 25
Regression Objectives

• Determination of a Model
• Explore existence of relationship
• Prediction
• Describe nature of relationship using an equation and use equation
for prediction
• Estimation
• To assess accuracy of prediction achieved by regression equation
• Determination of KPIV
• Screen variables and determine which variable has biggest impact
on response variable

© 2001 ConceptFlow 26
Types of Regression

• Simple Linear Regression


• Single regressor (x) variable such as x1 and model linear with
respect to coefficients
• Example 1: y = a0 + a1x + error
• Example 2: y = a0 + a1x + a2 x2 + a3 x3 + error
• Note: ‘Linear’ refers to the coefficients a0, a1, a2 , etc. It implies

that each term containing a coefficient is added to the model. In


example 2, the relationship between x and y are cubic
polynomial in nature, but the model is linear with respect to the
coefficients.

© 2001 ConceptFlow 27
Types of Regression

• Multiple Linear Regression


• Multiple regressor (x) variables such as x1, x2, x3 and model linear
with respect to coefficients
• Example: y = b0 + b1 x1 + b2 x2 + b3 x3 + error
• Simple Non-Linear Regression
• Single regressor (x) variable such as x and model non-linear with
respect to coefficients
• Example: y = b0 + b1 (1-e-b2x) + error
• Multiple Non-Linear Regression
• Multiple regressor (x) variables such as x1, x2, x3 and model non-
linear with respect to coefficients
• Example: y = (b0+ b1 x1) / b2 x2 + b3 x3 + error

© 2001 ConceptFlow 28
Simple Linear Regression

© 2001 ConceptFlow 29
Simple Linear Regression

• Use one independent variable (x) to explain the variation in dependent


variable (y)
• Example 1: use shelf space to explain variation sales $
• Example 2: amount of fertilizer applied to explain the
yield of crop
• Method of Least Squares
• Use the ‘Method of Least Squares’ to find the best fitting regression
line

© 2001 ConceptFlow 30
Method of Least Squares

Objective: Regression Plot


• Find a line that will 2000
minimize sum of squares
of residuals
Y Ŷ
1500
Sales

Residuals
are the error
of prediction 1000
Regression Line

550 600 650


Shelf Space
Residual = Y - Ŷ^
© 2001 ConceptFlow 31
Business Process Example: Cereal Sales
Minitab Worksheet

A market research analyst for a brand of cereal is interested in


predicting the sales generated from information on shelf space used to
display the cereal. She conducted a study and collected data from 12
different stores selling this brand of cereal.
Shelf Space, Sq in Sales, $
574 960 • The data contains sales $ generated for
635 1779 a certain month and the shelf space
533 651 dedicated to the product
560 831
628 1460 • How will we create a simple linear
615 1370 regression model for the two variables?
540 851
• Predict the sales $ using the regression
587 1220
equation when shelf space is 615 sq. in.
656 1889
594 1370
622 1609
567 1120 Data in Sales.mtw
© 2001 ConceptFlow 32
Example: Cereal Sales

Practical and Graphical:


• Practical questions about the process?
• Plot the data using different techniques

Tool Bar Menu > Stat > Basic Statistics > Display Descriptive Statistics
Descriptive Statistics
Graph > Plot Variable: Sales, $
2000
Anderson-Darling Normality Test
A-Squared: 0.177
P-Value: 0.898

Mean 1258.00
StDev 402.92
1500 Variance 162346
Sales, $

Skewness -1.8E-02
Kurtosis -1.04056
N 12
700 900 1100 1300 1500 1700 1900
Minimum 651.00
1st Quartile 863.25
Median 1304.00
1000 3rd Quartile 1575.00
95% Confidence Interval for Mu Maximum 1924.00
95% Confidence Interval for Mu
1002.00 1514.00
800 900 1000 1100 1200 1300 1400 1500 1600 95% Confidence Interval for Sigma
285.43 684.11
550 600 650 95% Confidence Interval for Median
95% Confidence Interval for Median
Shelf Space, Sq in 864.94 1573.22

© 2001 ConceptFlow 33
Example: Cereal Sales

Tool Bar Menu > Stat > Regression > Fitted Line Plot

© 2001 ConceptFlow 34
Example: Cereal Sales

• The regression equation is


• Sales = -4710.51 + 10.0720 Regression Plot
Shelf Space Sales = -4710.51 + 10.0720 Shelf Space
• S = 87.2641 R-Sq = 95.7 % S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 %
R-Sq(adj) = 95.3% 2000
• Also from previous, correlation
coefficient, r = 0.978
• What do these numbers mean? Sales
1500

1000

550 600 650


Shelf Space
© 2001 ConceptFlow 35
Example: Cereal Sales

Session Output from Minitab

Regression Analysis: Sales versus Shelf Space


The regression equation is
Sales = -4710.51 + 10.0720 Shelf Space

S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 %


Analysis of Variance

Source DF SS MS F P
Regression 1 1709656 1709656
224.511 0.000
Error 10 76150 7615
Total 11 1785806

Regression is significant

© 2001 ConceptFlow 36
What About R-squared?

• R-squared is a measure describing the quality of regression


• Measures the proportion of variation that is explained by the
regression model
2
• R = SSregression/ SStotal = (SStotal-SSerror) / SStotal = 1 - [SSerror/SStotal]

Source DF SS MS F P
Regression 1 1709656 1709656 224.511 0.000
Error 10 76150 7615
Total 11 1785806

R2 = 1709656 / 1785806 = 95.74%

95.7% of variation in sales can be explained by variation in shelf space


© 2001 ConceptFlow 37
What About R-Sq?

• What is the R-squared on a regression with two data points?


• Does that mean a model with two data points is better?

2000
Sales

1500

R-Sq = 100%

1000

570 580 590 600 610 620 630 640 650 660
Shelf Space
© 2001 ConceptFlow 38
Example: Cereal Sales
Tool Bar Menu > Stat > Regression > Fitted Line Plot

• What is the R-squared if Regression Plot


we choose a ‘cubic’ Sales = -32708.1 + 151.576 Shelf Space
- 0.237788 Shelf Space**2 + 0.0001329 Shelf Space**3
polynomial regression?
S = 97.2444 R-Sq = 95.8 % R-Sq(adj) = 94.2 %
2000

1500
Sales

1000
R-Sq = 95.8%

550 600 650


Shelf Space

© 2001 ConceptFlow 39
Example: Cereal Sales

• Which model is better? Linear or Cubic model?


Regression Plot
Regression Plot Sales = -32708.1 + 151.576 Shelf Space
Sales = -4710.51 + 10.0720 Shelf Space - 0.237788 Shelf Space**2 + 0.0001329 Shelf Space**3
S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 % S = 97.2444 R-Sq = 95.8 % R-Sq(adj) = 94.2 %
2000 2000

1500 1500
Sales

Sales
1000 1000
R-Sq
R-Sq==95.8%
95.8%
R-Sq = 95.7%

550 600 650 550 600 650


Shelf Space Shelf Space
• R-Squared gets bigger as we add more and more terms!
• So should we keep adding terms?
© 2001 ConceptFlow 40
What is R-Sq (adj)?

• More realistic measurement is a modified measure of R-squared


• Takes into account of number of terms in the model and number of
data points
n-1
• Adj R2 =1- [SSerror / (n- )] / [SStotal / (n-1)] = 1- (1- R2)
n-
• where n = number of data points and p = number of terms in the
model
• Becomes smaller when added terms provide little new information and
as the number of model terms gets closer to the total sample size

© 2001 ConceptFlow 41
Example: Cereal Sales

• Which model is better? Linear or Cubic model?


Regression Plot
Regression Plot Sales = -32708.1 + 151.576 Shelf Space
Sales = -4710.51 + 10.0720 Shelf Space - 0.237788 Shelf Space**2 + 0.0001329 Shelf Space**3
S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 % S = 97.2444 R-Sq = 95.8 % R-Sq(adj) = 94.2 %
2000 2000

1500 1500
Sales

Sales
1000 1000

R-Sq(adj) = 95.3% R-Sq(adj) = 94.2%

550 600 650 550 600 650


Shelf Space Shelf Space

• Linear model is better since the additional terms in cubic model did not
add value. How about a quadratic model?
© 2001 ConceptFlow 42
Example: Cereal Sales

The regression equation is


Sales = -4710.51 + 10.0720 Shelf Space
S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3%

Predict ‘Sales’ for 615 ‘Shelf Space’ in the above equation


• Substitute the value for ‘Shelf Space’ in the
above equation
• Sales = -4710.51 + 10.072 (615) = $1483.77
• What about the uncertainty around this prediction?
Is sales expected to be exactly $1483.77?

© 2001 ConceptFlow 43
Example: Cereal Sales
Tool Bar Menu > Stat > Regression > Fitted Line Plot

© 2001 ConceptFlow 44
Example: Cereal Sales

Regression Plot
Prediction Sales = -4710.51 + 10.0720 Shelf Space
S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 %
uncertainty for
individual values 2000

Regression
1500
95% PI
Sales

1000

500

550 600 650


Shelf Space

© 2001 ConceptFlow 45
Example: Cereal Sales
Tool Bar Menu > Stat > Regression

1. Stat > Regression > Regression

© 2001 ConceptFlow 46
Example: Cereal Sales

What is
95% CI?

Predicted Values for New Observations


New Obs Fit SE Fit 95.0% CI 95.0% PI
1 1483.8 29.4 (1418.4, 1549.2) (1278.6, 1688.9)
Values of Predictors for New Observations
New Obs Shelf Sp
1 615
• We are 95% certain that sales will be between $1278.6 and $1688.9 when
shelf space is 615 sq. in.
• In the example, the actual value was $1370

© 2001 ConceptFlow 47
Example: Cereal Sales
Tool Bar Menu > Stat > Regression > Fitted Line Plot

© 2001 ConceptFlow 48
Example: Cereal Sales

95% prediction
interval for a
single response Regression Plot
Sales = -4710.51 + 10.0720 Shelf Space
95% confidence S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 %
interval for the
mean response 2000

1500
Sales

1000

Regression
95% CI
500
95% PI
550 600 650
Shelf Space

© 2001 ConceptFlow 49
Assumptions for Regression

• To use the results of regression, assumptions about residuals must be


satisfied
• What are the assumptions about residuals?
• Residuals are normally distributed with mean of zero
• Residuals show no pattern (random)
• Residuals have constant variance (homogeneous variance or no
heteroscedasticity)
• Residuals are independent of the values of regressor (x) variables
• Residuals are independent of each other

© 2001 ConceptFlow 50
Checking Assumptions
Tool Bar Menu > Stat > Regression > Regression

Residuals are error in the fit of regression line


• Difference between the observed value of response variable and fitted
value

© 2001 ConceptFlow 51
Assumptions for Regression

Normal Probability Plot of the Residuals Residuals Versus Shelf Sp


(response is Sales) (response is Sales)
2
100

1
Normal Score

Residual
0
0

-1
-100

-2
-100 0 100 550 600 650
Residual Shelf Sp

Residuals are normally distributed Residuals are independent of


around mean of zero ‘Shelf Space’ variable

© 2001 ConceptFlow 52
Assumptions for Regression

Residuals Versus the Fitted Values Histogram of the Residuals


(response is Sales) (response is Sales)

4
100

Frequency
Residual

0
2

-100 1

0
1000 1500 2000
Fitted Value -150 -100 -50 0 50 100
Residual

• Residuals do not exhibit • Histogram of residuals resemble


heteroscedasticity (they have a normal distribution with mean
homogenous variance) of zero
• Residuals are randomly
distributed
No assumptions were violated; regression results are valid
© 2001 ConceptFlow 53
Quiz on Residuals

• Following are hypothetical examples of residual plots


• What are the issues, if any, associated with each plot and what actions
could be taken to mitigate the issue?

Normal Score
Residuals

Case 1
0 0
Case 2

Fitted Value Residuals


Residuals

Case 3
0

Fitted Value

© 2001 ConceptFlow 54
Quiz on Residuals

• What are the issues, if any, associated with each plot and what actions
could be taken to mitigate the issue?

Case 4
Residuals
0

Shelf Space

Residuals
Residuals

0
0
Case 5 Case 6

Order of Data Fitted Value

© 2001 ConceptFlow 55
Impact of Outliers on Regression

• What is the impact of an outlier on regression?


• What is an outlier?
• Unusual observations that can have undue influence on regression
• Create an outlier for the Cereal Sales data
• Change observation 8 from 1238 to 1288 and
complete regression
• What effect will it have on regression?

© 2001 ConceptFlow 56
Example: Cereal Sales

Session Output from Minitab


Normal Probability Plot of the Residuals
Regression Analysis: Sales versus Shelf Space (response is Sales)
The regression equation is 2

Sales = - 4697 + 10.1 Shelf Space 1

Normal Score
Predictor Coef SE Coef T P 0
Constant -4696.5 414.3 -11.34 0.000
Shelf Sp 10.0555 0.6978 14.41 0.000 -1
S = 90.59 R-Sq = 95.4% R-Sq(adj) = 94.9%
Analysis of Variance -2
-100 0 100
Residual
Source DF SS MS F P
Regression 1 1704037 1704037 207.66 0.000
Residual Error 10 82061 8206
Total 11 1786098

Residuals Versus Shelf Sp Residuals Versus the Fitted Values


(response is Sales) (response is Sales)

100 100

Residual
Residual

0 0

-100 -100

550 600 650 1000 1500 2000


Shelf Sp Fitted Value
© 2001 ConceptFlow 57
Example: Cereal Sales

• Some outliers as in previous case do not exert undue influence


• Change observation 3 from 651 to 1851 and complete regression

Regression Analysis: Sales versus Shelf Space

The regression equation is


Sales = - 2096 + 5.83 Shelf Space

Predictor Coef SE Coef T P


Constant -2096 1501 -1.40 0.193
Shelf Sp 5.829 2.527 2.31 0.044
S = 328.1 R-Sq = 34.7% R-Sq(adj) = 28.2%

Analysis of Variance
Source DF SS MS F P
Regression 1 572701 572701 5.32 0.044
Residual Error 10 1076305 107630
Total 11 1649006

Unusual Observations
Obs Shelf Sp Sales Fit SE Fit Residual St Resid
3 533 1851.0 1010.7 177.9 840.3 3.05R

R denotes an observation with a large standardized residual

© 2001 ConceptFlow 58
Impact of Outliers on Regression

Normal Probability Plot of the Residuals


Residuals Versus the Fitted Values
(response is Sales)
(response is Sales)
2
1000

1
Residual

Normal Score
500
0

0 -1

-2
1000 1100 1200 1300 1400 1500 1600 1700 1800 0 500 1000
Fitted Value Residual

Residuals Versus Shelf Sp


(response is Sales)
1000

500
Residual

550 600 650


Shelf Sp
© 2001 ConceptFlow 59
Model Diagnostics

• Diagnostics enables detection of potential problems with the


regression model
• Several measures available to detect outliers or potential problems
include:
• Standardized Residuals (unusual response)
• Leverages (unusual predictor)
• Cook’s Distance (influential observation)

© 2001 ConceptFlow 60
Standardized Residuals

• Measures observations with unusual response variable (Y) values


• These observations do not fit well by the model
• Standardized residual = residual/standard deviation of residuals
• Rule: Investigate standardized residuals > 2

© 2001 ConceptFlow 61
Standardized Residuals

• Minitab reports standardized residuals >2


• These are potential outliers

Regression Analysis: Sales versus Shelf Space Residuals Versus the Fitted Values
(response is Sales)
The regression equation is 1000
Sales = - 2096 + 5.83 Shelf Space

Residual
Predictor Coef SE Coef T P 500
Constant -2096 1501 -1.40 0.193
Shelf Sp 5.829 2.527 2.31 0.044
S = 328.1 R-Sq = 34.7% R-Sq(adj) = 28.2% 0

Analysis of Variance
Source DF SS MS F P 1000 1100 1200 1300 1400 1500 1600 1700 1800
Regression 1 572701 572701 5.32 0.044 Fitted Value
Residual Error 10 1076305 107630
Total 11 1649006

Unusual Observations
Obs Shelf Sp Sales Fit SE Fit Residual St Resid
3 533 1851.0 1010.7 177.9 840.3 3.05R

R denotes an observation with a large standardized residual

© 2001 ConceptFlow 62
Leverage
Tool Bar Menu > Stat > Regression > Regression > Storage > Hi (leverages)

• Measures how unusual a predictor (x) value is


• Calculated as a distance from the center of all x’s
• Leverage depends only on predictor variable (x)
• Rule: investigate leverage values > 2p/n where p=number of terms in
the model (including one for constant) and n=number of data points

High leverages have potential 2

influence on regression 1

0
Sales

-1

-2
0 500 1000
Shelf Space

© 2001 ConceptFlow 63
Cook’s Distance

• Combines standardized residual and leverage into one metric


• Cook’s distance is a measure influential observation
• Measures regression coefficient with and without an observation from
the data

1000

Rule: Cook’s distance > 1 500

flags an influential
Sales

observation 0

550 600 650


Shelf Space

© 2001 ConceptFlow 64
Exercise: Cereal Sales
Tool Bar Menu > Graph > Plot…
• Revisit cereal sales data in Sales.mtw
• Create a plot of “Shelf Space” vs “Sales” in Minitab

Shelf Space, Sq in Sales, $


574 960
635 1779
533 651
560 831
628 1460
615 1370
540 851
587 1220
656 1889
594 1370
622 1609
567 1120

© 2001 ConceptFlow 65
Exercise: Cereal Sales
Tool Bar Menu > Graph > Plot…
• Create an outlier by changing observation 3 from 651 to 1851 and
complete regression
• Create a plot of “Shelf Space” vs “Sales” in Minitab

Shelf Space, Sq in Sales, $


574 960
635 1779
533 651 1851
560 831
628 1460
615 1370
540 851
587 1220
656 1889
594 1370
622 1609
567 1120

© 2001 ConceptFlow 66
Exercise: Cereal Sales
Tool Bar Menu > Stat > Regression > Regression
Perform regression using new data complete model diagnostics

© 2001 ConceptFlow 67
Exercise: Cereal Sales

• Flags:
• Standardized residuals > 2 • Standardized residuals
• Leverage
• Leverage >0.33 (i.e, 2p/n = 2*2/12) • Cook’s distance
• Cook’s distance > 1

Does any of the data


demonstrate high values of
standardized residuals,
leverage or Cook’s distance?

© 2001 ConceptFlow 68
Simple Regression Exercise

• Create an appropriate regression model to predict the output of your


project
• What is the best regression model?
• Perform model diagnostics to detect any outliers or unusual
observations
• Validate any assumptions used for creating the model
• What are the limitations of the regression model?

© 2001 ConceptFlow 69
Key Learning Points

© 2001 ConceptFlow 70
Objectives Review

By the end of this module the participant should be able to:


• Measure the strength of correlation between two variables
• Determine if a correlation coefficient is statistically significant
• Perform simple linear regression including polynomial regression
• Perform model diagnostics and validate assumptions
• Use a regression model to predict the value of a response variable for
a given value of predictor

© 2001 ConceptFlow 71
Trademarks and Service Marks

Six Sigma is a federally registered trademark of Motorola, Inc.


Breakthrough Strategy is a federally registered trademark of Six Sigma Academy.
ESSENTEQ is a trademark of Six Sigma Academy.
METREQ is a trademark of Six Sigma Academy.
Weaving excellence into the fabric of business is a trademark of Six Sigma Academy.
FASTART is a trademark of Six Sigma Academy.
Breakthrough Design is a trademark of Six Sigma Academy.
Breakthrough Lean is a trademark of Six Sigma Academy.
Design with the Power of Six Sigma is a trademark of Six Sigma Academy.
Legal Lean is a trademark of Six Sigma Academy.
SSA Navigator is a trademark of Six Sigma Academy.
SigmaCALC is a trademark of Six Sigma Academy.
SigmaFlow is a trademark of Compass Partners, Inc.
SigmaTRAC is a trademark of DuPont.
MINITAB is a trademark of Minitab, Inc.

You might also like