Professional Documents
Culture Documents
© 2001 ConceptFlow 1
Module Objectives
© 2001 ConceptFlow 2
Why Learn Correlation and Regression?
© 2001 ConceptFlow 3
Correlation
© 2001 ConceptFlow 4
What is Correlation?
© 2001 ConceptFlow 5
Correlation Coefficient
n
(xi - x) (yi - y)
)( ) =
n
(
1
rxy =
1
n- 1 i=1
xi - x
sx
yi - y
sy
rxy =
n-1 i=1 sx sy
© 2001 ConceptFlow 6
Illustration of Correlation Coefficient
Regression Plot
Sales = -4710.51 + 10.0720 Shelf Space
2000
x x
i
1500
y y
i
Ybar
Y
1000
© 2001 ConceptFlow 7
Correlation Coefficient
90 100
80 90
70
80
60
70
Y
Y
50
60
40
50
30
r = +1.0 40 r = -1.0
20
10 20 30
10 20 30
X X
76
r = 0.0
75
74 No correlation
Y
73
72
71
10 20 30
X
© 2001 ConceptFlow 8
Strength and Direction of “+” Correlation
100
90
Output
80
Y=25.7595+0.645418X
70
R Squared=0.369
60
50
40
50 60 70 80 90 100
90
Output
75 80
Output
Y=56.6537+0.181987X 70
Y=9.77271+0.745022X
65 R Squared=0.115 60
R Squared=0.876
50
40
55
30
Input Input
© 2001 ConceptFlow 9
Strength and Direction of “-” Correlation
100
90
Output
80
Y=90.3013-0.645418X
70
R Squared=0.369
60
50
40
50 60 70 80 90 100
90
Output
75 80
Output
Y=74.8524-0.181987X 70
Y=99.1754-0.745022X
65 R Squared=0.115 60
R Squared=0.876
50
40
55
30
Input Input
© 2001 ConceptFlow 10
Correlation vs. Causation
• Data shows that average life expectancy of Americans increased when the
divorce rate went up!
• Is there a correlation between shark attacks and Popsicle sales?
# of Shark Attack
Average Life
Expectancy
Minitab Worksheet
• Practical Problem
• Is there a relationship between sales $ from cereal and the shelf
space used to display the cereal?
• If there a relationship, how strong is that relationship?
• Statistical Problem
• Are the variables ‘Sales’ and ‘Shelf Space’ correlated?
• Null hypothesis: Sales and Shelf space are
not correlated
• Alternate hypothesis: Sales and Shelf space
are correlated
© 2001 ConceptFlow 13
Example: Cereal Sales
© 2001 ConceptFlow 14
Example: Cereal Sales
Mean 1258.00
StDev 402.92
1500 Variance 162346
Sales, $
Skewness -1.8E-02
Kurtosis -1.04056
N 12
700 900 1100 1300 1500 1700 1900
Minimum 651.00
1st Quartile 863.25
Median 1304.00
1000 3rd Quartile 1575.00
95% Confidence Interval for Mu Maximum 1924.00
95% Confidence Interval for Mu
1002.00 1514.00
800 900 1000 1100 1200 1300 1400 1500 1600 95% Confidence Interval for Sigma
285.43 684.11
550 600 650 95% Confidence Interval for Median
95% Confidence Interval for Median
Shelf Space, Sq in 864.94 1573.22
© 2001 ConceptFlow 15
Example: Cereal Sales
Tool Bar Menu > Stat > Basic Statistics > Correlation
© 2001 ConceptFlow 16
Example: Cereal Sales
© 2001 ConceptFlow 17
How Big Should ‘r’ Be?
By finding the sample size of your Sample Size d.f. Significance Level
n n-2 0.05 0.025 0.01 0.005
sample, any correlation that is greater 3 1 0.9877 0.9969 0.9995 0.9999
4 2 0.9000 0.9500 0.9800 0.9900
than the table value is considered to 5 3 0.8054 0.8783 0.9343 0.9587
be “important” or statistically 6
7
4
5
0.7293
0.6694
0.8114
0.7545
0.8822
0.8329
0.9172
0.8745
significant. 8
9
6
7
0.6215
0.5822
0.7067
0.6664
0.7887
0.7498
0.8343
0.7977
10 8 0.5494 0.6319 0.7155 0.7646
11 9 0.5214 0.6021 0.6851 0.7348
12 10 0.4973 0.5760 0.6581 0.7079
13 11 0.4762 0.5529 0.6339 0.6835
t2
r =
14 12 0.4575 0.5324 0.6120 0.6614
15 13 0.4409 0.5140 0.5923 0.6411
16 14 0.4259 0.4973 0.5742 0.6226
n - 2 + t2 17
18
15
16
0.4124
0.4000
0.4821
0.4683
0.5577
0.5425
0.6055
0.5897
19 17 0.3887 0.4555 0.5285 0.5751
20 18 0.3783 0.4438 0.5155 0.5614
or 21
22
27
19
20
25
0.3687
0.3598
0.3233
0.4329
0.4227
0.3809
0.5034
0.4921
0.4451
0.5487
0.5368
0.4869
32 30 0.2960 0.3494 0.4093 0.4487
37 35 0.2746 0.3246 0.3810 0.4182
n- 2 •r
t =
42 40 0.2573 0.3044 0.3578 0.3932
47 45 0.2429 0.2876 0.3384 0.3721
1- r 2
52 50 0.2306 0.2732 0.3218 0.3542
62 60 0.2108 0.2500 0.2948 0.3248
72 70 0.1954 0.2319 0.2737 0.3017
82 80 0.1829 0.2172 0.2565 0.2830
92 90 0.1726 0.2050 0.2422 0.2673
102 100 0.1638 0.1946 0.2301 0.2540
© 2001 ConceptFlow 18
Example: Cereal Sales
© 2001 ConceptFlow 19
Example: Mystery Data Set
Minitab Worksheet
© 2001 ConceptFlow 20
Example: Mystery Data Set
10
5
Output
-3 -2 -1 0 1 2
Input
Really?
Output 0
-3 -2 -1 0 1 2 3
Input
© 2001 ConceptFlow 23
Correlation and Regression
© 2001 ConceptFlow 24
Regression Terminology
• Response Variable
• This is the uncontrolled variable - also known as dependent
variable, output variable or Y variable
• Regressor Variable
• Response depends on these variables - also known as
independent variables, input variables, or X variables
• Noise Variable
• Input variables (X) that are not controlled in the experiment
• Regression Equation
• Equation that describes relationship between independent variables
and dependent variable
• Residuals
• Difference between predicted response values and observed
response values
© 2001 ConceptFlow 25
Regression Objectives
• Determination of a Model
• Explore existence of relationship
• Prediction
• Describe nature of relationship using an equation and use equation
for prediction
• Estimation
• To assess accuracy of prediction achieved by regression equation
• Determination of KPIV
• Screen variables and determine which variable has biggest impact
on response variable
© 2001 ConceptFlow 26
Types of Regression
© 2001 ConceptFlow 27
Types of Regression
© 2001 ConceptFlow 28
Simple Linear Regression
© 2001 ConceptFlow 29
Simple Linear Regression
© 2001 ConceptFlow 30
Method of Least Squares
Residuals
are the error
of prediction 1000
Regression Line
Tool Bar Menu > Stat > Basic Statistics > Display Descriptive Statistics
Descriptive Statistics
Graph > Plot Variable: Sales, $
2000
Anderson-Darling Normality Test
A-Squared: 0.177
P-Value: 0.898
Mean 1258.00
StDev 402.92
1500 Variance 162346
Sales, $
Skewness -1.8E-02
Kurtosis -1.04056
N 12
700 900 1100 1300 1500 1700 1900
Minimum 651.00
1st Quartile 863.25
Median 1304.00
1000 3rd Quartile 1575.00
95% Confidence Interval for Mu Maximum 1924.00
95% Confidence Interval for Mu
1002.00 1514.00
800 900 1000 1100 1200 1300 1400 1500 1600 95% Confidence Interval for Sigma
285.43 684.11
550 600 650 95% Confidence Interval for Median
95% Confidence Interval for Median
Shelf Space, Sq in 864.94 1573.22
© 2001 ConceptFlow 33
Example: Cereal Sales
Tool Bar Menu > Stat > Regression > Fitted Line Plot
© 2001 ConceptFlow 34
Example: Cereal Sales
1000
Source DF SS MS F P
Regression 1 1709656 1709656
224.511 0.000
Error 10 76150 7615
Total 11 1785806
Regression is significant
© 2001 ConceptFlow 36
What About R-squared?
Source DF SS MS F P
Regression 1 1709656 1709656 224.511 0.000
Error 10 76150 7615
Total 11 1785806
2000
Sales
1500
R-Sq = 100%
1000
570 580 590 600 610 620 630 640 650 660
Shelf Space
© 2001 ConceptFlow 38
Example: Cereal Sales
Tool Bar Menu > Stat > Regression > Fitted Line Plot
1500
Sales
1000
R-Sq = 95.8%
© 2001 ConceptFlow 39
Example: Cereal Sales
1500 1500
Sales
Sales
1000 1000
R-Sq
R-Sq==95.8%
95.8%
R-Sq = 95.7%
© 2001 ConceptFlow 41
Example: Cereal Sales
1500 1500
Sales
Sales
1000 1000
• Linear model is better since the additional terms in cubic model did not
add value. How about a quadratic model?
© 2001 ConceptFlow 42
Example: Cereal Sales
© 2001 ConceptFlow 43
Example: Cereal Sales
Tool Bar Menu > Stat > Regression > Fitted Line Plot
© 2001 ConceptFlow 44
Example: Cereal Sales
Regression Plot
Prediction Sales = -4710.51 + 10.0720 Shelf Space
S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 %
uncertainty for
individual values 2000
Regression
1500
95% PI
Sales
1000
500
© 2001 ConceptFlow 45
Example: Cereal Sales
Tool Bar Menu > Stat > Regression
© 2001 ConceptFlow 46
Example: Cereal Sales
What is
95% CI?
© 2001 ConceptFlow 47
Example: Cereal Sales
Tool Bar Menu > Stat > Regression > Fitted Line Plot
© 2001 ConceptFlow 48
Example: Cereal Sales
95% prediction
interval for a
single response Regression Plot
Sales = -4710.51 + 10.0720 Shelf Space
95% confidence S = 87.2641 R-Sq = 95.7 % R-Sq(adj) = 95.3 %
interval for the
mean response 2000
1500
Sales
1000
Regression
95% CI
500
95% PI
550 600 650
Shelf Space
© 2001 ConceptFlow 49
Assumptions for Regression
© 2001 ConceptFlow 50
Checking Assumptions
Tool Bar Menu > Stat > Regression > Regression
© 2001 ConceptFlow 51
Assumptions for Regression
1
Normal Score
Residual
0
0
-1
-100
-2
-100 0 100 550 600 650
Residual Shelf Sp
© 2001 ConceptFlow 52
Assumptions for Regression
4
100
Frequency
Residual
0
2
-100 1
0
1000 1500 2000
Fitted Value -150 -100 -50 0 50 100
Residual
Normal Score
Residuals
Case 1
0 0
Case 2
Case 3
0
Fitted Value
© 2001 ConceptFlow 54
Quiz on Residuals
• What are the issues, if any, associated with each plot and what actions
could be taken to mitigate the issue?
Case 4
Residuals
0
Shelf Space
Residuals
Residuals
0
0
Case 5 Case 6
© 2001 ConceptFlow 55
Impact of Outliers on Regression
© 2001 ConceptFlow 56
Example: Cereal Sales
Normal Score
Predictor Coef SE Coef T P 0
Constant -4696.5 414.3 -11.34 0.000
Shelf Sp 10.0555 0.6978 14.41 0.000 -1
S = 90.59 R-Sq = 95.4% R-Sq(adj) = 94.9%
Analysis of Variance -2
-100 0 100
Residual
Source DF SS MS F P
Regression 1 1704037 1704037 207.66 0.000
Residual Error 10 82061 8206
Total 11 1786098
100 100
Residual
Residual
0 0
-100 -100
Analysis of Variance
Source DF SS MS F P
Regression 1 572701 572701 5.32 0.044
Residual Error 10 1076305 107630
Total 11 1649006
Unusual Observations
Obs Shelf Sp Sales Fit SE Fit Residual St Resid
3 533 1851.0 1010.7 177.9 840.3 3.05R
© 2001 ConceptFlow 58
Impact of Outliers on Regression
1
Residual
Normal Score
500
0
0 -1
-2
1000 1100 1200 1300 1400 1500 1600 1700 1800 0 500 1000
Fitted Value Residual
500
Residual
© 2001 ConceptFlow 60
Standardized Residuals
© 2001 ConceptFlow 61
Standardized Residuals
Regression Analysis: Sales versus Shelf Space Residuals Versus the Fitted Values
(response is Sales)
The regression equation is 1000
Sales = - 2096 + 5.83 Shelf Space
Residual
Predictor Coef SE Coef T P 500
Constant -2096 1501 -1.40 0.193
Shelf Sp 5.829 2.527 2.31 0.044
S = 328.1 R-Sq = 34.7% R-Sq(adj) = 28.2% 0
Analysis of Variance
Source DF SS MS F P 1000 1100 1200 1300 1400 1500 1600 1700 1800
Regression 1 572701 572701 5.32 0.044 Fitted Value
Residual Error 10 1076305 107630
Total 11 1649006
Unusual Observations
Obs Shelf Sp Sales Fit SE Fit Residual St Resid
3 533 1851.0 1010.7 177.9 840.3 3.05R
© 2001 ConceptFlow 62
Leverage
Tool Bar Menu > Stat > Regression > Regression > Storage > Hi (leverages)
influence on regression 1
0
Sales
-1
-2
0 500 1000
Shelf Space
© 2001 ConceptFlow 63
Cook’s Distance
1000
flags an influential
Sales
observation 0
© 2001 ConceptFlow 64
Exercise: Cereal Sales
Tool Bar Menu > Graph > Plot…
• Revisit cereal sales data in Sales.mtw
• Create a plot of “Shelf Space” vs “Sales” in Minitab
© 2001 ConceptFlow 65
Exercise: Cereal Sales
Tool Bar Menu > Graph > Plot…
• Create an outlier by changing observation 3 from 651 to 1851 and
complete regression
• Create a plot of “Shelf Space” vs “Sales” in Minitab
© 2001 ConceptFlow 66
Exercise: Cereal Sales
Tool Bar Menu > Stat > Regression > Regression
Perform regression using new data complete model diagnostics
© 2001 ConceptFlow 67
Exercise: Cereal Sales
• Flags:
• Standardized residuals > 2 • Standardized residuals
• Leverage
• Leverage >0.33 (i.e, 2p/n = 2*2/12) • Cook’s distance
• Cook’s distance > 1
© 2001 ConceptFlow 68
Simple Regression Exercise
© 2001 ConceptFlow 69
Key Learning Points
© 2001 ConceptFlow 70
Objectives Review
© 2001 ConceptFlow 71
Trademarks and Service Marks