Professional Documents
Culture Documents
i
First Edition
Copyright © (2020) Lakmini U. Mallawarachchi
ISBN: 979-8653861543
ii
STATISTICAL DATA ANALYSIS - 2
Step by Step Guide to SPSS & MINITAB
Lakmini U. Mallawarachchi
Sri Lanka)
Sri Lanka)
iii
Preface
Statistical Data Analysis-2, Step by Step Guide to SPSS & MINITAB, takes
a straight forward, step by step approach that makes familiar to SPSS
and MINITAB softwares.
I hope that this book will be very much useful to students, instructors
and researchers in applied and social sciences. Additionally, this can also
be used as a self-study material and text book.
Lakmini U. Mallawarachchi
June 2020
iv
Table of Contents
v
4.1.2 Intrinsically non linear models ..................................................... 65
4.2 Exponential model ................................................................................. 66
4.2.1 Test the significance of the model ................................................ 73
4.2.2 Test the significance of the parameters ....................................... 73
4.2.3 Diagnostic Testing for Errors ........................................................ 73
REFERENCES .................................................................................................... 76
vi
Examples
vii
CHAPTER ONE: SIMPLE LINEAR REGRESSION
̅ ̅
√ ̅ ̅
√ √
1
1.2.1 Correlation Strength
Example 1.1: If X and Y are two random variables, find whether there is
a relationship between X and Y using SPSS and MINITAB.
X Y
15 30
25 45
30 60
35 65
40 75
45 80
50 105
55 120
60 135
2
Step 2: In the ‘Bivariate Correlations’ dialogue box, add the required
variables in to the list of variables that need to analyze. According to this
example, selected variables are X and Y. Then press ok to proceed.
Correlations
X Y
3
According to the above output, there is a strong positive correlation of
0.982 between the variables X and Y.
In MINITAB,
4
Step 3: Generated MINITAB output is given below.
Correlations: X, Y
Y X1 X2 X3 X4
27 20 50 75 15
23 27 55 60 20
18 22 62 68 16
26 27 55 60 20
23 24 75 72 8
27 30 62 73 18
30 32 79 71 11
23 24 75 72 8
22 22 62 68 16
24 27 55 60 20
16 40 90 78 32
28 32 79 71 11
31 50 84 72 12
22 40 90 78 32
24 20 50 75 15
5
31 50 84 72 12
29 30 62 73 18
22 27 55 60 20
In SPSS,
6
Step 3: Generated SPSS output is given below.
Correlations
Y X1 X2 X3 X4
N 18 18 18 18 18
**
Pearson Correlation .373 1 .758 .288 .192
N 18 18 18 18 18
** *
Pearson Correlation .059 .758 1 .555 .099
X2 Sig. (2-tailed) .815 .000 .017 .697
N 18 18 18 18 18
*
Pearson Correlation .048 .288 .555 1 .060
X3 Sig. (2-tailed) .852 .247 .017 .813
N 18 18 18 18 18
Pearson Correlation -.522* .192 .099 .060 1
N 18 18 18 18 18
*. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed).
7
1.3 Simple Linear Regression
𝑦 β β 𝑥 𝜀
8
̅ ̅
Source DF SS MS=SS/DF F
Regression k-1 SSR MSR=SSR MSR/MSE
Residuals (Errors) n-k-1 SSE MSE=SSE/(n-k-1)
Total n-1 SST
̅
̂ ̅
̂
̅ = ̂ ̂ ̅
9
Where, ̅= mean of y variable, ̂ estimated y value
Example 1.3: Company A wants to find out whether the interest rate (X)
has a significance influence on the number of clients (Y) who deposit the
fixed deposits. Analyze the data using SPSS and MINITAB.
X Y
12 265
14 228
16 242
18 260
20 286
22 291
24 320
26 352
28 396
In SPSS,
10
Step 2: In the ‘Linear Regression’ dialogue box, select ‘Y’ as the
dependent variable and ‘X’ as the independent variable and press the ok
button.
Regression
Model Summary
Model R R Square Adjusted R Std. Error of the
Square Estimate
a
1 .911 .829 .805 23.970
a. Predictors: (Constant), X
11
independent variables. According to the above model, R Square is 0.805
i.e. 80.5% of the variability is explained by the overall model
ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 19548.150 1 19548.150 34.023 .001b
1 Residual 4021.850 7 574.550
Total 23570.000 8
a. Dependent Variable: Y
b. Predictors: (Constant), X
a
Coefficients
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 112.833 31.960 3.530 .010
1
X 9.025 1.547 .911 5.833 .001
a. Dependent Variable: Y
Y = 112.833 + 9.025X
12
Number of clients = 112.833 + 9.025* interest rates
The above formula indicates that one unit increase of interest rate would
increase the number of clients by 9.025 times.
In MINITAB,
13
Step 3: Generated MINITAB output is given below.
Analysis of Variance
Source DF SS MS F P
Regression 1 19548 19548 34.02 0.001
Residual Error 7 4022 575
Total 8 23570
Unusual Observations
14
Errors should be normally distributed: This is tested by using the
histogram for errors and if the errors are normally distributed,
shape of the histogram should be symmetric. Normality assumption
is checked by using the Anderson Darling (A-D) test.
Error mean should be zero: Usually this is tested by using the plot
of residuals vs fitted values or by using the one sample t test.
Example 1.4: Test the assumptions for errors using the data given in
the above example.
In MINITAB,
15
Step 2: In the ‘Regression’ dialogue box, include ‘Y’ as the response
variable and ‘X’ as the predictors.
Step 3: Click ‘options’ button in the ‘Regression’ dialogue box, and select
‘Durbin Watson’ statistic. Then press ok button to proceed.
16
In SPSS,
17
Step 3: Click ‘statistics’ button in the ‘Linear Regression’ dialogue box.
Then select ‘Durbin-Watson and select ‘continue’ to proceed. Generated
SPSS output is indicated below.
Model Summaryb
Model R R Square Adjusted R Std. Error of the Durbin-Watson
Square Estimate
1 .911a .829 .805 23.970 1.061
a. Predictors: (Constant), X
b. Dependent Variable: Y
In MINITAB,
18
Step 1: Stat RegressionRegression
Step 2: In the ‘regression’ dialogue box, select ‘Y’ as the response and
select ‘X’ as the predictors. Then click ‘storage’ button to proceed.
19
Step 6: In the ‘Normality Test’ dialogue box, select ‘RESI1’ as the variable
and press ok to proceed.
20
Probability Plot of RESI1
Normal
99
Mean -6.31594E-14
StDev 22.42
95 N 9
AD 0.820
90
P-Value 0.020
80
70
Percent
60
50
40
30
20
10
1
-50 -25 0 25 50
RESI1
In SPSS,
21
Step 4: Generated SPSS output is given below.
22
Step 6: In the ‘Explore dialogue box, select ‘Standardized Residuals’ in to
the dependent list and select ‘plots’ button to proceed.
Step 7: In the ‘Plots’ dialogue box, select ‘Normality plots with tests’ and
press ‘continue’ button to proceed.
23
Step 8: Generated SPSS output is given below.
24
Tests of Normality
a
Kolmogorov-Smirnov Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Standardized Residual .295 9 .023 .806 9 .024
a. Lilliefors Significance Correction
60
50
40
30
20
10
1
-50 -25 0 25 50
RESI1
25
Test for the constant variance of errors
In MINITAB,
Step 2: In the ‘regression’ dialogue box, select ‘Y’ as the response and
select ‘X’ as the predictors. Then click ‘storage’ button to proceed.
Step 3: In the ‘storage’ dialogue box, select ‘residuals’ and ‘fits’. Then
press ok button to proceed.
26
Step 5: Generated MINITAB outputs are given below.
40
30
20
Residual
10
-10
-20
200 220 240 260 280 300 320 340 360 380
Fitted Value
According to the above graph, it’s clear that data points are not scattered
randomly. This confirms that the errors are not having a constant
variance.
27
Conclusion: According to the above observations, the fitted model is (Y
= 112.833 + 9.025X) do not fulfill the assumptions of residuals.
Therefore, this model cannot use for the predictions or for the forecasting
purposes.
In MINITAB,
Step 2: In the ‘regression’ dialogue box, select ‘Y’ as the response and
select ‘X’ as the predictors. Then click ‘storage’ button to proceed.
Step 3: In the ‘storage’ dialogue box, select ‘residuals’ and ‘fits’. Then
press ok button to proceed
Unusual Observations
28
sample of 15 companies that recently went public revealed the
following.
c). Conduct the ANOVA table for the regression analysis and interpret
the results.
e). Carry out the diagnostic test to check the assumption of errors.
f). Do you think the hypothesis of the financial analyst can be rejected?
Justify the reasons statistically.
a).
29
The correlation coefficient between 2 variables (r=0.905, p=0.000) is
significantly greater than zero and there is a strong positive correlation
between the size and price per share.
b).
a
Coefficients
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 10.059 .193 52.105 .000
1
Size .011 .002 .905 6.014 .000
a. Dependent Variable: Price
c).
b
Model Summary
Model R R Square Adjusted R Std. Error of the Durbin-Watson
Square Estimate
a
1 .905 .819 .796 .1781 2.063
a. Predictors: (Constant), Size
b. Dependent Variable: Price
30
a
ANOVA
Model Sum of Squares df Mean Square F Sig.
b
Regression 1.147 1 1.147 36.164 .000
1 Residual .254 8 .032
Total 1.401 9
a. Dependent Variable: Price
b. Predictors: (Constant), Size
31
Test for the normality of errors
60
50
40
30
20
10
1
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
RESI1
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
11.00 11.25 11.50 11.75 12.00
Fitted Value
32
According to the above graph, it’s clear that data points are scattered
randomly. This confirms that the errors are having a constant variance.
33
CHAPTER TWO: MULTIPLE REGRESSION
2.1 Introduction
This section discusses about fitted models developed for the response
variable (Y) when there is more than one independent variable (X). The
general linear model in multiple regression can be written in the form
of;
,
1
34
2.1.1 Test the significance of the model
Source DF SS MS=SS/DF F
Regression p SSR MSR=SSR/p MSR/MSE
Residuals (Errors) n-p-1 SSE MSE=SSE/(n-p-1)
Total n-1 SST
There are few approaches that can be used to select the variables in to
the multiple regression models. They are;
In MINITAB,
35
Step 2: In the ‘Best Subset Regression’ dialogue box, select ‘Y’ as the
response, ‘X1-X4’ as the predictors and press ok to continue.
36
Forward Selection (FS) Method
This indicates adding each variable at a time and testing the significance
of the model. FS is manually calculated using the below formula.
Test statistic,
In SPSS,
37
Backward Elimination Method
Test Statistic;
In SPSS,
38
Step 2: In the ‘Linear Regression’ dialogue box, select ‘Y’ as the
dependent, ‘X1-X4’ as the Independents and press ok to continue.
Stepwise regression
2.1.3 Multicollinearity
39
multicollinearity exists between them, because it will lead to mis-
interpretation of the results generated.
Detecting Multicollinearity
1
1 1
1
40
Use step wise regression and identify the most significantly
influential variables to the model
Use forward selection method and add the independent variables to
the model.
Remove the collinear independent variables from the model.
Example 2.1: Analyze the data in the table and develop a suitable
model.
Y X1 X2 X3 X4
26 27 55 60 20
23 24 75 72 8
27 30 62 73 18
30 32 79 71 11
23 24 75 72 8
22 22 62 68 16
24 27 55 60 20
16 40 90 78 32
28 32 79 71 11
31 50 84 72 12
22 40 90 78 32
24 20 50 75 15
31 50 84 72 12
29 30 62 73 18
22 27 55 60 20
In SPSS,
41
Step 2: In the ‘Bivariate’ Dialogue box, select all the variables (Y, X1, X2,
X3 & X4) in to the ‘variables’ column and press ‘options’ button.
42
Correlations
Y X1 X2 X3 X4
*
Pearson Correlation 1 .361 .024 -.069 -.576
Y Sig. (2-tailed) .186 .932 .806 .025
N 15 15 15 15 15
**
Pearson Correlation .361 1 .731 .344 .194
X1 Sig. (2-tailed) .186 .002 .209 .489
N 15 15 15 15 15
** *
Pearson Correlation .024 .731 1 .637 .112
X2 Sig. (2-tailed) .932 .002 .011 .692
N 15 15 15 15 15
*
Pearson Correlation -.069 .344 .637 1 .131
X3 Sig. (2-tailed) .806 .209 .011 .641
N 15 15 15 15 15
*
Pearson Correlation -.576 .194 .112 .131 1
X4 Sig. (2-tailed) .025 .489 .692 .641
N 15 15 15 15 15
Interpretation
43
Draw a scatter plot and get the correlation matrix in MINITAB
Step 1: Graph Scatter plotSimple Select ‘Y’ for the Y variables and
‘X1, X2, X3, X4’ for the X variables and click ‘multiple graphs’ button.
Step 2: In the ‘Multiple Graphs’ dialogue box, select ‘In separate panels
of the same graph’ and press ok to proceed.
44
Step 3: Generated MINITAB output is given below.
25
20
15
20 30 40 50 50 60 70 80 90
Y
X3 X4
30
25
20
15
60 65 70 75 80 10 15 20 25 30
45
Step 5: In the ‘Correlation’ Dialogue box, select all the variables (Y, X1,
X2, X3 & X4) in to the ‘variables’ column and press ‘ok’ button.
46
2. Test the significance of the model.
In SPSS,
47
Step 3: In the ‘Linear Regression: Statistics’ Dialogue box, select
‘Collinearity diagnostics’ and under Residuals, select ‘Durbin- Watson’
and press continue to proceed.
Model Summaryb
Model R R Square Adjusted R Std. Error of the Durbin-Watson
Square Estimate
1 .847a .717 .604 2.628 2.747
a. Predictors: (Constant), X4, X2, X3, X1
b. Dependent Variable: Y
The R square of the above table indicates that 71.7 % of the observed
variability has been captured by the fitted model.
ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 175.342 4 43.836 6.348 .008b
1 Residual 69.058 10 6.906
Total 244.400 14
48
According to the ANOVA table, it can be seen that F value (6.348) is
statistically significant as the corresponding P value (0.008) is less than
0.05. Therefore, it can be concluded with 95% confidence that the fitted
model is statistically significant.
a
Coefficients
Model Unstandardized Standardized t Sig. Collinearity
Coefficients Coefficients Statistics
B Std. Error Beta Tolerance VIF
(Constant) 26.743 8.795 3.041 .012
X1 .418 .115 .937 3.635 .005 .425 2.350
1 X2 -.199 .094 -.658 -2.122 .060 .294 3.404
X3 .084 .159 .120 .529 .608 .554 1.805
X4 -.395 .097 -.700 -4.051 .002 .946 1.057
According to the above ‘coefficients’ table, all the VIF values of the
variables are less than 5, which indicates the absence of the multi
collinearity problem. Further, sig. column indicates that, coefficients of
X1 and X4 variables are statistically significant as sig. (p) values are
lesser than 0.05. Both coefficients of X2 and X3 variables are not
significant as p values are greater than 0.05. The fitted model can be
written as,
As X2 and X3 are not significant, it’s better to carry out the stepwise
regression method in order to find out the best fitted model.
49
Step 2: In the ‘Linear Regression’ Dialogue box, select ‘Y’ as the
dependent variable and ‘X1, X2, X3 & X4’ as the dependent variables and
select Method as ‘Stepwise’ and press options button.
50
Step 4: Generated SPSS output is given as follows.
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 30.684 2.343 13.095 .000
1
X4 -.325 .128 -.576 -2.542 .025
(Constant) 24.652 3.092 7.973 .000
2 X4 -.379 .110 -.671 -3.458 .005
X1 .219 .087 .491 2.530 .026
3 (Constant) 30.920 3.756 8.233 .000
X4 -.389 .094 -.689 -4.154 .002
X1 .403 .108 .903 3.741 .003
X2 -.169 .072 -.559 -2.344 .039
a. Dependent Variable: Y
Interpretation for;
51
In MINITAB,
52
Step 3: In the ‘Methods’ dialogue box, change Alpha to enter as ‘0.05’ and
Alpha to remove as ‘0.06’ and press ok button to proceed.
53
Similar results were obtained for the ‘Stepwise regression’ method
carried out using the MINITAB software. The coefficient values related
to the best fitted model is given in the third column that is highlighted
above. Therefore, the best fitted model can be written as;
Interpretations for;
54
Test for normality of errors
60
50
40
30
20
10
1
-5.0 -2.5 0.0 2.5 5.0
RESI1
According to the below graph, it’s clear that data points are scattered
randomly. This confirms that the errors are having a constant variance.
55
Residuals Versus the Fitted Values
(response is Y)
4
1
Residual
-1
-2
-3
-4
20 22 24 26 28 30 32
Fitted Value
The plot of Residuals Vs Fitted values indicates that the data points are
scattered randomly, which can be concluded that the errors are having a
constant variance.
56
CHAPTER THREE: POLYNOMIAL REGRESSION
3.1 Introduction
(Second order
model)
Price Sales
65 132
65 141
65 153
65 158
65 164
85 172
85 94
85 100
85 110
105 118
57
105 127
105 76
105 85
105 102
In MINITAB,
The R square of the above table indicates that 46.9 % of the observed
variability has been captured by the fitted model.
Step 3: Get the fitted line plot for sales and price.
58
Step 4: Generated output of MINITAB is given below.
120
110
100
90
80
60 70 80 90 100 110
Price
59
Price Sales Price^2
65 132 4225
65 141 4225
65 153 4225
65 158 4225
65 164 4225
85 172 7225
85 94 7225
85 100 7225
85 110 7225
105 118 11025
105 127 11025
105 76 11025
105 85 11025
105 102 11025
The R square of the above table indicates that 47.9 % of the observed
variability has been captured by the fitted model. Here the percentage
representing the variability of the model has increased by 1.0%.
60
3.2.2 Test the significance of the parameters
Step 7: Get the fitted line plot for sales, price and price2.
61
Fitted Line Plot
Sales = 340.2 - 4.005 Price
+ 0.01650 Price**2
S 24.1104
170 R-Sq 47.9%
160 R-Sq(adj) 38.5%
150
140
130
Sales
120
110
100
90
80
60 70 80 90 100 110
Price
Once the model is developed, need to test for the assumption of errors in
order to validate the model for predictions or forecasting.
62
Test for normality of errors
60
50
40
30
20
10
1
-50 -25 0 25 50
RESI1
According to the below graph, it’s clear that data points are scattered
randomly. This confirms that the errors are having a constant variance.
63
Residuals Versus the Fitted Values
(response is Sales)
60
50
40
30
Residual
20
10
-10
-20
-30
100 110 120 130 140 150
Fitted Value
64
CHAPTER FOUR: NON LINEAR REGRESSION
4.1 Introduction
Regression models that are not linear on the parameters are called
nonlinear. Nonlinear models can be grouped in to two types namely,
intrinsically linear and intrinsically nonlinear.
These include models that can be transformed in to linear form such as;
(i)
65
(ii)
Example 4.1: Develop a model for the data given below. Carry out
diagnostic tests to confirm the validity of the model.
t y
1 355
2 211
3 197
4 166
5 142
6 106
7 104
8 60
9 56
10 38
11 36
12 32
13 21
14 19
15 15
Correlations: t, y
66
Results indicate that correlation coefficient is close to -1.0 which
confirms that there is a negative linear relationship between these two
variables
The R square of the above table indicates that 82.3 % of the observed
variability has been captured by the fitted model. Further, in order to
find the relationship, regression analysis was carried out.
300
200
y
100
0 2 4 6 8 10 12 14 16
t
Randomness assumption
68
As Durbin Watson statistics is not close to 2.0, it can be confirmed that
the errors are non-random.
As shown in below graph, data has been plotted with residuals and fitted
values to test whether the errors are having a constant variance.
100
75
Residual
50
25
-25
-50
0 50 100 150 200 250
Fitted Value
The above plot indicates that the data points are not scattered
randomly, which can be concluded that the errors are not having a
constant variance.
Normality assumption
69
Probability Plot of RESI1
Normal
99
Mean -6.63173E-14
StDev 40.31
95 N 15
AD 0.870
90
P-Value 0.019
80
70
Percent
60
50
40
30
20
10
1
-100 -50 0 50 100
RESI1
As indicated in the above graph, Anderson Darling test is used to test the
null hypothesis. (i.e. H0 : Errors are distributed normally.) The value of
the test statistic (AD = 0.870) is significant. (P = 0.019). Thus Ho is not
accepted and therefore it can be claimed that the errors are not
normally distributed.
70
Data set with LOG y
t y LOG y
1 355 5.87212
2 211 5.35186
3 197 5.2832
4 166 5.11199
5 142 4.95583
6 106 4.66344
7 104 4.64439
8 60 4.09434
9 56 4.02535
10 38 3.63759
11 36 3.58352
12 32 3.46574
13 21 3.04452
14 19 2.94444
15 15 2.70805
Correlations: t, LOG y
71
4.2.1 Test the significance of the model
ANOVA Table is used to test the null hypothesis. (i.e. H0 : Model is not
significant). Above result indicates that F value (622.34) is significant as
the corresponding P value (0.000) is less than 0.05. Therefore, it can be
concluded with 95% confidence that the fitted model is significant.
Above results indicate that P values for the parameters are less than 0.05.
Therefore, we can conclude that the parameters are significantly different
from zero. Thus the fitted model is,
72
The line of best fit corresponding to the model equation is indicated
below.
5.0
LOG y
4.5
4.0
3.5
3.0
0 2 4 6 8 10 12 14
t
Randomness assumption
73
Constant variance assumption
As shown in below graph data has been plotted with residuals and fitted
values to test whether the errors are having a constant variance.
0.2
0.1
Residual
0.0
-0.1
-0.2
3.0 3.5 4.0 4.5 5.0 5.5 6.0
Fitted Value
The plot of Residuals Vs Fitted values indicate that the data points are
scattered randomly, which can be concluded that the errors are having a
constant variance.
Normality assumption
74
Probability Plot of RESI2
Normal
99
Mean -1.16146E-15
StDev 0.1139
95 N 13
AD 0.162
90
P-Value 0.928
80
70
Percent
60
50
40
30
20
10
1
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
RESI2
As indicated in the above graph, Anderson Darling test is used to test the
null hypothesis. (i.e. H0 : Errors are distributed normally.) The value of
the test statistic (AD = 0.162) is not significant. (P = 0.928). Thus Ho is
accepted and therefore it can be claimed that the errors are normally
distributed.
75
REFERENCES
76