Statistical Data Analysis - 2 - Step by Step Guide To SPSS & MINITAB - Nodrm

STATIS
TICAL DATA YSIS – I

& MINISTATISTICAL DATA ANALYSIS - 2
Step by Step Guide to SPSS & MINITAB
i
First Edition
Copyright © (2020) Lakmini U. Mallawarachchi
First Edition June 2020
ISBN: 979-8653861543
All rights reserved. No part of this book may be reproduced, stored in a

retrieval system or transmitted by any means, electronic, mechanical,
photocopying, recording or otherwise, without written permission of
the publisher, except in the case of brief quotations embodied in critical
reviews permitted by copyright law.
ii
STATISTICAL DATA ANALYSIS - 2
Step by Step Guide to SPSS & MINITAB
Lakmini U. Mallawarachchi
MSc in Business Statistics Degree (University of Moratuwa,
Sri Lanka)
Master of Financial Economics Degree (University of Colombo,
Sri Lanka)
BSc in Business Management (Special-Project Management) Degree
(NSBM Green University, Sri Lanka)
iii
Preface
Statistical Data Analysis-2, Step by Step Guide to SPSS & MINITAB, takes
a straight forward, step by step approach that makes familiar to SPSS
and MINITAB softwares.
This book covers the topics of simple linear regression, multiple

regression, polynomial regression and non-linear regression analysis
techniques using SPSS and MINITAB, in a simple language with several
examples to make easier for a beginner to understand with less effort.
Most importantly, this book is ideal for undergraduates who need to
complete their data analysis in research studies using SPSS and
MINITAB softwares.
I hope that this book will be very much useful to students, instructors
and researchers in applied and social sciences. Additionally, this can also
be used as a self-study material and text book.
Any suggestions to further improve the contents of this edition would be

warmly appreciated.
Lakmini U. Mallawarachchi
June 2020
iv
Table of Contents
CHAPTER ONE: SIMPLE LINEAR REGRESSION ............................................. 1
1.1 Regression analysis ................................................................................. 1

1.2 Correlation analysis: ................................................................................ 1
1.2.1 Correlation Strength............................................................................. 2
1.2.2 Correlation matrix ................................................................................ 5
1.3 Simple Linear Regression ....................................................................... 8
1.3.1 Test the significance of the model .................................................. 9
1.3.2 Test the significance of the parameters ....................................... 12
1.3.3 Measure of Model Adequacy ......................................................... 14
1.3.4 Examining the unusual observations ........................................... 28
CHAPTER TWO: MULTIPLE REGRESSION ................................................... 34
2.1 Introduction ............................................................................................ 34

2.1.1 Test the significance of the model ................................................ 35
2.1.2 Variable Selection methods for the model .................................. 35
2.1.3 Multicollinearity .............................................................................................35
CHAPTER THREE: POLYNOMIAL REGRESSION .......................................... 57
3.1 Introduction ............................................................................................ 57

3.2 Quadratic Regression ............................................................................ 57
3.2.2 Test the significance of the parameters....................................... 54
CHAPTER FOUR: NON LINEAR REGRESSION .............................................. 65
4.1 Introduction ............................................................................................ 65

4.1.1 Intrinsically linear models............................................................. 65
v
4.1.2 Intrinsically non linear models ..................................................... 65
4.2 Exponential model ................................................................................. 66
4.2.2 Test the significance of the parameters ....................................... 73
4.2.3 Diagnostic Testing for Errors ........................................................ 73
REFERENCES .................................................................................................... 76
vi
Examples
Example 1.1 ............................................................................................................................. 2
Example 1.2 ............................................................................................................................. 5
Example 1.3 ...........................................................................................................................10
Example 1.4 ...........................................................................................................................15
Example 1.5 ...........................................................................................................................28
Example 2.1 ...........................................................................................................................36
Example 3.1. ..........................................................................................................................49
Example 4.1 ...........................................................................................................................55
vii
CHAPTER ONE: SIMPLE LINEAR REGRESSION
1.1 Regression analysis
Regression analysis a statistical methodology used to analyze data and

determine the relationships among the variables.
1.2 Correlation analysis
Correlation indicates the strength of the relationship between two

variables. It is measured through correlation coefficient. Generally, Karl
Pearson’s correlation coefficient and Spearmen's Rank correlation are
used to find out the correlation coefficient between variables. Sample
correlation coefficient is denoted as ‘r’ and the population correlation
coefficient is measured through ‘𝜌’. Correlation coefficient can be either
positive, negative or none. Correlation coefficient is always in between -
1 to +1 and it can be manually calculated using any of the following
formulas.
̅̅
√ ̅ ̅
̅ ̅
√ ̅ ̅
√ √
Where, ̅ = mean of x variable, ̅= mean of y variable
1
1.2.1 Correlation Strength
Strength of the correlation depends on the value of the correlation

coefficient and it is determined based on the calculated values as
follows.
0.00 – 0.20 Weak or none

0.20 – 0.40 Weak
0.40 – 0.60 Moderate
0.60 – 0.80 Strong
0.80 – 0.10 Very strong
Example 1.1: If X and Y are two random variables, find whether there is
a relationship between X and Y using SPSS and MINITAB.
X Y
15 30
25 45
30 60
35 65
40 75
45 80
50 105
55 120
60 135
In SPSS, Step 1: Analyze  Correlate Bivariate
2
Step 2: In the ‘Bivariate Correlations’ dialogue box, add the required
variables in to the list of variables that need to analyze. According to this
example, selected variables are X and Y. Then press ok to proceed.
Step 3: Generated SPSS output is given below.
Correlations
X Y
Pearson Correlation 1 .982**
X Sig. (2-tailed) .000

N 9 9
**
Pearson Correlation .982 1
Y Sig. (2-tailed) .000

N 9 9
**. Correlation is significant at the 0.01 level (2-tailed).
3
According to the above output, there is a strong positive correlation of
0.982 between the variables X and Y.
In MINITAB,
Step 1: Stat  Basic Statistics Correlation
Step 2: In the ‘correlation’ dialogue box, add the required variables in to

the list of variables that need to analyze. According to this example,
selected variables are X and Y. Then press ok to proceed.
4
Step 3: Generated MINITAB output is given below.
Correlations: X, Y
Pearson correlation of X and Y = 0.982

P-Value = 0.000
According to the above output, there is a strong positive correlation of

0.982 between the variables X and Y.
1.2.2 Correlation matrix
Correlation matrix is a table showing correlation coefficients between a

set of variables. When there is more than one independent variable (X
variables) with one dependent variable, need to get a set of correlation
coefficients in a form of a matrix.
Example 1.2: Develop a correlation matrix of X, X2, X3, X4 and Y using

SPSS and MINITAB.
Y X1 X2 X3 X4
27 20 50 75 15
23 27 55 60 20
18 22 62 68 16
26 27 55 60 20
23 24 75 72 8
27 30 62 73 18
30 32 79 71 11
23 24 75 72 8
22 22 62 68 16
24 27 55 60 20
16 40 90 78 32
28 32 79 71 11
31 50 84 72 12
22 40 90 78 32
24 20 50 75 15
5
31 50 84 72 12
29 30 62 73 18
22 27 55 60 20
In SPSS,
Step 1: Analyze  Correlate Bivariate
Step 2: In the ‘correlation’ dialogue box, add the required variables in to

the list of variables that need to analyze (X1, X2, X3, X4 and Y). Then
press ok to proceed.
6
Correlations
Y X1 X2 X3 X4
Pearson Correlation 1 .373 .059 .048 -.522*
Y Sig. (2-tailed) .127 .815 .852 .026
N 18 18 18 18 18
**
Pearson Correlation .373 1 .758 .288 .192
X1 Sig. (2-tailed) .127 .000 .247 .444
N 18 18 18 18 18
** *
Pearson Correlation .059 .758 1 .555 .099
X2 Sig. (2-tailed) .815 .000 .017 .697
N 18 18 18 18 18
*
Pearson Correlation .048 .288 .555 1 .060
X3 Sig. (2-tailed) .852 .247 .017 .813
N 18 18 18 18 18
Pearson Correlation -.522* .192 .099 .060 1
X4 Sig. (2-tailed) .026 .444 .697 .813
N 18 18 18 18 18
*. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed).
According to the above correlation matrix, correlation between Y and X1

is 0.373, which indicates that there is a weak positive correlation
between Y and X1. The significant value of 0.127, just below 0.373
indicates that the relationship is not significant at 0.05 significant level
as (sig > 0.05). Similarly, between X2 and Y, there’s weak positive
correlation of 0.059 and it’s statistically not significant as sig >0.05. In
the case of X4 and Y, there’s a negative and a moderate level relationship
between Y and X4 and its statistically significant as (p) sig value <0.05.
7
1.3 Simple Linear Regression
When there is an association between two variables, we often interested

in trying to determine the relationship between the two variables. If X
and Y are two random variables, the linear relationship between a
dependent variable (Y) and one independent variable (X) is known as
‘simple linear regression.
Simple linear regression model can be expressed as,
Where, intercept, slope, = Error term
and are the two unknown parameters which can be estimated

using a given data set. The two unknown parameters are known as
coefficients.
𝑦 β β 𝑥 𝜀
The parameters can be estimated manually using the following formulas

obtained using the Least Square Estimation (LSE) technique.
8
̅ ̅
1.3.1 Test the significance of the model
ANOVA or the ‘Analysis of Variance’ is a statistical method used to test

the overall significance of the model.
Analysis of Variance (ANOVA)
Source DF SS MS=SS/DF F
Regression k-1 SSR MSR=SSR MSR/MSE
Residuals (Errors) n-k-1 SSE MSE=SSE/(n-k-1)
Total n-1 SST
Where, DF= degrees of freedom, k= number of parameters, n= number

of observations, SS= sum of squares, SSR= regression sum of squares,
SSE= error sum of squares, SST = total sum of squares, MSE=mean
square of errors, MSR= mean square of regression, F= F statistic
Components in the ANOVA table can be manually calculated using the

following formulas.
̅
̂ ̅
̂
̅ = ̂ ̂ ̅
9
Where, ̅= mean of y variable, ̂ estimated y value
Example 1.3: Company A wants to find out whether the interest rate (X)
has a significance influence on the number of clients (Y) who deposit the
fixed deposits. Analyze the data using SPSS and MINITAB.
X Y
12 265
14 228
16 242
18 260
20 286
22 291
24 320
26 352
28 396
In SPSS,
Step 1: Analyze  Regression Linear
10
Step 2: In the ‘Linear Regression’ dialogue box, select ‘Y’ as the
dependent variable and ‘X’ as the independent variable and press the ok
button.
Regression
Model Summary
Model R R Square Adjusted R Std. Error of the
Square Estimate
a
1 .911 .829 .805 23.970
a. Predictors: (Constant), X
The above ‘Model Summary’ table indicates the strength of the

relationship between the model and the dependent variable. Generally,
R square shows the percentage of variability explained by the
11
independent variables. According to the above model, R Square is 0.805
i.e. 80.5% of the variability is explained by the overall model
ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 19548.150 1 19548.150 34.023 .001b
1 Residual 4021.850 7 574.550
Total 23570.000 8
a. Dependent Variable: Y
b. Predictors: (Constant), X
According to the above ANOVA table, the F value (34.023) is significant

as the corresponding P value (0.001) is less than 0.05. Therefore, it can
be concluded with 95% confidence that the fitted model is significant.
1.3.2 Test the significance of the parameters
a
Coefficients
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 112.833 31.960 3.530 .010
1
X 9.025 1.547 .911 5.833 .001
As shown in the above ‘coefficient’ table, both P values of the two

parameters are less than 0.05. Therefore, it can be concluded that both
parameters are significantly different from zero. Based on the values in
the ‘B’ column under the ‘Unstandardized Coefficients’ column, fitted
model equation can be written as,
Y = 112.833 + 9.025X
12
Number of clients = 112.833 + 9.025* interest rates
The above formula indicates that one unit increase of interest rate would
increase the number of clients by 9.025 times.
In MINITAB,
Step 1: Stat  Regression Regression
Step 2: In the ‘Regression’ dialogue box, select ‘Y’ as the dependent

variable and ‘X’ as the independent variable and press the ok button.
13
Regression Analysis: Y versus X
The regression equation is

Y = 113 + 9.03 X 1
Predictor Coef SE Coef T P

Constant 112.83 31.96 3.53 0.010
X 9.025 1.547 5.83 0.001
S = 23.9698 R-Sq = 82.9% R-Sq(adj) = 80.5%
Analysis of Variance
Source DF SS MS F P
Regression 1 19548 19548 34.02 0.001
Residual Error 7 4022 575
Total 8 23570
Unusual Observations
Obs X Y Fit SE Fit Residual St Resid

1 12.0 265.00 221.13 14.73 43.87 2.32R
R denotes an observation with a large standardized residual.
Similar results were obtained from the MINITAB output as in SPSS.

Unlike in SPSS, MINITAB generates the model equation 1 in the
regression output.
1.3.3 Measure of Model Adequacy
Once the model has been identified, it is necessary to test the

assumptions for the residuals or the error term. These assumptions are;
 Errors should be random: In order to test the randomness of

errors, Durbin Watson (DW) statistic is used. If DW statistic ~ 2.0,
the errors are considered to be random.
14
 Errors should be normally distributed: This is tested by using the
histogram for errors and if the errors are normally distributed,
shape of the histogram should be symmetric. Normality assumption
is checked by using the Anderson Darling (A-D) test.
 Error mean should be zero: Usually this is tested by using the plot
of residuals vs fitted values or by using the one sample t test.
 Errors should have a constant variance (homoscedasticity): This

is tested by using the plot of residuals vs fitted values.
**Note: These assumptions need to be satisfied, in order to use the

model for predictions and forecasting.
Example 1.4: Test the assumptions for errors using the data given in
the above example.
 Test for the randomness of errors
In MINITAB,
Step 1: Stat RegressionRegression
15
Step 2: In the ‘Regression’ dialogue box, include ‘Y’ as the response
variable and ‘X’ as the predictors.
Step 3: Click ‘options’ button in the ‘Regression’ dialogue box, and select
‘Durbin Watson’ statistic. Then press ok button to proceed.
Durbin-Watson statistic = 1.06128
16
In SPSS,
Step 1: Analyze RegressionLinear
Step 2: In the ‘Linear Regression’ dialogue box, include ‘X’ as dependent

variable and ‘Y’ as independent variable.
Step 3: Click ‘statistics’ button in the ‘Linear Regression’ dialogue box.

Then select ‘Durbin Watson and select ‘continue’ to proceed.
17
Step 3: Click ‘statistics’ button in the ‘Linear Regression’ dialogue box.
Then select ‘Durbin-Watson and select ‘continue’ to proceed. Generated
SPSS output is indicated below.
Model Summaryb
Model R R Square Adjusted R Std. Error of the Durbin-Watson
Square Estimate
1 .911a .829 .805 23.970 1.061
a. Predictors: (Constant), X
b. Dependent Variable: Y
As the Durbin –Watson statistic is 1.061, it can be concluded that

errors are non-random. If the errors are random, DW statistics ~
2.0.
 Test for normality of errors
In MINITAB,
18
Step 2: In the ‘regression’ dialogue box, select ‘Y’ as the response and
select ‘X’ as the predictors. Then click ‘storage’ button to proceed.
Step 3: In the ‘storage’ dialogue box, select ‘residuals’ and press ok

button to proceed.
Step 4: The ‘residuals’ are listed in the data window as follows.
Step 5: Then click Stat Basic StatisticsNormality Test
19
Step 6: In the ‘Normality Test’ dialogue box, select ‘RESI1’ as the variable
and press ok to proceed.
20
Probability Plot of RESI1
Normal
99
Mean -6.31594E-14
StDev 22.42
95 N 9
AD 0.820
90
P-Value 0.020
80
70
Percent
60
50
40
30
20
10
1
-50 -25 0 25 50
RESI1
As indicated in the above graph, Anderson Darling (AD) test is used

to test the normality of errors assumption. The value of the test
statistic (AD = 0.820) is significant as P value is 0.020 (P <0.05).
Therefore it can be claimed that the errors are not normally
distributed. The ‘probability plot of Residuals’ indicates that the
residuals are deviated from the line.
In SPSS,
Step 2: In the ‘Linear regression’ dialogue box, select ‘Y’ as the

dependent variable and select ‘X’ as the independent variable. Then click
‘save’ button to proceed.
Step 3: In the ‘save’ dialogue box, select ‘standardized’ under residuals

and press ok button to proceed.
21
Step 5: Analyze RegressionExplore
22
Step 6: In the ‘Explore dialogue box, select ‘Standardized Residuals’ in to
the dependent list and select ‘plots’ button to proceed.
Step 7: In the ‘Plots’ dialogue box, select ‘Normality plots with tests’ and
press ‘continue’ button to proceed.
23
24
Tests of Normality
a
Kolmogorov-Smirnov Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Standardized Residual .295 9 .023 .806 9 .024
a. Lilliefors Significance Correction
Normality assumption can be tested using ‘Kolmogorov-Smirnov’ or

‘Shapiro-Wilk’ tests. Therefore, as shown in the above output of SPSS,
both (p) sig. values are lesser than 0.05, which indicates that the
assumption of normality of errors is violated.
 Test for the mean of errors
According to the below graph, error mean value is indicated as -

6.31594E-14. i.e. mean of errors is zero.

Normal
99
Mean -6.31594E-14
StDev 22.42
95 N 9
AD 0.820
90
P-Value 0.020
80
70
Percent
60
50
40
30
20
10
1
-50 -25 0 25 50
RESI1
25
 Test for the constant variance of errors
In MINITAB,
Step 3: In the ‘storage’ dialogue box, select ‘residuals’ and ‘fits’. Then
press ok button to proceed.
Step 4: In the ‘graphs’ dialogue box, under ‘Residual plots,’ select

‘residuals vs ‘fits’ and press ok button to proceed.
26
Step 5: Generated MINITAB outputs are given below.
Residuals Versus the Fitted Values

(response is Y)
50
40
30
20
Residual
10
-10
-20
200 220 240 260 280 300 320 340 360 380
Fitted Value
According to the above graph, it’s clear that data points are not scattered
randomly. This confirms that the errors are not having a constant
variance.
27
Conclusion: According to the above observations, the fitted model is (Y
= 112.833 + 9.025X) do not fulfill the assumptions of residuals.
Therefore, this model cannot use for the predictions or for the forecasting
purposes.
1.3.4 Examining the unusual observations
In MINITAB,
Step 3: In the ‘storage’ dialogue box, select ‘residuals’ and ‘fits’. Then
press ok button to proceed
Unusual Observations
Obs X Y Fit SE Fit Residual St Resid

1 12.0 265.00 221.13 14.73 43.87 2.32R
R denotes an observation with a large standardized residual.
The first observation is identified as unusual. If it’s removed from the

data set, and run the regression, then the results will be more
meaningful and accurate.
Example 1.5: A financial analyst is interested in studying companies

going public for the first time. He is keen to know the relationship
between size of offering (X) and the price per share (Y). A random
28
sample of 15 companies that recently went public revealed the
following.
Company Size ($min) Price (per share)

1 90.0 11.2
2 179.2 12.1
3 71.9 11.1
4 97.9 11.2
5 93.5 11.0
6 72.0 10.7
7 125.0 11.3
8 98.5 11.4
9 87.0 10.8
10 79.6 10.9
a). Is the correlation between size and price share statistically

significant?
b). Calculate the regression equation for this data.
c). Conduct the ANOVA table for the regression analysis and interpret
the results.
d). Test the validity of the model using suitable statistics.
e). Carry out the diagnostic test to check the assumption of errors.
f). Do you think the hypothesis of the financial analyst can be rejected?
Justify the reasons statistically.
a).
Correlations: Size ($min), Price (per share)

Pearson correlation of Size ($min) and Price (per share) = 0.905
P-Value = 0.000
29
The correlation coefficient between 2 variables (r=0.905, p=0.000) is
significantly greater than zero and there is a strong positive correlation
between the size and price per share.
b).
a
Coefficients
Coefficients
B Std. Error Beta
(Constant) 10.059 .193 52.105 .000
1
Size .011 .002 .905 6.014 .000
a. Dependent Variable: Price
Regression equation, Price = 10.059 + 0.011* Size
c).
b
Model Summary
Square Estimate
a
1 .905 .819 .796 .1781 2.063
a. Predictors: (Constant), Size
b. Dependent Variable: Price
The above ‘Model Summary’ table indicates the strength of the

relationship between the model and the dependent variable. Generally,
R square shows the percentage of variability explained by the
independent variables. According to the above model, R Square is 0.819
i.e. 81.9% of the variability is explained by the overall model.
30
a
ANOVA
b
Regression 1.147 1 1.147 36.164 .000
1 Residual .254 8 .032
Total 1.401 9
a. Dependent Variable: Price
b. Predictors: (Constant), Size
According to the above ANOVA table, the F value (36.164) is significant

as the corresponding P value (0.000) is less than 0.05. Therefore, it can
be concluded with 95% confidence that the fitted model is significant.
e) Diagnostic test for errors
Durbin-Watson statistics = 2.063
As the Durbin –Watson statistic is 2.063, it can be concluded that errors

are random.
According to the below graph, error mean value is indicated as

31
 Test for the normality of errors

Normal
99
Mean 1.776357E-16
StDev 0.1679
95 N 10
AD 0.268
90
P-Value 0.599
80
70
Percent
60
50
40
30
20
10
1
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
RESI1
As indicated in the above graph, Anderson Darling (AD) test is used to

test the normality of errors assumption. The value of the test statistic
(AD=0.268) is significant as P value is 0.599 (P >0.05). Therefore it can
be claimed that the errors are normally distributed. The ‘probability plot
of Residuals’ indicates that the residuals are deviated from the line.

(response is Price (per share))
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
11.00 11.25 11.50 11.75 12.00
Fitted Value
32
According to the above graph, it’s clear that data points are scattered
randomly. This confirms that the errors are having a constant variance.
Conclusion: According to the above observations, the fitted model

(Price = 10.059 + 0.011*Size) fulfill the assumptions of residuals.
Therefore, this model can be used for the predictions or for the forecasting
purposes.
33
CHAPTER TWO: MULTIPLE REGRESSION
2.1 Introduction
This section discusses about fitted models developed for the response
variable (Y) when there is more than one independent variable (X). The
general linear model in multiple regression can be written in the form
of;
The model for the ith observation can be written as,
,
1
The coefficients or the parameters are known

as ‘partial correlation coefficients’.
There are few assumptions that need to be satisfied in multiple

regression models such as;
 Relationship between dependent and the independent variables

should be in the linear form.
 Residuals should be normally distributed.
 Absence of multi-collinearity among the independent variables.
 Error variances of independent variables need to be constant.
34
ANOVA or the ‘Analysis of Variance’ is a statistical method used to test

the overall significance of the model.
Analysis of Variance (ANOVA)
Source DF SS MS=SS/DF F
Regression p SSR MSR=SSR/p MSR/MSE
Residuals (Errors) n-p-1 SSE MSE=SSE/(n-p-1)
Total n-1 SST
Where, DF= degrees of freedom, p= number of explanatory variables, n=

number of observations, SS= sum of squares, SSR= regression sum of
squares, SSE= error sum of squares, SST = total sum of squares,
MSE=mean square of errors, MSR= mean square of regression, F= F
statistic.
2.1.2 Variable Selection methods for the model
There are few approaches that can be used to select the variables in to
the multiple regression models. They are;
 Fit all the variables (Best Subset Regression)
First approach is to consider all the variables and by examining the

results, deciding whether any of these variables can be dropped from
the equation.
In MINITAB,
Step 1: Stat RegressionBest Subsets
35
Step 2: In the ‘Best Subset Regression’ dialogue box, select ‘Y’ as the
response, ‘X1-X4’ as the predictors and press ok to continue.
36
 Forward Selection (FS) Method
This indicates adding each variable at a time and testing the significance
of the model. FS is manually calculated using the below formula.
H0: Adding X1 to the model while having X1 is not significant.
Test statistic,
In SPSS,

dependent, ‘X1-X4’ as the Independents and press ok to continue.
37
 Backward Elimination Method
This indicates removing each variable at a time and testing the

significance of the model. This is manually calculated by using the
following formula.
Ho: Removing X1 from the model is significant.
Test Statistic;
F critical = (number of extra terms, df residuals)
In SPSS,
38
dependent, ‘X1-X4’ as the Independents and press ok to continue.
 Stepwise regression
In this method, each variable is added to the model in a sequential way

as at each step the significance of the model is checked to make sure that
their significance has not been reduced below the specific tolerable
level. If any of the variable is not significant, that will be removed from
the model. There’s a specific option given in softwares to get the results
automatically (Refer Example 2.1).
2.1.3 Multicollinearity
The problem of multicollinearity exists, when there is a correlation

between two independent variables. When there are many independent
variables in a data set, it’s essential to find out whether the
39
multicollinearity exists between them, because it will lead to mis-
interpretation of the results generated.
Detecting Multicollinearity
 Correlation between variables should be greater than 0.5.

 Any of the pairwise correlation between explanatory (X) variables
should be greater than the maximum correlation value in between
(x,y) variables.
 There should be a large F statistic and a smaller t statistic.
 Based on the Variation Inflation Factor (VIF).
1
1 1
1
- If VIF is greater than 5.0, there’s having the problem of multi

collinearity.
 Based on the ‘Conditional Index Number’ (K) determined using the

Eigen values (λ).
- If K < 100, there’s no serious problem of multicollinearity

- If 100 < K < 1000, there’s moderate to strong
multicollinearity.
- If K > 1000, there’s severe multicollinearity.
Solutions for Multicollinearity
 Increase the sample size.
40
 Use step wise regression and identify the most significantly
influential variables to the model
 Use forward selection method and add the independent variables to
the model.
 Remove the collinear independent variables from the model.
Example 2.1: Analyze the data in the table and develop a suitable
model.
Y X1 X2 X3 X4
26 27 55 60 20
23 24 75 72 8
27 30 62 73 18
30 32 79 71 11
23 24 75 72 8
22 22 62 68 16
24 27 55 60 20
16 40 90 78 32
28 32 79 71 11
31 50 84 72 12
22 40 90 78 32
24 20 50 75 15
31 50 84 72 12
29 30 62 73 18
22 27 55 60 20
1. Get the correlation matrix.
In SPSS,
Step 1: Analyze CorrelateBivariate
41
Step 2: In the ‘Bivariate’ Dialogue box, select all the variables (Y, X1, X2,
X3 & X4) in to the ‘variables’ column and press ‘options’ button.
42
Correlations
Y X1 X2 X3 X4
*
Pearson Correlation 1 .361 .024 -.069 -.576
Y Sig. (2-tailed) .186 .932 .806 .025
N 15 15 15 15 15
**
Pearson Correlation .361 1 .731 .344 .194
X1 Sig. (2-tailed) .186 .002 .209 .489
N 15 15 15 15 15
** *
Pearson Correlation .024 .731 1 .637 .112
X2 Sig. (2-tailed) .932 .002 .011 .692
N 15 15 15 15 15
*
Pearson Correlation -.069 .344 .637 1 .131
X3 Sig. (2-tailed) .806 .209 .011 .641
N 15 15 15 15 15
*
Pearson Correlation -.576 .194 .112 .131 1
X4 Sig. (2-tailed) .025 .489 .692 .641
N 15 15 15 15 15
Interpretation
According to the above correlation matrix, correlation between Y and X1

is 0.361, which indicates that there is a weak positive correlation
between Y and X1. The significant value of 0.186, just below 0.361
indicates that the relationship is not significant at 0.05 significant level
as (sig>0.05). Similarly, between X3 and Y, there’s weak negative
correlation of -0.069 and it’s statistically not significant as sig >0.05. In
the case of X4 and Y, there’s a negative and a moderate level relationship
between Y and X4 and its statistically significant as (p) sig value <0.05.
43
Draw a scatter plot and get the correlation matrix in MINITAB
Step 1: Graph Scatter plotSimple Select ‘Y’ for the Y variables and
‘X1, X2, X3, X4’ for the X variables and click ‘multiple graphs’ button.
Step 2: In the ‘Multiple Graphs’ dialogue box, select ‘In separate panels
of the same graph’ and press ok to proceed.
44
Scatterplot of Y vs X1, X2, X3, X4

X1 X2
30
25
20
15
20 30 40 50 50 60 70 80 90
Y
X3 X4
30
25
20
15
60 65 70 75 80 10 15 20 25 30
Step 4: Stat Basic StatisticsCorrelation
45
Step 5: In the ‘Correlation’ Dialogue box, select all the variables (Y, X1,
X2, X3 & X4) in to the ‘variables’ column and press ‘ok’ button.
46
2. Test the significance of the model.
In SPSS,
Step 2: In the ‘Linear Regression’ Dialogue box, select ‘Y’ as the

dependent variable and ‘X1, X2, X3 & X4’ as the dependent variables and
press ‘statistics’ button.
47
Step 3: In the ‘Linear Regression: Statistics’ Dialogue box, select
‘Collinearity diagnostics’ and under Residuals, select ‘Durbin- Watson’
and press continue to proceed.
Step 3: Generated SPSS output is given as follows.
Model Summaryb
Square Estimate
1 .847a .717 .604 2.628 2.747
a. Predictors: (Constant), X4, X2, X3, X1
b. Dependent Variable: Y
The R square of the above table indicates that 71.7 % of the observed
variability has been captured by the fitted model.
ANOVAa
Regression 175.342 4 43.836 6.348 .008b
1 Residual 69.058 10 6.906
Total 244.400 14
48
According to the ANOVA table, it can be seen that F value (6.348) is
statistically significant as the corresponding P value (0.008) is less than
0.05. Therefore, it can be concluded with 95% confidence that the fitted
model is statistically significant.
a
Coefficients
Model Unstandardized Standardized t Sig. Collinearity
Coefficients Coefficients Statistics
B Std. Error Beta Tolerance VIF
(Constant) 26.743 8.795 3.041 .012
X1 .418 .115 .937 3.635 .005 .425 2.350
1 X2 -.199 .094 -.658 -2.122 .060 .294 3.404
X3 .084 .159 .120 .529 .608 .554 1.805
X4 -.395 .097 -.700 -4.051 .002 .946 1.057
According to the above ‘coefficients’ table, all the VIF values of the
variables are less than 5, which indicates the absence of the multi
collinearity problem. Further, sig. column indicates that, coefficients of
X1 and X4 variables are statistically significant as sig. (p) values are
lesser than 0.05. Both coefficients of X2 and X3 variables are not
significant as p values are greater than 0.05. The fitted model can be
written as,
Y = 26.743 + 0.418 X1 – 0.199 X2 + 0.084 X3 – 0.395 X4
As X2 and X3 are not significant, it’s better to carry out the stepwise
regression method in order to find out the best fitted model.
In SPSS, Step 1: Analyze RegressionLinear
49
Step 2: In the ‘Linear Regression’ Dialogue box, select ‘Y’ as the
dependent variable and ‘X1, X2, X3 & X4’ as the dependent variables and
select Method as ‘Stepwise’ and press options button.
Step 3: In the ‘Linear Regression: Options’ Dialogue box, choose Entry as

‘0.05’ and Removal as ‘0.06’ and press continue.
50
Step 4: Generated SPSS output is given as follows.
Coefficientsa
Coefficients
B Std. Error Beta
(Constant) 30.684 2.343 13.095 .000
1
X4 -.325 .128 -.576 -2.542 .025
(Constant) 24.652 3.092 7.973 .000
2 X4 -.379 .110 -.671 -3.458 .005
X1 .219 .087 .491 2.530 .026
3 (Constant) 30.920 3.756 8.233 .000
X4 -.389 .094 -.689 -4.154 .002
X1 .403 .108 .903 3.741 .003
X2 -.169 .072 -.559 -2.344 .039
According to the output created using ‘Stepwise regression’ method, the

fitted model can be written as follows.
Y = 30.920 + 0.403 X1 – 0.169 X2 – 0.389 X4
Interpretation for;
- When X1 increases by 1 unit, Y increases by 0.403 units.
When X2 increases by 1 unit, Y decreases by 0.169 units.
51
In MINITAB,
Step 1: Stat RegressionStepwise
Step 2: In the ‘Stepwise Regression’ Dialogue box, select ‘Y’ as the

response and ‘X1, X2, X3 & X4’ as the predictors and press ‘Methods’
button.
52
Step 3: In the ‘Methods’ dialogue box, change Alpha to enter as ‘0.05’ and
Alpha to remove as ‘0.06’ and press ok button to proceed.
53
Similar results were obtained for the ‘Stepwise regression’ method
carried out using the MINITAB software. The coefficient values related
to the best fitted model is given in the third column that is highlighted
above. Therefore, the best fitted model can be written as;
Y = 30.920 + 0.403 X1 – 0.169 X2 – 0.389 X4
Interpretations for;
- When X1 increases by 1 unit, Y increases by 0.403 units.
Diagnostic test for errors
 Test for randomness of errors

errors are random. If the errors are random, DW statistics ~ 2.0.

54

Normal
99
Mean -2.03689E-14
StDev 2.252
95 N 15
AD 0.472
90
P-Value 0.209
80
70
Percent
60
50
40
30
20
10
1
-5.0 -2.5 0.0 2.5 5.0
RESI1

(AD = 0.209) is significant as P value is 0.209 (P >0.05). Therefore it can
of Residuals’ indicates that the residuals are not deviated from the line.
According to the below graph, it’s clear that data points are scattered
55
(response is Y)
4
1
Residual
-1
-2
-3
-4
20 22 24 26 28 30 32
Fitted Value
The plot of Residuals Vs Fitted values indicates that the data points are
scattered randomly, which can be concluded that the errors are having a
constant variance.
Conclusion: According to the above observations, the fitted model (Y =

30.920 +0.403 X1 – 0.169 X2 – 0.389 X4) fulfills the assumptions of
residuals. Therefore, this model can be used for the predictions or for
the forecasting purposes.
56
CHAPTER THREE: POLYNOMIAL REGRESSION
3.1 Introduction
It’s another type of a regression analysis, where the relationship

between X and Y is determined as the nth degree polynomial in X.
Polynomial regression is a special case of multiple regression.
(Second order model)
(Second order model)
(Second order
model)
3.2 Quadratic Regression
Example 3.1: Suppose the marketing division of a super market chain

wants to study the price elasticity of disposal items. Collected data is
given below. Develop a model to show the relationship between price
and sales.
Price Sales
65 132
65 141
65 153
65 158
65 164
85 172
85 94
85 100
85 110
105 118
57
105 127
105 76
105 85
105 102
In MINITAB,
Step 1: Get the correlation of price and sales
Correlations: Sales, Price
Pearson correlation of Sales and Price = -0.685

P-Value = 0.007
Step 2: Develop a regression model for the data set.
variability has been captured by the fitted model.
Step 3: Get the fitted line plot for sales and price.
Stat RegressionFitted Line Plot
58
Step 4: Generated output of MINITAB is given below.
Fitted Line Plot

Sales = 225.7 - 1.200 Price
S 23.3075
170 R-Sq 46.9%
R-Sq(adj) 42.5%
160
150
140
130
Sales
120
110
100
90
80
60 70 80 90 100 110
Price
Step 5: Develop a column for the Price2
59
Price Sales Price^2
65 132 4225
65 141 4225
65 153 4225
65 158 4225
65 164 4225
85 172 7225
85 94 7225
85 100 7225
85 110 7225
105 118 11025
105 127 11025
105 76 11025
105 85 11025
105 102 11025
Step 6: Develop a regression model for sales, price and price2.
variability has been captured by the fitted model. Here the percentage
representing the variability of the model has increased by 1.0%.
60
According to the above results, p column indicates that, coefficients of

‘price’ and ‘price2” variables are not statistically significant as sig. (p)
values are greater than 0.05. The fitted model can be written as,
Sales = 340 – 4.01 Price + 0.0165 Price 2
Step 7: Get the fitted line plot for sales, price and price2.
61
Fitted Line Plot
Sales = 340.2 - 4.005 Price
+ 0.01650 Price**2
S 24.1104
170 R-Sq 47.9%
160 R-Sq(adj) 38.5%
150
140
130
Sales
120
110
100
90
80
60 70 80 90 100 110
Price
Once the model is developed, need to test for the assumption of errors in
order to validate the model for predictions or forecasting.

errors are random. If the errors are random, DW statistics ~ 2.0.

62

Normal
99
Mean -1.11657E-13
StDev 22.18
95 N 14
AD 0.368
90
P-Value 0.378
80
70
Percent
60
50
40
30
20
10
1
-50 -25 0 25 50
RESI1

(AD = 0.368) is significant as P value is 0.378 (P >0.05). Therefore it can
of Residuals’ indicates that the residuals are not deviated from the line.
 Test for the error constant variance
According to the below graph, it’s clear that data points are scattered
63
(response is Sales)
60
50
40
30
Residual
20
10
-10
-20
-30
100 110 120 130 140 150
Fitted Value
Conclusion: According to the above observations, the fitted model

(Sales = 340 – 4.01Price + 0.0165 Price2) fulfills the assumptions of
residuals. Therefore, this model can be used for the predictions or for
the forecasting purposes.
64
CHAPTER FOUR: NON LINEAR REGRESSION
4.1 Introduction
Regression models that are not linear on the parameters are called
nonlinear. Nonlinear models can be grouped in to two types namely,
intrinsically linear and intrinsically nonlinear.
4.1.1 Intrinsically linear models
These include models that can be transformed in to linear form such as;
(i) Exponential Growth model: can be transformed to log

(y) = log (a) + x log (b)
(ii) Exponential decay model: can be transformed to log

(y) = log (a) + x log (b)
(iii)Power model: can be transformed to log (y) = log

( ) + log (x)
(iii) Reciprocal model: Thus,
(iv) Exponential model: can be written as log (y) = a

+b t
4.1.2 Intrinsically non linear models
If a nonlinear model cannot be expressed in the linear form, it is

intrinsically nonlinear.
(i)
65
(ii)
4.2 Exponential model
Exponential model is a form of nonlinear regression model in the form
of where Y is dependent variable, X is independent variable,

a and b are coefficients.
Example 4.1: Develop a model for the data given below. Carry out
diagnostic tests to confirm the validity of the model.
t y
1 355
2 211
3 197
4 166
5 142
6 106
7 104
8 60
9 56
10 38
11 36
12 32
13 21
14 19
15 15
Step 1: Get the correlation of t and y.
Correlations: t, y
Pearson correlation of t and y = -0.907

P-Value = 0.000
66
Results indicate that correlation coefficient is close to -1.0 which
confirms that there is a negative linear relationship between these two
variables
Step 2: Develop a regression model for t and y.
variability has been captured by the fitted model. Further, in order to
find the relationship, regression analysis was carried out.
Step 3: Test the significance of the parameters
Above result confirms that, the parameters are significant as the

respective p values <0.05. Thus the fitted model can be written as follows.
67
y = 260 - 19.5 t
The above formula (y = 260 - 19.5 t) indicates a unit increase of t value

would decrease y by 19.5 times.
Step 4: Get the fitted line plot for y and t.
Fitted Line Plot

y = 259.6 - 19.46 t
400 S 41.8324
R-Sq 82.3%
R-Sq(adj) 81.0%
300
200
y
100
0 2 4 6 8 10 12 14 16
t
Above graph illustrates that y value decreases when t value increases.

This obvious linear relationship is described by the model equation (y =
259.6 - 19.46t) preceding the graph and confirmed by the high R-
Squared value (82.3%).
Diagnostic Testing for Errors
 Randomness assumption
The observed DW statistic = 0.803288
68
As Durbin Watson statistics is not close to 2.0, it can be confirmed that
the errors are non-random.
 Constant Variance assumption
As shown in below graph, data has been plotted with residuals and fitted
values to test whether the errors are having a constant variance.

(response is y)
125
100
75
Residual
50
25
-25
-50
0 50 100 150 200 250
Fitted Value
The above plot indicates that the data points are not scattered
randomly, which can be concluded that the errors are not having a
constant variance.
 Normality assumption
Following graph is used to test whether the errors are normally

distributed.
69
Normal
99
Mean -6.63173E-14
StDev 40.31
95 N 15
AD 0.870
90
P-Value 0.019
80
70
Percent
60
50
40
30
20
10
1
-100 -50 0 50 100
RESI1
As indicated in the above graph, Anderson Darling test is used to test the
null hypothesis. (i.e. H0 : Errors are distributed normally.) The value of
the test statistic (AD = 0.870) is significant. (P = 0.019). Thus Ho is not
accepted and therefore it can be claimed that the errors are not
normally distributed.
Conclusion: As the above discussed error assumptions are violated, the

fitted model cannot be accepted.
Step 1: Get the log function of ‘y’ variable.
Calc  Calculator  Store result in ‘LOG y’  Expression LOGE (y) 

Press ok.
70
Data set with LOG y
t y LOG y
1 355 5.87212
2 211 5.35186
3 197 5.2832
4 166 5.11199
5 142 4.95583
6 106 4.66344
7 104 4.64439
8 60 4.09434
9 56 4.02535
10 38 3.63759
11 36 3.58352
12 32 3.46574
13 21 3.04452
14 19 2.94444
15 15 2.70805
Step 2: Get the correlation of t and LOG y.
Correlations: t, LOG y
Pearson correlation of LOG y and t = -0.991

P-Value = 0.000
The correlation coefficient between LOG y and t is significant (r = -0.991,

p = 0.000) confirming that the correlation coefficient between LOG y and
t is significantly lesser than zero. In fact the correlation coefficient
confirms that there is a negative linear relationship between these two
variables. Further, in order to find the relationship, regression analysis
was carried out.
71
Step 3: Develop a regression model for t and y.
Analysis of Variance of the fitted model

Source DF SS MS F P
Regression 1 8.8124 8.8124 622.34 0.000
Residual Error 11 0.1558 0.0142
Total 12 8.9682
ANOVA Table is used to test the null hypothesis. (i.e. H0 : Model is not
significant). Above result indicates that F value (622.34) is significant as
the corresponding P value (0.000) is less than 0.05. Therefore, it can be
concluded with 95% confidence that the fitted model is significant.
The R2 of the fitted model is 98.3 % indicating that 98.3 % of the

observed variability has been captured by the fitted model. The
significant tests of the parameters are shown below.
Results of the significance of the parameters in the model
Predictor Coef SE Coef T P

Constant 5.98138 0.07001 85.43 0.000
t -0.220045 0.008821 -24.95 0.000
Above results indicate that P values for the parameters are less than 0.05.
Therefore, we can conclude that the parameters are significantly different
from zero. Thus the fitted model is,
LOG y = 5.98 - 0.220 t

y = 395.836 e 0.22t
72
The line of best fit corresponding to the model equation is indicated
below.
Fitted Line Plot

LOG y = 5.981 - 0.2200 t
6.0 S 0.118996
R-Sq 98.3%
R-Sq(adj) 98.1%
5.5
5.0
LOG y
4.5
4.0
3.5
3.0
0 2 4 6 8 10 12 14
t
As shown in the above graph, when t value increases LOG y value

decreases. This relationship is described by the model equation (LOG y =
5.981 - 0.220 t) preceding the graph and confirmed by the high R-Square
value (98.3%).
4.2.3 Diagnostic Testing for Errors
 Randomness assumption
The observed DW statistic = 2.59032

As Durbin Watson statistics is close to 2.0, it can be confirmed that the
errors are random.
73
 Constant variance assumption
As shown in below graph data has been plotted with residuals and fitted
values to test whether the errors are having a constant variance.

(response is LOG y)
0.2
0.1
Residual
0.0
-0.1
-0.2
3.0 3.5 4.0 4.5 5.0 5.5 6.0
Fitted Value
The plot of Residuals Vs Fitted values indicate that the data points are
scattered randomly, which can be concluded that the errors are having a
constant variance.
 Normality assumption
Following graph is used to test whether the errors are normally

distributed.
74
Normal
99
Mean -1.16146E-15
StDev 0.1139
95 N 13
AD 0.162
90
P-Value 0.928
80
70
Percent
60
50
40
30
20
10
1
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
RESI2
As indicated in the above graph, Anderson Darling test is used to test the
null hypothesis. (i.e. H0 : Errors are distributed normally.) The value of
the test statistic (AD = 0.162) is not significant. (P = 0.928). Thus Ho is
accepted and therefore it can be claimed that the errors are normally
distributed.
Thus the fitted model (y = 395.836 e0.22t) is statistically validated and

the model can be recommended for forecasting.
75
REFERENCES
1. Anderson D. and Thomas A. (2008). Statistics for Business and

Economics. 10th Edition. Thomson Learning.
2. Argyrous G. (2013) Statistics for Research, With a Guide to SPSS.

Third Edition. (ISBN-13: 978-1849205948)
3. Crawshaw J. and Chambers J. (2001). A Concise Course in Advanced

Level Statistics: With Worked Examples. Fifteenth Edition. Oxford
Publisher.
4. Evans M. J and Rosenthal J.S (2009). Probability and Statistics. The

Science of Uncertainty. Second Edition. University of Toronto.
5. Peiris T.S.G. (2019). - Lecture Notes, Statistical Modeling in Business.

University of Moratuwa.
6. Richard I. and David S. (2004). Statistics for Management. 7th Edition.

Pearson Education.
76

Statistical Data Analysis - 2 - Step by Step Guide To SPSS & MINITAB - Nodrm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Data Analysis - 2 - Step by Step Guide To SPSS & MINITAB - Nodrm

Uploaded by

Copyright:

Available Formats

STATIS

TICAL DATA YSIS – I

First Edition June 2020

All rights reserved. No part of this book may be reproduced, stored in a

MSc in Business Statistics Degree (University of Moratuwa,

Master of Financial Economics Degree (University of Colombo,

BSc in Business Management (Special-Project Management) Degree

(NSBM Green University, Sri Lanka)

This book covers the topics of simple linear regression, multiple

Any suggestions to further improve the contents of this edition would be

CHAPTER ONE: SIMPLE LINEAR REGRESSION ............................................. 1

1.1 Regression analysis ................................................................................. 1

CHAPTER TWO: MULTIPLE REGRESSION ................................................... 34

2.1 Introduction ............................................................................................ 34

CHAPTER THREE: POLYNOMIAL REGRESSION .......................................... 57

3.1 Introduction ............................................................................................ 57

CHAPTER FOUR: NON LINEAR REGRESSION .............................................. 65

4.1 Introduction ............................................................................................ 65

Example 1.1 ............................................................................................................................. 2

Example 1.2 ............................................................................................................................. 5

Example 1.3 ...........................................................................................................................10

Example 1.4 ...........................................................................................................................15

Example 1.5 ...........................................................................................................................28

Example 2.1 ...........................................................................................................................36

Example 3.1. ..........................................................................................................................49

Example 4.1 ...........................................................................................................................55

1.1 Regression analysis

Regression analysis a statistical methodology used to analyze data and

1.2 Correlation analysis

Correlation indicates the strength of the relationship between two

Where, ̅ = mean of x variable, ̅= mean of y variable

Strength of the correlation depends on the value of the correlation

0.00 – 0.20 Weak or none

In SPSS, Step 1: Analyze  Correlate Bivariate

Step 3: Generated SPSS output is given below.

Pearson Correlation 1 .982**

X Sig. (2-tailed) .000

Y Sig. (2-tailed) .000

Step 1: Stat  Basic Statistics Correlation

Step 2: In the ‘correlation’ dialogue box, add the required variables in to

Pearson correlation of X and Y = 0.982

According to the above output, there is a strong positive correlation of

1.2.2 Correlation matrix

Correlation matrix is a table showing correlation coefficients between a

Example 1.2: Develop a correlation matrix of X, X2, X3, X4 and Y using

Step 1: Analyze  Correlate Bivariate

Step 2: In the ‘correlation’ dialogue box, add the required variables in to

Pearson Correlation 1 .373 .059 .048 -.522*

Y Sig. (2-tailed) .127 .815 .852 .026

X1 Sig. (2-tailed) .127 .000 .247 .444

X4 Sig. (2-tailed) .026 .444 .697 .813

According to the above correlation matrix, correlation between Y and X1

When there is an association between two variables, we often interested

Simple linear regression model can be expressed as,

Where, intercept, slope, = Error term

and are the two unknown parameters which can be estimated

The parameters can be estimated manually using the following formulas

1.3.1 Test the significance of the model

ANOVA or the ‘Analysis of Variance’ is a statistical method used to test

Analysis of Variance (ANOVA)

Where, DF= degrees of freedom, k= number of parameters, n= number

Components in the ANOVA table can be manually calculated using the

Step 1: Analyze  Regression Linear