Data Analysis and Statistical Packages 1

Q1. Perform the following tasks in SPSS.
a) Use catalog.sav sample data file to fit a multiple linear model to predict "Sales of
Men's Clothing" on the basis of varibales "Number of Catalogs Mailed", "Number
of Pages in Catalog", "Number of Phone Lines Open for Ordering", "Amount
Spent on Print Advertising" and "Number of Customer Service Representatives".
Use forward selection, backward elimination and enter methods in this respect.
Interpret your result in each case.
Variables Entered/Removeda
Model Variables Variables Method

Entered Removed
1 mail . Forward (Criterion: Probability-of-F-to-enter <= .050)

2 phone . Forward (Criterion: Probability-of-F-to-enter <= .050)
3 print . Forward (Criterion: Probability-of-F-to-enter <= .050)
4 page . Forward (Criterion: Probability-of-F-to-enter <= .050)
a. Dependent Variable: men
Model Summarye
Model R R Square Adjusted R Std. Error of the

Square Estimate
1 .803a .645 .642 3785.49685

2 .877b .770 .766 3061.36064
3 .885c .784 .778 2980.12178
4 .891d .794 .787 2919.90929
a. Predictors: (Constant), mail

b. Predictors: (Constant), mail, phone
c. Predictors: (Constant), mail, phone, print
d. Predictors: (Constant), mail, phone, print, page
e. Dependent Variable: men
Name: M. Hashim Javed Roll: BT-588221

ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 3069712621.002 1 3069712621.002 214.216 .000b
1 Residual 1690938397.841 118 14329986.422
Total 4760651018.843 119

Regression 3664135333.036 2 1832067666.518 195.485 .000c
2 Residual 1096515685.807 117 9371928.939
Total 4760651018.843 119
Regression 3730440424.702 3 1243480141.567 140.014 .000d
3 Residual 1030210594.141 116 8881125.812
Total 4760651018.843 119
Regression 3780175938.309 4 945043984.577 110.844 .000e
4 Residual 980475080.535 115 8525870.266
Total 4760651018.843 119

b. Predictors: (Constant), mail
c. Predictors: (Constant), mail, phone
d. Predictors: (Constant), mail, phone, print
e. Predictors: (Constant), mail, phone, print, page
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.

Coefficients
B Std. Error Beta
(Constant) -14064.614 2099.365 -6.699 .000

1
mail 2.991 .204 .803 14.636 .000
(Constant) -15361.047 1705.559 -9.006 .000
2 mail 1.971 .209 .529 9.424 .000
phone 334.103 41.952 .447 7.964 .000
(Constant) -20665.869 2554.586 -8.090 .000
mail 1.862 .207 .500 8.977 .000
3
phone 339.159 40.880 .454 8.296 .000
print .218 .080 .121 2.732 .007
(Constant) -23898.558 2838.361 -8.420 .000
mail 1.847 .203 .496 9.083 .000
4 phone 327.802 40.329 .439 8.128 .000
print .208 .078 .115 2.656 .009
page 50.508 20.912 .104 2.415 .017

Excluded Variablesa
Model Beta In t Sig. Partial Collinearity

Correlation Statistics
Tolerance
page .149b 2.773 .006 .248 .980
phone .447b 7.964 .000 .593 .625

1
print .104b 1.877 .063 .171 .957
service .153b 1.997 .048 .182 .501

page .110c 2.496 .014 .226 .968
2 print .121c 2.732 .007 .246 .955
c
service -.064 -.933 .353 -.086 .416
page .104d 2.415 .017 .220 .965
3
service -.079d -1.183 .239 -.110 .413
4 service -.072e -1.096 .275 -.102 .412

b. Predictors in the Model: (Constant), mail
c. Predictors in the Model: (Constant), mail, phone
d. Predictors in the Model: (Constant), mail, phone, print
e. Predictors in the Model: (Constant), mail, phone, print, page
Residuals Statisticsa
Minimum Maximum Mean Std. Deviation N
Predicted Value 2636.2012 34103.5547 16242.8134 5636.14978 120

Residual -8822.03613 9087.41895 .00000 2870.41572 120
Std. Predicted Value -2.414 3.169 .000 1.000 120
Std. Residual -3.021 3.112 .000 .983 120

b) Use bankloan.sav sample data file to fit a binary logistic regression model to predict
default on the basis of variables age, ed, income, debtinc, creddebt and othdebt.
Interpret your result.
Dependent Variable Encoding
Original Value Internal Value
No 0
Yes 1
Block 0: Beginning Block
Classification Tablea,b
Observed Predicted
default Percentage
No Yes Correct
No 517 0 100.0
default
Step 0 Yes 183 0 .0
Overall Percentage 73.9
a. Constant is included in the model.

b. The cut value is .500
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 0 Constant -1.039 .086 145.782 1 .000 .354
Variables not in the Equation
Score df Sig.
age 13.265 1 .000
ed 9.205 1 .002
income 3.526 1 .060

Variables
Step 0 debtinc 106.238 1 .000
creddebt 41.928 1 .000
othdebt 14.863 1 .000
Overall Statistics 148.310 6 .000

Block 1: Method = Enter
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 153.662 6 .000
Step 1 Block 153.662 6 .000
Model 153.662 6 .000
Model Summary
Step -2 Log likelihood Cox & Snell R Nagelkerke R

Square Square
1 650.702a .197 .289
a. Estimation terminated at iteration number 5 because parameter

estimates changed by less than .001.
Classification Tablea
Observed Predicted
default Percentage
No Yes Correct
No 483 34 93.4
default
Step 1 Yes 122 61 33.3
Overall Percentage 77.7
a. The cut value is .500
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
age -.047 .014 10.632 1 .001 .954
ed .392 .105 13.802 1 .000 1.480
income -.013 .007 3.077 1 .079 .987
Step 1a debtinc .111 .027 16.986 1 .000 1.117
creddebt .341 .088 15.186 1 .000 1.407
othdebt -.069 .062 1.221 1 .269 .933
Constant -1.198 .563 4.523 1 .033 .302
a. Variable(s) entered on step 1: age, ed, income, debtinc, creddebt, othdebt.

c) Find frequency distribution of "Preferred breakfast" for those senior citizens who
are also living an active life (Use sample data set cereal.sav)
Ans:
First step is to select those cases only, for which the life-style is recorded as active. [active=1]
bfast * agecat Crosstabulation
Count
agecat Total
Under 31 31-45 46-60 Over 60
Breakfast Bar 58 60 23 12 153
bfast Oatmeal 2 12 31 57 102
Cereal 51 45 38 17 151
Total 111 117 92 86 406
The frequency distribution of those senior citizens who are living an active life is highlighted in yellow.
It is seen that senior citizens who are living an active life prefer Oatmeal most of the time.

d) Carry out a test of independence of attributes "Preferred breakfast" and "marital
status". (Use cereal.sav)
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
bfast * marital 880 100.0% 0 0.0% 880 100.0%
bfast * marital Crosstabulation

Count
marital Total
Unmarried Married
Breakfast Bar 108 123 231
bfast Oatmeal 95 215 310
Cereal 100 239 339

Total 303 577 880
Chi-Square Tests
Value df Asymp. Sig. (2-sided)
Pearson Chi-Square 21.157a 2 .000

Likelihood Ratio 20.623 2 .000
Linear-by-Linear Association 16.226 1 .000
N of Valid Cases 880
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 79.54.
Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.05), so we reject the null
hypothesis and accept the alternate hypothesis that there is association between Preferred Breakfast
and Marital Status.

e) Carry out a test of independence of attributes "Preferred breakfast" and "Age
category". (Use cereal.sav)
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
bfast * agecat 880 100.0% 0 0.0% 880 100.0%
bfast * agecat Crosstabulation

Count
agecat Total
Under 31 31-45 46-60 Over 60
Breakfast Bar 84 90 39 18 231
bfast
Oatmeal 4 24 97 185 310
Cereal 93 92 95 59 339
Total 181 206 231 262 880
Chi-Square Tests
Value df Asymp. Sig. (2-sided)
Pearson Chi-Square 309.336a 6 .000
Likelihood Ratio 350.688 6 .000
Linear-by-Linear Association 4.986 1 .026

N of Valid Cases 880
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 47.51.
Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.05), so we reject the null
hypothesis and accept the alternate hypothesis that there is association between Preferred Breakfast
and Age Category.

f) Use grocery_coupons.sav sample data file to test that the mean of amount spent is
equal to 105. Find 90% confidence interval for the mean of amount spent. Also test
that the mean of amount spent by both male and female customers is equal. What
would to say about the equality of means for amount spent on stores of different
sizes?
Ans:
Part 1: Use grocery_coupons.sav sample data file to test that the mean of amount spent is
equal to 105. Find 90% confidence interval for the mean of amount spent.
T-Test
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
amtspent 1404 99.9338 48.54435 1.29555
One-Sample Test
Test Value = 105
t df Sig. (2-tailed) Mean Difference 90% Confidence Interval of the

Difference
Lower Upper
amtspent -3.910 1403 .000 -5.06621 -7.1986 -2.9338
Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.1), so it is concluded that
the mean of Amount Spent is not equal and is significantly different from 105. And the 90%
Confidence interval for the mean of Amount Spent is [-7.1986 – -2.9338]
Part 2: Also test that the mean of amount spent by both male and female customers is
equal.
T-TEST GROUPS=gender(0 1)
/MISSING=ANALYSIS
/VARIABLES=amtspent
/CRITERIA=CI(.95).
T-Test
Group Statistics
gender N Mean Std. Deviation Std. Error Mean
Male 740 107.5761 49.09908 1.80492

amtspent
Female 664 91.4168 46.49620 1.80440

Independent Samples Test
Levene's Test for t-test for Equality of Means

Equality of Variances
F Sig. t df Sig. (2- Mean Std. Error 95% Confidence Interval

tailed) Difference Difference of the Difference
Lower Upper
Equal variances
.458 .499 6.313 1402 .000 16.15930 2.55971 11.13803 21.18058
assumed
amtspent
Equal variances
6.332 1397.923 .000 16.15930 2.55218 11.15280 21.16581
not assumed
Conclusion:
We will use the top row (Equal variances assumed) as the p-value of Leven’s test is above 0.05
As the p-value (0.000) is less than our chosen significance level (α=0.05), so it is concluded that the
mean of Amount Spent is significantly different for both male and female customers. And the 95%
Confidence interval for the mean of Amount Spent is [11.138 – 2.9338]
Part 3: What would to say about the equality of means for amount spent on stores of
different sizes?
ONEWAY amtspent BY size

/MISSING ANALYSIS.
Oneway
ANOVA
amtspent
Sum of Squares df Mean Square F Sig.
Between Groups 20053.727 2 10026.864 4.275 .014

Within Groups 3286191.636 1401 2345.604
Total 3306245.363 1403
Conclusion:
As the p-value (0.014) for is less than our chosen significance level (α=0.05), so it is concluded
that the mean of Amount Spent is not equal and is significantly different for both male and female
customers. And the 95% Confidence interval for the mean of Amount Spent is [11.138 – 2.9338]

Question 3 is a group assignment (Group size not more than 5 students).
a) Create a Questionnaire having 10-15 questions to collect data from 30
respondents. Topic may be of your choice. Submit the soft copy of
questionnaire through class representative not later than 20th October 2019.
b) Enter your data in SPSS. Naming of variables and data types should be
appropriate.
c) Carry out data analysis of this data set.
d) Submit report of the statistical analysis of this data in MS Word.

Q4. Perform the following tasks in Minitab.
a) In order to ascertain the age distribution of operatives in a certain industry, random
samples of 1720 males and 1230 females are drawn. The sample means and standard
deviations were 33.93 years and 14.20 years for the males and 27.44 years and 10.79
years for the females. Calculate the 95 percent confidence interval for
i. The mean age of all the male operatives.
i – Ans:
Variable N Mean StDev SE Mean 95% CI

C1 1720 34.173 14.047 0.339 (33.509, 34.837)
ii. The differences between their mean ages.

Ii – Ans:
Estimate for difference: 6.204
95% CI for difference: (5.305, 7.104)
T-Test of difference = 0 (vs not =): T-Value = 13.52 P-Value = 0.000 DF =
2930
b) A psychology class performed an experiment to compare whether a recall score in which

instructions to form images of 25 words were given is better than an initial recall score
for which no images instruction were given. Twelve students participated in the
experiment with the following results:
With
20 24 20 18 22 19 20 19 17 21 17 20
Imagery
Without
5 9 5 9 6 11 8 11 7 9 8 16
Imagery
Does it appear that the average recall score is higher when imaginary is used? Also
construct 95% confidence interval for the difference between the mean of both the
imageries and interpret the results.
Ans:
Paired T-Test and CI: With Imagery, Without Imagery
Paired T for With Imagery - Without Imagery
N Mean StDev SE Mean

With Imagery 12 19.750 2.006 0.579
Without Imagery 12 8.667 3.055 0.882
Difference 12 11.08 3.70 1.07
95% CI for mean difference: (8.73, 13.44)

T-Test of mean difference = 0 (vs not = 0): T-Value = 10.37 P-Value = 0.000

Conclusion:
The mean of recall score when imagery is 19.75 and without using imagery it is 8.667, it is
clearly obvious that the score is better when using imagery, We also tested it statistically and results
are as the p-value (0.000) for paired t-test of mean difference is less than our chosen significance
level (α=0.05), so it is concluded that the mean difference of Recall Score with Imagery and Recall
Score without Imagery is significant. And the 95% Confidence interval for the mean difference of
Recall Score is [8.73 – 13.44].
c) Generate 4 samples of sizes 5, 6, 7 and 7 from normal populations with means 45, 40, 47
and 38 respectively. While the standard deviations of these distributions are 4, 6, 7 and 8
respectively. Test the equality of means.
Ans:
One-way ANOVA: Population versus Factor
Source DF SS MS F P
Factor 3 595.3 198.4 4.48 0.014
Error 21 929.3 44.3
Total 24 1524.6
S = 6.652 R-Sq = 39.05% R-Sq(adj) = 30.34%
Individual 95% CIs For Mean Based on

Pooled StDev
Level N Mean StDev ---------+---------+---------+---------+
1 5 47.891 3.939 (---------*---------)
2 6 36.004 6.483 (--------*--------)
3 7 47.847 9.382 (--------*-------)
4 7 41.373 4.636 (--------*--------)
---------+---------+---------+---------+
36.0 42.0 48.0 54.0
Pooled StDev = 6.652
Conclusion:
The F-Statistic value is 4.48 and as the p-value (0.014) is less than our chosen significance level
(α=0.05), so it is concluded that all of the means are not equal, at least one of the means is
significantly different.

Q5.
a) Explain the procedure for testing of equality of several means in Minitab and SPSS.
b) Use Minitab/SPSS to test equality of means for the following experiment of wheat yield
for different varieties. Varieties are shown by A, B, C, D and E.
A (8) B (5.3) C (4.1) D (5) E (16)

D (6.8) A (4.9) B (4.1) C (3.2) E (18)
B (6.3) E (16) C (4.7) D (4.0) A (5.0)
C (5.7) D (3.3) E (25) A (4.0) B (4.2)
E (18) C (4.7) A (4.2) D (6.6) B (6.2)
Ans:
One-way ANOVA: resp versus fac
Source DF SS MS F P
fac 4 740.14 185.03 44.59 0.000
Error 20 83.00 4.15
Total 24 823.13
S = 2.037 R-Sq = 89.92% R-Sq(adj) = 87.90%
Individual 95% CIs For Mean Based on

Pooled StDev
Level N Mean StDev -----+---------+---------+---------+----
1 5 5.220 1.613 (--*---)
2 5 5.220 1.052 (--*---)
3 5 4.480 0.918 (---*---)
4 5 5.140 1.549 (---*---)
5 5 18.600 3.715 (---*---)
-----+---------+---------+---------+----
5.0 10.0 15.0 20.0
Pooled StDev = 2.037
Conclusion:
The F-Statistic value is 44.59 and as the p-value (0.000) is less than our chosen significance level
(α=0.05), so it is concluded that all of the means are not equal, at least one of the means is
significantly different.

Q6.
a) Consider the experiment in which two fair dice are tossed and the absolute difference of
dots is recorded. Simulate this experiment 600 times using minitab. Find the frequency
distribution of the absolute differences and find mean and variance of this distribution.
b) Compare the statistical packages SPSS and Minitab with respect to statistical data
analysis in social sciences and physical sciences.

Q7.
a) Perform regression analysis to predict trade on the basis of other two variables on sample
dataset Employ.MTW. Also use matrix approach to do the same task. Furthermore,
calculate predicted values.
b) Discuss the normality tests available in Minitab?

Q8. Perform the following tasks in R.
a) A psychology class performed an experiment to compare whether a recall score in which
instructions to form images of 25 words were given is better than an initial recall score
for which no images instruction were given. Twelve students participated in the
experiment with the following results:
With
20 24 20 18 22 19 20 19 17 21 17 20
Imagery
Without
5 9 5 9 6 11 8 11 7 9 8 16
Imagery
Does it appear that the average recall score is higher when imaginary is used? Also
construct 95% confidence interval for the difference between the mean of both the imageries
and interpret the results.
b) Consider the experiment in which two fair dice are tossed and the absolute difference of
dots is recorded. Simulate this experiment 600 times. Find the frequency distribution of
the absolute differences and find mean and variance of this distribution.

Data Analysis and Statistical Packages 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis and Statistical Packages 1

Uploaded by

Copyright:

Available Formats

Q1. Perform the following tasks in SPSS.

Model Variables Variables Method

1 mail . Forward (Criterion: Probability-of-F-to-enter <= .050)

a. Dependent Variable: men

Model R R Square Adjusted R Std. Error of the

1 .803a .645 .642 3785.49685

a. Predictors: (Constant), mail

Name: M. Hashim Javed Roll: BT-588221

Model Sum of Squares df Mean Square F Sig.

Regression 3069712621.002 1 3069712621.002 214.216 .000b

1 Residual 1690938397.841 118 14329986.422

Total 4760651018.843 119

4 Residual 980475080.535 115 8525870.266

Total 4760651018.843 119

a. Dependent Variable: men

Model Unstandardized Coefficients Standardized t Sig.

B Std. Error Beta

(Constant) -14064.614 2099.365 -6.699 .000

mail 1.847 .203 .496 9.083 .000

4 phone 327.802 40.329 .439 8.128 .000

print .208 .078 .115 2.656 .009

page 50.508 20.912 .104 2.415 .017

a. Dependent Variable: men

Name: M. Hashim Javed Roll: BT-588221

Model Beta In t Sig. Partial Collinearity

page .149b 2.773 .006 .248 .980

phone .447b 7.964 .000 .593 .625

service .153b 1.997 .048 .182 .501

a. Dependent Variable: men

Minimum Maximum Mean Std. Deviation N

Predicted Value 2636.2012 34103.5547 16242.8134 5636.14978 120

a. Dependent Variable: men

Name: M. Hashim Javed Roll: BT-588221

Dependent Variable Encoding

Original Value Internal Value

Block 0: Beginning Block

Overall Percentage 73.9

a. Constant is included in the model.

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 0 Constant -1.039 .086 145.782 1 .000 .354

Variables not in the Equation

age 13.265 1 .000

income 3.526 1 .060

creddebt 41.928 1 .000

othdebt 14.863 1 .000

Overall Statistics 148.310 6 .000

Name: M. Hashim Javed Roll: BT-588221

Omnibus Tests of Model Coefficients

Step 153.662 6 .000

Step 1 Block 153.662 6 .000

Model 153.662 6 .000

Step -2 Log likelihood Cox & Snell R Nagelkerke R

1 650.702a .197 .289

a. Estimation terminated at iteration number 5 because parameter

Overall Percentage 77.7

a. The cut value is .500

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

age -.047 .014 10.632 1 .001 .954

ed .392 .105 13.802 1 .000 1.480

income -.013 .007 3.077 1 .079 .987

Step 1a debtinc .111 .027 16.986 1 .000 1.117

creddebt .341 .088 15.186 1 .000 1.407