You are on page 1of 28

Department of Statistics

Program: M.Sc. Statistics

Course Number 1569

Data Analysis and Statistical


Course Title
Packages
Semester/Year 3rd / 2019 Autumn

Instructor Sir. Zahoor Ahmad

01
ASSIGNMENT No.

Submission Date

Due Date

Student Name Muhamamd Hashim Javed


Student ID BT-588221

Signature*

Name: M. Hashim Javed Roll: BT-588221


Name: M. Hashim Javed Roll: BT-588221
Q1. Perform the following tasks in SPSS.

a) Use catalog.sav sample data file to fit a multiple linear model to predict "Sales of Men's
Clothing" on the basis of varibales "Number of Catalogs Mailed", "Number of Pages in
Catalog", "Number of Phone Lines Open for Ordering", "Amount Spent on Print
Advertising" and "Number of Customer Service Representatives".
Use forward selection, backward elimination and enter methods in this respect. Interpret
your result in each case.
Variables Entered/Removeda

Model Variables Variables Method


Entered Removed

1 mail . Forward (Criterion: Probability-of-F-to-enter <= .050)


2 phone . Forward (Criterion: Probability-of-F-to-enter <= .050)
3 print . Forward (Criterion: Probability-of-F-to-enter <= .050)
4 page . Forward (Criterion: Probability-of-F-to-enter <= .050)

a. Dependent Variable: men

Model Summarye

Model R R Square Adjusted R Std. Error of the


Square Estimate

1 .803a .645 .642 3785.49685


2 .877b .770 .766 3061.36064
3 .885c .784 .778 2980.12178
4 .891d .794 .787 2919.90929

a. Predictors: (Constant), mail


b. Predictors: (Constant), mail, phone
c. Predictors: (Constant), mail, phone, print
d. Predictors: (Constant), mail, phone, print, page
e. Dependent Variable: men

Conclusion:
The model has been run 4 times, 1st included variable ‘mail’ and the adjusted R2 is 0.642,
and 2nd time the variable ‘phone’ is included and the adjusted R2 is 0.766, 3rd time the variable
‘print’ is included and the adjusted R2 is 0.778 and 4th time the variable ‘page’ is included and the
adjusted R2 is 0.787.

Name: M. Hashim Javed Roll: BT-588221


ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 3069712621.002 1 3069712621.002 214.216 .000b


1 Residual 1690938397.841 118 14329986.422

Total 4760651018.843 119


Regression 3664135333.036 2 1832067666.518 195.485 .000c
2 Residual 1096515685.807 117 9371928.939
Total 4760651018.843 119
Regression 3730440424.702 3 1243480141.567 140.014 .000d
3 Residual 1030210594.141 116 8881125.812
Total 4760651018.843 119
Regression 3780175938.309 4 945043984.577 110.844 .000e
4 Residual 980475080.535 115 8525870.266

Total 4760651018.843 119

a. Dependent Variable: men


b. Predictors: (Constant), mail
c. Predictors: (Constant), mail, phone
d. Predictors: (Constant), mail, phone, print
e. Predictors: (Constant), mail, phone, print, page

Conclusion:
The model has been run 4 times, 1st time for including variable ‘mail’ and the
significance is 0.000, and 2nd time for including variable ‘phone’ is included and the significance is
0.000, 3rd time for including variable ‘print’ is included and the significance is 0.000 and 4th time for
including variable ‘page’ and the significance is 0.000. as every time the model showed that the
variable included has significant relation with dependent variable as their p-value is less than 0.05.

Coefficientsa

Model Unstandardized Coefficients Standardized t Sig.


Coefficients

B Std. Error Beta

(Constant) -14064.614 2099.365 -6.699 .000


1
mail 2.991 .204 .803 14.636 .000
(Constant) -15361.047 1705.559 -9.006 .000
2 mail 1.971 .209 .529 9.424 .000
phone 334.103 41.952 .447 7.964 .000
(Constant) -20665.869 2554.586 -8.090 .000
mail 1.862 .207 .500 8.977 .000
3
phone 339.159 40.880 .454 8.296 .000
print .218 .080 .121 2.732 .007
4 (Constant) -23898.558 2838.361 -8.420 .000

Name: M. Hashim Javed Roll: BT-588221


mail 1.847 .203 .496 9.083 .000

phone 327.802 40.329 .439 8.128 .000

print .208 .078 .115 2.656 .009

page 50.508 20.912 .104 2.415 .017

a. Dependent Variable: men

Excluded Variablesa

Model Beta In t Sig. Partial Collinearity


Correlation Statistics

Tolerance

page .149b 2.773 .006 .248 .980

phone .447b 7.964 .000 .593 .625


1
print .104b 1.877 .063 .171 .957

service .153b 1.997 .048 .182 .501


page .110c 2.496 .014 .226 .968
2 print .121c 2.732 .007 .246 .955
service -.064c -.933 .353 -.086 .416
page .104d 2.415 .017 .220 .965
3
service -.079d -1.183 .239 -.110 .413
4 service -.072e -1.096 .275 -.102 .412

a. Dependent Variable: men


b. Predictors in the Model: (Constant), mail
c. Predictors in the Model: (Constant), mail, phone
d. Predictors in the Model: (Constant), mail, phone, print
e. Predictors in the Model: (Constant), mail, phone, print, page

Residuals Statisticsa

Minimum Maximum Mean Std. Deviation N

Predicted Value 2636.2012 34103.5547 16242.8134 5636.14978 120


Residual -8822.03613 9087.41895 .00000 2870.41572 120
Std. Predicted Value -2.414 3.169 .000 1.000 120
Std. Residual -3.021 3.112 .000 .983 120

a. Dependent Variable: men

Name: M. Hashim Javed Roll: BT-588221


Name: M. Hashim Javed Roll: BT-588221
b) Use bankloan.sav sample data file to fit a binary logistic regression model to predict
default on the basis of variables age, ed, income, debtinc, creddebt and othdebt.
Interpret your result.

Dependent Variable Encoding

Original Value Internal Value

No 0
Yes 1

Block 0: Beginning Block

Classification Tablea,b

Observed Predicted

default Percentage

No Yes Correct

No 517 0 100.0
default
Step 0 Yes 183 0 .0

Overall Percentage 73.9

a. Constant is included in the model.


b. The cut value is .500

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 0 Constant -1.039 .086 145.782 1 .000 .354

Variables not in the Equation

Score df Sig.

age 13.265 1 .000

ed 9.205 1 .002

income 3.526 1 .060


Variables
Step 0 debtinc 106.238 1 .000

creddebt 41.928 1 .000

othdebt 14.863 1 .000

Overall Statistics 148.310 6 .000

Name: M. Hashim Javed Roll: BT-588221


Block 1: Method = Enter

Omnibus Tests of Model Coefficients

Chi-square df Sig.

Step 153.662 6 .000

Step 1 Block 153.662 6 .000

Model 153.662 6 .000

Model Summary

Step -2 Log likelihood Cox & Snell R Nagelkerke R


Square Square

1 650.702a .197 .289

a. Estimation terminated at iteration number 5 because parameter


estimates changed by less than .001.

Classification Tablea

Observed Predicted

default Percentage

No Yes Correct

No 483 34 93.4
default
Step 1 Yes 122 61 33.3

Overall Percentage 77.7

a. The cut value is .500

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

age -.047 .014 10.632 1 .001 .954

ed .392 .105 13.802 1 .000 1.480

income -.013 .007 3.077 1 .079 .987

Step 1a debtinc .111 .027 16.986 1 .000 1.117

creddebt .341 .088 15.186 1 .000 1.407

othdebt -.069 .062 1.221 1 .269 .933

Constant -1.198 .563 4.523 1 .033 .302

a. Variable(s) entered on step 1: age, ed, income, debtinc, creddebt, othdebt.

Name: M. Hashim Javed Roll: BT-588221


Name: M. Hashim Javed Roll: BT-588221
c) Find frequency distribution of "Preferred breakfast" for those senior citizens who
are also living an active life (Use sample data set cereal.sav)
Ans:

First step is to select those cases only, for which the life-style is recorded as active. [active=1]
bfast * agecat Crosstabulation

Count

agecat Total

Under 31 31-45 46-60 Over 60

Breakfast Bar 58 60 23 12 153

bfast Oatmeal 2 12 31 57 102

Cereal 51 45 38 17 151

Total 111 117 92 86 406

The frequency distribution of those senior citizens who are living an active life is highlighted in yellow.
It is seen that senior citizens who are living an active life prefer Oatmeal most of the time.

Name: M. Hashim Javed Roll: BT-588221


d) Carry out a test of independence of attributes "Preferred breakfast" and "marital
status". (Use cereal.sav)

Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

bfast * marital 880 100.0% 0 0.0% 880 100.0%

bfast * marital Crosstabulation


Count

marital Total

Unmarried Married

Breakfast Bar 108 123 231

bfast Oatmeal 95 215 310

Cereal 100 239 339


Total 303 577 880

Chi-Square Tests

Value df Asymp. Sig. (2-sided)

Pearson Chi-Square 21.157a 2 .000


Likelihood Ratio 20.623 2 .000

Linear-by-Linear Association 16.226 1 .000

N of Valid Cases 880

a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 79.54.

Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.05), so we reject the null
hypothesis and accept the alternate hypothesis that there is association between Preferred Breakfast
and Marital Status.

Name: M. Hashim Javed Roll: BT-588221


e) Carry out a test of independence of attributes "Preferred breakfast" and "Age
category". (Use cereal.sav)

Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

bfast * agecat 880 100.0% 0 0.0% 880 100.0%

bfast * agecat Crosstabulation


Count

agecat Total

Under 31 31-45 46-60 Over 60

Breakfast Bar 84 90 39 18 231

bfast
Oatmeal 4 24 97 185 310

Cereal 93 92 95 59 339
Total 181 206 231 262 880

Chi-Square Tests

Value df Asymp. Sig. (2-sided)

Pearson Chi-Square 309.336a 6 .000


Likelihood Ratio 350.688 6 .000

Linear-by-Linear Association 4.986 1 .026


N of Valid Cases 880

a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 47.51.

Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.05), so we reject the null
hypothesis and accept the alternate hypothesis that there is association between Preferred Breakfast
and Age Category.

Name: M. Hashim Javed Roll: BT-588221


f) Use grocery_coupons.sav sample data file to test that the mean of amount spent is
equal to 105. Find 90% confidence interval for the mean of amount spent. Also test
that the mean of amount spent by both male and female customers is equal. What
would to say about the equality of means for amount spent on stores of different
sizes?
Ans:
Part 1: Use grocery_coupons.sav sample data file to test that the mean of amount spent is
equal to 105. Find 90% confidence interval for the mean of amount spent.

T-Test
One-Sample Statistics

N Mean Std. Deviation Std. Error Mean

amtspent 1404 99.9338 48.54435 1.29555

One-Sample Test

Test Value = 105

t df Sig. (2-tailed) Mean Difference 90% Confidence Interval of the


Difference

Lower Upper

amtspent -3.910 1403 .000 -5.06621 -7.1986 -2.9338

Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.1), so it is concluded that
the mean of Amount Spent is not equal and is significantly different from 105. And the 90%
Confidence interval for the mean of Amount Spent is [-7.1986 – -2.9338]

Part 2: Also test that the mean of amount spent by both male and female customers is
equal.
T-TEST GROUPS=gender(0 1)
/MISSING=ANALYSIS
/VARIABLES=amtspent
/CRITERIA=CI(.95).
T-Test
Group Statistics

gender N Mean Std. Deviation Std. Error Mean

Male 740 107.5761 49.09908 1.80492


amtspent
Female 664 91.4168 46.49620 1.80440

Name: M. Hashim Javed Roll: BT-588221


Independent Samples Test

Levene's Test for t-test for Equality of Means


Equality of Variances

F Sig. t df Sig. (2- Mean Std. Error 95% Confidence Interval


tailed) Difference Difference of the Difference

Lower Upper

Equal variances
.458 .499 6.313 1402 .000 16.15930 2.55971 11.13803 21.18058
assumed
amtspent
Equal variances
6.332 1397.923 .000 16.15930 2.55218 11.15280 21.16581
not assumed

Conclusion:
We will use the top row (Equal variances assumed) as the p-value of Leven’s test is above 0.05
As the p-value (0.000) is less than our chosen significance level (α=0.05), so it is concluded that the
mean of Amount Spent is significantly different for both male and female customers. And the 95%
Confidence interval for the mean of Amount Spent is [11.138 – 2.9338]

Part 3: What would to say about the equality of means for amount spent on stores of
different sizes?

ONEWAY amtspent BY size


/MISSING ANALYSIS.
Oneway
ANOVA
amtspent

Sum of Squares df Mean Square F Sig.

Between Groups 20053.727 2 10026.864 4.275 .014


Within Groups 3286191.636 1401 2345.604
Total 3306245.363 1403

Conclusion:
As the p-value (0.014) for is less than our chosen significance level (α=0.05), so it is concluded
that the mean of Amount Spent is not equal and is significantly different for both male and female
customers. And the 95% Confidence interval for the mean of Amount Spent is [11.138 – 2.9338]

Name: M. Hashim Javed Roll: BT-588221


Q3 is a group assignment (Group size not more than 5 students).
a) Create a Questionnaire having 10-15 questions to collect data from 30 respondents.
Topic may be of your choice. Submit the soft copy of questionnaire through class
representative not later than 20th October 2019.
b) Enter your data in SPSS. Naming of variables and data types should be appropriate.
c) Carry out data analysis of this data set.
d) Submit report of the statistical analysis of this data in MS Word.

WHODAS 2.0
WORLD HEALTH ORGANIZATION DISABILITY ASSESSMENT
SCHEDULE 2.0

12-item version, self-administered


This questionnaire asks about difficulties due to health conditions. Health conditions include diseases or illnesses, other
health problems that may be short or long lasting, injuries, mental or emotional problems, and problems with alcohol or
drugs.

Think back over the past 30 days and answer these questions, thinking about how much difficulty you had doing the
following activities. For each question, please circle only one response.
In the past 30 days, how much difficulty did you have in:

S1 Standing for long periods such as 30 None Mild Moderate Severe Extreme
minutes? or cannot
do

S2 Taking care of your household None Mild Moderate Severe Extreme


responsibilities? or cannot
do

S3 Learning a new task, for example, None Mild Moderate Severe Extreme
learning how to get to a new place? or cannot
do

S4 How much of a problem did you have None Mild Moderate Severe Extreme
joining in community activities (for or cannot
example, festivities, religious or other do
activities) in the same way as anyone
else can?

S5 How much have you been emotionally None Mild Moderate Severe Extreme
affected by your health problems? or cannot
do

S6 Concentrating on doing something for ten None Mild Moderate Severe Extreme
minutes? or cannot
do

S7 Walking a long distance such as a None Mild Moderate Severe Extreme


kilometre [or equivalent]? or cannot
do

S8 Washing your whole body? None Mild Moderate Severe Extreme


or cannot
do

S9 Getting dressed? None Mild Moderate Severe Extreme


or cannot
do

S10 Dealing with people you do not know? None Mild Moderate Severe Extreme
or cannot
do

S11 Maintaining a friendship? None Mild Moderate Severe Extreme


or cannot
do

Name: M. Hashim Javed Roll: BT-588221


S12 Your day-to-day work? None Mild Moderate Severe Extreme
or cannot
do

H1 Overall, in the past 30 days, how many days were these


difficulties present? Record number of days

H2 In the past 30 days, for how many days were you totally unable
to carry out your usual activities or work because of any health Record number of days
condition?

H3 In the past 30 days, not counting the days that you were totally
unable, for how many days did you cut back or reduce your Record number of days
usual activities or work because of any health condition?

Descriptive Statistics

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

overal_score 30 27 50 35.47 5.776


Valid N (listwise) 30

Independent Samples Test

Levene's Test for t-test for Equality of Means


Equality of Variances

F Sig. t df Sig. (2-tailed) Mean Std. Error 95% Confidence Interval


Difference Difference of the Difference

Lower Upper

Equal variances
.050 .825 .052 28 .959 .125 2.427 -4.846 5.096
assumed
overal_score
Equal variances
.054 13.517 .958 .125 2.328 -4.884 5.134
not assumed

ANOVA
overal_score

Sum of Squares df Mean Square F Sig.

Between Groups 367.133 15 24.476 .571 .853


Within Groups 600.333 14 42.881
Total 967.467 29

Name: M. Hashim Javed Roll: BT-588221


Q4. Perform the following tasks in Minitab.
a) In order to ascertain the age distribution of operatives in a certain industry, random
samples of 1720 males and 1230 females are drawn. The sample means and standard
deviations were 33.93 years and 14.20 years for the males and 27.44 years and 10.79
years for the females. Calculate the 95 percent confidence interval for
i. The mean age of all the male operatives.
i – Ans:

Variable N Mean StDev SE Mean 95% CI


C1 1720 34.173 14.047 0.339 (33.509, 34.837)

ii. The differences between their mean ages.


Ii – Ans:
Estimate for difference: 6.204
95% CI for difference: (5.305, 7.104)
T-Test of difference = 0 (vs not =): T-Value = 13.52 P-Value = 0.000 DF =
2930

b) A psychology class performed an experiment to compare whether a recall score in


which instructions to form images of 25 words were given is better than an initial
recall score for which no images instruction were given. Twelve students
participated in the experiment with the following results:
With
20 24 20 18 22 19 20 19 17 21 17 20
Imagery
Without
5 9 5 9 6 11 8 11 7 9 8 16
Imagery
Does it appear that the average recall score is higher when imaginary is used? Also
construct 95% confidence interval for the difference between the mean of both the
imageries and interpret the results.
Ans:
Paired T-Test and CI: With Imagery, Without Imagery

Paired T for With Imagery - Without Imagery

N Mean StDev SE Mean


With Imagery 12 19.750 2.006 0.579
Without Imagery 12 8.667 3.055 0.882
Difference 12 11.08 3.70 1.07

95% CI for mean difference: (8.73, 13.44)


Name: M. Hashim Javed Roll: BT-588221
T-Test of mean difference = 0 (vs not = 0): T-Value = 10.37 P-Value = 0.000

Conclusion:
The mean of recall score when imagery is 19.75 and without using imagery it is 8.667, it is
clearly obvious that the score is better when using imagery, We also tested it statistically and results
are as the p-value (0.000) for paired t-test of mean difference is less than our chosen significance
level (α=0.05), so it is concluded that the mean difference of Recall Score with Imagery and Recall
Score without Imagery is significant. And the 95% Confidence interval for the mean difference of
Recall Score is [8.73 – 13.44].

c) Generate 4 samples of sizes 5, 6, 7 and 7 from normal populations with means 45, 40, 47
and 38 respectively. While the standard deviations of these distributions are 4, 6, 7 and 8
respectively. Test the equality of means.
Ans:
One-way ANOVA: Population versus Factor

Source DF SS MS F P
Factor 3 595.3 198.4 4.48 0.014
Error 21 929.3 44.3
Total 24 1524.6

S = 6.652 R-Sq = 39.05% R-Sq(adj) = 30.34%

Individual 95% CIs For Mean Based on


Pooled StDev
Level N Mean StDev ---------+---------+---------+---------+
1 5 47.891 3.939 (---------*---------)
2 6 36.004 6.483 (--------*--------)
3 7 47.847 9.382 (--------*-------)
4 7 41.373 4.636 (--------*--------)
---------+---------+---------+---------+
36.0 42.0 48.0 54.0

Pooled StDev = 6.652

Conclusion:
The F-Statistic value is 4.48 and as the p-value (0.014) is less than our chosen significance level
(α=0.05), so it is concluded that all of the means are not equal, at least one of the means is
significantly different.

Name: M. Hashim Javed Roll: BT-588221


Q5.
a) Explain the procedure for testing of equality of several means in Minitab and SPSS.
b) Use Minitab/SPSS to test equality of means for the following experiment of wheat yield for
different varieties. Varieties are shown by A, B, C, D and E.

A (8) B (5.3) C (4.1) D (5) E (16)


D (6.8) A (4.9) B (4.1) C (3.2) E (18)
B (6.3) E (16) C (4.7) D (4.0) A (5.0)
C (5.7) D (3.3) E (25) A (4.0) B (4.2)
E (18) C (4.7) A (4.2) D (6.6) B (6.2)
Ans:
One-way ANOVA: resp versus fac

Source DF SS MS F P
fac 4 740.14 185.03 44.59 0.000
Error 20 83.00 4.15
Total 24 823.13

S = 2.037 R-Sq = 89.92% R-Sq(adj) = 87.90%

Individual 95% CIs For Mean Based on


Pooled StDev
Level N Mean StDev -----+---------+---------+---------+----
1 5 5.220 1.613 (--*---)
2 5 5.220 1.052 (--*---)
3 5 4.480 0.918 (---*---)
4 5 5.140 1.549 (---*---)
5 5 18.600 3.715 (---*---)
-----+---------+---------+---------+----
5.0 10.0 15.0 20.0

Pooled StDev = 2.037

Conclusion:
The F-Statistic value is 44.59 and as the p-value (0.000) is less than our chosen significance level
(α=0.05), so it is concluded that all of the means are not equal, at least one of the means is
significantly different.

Name: M. Hashim Javed Roll: BT-588221


Q6.
a) Consider the experiment in which two fair dice are tossed and the absolute
difference of dots is recorded. Simulate this experiment 600 times using minitab.
Find the frequency distribution of the absolute differences and find mean and
variance of this distribution.
Ans:

Commands to generate 600 times the absolute difference of two dice:


MTB > Random 600 'abs_dif';
SUBC> Integer 0 5.
MTB >

Tabulated statistics: abs_diff

Rows: abs_diff

Count % of Total

0 104 17.33
1 99 16.50
2 110 18.33
3 89 14.83
4 90 15.00
5 108 18.00
All 600 100.00

Descriptive Statistics: abs_diff

Total
Variable Count Mean StDev Variance
abs_diff 600 2.4767 1.7333 3.0045

Frequency Distribution Chart

Chart of abs_diff
120

100

80
Count

60

40

20

0
0 1 2 3 4 5
abs_diff

Name: M. Hashim Javed Roll: BT-588221


b) Compare the statistical packages SPSS and Minitab with respect to statistical data
analysis in social sciences and physical sciences.
Ans:

Comparison of Minitab and IBM SPSS


1. Easy to learn and easy to use. SPSS is menu-driven; the software is very easy to use. Like
Minitab, most of the functionalities in SPSS are organized into pull-down menus in a very
intuitive way. The learning curves for SPSS and Minitab are similar.

2. SPSS is generally stronger in statistical analysis, especially in some specific area, such as
ANOVA-related procedures. The add-on modules give SPSS further flexibility and
potentials to develop its capacities. However, for cutting-edge statistical analysis, SPSS is
stronger than Minitab. So SPSS is most suitable to you if your work involves large dataset,
frequent data management, and intermediate/partially-advanced statistical analysis.

3. SPSS Statistics is loaded with powerful analytic techniques and time-saving features to
help you quickly and easily find new insights in your data, so you can make more accurate
predictions and achieve better outcomes for your organization.

4. View interactive SPSS Statistics output on smart devices (smartphones and tablets) and
Generate presentation-ready output quickly and easily

5. Enhanced Monte Carlo simulation to improve model accuracy with


a. Ability to fit a categorical distribution to string fields
b. Support for Automatic Linear Modeling (ALM)
c. Generate heat maps automatically when displaying scatterplots in which the
target or the input, or both, are categorical
d. Automatically determine and use associations between categorical inputs when
generating data for those inputs
e. Generating data in the absence of a predictive model

6. SPSS Advanced Statistics offers generalized linear mixed models (GLMM), general linear
models (GLM), mixed models procedures, generalized linear models (GENLIN) and
generalized estimating equations (GEE) procedures.

Name: M. Hashim Javed Roll: BT-588221


Comparison of Minitab and IBM SPSS in ANOVA

Product One-way Two-way MANOVA GLM Mixed model Post-hoc Latin squares

Minitab Yes Yes Yes Yes No Yes Yes

SPSS Yes Yes Yes Yes Yes Yes Yes

Comparison of Minitab and IBM SPSS in Regression

Product OLS WLS 2SLS NLLS Logistic GLM LAD Step Quantile Probit Cox Poisson MLR
wise

Minitab Yes Yes No Yes Yes No No Yes No No No No No

SPSS Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

OLS Ordinary Least Squares WLS Weighted Least Squares


2SLS 2 Stage least Squares NLLS Non Linear Least Squares
LAD Least Absolute Deviation GLM Generalized Linear Models
MLR Multiple Linear Regression

Comparison of Minitab and IBM SPSS for Operation System Support

Product Windows Mac OS Linux BSD Unix

Minitab Yes No No No No

SPSS Yes Yes Yes No No

Name: M. Hashim Javed Roll: BT-588221


Comparison of Minitab and IBM SPSS for Charts and Diagrams

Chart Bar chart Box plot Correlogram Histogram Line chart Scatterplot

Minitab Yes Yes Yes Yes Yes Yes

SPSS Yes Yes Yes Yes Yes Yes

Product Descriptive statistics Nonparametric statistics Quality Survival Data processing


control analysis

Base Normality CTA Nonparametric Cluster Discriminant BDP Ext.


stat test comparison, ANOVA analysis analysis

Minitab Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

SPSS Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Name: M. Hashim Javed Roll: BT-588221


Q7.
a) Perform regression analysis to predict trade on the basis of other two variables on sample
dataset Employ.MTW. Also use matrix approach to do the same task. Furthermore,
calculate predicted values.

Ans:
Regression Analysis: Trade versus Food, Metals

The regression equation is


Trade = 67.1 + 0.225 Food + 5.90 Metals

Predictor Coef SE Coef T P


Constant 67.05 22.03 3.04 0.004
Food 0.2255 0.2336 0.97 0.339
Metals 5.9001 0.5034 11.72 0.000

S = 11.2084 R-Sq = 74.8% R-Sq(adj) = 74.0%

Analysis of Variance

Source DF SS MS F P
Regression 2 21302 10651 84.78 0.000
Residual Error 57 7161 126
Total 59 28463

Source DF Seq SS
Food 1 4046
Metals 1 17257

Unusual Observations

Obs Food Trade Fit SE Fit Residual St Resid


2 53.0 317.00 340.38 1.87 -23.38 -2.12R
60 57.7 396.00 363.86 1.95 32.14 2.91R

R denotes an observation with a large standardized residual.

Name: M. Hashim Javed Roll: BT-588221


Residuals vs Fits for Trade

Versus Fits
(response is Trade)
40

30

20

10
Residual

-10

-20

-30
310 320 330 340 350 360 370 380 390
Fitted Value

Conclusion: As the adjusted R squared value for this fitted model is 74.0% so, this
means that the factors Food and Metals explain 74 percent of the variation in Trade. So this
model is considered as a good fit model.

Using Matrix Approach for Regression


MTB > %"E:\AIOU\M.Sc\3rd Semester\1569\macros\regMATRIX.mac" 'Trade' 'Food' 'Metals'
Executing from file: E:\AIOU\M.Sc\3rd Semester\1569\macros\regMATRIX.mac

This macro is to find coefficients of a regression problem with


two independent variables. Dependent variable is stored in C1,
while independent variables are stored in C2 and C3

coefficients

Matrix m7

67.0512
0.2255
5.9001
coefficients by regr command

f
67.0512 0.2255 5.9001

Name: M. Hashim Javed Roll: BT-588221


b) Discuss the normality tests available in Minitab?
Ans:

Types of normality tests


The following are types of normality tests that you can use to assess normality.

Anderson-Darling test
This test compares the ECDF (empirical cumulative distribution function) of your sample
data with the distribution expected if the data were normal. If the observed difference is adequately
large, you will reject the null hypothesis of population normality.

Ryan-Joiner normality test


This test assesses normality by calculating the correlation between your data and the normal
scores of your data. If the correlation coefficient is near 1, the population is likely to be normal. The
Ryan-Joiner statistic assesses the strength of this correlation; if it is less than the appropriate critical
value, you will reject the null hypothesis of population normality. This test is similar to the Shapiro-
Wilk normality test.

Kolmogorov-Smirnov normality test


This test compares the ECDF (empirical cumulative distribution function) of your sample
data with the distribution expected if the data were normal. If this observed difference is adequately
large, the test will reject the null hypothesis of population normality. If the p-value of this test is less
than your chosen α, you can reject your null hypothesis and conclude that the population is
nonnormal.

Comparison of Anderson-Darling, Kolmogorov-Smirnov, and


Ryan-Joiner normality tests
Anderson-Darling and Kolmogorov-Smirnov tests are based on the empirical distribution
function. Ryan-Joiner (similar to Shapiro-Wilk) is based on regression and correlation.

All three tests tend to work well in identifying a distribution as not normal when the
distribution is skewed. All three tests are less distinguishing when the underlying distribution is a t-
distribution and nonnormality is due to kurtosis. Usually, between the tests based on the empirical
distribution function, Anderson-Darling tends to be more effective in detecting departures in the tails
of the distribution. Usually, if departure from normality at the tails is the major problem, many
statisticians would use Anderson-Darling as the first choice.

NOTE: If you are checking normality to prepare for a normal capability analysis, the tails are the
most critical part of the distribution.

Name: M. Hashim Javed Roll: BT-588221


Q8. Perform the following tasks in R.
a) A psychology class performed an experiment to compare whether a recall score in which
instructions to form images of 25 words were given is better than an initial recall score for
which no images instruction were given. Twelve students participated in the experiment
with the following results:

With
20 24 20 18 22 19 20 19 17 21 17 20
Imagery
Without
5 9 5 9 6 11 8 11 7 9 8 16
Imagery
Does it appear that the average recall score is higher when imaginary is used? Also
construct 95% confidence interval for the difference between the mean of both the imageries and
interpret the results.

Data Input
> withImagery=c(20, 24, 20, 18, 22, 19, 20, 19, 17, 21, 17, 20)
> withoutimg=c(5,9,5,9,6,11,8,11,7,9,8,16)

T test Command
> t.test(withImagery, withoutimg, paired = TRUE, alternative = "two.sided")

Paired t-test

data: withImagery and withoutimg


t = 10.365, df = 11, p-value = 5.159e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
8.729917 13.436750
sample estimates:
mean of the differences
11.08333

Name: M. Hashim Javed Roll: BT-588221


b) Consider the experiment in which two fair dice are tossed and the absolute difference of
dots is recorded. Simulate this experiment 600 times. Find the frequency distribution of
the absolute differences and find mean and variance of this distribution.

Frequency Distribution
die_diff = floor(runif(600, min=0, max=6))
freq <- data.frame(table(die_diff))
relFreq <- data.frame(prop.table(table(die_diff)))
relFreq$Relative_Freq <- relFreq$Freq
relFreq$Freq <- NULL
Cumulative_Freq <- cumsum(table(die_diff))
z <- cbind(merge(freq, relFreq), Cumulative_Freq)
z$Cumulative_Relative_Freq <- z$Cumulative_Freq / sum(z$Freq)
print(z)
die_diff Freq Relative_Freq Cumulative_Freq Cumulative_Relative_Freq
0 0 94 0.1566667 94 0.1566667
1 1 104 0.1733333 198 0.3300000
2 2 108 0.1800000 306 0.5100000
3 3 99 0.1650000 405 0.6750000
4 4 122 0.2033333 527 0.8783333
5 5 73 0.1216667 600 1.0000000

Mean and Variance

mean(die_diff)
[1] 2.45

var(die_diff)
[1] 2.675292

Name: M. Hashim Javed Roll: BT-588221

You might also like