Assignment # 1 PDF

Department of Statistics
Program: M.Sc. Statistics
Course Number 1569
Data Analysis and Statistical

Course Title
Packages
Semester/Year 3rd / 2019 Autumn
Instructor Sir. Zahoor Ahmad
01
ASSIGNMENT No.
Submission Date
Due Date
Student Name Muhamamd Hashim Javed

Student ID BT-588221
Signature*
Name: M. Hashim Javed Roll: BT-588221

Q1. Perform the following tasks in SPSS.
a) Use catalog.sav sample data file to fit a multiple linear model to predict "Sales of Men's
Clothing" on the basis of varibales "Number of Catalogs Mailed", "Number of Pages in
Catalog", "Number of Phone Lines Open for Ordering", "Amount Spent on Print
Advertising" and "Number of Customer Service Representatives".
Use forward selection, backward elimination and enter methods in this respect. Interpret
your result in each case.
Variables Entered/Removeda
Model Variables Variables Method

Entered Removed
1 mail . Forward (Criterion: Probability-of-F-to-enter <= .050)

2 phone . Forward (Criterion: Probability-of-F-to-enter <= .050)
3 print . Forward (Criterion: Probability-of-F-to-enter <= .050)
4 page . Forward (Criterion: Probability-of-F-to-enter <= .050)
a. Dependent Variable: men
Model Summarye
Model R R Square Adjusted R Std. Error of the

Square Estimate
1 .803a .645 .642 3785.49685

2 .877b .770 .766 3061.36064
3 .885c .784 .778 2980.12178
4 .891d .794 .787 2919.90929
a. Predictors: (Constant), mail

b. Predictors: (Constant), mail, phone
c. Predictors: (Constant), mail, phone, print
d. Predictors: (Constant), mail, phone, print, page
e. Dependent Variable: men
Conclusion:
The model has been run 4 times, 1st included variable ‘mail’ and the adjusted R2 is 0.642,
and 2nd time the variable ‘phone’ is included and the adjusted R2 is 0.766, 3rd time the variable
‘print’ is included and the adjusted R2 is 0.778 and 4th time the variable ‘page’ is included and the
adjusted R2 is 0.787.

ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 3069712621.002 1 3069712621.002 214.216 .000b

1 Residual 1690938397.841 118 14329986.422
Total 4760651018.843 119

Regression 3664135333.036 2 1832067666.518 195.485 .000c
2 Residual 1096515685.807 117 9371928.939
Total 4760651018.843 119
Regression 3730440424.702 3 1243480141.567 140.014 .000d
3 Residual 1030210594.141 116 8881125.812
Total 4760651018.843 119
Regression 3780175938.309 4 945043984.577 110.844 .000e
4 Residual 980475080.535 115 8525870.266
Total 4760651018.843 119

b. Predictors: (Constant), mail
c. Predictors: (Constant), mail, phone
d. Predictors: (Constant), mail, phone, print
e. Predictors: (Constant), mail, phone, print, page
Conclusion:
The model has been run 4 times, 1st time for including variable ‘mail’ and the
significance is 0.000, and 2nd time for including variable ‘phone’ is included and the significance is
0.000, 3rd time for including variable ‘print’ is included and the significance is 0.000 and 4th time for
including variable ‘page’ and the significance is 0.000. as every time the model showed that the
variable included has significant relation with dependent variable as their p-value is less than 0.05.
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.

Coefficients
B Std. Error Beta
(Constant) -14064.614 2099.365 -6.699 .000

1
mail 2.991 .204 .803 14.636 .000
(Constant) -15361.047 1705.559 -9.006 .000
2 mail 1.971 .209 .529 9.424 .000
phone 334.103 41.952 .447 7.964 .000
(Constant) -20665.869 2554.586 -8.090 .000
mail 1.862 .207 .500 8.977 .000
3
phone 339.159 40.880 .454 8.296 .000
print .218 .080 .121 2.732 .007
4 (Constant) -23898.558 2838.361 -8.420 .000

mail 1.847 .203 .496 9.083 .000
phone 327.802 40.329 .439 8.128 .000
print .208 .078 .115 2.656 .009
page 50.508 20.912 .104 2.415 .017
Excluded Variablesa
Model Beta In t Sig. Partial Collinearity

Correlation Statistics
Tolerance
page .149b 2.773 .006 .248 .980
phone .447b 7.964 .000 .593 .625

1
print .104b 1.877 .063 .171 .957
service .153b 1.997 .048 .182 .501

page .110c 2.496 .014 .226 .968
2 print .121c 2.732 .007 .246 .955
service -.064c -.933 .353 -.086 .416
page .104d 2.415 .017 .220 .965
3
service -.079d -1.183 .239 -.110 .413
4 service -.072e -1.096 .275 -.102 .412

b. Predictors in the Model: (Constant), mail
c. Predictors in the Model: (Constant), mail, phone
d. Predictors in the Model: (Constant), mail, phone, print
e. Predictors in the Model: (Constant), mail, phone, print, page
Residuals Statisticsa
Minimum Maximum Mean Std. Deviation N
Predicted Value 2636.2012 34103.5547 16242.8134 5636.14978 120

Residual -8822.03613 9087.41895 .00000 2870.41572 120
Std. Predicted Value -2.414 3.169 .000 1.000 120
Std. Residual -3.021 3.112 .000 .983 120

b) Use bankloan.sav sample data file to fit a binary logistic regression model to predict
default on the basis of variables age, ed, income, debtinc, creddebt and othdebt.
Interpret your result.
Dependent Variable Encoding
Original Value Internal Value
No 0
Yes 1
Block 0: Beginning Block
Classification Tablea,b
Observed Predicted
default Percentage
No Yes Correct
No 517 0 100.0
default
Step 0 Yes 183 0 .0
Overall Percentage 73.9
a. Constant is included in the model.

b. The cut value is .500
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 0 Constant -1.039 .086 145.782 1 .000 .354
Variables not in the Equation
Score df Sig.
age 13.265 1 .000
ed 9.205 1 .002
income 3.526 1 .060

Variables
Step 0 debtinc 106.238 1 .000
creddebt 41.928 1 .000
othdebt 14.863 1 .000
Overall Statistics 148.310 6 .000

Block 1: Method = Enter
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 153.662 6 .000
Step 1 Block 153.662 6 .000
Model 153.662 6 .000
Model Summary
Step -2 Log likelihood Cox & Snell R Nagelkerke R

Square Square
1 650.702a .197 .289
a. Estimation terminated at iteration number 5 because parameter

estimates changed by less than .001.
Classification Tablea
Observed Predicted
default Percentage
No Yes Correct
No 483 34 93.4
default
Step 1 Yes 122 61 33.3
Overall Percentage 77.7
a. The cut value is .500
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
age -.047 .014 10.632 1 .001 .954
ed .392 .105 13.802 1 .000 1.480
income -.013 .007 3.077 1 .079 .987
Step 1a debtinc .111 .027 16.986 1 .000 1.117
creddebt .341 .088 15.186 1 .000 1.407
othdebt -.069 .062 1.221 1 .269 .933
Constant -1.198 .563 4.523 1 .033 .302
a. Variable(s) entered on step 1: age, ed, income, debtinc, creddebt, othdebt.

c) Find frequency distribution of "Preferred breakfast" for those senior citizens who
are also living an active life (Use sample data set cereal.sav)
Ans:
First step is to select those cases only, for which the life-style is recorded as active. [active=1]
bfast * agecat Crosstabulation
Count
agecat Total
Under 31 31-45 46-60 Over 60
Breakfast Bar 58 60 23 12 153
bfast Oatmeal 2 12 31 57 102
Cereal 51 45 38 17 151
Total 111 117 92 86 406
The frequency distribution of those senior citizens who are living an active life is highlighted in yellow.
It is seen that senior citizens who are living an active life prefer Oatmeal most of the time.

d) Carry out a test of independence of attributes "Preferred breakfast" and "marital
status". (Use cereal.sav)
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
bfast * marital 880 100.0% 0 0.0% 880 100.0%
bfast * marital Crosstabulation

Count
marital Total
Unmarried Married
Breakfast Bar 108 123 231
bfast Oatmeal 95 215 310
Cereal 100 239 339

Total 303 577 880
Chi-Square Tests
Value df Asymp. Sig. (2-sided)
Pearson Chi-Square 21.157a 2 .000

Likelihood Ratio 20.623 2 .000
Linear-by-Linear Association 16.226 1 .000
N of Valid Cases 880
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 79.54.
Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.05), so we reject the null
hypothesis and accept the alternate hypothesis that there is association between Preferred Breakfast
and Marital Status.

e) Carry out a test of independence of attributes "Preferred breakfast" and "Age
category". (Use cereal.sav)
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
bfast * agecat 880 100.0% 0 0.0% 880 100.0%
bfast * agecat Crosstabulation

Count
agecat Total
Under 31 31-45 46-60 Over 60
Breakfast Bar 84 90 39 18 231
bfast
Oatmeal 4 24 97 185 310
Cereal 93 92 95 59 339
Total 181 206 231 262 880
Chi-Square Tests
Value df Asymp. Sig. (2-sided)
Pearson Chi-Square 309.336a 6 .000

Likelihood Ratio 350.688 6 .000
Linear-by-Linear Association 4.986 1 .026

N of Valid Cases 880
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 47.51.
Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.05), so we reject the null
hypothesis and accept the alternate hypothesis that there is association between Preferred Breakfast
and Age Category.

f) Use grocery_coupons.sav sample data file to test that the mean of amount spent is
equal to 105. Find 90% confidence interval for the mean of amount spent. Also test
that the mean of amount spent by both male and female customers is equal. What
would to say about the equality of means for amount spent on stores of different
sizes?
Ans:
Part 1: Use grocery_coupons.sav sample data file to test that the mean of amount spent is
equal to 105. Find 90% confidence interval for the mean of amount spent.
T-Test
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
amtspent 1404 99.9338 48.54435 1.29555
One-Sample Test
Test Value = 105
t df Sig. (2-tailed) Mean Difference 90% Confidence Interval of the

Difference
Lower Upper
amtspent -3.910 1403 .000 -5.06621 -7.1986 -2.9338
Conclusion:
As the p-value (0.000) is less than our chosen significance level (α=0.1), so it is concluded that
the mean of Amount Spent is not equal and is significantly different from 105. And the 90%
Confidence interval for the mean of Amount Spent is [-7.1986 – -2.9338]
Part 2: Also test that the mean of amount spent by both male and female customers is
equal.
T-TEST GROUPS=gender(0 1)
/MISSING=ANALYSIS
/VARIABLES=amtspent
/CRITERIA=CI(.95).
T-Test
Group Statistics
gender N Mean Std. Deviation Std. Error Mean
Male 740 107.5761 49.09908 1.80492

amtspent
Female 664 91.4168 46.49620 1.80440

Independent Samples Test
Levene's Test for t-test for Equality of Means

Equality of Variances
F Sig. t df Sig. (2- Mean Std. Error 95% Confidence Interval

tailed) Difference Difference of the Difference
Lower Upper
Equal variances
.458 .499 6.313 1402 .000 16.15930 2.55971 11.13803 21.18058
assumed
amtspent
Equal variances
6.332 1397.923 .000 16.15930 2.55218 11.15280 21.16581
not assumed
Conclusion:
We will use the top row (Equal variances assumed) as the p-value of Leven’s test is above 0.05
As the p-value (0.000) is less than our chosen significance level (α=0.05), so it is concluded that the
mean of Amount Spent is significantly different for both male and female customers. And the 95%
Confidence interval for the mean of Amount Spent is [11.138 – 2.9338]
Part 3: What would to say about the equality of means for amount spent on stores of
different sizes?
ONEWAY amtspent BY size

/MISSING ANALYSIS.
Oneway
ANOVA
amtspent
Sum of Squares df Mean Square F Sig.
Between Groups 20053.727 2 10026.864 4.275 .014

Within Groups 3286191.636 1401 2345.604
Total 3306245.363 1403
Conclusion:
As the p-value (0.014) for is less than our chosen significance level (α=0.05), so it is concluded
that the mean of Amount Spent is not equal and is significantly different for both male and female
customers. And the 95% Confidence interval for the mean of Amount Spent is [11.138 – 2.9338]

Q3 is a group assignment (Group size not more than 5 students).
a) Create a Questionnaire having 10-15 questions to collect data from 30 respondents.
Topic may be of your choice. Submit the soft copy of questionnaire through class
representative not later than 20th October 2019.
b) Enter your data in SPSS. Naming of variables and data types should be appropriate.
c) Carry out data analysis of this data set.
d) Submit report of the statistical analysis of this data in MS Word.
WHODAS 2.0
WORLD HEALTH ORGANIZATION DISABILITY ASSESSMENT
SCHEDULE 2.0
12-item version, self-administered

This questionnaire asks about difficulties due to health conditions. Health conditions include diseases or illnesses, other
health problems that may be short or long lasting, injuries, mental or emotional problems, and problems with alcohol or
drugs.
Think back over the past 30 days and answer these questions, thinking about how much difficulty you had doing the
following activities. For each question, please circle only one response.
In the past 30 days, how much difficulty did you have in:
S1 Standing for long periods such as 30 None Mild Moderate Severe Extreme
minutes? or cannot
do
S2 Taking care of your household None Mild Moderate Severe Extreme

responsibilities? or cannot
do
S3 Learning a new task, for example, None Mild Moderate Severe Extreme
learning how to get to a new place? or cannot
do
S4 How much of a problem did you have None Mild Moderate Severe Extreme
joining in community activities (for or cannot
example, festivities, religious or other do
activities) in the same way as anyone
else can?
S5 How much have you been emotionally None Mild Moderate Severe Extreme
affected by your health problems? or cannot
do
S6 Concentrating on doing something for ten None Mild Moderate Severe Extreme
minutes? or cannot
do
S7 Walking a long distance such as a None Mild Moderate Severe Extreme

kilometre [or equivalent]? or cannot
do
S8 Washing your whole body? None Mild Moderate Severe Extreme

or cannot
do
S9 Getting dressed? None Mild Moderate Severe Extreme

or cannot
do
S10 Dealing with people you do not know? None Mild Moderate Severe Extreme
or cannot
do
S11 Maintaining a friendship? None Mild Moderate Severe Extreme

or cannot
do

S12 Your day-to-day work? None Mild Moderate Severe Extreme
or cannot
do
H1 Overall, in the past 30 days, how many days were these

difficulties present? Record number of days
H2 In the past 30 days, for how many days were you totally unable
to carry out your usual activities or work because of any health Record number of days
condition?
H3 In the past 30 days, not counting the days that you were totally
unable, for how many days did you cut back or reduce your Record number of days
usual activities or work because of any health condition?
Descriptive Statistics
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
overal_score 30 27 50 35.47 5.776

Valid N (listwise) 30
Independent Samples Test
Levene's Test for t-test for Equality of Means

Equality of Variances
F Sig. t df Sig. (2-tailed) Mean Std. Error 95% Confidence Interval

Difference Difference of the Difference
Lower Upper
Equal variances
.050 .825 .052 28 .959 .125 2.427 -4.846 5.096
assumed
overal_score
Equal variances
.054 13.517 .958 .125 2.328 -4.884 5.134
not assumed
ANOVA
overal_score
Sum of Squares df Mean Square F Sig.
Between Groups 367.133 15 24.476 .571 .853

Within Groups 600.333 14 42.881
Total 967.467 29

Q4. Perform the following tasks in Minitab.
a) In order to ascertain the age distribution of operatives in a certain industry, random
samples of 1720 males and 1230 females are drawn. The sample means and standard
deviations were 33.93 years and 14.20 years for the males and 27.44 years and 10.79
years for the females. Calculate the 95 percent confidence interval for
i. The mean age of all the male operatives.
i – Ans:
Variable N Mean StDev SE Mean 95% CI

C1 1720 34.173 14.047 0.339 (33.509, 34.837)
ii. The differences between their mean ages.

Ii – Ans:
Estimate for difference: 6.204
95% CI for difference: (5.305, 7.104)
T-Test of difference = 0 (vs not =): T-Value = 13.52 P-Value = 0.000 DF =
2930
b) A psychology class performed an experiment to compare whether a recall score in

which instructions to form images of 25 words were given is better than an initial
recall score for which no images instruction were given. Twelve students
participated in the experiment with the following results:
With
20 24 20 18 22 19 20 19 17 21 17 20
Imagery
Without
5 9 5 9 6 11 8 11 7 9 8 16
Imagery
Does it appear that the average recall score is higher when imaginary is used? Also
construct 95% confidence interval for the difference between the mean of both the
imageries and interpret the results.
Ans:
Paired T-Test and CI: With Imagery, Without Imagery
Paired T for With Imagery - Without Imagery
N Mean StDev SE Mean

With Imagery 12 19.750 2.006 0.579
Without Imagery 12 8.667 3.055 0.882
Difference 12 11.08 3.70 1.07
95% CI for mean difference: (8.73, 13.44)

T-Test of mean difference = 0 (vs not = 0): T-Value = 10.37 P-Value = 0.000
Conclusion:
The mean of recall score when imagery is 19.75 and without using imagery it is 8.667, it is
clearly obvious that the score is better when using imagery, We also tested it statistically and results
are as the p-value (0.000) for paired t-test of mean difference is less than our chosen significance
level (α=0.05), so it is concluded that the mean difference of Recall Score with Imagery and Recall
Score without Imagery is significant. And the 95% Confidence interval for the mean difference of
Recall Score is [8.73 – 13.44].
c) Generate 4 samples of sizes 5, 6, 7 and 7 from normal populations with means 45, 40, 47
and 38 respectively. While the standard deviations of these distributions are 4, 6, 7 and 8
respectively. Test the equality of means.
Ans:
One-way ANOVA: Population versus Factor
Source DF SS MS F P
Factor 3 595.3 198.4 4.48 0.014
Error 21 929.3 44.3
Total 24 1524.6
S = 6.652 R-Sq = 39.05% R-Sq(adj) = 30.34%
Individual 95% CIs For Mean Based on

Pooled StDev
Level N Mean StDev ---------+---------+---------+---------+
1 5 47.891 3.939 (---------*---------)
2 6 36.004 6.483 (--------*--------)
3 7 47.847 9.382 (--------*-------)
4 7 41.373 4.636 (--------*--------)
---------+---------+---------+---------+
36.0 42.0 48.0 54.0
Pooled StDev = 6.652
Conclusion:
The F-Statistic value is 4.48 and as the p-value (0.014) is less than our chosen significance level
(α=0.05), so it is concluded that all of the means are not equal, at least one of the means is
significantly different.

Q5.
a) Explain the procedure for testing of equality of several means in Minitab and SPSS.
b) Use Minitab/SPSS to test equality of means for the following experiment of wheat yield for
different varieties. Varieties are shown by A, B, C, D and E.
A (8) B (5.3) C (4.1) D (5) E (16)

D (6.8) A (4.9) B (4.1) C (3.2) E (18)
B (6.3) E (16) C (4.7) D (4.0) A (5.0)
C (5.7) D (3.3) E (25) A (4.0) B (4.2)
E (18) C (4.7) A (4.2) D (6.6) B (6.2)
Ans:
One-way ANOVA: resp versus fac
Source DF SS MS F P
fac 4 740.14 185.03 44.59 0.000
Error 20 83.00 4.15
Total 24 823.13
S = 2.037 R-Sq = 89.92% R-Sq(adj) = 87.90%
Individual 95% CIs For Mean Based on

Pooled StDev
Level N Mean StDev -----+---------+---------+---------+----
1 5 5.220 1.613 (--*---)
2 5 5.220 1.052 (--*---)
3 5 4.480 0.918 (---*---)
4 5 5.140 1.549 (---*---)
5 5 18.600 3.715 (---*---)
-----+---------+---------+---------+----
5.0 10.0 15.0 20.0
Pooled StDev = 2.037
Conclusion:
The F-Statistic value is 44.59 and as the p-value (0.000) is less than our chosen significance level
(α=0.05), so it is concluded that all of the means are not equal, at least one of the means is
significantly different.

Q6.
a) Consider the experiment in which two fair dice are tossed and the absolute
difference of dots is recorded. Simulate this experiment 600 times using minitab.
Find the frequency distribution of the absolute differences and find mean and
variance of this distribution.
Ans:
Commands to generate 600 times the absolute difference of two dice:

MTB > Random 600 'abs_dif';
SUBC> Integer 0 5.
MTB >
Tabulated statistics: abs_diff
Rows: abs_diff
Count % of Total
0 104 17.33
1 99 16.50
2 110 18.33
3 89 14.83
4 90 15.00
5 108 18.00
All 600 100.00
Descriptive Statistics: abs_diff
Total
Variable Count Mean StDev Variance
abs_diff 600 2.4767 1.7333 3.0045
Frequency Distribution Chart
Chart of abs_diff
120
100
80
Count
60
40
20
0
0 1 2 3 4 5
abs_diff

b) Compare the statistical packages SPSS and Minitab with respect to statistical data
analysis in social sciences and physical sciences.
Ans:
Comparison of Minitab and IBM SPSS

1. Easy to learn and easy to use. SPSS is menu-driven; the software is very easy to use. Like
Minitab, most of the functionalities in SPSS are organized into pull-down menus in a very
intuitive way. The learning curves for SPSS and Minitab are similar.
2. SPSS is generally stronger in statistical analysis, especially in some specific area, such as
ANOVA-related procedures. The add-on modules give SPSS further flexibility and
potentials to develop its capacities. However, for cutting-edge statistical analysis, SPSS is
stronger than Minitab. So SPSS is most suitable to you if your work involves large dataset,
frequent data management, and intermediate/partially-advanced statistical analysis.
3. SPSS Statistics is loaded with powerful analytic techniques and time-saving features to
help you quickly and easily find new insights in your data, so you can make more accurate
predictions and achieve better outcomes for your organization.
4. View interactive SPSS Statistics output on smart devices (smartphones and tablets) and
Generate presentation-ready output quickly and easily
5. Enhanced Monte Carlo simulation to improve model accuracy with

a. Ability to fit a categorical distribution to string fields
b. Support for Automatic Linear Modeling (ALM)
c. Generate heat maps automatically when displaying scatterplots in which the
target or the input, or both, are categorical
d. Automatically determine and use associations between categorical inputs when
generating data for those inputs
e. Generating data in the absence of a predictive model
6. SPSS Advanced Statistics offers generalized linear mixed models (GLMM), general linear
models (GLM), mixed models procedures, generalized linear models (GENLIN) and
generalized estimating equations (GEE) procedures.

Comparison of Minitab and IBM SPSS in ANOVA
Product One-way Two-way MANOVA GLM Mixed model Post-hoc Latin squares
Minitab Yes Yes Yes Yes No Yes Yes
SPSS Yes Yes Yes Yes Yes Yes Yes
Comparison of Minitab and IBM SPSS in Regression
Product OLS WLS 2SLS NLLS Logistic GLM LAD Step Quantile Probit Cox Poisson MLR
wise
Minitab Yes Yes No Yes Yes No No Yes No No No No No
SPSS Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
OLS Ordinary Least Squares WLS Weighted Least Squares

2SLS 2 Stage least Squares NLLS Non Linear Least Squares
LAD Least Absolute Deviation GLM Generalized Linear Models
MLR Multiple Linear Regression
Comparison of Minitab and IBM SPSS for Operation System Support
Product Windows Mac OS Linux BSD Unix
Minitab Yes No No No No
SPSS Yes Yes Yes No No

Comparison of Minitab and IBM SPSS for Charts and Diagrams
Chart Bar chart Box plot Correlogram Histogram Line chart Scatterplot
Minitab Yes Yes Yes Yes Yes Yes
SPSS Yes Yes Yes Yes Yes Yes
Product Descriptive statistics Nonparametric statistics Quality Survival Data processing

control analysis
Base Normality CTA Nonparametric Cluster Discriminant BDP Ext.

stat test comparison, ANOVA analysis analysis
Minitab Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
SPSS Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Q7.
a) Perform regression analysis to predict trade on the basis of other two variables on sample
dataset Employ.MTW. Also use matrix approach to do the same task. Furthermore,
calculate predicted values.
Ans:
Regression Analysis: Trade versus Food, Metals
The regression equation is

Trade = 67.1 + 0.225 Food + 5.90 Metals
Predictor Coef SE Coef T P

Constant 67.05 22.03 3.04 0.004
Food 0.2255 0.2336 0.97 0.339
Metals 5.9001 0.5034 11.72 0.000
S = 11.2084 R-Sq = 74.8% R-Sq(adj) = 74.0%
Analysis of Variance
Source DF SS MS F P
Regression 2 21302 10651 84.78 0.000
Residual Error 57 7161 126
Total 59 28463
Source DF Seq SS
Food 1 4046
Metals 1 17257
Unusual Observations
Obs Food Trade Fit SE Fit Residual St Resid

2 53.0 317.00 340.38 1.87 -23.38 -2.12R
60 57.7 396.00 363.86 1.95 32.14 2.91R
R denotes an observation with a large standardized residual.

Residuals vs Fits for Trade
Versus Fits
(response is Trade)
40
30
20
10
Residual
-10
-20
-30
310 320 330 340 350 360 370 380 390
Fitted Value
Conclusion: As the adjusted R squared value for this fitted model is 74.0% so, this
means that the factors Food and Metals explain 74 percent of the variation in Trade. So this
model is considered as a good fit model.
Using Matrix Approach for Regression

MTB > %"E:\AIOU\M.Sc\3rd Semester\1569\macros\regMATRIX.mac" 'Trade' 'Food' 'Metals'
Executing from file: E:\AIOU\M.Sc\3rd Semester\1569\macros\regMATRIX.mac
This macro is to find coefficients of a regression problem with

two independent variables. Dependent variable is stored in C1,
while independent variables are stored in C2 and C3
coefficients
Matrix m7
67.0512
0.2255
5.9001
coefficients by regr command
f
67.0512 0.2255 5.9001

b) Discuss the normality tests available in Minitab?
Ans:
Types of normality tests

The following are types of normality tests that you can use to assess normality.
Anderson-Darling test
This test compares the ECDF (empirical cumulative distribution function) of your sample
data with the distribution expected if the data were normal. If the observed difference is adequately
large, you will reject the null hypothesis of population normality.
Ryan-Joiner normality test

This test assesses normality by calculating the correlation between your data and the normal
scores of your data. If the correlation coefficient is near 1, the population is likely to be normal. The
Ryan-Joiner statistic assesses the strength of this correlation; if it is less than the appropriate critical
value, you will reject the null hypothesis of population normality. This test is similar to the Shapiro-
Wilk normality test.
Kolmogorov-Smirnov normality test

This test compares the ECDF (empirical cumulative distribution function) of your sample
data with the distribution expected if the data were normal. If this observed difference is adequately
large, the test will reject the null hypothesis of population normality. If the p-value of this test is less
than your chosen α, you can reject your null hypothesis and conclude that the population is
nonnormal.
Comparison of Anderson-Darling, Kolmogorov-Smirnov, and

Ryan-Joiner normality tests
Anderson-Darling and Kolmogorov-Smirnov tests are based on the empirical distribution
function. Ryan-Joiner (similar to Shapiro-Wilk) is based on regression and correlation.
All three tests tend to work well in identifying a distribution as not normal when the
distribution is skewed. All three tests are less distinguishing when the underlying distribution is a t-
distribution and nonnormality is due to kurtosis. Usually, between the tests based on the empirical
distribution function, Anderson-Darling tends to be more effective in detecting departures in the tails
of the distribution. Usually, if departure from normality at the tails is the major problem, many
statisticians would use Anderson-Darling as the first choice.
NOTE: If you are checking normality to prepare for a normal capability analysis, the tails are the
most critical part of the distribution.

Q8. Perform the following tasks in R.
a) A psychology class performed an experiment to compare whether a recall score in which
instructions to form images of 25 words were given is better than an initial recall score for
which no images instruction were given. Twelve students participated in the experiment
with the following results:
With
20 24 20 18 22 19 20 19 17 21 17 20
Imagery
Without
5 9 5 9 6 11 8 11 7 9 8 16
Imagery
Does it appear that the average recall score is higher when imaginary is used? Also
construct 95% confidence interval for the difference between the mean of both the imageries and
interpret the results.
Data Input
> withImagery=c(20, 24, 20, 18, 22, 19, 20, 19, 17, 21, 17, 20)
> withoutimg=c(5,9,5,9,6,11,8,11,7,9,8,16)
T test Command
> t.test(withImagery, withoutimg, paired = TRUE, alternative = "two.sided")
Paired t-test
data: withImagery and withoutimg

t = 10.365, df = 11, p-value = 5.159e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
8.729917 13.436750
sample estimates:
mean of the differences
11.08333

b) Consider the experiment in which two fair dice are tossed and the absolute difference of
dots is recorded. Simulate this experiment 600 times. Find the frequency distribution of
the absolute differences and find mean and variance of this distribution.
Frequency Distribution
die_diff = floor(runif(600, min=0, max=6))
freq <- data.frame(table(die_diff))
relFreq <- data.frame(prop.table(table(die_diff)))
relFreq$Relative_Freq <- relFreq$Freq
relFreq$Freq <- NULL
Cumulative_Freq <- cumsum(table(die_diff))
z <- cbind(merge(freq, relFreq), Cumulative_Freq)
z$Cumulative_Relative_Freq <- z$Cumulative_Freq / sum(z$Freq)
print(z)
die_diff Freq Relative_Freq Cumulative_Freq Cumulative_Relative_Freq
0 0 94 0.1566667 94 0.1566667
1 1 104 0.1733333 198 0.3300000
2 2 108 0.1800000 306 0.5100000
3 3 99 0.1650000 405 0.6750000
4 4 122 0.2033333 527 0.8783333
5 5 73 0.1216667 600 1.0000000
Mean and Variance
mean(die_diff)
[1] 2.45
var(die_diff)
[1] 2.675292

Assignment # 1 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment # 1 PDF

Uploaded by

Copyright:

Available Formats

Department of Statistics

Program: M.Sc. Statistics

Course Number 1569

Data Analysis and Statistical

Instructor Sir. Zahoor Ahmad

Student Name Muhamamd Hashim Javed

Name: M. Hashim Javed Roll: BT-588221

Model Variables Variables Method

1 mail . Forward (Criterion: Probability-of-F-to-enter <= .050)

a. Dependent Variable: men

Model R R Square Adjusted R Std. Error of the

1 .803a .645 .642 3785.49685

a. Predictors: (Constant), mail

Name: M. Hashim Javed Roll: BT-588221

Model Sum of Squares df Mean Square F Sig.

Regression 3069712621.002 1 3069712621.002 214.216 .000b

Total 4760651018.843 119

Total 4760651018.843 119

a. Dependent Variable: men

Model Unstandardized Coefficients Standardized t Sig.

B Std. Error Beta

(Constant) -14064.614 2099.365 -6.699 .000

Name: M. Hashim Javed Roll: BT-588221

phone 327.802 40.329 .439 8.128 .000

print .208 .078 .115 2.656 .009

page 50.508 20.912 .104 2.415 .017

a. Dependent Variable: men

Model Beta In t Sig. Partial Collinearity

page .149b 2.773 .006 .248 .980

phone .447b 7.964 .000 .593 .625

service .153b 1.997 .048 .182 .501

a. Dependent Variable: men

Minimum Maximum Mean Std. Deviation N

Predicted Value 2636.2012 34103.5547 16242.8134 5636.14978 120

a. Dependent Variable: men

Name: M. Hashim Javed Roll: BT-588221

Dependent Variable Encoding

Original Value Internal Value

Block 0: Beginning Block

Overall Percentage 73.9

a. Constant is included in the model.

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 0 Constant -1.039 .086 145.782 1 .000 .354

Variables not in the Equation

age 13.265 1 .000

income 3.526 1 .060

creddebt 41.928 1 .000

othdebt 14.863 1 .000

Overall Statistics 148.310 6 .000

Name: M. Hashim Javed Roll: BT-588221

Omnibus Tests of Model Coefficients

Step 153.662 6 .000

Step 1 Block 153.662 6 .000

Model 153.662 6 .000

Step -2 Log likelihood Cox & Snell R Nagelkerke R

1 650.702a .197 .289

a. Estimation terminated at iteration number 5 because parameter

Overall Percentage 77.7

a. The cut value is .500

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

age -.047 .014 10.632 1 .001 .954