You are on page 1of 34

Multiple Regression

Q.2 A bank has collected data on 15 credit card transaction (during a month) of itsHNI customers. The
bank wants to launch a new credit card reward point scheme and wants to estimate the amount spent
on the basis of monthly number of transactions and monthly income. The Data file and output are
given below.

The variables are, Amount


spent in ‘000 in a month
(Amtspent), number of
transactions in a month
(transactions), Whether been
abroad (Abroad)( 1 = been
abroad in the month, 0 =not
been abroad in the month)
and monthly income in ‘000
in a month (Income)

Regression
Variables Entered/Removed

Variables
Model Variables Entered Removed Method

1 income in 000, . Enter


location,
transactionsa

a. All requested variables entered.


Model Summaryb

Adjusted R Std. Error of the


Model R R Square Square Estimate Durbin-Watson

1 .995a .989 .986 5.769 1.966

a. Predictors: (Constant), income in 000, location, transactions

b. Dependent Variable: Amt spent in '000

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 33346.057 3 11115.352 333.974 .000a

Residual 366.103 11 33.282

Total 33712.160 14

a. Predictors: (Constant), income in 000, location, transactions

b. Dependent Variable: Amt spent in '000

a
Coefficients

Standardized
Unstandardized Coefficients Coefficients 95.0% Confidence Interval for B Collinearity Statistics

Model B Std. Error Beta T Sig. Lower Bound Upper Bound Tolerance VIF

1 (Constant) 13.426 7.563 1.775 .103 -3.219 30.071

transactions 2.112 .815 .256 2.592 .025 .318 3.907 .400 2.500

location 4.368 .167 .445 5.845 .004 7.004 15.740 .346 2.888

income in 000 .372 .052 .713 7.215 .000 .259 .486 .500 2.000

a. Dependent Variable: Amt spent in '000


Interpret the output. Mention the Regression equation.
The regression equation is
Amount spent = 13.426 + 2.112*transactions + 4.368*location + 0.372*Income

Comment(giving reasons)on if the regression model can be used for prediction.

Yes the model can be used for prediction as

1)All variables are quantitative or binary


2) All quantitve variables are normally distributed ( as K-S test significance value is >0.05 for all
variables hence null hypothesis that they are normal is accepted)
3)coefficient of determination R2 is high (.989)
4)Durban Wattson stats For autocorrelation is 1.966 which is within the required range of 1.5 to 2.5, indicating no threat of
autocorrelation
5) ANOVA is rejected ( Sig value 0.000) indicating atleast one independent variable is related to the dependent variable.
6)All individual 's are rejected
7) Since the normal plot for residuals shows that the points are around the diagonal, the residuals are normally distributed
8) The Partial grapg shows that the independent variables have linear relationship with the dependent variable.
9) the scatter plot shows no threat of heteroscadascity
State null and alternate hypothesis & interpretation of the hypothesis wherever necessary.
Interpret the meaning of each of the slopes.
a
Coefficients

Standardized
Unstandardized Coefficients Coefficients 95.0% Confidence Interval for B Collinearity Statistics

Model B Std. Error Beta T Sig. Lower Bound Upper Bound Tolerance VIF

1 (Constant) 13.426 7.563 1.775 .103 -3.219 30.071

transactions 2.112 .815 .256 2.592 .025 .318 3.907 .400 2.500

location 4.368 .167 .445 5.845 .004 7.004 15.740 .346 2.888

income in 000 .372 .052 .713 7.215 .000 .259 .486 .500 2.000

a. Dependent Variable: Amt spent in '000

Does the regression coefficient b0 has any practical meaning in the context of this problem? Why?
If the model can be used, estimate amount spent by a customer for the next month, if his estimated
transactions are 30, his monthly income is 400000, and he is going to go abroad in the next month.
Also find 95% confidence interval for the same.

Upper bound Lower limit


b b Point estimate
1 30.071 1 -3.219 1 13.426
30 3.907 30 0.318 30 2.112
1 15.74 1 7.004 1 4.368
400 0.486 400 0.259 400 0.372
357.421 116.925 229.954
Factor Analysis
A Financial analysis firm conducted a study on factors affecting consumers selection of financial
products. The questions in the survey were asked on rating scale of 1 to 7. The variables considered are
listed below

Sr Sr
No. Variable Description No. Variable Description
Depth of products and services to meet Representative knowing your overall
1 the range of your investment needs 6 situation and needs
2 Ability to resolve problems 7 Degree to which my provider knows me
Multiple providers' products to choose Likelihood to Recommend Primary
3 from 8 Provider to Someone I Know
Likelihood of Continuing to Use Primary
4 Quality of advice 9 Provider at Least at the Same Level as Now
5 Knowledge of representatives or advisors you deal with

A factor analysis was conducted on the data and the results are given below
KMO and Bartlett's Test

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .811


Bartlett's Test of Sphericity Approx. Chi-Square 1520.784
df 36
Sig. .000

Communalities

Initial Extraction
Depth of products and services to meet the .489 .696
range of your investment needs
Ability to resolve problems .431 .516
Multiple providers' products to choose from .441 .543
Quality of advice .546 .608
Knowledge of representatives or advisors .586 .635
you deal with
Representative knowing your overall .600 .723
situation and needs
Degree to which my provider knows me .576 .624
Likelihood to Recommend Primary Provider .388 .623
to Someone I Know
Likelihood of Continuing to Use Primary .357 .575
Provider at Least at the Same Level as Now

Extraction Method: Principal Axis Factoring.

Total Variance Explained

Extraction Sums of Squared Rotation Sums of Squared


Initial Eigenvalues Loadings Loadings

% of Cumulative % of Cumulative % of Cumulative


Factor Total Variance % Total Variance % Total Variance %

1 3.929 43.657 43.657 3.549 39.432 39.432 2.512 27.910 27.910

2 1.527 16.964 60.620 1.127 12.523 51.955 1.702 18.907 46.818

3 1.172 13.024 73.645 .766 8.512 60.466 1.228 13.649 60.466


KMO and Bartlett's Test

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .811


Bartlett's Test of Sphericity Approx. Chi-Square 1520.784
df 36
Sig. .000

Communalities

Initial Extraction
Depth of products and services to meet the .489 .696
range of your investment needs
Ability to resolve problems .431 .516
Multiple providers' products to choose from .441 .543
Quality of advice .546 .608
Knowledge of representatives or advisors .586 .635
you deal with
Representative knowing your overall .600 .723
situation and needs
Degree to which my provider knows me .576 .624
Likelihood to Recommend Primary Provider .388 .623
to Someone I Know
Likelihood of Continuing to Use Primary .357 .575
Provider at Least at the Same Level as Now

4 .519 5.768 79.413

5 .480 5.339 84.751

6 .403 4.480 89.232

7 .389 4.324 93.556

8 .335 3.728 97.283

9 .244 2.717 100.000

Extraction Method: Principal Axis Factoring.

Factor Matrixa

Factor
1 2 3
Depth of products and services to meet the range of your investment needs .680 .076 .477
Ability to resolve problems .654 -.022 .297
Multiple providers' products to choose from .509 .056 .425
Quality of advice .725 -.237 -.162
Knowledge of representatives or advisors you deal with .768 -.182 -.113
Representative knowing your overall situation and needs .754 -.103 -.380
Degree to which my provider knows me .749 -.085 -.237
Likelihood to Recommend Primary Provider to Someone I Know .337 .700 -.136
Likelihood of Continuing to Use Primary Provider at Least at the Same Level .209 .721 -.105
as Now

Extraction Method: Principal Axis Factoring.

a. 3 factors extracted. 12 iterations required.

Rotated Factor Matrixa

Factor
1 2 3
KMO and Bartlett's Test

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .811


Bartlett's Test of Sphericity Approx. Chi-Square 1520.784
df 36
Sig. .000

Communalities

Initial Extraction
Depth of products and services to meet the .489 .696
range of your investment needs
Ability to resolve problems .431 .516
Multiple providers' products to choose from .441 .543
Quality of advice .546 .608
Knowledge of representatives or advisors .586 .635
you deal with
Representative knowing your overall .600 .723
situation and needs
Degree to which my provider knows me .576 .624
Likelihood to Recommend Primary Provider .388 .623
to Someone I Know
Likelihood of Continuing to Use Primary .357 .575
Provider at Least at the Same Level as Now
Depth of products and services to meet the range of .251 .786 .124
your investment needs
Ability to resolve problems .358 .620 .061
Multiple providers' products to choose from .151 .643 .078
Quality of advice .727 .279 -.034
Knowledge of representatives or advisors you deal with .718 .346 .018
Representative knowing your overall situation and needs .829 .123 .145
Degree to which my provider knows me .741 .238 .132
Likelihood to Recommend Primary Provider to Someone .131 .113 .770
I Know
Likelihood of Continuing to Use Primary Provider at .007 .065 .756
Least at the Same Level as Now

Extraction Method: Principal Axis Factoring.


Rotation Method: Varimax with Kaiser Normalization.

a. Rotation converged in 5 iterations.

Answer following questions


a) Interpret the output
b) Identify and name the factors
c) Is there any variable/s which should not be included in the analysis?
d) Can you say that factor analysis is relevant in this case? Why?
(8 Marks)
A) The Bartlett's test is significant, and KMO is >0.5 indicating the FA can be done on the data. Each
extraction communalities are > 0.5. Cumulative of total explained variance is 60.466%(extraction) which is
>60%. Rotated factor matrix is a simple structure matrix for all variables ( each variable is loaded on
exactly one factor
B)
Factor
1 2 3
Depth of products and services to meet the range of your investment .251 .786 .124
needs
Ability to resolve problems .358 .620 .061
Multiple providers' products to choose from .151 .643 .078
Quality of advice .727 .279 -.034
Knowledge of representatives or advisors you deal with .718 .346 .018
Representative knowing your overall situation and needs .829 .123 .145
Degree to which my provider knows me .741 .238 .132
Likelihood to Recommend Primary Provider to Someone I Know .131 .113 .770
Likelihood of Continuing to Use Primary Provider at Least at the .007 .065 .756
Same Level as Now

Factor
1(Interface Related) 2(Product Related) 3(Loyalty/Repeat)
Quality of advice Depth of products and Likelihood to Recommend Primary
services to meet the range of Provider to Someone I Know
your investment needs
Knowledge of representatives or Ability to resolve problems Likelihood of Continuing to Use Primary
advisors you deal with Provider at Least at the Same Level as
Now
Representative knowing your Multiple providers' products
overall situation and needs to choose from.
Degree to which my provider
knows me

Factor 1 is dominant and most important factor, factor 2 is second and last is factor 3

C) Is there any variable/s which should not be included in the analysis? NO


Since anti image matrix not given, one cannot find the diagonal elements of correlations matrix
(of anti image ) which should be > 0.5
Each extraction communalities are > 0.5. and
each variable is loaded on exactly one factor ( only one factor loading is > 0.5 and rest are < 0.4)

D) Yes
The Bartlett's test is significant, and KMO is >0.5 indicating the FA can be done on the data. Each
extraction communalities are > 0.5. Cumulative of total explained variance is 60.466%(extraction) which is
>60%. Rotated factor matrix is a simple structure matrix for all variables ( each variable is loaded on
exactly one factor
Discriminant Analysis
Q. 4. A manager dealing with clients outstanding, wanted to classify their clients into defaulters and
non-defaulters of payment. A client is considered defaulter , if the payment is not received within 2
months of delivery. He collects data on past information for following parameters :
1. Years : No of years the client is in the business
2.Credit_Rating : the credit rating of the client ( 1=very bad 100=excellent)
3. Client Rating : the overall rating of the client ( 1=very bad 100=excellent)
4. Monthly turnover of the client ( in Crores)
Variable not_defaulted : 1 => not defaulted , 0=> defaulted
The Output of analysis is given below

Discriminant
Analysis Case Processing Summary

Unweighted Cases N Percent

Valid 19 100.0

Excluded Missing or out-of-range group 0 .0


codes

At least one missing 0 .0


discriminating variable

Both missing or out-of-range 0 .0


group codes and at least one
missing discriminating variable

Total 0 .0

Total 19 100.0

Group Statistics

Valid N (listwise)

not_default Mean Std. Deviation Unweighted Weighted

.00 Years 6.1429 1.86445 7 7.000

Credit_rating 33.5714 15.99851 7 7.000

Client_Rating 36.7143 19.58498 7 7.000

Monthly_Turnover 247.7143 142.05834 7 7.000

1.00 Years 7.7750 .60019 12 12.000

Credit_rating 57.8333 13.09291 12 12.000

Client_Rating 61.1667 9.93463 12 12.000

Monthly_Turnover 367.0000 94.37835 12 12.000

Total Years 7.1737 1.42589 19 19.000

Credit_rating 48.8947 18.29358 19 19.000

Client_Rating 52.1579 18.30380 19 19.000

Monthly_Turnover 323.0526 125.16011 19 19.000

Tests of Equality of Group Means

Wilks' Lambda F df1 df2 Sig.


Analysis Case Processing Summary

Unweighted Cases N Percent

Valid 19 100.0

Excluded Missing or out-of-range group 0 .0


codes

At least one missing 0 .0


discriminating variable

Both missing or out-of-range 0 .0


group codes and at least one
missing discriminating variable

Total 0 .0

Years .678 8.067 1 17 .011

Credit_rating .568 12.931 1 17 .002

Client_Rating .562 13.268 1 17 .002

Monthly_Turnover .777 4.882 1 17 .041

Analysis 1
Box's Test of Equality of Covariance Matrices
Log Determinants

not_default Rank Log Determinant

.00 4 19.986

1.00 4 17.521

Pooled within-groups 4 19.776

The ranks and natural logarithms of determinants printed are


those of the group covariance matrices.

Test Results

Box's M 23.556

F Approx. 1.652

df1 10

df2 723.305

Sig. .088

Tests null hypothesis of equal


population covariance matrices.
Canonical Discriminant Function
Coefficients

Function

Years .257

Credit_rating .038

Client_Rating .028

Monthly_Turnover .001

(Constant) -5.522

Unstandardized coefficients

Summary of Canonical Discriminant Functions


Eigenvalues

Canonical
Function Eigenvalue % of Variance Cumulative % Correlation

1 1.208a 100.0 100.0 .740

a. First 1 canonical discriminant functions were used in the analysis.

Wilks' Lambda

Test of
Function
(s) Wilks' Lambda Chi-square df Sig.

1 .453 11.881 4 .018

Standardized Canonical Discriminant


Function Coefficients

Function

Years .311

Credit_rating .534

Client_Rating .391

Monthly_Turnover .137

Structure Matrix

Function

Client_Rating .804

Credit_rating .794

Years .627

Monthly_Turnover .488
Eigenvalues

Canonical
Function Eigenvalue % of Variance Cumulative % Correlation

1 1.208a 100.0 100.0 .740

Pooled within-groups correlations


between discriminating variables and
standardized canonical discriminant
functions
Variables ordered by absolute size of
correlation within function.

Functions at Group
Centroids

not_defa Function

ult 1

.00 -1.361

1.00 .794

Unstandardized canonical
discriminant functions
evaluated at group means
Classification Statistics
Classification Processing Summary

Processed 19

Excluded Missing or out-of-range group 0


codes

At least one missing 0


discriminating variable

Used in Output 19

Prior Probabilities for Groups

not_defa Cases Used in Analysis

ult Prior Unweighted Weighted

.00 .500 7 7.000

1.00 .500 12 12.000

Total 1.000 19 19.000

Classification Resultsb,c

not_defa Predicted Group Membership

ult .00 1.00 Total

Original Count .00 6 1 7

1.00 1 11 12

% .00 85.7 14.3 100.0


Classification Processing Summary

Processed 19

Excluded Missing or out-of-range group 0


codes

At least one missing 0


discriminating variable

1.00 8.3 91.7 100.0

Cross-validateda Count .00 6 1 7

1.00 2 10 12

% .00 85.7 14.3 100.0

1.00 16.7 83.3 100.0

a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.

b. 89.5% of original grouped cases correctly classified.

c. 84.2% of cross-validated grouped cases correctly classified.

Answer following questions

a) Which multivariate technique is used and why?


Discriminant analysis is used as the dependent variable is qualitative and all independent
variables are quantitative.
b) State the function
Canonical Discriminant Function
Coefficients

Function

Years .257

Credit_rating .038

Client_Rating .028

Monthly_Turnover .001

(Constant) -5.522

Unstandardized coefficients
D F = -5.522+.257 xYears +.038 xCredit_rating +.028xClient_Rating +.001 xMonthly_Turnover
(since all independent variables are significant)
c) How would you assess the model? Which test statistic /s used to test accuracy of the model?
1) High value of Canonical Correlation (.740)
2) Wilk’s Lamda significant
3) ANOVA rejected for all variables
4) Classification table accuracy (since 84.2% > 66.82% required accuracy is attained )
c. 84.2% of cross-validated grouped cases correctly classified.
Probability
S 7 0.368421
F 12 0.631579
Total 19
Random Accuracy 0.534626
25% improvement over
RA 0.668283

d) Interpret the classification table. Same as above


e) How would you classify a client having years of business as 8, credit rating 80, client rating 60,
and monthly turnover 400.
Functions at Group Centroids
Function
not_default
1
0 -1.361 7= N1
1 0.794 12=N2

(-1.361*12+0.794*7)/19 = -0.56705

Substituting above values in the DF equation we get answer as 1.654 which is > -0.56705 hence the
observation will be classified as not defaulted ( since not defaulted is defined as 1 and defaulted as 0)

f) How would you test the assumption of homogeneity of variance? Using Boxe’s M since it is not
significant, variances are homogenous.
Logistic Regression
Q.2 A Bank wanted to decide on credit card applications on the basis of following variables of its
customers
Savings –savings in Rs thousands
Income - Income Rs thousands
Time_current_Res - months spent at current residence
Time_Current_Job - months spent at current Job
The dependent variable is Default =1=> the customer will be Defaulter , Default =0=> the customer will
not be a Defaulter. The output is given below
Logistic Regression
Dependent Variable
Encoding
Original
Value Internal Value
.00 0
1.00 1
Block 0: Beginning Block
Classification Tablea,b

Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 0 Default .00 36 0 100.0
1.00 14 0 .0
Overall Percentage 72.0
a. Constant is included in the model.
b. The cut value is .500
Variables in the Equation

B S.E. Wald df Sig. Exp(B)


Step 0 Constant -.944 .315 8.991 1 .003 .389
Variables not in the Equation

Score df Sig.
Step 0 Variables Savings 19.121 1 .000
Income 9.185 1 .002
Time_current_Res 2.704 1 .100
Time_Current_Job 1.320 1 .251
Overall Statistics 31.701 4 .000

Block 1: Method = Enter


Omnibus Tests of Model Coefficients

Chi-square df Sig.
Step 1 Step 39.883 4 .000
Block 39.883 4 .000
Model 39.883 4 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 19.412a .550 .791
a. Estimation terminated at iteration number 7 because parameter
estimates changed by less than .001.
Hosmer and Lemeshow Test
Omnibus Tests of Model Coefficients

Chi-square df Sig.
Step 1 Step 39.883 4 .000
Block 39.883 4 .000
Model 39.883 4 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 19.412a .550 .791
Step Chi-square df Sig.
1 10.666 8 .221
Contingency Table for Hosmer and Lemeshow Test

Default = .00 Default = 1.00


Observed Expected Observed Expected Total
Step 1 1 5 4.997 0 .003 5
2 5 4.983 0 .017 5
3 5 4.958 0 .042 5
4 5 4.923 0 .077 5
5 4 4.891 1 .109 5
6 5 4.751 0 .249 5
7 5 4.003 0 .997 5
8 1 1.962 4 3.038 5
9 1 .454 4 4.546 5
10 0 .079 5 4.921 5

Classification Tablea

Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 1 Default .00 35 2 94.44
1.00 2 13 85.71
Overall Percentage 92.0
a. The cut value is .500
Variables in the Equation

B S.E. Wald df Sig. Exp(B)


Step 1a Savings -.035 .011 9.496 1 .002 .966
Income -.002 .001 6.722 1 .010 .998
Time_current_Res -.006 .032 .039 1 .843 .994
Time_Current_Job -.036 .035 1.004 1 .316 .965
Constant 9.808 3.520 7.766 1 .005 18186.145
a. Variable(s) entered on step 1: Savings, Income, Time_current_Res, Time_Current_Job.

Logistic Regression
Dependent Variable
Encoding
Original
Value Internal Value
.00 0
1.00 1
Block 0: Beginning Block
Classification Tablea,b

Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 0 Default .00 36 0 100.0
1.00 14 0 .0
Overall Percentage 72.0
a. Constant is included in the model.
b. The cut value is .500
Variables in the Equation

B S.E. Wald df Sig. Exp(B)


Step 0 Constant -.944 .315 8.991 1 .003 .389
Variables not in the Equation

Score df Sig.
Step 0 Variables Savings 19.121 1 .000
Income 9.185 1 .002
Time_current_Res 2.704 1 .100
Time_Current_Job 1.320 1 .251
Overall Statistics 31.701 4 .000

Block 1: Method = Forward Stepwise (Wald)


Omnibus Tests of Model Coefficients

Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 34.197a .395 .568
2 20.857b .536 .772
a. Estimation terminated at iteration number 6 because parameter
estimates changed by less than .001.
b. Estimation terminated at iteration number 7 because parameter
estimates changed by less than .001.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 10.659 8 .222
2 4.157 8 .843
Contingency Table for Hosmer and Lemeshow Test

Default = .00 Default = 1.00


Observed Expected Observed Expected Total
Step 1 1 5 4.989 0 .011 5
2 5 4.963 0 .037 5
3 5 4.907 0 .093 5
4 6 5.773 0 .227 6
Omnibus Tests of Model Coefficients

Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 34.197a .395 .568
2 20.857b .536 .772
a. Estimation terminated at iteration number 6 because parameter
estimates changed by less than .001.
5 4 4.482 1 .518 5
6 5 3.997 0 1.003 5
7 2 3.446 4 2.554 6
8 0 1.778 5 3.222 5
9 3 1.176 2 3.824 5
10 1 .488 2 2.512 3
Step 2 1 5 4.996 0 .004 5
2 5 4.985 0 .015 5
3 5 4.968 0 .032 5
4 5 4.933 0 .067 5
5 5 4.895 0 .105 5
6 4 4.672 1 .328 5
7 5 3.916 0 1.084 5
8 1 1.857 4 3.143 5
9 1 .615 4 4.385 5
10 0 .162 5 4.838 5

Classification Tablea

Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 1 Default .00 32 4 88.9
1.00 3 11 78.6
Overall Percentage 86.0
Step 2 Default .00 35 1 97.2
1.00 2 12 85.7
Overall Percentage 94.0
a. The cut value is .500
Variables in the Equation

B S.E. Wald df Sig. Exp(B)


a
Step 1 Savings -.030 .009 11.310 1 .001 .970
Constant 1.643 .707 5.405 1 .020 5.172
Omnibus Tests of Model Coefficients

Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 34.197a .395 .568
2 20.857b .536 .772
a. Estimation terminated at iteration number 6 because parameter
estimates changed by less than .001.
Step 2b Savings -.034 .011 10.484 1 .001 .966
Income -.002 .001 6.882 1 .009 .998
Constant 8.100 2.787 8.446 1 .004 3293.574
a. Variable(s) entered on step 1: Savings.
b. Variable(s) entered on step 2: Income.
Variables not in the Equation

Score df Sig.
Step 1 Variables Income 11.623 1 .001
Time_current_Res 1.933 1 .164
Time_Current_Job .677 1 .411
Overall Statistics 12.671 3 .005
Step 2 Variables Time_current_Res .385 1 .535
Time_Current_Job 1.372 1 .242
Overall Statistics 1.405 2 .495
Answer following questions
a) State the function
Since stepwise LR is used, the last model is most significant and should be considered hence the
independent variables in the equation are Savings and income
The model is
1
𝑝= Where y = 8.100 – 0.034 * savings – 0.002* income
(1+𝑒 −𝑦 )
b) How would you assess the model? Which test statistic /s used to test accuracy of the model?
Hosmer and Lemeshow Test is not significant indicating that the LR model is a good fit.
Nagelkerke R Square is >0.5 indicating good relationship
Since stepwise is used, no need to check for the significance of individual variables.
Classification table gives 25% improvement over chance Accuracy

c) Interpret the classification table. Can this model be used for prediction?
d) Not
35 1 36 0.72
default
Default 2 12 14 0.28
Total 50
Chance Accuracy 0.5968
25% improvement
0ver chance Accuracy 0.746
e) How would you classify a customer having Savings = 50000, Income =6 Lakhs months spend
at residence = 20 and Time at current job = 3 years? Ans : Classified as Defaulted as p>0.5
f) How would you classify a customer having Savings = 10000, Income =5 Lakhs months spend at
residence = 25 and Time at current job = 2 years? Classified as Defaulted as p>0.5

Savings -0.034 50 10
Income -0.002 600 500
Constant 8.1 1 1
y 5.2 6.76
exp(-y) 0.005517 0.001159
1/(1+exp(-
y)) 0.994514 0.998842
Cluster Analysis
An investment company has shortlisted 25 of its corporate customers which it wants to classify into
homogenous groups and understand more about the groups. A cluster analysis was performed on the
data and also the ANOVA was done on the output of cluster analysis.
What are the groups formed? Who are the members of the group (give case number)? What would
you conclude from ANOVA? The output is given as “Company” (8 Marks)

Output for “Company”

Cluster
[DataSet6] C:\Documents and Settings\shailaja.rego\Desktop\cluster exam 2011.sav

Centroid Linkage
Agglomeration Schedule
Cluster Combined Stage Cluster First Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 21 23 .017 0 0 4
2 24 25 .035 0 0 14
3 14 15 .068 0 0 8
4 20 21 .092 0 1 7
5 9 19 .093 0 0 7
6 10 22 .101 0 0 9
7 9 20 .139 5 4 8
8 9 14 .248 7 3 12
9 10 17 .262 6 0 10
10 10 12 .298 9 0 13
11 6 16 .342 0 0 15
12 9 18 .467 8 0 13
13 9 10 .571 12 10 19
14 13 24 .650 0 2 19
15 6 11 .798 11 0 17
16 7 8 1.013 0 0 22
17 1 6 1.107 0 15 21
18 3 5 1.355 0 0 20
19 9 13 1.809 13 14 24
20 2 3 2.006 0 18 23
21 1 4 2.945 17 0 22
22 1 7 3.101 21 16 23
23 1 2 6.602 22 20 24
24 1 9 10.575 23 19 0
Dendrogram
* * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * *
Dendrogram using Centroid Method
Rescaled Distance Cluster Combine
C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+
Case 21 21 -+
Case 23 23 -+
Case 20 20 -+
Case 9 9 -+-+
Case 19 19 -+ |
Case 14 14 -+ |
Case 15 15 -+ |
Case 18 18 ---+-----+
Case 10 10 -+ | |
Case 22 22 -+ | |
Case 17 17 -+-+ +---------------------------------------+
Case 12 12 -+ | |
Case 24 24 -+-+ | |
Case 25 25 -+ +-----+ |
Case 13 13 ---+ |
Case 3 3 -------+-+ |
Case 5 5 -------+ +---------------------+ |
Case 2 2 ---------+ | |
Case 7 7 -----+---------+ +-----------------+
Case 8 8 -----+ | |
Case 6 6 -+-+ +---------------+
Case 16 16 -+ +-+ |
Case 11 11 ---+ +-------+ |
Case 1 1 -----+ +-+
Case 4 4 -------------+

Oneway
[DataSet6] C:\Documents and Settings\shailaja.rego\Desktop\cluster exam 2011.sav
Descriptives

95% Confidence Interval for

Std. Mean

N Mean Deviation Std. Error Lower Bound Upper Bound Minimum Maximum

turnover 1 7 60596.16 7201.233 2721.810 53936.13 67256.19 53262 75041

2 3 66879.92 5023.982 2900.597 54399.66 79360.18 62256 72226

3 15 24659.93 12400.521 3201.801 17792.75 31527.11 10000 55051

Tota 25 39788.47 21583.268 4316.654 30879.34 48697.61 10000 75041


l

Salary_cos 1 7 16857.14 5984.106 2261.779 11322.77 22391.52 11000 28000


t 2 3 23333.33 3785.939 2185.813 13928.54 32738.13 19000 26000

3 15 4300.00 3400.630 878.039 2416.79 6183.21 1000 12000

Tota 25 10100.00 8551.316 1710.263 6570.19 13629.81 1000 28000


l

Fixed_cost 1 7 10142.86 4180.453 1580.063 6276.58 14009.13 5000 16000

2 3 27000.00 2645.751 1527.525 20427.59 33572.41 24000 29000

3 15 3133.33 1302.013 336.178 2412.30 3854.36 1000 6000

Tota 25 7960.00 8197.967 1639.593 4576.05 11343.95 1000 29000


l

Profit 1 7 8571.43 4429.339 1674.133 4474.97 12667.88 3000 13000

2 3 16000.00 4582.576 2645.751 4616.25 27383.75 11000 20000

3 15 1900.00 1227.657 316.980 1220.15 2579.85 1000 5000

Tota 25 5460.00 5671.420 1134.284 3118.95 7801.05 1000 20000


l

Test of Homogeneity of Variances

Levene Statistic df1 df2 Sig.

turnover 1.436 2 22 .259

Salary_cost 1.839 2 22 .183

Fixed_cost 3.220 2 22 .040

Profit 5.398 2 22 .045

ANOVA

Sum of Squares df Mean Square F Sig.

turnover Between Groups 8.666E9 2 4.333E9 37.910 .000

Within Groups 2.514E9 22 1.143E8

Total 1.118E10 24

Salary_cost Between Groups 1.350E9 2 6.748E8 36.617 .000

Within Groups 4.054E8 22 1.843E7

Total 1.755E9 24
Descriptives

95% Confidence Interval for

Std. Mean

N Mean Deviation Std. Error Lower Bound Upper Bound Minimum Maximum

turnover 1 7 60596.16 7201.233 2721.810 53936.13 67256.19 53262 75041

2 3 66879.92 5023.982 2900.597 54399.66 79360.18 62256 72226

3 15 24659.93 12400.521 3201.801 17792.75 31527.11 10000 55051

Tota 25 39788.47 21583.268 4316.654 30879.34 48697.61 10000 75041


l

Salary_cos 1 7 16857.14 5984.106 2261.779 11322.77 22391.52 11000 28000


t 2 3 23333.33 3785.939 2185.813 13928.54 32738.13 19000 26000

3 15 4300.00 3400.630 878.039 2416.79 6183.21 1000 12000

Tota 25 10100.00 8551.316 1710.263 6570.19 13629.81 1000 28000


l

Fixed_cost 1 7 10142.86 4180.453 1580.063 6276.58 14009.13 5000 16000

2 3 27000.00 2645.751 1527.525 20427.59 33572.41 24000 29000

3 15 3133.33 1302.013 336.178 2412.30 3854.36 1000 6000

Tota 25 7960.00 8197.967 1639.593 4576.05 11343.95 1000 29000


l

Profit 1 7 8571.43 4429.339 1674.133 4474.97 12667.88 3000 13000

2 3 16000.00 4582.576 2645.751 4616.25 27383.75 11000 20000

3 15 1900.00 1227.657 316.980 1220.15 2579.85 1000 5000

Tota 25 5460.00 5671.420 1134.284 3118.95 7801.05 1000 20000


l

Test of Homogeneity of Variances

Levene Statistic df1 df2 Sig.

turnover 1.436 2 22 .259

Salary_cost 1.839 2 22 .183

Fixed_cost 3.220 2 22 .040

Fixed_cost Between Groups 1.470E9 2 7.352E8 113.430 .000

Within Groups 1.426E8 22 6481385.281

Total 1.613E9 24

Profit Between Groups 5.911E8 2 2.956E8 35.963 .000

Within Groups 1.808E8 22 8218831.169

Total 7.720E8 24

Answer

Q. What are the groups formed?


A. The output shows three cluster solution. Though maximum jump in the coefficients is for 2 cluster
solution, 3 cluster solution also has considerable jump and between 2 cluster & 3 cluster 3 cluster is
preferred.

Q.Who are the members of the group (give case number)?

A.

Cluster 1 Cluster 2 Cluster 3


Case 21 Case 3 Case 7
Case 23 Case 5 Case 8
Case 20 Case 2 Case 6
Case 9 Case 16
Case 19 Case 11
Case 14 Case 1
Case 15 Case 4
Case 18
Case 10
Case 22
Case 17
Case 12
Case 24
Case 25
Case 13

Note : since cluster 2 has too less cases, one may also consider 2 cluster solution as better solution.

2 cluster solution

Cluster 1 Cluster 2
Case 21 Case 3
Case 23 Case 5
Case 20 Case 2
Case 9 Case 7
Case 19 Case 8
Case 14 Case 6
Case 15 Case 16
Case 18 Case 11
Case 10 Case 1
Case 22 Case 4
Case 17
Case 12
Case 24
Case 25
Case 13

Q. What would you conclude from ANOVA?

A. ANOVA (for 3 cluster solution) is rejected for all variables (sig. value < 0.05) indicating the means of
variables across these clusters are different indicating A good cluster solution.
Conjoint Analysis
Q. 3 A company entering in the healthcare insurance sector wants to decide the price for their healthcare
product. They considered four attributes for this. The attributes and their levels is given below.
Charges Is third party Age for Amount
cover Insured
1500 Yes <50 1 LK
2000 No 50 to 60 2 LK
2500 60 to 70 3 Lk
3000
The variables were coded as follows
Charges X1 X2 X3
1500 1 0 0 Third party x4
2000 0 1 0 Yes 1
2500 0 0 1 No -1
3000 -1 -1 -1

Age X5 X6 Amount insured X7 X8


<50 1 0 1 Lk 1 0
50 to 60 0 1 2 Lk 0 1
60 and above -1 -1 3 Lk -1 -1
Note that the rankings are assumed to be best is highest

Output for “Medicare”


Model Summary

Std. Error
Mode Adjusted of the
l R R Square R Square Estimate
1 .924(a) .853 .834 8.51766
a Predictors: (Constant), x8, x3, x4, x6, x7, x2, x5, x1
ANOVA(b)

Mod Sum of Mean


el Squares df Square F Sig.
1 Regressio 26527.3
8 3315.914 45.705 .000(a)
n 16
Residual 4570.68
63 72.551
4
Total 31098.0
71
00
a Predictors: (Constant), x8, x3, x4, x6, x7, x2, x5, x1
b Dependent Variable: y
Coefficients(a)

Unstandardized Standardized
Coefficients Coefficients
Mode Std.
l B Error Beta t Sig.
1 (Constant
35.753 1.320 27.089 .000
)
x1 21.888 3.137 .563 6.978 .000
x2 6.067 1.638 .232 3.703 .000
x3 -8.727 2.116 -.280 -4.124 .000
x4 .828 1.095 .040 .756 .000
x5 -4.324 1.738 -.165 -2.487 .016
x6 -1.529 1.490 -.064 -1.026 .000
x7 -10.257 1.284 -.409 -7.990 .000
x8 -4.006 1.408 -.160 -2.846 .006
a Dependent Variable: y
Answer following
a) Find the utility for each attribute at each level and Comment on the output
b) Find the percentage utility for each attribute.
c) Draw the graph of the utility for each attribute
d) Which of the attribute is most important while considering the medical insurance? Why?

(8 Marks)

Answer

Please note Regression requirements are not mentioned in this answer.

a) Find the utility for each attribute at each level and Comment on the output

b) Find the percentage utility for each attribute.

Ans a & B
Charges Partial
Variavle Utility Range Percentage
1500 x1 21.888 41.116 53.07413
2000 x2 6.067
2500 x3 -8.727
3000 -19.228
Is third Partial
party Variavle Utility Range Percentage
Yes x4 0.828 1.656 2.137629
No -0.828
Age
for
cover Variavle Partial Utility
<50 x5 -4.324 10.177 13.13687
50 to 60 x6 -1.529
60 to 70 5.853
Amount
Insured Variavle Partial Utility
1 LK x7 -10.257 24.52 31.65137
2 LK x8 -4.006
3 Lk 14.263

77.469 100

c) Draw the graph of the utility for each attribute


Charges
25
20
15
10
5
0
1500 2000 2500 3000
-5
-10
-15
-20
-25

Is third party
1
0.8
0.6
0.4
0.2
0
Yes No
-0.2
-0.4
-0.6
-0.8
-1
Amount Insured
20

15

10

0
1 LK 2 LK 3 Lk
-5

-10

-15

Amount Insured
20

15

10

0
1 LK 2 LK 3 Lk
-5

-10

-15

a) Which of the attribute is most important while considering the medical insurance? Why?
Since Charges has maximum % utility it is considered as most important
Classification & Regression Trees

Classification Tree

[DataSet1] C:\Users\shailaja.rego\Desktop\final exam credit rating expaned for


tree.sav
Tree Table

Primary Independent Variable


Std.
Deviatio Predicted Improve
Node Mean n N Percent Mean Parent Node Variable ment Split Values

0 .2800 .45050 150 100.0% .2800

1 .7500 .43759 48 32.0% .7500 0 Sav-ings .104 <= 68.500

2 .0588 .23646 102 68.0% .0588 0 Sav-ings .104 > 68.500

3 .9167 .28031 36 24.0% .9167 1 Income .027 <= 4363.500

4 .2500 .45227 12 8.0% .2500 1 Income .027 > 4363.500

5 .2000 .40684 30 20.0% .2000 2 Sav-ings .006 <= 122.000

6 .0000 .00000 72 48.0% .0000 2 Sav-ings .006 > 122.000

Growing Method: CRT


Dependent Variable: Default

Gain Summary for Nodes

Node N Percent Mean

3 36 24.0% .9167

4 12 8.0% .2500

5 30 20.0% .2000

6 72 48.0% .0000

Growing Method: CRT


Dependent Variable: Default
Growing Method: CRT
Dependent Variable: Previously defaulted

Classification
Predicted
Not Percent
Observed Defaulted Defaulted Correct
Defaulted 105 3 72%
Not Defaulted 9 33 28%
Overall Percentage 76% 24% 92%

Answer following questions

a) Explain the tree output. How many splits the tree has?

The output has 3 splits and 6 Nodes

b) Explain each split in terms of variables

First split is on Savings <= 68.500


Second split is on income <= 4363.500

Third split is on Savings <= 122.000

If a customer has savings <= 68.5 and income is less than 4363.5 than the probability of that customer will
default is very high(95%)

If a customer has savings <= 68.5 and income is more than 4363.5 than the probability of that customer will
default is low (25%)

If a customer has savings > 68.5 and savings <122 than the probability of that customer will default is low(20%)

If a customer has savings >122 than the probability of that customer will default is very very low (0%)

c) What is the classification accuracy? Is it adequate?

0.72
0.28

Chance Accuracy = 0.5968


25% improvement over Chance Accuracy 0.746

Yes as it has 25% improvement over chance accuracy

d) According to this output, which variable contributes maximum to the classification?

Savings contributes maximum

You might also like