Professional Documents
Culture Documents
Q.2 A bank has collected data on 15 credit card transaction (during a month) of itsHNI customers. The
bank wants to launch a new credit card reward point scheme and wants to estimate the amount spent
on the basis of monthly number of transactions and monthly income. The Data file and output are
given below.
Regression
Variables Entered/Removed
Variables
Model Variables Entered Removed Method
ANOVAb
Total 33712.160 14
a
Coefficients
Standardized
Unstandardized Coefficients Coefficients 95.0% Confidence Interval for B Collinearity Statistics
Model B Std. Error Beta T Sig. Lower Bound Upper Bound Tolerance VIF
transactions 2.112 .815 .256 2.592 .025 .318 3.907 .400 2.500
location 4.368 .167 .445 5.845 .004 7.004 15.740 .346 2.888
income in 000 .372 .052 .713 7.215 .000 .259 .486 .500 2.000
Standardized
Unstandardized Coefficients Coefficients 95.0% Confidence Interval for B Collinearity Statistics
Model B Std. Error Beta T Sig. Lower Bound Upper Bound Tolerance VIF
transactions 2.112 .815 .256 2.592 .025 .318 3.907 .400 2.500
location 4.368 .167 .445 5.845 .004 7.004 15.740 .346 2.888
income in 000 .372 .052 .713 7.215 .000 .259 .486 .500 2.000
Does the regression coefficient b0 has any practical meaning in the context of this problem? Why?
If the model can be used, estimate amount spent by a customer for the next month, if his estimated
transactions are 30, his monthly income is 400000, and he is going to go abroad in the next month.
Also find 95% confidence interval for the same.
Sr Sr
No. Variable Description No. Variable Description
Depth of products and services to meet Representative knowing your overall
1 the range of your investment needs 6 situation and needs
2 Ability to resolve problems 7 Degree to which my provider knows me
Multiple providers' products to choose Likelihood to Recommend Primary
3 from 8 Provider to Someone I Know
Likelihood of Continuing to Use Primary
4 Quality of advice 9 Provider at Least at the Same Level as Now
5 Knowledge of representatives or advisors you deal with
A factor analysis was conducted on the data and the results are given below
KMO and Bartlett's Test
Communalities
Initial Extraction
Depth of products and services to meet the .489 .696
range of your investment needs
Ability to resolve problems .431 .516
Multiple providers' products to choose from .441 .543
Quality of advice .546 .608
Knowledge of representatives or advisors .586 .635
you deal with
Representative knowing your overall .600 .723
situation and needs
Degree to which my provider knows me .576 .624
Likelihood to Recommend Primary Provider .388 .623
to Someone I Know
Likelihood of Continuing to Use Primary .357 .575
Provider at Least at the Same Level as Now
Communalities
Initial Extraction
Depth of products and services to meet the .489 .696
range of your investment needs
Ability to resolve problems .431 .516
Multiple providers' products to choose from .441 .543
Quality of advice .546 .608
Knowledge of representatives or advisors .586 .635
you deal with
Representative knowing your overall .600 .723
situation and needs
Degree to which my provider knows me .576 .624
Likelihood to Recommend Primary Provider .388 .623
to Someone I Know
Likelihood of Continuing to Use Primary .357 .575
Provider at Least at the Same Level as Now
Factor Matrixa
Factor
1 2 3
Depth of products and services to meet the range of your investment needs .680 .076 .477
Ability to resolve problems .654 -.022 .297
Multiple providers' products to choose from .509 .056 .425
Quality of advice .725 -.237 -.162
Knowledge of representatives or advisors you deal with .768 -.182 -.113
Representative knowing your overall situation and needs .754 -.103 -.380
Degree to which my provider knows me .749 -.085 -.237
Likelihood to Recommend Primary Provider to Someone I Know .337 .700 -.136
Likelihood of Continuing to Use Primary Provider at Least at the Same Level .209 .721 -.105
as Now
Factor
1 2 3
KMO and Bartlett's Test
Communalities
Initial Extraction
Depth of products and services to meet the .489 .696
range of your investment needs
Ability to resolve problems .431 .516
Multiple providers' products to choose from .441 .543
Quality of advice .546 .608
Knowledge of representatives or advisors .586 .635
you deal with
Representative knowing your overall .600 .723
situation and needs
Degree to which my provider knows me .576 .624
Likelihood to Recommend Primary Provider .388 .623
to Someone I Know
Likelihood of Continuing to Use Primary .357 .575
Provider at Least at the Same Level as Now
Depth of products and services to meet the range of .251 .786 .124
your investment needs
Ability to resolve problems .358 .620 .061
Multiple providers' products to choose from .151 .643 .078
Quality of advice .727 .279 -.034
Knowledge of representatives or advisors you deal with .718 .346 .018
Representative knowing your overall situation and needs .829 .123 .145
Degree to which my provider knows me .741 .238 .132
Likelihood to Recommend Primary Provider to Someone .131 .113 .770
I Know
Likelihood of Continuing to Use Primary Provider at .007 .065 .756
Least at the Same Level as Now
Factor
1(Interface Related) 2(Product Related) 3(Loyalty/Repeat)
Quality of advice Depth of products and Likelihood to Recommend Primary
services to meet the range of Provider to Someone I Know
your investment needs
Knowledge of representatives or Ability to resolve problems Likelihood of Continuing to Use Primary
advisors you deal with Provider at Least at the Same Level as
Now
Representative knowing your Multiple providers' products
overall situation and needs to choose from.
Degree to which my provider
knows me
Factor 1 is dominant and most important factor, factor 2 is second and last is factor 3
D) Yes
The Bartlett's test is significant, and KMO is >0.5 indicating the FA can be done on the data. Each
extraction communalities are > 0.5. Cumulative of total explained variance is 60.466%(extraction) which is
>60%. Rotated factor matrix is a simple structure matrix for all variables ( each variable is loaded on
exactly one factor
Discriminant Analysis
Q. 4. A manager dealing with clients outstanding, wanted to classify their clients into defaulters and
non-defaulters of payment. A client is considered defaulter , if the payment is not received within 2
months of delivery. He collects data on past information for following parameters :
1. Years : No of years the client is in the business
2.Credit_Rating : the credit rating of the client ( 1=very bad 100=excellent)
3. Client Rating : the overall rating of the client ( 1=very bad 100=excellent)
4. Monthly turnover of the client ( in Crores)
Variable not_defaulted : 1 => not defaulted , 0=> defaulted
The Output of analysis is given below
Discriminant
Analysis Case Processing Summary
Valid 19 100.0
Total 0 .0
Total 19 100.0
Group Statistics
Valid N (listwise)
Valid 19 100.0
Total 0 .0
Analysis 1
Box's Test of Equality of Covariance Matrices
Log Determinants
.00 4 19.986
1.00 4 17.521
Test Results
Box's M 23.556
F Approx. 1.652
df1 10
df2 723.305
Sig. .088
Function
Years .257
Credit_rating .038
Client_Rating .028
Monthly_Turnover .001
(Constant) -5.522
Unstandardized coefficients
Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
Wilks' Lambda
Test of
Function
(s) Wilks' Lambda Chi-square df Sig.
Function
Years .311
Credit_rating .534
Client_Rating .391
Monthly_Turnover .137
Structure Matrix
Function
Client_Rating .804
Credit_rating .794
Years .627
Monthly_Turnover .488
Eigenvalues
Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
Functions at Group
Centroids
not_defa Function
ult 1
.00 -1.361
1.00 .794
Unstandardized canonical
discriminant functions
evaluated at group means
Classification Statistics
Classification Processing Summary
Processed 19
Used in Output 19
Classification Resultsb,c
1.00 1 11 12
Processed 19
1.00 2 10 12
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
Function
Years .257
Credit_rating .038
Client_Rating .028
Monthly_Turnover .001
(Constant) -5.522
Unstandardized coefficients
D F = -5.522+.257 xYears +.038 xCredit_rating +.028xClient_Rating +.001 xMonthly_Turnover
(since all independent variables are significant)
c) How would you assess the model? Which test statistic /s used to test accuracy of the model?
1) High value of Canonical Correlation (.740)
2) Wilk’s Lamda significant
3) ANOVA rejected for all variables
4) Classification table accuracy (since 84.2% > 66.82% required accuracy is attained )
c. 84.2% of cross-validated grouped cases correctly classified.
Probability
S 7 0.368421
F 12 0.631579
Total 19
Random Accuracy 0.534626
25% improvement over
RA 0.668283
(-1.361*12+0.794*7)/19 = -0.56705
Substituting above values in the DF equation we get answer as 1.654 which is > -0.56705 hence the
observation will be classified as not defaulted ( since not defaulted is defined as 1 and defaulted as 0)
f) How would you test the assumption of homogeneity of variance? Using Boxe’s M since it is not
significant, variances are homogenous.
Logistic Regression
Q.2 A Bank wanted to decide on credit card applications on the basis of following variables of its
customers
Savings –savings in Rs thousands
Income - Income Rs thousands
Time_current_Res - months spent at current residence
Time_Current_Job - months spent at current Job
The dependent variable is Default =1=> the customer will be Defaulter , Default =0=> the customer will
not be a Defaulter. The output is given below
Logistic Regression
Dependent Variable
Encoding
Original
Value Internal Value
.00 0
1.00 1
Block 0: Beginning Block
Classification Tablea,b
Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 0 Default .00 36 0 100.0
1.00 14 0 .0
Overall Percentage 72.0
a. Constant is included in the model.
b. The cut value is .500
Variables in the Equation
Score df Sig.
Step 0 Variables Savings 19.121 1 .000
Income 9.185 1 .002
Time_current_Res 2.704 1 .100
Time_Current_Job 1.320 1 .251
Overall Statistics 31.701 4 .000
Chi-square df Sig.
Step 1 Step 39.883 4 .000
Block 39.883 4 .000
Model 39.883 4 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 19.412a .550 .791
a. Estimation terminated at iteration number 7 because parameter
estimates changed by less than .001.
Hosmer and Lemeshow Test
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 39.883 4 .000
Block 39.883 4 .000
Model 39.883 4 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 19.412a .550 .791
Step Chi-square df Sig.
1 10.666 8 .221
Contingency Table for Hosmer and Lemeshow Test
Classification Tablea
Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 1 Default .00 35 2 94.44
1.00 2 13 85.71
Overall Percentage 92.0
a. The cut value is .500
Variables in the Equation
Logistic Regression
Dependent Variable
Encoding
Original
Value Internal Value
.00 0
1.00 1
Block 0: Beginning Block
Classification Tablea,b
Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 0 Default .00 36 0 100.0
1.00 14 0 .0
Overall Percentage 72.0
a. Constant is included in the model.
b. The cut value is .500
Variables in the Equation
Score df Sig.
Step 0 Variables Savings 19.121 1 .000
Income 9.185 1 .002
Time_current_Res 2.704 1 .100
Time_Current_Job 1.320 1 .251
Overall Statistics 31.701 4 .000
Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 34.197a .395 .568
2 20.857b .536 .772
a. Estimation terminated at iteration number 6 because parameter
estimates changed by less than .001.
b. Estimation terminated at iteration number 7 because parameter
estimates changed by less than .001.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 10.659 8 .222
2 4.157 8 .843
Contingency Table for Hosmer and Lemeshow Test
Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 34.197a .395 .568
2 20.857b .536 .772
a. Estimation terminated at iteration number 6 because parameter
estimates changed by less than .001.
5 4 4.482 1 .518 5
6 5 3.997 0 1.003 5
7 2 3.446 4 2.554 6
8 0 1.778 5 3.222 5
9 3 1.176 2 3.824 5
10 1 .488 2 2.512 3
Step 2 1 5 4.996 0 .004 5
2 5 4.985 0 .015 5
3 5 4.968 0 .032 5
4 5 4.933 0 .067 5
5 5 4.895 0 .105 5
6 4 4.672 1 .328 5
7 5 3.916 0 1.084 5
8 1 1.857 4 3.143 5
9 1 .615 4 4.385 5
10 0 .162 5 4.838 5
Classification Tablea
Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 1 Default .00 32 4 88.9
1.00 3 11 78.6
Overall Percentage 86.0
Step 2 Default .00 35 1 97.2
1.00 2 12 85.7
Overall Percentage 94.0
a. The cut value is .500
Variables in the Equation
Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 34.197a .395 .568
2 20.857b .536 .772
a. Estimation terminated at iteration number 6 because parameter
estimates changed by less than .001.
Step 2b Savings -.034 .011 10.484 1 .001 .966
Income -.002 .001 6.882 1 .009 .998
Constant 8.100 2.787 8.446 1 .004 3293.574
a. Variable(s) entered on step 1: Savings.
b. Variable(s) entered on step 2: Income.
Variables not in the Equation
Score df Sig.
Step 1 Variables Income 11.623 1 .001
Time_current_Res 1.933 1 .164
Time_Current_Job .677 1 .411
Overall Statistics 12.671 3 .005
Step 2 Variables Time_current_Res .385 1 .535
Time_Current_Job 1.372 1 .242
Overall Statistics 1.405 2 .495
Answer following questions
a) State the function
Since stepwise LR is used, the last model is most significant and should be considered hence the
independent variables in the equation are Savings and income
The model is
1
𝑝= Where y = 8.100 – 0.034 * savings – 0.002* income
(1+𝑒 −𝑦 )
b) How would you assess the model? Which test statistic /s used to test accuracy of the model?
Hosmer and Lemeshow Test is not significant indicating that the LR model is a good fit.
Nagelkerke R Square is >0.5 indicating good relationship
Since stepwise is used, no need to check for the significance of individual variables.
Classification table gives 25% improvement over chance Accuracy
c) Interpret the classification table. Can this model be used for prediction?
d) Not
35 1 36 0.72
default
Default 2 12 14 0.28
Total 50
Chance Accuracy 0.5968
25% improvement
0ver chance Accuracy 0.746
e) How would you classify a customer having Savings = 50000, Income =6 Lakhs months spend
at residence = 20 and Time at current job = 3 years? Ans : Classified as Defaulted as p>0.5
f) How would you classify a customer having Savings = 10000, Income =5 Lakhs months spend at
residence = 25 and Time at current job = 2 years? Classified as Defaulted as p>0.5
Savings -0.034 50 10
Income -0.002 600 500
Constant 8.1 1 1
y 5.2 6.76
exp(-y) 0.005517 0.001159
1/(1+exp(-
y)) 0.994514 0.998842
Cluster Analysis
An investment company has shortlisted 25 of its corporate customers which it wants to classify into
homogenous groups and understand more about the groups. A cluster analysis was performed on the
data and also the ANOVA was done on the output of cluster analysis.
What are the groups formed? Who are the members of the group (give case number)? What would
you conclude from ANOVA? The output is given as “Company” (8 Marks)
Cluster
[DataSet6] C:\Documents and Settings\shailaja.rego\Desktop\cluster exam 2011.sav
Centroid Linkage
Agglomeration Schedule
Cluster Combined Stage Cluster First Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 21 23 .017 0 0 4
2 24 25 .035 0 0 14
3 14 15 .068 0 0 8
4 20 21 .092 0 1 7
5 9 19 .093 0 0 7
6 10 22 .101 0 0 9
7 9 20 .139 5 4 8
8 9 14 .248 7 3 12
9 10 17 .262 6 0 10
10 10 12 .298 9 0 13
11 6 16 .342 0 0 15
12 9 18 .467 8 0 13
13 9 10 .571 12 10 19
14 13 24 .650 0 2 19
15 6 11 .798 11 0 17
16 7 8 1.013 0 0 22
17 1 6 1.107 0 15 21
18 3 5 1.355 0 0 20
19 9 13 1.809 13 14 24
20 2 3 2.006 0 18 23
21 1 4 2.945 17 0 22
22 1 7 3.101 21 16 23
23 1 2 6.602 22 20 24
24 1 9 10.575 23 19 0
Dendrogram
* * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * *
Dendrogram using Centroid Method
Rescaled Distance Cluster Combine
C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+
Case 21 21 -+
Case 23 23 -+
Case 20 20 -+
Case 9 9 -+-+
Case 19 19 -+ |
Case 14 14 -+ |
Case 15 15 -+ |
Case 18 18 ---+-----+
Case 10 10 -+ | |
Case 22 22 -+ | |
Case 17 17 -+-+ +---------------------------------------+
Case 12 12 -+ | |
Case 24 24 -+-+ | |
Case 25 25 -+ +-----+ |
Case 13 13 ---+ |
Case 3 3 -------+-+ |
Case 5 5 -------+ +---------------------+ |
Case 2 2 ---------+ | |
Case 7 7 -----+---------+ +-----------------+
Case 8 8 -----+ | |
Case 6 6 -+-+ +---------------+
Case 16 16 -+ +-+ |
Case 11 11 ---+ +-------+ |
Case 1 1 -----+ +-+
Case 4 4 -------------+
Oneway
[DataSet6] C:\Documents and Settings\shailaja.rego\Desktop\cluster exam 2011.sav
Descriptives
Std. Mean
N Mean Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
ANOVA
Total 1.118E10 24
Total 1.755E9 24
Descriptives
Std. Mean
N Mean Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
Total 1.613E9 24
Total 7.720E8 24
Answer
A.
Note : since cluster 2 has too less cases, one may also consider 2 cluster solution as better solution.
2 cluster solution
Cluster 1 Cluster 2
Case 21 Case 3
Case 23 Case 5
Case 20 Case 2
Case 9 Case 7
Case 19 Case 8
Case 14 Case 6
Case 15 Case 16
Case 18 Case 11
Case 10 Case 1
Case 22 Case 4
Case 17
Case 12
Case 24
Case 25
Case 13
A. ANOVA (for 3 cluster solution) is rejected for all variables (sig. value < 0.05) indicating the means of
variables across these clusters are different indicating A good cluster solution.
Conjoint Analysis
Q. 3 A company entering in the healthcare insurance sector wants to decide the price for their healthcare
product. They considered four attributes for this. The attributes and their levels is given below.
Charges Is third party Age for Amount
cover Insured
1500 Yes <50 1 LK
2000 No 50 to 60 2 LK
2500 60 to 70 3 Lk
3000
The variables were coded as follows
Charges X1 X2 X3
1500 1 0 0 Third party x4
2000 0 1 0 Yes 1
2500 0 0 1 No -1
3000 -1 -1 -1
Std. Error
Mode Adjusted of the
l R R Square R Square Estimate
1 .924(a) .853 .834 8.51766
a Predictors: (Constant), x8, x3, x4, x6, x7, x2, x5, x1
ANOVA(b)
Unstandardized Standardized
Coefficients Coefficients
Mode Std.
l B Error Beta t Sig.
1 (Constant
35.753 1.320 27.089 .000
)
x1 21.888 3.137 .563 6.978 .000
x2 6.067 1.638 .232 3.703 .000
x3 -8.727 2.116 -.280 -4.124 .000
x4 .828 1.095 .040 .756 .000
x5 -4.324 1.738 -.165 -2.487 .016
x6 -1.529 1.490 -.064 -1.026 .000
x7 -10.257 1.284 -.409 -7.990 .000
x8 -4.006 1.408 -.160 -2.846 .006
a Dependent Variable: y
Answer following
a) Find the utility for each attribute at each level and Comment on the output
b) Find the percentage utility for each attribute.
c) Draw the graph of the utility for each attribute
d) Which of the attribute is most important while considering the medical insurance? Why?
(8 Marks)
Answer
a) Find the utility for each attribute at each level and Comment on the output
Ans a & B
Charges Partial
Variavle Utility Range Percentage
1500 x1 21.888 41.116 53.07413
2000 x2 6.067
2500 x3 -8.727
3000 -19.228
Is third Partial
party Variavle Utility Range Percentage
Yes x4 0.828 1.656 2.137629
No -0.828
Age
for
cover Variavle Partial Utility
<50 x5 -4.324 10.177 13.13687
50 to 60 x6 -1.529
60 to 70 5.853
Amount
Insured Variavle Partial Utility
1 LK x7 -10.257 24.52 31.65137
2 LK x8 -4.006
3 Lk 14.263
77.469 100
Is third party
1
0.8
0.6
0.4
0.2
0
Yes No
-0.2
-0.4
-0.6
-0.8
-1
Amount Insured
20
15
10
0
1 LK 2 LK 3 Lk
-5
-10
-15
Amount Insured
20
15
10
0
1 LK 2 LK 3 Lk
-5
-10
-15
a) Which of the attribute is most important while considering the medical insurance? Why?
Since Charges has maximum % utility it is considered as most important
Classification & Regression Trees
Classification Tree
3 36 24.0% .9167
4 12 8.0% .2500
5 30 20.0% .2000
6 72 48.0% .0000
Classification
Predicted
Not Percent
Observed Defaulted Defaulted Correct
Defaulted 105 3 72%
Not Defaulted 9 33 28%
Overall Percentage 76% 24% 92%
a) Explain the tree output. How many splits the tree has?
If a customer has savings <= 68.5 and income is less than 4363.5 than the probability of that customer will
default is very high(95%)
If a customer has savings <= 68.5 and income is more than 4363.5 than the probability of that customer will
default is low (25%)
If a customer has savings > 68.5 and savings <122 than the probability of that customer will default is low(20%)
If a customer has savings >122 than the probability of that customer will default is very very low (0%)
0.72
0.28