Solved Problem Regression Factor Analysis Discriminant Logistic Cluster Conjoint (1) Xid-2939872 1 IjL9spRH9K

Multiple Regression
Q.2 A bank has collected data on 15 credit card transaction (during a month) of itsHNI customers. The
bank wants to launch a new credit card reward point scheme and wants to estimate the amount spent
on the basis of monthly number of transactions and monthly income. The Data file and output are
given below.
The variables are, Amount

spent in ‘000 in a month
(Amtspent), number of
transactions in a month
(transactions), Whether been
abroad (Abroad)( 1 = been
abroad in the month, 0 =not
been abroad in the month)
and monthly income in ‘000
in a month (Income)
Regression
Variables Entered/Removed
Variables
Model Variables Entered Removed Method
1 income in 000, . Enter

location,
transactionsa
a. All requested variables entered.

Model Summaryb
Adjusted R Std. Error of the

Model R R Square Square Estimate Durbin-Watson
1 .995a .989 .986 5.769 1.966
a. Predictors: (Constant), income in 000, location, transactions
b. Dependent Variable: Amt spent in '000
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 33346.057 3 11115.352 333.974 .000a
Residual 366.103 11 33.282
Total 33712.160 14
a. Predictors: (Constant), income in 000, location, transactions
b. Dependent Variable: Amt spent in '000
a
Coefficients
Standardized
Unstandardized Coefficients Coefficients 95.0% Confidence Interval for B Collinearity Statistics
Model B Std. Error Beta T Sig. Lower Bound Upper Bound Tolerance VIF
1 (Constant) 13.426 7.563 1.775 .103 -3.219 30.071
transactions 2.112 .815 .256 2.592 .025 .318 3.907 .400 2.500
location 4.368 .167 .445 5.845 .004 7.004 15.740 .346 2.888
income in 000 .372 .052 .713 7.215 .000 .259 .486 .500 2.000
a. Dependent Variable: Amt spent in '000

Interpret the output. Mention the Regression equation.
The regression equation is
Amount spent = 13.426 + 2.112*transactions + 4.368*location + 0.372*Income
Comment(giving reasons)on if the regression model can be used for prediction.
Yes the model can be used for prediction as
1)All variables are quantitative or binary

2) All quantitve variables are normally distributed ( as K-S test significance value is >0.05 for all
variables hence null hypothesis that they are normal is accepted)
3)coefficient of determination R2 is high (.989)
4)Durban Wattson stats For autocorrelation is 1.966 which is within the required range of 1.5 to 2.5, indicating no threat of
autocorrelation
5) ANOVA is rejected ( Sig value 0.000) indicating atleast one independent variable is related to the dependent variable.
6)All individual 's are rejected
7) Since the normal plot for residuals shows that the points are around the diagonal, the residuals are normally distributed
8) The Partial grapg shows that the independent variables have linear relationship with the dependent variable.
9) the scatter plot shows no threat of heteroscadascity
State null and alternate hypothesis & interpretation of the hypothesis wherever necessary.
Interpret the meaning of each of the slopes.
a
Coefficients
Standardized
Unstandardized Coefficients Coefficients 95.0% Confidence Interval for B Collinearity Statistics
Model B Std. Error Beta T Sig. Lower Bound Upper Bound Tolerance VIF
1 (Constant) 13.426 7.563 1.775 .103 -3.219 30.071
transactions 2.112 .815 .256 2.592 .025 .318 3.907 .400 2.500
location 4.368 .167 .445 5.845 .004 7.004 15.740 .346 2.888
income in 000 .372 .052 .713 7.215 .000 .259 .486 .500 2.000
a. Dependent Variable: Amt spent in '000
Does the regression coefficient b0 has any practical meaning in the context of this problem? Why?
If the model can be used, estimate amount spent by a customer for the next month, if his estimated
transactions are 30, his monthly income is 400000, and he is going to go abroad in the next month.
Also find 95% confidence interval for the same.
Upper bound Lower limit

b b Point estimate
1 30.071 1 -3.219 1 13.426
30 3.907 30 0.318 30 2.112
1 15.74 1 7.004 1 4.368
400 0.486 400 0.259 400 0.372
357.421 116.925 229.954
Factor Analysis
A Financial analysis firm conducted a study on factors affecting consumers selection of financial
products. The questions in the survey were asked on rating scale of 1 to 7. The variables considered are
listed below
Sr Sr
No. Variable Description No. Variable Description
Depth of products and services to meet Representative knowing your overall
1 the range of your investment needs 6 situation and needs
2 Ability to resolve problems 7 Degree to which my provider knows me
Multiple providers' products to choose Likelihood to Recommend Primary
3 from 8 Provider to Someone I Know
Likelihood of Continuing to Use Primary
4 Quality of advice 9 Provider at Least at the Same Level as Now
5 Knowledge of representatives or advisors you deal with
A factor analysis was conducted on the data and the results are given below
KMO and Bartlett's Test
Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .811

Bartlett's Test of Sphericity Approx. Chi-Square 1520.784
df 36
Sig. .000
Communalities
Initial Extraction
Depth of products and services to meet the .489 .696
range of your investment needs
Ability to resolve problems .431 .516
Multiple providers' products to choose from .441 .543
Quality of advice .546 .608
Knowledge of representatives or advisors .586 .635
you deal with
Representative knowing your overall .600 .723
situation and needs
Degree to which my provider knows me .576 .624
Likelihood to Recommend Primary Provider .388 .623
to Someone I Know
Likelihood of Continuing to Use Primary .357 .575
Provider at Least at the Same Level as Now
Extraction Method: Principal Axis Factoring.
Total Variance Explained
Extraction Sums of Squared Rotation Sums of Squared

Initial Eigenvalues Loadings Loadings
% of Cumulative % of Cumulative % of Cumulative

Factor Total Variance % Total Variance % Total Variance %
1 3.929 43.657 43.657 3.549 39.432 39.432 2.512 27.910 27.910
2 1.527 16.964 60.620 1.127 12.523 51.955 1.702 18.907 46.818
3 1.172 13.024 73.645 .766 8.512 60.466 1.228 13.649 60.466


df 36
Sig. .000
Communalities
Initial Extraction
you deal with
situation and needs
to Someone I Know
4 .519 5.768 79.413
5 .480 5.339 84.751
6 .403 4.480 89.232
7 .389 4.324 93.556
8 .335 3.728 97.283
9 .244 2.717 100.000
Factor Matrixa
Factor
1 2 3
Depth of products and services to meet the range of your investment needs .680 .076 .477
Ability to resolve problems .654 -.022 .297
Multiple providers' products to choose from .509 .056 .425
Quality of advice .725 -.237 -.162
Knowledge of representatives or advisors you deal with .768 -.182 -.113
Representative knowing your overall situation and needs .754 -.103 -.380
Degree to which my provider knows me .749 -.085 -.237
Likelihood to Recommend Primary Provider to Someone I Know .337 .700 -.136
Likelihood of Continuing to Use Primary Provider at Least at the Same Level .209 .721 -.105
as Now
a. 3 factors extracted. 12 iterations required.
Rotated Factor Matrixa
Factor
1 2 3

df 36
Sig. .000
Communalities
Initial Extraction
you deal with
situation and needs
to Someone I Know
Depth of products and services to meet the range of .251 .786 .124
your investment needs
Ability to resolve problems .358 .620 .061
Quality of advice .727 .279 -.034
Knowledge of representatives or advisors you deal with .718 .346 .018
Representative knowing your overall situation and needs .829 .123 .145
Degree to which my provider knows me .741 .238 .132
Likelihood to Recommend Primary Provider to Someone .131 .113 .770
I Know
Likelihood of Continuing to Use Primary Provider at .007 .065 .756
Least at the Same Level as Now

Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 5 iterations.
Answer following questions

a) Interpret the output
b) Identify and name the factors
c) Is there any variable/s which should not be included in the analysis?
d) Can you say that factor analysis is relevant in this case? Why?
(8 Marks)
A) The Bartlett's test is significant, and KMO is >0.5 indicating the FA can be done on the data. Each
extraction communalities are > 0.5. Cumulative of total explained variance is 60.466%(extraction) which is
>60%. Rotated factor matrix is a simple structure matrix for all variables ( each variable is loaded on
exactly one factor
B)
Factor
1 2 3
Depth of products and services to meet the range of your investment .251 .786 .124
needs
Ability to resolve problems .358 .620 .061
Quality of advice .727 .279 -.034
Knowledge of representatives or advisors you deal with .718 .346 .018
Representative knowing your overall situation and needs .829 .123 .145
Degree to which my provider knows me .741 .238 .132
Likelihood to Recommend Primary Provider to Someone I Know .131 .113 .770
Likelihood of Continuing to Use Primary Provider at Least at the .007 .065 .756
Same Level as Now
Factor
1(Interface Related) 2(Product Related) 3(Loyalty/Repeat)
Quality of advice Depth of products and Likelihood to Recommend Primary
services to meet the range of Provider to Someone I Know
your investment needs
Knowledge of representatives or Ability to resolve problems Likelihood of Continuing to Use Primary
advisors you deal with Provider at Least at the Same Level as
Now
Representative knowing your Multiple providers' products
overall situation and needs to choose from.
Degree to which my provider
knows me
Factor 1 is dominant and most important factor, factor 2 is second and last is factor 3
C) Is there any variable/s which should not be included in the analysis? NO

Since anti image matrix not given, one cannot find the diagonal elements of correlations matrix
(of anti image ) which should be > 0.5
Each extraction communalities are > 0.5. and
each variable is loaded on exactly one factor ( only one factor loading is > 0.5 and rest are < 0.4)
D) Yes
The Bartlett's test is significant, and KMO is >0.5 indicating the FA can be done on the data. Each
extraction communalities are > 0.5. Cumulative of total explained variance is 60.466%(extraction) which is
>60%. Rotated factor matrix is a simple structure matrix for all variables ( each variable is loaded on
exactly one factor
Discriminant Analysis
Q. 4. A manager dealing with clients outstanding, wanted to classify their clients into defaulters and
non-defaulters of payment. A client is considered defaulter , if the payment is not received within 2
months of delivery. He collects data on past information for following parameters :
1. Years : No of years the client is in the business
2.Credit_Rating : the credit rating of the client ( 1=very bad 100=excellent)
3. Client Rating : the overall rating of the client ( 1=very bad 100=excellent)
4. Monthly turnover of the client ( in Crores)
Variable not_defaulted : 1 => not defaulted , 0=> defaulted
The Output of analysis is given below
Discriminant
Analysis Case Processing Summary
Unweighted Cases N Percent
Valid 19 100.0
Excluded Missing or out-of-range group 0 .0

codes
At least one missing 0 .0

discriminating variable
Both missing or out-of-range 0 .0

group codes and at least one
missing discriminating variable
Total 0 .0
Total 19 100.0
Group Statistics
Valid N (listwise)
not_default Mean Std. Deviation Unweighted Weighted
.00 Years 6.1429 1.86445 7 7.000
Credit_rating 33.5714 15.99851 7 7.000
Client_Rating 36.7143 19.58498 7 7.000
Monthly_Turnover 247.7143 142.05834 7 7.000
1.00 Years 7.7750 .60019 12 12.000
Credit_rating 57.8333 13.09291 12 12.000
Client_Rating 61.1667 9.93463 12 12.000
Total Years 7.1737 1.42589 19 19.000
Credit_rating 48.8947 18.29358 19 19.000
Client_Rating 52.1579 18.30380 19 19.000
Tests of Equality of Group Means
Wilks' Lambda F df1 df2 Sig.

Analysis Case Processing Summary
Unweighted Cases N Percent
Valid 19 100.0
Excluded Missing or out-of-range group 0 .0

codes
At least one missing 0 .0

Both missing or out-of-range 0 .0

group codes and at least one
missing discriminating variable
Total 0 .0
Years .678 8.067 1 17 .011
Credit_rating .568 12.931 1 17 .002
Client_Rating .562 13.268 1 17 .002
Monthly_Turnover .777 4.882 1 17 .041
Analysis 1
Box's Test of Equality of Covariance Matrices
Log Determinants
not_default Rank Log Determinant
.00 4 19.986
1.00 4 17.521
Pooled within-groups 4 19.776
The ranks and natural logarithms of determinants printed are

those of the group covariance matrices.
Test Results
Box's M 23.556
F Approx. 1.652
df1 10
df2 723.305
Sig. .088
Tests null hypothesis of equal

population covariance matrices.
Canonical Discriminant Function
Coefficients
Function
Years .257
Credit_rating .038
Client_Rating .028
Monthly_Turnover .001
(Constant) -5.522
Unstandardized coefficients
Summary of Canonical Discriminant Functions

Eigenvalues
Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
1 1.208a 100.0 100.0 .740
a. First 1 canonical discriminant functions were used in the analysis.
Wilks' Lambda
Test of
Function
(s) Wilks' Lambda Chi-square df Sig.
1 .453 11.881 4 .018
Standardized Canonical Discriminant

Function Coefficients
Function
Years .311
Credit_rating .534
Client_Rating .391
Structure Matrix
Function
Client_Rating .804
Credit_rating .794
Years .627
Eigenvalues
Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
1 1.208a 100.0 100.0 .740
Pooled within-groups correlations

between discriminating variables and
standardized canonical discriminant
functions
Variables ordered by absolute size of
correlation within function.
Functions at Group
Centroids
not_defa Function
ult 1
.00 -1.361
1.00 .794
Unstandardized canonical
discriminant functions
evaluated at group means
Classification Statistics
Classification Processing Summary
Processed 19
Excluded Missing or out-of-range group 0

codes
At least one missing 0

Used in Output 19
Prior Probabilities for Groups
not_defa Cases Used in Analysis
ult Prior Unweighted Weighted
.00 .500 7 7.000
1.00 .500 12 12.000
Total 1.000 19 19.000
Classification Resultsb,c
not_defa Predicted Group Membership
ult .00 1.00 Total
Original Count .00 6 1 7
1.00 1 11 12
% .00 85.7 14.3 100.0

Classification Processing Summary
Processed 19
Excluded Missing or out-of-range group 0

codes
At least one missing 0

1.00 8.3 91.7 100.0
Cross-validateda Count .00 6 1 7
1.00 2 10 12
% .00 85.7 14.3 100.0
1.00 16.7 83.3 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
b. 89.5% of original grouped cases correctly classified.
c. 84.2% of cross-validated grouped cases correctly classified.
a) Which multivariate technique is used and why?

Discriminant analysis is used as the dependent variable is qualitative and all independent
variables are quantitative.
b) State the function
Canonical Discriminant Function
Coefficients
Function
Years .257
Credit_rating .038
Client_Rating .028
(Constant) -5.522
Unstandardized coefficients
D F = -5.522+.257 xYears +.038 xCredit_rating +.028xClient_Rating +.001 xMonthly_Turnover
(since all independent variables are significant)
c) How would you assess the model? Which test statistic /s used to test accuracy of the model?
1) High value of Canonical Correlation (.740)
2) Wilk’s Lamda significant
3) ANOVA rejected for all variables
4) Classification table accuracy (since 84.2% > 66.82% required accuracy is attained )
c. 84.2% of cross-validated grouped cases correctly classified.
Probability
S 7 0.368421
F 12 0.631579
Total 19
Random Accuracy 0.534626
25% improvement over
RA 0.668283
d) Interpret the classification table. Same as above

e) How would you classify a client having years of business as 8, credit rating 80, client rating 60,
and monthly turnover 400.
Functions at Group Centroids
Function
not_default
1
0 -1.361 7= N1
1 0.794 12=N2
(-1.361*12+0.794*7)/19 = -0.56705
Substituting above values in the DF equation we get answer as 1.654 which is > -0.56705 hence the
observation will be classified as not defaulted ( since not defaulted is defined as 1 and defaulted as 0)
f) How would you test the assumption of homogeneity of variance? Using Boxe’s M since it is not
significant, variances are homogenous.
Logistic Regression
Q.2 A Bank wanted to decide on credit card applications on the basis of following variables of its
customers
Savings –savings in Rs thousands
Income - Income Rs thousands
Time_current_Res - months spent at current residence
Time_Current_Job - months spent at current Job
The dependent variable is Default =1=> the customer will be Defaulter , Default =0=> the customer will
not be a Defaulter. The output is given below
Logistic Regression
Dependent Variable
Encoding
Original
Value Internal Value
.00 0
1.00 1
Block 0: Beginning Block
Classification Tablea,b
Predicted
Default
Percentage
Observed .00 1.00 Correct
Step 0 Default .00 36 0 100.0
1.00 14 0 .0
Overall Percentage 72.0
a. Constant is included in the model.
b. The cut value is .500
Variables in the Equation
B S.E. Wald df Sig. Exp(B)

Step 0 Constant -.944 .315 8.991 1 .003 .389
Variables not in the Equation
Score df Sig.
Step 0 Variables Savings 19.121 1 .000
Income 9.185 1 .002
Time_current_Res 2.704 1 .100
Time_Current_Job 1.320 1 .251
Overall Statistics 31.701 4 .000
Block 1: Method = Enter

Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 39.883 4 .000
Block 39.883 4 .000
Model 39.883 4 .000
Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 19.412a .550 .791
a. Estimation terminated at iteration number 7 because parameter
estimates changed by less than .001.
Hosmer and Lemeshow Test
Chi-square df Sig.
Step 1 Step 39.883 4 .000
Block 39.883 4 .000
Model 39.883 4 .000
Model Summary
1 19.412a .550 .791
Step Chi-square df Sig.
1 10.666 8 .221
Contingency Table for Hosmer and Lemeshow Test
Default = .00 Default = 1.00

Observed Expected Observed Expected Total
Step 1 1 5 4.997 0 .003 5
2 5 4.983 0 .017 5
3 5 4.958 0 .042 5
4 5 4.923 0 .077 5
5 4 4.891 1 .109 5
6 5 4.751 0 .249 5
7 5 4.003 0 .997 5
8 1 1.962 4 3.038 5
9 1 .454 4 4.546 5
10 0 .079 5 4.921 5
Classification Tablea
Predicted
Default
Percentage
Step 1 Default .00 35 2 94.44
1.00 2 13 85.71
a. The cut value is .500

Step 1a Savings -.035 .011 9.496 1 .002 .966
Income -.002 .001 6.722 1 .010 .998
Time_current_Res -.006 .032 .039 1 .843 .994
Time_Current_Job -.036 .035 1.004 1 .316 .965
Constant 9.808 3.520 7.766 1 .005 18186.145
a. Variable(s) entered on step 1: Savings, Income, Time_current_Res, Time_Current_Job.
Logistic Regression
Dependent Variable
Encoding
Original
Value Internal Value
.00 0
1.00 1
Block 0: Beginning Block
Classification Tablea,b
Predicted
Default
Percentage
Step 0 Default .00 36 0 100.0
1.00 14 0 .0
a. Constant is included in the model.
b. The cut value is .500

Step 0 Constant -.944 .315 8.991 1 .003 .389
Score df Sig.
Step 0 Variables Savings 19.121 1 .000
Income 9.185 1 .002
Block 1: Method = Forward Stepwise (Wald)

Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
1 34.197a .395 .568
2 20.857b .536 .772
b. Estimation terminated at iteration number 7 because parameter
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 10.659 8 .222
2 4.157 8 .843
Contingency Table for Hosmer and Lemeshow Test
Default = .00 Default = 1.00

Observed Expected Observed Expected Total
Step 1 1 5 4.989 0 .011 5
2 5 4.963 0 .037 5
3 5 4.907 0 .093 5
4 6 5.773 0 .227 6
Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
1 34.197a .395 .568
2 20.857b .536 .772
5 4 4.482 1 .518 5
6 5 3.997 0 1.003 5
7 2 3.446 4 2.554 6
8 0 1.778 5 3.222 5
9 3 1.176 2 3.824 5
10 1 .488 2 2.512 3
Step 2 1 5 4.996 0 .004 5
2 5 4.985 0 .015 5
3 5 4.968 0 .032 5
4 5 4.933 0 .067 5
5 5 4.895 0 .105 5
6 4 4.672 1 .328 5
7 5 3.916 0 1.084 5
8 1 1.857 4 3.143 5
9 1 .615 4 4.385 5
10 0 .162 5 4.838 5
Classification Tablea
Predicted
Default
Percentage
Step 1 Default .00 32 4 88.9
1.00 3 11 78.6
Step 2 Default .00 35 1 97.2
1.00 2 12 85.7
a. The cut value is .500

a
Step 1 Savings -.030 .009 11.310 1 .001 .970
Constant 1.643 .707 5.405 1 .020 5.172
Chi-square df Sig.
Step 1 Step 25.098 1 .000
Block 25.098 1 .000
Model 25.098 1 .000
Step 2 Step 13.340 1 .000
Block 38.439 2 .000
Model 38.439 2 .000
Model Summary
1 34.197a .395 .568
2 20.857b .536 .772
Step 2b Savings -.034 .011 10.484 1 .001 .966
Income -.002 .001 6.882 1 .009 .998
Constant 8.100 2.787 8.446 1 .004 3293.574
a. Variable(s) entered on step 1: Savings.
b. Variable(s) entered on step 2: Income.
Score df Sig.
Step 1 Variables Income 11.623 1 .001
Time_Current_Job .677 1 .411
Step 2 Variables Time_current_Res .385 1 .535
a) State the function
Since stepwise LR is used, the last model is most significant and should be considered hence the
independent variables in the equation are Savings and income
The model is
1
𝑝= Where y = 8.100 – 0.034 * savings – 0.002* income
(1+𝑒 −𝑦 )
b) How would you assess the model? Which test statistic /s used to test accuracy of the model?
Hosmer and Lemeshow Test is not significant indicating that the LR model is a good fit.
Nagelkerke R Square is >0.5 indicating good relationship
Since stepwise is used, no need to check for the significance of individual variables.
Classification table gives 25% improvement over chance Accuracy
c) Interpret the classification table. Can this model be used for prediction?
d) Not
35 1 36 0.72
default
Default 2 12 14 0.28
Total 50
Chance Accuracy 0.5968
25% improvement
0ver chance Accuracy 0.746
e) How would you classify a customer having Savings = 50000, Income =6 Lakhs months spend
at residence = 20 and Time at current job = 3 years? Ans : Classified as Defaulted as p>0.5
f) How would you classify a customer having Savings = 10000, Income =5 Lakhs months spend at
residence = 25 and Time at current job = 2 years? Classified as Defaulted as p>0.5
Savings -0.034 50 10
Income -0.002 600 500
Constant 8.1 1 1
y 5.2 6.76
exp(-y) 0.005517 0.001159
1/(1+exp(-
y)) 0.994514 0.998842
Cluster Analysis
An investment company has shortlisted 25 of its corporate customers which it wants to classify into
homogenous groups and understand more about the groups. A cluster analysis was performed on the
data and also the ANOVA was done on the output of cluster analysis.
What are the groups formed? Who are the members of the group (give case number)? What would
you conclude from ANOVA? The output is given as “Company” (8 Marks)
Output for “Company”
Cluster
[DataSet6] C:\Documents and Settings\shailaja.rego\Desktop\cluster exam 2011.sav
Centroid Linkage
Agglomeration Schedule
Cluster Combined Stage Cluster First Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 21 23 .017 0 0 4
2 24 25 .035 0 0 14
3 14 15 .068 0 0 8
4 20 21 .092 0 1 7
5 9 19 .093 0 0 7
6 10 22 .101 0 0 9
7 9 20 .139 5 4 8
8 9 14 .248 7 3 12
9 10 17 .262 6 0 10
10 10 12 .298 9 0 13
11 6 16 .342 0 0 15
12 9 18 .467 8 0 13
13 9 10 .571 12 10 19
14 13 24 .650 0 2 19
15 6 11 .798 11 0 17
16 7 8 1.013 0 0 22
17 1 6 1.107 0 15 21
18 3 5 1.355 0 0 20
19 9 13 1.809 13 14 24
20 2 3 2.006 0 18 23
21 1 4 2.945 17 0 22
22 1 7 3.101 21 16 23
23 1 2 6.602 22 20 24
24 1 9 10.575 23 19 0
Dendrogram
* * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * *
Dendrogram using Centroid Method
Rescaled Distance Cluster Combine
C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+
Case 21 21 -+
Case 23 23 -+
Case 20 20 -+
Case 9 9 -+-+
Case 19 19 -+ |
Case 14 14 -+ |
Case 15 15 -+ |
Case 18 18 ---+-----+
Case 10 10 -+ | |
Case 22 22 -+ | |
Case 17 17 -+-+ +---------------------------------------+
Case 12 12 -+ | |
Case 24 24 -+-+ | |
Case 25 25 -+ +-----+ |
Case 13 13 ---+ |
Case 3 3 -------+-+ |
Case 5 5 -------+ +---------------------+ |
Case 2 2 ---------+ | |
Case 7 7 -----+---------+ +-----------------+
Case 8 8 -----+ | |
Case 6 6 -+-+ +---------------+
Case 16 16 -+ +-+ |
Case 11 11 ---+ +-------+ |
Case 1 1 -----+ +-+
Case 4 4 -------------+
Oneway
[DataSet6] C:\Documents and Settings\shailaja.rego\Desktop\cluster exam 2011.sav
Descriptives
95% Confidence Interval for
Std. Mean
N Mean Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
turnover 1 7 60596.16 7201.233 2721.810 53936.13 67256.19 53262 75041
2 3 66879.92 5023.982 2900.597 54399.66 79360.18 62256 72226
3 15 24659.93 12400.521 3201.801 17792.75 31527.11 10000 55051
Tota 25 39788.47 21583.268 4316.654 30879.34 48697.61 10000 75041

l
Salary_cos 1 7 16857.14 5984.106 2261.779 11322.77 22391.52 11000 28000

t 2 3 23333.33 3785.939 2185.813 13928.54 32738.13 19000 26000
3 15 4300.00 3400.630 878.039 2416.79 6183.21 1000 12000
Tota 25 10100.00 8551.316 1710.263 6570.19 13629.81 1000 28000

l
Fixed_cost 1 7 10142.86 4180.453 1580.063 6276.58 14009.13 5000 16000
2 3 27000.00 2645.751 1527.525 20427.59 33572.41 24000 29000
3 15 3133.33 1302.013 336.178 2412.30 3854.36 1000 6000
Tota 25 7960.00 8197.967 1639.593 4576.05 11343.95 1000 29000

l
Profit 1 7 8571.43 4429.339 1674.133 4474.97 12667.88 3000 13000
2 3 16000.00 4582.576 2645.751 4616.25 27383.75 11000 20000
3 15 1900.00 1227.657 316.980 1220.15 2579.85 1000 5000
Tota 25 5460.00 5671.420 1134.284 3118.95 7801.05 1000 20000

l
Test of Homogeneity of Variances
Levene Statistic df1 df2 Sig.
turnover 1.436 2 22 .259
Salary_cost 1.839 2 22 .183
Fixed_cost 3.220 2 22 .040
Profit 5.398 2 22 .045
ANOVA
Sum of Squares df Mean Square F Sig.
turnover Between Groups 8.666E9 2 4.333E9 37.910 .000
Within Groups 2.514E9 22 1.143E8
Total 1.118E10 24
Salary_cost Between Groups 1.350E9 2 6.748E8 36.617 .000
Within Groups 4.054E8 22 1.843E7
Total 1.755E9 24
Descriptives
95% Confidence Interval for
Std. Mean
N Mean Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
turnover 1 7 60596.16 7201.233 2721.810 53936.13 67256.19 53262 75041
2 3 66879.92 5023.982 2900.597 54399.66 79360.18 62256 72226
3 15 24659.93 12400.521 3201.801 17792.75 31527.11 10000 55051
Tota 25 39788.47 21583.268 4316.654 30879.34 48697.61 10000 75041

l
Salary_cos 1 7 16857.14 5984.106 2261.779 11322.77 22391.52 11000 28000

t 2 3 23333.33 3785.939 2185.813 13928.54 32738.13 19000 26000
3 15 4300.00 3400.630 878.039 2416.79 6183.21 1000 12000
Tota 25 10100.00 8551.316 1710.263 6570.19 13629.81 1000 28000

l
Fixed_cost 1 7 10142.86 4180.453 1580.063 6276.58 14009.13 5000 16000
2 3 27000.00 2645.751 1527.525 20427.59 33572.41 24000 29000
3 15 3133.33 1302.013 336.178 2412.30 3854.36 1000 6000
Tota 25 7960.00 8197.967 1639.593 4576.05 11343.95 1000 29000

l
Profit 1 7 8571.43 4429.339 1674.133 4474.97 12667.88 3000 13000
2 3 16000.00 4582.576 2645.751 4616.25 27383.75 11000 20000
3 15 1900.00 1227.657 316.980 1220.15 2579.85 1000 5000
Tota 25 5460.00 5671.420 1134.284 3118.95 7801.05 1000 20000

l
Test of Homogeneity of Variances
Levene Statistic df1 df2 Sig.
turnover 1.436 2 22 .259
Salary_cost 1.839 2 22 .183
Fixed_cost 3.220 2 22 .040
Fixed_cost Between Groups 1.470E9 2 7.352E8 113.430 .000
Within Groups 1.426E8 22 6481385.281
Total 1.613E9 24
Profit Between Groups 5.911E8 2 2.956E8 35.963 .000
Within Groups 1.808E8 22 8218831.169
Total 7.720E8 24
Answer
Q. What are the groups formed?

A. The output shows three cluster solution. Though maximum jump in the coefficients is for 2 cluster
solution, 3 cluster solution also has considerable jump and between 2 cluster & 3 cluster 3 cluster is
preferred.
Q.Who are the members of the group (give case number)?
A.
Cluster 1 Cluster 2 Cluster 3

Case 21 Case 3 Case 7
Case 9 Case 16
Case 19 Case 11
Case 14 Case 1
Case 15 Case 4
Case 18
Case 10
Case 22
Case 17
Case 12
Case 24
Case 25
Case 13
Note : since cluster 2 has too less cases, one may also consider 2 cluster solution as better solution.
2 cluster solution
Cluster 1 Cluster 2
Case 21 Case 3
Case 23 Case 5
Case 20 Case 2
Case 9 Case 7
Case 19 Case 8
Case 14 Case 6
Case 15 Case 16
Case 18 Case 11
Case 10 Case 1
Case 22 Case 4
Case 17
Case 12
Case 24
Case 25
Case 13
Q. What would you conclude from ANOVA?
A. ANOVA (for 3 cluster solution) is rejected for all variables (sig. value < 0.05) indicating the means of
variables across these clusters are different indicating A good cluster solution.
Conjoint Analysis
Q. 3 A company entering in the healthcare insurance sector wants to decide the price for their healthcare
product. They considered four attributes for this. The attributes and their levels is given below.
Charges Is third party Age for Amount
cover Insured
1500 Yes <50 1 LK
2000 No 50 to 60 2 LK
2500 60 to 70 3 Lk
3000
The variables were coded as follows
Charges X1 X2 X3
1500 1 0 0 Third party x4
2000 0 1 0 Yes 1
2500 0 0 1 No -1
3000 -1 -1 -1
Age X5 X6 Amount insured X7 X8

<50 1 0 1 Lk 1 0
50 to 60 0 1 2 Lk 0 1
60 and above -1 -1 3 Lk -1 -1
Note that the rankings are assumed to be best is highest
Output for “Medicare”

Model Summary
Std. Error
Mode Adjusted of the
l R R Square R Square Estimate
1 .924(a) .853 .834 8.51766
a Predictors: (Constant), x8, x3, x4, x6, x7, x2, x5, x1
ANOVA(b)
Mod Sum of Mean

el Squares df Square F Sig.
1 Regressio 26527.3
8 3315.914 45.705 .000(a)
n 16
Residual 4570.68
63 72.551
4
Total 31098.0
71
00
a Predictors: (Constant), x8, x3, x4, x6, x7, x2, x5, x1
b Dependent Variable: y
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
Mode Std.
l B Error Beta t Sig.
1 (Constant
35.753 1.320 27.089 .000
)
x1 21.888 3.137 .563 6.978 .000
x2 6.067 1.638 .232 3.703 .000
x3 -8.727 2.116 -.280 -4.124 .000
x4 .828 1.095 .040 .756 .000
x5 -4.324 1.738 -.165 -2.487 .016
x6 -1.529 1.490 -.064 -1.026 .000
x7 -10.257 1.284 -.409 -7.990 .000
x8 -4.006 1.408 -.160 -2.846 .006
a Dependent Variable: y
Answer following
a) Find the utility for each attribute at each level and Comment on the output
b) Find the percentage utility for each attribute.
c) Draw the graph of the utility for each attribute
d) Which of the attribute is most important while considering the medical insurance? Why?
(8 Marks)
Answer
Please note Regression requirements are not mentioned in this answer.
a) Find the utility for each attribute at each level and Comment on the output
b) Find the percentage utility for each attribute.
Ans a & B
Charges Partial
Variavle Utility Range Percentage
1500 x1 21.888 41.116 53.07413
2000 x2 6.067
2500 x3 -8.727
3000 -19.228
Is third Partial
party Variavle Utility Range Percentage
Yes x4 0.828 1.656 2.137629
No -0.828
Age
for
cover Variavle Partial Utility
<50 x5 -4.324 10.177 13.13687
50 to 60 x6 -1.529
60 to 70 5.853
Amount
Insured Variavle Partial Utility
1 LK x7 -10.257 24.52 31.65137
2 LK x8 -4.006
3 Lk 14.263
77.469 100
c) Draw the graph of the utility for each attribute

Charges
25
20
15
10
5
0
1500 2000 2500 3000
-5
-10
-15
-20
-25
Is third party
1
0.8
0.6
0.4
0.2
0
Yes No
-0.2
-0.4
-0.6
-0.8
-1
Amount Insured
20
15
10
0
1 LK 2 LK 3 Lk
-5
-10
-15
Amount Insured
20
15
10
0
1 LK 2 LK 3 Lk
-5
-10
-15
a) Which of the attribute is most important while considering the medical insurance? Why?
Since Charges has maximum % utility it is considered as most important
Classification & Regression Trees
Classification Tree
[DataSet1] C:\Users\shailaja.rego\Desktop\final exam credit rating expaned for

tree.sav
Tree Table
Primary Independent Variable

Std.
Deviatio Predicted Improve
Node Mean n N Percent Mean Parent Node Variable ment Split Values
0 .2800 .45050 150 100.0% .2800
1 .7500 .43759 48 32.0% .7500 0 Sav-ings .104 <= 68.500
2 .0588 .23646 102 68.0% .0588 0 Sav-ings .104 > 68.500
3 .9167 .28031 36 24.0% .9167 1 Income .027 <= 4363.500
4 .2500 .45227 12 8.0% .2500 1 Income .027 > 4363.500
5 .2000 .40684 30 20.0% .2000 2 Sav-ings .006 <= 122.000
6 .0000 .00000 72 48.0% .0000 2 Sav-ings .006 > 122.000
Growing Method: CRT

Dependent Variable: Default
Gain Summary for Nodes
Node N Percent Mean
3 36 24.0% .9167
4 12 8.0% .2500
5 30 20.0% .2000
6 72 48.0% .0000
Growing Method: CRT

Dependent Variable: Default
Growing Method: CRT
Dependent Variable: Previously defaulted
Classification
Predicted
Not Percent
Observed Defaulted Defaulted Correct
Defaulted 105 3 72%
Not Defaulted 9 33 28%
Overall Percentage 76% 24% 92%
a) Explain the tree output. How many splits the tree has?
The output has 3 splits and 6 Nodes
b) Explain each split in terms of variables
First split is on Savings <= 68.500

Second split is on income <= 4363.500
Third split is on Savings <= 122.000
If a customer has savings <= 68.5 and income is less than 4363.5 than the probability of that customer will
default is very high(95%)
If a customer has savings <= 68.5 and income is more than 4363.5 than the probability of that customer will
default is low (25%)
If a customer has savings > 68.5 and savings <122 than the probability of that customer will default is low(20%)
If a customer has savings >122 than the probability of that customer will default is very very low (0%)
c) What is the classification accuracy? Is it adequate?
0.72
0.28
Chance Accuracy = 0.5968

25% improvement over Chance Accuracy 0.746
Yes as it has 25% improvement over chance accuracy
d) According to this output, which variable contributes maximum to the classification?
Savings contributes maximum

Solved Problem Regression Factor Analysis Discriminant Logistic Cluster Conjoint (1) Xid-2939872 1 IjL9spRH9K

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solved Problem Regression Factor Analysis Discriminant Logistic Cluster Conjoint (1) Xid-2939872 1 IjL9spRH9K

Uploaded by

Copyright:

Available Formats

Multiple Regression

The variables are, Amount

1 income in 000, . Enter

a. All requested variables entered.

Adjusted R Std. Error of the

1 .995a .989 .986 5.769 1.966

a. Predictors: (Constant), income in 000, location, transactions

b. Dependent Variable: Amt spent in '000

Model Sum of Squares df Mean Square F Sig.

1 Regression 33346.057 3 11115.352 333.974 .000a

Residual 366.103 11 33.282

a. Predictors: (Constant), income in 000, location, transactions

b. Dependent Variable: Amt spent in '000

1 (Constant) 13.426 7.563 1.775 .103 -3.219 30.071

a. Dependent Variable: Amt spent in '000

Comment(giving reasons)on if the regression model can be used for prediction.

Yes the model can be used for prediction as

1)All variables are quantitative or binary

1 (Constant) 13.426 7.563 1.775 .103 -3.219 30.071

a. Dependent Variable: Amt spent in '000

Upper bound Lower limit

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .811

Extraction Method: Principal Axis Factoring.

Total Variance Explained

Extraction Sums of Squared Rotation Sums of Squared

% of Cumulative % of Cumulative % of Cumulative

1 3.929 43.657 43.657 3.549 39.432 39.432 2.512 27.910 27.910

2 1.527 16.964 60.620 1.127 12.523 51.955 1.702 18.907 46.818

3 1.172 13.024 73.645 .766 8.512 60.466 1.228 13.649 60.466

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .811

4 .519 5.768 79.413

5 .480 5.339 84.751

6 .403 4.480 89.232

7 .389 4.324 93.556

8 .335 3.728 97.283

9 .244 2.717 100.000

Extraction Method: Principal Axis Factoring.

Extraction Method: Principal Axis Factoring.

a. 3 factors extracted. 12 iterations required.

Rotated Factor Matrixa

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .811

Extraction Method: Principal Axis Factoring.

a. Rotation converged in 5 iterations.

Answer following questions

C) Is there any variable/s which should not be included in the analysis? NO

Unweighted Cases N Percent

Excluded Missing or out-of-range group 0 .0

At least one missing 0 .0

Both missing or out-of-range 0 .0

not_default Mean Std. Deviation Unweighted Weighted

.00 Years 6.1429 1.86445 7 7.000

Credit_rating 33.5714 15.99851 7 7.000

Client_Rating 36.7143 19.58498 7 7.000

Monthly_Turnover 247.7143 142.05834 7 7.000

1.00 Years 7.7750 .60019 12 12.000

Credit_rating 57.8333 13.09291 12 12.000

Client_Rating 61.1667 9.93463 12 12.000

Monthly_Turnover 367.0000 94.37835 12 12.000

Total Years 7.1737 1.42589 19 19.000

Credit_rating 48.8947 18.29358 19 19.000

Client_Rating 52.1579 18.30380 19 19.000

Monthly_Turnover 323.0526 125.16011 19 19.000

Tests of Equality of Group Means