Professional Documents
Culture Documents
Cluster Training PDF Modified Compatibility Mode
Cluster Training PDF Modified Compatibility Mode
..
Solution : Identify segments where people have same characters and target each of
these segments in a different way
Approach to Segmentation
Segmentation is of 2 types
Objective Segmentation
Subjective Segmentation
Response rate
Increase in Sales
Conversion proportion
CHAID Analysis
Cluster Analysis
Cluster Analysis
Cluster 6
17.76
Cluster 1
20.61
5.80
Cluster 2
Cluster 5
16.19
12.15
Cluster 4
27.49
Cluster 3
Example of Clusters
Example Cluster 1
High Balance
Low Income
High
Current
Balance
Medium
Example Cluster 2
High Income
Low Balance
Low
Low
Medium
Gross Monthly Income
High
Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The objects in
Cluster 1 have similar characteristics (High Income and Low balance), on the other hand the
objects in Cluster 2 have the same characteristic (High Balance and Low Income).
But there are much differences between an object in Cluster 1 and an object in Cluster 2.
Cluster Methodology
Variables Creation
Final Dataset
Validation Sample
Outlier Treatment
Data
Variables Standardization
Cluster
Profiles
Outlier
80
To identify them:
70
50
40
Var 2
60
30
20
10
0
0
10
15
20
25
30
35
40
45
Var 1
To tackle them:
1. The outliers can be deleted from analysis if they are very small in number.
2. The variables selected can be trimmed or capped.
% of Missing
Less than 1%
1-5%
5-10%
Treatments
Mean Imputation
Regression Imputation
Mean Imputation
Regression Imputation
Try to use some proxy Variable
Note: - SAS does not include observations with missing values for Clustering Process
Factor Analysis: By Factor Analysis select those factors, which are explaining almost
90/95 % of total variation together. Then select those variables which
have high loadings towards those factors.
VIF (Variance Inflation Factor): Variables with VIF more than 2 should be dropped
Generally we standardize by making the mean = 0 and variance = 1 thus deunitizing the
variables and bringing them on a common platform to analyze.
Post all the data treatment steps Cluster Development Process is commenced
upon.
Post Cluster Development Cluster Validation is done on the validation sample
to establish that the cluster solution is not Sample dependent.
Cluster Building
Hierarchical Clustering
Each observation is considered as an
individual cluster. Distance from each
observation to all others is calculated &
the nearest observations are clubbed to
form
clusters.
Intensive
distance
calculations required thus making it
difficult to implement.
K-Means Clustering
K distinct observations are randomly
selected at the highest distance from
each
other.
Each
observation
is
considered one by one & clubbed to the
nearest Cluster. If two clusters come
significantly close to each other, they are
merged to each other to form a new
cluster.
Hierarchical Clustering is not suitable for large datasets as the multitude of calculations
involved would be impossibly huge. Thus K-Means clustering is the most used method of
clustering.
Frequency
1
2
3
4
5
6
7
69696
164495
84576
53434
111923
171323
61126
0.8642
0.3587
0.7891
0.6266
0.4809
0.3729
0.7138
14.6487
3.7355
15.5323
4.471
8.6794
2.4891
12.3443
2.7342
1.7221
3.2326
1.9309
1.8346
1.7221
2.4533
Cluster Means
Cluster
1
2
3
4
5
6
7
CNT_LAN_MAT_TW
-0.197450101
-0.375048706
2.622903928
-0.375048706
-0.355935079
-0.375046517
-0.363966798
Variable
Loanno
2.27108366
-0.34924125
-0.09743388
-0.35414491
-0.28306493
-0.28265108
0.1052424
NO_ADV_EMI
-1.046054301
0.200597434
-0.435001209
-0.97960221
1.567501549
0.111602877
-1.071824808
R-Square
CNT_LAN_MAT_TW
Loanno
NO_ADV_EMI
MONTHS_SINCE_LOAN_MATURITY
TENOR
0.923282
0.572698
0.694306
0.590897
0.629882
OVER-ALL
0.682213
MONTHS_SINCE_LOAN_MATURITY
TENOR
-0.509641873
0.994222204
-0.046890848
1.463155948
-0.354749804
-0.724103839
-0.629530996
0.312811446
-0.677162416
0.113553959
0.777377199
0.347933497
-0.705347005
1.968822045
Scatter Plot
80
Cluster 1
70
Cluster 3
60
R - Square =
Var 2
50
Cluster 2
40
Between Variation
30
20
Total Variation
10
0
0
10
15
20
25
30
35
40
45
50
Var 1
Higher R-Square signifies high between variation and low within variation. Thus Higher
the R-Square, the better it is.
RMMSTD
RMMSTD within a cluster = Square root of Average of (Variance of variable 1 in that cluster,
Variance of variable 2 in that cluster, ,Variance of variable p in that cluster) . Assuming p
variables were used for Clustering.
Cluster Validation
The Cluster Solution is Validated on the Validation Sample using the Minimum Euclidean
Distance Method. Validation is done by calculating the distance of each observation in the
Validation sample from the Cluster Seed & assigning it to the closest cluster.
Scatter Plot
80
New
Observation
70
60
Var 2
50
Cluster 1
40
Cluster 3
30
20
Cluster 2
10
0
0
10
15
20
25
30
35
40
Var 1
45
50
Cluster 1
Cluster 7
8.53%
Cluster
1
2
3
4
5
6
7
Total
Frequency
69,696
164,495
84,576
53,434
111,923
171,323
61,126
716,573
%
9.73
22.96
11.80
7.46
15.62
23.91
8.53
9.73%
23.91%
22.96%
Cluster 2
Cluster 6
11.80%
15.62%
7.46%
Cluster 5
100
Cluster 3
Cluster 4
Cluster Population (% )
Validation Sample
Cluster 1
Cluster 7
Cluster
1
2
3
4
5
6
7
Frequency
69,899
164,653
84,837
53,625
112,250
172,084
60,320
%
9.74
22.94
11.82
7.47
15.64
23.98
8.41
8.41%
23.98%
22.94%
Cluster 2
Cluster 6
15.64%
Total
717,668
100
9.74%
Cluster 5
11.82%
7.47%
Cluster 3
Cluster 4
The Validation sample was scored using the cluster solution. The frequency plot shows
a similar distribution on the Validation sample as in the Development sample.
PROFILING
Cluster
Solution
Microsoft Excel
Worksheet
Microsoft Excel
Worksheet