DataMining Project SonaliPradhan

Submitted By:- Sonali Pradhan
1
Content
Problem 1 - Clustering ..............................................................................................................3
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate,and multivariate analysis)...........................................................................................3
1.2 Do you think scaling is necessary for clustering in this case? Justify....................................7
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them..................................................................................7
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on
the finalized clusters..............................................................................................................9
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.............................................................................................10
Problem 2: CART-RF-ANN .....................................................................................................11
2.1 Read the data, do the necessary initial steps, and exploratory data analysis …………….16
2.2 Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network ...................................................................................................................16
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score,
classification reports for each model....................................................................................17
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized......................................................................................................................23
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendation?................................................................................................................24
2
Problem1-Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to

its customers. They collected a sample that summarizes the activities of users during
the past few months. You are given the task to identify the segments based on credit
card usage.
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis)
Dataset Head:
Number of Rows: 210

Number of Columns: 7
3
From the information provided, we can infer that there are no null values in the dataset.
All the variables in the dataset are continuous (same variable datatype).
Looking at the mean values from the above summary, we can see that the Spending and
Advanced Payments are the highest rated variables determining the customer
segmentation of the bank. They are also the only variables with the highest rating of 21
followed by 17 respectively.
The lowest ratings are scored by Probability of full payment and Credit Limit, where the
lowest maximum scores of 1 and 4 are associated with them respectively.
Univariate Analysis:
spending
Skew: 0.39702715402072153
4
advance_payments
Skew :0.38380604212562563
5
6
current_balance
Skew :0.5217206481959239
7
credit_limit
Skew :0.13341648969738146
8
current_balance
Skew :0.5217206481959239
min_payment_amt
Skew :0.3987925792256687
9
max_spent_in_single_shopping
Skew :0.5578758322317957
10
The diagrams show that Probability of full payment done is left kewed. All other remaining
variables are right skewed.
Outliers are present only in Probability of full payment and Minimum payment amount
variables.
Bivariate Analysis and Multivariate:
11
12
We can see high positive correlation between the following variables:
1. Advance payments and Spending

2. Current Balance and Advance payments
3. Current Balance and Spending
4. Credit Limit and Spending, Advance payments and Current balance
5. Max Spent in single shopping with Spending, Advance payments, and Current
balance.
13
Do you think scaling is necessary for clustering in this case? Justify
Yes, it controls the variability of the dataset, it convert data into specific range using a
linear transformation which generate good quality clusters and improve the accuracy of
clustering algorithms.
Feature scaling through standardization (or Z-score normalization) can be an important
pre- processing step for many machine learning algorithms. Standardization involves
rescaling the features such that they have the properties of a standard normal distribution
with a mean of zero and a standard deviation of one.
To illustrate this, scaling is performed comparing the use of data with StandardScaler applied, to
unscaled data.
We used zscore from scipy stats to standardize the data. Below is a transformed dataframe
head.
As we can see from the above dataframe, data after scaling will transform every value in such
a way that the mean will be 0 and standard deviation will be 1.
14
Apply hierarchical clustering to scaled data. Identify the number of
optimum clusters using Dendrogram and briefly describe them.
A dendrogram is a visual representation of cluster-making. On the x-axis are the item names or
item numbers. On the y-axis is the distance or height. The vertical straight lines denote the height
where two items or two clusters combine. The higher the level of combining, the distant the
individual items or clusters are. By definition of hierarchical clustering, all items must combine to
make one cluster. However, the problem of interest
here is to determine the optimum number of clusters. Recall that each cluster is a representative
of a different population. Since this is a problem of unsupervised learning, there cannot be any
error measure to dictate the number of clusters. To some extent this is a subjective choice. After
considering the dendrogram, one may decide the level where the resultant tree needs to be cut.
This is another way to say that how close-knit or dispersed should the clusters be. If the
number of clusters is large, the cluster size is small and the clusters homogeneous. If the
number of clusters is small, each contains more items and hence clusters are more
heterogeneous. Often practical consideration or domain knowledge plays a part in determining
the number of clusters. Note that depending on the distance measure and linkage used, the
number of clusters and their composition may be different.
Ward’s Linkage:
This is an alternative method of clustering, also known as minimum variance clustering method.
In that sense it is similar to the k-means clustering (which is discussed later), an iterative method
where after every merge, the distances are updated successively. Ward’s method often creates
compact and even-sized clusters. However, the drawback is, it may result in less than optimal
clusters. Like all other clustering methods, this is also computationally intensive.
15
Both the method are almost similer means , minor variation, which we know it occurs.
We for cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis,
and based on the dataset had gone for 3 group cluster solution based on the hierarchical
clustering.
Also in real time, there colud have been more variables value captured - tenure,
BALANCE_FREQUENCY, balance, purchase, installment of purchase, others.
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment
made).
Apply K-Means clustering on scaled data and determine optimum clusters.

Apply elbow curve and silhouette score. Explain the results properly.
Interpret and write inferences on the finalized clusters.
There are many methods that are recommended for determination of an optimal number of
partitions. Unfortunately, however, there is no closed form solution to the problem of determining
k. The choice is somewhat subjective and graphical methods are often employed.
Objective of partitioning is to separate out the observations or units so that the ‘most’ similar
items are put together. Recall that singleton clusters will have the lowest value of WSS, but that
is not useful. Hence finding k is striking a balance between WSS and cluster size.
Elbow Method :
For a given number of clusters, the total within-cluster sum of squares (WCSS) is computed.
That value of k is chosen to be optimum, where addition of one more cluster does not lower the
16
value of total WCSS appreciably. The Elbow method looks at the total WCSS as a function of
the number of clusters.
Figure indicates a clear break in the elbow after k=2. Hence one option for optimum number of
clusters is 2.
Silhouette Method :
This method measures how tightly the observations are clustered and the average distance
between clusters. For each observation a silhouette score is constructed which is a
function of the average distance between the point and all other points in the cluster to
which it belongs, and the distance between the point and all other points in all other clusters,
that it does not belong to. The maximum value of the statistic indicates the optimum value of k.
Here, the maximum value of the Silhouette score is - 0.32943733699973826 with a min score of
-0.473
17
Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters.
Cluster 0: Customers are relatively less active with their usage of credit cards
Cluster 1: Customers have more credibility in terms of usage as they are high on
spending, but also their probability of full payment is higher
Cluster Group Profiles
Group 1 : High Spending
Group 3 : Medium Spending
Group 2 : Low Spending
Promotional strategies for each cluster
Group 1 : High Spending Group
- Giving any reward points might increase their purchases.
- maximum max_spent_in_single_shopping is high for this group, so can be offered
discount/offer on next transactions upon full payment
- Increase there credit limit and
- Increase spending habits
- Give loan against the credit card, as they are customers with good repayment record.
- Tie up with luxary brands, which will drive more one_time_maximun spending
Group 3 : Medium Spending Group
- They are potential target customers who are paying bills and doing purchases and
maintaining comparatively good credit score. So we can increase credit limit or can lower
down interest rate.
- Promote premium cards/loyality cars to increase transcations.
- Increase spending habits by trying with premium ecommerce sites, travel portal, travel
airlines/hotel, as this will encourge them to spend more
Group 2 : Low Spending Group
18
- customers should be given remainders for payments. Offers can be provided on early
payments to improve their payment rate.
- Increase there spending habits by tieing up with grocery stores, utlities (electircity,
phone, gas, others)
Recommendations:
For the customers in Cluster 0, more emphasis should be given on increasing the usage
of the credit cards as their spending average is very low when compared to Cluster 1.
Also, the credit limits should be increased for Cluster 0, which will enhance the
customers’ spending capability, and also the money spent in a single transaction
For Cluster 1, customers should be given more perks of using the credit cards, as they
have high probability of paying the full amount, and also that their credit limit is high,
which can encourage them to use the credit card more frequently.
19
Problem 2- CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the task
to make a model which predicts the claim status and provide recommendations to
management. Use CART, RF & ANN and compare the models' performances in train
and test sets.
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Dataset Head:
Number of Rows: 3000

Number of Columns: 10
20
From the information provided, we can infer that there are no null values in the dataset. The
variables in the dataset are different having integer, float and object datatypes. Object datatype
variables will be converted to categorical variables later.
Actual Nominal variables are 6, which are the object datatype Variables. Actual Numeric
variables are 4, Age, Commission, Duration, and Sales.
There are no missing values.

Also, there are 139 duplicated rows, after removal of which the information is as follows:
The above table primarily shows that the variables have very low correlation with each
other which can also be proven later through bivariate analysis.
21
Univariate Analysis:
22
23
The diagrams show that Age, Claim, Commission, Duration, Sales, Product Name,
Destination are right skewed.
All other remaining variables are moderately left skewed.
Outliers are present in all except Agency, Type, and Claim, which are nominal variables.
24
Bivariate Analysis:
We can see high positive correlation only between Sales and Commission. Other
variables account for either negligible, or sometimes even negative correlation with
each other, as already hypothesised before.
25
Data Split: Split the data into test and train, build classification model
CART, Random Forest, Artificial Neural Network
CART Classification Model:

Model: X_train (2001, 9), X_test (900, 9), Train_labels (2001,), Test_labels (900,)
Parameter Guidelines:
Criterion: Gini, Max_depth: 50, Min_samples_leaf: 60, Min_samples_split: 450,

Random
State=5.
Random Forest Classifier Model:
Model: max_depth=6, max_features=6, min_samples_leaf=15,

min_samples_split=55 , n_estimators=400, random_state=1
Artificial Neural Network Model:
Model: hidden_layer_sizes=200, max_iter=4000, random_state=1, tol=0.01
Performance Metrics: Comment and Check the performance of Predictions

on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score, classification reports for each model.
CART Model:
Train Data:
AUC: 82%
Accuracy: 76%
Precision: 65%
f1-Score: 61%
Confusion Matrix: [[1157, 202],[ 270, 373]]
26
AUC Train Set
Test Data:
AUC: 79.2%
Accuracy: 78%
Precision: 68%
f1-Score: 63%
27
AUC Test Set
Training and Test set results are almost similar, and with the overall measures fairly
high, model is a moderate model.
Random Forest Classifier Model:
Train Data:
AUC: 81%
Accuracy: 76%
Precision: 75%
f1-Score: 66%
28
AUC Train Set
Test Data:
AUC: 81%
Accuracy: 79%
Precision: 70%
f1-Score: 62%
29
AUC Test Set
Training and Test set results are almost similar, and with the overall measures high, the
model is a good model.
Artificial Neural Network Model:
Train Data:
AUC: 79.2%
Accuracy: 76%
Precision: 65%
f1-Score: 60%
30
AUC Train Set
Test Data:
AUC: 79.1%
Accuracy: 77%
Precision: 66%
f1-Score: 61%
31
AUC Test Set
Training and Test set results are almost similar, and with the overall measures fairly
high, model is a moderate model.
Final Model: Compare all the models and write an inference which model is
best/optimized.
32
AUC Train Set
AUC Test Set
Out of the 3 models, Random Forest has slightly better performance than the Cart and
Neural network model.
Overall all the 3 models are reasonably stable enough to be used for making
any future predictions. From Cart and Random Forest Model, the variable Agency Code
is found to be the most useful feature amongst all other features for predicting if a person
has a claim or not.
33
Inference: Based on the whole Analysis, what are the business insights
and recommendation?
Insights:
Agency Code is the most important and defining variable for claiming of tour
insurance.
Sales is the most important numeric variable defining the importance of insurance
claim.
The Random Forest Model is the most effective model to evaluate the
performance of insurance claim.
Recommendations:
Emphasis should be laid on increasing correlational bond between variables, as

it is presently scattered.
Also, the precision is moderate presently. To make the efficiency higher, a
more compact model should be implied so as the insurance claim structure is not
scattered.
34

DataMining Project SonaliPradhan

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DataMining Project SonaliPradhan

Uploaded by

Copyright:

Available Formats

Submitted By:- Sonali Pradhan

Problem 1 - Clustering ..............................................................................................................3

Problem 2: CART-RF-ANN .....................................................................................................11

A leading bank wants to develop a customer segmentation to give promotional offers to

Number of Rows: 210

1. Advance payments and Spending

Apply K-Means clustering on scaled data and determine optimum clusters.

Number of Rows: 3000

There are no missing values.

CART Classification Model:

Criterion: Gini, Max_depth: 50, Min_samples_leaf: 60, Min_samples_split: 450,

Random Forest Classifier Model:

Model: max_depth=6, max_features=6, min_samples_leaf=15,

Artificial Neural Network Model:

Model: hidden_layer_sizes=200, max_iter=4000, random_state=1, tol=0.01

Performance Metrics: Comment and Check the performance of Predictions

Random Forest Classifier Model:

Artificial Neural Network Model:

AUC Test Set

Emphasis should be laid on increasing correlational bond between variables, as

You might also like