Data Mininig Project

DATA MININIG PROJECT
[Type the document subtitle]
5/23/2021
KARTHIKEYAN M
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of users during the past few months. You are
given the task to identify the segments based on credit card usage.
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
1.2 Do you think scaling is necessary for clustering in this case? Justify
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
The dataset has 210 rows and 7 columns.
2
All the columns are of datatype float/integer
There are no null values in the data set.
Mean and median of each column is more or less equal which indicates that data is normally distributed
and it is visible from the distribution plot as well.
3
There are no duplicate values present in the dataset.
Checking for outliers:
Outliers are present only in two columns and they are only very few data points hence outlier treatment
has not been done.
All the variables are positively correlated,
4
1.2 Do you think scaling is necessary for clustering in this case? Justify
Yes scaling is required.
The descriptive statistics shows that measures of central tendencies are very close to each other, so in
order to have a quality clusters it is better to scale the data and normalize it. And also as Euclidean
distance is very sensitive to the changes in the differences, it becomes critical to scale the data.
The below is the data after z-score scaling
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
The average linkage method is chosen here to do hierarchical clustering
5
Choosing the suitable value for dendogram,
Appending the clusters to the original dataset,
Cluster Frequency:
6
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.
Creating clusters using KMeans Forming 3 cluster with K = 3
Forming clusters with K = 1,2,4 and comparing the WSS
As K increases WSS reduces,
7
For K - Elbow Method
From the above plot, WSS does not change much after 3, hence selecting the optimal value as 3.
With kmeans as 3, the following is the silhouette score,
8
The score is positive, hence the clustering is good.
As per Kmeans, clustering,
Cluster 1 has 75 rows, cluster 2 has 70 and cluster 3 has 65 rows.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
Cluster Profiling – Hierarchical
KMeans – Clustering:
Group 1 : High Spending Group ( Hierarchial – group 1 and Kmeans group 2)
9
They are already spending a lot, hence
• Giving any reward points might increase their purchases.

• Increase their credit limis thereby increase spending habits
• Giving loan, EMI options to them will attract them to purchase more
Group 2 : Medium Spending Group ( Hierarchial – group 3 and Kmeans group 1)
• These are customers who are paying bill on time hence some discounts will attract them to
spend more
• As they pay on time, increase the credit limit so that they will be elated.
Group 3 : Low Spending Group ( Hierarchial – group 20 and Kmeans group 0)
• They are not spending much hence giving promotional offers may attract them
10
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each
model.
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
The dataset has 3000 rows, 10 columns.
11
There are no null elements. The columns has few categorical variables and few integers
12
There are extreme min and max values for duration.
There are 139 duplicate values, however as there are no primary key indicator, duplicate values are not
treated.
There are outliers present in the continuous variables like age, commission, duration, sales, however
cart model can handle outliers so not treating them
13
The heat map shows, sales and commission are positively correlated.
Treating the categorical variables ,
14
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network
Splitting data in train and test.
Dropping the target variable and making two data sets
Splitting data in train and test.
15
.
CART Model:
The cart model is built with various combinations of parameters,
Based on CART model, the features importance is,
Random Forest Model
16
Feature Importance as per Random forest model.
Artificial Neural Network:
17
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each
model.
CART Model:
18
19
CART Conclusion:
Cart Conclusion
Train Data: AUC: 81% Accuracy: 78% Precision: 66% f1-Score: 62%
Test Data: AUC: 79% Accuracy:78% Precision: 65% f1-Score: 62%
The model has good accuracy % with training and test data results are almost similar.
20
Agency_Code is the variable of importancefor predicting Target variable.
Random Forest Model:
21
22
23
The model has good accuracy % with training and test data results are almost similar.
Agency_Code is the variable of importancefor predicting Target variable.
Artificial Neural Network Model:
24
25
The model has good accuracy % similar to other models with training and test data results are almost
similar.
26
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.
All the three models have training ad test data more or less equal within each model.
Random Forest has slightly higher accuracy compared with other two models.
Variable Agency code found to be variable of importance in all the models.
27
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations?
There is a lot of online conversions as 90% of data says insurance is done by online channel
Most of the sales happens via these agencies which is also important factor and claims happens via
agencies, so we may have to understand the workflow to arrive at a conclusion.
As per the models we understand that agency code is important variable and there are four agency code
EPX, C2B, CWT, JZI.
Of this JZI agency has not done good in sales hence their performance may have to be evaluated and
also if not satisfactory, can try out for other agencies.
28

Data Mininig Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mininig Project

Uploaded by

Copyright:

Available Formats

DATA MININIG PROJECT

[Type the document subtitle]

The dataset has 210 rows and 7 columns.

There are no null values in the data set.

Checking for outliers:

All the variables are positively correlated,

The below is the data after z-score scaling

Appending the clusters to the original dataset,

Forming clusters with K = 1,2,4 and comparing the WSS

As K increases WSS reduces,

With kmeans as 3, the following is the silhouette score,

As per Kmeans, clustering,

Cluster 1 has 75 rows, cluster 2 has 70 and cluster 3 has 65 rows.

Group 1 : High Spending Group ( Hierarchial – group 1 and Kmeans group 2)

• Giving any reward points might increase their purchases.

Group 2 : Medium Spending Group ( Hierarchial – group 3 and Kmeans group 1)

Group 3 : Low Spending Group ( Hierarchial – group 20 and Kmeans group 0)

The dataset has 3000 rows, 10 columns.

Treating the categorical variables ,

Splitting data in train and test.

Dropping the target variable and making two data sets

Splitting data in train and test.

The cart model is built with various combinations of parameters,

Based on CART model, the features importance is,

Random Forest Model

Artificial Neural Network:

Random Forest Model:

Agency_Code is the variable of importancefor predicting Target variable.

Artificial Neural Network Model:

Variable Agency code found to be variable of importance in all the models.

You might also like