You are on page 1of 28

DATA MININIG PROJECT

[Type the document subtitle]

5/23/2021
KARTHIKEYAN M
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of users during the past few months. You are
given the task to identify the segments based on credit card usage.

1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
1.2 Do you think scaling is necessary for clustering in this case? Justify
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.

1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).

The dataset has 210 rows and 7 columns.

2
All the columns are of datatype float/integer

There are no null values in the data set.

Mean and median of each column is more or less equal which indicates that data is normally distributed
and it is visible from the distribution plot as well.

3
There are no duplicate values present in the dataset.

Checking for outliers:

Outliers are present only in two columns and they are only very few data points hence outlier treatment
has not been done.

All the variables are positively correlated,

4
1.2 Do you think scaling is necessary for clustering in this case? Justify
Yes scaling is required.

The descriptive statistics shows that measures of central tendencies are very close to each other, so in
order to have a quality clusters it is better to scale the data and normalize it. And also as Euclidean
distance is very sensitive to the changes in the differences, it becomes critical to scale the data.

The below is the data after z-score scaling

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
The average linkage method is chosen here to do hierarchical clustering

5
Choosing the suitable value for dendogram,

Appending the clusters to the original dataset,

Cluster Frequency:

6
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.
Creating clusters using KMeans Forming 3 cluster with K = 3

Forming clusters with K = 1,2,4 and comparing the WSS

As K increases WSS reduces,

7
For K - Elbow Method

From the above plot, WSS does not change much after 3, hence selecting the optimal value as 3.

With kmeans as 3, the following is the silhouette score,

8
The score is positive, hence the clustering is good.

As per Kmeans, clustering,

Cluster 1 has 75 rows, cluster 2 has 70 and cluster 3 has 65 rows.

1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
Cluster Profiling – Hierarchical

KMeans – Clustering:

Group 1 : High Spending Group ( Hierarchial – group 1 and Kmeans group 2)

9
They are already spending a lot, hence

• Giving any reward points might increase their purchases.


• Increase their credit limis thereby increase spending habits
• Giving loan, EMI options to them will attract them to purchase more

Group 2 : Medium Spending Group ( Hierarchial – group 3 and Kmeans group 1)

• These are customers who are paying bill on time hence some discounts will attract them to
spend more
• As they pay on time, increase the credit limit so that they will be elated.

Group 3 : Low Spending Group ( Hierarchial – group 20 and Kmeans group 0)

• They are not spending much hence giving promotional offers may attract them

10
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each
model.
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations

2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).

The dataset has 3000 rows, 10 columns.

11
There are no null elements. The columns has few categorical variables and few integers

12
There are extreme min and max values for duration.

There are 139 duplicate values, however as there are no primary key indicator, duplicate values are not
treated.

There are outliers present in the continuous variables like age, commission, duration, sales, however
cart model can handle outliers so not treating them

13
The heat map shows, sales and commission are positively correlated.

Treating the categorical variables ,

14
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network

Splitting data in train and test.

Dropping the target variable and making two data sets

Splitting data in train and test.

15
.

CART Model:

The cart model is built with various combinations of parameters,

Based on CART model, the features importance is,

Random Forest Model

16
Feature Importance as per Random forest model.

Artificial Neural Network:

17
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each
model.

CART Model:

18
19
CART Conclusion:

Cart Conclusion
Train Data: AUC: 81% Accuracy: 78% Precision: 66% f1-Score: 62%
Test Data: AUC: 79% Accuracy:78% Precision: 65% f1-Score: 62%
The model has good accuracy % with training and test data results are almost similar.

20
Agency_Code is the variable of importancefor predicting Target variable.

Random Forest Model:

21
22
23
Train Data: AUC: 86% Accuracy: 80% Precision: 72% f1-Score: 66%
Test Data: AUC: 82% Accuracy:78% Precision: 68% f1-Score: 62%
The model has good accuracy % with training and test data results are almost similar.

Agency_Code is the variable of importancefor predicting Target variable.

Artificial Neural Network Model:

24
25
Train Data: AUC: 82% Accuracy: 78% Precision: 68% f1-Score: 59%
Test Data: AUC: 80% Accuracy:77% Precision: 67% f1-Score: 57%
The model has good accuracy % similar to other models with training and test data results are almost
similar.

26
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.

All the three models have training ad test data more or less equal within each model.

Random Forest has slightly higher accuracy compared with other two models.

Variable Agency code found to be variable of importance in all the models.

27
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations?

There is a lot of online conversions as 90% of data says insurance is done by online channel

Most of the sales happens via these agencies which is also important factor and claims happens via
agencies, so we may have to understand the workflow to arrive at a conclusion.

As per the models we understand that agency code is important variable and there are four agency code
EPX, C2B, CWT, JZI.

Of this JZI agency has not done good in sales hence their performance may have to be evaluated and
also if not satisfactory, can try out for other agencies.

28

You might also like