You are on page 1of 11

DATA MINING PROJECT

Problem 1
1.1 Read the data, do the necessary initial steps, and
exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Initial steps and exploratory data analysis are done in jupyter notebook

PairPlot

Correlation Plot
From the above correlation plot we can see a lot of positive correlation, except for min_payment_amt
which has mostly negative correlation.

1.2 Do you think scaling is necessary for clustering in this


case? Justify
No , I don’t think scaling is necessary for clustering in this case because scaling is done when there is
different variables having different unit of measurement. In this case, all variable have same
unit.Hence,I think there is no need for scaling in this case.

1.3 Apply hierarchical clustering to scaled data. Identify


the number of optimum clusters using Dendrogram and
briefly describe them
Hierarchical clustering is done in jupyter notebook name ‘data mining problem 1’.

Dendogram after applying truncate mode

Using dendogram, there are 2 main cluster :-

Cluster 1:- customer with high current balance, spend more in single shopping, and high credit limit.

Cluster 2:- customer with less current balance, spend less on single shopping as compared to cluster 1
customer and has low credit limit.
1.4 Apply K-Means clustering on scaled data and
determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and
write inferences on the finalized clusters.
K-Means clustering in done in jupyter notebook name ‘data mining problem 1’.

Elbow Curve

Silhuette score is better for K= 3 than for K= 4.


1.5 Describe cluster profiles for the clusters defined.
Recommend different promotional strategies for different
clusters.
Using K-Means 3 main cluster are formed :-

Cluster 0:- customer having least minimum payment.

Cluster 1:- customers with least spending, least amount spend in single shopping, least credit limit,least
advance payment and maximum advance payment.

Cluster 2:- customer with maximum spending, maximum advance payment and spend most in a single
shopping.

Bank should have a lie up with ecommerce sites and should come up with various offers for the
customer so that they can spend a more using their credit card. Not only that, they can even give
discount coupons if the customer spend a minimum amount in a month. This could encourage
customers to spend more using credit card.

Problem 2
2.1 Read the data, do the necessary initial steps, and
exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Initial steps and exploratory data analysis is done in jupyter notebook naming ‘data mining problem 2’.

Pair Plot
From the below pair plot we can see except age variable other variable are left screwed.
Correlation Plot

From the above correlation plot we can see mostly positive correlation between variables.
2.2 Data Split: Split the data into test and train, build
classification model CART, Random Forest, Artificial
Neural Network
Classification model is build in jupyter notebook naming ‘data mining problem 2’

2.3 Performance Metrics: Comment and Check the


performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for each model.
Performance metrics for each model is done in jupyter notebook naming ‘data mining problem 2’

For cart
Test data accuracy =78.23%

Train data accuracy = 76.42%

ROC curve, ROC_AUC score for test data

ROC_AUC score =0.790

ROC Curve for train data


ROC_AUC SCORE =0.810

For Random Forest Method


Test data accuracy =78.92%

Train data accuracy = 80.41%

ROC curve, ROC_AUC score for test data tes

Area under curve is 0.81

ROC curve, ROC_AUC score for train data


Area under curve is 0.862

For Artificial Neural Network


Test data accuracy =75.43%

Train data accuracy = 75.82%

ROC curve, ROC_AUC score for test data

Area under curve is 0.7869

ROC curve, ROC_AUC score for train data


Area under curve is 0.7871

2.4 Final Model: Compare all the models and write an


inference which model is best/optimized.

After comparing all the models, Random Forest is the best model in this case as it has high accuracy,
precision, F1 score and AUC score as compared to other models.

CART train CART test Random Random Artificial Artificial


Forest train Forest test Network Network
train test
Accuracy 0.76 0.78 0.80 0.79 0.76 0.75
AUC 0.81 0.79 0.86 0.81 0.79 0.79
Recall 0.58 0.60 0.60 0.60 0.55 0.52
Precision 0.65 0.68 0.74 0.74 0.64 0.63
F1 Score 0.61 0.63 0.66 0.66 0.59 0.57

ROC curve for 3 models on training data.


ROC curve for 3 models on test data

You might also like