Data Mining Project

DATA MINING PROJECT
Problem 1
1.1 Read the data, do the necessary initial steps, and
exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Initial steps and exploratory data analysis are done in jupyter notebook
PairPlot
Correlation Plot
From the above correlation plot we can see a lot of positive correlation, except for min_payment_amt
which has mostly negative correlation.
1.2 Do you think scaling is necessary for clustering in this

case? Justify
No , I don’t think scaling is necessary for clustering in this case because scaling is done when there is
different variables having different unit of measurement. In this case, all variable have same
unit.Hence,I think there is no need for scaling in this case.
1.3 Apply hierarchical clustering to scaled data. Identify

the number of optimum clusters using Dendrogram and
briefly describe them
Hierarchical clustering is done in jupyter notebook name ‘data mining problem 1’.
Dendogram after applying truncate mode
Using dendogram, there are 2 main cluster :-
Cluster 1:- customer with high current balance, spend more in single shopping, and high credit limit.
Cluster 2:- customer with less current balance, spend less on single shopping as compared to cluster 1
customer and has low credit limit.
1.4 Apply K-Means clustering on scaled data and
determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and
write inferences on the finalized clusters.
K-Means clustering in done in jupyter notebook name ‘data mining problem 1’.
Elbow Curve
Silhuette score is better for K= 3 than for K= 4.

1.5 Describe cluster profiles for the clusters defined.
Recommend different promotional strategies for different
clusters.
Using K-Means 3 main cluster are formed :-
Cluster 0:- customer having least minimum payment.
Cluster 1:- customers with least spending, least amount spend in single shopping, least credit limit,least
advance payment and maximum advance payment.
Cluster 2:- customer with maximum spending, maximum advance payment and spend most in a single
shopping.
Bank should have a lie up with ecommerce sites and should come up with various offers for the
customer so that they can spend a more using their credit card. Not only that, they can even give
discount coupons if the customer spend a minimum amount in a month. This could encourage
customers to spend more using credit card.
Problem 2
2.1 Read the data, do the necessary initial steps, and
exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Initial steps and exploratory data analysis is done in jupyter notebook naming ‘data mining problem 2’.
Pair Plot
From the below pair plot we can see except age variable other variable are left screwed.
Correlation Plot
From the above correlation plot we can see mostly positive correlation between variables.
2.2 Data Split: Split the data into test and train, build
classification model CART, Random Forest, Artificial
Neural Network
Classification model is build in jupyter notebook naming ‘data mining problem 2’
2.3 Performance Metrics: Comment and Check the

performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for each model.
Performance metrics for each model is done in jupyter notebook naming ‘data mining problem 2’
For cart
Test data accuracy =78.23%
Train data accuracy = 76.42%
ROC curve, ROC_AUC score for test data
ROC_AUC score =0.790
ROC Curve for train data

ROC_AUC SCORE =0.810
For Random Forest Method

ROC curve, ROC_AUC score for test data tes
Area under curve is 0.81
ROC curve, ROC_AUC score for train data

For Artificial Neural Network

ROC curve, ROC_AUC score for test data
ROC curve, ROC_AUC score for train data

2.4 Final Model: Compare all the models and write an

inference which model is best/optimized.
After comparing all the models, Random Forest is the best model in this case as it has high accuracy,
precision, F1 score and AUC score as compared to other models.
CART train CART test Random Random Artificial Artificial

Forest train Forest test Network Network
train test
Accuracy 0.76 0.78 0.80 0.79 0.76 0.75
AUC 0.81 0.79 0.86 0.81 0.79 0.79
Recall 0.58 0.60 0.60 0.60 0.55 0.52
Precision 0.65 0.68 0.74 0.74 0.64 0.63
F1 Score 0.61 0.63 0.66 0.66 0.59 0.57
ROC curve for 3 models on training data.

ROC curve for 3 models on test data

Data Mining Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Project

Uploaded by

Copyright:

Available Formats

DATA MINING PROJECT

1.2 Do you think scaling is necessary for clustering in this

1.3 Apply hierarchical clustering to scaled data. Identify

Dendogram after applying truncate mode

Using dendogram, there are 2 main cluster :-

Silhuette score is better for K= 3 than for K= 4.

Cluster 0:- customer having least minimum payment.

2.3 Performance Metrics: Comment and Check the

Train data accuracy = 76.42%

ROC curve, ROC_AUC score for test data

ROC_AUC score =0.790

ROC Curve for train data

For Random Forest Method

Train data accuracy = 80.41%

ROC curve, ROC_AUC score for test data tes

Area under curve is 0.81

ROC curve, ROC_AUC score for train data

For Artificial Neural Network

Train data accuracy = 75.82%

ROC curve, ROC_AUC score for test data

Area under curve is 0.7869

ROC curve, ROC_AUC score for train data

2.4 Final Model: Compare all the models and write an

CART train CART test Random Random Artificial Artificial

ROC curve for 3 models on training data.

You might also like