Professional Documents
Culture Documents
Namta Bansal
Student - PGPDSBA
Contents
Index of Figures ...................................................................................................................................... 2
Problem 1 : Clustering ........................................................................................................................... 3
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis). .................................................................................................... 3
1.2 Do you think scaling is necessary for clustering in this case? Justify. The learner is expected to
check and comment about the difference in scale of different features on the bases of appropriate
measure for example std dev, variance, etc. Should justify whether there is a necessity for scaling and
which method is he/she using to do the scaling. Can also comment on how that method works. .. 8
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them ........................................................................................... 10
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and silhouette score. Explain the results properly. Interpret and write inferences on the finalized
clusters. ............................................................................................................................................ 12
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies
for different clusters......................................................................................................................... 15
Problem 2 : CART-RF-ANN ................................................................................................................... 17
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis). .................................................................................................. 17
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest,
Artificial Neural Network .................................................................................................................. 27
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification
reports for each model. .................................................................................................................... 32
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.36
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
37
Index of Figures
Problem 1 : Clustering
1.1 Sample of the dataset 1.13 Key statistics post data scaling
1.2 Variables in the dataset 1.14 Boxplot post data scaling
1.3 Key statistical parameters of the dataset 1.15 Customer Segmentation Dendrogram
1.4 Boxplot of variables 1.16 Truncated Dendrogram
1.5 Count of Outliers 1.17 Dendrogram Clusters
1.6 Histogram of variables 1.18 Variable means of Hierarchical Clusters
1.7 Skewness 1.19 wss score for 1 to 15 clusters
1.8 Correlation & Heat Map of Variables 1.20 Elbow Plot
1.9 Pair plot of Variables 1.21 Silhouette Scores for 2 to 14 clusters
1.10 Outlier Treatment 1.22 Silhouette Plot
1.11 Key statistics prior to data scaling 1.23 Clusters in the dataset
1.12 Boxplot prior to data scaling 1.24 Variable means of K-Means clusters
Problem 2 : CART-RF-ANN
2.1 Sample of the dataset 2.30 Optimal parameters for CART
2.2 Variables in the dataset 2.31 Decision Tree
2.3 Key statistical parameters of the dataset 2.32 Root Node
2.4 Unique counts of Agency_Code 2.33 Variable Importance - CART
2.5 Unique counts of Type 2.34 Optimal parameters for Random Forest
2.6 Unique counts of Channel 2.35 Train dataset post scaling (post fit and
transform)
2.7 Unique counts of Product Name 2.36 Test dataset post scaling (post
transform)
2.8 Unique counts of Destination 2.37 Optimal parameters for ANN
2.9 Duplicate Rows 2.38 CART model Accuracy
2.10 Boxplot of the dataset 2.39 CART – Confusion Matrix
2.11 Histogram of variables 2.40 CART – ROC Curve
2.12 Correlation & Heat Map of Variables 2.41 CART – Area under AUC
2.13 Pair plot of Variables 2.42 CART – Classification report
2.14 Type vs Claimed 2.43 RF model Accuracy
2.15 Destination vs Claimed 2.44 RF – Confusion Matrix
2.16 Product Name vs Claimed 2.45 RF – ROC Curve
2.17 Channel vs Claimed 2.46 RF – Area under AUC
2.18 Agency_Code vs Claimed 2.47 RF – Classification report
2.19 Boxplot of Sales vs Agency_Code 2.48 ANN model Accuracy
2.20 Boxplot of Sales vs Type 2.49 ANN – Confusion Matrix
2.21 Boxplot of Sales vs Agency_Code 2.50 ANN – ROC Curve
2.22 Boxplot of Sales vs Type 2.51 ANN – Area under AUC
2.23 Boxplot of Sales vs Product Name 2.52 ANN – Classification report
2.24 Proportion of observations in Target class 2.53 CART, RF, ANN model comparison
2.25 Label encoding of categorical variables 2.54 CART, RF, ANN – ROC Curve
2.26 Variables in the dataset post Label 2.55 Feature Importance
encoding
2.27 Sample dataset without target column 2.56 % claims by Product Name
2.28 Sample dataset of target column 2.57 Boxplot of Sales vs Agency_Code
2.29 Train-Test split 2.58 % claims by Agency_Code
Problem 1 : Clustering
Description:
A leading bank wants to develop a customer segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of users during the past few months. You are given
the task to identify the segments based on credit card usage.
Dataset: bank_marketing_part1_Data.csv
Data Dictionary:
1.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Null Values:
Univariate Analysis :
➢ Outlier Analysis
Skewness :
Multivariate Analysis :
• There is also some positive correlation (though not very strong) between
o advance_payments and max_spent_in_single_shopping
o max_spent_in_single_shopping and spending
o credit_limit and current_balance
Outlier Treatment :
Before After
1.2 Do you think scaling is necessary for clustering in this case? Justify. The learner is
expected to check and comment about the difference in scale of different features
on the bases of appropriate measure for example std dev, variance, etc. Should
justify whether there is a necessity for scaling and which method is he/she using to
do the scaling. Can also comment on how that method works.
Clustering algorithms use Euclidean distance to form cohorts. High magnitude of some variables
will impact performance of all distance based models such as Clustering, as it will give higher
weightage to variables which have higher magnitude. Scaling helps to make the relative weight of
each variable equal by converting each variable to a unitless measure or relative distance.
Before Scaling :
There is a lot of variation in the scale of the dataset, as can be seen from the boxplot:
z-score scaling has been applied to the dataset. Post scaling, we see that all the data fields have
the same weights. The mean of all is “0” and standard deviation is “1”.
Fig 1.13 : Key Statistics post data scaling
Note : Before and After scaling comparison has been done post Outlier Treatment.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them
Hierarchical cluster analysis or HCA is an unsupervised clustering algorithm which involves creating
clusters that have predominant ordering from top to bottom. The algorithm groups similar objects into
groups called clusters. The endpoint is a set of clusters or groups, where each cluster is distinct from each
other cluster, and the objects within each cluster are broadly similar to each other.
Applying Agglomerative clustering to the data set yields below Dendrogram. Affinity method used is
Euclidean distance, using ward’s linkage.
Fig 1.15 : Customer Segmentation Dendrogram
Observations are treated separately as singleton clusters. Then, Euclidean distance of each pair is
computed and the most similar clusters are successively merged. The process is repeated until the final
optimal clusters are formed.
Ward’s method says that the distance between two clusters, A and B, is how much the sum of squares
will increase when we merge them:
With hierarchical clustering, the sum of squares starts out at zero (because every point is in its own
cluster) and then grows as we merge clusters. Ward’s method keeps this growth as small as possible.
Given two pairs of clusters whose centers are equally far apart, Ward’s method will prefer to merge the
smaller ones.
Cutting the Dendrogram to last 10 merges yields the below Dendrogram with 3 clusters.
Cluster 1 : 70 observations, having highest average for all features except min_payment_amt. This
group also has highest probability of making full payment.
Cluster 2 : 67 obsevations, having lowest spending, and lowest probability of making full payment. This
group, however, has the highest min_payment_amt.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply
elbow curve and silhouette score. Explain the results properly. Interpret and write
inferences on the finalized clusters.
K-means is a simple iterative clustering algorithm. Starting with randomly chosen K centroids, the
algorithm proceeds to update the centroids and their clusters to equilibrium while minimizing the total
within cluster variance. It relies on the Euclidean distance to discover cluster centroids.
Applying K-Means clustering to the data set gives the below within sum of squares (wss) i.e. the sum of
squared distances of samples to their closest cluster centre, for no. of clusters ranging from 1 to 15.
We see that wss drops significantly as no. of clusters increases from 1 to 3. Thereafter, the drop is not
significant.
Elbow Plot :
The Elbow method looks at the total within sum of squares (wss) as a function of the number of clusters.
That value of k is chosen to be optimum, where addition of one more cluster does not lower the value of
total wss appreciably.
Plotting the Elbow Curve for above within sum of squares (wss) clearly shows the ideal no. of clusters as
per wss as 3. Beyond 3 clusters, the curve becomes relatively flatter and wss starts to decrease in a linear
fashion.
This method measures how tightly the observations are clustered and the average distance between
clusters. For each observation a silhouette score is constructed which is a function of
• the average distance between the point and all other points in the cluster to which it belongs
• the distance between the point and all other points in all other clusters that it does not belong to
The value of the Silhouette score varies from -1 to 1. If the score is 1, the cluster is dense and well-
separated than other clusters. A value near 0 represents overlapping clusters with samples very close to
the decision boundary of the neighbouring clusters. A negative score indicates that the samples might
have got assigned to the wrong clusters.
As per Elbow method, 3 clusters is optimum. Whereas, as per Silhouette method, 2 clusters should be
optimum.
Silhouette Width is calculated for 2 and 3 clusters. Silhouette width is given by following formula :
Where b = distance between the observation and neighbouring cluster’s centroid
A positive Silhouette width implies that the mapping is correct to its centroid. Since the min Silhouette
width for 2 clusters is negative, it implies that particular observation is incorrectly mapped.
Since the Silhouette width for 2 clusters is negative, hence we should not consider this. The Silhouette
width for 3 clusters is positive. However, it is very close to 0, implying that this particular observation is
not far away from its nearest cluster.
1.5 Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters.
This group has highest averages for all variables except min_payment_amt. They have high balance in
their accounts, have high credit limit, spend huge amounts in single shopping and have highest probability
of full payment.
These can be considered as privileged customers for the bank, with good credit record. Their credit card
can be upgraded to Platinum.
This group has the lowest averages for all variables except min_payment_amt. They can be given Silver
credit cards.
• Spending pattern can be increased by tying up with online and offline stores such as Big Bazar,
Amazon, Flipkart, Petrol Pumps etc.
• They should be encouraged to pay for utilities and subscriptions through credit card by offering
discounts
• Offers can be provided on early payments to improve their payment rate
This group has the medium averages for all variables except min_payment_amt. They can be upgraded
to Gold membership.
• This group should be encouraged to increase their credit card spending by tying up with e-
commerce sites, grocery stores, airlines and hotels.
Problem 2 : CART-RF-ANN
Description:
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides
to collect data from the past few years. You are assigned the task to make a model which predicts the
claim status and provide recommendations to management. Use CART, RF & ANN and compare the
models' performances in train and test
Dataset: insurance_part2_data-1.csv
Data Dictionary:
8. Amount worth of sales per customer in procuring tour insurance policies in rupees (in 100’s)
9. The commission received for tour insurance firm (Commission is in percentage of sales)
2.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Variables in the data set:
The max values of the continuous variables shows the presence of outliers. Commision is “0” for 1366
records and Sales is “0” for 2089 records.
There is one negative entry in Duration field. This seems to be bad data and will have to be treated.
Unique counts of all Categorical Variables:
EPX 45.5%
C2B 30.8%
CWT 15.73%
JZI 7.97%
Online 98.47%
Offline 15.33%
There are 139 (i.e. 4.6%) duplicate records in the dataset. Since there is no unique Customer ID, so we
will go with the assumption that these could belong to different customers, hence we will not remove
them.
There is one record with negative duration. This has been changed to “0 days”.
Null Values:
➢ Outlier Analysis
No. of Outliers :
Sales : 353
Duration : 382
Commision : 362
Age : 204
There are outliers in Sales, Duration, Commision and Age. The no. of outliers is more than 10% of the
overall data, which can be a cause of concern. However, CART, Random Forest, and ANN can handle
outliers. Hence, Outliers need not be treated.
➢ Data Distribution & Skewness
Skewness :
Sales : 2.381
Duration : 13.785
Commision : 3.149
Age : 1.15
➢ Pairplot
➢ Boxplot
Fig 2.19 : Boxplot of Sales vs Agency_Code Fig 2.20 : Boxplot of Sales vs Type
Fig 2.21 : Boxplot of Sales vs Agency_Code Fig 2.22 : Boxplot of Sales vs Type
No 68.05%
Yes 31.95%
The data is not 50-50 distributed. There might be higher accuracy towards predicting “No” as compared
to predicting “Yes”. However, since each class is >10%, we can say that we have reasonable proportions
in both the classes.
Before proceeding further with the analysis, the object data type has been converted to categorical
codes.
Fig 2.25 : Label encoding of categorical variables
2.2 Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network
1. Extracting the target column “Claimed” into separate vector for training and test set :
Fig 2.27 : Sample dataset without target column
Train dataset has 2100 observations and test dataset has 900 observations with 9 variables
each.
Based on above criteria, the model will run 6 X 5 X 5 X 10 = 1500 times to give the below
parameters :
✓ The tree is not overgrown and there is uniform branching out, which is good for the
model.
✓ At the root node, split has been done as per Agency_Code<=0.5. Gini is 0.42. Out of the
2100 samples in Training data, 1471 are “No” and 629 are “Yes”.
✓ Wherever the sample is less than 300, the tree does not split further.
Variable importance :
• Grid Search for finding out the optimal values for the hyper parameters
max_depth : 5, 8, 10, 15, 20
max_features : 4, 6, 8
min_samples_leaf : 20, 30, 50, 100, 150
min_samples_split : 50, 80, 100, 300
n_estimators : 100, 200
cross validation : 5
Based on above criteria, the model will run 4 X 3 X 5 X 4 X 2 X 5 = 2400 times to give the following
parameters :
Scaling is required for Artificial Neural Networks but not for CART and Random Forest (RF).
Fig 2.35 : Train dataset post scaling (post fit and transform)
• Grid Search for finding out the optimal values for the hyper parameters
hidden_layer_sizes : 5, 20, 50
max_iter : 2500, 5000
solver : adam, sgd
tol : 0.0001, 0.001
activation : logistic, relu
cross validation : 10
Based on above criteria, the model returns the following best parameters :
3. Using the best grid obtained for each of the three classification models, prediction is then made
on training and test data.
2.3 Performance Metrics: Comment and Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for each model.
b. Confusion Matrix :
c. ROC Curve :
e. Classification Report :
Comments :
• The CART model shows 79% accuracy on the Train data and 77% accuracy on Test data. Since
the difference in the accuracy scores is very small, we can conclude that the model is neither
over-fitting nor under-fitting.
• The precision score is the proportion of predictions of that class that are true. Precision score
for “1” is 67% for train dataset and 71% for test dataset
• Recall is the proportion of the true positives that are identified as such. The model is correctly
identifying 87% of the class 0s, but only 62% of the class 1s for Train data. This becomes 89%
and 53% respectively for Test data.
• AUC score is 0.826 on Train data and 0.795 on Test data
• ROC curve is a probability curve that plots the TPR against FPR at various threshold values and
essentially separates the ‘signal’ from the ‘noise’. ROC curve for Test data is slightly below that
of Train data
• Training and Test set results are almost similar, and with the overall measures high, the model
is a good model.
a. Accuracy :
c. ROC Curve :
e. Classification Report :
• The Random Forest model shows 81% accuracy on Train data and 78% on Test data. The model
is neither over-fitting nor under-fitting.
• Precision score for “1” is 72% for train dataset and 74% for test data
• Recall is 59% for positive class on Train data and 49% on Test data
• Train and Test set results are almost similar, and with the overall measures high, the model is a
good model.
a. Accuracy :
b. Confusion Matrix :
c. ROC Curve :
e. Classification Report :
Comments :
• The Artificial Neural Networks model shows 80% accuracy on Train dataset and 76% on Test
data. The model is neither over-fitting nor under-fitting.
• Precision score for “1” is 69% and 70% for train and test datasets
• Recall for positive class is also similar 58% on both Train data and 47% on Test dataset
• Training and Test set results are almost similar, and with the overall measures high, the model
is a good model.
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.
There is not much difference in the scores of the three models. The ROC curves are also similar and
close to each other for all 3 models. However, Random Forest model has slightly better accuracy and
AUC score compared to CART and ANN models. The Precision and F1 scores are also better. However,
we see that the recall is slightly lower for Random Forest as compared to CART.
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
We can draw below business insights and recommendations from the analysis :
• Agency_Code and Product Name play an important role in deciding claim status (highest feature
importance)
• JZI agency has least sales compared to other channels. The company can run some promotional
campaigns to pull up the sales through JZI.