Data Mining

DATA MINING
Namta Bansal
Student - PGPDSBA
Contents
Index of Figures ...................................................................................................................................... 2
Problem 1 : Clustering ........................................................................................................................... 3
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis). .................................................................................................... 3
1.2 Do you think scaling is necessary for clustering in this case? Justify. The learner is expected to
check and comment about the difference in scale of different features on the bases of appropriate
measure for example std dev, variance, etc. Should justify whether there is a necessity for scaling and
which method is he/she using to do the scaling. Can also comment on how that method works. .. 8
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them ........................................................................................... 10
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve
and silhouette score. Explain the results properly. Interpret and write inferences on the finalized
clusters. ............................................................................................................................................ 12
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies
for different clusters......................................................................................................................... 15
Problem 2 : CART-RF-ANN ................................................................................................................... 17
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis). .................................................................................................. 17
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest,
Artificial Neural Network .................................................................................................................. 27
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification
reports for each model. .................................................................................................................... 32
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.36
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
37
Index of Figures
Problem 1 : Clustering
1.1 Sample of the dataset 1.13 Key statistics post data scaling
1.2 Variables in the dataset 1.14 Boxplot post data scaling
1.3 Key statistical parameters of the dataset 1.15 Customer Segmentation Dendrogram
1.4 Boxplot of variables 1.16 Truncated Dendrogram
1.5 Count of Outliers 1.17 Dendrogram Clusters
1.6 Histogram of variables 1.18 Variable means of Hierarchical Clusters
1.7 Skewness 1.19 wss score for 1 to 15 clusters
1.8 Correlation & Heat Map of Variables 1.20 Elbow Plot
1.9 Pair plot of Variables 1.21 Silhouette Scores for 2 to 14 clusters
1.10 Outlier Treatment 1.22 Silhouette Plot
1.11 Key statistics prior to data scaling 1.23 Clusters in the dataset
1.12 Boxplot prior to data scaling 1.24 Variable means of K-Means clusters
Problem 2 : CART-RF-ANN
2.1 Sample of the dataset 2.30 Optimal parameters for CART
2.2 Variables in the dataset 2.31 Decision Tree
2.3 Key statistical parameters of the dataset 2.32 Root Node
2.4 Unique counts of Agency_Code 2.33 Variable Importance - CART
2.5 Unique counts of Type 2.34 Optimal parameters for Random Forest
2.6 Unique counts of Channel 2.35 Train dataset post scaling (post fit and
transform)
2.7 Unique counts of Product Name 2.36 Test dataset post scaling (post
transform)
2.8 Unique counts of Destination 2.37 Optimal parameters for ANN
2.9 Duplicate Rows 2.38 CART model Accuracy
2.10 Boxplot of the dataset 2.39 CART – Confusion Matrix
2.11 Histogram of variables 2.40 CART – ROC Curve
2.12 Correlation & Heat Map of Variables 2.41 CART – Area under AUC
2.13 Pair plot of Variables 2.42 CART – Classification report
2.14 Type vs Claimed 2.43 RF model Accuracy
2.15 Destination vs Claimed 2.44 RF – Confusion Matrix
2.16 Product Name vs Claimed 2.45 RF – ROC Curve
2.17 Channel vs Claimed 2.46 RF – Area under AUC
2.18 Agency_Code vs Claimed 2.47 RF – Classification report
2.19 Boxplot of Sales vs Agency_Code 2.48 ANN model Accuracy
2.20 Boxplot of Sales vs Type 2.49 ANN – Confusion Matrix
2.21 Boxplot of Sales vs Agency_Code 2.50 ANN – ROC Curve
2.22 Boxplot of Sales vs Type 2.51 ANN – Area under AUC
2.23 Boxplot of Sales vs Product Name 2.52 ANN – Classification report
2.24 Proportion of observations in Target class 2.53 CART, RF, ANN model comparison
2.25 Label encoding of categorical variables 2.54 CART, RF, ANN – ROC Curve
2.26 Variables in the dataset post Label 2.55 Feature Importance
encoding
2.27 Sample dataset without target column 2.56 % claims by Product Name
2.28 Sample dataset of target column 2.57 Boxplot of Sales vs Agency_Code
2.29 Train-Test split 2.58 % claims by Agency_Code
Problem 1 : Clustering
Description:
A leading bank wants to develop a customer segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of users during the past few months. You are given
the task to identify the segments based on credit card usage.
Dataset: bank_marketing_part1_Data.csv
Data Dictionary:
• spending: Amount spent by the customer per month (in 1000s)

• advance_payments: Amount paid by the customer in advance by cash (in 100s)
• probability_of_full_payment: Probability of payment done in full by the customer to the bank
• current_balance: Balance amount left in the account to make purchases (in 1000s)
• credit_limit: Limit of the amount in credit card (10000s)
• min_payment_amt : minimum paid by the customer while making payments for purchases made
monthly (in 100s)
• max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)
Sample of the data set:
Fig 1.1 : Sample of the dataset
1.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Variables in the data set:
Fig 1.2 : Variables in the dataset
All the variables are of type float.

Size of the data set:
The dataset has 210 rows and 7 columns
Description of the data set:
Fig 1.3 : Key statistical parameters of the dataset
Duplicate rows in the data set:
There are no duplicate rows
Null Values:
There are no null values.
Univariate Analysis :
➢ Outlier Analysis
Fig 1.4 : Boxplot of variables

No. of Outliers :
Fig 1.5 : Count of Outliers
➢ Data Distribution & Skewness
Fig 1.6 : Histogram of variables
Skewness :
Fig 1.7 : Skewness

Insights from Univariate Analysis :
• There are few outliers in probability_of_full_payment and min_payment_amt. Outliers
will have to be treated before proceeding with further analysis.
• The skewness measure is close to 0 for all data fields, hence we can say that the
distribution is almost normal for all.
Multivariate Analysis :
➢ Correlation & Heat Map
Fig 1.8 : Correlation & Heat Map of Variables

➢ Pairplot
Fig 1.9 : Pair plot of Variables
Insights from Multivariate Analysis :

• There is a very strong positive correlation between
o spending and advance_payments
o spending and current_balance
o spending and credit_limit
o advance_payments and credit_limit
o advance_payments and current_balance
o current_balance and max_spent_in_single_shopping
• There is also some positive correlation (though not very strong) between
o advance_payments and max_spent_in_single_shopping
o max_spent_in_single_shopping and spending
o credit_limit and current_balance
Outlier Treatment :
There are outliers in probability_of_full_payment and min_payment_amt. These have been

treated as per upper and lower range
i.e. any observation above Q3 = Q3
any observation below Q1 = Q1
Before After
Fig 1.10 : Outlier Treatment
1.2 Do you think scaling is necessary for clustering in this case? Justify. The learner is
expected to check and comment about the difference in scale of different features
on the bases of appropriate measure for example std dev, variance, etc. Should
justify whether there is a necessity for scaling and which method is he/she using to
do the scaling. Can also comment on how that method works.
Clustering algorithms use Euclidean distance to form cohorts. High magnitude of some variables
will impact performance of all distance based models such as Clustering, as it will give higher
weightage to variables which have higher magnitude. Scaling helps to make the relative weight of
each variable equal by converting each variable to a unitless measure or relative distance.
Before Scaling :
There is a lot of variation in the scale of the dataset, as can be seen from the boxplot:
▪ spending, current_balance and max_spent_in_single_shopping are in ‘000. This ranges

from 5.04k (max_spent_in_single_shopping) to 21.18k (spending).
▪ advance_payment and min_payment _amt are in ’00.
▪ probability_of_full_payment ranges from 0.86 to 0.92
Fig 1.11 : Key Statistics prior to data scaling
Fig 1.12 : Boxplot prior to data scaling
After Scaling (using z-score):
z-score scaling has been applied to the dataset. Post scaling, we see that all the data fields have
the same weights. The mean of all is “0” and standard deviation is “1”.
Fig 1.13 : Key Statistics post data scaling
Fig 1.14 : Boxplot post data scaling
Note : Before and After scaling comparison has been done post Outlier Treatment.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them
Hierarchical cluster analysis or HCA is an unsupervised clustering algorithm which involves creating
clusters that have predominant ordering from top to bottom. The algorithm groups similar objects into
groups called clusters. The endpoint is a set of clusters or groups, where each cluster is distinct from each
other cluster, and the objects within each cluster are broadly similar to each other.
Applying Agglomerative clustering to the data set yields below Dendrogram. Affinity method used is
Euclidean distance, using ward’s linkage.
Fig 1.15 : Customer Segmentation Dendrogram
Observations are treated separately as singleton clusters. Then, Euclidean distance of each pair is
computed and the most similar clusters are successively merged. The process is repeated until the final
optimal clusters are formed.
Ward’s method says that the distance between two clusters, A and B, is how much the sum of squares
will increase when we merge them:
With hierarchical clustering, the sum of squares starts out at zero (because every point is in its own
cluster) and then grows as we merge clusters. Ward’s method keeps this growth as small as possible.
Given two pairs of clusters whose centers are equally far apart, Ward’s method will prefer to merge the
smaller ones.
Creating optimum no. of Clusters :
Cutting the Dendrogram to last 10 merges yields the below Dendrogram with 3 clusters.
Fig 1.16 : Truncated Dendrogram

fcluster method is used with distance criterion at 15. This too yields 3 clusters (same as above). The
size of each cluster is as below :
Fig 1.17 : Dendrogram Clusters
Below are the variable means for all 3 clusters.
Fig 1.18 : Variable means of Hierarchical Clusters
Cluster 1 : 70 observations, having highest average for all features except min_payment_amt. This
group also has highest probability of making full payment.
Cluster 2 : 67 obsevations, having lowest spending, and lowest probability of making full payment. This
group, however, has the highest min_payment_amt.
Cluster 3 : 73 observations with medium spending, advance_payments, probability_of_full_payment,

current_balance, credit_limit and max_spent_in_single_shopping compared to other 2 clusters.
However, this group has the lowest minimum paid by the customer while making payments for purchases
made monthly.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply
elbow curve and silhouette score. Explain the results properly. Interpret and write
inferences on the finalized clusters.
K-means is a simple iterative clustering algorithm. Starting with randomly chosen K centroids, the
algorithm proceeds to update the centroids and their clusters to equilibrium while minimizing the total
within cluster variance. It relies on the Euclidean distance to discover cluster centroids.
Applying K-Means clustering to the data set gives the below within sum of squares (wss) i.e. the sum of
squared distances of samples to their closest cluster centre, for no. of clusters ranging from 1 to 15.
Fig 1.19 : wss score for 1 to 15 clusters
We see that wss drops significantly as no. of clusters increases from 1 to 3. Thereafter, the drop is not
significant.
Elbow Plot :
The Elbow method looks at the total within sum of squares (wss) as a function of the number of clusters.
That value of k is chosen to be optimum, where addition of one more cluster does not lower the value of
total wss appreciably.
Plotting the Elbow Curve for above within sum of squares (wss) clearly shows the ideal no. of clusters as
per wss as 3. Beyond 3 clusters, the curve becomes relatively flatter and wss starts to decrease in a linear
fashion.
Fig 1.20 : Elbow Plot

Silhouette Score :
This method measures how tightly the observations are clustered and the average distance between
clusters. For each observation a silhouette score is constructed which is a function of
• the average distance between the point and all other points in the cluster to which it belongs
• the distance between the point and all other points in all other clusters that it does not belong to
The value of the Silhouette score varies from -1 to 1. If the score is 1, the cluster is dense and well-
separated than other clusters. A value near 0 represents overlapping clusters with samples very close to
the decision boundary of the neighbouring clusters. A negative score indicates that the samples might
have got assigned to the wrong clusters.
The maximum value of the statistic indicates the optimum value of k.
Fig 1.21 : Silhouette Scores for 2 to 14 clusters
Silhouette score is the maximum for 2 clusters.
Fig 1.22 : Silhouette Plot
As per Elbow method, 3 clusters is optimum. Whereas, as per Silhouette method, 2 clusters should be
optimum.
Silhouette Width is calculated for 2 and 3 clusters. Silhouette width is given by following formula :
Where b = distance between the observation and neighbouring cluster’s centroid
a = distance between the observation and its own cluster’s centroid
A positive Silhouette width implies that the mapping is correct to its centroid. Since the min Silhouette
width for 2 clusters is negative, it implies that particular observation is incorrectly mapped.
Min. Silhouette Width for 2 clusters = - 0.0057
Min. Silhouette Width for 3 clusters = 0.0028
Since the Silhouette width for 2 clusters is negative, hence we should not consider this. The Silhouette
width for 3 clusters is positive. However, it is very close to 0, implying that this particular observation is
not far away from its nearest cluster.
Let us proceed with 3 clusters.
Fig 1.23 : Clusters in the dataset
1.5 Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters.
Below are the variable means for all 3 clusters.
Fig 1.24 : Variable means of K-Means clusters

Based on the analysis so far, the 3 clusters identified are as follows:
Group 1 : High Spending Group (Cluster 0, 67 customers)
This group has highest averages for all variables except min_payment_amt. They have high balance in
their accounts, have high credit limit, spend huge amounts in single shopping and have highest probability
of full payment.
These can be considered as privileged customers for the bank, with good credit record. Their credit card
can be upgraded to Platinum.
• Increase credit limit

• Higher reward points for purchases
• Increase spending habits through tie ups with airlines, resorts, luxury brands
• Loans can be offered as they are customers with good credit record
Group 2 : Low Spending Group (Cluster 1, 72 customers)
This group has the lowest averages for all variables except min_payment_amt. They can be given Silver
credit cards.
• Spending pattern can be increased by tying up with online and offline stores such as Big Bazar,
Amazon, Flipkart, Petrol Pumps etc.
• They should be encouraged to pay for utilities and subscriptions through credit card by offering
discounts
• Offers can be provided on early payments to improve their payment rate
Group 3 : Medium Spending Group (Cluster 2, 71 customers)
This group has the medium averages for all variables except min_payment_amt. They can be upgraded
to Gold membership.
• This group should be encouraged to increase their credit card spending by tying up with e-
commerce sites, grocery stores, airlines and hotels.
Problem 2 : CART-RF-ANN
Description:
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides
to collect data from the past few years. You are assigned the task to make a model which predicts the
claim status and provide recommendations to management. Use CART, RF & ANN and compare the
models' performances in train and test
Dataset: insurance_part2_data-1.csv
Data Dictionary:
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration in days)
7. Destination of the tour (Destination)
8. Amount worth of sales per customer in procuring tour insurance policies in rupees (in 100’s)
9. The commission received for tour insurance firm (Commission is in percentage of sales)
10.Age of insured (Age)
Sample of the data set:
Fig 2.1 : Sample of the dataset
2.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Variables in the data set:
Fig 2.2 : Variables in the dataset
There are 2 int, 2 float and 6 object variables.
Size of the data set:
The dataset has 3000 rows and 10 columns
Description of the data set:
Fig 2.3 : Key statistical parameters of the dataset
The max values of the continuous variables shows the presence of outliers. Commision is “0” for 1366
records and Sales is “0” for 2089 records.
There is one negative entry in Duration field. This seems to be bad data and will have to be treated.
Unique counts of all Categorical Variables:
EPX 45.5%
C2B 30.8%
CWT 15.73%
JZI 7.97%
Fig 2.4 : Unique counts of Agency_Code
Travel Agency 61.23%

Airlines 38.77%
Fig 2.5 : Unique counts of Type
Online 98.47%
Offline 15.33%
Fig 2.6 : Unique counts of Channel
Customised Plan 37.87%

Cancellation Plan 22.60%
Bronze Plan 21.67%
Silver Plan 14.23%
Gold Plan 3.63%
Fig 2.7 : Unique counts of Product Name

ASIA 82.17%
Americas 10.67%
EUROPE 7.17%
Fig 2.8 : Unique counts of Destination
Duplicate rows in the data set:
Fig 2.9 : Duplicate Rows
There are 139 (i.e. 4.6%) duplicate records in the dataset. Since there is no unique Customer ID, so we
will go with the assumption that these could belong to different customers, hence we will not remove
them.
Treating Bad Data :
There is one record with negative duration. This has been changed to “0 days”.
Null Values:
There are no null values.

Univariate Analysis :
➢ Outlier Analysis
Fig 2.10 : Boxplot of the dataset
No. of Outliers :
Sales : 353
Duration : 382
Commision : 362
Age : 204
There are outliers in Sales, Duration, Commision and Age. The no. of outliers is more than 10% of the
overall data, which can be a cause of concern. However, CART, Random Forest, and ANN can handle
outliers. Hence, Outliers need not be treated.
➢ Data Distribution & Skewness
Fig 2.11 : Histogram of variables
Skewness :
Sales : 2.381
Duration : 13.785
Commision : 3.149
Age : 1.15
Insights from Univariate Analysis :

• There are many outliers in Sales, Duration, Commision and Age. CART, Random Forest
and ANN can handle outliers. Hence, Outliers need not be treated.
• The skewness measure as well as the distribution plots of continuous variables indicate
that the data is highly skewed to the right. This is also clearly visible from the boxplots.
• Of the 3000 records, Agency Code EPX accounts for 45.5%. Similarly, Type Travel Agency
accounts for 61.2%, Online channel accounts for 98.5%, Customised Plan 37.8% and
Destination Asia 82%
Multivariate Analysis :
➢ Correlation & Heat Map of Continuous Variables
Fig 2.12 : Correlation & Heat Map of Variables
➢ Pairplot
Fig 2.13 : Pair plot of Variables

Fig 2.14 : Type vs Claimed
Fig 2.15 : Destination vs Claimed
Fig 2.16 : Product Name vs Claimed
Fig 2.17 : Channel vs Claimed

Fig 2.18 : Agency_Code vs Claimed
➢ Boxplot
Fig 2.19 : Boxplot of Sales vs Agency_Code Fig 2.20 : Boxplot of Sales vs Type
Fig 2.21 : Boxplot of Sales vs Agency_Code Fig 2.22 : Boxplot of Sales vs Type
Fig 2.23 : Boxplot of Sales vs Product Name

Insights from Multivariate Analysis :
• There is strong relationship between Sales and Commision (coeff of corr = 0.76)
• There is weak correlation between all other continuous variables, as is clear from the
pair plots as well as from Heat Map.
• Claims account for approx. 32% of all records. Airlines has higher percentage of claims
at 20% compared to Travel Agency at 11.6%
• Destination Asia has the highest claims
• Similarly, Silver Plan has the highest claims
• Online channel accounts for most of the claims
• It is clear from the box plots that median sales is higher for “Claimed”.
• Agency_Code JZI has the least sales compared to other agencies.
• C2B has the highest sales as well as highest claims.
Proportion of observations in Target class :
No 68.05%
Yes 31.95%
Fig 2.24 : Proportion of observations in Target class
The data is not 50-50 distributed. There might be higher accuracy towards predicting “No” as compared
to predicting “Yes”. However, since each class is >10%, we can say that we have reasonable proportions
in both the classes.
Label Encoding the Categorical Variables :
Before proceeding further with the analysis, the object data type has been converted to categorical
codes.
Fig 2.25 : Label encoding of categorical variables
Fig 2.26 : Variables in the dataset post Label encoding
2.2 Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network
1. Extracting the target column “Claimed” into separate vector for training and test set :
Fig 2.27 : Sample dataset without target column
Fig 2.28 : Sample dataset of target column
2. Splitting the Data into 70% training and 30% test
Fig 2.29 : Train-Test split
Train dataset has 2100 observations and test dataset has 900 observations with 9 variables
each.
A. Building CART Classification Model :
• Criterion used : Gini

• Grid Search for finding out the optimal values for the hyper parameters
max_depth : 5, 8, 10, 15, 20, 25
min_samples_leaf : 20, 30, 50, 100, 150
min_samples_split : 50, 100, 150, 300, 450
cross validation : 10
The model has been optimised for recall.
Based on above criteria, the model will run 6 X 5 X 5 X 10 = 1500 times to give the below
parameters :
Fig 2.30 : Optimal parameters for CART

Decision Tree :
Fig 2.31 : Decision Tree
Observations from the Decision Tree:
✓ The tree is not overgrown and there is uniform branching out, which is good for the
model.
✓ At the root node, split has been done as per Agency_Code<=0.5. Gini is 0.42. Out of the
2100 samples in Training data, 1471 are “No” and 629 are “Yes”.
Fig 2.32 : Root Node
✓ Wherever the sample is less than 300, the tree does not split further.
Variable importance :
Fig 2.33 : Variable Importance - CART
✓ Agency_Code is the most important variable followed by Sales.

✓ Type, Channel and Destination are the least important and can be dropped.
B. Building Random Forest (RF) Classification Model :
max_depth : 5, 8, 10, 15, 20
max_features : 4, 6, 8
min_samples_leaf : 20, 30, 50, 100, 150
min_samples_split : 50, 80, 100, 300
n_estimators : 100, 200
Based on above criteria, the model will run 4 X 3 X 5 X 4 X 2 X 5 = 2400 times to give the following
parameters :
Fig 2.34 : Optimal parameters for Random Forest
C. Building Artificial Neural Networks (ANN) Classification Model :
• Scaling is performed on the dataset using StandardScaler’ z-score scaling.
Scaling is required for Artificial Neural Networks but not for CART and Random Forest (RF).
Fig 2.35 : Train dataset post scaling (post fit and transform)
Fig 2.36 : Test dataset post scaling (post transform)
hidden_layer_sizes : 5, 20, 50
max_iter : 2500, 5000
solver : adam, sgd
tol : 0.0001, 0.001
activation : logistic, relu
Based on above criteria, the model returns the following best parameters :
Fig 2.37 : Optimal parameters for ANN
3. Using the best grid obtained for each of the three classification models, prediction is then made
on training and test data.
2.3 Performance Metrics: Comment and Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for each model.
A. CART Classification Model :

a. Accuracy :
Train Data 79.33%
Test Data 77.33%
Fig 2.38 : CART model Accuracy
b. Confusion Matrix :
Train Dataset Test Dataset
Fig 2.39 : CART – Confusion Matrix
c. ROC Curve :
Fig 2.40 : CART – ROC Curve

d. Area under Curve (AUC) :
Train Data 0.826
Test Data 0.795
Fig 2.41 : CART – Area under AUC
e. Classification Report :
Fig 2.42 : CART – Classification report
Comments :
• The CART model shows 79% accuracy on the Train data and 77% accuracy on Test data. Since
the difference in the accuracy scores is very small, we can conclude that the model is neither
over-fitting nor under-fitting.
• The precision score is the proportion of predictions of that class that are true. Precision score
for “1” is 67% for train dataset and 71% for test dataset
• Recall is the proportion of the true positives that are identified as such. The model is correctly
identifying 87% of the class 0s, but only 62% of the class 1s for Train data. This becomes 89%
and 53% respectively for Test data.
• AUC score is 0.826 on Train data and 0.795 on Test data
• ROC curve is a probability curve that plots the TPR against FPR at various threshold values and
essentially separates the ‘signal’ from the ‘noise’. ROC curve for Test data is slightly below that
of Train data
• Training and Test set results are almost similar, and with the overall measures high, the model
is a good model.
B. Random Forest Classification Model :
a. Accuracy :
Train Data 80.76%
Test Data 77.55%
Fig 2.43 : RF model Accuracy

Fig 2.44 : RF – Confusion Matrix
c. ROC Curve :
Fig 2.45 : RF – ROC Curve
Train Data 0.847
Test Data 0.819
Fig 2.46 : RF – Area under AUC
Fig 2.47 : RF – Classification report

Comments :
• The Random Forest model shows 81% accuracy on Train data and 78% on Test data. The model
is neither over-fitting nor under-fitting.
• Precision score for “1” is 72% for train dataset and 74% for test data
• Recall is 59% for positive class on Train data and 49% on Test data
• Train and Test set results are almost similar, and with the overall measures high, the model is a
good model.
C. Artificial Neural Networks Classification Model :
a. Accuracy :
Train Data 79.76%
Test Data 76%
Fig 2.48 : ANN model Accuracy
Fig 2.49 : ANN – Confusion Matrix
c. ROC Curve :
Fig 2.50 : ANN – ROC Curve

Train Data 0.847
Test Data 0.811
Fig 2.51 : ANN – Area under AUC
Fig 2.52 : ANN – Classification report
Comments :
• The Artificial Neural Networks model shows 80% accuracy on Train dataset and 76% on Test
data. The model is neither over-fitting nor under-fitting.
• Precision score for “1” is 69% and 70% for train and test datasets
• Recall for positive class is also similar 58% on both Train data and 47% on Test dataset
• Training and Test set results are almost similar, and with the overall measures high, the model
is a good model.
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.
Fig 2.53 : CART, RF, ANN model comparison

Fig 2.54 : CART, RF, ANN – ROC Curve
There is not much difference in the scores of the three models. The ROC curves are also similar and
close to each other for all 3 models. However, Random Forest model has slightly better accuracy and
AUC score compared to CART and ANN models. The Precision and F1 scores are also better. However,
we see that the recall is slightly lower for Random Forest as compared to CART.
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
We can draw below business insights and recommendations from the analysis :
• Agency_Code and Product Name play an important role in deciding claim status (highest feature
importance)
Fig 2.55 : Feature Importance

• The recall score for the selected model (Random Forest) is 0.49 for the test data. This means
that 51% of the true positives, who deserve a claim are denied (i.e. [295-144]/900 = 16.78% of
the 900 test customers). These are dissatisfied customers. They can spread negative word of
mouth for the insurance firm, which can thereby affect future sales.
• Silver Plan has the higher claims . The insurance firm should revisit its products, and may decide
to discontinue the Silver Plan, if required. Cancellation and Customised Plans have much lower
claims. The company should promote higher sales for these products.
Fig 2.56 : % claims by Product Name
• JZI agency has least sales compared to other channels. The company can run some promotional
campaigns to pull up the sales through JZI.
Fig 2.57 : Boxplot of Sales vs Agency_Code

• C2B agency has highest sales as well as highest claims. There is a need to evaluate the sales
process of the C2B agency to ensure ethical sales practices are being followed.
Fig 2.58 : % claims by Agency_Code

Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

DATA MINING

• spending: Amount spent by the customer per month (in 1000s)

Sample of the data set:

Fig 1.1 : Sample of the dataset

Variables in the data set:

Fig 1.2 : Variables in the dataset

All the variables are of type float.

The dataset has 210 rows and 7 columns

Description of the data set:

Fig 1.3 : Key statistical parameters of the dataset

Duplicate rows in the data set:

There are no duplicate rows

There are no null values.

Fig 1.4 : Boxplot of variables

Fig 1.5 : Count of Outliers

➢ Data Distribution & Skewness

Fig 1.6 : Histogram of variables

Fig 1.7 : Skewness

➢ Correlation & Heat Map

Fig 1.8 : Correlation & Heat Map of Variables

Fig 1.9 : Pair plot of Variables

Insights from Multivariate Analysis :

There are outliers in probability_of_full_payment and min_payment_amt. These have been

i.e. any observation above Q3 = Q3

any observation below Q1 = Q1

Fig 1.10 : Outlier Treatment

▪ spending, current_balance and max_spent_in_single_shopping are in ‘000. This ranges

Fig 1.12 : Boxplot prior to data scaling

After Scaling (using z-score):

Fig 1.14 : Boxplot post data scaling

Creating optimum no. of Clusters :

Fig 1.16 : Truncated Dendrogram

Fig 1.17 : Dendrogram Clusters

Below are the variable means for all 3 clusters.

Fig 1.18 : Variable means of Hierarchical Clusters

Cluster 3 : 73 observations with medium spending, advance_payments, probability_of_full_payment,

Fig 1.19 : wss score for 1 to 15 clusters

Fig 1.20 : Elbow Plot

The maximum value of the statistic indicates the optimum value of k.

Fig 1.21 : Silhouette Scores for 2 to 14 clusters

Silhouette score is the maximum for 2 clusters.

Fig 1.22 : Silhouette Plot

a = distance between the observation and its own cluster’s centroid

Min. Silhouette Width for 2 clusters = - 0.0057

Min. Silhouette Width for 3 clusters = 0.0028

Let us proceed with 3 clusters.

Fig 1.23 : Clusters in the dataset

Below are the variable means for all 3 clusters.

Fig 1.24 : Variable means of K-Means clusters

Group 1 : High Spending Group (Cluster 0, 67 customers)

• Increase credit limit

Group 2 : Low Spending Group (Cluster 1, 72 customers)

Group 3 : Medium Spending Group (Cluster 2, 71 customers)

1. Target: Claim Status (Claimed)

2. Code of tour firm (Agency_Code)

3. Type of tour insurance firms (Type)

4. Distribution channel of tour insurance agencies (Channel)

5. Name of the tour insurance products (Product)

6. Duration of the tour (Duration in days)

7. Destination of the tour (Destination)

10.Age of insured (Age)

Sample of the data set:

Fig 2.1 : Sample of the dataset

Fig 2.2 : Variables in the dataset