Professional Documents
Culture Documents
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis)
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Import stats from scipy
from scipy import stats
In [2]:
df=pd.read_csv("insurance_part2_data.csv")
In [3]:
df.head()
Out[3]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales De
Name
Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan
Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan
Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan
Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan
In [4]:
df.tail()
Out[4]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Travel
2995 28 CWT Yes 166.53 Online 364 256.20 Gold Plan
Agency
Travel Customised
2997 36 EPX No 0.00 Online 54 28.00
Agency Plan
Attribute Information:
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
In [6]:
df.dtypes.value_counts()
Out[6]:
object 6
float64 2
int64 2
dtype: int64
There are total of 3000 rows and 10 columns in the dataset.Out of 10, 6 columns are of object type, 2 columns
are of integer type and remaining two are of float type data.
10 variables
Age, Commision, Duration, Sales are numeric variable
rest are categorial variables
3000 records, no missing one
9 independant variable and one target variable - Clamied
In [7]:
df.isnull().sum()
Out[7]:
Age 0
Agency_Code 0
Type 0
Claimed 0
Commision 0
Channel 0
Duration 0
Sales 0
Product Name 0
Destination 0
dtype: int64
In [8]:
round(df.describe().T,3)
Out[8]:
Inference:
In [9]:
df.shape
print('The number of rows of the dataframe is',df.shape[0],'.')
print('The number of columns of the dataframe is',df.shape[1],'.')
In [10]:
AGENCY_CODE : 4
JZI 239
CWT 472
C2B 924
EPX 1365
Name: Agency_Code, dtype: int64
TYPE : 2
Airlines 1163
CLAIMED : 2
Yes 924
No 2076
Name: Claimed, dtype: int64
CHANNEL : 2
Offline 46
Online 2954
PRODUCT NAME : 5
DESTINATION : 3
EUROPE 215
Americas 320
ASIA 2465
In [11]:
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
df[dups]
Out[11]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Travel Customised
329 36 EPX No 0.0 Online 5 20.0
Agency Plan
Travel Cancellation
407 36 EPX No 0.0 Online 11 19.0
Agency Plan
Travel Customised
411 35 EPX No 0.0 Online 2 20.0
Agency Plan
Travel Customised
422 36 EPX No 0.0 Online 5 20.0
Agency Plan
... ... ... ... ... ... ... ... ... ...
Travel Cancellation
2940 36 EPX No 0.0 Online 8 10.0
Agency Plan
Travel Customised
2947 36 EPX No 0.0 Online 10 28.0
Agency Plan
Travel Cancellation
2952 36 EPX No 0.0 Online 2 10.0
Agency Plan
Travel Customised
2962 36 EPX No 0.0 Online 4 20.0
Agency Plan
Travel Customised
2984 36 EPX No 0.0 Online 1 20.0
Agency Plan
Though it shows there are 139 records, but it can be of different customers, there is no customer ID or any
unique identifier, hence,we will not drop them off.
Univariate Analysis
In [12]:
def univariateAnalysis_numeric(column,nbins):
print("Description of " + column)
print("-------------------------------------------------------------------------
print(df[column].describe(),end=' ')
plt.figure()
print("Distribution of " + column)
print("-------------------------------------------------------------------------
sns.distplot(df[column], kde=False, color='g');
plt.show()
plt.figure()
print("BoxPlot of " + column)
print("-------------------------------------------------------------------------
ax = sns.boxplot(x=df[column])
plt.show()
In [13]:
In [14]:
df_cat.head()
Out[14]:
In [15]:
df_num.head()
Out[15]:
0 48 0.70 7 2.51
1 36 0.00 34 20.00
2 39 5.94 3 9.90
3 36 0.00 4 26.00
4 33 6.30 53 18.00
In [16]:
for x in Numerical_column_list:
univariateAnalysis_numeric(x,20)
BoxPlot of Commision
----------------------------------------------------------------------
------
For Age variable, Minimum age of insured is 8 years and maximum age of insured is 84 years.Average age
for insured people is around 38.
For Commision Variable, minimum commission earned is zero and a maximum commission that can be
earned is approximately 210.21, with an average earning of approximately 14.53 .
For Duration Variable, minimum duaration is a negtive value , which cannot be true , hence we now there is
atleast one wrong entry. Maximum duration of tour is 4580 and an average duration of tour is
approximately 70 .
For Sales Variable,Minimum and maximum amounts of sales of tour insurance policies are 0 and 539
respectively. On an average amount of sales is approximately 60.25 .
In [17]:
def univariateAnalysis_category(cat_column):
print("Details of " + cat_column)
print("----------------------------------------------------------------")
print(df_cat[cat_column].value_counts())
plt.figure()
df_cat[cat_column].value_counts().plot.bar(title="Frequency Distribution of " +
plt.show()
print(" ")
In [18]:
Out[18]:
In [19]:
sns.pairplot(df[['Age', 'Commision',
'Duration', 'Sales']])
Out[19]:
<seaborn.axisgrid.PairGrid at 0x7ff4b63a7e20>
In [20]:
plt.figure(figsize=(10,8))
plt.title("Figure 3: Heatmap of Variables ")
sns.set(font_scale=1.2)
sns.heatmap(df[['Age', 'Commision',
'Duration', 'Sales']].corr(), annot=True)
Out[20]:
Insights:
In [21]:
clean_dataset=df.copy()
In [22]:
def check_outliers(data):
vData_num = data.loc[:,data.columns != 'class']
Q1 = vData_num.quantile(0.25)
Q3 = vData_num.quantile(0.75)
IQR = Q3 - Q1
count = 0
# checking for outliers, True represents outlier
vData_num_mod = ((vData_num < (Q1 - 1.5 * IQR)) |(vData_num > (Q3 + 1.5 * IQR)))
#iterating over columns to check for no.of outliers in each of the numerical att
for col in vData_num_mod:
if(1 in vData_num_mod[col].value_counts().index):
print("No. of outliers in %s: %d" %( col, vData_num_mod[col].value_count
count += 1
print("\n\nNo of attributes with outliers are :", count)
check_outliers(df)
There are outliers in all the variables, but the sales and commision can be a geneuine business value. Random
Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will keep the data as it is.
We will treat the outliers for the ANN model to compare the same after the all the steps just for comparsion.
In [23]:
df.hist(figsize=(15,16),layout=(4,2), color="blue");
plt.title("Figure 4:Distribution plot for Continuous Variables")
plt.ylabel("Density")
plt.show()
In [24]:
# Skewness of Data
Out[24]:
Duration 13.784681
Commision 3.148858
Sales 2.381148
Age 1.149713
dtype: float64
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest,
Artificial Neural Network
Object data should be converted into categorical/numerical data to fit in the models. (pd.categorical().codes(),
pd.get_dummies(drop_first=True)) Data split, ratio defined for the split, train-test split should be discussed. Any
reasonable split is acceptable. Use of random state is mandatory. Successful implementation of each model.
Logical reason behind the selection of different values for the parameters involved in each model. Apply grid
search for each model and make models on best_params. Feature importance for each model.
In [25]:
feature: Agency_Code
[0 2 1 3]
feature: Type
[0 1]
feature: Claimed
['No', 'Yes']
[0 1]
feature: Channel
['Online', 'Offline']
[1 0]
[2 1 0 4 3]
feature: Destination
[0 1 2]
In [26]:
df.info()
<class 'pandas.core.frame.DataFrame'>
In [27]:
df.head()
Out[27]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales Destinat
Name
0 48 0 0 0 0.70 1 7 2.51 2
1 36 2 1 0 0.00 1 34 20.00 2
2 39 1 1 0 5.94 1 3 9.90 2
3 36 2 1 0 0.00 1 4 26.00 1
4 33 3 0 0 6.30 1 53 18.00 0
In [28]:
df.Claimed.value_counts(normalize=True)
Out[28]:
0 0.692
1 0.308
In [29]:
plt.figure(figsize=(7,6))
sns.countplot(df["Claimed"])
plt.title("Figure 5: Countplot of Target Variable-CLaimed")
plt.show()
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36:
FutureWarning: Pass the following variable as a keyword arg: x. From v
ersion 0.12, the only valid positional argument will be `data`, and pa
ssing other arguments without an explicit keyword will result in an er
ror or misinterpretation.
warnings.warn(
In [30]:
print("Percentage of 0's",round(df["Claimed"].value_counts().values[0]/df["Claimed"]
print("Percentage of 1's",round(df["Claimed"].value_counts().values[1]/df["Claimed"]
In [31]:
plt.figure(figsize=(16,7))
df["Claimed"].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',shadow=False
plt.title('Figure 6:Pi Chart of Target Variable-Claimed')
plt.show()
In [32]:
X = df.drop("Claimed", axis=1)
y = df.pop("Claimed")
X.head()
Out[32]:
Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name
0 48 0 0 0.70 1 7 2.51 2 0
1 36 2 1 0.00 1 34 20.00 2 0
2 39 1 1 5.94 1 3 9.90 2 1
3 36 2 1 0.00 1 4 26.00 1 0
4 33 3 0 6.30 1 53 18.00 0 0
In [33]:
plt.plot(X)
plt.title("Figure:Independent Variable Plot Before Scaling")
plt.show()
In [34]:
y.head()
Out[34]:
0 0
1 0
2 0
3 0
4 0
Name: Claimed, dtype: int8
Feature Scaling
In [35]:
Out[35]:
Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name
In [36]:
plt.plot(X_scaled)
plt.title("Figure:Independent Variable Plot Prior Scaling")
plt.show()
In [37]:
In [38]:
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)
X_train (2100, 9)
X_test (900, 9)
train_labels (2100,)
test_labels (900,)
In [39]:
param_grid_dtcl = {
'criterion': ['gini'],
'max_depth': [10,20,30,50],
'min_samples_leaf': [50,100,150],
'min_samples_split': [150,300,450],
}
dtcl = DecisionTreeClassifier(random_state=5)
In [ ]:
In [40]:
grid_search_dtcl.fit(X_train, train_labels)
print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl
Out[40]:
random_state=5)
In [41]:
In [42]:
tree_regularized.close()
dot_data
http://webgraphviz.com/ (http://webgraphviz.com/)
In [43]:
Imp
Agency_Code 0.674494
Sales 0.222345
Commision 0.008008
Duration 0.003005
Age 0.000000
Type 0.000000
Channel 0.000000
Destination 0.000000
In [44]:
ytrain_predict_dtcl = best_grid_dtcl.predict(X_train)
ytest_predict_dtcl = best_grid_dtcl.predict(X_test)
In [45]:
ytest_predict_dtcl
ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)
ytest_predict_prob_dtcl
pd.DataFrame(ytest_predict_prob_dtcl).head()
Out[45]:
0 1
0 0.656751 0.343249
1 0.979452 0.020548
2 0.921171 0.078829
3 0.656751 0.343249
4 0.921171 0.078829
In [46]:
param_grid_rfcl = {
'max_depth': [4,5,6],#20,30,40
'max_features': [2,3,4,5],## 7,8,9
'min_samples_leaf': [8,9,11,15],## 50,100
'min_samples_split': [46,50,55], ## 60,70
'n_estimators': [290,350,400] ## 100,200
}
rfcl = RandomForestClassifier(random_state=5)
In [47]:
grid_search_rfcl.fit(X_train, train_labels)
Out[47]:
GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=5),
In [48]:
grid_search_rfcl.best_params_
Out[48]:
{'max_depth': 6,
'max_features': 3,
'min_samples_leaf': 9,
'min_samples_split': 50,
'n_estimators': 290}
In [49]:
best_grid_rfcl = grid_search_rfcl.best_estimator_
In [50]:
best_grid_rfcl
Out[50]:
In [51]:
ytrain_predict_rfcl = best_grid_rfcl.predict(X_train)
ytest_predict_rfcl = best_grid_rfcl.predict(X_test)
In [52]:
ytest_predict_rfcl
ytest_predict_prob_rfcl=best_grid_rfcl.predict_proba(X_test)
ytest_predict_prob_rfcl
pd.DataFrame(ytest_predict_prob_rfcl).head()
Out[52]:
0 1
0 0.786094 0.213906
1 0.971485 0.028515
2 0.906544 0.093456
3 0.657028 0.342972
4 0.875002 0.124998
In [53]:
# Variable Importance
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
Agency_Code 0.279196
Sales 0.150871
Commision 0.146070
Duration 0.078847
Type 0.057515
Age 0.040628
Destination 0.008741
Channel 0.002758
In [54]:
param_grid_nncl = {
'hidden_layer_sizes': [50,100,200],
'max_iter': [2500,3000,4000],
'solver': ['adam'],
'tol': [0.01],
}
nncl = MLPClassifier(random_state=5)
In [55]:
grid_search_nncl.fit(X_train, train_labels)
grid_search_nncl.best_params_
best_grid_nncl = grid_search_nncl.best_estimator_
best_grid_nncl
Out[55]:
In [56]:
ytrain_predict_nncl = best_grid_nncl.predict(X_train)
ytest_predict_nncl = best_grid_nncl.predict(X_test)
In [57]:
ytest_predict_nncl
ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)
ytest_predict_prob_nncl
pd.DataFrame(ytest_predict_prob_nncl).head()
Out[57]:
0 1
0 0.838865 0.161135
1 0.926699 0.073301
2 0.914996 0.085004
3 0.657225 0.342775
4 0.909727 0.090273
In [58]:
# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_train_auc = roc_auc_score(train_labels, probs_cart)
print('AUC: %.3f' % cart_train_auc)
# calculate roc curve
cart_train_fpr, cart_train_tpr, cart_train_thresholds = roc_curve(train_labels, prob
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 13: CART AUC-ROC for Train Data ")
# plot the roc curve for the model
plt.plot(cart_train_fpr, cart_train_tpr)
AUC: 0.812
Out[58]:
[<matplotlib.lines.Line2D at 0x7ff4ad794dc0>]
In [59]:
# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_test_auc = roc_auc_score(test_labels, probs_cart)
print('AUC: %.3f' % cart_test_auc)
# calculate roc curve
cart_test_fpr, cart_test_tpr, cart_testthresholds = roc_curve(test_labels, probs_car
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 14: CART AUC-ROC for Test Data ")
# plot the roc curve for the model
plt.plot(cart_test_fpr, cart_test_tpr)
AUC: 0.800
Out[59]:
[<matplotlib.lines.Line2D at 0x7ff4b7e163d0>]
CART Confusion Matrix and Classification Report for the training data
In [60]:
confusion_matrix(train_labels, ytrain_predict_dtcl)
Out[60]:
array([[1258, 195],
[ 268, 379]])
In [61]:
In [62]:
Out[62]:
0.7795238095238095
In [63]:
print(classification_report(train_labels, ytrain_predict_dtcl))
In [64]:
cart_metrics=classification_report(train_labels, ytrain_predict_dtcl,output_dict=Tru
df=pd.DataFrame(cart_metrics).transpose()
cart_train_f1=round(df.loc["1"][2],2)
cart_train_recall=round(df.loc["1"][1],2)
cart_train_precision=round(df.loc["1"][0],2)
print ('cart_train_precision ',cart_train_precision)
print ('cart_train_recall ',cart_train_recall)
print ('cart_train_f1 ',cart_train_f1)
cart_train_precision 0.66
cart_train_recall 0.59
cart_train_f1 0.62
CART Confusion Matrix and Classification Report for the testing data
In [65]:
confusion_matrix(test_labels, ytest_predict_dtcl)
Out[65]:
array([[536, 87],
[113, 164]])
In [66]:
In [67]:
Out[67]:
0.7777777777777778
In [68]:
print(classification_report(test_labels, ytest_predict_dtcl))
In [69]:
cart_metrics=classification_report(test_labels, ytest_predict_dtcl,output_dict=True)
df=pd.DataFrame(cart_metrics).transpose()
cart_test_precision=round(df.loc["1"][0],2)
cart_test_recall=round(df.loc["1"][1],2)
cart_test_f1=round(df.loc["1"][2],2)
print ('cart_test_precision ',cart_test_precision)
print ('cart_test_recall ',cart_test_recall)
print ('cart_test_f1 ',cart_test_f1)
cart_test_precision 0.65
cart_test_recall 0.59
cart_test_f1 0.62
CART Conclusion:
Train Data:
AUC: 82%
Accuracy: 79%
Precision: 70%
f1-Score: 60%
Test Data:
AUC: 80%
Accuracy: 77%
Precision: 80%
f1-Score: 84%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
In [70]:
confusion_matrix(train_labels,ytrain_predict_rfcl)
Out[70]:
array([[1296, 157],
[ 249, 398]])
In [71]:
ax=sns.heatmap(confusion_matrix(train_labels,ytrain_predict_rfcl),annot=True, fmt='d
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 19: RF Confusion Matrix of Train Data')
plt.show()
In [72]:
rf_train_acc=best_grid_rfcl.score(X_train,train_labels)
rf_train_acc
Out[72]:
0.8066666666666666
In [73]:
print(classification_report(train_labels,ytrain_predict_rfcl))
In [74]:
rf_metrics=classification_report(train_labels, ytrain_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_train_precision=round(df.loc["1"][0],2)
rf_train_recall=round(df.loc["1"][1],2)
rf_train_f1=round(df.loc["1"][2],2)
print ('rf_train_precision ',rf_train_precision)
print ('rf_train_recall ',rf_train_recall)
print ('rf_train_f1 ',rf_train_f1)
rf_train_precision 0.72
rf_train_recall 0.62
rf_train_f1 0.66
In [75]:
rf_train_fpr, rf_train_tpr,_=roc_curve(train_labels,best_grid_rfcl.predict_proba(X_t
plt.plot(rf_train_fpr,rf_train_tpr,color='green')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 17: RF AUC-ROC for Train Data ")
rf_train_auc=roc_auc_score(train_labels,best_grid_rfcl.predict_proba(X_train)[:,1])
print('Area under Curve is', rf_train_auc)
In [76]:
confusion_matrix(test_labels,ytest_predict_rfcl)
Out[76]:
array([[546, 77],
[120, 157]])
In [77]:
ax=sns.heatmap(confusion_matrix(test_labels,ytest_predict_rfcl),annot=True, fmt='d')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 20: RF Confusion Matrix of Test Data')
plt.show()
In [78]:
rf_test_acc=best_grid_rfcl.score(X_test,test_labels)
rf_test_acc
Out[78]:
0.7811111111111111
In [79]:
print(classification_report(test_labels,ytest_predict_rfcl))
In [80]:
rf_metrics=classification_report(test_labels, ytest_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_test_precision=round(df.loc["1"][0],2)
rf_test_recall=round(df.loc["1"][1],2)
rf_test_f1=round(df.loc["1"][2],2)
print ('rf_test_precision ',rf_test_precision)
print ('rf_test_recall ',rf_test_recall)
print ('rf_test_f1 ',rf_test_f1)
rf_test_precision 0.67
rf_test_recall 0.57
rf_test_f1 0.61
In [81]:
rf_test_fpr, rf_test_tpr,_=roc_curve(test_labels,best_grid_rfcl.predict_proba(X_test
plt.plot(rf_test_fpr,rf_test_tpr,color='green')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 18: RF AUC-ROC for Test Data ")
rf_test_auc=roc_auc_score(test_labels,best_grid_rfcl.predict_proba(X_test)[:,1])
print('Area under Curve is', rf_test_auc)
Train Data:
AUC: 86%
Accuracy: 80%
Precision: 72%
f1-Score: 66%
Test Data:
AUC: 82%
Accuracy: 78%
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 33/41
25/07/2021 Project-CART-RF-ANN - Jupyter Notebook
Precision: 68%
f1-Score: 62
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
In [82]:
confusion_matrix(train_labels,ytrain_predict_nncl)
Out[82]:
array([[1292, 161],
[ 319, 328]])
In [83]:
ax=sns.heatmap(confusion_matrix(train_labels,ytrain_predict_nncl),annot=True, fmt='d
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 23: ANN Confusion Matrix of Train Data')
plt.show()
In [84]:
nn_train_acc=best_grid_nncl.score(X_train,train_labels)
nn_train_acc
Out[84]:
0.7714285714285715
In [85]:
print(classification_report(train_labels,ytrain_predict_nncl))
In [86]:
nn_metrics=classification_report(train_labels, ytrain_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_train_precision=round(df.loc["1"][0],2)
nn_train_recall=round(df.loc["1"][1],2)
nn_train_f1=round(df.loc["1"][2],2)
print ('nn_train_precision ',nn_train_precision)
print ('nn_train_recall ',nn_train_recall)
print ('nn_train_f1 ',nn_train_f1)
nn_train_precision 0.67
nn_train_recall 0.51
nn_train_f1 0.58
In [87]:
nn_train_fpr, nn_train_tpr,_=roc_curve(train_labels,best_grid_nncl.predict_proba(X_t
plt.plot(nn_train_fpr,nn_train_tpr,color='black')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 21: ANN AUC-ROC for Train Data ")
nn_train_auc=roc_auc_score(train_labels,best_grid_nncl.predict_proba(X_train)[:,1])
print('Area under Curve is', nn_train_auc)
In [88]:
confusion_matrix(test_labels,ytest_predict_nncl)
Out[88]:
array([[550, 73],
[140, 137]])
In [89]:
ax=sns.heatmap(confusion_matrix(test_labels,ytest_predict_nncl),annot=True, fmt='d',
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 24: ANN Confusion Matrix of Test Data')
plt.show()
In [90]:
nn_test_acc=best_grid_nncl.score(X_test,test_labels)
nn_test_acc
Out[90]:
0.7633333333333333
In [91]:
print(classification_report(test_labels,ytest_predict_nncl))
In [92]:
nn_metrics=classification_report(test_labels, ytest_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_test_precision=round(df.loc["1"][0],2)
nn_test_recall=round(df.loc["1"][1],2)
nn_test_f1=round(df.loc["1"][2],2)
print ('nn_test_precision ',nn_test_precision)
print ('nn_test_recall ',nn_test_recall)
print ('nn_test_f1 ',nn_test_f1)
nn_test_precision 0.65
nn_test_recall 0.49
nn_test_f1 0.56
In [93]:
nn_test_fpr, nn_test_tpr,_=roc_curve(test_labels,best_grid_nncl.predict_proba(X_test
plt.plot(nn_test_fpr,nn_test_tpr,color='black')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 22: ANN AUC-ROC for Test Data ")
nn_test_auc=roc_auc_score(test_labels,best_grid_nncl.predict_proba(X_test)[:,1])
print('Area under Curve is', nn_test_auc)
Train Data:
AUC: 82%
Accuracy: 78%
Precision: 68%
f1-Score: 59
Test Data:
AUC: 80%
Accuracy: 77%
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 37/41
25/07/2021 Project-CART-RF-ANN - Jupyter Notebook
Precision: 67%
f1-Score: 57%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
2.4 Final Model - Compare all models on the basis of the performance metrics in
a structured tabular manner (2.5 pts).
Describe on which model is best/optimized (1.5 pts ). A table containing all the values of accuracies, precision,
recall, auc_roc_score, f1 score. Comparison between the different models(final) on the basis of above table
values. After comparison which model suits the best for the problem in hand on the basis of different
measures. Comment on the final model.
In [94]:
Out[94]:
CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test
In [98]:
plt.figure(figsize=(10,8))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(cart_train_fpr, cart_train_tpr,color='red',label="CART")
plt.plot(rf_train_fpr,rf_train_tpr,color='green',label="RF")
plt.plot(nn_train_fpr,nn_train_tpr,color='black',label="NN")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Figure 25:ROC for 3 Models in Training Data')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')
Out[98]:
<matplotlib.legend.Legend at 0x7ff4b17210d0>
In [99]:
plt.figure(figsize=(10,8))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(cart_test_fpr, cart_test_tpr,color='red',label="CART")
plt.plot(rf_test_fpr,rf_test_tpr,color='green',label="RF")
plt.plot(nn_test_fpr,nn_test_tpr,color='black',label="NN")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Figure 26:ROC for 3 Models in Test Data')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')
Out[99]:
<matplotlib.legend.Legend at 0x7ff4b88992e0>
RF model should be selected, as it has better accuracy, precsion, recall, f1 score better than other two CART &
NN.
2.5 Based on your analysis and working on the business problem, detail out appropriate insights and
recommendations to help the management solve the business objective.
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 40/41
25/07/2021 Project-CART-RF-ANN - Jupyter Notebook
p g j
There should be at least 3-4 Recommendations and insights in total. Recommendations should be easily
understandable and business specific, students should not give any technical suggestions. Full marks should
only be allotted if the recommendations are correct and business specific.
In [ ]:
In [ ]:
In [ ]: