Insurance - CART - RF - ANN - Models - Kaggle

2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim status
and provide recommendations to management. Use CART, RF & ANN and compare the models' performances
in train and test sets.
Attribute Information:
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency_Code)

3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)

7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)

10. Age of insured (Age)
Importing all required Libraries
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 1/81
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_mat

rix
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
# Import stats from scipy
from scipy import stats
2.1 Data Ingestion: Read the dataset. Do the descriptive

statistics and do null value condition check, write an inference on
it.
Loading the Data
In [2]:
df = pd.read_csv("../input/insurance-data/insurance_part2_data (2).csv")
Checking the data
In [3]:
df.head()
Out[3]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan
Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan
Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan
Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan
4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 3000 non-null int64
1 Agency_Code 3000 non-null object
2 Type 3000 non-null object
3 Claimed 3000 non-null object
4 Commision 3000 non-null float64
5 Channel 3000 non-null object
6 Duration 3000 non-null int64
7 Sales 3000 non-null float64
8 Product Name 3000 non-null object
9 Destination 3000 non-null object
dtypes: float64(2), int64(2), object(6)
memory usage: 234.5+ KB
Observatiom
- 10 variables
- Age, Commision, Duration, Sales are numeric variable
- rest are categorial variables
- 3000 records, no missing one
- 9 independant variable and one target variable - Clamied
Check for missing value in any column
In [5]:
# Are there any missing values ?
df.isnull().sum()
Out[5]:
Age 0
Agency_Code 0
Type 0
Claimed 0
Commision 0
Channel 0
Duration 0
Sales 0
Product Name 0
Destination 0
dtype: int64
Observation
No missing valune
Descriptive Statistics Summary
In [6]:
df.describe().T
Out[6]:
count mean std min 25% 50% 75% max

Age 3000.0 38.091000 10.463518 8.0 32.0 36.00 42.000 84.00
Commision 3000.0 14.529203 25.481455 0.0 0.0 4.63 17.235 210.21
Duration 3000.0 70.001333 134.053313 -1.0 11.0 26.50 63.000 4580.00
Sales 3000.0 60.249913 70.733954 0.0 20.0 33.00 69.000 539.00
In [7]:
## Intital descriptive analysis of the data
df.describe(percentiles=[.25,0.50,0.75,0.90]).T
Out[7]:
count mean std min 25% 50% 75% 90% max

Age 3000.0 38.091000 10.463518 8.0 32.0 36.00 42.000 53.000 84.00
Commision 3000.0 14.529203 25.481455 0.0 0.0 4.63 17.235 48.300 210.21
Duration 3000.0 70.001333 134.053313 -1.0 11.0 26.50 63.000 224.200 4580.00
Sales 3000.0 60.249913 70.733954 0.0 20.0 33.00 69.000 172.025 539.00
Observation
- duration has negative valu, it is not possible. Wrong entry.
- Commision & Sales- mean and median varies signficantly
In [8]:
df.describe(include='all').T
Out[8]:
count unique top freq mean std min 25% 50% 75%
Age 3000 NaN NaN NaN 38.091 10.4635 8 32 36 42
Agency_Code 3000 4 EPX 1365 NaN NaN NaN NaN NaN NaN
Travel
Type 3000 2 1837 NaN NaN NaN NaN NaN NaN
Agency
Claimed 3000 2 No 2076 NaN NaN NaN NaN NaN NaN
Commision 3000 NaN NaN NaN 14.5292 25.4815 0 0 4.63 17.23
Channel 3000 2 Online 2954 NaN NaN NaN NaN NaN NaN
Duration 3000 NaN NaN NaN 70.0013 134.053 -1 11 26.5 63
Sales 3000 NaN NaN NaN 60.2499 70.734 0 20 33 69
Product Customised
3000 5 1136 NaN NaN NaN NaN NaN NaN
Name Plan
Destination 3000 3 ASIA 2465 NaN NaN NaN NaN NaN NaN
Observation
Categorial code variable maximun unique count is 5
In [9]:
df.head(10)
Out[9]:
Product
Name
Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan
Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan
Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan
Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan
4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan
5 45 JZI Airlines Yes 15.75 Online 8 45.00 Bronze Plan
Travel Customised
6 61 CWT No 35.64 Online 30 59.40
Agency Plan
Travel Cancellation
7 36 EPX No 0.00 Online 16 80.00
Agency Plan
Travel Cancellation
8 36 EPX No 0.00 Online 19 14.00
Agency Plan
Travel Cancellation
9 36 EPX No 0.00 Online 42 43.00
Agency Plan
Observation
- Data looks good at first glance
In [10]:
df.tail(10)
Out[10]:
Product
Name
Travel Custom
2990 51 EPX No 0.00 Online 2 20.00
Agency Plan
2991 29 C2B Airlines Yes 48.30 Online 381 193.20 Silver P
Travel Custom
2992 28 CWT No 11.88 Online 389 19.80
Agency Plan
Travel Cancella
2993 36 EPX No 0.00 Online 234 10.00
Agency Plan
2994 27 C2B Airlines Yes 71.85 Online 416 287.40 Gold Pla
Travel
2995 28 CWT Yes 166.53 Online 364 256.20 Gold Pla
Agency
2996 35 C2B Airlines No 13.50 Online 5 54.00 Gold Pla
Travel Custom
2997 36 EPX No 0.00 Online 54 28.00
Agency Plan
2998 34 C2B Airlines Yes 7.64 Online 39 30.55 Bronze
2999 47 JZI Airlines No 11.55 Online 15 33.00 Bronze
Observation
- Data looks good at first glance
In [11]:
### data dimensions
df.shape
Out[11]:
(3000, 10)
Geting unique counts of all Nominal Variables
In [12]:
for column in df[['Agency_Code', 'Type', 'Claimed', 'Channel',
'Product Name', 'Destination']]:
print(column.upper(),': ',df[column].nunique())
print(df[column].value_counts().sort_values())
print('\n')
AGENCY_CODE : 4
JZI 239
CWT 472
C2B 924
EPX 1365
Name: Agency_Code, dtype: int64
TYPE : 2
Airlines 1163
Travel Agency 1837
Name: Type, dtype: int64
CLAIMED : 2
Yes 924
No 2076
Name: Claimed, dtype: int64
CHANNEL : 2
Offline 46
Online 2954
Name: Channel, dtype: int64
PRODUCT NAME : 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136
Name: Product Name, dtype: int64
DESTINATION : 3
EUROPE 215
Americas 320
ASIA 2465
Name: Destination, dtype: int64
Check for duplicate data
In [13]:
# Are there any duplicates ?
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
df[dups]
Number of duplicate rows = 139
Out[13]:
Produ
Name
63 30 C2B Airlines Yes 15.0 Online 27 60.0 Bronz
Travel Custo
329 36 EPX No 0.0 Online 5 20.0
Agency Plan
Travel Cance
407 36 EPX No 0.0 Online 11 19.0
Agency Plan
Travel Custo
411 35 EPX No 0.0 Online 2 20.0
Agency Plan
Travel Custo
422 36 EPX No 0.0 Online 5 20.0
Agency Plan
... ... ... ... ... ... ... ... ... ...
Travel Cance
2940 36 EPX No 0.0 Online 8 10.0
Agency Plan
Travel Custo
2947 36 EPX No 0.0 Online 10 28.0
Agency Plan
Travel Cance
2952 36 EPX No 0.0 Online 2 10.0
Agency Plan
Travel Custo
2962 36 EPX No 0.0 Online 4 20.0
Agency Plan
Travel Custo
2984 36 EPX No 0.0 Online 1 20.0
Agency Plan
139 rows × 10 columns
Removing Duplicates - Not removing them - no unique identifier,

can be different customer.
Though it shows there are 139 records, but it can be of different customers, there is no customer ID or any
unique identifier, so I am not dropping them off.
Univariate Analysis
Age variable
In [14]:
print('Range of values: ', df['Age'].max()-df['Age'].min())
Range of values: 76
In [15]:
#Central values
print('Minimum Age: ', df['Age'].min())
print('Maximum Age: ',df['Age'].max())

print('Mean value: ', df['Age'].mean())
print('Median value: ',df['Age'].median())
print('Standard deviation: ', df['Age'].std())
print('Null values: ',df['Age'].isnull().any())
Minimum Age: 8
Maximum Age: 84
Mean value: 38.091
Median value: 36.0
Standard deviation: 10.463518245377944
Null values: False
In [16]:
#Quartiles
Q1=df['Age'].quantile(q=0.25)
Q3=df['Age'].quantile(q=0.75)
print('spending - 1st Quartile (Q1) is: ', Q1)
print('spending - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of Age is ', stats.iqr(df['Age']))
spending - 1st Quartile (Q1) is: 32.0
spending - 3st Quartile (Q3) is: 42.0
Interquartile range (IQR) of Age is 10.0
In [17]:
#Outlier detection from Interquartile range (IQR) in original data
# IQR=Q3-Q1
#lower 1.5*IQR whisker i.e Q1-1.5*IQR
#upper 1.5*IQR whisker i.e Q3+1.5*IQR
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
print('Lower outliers in Age: ', L_outliers)
print('Upper outliers in Age: ', U_outliers)
Lower outliers in Age: 17.0
Upper outliers in Age: 57.0
In [18]:
print('Number of outliers in Age upper : ', df[df['Age']>57.0]['Age'].count())
print('Number of outliers in Age lower : ', df[df['Age']<17.0]['Age'].count())
print('% of Outlier in Age upper: ',round(df[df['Age']>57.0]['Age'].count()*100/len(df

)), '%')
print('% of Outlier in Age lower: ',round(df[df['Age']<17.0]['Age'].count()*100/len(df

)), '%')
Number of outliers in Age upper : 198
Number of outliers in Age lower : 6
% of Outlier in Age upper: 7.0 %

% of Outlier in Age lower: 0.0 %
In [19]:
plt.title('Age')
sns.boxplot(df['Age'],orient='horizondal',color='purple')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2839607e10>
In [20]:
fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))
#distplot
sns.distplot(df['Age'],ax=ax2)
ax2.set_xlabel('Age', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(df['Age'])
ax3.set_xlabel('Age', fontsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
Commision variable
In [21]:
print('Range of values: ', df['Commision'].max()-df['Commision'].min())
Range of values: 210.21
In [22]:
#Central values
print('Minimum Commision: ', df['Commision'].min())
print('Maximum Commision: ',df['Commision'].max())
print('Mean value: ', df['Commision'].mean())
print('Median value: ',df['Commision'].median())
print('Standard deviation: ', df['Commision'].std())
print('Null values: ',df['Commision'].isnull().any())
Minimum Commision: 0.0
Maximum Commision: 210.21
Mean value: 14.529203333333266
Median value: 4.63
Null values: False
In [23]:
#Quartiles
Q1=df['Commision'].quantile(q=0.25)
Q3=df['Commision'].quantile(q=0.75)
print('Commision - 1st Quartile (Q1) is: ', Q1)
print('Commision - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of Commision is ', stats.iqr(df['Commision']))
Commision - 1st Quartile (Q1) is: 0.0
Commision - 3st Quartile (Q3) is: 17.235
Interquartile range (IQR) of Commision is 17.235
In [24]:
# IQR=Q3-Q1
print('Lower outliers in Commision: ', L_outliers)
print('Upper outliers in Commision: ', U_outliers)
Lower outliers in Commision: -25.8525
Upper outliers in Commision: 43.0875
In [25]:
print('Number of outliers in Commision upper : ', df[df['Commision']>43.0875]['Commisio

n'].count())
print('Number of outliers in Commision lower : ', df[df['Commision']<-25.8525]['Commisi

on'].count())
print('% of Outlier in Commision upper: ',round(df[df['Commision']>43.0875]['Commision'

].count()*100/len(df)), '%')
print('% of Outlier in Commision lower: ',round(df[df['Commision']<-25.8525]['Commisio

n'].count()*100/len(df)), '%')
Number of outliers in Commision upper : 362
Number of outliers in Commision lower : 0
% of Outlier in Commision upper: 12.0 %
% of Outlier in Commision lower: 0.0 %
In [26]:
plt.title('Commision')
sns.boxplot(df['Commision'],orient='horizondal',color='purple')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836b40150>
In [27]:
#distplot
sns.distplot(df['Commision'],ax=ax2)
ax2.set_xlabel('Commision', fontsize=15)
#histogram
ax3.hist(df['Commision'])
ax3.set_xlabel('Commision', fontsize=15)
plt.tight_layout()
Duration variable
In [28]:
print('Range of values: ', df['Duration'].max()-df['Duration'].min())
Range of values: 4581
In [29]:
#Central values
print('Minimum Duration: ', df['Duration'].min())
print('Maximum Duration: ',df['Duration'].max())
print('Mean value: ', df['Duration'].mean())
print('Median value: ',df['Duration'].median())
print('Standard deviation: ', df['Duration'].std())
print('Null values: ',df['Duration'].isnull().any())
Minimum Duration: -1
Maximum Duration: 4580
Mean value: 70.00133333333333
Median value: 26.5
Null values: False
In [30]:
#Quartiles
Q1=df['Duration'].quantile(q=0.25)
Q3=df['Duration'].quantile(q=0.75)
print('Duration - 1st Quartile (Q1) is: ', Q1)
print('Duration - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of Duration is ', stats.iqr(df['Duration']))
Duration - 1st Quartile (Q1) is: 11.0
Duration - 3st Quartile (Q3) is: 63.0
Interquartile range (IQR) of Duration is 52.0
In [31]:
# IQR=Q3-Q1
print('Lower outliers in Duration: ', L_outliers)
print('Upper outliers in Duration: ', U_outliers)
Lower outliers in Duration: -67.0
Upper outliers in Duration: 141.0
In [32]:
print('Number of outliers in Duration upper : ', df[df['Duration']>141.0]['Duration'].c

ount())
print('Number of outliers in Duration lower : ', df[df['Duration']<-67.0]['Duration'].c

ount())
print('% of Outlier in Duration upper: ',round(df[df['Duration']>141.0]['Duration'].cou

nt()*100/len(df)), '%')
print('% of Outlier in Duration lower: ',round(df[df['Duration']<-67.0]['Duration'].cou

nt()*100/len(df)), '%')
Number of outliers in Duration upper : 382
Number of outliers in Duration lower : 0
% of Outlier in Duration upper: 13.0 %
% of Outlier in Duration lower: 0.0 %
In [33]:
plt.title('Duration')
sns.boxplot(df['Duration'],orient='horizondal',color='purple')
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a4a890>
In [34]:
#distplot
sns.distplot(df['Duration'],ax=ax2)
ax2.set_xlabel('Duration', fontsize=15)
#histogram
ax3.hist(df['Duration'])
ax3.set_xlabel('Duration', fontsize=15)
plt.tight_layout()
Sales variable
In [35]:
print('Range of values: ', df['Sales'].max()-df['Sales'].min())
Range of values: 539.0
In [36]:
#Central values
print('Minimum Sales: ', df['Sales'].min())
print('Maximum Sales: ',df['Sales'].max())
print('Mean value: ', df['Sales'].mean())
print('Median value: ',df['Sales'].median())
print('Standard deviation: ', df['Sales'].std())
print('Null values: ',df['Sales'].isnull().any())
Minimum Sales: 0.0
Maximum Sales: 539.0
Mean value: 60.24991333333344
Median value: 33.0
Null values: False
In [37]:
#Quartiles
Q1=df['Sales'].quantile(q=0.25)
Q3=df['Sales'].quantile(q=0.75)
print('Sales - 1st Quartile (Q1) is: ', Q1)
print('Sales - 3st Quartile (Q3) is: ', Q3)
print('Interquartile range (IQR) of Sales is ', stats.iqr(df['Sales']))
Sales - 1st Quartile (Q1) is: 20.0
Sales - 3st Quartile (Q3) is: 69.0
Interquartile range (IQR) of Sales is 49.0
In [38]:
# IQR=Q3-Q1
print('Lower outliers in Sales: ', L_outliers)
print('Upper outliers in Sales: ', U_outliers)
Lower outliers in Sales: -53.5
Upper outliers in Sales: 142.5
In [39]:
print('Number of outliers in Sales upper : ', df[df['Sales']>142.5]['Sales'].count())
print('Number of outliers in Sales lower : ', df[df['Sales']<-53.5]['Sales'].count())
print('% of Outlier in Sales upper: ',round(df[df['Sales']>142.5]['Sales'].count()*100/

len(df)), '%')
print('% of Outlier in Sales lower: ',round(df[df['Sales']<-53.5]['Sales'].count()*100/

len(df)), '%')
Number of outliers in Sales upper : 353
Number of outliers in Sales lower : 0
% of Outlier in Sales upper: 12.0 %
% of Outlier in Sales lower: 0.0 %
In [40]:
plt.title('Sales')
sns.boxplot(df['Sales'],orient='horizondal',color='purple')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28368b67d0>
In [41]:
#distplot
sns.distplot(df['Sales'],ax=ax2)
ax2.set_xlabel('Sales', fontsize=15)
#histogram
ax3.hist(df['Sales'])
ax3.set_xlabel('Sales', fontsize=15)
plt.tight_layout()
There are outliers in all the variables, but the sales and commision can be a geneui business value. Random
Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will keep the data as it is.
I will treat the outliers for the ANN model to compare the same after the all the steps just for comparsion.
Categorical Variables
Agency_Code
Count Plot
In [42]:
sns.countplot(data = df, x = 'Agency_Code')
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28366bebd0>
Boxplot
In [43]:
sns.boxplot(data = df, x='Agency_Code',y='Sales', hue='Claimed')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f283660d610>
Swarmpot
In [44]:
sns.swarmplot(data = df, x='Agency_Code',y='Sales')
Out[44]:
Combine Violin plot and Swarmp plot
In [45]:
sns.violinplot(data = df, x='Agency_Code',y='Sales')
sns.swarmplot(data = df, x='Agency_Code',y='Sales', color = 'k', alpha = 0.6)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28368bc3d0>
Type
In [46]:
sns.countplot(data = df, x = 'Type')
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28368dd950>
In [47]:
sns.boxplot(data = df, x='Type',y='Sales', hue='Claimed')
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836b6ca50>
In [48]:
sns.swarmplot(data = df, x='Type',y='Sales')
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a98d10>
In [49]:
sns.violinplot(data = df, x='Type',y='Sales')
sns.swarmplot(data = df, x='Type',y='Sales', color = 'k', alpha = 0.6)
Out[49]:
Channel
In [50]:
sns.countplot(data = df, x = 'Channel')
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836be08d0>
In [51]:
sns.boxplot(data = df, x='Channel',y='Sales', hue='Claimed')
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836d1e890>
In [52]:
sns.swarmplot(data = df, x='Channel',y='Sales')
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836c36e50>
In [53]:
sns.violinplot(data = df, x='Channel',y='Sales')
sns.swarmplot(data = df, x='Channel',y='Sales', color = 'k', alpha = 0.6)
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834cd2d50>
Product Name
In [54]:
sns.countplot(data = df, x = 'Product Name')
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834cba390>
In [55]:
sns.boxplot(data = df, x='Product Name',y='Sales', hue='Claimed')
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834c36610>
In [56]:
sns.swarmplot(data = df, x='Product Name',y='Sales')
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834aee750>
In [57]:
sns.violinplot(data = df, x='Product Name',y='Sales')
sns.swarmplot(data = df, x='Product Name',y='Sales', color = 'k', alpha = 0.6)
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834a52490>
Destination
In [58]:
sns.countplot(data = df, x = 'Destination')
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f283498c2d0>
In [59]:
sns.boxplot(data = df, x='Destination',y='Sales', hue='Claimed')
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28349567d0>
In [60]:
sns.swarmplot(data = df, x='Destination',y='Sales')
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28348ad290>
In [61]:
sns.violinplot(data = df, x='Destination',y='Sales')
sns.swarmplot(data = df, x='Destination',y='Sales', color = 'k', alpha = 0.6)
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a907d0>
Checking pairwise distribution of the continuous variables
In [62]:
sns.pairplot(df[['Age', 'Commision',
'Duration', 'Sales']])
Out[62]:
<seaborn.axisgrid.PairGrid at 0x7f2834cbcc10>
Checking for Correlations
In [63]:
# construct heatmap with only continuous variables
plt.figure(figsize=(10,8))
sns.set(font_scale=1.2)
sns.heatmap(df[['Age', 'Commision',
'Duration', 'Sales']].corr(), annot=True)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834099b50>
Converting all objects to categorical codes
In [64]:
for feature in df.columns:
if df[feature].dtype == 'object':
print('\n')
print('feature:',feature)
print(pd.Categorical(df[feature].unique()))
print(pd.Categorical(df[feature].unique()).codes)
df[feature] = pd.Categorical(df[feature]).codes
feature: Agency_Code
['C2B', 'EPX', 'CWT', 'JZI']
Categories (4, object): ['C2B', 'CWT', 'EPX', 'JZI']
[0 2 1 3]
feature: Type
['Airlines', 'Travel Agency']
Categories (2, object): ['Airlines', 'Travel Agency']
[0 1]
feature: Claimed
['No', 'Yes']
Categories (2, object): ['No', 'Yes']
[0 1]
feature: Channel
['Online', 'Offline']
Categories (2, object): ['Offline', 'Online']
[1 0]
feature: Product Name
['Customised Plan', 'Cancellation Plan', 'Bronze Plan', 'Silver Plan', 'Gol

d Plan']
Categories (5, object): ['Bronze Plan', 'Cancellation Plan', 'Customised Pl

an', 'Gold Plan', 'Silver Plan']
[2 1 0 4 3]
feature: Destination
['ASIA', 'Americas', 'EUROPE']
Categories (3, object): ['ASIA', 'Americas', 'EUROPE']
[0 1 2]
In [65]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 3000 non-null int64
1 Agency_Code 3000 non-null int8
2 Type 3000 non-null int8
3 Claimed 3000 non-null int8
4 Commision 3000 non-null float64
5 Channel 3000 non-null int8
6 Duration 3000 non-null int64
7 Sales 3000 non-null float64
8 Product Name 3000 non-null int8
9 Destination 3000 non-null int8
dtypes: float64(2), int64(2), int8(6)
memory usage: 111.5 KB
In [66]:
df.head()
Out[66]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales D
Name
0 48 0 0 0 0.70 1 7 2.51 2 0
1 36 2 1 0 0.00 1 34 20.00 2 0
2 39 1 1 0 5.94 1 3 9.90 2 1
3 36 2 1 0 0.00 1 4 26.00 1 0
4 33 3 0 0 6.30 1 53 18.00 0 0
Proportion of 1s and 0s
In [67]:
df.Claimed.value_counts(normalize=True)
Out[67]:
0 0.692
1 0.308
Name: Claimed, dtype: float64
----------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------
2.2 Data Split: Split the data into test and train, build
classification model CART, Random Forest, Artificial Neural
Network
Extracting the target column into separate vectors for training set
and test set
In [68]:
X = df.drop("Claimed", axis=1)
y = df.pop("Claimed")
X.head()
Out[68]:
Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name
0 48 0 0 0.70 1 7 2.51 2 0
1 36 2 1 0.00 1 34 20.00 2 0
2 39 1 1 5.94 1 3 9.90 2 1
3 36 2 1 0.00 1 4 26.00 1 0
4 33 3 0 6.30 1 53 18.00 0 0
In [69]:
# prior to scaling
plt.plot(X)
plt.show()
In [70]:
# Scaling the attributes.
from scipy.stats import zscore
X_scaled=X.apply(zscore)
X_scaled.head()
Out[70]:
Product
Age Agency_Code Type Commision Channel Duration Sales
Name
0 0.947162 -1.314358 -1.256796 -0.542807 0.124788 -0.470051 -0.816433 0.26883
1 -0.199870 0.697928 0.795674 -0.570282 0.124788 -0.268605 -0.569127 0.26883
2 0.086888 -0.308215 0.795674 -0.337133 0.124788 -0.499894 -0.711940 0.26883
3 -0.199870 0.697928 0.795674 -0.570282 0.124788 -0.492433 -0.484288 -0.52575
4 -0.486629 1.704071 -1.256796 -0.323003 0.124788 -0.126846 -0.597407 -1.32033
In [71]:
# prior to scaling
plt.plot(X_scaled)
plt.show()
Splitting data into training and test set
In [72]:
from sklearn.model_selection import train_test_split
X_train, X_test, train_labels, test_labels = train_test_split(X_scaled, y, test_size=.3

0, random_state=5)
Checking the dimensions of the training and test data
In [73]:
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)
X_train (2100, 9)
X_test (900, 9)
train_labels (2100,)
test_labels (900,)
Building a Decision Tree Classifier
In [74]:
param_grid_dtcl = {
'criterion': ['gini'],
'max_depth': [10,20,30,50],
'min_samples_leaf': [50,100,150],
'min_samples_split': [150,300,450],
dtcl = DecisionTreeClassifier(random_state=1)
grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, cv = 10

)
In [75]:
grid_search_dtcl.fit(X_train, train_labels)
print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl
#{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples_split': 45

0}
{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples

_split': 450}
Out[75]:
DecisionTreeClassifier(max_depth=10, min_samples_leaf=50, min_samples_split

=450,
random_state=1)
In [76]:
param_grid_dtcl = {
'max_depth': [3, 5, 7, 10,12],
'min_samples_leaf': [20,30,40,50,60],
'min_samples_split': [150,300,450],

)
In [77]:
best_grid_dtcl

0}
{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 20, 'min_samples_

split': 150}
Out[77]:
DecisionTreeClassifier(max_depth=5, min_samples_leaf=20, min_samples_split=

150,
random_state=1)
In [78]:
param_grid_dtcl = {
'max_depth': [3.5,4.0,4.5, 5.0,5.5],
'min_samples_leaf': [40, 42, 44,46,48,50,52,54],
'min_samples_split': [250, 270, 280, 290, 300,310],

)
In [79]:
best_grid_dtcl

0}
{'criterion': 'gini', 'max_depth': 3.5, 'min_samples_leaf': 44, 'min_sample

s_split': 250}
Out[79]:
DecisionTreeClassifier(max_depth=3.5, min_samples_leaf=44,
min_samples_split=250, random_state=1)
In [80]:
param_grid_dtcl = {
'max_depth': [4.85, 4.90,4.95, 5.0,5.05,5.10,5.15],
'min_samples_leaf': [40, 41, 42, 43, 44],
'min_samples_split': [150, 175, 200, 210, 220, 230, 240, 250, 260, 270],

)
In [81]:
best_grid_dtcl

0}
{'criterion': 'gini', 'max_depth': 4.85, 'min_samples_leaf': 44, 'min_sampl

es_split': 260}
Out[81]:
DecisionTreeClassifier(max_depth=4.85, min_samples_leaf=44,
min_samples_split=260, random_state=1)
Generating Tree
In [82]:
train_char_label = ['no', 'yes']
tree_regularized = open('tree_regularized.dot','w')
dot_data = tree.export_graphviz(best_grid_dtcl, out_file= tree_regularized ,
feature_names = list(X_train),
class_names = list(train_char_label))
tree_regularized.close()
dot_data
http://webgraphviz.com/
Variable Importance - DTCL
In [83]:
print (pd.DataFrame(best_grid_dtcl.feature_importances_, columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
Agency_Code 0.634112
Sales 0.220899
Product Name 0.086632
Commision 0.021881
Age 0.019940
Duration 0.016536
Type 0.000000
Channel 0.000000
Destination 0.000000
Predicting on Training and Test dataset
In [84]:
ytrain_predict_dtcl = best_grid_dtcl.predict(X_train)
ytest_predict_dtcl = best_grid_dtcl.predict(X_test)
Getting the Predicted Classes and Probs
In [85]:
ytest_predict_dtcl
ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)
ytest_predict_prob_dtcl
pd.DataFrame(ytest_predict_prob_dtcl).head()
Out[85]:
0 1
0 0.697947 0.302053
1 0.979452 0.020548
2 0.921171 0.078829
3 0.510417 0.489583
4 0.921171 0.078829
Building a Random Forest Classifier
param_grid_rfcl = { 'max_depth': [5,10,15],#20,30,40 'max_features': [4,5,6,7],## 7,8,9 'min_samples_leaf':

[10,50,70],## 50,100 'min_samples_split': [30,50,70], ## 60,70 'n_estimators': [200, 250,300] ## 100,200 }
rfcl = RandomForestClassifier(random_state=1)
grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv = 5)
grid_search_rfcl.fit(X_train, train_labels) print(grid_search_rfcl.bestparams) best_grid_rfcl =

grid_search_rfcl.bestestimator best_grid_rfcl
{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50,

'min_samples_split': 450}
In [86]:
param_grid_rfcl = {
'max_depth': [4,5,6],#20,30,40
'max_features': [2,3,4,5],## 7,8,9
'min_samples_leaf': [8,9,11,15],## 50,100
'min_samples_split': [46,50,55], ## 60,70
'n_estimators': [290,350,400] ## 100,200
rfcl = RandomForestClassifier(random_state=1)
grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv = 5)
In [87]:
grid_search_rfcl.fit(X_train, train_labels)
print(grid_search_rfcl.best_params_)
best_grid_rfcl = grid_search_rfcl.best_estimator_
best_grid_rfcl

0}
{'max_depth': 6, 'max_features': 3, 'min_samples_leaf': 8, 'min_samples_spl

it': 46, 'n_estimators': 350}
Out[87]:
RandomForestClassifier(max_depth=6, max_features=3, min_samples_leaf=8,
min_samples_split=46, n_estimators=350, random_state

=1)
Predicting the Training and Testing data
In [88]:
ytrain_predict_rfcl = best_grid_rfcl.predict(X_train)
ytest_predict_rfcl = best_grid_rfcl.predict(X_test)
In [89]:
ytest_predict_rfcl
ytest_predict_prob_rfcl=best_grid_rfcl.predict_proba(X_test)
ytest_predict_prob_rfcl
pd.DataFrame(ytest_predict_prob_rfcl).head()
Out[89]:
0 1
0 0.778010 0.221990
1 0.971910 0.028090
2 0.904401 0.095599
3 0.651398 0.348602
4 0.868406 0.131594
Variable Importance via RF
In [90]:
# Variable Importance
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
Agency_Code 0.276015
Product Name 0.235583
Sales 0.152733
Commision 0.135997
Duration 0.077475
Type 0.071019
Age 0.039503
Destination 0.008971
Channel 0.002705
Building a Neural Network Classifier
In [91]:
param_grid_nncl = {
'hidden_layer_sizes': [50,100,200], # 50, 200
'max_iter': [2500,3000,4000], #5000,2500
'solver': ['adam'], #sgd
'tol': [0.01],
nncl = MLPClassifier(random_state=1)
grid_search_nncl = GridSearchCV(estimator = nncl, param_grid = param_grid_nncl, cv = 10

)
In [92]:
grid_search_nncl.fit(X_train, train_labels)
grid_search_nncl.best_params_
best_grid_nncl = grid_search_nncl.best_estimator_
best_grid_nncl
Out[92]:
MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1, tol=0.

01)
Predicting the Training and Testing data
In [93]:
ytrain_predict_nncl = best_grid_nncl.predict(X_train)
ytest_predict_nncl = best_grid_nncl.predict(X_test)
In [94]:
ytest_predict_nncl
ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)
ytest_predict_prob_nncl
pd.DataFrame(ytest_predict_prob_nncl).head()
Out[94]:
0 1
0 0.822676 0.177324
1 0.933407 0.066593
2 0.918772 0.081228
3 0.688933 0.311067
4 0.913425 0.086575
----------------------------------------------------------------------------------------------------------------------------------------------------
-------------------
2.3 Performance Metrics: Check the performance of Predictions

on Train and Test sets using Accuracy, Confusion Matrix, Plot
ROC curve and get ROC_AUC score for each model
CART - AUC and ROC for the training data
In [95]:
# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_train_auc = roc_auc_score(train_labels, probs_cart)
print('AUC: %.3f' % cart_train_auc)
# calculate roc curve
cart_train_fpr, cart_train_tpr, cart_train_thresholds = roc_curve(train_labels, probs_c

art)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(cart_train_fpr, cart_train_tpr)
AUC: 0.823
Out[95]:
[<matplotlib.lines.Line2D at 0x7f2836b1d350>]
CART -AUC and ROC for the test data
In [96]:
# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_test_auc = roc_auc_score(test_labels, probs_cart)
print('AUC: %.3f' % cart_test_auc)
# calculate roc curve
cart_test_fpr, cart_test_tpr, cart_testthresholds = roc_curve(test_labels, probs_cart)
# plot the roc curve for the model
plt.plot(cart_test_fpr, cart_test_tpr)
AUC: 0.801
Out[96]:
[<matplotlib.lines.Line2D at 0x7f2834617a10>]
CART Confusion Matrix and Classification Report for the training

data
In [97]:
confusion_matrix(train_labels, ytrain_predict_dtcl)
Out[97]:
array([[1309, 144],
[ 307, 340]])
In [98]:
#Train Data Accuracy
cart_train_acc=best_grid_dtcl.score(X_train,train_labels)
cart_train_acc
Out[98]:
0.7852380952380953
In [99]:
print(classification_report(train_labels, ytrain_predict_dtcl))
precision recall f1-score support
0 0.81 0.90 0.85 1453
1 0.70 0.53 0.60 647
accuracy 0.79 2100
macro avg 0.76 0.71 0.73 2100
weighted avg 0.78 0.79 0.78 2100
In [100]:
cart_metrics=classification_report(train_labels, ytrain_predict_dtcl,output_dict=True)
df=pd.DataFrame(cart_metrics).transpose()
cart_train_f1=round(df.loc["1"][2],2)
cart_train_recall=round(df.loc["1"][1],2)
cart_train_precision=round(df.loc["1"][0],2)
print ('cart_train_precision ',cart_train_precision)
print ('cart_train_recall ',cart_train_recall)
print ('cart_train_f1 ',cart_train_f1)
cart_train_precision 0.7
cart_train_recall 0.53
cart_train_f1 0.6
CART Confusion Matrix and Classification Report for the testing

data
In [101]:
confusion_matrix(test_labels, ytest_predict_dtcl)
Out[101]:
array([[553, 70],
[136, 141]])
In [102]:
#Test Data Accuracy
cart_test_acc=best_grid_dtcl.score(X_test,test_labels)
cart_test_acc
Out[102]:
0.7711111111111111
In [103]:
print(classification_report(test_labels, ytest_predict_dtcl))
0 0.80 0.89 0.84 623
1 0.67 0.51 0.58 277
accuracy 0.77 900
macro avg 0.74 0.70 0.71 900
weighted avg 0.76 0.77 0.76 900
In [104]:
cart_metrics=classification_report(test_labels, ytest_predict_dtcl,output_dict=True)
df=pd.DataFrame(cart_metrics).transpose()
cart_test_precision=round(df.loc["1"][0],2)
cart_test_recall=round(df.loc["1"][1],2)
cart_test_f1=round(df.loc["1"][2],2)
print ('cart_test_precision ',cart_test_precision)
print ('cart_test_recall ',cart_test_recall)
print ('cart_test_f1 ',cart_test_f1)
cart_test_precision 0.67
cart_test_recall 0.51
cart_test_f1 0.58
Cart Conclusion
Train Data:
- AUC: 82%
- Accuracy: 79%
- Precision: 70%
- f1-Score: 60%
Test Data:
- AUC: 80%
- Accuracy: 77%
- Precision: 80%
- f1-Score: 84%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
Change is the most important variable for predicting diabetes
RF Model Performance Evaluation on Training data
In [105]:
confusion_matrix(train_labels,ytrain_predict_rfcl)
Out[105]:
array([[1297, 156],
[ 255, 392]])
In [106]:
rf_train_acc=best_grid_rfcl.score(X_train,train_labels)
rf_train_acc
Out[106]:
0.8042857142857143
In [107]:
print(classification_report(train_labels,ytrain_predict_rfcl))
0 0.84 0.89 0.86 1453
1 0.72 0.61 0.66 647
accuracy 0.80 2100
macro avg 0.78 0.75 0.76 2100
weighted avg 0.80 0.80 0.80 2100
In [108]:
rf_metrics=classification_report(train_labels, ytrain_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_train_precision=round(df.loc["1"][0],2)
rf_train_recall=round(df.loc["1"][1],2)
rf_train_f1=round(df.loc["1"][2],2)
print ('rf_train_precision ',rf_train_precision)
print ('rf_train_recall ',rf_train_recall)
print ('rf_train_f1 ',rf_train_f1)
rf_train_precision 0.72
rf_train_recall 0.61
rf_train_f1 0.66
In [109]:
rf_train_fpr, rf_train_tpr,_=roc_curve(train_labels,best_grid_rfcl.predict_proba(X_trai
n)[:,1])
plt.plot(rf_train_fpr,rf_train_tpr,color='green')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
rf_train_auc=roc_auc_score(train_labels,best_grid_rfcl.predict_proba(X_train)[:,1])
print('Area under Curve is', rf_train_auc)
Area under Curve is 0.8563713512840778
RF Model Performance Evaluation on Test data
In [110]:
confusion_matrix(test_labels,ytest_predict_rfcl)
Out[110]:
array([[550, 73],
[121, 156]])
In [111]:
rf_test_acc=best_grid_rfcl.score(X_test,test_labels)
rf_test_acc
Out[111]:
0.7844444444444445
In [112]:
print(classification_report(test_labels,ytest_predict_rfcl))
0 0.82 0.88 0.85 623
1 0.68 0.56 0.62 277
accuracy 0.78 900
macro avg 0.75 0.72 0.73 900
weighted avg 0.78 0.78 0.78 900
In [113]:
rf_metrics=classification_report(test_labels, ytest_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_test_precision=round(df.loc["1"][0],2)
rf_test_recall=round(df.loc["1"][1],2)
rf_test_f1=round(df.loc["1"][2],2)
print ('rf_test_precision ',rf_test_precision)
print ('rf_test_recall ',rf_test_recall)
print ('rf_test_f1 ',rf_test_f1)
rf_test_precision 0.68
rf_test_recall 0.56
rf_test_f1 0.62
In [114]:
rf_test_fpr, rf_test_tpr,_=roc_curve(test_labels,best_grid_rfcl.predict_proba(X_test)
[:,1])
plt.plot(rf_test_fpr,rf_test_tpr,color='green')
plt.title('ROC')
rf_test_auc=roc_auc_score(test_labels,best_grid_rfcl.predict_proba(X_test)[:,1])
print('Area under Curve is', rf_test_auc)
Random Forest Conclusion
Train Data:
- AUC: 86%
- Accuracy: 80%
- Precision: 72%
- f1-Score: 66%
Test Data:
- AUC: 82%
- Accuracy: 78%
- Precision: 68%
- f1-Score: 62
Change is again the most important variable for predicting diabetes
NN Model Performance Evaluation on Training data
In [115]:
confusion_matrix(train_labels,ytrain_predict_nncl)
Out[115]:
array([[1298, 155],
[ 315, 332]])
In [116]:
nn_train_acc=best_grid_nncl.score(X_train,train_labels)
nn_train_acc
Out[116]:
0.7761904761904762
In [117]:
print(classification_report(train_labels,ytrain_predict_nncl))
0 0.80 0.89 0.85 1453
1 0.68 0.51 0.59 647
accuracy 0.78 2100
macro avg 0.74 0.70 0.72 2100
weighted avg 0.77 0.78 0.77 2100
In [118]:
nn_metrics=classification_report(train_labels, ytrain_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_train_precision=round(df.loc["1"][0],2)
nn_train_recall=round(df.loc["1"][1],2)
nn_train_f1=round(df.loc["1"][2],2)
print ('nn_train_precision ',nn_train_precision)
print ('nn_train_recall ',nn_train_recall)
print ('nn_train_f1 ',nn_train_f1)
nn_train_precision 0.68
nn_train_recall 0.51
nn_train_f1 0.59
In [119]:
nn_train_fpr, nn_train_tpr,_=roc_curve(train_labels,best_grid_nncl.predict_proba(X_trai
n)[:,1])
plt.plot(nn_train_fpr,nn_train_tpr,color='black')
plt.title('ROC')
nn_train_auc=roc_auc_score(train_labels,best_grid_nncl.predict_proba(X_train)[:,1])
print('Area under Curve is', nn_train_auc)
NN Model Performance Evaluation on Test data
In [120]:
confusion_matrix(test_labels,ytest_predict_nncl)
Out[120]:
array([[553, 70],
[138, 139]])
In [121]:
nn_test_acc=best_grid_nncl.score(X_test,test_labels)
nn_test_acc
Out[121]:
0.7688888888888888
In [122]:
print(classification_report(test_labels,ytest_predict_nncl))
0 0.80 0.89 0.84 623
1 0.67 0.50 0.57 277
accuracy 0.77 900
macro avg 0.73 0.69 0.71 900
weighted avg 0.76 0.77 0.76 900
In [123]:
nn_metrics=classification_report(test_labels, ytest_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_test_precision=round(df.loc["1"][0],2)
nn_test_recall=round(df.loc["1"][1],2)
nn_test_f1=round(df.loc["1"][2],2)
print ('nn_test_precision ',nn_test_precision)
print ('nn_test_recall ',nn_test_recall)
print ('nn_test_f1 ',nn_test_f1)
nn_test_precision 0.67
nn_test_recall 0.5
nn_test_f1 0.57
In [124]:
nn_test_fpr, nn_test_tpr,_=roc_curve(test_labels,best_grid_nncl.predict_proba(X_test)
[:,1])
plt.plot(nn_test_fpr,nn_test_tpr,color='black')
plt.title('ROC')
nn_test_auc=roc_auc_score(test_labels,best_grid_nncl.predict_proba(X_test)[:,1])
print('Area under Curve is', nn_test_auc)
Neural Network Conclusion
Train Data:
- AUC: 82%
- Accuracy: 78%
- Precision: 68%
- f1-Score: 59
Test Data:
- AUC: 80%
- Accuracy: 77%
- Precision: 67%
- f1-Score: 57%
----------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------
2.4 Final Model: Compare all the model and write an inference
which model is best/optimized.
Comparison of the performance metrics from the 3 models
In [125]:
index=['Accuracy', 'AUC', 'Recall','Precision','F1 Score']
data = pd.DataFrame({'CART Train':[cart_train_acc,cart_train_auc,cart_train_recall,cart

_train_precision,cart_train_f1],
'CART Test':[cart_test_acc,cart_test_auc,cart_test_recall,cart_test_precision,c
art_test_f1],
'Random Forest Train':[rf_train_acc,rf_train_auc,rf_train_recall,rf_train_precis

ion,rf_train_f1],
'Random Forest Test':[rf_test_acc,rf_test_auc,rf_test_recall,rf_test_precision,

rf_test_f1],
'Neural Network Train':[nn_train_acc,nn_train_auc,nn_train_recall,nn_train_preci

sion,nn_train_f1],
'Neural Network Test':[nn_test_acc,nn_test_auc,nn_test_recall,nn_test_precision

,nn_test_f1]},index=index)
round(data,2)
Out[125]:
CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test
Accuracy 0.79 0.77 0.80 0.78 0.78 0.77
AUC 0.82 0.80 0.86 0.82 0.82 0.80
Recall 0.53 0.51 0.61 0.56 0.51 0.50
Precision 0.70 0.67 0.72 0.68 0.68 0.67
F1 Score 0.60 0.58 0.66 0.62 0.59 0.57
ROC Curve for the 3 models on the Training data
In [126]:
plt.plot(cart_train_fpr, cart_train_tpr,color='red',label="CART")
plt.plot(rf_train_fpr,rf_train_tpr,color='green',label="RF")
plt.plot(nn_train_fpr,nn_train_tpr,color='black',label="NN")
plt.title('ROC')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')
Out[126]:
<matplotlib.legend.Legend at 0x7f282f806710>
ROC Curve for the 3 models on the Test data
In [127]:
plt.plot(cart_test_fpr, cart_test_tpr,color='red',label="CART")
plt.plot(rf_test_fpr,rf_test_tpr,color='green',label="RF")
plt.plot(nn_test_fpr,nn_test_tpr,color='black',label="NN")
plt.title('ROC')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')
Out[127]:
<matplotlib.legend.Legend at 0x7f282f788850>
CONCLUSION :
I am selecting the RF model, as it has better accuracy, precsion,

recall, f1 score better than other two CART & NN.
----------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------
2.5 Inference: Basis on these predictions, what are the business

insights and recommendations
I strongly recommended we collect more real time unstructured data and past data if possible.
This is understood by looking at the insurance data by drawing relations between different variables such as
day of the incident, time, age group, and associating it with other external information such as location,
behavior patterns, weather information, airline/vehicle types, etc.
• Streamlining online experiences benefitted customers, leading to an increase in conversions, which

subsequently raised profits. • As per the data 90% of insurance is done by online channel. • Other interesting
fact, is almost all the offline business has a claimed associated, need to find why? • Need to train the JZI
agency resources to pick up sales as they are in bottom, need to run promotional marketing campaign or
evaluate if we need to tie up with alternate agency • Also based on the model we are getting 80%accuracy, so
we need customer books airline tickets or plans, cross sell the insurance based on the claim data pattern. •
Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are
processed more at Airline. So we may need to deep dive into the process to understand the workflow and why?
Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time • Increase
customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling costs Insights gained
from data and AI-powered analytics could expand the boundaries of insurability, extend existing products, and
give rise to new risk transfer solutions in areas like a non-damage business interruption and reputational
damage

Insurance - CART - RF - ANN - Models - Kaggle

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Insurance - CART - RF - ANN - Models - Kaggle

Uploaded by

Copyright:

Available Formats

2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

2. Code of tour firm (Agency_Code)

5. Name of the tour insurance products (Product)

6. Duration of the tour (Duration)

9. The commission received for tour insurance firm (Commission)

Importing all required Libraries

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import tree

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.neural_network import MLPClassifier

from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_mat

from sklearn.preprocessing import StandardScaler

from scipy import stats

2.1 Data Ingestion: Read the dataset. Do the descriptive

Loading the Data

Checking the data

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 3000 non-null int64

1 Agency_Code 3000 non-null object

2 Type 3000 non-null object

3 Claimed 3000 non-null object

4 Commision 3000 non-null float64

5 Channel 3000 non-null object

6 Duration 3000 non-null int64

7 Sales 3000 non-null float64

8 Product Name 3000 non-null object

9 Destination 3000 non-null object

dtypes: float64(2), int64(2), object(6)

memory usage: 234.5+ KB

- Age, Commision, Duration, Sales are numeric variable

- rest are categorial variables

- 3000 records, no missing one

- 9 independant variable and one target variable - Clamied

Check for missing value in any column

# Are there any missing values ?

Descriptive Statistics Summary

count mean std min 25% 50% 75% max

## Intital descriptive analysis of the data

count mean std min 25% 50% 75% 90% max

- Commision & Sales- mean and median varies signficantly

- Data looks good at first glance

- Data looks good at first glance

### data dimensions

Geting unique counts of all Nominal Variables

for column in df[['Agency_Code', 'Type', 'Claimed', 'Channel',

'Product Name', 'Destination']]:

Name: Agency_Code, dtype: int64

Travel Agency 1837

Name: Type, dtype: int64

Name: Claimed, dtype: int64

Name: Channel, dtype: int64

Gold Plan 109

Silver Plan 427

Bronze Plan 650

Cancellation Plan 678

Customised Plan 1136

Name: Product Name, dtype: int64

Name: Destination, dtype: int64

#lower 1.5IQR whisker i.e Q1-1.5IQR

#upper 1.5IQR whisker i.e Q3+1.5IQR

#lower 1.5IQR whisker i.e Q1-1.5IQR

#upper 1.5IQR whisker i.e Q3+1.5IQR