Professional Documents
Culture Documents
Insurance - CART - RF - ANN - Models - Kaggle
Insurance - CART - RF - ANN - Models - Kaggle
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim status
and provide recommendations to management. Use CART, RF & ANN and compare the models' performances
in train and test sets.
Attribute Information:
1. Target: Claim Status (Claimed)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 1/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.read_csv("../input/insurance-data/insurance_part2_data (2).csv")
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 2/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [3]:
df.head()
Out[3]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan
Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan
Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan
Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan
4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 3/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Observatiom
- 10 variables
In [5]:
df.isnull().sum()
Out[5]:
Age 0
Agency_Code 0
Type 0
Claimed 0
Commision 0
Channel 0
Duration 0
Sales 0
Product Name 0
Destination 0
dtype: int64
Observation
No missing valune
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 4/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [6]:
df.describe().T
Out[6]:
In [7]:
df.describe(percentiles=[.25,0.50,0.75,0.90]).T
Out[7]:
Observation
- duration has negative valu, it is not possible. Wrong entry.
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 5/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [8]:
df.describe(include='all').T
Out[8]:
count unique top freq mean std min 25% 50% 75%
Age 3000 NaN NaN NaN 38.091 10.4635 8 32 36 42
Agency_Code 3000 4 EPX 1365 NaN NaN NaN NaN NaN NaN
Travel
Type 3000 2 1837 NaN NaN NaN NaN NaN NaN
Agency
Claimed 3000 2 No 2076 NaN NaN NaN NaN NaN NaN
Commision 3000 NaN NaN NaN 14.5292 25.4815 0 0 4.63 17.23
Channel 3000 2 Online 2954 NaN NaN NaN NaN NaN NaN
Duration 3000 NaN NaN NaN 70.0013 134.053 -1 11 26.5 63
Sales 3000 NaN NaN NaN 60.2499 70.734 0 20 33 69
Product Customised
3000 5 1136 NaN NaN NaN NaN NaN NaN
Name Plan
Destination 3000 3 ASIA 2465 NaN NaN NaN NaN NaN NaN
Observation
Categorial code variable maximun unique count is 5
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 6/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [9]:
df.head(10)
Out[9]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan
Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan
Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan
Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan
4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan
5 45 JZI Airlines Yes 15.75 Online 8 45.00 Bronze Plan
Travel Customised
6 61 CWT No 35.64 Online 30 59.40
Agency Plan
Travel Cancellation
7 36 EPX No 0.00 Online 16 80.00
Agency Plan
Travel Cancellation
8 36 EPX No 0.00 Online 19 14.00
Agency Plan
Travel Cancellation
9 36 EPX No 0.00 Online 42 43.00
Agency Plan
Observation
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 7/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [10]:
df.tail(10)
Out[10]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Travel Custom
2990 51 EPX No 0.00 Online 2 20.00
Agency Plan
2991 29 C2B Airlines Yes 48.30 Online 381 193.20 Silver P
Travel Custom
2992 28 CWT No 11.88 Online 389 19.80
Agency Plan
Travel Cancella
2993 36 EPX No 0.00 Online 234 10.00
Agency Plan
2994 27 C2B Airlines Yes 71.85 Online 416 287.40 Gold Pla
Travel
2995 28 CWT Yes 166.53 Online 364 256.20 Gold Pla
Agency
2996 35 C2B Airlines No 13.50 Online 5 54.00 Gold Pla
Travel Custom
2997 36 EPX No 0.00 Online 54 28.00
Agency Plan
2998 34 C2B Airlines Yes 7.64 Online 39 30.55 Bronze
2999 47 JZI Airlines No 11.55 Online 15 33.00 Bronze
Observation
In [11]:
df.shape
Out[11]:
(3000, 10)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 8/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [12]:
print(column.upper(),': ',df[column].nunique())
print(df[column].value_counts().sort_values())
print('\n')
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 9/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
AGENCY_CODE : 4
JZI 239
CWT 472
C2B 924
EPX 1365
TYPE : 2
Airlines 1163
CLAIMED : 2
Yes 924
No 2076
CHANNEL : 2
Offline 46
Online 2954
PRODUCT NAME : 5
DESTINATION : 3
EUROPE 215
Americas 320
ASIA 2465
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 10/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [13]:
dups = df.duplicated()
df[dups]
Out[13]:
Produ
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
63 30 C2B Airlines Yes 15.0 Online 27 60.0 Bronz
Travel Custo
329 36 EPX No 0.0 Online 5 20.0
Agency Plan
Travel Cance
407 36 EPX No 0.0 Online 11 19.0
Agency Plan
Travel Custo
411 35 EPX No 0.0 Online 2 20.0
Agency Plan
Travel Custo
422 36 EPX No 0.0 Online 5 20.0
Agency Plan
... ... ... ... ... ... ... ... ... ...
Travel Cance
2940 36 EPX No 0.0 Online 8 10.0
Agency Plan
Travel Custo
2947 36 EPX No 0.0 Online 10 28.0
Agency Plan
Travel Cance
2952 36 EPX No 0.0 Online 2 10.0
Agency Plan
Travel Custo
2962 36 EPX No 0.0 Online 4 20.0
Agency Plan
Travel Custo
2984 36 EPX No 0.0 Online 1 20.0
Agency Plan
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 11/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Though it shows there are 139 records, but it can be of different customers, there is no customer ID or any
unique identifier, so I am not dropping them off.
Univariate Analysis
Age variable
In [14]:
Range of values: 76
In [15]:
#Central values
Minimum Age: 8
Maximum Age: 84
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 12/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [16]:
#Quartiles
Q1=df['Age'].quantile(q=0.25)
Q3=df['Age'].quantile(q=0.75)
In [17]:
# IQR=Q3-Q1
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 13/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [18]:
In [19]:
plt.title('Age')
sns.boxplot(df['Age'],orient='horizondal',color='purple')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2839607e10>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 14/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [20]:
fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))
#distplot
sns.distplot(df['Age'],ax=ax2)
ax2.set_xlabel('Age', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(df['Age'])
ax3.set_xlabel('Age', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
Commision variable
In [21]:
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 15/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [22]:
#Central values
In [23]:
#Quartiles
Q1=df['Commision'].quantile(q=0.25)
Q3=df['Commision'].quantile(q=0.75)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 16/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [24]:
# IQR=Q3-Q1
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
In [25]:
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 17/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [26]:
plt.title('Commision')
sns.boxplot(df['Commision'],orient='horizondal',color='purple')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836b40150>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 18/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [27]:
fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))
#distplot
sns.distplot(df['Commision'],ax=ax2)
ax2.set_xlabel('Commision', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(df['Commision'])
ax3.set_xlabel('Commision', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
Duration variable
In [28]:
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 19/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [29]:
#Central values
Minimum Duration: -1
In [30]:
#Quartiles
Q1=df['Duration'].quantile(q=0.25)
Q3=df['Duration'].quantile(q=0.75)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 20/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [31]:
# IQR=Q3-Q1
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
In [32]:
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 21/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [33]:
plt.title('Duration')
sns.boxplot(df['Duration'],orient='horizondal',color='purple')
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a4a890>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 22/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [34]:
fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))
#distplot
sns.distplot(df['Duration'],ax=ax2)
ax2.set_xlabel('Duration', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(df['Duration'])
ax3.set_xlabel('Duration', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
Sales variable
In [35]:
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 23/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [36]:
#Central values
In [37]:
#Quartiles
Q1=df['Sales'].quantile(q=0.25)
Q3=df['Sales'].quantile(q=0.75)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 24/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [38]:
# IQR=Q3-Q1
L_outliers=Q1-1.5*(Q3-Q1)
U_outliers=Q3+1.5*(Q3-Q1)
In [39]:
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 25/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [40]:
plt.title('Sales')
sns.boxplot(df['Sales'],orient='horizondal',color='purple')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28368b67d0>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 26/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [41]:
fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))
#distplot
sns.distplot(df['Sales'],ax=ax2)
ax2.set_xlabel('Sales', fontsize=15)
ax2.tick_params(labelsize=15)
#histogram
ax3.hist(df['Sales'])
ax3.set_xlabel('Sales', fontsize=15)
ax3.tick_params(labelsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
There are outliers in all the variables, but the sales and commision can be a geneui business value. Random
Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will keep the data as it is.
I will treat the outliers for the ANN model to compare the same after the all the steps just for comparsion.
Categorical Variables
Agency_Code
Count Plot
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 27/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [42]:
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28366bebd0>
Boxplot
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 28/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [43]:
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f283660d610>
Swarmpot
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 29/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [44]:
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836527e50>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 30/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [45]:
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28368bc3d0>
Type
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 31/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [46]:
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28368dd950>
In [47]:
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836b6ca50>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 32/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [48]:
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a98d10>
In [49]:
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836e20910>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 33/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Channel
In [50]:
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836be08d0>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 34/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [51]:
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836d1e890>
In [52]:
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836c36e50>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 35/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [53]:
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834cd2d50>
Product Name
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 36/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [54]:
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834cba390>
In [55]:
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834c36610>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 37/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [56]:
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834aee750>
In [57]:
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834a52490>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 38/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Destination
In [58]:
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f283498c2d0>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 39/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [59]:
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28349567d0>
In [60]:
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f28348ad290>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 40/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [61]:
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a907d0>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 41/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [62]:
sns.pairplot(df[['Age', 'Commision',
'Duration', 'Sales']])
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 42/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Out[62]:
<seaborn.axisgrid.PairGrid at 0x7f2834cbcc10>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 43/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [63]:
plt.figure(figsize=(10,8))
sns.set(font_scale=1.2)
sns.heatmap(df[['Age', 'Commision',
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2834099b50>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 44/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [64]:
if df[feature].dtype == 'object':
print('\n')
print('feature:',feature)
print(pd.Categorical(df[feature].unique()))
print(pd.Categorical(df[feature].unique()).codes)
df[feature] = pd.Categorical(df[feature]).codes
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 45/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
feature: Agency_Code
[0 2 1 3]
feature: Type
[0 1]
feature: Claimed
['No', 'Yes']
[0 1]
feature: Channel
['Online', 'Offline']
[1 0]
[2 1 0 4 3]
feature: Destination
[0 1 2]
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 46/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [65]:
df.info()
<class 'pandas.core.frame.DataFrame'>
In [66]:
df.head()
Out[66]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales D
Name
0 48 0 0 0 0.70 1 7 2.51 2 0
1 36 2 1 0 0.00 1 34 20.00 2 0
2 39 1 1 0 5.94 1 3 9.90 2 1
3 36 2 1 0 0.00 1 4 26.00 1 0
4 33 3 0 0 6.30 1 53 18.00 0 0
Proportion of 1s and 0s
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 47/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [67]:
df.Claimed.value_counts(normalize=True)
Out[67]:
0 0.692
1 0.308
----------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------
2.2 Data Split: Split the data into test and train, build
classification model CART, Random Forest, Artificial Neural
Network
Extracting the target column into separate vectors for training set
and test set
In [68]:
X = df.drop("Claimed", axis=1)
y = df.pop("Claimed")
X.head()
Out[68]:
Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name
0 48 0 0 0.70 1 7 2.51 2 0
1 36 2 1 0.00 1 34 20.00 2 0
2 39 1 1 5.94 1 3 9.90 2 1
3 36 2 1 0.00 1 4 26.00 1 0
4 33 3 0 6.30 1 53 18.00 0 0
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 48/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [69]:
# prior to scaling
plt.plot(X)
plt.show()
In [70]:
X_scaled=X.apply(zscore)
X_scaled.head()
Out[70]:
Product
Age Agency_Code Type Commision Channel Duration Sales
Name
0 0.947162 -1.314358 -1.256796 -0.542807 0.124788 -0.470051 -0.816433 0.26883
1 -0.199870 0.697928 0.795674 -0.570282 0.124788 -0.268605 -0.569127 0.26883
2 0.086888 -0.308215 0.795674 -0.337133 0.124788 -0.499894 -0.711940 0.26883
3 -0.199870 0.697928 0.795674 -0.570282 0.124788 -0.492433 -0.484288 -0.52575
4 -0.486629 1.704071 -1.256796 -0.323003 0.124788 -0.126846 -0.597407 -1.32033
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 49/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [71]:
# prior to scaling
plt.plot(X_scaled)
plt.show()
In [72]:
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 50/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [73]:
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)
X_train (2100, 9)
X_test (900, 9)
train_labels (2100,)
test_labels (900,)
In [74]:
param_grid_dtcl = {
'criterion': ['gini'],
'max_depth': [10,20,30,50],
'min_samples_leaf': [50,100,150],
'min_samples_split': [150,300,450],
dtcl = DecisionTreeClassifier(random_state=1)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 51/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [75]:
grid_search_dtcl.fit(X_train, train_labels)
print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl
Out[75]:
random_state=1)
In [76]:
param_grid_dtcl = {
'criterion': ['gini'],
'min_samples_leaf': [20,30,40,50,60],
'min_samples_split': [150,300,450],
dtcl = DecisionTreeClassifier(random_state=1)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 52/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [77]:
grid_search_dtcl.fit(X_train, train_labels)
print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl
Out[77]:
random_state=1)
In [78]:
param_grid_dtcl = {
'criterion': ['gini'],
dtcl = DecisionTreeClassifier(random_state=1)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 53/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [79]:
grid_search_dtcl.fit(X_train, train_labels)
print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl
Out[79]:
DecisionTreeClassifier(max_depth=3.5, min_samples_leaf=44,
min_samples_split=250, random_state=1)
In [80]:
param_grid_dtcl = {
'criterion': ['gini'],
'min_samples_split': [150, 175, 200, 210, 220, 230, 240, 250, 260, 270],
dtcl = DecisionTreeClassifier(random_state=1)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 54/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [81]:
grid_search_dtcl.fit(X_train, train_labels)
print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl
Out[81]:
DecisionTreeClassifier(max_depth=4.85, min_samples_leaf=44,
min_samples_split=260, random_state=1)
Generating Tree
In [82]:
tree_regularized = open('tree_regularized.dot','w')
feature_names = list(X_train),
class_names = list(train_char_label))
tree_regularized.close()
dot_data
http://webgraphviz.com/
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 55/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [83]:
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
Agency_Code 0.634112
Sales 0.220899
Commision 0.021881
Age 0.019940
Duration 0.016536
Type 0.000000
Channel 0.000000
Destination 0.000000
In [84]:
ytrain_predict_dtcl = best_grid_dtcl.predict(X_train)
ytest_predict_dtcl = best_grid_dtcl.predict(X_test)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 56/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [85]:
ytest_predict_dtcl
ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)
ytest_predict_prob_dtcl
pd.DataFrame(ytest_predict_prob_dtcl).head()
Out[85]:
0 1
0 0.697947 0.302053
1 0.979452 0.020548
2 0.921171 0.078829
3 0.510417 0.489583
4 0.921171 0.078829
rfcl = RandomForestClassifier(random_state=1)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 57/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [86]:
param_grid_rfcl = {
'max_depth': [4,5,6],#20,30,40
rfcl = RandomForestClassifier(random_state=1)
In [87]:
grid_search_rfcl.fit(X_train, train_labels)
print(grid_search_rfcl.best_params_)
best_grid_rfcl = grid_search_rfcl.best_estimator_
best_grid_rfcl
Out[87]:
In [88]:
ytrain_predict_rfcl = best_grid_rfcl.predict(X_train)
ytest_predict_rfcl = best_grid_rfcl.predict(X_test)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 58/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [89]:
ytest_predict_rfcl
ytest_predict_prob_rfcl=best_grid_rfcl.predict_proba(X_test)
ytest_predict_prob_rfcl
pd.DataFrame(ytest_predict_prob_rfcl).head()
Out[89]:
0 1
0 0.778010 0.221990
1 0.971910 0.028090
2 0.904401 0.095599
3 0.651398 0.348602
4 0.868406 0.131594
In [90]:
# Variable Importance
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
Agency_Code 0.276015
Sales 0.152733
Commision 0.135997
Duration 0.077475
Type 0.071019
Age 0.039503
Destination 0.008971
Channel 0.002705
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 59/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [91]:
param_grid_nncl = {
'tol': [0.01],
nncl = MLPClassifier(random_state=1)
In [92]:
grid_search_nncl.fit(X_train, train_labels)
grid_search_nncl.best_params_
best_grid_nncl = grid_search_nncl.best_estimator_
best_grid_nncl
Out[92]:
In [93]:
ytrain_predict_nncl = best_grid_nncl.predict(X_train)
ytest_predict_nncl = best_grid_nncl.predict(X_test)
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 60/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [94]:
ytest_predict_nncl
ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)
ytest_predict_prob_nncl
pd.DataFrame(ytest_predict_prob_nncl).head()
Out[94]:
0 1
0 0.822676 0.177324
1 0.933407 0.066593
2 0.918772 0.081228
3 0.688933 0.311067
4 0.913425 0.086575
----------------------------------------------------------------------------------------------------------------------------------------------------
-------------------
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 61/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [95]:
# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_train)
probs_cart = probs_cart[:, 1]
# calculate AUC
plt.plot(cart_train_fpr, cart_train_tpr)
AUC: 0.823
Out[95]:
[<matplotlib.lines.Line2D at 0x7f2836b1d350>]
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 62/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [96]:
# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_test)
probs_cart = probs_cart[:, 1]
# calculate AUC
plt.plot(cart_test_fpr, cart_test_tpr)
AUC: 0.801
Out[96]:
[<matplotlib.lines.Line2D at 0x7f2834617a10>]
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 63/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [97]:
confusion_matrix(train_labels, ytrain_predict_dtcl)
Out[97]:
array([[1309, 144],
[ 307, 340]])
In [98]:
cart_train_acc=best_grid_dtcl.score(X_train,train_labels)
cart_train_acc
Out[98]:
0.7852380952380953
In [99]:
print(classification_report(train_labels, ytrain_predict_dtcl))
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 64/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [100]:
cart_metrics=classification_report(train_labels, ytrain_predict_dtcl,output_dict=True)
df=pd.DataFrame(cart_metrics).transpose()
cart_train_f1=round(df.loc["1"][2],2)
cart_train_recall=round(df.loc["1"][1],2)
cart_train_precision=round(df.loc["1"][0],2)
cart_train_precision 0.7
cart_train_recall 0.53
cart_train_f1 0.6
In [101]:
confusion_matrix(test_labels, ytest_predict_dtcl)
Out[101]:
array([[553, 70],
[136, 141]])
In [102]:
cart_test_acc=best_grid_dtcl.score(X_test,test_labels)
cart_test_acc
Out[102]:
0.7711111111111111
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 65/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [103]:
print(classification_report(test_labels, ytest_predict_dtcl))
In [104]:
cart_metrics=classification_report(test_labels, ytest_predict_dtcl,output_dict=True)
df=pd.DataFrame(cart_metrics).transpose()
cart_test_precision=round(df.loc["1"][0],2)
cart_test_recall=round(df.loc["1"][1],2)
cart_test_f1=round(df.loc["1"][2],2)
cart_test_precision 0.67
cart_test_recall 0.51
cart_test_f1 0.58
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 66/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Cart Conclusion
Train Data:
- AUC: 82%
- Accuracy: 79%
- Precision: 70%
- f1-Score: 60%
Test Data:
- AUC: 80%
- Accuracy: 77%
- Precision: 80%
- f1-Score: 84%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
In [105]:
confusion_matrix(train_labels,ytrain_predict_rfcl)
Out[105]:
array([[1297, 156],
[ 255, 392]])
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 67/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [106]:
rf_train_acc=best_grid_rfcl.score(X_train,train_labels)
rf_train_acc
Out[106]:
0.8042857142857143
In [107]:
print(classification_report(train_labels,ytrain_predict_rfcl))
In [108]:
rf_metrics=classification_report(train_labels, ytrain_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_train_precision=round(df.loc["1"][0],2)
rf_train_recall=round(df.loc["1"][1],2)
rf_train_f1=round(df.loc["1"][2],2)
rf_train_precision 0.72
rf_train_recall 0.61
rf_train_f1 0.66
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 68/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [109]:
rf_train_fpr, rf_train_tpr,_=roc_curve(train_labels,best_grid_rfcl.predict_proba(X_trai
n)[:,1])
plt.plot(rf_train_fpr,rf_train_tpr,color='green')
plt.title('ROC')
rf_train_auc=roc_auc_score(train_labels,best_grid_rfcl.predict_proba(X_train)[:,1])
In [110]:
confusion_matrix(test_labels,ytest_predict_rfcl)
Out[110]:
array([[550, 73],
[121, 156]])
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 69/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [111]:
rf_test_acc=best_grid_rfcl.score(X_test,test_labels)
rf_test_acc
Out[111]:
0.7844444444444445
In [112]:
print(classification_report(test_labels,ytest_predict_rfcl))
In [113]:
rf_metrics=classification_report(test_labels, ytest_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_test_precision=round(df.loc["1"][0],2)
rf_test_recall=round(df.loc["1"][1],2)
rf_test_f1=round(df.loc["1"][2],2)
rf_test_precision 0.68
rf_test_recall 0.56
rf_test_f1 0.62
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 70/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [114]:
rf_test_fpr, rf_test_tpr,_=roc_curve(test_labels,best_grid_rfcl.predict_proba(X_test)
[:,1])
plt.plot(rf_test_fpr,rf_test_tpr,color='green')
plt.title('ROC')
rf_test_auc=roc_auc_score(test_labels,best_grid_rfcl.predict_proba(X_test)[:,1])
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 71/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Train Data:
- AUC: 86%
- Accuracy: 80%
- Precision: 72%
- f1-Score: 66%
Test Data:
- AUC: 82%
- Accuracy: 78%
- Precision: 68%
- f1-Score: 62
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
In [115]:
confusion_matrix(train_labels,ytrain_predict_nncl)
Out[115]:
array([[1298, 155],
[ 315, 332]])
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 72/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [116]:
nn_train_acc=best_grid_nncl.score(X_train,train_labels)
nn_train_acc
Out[116]:
0.7761904761904762
In [117]:
print(classification_report(train_labels,ytrain_predict_nncl))
In [118]:
nn_metrics=classification_report(train_labels, ytrain_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_train_precision=round(df.loc["1"][0],2)
nn_train_recall=round(df.loc["1"][1],2)
nn_train_f1=round(df.loc["1"][2],2)
nn_train_precision 0.68
nn_train_recall 0.51
nn_train_f1 0.59
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 73/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [119]:
nn_train_fpr, nn_train_tpr,_=roc_curve(train_labels,best_grid_nncl.predict_proba(X_trai
n)[:,1])
plt.plot(nn_train_fpr,nn_train_tpr,color='black')
plt.title('ROC')
nn_train_auc=roc_auc_score(train_labels,best_grid_nncl.predict_proba(X_train)[:,1])
In [120]:
confusion_matrix(test_labels,ytest_predict_nncl)
Out[120]:
array([[553, 70],
[138, 139]])
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 74/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [121]:
nn_test_acc=best_grid_nncl.score(X_test,test_labels)
nn_test_acc
Out[121]:
0.7688888888888888
In [122]:
print(classification_report(test_labels,ytest_predict_nncl))
In [123]:
nn_metrics=classification_report(test_labels, ytest_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_test_precision=round(df.loc["1"][0],2)
nn_test_recall=round(df.loc["1"][1],2)
nn_test_f1=round(df.loc["1"][2],2)
nn_test_precision 0.67
nn_test_recall 0.5
nn_test_f1 0.57
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 75/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [124]:
nn_test_fpr, nn_test_tpr,_=roc_curve(test_labels,best_grid_nncl.predict_proba(X_test)
[:,1])
plt.plot(nn_test_fpr,nn_test_tpr,color='black')
plt.title('ROC')
nn_test_auc=roc_auc_score(test_labels,best_grid_nncl.predict_proba(X_test)[:,1])
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 76/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
Train Data:
- AUC: 82%
- Accuracy: 78%
- Precision: 68%
- f1-Score: 59
Test Data:
- AUC: 80%
- Accuracy: 77%
- Precision: 67%
- f1-Score: 57%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
----------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------
2.4 Final Model: Compare all the model and write an inference
which model is best/optimized.
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 77/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [125]:
'CART Test':[cart_test_acc,cart_test_auc,cart_test_recall,cart_test_precision,c
art_test_f1],
round(data,2)
Out[125]:
CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test
Accuracy 0.79 0.77 0.80 0.78 0.78 0.77
AUC 0.82 0.80 0.86 0.82 0.82 0.80
Recall 0.53 0.51 0.61 0.56 0.51 0.50
Precision 0.70 0.67 0.72 0.68 0.68 0.67
F1 Score 0.60 0.58 0.66 0.62 0.59 0.57
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 78/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [126]:
plt.plot(cart_train_fpr, cart_train_tpr,color='red',label="CART")
plt.plot(rf_train_fpr,rf_train_tpr,color='green',label="RF")
plt.plot(nn_train_fpr,nn_train_tpr,color='black',label="NN")
plt.title('ROC')
Out[126]:
<matplotlib.legend.Legend at 0x7f282f806710>
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 79/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
In [127]:
plt.plot(cart_test_fpr, cart_test_tpr,color='red',label="CART")
plt.plot(rf_test_fpr,rf_test_tpr,color='green',label="RF")
plt.plot(nn_test_fpr,nn_test_tpr,color='black',label="NN")
plt.title('ROC')
Out[127]:
<matplotlib.legend.Legend at 0x7f282f788850>
CONCLUSION :
----------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 80/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle
I strongly recommended we collect more real time unstructured data and past data if possible.
This is understood by looking at the insurance data by drawing relations between different variables such as
day of the incident, time, age group, and associating it with other external information such as location,
behavior patterns, weather information, airline/vehicle types, etc.
Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time • Increase
customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling costs Insights gained
from data and AI-powered analytics could expand the boundaries of insurability, extend existing products, and
give rise to new risk transfer solutions in areas like a non-damage business interruption and reputational
damage
https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 81/81