You are on page 1of 81

2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim status
and provide recommendations to management. Use CART, RF & ANN and compare the models' performances
in train and test sets.

Attribute Information:
1. Target: Claim Status (Claimed)

2. Code of tour firm (Agency_Code)


3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)

5. Name of the tour insurance products (Product)

6. Duration of the tour (Duration)


7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)

9. The commission received for tour insurance firm (Commission)


10. Age of insured (Age)

Importing all required Libraries

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 1/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [1]:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import tree

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.neural_network import MLPClassifier


from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_mat


rix

from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import GridSearchCV
# Import stats from scipy

from scipy import stats

2.1 Data Ingestion: Read the dataset. Do the descriptive


statistics and do null value condition check, write an inference on
it.

Loading the Data

In [2]:

df = pd.read_csv("../input/insurance-data/insurance_part2_data (2).csv")

Checking the data

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 2/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [3]:

df.head()

Out[3]:

Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan
Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan
Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan
Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan
4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 3000 non-null int64

1 Agency_Code 3000 non-null object

2 Type 3000 non-null object

3 Claimed 3000 non-null object

4 Commision 3000 non-null float64

5 Channel 3000 non-null object

6 Duration 3000 non-null int64

7 Sales 3000 non-null float64

8 Product Name 3000 non-null object

9 Destination 3000 non-null object

dtypes: float64(2), int64(2), object(6)

memory usage: 234.5+ KB

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 3/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Observatiom

- 10 variables

- Age, Commision, Duration, Sales are numeric variable

- rest are categorial variables

- 3000 records, no missing one

- 9 independant variable and one target variable - Clamied

Check for missing value in any column

In [5]:

# Are there any missing values ?

df.isnull().sum()

Out[5]:

Age 0

Agency_Code 0

Type 0

Claimed 0

Commision 0

Channel 0

Duration 0

Sales 0

Product Name 0

Destination 0

dtype: int64

Observation

No missing valune

Descriptive Statistics Summary

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 4/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [6]:

df.describe().T

Out[6]:

count mean std min 25% 50% 75% max


Age 3000.0 38.091000 10.463518 8.0 32.0 36.00 42.000 84.00
Commision 3000.0 14.529203 25.481455 0.0 0.0 4.63 17.235 210.21
Duration 3000.0 70.001333 134.053313 -1.0 11.0 26.50 63.000 4580.00
Sales 3000.0 60.249913 70.733954 0.0 20.0 33.00 69.000 539.00

In [7]:

## Intital descriptive analysis of the data

df.describe(percentiles=[.25,0.50,0.75,0.90]).T

Out[7]:

count mean std min 25% 50% 75% 90% max


Age 3000.0 38.091000 10.463518 8.0 32.0 36.00 42.000 53.000 84.00
Commision 3000.0 14.529203 25.481455 0.0 0.0 4.63 17.235 48.300 210.21
Duration 3000.0 70.001333 134.053313 -1.0 11.0 26.50 63.000 224.200 4580.00
Sales 3000.0 60.249913 70.733954 0.0 20.0 33.00 69.000 172.025 539.00

Observation
- duration has negative valu, it is not possible. Wrong entry.

- Commision & Sales- mean and median varies signficantly

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 5/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [8]:

df.describe(include='all').T

Out[8]:

count unique top freq mean std min 25% 50% 75%
Age 3000 NaN NaN NaN 38.091 10.4635 8 32 36 42
Agency_Code 3000 4 EPX 1365 NaN NaN NaN NaN NaN NaN
Travel
Type 3000 2 1837 NaN NaN NaN NaN NaN NaN
Agency
Claimed 3000 2 No 2076 NaN NaN NaN NaN NaN NaN
Commision 3000 NaN NaN NaN 14.5292 25.4815 0 0 4.63 17.23
Channel 3000 2 Online 2954 NaN NaN NaN NaN NaN NaN
Duration 3000 NaN NaN NaN 70.0013 134.053 -1 11 26.5 63
Sales 3000 NaN NaN NaN 60.2499 70.734 0 20 33 69
Product Customised
3000 5 1136 NaN NaN NaN NaN NaN NaN
Name Plan
Destination 3000 3 ASIA 2465 NaN NaN NaN NaN NaN NaN

Observation
Categorial code variable maximun unique count is 5

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 6/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [9]:

df.head(10)

Out[9]:

Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan
Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan
Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan
Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan
4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan
5 45 JZI Airlines Yes 15.75 Online 8 45.00 Bronze Plan
Travel Customised
6 61 CWT No 35.64 Online 30 59.40
Agency Plan
Travel Cancellation
7 36 EPX No 0.00 Online 16 80.00
Agency Plan
Travel Cancellation
8 36 EPX No 0.00 Online 19 14.00
Agency Plan
Travel Cancellation
9 36 EPX No 0.00 Online 42 43.00
Agency Plan

Observation

- Data looks good at first glance

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 7/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [10]:

df.tail(10)

Out[10]:

Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Travel Custom
2990 51 EPX No 0.00 Online 2 20.00
Agency Plan
2991 29 C2B Airlines Yes 48.30 Online 381 193.20 Silver P
Travel Custom
2992 28 CWT No 11.88 Online 389 19.80
Agency Plan
Travel Cancella
2993 36 EPX No 0.00 Online 234 10.00
Agency Plan
2994 27 C2B Airlines Yes 71.85 Online 416 287.40 Gold Pla
Travel
2995 28 CWT Yes 166.53 Online 364 256.20 Gold Pla
Agency
2996 35 C2B Airlines No 13.50 Online 5 54.00 Gold Pla
Travel Custom
2997 36 EPX No 0.00 Online 54 28.00
Agency Plan
2998 34 C2B Airlines Yes 7.64 Online 39 30.55 Bronze
2999 47 JZI Airlines No 11.55 Online 15 33.00 Bronze

Observation

- Data looks good at first glance

In [11]:

### data dimensions

df.shape

Out[11]:

(3000, 10)

Geting unique counts of all Nominal Variables

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 8/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [12]:

for column in df[['Agency_Code', 'Type', 'Claimed', 'Channel',

'Product Name', 'Destination']]:

print(column.upper(),': ',df[column].nunique())

print(df[column].value_counts().sort_values())

print('\n')

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 9/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

AGENCY_CODE : 4

JZI 239

CWT 472

C2B 924

EPX 1365

Name: Agency_Code, dtype: int64

TYPE : 2

Airlines 1163

Travel Agency 1837

Name: Type, dtype: int64

CLAIMED : 2

Yes 924

No 2076

Name: Claimed, dtype: int64

CHANNEL : 2

Offline 46

Online 2954

Name: Channel, dtype: int64

PRODUCT NAME : 5

Gold Plan 109

Silver Plan 427

Bronze Plan 650

Cancellation Plan 678

Customised Plan 1136

Name: Product Name, dtype: int64

DESTINATION : 3

EUROPE 215

Americas 320

ASIA 2465

Name: Destination, dtype: int64

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 10/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Check for duplicate data

In [13]:

# Are there any duplicates ?

dups = df.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

df[dups]

Number of duplicate rows = 139

Out[13]:

Produ
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
63 30 C2B Airlines Yes 15.0 Online 27 60.0 Bronz
Travel Custo
329 36 EPX No 0.0 Online 5 20.0
Agency Plan
Travel Cance
407 36 EPX No 0.0 Online 11 19.0
Agency Plan
Travel Custo
411 35 EPX No 0.0 Online 2 20.0
Agency Plan
Travel Custo
422 36 EPX No 0.0 Online 5 20.0
Agency Plan
... ... ... ... ... ... ... ... ... ...
Travel Cance
2940 36 EPX No 0.0 Online 8 10.0
Agency Plan
Travel Custo
2947 36 EPX No 0.0 Online 10 28.0
Agency Plan
Travel Cance
2952 36 EPX No 0.0 Online 2 10.0
Agency Plan
Travel Custo
2962 36 EPX No 0.0 Online 4 20.0
Agency Plan
Travel Custo
2984 36 EPX No 0.0 Online 1 20.0
Agency Plan

139 rows × 10 columns

Removing Duplicates - Not removing them - no unique identifier,


can be different customer.

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 11/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Though it shows there are 139 records, but it can be of different customers, there is no customer ID or any
unique identifier, so I am not dropping them off.

Univariate Analysis

Age variable

In [14]:

print('Range of values: ', df['Age'].max()-df['Age'].min())

Range of values: 76

In [15]:

#Central values

print('Minimum Age: ', df['Age'].min())

print('Maximum Age: ',df['Age'].max())


print('Mean value: ', df['Age'].mean())

print('Median value: ',df['Age'].median())

print('Standard deviation: ', df['Age'].std())

print('Null values: ',df['Age'].isnull().any())

Minimum Age: 8

Maximum Age: 84

Mean value: 38.091

Median value: 36.0

Standard deviation: 10.463518245377944

Null values: False

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 12/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [16]:

#Quartiles

Q1=df['Age'].quantile(q=0.25)

Q3=df['Age'].quantile(q=0.75)

print('spending - 1st Quartile (Q1) is: ', Q1)

print('spending - 3st Quartile (Q3) is: ', Q3)

print('Interquartile range (IQR) of Age is ', stats.iqr(df['Age']))

spending - 1st Quartile (Q1) is: 32.0

spending - 3st Quartile (Q3) is: 42.0

Interquartile range (IQR) of Age is 10.0

In [17]:

#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1

#lower 1.5*IQR whisker i.e Q1-1.5*IQR

#upper 1.5*IQR whisker i.e Q3+1.5*IQR

L_outliers=Q1-1.5*(Q3-Q1)

U_outliers=Q3+1.5*(Q3-Q1)

print('Lower outliers in Age: ', L_outliers)

print('Upper outliers in Age: ', U_outliers)

Lower outliers in Age: 17.0

Upper outliers in Age: 57.0

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 13/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [18]:

print('Number of outliers in Age upper : ', df[df['Age']>57.0]['Age'].count())

print('Number of outliers in Age lower : ', df[df['Age']<17.0]['Age'].count())

print('% of Outlier in Age upper: ',round(df[df['Age']>57.0]['Age'].count()*100/len(df


)), '%')

print('% of Outlier in Age lower: ',round(df[df['Age']<17.0]['Age'].count()*100/len(df


)), '%')

Number of outliers in Age upper : 198

Number of outliers in Age lower : 6

% of Outlier in Age upper: 7.0 %


% of Outlier in Age lower: 0.0 %

In [19]:

plt.title('Age')

sns.boxplot(df['Age'],orient='horizondal',color='purple')

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2839607e10>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 14/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [20]:

fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))

#distplot

sns.distplot(df['Age'],ax=ax2)

ax2.set_xlabel('Age', fontsize=15)

ax2.tick_params(labelsize=15)

#histogram

ax3.hist(df['Age'])

ax3.set_xlabel('Age', fontsize=15)

ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)

plt.tight_layout()

Commision variable

In [21]:

print('Range of values: ', df['Commision'].max()-df['Commision'].min())

Range of values: 210.21

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 15/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [22]:

#Central values

print('Minimum Commision: ', df['Commision'].min())

print('Maximum Commision: ',df['Commision'].max())

print('Mean value: ', df['Commision'].mean())

print('Median value: ',df['Commision'].median())

print('Standard deviation: ', df['Commision'].std())

print('Null values: ',df['Commision'].isnull().any())

Minimum Commision: 0.0

Maximum Commision: 210.21

Mean value: 14.529203333333266

Median value: 4.63

Standard deviation: 25.48145450662553

Null values: False

In [23]:

#Quartiles

Q1=df['Commision'].quantile(q=0.25)

Q3=df['Commision'].quantile(q=0.75)

print('Commision - 1st Quartile (Q1) is: ', Q1)

print('Commision - 3st Quartile (Q3) is: ', Q3)

print('Interquartile range (IQR) of Commision is ', stats.iqr(df['Commision']))

Commision - 1st Quartile (Q1) is: 0.0

Commision - 3st Quartile (Q3) is: 17.235

Interquartile range (IQR) of Commision is 17.235

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 16/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [24]:

#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1

#lower 1.5*IQR whisker i.e Q1-1.5*IQR

#upper 1.5*IQR whisker i.e Q3+1.5*IQR

L_outliers=Q1-1.5*(Q3-Q1)

U_outliers=Q3+1.5*(Q3-Q1)

print('Lower outliers in Commision: ', L_outliers)

print('Upper outliers in Commision: ', U_outliers)

Lower outliers in Commision: -25.8525

Upper outliers in Commision: 43.0875

In [25]:

print('Number of outliers in Commision upper : ', df[df['Commision']>43.0875]['Commisio


n'].count())

print('Number of outliers in Commision lower : ', df[df['Commision']<-25.8525]['Commisi


on'].count())

print('% of Outlier in Commision upper: ',round(df[df['Commision']>43.0875]['Commision'


].count()*100/len(df)), '%')

print('% of Outlier in Commision lower: ',round(df[df['Commision']<-25.8525]['Commisio


n'].count()*100/len(df)), '%')

Number of outliers in Commision upper : 362

Number of outliers in Commision lower : 0

% of Outlier in Commision upper: 12.0 %

% of Outlier in Commision lower: 0.0 %

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 17/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [26]:

plt.title('Commision')

sns.boxplot(df['Commision'],orient='horizondal',color='purple')

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836b40150>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 18/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [27]:

fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))

#distplot

sns.distplot(df['Commision'],ax=ax2)

ax2.set_xlabel('Commision', fontsize=15)

ax2.tick_params(labelsize=15)

#histogram

ax3.hist(df['Commision'])

ax3.set_xlabel('Commision', fontsize=15)

ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)

plt.tight_layout()

Duration variable

In [28]:

print('Range of values: ', df['Duration'].max()-df['Duration'].min())

Range of values: 4581

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 19/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [29]:

#Central values

print('Minimum Duration: ', df['Duration'].min())

print('Maximum Duration: ',df['Duration'].max())

print('Mean value: ', df['Duration'].mean())

print('Median value: ',df['Duration'].median())

print('Standard deviation: ', df['Duration'].std())

print('Null values: ',df['Duration'].isnull().any())

Minimum Duration: -1

Maximum Duration: 4580

Mean value: 70.00133333333333

Median value: 26.5

Standard deviation: 134.05331313253495

Null values: False

In [30]:

#Quartiles

Q1=df['Duration'].quantile(q=0.25)

Q3=df['Duration'].quantile(q=0.75)

print('Duration - 1st Quartile (Q1) is: ', Q1)

print('Duration - 3st Quartile (Q3) is: ', Q3)

print('Interquartile range (IQR) of Duration is ', stats.iqr(df['Duration']))

Duration - 1st Quartile (Q1) is: 11.0

Duration - 3st Quartile (Q3) is: 63.0

Interquartile range (IQR) of Duration is 52.0

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 20/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [31]:

#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1

#lower 1.5*IQR whisker i.e Q1-1.5*IQR

#upper 1.5*IQR whisker i.e Q3+1.5*IQR

L_outliers=Q1-1.5*(Q3-Q1)

U_outliers=Q3+1.5*(Q3-Q1)

print('Lower outliers in Duration: ', L_outliers)

print('Upper outliers in Duration: ', U_outliers)

Lower outliers in Duration: -67.0

Upper outliers in Duration: 141.0

In [32]:

print('Number of outliers in Duration upper : ', df[df['Duration']>141.0]['Duration'].c


ount())

print('Number of outliers in Duration lower : ', df[df['Duration']<-67.0]['Duration'].c


ount())

print('% of Outlier in Duration upper: ',round(df[df['Duration']>141.0]['Duration'].cou


nt()*100/len(df)), '%')

print('% of Outlier in Duration lower: ',round(df[df['Duration']<-67.0]['Duration'].cou


nt()*100/len(df)), '%')

Number of outliers in Duration upper : 382

Number of outliers in Duration lower : 0

% of Outlier in Duration upper: 13.0 %

% of Outlier in Duration lower: 0.0 %

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 21/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [33]:

plt.title('Duration')

sns.boxplot(df['Duration'],orient='horizondal',color='purple')

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a4a890>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 22/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [34]:

fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))

#distplot

sns.distplot(df['Duration'],ax=ax2)

ax2.set_xlabel('Duration', fontsize=15)

ax2.tick_params(labelsize=15)

#histogram

ax3.hist(df['Duration'])

ax3.set_xlabel('Duration', fontsize=15)

ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)

plt.tight_layout()

Sales variable

In [35]:

print('Range of values: ', df['Sales'].max()-df['Sales'].min())

Range of values: 539.0

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 23/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [36]:

#Central values

print('Minimum Sales: ', df['Sales'].min())

print('Maximum Sales: ',df['Sales'].max())

print('Mean value: ', df['Sales'].mean())

print('Median value: ',df['Sales'].median())

print('Standard deviation: ', df['Sales'].std())

print('Null values: ',df['Sales'].isnull().any())

Minimum Sales: 0.0

Maximum Sales: 539.0

Mean value: 60.24991333333344

Median value: 33.0

Standard deviation: 70.73395353143047

Null values: False

In [37]:

#Quartiles

Q1=df['Sales'].quantile(q=0.25)

Q3=df['Sales'].quantile(q=0.75)

print('Sales - 1st Quartile (Q1) is: ', Q1)

print('Sales - 3st Quartile (Q3) is: ', Q3)

print('Interquartile range (IQR) of Sales is ', stats.iqr(df['Sales']))

Sales - 1st Quartile (Q1) is: 20.0

Sales - 3st Quartile (Q3) is: 69.0

Interquartile range (IQR) of Sales is 49.0

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 24/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [38]:

#Outlier detection from Interquartile range (IQR) in original data

# IQR=Q3-Q1

#lower 1.5*IQR whisker i.e Q1-1.5*IQR

#upper 1.5*IQR whisker i.e Q3+1.5*IQR

L_outliers=Q1-1.5*(Q3-Q1)

U_outliers=Q3+1.5*(Q3-Q1)

print('Lower outliers in Sales: ', L_outliers)

print('Upper outliers in Sales: ', U_outliers)

Lower outliers in Sales: -53.5

Upper outliers in Sales: 142.5

In [39]:

print('Number of outliers in Sales upper : ', df[df['Sales']>142.5]['Sales'].count())

print('Number of outliers in Sales lower : ', df[df['Sales']<-53.5]['Sales'].count())

print('% of Outlier in Sales upper: ',round(df[df['Sales']>142.5]['Sales'].count()*100/


len(df)), '%')

print('% of Outlier in Sales lower: ',round(df[df['Sales']<-53.5]['Sales'].count()*100/


len(df)), '%')

Number of outliers in Sales upper : 353

Number of outliers in Sales lower : 0

% of Outlier in Sales upper: 12.0 %

% of Outlier in Sales lower: 0.0 %

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 25/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [40]:

plt.title('Sales')

sns.boxplot(df['Sales'],orient='horizondal',color='purple')

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f28368b67d0>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 26/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [41]:

fig, (ax2,ax3)=plt.subplots(1,2,figsize=(13,5))

#distplot

sns.distplot(df['Sales'],ax=ax2)

ax2.set_xlabel('Sales', fontsize=15)

ax2.tick_params(labelsize=15)

#histogram

ax3.hist(df['Sales'])

ax3.set_xlabel('Sales', fontsize=15)

ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)

plt.tight_layout()

There are outliers in all the variables, but the sales and commision can be a geneui business value. Random
Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will keep the data as it is.

I will treat the outliers for the ANN model to compare the same after the all the steps just for comparsion.

Categorical Variables

Agency_Code

Count Plot

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 27/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [42]:

sns.countplot(data = df, x = 'Agency_Code')

Out[42]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f28366bebd0>

Boxplot

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 28/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [43]:

sns.boxplot(data = df, x='Agency_Code',y='Sales', hue='Claimed')

Out[43]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f283660d610>

Swarmpot

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 29/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [44]:

sns.swarmplot(data = df, x='Agency_Code',y='Sales')

Out[44]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836527e50>

Combine Violin plot and Swarmp plot

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 30/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [45]:

sns.violinplot(data = df, x='Agency_Code',y='Sales')

sns.swarmplot(data = df, x='Agency_Code',y='Sales', color = 'k', alpha = 0.6)

Out[45]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f28368bc3d0>

Type

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 31/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [46]:

sns.countplot(data = df, x = 'Type')

Out[46]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f28368dd950>

In [47]:

sns.boxplot(data = df, x='Type',y='Sales', hue='Claimed')

Out[47]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836b6ca50>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 32/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [48]:

sns.swarmplot(data = df, x='Type',y='Sales')

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a98d10>

In [49]:

sns.violinplot(data = df, x='Type',y='Sales')

sns.swarmplot(data = df, x='Type',y='Sales', color = 'k', alpha = 0.6)

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836e20910>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 33/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Channel

In [50]:

sns.countplot(data = df, x = 'Channel')

Out[50]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836be08d0>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 34/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [51]:

sns.boxplot(data = df, x='Channel',y='Sales', hue='Claimed')

Out[51]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836d1e890>

In [52]:

sns.swarmplot(data = df, x='Channel',y='Sales')

Out[52]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836c36e50>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 35/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [53]:

sns.violinplot(data = df, x='Channel',y='Sales')

sns.swarmplot(data = df, x='Channel',y='Sales', color = 'k', alpha = 0.6)

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2834cd2d50>

Product Name

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 36/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [54]:

sns.countplot(data = df, x = 'Product Name')

Out[54]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2834cba390>

In [55]:

sns.boxplot(data = df, x='Product Name',y='Sales', hue='Claimed')

Out[55]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2834c36610>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 37/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [56]:

sns.swarmplot(data = df, x='Product Name',y='Sales')

Out[56]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2834aee750>

In [57]:

sns.violinplot(data = df, x='Product Name',y='Sales')

sns.swarmplot(data = df, x='Product Name',y='Sales', color = 'k', alpha = 0.6)

Out[57]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2834a52490>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 38/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Destination

In [58]:

sns.countplot(data = df, x = 'Destination')

Out[58]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f283498c2d0>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 39/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [59]:

sns.boxplot(data = df, x='Destination',y='Sales', hue='Claimed')

Out[59]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f28349567d0>

In [60]:

sns.swarmplot(data = df, x='Destination',y='Sales')

Out[60]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f28348ad290>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 40/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [61]:

sns.violinplot(data = df, x='Destination',y='Sales')

sns.swarmplot(data = df, x='Destination',y='Sales', color = 'k', alpha = 0.6)

Out[61]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2836a907d0>

Checking pairwise distribution of the continuous variables

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 41/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [62]:

sns.pairplot(df[['Age', 'Commision',

'Duration', 'Sales']])

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 42/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Out[62]:

<seaborn.axisgrid.PairGrid at 0x7f2834cbcc10>

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 43/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Checking for Correlations

In [63]:

# construct heatmap with only continuous variables

plt.figure(figsize=(10,8))

sns.set(font_scale=1.2)

sns.heatmap(df[['Age', 'Commision',

'Duration', 'Sales']].corr(), annot=True)

Out[63]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2834099b50>

Converting all objects to categorical codes

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 44/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [64]:

for feature in df.columns:

if df[feature].dtype == 'object':

print('\n')

print('feature:',feature)

print(pd.Categorical(df[feature].unique()))

print(pd.Categorical(df[feature].unique()).codes)

df[feature] = pd.Categorical(df[feature]).codes

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 45/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

feature: Agency_Code

['C2B', 'EPX', 'CWT', 'JZI']

Categories (4, object): ['C2B', 'CWT', 'EPX', 'JZI']

[0 2 1 3]

feature: Type

['Airlines', 'Travel Agency']

Categories (2, object): ['Airlines', 'Travel Agency']

[0 1]

feature: Claimed

['No', 'Yes']

Categories (2, object): ['No', 'Yes']

[0 1]

feature: Channel

['Online', 'Offline']

Categories (2, object): ['Offline', 'Online']

[1 0]

feature: Product Name

['Customised Plan', 'Cancellation Plan', 'Bronze Plan', 'Silver Plan', 'Gol


d Plan']

Categories (5, object): ['Bronze Plan', 'Cancellation Plan', 'Customised Pl


an', 'Gold Plan', 'Silver Plan']

[2 1 0 4 3]

feature: Destination

['ASIA', 'Americas', 'EUROPE']

Categories (3, object): ['ASIA', 'Americas', 'EUROPE']

[0 1 2]

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 46/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [65]:

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 3000 non-null int64

1 Agency_Code 3000 non-null int8

2 Type 3000 non-null int8

3 Claimed 3000 non-null int8

4 Commision 3000 non-null float64

5 Channel 3000 non-null int8

6 Duration 3000 non-null int64

7 Sales 3000 non-null float64

8 Product Name 3000 non-null int8

9 Destination 3000 non-null int8

dtypes: float64(2), int64(2), int8(6)

memory usage: 111.5 KB

In [66]:

df.head()

Out[66]:

Product
Age Agency_Code Type Claimed Commision Channel Duration Sales D
Name
0 48 0 0 0 0.70 1 7 2.51 2 0
1 36 2 1 0 0.00 1 34 20.00 2 0
2 39 1 1 0 5.94 1 3 9.90 2 1
3 36 2 1 0 0.00 1 4 26.00 1 0
4 33 3 0 0 6.30 1 53 18.00 0 0

Proportion of 1s and 0s

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 47/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [67]:

df.Claimed.value_counts(normalize=True)

Out[67]:

0 0.692

1 0.308

Name: Claimed, dtype: float64

----------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------

2.2 Data Split: Split the data into test and train, build
classification model CART, Random Forest, Artificial Neural
Network

Extracting the target column into separate vectors for training set
and test set

In [68]:

X = df.drop("Claimed", axis=1)

y = df.pop("Claimed")

X.head()

Out[68]:

Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name
0 48 0 0 0.70 1 7 2.51 2 0
1 36 2 1 0.00 1 34 20.00 2 0
2 39 1 1 5.94 1 3 9.90 2 1
3 36 2 1 0.00 1 4 26.00 1 0
4 33 3 0 6.30 1 53 18.00 0 0

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 48/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [69]:

# prior to scaling

plt.plot(X)

plt.show()

In [70]:

# Scaling the attributes.

from scipy.stats import zscore

X_scaled=X.apply(zscore)

X_scaled.head()

Out[70]:

Product
Age Agency_Code Type Commision Channel Duration Sales
Name
0 0.947162 -1.314358 -1.256796 -0.542807 0.124788 -0.470051 -0.816433 0.26883
1 -0.199870 0.697928 0.795674 -0.570282 0.124788 -0.268605 -0.569127 0.26883
2 0.086888 -0.308215 0.795674 -0.337133 0.124788 -0.499894 -0.711940 0.26883
3 -0.199870 0.697928 0.795674 -0.570282 0.124788 -0.492433 -0.484288 -0.52575
4 -0.486629 1.704071 -1.256796 -0.323003 0.124788 -0.126846 -0.597407 -1.32033

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 49/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [71]:

# prior to scaling

plt.plot(X_scaled)

plt.show()

Splitting data into training and test set

In [72]:

from sklearn.model_selection import train_test_split

X_train, X_test, train_labels, test_labels = train_test_split(X_scaled, y, test_size=.3


0, random_state=5)

Checking the dimensions of the training and test data

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 50/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [73]:

print('X_train',X_train.shape)

print('X_test',X_test.shape)

print('train_labels',train_labels.shape)

print('test_labels',test_labels.shape)

X_train (2100, 9)

X_test (900, 9)

train_labels (2100,)

test_labels (900,)

Building a Decision Tree Classifier

In [74]:

param_grid_dtcl = {

'criterion': ['gini'],

'max_depth': [10,20,30,50],

'min_samples_leaf': [50,100,150],

'min_samples_split': [150,300,450],

dtcl = DecisionTreeClassifier(random_state=1)

grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, cv = 10


)

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 51/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [75]:

grid_search_dtcl.fit(X_train, train_labels)

print(grid_search_dtcl.best_params_)

best_grid_dtcl = grid_search_dtcl.best_estimator_

best_grid_dtcl

#{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples_split': 45


0}

{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples


_split': 450}

Out[75]:

DecisionTreeClassifier(max_depth=10, min_samples_leaf=50, min_samples_split


=450,

random_state=1)

In [76]:

param_grid_dtcl = {

'criterion': ['gini'],

'max_depth': [3, 5, 7, 10,12],

'min_samples_leaf': [20,30,40,50,60],

'min_samples_split': [150,300,450],

dtcl = DecisionTreeClassifier(random_state=1)

grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, cv = 10


)

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 52/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [77]:

grid_search_dtcl.fit(X_train, train_labels)

print(grid_search_dtcl.best_params_)

best_grid_dtcl = grid_search_dtcl.best_estimator_

best_grid_dtcl

#{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples_split': 45


0}

{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 20, 'min_samples_


split': 150}

Out[77]:

DecisionTreeClassifier(max_depth=5, min_samples_leaf=20, min_samples_split=


150,

random_state=1)

In [78]:

param_grid_dtcl = {

'criterion': ['gini'],

'max_depth': [3.5,4.0,4.5, 5.0,5.5],

'min_samples_leaf': [40, 42, 44,46,48,50,52,54],

'min_samples_split': [250, 270, 280, 290, 300,310],

dtcl = DecisionTreeClassifier(random_state=1)

grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, cv = 10


)

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 53/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [79]:

grid_search_dtcl.fit(X_train, train_labels)

print(grid_search_dtcl.best_params_)

best_grid_dtcl = grid_search_dtcl.best_estimator_

best_grid_dtcl

#{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples_split': 45


0}

{'criterion': 'gini', 'max_depth': 3.5, 'min_samples_leaf': 44, 'min_sample


s_split': 250}

Out[79]:

DecisionTreeClassifier(max_depth=3.5, min_samples_leaf=44,

min_samples_split=250, random_state=1)

In [80]:

param_grid_dtcl = {

'criterion': ['gini'],

'max_depth': [4.85, 4.90,4.95, 5.0,5.05,5.10,5.15],

'min_samples_leaf': [40, 41, 42, 43, 44],

'min_samples_split': [150, 175, 200, 210, 220, 230, 240, 250, 260, 270],

dtcl = DecisionTreeClassifier(random_state=1)

grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, cv = 10


)

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 54/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [81]:

grid_search_dtcl.fit(X_train, train_labels)

print(grid_search_dtcl.best_params_)

best_grid_dtcl = grid_search_dtcl.best_estimator_

best_grid_dtcl

#{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples_split': 45


0}

{'criterion': 'gini', 'max_depth': 4.85, 'min_samples_leaf': 44, 'min_sampl


es_split': 260}

Out[81]:

DecisionTreeClassifier(max_depth=4.85, min_samples_leaf=44,

min_samples_split=260, random_state=1)

Generating Tree

In [82]:

train_char_label = ['no', 'yes']

tree_regularized = open('tree_regularized.dot','w')

dot_data = tree.export_graphviz(best_grid_dtcl, out_file= tree_regularized ,

feature_names = list(X_train),

class_names = list(train_char_label))

tree_regularized.close()

dot_data

http://webgraphviz.com/

Variable Importance - DTCL

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 55/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [83]:

print (pd.DataFrame(best_grid_dtcl.feature_importances_, columns = ["Imp"],

index = X_train.columns).sort_values('Imp',ascending=False))

Imp

Agency_Code 0.634112

Sales 0.220899

Product Name 0.086632

Commision 0.021881

Age 0.019940

Duration 0.016536

Type 0.000000

Channel 0.000000

Destination 0.000000

Predicting on Training and Test dataset

In [84]:

ytrain_predict_dtcl = best_grid_dtcl.predict(X_train)

ytest_predict_dtcl = best_grid_dtcl.predict(X_test)

Getting the Predicted Classes and Probs

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 56/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [85]:

ytest_predict_dtcl

ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)

ytest_predict_prob_dtcl

pd.DataFrame(ytest_predict_prob_dtcl).head()

Out[85]:

0 1
0 0.697947 0.302053
1 0.979452 0.020548
2 0.921171 0.078829
3 0.510417 0.489583
4 0.921171 0.078829

Building a Random Forest Classifier

param_grid_rfcl = { 'max_depth': [5,10,15],#20,30,40 'max_features': [4,5,6,7],## 7,8,9 'min_samples_leaf':


[10,50,70],## 50,100 'min_samples_split': [30,50,70], ## 60,70 'n_estimators': [200, 250,300] ## 100,200 }

rfcl = RandomForestClassifier(random_state=1)

grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv = 5)

grid_search_rfcl.fit(X_train, train_labels) print(grid_search_rfcl.bestparams) best_grid_rfcl =


grid_search_rfcl.bestestimator best_grid_rfcl

{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50,


'min_samples_split': 450}

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 57/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [86]:

param_grid_rfcl = {

'max_depth': [4,5,6],#20,30,40

'max_features': [2,3,4,5],## 7,8,9

'min_samples_leaf': [8,9,11,15],## 50,100

'min_samples_split': [46,50,55], ## 60,70

'n_estimators': [290,350,400] ## 100,200

rfcl = RandomForestClassifier(random_state=1)

grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv = 5)

In [87]:

grid_search_rfcl.fit(X_train, train_labels)

print(grid_search_rfcl.best_params_)

best_grid_rfcl = grid_search_rfcl.best_estimator_

best_grid_rfcl

#{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_samples_split': 45


0}

{'max_depth': 6, 'max_features': 3, 'min_samples_leaf': 8, 'min_samples_spl


it': 46, 'n_estimators': 350}

Out[87]:

RandomForestClassifier(max_depth=6, max_features=3, min_samples_leaf=8,

min_samples_split=46, n_estimators=350, random_state


=1)

Predicting the Training and Testing data

In [88]:

ytrain_predict_rfcl = best_grid_rfcl.predict(X_train)

ytest_predict_rfcl = best_grid_rfcl.predict(X_test)

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 58/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Getting the Predicted Classes and Probs

In [89]:

ytest_predict_rfcl

ytest_predict_prob_rfcl=best_grid_rfcl.predict_proba(X_test)

ytest_predict_prob_rfcl

pd.DataFrame(ytest_predict_prob_rfcl).head()

Out[89]:

0 1
0 0.778010 0.221990
1 0.971910 0.028090
2 0.904401 0.095599
3 0.651398 0.348602
4 0.868406 0.131594

Variable Importance via RF

In [90]:

# Variable Importance

print (pd.DataFrame(best_grid_rfcl.feature_importances_,

columns = ["Imp"],

index = X_train.columns).sort_values('Imp',ascending=False))

Imp

Agency_Code 0.276015

Product Name 0.235583

Sales 0.152733

Commision 0.135997

Duration 0.077475

Type 0.071019

Age 0.039503

Destination 0.008971

Channel 0.002705

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 59/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Building a Neural Network Classifier

In [91]:

param_grid_nncl = {

'hidden_layer_sizes': [50,100,200], # 50, 200

'max_iter': [2500,3000,4000], #5000,2500

'solver': ['adam'], #sgd

'tol': [0.01],

nncl = MLPClassifier(random_state=1)

grid_search_nncl = GridSearchCV(estimator = nncl, param_grid = param_grid_nncl, cv = 10


)

In [92]:

grid_search_nncl.fit(X_train, train_labels)

grid_search_nncl.best_params_

best_grid_nncl = grid_search_nncl.best_estimator_

best_grid_nncl

Out[92]:

MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1, tol=0.


01)

Predicting the Training and Testing data

In [93]:

ytrain_predict_nncl = best_grid_nncl.predict(X_train)

ytest_predict_nncl = best_grid_nncl.predict(X_test)

Getting the Predicted Classes and Probs

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 60/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [94]:

ytest_predict_nncl

ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)

ytest_predict_prob_nncl

pd.DataFrame(ytest_predict_prob_nncl).head()

Out[94]:

0 1
0 0.822676 0.177324
1 0.933407 0.066593
2 0.918772 0.081228
3 0.688933 0.311067
4 0.913425 0.086575

----------------------------------------------------------------------------------------------------------------------------------------------------
-------------------

2.3 Performance Metrics: Check the performance of Predictions


on Train and Test sets using Accuracy, Confusion Matrix, Plot
ROC curve and get ROC_AUC score for each model

CART - AUC and ROC for the training data

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 61/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [95]:

# predict probabilities

probs_cart = best_grid_dtcl.predict_proba(X_train)

# keep probabilities for the positive outcome only

probs_cart = probs_cart[:, 1]

# calculate AUC

cart_train_auc = roc_auc_score(train_labels, probs_cart)

print('AUC: %.3f' % cart_train_auc)

# calculate roc curve

cart_train_fpr, cart_train_tpr, cart_train_thresholds = roc_curve(train_labels, probs_c


art)

plt.plot([0, 1], [0, 1], linestyle='--')

# plot the roc curve for the model

plt.plot(cart_train_fpr, cart_train_tpr)

AUC: 0.823

Out[95]:

[<matplotlib.lines.Line2D at 0x7f2836b1d350>]

CART -AUC and ROC for the test data

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 62/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [96]:

# predict probabilities

probs_cart = best_grid_dtcl.predict_proba(X_test)

# keep probabilities for the positive outcome only

probs_cart = probs_cart[:, 1]

# calculate AUC

cart_test_auc = roc_auc_score(test_labels, probs_cart)

print('AUC: %.3f' % cart_test_auc)

# calculate roc curve

cart_test_fpr, cart_test_tpr, cart_testthresholds = roc_curve(test_labels, probs_cart)

plt.plot([0, 1], [0, 1], linestyle='--')

# plot the roc curve for the model

plt.plot(cart_test_fpr, cart_test_tpr)

AUC: 0.801

Out[96]:

[<matplotlib.lines.Line2D at 0x7f2834617a10>]

CART Confusion Matrix and Classification Report for the training


data

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 63/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [97]:

confusion_matrix(train_labels, ytrain_predict_dtcl)

Out[97]:

array([[1309, 144],

[ 307, 340]])

In [98]:

#Train Data Accuracy

cart_train_acc=best_grid_dtcl.score(X_train,train_labels)

cart_train_acc

Out[98]:

0.7852380952380953

In [99]:

print(classification_report(train_labels, ytrain_predict_dtcl))

precision recall f1-score support

0 0.81 0.90 0.85 1453

1 0.70 0.53 0.60 647

accuracy 0.79 2100

macro avg 0.76 0.71 0.73 2100

weighted avg 0.78 0.79 0.78 2100

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 64/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [100]:

cart_metrics=classification_report(train_labels, ytrain_predict_dtcl,output_dict=True)

df=pd.DataFrame(cart_metrics).transpose()

cart_train_f1=round(df.loc["1"][2],2)

cart_train_recall=round(df.loc["1"][1],2)

cart_train_precision=round(df.loc["1"][0],2)

print ('cart_train_precision ',cart_train_precision)

print ('cart_train_recall ',cart_train_recall)

print ('cart_train_f1 ',cart_train_f1)

cart_train_precision 0.7

cart_train_recall 0.53

cart_train_f1 0.6

CART Confusion Matrix and Classification Report for the testing


data

In [101]:

confusion_matrix(test_labels, ytest_predict_dtcl)

Out[101]:

array([[553, 70],

[136, 141]])

In [102]:

#Test Data Accuracy

cart_test_acc=best_grid_dtcl.score(X_test,test_labels)

cart_test_acc

Out[102]:

0.7711111111111111

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 65/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [103]:

print(classification_report(test_labels, ytest_predict_dtcl))

precision recall f1-score support

0 0.80 0.89 0.84 623

1 0.67 0.51 0.58 277

accuracy 0.77 900

macro avg 0.74 0.70 0.71 900

weighted avg 0.76 0.77 0.76 900

In [104]:

cart_metrics=classification_report(test_labels, ytest_predict_dtcl,output_dict=True)

df=pd.DataFrame(cart_metrics).transpose()

cart_test_precision=round(df.loc["1"][0],2)

cart_test_recall=round(df.loc["1"][1],2)

cart_test_f1=round(df.loc["1"][2],2)

print ('cart_test_precision ',cart_test_precision)

print ('cart_test_recall ',cart_test_recall)

print ('cart_test_f1 ',cart_test_f1)

cart_test_precision 0.67

cart_test_recall 0.51

cart_test_f1 0.58

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 66/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Cart Conclusion

Train Data:

- AUC: 82%

- Accuracy: 79%

- Precision: 70%

- f1-Score: 60%

Test Data:

- AUC: 80%

- Accuracy: 77%

- Precision: 80%

- f1-Score: 84%

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

Change is the most important variable for predicting diabetes

RF Model Performance Evaluation on Training data

In [105]:

confusion_matrix(train_labels,ytrain_predict_rfcl)

Out[105]:

array([[1297, 156],

[ 255, 392]])

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 67/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [106]:

rf_train_acc=best_grid_rfcl.score(X_train,train_labels)

rf_train_acc

Out[106]:

0.8042857142857143

In [107]:

print(classification_report(train_labels,ytrain_predict_rfcl))

precision recall f1-score support

0 0.84 0.89 0.86 1453

1 0.72 0.61 0.66 647

accuracy 0.80 2100

macro avg 0.78 0.75 0.76 2100

weighted avg 0.80 0.80 0.80 2100

In [108]:

rf_metrics=classification_report(train_labels, ytrain_predict_rfcl,output_dict=True)

df=pd.DataFrame(rf_metrics).transpose()

rf_train_precision=round(df.loc["1"][0],2)

rf_train_recall=round(df.loc["1"][1],2)

rf_train_f1=round(df.loc["1"][2],2)

print ('rf_train_precision ',rf_train_precision)

print ('rf_train_recall ',rf_train_recall)

print ('rf_train_f1 ',rf_train_f1)

rf_train_precision 0.72

rf_train_recall 0.61

rf_train_f1 0.66

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 68/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [109]:

rf_train_fpr, rf_train_tpr,_=roc_curve(train_labels,best_grid_rfcl.predict_proba(X_trai
n)[:,1])

plt.plot(rf_train_fpr,rf_train_tpr,color='green')

plt.plot([0, 1], [0, 1], linestyle='--')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC')

rf_train_auc=roc_auc_score(train_labels,best_grid_rfcl.predict_proba(X_train)[:,1])

print('Area under Curve is', rf_train_auc)

Area under Curve is 0.8563713512840778

RF Model Performance Evaluation on Test data

In [110]:

confusion_matrix(test_labels,ytest_predict_rfcl)

Out[110]:

array([[550, 73],

[121, 156]])

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 69/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [111]:

rf_test_acc=best_grid_rfcl.score(X_test,test_labels)

rf_test_acc

Out[111]:

0.7844444444444445

In [112]:

print(classification_report(test_labels,ytest_predict_rfcl))

precision recall f1-score support

0 0.82 0.88 0.85 623

1 0.68 0.56 0.62 277

accuracy 0.78 900

macro avg 0.75 0.72 0.73 900

weighted avg 0.78 0.78 0.78 900

In [113]:

rf_metrics=classification_report(test_labels, ytest_predict_rfcl,output_dict=True)

df=pd.DataFrame(rf_metrics).transpose()

rf_test_precision=round(df.loc["1"][0],2)

rf_test_recall=round(df.loc["1"][1],2)

rf_test_f1=round(df.loc["1"][2],2)

print ('rf_test_precision ',rf_test_precision)

print ('rf_test_recall ',rf_test_recall)

print ('rf_test_f1 ',rf_test_f1)

rf_test_precision 0.68

rf_test_recall 0.56

rf_test_f1 0.62

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 70/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [114]:

rf_test_fpr, rf_test_tpr,_=roc_curve(test_labels,best_grid_rfcl.predict_proba(X_test)
[:,1])

plt.plot(rf_test_fpr,rf_test_tpr,color='green')

plt.plot([0, 1], [0, 1], linestyle='--')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC')

rf_test_auc=roc_auc_score(test_labels,best_grid_rfcl.predict_proba(X_test)[:,1])

print('Area under Curve is', rf_test_auc)

Area under Curve is 0.8181994657271499

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 71/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Random Forest Conclusion

Train Data:

- AUC: 86%

- Accuracy: 80%

- Precision: 72%

- f1-Score: 66%

Test Data:

- AUC: 82%

- Accuracy: 78%

- Precision: 68%

- f1-Score: 62

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

Change is again the most important variable for predicting diabetes

NN Model Performance Evaluation on Training data

In [115]:

confusion_matrix(train_labels,ytrain_predict_nncl)

Out[115]:

array([[1298, 155],

[ 315, 332]])

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 72/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [116]:

nn_train_acc=best_grid_nncl.score(X_train,train_labels)

nn_train_acc

Out[116]:

0.7761904761904762

In [117]:

print(classification_report(train_labels,ytrain_predict_nncl))

precision recall f1-score support

0 0.80 0.89 0.85 1453

1 0.68 0.51 0.59 647

accuracy 0.78 2100

macro avg 0.74 0.70 0.72 2100

weighted avg 0.77 0.78 0.77 2100

In [118]:

nn_metrics=classification_report(train_labels, ytrain_predict_nncl,output_dict=True)

df=pd.DataFrame(nn_metrics).transpose()

nn_train_precision=round(df.loc["1"][0],2)

nn_train_recall=round(df.loc["1"][1],2)

nn_train_f1=round(df.loc["1"][2],2)

print ('nn_train_precision ',nn_train_precision)

print ('nn_train_recall ',nn_train_recall)

print ('nn_train_f1 ',nn_train_f1)

nn_train_precision 0.68

nn_train_recall 0.51

nn_train_f1 0.59

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 73/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [119]:

nn_train_fpr, nn_train_tpr,_=roc_curve(train_labels,best_grid_nncl.predict_proba(X_trai
n)[:,1])

plt.plot(nn_train_fpr,nn_train_tpr,color='black')

plt.plot([0, 1], [0, 1], linestyle='--')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC')

nn_train_auc=roc_auc_score(train_labels,best_grid_nncl.predict_proba(X_train)[:,1])

print('Area under Curve is', nn_train_auc)

Area under Curve is 0.8166831721609928

NN Model Performance Evaluation on Test data

In [120]:

confusion_matrix(test_labels,ytest_predict_nncl)

Out[120]:

array([[553, 70],

[138, 139]])

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 74/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [121]:

nn_test_acc=best_grid_nncl.score(X_test,test_labels)

nn_test_acc

Out[121]:

0.7688888888888888

In [122]:

print(classification_report(test_labels,ytest_predict_nncl))

precision recall f1-score support

0 0.80 0.89 0.84 623

1 0.67 0.50 0.57 277

accuracy 0.77 900

macro avg 0.73 0.69 0.71 900

weighted avg 0.76 0.77 0.76 900

In [123]:

nn_metrics=classification_report(test_labels, ytest_predict_nncl,output_dict=True)

df=pd.DataFrame(nn_metrics).transpose()

nn_test_precision=round(df.loc["1"][0],2)

nn_test_recall=round(df.loc["1"][1],2)

nn_test_f1=round(df.loc["1"][2],2)

print ('nn_test_precision ',nn_test_precision)

print ('nn_test_recall ',nn_test_recall)

print ('nn_test_f1 ',nn_test_f1)

nn_test_precision 0.67

nn_test_recall 0.5

nn_test_f1 0.57

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 75/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [124]:

nn_test_fpr, nn_test_tpr,_=roc_curve(test_labels,best_grid_nncl.predict_proba(X_test)
[:,1])

plt.plot(nn_test_fpr,nn_test_tpr,color='black')

plt.plot([0, 1], [0, 1], linestyle='--')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC')

nn_test_auc=roc_auc_score(test_labels,best_grid_nncl.predict_proba(X_test)[:,1])

print('Area under Curve is', nn_test_auc)

Area under Curve is 0.8044225275393896

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 76/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

Neural Network Conclusion

Train Data:

- AUC: 82%

- Accuracy: 78%

- Precision: 68%

- f1-Score: 59

Test Data:

- AUC: 80%

- Accuracy: 77%

- Precision: 67%

- f1-Score: 57%

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

----------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------

2.4 Final Model: Compare all the model and write an inference
which model is best/optimized.

Comparison of the performance metrics from the 3 models

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 77/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [125]:

index=['Accuracy', 'AUC', 'Recall','Precision','F1 Score']

data = pd.DataFrame({'CART Train':[cart_train_acc,cart_train_auc,cart_train_recall,cart


_train_precision,cart_train_f1],

'CART Test':[cart_test_acc,cart_test_auc,cart_test_recall,cart_test_precision,c
art_test_f1],

'Random Forest Train':[rf_train_acc,rf_train_auc,rf_train_recall,rf_train_precis


ion,rf_train_f1],

'Random Forest Test':[rf_test_acc,rf_test_auc,rf_test_recall,rf_test_precision,


rf_test_f1],

'Neural Network Train':[nn_train_acc,nn_train_auc,nn_train_recall,nn_train_preci


sion,nn_train_f1],

'Neural Network Test':[nn_test_acc,nn_test_auc,nn_test_recall,nn_test_precision


,nn_test_f1]},index=index)

round(data,2)

Out[125]:

CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test
Accuracy 0.79 0.77 0.80 0.78 0.78 0.77
AUC 0.82 0.80 0.86 0.82 0.82 0.80
Recall 0.53 0.51 0.61 0.56 0.51 0.50
Precision 0.70 0.67 0.72 0.68 0.68 0.67
F1 Score 0.60 0.58 0.66 0.62 0.59 0.57

ROC Curve for the 3 models on the Training data

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 78/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [126]:

plt.plot([0, 1], [0, 1], linestyle='--')

plt.plot(cart_train_fpr, cart_train_tpr,color='red',label="CART")

plt.plot(rf_train_fpr,rf_train_tpr,color='green',label="RF")

plt.plot(nn_train_fpr,nn_train_tpr,color='black',label="NN")

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC')

plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')

Out[126]:

<matplotlib.legend.Legend at 0x7f282f806710>

ROC Curve for the 3 models on the Test data

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 79/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

In [127]:

plt.plot([0, 1], [0, 1], linestyle='--')

plt.plot(cart_test_fpr, cart_test_tpr,color='red',label="CART")

plt.plot(rf_test_fpr,rf_test_tpr,color='green',label="RF")

plt.plot(nn_test_fpr,nn_test_tpr,color='black',label="NN")

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC')

plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')

Out[127]:

<matplotlib.legend.Legend at 0x7f282f788850>

CONCLUSION :

I am selecting the RF model, as it has better accuracy, precsion,


recall, f1 score better than other two CART & NN.

----------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 80/81
2/9/22, 7:00 PM insurance_CART_RF_ANN_models | Kaggle

2.5 Inference: Basis on these predictions, what are the business


insights and recommendations

I strongly recommended we collect more real time unstructured data and past data if possible.

This is understood by looking at the insurance data by drawing relations between different variables such as
day of the incident, time, age group, and associating it with other external information such as location,
behavior patterns, weather information, airline/vehicle types, etc.

• Streamlining online experiences benefitted customers, leading to an increase in conversions, which


subsequently raised profits. • As per the data 90% of insurance is done by online channel. • Other interesting
fact, is almost all the offline business has a claimed associated, need to find why? • Need to train the JZI
agency resources to pick up sales as they are in bottom, need to run promotional marketing campaign or
evaluate if we need to tie up with alternate agency • Also based on the model we are getting 80%accuracy, so
we need customer books airline tickets or plans, cross sell the insurance based on the claim data pattern. •
Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are
processed more at Airline. So we may need to deep dive into the process to understand the workflow and why?

Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time • Increase
customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling costs Insights gained
from data and AI-powered analytics could expand the boundaries of insurability, extend existing products, and
give rise to new risk transfer solutions in areas like a non-damage business interruption and reputational
damage

https://www.kaggle.com/mihirjhaveri/insurance-cart-rf-ann-models/notebook 81/81

You might also like