You are on page 1of 41

25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.

2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis)

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Import stats from scipy
from scipy import stats

In [2]:

df=pd.read_csv("insurance_part2_data.csv")

In [3]:

df.head()

Out[3]:

Product
Age Agency_Code Type Claimed Commision Channel Duration Sales De
Name

Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan

Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan

Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan

Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan

4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 1/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [4]:

df.tail()

Out[4]:

Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name

Travel
2995 28 CWT Yes 166.53 Online 364 256.20 Gold Plan
Agency

2996 35 C2B Airlines No 13.50 Online 5 54.00 Gold Plan

Travel Customised
2997 36 EPX No 0.00 Online 54 28.00
Agency Plan

2998 34 C2B Airlines Yes 7.64 Online 39 30.55 Bronze Plan

2999 47 JZI Airlines No 11.55 Online 15 33.00 Bronze Plan

Attribute Information:

1. Target: Claim Status (Claimed)


2. Agency_Code: Code of tour firm
3. Type: Type of tour insurance firms
4. Channel: Distribution channel of tour insurance agencies
5. Product: Name of the tour insurance products
6. Duration: Duration of the tour
7. Destination: Destination of the tour
8. Sales: Amount of sales of tour insurance policies
9. Commission: The commission received for tour insurance firm
10. Age: Age of insured

In [5]:

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 3000 non-null int64

1 Agency_Code 3000 non-null object

2 Type 3000 non-null object

3 Claimed 3000 non-null object

4 Commision 3000 non-null float64

5 Channel 3000 non-null object

6 Duration 3000 non-null int64

7 Sales 3000 non-null float64

8 Product Name 3000 non-null object

9 Destination 3000 non-null object

dtypes: float64(2), int64(2), object(6)

memory usage: 234.5+ KB

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 2/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [6]:

df.dtypes.value_counts()

Out[6]:

object 6

float64 2

int64 2

dtype: int64

Data consists of both categorical and numerical values .

There are total of 3000 rows and 10 columns in the dataset.Out of 10, 6 columns are of object type, 2 columns
are of integer type and remaining two are of float type data.

10 variables
Age, Commision, Duration, Sales are numeric variable
rest are categorial variables
3000 records, no missing one
9 independant variable and one target variable - Clamied

In [7]:

df.isnull().sum()

Out[7]:

Age 0

Agency_Code 0

Type 0

Claimed 0

Commision 0

Channel 0

Duration 0

Sales 0

Product Name 0

Destination 0

dtype: int64

Data does not contain any missing values

In [8]:

round(df.describe().T,3)

Out[8]:

count mean std min 25% 50% 75% max

Age 3000.0 38.091 10.464 8.0 32.0 36.00 42.000 84.00

Commision 3000.0 14.529 25.481 0.0 0.0 4.63 17.235 210.21

Duration 3000.0 70.001 134.053 -1.0 11.0 26.50 63.000 4580.00

Sales 3000.0 60.250 70.734 0.0 20.0 33.00 69.000 539.00

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 3/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

Inference:

duration has negative value, it is not possible. Wrong entry.


Commision & Sales- mean and median varies signficantly
Minimum age of insured is 8 years and maximum age of insured is 84 years.Average group for insured
people is around 38.
Minimum comission an agent can earn is zero and a maximum commission is aprroximately 210.On an
average comiision earned is approximately 14.6.
Minimum amount of sales of tour insurance policies is zero and a maximum amount is 539.
On an average approximately 60.29 is amount of sales of tour insurance policies
Average duration of the tour is 70 and maximum is 4580.

In [9]:

df.shape
print('The number of rows of the dataframe is',df.shape[0],'.')
print('The number of columns of the dataframe is',df.shape[1],'.')

The number of rows of the dataframe is 3000 .

The number of columns of the dataframe is 10 .

Checking for unique Values

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 4/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [10]:

for column in df[['Agency_Code', 'Type', 'Claimed', 'Channel',


'Product Name', 'Destination']]:
print(column.upper(),': ',df[column].nunique())
print(df[column].value_counts().sort_values())
print('\n')

AGENCY_CODE : 4

JZI 239
CWT 472
C2B 924
EPX 1365
Name: Agency_Code, dtype: int64

TYPE : 2

Airlines 1163

Travel Agency 1837

Name: Type, dtype: int64

CLAIMED : 2

Yes 924
No 2076
Name: Claimed, dtype: int64

CHANNEL : 2

Offline 46

Online 2954

Name: Channel, dtype: int64

PRODUCT NAME : 5

Gold Plan 109

Silver Plan 427

Bronze Plan 650

Cancellation Plan 678

Customised Plan 1136

Name: Product Name, dtype: int64

DESTINATION : 3

EUROPE 215

Americas 320

ASIA 2465

Name: Destination, dtype: int64

Checking for Duplicate Values

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 5/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [11]:

dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
df[dups]

Number of duplicate rows = 139

Out[11]:

Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name

63 30 C2B Airlines Yes 15.0 Online 27 60.0 Bronze Plan

Travel Customised
329 36 EPX No 0.0 Online 5 20.0
Agency Plan

Travel Cancellation
407 36 EPX No 0.0 Online 11 19.0
Agency Plan

Travel Customised
411 35 EPX No 0.0 Online 2 20.0
Agency Plan

Travel Customised
422 36 EPX No 0.0 Online 5 20.0
Agency Plan

... ... ... ... ... ... ... ... ... ...

Travel Cancellation
2940 36 EPX No 0.0 Online 8 10.0
Agency Plan

Travel Customised
2947 36 EPX No 0.0 Online 10 28.0
Agency Plan

Travel Cancellation
2952 36 EPX No 0.0 Online 2 10.0
Agency Plan

Travel Customised
2962 36 EPX No 0.0 Online 4 20.0
Agency Plan

Travel Customised
2984 36 EPX No 0.0 Online 1 20.0
Agency Plan

139 rows × 10 columns

Though it shows there are 139 records, but it can be of different customers, there is no customer ID or any
unique identifier, hence,we will not drop them off.

Univariate Analysis

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 6/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [12]:

def univariateAnalysis_numeric(column,nbins):
print("Description of " + column)
print("-------------------------------------------------------------------------
print(df[column].describe(),end=' ')

plt.figure()
print("Distribution of " + column)
print("-------------------------------------------------------------------------
sns.distplot(df[column], kde=False, color='g');
plt.show()

plt.figure()
print("BoxPlot of " + column)
print("-------------------------------------------------------------------------
ax = sns.boxplot(x=df[column])
plt.show()

In [13]:

df_num = df.select_dtypes(include = ['float64', 'int64'])


df_cat=df.select_dtypes(["object"])
Categorical_column_list=list(df_cat.columns.values)
Numerical_column_list = list(df_num.columns.values)
Numerical_length=len(Numerical_column_list)
Categorical_length=len(Categorical_column_list)
print("Length of Numerical columns is :",Numerical_length)
print("Length of Categorical columns is :",Categorical_length)

Length of Numerical columns is : 4

Length of Categorical columns is : 6

In [14]:

df_cat.head()

Out[14]:

Agency_Code Type Claimed Channel Product Name Destination

0 C2B Airlines No Online Customised Plan ASIA

1 EPX Travel Agency No Online Customised Plan ASIA

2 CWT Travel Agency No Online Customised Plan Americas

3 EPX Travel Agency No Online Cancellation Plan ASIA

4 JZI Airlines No Online Bronze Plan ASIA

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 7/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [15]:

df_num.head()

Out[15]:

Age Commision Duration Sales

0 48 0.70 7 2.51

1 36 0.00 34 20.00

2 39 5.94 3 9.90

3 36 0.00 4 26.00

4 33 6.30 53 18.00

In [16]:

for x in Numerical_column_list:
univariateAnalysis_numeric(x,20)

BoxPlot of Commision

----------------------------------------------------------------------
------

Insights of Univariate Analysis of Numerical Variables:

For Age variable, Minimum age of insured is 8 years and maximum age of insured is 84 years.Average age
for insured people is around 38.
For Commision Variable, minimum commission earned is zero and a maximum commission that can be
earned is approximately 210.21, with an average earning of approximately 14.53 .
For Duration Variable, minimum duaration is a negtive value , which cannot be true , hence we now there is
atleast one wrong entry. Maximum duration of tour is 4580 and an average duration of tour is
approximately 70 .
For Sales Variable,Minimum and maximum amounts of sales of tour insurance policies are 0 and 539
respectively. On an average amount of sales is approximately 60.25 .

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 8/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [17]:

def univariateAnalysis_category(cat_column):
print("Details of " + cat_column)
print("----------------------------------------------------------------")
print(df_cat[cat_column].value_counts())
plt.figure()
df_cat[cat_column].value_counts().plot.bar(title="Frequency Distribution of " +
plt.show()
print(" ")

In [18]:

df_cat = df.select_dtypes(include = ['object'])


Categorical_column_list = list(df_cat.columns.values)
Categorical_column_list

Out[18]:

['Agency_Code', 'Type', 'Claimed', 'Channel', 'Product Name', 'Destina


tion']

Pairwise Distribution of Continuous variables

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 9/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [19]:

sns.pairplot(df[['Age', 'Commision',
'Duration', 'Sales']])

Out[19]:

<seaborn.axisgrid.PairGrid at 0x7ff4b63a7e20>

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 10/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

Heatmap of continuous variables

In [20]:

plt.figure(figsize=(10,8))
plt.title("Figure 3: Heatmap of Variables ")
sns.set(font_scale=1.2)
sns.heatmap(df[['Age', 'Commision',
'Duration', 'Sales']].corr(), annot=True)

Out[20]:

<AxesSubplot:title={'center':'Figure 3: Heatmap of Variables '}>

Insights:

There is strong positive correlation between Commission and Sales.


Sales and Duration are moderately correlated.
Commission and Duration have low correlation.

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 11/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [21]:

clean_dataset=df.copy()

In [22]:

def check_outliers(data):
vData_num = data.loc[:,data.columns != 'class']
Q1 = vData_num.quantile(0.25)
Q3 = vData_num.quantile(0.75)
IQR = Q3 - Q1
count = 0
# checking for outliers, True represents outlier
vData_num_mod = ((vData_num < (Q1 - 1.5 * IQR)) |(vData_num > (Q3 + 1.5 * IQR)))
#iterating over columns to check for no.of outliers in each of the numerical att
for col in vData_num_mod:
if(1 in vData_num_mod[col].value_counts().index):
print("No. of outliers in %s: %d" %( col, vData_num_mod[col].value_count
count += 1
print("\n\nNo of attributes with outliers are :", count)

check_outliers(df)

No. of outliers in Age: 204

No. of outliers in Commision: 362

No. of outliers in Duration: 382

No. of outliers in Sales: 353

No of attributes with outliers are : 4

There are outliers in all the variables, but the sales and commision can be a geneuine business value. Random
Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will keep the data as it is.

We will treat the outliers for the ANN model to compare the same after the all the steps just for comparsion.

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 12/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [23]:

df.hist(figsize=(15,16),layout=(4,2), color="blue");
plt.title("Figure 4:Distribution plot for Continuous Variables")
plt.ylabel("Density")
plt.show()

In [24]:

# Skewness of Data

df.skew(axis = 0, skipna = True).sort_values(ascending=False)

Out[24]:

Duration 13.784681

Commision 3.148858

Sales 2.381148

Age 1.149713

dtype: float64

2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest,
Artificial Neural Network

Object data should be converted into categorical/numerical data to fit in the models. (pd.categorical().codes(),
pd.get_dummies(drop_first=True)) Data split, ratio defined for the split, train-test split should be discussed. Any
reasonable split is acceptable. Use of random state is mandatory. Successful implementation of each model.
Logical reason behind the selection of different values for the parameters involved in each model. Apply grid
search for each model and make models on best_params. Feature importance for each model.

Converting object data type to numerical

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 13/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [25]:

for feature in df.columns:


if df[feature].dtype == 'object':
print('\n')
print('feature:',feature)
print(pd.Categorical(df[feature].unique()))
print(pd.Categorical(df[feature].unique()).codes)
df[feature] = pd.Categorical(df[feature]).codes

feature: Agency_Code

['C2B', 'EPX', 'CWT', 'JZI']

Categories (4, object): ['C2B', 'CWT', 'EPX', 'JZI']

[0 2 1 3]

feature: Type

['Airlines', 'Travel Agency']

Categories (2, object): ['Airlines', 'Travel Agency']

[0 1]

feature: Claimed

['No', 'Yes']

Categories (2, object): ['No', 'Yes']

[0 1]

feature: Channel

['Online', 'Offline']

Categories (2, object): ['Offline', 'Online']

[1 0]

feature: Product Name

['Customised Plan', 'Cancellation Plan', 'Bronze Plan', 'Silver Plan',


'Gold Plan']

Categories (5, object): ['Bronze Plan', 'Cancellation Plan', 'Customis


ed Plan', 'Gold Plan', 'Silver Plan']

[2 1 0 4 3]

feature: Destination

['ASIA', 'Americas', 'EUROPE']

Categories (3, object): ['ASIA', 'Americas', 'EUROPE']

[0 1 2]

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 14/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [26]:

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 3000 non-null int64

1 Agency_Code 3000 non-null int8

2 Type 3000 non-null int8

3 Claimed 3000 non-null int8

4 Commision 3000 non-null float64

5 Channel 3000 non-null int8

6 Duration 3000 non-null int64

7 Sales 3000 non-null float64

8 Product Name 3000 non-null int8

9 Destination 3000 non-null int8

dtypes: float64(2), int64(2), int8(6)

memory usage: 111.5 KB

In [27]:

df.head()

Out[27]:

Product
Age Agency_Code Type Claimed Commision Channel Duration Sales Destinat
Name

0 48 0 0 0 0.70 1 7 2.51 2

1 36 2 1 0 0.00 1 34 20.00 2

2 39 1 1 0 5.94 1 3 9.90 2

3 36 2 1 0 0.00 1 4 26.00 1

4 33 3 0 0 6.30 1 53 18.00 0

Proportion of Target Variable

In [28]:

df.Claimed.value_counts(normalize=True)

Out[28]:

0 0.692

1 0.308

Name: Claimed, dtype: float64

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 15/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [29]:

# Check Counts in Target Variable

plt.figure(figsize=(7,6))
sns.countplot(df["Claimed"])
plt.title("Figure 5: Countplot of Target Variable-CLaimed")
plt.show()

/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36:
FutureWarning: Pass the following variable as a keyword arg: x. From v
ersion 0.12, the only valid positional argument will be `data`, and pa
ssing other arguments without an explicit keyword will result in an er
ror or misinterpretation.

warnings.warn(

In [30]:

# Check % of counts in Tgt Var

print("Percentage of 0's",round(df["Claimed"].value_counts().values[0]/df["Claimed"]
print("Percentage of 1's",round(df["Claimed"].value_counts().values[1]/df["Claimed"]

Percentage of 0's 69.2 %

Percentage of 1's 30.8 %

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 16/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [31]:

plt.figure(figsize=(16,7))
df["Claimed"].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',shadow=False
plt.title('Figure 6:Pi Chart of Target Variable-Claimed')
plt.show()

Extracting the target column into train and test data

In [32]:

X = df.drop("Claimed", axis=1)

y = df.pop("Claimed")

X.head()

Out[32]:

Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name

0 48 0 0 0.70 1 7 2.51 2 0

1 36 2 1 0.00 1 34 20.00 2 0

2 39 1 1 5.94 1 3 9.90 2 1

3 36 2 1 0.00 1 4 26.00 1 0

4 33 3 0 6.30 1 53 18.00 0 0

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 17/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [33]:

plt.plot(X)
plt.title("Figure:Independent Variable Plot Before Scaling")
plt.show()

In [34]:

y.head()

Out[34]:

0 0
1 0
2 0
3 0
4 0
Name: Claimed, dtype: int8

Feature Scaling

In [35]:

# Scaling the attributes.

from scipy.stats import zscore


X_scaled=X.apply(zscore)
round(X_scaled.head(),3)

Out[35]:

Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name

0 0.947 -1.314 -1.257 -0.543 0.125 -0.470 -0.816 0.269 -0.435

1 -0.200 0.698 0.796 -0.570 0.125 -0.269 -0.569 0.269 -0.435

2 0.087 -0.308 0.796 -0.337 0.125 -0.500 -0.712 0.269 1.304

3 -0.200 0.698 0.796 -0.570 0.125 -0.492 -0.484 -0.526 -0.435

4 -0.487 1.704 -1.257 -0.323 0.125 -0.127 -0.597 -1.320 -0.435

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 18/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [36]:

plt.plot(X_scaled)
plt.title("Figure:Independent Variable Plot Prior Scaling")
plt.show()

Train and Test Split

In [37]:

X_train, X_test, train_labels, test_labels = train_test_split(X_scaled, y, test_size

Checking Dimensions of Train and Test Data

In [38]:

print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)

X_train (2100, 9)

X_test (900, 9)

train_labels (2100,)

test_labels (900,)

Building Decision tree Classifier

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 19/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [39]:

param_grid_dtcl = {
'criterion': ['gini'],
'max_depth': [10,20,30,50],
'min_samples_leaf': [50,100,150],
'min_samples_split': [150,300,450],
}

dtcl = DecisionTreeClassifier(random_state=5)

grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, cv =

In [ ]:

In [40]:

grid_search_dtcl.fit(X_train, train_labels)
print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl

{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_sa


mples_split': 450}

Out[40]:

DecisionTreeClassifier(max_depth=10, min_samples_leaf=50, min_samples_


split=450,

random_state=5)

Generating Decision tree

In [41]:

from sklearn import tree


from sklearn.tree import DecisionTreeClassifier

In [42]:

train_char_label = ['no', 'yes']


tree_regularized = open('tree_regularized.dot','w')
dot_data = tree.export_graphviz(best_grid_dtcl, out_file= tree_regularized ,
feature_names = list(X_train),
class_names = list(train_char_label))

tree_regularized.close()
dot_data

http://webgraphviz.com/ (http://webgraphviz.com/)

Variable Importance - DTCL


localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 20/41
25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [43]:

print (pd.DataFrame(best_grid_dtcl.feature_importances_, columns = ["Imp"],


index = X_train.columns).sort_values('Imp',ascending=False))

Imp

Agency_Code 0.674494

Sales 0.222345

Product Name 0.092149

Commision 0.008008

Duration 0.003005

Age 0.000000

Type 0.000000

Channel 0.000000

Destination 0.000000

Predicting Train and Test model

In [44]:

ytrain_predict_dtcl = best_grid_dtcl.predict(X_train)
ytest_predict_dtcl = best_grid_dtcl.predict(X_test)

Getting Probabilities of predicted data

In [45]:

ytest_predict_dtcl
ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)
ytest_predict_prob_dtcl
pd.DataFrame(ytest_predict_prob_dtcl).head()

Out[45]:

0 1

0 0.656751 0.343249

1 0.979452 0.020548

2 0.921171 0.078829

3 0.656751 0.343249

4 0.921171 0.078829

Building a Random Forest Classifier

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 21/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [46]:

param_grid_rfcl = {
'max_depth': [4,5,6],#20,30,40
'max_features': [2,3,4,5],## 7,8,9
'min_samples_leaf': [8,9,11,15],## 50,100
'min_samples_split': [46,50,55], ## 60,70
'n_estimators': [290,350,400] ## 100,200
}

rfcl = RandomForestClassifier(random_state=5)

grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv =

In [47]:

grid_search_rfcl.fit(X_train, train_labels)

Out[47]:

GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=5),

param_grid={'max_depth': [4, 5, 6], 'max_features': [2,


3, 4, 5],

'min_samples_leaf': [8, 9, 11, 15],

'min_samples_split': [46, 50, 55],

'n_estimators': [290, 350, 400]})

In [48]:

grid_search_rfcl.best_params_

Out[48]:

{'max_depth': 6,

'max_features': 3,

'min_samples_leaf': 9,

'min_samples_split': 50,

'n_estimators': 290}

In [49]:

best_grid_rfcl = grid_search_rfcl.best_estimator_

In [50]:

best_grid_rfcl

Out[50]:

RandomForestClassifier(max_depth=6, max_features=3, min_samples_leaf=


9,

min_samples_split=50, n_estimators=290, random_


state=5)

Using Best Parameters to predict Train & Test Data

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 22/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [51]:

ytrain_predict_rfcl = best_grid_rfcl.predict(X_train)
ytest_predict_rfcl = best_grid_rfcl.predict(X_test)

Getting probabilities of predicted data

In [52]:

ytest_predict_rfcl
ytest_predict_prob_rfcl=best_grid_rfcl.predict_proba(X_test)
ytest_predict_prob_rfcl
pd.DataFrame(ytest_predict_prob_rfcl).head()

Out[52]:

0 1

0 0.786094 0.213906

1 0.971485 0.028515

2 0.906544 0.093456

3 0.657028 0.342972

4 0.875002 0.124998

Variable Importance via Random Forest

In [53]:

# Variable Importance
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))

Imp

Agency_Code 0.279196

Product Name 0.235375

Sales 0.150871

Commision 0.146070

Duration 0.078847

Type 0.057515

Age 0.040628

Destination 0.008741

Channel 0.002758

Building ANN Model

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 23/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [54]:

param_grid_nncl = {
'hidden_layer_sizes': [50,100,200],
'max_iter': [2500,3000,4000],
'solver': ['adam'],
'tol': [0.01],
}

nncl = MLPClassifier(random_state=5)

grid_search_nncl = GridSearchCV(estimator = nncl, param_grid = param_grid_nncl, cv =

In [55]:

grid_search_nncl.fit(X_train, train_labels)
grid_search_nncl.best_params_
best_grid_nncl = grid_search_nncl.best_estimator_
best_grid_nncl

Out[55]:

MLPClassifier(hidden_layer_sizes=100, max_iter=2500, random_state=5, t


ol=0.01)

Using Best Parameters to predict Train & Test Data

In [56]:

ytrain_predict_nncl = best_grid_nncl.predict(X_train)
ytest_predict_nncl = best_grid_nncl.predict(X_test)

Getting probabilities of predicted data

In [57]:

ytest_predict_nncl
ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)
ytest_predict_prob_nncl
pd.DataFrame(ytest_predict_prob_nncl).head()

Out[57]:

0 1

0 0.838865 0.161135

1 0.926699 0.073301

2 0.914996 0.085004

3 0.657225 0.342775

4 0.909727 0.090273

2.3 Performance Metrics: Comment and Check the performance of Predictions


on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 24/41
25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

ROC_AUC score, classification reports for each model.


Comment on the validness of models (overfitting or underfitting) Build confusion matrix for each model.
Comment on the positive class in hand. Must clearly show obs/pred in row/col Plot roc_curve for each model.
Calculate roc_auc_score for each model. Comment on the above calculated scores and plots. Build
classification reports for each model. Comment on f1 score, precision and recall, which one is important here.

CART : AUC & ROC for Train Data

In [58]:

# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_train_auc = roc_auc_score(train_labels, probs_cart)
print('AUC: %.3f' % cart_train_auc)
# calculate roc curve
cart_train_fpr, cart_train_tpr, cart_train_thresholds = roc_curve(train_labels, prob
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 13: CART AUC-ROC for Train Data ")
# plot the roc curve for the model
plt.plot(cart_train_fpr, cart_train_tpr)

AUC: 0.812

Out[58]:

[<matplotlib.lines.Line2D at 0x7ff4ad794dc0>]

CART : AUC & ROC for Test Data

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 25/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [59]:

# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_test_auc = roc_auc_score(test_labels, probs_cart)
print('AUC: %.3f' % cart_test_auc)
# calculate roc curve
cart_test_fpr, cart_test_tpr, cart_testthresholds = roc_curve(test_labels, probs_car
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 14: CART AUC-ROC for Test Data ")
# plot the roc curve for the model
plt.plot(cart_test_fpr, cart_test_tpr)

AUC: 0.800

Out[59]:

[<matplotlib.lines.Line2D at 0x7ff4b7e163d0>]

CART Confusion Matrix and Classification Report for the training data

In [60]:

confusion_matrix(train_labels, ytrain_predict_dtcl)

Out[60]:

array([[1258, 195],

[ 268, 379]])

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 26/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [61]:

ax=sns.heatmap(confusion_matrix(train_labels, ytrain_predict_dtcl),annot=True, fmt='


plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 15: CART Confusion Matrix of Train Data')
plt.show()

In [62]:

#Train Data Accuracy


cart_train_acc=best_grid_dtcl.score(X_train,train_labels)
cart_train_acc

Out[62]:

0.7795238095238095

In [63]:

print(classification_report(train_labels, ytrain_predict_dtcl))

precision recall f1-score support

0 0.82 0.87 0.84 1453

1 0.66 0.59 0.62 647

accuracy 0.78 2100

macro avg 0.74 0.73 0.73 2100

weighted avg 0.77 0.78 0.78 2100

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 27/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [64]:

cart_metrics=classification_report(train_labels, ytrain_predict_dtcl,output_dict=Tru
df=pd.DataFrame(cart_metrics).transpose()
cart_train_f1=round(df.loc["1"][2],2)
cart_train_recall=round(df.loc["1"][1],2)
cart_train_precision=round(df.loc["1"][0],2)
print ('cart_train_precision ',cart_train_precision)
print ('cart_train_recall ',cart_train_recall)
print ('cart_train_f1 ',cart_train_f1)

cart_train_precision 0.66

cart_train_recall 0.59

cart_train_f1 0.62

CART Confusion Matrix and Classification Report for the testing data

In [65]:

confusion_matrix(test_labels, ytest_predict_dtcl)

Out[65]:

array([[536, 87],

[113, 164]])

In [66]:

ax=sns.heatmap(confusion_matrix(test_labels, ytest_predict_dtcl),annot=True, fmt='d'


plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 16: CART Confusion Matrix of Test Data')
plt.show()

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 28/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [67]:

#Test Data Accuracy


cart_test_acc=best_grid_dtcl.score(X_test,test_labels)
cart_test_acc

Out[67]:

0.7777777777777778

In [68]:

print(classification_report(test_labels, ytest_predict_dtcl))

precision recall f1-score support

0 0.83 0.86 0.84 623

1 0.65 0.59 0.62 277

accuracy 0.78 900

macro avg 0.74 0.73 0.73 900

weighted avg 0.77 0.78 0.77 900

In [69]:

cart_metrics=classification_report(test_labels, ytest_predict_dtcl,output_dict=True)
df=pd.DataFrame(cart_metrics).transpose()
cart_test_precision=round(df.loc["1"][0],2)
cart_test_recall=round(df.loc["1"][1],2)
cart_test_f1=round(df.loc["1"][2],2)
print ('cart_test_precision ',cart_test_precision)
print ('cart_test_recall ',cart_test_recall)
print ('cart_test_f1 ',cart_test_f1)

cart_test_precision 0.65

cart_test_recall 0.59

cart_test_f1 0.62

CART Conclusion:

Train Data:

AUC: 82%
Accuracy: 79%
Precision: 70%
f1-Score: 60%

Test Data:

AUC: 80%
Accuracy: 77%
Precision: 80%
f1-Score: 84%

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

Change is the most important variable for predicting diabetes

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 29/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

RF Model Performance Evaluation on Training data

In [70]:

confusion_matrix(train_labels,ytrain_predict_rfcl)

Out[70]:

array([[1296, 157],

[ 249, 398]])

In [71]:

ax=sns.heatmap(confusion_matrix(train_labels,ytrain_predict_rfcl),annot=True, fmt='d
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 19: RF Confusion Matrix of Train Data')
plt.show()

In [72]:

rf_train_acc=best_grid_rfcl.score(X_train,train_labels)
rf_train_acc

Out[72]:

0.8066666666666666

In [73]:

print(classification_report(train_labels,ytrain_predict_rfcl))

precision recall f1-score support

0 0.84 0.89 0.86 1453

1 0.72 0.62 0.66 647

accuracy 0.81 2100

macro avg 0.78 0.75 0.76 2100

weighted avg 0.80 0.81 0.80 2100

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 30/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [74]:

rf_metrics=classification_report(train_labels, ytrain_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_train_precision=round(df.loc["1"][0],2)
rf_train_recall=round(df.loc["1"][1],2)
rf_train_f1=round(df.loc["1"][2],2)
print ('rf_train_precision ',rf_train_precision)
print ('rf_train_recall ',rf_train_recall)
print ('rf_train_f1 ',rf_train_f1)

rf_train_precision 0.72

rf_train_recall 0.62

rf_train_f1 0.66

In [75]:

rf_train_fpr, rf_train_tpr,_=roc_curve(train_labels,best_grid_rfcl.predict_proba(X_t
plt.plot(rf_train_fpr,rf_train_tpr,color='green')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 17: RF AUC-ROC for Train Data ")
rf_train_auc=roc_auc_score(train_labels,best_grid_rfcl.predict_proba(X_train)[:,1])
print('Area under Curve is', rf_train_auc)

Area under Curve is 0.854377395379809

RF Model Performance Evaluation on Test data

In [76]:

confusion_matrix(test_labels,ytest_predict_rfcl)

Out[76]:

array([[546, 77],

[120, 157]])

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 31/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [77]:

ax=sns.heatmap(confusion_matrix(test_labels,ytest_predict_rfcl),annot=True, fmt='d')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 20: RF Confusion Matrix of Test Data')
plt.show()

In [78]:

rf_test_acc=best_grid_rfcl.score(X_test,test_labels)
rf_test_acc

Out[78]:

0.7811111111111111

In [79]:

print(classification_report(test_labels,ytest_predict_rfcl))

precision recall f1-score support

0 0.82 0.88 0.85 623

1 0.67 0.57 0.61 277

accuracy 0.78 900

macro avg 0.75 0.72 0.73 900

weighted avg 0.77 0.78 0.78 900

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 32/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [80]:

rf_metrics=classification_report(test_labels, ytest_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_test_precision=round(df.loc["1"][0],2)
rf_test_recall=round(df.loc["1"][1],2)
rf_test_f1=round(df.loc["1"][2],2)
print ('rf_test_precision ',rf_test_precision)
print ('rf_test_recall ',rf_test_recall)
print ('rf_test_f1 ',rf_test_f1)

rf_test_precision 0.67

rf_test_recall 0.57

rf_test_f1 0.61

In [81]:

rf_test_fpr, rf_test_tpr,_=roc_curve(test_labels,best_grid_rfcl.predict_proba(X_test
plt.plot(rf_test_fpr,rf_test_tpr,color='green')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 18: RF AUC-ROC for Test Data ")
rf_test_auc=roc_auc_score(test_labels,best_grid_rfcl.predict_proba(X_test)[:,1])
print('Area under Curve is', rf_test_auc)

Area under Curve is 0.8187122981265682

Random Forest Conclusion:

Train Data:

AUC: 86%
Accuracy: 80%
Precision: 72%
f1-Score: 66%

Test Data:

AUC: 82%
Accuracy: 78%
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 33/41
25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

Precision: 68%
f1-Score: 62

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

Change is again the most important variable for predicting diabetes

NN Model Performance Evaluation on Training data

In [82]:

confusion_matrix(train_labels,ytrain_predict_nncl)

Out[82]:

array([[1292, 161],

[ 319, 328]])

In [83]:

ax=sns.heatmap(confusion_matrix(train_labels,ytrain_predict_nncl),annot=True, fmt='d
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 23: ANN Confusion Matrix of Train Data')
plt.show()

In [84]:

nn_train_acc=best_grid_nncl.score(X_train,train_labels)
nn_train_acc

Out[84]:

0.7714285714285715

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 34/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [85]:

print(classification_report(train_labels,ytrain_predict_nncl))

precision recall f1-score support

0 0.80 0.89 0.84 1453

1 0.67 0.51 0.58 647

accuracy 0.77 2100

macro avg 0.74 0.70 0.71 2100

weighted avg 0.76 0.77 0.76 2100

In [86]:

nn_metrics=classification_report(train_labels, ytrain_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_train_precision=round(df.loc["1"][0],2)
nn_train_recall=round(df.loc["1"][1],2)
nn_train_f1=round(df.loc["1"][2],2)
print ('nn_train_precision ',nn_train_precision)
print ('nn_train_recall ',nn_train_recall)
print ('nn_train_f1 ',nn_train_f1)

nn_train_precision 0.67

nn_train_recall 0.51

nn_train_f1 0.58

In [87]:

nn_train_fpr, nn_train_tpr,_=roc_curve(train_labels,best_grid_nncl.predict_proba(X_t
plt.plot(nn_train_fpr,nn_train_tpr,color='black')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 21: ANN AUC-ROC for Train Data ")
nn_train_auc=roc_auc_score(train_labels,best_grid_nncl.predict_proba(X_train)[:,1])
print('Area under Curve is', nn_train_auc)

Area under Curve is 0.8124293286500988

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 35/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

NN Model Performance Evaluation on Test data

In [88]:

confusion_matrix(test_labels,ytest_predict_nncl)

Out[88]:

array([[550, 73],

[140, 137]])

In [89]:

ax=sns.heatmap(confusion_matrix(test_labels,ytest_predict_nncl),annot=True, fmt='d',
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 24: ANN Confusion Matrix of Test Data')
plt.show()

In [90]:

nn_test_acc=best_grid_nncl.score(X_test,test_labels)
nn_test_acc

Out[90]:

0.7633333333333333

In [91]:

print(classification_report(test_labels,ytest_predict_nncl))

precision recall f1-score support

0 0.80 0.88 0.84 623

1 0.65 0.49 0.56 277

accuracy 0.76 900

macro avg 0.72 0.69 0.70 900

weighted avg 0.75 0.76 0.75 900

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 36/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [92]:

nn_metrics=classification_report(test_labels, ytest_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_test_precision=round(df.loc["1"][0],2)
nn_test_recall=round(df.loc["1"][1],2)
nn_test_f1=round(df.loc["1"][2],2)
print ('nn_test_precision ',nn_test_precision)
print ('nn_test_recall ',nn_test_recall)
print ('nn_test_f1 ',nn_test_f1)

nn_test_precision 0.65

nn_test_recall 0.49

nn_test_f1 0.56

In [93]:

nn_test_fpr, nn_test_tpr,_=roc_curve(test_labels,best_grid_nncl.predict_proba(X_test
plt.plot(nn_test_fpr,nn_test_tpr,color='black')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 22: ANN AUC-ROC for Test Data ")
nn_test_auc=roc_auc_score(test_labels,best_grid_nncl.predict_proba(X_test)[:,1])
print('Area under Curve is', nn_test_auc)

Area under Curve is 0.8042197124661733

Neural Network Conclusion:

Train Data:

AUC: 82%
Accuracy: 78%
Precision: 68%
f1-Score: 59

Test Data:

AUC: 80%
Accuracy: 77%
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 37/41
25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

Precision: 67%
f1-Score: 57%

Training and Test set results are almost similar, and with the overall measures high, the model is a good model.

2.4 Final Model - Compare all models on the basis of the performance metrics in
a structured tabular manner (2.5 pts).
Describe on which model is best/optimized (1.5 pts ). A table containing all the values of accuracies, precision,
recall, auc_roc_score, f1 score. Comparison between the different models(final) on the basis of above table
values. After comparison which model suits the best for the problem in hand on the basis of different
measures. Comment on the final model.

Comparison of the performance metrics from the 3 models

In [94]:

index=['Accuracy', 'AUC', 'Recall','Precision','F1 Score']


data = pd.DataFrame({'CART Train':[cart_train_acc,cart_train_auc,cart_train_recall,ca
'CART Test':[cart_test_acc,cart_test_auc,cart_test_recall,cart_test_precision
'Random Forest Train':[rf_train_acc,rf_train_auc,rf_train_recall,rf_train_prec
'Random Forest Test':[rf_test_acc,rf_test_auc,rf_test_recall,rf_test_precisio
'Neural Network Train':[nn_train_acc,nn_train_auc,nn_train_recall,nn_train_pre
'Neural Network Test':[nn_test_acc,nn_test_auc,nn_test_recall,nn_test_precisi
round(data,2)

Out[94]:

CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test

Accuracy 0.78 0.78 0.81 0.78 0.77 0.76

AUC 0.81 0.80 0.85 0.82 0.81 0.80

Recall 0.59 0.59 0.62 0.57 0.51 0.49

Precision 0.66 0.65 0.72 0.67 0.67 0.65

F1 Score 0.62 0.62 0.66 0.61 0.58 0.56

ROC Curve for the 3 models on the Training data

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 38/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

In [98]:

plt.figure(figsize=(10,8))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(cart_train_fpr, cart_train_tpr,color='red',label="CART")
plt.plot(rf_train_fpr,rf_train_tpr,color='green',label="RF")
plt.plot(nn_train_fpr,nn_train_tpr,color='black',label="NN")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Figure 25:ROC for 3 Models in Training Data')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')

Out[98]:

<matplotlib.legend.Legend at 0x7ff4b17210d0>

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 39/41


25/07/2021 Project-CART-RF-ANN - Jupyter Notebook

ROC Curve for the 3 models on the Test data

In [99]:

plt.figure(figsize=(10,8))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(cart_test_fpr, cart_test_tpr,color='red',label="CART")
plt.plot(rf_test_fpr,rf_test_tpr,color='green',label="RF")
plt.plot(nn_test_fpr,nn_test_tpr,color='black',label="NN")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Figure 26:ROC for 3 Models in Test Data')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')

Out[99]:

<matplotlib.legend.Legend at 0x7ff4b88992e0>

RF model should be selected, as it has better accuracy, precsion, recall, f1 score better than other two CART &
NN.

2.5 Based on your analysis and working on the business problem, detail out appropriate insights and
recommendations to help the management solve the business objective.
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 40/41
25/07/2021 Project-CART-RF-ANN - Jupyter Notebook
p g j
There should be at least 3-4 Recommendations and insights in total. Recommendations should be easily
understandable and business specific, students should not give any technical suggestions. Full marks should
only be allotted if the recommendations are correct and business specific.

In [ ]:

In [ ]:

In [ ]:

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 41/41

You might also like