Professional Documents
Culture Documents
ANN
PREPARED BY
MURALIDHARAN N
1
CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the task to
make a model which predicts the claim status and provide recommendations to management.
Train/Test split-
VIF values
carat ---> 33.35086119845924
depth ---> 4.573918951598614
table ---> 1.772885281261914
x ---> 463.5542785436457
y ---> 462.769821646584
z ---> 238.65819968687333
cut_Good ---> 3.609618194943713
cut_Ideal ---> 14.348125081188464
cut_Premium ---> 8.623414379121153
cut_Very Good ---> 7.848451571723688
color_E ---> 2.3710704647626115
We still find we have multi collinearity in the dataset, to drop these values to Lower level we can drop columns
after doing stats model.
From stats model we can understand the features that do not contribute to the Model
We can remove those features after that the Vif Values will be reduced Ideal value of VIF is less tha 5%.
STATSMODEL
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
To ideally bring down the values to lower levels we can drop one of the variable that is highly correlated.
Dropping variables would bring down the multi collinearity level down.
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
We had a business problem to predict the price of the stone and provide insights for the company on the profits
on different prize slots. From the EDA analysis we could understand the cut, ideal cut had number profits to the
company. The colours H, I, J have bought profits for the company. In clarity if we could see there were no
flawless stones and they were no profits coming from l1, l2, l3 stones. The ideal, premium and very good types
of cut were bringing profits where as fair and good are not bringing profits.
The predictions were able to capture 95% variations in the price and it is explained by the predictors in the
training set.
Using stats model if we could run the model again we can have P values and coefficients which will give us better
understanding of the relationship, so that values more 0.05 we can drop those variables and re run the model
again for better results.
For better accuracy dropping depth column in iteration for better results.
The equation, (-0.76) * Intercept + (1.1) * carat + (-0.01) * table + (-0.32) * x + (0.2 8) * y + (-0.11) * z + (0.1) *
cut_Good + (0.15) * cut_Ideal + (0.15) * cut_Premiu m + (0.13) * cut_Very_Good + (-0.05) * color_E + (-0.06) *
color_F + (-0.1) * color
Channel is online
We will further look at the distribution of dataset in univarite and bivariate analysis
As there is no unique identifier I’m not dropping the duplicates it may be different customer’s data.
4
AGENCY_CODE: 4
JZI 239
CWT 472
C2B 924
EPX 1365
TYPE: 2
Airlines 1163
Travel Agency 1837
CLAIMED: 2
5
Yes924
No2076
CHANNEL: 2
Offline 46
Online 2954
PRODUCT NAME: 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136
DESTINATION: 3
EUROPE 215
Americas 320
ASIA 2465
Categorical Variables
Agency Code
The distribution of the agency code, shows us EPX with maximum frequency
8
The box plot shows the split of sales with different agency code and also hue having
claimed column.
It seems that C2B have claimed more claims than other agency.
9
The box plot shows the split of sales with different type and also hue having claimed
column. We could understand airlines type has more claims.
The majority of customers have used online medium, very less with offline medium
10
The box plot shows the split of sales with different channel and also hue having claimed
column.
Customized plan seems to be most liked plan by customers when compared to all other plans.
11
The box plot shows the split of sales with different product name and also hue having
claimed column.
Asia is where customers choose when compared with other destination places.
12
The box plot shows the split of sales with different destination and also hue having claimed
column.
BEFORE TREATING
OUTLIERS
1.2 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.
27
BEST GRID
MODEL 2
MODEL 3
MLP CLASSIFIER
GRID SEARCH
FITTING THE MODEL USING THE OPTIMAL VALUES FROM GRID SEARCH
ACCURACY
CONFUSION MATRIX
24
ACCURACY
CONFUSION MATRIX
26
AUC and ROC for the training data for Random Forest
27
ACCURACY
CONFUSION MATRIX
28
MODEL 3
ANN
29
CONFUSION MATRIX
ACCURACY
ACCURACY
CONFUSION MATRIX
31
2.4 Final Model: Compare all the model and write an inference which model is bes
32
CONCLUSION:
2.5 Inference: Based on the whole Analysis, what are the business
insights and recommendations?
Looking at the model, more data will help us understand and predict models better.
• Other interesting fact, is almost all the offline business has a claimed associated
• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run
promotional marketing campaign or evaluate if we need to tie up with alternate agency
• Also based on the model we are getting 80%accuracy, so we need customer books airline tickets or
plans, cross sell the insurance based on the claim data pattern.
• Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are
processed more at Airline. So we may need to deep dive into the process to understand the workflow
and why?