You are on page 1of 145

University of

Texas

1
Life
Insurance Sales

FINAL PROJECT
REPORT

2
SUBMITTED BY
VIVEK AJAYAKUMAR

3 DATE OF SUBMISSION: 04 JULY 2022


VERSION : 2.1

1
Sl. CONTENT Page
No: Number

1 Introduction to the Business Problem 3

2 Data Cleaning/Preprocessing 6

3 Exploratory Data Analysis(EDA) 7

4 Univariate Analysis 11

5 Bi-variate Analysis 16

6 Modelling Approach 24

7 Linear Regression Model 25

8 KNN Regression Model 27

9 Random Forest Regression Model 30

10 Model Validation 35

11 Business Recommendation 36

LIST OF TABLES

1 Data Dictionary 4

2 Null Value Table 6

3 Outliner Table 8

4 Feature Engineering Table 9

5 Univariate Analysis 13

6 BiVariate Analysis 15

7 Model Metrics 34

LIST OF FIGURES

1 Count Plot 15

2 Heat Map 16

3 Pair Plot 17

4 Scatter Plot 23

5 Random Forest plot 30

2
1: Introduction to the Business Problem.

Insurance companies are flourished all over the world and their service is distributed
through various channels. Insurance Agents are most used method to expand their business.
Insurance agents sell and negotiate life, health, property, or other types of insurance
to match the needs of their clients. As an insurance agent, you may work for an insurance
company, refer clients to independent brokers, or work as an independent broker.

In this case, Insurance company collected data of the customer, insurance policy and agent
details. The company want to predict the bonus of its agents by analysing the data. From
the data, the firm can design appropriate engagement for their high performing agents and
upskill programs for low performing agents.

The firm want to understand the business via analysing the agent bonus. The company is
planning the classify the agents into two groups- High performing Agents and Low Performing
agents. The firm want to design a model to ensure appropriate bonus for their
employees/agents.

The study is to be conducted to ensure each and every agent driven through their goal by
predicting the appropriate bonus for all their agents. This help in finding those with
higher performance and keep them motivated. At the same time, low performing agents needs
to be segregated and upskill programs need to add to their program.

3
2: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an approach to analyse the data using visual techniques.
It is used to discover trends, patterns, or to check assumptions with the help of
statistical summary and graphical representations.

The dataset is uniquely constructed to encompass various feature to predict the agent
bonus.

The dataset has three main segments:


1) Data of the policy holder.
2) Details of the policy.
3) Agent details and channel.

Dataset has 4520 entries covering the details of


Customer ID, Agent Bonus, Age, Customer Tenure, Channel, Occupation, Education Field,
Gender, Existing Product Type, Designation, Number of policy, Martial Status, Monthly
Income, Complaint, Policy Tenure, Sum Assured, Zone, Payment Method, Last Month Call and
Customer Care Score.

Data Dictionary

Insurance company has taken dataset for predicting bonus for its agents and also further
analysis is done to design appropriate engagement activity to improve the skill set.

The data dictionary is shown below:


CustID Unique customer ID
AgentBonus Bonus amount given to each agents in last month
Age Age of customer
CustTenure Tenure of customer in organization
Channel Channel through which acquisition of customer is done
Occupation Occupation of customer
EducationField Field of education of customer
Gender Gender of customer
ExistingProdType Existing product type of customer
Designation Designation of customer in their organization
NumberOfPolicy Total number of existing policy of a customer
MaritalStatus Marital status of customer
MonthlyIncome Gross monthly income of customer
Complaint Indicator of complaint registered in last one month by customer
ExistingPolicyTenure Max tenure in all existing policies of customer
SumAssured Max of sum assured in all existing policies of customer
Customer belongs to which zone in India. Like East, West, North and
Zone South
Frequency of payment selected by customer like Monthly, quarterly, half
PaymentMethod yearly and yearly
LastMonthCalls Total calls attempted by company to a customer for cross sell
CustCareScore Customer satisfaction score given by customer in previous service call

4
From the data dictionary, we can understand that data from three main segments are
taken:
1) Data of the policy holder
2) Details of the policy
3) Agent details and channels.

The dataset is uniquely constructed to encompass various features to predict the bonus
and also improvise the skill set of the agents.

Data Report
As explained earlier, data has three main segments and data dictionary is shown above.
The data has 19 columns that capture data of the customer, policy and agent details.
Dataset has 4520 entries with various data types.

5
Data Description is shown below

Table1: Data Description


Data Pre-processing

1) Presence of Null Values

Dataset has presence of null values both in categorical and continuous variables.

Variable Number of null values


Age 269
CustTenure 226
NumberOfPolicy 45
Monthly Income 236
ExistingPolicyTenure 184
SumAssured 154
CustCareScore 52
Table2: Null Value Table
The null values are treated using KNN neighbour algorithm.

2) Replacement of Values

Data has some issues in Categorical Data due to some miss spelling and spacing issues and
it is specified below. Further treatment is done to improve the quality of data.

Variable Date Before Treatment Data After Treatment


Channel ['Agent' 'Third Party ['Agent' 'Third Party
Partner' 'Online'] Partner' 'Online']

6
Occupation ['Salaried' 'Free Lancer' ' ['Salaried' 'Free Lancer' '
Small Business' 'Laarge Bus Small Business',
iness' 'Large Business']
'Large Business']

EducationField ['Graduate' 'Post Graduate' ['Graduate' 'Post Graduate'


'UG' 'Under Graduate' 'Eng 'Under Graduate' 'Enginee
ineer' 'Diploma' r' 'Diploma'
'MBA'] 'MBA']

Gender ['Female' 'Male' 'Fe male'] ['Female' 'Male']

Designation ['Manager' 'Exe' 'Executive ['Manager' 'Exe' 'Executive


' 'VP' 'AVP' 'Senior Manage ' 'VP' 'AVP' 'Senior Manage
r'] r']

MaritalStatus ['Single' 'Divorced' 'Unmar ['Single' 'Divorced' 'Unmar


ried' 'Married'] ried' 'Married']

Zone ['North' 'West' 'East' 'Sou ['North' 'West' 'East' 'Sou


th'] th']

PaymentMethod ['Half Yearly' 'Yearly' 'Qu ['Half Yearly' 'Yearly' 'Qu


arterly' 'Monthly'] arterly' 'Monthly']

Table3: Data Evaluation Table


The values are analysed and changed to improve the quality of data. The discrete values
are properly modified and evaluated.

3) Presence of Outliners.

An outlier is an object that deviates significantly from the rest of the objects. They
can be caused by measurement or execution errors. The analysis of outlier data is referred
to as outlier analysis or outlier mining. An outlier cannot be termed as a noise or error.

In this dataset, there is presence of outliners are present in the following data:

CustTenure, Age, AgentBonus, Monthly Income, Existing Policy Tenure and SumAssured.

Agent
Bonus

7
Age

CustT
enure

Month
lyInc
ome

Exist
ingPo
licyT
enure

SumAs
sured

Table4: Outliner Table

The outliners are treated and updated to improve the quality of data.

8
4)Variance of Variables

Variance is a measure of spread for a distribution of a random variable that determines


the degree to which the values of a random variable differ from the expected value.

In this data, the following variables have unaccepted variance.

Variables Values
Agent Bonus 1844936.85
SumAssured 54358424250.60
MonthlyIncome 18611449.10
Table5: Variance Table

Compared to other variables, these high values will affect the modelling. To avoid this,
logarithmic transformation is applied and values are scaled down to improve the quality
of the data.

Dataset after transformation:

4) Feature Engineering

Feature engineering is a machine learning technique that leverages data to create new
variables that aren't in the training set. It can produce new features for both supervised
and unsupervised learning, with the goal of simplifying and speeding up data
transformations while also enhancing model accuracy.

In this dataset, ‘Age’ variable is modified. The continuous variable is modified into age
bins starting from 0-10,10-20,20-30,30-40. Later these values are changed into categorical
variables.

Age bins Categorical Values


0-10 0
10-20 1
20-30 2
30-40 3
Table6: Feature Engineering Table
5) Modification of Variables

For modelling purpose, object data type are modified into categorical numerical value.

The results are given below:

feature: Channel
[Agent, Third Party Partner, Online]
Categories (3, object): [Agent, Online, Third Party Partner]
[0 2 1]

feature: Occupation

9
[Salaried, Free Lancer, Small Business, Large Business]
Categories (4, object): [Free Lancer, Large Business, Salaried, Small Business]
[2 0 3 1]

feature: EducationField
[Graduate, Post Graduate, Under Graduate, Engineer, Diploma, MBA]
Categories (6, object): [Diploma, Engineer, Graduate, MBA, Post Graduate, Under Graduate
]
[2 4 5 1 0 3]

feature: Gender
[Female, Male]
Categories (2, object): [Female, Male]
[0 1]

feature: Designation
[Manager, Executive, VP, AVP, Senior Manager]
Categories (5, object): [AVP, Executive, Manager, Senior Manager, VP]
[2 1 4 0 3]

feature: MaritalStatus
[Single, Divorced, Unmarried, Married]
Categories (4, object): [Divorced, Married, Single, Unmarried]
[2 0 3 1]

feature: Zone
[North, West, East, South]
Categories (4, object): [East, North, South, West]
[1 3 0 2]

feature: PaymentMethod
[Half Yearly, Yearly, Quarterly, Monthly]
Categories (4, object): [Half Yearly, Monthly, Quarterly, Yearly]
[0 3 2 1]

Overall Analysis of Data Cleaning and Pre-processing

• Null values are replaced.


• Replacement of wrong values to improve the quality of data.
• Removal of Outliners.
• Variance of continuous variables are properly treated.
• Feature Engineering is done to improve the model.
• Discrete variables are changed into numerical categorical values.

10
Univariate Analysis

Univariate Analysis
Univariate Analysis is the key to understanding each and every variable in the data.

Analysis of Continuous Variable

Variabl Distribution Box Plot Description


e
CustID

AgentBo
nus

Age

CustTen
ure

11
Existin
gProdTy
pe

NumberO
fPolicy

Monthly
Income

Existin
gPolicy
Tenure

SumAssu
red

12
LastMon
thCalls

Analysis of Categorical Variable

Variable Count Plot


Channel

Occupation

13
EducationField

Gender

Designation

MaritalStatus

14
Zone

PaymentMethod

Inference:
• From the univariate analysis of continuous variables like AgentBonus, Age,CustTenure,
Monthly Income, Existing Policy Tenure, Sum Assured have outliers and it need to be
treated to improve the quality of data.
• High variance variables are modified to improve the quality of data.
• From the categorical variables, we can find that :
o Agent plays a vital role in bring the customer to the organization.
o Customers having Free Lancer and Large Business as their profession bring more
business to firm.
o Designation of Customer plays a crucial role in canvasing the insurance policy.
o Married Customers are interested in availing insurance policy.
o North and West Zone bring most of the business.
o Customers are preferred to pay Half Yearly and Yearly for their Insurance
payments.

15
Bi-variate Analysis.

Bivariate analysis is a kind of statistical analysis when two variables are observed
against each other. One of the variables will be dependent and the other is independent.
The variables are denoted by X and Y. The changes are analyzed between the two variables
to understand to what extent the change has occurred.

In our dataset, agent bonus is our independent variable. For bi-variate analysis, agent
bonus is evaluated against all other dependent variables for predication.

Heat Map of Variables:

Figure 1: Heat Map


Inference:
• From the Heatmap, there is huge correlation between the Agent Bonus between Monthly
Income and Sum Assured. At the same time, there is moderate relationship between
Agent Bonus and Age and Customer Tenure.
• At the same time, the variables like CustCareScore and Complaint has weal
relationship with the dependent variable, AgentBonus.

16
Pair Plot of Variables

Figure 2 : Pair Plot

Inference:
• From the pair plot, there is strong linear correlation between Agent Bonus between
Sum Assured, MonthlyIncome and CustomerTenure.

Bi-variate Analysis of Discrete variables

17
Variable Plot
Channel

Occupation

EducationField

18
Gender

Designation

MaritalStatus

19
Zone

PaymentMethod

Table7: BI variate table


Inference:
• From the Bi-variate analysis, most of the variables have equally contributed to
Agent Bonus. This implies that Agent can gain bonus from most of the sections. Apart
from that, from the ‘Designation’, the positions like VP and AVP helps the agent to
get maximum bonus. Similarly, ‘Unmarried’ customers are least contributors of agent
bonus.

20
Other Plots:

Inferences:

• Median policy is about 3 years.


• Average period of the policy is about 4 years.
• AVP and VP designation holds the top most Policy Tenure and more focus needs to done
for business expansion.

21
Inferences:
• South Zone needs more attention compared to other zones.
• Rest of zones have almost equal participation w.r.t gender wise.
• From complaint point of view, West zone needs to rectify those issues.

22
Inference:
• Customer Tenure for married persons are more in both the genders.
• More focus needs to done to Divorced and Unmarried persons.

Inferences:

Scatter plot implies that there is direct positive relationship between SumAssured and
AgentBonus.

23
Modelling Approach Used:

o For predicting ‘Agent Bonus’, regression modelling approach is used. For


improving the model, null values are replaced, outliners are removed, high variance
variables are transformed, and feature engineering is done. Data set is split into
70:30 for training the model and also for testing the data. The 70% of the data is
in training data set and 30 % of the data is for testing the data.Model are further
tuned using hyperparameters to improve the accuracy. Various Metrics are considered
to compare the model performances.

Model Used:

▪ Linear Regression Model.


▪ KNN Regression Model.
▪ Random Forest Regression Model.

Metrics Considered.

▪ Root Mean Square(RMSE)


▪ Correlation Coefficient(R2 score)
▪ Adjusted R Square.
▪ Mean Absolute Error.
▪ Mean Squared Error.

24
Linear Regression Model
Linear regression is a linear model, e.g. a model that assumes a linear relationship
between the input variables (x) and the single output variable (y). More specifically,
that y can be calculated from a linear combination of the input variables (x).
When there is a single input variable (x), the method is referred to as simple linear
regression. When there are multiple input variables, literature from statistics often
refers to the method as multiple linear regression.
Different techniques can be used to prepare or train the linear regression equation from
data, the most common of which is called Ordinary Least Squares. It is common to therefore
refer to a model prepared this way as Ordinary Least Squares Linear Regression or just
Least Squares Regression.

The formula used for modelling is


AgentBonus ~ Age+ CustTenure+ Channel+ Occupation+EducationField+ Gender+
ExistingProdType+ Designation+NumberOfPolicy+ MaritalStatus+ MonthlyIncome+
Complaint+ExistingPolicyTenure+ SumAssured+ Zone+ PaymentMethod+LastMonthCalls+
CustCareScore',

The coefficent for CustTenure is 0.0003300824744941662


The coefficent for Channel is -2.751621283610898e-05
The coefficent for Occupation is 1.5259293154943037e-05
The coefficent for EducationField is 1.1697523443542078e-05
The coefficent for Gender is 0.00031193417658791295
The coefficent for ExistingProdType is -0.00015409480456214582
The coefficent for Designation is 1.8915797248280733e-05
The coefficent for NumberOfPolicy is 0.00020411959362769263
The coefficent for MaritalStatus is -0.0003615014051258259
The coefficent for MonthlyIncome is 0.02808782946595261
The coefficent for Complaint is 0.0003250004697582831
The coefficent for ExistingPolicyTenure is 0.0005479683954928168
The coefficent for SumAssured is 0.06516875856010947
The coefficent for Zone is 9.363981215428923e-05
The coefficent for PaymentMethod is 3.256233181602395e-05
The coefficent for LastMonthCalls is 0.00011451462536120873
The coefficent for CustCareScore is 0.000154645997875137
The coefficent for age_bins is 0.0027310843920501505

25
Most significant coefficients are listed below:

Variable Value
Sum Assured 0.06516
MonthlyIncome 0.02808
age_bins 0.00273
ExistingPolicyTenure 0.000547
Gender 0.000312
ExistingProdType --0.00015

Train Data Test Data

R Square 0.7587 0.75617

Adjusted R Square 0.75779 0.75519

MAE 0.00686 0.0069

MSE 7.6279 7.6589

RMSE 0.00873 0.00875

Analysis:

o From this linear regression model, we can understand that Sum Assured, Monthly Income
of the policy holders, age group of the people, Policy Tenure, Gender plays a vital
role in agent bonus. Along with that Existing Product Type results in the negative
impact on the agent bonus value.
o R square value of Train and Test Data indicates that the model is neither under fit
or over fit. The value is an appreciable for predicting the agent bonus also.
o MAE value is also acceptable so that predicted value would remain near by the actual
value.
o MSE and RMSE value is also acceptable for predicting the agent bonus.

Hyper Tunning of Linear Regression Model:

For Hyper tunning the linear regression model, we can use Lasso, Ridge or Elastic Net
modelling. In our case, we cannot use these modelling approach because the values are
already scaled down and further modelling might affect in getting the values.
To understand coefficients of Lasso Model is shown below
0 Coefficient Estimate
0 Age 0.0
1 CustTenure -0.0
2 Channel 0.0
3 Occupation 0.0
4 EducationField -0.0
5 Gender 0.0
6 ExistingProdType 0.0

26
7 Designation 0.0
8 NumberOfPolicy -0.0
9 MaritalStatus 0.0
10 MonthlyIncome 0.0
11 Complaint 0.0
12 ExistingPolicyTenure 0.0
13 SumAssured 0.0
14 Zone -0.0
15 PaymentMethod 0.0
16 LastMonthCalls 0.0
17 CustCareScore 0.0

The coefficient values are shrunk too much for further modelling. Therefore, hyper tunning
cannot be implemented. So that there is no need for Ridge and Elastic Net modelling.

KNN Regression Model


KNN regression is a non-parametric method that, in an intuitive manner, approximates the
association between independent variables and the continuous outcome by averaging the
observations in the same neighbourhood.

A simple implementation of KNN regression is to calculate the average of the numerical


target of the K nearest neighbors. Another approach uses an inverse distance weighted
average of the K nearest neighbors. KNN regression uses the same distance functions as
KNN classification.

In this model, various K values are applied to the model and RMSE value is calculated.
Then lower RMSE value is taken for K value consideration.

RMSE value for k= 1 is: 0.01749


RMSE value for k= 2 is: 0.01518
RMSE value for k= 3 is: 0.014468
RMSE value for k= 4 is: 0.013879
RMSE value for k= 5 is: 0.01362
RMSE value for k= 6 is: 0.013416
RMSE value for k= 7 is: 0.01333
RMSE value for k= 8 is: 0.01328
RMSE value for k= 9 is: 0.01323
RMSE value for k= 10 is: 0.01319
RMSE value for k= 11 is: 0.01318
RMSE value for k= 12 is: 0.01317
RMSE value for k= 13 is: 0.01315
RMSE value for k= 14 is: 0.01308
RMSE value for k= 15 is: 0.01308
RMSE value for k= 16 is: 0.01308
RMSE value for k= 17 is: 0.01307
RMSE value for k= 18 is: 0.01304
RMSE value for k= 19 is: 0.01299
RMSE value for k= 20 is: 0.01299

From the RMSE values, the accepted value of K is 15.

27
Model Metrics when K=15
Train Data Test Data

R Square 0.53100 0.4555

Adjusted R Square 0.5291 0.4533

MAE 0.00973 0.010462

MSE 0.0001482 0.000171

RMSE 0.012177 0.013078

28
Plot of Predicted Value/ Actual Value

Hyper Tuning of the Model

For Hyper Tuning the Model, various values of K values are used. In this tuning process,
K -values starting from 1 to 100 are given to figure out the best model.

The result of K after Hyper Tuning is 13.

Model Metrics:

Train Data Test Data

R Square 0.54059 0.4494

Adjusted R Square 0.53875 0.4472

MAE 0.009617 0.01055

MSE 0.0001452 0.000172

RMSE 0.01205 0.01315

29
Plot of Predicted Value/ Actual Value

Inference:
o R-squared and Adjusted R-squared values indicates that model is not good for
predicting the values.
o Hyper tuning fails to improve the model performance.
o The model can be categorized as Under-Fit Model.
o The model has unaccepted MAE, MSE and RMSE values for predicting Agent Bonus.

Random Forest Regression:


A Random Forest Regression is a supervised learning algorithm that uses ensemble
learning method for regression. Ensemble learning method is a technique that combines
predictions from multiple machine learning algorithms to make a more accurate prediction
than a single model.

For modelling , sklearn Random Forest Regressor is used. Data is split into 70:30 as
train and test data. Training model is then implemented to test data for analysis the
quality of the model.

Random Forest tree is shown below:

30
Table6: Random Forest result

Regression parameters are shown below:

{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'mse',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 1000,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}

31
Model Metrics:

Train Data Test Data

R Square 0.74747 0.73076

Adjusted R Square 0.74646 0.72968

MAE 0.0071 0.007395

MSE 7.9843 8.4571

RMSE 0.00893 0.009193

Plot of Predicted/Actual Data

HyperParameter Tuning
Using GridSearch, we have created hyper tuned parameters.

Best parameters are

n_estimators' 800
min_samples_split' 5
'min_samples_leaf 1

32
max_features' sqrt
'max_depth 90
bootstrap False

Train Data Test Data

R Square 0.991882 0.812877

Adjusted R Square 0.991849 0.812128

MAE 0.00121 0.006028

MSE 2.5667 5.577428

RMSE 0.00160 0.007667

Feature Importance

CustTenure
15.40674

Channel
0.567802

Occupation
0.573335

EducationField
0.896722

Gender
0.392037

ExistingProdType
1.015336

Designation
3.726198

NumberOfPolicy
1.207663

MaritalStatus
1.431711

MonthlyIncome
10.86519

Complaint
0.374762

ExistingPolicyTenure
5.498949

33
SumAssured
43.32096

Zone
0.48656

PaymentMethod
0.516037

LastMonthCalls
2.220802

CustCareScore
1.104411

age_bins 10.39478

Model Metrics

Train Data Test Data

R Square 0.991882 0.812877

Adjusted R Square 0.991849 0.812128

MAE 0.00121 0.006028

MSE 2.5667 5.577428

RMSE 0.00160 0.007667

Inferences:

o R square and Adjusted R square values have changed after tuning the model. The
values for the train data are about 0.99 and test data is 0.8128. This implies
that model is neither under fit nor over fit.
o Moreover, MAE, MSE and RMSE values are also good and therefore hyper-tuned
model can be considered for modelling.
o Significant features of the model are
▪ Sum Assured
▪ Cust Tenure
▪ Monthly Income
▪ Age_bins
▪ Existing Policy Tenure

34
Model Validation
In machine learning, model validation is referred to as the process where a trained model
is evaluated with a testing data set. The testing data set is a separate portion of the
same data set from which the training set is derived.

In our regression problem to predict ‘Agent Bonus’, we have done various modelling approach
to predict the value. Then the model metrics are evaluated and compared to figure out the
best model. From our analysis, Hyper Tunned Random Forest Model performance outstands
with other linear regression model and KNN regression model.

Model Metrics is various model is shown below:

Linear Regression KNN Regression Model Random Forest


Model Regression Model
Train Test Data Train Test Data Train Test
Data Data Data Model
R Square 0.7587 0.7562 0.5406 0.4494 0.9919 0.8129
Adjusted R Square 0.7578 0.7552 0.5388 0.4472 0.9918 0.8121
MAE 0.0069 0.0069 0.0096 0.0106 0.0012 0.0060
MSE 7.6279 7.6589 0.0001 0.0002 2.5667 5.5774
RMSE 0.0087 0.0078 0.0121 0.0132 0.0016 0.0077
Max Error 0.0411 0.0456 0.0504 0.0460 0.0330 0.0082

From the model metrics given above, Random Regression Model performs well and able to
predict the values better than any other models.

R-squared and Adjusted values are better used to explain the model because it can explain
the percentage of the output variability. MSE, RMSE, or MAE are better be used to compare
performance between different regression models.

In our data analysis, the main objective is to predict the agent bonus and to classify
the agents to improve their performance. Therefore, here we can consider R square and
Adjusted R-square to understand the performance of the model. From the consolidated model
metrics, it is clear that hyper tunned Random Forest is best available model to predict
the agent bonus. R-squared values range from 0 to 1 and are commonly stated as percentages
from 0% to 100%. The model metrics indicates that R square value of Test Model is 0.8129
i.e., the model can predict the value up to an accuracy of 81.29%. Therefore, the model
is neither overfit nor underfit. Model performs well in the train data and test data.
Comparing to other Model Metrics like MAE, MSE, RSME, Max Error, the hyper tunned Random
Forest performs well compared to other two models.

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction
errors). Residuals are a measure of how far from the regression line data points are;
RMSE is a measure of how spread out these residuals are. In other words, it tells you how
concentrated the data is around the line of best fit. Hyper tunned Random Forest model
has good RMSE value compared to other models. In this modelling, R squared, Adjusted R
squared, RMSE values are the considered as model metrics for modelling.

35
Business Recommendation
The objective of this analysis is to predict the Agent Bonus from data set. For this,
regression method is used. For our analysis, three regression models are used to encompass
three methods to predict the agent bonus.

From the univariate analysis,


• From the Channel, most of the policies are through Agents and firm needs to untap
the potential of Third Party Partner and Online.
• The company needs to make policies to attract Freelancer and Large Business holders.
• Graduate holders are interested in availing insurance policies. Insurance Companies
needs to focus on Engineers , MBA holders and other section.
• Designation of customer plays a vital role in availing the policy. Most of policy
holders are Managers and Executives.
• Married people are interested to avail insurance policies but single, unmarried and
divorced people are not interested to have an insurance policy.
• East and South zone are the worst performers , the company needs to focus on the
business on these zones also.
• Preferred policy payment is Half yearly, so the companies need to design their
premium amount for Half yearly payment and it might attract the customers in future.
From the Bi-variate or multi-variate analysis:
• From the Bi-variate analysis, most of the variables have equally contributed to
Agent Bonus. This implies that Agent can gain bonus from most of the sections. Apart
from that, from the ‘Designation’, the positions like VP and AVP helps the agent to
get maximum bonus. Similarly, ‘Unmarried’ customers are least contributors of agent
bonus.

In this case, hyper tunned Random Regression Model is used for predicting the agent bonus.
In this modelling, the significant features are Sum Assured, Cust Tenure, Monthly Income,
Age bins and Existing Policy Tenure. Therefore, agent who is canvassing higher Sum Assured
policy from higher monthly income older people with a higher policy tenure can bring
higher agent bonus. Therefore, the agent needs to focus on these parameters in the
canvassing the customers.

Moreover, Average Existing policy tenure is maximum for higher ranked designation and the
same time, policy tenure is less for undergraduates. Therefore, agent needs to focus on
the higher ranked designation people to bring good business to the company. Policy tenure
for married people is more so we can focus on divorced and unmarried people and that might
increase policy tenure.

The South and East Region are the worst performing regions compared to the other regions.
Proper Training and daily evaluation need to be done for proper tracking of their progress
of the underperforming agents. For Higher Performing agents, I would recommend increasing
the number of policies with higher sum assured value category. Therefore, focus on that
area can improvise the business and bonus.

36
Appendix:
#https://towardsdatascience.com/whats-the-difference-between-linear-regression-lasso-
ridge-and-elasticnet-8f997c60cf29

Lasso, Ridge and ElasticNet are all part of the Linear Regression family where the x

(input) and y (output) are assumed to have a linear relationship. In sklearn,

LinearRegression refers to the most ordinary least square linear regression method

without regularization (penalty on weights) . The main difference among them is whether

the model is penalized for its weights. For the rest of the post, I am going to talk

about them in the context of scikit-learn library.

Linear regression (in scikit-learn) is the most basic form, where the model is not

penalized for its choice of weights, at all. That means, during the training stage, if

the model feels like one particular feature is particularly important, the model may

place a large weight to the feature. This sometimes leads to overfitting in small

datasets. Hence, following methods are invented.

Lasso is a modification of linear regression, where the model is penalized for the sum

of absolute values of the weights. Thus, the absolute values of weight will be (in

general) reduced, and many will tend to be zeros. During training, the objective

function become:

As you see, Lasso introduced a new hyperparameter, alpha, the coefficient to penalize

weights.

37
Ridge takes a step further and penalizes the model for the sum of squared value of the

weights. Thus, the weights not only tend to have smaller absolute values, but also

really tend to penalize the extremes of the weights, resulting in a group of weights

that are more evenly distributed. The objective function becomes:

ElasticNet is a hybrid of Lasso and Ridge, where both the absolute value penalization

and squared penalization are included, being regulated with another coefficient

l1_ratio:

Are your data Scaled yet?

As you can see in these equations above, the weights penalization are summed together in

the loss function. Suppose we have a feature house_size in the 2000 range, while another

feature num_bedrooms in the range of 3, then we would expect that the weight for

house_size may be naturally smaller than the weight for num_bedrooms. In such case,

penalizing each feature’s weight the same way becomes inappropriate. Hence, it is

important to scale or normalize the data before entering them to the models. A quick

note, the default setting in sklearn for these model set ‘normalize’ to false. You will

either want to turn the ‘normalize’ to ‘on’, or use ScandardScaler to scale the data.

Typically, use ScandardScaler is a good practice because you may want to scale your

testing data using the same scale.

38
When to use which?

There are a few things to remember:

(1) sklearn’s algorithm cheat sheet suggests you to try Lasso, ElasticNet, or Ridge when

you data-set is smaller than 100k rows. Otherwise, try SGDRegressor.

(2) Lasso and ElasticNet tend to give sparse weights (most zeros), because the l1

regularization cares equally about driving down big weights to small weights, or driving

small weights to zeros. If you have a lot of predictors (features), and you suspect that

not all of them are that important, Lasso and ElasticNet may be really good idea to

start with.

(3) Ridge tends to give small but well distributed weights, because the l2

regularization cares more about driving big weight to small weights, instead of driving

small weights to zeros. If you only have a few predictors, and you are confident that

all of them should be really relevant for predictions, try Ridge as a good regularized

linear regression method.

(4) You will need to scale your data before using these regularized linear regression

methods. Use StandardScaler first, or set ‘normalize’ in these estimators to ‘True’.

39
# https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-
metrics/
11 Important Model Evaluation Metrics for Machine Learning Everyone should know

Tavish Srivastava — August 6, 2019


Beginner Listicle Machine Learning Python Statistics
Overview

• Evaluating a model is a core part of building an effective machine learning model


• There are several evaluation metrics, like confusion matrix, cross-validation,
AUC-ROC curve, etc.
• Different evaluation metrics are used for different kinds of problems

This article was originally published in February 2016 and updated in August 2019. with

four new evaluation metrics.

Introduction

The idea of building machine learning models works on a constructive feedback principle.

You build a model, get feedback from metrics, make improvements and continue until you

achieve a desirable accuracy. Evaluation metrics explain the performance of a model. An

important aspect of evaluation metrics is their capability to discriminate among model

results.

I have seen plenty of analysts and aspiring data scientists not even bothering to check

how robust their model is. Once they are finished building a model, they hurriedly map

predicted values on unseen data. This is an incorrect approach.

Simply building a predictive model is not your motive. It’s about creating and selecting

a model which gives high accuracy on out of sample data. Hence, it is crucial to check

the accuracy of your model prior to computing predicted values.

40
Data Science Immersive Bootcamp
A program that trains you to be an industry-ready data scientist within 240 DaysBook Your
Seats!

In our industry, we consider different kinds of metrics to evaluate our models. The choice

of metric completely depends on the type of model and the implementation plan of the model.

After you are finished building your model, these 11 metrics will help you in evaluating

your model’s accuracy. Considering the rising popularity and importance of cross-

validation, I’ve also mentioned its principles in this article.

And if you’re starting out your machine learning journey, you should check out the

comprehensive and popular ‘Applied Machine Learning’ course which covers this concept in

a lot of detail along with the various algorithms and components of machine learning.

Table of Contents

41
1. Confusion Matrix
2. F1 Score
3. Gain and Lift Charts
4. Kolmogorov Smirnov Chart
5. AUC – ROC
6. Log Loss
7. Gini Coefficient
8. Concordant – Discordant Ratio
9. Root Mean Squared Error
10. Cross Validation (Not a metric though!)

Warming up: Types of Predictive models

When we talk about predictive models, we are talking either about a regression model

(continuous output) or a classification model (nominal or binary output). The evaluation

metrics used in each of these models are different.

In classification problems, we use two types of algorithms (dependent on the kind of output

it creates):

1. Class output: Algorithms like SVM and KNN create a class output. For instance, in
a binary classification problem, the outputs will be either 0 or 1. However, today
we have algorithms which can convert these class outputs to probability. But these
algorithms are not well accepted by the statistics community.
2. Probability output: Algorithms like Logistic Regression, Random Forest, Gradient
Boosting, Adaboost etc. give probability outputs. Converting probability outputs
to class output is just a matter of creating a threshold probability.

In regression problems, we do not have such inconsistencies in output. The output is

always continuous in nature and requires no further treatment.

Illustrative Example

For a classification model evaluation metric discussion, I have used my predictions for

the problem BCI challenge on Kaggle. The solution of the problem is out of the scope of

our discussion here. However the final predictions on the training set have been used for

this article. The predictions made for this problem were probability outputs which have

been converted to class outputs assuming a threshold of 0.5.

42
1. Confusion Matrix

A confusion matrix is an N X N matrix, where N is the number of classes being predicted.

For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. Here are a few

definitions, you need to remember for a confusion matrix :

• Accuracy : the proportion of the total number of predictions that were correct.
• Positive Predictive Value or Precision : the proportion of positive cases that
were correctly identified.
• Negative Predictive Value : the proportion of negative cases that were correctly
identified.
• Sensitivity or Recall : the proportion of actual positive cases which are
correctly identified.
• Specificity : the proportion of actual negative cases which are correctly
identified.

The accuracy for the problem in hand comes out to be 88%. As you can see from the above

two tables, the Positive predictive Value is high, but negative predictive value is quite

low. Same holds for Sensitivity and Specificity. This is primarily driven by the threshold

value we have chosen. If we decrease our threshold value, the two pairs of starkly different

numbers will come closer.

In general we are concerned with one of the above defined metric. For instance, in a

pharmaceutical company, they will be more concerned with minimal wrong positive diagnosis.

43
Hence, they will be more concerned about high Specificity. On the other hand an attrition

model will be more concerned with Sensitivity. Confusion matrix are generally used only

with class output models.

2. F1 Score

In the last section, we discussed precision and recall for classification problems and

also highlighted the importance of choosing precision/recall basis our use case. What if

for a use case, we are trying to get the best precision and recall at the same time? F1-

Score is the harmonic mean of precision and recall values for a classification problem.

The formula for F1-Score is as follows:

Now, an obvious question that comes to mind is why are taking a harmonic mean and not an

arithmetic mean. This is because HM punishes extreme values more. Let us understand this

with an example. We have a binary classification model with the following results:

Precision: 0, Recall: 1

Here, if we take the arithmetic mean, we get 0.5. It is clear that the above result

comes from a dumb classifier which just ignores the input and just predicts one of the

classes as output. Now, if we were to take HM, we will get 0 which is accurate as this

model is useless for all purposes.

This seems simple. There are situations however for which a data scientist would like to

give a percentage more importance/weight to either precision or recall. Altering the

above expression a bit such that we can include an adjustable parameter beta for this

purpose, we get:

44
Fbeta measures the effectiveness of a model with respect to a user who attaches β times

as much importance to recall as precision.

3. Gain and Lift charts

Gain and Lift chart are mainly concerned to check the rank ordering of the

probabilities. Here are the steps to build a Lift/Gain chart:

Step 1 : Calculate probability for each observation

Step 2 : Rank these probabilities in decreasing order.

Step 3 : Build deciles with each group having almost 10% of the observations.

Step 4 : Calculate the response rate at each deciles for Good (Responders) ,Bad (Non-

responders) and total.

You will get following table from which you need to plot Gain/Lift charts:

45
This is a very informative table. Cumulative Gain chart is the graph between Cumulative

%Right and Cummulative %Population. For the case in hand here is the graph :

This graph tells you how well is your model segregating responders from non-responders.

For example, the first decile however has 10% of the population, has 14% of responders.

This means we have a 140% lift at first decile.

What is the maximum lift we could have reached in first decile? From the first table of

this article, we know that the total number of responders are 3850. Also the first decile

will contains 543 observations. Hence, the maximum lift at first decile could have been

543/3850 ~ 14.1%. Hence, we are quite close to perfection with this model.

Let’s now plot the lift curve. Lift curve is the plot between total lift and %population.

Note that for a random model, this always stays flat at 100%. Here is the plot for the

case in hand :

46
You can also plot decile wise lift with decile number :

What does this graph tell you? It tells you that our model does well till the 7th decile.

Post which every decile will be skewed towards non-responders. Any model with lift @ decile

above 100% till minimum 3rd decile and maximum 7th decile is a good model. Else you might

consider over sampling first.

Lift / Gain charts are widely used in campaign targeting problems. This tells us till

which decile can we target customers for an specific campaign. Also, it tells you how much

response do you expect from the new target base.

47
4. Kolomogorov Smirnov chart

K-S or Kolmogorov-Smirnov chart measures performance of classification models. More

accurately, K-S is a measure of the degree of separation between the positive and negative

distributions. The K-S is 100, if the scores partition the population into two separate

groups in which one group contains all the positives and the other all the negatives.

On the other hand, If the model cannot differentiate between positives and negatives, then

it is as if the model selects cases randomly from the population. The K-S would be 0. In

most classification models the K-S will fall between 0 and 100, and that the higher the

value the better the model is at separating the positive from negative cases.

For the case in hand, following is the table :

We can also plot the %Cumulative Good and Bad to see the maximum separation. Following

is a sample plot :

48
The metrics covered till here are mostly used in classification problems. Till here, we

learnt about confusion matrix, lift and gain chart and kolmogorov-smirnov chart. Let’s

proceed and learn few more important metrics.

5. Area Under the ROC curve (AUC – ROC)

This is again one of the popular metrics used in the industry. The biggest advantage of

using ROC curve is that it is independent of the change in proportion of responders. This

statement will get clearer in the following sections.

Let’s first try to understand what is ROC (Receiver operating characteristic) curve. If

we look at the confusion matrix below, we observe that for a probabilistic model, we get

different value for each metric.

49
Hence, for each sensitivity, we get a different specificity.The two vary as follows:

The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is

also known as false positive rate and sensitivity is also known as True Positive rate.

Following is the ROC curve for the case in hand.

Let’s take an example of threshold = 0.5 (refer to confusion matrix). Here is the

confusion matrix :

50
As you can see, the sensitivity at this threshold is 99.6% and the (1-specificity) is

~60%. This coordinate becomes on point in our ROC curve. To bring this curve down to a

single number, we find the area under this curve (AUC).

Note that the area of entire square is 1*1 = 1. Hence AUC itself is the ratio under the

curve and the total area. For the case in hand, we get AUC ROC as 96.4%. Following are a

few thumb rules:

• .90-1 = excellent (A)


• .80-.90 = good (B)
• .70-.80 = fair (C)
• .60-.70 = poor (D)
• .50-.60 = fail (F)

We see that we fall under the excellent band for the current model. But this might simply

be over-fitting. In such cases it becomes very important to to in-time and out-of-time

validations.

Points to Remember:

1. For a model which gives class as output, will be represented as a single point in ROC

plot.

2. Such models cannot be compared with each other as the judgement needs to be taken on

a single metric and not using multiple metrics. For instance, model with parameters

(0.2,0.8) and model with parameter (0.8,0.2) can be coming out of the same model, hence

these metrics should not be directly compared.

3. In case of probabilistic model, we were fortunate enough to get a single number which

was AUC-ROC. But still, we need to look at the entire curve to make conclusive decisions.

It is also possible that one model performs better in some region and other performs better

in other.

51
Advantages of using ROC

Why should you use ROC and not metrics like lift curve?

Lift is dependent on total response rate of the population. Hence, if the response rate

of the population changes, the same model will give a different lift chart. A solution to

this concern can be true lift chart (finding the ratio of lift and perfect model lift at

each decile). But such ratio rarely makes sense for the business.

ROC curve on the other hand is almost independent of the response rate. This is because

it has the two axis coming out from columnar calculations of confusion matrix. The numerator

and denominator of both x and y axis will change on similar scale in case of response rate

shift.

6. Log Loss

AUC ROC considers the predicted probabilities for determining our model’s performance.

However, there is an issue with AUC ROC, it only takes into account the order of

probabilities and hence it does not take into account the model’s capability to predict

higher probability for samples more likely to be positive. In that case, we could us the

log loss which is nothing but negative average of the log of corrected predicted

probabilities for each instance.

• p(yi) is predicted probability of positive class


• 1-p(yi) is predicted probability of negative class
• yi = 1 for positive class and 0 for negative class (actual values)

Let us calculate log loss for a few random values to get the gist of the above

mathematical function:

Logloss(1, 0.1) = 2.303

52
Logloss(1, 0.5) = 0.693

Logloss(1, 0.9) = 0.105

If we plot this relationship, we will get a curve as follows:

It’s apparent from the gentle downward slope towards the right that the Log Loss

gradually declines as the predicted probability improves. Moving in the opposite

direction though, the Log Loss ramps up very rapidly as the predicted probability

approaches 0.

So, lower the log loss, better the model. However, there is no absolute measure on a

good log loss and it is use-case/application dependent.

Whereas the AUC is computed with regards to binary classification with a varying

decision threshold, log loss actually takes “certainty” of classification into account.

7. Gini Coefficient

Gini coefficient is sometimes used in classification problems. Gini coefficient can be

straigh away derived from the AUC ROC number. Gini is nothing but ratio between area

53
between the ROC curve and the diagnol line & the area of the above triangle. Following is

the formulae used :

Gini = 2*AUC – 1

Gini above 60% is a good model. For the case in hand we get Gini as 92.7%.

8. Concordant – Discordant ratio

This is again one of the most important metric for any classification predictions problem.

To understand this let’s assume we have 3 students who have some likelihood to pass this

year. Following are our predictions :

A – 0.9

B – 0.5

C – 0.3

Now picture this. if we were to fetch pairs of two from these three student, how many

pairs will we have? We will have 3 pairs : AB , BC, CA. Now, after the year ends we saw

that A and C passed this year while B failed. No, we choose all the pairs where we will

find one responder and other non-responder. How many such pairs do we have?

We have two pairs AB and BC. Now for each of the 2 pairs, the concordant pair is where

the probability of responder was higher than non-responder. Whereas discordant pair is

where the vice-versa holds true. In case both the probabilities were equal, we say its a

tie. Let’s see what happens in our case :

AB – Concordant

BC – Discordant

54
Hence, we have 50% of concordant cases in this example. Concordant ratio of more than 60%

is considered to be a good model. This metric generally is not used when deciding how many

customer to target etc. It is primarily used to access the model’s predictive power. For

decisions like how many to target are again taken by KS / Lift charts.

9. Root Mean Squared Error (RMSE)

RMSE is the most popular evaluation metric used in regression problems. It follows an

assumption that error are unbiased and follow a normal distribution. Here are the key

points to consider on RMSE:

1. The power of ‘square root’ empowers this metric to show large number deviations.
2. The ‘squared’ nature of this metric helps to deliver more robust results which
prevents cancelling the positive and negative error values. In other words, this
metric aptly displays the plausible magnitude of error term.
3. It avoids the use of absolute error values which is highly undesirable in
mathematical calculations.
4. When we have more samples, reconstructing the error distribution using RMSE is
considered to be more reliable.
5. RMSE is highly affected by outlier values. Hence, make sure you’ve removed
outliers from your data set prior to using this metric.
6. As compared to mean absolute error, RMSE gives higher weightage and punishes large
errors.

RMSE metric is given by:

where, N is Total Number of Observations.

10. Root Mean Squared Logarithmic Error

In case of Root mean squared logarithmic error, we take the log of the predictions and

actual values. So basically, what changes are the variance that we are measuring. RMSLE

55
is usually used when we don’t want to penalize huge differences in the predicted and the

actual values when both predicted and true values are huge numbers.

1. If both predicted and actual values are small: RMSE and RMSLE are same.
2. If either predicted or the actual value is big: RMSE > RMSLE
3. If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes almost
negligible)

11. R-Squared/Adjusted R-Squared

We learned that when the RMSE decreases, the model’s performance will improve. But these

values alone are not intuitive.

In the case of a classification problem, if the model has an accuracy of 0.8, we could

gauge how good our model is against a random model, which has an accuracy of 0.5. So

the random model can be treated as a benchmark. But when we talk about the RMSE metrics,

we do not have a benchmark to compare.

This is where we can use R-Squared metric. The formula for R-Squared is as follows:

56
MSE(model): Mean Squared Error of the predictions against the actual values

MSE(baseline): Mean Squared Error of mean prediction against the actual values

In other words how good our regression model as compared to a very simple model that

just predicts the mean value of target from the train set as predictions.

Adjusted R-Squared

A model performing equal to baseline would give R-Squared as 0. Better the model, higher

the r2 value. The best model with all correct predictions would give R-Squared as 1.

However, on adding new features to the model, the R-Squared value either increases or

remains the same. R-Squared does not penalize for adding features that add no value to

the model. So an improved version over the R-Squared is the adjusted R-Squared. The

formula for adjusted R-Squared is given by:

k: number of features

n: number of samples

As you can see, this metric takes the number of features into account. When we add more

features, the term in the denominator n-(k +1) decreases, so the whole expression

increases.

57
If R-Squared does not increase, that means the feature added isn’t valuable for our

model. So overall we subtract a greater value from 1 and adjusted r2, in turn, would

decrease.

Beyond these 11 metrics, there is another method to check the model performance. These 7

methods are statistically prominent in data science. But, with arrival of machine

learning, we are now blessed with more robust methods of model selection. Yes! I’m

talking about Cross Validation.

Though, cross validation isn’t a really an evaluation metric which is used openly

to communicate model accuracy. But, the result of cross validation provides good enough

intuitive result to generalize the performance of a model.

Let’s now understand cross validation in detail.

12. Cross Validation

Let’s first understand the importance of cross validation. Due to busy schedules, these

days I don’t get much time to participate in data science competitions. Long time back, I

participated in TFI Competition on Kaggle. Without delving into my competition performance,

I would like to show you the dissimilarity between my public and private leaderboard score.

Here is an example of scoring on Kaggle!

For TFI competition, following were three of my solution and scores (Lesser the better)

58
You will notice that the third entry which has the worst Public score turned to be the

best model on Private ranking. There were more than 20 models above the

“submission_all.csv”, but I still chose “submission_all.csv” as my final entry (which

really worked out well). What caused this phenomenon ? The dissimilarity in my public and

private leaderboard is caused by over-fitting.

Over-fitting is nothing but when you model become highly complex that it starts capturing

noise also. This ‘noise’ adds no value to model, but only inaccuracy.

In the following section, I will discuss how you can know if a solution is an over-fit or

not before we actually know the test results.

The concept : Cross Validation

Cross Validation is one of the most important concepts in any type of data modelling. It

simply says, try to leave a sample on which you do not train the model and test the model

on this sample before finalizing the model.

59
Above diagram shows how to validate model with in-time sample. We simply divide the

population into 2 samples, and build model on one sample. Rest of the population is used

for in-time validation.

Could there be a negative side of the above approach?

I believe, a negative side of this approach is that we loose a good amount of data from

training the model. Hence, the model is very high bias. And this won’t give best estimate

for the coefficients. So what’s the next best option?

What if, we make a 50:50 split of training population and the train on first 50 and

validate on rest 50. Then, we train on the other 50, test on first 50. This way we train

the model on the entire population, however on 50% in one go. This reduces bias because

60
of sample selection to some extent but gives a smaller sample to train the model on. This

approach is known as 2-fold cross validation.

k-fold Cross validation

Let’s extrapolate the last example to k-fold from 2-fold cross validation. Now, we will

try to visualize how does a k-fold validation work.

This is a 7-fold cross validation.

Here’s what goes on behind the scene : we divide the entire population into 7 equal

samples. Now we train models on 6 samples (Green boxes) and validate on 1 sample (grey

box). Then, at the second iteration we train the model with a different sample held as

validation. In 7 iterations, we have basically built model on each sample and held each

of them as validation. This is a way to reduce the selection bias and reduce the variance

in prediction power. Once we have all the 7 models, we take average of the error terms to

find which of the models is best.

How does this help to find best (non over-fit) model?

k-fold cross validation is widely used to check whether a model is an overfit or not. If

the performance metrics at each of the k times modelling are close to each other and the

mean of metric is highest. In a Kaggle competition, you might rely more on the cross

validation score and not on the Kaggle public score. This way you will be sure that the

Public score is not just by chance.

61
How do we implement k-fold with any model?

Coding k-fold in R and Python are very similar. Here is how you code a k-fold in Python

Try out the code for KFold in the live coding window below:

But how do we choose k?

This is the tricky part. We have a trade off to choose k.

For a small k, we have a higher selection bias but low variance in the performances.

For a large k, we have a small selection bias but high variance in the performances.

Think of extreme cases :

k = 2 : We have only 2 samples similar to our 50-50 example. Here we build model only on

50% of the population each time. But as the validation is a significant population, the

variance of validation performance is minimal.

k = number of observations (n) : This is also known as “Leave one out”. We have n samples

and modelling repeated n number of times leaving only one observation out for cross

validation. Hence, the selection bias is minimal but the variance of validation performance

is very large.

Generally a value of k = 10 is recommended for most purpose.

End Notes

Measuring the performance on training sample is point less. And leaving a in-time

validation batch aside is a waste of data. K-Fold gives us a way to use every singe
62
datapoint which can reduce this selection bias to a good extent. Also, K-fold cross

validation can be used with any modelling technique.

In addition, the metrics covered in this article are some of the most used metrics of

evaluation in a classification and regression problems.

63
29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [1]:  import pandas as pd


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from factor_analyzer import FactorAnalyzer # Perform statistical tests before PCA
import warnings
warnings.filterwarnings("ignore")

In [2]:  data= pd.read_excel('Sales.xlsx','Sales',engine='openpyxl')

In [3]:  data.head()

Out[3]:
CustID AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy Mar

0 7000000 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager 2.0

Third
1 7000001 2214 11.0 2.0 Party Salaried Graduate Male 4 Manager 4.0
Partner

2 7000002 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Exe 3.0

Third
Fe
3 7000003 1791 11.0 NaN Party Salaried Graduate 3 Executive 3.0
male
Partner

Small
4 7000004 2955 6.0 NaN Agent UG Male 3 Executive 4.0
Business

localhost:8888/notebooks/Capstone Project.ipynb# 1/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [4]:  data.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4520 entries, 0 to 4519

Data columns (total 20 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 CustID 4520 non-null int64

1 AgentBonus 4520 non-null int64

2 Age 4251 non-null float64

3 CustTenure 4294 non-null float64

4 Channel 4520 non-null object

5 Occupation 4520 non-null object

6 EducationField 4520 non-null object

7 Gender 4520 non-null object

8 ExistingProdType 4520 non-null int64

9 Designation 4520 non-null object

10 NumberOfPolicy 4475 non-null float64

11 MaritalStatus 4520 non-null object

12 MonthlyIncome 4284 non-null float64

13 Complaint 4520 non-null int64

14 ExistingPolicyTenure 4336 non-null float64

15 SumAssured 4366 non-null float64

16 Zone 4520 non-null object

17 PaymentMethod 4520 non-null object

18 LastMonthCalls 4520 non-null int64

19 CustCareScore 4468 non-null float64

dtypes: float64(7), int64(5), object(8)

memory usage: 706.4+ KB

localhost:8888/notebooks/Capstone Project.ipynb# 2/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [5]:  data.describe().T

Out[5]:
count mean std min 25% 50% 75% max

CustID 4520.0 7.002260e+06 1304.955938 7000000.0 7001129.75 7002259.5 7003389.25 7004519.0

AgentBonus 4520.0 4.077838e+03 1403.321711 1605.0 3027.75 3911.5 4867.25 9608.0

Age 4251.0 1.449471e+01 9.037629 2.0 7.00 13.0 20.00 58.0

CustTenure 4294.0 1.446903e+01 8.963671 2.0 7.00 13.0 20.00 57.0

ExistingProdType 4520.0 3.688938e+00 1.015769 1.0 3.00 4.0 4.00 6.0

NumberOfPolicy 4475.0 3.565363e+00 1.455926 1.0 2.00 4.0 5.00 6.0

MonthlyIncome 4284.0 2.289031e+04 4885.600757 16009.0 19683.50 21606.0 24725.00 38456.0

Complaint 4520.0 2.871681e-01 0.452491 0.0 0.00 0.0 1.00 1.0

ExistingPolicyTenure 4336.0 4.130074e+00 3.346386 1.0 2.00 3.0 6.00 25.0

SumAssured 4366.0 6.199997e+05 246234.822140 168536.0 439443.25 578976.5 758236.00 1838496.0

LastMonthCalls 4520.0 4.626991e+00 3.620132 0.0 2.00 3.0 8.00 18.0

CustCareScore 4468.0 3.067592e+00 1.382968 1.0 2.00 3.0 4.00 5.0

localhost:8888/notebooks/Capstone Project.ipynb# 3/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [6]:  data.isnull().sum()
# presence of null values

Out[6]: CustID 0

AgentBonus 0

Age 269

CustTenure 226

Channel 0

Occupation 0

EducationField 0

Gender 0

ExistingProdType 0

Designation 0

NumberOfPolicy 45

MaritalStatus 0

MonthlyIncome 236

Complaint 0

ExistingPolicyTenure 184

SumAssured 154

Zone 0

PaymentMethod 0

LastMonthCalls 0

CustCareScore 52

dtype: int64

In [8]:  data.duplicated().sum()
# no duplicate values

Out[8]: 0

In [10]:  data.columns

Out[10]: Index(['CustID', 'AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation',

'EducationField', 'Gender', 'ExistingProdType', 'Designation',

'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome', 'Complaint',

'ExistingPolicyTenure', 'SumAssured', 'Zone', 'PaymentMethod',

'LastMonthCalls', 'CustCareScore'],

dtype='object')

localhost:8888/notebooks/Capstone Project.ipynb# 4/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [7]:  df=data.copy()

In [8]:  df_num = df.select_dtypes(include = [ 'object'])


lstnumericcolumns = list(df_num.columns.values)
print(lstnumericcolumns)

['Channel', 'Occupation', 'EducationField', 'Gender', 'Designation', 'MaritalStatus', 'Zone', 'PaymentMethod']

In [9]:  for i in lstnumericcolumns:


print(i)
print(df[i].unique())

Channel

['Agent' 'Third Party Partner' 'Online']

Occupation

['Salaried' 'Free Lancer' 'Small Business' 'Laarge Business'

'Large Business']

EducationField

['Graduate' 'Post Graduate' 'UG' 'Under Graduate' 'Engineer' 'Diploma'

'MBA']

Gender

['Female' 'Male' 'Fe male']

Designation

['Manager' 'Exe' 'Executive' 'VP' 'AVP' 'Senior Manager']

MaritalStatus

['Single' 'Divorced' 'Unmarried' 'Married']

Zone

['North' 'West' 'East' 'South']

PaymentMethod

['Half Yearly' 'Yearly' 'Quarterly' 'Monthly']

In [11]:  #df['Gender'].mask(df['gender'] == 'female', 0, inplace=True)

In [12]:  df['Occupation'].mask(df['Occupation']=='Laarge Business','Large Business',inplace=True)

In [20]:  df['EducationField'].mask(df['EducationField']=='Under Graduates','Under Graduate',inplace=True)

localhost:8888/notebooks/Capstone Project.ipynb# 5/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [14]:  df['Gender'].mask(df['Gender']=='Fe male','Female',inplace=True)

In [15]:  df['Designation'].mask(df['Designation']=='Exe','Executive',inplace=True)

In [ ]:  ​

In [16]:  df['Designation'].unique()

Out[16]: array(['Manager', 'Executive', 'VP', 'AVP', 'Senior Manager'],

dtype=object)

In [22]:  for i in lstnumericcolumns:


print(i)
print(df[i].unique())

Channel

['Agent' 'Third Party Partner' 'Online']

Occupation

['Salaried' 'Free Lancer' 'Small Business' 'Large Business']

EducationField

['Graduate' 'Post Graduate' 'Under Graduate' 'Engineer' 'Diploma' 'MBA']

Gender

['Female' 'Male']

Designation

['Manager' 'Executive' 'VP' 'AVP' 'Senior Manager']

MaritalStatus

['Single' 'Divorced' 'Unmarried' 'Married']

Zone

['North' 'West' 'East' 'South']

PaymentMethod

['Half Yearly' 'Yearly' 'Quarterly' 'Monthly']

localhost:8888/notebooks/Capstone Project.ipynb# 6/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [26]:  df.head(5)

Out[26]:
CustID AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy Mar

0 7000000 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager 2.0

Third
1 7000001 2214 11.0 2.0 Party Salaried Graduate Male 4 Manager 4.0
Partner

2 7000002 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Executive 3.0

Third
3 7000003 1791 11.0 NaN Party Salaried Graduate Female 3 Executive 3.0
Partner

Small
4 7000004 2955 6.0 NaN Agent Under Graduate Male 3 Executive 4.0
Business

In [27]:  df.to_excel('file1.xlsx')

Exploratory Data Analysis

Univariate Analysis
In [77]:  # for continious variables

localhost:8888/notebooks/Capstone Project.ipynb# 7/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [38]:  def univariateAnalysis_numeric(column,nbins):


print("Description of " + column)
print("----------------------------------------------------------------------------")
print(df[column].describe(),end=' ')

plt.figure()
print("Distribution of " + column)
print("----------------------------------------------------------------------------")
sns.distplot(df[column], kde=True, color='g');
plt.show()

plt.figure()
print("BoxPlot of " + column)
print("----------------------------------------------------------------------------")
ax = sns.boxplot(x=df[column])
plt.show()

In [78]:  df_num = df.select_dtypes(include = [ 'int64','float'])


lstnumericcolumns = list(df_num.columns.values)
len(lstnumericcolumns)

Out[78]: 12

localhost:8888/notebooks/Capstone Project.ipynb# 8/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [81]:  for x in lstnumericcolumns:


univariateAnalysis_numeric(x,20)

BoxPlot of Age

----------------------------------------------------------------------------

In [82]:  # Multivariate Analysis


localhost:8888/notebooks/Capstone Project.ipynb# 9/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [83]:  plt.figure(figsize=(18,7))
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True,cmap="YlGnBu")
plt.show()

localhost:8888/notebooks/Capstone Project.ipynb# 10/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [84]:  sns.pairplot(df)
plt.show()

In [85]:  # for categorical variables

In [87]:  df.columns

Out[87]: Index(['CustID', 'AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation',

'EducationField', 'Gender', 'ExistingProdType', 'Designation',

'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome', 'Complaint',

'ExistingPolicyTenure', 'SumAssured', 'Zone', 'PaymentMethod',

'LastMonthCalls', 'CustCareScore'],

dtype='object')

localhost:8888/notebooks/Capstone Project.ipynb# 11/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [89]:  df_num = df.select_dtypes(include = [ 'object'])


lst = list(df_num.columns.values)
len(lst)

Out[89]: 8

In [95]:  for i in lst:


print(i)
plt.figure()
sns.countplot(df[i])
plt.show()

Channel

Occupation

Bivariate Analysis

localhost:8888/notebooks/Capstone Project.ipynb# 12/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [105]:  df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4520 entries, 0 to 4519

Data columns (total 20 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 CustID 4520 non-null int64

1 AgentBonus 4520 non-null int64

2 Age 4251 non-null float64

3 CustTenure 4294 non-null float64

4 Channel 4520 non-null object

5 Occupation 4520 non-null object

6 EducationField 4520 non-null object

7 Gender 4520 non-null object

8 ExistingProdType 4520 non-null int64

9 Designation 4520 non-null object

10 NumberOfPolicy 4475 non-null float64

11 MaritalStatus 4520 non-null object

12 MonthlyIncome 4284 non-null float64

13 Complaint 4520 non-null int64

14 ExistingPolicyTenure 4336 non-null float64

15 SumAssured 4366 non-null float64

16 Zone 4520 non-null object

17 PaymentMethod 4520 non-null object

18 LastMonthCalls 4520 non-null int64

19 CustCareScore 4468 non-null float64

dtypes: float64(7), int64(5), object(8)

memory usage: 706.4+ KB

localhost:8888/notebooks/Capstone Project.ipynb# 13/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [110]:  for i in lst:


print(i)
plt.figure()
sns.barplot(x=df[i],y=df['AgentBonus'])
plt.show()

Channel

Occupation

localhost:8888/notebooks/Capstone Project.ipynb# 14/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [109]:  ​

Out[109]: <AxesSubplot:xlabel='Gender', ylabel='AgentBonus'>

In [28]:  df.drop(['CustID'],inplace=True,axis=1)

# removal of CustId

localhost:8888/notebooks/Capstone Project.ipynb# 15/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [29]:  df.head()

Out[29]:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus

0 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager 2.0 Single

Third
1 2214 11.0 2.0 Party Salaried Graduate Male 4 Manager 4.0 Divorced
Partner

2 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Executive 3.0 Unmarried

Third
3 1791 11.0 NaN Party Salaried Graduate Female 3 Executive 3.0 Divorced
Partner

Small
4 2955 6.0 NaN Agent Under Graduate Male 3 Executive 4.0 Divorced
Business

In [118]:  # Treating Missing Values

In [119]:  # KNN Imputer

In [30]:  from sklearn.impute import KNNImputer

In [31]:  imputer = KNNImputer(n_neighbors=2)


In [141]:  from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()

localhost:8888/notebooks/Capstone Project.ipynb# 16/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [147]:  df['Age'].values.reshape(-1,1)

Out[147]: array([[22.],

[11.],

[26.],

...,

[23.],

[10.],

[14.]])

In [148]:  df.columns

Out[148]: Index(['AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation',

'EducationField', 'Gender', 'ExistingProdType', 'Designation',

'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome', 'Complaint',

'ExistingPolicyTenure', 'SumAssured', 'Zone', 'PaymentMethod',

'LastMonthCalls', 'CustCareScore'],

dtype='object')

In [32]:  sform(df[['Age','CustTenure','NumberOfPolicy','MonthlyIncome','ExistingPolicyTenure','SumAssured','CustCareScore']])

In [ ]:  ​

localhost:8888/notebooks/Capstone Project.ipynb# 17/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [33]:  df.isnull().sum()

Out[33]: AgentBonus 0

Age 0

CustTenure 0

Channel 0

Occupation 0

EducationField 0

Gender 0

ExistingProdType 0

Designation 0

NumberOfPolicy 0

MaritalStatus 0

MonthlyIncome 0

Complaint 0

ExistingPolicyTenure 0

SumAssured 0

Zone 0

PaymentMethod 0

LastMonthCalls 0

CustCareScore 0

dtype: int64

In [345]:  # null values are replaced using KNN imputer.

localhost:8888/notebooks/Capstone Project.ipynb# 18/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [346]:  np.mean(data)

Out[346]: CustID 7.002260e+06

AgentBonus 4.077838e+03

Age 1.449471e+01

CustTenure 1.446903e+01

ExistingProdType 3.688938e+00

NumberOfPolicy 3.565363e+00

MonthlyIncome 2.289031e+04

Complaint 2.871681e-01

ExistingPolicyTenure 4.130074e+00

SumAssured 6.199997e+05

LastMonthCalls 4.626991e+00

CustCareScore 3.067592e+00

Clus_kmeans 3.546460e-01

dtype: float64

In [160]:  np.mean(df)

Out[160]: AgentBonus 4077.838274

Age 14.228761

CustTenure 14.276106

ExistingProdType 3.688938

NumberOfPolicy 3.566704

MonthlyIncome 22586.653982

Complaint 0.287168

ExistingPolicyTenure 4.195354

SumAssured 621905.446571

LastMonthCalls 4.626991

CustCareScore 3.068031

dtype: float64

localhost:8888/notebooks/Capstone Project.ipynb# 19/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [162]:  np.std(df)

Out[162]: AgentBonus 1403.166467

Age 8.839135

CustTenure 8.823909

ExistingProdType 1.015657

NumberOfPolicy 1.451436

MonthlyIncome 4928.700848

Complaint 0.452441

ExistingPolicyTenure 3.359045

SumAssured 244199.496965

LastMonthCalls 3.619732

CustCareScore 1.379033

dtype: float64

In [165]:  from scipy import stats as st


In [167]:  st.mode(df)

Out[167]: ModeResult(mode=array([[2581, 5.0, 4.0, 'Agent', 'Salaried', 'Graduate', 'Male', 4,

'Executive', 4.0, 'Married', 17104.5, 0, 1.0, 437153.0, 'West',

'Half Yearly', 3, 3.0]], dtype=object), count=array([[ 8, 239, 253, 3194, 2192, 1870, 2688, 1916, 1662,
1107, 2268,

22, 3222, 994, 6, 2566, 2656, 733, 1378]]))

In [171]:  ## Outlinear Treatment

In [34]:  def remove_outlier(col):


sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

localhost:8888/notebooks/Capstone Project.ipynb# 20/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [35]:  df.columns

Out[35]: Index(['AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation',

'EducationField', 'Gender', 'ExistingProdType', 'Designation',

'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome', 'Complaint',

'ExistingPolicyTenure', 'SumAssured', 'Zone', 'PaymentMethod',

'LastMonthCalls', 'CustCareScore'],

dtype='object')

In [36]:  x=['AgentBonus', 'Age', 'CustTenure','ExistingProdType','MonthlyIncome','ExistingPolicyTenure','SumAssured']


for i in x:
lr,ur=remove_outlier(df[i])
print(i)
print('Lower Range :',lr,'\nUpper Range :',ur)
df[i]=np.where(df[i]>ur,ur,df[i])
df[i]=np.where(df[i]<lr,lr,df[i])

AgentBonus

Lower Range : 268.5

Upper Range : 7626.5

Age

Lower Range : -8.5

Upper Range : 35.5

CustTenure

Lower Range : -11.0

Upper Range : 37.0

ExistingProdType

Lower Range : 1.5

Upper Range : 5.5

MonthlyIncome

Lower Range : 11277.375

Upper Range : 32484.375

ExistingPolicyTenure

Lower Range : -4.0

Upper Range : 12.0

SumAssured

Lower Range : -31974.375

Upper Range : 1234220.625

localhost:8888/notebooks/Capstone Project.ipynb# 21/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [183]:  for i in x:
print(i)
sns.boxplot(df[i])
plt.show();

ExistingProdType

Variable tranformation

Converting all objects to categorical codes

localhost:8888/notebooks/Capstone Project.ipynb# 22/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [40]:  for feature in df.columns:


if df[feature].dtype == 'object':
print('\n')
print('feature:',feature)
print(pd.Categorical(df[feature].unique()))
print(pd.Categorical(df[feature].unique()).codes)
df[feature] = pd.Categorical(df[feature]).codes
g ( , j ) [ , , g , g , ]
[2 1 4 0 3]

feature: MaritalStatus

[Single, Divorced, Unmarried, Married]

Categories (4, object): [Divorced, Married, Single, Unmarried]

[2 0 3 1]

feature: Zone

[North, West, East, South]

Categories (4, object): [East, North, South, West]

[1 3 0 2]

feature: PaymentMethod

[Half Yearly, Yearly, Quarterly, Monthly]

Categories (4, object): [Half Yearly, Monthly, Quarterly, Yearly]

[0 3 2 1]

In [38]:  df['Gender'].value_counts()

Out[38]: Male 2688

Female 1832

Name: Gender, dtype: int64

localhost:8888/notebooks/Capstone Project.ipynb# 23/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [41]:  df.head()

Out[41]:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus

0 4409.0 22.0 4.0 0 2 2 0 3.0 2 2.0 2

1 2214.0 11.0 2.0 2 2 2 1 4.0 2 4.0 0

2 4273.0 26.0 4.0 0 0 4 1 4.0 1 3.0 3

3 1791.0 11.0 10.5 2 2 2 0 3.0 1 3.0 0

4 2955.0 6.0 5.5 0 3 5 1 3.0 1 4.0 0

localhost:8888/notebooks/Capstone Project.ipynb# 24/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [37]:  for i in df.columns:


if data[i].dtype == 'object':
print(i)
print(df[i].value_counts())
print(data[i].value_counts())
print('')

Channel

Agent 3194

Third Party Partner 858

Online 468

Name: Channel, dtype: int64

Agent 3194

Third Party Partner 858

Online 468

Name: Channel, dtype: int64

Occupation

Salaried 2192

Small Business 1918

Large Business 408

Free Lancer 2

Name: Occupation, dtype: int64

Salaried 2192

Small Business 1918

Large Business 255

Laarge Business 153

Free Lancer 2

Name: Occupation, dtype: int64

EducationField

Graduate 1870

Under Graduate 1420

Diploma 496

Engineer 408

Post Graduate 252

MBA 74

Name: EducationField, dtype: int64

Graduate 1870

Under Graduate 1190

localhost:8888/notebooks/Capstone Project.ipynb# 25/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

Diploma 496

Engineer 408

Post Graduate 252

UG 230

MBA 74

Name: EducationField, dtype: int64

Gender

Male 2688

Female 1832

Name: Gender, dtype: int64

Male 2688

Female 1507

Fe male 325

Name: Gender, dtype: int64

Designation

Executive 1662

Manager 1620

Senior Manager 676

AVP 336

VP 226

Name: Designation, dtype: int64

Manager 1620

Executive 1535

Senior Manager 676

AVP 336

VP 226

Exe 127

Name: Designation, dtype: int64

MaritalStatus

Married 2268

Single 1254

Divorced 804

Unmarried 194

Name: MaritalStatus, dtype: int64

Married 2268

Single 1254

Divorced 804

Unmarried 194

Name: MaritalStatus, dtype: int64

localhost:8888/notebooks/Capstone Project.ipynb# 26/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

Zone

West 2566

North 1884

East 64

South 6

Name: Zone, dtype: int64

West 2566

North 1884

East 64

South 6

Name: Zone, dtype: int64

PaymentMethod

Half Yearly 2656

Yearly 1434

Monthly 354

Quarterly 76

Name: PaymentMethod, dtype: int64

Half Yearly 2656

Yearly 1434

Monthly 354

Quarterly 76

Name: PaymentMethod, dtype: int64

In [42]:  df.to_excel('encode+file.xlsx')

Clustering

localhost:8888/notebooks/Capstone Project.ipynb# 27/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [230]:  data.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4520 entries, 0 to 4519

Data columns (total 20 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 CustID 4520 non-null int64

1 AgentBonus 4520 non-null int64

2 Age 4251 non-null float64

3 CustTenure 4294 non-null float64

4 Channel 4520 non-null object

5 Occupation 4520 non-null object

6 EducationField 4520 non-null object

7 Gender 4520 non-null object

8 ExistingProdType 4520 non-null int64

9 Designation 4520 non-null object

10 NumberOfPolicy 4475 non-null float64

11 MaritalStatus 4520 non-null object

12 MonthlyIncome 4284 non-null float64

13 Complaint 4520 non-null int64

14 ExistingPolicyTenure 4336 non-null float64

15 SumAssured 4366 non-null float64

16 Zone 4520 non-null object

17 PaymentMethod 4520 non-null object

18 LastMonthCalls 4520 non-null int64

19 CustCareScore 4468 non-null float64

dtypes: float64(7), int64(5), object(8)

memory usage: 706.4+ KB

In [249]:  for i in data.columns:


if data[i].dtype == ('int64','float64'):
print(i)

CustID

AgentBonus

ExistingProdType

Complaint

LastMonthCalls

localhost:8888/notebooks/Capstone Project.ipynb# 28/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [250]:  data.columns

Out[250]: Index(['CustID', 'AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation',

'EducationField', 'Gender', 'ExistingProdType', 'Designation',

'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome', 'Complaint',

'ExistingPolicyTenure', 'SumAssured', 'Zone', 'PaymentMethod',

'LastMonthCalls', 'CustCareScore'],

dtype='object')

In [252]:  df1 = data.select_dtypes(include = [ 'int64','float64'])

In [256]:  col=df1.columns

In [300]:  df1=df[[ 'AgentBonus', 'Age', 'CustTenure', 'ExistingProdType',


'NumberOfPolicy', 'MonthlyIncome', 'Complaint', 'ExistingPolicyTenure',
'SumAssured', 'LastMonthCalls', 'CustCareScore']]

localhost:8888/notebooks/Capstone Project.ipynb# 29/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [301]:  # table for continous variable



df1

Out[301]: AgentBonus Age CustTenure ExistingProdType NumberOfPolicy MonthlyIncome Complaint ExistingPolicyTenure SumAssured LastMo

0 4409.0 22.0 4.0 3.0 2.0 20993.0 1 2.0 806761.0

1 2214.0 11.0 2.0 4.0 4.0 20130.0 0 3.0 294502.0

2 4273.0 26.0 4.0 4.0 3.0 17090.0 1 2.0 542086.0

3 1791.0 11.0 10.5 3.0 3.0 17909.0 1 2.0 268635.0

4 2955.0 6.0 5.5 3.0 4.0 18468.0 0 4.0 366405.0

... ... ... ... ... ... ... ... ... ...

4515 3953.0 4.0 8.0 4.0 2.0 26355.0 0 2.0 636473.0

4516 2939.0 9.0 9.0 2.0 2.0 20991.0 0 3.0 296813.0

4517 3792.0 23.0 23.0 5.0 5.0 17057.5 0 2.0 667371.0

4518 4816.0 10.0 10.0 4.0 2.0 20068.0 0 6.0 943999.0

4519 4764.0 14.0 10.0 5.0 2.0 23820.0 0 3.0 700308.0

4520 rows × 11 columns

In [290]:  # Scaling of Data

In [292]:  from sklearn.preprocessing import StandardScaler

In [293]:  X = StandardScaler()

In [302]:  scaled_df = X.fit_transform(df1)

localhost:8888/notebooks/Capstone Project.ipynb# 30/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [303]:  scaled_df

Out[303]: array([[ 0.25492775, 0.93595657, -1.19784794, ..., 0.80802653,

0.10304876, -0.7744781 ],

[-1.36126003, -0.36721936, -1.43334838, ..., -1.38934865,

0.6555759 , -0.04933237],

[ 0.15479037, 1.40983873, -1.19784794, ..., -0.32731766,

-1.2782691 , -0.04933237],

...,

[-0.19937196, 1.05442711, 1.03940617, ..., 0.21010218,

-0.17321481, -1.49962384],

[ 0.55460357, -0.48568989, -0.49134664, ..., 1.3967197 ,

-1.00200553, 1.4009591 ],

[ 0.51631575, -0.01180774, -0.49134664, ..., 0.35138803,

-1.00200553, -0.04933237]])

In [305]:  from sklearn.cluster import KMeans

In [ ]:  # Clustering Optimization

In [306]:  wss =[]

In [307]:  for i in range(1,11):


KM = KMeans(n_clusters=i)
KM.fit(scaled_df)
wss.append(KM.inertia_)

localhost:8888/notebooks/Capstone Project.ipynb# 31/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [309]:  wss

Out[309]: [49720.000000000015,

40138.77380529607,

36991.53521799116,

34874.24407267552,

33247.949982908,

31900.841120890425,

30825.575457900406,

29884.850552160417,

29104.006560150017,

28466.636445111657]

In [310]:  plt.plot(range(1,11), wss)

Out[310]: [<matplotlib.lines.Line2D at 0x265791aab50>]

In [311]:  # from the plot we can take 2 clusters for further processing

In [313]:  k_means = KMeans(n_clusters = 2)


k_means.fit(scaled_df)
labels = k_means.labels_

localhost:8888/notebooks/Capstone Project.ipynb# 32/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [319]:  data["Clus_kmeans"] = labels

In [323]:  data.head(5)

Out[323]: CustID AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation ... MaritalStatus Mo

0 7000000 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager ... Single

Third
1 7000001 2214 11.0 2.0 Party Salaried Graduate Male 4 Manager ... Divorced
Partner

2 7000002 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Exe ... Unmarried

Third
Fe
3 7000003 1791 11.0 NaN Party Salaried Graduate 3 Executive ... Divorced
male
Partner

Small
4 7000004 2955 6.0 NaN Agent UG Male 3 Executive ... Divorced
Business

5 rows × 21 columns

In [322]:  # silhouette score calculation

In [324]:  from sklearn.metrics import silhouette_samples, silhouette_score

In [325]:  silhouette_score(scaled_df,labels)

Out[325]: 0.19883521509219967

In [326]:  k_means = KMeans(n_clusters = 3)


k_means.fit(scaled_df)
labels = k_means.labels_

localhost:8888/notebooks/Capstone Project.ipynb# 33/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [327]:  silhouette_score(scaled_df,labels)

Out[327]: 0.13229821283027335

In [328]:  k_means = KMeans(n_clusters = 4)


k_means.fit(scaled_df)
labels = k_means.labels_

In [350]:  silhouette_score(scaled_df,labels)

Out[350]: 0.11344412761391207

localhost:8888/notebooks/Capstone Project.ipynb# 34/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [357]:  data[['CustID', 'AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation','EducationField', 'Gender', 'ExistingProdT

Out[357]: CustID AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation Clus_kmeans

0 7000000 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager 0

Third Party
1 7000001 2214 11.0 2.0 Salaried Graduate Male 4 Manager 0
Partner

2 7000002 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Exe 0

Third Party Fe
3 7000003 1791 11.0 NaN Salaried Graduate 3 Executive 0
Partner male

Small
4 7000004 2955 6.0 NaN Agent UG Male 3 Executive 0
Business

... ... ... ... ... ... ... ... ... ... ... ...

Small Senior
4515 7004515 3953 4.0 8.0 Agent Graduate Male 4 0
Business Manager

4516 7004516 2939 9.0 9.0 Agent Salaried Under Graduate Female 2 Executive 0

4517 7004517 3792 23.0 23.0 Agent Salaried Engineer Female 5 AVP 0

Small
4518 7004518 4816 10.0 10.0 Online Graduate Female 4 Executive 0
Business

4519 7004519 4764 14.0 10.0 Agent Salaried Under Graduate Female 5 Manager 0

4520 rows × 11 columns

In [352]:  data.columns

Out[352]: Index(['CustID', 'AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation',

'EducationField', 'Gender', 'ExistingProdType', 'Designation',

'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome', 'Complaint',

'ExistingPolicyTenure', 'SumAssured', 'Zone', 'PaymentMethod',

'LastMonthCalls', 'CustCareScore', 'Clus_kmeans'],

dtype='object')

In [358]:  # Plotting K-Means Cluster

localhost:8888/notebooks/Capstone Project.ipynb# 35/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [362]:  df[labels == 0].info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 703 entries, 29 to 4512

Data columns (total 19 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 AgentBonus 703 non-null float64

1 Age 703 non-null float64

2 CustTenure 703 non-null float64

3 Channel 703 non-null int8

4 Occupation 703 non-null int8

5 EducationField 703 non-null int8

6 Gender 703 non-null int8

7 ExistingProdType 703 non-null float64

8 Designation 703 non-null int8

9 NumberOfPolicy 703 non-null float64

10 MaritalStatus 703 non-null int8

11 MonthlyIncome 703 non-null float64

12 Complaint 703 non-null int64

13 ExistingPolicyTenure 703 non-null float64

14 SumAssured 703 non-null float64

15 Zone 703 non-null int8

16 PaymentMethod 703 non-null int8

17 LastMonthCalls 703 non-null int64

18 CustCareScore 703 non-null float64

dtypes: float64(9), int64(2), int8(8)

memory usage: 71.4 KB

localhost:8888/notebooks/Capstone Project.ipynb# 36/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [363]:  df[labels == 1].info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 1479 entries, 1 to 4516

Data columns (total 19 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 AgentBonus 1479 non-null float64

1 Age 1479 non-null float64

2 CustTenure 1479 non-null float64

3 Channel 1479 non-null int8

4 Occupation 1479 non-null int8

5 EducationField 1479 non-null int8

6 Gender 1479 non-null int8

7 ExistingProdType 1479 non-null float64

8 Designation 1479 non-null int8

9 NumberOfPolicy 1479 non-null float64

10 MaritalStatus 1479 non-null int8

11 MonthlyIncome 1479 non-null float64

12 Complaint 1479 non-null int64

13 ExistingPolicyTenure 1479 non-null float64

14 SumAssured 1479 non-null float64

15 Zone 1479 non-null int8

16 PaymentMethod 1479 non-null int8

17 LastMonthCalls 1479 non-null int64

18 CustCareScore 1479 non-null float64

dtypes: float64(9), int64(2), int8(8)

memory usage: 150.2 KB

localhost:8888/notebooks/Capstone Project.ipynb# 37/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [342]:  filtered_label0 = df[labels == 0]


filtered_label1 = df[labels == 1]

#Plotting the results
plt.scatter(filtered_label0[:,0] , filtered_label0[:,1] , color = 'red')
plt.scatter(filtered_label1[:,0] , filtered_label1[:,1] , color = 'black')
plt.show()

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-342-2159fa1adc53> in <module>

4 #Plotting the results

----> 5 plt.scatter(filtered_label0[:,0] , filtered_label0[:,1] , color = 'red')

6 plt.scatter(filtered_label1[:,0] , filtered_label1[:,1] , color = 'black')

7 plt.show()

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)

3022 if self.columns.nlevels > 1:

3023 return self._getitem_multilevel(key)

-> 3024 indexer = self.columns.get_loc(key)

3025 if is_integer(indexer):

3026 indexer = [indexer]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)

3078 casted_key = self._maybe_cast_indexer(key)

3079 try:

-> 3080 return self._engine.get_loc(casted_key)

3081 except KeyError as err:

3082 raise KeyError(key) from err

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(slice(None, None, None), 0)' is an invalid key

localhost:8888/notebooks/Capstone Project.ipynb# 38/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [341]:  filtered_label0

Out[341]: AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStat

29 5223.0 26.0 31.0 0 2 2 1 4.0 3 4.0

31 6576.0 7.0 26.0 0 2 2 0 3.0 3 2.0

38 6524.0 7.0 20.0 2 2 4 1 3.0 3 2.0

47 4450.0 27.0 9.0 0 1 1 1 3.0 4 2.0

49 7594.0 23.0 30.0 0 3 5 0 3.0 4 2.0

... ... ... ... ... ... ... ... ... ... ...

4496 4813.0 19.0 14.0 1 2 2 0 4.0 3 4.0

4500 7626.5 8.0 23.0 1 2 2 1 4.0 0 4.0

4502 6871.0 21.0 27.0 0 2 2 1 4.0 4 5.0

4508 4872.0 24.0 24.0 0 2 2 1 4.0 0 3.5

4512 5693.0 34.0 34.0 0 3 5 0 5.0 3 6.0

703 rows × 19 columns

localhost:8888/notebooks/Capstone Project.ipynb# 39/40


29/06/2022, 22:33 Capstone Project - Jupyter Notebook

In [348]:  filtered_label2 = df[label == 2]

filtered_label8 = df[label == 8]

#Plotting the results


plt.scatter(filtered_label2[:,0] , filtered_label2[:,1] , color = 'red')
plt.scatter(filtered_label8[:,0] , filtered_label8[:,1] , color = 'black')
plt.show()

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-348-3b167cddb1ce> in <module>

----> 1 filtered_label2 = df[label == 2]

3 filtered_label8 = df[label == 8]

5 #Plotting the results

NameError: name 'label' is not defined

In [3]:  import lazypredict

localhost:8888/notebooks/Capstone Project.ipynb# 40/40


01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [2]:  import pandas as pd


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from factor_analyzer import FactorAnalyzer # Perform statistical tests before PCA
import warnings
warnings.filterwarnings("ignore")
from pandas import Series, DataFrame

In [3]:  data= pd.read_excel('encode+file.xlsx',engine='openpyxl')

In [4]:  data.head()

Out[4]:
Unnamed:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation ... MaritalStatus M
0

0 0 0.561619 22.0 4.0 0 2 2 0 3.0 2 ... 2

1 1 0.524419 11.0 2.0 2 2 2 1 4.0 2 ... 0

2 2 0.559994 26.0 4.0 0 0 4 1 4.0 1 ... 3

3 3 0.512297 11.0 10.5 2 2 2 0 3.0 1 ... 0

4 4 0.540399 6.0 5.5 0 3 5 1 3.0 1 ... 0

5 rows × 21 columns

In [7]:  data.columns

Out[7]: Index(['Unnamed: 0', 'AgentBonus', 'Age', 'CustTenure', 'Channel',

'Occupation', 'EducationField', 'Gender', 'ExistingProdType',

'Designation', 'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome',

'Complaint', 'ExistingPolicyTenure', 'SumAssured', 'Zone',

'PaymentMethod', 'LastMonthCalls', 'CustCareScore'],

dtype='object')

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 1/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [10]:  #data.drop('Unnamed: 0',axis=1,inplace=True)

In [11]:  data.drop('Age',axis=1,inplace=True)

In [14]:  data.head()

Out[14]:
AgentBonus CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus Mont

0 0.561619 4.0 0 2 2 0 3.0 2 2.0 2

1 0.524419 2.0 2 2 2 1 4.0 2 4.0 0

2 0.559994 4.0 0 0 4 1 4.0 1 3.0 3

3 0.512297 10.5 2 2 2 0 3.0 1 3.0 0

4 0.540399 5.5 0 3 5 1 3.0 1 4.0 0

In [15]:  df=data.copy()

Scaling of Data
In [15]:  from sklearn.preprocessing import StandardScaler
from pandas import DataFrame

In [16]:  scaler = StandardScaler()

In [18]:  df=scaler.fit_transform(data)

In [21]:  df=DataFrame(df)

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 2/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [22]:  data.columns

Out[22]: Index(['AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation',

'EducationField', 'Gender', 'ExistingProdType', 'Designation',

'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome', 'Complaint',

'ExistingPolicyTenure', 'SumAssured', 'Zone', 'PaymentMethod',

'LastMonthCalls', 'CustCareScore'],

dtype='object')

In [23]:  df.columns=['AgentBonus', 'Age', 'CustTenure', 'Channel', 'Occupation',


'EducationField', 'Gender', 'ExistingProdType', 'Designation',
'NumberOfPolicy', 'MaritalStatus', 'MonthlyIncome', 'Complaint',
'ExistingPolicyTenure', 'SumAssured', 'Zone', 'PaymentMethod',
'LastMonthCalls', 'CustCareScore']

In [25]:  df.head()

Out[25]: AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy Marit

0 0.254928 0.935957 -1.197848 -0.609065 -0.523319 -0.437870 -1.211301 -0.742887 0.274700 -1.079416 1

1 -1.361260 -0.367219 -1.433348 1.911973 -0.523319 -0.437870 0.825559 0.325131 0.274700 0.298529 -1

2 0.154790 1.409839 -1.197848 -0.609065 -3.664625 0.713482 0.825559 0.325131 -0.754855 -0.390443 2

3 -1.672717 -0.367219 -0.432472 1.911973 -0.523319 -0.437870 -1.211301 -0.742887 -0.754855 -0.390443 -1

4 -0.815659 -0.959572 -1.021223 -0.609065 1.047333 1.289158 0.825559 -0.742887 -0.754855 0.298529 -1

Train-Test Split¶
In [16]:  #Copy all the predictor variables into X dataframe
X = df.drop('AgentBonus', axis=1)

# Copy target into the y dataframe.
y = df['AgentBonus']

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 3/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [17]:  X.head()

Out[17]:
CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus MonthlyIncome C

0 4.0 0 2 2 0 3.0 2 2.0 2 4.322075

1 2.0 2 2 2 1 4.0 2 4.0 0 4.303844

2 4.0 0 0 4 1 4.0 1 3.0 3 4.232742

3 10.5 2 2 2 0 3.0 1 3.0 0 4.253071

4 5.5 0 3 5 1 3.0 1 4.0 0 4.266420

In [35]:  y.head()

Out[35]: 0 0.254928

1 -1.361260

2 0.154790

3 -1.672717

4 -0.815659

Name: AgentBonus, dtype: float64

In [18]:  # Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1)

Linear Regression Model


In [19]:  from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import matplotlib.style
import scipy.stats as stats
from sklearn.metrics import max_error

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 4/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [20]:  regression_model=LinearRegression()
regression_model.fit(X_train,y_train)

Out[20]: LinearRegression()

In [21]:  for idx,col_name in enumerate(X_train.columns):

print(f"The coefficent for {col_name} is {regression_model.coef_[idx]}")

The coefficent for CustTenure is 0.0003300824744941662

The coefficent for Channel is -2.751621283610898e-05

The coefficent for Occupation is 1.5259293154943037e-05

The coefficent for EducationField is 1.1697523443542078e-05

The coefficent for Gender is 0.00031193417658791295

The coefficent for ExistingProdType is -0.00015409480456214582

The coefficent for Designation is 1.8915797248280733e-05

The coefficent for NumberOfPolicy is 0.00020411959362769263

The coefficent for MaritalStatus is -0.0003615014051258259


The coefficent for MonthlyIncome is 0.02808782946595261

The coefficent for Complaint is 0.0003250004697582831

The coefficent for ExistingPolicyTenure is 0.0005479683954928168

The coefficent for SumAssured is 0.06516875856010947

The coefficent for Zone is 9.363981215428923e-05

The coefficent for PaymentMethod is 3.256233181602395e-05

The coefficent for LastMonthCalls is 0.00011451462536120873

The coefficent for CustCareScore is 0.000154645997875137

The coefficent for age_bins is 0.0027310843920501505

In [22]:  #Intercept for the model



intercept =regression_model.intercept_

print(f"Intercept of the model {intercept}")

Intercept of the model 0.046132549020202696

R square Score
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 5/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

R-squared explains to what extent the variance of one variable explains the variance of the second variable. In other words, it measures the
proportion of variance of the dependent variable explained by the independent variable.

R squared is a popular metric for identifying model accuracy. It tells how close are the data points to the fitted line generated by a regression
algorithm. A larger R squared value indicates a better fit. This helps us to find the relationship between the independent variable towards the
dependent variable.

R² score ranges from 0 to 1. The closest to 1 the R², the better the regression model is. If R² is equal to 0, the model is not performing better than a
random model. If R² is negative, the regression model is erroneous.

In [23]:  # R square Score on training data


regression_model.score(X_train, y_train)

Out[23]: 0.7587443916585249

In [24]:  # R square on testing data


regression_model.score(X_test, y_test)

Out[24]: 0.7561716416565496

Predicted Score
In [25]:  y_pred_train = regression_model.predict(X_train)

In [74]:  y_pred_train

Out[74]: array([-0.41926401, 0.36446012, -0.75030793, ..., -0.55271221,

0.20529222, 0.08440873])

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 6/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [76]:  y_train

Out[76]: 2461 -0.089663

3681 0.799057

1309 -0.720675

4254 -1.085146

1335 0.060543

...

2895 -0.599185

2763 -0.983536

905 -0.151512

3980 0.651796

235 -0.976909

Name: AgentBonus, Length: 3164, dtype: float64

Max Error
#The max_error() function computes the maximum residual error. A metric that captures the worst-case error between the predicted value and the
true value. This function compares each element (index wise) of both lists, tuples or data frames and returns the count of unmatched elements.

In [27]:  # Test Data


print(max_error(y_test,y_pred))

0.041019251118525446

In [28]:  #Train Data


print(max_error(y_train,y_pred_train))

0.04555936215559531

Adjusted R Square

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 7/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

Adjusted R² is the same as standard R² except that it penalizes models when additional features are added.

To counter the problem which is faced by R-square, Adjusted r-square penalizes adding more independent variables which don’t increase the
explanatory power of the regression model.

The value of adjusted r-square is always less than or equal to the value of r-square.

It ranges from 0 to 1, the closer the value is to 1, the better it is.

In [26]:  y_pred = regression_model.predict(X_test)

In [64]:  y_pred

Out[64]: array([ 0.42005797, 1.85973827, -0.72368481, ..., -0.75515163,

-1.07738633, 0.70101843])

In [69]:  y_test

Out[69]: 610 1.197397

1519 1.837246

1620 -1.201482

2031 0.315305

494 -1.212527

...

2124 0.819673

3220 -1.239770

1851 -1.356106

1065 -1.160985

462 0.367582

Name: AgentBonus, Length: 1356, dtype: float64

In [29]:  from sklearn.metrics import r2_score

In [30]:  # test Data


print(r2_score(y_test, y_pred))

0.7561716416565496

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 8/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [31]:  # train Data


print(r2_score(y_train, y_pred_train))

0.7587443916585249

Test Data
In [32]:  from sklearn.metrics import mean_absolute_error,mean_squared_error

mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=y_pred) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)

In [33]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.00696660182270242

MSE: 7.658935210157367e-05

RMSE: 0.008751534271290587

Train Data
mae = mean_absolute_error(y_true=y_train,y_pred=y_pred_train)
#squared True returns MSE value, False returns RMSE value.
mse =
mean_squared_error(y_true=y_train,y_pred=y_pred_train) #default=True
rmse =
mean_squared_error(y_true=y_train,y_pred=y_pred_train,squared=False)

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 9/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [34]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.00696660182270242

MSE: 7.658935210157367e-05

RMSE: 0.008751534271290587

RMSE
In [35]:  #RMSE on Training data
predicted_train=regression_model.fit(X_train, y_train).predict(X_train)
np.sqrt(metrics.mean_squared_error(y_train,predicted_train))

Out[35]: 0.008733829934443486

In [36]:  #RMSE on Testing data


predicted_test=regression_model.fit(X_train, y_train).predict(X_test)
np.sqrt(metrics.mean_squared_error(y_test,predicted_test))

Out[36]: 0.008751534271290587

Plot of the data

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 10/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [37]:  fig, ax = plt.subplots()


ax.scatter(y_pred, y_test, edgecolors=(0, 0, 1))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=3)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.show()

In [95]:  # https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 11/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

Tuning of Linear Regression


Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where linearly independent variables are
highly correlated. It has been used in many fields including econometrics, chemistry, and engineering.

No need for Ridge Regression because there is no highlu correlated


variables

Lasso Regression
Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction. This model uses shrinkage.
Shrinkage is where data values are shrunk towards a central point as the mean. The lasso procedure encourages simple, sparse models (i.e.
models with fewer parameters)

In [97]:  from sklearn.linear_model import Lasso

In [132]:  lassoReg = Lasso(alpha=1, normalize=True)

In [134]:  lassoReg.fit(X_train,y_train)

Out[134]: Lasso(alpha=1, normalize=True)

In [129]:  pred_train = lassoReg.predict(X_train)

In [130]:  pred_test = lassoReg.predict(X_test)

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 12/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [135]:  # calculating coefficients


coeff = DataFrame(X_train.columns)
coeff['Coefficient Estimate'] = Series(lassoReg.coef_)
print(coeff)

0 Coefficient Estimate

0 Age 0.0

1 CustTenure 0.0

2 Channel -0.0

3 Occupation 0.0

4 EducationField 0.0

5 Gender -0.0

6 ExistingProdType 0.0

7 Designation 0.0

8 NumberOfPolicy 0.0

9 MaritalStatus -0.0

10 MonthlyIncome 0.0

11 Complaint 0.0

12 ExistingPolicyTenure 0.0

13 SumAssured 0.0

14 Zone 0.0

15 PaymentMethod -0.0

16 LastMonthCalls 0.0

17 CustCareScore 0.0

Train data Metrics


In [114]:  mae = mean_absolute_error(y_true=y_train,y_pred=pred_train)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pred_train) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pred_train,squared=False)

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 13/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

In [115]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.8050079397017279

MSE: 1.0005051970719936

RMSE: 1.0002525666410427

In [117]:  # Adjusted R square


# train Data
print(r2_score(y_train, pred_train))

0.0

In [138]:  print(max_error(y_train,pred_train))

2.6210621129222855

In [ ]:  print(lasso.score(x_cv,y_cv))

In [140]:  # print(lassoReg.score(y_train,y_test))

In [141]:  # https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
#https://www.datacourses.com/evaluation-of-regression-models-in-scikit-learn-846/ https://medium.com/analytics-vidh
# https://www.geeksforgeeks.org/sklearn-metrics-max_error-function-in-python/
# https://www.geeksforgeeks.org/python-linear-regression-using-sklearn/
# https://medium.com/analytics-vidhya/hyperparameter-tuning-in-linear-regression-e0e0f1f968a1
# https://neptune.ai/blog/fighting-overfitting-with-l1-or-l2-regularization

In [ ]:  ​

In [ ]:  ​

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 14/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook

localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 15/15
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [1]:  import pandas as pd


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from factor_analyzer import FactorAnalyzer # Perform statistical tests before PCA
import warnings
warnings.filterwarnings("ignore")
from pandas import Series, DataFrame
from sklearn import metrics
import matplotlib.pyplot as plt
import matplotlib.style
import scipy.stats as stats
from sklearn.metrics import max_error

In [2]:  data= pd.read_excel('encode+file.xlsx',engine='openpyxl')

In [3]:  data.head()

Out[3]:
Unnamed:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation ... MaritalStatus M
0

0 0 0.561619 22.0 4.0 0 2 2 0 3.0 2 ... 2

1 1 0.524419 11.0 2.0 2 2 2 1 4.0 2 ... 0

2 2 0.559994 26.0 4.0 0 0 4 1 4.0 1 ... 3

3 3 0.512297 11.0 10.5 2 2 2 0 3.0 1 ... 0

4 4 0.540399 6.0 5.5 0 3 5 1 3.0 1 ... 0

5 rows × 21 columns

In [5]:  #data.drop('Unnamed: 0',axis=1,inplace=True)

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 1/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [6]:  data.drop('Age',axis=1,inplace=True)

In [7]:  data.head()

Out[7]:
y MaritalStatus MonthlyIncome Complaint ExistingPolicyTenure SumAssured Zone PaymentMethod LastMonthCalls CustCareScore age_bins

0 2 4.322075 1 2.0 5.906745 1 0 5 2.0 2

0 0 4.303844 0 3.0 5.469088 1 3 7 3.0 1

0 3 4.232742 1 2.0 5.734068 1 3 0 3.0 2

0 0 4.253071 1 2.0 5.429163 3 0 0 5.0 1

0 0 4.266420 0 4.0 5.563961 3 0 2 5.0 0

In [8]:  df=data.copy()

In [97]:  data['Age'].unique()

Out[97]: array([22. , 11. , 26. , 6. , 7. , 12. , 8. , 20. , 18. , 10. , 9. ,

5. , 9.5, 30. , 14. , 16. , 13. , 12.5, 2. , 4. , 15. , 27. ,

23. , 35.5, 33. , 19. , 17. , 25. , 21. , 24. , 29. , 31. , 28. ,

3. , 14.5, 13.5, 11.5, 8.5, 7.5, 35. , 32. , 18.5, 34. , 10.5,

22.5, 15.5, 17.5])

#The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based
on how closely it resembles the points in the training set.

df['agebins'] = pd.cut(x=df['number'], bins=[1, 20, 40, 60,


80, 100])

In [103]:  df['age_bins']=pd.cut(x=df['Age'],bins=[0,10,20,30,40,50,60,70])

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 2/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [105]:  df[['age_bins','Age']]

Out[105]: age_bins Age

0 (20, 30] 22.0

1 (10, 20] 11.0

2 (20, 30] 26.0

3 (10, 20] 11.0

4 (0, 10] 6.0

... ... ...

4515 (0, 10] 4.0

4516 (0, 10] 9.0

4517 (20, 30] 23.0

4518 (0, 10] 10.0

4519 (10, 20] 14.0

4520 rows × 2 columns

In [107]:  df['age_bins'].unique()

Out[107]: [(20, 30], (10, 20], (0, 10], (30, 40]]

Categories (4, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]

In [109]:  df['age_bins'].dtype

Out[109]: CategoricalDtype(categories=[(0, 10], (10, 20], (20, 30], (30, 40], (40, 50], (50, 60], (60, 70]],

ordered=True)

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 3/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [9]:  from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import accuracy_score
from math import sqrt
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn.metrics import mean_squared_error
from math import sqrt
%matplotlib inline

Train -Test Split


In [10]:  #Copy all the predictor variables into X dataframe
X = df.drop('AgentBonus', axis=1)

# Copy target into the y dataframe.
y = df['AgentBonus']

In [11]:  # Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1)

Preprocessing – Scaling the features


In [21]:  from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

x_train_scaled = scaler.fit_transform(X_train)
X_train = pd.DataFrame(x_train_scaled)

x_test_scaled = scaler.fit_transform(X_test)
X_test = pd.DataFrame(x_test_scaled)

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 4/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [24]: 

KNN Regression Model¶


In [12]:  rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)
model.fit(X_train, y_train) #fit the model
pred=model.predict(X_test) #make prediction on test set
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
print('RMSE value for k= ' , K , 'is:', error)

RMSE value for k= 1 is: 0.017491774596938722

RMSE value for k= 2 is: 0.015175320238339444

RMSE value for k= 3 is: 0.014468624431178693

RMSE value for k= 4 is: 0.013879108011015873

RMSE value for k= 5 is: 0.013621275109958371

RMSE value for k= 6 is: 0.013416450957228706

RMSE value for k= 7 is: 0.013327833089396474

RMSE value for k= 8 is: 0.013284323216308087

RMSE value for k= 9 is: 0.013230517071299937

RMSE value for k= 10 is: 0.013196509089571842

RMSE value for k= 11 is: 0.013180448331691245

RMSE value for k= 12 is: 0.013169558589283292

RMSE value for k= 13 is: 0.013150908011123801

RMSE value for k= 14 is: 0.013080011504198049

RMSE value for k= 15 is: 0.013078160523628627

RMSE value for k= 16 is: 0.013082924022262073

RMSE value for k= 17 is: 0.013069532312496603

RMSE value for k= 18 is: 0.0130415715047404

RMSE value for k= 19 is: 0.0129982690333403

RMSE value for k= 20 is: 0.012989177610160217

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 5/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [32]:  #plotting the rmse values against k values


curve = pd.DataFrame(rmse_val) #elbow curve
curve.plot()

Out[32]: <AxesSubplot:>

Best fit K=15


In [14]:  model = neighbors.KNeighborsRegressor(n_neighbors = 15)
model.fit(X_train, y_train)

Out[14]: KNeighborsRegressor(n_neighbors=15)

In [15]:  pred_test=model.predict(X_test)

In [16]:  pred_train=model.predict(X_train)

Max Error
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 6/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [17]:  # Test Data


print(max_error(y_test,pred_test))

0.05042173810624884

In [18]:  #Train Data


print(max_error(y_train,pred_train))

0.04598085821358955

Adjusted R Square
In [19]:  from sklearn.metrics import r2_score

In [20]:  # test Data


print(r2_score(y_test, pred_test))

0.4554858797489557

In [21]:  # train Data


print(r2_score(y_train, pred_train))

0.5310027559414495

Test Data
In [22]:  from sklearn.metrics import mean_absolute_error,mean_squared_error

mae = mean_absolute_error(y_true=y_test,y_pred=pred_test)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=pred_test) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=pred_test,squared=False)

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 7/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [23]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.010462744190562126

MSE: 0.0001710382826817982

RMSE: 0.013078160523628627

Train Data
In [24]:  mae = mean_absolute_error(y_true=y_train,y_pred=pred_train)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pred_train) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pred_train,squared=False)

In [25]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.009735566430888824

MSE: 0.00014828674591305246

RMSE: 0.01217730454218225

# R square Score
In [26]:  # R square Score on training data
model.score(X_train, y_train)

Out[26]: 0.5310027559414495

In [27]:  # R square on testing data


model.score(X_test, y_test)

Out[27]: 0.4554858797489557

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 8/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

Plot of the data


In [28]:  fig, ax = plt.subplots()
ax.scatter(pred_test, y_test, edgecolors=(0, 0, 1))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=3)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.show()

Hyper Tuning
In [30]:  ar=[]
for K in range(100):
K = K+1
ar.append(K)

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 9/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [31]:  from sklearn.model_selection import GridSearchCV


params = {'n_neighbors':ar}

knn = neighbors.KNeighborsRegressor()

model = GridSearchCV(knn, params, cv=5)
model.fit(X_train,y_train)

Out[31]: GridSearchCV(cv=5, estimator=KNeighborsRegressor(),

param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,

13, 14, 15, 16, 17, 18, 19, 20, 21, 22,

23, 24, 25, 26, 27, 28, 29, 30, ...]})

In [32]:  model.best_params_

Out[32]: {'n_neighbors': 13}

Predicted Value
In [33]:  pret_test=model.predict(X_test)

In [34]:  pret_train=model.predict(X_train)

Max Error
In [35]:  # Test Data
print(max_error(y_test,pret_test))

0.051673641607796195

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 10/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [36]:  #Train Data


print(max_error(y_train,pret_train))

0.04446582026307788

Adjusted R Square
In [37]:  from sklearn.metrics import r2_score

In [38]:  # test Data


print(r2_score(y_test, pret_test))

0.44941129374205124

In [39]:  # train Data


print(r2_score(y_train, pret_train))

0.5405965659499341

Test Data¶
In [40]:  from sklearn.metrics import mean_absolute_error,mean_squared_error

mae = mean_absolute_error(y_true=y_test,y_pred=pret_test)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=pret_test) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=pret_test,squared=False)

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 11/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

In [41]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.010552156604526413

MSE: 0.00017294638151704015

RMSE: 0.013150908011123801

Train Data
In [42]:  mae = mean_absolute_error(y_true=y_train,y_pred=pret_train)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pret_train) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pret_train,squared=False)

In [43]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.009617269338237798

MSE: 0.0001452533914848788

RMSE: 0.0120521114948742

R Squared Score
In [44]:  # R square Score on training data
model.score(X_train, y_train)

Out[44]: 0.5405965659499341

In [45]:  # R square on testing data


model.score(X_test, y_test)

Out[45]: 0.44941129374205124

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 12/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook

Plot of the data


In [46]:  fig, ax = plt.subplots()
ax.scatter(pret_test, y_test, edgecolors=(0, 0, 1))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=3)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.show()

In [ ]:  ​

localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 13/13
01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [1]:  import pandas as pd


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from factor_analyzer import FactorAnalyzer # Perform statistical tests before PCA
import warnings
warnings.filterwarnings("ignore")
from pandas import Series, DataFrame
from sklearn import metrics
import matplotlib.pyplot as plt
import matplotlib.style
import scipy.stats as stats
from sklearn.metrics import max_error

In [2]:  data= pd.read_excel('encode+file.xlsx',engine='openpyxl')

In [3]:  data.head()

Out[3]:
Unnamed:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation ... MaritalStatus M
0

0 0 0.561619 22.0 4.0 0 2 2 0 3.0 2 ... 2

1 1 0.524419 11.0 2.0 2 2 2 1 4.0 2 ... 0

2 2 0.559994 26.0 4.0 0 0 4 1 4.0 1 ... 3

3 3 0.512297 11.0 10.5 2 2 2 0 3.0 1 ... 0

4 4 0.540399 6.0 5.5 0 3 5 1 3.0 1 ... 0

5 rows × 21 columns

In [5]:  #data.drop('Unnamed: 0',axis=1,inplace=True)

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 1/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [6]:  data.drop('Age',axis=1,inplace=True)

In [7]:  data.head()

Out[7]:
AgentBonus CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus Mont

0 0.561619 4.0 0 2 2 0 3.0 2 2.0 2

1 0.524419 2.0 2 2 2 1 4.0 2 4.0 0

2 0.559994 4.0 0 0 4 1 4.0 1 3.0 3

3 0.512297 10.5 2 2 2 0 3.0 1 3.0 0

4 0.540399 5.5 0 3 5 1 3.0 1 4.0 0

In [8]:  df=data.copy()

Data Split
In [9]:  #Copy all the predictor variables into X dataframe
X = df.drop('AgentBonus', axis=1)

# Copy target into the y dataframe.
y = df['AgentBonus']

In [10]:  # Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1)

Random Forest

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 2/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [11]:  from sklearn.ensemble import RandomForestRegressor

In [12]:  rf = RandomForestRegressor(n_estimators = 300, max_features = 'sqrt', max_depth = 5, random_state = 18)

In [13]:  rf.fit(X_train,y_train)

Out[13]: RandomForestRegressor(max_depth=5, max_features='sqrt', n_estimators=300,

random_state=18)

In [14]:  pred_train=rf.predict(X_train)

In [15]:  pred_train

Out[15]: array([0.5456535 , 0.56330635, 0.54159001, ..., 0.54381698, 0.56118677,

0.55291051])

In [16]:  pred_test=rf.predict(X_test)

In [17]:  pred_test

Out[17]: array([0.56148345, 0.57258753, 0.54215618, ..., 0.54467076, 0.53538004,

0.56435946])

MAX ERROR
In [18]:  # Test Data
print(max_error(y_test,pred_test))

0.04070131067993954

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 3/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [19]:  # Train Data


print(max_error(y_train,pred_train))

0.042071604020055386

R square Score
In [20]:  # R square Score on training data
rf.score(X_train, y_train)

Out[20]: 0.7474736496532233

In [21]:  # R square Score for test data


rf.score(X_test, y_test)

Out[21]: 0.7307608642872265

Adjusted R Square
In [22]:  from sklearn.metrics import r2_score

In [23]:  # test Data


print(r2_score(y_test, pred_test))

0.7307608642872265

In [24]:  # train Data


print(r2_score(y_train, pred_train))

0.7474736496532233

Test Data
localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 4/14
01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [25]:  from sklearn.metrics import mean_absolute_error,mean_squared_error

mae = mean_absolute_error(y_true=y_test,y_pred=pred_test)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=pred_test) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=pred_test,squared=False)

In [26]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.0073953875572963165

MSE: 8.457117582518018e-05

RMSE: 0.009196258795030736

Train Data
In [27]:  mae = mean_absolute_error(y_true=y_train,y_pred=pred_train)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pred_train) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pred_train,squared=False)

In [28]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.007100929904297198

MSE: 7.984334924055136e-05

RMSE: 0.008935510575258213

Plot of the Data

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 5/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [29]:  fig, ax = plt.subplots()


ax.scatter(pred_test, y_test, edgecolors=(0, 0, 1))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=3)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.show()

Tuning of the Model

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 6/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [32]:  from sklearn.model_selection import RandomizedSearchCV


# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_d
epth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf':
[1, 2, 4], 'bootstrap': [True, False]}

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 7/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [33]:  # Use the random grid to search for best hyperparameters


# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, ra
# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.

[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 52.0s

[Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 3.8min

[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 8.7min finished

Out[33]: RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=100,

n_jobs=-1,

param_distributions={'bootstrap': [True, False],

'max_depth': [10, 20, 30, 40, 50, 60,

70, 80, 90, 100, 110,

None],

'max_features': ['auto', 'sqrt'],

'min_samples_leaf': [1, 2, 4],

'min_samples_split': [2, 5, 10],

'n_estimators': [200, 400, 600, 800,

1000, 1200, 1400, 1600,

1800, 2000]},

random_state=42, verbose=2)

In [34]:  rf_random.best_params_

Out[34]: {'n_estimators': 600,

'min_samples_split': 5,

'min_samples_leaf': 1,

'max_features': 'sqrt',

'max_depth': 60,

'bootstrap': False}

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 8/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [36]:  base_model = RandomForestRegressor(n_estimators = 600, min_samples_split=5,min_samples_leaf=1,max_features='sqrt',max

In [37]:  base_model

Out[37]: RandomForestRegressor(bootstrap=False, max_depth=60, max_features='sqrt',

min_samples_split=5, n_estimators=600, random_state=42)

In [38]:  base_model.fit(X_train,y_train)

Out[38]: RandomForestRegressor(bootstrap=False, max_depth=60, max_features='sqrt',

min_samples_split=5, n_estimators=600, random_state=42)

In [39]:  pred_btrain=base_model.predict(X_train)

In [40]:  pred_btrain

Out[40]: array([0.55552542, 0.57002255, 0.54290556, ..., 0.55289121, 0.56706034,

0.53903398])

In [41]:  pred_btest=base_model.predict(X_test)

In [52]:  pred_btest

Out[52]: array([0.56443089, 0.57895046, 0.54067113, ..., 0.5459554 , 0.53120521,

0.56616343])

# R square Score
In [54]:  # R square Score on training data
base_model.score(X_train, y_train)

Out[54]: 0.9918822168009825

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 9/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [55]:  # R square Score on training data


base_model.score(X_test, y_test)

Out[55]: 0.8128773312202485

MAX ERROR
In [43]:  # Test Data
print(max_error(y_test,pred_btest))

0.03303242857956701

In [44]:  #Train Data


print(max_error(y_train,pred_btrain))

0.008219513331729433

Adjusted R Square
In [45]:  # test Data
print(r2_score(y_test, pred_btest))

0.8128773312202485

In [46]:  # train Data


print(r2_score(y_train, pred_btrain))

0.9918822168009825

Test DAta

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 10/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [47]:  from sklearn.metrics import mean_absolute_error,mean_squared_error

mae = mean_absolute_error(y_true=y_test,y_pred=pred_btest)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=pred_btest) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=pred_btest,squared=False)

In [48]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.006028757345944345

MSE: 5.8777428772954286e-05

RMSE: 0.0076666439054487385

Train Data
In [49]:  mae = mean_absolute_error(y_true=y_train,y_pred=pred_btrain)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pred_btrain) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pred_btrain,squared=False)

In [50]:  print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)

MAE: 0.0012101458086502667

MSE: 2.566666797853658e-06

RMSE: 0.0016020820197023803

Plot of the data

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 11/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [51]:  fig, ax = plt.subplots()


ax.scatter(pred_btest, y_test, edgecolors=(0, 0, 1))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=3)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.show()

In [56]:  for idx, col_name in enumerate(X_train.columns):


print("The coefficient for {} is {}".format(col_name, base_model.coef_[0][idx]))

---------------------------------------------------------------------------

AttributeError Traceback (most recent call last)

<ipython-input-56-9ed5e4324b22> in <module>

1 for idx, col_name in enumerate(X_train.columns):

----> 2 print("The coefficient for {} is {}".format(col_name, base_model.coef_[0][idx]))

AttributeError: 'RandomForestRegressor' object has no attribute 'coef_'

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 12/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [61]:  base_model.feature_importances_ *100

Out[61]: array([15.40674396, 0.5678023 , 0.57333485, 0.89672205, 0.39203682,

1.01533605, 3.72619788, 1.20766343, 1.4317113 , 10.86518658,

0.37476185, 5.49894885, 43.32096357, 0.48655976, 0.51603722,

2.22080241, 1.10441135, 10.39477979])

In [64]:  df.columns

Out[64]: Index(['AgentBonus', 'CustTenure', 'Channel', 'Occupation', 'EducationField',

'Gender', 'ExistingProdType', 'Designation', 'NumberOfPolicy',

'MaritalStatus', 'MonthlyIncome', 'Complaint', 'ExistingPolicyTenure',

'SumAssured', 'Zone', 'PaymentMethod', 'LastMonthCalls',

'CustCareScore', 'age_bins'],

dtype='object')

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 13/14


01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook

In [65]:  df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4520 entries, 0 to 4519

Data columns (total 19 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 AgentBonus 4520 non-null float64

1 CustTenure 4520 non-null float64

2 Channel 4520 non-null int64

3 Occupation 4520 non-null int64

4 EducationField 4520 non-null int64

5 Gender 4520 non-null int64

6 ExistingProdType 4520 non-null float64

7 Designation 4520 non-null int64

8 NumberOfPolicy 4520 non-null float64

9 MaritalStatus 4520 non-null int64

10 MonthlyIncome 4520 non-null float64

11 Complaint 4520 non-null int64

12 ExistingPolicyTenure 4520 non-null float64

13 SumAssured 4520 non-null float64

14 Zone 4520 non-null int64

15 PaymentMethod 4520 non-null int64

16 LastMonthCalls 4520 non-null int64

17 CustCareScore 4520 non-null float64

18 age_bins 4520 non-null int64

dtypes: float64(8), int64(11)

memory usage: 671.1 KB

localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 14/14

You might also like