Professional Documents
Culture Documents
Texas
1
Life
Insurance Sales
FINAL PROJECT
REPORT
2
SUBMITTED BY
VIVEK AJAYAKUMAR
1
Sl. CONTENT Page
No: Number
2 Data Cleaning/Preprocessing 6
4 Univariate Analysis 11
5 Bi-variate Analysis 16
6 Modelling Approach 24
10 Model Validation 35
11 Business Recommendation 36
LIST OF TABLES
1 Data Dictionary 4
3 Outliner Table 8
5 Univariate Analysis 13
6 BiVariate Analysis 15
7 Model Metrics 34
LIST OF FIGURES
1 Count Plot 15
2 Heat Map 16
3 Pair Plot 17
4 Scatter Plot 23
2
1: Introduction to the Business Problem.
Insurance companies are flourished all over the world and their service is distributed
through various channels. Insurance Agents are most used method to expand their business.
Insurance agents sell and negotiate life, health, property, or other types of insurance
to match the needs of their clients. As an insurance agent, you may work for an insurance
company, refer clients to independent brokers, or work as an independent broker.
In this case, Insurance company collected data of the customer, insurance policy and agent
details. The company want to predict the bonus of its agents by analysing the data. From
the data, the firm can design appropriate engagement for their high performing agents and
upskill programs for low performing agents.
The firm want to understand the business via analysing the agent bonus. The company is
planning the classify the agents into two groups- High performing Agents and Low Performing
agents. The firm want to design a model to ensure appropriate bonus for their
employees/agents.
The study is to be conducted to ensure each and every agent driven through their goal by
predicting the appropriate bonus for all their agents. This help in finding those with
higher performance and keep them motivated. At the same time, low performing agents needs
to be segregated and upskill programs need to add to their program.
3
2: Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an approach to analyse the data using visual techniques.
It is used to discover trends, patterns, or to check assumptions with the help of
statistical summary and graphical representations.
The dataset is uniquely constructed to encompass various feature to predict the agent
bonus.
Data Dictionary
Insurance company has taken dataset for predicting bonus for its agents and also further
analysis is done to design appropriate engagement activity to improve the skill set.
4
From the data dictionary, we can understand that data from three main segments are
taken:
1) Data of the policy holder
2) Details of the policy
3) Agent details and channels.
The dataset is uniquely constructed to encompass various features to predict the bonus
and also improvise the skill set of the agents.
Data Report
As explained earlier, data has three main segments and data dictionary is shown above.
The data has 19 columns that capture data of the customer, policy and agent details.
Dataset has 4520 entries with various data types.
5
Data Description is shown below
Dataset has presence of null values both in categorical and continuous variables.
2) Replacement of Values
Data has some issues in Categorical Data due to some miss spelling and spacing issues and
it is specified below. Further treatment is done to improve the quality of data.
6
Occupation ['Salaried' 'Free Lancer' ' ['Salaried' 'Free Lancer' '
Small Business' 'Laarge Bus Small Business',
iness' 'Large Business']
'Large Business']
3) Presence of Outliners.
An outlier is an object that deviates significantly from the rest of the objects. They
can be caused by measurement or execution errors. The analysis of outlier data is referred
to as outlier analysis or outlier mining. An outlier cannot be termed as a noise or error.
In this dataset, there is presence of outliners are present in the following data:
CustTenure, Age, AgentBonus, Monthly Income, Existing Policy Tenure and SumAssured.
Agent
Bonus
7
Age
CustT
enure
Month
lyInc
ome
Exist
ingPo
licyT
enure
SumAs
sured
The outliners are treated and updated to improve the quality of data.
8
4)Variance of Variables
Variables Values
Agent Bonus 1844936.85
SumAssured 54358424250.60
MonthlyIncome 18611449.10
Table5: Variance Table
Compared to other variables, these high values will affect the modelling. To avoid this,
logarithmic transformation is applied and values are scaled down to improve the quality
of the data.
4) Feature Engineering
Feature engineering is a machine learning technique that leverages data to create new
variables that aren't in the training set. It can produce new features for both supervised
and unsupervised learning, with the goal of simplifying and speeding up data
transformations while also enhancing model accuracy.
In this dataset, ‘Age’ variable is modified. The continuous variable is modified into age
bins starting from 0-10,10-20,20-30,30-40. Later these values are changed into categorical
variables.
For modelling purpose, object data type are modified into categorical numerical value.
feature: Channel
[Agent, Third Party Partner, Online]
Categories (3, object): [Agent, Online, Third Party Partner]
[0 2 1]
feature: Occupation
9
[Salaried, Free Lancer, Small Business, Large Business]
Categories (4, object): [Free Lancer, Large Business, Salaried, Small Business]
[2 0 3 1]
feature: EducationField
[Graduate, Post Graduate, Under Graduate, Engineer, Diploma, MBA]
Categories (6, object): [Diploma, Engineer, Graduate, MBA, Post Graduate, Under Graduate
]
[2 4 5 1 0 3]
feature: Gender
[Female, Male]
Categories (2, object): [Female, Male]
[0 1]
feature: Designation
[Manager, Executive, VP, AVP, Senior Manager]
Categories (5, object): [AVP, Executive, Manager, Senior Manager, VP]
[2 1 4 0 3]
feature: MaritalStatus
[Single, Divorced, Unmarried, Married]
Categories (4, object): [Divorced, Married, Single, Unmarried]
[2 0 3 1]
feature: Zone
[North, West, East, South]
Categories (4, object): [East, North, South, West]
[1 3 0 2]
feature: PaymentMethod
[Half Yearly, Yearly, Quarterly, Monthly]
Categories (4, object): [Half Yearly, Monthly, Quarterly, Yearly]
[0 3 2 1]
10
Univariate Analysis
Univariate Analysis
Univariate Analysis is the key to understanding each and every variable in the data.
AgentBo
nus
Age
CustTen
ure
11
Existin
gProdTy
pe
NumberO
fPolicy
Monthly
Income
Existin
gPolicy
Tenure
SumAssu
red
12
LastMon
thCalls
Occupation
13
EducationField
Gender
Designation
MaritalStatus
14
Zone
PaymentMethod
Inference:
• From the univariate analysis of continuous variables like AgentBonus, Age,CustTenure,
Monthly Income, Existing Policy Tenure, Sum Assured have outliers and it need to be
treated to improve the quality of data.
• High variance variables are modified to improve the quality of data.
• From the categorical variables, we can find that :
o Agent plays a vital role in bring the customer to the organization.
o Customers having Free Lancer and Large Business as their profession bring more
business to firm.
o Designation of Customer plays a crucial role in canvasing the insurance policy.
o Married Customers are interested in availing insurance policy.
o North and West Zone bring most of the business.
o Customers are preferred to pay Half Yearly and Yearly for their Insurance
payments.
15
Bi-variate Analysis.
Bivariate analysis is a kind of statistical analysis when two variables are observed
against each other. One of the variables will be dependent and the other is independent.
The variables are denoted by X and Y. The changes are analyzed between the two variables
to understand to what extent the change has occurred.
In our dataset, agent bonus is our independent variable. For bi-variate analysis, agent
bonus is evaluated against all other dependent variables for predication.
16
Pair Plot of Variables
Inference:
• From the pair plot, there is strong linear correlation between Agent Bonus between
Sum Assured, MonthlyIncome and CustomerTenure.
17
Variable Plot
Channel
Occupation
EducationField
18
Gender
Designation
MaritalStatus
19
Zone
PaymentMethod
20
Other Plots:
Inferences:
21
Inferences:
• South Zone needs more attention compared to other zones.
• Rest of zones have almost equal participation w.r.t gender wise.
• From complaint point of view, West zone needs to rectify those issues.
22
Inference:
• Customer Tenure for married persons are more in both the genders.
• More focus needs to done to Divorced and Unmarried persons.
Inferences:
Scatter plot implies that there is direct positive relationship between SumAssured and
AgentBonus.
23
Modelling Approach Used:
Model Used:
Metrics Considered.
24
Linear Regression Model
Linear regression is a linear model, e.g. a model that assumes a linear relationship
between the input variables (x) and the single output variable (y). More specifically,
that y can be calculated from a linear combination of the input variables (x).
When there is a single input variable (x), the method is referred to as simple linear
regression. When there are multiple input variables, literature from statistics often
refers to the method as multiple linear regression.
Different techniques can be used to prepare or train the linear regression equation from
data, the most common of which is called Ordinary Least Squares. It is common to therefore
refer to a model prepared this way as Ordinary Least Squares Linear Regression or just
Least Squares Regression.
25
Most significant coefficients are listed below:
Variable Value
Sum Assured 0.06516
MonthlyIncome 0.02808
age_bins 0.00273
ExistingPolicyTenure 0.000547
Gender 0.000312
ExistingProdType --0.00015
Analysis:
o From this linear regression model, we can understand that Sum Assured, Monthly Income
of the policy holders, age group of the people, Policy Tenure, Gender plays a vital
role in agent bonus. Along with that Existing Product Type results in the negative
impact on the agent bonus value.
o R square value of Train and Test Data indicates that the model is neither under fit
or over fit. The value is an appreciable for predicting the agent bonus also.
o MAE value is also acceptable so that predicted value would remain near by the actual
value.
o MSE and RMSE value is also acceptable for predicting the agent bonus.
For Hyper tunning the linear regression model, we can use Lasso, Ridge or Elastic Net
modelling. In our case, we cannot use these modelling approach because the values are
already scaled down and further modelling might affect in getting the values.
To understand coefficients of Lasso Model is shown below
0 Coefficient Estimate
0 Age 0.0
1 CustTenure -0.0
2 Channel 0.0
3 Occupation 0.0
4 EducationField -0.0
5 Gender 0.0
6 ExistingProdType 0.0
26
7 Designation 0.0
8 NumberOfPolicy -0.0
9 MaritalStatus 0.0
10 MonthlyIncome 0.0
11 Complaint 0.0
12 ExistingPolicyTenure 0.0
13 SumAssured 0.0
14 Zone -0.0
15 PaymentMethod 0.0
16 LastMonthCalls 0.0
17 CustCareScore 0.0
The coefficient values are shrunk too much for further modelling. Therefore, hyper tunning
cannot be implemented. So that there is no need for Ridge and Elastic Net modelling.
In this model, various K values are applied to the model and RMSE value is calculated.
Then lower RMSE value is taken for K value consideration.
27
Model Metrics when K=15
Train Data Test Data
28
Plot of Predicted Value/ Actual Value
For Hyper Tuning the Model, various values of K values are used. In this tuning process,
K -values starting from 1 to 100 are given to figure out the best model.
Model Metrics:
29
Plot of Predicted Value/ Actual Value
Inference:
o R-squared and Adjusted R-squared values indicates that model is not good for
predicting the values.
o Hyper tuning fails to improve the model performance.
o The model can be categorized as Under-Fit Model.
o The model has unaccepted MAE, MSE and RMSE values for predicting Agent Bonus.
For modelling , sklearn Random Forest Regressor is used. Data is split into 70:30 as
train and test data. Training model is then implemented to test data for analysis the
quality of the model.
30
Table6: Random Forest result
{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'mse',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 1000,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
31
Model Metrics:
HyperParameter Tuning
Using GridSearch, we have created hyper tuned parameters.
n_estimators' 800
min_samples_split' 5
'min_samples_leaf 1
32
max_features' sqrt
'max_depth 90
bootstrap False
Feature Importance
CustTenure
15.40674
Channel
0.567802
Occupation
0.573335
EducationField
0.896722
Gender
0.392037
ExistingProdType
1.015336
Designation
3.726198
NumberOfPolicy
1.207663
MaritalStatus
1.431711
MonthlyIncome
10.86519
Complaint
0.374762
ExistingPolicyTenure
5.498949
33
SumAssured
43.32096
Zone
0.48656
PaymentMethod
0.516037
LastMonthCalls
2.220802
CustCareScore
1.104411
age_bins 10.39478
Model Metrics
Inferences:
o R square and Adjusted R square values have changed after tuning the model. The
values for the train data are about 0.99 and test data is 0.8128. This implies
that model is neither under fit nor over fit.
o Moreover, MAE, MSE and RMSE values are also good and therefore hyper-tuned
model can be considered for modelling.
o Significant features of the model are
▪ Sum Assured
▪ Cust Tenure
▪ Monthly Income
▪ Age_bins
▪ Existing Policy Tenure
34
Model Validation
In machine learning, model validation is referred to as the process where a trained model
is evaluated with a testing data set. The testing data set is a separate portion of the
same data set from which the training set is derived.
In our regression problem to predict ‘Agent Bonus’, we have done various modelling approach
to predict the value. Then the model metrics are evaluated and compared to figure out the
best model. From our analysis, Hyper Tunned Random Forest Model performance outstands
with other linear regression model and KNN regression model.
From the model metrics given above, Random Regression Model performs well and able to
predict the values better than any other models.
R-squared and Adjusted values are better used to explain the model because it can explain
the percentage of the output variability. MSE, RMSE, or MAE are better be used to compare
performance between different regression models.
In our data analysis, the main objective is to predict the agent bonus and to classify
the agents to improve their performance. Therefore, here we can consider R square and
Adjusted R-square to understand the performance of the model. From the consolidated model
metrics, it is clear that hyper tunned Random Forest is best available model to predict
the agent bonus. R-squared values range from 0 to 1 and are commonly stated as percentages
from 0% to 100%. The model metrics indicates that R square value of Test Model is 0.8129
i.e., the model can predict the value up to an accuracy of 81.29%. Therefore, the model
is neither overfit nor underfit. Model performs well in the train data and test data.
Comparing to other Model Metrics like MAE, MSE, RSME, Max Error, the hyper tunned Random
Forest performs well compared to other two models.
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction
errors). Residuals are a measure of how far from the regression line data points are;
RMSE is a measure of how spread out these residuals are. In other words, it tells you how
concentrated the data is around the line of best fit. Hyper tunned Random Forest model
has good RMSE value compared to other models. In this modelling, R squared, Adjusted R
squared, RMSE values are the considered as model metrics for modelling.
35
Business Recommendation
The objective of this analysis is to predict the Agent Bonus from data set. For this,
regression method is used. For our analysis, three regression models are used to encompass
three methods to predict the agent bonus.
In this case, hyper tunned Random Regression Model is used for predicting the agent bonus.
In this modelling, the significant features are Sum Assured, Cust Tenure, Monthly Income,
Age bins and Existing Policy Tenure. Therefore, agent who is canvassing higher Sum Assured
policy from higher monthly income older people with a higher policy tenure can bring
higher agent bonus. Therefore, the agent needs to focus on these parameters in the
canvassing the customers.
Moreover, Average Existing policy tenure is maximum for higher ranked designation and the
same time, policy tenure is less for undergraduates. Therefore, agent needs to focus on
the higher ranked designation people to bring good business to the company. Policy tenure
for married people is more so we can focus on divorced and unmarried people and that might
increase policy tenure.
The South and East Region are the worst performing regions compared to the other regions.
Proper Training and daily evaluation need to be done for proper tracking of their progress
of the underperforming agents. For Higher Performing agents, I would recommend increasing
the number of policies with higher sum assured value category. Therefore, focus on that
area can improvise the business and bonus.
36
Appendix:
#https://towardsdatascience.com/whats-the-difference-between-linear-regression-lasso-
ridge-and-elasticnet-8f997c60cf29
Lasso, Ridge and ElasticNet are all part of the Linear Regression family where the x
LinearRegression refers to the most ordinary least square linear regression method
without regularization (penalty on weights) . The main difference among them is whether
the model is penalized for its weights. For the rest of the post, I am going to talk
Linear regression (in scikit-learn) is the most basic form, where the model is not
penalized for its choice of weights, at all. That means, during the training stage, if
the model feels like one particular feature is particularly important, the model may
place a large weight to the feature. This sometimes leads to overfitting in small
Lasso is a modification of linear regression, where the model is penalized for the sum
of absolute values of the weights. Thus, the absolute values of weight will be (in
general) reduced, and many will tend to be zeros. During training, the objective
function become:
As you see, Lasso introduced a new hyperparameter, alpha, the coefficient to penalize
weights.
37
Ridge takes a step further and penalizes the model for the sum of squared value of the
weights. Thus, the weights not only tend to have smaller absolute values, but also
really tend to penalize the extremes of the weights, resulting in a group of weights
ElasticNet is a hybrid of Lasso and Ridge, where both the absolute value penalization
and squared penalization are included, being regulated with another coefficient
l1_ratio:
As you can see in these equations above, the weights penalization are summed together in
the loss function. Suppose we have a feature house_size in the 2000 range, while another
feature num_bedrooms in the range of 3, then we would expect that the weight for
house_size may be naturally smaller than the weight for num_bedrooms. In such case,
penalizing each feature’s weight the same way becomes inappropriate. Hence, it is
important to scale or normalize the data before entering them to the models. A quick
note, the default setting in sklearn for these model set ‘normalize’ to false. You will
either want to turn the ‘normalize’ to ‘on’, or use ScandardScaler to scale the data.
Typically, use ScandardScaler is a good practice because you may want to scale your
38
When to use which?
(1) sklearn’s algorithm cheat sheet suggests you to try Lasso, ElasticNet, or Ridge when
(2) Lasso and ElasticNet tend to give sparse weights (most zeros), because the l1
regularization cares equally about driving down big weights to small weights, or driving
small weights to zeros. If you have a lot of predictors (features), and you suspect that
not all of them are that important, Lasso and ElasticNet may be really good idea to
start with.
(3) Ridge tends to give small but well distributed weights, because the l2
regularization cares more about driving big weight to small weights, instead of driving
small weights to zeros. If you only have a few predictors, and you are confident that
all of them should be really relevant for predictions, try Ridge as a good regularized
(4) You will need to scale your data before using these regularized linear regression
39
# https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-
metrics/
11 Important Model Evaluation Metrics for Machine Learning Everyone should know
This article was originally published in February 2016 and updated in August 2019. with
Introduction
The idea of building machine learning models works on a constructive feedback principle.
You build a model, get feedback from metrics, make improvements and continue until you
results.
I have seen plenty of analysts and aspiring data scientists not even bothering to check
how robust their model is. Once they are finished building a model, they hurriedly map
Simply building a predictive model is not your motive. It’s about creating and selecting
a model which gives high accuracy on out of sample data. Hence, it is crucial to check
40
Data Science Immersive Bootcamp
A program that trains you to be an industry-ready data scientist within 240 DaysBook Your
Seats!
In our industry, we consider different kinds of metrics to evaluate our models. The choice
of metric completely depends on the type of model and the implementation plan of the model.
After you are finished building your model, these 11 metrics will help you in evaluating
your model’s accuracy. Considering the rising popularity and importance of cross-
And if you’re starting out your machine learning journey, you should check out the
comprehensive and popular ‘Applied Machine Learning’ course which covers this concept in
a lot of detail along with the various algorithms and components of machine learning.
Table of Contents
41
1. Confusion Matrix
2. F1 Score
3. Gain and Lift Charts
4. Kolmogorov Smirnov Chart
5. AUC – ROC
6. Log Loss
7. Gini Coefficient
8. Concordant – Discordant Ratio
9. Root Mean Squared Error
10. Cross Validation (Not a metric though!)
When we talk about predictive models, we are talking either about a regression model
In classification problems, we use two types of algorithms (dependent on the kind of output
it creates):
1. Class output: Algorithms like SVM and KNN create a class output. For instance, in
a binary classification problem, the outputs will be either 0 or 1. However, today
we have algorithms which can convert these class outputs to probability. But these
algorithms are not well accepted by the statistics community.
2. Probability output: Algorithms like Logistic Regression, Random Forest, Gradient
Boosting, Adaboost etc. give probability outputs. Converting probability outputs
to class output is just a matter of creating a threshold probability.
Illustrative Example
For a classification model evaluation metric discussion, I have used my predictions for
the problem BCI challenge on Kaggle. The solution of the problem is out of the scope of
our discussion here. However the final predictions on the training set have been used for
this article. The predictions made for this problem were probability outputs which have
42
1. Confusion Matrix
For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. Here are a few
• Accuracy : the proportion of the total number of predictions that were correct.
• Positive Predictive Value or Precision : the proportion of positive cases that
were correctly identified.
• Negative Predictive Value : the proportion of negative cases that were correctly
identified.
• Sensitivity or Recall : the proportion of actual positive cases which are
correctly identified.
• Specificity : the proportion of actual negative cases which are correctly
identified.
The accuracy for the problem in hand comes out to be 88%. As you can see from the above
two tables, the Positive predictive Value is high, but negative predictive value is quite
low. Same holds for Sensitivity and Specificity. This is primarily driven by the threshold
value we have chosen. If we decrease our threshold value, the two pairs of starkly different
In general we are concerned with one of the above defined metric. For instance, in a
pharmaceutical company, they will be more concerned with minimal wrong positive diagnosis.
43
Hence, they will be more concerned about high Specificity. On the other hand an attrition
model will be more concerned with Sensitivity. Confusion matrix are generally used only
2. F1 Score
In the last section, we discussed precision and recall for classification problems and
also highlighted the importance of choosing precision/recall basis our use case. What if
for a use case, we are trying to get the best precision and recall at the same time? F1-
Score is the harmonic mean of precision and recall values for a classification problem.
Now, an obvious question that comes to mind is why are taking a harmonic mean and not an
arithmetic mean. This is because HM punishes extreme values more. Let us understand this
with an example. We have a binary classification model with the following results:
Precision: 0, Recall: 1
Here, if we take the arithmetic mean, we get 0.5. It is clear that the above result
comes from a dumb classifier which just ignores the input and just predicts one of the
classes as output. Now, if we were to take HM, we will get 0 which is accurate as this
This seems simple. There are situations however for which a data scientist would like to
above expression a bit such that we can include an adjustable parameter beta for this
purpose, we get:
44
Fbeta measures the effectiveness of a model with respect to a user who attaches β times
Gain and Lift chart are mainly concerned to check the rank ordering of the
Step 3 : Build deciles with each group having almost 10% of the observations.
Step 4 : Calculate the response rate at each deciles for Good (Responders) ,Bad (Non-
You will get following table from which you need to plot Gain/Lift charts:
45
This is a very informative table. Cumulative Gain chart is the graph between Cumulative
%Right and Cummulative %Population. For the case in hand here is the graph :
This graph tells you how well is your model segregating responders from non-responders.
For example, the first decile however has 10% of the population, has 14% of responders.
What is the maximum lift we could have reached in first decile? From the first table of
this article, we know that the total number of responders are 3850. Also the first decile
will contains 543 observations. Hence, the maximum lift at first decile could have been
543/3850 ~ 14.1%. Hence, we are quite close to perfection with this model.
Let’s now plot the lift curve. Lift curve is the plot between total lift and %population.
Note that for a random model, this always stays flat at 100%. Here is the plot for the
case in hand :
46
You can also plot decile wise lift with decile number :
What does this graph tell you? It tells you that our model does well till the 7th decile.
Post which every decile will be skewed towards non-responders. Any model with lift @ decile
above 100% till minimum 3rd decile and maximum 7th decile is a good model. Else you might
Lift / Gain charts are widely used in campaign targeting problems. This tells us till
which decile can we target customers for an specific campaign. Also, it tells you how much
47
4. Kolomogorov Smirnov chart
accurately, K-S is a measure of the degree of separation between the positive and negative
distributions. The K-S is 100, if the scores partition the population into two separate
groups in which one group contains all the positives and the other all the negatives.
On the other hand, If the model cannot differentiate between positives and negatives, then
it is as if the model selects cases randomly from the population. The K-S would be 0. In
most classification models the K-S will fall between 0 and 100, and that the higher the
value the better the model is at separating the positive from negative cases.
We can also plot the %Cumulative Good and Bad to see the maximum separation. Following
is a sample plot :
48
The metrics covered till here are mostly used in classification problems. Till here, we
learnt about confusion matrix, lift and gain chart and kolmogorov-smirnov chart. Let’s
This is again one of the popular metrics used in the industry. The biggest advantage of
using ROC curve is that it is independent of the change in proportion of responders. This
Let’s first try to understand what is ROC (Receiver operating characteristic) curve. If
we look at the confusion matrix below, we observe that for a probabilistic model, we get
49
Hence, for each sensitivity, we get a different specificity.The two vary as follows:
The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is
also known as false positive rate and sensitivity is also known as True Positive rate.
Let’s take an example of threshold = 0.5 (refer to confusion matrix). Here is the
confusion matrix :
50
As you can see, the sensitivity at this threshold is 99.6% and the (1-specificity) is
~60%. This coordinate becomes on point in our ROC curve. To bring this curve down to a
Note that the area of entire square is 1*1 = 1. Hence AUC itself is the ratio under the
curve and the total area. For the case in hand, we get AUC ROC as 96.4%. Following are a
We see that we fall under the excellent band for the current model. But this might simply
validations.
Points to Remember:
1. For a model which gives class as output, will be represented as a single point in ROC
plot.
2. Such models cannot be compared with each other as the judgement needs to be taken on
a single metric and not using multiple metrics. For instance, model with parameters
(0.2,0.8) and model with parameter (0.8,0.2) can be coming out of the same model, hence
3. In case of probabilistic model, we were fortunate enough to get a single number which
was AUC-ROC. But still, we need to look at the entire curve to make conclusive decisions.
It is also possible that one model performs better in some region and other performs better
in other.
51
Advantages of using ROC
Why should you use ROC and not metrics like lift curve?
Lift is dependent on total response rate of the population. Hence, if the response rate
of the population changes, the same model will give a different lift chart. A solution to
this concern can be true lift chart (finding the ratio of lift and perfect model lift at
each decile). But such ratio rarely makes sense for the business.
ROC curve on the other hand is almost independent of the response rate. This is because
it has the two axis coming out from columnar calculations of confusion matrix. The numerator
and denominator of both x and y axis will change on similar scale in case of response rate
shift.
6. Log Loss
AUC ROC considers the predicted probabilities for determining our model’s performance.
However, there is an issue with AUC ROC, it only takes into account the order of
probabilities and hence it does not take into account the model’s capability to predict
higher probability for samples more likely to be positive. In that case, we could us the
log loss which is nothing but negative average of the log of corrected predicted
Let us calculate log loss for a few random values to get the gist of the above
mathematical function:
52
Logloss(1, 0.5) = 0.693
It’s apparent from the gentle downward slope towards the right that the Log Loss
direction though, the Log Loss ramps up very rapidly as the predicted probability
approaches 0.
So, lower the log loss, better the model. However, there is no absolute measure on a
Whereas the AUC is computed with regards to binary classification with a varying
decision threshold, log loss actually takes “certainty” of classification into account.
7. Gini Coefficient
straigh away derived from the AUC ROC number. Gini is nothing but ratio between area
53
between the ROC curve and the diagnol line & the area of the above triangle. Following is
Gini = 2*AUC – 1
Gini above 60% is a good model. For the case in hand we get Gini as 92.7%.
This is again one of the most important metric for any classification predictions problem.
To understand this let’s assume we have 3 students who have some likelihood to pass this
A – 0.9
B – 0.5
C – 0.3
Now picture this. if we were to fetch pairs of two from these three student, how many
pairs will we have? We will have 3 pairs : AB , BC, CA. Now, after the year ends we saw
that A and C passed this year while B failed. No, we choose all the pairs where we will
find one responder and other non-responder. How many such pairs do we have?
We have two pairs AB and BC. Now for each of the 2 pairs, the concordant pair is where
the probability of responder was higher than non-responder. Whereas discordant pair is
where the vice-versa holds true. In case both the probabilities were equal, we say its a
AB – Concordant
BC – Discordant
54
Hence, we have 50% of concordant cases in this example. Concordant ratio of more than 60%
is considered to be a good model. This metric generally is not used when deciding how many
customer to target etc. It is primarily used to access the model’s predictive power. For
decisions like how many to target are again taken by KS / Lift charts.
RMSE is the most popular evaluation metric used in regression problems. It follows an
assumption that error are unbiased and follow a normal distribution. Here are the key
1. The power of ‘square root’ empowers this metric to show large number deviations.
2. The ‘squared’ nature of this metric helps to deliver more robust results which
prevents cancelling the positive and negative error values. In other words, this
metric aptly displays the plausible magnitude of error term.
3. It avoids the use of absolute error values which is highly undesirable in
mathematical calculations.
4. When we have more samples, reconstructing the error distribution using RMSE is
considered to be more reliable.
5. RMSE is highly affected by outlier values. Hence, make sure you’ve removed
outliers from your data set prior to using this metric.
6. As compared to mean absolute error, RMSE gives higher weightage and punishes large
errors.
In case of Root mean squared logarithmic error, we take the log of the predictions and
actual values. So basically, what changes are the variance that we are measuring. RMSLE
55
is usually used when we don’t want to penalize huge differences in the predicted and the
actual values when both predicted and true values are huge numbers.
1. If both predicted and actual values are small: RMSE and RMSLE are same.
2. If either predicted or the actual value is big: RMSE > RMSLE
3. If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes almost
negligible)
We learned that when the RMSE decreases, the model’s performance will improve. But these
In the case of a classification problem, if the model has an accuracy of 0.8, we could
gauge how good our model is against a random model, which has an accuracy of 0.5. So
the random model can be treated as a benchmark. But when we talk about the RMSE metrics,
This is where we can use R-Squared metric. The formula for R-Squared is as follows:
56
MSE(model): Mean Squared Error of the predictions against the actual values
MSE(baseline): Mean Squared Error of mean prediction against the actual values
In other words how good our regression model as compared to a very simple model that
just predicts the mean value of target from the train set as predictions.
Adjusted R-Squared
A model performing equal to baseline would give R-Squared as 0. Better the model, higher
the r2 value. The best model with all correct predictions would give R-Squared as 1.
However, on adding new features to the model, the R-Squared value either increases or
remains the same. R-Squared does not penalize for adding features that add no value to
the model. So an improved version over the R-Squared is the adjusted R-Squared. The
k: number of features
n: number of samples
As you can see, this metric takes the number of features into account. When we add more
features, the term in the denominator n-(k +1) decreases, so the whole expression
increases.
57
If R-Squared does not increase, that means the feature added isn’t valuable for our
model. So overall we subtract a greater value from 1 and adjusted r2, in turn, would
decrease.
Beyond these 11 metrics, there is another method to check the model performance. These 7
methods are statistically prominent in data science. But, with arrival of machine
learning, we are now blessed with more robust methods of model selection. Yes! I’m
Though, cross validation isn’t a really an evaluation metric which is used openly
to communicate model accuracy. But, the result of cross validation provides good enough
Let’s first understand the importance of cross validation. Due to busy schedules, these
days I don’t get much time to participate in data science competitions. Long time back, I
I would like to show you the dissimilarity between my public and private leaderboard score.
For TFI competition, following were three of my solution and scores (Lesser the better)
58
You will notice that the third entry which has the worst Public score turned to be the
best model on Private ranking. There were more than 20 models above the
really worked out well). What caused this phenomenon ? The dissimilarity in my public and
Over-fitting is nothing but when you model become highly complex that it starts capturing
noise also. This ‘noise’ adds no value to model, but only inaccuracy.
In the following section, I will discuss how you can know if a solution is an over-fit or
Cross Validation is one of the most important concepts in any type of data modelling. It
simply says, try to leave a sample on which you do not train the model and test the model
59
Above diagram shows how to validate model with in-time sample. We simply divide the
population into 2 samples, and build model on one sample. Rest of the population is used
I believe, a negative side of this approach is that we loose a good amount of data from
training the model. Hence, the model is very high bias. And this won’t give best estimate
What if, we make a 50:50 split of training population and the train on first 50 and
validate on rest 50. Then, we train on the other 50, test on first 50. This way we train
the model on the entire population, however on 50% in one go. This reduces bias because
60
of sample selection to some extent but gives a smaller sample to train the model on. This
Let’s extrapolate the last example to k-fold from 2-fold cross validation. Now, we will
Here’s what goes on behind the scene : we divide the entire population into 7 equal
samples. Now we train models on 6 samples (Green boxes) and validate on 1 sample (grey
box). Then, at the second iteration we train the model with a different sample held as
validation. In 7 iterations, we have basically built model on each sample and held each
of them as validation. This is a way to reduce the selection bias and reduce the variance
in prediction power. Once we have all the 7 models, we take average of the error terms to
k-fold cross validation is widely used to check whether a model is an overfit or not. If
the performance metrics at each of the k times modelling are close to each other and the
mean of metric is highest. In a Kaggle competition, you might rely more on the cross
validation score and not on the Kaggle public score. This way you will be sure that the
61
How do we implement k-fold with any model?
Coding k-fold in R and Python are very similar. Here is how you code a k-fold in Python
Try out the code for KFold in the live coding window below:
For a small k, we have a higher selection bias but low variance in the performances.
For a large k, we have a small selection bias but high variance in the performances.
k = 2 : We have only 2 samples similar to our 50-50 example. Here we build model only on
50% of the population each time. But as the validation is a significant population, the
k = number of observations (n) : This is also known as “Leave one out”. We have n samples
and modelling repeated n number of times leaving only one observation out for cross
validation. Hence, the selection bias is minimal but the variance of validation performance
is very large.
End Notes
Measuring the performance on training sample is point less. And leaving a in-time
validation batch aside is a waste of data. K-Fold gives us a way to use every singe
62
datapoint which can reduce this selection bias to a good extent. Also, K-fold cross
In addition, the metrics covered in this article are some of the most used metrics of
63
29/06/2022, 22:33 Capstone Project - Jupyter Notebook
In [3]: data.head()
Out[3]:
CustID AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy Mar
0 7000000 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager 2.0
Third
1 7000001 2214 11.0 2.0 Party Salaried Graduate Male 4 Manager 4.0
Partner
2 7000002 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Exe 3.0
Third
Fe
3 7000003 1791 11.0 NaN Party Salaried Graduate 3 Executive 3.0
male
Partner
Small
4 7000004 2955 6.0 NaN Agent UG Male 3 Executive 4.0
Business
In [4]: data.info()
<class 'pandas.core.frame.DataFrame'>
In [5]: data.describe().T
Out[5]:
count mean std min 25% 50% 75% max
In [6]: data.isnull().sum()
# presence of null values
Out[6]: CustID 0
AgentBonus 0
Age 269
CustTenure 226
Channel 0
Occupation 0
EducationField 0
Gender 0
ExistingProdType 0
Designation 0
NumberOfPolicy 45
MaritalStatus 0
MonthlyIncome 236
Complaint 0
ExistingPolicyTenure 184
SumAssured 154
Zone 0
PaymentMethod 0
LastMonthCalls 0
CustCareScore 52
dtype: int64
In [8]: data.duplicated().sum()
# no duplicate values
Out[8]: 0
In [10]: data.columns
'LastMonthCalls', 'CustCareScore'],
dtype='object')
In [7]: df=data.copy()
Channel
Occupation
'Large Business']
EducationField
'MBA']
Gender
Designation
MaritalStatus
Zone
PaymentMethod
In [15]: df['Designation'].mask(df['Designation']=='Exe','Executive',inplace=True)
In [ ]:
In [16]: df['Designation'].unique()
dtype=object)
Channel
Occupation
EducationField
Gender
['Female' 'Male']
Designation
MaritalStatus
Zone
PaymentMethod
In [26]: df.head(5)
Out[26]:
CustID AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy Mar
0 7000000 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager 2.0
Third
1 7000001 2214 11.0 2.0 Party Salaried Graduate Male 4 Manager 4.0
Partner
2 7000002 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Executive 3.0
Third
3 7000003 1791 11.0 NaN Party Salaried Graduate Female 3 Executive 3.0
Partner
Small
4 7000004 2955 6.0 NaN Agent Under Graduate Male 3 Executive 4.0
Business
In [27]: df.to_excel('file1.xlsx')
Univariate Analysis
In [77]: # for continious variables
plt.figure()
print("Distribution of " + column)
print("----------------------------------------------------------------------------")
sns.distplot(df[column], kde=True, color='g');
plt.show()
plt.figure()
print("BoxPlot of " + column)
print("----------------------------------------------------------------------------")
ax = sns.boxplot(x=df[column])
plt.show()
Out[78]: 12
BoxPlot of Age
----------------------------------------------------------------------------
In [83]: plt.figure(figsize=(18,7))
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True,cmap="YlGnBu")
plt.show()
In [84]: sns.pairplot(df)
plt.show()
In [87]: df.columns
'LastMonthCalls', 'CustCareScore'],
dtype='object')
Out[89]: 8
Channel
Occupation
Bivariate Analysis
In [105]: df.info()
<class 'pandas.core.frame.DataFrame'>
Channel
Occupation
In [109]:
In [28]: df.drop(['CustID'],inplace=True,axis=1)
# removal of CustId
In [29]: df.head()
Out[29]:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus
0 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager 2.0 Single
Third
1 2214 11.0 2.0 Party Salaried Graduate Male 4 Manager 4.0 Divorced
Partner
2 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Executive 3.0 Unmarried
Third
3 1791 11.0 NaN Party Salaried Graduate Female 3 Executive 3.0 Divorced
Partner
Small
4 2955 6.0 NaN Agent Under Graduate Male 3 Executive 4.0 Divorced
Business
In [147]: df['Age'].values.reshape(-1,1)
Out[147]: array([[22.],
[11.],
[26.],
...,
[23.],
[10.],
[14.]])
In [148]: df.columns
'LastMonthCalls', 'CustCareScore'],
dtype='object')
In [32]: sform(df[['Age','CustTenure','NumberOfPolicy','MonthlyIncome','ExistingPolicyTenure','SumAssured','CustCareScore']])
In [ ]:
In [33]: df.isnull().sum()
Out[33]: AgentBonus 0
Age 0
CustTenure 0
Channel 0
Occupation 0
EducationField 0
Gender 0
ExistingProdType 0
Designation 0
NumberOfPolicy 0
MaritalStatus 0
MonthlyIncome 0
Complaint 0
ExistingPolicyTenure 0
SumAssured 0
Zone 0
PaymentMethod 0
LastMonthCalls 0
CustCareScore 0
dtype: int64
In [346]: np.mean(data)
AgentBonus 4.077838e+03
Age 1.449471e+01
CustTenure 1.446903e+01
ExistingProdType 3.688938e+00
NumberOfPolicy 3.565363e+00
MonthlyIncome 2.289031e+04
Complaint 2.871681e-01
ExistingPolicyTenure 4.130074e+00
SumAssured 6.199997e+05
LastMonthCalls 4.626991e+00
CustCareScore 3.067592e+00
Clus_kmeans 3.546460e-01
dtype: float64
In [160]: np.mean(df)
Age 14.228761
CustTenure 14.276106
ExistingProdType 3.688938
NumberOfPolicy 3.566704
MonthlyIncome 22586.653982
Complaint 0.287168
ExistingPolicyTenure 4.195354
SumAssured 621905.446571
LastMonthCalls 4.626991
CustCareScore 3.068031
dtype: float64
In [162]: np.std(df)
Age 8.839135
CustTenure 8.823909
ExistingProdType 1.015657
NumberOfPolicy 1.451436
MonthlyIncome 4928.700848
Complaint 0.452441
ExistingPolicyTenure 3.359045
SumAssured 244199.496965
LastMonthCalls 3.619732
CustCareScore 1.379033
dtype: float64
In [167]: st.mode(df)
'Half Yearly', 3, 3.0]], dtype=object), count=array([[ 8, 239, 253, 3194, 2192, 1870, 2688, 1916, 1662,
1107, 2268,
In [35]: df.columns
'LastMonthCalls', 'CustCareScore'],
dtype='object')
AgentBonus
Age
CustTenure
ExistingProdType
MonthlyIncome
ExistingPolicyTenure
SumAssured
In [183]: for i in x:
print(i)
sns.boxplot(df[i])
plt.show();
ExistingProdType
Variable tranformation
feature: MaritalStatus
[2 0 3 1]
feature: Zone
[1 3 0 2]
feature: PaymentMethod
[0 3 2 1]
In [38]: df['Gender'].value_counts()
Female 1832
In [41]: df.head()
Out[41]:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus
Channel
Agent 3194
Online 468
Agent 3194
Online 468
Occupation
Salaried 2192
Free Lancer 2
Salaried 2192
Free Lancer 2
EducationField
Graduate 1870
Diploma 496
Engineer 408
MBA 74
Graduate 1870
Diploma 496
Engineer 408
UG 230
MBA 74
Gender
Male 2688
Female 1832
Male 2688
Female 1507
Fe male 325
Designation
Executive 1662
Manager 1620
AVP 336
VP 226
Manager 1620
Executive 1535
AVP 336
VP 226
Exe 127
MaritalStatus
Married 2268
Single 1254
Divorced 804
Unmarried 194
Married 2268
Single 1254
Divorced 804
Unmarried 194
Zone
West 2566
North 1884
East 64
South 6
West 2566
North 1884
East 64
South 6
PaymentMethod
Yearly 1434
Monthly 354
Quarterly 76
Yearly 1434
Monthly 354
Quarterly 76
In [42]: df.to_excel('encode+file.xlsx')
Clustering
In [230]: data.info()
<class 'pandas.core.frame.DataFrame'>
CustID
AgentBonus
ExistingProdType
Complaint
LastMonthCalls
In [250]: data.columns
'LastMonthCalls', 'CustCareScore'],
dtype='object')
In [256]: col=df1.columns
Out[301]: AgentBonus Age CustTenure ExistingProdType NumberOfPolicy MonthlyIncome Complaint ExistingPolicyTenure SumAssured LastMo
... ... ... ... ... ... ... ... ... ...
In [293]: X = StandardScaler()
In [303]: scaled_df
0.10304876, -0.7744781 ],
0.6555759 , -0.04933237],
-1.2782691 , -0.04933237],
...,
-0.17321481, -1.49962384],
-1.00200553, 1.4009591 ],
-1.00200553, -0.04933237]])
In [309]: wss
Out[309]: [49720.000000000015,
40138.77380529607,
36991.53521799116,
34874.24407267552,
33247.949982908,
31900.841120890425,
30825.575457900406,
29884.850552160417,
29104.006560150017,
28466.636445111657]
In [311]: # from the plot we can take 2 clusters for further processing
In [323]: data.head(5)
Out[323]: CustID AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation ... MaritalStatus Mo
0 7000000 4409 22.0 4.0 Agent Salaried Graduate Female 3 Manager ... Single
Third
1 7000001 2214 11.0 2.0 Party Salaried Graduate Male 4 Manager ... Divorced
Partner
2 7000002 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Exe ... Unmarried
Third
Fe
3 7000003 1791 11.0 NaN Party Salaried Graduate 3 Executive ... Divorced
male
Partner
Small
4 7000004 2955 6.0 NaN Agent UG Male 3 Executive ... Divorced
Business
5 rows × 21 columns
In [325]: silhouette_score(scaled_df,labels)
Out[325]: 0.19883521509219967
In [327]: silhouette_score(scaled_df,labels)
Out[327]: 0.13229821283027335
In [350]: silhouette_score(scaled_df,labels)
Out[350]: 0.11344412761391207
Out[357]: CustID AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation Clus_kmeans
Third Party
1 7000001 2214 11.0 2.0 Salaried Graduate Male 4 Manager 0
Partner
2 7000002 4273 26.0 4.0 Agent Free Lancer Post Graduate Male 4 Exe 0
Third Party Fe
3 7000003 1791 11.0 NaN Salaried Graduate 3 Executive 0
Partner male
Small
4 7000004 2955 6.0 NaN Agent UG Male 3 Executive 0
Business
... ... ... ... ... ... ... ... ... ... ... ...
Small Senior
4515 7004515 3953 4.0 8.0 Agent Graduate Male 4 0
Business Manager
4516 7004516 2939 9.0 9.0 Agent Salaried Under Graduate Female 2 Executive 0
4517 7004517 3792 23.0 23.0 Agent Salaried Engineer Female 5 AVP 0
Small
4518 7004518 4816 10.0 10.0 Online Graduate Female 4 Executive 0
Business
4519 7004519 4764 14.0 10.0 Agent Salaried Under Graduate Female 5 Manager 0
In [352]: data.columns
dtype='object')
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
---------------------------------------------------------------------------
<ipython-input-342-2159fa1adc53> in <module>
7 plt.show()
3025 if is_integer(indexer):
3079 try:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
In [341]: filtered_label0
Out[341]: AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStat
... ... ... ... ... ... ... ... ... ... ...
filtered_label8 = df[label == 8]
---------------------------------------------------------------------------
<ipython-input-348-3b167cddb1ce> in <module>
3 filtered_label8 = df[label == 8]
In [4]: data.head()
Out[4]:
Unnamed:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation ... MaritalStatus M
0
5 rows × 21 columns
In [7]: data.columns
dtype='object')
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 1/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
In [11]: data.drop('Age',axis=1,inplace=True)
In [14]: data.head()
Out[14]:
AgentBonus CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus Mont
In [15]: df=data.copy()
Scaling of Data
In [15]: from sklearn.preprocessing import StandardScaler
from pandas import DataFrame
In [18]: df=scaler.fit_transform(data)
In [21]: df=DataFrame(df)
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 2/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
In [22]: data.columns
'LastMonthCalls', 'CustCareScore'],
dtype='object')
In [25]: df.head()
Out[25]: AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy Marit
0 0.254928 0.935957 -1.197848 -0.609065 -0.523319 -0.437870 -1.211301 -0.742887 0.274700 -1.079416 1
1 -1.361260 -0.367219 -1.433348 1.911973 -0.523319 -0.437870 0.825559 0.325131 0.274700 0.298529 -1
2 0.154790 1.409839 -1.197848 -0.609065 -3.664625 0.713482 0.825559 0.325131 -0.754855 -0.390443 2
3 -1.672717 -0.367219 -0.432472 1.911973 -0.523319 -0.437870 -1.211301 -0.742887 -0.754855 -0.390443 -1
4 -0.815659 -0.959572 -1.021223 -0.609065 1.047333 1.289158 0.825559 -0.742887 -0.754855 0.298529 -1
Train-Test Split¶
In [16]: #Copy all the predictor variables into X dataframe
X = df.drop('AgentBonus', axis=1)
# Copy target into the y dataframe.
y = df['AgentBonus']
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 3/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
In [17]: X.head()
Out[17]:
CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus MonthlyIncome C
In [35]: y.head()
Out[35]: 0 0.254928
1 -1.361260
2 0.154790
3 -1.672717
4 -0.815659
In [18]: # Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1)
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 4/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
In [20]: regression_model=LinearRegression()
regression_model.fit(X_train,y_train)
Out[20]: LinearRegression()
R square Score
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 5/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
R-squared explains to what extent the variance of one variable explains the variance of the second variable. In other words, it measures the
proportion of variance of the dependent variable explained by the independent variable.
R squared is a popular metric for identifying model accuracy. It tells how close are the data points to the fitted line generated by a regression
algorithm. A larger R squared value indicates a better fit. This helps us to find the relationship between the independent variable towards the
dependent variable.
R² score ranges from 0 to 1. The closest to 1 the R², the better the regression model is. If R² is equal to 0, the model is not performing better than a
random model. If R² is negative, the regression model is erroneous.
Out[23]: 0.7587443916585249
Out[24]: 0.7561716416565496
Predicted Score
In [25]: y_pred_train = regression_model.predict(X_train)
In [74]: y_pred_train
0.20529222, 0.08440873])
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 6/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
In [76]: y_train
3681 0.799057
1309 -0.720675
4254 -1.085146
1335 0.060543
...
2895 -0.599185
2763 -0.983536
905 -0.151512
3980 0.651796
235 -0.976909
Max Error
#The max_error() function computes the maximum residual error. A metric that captures the worst-case error between the predicted value and the
true value. This function compares each element (index wise) of both lists, tuples or data frames and returns the count of unmatched elements.
0.041019251118525446
0.04555936215559531
Adjusted R Square
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 7/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
Adjusted R² is the same as standard R² except that it penalizes models when additional features are added.
To counter the problem which is faced by R-square, Adjusted r-square penalizes adding more independent variables which don’t increase the
explanatory power of the regression model.
The value of adjusted r-square is always less than or equal to the value of r-square.
In [64]: y_pred
-1.07738633, 0.70101843])
In [69]: y_test
1519 1.837246
1620 -1.201482
2031 0.315305
494 -1.212527
...
2124 0.819673
3220 -1.239770
1851 -1.356106
1065 -1.160985
462 0.367582
0.7561716416565496
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 8/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
0.7587443916585249
Test Data
In [32]: from sklearn.metrics import mean_absolute_error,mean_squared_error
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=y_pred) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
In [33]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.00696660182270242
MSE: 7.658935210157367e-05
RMSE: 0.008751534271290587
Train Data
mae = mean_absolute_error(y_true=y_train,y_pred=y_pred_train)
#squared True returns MSE value, False returns RMSE value.
mse =
mean_squared_error(y_true=y_train,y_pred=y_pred_train) #default=True
rmse =
mean_squared_error(y_true=y_train,y_pred=y_pred_train,squared=False)
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 9/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
In [34]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.00696660182270242
MSE: 7.658935210157367e-05
RMSE: 0.008751534271290587
RMSE
In [35]: #RMSE on Training data
predicted_train=regression_model.fit(X_train, y_train).predict(X_train)
np.sqrt(metrics.mean_squared_error(y_train,predicted_train))
Out[35]: 0.008733829934443486
Out[36]: 0.008751534271290587
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 10/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
In [95]: # https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 11/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
Lasso Regression
Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction. This model uses shrinkage.
Shrinkage is where data values are shrunk towards a central point as the mean. The lasso procedure encourages simple, sparse models (i.e.
models with fewer parameters)
In [134]: lassoReg.fit(X_train,y_train)
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 12/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
0 Coefficient Estimate
0 Age 0.0
1 CustTenure 0.0
2 Channel -0.0
3 Occupation 0.0
4 EducationField 0.0
5 Gender -0.0
6 ExistingProdType 0.0
7 Designation 0.0
8 NumberOfPolicy 0.0
9 MaritalStatus -0.0
10 MonthlyIncome 0.0
11 Complaint 0.0
12 ExistingPolicyTenure 0.0
13 SumAssured 0.0
14 Zone 0.0
15 PaymentMethod -0.0
16 LastMonthCalls 0.0
17 CustCareScore 0.0
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 13/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
In [115]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.8050079397017279
MSE: 1.0005051970719936
RMSE: 1.0002525666410427
0.0
In [138]: print(max_error(y_train,pred_train))
2.6210621129222855
In [ ]: print(lasso.score(x_cv,y_cv))
In [140]: # print(lassoReg.score(y_train,y_test))
In [141]: # https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
#https://www.datacourses.com/evaluation-of-regression-models-in-scikit-learn-846/ https://medium.com/analytics-vidh
# https://www.geeksforgeeks.org/sklearn-metrics-max_error-function-in-python/
# https://www.geeksforgeeks.org/python-linear-regression-using-sklearn/
# https://medium.com/analytics-vidhya/hyperparameter-tuning-in-linear-regression-e0e0f1f968a1
# https://neptune.ai/blog/fighting-overfitting-with-l1-or-l2-regularization
In [ ]:
In [ ]:
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 14/15
01/07/2022, 11:45 Cap_LinearModel-Copy1 - Jupyter Notebook
localhost:8888/notebooks/Cap_LinearModel-Copy1.ipynb 15/15
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
In [3]: data.head()
Out[3]:
Unnamed:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation ... MaritalStatus M
0
5 rows × 21 columns
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 1/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
In [6]: data.drop('Age',axis=1,inplace=True)
In [7]: data.head()
Out[7]:
y MaritalStatus MonthlyIncome Complaint ExistingPolicyTenure SumAssured Zone PaymentMethod LastMonthCalls CustCareScore age_bins
In [8]: df=data.copy()
In [97]: data['Age'].unique()
23. , 35.5, 33. , 19. , 17. , 25. , 21. , 24. , 29. , 31. , 28. ,
3. , 14.5, 13.5, 11.5, 8.5, 7.5, 35. , 32. , 18.5, 34. , 10.5,
#The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based
on how closely it resembles the points in the training set.
In [103]: df['age_bins']=pd.cut(x=df['Age'],bins=[0,10,20,30,40,50,60,70])
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 2/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
In [105]: df[['age_bins','Age']]
In [107]: df['age_bins'].unique()
Categories (4, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]
In [109]: df['age_bins'].dtype
Out[109]: CategoricalDtype(categories=[(0, 10], (10, 20], (20, 30], (30, 40], (40, 50], (50, 60], (60, 70]],
ordered=True)
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 3/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
In [11]: # Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1)
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 4/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
In [24]:
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 5/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
Out[32]: <AxesSubplot:>
Out[14]: KNeighborsRegressor(n_neighbors=15)
In [15]: pred_test=model.predict(X_test)
In [16]: pred_train=model.predict(X_train)
Max Error
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 6/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
0.05042173810624884
0.04598085821358955
Adjusted R Square
In [19]: from sklearn.metrics import r2_score
0.4554858797489557
0.5310027559414495
Test Data
In [22]: from sklearn.metrics import mean_absolute_error,mean_squared_error
mae = mean_absolute_error(y_true=y_test,y_pred=pred_test)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=pred_test) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=pred_test,squared=False)
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 7/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
In [23]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.010462744190562126
MSE: 0.0001710382826817982
RMSE: 0.013078160523628627
Train Data
In [24]: mae = mean_absolute_error(y_true=y_train,y_pred=pred_train)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pred_train) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pred_train,squared=False)
In [25]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.009735566430888824
MSE: 0.00014828674591305246
RMSE: 0.01217730454218225
# R square Score
In [26]: # R square Score on training data
model.score(X_train, y_train)
Out[26]: 0.5310027559414495
Out[27]: 0.4554858797489557
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 8/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
Hyper Tuning
In [30]: ar=[]
for K in range(100):
K = K+1
ar.append(K)
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 9/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
In [32]: model.best_params_
Predicted Value
In [33]: pret_test=model.predict(X_test)
In [34]: pret_train=model.predict(X_train)
Max Error
In [35]: # Test Data
print(max_error(y_test,pret_test))
0.051673641607796195
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 10/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
0.04446582026307788
Adjusted R Square
In [37]: from sklearn.metrics import r2_score
0.44941129374205124
0.5405965659499341
Test Data¶
In [40]: from sklearn.metrics import mean_absolute_error,mean_squared_error
mae = mean_absolute_error(y_true=y_test,y_pred=pret_test)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=pret_test) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=pret_test,squared=False)
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 11/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
In [41]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.010552156604526413
MSE: 0.00017294638151704015
RMSE: 0.013150908011123801
Train Data
In [42]: mae = mean_absolute_error(y_true=y_train,y_pred=pret_train)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pret_train) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pret_train,squared=False)
In [43]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.009617269338237798
MSE: 0.0001452533914848788
RMSE: 0.0120521114948742
R Squared Score
In [44]: # R square Score on training data
model.score(X_train, y_train)
Out[44]: 0.5405965659499341
Out[45]: 0.44941129374205124
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 12/13
01/07/2022, 11:45 Cap_KNN_Reg-Copy1 - Jupyter Notebook
In [ ]:
localhost:8888/notebooks/Cap_KNN_Reg-Copy1.ipynb 13/13
01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook
In [3]: data.head()
Out[3]:
Unnamed:
AgentBonus Age CustTenure Channel Occupation EducationField Gender ExistingProdType Designation ... MaritalStatus M
0
5 rows × 21 columns
In [6]: data.drop('Age',axis=1,inplace=True)
In [7]: data.head()
Out[7]:
AgentBonus CustTenure Channel Occupation EducationField Gender ExistingProdType Designation NumberOfPolicy MaritalStatus Mont
In [8]: df=data.copy()
Data Split
In [9]: #Copy all the predictor variables into X dataframe
X = df.drop('AgentBonus', axis=1)
# Copy target into the y dataframe.
y = df['AgentBonus']
In [10]: # Split X and y into training and test set in 70:30 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , random_state=1)
Random Forest
In [13]: rf.fit(X_train,y_train)
random_state=18)
In [14]: pred_train=rf.predict(X_train)
In [15]: pred_train
0.55291051])
In [16]: pred_test=rf.predict(X_test)
In [17]: pred_test
0.56435946])
MAX ERROR
In [18]: # Test Data
print(max_error(y_test,pred_test))
0.04070131067993954
0.042071604020055386
R square Score
In [20]: # R square Score on training data
rf.score(X_train, y_train)
Out[20]: 0.7474736496532233
Out[21]: 0.7307608642872265
Adjusted R Square
In [22]: from sklearn.metrics import r2_score
0.7307608642872265
0.7474736496532233
Test Data
localhost:8888/notebooks/Cap_Random Forest-Copy1.ipynb##-R-square-Score 4/14
01/07/2022, 11:43 Cap_Random Forest-Copy1 - Jupyter Notebook
mae = mean_absolute_error(y_true=y_test,y_pred=pred_test)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=pred_test) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=pred_test,squared=False)
In [26]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.0073953875572963165
MSE: 8.457117582518018e-05
RMSE: 0.009196258795030736
Train Data
In [27]: mae = mean_absolute_error(y_true=y_train,y_pred=pred_train)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pred_train) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pred_train,squared=False)
In [28]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.007100929904297198
MSE: 7.984334924055136e-05
RMSE: 0.008935510575258213
{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_d
epth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf':
[1, 2, 4], 'bootstrap': [True, False]}
n_jobs=-1,
None],
1800, 2000]},
random_state=42, verbose=2)
In [34]: rf_random.best_params_
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'max_depth': 60,
'bootstrap': False}
In [37]: base_model
In [38]: base_model.fit(X_train,y_train)
In [39]: pred_btrain=base_model.predict(X_train)
In [40]: pred_btrain
0.53903398])
In [41]: pred_btest=base_model.predict(X_test)
In [52]: pred_btest
0.56616343])
# R square Score
In [54]: # R square Score on training data
base_model.score(X_train, y_train)
Out[54]: 0.9918822168009825
Out[55]: 0.8128773312202485
MAX ERROR
In [43]: # Test Data
print(max_error(y_test,pred_btest))
0.03303242857956701
0.008219513331729433
Adjusted R Square
In [45]: # test Data
print(r2_score(y_test, pred_btest))
0.8128773312202485
0.9918822168009825
Test DAta
mae = mean_absolute_error(y_true=y_test,y_pred=pred_btest)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=pred_btest) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=pred_btest,squared=False)
In [48]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.006028757345944345
MSE: 5.8777428772954286e-05
RMSE: 0.0076666439054487385
Train Data
In [49]: mae = mean_absolute_error(y_true=y_train,y_pred=pred_btrain)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_train,y_pred=pred_btrain) #default=True
rmse = mean_squared_error(y_true=y_train,y_pred=pred_btrain,squared=False)
In [50]: print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
MAE: 0.0012101458086502667
MSE: 2.566666797853658e-06
RMSE: 0.0016020820197023803
---------------------------------------------------------------------------
<ipython-input-56-9ed5e4324b22> in <module>
In [64]: df.columns
'CustCareScore', 'age_bins'],
dtype='object')
In [65]: df.info()
<class 'pandas.core.frame.DataFrame'>