Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3

“PREDICTING INSURANCE RENEWAL PROPENSITY”
Submitted towards partial fulfillment of the criteria for the Award of
Post Graduate Program in Business Analytics Business Intelligence by Great Lakes Institute of Management
Capstone Project Report
Submitted by
AVK Subrahmanyam
Sweta Sahay(S.Id- ZHPTB02QVK)
Shivangi Gupta(S.Id-IHMYYY212M)
Gaurav Chakraborty(S.Id- QWYZRFY1O6)
Batch- PGPDSA- BAHYDSepGrp6
Under the guidance of
(Richa Agarwal)
Great Lakes Institute of Management
June 2020
ABSTRACT
The objective of this project is to predict the probability that a customer will default the premium payment, so that the
insurance agent can proactively reach out to the policy holder to follow up for the payment of premium. Premium paid by
the customer is the major revenue source for insurance companies. Default in premium payments results in significant
revenue losses and hence insurance companies prefers to know upfront which type of customers would default premium
payments.
2|Page
Techniques, Tools, Domain
Techniques Machine Learning using KNN, Logistic Regression, Ensemble Methods, XGBoost
Tools R, Tableau, Python, MS Excel
Domain Financial Analytics( Insurance Sector)
3|Page
CERTIFICATE
This is to certify that the participants AVK Subrahmanyam, Sweta Sahay, Shivangi Gupta, and Gaurav Chakraborty
who are the students of Great Lakes Institute of Management, have successfully completed their project on “Predicting
Insurance Renewal Propensity”.
This project is the record of authentic work carried out by them during the academic year 2019- 2020.
_______________
Richa Agarwal
(Mentor)
Date: 21st Jun 2020

Place: Hyderabad
4|Page
ACKNOWLEDGEMENT
We would like to convey our sincere gratitude to our mentor Richa Agarwal without her timely guidance and able
mentorship, this project would might not been possible. Her deep understanding of the use case and business intellect
helped in charting the right approach and deploying the appropriate models for data analytics.
We would also like to thank the Great Lakes Institute of Management for giving us an opportunity to work on a project
assigned to them to showcase their students’ capabilities.
Would like to take this opportunity to thank all our faculties who have diligently worked to ensure that the concepts of
data science, business analytics and relevance of business context are well embed in the analytical solution that we build.
5|Page
Table of Contents
1. Project Background(Introduction, Need of Study)
2. Project Objective
3. Exploratory Data Analysis
4. Research Approach
5. Inferences from the ML Models
6. Recommendation and Conclusion
7. Bibliography
8. Annexures
6|Page
List of Tables and Figures
The dataset contains the following information about 79854 policy holders.
Variable Type & Description
Summary Statistics
At a glance we have 4 floating point features, 1 integer variables, 4 ordinal integer features, 2 categorical text features
except id and renewal. For the numeric variables, we have the following statistic summary:
Variable Min 1st Qu Median Mean 3rd Qu Max.

perc_premium_paid_by_cash 0 0.034 0.167 0.3143 0.538 1
age_in_days 7670 14974 18625 18847 22636 37602
Income 24030 108010 166560 208847 252090 90262600

Delay1 0 0 0 0.2484 0 13
Delay2 0 0 0 0.07809 0 17
Delay3 0 0 0 0.05994 0 11
Veh_Owned 1 1 2 1.998 3 3
risk score 91.9 98.83 99.18 99.07 99.52 99.89
7|Page
no_of_premiums_paid 2 7 10 10.86 14 60
premium 1200 5400 7500 10925 13800 60000
Abbreviations
No abbreviations were used in the entire report.
Executive Summary
This project helps an insurance company to build a model to predict the propensity to pay renewal premium and build an
incentive plan for its agents to maximize the net revenue. Available information includes past transactions from the policy
holders along with their demographics. The client has provided aggregated historical transaction data like number of
premiums delayed by 3/ 6/ 12 months across all the products, number of premiums paid, customer sourcing channel and
customer demographics like age, monthly income and area type. In addition to the information above, the client has
provided the following relationships. Given the information, the model predicts the propensity of renewal payment.
collection and create an incentive plan for their agents (at policy level) to maximize the net revenues from these policies.
Chapter 1
Project Introduction
Insurance is an instrument available to individuals and organizations to reduce the exposure of financial risk. It is a
contractual obligation between two parties, wherein one party (the insurer) agrees to pay another party (the insured) for
the agreed financial amount subject to happening of an agreed event. For this, the insured pays amount, known as
premium to the insurer in exchange of the protection to the financial amount as agreed upon. The contract of insurance is
based on 7 key principles –
1. Utmost Good Faith
2. Insurable Interest
3. Proximate Cause
4. Indemnity
5. Subrogation
6. Contribution
7. Loss Minimization
Need of the Study
Premium paid by customer is a major source of revenue for the insurance companies. Default in premium payments
results in significant revenue losses and hence insurance companies put their efforts to minimize the leakage in revenue.
Life Insurance company spend heavy amounts in establishing marketing set ups and pay hefty first year commissions.
That increases the cost of acquisition of a new customer. Studies have shown that $1 paid towards customer retention
increases profits by more than $5 spent on new customer acquisition.
The cyclical effect of non-payment of renewal premium is:

1. Fall in Profits: Premium are determined by insurance companies considering that premium would be paid over the
term of the policy and their costs are structured in that way to retain the profitability of the company. Premium,
being the major revenue of an insurance company, nonreceipt of premium results in reduced revenue. But
operating costs in terms of collection efforts and cost of reactive follow up, studies to understand the non-payment
etc., would increase. This will not only add additional financial strain to the company, but also in reduced market
share.
8|Page
2. Increase in Future Premium: As the non-payment of renewal premium results in financial strain and reduced
market share, company may have to adopt the practice of increasing the future premiums for the new policy
holders that may act as a deterrent for new business in the fiercely competitive and data driven environment
where policyholders have access to premium of other companies.
On the other hand, a regular payment of renewal premium indicates customer satisfaction. Higher incentives to the sales
force, higher profits to the company that may result in reduction of premium to the new policyholders.
Therefore, predicting the payment of renewal premium which is considered to be an early warning indicator is a sine qua
non for the insurance company
Project Objective
An Insurance company is interested to predict the probability that a customer will default the premium payment. This will
help in strategizing the agent force to reach out to policy holders in advance to follow up for payment of premium.
This is achieved by identifying the patterns of the default from the historical data & predict the default in premium
payment by employing appropriate model/s, from the armoury of machine learning and predictive analytics.
For this project, default in premium indicates customer has not renewed the premium.
Data Source
The dataset contains the following information about 79854 policy holders.
Statistical tools & techniques
Techniques Machine Learning using KNN, Logistic Regression, Ensemble Methods, XGBoost
Tools R, Tableau, Python, MS Excel
Chapter 2
Literature Review
Insurance is an instrument available to individuals and organizations to reduce the exposure of financial risk. It is a
contractual obligation between two parties, wherein one party (the insurer) agrees to pay another party (the insured) for
the agreed financial amount subject to happening of an agreed event, as per the terms of the contract. For this, the insured
pays amount, known as premium to the insurer in exchange of the protection to the financial amount as agreed upon. The
contract of insurance is based on 7 key principles –
1. Utmost Good Faith
2. Insurable Interest
3. Proximate Cause
4. Indemnity
5. Subrogation
6. Contribution
7. Loss Minimization
Insurance companies operate on the proven principle of premia by many insurance policy holders to compensate the
financial loss suffered by the insured population.
9|Page
This is popularly known as pooling of funds and sharing the risk based on the rule of large numbers. The collection of
premium (major source of revenue) and judicious usage of revenue to meet the expenses form key to the survival and
growth of the industry. Insurance agents play predominant role in procuring and retaining the premia collection. The
premium paid in the first is year is called first year premium and from second year onwards is called as renewal premium.
Commissions are paid to gets to procure and retain collection of premia.
Insurance market can be categorized into life and general (non-life). In case of former, the life of the customer or his/her
dependents is the risk covered where is in general insurance it could be the health, vehicles or casualty.
Key Metrics
For understanding the significance of renewal premium, team looked at the performance metrics used for this purpose.
The two key metrics used in the industry for renewal are:
Persistency Ratio  This is based on the number of policies renewed vis-a-vis previous year. It is calculated as below
(Total number of policies renewed/Total number of policies outstanding as at previous year end) * 100
Conservation Ratio  This is based on the total premium amount that has renewed vis-à-vis total premium collected in
the previous year.
(Total Renewal premium collected in the current Year/ Total premium collected in the previous year) * 100
All insurance companies strive to maintain these two ratios at maximum possible highest rate. The influencers of
persistency rate are the three stakeholders in the industry: life insurers, agents, and customers. Strategies are formulated
and implemented to keep these ratios at an extremely high level among the three influencer classes. For the purpose of this
study, insurance company is aiming at the influencing the customers by providing guidelines to agents.
Chapter 3
Exploratory Data Analysis
Variable Rationalization
In order to study the data better, we performed a preliminary variable reduction in the beginning itself. At this stage, we
reduced the variable on the following criteria:
 Redundant Variables
 Business relevance
 Correlated Variables
 Clubbed Variables
Important variables identification

For determining the important variables whether renewal premium be paid or not, subject matter expertise is the method
followed. Those variables are:
 Percentage Premium paid by cash or credit
 Age of the policyholder expressed as age in days(converted into years in a dummy variable)
 Income of the Policyholder
 Risk Score i.e. the underwriting score in Insurance industry terminology
 Number of premiums paid till the date of collection of data
(Please refer to the graphical representation at the end of the document)
Unimportant Variables
On checking the relationship between renewal premium payment to the below variables in EDA, intuitively it appears to
be unimportant variables, however significance can be interpreted post model validation, therefore retained for the
purpose of running models:
 Accommodation
10 | P a g e
 Marital Status
 Number of vehicles owned &
 No of dependents
Insights
We learn,
 Very few people have defaulted their renewals more than 1 time (i.e. 12 months late payment). Although there
have been some who have defaulted renewals 3 times also
 We observe that in all the 3 situations the people have defaulted once
 If the count of payment is less then the chances to renew is more
 Residence areas and sourcing channels do not affect the renewals
 sourcing_channel and residence_area_type do not affect renewal
 person who renewed are less likely to pay by cash or credit card, compared with those who did not renew.
 person who did not renewed seemed to be younger than those who renewed.
The key observations (summary) based on exploratory analysis are as follows:
S.No Variable Name Observation

__________________________________________________________________________________________________
1 id id is the Unique numerical ID of the policy holder

is the percentage of premium amount paid by cash or credit
2
perc_premium_paid_by_cash_credit card.
3 age_in_days age in days of policy holder
4 Income Income is the monthly income of policy holder
Delay in payment of premium is divided into 3 categories
5 and made different variables, namely 3-6 months late, 6-12
Count_3-6_months_late months late and more than 12 months late.
Count_6-12_months_late
Count_more_than_12_months_late
6 Marital Status marital status of the insurance policy holder
7 Veh_Owned number of vehicles owned by the policy holder(1-3)
8 No_of_dep no. of dependents for the insurance policyholder
9 Accommodation accommodation type rented or owned
is the underwriting score of the applicant at the time of
10
risk_score application. Ranged from 91.9% to 99.9%
total number premiums paid on time (2-60) on time till the
11
no_of_premiums_paid time of this data collection.
sourcing channel for application through which customer was
12
sourcing_channel sourced (5 – A, B,C,D &E)
13 residence_area_type residence_area_type is area type of residence urban or rural
premium is the monthly premium amount. Insurance
14
premium premium amount ranging from 1200 to 60000
15 renewal The target variable is “Premium Renewal” i.e. Y variable.
11 | P a g e
The data indicates “0” where customer has not renewed the
premium & “1” the customer has renewed the premium.
Correlation
1. Age has medium negative correlation with percentage of premium paid by cash_credit and medium positive
correlation with the number of premiums paid.
2. Income is highly positive correlated with premium amount.
3. Number of premiums paid is moderately positively correlated with Income
4. Premium is highly correlated with Income
Research Approach
In the subsequent sections, we will create a predictive model based on logistic regression and other machine learning
models to understand the probability of default in premium payment.
Data preparation
Data Pre-processing
First we need to convert categorical string variables into number. Secondly, missing values needed to be treaded. We are
imputing missing values with K-Nearest Neighbors. Thirdly the number of late payment and percentage of cash/credit
card payment are the main factors contributing if the premium got renewed.
Missing Value Treatment

The data is clean i.e. there is no missing data. Therefore, no treatment is required.
Outlier treatment
The data contains outliers in variables

 age_in_days
 Income
 no_of_premiums_paid
 premium
However, with a view that outliers exist in real time data and imputation or capping or removal results in data loss –
outliers were left as they are in the data set.
Synthetic Minority Oversampling Technique (SMOTE)
SMOTE technique has been applied as the give dataset is an imbalanced one as 93.74% policy holders renewed premium
and 6.26% not renewed. This statistical technique is used to increase the number of minority class in a balanced way.
SMOTE is applied in a graded manner to improve the class imbalance in steps, i.e at 11.7%.
After applying SMOTE, we have now balanced the target variable responder class in the training data.
Variable Transformation
For the purpose of model building and from thenceforth, age in days was converted into years and also applied binning.
Total Late Payments (by combining late Payments of 3 months, 6 months and 12 months)
12 | P a g e
Binning
Binning the outliers is the method used to classify data into categories to smoothen the presence of outliers. These bins
would be useful in providing insights of the category or categories where customers might default.
One hot encoding

The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for
each category and returns a sparse matrix or dense array (depending on the sparse parameter)
Usage of dummy variables
Premium premgrp 1200 - 5000 Premgrp_1

5000 - 10000 Premgrp_2
10000 - 15000 Premgrp_3
15000 - 20000 Premgrp_4
20000 - 30000 Premgrp_5
>30000 Premgrp_6
Risk _Score Riskgrp 91.9 - 94 Riskgrp_1
94 - 96 Riskgrp _2
96 - 98 Riskgrp_3
>98 Riskgrp_4
Count_3.6_months late Late_3.6 0-5 Late_1
5 - 10 Late_2
10 - 15 Late_3
Count_6.12_months late Late_6.12 0-5 Late_4
5 - 10 Late_5
10 - 15 Late_6
>15 Late_7
Count more than 12_months late Late_more12 0-5 Late_8
5 - 10 Late_9
10 - 15 Late_10
Age(in years) agegroup 21-30 Age_1
30- 40 Age_2
40-50 Age_3
50-60 Age_4
60-70 Age_5
>70 Age_6
Income income_group 0 - 50000 Income_1
50000 - 100000 Income_2
100000 - 300000 Income_3
300000 - 750000 Income_4
>750000 Income_5
13 | P a g e
Data split into test and train

The data containing the target variable (Renewal Premium) is being split into Train data and Test data. The purpose of
creating subset of training and test data set is to create a model based on train data and validate the built model using the
test data set.
The industry best practice states
 the training and testing data sets are too be different to avoid overfitting.
 70% into train data and 30% test data is usually an acceptable division.
KFold Cross Validation(5 Folds) has been adopted to bring in the benefits of multiple random splits of dataset into
training and testing. This is a powerful tool to prevent overfitting of the model. This is used to determine the optimal
parameters of the model.
Chapter 4
Tools and Techniques used

- R for Data preparation, as data set is very large (79854 records),
- Training & Testing data split - 70: 30
Models Used
As the objective of this project is to predict the probability that a customer will default the premium payment or not, it is a
classification problem.
The dataset contains target variable “Renewal Premium”, wherein “0” represents that the customer has not renewed the
premium and “1” that customer has renewed the premium. Therefore, supervised learning algorithm needs to be applied
for this prediction.
Different supervised learning algorithms in classification problems that are applied are:
1. Logistic Regression:
Logistic Regression is a classification algorithm that estimates discrete values like yes/no, true/false, 0 or 1 etc.
This model is most useful for understanding the influence of several independent variables on a single outcome
variable. It works very well on linearly separable classes, making use of odds ratio and sigmoid function.
Confusion Matrix for Logistic Regression

Confusion Matrix and Statistics for Train Dataset Confusion Matrix and Statistics for Test Dataset
Reference Reference
Prediction No Yes Prediction No Yes
No 12214 7322 No 359 391
Yes 43658 411718 Yes 1147 22215
Accuracy 0.8927 Accuracy 0.9362
95% CI (0.8918, 0.8935) 95% CI (0.9331, 0.9393)
No Information Rate 0.8824 No Information Rate 0.9375
P-Value [Acc > NIR] < 0.00000000000000022 P-Value [Acc > NIR] 0.8067
Kappa 0.2801 Kappa 0.2887
Mcnemar's Test P-Value < 0.00000000000000022 Mcnemar's Test P-Value <0.0000000000000002
14 | P a g e
Sensitivity 0.21861 Sensitivity 0.23838
Specificity 0.98253 Specificity 0.9827
Pos Pred Value 0.6252 Pos Pred Value 0.47867
Neg Pred Value 0.90413 Neg Pred Value 0.9509
Prevalence 0.11765 Prevalence 0.06246
Detection Rate 0.02572 Detection Rate 0.01489
Detection Prevalence 0.04114 Detection Prevalence 0.0311
Balanced Accuracy 0.60057 Balanced Accuracy 0.61054
'Positive' Class 0 'Positive' Class 0
Interpretation
By using Logistic Regression, we have predicted the renewal (on train data) who will be defaulting with an accuracy of
over 89.27% which seems very good.
Also we have predicted the renewal (on test data) who will be defaulting with an accuracy of over 93.62% which seems
very good.
2. Support vector machine (SVM):

SVM operates on the principle of maximizing the distances as the lines with large margins tend to have a lower
generalization error. SVM is used to solve nonlinear classification problems using the method called kernelizing
3. K-Nearest Neighbours (KNN)

KNN is a simple supervised learning algorithm that is used for solving both regression and classification
problems. This is called a lazy learner because it computes the maximum points of K nearest neighbours for a
given new data point.
4. Ensemble methods:
Ensemble methods is a machine learning technique that combines several base models in order to produce one
optimal predictive model. Bagging, Boosting and Stacking are different ensemble methods that are used.
Gradient Boosting:
Gradient boosting is a machine learning technique for regression and classification problems, which produces a
prediction model in the form of an ensemble of weak prediction models, typically decision trees. Gradient
boosting decision trees is the state of the art for structured data problems. Two modern algorithms that make
gradient boosted tree models are XGBoost and LightGBM. In this article I’ll summarize their introductory papers
for each algorithm’s approach.
XGBoost
XG Boost or Extreme Gradient Boosting method is further improvement of Gradient Boosting method that uses more
approximations for finding the best tree model.
(Please refer to the graphical representation at the end of the document for the model)
Inferences from the ML models
Model Interpretation(Test data)
15 | P a g e
In the above sections we have created prediction models using logistic regression, KNN, SVM,. Let us look at the
evaluation metrics of the these models on the basis of test data and interpret them.
Test Data Set

Logistic Regression Model KNN SVM XGBoost
Accur
acy : 0.9362
95% CI : (0.9331, Accuracy: 0.7875 Accuracy: 0.7897 Accuracy : 0.9355
0.9393)
No Information Rate :
0.9375
P-Value [Acc > NIR] :
0.8067
AUC: 0.6889 AUC: 0.7381 AUC Score : 0.854821
Kappa : 0.2887
Mcnemar's Test P-Value :

<0.0000000000000002
Recall: 0.8014 Recall: 0.79695 Recall : 0.9955
Sensitivity : 0.23838
Specificity : 0.98270
Pos Pred Value : 0.47867
Neg Pred Value :
0.95090
Prevalence : 0.06246
Detection Rate : 0.01489 Precision: 0.9664 Precision: 0.9742 Precision : 0.9459
Detection Prevalence :
0.03110 F1_Score: 0.87620993 F1_Score: 0.87672 F1_Score : 0.9701
Balanced Accuracy :
0.61054
'Positive' Class : 0
Interpretation: By using Interpretation: By using Interpretation: By using Interpretation: By
logistic regression we have KNN we have predicted the SVM we have predicted the using XGBoost we have
predicted the accuracy to be accuracy to be 78.75% accuracy to be 78.97% predicted the accuracy to
93.62%(highest) be 93.55%
Chapter 5
1. Overall Best Model

16 | P a g e
The below table provides a comparative view of the confusion matrix measures (i.e. Accuracy, Sensitivity, Specificity)
of all the models, using the test data. The ones highlighted in yellow are the best performing models.
Comparing the Accuracy, Sensitivity and specificity measures of the classification matrices (on test data) of all the
models created so far, we can say that the logistic regression algorithm has worked the best for this dataset although
the bagging and boosting models are close enough.
However, one must consider the fact that we had balanced the target variable responder class using SMOTE. The results
would have been different if we had used the dataset as it is.
Model/Measure Logistic KNN SVM XGBoost

Regression
Accuracy 93.62 78.75 78.97 93.55
Sensitivity/Recall 23.84 80.14 95.76 -
Actionable Insights and Recommendations
The final interpretation is as follows:
The objective we have here is that an Insurance company is look for a practicable model to predict the probability that a
customer will default the premium payment. This will help in strategizing the agent force to reach out to policy holders in
advance to follow up for payment of premium.
1. Based on the Variable importance of the Logistic Regression model, the insurance company is suggested to orient
its agents force to contact policy holders for Renewal premium as per the below criteria.
Measure Criteria
Age 1. Between 30-40 years
2. Between 50-70 years
Marital Status Married
Income 1. Between 50000-100000
2. Equal to 300000 and above
Premium 15,000 and above
Risk score 94%-96%
Sourcing Channel C, D & E
Late Payment (3-6 5-10 times
months)
Late Payment (6-12 1. 5-10 times
months) 2. 10-15 times
2. Age: Age between 30-40 years and between 50-70 years the number of defaults is higher. As we saw in the EDA,
the mean age of a policyholder making renewal payment is around 51 years (18.847 days).
3. Number of dependents: Policy holders with number of dependents being 4 is a segment to be focussed for
issuance of new policies
4. Marital Status: Married category policyholders are to be contacted by agents to ensure receipt of renewal
premium.
5. Urban and Income: Policyholders in Urban are to be contacted by agents to ensure receipt of renewal premium.
17 | P a g e
6. Income: Policy holders with the income in the ranges of 50,000 to 100,000 & above 300,000 are to be contacted
by agents to ensure receipt of renewal premium
7. As the customers coming from sourcing channels A & B are more regular in payment of premium, business
from them needs encouragement. At the same time, the business coming from sourcing channels C, D & E is
subjected to closer scrutiny at the time of accepting the business.
8. Insurance company has to closely watch and ensure that the late payments of 3-6 & 6-12 months of a
policyholder are not going above 5. This will be proactive step to reduce the criteria of follow up by agents.
9. Model Comparison and Conclusion
Logistics Regression Model and XGBoost are able to predict the renewal premium default with very high accuracy. In this
case, any of the models Logistics Regression, XGBoost can be used for high accuracy prediction. However, the key aspect
is SMOTE for balancing the minority and majority class, without which our models will not be so accurate.
Logistic Regression & XGBoost are the best models that a customer may default the premium payment. From XGBoost
we are focusing on feature importance variables which needs to be focused and what is the probability rate needs to be
considered based on decision is recommended.
With regard to implementation, the insurance company is suggested to adopt a three pronged approach to policy holders
for pursuing renewal premium payment.
Based on the model interpretation metrics,
 For True Positives the insurance company needs to adopt multiple modes of follow up including sending
a person to provide clarifications or giving comfort to the policyholder for renewal. These have to started
well before due date of renewal premium payment.
 For the mid-category, a medium approach for follow up and at a mid-interval time period is fine.
 For the policyholders who have been regularly paying premium a reminder by email and sms would be
fine.
Bibliography
 https://www.inc.com/encyclopedia/insurance-pooling.html
 https://www.ibef.org/industry/insurance-sector-india.aspx
 The Indian Insurance Industry Report – 2018 “ by The India Insure Risk Management & Insurance Broking
Services Pvt Ltd. – page 48
Graphical Representations
Frequency Distribution & Outlier Detection
18 | P a g e
19 | P a g e
20 | P a g e
Exploratory Data Analysis
Unimportant Variable
21 | P a g e
Correlation Heatmap
22 | P a g e
XGBoost Graphs
Logloss for comparing the test & train dataset for prediction
Feature Importance based on test data
23 | P a g e
Annexures
1. Hyperlinks & Files

Title Artifact/Location Remarks
Source of Data Data File for Insurance Premium
Data Dictionary Insurance Premium Renewals (provided

Insurance premium renewal propensity with dataset)
Modified Data Source(for Model Variable description and rationale behind

Validation) selection
Project Notes 1 As per Capstone Project

guidelines/instructions


Project Presentation As per Capstone Project

R/Python Code for Reference Complete R/Python-code used in EDA,
Model Building, etc.
24 | P a g e

Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3

Uploaded by

Copyright:

Available Formats

“PREDICTING INSURANCE RENEWAL PROPENSITY”

Submitted towards partial fulfillment of the criteria for the Award of

Capstone Project Report

Sweta Sahay(S.Id- ZHPTB02QVK)

Gaurav Chakraborty(S.Id- QWYZRFY1O6)

Batch- PGPDSA- BAHYDSepGrp6

Under the guidance of

Great Lakes Institute of Management

Date: 21st Jun 2020

1. Project Background(Introduction, Need of Study)

3. Exploratory Data Analysis

5. Inferences from the ML Models

6. Recommendation and Conclusion

Variable Type & Description

Variable Min 1st Qu Median Mean 3rd Qu Max.

age_in_days 7670 14974 18625 18847 22636 37602

Income 24030 108010 166560 208847 252090 90262600

risk score 91.9 98.83 99.18 99.07 99.52 99.89

premium 1200 5400 7500 10925 13800 60000

Need of the Study

The cyclical effect of non-payment of renewal premium is:

Statistical tools & techniques

Exploratory Data Analysis

Important variables identification

(Please refer to the graphical representation at the end of the document)

(Please refer to the graphical representation at the end of the document)

The key observations (summary) based on exploratory analysis are as follows:

S.No Variable Name Observation

1 id id is the Unique numerical ID of the policy holder

(Please refer to the graphical representation at the end of the document)

Missing Value Treatment

The data contains outliers in variables

Synthetic Minority Oversampling Technique (SMOTE)

One hot encoding

Usage of dummy variables

Premium premgrp 1200 - 5000 Premgrp_1

Data split into test and train

Tools and Techniques used

Confusion Matrix for Logistic Regression

2. Support vector machine (SVM):

3. K-Nearest Neighbours (KNN)

Inferences from the ML models

Model Interpretation(Test data)

Test Data Set

Mcnemar's Test P-Value :

1. Overall Best Model

Model/Measure Logistic KNN SVM XGBoost

9. Model Comparison and Conclusion

Frequency Distribution & Outlier Detection

Feature Importance based on test data

1. Hyperlinks & Files

Data Dictionary Insurance Premium Renewals (provided

Modified Data Source(for Model Variable description and rationale behind

Project Notes 1 As per Capstone Project

Project Notes 2 As per Capstone Project

Project Notes 3 As per Capstone Project

Project Presentation As per Capstone Project

You might also like