Professional Documents
Culture Documents
Submitted by
MONIGA.S
3122218001063
MASTER
OF BUSINESS ADMINISTRATION
MBA (2021-2023)
June 2023
BONAFIDE CERTIFICATE
Certified that the Project report titled PREDICTING LIFE INSURANCE RISK
CLASSES USING MACHINE LEARNING is the bonafide work of Ms. MONIGA S
3122218001063 who carried out the work under my supervision. Certified further that to
the best of my knowledge the work reported herein does not form part of any other project
report or dissertation on the basis of which a degree or award was conferred on an earlier
occasion on this or any other candidate.
I hereby declare that the project entitled “ PREDICTING LIFE INSURANCE RISK
CLASSES USING MACHINE LEARNING” submitted for the M.B.A. degree is my original
work and the dissertation has not formed the basis for the award of any degree, associate ship,
fellowship, or any other similar titles.
Place:
Date:
MONIGA S
ACKNOWLEDGEMENT
I am overwhelmed in all humbleness and gratefulness to acknowledge my depth to
all those who have helped me to put these ideas well above the level of simplicity and into
something concrete.
I express my sincere thanks to our honorable Director, Dr. K.
HARIHARANATH, for inculcating in us a passion for excellence in all our activities
including this project.
With an extreme sense of gratitude, I express my sincere thanks from heart to my
project guide, Dr. GIRIJA, for his guidance at every stage of the project and for providing
necessary support and valuable ideas in the successful completion of the project.
At the onset, I owe my sincere gratitude to Mr. RAKESH BABU and Mr.
THIRUMURUGAN for giving me the opportunity to undergo the project at Medchrono.
and for constant support and guidance.
Finally, I thank all the faculty members of the department, my family, and my friends for
their unprecedented help without which this study would not have been accomplished.
SUMMARY OF THE PROJECT
Life insurance plays a vital role in providing financial security to individuals and their
families, making it essential for insurance companies to accurately evaluate the risk
associated with insuring an individual. Traditional methods of risk assessment in the life
insurance domain rely heavily on manual processes and subjective judgments, which can
be time-consuming and prone to human error.
The project will involve the collection and preprocessing of large datasets from diverse
sources, including insurance databases, medical records, and public data repositories.
Feature engineering techniques will be employed to extract relevant information and create
meaningful variables for training the machine learning models. Various algorithms such as
logistic regression, decision trees, and ensemble methods will be implemented and
evaluated to identify the most accurate and efficient approach for predicting risk classes.
The outcomes of this project have the potential to significantly improve the efficiency and
accuracy of risk classification in the life insurance industry. By automating the process
using machine learning, insurance companies can enhance their underwriting procedures,
streamline the application process, and provide more tailored insurance coverage to
policyholders. Ultimately, this project aims to contribute to the advancement of the life
insurance sector by harnessing the power of machine learning for risk assessment and
decision-making.
3.3 Python
4.16 MI values
4.19 L1 Regularization
Medico Legal service is the field of study and accumulation of materials that deals
with the application of medical knowledge to the administration of justice. Medicine
and law have been related from the earliest times. The bonds that first united them were
religion, superstition, and magic. The oldest of these written records, the Code of
Hammurabi, includes legislation pertaining to the practice of medicine, dating back to
the year 2200 B.C. It covered the topic of medical malpractice and set out for the first
time the concept of civil and criminal liability for improper and negligent medical care.
Penalties ranged from monetary compensation to cutting off the surgeon’s hand. Fees
also were fixed. In 1955, recognizing the growing impact of legislation, regulations,
and court decisions on patient care and the general effect of litigation and legal
medicine on modern society, a group of physicians and surgeons, some of whom were
educated in the law.
Unfortunately, two professional groups suffer from far more ignorance of law and
medicine than is healthy, such as lawyers who do not regularly deal with medical issues
in their legal practice and thus know very little about the medical profession and its
problems; and lawyers who do not regularly deal with medical issues in their legal
practice and thus know very little about the medical profession and its problems.
Physicians frequently have a poor understanding of the law and how it affects their
professional practice.
Medical experts trained in the legal perspectives and possess a good knowledge in
health sector as they prepare logical summary reports that would support the clients
during various stages of court cases until it reaches the settlements. They act as a
bridge between lawyers and medical professionals. Due to client activism, there is an
increase in litigations like personal injuries, medical malpractices, and mass torts. The
regulatory framework governing medical practices are witnessing frequent changes. It
leads to the demand for professional assistance and thereby aiding the growth of
medico legal services worldwide.
During 2017 and 2022, the market for medical legal services increased from USD
million to USD million. This market is anticipated to reach million USD in 2029 due to a
CAGR. During the projection period from 2021 to 2030, it is expected that the Global
Medico Legal Services Market will experience considerable expansion.
The global medico legal services market accounted for US$ 11, 56,284.5 Thousands in
2021, and is expected to grow at a CAGR of 9.3% during the forecast years (2021-2029)
1.1.4 Industry Players
Medico Legal industry is one of the B2B business industries. In 2021, the market was
growing at a steady rate and with the rising adoption of strategies by key players; the
market is expected to rise over the projected horizon. Medico Legal Services Professional
market from 2022 to 2026 is primarily split into Service, Solution, Large Enterprises and
small & Mid Sized Enterprises. Some of the major players in global medico legal industry
are Apex Medico Limited, Clinical partners, Exigent, FORENSICDX and MAPS Medical
reporting.
Our services include market-based solutions in the major areas of Personal injury,
medical malpractice, and Mass tort. We have been meeting our clients' needs for over
a decade. Our consultants ensure that they explore every available option to meet your
needs. We work with you to meet your needs in a timely manner. They effectively
deliver medical indexes, timelines, abstracts, and expert opinions. Our highly
profitable solutions save our customers hundreds to thousands of dollars on a complex
medical file.
o Medical Chronology
o Narrative Summary
o Deposition Summary
o Plaintiff Fact Sheet
o Demand Letter
o Medical Synopsis
o Medical Opinion
o Jury Questionnaire
o Value Added Services
o Additional / Technical Services
Top Competitors
The problem is to develop a predictive model using the Prudential Insurance dataset
to accurately predict the risk class of an insurance policyholder. The risk class is a
categorical variable that represents the level of risk associated with a policyholder,
which is determined based on various demographics, health, and other relevant
factors.
Determining risk classes or risk scores for insurance applicants is essential for
insurers. It helps them set appropriate premium rates and coverage based on the level
of risk associated with each applicant. This ensures fairness and actuarial soundness in
the insurance industry.
Informed underwriting decisions are crucial for insurers to effectively manage their
business operations. By thoroughly evaluating the risks posed by potential
policyholders, insurers can decide whether to accept, decline, or modify coverage.
Factors such as health conditions, occupation, lifestyle choices, and claims history are
analyzed to mitigate potential risks and provide suitable coverage options.
Segmenting the customer base allows insurers to tailor their offerings to different risk
groups. By categorizing policyholders into specific segments, insurers can design
customized insurance products that meet the unique needs of each group. This
targeted approach ensures that customers receive appropriate benefits while
optimizing profitability and managing risk exposure effectively.
Managing profitability and risk exposure is a fundamental aspect of insurance
companies' operations. Accurate risk assessment and classification help insurers price
their policies appropriately, minimizing the chances of financial losses due to adverse
selection. By effectively managing profitability and risk exposure, insurers can
safeguard their financial stability and fulfill policyholder claims.
CHAPTER 2
LITERATURE REVIEW
Summary:
Data analytics is increasingly important in the insurance industry due to the growing
volume of transactional data. Insurance firms can benefit from customer-level analytics,
risk assessment, and fraud detection. Techniques like clustering and classification help
identify patterns and predict future events. Implementing analytics enables better customer
understanding, loss minimization, and gaining a competitive edge. The paper provides a
review of various algorithms used in insurance data analysis, along with evaluations of
different approaches.
Relevance:
This study helps us to understand the data mining techniques to detect frauds among
insurance firms, which is a crucial issue due to the companies facing great losses.
Summary:
This research aims to improve risk assessment in life insurance by using predictive
analytics. Real-world data with multiple attributes is analyzed, and dimensionality
reduction techniques like CFS and PCA are applied. Machine learning algorithms,
including Multiple Linear Regression, Artificial Neural Network, REPTree, and Random
Tree, are used to predict applicant risk levels. REPTree performs best with the lowest MAE
and RMSE values for CFS, while Multiple Linear Regression excels with PCA. Overall,
the study shows that predictive analytics can enhance risk assessment in the life insurance
industry.
Relevance:
This study helps us to understand grouping customers according to their estimated level of
risks, determined from their historical data in the insurance firms.
Summary:
The paper explores feature selection and feature extraction techniques to address the
challenges of high-dimensional data. It emphasizes the importance of removing redundant
and irrelevant features to improve mining performance and learning accuracy. The analysis
focuses on popular techniques, highlighting their benefits and challenges. The paper aims
to provide beginners with valuable insights into these algorithms, aiding their
understanding of how these techniques can enhance data analysis in the context of high-
dimensional data.
Relevance:
This study gives insights about the feature selection and feature extraction techniques as a
preprocessing step are used for reducing data dimensionality.
Summary:
This paper explores filter-based feature selection methods in machine learning, including
Information Gain and Correlation coefficient. It demonstrates that these approaches
effectively reduce the number of gene expression levels, resulting in improved
classification accuracy. Five classification problems are evaluated, and the results show
that the selected gene subset outperforms the raw data. Additionally, Correlation Based
Feature Selection achieves higher accuracy with fewer genes compared to the Information
Gain approach.
Relevance:
This study helps us to get insights about the feature selection method that is easy to
understand and fast to execute and which removes noisy data and improves the
performance of algorithms.
5. Sudhakar M, Reddy C (2016) Two step credit risk assessment model for retail bank
loan applications using Decision Tree data mining technique.
Summary:
This paper highlights the significance of data mining techniques in the banking industry,
particularly for credit risk management. It introduces a prediction model utilizing Decision
Tree Induction Data Mining Algorithm to identify trustworthy loan applicants. By
analyzing customer data, banks can make informed decisions on loan approvals, mitigating
risks and enhancing lending processes. The model serves as a valuable tool for
organizations seeking to improve loan decision-making and maximize profitability in the
competitive banking market.
Relevance:
This study gives insights about Decision trees are a widely used machine learning
technique for prediction and have been implemented in several studies.
Summary:
The study focuses on understanding the handling of attributes with a high proportion of
missing data in statistical analysis. Specifically, it suggests that attributes exhibiting more
than 30% missing data should be excluded from the analysis. By removing these attributes,
the study aims to ensure the integrity and reliability of the analysis by minimizing the
impact of incomplete or unreliable data.
Relevance:
This study helps us to understand the attributes that are showing more than 30% missing
data would be dropped from the analysis.
Summary:
This paper discusses the use of federated learning (FL) to bring collaborative intelligence
to industries lacking centralized training data. It addresses security concerns and aims to
accelerate Industry 4.0 on the edge computing level. The study defines FL terminologies,
presents a framework, explores FL research advancements, and discusses the economic
impacts. It proposes a FL-transformed manufacturing paradigm, outlines future research
directions, and suggests potential applications in the industry 4.0 domain. Overall, the
paper highlights FL's potential in enabling secure and efficient data intelligence
collaboration across industries.
Relevance:
This study gives insights about the FL tries to bring collaborative intelligence into the
industry without the centralization of training data and an overall increase in performance
in the entire industry.
Summary:
This study provides insights into the categorization of federated learning (FL) based on
data features and sample space. FL can be classified into two types: horizontal FL and
vertical FL. Horizontal FL involves training models on different data samples from various
devices, while vertical FL focuses on collaborating on different features of the data.
Understanding these distinctions helps in better comprehending the applications and
implications of FL in different contexts.
Relevance:
This study helps us to understand Based on the features and the sample space of the data,
federated learning can further be categorized into two types namely horizontal FL and
vertical FL
Summary:
This paper introduces a dataset extracted from a real-life risk insurance portfolio,
containing information on 76,102 policies and 15 variables. It highlights the dataset's
significance in teaching and research, enabling the development of pricing systems,
evaluation of marketing strategies, portfolio analysis, regulatory compliance, and
benchmarking. Previous studies have utilized the dataset to compare pricing methodologies
and assess the impact of managing catastrophic risks on solvency capital requirements in
life insurance. Overall, the dataset serves as a valuable resource for various insurance-
related analyses and investigations.
Relevance:
This study helps us to understand the risk structure to which an insurance company is
exposed can be deduced by reviewing its customer database.
10. Bhalla A (2012) Enhancement in predictive model for insurance underwriting. Int
J Computer Sci Eng Technology
Summary:
The underwriting process in the insurance industry is crucial for assessing risks accurately.
Predictive Analytics uses statistical data to analyze and assign scores to applications,
identifying low-risk applicants. This paper proposes incorporating predictive models into
underwriting to streamline and optimize the process. By focusing on applications with
scores surpassing a defined threshold, underwriters can efficiently evaluate applicants. This
methodology aims to improve the effectiveness and efficiency of underwriting operations,
ensuring accurate risk assessment and better decision-making.
Relevance:
The study proposes integrating predictive models into underwriting to enhance efficiency
and accuracy in risk assessment for insurance applications. By leveraging Predictive
Analytics, underwriters can streamline the process and make informed decisions.
CHAPTER-3
Primary Objective:
To create a machine learning model that can classify policyholders into the correct risk
class, which can help insurance companies in better risk assessment, pricing, and
underwriting decisions.
Secondary Objective:
• Analyzing historical data and identifying key risk factors that impact
insurance risks, such as demographics, claims data, etc.
• To extract insights and patterns from large and complex insurance data set.
The Cross Industry Process for Data Mining [CRISP-DM] methodology is a process aimed
at increasing the use of data mining over a wide variety of business applications and
industries. The intent is to take case specific scenarios and general behaviors to make them
domain neutral. CRISP-DM is comprised of six steps with an entity that must 10
implements in order to have a reasonable chance of success. The six steps are shown in the
following diagram:
3. Data Preparation: Once the data has been collected, it must be transformed
into a useable subset unless it is determined that more data is needed. Once
a dataset is chosen, it must then be checked for questionable, missing, or
ambiguous cases. Data Preparation is common to CRISP-DM and
Foundational Methodology.
4. Modeling: Once prepared for use, the data must be expressed through
whatever appropriate models, give meaningful insights, and hopefully new
knowledge. This is the purpose of data mining: to create knowledge
information that has meaning and utility. The use of models reveals patterns
and structures within the data that provide insight into the features of
interest. Models are selected on a portion of the data and adjustments are
made if necessary. Model selection is an art and science. Both Foundational
Methodology and CRISP-DM are required for the subsequent stage.
5. Evaluation: The selected model must be tested. This is usually done by
having a preselected test, set to run the trained model on. This will allow
you to see the effectiveness of the model on a set it sees as new. Results
from this are used to determine efficacy of the model and foreshadows its
role in the next and final stage.
6. Deployment: In the deployment step, the model is used on new data
outside of the scope of the dataset and by new stakeholders. The new
interactions at this phase might reveal the new variables and needs for the
dataset and model. These new challenges could initiate revision of either
business needs and actions, or the model and data, or both.
CRISP-DM is a highly flexible and cyclical model. Flexibility is required at each step
along with communication to keep the project on track. At any of the six stages, it may be
necessary to revisit an earlier stage and make changes. The key point of this process is that
it’s cyclical; therefore, even at the finish you are having another business understanding
encounter to discuss the viability after deployment.
Secondary Data for analysis and modeling is obtained from a published academic data
source Prudential, one of the largest issuers of life insurance in the USA. Dependent
variables and independent variables are found in the dataset.
Product_Info_1-7
Ins_Age
Ht
Wt
BMI
Employment_Info_1-6
InsuredInfo_1-6
Insurance_History_1-9
Family_Hist_1-5
Medical_History_1-41
Medical_Keyword_1-48
Dependent Variable (DV) or Target:
Response
Statistical tool/platform for Data Analysis and Modeling: Jupyter Notebook (Anaconda)
Statistical language for Data Analysis and Modeling: Python
Machine Learning Algorithms for Modeling: Logistic Regression, Gaussian
Naïve
Bayes, Support Vector
Classifier,
Decision Tree Classifier,
Random
Forest Classifier, AdaBoost
Classifier, Gradient Boosting,
XGBoost Classifier
Python Libraries for Data Analysis and Visualization: pandas, NumPy, seaborn,
marplot
Python Libraries for Machine learning: Sci-Kit Learn
JUPYTER NOTEBOOK
Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy. It is an important component of the growing field of data science.
Using statistical methods, algorithms are trained to make classifications or predictions,
uncovering key insights within data mining projects. These insights subsequently drive
decision making within applications and businesses, ideally impacting key growth metrics.
As big data continues to expand and grow, the market demand for data scientists will
increase, requiring them to assist in the identification of the most relevant business
questions and subsequently the data to answer them.
2. Gaussian Naïve Bayes: Gaussian Naïve Bayes is a simple yet effective classification
algorithm that assumes the independence of features and follows the Bayes' theorem. It is
particularly useful for continuous feature data and is based on the assumption that the
features are normally distributed. It calculates the probabilities of different classes and
assigns the input to the class with the highest probability.
CHAPTER 4
Wt: This KDE plot displays significant variation in each Response group
distribution's composition and structure, where each peak exhibits both
broadening and shouldering. Much of the distribution density is spread
between x=0.2 and x=0.5. The distribution for class 8 is notably centered
around low values of Wt (x=0.2), whereas the remainder of the population
is mostly spread across the interval between x=0.2 and x=0.5.
BMI: This KDE plot displays significant variation in each Response group
distribution's composition and structure, where each peak exhibits both
broadening and shouldering. Most of the distribution density is spread in a
similar fashion to Wt - class 8's peak is centered at x=0.4 whereas most of
the other classes' distributions are spread across the interval between x=0.4
and x=0.6. However, one of the "medium" risk-rating distributions features
a notably stronger skew in its distribution compared to its peers, with
significant negative skew towards x=0.6 rather than towards x=0.5
4.2.1.3 Employment Info Column set
These KDE plots show the distributions with varying modality, however the
consistent trend to them is that they all closely overlap between each
Response group/cohort of applicants, with no major difference in relative
densities.
As a result, any variation in these features is unlikely to individually help
towards predicting an applicant's risk rating.
4.2.2 Correlation plots
Inference
Few Attributes are elected to delete columns where their proportions of missing
values in the training subset are greater than 40%, although any other sensible
threshold could be set instead.
Some Attributes can be preprocessed via imputation methods in order to provide machine-
interpretable inputs for our models.
4.3.2. Locating/handling excess zeroes
Increasing k beyond this value does not yield a significant benefit in the rate of
reduction in the training dataset's inertia - hence, we will set k=15 as cluster values.
cluster labels will be incorporated as an additional feature in our datasets and may
in fact prove to be useful in helping to understand applicants' risk rating
assignments.
4.4 Feature Selection
The top 5 ranked features (in descending order) are BMI, Wt, Product_Info_4,
Medical_Keyword_15, and Medical_History_23.
This means that these features have strong statistical dependences with the
Response variable, i.e. that they contribute significantly to reducing uncertainty in
the value of the Response variable .
Features that have low/zero MI scores indicate that they do not significantly
contribute towards reducing this uncertainty and are hence less useful for guiding
our predictions.
4.4.2. Multicollinearity analysis
The features listed above have very high VIF scores, which indicate a high level of
multicollinearity.
However, in the edge cases where some of these values tend towards infinity, these
can be discounted as they represent dummy variables that are perfectly anti-
correlated.
4.4.3. Principal Component Analysis
the first 40 PCs contain just over 80% of the cumulative variance in the
validation dataset.
This means that we can still capture a significant majority of the
dataset's cumulative variance, were we to use a lower dimensionality
feature-space instead, rather than simply using all features together.
now reduced our dataset down to 57 features, from a starting value of 126.
4.5. Data Modeling
4.5.1 Classification
Each model has a wide range of AUC values against each class, indicating a
varying degree of sensitivity/specificity across the dataset.
The two highest average AUC values were achieved by the gradient boosting
classifiers (model 11: macro-average AUC=0.84, and model 12: macro-average
AUC=0.83).
It is also worth noting that both of these models performed best when predicting
applicants for classes 3/4/8, as indicated by the high AUC scores (~0.9) when
classifying applicants into the remaining groups.
The XGBClassifier model features a similarly broad range of AUC values against
each class, indicating a varying degree of sensitivity/specificity across the dataset.
Furthermore, the model has also performed relatively strongly when predicting
applicants for classes 3/4/8, as indicated by the high AUC scores (~0.9) for their
respective ROC curves - this is a good sign that our model is still able to generalise
well, even when handling previously unseen data
4.6.2. Classification Report
Each of the models display somewhat mediocre performance for each class, with
the highest value consistently belonging to class 8.
Each model also consistently shows very high recall for class 8, and poorer values
against the other groups.
The same trend can also be observed for F1-score, indicating that the models are
strongly fitted towards predicting applicants in class 8
For class 3, model 11's precision/recall/F1-score values are equal to 0, which means
that it did not correctly predict any applicants for this risk rating.
Model 12 showed a precision of 0.27, a recall of 0.05 and an F1-score of 0.08.
This does not necessarily mean that model 11 is wholly inaccurate and should not
be trusted altogether.
These are typically harsh metrics that can indicate where models are strongly
under/overfitting as they evaluate accuracy in a "one vs. rest" fashion - if the set of
possible Response values was much smaller instead, e.g. by grouping together
classes 1-8 into Low/Med/High-risk, then each model's performance would appear
to significantly improve.
4.6.3. Confusion matrix
most appear to be fairly well-fitted to the data, in that they are capable of
replicating the distribution of Response values reasonably well.
A majority of the missed cases are in fact relatively close to where they
should be (e.g. predicted as class 8, was actually class 7)
There are some notably underperforming models which appear to be highly
overfitted towards predicting applicants as belonging to class 8 (which
represents the largest proportion across all applicants in the dataset).
The confusion matrix for model 7 (SVC with sigmoid kernel) demonstrates
this trait extremely well, which shows that only a tiny proportion of
predictions were made in any other class apart from 8.
Models 3 and 5 (SVCs with linear and polynomial kernels, respectively)
also demonstrate this trend to a lesser extent as well.
Models 11 and 12, on the other hand, appear to generalize reasonably well
to the dataset, mimicking the distribution plot shown in response values.
Furthermore, the only notable drawback of these models that can be
observed is that they tend to predict some low-risk applicants (e.g. classes
1-2) as high-risk applicants (e.g. classes 6-8) with a higher than expected
frequency - see the top-right corners of each plot.
In terms of using either of these models in a real-world business scenario,
this would only mean that more effort/time is potentially wasted on
scrutinizing low-risk applicants further before offering a policy, rather than
treating high-risk applicants with a light touch and introducing unnecessary
risk into the insurer's portfolio.
4. 7 Attribute Importance
Inference
This plot shows that the top five most important features (in terms of Gain/model
contribution) are:
o Medical_History_23,
o Medical_History_4,
o Medical_Keyword_3,
o Medical_Keyword_15, and
o BMI.
These five features were also highly ranked within the Mutual Information score
chart , which lends some credibility to the view that these features may be closely
involved in governing what risk rating an applicant should be assigned.
The bottom five (i.e., least important) features in terms of Gain are:
o Medical_History_34,
o Product_Info_2_E1,
o KMeansCluster_4,
o Insurance_History_8, and
o Medical_History_41.
Four of these five features were also poorly ranked within the MI score chart,
however KMeansCluster_4 was instead moderately ranked (residing within the top
15 features).
This implies that, whilst KMeansCluster_4 was initially deemed to show some
potential in terms of predictive power, the XGBClassifier does not value this
feature as highly when generating predictions for the test dataset.
4.7.2. Permutation Importance
Inference
The top five features that appear to have the strongest impact on the model's
performance/accuracy when shuffled randomly are:
o Medical_Keyword_3
o Medical_History_39
o InsuredInfo_5
o Product_Info_2_D1
o Medical_History_17
Most of these features were also highlighted as having the highest feature
importance.
Interestingly, Medical_History_23 (the highest scored feature in terms of
Feature Importance) does not show up in the top 20 permutation importance
weightings - this could imply that this feature does not have any meaningful
causational relationship with Response and has unintentionally been given
higher importance due to possible overfitting.
4.7.3. SHAP Values
Inference
o The SHAP summary plot above provides us with a top-down view of feature
importance, for the top 20 features as calculated via the SHAP framework.
o The color of each dot represents whether that feature was high or low (for that row
in the dataset), and its horizontal location shows whether the effect of that value
caused a higher (towards 8) or lower (towards 1) prediction.
o For instance, the BMI feature clearly expresses that as the BMI of the applicant
increases, then their predicted risk rating also increases strongly.
o Product_Info_2_A6 - i.e. whether the applicant's selection for Product_Info_2 is
equal to A6 - shows a clear negative correlation with the predicted risk rating.
o Ins_Age shows a negative correlation when away from the baseline value (class 5)
but is more mixed as the value approaches the baseline.
o Family_Hist_4 appears to show a mixed effect on the value of Response regardless
of the choice of class.
CHAPTER 5
5.1 FINDINGS
5.3 CONCLUSION
This project has successfully achieved its objectives of analyzing historical data, extracting
valuable insights, and building an accurate risk classification model in the insurance
domain. Through thorough analysis, key risk factors impacting insurance risks were
identified, providing a deeper understanding of the underlying patterns and trends within
the data. The developed risk classification model accurately assigns policyholders to the
appropriate risk class, enhancing risk assessment and underwriting processes.
By leveraging advanced techniques, insurers can extract actionable insights that lead to
more accurate risk assessments and optimized underwriting processes. The risk
classification model developed in this project provides insurers with a competitive
advantage, enabling them to make informed strategic decisions and enhance overall
business performance.
In summary, this project demonstrates the value of data analytics in the insurance sector.
The insights gained from analyzing historical data and building an accurate risk
classification model offer significant potential for insurers to improve risk assessment,
underwriting efficiency, and overall profitability. By embracing data-driven approaches,
insurance companies can enhance their ability to understand and manage risks effectively,
ultimately leading to better outcomes for both insurers and policyholders.
REFERENCES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
main_data = pd.read_csv('../Downloads/train.csv/train.csv')
print(main_data.dtypes)
main_data.describe()
plt.hist(main_data_index_set['Response'],
bins=sorted(main_data_index_set['Response'].unique()))
plt.xlabel('Response')
plt.ylabel('# of Applicants')
plt.title('Response Distribution')
# Set up a subplot grid.
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(25,15))
ColSet1_ProdInfo_kde =
['Product_Info_1','Product_Info_3','Product_Info_4','Product_Info_5','Product_Info_6','Pro
duct_Info_7']
# Produce kernel density estimate plots for each set of columns.
for i, column in enumerate(main_data_index_set[ColSet1_ProdInfo_kde].columns):
sns.kdeplot(data=main_data_index_set,
x=column,
hue="Response", fill=True, common_norm=True, alpha=0.05,
ax=axes[i//3,i%3])
# Produce a correlation matrix of the dataset - then, create a mask to hide the upper-right
half of the matrix.
corrs = main_data_index_set.corr()
mask = np.zeros_like(corrs)
mask[np.triu_indices_from(mask)] = True