You are on page 1of 68

PREDICTING LIFE INSURANCE RISK CLASSES

USING MACHINE LEARNING


A PROJECT REPORT

Submitted by

MONIGA.S

3122218001063

in partial fulfillment for the award of the degree of

MASTER

OF BUSINESS ADMINISTRATION

MBA (2021-2023)

SSN SCHOOL OF MANAGEMENT

Sri Sivasubramaniya Nadar College of Engineering

(An Autonomous Institution, affiliated to Anna University, Chennai)

Rajiv Gandhi Salai (OMR), Kalavakkam – 603 110

June 2023
BONAFIDE CERTIFICATE

Certified that the Project report titled PREDICTING LIFE INSURANCE RISK
CLASSES USING MACHINE LEARNING is the bonafide work of Ms. MONIGA S
3122218001063 who carried out the work under my supervision. Certified further that to
the best of my knowledge the work reported herein does not form part of any other project
report or dissertation on the basis of which a degree or award was conferred on an earlier
occasion on this or any other candidate.

Signature of Supervisor Signature of HOD

Submitted to Project Viva Voce held on ……………….

Internal Examiner External Examiner


DECLARATION

I hereby declare that the project entitled “ PREDICTING LIFE INSURANCE RISK
CLASSES USING MACHINE LEARNING” submitted for the M.B.A. degree is my original
work and the dissertation has not formed the basis for the award of any degree, associate ship,
fellowship, or any other similar titles.

Place:

Date:

Signature of the Student

MONIGA S
ACKNOWLEDGEMENT
I am overwhelmed in all humbleness and gratefulness to acknowledge my depth to
all those who have helped me to put these ideas well above the level of simplicity and into
something concrete.
I express my sincere thanks to our honorable Director, Dr. K.
HARIHARANATH, for inculcating in us a passion for excellence in all our activities
including this project.
With an extreme sense of gratitude, I express my sincere thanks from heart to my
project guide, Dr. GIRIJA, for his guidance at every stage of the project and for providing
necessary support and valuable ideas in the successful completion of the project.
At the onset, I owe my sincere gratitude to Mr. RAKESH BABU and Mr.
THIRUMURUGAN for giving me the opportunity to undergo the project at Medchrono.
and for constant support and guidance.
Finally, I thank all the faculty members of the department, my family, and my friends for
their unprecedented help without which this study would not have been accomplished.
SUMMARY OF THE PROJECT

Life insurance plays a vital role in providing financial security to individuals and their
families, making it essential for insurance companies to accurately evaluate the risk
associated with insuring an individual. Traditional methods of risk assessment in the life
insurance domain rely heavily on manual processes and subjective judgments, which can
be time-consuming and prone to human error.

This project seeks to overcome these limitations by employing machine learning


algorithms to analyze a wide range of data points, including demographic information,
medical records, lifestyle factors, and financial indicators. By training these algorithms on
historical data that includes information about policyholders and their corresponding risk
classes, the project aims to develop predictive models capable of accurately classifying
future applicants into appropriate risk categories.

The project will involve the collection and preprocessing of large datasets from diverse
sources, including insurance databases, medical records, and public data repositories.
Feature engineering techniques will be employed to extract relevant information and create
meaningful variables for training the machine learning models. Various algorithms such as
logistic regression, decision trees, and ensemble methods will be implemented and
evaluated to identify the most accurate and efficient approach for predicting risk classes.

The outcomes of this project have the potential to significantly improve the efficiency and
accuracy of risk classification in the life insurance industry. By automating the process
using machine learning, insurance companies can enhance their underwriting procedures,
streamline the application process, and provide more tailored insurance coverage to
policyholders. Ultimately, this project aims to contribute to the advancement of the life
insurance sector by harnessing the power of machine learning for risk assessment and
decision-making.

[Keywords: Predicting, Life insurance, Risk assessment, Machine learning, Underwriting]


TABLE OF CONTENTS

CHAPTER TITLE PAGE


NO NO.
1 INTRODUCTION
1.1 Industry Profile
1.2 Company Profile
1.3 Problem Statement
1.4 Need for the Study
2 REVIEW OF LITERATURE
3 RESEARCH OBJECTIVES AND METHODOLOGY
3.1 Research Objectives
3.2 Research Methodology
3.3 Data Collection
3.4 Data Understanding
3.5 Tools for Analysis
4 DATA ANALYSIS AND INTERPRETATION
4.1 Descriptive Analysis
4.2 Exploratory Data Analysis
4.3 Data preparation
4.4 Feature selection
4.5 Data Modeling
4.6 Performance Evaluation
4.7 Attribute Performance
5 SUMMARY OF FINDINGS AND SUGGESTION
5.1 Findings
5.2 Suggestions and Recommendation
5.3 Conclusion
LIST OF FIGURES

FIGURE NO. TITLE PAGE NO

1.1 Industry Size

1.2 Company Logo

3.1 CRISP-DM methodology

3.2 Jupyter Notebook

3.3 Python

4.1 Overview of the dataset

4.2 Distribution of target variable

4.3 KDE of Product Info

4.4 KDE of Applicant Info

4.5 KDE of Employment Info

4.6 KDE of Insured Info

4.7 KDE of Insurance History Info

4.8 KDE of Family History Info

4.9 KDE of Medical History Info

4.10 KDE of Medical Keyword Info

4.11 Heatmap Plot

4.12 Null Values

4.13 Distribution Before Imputation


FIGURE NO. TITLE PAGE NO

4.14 Distribution After Imputation

4.15 Elbow method

4.16 MI values

4.17 VIF Scores

4.18 Cumulative Variance

4.19 L1 Regularization

4.20 Accuracy Score

4.21 ROC Curve

4.22 Classification Report

4.23 Confusion Matrix

4.24 Feature Importance

4.25 Permutation Importance

4.26 SHAP Values


CHAPTER-1
INTRODUCTION

1.1 INDUSTRY PROFILE


Medico - Legal industry provides services pertaining to medical and legal research,
Legal advice, and summary of medical records. It is one of the B2B businesses. The
word Medico-Legal combines the two main professions, Medicine, and law. The
medical profession has its own set of ethical guidelines and code of ethics. On the other
hand, must be determined by judges who are not medically trained. They rely on the
advice of experts and make decisions based on reasonableness and prudence. The
above-mentioned services are provided by the medico-legal experts who analyze the
laws governing the medical sector.

Medico Legal service is the field of study and accumulation of materials that deals
with the application of medical knowledge to the administration of justice. Medicine
and law have been related from the earliest times. The bonds that first united them were
religion, superstition, and magic. The oldest of these written records, the Code of
Hammurabi, includes legislation pertaining to the practice of medicine, dating back to
the year 2200 B.C. It covered the topic of medical malpractice and set out for the first
time the concept of civil and criminal liability for improper and negligent medical care.
Penalties ranged from monetary compensation to cutting off the surgeon’s hand. Fees
also were fixed. In 1955, recognizing the growing impact of legislation, regulations,
and court decisions on patient care and the general effect of litigation and legal
medicine on modern society, a group of physicians and surgeons, some of whom were
educated in the law.

Unfortunately, two professional groups suffer from far more ignorance of law and
medicine than is healthy, such as lawyers who do not regularly deal with medical issues
in their legal practice and thus know very little about the medical profession and its
problems; and lawyers who do not regularly deal with medical issues in their legal
practice and thus know very little about the medical profession and its problems.
Physicians frequently have a poor understanding of the law and how it affects their
professional practice.

Medical experts trained in the legal perspectives and possess a good knowledge in
health sector as they prepare logical summary reports that would support the clients
during various stages of court cases until it reaches the settlements. They act as a
bridge between lawyers and medical professionals. Due to client activism, there is an
increase in litigations like personal injuries, medical malpractices, and mass torts. The
regulatory framework governing medical practices are witnessing frequent changes. It
leads to the demand for professional assistance and thereby aiding the growth of
medico legal services worldwide.

1.1.1 Medico Legal Cases


o A Medico-Legal Case can be defined as a case of injury or ailment, etc., in which
investigations by the law-enforcing agencies are essential to fix the responsibility
regarding the causation of the injury or ailment. Some of the medico legal cases
are,
o All cases of injuries and burns -the circumstances of which suggest commission of
an offense by somebody. (Irrespective of suspicion of foul play)
o All vehicular, factory or other unnatural accident cases specially when there is a
likelihood of patient’s death or grievous hurt.
o Cases of suspected or evident sexual assault or evident criminal abortion.
o Cases of unconsciousness where its cause is not natural or not clear.
o All cases of suspected or evident poisoning or intoxication.
o Any other case not falling under the above categories but has legal implications.
o The procedure or registering the medico legal case.
o Treatment - Legal formalities till the patient is revive.
o Identification- Whether the registered case falls under medico legal case or not.
o Intimation to Police- If the case is medico legal case, then the victim must inform
it to the police.
o Acknowledgement Receipt- Received it from police for future reference.
1.1.2 Summarization of medical reports by Medico Legal Experts

o Reports must be prepared in duplicate on proper pro-forma giving all necessary


details.
o Avoid abbreviations, over writings. Correction if any should be initialed with date
and time.
o Reports must be submitted to the authorities promptly.
o Medico-legal documents should be stored under safe custody for 10 years.
o Age, sex, father’s name, complete address, date and time of reporting, time of
incident, brought by whom.
o Identification marks and finger impressions
o All MLC to be informed to the police for taking legal evidence.
o If the patient is dying, inform the magistrate to record ‘dying declaration’.

1.1.3 Industry Size and Growth

During 2017 and 2022, the market for medical legal services increased from USD
million to USD million. This market is anticipated to reach million USD in 2029 due to a
CAGR. During the projection period from 2021 to 2030, it is expected that the Global
Medico Legal Services Market will experience considerable expansion.

Fig 1.1 Industry size

The global medico legal services market accounted for US$ 11, 56,284.5 Thousands in
2021, and is expected to grow at a CAGR of 9.3% during the forecast years (2021-2029)
1.1.4 Industry Players
Medico Legal industry is one of the B2B business industries. In 2021, the market was
growing at a steady rate and with the rising adoption of strategies by key players; the
market is expected to rise over the projected horizon. Medico Legal Services Professional
market from 2022 to 2026 is primarily split into Service, Solution, Large Enterprises and
small & Mid Sized Enterprises. Some of the major players in global medico legal industry
are Apex Medico Limited, Clinical partners, Exigent, FORENSICDX and MAPS Medical
reporting.

1.2 COMPANY PROFILE

Fig 1.2 Company logo

Medchrono is an outsourcing and off shoring consulting firm, founded in 2021 in


Chennai. They provide an easy and simplified review process for legal experts. Their
professional team helps to improve our client’s efficiency while cutting costs. Their
clients are US based Lawyers. Medchrono believes in exceeding customers’
expectations. We understand the need of each customer is unique and requires
particular attention. As a result, ‘MC’ has developed a mixture of highly talented
professionals who are experts in their respective fields who will support you in all
your efforts.

Our services include market-based solutions in the major areas of Personal injury,
medical malpractice, and Mass tort. We have been meeting our clients' needs for over
a decade. Our consultants ensure that they explore every available option to meet your
needs. We work with you to meet your needs in a timely manner. They effectively
deliver medical indexes, timelines, abstracts, and expert opinions. Our highly
profitable solutions save our customers hundreds to thousands of dollars on a complex
medical file.

1.2.1 Service offered

o Medical Chronology
o Narrative Summary
o Deposition Summary
o Plaintiff Fact Sheet
o Demand Letter
o Medical Synopsis
o Medical Opinion
o Jury Questionnaire
o Value Added Services
o Additional / Technical Services

1.2.2 Competitor Analysis

Top Competitors

1. Datascribe LPO is one of the leading Alternative Legal Services


providers in the industry, delivering legal, shared and advisory
services from secure centers in Bangalore, India, to top law firms and
corporations in the US and Canada. The services offered are litigation
support, legal and immigration support and medico-legal services.
2. Legacore delivers global Legal Support Services Company,
delivering Medico-legal and Secretarial/Administrative services. The
services offered are litigation support and medico legal services.
3. Premier Medical Review is a physician and attorney-owned and
operated firm known for delivering high-caliber medicolegal
consulting to attorneys across the U.S. The services offered are
medical device consulting, pharmaceutical consulting and medical
malpractice consulting.
4. Broadspire Rehabilitation provides customized, integrated claims
solutions to clients across the globe. Through the industry-leading
Expertise, innovative technology, and unrelenting focus on
continuous improvement.

1.3 PROBLEM STATEMENT

The problem is to develop a predictive model using the Prudential Insurance dataset
to accurately predict the risk class of an insurance policyholder. The risk class is a
categorical variable that represents the level of risk associated with a policyholder,
which is determined based on various demographics, health, and other relevant
factors.

1.4 NEED FOR THE STUDY

Determining risk classes or risk scores for insurance applicants is essential for
insurers. It helps them set appropriate premium rates and coverage based on the level
of risk associated with each applicant. This ensures fairness and actuarial soundness in
the insurance industry.

Informed underwriting decisions are crucial for insurers to effectively manage their
business operations. By thoroughly evaluating the risks posed by potential
policyholders, insurers can decide whether to accept, decline, or modify coverage.
Factors such as health conditions, occupation, lifestyle choices, and claims history are
analyzed to mitigate potential risks and provide suitable coverage options.

Segmenting the customer base allows insurers to tailor their offerings to different risk
groups. By categorizing policyholders into specific segments, insurers can design
customized insurance products that meet the unique needs of each group. This
targeted approach ensures that customers receive appropriate benefits while
optimizing profitability and managing risk exposure effectively.
Managing profitability and risk exposure is a fundamental aspect of insurance
companies' operations. Accurate risk assessment and classification help insurers price
their policies appropriately, minimizing the chances of financial losses due to adverse
selection. By effectively managing profitability and risk exposure, insurers can
safeguard their financial stability and fulfill policyholder claims.

CHAPTER 2
LITERATURE REVIEW

1.Goleiji L, Tarokh M (2015) Identification of influential features and fraud detection


in the Insurance Industry using data mining techniques.

Summary:

Data analytics is increasingly important in the insurance industry due to the growing
volume of transactional data. Insurance firms can benefit from customer-level analytics,
risk assessment, and fraud detection. Techniques like clustering and classification help
identify patterns and predict future events. Implementing analytics enables better customer
understanding, loss minimization, and gaining a competitive edge. The paper provides a
review of various algorithms used in insurance data analysis, along with evaluations of
different approaches.

Relevance:

This study helps us to understand the data mining techniques to detect frauds among
insurance firms, which is a crucial issue due to the companies facing great losses.

2. Cummins J, Smith B, Vance R, Vanderhel J (2013) Risk classification in Life


Insurance, 1st edn. Springer, New York

Summary:

This research aims to improve risk assessment in life insurance by using predictive
analytics. Real-world data with multiple attributes is analyzed, and dimensionality
reduction techniques like CFS and PCA are applied. Machine learning algorithms,
including Multiple Linear Regression, Artificial Neural Network, REPTree, and Random
Tree, are used to predict applicant risk levels. REPTree performs best with the lowest MAE
and RMSE values for CFS, while Multiple Linear Regression excels with PCA. Overall,
the study shows that predictive analytics can enhance risk assessment in the life insurance
industry.

Relevance:

This study helps us to understand grouping customers according to their estimated level of
risks, determined from their historical data in the insurance firms.

3. Priyanka Jindal & Dharmender Kumar, (2017), “ A Review on Dimensionality


Reduction Techniques “International Journal of Computer Applications

Summary:

The paper explores feature selection and feature extraction techniques to address the
challenges of high-dimensional data. It emphasizes the importance of removing redundant
and irrelevant features to improve mining performance and learning accuracy. The analysis
focuses on popular techniques, highlighting their benefits and challenges. The paper aims
to provide beginners with valuable insights into these algorithms, aiding their
understanding of how these techniques can enhance data analysis in the context of high-
dimensional data.

Relevance:

This study gives insights about the feature selection and feature extraction techniques as a
preprocessing step are used for reducing data dimensionality.

4. Chinnaswamy A, Srinivasan R (eds) (2017) Performance analysis of classifiers on


filter-based feature selection approaches on microarray data. Bio-Inspired
Computing for Information Retrieval Applications.

Summary:

This paper explores filter-based feature selection methods in machine learning, including
Information Gain and Correlation coefficient. It demonstrates that these approaches
effectively reduce the number of gene expression levels, resulting in improved
classification accuracy. Five classification problems are evaluated, and the results show
that the selected gene subset outperforms the raw data. Additionally, Correlation Based
Feature Selection achieves higher accuracy with fewer genes compared to the Information
Gain approach.

Relevance:

This study helps us to get insights about the feature selection method that is easy to
understand and fast to execute and which removes noisy data and improves the
performance of algorithms.

5. Sudhakar M, Reddy C (2016) Two step credit risk assessment model for retail bank
loan applications using Decision Tree data mining technique.

Summary:

This paper highlights the significance of data mining techniques in the banking industry,
particularly for credit risk management. It introduces a prediction model utilizing Decision
Tree Induction Data Mining Algorithm to identify trustworthy loan applicants. By
analyzing customer data, banks can make informed decisions on loan approvals, mitigating
risks and enhancing lending processes. The model serves as a valuable tool for
organizations seeking to improve loan decision-making and maximize profitability in the
competitive banking market.

Relevance:

This study gives insights about Decision trees are a widely used machine learning
technique for prediction and have been implemented in several studies.

6. Mertler C, Reinhart R (2016) Advanced and multivariate statistical methods, 6th


edn. Routledge, New York

Summary:

The study focuses on understanding the handling of attributes with a high proportion of
missing data in statistical analysis. Specifically, it suggests that attributes exhibiting more
than 30% missing data should be excluded from the analysis. By removing these attributes,
the study aims to ensure the integrity and reliability of the analysis by minimizing the
impact of incomplete or unreliable data.

Relevance:
This study helps us to understand the attributes that are showing more than 30% missing
data would be dropped from the analysis.

7. J. Zhou, S. Zhang, Q. Lu, W. Dai, M. Chen, X. Liu, et al., A Survey on Federated


Learning and its Applications for Accelerating Industrial Internet of Things, 2021

Summary:

This paper discusses the use of federated learning (FL) to bring collaborative intelligence
to industries lacking centralized training data. It addresses security concerns and aims to
accelerate Industry 4.0 on the edge computing level. The study defines FL terminologies,
presents a framework, explores FL research advancements, and discusses the economic
impacts. It proposes a FL-transformed manufacturing paradigm, outlines future research
directions, and suggests potential applications in the industry 4.0 domain. Overall, the
paper highlights FL's potential in enabling secure and efficient data intelligence
collaboration across industries.

Relevance:

This study gives insights about the FL tries to bring collaborative intelligence into the
industry without the centralization of training data and an overall increase in performance
in the entire industry.

8. Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen and H. Yu, Federated learning,


Morgan Claypool Publishers, 2020.

Summary:

This study provides insights into the categorization of federated learning (FL) based on
data features and sample space. FL can be classified into two types: horizontal FL and
vertical FL. Horizontal FL involves training models on different data samples from various
devices, while vertical FL focuses on collaborating on different features of the data.
Understanding these distinctions helps in better comprehending the applications and
implications of FL in different contexts.

Relevance:
This study helps us to understand Based on the features and the sample space of the data,
federated learning can further be categorized into two types namely horizontal FL and
vertical FL

9. Josep Lledó a, Jose M. Pavía (2022), Dataset of an actual life-risk insurance


portfolio, Data Brief 45 (2022)

Summary:

This paper introduces a dataset extracted from a real-life risk insurance portfolio,
containing information on 76,102 policies and 15 variables. It highlights the dataset's
significance in teaching and research, enabling the development of pricing systems,
evaluation of marketing strategies, portfolio analysis, regulatory compliance, and
benchmarking. Previous studies have utilized the dataset to compare pricing methodologies
and assess the impact of managing catastrophic risks on solvency capital requirements in
life insurance. Overall, the dataset serves as a valuable resource for various insurance-
related analyses and investigations.

Relevance:

This study helps us to understand the risk structure to which an insurance company is
exposed can be deduced by reviewing its customer database.

10. Bhalla A (2012) Enhancement in predictive model for insurance underwriting. Int
J Computer Sci Eng Technology

Summary:

The underwriting process in the insurance industry is crucial for assessing risks accurately.
Predictive Analytics uses statistical data to analyze and assign scores to applications,
identifying low-risk applicants. This paper proposes incorporating predictive models into
underwriting to streamline and optimize the process. By focusing on applications with
scores surpassing a defined threshold, underwriters can efficiently evaluate applicants. This
methodology aims to improve the effectiveness and efficiency of underwriting operations,
ensuring accurate risk assessment and better decision-making.

Relevance:
The study proposes integrating predictive models into underwriting to enhance efficiency
and accuracy in risk assessment for insurance applications. By leveraging Predictive
Analytics, underwriters can streamline the process and make informed decisions.

CHAPTER-3

RESEARCH OBJECTIVES & METHODOLOGY

3.1 RESEARCH OBJECTIVES

Primary Objective:

To create a machine learning model that can classify policyholders into the correct risk
class, which can help insurance companies in better risk assessment, pricing, and
underwriting decisions.

Secondary Objective:

• Analyzing historical data and identifying key risk factors that impact
insurance risks, such as demographics, claims data, etc.

• To extract insights and patterns from large and complex insurance data set.

• To build a model to classify policyholders into the correct risk class.

3.2. Research Methodology

The Cross Industry Process for Data Mining [CRISP-DM] methodology is a process aimed
at increasing the use of data mining over a wide variety of business applications and
industries. The intent is to take case specific scenarios and general behaviors to make them
domain neutral. CRISP-DM is comprised of six steps with an entity that must 10
implements in order to have a reasonable chance of success. The six steps are shown in the
following diagram:

1. Business Understanding: This stage is the most important because this is


where the intention of the project is outlined. Foundational Methodology
and CRISP-DM are aligned here. It requires communication and clarity.
The difficulty here is that stakeholders have different objectives, biases, and
modalities of relating information. They don’t all see the same things or in
the same manner. Without clear, concise, and complete perspective of what
the project goals are resources will be needlessly expended.
2. Data Understanding: Data understanding relies on business understanding.
Data is collected at this stage of the process. The understanding of what the
business wants and needs will determine what data is collected, from what
sources, and by what methods. CRISP-DM combines the stages of Data
Requirements, Data Collection, and Data Understanding from the
Foundational Methodology outline.

Fig 3.1 CRISP-DM methodology

3. Data Preparation: Once the data has been collected, it must be transformed
into a useable subset unless it is determined that more data is needed. Once
a dataset is chosen, it must then be checked for questionable, missing, or
ambiguous cases. Data Preparation is common to CRISP-DM and
Foundational Methodology.
4. Modeling: Once prepared for use, the data must be expressed through
whatever appropriate models, give meaningful insights, and hopefully new
knowledge. This is the purpose of data mining: to create knowledge
information that has meaning and utility. The use of models reveals patterns
and structures within the data that provide insight into the features of
interest. Models are selected on a portion of the data and adjustments are
made if necessary. Model selection is an art and science. Both Foundational
Methodology and CRISP-DM are required for the subsequent stage.
5. Evaluation: The selected model must be tested. This is usually done by
having a preselected test, set to run the trained model on. This will allow
you to see the effectiveness of the model on a set it sees as new. Results
from this are used to determine efficacy of the model and foreshadows its
role in the next and final stage.
6. Deployment: In the deployment step, the model is used on new data
outside of the scope of the dataset and by new stakeholders. The new
interactions at this phase might reveal the new variables and needs for the
dataset and model. These new challenges could initiate revision of either
business needs and actions, or the model and data, or both.

CRISP-DM is a highly flexible and cyclical model. Flexibility is required at each step
along with communication to keep the project on track. At any of the six stages, it may be
necessary to revisit an earlier stage and make changes. The key point of this process is that
it’s cyclical; therefore, even at the finish you are having another business understanding
encounter to discuss the viability after deployment.

3.3 DATA COLLECTION

Target population of study – US based Insurance applicants

Type of Data - Secondary Data

No. of samples: 59,381 instances and 128 attributes

Data Collection method - Published Academic sources.

Secondary Data for analysis and modeling is obtained from a published academic data
source Prudential, one of the largest issuers of life insurance in the USA. Dependent
variables and independent variables are found in the dataset.

3.4 DATA UNDERSTANDING


The dataset contains various attributes such as age, height, weight, medical history, and
other demographic information of insurance applicants. The goal is to predict the risk class
of the applicants based on these attributes. The data understanding process would involve
exploring the variables, assessing data quality, checking for missing values, identifying any
patterns or correlations, and gaining a comprehensive understanding of the dataset's
characteristics before proceeding with further analysis and modeling. Customer
segmentation is used to group/classify the customers. The ML algorithm uses a set of
independent variables or predictors to determine the dependent variable or target.

Independent Variables (IV) or Predictors:

 Product_Info_1-7
 Ins_Age
 Ht
 Wt
 BMI
 Employment_Info_1-6
 InsuredInfo_1-6
 Insurance_History_1-9
 Family_Hist_1-5
 Medical_History_1-41
 Medical_Keyword_1-48
Dependent Variable (DV) or Target:
 Response

3.5 TOOLS FOR ANALYSIS

Statistical tool/platform for Data Analysis and Modeling: Jupyter Notebook (Anaconda)
Statistical language for Data Analysis and Modeling: Python
Machine Learning Algorithms for Modeling: Logistic Regression, Gaussian
Naïve
Bayes, Support Vector
Classifier,
Decision Tree Classifier,
Random
Forest Classifier, AdaBoost
Classifier, Gradient Boosting,
XGBoost Classifier
Python Libraries for Data Analysis and Visualization: pandas, NumPy, seaborn,
marplot
Python Libraries for Machine learning: Sci-Kit Learn

3.5.1 Statistical tool and language

JUPYTER NOTEBOOK

Fig 3.2 Jupyter Notebook


The Jupyter Notebook is an open-source web application that allows the user to create and
share.
documents that contain live code, equations, visualizations, and narrative text. Uses
include data.
cleaning and transformation, numerical simulation, statistical modelling, data visualization,
machine learning, and much more.
PYTHON
Fig 3.3 Python
Python is a high-level, general-purpose and a very popular programming language. Python
programming language (latest Python 3) is being used in web development, Machine
Learning applications, along with all cutting-edge technology in Software Industry.

3.5.2 Machine Learning

Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy. It is an important component of the growing field of data science.
Using statistical methods, algorithms are trained to make classifications or predictions,
uncovering key insights within data mining projects. These insights subsequently drive
decision making within applications and businesses, ideally impacting key growth metrics.
As big data continues to expand and grow, the market demand for data scientists will
increase, requiring them to assist in the identification of the most relevant business
questions and subsequently the data to answer them.

1. Logistic Regression: Logistic Regression is a popular classification algorithm that is


used to predict binary or categorical outcomes. It models the relationship between the
dependent variable and one or more independent variables using a logistic function, which
enables the calculation of probabilities and classification decisions based on a set of input
features.

2. Gaussian Naïve Bayes: Gaussian Naïve Bayes is a simple yet effective classification
algorithm that assumes the independence of features and follows the Bayes' theorem. It is
particularly useful for continuous feature data and is based on the assumption that the
features are normally distributed. It calculates the probabilities of different classes and
assigns the input to the class with the highest probability.

3. Support Vector Classifier: Support Vector Classifier (SVC) is a machine learning


algorithm that performs classification by constructing hyperplanes in a high-dimensional
feature space. It aims to find the optimal hyperplane that maximally separates the different
classes. SVC is effective in handling both linearly separable and non-linearly separable
datasets by using different kernel functions to transform the data into a higher-dimensional
space.

4. Decision Tree Classifier: Decision Tree Classifier is a versatile and interpretable


algorithm that builds a tree-like model for classification tasks. It splits the dataset based on
the values of different features and creates decision rules to assign the classes. Each
internal node represents a feature test, and each leaf node represents a class prediction.
Decision trees can handle both categorical and numerical features and are capable of
capturing complex relationships in the data.

5. Random Forest Classifier: Random Forest Classifier is an ensemble learning method


that combines multiple decision trees to make predictions. It creates a set of decision trees
using random subsets of the training data and features, and then aggregates their
predictions to determine the final class. Random Forest Classifier reduces overfitting,
improves accuracy, and handles high-dimensional data effectively.

6. AdaBoost Classifier: AdaBoost Classifier is an ensemble learning algorithm that


iteratively trains weak classifiers on different weighted versions of the training data. It
assigns higher weights to misclassified instances, allowing subsequent classifiers to focus
on the more challenging samples. The final prediction is made by combining the
predictions of all weak classifiers, with more weight given to classifiers with higher
accuracy.

7. Gradient Boosting: Gradient Boosting is an ensemble learning technique that builds a


sequence of models, each correcting the mistakes made by the previous model. It
minimizes a loss function by adding weak learners in a forward stage-wise manner.
Gradient Boosting is known for its high predictive power and is particularly effective in
handling heterogeneous data and capturing complex interactions.

8. XGBoost Classifier: XGBoost Classifier is an optimized implementation of the gradient


boosting algorithm. It incorporates additional regularization techniques and advanced
features to improve performance and speed. XGBoost is highly scalable, efficient, and
often used in machine learning competitions due to its ability to handle large datasets,
high-dimensional features, and provide accurate predictions.

3.5.3 Feature Selection tools

1. Mutual Information (MI): Mutual Information is a feature selection technique that


measures the dependency between two variables, assessing how much information one
variable provides about the other. In the context of feature selection, it calculates the
relevance between each feature and the target variable. Features with high mutual
information are considered to have more predictive power and are selected for further
analysis.

2. Multicollinearity analysis: Multicollinearity analysis is a technique used to detect and


address high correlation among predictor variables. It assesses the linear relationship
between variables, aiming to identify instances where predictors are highly correlated with
each other. Multicollinearity can negatively impact model interpretation and performance,
so this analysis helps in selecting a subset of independent variables that are minimally
correlated with each other.

3. PCA (Principal Component Analysis): PCA is a dimensionality reduction technique that


transforms a set of correlated variables into a smaller set of uncorrelated variables called
principal components. It identifies the directions in the data that explain the most variance
and projects the original variables onto these components. By selecting a subset of
principal components that capture most of the variance, PCA helps to reduce the
dimensionality of the feature space while retaining as much information as possible.
4. Lasso (L1) Regularization: Lasso regularization is a method used for feature selection
and regularization in linear regression models. It adds a penalty term to the loss function,
which encourages the model to shrink the coefficients of irrelevant or less important
features to zero. Lasso regularization selects features by promoting sparsity and helps in
identifying the most relevant predictors while effectively reducing the impact of irrelevant
or redundant variables.

CHAPTER 4

DATA ANALYSIS AND INTERPRETATION

4.1 DESCRIPTIVE ANALYSIS

4.1.1 Reading the Dataset

Fig 4.1 Overview of the dataset


Inference

 There are a variety of data types within our data frame.


 Any columns with object dtype contain non-numerical (character) data, which
will need to be pre-processed in order for these to be machine interpretable.
 There is a mixture of numeric-valued features that are both normalized and non-
normalized.

4.1.2 Target variable


Fig 4.2 Distribution of target variable
Inference
 The distribution is unbalanced and is skewed towards classes 6-8.
 Although classes 1-2 also account for a notable proportion of the dataset.

4.2 EXPLORATORY DATA ANALYSIS

4.2.1 Kernel Density Estimation

4.2.1.1 Product Info Column set


Fig 4.3 KDE of Product Info
Inference
 These KDE plots show the distributions with varying modality, however the
consistent trend to them is that they all closely overlap between each
Response group/cohort of applicants, with no major difference in relative
densities.
 As a result, any variation in these features is unlikely to individually help
towards predicting an applicant's risk rating.

4.2.1.2 Applicant Info Column set

Fig 4.4 KDE of Applicant Info


Inference
 Ins_Age: This KDE plot displays significant variation in each Response
group distribution's composition and structure, where each peak exhibits
both broadening and shouldering. Whilst most of the distribution density is
spread between x=0 and x=0.9, each cohort's distribution does vary
somewhat in shape and skew - for instance, class 8 features positive skew
towards x=0.2 whereas most of the other classes are negatively skewed
towards x=0.4.
 Ht: This KDE plot displays significant variation in each Response group
distribution's composition and structure, where each peak exhibits both
broadening and shouldering. Most of the distribution density is spread
between x=0.6 and x=0.8, and each cohort's skew is similar to that shown in
Ins Age (i.e., positive for class 8, negative for most others).

 Wt: This KDE plot displays significant variation in each Response group
distribution's composition and structure, where each peak exhibits both
broadening and shouldering. Much of the distribution density is spread
between x=0.2 and x=0.5. The distribution for class 8 is notably centered
around low values of Wt (x=0.2), whereas the remainder of the population
is mostly spread across the interval between x=0.2 and x=0.5.
 BMI: This KDE plot displays significant variation in each Response group
distribution's composition and structure, where each peak exhibits both
broadening and shouldering. Most of the distribution density is spread in a
similar fashion to Wt - class 8's peak is centered at x=0.4 whereas most of
the other classes' distributions are spread across the interval between x=0.4
and x=0.6. However, one of the "medium" risk-rating distributions features
a notably stronger skew in its distribution compared to its peers, with
significant negative skew towards x=0.6 rather than towards x=0.5
4.2.1.3 Employment Info Column set

Fig 4.5 KDE of Employment Info


Inference
 These KDE plots show the distributions with varying modality, however the
consistent trend to them is that they all closely overlap between each
Response group/cohort of applicants, with no major difference in relative
densities.
 As a result, any variation in these features is unlikely to individually help
towards predicting an applicant's risk rating.

4.2.1.4 Insured Info Column set


Fig 4.6 KDE of Insured Info
Inference
 These KDE plots show the distributions with varying modality, however the
consistent trend to them is that they all closely overlap between each
Response group/cohort of applicants, with no major difference in relative
densities.
 As a result, any variation in these features is unlikely to individually help
towards predicting an applicant's risk rating.

4.2.1.5 Insurance history Info Column set


Fig 4.7 KDE of Insurance History Info
Inference
 These KDE plots show the distributions with varying modality, however the
consistent trend to them is that they all closely overlap between each
Response group/cohort of applicants, with no major difference in relative
densities.
 As a result, any variation in these features is unlikely to individually help
towards predicting an applicant's risk rating.

4.2.1.6 Family history Info Column set


Fig 4.8 KDE of Family History Info
Inference
 Family_Hist_1: This KDE plot shows a bimodal distribution comprised of two
peaks at x=2 and x=3 (along with a very low-density curve at x=1), with the most
prominent peak at x=3. However, as this feature appears to show little variation in
relative densities between each of the Response classes, any variation in this feature
is unlikely to individually help towards predicting an applicant's risk rating.
 Family_Hist_2 - Family_Hist_5: These KDE plots show a number of unimodal
distributions which display some variation in terms of each Response group
distribution's composition and structure, where each peak exhibits both broadening
and shouldering. Whilst the majority of the distributions' densities are spread
between x=0.2 and x=0.8, each cohort's distribution does vary somewhat in shape
and skew/kurtosis - for instance, class 8 generally features more positive kurtosis
than most of the other classes' distributions, which tend to show broader density
plots.

4.2.1.7 Medical history Info Column set


Fig 4.9 KDE of Medical History Info
Inference
 A majority of these KDE plots feature distributions that closely overlap between
each Response group/cohort of applicants, and hence any variation in the
underlying features is unlikely to individually lend any predictive power for
determining an applicant's risk rating. However, there are a handful of notable
exceptions.
 Medical_History_2/15/24: These KDE plots appear to show multimodal
distributions that features some degree of predictive distinction in terms of
variance, as each Response group distribution's peaks exhibit different levels of
broadening and shouldering. However, it is important to note the scales of the y-
axes here - the densities for each underlying distribution are very small and provide
little in the way of allowing each feature to individually help in distinguishing
between each Response group.
 Medical_History_10: This KDE plot seems to display some interesting features.
For the low-risk applicant cohorts, the plots appear to show distributions that are
bimodal, however at higher risk levels the distribution tends towards having a
single peak instead. Unfortunately, as will be shown in further detail later on, this
column features a very high proportion of missing values and thus its distribution/s
should not be misconstrued as highly predictive.
 Medical_History_23: This KDE plot helps to illustrate that this feature does show
some potential. As the Response value/risk rating increases, each peak in the
bimodal distribution becomes sharper as the degree of kurtosis becomes more
positive. In simpler terms, having a value (within this column) that is further away
from the peaks' centres would tend to correlate with having a lower risk rating,
whereas values that overlap closely with the peaks tend to represent applicants with
higher risk ratings.
4.2.1.8 Medical Keyword Info Column set

Fig 4.10 KDE of Medical Keyword Info


Inference

 These KDE plots show the distributions with varying modality, however the
consistent trend to them is that they all closely overlap between each
Response group/cohort of applicants, with no major difference in relative
densities.
 As a result, any variation in these features is unlikely to individually help
towards predicting an applicant's risk rating.
4.2.2 Correlation plots

Fig 4.11 heatmap plot for all the features


Inference

 Column Set 1 - Product Info: These appear to show little interaction/correlation


with a majority of the other feature sets, with the exception of Employment_Info_1,
Employment_Info_5 and Insured_Info_6 - these columns may be directly
correlated as, to give an example, an applicant's employment/financial status will
have an impact on what type of policy/product the applicant is applying for.
 Column Set 2 - Applicant Info: These columns show a varying range of interactions
with the other feature sets; the strongest anti-/correlations (excluding those within
the same column set) are between a handful of the Family_Hist columns as well as
Insured_Info_6.
 Column Set 3 - Employment Info: With the exception of two strong anti-
correlations - between Employment_Info_2 and Employment_Info_3, plus between
Employment_Info_5 and Product_Info_3 - and a couple of moderate interactions
between Employment_Info_6 and Family_Hist_2 / Family_Hist_4, this column set
does not interact very strongly with the rest of the features.
 Column Set 4 - Insured Info: The column InsuredInfo_2 shows a fairly strong
correlation with InsuredInfo_7, and also a strong anticorrelation with some of the
Applicant Info columns; otherwise, this column set does not interact very much
with the rest of the other features.
 Column Set 5 - Insurance History Info: This feature set exhibits several strong
inter-correlations with other Insurance_History columns but does not interact very
much with the rest of the features.
 Column Set 6 - Family History Info: The columns Family_Hist_2 and
Family_Hist_4 show a very strong positive correlation with Ins_Age, and also with
Medical_History_10 and Medical_History_15 to a lesser degree.
 Column Set 7 - Medical History Info: This column set shows a number of
correlation hotspots against several Medical_Keyword columns, as well as against
Ins_Age and some of the Family_Hist columns.
 Column Set 8 - Medical Keyword Info: These columns show a number of
correlation hotspots against several Medical_History columns, but do not otherwise
show any notable interactions with the rest of the features.
4.3 Data Preparation

4.3.1. Locating/handling excess zeroes

Fig 4.12 Null Values

Inference

 Few Attributes are elected to delete columns where their proportions of missing
values in the training subset are greater than 40%, although any other sensible
threshold could be set instead.
 Some Attributes can be preprocessed via imputation methods in order to provide machine-
interpretable inputs for our models.
4.3.2. Locating/handling excess zeroes

Fig 4.13 Distribution before Imputation

Fig 4.14 Distribution After Imputation


Inference

 Three of the five columns' distributions appear to be mostly unchanged.


 Imputation have introduced additional probability density/peak splitting into
Family_Hist_4 (at approx. x=0.4, x=0.6) and Medical_History_1 (at approx.
x=10).
4.3.3. K-means clustering

Fig 4.15 Elbow Method


Inference

 Increasing k beyond this value does not yield a significant benefit in the rate of
reduction in the training dataset's inertia - hence, we will set k=15 as cluster values.
 cluster labels will be incorporated as an additional feature in our datasets and may
in fact prove to be useful in helping to understand applicants' risk rating
assignments.
4.4 Feature Selection

4.4.1. Mutual Information (MI):

Fig 4.16 MI values


Inference

 The top 5 ranked features (in descending order) are BMI, Wt, Product_Info_4,
Medical_Keyword_15, and Medical_History_23.
 This means that these features have strong statistical dependences with the
Response variable, i.e. that they contribute significantly to reducing uncertainty in
the value of the Response variable .
 Features that have low/zero MI scores indicate that they do not significantly
contribute towards reducing this uncertainty and are hence less useful for guiding
our predictions.
4.4.2. Multicollinearity analysis

Fig 4.17 VIF Scores


Inference

 The features listed above have very high VIF scores, which indicate a high level of
multicollinearity.
 However, in the edge cases where some of these values tend towards infinity, these
can be discounted as they represent dummy variables that are perfectly anti-
correlated.
4.4.3. Principal Component Analysis

Fig 4.18 Cumulative Variance


Inference

 the first 40 PCs contain just over 80% of the cumulative variance in the
validation dataset.
 This means that we can still capture a significant majority of the
dataset's cumulative variance, were we to use a lower dimensionality
feature-space instead, rather than simply using all features together.

4.4.4. Lasso (L1) Regularization

Fig 4.19 L1 Regularization


Inference

  now reduced our dataset down to 57 features, from a starting value of 126.
4.5. Data Modeling

4.5.1 Classification

Fig 4.20 Accuracy Score


Inference

 Based on the provided accuracy scores for different classification algorithms, it


appears that Gradient Boosting has the highest accuracy with a score of 0.5355,
followed by Random Forest with a score of 0.5238. Gaussian Naive Bayes and
Decision Tree have lower accuracy scores of 0.3905 and 0.3842, respectively.
 In order to further improve the performance of these models, hyperparameter
tuning can be applied. Hyperparameter tuning involves selecting the optimal values
for the hyperparameters of the algorithm, which can significantly impact the
model's performance.
4.6 Performance evaluation post parameter tuning

4.6.1. ROC Curves

Fig 4.21 ROC Curves


Inference

 Each model has a wide range of AUC values against each class, indicating a
varying degree of sensitivity/specificity across the dataset.
 The two highest average AUC values were achieved by the gradient boosting
classifiers (model 11: macro-average AUC=0.84, and model 12: macro-average
AUC=0.83).
 It is also worth noting that both of these models performed best when predicting
applicants for classes 3/4/8, as indicated by the high AUC scores (~0.9) when
classifying applicants into the remaining groups.
 The XGBClassifier model features a similarly broad range of AUC values against
each class, indicating a varying degree of sensitivity/specificity across the dataset.
Furthermore, the model has also performed relatively strongly when predicting
applicants for classes 3/4/8, as indicated by the high AUC scores (~0.9) for their
respective ROC curves - this is a good sign that our model is still able to generalise
well, even when handling previously unseen data
4.6.2. Classification Report

Fig 4.22 Classification Report


Inference

 Each of the models display somewhat mediocre performance for each class, with
the highest value consistently belonging to class 8.
 Each model also consistently shows very high recall for class 8, and poorer values
against the other groups.
 The same trend can also be observed for F1-score, indicating that the models are
strongly fitted towards predicting applicants in class 8
 For class 3, model 11's precision/recall/F1-score values are equal to 0, which means
that it did not correctly predict any applicants for this risk rating.
 Model 12 showed a precision of 0.27, a recall of 0.05 and an F1-score of 0.08.
 This does not necessarily mean that model 11 is wholly inaccurate and should not
be trusted altogether.
 These are typically harsh metrics that can indicate where models are strongly
under/overfitting as they evaluate accuracy in a "one vs. rest" fashion - if the set of
possible Response values was much smaller instead, e.g. by grouping together
classes 1-8 into Low/Med/High-risk, then each model's performance would appear
to significantly improve.
4.6.3. Confusion matrix

Fig 4.23 Confusion matrix


Inference

 most appear to be fairly well-fitted to the data, in that they are capable of
replicating the distribution of Response values reasonably well.
 A majority of the missed cases are in fact relatively close to where they
should be (e.g. predicted as class 8, was actually class 7)
 There are some notably underperforming models which appear to be highly
overfitted towards predicting applicants as belonging to class 8 (which
represents the largest proportion across all applicants in the dataset).
 The confusion matrix for model 7 (SVC with sigmoid kernel) demonstrates
this trait extremely well, which shows that only a tiny proportion of
predictions were made in any other class apart from 8.
 Models 3 and 5 (SVCs with linear and polynomial kernels, respectively)
also demonstrate this trend to a lesser extent as well.
 Models 11 and 12, on the other hand, appear to generalize reasonably well
to the dataset, mimicking the distribution plot shown in response values.
 Furthermore, the only notable drawback of these models that can be
observed is that they tend to predict some low-risk applicants (e.g. classes
1-2) as high-risk applicants (e.g. classes 6-8) with a higher than expected
frequency - see the top-right corners of each plot.
 In terms of using either of these models in a real-world business scenario,
this would only mean that more effort/time is potentially wasted on
scrutinizing low-risk applicants further before offering a policy, rather than
treating high-risk applicants with a light touch and introducing unnecessary
risk into the insurer's portfolio.
4. 7 Attribute Importance

4.7.1. Feature Importance

Fig 4.24 Feature Importance

Inference

 This plot shows that the top five most important features (in terms of Gain/model
contribution) are:
o Medical_History_23,
o Medical_History_4,
o Medical_Keyword_3,
o Medical_Keyword_15, and
o BMI.
 These five features were also highly ranked within the Mutual Information score
chart , which lends some credibility to the view that these features may be closely
involved in governing what risk rating an applicant should be assigned.
 The bottom five (i.e., least important) features in terms of Gain are:
o Medical_History_34,
o Product_Info_2_E1,
o KMeansCluster_4,
o Insurance_History_8, and
o Medical_History_41.
 Four of these five features were also poorly ranked within the MI score chart,
however KMeansCluster_4 was instead moderately ranked (residing within the top
15 features).
 This implies that, whilst KMeansCluster_4 was initially deemed to show some
potential in terms of predictive power, the XGBClassifier does not value this
feature as highly when generating predictions for the test dataset.
4.7.2. Permutation Importance

Fig 4.25 Permutation Importance

Inference

 The top five features that appear to have the strongest impact on the model's
performance/accuracy when shuffled randomly are:
o Medical_Keyword_3
o Medical_History_39
o InsuredInfo_5
o Product_Info_2_D1
o Medical_History_17
 Most of these features were also highlighted as having the highest feature
importance.
 Interestingly, Medical_History_23 (the highest scored feature in terms of
Feature Importance) does not show up in the top 20 permutation importance
weightings - this could imply that this feature does not have any meaningful
causational relationship with Response and has unintentionally been given
higher importance due to possible overfitting.
4.7.3. SHAP Values

Fig 4.26 Permutation Importance

Inference

o The SHAP summary plot above provides us with a top-down view of feature
importance, for the top 20 features as calculated via the SHAP framework.
o The color of each dot represents whether that feature was high or low (for that row
in the dataset), and its horizontal location shows whether the effect of that value
caused a higher (towards 8) or lower (towards 1) prediction.
o For instance, the BMI feature clearly expresses that as the BMI of the applicant
increases, then their predicted risk rating also increases strongly.
o Product_Info_2_A6 - i.e. whether the applicant's selection for Product_Info_2 is
equal to A6 - shows a clear negative correlation with the predicted risk rating.
o Ins_Age shows a negative correlation when away from the baseline value (class 5)
but is more mixed as the value approaches the baseline.
o Family_Hist_4 appears to show a mixed effect on the value of Response regardless
of the choice of class.
CHAPTER 5

FINDINGS, SUGGESTIONS & CONCLUSIONS

5.1 FINDINGS

5.1.1 Findings based on Primary Objective

 Models exhibit varying sensitivity/specificity across the dataset, as shown by the


wide range of AUC values for each class.
 Gradient boosting classifiers (models 11 and 12) achieve the highest average AUC
values.
 Models 11(Adaboost) and 12(XGBClassifier)perform well in predicting applicants
for classes 3, 4, and 8. The XGBClassifier model also performs well for classes 3,
4, and 8.
 Models tend to focus on predicting class 8, as indicated by high recall and F1-score
for that class.
 Some models are highly overfitted towards predicting class 8, especially model 7
with a sigmoid kernel.
 Models 11 and 12 generalize well to the dataset and closely match the distribution
of Response values.
 However, they tend to misclassify some low-risk applicants as high-risk applicants
more frequently. Misclassifying low-risk applicants may lead to additional scrutiny
and time wasted in real-world scenarios, but it does not introduce unnecessary risk
into the insurer's portfolio.
5.1.2 Findings based on Secondary Objective
 Top five important features: Medical_History_23, Medical_History_4,
Medical_Keyword_3, Medical_Keyword_15, and BMI. These features were also
highly ranked in the Mutual Information score chart.
 Bottom five least important features: Medical_History_34, Product_Info_2_E1,
KMeansCluster_4, Insurance_History_8, and Medical_History_41.
 XGBClassifier doesn't value KMeansCluster_4 highly for predictions.
 Top five impactful shuffled features: Medical_Keyword_3, Medical_History_39,
InsuredInfo_5, Product_Info_2_D1, and Medical_History_17.
 Most of these features were identified as having high importance.
 Medical_History_23, the highest scored feature, doesn't appear in the top 20
permutation importance weightings, indicating possible overfitting.
 SHAP summary plot shows feature importance and effects on prediction.
 BMI has a strong positive correlation with the predicted risk rating.
 Product_Info_2_A6 has a clear negative correlation with the predicted risk rating.
 Ins_Age has a negative correlation away from the baseline value but a mixed effect
approaching the baseline.
 Family_Hist_4 has a mixed effect on the value of Response across different classes.
5.2 SUGGESTIONS
 Refining Risk Rating Prediction:
o Considering the complexity of the dataset and the model's generalization, it
might be beneficial to group risk ratings into broader categories initially
(e.g.  classes 1-3 could be grouped together into Low, classes 4-6 into Medium, and
classes 7-9 into High).
o Train another layer of classifiers specifically for further segmentation within
each risk category.
o This approach could provide more accurate predictions and allow for
additional refinement based on specific risk bandings.
 Incorporating Business Logic into Feature Engineering:
o Although feature names have been anonymized, further exploration of
potential feature interactions during Exploratory Data Analysis (EDA) can
uncover statistical correlations.
o With access to an "unanonymized" dataset and subject matter expertise,
incorporate domain-specific knowledge to engineer new features that
capture predictive power and validate their effectiveness for production use.
 Enhancing Feature Selection:
o Consider a more comprehensive approach to feature selection by combining
multiple techniques.
o Instead of relying solely on Lasso regularization, use a "voting system" that
incorporates rankings from techniques such as Mutual Information, Variance
Inflation Factor, Principal Component Analysis, and other quantitative-based
methods.
o By leveraging a consensus of feature importance across different techniques,
you can obtain a more robust and reliable set of selected features.

5.3 CONCLUSION

This project has successfully achieved its objectives of analyzing historical data, extracting
valuable insights, and building an accurate risk classification model in the insurance
domain. Through thorough analysis, key risk factors impacting insurance risks were
identified, providing a deeper understanding of the underlying patterns and trends within
the data. The developed risk classification model accurately assigns policyholders to the
appropriate risk class, enhancing risk assessment and underwriting processes.

By leveraging advanced techniques, insurers can extract actionable insights that lead to
more accurate risk assessments and optimized underwriting processes. The risk
classification model developed in this project provides insurers with a competitive
advantage, enabling them to make informed strategic decisions and enhance overall
business performance.

In summary, this project demonstrates the value of data analytics in the insurance sector.
The insights gained from analyzing historical data and building an accurate risk
classification model offer significant potential for insurers to improve risk assessment,
underwriting efficiency, and overall profitability. By embracing data-driven approaches,
insurance companies can enhance their ability to understand and manage risks effectively,
ultimately leading to better outcomes for both insurers and policyholders.
REFERENCES

1. Goleiji L, Tarokh M (2015) Identification of influential features and fraud detection


in the Insurance Industry using data mining techniques (Case study: automobile’s
body insurance). Majlesi J Multimed Process 4:1–5
2. Cummins J, Smith B, Vance R, Vanderhel J (2013) Risk classificaition in Life
Insurance, 1st edn. Springer, New York
3. Priyanka Jindal & Dharmender Kumar, (2017), “ A Review on Dimensionality
Reduction Techniques”International Journal of Computer Applications (0975 –
8887) Volume 173 – No.2, September 2017
4. Chinnaswamy A, Srinivasan R (eds) (2017) Performance analysis of classifiers on
filter-based feature selection approaches on microarray data. Bio-Inspired
Computing for Information Retrieval Applications. United States of America, IGI
Global
5. Sudhakar M, Reddy C (2016) Two step credit risk assessment model for retail bank
loan applications using Decision Tree data mining technique. Int J Adv Res
Computer Eng Technol 5:705–718
6. Mertler C, Reinhart R (2016) Advanced and multivariate statistical methods, 6th
edn. Routledge, New York
7. J. Zhou, S. Zhang, Q. Lu, W. Dai, M. Chen, X. Liu, et al., (2021)A Survey on
Federated Learning and its Applications for Accelerating Industrial Internet of
Things.
8. Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen and H. Yu,(2021) Federated learning,
Morgan Claypool Publishers.
9. Josep Lledó a, Jose M. Pavía (2022), Dataset of an actual life-risk insurance
portfolio, Data Brief 45 108655
10. Bhalla A (2012) Enhancement in predictive model for insurance underwriting. Int J
Computer Sci Eng Technology
APPENDIX

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
main_data = pd.read_csv('../Downloads/train.csv/train.csv')
print(main_data.dtypes)
main_data.describe()
plt.hist(main_data_index_set['Response'],
bins=sorted(main_data_index_set['Response'].unique()))
plt.xlabel('Response')
plt.ylabel('# of Applicants')
plt.title('Response Distribution')
# Set up a subplot grid.
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(25,15))
ColSet1_ProdInfo_kde =
['Product_Info_1','Product_Info_3','Product_Info_4','Product_Info_5','Product_Info_6','Pro
duct_Info_7']
# Produce kernel density estimate plots for each set of columns.
for i, column in enumerate(main_data_index_set[ColSet1_ProdInfo_kde].columns):
sns.kdeplot(data=main_data_index_set,
x=column,
hue="Response", fill=True, common_norm=True, alpha=0.05,
ax=axes[i//3,i%3])
# Produce a correlation matrix of the dataset - then, create a mask to hide the upper-right
half of the matrix.
corrs = main_data_index_set.corr()
mask = np.zeros_like(corrs)
mask[np.triu_indices_from(mask)] = True

# Convert the correlation matrix into a heatmap using Seaborn.


plt.figure(figsize=(24,16))
sns.heatmap(corrs, cmap='RdBu_r', mask=mask)
plt.show()
# Import the IterativeImputer class from sklearn - NOTE: enable_iterative_imputer also
needs to be imported as this is an experimental feature.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Take a copy of each dataset before transforming.
copy_X_train = X_train.copy()
copy_X_valid = X_valid.copy()
copy_X_test = X_test.copy()
# Filter the splits down to the columns that require imputation.
X_train_pre_impute = copy_X_train[cols_to_impute]
X_valid_pre_impute = copy_X_valid[cols_to_impute]
X_test_pre_impute = copy_X_test[cols_to_impute]
# Save the other columns into separate dataframes, for re-joining later on.
X_train_no_impute = copy_X_train.drop(cols_to_impute, axis=1)
X_valid_no_impute = copy_X_valid.drop(cols_to_impute, axis=1)
X_test_no_impute = copy_X_test.drop(cols_to_impute, axis=1)
# Initialise the IterativeImputer transformer.
X_imputer = IterativeImputer(random_state=0)
# Transform the train/val/test datasets using iterative imputation.
X_train_post_impute = pd.DataFrame(X_imputer.fit_transform(X_train_pre_impute),
columns=X_train_pre_impute.columns)
X_valid_post_impute = pd.DataFrame(X_imputer.transform(X_valid_pre_impute),
columns=X_valid_pre_impute.columns)
X_test_post_impute = pd.DataFrame(X_imputer.transform(X_test_pre_impute),
columns=X_train_pre_impute.columns)
# Reset the indexes of each dataset, as they are dropped during imputation.
X_train_post_impute.index = X_train_pre_impute.index
X_valid_post_impute.index = X_valid_pre_impute.index
X_test_post_impute.index = X_test_pre_impute.index
# Re-join the imputed columns with the remaining columns in each dataset.
X_train_imputed = pd.concat([X_train_no_impute, X_train_post_impute], axis=1)
X_valid_imputed = pd.concat([X_valid_no_impute, X_valid_post_impute], axis=1)
X_test_imputed = pd.concat([X_test_no_impute, X_test_post_impute], axis=1)
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,
GradientBoostingClassifier
from xgboost import XGBClassifier
# Assuming you have already performed feature selection and have X_train_L1reg and
X_test_L1reg
# Initialize the classifiers
logistic_regression = LogisticRegression()
naive_bayes = GaussianNB()
support_vector_machine = SVC()
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier()
adaboost = AdaBoostClassifier()
gradient_boosting = GradientBoostingClassifier()
xgboost = XGBClassifier()
# Train and evaluate each classifier
classifiers = [naive_bayes, support_vector_machine, decision_tree, random_forest,
adaboost, gradient_boosting]
classifier_names = [ 'Gaussian Naive Bayes', 'Support Vector Machine', 'Decision Tree',
'Random Forest', 'AdaBoost', 'Gradient Boosting']
for clf, name in zip(classifiers, classifier_names):
# Train the classifier
clf.fit(X_train_L1reg, y_train) # Assuming y_train is the target variable
# Make predictions on the test set
y_pred = clf.predict(X_test_L1reg)
# Evaluate the classifier
accuracy = clf.score(X_test_L1reg, y_test) # Assuming y_test is the target variable
# Print the results
print(f'{name}: Accuracy = {accuracy:.4f}')
for y_valid_predprobs in Valid_PredProbs:
plot_roc(y_valid, y_valid_predprobs)
for y_valid_preds in Valid_Preds:
print(classification_report(y_valid, y_valid_preds))
for y_valid_preds in Valid_Preds:
plot_confusion_matrix(y_valid, y_valid_preds)
from xgboost import plot_importance
# Generate a Feature Importance plot using the selected model.
fig, ax = plt.subplots(figsize=(10, 10))
plot_importance(BestModel.base_estimator,
importance_type="gain",
xlabel="Gain",
show_values=False,
ax=ax)
plt.show()
from eli5.sklearn import PermutationImportance
from eli5 import show_weights
# Calculate the Permutation Importances of the selected model.
perm = PermutationImportance(BestModel.base_estimator,
random_state=0).fit(X_test_L1reg, y_test)
show_weights(perm, feature_names=X_test_L1reg.columns.tolist())
show_weights(perm, feature_names=X_test_L1reg.columns.tolist())V

You might also like