You are on page 1of 33

Data Science

for
INSURANCE INDUSTRY
Author : Sudhir Behera, Applied Data Science ,

24 years of experience in Insurance industry,


IT , Actuarial , Business and data science.

Lecture 1

falytics.com/ProfessionalEducation
?
Data Science

falytics.com/ProfessionalEducation
Data science is a dynamic and interdisciplinary field that
harnesses the power of mathematical principles, statistical
techniques, and computer science to address complex challenges in
both business and scientific domains. Through the application of
predictive modeling, classification methods, and the generation of
meaningful insights, data science empowers organizations to derive
actionable conclusions from vast datasets. It serves as a crucial
catalyst for informed decision-making and problem-solving, paving
the way for innovation and strategic advancements in diverse
sectors.

falytics.com/ProfessionalEducation
Insurance Industries – Problem Space

Operational

falytics.com/ProfessionalEducation
The problem space within the insurance industry
encompasses key domains, including operations, marketing,
underwriting, and actuarial functions. In each of these areas,
challenges arise that demand strategic solutions and innovation
to optimize processes, enhance customer experiences, and
ensure sound risk management. By addressing issues within
these core facets, the insurance sector can navigate
complexities, improve efficiency, and ultimately deliver greater
value to both clients and stakeholders.

falytics.com/ProfessionalEducation
Error in underwriting decision-making.
Operational
Error in experience analysis.
Customer churn and retention analysis. Marketing
Product mix analysis.
Actuarial
Claim cost analysis.
Profitability indicator.
Underwriting
Profiling of risk segments.
Popular Algorithms

Decision Trees Deep learning CNN,GNN, RNN

Recommendation Systems
Random forest

Generative AI

Time series analysis

Gradient Boosting

falytics.com/ProfessionalEducation
Mortality
In insurance, mortality rate is the oblivious thing that
drives product design and premium rates. Insurance
companies use mortality tables, also known as life tables
or actuarial tables, to estimate the probability of death at
different ages and under various conditions.

In short, mortality in insurance involves the study and


analysis of death rates to determine the likelihood of
policyholders dying, helping insurers set appropriate
falytics.com/ProfessionalEducation
premiums and manage their financial risks.
falytics.com/ProfessionalEducation
ML/AI models Insurance Models

Premium
calculation

Mortality CV
Rate Accumulation
COI
Rate

Policy
illustrations

& More. . licy


illustrations

falytics.com/ProfessionalEducation
Use case.1 Expected loss analysis for underwriting decisions.
Problem definition:
The insurance company seeks to improve its risk assessment capabilities by
leveraging both traditional and newly identified factors that contribute to
expected losses. The predictive model should provide insights into the
potential financial impact of various insurance policies, enabling the
company to make informed underwriting decisions and optimize its risk
management strategies.
Objective:
Develop a predictive model for an insurance company to estimate expected
losses, integrating new and existing rating variables. The goal is to enhance
accuracy in predicting potential financial risks associated with insurance
policies.

falytics.com/ProfessionalEducation
Policy_ID Claim_Amount Policy_Premium Deductible Age Gender Location_Type Previous_Claims Coverage_Type Emerging_Risk_Factor External_Factor Expected_Loss
1 5000 1000 200 35 Male Urban 0 Auto High_Technology 0.2 4500
2 2000 800 150 45 Female Suburban 1 Home Regulatory_Change -0.1 1800
3 10000 1500 300 28 Male Rural 2 Health Market_Trends 0.3 11000
4 8000 1200 250 50 Female Urban 0 Auto Economic_Indicators 0.1 7500
5 3000 600 100 40 Male Suburban 3 Home New_Technology -0.2 2800
6 15000 2000 400 32 Female Rural 1 Health Regulatory_Change 0.4 16500
7 6000 1000 200 55 Male Urban 0 Auto Market_Trends -0.3 5800
8 12000 1800 350 48 Female Suburban 2 Home High_Technology 0.2 10500

9 4500 900 150 38 Male Rural 1 Health Economic_Indicators -0.1 4200


10 7000 1100 250 42 Female Urban 3 Auto New_Technology 0.5 6800

falytics.com/ProfessionalEducation
In this dataset there are 10 rows displayed. Actually, in the data sets they’re
much more than 10 rows.

Predictor variables are


policy ID
claim amount
policy premium : premium paid towards insurance coverage.
deductible: paid by the policy holder.
age :
gender: male, female and unisex.
location type: rural, urban and suburban.
previous claims: Number of claims in the past.
coverage type: auto home or health.
emerging risk factors: external factors these are the factors suppled to the
company by a 3rd party consulting firm, these are the benchmarks.
Target Variable
expected loss

The objective is to build a machine learning model to predict the factors


influences the loss also, predict the loss based on historical data. The company is
going to use these findings and these predictions l to build strategies to reduce
the loss.
Use case.2 Optimize lead conversions.
Problem definition:
The insurance company aims to enhance the effectiveness of its marketing
campaigns by leveraging machine learning (ML) models to optimize lead
conversion rates. Currently, the lead conversion process is suboptimal,
leading to inefficiencies in resource allocation and missed opportunities.

Objectives:
1. Develop and deploy ML models that can predict the likelihood of lead
conversion based on historical data and relevant features.
2. Identify key features and variables that significantly influence lead
conversion, allowing for targeted and personalized marketing strategies.
decision-making.

falytics.com/ProfessionalEducation
XpressCoverage

ID age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media online_forums partners referral status
Website
EXT001 57Small_business_owner Website High 7 1639 1.861 Activity Yes No Yes No No 1
Website
EXT002 56Professional Mobile App Medium 2 83 0.32 Activity No No No Yes No 0
Website
EXT003 52Professional Website Medium 3 330 0.074 Activity No No Yes No No 0
Website
EXT004 53Small_business_owner Website High 4 464 2.057 Activity No No No No No 1
EXT005 23Student Website High 4 600 16.914Email Activity No No No No No 0
EXT006 50Small_business_owner Mobile App High 4 212 5.682Phone Activity No No No Yes No 0
Website
EXT007 56Professional Mobile App Medium 13 625 2.015 Activity No No Yes No No 1
EXT008 57Professional Mobile App Medium 2 517 2.985Email Activity No No No No No 0
EXT009 57Professional Mobile App High 2 2231 2.194Phone Activity No No Yes No No 1
EXT010 59Professional Mobile App High 1 1819 3.513Phone Activity No No No No No 0
EXT011 52Professional Website Medium 2 433 2.14Email Activity No No No No No 1
Website
EXT012 57Professional Website High 3 616 3.485 Activity Yes Yes No No No 1
EXT013 35Professional Website High 4 239 2.214Phone Activity No No No No No 0
EXT014 23Student Website High 3 115 2.69Email Activity No No No No No 0
EXT015 56Professional Website High 6 358 0.279Email Activity No No No No No 0
EXT016 62Small_business_owner Mobile App High 5 1057 5.605Phone Activity No No No Yes No 0
EXT017 47Professional Website High 3 1419 3.45Email Activity No No Yes No Yes 1

falytics.com/ProfessionalEducation
In this XpressCoverage dataset there are 15 columns and10 rows displayed.

Predictor variables are


policy ID
ID The objective is to build a machine
age learning model to predict the factors
current_occupation influences the lead conversion and
first_interaction
predict whether a customer is going to
profile_completed
website_visits
covert or not.
time_spent_on_website
page_views_per_visit status: 1 , converted. (positive class)
last_activity 0 , did not covert.
print_media_type1
print_media_type2
digital_media
online_forums
partners_referral
Target Variable
status
Use case.3 Policy holder (customer) lifetime value analysis.
Problem definition:
The insurance company is grappling with a substantial policyholder churn rate. high
policyholder churn is directly impacting the life-time value (LTV) of our customers. This not
only leads to immediate revenue loss but also diminishes the long-term value these
customers could bring to the company. By addressing the root causes of churn, optimizing
retention strategies, and improving customer satisfaction, we aim to maximize
policyholder life-time value, ensuring both short-term financial stability and long-term
profitability.

Objectives:
1. The primary objective of this data science project is to develop a machine learning
model that accurately predicts customer churn and based on the predictions,
implement targeted strategies to increase CLV.

falytics.com/ProfessionalEducation
PolicyNumber Revenue AcquisitionCost Lifespan Age Gender Region ChurnStatus
POL001 500 100 24 35 Female North No
POL002 700 150 18 42 Male South No
POL003 600 120 20 28 Non-binary East Yes
POL004 800 200 22 45 Male West No
POL005 550 130 25 30 Female North Yes
POL006 900 180 30 38 Male South No
POL007 750 160 28 32 Female East No
POL008 650 140 21 40 Male West Yes
POL009 720 170 23 33 Female North No
POL010 580 110 19 50 Male South Yes

falytics.com/ProfessionalEducation
In this CustChurn dataset there are 8 columns and 2500 rows.

Predictor variables are


PolicyNumber
Revenue
AcquisitionCost
Lifespan
Age
Gender
Region
Target Variable
ChurnStatus

The objective is to build a machine learning model to predict the factors


influences customer to leave and also predicts whether he or she will churn.
Based on insights generated and predictions on churn status the insurance
company employ some mitigation strategy to lower the churn rate.

ChurnStatus: Yes, left. (negative class)


No , customer is retained within the study period.
Customers’ life span is inversely proportional to churn rate, therefore impacts LTV.

falytics.com/ProfessionalEducation
Use case.4 Claim Cost Optimization.
Problem definition:
The insurance company aims to optimize claim cost. The absence of a systematic claim
cost analysis process impedes their ability to manage insurance claims effectively. Without
a detailed analysis, we struggle to identify cost drivers, trends, and potential savings. This
hampers our capacity to optimize claims management, budget accurately, and make
informed decisions, risking financial inefficiencies and suboptimal resource allocation.
Establishing a structured and data-driven approach to claim cost analysis is essential for
enhancing our understanding of cost factors and improving decision-making in our claims
management processes.

Objectives:
1. Develop a data science solution to optimize claim costs by implementing a structured
and data-driven approach to claim cost analysis.

falytics.com/ProfessionalEducation
claim_id claim_type claim_amount policy_type location incident_date age_of_driver gender_of_driver weather_condition vehicle_make vehicle_model vehicle_year property_type health_condition
1 Auto 5000 Comprehensive CityA 1/15/2023 35 Male Clear Toyota Camry 2018 Apartment Good
Single-Family
2 Home 10000 Property CityB 2/10/2023 Thunderstorm House

3 Health 3000 Health CityC 3/5/2023 45 Female Standard Sedan 2015

4 Auto 8000 Collision CityA 4/20/2023 28 Male Rain Ford Fusion 2020 Good

5 Home 12000 Property CityB 5/12/2023 Clear Condo

6 Health 2500 Health CityC 6/8/2023 55 Female Compact Car 2019

7 Auto 6000 Comprehensive CityA 7/25/2023 40 Male Snow Chevrolet Equinox 2017 Excellent
Single-Family
8 Home 11000 Property CityB 8/18/2023 Clear House

9 Health 3500 Health CityC 9/3/2023 30 Male SUV 2016

10 Auto 7500 Collision CityA 10/30/2023 42 Female Rain Hyundai Sonata 2019 Good
In this ClaimData dataset there are 14 columns and 2000 rows.

Predictor variables are


claim_id
claim_type
Policy_type
location
incident_date This is a regression problem; objective is to build a
age_of_driver machine learning model to predict claim amount
gender_of_driver and find out the feature of importance's.
weather_condition
vehicle_make
vehicle_model
vehicle_year
property_type
health_condition

Target Variable
claim_amount
Use case.5 Product Profit Analysis.
Problem definition:
Express Insurance, Inc's actuarial leader has assigned its data science team the project
Product Profit Analysis with the objective of building a statistical model to predict the
profitability of insurance contracts. This project aims to, thoroughly analyze and evaluate
the, profitability of the insurance products offered, by the company.

Objectives:
1. The goal of this data science project is to develop, a robust and accurate machine
learning model to provide valuable insights to actuarial professionals, enabling them
to make well-informed decisions regarding pricing, risk assessment, and strategic
planning.

falytics.com/ProfessionalEducation
ACTEXP

falytics.com/ProfessionalEducation
In this ACTEXP dataset there are 14 columns and 2000 rows.

Predictor variables are


Policy_ID
Policy_Type
Age
Issue_year
Current_Year This is a regression problem; objective is to build a
Coverage machine learning model to profitability ratio; and
RiskClass find out the feature of importance's. Positive ration
Premium indicates profit and negative ratio indicates loss.
Claims
Claim_Costs

Target Variable
Profitability
Ethics and Privacy

falytics.com/ProfessionalEducation
“Ethical considerations are paramount in data science
projects, particularly in industries like insurance where
sensitive personal information is involved.”

Let us look at couple of feature variables


Job type can be used as proxy for gender
Zip code can be used as proxy for racer to to fool the system. Though it’s not
fool the system. Though it’s not appropriate or ethical to do so.
appropriate or ethical to do so.
e.g.
e.g. Gender Job type: Receptionist
little Italy in NYC Human Resource
in some zip code majority of the Race
residents are of Chinese origin or African Can be used as proxy for gender, as a
Americans. greater number of women work in this
field compared to men.

Society of actuary - Ethical & Responsible Use of Data & Predictive Models
Certificate Program

falytics.com/ProfessionalEducation
Tools and Technologies

Programming Languages Data Manipulation and Analysis Data Visualization

-R - Pandas - Matplotlib
- Python - NumPy - Seaborn
- Plotly
Data storage and processing
Computing Environments services
- Jupyter Notebooks - AWS
- Google colab - Azure
- Google cloud

falytics.com/ProfessionalEducation
falytics.com/ProfessionalEducation
Study by Mckinsey

falytics.com/ProfessionalEducation
kaggle.com/sudhirbehera
github.com/falytics/Sudhir.Behera
https://www.udemy.com/user/sudhir-k-behera

falytics.com/ProfessionalEducation

You might also like