You are on page 1of 26

Car Insurance Claim Prediction by Using Machine

Learning Algorithms on Apache Spark Platform

Supervised by : Dr. Tin Zar Thaw


Presented By : Mg Thein Than Ko
Roll No : MIS-18
Batch : Staff Batch 1
Seminar : First Seminar
Date : 25.1.2023
Outline
 Abstract  Regression Method
 Objectives  Random Forest Method
 Motivation  Proposed System Design
 Related Works  Evaluation Metric
 Data Collection  Conclusion
 Apache Spark  References
 Apache Spark ML Lib  Thesis Timeline Schedule
 Feature Selector
Abstract
 One of the main challenges facing the insurance companies is to determine the proper
insurance premium for each risk represented by customers.
 Risk differs widely from clients to another, and a Careful understanding of various risk
factors assists predict the likelihood of insurance claims based on historical data.
 Numerous car insurance companies use traditional methods to analyze client details
and the volume of the historical claim data is usually big data.
 Therefore, Car insurance claim prediction models are built by using Logistic
Regression and Random Forest algorithms based on apache spark platform .
Objectives

 To build precise models to predict car insurance claims through Logistic Regression and
Random Forest algorithms.
 To apply these models on apache spark platform.
 To increase processing time of car insurance claims prediction models by using spark parallel
processing.
 To support insurance companies.
Related Works (1/4)

Title Motor Insurance Claim Status Prediction using Machine Learning Techniques,
2021
Dataset The data set contains a total of eleven attributes of motor insurance claim data
from AIC company and five target classes: close, notification, pending, re-
open and settled.
Methods Random Forest (RF) and Multi Class – Support Vector Machine (SVM)
Findings The system procedures included data understanding and explanatory data
analysis, data preprocessing, model training, model testing, classification and
prediction, and finally comparison of the two built models have done. The
performance of the model was evaluated with four metrics (Accuracy,
Precision, Recall, and Fmeasure). RF model prediction accuracy is slightly
better than SVM model in the insurance domain specifically in motor
insurance.
Related Works (2/4)

Title Zuriahati Mohd Yunos, Siti Mariyam Shamsuddin, Roselina Sallehuddin, and
Razana Alwee,” Joint Conference on Green Engineering Technology &
Applied Computing”, 2019
Dataset Insurance Services Malaysia Berhad (ISM) dataset to estimate the two
important components; claim frequency and claim severity.
Methods Integrating between grey relational analysis (GRA) and back propagation
neural network (BPNN).
Findings Evaluated the predictive model performance between hybrid model
GRABPNN and simple BPNN based on four error measurements, mean
squared of error (MSE), root mean square of error (RMSE), mean absolute
error (MAE) and mean absolute percentage error (MAPE). The study had
gained evidence that, given various numbers of features and hidden nodes,
rank the informative features; GRABPNN obtained better performance in
modelling claim frequency and claim severity for each claim type as compared
to other model.
Related Works (3/4)

Title Baran, Sebastian, and Przemysław Rola. "Prediction of motor insurance claims
occurrence as an imbalanced machine learning problem." , Cornell University,
arXiv preprint arXiv:2204. 06109 (2022).
Dataset Fremotor1 dataset describing the claims of car insurance and parameters of the
insurance policy of the one unknown French insurer from 2003 - 2004.
Methods Logistic regression, Decision tree, Random forest, XgBoost and Feed-forward
network
Findings The main goal of this work was to present and apply various methods of
dealing with an imbalanced dataset in the context of claim occurrence
prediction in car insurance. Even a good ML algorithm might not perform well
for imbalanced data. To overcome this problem, SMOTE oversampling
technique was used. Accuracy and F1 score were used to compare the results
of machine learning algorithms in the context of claim occurrence prediction
in car insurance. The accuracies of XGBoost and Random Forest Methods are
better than other algorithms.
Related Works (4/4)

Title Shady Abdelhadi, Khaled Elbahnasy, Mohamed Abdelsalam, “A Pproposed


Model To Predict Auto Insurance Claims Using Machine Learning
Techniques”, Journal of Theoretical and Applied Information Technology 30th
November 2020. Vol.98. No 22.
Dataset Auto Insurance Claims Dataset from Kaggle which consists of 12 variables
and 30240 cases
Methods The research was carried out by using Artificial Neural Network (ANN),
Decision Tree (DT), Naïve Bayes classifiers, and XGBoost to develop the
prediction model.
Findings The experimental results showed that the model obtained acceptable results
The XGBoost model and Resolution Tree achieved the best accuracy among
the four models, with an accuracy of 92.53% and 92.22%, respectively.
Data Description (1/3)
 The Car Insurance Claim Prediction dataset is collected from the Kaggle website(
www.kaggle.com)that specializes in running statistical analysis and predictive modeling
competitions.
 The Dataset contains information on policyholders having the 44 attributes and 97656
cases.
Variable Description

policy_id Unique Identifier of the policyholder

policy tenure Time period of the policy

age of the car Normalized age of the car in years

age of owner/policyholder Normalized age of the policyholder in years

Area_cluster Area cluster of the policyholder

population density Population density of the city (Policyholder City)


Make Encoded Manufacturer/company of the car

Segment Segment of the car (A/B1/B2/C1/C2)

Model Encoded name of the car


Data Description (2/3)
Fuel_type Type of fuel used by the car

Max_torque Maximum Torque generated by the car (Nm@rpm)

Max_power Maximum Power generated by the car (bhp@rpm)

Engine_type Type of engine used in the car

Airbags Number of airbags installed in the car

Is_esc Boolean flag indicating whether Electronic Stability Control (ESC) is present in the car or not.

Is_adjustable_steering Boolean flag indicating whether the steering wheel of the car is adjustable or not.
Is_tpms Boolean flag indicating whether Tyre Pressure Monitoring System (TPMS) is present in the car or not.

Is_parking_sensors Boolean flag indicating whether parking sensors are present in the car or not.

Is_parking_camera Boolean flag indicating whether the parking camera is present in the car or not.

Rear_brakes_type Type of brakes used in the rear of the car

Displacement Engine displacement of the car (cc)

Cylinder Number of cylinders present in the engine of the car

Transmission_type Transmission type of the car

Gear_box Number of gears in the car


Data Description (2/3)
Steering_type Type of the power steering present in the car

Turining_radiucs The space a vehicle needs to make a certain turn (Meters)

Length, Width and Height of the car Millimeter

Gross_weight The maximum allowable weight of the fully-loaded car, including passengers, cargo and
equipment (Kg)
Is_front_fog_lights, Boolean flags indicating whether these are available in the car or not.
Is_rear_window_wiper,
Is_rear_window_washer,
Is_rear_window_defogger
Is_brake_assist, Is_power_door_lock Boolean flags indicating whether these are available in the car or not.

Is_central_locking Boolean flag indicating whether the central locking feature is available in the car or not.

Is_power_steering Boolean flag indicating whether power steering is available in the car or not.

Is_driver_seat_height_adjustable Boolean flag indicating whether the height of the driver seat is adjustable or not.

Is_day_night_rar_view_mirror Boolean flag indicating whether day & night rearview mirror is present in the car or not.

Is_ecw Boolean flag indicating whether Engine Check Warning (ECW) is available in the car or
not.
Is_speed_alert Boolean flag indicating whether the speed alert system is available in the car or not.

Ncap_rating Safety rating given by NCAP (out of 5)

Is_claim Outcome: Boolean Flag indicating whether the policyholder file a claim in the 6 months or
not.
Apache Spark
 Spark is an open source processing engine, which uses directed acyclic graph and its own data
structure i.e., Resilient Distributed Dataset (RDD) to provide speed and analytics.
 Spark helps in some challenging and computationally exhaustive tasks like processing high
volumes of real-time and archived data, thereby integrating the complex capabilities such as ML
and graph algorithms.
 It brings big data processing to the market and Spark has a library for ML labelled as MLib.
 Spark MLib library has algorithms for the functions of classification, regression, clustering,
collaborative filtering, dimensionality reduction, etc.
Apache Spark ML Lib
 Learn how to use Apache Spark MLlib to create a machine learning application. The application
will do predictive analysis on an open dataset. From Spark's built-in machine learning libraries,
this example uses classification through logistic regression.
 MLlib is a core Spark library that provides many utilities useful for machine learning tasks, such
as:
 Classification
 Clustering
 Modeling
 Singular value decomposition (SVD) and principal component analysis (PCA)
 Hypothesis testing and calculating sample statistics
Feature Selector (1/2)
 The biggest challenge of Machine Learning is to create models that have robust predictive power
by using as few features as possible. But given the massive sizes of today’s datasets, it is easy to
lose the oversight of which features are important and which ones aren’t.
 That’s why there is an entire skill to be learned in the ML field — feature selection. Feature
selection is the process of choosing a subset of the most important features while trying to retain
as much information as possible.
 VarianceThresholdSelector
 VarianceThresholdSelector is a selector that removes low-variance features. Features with a
variance not greater than the varianceThreshold will be removed. If not set,
varianceThreshold defaults to 0, which means only features with variance 0 (i.e. features that
have the same value in all samples) will be removed.
Feature Selector (2/2)
 VarianceThresholdSelector
 VarianceThresholdSelector is a selector that removes low-variance features. Features with a
variance not greater than the varianceThreshold will be removed. If not set, varianceThreshold
defaults to 0, which means only features with variance 0 (i.e. features that have the same value in
all samples) will be removed.
 This technique is a quick and lightweight way of eliminating features with very low variance, i. e.
features with not much useful information.
 Variance Method

 Where, n is the number of records, xi is the value at position I and x is the mean of particular
attribute.
Logistic Regression Classifier (1/2)
 Classification, a popular machine learning task, is the process of sorting input data into categories.
 It's the job of a classification algorithm to figure out how to assign "labels" to input data that you
provide.
 For example, you could think of a machine learning algorithm that accepts stock information as
input. Then divides the stock into two categories: stocks that you should sell and stocks that you
should keep.
 Logistic regression is the algorithm that you use for classification. Spark's logistic regression API is
useful for binary classification, or classifying input data into one of two groups.
 In summary, the process of logistic regression produces a logistic function. Use the function to
predict the probability that an input vector belongs in one group or the other.
where, are the regression coefficients.

Logistic Regression Classifier (2/2)


 Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
 We know the equation of the straight line can be written as:

y=b0+b1x1+b2x2+…………….+bnxn
 In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by
(1-y):

;0 for y=0, and infinity for y=1


 But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

log[] =b0+b1x1+b2x2+…………….+bnxn
 The above equation is the final equation for Logistic Regression where b are the regression
coefficients, xi is the real data and y is the predict class data of the particular record.
Random Forest Classifier (1/2)
 Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest
classifier:
 There should be some actual values in the feature variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
 The predictions from each tree must have very low correlations.
 Below are some points that explain why we should use the Random Forest algorithm:
 It takes less training time as compared to other algorithms.
 It predicts output with high accuracy, even for the large dataset it runs efficiently.
 It can also maintain accuracy when a large proportion of data is missing.
Random Forest Classifier (1/2)
 How does Random Forest algorithm work?
 Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree created
in the first phase.
 The Working process can be explained in the below steps and diagram:
 Step-1: Select random K data points from the training set.
 Step-2: Build the decision trees associated with the selected data points (Subsets).
 Step-3: Choose the number N for decision trees that you want to build.
 Step-4: Repeat Step 1 & 2.
 Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.
Proposed System Design Car Insurance Claim
Dataset

Select attributes with


VarianceThresholdSelector

Split training and testing


datasets (80%-20%)
randomly

Create Random Forest Create Logistic Regression


Classifier with training Classifier with training
dataset dataset

Predict testing dataset with Predict testing dataset with


Random Forest Classifier Logistic Regression Classifier

Accuracy and Accuracy and


F1 Score of the F1 Score of the
result result
Evaluation Metric (1/2)

 F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false
positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is
usually more useful than accuracy, especially if you have an uneven class distribution.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
 Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes.
Recall = TP / TP+FN
 Precision is the ratio of correctly predicted positive observations to the total predicted positive
observations. High precision relates to the low false positive rate.
Precision = TP / (TP+FP)
Evaluation Metric (1/2)

 Confusion Matrix: True positive and true negatives are the observations that are correctly
predicted and therefore shown in green. We want to minimize false positives and false negatives.

 True Positives (TP) - E.g. if actual class value indicates that policyholder claim and predicted class tells
you the same thing.
 True Negatives (TN) - E.g. if actual class says policyholder did not claim and predicted class tells you
the same thing.
 False Positives (FP) – E.g. if actual class says policyholder did not claim but predicted class tells you that
policyholder will claim.
 False Negatives (FN) – E.g. if actual class value indicates that policyholder claimed and predicted class
tells you that passenger will not claim.
Conclusion

 This system will create two Car Insurance Claim Prediction Classifiers with Random Forest
and Logistic regression based on the car insurance claim dataset respectively.
 This system will select attributes with VarianceThresholdSelector and analyze the selected
attributes impact on the accuracy of car insurance claim prediction classifiers.
 The performance of the classifiers will be evaluated with four metrics: Accuracy, Precision,
Recall, and F1 measure.
 This system will describe which classifier is the most suitable for predicting car insurance
claim in the next six months.
References

1. Endalew Alamir, Teklu Urgessa, Ashebir Hunegnaw, Tiruveedula Gopikrishna, “Motor Insurance Claim Status Prediction
using Machine Learning Techniques”, (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021.
2. Zuriahati Mohd Yunos, Siti Mariyam Shamsuddin, Roselina Sallehuddin, and Razana Alwee,” Joint Conference on Green
Engineering Technology & Applied Computing”, 2019.
3. Baran, Sebastian, and Przemysław Rola. "Prediction of motor insurance claims occurrence as an imbalanced machine
learning problem.", Cornell University, arXiv preprint arXiv:2204. 06109 (2022).
4. Shady Abdelhadi, Khaled Elbahnasy, Mohamed Abdelsalam, “A Pproposed Model To Predict Auto Insurance Claims
Using Machine Learning Techniques”, Journal of Theoretical and Applied Information Technology 30th November 2020.
Vol.98. No 22.
5. Car Insurance Claim Prediction Dataset From Kaggle: https://www.kaggle.com/datasets/ifteshanajnin/
carinsuranceclaimprediction-classification
6. Logistic Regression in Machine Learning – Javatpoint: https://www.javatpoint.com/logistic-regression- in-machine-
learning
7. Understanding Logistic Regression – GeeksforGeeks: https://www.geeksforgeeks.org/understanding-logistic-regression/
25
Thesis Schedule
Date and January February March
Seminar
1st
Seminar

2nd
Seminar

3rd
Seminar

Defence and
Thesis Book
Submission
Thank You

You might also like