Professional Documents
Culture Documents
To build precise models to predict car insurance claims through Logistic Regression and
Random Forest algorithms.
To apply these models on apache spark platform.
To increase processing time of car insurance claims prediction models by using spark parallel
processing.
To support insurance companies.
Related Works (1/4)
Title Motor Insurance Claim Status Prediction using Machine Learning Techniques,
2021
Dataset The data set contains a total of eleven attributes of motor insurance claim data
from AIC company and five target classes: close, notification, pending, re-
open and settled.
Methods Random Forest (RF) and Multi Class – Support Vector Machine (SVM)
Findings The system procedures included data understanding and explanatory data
analysis, data preprocessing, model training, model testing, classification and
prediction, and finally comparison of the two built models have done. The
performance of the model was evaluated with four metrics (Accuracy,
Precision, Recall, and Fmeasure). RF model prediction accuracy is slightly
better than SVM model in the insurance domain specifically in motor
insurance.
Related Works (2/4)
Title Zuriahati Mohd Yunos, Siti Mariyam Shamsuddin, Roselina Sallehuddin, and
Razana Alwee,” Joint Conference on Green Engineering Technology &
Applied Computing”, 2019
Dataset Insurance Services Malaysia Berhad (ISM) dataset to estimate the two
important components; claim frequency and claim severity.
Methods Integrating between grey relational analysis (GRA) and back propagation
neural network (BPNN).
Findings Evaluated the predictive model performance between hybrid model
GRABPNN and simple BPNN based on four error measurements, mean
squared of error (MSE), root mean square of error (RMSE), mean absolute
error (MAE) and mean absolute percentage error (MAPE). The study had
gained evidence that, given various numbers of features and hidden nodes,
rank the informative features; GRABPNN obtained better performance in
modelling claim frequency and claim severity for each claim type as compared
to other model.
Related Works (3/4)
Title Baran, Sebastian, and Przemysław Rola. "Prediction of motor insurance claims
occurrence as an imbalanced machine learning problem." , Cornell University,
arXiv preprint arXiv:2204. 06109 (2022).
Dataset Fremotor1 dataset describing the claims of car insurance and parameters of the
insurance policy of the one unknown French insurer from 2003 - 2004.
Methods Logistic regression, Decision tree, Random forest, XgBoost and Feed-forward
network
Findings The main goal of this work was to present and apply various methods of
dealing with an imbalanced dataset in the context of claim occurrence
prediction in car insurance. Even a good ML algorithm might not perform well
for imbalanced data. To overcome this problem, SMOTE oversampling
technique was used. Accuracy and F1 score were used to compare the results
of machine learning algorithms in the context of claim occurrence prediction
in car insurance. The accuracies of XGBoost and Random Forest Methods are
better than other algorithms.
Related Works (4/4)
Is_esc Boolean flag indicating whether Electronic Stability Control (ESC) is present in the car or not.
Is_adjustable_steering Boolean flag indicating whether the steering wheel of the car is adjustable or not.
Is_tpms Boolean flag indicating whether Tyre Pressure Monitoring System (TPMS) is present in the car or not.
Is_parking_sensors Boolean flag indicating whether parking sensors are present in the car or not.
Is_parking_camera Boolean flag indicating whether the parking camera is present in the car or not.
Gross_weight The maximum allowable weight of the fully-loaded car, including passengers, cargo and
equipment (Kg)
Is_front_fog_lights, Boolean flags indicating whether these are available in the car or not.
Is_rear_window_wiper,
Is_rear_window_washer,
Is_rear_window_defogger
Is_brake_assist, Is_power_door_lock Boolean flags indicating whether these are available in the car or not.
Is_central_locking Boolean flag indicating whether the central locking feature is available in the car or not.
Is_power_steering Boolean flag indicating whether power steering is available in the car or not.
Is_driver_seat_height_adjustable Boolean flag indicating whether the height of the driver seat is adjustable or not.
Is_day_night_rar_view_mirror Boolean flag indicating whether day & night rearview mirror is present in the car or not.
Is_ecw Boolean flag indicating whether Engine Check Warning (ECW) is available in the car or
not.
Is_speed_alert Boolean flag indicating whether the speed alert system is available in the car or not.
Is_claim Outcome: Boolean Flag indicating whether the policyholder file a claim in the 6 months or
not.
Apache Spark
Spark is an open source processing engine, which uses directed acyclic graph and its own data
structure i.e., Resilient Distributed Dataset (RDD) to provide speed and analytics.
Spark helps in some challenging and computationally exhaustive tasks like processing high
volumes of real-time and archived data, thereby integrating the complex capabilities such as ML
and graph algorithms.
It brings big data processing to the market and Spark has a library for ML labelled as MLib.
Spark MLib library has algorithms for the functions of classification, regression, clustering,
collaborative filtering, dimensionality reduction, etc.
Apache Spark ML Lib
Learn how to use Apache Spark MLlib to create a machine learning application. The application
will do predictive analysis on an open dataset. From Spark's built-in machine learning libraries,
this example uses classification through logistic regression.
MLlib is a core Spark library that provides many utilities useful for machine learning tasks, such
as:
Classification
Clustering
Modeling
Singular value decomposition (SVD) and principal component analysis (PCA)
Hypothesis testing and calculating sample statistics
Feature Selector (1/2)
The biggest challenge of Machine Learning is to create models that have robust predictive power
by using as few features as possible. But given the massive sizes of today’s datasets, it is easy to
lose the oversight of which features are important and which ones aren’t.
That’s why there is an entire skill to be learned in the ML field — feature selection. Feature
selection is the process of choosing a subset of the most important features while trying to retain
as much information as possible.
VarianceThresholdSelector
VarianceThresholdSelector is a selector that removes low-variance features. Features with a
variance not greater than the varianceThreshold will be removed. If not set,
varianceThreshold defaults to 0, which means only features with variance 0 (i.e. features that
have the same value in all samples) will be removed.
Feature Selector (2/2)
VarianceThresholdSelector
VarianceThresholdSelector is a selector that removes low-variance features. Features with a
variance not greater than the varianceThreshold will be removed. If not set, varianceThreshold
defaults to 0, which means only features with variance 0 (i.e. features that have the same value in
all samples) will be removed.
This technique is a quick and lightweight way of eliminating features with very low variance, i. e.
features with not much useful information.
Variance Method
Where, n is the number of records, xi is the value at position I and x is the mean of particular
attribute.
Logistic Regression Classifier (1/2)
Classification, a popular machine learning task, is the process of sorting input data into categories.
It's the job of a classification algorithm to figure out how to assign "labels" to input data that you
provide.
For example, you could think of a machine learning algorithm that accepts stock information as
input. Then divides the stock into two categories: stocks that you should sell and stocks that you
should keep.
Logistic regression is the algorithm that you use for classification. Spark's logistic regression API is
useful for binary classification, or classifying input data into one of two groups.
In summary, the process of logistic regression produces a logistic function. Use the function to
predict the probability that an input vector belongs in one group or the other.
where, are the regression coefficients.
y=b0+b1x1+b2x2+…………….+bnxn
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by
(1-y):
log[] =b0+b1x1+b2x2+…………….+bnxn
The above equation is the final equation for Logistic Regression where b are the regression
coefficients, xi is the real data and y is the predict class data of the particular record.
Random Forest Classifier (1/2)
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest
classifier:
There should be some actual values in the feature variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
It takes less training time as compared to other algorithms.
It predicts output with high accuracy, even for the large dataset it runs efficiently.
It can also maintain accuracy when a large proportion of data is missing.
Random Forest Classifier (1/2)
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree created
in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.
Proposed System Design Car Insurance Claim
Dataset
F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false
positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is
usually more useful than accuracy, especially if you have an uneven class distribution.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes.
Recall = TP / TP+FN
Precision is the ratio of correctly predicted positive observations to the total predicted positive
observations. High precision relates to the low false positive rate.
Precision = TP / (TP+FP)
Evaluation Metric (1/2)
Confusion Matrix: True positive and true negatives are the observations that are correctly
predicted and therefore shown in green. We want to minimize false positives and false negatives.
True Positives (TP) - E.g. if actual class value indicates that policyholder claim and predicted class tells
you the same thing.
True Negatives (TN) - E.g. if actual class says policyholder did not claim and predicted class tells you
the same thing.
False Positives (FP) – E.g. if actual class says policyholder did not claim but predicted class tells you that
policyholder will claim.
False Negatives (FN) – E.g. if actual class value indicates that policyholder claimed and predicted class
tells you that passenger will not claim.
Conclusion
This system will create two Car Insurance Claim Prediction Classifiers with Random Forest
and Logistic regression based on the car insurance claim dataset respectively.
This system will select attributes with VarianceThresholdSelector and analyze the selected
attributes impact on the accuracy of car insurance claim prediction classifiers.
The performance of the classifiers will be evaluated with four metrics: Accuracy, Precision,
Recall, and F1 measure.
This system will describe which classifier is the most suitable for predicting car insurance
claim in the next six months.
References
1. Endalew Alamir, Teklu Urgessa, Ashebir Hunegnaw, Tiruveedula Gopikrishna, “Motor Insurance Claim Status Prediction
using Machine Learning Techniques”, (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021.
2. Zuriahati Mohd Yunos, Siti Mariyam Shamsuddin, Roselina Sallehuddin, and Razana Alwee,” Joint Conference on Green
Engineering Technology & Applied Computing”, 2019.
3. Baran, Sebastian, and Przemysław Rola. "Prediction of motor insurance claims occurrence as an imbalanced machine
learning problem.", Cornell University, arXiv preprint arXiv:2204. 06109 (2022).
4. Shady Abdelhadi, Khaled Elbahnasy, Mohamed Abdelsalam, “A Pproposed Model To Predict Auto Insurance Claims
Using Machine Learning Techniques”, Journal of Theoretical and Applied Information Technology 30th November 2020.
Vol.98. No 22.
5. Car Insurance Claim Prediction Dataset From Kaggle: https://www.kaggle.com/datasets/ifteshanajnin/
carinsuranceclaimprediction-classification
6. Logistic Regression in Machine Learning – Javatpoint: https://www.javatpoint.com/logistic-regression- in-machine-
learning
7. Understanding Logistic Regression – GeeksforGeeks: https://www.geeksforgeeks.org/understanding-logistic-regression/
25
Thesis Schedule
Date and January February March
Seminar
1st
Seminar
2nd
Seminar
3rd
Seminar
Defence and
Thesis Book
Submission
Thank You