Professional Documents
Culture Documents
FINAL REPORT
BIG DATA ANALYTICS AND ITS APPLICATIONS
[GROUP 4]
Dataset: https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-
prediction/code?datasetId=869050&sortBy=voteCount
GROUP MEMBER
3
COMMITMENT
The Group 5 committed to the project with the topic “Applying Big Data In The
Spark Environment To Build A Forecast Model For Cross-Selling Vehicle Insurance”
which was implemented by the group itself, based on the guidance from Lecturer
Nguyen Thon Da.
The group has references to the sources, and the source is cited at the end of the
content in the references section. The other content is developed from the team's research
and is completely honest, if there is a problem, the team will be fully responsible.
4
TABLE OF CONTENTS
LIST OF FIGURES........................................................................................................................i
LIST OF TABLES........................................................................................................................iii
LIST OF ACRONYMS................................................................................................................iv
PROJECT MEMBER EVALUATION.......................................................................................v
DETAILED TASK ASSIGNMENT TABLE.............................................................................vi
ABSTRACT....................................................................................................................................1
CHAPTER 1 : INTRODUCTION TO THE TOPIC..................................................................2
1.1 Reason for choosing the topic..............................................................................................................2
1.2 Objectives of the research....................................................................................................................2
1.3 Research question................................................................................................................................2
1.4 Research tools and languages..............................................................................................................3
1.5 Research process..................................................................................................................................3
1.6 Structure of the research......................................................................................................................4
CHAPTER 2 : THEORETICAL BASIS.....................................................................................5
2.1 Related terms.......................................................................................................................................5
2.1.1 Vehicle insurance..........................................................................................................................5
2.1.2 "Cross-selling" term......................................................................................................................5
2.1.3 Cross-selling vehicle insurance....................................................................................................5
2.2 Application algorithm..........................................................................................................................5
2.2.1 Logistic Regression.......................................................................................................................5
2.2.2 Random Forest............................................................................................................................10
2.2.3 Gradient Boosted Trees Algorithm.............................................................................................12
2.3 K-Fold Cross Validation (K-Fold Cross Validation).........................................................................15
2.4 Evaluation method based on confusion matrix..................................................................................15
2.4.1 Concepts......................................................................................................................................16
2.4.2 Evaluation indicators..................................................................................................................16
CHAPTER 3 : CONDUCTING EXPERIMENTAL PROBLEMS.........................................21
3.1 General information...........................................................................................................................21
3.2 Explore and analyze data...................................................................................................................22
3.2.1 Explore and visualize data..........................................................................................................22
3.2.2 Visualize data using Heatmap charts..........................................................................................29
3.3 Prepare to run the model....................................................................................................................31
3.3.1 The data balancing process.........................................................................................................31
3.3.2 The data labeling process............................................................................................................33
3.3.3 The data cleaning process...........................................................................................................33
3.3.4 The scaling data process.............................................................................................................34
5
3.3.5 Vectorize the input data and separate the train - test data set.....................................................35
3.4 Run the model....................................................................................................................................36
3.4.1 Model 1: Logistic Regression.....................................................................................................36
3.4.2 Model 2: Random Forest............................................................................................................37
3.4.3 Model 3: Gradient Boosted Trees...............................................................................................37
CHAPTER 4 : EVALUATING THE RESULTS OF THE MODELS...................................38
4.1 Evaluate using proposed indicators...................................................................................................38
4.1.1 Evaluation using AUC/ROC and AUC/PR metrics....................................................................38
4.1.2 Evaluation using confusion matrix metrics................................................................................38
4.2 Conclusion.........................................................................................................................................40
REFERENCES............................................................................................................................41
6
LIST OF FIGURES
Figure 1-1 Research process...............................................................................................2
Figure 2-1 The sigmoid function chart (source: Pham Dinh Khanh).................................5
Figure 2-2 Graph of the exponential function e−x and the function ex................................6
Figure 2-3 Random Forest working diagram (Source: simplilearn)...................................9
Figure 2-4 Confusion matrix............................................................................................14
Figure 2-5 Histogram for the case of AUC = 1................................................................18
Figure 2-6 Histogram for the case of AUC = 0................................................................18
Figure 2-7 Histogram for the case of AUC = 0.7.............................................................19
Figure 2-8 Histogram for the case of AUC = 0.5.............................................................19
Figure 3-1 Use the distinct function to select rows with unique values...........................22
Figure 3-2 Statistical chart for the variable 'age'..............................................................22
Figure 3-3 Statistical Chart for the 'Gender' variable.......................................................23
Figure 3-4 The count of genders with a driver's license...................................................24
Figure 3-5 The number of users with or without auto insurance......................................24
Figure 3-6 The distribution of vehicle ages......................................................................25
Figure 3-7 Statistics for the 'Vehicle_Damage' variable...................................................26
Figure 3-8 Statistics for the 'Annual_Premium' variable..................................................26
Figure 3-9 Statistical chart for the 'Vintage' variable:......................................................27
Figure 3-10 Statistical chart for the 'Policy_Sales_Channel' variable:.............................27
Figure 3-11 Heatmap chart...............................................................................................28
Figure 3-12 Data before balancing...................................................................................30
Figure 3-13 Data after balancing......................................................................................30
Figure 3-14 The code labels the gender variable..............................................................31
Figure 3-15 Results after data labeling.............................................................................31
Figure 3-16 Data before removing outliers.......................................................................31
Figure 3-17 Find the upper limit and lower limit.............................................................32
Figure 3-18 Data after removing outliers.........................................................................32
Figure 3-19 Results table after implementing the Min-max Scaling method...................33
Figure 3-20 List of nominal variables to predict..............................................................33
Figure 3-21 List of variable columns after applying VectorAssembler...........................34
Figure 3-22 Data of variable features after being vectorized...........................................34
i
Figure 3-23 The results of model test...............................................................................34
Figure 3-24 Logistic Regression model evaluation according to the confusion matrix...35
Figure 3-25 Random Forest model evaluation index according to the confusion matrix.35
Figure 3-26 Gradient Boosted Trees model evaluation index..........................................35
Figure 4-1 Confusion matrix of the Random Forest model..............................................37
ii
LIST OF TABLES
Table 3-1 Description of attributes...................................................................................21
iii
LIST OF ACRONYMS
Acronym Description
EDA Exploratory Data Analysis
DT Decision Tree
RF Random Forest
ERF Extreme Random Forest
LR Logistic Regression
KNN K-Nearest Neighbour
ROC Receiver Operating Characteristics
AUC Area Under The Curve
GBT Gradient Boosted Trees
FPR False positive rate
TPR True positive rate
FN False Negative
TN True Negative
FP False Positive
TP True Positive
iv
PROJECT MEMBER EVALUATION
No. Full Name Student ID Role Evaluation Rate
1 Nguyen Thi Thanh Thuy K204061418 Leader 100%
2 Tran Thi Thuan K204061417 Member 100%
3 Nguyen Thuy Dung K204061391 Member 100%
4 Mai Tran Dan Nhi K204061407 Member 100%
5 Tran Thi Thuyen Thuyen K204061419 Member 100%
5 Dinh Tran Xuan Nguyen K204061404 Member 100%
v
DETAILED TASK ASSIGNMENT TABLE
No. Task Performer
1 Identify course requirements and evaluation criteria. Select a topic
1.1 Determine project requirements All
1.2 Searching the data for the project All
1.3 Looking up the knowledge about the project All
vii
ABSTRACT
In the contemporary era, as the demand for improving the quality of life continues
to rise, cross-selling campaigns for car insurance have become increasingly prevalent.
This study focuses on utilizing the Pyspark library and combining Logistic Regression,
Random Forest, and Gradient Boosting Tree to construct a predictive system for cross-
selling car insurance campaigns.
By applying these advanced methods, our goal is to predict the number of
customers accepting cross-selling services. We approach finding the optimal and most
suitable model for the given dataset. The results from experiments and evaluations
demonstrate the model's high performance, assessed through accuracy metrics and the
AUC - ROC curve, aiding in comparing and selecting the best model for the problem.
Faced with the complex and inflationary economic landscape, this research
proposes an effective and integrated approach, utilizing the Pyspark library to handle
large datasets. This assists businesses in predicting customer interest in car insurance,
enabling them to plan appropriate communication strategies and optimize both their
business model and revenue streams.
1
CHAPTER 1 : INTRODUCTION TO THE TOPIC
Includes general information about the topic, after determining the project
requirements, the general introduction content is: Reason for choosing the topic, research
objectives, research questions, objects and research methods research, tools and language,
research implementation process, and topic structure.
1.1 Reason for choosing the topic
Nowadays, with the strong development of the socio-economy as well as every
aspect of life. The community demands that the quality of life be increasingly improved
and met more fully. In particular, one of the most essential needs is to protect the health
of yourself and your family. At the same time, the insurance market is currently on a
solid growth path with policy efforts from the government, along with contributions to
macroeconomic stability and social security from companies.
As of November 30, 2023, the insurance market has 82 insurance businesses with
many diverse forms. Thanks to that, Cross-selling has penetrated the insurance market,
even though it was not widely known in this field. This is one of the particular strategies
and is very commonly applied in businesses today. If done well, it will bring more profits
to the business and enhance customer experience, loyalty, and lifetime value. The
insurance market has been penetrated by cross-selling vehicle and vehicle insurance.
To grasp user needs based on reviews and predict their interest in vehicle insurance.
Organizations and insurance companies need to implement preferential services and
promotional communication strategies to help reach the most of their customer base.
From there, the ultimate goal of optimizing revenue can be achieved. For all those
reasons, the group researched: " APPLYING BIG DATA IN THE SPARK
ENVIRONMENT TO BUILD A FORECAST MODEL FOR CROSS-SELLING
VEHICLE INSURANCE ".
1.2 Objectives of the research
Identify topics, explore data sets, and apply 3 algorithms: Logistics Regression,
Random Forest, and gradient-boosted trees to predict demand for cross-selling health
services. From there, we provide the most optimal evaluation results and solutions for
insurance companies and organizations.
Improve teamwork skills, problem-solving skills, presentation skills, and skills
related to thinking and data analysis for team members.
Understanding more about the subject's related field helps enrich knowledge and
experience. Prepare for good job opportunities in the future.
1.3 Research question
2
The group raised the following two research questions:
What factors lead to the decision to accept cross-selling of health insurance in the
vehicle sector?
How do the results predict the extent to which customers return to use cross-selling
services?
1.4 Research tools and languages
Tool: Google Colaboratory.
Language: Python.
1.5 Research process
The team conducted research with an 8-step process. The first is to determine
project requirements. From the identified requirements, the group explored the data set to
build an overview of the topic and learn the theoretical basis for research. After learning
the theoretical basis, the group proceeded to prepare to experiment. Then there are the
steps of exploring, cleaning the data, and running the model. Finally, there are
evaluations, conclusions, and applications. To be more specific about the group research
process described in the diagram below:
3
Figure 1-1 Research process
5
2.2.1 Logistic Regression
2.2.1.1 Concepts
Logistic regression is a supervised machine-learning technique for binary
classification that predicts the likelihood of an outcome, occurrence, or observation.
Logistic regression is a predictive technique that describes and explains the
connection between a binary variable and one or more nominal, ordinal, interval, or ratio-
level independent variables.
We use logistic regression to classify research objects, types, names of groups, and
types within the value range of the target variable. For example, classify target customers
A, B, C, D, and the final value of target variable y = {A, B, C, D} (nominal, ordinal
logistic regression form). We also use it when anticipating an event with just two choices,
yes or no, as the target variable y will only have two values. 0 means no, and 1 means yes
(the most frequent binary logistic regression form).
2.2.1.2 Classification
Logistic regression is classified into three categories.
- Binary Logistic Regression: The dependent variable has two alternative
outcomes/classes. These variables can signify success or failure, yes or no, victory or
loss, etc.
- Multinomial Logistic Regression: The dependent variable may contain three or
more unordered outcomes/classes or patterns with no quantitative significance. For
example, predict the quality of food. (Good, Great, and Bad).
- Ordinary logistic regression: The dependent variable can have three or more
ordered or quantitatively meaningful patterns. For example, these variables can represent
“poor” or “good”, “very good”, or “Excellent” and each category can have scores like
0,1,2,3.
6
Figure 2-2 The sigmoid function chart (source: Pham Dinh Khanh).
The Sigmoid function resembles an S-curve and rises monotonically. That is why it
also goes by the moniker S-function.
The sigmoid function has values ranging from 0 to 1.
−x 1
- We have the exponential function e−x , which is the inverse function of ex: e = x ,
e
represented as shown in the figure:
Figure 2-3 Graph of the exponential function e−x and the function ex
Notice that the function has the same shape as e x, the only difference is that it is
derived by flipping the y-axis.
7
Then, while examining the limit of the exponential function, we obtain:
Based on the results of the limit of the exponential function, we can infer the limit
of the sigmoid function as follows:
Thus, for whatever value of x, the sigmoid function always produces a value
between 0 and 1. As a result, the Sigmoid function is well-suited for probability
forecasting in classification issues.
In the linear regression model, we represented the relationship between the result
and the characteristics with the following regression function:
To categorize, because the probability ranges from 0 to 1, we shall convert the right
side of the equation into the logistic function. This sets the output to only take values
between 0 and 1.
8
We use the Odd ratio rather than the probability to extract the exponential
expression from the denominator. Odd ratio is a statistic that calculates the probability
ratio of positive and negative instances predicted by a logistic regression model. The
greater a prediction's odds ratio, the more likely it is to be classified as positive. If the
Odds ratio is more significant than one, the sample is more likely to be categorized
positively than negatively, and vice versa.
The probability may be calculated from the Odd ratio using the inverse function:
We may combine this expression with the function of the linear function to obtain:
Finally, taking the logarithm of both sides yields an equation using a linear function
of the predictors:
9
Logistic regression works best for data sets that are linearly separable. A data set is
considered linearly separable if a straight line separates two distinct data classes. Logistic
regression is utilized when your Y variable can only take two values, and if the data is
linearly separable, it is more efficient to divide it into two groups.
Logistic regression allows us to assess the importance of an independent variable
(i.e., coefficient size) and indicates the direction of the association (positive or negative).
Disadvantages
Logistic regression did not predict continuous results.
Logistic regression assumes linearity between the predictor (dependent) and the
predictor (independent) variables. In the actual world, observations are unlikely to be
linearly separable.
Logistic regression can be inaccurate if the sample size is too small. If the sample
size is small, the model is created using logistic regression based on fewer observations.
This can lead to overfitting.
2.2.2 Random Forest
2.2.2.1 Concepts
In the Random Forest algorithm, we will use the Decision Tree algorithm to create
several unique decision trees (with a random element). The prediction results are then
collected using the decision trees. Use ensemble learning, a technique that combines
several classifiers to solve complicated problems.
Random Forests are used to tackle regression and classification issues.
A Random Forest method consists of several decision trees. A random forest
algorithm trained using bagging or bootstrap aggregation generates the 'forest'.
10
Figure 2-4 Random Forest working diagram (Source: simplilearn)
Step 1: Choose random samples from a certain data collection or training set.
Step 2: This algorithm creates a decision tree for all training data.
Step 3: The choice tree will be averaged to determine the vote.
Step 4: Finally, choose the guess that received the most votes as the final
prediction.
To further understand the algorithm's work, consider the following example: A
decision tree and projected banana as the outcome. The Random Forest classifier makes
the final recommendation based on a majority vote. The majority of decision trees
selected apples as their forecast. This leads the classifier to select Apple as its final
prediction.
2.2.2.3 Application
Random forest algorithms are utilized in a variety of disciplines, including:
In the financial industry, it is used to recognize clients who are more likely to repay
loans on time or use bank services more frequently to detect criminals attempting to
defraud institutions and to predict future stock performance.
In medicine, it is used to establish the appropriate mix of drugs and to examine the
patient's medical history to diagnose the ailment.
In e-commerce, determine if customers enjoy the product or not.
2.2.3.1 Concepts
Boosting
Boosting is an ensemble approach that aims to generate a strong classifier from a
collection of lesser classifiers. In other words, with each succeeding round, recordings
with big residuals receive greater weight. The most popular strategy is decision trees.
Boosting was established out of a desire to improve the limits of Bagging. We
anticipate weak models to help each other by learning from each other and avoiding the
faults of prior models. This is something Bagging cannot accomplish.
Building a model using training data, utilizing a mix of numerous models, each
succeeding model will incorporate and minimize the mistakes of prior models
(particularly the weight). The amount of successfully predicted data will remain constant,
but the weight of wrongly predicted data will grow). The performance of the first model
will impact how the second model is produced. Models are added until the training set is
predicted correctly or the number of models exceeds a certain threshold. The return result
will be based on the final model in this model chain.
Gradient Boosted Trees
The Ensemble strategy is a strategy based on the premise that rather than attempting
to develop a single strong model, we would create a family of slightly weaker models
that, when merged appropriately, will provide an even greater model. "If one model can't
solve it on its own, let many models solve it together."
Gradient Boosted Trees (GBT) is a machine learning approach that uses an
Ensemble Method to integrate several weak models (typically decision trees) to generate
a stronger model.
Gradient Boosted Trees (GBT) are used to improve prediction performance by
iteratively altering decision trees based on prior mistakes.
In there:
+ L: the loss function value.
+ label
+ cn : the confidence score of the nth weak learner (also known as weight)
+ 𝜔n: the nth weak learner.
The loss function is a key component of both the evaluation and objective functions.
Specifically, in common formulae:
The Loss function computes a non-negative real integer that represents the
difference between two quantities: y to the anticipated label and y to the true label. The
loss function is a form that forces the model to pay a penalty every time it predicts
incorrectly, with the amount of penalties proportionate to the severity of the inaccuracy.
We aim to minimize the penalty payable in every supervised learning task. In the ideal
instance, y hat = y. The loss function will yield a minimal value of zero.
Instead of attempting to scan for all values to discover the global ideal solution,
which would need a significant amount of time and resources. As a result, we will search
for local solution values after adding each additional model to the required model chain,
eventually progressing to the global solution.
13
Gradient Boosting: A more generic type of boosting used to reduce the cost
function. Specifically, for the first optimization problem:
The model is inserted next. Then, the new model must be trained to contain the
parameter −L(Wn−1) . (Value is sometimes known as pseudo-residuals)
14
A synopsis of the algorithm implementation procedure is provided below:
Set the pseudo-residuals value to be equal for all data points.
In the ith loop
+ Train the new model to match the current pseudo-residuals.
+ Calculate the newly trained model's confidence score.
+ Update core model W=W+ cn wn
+ Finally, compute the pseudo-residuals value to create the label for the
following model.
Then repeat with i + 1 loop.
Gradient Boosting covers more cases.
Disadvantages:
GBT is a sequential process, therefore, model training takes a lengthy time when
dealing with huge data sets.
GBT is prone to overfitting if hyperparameters are not managed.
2.3 K-Fold Cross Validation (K-Fold Cross Validation)
K-Fold cross-validation is an assessment approach for machine learning models that
determines how well the model predicts real-world data findings. This approach divides
the training data set into K smaller subsets, each with the same magnitude. The model
will be trained on the K-1 subset and tested on the remaining subset. This technique is
done K times, each time using a different subset as the test set. The assessment findings
from K times are pooled to provide an overall evaluation of the model.
In terms of choosing the K value, it is often set to 5 or 10. K can be selected so that
each independent data sample is large enough to be statistically represented in the larger
data collection.
2.4 Evaluation method based on confusion matrix
15
2.4.1 Concepts
The confusion matrix is a widely used statistic when tackling classification issues.
A confusion matrix is a table that summarizes the number of accurate and wrong
predictions produced by a classifier. It is used to assess the effectiveness of a
classification model. Accuracy, Precision, Recall, and F1_score are the most often
utilized performance measures.
The confusion matrix has the shape of a square matrix, with each row representing
each anticipated class and each column representing the actual class.
True Positive - TP: These are the occurrences that the model properly classifies as
"occurring = Yes".
False Positive - FP: These are occurrences that the model mistakenly forecasts as
"occurring = Yes," but which really "does not occur = No."
True Negative - TN: These are the occurrences that the model accurately predicts as
"not occurring = No."
False Negative - FN: These are occurrences that the model anticipated as "not
happening = No", but in reality "happened = Yes". This is in contrast with FP.
False Positive and False Negative are commonly known in statistics as Type I
error and Type II error.
Our objective when developing the model is to minimize the number of false
negatives and false positives.
2.4.2 Evaluation indicators
16
Accuracy is defined as the number of properly predicted values divided by the total
number of values in the dataset. Because of its simple formula and clear meaning,
evaluating classification models based on accuracy is popular.
However, the assessment technique simply displays the percentage of values
categorized into the proper class, not the accuracy of each type of data, the most
accurately classified class, or the most common class. There is the most
misunderstanding. Furthermore, with skewed data sets, accuracy is a poor measure. So
we have two indications for evaluating a model's reliability: precision and recall.
Recall shows the number of samples accurately predicted to belong to the positive
class out of all the samples that belonged to the positive class: how many values were
17
correctly categorized among the truly positive points? High recall indicates a low rate of
missing genuine positive samples.
Recall should be given higher weight when considering choosing the best model
when mistakenly recognizing Positive labels as False Negatives bring unpredictable
consequences (For example, for problems predicting people with cancer, a person who is
sick predicts that being free of the disease will bring serious consequences). The higher
the recall, the fewer positive points are missed. Recall = 1, meaning all points labeled
Positive are recognized by the model.
Consider the following example. When someone believes they are unwell, they go
to the hospital to get tested. We have two types of diseases: having the disease (positive)
and not having the disease (negative). The proportion of persons classified as positive
who genuinely have the illness is then denoted by precision. If the accuracy is 0.9, 90 out
of 100 persons who are identified as positive will truly have the condition. The recall
ratio is the proportion of those classified as positive who really have the illness compared
to the total number of people who have it. If the recall is 0.9, then 90 out of 100 patients
with the condition will be classified as positive. The greater the recall, the more likely the
patient would be diagnosed positively.
Precision solves the question "How many of the 100 people predicted to have the
disease actually get sick?". Recall will answer the question, "Did we underpredict any
outcomes?" (Did we overlook any individuals who were suffering from a disease we did
not anticipate?).
The F1 score is the harmonic mean of Precision and Recall, and it is employed as an
indicator in cases when selecting either precision or recall may result in model bias.
The False Positive Rate (FPR) is the fraction of observations that are mistakenly
expected to be positive out of all negative observations.
The classifier with curves closer to the upper left corner performs better. A random
classifier is supposed to produce diagonal points (FPR = TPR). The findings get weaker
when the ROC curve approaches 45 degrees of the ROC space.
It should be noted that the ROC function is independent of class distribution. This
makes it handy for testing classifiers that anticipate uncommon occurrences like sickness
or disasters. In contrast, measuring performance based on accuracy promotes classifiers
that consistently anticipate negative outcomes for uncommon occurrences.
AUC stands for "Area Under the ROC Curve." That is, AUC represents the full
area under the ROC curve.
The higher the AUC value, the more successful the classifier. And AUC is
categorized in the following manner:
- If AUC = 1, the classifier accurately distinguishes between all Positive and
Negative class points.
19
Figure 2-6 Histogram for the case of AUC = 1
- If AUC = 0, the classifier predicts that all positive and negative values are
positive.
20
Figure 2-9 Histogram for the case of AUC = 0.5
Figure 3-10 Use the distinct function to select rows with unique values
The dataset does not contain any duplicate values.
23
Figure 3-11 Statistical chart for the variable 'age'
Based on the analysis of the age distribution chart of customers in the dataset, it can
be observed that the primary concentration of data lies in the age range from 20 to 30.
This indicates that the majority of the company's customers fall into the young age group.
Additionally, the age range from 40 to 50 also holds a relatively high proportion,
showing diversity in the age range of customers.
The dataset displays a fairly wide age range, from 20 to 80 years old, with the
average age concentrated around 35-40 years. Although there is an even distribution of
ages, it can be seen that the largest concentration is in the age group from 20 to 30.
Between the ages of 22 to 26, the data reaches its highest concentration, while the
concentration decreases significantly in higher age groups, especially from 70 to 80 years
old. This demonstrates a clear trend in the age of the customer base, with a preference for
the younger age group.
The chart illustrates the correlation between the 'Age' variable and the
'Annual_Premium' variable
The scatter plot illustrates the correlation between the 'Age' variable and the
'Annual_Premium' variable (the amount spent on insurance premiums). From the plot, it
can be observed that there is an increasing trend in the amount spent on insurance
24
premiums as the age of the customer increases, especially in the range from 20 to 50
years old.
This positive correlation suggests that individuals aged between 20 and 50 years old
tend to spend a larger amount on insurance premiums. This could reflect factors such as
the diversity of insurance needs, increased financial security with family, or changes in
risk and personal responsibility over time.
Variable “gender”
25
Figure 3-13 The count of genders with a driver's license
The information from the dataset indicates that the number of males with a driver's
license is higher than females. This presents an opportunity for cross-selling campaigns,
particularly within the male segment. It might make sense to focus on this segment as
they are a group that frequently uses vehicles and has a demand for auto insurance.
Cross-selling campaigns can be designed to capitalize on the specific needs of
males related to vehicles and insurance. Promotional and marketing strategies can be
tailored to attract and retain the attention of the male audience while providing insurance
packages or special deals that align with their needs.
Variable “Previously_Insured”
26
Many users have not previously enrolled in insurance, with over half of them not
having approached or utilized insurance. This raises questions about why they are not
interested or do not use insurance services. Delving into details about this group may
provide valuable insights into barriers or unmet needs, which can be used to optimize
marketing strategies and expand the market.
Variable “Vehicle_Age”
27
Figure 3-16 Statistics for the 'Vehicle_Damage' variable
The information about the proportions indicates an interesting trend: if the vehicle
has not been damaged in the past, users tend not to opt for cross-selling insurance. This
might reflect the belief or perspective of users that their vehicles are less likely to
encounter issues or damages.
With this insight, marketing and sales strategies can be adjusted to focus on
conveying the value of cross-selling insurance for unforeseen situations. It could create a
message highlighting the benefits of cross-selling insurance in protecting against risks
that users may not have considered, even when their vehicles haven't previously
experienced issues.
Variable “Annual_Premium”
28
Figure 3-18 Statistical chart for the 'Vintage' variable:
The 'Vintage' variable appears to be a balanced and independent variable, with no
significant skewness and no presence of outliers. This can contribute to effective
modeling during the data analysis process.
Variable “Policy_Sales_Channel”
29
Figure 3-20 Heatmap chart
Based on the heatmap chart illustrating correlation coefficients, the team has
identified three variables with correlation coefficients close to 0 concerning the
dependent variable 'Response'. Specifically, 'Region_Code' has a correlation coefficient
of 0.011, 'Driving_License' is 0.01, and 'Vintage' is -0.0011. Correlation values close to 0
often indicate weak or no significant linear correlation between variables.
Also, based on the heatmap chart:
+ (+) values indicate positive correlation
+ (-) values indicate negative correlation
Subtracting out variables that are not correlated with the response and selecting
variables that positively or negatively influence the response, we have the following:
+ Previously_Insured (-0.34)
+ Age (0.11)
+ Vehicle_Age (0.22)
+ Vehicle_Damage (0.35)
+ Annual_Premium (0.023)
Feature selection for machine learning model:
30
The data and variables are selected to build the machine learning model based on
the Feature Selection method. This method aims to reduce or eliminate unnecessary input
variables, focusing only on variables that have a significant impact on the model. This
helps optimize the model's performance by making the processing and running of the
model more efficient.
Feature selection also helps avoid using highly correlated variables, minimizes
unrelated variables, and prevents the use of data that complicates the analysis. The result
is a refined dataset, making the machine learning model more effective and easier to
manage during deployment.
The SelectKBest method is employed to reduce the number of input variables, as
shown in the graph below.
31
Figure 3-21 Data before balancing
Observing that the percentage of interested customers is only 10%, while the
percentage of non-interested customers is very high. Therefore, this is an imbalanced
dataset. If using this dataset as the basis for prediction and analysis models, algorithms
may overfit as they will 'assume' that the majority of transactions are legitimate. Hence,
data balancing is necessary to avoid introducing any bias into the analysis.
Imbalanced data refers to the distribution of samples across classes or labels that is
significantly skewed, leading the model to focus only on learning features of the class
with more data, failing to generalize to the entire dataset. This results in the model
performing well on classes with more data and poorly on classes with less data.
To address this issue, we can use the resample function from the scikit-learn library
to balance the dataset.
32
3.3.2 The data labeling process
Search for numeric and letter columns to perform labeling. Label the data for
category data to avoid running a biased model. Here, the group only changes the gender
variable because the other category variables have no computational meaning.
34
Figure 3-28 Results table after implementing the Min-max Scaling method
3.3.5 Vectorize the input data and separate the train - test data set
After selecting 4 variables that are thought to be strongly correlated with the target
variable, proceed to vectorize the input data and decompose the train-test data set. The
implementation procedure includes the following steps:
Step 1: Create a list of nominal variables for forecasting
35
Figure 3-30 List of variable columns after applying VectorAssembler
Step 3: We run the algorithm after processing and analyzing the variables. First, the
standardized data set must be divided into 2 parts at a ratio of 8:2, including training
(used to find coefficients and build the model) and testing (used to give prediction values
and evaluate the data). model price) along with separating features and targets in each
train and test set.
3.4 Run the model
3.4.1 Model 1: Logistic Regression
Using the Logistic Regression model is done through the following steps:
In the Logistic Regression model, create a Logistics Model with the number of
Folds = 5 corresponding to the corresponding training-testing rule of 8 - 2. Then, use
Parameter Grid and HyperParameter to build the Grid.
For the model learned by training, the model after being trained by training can
display columns including data of variable features after being vectorized, column labels,
column predictions, predictions rough, and probability.
36
Evaluate the model based on index precision, recall, and F1 score; support can show
the expected results are found.
Figure 3-33 Logistic Regression model evaluation index according to the confusion
matrix
3.4.2 Model 2: Random Forest
Use RandomForestClassifier in the pyspark.ml.classification library to conduct
training of the Random Forest model on input variables defined with numFolds=5.
Random Forest is trained quickly, giving positive, predictable results.
Figure 3-34 Random Forest model evaluation index according to the confusion matrix
3.4.3 Model 3: Gradient Boosted Trees
Import the Gradient Boosted-tree model from the pyspark.ml.classification library
to perform classification on the dataset. Then, build a Gradient Boosted-trees model with
a maximum number of child decision trees of 10. Train the model on the previously
prepared training set. Finally, run the model on the test set and perform cross-validation
testing with folds = 5.
37
CHAPTER 4 : EVALUATING THE RESULTS OF THE MODELS
Look at the chart or index from the results of each model, compare and evaluate,
and choose the model that best suits the problem of detecting credit fraud. Provide
application proposals for fraud detection systems in banks and credit institutions, the
group's results, and conclude important points to support decision-making
4.1 Evaluate using proposed indicators
Model evaluation is essential in data analysis because it enables us to choose which
model best suits a specific task and evaluate its performance. While many assessment
measures exist, few are suitable for evaluating machine learning models for
categorization.
4.1.1 Evaluation using AUC/ROC and AUC/PR metrics
To determine the best algorithm for the dataset, we will assess the model using the
area under the ROC curve and measures like the area under the Precision-Recall (PR)
curve.
Following cross-validation, the evaluation results were derived from the Logistic
Regression technique.
The evaluation results were derived from the Random Forest technique cross-
validation:
The evaluation results were derived from the Gradient Boosting technique cross-
validation:
39
Figure 4-36 Confusion matrix of the Random Forest model
The model operates well based on the metrics showing strong F1-score and Recall
values. The Gradient Boosted Tree model performs exceptionally well in its classification
duty, as evidenced by the recall metric reaching 0.95. This demonstrates that selecting
this model was the right choice.
4.2 Conclusion
We compared the efficacy of algorithmic models, such as Random Forest, Gradient
Boosted Tree, and Logistic Regression, for classification purposes based on the
objectives and thorough study procedure that the team initially described. The key goals
were finding the best model for the experimental dataset and evaluating their
optimization potential for cross-selling.
Cross-selling predictions for vehicle insurance "with Logistic Regression, Gradient
Boosted Tree, and Random Forest", the name of our project, has produced fresh, in-depth
understandings of the elements and consequences of fraudulent and authentic
transactions. The following are some ways that the research can help banks and financial
institutions use these algorithms more effectively:
By using timely insights about possible interest in health cross-selling services, you
can anticipate and proactively address customer preferences and develop marketing
strategies that work.
You may improve user assistance by customizing offerings according to known
preferences. In the automotive sector, this strategy seeks to increase customer
satisfaction, develop pleasant consumer experiences, and establish trust.
40
REFERENCES
[1] Arora, U. (2016). Gradient Boosting: Visual Conceptualization | Dimensionless
Technogolies. Retrieved 25 December 2023, from
<https://dimensionless.in/gradient-boosting/>.
[2] Gradient Boost (GBM) 란 . (2022). Retrieved 25 December 2023, from
<https://velog.io/@hyesoup/Gradient-Boost-GBM%EC%9D%B4%EB%9E%80>
[3] Gradient Boosting - Tất tần tật về thuật toán mạnh mẽ nhất trong Machine
Learning. (2022). Retrieved 25 December 2023, from
<https://viblo.asia/p/gradient-boosting-tat-tan-tat-ve-thuat-toan-manh-me-nhat-trong-
machine-learning-YWOZrN7vZQ0>
[4] Random Forest Classifier: A Complete Guide to How It Works in Machine
Learning. (2022). Retrieved 25 December 2023, from
<https://builtin.com/data-science/random-forest-algorithm>
[5] Introduction to Random Forest in Machine Learning. (2022). Retrieved 25
December 2023, from
<https://www.section.io/engineering-education/introduction-to-random-forest-in-mach
ine-learning/>
[6] So Sánh Accuracy Vs Precision Là Gì ? Precision, Recall Và F1. (2022).
Retrieved 25 December 2023, from <https://ceds.edu.vn/precision-la-gi/>
[7] Thắng, N. (2020). Đánh giá model AI với Precision, Recall va F1 Score - Mì AI.
Retrieved 25 December 2023, from
<https://www.miai.vn/2020/06/16/oanh-gia-model-ai-theo-cach-mi-an-lien-chuong-2-
precision-recall-va-f-score/>
[8] What is K-fold Cross Validation?. (2022). Retrieved 25 December 2023, from
https://towardsdatascience.com/what-is-k-fold-cross-validation-5a7bb241d82f
41