Baocaocuoiky K204061418 NguyenThiThanhThuy 232MI1401

UNIVERSITY OF ECONOMICS AND LAW
FACULTY OF INFORMATION SYSTEMS

--------
FINAL REPORT
BIG DATA ANALYTICS AND ITS APPLICATIONS
[GROUP 4]
TOPIC : APPLYING BIG DATA IN THE SPARK

ENVIRONMENT TO BUILD A FORECAST MODEL FOR
CROSS-SELLING VEHICLE INSURANCE
Dataset: https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-
prediction/code?datasetId=869050&sortBy=voteCount
Course : Big Data Analytics And Its

Applications
Course Code : 232MI1401
Lecture : Mr. Nguyen Thon Da
Ho Chi Minh City, January 20, 2024.
GROUP MEMBER
No Full Name Student ID Role

.
1 Nguyen Thi Thanh Thuy K204061418 Leader
2 Tran Thi Thuan K204061417 Member
3 Nguyen Thuy Dung K204061391 Member
4 Mai Tran Dan Nhi K204061407 Member
5 Tran Thi Thuyen Thuyen K204061419 Member
6 Dinh Tran Xuan Nguyen K204061404 Member
ACKNOWLEDGMENTS
Group 5 has applied what we have learned in the " Big Data Analytics And Its
Applications " course taught by Mr. Nguyen Thon Da. Throughout learning and
absorbing theoretical and practical knowledge, our team successfully applied and
implemented this project.
Group 5 would like to sincerely thank all those who supported us in completing the
course report. First, the team would like to thank Mr. Da for providing a solid foundation
of knowledge and having many comments to help the team complete the project well.
Because of the limited knowledge and experience, our team has tried to develop
ideas and solve the requirements set out in the best possible way. However, many
obstacles still exist, and mistakes can't be avoided. I hope you will read and comment on
the group's topic so that Group 5 can learn from and improve the topic.
Best regards,
Group 5.
3
COMMITMENT
The Group 5 committed to the project with the topic “Applying Big Data In The
Spark Environment To Build A Forecast Model For Cross-Selling Vehicle Insurance”
which was implemented by the group itself, based on the guidance from Lecturer
Nguyen Thon Da.
The group has references to the sources, and the source is cited at the end of the
content in the references section. The other content is developed from the team's research
and is completely honest, if there is a problem, the team will be fully responsible.
4
TABLE OF CONTENTS
LIST OF FIGURES........................................................................................................................i
LIST OF TABLES........................................................................................................................iii
LIST OF ACRONYMS................................................................................................................iv
PROJECT MEMBER EVALUATION.......................................................................................v
DETAILED TASK ASSIGNMENT TABLE.............................................................................vi
ABSTRACT....................................................................................................................................1
CHAPTER 1 : INTRODUCTION TO THE TOPIC..................................................................2
1.1 Reason for choosing the topic..............................................................................................................2
1.2 Objectives of the research....................................................................................................................2
1.3 Research question................................................................................................................................2
1.4 Research tools and languages..............................................................................................................3
1.5 Research process..................................................................................................................................3
1.6 Structure of the research......................................................................................................................4
CHAPTER 2 : THEORETICAL BASIS.....................................................................................5
2.1 Related terms.......................................................................................................................................5
2.1.1 Vehicle insurance..........................................................................................................................5
2.1.2 "Cross-selling" term......................................................................................................................5
2.1.3 Cross-selling vehicle insurance....................................................................................................5
2.2 Application algorithm..........................................................................................................................5
2.2.1 Logistic Regression.......................................................................................................................5
2.2.2 Random Forest............................................................................................................................10
2.2.3 Gradient Boosted Trees Algorithm.............................................................................................12
2.3 K-Fold Cross Validation (K-Fold Cross Validation).........................................................................15
2.4 Evaluation method based on confusion matrix..................................................................................15
2.4.1 Concepts......................................................................................................................................16
2.4.2 Evaluation indicators..................................................................................................................16
CHAPTER 3 : CONDUCTING EXPERIMENTAL PROBLEMS.........................................21
3.1 General information...........................................................................................................................21
3.2 Explore and analyze data...................................................................................................................22
3.2.1 Explore and visualize data..........................................................................................................22
3.2.2 Visualize data using Heatmap charts..........................................................................................29
3.3 Prepare to run the model....................................................................................................................31
3.3.1 The data balancing process.........................................................................................................31
3.3.2 The data labeling process............................................................................................................33
3.3.3 The data cleaning process...........................................................................................................33
3.3.4 The scaling data process.............................................................................................................34
5
3.3.5 Vectorize the input data and separate the train - test data set.....................................................35
3.4 Run the model....................................................................................................................................36
3.4.1 Model 1: Logistic Regression.....................................................................................................36
3.4.2 Model 2: Random Forest............................................................................................................37
3.4.3 Model 3: Gradient Boosted Trees...............................................................................................37
CHAPTER 4 : EVALUATING THE RESULTS OF THE MODELS...................................38
4.1 Evaluate using proposed indicators...................................................................................................38
4.1.1 Evaluation using AUC/ROC and AUC/PR metrics....................................................................38
4.1.2 Evaluation using confusion matrix metrics................................................................................38
4.2 Conclusion.........................................................................................................................................40
REFERENCES............................................................................................................................41
6
LIST OF FIGURES
Figure 1-1 Research process...............................................................................................2
Figure 2-1 The sigmoid function chart (source: Pham Dinh Khanh).................................5
Figure 2-2 Graph of the exponential function e−x and the function ex................................6
Figure 2-3 Random Forest working diagram (Source: simplilearn)...................................9
Figure 2-4 Confusion matrix............................................................................................14
Figure 2-5 Histogram for the case of AUC = 1................................................................18
Figure 2-6 Histogram for the case of AUC = 0................................................................18
Figure 2-7 Histogram for the case of AUC = 0.7.............................................................19
Figure 2-8 Histogram for the case of AUC = 0.5.............................................................19
Figure 3-1 Use the distinct function to select rows with unique values...........................22
Figure 3-2 Statistical chart for the variable 'age'..............................................................22
Figure 3-3 Statistical Chart for the 'Gender' variable.......................................................23
Figure 3-4 The count of genders with a driver's license...................................................24
Figure 3-5 The number of users with or without auto insurance......................................24
Figure 3-6 The distribution of vehicle ages......................................................................25
Figure 3-7 Statistics for the 'Vehicle_Damage' variable...................................................26
Figure 3-8 Statistics for the 'Annual_Premium' variable..................................................26
Figure 3-9 Statistical chart for the 'Vintage' variable:......................................................27
Figure 3-10 Statistical chart for the 'Policy_Sales_Channel' variable:.............................27
Figure 3-11 Heatmap chart...............................................................................................28
Figure 3-12 Data before balancing...................................................................................30
Figure 3-13 Data after balancing......................................................................................30
Figure 3-14 The code labels the gender variable..............................................................31
Figure 3-15 Results after data labeling.............................................................................31
Figure 3-16 Data before removing outliers.......................................................................31
Figure 3-17 Find the upper limit and lower limit.............................................................32
Figure 3-18 Data after removing outliers.........................................................................32
Figure 3-19 Results table after implementing the Min-max Scaling method...................33
Figure 3-20 List of nominal variables to predict..............................................................33
Figure 3-21 List of variable columns after applying VectorAssembler...........................34
Figure 3-22 Data of variable features after being vectorized...........................................34
i
Figure 3-23 The results of model test...............................................................................34
Figure 3-24 Logistic Regression model evaluation according to the confusion matrix...35
Figure 3-25 Random Forest model evaluation index according to the confusion matrix.35
Figure 3-26 Gradient Boosted Trees model evaluation index..........................................35
Figure 4-1 Confusion matrix of the Random Forest model..............................................37
ii
LIST OF TABLES
Table 3-1 Description of attributes...................................................................................21
iii
LIST OF ACRONYMS
Acronym Description
EDA Exploratory Data Analysis
DT Decision Tree
RF Random Forest
ERF Extreme Random Forest
LR Logistic Regression
KNN K-Nearest Neighbour
ROC Receiver Operating Characteristics
AUC Area Under The Curve
GBT Gradient Boosted Trees
FPR False positive rate
TPR True positive rate
FN False Negative
TN True Negative
FP False Positive
TP True Positive
iv
PROJECT MEMBER EVALUATION
No. Full Name Student ID Role Evaluation Rate
1 Nguyen Thi Thanh Thuy K204061418 Leader 100%
2 Tran Thi Thuan K204061417 Member 100%
3 Nguyen Thuy Dung K204061391 Member 100%
4 Mai Tran Dan Nhi K204061407 Member 100%
5 Tran Thi Thuyen Thuyen K204061419 Member 100%
5 Dinh Tran Xuan Nguyen K204061404 Member 100%
v
DETAILED TASK ASSIGNMENT TABLE
No. Task Performer
1 Identify course requirements and evaluation criteria. Select a topic
1.1 Determine project requirements All
1.2 Searching the data for the project All
1.3 Looking up the knowledge about the project All
2 Identify and develop the project outline All

3 Develop the project implementation plan Thuy
4 Develop the overview introduction content for the Nguyen
project
5 Find research related to our chosen topic. Thuyen + Thuan
6 Develop theoretical foundations content
6.1 Artificial Neural Networks Thuy
6.2 Logistic Regression Dung
6.3 Random Forest Nhi
6.4 AUC-ROC curve Thuyen
7 Explore datasets
7.1 Overview of datasets Dung
7.2 Data mining methods Nhi
7.3 Mining variables in the dataset Nguyen + Thuan
8 Data preprocessing
8.1 Analyze and process data to bring it to a normal distribution Thuy
8.2 Scale and balance data using Rapidminer's technique Nhi
8.3 Feature selection based on correlation analysis Thuyen
9 Proposed model
9.1 Artificial Neural Networks Thuy + Dung
9.2 Logistic Regression Thuan + Nhi
9.3 Random Forest Nguyen + Thuyen
vi
10 Experimental result and Conclusion Thuan
11 Check the full content of the report All
12 Summarize and edit the form of the report Nguyen
13 Slide Design Dung + Nhi
14 Check and adjust the content of the report All
vii
ABSTRACT
In the contemporary era, as the demand for improving the quality of life continues
to rise, cross-selling campaigns for car insurance have become increasingly prevalent.
This study focuses on utilizing the Pyspark library and combining Logistic Regression,
Random Forest, and Gradient Boosting Tree to construct a predictive system for cross-
selling car insurance campaigns.
By applying these advanced methods, our goal is to predict the number of
customers accepting cross-selling services. We approach finding the optimal and most
suitable model for the given dataset. The results from experiments and evaluations
demonstrate the model's high performance, assessed through accuracy metrics and the
AUC - ROC curve, aiding in comparing and selecting the best model for the problem.
Faced with the complex and inflationary economic landscape, this research
proposes an effective and integrated approach, utilizing the Pyspark library to handle
large datasets. This assists businesses in predicting customer interest in car insurance,
enabling them to plan appropriate communication strategies and optimize both their
business model and revenue streams.
1
CHAPTER 1 : INTRODUCTION TO THE TOPIC
Includes general information about the topic, after determining the project
requirements, the general introduction content is: Reason for choosing the topic, research
objectives, research questions, objects and research methods research, tools and language,
research implementation process, and topic structure.
1.1 Reason for choosing the topic
Nowadays, with the strong development of the socio-economy as well as every
aspect of life. The community demands that the quality of life be increasingly improved
and met more fully. In particular, one of the most essential needs is to protect the health
of yourself and your family. At the same time, the insurance market is currently on a
solid growth path with policy efforts from the government, along with contributions to
macroeconomic stability and social security from companies.
As of November 30, 2023, the insurance market has 82 insurance businesses with
many diverse forms. Thanks to that, Cross-selling has penetrated the insurance market,
even though it was not widely known in this field. This is one of the particular strategies
and is very commonly applied in businesses today. If done well, it will bring more profits
to the business and enhance customer experience, loyalty, and lifetime value. The
insurance market has been penetrated by cross-selling vehicle and vehicle insurance.
To grasp user needs based on reviews and predict their interest in vehicle insurance.
Organizations and insurance companies need to implement preferential services and
promotional communication strategies to help reach the most of their customer base.
From there, the ultimate goal of optimizing revenue can be achieved. For all those
reasons, the group researched: " APPLYING BIG DATA IN THE SPARK
ENVIRONMENT TO BUILD A FORECAST MODEL FOR CROSS-SELLING
VEHICLE INSURANCE ".
1.2 Objectives of the research
Identify topics, explore data sets, and apply 3 algorithms: Logistics Regression,
Random Forest, and gradient-boosted trees to predict demand for cross-selling health
services. From there, we provide the most optimal evaluation results and solutions for
insurance companies and organizations.
Improve teamwork skills, problem-solving skills, presentation skills, and skills
related to thinking and data analysis for team members.
Understanding more about the subject's related field helps enrich knowledge and
experience. Prepare for good job opportunities in the future.
1.3 Research question
2
The group raised the following two research questions:
What factors lead to the decision to accept cross-selling of health insurance in the
vehicle sector?
How do the results predict the extent to which customers return to use cross-selling
services?
1.4 Research tools and languages
Tool: Google Colaboratory.
Language: Python.
1.5 Research process
The team conducted research with an 8-step process. The first is to determine
project requirements. From the identified requirements, the group explored the data set to
build an overview of the topic and learn the theoretical basis for research. After learning
the theoretical basis, the group proceeded to prepare to experiment. Then there are the
steps of exploring, cleaning the data, and running the model. Finally, there are
evaluations, conclusions, and applications. To be more specific about the group research
process described in the diagram below:
3
Figure 1-1 Research process
1.6 Structure of the research

Includes five chapters:
Chapter 1: Introduction to the topic
Includes general information about the topic, after determining the project
requirements, the general introduction content is: Reason for choosing the topic, research
objectives, research questions, objects and research methods research, tools and language,
research implementation process, and topic structure.
Chapter 2: Theoretical basis
The content shows theories related to the problem topic. The algorithm theories
applied to the topic include Logistic Regression, Random Forest, and Gradient Boosted
Trees.
4
Chapter 3: Conducting experimental problems
Select the data set and shape the definitions of the variables, explore the data
overview, explore the details, and derive important factors with the target variable to
proceed with running the model. Data processing techniques and methods. Run the
algorithm when you have a processed data set.
Chapter 4: Evaluate the results of the models
Look at the chart or index from the results of each model, compare and evaluate,
and choose the model that best suits the problem of detecting credit fraud. Provide
application proposals for fraud detection systems in banks and credit institutions, the
group's results, and conclude important points to support decision-making.
CHAPTER 2 : THEORETICAL BASIS
The content shows theories related to the problem topic. The algorithm theories
applied to the topic include Logistic Regression, Random Forest, and Gradient Boosted
Trees.
2.1 Related terms
2.1.1 Vehicle insurance
Vehicle insurance is a form of insurance that protects property and individuals from
potential harm caused by traffic accidents. Vehicle insurance provides liability coverage
if you are responsible for third-party injury or property damage. These insurance
packages cover motorbikes, vehicles, occupants, and accident victims.
2.1.2 "Cross-selling" term
Cross-selling is a sales approach that encourages clients to spend more money by
purchasing a product or service similar to the one they have previously purchased.
Even though this form requires clients to spend more money, they will readily
accept it because it assists them in finding things that they presently need or will need in
the future.
2.1.3 Cross-selling vehicle insurance
Cross-selling vehicle insurance is a popular sales approach banks and insurance
firms use to promote and sell vehicle insurance products to their clients.
Cross-selling car insurance can be accomplished in various ways, including offering
cross-selling vehicle insurance products when consumers purchase other insurance or
obtain a vehicle loan.
2.2 Application algorithm
5
2.2.1 Logistic Regression
2.2.1.1 Concepts
Logistic regression is a supervised machine-learning technique for binary
classification that predicts the likelihood of an outcome, occurrence, or observation.
Logistic regression is a predictive technique that describes and explains the
connection between a binary variable and one or more nominal, ordinal, interval, or ratio-
level independent variables.
We use logistic regression to classify research objects, types, names of groups, and
types within the value range of the target variable. For example, classify target customers
A, B, C, D, and the final value of target variable y = {A, B, C, D} (nominal, ordinal
logistic regression form). We also use it when anticipating an event with just two choices,
yes or no, as the target variable y will only have two values. 0 means no, and 1 means yes
(the most frequent binary logistic regression form).
2.2.1.2 Classification
Logistic regression is classified into three categories.
- Binary Logistic Regression: The dependent variable has two alternative
outcomes/classes. These variables can signify success or failure, yes or no, victory or
loss, etc.
- Multinomial Logistic Regression: The dependent variable may contain three or
more unordered outcomes/classes or patterns with no quantitative significance. For
example, predict the quality of food. (Good, Great, and Bad).
- Ordinary logistic regression: The dependent variable can have three or more
ordered or quantitatively meaningful patterns. For example, these variables can represent
“poor” or “good”, “very good”, or “Excellent” and each category can have scores like
0,1,2,3.
2.2.1.3 Logistic regression function

Logistic regression employs logit functions to determine the link between the
dependent and independent variables by estimating the probability or chance of
occurrence. Logistic functions, often known as sigmoid functions, assist in transforming
probabilities into binary values that may then be utilized to make predictions.
Logistic regression employs the sigmoid function as an activation function to
determine the data's probability distribution.
The sigmoid function is defined as the following:
6
Figure 2-2 The sigmoid function chart (source: Pham Dinh Khanh).
The Sigmoid function resembles an S-curve and rises monotonically. That is why it
also goes by the moniker S-function.
The sigmoid function has values ranging from 0 to 1.
−x 1
- We have the exponential function e−x , which is the inverse function of ex: e = x ,
e
represented as shown in the figure:
Figure 2-3 Graph of the exponential function e−x and the function ex
Notice that the function has the same shape as e x, the only difference is that it is
derived by flipping the y-axis.
7
Then, while examining the limit of the exponential function, we obtain:
Based on the results of the limit of the exponential function, we can infer the limit
of the sigmoid function as follows:
Thus, for whatever value of x, the sigmoid function always produces a value
between 0 and 1. As a result, the Sigmoid function is well-suited for probability
forecasting in classification issues.
In the linear regression model, we represented the relationship between the result
and the characteristics with the following regression function:
To categorize, because the probability ranges from 0 to 1, we shall convert the right
side of the equation into the logistic function. This sets the output to only take values
between 0 and 1.
8
We use the Odd ratio rather than the probability to extract the exponential
expression from the denominator. Odd ratio is a statistic that calculates the probability
ratio of positive and negative instances predicted by a logistic regression model. The
greater a prediction's odds ratio, the more likely it is to be classified as positive. If the
Odds ratio is more significant than one, the sample is more likely to be categorized
positively than negatively, and vice versa.
The probability may be calculated from the Odd ratio using the inverse function:
We may combine this expression with the function of the linear function to obtain:
Finally, taking the logarithm of both sides yields an equation using a linear function
of the predictors:
Furthermore, if the sigmoid function's output (estimated probability) exceeds a

predetermined threshold on the histogram, the model predicts that the value belongs to
that class. If the predicted probability is less than a predetermined threshold, the model
predicts that the value does not belong in that class. In both circumstances, we normally
set a threshold of 0.5.
2.2.1.4 Advantages and Disadvantages

 Advantages
Logistic regression is significantly more straightforward than other Machine
Learning approaches since the formula is simple to grasp and apply.
9
Logistic regression works best for data sets that are linearly separable. A data set is
considered linearly separable if a straight line separates two distinct data classes. Logistic
regression is utilized when your Y variable can only take two values, and if the data is
linearly separable, it is more efficient to divide it into two groups.
Logistic regression allows us to assess the importance of an independent variable
(i.e., coefficient size) and indicates the direction of the association (positive or negative).
 Disadvantages
Logistic regression did not predict continuous results.
Logistic regression assumes linearity between the predictor (dependent) and the
predictor (independent) variables. In the actual world, observations are unlikely to be
linearly separable.
Logistic regression can be inaccurate if the sample size is too small. If the sample
size is small, the model is created using logistic regression based on fewer observations.
This can lead to overfitting.
2.2.2 Random Forest
2.2.2.1 Concepts
In the Random Forest algorithm, we will use the Decision Tree algorithm to create
several unique decision trees (with a random element). The prediction results are then
collected using the decision trees. Use ensemble learning, a technique that combines
several classifiers to solve complicated problems.
Random Forests are used to tackle regression and classification issues.
A Random Forest method consists of several decision trees. A random forest
algorithm trained using bagging or bootstrap aggregation generates the 'forest'.
2.2.2.2 How it works

Random Forest achieves its outcomes using an ensemble approach. Training data is
supplied to train various decision trees. This dataset contains observations and
characteristics that will be randomly picked throughout the node-splitting procedure.
A Random Forest system relies on several decision trees. Each decision tree has
decision nodes, leaf nodes, and root nodes. Each tree's leaf node represents the decision
tree's final result. The outcome is determined using a majority vote mechanism.
The following stages and graphics illustrate the working process:
10
Figure 2-4 Random Forest working diagram (Source: simplilearn)
Step 1: Choose random samples from a certain data collection or training set.
Step 2: This algorithm creates a decision tree for all training data.
Step 3: The choice tree will be averaged to determine the vote.
Step 4: Finally, choose the guess that received the most votes as the final
prediction.
To further understand the algorithm's work, consider the following example: A
decision tree and projected banana as the outcome. The Random Forest classifier makes
the final recommendation based on a majority vote. The majority of decision trees
selected apples as their forecast. This leads the classifier to select Apple as its final
prediction.
2.2.2.3 Application
Random forest algorithms are utilized in a variety of disciplines, including:
In the financial industry, it is used to recognize clients who are more likely to repay
loans on time or use bank services more frequently to detect criminals attempting to
defraud institutions and to predict future stock performance.
In medicine, it is used to establish the appropriate mix of drugs and to examine the
patient's medical history to diagnose the ailment.
In e-commerce, determine if customers enjoy the product or not.

 Advantages
Can carry out classification and regression tasks.
A random forest delivers accurate, understandable forecasts.
Can handle big data sets effectively.
11
Random forest algorithms are more accurate in forecasting outcomes than decision
tree algorithms.
Improve model correctness and prevent overfitting.
Maintains high accuracy even with missing information.
 Disadvantages
A random forest requires greater computational resources.
It takes more time than the decision tree algorithm.
2.2.3 Gradient Boosted Trees Algorithm
2.2.3.1 Concepts
 Boosting
Boosting is an ensemble approach that aims to generate a strong classifier from a
collection of lesser classifiers. In other words, with each succeeding round, recordings
with big residuals receive greater weight. The most popular strategy is decision trees.
Boosting was established out of a desire to improve the limits of Bagging. We
anticipate weak models to help each other by learning from each other and avoiding the
faults of prior models. This is something Bagging cannot accomplish.
Building a model using training data, utilizing a mix of numerous models, each
succeeding model will incorporate and minimize the mistakes of prior models
(particularly the weight). The amount of successfully predicted data will remain constant,
but the weight of wrongly predicted data will grow). The performance of the first model
will impact how the second model is produced. Models are added until the training set is
predicted correctly or the number of models exceeds a certain threshold. The return result
will be based on the final model in this model chain.
 Gradient Boosted Trees
The Ensemble strategy is a strategy based on the premise that rather than attempting
to develop a single strong model, we would create a family of slightly weaker models
that, when merged appropriately, will provide an even greater model. "If one model can't
solve it on its own, let many models solve it together."
Gradient Boosted Trees (GBT) is a machine learning approach that uses an
Ensemble Method to integrate several weak models (typically decision trees) to generate
a stronger model.
Gradient Boosted Trees (GBT) are used to improve prediction performance by
iteratively altering decision trees based on prior mistakes.
2.2.3.2 How it works

12
By assigning extra weight to misclassified observations, the technique drives
models to train on data when they perform badly. The αm coefficient prioritizes models
with reduced errors.
Gradient Boosting addresses the topic of cost function optimization. Instead of
altering the weights, increasing the gradient correspondingly provides a greater training
impact than bigger residuals. Stochastic gradient boosting generates unpredictability in
the algorithm by selecting observations and predictor variables at each stage.
Gradient Boosting is a method designed to handle the following optimization
problems:
In there:
+ L: the loss function value.
+ label
+ cn : the confidence score of the nth weak learner (also known as weight)
+ 𝜔n: the nth weak learner.
The loss function is a key component of both the evaluation and objective functions.
Specifically, in common formulae:
The Loss function computes a non-negative real integer that represents the
difference between two quantities: y to the anticipated label and y to the true label. The
loss function is a form that forces the model to pay a penalty every time it predicts
incorrectly, with the amount of penalties proportionate to the severity of the inaccuracy.
We aim to minimize the penalty payable in every supervised learning task. In the ideal
instance, y hat = y. The loss function will yield a minimal value of zero.
Instead of attempting to scan for all values to discover the global ideal solution,
which would need a significant amount of time and resources. As a result, we will search
for local solution values after adding each additional model to the required model chain,
eventually progressing to the global solution.
13
Gradient Boosting: A more generic type of boosting used to reduce the cost
function. Specifically, for the first optimization problem:
First, it is required to understand the theory of gradient descent.
The formula changes the model parameters in the direction of decreasing

derivatives. This formula is based on the parameter space, but to link it to the problem,
transform it to the function space perspective.
When the sequence of boosting models is viewed as a function W, each learner
function may be regarded as a function. To minimize the loss function, use Gradient
Descent:
We may see the following linked relationships:
The model is inserted next. Then, the new model must be trained to contain the
parameter −L(Wn−1) . (Value is sometimes known as pseudo-residuals)
14
A synopsis of the algorithm implementation procedure is provided below:
Set the pseudo-residuals value to be equal for all data points.
In the ith loop
+ Train the new model to match the current pseudo-residuals.
+ Calculate the newly trained model's confidence score.
+ Update core model W=W+ cn wn
+ Finally, compute the pseudo-residuals value to create the label for the
following model.
Then repeat with i + 1 loop.
Gradient Boosting covers more cases.

 Advantages:
GBT typically performs well in prediction and classification tasks.
Ability to automatically handle missing data - missing values; continue training
with already established models to save time.
GBT allows users to create their optimization functions and assessment criteria,
rather than being constrained to the ones given.
 Disadvantages:
GBT is a sequential process, therefore, model training takes a lengthy time when
dealing with huge data sets.
GBT is prone to overfitting if hyperparameters are not managed.
2.3 K-Fold Cross Validation (K-Fold Cross Validation)
K-Fold cross-validation is an assessment approach for machine learning models that
determines how well the model predicts real-world data findings. This approach divides
the training data set into K smaller subsets, each with the same magnitude. The model
will be trained on the K-1 subset and tested on the remaining subset. This technique is
done K times, each time using a different subset as the test set. The assessment findings
from K times are pooled to provide an overall evaluation of the model.
In terms of choosing the K value, it is often set to 5 or 10. K can be selected so that
each independent data sample is large enough to be statistically represented in the larger
data collection.
2.4 Evaluation method based on confusion matrix
15
2.4.1 Concepts
The confusion matrix is a widely used statistic when tackling classification issues.
A confusion matrix is a table that summarizes the number of accurate and wrong
predictions produced by a classifier. It is used to assess the effectiveness of a
classification model. Accuracy, Precision, Recall, and F1_score are the most often
utilized performance measures.
The confusion matrix has the shape of a square matrix, with each row representing
each anticipated class and each column representing the actual class.
Figure 2-5 Confusion matrix
True Positive - TP: These are the occurrences that the model properly classifies as
"occurring = Yes".
False Positive - FP: These are occurrences that the model mistakenly forecasts as
"occurring = Yes," but which really "does not occur = No."
True Negative - TN: These are the occurrences that the model accurately predicts as
"not occurring = No."
False Negative - FN: These are occurrences that the model anticipated as "not
happening = No", but in reality "happened = Yes". This is in contrast with FP.
False Positive and False Negative are commonly known in statistics as Type I
error and Type II error.
Our objective when developing the model is to minimize the number of false
negatives and false positives.
2.4.2 Evaluation indicators
2.4.2.1 Accuracy index
16
Accuracy is defined as the number of properly predicted values divided by the total
number of values in the dataset. Because of its simple formula and clear meaning,
evaluating classification models based on accuracy is popular.
However, the assessment technique simply displays the percentage of values
categorized into the proper class, not the accuracy of each type of data, the most
accurately classified class, or the most common class. There is the most
misunderstanding. Furthermore, with skewed data sets, accuracy is a poor measure. So
we have two indications for evaluating a model's reliability: precision and recall.
2.4.2.2 Precision index
Precision is used to assess the model's performance by counting the number of

correct true positives among all positive predictions generated. In other words, how many
"positive" forecasts become "true" in reality?
Precision will need to be prioritized when selecting models for specific challenges
where false positives provide poor results. For example, if Spam Mail is blocked, getting
the incorrect FP (mistaking a normal email for a spam email) may disrupt the user's job
since they will miss a vital email (trillions of billions of contracts, for example). The
greater the Precision, the better the model's projected scores are. Precision = 1, indicating
that all points predicted as Positive are right, or that no points designated as Negative are
wrongly predicted as Positive by the model.
2.4.2.3 Recall index
Recall shows the number of samples accurately predicted to belong to the positive
class out of all the samples that belonged to the positive class: how many values were
17
correctly categorized among the truly positive points? High recall indicates a low rate of
missing genuine positive samples.
Recall should be given higher weight when considering choosing the best model
when mistakenly recognizing Positive labels as False Negatives bring unpredictable
consequences (For example, for problems predicting people with cancer, a person who is
sick predicts that being free of the disease will bring serious consequences). The higher
the recall, the fewer positive points are missed. Recall = 1, meaning all points labeled
Positive are recognized by the model.
Consider the following example. When someone believes they are unwell, they go
to the hospital to get tested. We have two types of diseases: having the disease (positive)
and not having the disease (negative). The proportion of persons classified as positive
who genuinely have the illness is then denoted by precision. If the accuracy is 0.9, 90 out
of 100 persons who are identified as positive will truly have the condition. The recall
ratio is the proportion of those classified as positive who really have the illness compared
to the total number of people who have it. If the recall is 0.9, then 90 out of 100 patients
with the condition will be classified as positive. The greater the recall, the more likely the
patient would be diagnosed positively.
Precision solves the question "How many of the 100 people predicted to have the
disease actually get sick?". Recall will answer the question, "Did we underpredict any
outcomes?" (Did we overlook any individuals who were suffering from a disease we did
not anticipate?).
2.4.2.4 F1_score index

When developing models, we all want the highest precision and recall possible. The
classification model is regarded as good when both the Recall and Precision indices are
big, which means that the closer the indices are to 1, the better. However, if we alter the
model too much to boost Recall, we risk decreasing Precision, and vice versa. Attempting
to adjust the model to increase Precision can diminish Recall. Our objective is to achieve
a balance between these two variables.
The F1 score is the harmonic mean of Precision and Recall, and it is employed as an
indicator in cases when selecting either precision or recall may result in model bias.
2.4.2.5 AUC-ROC curve

18
In addition to the indicators, we utilize the AUC - ROC curve to determine the
model's performance in the classification issue.
The acronym ROC refers to Receiver Operating Characteristics. The ROC curve
was created by graphing the true positive rate (TPR) versus the false positive rate (FPR)
at each threshold, using TPR as the vertical axis and FPR as the horizontal axis.
True Positive Rate (TPR) is the rate of accurately categorizing positive samples
overall positive samples.
The False Positive Rate (FPR) is the fraction of observations that are mistakenly
expected to be positive out of all negative observations.
The classifier with curves closer to the upper left corner performs better. A random
classifier is supposed to produce diagonal points (FPR = TPR). The findings get weaker
when the ROC curve approaches 45 degrees of the ROC space.
It should be noted that the ROC function is independent of class distribution. This
makes it handy for testing classifiers that anticipate uncommon occurrences like sickness
or disasters. In contrast, measuring performance based on accuracy promotes classifiers
that consistently anticipate negative outcomes for uncommon occurrences.
AUC stands for "Area Under the ROC Curve." That is, AUC represents the full
area under the ROC curve.
The higher the AUC value, the more successful the classifier. And AUC is
categorized in the following manner:
- If AUC = 1, the classifier accurately distinguishes between all Positive and
Negative class points.
19
Figure 2-6 Histogram for the case of AUC = 1
- If AUC = 0, the classifier predicts that all positive and negative values are
positive.
Figure 2-7 Histogram for the case of AUC = 0

- If the AUC is from 0.5 to 1, the classifier can discriminate between positive
and negative class values by detecting more TP and TN than FP and FN.
Figure 2-8 Histogram for the case of AUC = 0.7

- If AUC = 0.5, it indicates that the classifier is unable to discriminate between
negative and positive values
20
Figure 2-9 Histogram for the case of AUC = 0.5
CHAPTER 3 : CONDUCTING EXPERIMENTAL PROBLEMS

Select the data set and shape the definitions of the variables, explore the data
overview, explore the details, and derive important factors with the target variable to
proceed with running the model. Data processing techniques and methods. Run the
algorithm when you have a processed data set.
3.1 General information
Dataset Source: https://www.kaggle.com/datasets/anmolkumar/health-insurance-
cross-sell-prediction/code?datasetId=869050&sortBy=voteCount
The dataset is provided by an insurance company, primarily focusing on health
insurance. The company needs assistance in building a predictive model to determine
whether customers (policyholders) from the previous year are interested in the auto
insurance that the company offers.
The dataset consists of 381,109 rows and 12 attributes:
21
No. Attribute names Description
1 Id The customer ID is unique and increases sequentially
across the dataset, appearing only once.
2 Gender Customer gender attribute
3 Driving_Licence Attribute indicating an overview of two factors: 1 -
Customer has a driver's license, 0 - Customer does not
have a driver's license.
4 Age Attribute representing the age of the customer.
5 Region Code Code representing the customer's region, with only one
unique code for each customer's region.
6 Previously_Insured Attribute indicating the insurance status, where:
1 - Customer has auto insurance, 0 - Customer does not
have auto insurance.
7 Vehicle Age Age of the vehicle, with values that can vary depending
on the customer's duration of vehicle usage.
8 Vehicle Damage Attribute indicating the assessment of vehicle damage,
where:
1 - Customer has experienced vehicle damage in the
past.
0 - Customer has not experienced vehicle damage in the
past.
9 Annual Premium The amount of money that the customer uses to pay for
insurance expenses.
10 Policy_Sales_Channel This is an anonymized code for the channel through
which the customer was approached. It may include
codes for different agents, through mail, over the phone,
in-person, etc.
11 Vintage Age of the policy (number of days the customer has
been associated/using the insurance services of the
company).
12 Response This column of data is the output that the problem
statement is seeking. Where:
1 - Customer is interested
0 - Customer is not interested
Table 3-1 Description of attributes
3.2 Explore and analyze data
3.2.1 Explore and visualize data
3.2.1.1 Explore overview

22
The dataset contains no missing values (null) and includes variables in the
following order: id, Gender, Age, Driving_License, Region_Code, Previously_Insured,
Vehicle_Age, Vehicle_Damage, Annual_Premium, Policy_Sale_Channel, Vintage,
Response. The variables in the dataset are of integer, string, and double types. For the
categorical variables, label encoding is performed during the data preparation step before
running the model.
Figure 3-10 Use the distinct function to select rows with unique values
The dataset does not contain any duplicate values.
3.2.1.2 Explore data

 Variable “age”
23
Figure 3-11 Statistical chart for the variable 'age'
Based on the analysis of the age distribution chart of customers in the dataset, it can
be observed that the primary concentration of data lies in the age range from 20 to 30.
This indicates that the majority of the company's customers fall into the young age group.
Additionally, the age range from 40 to 50 also holds a relatively high proportion,
showing diversity in the age range of customers.
The dataset displays a fairly wide age range, from 20 to 80 years old, with the
average age concentrated around 35-40 years. Although there is an even distribution of
ages, it can be seen that the largest concentration is in the age group from 20 to 30.
Between the ages of 22 to 26, the data reaches its highest concentration, while the
concentration decreases significantly in higher age groups, especially from 70 to 80 years
old. This demonstrates a clear trend in the age of the customer base, with a preference for
the younger age group.
The chart illustrates the correlation between the 'Age' variable and the
'Annual_Premium' variable
The scatter plot illustrates the correlation between the 'Age' variable and the
'Annual_Premium' variable (the amount spent on insurance premiums). From the plot, it
can be observed that there is an increasing trend in the amount spent on insurance
24
premiums as the age of the customer increases, especially in the range from 20 to 50
years old.
This positive correlation suggests that individuals aged between 20 and 50 years old
tend to spend a larger amount on insurance premiums. This could reflect factors such as
the diversity of insurance needs, increased financial security with family, or changes in
risk and personal responsibility over time.
 Variable “gender”
Figure 3-12 Statistical Chart for the 'Gender' variable

Information about gender in the dataset indicates a significant balance between
males and females, meaning there is no substantial imbalance in the gender ratio.
Regarding cross-selling of products, a notable point is that the ratio of users not
interested in cross-selling is quite high. Additionally, the gender ratio between users
interested and not interested in cross-selling remains relatively even, with no significant
deviation. This could be influenced by various factors, ranging from shopping trends and
preferences to marketing strategies.
To leverage the balanced gender ratio, marketing and sales strategies could be
designed to appeal to both males and females.
25
Figure 3-13 The count of genders with a driver's license
The information from the dataset indicates that the number of males with a driver's
license is higher than females. This presents an opportunity for cross-selling campaigns,
particularly within the male segment. It might make sense to focus on this segment as
they are a group that frequently uses vehicles and has a demand for auto insurance.
Cross-selling campaigns can be designed to capitalize on the specific needs of
males related to vehicles and insurance. Promotional and marketing strategies can be
tailored to attract and retain the attention of the male audience while providing insurance
packages or special deals that align with their needs.
 Variable “Previously_Insured”
Figure 3-14 The number of users with or without auto insurance.
26
Many users have not previously enrolled in insurance, with over half of them not
having approached or utilized insurance. This raises questions about why they are not
interested or do not use insurance services. Delving into details about this group may
provide valuable insights into barriers or unmet needs, which can be used to optimize
marketing strategies and expand the market.
 Variable “Vehicle_Age”
Figure 3-15 The distribution of vehicle ages

The data categorizes the number of years a vehicle has been in use into 3 groups
(under 1 year, 1-2 years, and over 2 years), as illustrated in the chart. The chart shows
that the percentage of vehicles used for over 2 years is tiny and negligible compared to
the other 2 groups. This imbalance can impact the model's quality, especially when
machine learning models may be inefficient when dealing with imbalanced data.
To address this issue, one preprocessing approach could be to reclassify the data
into 2 groups based on the vehicle usage time, specifically under 1 year and over 1 year.
This preprocessing step helps create a better balance among the groups and may enhance
the model's predictive capabilities for these groups.
 Variable “Vehicle_Damage”
27
Figure 3-16 Statistics for the 'Vehicle_Damage' variable
The information about the proportions indicates an interesting trend: if the vehicle
has not been damaged in the past, users tend not to opt for cross-selling insurance. This
might reflect the belief or perspective of users that their vehicles are less likely to
encounter issues or damages.
With this insight, marketing and sales strategies can be adjusted to focus on
conveying the value of cross-selling insurance for unforeseen situations. It could create a
message highlighting the benefits of cross-selling insurance in protecting against risks
that users may not have considered, even when their vehicles haven't previously
experienced issues.
 Variable “Annual_Premium”
Figure 3-17 Statistics for the 'Annual_Premium' variable

The 'Annual_Premium' variable has a right-skewed distribution and is concentrated
around the range of 30,000-40,000. However, the presence of many outliers could impact
modeling and requires special attention in handling outliers during the data preprocessing
stage.
 Variable “Vintage”
28
Figure 3-18 Statistical chart for the 'Vintage' variable:
The 'Vintage' variable appears to be a balanced and independent variable, with no
significant skewness and no presence of outliers. This can contribute to effective
modeling during the data analysis process.
 Variable “Policy_Sales_Channel”
Figure 3-19 Statistical chart for the 'Policy_Sales_Channel' variable:

Based on the statistical chart of the 'Policy_Sales_Channel' variable, an uneven
distribution in the number of anonymized codes for the channel of customer approach can
be observed. Code 160 has the highest count, while code 140 has the lowest count.
This observation suggests that the 'Policy_Sales_Channel' variable may not
significantly impact the target variable. Therefore, it could potentially be removed from
the dataset without causing a substantial impact on the model's performance.
The decision to remove this variable should be made based on a deeper
understanding of the data and the specific goals of the model. If the
'Policy_Sales_Channel' variable does not provide important information or has little
impact on the model, removing it could help reduce the dataset's length and enhance
performance during analysis.
3.2.2 Visualize data using Heatmap charts
29
Figure 3-20 Heatmap chart
Based on the heatmap chart illustrating correlation coefficients, the team has
identified three variables with correlation coefficients close to 0 concerning the
dependent variable 'Response'. Specifically, 'Region_Code' has a correlation coefficient
of 0.011, 'Driving_License' is 0.01, and 'Vintage' is -0.0011. Correlation values close to 0
often indicate weak or no significant linear correlation between variables.
Also, based on the heatmap chart:
+ (+) values indicate positive correlation
+ (-) values indicate negative correlation
Subtracting out variables that are not correlated with the response and selecting
variables that positively or negatively influence the response, we have the following:
+ Previously_Insured (-0.34)
+ Age (0.11)
+ Vehicle_Age (0.22)
+ Vehicle_Damage (0.35)
+ Annual_Premium (0.023)
 Feature selection for machine learning model:
30
The data and variables are selected to build the machine learning model based on
the Feature Selection method. This method aims to reduce or eliminate unnecessary input
variables, focusing only on variables that have a significant impact on the model. This
helps optimize the model's performance by making the processing and running of the
model more efficient.
Feature selection also helps avoid using highly correlated variables, minimizes
unrelated variables, and prevents the use of data that complicates the analysis. The result
is a refined dataset, making the machine learning model more effective and easier to
manage during deployment.
The SelectKBest method is employed to reduce the number of input variables, as
shown in the graph below.
 Visualizing Feature Selection

Based on the chart above, the team chooses k=4 to identify the top 4 most
influential and correlated variables with the model outcome, including:
+ Age
+ Previously_Insured
+ Vehicle_Age
+ Vehicle_Damage
3.3 Prepare to run the model
3.3.1 The data balancing process
Checking the data imbalance of the response variable:
31
Figure 3-21 Data before balancing
Observing that the percentage of interested customers is only 10%, while the
percentage of non-interested customers is very high. Therefore, this is an imbalanced
dataset. If using this dataset as the basis for prediction and analysis models, algorithms
may overfit as they will 'assume' that the majority of transactions are legitimate. Hence,
data balancing is necessary to avoid introducing any bias into the analysis.
Imbalanced data refers to the distribution of samples across classes or labels that is
significantly skewed, leading the model to focus only on learning features of the class
with more data, failing to generalize to the entire dataset. This results in the model
performing well on classes with more data and poorly on classes with less data.
To address this issue, we can use the resample function from the scikit-learn library
to balance the dataset.
Figure 3-22 Data after balancing
32
3.3.2 The data labeling process
Search for numeric and letter columns to perform labeling. Label the data for
category data to avoid running a biased model. Here, the group only changes the gender
variable because the other category variables have no computational meaning.
Figure 3-23 The code labels the gender variable

Results after data labeling:
Figure 3-24 Results after data labeling

3.3.3 The data cleaning process
Figure 3-25 Data before removing outliers

Realizing that the variable 'Annual_Premium' in our dataset contained several
outliers, the team eliminated them using the upper and lower bounds method. First, the
33
team calculated the upper and lower limits based on the mean and standard deviation of
the 'Annual_Premium' variable.
Figure 3-26 Find the upper limit and lower limit

The group then applied an outliers removal process by replacing the
'Annual_Premium' value with the corresponding limit if it exceeded the upper limit or
was lower than the lower limit. This helps retain values within previously defined ranges
and remove outliers, creating a cleaner, more standardized data set.
Figure 3-27 Data after removing outliers

3.3.4 The scaling data process
There are two ways to scale data: Normalization and standardization. Both of these
methods are provided in the scikit-learn library. With the data set used, the team decided
to use the Normalization method to scale the data. Specifically, Normalization is scaling
data from any value range to a value range between 0 and 1. This method requires
determining the data's maximum value (max) and minimum value (min).
The value is normalized according to the following formula:
x−min ⁡( x )
y=
max ( x )−min ⁡(x )
The data set after normalization (scaling) has the following results:
34
Figure 3-28 Results table after implementing the Min-max Scaling method
3.3.5 Vectorize the input data and separate the train - test data set
After selecting 4 variables that are thought to be strongly correlated with the target
variable, proceed to vectorize the input data and decompose the train-test data set. The
implementation procedure includes the following steps:
Step 1: Create a list of nominal variables for forecasting
Figure 3-29 List of nominal variables to predict

Step 2: Apply VectorAssembler to transform the list of variable columns above into
a single vector column (“features”).
35
Figure 3-30 List of variable columns after applying VectorAssembler
Step 3: We run the algorithm after processing and analyzing the variables. First, the
standardized data set must be divided into 2 parts at a ratio of 8:2, including training
(used to find coefficients and build the model) and testing (used to give prediction values
and evaluate the data). model price) along with separating features and targets in each
train and test set.
3.4 Run the model
3.4.1 Model 1: Logistic Regression
Using the Logistic Regression model is done through the following steps:
In the Logistic Regression model, create a Logistics Model with the number of
Folds = 5 corresponding to the corresponding training-testing rule of 8 - 2. Then, use
Parameter Grid and HyperParameter to build the Grid.
For the model learned by training, the model after being trained by training can
display columns including data of variable features after being vectorized, column labels,
column predictions, predictions rough, and probability.
Figure 3-31 Data of variable features after being vectorized

Test the model using the test model using the test results below:
Figure 3-32 The results of model test
36
Evaluate the model based on index precision, recall, and F1 score; support can show
the expected results are found.
Figure 3-33 Logistic Regression model evaluation index according to the confusion
matrix
3.4.2 Model 2: Random Forest
Use RandomForestClassifier in the pyspark.ml.classification library to conduct
training of the Random Forest model on input variables defined with numFolds=5.
Random Forest is trained quickly, giving positive, predictable results.
Figure 3-34 Random Forest model evaluation index according to the confusion matrix
3.4.3 Model 3: Gradient Boosted Trees
Import the Gradient Boosted-tree model from the pyspark.ml.classification library
to perform classification on the dataset. Then, build a Gradient Boosted-trees model with
a maximum number of child decision trees of 10. Train the model on the previously
prepared training set. Finally, run the model on the test set and perform cross-validation
testing with folds = 5.
Figure 3-35 Gradient Boosted Trees model evaluation index
37
CHAPTER 4 : EVALUATING THE RESULTS OF THE MODELS
Look at the chart or index from the results of each model, compare and evaluate,
and choose the model that best suits the problem of detecting credit fraud. Provide
application proposals for fraud detection systems in banks and credit institutions, the
group's results, and conclude important points to support decision-making
4.1 Evaluate using proposed indicators
Model evaluation is essential in data analysis because it enables us to choose which
model best suits a specific task and evaluate its performance. While many assessment
measures exist, few are suitable for evaluating machine learning models for
categorization.
4.1.1 Evaluation using AUC/ROC and AUC/PR metrics
To determine the best algorithm for the dataset, we will assess the model using the
area under the ROC curve and measures like the area under the Precision-Recall (PR)
curve.
Following cross-validation, the evaluation results were derived from the Logistic
Regression technique.
Mean AUC of the ROC curve: 0.8308250200707576

Mean AUC of the PR curve: 0.7596466138142874
The evaluation results were derived from the Random Forest technique cross-
validation:

The evaluation results were derived from the Gradient Boosting technique cross-
validation:

Conclusion: Similarity between the algorithms may be seen in the average

computational effectiveness of the ROC/AUC scores and PR curve. As a result, our
evaluation shows that models built on the dataset with the Gradient Boosting technique
produce better classification outcomes.
4.1.2 Evaluation using confusion matrix metrics
38
To experiment with the Gradient Boosting technique on the test dataset (where the
model has not been exposed earlier) to perform a more thorough evaluation. This
assessment will consider the following metrics:
+ Accuracy Ratio: A model is more suitable if its accuracy ratio is larger. The
accuracy ratio range is between 0 and 1.
+ Precision Ratio: An optimal model has a more excellent precision ratio. The
range of the precision ratio is 0 to 1.
+ Recall Ratio: An optimal model has a more excellent recall ratio. The recall
ratio ranges from 0 to 1.
+ A higher F1 score indicates an optimal model. F1-scores have a range of 0-1.
We analyze the significance of the Recall and Precision variables as follows:
+ The Precision ratio simply displays the accuracy in anticipating customers
who make cross-purchases compared to real data. For this particular problem, the
Precision ratio is of limited value because the actual scenario of customers who do not
make cross-purchases but are expected to do so has little impact on the company's
marketing strategy.
+ In contrast, the Recall ratio reveals the degree to which customers are not
losing out on cross-purchases compared to actual data. The Recall ratio is critical for this
issue because "missing out" on consumers who make cross-purchases can significantly
influence the company's revenue. The ideal method in customer management is
frequently seen as "making a mistake is better than missing out."
As a result, we will only consider the Recall ratio while evaluating the model rather
than the Precision metric itself.
The model will be implemented using the "bestModel" acquired after cross-
validating:
GBTClassificationModel: uid = GBTClassifier_0f2db9ca0fc0, numTrees=100,

numClasses=2, numFeatures=4
To determine the model's efficacy on never-before-evaluated data, we continue by

testing the best model using the Gradient Boosting algorithm on the test dataset.
39
Figure 4-36 Confusion matrix of the Random Forest model
The model operates well based on the metrics showing strong F1-score and Recall
values. The Gradient Boosted Tree model performs exceptionally well in its classification
duty, as evidenced by the recall metric reaching 0.95. This demonstrates that selecting
this model was the right choice.
4.2 Conclusion
We compared the efficacy of algorithmic models, such as Random Forest, Gradient
Boosted Tree, and Logistic Regression, for classification purposes based on the
objectives and thorough study procedure that the team initially described. The key goals
were finding the best model for the experimental dataset and evaluating their
optimization potential for cross-selling.
Cross-selling predictions for vehicle insurance "with Logistic Regression, Gradient
Boosted Tree, and Random Forest", the name of our project, has produced fresh, in-depth
understandings of the elements and consequences of fraudulent and authentic
transactions. The following are some ways that the research can help banks and financial
institutions use these algorithms more effectively:
By using timely insights about possible interest in health cross-selling services, you
can anticipate and proactively address customer preferences and develop marketing
strategies that work.
You may improve user assistance by customizing offerings according to known
preferences. In the automotive sector, this strategy seeks to increase customer
satisfaction, develop pleasant consumer experiences, and establish trust.
40
REFERENCES
[1] Arora, U. (2016). Gradient Boosting: Visual Conceptualization | Dimensionless
Technogolies. Retrieved 25 December 2023, from
<https://dimensionless.in/gradient-boosting/>.
[2] Gradient Boost (GBM) 란 . (2022). Retrieved 25 December 2023, from
<https://velog.io/@hyesoup/Gradient-Boost-GBM%EC%9D%B4%EB%9E%80>
[3] Gradient Boosting - Tất tần tật về thuật toán mạnh mẽ nhất trong Machine
Learning. (2022). Retrieved 25 December 2023, from
<https://viblo.asia/p/gradient-boosting-tat-tan-tat-ve-thuat-toan-manh-me-nhat-trong-
machine-learning-YWOZrN7vZQ0>
[4] Random Forest Classifier: A Complete Guide to How It Works in Machine
Learning. (2022). Retrieved 25 December 2023, from
<https://builtin.com/data-science/random-forest-algorithm>
[5] Introduction to Random Forest in Machine Learning. (2022). Retrieved 25
December 2023, from
<https://www.section.io/engineering-education/introduction-to-random-forest-in-mach
ine-learning/>
[6] So Sánh Accuracy Vs Precision Là Gì ? Precision, Recall Và F1. (2022).
Retrieved 25 December 2023, from <https://ceds.edu.vn/precision-la-gi/>
[7] Thắng, N. (2020). Đánh giá model AI với Precision, Recall va F1 Score - Mì AI.
Retrieved 25 December 2023, from
<https://www.miai.vn/2020/06/16/oanh-gia-model-ai-theo-cach-mi-an-lien-chuong-2-
precision-recall-va-f-score/>
[8] What is K-fold Cross Validation?. (2022). Retrieved 25 December 2023, from
https://towardsdatascience.com/what-is-k-fold-cross-validation-5a7bb241d82f
41

Baocaocuoiky K204061418 NguyenThiThanhThuy 232MI1401

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Baocaocuoiky K204061418 NguyenThiThanhThuy 232MI1401

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF ECONOMICS AND LAW

FACULTY OF INFORMATION SYSTEMS

TOPIC : APPLYING BIG DATA IN THE SPARK

Course : Big Data Analytics And Its

No Full Name Student ID Role

2 Identify and develop the project outline All

1.6 Structure of the research

2.2.1.3 Logistic regression function

Furthermore, if the sigmoid function's output (estimated probability) exceeds a

2.2.1.4 Advantages and Disadvantages

2.2.2.2 How it works

2.2.2.4 Advantages and Disadvantages

2.2.3.2 How it works

First, it is required to understand the theory of gradient descent.

The formula changes the model parameters in the direction of decreasing

We may see the following linked relationships:

2.2.3.3 Advantages and Disadvantages

Figure 2-5 Confusion matrix

2.4.2.1 Accuracy index

2.4.2.2 Precision index

Precision is used to assess the model's performance by counting the number of

2.4.2.3 Recall index

2.4.2.4 F1_score index

2.4.2.5 AUC-ROC curve

Figure 2-7 Histogram for the case of AUC = 0

Figure 2-8 Histogram for the case of AUC = 0.7

CHAPTER 3 : CONDUCTING EXPERIMENTAL PROBLEMS

3.2.1.1 Explore overview

3.2.1.2 Explore data

Figure 3-12 Statistical Chart for the 'Gender' variable

Figure 3-14 The number of users with or without auto insurance.

Figure 3-15 The distribution of vehicle ages

Figure 3-17 Statistics for the 'Annual_Premium' variable

Figure 3-19 Statistical chart for the 'Policy_Sales_Channel' variable:

 Visualizing Feature Selection

Figure 3-22 Data after balancing

Figure 3-23 The code labels the gender variable

Figure 3-24 Results after data labeling

Figure 3-25 Data before removing outliers

Figure 3-26 Find the upper limit and lower limit

Figure 3-27 Data after removing outliers

Figure 3-29 List of nominal variables to predict

Figure 3-31 Data of variable features after being vectorized

Figure 3-32 The results of model test

Figure 3-35 Gradient Boosted Trees model evaluation index

Mean AUC of the ROC curve: 0.8308250200707576

Mean AUC of the ROC curve: 0.8380318093159917

Mean AUC of the ROC curve: 0.8353192784889896

Conclusion: Similarity between the algorithms may be seen in the average

GBTClassificationModel: uid = GBTClassifier_0f2db9ca0fc0, numTrees=100,

To determine the model's efficacy on never-before-evaluated data, we continue by

You might also like