Professional Documents
Culture Documents
AN APPROACH TO DETECTION
OF FRAUD NOT CONTEXT OF
BANK TRANSACTIONS
Dunfrey P. Aragão
Italo de Pontes Oliveira
Ferdinand Felix
Machine Translated by Google
summary
Introduction 2
objectives 2
Main goal 2
Specific objectives 3
Methodology 3
Data Understanding 5
Data dictionary 5
Attribute Analysis for Genuine and Fraudulent Instances How is the data distributed 6
Operation 7
PrincingStrategy 7
Value 7
Data Handling 8
Attribute Removal 8
Isolation Forest 10
Solution development 12
Dataset Balancing 12
Modeling 13
CatBoost 13
Evaluation 14
Financial analysis 17
Conclusion 18
Machine Translated by Google
Introduction
According to The World Payments Report 2019, carried out by Capgemini Financial Services
1
During the process of a financial transaction, a series of procedures take place: 1) the merchant
4
that is selling the product, or providing customer service, collects credit card data, such as the card
number, expiration date , security code, among others, by means of a machine or by manual supply, as
occurs in e-commerce systems, and passes this data to the credit card network; 2) the credit card
network must authenticate the information provided and request that the banking institution approve the
transaction; 3) the banking agency makes the payment to the commercial establishment that carried out
the sale and approves the transaction.
In this way, the credit card network institutions are responsible for authenticating the customer
information provided by the merchant, as well as validating the transaction was carried out by the
customer. Thus, it is essential that the validation system works with a low response time and that, at the
same time, it is reliable, preventing fraudulent transactions from taking place.
objectives
Main goal
• Automatically identify whether a transaction was fraudulent or genuine.
Specific objectives
• Estimate the financial impact caused by the fraud detection system.
Methodology
To identify fraudulent transactions, the database provided was adopted.
6 7
in the challenge organized by Zindi. The data comes from the Xente platform aimed at providing , business
financial services and e-commerce that serves more than 10,000 customers in Uganda. The dataset includes
an approximate sample of 140,000 transactions that took place between November 15, 2018 and March 14,
2019.
The data is divided into training (set of transactions that occurred between November 15, 2018 and February
13, 2019 with prior identification of the type of transaction - fraud or genuine) and test (set of transactions
that correspond to the period from February 13, 2019 to March 14, 2019, without identification of the
transaction type).
The main challenges involving fraud detection are related to the high level of data imbalance, since
about 0.002% of transactions correspond to fraud. Furthermore, fraudulent transactions do not always
behave suspiciously, they often go unnoticed.
Considering these difficulties, the approach adopted in this project is presented in Figure 1, which
shows the complete pipeline. Each step is described in detail in the following sections. In summary, they are:
• Data pre-processing: In this step, we sought to identify the existence of missing data, verify data repetition
in the training set, perform descriptive analysis and trends in fraudulent transactions and, finally,
correlation of attributes in the training set; • Data balancing: Observed the high level of
imbalance
in the data, different data balancing techniques were used using oversampling , since
8 9
undersampling would result99%
in a of
significant
the instances
drop in
arethe
from
amount
the non-fraudulent
of data available
class).
(about
Based
on this, a better performance of the classifier was observed when using balancing
techniques based on oversampling; • Identification of outliers: Application of techniques
to the experiment that consist of identifying instances that differ from the others, and using
this information as a new attribute for the model;
6
https://zindi.africa/competitions/xente-fraud-detection-challenge (accessed on 12/07/2019).
7
https://zindi.africa/competitions/xente-fraud-detection-challenge/data (accessed on 12/07/2019).
8
Replicate data with lower index until transaction type levels are equal.
9
Drop data with higher index until transaction type levels are equal.
10
https://catboost.ai/ "CatBoost is a high-performance open source library for gradient boosting on
decision trees" (accessed 12/07/2019).
Machine Translated by Google
proposed methodology is among the top-70 for the more than 1,000 data
scientists participating in the competition.
Figure 1: Methodological flow of actions applied to the dataset to classify the type of transaction.
Machine Translated by Google
Data Understanding
The first step of the pipeline (workflow) consists of understanding the data provided
11
by the challenge. Figure 2 presents the flow of actions performed in the first stage. The
pipeline consists of the following sub-steps: 1) following the CRISP-DM methodology, at the
moment the focus is on creating data dictionaries and analyzing them in a descriptive way,
this provided us with a better understanding of the data and the business context involved ;
2) correlation analysis between attributes to identify and remove highly correlated attributes;
3) creation of new attributes, go back to step (2) until there are no more attributes correlated
with each other that should be removed.
Figure 2: Methodological flow of actions applied to the data set in the data collection and analysis
stage.
Data dictionary
The data set includes a transaction performed and, for each of these, there is a group
of attributes, which are:
- TransactionId: Unique identifier of the transaction on the platform.
- BatchId: Unique number of the set of transactions sent by processing.
- AccountId: Unique identifier of the user on the platform.
- SubscriptionId: Unique identifier of the subscribed user.
- CustomerId: Unique identifier attached to AccountId.
11
https://zindi.africa/competitions/xente-fraud-detection-challenge/data (accessed on 12/08/2019).
Machine Translated by Google
Note: The nomenclature used to identify the labels was fraudulent and genuine. For clarity, we
chose to identify non-fraudulent transactions as genuine transactions.
At this point in the descriptive analysis, we look at what evidence we have about fraudulent data for
each attribute. Below, we list the most important aspects found for each attribute:
Operation
A bias was identified in this attribute, as a much higher proportion of debit transactions were made
for fraudulent data. Table 1 shows the values found in this analysis.
Credit 3% 40%
Table 1: Proportion of how fraudulent and genuine transactions are distributed for each type of
transaction (debit or credit).
PrincingStrategy
Value
The mean value of genuine transactions was $672, with a standard deviation of $3,995, while for
fraudulent transactions this value was $1,560,156, with a standard deviation of $2,082,015. Furthermore, the
following observations were made: • 98.9% of Frauds: transaction value > average of all transactions; • 27%
of Frauds: transaction value > average of all fraudulent transactions; • 80.6% of Genuine: transaction
value < average of all transactions; • 20% of Genuine: transaction value > average of all genuine
transactions.
Figure 4: Histogram for the different product categories considering fraudulent and genuine
transactions.
Data Handling
After general data compression, the opportunity to create new attributes was observed, as well as to remove
attributes that were not related to the target attribute (FraudResult) or that have high correlation with other attributes.
Attribute Removal
Attributes referring to unique identifier and user-relative values are
removed because their contents have no influence on the classification of the type of transaction, such as the
customer identifier. In addition to these, attributes that have constant values were also removed, such as
CountryCode that identifies the code of the country where the transaction took place, as all data were collected in
Uganda, this value is constant for all instances and can be discarded because it did not bring any contribution to the
model. Altogether, the attributes removed in this step were: AccountId, SubscriptionId, CustomerId, CurrencyCode
and CountryCode.
genuine) and the transaction amount (represented by Amount and Value); 3) The PrincingStrategy
attribute has no significant correlation with any other attributes.
- Vl_per_dayWk: Ratio between the transaction value and the position of that day in the week.
- Vl_per_dayYr: Ratio between the transaction value and the position of that day in the year.
- ValueStrategy: Classification of transaction values. The check is whether or not the value of a
single transaction is greater than the average of all transactions multiplied by a value n.
According to this analysis, the transaction enters a classification range (n : 0 to 5).
Isolation Forest
Isolation Forest is 13a supervised algorithm that detects points outside the standard
curve in the dataset, or anomalies. The algorithm is based on the decision tree structure,
where attributes are used to create nodes along the tree, and the more attributes in common,
the fewer layers are needed to differentiate the instances.
From the decision tree generated by the algorithm, it is possible to identify anomalies that do
not share the same pattern of attributes (in this context, fraudulent transactions).
Figure 6: Methodological flow of actions applied to the dataset to detect outliers and create new
features related to this type of detection.
13
https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e (accessed on
12/08/2019).
14
Publicly available source code: https://pyod.readthedocs.io/en/latest/ (accessed on
12/08/2019).
Machine Translated by Google
Figure 7 shows a new graph containing the correlation coefficient between the
attributes after creating the attributes described in this section.
Figure 7: Dataset features correlation graph with implementation of removal and addition of new
features.
Machine Translated by Google
Solution development
This section deals with the manipulation and implementation of computational
algorithms aimed at solving the problem of detecting the transaction type. This section is
subdivided into dataset balancing, model implementation, and model evaluation as shown in
Figure 8.
Dataset Balancing
The balancing step is important due to the high unevenness in the dataset regarding
the classification type of each transaction. Carrying out the data balancing activity is important
as it will allow the classification model to have available for its process to learn to classify the
transaction type a reasonable amount of available examples and equal amounts for both types
of transaction type.
For this, the dataset balancing approach was performed, which is able to
15
using the Oversampling method, through the use of the SMOTENC
method, create new neighboring elements (similar) to the elements that you want to multiply,
that is, the characteristics of the fraudulent elements are observed, they undergo small
changes and are multiplied considering. The new data does not have characteristics far from
a real fraudulent transaction.
The quantitative representation of the unbalanced data is presented in Figure 9a, while
the quantitative representation of the balanced data is presented in Figure 9b.
15
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTENC.html
(accessed on 12/08/2019).
Machine Translated by Google
Figure 9: Balancing fraudulent data. SMOTNC was applied to the dataset whose transaction type is
classified as fraudulent, generating new similar elements.
Modeling
Figure 10: Methodology flow of actions performed in the creation and training of the classifier model.
CatBoost
CatBoost is a gradient boosting library, that means the model tries to improve the
previous models by combining them with each other. Gradient boosting works with the
gradient algorithm on an objective function, thus being a supervised training, using a set of
training instances labeled as input and creating a model that tries to correctly predict the label
of new examples not presented in the training by the set of Dice. In addition, Catboost when
compared to other algorithms such as LightGBM, XGBoost and H2O presented superior
performance for different databases, as well as the computational cost was quite competitive .
16
16
https://catboost.ai/ (accessed on 12/08/2019).
17
https://en.wikipedia.org/wiki/Receiver_operating_characteristic (accessed on 12/08/2019).
Machine Translated by Google
Figure 11: Curve created from the GridSearch made for several parameters of the model.
Looking closely at the point inserted in the graph, approximately at the fiftieth epoch, we can see that it
was the point of greatest accuracy with CatBoost in the validation data. This process can return the following
configuration parameters to be used: ÿ Learning rate = 0.1 ÿ Depth = 5 ÿ L2 regularizer = 1
Evaluation
Classification
Results
genuine Fraud
genuine
genuine transaction genuine transaction
Genuine rating Fraud rating
Reply
Machine Translated by Google
With the validation data, the model was submitted to evaluative tests, verifying the
accuracy of the proposed model, following the flow shown in Figure 12, which verified the
convergence of values in classifying the type of a transaction. The training of the classification
model consists of submitting the balanced dataset to the CatBoost model with the parameters
defined in the GridSearch step.
This procedure allowed us, in addition to training the model, to extract information about the
attributes most relevant to the learning process using the specific dataset.
Figure 12: Classification process done by the model and the final output of the system.
18
https://github.com/slundberg/shap (accessed on 12/08/2019).
Machine Translated by Google
note that several attributes generated in the pipeline described in this document appear among the most
important.
The current model was submitted on the Xente platform, although the challenge has already
ended, it still accepts new submissions. Thus, the score obtained in the model described in this document
was 0.75, as shown in Figure 14. This result ranks among the Top 70 best results obtained. The
19
evaluation metrics used were: F1-score; precision; and Recall.
20
Figure 14: Submission and the result achieved on the platform using the proposed model.
Furthermore, the result obtained ranks among the Top 70 best results obtained, which can be
accessed via the Rank link. The score adopted by the creators of the competition is based on three
metrics: • F1 Score
• Precision
• Recall/Sensitivity/True Positive Rate (TPR)
19
https://zindi.africa/competitions/xente-fraud-detection-challenge/leaderboard (accessed on
12/08/2019).
20A full description of each cited metric can be found at: https://en.wikipedia.org/
wiki/Receiver_operating_characteristic (accessed on 12/08/2019).
Machine Translated by Google
Financial analysis
When a False Positive occurs, that is, a genuine transaction is classified as fraud, the consumer
cannot make the purchase and, therefore, the bank, by denying the transaction, fails to receive the
operating margin that is calculated based on the transaction value that would be fulfilled. Therefore, to
calculate this cost, it is necessary to know the percentage destined to the bank for each transaction,
based on the purchase value. In Brazil, this value is around 1% to 5% in credit card machines.
21 22
Therefore, the equation for calculating the Cost of False Positives (CFP) can be described as:
N
Cost of False Positives (CFP) = ÿ Vi
i=1
Where, V indicates the value of transactions classified as False Positives, given that N is the
number of False Positives of the classifier.
On the other hand, when a False Negative occurs, that is, the bank classifies a fraudulent
transaction as genuine, the bank must bear the loss of the operation, being 100% of the transaction
value, so the Cost of False Negatives (CFN) can be described
What:
N
Cost of False Negatives (CFN) = alpha × ÿ Vi
i=1
Where, alpha is the percentage of bank profit on a debit/credit transaction, V indicates the
value of transactions that the classifier made of the False Negative type. Therefore, the objective of
this optimization function is to minimize the equation:
This idea can be used in the context of fraud detection, since a False Positive will result in a
certain discomfort for the customer who, when swiping the card, will have his transaction denied, in
addition to the credit network losing a margin of its profit ( participation in the transaction). While a True
Negative amounts to a complete loss, once a fraudulent transaction has been approved by the credit
network and the customer will demand a reversal of the invoice.
21
https://www.visa.com.br/sobre-a-visa/geral/taxas-intercambio.html
22
https://www.hnb.net/images/bank-downloads/card-center/agreement-english.pdf
Machine Translated by Google
Considering the model trained with CatBoost, the bank managed to save the estimated
loss amount of R$ 1,441,922.20. In this way, the model made it possible to avoid more than
99.9% of the total loss to the company's coffers.
Conclusion
The main conclusions of the experiments developed here showed that:
• We observed that discarding highly correlated attributes allowed the model to
remain competitive with other competitors. • The new attributes created helped
the model to achieve the expected performance, identified using the SHAP model
interpreter implemented in CatBoost.