You are on page 1of 19

Machine Translated by Google

AN APPROACH TO DETECTION
OF FRAUD NOT CONTEXT OF
BANK TRANSACTIONS

Dunfrey P. Aragão
Italo de Pontes Oliveira
Ferdinand Felix
Machine Translated by Google

summary

Introduction 2

objectives 2

Main goal 2

Specific objectives 3

Methodology 3

Data Understanding 5

Data dictionary 5

Descriptive data analysis 6

Attribute Analysis for Genuine and Fraudulent Instances How is the data distributed 6

considering the different attributes? 7

Operation 7

PrincingStrategy 7

Value 7

Data Handling 8

Attribute Removal 8

Feature Correlation Analysis 8

Creating new attributes based on other attributes 9

Creating new attributes using outlier detectors 9

Isolation Forest 10

Locally Selective Combination in Parallel Outlier Ensembles (LSCP) 10

K-Nearest Neighbors (KNN) 10

Generation of New Attributes 10

Solution development 12

Dataset Balancing 12

Modeling 13

CatBoost 13

Defining the model parameters 13

Evaluation 14

Financial Transaction Type Classification Dictionary 14

Financial analysis 17

Conclusion 18
Machine Translated by Google

Introduction
According to The World Payments Report 2019, carried out by Capgemini Financial Services
1

Analysis (2019), the operating billion


amount dollars
handledaround
by financial
the world,
transactions
without accounting
in 2019 corresponds
for operations
to 680
involving
the use of physical money. . For comparison purposes, this value exceeds Argentina's Gross Domestic
Product (GDP) for the year 2019, around 477 billion dollars, as well as more than twice the GDP of
countries such as Colombia, Finland, Egypt, Chile and Portugal. In a single day, companies like Visa
2
process more than 100 million financial transactions, corresponding
every second
, . to around 24,000 new transactions

During the process of a financial transaction, a series of procedures take place: 1) the merchant
4
that is selling the product, or providing customer service, collects credit card data, such as the card
number, expiration date , security code, among others, by means of a machine or by manual supply, as
occurs in e-commerce systems, and passes this data to the credit card network; 2) the credit card
network must authenticate the information provided and request that the banking institution approve the
transaction; 3) the banking agency makes the payment to the commercial establishment that carried out
the sale and approves the transaction.

In this way, the credit card network institutions are responsible for authenticating the customer
information provided by the merchant, as well as validating the transaction was carried out by the
customer. Thus, it is essential that the validation system works with a low response time and that, at the
same time, it is reliable, preventing fraudulent transactions from taking place.

This document proposes an approach capable of automatically identifying the type of a


transaction with an accuracy of 74.4%, using the F1-score method to obtain the result. The approach
has been validated in a real and publicly available database.
5
available. The proposed solution is among the Top-70 in the Zindi challenge. The Methodology Section
provides a detailed description of how the experiments were conducted.

objectives

Main goal
• Automatically identify whether a transaction was fraudulent or genuine.

https://worldpaymentsreport.com/resources/world-payments-report-2019/ (accessed on 12/07/2019).


2
http://worldpopulationreview.com/countries/countries-by-gdp/ (accessed on 12/07/2019). https://
3
usa.visa.com/run-your-business/small-business-tools/retail.html (accessed on 12/07/2019). https://
4
www.mastercard.com.br/pt-br/estabelecimentos/comece-aceitar/processo-pagamento.html (accessed
on 12/07/2019).
5
https://zindi.africa/competitions/xente-fraud-detection-challenge (accessed on 12/07/19).
Machine Translated by Google

Specific objectives
• Estimate the financial impact caused by the fraud detection system.

Methodology
To identify fraudulent transactions, the database provided was adopted.
6 7
in the challenge organized by Zindi. The data comes from the Xente platform aimed at providing , business
financial services and e-commerce that serves more than 10,000 customers in Uganda. The dataset includes
an approximate sample of 140,000 transactions that took place between November 15, 2018 and March 14,
2019.
The data is divided into training (set of transactions that occurred between November 15, 2018 and February
13, 2019 with prior identification of the type of transaction - fraud or genuine) and test (set of transactions
that correspond to the period from February 13, 2019 to March 14, 2019, without identification of the
transaction type).
The main challenges involving fraud detection are related to the high level of data imbalance, since
about 0.002% of transactions correspond to fraud. Furthermore, fraudulent transactions do not always
behave suspiciously, they often go unnoticed.

Considering these difficulties, the approach adopted in this project is presented in Figure 1, which
shows the complete pipeline. Each step is described in detail in the following sections. In summary, they are:
• Data pre-processing: In this step, we sought to identify the existence of missing data, verify data repetition
in the training set, perform descriptive analysis and trends in fraudulent transactions and, finally,
correlation of attributes in the training set; • Data balancing: Observed the high level of
imbalance

in the data, different data balancing techniques were used using oversampling , since
8 9
undersampling would result99%
in a of
significant
the instances
drop in
arethe
from
amount
the non-fraudulent
of data available
class).
(about
Based
on this, a better performance of the classifier was observed when using balancing
techniques based on oversampling; • Identification of outliers: Application of techniques
to the experiment that consist of identifying instances that differ from the others, and using
this information as a new attribute for the model;

• Data training: Finally, a learning model was trained


10
supervised called non-fraudulent Catboost. , to identify whether an instance is or
The results were submitted on the Zindi platform and the

6
https://zindi.africa/competitions/xente-fraud-detection-challenge (accessed on 12/07/2019).
7
https://zindi.africa/competitions/xente-fraud-detection-challenge/data (accessed on 12/07/2019).
8
Replicate data with lower index until transaction type levels are equal.
9
Drop data with higher index until transaction type levels are equal.
10
https://catboost.ai/ "CatBoost is a high-performance open source library for gradient boosting on
decision trees" (accessed 12/07/2019).
Machine Translated by Google

proposed methodology is among the top-70 for the more than 1,000 data
scientists participating in the competition.

Figure 1: Methodological flow of actions applied to the dataset to classify the type of transaction.
Machine Translated by Google

Data Understanding
The first step of the pipeline (workflow) consists of understanding the data provided
11
by the challenge. Figure 2 presents the flow of actions performed in the first stage. The
pipeline consists of the following sub-steps: 1) following the CRISP-DM methodology, at the
moment the focus is on creating data dictionaries and analyzing them in a descriptive way,
this provided us with a better understanding of the data and the business context involved ;
2) correlation analysis between attributes to identify and remove highly correlated attributes;
3) creation of new attributes, go back to step (2) until there are no more attributes correlated
with each other that should be removed.

Figure 2: Methodological flow of actions applied to the data set in the data collection and analysis
stage.

Data dictionary

The data set includes a transaction performed and, for each of these, there is a group
of attributes, which are:
- TransactionId: Unique identifier of the transaction on the platform.
- BatchId: Unique number of the set of transactions sent by processing.
- AccountId: Unique identifier of the user on the platform.
- SubscriptionId: Unique identifier of the subscribed user.
- CustomerId: Unique identifier attached to AccountId.

11
https://zindi.africa/competitions/xente-fraud-detection-challenge/data (accessed on 12/08/2019).
Machine Translated by Google

- CurrencyCode: Currency of the country.


- CountryCode: Country code.
- ProviderId: Source supplier of the purchased item.
- ProductId: Identifier of the purchased item.
- ProductCategory: The ProductIds are organized into categories according to the
products.
- ChannelId: Identifies the platform used by the user, such as web, Android system or
IOS, pay later ou checkout.
- Amount: Transaction amount. If the value is positive, it means that the user used the debit
option in his account, if negative, the user used credit.
- Value: Absolute value of the transaction value.
- TransactionStartTime: Transaction day and time.
- PricingStrategy: Pricing category for sale provided by Xente.
- FraudResult: Transaction status: 1) fraud; or 0) non-fraudulent.

Note: The nomenclature used to identify the labels was fraudulent and genuine. For clarity, we
chose to identify non-fraudulent transactions as genuine transactions.

Descriptive data analysis

Attribute analysis for genuine and fraudulent instances

To identify whether any attribute reflects typical patterns of fraudulent transactions, we


use histograms to observe each distribution. Figure 3 shows how genuine and fraudulent
financial transactions behave in relation to the value used in the transaction (Value) and the type
of strategy used (PrincingStrategy).

Fraudulent Data Genuine Data

Figure 3: Histogram for the PricingStrategy and Value attributes.


In this way, it can be seen that genuine and fraudulent transactions have a very similar
pattern for the PrincingStrategy attribute, with the difference that there were almost no
fraudulent transactions for PrincingStrategy equal to one and none of type zero. In addition, it
is clear that fraudulent operations have transactions with values well concentrated in a certain
region.
Machine Translated by Google

How is the data distributed considering the different attributes?

At this point in the descriptive analysis, we look at what evidence we have about fraudulent data for
each attribute. Below, we list the most important aspects found for each attribute:

Operation

A bias was identified in this attribute, as a much higher proportion of debit transactions were made
for fraudulent data. Table 1 shows the values found in this analysis.

operations Fraudulent genuine

debit 97% 60%

Credit 3% 40%

Table 1: Proportion of how fraudulent and genuine transactions are distributed for each type of
transaction (debit or credit).

PrincingStrategy

A higher standard deviation was identified in fraudulent transactions. Which means


that the strategy used are more dispersed, however at lower values.

Value

The mean value of genuine transactions was $672, with a standard deviation of $3,995, while for
fraudulent transactions this value was $1,560,156, with a standard deviation of $2,082,015. Furthermore, the
following observations were made: • 98.9% of Frauds: transaction value > average of all transactions; • 27%
of Frauds: transaction value > average of all fraudulent transactions; • 80.6% of Genuine: transaction
value < average of all transactions; • 20% of Genuine: transaction value > average of all genuine
transactions.

Regarding the other attributes related to the transaction value, we can


highlight the following observations:
• CHANNEL=3: Channel where the highest number of fraudulent transactions
took place; • OPERATION=1: This operation is the one with the highest
probability of fraud; • PRODUCT=15: Products with the highest fraud rate; •
PROVIDER_ID=[1, 3, 5]: These are the suppliers where the largest number of
frauds occur; • PRODUCTCATEGORY=9 (financial services): The financial services
category was the one with the highest fraud rate. To better observe this bias, the
number of transactions per category was counted, normalized and visualized in
(represented in blue Figure
the histogram.
4) mainly focus
It canon
beFinancial
seen thatServices
fraudulent
type
transactions
operations,
as well as Airtime and Utility Bill.
Machine Translated by Google

Figure 4: Histogram for the different product categories considering fraudulent and genuine
transactions.

Data Handling
After general data compression, the opportunity to create new attributes was observed, as well as to remove
attributes that were not related to the target attribute (FraudResult) or that have high correlation with other attributes.

Attribute Removal
Attributes referring to unique identifier and user-relative values are
removed because their contents have no influence on the classification of the type of transaction, such as the
customer identifier. In addition to these, attributes that have constant values were also removed, such as
CountryCode that identifies the code of the country where the transaction took place, as all data were collected in
Uganda, this value is constant for all instances and can be discarded because it did not bring any contribution to the
model. Altogether, the attributes removed in this step were: AccountId, SubscriptionId, CustomerId, CurrencyCode
and CountryCode.

Feature Correlation Analysis


When two random variables are correlated with each other, it means that both obey a proportional growth
rate. This ambiguity, if left untreated, will require unnecessary and greater processing and memory allocation and,
therefore, it is best to discard one of the variables. Figure 5 shows the correlation coefficient between the numeric
attributes, notice that: 1) the Amount and Value attributes are highly correlated, in this case, it was decided to
remove the Amount attribute; 2) There is a moderate correlation between the type of transaction (fraudulent or
Machine Translated by Google

genuine) and the transaction amount (represented by Amount and Value); 3) The PrincingStrategy
attribute has no significant correlation with any other attributes.

Figure 5: Pearson correlation between numeric type attributes.

Creating new attributes based on other attributes


New attributes were generated based on existing attributes, they are:

- Operation: Identifies whether the transaction is a credit (1) or debit (-1).


- Hour: Hour of the day the transaction was performed.
- DayOfWeek: Day of the week that the transaction was carried out.
- WeekOfYear: Week of the year that the transaction was carried out.
- Vl_per_weekYr: Ratio between the transaction value and this week's position in the
year.

- Vl_per_dayWk: Ratio between the transaction value and the position of that day in the week.
- Vl_per_dayYr: Ratio between the transaction value and the position of that day in the year.
- ValueStrategy: Classification of transaction values. The check is whether or not the value of a
single transaction is greater than the average of all transactions multiplied by a value n.
According to this analysis, the transaction enters a classification range (n : 0 to 5).

Creating new attributes using outlier detectors


In statistics, an outlier is an instance that behaves differently from most instances of the
same dataset. Thus, a question that may arise is “are there possibilities to include new attributes in
the dataset using outlier detectors to improve the model performance?”. Based on this, three different
outlier detectors were adopted , which are: the Isolation Forest, the Locally Selective Combination in
Parallel Outlier Ensembles (LSCP) and the KNN, all described below.
12
,

12The cited algorithms are implemented at: https://scikit-learn.org/ (accessed on 12/08/2019).


Machine Translated by Google

Isolation Forest

Isolation Forest is 13a supervised algorithm that detects points outside the standard
curve in the dataset, or anomalies. The algorithm is based on the decision tree structure,
where attributes are used to create nodes along the tree, and the more attributes in common,
the fewer layers are needed to differentiate the instances.
From the decision tree generated by the algorithm, it is possible to identify anomalies that do
not share the same pattern of attributes (in this context, fraudulent transactions).

Locally Selective Combination in Parallel Outlier Ensembles (LSCP)


The LSCP14 is an unsupervised algorithm that combines, in a single application, several
base methodologies from other detectors, adopting the one whose emphasis on local data is
best suited (an ensemble approach). This variety of application of methodologies in detecting
outliers considers the behavior of the data for the method that best fits the problem.

K-Nearest Neighbors (KNN)


KNN works with the idea of distance from a given point to its k-th neighbor(s). Thus,
the idea is that, if a point is an outlier, it will be further away from the other instances.

Generation of New Attributes


After completing the process of identifying outliers using the three detectors described
previously, new attributes were generated indicating whether the instance behaves as an
outlier or not, as shown in Figure 6. This approach aimed to identify fraudulent transactions,
assuming that fraudulent instances should behave as an outlier.

Figure 6: Methodological flow of actions applied to the dataset to detect outliers and create new
features related to this type of detection.

13
https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e (accessed on
12/08/2019).
14
Publicly available source code: https://pyod.readthedocs.io/en/latest/ (accessed on
12/08/2019).
Machine Translated by Google

The new attributes created were:

- IsolationForest: Class estimated by IsolationForest, 1 if it is outlier and 0 if it is


normal.
- LSCP: Class estimated by the LSCP, 1 if it is outlier and 0 if it is normal.
- KNN: Class estimated by KNN, 1 if it is outlier and 0 if it is normal.
- CountDetection: Sum of the total classes provided by the outliers detectors.
For example, if none of the detectors classified an instance as an outlier, this value will be
zero, if only KNN has detected a certain instance as an outlier, then this value will be 1, but if
all classifiers have detected an instance as an outlier, then the value of this attribute will be 3.

Figure 7 shows a new graph containing the correlation coefficient between the
attributes after creating the attributes described in this section.

Figure 7: Dataset features correlation graph with implementation of removal and addition of new
features.
Machine Translated by Google

Solution development
This section deals with the manipulation and implementation of computational
algorithms aimed at solving the problem of detecting the transaction type. This section is
subdivided into dataset balancing, model implementation, and model evaluation as shown in
Figure 8.

Figure 8: Dataset balancing and learning curve generation steps.

Dataset Balancing
The balancing step is important due to the high unevenness in the dataset regarding
the classification type of each transaction. Carrying out the data balancing activity is important
as it will allow the classification model to have available for its process to learn to classify the
transaction type a reasonable amount of available examples and equal amounts for both types
of transaction type.

For this, the dataset balancing approach was performed, which is able to
15
using the Oversampling method, through the use of the SMOTENC
method, create new neighboring elements (similar) to the elements that you want to multiply,
that is, the characteristics of the fraudulent elements are observed, they undergo small
changes and are multiplied considering. The new data does not have characteristics far from
a real fraudulent transaction.
The quantitative representation of the unbalanced data is presented in Figure 9a, while
the quantitative representation of the balanced data is presented in Figure 9b.

a) Unbalanced data set b) Balanced dataset

15

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTENC.html
(accessed on 12/08/2019).
Machine Translated by Google

Figure 9: Balancing fraudulent data. SMOTNC was applied to the dataset whose transaction type is
classified as fraudulent, generating new similar elements.

Modeling

Figure 10: Methodology flow of actions performed in the creation and training of the classifier model.

CatBoost
CatBoost is a gradient boosting library, that means the model tries to improve the
previous models by combining them with each other. Gradient boosting works with the
gradient algorithm on an objective function, thus being a supervised training, using a set of
training instances labeled as input and creating a model that tries to correctly predict the label
of new examples not presented in the training by the set of Dice. In addition, Catboost when
compared to other algorithms such as LightGBM, XGBoost and H2O presented superior
performance for different databases, as well as the computational cost was quite competitive .

16

Defining the model parameters


To define the best hyperparameters of the model, the technique known as Grid
Search was adopted. In this procedure, cross-validation is performed on the dataset (or,
ideally, using a separate validation set), without looking at the test set until the final precision
calculation. The generated result is shown in Figure 11, which
17
presents the F1-score for the entire dataset tested with the diversity of parameters submitted.

16
https://catboost.ai/ (accessed on 12/08/2019).
17
https://en.wikipedia.org/wiki/Receiver_operating_characteristic (accessed on 12/08/2019).
Machine Translated by Google

Figure 11: Curve created from the GridSearch made for several parameters of the model.

Looking closely at the point inserted in the graph, approximately at the fiftieth epoch, we can see that it
was the point of greatest accuracy with CatBoost in the validation data. This process can return the following
configuration parameters to be used: ÿ Learning rate = 0.1 ÿ Depth = 5 ÿ L2 regularizer = 1

Evaluation

Financial Transaction Type Classification Dictionary


The problem of classifying the type of financial transaction, which can be carried out by customers of
financial institutions, will be approved or denied by the financial business body. The institution may come across
four possibilities of possible transaction detections are (shown in Table 2):

Classification
Results
genuine Fraud

genuine
genuine transaction genuine transaction
Genuine rating Fraud rating
Reply
Machine Translated by Google

transaction fraud transaction fraud


Fraud
Genuine rating Fraud rating
Table 2: Table of possible classification types about the types of transactions carried out. The goal is
greater accuracy in getting the featured rankings right.

1. True Positive: Genuine transaction correctly classified; 2. False


Positive: Fraudulent transaction that should not be approved, but is classified as genuine
by the model;
3. True Negative: Fraudulent transaction classified correctly.
4. False Negative: Genuine transaction that should be approved, but is classified
as fraudulent by the model;

With the validation data, the model was submitted to evaluative tests, verifying the
accuracy of the proposed model, following the flow shown in Figure 12, which verified the
convergence of values in classifying the type of a transaction. The training of the classification
model consists of submitting the balanced dataset to the CatBoost model with the parameters
defined in the GridSearch step.
This procedure allowed us, in addition to training the model, to extract information about the
attributes most relevant to the learning process using the specific dataset.

Figure 12: Classification process done by the model and the final output of the system.

In Figure 13, the Feature Importance graph generated by the


18
CatBoost in your learning process using the SHAP interpretability technique.
Note that the colored attributes are those with the greatest contribution to the prediction,

18
https://github.com/slundberg/shap (accessed on 12/08/2019).
Machine Translated by Google

note that several attributes generated in the pipeline described in this document appear among the most
important.

Figure 13: Importance of each attribute for the model.

The current model was submitted on the Xente platform, although the challenge has already
ended, it still accepts new submissions. Thus, the score obtained in the model described in this document
was 0.75, as shown in Figure 14. This result ranks among the Top 70 best results obtained. The
19
evaluation metrics used were: F1-score; precision; and Recall.
20

Figure 14: Submission and the result achieved on the platform using the proposed model.

Furthermore, the result obtained ranks among the Top 70 best results obtained, which can be
accessed via the Rank link. The score adopted by the creators of the competition is based on three
metrics: • F1 Score

• Precision
• Recall/Sensitivity/True Positive Rate (TPR)

19
https://zindi.africa/competitions/xente-fraud-detection-challenge/leaderboard (accessed on
12/08/2019).
20A full description of each cited metric can be found at: https://en.wikipedia.org/
wiki/Receiver_operating_characteristic (accessed on 12/08/2019).
Machine Translated by Google

Financial analysis

When a False Positive occurs, that is, a genuine transaction is classified as fraud, the consumer
cannot make the purchase and, therefore, the bank, by denying the transaction, fails to receive the
operating margin that is calculated based on the transaction value that would be fulfilled. Therefore, to
calculate this cost, it is necessary to know the percentage destined to the bank for each transaction,
based on the purchase value. In Brazil, this value is around 1% to 5% in credit card machines.
21 22

Therefore, the equation for calculating the Cost of False Positives (CFP) can be described as:

N
Cost of False Positives (CFP) = ÿ Vi
i=1

Where, V indicates the value of transactions classified as False Positives, given that N is the
number of False Positives of the classifier.
On the other hand, when a False Negative occurs, that is, the bank classifies a fraudulent
transaction as genuine, the bank must bear the loss of the operation, being 100% of the transaction
value, so the Cost of False Negatives (CFN) can be described
What:
N
Cost of False Negatives (CFN) = alpha × ÿ Vi
i=1

Where, alpha is the percentage of bank profit on a debit/credit transaction, V indicates the
value of transactions that the classifier made of the False Negative type. Therefore, the objective of
this optimization function is to minimize the equation:

Financial Costs = CFP + CFN

This idea can be used in the context of fraud detection, since a False Positive will result in a
certain discomfort for the customer who, when swiping the card, will have his transaction denied, in
addition to the credit network losing a margin of its profit ( participation in the transaction). While a True
Negative amounts to a complete loss, once a fraudulent transaction has been approved by the credit
network and the customer will demand a reversal of the invoice.

Fraudulent transactions correspond to less than 1% of all transactional financial volume.


Whereas a false negative indicates that a fraudulent transaction took place but the algorithm did not
detect it and therefore the bank will fully assume the loss, and a false positive indicates that a genuine
transaction took place but the algorithm did not detect it and therefore the bank will forfeit the stake him
in the transaction if it had been approved. Therefore, if the algorithm did not exist, the loss would be
almost 45 BILLION dollars.

21
https://www.visa.com.br/sobre-a-visa/geral/taxas-intercambio.html
22
https://www.hnb.net/images/bank-downloads/card-center/agreement-english.pdf
Machine Translated by Google

Considering the model trained with CatBoost, the bank managed to save the estimated
loss amount of R$ 1,441,922.20. In this way, the model made it possible to avoid more than
99.9% of the total loss to the company's coffers.

Conclusion
The main conclusions of the experiments developed here showed that:
• We observed that discarding highly correlated attributes allowed the model to
remain competitive with other competitors. • The new attributes created helped
the model to achieve the expected performance, identified using the SHAP model
interpreter implemented in CatBoost.

• The insertion of the fraud detection technique allowed savings


billionaire for the credit card network.

You might also like